Category: OVN

Implementing Security Groups in OpenStack using OVN Port Groups

Some time back, when looking at the performance of OpenStack using OVN as the networking backend, we noticed that it didn’t scale really well and it turned out that the major culprit was the way we implemented Neutron Security Groups . In order to illustrate the issue and the optimizations that we carried out, let’s first explain how security was originally implemented:

Networking-ovn and Neutron Security Groups

Originally, Security Groups were implemented using a combination of OVN resources such as Address Sets and Access Control Lists (ACLs):

  • Address Sets: An OVN Address set contains a number of IP addresses that can be referenced from an ACL. In networking-ovn  we directly map Security Groups to OVN Address Sets: every time a new IP address is allocated for a port, this address will be added to the Address Set(s) representing the Security Groups which the port belongs to.
$ ovn-nbctl list address_set
_uuid : 039032e4-9d98-4368-8894-08e804e9ee78
addresses : ["10.0.0.118", "10.0.0.123", "10.0.0.138", "10.0.0.143"]
external_ids : {"neutron:security_group_id"="0509db24-4755-4321-bb6f-9a094962ec91"}
name : "as_ip4_0509db24_4755_4321_bb6f_9a094962ec91"
  • ACLs: They are applied to a Logical Switch (Neutron network). They have a 1-to-many relationship with Neutron Security Group Rules. For instance, when the user creates a single Neutron rule within a Security Group to allow ingress ICMP traffic, it will map to N ACLs in OVN Northbound database with N being the number of ports that belong to that Security Group.
$ openstack security group rule create --protocol icmp default
_uuid : 6f7635ff-99ae-498d-8700-eb634a16903b
action : allow-related
direction : to-lport
external_ids : {"neutron:lport"="95fb15a4-c638-42f2-9035-bee989d80603", "neutron:security_group_rule_id"="70bcb4ca-69d6-499f-bfcf-8f353742d3ff"}
log : false
match : "outport == \"95fb15a4-c638-42f2-9035-bee989d80603\" && ip4 && ip4.src == 0.0.0.0/0 && icmp4"
meter : []
name : []
priority : 1002
severity : []

On the other hand, Neutron has the possibility to filter traffic between ports within the same Security Group or a remote Security Group. One use case may be: a set of VMs whose ports belong to SG1 only allowing HTTP traffic from the outside and another set of VMs whose ports belong to SG2 blocking all incoming traffic. From Neutron, you can create a rule to allow database connections from SG1 to SG2. In this case, in OVN we’ll see ACLs referencing the aforementioned Address Sets. In

$ openstack security group rule create --protocol tcp --dst-port 3306 --remote-group webservers default
+-------------------+--------------------------------------+
| Field | Value |
+-------------------+--------------------------------------+
| created_at | 2018-12-21T11:29:32Z |
| description | |
| direction | ingress |
| ether_type | IPv4 |
| id | 663012c1-67de-45e1-a398-d15bd4f295bb |
| location | None |
| name | None |
| port_range_max | 3306 |
| port_range_min | 3306 |
| project_id | 471603b575184afc85c67d0c9e460e85 |
| protocol | tcp |
| remote_group_id | 11059b7d-725c-4740-8db8-5c5b89865d0f |
| remote_ip_prefix | None |
| revision_number | 0 |
| security_group_id | 0509db24-4755-4321-bb6f-9a094962ec91 |
| updated_at | 2018-12-21T11:29:32Z |
+-------------------+--------------------------------------+

This gets the following OVN ACL into Northbound database:


_uuid : 03dcbc0f-38b2-42da-8f20-25996044e516
action : allow-related
direction : to-lport
external_ids : {"neutron:lport"="7d6247b7-65b9-4864-a9a0-a85bacb4d9ac", "neutron:security_group_rule_id"="663012c1-67de-45e1-a398-d15bd4f295bb"}
log : false
match : "outport == \"7d6247b7-65b9-4864-a9a0-a85bacb4d9ac\" && ip4 && ip4.src == $as_ip4_11059b7d_725c_4740_8db8_5c5b89865d0f && tcp && tcp.dst == 3306"
meter : []
name : []
priority : 1002
severity : []

Problem “at scale”

In order to best illustrate the impact of the optimizations that the Port Groups feature brought in OpenStack, let’s take a look at the number of ACLs on a typical setup when creating just 100 ports on a single network. All those ports will belong to a Security Group with the following rules:

  1. Allow incoming SSH traffic
  2. Allow incoming HTTP traffic
  3. Allow incoming ICMP traffic
  4. Allow all IPv4 traffic between ports of this same Security Group
  5. Allow all IPv6 traffic between ports of this same Security Group
  6. Allow all outgoing IPv4 traffic
  7. Allow all outgoing IPv6 traffic

Every time we create a port, new 10 ACLs (the 7 rules above + DHCP traffic ACL + default egress drop ACL + default ingress drop ACL) will be created in OVN:


$ ovn-nbctl list ACL| grep ce2ad98f-58cf-4b47-bd7c-38019f844b7b | grep match
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip6 && ip6.src == $as_ip6_0509db24_4755_4321_bb6f_9a094962ec91"
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip"
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4 && ip4.src == 0.0.0.0/0 && icmp4"
match : "inport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4"
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4 && ip4.src == $as_ip4_0509db24_4755_4321_bb6f_9a094962ec91"
match : "inport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip6"
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 80"
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 22"
match : "inport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4 && ip4.dst == {255.255.255.255, 10.0.0.0/8} && udp && udp.src == 68 && udp.dst == 67"
match : "inport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip"

With 100 ports, we’ll observe 1K ACLs in the system:

$ ovn-nbctl lsp-list neutron-ebde771e-a93d-438d-a689-d02e9c91c7cf | wc -l
100
$ ovn-nbctl acl-list neutron-ebde771e-a93d-438d-a689-d02e9c91c7cf | wc -l
1000

When ovn-northd sees these new ACLs, it’ll create the corresponding Logical Flows in Southbound database that will then be translated by ovn-controller to OpenFlow flows in the actual hypervisors. The number of Logical Flows also for this 100 ports system can be pulled like this:

$ ovn-sbctl lflow-list neutron-ebde771e-a93d-438d-a689-d02e9c91c7cf | wc -l
3052

At this point, you can pretty much tell that this doesn’t look very promising at scale.

Optimization

One can quickly spot an optimization consisting on having just one ACL per Security Group Rule instead of one ACL per Security Group Rule per port if only we could reference a set of ports and not each port individually on the ‘match’ column of an ACL. This would alleviate calculations mainly on the networking-ovn side where we saw a bottleneck at scale when processing new ports due to the high number of ACLs.

Such optimization would require a few changes on the core OVN side:

  • Changes to the schema to create a new table in the Northbound database (Port_Group) and to be able to apply ACLs also to a Port Group.
  • Changes to ovn-northd so that it creates new Logical Flows based on ACLs applied to Port Groups.
  • Changes to ovn-controller so that it can figure out the physical flows to install on every hypervisor based on the new Logical Flows.

These changes happened mainly in the next 3 patches and the feature is present in OvS 2.10 and beyond:

https://github.com/openvswitch/ovs/commit/3d2848bafa93a2b483a4504c5de801454671dccf

https://github.com/openvswitch/ovs/commit/1beb60afd25a64f1779903b22b37ed3d9956d47c

https://github.com/openvswitch/ovs/commit/689829d53612a573f810271a01561f7b0948c8c8

On the networking-ovn side, we needed to adapt the code as well to:

  • Make use of the new feature and implement Security Groups using Port Groups.
  • Ensure a migration path from old implementation to Port Groups.
  • Keep backwards compatibility: in case an older version of OvS is used, we need to fall back to the previous implementation.

Here you can see the main patch to accomplish the changes above:

https://github.com/openstack/networking-ovn/commit/f01169b405bb5080a1bc1653f79512eb0664c35d

If we attempt to recreate the same scenario as we did earlier where we had 1000 ACLs for 100 ports on our Security Group using the new feature, we can compare the number of resources that we’re now using:

$ ovn-nbctl lsp-list neutron-ebde771e-a93d-438d-a689-d02e9c91c7cf | wc -l
100

Two OVN Port Groups have been created: one for our Security Group and then neutron-pg-drop which is used to add fallback, low priority drop ACLs (by default OVN will allow all traffic if no explicit drop ACLs are added):

$ ovn-nbctl --bare --columns=name list Port_Group
neutron_pg_drop
pg_0509db24_4755_4321_bb6f_9a094962ec91

ACLs are now applied to Port Groups and not to the Logical Switch:

$ ovn-nbctl acl-list neutron-ebde771e-a93d-438d-a689-d02e9c91c7cf | wc -l
0
$ ovn-nbctl acl-list pg_0509db24_4755_4321_bb6f_9a094962ec91 | wc -l
7
$ ovn-nbctl acl-list neutron_pg_drop | wc -l
2

The number of ACLs has gone from 1000 (10 per port) to just 9 regardless of the number of ports in the system:

$ ovn-nbctl --bare --columns=match list ACL
inport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip4
inport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip6
inport == @neutron_pg_drop && ip
outport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 22
outport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip4 && ip4.src == 0.0.0.0/0 && icmp4
outport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip6 && ip6.src == $pg_0509db24_4755_4321_bb6f_9a094962ec91_ip6
outport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip4 && ip4.src == $pg_0509db24_4755_4321_bb6f_9a094962ec91_ip4
outport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 80
outport == @neutron_pg_drop && ip

 

This change was merged in OpenStack Queens and requires OvS 2.10 at least. Also, if upgrading from an earlier version of either OpenStack or OvS, networking-ovn will take care of the migration from Address Sets to Port Groups upon start of Neutron server and the new implementation will be automatically used.

As  a bonus, this enables the possibility of applying the conjunctive match action easier on Logical Flows resulting in a big performance improvement as it was reported here.

Some fun with race conditions and threads in OpenStack

Almost a year ago I reported a bug we were hitting in OpenStack using networking-ovn as a network backend. The symptom was that sometimes Tempest tests were failing in the gate when trying to reach a Floating IP of a VM. The failure rate was not really high so a couple of ‘rechecks’ here and there was enough for us to delay chasing down the bug.

Last week I decided to hunt the bug and attempt to find out the root cause of the failure. Why was the FIP unreachable? For a FIP to be reachable from the external network (Tempest), the following high-level steps should happen:

  1. Tempest needs to ARP query the FIP of the VM
  2. A TCP SYN packet is sent out to the FIP
  3. Routing will happen between external network and internal VM network
  4. The packet will reach the VM and it’ll respond back with a SYN/ACK packet to the originating IP (Tempest executor)
  5. Routing will happen between internal VM network and external network
  6. SYN/ACK packet reaches Tempest executor and the connection will get established on its side
  7. ACK sent to the VM

Some of those steps are clearly failing so time to figure out which.

2018-09-18 17:09:17,276 13788 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.11 after 15 attempts

As first things come first, let’s start off by checking that we get the ARP response of the FIP. We spawn a tcpdump on the external interface where Tempest runs and check traffic to/from 172.24.5.11:

Sep 18 17:09:17.644405 ubuntu-xenial-ovh-bhs1-0002095917 tcpdump[28445]: 17:09:17.275597 1e:d5:ec:49:df:4f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.24.5.11 tell 172.24.5.1, length 28

I can see plenty of those ARP requests but not a single reply. Something’s fishy…

In OVN, ARP queries are responded by ovn-controller so they should hit the gateway node. Let’s inspect the flows there to see if they were installed in OVN Ingress table 1 (which corresponds to OpenFlow table 9):

http://www.openvswitch.org/support/dist-docs/ovn-northd.8.html

Ingress Table 1: IP Input

These flows reply to ARP requests for the virtual IP
addresses configured in the router for DNAT or load bal‐
ancing. For a configured DNAT IP address or a load bal‐
ancer VIP A, for each router port P with Ethernet address
E, a priority-90 flow matches inport == P && arp.op == 1
&& arp.tpa == A (ARP request) with the following actions:

eth.dst = eth.src;
eth.src = E;
arp.op = 2; /* ARP reply. */
arp.tha = arp.sha;
arp.sha = E;
arp.tpa = arp.spa;
arp.spa = A;
outport = P;
flags.loopback = 1;
output;


$ sudo ovs-ofctl dump-flows br-int | grep table=9 | grep "arp_tpa=172.24.5.11"
cookie=0x549ad196, duration=105.055s, table=9, n_packets=0, n_bytes=0, idle_age=105, priority=90,arp,reg14=0x1,metadata=0x87,arp_tpa=172.24.5.11,arp_op=1 actions=move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],mod_dl_src:fa:16:3e:3d:bf:46,load:0xfa163e3dbf46->NXM_NX_ARP_SHA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xac18050b->NXM_OF_ARP_SPA[],load:0x1->NXM_NX_REG15[],load:0x1->NXM_NX_REG10[0],resubmit(,32)

So the flow for the ARP responder is installed but it has no hits (note n_packets=0). Looks like for some reason the ARP query is not reaching the router pipeline from the public network. Let’s now take a look at the Logical Router:

This is how the Logical Router looks like:

$ ovn-nbctl show
router 71c37cbd-4aa9-445d-a9d1-cb54ee1d3207 (neutron-cc163b42-1fdf-4cfa-a2ff-50c521f04222) (aka tempest-TestSecurityGroupsBasicOps-router-1912572360)
     port lrp-2a1bbf89-adee-4e74-b65e-1ac7a1ba4089
         mac: "fa:16:3e:3d:bf:46"
         networks: ["10.1.0.1/28"]
     nat 2eaaa99e-3be8-49ad-b801-ad198a6084fd
         external ip: "172.24.5.7"
         logical ip: "10.1.0.0/28"
         type: "snat"
     nat 582bab87-8acb-4905-8723-948651811193
         external ip: "172.24.5.11"
         logical ip: "10.1.0.6"
         type: "dnat_and_snat"

We can see that we have two NAT entries: one for the FIP (172.24.5.11 <-> 10.1.0.6) and one SNAT entry for the gateway which should allow the VM to access the external network.

There’s also one router port which connects the VM subnet to the router but… wait …There’s no gateway port connected to the router!! This means that the FIP is unreachable so at this point we know what’s going on but… WHY? We need to figure out why the gateway port is not added. Time to check the code and logs:

Code wise (see below), the gateway is added upon router creation. It’ll imply creating a router (L9), the gateway port (L26) and adding it to the Logical Switch as you can see here. Afterwards, it’ll add a static route with the next hop to the router (L37). Also, you can see that everything is inside a context manager (L8) where a transaction with OVSDB is created so all the commands are expected to be commited/failed as a whole:

def create_router(self, router, add_external_gateway=True):
    """Create a logical router."""
    context = n_context.get_admin_context()
    external_ids = self._gen_router_ext_ids(router)
    enabled = router.get('admin_state_up')
    lrouter_name = utils.ovn_name(router['id'])
    added_gw_port = None
    with self._nb_idl.transaction(check_error=True) as txn:
        txn.add(self._nb_idl.create_lrouter(lrouter_name,
                                            external_ids=external_ids,
                                            enabled=enabled,
                                            options={}))
        if add_external_gateway:
            networks = self._get_v4_network_of_all_router_ports(
                context, router['id'])
            if router.get(l3.EXTERNAL_GW_INFO) and networks is not None:
                added_gw_port = self._add_router_ext_gw(context, router,
                                                        networks, txn)

def _add_router_ext_gw(self, context, router, networks, txn):
    router_id = router['id']
    # 1. Add the external gateway router port.
    gw_info = self._get_gw_info(context, router)
    gw_port_id = router['gw_port_id']
    port = self._plugin.get_port(context, gw_port_id)
    self._create_lrouter_port(router_id, port, txn=txn)

    columns = {}
    if self._nb_idl.is_col_present('Logical_Router_Static_Route',
                                   'external_ids'):
        columns['external_ids'] = {
            ovn_const.OVN_ROUTER_IS_EXT_GW: 'true',
            ovn_const.OVN_SUBNET_EXT_ID_KEY: gw_info.subnet_id}

    # 2. Add default route with nexthop as gateway ip
    lrouter_name = utils.ovn_name(router_id)
    txn.add(self._nb_idl.add_static_route(
        lrouter_name, ip_prefix='0.0.0.0/0', nexthop=gw_info.gateway_ip,
        **columns))

    # 3. Add snat rules for tenant networks in lrouter if snat is enabled
    if utils.is_snat_enabled(router) and networks:
        self.update_nat_rules(router, networks, enable_snat=True, txn=txn)
    return port

So, how is it possible that the Logical Router exists while the gateway port does not if everything is under the same transaction? We’re nailing this down and now we need to figure that out by inspecting the transactions in neutron-server logs:

DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): AddLRouterCommand(may_exist=True, columns={'external_ids': {'neutron:gw_port_id': u'8dc49792-b37a-48be-926a-af2c76e269a9', 'neutron:router_name': u'tempest-TestSecurityGroupsBasicOps-router-1912572360', 'neutron:revision_number': '2'}, 'enabled': True, 'options': {}}, name=neutron-cc163b42-1fdf-4cfa-a2ff-50c521f04222) {{(pid=32314) do_commit /usr/local/lib/python2.7/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}

The above trace is written when the Logical Router gets created but surprisingly we can see idx=2 meaning that it’s not the first command of a transaction. But… How is this possible? We saw in the code that it was the first command to be executed when creating a router and this is the expected sequence:

  1. AddLRouterCommand
  2. AddLRouterPortCommand
  3. SetLRouterPortInLSwitchPortCommand
  4. AddStaticRouteCommand

Let’s check the other commands in this transaction in the log file:

DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): AddLRouterPortCommand(name=lrp-1763c78f-5d0d-41d4-acc7-dda1882b79bd, may_exist=True, lrouter=neutron-d1d3b2f2-42cb-4a86-ac5a-77001da8fee2, columns={'mac': u'fa:16:3e:e7:63:b9', 'external_ids': {'neutron:subnet_ids': u'97d41327-4ea6-4fff-859c-9884f6d1632d', 'neutron:revision_number': '3', 'neutron:router_name': u'd1d3b2f2-42cb-4a86-ac5a-77001da8fee2'}, 'networks': [u'10.1.0.1/28']}) {{(pid=32314) do_commit /usr/local/lib/python2.7/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}

DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=1): SetLRouterPortInLSwitchPortCommand(if_exists=True, lswitch_port=1763c78f-5d0d-41d4-acc7-dda1882b79bd, lrouter_port=lrp-1763c78f-5d0d-41d4-acc7-dda1882b79bd, is_gw_port=False, lsp_address=router) {{(pid=32314) do_commit /usr/local/lib/python2.7/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}

DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): AddLRouterCommand(may_exist=True, columns={'external_ids': {'neutron:gw_port_id': u'8dc49792-b37a-48be-926a-af2c76e269a9', 'neutron:router_name': u'tempest-TestSecurityGroupsBasicOps-router-1912572360', 'neutron:revision_number': '2'}, 'enabled': True, 'options': {}}, name=neutron-cc163b42-1fdf-4cfa-a2ff-50c521f04222)

DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=3): AddNATRuleInLRouterCommand(lrouter=neutron-d1d3b2f2-42cb-4a86-ac5a-77001da8fee2, columns={'external_ip': u'172.24.5.26', 'type': 'snat', 'logical_ip': '10.1.0.0/28'}) {{(pid=32314) do_commit /usr/local/lib/python2.7/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}

It looks like this:

  1. AddLRouterPortCommand
  2. SetLRouterPortInLSwitchPortCommand
  3. AddLRouterCommand (our faulty router)
  4. AddNATRuleInLRouterCommand

Definitely, number 3 is our command that got in between a totally different transaction from a totally different test being executed concurrently in the gate . Most likely, according to the commands 1, 2 and 4 it’s another test adding a new interface to a different router here. It’ll create a logical router port, add it to the switch and update the SNAT rules. All those debug traces come from the same process ID so the transactions are getting messed up from the two concurrent threads attempting to write into OVSDB.

This is the code that will create a new transaction object in ovsdbapp or, if it was already created, will return the same object. The problem here comes when two different threads/greenthreads attempt to create their own separate transactions concurrently:

    @contextlib.contextmanager
    def transaction(self, check_error=False, log_errors=True, **kwargs):
        """Create a transaction context.

        :param check_error: Allow the transaction to raise an exception?
        :type check_error:  bool
        :param log_errors:  Log an error if the transaction fails?
        :type log_errors:   bool
        :returns: Either a new transaction or an existing one.
        :rtype: :class:`Transaction`
        """
        if self._nested_txn:
            yield self._nested_txn
        else:
            with self.create_transaction(
                    check_error, log_errors, **kwargs) as txn:
                self._nested_txn = txn
                try:
                    yield txn
                finally:
                    self._nested_txn = None
  1. Test1 will open a new transaction and append  AddLRouterPortCommand and SetLRouterPortInLSwitchPortCommand commands to it.
  2. Test1 will now yield its execution (due to some I/O in the case of greenthreads as eventlet is heavily used in OpenStack).
  3. Test 2 (our failing test!) will attempt to create its own transaction.
  4. As everything happens in the same process, self._nested_txn was previously assigned due to step number 1 and is returned here.
  5. Test 2 will append the  AddLRouterCommand command to it.
  6. At this point Test 2 will also yield its execution. Mainly due to the many debug traces that we have for scheduling a gateway port on the available chassis in OVN so this is one of the most likely parts of the code for the race condition to occur.
  7. Test 1 will append the rest of the commands and commit the transaction to OVSDB.
  8. OVSDB will execute the commands and close the transaction.
  9. When Test 2 appends the rest of its commands, the transaction is closed and will never get commited, ie. only AddLRouterCommand was applied to the DB leaving the gateway port behind in OVN.

The nested transaction mechanism in ovsdbapp was not taking into account that two different threads attempt to open separate transactions so we needed to patch this to make it thread safe. The solution that we came up with was to create a different transaction per thread ID and add the necessary coverage to the unit tests.

Two more questions arise at this point:

  • Why this is not happening in Neutron if they also use ovsdbapp there?

Perhaps it’s happening but they don’t make that much use of multiple commands transactions. Instead, most of the time it’s single command transactions that are less prone to this particular race condition.

  • Why does it happen mostly when creating a gateway port?

Failing to create a gateway port leads to VMs to lose external connectivity so the failure is very obvious. There might be other conditions where this bug happens and we’re not even realizing, producing weird behaviors. However, it’s true that this particular code which creates the gateway port is complex as it implies scheduling the port into the least loaded available chassis so a number of DEBUG traces were added. As commented through the blog post, writing these traces will result in yielding the execution to a different greenthread where the disaster occurs!

This blog post tries to show a top-bottom approach to debugging failures that are not easily reproducible. It requires a solid understanding of the system architecture (OpenStack components, OVN, databases) and its underlying technologies (Python, greenthreads) to be able to tackle this kind of race conditions in an effective way.

OVN – Profiling and optimizing ports creation

One of the most important abstractions to handle virtual networking in OpenStack clouds are definitely ports. A port is basically a connection point for attaching a single device, such as the NIC of a server to a network, a subnet to a router, etc.  They represent entry and exit points for data traffic playing a critical role in OpenStack networking.

Given the importance of ports, it’s clear that any operation on them should perform well, specially at scale and we should be able to catch any regressions ASAP before they land on production environments. Such tests should be done in a periodic fashion and should require doing them on a consistent hardware with enough resources.

In one of our first performance tests using OVN as a backend for OpenStack networking, we found out that port creation would be around 20-30% slower than using ML2/OVS. At this point, those are merely DB operations so there had to be something really odd going on.  First thing I did was to measure the time for a single port creation by creating 800 ports in an empty cloud. Each port would belong to a security group with allowed egress traffic and allowed ingress ICMP and SSH. These are the initial results:

As you can see, the time for creating a single port grows linearly with the number of ports in the cloud.  This is obviously a problem at scale that requires further investigation.

In order to understand how a port gets created in OVN, it’s recommended to take a look at the NorthBound database schema and to this interesting blogpost by my colleague Russel Bryant about how Neutron security groups are implemented in OVN. When a port is first created in OVN, the following DB operations will occur:

  1.  Logical_Switch_Port ‘insert’
  2.  Address_Set ‘modify’
  3.  ACL ‘insert’ x8 (one per ACL)
  4.  Logical_Switch_Port ‘modify’ (to add the new 8 ACLs)
  5.  Logical_Switch_Port ‘modify’ (‘up’ = False)

After a closer look to OVN NB database contents, with 800 ports created, there’ll be a total of 6400 ACLs (8 per port) which look all the same except for the inport/outport fields. A first optimization looks obvious: let’s try to get all those 800 ports into the same group and make the ACLs reference that group avoiding duplication. There’s some details and discussions in OpenvSwitch mailing list here. So far so good; we’ll be cutting down the number of ACLs to 8 in this case no matter how big the number of ports in the system increases. This would reduce the number of DB operations considerably and scale better.

However, I went deeper and wanted to understand this linear growth through profiling. I’d like to write a separate blogpost about how to do the profiling but, so far, what I found is that the ‘modify’ operations on the Logical_Switch table to insert the new 8 ACLs would take longer and longer every time (the same goes to the Address_Set table where a new element is added to it with every new port). As the number of elements grow in those two tables, inserting a new element was more expensive. Digging further in the profiling results, it looks like “__apply_diff()” method is the main culprit, and more precisely, “Datum.to_json()”. Here‘s a link to the actual source code.

Whenever a ‘modify’ operation happens in OVSDB, ovsdb-server will send a JSON  update notification to all clients (it’s actually an update2 notification which is essentially the same but including only the diff with the previous state). This notification is going to be processed by the client (in this case networking-ovn, the OpenStack integration) in the “process_update2” method here and it happens that the Datum.to_json() method is taking most of the execution time. The main questions that came to my mind at this point are:

  1. How is this JSON data later being used?
  2. Why do we need to convert to JSON if the original notification was already in JSON?

For 1, it looks like the JSON data is used to build a Row object which will be then used to notify networking-ovn of the change just in case it’d be monitoring any events:

    def __apply_diff(self, table, row, row_diff):
        old_row_diff_json = {}
        for column_name, datum_diff_json in six.iteritems(row_diff):
            [...]
            old_row_diff_json[column_name] = row._data[column_name].to_json()
            [...]
        return old_row_diff_json
old_row_diff_json = self.__apply_diff(table, row,
                                      row_update['modify'])
self.notify(ROW_UPDATE, row,
            Row.from_json(self, table, uuid, old_row_diff_json))

Doesn’t it look pointless to convert data “to_json” and then back “from_json”? As the number of elements in the row grows, it’ll take longer to produce the associated JSON document (linear growth) of the whole column contents (not just the diff) and the same would go the other way around.

I thought of finding a way to build the Row from the data itself before going through the expensive (and useless) JSON conversions. The patch would look something like this:

diff --git a/python/ovs/db/idl.py b/python/ovs/db/idl.py
index 60548bcf5..5a4d129c0 100644
--- a/python/ovs/db/idl.py
+++ b/python/ovs/db/idl.py
@@ -518,10 +518,8 @@ class Idl(object):
             if not row:
                 raise error.Error('Modify non-existing row')
 
-            old_row_diff_json = self.__apply_diff(table, row,
-                                                  row_update['modify'])
-            self.notify(ROW_UPDATE, row,
-                        Row.from_json(self, table, uuid, old_row_diff_json))
+            old_row = self.__apply_diff(table, row, row_update['modify'])
+            self.notify(ROW_UPDATE, row, Row(self, table, uuid, old_row))
             changed = True
         else:
             raise error.Error('<row-update> unknown operation',
@@ -584,7 +582,7 @@ class Idl(object):
                         row_update[column.name] = self.__column_name(column)
 
     def __apply_diff(self, table, row, row_diff):
-        old_row_diff_json = {}
+        old_row = {}
         for column_name, datum_diff_json in six.iteritems(row_diff):
             column = table.columns.get(column_name)
             if not column:
@@ -601,12 +599,12 @@ class Idl(object):
                           % (column_name, table.name, e))
                 continue
 
-            old_row_diff_json[column_name] = row._data[column_name].to_json()
+            old_row[column_name] = row._data[column_name].copy()
             datum = row._data[column_name].diff(datum_diff)
             if datum != row._data[column_name]:
                 row._data[column_name] = datum
 
-        return old_row_diff_json
+        return old_row
 
     def __row_update(self, table, row, row_json):
         changed = False

Now that we have everything in place, it’s time to repeat the tests with the new code and compare the results:

The improvement is obvious! Now the time to create a port is mostly constant regardless of the number of ports in the cloud and the best part is that this gain is not specific to port creation but to ANY modify operation taking place in OVSDB that uses the Python implementation of the OpenvSwitch IDL.

The intent of this blogpost is to show how to deal with performance bottlenecks through profiling and debugging in a top-down fashion, first through simple API calls measuring the time they take to complete and then all the way down to the database changes notifications.

Daniel