Tagneutron

Trunk ports in OVN

The purpose of this post is explaining the low level details of how trunking works in OVN. A typical use case of trunk ports could be the use of nested VMs or containers inside VMs; where all the traffic is directed to the vNIC of the virtual machine and then forwarded to the right container based on their VLAN ID. For more context around the feature and use cases, please check out the OpenStack documentation guide.

Let's take a very simple topology for our deep dive. You can also deploy it in your machine using this vagrant setup and replay the commands as we go 🙂

This sample setup has two Logical Switches and two ports on each of them. The physical layout is as follows:

  • vm1 bound to worker1
  • vm2 bound to worker2
  • child1 (VLAN 30) inside vm1
  • child2 (VLAN 50) inside vm2

Let's quickly check the OVN databases info:

[root@central vagrant]# ovn-nbctl show
switch db4e7781-370c-4439-becd-35803c0e3f12 (network1)
    port vm1
        addresses: ["40:44:00:00:00:01 192.168.0.11"]
    port vm2
        addresses: ["40:44:00:00:00:02 192.168.0.12"]
switch 40ac144b-a32a-4202-bce2-3329f8f3e98f (network2)
    port child1
        parent: vm1
        tag: 30
        addresses: ["40:44:00:00:00:03 192.168.1.13"]
    port child2
        parent: vm2
        tag: 50
        addresses: ["40:44:00:00:00:04 192.168.1.14"]



[root@central vagrant]# ovn-sbctl show
Chassis worker2
    hostname: worker2
    Encap geneve
        ip: "192.168.50.101"
[root@central vagrant]# ovn-sbctl show
Chassis worker2
    hostname: worker2
    Encap geneve
        ip: "192.168.50.101"
        options: {csum="true"}
    Port_Binding child2
    Port_Binding vm2
Chassis worker1
    hostname: worker1
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}
    Port_Binding child1
    Port_Binding vm1

Instead of booting actual VMs and containers, I simulated it with network namespaces and VLAN devices inside them:

[root@worker1 vagrant]# ip netns exec vm1 ip -d link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
2: child1@vm1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:03 brd ff:ff:ff:ff:ff:ff promiscuity 0
    vlan protocol 802.1Q id 30 <REORDER_HDR> addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
24: vm1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:01 brd ff:ff:ff:ff:ff:ff promiscuity 2
    openvswitch addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535


[root@worker2 vagrant]# ip netns exec vm2 ip -d link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
2: child2@vm2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:04 brd ff:ff:ff:ff:ff:ff promiscuity 0
    vlan protocol 802.1Q id 50 <REORDER_HDR> addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
15: vm2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:02 brd ff:ff:ff:ff:ff:ff promiscuity 2
    openvswitch addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

Now, as you can see, none of the subports (child1 and child2) are connected directly to the integration bridge so both vm1 and vm2 ports act as trunk ports for VLAN IDs 30 and 50. OVN will install flows to tag/untag the traffic directed to/from these ports.

 

Traffic and OpenFlow analysis

To illustrate this, let's ping from child1 (worker1) to child 2 (worker2):

[root@worker1 vagrant]# ip netns exec vm1 ping 192.168.1.14
PING 192.168.1.14 (192.168.1.14) 56(84) bytes of data.
64 bytes from 192.168.1.14: icmp_seq=21 ttl=64 time=0.824 ms
64 bytes from 192.168.1.14: icmp_seq=22 ttl=64 time=0.211 ms

The traffic will arrive tagged to the vm1 interface and will be sent out untagged to worker2 (where vm2 is bound) via the Geneve tunnel:

[root@worker1 ~]# ip netns exec vm1 tcpdump -vvnee -i vm1 icmp -c1

tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes
13:18:16.980650 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype 802.1Q (0x8100), length 102: vlan 30, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 55255, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 9833, seq 176, length 64



[root@worker1 ~]# tcpdump -vvneei genev_sys_6081 icmp -c1

tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
13:19:11.980671 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 16226, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 9833, seq 231, length 64

On worker1, let's inspect the OVS flows that determine the source network/port based on the VLAN ID:

[root@worker1 ~]# ovs-ofctl dump-flows br-int table=0 |grep vlan

 cookie=0x4b8d6fa5, duration=337116.380s, table=0, n_packets=270983, n_bytes=27165634, idle_age=0, hard_age=65534, priority=150,in_port=12,dl_vlan=30 actions=load:0x1->NXM_NX_REG10[5],strip_vlan,load:0x5->NXM_NX_REG13[],load:0x1->NXM_NX_REG11[],load:0x2->NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],resubmit(,8)

The flow above in table 0 matches on the VLAN tag (dl_vlan=30). Also, note that there's no matching flow for VLAN 50 as vm2 is not bound to worker1.

As each parent should have subports with unique VLAN IDs, this ID will determine the source port (nested VM or container) that is sending the traffic. In our example, this will be child1 as it is the subport tagged with VLAN 30. In the actions section of the table 0 flow, the packet will be untagged (strip_vlan action), and the relevant registers will be populated to identify both the subport network and the logical input port:

  • The packet is coming from OF port 12 (in_port=12) which corresponds to vm1
[root@worker1 ~]# ovs-ofctl show br-int | grep vm1
 12(vm1): addr:40:44:00:00:00:01
  • The network identifier (metadata) is populated with the value 2 (load:0x2->OXM_OF_METADATA[]) which corresponds to the network of the subport (network2)
[root@central ~]# ovn-sbctl find datapath_binding tunnel_key=2
_uuid               : 2d762d73-5ab9-4f43-a303-65a6046e41e7
external_ids        : {logical-switch="40ac144b-a32a-4202-bce2-3329f8f3e98f", name=network2}
load_balancers      : []
tunnel_key          : 2
  • The logical input port (register 14) will be populated with the tunnel key of the child1 subport (load:0x1->NXM_NX_REG14[])
[root@central vagrant]# ovn-sbctl get port_binding child1 tunnel_key
1
  • Now the pipeline execution with the untagged packet gets resumed from table 8 (resubmit(,8)). Eventually it gets sent through the tunnel to worker2, where the parent (vm2) of the destination port (child2) is bound to.

 

Let's inspect the traffic and flows on worker2, the destination hypervisor:

The traffic arrives untagged to br-int from the Geneve interface and later gets delivered to the vm2 interface tagged with the child2 VLAN ID (50).

[root@worker2 vagrant]# tcpdump -vvnee -i genev_sys_6081 icmp -c1
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
13:57:25.000587 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 56431, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 10218, seq 31, length 64


[root@worker2 vagrant]# ip netns exec vm2 tcpdump -vvneei vm2 icmp -c1
tcpdump: listening on vm2, link-type EN10MB (Ethernet), capture size 262144 bytes
13:57:39.000617 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype 802.1Q (0x8100), length 102: vlan 50, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 59701, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 10218, seq 45, length 64

The packet processing takes place as with any regular VIF but in the output stage, the traffic will be tagged before it is sent out to the vm2 interface:

[root@worker2 vagrant]# ovs-ofctl dump-flows br-int |grep mod_vlan_vid:50
 cookie=0x969b8a9d, duration=338995.147s, table=65, n_packets=263914, n_bytes=25814112, idle_age=0, hard_age=65534, priority=100,reg15=0x2,metadata=0x2 actions=mod_vlan_vid:50,output:4,strip_vlan

[root@worker2 vagrant]# ovs-ofctl show br-int | grep vm2
 4(vm2): addr:00:00:00:00:00:00

 

As you can see, the way that OVN implements this feature is very simple and only adds a couple of extra flows. Hope that this article helps understanding the details of trunk ports and how it's leveraged by projects like Kuryr to run Kubernetes on top of OpenStack.

OpenStack TripleO networking layout

The goal of this post is to describe how network isolation is typically achieved for both the control and data planes in OpenStack using TripleO. In particular, how all this happens in a virtual setup, using one baremetal node (hypervisor, from now on) to deploy the OpenStack nodes with libvirt. For the purpose of this post, we'll work with a 3 controllers + 1 compute virtual setup.

(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+
| ID | Name | Status | Networks |
+--------------------------------------+--------------+--------+------------------------+
| b3bd5157-b3ea-4331-91af-3820c4e12252 | controller-0 | ACTIVE | ctlplane=192.168.24.15 |
| 6f228b08-49a0-4b68-925a-17d06224d5f9 | controller-1 | ACTIVE | ctlplane=192.168.24.37 |
| e5c649b5-c968-4293-a994-04293cb16da1 | controller-2 | ACTIVE | ctlplane=192.168.24.10 |
| 9f15ed23-efb1-4972-b578-7b0da3500053 | compute-0 | ACTIVE | ctlplane=192.168.24.14 |
+--------------------------------------+--------------+--------+------------------------+

The tool used to deploy this setup is Infrared (documentation) which is an easy-to-use wrapper around TripleO. Don't be scared about the so many layers involved here; the main point is to understand that a physical - and somewhat powerful - server is running an OpenStack cluster formed by:

  • 3 virtual controllers that run the OpenStack control plane services (Neutron, Nova, Glance, ...)
  • 1 virtual compute node that will serve to host the workloads (virtual machines) of the OpenStack cluster 

From a Networking perspective (I'll omit the undercloud for simplicity), things are wired like this:

Let's take a look at the bridges in the hypervisor node:

[root@hypervisor]# brctl show

bridge name     bridge id               STP enabled     interfaces
management      8000.525400cc1d8b       yes             management-nic
                                                        vnet0
                                                        vnet12
                                                        vnet3
                                                        vnet6
                                                        vnet9

external        8000.5254000ceb7c       yes             external-nic
                                                        vnet11
                                                        vnet14
                                                        vnet2
                                                        vnet5
                                                        vnet8

data            8000.5254007bc90a       yes             data-nic
                                                        vnet1
                                                        vnet10
                                                        vnet13
                                                        vnet4
                                                        vnet7

Each bridge has 6 ports (3 controllers, 1 compute, 1 undercloud, and the local port in the hypervisor). Now, each virtual machine running in this node can be mapped to the right interface:

[root@hypervisor]# for i in controller-0 controller-1 controller-2 compute-0; do virsh domiflist $i; done


 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet9       network   management   virtio   52:54:00:74:29:4f
 vnet10      network   data         virtio   52:54:00:1c:44:26
 vnet11      network   external     virtio   52:54:00:20:3c:4e

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet3       network   management   virtio   52:54:00:0b:ad:3b
 vnet4       network   data         virtio   52:54:00:2f:9f:3e
 vnet5       network   external     virtio   52:54:00:75:a5:ed

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet6       network   management   virtio   52:54:00:da:a3:1e
 vnet7       network   data         virtio   52:54:00:57:26:67
 vnet8       network   external     virtio   52:54:00:2c:21:d5

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet0       network   management   virtio   52:54:00:de:4a:38
 vnet1       network   data         virtio   52:54:00:c7:74:4b
 vnet2       network   external     virtio   52:54:00:22:de:5c

Network configuration templates

This section will go through the Infrared/TripleO configuration to understand how this layout was defined. This will also help the reader to change the CIDRs, VLANs, number of virtual NICs, etc.

First, the deployment script:

$ cat overcloud_deploy.sh
#!/bin/bash

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
-e /home/stack/virt/config_lvm.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /home/stack/virt/docker-images.yaml \
--log-file overcloud_deployment_99.log

Now, let's take a look at the network related templates to understand the different networks and how they map to the physical NICs inside the controllers/compute nodes:

$ grep -i -e cidr -e vlan /home/stack/virt/network/network-environment.yaml
ControlPlaneSubnetCidr: '192.168.24.0/24'

ExternalNetCidr: 10.0.0.0/24
ExternalNetworkVlanID: 10

InternalApiNetCidr: 172.17.1.0/24
InternalApiNetworkVlanID: 20

StorageMgmtNetCidr: 172.17.4.0/24
StorageMgmtNetworkVlanID: 40

StorageNetCidr: 172.17.3.0/24
StorageNetworkVlanID: 30

TenantNetCidr: 172.17.2.0/24
TenantNetworkVlanID: 50

NeutronNetworkVLANRanges: tenant:1000:2000

OS::TripleO::Compute::Net::SoftwareConfig: three-nics-vlans/compute.yaml
OS::TripleO::Controller::Net::SoftwareConfig: three-nics-vlans/controller.yaml

In the output above you can see 6 different networks:

  • ControlPlane (flat): used mainly for provisioning (PXE) and remote access to the nodes via SSH.
  • External (VLAN 10): external network used for dataplane floating IP traffic and access to the OpenStack API services via their external endpoints.
  • InternalApi (VLAN 20): network where the OpenStack control plane services will listen for internal communication (eg. Neutron <-> Nova).
  • StorageMgmt (VLAN 40): network used to manage the storage (in this deployment, swift-object-server, swift-container-server, and swift-account-server will listen to requests on this network)   
  • Storage (VLAN 30): network used for access to the Object storage (in this deployment, swift-proxy will listen to requests on this network).
  • Tenant: this network will carry the overlay tunnelled traffic (Geneve for OVN, VXLAN in the case of ML2/OVS) in the VLAN 50 but will also carry dataplane traffic if VLAN tenant networks are used in Neutron. The VLAN range allowed for such traffic is specified also in the template (in the example, VLAN ids ranging from 1000-2000 are reserved for Neutron tenant networks).

The way that each NIC is mapped to each network is defined in the yaml files below. For this deployment, I used a customized layout via this patch (controller.yaml and compute.yaml). Essentially, the mapping looks like this:

  • Controllers:
    • nic1: ControlPlaneIp (flat); InternalApi (20), Storage (30) , StorageMgmt (40), VLAN devices
    • nic2: br-tenant OVS bridge and VLAN50 for the tunnelled traffic
    • nic3: br-ex OVS bridge for external traffic 
  • Compute:
    • nic1: ControlPlaneIp (flat); InternalApi (20), Storage (30), VLAN devices 
    • nic2: br-tenant OVS bridge and VLAN50 for the tunnelled traffic
    • nic3: br-ex OVS bridge for external traffic 

The nodes map nic1, nic2, nic3 to ens4, ens5, ens6 respectively:

[root@controller-0 ~]# ip l | egrep "vlan[2-4]0"
9: vlan20@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
10: vlan30@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
11: vlan40@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

[root@controller-0 ~]# ovs-vsctl list-ports br-tenant
ens4
vlan50

[root@controller-0 ~]# ovs-vsctl list-ports br-ex
ens5

In the controller nodes we'll find an haproxy instance load balancing the requests to the different nodes and we can see here the network layout as well:

[root@controller-1 ~]# podman exec -uroot -it haproxy-bundle-podman-1 cat /etc/haproxy/haproxy.cfg

listen neutron
  bind 10.0.0.122:9696 transparent      <--- External network
  bind 172.17.1.48:9696 transparent     <--- InternalApi network
  mode http
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Port %[dst_port]
  option httpchk
  option httplog
# Now the backends in the InternalApi network
  server controller-0.internalapi.local 172.17.1.72:9696 check fall 5 inter 2000 rise 2
  server controller-1.internalapi.local 172.17.1.101:9696 check fall 5 inter 2000 rise 2
  server controller-2.internalapi.local 172.17.1.115:9696 check fall 5 inter 2000 rise 2

In the above output, the IP address 172.17.1.48 is a virtual IP managed by pacemaker and will live in the InternalApi (VLAN 20) network where it is master:

[root@controller-1 ~]# pcs status | grep 172.17.1.48
  * ip-172.17.1.48      (ocf::heartbeat:IPaddr2):       Started controller-0

[root@controller-0 ~]# ip a |grep 172.17.1.48
    inet 172.17.1.48/32 brd 172.17.1.255 scope global vlan20

Traffic inspection

With a clear view on the networking layout, now we can use the hypervisor to hook a tcpdump in the right bridge and check for whatever traffic we're interested in.

Let's for example ping from the InternalApi (172.17.1.0/24) network on controller-0 to controller-1 and check the traffic in the hypervisor:

[heat-admin@controller-0 ~]$ ping controller-1.internalapi.local
PING controller-1.internalapi.redhat.local (172.17.1.101) 56(84) bytes of data.
64 bytes from controller-1.redhat.local (172.17.1.101): icmp_seq=1 ttl=64 time=0.213 ms
64 bytes from controller-1.redhat.local (172.17.1.101): icmp_seq=2 ttl=64 time=0.096 ms


[root@hypervisor]# tcpdump -i management -vvne icmp -c2
tcpdump: listening on management, link-type EN10MB (Ethernet), capture size 262144 bytes
15:19:08.418046 52:54:00:74:29:4f > 52:54:00:0b:ad:3b, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 58494, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.1.72 > 172.17.1.101: ICMP echo request, id 53086, seq 5, length 64 15:19:08.418155 52:54:00:0b:ad:3b > 52:54:00:74:29:4f, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 39897, offset 0, flags [none], proto ICMP (1), length 84) 172.17.1.101 > 172.17.1.72: ICMP echo reply, id 53086, seq 5, length 64 [root@hypervisor]# brctl showmacs management | egrep "52:54:00:0b:ad:3b|52:54:00:74:29:4f" port no mac addr is local? ageing timer 3 52:54:00:0b:ad:3b no 0.01 5 52:54:00:74:29:4f no 0.01

When we ping to the controller-1 IP address of the InternalApi network, the traffic is tagged (VLAN 20) and going through the management bridge in the hypervisor. This matches our expectations as we defined such network in the template files that way.

Similarly, we could trace more complicated scenarios like an OpenStack instance in a tenant network pinging an external destination:

(overcloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+---------+--------+-----------------------+--------+
| ID | Name | Status | Networks | Image |
+--------------------------------------+---------+--------+-----------------------+--------+
| 3d9f6957-5311-4590-8c62-097b576ffa04 | cirros1 | ACTIVE | private=192.168.0.166 | cirros |
+--------------------------------------+---------+--------+-----------------------+--------+
[root@compute-0 ~]# sudo ip net e ovnmeta-e49cc182-247c-4dc9-9589-4df6fcb09511 ssh cirros@192.168.0.166 cirros@192.168.0.166's password: $ ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 ttl=53 time=10.356 ms 64 bytes from 8.8.8.8: seq=1 ttl=53 time=8.591 ms

Now in the hypervisor, we'll trace the Geneve traffic (VLAN50):

# tcpdump -i data -vvnne vlan 50 and "(udp port 6081) and (udp[10:2] = 0x6558) and (udp[(8 + (4 * (2 + (udp[8:1] & 0x3f))) + 12):2] = 0x0800) and (udp[8 + (4 * (2 + (udp[8:1] & 0x3f))) + 14 + 9:1] = 01)"  -c2

tcpdump: listening on data, link-type EN10MB (Ethernet), capture size 262144 bytes
16:21:28.642671 6a:9b:72:22:3f:68 > 0e:d0:eb:00:1b:e7, ethertype 802.1Q (0x8100), length 160: vlan 50, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 15872, offset 0, flags [DF], proto UDP (17), length 142) 172.17.2.119.27073 > 172.17.2.143.6081: [bad udp cksum 0x5db4 -> 0x1e8c!] Geneve, Flags [C], vni 0x5, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00010003] fa:16:3e:a7:95:87 > 52:54:00:0c:eb:7c, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 50335, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.166 > 8.8.8.8: ICMP echo request, id 2818, seq 2145, length 64 16:21:28.650412 0e:d0:eb:00:1b:e7 > 6a:9b:72:22:3f:68, ethertype 802.1Q (0x8100), length 160: vlan 50, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 26871, offset 0, flags [DF], proto UDP (17), length 142) 172.17.2.143.31003 > 172.17.2.119.6081: [bad udp cksum 0x5db4 -> 0x4a04!] Geneve, Flags [C], vni 0x3, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00040002] fa:16:3e:34:a2:0e > fa:16:3e:63:c0:7a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 53, id 0, offset 0, flags [none], proto ICMP (1), length 84) 8.8.8.8 > 192.168.0.166: ICMP echo reply, id 2818, seq 2145, length 64

(First, sorry for the complicated filter; I picked it up from here and adapted it to match on the inner protocol of the Geneve traffic against ICMP. If there's an easier way please tell me :p)

We can see that the Geneve traffic goes between 6a:9b:72:22:3f:68 and 0e:d0:eb:00:1b:e7 and now we can determine the source/dest nodes:

[root@hypervisor]# brctl showmacs data
  2     6a:9b:72:22:3f:68       no                 0.32
  2     fe:54:00:c7:74:4b       yes                0.00
  2     fe:54:00:c7:74:4b       yes                0.00
  3     0e:d0:eb:00:1b:e7       no                 0.40
  3     fe:54:00:2f:9f:3e       yes                0.00
  3     fe:54:00:2f:9f:3e       yes                0.00

From the info above we can see that port 2 corresponds to the MAC ending in "74:4b" and port 3 corresponds to the MAC ending in "9f:3e". Therefore, this Geneve traffic is flowing from the compute-0 node to the controller-1 node which is where Neutron is running the gateway to do the SNAT towards the external network. Now, this last portion can be examined in the external bridge:

[root@hypervisor]# tcpdump -i external icmp -vvnnee -c2
tcpdump: listening on external, link-type EN10MB (Ethernet), capture size 262144 bytes
16:33:35.016198 fa:16:3e:a7:95:87 > 52:54:00:0c:eb:7c, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 13537, offset 0, flags [DF], proto ICMP (1), length 84) 10.0.0.225 > 8.8.8.8: ICMP echo request, id 4354, seq 556, length 64 16:33:35.023570 52:54:00:0c:eb:7c > fa:16:3e:a7:95:87, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 54, id 0, offset 0, flags [none], proto ICMP (1), length 84) 8.8.8.8 > 10.0.0.225: ICMP echo reply, id 4354, seq 556, length 64

In case that you're wondering what's 10.0.0.225; that's the IP address of the Neutron gateway:

(overcloud) [stack@undercloud-0 ~]$ openstack router show router1 | grep gateway
| external_gateway_info   | {"network_id": "fe8330fe-540a-4acf-bda8-394398fb4272", "external_fixed_ips": [{"subnet_id": "e388a080-1953-4cdd-9e35-48d416fe2ae1", "ip_address": "10.0.0.225"}

Similarly, the MAC addresses can be matched to confirm that the traffic goes from the gateway node (controller-1), as the MAC ending in "a5:ed"  - in the same port as the source MAC from the ICMP packet - corresponds to the NIC attached to the external network on the controller-1.

[root@hypervisor]# brctl showmacs external
  3     fa:16:3e:a7:95:87       no                 0.47
  3     fe:54:00:75:a5:ed       yes                0.00
  3     fe:54:00:75:a5:ed       yes                0.00

Reflection

This is a virtual setup and everything is confined to the boundaries of a physical server. However, it is a great playground to get yourself familiar with the underlay networking of an OpenStack setup (and networking in general ;). Once you get your hands on a real production environment, all these Linux bridges will be replaced by ToR switches (or even routers on a pure L3 Spine & Leaf architecture) but the fundamentals are the same.

Improving OpenStack Neutron memory footprint

Recently, the Performance & Scale team at Red Hat ran some tests to stress both the control and data planes of OpenStack. One of the biggest issues detected during that exercise was the memory consumption of all the Neutron workers across the controller nodes raising all the way up to 75 GB of RSS:

 

 

The team did some analysis and we determined that there were close to a million MAC_Binding entries in the OVN Southbound database. These entries are kept in memory by Neutron and they never age out so the memory just grows and grows.

     MAC bindings

       The MAC_Binding table tracks the bindings from IP addresses to Ethernet
       addresses that are dynamically discovered  using  ARP  (for  IPv4)  and
       neighbor  discovery (for IPv6). Usually, IP-to-MAC bindings for virtual
       machines are statically  populated  into  the  Port_Binding  table,  so
       MAC_Binding  is  primarily  used  to discover bindings on physical net‐
       works.

The MAC_Binding table is populated by ovn-controller when it sees a new ARP/ND packet in the network even if they don't belong to OVN. It is common in OpenStack deployments to have multiple tenants connecting their routers to a relatively large provider network and once a new MAC address is learned, OVN will add one MAC_Binding entry per router connected to the external network.

In this particular exercise, the external network was a /16 and we observed close to 1M entries. This doesn't only pose a memory problem but also a lot of network traffic and stress to the OVN Southbound server which needs to commit the transactions from the ovn-controllers and send out the notifications to all the clients.

Why does Neutron care about the MAC_Binding table?

The problem - and its workaround - is described here but, in short, it is very common in OpenStack to reuse a Floating IP address (eg. during testing on CI) and Neutron implements a mechanism to delete the MAC address associated to a Floating IP from the MAC_Binding entry in order to force learning the new MAC address when needed.

For Neutron to do this, we monitored the table forcing ourselves to keep an in-memory copy of all its entries. Since these entries do not age out, the most likely scenario is that we'll hit OOM killers eventually - depending on the network topology, network size and other factors -.

The "solution"

Ideally, core OVN should implement a mechanism to eliminate stale MAC_Binding entries after a certain time but that's not an easy task as discussed in the links above. In the meantime, we thought of a way to stop monitoring the problematic table but, at the same time, keep the mechanism of deleting Floating IP MAC addresses upon association and disassociation of them from Neutron. The adopted solution was to invoke an external tool, ovsdb-client, to be able to delete such entries by:

  • Avoid monitoring the MAC_Binding table
  • Avoid downloading the database contents, hence, deleting those entries in a constant time despite the database size

The patch got recently merged into the Neutron repository and the result is that, running the same tests, the overall memory consumption decreased by an order of magnitude from ~75GB to 7.5GB of RSS, without impacting the run time of the exercise.

This great improvement was thanks to the collaboration of the Performance & Scale, Neutron and core OVN teams at Red Hat. I'm lucky to work with such great and smart people 😉

OVN external ports

Some time back a new type of virtual port was introduced in OVN: external ports. The initial motivation was to support two main use cases in OpenStack:

  1. SR-IOV:  For this type of workloads, the VM will bypass the hypervisor by accessing the physical NIC directly on the host. This means that the traffic sent out by the VM which requires a controller action (such as DHCP requests or IGMP traffic) will not hit the local OVN bridge and will be missed. You can read more about SR-IOV in this excellent blog post.
  2. Baremetal provisioning: Similarly, when provisioning a baremetal server, DHCP requests for PXE booting will be missed as those ports are not bound in OVN so even they're sent in broadcast, all the ovn-controller instances in the cluster will ignore them.

For both cases we needed to have a way in OVN to process those requests coming from ports that are not bound to the local hypervisor and the solution was to implement the OVN external ports.

Testing scenario

From a pure OVN perspective, I created the following setup that will hopefully help understand the details and how to troubleshoot this type of ports.

  • Two private networks:
    • network1: 192.168.0.0/24 of type VLAN and ID 190
    • network2: 192.168.1.0/24 of type VLAN and ID 170
  • Two provider networks:
    • external: 172.24.14.0/24 used for Floating IP traffic
    • tenant: used by network1 and network2 VLAN networks
  • One Logical router that connects both private networks and the Floating IP network
    • router1-net1: "40:44:00:00:00:03" "192.168.0.1/24"
    • router1-net2: "40:44:33:00:00:05" "192.168.1.1/24"
    • router1-public: "40:44:00:00:00:04" "172.24.14.1/24"
  • 4 Virtual Machines (simulated with network namespaces), 2 on each network
    • vm1: addresses: ["40:44:00:00:00:01 192.168.0.11"]
    • vm2: addresses: ["40:44:00:00:00:02 192.168.0.12"]
    • vm3: addresses: ["40:44:33:00:00:03 192.168.1.13"]
    • pext: addresses: ["40:44:44:00:00:10 192.168.1.111"]

The physical layout involves the following nodes:

  • Worker nodes:
    • They have one NIC connected to the Geneve overlay network and another NIC on the provider network using the same OVS bridge (br-ex) for the flat external network and for both VLAN tenant networks.
  • Gateway nodes:
    • Same network configuration as the worker nodes 
    • gw1: which hosts the router gateway port for the Logical Router
    • gw2: which hosts the external port (pext)
  • Host
    • This server is just another machine which has access to the provider networks and no OVN/OVS components are running on it.
    • To illustrate different scenarios, during the provisioning phase, it'll be configured with a network namespace connected to an OVS bridge and a VLAN device on the network2 (VLAN 170) with the MAC address of the external port.

A full vagrant setup can be found here that will deploy and configure the above setup for you in less than 10 minutes 🙂

The configuration of the external port in the OVN NorthBound database can be found here:

 

# Create the external port
ovn-nbctl lsp-add network2 pext
ovn-nbctl lsp-set-addresses pext \
          "40:44:44:00:00:10 192.168.1.111"
ovn-nbctl lsp-set-type pext external

# Schedule the external port in gw2
ovn-nbctl --id=@ha_chassis  create HA_Chassis \
          chassis_name=gw2 priority=1 -- \
          --id=@ha_chassis_group create \
          HA_Chassis_Group name=default2 \
          ha_chassis=[@ha_chassis] -- \
          set Logical_Switch_Port pext \
          ha_chassis_group=@ha_chassis_group

DHCP

When using a regular OVN port, the DHCP request from the VM hits the integration bridge and, via a controller action, is served by the local ovn-controller instance. Also, the DHCP request never leaves the hypervisor.

However, when it comes to external ports, the broadcast request will be processed by the chassis where the port is scheduled and the ovn-controller instance running there will be responsible of serving DHCP for it.

Let's get to our setup and issue the request from host1:

[root@host1 vagrant] ip netns exec pext dhclient -v -i pext.170 --no-pid
Listening on LPF/pext.170/40:44:44:00:00:10
Sending on   LPF/pext.170/40:44:44:00:00:10
Sending on   Socket/fallback
DHCPREQUEST on pext.170 to 255.255.255.255 port 67 (xid=0x5149c1a3)
DHCPACK from 192.168.1.1 (xid=0x56bf40c1)
bound to 192.168.1.111 -- renewal in 1667 seconds.

Since we scheduled the external port on the gw2, we'd expect the request being handled there which we can check by inspecting the relevant logs:

[root@gw2 ~] tail -f /usr/var/log/ovn/ovn-controller.log
2020-09-08T09:01:52.547Z|00007|pinctrl(ovn_pinctrl0)
|INFO|DHCPACK 40:44:44:00:00:10 192.168.1.111

Below you can see the flows installed to handle DHCP in the gw2 node:

table=22, priority=100, udp, reg14=0x2, metadata=0x3, dl_src=40:44:44:00:00:10, nw_src=192.168.1.111, nw_dst=255.255.255.255, tp_src=68, tp_dst=67 actions=controller(...

table=22, priority=100, udp, reg14=0x2, metadata=0x3, dl_src=40:44:44:00:00:10, nw_src=0.0.0.0, nw_dst=255.255.255.255, tp_src=68, tp_dst=67 actions=controller(...

table=22, priority=100, udp, reg14=0x2, metadata=0x3, dl_src=40:44:44:00:00:10, nw_src=192.168.1.111, nw_dst=192.168.1.1, tp_src=68, tp_dst=67 actions=controller(...

Regular L2 traffic path

This is the easy path. As the external port MAC/IP addresses are known to OVN, whenever the packet arrives to the destination chassis via a localnet port they'll be processed normally and delivered to the output port. No extra hops are observed:

Let's ping from the external port pext to vm3 which are in the same network. The expected path is for the ICMP packets to exit host1 from eth1 tagged with VLAN 170 and reach worker2 on eth2:

[root@host1 vagrant]# ip netns exec pext ping 192.168.1.13 -c2
PING 192.168.1.13 (192.168.1.13) 56(84) bytes of data.
64 bytes from 192.168.1.13: icmp_seq=1 ttl=64 time=0.331 ms
64 bytes from 192.168.1.13: icmp_seq=2 ttl=64 time=0.299 ms

--- 192.168.1.13 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.299/0.315/0.331/0.016 ms

Traffic from host1 arrives directly with the destination MAC address of the vm3 port and the reply is received from that same MAC:

[root@host1 ~]# tcpdump -i eth1 -vvnee
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
14:45:22.919136 40:44:44:00:00:10 > 40:44:33:00:00:03, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 14603, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.1.13: ICMP echo request, id 19425, seq 469, length 64


14:45:22.919460 40:44:33:00:00:03 > 40:44:44:00:00:10, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 13721, offset 0, flags [none], proto ICMP (1), length 84)
192.168.1.13 > 192.168.1.111: ICMP echo reply, id 19425, seq 469, length 64

At worker2, we see the traffic coming in from the eth2 NIC on the network2 VLAN 170:

[root@worker2 ~]# tcpdump -i eth2 -vvne
tcpdump: listening on eth2, link-type EN10MB (Ethernet), capture size 262144 bytes
14:43:33.009370 40:44:44:00:00:10 > 40:44:33:00:00:03, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 22650, offset 0, flags [DF], proto ICMP (1), length 84)


192.168.1.111 > 192.168.1.13: ICMP echo request, id 19425, seq 360, length 64
14:43:33.009472 40:44:33:00:00:03 > 40:44:44:00:00:10, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 23154, offset 0, flags [none], proto ICMP (1), length 84)
192.168.1.13 > 192.168.1.111: ICMP echo reply, id 19425, seq 360, length 64

Routed traffic path

In this example, we're going to ping from the external port pext to vm1 (192.168.0.11) located in worker1. Now, the packet will exit the host1 with the destination MAC address of the router port (192.168.1.1). Since the traffic is originated by pext and it is bound to the gateway gw2, the routing will happen there.

 

The router pipeline, and network1 and 2 pipelines will run in the gw2 and, from here, the packet goes out already to the network2 (VLAN 190) to worker1.

 

  • Ping from external port pext to vm1

[root@host1 ~]# ip netns exec pext ping 192.168.0.11 -c2
PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.
64 bytes from 192.168.0.11: icmp_seq=1 ttl=63 time=2.67 ms
64 bytes from 192.168.0.11: icmp_seq=2 ttl=63 time=0.429 ms

--- 192.168.0.11 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.429/1.553/2.678/1.125 ms

[root@host1 ~]# ip net e pext ip neigh
192.168.1.1 dev pext.170 lladdr 40:44:33:00:00:05 REACHABLE

[root@host1 ~]# tcpdump -i eth1 -vvnee
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:18:52.285722 40:44:44:00:00:10 > 40:44:33:00:00:05, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 15048, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.0.11: ICMP echo request, id 1257, seq 6, length 64

  • Packet arrives to gw2 for routing:

[root@gw2 ~]# tcpdump -i eth2 -vvne icmp
tcpdump: listening on eth2, link-type EN10MB (Ethernet), capture size 262144 bytes
16:19:40.901874 40:44:44:00:00:10 > 40:44:33:00:00:05, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 41726, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.0.11: ICMP echo request, id 1257, seq 56, length 64

  • And from gw2, the packet is sent to the destination vm vm1 on the VLAN 190 network:

12:31:13.737551 40:44:44:00:00:10 > 40:44:33:00:00:05, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 35405, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.0.11: ICMP echo request, id 6103, seq 5455, length 64
12:31:13.737583 1e:02:ad:bb:aa:cc > 40:44:00:00:00:01, ethertype 802.1Q (0x8100), length 102: vlan 190, p 0, ethertype IPv4, (tos 0x0, ttl 63, id 35405, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.0.11: ICMP echo request, id 6103, seq 5455, length 64

  • At worker1, the packet is delivered to the tap port of vm1 and this will reply to the ping

[root@worker1 ~]# ip netns exec vm1 tcpdump -i vm1 -vvne icmp
tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:21:38.561881 40:44:00:00:00:03 > 40:44:00:00:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 32180, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.0.11: ICMP echo request, id 1278, seq 18, length 64
16:21:38.561925 40:44:00:00:00:01 > 40:44:00:00:00:03, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 25498, offset 0, flags [none], proto ICMP (1), length 84)
192.168.0.11 > 192.168.1.111: ICMP echo reply, id 1278, seq 18, length 64

  • The ICMP echo reply packet will be sent directly to pext through the localnet port to the physical network tagged with VLAN 170:

[root@host1 ~]# tcpdump -vvnne -i eth1 icmp
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
12:35:00.322735 1e:02:ad:bb:aa:77 > 40:44:44:00:00:10, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 63, id 18741, offset 0, flags [none], proto ICMP (1), length 84)
192.168.0.11 > 192.168.1.111: ICMP echo reply, id 6103, seq 5681, length 644

You might have noticed that the source MAC address (1e:02:ad:bb:aa:77) in the last step doesn't correspond to any OVN logical port. This is because, in order for the routing to be distributed, OVN rewrites this MAC address to that of external_ids:ovn-chassis-mac-mappings. In our case, this tell us as well that the traffic is coming directly from worker1 saving the extra hop that we see for the pext -> vm1 path via gw2.

[root@worker1 ~]# ovs-vsctl get open . external_ids:ovn-chassis-mac-mappings
"tenant:1e:02:ad:bb:aa:77"

Floating IP traffic path

The FIP traffic path (for non Distributed Virtual Routing - DVR -  case) is similar to the case that we just described except for the fact that since we require to traverse the distributed gateway port, our traffic will take an extra hop.

Example of ping from pext to the Floating IP of vm1 (172.24.14.100):

  1. The packet goes from host1 to gw2 via the VLAN 170 network with the destination MAC address of the network2 router port.
  2. From gw2, the traffic will be steered to the gw1 node which is hosting the distributed gateway port. This traffic is sent out via the Geneve overlay tunnel.
  3. gw1 will perform the routing and send the traffic to the FIP of vm1 via the public flat network (172.24.14.0/24) and src IP that of the SNAT (172.24.14.1).
  4. The request arrives to worker1 where ovn-controller un-NATs the packet to the vm1 (192.168.0.11) and delivers it to the tap interface. The reply from the vm1 is NATed to the FIP (172.14.14.100) and sent back to the router at gw1 via the flat network.
  5. gw1 will perform the routing to the network2 and push the reply packet directly onto the network2 with VLAN 190 which will be received at host1.

OpenStack

For use cases like SR-IOV workloads, OpenStack is responsible for creating an OVN port of type 'external' in the NorthBound database and also its location, ie., which chassis are going to claim the port and install the relevant flows to, for example, serve DHCP.

This is done through a HA Chassis Group where the OVN Neutron plugin adds all the external ports to and, usually, the controller nodes belong to it. This way if one node goes down, all the external ports will be served from the next highest priority chassis in the group.

Information about how OpenStack handles this can be found here.

Future uses of OVN external ports include baremetal provisioning but as of this writing we lack some features like PXE chaining in OVN.