Tagopenvswitch

OpenStack TripleO networking layout

The goal of this post is to describe how network isolation is typically achieved for both the control and data planes in OpenStack using TripleO. In particular, how all this happens in a virtual setup, using one baremetal node (hypervisor, from now on) to deploy the OpenStack nodes with libvirt. For the purpose of this post, we'll work with a 3 controllers + 1 compute virtual setup.

(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+
| ID | Name | Status | Networks |
+--------------------------------------+--------------+--------+------------------------+
| b3bd5157-b3ea-4331-91af-3820c4e12252 | controller-0 | ACTIVE | ctlplane=192.168.24.15 |
| 6f228b08-49a0-4b68-925a-17d06224d5f9 | controller-1 | ACTIVE | ctlplane=192.168.24.37 |
| e5c649b5-c968-4293-a994-04293cb16da1 | controller-2 | ACTIVE | ctlplane=192.168.24.10 |
| 9f15ed23-efb1-4972-b578-7b0da3500053 | compute-0 | ACTIVE | ctlplane=192.168.24.14 |
+--------------------------------------+--------------+--------+------------------------+

The tool used to deploy this setup is Infrared (documentation) which is an easy-to-use wrapper around TripleO. Don't be scared about the so many layers involved here; the main point is to understand that a physical - and somewhat powerful - server is running an OpenStack cluster formed by:

  • 3 virtual controllers that run the OpenStack control plane services (Neutron, Nova, Glance, ...)
  • 1 virtual compute node that will serve to host the workloads (virtual machines) of the OpenStack cluster 

From a Networking perspective (I'll omit the undercloud for simplicity), things are wired like this:

Let's take a look at the bridges in the hypervisor node:

[root@hypervisor]# brctl show

bridge name     bridge id               STP enabled     interfaces
management      8000.525400cc1d8b       yes             management-nic
                                                        vnet0
                                                        vnet12
                                                        vnet3
                                                        vnet6
                                                        vnet9

external        8000.5254000ceb7c       yes             external-nic
                                                        vnet11
                                                        vnet14
                                                        vnet2
                                                        vnet5
                                                        vnet8

data            8000.5254007bc90a       yes             data-nic
                                                        vnet1
                                                        vnet10
                                                        vnet13
                                                        vnet4
                                                        vnet7

Each bridge has 6 ports (3 controllers, 1 compute, 1 undercloud, and the local port in the hypervisor). Now, each virtual machine running in this node can be mapped to the right interface:

[root@hypervisor]# for i in controller-0 controller-1 controller-2 compute-0; do virsh domiflist $i; done


 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet9       network   management   virtio   52:54:00:74:29:4f
 vnet10      network   data         virtio   52:54:00:1c:44:26
 vnet11      network   external     virtio   52:54:00:20:3c:4e

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet3       network   management   virtio   52:54:00:0b:ad:3b
 vnet4       network   data         virtio   52:54:00:2f:9f:3e
 vnet5       network   external     virtio   52:54:00:75:a5:ed

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet6       network   management   virtio   52:54:00:da:a3:1e
 vnet7       network   data         virtio   52:54:00:57:26:67
 vnet8       network   external     virtio   52:54:00:2c:21:d5

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet0       network   management   virtio   52:54:00:de:4a:38
 vnet1       network   data         virtio   52:54:00:c7:74:4b
 vnet2       network   external     virtio   52:54:00:22:de:5c

Network configuration templates

This section will go through the Infrared/TripleO configuration to understand how this layout was defined. This will also help the reader to change the CIDRs, VLANs, number of virtual NICs, etc.

First, the deployment script:

$ cat overcloud_deploy.sh
#!/bin/bash

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
-e /home/stack/virt/config_lvm.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /home/stack/virt/docker-images.yaml \
--log-file overcloud_deployment_99.log

Now, let's take a look at the network related templates to understand the different networks and how they map to the physical NICs inside the controllers/compute nodes:

$ grep -i -e cidr -e vlan /home/stack/virt/network/network-environment.yaml
ControlPlaneSubnetCidr: '192.168.24.0/24'

ExternalNetCidr: 10.0.0.0/24
ExternalNetworkVlanID: 10

InternalApiNetCidr: 172.17.1.0/24
InternalApiNetworkVlanID: 20

StorageMgmtNetCidr: 172.17.4.0/24
StorageMgmtNetworkVlanID: 40

StorageNetCidr: 172.17.3.0/24
StorageNetworkVlanID: 30

TenantNetCidr: 172.17.2.0/24
TenantNetworkVlanID: 50

NeutronNetworkVLANRanges: tenant:1000:2000

OS::TripleO::Compute::Net::SoftwareConfig: three-nics-vlans/compute.yaml
OS::TripleO::Controller::Net::SoftwareConfig: three-nics-vlans/controller.yaml

In the output above you can see 6 different networks:

  • ControlPlane (flat): used mainly for provisioning (PXE) and remote access to the nodes via SSH.
  • External (VLAN 10): external network used for dataplane floating IP traffic and access to the OpenStack API services via their external endpoints.
  • InternalApi (VLAN 20): network where the OpenStack control plane services will listen for internal communication (eg. Neutron <-> Nova).
  • StorageMgmt (VLAN 40): network used to manage the storage (in this deployment, swift-object-server, swift-container-server, and swift-account-server will listen to requests on this network)   
  • Storage (VLAN 30): network used for access to the Object storage (in this deployment, swift-proxy will listen to requests on this network).
  • Tenant: this network will carry the overlay tunnelled traffic (Geneve for OVN, VXLAN in the case of ML2/OVS) in the VLAN 50 but will also carry dataplane traffic if VLAN tenant networks are used in Neutron. The VLAN range allowed for such traffic is specified also in the template (in the example, VLAN ids ranging from 1000-2000 are reserved for Neutron tenant networks).

The way that each NIC is mapped to each network is defined in the yaml files below. For this deployment, I used a customized layout via this patch (controller.yaml and compute.yaml). Essentially, the mapping looks like this:

  • Controllers:
    • nic1: ControlPlaneIp (flat); InternalApi (20), Storage (30) , StorageMgmt (40), VLAN devices
    • nic2: br-tenant OVS bridge and VLAN50 for the tunnelled traffic
    • nic3: br-ex OVS bridge for external traffic 
  • Compute:
    • nic1: ControlPlaneIp (flat); InternalApi (20), Storage (30), VLAN devices 
    • nic2: br-tenant OVS bridge and VLAN50 for the tunnelled traffic
    • nic3: br-ex OVS bridge for external traffic 

The nodes map nic1, nic2, nic3 to ens4, ens5, ens6 respectively:

[root@controller-0 ~]# ip l | egrep "vlan[2-4]0"
9: vlan20@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
10: vlan30@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
11: vlan40@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

[root@controller-0 ~]# ovs-vsctl list-ports br-tenant
ens4
vlan50

[root@controller-0 ~]# ovs-vsctl list-ports br-ex
ens5

In the controller nodes we'll find an haproxy instance load balancing the requests to the different nodes and we can see here the network layout as well:

[root@controller-1 ~]# podman exec -uroot -it haproxy-bundle-podman-1 cat /etc/haproxy/haproxy.cfg

listen neutron
  bind 10.0.0.122:9696 transparent      <--- External network
  bind 172.17.1.48:9696 transparent     <--- InternalApi network
  mode http
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Port %[dst_port]
  option httpchk
  option httplog
# Now the backends in the InternalApi network
  server controller-0.internalapi.local 172.17.1.72:9696 check fall 5 inter 2000 rise 2
  server controller-1.internalapi.local 172.17.1.101:9696 check fall 5 inter 2000 rise 2
  server controller-2.internalapi.local 172.17.1.115:9696 check fall 5 inter 2000 rise 2

In the above output, the IP address 172.17.1.48 is a virtual IP managed by pacemaker and will live in the InternalApi (VLAN 20) network where it is master:

[root@controller-1 ~]# pcs status | grep 172.17.1.48
  * ip-172.17.1.48      (ocf::heartbeat:IPaddr2):       Started controller-0

[root@controller-0 ~]# ip a |grep 172.17.1.48
    inet 172.17.1.48/32 brd 172.17.1.255 scope global vlan20

Traffic inspection

With a clear view on the networking layout, now we can use the hypervisor to hook a tcpdump in the right bridge and check for whatever traffic we're interested in.

Let's for example ping from the InternalApi (172.17.1.0/24) network on controller-0 to controller-1 and check the traffic in the hypervisor:

[heat-admin@controller-0 ~]$ ping controller-1.internalapi.local
PING controller-1.internalapi.redhat.local (172.17.1.101) 56(84) bytes of data.
64 bytes from controller-1.redhat.local (172.17.1.101): icmp_seq=1 ttl=64 time=0.213 ms
64 bytes from controller-1.redhat.local (172.17.1.101): icmp_seq=2 ttl=64 time=0.096 ms


[root@hypervisor]# tcpdump -i management -vvne icmp -c2
tcpdump: listening on management, link-type EN10MB (Ethernet), capture size 262144 bytes
15:19:08.418046 52:54:00:74:29:4f > 52:54:00:0b:ad:3b, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 58494, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.1.72 > 172.17.1.101: ICMP echo request, id 53086, seq 5, length 64 15:19:08.418155 52:54:00:0b:ad:3b > 52:54:00:74:29:4f, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 39897, offset 0, flags [none], proto ICMP (1), length 84) 172.17.1.101 > 172.17.1.72: ICMP echo reply, id 53086, seq 5, length 64 [root@hypervisor]# brctl showmacs management | egrep "52:54:00:0b:ad:3b|52:54:00:74:29:4f" port no mac addr is local? ageing timer 3 52:54:00:0b:ad:3b no 0.01 5 52:54:00:74:29:4f no 0.01

When we ping to the controller-1 IP address of the InternalApi network, the traffic is tagged (VLAN 20) and going through the management bridge in the hypervisor. This matches our expectations as we defined such network in the template files that way.

Similarly, we could trace more complicated scenarios like an OpenStack instance in a tenant network pinging an external destination:

(overcloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+---------+--------+-----------------------+--------+
| ID | Name | Status | Networks | Image |
+--------------------------------------+---------+--------+-----------------------+--------+
| 3d9f6957-5311-4590-8c62-097b576ffa04 | cirros1 | ACTIVE | private=192.168.0.166 | cirros |
+--------------------------------------+---------+--------+-----------------------+--------+
[root@compute-0 ~]# sudo ip net e ovnmeta-e49cc182-247c-4dc9-9589-4df6fcb09511 ssh cirros@192.168.0.166 cirros@192.168.0.166's password: $ ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 ttl=53 time=10.356 ms 64 bytes from 8.8.8.8: seq=1 ttl=53 time=8.591 ms

Now in the hypervisor, we'll trace the Geneve traffic (VLAN50):

# tcpdump -i data -vvnne vlan 50 and "(udp port 6081) and (udp[10:2] = 0x6558) and (udp[(8 + (4 * (2 + (udp[8:1] & 0x3f))) + 12):2] = 0x0800) and (udp[8 + (4 * (2 + (udp[8:1] & 0x3f))) + 14 + 9:1] = 01)"  -c2

tcpdump: listening on data, link-type EN10MB (Ethernet), capture size 262144 bytes
16:21:28.642671 6a:9b:72:22:3f:68 > 0e:d0:eb:00:1b:e7, ethertype 802.1Q (0x8100), length 160: vlan 50, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 15872, offset 0, flags [DF], proto UDP (17), length 142) 172.17.2.119.27073 > 172.17.2.143.6081: [bad udp cksum 0x5db4 -> 0x1e8c!] Geneve, Flags [C], vni 0x5, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00010003] fa:16:3e:a7:95:87 > 52:54:00:0c:eb:7c, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 50335, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.166 > 8.8.8.8: ICMP echo request, id 2818, seq 2145, length 64 16:21:28.650412 0e:d0:eb:00:1b:e7 > 6a:9b:72:22:3f:68, ethertype 802.1Q (0x8100), length 160: vlan 50, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 26871, offset 0, flags [DF], proto UDP (17), length 142) 172.17.2.143.31003 > 172.17.2.119.6081: [bad udp cksum 0x5db4 -> 0x4a04!] Geneve, Flags [C], vni 0x3, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00040002] fa:16:3e:34:a2:0e > fa:16:3e:63:c0:7a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 53, id 0, offset 0, flags [none], proto ICMP (1), length 84) 8.8.8.8 > 192.168.0.166: ICMP echo reply, id 2818, seq 2145, length 64

(First, sorry for the complicated filter; I picked it up from here and adapted it to match on the inner protocol of the Geneve traffic against ICMP. If there's an easier way please tell me :p)

We can see that the Geneve traffic goes between 6a:9b:72:22:3f:68 and 0e:d0:eb:00:1b:e7 and now we can determine the source/dest nodes:

[root@hypervisor]# brctl showmacs data
  2     6a:9b:72:22:3f:68       no                 0.32
  2     fe:54:00:c7:74:4b       yes                0.00
  2     fe:54:00:c7:74:4b       yes                0.00
  3     0e:d0:eb:00:1b:e7       no                 0.40
  3     fe:54:00:2f:9f:3e       yes                0.00
  3     fe:54:00:2f:9f:3e       yes                0.00

From the info above we can see that port 2 corresponds to the MAC ending in "74:4b" and port 3 corresponds to the MAC ending in "9f:3e". Therefore, this Geneve traffic is flowing from the compute-0 node to the controller-1 node which is where Neutron is running the gateway to do the SNAT towards the external network. Now, this last portion can be examined in the external bridge:

[root@hypervisor]# tcpdump -i external icmp -vvnnee -c2
tcpdump: listening on external, link-type EN10MB (Ethernet), capture size 262144 bytes
16:33:35.016198 fa:16:3e:a7:95:87 > 52:54:00:0c:eb:7c, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 13537, offset 0, flags [DF], proto ICMP (1), length 84) 10.0.0.225 > 8.8.8.8: ICMP echo request, id 4354, seq 556, length 64 16:33:35.023570 52:54:00:0c:eb:7c > fa:16:3e:a7:95:87, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 54, id 0, offset 0, flags [none], proto ICMP (1), length 84) 8.8.8.8 > 10.0.0.225: ICMP echo reply, id 4354, seq 556, length 64

In case that you're wondering what's 10.0.0.225; that's the IP address of the Neutron gateway:

(overcloud) [stack@undercloud-0 ~]$ openstack router show router1 | grep gateway
| external_gateway_info   | {"network_id": "fe8330fe-540a-4acf-bda8-394398fb4272", "external_fixed_ips": [{"subnet_id": "e388a080-1953-4cdd-9e35-48d416fe2ae1", "ip_address": "10.0.0.225"}

Similarly, the MAC addresses can be matched to confirm that the traffic goes from the gateway node (controller-1), as the MAC ending in "a5:ed"  - in the same port as the source MAC from the ICMP packet - corresponds to the NIC attached to the external network on the controller-1.

[root@hypervisor]# brctl showmacs external
  3     fa:16:3e:a7:95:87       no                 0.47
  3     fe:54:00:75:a5:ed       yes                0.00
  3     fe:54:00:75:a5:ed       yes                0.00

Reflection

This is a virtual setup and everything is confined to the boundaries of a physical server. However, it is a great playground to get yourself familiar with the underlay networking of an OpenStack setup (and networking in general ;). Once you get your hands on a real production environment, all these Linux bridges will be replaced by ToR switches (or even routers on a pure L3 Spine & Leaf architecture) but the fundamentals are the same.

OVN: Where is my packet?

Working with rather complex OpenFlow pipelines as in the case of OVN can be tricky when it comes to debugging. Being able to tell where a packet gets dropped may not be the easiest task ever but fortunately, this is getting a bit more friendly thanks to the available tools around OVS and OVN these days.

In this post, I'll show a recent troubleshooting process that I did where the symptom was that an ICMP Neighbor Advertisement packet sent out by a guest VM was not arriving at its destination. It turned out to be a bug in ovn-controller that has already been fixed so it will hopefully illustrate some of the available tools and techniques to debug OVN through a real world example.

Let's start by showing a failing ping to an OVN destination and we'll build up from here:

# ping 2001:db8::f816:3eff:fe1e:5a0f -c2
PING 2001:db8::f816:3eff:fe1e:5a0f(2001:db8::f816:3eff:fe1e:5a0f) 56 data bytes
From f00d:f00d:f00d:f00d:f00d:f00d:f00d:9: icmp_seq=1 Destination unreachable: Address unreachable
From f00d:f00d:f00d:f00d:f00d:f00d:f00d:9: icmp_seq=2 Destination unreachable: Address unreachable

--- 2001:db8::f816:3eff:fe1e:5a0f ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 61ms

In the destination hypervisor we can see the Neighbor Solicitation packets but no answer to them:

$ tcpdump -i br-ex -ne "icmp6" -c2

13:11:18.350789 b6:c9:b9:6e:50:48 > 33:33:ff:1e:5a:0f, ethertype IPv6 (0x86dd), length 86: fe80::b4c9:b9ff:fe6e:5048 > ff02::1:ff1e:5a0f: ICMP6, neighbor solicitation, who has 2001:db8::f816:3eff:fe1e:5a0f, length 32

13:11:19.378648 b6:c9:b9:6e:50:48 > 33:33:ff:1e:5a:0f, ethertype IPv6 (0x86dd), length 86: fe80::b4c9:b9ff:fe6e:5048 > ff02::1:ff1e:5a0f: ICMP6, neighbor solicitation, who has 2001:db8::f816:3eff:fe1e:5a0f, length 32

The first step is to determine whether the actual VM is responding with Neighbor Advertisement packets by examining the traffic on the tap interface:

$ tcpdump -veni tapdf8a80f8-0c icmp6 -c2

15:10:26.862344 b6:c9:b9:6e:50:48 > 33:33:ff:1e:5a:0f, ethertype IPv6 (0x86dd), length 86: (hlim 255, next-header ICMPv6 (58) payload length: 32) fe80::b4c9:b9ff:fe6e:5048 > ff02::1:ff1e:5a0f: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has 2001:db8::f816:3eff:fe1e:5a0f
          source link-address option (1), length 8 (1): b6:c9:b9:6e:50:48

15:10:26.865489 fa:16:3e:1e:5a:0f > b6:c9:b9:6e:50:48, ethertype IPv6 (0x86dd), length 86: (hlim 255, next-header ICMPv6 (58) payload length: 32) 2001:db8::f816:3eff:fe1e:5a0f > fe80::b4c9:b9ff:fe6e:5048: [icmp6 sum ok] ICMP6, neighbor advertisement, length 32, tgt is 2001:db8::f816:3eff:fe1e:5a0f, Flags [solicited, override]
          destination link-address option (2), length 8 (1): fa:16:3e:1e:5a:0f

At this point we know that the Neighbor Advertisement (NA) packets are sent by the VM but dropped somewhere in the OVN integration bridge, which makes this task about inspecting Logical and Physical flows to find out what's causing this.

Let's begin by checking the Logical flows to see if, from the OVN database standpoint, all is correctly configured. For this, we need to collect all the info that we are going to pass to the ovn-trace tool to simulate the NA packet originated from the VM such as:

  • datapath: OVN Logical Switch name or ID
  • inport: OVN port name corresponding to the VM
  • eth.src & eth.dst: source and destination MAC addresses (we can copy them from the tcpdump output above)
  • ip6.src: the IPv6 address of the VM
  • nd_target: The IPv6 address that is advertised
# ovn-nbctl show
[...]
switch 0309ffff-7c89-427e-a68e-cd87b0658005 (neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873) (aka provider-flat)
    port provnet-64e303a9-af62-4833-b713-6b361cdd6ecd
        type: localnet
        addresses: ["unknown"]
    port 916eb7ea-4d64-4d9c-be28-b693af2e7ed3
        type: localport
        addresses: ["fa:16:3e:65:76:bd 172.24.100.2 2001:db8::f816:3eff:fe65:76bd"]
    port df8a80f8-0cbf-48af-8b32-78377f034797 (aka vm-provider-flat-port)
        addresses: ["fa:16:3e:1e:5a:0f 2001:db8::f816:3eff:fe1e:5a0f"]
    port f1d49ca3-0e9c-4387-8d5d-c833fc3b7943
        type: router
        router-port: lrp-f1d49ca3-0e9c-4387-8d5d-c833fc3b7943

With all this info we are ready to run ovn-trace:

$ ovn-trace --summary neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873 'inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && eth.src == fa:16:3e:1e:5a:0f && eth.dst == b6:c9:b9:6e:50:48 && ip6.src == 2001:db8::f816:3eff:fe1e:5a0f  && nd.target == 2001:db8::f816:3eff:fe1e:5a0f && nd_na && nd.tll == fa:16:3e:1e:5a:0f'
# icmp6,reg14=0x4,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=::,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
ingress(dp="provider-flat", inport="vm-provider-flat-port") {
    next;
    next;
    next;
    next;
    next;
    reg0[8] = 1;
    reg0[9] = 1;
    next;
    next;
    outport = "_MC_unknown";
    output;
    multicast(dp="provider-flat", mcgroup="_MC_unknown") {
        egress(dp="provider-flat", inport="vm-provider-flat-port", outport="provnet-64e303") {
            next;
            next;
            reg0[8] = 1;
            reg0[9] = 1;
            next;
            next;
            output;
            /* output to "provnet-64e303", type "localnet" */;
        };
    };
};

As we can see from the ovn-trace output above, the packet should be delivered to the localnet port as per the OVN DB contents, meaning that we should have been able to see it with tcpdump. Since we know that the packet was not present in the bridge, the next step is to figure out where it is being dropped in the OpenFlow pipeline by inspecting the Physical flows.

First, let's capture a live packet (filter by ICMP6 and Neighbor Advertisement):

$ flow=$(tcpdump -nXXi tapdf8a80f8-0c "icmp6 && ip6[40] = 136 && host 2001:db8::f816:3eff:fe1e:5a0f"  -c1 | ovs-tcpundump)

1 packet captured
1 packet received by filter
0 packets dropped by kernel

$ echo $flow
b6c9b96e5048fa163e1e5a0f86dd6000000000203aff20010db800000000f8163efffe1e5a0ffe80000000000000b4c9b9fffe6e504888004d626000000020010db800000000f8163efffe1e5a0f0201fa163e1e5a0f

Now, let's feed this packet into ovs-appctl to see the execution of the OVS pipeline:

$ ovs-appctl ofproto/trace br-int in_port=`ovs-vsctl get Interface tapdf8a80f8-0c ofport` $flow

Flow: icmp6,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f

bridge("br-int")
----------------
 0. in_port=9, priority 100, cookie 0x146e7f9c
    set_field:0x1->reg13
    set_field:0x3->reg11
    set_field:0x2->reg12
    set_field:0x7->metadata
    set_field:0x4->reg14
    resubmit(,8)
 8. reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f, priority 50, cookie 0x195ddd1b
    resubmit(,9)
 9. ipv6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f, priority 90, cookie 0x1c05660e
    resubmit(,10)
10. icmp6,reg14=0x4,metadata=0x7,nw_ttl=255,icmp_type=136,icmp_code=0, priority 80, cookie 0xd3c58704
    drop

Final flow: icmp6,reg11=0x3,reg12=0x2,reg13=0x1,reg14=0x4,metadata=0x7,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
Megaflow: recirc_id=0,eth,icmp6,in_port=9,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::/16,nw_ttl=255,nw_frag=no,icmp_type=0x88/0xff,icmp_code=0x0/0xff,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f

We see that the packet is dropped in the table 10. Also, we can feed this output into ovn-detrace for a more friendly output which allows us to match each OpenFlow rule to OVN logical flows and tables:

$ ovs-appctl ofproto/trace br-int in_port=9 $flow | ovn-detrace --ovnsb="tcp:99.88.88.88:6642" --ovnnb="tcp:99.88.88.88:6641"

Flow: icmp6,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f

bridge("br-int")
----------------
0. in_port=9, priority 100, cookie 0x146e7f9c
set_field:0x1->reg13
set_field:0x3->reg11
set_field:0x2->reg12
set_field:0x7->metadata
set_field:0x4->reg14
resubmit(,8)
  *  Logical datapath: "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204)
  *  Port Binding: logical_port "df8a80f8-0cbf-48af-8b32-78377f034797", tunnel_key 4, chassis-name "4bcf0b65-8d0c-4dde-9783-7ccb23fe3627", chassis-str "cmp-3-1.bgp.ftw"
8. reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f, priority 50, cookie 0x195ddd1b
resubmit(,9)
  *  Logical datapaths:
  *      "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204) [ingress]
  *  Logical flow: table=0 (ls_in_port_sec_l2), priority=50, match=(inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && eth.src == {fa:16:3e:1e:5a:0f}), actions=(next;)
   *  Logical Switch Port: df8a80f8-0cbf-48af-8b32-78377f034797 type  (addresses ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f'], dynamic addresses [], security ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f']
9. ipv6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f, priority 90, cookie 0x1c05660e
resubmit(,10)
  *  Logical datapaths:
  *      "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204) [ingress]
  *  Logical flow: table=1 (ls_in_port_sec_ip), priority=90, match=(inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && eth.src == fa:16:3e:1e:5a:0f && ip6.src == {fe80::f816:3eff:fe1e:5a0f, 2001:db8::f816:3eff:fe1e:5a0f}), actions=(next;)
   *  Logical Switch Port: df8a80f8-0cbf-48af-8b32-78377f034797 type  (addresses ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f'], dynamic addresses [], security ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f']
10. icmp6,reg14=0x4,metadata=0x7,nw_ttl=255,icmp_type=136,icmp_code=0, priority 80, cookie 0xd3c58704
drop
  *  Logical datapaths:
  *      "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204) [ingress]
  *  Logical flow: table=2 (ls_in_port_sec_nd), priority=80, match=(inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && (arp || nd)), actions=(drop;)
   *  Logical Switch Port: df8a80f8-0cbf-48af-8b32-78377f034797 type  (addresses ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f'], dynamic addresses [], security ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f']

Final flow: icmp6,reg11=0x3,reg12=0x2,reg13=0x1,reg14=0x4,metadata=0x7,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
Megaflow: recirc_id=0,eth,icmp6,in_port=9,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::/16,nw_ttl=255,nw_frag=no,icmp_type=0x88/0xff,icmp_code=0x0/0xff,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
Datapath actions: drop

The packet is explicitly dropped in table 10 (OVN Logical Switch table 2 (ls_in_port_sec_nd) because no other higher priority flow in that table was matched previously. Let's inspect the relevant OVN Logical flows in such table:

_uuid               : e58f66b6-07f0-4f31-bd6b-97ea609ac3fb
actions             : "next;"
external_ids        : {source="ovn-northd.c:4455", stage-hint=bc1dbbdf, stage-name=ls_in_port_sec_nd}
logical_datapath    : 67b33068-8baa-4e45-9447-61352f8d5204
logical_dp_group    : []
match               : "inport == \"df8a80f8-0cbf-48af-8b32-78377f034797\" && eth.src == fa:16:3e:1e:5a:0f && ip6 && nd && ((nd.sll == 00:00:00:00:00:00 || nd.sll == fa:16:3e:1e:5a:0f) || ((nd.tll == 00:00:00:00:00:00 || nd.tll == fa:16:3e:1e:5a:0f) && (nd.target == fe80::f816:3eff:fe1e:5a0f || nd.target == 2001:db8::f816:3eff:fe1e:5a0f)))"
pipeline            : ingress
priority            : 90
table_id            : 2



_uuid               : d3c58704-c6be-4876-b023-2ec7ae870bde
actions             : "drop;"
external_ids        : {source="ovn-northd.c:4462", stage-hint=bc1dbbdf, stage-name=ls_in_port_sec_nd}
logical_datapath    : 67b33068-8baa-4e45-9447-61352f8d5204
logical_dp_group    : []
match               : "inport == \"df8a80f8-0cbf-48af-8b32-78377f034797\" && (arp || nd)"
pipeline            : ingress
priority            : 80
table_id            : 2

The first flow above should have matched, it has higher priority than the drop and all the fields seem correct. Let's take a look at the actual OpenFlow rules installed by ovn-controller matching on some fields of the logical flow such as the uuid/cookie or the nd_target: 

$ ovs-ofctl dump-flows br-int |grep cookie=0xe58f66b6 | grep icmp_type=136
 cookie=0xe58f66b6, duration=20946.048s, table=10, n_packets=0, n_bytes=0, idle_age=20946, priority=90,conj_id=19,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0 actions=resubmit(,11)

$ ovs-ofctl dump-flows br-int table=10 | grep priority=90 | grep nd_target=2001:db8::f816:3eff:fe1e:5a0f
 cookie=0x0, duration=21035.890s, table=10, n_packets=0, n_bytes=0, idle_age=21035, priority=90,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f actions=conjunction(3,1/2)

At first glance, the fact that the packet counter is 0 when all the fields seem correct (src/dest IP addresses, ICMP6 type, ...) looks suspicious. The conjunctive flows are key here. On one hand, the first flow is referencing conj_id=19 while the second flow is referencing conj_id=3 (the notation for conjunction(3,1/2) means that the action to be executed is the first out of 2 clauses for the conjunction id 3). However, there's no such conjunction and hence, the packet will not hit these flows:

$ ovs-ofctl dump-flows br-int | grep conj_id=3 -c
0

If our assumptions are correct, we could rewrite the first flow so that the conjunction id matches the expected value (3). For this, we'll stop first ovn-controller, delete the flow and add the new one:

$ ovs-ofctl del-flows --strict br-int table=10,priority=90,conj_id=19,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0

$ ovs-ofctl add-flow br-int "cookie=0xe58f66b6,table=10,priority=90,conj_id=3,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0,actions=resubmit(,11)"

At this point we can test the ping and see if it works and the new flow gets hit:

$ tcpdump -ni tapdf8a80f8-0c

11:00:24.975037 IP6 fe80::b4c9:b9ff:fe6e:5048 > 2001:db8::f816:3eff:fe1e:5a0f: ICMP6, neighbor solicitation, who has 2001:db8::f816:3eff:fe1e:5a0f, length 32

11:00:24.975544 IP6 2001:db8::f816:3eff:fe1e:5a0f > fe80::b4c9:b9ff:fe6e:5048: ICMP6, neighbor advertisement, tgt is 2001:db8::f816:3eff:fe1e:5a0f, length 24

11:00:25.831695 IP6 f00d:f00d:f00d:4::1 > 2001:db8::f816:3eff:fe1e:5a0f: ICMP6, echo request, seq 55, length 64

11:00:25.832182 IP6 2001:db8::f816:3eff:fe1e:5a0f > f00d:f00d:f00d:4::1: ICMP6, echo reply, seq 55, length 64

$ sudo ovs-ofctl dump-flows br-int| grep cookie=0xe58f66b6  | grep conj_id

 cookie=0xe58f66b6, duration=5.278s, table=10, n_packets=2, n_bytes=172, idle_age=4, priority=90,conj_id=3,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0 actions=resubmit(,11)

As I mentioned earlier, this issue is now solved upstream and you can check the fix here. With this patch, the conjunction ids will be properly generated and we'll no longer hit this bug. 

Most of the issues that we'll face in OVN will be likely related to misconfigurations (missing ACLs, port security, ...) and the use of tools like ovn-trace will help us spot them. However, when tackling legitimate bugs in the core OVN side, there's no easy and defined way to find them but luckily we have some tools at hand that, at the very least, will help us providing a good bug report that will be promptly picked and solved by the very active OVN community. 

Multinode OVN setup

As a follow up from the last post, we are now going to deploy a 3 nodes OVN setup to demonstrate basic L2 communication across different hypervisors. This is the physical topology and how services are distributed:

  • Central node: ovn-northd and ovsdb-servers (North and Southbound databases) as well as ovn-controller
  • Worker1 / Worker2: ovn-controller connected to Central node Southbound ovsdb-server (TCP port 6642)

In order to deploy the 3 machines, I'm using Vagrant + libvirt and you can checkout the Vagrant files and scripts used from this link. After running 'vagrant up', we'll have 3 nodes with OVS/OVN installed from sources and we should be able to log in to the central node and verify that OVN is up and running and Geneve tunnels have been established to both workers:

 

[vagrant@central ~]$ sudo ovs-vsctl show
f38658f5-4438-4917-8b51-3bb30146877a
    Bridge br-int
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port "ovn-worker-1"
            Interface "ovn-worker-1"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.50.101"}
        Port "ovn-worker-0"
            Interface "ovn-worker-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.50.100"}
    ovs_version: "2.11.90"

 

For demonstration purposes, we're going to create a Logical Switch (network1) and two Logical Ports (vm1 and vm2). Then we're going to bind VM1 to Worker1 and VM2 to Worker2. If everything works as expected, we would be able to communicate both Logical Ports through the overlay network formed between both workers nodes.

We can run the following commands on any node to create the logical topology (please, note that if we run them on Worker1 or Worker2, we need to specify the NB database location by running ovn-nbctl with "--db=tcp:192.168.50.10:6641" as 6641 is the default port for NB database):

ovn-nbctl ls-add network1
ovn-nbctl lsp-add network1 vm1
ovn-nbctl lsp-add network1 vm2
ovn-nbctl lsp-set-addresses vm1 "40:44:00:00:00:01 192.168.0.11"
ovn-nbctl lsp-set-addresses vm2 "40:44:00:00:00:02 192.168.0.12"

And now let's check the Northbound and Southbound databases contents. As we didn't bind any port to the workers yet, "ovn-sbctl show" command should only list the chassis (or hosts in OVN jargon):

[root@central ~]# ovn-nbctl show
switch a51334e8-f77d-4d85-b01a-e547220eb3ff (network1)
    port vm2
        addresses: ["40:44:00:00:00:02 192.168.0.12"]
    port vm1
        addresses: ["40:44:00:00:00:01 192.168.0.11"]

[root@central ~]# ovn-sbctl show
Chassis "worker2"
    hostname: "worker2"
    Encap geneve
        ip: "192.168.50.101"
        options: {csum="true"}
Chassis central
    hostname: central
    Encap geneve
        ip: "127.0.0.1"
        options: {csum="true"}
Chassis "worker1"
    hostname: "worker1"
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}

Now we're going to bind VM1 to Worker1:

ovs-vsctl add-port br-int vm1 -- set Interface vm1 type=internal -- set Interface vm1 external_ids:iface-id=vm1
ip netns add vm1
ip link set vm1 netns vm1
ip netns exec vm1 ip link set vm1 address 40:44:00:00:00:01
ip netns exec vm1 ip addr add 192.168.0.11/24 dev vm1
ip netns exec vm1 ip link set vm1 up

And VM2 to Worker2:

ovs-vsctl add-port br-int vm2 -- set Interface vm2 type=internal -- set Interface vm2 external_ids:iface-id=vm2
ip netns add vm2
ip link set vm2 netns vm2
ip netns exec vm2 ip link set vm2 address 40:44:00:00:00:02
ip netns exec vm2 ip addr add 192.168.0.12/24 dev vm2
ip netns exec vm2 ip link set vm2 up

Checking again the Southbound database, we should see the port binding status:

[root@central ~]# ovn-sbctl show
Chassis "worker2"
    hostname: "worker2"
    Encap geneve
        ip: "192.168.50.101"
        options: {csum="true"}
    Port_Binding "vm2"
Chassis central
    hostname: central
    Encap geneve
        ip: "127.0.0.1"
        options: {csum="true"}
Chassis "worker1"
    hostname: "worker1"
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}
    Port_Binding "vm1"

Now let's check connectivity between VM1 (Worker1) and VM2 (Worker2):

[root@worker1 ~]# ip netns exec vm1 ping 192.168.0.12 -c2
PING 192.168.0.12 (192.168.0.12) 56(84) bytes of data.
64 bytes from 192.168.0.12: icmp_seq=1 ttl=64 time=0.416 ms
64 bytes from 192.168.0.12: icmp_seq=2 ttl=64 time=0.307 ms

--- 192.168.0.12 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.307/0.361/0.416/0.057 ms


[root@worker2 ~]# ip netns exec vm2 ping 192.168.0.11 -c2
PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.
64 bytes from 192.168.0.11: icmp_seq=1 ttl=64 time=0.825 ms
64 bytes from 192.168.0.11: icmp_seq=2 ttl=64 time=0.275 ms

--- 192.168.0.11 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.275/0.550/0.825/0.275 ms

As both ports are located in different hypervisors, OVN is pushing the traffic via the overlay Geneve tunnel from Worker1 to Worker2. In the next post, we'll analyze the Geneve encapsulation and how OVN uses its metadata internally.

For now, let's ping from VM1 to VM2 and just capture traffic on the geneve interface on Worker2 to verify that ICMP packets are coming through the tunnel:

[root@worker2 ~]# tcpdump -i genev_sys_6081 -vvnn icmp
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
15:07:42.395318 IP (tos 0x0, ttl 64, id 45147, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.0.11 > 192.168.0.12: ICMP echo request, id 1251, seq 26, length 64
15:07:42.395383 IP (tos 0x0, ttl 64, id 39282, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.0.12 > 192.168.0.11: ICMP echo reply, id 1251, seq 26, length 64
15:07:43.395221 IP (tos 0x0, ttl 64, id 45612, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.0.11 > 192.168.0.12: ICMP echo request, id 1251, seq 27, length 64

In coming posts we'll cover Geneve encapsulation as well as OVN pipelines and L3 connectivity.