Category: Uncategorized

Trunk ports in OVN

The purpose of this post is explaining the low level details of how trunking works in OVN. A typical use case of trunk ports could be the use of nested VMs or containers inside VMs; where all the traffic is directed to the vNIC of the virtual machine and then forwarded to the right container based on their VLAN ID. For more context around the feature and use cases, please check out the OpenStack documentation guide.

Let’s take a very simple topology for our deep dive. You can also deploy it in your machine using this vagrant setup and replay the commands as we go 🙂

This sample setup has two Logical Switches and two ports on each of them. The physical layout is as follows:

  • vm1 bound to worker1
  • vm2 bound to worker2
  • child1 (VLAN 30) inside vm1
  • child2 (VLAN 50) inside vm2

Let’s quickly check the OVN databases info:

[root@central vagrant]# ovn-nbctl show
switch db4e7781-370c-4439-becd-35803c0e3f12 (network1)
    port vm1
        addresses: ["40:44:00:00:00:01 192.168.0.11"]
    port vm2
        addresses: ["40:44:00:00:00:02 192.168.0.12"]
switch 40ac144b-a32a-4202-bce2-3329f8f3e98f (network2)
    port child1
        parent: vm1
        tag: 30
        addresses: ["40:44:00:00:00:03 192.168.1.13"]
    port child2
        parent: vm2
        tag: 50
        addresses: ["40:44:00:00:00:04 192.168.1.14"]



[root@central vagrant]# ovn-sbctl show
Chassis worker2
    hostname: worker2
    Encap geneve
        ip: "192.168.50.101"
[root@central vagrant]# ovn-sbctl show
Chassis worker2
    hostname: worker2
    Encap geneve
        ip: "192.168.50.101"
        options: {csum="true"}
    Port_Binding child2
    Port_Binding vm2
Chassis worker1
    hostname: worker1
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}
    Port_Binding child1
    Port_Binding vm1

Instead of booting actual VMs and containers, I simulated it with network namespaces and VLAN devices inside them:

[root@worker1 vagrant]# ip netns exec vm1 ip -d link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
2: child1@vm1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:03 brd ff:ff:ff:ff:ff:ff promiscuity 0
    vlan protocol 802.1Q id 30 <REORDER_HDR> addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
24: vm1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:01 brd ff:ff:ff:ff:ff:ff promiscuity 2
    openvswitch addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535


[root@worker2 vagrant]# ip netns exec vm2 ip -d link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
2: child2@vm2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:04 brd ff:ff:ff:ff:ff:ff promiscuity 0
    vlan protocol 802.1Q id 50 <REORDER_HDR> addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
15: vm2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:02 brd ff:ff:ff:ff:ff:ff promiscuity 2
    openvswitch addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

Now, as you can see, none of the subports (child1 and child2) are connected directly to the integration bridge so both vm1 and vm2 ports act as trunk ports for VLAN IDs 30 and 50. OVN will install flows to tag/untag the traffic directed to/from these ports.

 

Traffic and OpenFlow analysis

To illustrate this, let’s ping from child1 (worker1) to child 2 (worker2):

[root@worker1 vagrant]# ip netns exec vm1 ping 192.168.1.14
PING 192.168.1.14 (192.168.1.14) 56(84) bytes of data.
64 bytes from 192.168.1.14: icmp_seq=21 ttl=64 time=0.824 ms
64 bytes from 192.168.1.14: icmp_seq=22 ttl=64 time=0.211 ms

The traffic will arrive tagged to the vm1 interface and will be sent out untagged to worker2 (where vm2 is bound) via the Geneve tunnel:

[root@worker1 ~]# ip netns exec vm1 tcpdump -vvnee -i vm1 icmp -c1

tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes
13:18:16.980650 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype 802.1Q (0x8100), length 102: vlan 30, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 55255, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 9833, seq 176, length 64



[root@worker1 ~]# tcpdump -vvneei genev_sys_6081 icmp -c1

tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
13:19:11.980671 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 16226, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 9833, seq 231, length 64

On worker1, let’s inspect the OVS flows that determine the source network/port based on the VLAN ID:

[root@worker1 ~]# ovs-ofctl dump-flows br-int table=0 |grep vlan

 cookie=0x4b8d6fa5, duration=337116.380s, table=0, n_packets=270983, n_bytes=27165634, idle_age=0, hard_age=65534, priority=150,in_port=12,dl_vlan=30 actions=load:0x1->NXM_NX_REG10[5],strip_vlan,load:0x5->NXM_NX_REG13[],load:0x1->NXM_NX_REG11[],load:0x2->NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],resubmit(,8)

The flow above in table 0 matches on the VLAN tag (dl_vlan=30). Also, note that there’s no matching flow for VLAN 50 as vm2 is not bound to worker1.

As each parent should have subports with unique VLAN IDs, this ID will determine the source port (nested VM or container) that is sending the traffic. In our example, this will be child1 as it is the subport tagged with VLAN 30. In the actions section of the table 0 flow, the packet will be untagged (strip_vlan action), and the relevant registers will be populated to identify both the subport network and the logical input port:

  • The packet is coming from OF port 12 (in_port=12) which corresponds to vm1
[root@worker1 ~]# ovs-ofctl show br-int | grep vm1
 12(vm1): addr:40:44:00:00:00:01
  • The network identifier (metadata) is populated with the value 2 (load:0x2->OXM_OF_METADATA[]) which corresponds to the network of the subport (network2)
[root@central ~]# ovn-sbctl find datapath_binding tunnel_key=2
_uuid               : 2d762d73-5ab9-4f43-a303-65a6046e41e7
external_ids        : {logical-switch="40ac144b-a32a-4202-bce2-3329f8f3e98f", name=network2}
load_balancers      : []
tunnel_key          : 2
  • The logical input port (register 14) will be populated with the tunnel key of the child1 subport (load:0x1->NXM_NX_REG14[])
[root@central vagrant]# ovn-sbctl get port_binding child1 tunnel_key
1
  • Now the pipeline execution with the untagged packet gets resumed from table 8 (resubmit(,8)). Eventually it gets sent through the tunnel to worker2, where the parent (vm2) of the destination port (child2) is bound to.

 

Let’s inspect the traffic and flows on worker2, the destination hypervisor:

The traffic arrives untagged to br-int from the Geneve interface and later gets delivered to the vm2 interface tagged with the child2 VLAN ID (50).

[root@worker2 vagrant]# tcpdump -vvnee -i genev_sys_6081 icmp -c1
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
13:57:25.000587 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 56431, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 10218, seq 31, length 64


[root@worker2 vagrant]# ip netns exec vm2 tcpdump -vvneei vm2 icmp -c1
tcpdump: listening on vm2, link-type EN10MB (Ethernet), capture size 262144 bytes
13:57:39.000617 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype 802.1Q (0x8100), length 102: vlan 50, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 59701, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 10218, seq 45, length 64

The packet processing takes place as with any regular VIF but in the output stage, the traffic will be tagged before it is sent out to the vm2 interface:

[root@worker2 vagrant]# ovs-ofctl dump-flows br-int |grep mod_vlan_vid:50
 cookie=0x969b8a9d, duration=338995.147s, table=65, n_packets=263914, n_bytes=25814112, idle_age=0, hard_age=65534, priority=100,reg15=0x2,metadata=0x2 actions=mod_vlan_vid:50,output:4,strip_vlan

[root@worker2 vagrant]# ovs-ofctl show br-int | grep vm2
 4(vm2): addr:00:00:00:00:00:00

 

As you can see, the way that OVN implements this feature is very simple and only adds a couple of extra flows. Hope that this article helps understanding the details of trunk ports and how it’s leveraged by projects like Kuryr to run Kubernetes on top of OpenStack.

OpenStack TripleO networking layout

The goal of this post is to describe how network isolation is typically achieved for both the control and data planes in OpenStack using TripleO. In particular, how all this happens in a virtual setup, using one baremetal node (hypervisor, from now on) to deploy the OpenStack nodes with libvirt. For the purpose of this post, we’ll work with a 3 controllers + 1 compute virtual setup.

(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+
| ID | Name | Status | Networks |
+--------------------------------------+--------------+--------+------------------------+
| b3bd5157-b3ea-4331-91af-3820c4e12252 | controller-0 | ACTIVE | ctlplane=192.168.24.15 |
| 6f228b08-49a0-4b68-925a-17d06224d5f9 | controller-1 | ACTIVE | ctlplane=192.168.24.37 |
| e5c649b5-c968-4293-a994-04293cb16da1 | controller-2 | ACTIVE | ctlplane=192.168.24.10 |
| 9f15ed23-efb1-4972-b578-7b0da3500053 | compute-0 | ACTIVE | ctlplane=192.168.24.14 |
+--------------------------------------+--------------+--------+------------------------+

The tool used to deploy this setup is Infrared (documentation) which is an easy-to-use wrapper around TripleO. Don’t be scared about the so many layers involved here; the main point is to understand that a physical – and somewhat powerful – server is running an OpenStack cluster formed by:

  • 3 virtual controllers that run the OpenStack control plane services (Neutron, Nova, Glance, …)
  • 1 virtual compute node that will serve to host the workloads (virtual machines) of the OpenStack cluster 

From a Networking perspective (I’ll omit the undercloud for simplicity), things are wired like this:

Let’s take a look at the bridges in the hypervisor node:

[root@hypervisor]# brctl show

bridge name     bridge id               STP enabled     interfaces
management      8000.525400cc1d8b       yes             management-nic
                                                        vnet0
                                                        vnet12
                                                        vnet3
                                                        vnet6
                                                        vnet9

external        8000.5254000ceb7c       yes             external-nic
                                                        vnet11
                                                        vnet14
                                                        vnet2
                                                        vnet5
                                                        vnet8

data            8000.5254007bc90a       yes             data-nic
                                                        vnet1
                                                        vnet10
                                                        vnet13
                                                        vnet4
                                                        vnet7

Each bridge has 6 ports (3 controllers, 1 compute, 1 undercloud, and the local port in the hypervisor). Now, each virtual machine running in this node can be mapped to the right interface:

[root@hypervisor]# for i in controller-0 controller-1 controller-2 compute-0; do virsh domiflist $i; done


 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet9       network   management   virtio   52:54:00:74:29:4f
 vnet10      network   data         virtio   52:54:00:1c:44:26
 vnet11      network   external     virtio   52:54:00:20:3c:4e

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet3       network   management   virtio   52:54:00:0b:ad:3b
 vnet4       network   data         virtio   52:54:00:2f:9f:3e
 vnet5       network   external     virtio   52:54:00:75:a5:ed

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet6       network   management   virtio   52:54:00:da:a3:1e
 vnet7       network   data         virtio   52:54:00:57:26:67
 vnet8       network   external     virtio   52:54:00:2c:21:d5

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet0       network   management   virtio   52:54:00:de:4a:38
 vnet1       network   data         virtio   52:54:00:c7:74:4b
 vnet2       network   external     virtio   52:54:00:22:de:5c

Network configuration templates

This section will go through the Infrared/TripleO configuration to understand how this layout was defined. This will also help the reader to change the CIDRs, VLANs, number of virtual NICs, etc.

First, the deployment script:

$ cat overcloud_deploy.sh
#!/bin/bash

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
-e /home/stack/virt/config_lvm.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /home/stack/virt/docker-images.yaml \
--log-file overcloud_deployment_99.log

Now, let’s take a look at the network related templates to understand the different networks and how they map to the physical NICs inside the controllers/compute nodes:

$ grep -i -e cidr -e vlan /home/stack/virt/network/network-environment.yaml
ControlPlaneSubnetCidr: '192.168.24.0/24'

ExternalNetCidr: 10.0.0.0/24
ExternalNetworkVlanID: 10

InternalApiNetCidr: 172.17.1.0/24
InternalApiNetworkVlanID: 20

StorageMgmtNetCidr: 172.17.4.0/24
StorageMgmtNetworkVlanID: 40

StorageNetCidr: 172.17.3.0/24
StorageNetworkVlanID: 30

TenantNetCidr: 172.17.2.0/24
TenantNetworkVlanID: 50

NeutronNetworkVLANRanges: tenant:1000:2000

OS::TripleO::Compute::Net::SoftwareConfig: three-nics-vlans/compute.yaml
OS::TripleO::Controller::Net::SoftwareConfig: three-nics-vlans/controller.yaml

In the output above you can see 6 different networks:

  • ControlPlane (flat): used mainly for provisioning (PXE) and remote access to the nodes via SSH.
  • External (VLAN 10): external network used for dataplane floating IP traffic and access to the OpenStack API services via their external endpoints.
  • InternalApi (VLAN 20): network where the OpenStack control plane services will listen for internal communication (eg. Neutron <-> Nova).
  • StorageMgmt (VLAN 40): network used to manage the storage (in this deployment, swift-object-server, swift-container-server, and swift-account-server will listen to requests on this network)   
  • Storage (VLAN 30): network used for access to the Object storage (in this deployment, swift-proxy will listen to requests on this network).
  • Tenant: this network will carry the overlay tunnelled traffic (Geneve for OVN, VXLAN in the case of ML2/OVS) in the VLAN 50 but will also carry dataplane traffic if VLAN tenant networks are used in Neutron. The VLAN range allowed for such traffic is specified also in the template (in the example, VLAN ids ranging from 1000-2000 are reserved for Neutron tenant networks).

The way that each NIC is mapped to each network is defined in the yaml files below. For this deployment, I used a customized layout via this patch (controller.yaml and compute.yaml). Essentially, the mapping looks like this:

  • Controllers:
    • nic1: ControlPlaneIp (flat); InternalApi (20), Storage (30) , StorageMgmt (40), VLAN devices
    • nic2: br-tenant OVS bridge and VLAN50 for the tunnelled traffic
    • nic3: br-ex OVS bridge for external traffic 
  • Compute:
    • nic1: ControlPlaneIp (flat); InternalApi (20), Storage (30), VLAN devices 
    • nic2: br-tenant OVS bridge and VLAN50 for the tunnelled traffic
    • nic3: br-ex OVS bridge for external traffic 

The nodes map nic1, nic2, nic3 to ens4, ens5, ens6 respectively:

[root@controller-0 ~]# ip l | egrep "vlan[2-4]0"
9: vlan20@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
10: vlan30@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
11: vlan40@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

[root@controller-0 ~]# ovs-vsctl list-ports br-tenant
ens4
vlan50

[root@controller-0 ~]# ovs-vsctl list-ports br-ex
ens5

In the controller nodes we’ll find an haproxy instance load balancing the requests to the different nodes and we can see here the network layout as well:

[root@controller-1 ~]# podman exec -uroot -it haproxy-bundle-podman-1 cat /etc/haproxy/haproxy.cfg

listen neutron
  bind 10.0.0.122:9696 transparent      <--- External network
  bind 172.17.1.48:9696 transparent     <--- InternalApi network
  mode http
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Port %[dst_port]
  option httpchk
  option httplog
# Now the backends in the InternalApi network
  server controller-0.internalapi.local 172.17.1.72:9696 check fall 5 inter 2000 rise 2
  server controller-1.internalapi.local 172.17.1.101:9696 check fall 5 inter 2000 rise 2
  server controller-2.internalapi.local 172.17.1.115:9696 check fall 5 inter 2000 rise 2

In the above output, the IP address 172.17.1.48 is a virtual IP managed by pacemaker and will live in the InternalApi (VLAN 20) network where it is master:

[root@controller-1 ~]# pcs status | grep 172.17.1.48
  * ip-172.17.1.48      (ocf::heartbeat:IPaddr2):       Started controller-0

[root@controller-0 ~]# ip a |grep 172.17.1.48
    inet 172.17.1.48/32 brd 172.17.1.255 scope global vlan20

Traffic inspection

With a clear view on the networking layout, now we can use the hypervisor to hook a tcpdump in the right bridge and check for whatever traffic we’re interested in.

Let’s for example ping from the InternalApi (172.17.1.0/24) network on controller-0 to controller-1 and check the traffic in the hypervisor:

[heat-admin@controller-0 ~]$ ping controller-1.internalapi.local
PING controller-1.internalapi.redhat.local (172.17.1.101) 56(84) bytes of data.
64 bytes from controller-1.redhat.local (172.17.1.101): icmp_seq=1 ttl=64 time=0.213 ms
64 bytes from controller-1.redhat.local (172.17.1.101): icmp_seq=2 ttl=64 time=0.096 ms


[root@hypervisor]# tcpdump -i management -vvne icmp -c2
tcpdump: listening on management, link-type EN10MB (Ethernet), capture size 262144 bytes
15:19:08.418046 52:54:00:74:29:4f > 52:54:00:0b:ad:3b, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 58494, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.1.72 > 172.17.1.101: ICMP echo request, id 53086, seq 5, length 64 15:19:08.418155 52:54:00:0b:ad:3b > 52:54:00:74:29:4f, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 39897, offset 0, flags [none], proto ICMP (1), length 84) 172.17.1.101 > 172.17.1.72: ICMP echo reply, id 53086, seq 5, length 64 [root@hypervisor]# brctl showmacs management | egrep "52:54:00:0b:ad:3b|52:54:00:74:29:4f" port no mac addr is local? ageing timer 3 52:54:00:0b:ad:3b no 0.01 5 52:54:00:74:29:4f no 0.01

When we ping to the controller-1 IP address of the InternalApi network, the traffic is tagged (VLAN 20) and going through the management bridge in the hypervisor. This matches our expectations as we defined such network in the template files that way.

Similarly, we could trace more complicated scenarios like an OpenStack instance in a tenant network pinging an external destination:

(overcloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+---------+--------+-----------------------+--------+
| ID | Name | Status | Networks | Image |
+--------------------------------------+---------+--------+-----------------------+--------+
| 3d9f6957-5311-4590-8c62-097b576ffa04 | cirros1 | ACTIVE | private=192.168.0.166 | cirros |
+--------------------------------------+---------+--------+-----------------------+--------+
[root@compute-0 ~]# sudo ip net e ovnmeta-e49cc182-247c-4dc9-9589-4df6fcb09511 ssh cirros@192.168.0.166 cirros@192.168.0.166's password: $ ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 ttl=53 time=10.356 ms 64 bytes from 8.8.8.8: seq=1 ttl=53 time=8.591 ms

Now in the hypervisor, we’ll trace the Geneve traffic (VLAN50):

# tcpdump -i data -vvnne vlan 50 and "(udp port 6081) and (udp[10:2] = 0x6558) and (udp[(8 + (4 * (2 + (udp[8:1] & 0x3f))) + 12):2] = 0x0800) and (udp[8 + (4 * (2 + (udp[8:1] & 0x3f))) + 14 + 9:1] = 01)"  -c2

tcpdump: listening on data, link-type EN10MB (Ethernet), capture size 262144 bytes
16:21:28.642671 6a:9b:72:22:3f:68 > 0e:d0:eb:00:1b:e7, ethertype 802.1Q (0x8100), length 160: vlan 50, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 15872, offset 0, flags [DF], proto UDP (17), length 142) 172.17.2.119.27073 > 172.17.2.143.6081: [bad udp cksum 0x5db4 -> 0x1e8c!] Geneve, Flags [C], vni 0x5, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00010003] fa:16:3e:a7:95:87 > 52:54:00:0c:eb:7c, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 50335, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.166 > 8.8.8.8: ICMP echo request, id 2818, seq 2145, length 64 16:21:28.650412 0e:d0:eb:00:1b:e7 > 6a:9b:72:22:3f:68, ethertype 802.1Q (0x8100), length 160: vlan 50, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 26871, offset 0, flags [DF], proto UDP (17), length 142) 172.17.2.143.31003 > 172.17.2.119.6081: [bad udp cksum 0x5db4 -> 0x4a04!] Geneve, Flags [C], vni 0x3, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00040002] fa:16:3e:34:a2:0e > fa:16:3e:63:c0:7a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 53, id 0, offset 0, flags [none], proto ICMP (1), length 84) 8.8.8.8 > 192.168.0.166: ICMP echo reply, id 2818, seq 2145, length 64

(First, sorry for the complicated filter; I picked it up from here and adapted it to match on the inner protocol of the Geneve traffic against ICMP. If there’s an easier way please tell me :p)

We can see that the Geneve traffic goes between 6a:9b:72:22:3f:68 and 0e:d0:eb:00:1b:e7 and now we can determine the source/dest nodes:

[root@hypervisor]# brctl showmacs data
  2     6a:9b:72:22:3f:68       no                 0.32
  2     fe:54:00:c7:74:4b       yes                0.00
  2     fe:54:00:c7:74:4b       yes                0.00
  3     0e:d0:eb:00:1b:e7       no                 0.40
  3     fe:54:00:2f:9f:3e       yes                0.00
  3     fe:54:00:2f:9f:3e       yes                0.00

From the info above we can see that port 2 corresponds to the MAC ending in “74:4b” and port 3 corresponds to the MAC ending in “9f:3e“. Therefore, this Geneve traffic is flowing from the compute-0 node to the controller-1 node which is where Neutron is running the gateway to do the SNAT towards the external network. Now, this last portion can be examined in the external bridge:

[root@hypervisor]# tcpdump -i external icmp -vvnnee -c2
tcpdump: listening on external, link-type EN10MB (Ethernet), capture size 262144 bytes
16:33:35.016198 fa:16:3e:a7:95:87 > 52:54:00:0c:eb:7c, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 13537, offset 0, flags [DF], proto ICMP (1), length 84) 10.0.0.225 > 8.8.8.8: ICMP echo request, id 4354, seq 556, length 64 16:33:35.023570 52:54:00:0c:eb:7c > fa:16:3e:a7:95:87, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 54, id 0, offset 0, flags [none], proto ICMP (1), length 84) 8.8.8.8 > 10.0.0.225: ICMP echo reply, id 4354, seq 556, length 64

In case that you’re wondering what’s 10.0.0.225; that’s the IP address of the Neutron gateway:

(overcloud) [stack@undercloud-0 ~]$ openstack router show router1 | grep gateway
| external_gateway_info   | {"network_id": "fe8330fe-540a-4acf-bda8-394398fb4272", "external_fixed_ips": [{"subnet_id": "e388a080-1953-4cdd-9e35-48d416fe2ae1", "ip_address": "10.0.0.225"}

Similarly, the MAC addresses can be matched to confirm that the traffic goes from the gateway node (controller-1), as the MAC ending in “a5:ed”  – in the same port as the source MAC from the ICMP packet – corresponds to the NIC attached to the external network on the controller-1.

[root@hypervisor]# brctl showmacs external
  3     fa:16:3e:a7:95:87       no                 0.47
  3     fe:54:00:75:a5:ed       yes                0.00
  3     fe:54:00:75:a5:ed       yes                0.00

Reflection

This is a virtual setup and everything is confined to the boundaries of a physical server. However, it is a great playground to get yourself familiar with the underlay networking of an OpenStack setup (and networking in general ;). Once you get your hands on a real production environment, all these Linux bridges will be replaced by ToR switches (or even routers on a pure L3 Spine & Leaf architecture) but the fundamentals are the same.

Improving OpenStack Neutron memory footprint

Recently, the Performance & Scale team at Red Hat ran some tests to stress both the control and data planes of OpenStack. One of the biggest issues detected during that exercise was the memory consumption of all the Neutron workers across the controller nodes raising all the way up to 75 GB of RSS:

 

 

The team did some analysis and we determined that there were close to a million MAC_Binding entries in the OVN Southbound database. These entries are kept in memory by Neutron and they never age out so the memory just grows and grows.

     MAC bindings

       The MAC_Binding table tracks the bindings from IP addresses to Ethernet
       addresses that are dynamically discovered  using  ARP  (for  IPv4)  and
       neighbor  discovery (for IPv6). Usually, IP-to-MAC bindings for virtual
       machines are statically  populated  into  the  Port_Binding  table,  so
       MAC_Binding  is  primarily  used  to discover bindings on physical net‐
       works.

The MAC_Binding table is populated by ovn-controller when it sees a new ARP/ND packet in the network even if they don’t belong to OVN. It is common in OpenStack deployments to have multiple tenants connecting their routers to a relatively large provider network and once a new MAC address is learned, OVN will add one MAC_Binding entry per router connected to the external network.

In this particular exercise, the external network was a /16 and we observed close to 1M entries. This doesn’t only pose a memory problem but also a lot of network traffic and stress to the OVN Southbound server which needs to commit the transactions from the ovn-controllers and send out the notifications to all the clients.

Why does Neutron care about the MAC_Binding table?

The problem – and its workaround – is described here but, in short, it is very common in OpenStack to reuse a Floating IP address (eg. during testing on CI) and Neutron implements a mechanism to delete the MAC address associated to a Floating IP from the MAC_Binding entry in order to force learning the new MAC address when needed.

For Neutron to do this, we monitored the table forcing ourselves to keep an in-memory copy of all its entries. Since these entries do not age out, the most likely scenario is that we’ll hit OOM killers eventually – depending on the network topology, network size and other factors -.

The “solution”

Ideally, core OVN should implement a mechanism to eliminate stale MAC_Binding entries after a certain time but that’s not an easy task as discussed in the links above. In the meantime, we thought of a way to stop monitoring the problematic table but, at the same time, keep the mechanism of deleting Floating IP MAC addresses upon association and disassociation of them from Neutron. The adopted solution was to invoke an external tool, ovsdb-client, to be able to delete such entries by:

  • Avoid monitoring the MAC_Binding table
  • Avoid downloading the database contents, hence, deleting those entries in a constant time despite the database size

The patch got recently merged into the Neutron repository and the result is that, running the same tests, the overall memory consumption decreased by an order of magnitude from ~75GB to 7.5GB of RSS, without impacting the run time of the exercise.

This great improvement was thanks to the collaboration of the Performance & Scale, Neutron and core OVN teams at Red Hat. I’m lucky to work with such great and smart people 😉

Pure L3 OVN dataplane setup

Nowadays, with the huge amount of virtualization in the workloads, it is becoming more and more popular to see pure layer-3 Spine and Leaf datacenters. This overcomes the challenge of having large L2 domains (complex at scale, significant failure domains, …), and since most of the switching devices these days are capable of doing routing, it makes more sense to move to a pure Layer 3 fabric design where L2 connectivity is only assumed within a rack or leaf.

The goal of this blogpost is to experiment with the idea of having a single OVN Logical Switch (L2 domain in the overlay) on a pure L3 underlay without using tunnels as well as to set the grounds for further ideas and code to support this type of deployments.

As a playground, I have set up the following 3 machines topology:


From an OVN perspective, the topology is pretty simple as well: one Logical Switch and two VMs (actually, network namespaces with an OVS port), placed on separate workers:

 

[root@central ~]# ovn-nbctl show
switch 03145198-0722-452a-b381-032dcec47cd9 (public)
    port public-segment1
        type: localnet
        addresses: ["unknown"]
    port vm1
        addresses: ["40:44:00:00:00:01 10.0.0.10"]
    port vm2
        addresses: ["40:44:00:00:00:02 10.0.0.20"]

[root@central ~]# ovn-sbctl show
Chassis worker1
    hostname: worker1
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}
    Port_Binding vm1
Chassis worker2
    hostname: worker2
    Encap geneve
        ip: "192.168.50.101"
        options: {csum="true"}
    Port_Binding vm2

As we are only interested (for now) in the data plane implications, I have set up a dedicated shared network across the 3 machines for the OVN control plane. So for now, L2 is only assumed for OVN control plane.

Traffic flow between vm1 and vm2

ARP Resolution

When vm1 (10.0.0.10) wants to communicate with vm2 (10.0.0.20) it first needs to resolve its MAC address. Normally, ovn-controller will reply to vm1’s ARP request with the MAC address of vm2. On a typical deployment where L2 connectivity is present, this would just work as the traffic would be placed on the wire and picked by ovn-controller where the destination machine is running.

However, this is no longer true so we need to figure out a way to place the traffic into vm1’s kernel to do the routing through our 100.{64, 65}.{1, 2}.0/24 networks. The answer to this is the Proxy ARP technique.

On a traditional OVN setup, we would have the NIC attached to br-ex and the traffic will hit it via a patch-port between br-int and br-ex. Now, we won’t have any NIC attached to br-ex. Instead, we’ll have an IP within the same virtual segment and we’ll enable proxy_arp on the br-ex interface:

# Enable proxy-ARP and forwarding
ip link set dev br-ex up
ip address add 10.0.0.42/24 dev br-ex
sudo sysctl -w net.ipv4.conf.br-ex.proxy_arp=1
sudo sysctl -w net.ipv4.ip_forward=1

At this point, we need to prevent OVN from responding to ARP requests that are directed to VMs outside the worker node. For now, let’s just remove the ARP responder flows manually (hack!):

  uuid=0xb9394db0, table=14(ls_in_arp_rsp), priority=50   , match=(arp.tpa == 10.0.0.20 && arp.op == 1), action=(eth.dst = eth.src; eth.src = 40:44:00:00:00:02; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = 40:44:00:00:00:02; arp.tpa = arp.spa; arp.spa = 10.0.0.20; outport = inport; flags.loopback = 1; output;)

   uuid=0x7c4824a5,table=14(ls_in_arp_rsp), priority=50   , match=(arp.tpa == 10.0.0.10 && arp.op == 1), action=(eth.dst = eth.src; eth.src = 40:44:00:00:00:01; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = 40:44:00:00:00:01; arp.tpa = arp.spa; arp.spa = 10.0.0.10; outport = inport; flags.loopback = 1; output;

Now, on worker1 (vm1) we’ll remove the corresponding OpenFlow for the ARP responder of vm2 and vice-versa:

[vagrant@worker1 ~]$ sudo ovs-ofctl del-flows br-int cookie=0xb9394db0/-1

[vagrant@worker2 ~]$ sudo ovs-ofctl del-flows br-int cookie=0x7c4824a5/-1

With this setup, we expect br-ex to reply for the ARP requests and each VM will learn the MAC address of br-ex instead of the actual one for the remote VM:

[vagrant@worker1 ~]$ sudo ip address show br-ex
8: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 9a:a1:47:9b:36:45 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.42/24 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::98a1:47ff:fe9b:3645/64 scope link
       valid_lft forever preferred_lft forever
[vagrant@worker1 ~]$ sudo ip netns exec vm1 ip nei | grep 10.0.0.20
10.0.0.20 dev vm1 lladdr 9a:a1:47:9b:36:45 REACHABLE


[root@worker2 ~]# sudo ip address show br-ex
9: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 72:84:42:d8:f6:48 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.42/24 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::7084:42ff:fed8:f648/64 scope link
       valid_lft forever preferred_lft forever
[root@worker2 ~]# sudo ip netns exec vm2 ip nei | grep 10.0.0.10
10.0.0.10 dev vm2 lladdr 72:84:42:d8:f6:48 REACHABLE

Now that we have solved the L2 portion, we need to steer the traffic through the L3 networks with the central node (our leaf / ToR switch).

Routing

Ideally, we would have BGP speakers running on each node advertising host routes to our VMs. This way we could learn the routes dynamically without any configuration needed on the ToRs or on the hypervisors.

As a first step, I have omitted this for now and configured routes statically:

[root@central ~]# ip route
default via 192.168.121.1 dev eth0 proto dhcp metric 100
10.0.0.10
        nexthop via 100.64.1.3 dev eth4 weight 1
        nexthop via 100.65.1.3 dev eth2 weight 1
10.0.0.20
        nexthop via 100.64.2.3 dev eth5 weight 1
        nexthop via 100.65.2.3 dev eth3 weight 1
100.64.1.0/24 dev eth4 proto kernel scope link src 100.64.1.2 metric 104
100.64.2.0/24 dev eth5 proto kernel scope link src 100.64.2.2 metric 105
100.65.1.0/24 dev eth2 proto kernel scope link src 100.65.1.2 metric 102
100.65.2.0/24 dev eth3 proto kernel scope link src 100.65.2.2 metric 103
192.168.50.0/24 dev eth1 proto kernel scope link src 192.168.50.10 metric 101
192.168.121.0/24 dev eth0 proto kernel scope link src 192.168.121.52 metric 100


[vagrant@worker1 ~]$ ip route
default via 192.168.121.1 dev eth0 proto dhcp metric 100
10.0.0.0/24 dev br-ex proto kernel scope link src 10.0.0.42
10.0.0.20
        nexthop via 100.65.1.2 dev eth2 weight 1
        nexthop via 100.64.1.2 dev eth3 weight 1
100.64.1.0/24 dev eth3 proto kernel scope link src 100.64.1.3 metric 103
100.65.1.0/24 dev eth2 proto kernel scope link src 100.65.1.3 metric 102
192.168.50.0/24 dev eth1 proto kernel scope link src 192.168.50.100 metric 101
192.168.121.0/24 dev eth0 proto kernel scope link src 192.168.121.27 metric 100


[root@worker2 ~]# ip route
default via 192.168.121.1 dev eth0 proto dhcp metric 100
10.0.0.0/24 dev br-ex proto kernel scope link src 10.0.0.42
10.0.0.10
        nexthop via 100.65.2.2 dev eth2 weight 1
        nexthop via 100.64.2.2 dev eth3 weight 1
100.64.2.0/24 dev eth3 proto kernel scope link src 100.64.2.3 metric 103
100.65.2.0/24 dev eth2 proto kernel scope link src 100.65.2.3 metric 102
192.168.50.0/24 dev eth1 proto kernel scope link src 192.168.50.101 metric 101
192.168.121.0/24 dev eth0 proto kernel scope link src 192.168.121.147 metric 100

You can notice that there are Equal-cost multi-path (ECMP) routes to the 100.{64, 65}.x.x networks. This provides load balancing using a 5-tuple hash algorithm as two NICs are going to be used for our data plane traffic.

Combined with the use of Bidirectional Forwarding Detection (BFD) we could as well provide fault tolerance by detecting when one of the links go down and steer all the traffic through the other link. For the sake of this blogpost, we’ll ignore this part.

With the routes listed above and ARP proxy enabled on both worker nodes, we’re now able to route the traffic between the two VMs:

[vagrant@worker1 ~]$ sudo ip netns exec vm1 ping 10.0.0.20 -c4
PING 10.0.0.20 (10.0.0.20) 56(84) bytes of data.
64 bytes from 10.0.0.20: icmp_seq=1 ttl=61 time=0.891 ms
64 bytes from 10.0.0.20: icmp_seq=2 ttl=61 time=0.563 ms
64 bytes from 10.0.0.20: icmp_seq=3 ttl=61 time=0.701 ms
64 bytes from 10.0.0.20: icmp_seq=4 ttl=61 time=0.772 ms

--- 10.0.0.20 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3001ms
rtt min/avg/max/mdev = 0.563/0.731/0.891/0.123 ms

Also we can check that the ECMP routes are working by inspecting the ICMP traffic on eth2/eth3 on one of the workers while the ping is running:

[root@worker2 ~]# tcpdump -i eth2 -vvne icmp
tcpdump: listening on eth2, link-type EN10MB (Ethernet), capture size 262144 bytes
11:55:18.523039 52:54:00:c2:46:52 > 52:54:00:3c:9e:fd, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 62387, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.0.20 > 10.0.0.10: ICMP echo reply, id 9935, seq 7, length 64
11:55:19.523825 52:54:00:c2:46:52 > 52:54:00:3c:9e:fd, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 62817, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.0.20 > 10.0.0.10: ICMP echo reply, id 9935, seq 8, length 64


[root@worker2 ~]# tcpdump -i eth3 -vvnee icmp
tcpdump: listening on eth3, link-type EN10MB (Ethernet), capture size 262144 bytes
11:55:24.524921 52:54:00:86:a6:67 > 52:54:00:76:11:56, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 9656, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.10 > 10.0.0.20: ICMP echo request, id 9935, seq 13, length 64
11:55:25.526151 52:54:00:86:a6:67 > 52:54:00:76:11:56, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 10378, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.10 > 10.0.0.20: ICMP echo request, id 9935, seq 14, length 6411:55:26.527168 52:54:00:86:a6:67 > 52:54:00:76:11:56, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 11334, offset 0, flags [DF], proto ICMP (1), length 84)

We can see that the ICMP requests are arriving on one interface (eth2) while the ICMP replies are being sent on the other (eth3) effectively splitting the load across the two NICs.

Next steps:

  • Add support in OVN for Proxy ARP (right now, we are using Proxy ARP in the kernel and removing the ARP responder flows manually)
  • Adding BGP support to this deployment in order to avoid the static configuration of the nodes and routers
  • Add BFD capabilities to provide fault tolerance

I have written an automated setup based on Vagrant that will deploy and configure everything as explained above. You can clone it from here.

Debugging scaling issues on OVN

One of the great things of OVN is that North/South routing can happen locally in the hypervisor when a ‘dnat_and_snat‘ rule is used (aka Floating IP) without having to go through a central or network node.

The way it works is that when outgoing traffic reaches the OVN bridge, the source IP address is changed by the Floating IP (snat) and pushed out to the external network. Similarly, when the packet comes in, the Floating IP address is changed (dnat) by that of the virtual machine in the private network.

In the OpenStack world, this is called DVR (“Distributed Virtual Routing”) as the routing doesn’t need to traverse any central node and happens on compute nodes meaning no extra hops, no overlay traffic and distributed routing processing.

The main advantage is that if all your workloads have a Floating IP and a lot of N/S traffic, the cloud can be well dimensioned and it’s very scalable (no need to scale any network/gateway nodes as you scale out the number of computes and less dependency on the control plane). The drawback is that you’ll need to consume IP addresses from the FIP pool. Yeah, it couldn’t be all good news :p

All this introduction to say that during some testing on an OVN cluster with lots of Floating IPs, we noticed that the amount of Logical Flows was *huge* and that led to numerous problems related to a very high CPU and memory consumption on both server (ovsdb-server) and client (ovn-controller) sides.

I wanted to understand how the flows were distributed and what was the main contributor(s) to this explosion. What I did is simply count the number of flows on every stage and sorting them. This showed that 93% of all the Logical Flows were in two stages:

$ head -n 6 logical_flows_distribution_sorted.txt
lr_out_egr_loop: 423414  62.24%
lr_in_ip_routing: 212199  31.19%
lr_in_ip_input: 10831  1.59%
ls_out_acl: 4831  0.71%
ls_in_port_sec_ip: 3471  0.51%
ls_in_l2_lkup: 2360  0.34%

Here’s the simple commands that I used to figure out the flow distribution:

# ovn-sbctl list Logical_Flow > logical_flows.txt

# Retrieve all the stages in the current pipeline
$ grep ^external_ids logical_flows.txt | sed 's/.*stage-name=//' | tr -d '}' | sort | uniq

# Count how many flows on each stage
$ while read stage; do echo $stage: $(grep $stage logical_flows.txt -c); done < stage_names.txt  > logical_flows_distribution.txt

$ sort  -k 2 -g -r logical_flows_distribution.txt  > logical_flows_distribution_sorted.txt

Next step would be to understand what’s in those two tables (lr_out_egr_loop & lr_in_ip_routing):

_uuid               : e1cc600a-fb9c-4968-a124-b0f78ed8139f
actions             : "next;"
external_ids        : {source="ovn-northd.c:8958", stage-name=lr_out_egr_loop}
logical_datapath    : 9cd315f4-1033-4f71-a26e-045a379aebe8
match               : "ip4.src == 172.24.4.10 && ip4.dst == 172.24.4.209"
pipeline            : egress
priority            : 200
table_id            : 2
hash                : 0

_uuid               : c8d8400a-590e-4b7e-b433-7a1491d31488
actions             : "inport = outport; outport = \"\"; flags = 0; flags.loopback = 1; reg9[1] = 1; next(pipeline=ingress, table=0); "
external_ids        : {source="ovn-northd.c:8950", stage-name=lr_out_egr_loop}
logical_datapath    : 9cd315f4-1033-4f71-a26e-045a379aebe8
match               : "is_chassis_resident(\"vm1\") && ip4.src == 172.24.4.218 && ip4.dst == 172.24.4.220"
pipeline            : egress
priority            : 300
table_id            : 2
hash                : 0
_uuid               : 0777b005-0ff0-40cb-8532-f7e2261dae06
actions             : "outport = \"router1-public\"; eth.src = 40:44:00:00:00:06; eth.dst = 40:44:00:00:00:07; reg0 = ip4.dst; reg1 = 172.24.4.218; reg9[2] = 1; reg9[0] = 0; ne
xt;"
external_ids        : {source="ovn-northd.c:6945", stage-name=lr_in_ip_routing}
logical_datapath    : 9cd315f4-1033-4f71-a26e-045a379aebe8
match               : "inport == \"router1-net1\" && ip4.src == 192.168.0.11 && ip4.dst == 172.24.4.226"
pipeline            : ingress
priority            : 400
table_id            : 9
hash                : 0

Turns out that those flows are intended to handle inter-FIP communication. Basically, there are flows for every possible FIP pair so that the traffic doesn’t flow through a Geneve tunnel.

While FIP-to-FIP traffic between two OVN ports is not perhaps the most common use case, those flows are there to handle it that way that the traffic would be distributed and never sent through the overlay network.

A git blame on the code that generates those flows will show the commits [1][2] and some background on the actual issue.

With the results above, one would expect a quadratic growth but still, it’s always nice to pull some graphs 🙂

 

And again, the simple script that I used to get the numbers in the graph:

ovn-nbctl clear logical_router router1 nat
for i in {1..200}; do
  ovn-nbctl lr-nat-add router1 dnat_and_snat 172.24.4.$i 192.168.0.11 vm1 40:44:00:00:00:07
  # Allow some time to northd to generate the new lflows.
  ovn-sbctl list logical_flow > /dev/null
  ovn-sbctl list logical_flow > /tmp/lflows.txt
  S1=$(grep lr_out_egr_loop -c /tmp/lflows.txt )
  S2=$(grep lr_in_ip_routing -c /tmp/lflows.txt )
  echo $S1 $S2
done

Soon after I reported this scaling issue with the above findings, my colleague Numan did an amazing job fixing it with this patch. The results are amazing and for the same test scenario with 200 FIPs, the total amount of lflows dropped from ~127K to ~2.7K and most importantly, from an exponential to a linear growth.

 

Since the Logical Flows are represented in ASCII, they are also quite expensive in processing due to the string parsing, and not very cheap to transmit in the OVSDB protocol. This has been a great leap when it comes to scaling scaling environments with a heavy N/S traffic and lots of Floating IPs.

OVN – Geneve Encapsulation

In the last post we created a Logical Switch with two ports residing on different hypervisors. Communication between those two ports took place over the tunnel interface using Geneve encapsulation. Let’s now take a closer look at this overlay traffic.

Without diving too much into the packet processing in OVN, we need to know that each Logical Datapath (Logical Switch / Logical Router) has an ingress and an egress pipeline. Whenever a packet comes in, the ingress pipeline is executed and after the output action, the egress pipeline will run to deliver the packet to its destination. More info here: http://docs.openvswitch.org/en/latest/faq/ovn/#ovn

In our scenario, when we ping from VM1 to VM2, the ingress pipeline of each ICMP packet runs on Worker1 (where VM1 is bound to) and the packet is pushed to the tunnel interface to Worker2 (where VM2 resides). When Worker2 receives the packet on its physical interface, the egress pipeline of the Logical Switch (network1) is executed to deliver the packet to VM2. But … How does OVN know where the packet comes from and which Logical Datapath should process it? This is where the metadata in the Geneve headers comes in.

Let’s get back to our setup and ping from VM1 to VM2 and capture traffic on the physical interface (eth1) of Worker2:

[root@worker2 ~]# sudo tcpdump -i eth1 -vvvnnexx

17:02:13.403229 52:54:00:13:e0:a2 > 52:54:00:ac:67:5b, ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 64, id 63920, offset 0, flags [DF], proto UDP (17), length 142)
    192.168.50.100.7549 > 192.168.50.101.6081: [bad udp cksum 0xe6a5 -> 0x7177!] Geneve, Flags [C], vni 0x1, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00010002]
        40:44:00:00:00:01 > 40:44:00:00:00:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 41968, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.0.11 > 192.168.0.12: ICMP echo request, id 1251, seq 6897, length 64
        0x0000:  5254 00ac 675b 5254 0013 e0a2 0800 4500
        0x0010:  008e f9b0 4000 4011 5a94 c0a8 3264 c0a8
        0x0020:  3265 1d7d 17c1 007a e6a5 0240 6558 0000
        0x0030:  0100 0102 8001 0001 0002 4044 0000 0002
        0x0040:  4044 0000 0001 0800 4500 0054 a3f0 4000
        0x0050:  4001 1551 c0a8 000b c0a8 000c 0800 c67b
        0x0060:  04e3 1af1 94d9 6e5c 0000 0000 41a7 0e00
        0x0070:  0000 0000 1011 1213 1415 1617 1819 1a1b
        0x0080:  1c1d 1e1f 2021 2223 2425 2627 2829 2a2b
        0x0090:  2c2d 2e2f 3031 3233 3435 3637

17:02:13.403268 52:54:00:ac:67:5b > 52:54:00:13:e0:a2, ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 64, id 46181, offset 0, flags [DF], proto UDP (17), length 142)
    192.168.50.101.9683 > 192.168.50.100.6081: [bad udp cksum 0xe6a5 -> 0x6921!] Geneve, Flags [C], vni 0x1, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00020001]
        40:44:00:00:00:02 > 40:44:00:00:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 16422, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.0.12 > 192.168.0.11: ICMP echo reply, id 1251, seq 6897, length 64
        0x0000:  5254 0013 e0a2 5254 00ac 675b 0800 4500
        0x0010:  008e b465 4000 4011 9fdf c0a8 3265 c0a8
        0x0020:  3264 25d3 17c1 007a e6a5 0240 6558 0000
        0x0030:  0100 0102 8001 0002 0001 4044 0000 0001
        0x0040:  4044 0000 0002 0800 4500 0054 4026 0000
        0x0050:  4001 b91b c0a8 000c c0a8 000b 0000 ce7b
        0x0060:  04e3 1af1 94d9 6e5c 0000 0000 41a7 0e00
        0x0070:  0000 0000 1011 1213 1415 1617 1819 1a1b
        0x0080:  1c1d 1e1f 2021 2223 2425 2627 2829 2a2b
        0x0090:  2c2d 2e2f 3031 3233 3435 3637

Let’s now decode the ICMP request packet (I’m using this tool):

ICMP request inside the Geneve tunnel

Metadata

 

In the ovn-architecture(7) document, you can check how the Metadata is used in OVN in the Tunnel Encapsulations section. In short, OVN encodes the following information in the Geneve packets:

  • Logical Datapath (switch/router) identifier (24 bits) – Geneve VNI
  • Ingress and Egress port identifiers – Option with class 0x0102 and type 0x80 with 32 bits of data:
         1       15          16
       +---+------------+-----------+
       |rsv|ingress port|egress port|
       +---+------------+-----------+
         0

Back to our example: VNI = 0x000001 and Option Data = 00010002, so from the above:

Logical Datapath = 1   Ingress Port = 1   Egress Port = 2

Let’s take a look at SB database contents to see if they match what we expect:

[root@central ~]# ovn-sbctl get Datapath_Binding network1 tunnel-key
1

[root@central ~]# ovn-sbctl get Port_Binding vm1 tunnel-key
1

[root@central ~]# ovn-sbctl get Port_Binding vm2 tunnel-key
2

We can see that the Logical Datapath belongs to network1, that the ingress port is vm1 and that the output port is vm2 which makes sense as we’re analyzing the ICMP request from VM1 to VM2. 

By the time this packet hits Worker2 hypervisor, OVN has all the information to process the packet on the right pipeline and deliver the port to VM2 without having to run the ingress pipeline again.

What if we don’t use any encapsulation?

This is technically possible in OVN and there’s such scenarios like in the case where we’re managing a physical network directly and won’t use any kind of overlay technology. In this case, our ICMP request packet would’ve been pushed directly to the network and when Worker2 receives the packet, OVN needs to figure out (based on the IP/MAC addresses) which ingress pipeline to execute (twice, as it was also executed by Worker1) before it can go to the egress pipeline and deliver the packet to VM2.

Multinode OVN setup

As a follow up from the last post, we are now going to deploy a 3 nodes OVN setup to demonstrate basic L2 communication across different hypervisors. This is the physical topology and how services are distributed:

  • Central node: ovn-northd and ovsdb-servers (North and Southbound databases) as well as ovn-controller
  • Worker1 / Worker2: ovn-controller connected to Central node Southbound ovsdb-server (TCP port 6642)

In order to deploy the 3 machines, I’m using Vagrant + libvirt and you can checkout the Vagrant files and scripts used from this link. After running ‘vagrant up’, we’ll have 3 nodes with OVS/OVN installed from sources and we should be able to log in to the central node and verify that OVN is up and running and Geneve tunnels have been established to both workers:

 

[vagrant@central ~]$ sudo ovs-vsctl show
f38658f5-4438-4917-8b51-3bb30146877a
    Bridge br-int
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port "ovn-worker-1"
            Interface "ovn-worker-1"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.50.101"}
        Port "ovn-worker-0"
            Interface "ovn-worker-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.50.100"}
    ovs_version: "2.11.90"

 

For demonstration purposes, we’re going to create a Logical Switch (network1) and two Logical Ports (vm1 and vm2). Then we’re going to bind VM1 to Worker1 and VM2 to Worker2. If everything works as expected, we would be able to communicate both Logical Ports through the overlay network formed between both workers nodes.

We can run the following commands on any node to create the logical topology (please, note that if we run them on Worker1 or Worker2, we need to specify the NB database location by running ovn-nbctl with “–db=tcp:192.168.50.10:6641” as 6641 is the default port for NB database):

ovn-nbctl ls-add network1
ovn-nbctl lsp-add network1 vm1
ovn-nbctl lsp-add network1 vm2
ovn-nbctl lsp-set-addresses vm1 "40:44:00:00:00:01 192.168.0.11"
ovn-nbctl lsp-set-addresses vm2 "40:44:00:00:00:02 192.168.0.12"

And now let’s check the Northbound and Southbound databases contents. As we didn’t bind any port to the workers yet, “ovn-sbctl show” command should only list the chassis (or hosts in OVN jargon):

[root@central ~]# ovn-nbctl show
switch a51334e8-f77d-4d85-b01a-e547220eb3ff (network1)
    port vm2
        addresses: ["40:44:00:00:00:02 192.168.0.12"]
    port vm1
        addresses: ["40:44:00:00:00:01 192.168.0.11"]

[root@central ~]# ovn-sbctl show
Chassis "worker2"
    hostname: "worker2"
    Encap geneve
        ip: "192.168.50.101"
        options: {csum="true"}
Chassis central
    hostname: central
    Encap geneve
        ip: "127.0.0.1"
        options: {csum="true"}
Chassis "worker1"
    hostname: "worker1"
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}

Now we’re going to bind VM1 to Worker1:

ovs-vsctl add-port br-int vm1 -- set Interface vm1 type=internal -- set Interface vm1 external_ids:iface-id=vm1
ip netns add vm1
ip link set vm1 netns vm1
ip netns exec vm1 ip link set vm1 address 40:44:00:00:00:01
ip netns exec vm1 ip addr add 192.168.0.11/24 dev vm1
ip netns exec vm1 ip link set vm1 up

And VM2 to Worker2:

ovs-vsctl add-port br-int vm2 -- set Interface vm2 type=internal -- set Interface vm2 external_ids:iface-id=vm2
ip netns add vm2
ip link set vm2 netns vm2
ip netns exec vm2 ip link set vm2 address 40:44:00:00:00:02
ip netns exec vm2 ip addr add 192.168.0.12/24 dev vm2
ip netns exec vm2 ip link set vm2 up

Checking again the Southbound database, we should see the port binding status:

[root@central ~]# ovn-sbctl show
Chassis "worker2"
    hostname: "worker2"
    Encap geneve
        ip: "192.168.50.101"
        options: {csum="true"}
    Port_Binding "vm2"
Chassis central
    hostname: central
    Encap geneve
        ip: "127.0.0.1"
        options: {csum="true"}
Chassis "worker1"
    hostname: "worker1"
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}
    Port_Binding "vm1"

Now let’s check connectivity between VM1 (Worker1) and VM2 (Worker2):

[root@worker1 ~]# ip netns exec vm1 ping 192.168.0.12 -c2
PING 192.168.0.12 (192.168.0.12) 56(84) bytes of data.
64 bytes from 192.168.0.12: icmp_seq=1 ttl=64 time=0.416 ms
64 bytes from 192.168.0.12: icmp_seq=2 ttl=64 time=0.307 ms

--- 192.168.0.12 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.307/0.361/0.416/0.057 ms


[root@worker2 ~]# ip netns exec vm2 ping 192.168.0.11 -c2
PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.
64 bytes from 192.168.0.11: icmp_seq=1 ttl=64 time=0.825 ms
64 bytes from 192.168.0.11: icmp_seq=2 ttl=64 time=0.275 ms

--- 192.168.0.11 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.275/0.550/0.825/0.275 ms

As both ports are located in different hypervisors, OVN is pushing the traffic via the overlay Geneve tunnel from Worker1 to Worker2. In the next post, we’ll analyze the Geneve encapsulation and how OVN uses its metadata internally.

For now, let’s ping from VM1 to VM2 and just capture traffic on the geneve interface on Worker2 to verify that ICMP packets are coming through the tunnel:

[root@worker2 ~]# tcpdump -i genev_sys_6081 -vvnn icmp
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
15:07:42.395318 IP (tos 0x0, ttl 64, id 45147, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.0.11 > 192.168.0.12: ICMP echo request, id 1251, seq 26, length 64
15:07:42.395383 IP (tos 0x0, ttl 64, id 39282, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.0.12 > 192.168.0.11: ICMP echo reply, id 1251, seq 26, length 64
15:07:43.395221 IP (tos 0x0, ttl 64, id 45612, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.0.11 > 192.168.0.12: ICMP echo request, id 1251, seq 27, length 64

In coming posts we’ll cover Geneve encapsulation as well as OVN pipelines and L3 connectivity.

Implementing Security Groups in OpenStack using OVN Port Groups

Some time back, when looking at the performance of OpenStack using OVN as the networking backend, we noticed that it didn’t scale really well and it turned out that the major culprit was the way we implemented Neutron Security Groups . In order to illustrate the issue and the optimizations that we carried out, let’s first explain how security was originally implemented:

Networking-ovn and Neutron Security Groups

Originally, Security Groups were implemented using a combination of OVN resources such as Address Sets and Access Control Lists (ACLs):

  • Address Sets: An OVN Address set contains a number of IP addresses that can be referenced from an ACL. In networking-ovn  we directly map Security Groups to OVN Address Sets: every time a new IP address is allocated for a port, this address will be added to the Address Set(s) representing the Security Groups which the port belongs to.
$ ovn-nbctl list address_set
_uuid : 039032e4-9d98-4368-8894-08e804e9ee78
addresses : ["10.0.0.118", "10.0.0.123", "10.0.0.138", "10.0.0.143"]
external_ids : {"neutron:security_group_id"="0509db24-4755-4321-bb6f-9a094962ec91"}
name : "as_ip4_0509db24_4755_4321_bb6f_9a094962ec91"
  • ACLs: They are applied to a Logical Switch (Neutron network). They have a 1-to-many relationship with Neutron Security Group Rules. For instance, when the user creates a single Neutron rule within a Security Group to allow ingress ICMP traffic, it will map to N ACLs in OVN Northbound database with N being the number of ports that belong to that Security Group.
$ openstack security group rule create --protocol icmp default
_uuid : 6f7635ff-99ae-498d-8700-eb634a16903b
action : allow-related
direction : to-lport
external_ids : {"neutron:lport"="95fb15a4-c638-42f2-9035-bee989d80603", "neutron:security_group_rule_id"="70bcb4ca-69d6-499f-bfcf-8f353742d3ff"}
log : false
match : "outport == \"95fb15a4-c638-42f2-9035-bee989d80603\" && ip4 && ip4.src == 0.0.0.0/0 && icmp4"
meter : []
name : []
priority : 1002
severity : []

On the other hand, Neutron has the possibility to filter traffic between ports within the same Security Group or a remote Security Group. One use case may be: a set of VMs whose ports belong to SG1 only allowing HTTP traffic from the outside and another set of VMs whose ports belong to SG2 blocking all incoming traffic. From Neutron, you can create a rule to allow database connections from SG1 to SG2. In this case, in OVN we’ll see ACLs referencing the aforementioned Address Sets. In

$ openstack security group rule create --protocol tcp --dst-port 3306 --remote-group webservers default
+-------------------+--------------------------------------+
| Field | Value |
+-------------------+--------------------------------------+
| created_at | 2018-12-21T11:29:32Z |
| description | |
| direction | ingress |
| ether_type | IPv4 |
| id | 663012c1-67de-45e1-a398-d15bd4f295bb |
| location | None |
| name | None |
| port_range_max | 3306 |
| port_range_min | 3306 |
| project_id | 471603b575184afc85c67d0c9e460e85 |
| protocol | tcp |
| remote_group_id | 11059b7d-725c-4740-8db8-5c5b89865d0f |
| remote_ip_prefix | None |
| revision_number | 0 |
| security_group_id | 0509db24-4755-4321-bb6f-9a094962ec91 |
| updated_at | 2018-12-21T11:29:32Z |
+-------------------+--------------------------------------+

This gets the following OVN ACL into Northbound database:


_uuid : 03dcbc0f-38b2-42da-8f20-25996044e516
action : allow-related
direction : to-lport
external_ids : {"neutron:lport"="7d6247b7-65b9-4864-a9a0-a85bacb4d9ac", "neutron:security_group_rule_id"="663012c1-67de-45e1-a398-d15bd4f295bb"}
log : false
match : "outport == \"7d6247b7-65b9-4864-a9a0-a85bacb4d9ac\" && ip4 && ip4.src == $as_ip4_11059b7d_725c_4740_8db8_5c5b89865d0f && tcp && tcp.dst == 3306"
meter : []
name : []
priority : 1002
severity : []

Problem “at scale”

In order to best illustrate the impact of the optimizations that the Port Groups feature brought in OpenStack, let’s take a look at the number of ACLs on a typical setup when creating just 100 ports on a single network. All those ports will belong to a Security Group with the following rules:

  1. Allow incoming SSH traffic
  2. Allow incoming HTTP traffic
  3. Allow incoming ICMP traffic
  4. Allow all IPv4 traffic between ports of this same Security Group
  5. Allow all IPv6 traffic between ports of this same Security Group
  6. Allow all outgoing IPv4 traffic
  7. Allow all outgoing IPv6 traffic

Every time we create a port, new 10 ACLs (the 7 rules above + DHCP traffic ACL + default egress drop ACL + default ingress drop ACL) will be created in OVN:


$ ovn-nbctl list ACL| grep ce2ad98f-58cf-4b47-bd7c-38019f844b7b | grep match
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip6 && ip6.src == $as_ip6_0509db24_4755_4321_bb6f_9a094962ec91"
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip"
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4 && ip4.src == 0.0.0.0/0 && icmp4"
match : "inport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4"
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4 && ip4.src == $as_ip4_0509db24_4755_4321_bb6f_9a094962ec91"
match : "inport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip6"
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 80"
match : "outport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 22"
match : "inport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip4 && ip4.dst == {255.255.255.255, 10.0.0.0/8} && udp && udp.src == 68 && udp.dst == 67"
match : "inport == \"ce2ad98f-58cf-4b47-bd7c-38019f844b7b\" && ip"

With 100 ports, we’ll observe 1K ACLs in the system:

$ ovn-nbctl lsp-list neutron-ebde771e-a93d-438d-a689-d02e9c91c7cf | wc -l
100
$ ovn-nbctl acl-list neutron-ebde771e-a93d-438d-a689-d02e9c91c7cf | wc -l
1000

When ovn-northd sees these new ACLs, it’ll create the corresponding Logical Flows in Southbound database that will then be translated by ovn-controller to OpenFlow flows in the actual hypervisors. The number of Logical Flows also for this 100 ports system can be pulled like this:

$ ovn-sbctl lflow-list neutron-ebde771e-a93d-438d-a689-d02e9c91c7cf | wc -l
3052

At this point, you can pretty much tell that this doesn’t look very promising at scale.

Optimization

One can quickly spot an optimization consisting on having just one ACL per Security Group Rule instead of one ACL per Security Group Rule per port if only we could reference a set of ports and not each port individually on the ‘match’ column of an ACL. This would alleviate calculations mainly on the networking-ovn side where we saw a bottleneck at scale when processing new ports due to the high number of ACLs.

Such optimization would require a few changes on the core OVN side:

  • Changes to the schema to create a new table in the Northbound database (Port_Group) and to be able to apply ACLs also to a Port Group.
  • Changes to ovn-northd so that it creates new Logical Flows based on ACLs applied to Port Groups.
  • Changes to ovn-controller so that it can figure out the physical flows to install on every hypervisor based on the new Logical Flows.

These changes happened mainly in the next 3 patches and the feature is present in OvS 2.10 and beyond:

https://github.com/openvswitch/ovs/commit/3d2848bafa93a2b483a4504c5de801454671dccf

https://github.com/openvswitch/ovs/commit/1beb60afd25a64f1779903b22b37ed3d9956d47c

https://github.com/openvswitch/ovs/commit/689829d53612a573f810271a01561f7b0948c8c8

On the networking-ovn side, we needed to adapt the code as well to:

  • Make use of the new feature and implement Security Groups using Port Groups.
  • Ensure a migration path from old implementation to Port Groups.
  • Keep backwards compatibility: in case an older version of OvS is used, we need to fall back to the previous implementation.

Here you can see the main patch to accomplish the changes above:

https://github.com/openstack/networking-ovn/commit/f01169b405bb5080a1bc1653f79512eb0664c35d

If we attempt to recreate the same scenario as we did earlier where we had 1000 ACLs for 100 ports on our Security Group using the new feature, we can compare the number of resources that we’re now using:

$ ovn-nbctl lsp-list neutron-ebde771e-a93d-438d-a689-d02e9c91c7cf | wc -l
100

Two OVN Port Groups have been created: one for our Security Group and then neutron-pg-drop which is used to add fallback, low priority drop ACLs (by default OVN will allow all traffic if no explicit drop ACLs are added):

$ ovn-nbctl --bare --columns=name list Port_Group
neutron_pg_drop
pg_0509db24_4755_4321_bb6f_9a094962ec91

ACLs are now applied to Port Groups and not to the Logical Switch:

$ ovn-nbctl acl-list neutron-ebde771e-a93d-438d-a689-d02e9c91c7cf | wc -l
0
$ ovn-nbctl acl-list pg_0509db24_4755_4321_bb6f_9a094962ec91 | wc -l
7
$ ovn-nbctl acl-list neutron_pg_drop | wc -l
2

The number of ACLs has gone from 1000 (10 per port) to just 9 regardless of the number of ports in the system:

$ ovn-nbctl --bare --columns=match list ACL
inport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip4
inport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip6
inport == @neutron_pg_drop && ip
outport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 22
outport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip4 && ip4.src == 0.0.0.0/0 && icmp4
outport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip6 && ip6.src == $pg_0509db24_4755_4321_bb6f_9a094962ec91_ip6
outport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip4 && ip4.src == $pg_0509db24_4755_4321_bb6f_9a094962ec91_ip4
outport == @pg_0509db24_4755_4321_bb6f_9a094962ec91 && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 80
outport == @neutron_pg_drop && ip

 

This change was merged in OpenStack Queens and requires OvS 2.10 at least. Also, if upgrading from an earlier version of either OpenStack or OvS, networking-ovn will take care of the migration from Address Sets to Port Groups upon start of Neutron server and the new implementation will be automatically used.

As  a bonus, this enables the possibility of applying the conjunctive match action easier on Logical Flows resulting in a big performance improvement as it was reported here.

Encrypting your connections with stunnel

stunnel is an open source software that provides SSL/TLS tunneling. This is especially useful when it comes to protect existing client-server communications that do not provide any encryption at all. Another application is to avoid exposing many services and make all of them pass through the tunnel and, therefore, securing all the traffic at the same time.

And because I have a WR703N with an OpenVPN server installed, I decided to set up stunnel and give it a try. The advantage over using my existing VPN, under certain circumstances, is that the establishment of the secure tunnel looks pretty much like a normal connection to an HTTPS website so most of the networks/proxys will allow this traffic whilst the VPN might be blocked (especially if UDP is used). So, the OpenVPN+stunnel combo looks like a pretty good security solution to be installed on our OpenWRT device.

The way I have the stunnel service configured is using MTLS (client and server authentication) and allowing only TLSv1.2 protocol. These are the specific lines in the stunnel.conf (server side):

; protocol version (all, SSLv2, SSLv3, TLSv1)
sslVersion = all
options = CIPHER_SERVER_PREFERENCE
options = NO_SSLv2
options = NO_SSLv3
options = NO_TLSv1

Just for testing, I have installed stunnel on a Windows box and configured it as a client (with a client certificate signed by the same CA as the server) and connections to server port 443 will be forwarded to the SSH service running on the server side. This would allow us to SSH our server without
needing to expose it and, for example, set up a SOCKS proxy and browse the internet securely through the tunnel.

stunnel Diagram

Client side:

[https]
accept  = 22
protocol = connect
connect = proxy:8080
protocolHost= server:443

Server side:

[https]
accept  = 443
connect = 22
TIMEOUTclose = 0

On the client side, simply SSH localhost on the configured port (22) and stunnel will intercept this connection and establish a TLS tunnel with the server to the SSH service running on it.

These are the logs on the client side when SSH’ing localhost:

2016.07.20 21:37:09 LOG7[12]: Service [https] started
2016.07.20 21:37:09 LOG5[12]: Service [https] accepted connection from 127.0.0.1:43858
2016.07.20 21:37:09 LOG6[12]: s_connect: connecting proxy:8080
2016.07.20 21:37:09 LOG7[12]: s_connect: s_poll_wait proxy:8080: waiting 10 seconds
2016.07.20 21:37:09 LOG5[12]: s_connect: connected proxy:8080
2016.07.20 21:37:09 LOG5[12]: Service [https] connected remote server from x.x.x.x:43859
2016.07.20 21:37:09 LOG7[12]: Remote descriptor (FD=732) initialized
2016.07.20 21:37:09 LOG7[12]:  -> CONNECT server:443 HTTP/1.1
2016.07.20 21:37:09 LOG7[12]:  -> Host: server:443
2016.07.20 21:37:09 LOG7[12]:  -> Proxy-Authorization: basic **
2016.07.20 21:37:09 LOG7[12]:  -> 
2016.07.20 21:37:09 LOG7[12]:  <- HTTP/1.1 200 Connection established
2016.07.20 21:37:09 LOG6[12]: CONNECT request accepted
2016.07.20 21:37:09 LOG7[12]:  <- 
2016.07.20 21:37:09 LOG6[12]: SNI: sending servername: server
2016.07.20 21:37:09 LOG7[12]: SSL state (connect): before/connect initialization
2016.07.20 21:37:09 LOG7[12]: SSL state (connect): SSLv3 write client hello A
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 read server hello A
2016.07.20 21:37:11 LOG7[12]: Verification started at depth=1: C=ES, ST=M, O=O, CN=wrtServer
2016.07.20 21:37:11 LOG7[12]: CERT: Pre-verification succeeded
2016.07.20 21:37:11 LOG6[12]: Certificate accepted at depth=1: C=ES, ST=M, O=O, CN=wrtServer
2016.07.20 21:37:11 LOG7[12]: Verification started at depth=0: C=ES, ST=S, O=O, CN=wrtClient
2016.07.20 21:37:11 LOG7[12]: CERT: Pre-verification succeeded
2016.07.20 21:37:11 LOG5[12]: Certificate accepted at depth=0: C=ES, ST=S, O=O, CN=wrtClient
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 read server certificate A
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 read server key exchange A
2016.07.20 21:37:11 LOG6[12]: Client CA: C=ES, ST=M, O=O, CN=wrtCA
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 read server certificate request A
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 read server done A
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 write client certificate A
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 write client key exchange A
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 write certificate verify A
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 write change cipher spec A
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 write finished A
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 flush data
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 read server session ticket A
2016.07.20 21:37:11 LOG7[12]: SSL state (connect): SSLv3 read finished A
2016.07.20 21:37:11 LOG7[12]:      8 client connect(s) requested
2016.07.20 21:37:11 LOG7[12]:      7 client connect(s) succeeded
2016.07.20 21:37:11 LOG7[12]:      0 client renegotiation(s) requested
2016.07.20 21:37:11 LOG7[12]:      2 session reuse(s)
2016.07.20 21:37:11 LOG6[12]: SSL connected: new session negotiated
2016.07.20 21:37:11 LOG7[12]: Deallocating application specific data for addr index
2016.07.20 21:37:11 LOG6[12]: Negotiated TLSv1.2 ciphersuite ECDHE-RSA-AES256-GCM-SHA384 (256-bit encryption)

As you can see, the traffic will be routed through a TLSv1.2 channel encrypted with AES256 in GCM mode and the session key has been derived using ephimeral ECDH, with Perfect Forward Secrecy so the traffic will be fairly well protected, at least, up to the stunnel server.

Make sure to keep an eye on the vulnerabilities listed on the stunnel website and have the server properly patched.

Building RPM packages

I wanted to learn how to build an RPM package out of a Python module so, now that I’m playing a bit with OpenStack, I decided to pick up a log merger for OpenStack files and build the corresponding package on my Fedora 24.

First thing is to setup the distribution with the right packages:

[root@localhost ~]$ dnf install @development-tools fedora-packager
[dani@localhost ~]$ rpmdev-setuptree
[dani@localhost ~]$ ls rpmbuild/
BUILD  BUILDROOT  RPMS  SOURCES  SPECS  SRPMS

Now, under the SPECS directory, we need to create the spec file which will include the necessary info to build the RPM:

%global srcname os-log-merger
%global	sum	OpenStack Log Merger

Name:		python-%{srcname}
Version:	1.0.6
Release:	1%{?dist}
Summary:	%{sum}

License:	Apache
URL:		https://github.com/mangelajo/os-log-merger/
Source:         https://pypi.python.org/packages/6f/f1/b2a46907086c29725dd0e2296d6f45e7965670a05b43626abc1c81a098a0/os-log-merger-%{version}.tar.gz

BuildRoot:      %{_tmppath}/%{srcname}-%{version}-build
BuildArch:	noarch
BuildRequires:	python2

%description
A tool designed to take a bunch of openstack logs across different projects, and merge them in a single file, ordered by time entries

%package -n %{srcname}
Summary:	%{sum}
%{?python_provide:%python_provide python2-%{srcname}}

%description -n %{srcname}
A tool designed to take a bunch of openstack logs across different projects, and merge them in a single file, ordered by time entries

%prep
%autosetup -n %{srcname}-%{version}

%install
%py2_install

%check
%{__python2} setup.py test

%files -n %{srcname}
#%license LICENSE
%doc README.rst
%{python2_sitelib}/*
%{_bindir}/os-log-merger
%{_bindir}/oslogmerger
%{_bindir}/netprobe

%changelog
* Tue Jul 19 2016 dani - 1.0.6-1
- First version of the os-log-merger-package

Once the file is created, it’s time to build the RPM package:

[dani@localhost SPECS]$ rpmbuild -bb os-log-merger.spec 
....
+ umask 022
+ cd /home/dani/rpmbuild/BUILD
+ cd os-log-merger-1.0.6
+ /usr/bin/rm -rf /home/dani/rpmbuild/BUILDROOT/python-os-log-merger-1.0.6-1.fc24.x86_64
+ exit 0
[dani@localhost SPECS]$ ls -alh ../RPMS/noarch/
total 44K
drwxr-xr-x. 2 dani dani 4,0K jul 19 20:35 .
drwxr-xr-x. 3 dani dani 4,0K jul 19 20:35 ..
-rw-rw-r--. 1 dani dani  34K jul 19 20:47 os-log-merger-1.0.6-1.fc24.noarch.rpm

We can see that the rpmbuild command produced the RPM file inside ~/rpmbuild/RPMS/noarch. Let’s pull the info from it and check whether it’s correct:

[dani@localhost SPECS]$ rpm -qip ../RPMS/noarch/os-log-merger-1.0.6-1.fc24.noarch.rpm 
Name        : os-log-merger
Version     : 1.0.6
Release     : 1.fc24
Architecture: noarch
Install Date: (not installed)
Group       : Unspecified
Size        : 85356
License     : Apache
Signature   : (none)
Source RPM  : python-os-log-merger-1.0.6-1.fc24.src.rpm
Build Date  : mar 19 jul 2016 20:47:42 CEST
Build Host  : localhost
Relocations : (not relocatable)
URL         : https://github.com/mangelajo/os-log-merger/
Summary     : OpenStack Log Merger
Description :
A tool designed to take a bunch of openstack logs across different projects, and merge them in a single file, ordered by time entries

The last step is trying to install the actual file and execute the module to see if everything went fine:

[root@localhost noarch]$ rpm -qa | grep os-log-merger
[root@localhost noarch]$ rpm -i os-log-merger-1.0.6-1.fc24.noarch.rpm 
[root@localhost noarch]$ oslogmerger 
usage: oslogmerger [-h] [-v] [--log-base  LOG_BASE]
                   [--log-postfix  LOG_POSTFIX] [--alias-level ALIAS_LEVEL]
                   [--min-memory] [--msg-logs file[:ALIAS] [file[:ALIAS] ...]]
                   [--timestamp-logs file[:ALIAS] [file[:ALIAS] ...]]
                   log_file[:ALIAS] [log_file[:ALIAS] ...]

References:
https://fedoraproject.org/wiki/How_to_create_a_GNU_Hello_RPM_package
https://fedoraproject.org/wiki/Packaging:Python