Tagnetworking

OVN: Where is my packet?

Working with rather complex OpenFlow pipelines as in the case of OVN can be tricky when it comes to debugging. Being able to tell where a packet gets dropped may not be the easiest task ever but fortunately, this is getting a bit more friendly thanks to the available tools around OVS and OVN these days.

In this post, I'll show a recent troubleshooting process that I did where the symptom was that an ICMP Neighbor Advertisement packet sent out by a guest VM was not arriving at its destination. It turned out to be a bug in ovn-controller that has already been fixed so it will hopefully illustrate some of the available tools and techniques to debug OVN through a real world example.

Let's start by showing a failing ping to an OVN destination and we'll build up from here:

# ping 2001:db8::f816:3eff:fe1e:5a0f -c2
PING 2001:db8::f816:3eff:fe1e:5a0f(2001:db8::f816:3eff:fe1e:5a0f) 56 data bytes
From f00d:f00d:f00d:f00d:f00d:f00d:f00d:9: icmp_seq=1 Destination unreachable: Address unreachable
From f00d:f00d:f00d:f00d:f00d:f00d:f00d:9: icmp_seq=2 Destination unreachable: Address unreachable

--- 2001:db8::f816:3eff:fe1e:5a0f ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 61ms

In the destination hypervisor we can see the Neighbor Solicitation packets but no answer to them:

$ tcpdump -i br-ex -ne "icmp6" -c2

13:11:18.350789 b6:c9:b9:6e:50:48 > 33:33:ff:1e:5a:0f, ethertype IPv6 (0x86dd), length 86: fe80::b4c9:b9ff:fe6e:5048 > ff02::1:ff1e:5a0f: ICMP6, neighbor solicitation, who has 2001:db8::f816:3eff:fe1e:5a0f, length 32

13:11:19.378648 b6:c9:b9:6e:50:48 > 33:33:ff:1e:5a:0f, ethertype IPv6 (0x86dd), length 86: fe80::b4c9:b9ff:fe6e:5048 > ff02::1:ff1e:5a0f: ICMP6, neighbor solicitation, who has 2001:db8::f816:3eff:fe1e:5a0f, length 32

The first step is to determine whether the actual VM is responding with Neighbor Advertisement packets by examining the traffic on the tap interface:

$ tcpdump -veni tapdf8a80f8-0c icmp6 -c2

15:10:26.862344 b6:c9:b9:6e:50:48 > 33:33:ff:1e:5a:0f, ethertype IPv6 (0x86dd), length 86: (hlim 255, next-header ICMPv6 (58) payload length: 32) fe80::b4c9:b9ff:fe6e:5048 > ff02::1:ff1e:5a0f: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has 2001:db8::f816:3eff:fe1e:5a0f
          source link-address option (1), length 8 (1): b6:c9:b9:6e:50:48

15:10:26.865489 fa:16:3e:1e:5a:0f > b6:c9:b9:6e:50:48, ethertype IPv6 (0x86dd), length 86: (hlim 255, next-header ICMPv6 (58) payload length: 32) 2001:db8::f816:3eff:fe1e:5a0f > fe80::b4c9:b9ff:fe6e:5048: [icmp6 sum ok] ICMP6, neighbor advertisement, length 32, tgt is 2001:db8::f816:3eff:fe1e:5a0f, Flags [solicited, override]
          destination link-address option (2), length 8 (1): fa:16:3e:1e:5a:0f

At this point we know that the Neighbor Advertisement (NA) packets are sent by the VM but dropped somewhere in the OVN integration bridge, which makes this task about inspecting Logical and Physical flows to find out what's causing this.

Let's begin by checking the Logical flows to see if, from the OVN database standpoint, all is correctly configured. For this, we need to collect all the info that we are going to pass to the ovn-trace tool to simulate the NA packet originated from the VM such as:

  • datapath: OVN Logical Switch name or ID
  • inport: OVN port name corresponding to the VM
  • eth.src & eth.dst: source and destination MAC addresses (we can copy them from the tcpdump output above)
  • ip6.src: the IPv6 address of the VM
  • nd_target: The IPv6 address that is advertised
# ovn-nbctl show
[...]
switch 0309ffff-7c89-427e-a68e-cd87b0658005 (neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873) (aka provider-flat)
    port provnet-64e303a9-af62-4833-b713-6b361cdd6ecd
        type: localnet
        addresses: ["unknown"]
    port 916eb7ea-4d64-4d9c-be28-b693af2e7ed3
        type: localport
        addresses: ["fa:16:3e:65:76:bd 172.24.100.2 2001:db8::f816:3eff:fe65:76bd"]
    port df8a80f8-0cbf-48af-8b32-78377f034797 (aka vm-provider-flat-port)
        addresses: ["fa:16:3e:1e:5a:0f 2001:db8::f816:3eff:fe1e:5a0f"]
    port f1d49ca3-0e9c-4387-8d5d-c833fc3b7943
        type: router
        router-port: lrp-f1d49ca3-0e9c-4387-8d5d-c833fc3b7943

With all this info we are ready to run ovn-trace:

$ ovn-trace --summary neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873 'inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && eth.src == fa:16:3e:1e:5a:0f && eth.dst == b6:c9:b9:6e:50:48 && ip6.src == 2001:db8::f816:3eff:fe1e:5a0f  && nd.target == 2001:db8::f816:3eff:fe1e:5a0f && nd_na && nd.tll == fa:16:3e:1e:5a:0f'
# icmp6,reg14=0x4,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=::,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
ingress(dp="provider-flat", inport="vm-provider-flat-port") {
    next;
    next;
    next;
    next;
    next;
    reg0[8] = 1;
    reg0[9] = 1;
    next;
    next;
    outport = "_MC_unknown";
    output;
    multicast(dp="provider-flat", mcgroup="_MC_unknown") {
        egress(dp="provider-flat", inport="vm-provider-flat-port", outport="provnet-64e303") {
            next;
            next;
            reg0[8] = 1;
            reg0[9] = 1;
            next;
            next;
            output;
            /* output to "provnet-64e303", type "localnet" */;
        };
    };
};

As we can see from the ovn-trace output above, the packet should be delivered to the localnet port as per the OVN DB contents, meaning that we should have been able to see it with tcpdump. Since we know that the packet was not present in the bridge, the next step is to figure out where it is being dropped in the OpenFlow pipeline by inspecting the Physical flows.

First, let's capture a live packet (filter by ICMP6 and Neighbor Advertisement):

$ flow=$(tcpdump -nXXi tapdf8a80f8-0c "icmp6 && ip6[40] = 136 && host 2001:db8::f816:3eff:fe1e:5a0f"  -c1 | ovs-tcpundump)

1 packet captured
1 packet received by filter
0 packets dropped by kernel

$ echo $flow
b6c9b96e5048fa163e1e5a0f86dd6000000000203aff20010db800000000f8163efffe1e5a0ffe80000000000000b4c9b9fffe6e504888004d626000000020010db800000000f8163efffe1e5a0f0201fa163e1e5a0f

Now, let's feed this packet into ovs-appctl to see the execution of the OVS pipeline:

$ ovs-appctl ofproto/trace br-int in_port=`ovs-vsctl get Interface tapdf8a80f8-0c ofport` $flow

Flow: icmp6,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f

bridge("br-int")
----------------
 0. in_port=9, priority 100, cookie 0x146e7f9c
    set_field:0x1->reg13
    set_field:0x3->reg11
    set_field:0x2->reg12
    set_field:0x7->metadata
    set_field:0x4->reg14
    resubmit(,8)
 8. reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f, priority 50, cookie 0x195ddd1b
    resubmit(,9)
 9. ipv6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f, priority 90, cookie 0x1c05660e
    resubmit(,10)
10. icmp6,reg14=0x4,metadata=0x7,nw_ttl=255,icmp_type=136,icmp_code=0, priority 80, cookie 0xd3c58704
    drop

Final flow: icmp6,reg11=0x3,reg12=0x2,reg13=0x1,reg14=0x4,metadata=0x7,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
Megaflow: recirc_id=0,eth,icmp6,in_port=9,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::/16,nw_ttl=255,nw_frag=no,icmp_type=0x88/0xff,icmp_code=0x0/0xff,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f

We see that the packet is dropped in the table 10. Also, we can feed this output into ovn-detrace for a more friendly output which allows us to match each OpenFlow rule to OVN logical flows and tables:

$ ovs-appctl ofproto/trace br-int in_port=9 $flow | ovn-detrace --ovnsb="tcp:99.88.88.88:6642" --ovnnb="tcp:99.88.88.88:6641"

Flow: icmp6,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f

bridge("br-int")
----------------
0. in_port=9, priority 100, cookie 0x146e7f9c
set_field:0x1->reg13
set_field:0x3->reg11
set_field:0x2->reg12
set_field:0x7->metadata
set_field:0x4->reg14
resubmit(,8)
  *  Logical datapath: "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204)
  *  Port Binding: logical_port "df8a80f8-0cbf-48af-8b32-78377f034797", tunnel_key 4, chassis-name "4bcf0b65-8d0c-4dde-9783-7ccb23fe3627", chassis-str "cmp-3-1.bgp.ftw"
8. reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f, priority 50, cookie 0x195ddd1b
resubmit(,9)
  *  Logical datapaths:
  *      "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204) [ingress]
  *  Logical flow: table=0 (ls_in_port_sec_l2), priority=50, match=(inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && eth.src == {fa:16:3e:1e:5a:0f}), actions=(next;)
   *  Logical Switch Port: df8a80f8-0cbf-48af-8b32-78377f034797 type  (addresses ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f'], dynamic addresses [], security ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f']
9. ipv6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f, priority 90, cookie 0x1c05660e
resubmit(,10)
  *  Logical datapaths:
  *      "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204) [ingress]
  *  Logical flow: table=1 (ls_in_port_sec_ip), priority=90, match=(inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && eth.src == fa:16:3e:1e:5a:0f && ip6.src == {fe80::f816:3eff:fe1e:5a0f, 2001:db8::f816:3eff:fe1e:5a0f}), actions=(next;)
   *  Logical Switch Port: df8a80f8-0cbf-48af-8b32-78377f034797 type  (addresses ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f'], dynamic addresses [], security ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f']
10. icmp6,reg14=0x4,metadata=0x7,nw_ttl=255,icmp_type=136,icmp_code=0, priority 80, cookie 0xd3c58704
drop
  *  Logical datapaths:
  *      "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204) [ingress]
  *  Logical flow: table=2 (ls_in_port_sec_nd), priority=80, match=(inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && (arp || nd)), actions=(drop;)
   *  Logical Switch Port: df8a80f8-0cbf-48af-8b32-78377f034797 type  (addresses ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f'], dynamic addresses [], security ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f']

Final flow: icmp6,reg11=0x3,reg12=0x2,reg13=0x1,reg14=0x4,metadata=0x7,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
Megaflow: recirc_id=0,eth,icmp6,in_port=9,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::/16,nw_ttl=255,nw_frag=no,icmp_type=0x88/0xff,icmp_code=0x0/0xff,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
Datapath actions: drop

The packet is explicitly dropped in table 10 (OVN Logical Switch table 2 (ls_in_port_sec_nd) because no other higher priority flow in that table was matched previously. Let's inspect the relevant OVN Logical flows in such table:

_uuid               : e58f66b6-07f0-4f31-bd6b-97ea609ac3fb
actions             : "next;"
external_ids        : {source="ovn-northd.c:4455", stage-hint=bc1dbbdf, stage-name=ls_in_port_sec_nd}
logical_datapath    : 67b33068-8baa-4e45-9447-61352f8d5204
logical_dp_group    : []
match               : "inport == \"df8a80f8-0cbf-48af-8b32-78377f034797\" && eth.src == fa:16:3e:1e:5a:0f && ip6 && nd && ((nd.sll == 00:00:00:00:00:00 || nd.sll == fa:16:3e:1e:5a:0f) || ((nd.tll == 00:00:00:00:00:00 || nd.tll == fa:16:3e:1e:5a:0f) && (nd.target == fe80::f816:3eff:fe1e:5a0f || nd.target == 2001:db8::f816:3eff:fe1e:5a0f)))"
pipeline            : ingress
priority            : 90
table_id            : 2



_uuid               : d3c58704-c6be-4876-b023-2ec7ae870bde
actions             : "drop;"
external_ids        : {source="ovn-northd.c:4462", stage-hint=bc1dbbdf, stage-name=ls_in_port_sec_nd}
logical_datapath    : 67b33068-8baa-4e45-9447-61352f8d5204
logical_dp_group    : []
match               : "inport == \"df8a80f8-0cbf-48af-8b32-78377f034797\" && (arp || nd)"
pipeline            : ingress
priority            : 80
table_id            : 2

The first flow above should have matched, it has higher priority than the drop and all the fields seem correct. Let's take a look at the actual OpenFlow rules installed by ovn-controller matching on some fields of the logical flow such as the uuid/cookie or the nd_target: 

$ ovs-ofctl dump-flows br-int |grep cookie=0xe58f66b6 | grep icmp_type=136
 cookie=0xe58f66b6, duration=20946.048s, table=10, n_packets=0, n_bytes=0, idle_age=20946, priority=90,conj_id=19,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0 actions=resubmit(,11)

$ ovs-ofctl dump-flows br-int table=10 | grep priority=90 | grep nd_target=2001:db8::f816:3eff:fe1e:5a0f
 cookie=0x0, duration=21035.890s, table=10, n_packets=0, n_bytes=0, idle_age=21035, priority=90,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f actions=conjunction(3,1/2)

At first glance, the fact that the packet counter is 0 when all the fields seem correct (src/dest IP addresses, ICMP6 type, ...) looks suspicious. The conjunctive flows are key here. On one hand, the first flow is referencing conj_id=19 while the second flow is referencing conj_id=3 (the notation for conjunction(3,1/2) means that the action to be executed is the first out of 2 clauses for the conjunction id 3). However, there's no such conjunction and hence, the packet will not hit these flows:

$ ovs-ofctl dump-flows br-int | grep conj_id=3 -c
0

If our assumptions are correct, we could rewrite the first flow so that the conjunction id matches the expected value (3). For this, we'll stop first ovn-controller, delete the flow and add the new one:

$ ovs-ofctl del-flows --strict br-int table=10,priority=90,conj_id=19,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0

$ ovs-ofctl add-flow br-int "cookie=0xe58f66b6,table=10,priority=90,conj_id=3,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0,actions=resubmit(,11)"

At this point we can test the ping and see if it works and the new flow gets hit:

$ tcpdump -ni tapdf8a80f8-0c

11:00:24.975037 IP6 fe80::b4c9:b9ff:fe6e:5048 > 2001:db8::f816:3eff:fe1e:5a0f: ICMP6, neighbor solicitation, who has 2001:db8::f816:3eff:fe1e:5a0f, length 32

11:00:24.975544 IP6 2001:db8::f816:3eff:fe1e:5a0f > fe80::b4c9:b9ff:fe6e:5048: ICMP6, neighbor advertisement, tgt is 2001:db8::f816:3eff:fe1e:5a0f, length 24

11:00:25.831695 IP6 f00d:f00d:f00d:4::1 > 2001:db8::f816:3eff:fe1e:5a0f: ICMP6, echo request, seq 55, length 64

11:00:25.832182 IP6 2001:db8::f816:3eff:fe1e:5a0f > f00d:f00d:f00d:4::1: ICMP6, echo reply, seq 55, length 64

$ sudo ovs-ofctl dump-flows br-int| grep cookie=0xe58f66b6  | grep conj_id

 cookie=0xe58f66b6, duration=5.278s, table=10, n_packets=2, n_bytes=172, idle_age=4, priority=90,conj_id=3,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0 actions=resubmit(,11)

As I mentioned earlier, this issue is now solved upstream and you can check the fix here. With this patch, the conjunction ids will be properly generated and we'll no longer hit this bug. 

Most of the issues that we'll face in OVN will be likely related to misconfigurations (missing ACLs, port security, ...) and the use of tools like ovn-trace will help us spot them. However, when tackling legitimate bugs in the core OVN side, there's no easy and defined way to find them but luckily we have some tools at hand that, at the very least, will help us providing a good bug report that will be promptly picked and solved by the very active OVN community. 

Debugging scaling issues on OVN

One of the great things of OVN is that North/South routing can happen locally in the hypervisor when a 'dnat_and_snat' rule is used (aka Floating IP) without having to go through a central or network node.

The way it works is that when outgoing traffic reaches the OVN bridge, the source IP address is changed by the Floating IP (snat) and pushed out to the external network. Similarly, when the packet comes in, the Floating IP address is changed (dnat) by that of the virtual machine in the private network.

In the OpenStack world, this is called DVR ("Distributed Virtual Routing") as the routing doesn't need to traverse any central node and happens on compute nodes meaning no extra hops, no overlay traffic and distributed routing processing.

The main advantage is that if all your workloads have a Floating IP and a lot of N/S traffic, the cloud can be well dimensioned and it's very scalable (no need to scale any network/gateway nodes as you scale out the number of computes and less dependency on the control plane). The drawback is that you'll need to consume IP addresses from the FIP pool. Yeah, it couldn't be all good news :p

All this introduction to say that during some testing on an OVN cluster with lots of Floating IPs, we noticed that the amount of Logical Flows was *huge* and that led to numerous problems related to a very high CPU and memory consumption on both server (ovsdb-server) and client (ovn-controller) sides.

I wanted to understand how the flows were distributed and what was the main contributor(s) to this explosion. What I did is simply count the number of flows on every stage and sorting them. This showed that 93% of all the Logical Flows were in two stages:

$ head -n 6 logical_flows_distribution_sorted.txt
lr_out_egr_loop: 423414  62.24%
lr_in_ip_routing: 212199  31.19%
lr_in_ip_input: 10831  1.59%
ls_out_acl: 4831  0.71%
ls_in_port_sec_ip: 3471  0.51%
ls_in_l2_lkup: 2360  0.34%

Here's the simple commands that I used to figure out the flow distribution:

# ovn-sbctl list Logical_Flow > logical_flows.txt

# Retrieve all the stages in the current pipeline
$ grep ^external_ids logical_flows.txt | sed 's/.*stage-name=//' | tr -d '}' | sort | uniq

# Count how many flows on each stage
$ while read stage; do echo $stage: $(grep $stage logical_flows.txt -c); done < stage_names.txt  > logical_flows_distribution.txt

$ sort  -k 2 -g -r logical_flows_distribution.txt  > logical_flows_distribution_sorted.txt

Next step would be to understand what's in those two tables (lr_out_egr_loop & lr_in_ip_routing):

_uuid               : e1cc600a-fb9c-4968-a124-b0f78ed8139f
actions             : "next;"
external_ids        : {source="ovn-northd.c:8958", stage-name=lr_out_egr_loop}
logical_datapath    : 9cd315f4-1033-4f71-a26e-045a379aebe8
match               : "ip4.src == 172.24.4.10 &amp;&amp; ip4.dst == 172.24.4.209"
pipeline            : egress
priority            : 200
table_id            : 2
hash                : 0

_uuid               : c8d8400a-590e-4b7e-b433-7a1491d31488
actions             : "inport = outport; outport = \"\"; flags = 0; flags.loopback = 1; reg9[1] = 1; next(pipeline=ingress, table=0); "
external_ids        : {source="ovn-northd.c:8950", stage-name=lr_out_egr_loop}
logical_datapath    : 9cd315f4-1033-4f71-a26e-045a379aebe8
match               : "is_chassis_resident(\"vm1\") &amp;&amp; ip4.src == 172.24.4.218 &amp;&amp; ip4.dst == 172.24.4.220"
pipeline            : egress
priority            : 300
table_id            : 2
hash                : 0
_uuid               : 0777b005-0ff0-40cb-8532-f7e2261dae06
actions             : "outport = \"router1-public\"; eth.src = 40:44:00:00:00:06; eth.dst = 40:44:00:00:00:07; reg0 = ip4.dst; reg1 = 172.24.4.218; reg9[2] = 1; reg9[0] = 0; ne
xt;"
external_ids        : {source="ovn-northd.c:6945", stage-name=lr_in_ip_routing}
logical_datapath    : 9cd315f4-1033-4f71-a26e-045a379aebe8
match               : "inport == \"router1-net1\" &amp;&amp; ip4.src == 192.168.0.11 &amp;&amp; ip4.dst == 172.24.4.226"
pipeline            : ingress
priority            : 400
table_id            : 9
hash                : 0

Turns out that those flows are intended to handle inter-FIP communication. Basically, there are flows for every possible FIP pair so that the traffic doesn't flow through a Geneve tunnel.

While FIP-to-FIP traffic between two OVN ports is not perhaps the most common use case, those flows are there to handle it that way that the traffic would be distributed and never sent through the overlay network.

A git blame on the code that generates those flows will show the commits [1][2] and some background on the actual issue.

With the results above, one would expect a quadratic growth but still, it's always nice to pull some graphs 🙂

 

And again, the simple script that I used to get the numbers in the graph:

ovn-nbctl clear logical_router router1 nat
for i in {1..200}; do
  ovn-nbctl lr-nat-add router1 dnat_and_snat 172.24.4.$i 192.168.0.11 vm1 40:44:00:00:00:07
  # Allow some time to northd to generate the new lflows.
  ovn-sbctl list logical_flow > /dev/null
  ovn-sbctl list logical_flow > /tmp/lflows.txt
  S1=$(grep lr_out_egr_loop -c /tmp/lflows.txt )
  S2=$(grep lr_in_ip_routing -c /tmp/lflows.txt )
  echo $S1 $S2
done

Soon after I reported this scaling issue with the above findings, my colleague Numan did an amazing job fixing it with this patch. The results are amazing and for the same test scenario with 200 FIPs, the total amount of lflows dropped from ~127K to ~2.7K and most importantly, from an exponential to a linear growth.

 

Since the Logical Flows are represented in ASCII, they are also quite expensive in processing due to the string parsing, and not very cheap to transmit in the OVSDB protocol. This has been a great leap when it comes to scaling scaling environments with a heavy N/S traffic and lots of Floating IPs.

OVN Cluster Interconnection

A new feature has been recently introduced in OVN that allows multiple clusters to be interconnected at L3 level (here's a link to the series of patches). This can be useful for scenarios with multiple availability zones (or physical regions) or simply to allow better scaling by having independent control planes yet allowing connectivity between workloads in separate zones.

Simplifying things, logical routers on each cluster can be connected via transit overlay networks. The interconnection layer is responsible for creating the transit switches in the IC database that will become visible to the connected clusters. Each cluster can then connect their logical routers to the transit switches. More information can be found in the ovn-architecture manpage.

I created a vagrant setup to test it out and become a bit familiar with it. All you need to do to recreate it is cloning and running 'vagrant up' inside the ovn-interconnection folder:

https://github.com/danalsan/vagrants/tree/master/ovn-interconnection

This will deploy 7 CentOS machines (300MB of RAM each) with two separate OVN clusters (west & east) and the interconnection services. The layout is described in the image below:

Once the services are up and running, a few resources will be created on each cluster and the interconnection services will be configured with a transit switch between them:

Let's see, for example, the logical topology of the east availability zone, where the transit switch ts1 is listed along with the port in the west remote zone:

[root@central-east ~]# ovn-nbctl show
switch c850599c-263c-431b-b67f-13f4eab7a2d1 (ts1)
    port lsp-ts1-router_west
        type: remote
        addresses: ["aa:aa:aa:aa:aa:02 169.254.100.2/24"]
    port lsp-ts1-router_east
        type: router
        router-port: lrp-router_east-ts1
switch 8361d0e1-b23e-40a6-bd78-ea79b5717d7b (net_east)
    port net_east-router_east
        type: router
        router-port: router_east-net_east
    port vm1
        addresses: ["40:44:00:00:00:01 192.168.1.11"]
router b27d180d-669c-4ca8-ac95-82a822da2730 (router_east)
    port lrp-router_east-ts1
        mac: "aa:aa:aa:aa:aa:01"
        networks: ["169.254.100.1/24"]
        gateway chassis: [gw_east]
    port router_east-net_east
        mac: "40:44:00:00:00:04"
        networks: ["192.168.1.1/24"]

As for the Southbound database, we can see the gateway port for each router. In this setup I only have one gateway node but, as any other distributed gateway port in OVN, it could be scheduled in multiple nodes providing HA

[root@central-east ~]# ovn-sbctl show
Chassis worker_east
    hostname: worker-east
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}
    Port_Binding vm1
Chassis gw_east
    hostname: gw-east
    Encap geneve
        ip: "192.168.50.102"
        options: {csum="true"}
    Port_Binding cr-lrp-router_east-ts1
Chassis gw_west
    hostname: gw-west
    Encap geneve
        ip: "192.168.50.103"
        options: {csum="true"}
    Port_Binding lsp-ts1-router_west

If we query the interconnection databases, we will see the transit switch in the NB and the gateway ports in each zone:

[root@central-ic ~]# ovn-ic-nbctl show
Transit_Switch ts1

[root@central-ic ~]# ovn-ic-sbctl show
availability-zone east
    gateway gw_east
        hostname: gw-east
        type: geneve
            ip: 192.168.50.102
        port lsp-ts1-router_east
            transit switch: ts1
            address: ["aa:aa:aa:aa:aa:01 169.254.100.1/24"]
availability-zone west
    gateway gw_west
        hostname: gw-west
        type: geneve
            ip: 192.168.50.103
        port lsp-ts1-router_west
            transit switch: ts1
            address: ["aa:aa:aa:aa:aa:02 169.254.100.2/24"]

With this topology, traffic flowing from vm1 to vm2 shall flow from gw-east to gw-west through a Geneve tunnel. If we list the ports in each gateway we should be able to see the tunnel ports. Needless to say, gateways have to be mutually reachable so that the transit overlay network can be established:

[root@gw-west ~]# ovs-vsctl show
6386b867-a3c2-4888-8709-dacd6e2a7ea5
    Bridge br-int
        fail_mode: secure
        Port ovn-gw_eas-0
            Interface ovn-gw_eas-0
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.50.102"}

Now, when vm1 pings vm2, the traffic flow should be like:

(vm1) worker_east ==== gw_east ==== gw_west ==== worker_west (vm2).

Let's see it via ovn-trace tool:

[root@central-east vagrant]# ovn-trace  --ovs --friendly-names --ct=new net_east  'inport == "vm1" &amp;&amp; eth.src == 40:44:00:00:00:01 &amp;&amp; eth.dst == 40:44:00:00:00:04 &amp;&amp; ip4.src == 192.168.1.11 &amp;&amp; ip4.dst == 192.168.2.12 &amp;&amp; ip.ttl == 64 &amp;&amp; icmp4.type == 8'


ingress(dp="net_east", inport="vm1")
...
egress(dp="net_east", inport="vm1", outport="net_east-router_east")
...
ingress(dp="router_east", inport="router_east-net_east")
...
egress(dp="router_east", inport="router_east-net_east", outport="lrp-router_east-ts1")
...
ingress(dp="ts1", inport="lsp-ts1-router_east")
...
egress(dp="ts1", inport="lsp-ts1-router_east", outport="lsp-ts1-router_west")
 9. ls_out_port_sec_l2 (ovn-northd.c:4543): outport == "lsp-ts1-router_west", priority 50, uuid c354da11
    output;
    /* output to "lsp-ts1-router_west", type "remote" */

Now let's capture Geneve traffic on both gateways while a ping between both VMs is running:

[root@gw-east ~]# tcpdump -i genev_sys_6081 -vvnee icmp
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
10:43:35.355772 aa:aa:aa:aa:aa:01 > aa:aa:aa:aa:aa:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 11379, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 40, length 64
10:43:35.356077 aa:aa:aa:aa:aa:01 > aa:aa:aa:aa:aa:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 11379, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 40, length 64
10:43:35.356442 aa:aa:aa:aa:aa:02 > aa:aa:aa:aa:aa:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 42610, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 40, length 64
10:43:35.356734 40:44:00:00:00:04 > 40:44:00:00:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 42610, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 40, length 64


[root@gw-west ~]# tcpdump -i genev_sys_6081 -vvnee icmp
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
10:43:29.169532 aa:aa:aa:aa:aa:01 > aa:aa:aa:aa:aa:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 8875, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 34, length 64
10:43:29.170058 40:44:00:00:00:10 > 40:44:00:00:00:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 8875, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 34, length 64
10:43:29.170308 aa:aa:aa:aa:aa:02 > aa:aa:aa:aa:aa:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 38667, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 34, length 64
10:43:29.170476 aa:aa:aa:aa:aa:02 > aa:aa:aa:aa:aa:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 38667, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 34, length 64

You can observe that the ICMP traffic flows between the transit switch ports (aa:aa:aa:aa:aa:02 <> aa:aa:aa:aa:aa:01) traversing both zones.

Also, as the packet has gone through two routers (router_east and router_west), the TTL at the destination has been decremented twice (from 64 to 62):

[root@worker-west ~]# ip net e vm2 tcpdump -i any icmp -vvne
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
10:49:32.491674  In 40:44:00:00:00:10 ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 62, id 57504, offset 0, flags [DF], proto ICMP (1), length 84)

This is a really great feature that opens a lot of possibilities for cluster interconnection and scaling. However, it has to be taken into account that it requires another layer of management that handles isolation (multitenancy) and avoids IP overlapping across the connected availability zones.

OVN Routing and ovn-trace

In the last post we covered how OVN distributes the pipeline execution across the different hypervisors thanks to the encapsulation used. Using the same system as a base, let's create a Logical Router and a second Logical Switch where we can see how routing works between different nodes.

 

ovn-nbctl ls-add network1
ovn-nbctl ls-add network2
ovn-nbctl lsp-add network1 vm1
ovn-nbctl lsp-add network2 vm2
ovn-nbctl lsp-set-addresses vm1 "40:44:00:00:00:01 192.168.0.11"
ovn-nbctl lsp-set-addresses vm2 "40:44:00:00:00:02 192.168.1.11"

ovn-nbctl lr-add router1
ovn-nbctl lrp-add router1 router1-net1 40:44:00:00:00:03 192.168.0.1/24
ovn-nbctl lsp-add network1 net1-router1
ovn-nbctl lsp-set-addresses net1-router1 40:44:00:00:00:03
ovn-nbctl lsp-set-type net1-router1 router
ovn-nbctl lsp-set-options net1-router1 router-port=router1-net1

ovn-nbctl lrp-add router1 router1-net2 40:44:00:00:00:04 192.168.1.1/24
ovn-nbctl lsp-add network2 net2-router1
ovn-nbctl lsp-set-addresses net2-router1 40:44:00:00:00:04
ovn-nbctl lsp-set-type net2-router1 router
ovn-nbctl lsp-set-options net2-router1 router-port=router1-net2

And then on Worker1 and Worker2 respectively, let's bind VM1 and VM2 ports inside network namespaces:

# Worker1
ovs-vsctl add-port br-int vm1 -- set Interface vm1 type=internal -- set Interface vm1 external_ids:iface-id=vm1
ip netns add vm1
ip link set vm1 netns vm1
ip netns exec vm1 ip link set vm1 address 40:44:00:00:00:01
ip netns exec vm1 ip addr add 192.168.0.11/24 dev vm1
ip netns exec vm1 ip link set vm1 up
ip netns exec vm1 ip route add default via 192.168.0.1

# Worker2
ovs-vsctl add-port br-int vm2 -- set Interface vm2 type=internal -- set Interface vm2 external_ids:iface-id=vm2
ip netns add vm2
ip link set vm2 netns vm2
ip netns exec vm2 ip link set vm2 address 40:44:00:00:00:02
ip netns exec vm2 ip addr add 192.168.1.11/24 dev vm2
ip netns exec vm2 ip link set vm2 up
ip netns exec vm2 ip route add default via 192.168.1.1

First, let's check connectivity between two VMs through the Logical Router:

[root@worker1 ~]# ip netns exec vm1 ping 192.168.1.11 -c2
PING 192.168.1.11 (192.168.1.11) 56(84) bytes of data.
64 bytes from 192.168.1.11: icmp_seq=1 ttl=63 time=0.371 ms
64 bytes from 192.168.1.11: icmp_seq=2 ttl=63 time=0.398 ms

--- 192.168.1.11 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.371/0.384/0.398/0.023 ms

ovn-trace

Let's now check with ovn-trace how the flow of a packet from VM1 to VM2 looks like and then we'll verify how this gets realized in our physical deployment with 3 hypervisors: 

[root@central ~]# ovn-trace --summary network1 'inport == "vm1" &amp;&amp; eth.src == 40:44:00:00:00:01 &amp;&amp; eth.dst == 40:44:00:00:00:03 &amp;&amp; ip4.src == 192.168.0.11 &amp;&amp; ip4.dst == 192.168.1.11 &amp;&amp; ip.ttl == 64'
# ip,reg14=0x1,vlan_tci=0x0000,dl_src=40:44:00:00:00:01,dl_dst=40:44:00:00:00:03,nw_src=192.168.0.11,nw_dst=192.168.1.11,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=64
ingress(dp="network1", inport="vm1") {
    next;
    outport = "net1-router1";
    output;
    egress(dp="network1", inport="vm1", outport="net1-router1") {
        output;
        /* output to "net1-router1", type "patch" */;
        ingress(dp="router1", inport="router1-net1") {
            next;
            ip.ttl--;
            reg0 = ip4.dst;
            reg1 = 192.168.1.1;
            eth.src = 40:44:00:00:00:04;
            outport = "router1-net2";
            flags.loopback = 1;
            next;
            eth.dst = 40:44:00:00:00:02;
            next;
            output;
            egress(dp="router1", inport="router1-net1", outport="router1-net2") {
                output;
                /* output to "router1-net2", type "patch" */;
                ingress(dp="network2", inport="net2-router1") {
                    next;
                    outport = "vm2";
                    output;
                    egress(dp="network2", inport="net2-router1", outport="vm2") {
                        output;
                        /* output to "vm2", type "" */;
                    };
                };
            };
        };
    };
};

As you can see, the packet goes through 3 OVN datapaths: network1, router1 and network2. As VM1 is on Worker1 and VM2 is on Worker2, the packet will traverse the tunnel between both hypervisors and thanks to the Geneve encapsulation, the execution pipeline will be distributed:

  • Worker1 will execute both ingress and egress pipelines of network1
  • Worker1 will also perform the routing (lines 10 to 28 above) and the ingress pipeline for network2. Then the packet will be pushed to Worker2 via the tunnel.
  • Worker2 will execute the egress pipeline for network2 delivering the packet to VM2.

Let's launch a ping from VM1 and check the Geneve traffic on Worker2: 

52:54:00:13:e0:a2 > 52:54:00:ac:67:5b, ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 64, id 46587, offset 0, flags [DF], proto UDP (17), length 142)
    192.168.50.100.55145 > 192.168.50.101.6081: [bad udp cksum 0xe6a5 -> 0xb087!] Geneve, Flags [C],
vni 0x8,
proto TEB (0x6558),
options [class Open Virtual Networking (OVN) (0x102)
type 0x80(C) len 8
data 00020001]

From what we learnt in the previous post, when the packet arrives to Worker2, the egress pipeline of Datapath 8 (VNI) will be executed being ingress port = 2 and egress port = 1. Let's see if it matches what ovn-trace gave us earlier:

egress(dp="network2", inport="net2-router1", outport="vm2") 

[root@central ~]# ovn-sbctl get Datapath_Binding network2 tunnel_key
8
[root@central ~]# ovn-sbctl get Port_Binding net2-router1 tunnel-key
2
[root@central ~]# ovn-sbctl get Port_Binding vm2 tunnel-key
1

What about the reply from VM2 to VM1? If we check the tunnel traffic on Worker2 we'll see the ICMP echo reply packets with different datapath and ingress/egress ports:

22:19:03.896340 52:54:00:ac:67:5b > 52:54:00:13:e0:a2, ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 64, id 30538, offset 0, flags [DF], proto UDP (17), length 142)
    192.168.50.101.13530 > 192.168.50.100.6081: [bad udp cksum 0xe6a5 -> 0x5419!] Geneve, Flags [C],
vni 0x7,
proto TEB (0x6558),
options [class Open Virtual Networking (OVN) (0x102)
type 0x80(C) len 8
data 00020001]


[root@central ~]# ovn-sbctl get Datapath_Binding network1 tunnel_key
7
[root@central ~]# ovn-sbctl get Port_Binding net1-router1 tunnel-key
2
[root@central ~]# ovn-sbctl get Port_Binding vm1 tunnel-key
1

As we can see, the routing happens locally in the source node; there's no need to send the packet to any central/network node as the East/West routing is always distributed with OVN. In the case of North/South traffic, where SNAT is required, traffic will go to the gateway node but we'll cover this in coming posts.

OpenFlow analysis

Until now, we just focused on the Logical elements but ovn-controller will ultimately translate the Logical Flows into physical flows on the OVN bridge.  Let's inspect the ports that our bridge has on Worker2:

[root@worker2 ~]# ovs-ofctl show br-int
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000404400000002
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(ovn-centra-0): addr:ea:64:c2:a9:86:fe
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(ovn-worker-0): addr:de:96:64:2b:21:4a
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 8(vm2): addr:40:44:00:00:00:02
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br-int): addr:40:44:00:00:00:02
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

We have 3 ports in the OVN bridge:

  1. ovn-centra-0: Tunnel port with Central node
  2. ovn-worker-0: Tunnel port with Worker1
  3. vm2: Our fake virtual machine that we bound to Worker2

When ICMP echo request packets are coming from VM1, we expect them to be arriving via port 2 and being delivered to VM2 on port 8. We can check this by inspecting the flows at table 0 (input) and table 65 (output) for br-int. For a more detailed information of the OpenFlow tables, see ovn-architecture(7) document, section "Architectural Physical Life Cycle of a Packet".

[root@worker2 ~]# ovs-ofctl dump-flows br-int table=0 | grep idle_age='[0-1],'
cookie=0x0, duration=3845.572s, table=0, n_packets=2979, n_bytes=291942, idle_age=0, priority=100,in_port=2 actions=move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15],resubmit(,33)


[root@worker2 ~]# ovs-ofctl dump-flows br-int table=65 | grep idle_age='[0-1],'
 cookie=0x0, duration=3887.336s, table=65, n_packets=3062, n_bytes=297780, idle_age=0, priority=100,reg15=0x1,metadata=0x8 actions=output:8

I was filtering the dump-flows output to display only those flows with an idle_age of just 0 or 1 seconds. This means that they've been recently hit so if we're inspecting tables with lots of flows, we would filter away unwanted flows to focus just on the ones that we're really interested. Please, note that for this particular example, I've set the VM2 link down to see only incoming ICMP packets from VM1 and hide the replies.

Let's inspect the input flow on table 0 where the following actions happen:

actions=
move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],
move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14],
move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15],
resubmit(,33)

These actions will:

  • Copy the 24 bits of the VNI (Tunnel ID / Logical Datapath) to the OpenFlow metadata.
  • Copy bits 16-30 of the tunnel data (Ingress port) to the OpenvSwitch register 14
  • Copy bits 0-15 of the tunnel data (Egress port) to the OpenvSwitch register 15.

From the ovn-architecture document, we can read:

logical datapath field
A field that denotes the logical datapath through which a
packet is being processed. OVN uses the field that Open‐
Flow 1.1+ simply (and confusingly) calls "metadata" to
store the logical datapath. (This field is passed across
tunnels as part of the tunnel key.)

logical input port field
A field that denotes the logical port from which the
packet entered the logical datapath. OVN stores this in
Open vSwitch extension register number 14.

Geneve and STT tunnels pass this field as part of the
tunnel key. Although VXLAN tunnels do not explicitly
carry a logical input port, OVN only uses VXLAN to commu‐
nicate with gateways that from OVN’s perspective consist
of only a single logical port, so that OVN can set the
logical input port field to this one on ingress to the
OVN logical pipeline.

logical output port field
A field that denotes the logical port from which the
packet will leave the logical datapath. This is initial‐
ized to 0 at the beginning of the logical ingress pipe‐
line. OVN stores this in Open vSwitch extension register
number 15.

Geneve and STT tunnels pass this field as part of the
tunnel key. VXLAN tunnels do not transmit the logical
output port field. Since VXLAN tunnels do not carry a
logical output port field in the tunnel key, when a
packet is received from VXLAN tunnel by an OVN hypervi‐
sor, the packet is resubmitted to table 8 to determine
the output port(s); when the packet reaches table 32,
these packets are resubmitted to table 33 for local de‐
livery by checking a MLF_RCV_FROM_VXLAN flag, which is
set when the packet arrives from a VXLAN tunnel. 

Similarly, we can see that in the output action from table 65, the OpenFlow rule is matching on "reg15=0x1,metadata=0x8" when it outputs the packet to the OpenFlow port number 8 (VM2). As we saw earlier, metadata 8 corresponds to network2 Logical Switch and the output port (reg15) 1 corresponds to the VM2 logical port (tunnel_key 1).

 

Due to its distributed nature and everything being OpenFlow based, it may be a little bit tricky to trace packets when implementing Software Defined Networking with OVN compared to traditional solutions (iptables, network namespaces, ...). Knowing the above concepts and getting familiar with the tools is key to debug OVN systems in an effective way.

OVN - Geneve Encapsulation

In the last post we created a Logical Switch with two ports residing on different hypervisors. Communication between those two ports took place over the tunnel interface using Geneve encapsulation. Let's now take a closer look at this overlay traffic.

Without diving too much into the packet processing in OVN, we need to know that each Logical Datapath (Logical Switch / Logical Router) has an ingress and an egress pipeline. Whenever a packet comes in, the ingress pipeline is executed and after the output action, the egress pipeline will run to deliver the packet to its destination. More info here: http://docs.openvswitch.org/en/latest/faq/ovn/#ovn

In our scenario, when we ping from VM1 to VM2, the ingress pipeline of each ICMP packet runs on Worker1 (where VM1 is bound to) and the packet is pushed to the tunnel interface to Worker2 (where VM2 resides). When Worker2 receives the packet on its physical interface, the egress pipeline of the Logical Switch (network1) is executed to deliver the packet to VM2. But ... How does OVN know where the packet comes from and which Logical Datapath should process it? This is where the metadata in the Geneve headers comes in.

Let's get back to our setup and ping from VM1 to VM2 and capture traffic on the physical interface (eth1) of Worker2:

[root@worker2 ~]# sudo tcpdump -i eth1 -vvvnnexx

17:02:13.403229 52:54:00:13:e0:a2 > 52:54:00:ac:67:5b, ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 64, id 63920, offset 0, flags [DF], proto UDP (17), length 142)
    192.168.50.100.7549 > 192.168.50.101.6081: [bad udp cksum 0xe6a5 -> 0x7177!] Geneve, Flags [C], vni 0x1, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00010002]
        40:44:00:00:00:01 > 40:44:00:00:00:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 41968, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.0.11 > 192.168.0.12: ICMP echo request, id 1251, seq 6897, length 64
        0x0000:  5254 00ac 675b 5254 0013 e0a2 0800 4500
        0x0010:  008e f9b0 4000 4011 5a94 c0a8 3264 c0a8
        0x0020:  3265 1d7d 17c1 007a e6a5 0240 6558 0000
        0x0030:  0100 0102 8001 0001 0002 4044 0000 0002
        0x0040:  4044 0000 0001 0800 4500 0054 a3f0 4000
        0x0050:  4001 1551 c0a8 000b c0a8 000c 0800 c67b
        0x0060:  04e3 1af1 94d9 6e5c 0000 0000 41a7 0e00
        0x0070:  0000 0000 1011 1213 1415 1617 1819 1a1b
        0x0080:  1c1d 1e1f 2021 2223 2425 2627 2829 2a2b
        0x0090:  2c2d 2e2f 3031 3233 3435 3637

17:02:13.403268 52:54:00:ac:67:5b > 52:54:00:13:e0:a2, ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 64, id 46181, offset 0, flags [DF], proto UDP (17), length 142)
    192.168.50.101.9683 > 192.168.50.100.6081: [bad udp cksum 0xe6a5 -> 0x6921!] Geneve, Flags [C], vni 0x1, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00020001]
        40:44:00:00:00:02 > 40:44:00:00:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 16422, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.0.12 > 192.168.0.11: ICMP echo reply, id 1251, seq 6897, length 64
        0x0000:  5254 0013 e0a2 5254 00ac 675b 0800 4500
        0x0010:  008e b465 4000 4011 9fdf c0a8 3265 c0a8
        0x0020:  3264 25d3 17c1 007a e6a5 0240 6558 0000
        0x0030:  0100 0102 8001 0002 0001 4044 0000 0001
        0x0040:  4044 0000 0002 0800 4500 0054 4026 0000
        0x0050:  4001 b91b c0a8 000c c0a8 000b 0000 ce7b
        0x0060:  04e3 1af1 94d9 6e5c 0000 0000 41a7 0e00
        0x0070:  0000 0000 1011 1213 1415 1617 1819 1a1b
        0x0080:  1c1d 1e1f 2021 2223 2425 2627 2829 2a2b
        0x0090:  2c2d 2e2f 3031 3233 3435 3637

Let's now decode the ICMP request packet (I'm using this tool):

ICMP request inside the Geneve tunnel

Metadata

 

In the ovn-architecture(7) document, you can check how the Metadata is used in OVN in the Tunnel Encapsulations section. In short, OVN encodes the following information in the Geneve packets:

  • Logical Datapath (switch/router) identifier (24 bits) - Geneve VNI
  • Ingress and Egress port identifiers - Option with class 0x0102 and type 0x80 with 32 bits of data:
         1       15          16
       +---+------------+-----------+
       |rsv|ingress port|egress port|
       +---+------------+-----------+
         0

Back to our example: VNI = 0x000001 and Option Data = 00010002, so from the above:

Logical Datapath = 1   Ingress Port = 1   Egress Port = 2

Let's take a look at SB database contents to see if they match what we expect:

[root@central ~]# ovn-sbctl get Datapath_Binding network1 tunnel-key
1

[root@central ~]# ovn-sbctl get Port_Binding vm1 tunnel-key
1

[root@central ~]# ovn-sbctl get Port_Binding vm2 tunnel-key
2

We can see that the Logical Datapath belongs to network1, that the ingress port is vm1 and that the output port is vm2 which makes sense as we're analyzing the ICMP request from VM1 to VM2. 

By the time this packet hits Worker2 hypervisor, OVN has all the information to process the packet on the right pipeline and deliver the port to VM2 without having to run the ingress pipeline again.

What if we don't use any encapsulation?

This is technically possible in OVN and there's such scenarios like in the case where we're managing a physical network directly and won't use any kind of overlay technology. In this case, our ICMP request packet would've been pushed directly to the network and when Worker2 receives the packet, OVN needs to figure out (based on the IP/MAC addresses) which ingress pipeline to execute (twice, as it was also executed by Worker1) before it can go to the egress pipeline and deliver the packet to VM2.

Multinode OVN setup

As a follow up from the last post, we are now going to deploy a 3 nodes OVN setup to demonstrate basic L2 communication across different hypervisors. This is the physical topology and how services are distributed:

  • Central node: ovn-northd and ovsdb-servers (North and Southbound databases) as well as ovn-controller
  • Worker1 / Worker2: ovn-controller connected to Central node Southbound ovsdb-server (TCP port 6642)

In order to deploy the 3 machines, I'm using Vagrant + libvirt and you can checkout the Vagrant files and scripts used from this link. After running 'vagrant up', we'll have 3 nodes with OVS/OVN installed from sources and we should be able to log in to the central node and verify that OVN is up and running and Geneve tunnels have been established to both workers:

 

[vagrant@central ~]$ sudo ovs-vsctl show
f38658f5-4438-4917-8b51-3bb30146877a
    Bridge br-int
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port "ovn-worker-1"
            Interface "ovn-worker-1"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.50.101"}
        Port "ovn-worker-0"
            Interface "ovn-worker-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.50.100"}
    ovs_version: "2.11.90"

 

For demonstration purposes, we're going to create a Logical Switch (network1) and two Logical Ports (vm1 and vm2). Then we're going to bind VM1 to Worker1 and VM2 to Worker2. If everything works as expected, we would be able to communicate both Logical Ports through the overlay network formed between both workers nodes.

We can run the following commands on any node to create the logical topology (please, note that if we run them on Worker1 or Worker2, we need to specify the NB database location by running ovn-nbctl with "--db=tcp:192.168.50.10:6641" as 6641 is the default port for NB database):

ovn-nbctl ls-add network1
ovn-nbctl lsp-add network1 vm1
ovn-nbctl lsp-add network1 vm2
ovn-nbctl lsp-set-addresses vm1 "40:44:00:00:00:01 192.168.0.11"
ovn-nbctl lsp-set-addresses vm2 "40:44:00:00:00:02 192.168.0.12"

And now let's check the Northbound and Southbound databases contents. As we didn't bind any port to the workers yet, "ovn-sbctl show" command should only list the chassis (or hosts in OVN jargon):

[root@central ~]# ovn-nbctl show
switch a51334e8-f77d-4d85-b01a-e547220eb3ff (network1)
    port vm2
        addresses: ["40:44:00:00:00:02 192.168.0.12"]
    port vm1
        addresses: ["40:44:00:00:00:01 192.168.0.11"]

[root@central ~]# ovn-sbctl show
Chassis "worker2"
    hostname: "worker2"
    Encap geneve
        ip: "192.168.50.101"
        options: {csum="true"}
Chassis central
    hostname: central
    Encap geneve
        ip: "127.0.0.1"
        options: {csum="true"}
Chassis "worker1"
    hostname: "worker1"
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}

Now we're going to bind VM1 to Worker1:

ovs-vsctl add-port br-int vm1 -- set Interface vm1 type=internal -- set Interface vm1 external_ids:iface-id=vm1
ip netns add vm1
ip link set vm1 netns vm1
ip netns exec vm1 ip link set vm1 address 40:44:00:00:00:01
ip netns exec vm1 ip addr add 192.168.0.11/24 dev vm1
ip netns exec vm1 ip link set vm1 up

And VM2 to Worker2:

ovs-vsctl add-port br-int vm2 -- set Interface vm2 type=internal -- set Interface vm2 external_ids:iface-id=vm2
ip netns add vm2
ip link set vm2 netns vm2
ip netns exec vm2 ip link set vm2 address 40:44:00:00:00:02
ip netns exec vm2 ip addr add 192.168.0.12/24 dev vm2
ip netns exec vm2 ip link set vm2 up

Checking again the Southbound database, we should see the port binding status:

[root@central ~]# ovn-sbctl show
Chassis "worker2"
    hostname: "worker2"
    Encap geneve
        ip: "192.168.50.101"
        options: {csum="true"}
    Port_Binding "vm2"
Chassis central
    hostname: central
    Encap geneve
        ip: "127.0.0.1"
        options: {csum="true"}
Chassis "worker1"
    hostname: "worker1"
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}
    Port_Binding "vm1"

Now let's check connectivity between VM1 (Worker1) and VM2 (Worker2):

[root@worker1 ~]# ip netns exec vm1 ping 192.168.0.12 -c2
PING 192.168.0.12 (192.168.0.12) 56(84) bytes of data.
64 bytes from 192.168.0.12: icmp_seq=1 ttl=64 time=0.416 ms
64 bytes from 192.168.0.12: icmp_seq=2 ttl=64 time=0.307 ms

--- 192.168.0.12 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.307/0.361/0.416/0.057 ms


[root@worker2 ~]# ip netns exec vm2 ping 192.168.0.11 -c2
PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.
64 bytes from 192.168.0.11: icmp_seq=1 ttl=64 time=0.825 ms
64 bytes from 192.168.0.11: icmp_seq=2 ttl=64 time=0.275 ms

--- 192.168.0.11 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.275/0.550/0.825/0.275 ms

As both ports are located in different hypervisors, OVN is pushing the traffic via the overlay Geneve tunnel from Worker1 to Worker2. In the next post, we'll analyze the Geneve encapsulation and how OVN uses its metadata internally.

For now, let's ping from VM1 to VM2 and just capture traffic on the geneve interface on Worker2 to verify that ICMP packets are coming through the tunnel:

[root@worker2 ~]# tcpdump -i genev_sys_6081 -vvnn icmp
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
15:07:42.395318 IP (tos 0x0, ttl 64, id 45147, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.0.11 > 192.168.0.12: ICMP echo request, id 1251, seq 26, length 64
15:07:42.395383 IP (tos 0x0, ttl 64, id 39282, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.0.12 > 192.168.0.11: ICMP echo reply, id 1251, seq 26, length 64
15:07:43.395221 IP (tos 0x0, ttl 64, id 45612, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.0.11 > 192.168.0.12: ICMP echo request, id 1251, seq 27, length 64

In coming posts we'll cover Geneve encapsulation as well as OVN pipelines and L3 connectivity.