Tagcloud

OVN external ports

Some time back a new type of virtual port was introduced in OVN: external ports. The initial motivation was to support two main use cases in OpenStack:

  1. SR-IOV:  For this type of workloads, the VM will bypass the hypervisor by accessing the physical NIC directly on the host. This means that the traffic sent out by the VM which requires a controller action (such as DHCP requests or IGMP traffic) will not hit the local OVN bridge and will be missed. You can read more about SR-IOV in this excellent blog post.
  2. Baremetal provisioning: Similarly, when provisioning a baremetal server, DHCP requests for PXE booting will be missed as those ports are not bound in OVN so even they're sent in broadcast, all the ovn-controller instances in the cluster will ignore them.

For both cases we needed to have a way in OVN to process those requests coming from ports that are not bound to the local hypervisor and the solution was to implement the OVN external ports.

Testing scenario

From a pure OVN perspective, I created the following setup that will hopefully help understand the details and how to troubleshoot this type of ports.

  • Two private networks:
    • network1: 192.168.0.0/24 of type VLAN and ID 190
    • network2: 192.168.1.0/24 of type VLAN and ID 170
  • Two provider networks:
    • external: 172.24.14.0/24 used for Floating IP traffic
    • tenant: used by network1 and network2 VLAN networks
  • One Logical router that connects both private networks and the Floating IP network
    • router1-net1: "40:44:00:00:00:03" "192.168.0.1/24"
    • router1-net2: "40:44:33:00:00:05" "192.168.1.1/24"
    • router1-public: "40:44:00:00:00:04" "172.24.14.1/24"
  • 4 Virtual Machines (simulated with network namespaces), 2 on each network
    • vm1: addresses: ["40:44:00:00:00:01 192.168.0.11"]
    • vm2: addresses: ["40:44:00:00:00:02 192.168.0.12"]
    • vm3: addresses: ["40:44:33:00:00:03 192.168.1.13"]
    • pext: addresses: ["40:44:44:00:00:10 192.168.1.111"]

The physical layout involves the following nodes:

  • Worker nodes:
    • They have one NIC connected to the Geneve overlay network and another NIC on the provider network using the same OVS bridge (br-ex) for the flat external network and for both VLAN tenant networks.
  • Gateway nodes:
    • Same network configuration as the worker nodes 
    • gw1: which hosts the router gateway port for the Logical Router
    • gw2: which hosts the external port (pext)
  • Host
    • This server is just another machine which has access to the provider networks and no OVN/OVS components are running on it.
    • To illustrate different scenarios, during the provisioning phase, it'll be configured with a network namespace connected to an OVS bridge and a VLAN device on the network2 (VLAN 170) with the MAC address of the external port.

A full vagrant setup can be found here that will deploy and configure the above setup for you in less than 10 minutes 🙂

The configuration of the external port in the OVN NorthBound database can be found here:

 

# Create the external port
ovn-nbctl lsp-add network2 pext
ovn-nbctl lsp-set-addresses pext \
          "40:44:44:00:00:10 192.168.1.111"
ovn-nbctl lsp-set-type pext external

# Schedule the external port in gw2
ovn-nbctl --id=@ha_chassis  create HA_Chassis \
          chassis_name=gw2 priority=1 -- \
          --id=@ha_chassis_group create \
          HA_Chassis_Group name=default2 \
          ha_chassis=[@ha_chassis] -- \
          set Logical_Switch_Port pext \
          ha_chassis_group=@ha_chassis_group

DHCP

When using a regular OVN port, the DHCP request from the VM hits the integration bridge and, via a controller action, is served by the local ovn-controller instance. Also, the DHCP request never leaves the hypervisor.

However, when it comes to external ports, the broadcast request will be processed by the chassis where the port is scheduled and the ovn-controller instance running there will be responsible of serving DHCP for it.

Let's get to our setup and issue the request from host1:

[root@host1 vagrant] ip netns exec pext dhclient -v -i pext.170 --no-pid
Listening on LPF/pext.170/40:44:44:00:00:10
Sending on   LPF/pext.170/40:44:44:00:00:10
Sending on   Socket/fallback
DHCPREQUEST on pext.170 to 255.255.255.255 port 67 (xid=0x5149c1a3)
DHCPACK from 192.168.1.1 (xid=0x56bf40c1)
bound to 192.168.1.111 -- renewal in 1667 seconds.

Since we scheduled the external port on the gw2, we'd expect the request being handled there which we can check by inspecting the relevant logs:

[root@gw2 ~] tail -f /usr/var/log/ovn/ovn-controller.log
2020-09-08T09:01:52.547Z|00007|pinctrl(ovn_pinctrl0)
|INFO|DHCPACK 40:44:44:00:00:10 192.168.1.111

Below you can see the flows installed to handle DHCP in the gw2 node:

table=22, priority=100, udp, reg14=0x2, metadata=0x3, dl_src=40:44:44:00:00:10, nw_src=192.168.1.111, nw_dst=255.255.255.255, tp_src=68, tp_dst=67 actions=controller(...

table=22, priority=100, udp, reg14=0x2, metadata=0x3, dl_src=40:44:44:00:00:10, nw_src=0.0.0.0, nw_dst=255.255.255.255, tp_src=68, tp_dst=67 actions=controller(...

table=22, priority=100, udp, reg14=0x2, metadata=0x3, dl_src=40:44:44:00:00:10, nw_src=192.168.1.111, nw_dst=192.168.1.1, tp_src=68, tp_dst=67 actions=controller(...

Regular L2 traffic path

This is the easy path. As the external port MAC/IP addresses are known to OVN, whenever the packet arrives to the destination chassis via a localnet port they'll be processed normally and delivered to the output port. No extra hops are observed:

Let's ping from the external port pext to vm3 which are in the same network. The expected path is for the ICMP packets to exit host1 from eth1 tagged with VLAN 170 and reach worker2 on eth2:

[root@host1 vagrant]# ip netns exec pext ping 192.168.1.13 -c2
PING 192.168.1.13 (192.168.1.13) 56(84) bytes of data.
64 bytes from 192.168.1.13: icmp_seq=1 ttl=64 time=0.331 ms
64 bytes from 192.168.1.13: icmp_seq=2 ttl=64 time=0.299 ms

--- 192.168.1.13 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.299/0.315/0.331/0.016 ms

Traffic from host1 arrives directly with the destination MAC address of the vm3 port and the reply is received from that same MAC:

[root@host1 ~]# tcpdump -i eth1 -vvnee
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
14:45:22.919136 40:44:44:00:00:10 > 40:44:33:00:00:03, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 14603, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.1.13: ICMP echo request, id 19425, seq 469, length 64


14:45:22.919460 40:44:33:00:00:03 > 40:44:44:00:00:10, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 13721, offset 0, flags [none], proto ICMP (1), length 84)
192.168.1.13 > 192.168.1.111: ICMP echo reply, id 19425, seq 469, length 64

At worker2, we see the traffic coming in from the eth2 NIC on the network2 VLAN 170:

[root@worker2 ~]# tcpdump -i eth2 -vvne
tcpdump: listening on eth2, link-type EN10MB (Ethernet), capture size 262144 bytes
14:43:33.009370 40:44:44:00:00:10 > 40:44:33:00:00:03, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 22650, offset 0, flags [DF], proto ICMP (1), length 84)


192.168.1.111 > 192.168.1.13: ICMP echo request, id 19425, seq 360, length 64
14:43:33.009472 40:44:33:00:00:03 > 40:44:44:00:00:10, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 23154, offset 0, flags [none], proto ICMP (1), length 84)
192.168.1.13 > 192.168.1.111: ICMP echo reply, id 19425, seq 360, length 64

Routed traffic path

In this example, we're going to ping from the external port pext to vm1 (192.168.0.11) located in worker1. Now, the packet will exit the host1 with the destination MAC address of the router port (192.168.1.1). Since the traffic is originated by pext and it is bound to the gateway gw2, the routing will happen there.

 

The router pipeline, and network1 and 2 pipelines will run in the gw2 and, from here, the packet goes out already to the network2 (VLAN 190) to worker1.

 

  • Ping from external port pext to vm1

[root@host1 ~]# ip netns exec pext ping 192.168.0.11 -c2
PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.
64 bytes from 192.168.0.11: icmp_seq=1 ttl=63 time=2.67 ms
64 bytes from 192.168.0.11: icmp_seq=2 ttl=63 time=0.429 ms

--- 192.168.0.11 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.429/1.553/2.678/1.125 ms

[root@host1 ~]# ip net e pext ip neigh
192.168.1.1 dev pext.170 lladdr 40:44:33:00:00:05 REACHABLE

[root@host1 ~]# tcpdump -i eth1 -vvnee
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:18:52.285722 40:44:44:00:00:10 > 40:44:33:00:00:05, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 15048, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.0.11: ICMP echo request, id 1257, seq 6, length 64

  • Packet arrives to gw2 for routing:

[root@gw2 ~]# tcpdump -i eth2 -vvne icmp
tcpdump: listening on eth2, link-type EN10MB (Ethernet), capture size 262144 bytes
16:19:40.901874 40:44:44:00:00:10 > 40:44:33:00:00:05, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 41726, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.0.11: ICMP echo request, id 1257, seq 56, length 64

  • And from gw2, the packet is sent to the destination vm vm1 on the VLAN 190 network:

12:31:13.737551 40:44:44:00:00:10 > 40:44:33:00:00:05, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 35405, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.0.11: ICMP echo request, id 6103, seq 5455, length 64
12:31:13.737583 1e:02:ad:bb:aa:cc > 40:44:00:00:00:01, ethertype 802.1Q (0x8100), length 102: vlan 190, p 0, ethertype IPv4, (tos 0x0, ttl 63, id 35405, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.0.11: ICMP echo request, id 6103, seq 5455, length 64

  • At worker1, the packet is delivered to the tap port of vm1 and this will reply to the ping

[root@worker1 ~]# ip netns exec vm1 tcpdump -i vm1 -vvne icmp
tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:21:38.561881 40:44:00:00:00:03 > 40:44:00:00:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 32180, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.111 > 192.168.0.11: ICMP echo request, id 1278, seq 18, length 64
16:21:38.561925 40:44:00:00:00:01 > 40:44:00:00:00:03, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 25498, offset 0, flags [none], proto ICMP (1), length 84)
192.168.0.11 > 192.168.1.111: ICMP echo reply, id 1278, seq 18, length 64

  • The ICMP echo reply packet will be sent directly to pext through the localnet port to the physical network tagged with VLAN 170:

[root@host1 ~]# tcpdump -vvnne -i eth1 icmp
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
12:35:00.322735 1e:02:ad:bb:aa:77 > 40:44:44:00:00:10, ethertype 802.1Q (0x8100), length 102: vlan 170, p 0, ethertype IPv4, (tos 0x0, ttl 63, id 18741, offset 0, flags [none], proto ICMP (1), length 84)
192.168.0.11 > 192.168.1.111: ICMP echo reply, id 6103, seq 5681, length 644

You might have noticed that the source MAC address (1e:02:ad:bb:aa:77) in the last step doesn't correspond to any OVN logical port. This is because, in order for the routing to be distributed, OVN rewrites this MAC address to that of external_ids:ovn-chassis-mac-mappings. In our case, this tell us as well that the traffic is coming directly from worker1 saving the extra hop that we see for the pext -> vm1 path via gw2.

[root@worker1 ~]# ovs-vsctl get open . external_ids:ovn-chassis-mac-mappings
"tenant:1e:02:ad:bb:aa:77"

Floating IP traffic path

The FIP traffic path (for non Distributed Virtual Routing - DVR -  case) is similar to the case that we just described except for the fact that since we require to traverse the distributed gateway port, our traffic will take an extra hop.

Example of ping from pext to the Floating IP of vm1 (172.24.14.100):

  1. The packet goes from host1 to gw2 via the VLAN 170 network with the destination MAC address of the network2 router port.
  2. From gw2, the traffic will be steered to the gw1 node which is hosting the distributed gateway port. This traffic is sent out via the Geneve overlay tunnel.
  3. gw1 will perform the routing and send the traffic to the FIP of vm1 via the public flat network (172.24.14.0/24) and src IP that of the SNAT (172.24.14.1).
  4. The request arrives to worker1 where ovn-controller un-NATs the packet to the vm1 (192.168.0.11) and delivers it to the tap interface. The reply from the vm1 is NATed to the FIP (172.14.14.100) and sent back to the router at gw1 via the flat network.
  5. gw1 will perform the routing to the network2 and push the reply packet directly onto the network2 with VLAN 190 which will be received at host1.

OpenStack

For use cases like SR-IOV workloads, OpenStack is responsible for creating an OVN port of type 'external' in the NorthBound database and also its location, ie., which chassis are going to claim the port and install the relevant flows to, for example, serve DHCP.

This is done through a HA Chassis Group where the OVN Neutron plugin adds all the external ports to and, usually, the controller nodes belong to it. This way if one node goes down, all the external ports will be served from the next highest priority chassis in the group.

Information about how OpenStack handles this can be found here.

Future uses of OVN external ports include baremetal provisioning but as of this writing we lack some features like PXE chaining in OVN.

 

Debugging scaling issues on OVN

One of the great things of OVN is that North/South routing can happen locally in the hypervisor when a 'dnat_and_snat' rule is used (aka Floating IP) without having to go through a central or network node.

The way it works is that when outgoing traffic reaches the OVN bridge, the source IP address is changed by the Floating IP (snat) and pushed out to the external network. Similarly, when the packet comes in, the Floating IP address is changed (dnat) by that of the virtual machine in the private network.

In the OpenStack world, this is called DVR ("Distributed Virtual Routing") as the routing doesn't need to traverse any central node and happens on compute nodes meaning no extra hops, no overlay traffic and distributed routing processing.

The main advantage is that if all your workloads have a Floating IP and a lot of N/S traffic, the cloud can be well dimensioned and it's very scalable (no need to scale any network/gateway nodes as you scale out the number of computes and less dependency on the control plane). The drawback is that you'll need to consume IP addresses from the FIP pool. Yeah, it couldn't be all good news :p

All this introduction to say that during some testing on an OVN cluster with lots of Floating IPs, we noticed that the amount of Logical Flows was *huge* and that led to numerous problems related to a very high CPU and memory consumption on both server (ovsdb-server) and client (ovn-controller) sides.

I wanted to understand how the flows were distributed and what was the main contributor(s) to this explosion. What I did is simply count the number of flows on every stage and sorting them. This showed that 93% of all the Logical Flows were in two stages:

$ head -n 6 logical_flows_distribution_sorted.txt
lr_out_egr_loop: 423414  62.24%
lr_in_ip_routing: 212199  31.19%
lr_in_ip_input: 10831  1.59%
ls_out_acl: 4831  0.71%
ls_in_port_sec_ip: 3471  0.51%
ls_in_l2_lkup: 2360  0.34%

Here's the simple commands that I used to figure out the flow distribution:

# ovn-sbctl list Logical_Flow > logical_flows.txt

# Retrieve all the stages in the current pipeline
$ grep ^external_ids logical_flows.txt | sed 's/.*stage-name=//' | tr -d '}' | sort | uniq

# Count how many flows on each stage
$ while read stage; do echo $stage: $(grep $stage logical_flows.txt -c); done < stage_names.txt  > logical_flows_distribution.txt

$ sort  -k 2 -g -r logical_flows_distribution.txt  > logical_flows_distribution_sorted.txt

Next step would be to understand what's in those two tables (lr_out_egr_loop & lr_in_ip_routing):

_uuid               : e1cc600a-fb9c-4968-a124-b0f78ed8139f
actions             : "next;"
external_ids        : {source="ovn-northd.c:8958", stage-name=lr_out_egr_loop}
logical_datapath    : 9cd315f4-1033-4f71-a26e-045a379aebe8
match               : "ip4.src == 172.24.4.10 &amp;&amp; ip4.dst == 172.24.4.209"
pipeline            : egress
priority            : 200
table_id            : 2
hash                : 0

_uuid               : c8d8400a-590e-4b7e-b433-7a1491d31488
actions             : "inport = outport; outport = \"\"; flags = 0; flags.loopback = 1; reg9[1] = 1; next(pipeline=ingress, table=0); "
external_ids        : {source="ovn-northd.c:8950", stage-name=lr_out_egr_loop}
logical_datapath    : 9cd315f4-1033-4f71-a26e-045a379aebe8
match               : "is_chassis_resident(\"vm1\") &amp;&amp; ip4.src == 172.24.4.218 &amp;&amp; ip4.dst == 172.24.4.220"
pipeline            : egress
priority            : 300
table_id            : 2
hash                : 0
_uuid               : 0777b005-0ff0-40cb-8532-f7e2261dae06
actions             : "outport = \"router1-public\"; eth.src = 40:44:00:00:00:06; eth.dst = 40:44:00:00:00:07; reg0 = ip4.dst; reg1 = 172.24.4.218; reg9[2] = 1; reg9[0] = 0; ne
xt;"
external_ids        : {source="ovn-northd.c:6945", stage-name=lr_in_ip_routing}
logical_datapath    : 9cd315f4-1033-4f71-a26e-045a379aebe8
match               : "inport == \"router1-net1\" &amp;&amp; ip4.src == 192.168.0.11 &amp;&amp; ip4.dst == 172.24.4.226"
pipeline            : ingress
priority            : 400
table_id            : 9
hash                : 0

Turns out that those flows are intended to handle inter-FIP communication. Basically, there are flows for every possible FIP pair so that the traffic doesn't flow through a Geneve tunnel.

While FIP-to-FIP traffic between two OVN ports is not perhaps the most common use case, those flows are there to handle it that way that the traffic would be distributed and never sent through the overlay network.

A git blame on the code that generates those flows will show the commits [1][2] and some background on the actual issue.

With the results above, one would expect a quadratic growth but still, it's always nice to pull some graphs 🙂

 

And again, the simple script that I used to get the numbers in the graph:

ovn-nbctl clear logical_router router1 nat
for i in {1..200}; do
  ovn-nbctl lr-nat-add router1 dnat_and_snat 172.24.4.$i 192.168.0.11 vm1 40:44:00:00:00:07
  # Allow some time to northd to generate the new lflows.
  ovn-sbctl list logical_flow > /dev/null
  ovn-sbctl list logical_flow > /tmp/lflows.txt
  S1=$(grep lr_out_egr_loop -c /tmp/lflows.txt )
  S2=$(grep lr_in_ip_routing -c /tmp/lflows.txt )
  echo $S1 $S2
done

Soon after I reported this scaling issue with the above findings, my colleague Numan did an amazing job fixing it with this patch. The results are amazing and for the same test scenario with 200 FIPs, the total amount of lflows dropped from ~127K to ~2.7K and most importantly, from an exponential to a linear growth.

 

Since the Logical Flows are represented in ASCII, they are also quite expensive in processing due to the string parsing, and not very cheap to transmit in the OVSDB protocol. This has been a great leap when it comes to scaling scaling environments with a heavy N/S traffic and lots of Floating IPs.

OVN Cluster Interconnection

A new feature has been recently introduced in OVN that allows multiple clusters to be interconnected at L3 level (here's a link to the series of patches). This can be useful for scenarios with multiple availability zones (or physical regions) or simply to allow better scaling by having independent control planes yet allowing connectivity between workloads in separate zones.

Simplifying things, logical routers on each cluster can be connected via transit overlay networks. The interconnection layer is responsible for creating the transit switches in the IC database that will become visible to the connected clusters. Each cluster can then connect their logical routers to the transit switches. More information can be found in the ovn-architecture manpage.

I created a vagrant setup to test it out and become a bit familiar with it. All you need to do to recreate it is cloning and running 'vagrant up' inside the ovn-interconnection folder:

https://github.com/danalsan/vagrants/tree/master/ovn-interconnection

This will deploy 7 CentOS machines (300MB of RAM each) with two separate OVN clusters (west & east) and the interconnection services. The layout is described in the image below:

Once the services are up and running, a few resources will be created on each cluster and the interconnection services will be configured with a transit switch between them:

Let's see, for example, the logical topology of the east availability zone, where the transit switch ts1 is listed along with the port in the west remote zone:

[root@central-east ~]# ovn-nbctl show
switch c850599c-263c-431b-b67f-13f4eab7a2d1 (ts1)
    port lsp-ts1-router_west
        type: remote
        addresses: ["aa:aa:aa:aa:aa:02 169.254.100.2/24"]
    port lsp-ts1-router_east
        type: router
        router-port: lrp-router_east-ts1
switch 8361d0e1-b23e-40a6-bd78-ea79b5717d7b (net_east)
    port net_east-router_east
        type: router
        router-port: router_east-net_east
    port vm1
        addresses: ["40:44:00:00:00:01 192.168.1.11"]
router b27d180d-669c-4ca8-ac95-82a822da2730 (router_east)
    port lrp-router_east-ts1
        mac: "aa:aa:aa:aa:aa:01"
        networks: ["169.254.100.1/24"]
        gateway chassis: [gw_east]
    port router_east-net_east
        mac: "40:44:00:00:00:04"
        networks: ["192.168.1.1/24"]

As for the Southbound database, we can see the gateway port for each router. In this setup I only have one gateway node but, as any other distributed gateway port in OVN, it could be scheduled in multiple nodes providing HA

[root@central-east ~]# ovn-sbctl show
Chassis worker_east
    hostname: worker-east
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}
    Port_Binding vm1
Chassis gw_east
    hostname: gw-east
    Encap geneve
        ip: "192.168.50.102"
        options: {csum="true"}
    Port_Binding cr-lrp-router_east-ts1
Chassis gw_west
    hostname: gw-west
    Encap geneve
        ip: "192.168.50.103"
        options: {csum="true"}
    Port_Binding lsp-ts1-router_west

If we query the interconnection databases, we will see the transit switch in the NB and the gateway ports in each zone:

[root@central-ic ~]# ovn-ic-nbctl show
Transit_Switch ts1

[root@central-ic ~]# ovn-ic-sbctl show
availability-zone east
    gateway gw_east
        hostname: gw-east
        type: geneve
            ip: 192.168.50.102
        port lsp-ts1-router_east
            transit switch: ts1
            address: ["aa:aa:aa:aa:aa:01 169.254.100.1/24"]
availability-zone west
    gateway gw_west
        hostname: gw-west
        type: geneve
            ip: 192.168.50.103
        port lsp-ts1-router_west
            transit switch: ts1
            address: ["aa:aa:aa:aa:aa:02 169.254.100.2/24"]

With this topology, traffic flowing from vm1 to vm2 shall flow from gw-east to gw-west through a Geneve tunnel. If we list the ports in each gateway we should be able to see the tunnel ports. Needless to say, gateways have to be mutually reachable so that the transit overlay network can be established:

[root@gw-west ~]# ovs-vsctl show
6386b867-a3c2-4888-8709-dacd6e2a7ea5
    Bridge br-int
        fail_mode: secure
        Port ovn-gw_eas-0
            Interface ovn-gw_eas-0
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.50.102"}

Now, when vm1 pings vm2, the traffic flow should be like:

(vm1) worker_east ==== gw_east ==== gw_west ==== worker_west (vm2).

Let's see it via ovn-trace tool:

[root@central-east vagrant]# ovn-trace  --ovs --friendly-names --ct=new net_east  'inport == "vm1" &amp;&amp; eth.src == 40:44:00:00:00:01 &amp;&amp; eth.dst == 40:44:00:00:00:04 &amp;&amp; ip4.src == 192.168.1.11 &amp;&amp; ip4.dst == 192.168.2.12 &amp;&amp; ip.ttl == 64 &amp;&amp; icmp4.type == 8'


ingress(dp="net_east", inport="vm1")
...
egress(dp="net_east", inport="vm1", outport="net_east-router_east")
...
ingress(dp="router_east", inport="router_east-net_east")
...
egress(dp="router_east", inport="router_east-net_east", outport="lrp-router_east-ts1")
...
ingress(dp="ts1", inport="lsp-ts1-router_east")
...
egress(dp="ts1", inport="lsp-ts1-router_east", outport="lsp-ts1-router_west")
 9. ls_out_port_sec_l2 (ovn-northd.c:4543): outport == "lsp-ts1-router_west", priority 50, uuid c354da11
    output;
    /* output to "lsp-ts1-router_west", type "remote" */

Now let's capture Geneve traffic on both gateways while a ping between both VMs is running:

[root@gw-east ~]# tcpdump -i genev_sys_6081 -vvnee icmp
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
10:43:35.355772 aa:aa:aa:aa:aa:01 > aa:aa:aa:aa:aa:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 11379, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 40, length 64
10:43:35.356077 aa:aa:aa:aa:aa:01 > aa:aa:aa:aa:aa:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 11379, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 40, length 64
10:43:35.356442 aa:aa:aa:aa:aa:02 > aa:aa:aa:aa:aa:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 42610, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 40, length 64
10:43:35.356734 40:44:00:00:00:04 > 40:44:00:00:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 42610, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 40, length 64


[root@gw-west ~]# tcpdump -i genev_sys_6081 -vvnee icmp
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
10:43:29.169532 aa:aa:aa:aa:aa:01 > aa:aa:aa:aa:aa:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 8875, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 34, length 64
10:43:29.170058 40:44:00:00:00:10 > 40:44:00:00:00:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 8875, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 34, length 64
10:43:29.170308 aa:aa:aa:aa:aa:02 > aa:aa:aa:aa:aa:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 38667, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 34, length 64
10:43:29.170476 aa:aa:aa:aa:aa:02 > aa:aa:aa:aa:aa:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 38667, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 34, length 64

You can observe that the ICMP traffic flows between the transit switch ports (aa:aa:aa:aa:aa:02 <> aa:aa:aa:aa:aa:01) traversing both zones.

Also, as the packet has gone through two routers (router_east and router_west), the TTL at the destination has been decremented twice (from 64 to 62):

[root@worker-west ~]# ip net e vm2 tcpdump -i any icmp -vvne
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
10:49:32.491674  In 40:44:00:00:00:10 ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 62, id 57504, offset 0, flags [DF], proto ICMP (1), length 84)

This is a really great feature that opens a lot of possibilities for cluster interconnection and scaling. However, it has to be taken into account that it requires another layer of management that handles isolation (multitenancy) and avoids IP overlapping across the connected availability zones.