Tag: sdn

Learning OVN and BGP with Claude AI

Recently, OVN has introduced BGP support and I did not have the time to follow the development so I decided to use the help of AI (Claude Code CLI) to learn about the new feature. The path I chose here is:

  1. Run one of the existing BGP tests  (ovn multinode bgp unnumbered)
  2. Use AI (claude-sonnet-4.5 model) to help me understand what the test does
  3. Identify the parts of the codebase that implement some of the bits in the test
  4. Do a small modification to the test to do something new

1. Run the BGP test from the OVN sources

To run this test, I need first to deploy an ovn-fake-multinode setup with 4 chassis and 4 gateways. I did it just by following the instructions in the README.md file.

$ sudo -E CHASSIS_COUNT=4 GW_COUNT=4 ./ovn_cluster.sh start

After deploying the cluster, you’ll have multiple containers running but only the ovn-central-az1, ovn-gw-1, and ovn-gw-2 are relevant to this test. This cluster will have some OVN resources but the test cleans them up upon start.

I did not want to investigate much how these containers are wired so I asked Claude to read the sources and produce a diagram for me. It also helped me to render it into a PNG file using ImageMagick :p

Essentially, it will create two OVS bridges in my host to connect eth1 on each container for the underlay and eth2 for the dataplane. The eth0 interface is connected to the podman network for management.

Now we run the test and stop it before the clean up. This you can do by adding a check false right before the cleanup, and then executing the test like this:

OVS_PAUSE_TEST=1 make check-multinode TESTSUITEFLAGS="-k 'ovn multinode bgp unnumbered' -v"

Once the test stops, you can see that it’s completed successfully and it was able to ping some IP address, and the containers are running with the configuration and resources created by the test.

multinode.at:2959: waiting until m_as ovn-gw-1 ip netns exec frr-ns ping -W 1 -c 1 172.16.10.2...
PING 172.16.10.2 (172.16.10.2) 56(84) bytes of data.
64 bytes from 172.16.10.2: icmp_seq=1 ttl=62 time=2.32 ms

--- 172.16.10.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.322/2.322/2.322/0.000 ms
multinode.at:2959: wait succeeded immediately
multinode.at:2960: waiting until m_as ovn-gw-2 ip netns exec frr-ns ip route | grep -q 'ext1'...
multinode.at:2960: wait succeeded immediately
multinode.at:2961: waiting until m_as ovn-gw-2 ip netns exec frr-ns ping -W 1 -c 1 172.16.10.2...
PING 172.16.10.2 (172.16.10.2) 56(84) bytes of data.
64 bytes from 172.16.10.2: icmp_seq=1 ttl=62 time=1.77 ms

--- 172.16.10.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.773/1.773/1.773/0.000 ms
multinode.at:2961: wait succeeded immediately
false
./ovn-macros.at:856: "$@"
./ovn-macros.at:856: exit code was 1, expected 0
=====================================================
Set following environment variable to use various ovs utilities
export OVS_RUNDIR=/root/ovn/tests/multinode-testsuite.dir/17
Press ENTER to continue:


$ podman exec ovn-central-az1 ovn-nbctl show | grep -E '^(router|switch)'
switch 05e453e9-f5b2-47b9-9eb6-10ef6ea8a08c (ls-ovn-gw-2-ext0)
switch 8a12c075-6367-46b0-ac5a-88346450fc60 (ls-ovn-gw-1-ext0)
switch 3cc96979-1aab-4cd3-a84b-6f5f36bd4a42 (ls-guest-ovn-gw-1)
switch e24ad9c9-ef80-4686-96eb-88ef47d9bc01 (ls-guest-ovn-gw-2)
switch 9a279d1b-0929-4386-a704-fb9faaf6dfa6 (ls-join)
router e0a37965-c854-4c17-a4bb-4401d61c48a6 (lr-guest)
router 1f81175b-1168-4956-b7a8-64c85c07af4f (lr-ovn-gw-1-ext0)
router ce2161ff-ea11-4fb0-954b-75f0cf26f3d7 (lr-ovn-gw-2-ext0)

2. Use Claude AI to understand what the test does

At this point I read the test and I can see that there’s a fake VM with a Floating IP (172.16.10.2) that’s reachable through the simulated ToR switches. The ToRs learn the route via BGP, but I’m struggling to understand the exact topology and configuration because the test uses several layers of macros that call other macros.

Claude to the rescue!

I used Claude to help me:

  • Trace through the macro definitions to understand what topology gets created
  • Identify which components run where (containers, namespaces, VRFs)
  • Map out the BGP peering relationships and AS numbers
  • Understand how FRR is configured on both the ToRs and OVN gateways
  • Figure out why there are two separate FRR instances running on the same container The key insight Claude helped me understand was that the test simulates both the ToR switches and the OVN gateways within the same container nodes, using:
  • Network namespaces (frr-ns) to isolate the ToR’s network stack
  • Two separate FRR instances – one for the ToR (in frr-ns) and one for the OVN gateway (in default namespace, using VRFs)
  • BGP unnumbered peering using IPv6 link-local addresses between them

As a bonus, I asked Claude to draw a couple of diagrams for me to better understand the topology that this test creates:

 

Since I stopped the test before the cleanup, I told Claude how to access the remote lab and asked to inspect the live environment via SSH commands.

NOTE: Before I got the answer below I had been asking Claude things about the macros that the test uses and it’s been unpacking and reading a lot of the OVN code that I checked out locally. I re-ran this on a fresh session and it failed to figure out things like who creates the VRFs where the blackhole routes are installed, so it required a few more follow-up questions until I was able to find the right answers.

> I would like you to inspect the live deployment and show how the routes get advertised (eg. listing the different OVN tables, FRR configuration, etcetera). You can access the live environment by running 'ssh ovnlab' directly and using 'sudo'.


⏺ Perfect! Let me trace through exactly how the Floating IP (172.16.10.2) gets advertised from OVN through to the external ToR switches via BGP.

  ---  Step 1: NAT Configuration in OVN Northbound

  First, the NAT rule is configured on the guest router:

  $ podman exec ovn-central-az1 ovn-nbctl list NAT
  _uuid               : 4305c6db-b431-400d-b3e8-248d920394fc
  external_ip         : "172.16.10.2"
  logical_ip          : "192.168.10.10"
  type                : dnat_and_snat
  gateway_port        : lrp-guest-join-dgp
  options             : {add_route="true", ...}

  Key points:
  - external_ip: 172.16.10.2 (the Floating IP)
  - logical_ip: 192.168.10.10 (the internal VM IP)
  - gateway_port: Associated with the Distributed Gateway Port (DGP)
  - add_route=true: Tells OVN to create a route for this NAT entry

  Step 2: Dynamic Routing and VRF Configuration

  The gateway routers are configured with dynamic routing and VRF settings:

  $ podman exec ovn-central-az1 ovn-nbctl get logical_router lr-ovn-gw-1-ext0 options
  {chassis=ovn-gw-1, dynamic-routing="true", dynamic-routing-vrf-id="10"}

  $ podman exec ovn-central-az1 ovn-nbctl get logical_router_port lrp-ovn-gw-1-ext0 options
  {dynamic-routing-maintain-vrf="true", dynamic-routing-redistribute=nat,
   routing-protocol-redirect=lsp-ovn-gw-1-ext0-bgp, routing-protocols="BGP,BFD"}

  This configuration tells OVN:
  - Enable dynamic routing on lr-ovn-gw-1-ext0
  - Use VRF ID 10 (will create VRF named ovnvrf10)
  - dynamic-routing-maintain-vrf=true on the router port - This is the key setting that tells ovn-controller to create and maintain the VRF in the kernel
  - Redistribute NAT routes via the router port
  - Redirect BGP protocol traffic to the lsp-ovn-gw-1-ext0-bgp port

  When ovn-controller on the ovn-gw-1 chassis sees the dynamic-routing-maintain-vrf=true option, it automatically creates the VRF in the kernel:

  $ podman exec ovn-gw-1 ip link show type vrf
  11: ovnvrf10: <NOARP,MASTER,UP,LOWER_UP> mtu 65575 qdisc noqueue state UP mode DEFAULT group default qlen 1000
      link/ether 7a:7c:6f:2c:0d:22 brd ff:ff:ff:ff:ff:ff

  The VRF name is constructed as ovnvrf + dynamic-routing-vrf-id (10 in this case).

  Similarly, on ovn-gw-2, the VRF ID 20 creates ovnvrf20.

  Step 3: Advertised Routes in OVN Southbound

  OVN northd processes the NAT configuration and creates advertised route entries:

  $ podman exec ovn-central-az1 ovn-sbctl list Advertised_Route
  _uuid               : 360a360a-4ff4-4e87-888f-80e7c3b55e0a
  ip_prefix           : "172.16.10.2"
  logical_port        : lrp-ovn-gw-1-to-join
  ...

  _uuid               : d3a74c47-2ad9-471e-8b5e-e186ab774f21
  ip_prefix           : "172.16.10.2"
  logical_port        : lrp-ovn-gw-2-to-join
  ...

  Both gateway routers have advertised route entries for 172.16.10.2 because:
  - Each gateway router (lr-ovn-gw-1-ext0, lr-ovn-gw-2-ext0) has dynamic-routing-redistribute=nat configured on its port connecting to the join switch
  - The NAT entry from lr-guest gets redistributed to both gateway routers
  - Each gateway then advertises the FIP to its respective ToR (ovn-ext-1 and ovn-ext-2)
  - This provides connectivity from both external ASNs (ASN 4200000100 and ASN 4200000200) to reach the FIP

  The DGP location only matters for actual traffic forwarding, not for route advertisement. Both ToRs can advertise reachability to 172.16.10.2, and when traffic arrives at either gateway, OVN's overlay handles forwarding it to whichever chassis actually hosts the DGP.

  Step 4: Blackhole Routes in the Kernel

  OVN-controller on each gateway node installs blackhole routes in the VRF:

  $ podman exec ovn-gw-1 ip route show vrf ovnvrf10
  blackhole 172.16.10.2 proto 84 metric 1000

  $ podman exec ovn-gw-2 ip route show vrf ovnvrf20
  blackhole 172.16.10.2 proto 84 metric 1000

  These blackhole routes serve as markers for FRR to detect and advertise via BGP. They don't actually route traffic (that's handled by OVN flows), but FRR's redistribute kernel configuration picks them up.

  Step 5: FRR BGP Advertisement (OVN Gateway Side)

  FRR on the OVN gateway sees the kernel route and advertises it to its BGP neighbor:

  $ podman exec ovn-gw-1 vtysh --vty_socket /run/frr/ -c "show bgp vrf ovnvrf10 ipv4"
  BGP table version is 1, local router ID is 14.14.14.14, vrf id 11
  Default local pref 100, local AS 4210000000

       Network          Next Hop            Metric LocPrf Weight Path
   *>  172.16.10.2/32   0.0.0.0               1000         32768 ?

  The route shows:
  - Next Hop: 0.0.0.0 - locally originated (from kernel redistribution)
  - Weight: 32768 - default weight for local routes
  - Path: ? - incomplete origin (redistributed from kernel)

  Step 6: BGP Peering (Unnumbered)

  The OVN gateway and ToR are peered using BGP unnumbered (IPv6 link-local addresses):

  $ podman exec ovn-gw-1 vtysh --vty_socket /run/frr/ -c "show bgp vrf ovnvrf10 neighbors ext0-bgp"
  BGP neighbor on ext0-bgp: fe80::a8ed:baff:feb6:585e, remote AS 4200000100, local AS 4210000000
  BGP state = Established, up for 1d06h12m
  ...
  Prefix statistics:
    0 accepted, 1 sent prefixes

  Key details:
  - Neighbor: fe80::a8ed:baff:feb6:585e (IPv6 link-local - unnumbered)
  - Remote AS: 4200000100 (the ToR's ASN)
  - Local AS: 4210000000 (OVN gateway's ASN)
  - Sent prefixes: 1 (the 172.16.10.2/32 route)

  Step 7: FRR BGP Learning (ToR Side)

  The ToR's FRR instance receives and accepts the route:

  $ podman exec ovn-gw-1 vtysh --vty_socket /run/frr/frr-ns/ -c "show bgp ipv4"
  BGP table version is 1, local router ID is 41.41.41.41, vrf id 0
  Default local pref 100, local AS 4200000100

       Network          Next Hop            Metric LocPrf Weight Path
   *>  172.16.10.2/32   ext1                  1000             0 4210000000 ?

  The ToR now knows:
  - Network: 172.16.10.2/32
  - Next Hop: ext1 (the interface toward the OVN gateway)
  - Metric: 1000 (inherited from the advertised route)
  - ASN Path: 4210000000 (learned from the OVN gateway's ASN)

  Summary: Complete Advertisement Chain

  1. NAT Config (NB)
     ↓
  2. Dynamic Routing + VRF Config (NB)
     ↓
  3. VRF Creation (Kernel) ← ovn-controller sees dynamic-routing-maintain-vrf=true
     ↓
  4. Advertised Routes (SB) ← northd processing
     ↓
  5. Blackhole Routes (Kernel VRF) ← ovn-controller
     ↓
  6. BGP Table (FRR Gateway) ← redistribute kernel
     ↓
  7. BGP Peering (IPv6 LLA)
     ↓
  8. BGP Table (FRR ToR) ← received from peer

  This shows the complete journey of how a NAT entry in OVN becomes a BGP route advertised to external infrastructure, enabling seamless integration between OVN's virtual networking and physical network routing.

3. Identify the parts of the codebase that implement some of the bits in the test

At this point I believe I understand what the test does and how the BGP implementation works on the surface. In order for me to get a deeper understanding of certain parts of the code I want to understand how and when the VRF is created by ovn-controller. Since Claude has access to the OVN sources and my live environment, it should be able to give me good pointers to it 🙂

> I want to get some details (source code level) about how and when ovn-controller creates the VRF where it installs the blackhole routes.


⏺ Let me trace through the source code to show you exactly how and when ovn-controller creates the VRF and installs the blackhole routes.

⏺ Read(ovn/controller/route-exchange.c)
  ⎿  Read 100 lines

⏺ Search(pattern: "vrf.*create|create.*vrf", path: "ovn/controller", output_mode: "content")
  ⎿  Found 5 lines (ctrl+o to expand)

⏺ Read(ovn/controller/route-exchange.c)
  ⎿  Read 150 lines

⏺ Read(ovn/controller/route-exchange-netlink.c)
  ⎿  Read 80 lines

⏺ Now let me check where the advertise_datapath_entry gets populated with the maintain_vrf flag:

⏺ Search(pattern: "maintain_vrf.*=|->maintain_vrf", path: "ovn/controller", output_mode: "content")
  ⎿  Found 2 lines (ctrl+o to expand)

⏺ Read(ovn/controller/route.c)
  ⎿  Read 50 lines

⏺ Perfect! Now let me check how advertised routes get added:

⏺ Search(pattern: "Advertised_Route|advertised.*route", path: "ovn/controller/route.c", output_mode: "content")
  ⎿  Found 3 lines (ctrl+o to expand)

⏺ Read(ovn/controller/route.c)
  ⎿  Read 80 lines

⏺ Perfect! Now I have the complete picture. Let me create a detailed source code-level explanation:

  ---
  Source Code Deep Dive: VRF Creation and Blackhole Route Installation

  Overview

  The VRF creation and blackhole route installation happens in ovn-controller through a multi-step process coordinated between several source files. Let me trace through the exact code path.

  Step 1: Checking for dynamic-routing-maintain-vrf Option

  File: controller/route.c (lines 206-209)

  ad->maintain_vrf |=
      smap_get_bool(&repb->options,
                    "dynamic-routing-maintain-vrf",
                    false);

  What happens here:
  - ovn-controller iterates through all router port bindings (repb) that have dynamic routing enabled
  - For each port, it checks if the dynamic-routing-maintain-vrf option is set to true
  - If found, it sets the maintain_vrf flag in the advertise_datapath_entry structure
  - This flag tells ovn-controller that it is responsible for creating and managing the VRF

  Step 2: Determining VRF Name

  File: controller/route.c (lines 211-225)

  const char *vrf_name = smap_get(&repb->options,
                                  "dynamic-routing-vrf-name");
  if (vrf_name && strlen(vrf_name) >= IFNAMSIZ) {
      // Warn and ignore if name is too long
      vrf_name = NULL;
  }
  if (vrf_name) {
      memcpy(ad->vrf_name, vrf_name, strlen(vrf_name) + 1);
  } else {
      snprintf(ad->vrf_name, sizeof ad->vrf_name, "ovnvrf%"PRIu32,
               route_get_table_id(ad->db));
  }

  What happens here:
  - First checks if a custom VRF name is specified via dynamic-routing-vrf-name option
  - If no custom name is provided, constructs the default name as ovnvrf + VRF ID
  - For example, with dynamic-routing-vrf-id=10, it creates ovnvrf10

  Step 3: Creating the VRF

  File: controller/route-exchange.c (lines 265-277)

  if (ad->maintain_vrf) {
      if (!sset_contains(&old_maintained_vrfs, ad->vrf_name)) {
          error = re_nl_create_vrf(ad->vrf_name, table_id);
          if (error && error != EEXIST) {
              VLOG_WARN_RL(&rl,
                           "Unable to create VRF %s for datapath "
                           UUID_FMT": %s.", ad->vrf_name,
                           UUID_ARGS(&ad->db->header_.uuid),
                           ovs_strerror(error));
              SET_ROUTE_EXCHANGE_NL_STATUS(error);
              continue;
          }
      }
      sset_add(&_maintained_vrfs, ad->vrf_name);
  }

  What happens here:
  - During the route_exchange_run() function execution (called on every ovn-controller iteration)
  - Checks if the VRF was already created in a previous iteration (by checking old_maintained_vrfs)
  - If not, calls re_nl_create_vrf() to create it via netlink
  - Adds the VRF name to _maintained_vrfs set to track it
  - If VRF already exists (EEXIST error), silently continues (this is normal)

  Step 4: Netlink VRF Creation

  File: controller/route-exchange-netlink.c (lines 42-77)

  int
  re_nl_create_vrf(const char *ifname, uint32_t table_id)
  {
      if (!TABLE_ID_VALID(table_id)) {
          // ... error handling
          return EINVAL;
      }

      // Prepare netlink message
      struct ofpbuf request;
      uint8_t request_stub[NETNL_REQ_BUFFER_SIZE];
      ofpbuf_use_stub(&request, request_stub, sizeof(request_stub));

      // Create RTM_NEWLINK message with VRF attributes
      nl_msg_put_nlmsghdr(&request, 0, RTM_NEWLINK,
                          NLM_F_REQUEST | NLM_F_ACK | NLM_F_CREATE | NLM_F_EXCL);
      ifinfo = ofpbuf_put_zeros(&request, sizeof *ifinfo);
      nl_msg_put_string(&request, IFLA_IFNAME, ifname);  // VRF name

      ifinfo->ifi_change = ifinfo->ifi_flags = IFF_UP;   // Bring it UP
      linkinfo_off = nl_msg_start_nested(&request, IFLA_LINKINFO);
      nl_msg_put_string(&request, IFLA_INFO_KIND, "vrf");  // Type: VRF
      infodata_off = nl_msg_start_nested(&request, IFLA_INFO_DATA);
      nl_msg_put_u32(&request, IFLA_VRF_TABLE, table_id);  // VRF table ID
      nl_msg_end_nested(&request, infodata_off);
      nl_msg_end_nested(&request, linkinfo_off);

      // Send to kernel via netlink
      err = nl_transact(NETLINK_ROUTE, &request, NULL);

      ofpbuf_uninit(&request);
      return err;
  }

  What happens here:
  - Constructs a netlink RTM_NEWLINK message to create a new network interface
  - Sets the interface type to "vrf"
  - Associates it with the routing table ID (from dynamic-routing-vrf-id)
  - Sets the interface to UP state immediately
  - Sends the netlink message to the kernel's routing subsystem
  - The kernel creates the VRF device (e.g., ovnvrf10)

  Step 5: Reading Advertised Routes from Southbound DB

  File: controller/route.c (lines 267-295)

  const struct sbrec_advertised_route *route;
  SBREC_ADVERTISED_ROUTE_TABLE_FOR_EACH (route,
                                         r_ctx_in->advertised_route_table) {
      struct advertise_datapath_entry *ad =
          advertise_datapath_find(r_ctx_out->announce_routes,
                                  route->datapath);
      if (!ad) {
          continue;
      }

      struct in6_addr prefix;
      unsigned int plen;
      if (!ip46_parse_cidr(route->ip_prefix, &prefix, &plen)) {
          // ... error handling
          continue;
      }

      if (!lport_is_local(r_ctx_in->sbrec_port_binding_by_name,
                          r_ctx_in->chassis,
                          route->logical_port->logical_port)) {
          // Skip routes for ports not on this chassis
          continue;
      }

      // Add route to the advertise_datapath_entry
      struct advertise_route_entry *ar = xmalloc(sizeof(*ar));
      ar->addr = prefix;
      ar->plen = plen;
      ar->priority = priority;
      hmap_insert(&ad->routes, &ar->node,
                  advertise_route_hash(&prefix, plen));
  }

  What happens here:
  - Reads all Advertised_Route entries from the Southbound database
  - These are created by northd when it processes NAT rules with redistribution enabled
  - Filters to only routes whose logical port is bound to this chassis
  - Builds a hash map of routes to be installed in the kernel

  Step 6: Installing Blackhole Routes

  File: controller/route-exchange-netlink.c (lines 98-121)

  static int
  modify_route(uint32_t type, uint32_t flags_arg, uint32_t table_id,
               const struct in6_addr *dst, unsigned int plen,
               unsigned int priority)
  {
      uint32_t flags = NLM_F_REQUEST | NLM_F_ACK;
      bool is_ipv4 = IN6_IS_ADDR_V4MAPPED(dst);
      struct rtmsg *rt;

      flags |= flags_arg;

      struct ofpbuf request;
      // ... buffer initialization

      nl_msg_put_nlmsghdr(&request, 0, type, flags);
      rt = ofpbuf_put_zeros(&request, sizeof *rt);
      rt->rtm_family = is_ipv4 ? AF_INET : AF_INET6;
      rt->rtm_table = RT_TABLE_UNSPEC;
      rt->rtm_protocol = RTPROT_OVN;      // Mark as OVN-managed (proto 84)
      rt->rtm_type = RTN_BLACKHOLE;       // ← BLACKHOLE route type
      rt->rtm_scope = RT_SCOPE_UNIVERSE;

  What happens here:
  - For each advertised route, constructs a netlink RTM_NEWROUTE message
  - Sets route type to RTN_BLACKHOLE - this is the key!
  - Sets protocol to RTPROT_OVN (protocol 84) so they can be identified later
  - Sends to kernel, which installs the route in the specified VRF table

  Step 7: FRR Picks Up the Routes

  Once the blackhole routes are installed in the kernel VRF, FRR's redistribute kernel configuration (in the VRF BGP context) automatically detects them and advertises via BGP.

  Key Timing and Lifecycle

  1. When VRF is created: On first ovn-controller iteration after seeing dynamic-routing-maintain-vrf=true
  2. When routes are installed: Every time ovn-controller runs and processes Advertised_Route entries
  3. Route updates: Synchronized continuously - stale routes are removed, new ones added
  4. VRF cleanup: When dynamic-routing-maintain-vrf is removed or chassis loses binding, VRF is deleted

  Why Blackhole Routes?

  The routes are marked as RTN_BLACKHOLE because:
  - They are markers for FRR to advertise, not for actual packet forwarding
  - OVN handles the actual packet forwarding through OpenFlow rules
  - If a packet somehow reaches the kernel and matches one of these routes, it will be dropped (blackholed)
  - This prevents routing loops and ensures all traffic goes through OVN's datapath

Impressive, right?

Let’s now try to do something with the environment!

3. Modify the environment to advertise internal IPs via BGP

One potential modification to this test is to advertise the internal IP address of the guest VM (192.168.10.10) and ping it from the ToR. Instead of changing the code of the test I’ll modify the live environment to advertise the route to the VM and use Claude to handhold me.

First, verify that the VM is not reachable.

# podman exec ovn-gw-2 ip netns exec frr-ns ping -c 1 192.168.10.10
ping: connect: Network is unreachable

Now, since the ToRs are on different ASNs and are not connected,  we will only be able to access the VM from the ToR where the DGP  is bound. In our environment, the active gateway is on the gw-2 node. On a typical datacenter with a spine-and-leaf topology, the ToRs will be in the same AS and will be connected through the spine so we could reach the VM from anywhere in the network.

> I would like the lr-guest to advertise the route to 192.168.10.10, how can i enable that option?

⏺ To make lr-guest advertise its connected routes, you need to configure dynamic routing on lr-guest itself and on its router ports. Here's how:

  Step 1: Enable dynamic routing on lr-guest

  podman exec ovn-central-az1 ovn-nbctl set logical_router lr-guest \
      options:dynamic-routing=true \
      options:dynamic-routing-vrf-id=100

  Step 2: Configure the router ports to redistribute connected routes

  You need to set dynamic-routing-redistribute on the ports where you want to advertise routes from:

  # Advertise from the port connected to ls-guest-ovn-gw-1
  podman exec ovn-central-az1 ovn-nbctl set logical_router_port lrp-guest-ls-guest-ovn-gw-1 \
      options:dynamic-routing-redistribute=connected

  # Advertise from the port connected to ls-guest-ovn-gw-2
  podman exec ovn-central-az1 ovn-nbctl set logical_router_port lrp-guest-ls-guest-ovn-gw-2 \
      options:dynamic-routing-redistribute=connected

  However, there's a critical limitation here:

  - lr-guest is only instantiated on the chassis where the DGP is active (ovn-gw-2)
  - The VRF (ovnvrf100) will only be created on ovn-gw-2
  - Routes from lr-guest won't propagate to the gateway routers (lr-ovn-gw-1-ext0, lr-ovn-gw-2-ext0) automatically

  To actually make this work end-to-end, you would still need the static routes approach on the gateway routers, because OVN doesn't automatically propagate routes between different logical routers.

  The dynamic-routing-redistribute=connected on lr-guest would be useful if you had FRR running directly in the lr-guest VRF (ovnvrf100), but in this test topology, the BGP peering is happening in the gateway router VRFs (ovnvrf10, ovnvrf20), not in lr-guest's VRF.

  So the answer is: you can enable it, but it won't achieve what you want without also adding the static routes on the gateway routers.

This looks sane but still it does not seem that we’re telling OVN to create the VRF 100 and also FRR won’t be reading any routes from that VRF. Claude has missed that part, let’s talk about it 🙂

> if i understood you correctly earlier, we still need to add the maintain-vrf option to the LRP so that the vrf100 gets created

⏺ You're absolutely right! We need to add the maintain-vrf option to trigger vrf100 creation. Let me provide the complete configuration:

  # 1. Set maintain-vrf=true on the DGP to create vrf100
  # This is the critical step that triggers VRF creation in ovn-controller
  podman exec -it ovn-central-az1 \
    ovn-nbctl set logical_router_port lrp-guest-join-dgp \
    options:dynamic-routing-maintain-vrf=true

  # 2. Add static routes on gateway routers pointing to the DGP
  podman exec ovn-central-az1 ovn-nbctl lr-route-add lr-ovn-gw-2-ext0 \
      192.168.10.0/24 fe80::216:6ff:fe12:f00d lrp-ovn-gw-2-to-join


> FRR on ovn-gw-2 does not know anything about the vrf100. Can you help me adding the config to import this VRF?

⏺ You are right. Here's the configuration to add the import:

  podman exec -i ovn-gw-2 vtysh <<'EOF'
  configure
    router bgp 4210000000 vrf ovnvrf100
      bgp router-id 100.100.100.100
      no bgp ebgp-requires-policy
      address-family ipv4 unicast
        redistribute kernel
      exit-address-family
    exit
    router bgp 4210000000 vrf ovnvrf20
      address-family ipv4 unicast
        import vrf ovnvrf100
      exit-address-family
    exit
    write memory
  EOF

  This will:
  1. Create the BGP instance in ovnvrf100 that redistributes kernel routes (the blackhole route for 192.168.10.0/24)
  2. Add import vrf ovnvrf100 to the existing ovnvrf20 BGP configuration, so routes from vrf100 get leaked into vrf20 and advertised to the ToR


> Can you explain in detail what we have just done to reach 192.168.10.10 from the ToR?


⏺ Let me explain in detail how this works by tracing through the entire flow:

  1. OVN Configuration Creates the Foundation

  You've configured lr-guest with:
  - options:dynamic-routing=true
  - options:dynamic-routing-vrf-id=100
  - lrp-guest-join-dgp has options:dynamic-routing-maintain-vrf=true

  2. VRF Creation in Kernel

  On ovn-gw-2 (where the DGP is bound), ovn-controller sees the dynamic-routing-maintain-vrf=true option and calls re_nl_create_vrf() to create ovnvrf100 in the kernel with routing table ID 100.

  You can verify this:
  ssh ovnlab sudo podman exec ovn-gw-2 ip link show ovnvrf100
  ssh ovnlab sudo podman exec ovn-gw-2 ip route show vrf ovnvrf100

  3. Blackhole Route Installation

  When lr-guest has routes to advertise (like the connected network 192.168.10.0/24), ovn-controller queries the Advertised_Route table in the Southbound DB and installs blackhole routes in vrf100:

  ssh ovnlab sudo podman exec ovn-gw-2 ip route show vrf ovnvrf100 table 100
  # You should see something like:
  # blackhole 192.168.10.0/24 proto 84

  The route is marked as:
  - RTN_BLACKHOLE - doesn't actually route traffic, just a marker
  - RTPROT_OVN (protocol 84) - identifies it as an OVN-managed route

  This happens in route-exchange-netlink.c:98-121.

  4. FRR Sees the Kernel Route in vrf100

  When you configure:
  router bgp 4210000000 vrf ovnvrf100
    address-family ipv4 unicast
      redistribute kernel

  FRR monitors the kernel routing table for vrf100 (table 100) and sees the blackhole route for 192.168.10.0/24. Because of redistribute kernel, FRR adds this route to BGP's Routing Information Base (RIB) for vrf100.

  You can verify:
  ssh ovnlab "sudo podman exec ovn-gw-2 vtysh -c 'show ip bgp vrf ovnvrf100'"

  5. VRF Route Leaking (The Magic Part)

  When you configure in ovnvrf20:
  router bgp 4210000000 vrf ovnvrf20
    address-family ipv4 unicast
      import vrf ovnvrf100

  This tells FRR to import routes from vrf100's BGP RIB into vrf20's BGP RIB. This is called "VRF route leaking" or "route target import."

  Internally, FRR copies the 192.168.10.0/24 route from vrf100's BGP table into vrf20's BGP table.

  You can verify:
  ssh ovnlab "sudo podman exec ovn-gw-2 vtysh -c 'show ip bgp vrf ovnvrf20'"
  # Should now show 192.168.10.0/24 in addition to the NAT routes

  6. BGP Advertisement to ToR

  Once the route is in ovnvrf20's BGP RIB, the existing BGP session advertises it:
  router bgp 4210000000 vrf ovnvrf20
    neighbor ext0-bgp interface remote-as external
    address-family ipv4 unicast
      redistribute kernel
      neighbor ext0-bgp prefix-list no-default out

  The route gets advertised to the ToR peer via the ext0-bgp unnumbered BGP session.


  Key Points

  1. vrf100 has no external BGP peers - it's purely for internal route collection from lr-guest
  2. vrf20 has the actual ToR peering - it's where routes get advertised externally
  3. VRF import bridges the two - routes from vrf100 appear in vrf20's BGP table
  4. Blackhole routes are just markers - they don't route actual traffic, they just signal to FRR "advertise this prefix"
  5. This only works on ovn-gw-2 - because that's where the DGP (lrp-guest-join-dgp) is currently bound

And, of course, we can now ping the VM IP address from the ToR connected to ovn-gw-2:

$ podman exec -it ovn-central-az1 ovn-sbctl list advertised_route | grep 192.168.10 -C3
_uuid               : 1c417fc1-6af6-4945-9aaf-5443237a120a
datapath            : c1a7942f-bf59-4835-8c56-4df45a5c5855
external_ids        : {}
ip_prefix           : "192.168.10.0/24"
logical_port        : 782b2357-2c37-4e6f-bec6-ff204fcb4733
tracked_port        : []

$ podman exec ovn-gw-2 ip netns exec frr-ns ip route
172.16.10.2 nhid 6 via inet6 fe80::20fb:d6ff:fe66:992c dev ext1 proto bgp metric 20
192.168.10.0/24 nhid 6 via inet6 fe80::20fb:d6ff:fe66:992c dev ext1 proto bgp metric 20
192.168.20.0/24 nhid 6 via inet6 fe80::20fb:d6ff:fe66:992c dev ext1 proto bgp metric 20

$ podman exec ovn-gw-2 ip netns exec frr-ns ping -c 2 192.168.10.10
PING 192.168.10.10 (192.168.10.10) 56(84) bytes of data.
64 bytes from 192.168.10.10: icmp_seq=1 ttl=62 time=1.48 ms
64 bytes from 192.168.10.10: icmp_seq=2 ttl=62 time=0.575 ms

--- 192.168.10.10 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.575/1.027/1.479/0.452 ms


Conclusions

This combination of code analysis + live system inspection + domain knowledge made understanding the feature much faster than traditional code reading alone. Even for someone familiar with OVN, having AI as a “knowledgeable assistant” that can quickly locate and explain specific implementation details is awesome! It’s been very handy for:

  • Rapid ramp-up on the new BGP integration feature as a user
  • Tracing execution paths across multiple source files without manually grepping through thousands of lines
  • Understanding the “how” and “why” of implementation details like the VRF creation
  • Connecting the dots between configuration options and actual kernel-level operations

Claude is definitely getting a permanent route in my learning topology. Next hop: more complex features! 😉

Trunk ports in OVN

The purpose of this post is explaining the low level details of how trunking works in OVN. A typical use case of trunk ports could be the use of nested VMs or containers inside VMs; where all the traffic is directed to the vNIC of the virtual machine and then forwarded to the right container based on their VLAN ID. For more context around the feature and use cases, please check out the OpenStack documentation guide.

Let’s take a very simple topology for our deep dive. You can also deploy it in your machine using this vagrant setup and replay the commands as we go 🙂

This sample setup has two Logical Switches and two ports on each of them. The physical layout is as follows:

  • vm1 bound to worker1
  • vm2 bound to worker2
  • child1 (VLAN 30) inside vm1
  • child2 (VLAN 50) inside vm2

Let’s quickly check the OVN databases info:

[root@central vagrant]# ovn-nbctl show
switch db4e7781-370c-4439-becd-35803c0e3f12 (network1)
    port vm1
        addresses: ["40:44:00:00:00:01 192.168.0.11"]
    port vm2
        addresses: ["40:44:00:00:00:02 192.168.0.12"]
switch 40ac144b-a32a-4202-bce2-3329f8f3e98f (network2)
    port child1
        parent: vm1
        tag: 30
        addresses: ["40:44:00:00:00:03 192.168.1.13"]
    port child2
        parent: vm2
        tag: 50
        addresses: ["40:44:00:00:00:04 192.168.1.14"]



[root@central vagrant]# ovn-sbctl show
Chassis worker2
    hostname: worker2
    Encap geneve
        ip: "192.168.50.101"
[root@central vagrant]# ovn-sbctl show
Chassis worker2
    hostname: worker2
    Encap geneve
        ip: "192.168.50.101"
        options: {csum="true"}
    Port_Binding child2
    Port_Binding vm2
Chassis worker1
    hostname: worker1
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}
    Port_Binding child1
    Port_Binding vm1

Instead of booting actual VMs and containers, I simulated it with network namespaces and VLAN devices inside them:

[root@worker1 vagrant]# ip netns exec vm1 ip -d link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
2: child1@vm1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:03 brd ff:ff:ff:ff:ff:ff promiscuity 0
    vlan protocol 802.1Q id 30 <REORDER_HDR> addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
24: vm1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:01 brd ff:ff:ff:ff:ff:ff promiscuity 2
    openvswitch addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535


[root@worker2 vagrant]# ip netns exec vm2 ip -d link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
2: child2@vm2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:04 brd ff:ff:ff:ff:ff:ff promiscuity 0
    vlan protocol 802.1Q id 50 <REORDER_HDR> addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
15: vm2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 40:44:00:00:00:02 brd ff:ff:ff:ff:ff:ff promiscuity 2
    openvswitch addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

Now, as you can see, none of the subports (child1 and child2) are connected directly to the integration bridge so both vm1 and vm2 ports act as trunk ports for VLAN IDs 30 and 50. OVN will install flows to tag/untag the traffic directed to/from these ports.

 

Traffic and OpenFlow analysis

To illustrate this, let’s ping from child1 (worker1) to child 2 (worker2):

[root@worker1 vagrant]# ip netns exec vm1 ping 192.168.1.14
PING 192.168.1.14 (192.168.1.14) 56(84) bytes of data.
64 bytes from 192.168.1.14: icmp_seq=21 ttl=64 time=0.824 ms
64 bytes from 192.168.1.14: icmp_seq=22 ttl=64 time=0.211 ms

The traffic will arrive tagged to the vm1 interface and will be sent out untagged to worker2 (where vm2 is bound) via the Geneve tunnel:

[root@worker1 ~]# ip netns exec vm1 tcpdump -vvnee -i vm1 icmp -c1

tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes
13:18:16.980650 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype 802.1Q (0x8100), length 102: vlan 30, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 55255, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 9833, seq 176, length 64



[root@worker1 ~]# tcpdump -vvneei genev_sys_6081 icmp -c1

tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
13:19:11.980671 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 16226, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 9833, seq 231, length 64

On worker1, let’s inspect the OVS flows that determine the source network/port based on the VLAN ID:

[root@worker1 ~]# ovs-ofctl dump-flows br-int table=0 |grep vlan

 cookie=0x4b8d6fa5, duration=337116.380s, table=0, n_packets=270983, n_bytes=27165634, idle_age=0, hard_age=65534, priority=150,in_port=12,dl_vlan=30 actions=load:0x1->NXM_NX_REG10[5],strip_vlan,load:0x5->NXM_NX_REG13[],load:0x1->NXM_NX_REG11[],load:0x2->NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],resubmit(,8)

The flow above in table 0 matches on the VLAN tag (dl_vlan=30). Also, note that there’s no matching flow for VLAN 50 as vm2 is not bound to worker1.

As each parent should have subports with unique VLAN IDs, this ID will determine the source port (nested VM or container) that is sending the traffic. In our example, this will be child1 as it is the subport tagged with VLAN 30. In the actions section of the table 0 flow, the packet will be untagged (strip_vlan action), and the relevant registers will be populated to identify both the subport network and the logical input port:

  • The packet is coming from OF port 12 (in_port=12) which corresponds to vm1
[root@worker1 ~]# ovs-ofctl show br-int | grep vm1
 12(vm1): addr:40:44:00:00:00:01
  • The network identifier (metadata) is populated with the value 2 (load:0x2->OXM_OF_METADATA[]) which corresponds to the network of the subport (network2)
[root@central ~]# ovn-sbctl find datapath_binding tunnel_key=2
_uuid               : 2d762d73-5ab9-4f43-a303-65a6046e41e7
external_ids        : {logical-switch="40ac144b-a32a-4202-bce2-3329f8f3e98f", name=network2}
load_balancers      : []
tunnel_key          : 2
  • The logical input port (register 14) will be populated with the tunnel key of the child1 subport (load:0x1->NXM_NX_REG14[])
[root@central vagrant]# ovn-sbctl get port_binding child1 tunnel_key
1
  • Now the pipeline execution with the untagged packet gets resumed from table 8 (resubmit(,8)). Eventually it gets sent through the tunnel to worker2, where the parent (vm2) of the destination port (child2) is bound to.

 

Let’s inspect the traffic and flows on worker2, the destination hypervisor:

The traffic arrives untagged to br-int from the Geneve interface and later gets delivered to the vm2 interface tagged with the child2 VLAN ID (50).

[root@worker2 vagrant]# tcpdump -vvnee -i genev_sys_6081 icmp -c1
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
13:57:25.000587 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 56431, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 10218, seq 31, length 64


[root@worker2 vagrant]# ip netns exec vm2 tcpdump -vvneei vm2 icmp -c1
tcpdump: listening on vm2, link-type EN10MB (Ethernet), capture size 262144 bytes
13:57:39.000617 40:44:00:00:00:03 > 40:44:00:00:00:04, ethertype 802.1Q (0x8100), length 102: vlan 50, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 59701, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.13 > 192.168.1.14: ICMP echo request, id 10218, seq 45, length 64

The packet processing takes place as with any regular VIF but in the output stage, the traffic will be tagged before it is sent out to the vm2 interface:

[root@worker2 vagrant]# ovs-ofctl dump-flows br-int |grep mod_vlan_vid:50
 cookie=0x969b8a9d, duration=338995.147s, table=65, n_packets=263914, n_bytes=25814112, idle_age=0, hard_age=65534, priority=100,reg15=0x2,metadata=0x2 actions=mod_vlan_vid:50,output:4,strip_vlan

[root@worker2 vagrant]# ovs-ofctl show br-int | grep vm2
 4(vm2): addr:00:00:00:00:00:00

 

As you can see, the way that OVN implements this feature is very simple and only adds a couple of extra flows. Hope that this article helps understanding the details of trunk ports and how it’s leveraged by projects like Kuryr to run Kubernetes on top of OpenStack.

OpenStack TripleO networking layout

The goal of this post is to describe how network isolation is typically achieved for both the control and data planes in OpenStack using TripleO. In particular, how all this happens in a virtual setup, using one baremetal node (hypervisor, from now on) to deploy the OpenStack nodes with libvirt. For the purpose of this post, we’ll work with a 3 controllers + 1 compute virtual setup.

(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+
| ID | Name | Status | Networks |
+--------------------------------------+--------------+--------+------------------------+
| b3bd5157-b3ea-4331-91af-3820c4e12252 | controller-0 | ACTIVE | ctlplane=192.168.24.15 |
| 6f228b08-49a0-4b68-925a-17d06224d5f9 | controller-1 | ACTIVE | ctlplane=192.168.24.37 |
| e5c649b5-c968-4293-a994-04293cb16da1 | controller-2 | ACTIVE | ctlplane=192.168.24.10 |
| 9f15ed23-efb1-4972-b578-7b0da3500053 | compute-0 | ACTIVE | ctlplane=192.168.24.14 |
+--------------------------------------+--------------+--------+------------------------+

The tool used to deploy this setup is Infrared (documentation) which is an easy-to-use wrapper around TripleO. Don’t be scared about the so many layers involved here; the main point is to understand that a physical – and somewhat powerful – server is running an OpenStack cluster formed by:

  • 3 virtual controllers that run the OpenStack control plane services (Neutron, Nova, Glance, …)
  • 1 virtual compute node that will serve to host the workloads (virtual machines) of the OpenStack cluster 

From a Networking perspective (I’ll omit the undercloud for simplicity), things are wired like this:

Let’s take a look at the bridges in the hypervisor node:

[root@hypervisor]# brctl show

bridge name     bridge id               STP enabled     interfaces
management      8000.525400cc1d8b       yes             management-nic
                                                        vnet0
                                                        vnet12
                                                        vnet3
                                                        vnet6
                                                        vnet9

external        8000.5254000ceb7c       yes             external-nic
                                                        vnet11
                                                        vnet14
                                                        vnet2
                                                        vnet5
                                                        vnet8

data            8000.5254007bc90a       yes             data-nic
                                                        vnet1
                                                        vnet10
                                                        vnet13
                                                        vnet4
                                                        vnet7

Each bridge has 6 ports (3 controllers, 1 compute, 1 undercloud, and the local port in the hypervisor). Now, each virtual machine running in this node can be mapped to the right interface:

[root@hypervisor]# for i in controller-0 controller-1 controller-2 compute-0; do virsh domiflist $i; done


 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet9       network   management   virtio   52:54:00:74:29:4f
 vnet10      network   data         virtio   52:54:00:1c:44:26
 vnet11      network   external     virtio   52:54:00:20:3c:4e

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet3       network   management   virtio   52:54:00:0b:ad:3b
 vnet4       network   data         virtio   52:54:00:2f:9f:3e
 vnet5       network   external     virtio   52:54:00:75:a5:ed

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet6       network   management   virtio   52:54:00:da:a3:1e
 vnet7       network   data         virtio   52:54:00:57:26:67
 vnet8       network   external     virtio   52:54:00:2c:21:d5

 Interface   Type      Source       Model    MAC
----------------------------------------------------------------
 vnet0       network   management   virtio   52:54:00:de:4a:38
 vnet1       network   data         virtio   52:54:00:c7:74:4b
 vnet2       network   external     virtio   52:54:00:22:de:5c

Network configuration templates

This section will go through the Infrared/TripleO configuration to understand how this layout was defined. This will also help the reader to change the CIDRs, VLANs, number of virtual NICs, etc.

First, the deployment script:

$ cat overcloud_deploy.sh
#!/bin/bash

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
-e /home/stack/virt/config_lvm.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /home/stack/virt/docker-images.yaml \
--log-file overcloud_deployment_99.log

Now, let’s take a look at the network related templates to understand the different networks and how they map to the physical NICs inside the controllers/compute nodes:

$ grep -i -e cidr -e vlan /home/stack/virt/network/network-environment.yaml
ControlPlaneSubnetCidr: '192.168.24.0/24'

ExternalNetCidr: 10.0.0.0/24
ExternalNetworkVlanID: 10

InternalApiNetCidr: 172.17.1.0/24
InternalApiNetworkVlanID: 20

StorageMgmtNetCidr: 172.17.4.0/24
StorageMgmtNetworkVlanID: 40

StorageNetCidr: 172.17.3.0/24
StorageNetworkVlanID: 30

TenantNetCidr: 172.17.2.0/24
TenantNetworkVlanID: 50

NeutronNetworkVLANRanges: tenant:1000:2000

OS::TripleO::Compute::Net::SoftwareConfig: three-nics-vlans/compute.yaml
OS::TripleO::Controller::Net::SoftwareConfig: three-nics-vlans/controller.yaml

In the output above you can see 6 different networks:

  • ControlPlane (flat): used mainly for provisioning (PXE) and remote access to the nodes via SSH.
  • External (VLAN 10): external network used for dataplane floating IP traffic and access to the OpenStack API services via their external endpoints.
  • InternalApi (VLAN 20): network where the OpenStack control plane services will listen for internal communication (eg. Neutron <-> Nova).
  • StorageMgmt (VLAN 40): network used to manage the storage (in this deployment, swift-object-server, swift-container-server, and swift-account-server will listen to requests on this network)   
  • Storage (VLAN 30): network used for access to the Object storage (in this deployment, swift-proxy will listen to requests on this network).
  • Tenant: this network will carry the overlay tunnelled traffic (Geneve for OVN, VXLAN in the case of ML2/OVS) in the VLAN 50 but will also carry dataplane traffic if VLAN tenant networks are used in Neutron. The VLAN range allowed for such traffic is specified also in the template (in the example, VLAN ids ranging from 1000-2000 are reserved for Neutron tenant networks).

The way that each NIC is mapped to each network is defined in the yaml files below. For this deployment, I used a customized layout via this patch (controller.yaml and compute.yaml). Essentially, the mapping looks like this:

  • Controllers:
    • nic1: ControlPlaneIp (flat); InternalApi (20), Storage (30) , StorageMgmt (40), VLAN devices
    • nic2: br-tenant OVS bridge and VLAN50 for the tunnelled traffic
    • nic3: br-ex OVS bridge for external traffic 
  • Compute:
    • nic1: ControlPlaneIp (flat); InternalApi (20), Storage (30), VLAN devices 
    • nic2: br-tenant OVS bridge and VLAN50 for the tunnelled traffic
    • nic3: br-ex OVS bridge for external traffic 

The nodes map nic1, nic2, nic3 to ens4, ens5, ens6 respectively:

[root@controller-0 ~]# ip l | egrep "vlan[2-4]0"
9: vlan20@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
10: vlan30@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
11: vlan40@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

[root@controller-0 ~]# ovs-vsctl list-ports br-tenant
ens4
vlan50

[root@controller-0 ~]# ovs-vsctl list-ports br-ex
ens5

In the controller nodes we’ll find an haproxy instance load balancing the requests to the different nodes and we can see here the network layout as well:

[root@controller-1 ~]# podman exec -uroot -it haproxy-bundle-podman-1 cat /etc/haproxy/haproxy.cfg

listen neutron
  bind 10.0.0.122:9696 transparent      <--- External network
  bind 172.17.1.48:9696 transparent     <--- InternalApi network
  mode http
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Port %[dst_port]
  option httpchk
  option httplog
# Now the backends in the InternalApi network
  server controller-0.internalapi.local 172.17.1.72:9696 check fall 5 inter 2000 rise 2
  server controller-1.internalapi.local 172.17.1.101:9696 check fall 5 inter 2000 rise 2
  server controller-2.internalapi.local 172.17.1.115:9696 check fall 5 inter 2000 rise 2

In the above output, the IP address 172.17.1.48 is a virtual IP managed by pacemaker and will live in the InternalApi (VLAN 20) network where it is master:

[root@controller-1 ~]# pcs status | grep 172.17.1.48
  * ip-172.17.1.48      (ocf::heartbeat:IPaddr2):       Started controller-0

[root@controller-0 ~]# ip a |grep 172.17.1.48
    inet 172.17.1.48/32 brd 172.17.1.255 scope global vlan20

Traffic inspection

With a clear view on the networking layout, now we can use the hypervisor to hook a tcpdump in the right bridge and check for whatever traffic we’re interested in.

Let’s for example ping from the InternalApi (172.17.1.0/24) network on controller-0 to controller-1 and check the traffic in the hypervisor:

[heat-admin@controller-0 ~]$ ping controller-1.internalapi.local
PING controller-1.internalapi.redhat.local (172.17.1.101) 56(84) bytes of data.
64 bytes from controller-1.redhat.local (172.17.1.101): icmp_seq=1 ttl=64 time=0.213 ms
64 bytes from controller-1.redhat.local (172.17.1.101): icmp_seq=2 ttl=64 time=0.096 ms


[root@hypervisor]# tcpdump -i management -vvne icmp -c2
tcpdump: listening on management, link-type EN10MB (Ethernet), capture size 262144 bytes
15:19:08.418046 52:54:00:74:29:4f > 52:54:00:0b:ad:3b, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 58494, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.1.72 > 172.17.1.101: ICMP echo request, id 53086, seq 5, length 64 15:19:08.418155 52:54:00:0b:ad:3b > 52:54:00:74:29:4f, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 39897, offset 0, flags [none], proto ICMP (1), length 84) 172.17.1.101 > 172.17.1.72: ICMP echo reply, id 53086, seq 5, length 64 [root@hypervisor]# brctl showmacs management | egrep "52:54:00:0b:ad:3b|52:54:00:74:29:4f" port no mac addr is local? ageing timer 3 52:54:00:0b:ad:3b no 0.01 5 52:54:00:74:29:4f no 0.01

When we ping to the controller-1 IP address of the InternalApi network, the traffic is tagged (VLAN 20) and going through the management bridge in the hypervisor. This matches our expectations as we defined such network in the template files that way.

Similarly, we could trace more complicated scenarios like an OpenStack instance in a tenant network pinging an external destination:

(overcloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+---------+--------+-----------------------+--------+
| ID | Name | Status | Networks | Image |
+--------------------------------------+---------+--------+-----------------------+--------+
| 3d9f6957-5311-4590-8c62-097b576ffa04 | cirros1 | ACTIVE | private=192.168.0.166 | cirros |
+--------------------------------------+---------+--------+-----------------------+--------+
[root@compute-0 ~]# sudo ip net e ovnmeta-e49cc182-247c-4dc9-9589-4df6fcb09511 ssh cirros@192.168.0.166 cirros@192.168.0.166's password: $ ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 ttl=53 time=10.356 ms 64 bytes from 8.8.8.8: seq=1 ttl=53 time=8.591 ms

Now in the hypervisor, we’ll trace the Geneve traffic (VLAN50):

# tcpdump -i data -vvnne vlan 50 and "(udp port 6081) and (udp[10:2] = 0x6558) and (udp[(8 + (4 * (2 + (udp[8:1] & 0x3f))) + 12):2] = 0x0800) and (udp[8 + (4 * (2 + (udp[8:1] & 0x3f))) + 14 + 9:1] = 01)"  -c2

tcpdump: listening on data, link-type EN10MB (Ethernet), capture size 262144 bytes
16:21:28.642671 6a:9b:72:22:3f:68 > 0e:d0:eb:00:1b:e7, ethertype 802.1Q (0x8100), length 160: vlan 50, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 15872, offset 0, flags [DF], proto UDP (17), length 142) 172.17.2.119.27073 > 172.17.2.143.6081: [bad udp cksum 0x5db4 -> 0x1e8c!] Geneve, Flags [C], vni 0x5, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00010003] fa:16:3e:a7:95:87 > 52:54:00:0c:eb:7c, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 50335, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.166 > 8.8.8.8: ICMP echo request, id 2818, seq 2145, length 64 16:21:28.650412 0e:d0:eb:00:1b:e7 > 6a:9b:72:22:3f:68, ethertype 802.1Q (0x8100), length 160: vlan 50, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 26871, offset 0, flags [DF], proto UDP (17), length 142) 172.17.2.143.31003 > 172.17.2.119.6081: [bad udp cksum 0x5db4 -> 0x4a04!] Geneve, Flags [C], vni 0x3, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00040002] fa:16:3e:34:a2:0e > fa:16:3e:63:c0:7a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 53, id 0, offset 0, flags [none], proto ICMP (1), length 84) 8.8.8.8 > 192.168.0.166: ICMP echo reply, id 2818, seq 2145, length 64

(First, sorry for the complicated filter; I picked it up from here and adapted it to match on the inner protocol of the Geneve traffic against ICMP. If there’s an easier way please tell me :p)

We can see that the Geneve traffic goes between 6a:9b:72:22:3f:68 and 0e:d0:eb:00:1b:e7 and now we can determine the source/dest nodes:

[root@hypervisor]# brctl showmacs data
  2     6a:9b:72:22:3f:68       no                 0.32
  2     fe:54:00:c7:74:4b       yes                0.00
  2     fe:54:00:c7:74:4b       yes                0.00
  3     0e:d0:eb:00:1b:e7       no                 0.40
  3     fe:54:00:2f:9f:3e       yes                0.00
  3     fe:54:00:2f:9f:3e       yes                0.00

From the info above we can see that port 2 corresponds to the MAC ending in “74:4b” and port 3 corresponds to the MAC ending in “9f:3e“. Therefore, this Geneve traffic is flowing from the compute-0 node to the controller-1 node which is where Neutron is running the gateway to do the SNAT towards the external network. Now, this last portion can be examined in the external bridge:

[root@hypervisor]# tcpdump -i external icmp -vvnnee -c2
tcpdump: listening on external, link-type EN10MB (Ethernet), capture size 262144 bytes
16:33:35.016198 fa:16:3e:a7:95:87 > 52:54:00:0c:eb:7c, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 13537, offset 0, flags [DF], proto ICMP (1), length 84) 10.0.0.225 > 8.8.8.8: ICMP echo request, id 4354, seq 556, length 64 16:33:35.023570 52:54:00:0c:eb:7c > fa:16:3e:a7:95:87, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 54, id 0, offset 0, flags [none], proto ICMP (1), length 84) 8.8.8.8 > 10.0.0.225: ICMP echo reply, id 4354, seq 556, length 64

In case that you’re wondering what’s 10.0.0.225; that’s the IP address of the Neutron gateway:

(overcloud) [stack@undercloud-0 ~]$ openstack router show router1 | grep gateway
| external_gateway_info   | {"network_id": "fe8330fe-540a-4acf-bda8-394398fb4272", "external_fixed_ips": [{"subnet_id": "e388a080-1953-4cdd-9e35-48d416fe2ae1", "ip_address": "10.0.0.225"}

Similarly, the MAC addresses can be matched to confirm that the traffic goes from the gateway node (controller-1), as the MAC ending in “a5:ed”  – in the same port as the source MAC from the ICMP packet – corresponds to the NIC attached to the external network on the controller-1.

[root@hypervisor]# brctl showmacs external
  3     fa:16:3e:a7:95:87       no                 0.47
  3     fe:54:00:75:a5:ed       yes                0.00
  3     fe:54:00:75:a5:ed       yes                0.00

Reflection

This is a virtual setup and everything is confined to the boundaries of a physical server. However, it is a great playground to get yourself familiar with the underlay networking of an OpenStack setup (and networking in general ;). Once you get your hands on a real production environment, all these Linux bridges will be replaced by ToR switches (or even routers on a pure L3 Spine & Leaf architecture) but the fundamentals are the same.

OVN: Where is my packet?

Working with rather complex OpenFlow pipelines as in the case of OVN can be tricky when it comes to debugging. Being able to tell where a packet gets dropped may not be the easiest task ever but fortunately, this is getting a bit more friendly thanks to the available tools around OVS and OVN these days.

In this post, I’ll show a recent troubleshooting process that I did where the symptom was that an ICMP Neighbor Advertisement packet sent out by a guest VM was not arriving at its destination. It turned out to be a bug in ovn-controller that has already been fixed so it will hopefully illustrate some of the available tools and techniques to debug OVN through a real world example.

Let’s start by showing a failing ping to an OVN destination and we’ll build up from here:

# ping 2001:db8::f816:3eff:fe1e:5a0f -c2
PING 2001:db8::f816:3eff:fe1e:5a0f(2001:db8::f816:3eff:fe1e:5a0f) 56 data bytes
From f00d:f00d:f00d:f00d:f00d:f00d:f00d:9: icmp_seq=1 Destination unreachable: Address unreachable
From f00d:f00d:f00d:f00d:f00d:f00d:f00d:9: icmp_seq=2 Destination unreachable: Address unreachable

--- 2001:db8::f816:3eff:fe1e:5a0f ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 61ms

In the destination hypervisor we can see the Neighbor Solicitation packets but no answer to them:

$ tcpdump -i br-ex -ne "icmp6" -c2

13:11:18.350789 b6:c9:b9:6e:50:48 > 33:33:ff:1e:5a:0f, ethertype IPv6 (0x86dd), length 86: fe80::b4c9:b9ff:fe6e:5048 > ff02::1:ff1e:5a0f: ICMP6, neighbor solicitation, who has 2001:db8::f816:3eff:fe1e:5a0f, length 32

13:11:19.378648 b6:c9:b9:6e:50:48 > 33:33:ff:1e:5a:0f, ethertype IPv6 (0x86dd), length 86: fe80::b4c9:b9ff:fe6e:5048 > ff02::1:ff1e:5a0f: ICMP6, neighbor solicitation, who has 2001:db8::f816:3eff:fe1e:5a0f, length 32

The first step is to determine whether the actual VM is responding with Neighbor Advertisement packets by examining the traffic on the tap interface:

$ tcpdump -veni tapdf8a80f8-0c icmp6 -c2

15:10:26.862344 b6:c9:b9:6e:50:48 > 33:33:ff:1e:5a:0f, ethertype IPv6 (0x86dd), length 86: (hlim 255, next-header ICMPv6 (58) payload length: 32) fe80::b4c9:b9ff:fe6e:5048 > ff02::1:ff1e:5a0f: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has 2001:db8::f816:3eff:fe1e:5a0f
          source link-address option (1), length 8 (1): b6:c9:b9:6e:50:48

15:10:26.865489 fa:16:3e:1e:5a:0f > b6:c9:b9:6e:50:48, ethertype IPv6 (0x86dd), length 86: (hlim 255, next-header ICMPv6 (58) payload length: 32) 2001:db8::f816:3eff:fe1e:5a0f > fe80::b4c9:b9ff:fe6e:5048: [icmp6 sum ok] ICMP6, neighbor advertisement, length 32, tgt is 2001:db8::f816:3eff:fe1e:5a0f, Flags [solicited, override]
          destination link-address option (2), length 8 (1): fa:16:3e:1e:5a:0f

At this point we know that the Neighbor Advertisement (NA) packets are sent by the VM but dropped somewhere in the OVN integration bridge, which makes this task about inspecting Logical and Physical flows to find out what’s causing this.

Let’s begin by checking the Logical flows to see if, from the OVN database standpoint, all is correctly configured. For this, we need to collect all the info that we are going to pass to the ovn-trace tool to simulate the NA packet originated from the VM such as:

  • datapath: OVN Logical Switch name or ID
  • inport: OVN port name corresponding to the VM
  • eth.src & eth.dst: source and destination MAC addresses (we can copy them from the tcpdump output above)
  • ip6.src: the IPv6 address of the VM
  • nd_target: The IPv6 address that is advertised
# ovn-nbctl show
[...]
switch 0309ffff-7c89-427e-a68e-cd87b0658005 (neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873) (aka provider-flat)
    port provnet-64e303a9-af62-4833-b713-6b361cdd6ecd
        type: localnet
        addresses: ["unknown"]
    port 916eb7ea-4d64-4d9c-be28-b693af2e7ed3
        type: localport
        addresses: ["fa:16:3e:65:76:bd 172.24.100.2 2001:db8::f816:3eff:fe65:76bd"]
    port df8a80f8-0cbf-48af-8b32-78377f034797 (aka vm-provider-flat-port)
        addresses: ["fa:16:3e:1e:5a:0f 2001:db8::f816:3eff:fe1e:5a0f"]
    port f1d49ca3-0e9c-4387-8d5d-c833fc3b7943
        type: router
        router-port: lrp-f1d49ca3-0e9c-4387-8d5d-c833fc3b7943

With all this info we are ready to run ovn-trace:

$ ovn-trace --summary neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873 'inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && eth.src == fa:16:3e:1e:5a:0f && eth.dst == b6:c9:b9:6e:50:48 && ip6.src == 2001:db8::f816:3eff:fe1e:5a0f  && nd.target == 2001:db8::f816:3eff:fe1e:5a0f && nd_na && nd.tll == fa:16:3e:1e:5a:0f'
# icmp6,reg14=0x4,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=::,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
ingress(dp="provider-flat", inport="vm-provider-flat-port") {
    next;
    next;
    next;
    next;
    next;
    reg0[8] = 1;
    reg0[9] = 1;
    next;
    next;
    outport = "_MC_unknown";
    output;
    multicast(dp="provider-flat", mcgroup="_MC_unknown") {
        egress(dp="provider-flat", inport="vm-provider-flat-port", outport="provnet-64e303") {
            next;
            next;
            reg0[8] = 1;
            reg0[9] = 1;
            next;
            next;
            output;
            /* output to "provnet-64e303", type "localnet" */;
        };
    };
};

As we can see from the ovn-trace output above, the packet should be delivered to the localnet port as per the OVN DB contents, meaning that we should have been able to see it with tcpdump. Since we know that the packet was not present in the bridge, the next step is to figure out where it is being dropped in the OpenFlow pipeline by inspecting the Physical flows.

First, let’s capture a live packet (filter by ICMP6 and Neighbor Advertisement):

$ flow=$(tcpdump -nXXi tapdf8a80f8-0c "icmp6 && ip6[40] = 136 && host 2001:db8::f816:3eff:fe1e:5a0f"  -c1 | ovs-tcpundump)

1 packet captured
1 packet received by filter
0 packets dropped by kernel

$ echo $flow
b6c9b96e5048fa163e1e5a0f86dd6000000000203aff20010db800000000f8163efffe1e5a0ffe80000000000000b4c9b9fffe6e504888004d626000000020010db800000000f8163efffe1e5a0f0201fa163e1e5a0f

Now, let’s feed this packet into ovs-appctl to see the execution of the OVS pipeline:

$ ovs-appctl ofproto/trace br-int in_port=`ovs-vsctl get Interface tapdf8a80f8-0c ofport` $flow

Flow: icmp6,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f

bridge("br-int")
----------------
 0. in_port=9, priority 100, cookie 0x146e7f9c
    set_field:0x1->reg13
    set_field:0x3->reg11
    set_field:0x2->reg12
    set_field:0x7->metadata
    set_field:0x4->reg14
    resubmit(,8)
 8. reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f, priority 50, cookie 0x195ddd1b
    resubmit(,9)
 9. ipv6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f, priority 90, cookie 0x1c05660e
    resubmit(,10)
10. icmp6,reg14=0x4,metadata=0x7,nw_ttl=255,icmp_type=136,icmp_code=0, priority 80, cookie 0xd3c58704
    drop

Final flow: icmp6,reg11=0x3,reg12=0x2,reg13=0x1,reg14=0x4,metadata=0x7,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
Megaflow: recirc_id=0,eth,icmp6,in_port=9,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::/16,nw_ttl=255,nw_frag=no,icmp_type=0x88/0xff,icmp_code=0x0/0xff,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f

We see that the packet is dropped in the table 10. Also, we can feed this output into ovn-detrace for a more friendly output which allows us to match each OpenFlow rule to OVN logical flows and tables:

$ ovs-appctl ofproto/trace br-int in_port=9 $flow | ovn-detrace --ovnsb="tcp:99.88.88.88:6642" --ovnnb="tcp:99.88.88.88:6641"

Flow: icmp6,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f

bridge("br-int")
----------------
0. in_port=9, priority 100, cookie 0x146e7f9c
set_field:0x1->reg13
set_field:0x3->reg11
set_field:0x2->reg12
set_field:0x7->metadata
set_field:0x4->reg14
resubmit(,8)
  *  Logical datapath: "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204)
  *  Port Binding: logical_port "df8a80f8-0cbf-48af-8b32-78377f034797", tunnel_key 4, chassis-name "4bcf0b65-8d0c-4dde-9783-7ccb23fe3627", chassis-str "cmp-3-1.bgp.ftw"
8. reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f, priority 50, cookie 0x195ddd1b
resubmit(,9)
  *  Logical datapaths:
  *      "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204) [ingress]
  *  Logical flow: table=0 (ls_in_port_sec_l2), priority=50, match=(inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && eth.src == {fa:16:3e:1e:5a:0f}), actions=(next;)
   *  Logical Switch Port: df8a80f8-0cbf-48af-8b32-78377f034797 type  (addresses ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f'], dynamic addresses [], security ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f']
9. ipv6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f, priority 90, cookie 0x1c05660e
resubmit(,10)
  *  Logical datapaths:
  *      "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204) [ingress]
  *  Logical flow: table=1 (ls_in_port_sec_ip), priority=90, match=(inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && eth.src == fa:16:3e:1e:5a:0f && ip6.src == {fe80::f816:3eff:fe1e:5a0f, 2001:db8::f816:3eff:fe1e:5a0f}), actions=(next;)
   *  Logical Switch Port: df8a80f8-0cbf-48af-8b32-78377f034797 type  (addresses ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f'], dynamic addresses [], security ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f']
10. icmp6,reg14=0x4,metadata=0x7,nw_ttl=255,icmp_type=136,icmp_code=0, priority 80, cookie 0xd3c58704
drop
  *  Logical datapaths:
  *      "neutron-aab27d39-a3c0-4666-81a0-aa4be26ec873" (67b33068-8baa-4e45-9447-61352f8d5204) [ingress]
  *  Logical flow: table=2 (ls_in_port_sec_nd), priority=80, match=(inport == "df8a80f8-0cbf-48af-8b32-78377f034797" && (arp || nd)), actions=(drop;)
   *  Logical Switch Port: df8a80f8-0cbf-48af-8b32-78377f034797 type  (addresses ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f'], dynamic addresses [], security ['fa:16:3e:1e:5a:0f 172.24.100.163 2001:db8::f816:3eff:fe1e:5a0f']

Final flow: icmp6,reg11=0x3,reg12=0x2,reg13=0x1,reg14=0x4,metadata=0x7,in_port=9,vlan_tci=0x0000,dl_src=fa:16:3e:1e:5a:0f,dl_dst=b6:c9:b9:6e:50:48,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::b4c9:b9ff:fe6e:5048,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
Megaflow: recirc_id=0,eth,icmp6,in_port=9,dl_src=fa:16:3e:1e:5a:0f,ipv6_src=2001:db8::f816:3eff:fe1e:5a0f,ipv6_dst=fe80::/16,nw_ttl=255,nw_frag=no,icmp_type=0x88/0xff,icmp_code=0x0/0xff,nd_target=2001:db8::f816:3eff:fe1e:5a0f,nd_sll=00:00:00:00:00:00,nd_tll=fa:16:3e:1e:5a:0f
Datapath actions: drop

The packet is explicitly dropped in table 10 (OVN Logical Switch table 2 (ls_in_port_sec_nd) because no other higher priority flow in that table was matched previously. Let’s inspect the relevant OVN Logical flows in such table:

_uuid               : e58f66b6-07f0-4f31-bd6b-97ea609ac3fb
actions             : "next;"
external_ids        : {source="ovn-northd.c:4455", stage-hint=bc1dbbdf, stage-name=ls_in_port_sec_nd}
logical_datapath    : 67b33068-8baa-4e45-9447-61352f8d5204
logical_dp_group    : []
match               : "inport == \"df8a80f8-0cbf-48af-8b32-78377f034797\" && eth.src == fa:16:3e:1e:5a:0f && ip6 && nd && ((nd.sll == 00:00:00:00:00:00 || nd.sll == fa:16:3e:1e:5a:0f) || ((nd.tll == 00:00:00:00:00:00 || nd.tll == fa:16:3e:1e:5a:0f) && (nd.target == fe80::f816:3eff:fe1e:5a0f || nd.target == 2001:db8::f816:3eff:fe1e:5a0f)))"
pipeline            : ingress
priority            : 90
table_id            : 2



_uuid               : d3c58704-c6be-4876-b023-2ec7ae870bde
actions             : "drop;"
external_ids        : {source="ovn-northd.c:4462", stage-hint=bc1dbbdf, stage-name=ls_in_port_sec_nd}
logical_datapath    : 67b33068-8baa-4e45-9447-61352f8d5204
logical_dp_group    : []
match               : "inport == \"df8a80f8-0cbf-48af-8b32-78377f034797\" && (arp || nd)"
pipeline            : ingress
priority            : 80
table_id            : 2

The first flow above should have matched, it has higher priority than the drop and all the fields seem correct. Let’s take a look at the actual OpenFlow rules installed by ovn-controller matching on some fields of the logical flow such as the uuid/cookie or the nd_target: 

$ ovs-ofctl dump-flows br-int |grep cookie=0xe58f66b6 | grep icmp_type=136
 cookie=0xe58f66b6, duration=20946.048s, table=10, n_packets=0, n_bytes=0, idle_age=20946, priority=90,conj_id=19,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0 actions=resubmit(,11)

$ ovs-ofctl dump-flows br-int table=10 | grep priority=90 | grep nd_target=2001:db8::f816:3eff:fe1e:5a0f
 cookie=0x0, duration=21035.890s, table=10, n_packets=0, n_bytes=0, idle_age=21035, priority=90,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0,nd_target=2001:db8::f816:3eff:fe1e:5a0f actions=conjunction(3,1/2)

At first glance, the fact that the packet counter is 0 when all the fields seem correct (src/dest IP addresses, ICMP6 type, …) looks suspicious. The conjunctive flows are key here. On one hand, the first flow is referencing conj_id=19 while the second flow is referencing conj_id=3 (the notation for conjunction(3,1/2) means that the action to be executed is the first out of 2 clauses for the conjunction id 3). However, there’s no such conjunction and hence, the packet will not hit these flows:

$ ovs-ofctl dump-flows br-int | grep conj_id=3 -c
0

If our assumptions are correct, we could rewrite the first flow so that the conjunction id matches the expected value (3). For this, we’ll stop first ovn-controller, delete the flow and add the new one:

$ ovs-ofctl del-flows --strict br-int table=10,priority=90,conj_id=19,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0

$ ovs-ofctl add-flow br-int "cookie=0xe58f66b6,table=10,priority=90,conj_id=3,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0,actions=resubmit(,11)"

At this point we can test the ping and see if it works and the new flow gets hit:

$ tcpdump -ni tapdf8a80f8-0c

11:00:24.975037 IP6 fe80::b4c9:b9ff:fe6e:5048 > 2001:db8::f816:3eff:fe1e:5a0f: ICMP6, neighbor solicitation, who has 2001:db8::f816:3eff:fe1e:5a0f, length 32

11:00:24.975544 IP6 2001:db8::f816:3eff:fe1e:5a0f > fe80::b4c9:b9ff:fe6e:5048: ICMP6, neighbor advertisement, tgt is 2001:db8::f816:3eff:fe1e:5a0f, length 24

11:00:25.831695 IP6 f00d:f00d:f00d:4::1 > 2001:db8::f816:3eff:fe1e:5a0f: ICMP6, echo request, seq 55, length 64

11:00:25.832182 IP6 2001:db8::f816:3eff:fe1e:5a0f > f00d:f00d:f00d:4::1: ICMP6, echo reply, seq 55, length 64

$ sudo ovs-ofctl dump-flows br-int| grep cookie=0xe58f66b6  | grep conj_id

 cookie=0xe58f66b6, duration=5.278s, table=10, n_packets=2, n_bytes=172, idle_age=4, priority=90,conj_id=3,icmp6,reg14=0x4,metadata=0x7,dl_src=fa:16:3e:1e:5a:0f,nw_ttl=255,icmp_type=136,icmp_code=0 actions=resubmit(,11)

As I mentioned earlier, this issue is now solved upstream and you can check the fix here. With this patch, the conjunction ids will be properly generated and we’ll no longer hit this bug. 

Most of the issues that we’ll face in OVN will be likely related to misconfigurations (missing ACLs, port security, …) and the use of tools like ovn-trace will help us spot them. However, when tackling legitimate bugs in the core OVN side, there’s no easy and defined way to find them but luckily we have some tools at hand that, at the very least, will help us providing a good bug report that will be promptly picked and solved by the very active OVN community. 

OVN Cluster Interconnection

A new feature has been recently introduced in OVN that allows multiple clusters to be interconnected at L3 level (here’s a link to the series of patches). This can be useful for scenarios with multiple availability zones (or physical regions) or simply to allow better scaling by having independent control planes yet allowing connectivity between workloads in separate zones.

Simplifying things, logical routers on each cluster can be connected via transit overlay networks. The interconnection layer is responsible for creating the transit switches in the IC database that will become visible to the connected clusters. Each cluster can then connect their logical routers to the transit switches. More information can be found in the ovn-architecture manpage.

I created a vagrant setup to test it out and become a bit familiar with it. All you need to do to recreate it is cloning and running ‘vagrant up‘ inside the ovn-interconnection folder:

https://github.com/danalsan/vagrants/tree/master/ovn-interconnection

This will deploy 7 CentOS machines (300MB of RAM each) with two separate OVN clusters (west & east) and the interconnection services. The layout is described in the image below:

Once the services are up and running, a few resources will be created on each cluster and the interconnection services will be configured with a transit switch between them:

Let’s see, for example, the logical topology of the east availability zone, where the transit switch ts1 is listed along with the port in the west remote zone:

[root@central-east ~]# ovn-nbctl show
switch c850599c-263c-431b-b67f-13f4eab7a2d1 (ts1)
    port lsp-ts1-router_west
        type: remote
        addresses: ["aa:aa:aa:aa:aa:02 169.254.100.2/24"]
    port lsp-ts1-router_east
        type: router
        router-port: lrp-router_east-ts1
switch 8361d0e1-b23e-40a6-bd78-ea79b5717d7b (net_east)
    port net_east-router_east
        type: router
        router-port: router_east-net_east
    port vm1
        addresses: ["40:44:00:00:00:01 192.168.1.11"]
router b27d180d-669c-4ca8-ac95-82a822da2730 (router_east)
    port lrp-router_east-ts1
        mac: "aa:aa:aa:aa:aa:01"
        networks: ["169.254.100.1/24"]
        gateway chassis: [gw_east]
    port router_east-net_east
        mac: "40:44:00:00:00:04"
        networks: ["192.168.1.1/24"]

As for the Southbound database, we can see the gateway port for each router. In this setup I only have one gateway node but, as any other distributed gateway port in OVN, it could be scheduled in multiple nodes providing HA

[root@central-east ~]# ovn-sbctl show
Chassis worker_east
    hostname: worker-east
    Encap geneve
        ip: "192.168.50.100"
        options: {csum="true"}
    Port_Binding vm1
Chassis gw_east
    hostname: gw-east
    Encap geneve
        ip: "192.168.50.102"
        options: {csum="true"}
    Port_Binding cr-lrp-router_east-ts1
Chassis gw_west
    hostname: gw-west
    Encap geneve
        ip: "192.168.50.103"
        options: {csum="true"}
    Port_Binding lsp-ts1-router_west

If we query the interconnection databases, we will see the transit switch in the NB and the gateway ports in each zone:

[root@central-ic ~]# ovn-ic-nbctl show
Transit_Switch ts1

[root@central-ic ~]# ovn-ic-sbctl show
availability-zone east
    gateway gw_east
        hostname: gw-east
        type: geneve
            ip: 192.168.50.102
        port lsp-ts1-router_east
            transit switch: ts1
            address: ["aa:aa:aa:aa:aa:01 169.254.100.1/24"]
availability-zone west
    gateway gw_west
        hostname: gw-west
        type: geneve
            ip: 192.168.50.103
        port lsp-ts1-router_west
            transit switch: ts1
            address: ["aa:aa:aa:aa:aa:02 169.254.100.2/24"]

With this topology, traffic flowing from vm1 to vm2 shall flow from gw-east to gw-west through a Geneve tunnel. If we list the ports in each gateway we should be able to see the tunnel ports. Needless to say, gateways have to be mutually reachable so that the transit overlay network can be established:

[root@gw-west ~]# ovs-vsctl show
6386b867-a3c2-4888-8709-dacd6e2a7ea5
    Bridge br-int
        fail_mode: secure
        Port ovn-gw_eas-0
            Interface ovn-gw_eas-0
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.50.102"}

Now, when vm1 pings vm2, the traffic flow should be like:

(vm1) worker_east ==== gw_east ==== gw_west ==== worker_west (vm2).

Let’s see it via ovn-trace tool:

[root@central-east vagrant]# ovn-trace  --ovs --friendly-names --ct=new net_east  'inport == "vm1" && eth.src == 40:44:00:00:00:01 && eth.dst == 40:44:00:00:00:04 && ip4.src == 192.168.1.11 && ip4.dst == 192.168.2.12 && ip.ttl == 64 && icmp4.type == 8'


ingress(dp="net_east", inport="vm1")
...
egress(dp="net_east", inport="vm1", outport="net_east-router_east")
...
ingress(dp="router_east", inport="router_east-net_east")
...
egress(dp="router_east", inport="router_east-net_east", outport="lrp-router_east-ts1")
...
ingress(dp="ts1", inport="lsp-ts1-router_east")
...
egress(dp="ts1", inport="lsp-ts1-router_east", outport="lsp-ts1-router_west")
 9. ls_out_port_sec_l2 (ovn-northd.c:4543): outport == "lsp-ts1-router_west", priority 50, uuid c354da11
    output;
    /* output to "lsp-ts1-router_west", type "remote" */

Now let’s capture Geneve traffic on both gateways while a ping between both VMs is running:

[root@gw-east ~]# tcpdump -i genev_sys_6081 -vvnee icmp
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
10:43:35.355772 aa:aa:aa:aa:aa:01 > aa:aa:aa:aa:aa:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 11379, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 40, length 64
10:43:35.356077 aa:aa:aa:aa:aa:01 > aa:aa:aa:aa:aa:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 11379, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 40, length 64
10:43:35.356442 aa:aa:aa:aa:aa:02 > aa:aa:aa:aa:aa:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 42610, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 40, length 64
10:43:35.356734 40:44:00:00:00:04 > 40:44:00:00:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 42610, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 40, length 64


[root@gw-west ~]# tcpdump -i genev_sys_6081 -vvnee icmp
tcpdump: listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
10:43:29.169532 aa:aa:aa:aa:aa:01 > aa:aa:aa:aa:aa:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 8875, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 34, length 64
10:43:29.170058 40:44:00:00:00:10 > 40:44:00:00:00:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 8875, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.11 > 192.168.2.12: ICMP echo request, id 5494, seq 34, length 64
10:43:29.170308 aa:aa:aa:aa:aa:02 > aa:aa:aa:aa:aa:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 38667, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 34, length 64
10:43:29.170476 aa:aa:aa:aa:aa:02 > aa:aa:aa:aa:aa:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 38667, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.2.12 > 192.168.1.11: ICMP echo reply, id 5494, seq 34, length 64

You can observe that the ICMP traffic flows between the transit switch ports (aa:aa:aa:aa:aa:02 <> aa:aa:aa:aa:aa:01) traversing both zones.

Also, as the packet has gone through two routers (router_east and router_west), the TTL at the destination has been decremented twice (from 64 to 62):

[root@worker-west ~]# ip net e vm2 tcpdump -i any icmp -vvne
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
10:49:32.491674  In 40:44:00:00:00:10 ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 62, id 57504, offset 0, flags [DF], proto ICMP (1), length 84)

This is a really great feature that opens a lot of possibilities for cluster interconnection and scaling. However, it has to be taken into account that it requires another layer of management that handles isolation (multitenancy) and avoids IP overlapping across the connected availability zones.

OVN Routing and ovn-trace

In the last post we covered how OVN distributes the pipeline execution across the different hypervisors thanks to the encapsulation used. Using the same system as a base, let’s create a Logical Router and a second Logical Switch where we can see how routing works between different nodes.

 

ovn-nbctl ls-add network1
ovn-nbctl ls-add network2
ovn-nbctl lsp-add network1 vm1
ovn-nbctl lsp-add network2 vm2
ovn-nbctl lsp-set-addresses vm1 "40:44:00:00:00:01 192.168.0.11"
ovn-nbctl lsp-set-addresses vm2 "40:44:00:00:00:02 192.168.1.11"

ovn-nbctl lr-add router1
ovn-nbctl lrp-add router1 router1-net1 40:44:00:00:00:03 192.168.0.1/24
ovn-nbctl lsp-add network1 net1-router1
ovn-nbctl lsp-set-addresses net1-router1 40:44:00:00:00:03
ovn-nbctl lsp-set-type net1-router1 router
ovn-nbctl lsp-set-options net1-router1 router-port=router1-net1

ovn-nbctl lrp-add router1 router1-net2 40:44:00:00:00:04 192.168.1.1/24
ovn-nbctl lsp-add network2 net2-router1
ovn-nbctl lsp-set-addresses net2-router1 40:44:00:00:00:04
ovn-nbctl lsp-set-type net2-router1 router
ovn-nbctl lsp-set-options net2-router1 router-port=router1-net2

And then on Worker1 and Worker2 respectively, let’s bind VM1 and VM2 ports inside network namespaces:

# Worker1
ovs-vsctl add-port br-int vm1 -- set Interface vm1 type=internal -- set Interface vm1 external_ids:iface-id=vm1
ip netns add vm1
ip link set vm1 netns vm1
ip netns exec vm1 ip link set vm1 address 40:44:00:00:00:01
ip netns exec vm1 ip addr add 192.168.0.11/24 dev vm1
ip netns exec vm1 ip link set vm1 up
ip netns exec vm1 ip route add default via 192.168.0.1

# Worker2
ovs-vsctl add-port br-int vm2 -- set Interface vm2 type=internal -- set Interface vm2 external_ids:iface-id=vm2
ip netns add vm2
ip link set vm2 netns vm2
ip netns exec vm2 ip link set vm2 address 40:44:00:00:00:02
ip netns exec vm2 ip addr add 192.168.1.11/24 dev vm2
ip netns exec vm2 ip link set vm2 up
ip netns exec vm2 ip route add default via 192.168.1.1

First, let’s check connectivity between two VMs through the Logical Router:

[root@worker1 ~]# ip netns exec vm1 ping 192.168.1.11 -c2
PING 192.168.1.11 (192.168.1.11) 56(84) bytes of data.
64 bytes from 192.168.1.11: icmp_seq=1 ttl=63 time=0.371 ms
64 bytes from 192.168.1.11: icmp_seq=2 ttl=63 time=0.398 ms

--- 192.168.1.11 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.371/0.384/0.398/0.023 ms

ovn-trace

Let’s now check with ovn-trace how the flow of a packet from VM1 to VM2 looks like and then we’ll verify how this gets realized in our physical deployment with 3 hypervisors: 

[root@central ~]# ovn-trace --summary network1 'inport == "vm1" && eth.src == 40:44:00:00:00:01 && eth.dst == 40:44:00:00:00:03 && ip4.src == 192.168.0.11 && ip4.dst == 192.168.1.11 && ip.ttl == 64'
# ip,reg14=0x1,vlan_tci=0x0000,dl_src=40:44:00:00:00:01,dl_dst=40:44:00:00:00:03,nw_src=192.168.0.11,nw_dst=192.168.1.11,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=64
ingress(dp="network1", inport="vm1") {
    next;
    outport = "net1-router1";
    output;
    egress(dp="network1", inport="vm1", outport="net1-router1") {
        output;
        /* output to "net1-router1", type "patch" */;
        ingress(dp="router1", inport="router1-net1") {
            next;
            ip.ttl--;
            reg0 = ip4.dst;
            reg1 = 192.168.1.1;
            eth.src = 40:44:00:00:00:04;
            outport = "router1-net2";
            flags.loopback = 1;
            next;
            eth.dst = 40:44:00:00:00:02;
            next;
            output;
            egress(dp="router1", inport="router1-net1", outport="router1-net2") {
                output;
                /* output to "router1-net2", type "patch" */;
                ingress(dp="network2", inport="net2-router1") {
                    next;
                    outport = "vm2";
                    output;
                    egress(dp="network2", inport="net2-router1", outport="vm2") {
                        output;
                        /* output to "vm2", type "" */;
                    };
                };
            };
        };
    };
};

As you can see, the packet goes through 3 OVN datapaths: network1, router1 and network2. As VM1 is on Worker1 and VM2 is on Worker2, the packet will traverse the tunnel between both hypervisors and thanks to the Geneve encapsulation, the execution pipeline will be distributed:

  • Worker1 will execute both ingress and egress pipelines of network1
  • Worker1 will also perform the routing (lines 10 to 28 above) and the ingress pipeline for network2. Then the packet will be pushed to Worker2 via the tunnel.
  • Worker2 will execute the egress pipeline for network2 delivering the packet to VM2.

Let’s launch a ping from VM1 and check the Geneve traffic on Worker2: 

52:54:00:13:e0:a2 > 52:54:00:ac:67:5b, ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 64, id 46587, offset 0, flags [DF], proto UDP (17), length 142)
    192.168.50.100.55145 > 192.168.50.101.6081: [bad udp cksum 0xe6a5 -> 0xb087!] Geneve, Flags [C],
vni 0x8,
proto TEB (0x6558),
options [class Open Virtual Networking (OVN) (0x102)
type 0x80(C) len 8
data 00020001]

From what we learnt in the previous post, when the packet arrives to Worker2, the egress pipeline of Datapath 8 (VNI) will be executed being ingress port = 2 and egress port = 1. Let’s see if it matches what ovn-trace gave us earlier:

egress(dp=”network2″, inport=”net2-router1″, outport=”vm2″) 

[root@central ~]# ovn-sbctl get Datapath_Binding network2 tunnel_key
8
[root@central ~]# ovn-sbctl get Port_Binding net2-router1 tunnel-key
2
[root@central ~]# ovn-sbctl get Port_Binding vm2 tunnel-key
1

What about the reply from VM2 to VM1? If we check the tunnel traffic on Worker2 we’ll see the ICMP echo reply packets with different datapath and ingress/egress ports:

22:19:03.896340 52:54:00:ac:67:5b > 52:54:00:13:e0:a2, ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 64, id 30538, offset 0, flags [DF], proto UDP (17), length 142)
    192.168.50.101.13530 > 192.168.50.100.6081: [bad udp cksum 0xe6a5 -> 0x5419!] Geneve, Flags [C],
vni 0x7,
proto TEB (0x6558),
options [class Open Virtual Networking (OVN) (0x102)
type 0x80(C) len 8
data 00020001]


[root@central ~]# ovn-sbctl get Datapath_Binding network1 tunnel_key
7
[root@central ~]# ovn-sbctl get Port_Binding net1-router1 tunnel-key
2
[root@central ~]# ovn-sbctl get Port_Binding vm1 tunnel-key
1

As we can see, the routing happens locally in the source node; there’s no need to send the packet to any central/network node as the East/West routing is always distributed with OVN. In the case of North/South traffic, where SNAT is required, traffic will go to the gateway node but we’ll cover this in coming posts.

OpenFlow analysis

Until now, we just focused on the Logical elements but ovn-controller will ultimately translate the Logical Flows into physical flows on the OVN bridge.  Let’s inspect the ports that our bridge has on Worker2:

[root@worker2 ~]# ovs-ofctl show br-int
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000404400000002
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(ovn-centra-0): addr:ea:64:c2:a9:86:fe
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(ovn-worker-0): addr:de:96:64:2b:21:4a
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 8(vm2): addr:40:44:00:00:00:02
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br-int): addr:40:44:00:00:00:02
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

We have 3 ports in the OVN bridge:

  1. ovn-centra-0: Tunnel port with Central node
  2. ovn-worker-0: Tunnel port with Worker1
  3. vm2: Our fake virtual machine that we bound to Worker2

When ICMP echo request packets are coming from VM1, we expect them to be arriving via port 2 and being delivered to VM2 on port 8. We can check this by inspecting the flows at table 0 (input) and table 65 (output) for br-int. For a more detailed information of the OpenFlow tables, see ovn-architecture(7) document, section “Architectural Physical Life Cycle of a Packet”.

[root@worker2 ~]# ovs-ofctl dump-flows br-int table=0 | grep idle_age='[0-1],'
cookie=0x0, duration=3845.572s, table=0, n_packets=2979, n_bytes=291942, idle_age=0, priority=100,in_port=2 actions=move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15],resubmit(,33)


[root@worker2 ~]# ovs-ofctl dump-flows br-int table=65 | grep idle_age='[0-1],'
 cookie=0x0, duration=3887.336s, table=65, n_packets=3062, n_bytes=297780, idle_age=0, priority=100,reg15=0x1,metadata=0x8 actions=output:8

I was filtering the dump-flows output to display only those flows with an idle_age of just 0 or 1 seconds. This means that they’ve been recently hit so if we’re inspecting tables with lots of flows, we would filter away unwanted flows to focus just on the ones that we’re really interested. Please, note that for this particular example, I’ve set the VM2 link down to see only incoming ICMP packets from VM1 and hide the replies.

Let’s inspect the input flow on table 0 where the following actions happen:

actions=
move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],
move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14],
move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15],
resubmit(,33)

These actions will:

  • Copy the 24 bits of the VNI (Tunnel ID / Logical Datapath) to the OpenFlow metadata.
  • Copy bits 16-30 of the tunnel data (Ingress port) to the OpenvSwitch register 14
  • Copy bits 0-15 of the tunnel data (Egress port) to the OpenvSwitch register 15.

From the ovn-architecture document, we can read:

logical datapath field
A field that denotes the logical datapath through which a
packet is being processed. OVN uses the field that Open‐
Flow 1.1+ simply (and confusingly) calls “metadata” to
store the logical datapath. (This field is passed across
tunnels as part of the tunnel key.)

logical input port field
A field that denotes the logical port from which the
packet entered the logical datapath. OVN stores this in
Open vSwitch extension register number 14.

Geneve and STT tunnels pass this field as part of the
tunnel key. Although VXLAN tunnels do not explicitly
carry a logical input port, OVN only uses VXLAN to commu‐
nicate with gateways that from OVN’s perspective consist
of only a single logical port, so that OVN can set the
logical input port field to this one on ingress to the
OVN logical pipeline.

logical output port field
A field that denotes the logical port from which the
packet will leave the logical datapath. This is initial‐
ized to 0 at the beginning of the logical ingress pipe‐
line. OVN stores this in Open vSwitch extension register
number 15.

Geneve and STT tunnels pass this field as part of the
tunnel key. VXLAN tunnels do not transmit the logical
output port field. Since VXLAN tunnels do not carry a
logical output port field in the tunnel key, when a
packet is received from VXLAN tunnel by an OVN hypervi‐
sor, the packet is resubmitted to table 8 to determine
the output port(s); when the packet reaches table 32,
these packets are resubmitted to table 33 for local de‐
livery by checking a MLF_RCV_FROM_VXLAN flag, which is
set when the packet arrives from a VXLAN tunnel. 

Similarly, we can see that in the output action from table 65, the OpenFlow rule is matching on “reg15=0x1,metadata=0x8” when it outputs the packet to the OpenFlow port number 8 (VM2). As we saw earlier, metadata 8 corresponds to network2 Logical Switch and the output port (reg15) 1 corresponds to the VM2 logical port (tunnel_key 1).

 

Due to its distributed nature and everything being OpenFlow based, it may be a little bit tricky to trace packets when implementing Software Defined Networking with OVN compared to traditional solutions (iptables, network namespaces, …). Knowing the above concepts and getting familiar with the tools is key to debug OVN systems in an effective way.