Recently, the Performance & Scale team at Red Hat ran some tests to stress both the control and data planes of OpenStack. One of the biggest issues detected during that exercise was the memory consumption of all the Neutron workers across the controller nodes raising all the way up to 75 GB of RSS:
The team did some analysis and we determined that there were close to a million MAC_Binding entries in the OVN Southbound database. These entries are kept in memory by Neutron and they never age out so the memory just grows and grows.
MAC bindings The MAC_Binding table tracks the bindings from IP addresses to Ethernet addresses that are dynamically discovered using ARP (for IPv4) and neighbor discovery (for IPv6). Usually, IP-to-MAC bindings for virtual machines are statically populated into the Port_Binding table, so MAC_Binding is primarily used to discover bindings on physical net‐ works.
The MAC_Binding table is populated by ovn-controller when it sees a new ARP/ND packet in the network even if they don't belong to OVN. It is common in OpenStack deployments to have multiple tenants connecting their routers to a relatively large provider network and once a new MAC address is learned, OVN will add one MAC_Binding entry per router connected to the external network.
In this particular exercise, the external network was a /16 and we observed close to 1M entries. This doesn't only pose a memory problem but also a lot of network traffic and stress to the OVN Southbound server which needs to commit the transactions from the ovn-controllers and send out the notifications to all the clients.
Why does Neutron care about the MAC_Binding table?
The problem - and its workaround - is described here but, in short, it is very common in OpenStack to reuse a Floating IP address (eg. during testing on CI) and Neutron implements a mechanism to delete the MAC address associated to a Floating IP from the MAC_Binding entry in order to force learning the new MAC address when needed.
For Neutron to do this, we monitored the table forcing ourselves to keep an in-memory copy of all its entries. Since these entries do not age out, the most likely scenario is that we'll hit OOM killers eventually - depending on the network topology, network size and other factors -.
Ideally, core OVN should implement a mechanism to eliminate stale MAC_Binding entries after a certain time but that's not an easy task as discussed in the links above. In the meantime, we thought of a way to stop monitoring the problematic table but, at the same time, keep the mechanism of deleting Floating IP MAC addresses upon association and disassociation of them from Neutron. The adopted solution was to invoke an external tool, ovsdb-client, to be able to delete such entries by:
- Avoid monitoring the MAC_Binding table
- Avoid downloading the database contents, hence, deleting those entries in a constant time despite the database size
The patch got recently merged into the Neutron repository and the result is that, running the same tests, the overall memory consumption decreased by an order of magnitude from ~75GB to 7.5GB of RSS, without impacting the run time of the exercise.
This great improvement was thanks to the collaboration of the Performance & Scale, Neutron and core OVN teams at Red Hat. I'm lucky to work with such great and smart people 😉