Using nftables for QUIC load balancing
In this post, we’ll take a deep dive into how to use nftables to build a QUIC load balancer.
Typically, the backend servers are dedicated machines / VMs running the application software.
nftables is a subsystem of the Linux kernel that replaces the older iptables / ip6tables. It was merged into the Linux kernel in early 2014 (version 3.13). nftables offers a unified interface to the kernel’s netfilter infrastructure, which is used for a variety of networking tasks, including filtering, network address translation (NAT), and packet mangling.
In this post, we’ll first take a brief look at how to load balance TCP traffic, and then highlight the differences when it comes to QUIC traffic. As we will see, the two have very little in common.
TCP Load Balancing
Assume we want to implement the simplest TCP load balancing scheme possible: New connections should be sent to a random backend server. Of course, we need to make sure that:
- TCP segments sent after establishment of the TCP connection are sent to the same backend server.
- For outgoing packets, the source IP address and port of the backend server are rewritten to the IP and port of the load balancer.
We can achieve this with the following nftables rules:
table ip lb-tcp {
map backends {
type mark : ipv4_addr . inet_service
flags constant
elements = { 0 : 192.168.0.1 . 8080, 1 : 192.168.0.2 . 8081, 2 : 192.168.0.3 . 8082 }
}
chain prerouting {
type nat hook prerouting priority filter; policy accept;
tcp dport 443 dnat ip to numgen random mod 3 map @backends
}
}
This is pretty straightforward: We first define a map backends containing the addresses of the backend servers. Then, we define a chain prerouting that matches all incoming TCP connections on port 443 (HTTPS) and rewrites the destination IP address to a random backend server.
The random selection of the backend server happens when the TCP SYN packet is received: The rule instructs nft to generate a random number between 0 and 2, and use this number to index the backends map.
After the TCP handshake, the kernel does all the heavy lifting for us: it keeps track of TCP flows and sends TCP segments belonging to the same flow to the same backend server, and it also takes care of rewriting the source IP address / port for packets sent from the backend server back to the client.
When adding new backend servers, we simply need to add the new address to the map backends, and adjust the numgen value in the rule (to 4, in this case). Removing backend servers is fairly straightforward as well: We simply need to remove the address from the map backends, re-number existing backends from 0 to 2, and adjust the numgen value in the rule (to 3, in this case). Existing connections are not affected by this change: As connection tracking is done by the kernel, and not by nft, the kernel will continue to send packets to the backend server that was assigned to the connection when it was established.
QUIC Load Balancing
When load balancing QUIC traffic, we don’t have the luxury of using the kernel’s connection tracking to keep track of QUIC flows. This is because QUIC flows are identified by the QUIC connection ID, and not by the 4-tuple (IP and port of client and server). In fact, there might be multiple QUIC connections between the same client and server on the same 4-tuple, each with a different connection ID. As a result, we need to implement our own connection tracking.
In this setup, for established connections, we use the first byte of the connection ID to determine the backend server to which the packet should be sent. This only works after completion of the QUIC handshake, since the client uses a random connection ID when it starts the handshake, and switches over to server-chosen connection IDs during the handshake. Using the first byte of the connection ID to encode the backend server limits the number of backend servers to 256, which worked well for my use case. It would be trivial to extend this to use more than 8 bits, if one needed more than 256 backend servers.
The current version of the nft command line tool contains a bug that prevents us from viewing the nft rules that we need to achieve this connection tracking. This only affects the command line tool that displays the rules on the command line; it doesn’t affect the nft rules that are stored in the kernel. In my case, I’m using the github.com/google/nftables package to programmatically set the nft rules as new backend servers are added or removed from the load balancer.
For the same setup as above (with the backend servers listening for QUIC connections on the respective UDP ports), we can achieve this with the following nft rules:
table ip lb-udp {
map backend-zones {
type mark : mark
elements = { 0x00000000 : 1, 0x00000001 : 2, 0x00000002 : 3 }
}
map flows {
type ipv4_addr : mark
size 65536
flags dynamic,timeout
timeout 5s
}
chain raw {
type filter hook prerouting priority raw; policy accept;
udp dport 443 @th,64,8 & 0x80 == 0x0 jump raw-short-header
udp dport 443 @th,64,8 & 0x80 == 0x80 jump raw-other
}
chain raw-short-header {
@th,72,8 0x2 ct zone set 3 return
@th,72,8 0x1 ct zone set 2 return
@th,72,8 0x0 ct zone set 1 return
jump raw-other
}
chain raw-other {
ct zone set ip saddr map @flows return
meta mark set numgen random mod 3
ct zone set meta mark map @backend-zones add @flows { ip saddr : meta mark map @backend-zones }
}
chain prerouting {
type nat hook prerouting priority dstnat; policy accept;
ct zone 3 dnat to 192.168.0.3:8082
ct zone 2 dnat to 192.168.0.2:8081
ct zone 1 dnat to 192.168.0.1:8080
}
chain output {
type filter hook output priority raw; policy accept;
ip saddr 192.168.0.1 udp sport 8080 ct zone set 1
ip saddr 192.168.0.2 udp sport 8081 ct zone set 2
ip saddr 192.168.0.3 udp sport 8082 ct zone set 3
}
}
This is a lot! Let’s go through it step by step. First of all, we have to understand how packets traverse through multiple chains.
Every nftables chain is attached to a hook: a fixed point in the kernel’s packet-processing pipeline where user-defined rules can run. The two hooks we use in this post are prerouting (for incoming packets, before routing decisions) and output (for locally generated outgoing packets). Multiple chains can attach to the same hook. When they do, their priority determines the execution order: chains with a lower priority number run first. The keywords raw and dstnat in the chain definitions are named constants for numeric priority values:
raw=-300dstnat=-100
In the ruleset above, both the raw chain and the prerouting chain are attached to the prerouting hook, but with different priorities. Because raw (-300) is lower than dstnat (-100), the raw chain runs first. That ordering is essential here, as we’ll see later.
map backend-zones {
type mark : mark
elements = { 0x00000000 : 1, 0x00000001 : 2, 0x00000002 : 3 }
}
The backend-zones map translates a mark value into a conntrack zone number. Each backend server is assigned a unique zone. As we’ll see, the mark is generated randomly for new connections, and the zone is then used by the prerouting chain to determine the actual IP address and port of the backend server. The indirection through zones (rather than mapping directly to IP addresses) is what enables graceful shutdown of backend servers later on.
Next, we’ll look at how an incoming UDP packet is processed. These rules only rely on the version-independent QUIC packet format, as defined in Section 5 of RFC 8999. This means that this setup will work with any (present and future) version of QUIC.
chain raw {
type filter hook prerouting priority raw; policy accept;
udp dport 443 @th,64,8 & 0x80 == 0x0 jump raw-short-header
udp dport 443 @th,64,8 & 0x80 == 0x80 jump raw-other
}
These rules match UDP packets arriving on port 443 (HTTPS), and then inspect the 8 bits after offset 64 of the transport header (th). Since the UDP header is 8 bytes long, this means that we’re inspecting the first byte of the UDP payload, i.e. the first byte of the QUIC packet.
If the first bit (0x80) is set, we’re dealing with a QUIC Long Header packet, and nft jumps to the raw-other chain. If the first bit is not set, we’re dealing with a QUIC Short Header packet, and nft jumps to the raw-short-header chain.
Long Header Packets
First we’ll look at handling of Long Header packets. This is a bit more complex, as we can’t use the QUIC Connection ID for routing yet, since clients use a random connection ID when they start the handshake, and only switch over to server-chosen connection IDs later. In the example below, Long Header packets are therefore routed based on the client’s source IP address only (ip saddr). If the same client initiates multiple QUIC handshakes at the same time, they will be routed to the same backend server, and multiple clients behind the same NAT (shared public source IP) will also stick to the same backend.
chain raw-other {
ct zone set ip saddr map @flows return
meta mark set numgen random mod 3
ct zone set meta mark map @backend-zones add @flows { ip saddr : meta mark map @backend-zones }
}
The raw-other chain handles packets for which we need to make a new routing decision: typically QUIC handshake packets using long headers, or Short Header packets that didn’t match the connection ID check (or other non-QUIC UDP packets).
This chain sets up connection tracking (referenced as ct). conntrack is the Linux netfilter subsystem that maintains state for network flows, while a conntrack zone (ct zone) is a 16-bit identifier assigned to packets to partition that tracking into isolated namespaces. Note that we’re not making any modifications to the IP packet yet: we’re only attaching some metadata to the packet, which will persist as long as this packet is processed by the kernel.
The raw-other chain works in three steps. The first rule looks up the client’s source IP in the dynamic flows map. If an entry exists, it restores the previously assigned conntrack zone and skips to the next chain. This ensures that all future packets that share that source IP are sent to the same backend.
If no entry is found (a new client), the second rule selects a random backend server by setting meta mark to a random value between 0 and 2. The third rule then uses this mark to determine the correct ct zone (via the backend-zones map) and inserts a new entry into flows that permanently maps the client’s IP address to that zone.
Together, these lines give us our own simple connection-tracking mechanism for QUIC.
After a couple of QUIC handshakes have been processed (while the handshake is still in progress), the flows map might look something like this:
map flows {
type ipv4_addr : mark
size 65536
flags dynamic,timeout
timeout 5s
elements = { 101.102.103.104 timeout 5s expires 2s800ms : 3,
105.106.107.108 timeout 5s expires 3s200ms : 3,
109.110.111.112 timeout 5s expires 4s900ms : 2 }
}
This means that the clients with the IP addresses 101.102.103.104 and 105.106.107.108 have been (randomly) assigned to the third backend server, and the client with the IP address 109.110.111.112 has been assigned to the second backend server.
The flows map is a dynamic map that is stored in kernel memory, with a maximum size of 65536 entries. Entries have a timeout of 5 seconds, which under normal conditions is more than enough time to complete the QUIC handshake. If another packet from the same client is received within this timeout, the entry is updated with a new timeout. If no packet from the same client is received within this timeout, the entry is removed from the map.
Now that the ct value has been set, the packets are passed to the next chain, the prerouting chain:
chain prerouting {
type nat hook prerouting priority dstnat; policy accept;
ct zone 3 dnat to 192.168.0.3:8082
ct zone 2 dnat to 192.168.0.2:8081
ct zone 1 dnat to 192.168.0.1:8080
}
The prerouting chain now actually modifies the IP packet: it rewrites the destination IP address and port to the IP and port of the backend server, based on the ct zone. As we will see later, this chain is also used for Short Header packets.
Short Header Packets
Next we’ll look at handling of Short Header packets. This is a lot more straightforward, since we can use QUIC connection IDs to route packets. Most importantly, short-header packets can be routed without consulting the flows map.
chain raw-short-header {
@th,72,8 0x2 ct zone set 3 return
@th,72,8 0x1 ct zone set 2 return
@th,72,8 0x0 ct zone set 1 return
jump raw-other
}
Here we look at the second byte of the QUIC packet (the 8 bits after offset 72), which is where the QUIC connection ID begins in Short Header packets; we set the conntrack zone accordingly. As we’ve seen for Long Header packets, the prerouting chain can now use this information to rewrite the destination IP address and port to the IP and port of the backend server that was assigned to the QUIC connection.
When a backend server registers with the load balancer, it is assigned a unique byte (in this example: 0x00 for the first backend, 0x01 for the second, etc.). The load balancer uses this byte in the raw-short-header chain, and the backend must use exactly the same byte as the first byte of every server-chosen Connection ID it generates.
If none of the rules match, the packet is passed to the raw-other chain. This can happen when we receive a Short Header packet for a connection that was handled by a backend server that crashed, or that was shut down after completion of the graceful shutdown process (see below).
Routing based on the connection ID allows multiple QUIC connections that share the same 4-tuple to coexist, so that different QUIC flows on the same 4-tuple can be consistently directed to the same backend.
Stateless Resets
Stateless Resets (see section 10.3 of RFC 9000) are used to immediately terminate QUIC connections for which the server has lost state. This can happen when the server crashes (e.g. due to a bug). Depending on where such a crash happened, the faulty server can deregister itself from the load balancer, or as a backstop, the load balancer can detect such a crash by performing regular health checks. An incoming Short Header packet for the existing connection will now not have a corresponding entry in the flows map, and the packet will be routed to a random backend server (via the raw-other chain).
It is therefore important to configure the backend servers such that they are able to generate stateless resets for each other’s connections. The QUIC specification does not specify how servers generate stateless resets. quic-go uses an HMAC to generate the stateless reset token from a secret key and the connection ID. By setting the same StatelessResetKey in the quic.Config on all backend servers, they will be able to generate stateless resets for each other’s connections.
Return Path
Finally, we need the output chain to handle traffic in the reply direction:
chain output {
type filter hook output priority raw; policy accept;
ip saddr 192.168.0.1 udp sport 8080 ct zone set 1
ip saddr 192.168.0.2 udp sport 8081 ct zone set 2
ip saddr 192.168.0.3 udp sport 8082 ct zone set 3
}
When a backend server sends a reply packet, the source IP and port belong to the backend itself. This chain simply restores the correct ct zone (based on the backend’s address and port) so that conntrack can match the reply against the original flow we created earlier. Once the zone is set, the kernel can perform the reverse NAT, rewriting the source address and port back to the load balancer’s public IP. Without this chain, return traffic would break.
Graceful Shutdown
This setup allows backend servers to be added or removed from the load balancer without downtime. This can be very useful when deploying new backend servers (to respond to increased traffic) or when removing backend servers (to perform maintenance or upgrades). In any case, we want to make sure that existing connections are not affected by this change.
When the application protocol is HTTP/3, or any other protocol built on HTTP/3, such as WebTransport or one of the MASQUE proxying protocols, we can make use of HTTP/3’s graceful shutdown mechanism (section 5.2 of RFC 9114) to inform the client that the server is shutting down: The server sends a GOAWAY on the HTTP/3 control stream. This lets the client know that existing requests will still be processed, but no new requests can be sent on the connection. The client will establish a new QUIC connection, which, as we will see, will then be routed to another backend server.
While graceful shutdown was trivial for TCP (the kernel was handling the entire TCP state machine for us), we need to do this ourselves for QUIC. Graceful shutdown is a 2-step process:
- The server is marked as shutting down: The load balancer stops sending new connections to the server, but continues routing existing connections (and the respective return path). When using HTTP/3, the server can send a
GOAWAYto initiate the application-layer graceful shutdown. - After a while, the server is deregistered entirely from the load balancer, and turned off. This can be done after a pre-defined shutdown period, or when the connection count drops to 0, or a combination of both. This shutdown logic is implemented on the backend servers themselves, the load balancer only responds to the deregistration request.
Let’s have a look at how this works in practice. We’ll look at the changes to lb-udp when the third backend server (192.168.0.3:8082) starts graceful shutdown, and a new backend server (192.168.0.4:8083) is added to the load balancer.
table ip lb-udp {
map backend-zones {
type mark : mark
- elements = { 0x00000000 : 1, 0x00000001 : 2, 0x00000002 : 3 }
+ elements = { 0x00000000 : 1, 0x00000001 : 2, 0x00000002 : 4 }
}
map flows {
type ipv4_addr : mark
size 65536
flags dynamic,timeout
timeout 5s
}
chain raw {
type filter hook prerouting priority raw; policy accept;
udp dport 443 @th,64,8 & 0x80 == 0x0 jump raw-short-header
udp dport 443 @th,64,8 & 0x80 == 0x80 jump raw-other
}
chain raw-short-header {
+ @th,72,8 0x3 ct zone set 4 return
@th,72,8 0x2 ct zone set 3 return
@th,72,8 0x1 ct zone set 2 return
@th,72,8 0x0 ct zone set 1 return
jump raw-other
}
chain raw-other {
ct zone set ip saddr map @flows return
meta mark set numgen random mod 3
ct zone set meta mark map @backend-zones add @flows { ip saddr : meta mark map @backend-zones }
}
chain prerouting {
type nat hook prerouting priority dstnat; policy accept;
+ ct zone 4 dnat to 192.168.0.4:8083
ct zone 3 dnat to 192.168.0.3:8082
ct zone 2 dnat to 192.168.0.2:8081
ct zone 1 dnat to 192.168.0.1:8080
}
chain output {
type filter hook output priority raw; policy accept;
ip saddr 192.168.0.1 udp sport 8080 ct zone set 1
ip saddr 192.168.0.2 udp sport 8081 ct zone set 2
ip saddr 192.168.0.3 udp sport 8082 ct zone set 3
+ ip saddr 192.168.0.4 udp sport 8083 ct zone set 4
}
}
The backend-zones map is updated: the entry for mark 0x00000002 now maps to zone 4 (the new server) instead of zone 3 (the old server). Since numgen random mod 3 still generates values 0, 1, and 2, but mark 0x00000002 now resolves to the new server’s zone, new connections are never routed to the old server.
Crucially, the entries for the older server are not removed from the raw-short-header, prerouting and output chains. This ensures that packets for existing connections are still routed to the old server, and that packets originating from the old server are still routed back.
Now let’s have a look at how lb-udp changes once the old server finishes graceful shutdown, and is deregistered from the load balancer.
table ip lb-udp {
map backend-zones {
type mark : mark
elements = { 0x00000000 : 1, 0x00000001 : 2, 0x00000002 : 4 }
}
map flows {
type ipv4_addr : mark
size 65536
flags dynamic,timeout
timeout 5s
}
chain raw {
type filter hook prerouting priority raw; policy accept;
udp dport 443 @th,64,8 & 0x80 == 0x0 jump raw-short-header
udp dport 443 @th,64,8 & 0x80 == 0x80 jump raw-other
}
chain raw-short-header {
@th,72,8 0x3 ct zone set 4 return
- @th,72,8 0x2 ct zone set 3 return
@th,72,8 0x1 ct zone set 2 return
@th,72,8 0x0 ct zone set 1 return
jump raw-other
}
chain raw-other {
ct zone set ip saddr map @flows return
meta mark set numgen random mod 3
ct zone set meta mark map @backend-zones add @flows { ip saddr : meta mark map @backend-zones }
}
chain prerouting {
type nat hook prerouting priority dstnat; policy accept;
ct zone 4 dnat to 192.168.0.4:8083
- ct zone 3 dnat to 192.168.0.3:8082
ct zone 2 dnat to 192.168.0.2:8081
ct zone 1 dnat to 192.168.0.1:8080
}
chain output {
type filter hook output priority raw; policy accept;
ip saddr 192.168.0.1 udp sport 8080 ct zone set 1
ip saddr 192.168.0.2 udp sport 8081 ct zone set 2
- ip saddr 192.168.0.3 udp sport 8082 ct zone set 3
ip saddr 192.168.0.4 udp sport 8083 ct zone set 4
}
}
All that we need to do is remove the server from raw-short-header, prerouting and output chains. Should any delayed packet be received after the server has been deregistered (e.g. because a client chose to disregard the GOAWAY and continue sending packets), it will be routed to a random backend server, where it will trigger a stateless reset, which immediately terminates the connection.
Conclusion
This approach gives us a simple yet robust way to load-balance QUIC traffic using only nftables. While the ruleset is admittedly more involved than its TCP counterpart, it delivers true zero-downtime scaling: backend servers can be added or removed at any time without dropping a single connection or requiring client-side changes. I implemented the complete configuration shown here (including the dynamic registration logic) using the Go nftables library.
The quic-go project and the QUIC Interop Runner are community-funded projects.
If you find my work useful, please considering sponsoring: