Controlling IP Fragmentation for Path MTU Discovery

IP Fragmentation

When sending packets over the network, every hop along the path needs to read the packet, make a routing decision, and forward the packet to the next hop. Most routers operate with a fixed buffer size, and if they encounter a packet that exceeds this size, they will do one of two things:

  1. Drop the packet, or
  2. Fragment the packet into smaller pieces (this is only possible for IPv4, but not for IPv6).

This limit is called the Maximum Transmission Unit (MTU).

QUIC defines a minimum packet size of 1200 bytes and requires the client to pad the first packets sent on a new connection to at least this size. This ensures that the QUIC handshake fails if the network path doesn’t support the minimum MTU.

To achieve high rates of throughput on a QUIC connection, it is desirable to send QUIC packets as large as possible, for multiple reasons:

  • Every packet comes with an encapsulation overhead: IP and UDP headers, and of course the QUIC packet header itself. The IP header weighs 20 bytes for IPv4, and 40 bytes for IPv6. The UDP header weighs 8 bytes. The QUIC header has somewhere between 1 and 21 bytes, depending on the length of the QUIC connection ID.
  • In terms of CPU cycles, there is a processing overhead for every packet, as the packet and the packet header need to be composed and encrypted on the sender side, and decrypted and processed on the receiver side.
  • Queues in the network might allocate a fixed-size buffer for each incoming packet, and the unused space in the buffer is essentially wasted.

As we’ve mentioned before, IPv4 allows routers to split packets if they exceed the maximum transmission unit (MTU) of the outgoing link. This is undesirable in (almost) all cases, as the network now needs to deal with twice as many packets, and the receiver’s kernel first needs to reassemble the packet, before it can be processed by the receiving application. It also means that if any of the fragments is lost, the entire packet needs to be retransmitted. For this reason, IPv6 does not allow fragmentation.

Instead of fragmenting packets, routers can be instructed to drop packets that exceed the outgoing link’s MTU. This uses the “Don’t Fragment” (DF) flag in the IPv4 header. IPv6 always drops oversized packets, so it does not have a DF flag.

While IPv6 routers are not permitted to fragment IPv6 packets, the sender interface can still break up an IPv6 packet into multiple fragments.

In this blog post, we’ll dive into my most recent IETF draft (draft-seemann-tsvwg-udp-fragmentation), which describes how the operating system can be instructed to set the DF flag. Unfortunately, this is not as simple as it sounds, and every operating system (and versions thereof) handles this in slightly different ways.

ICMP “Packet Too Big”

If a router drops a packet because it exceeds the MTU of the outgoing link, it sends an ICMP “Packet Too Big” message back to the sender. This message contains the maximum transmission unit of the outgoing link, allowing the sender to reduce the size of the next packet sent.

This mechanism can be used by the kernel to implement Path MTU Discovery (PMTUD) for TCP connections. By probing different packet sizes, the sender can find the maximum transmission unit of the path, and use that to set the size of the TCP segments sent.

Path MTU Discovery for QUIC

This mechanism might not work well for QUIC: ICMP packets are routed on a best-effort basis and may not reach the right sender if load balancing based on connection IDs is in use. They also aren’t encrypted or authenticated, so an attacker could inject spoofed or modified ICMP packets.

The good news is that QUIC stacks already track the fate of every packet sent (after all, if a packet is lost, the sender needs to retransmit it). It is therefore possible to implement Path MTU Discovery without relying on ICMP. To do this, a QUIC stack can occasionally send a larger packet and observe whether it is acknowledged by the receiver. If it is, the sender knows that the network path can handle larger packets. If it is not, the sender can suspect that the packet was dropped due to its size. Importantly, the packet could also be dropped due to some other reason, for example due to normal congestion on the path.

RFC 8899 defines an algorithm for probing the size of a path that deals with the above issues (and many more, like shrinking MTUs as well), with the lovely acronym DPLPMTUD: Datagram Packetization Layer Path MTU Discovery.

quic-go (partially) implements this algorithm, and allows the application to set a initial packet size.

Controlling IP Fragmentation for IPv4 and IPv6 Sockets

The following table summarizes how the different operating systems handle IP fragmentation.

LinuxmacOSWindows
Enabling DF on IPv4IP_PMTUDISC_PROBEIP_DONTFRAGIP_DONTFRAGMENT
Preventing fragmentation on IPv6IPV6_PMTUDISC_PROBEIPV6_DONTFRAGIPV6_DONTFRAG

Annoyingly, every operating system chose different names for the same feature.

For example, to enable the DF bit for IPv4, and to disable fragmentation for IPv6, on Linux one would set the following socket options:

int enable = 1;
setsockopt(fd, IPPROTO_IP, IP_PMTUDISC_PROBE, &enable, sizeof(enable)); // IPv4
setsockopt(fd, IPPROTO_IPV6, IPV6_PMTUDISC_PROBE, &enable, sizeof(enable)); // IPv6

IP_PMTUDISC_PROBE vs IP_PMTUDISC_DO

On Linux, in addition to IP_PMTUDISC_PROBE and IPV6_PMTUDISC_PROBE, there are also IP_PMTUDISC_DO and IPV6_PMTUDISC_DO, respectively. In terms of controlling the fragmentation behavior, they do almost exactly the same thing: they enable the DF bit on IPv4 and prevent the kernel from fragmenting IPv6 packets.

The difference lies in the handling of ICMP messages: with IP_PMTUDISC_DO and IPV6_PMTUDISC_DO set, the kernel will process ICMP “Packet Too Big” messages, and return an error if the application attempts to send a packet that exceeds the MTU advertised in the ICMP message.

For QUIC, this is a security vulnerability: an attacker can inject false ICMP “Packet Too Big” messages that reduce the MTU below the minimum QUIC packet size of 1200 bytes, thereby effectively blocking the connection. quic-go used to be vulnerable to this attack, and we published a security advisory (CVE-2024-53259). We fixed this vulnerability in #4729 by using IP_PMTUDISC_PROBE and IPV6_PMTUDISC_PROBE, respectively.

Controlling IP Fragmentation for Dual-Stack Sockets

Dual-stack sockets can send and receive IPv4 and IPv6 packets. They are implemented differently across operating systems.

Importantly for quic-go, the Go standard library creates a dual-stack socket when a UDP socket is created the standard way:

net.ListenUDP("udp", <local address>)

Linux

On Linux, dual-stack sockets work as expected: you can set IP_PMTUDISC_PROBE and IPV6_PMTUDISC_PROBE independently to control fragmentation for IPv4 and IPv6.

macOS

macOS is a completely different story, and the behavior is not well documented. It also depends on the version of macOS.

A dual-stack socket is created by first creating an IPv6 socket, and then unsetting the IPV6_V6ONLY socket option:

int disable = 0;
setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &disable, sizeof(disable));

It is then possible to send and receive IPv4 packets on this socket. Furthermore, when handling IPv4 packets on this socket, one needs to use IPv6-mapped IPv4 addresses.

Before the current macOS version (15, Sequoia), setting IPV6_DONTFRAG didn’t have any effect on the DF bit in the IPv4 header. Setting IP_DONTFRAG was not possible either, since macOS regards the socket as an IPv6 socket. Effectively, it was therefore not possible to set the DF bit on dual-stack sockets, which is why quic-go disables DPLPMTUD on older macOS versions.

This bug was fixed in macOS 15. It’s still not possible to set IP_DONTFRAG on dual-stack sockets, but IPV6_DONTFRAG now controls the fragmentation behavior for IPv6 and IPv4. This is still not ideal in the general case, but it is sufficient for running Path MTU Discovery for QUIC.

Windows

Windows dual-stack sockets are also created by disabling IPV6_V6ONLY, but Windows does allow setting IP_DONTFRAGMENT and IPV6_DONTFRAG separately for IPv4 and IPv6.