Exploiting QUIC’s Connection ID Management

QUIC’s connection ID issuance mechanism is vulnerable to a resource exhaustion attack similar to the recently reported attack against QUIC’s path validation mechanism.

I discovered this vulnerability in December 2023 and disclosed it to the IETF QUIC working group. Among 17 QUIC stacks surveyed, 11 were found vulnerable, including my own (quic-go), Cloudflare quiche, Neqo (Mozilla), lsquic (LiteSpeed) and MsQuic (Microsoft). Due to the large number of affected implementations, and the lengthy release cycles of some of them, the disclosure of this vulnerability only happened on March 12th. Since then, most affected implementations have released fixes.

In this post, we’ll dive into how QUIC uses connection IDs, and how the current protocol mechanism introduces a this vulnerability. We’ll also take a step back and explore the lessons to be learned from these recent attacks.

QUIC Connection IDs

Whereas TCP connections are identified by their 4-tuple (i.e. the combination of the sender’s and the receiver’s IP address and TCP port), QUIC uses connection IDs to demultiplex packets belonging to different QUIC connections. This means that it is possible to run an unlimited number of QUIC connections on the same UDP socket, and even run a QUIC server and multiple outgoing QUIC connections on the same socket.

Contrary to what the name may suggest, a single QUIC connection is not identified by a single connection ID. Connection IDs change several times during the QUIC handshake and may also change during the lifetime of the connection. It’s also important to keep in mind that connection IDs are unidirectional, i.e. the client uses a connection ID generated by the server when sending packets to the server, and vice versa.

Privacy Considerations for Connection Migration

The design around changing connection IDs is primarily for privacy, especially during connection migration, to prevent on-path observers from tracking the connection. If a client migrates to a new path (e.g. from a WiFi to a cellular connection), and it kept sending and receiving packets with the same connection ID, an on-path observer would be able to track the client across the migration. Privacy-wise, this would put QUIC in a worse spot than TCP: Since TCP doesn’t support connection migration, the client would dial a new TCP connection, and the observer wouldn’t be able to (easily) correlate these two connections. Therefore, QUIC endpoints use a new connection ID whenever they switch to a new network path.

Load Balancing based on Connection IDs

Turning towards typical server deployements, connection IDs serve yet another purpose: A QUIC-aware load balancer can route QUIC packets to a specific backend server by inspecting the connection ID. This requires the server and the load balancer to agree on some kind of encoding scheme, and the QUIC working group is working on a document to standardize a protocol for exactly this use case (called QUIC-LB). The ability to load-balance is the reason why QUIC version 1 uses connection IDs that can be up to 20 bytes in length. This enables the load balancer to encode routing information and even attach a (cryptographic) signature.

Some implementations go even further and use Receive Side Scaling (RSS) to bind QUIC connections to a particular CPU core, increasing cache locality and therefore throughput, all based on the packet’s connection ID.

Providing and Retiring Connection IDs

As mentioned above, endpoints need to use connection IDs chosen by their peer. QUIC therefore defines how fresh connection IDs are supplied, and how old connection IDs can be deactivated. New connection IDs are - unsurprisingly - provided in NEW_CONNECTION_ID frames. This is how this frame looks on the wire:

NEW_CONNECTION_ID Frame {
    Type (i) = 0x18,
    Sequence Number (i),
    Retire Prior To (i),
    Length (8),
    Connection ID (8..160),
    Stateless Reset Token (128),
}

Each connection ID is identified by a sequence number, is between 1 and 20 bytes in length, and comes with a Stateless Reset Token. We won’t dive into Stateless Resets in this post.

Connection IDs can be “retired” by the endpoint that would use the connection ID on outgoing packets. QUIC defines a RETIRE_CONNECTION_ID frame for this purpose:

RETIRE_CONNECTION_ID Frame {
  Type (i) = 0x19,
  Sequence Number (i),
}

This is necessary when a client uses a connection ID to probe a new path, but path validation fails. To preserve the unlinkability properties discussed above, this connection ID should not be reused on a different path. Instead, the client retires the connection ID, and the server might issue a new connection ID as a replacement.

The protocol comes with a protection against Denial of Service (DoS) attacks using connection IDs. During the handshake, endpoints declare how many connection IDs they are willing to keep track of using the active_connection_id_limit transport parameter. This effectively limits the number of new paths that can be probed at the same time. In practice, most QUIC implementations use a (low) single-digit number.

There’s also a way to ask the peer to retire older (i.e. connection IDs with lower sequence numbers) connection IDs via the Retire Prior To field. This is useful when using connection ID-based load balancing and / or RSS, and the configuration of the routing changes. An endpoint can then provide a new set of connection IDs to the peer, and quickly retire older connection IDs.

The receiver of the NEW_CONNECTION_ID frame needs to explicitly retire these connection IDs: For every connection ID, it needs to send a RETIRE_CONNECTION_ID frame.

The Attack

If you’ve read the last post about the DoS attack against QUIC’s path validation mechanism, you might see where this is going: The active_connection_id_limit limit in fact doesn’t protect a flood of connection IDs, since the attacker can immediately retire newly issued connection IDs. This way, the number of active connection IDs always stays below the peer’s limit.

sequenceDiagram participant Attacker participant Victim Attacker->>Victim: NEW_CONNECTION_ID { Sequence: 1, Retire Prior To: 1} Note right of Victim: queue RETIRE_CONNECTION_ID { Sequence: 0 } Attacker->>Victim: NEW_CONNECTION_ID { Sequence: 2, Retire Prior To: 2} Note right of Victim: queue RETIRE_CONNECTION_ID { Sequence: 1 } Attacker->>Victim: NEW_CONNECTION_ID { Sequence: 3, Retire Prior To: 3} Note right of Victim: queue RETIRE_CONNECTION_ID { Sequence: 2 } Attacker->>Victim: NEW_CONNECTION_ID { Sequence: 4, Retire Prior To: 4} Note right of Victim: queue RETIRE_CONNECTION_ID { Sequence: 3 }

However, for every every such NEW_CONNECTION_ID frame received, the receiver now needs to send a RETIRE_CONNECTION_ID frame. For asymmetric paths, it might not be able to send out the RETIRE_CONNECTION_ID frames as quickly as it receives NEW_CONNECTION_ID frames, leading to the buildup of an ever-growing queue of RETIRE_CONNECTION_ID frames.

This situation is exacerbated by the fact that the rate at which the victim can send data can be manipulated by the attacker, for example by delaying acknowledgements (thereby increasing the victim’s RTT estimate), and by selectively acknowledging packets (thereby misleading the victim to believe that packet loss happened, which leads to a collapse of the congestion window).

General Lessons Learned

This vulnerability, the recent attack on QUIC’s path validation mechanism, the HTTP/2 Ping Flood and the HTTP/2 Rapid Reset attack all share some similarities.

Protocols Mechanisms without Flow Control

The attack on QUIC’s path validation mechanism and the HTTP/2 Ping Flood involved the sending of frames without flow control measures in place, typically because these frames are expected to be sent infrequently.

Introducing explicit flow control for infrequent frames such as PING frames may not be practical. An easy defense against these attacks is simply dropping these frames, if too many of them are received within a short time frame. Protocols need to allow this to happen - otherwise, standard-compliant implementations will be vulnerable, and implementations that protect themselves (technically) violate the standard, as happened in the attack on QUIC’s path validation.

Protocols Mechanism with Flow Control

For both the HTTP/2 Rapid Reset and the attack described in this blog post, the protocol made an attempt to limit the number of concurrent objects in flight, but did so in a way that allowed the attacker to replenish this limit: by resetting the stream (for Rapid Reset), and by retiring connection IDs (in this attack). This renders the limit ineffectual in the cases when it’s most needed.

It is interesting to look at flow control mechanisms that were not vulnerable to this class of attacks. In QUIC, both stream-level flow control and the limiting of the stream number, the protocol mechanism is fundamentally different: In both cases, an explicit limit is communicated from the sender. For stream-level flow control, the receiver defines a byte offset up until which the sender is allowed to send more data. This byte offset is sent in a MAX_STREAM_DATA frame, which is sent at regular intervals (determined by the receiver). To limit the total number of streams, the receiver communicates the maximum stream ID using a MAX_STREAMS frame. The downside of this approach is that replenishing the limit requires the receiver to send these frames at regular intervals, however, it also allows the receiver to grant more flow control credits to the sender at its own discretion. This is how flow control auto-tuning works.

This seems to suggest a general design principle when defining network protocols: Replenishing a limit should always involve a conscious decision by the receiver. This

  1. ensures that not more than one limit worth of data / frames can be sent in one roundtrip, and
  2. it gives the receiver a protocol-level way to protect itself from this class of flooding attacks.

For connection IDs, this could have been achieved by defining a MAX_NEW_CONNECTION_ID frame:

MAX_NEW_CONNECTION_ID Frame {
  Type (i),
  Sequence Number (i),
}

Given that the QUIC v1 protocol is already specified and widely deployed, it is unlikely that this solution will be adopted. Fortunately, it is easy to mitigate this attack without changing the protocol: Implementations can impose a limit on the number of queued RETIRE_CONNECTION_ID frames, and close the connection if that limit is surpassed. Since frames are small, the limit can be chosen to be very high (even 1000 frames consume a negligible amount of memory), so high that it will never be reached, unless the node is under attack. In fact, this is how I fixed the bug in quic-go and released in the v0.42 release.