The MPTCP protocol is complex, mainly to be able to survive on the Internet where middleboxes such as NATs, firewalls, IDS or proxies can modify parts of the TCP packets. Worst case scenario, an MPTCP connection should fallback to “plain” TCP. Today, such fallbacks are rarer than before – probably because MPTCP has been used since 2013 on millions of Apple smartphones worldwide – but they can still exist, e.g. on some mobile networks using Performance Enhancing Proxies (PEPs) where MPTCP connections are not bypassed. In such cases, a solution to continue benefiting from MPTCP is to tunnel the MPTCP connections. Different solutions exist, but they usually add extra layers, and requires setting a virtual private network (VPN) up with private IP addresses between the client and the server.

Here, a simpler solution is presented: TCP-in-UDP. This solution relies on eBPF, doesn’t add extra data per packet, and doesn’t require a virtual private network. Read on to find out more about that!


First, if the network you use blocks TCP extensions like MPTCP or other protocols, the best thing to do is to contact your network operator: maybe they are simply not aware of this issue, and can easily fix it.

TCP-in-UDP

Many tunnel solutions exist, but they have other use-cases: getting access to private networks, eventually with encryptions – with solutions like OpenVPN, IPSec, WireGuard®, etc. – or to add extra info in each packet for routing purposes – like GRE, GENEVE, etc. The Linux kernel supports many of these tunnels. In our case, the goal is not to get access to private networks and not to add an extra layer of encryption, but to make sure packets are not being modified by the network.

For our use-case, it is then enough to “convert the TCP packets in UDP”. This what TCP-in-UDP is doing. This idea is not new, it is inspired by an old IETF draft. In short, items from the TCP header are re-ordered to start with items from the UDP header.

TCP to UDP header

To better understand the translation, let’s see how the different headers look like:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          Source Port          |       Destination Port        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            Length             |           Checksum            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          Source Port          |       Destination Port        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Sequence Number                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Acknowledgment Number                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Data |       |C|E|U|A|P|R|S|F|                               |
| Offset| Reser |R|C|R|C|S|S|Y|I|            Window             |
|       |       |W|E|G|K|H|T|N|N|                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Checksum            |         Urgent Pointer        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      (Optional) Options                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          Source Port          |       Destination Port        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            Length             |           Checksum            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Data |       |C|E| |A|P|R|S|F|                               |
| Offset| Reser |R|C|0|C|S|S|Y|I|            Window             |
|       |       |W|E| |K|H|T|N|N|                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Sequence Number                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Acknowledgment Number                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      (Optional) Options                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

As described here, the first eight bytes of the TCP-in-UDP header correspond to the classical UDP header. Then, the Data Offset is placed with the flags and the window field. Placing the Data Offset after the Checksum ensures that a value larger than 0x5 will appear there, which is required for STUN traversal. Then, the sequence numbers and acknowledgment numbers follow. With this translation, the TCP header has been reordered, but starts with a UDP header without modifying the packet length. The informed reader will have noticed that the URG flag and the Urgent Pointer have disappeared. This field is rarely used and some middleboxes reset it. This is not a huge loss for most TCP applications.

In other words, apart from a different order, the only two modifications are:

  • the layer 4 protocol indicated in layer 3 (IPv4/IPv6)
  • the switch from Urgent Pointer to Length (and the opposite)

These two modifications will of course affect the Checksum field that will need to be updated accordingly.

Dealing with network stack optimisations

On paper, the required modifications – protocol, a 16-bit word, and adapt the checksum – are small, and should be easy to do using eBPF with TC ingress and egress hooks. But doing that in a highly optimised stack is more complex than expected.

Accessing all required data

On Linux, all per-packet data are stored in a socket buffer, or “SKB”. In our case here, the eBPF code needs to access the packet header, which should be available between skb->data and skb->data_end. Except that, skb->data_end might not point to the end of the packet, but typically it points to the end of the packet header. This is an optimisation, because the kernel will often do operations depending on the packet header, and it doesn’t really care about the content of the data, which is usually more for the userspace, or to be forwarded to another network interface.

In our case, in egress – translation from TCP to UDP – it is fine: the whole TCP header is available, and that’s where the modifications will need to be done. In ingress – translation from UDP to TCP – that’s different: some network drivers will only align data going up to the end of the layer 4 protocol, so the 8 bytes of the UDP header here. This is not enough to do the translation, as it is required to access the 12 more bytes. This issue is easy to fix: eBPF helpers were introduced a long time ago to pull in non-linear data, e.g. via bpf_skb_pull_data or bpf_skb_load_bytes.

GRO & TSO/GSO

On the Internet, packets are usually limited to 1500 bytes or fewer. Each packet still needs to carry some headers to indicate the source and destination, but also per-packet information like the data sequence number. Having to deal with “small” packets has a cost which can be very high to deal with very high throughput. To counter that, the Linux networking stack will prefer to deal with bigger chunks of data, with “internal” packets of tens of kilobytes, and split the packet into smaller ones with very similar header later on. Some network devices can even do this segmentation or aggregation work in hardware. That’s what GRO (Generic Receive Offload), and TSO (TCP Segmentation Offload) / GSO (Generic Segmentation Offload) are for.

With TCP-in-UDP, it is required to act on a per-packet basis: each TCP packet will be translated to UDP, which will contain the UDP header (8 bytes), the rest of the TCP one (12 bytes + the TCP options), then the TCP payload. In other words, for each UDP packet, the UDP payload will contain a part of the TCP header: data that is per-packet specific. It means that the traditional GRO and TSO cannot be used because the data cannot “simply” be merged with the next one like before.

Informed readers will then say that these network device features can be easily disabled using ethtool, e.g.

ethtool -K "${IFACE}" gro off gso off tso off

Correct, but even if all hardware offload accelerations are disabled, in egress, the Linux networking stack still has interest to deal with bigger packets internally, and do the segmentation in software at the end. Because it is not easily possible to modify how the segmentation will be done with eBPF, it is required to tell the stack not to do this optimisation, e.g. with:

ip link set "${IFACE}" gso_max_segs 1

Checksum

The following was certainly the most frustrating issue to deal with!

Thanks to how the checksum is computed, moving some 16-bit words or bigger around doesn’t change the checksum. Still, some fields need to be updated:

  • The layer 4 protocol, set in layer 3 (IPv4/IPv6) here, also used to compute the next layer (UDP/TCP) checksum.
  • The switch from the TCP Urgent Pointer (0) to the UDP Length (and the opposite).

It is not required to recompute the full checksum. Instead, this can be done incrementally, and some eBPF helpers can do that for us, e.g. bpf_l3_csum_replace and bpf_l4_csum_replace.

When testing with Network namespaces (netns) with one host dedicated to the translation when forwarding packets, everything was fine: the correct checksum was visible in each packet. But when testing with real hardware, with TCP-in-UDP eBPF hooks directly on the client and server, that was different: the checksum in egress was incorrect on most network interfaces, even when the transmission checksum offload (tx) was disabled on the network interface.

After quite a bit of investigation, it appears that both the layer 3 and 4 checksums were correctly updated by the eBPF hook, but either the NIC or the networking stack was modifying the layer 4 checksum at the wrong place. This deserves some explanation.

In egress, the Linux TCP networking stack of the sender will typically set skb->ip_summed to CHECKSUM_PARTIAL. In short, it means the TCP/IP stack will compute a part of the checksum, only the one covering the pseudo-header: IP addresses, protocol number and length. The rest will be computed later on, ideally by the networking device. At that last stage, the device only needs to know where the layer 4 starts in the packet, but also where the checksum field is from the start of this layer 4. This info is internally registered in skb->csum_offset, and it is different for TCP and UDP because the checksum field is not at the same place in their headers.

When switching from UDP to TCP, it is then not enough to change the protocol number in the layer 3, this internal checksum offset value also needs to be updated. If I’m not mistaken, today, it is not possible to update it directly with eBPF. A proper solution is certainly to add a new eBPF helper, but that would only work with newer kernels, or eventually with a custom module. Instead, a workaround has been found: chain the eBPF TC egress hook with a TC ACT_CSUM action when the packet is translated from TCP to UDP. This csum action triggers a software checksum recalculation of the specified packet headers. In other words and in our case, it is used to compute the rest of the checksum for a given protocol (UDP), and mark the checksum as computed (CHECKSUM_NONE). This last step is important, because even if it is possible to compute the full checksum with eBPF code like we did at some point, it is wrong to do so if we cannot change the CHECKSUM_PARTIAL flag which expect a later stage to update a checksum at a (wrong) offset with the rest of the data.

So with a combination of both TC ACT_CSUM and eBPF, it is possible to get the right checksum after having modified the layer 4 protocol.

MTU/MSS

This is not linked to the highly optimised Linux network stack, but, on the wire, the packets will be in UDP and not TCP. It means that some operations like the dynamic adaptation of the MSS (TCP Maximum Segment Size) – aka MSS clamping – will have no effects here. Many mobile networks uses encapsulation without jumbo frames, meaning that the maximum size is lower than 1500 bytes. For performance reasons, and not to have to deal with this, it is important to avoid IP fragmentation. In other words, it might be required to adapt the interface Maximum Transmission Unit (MTU), or the MTU / MSS per destination.

Conclusion

In conclusion, this new eBPF program can be easily deployed on both the client and server sides to circumvent middleboxes that are still blocking MPTCP or other protocols. All you might still need to do is to modify the destination port which is currently hardcoded.

Acknowledgments

Thanks to Xpedite Technologies for having supported this work, and in particular Chester for his help investigating the checksum issues with real hardware. Also thanks to Nickz from the eBPF.io community for his support while working on these checksum issues.