Introducing TCP-in-UDP solution
The MPTCP protocol is complex, mainly to be able to survive on the Internet where middleboxes such as NATs, firewalls, IDS or proxies can modify parts of the TCP packets. Worst case scenario, an MPTCP connection should fallback to “plain” TCP. Today, such fallbacks are rarer than before – probably because MPTCP has been used since 2013 on millions of Apple smartphones worldwide – but they can still exist, e.g. on some mobile networks using Performance Enhancing Proxies (PEPs) where MPTCP connections are not bypassed. In such cases, a solution to continue benefiting from MPTCP is to tunnel the MPTCP connections. Different solutions exist, but they usually add extra layers, and requires setting a virtual private network (VPN) up with private IP addresses between the client and the server.
Here, a simpler solution is presented: TCP-in-UDP. This solution relies on eBPF, doesn’t add extra data per packet, and doesn’t require a virtual private network. Read on to find out more about that!
First, if the network you use blocks TCP extensions like MPTCP or other protocols, the best thing to do is to contact your network operator: maybe they are simply not aware of this issue, and can easily fix it.
TCP-in-UDP
Many tunnel solutions exist, but they have other use-cases: getting access to private networks, eventually with encryptions – with solutions like OpenVPN, IPSec, WireGuard®, etc. – or to add extra info in each packet for routing purposes – like GRE, GENEVE, etc. The Linux kernel supports many of these tunnels. In our case, the goal is not to get access to private networks and not to add an extra layer of encryption, but to make sure packets are not being modified by the network.
For our use-case, it is then enough to “convert the TCP packets in UDP”. This what TCP-in-UDP is doing. This idea is not new, it is inspired by an old IETF draft. In short, items from the TCP header are re-ordered to start with items from the UDP header.
TCP to UDP header
To better understand the translation, let’s see how the different headers look like:
- UDP:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
- TCP:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgment Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data | |C|E|U|A|P|R|S|F| |
| Offset| Reser |R|C|R|C|S|S|Y|I| Window |
| | |W|E|G|K|H|T|N|N| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | Urgent Pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| (Optional) Options |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data | |C|E| |A|P|R|S|F| |
| Offset| Reser |R|C|0|C|S|S|Y|I| Window |
| | |W|E| |K|H|T|N|N| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgment Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| (Optional) Options |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
As described
here,
the first eight bytes of the TCP-in-UDP header correspond to the classical UDP
header. Then, the Data Offset is placed with the flags and the window field.
Placing the Data Offset after the Checksum ensures that a value larger than
0x5
will appear there, which is required for STUN traversal. Then, the
sequence numbers and acknowledgment numbers follow. With this translation, the
TCP header has been reordered, but starts with a UDP header without modifying
the packet length. The informed reader will have noticed that the URG
flag and
the Urgent Pointer
have disappeared. This field is rarely used and some
middleboxes reset it. This is not a huge loss for most TCP applications.
In other words, apart from a different order, the only two modifications are:
- the layer 4 protocol indicated in layer 3 (IPv4/IPv6)
- the switch from
Urgent Pointer
toLength
(and the opposite)
These two modifications will of course affect the Checksum field that will need to be updated accordingly.
Dealing with network stack optimisations
On paper, the required modifications – protocol, a 16-bit word, and adapt the checksum – are small, and should be easy to do using eBPF with TC ingress and egress hooks. But doing that in a highly optimised stack is more complex than expected.
Accessing all required data
On Linux, all per-packet data are stored in a socket buffer, or
“SKB”. In our case here, the eBPF
code needs to access the packet header, which should be available between
skb->data
and skb->data_end
. Except that, skb->data_end
might not point to
the end of the packet, but typically it points to the end of the packet header.
This is an optimisation, because the kernel will often do operations depending
on the packet header, and it doesn’t really care about the content of the data,
which is usually more for the userspace, or to be forwarded to another network
interface.
In our case, in egress – translation from TCP to UDP – it is fine: the whole
TCP header is available, and that’s where the modifications will need to be
done. In ingress – translation from UDP to TCP – that’s different: some
network drivers will only align data going up to the end of the layer 4
protocol, so the 8 bytes of the UDP header here. This is not enough to do the
translation, as it is required to access the 12 more bytes. This issue is easy
to fix: eBPF helpers were introduced a long time ago to pull in non-linear data,
e.g. via
bpf_skb_pull_data
or
bpf_skb_load_bytes
.
GRO & TSO/GSO
On the Internet, packets are usually limited to 1500 bytes or fewer. Each packet still needs to carry some headers to indicate the source and destination, but also per-packet information like the data sequence number. Having to deal with “small” packets has a cost which can be very high to deal with very high throughput. To counter that, the Linux networking stack will prefer to deal with bigger chunks of data, with “internal” packets of tens of kilobytes, and split the packet into smaller ones with very similar header later on. Some network devices can even do this segmentation or aggregation work in hardware. That’s what GRO (Generic Receive Offload), and TSO (TCP Segmentation Offload) / GSO (Generic Segmentation Offload) are for.
With TCP-in-UDP, it is required to act on a per-packet basis: each TCP packet will be translated to UDP, which will contain the UDP header (8 bytes), the rest of the TCP one (12 bytes + the TCP options), then the TCP payload. In other words, for each UDP packet, the UDP payload will contain a part of the TCP header: data that is per-packet specific. It means that the traditional GRO and TSO cannot be used because the data cannot “simply” be merged with the next one like before.
Informed readers will then say that these network device features can be easily
disabled using ethtool
, e.g.
ethtool -K "${IFACE}" gro off gso off tso off
Correct, but even if all hardware offload accelerations are disabled, in egress, the Linux networking stack still has interest to deal with bigger packets internally, and do the segmentation in software at the end. Because it is not easily possible to modify how the segmentation will be done with eBPF, it is required to tell the stack not to do this optimisation, e.g. with:
ip link set "${IFACE}" gso_max_segs 1
Checksum
The following was certainly the most frustrating issue to deal with!
Thanks to how the checksum is computed, moving some 16-bit words or bigger around doesn’t change the checksum. Still, some fields need to be updated:
- The layer 4 protocol, set in layer 3 (IPv4/IPv6) here, also used to compute the next layer (UDP/TCP) checksum.
- The switch from the TCP
Urgent Pointer
(0
) to the UDPLength
(and the opposite).
It is not required to recompute the full checksum. Instead, this can be done
incrementally, and some eBPF
helpers can do that for us, e.g.
bpf_l3_csum_replace
and
bpf_l4_csum_replace
.
When testing with Network namespaces (netns
) with one host dedicated to the
translation when forwarding packets, everything was fine: the correct checksum
was visible in each packet. But when testing with real hardware, with TCP-in-UDP
eBPF hooks directly on the client and server, that was different: the checksum
in egress was incorrect on most network interfaces, even when the transmission
checksum offload (tx
) was disabled on the network interface.
After quite a bit of investigation, it appears that both the layer 3 and 4 checksums were correctly updated by the eBPF hook, but either the NIC or the networking stack was modifying the layer 4 checksum at the wrong place. This deserves some explanation.
In egress, the Linux TCP networking stack of the sender will typically set
skb->ip_summed
to CHECKSUM_PARTIAL
. In short, it means the TCP/IP stack will
compute a part of the checksum, only the one covering the
pseudo-header: IP
addresses, protocol number and length. The rest will be computed later on,
ideally by the networking device. At that last stage, the device only needs to
know where the layer 4 starts in the packet, but also where the checksum field
is from the start of this layer 4. This info is internally registered in
skb->csum_offset
, and it is different for TCP and UDP because the checksum
field is not at the same place in their headers.
When switching from UDP to TCP, it is then not enough to change the protocol
number in the layer 3, this internal checksum offset value also needs to be
updated. If I’m not mistaken, today, it is not possible to update it directly
with eBPF. A proper solution is certainly to add a new eBPF helper, but that
would only work with newer kernels, or eventually with a custom module. Instead,
a workaround has been found: chain the eBPF TC egress hook with a TC ACT_CSUM
action when the packet is translated from TCP to UDP. This csum
action triggers a
software checksum recalculation of the specified packet headers. In other words
and in our case, it is used to compute the rest of the checksum for a given
protocol (UDP), and mark the checksum as computed (CHECKSUM_NONE
). This last
step is important, because even if it is possible to compute the full checksum
with eBPF code like we did at some point, it is wrong to do so if we cannot
change the CHECKSUM_PARTIAL
flag which expect a later stage to update a
checksum at a (wrong) offset with the rest of the data.
So with a combination of both TC ACT_CSUM
and eBPF, it is possible to get the
right checksum after having modified the layer 4 protocol.
MTU/MSS
This is not linked to the highly optimised Linux network stack, but, on the wire, the packets will be in UDP and not TCP. It means that some operations like the dynamic adaptation of the MSS (TCP Maximum Segment Size) – aka MSS clamping – will have no effects here. Many mobile networks uses encapsulation without jumbo frames, meaning that the maximum size is lower than 1500 bytes. For performance reasons, and not to have to deal with this, it is important to avoid IP fragmentation. In other words, it might be required to adapt the interface Maximum Transmission Unit (MTU), or the MTU / MSS per destination.
Conclusion
In conclusion, this new eBPF program can be easily deployed on both the client and server sides to circumvent middleboxes that are still blocking MPTCP or other protocols. All you might still need to do is to modify the destination port which is currently hardcoded.
Acknowledgments
Thanks to Xpedite Technologies for having supported this work, and in particular Chester for his help investigating the checksum issues with real hardware. Also thanks to Nickz from the eBPF.io community for his support while working on these checksum issues.