Summer update and MPTCP features in Linux v6.18

Long time no see (or read?) as we could say! The last update was in January. Since then, we have been very busy! Read on to find out what happened around MPTCP during the last few months, and which new features will be present in the future v6.18.

Activities

In March, I was at Netdev 0x19 in Zagreb, and I had a BoF session: MPTCP: present, future, and its development workflow (CI). Do not hesitate to check the video or the slides. This session covered different aspects about MPTCP: what is MPTCP, its use-cases, and the different components. Then how easy it is to use MPTCP today with a recent and up-to-date Linux environment. There were some words about the current status, what was planned, and some discussions. There was also a second part about the development workflow, and how a CI with a specific setup can greatly help!

Soon after, I started a temporally part-time contract at UCLouvain, as a Research assistant in the IP Networking Lab. That was a great opportunity to work with excellent colleagues, learn more about current research in the academic world, contribute to different scientific research. It was also a way to get financial support for the MPTCP maintenance, plus access to some servers to run SyzKaller, an excellent kernel fuzzer, to continue finding bugs in the current implementation.

A few months ago, I got a mission to find a solution for middleboxes intercepting TCP connections, and thus forcing MPTCP to fallback to “plain” TCP. This resulted in the TCP-in-UDP eBPF program. Please check the dedicated blog post for more details about that.

In July, I presented two unrelated and independent extensions to the MPTCP protocol. The first one extends the Data-Level Length (DLL) size to allow MPTCP packets of more than 64 KB, mainly to allow internal egress packets of more than 64 KB, and improve performances in a data centre. It can also be helpful when IPv6 jumbograms packets are used. See this draft for more details. The second extension suggests using application-level keys to better secure MPTCP when establishing new subflows, announcing addresses, and resetting connections. See this other draft for more explanations about this idea. If you know a company or an actor present on the Internet interested in these extensions and can help to push them to be accepted, feel free to contact me.

Finally, it is important to note that more funding around MPTCP recently got accepted! 🎉 Thanks again NLnet for your invaluable your support!

New features

Better `MPCapable`’s C-flag support on the client side

The MPTCP protocol and its implementation in the Linux kernel support deployments behind load-balancers. This is typically used by CDNs. When a layer-4 load-balancer is in place, it means a connection will be handled by one server out of many placed behind it. In other words, it means multiple servers are accepting connections to the same IP address and port. An MPTCP connection can be composed of … multiple TCP subflows (path), and it is important to make sure new path requests (MPJoin) reach the right end-server. If such path request is sent to the original IP address and port, there is a high change the load-balancers will route it to a different end-server. To cope with that, the MPTCP protocol allows a host to set a flag (C-flag) in the connection request (MPCapable) to tell the receiver it cannot try to open any additional subflows toward this address and port. Instead, the same host will announce a unique IP address and port that can be used to reach the right end-server. For more details about this case, please see this page: Deployment behind a load balancer.

The implementation on the server side has been supported for a few years now on Linux, and is already well-used. A server simply has to set the net.mptcp.allow_join_initial_addr_port sysctl knob to 0, and add a signal MPTCP endpoint with a dedicated IP address and an optional port.

So far, it looks like this setup was mainly used when interacting with iOS devices, so not using the Linux kernel on the client side then. On this side, the in-kernel path-manager will respect the C flag by not establishing new paths to the initial address, but that was it. By default, in such situations with the C flag and the in-kernel path-manager, if a client has multiple interfaces, the non-primary ones were not being used to establish extra paths. This was not done because the extra interfaces are by default only used to create new paths to the initial address of the server, not allowed in this case. This was not good behaviour. A fix has been recently sent to improve this situation. Now, in this particular case, the in-kernel path-manager considers using the other MPTCP endpoints to establish new paths to the announced address.

With the userspace path-manager, the userspace daemon didn’t know when the other peer has set this C-flag. That means it was not able to respect the protocol when it is set. The kernel now announces this info, and the “official” userspace daemon (mptcpd) will support it soon.

New `laminar` endpoints

Up to Linux v6.18, upon the reception of an ADD_ADDR (and when the fullmesh flag was not used), the in-kernel PM was only creating new subflows using the local address picked by the routing configuration. That works well when the announced addresses can be predicted, but not on the Internet with servers controlled by someone else. Instead, it is easier to pick local addresses from a selected list of endpoints, and use them only once, than relying on routing rules. laminar endpoints have been added in v6.18.

In other words, on the client side, it is now recommended to set both subflow and laminar flags by default. If both the client and the server sides have multiple network interfaces they want to use, it might be interesting to use only the laminar flag on all client side MPTCP endpoints, and only the signal one on all server side MPTCP endpoints.

mptcpd: security report & improvements

Thanks to the NLnet funding, Radically Open Security B.V. did a security review of mptcpd. Thank you, Tim and Marcus, for this great work! No security issues have been found 🎉

The report mentioned one attention point: the plugin directory should not be world writeable, not to let other pieces of code executed with extra permissions (CAP_NET_ADMIN). The full report is available here.

In terms of improvements, it is good to note that mptcpd is now available in more Linux distributions: OpenWrt, Alpine Linux, NixOS, etc. A future v0.14 version is planned, and it will include some new features around mptcpize: setting the GODEBUG=multipathtcp=1 environment variable, and also appending LD_PRELOAD if previously set, instead of overriding it. This version should also support new laminar endpoints, and the new deny_join_id0 parameter.

User applications

Quite a few new applications now have a dedicated option to enable MPTCP support: IPerf3, sing-box, Valkey, FreeNginx, etc. Please also note that since GoLang 1.24, all applications written in Go have MPTCP enabled by default on the server side! This includes Caddy, Traefik, Shadowsocks Go, and many more!

Miscellaneous

When working on current and future features around the path-manager, a lot of clean-ups have been done by Geliang and me. Some were required to allow new features, but others have been also added to improve the code itself by renaming variables, splitting large functions, regrouping code per purpose, etc. This might cause a bit more of attention during the backports, but it will help with the maintenance in the long term.

To help with the debugging, new MIB counters for the rejected MPJoin and for fallbacks to TCP have been added by Paolo and me. Some of them have been validated by Gang when working on improving the code coverage when running the whole test suite.

The MPTCP CI was taking more and more time due to the addition of new tests. To accelerate the whole process, more builders are used in parallel: now the mptcp_join selftest is executed in a dedicated job for the normal and debug modes. Results can now be shared after ~1h15 instead of 2h.

Performances are being improved thanks to the work from Paolo and Christoph! More work is still ongoing, and a proper perf regression lab should be put in place soon. More explanations will be shared in a later blog post.

Regarding the socket options, TCP_MAXSEG has been added by Geliang, and an MPTCP version of SO_MAX_PACING_RATE from Christoph is in discussion. More work will be done around the socket options to simplify the code and improve the maintenance in the long term.

When an address is announced by a peer via an ADD_ADDR, the signalling packet carried in a TCP ACK can be lost. Up to v6.18, the retransmissions were done after a timeout controlled by the net.mptcp.add_addr_timeout sysctl knob. The default value is set to 2 minutes, which is a safe choice, but certainly too high for most use-cases. Geliang changed its behaviour to be used as a maximum value for the timeout, and instead, the timeout now depends on the connection’s round-trip-time (RTT) to better adapt to the situation.

Last but not least, thanks to Paolo for helping with some fixes, to Mat for the code review, and to everybody who have reported issues, sent fixes and promoted MPTCP! A great community!

Conclusion

Quite a lot of new features and improvements will be present in the future Linux kernel LTS version (v6.18)! Looking forward for even more of them in the coming months!

If you like my work and wish me to continue doing so, you can become a sponsor via LiberaPay, GitHub or Patreon.

Please contact me for professional collaborations, short or long missions, or for financial support for my contributions to the maintenance of MPTCP and various apps around it.