CI CI CI
The previous post was predicting more work on the CI side, and… That was correct! A new environment causing unexpected kernel crashes, new CI instances finding new bugs, third party tools causing issues, etc. Read on to find out why January seems to be the month dedicated to CI activities!
CI
In addition to the work around MPTCP CI, some modifications have also been done for other CIs validating MPTCP selftests.
MPTCP CI
As mentioned in my previous post, the migration to Github Actions continued in January: adding email notifications, display test results and extra metrics, enabling them on new patches, etc. In short, re-adding many bits we had before with Cirrus-CI, and needed to keep the workflow closed to what it was before.
Unfortunately, the migration came with some unexpected surprises! The first one is a small increase in the number of “flaky” tests. This can partially be explained by the fact KVM acceleration cannot be used on GitHub Actions when using free runners. Packetdrill tests seem to suffer more than the others. Compared to the other tests, they are executed in parallel. It looks like they take all/most of the available CPU resources. Reducing the limit to have maximum 2 tests in parallel greatly improved the results, but there are still some rare hiccups.
The second surprise initially looked bad: some new kernel panics have been detected. They were not due to MPTCP code – it was happening while doing some “ping” between net namespaces – but it was reported by the CI ~25-30% of the time. The bug has been reported to the Netdev community, and quickly identified as an issue somewhere in x86 code. It was not easy to reproduce the issue, but with determination, and hours of tests, a patch causing the issue has been identified. It turns out, the patch accidentally removed some code that was preventing a bug in QEmu’s TCG code. Before reporting this to the QEmu’s community, I tried to reproduce the issue with the last stable version of QEmu – requiring an update of other dependences, including a fix for virtme that is no longer maintained – and I didn’t manage to. Still, the QEmu’s version from the last Ubuntu stable release doesn’t include this fix. This has been reported to the Ubuntu bug tracker. After another long bisecting session, I just managed to identify the commits fixing the bug. Sadly, these commits have not been backported on QEmu’s side, and the commit message was lacking explanations about how the bug was initially found, and when the bug was introduced. This doesn’t help to get fixes backported, and to quickly identify which patch fixed someone else’s issue. At the end, quite some time has been spent on a bug impacting us, but not due to MPTCP. On the other hand, it helped to improve kernel panic detection, quickly stopping the VM in case of such critical issues, and the fix will likely be available to all Ubuntu users.
Not related to the migration, but some other subtests have started to fail for a
few days in a row. It turns out it was due to the last version of IPRoute2, more
precisely due to a regression
in the ss
tool we use for some tests. A
fix
has been sent, applied, but not backported. Even if it breaks our selftests, and
it might break scripts using ss
to monitor connections or to report info, the
issue is not big enough
to justify a new bug-fix version. In other words, if you want to validate MPTCP
selftests, don’t use IPRoute 6.7.0. To reduce the number of users running tests
with this buggy version, note that the version on Debian, used by a few CIs, has
been patched
with the fix.
LKFT CI
Linaro’s Linux Kernel Functional Test framework, or LKFT, test a very impressive
number of kernel versions. From old stable ones, to linux-next. They have been
validating MPTCP selftests for years, and that’s great, thanks guys for doing
that! It is especially useful to find issues due to a bad backport in stable
versions, or to prevent issues by testing linux-next. One important
particularity when validating stable kernel versions, is that they follow stable
team recommendations:
running the kselftests suite from the last stable version (e.g. 6.7.4) when
validating (old) stable kernel (e.g. 5.15.148). It is an important assumption to
know when writing kselftests, and I don’t think this is well known. I don’t
think that’s easy to support all kernels, especially in the networking area,
where some features/behaviours are not directly exposed to the userspace. Some
MPTCP kselftests have to look at /proc/kallsyms
or use other (ugly?)
workarounds
to predict what we are supposed to have, depending on the kernel that is being
used. But it is important to do something to support these old kernels, not to
have big kselftests, with many different subtests, always marked as “failed”
when validating new stable releases, removing all their value.
Regularly, test results from LKFT are analysed, to check if some tests have not been skipped by accident, and if there are no real issues being ignored there. Recently, some modifications have been done to support more environments, add missing kernel configs, increase timeout, and other techniques trying to support very slow environments, and more.
Netdev CI
Great news for the Networking subsystem: a new CI environment has been put in place by Jakub Kicinski, supported by Meta. It currently runs most tests available in the kernel source code, including ours for MPTCP: kselftests, KUnit, and more. The initial goal is to validate patches that are sent on the mailing list, and send reports on Patchwork. That’s an important addition, something we have done with MPTCP for a long time, and is part of the common practice: each new modification has to be validated by a CI. Certainly due to the same reasons as mentioned in my previous post and valid for most kernel subsystems, and Open-Source project in general – no infrastructure in place that can be easily used by kernel dev to monitor new patches sent by email, to automatically apply, then run tests, and send a report ; who would pay to develop and maintain that, for the VM or the specific hardware, etc. –, this has been lacking for years. This doesn’t mean no tests were being executed, it’s just that this was not done in public: it is easier for a company to run these tests internally, especially when some specific hardware is needed, or the test suite is not public. But that’s not impossible: DRM subsystem has been using GitLab for years, with self-hosted runners, for everybody’s benefit. But it needs coordination, and some brave people to push their manager to start such projects that are not going to be directly beneficial for this company. Anyway, here, that’s great to have Meta’s support.
I’m glad I was able to share some knowledge to help a bit having this in place. The environment is a bit different, mainly because no “public CIs” are involved. When I look at the different issues I had in the past, and the workarounds I put in place to have our different tests supported on such “public CIs”, I’m now thinking I could have probably done a lot more with the same time if I was able to control everything. But on the other hand, that’s great anybody can maintain it, and it is not tied to a single company – which is currently not the case with the Netdev CI, even if a lot of components are in a public repository.
Having CI executed in a new environment, also means new issues specific to this new environment! This was the case, especially because the tests are running without KVM acceleration. It is good that some of the fixes needed for LKFT CI can be re-used here.
Talking about fixes, Paolo managed to find what was causing some subtests to be
unstable in very
slow environments. It took a bit of time to identify the root cause, but it
appears to be due to the use of a TCP-specific helper on an MPTCP socket. As
mentioned in the commit message,
funnily enough, fuzzers and static checkers are happy, as the accessed memory
still belongs to the mptcp_sock struct
, but completely unrelated. Even from a
functional perspective, the check being done was always skipped before. Until an
unrelated refactoring which
exposed the issue. In short, always be careful when casting a structure into
another one! BTW, that’s not the first time such an issue has been introduced
because that’s easy to mix-up MPTCP and TCP socket structures. I’m working on
adding new small checks in debug mode to avoid such issues in the future.
One last thing, not to have entire kselftests, including hundreds of subtests, being ignored due to a single unstable subtest, I’m adding subtests support. We would feel better skipping just one or two subtests while we are trying to improve the situation, than having hundreds of them being ignored for days, adding unneeded pressure.
Misc. Maintenance Work
Mainly to expose that the maintenance work is always full of unexpected stuff that needs to be done “quickly”, here are a bunch of tasks that needed to be done this month, on top of the classic reviews, meetings, questions, bugs, etc.:
-
BPF: we have some code in progress in our tree that is not fully ready yet. But then it means that we continuously have to adapt our code when some changes are done upstream. It was the case recently, to adapt to different
bpf_struct_ops
changes. -
mptcpd
debian packages: two new versions – 0.12-3 and 0.12-4 – have been released to fix some issues due to some changes in the test infrastructure, and how SystemD dependence is handled. -
packetdrill
: Packetdrill is a very handy tool to validate networking features. We maintain an out-of-tree version supporting MPTCP. This version is not “upstreamable” as it is, because the old experimental code needs a serious refactoring. But still, it works quite well. Anyway, when trying to find a fix for the bug we had with QEmu (see above), the build environment for the tests has been updated, breaking our branches dedicated to different stable versions containing tests valid for the specific versions: MPTCP support has been added progressively. To fix compilation issues when using newer compiler versions, but also to help with the maintenance later, each stable branch has been synced with modifications from the upstream repository maintained by Google.
What’s next?
Even if I would prefer to add new features, there is still some work needed for the CI:
- using self-hosted runners to improve the stability of some tests
- switching to virtme-ng
because the
virtme
tool we use is no longer maintained - validating MPTCP BPF tests to track regressions in this area
- tracking flaky tests to be able to easily identify them
But once done, that will help the whole MPTCP community, and I will be able to focus on other things!
Team work
It is important to note that what I presented here so far is mostly what I’m
working on. But I’m not alone in this project. For example, Paolo is looking at
supporting TCP_NOTSENT_LOWAT
with MPTCP + fixing various issues – especially
one only affecting v6.1 and reported by Christoph –, Mat helped a lot with the
reviews, Geliang is looking at validating MPTCP BPF tests and improving the
userspace path-manager, etc. A great community!