We can't use the systemd-journal-upload user here, since it's created
dynamically by DynamicUser=yes. However, we can use the group specified
in SupplementaryGroups=, so do exactly that.
Turns out that redirecting a lot of output to the console can have some
funny effects, like random kernel soft lockups. I spotted this in
various CIs, but it remained almost entirely hidden thanks to
`softlockup_panic=1`, until 1a36d2672f which introduced a couple of
tests that log quite a lot in a short amount of time. This, in
combination with newer kernel version, which, for some reason, seem to
be more susceptible to such soft lockups, made the Arch Linux jobs soft
lockup quite a lot, see [0].
While debugging this I also noticed that runs which don't redirect
stdout/stderr to the console are noticeably faster, e.g.:
# TEST-71 nspawn + QEMU (KVM), StandardOutput=journal+console
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.64
# TEST-71 nspawn + QEMU (KVM), StandardOutput=journal
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.95
# TEST-71 nspawn + QEMU, StandardOutput=journal+console
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:04.70
# TEST-71 nspawn + QEMU, StandardOutput=journal
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:44.48
# TEST-04 QEMU, StandardOutput=journal+console
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:22.70
# TEST-04 QEMU, StandardOutput=console
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:04.67
Given all this, let's effectively revert ba7abf79a5, and dump the
testsuite-related journal messages only after the test finishes, so they
don't go through the slow console.
Resolves: systemd/systemd-centos-ci#660
[0] https://github.com/systemd/systemd-centos-ci/issues/660
Let's save all journals from the test machine instead of calling export
on each journal file separately, which makes the code less complicated
(and probably faster).
I'm pretty sure this is not the only case, but it's the one I recently
noticed. Even though we call ddebug() from a function, that function is
called before ddebug() is defined, resulting in the same issue as if we
called just ddebug() in its place, i.e.:
..//test-functions: line 276: ddebug: command not found
Currently the test works only with policy shipped by Fedora, which makes
it pretty much useless in most of our CIs. Let's drop the custom module
and make the test more generic, so it works with the refpolicy as well,
which should allow us to run it on Arch and probably even in Ubuntu CI.
Let's just rely on the word splitting done by bash instead of messing
with that ourselves, as it's just adding extra complexity to appease one
ShellCheck check. Also, this apparently never worked for the nspawn
stuff anyway, since I forgot to set $IFS to an appropriate value, so it
always put all arguments from $KERNEL_APPEND into a single array item
with an extra newline, which then made systemd sad:
~# readarray arr <<< "foo bar baz"; for i in "${arr[@]}"; do echo "'$i'"; done
'foo bar baz
'
~# make -C test/TEST-45-TIMEDATE/ clean setup run BUILD_DIR=$PWD/build TEST_NO_QEMU=1 KERNEL_APPEND="systemd.log_level=console"
...
~# journalctl -o short-monotonic --no-hostname --file /var/tmp/systemd-tests/systemd-test.XaDX67/system.journal --grep "Failed to parse" -p info --no-pager
[551138.986882] systemd-tmpfiles[21]: Failed to parse log level 'console
[551138.987179] systemd-remount-fs[20]: Failed to parse log level 'console
[551138.993125] systemd-sysusers[23]: Failed to parse log level 'console
[551138.998685] journalctl[29]: Failed to parse log level 'console
Resolves: #29945
Let's put this back in, as it could help with occasional machine lock ups
on overloaded systems (and it didn't help with the original issue
anyway).
This reverts commit 3a89904e45.
On slower/overloaded systems it may take a bit for the swtpm socket
to show up:
I: Started swtpm as PID 189419 with state dir /tmp/tmp.pWqUutuGUj
I: Configured emulated TPM2 device tpm-spapr
+ tee /var/tmp/systemd-test-TEST-70-TPM2_1/console.log
+ timeout --foreground 1200 /bin/qemu-system-ppc64le -smp 4 ...
qemu-system-ppc64le: -chardev socket,id=chrtpm,path=/tmp/tmp.pWqUutuGUj/sock: Failed to connect to '/tmp/tmp.pWqUutuGUj/sock': No such file or directory
E: qemu failed with exit code 1
Spotted regularly in the ppc64le cron job and in some Ubuntu CI/CentOS CI
pr runs [0].
[0] https://github.com/systemd/systemd/pull/29183#issuecomment-1721727927
We can't do anything about them anyway, and most importantly this seems
to alleviate systemd/systemd-centos-ci#660, which should make the CIs
a bit less angry (at least until the issue is addressed properly).
The pipe stuff introduced in 701e0c2660 causes nspawn to switch the
console from 'interactive' into 'read-only' which is a bit useless when
debugging. Let's set --console=interactive explicitly in such case.
Follow-up to 701e0c2660.
This metadata (EXTENSION_RELOAD_MANAGER) can be set to "1" to reload the manager
when merging/refreshing/unmerging a system extension image. This can be useful in case the sysext
image provides systemd units that need to be loaded.
With `--no-reload`, one can deactivate the EXTENSION_RELOAD_MANAGER metadata interpretation.
Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
Same scenario as with libsystemd - ldd might use unprefixed RPATH, and
we install our own stuff into the image unconditionally anyway.
Also, bail out early if we hit a missing DSO with a possibly helpful
message.
This extends the test framework a bit, and allows adding additional
initrds to the qemu invocation, which we use here to place credentials
in the new /run/systemd/@initrd/ credentials dir which are then passed
to the host.
So we're able to detect memory leaks in our NSS modules.
An example after introducing a memory leak in nss-myhostname.c:
testsuite-71.sh[2881]: =================================================================
testsuite-71.sh[2881]: ==2880==ERROR: LeakSanitizer: detected memory leaks
testsuite-71.sh[2881]: Direct leak of 2 byte(s) in 1 object(s) allocated from:
testsuite-71.sh[2881]: #0 0x7fa28907243b in strdup (/usr/lib64/libasan.so.8.0.0+0x7243b)
testsuite-71.sh[2881]: #1 0x7fa286a7bc10 in gethostname_full ../src/basic/hostname-util.c:67
testsuite-71.sh[2881]: #2 0x7fa286a74af9 in gethostname_malloc ../src/basic/hostname-util.h:24
testsuite-71.sh[2881]: #3 0x7fa286a756f4 in _nss_myhostname_gethostbyname4_r ../src/nss-myhostname/nss-myhostname.c:79
testsuite-71.sh[2881]: #4 0x7fa288f17588 in getaddrinfo (/lib64/libc.so.6+0xf4588)
testsuite-71.sh[2881]: #5 0x7fa2890a4d93 in __interceptor_getaddrinfo.part.0 (/usr/lib64/libasan.so.8.0.0+0xa4d93)
testsuite-71.sh[2881]: #6 0x55a54b2b7159 in ahosts_keys_int.part.0 (/usr/bin/getent.orig+0x4159)
testsuite-71.sh[2881]: SUMMARY: AddressSanitizer: 2 byte(s) leaked in 1 allocation(s).
As hitting an ASan/UBSan error in PID1 results in a crash (and a kernel
panic when running under qemu), we usually lose the stack trace which
makes debugging quite painful. Let's mitigate this by forwarding the
stack trace to multiple places - namely to a file and the syslog.
Let's drop stuff from the current $BUILD_DIR from the final coverage
report, as it's all generated files (mostly gperf) which we don't
really care about and it makes the Coveralls report confusing, since it
reports "source not available" for all such files.
When running on Debian/Ubuntu, I get a minute delay or so on every boot
because the local initramfs tries to resume from hibernation. This is
not really useful here, so always skip it
This gets rid of the all-but-one remaining uses of perl. I tested the new code
on my machine, so I'm fairly confident that it works as expected.
install_iscsi() has one majestic perl invocation, but we can't get rid of it
easily: it extends the code of tgt-admin to print some list of files. Obviously
this only works because tgt-admin is written in perl, and perl will be installed
if tgt-admin is installed. install_iscsi() is used in TEST-64-UDEV-STORAGE
conditionally if tgtadm is installed, so this can stay as is.
Stripping the binaries in the test images makes potential stack straces
quite useless, so let's drop the stripping stuff to make test fails a bit
more developer friendly.
Related: https://github.com/systemd/systemd-centos-ci/pull/616
Let's make the dropin, to make the build dir writable for gcov, a bit
more generic, so it can be used by all units starting with prefix test-.
This should help with a bunch of recent reports about missing coverage I
got, as well as with existing test units using DynamicUser=true.
This might feel a bit like a magic trick from behind the curtains, but I
want to touch the actual tests as little as possible, since it makes them
unnecessarily messy (see the various workarounds for sanitizers), and
the coverage reports are generated only in a specific CI job anyway.
systemd-repart needs to find mkfs.ext4 for the test.
This is located in the directory /usr/sbin on openSUSE Tumbleweed.
But since the variable ALWAYS_SET_PATH in /etc/login.defs is set to yes,
su re-initializes the $PATH variable and removes /usr/sbin.
Hence, mkfs.ext4 is not found and the test fails.
Using setpriv instead of su fixes this issue and is more appropriate to
do the switch user task from root.
[zjs: move setpriv to $BASICTOOLS and force-push to retrigger CI]
This is useful to identify log messages with metadata from the images
they run on. Look for ID/VERSION_ID/IMAGE_ID/IMAGE_VERSION/BUILD_ID,
with a SYSEXT_ prefix if we are looking at an extension, and append via
LogExtraFields= as respectively PORTABLE_NAME_AND_VERSION= in case of a
single image. In case of extensions, append as PORTABLE_ROOT_NAME_AND_VERSION=
for the base and one PORTABLE_EXTENSION_AND_VERSION= for each extension.
Example with a base and two extensions, with the unit coming from the
first extension:
[Service]
RootImage=/home/bluca/git/systemd/base.raw
Environment=PORTABLE=app0.raw
BindReadOnlyPaths=/etc/os-release:/run/host/os-release
LogExtraFields=PORTABLE=app0.raw
Environment=PORTABLE_ROOT=base.raw
LogExtraFields=PORTABLE_ROOT=base.raw
LogExtraFields=PORTABLE_ROOT_NAME_AND_VERSION=debian_10
ExtensionImages=/home/bluca/git/systemd/app0.raw
LogExtraFields=PORTABLE_EXTENSION=app0.raw
LogExtraFields=PORTABLE_EXTENSION_NAME_AND_VERSION=app_0
ExtensionImages=/home/bluca/git/systemd/app1.raw
LogExtraFields=PORTABLE_EXTENSION=app1.raw
LogExtraFields=PORTABLE_EXTENSION_NAME_AND_VERSION=app_1
When testing the binaries from the host, make sure to not store the state data
below /usr but use a dedicated directory in /var/tmp/ instead.
The working directories of the tests, initially located in /var/tmp, are also
moved in a dedicated directory /var/tmp/systemd-tests.
I noticed that our coverage reports miss some files completely - this
happens when the test doesn't touch the code in them at all, so the
generated coverage data (and resulting reports) have no information
about them. Let's fix this by doing an initial zero coverage capture
that contains a zeroed counter for every instrumented line in every
file, so when we later merge it with a capture from the test, it shows up
with a missing coverage instead of not showing at all.
/usr/lib/systemd/tests may contain more than the unit tests. For example on
SUSE we also install the integration tests there.
Putting the unit tests in a dedicated directory named 'unit-tests' makes the
layout cleaner.
Note that `run-unit-tests.py` has not been moved so we don't need to adjust
(Fedora) packaging and users also don't need to descend into the subdirectory.
Let's attempt to reduce the amount of flakes further when the AWS region
we run in is under heavy load and the hypervisor stars stealing our CPU
time.
Follow-up to e0cbb73911 and c78d18215b.
(The one case that is left unchanged is '< <(subcommand)'.)
This way, the style with no gap was already dominant. This way, the reader
immediately knows that ' < ' is a comparison operator and ' << ' is a shift.
In a few cases, replace custom EOF replacement by just EOF. There is no point
in using someting like "_EOL" unless "EOF" appears in the text.
On Arch both delv and dig pull in libnss_resolve:
```
$ grep resolve /etc/nsswitch.conf
hosts: mymachines resolve [!UNAVAIL=return] files myhostname dns
```
Since c78d18215b D-Bus services now have 60s to start, but the client
side (sd-bus) still waits only for 25s before giving up:
```
[ 226.196380] testsuite-71.sh[556]: + assert_in 'Static hostname: H' ''
[ 226.332965] testsuite-71.sh[576]: + set +ex
[ 226.332965] testsuite-71.sh[576]: FAIL: 'Static hostname: H' not found in:
[ 228.910782] sh[577]: + systemctl poweroff --no-block
[ 232.255584] hostnamectl[565]: Failed to query system properties: Connection timed out
[ 236.827514] systemd[1]: end.service: Consumed 2.131s CPU time.
[ 237.476969] dbus-daemon[566]: [system] Successfully activated service 'org.freedesktop.hostname1'
[ 237.516308] systemd[1]: system-modprobe.slice: Consumed 1.533s CPU time.
[ 237.794635] systemd[1]: testsuite-71.service: Main process exited, code=exited, status=1/FAILURE
[ 237.818469] systemd[1]: testsuite-71.service: Failed with result 'exit-code'.
[ 237.931415] systemd[1]: Failed to start testsuite-71.service.
[ 238.000833] systemd[1]: testsuite-71.service: Consumed 5.651s CPU time.
[ 238.181030] systemd[1]: Reached target testsuite.target.
```
Let's override the timeout in sd-bus as well to mitigate this.
Follow-up to c78d18215b.