This commit introduces a build-time option to enable/disable sysupdated
separately from sysupdate. 'auto' translated to enabled by default in
developer builds.
Closes#28367 (but not really in the exact form, see below)
We have the problem of restarting all user manager instances
after upgrade. Current approaches involve systemctl kill
with SIGRTMIN+25, which is async and feels rather ugly [1][2];
or systemctl --machine=user@ --user, which requires entering
each user session. Neither is particularly elegant.
Instead, let's just signal daemon-reexec when user@.service
is reloaded from system manager. Our long goal of dropping
daemon-reload in favor of reexec (see TODO) is unlikely to happen
due to user dbus restrictions, but here the synchronization
is done via READY=1.
[1] https://gitlab.archlinux.org/archlinux/packaging/packages/systemd/-/blob/main/systemd.install?ref_type=heads#L37
[2] https://salsa.debian.org/systemd-team/systemd/-/blob/debian/master/debian/systemd.postinst#L24#28367 would not really work for us now I come to think about it,
because all processes will be reparented to pid1 as soon as
original user manager process exits. This alternative approach
seems good enough for our use case.
Add support for opening /dev/hidraw devices via logind's TakeDevice().
Same semantics as our support for evdev devices, but it requires the
HIDIOCREVOKE ioctl in the kernel.
tmpfiles might be linking the configuration for ldconfig into /etc
so make sure it runs after it so that the configuration is guaranteed
to be in place.
The configuration files required by ldconfig could be put into
place by systemd-confext.service (ldconfig only looks in /etc) so
let's order the service after systemd-confext.service to make sure
any config files are in place before the service runs.
Monitor the sysctl set by networkd for writes, if a sysctl is
overwritten with a different value than the one we set, emit a warning.
Writes are detected with an eBPF program attached as BPF_CGROUP_SYSCTL
which reports the sysctl writes only in net/.
The eBPF program only reports sysctl writes from a different cgroup than networkd.
To do this, it uses the `bpf_current_task_under_cgroup_proto()` helper,
which will be available allowed in BPF_CGROUP_SYSCTL from kernel 6.12[1].
Loading a BPF_CGROUP_SYSCTL program requires the CAP_SYS_ADMIN capability,
so drop it just after the program load, whether it loads successfully or not.
Writes are logged but permitted, in future the functionality can be
extended to also deny writes to managed sysctls.
[1] https://lore.kernel.org/bpf/20240819162805.78235-3-technoboy85@gmail.com/
Linux kernel v4.18 (2018-08-12) added user-namespace support to FUSE, and
bumped the FUSE version to 7.27 (see: da315f6e0398 (Merge tag
'fuse-update-4.18' of
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse, Linus Torvalds,
2018-06-07). This means that on such kernels it is safe to enable FUSE in
nspawn containers.
In outer_child(), before calling copy_devnodes(), check the FUSE version to
decide whether enable (>=7.27) or disable (<7.27) FUSE in the container. We
look at the FUSE version instead of the kernel version in order to enable FUSE
support on older-versioned kernels that may have the mentioned patchset
backported ([as requested by @poettering][1]). However, I am not sure that
this is safe; user-namespace support is not a documented part of the FUSE
protocol, which is what FUSE_KERNEL_VERSION/FUSE_KERNEL_MINOR_VERSION are meant
to capture. While the same patchset
- added FUSE_ABORT_ERROR (which is all that the 7.27 version bump
is documented as including),
- bumped FUSE_KERNEL_MINOR_VERSION from 26 to 27, and
- added user-namespace support
these 3 things are not inseparable; it is conceivable to me that a backport
could include the first 2 of those things and exclude the 3rd; perhaps it would
be safer to check the kernel version.
Do note that our get_fuse_version() function uses the fsopen() family of
syscalls, which were not added until Linux kernel v5.2 (2019-07-07); so if
nothing has been backported, then the minimum kernel version for FUSE-in-nspawn
is actually v5.2, not v4.18.
Pass whether or not to enable FUSE to copy_devnodes(); have copy_devnodes()
copy in /dev/fuse if enabled.
Pass whether or not to enable FUSE back over fd_outer_socket to run_container()
so that it can pass that to append_machine_properties() (via either
register_machine() or allocate_scope()); have append_machine_properties()
append "DeviceAllow=/dev/fuse rw" if enabled.
For testing, simply check that /dev/fuse can be opened for reading and writing,
but that actually reading from it fails with EPERM. The test assumes that if
FUSE is supported (/dev/fuse exists), then the testsuite is running on a kernel
with FUSE >= 7.27; I am unsure how to go about writing a test that validates
that the version check disables FUSE on old kernels.
[1]: https://github.com/systemd/systemd/issues/17607#issuecomment-745418835Closes#17607
In 924453c225
ProtectHome was set to true for systemd-coredump in order to reduce risk, since an attacker could craft a malicious binary in order to compromise systemd-coredump.
At that point the object analysis was done in the main systemd-coredump process.
Because of this systemd-coredump is unable to product symbolicated call-stacks for binaries running under /home ("n/a" is shown instead of function names).
However, later in 61aea456c1 systemd-coredump was changed to do the object analysis in a forked process,
covering those security concerns.
Let's set ProtectHome to read-only so that systemd-coredump produces symbolicated call-stacks for processes running under /home.
This flag was added in db6aedab92 with the justification that locale
environment variables should be preserved by the user session. However,
the companion patch to drop the UnsetEnvironment= directive blocking
these variables was never merged, so the intended change was never
effected.
While the patch was ineffective toward its stated goal, the "-p" option
does have material negative consequences for the user session in
systemd — environment variables to support the use of
credentials and memory pressure directives, such as
$CREDENTIALS_DIRECTORY and $MEMORY_PRESSURE_WATCH, which are now
directly used by agetty and login, get leaked into the user session
potentially breaking applications that rely on these values.
E.g. systemd-ask-password fails from the tty when $CREDENTIALS_DIRECTORY
has been leaked from agetty, because it expects to be able to access
credentials in $CREDENTIALS_DIRECTORY.
This effectively reverts db6aedab92.
References: db6aedab92 (units: Tell login to preserve environment (#6023), 2017-05-24)
Let's always rely on our own TTY reset logic and tty disallocation/clear
screen logic, thus always pass --noclear and --noreset.
Also, bring the list of baud rates to try into sync for console-getty
and serial-getty (the former might or might not be connected to rs232,
we can't know, hence assume the worst, and copy what
serial-getty@.service does)
* Remove extra period at end of unit description.
Having an extra period at the end of this unit description makes log entries pertaining to it appear weirdly, as it seems the default expectation is that there is not to be a period at the end of a unit description.
e.g.: `systemd[1]: Started Displays emergency message in full screen..`
Let's make sure logind is accessible by the time user@.service runs, and
that logind stays around as long as it does so.
Addresses an issue reported here:
https://lists.freedesktop.org/archives/systemd-devel/2024-June/050468.html
This addresses an issued introduced by
278e815bfa, which dropped the a dependency
from user@.service systemd-user-sessions.service without replacement.
While dropping that dependency does make sense, it should have been
replaced with the weaker dependency on systemd-logind.service, hence fix
that now.
user@.service is after all a logind concept, hence logind really should
be around for its lifetime.
systemd-user-sessions.service is a later milestone that only really
should apply to regular users (not root), hence it's too strong a
requirement.
Previously, importd was only accessible via D-Bus, which required it to
be a late boot service. Now that we have Varlink we can rearrange things
to become early-boot activated, just after the image directories are
mounted.
This will later allow us to have generator that auto-downloads images on
boot.
Historically, systemd-tmpfiles was designed to manager temporary
files, but nowadays it has become a generic tool for managing
all kinds of files. To avoid user confusion, let's remove "temporary"
from the tool's description.
As discussed in #33349
The TPM might be password/pin protected for various reasons even if
there is no SRK yet. Let's handle those cases gracefully instead of
failing the unit as it is enabled by default.
Enabling this service by default means every CI image without a
regular user now gets stuck on first boot due to the password prompt
from systemd-homed-firstboot.service. Let's not enable the service
by default but instead require users to enable it explicitly if they
want its behavior.
Fixes#33249
A unit with StandardOutput=journal (the default) will get its stdout/stderr sockets
disconnected when journald stops, as the file descriptors on journald's side are
not preserved (it works on restart, as the FD Store keeps them open during restarts).
Set FileDescriptorStorePreserve=yes so that the journal FD's stay open during a soft
reboot, and applications don't get broken stdout/stderr.
After soft-reboot, /var/log/journal may be initially read-only,
and becomes writable a bit later. In such case, runtime journal is
initially opened by journald. Hence, we need to flush to /var when it is
ready.
Typically, soft-reboot.target is never reached. So, without this change,
systemd-journald may be killed by PID1 on soft-reboot, and may cause
journal corruption.
This reverts commit 4263d7617f.
Still I think this is the way to go. But the change was merged after -rc2,
and still discussion is continued. So, at least now let's revert it,
and do that after v256-final is released if approved.
Otherwise, at the time systemd-soft-reboot.service succeeds,
services which has Conflicts= and Before=soft-reboot.target may
not be stopped yet, and may be SIGKILLed.
Especially, systemd-journald.service has the dependencies, thus
journal may be corrupted. See #32223.
Follow-up for 13ffc60749.
Fixes#32834.
The service deos not have DefaultDependencies=no. Hence it has dependencies
of shutdown.target, and dependencies of soft-reboot.target are not
necessary.
Follow-up for f89985ca49.
Otherwise, if a service unit that requests LogNamespace= stopped before
systemd-journald@.service is started, logs generated by the service will be
lost, as systemd-journald@.socket is stopped and
systemd-journald@.service will never started.
To prevent the issue, let's introduce another implicit dependency to
a oneshot service that explicitly synchronizes a namespaced journal file
when the log namespace is not needed anymore.
Fixes#32604.
It's ordered with networkd, but just in case. Lintian complains:
W: systemd: systemd-service-file-shutdown-problems [usr/lib/systemd/system/systemd-networkd-persistent-storage.service]
Follow-up for 91676b6458
Right now systemd-tpm2-setup-early and systemd-pcrphase-initrd.service
are not ordered against each other. However, they require the same slow
resource to operate: the TPM2. If we allow them to access the device
simultaneously, the kernel resource manager like has to save/restore TPM
state while they operate, slowing things down further.
hence, let's avoid all this mess, and just order them against each other
so that the shared resource is first used in full by one and then by the
other.
I opted to order systemd-pcrphase-initrd before
systemd-tpm2-setup-early, since there's value in having the former as
early as possible in userspace, to be a good marker for the transition
from kernel to first userspace. I can see no benefit in the opposite
order however.
This mimics what we do for systemd-cryptsetup@.service (see
src/shared/generator.c), and makes sense since repart might lock up the
root volume against a TPM, which ideally has its SRK already set up by
then.
More importantly though, this ensures that we ordered correctly after
tpm2.target (which systemd-tpm2-setup-early.service has a dependency
on), for systems where the TPM drivers are not compiled into the kernel.
See: https://lists.freedesktop.org/archives/systemd-devel/2024-April/050201.html
This adds a small, socket-activated Varlink daemon that can delegate UID
ranges for user namespaces to clients asking for it.
The primary call is AllocateUserRange() where the user passes in an
uninitialized userns fd, which is then set up.
There are other calls that allow assigning a mount fd to a userns
allocated that way, to set up permissions for a cgroup subtree, and to
allocate a veth for such a user namespace.
Since the UID assignments are supposed to be transitive, i.e. not
permanent, care is taken to ensure that users cannot create inodes owned
by these UIDs, so that persistancy cannot be acquired. This is
implemented via a BPF-LSM module that ensures that any member of a
userns allocated that way cannot create files unless the mount it
operates on is owned by the userns itself, or is explicitly
allowelisted.
BPF LSM program with contributions from Alexei Starovoitov.
stale HibernateLocation EFI variable
Currently, if the HibernateLocation EFI variable exists,
but we failed to resume from it, the boot carries on
without clearing the stale variable. Therefore, the subsequent
boots would still be waiting for the device timeout,
unless the variable is purged manually.
There's no point to keep trying to resume after a successful
switch-root, because the hibernation image state
would have been invalidated by then. OTOH, we don't
want to clear the variable prematurely either,
i.e. in initrd, since if the resume device is the same
as root one, the boot won't succeed and the user might
be able to try resuming again. So, let's introduce a
unit that only runs after switch-root and clears the var.
Fixes#32021