Let's explicitly order these against initrd-switch-root.target, so
that they are properly shut down before we switch root. Otherwise,
there's a race condition where networkd might only shut down after
switching root and after we've already we've loaded the unit graph,
meaning it won't be restarted in the rootfs.
Fixes#27718
These are all services that valid to be run in the initrd, so let's
make sure they have the appropriate dependencies on
initrd-switch-root.target so that they are stopped when we're about
to switch root.
Note that this drops ProtectProc=invisible from
systemd-resolved.service.
This is done because othewise access to the booted "kernel" command line is not
necessarily available. That's because in containers we want to read
/proc/1/cmdline for that.
Fixes: #24103
79a67f3ca4 pulled systemd-resolved.service
in from basic.target instead of multi-user.target, i.e. the idea is to
make it an early boot service, instead of a regular service.
However, early boot services are supposed to be in sysinit.target, not
basic.target (the latter is just one that combines the early boot
services in sysinit.target, the sockets in sockets.targt, the mounts in
local-fs.target and so on into one big target).
Also, the comit actually didn't add a synchronization point, i.e. not
Before=, so that the whole thing was racy.
Let's fix all that.
Follow-up for 79a67f3ca4
This ordering existed since resolved was first created, but there should
not be any need to order the two services against each other, as
resolved should be able to pick up networkd DNS metadata either way (as
it works with inotify in /run).
Let's drop this hence, and not cargo-cult this to eternity
Also see: https://github.com/systemd/systemd/pull/22389#issuecomment-1045978403
In the olden days systemd-resolved used dbus and it didn't make sense to start
it before dbus which is started fairly late. But we have mostly ported resolved
over to varlink. The queries from nss-resolve are done using varlink, so name
resolution can work without dbus. resolvectl still uses dbus, so e.g. 'resolvectl
query' will not work, but by starting systemd-resolved earlier we're not making this
any worse.
If systemd-resolved is started after dbus, it registers the name and everything
is fine. If it is started before dbus, it'll watch for the dbus socket and
connect later. So it should be fine to start systemd-resolved earlier. (If dbus
is stopped and restarted, unfortunately systemd-resolved does not reconnect.
This seems to be a small bug: since our daemons know how to watch for
dbus.socket, they could restart the watch if they ever lose the connection. But
this scenario shouldn't happen in normal boot, and restarting dbus is not
supported anyway.)
Moving the start earlier the following advantages:
- name resolution becomes availabe earlier, in particular for synthesized
hostnames even before the network is up.
- basic.target is part of initrd.target, so systemd-resolved will get started
in the initrd if installed. This is required for nfs-root when the server is
specified using a name (https://bugzilla.redhat.com/show_bug.cgi?id=2037311).
We don't need two (and half) templating systems anymore, yay!
I'm keeping the changes minimal, to make the diff manageable. Some enhancements
due to a better templating system might be possible in the future.
For handling of '## ' — see the next commit.
Add `ProtectClock=yes` to systemd units. Since it implies certain
`DeviceAllow=` rules, make sure that the units have `DeviceAllow=` rules so
they are still able to access other devices. Exclude timesyncd and timedated.
We set ProtectKernelLogs=yes on all long running services except for
udevd, since it accesses /dev/kmsg, and journald, since it calls syslog
and accesses /dev/kmsg.
As discussed on systemd-devel [1], in Fedora we get lots of abrt reports
about the watchdog firing [2], but 100% of them seem to be caused by resource
starvation in the machine, and never actual deadlocks in the services being
monitored. Killing the services not only does not improve anything, but it
makes the resource starvation worse, because the service needs cycles to restart,
and coredump processing is also fairly expensive. This adds a configuration option
to allow the value to be changed. If the setting is not set, there is no change.
My plan is to set it to some ridiculusly high value, maybe 1h, to catch cases
where a service is actually hanging.
[1] https://lists.freedesktop.org/archives/systemd-devel/2019-October/043618.html
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1300212
ProtectHostname= turns off hostname change propagation from host to
service. This means for services that care about the hostname and need
to be able to notice changes to it it's not suitable (though it is
useful for most other cases still).
Let's turn it off hence for journald (which logs the current hostname)
for networkd (which optionally sends the current hostname to dhcp
servers) and resolved (which announces the current hostname via
llmnr/mdns).
Previously, setting this option by default was problematic due to
SELinux (as this would also prohibit the transition from PID1's label to
the service's label). However, this restriction has since been lifted,
hence let's start making use of this universally in our services.
On SELinux system this change should be synchronized with a policy
update that ensures that NNP-ful transitions from init_t to service
labels is permitted.
An while we are at it: sort the settings in the unit files this touches.
This might increase the size of the change in this case, but hopefully
should result in stabler patches later on.
Fixes: #1219
This is generally the safer approach, and is what container managers
(including nspawn) do, hence let's move to this too for our own
services. This is particularly useful as this this means the new
@system-service system call filter group will get serious real-life
testing quickly.
This also switches from firing SIGSYS on unexpected syscalls to
returning EPERM. This would have probably been a better default anyway,
but it's hard to change that these days. When whitelisting system calls
SIGSYS is highly problematic as system calls that are newly introduced
to Linux become minefields for services otherwise.
Note that this enables a system call filter for udev for the first time,
and will block @clock, @mount and @swap from it. Some downstream
distributions might want to revert this locally if they want to permit
unsafe operations on udev rules, but in general this shiuld be mostly
safe, as we already set MountFlags=shared for udevd, hence at least
@mount won't change anything.
On systems that only use resolved for name resolution, there are usecases that
require resolved to be started before sysinit target, such that network name
resolution is available before network-online/sysinit targets. For example,
cloud-init for some datasources hooks into the boot process ahead of sysinit
target and may need network name resolution at that point already.
systemd-resolved already starts pretty early in the process, thus starting it
slightly earlier should not have negative side effects.
However, this depends on resolved ability to connect to system DBus once that
is up.
Also, rename ProtectedHome= to ProtectHome=, to simplify things a bit.
With this in place we now have two neat options ProtectSystem= and
ProtectHome= for protecting the OS itself (and optionally its
configuration), and for protecting the user's data.
ReadOnlySystem= uses fs namespaces to mount /usr and /boot read-only for
a service.
ProtectedHome= uses fs namespaces to mount /home and /run/user
inaccessible or read-only for a service.
This patch also enables these settings for all our long-running services.
Together they should be good building block for a minimal service
sandbox, removing the ability for services to modify the operating
system or access the user's private data.