Monitor the sysctl set by networkd for writes, if a sysctl is
overwritten with a different value than the one we set, emit a warning.
Writes are detected with an eBPF program attached as BPF_CGROUP_SYSCTL
which reports the sysctl writes only in net/.
The eBPF program only reports sysctl writes from a different cgroup than networkd.
To do this, it uses the `bpf_current_task_under_cgroup_proto()` helper,
which will be available allowed in BPF_CGROUP_SYSCTL from kernel 6.12[1].
Loading a BPF_CGROUP_SYSCTL program requires the CAP_SYS_ADMIN capability,
so drop it just after the program load, whether it loads successfully or not.
Writes are logged but permitted, in future the functionality can be
extended to also deny writes to managed sysctls.
[1] https://lore.kernel.org/bpf/20240819162805.78235-3-technoboy85@gmail.com/
Now, networkd accesses the state directory through the file descriptor
passed from systemd-networkd-persistent-storage.service.
Hence, the networkd itself does not need to access the state directory
through its path, and we can use more stronger mode for ProtectSystem=.
ProtectSystem=full remounts /boot and/or /efi read-only, but that
may trigger automount for the paths and delay the service being started.
===
systemd[1]: boot.automount: Got automount request for /boot, triggered by 720 ((networkd))
===
The service does not need to access the paths, so let's hide them.
Follow-up for f90eb08627.
Fixes#31742.
Then, this introduces systemd-networkd-persistent-storage.service.
systemd-networkd.service is an early starting service. So, at the time
it is started, the persistent storage for the service may not be ready,
and we cannot use StateDirectory=systemd/network in
systemd-networkd.service.
The newly added systemd-networkd-persistent-storage.service creates the
state directory for networkd, and notify systemd-networkd that the
directory is usable.
Let's explicitly order these against initrd-switch-root.target, so
that they are properly shut down before we switch root. Otherwise,
there's a race condition where networkd might only shut down after
switching root and after we've already we've loaded the unit graph,
meaning it won't be restarted in the rootfs.
Fixes#27718
These are all services that valid to be run in the initrd, so let's
make sure they have the appropriate dependencies on
initrd-switch-root.target so that they are stopped when we're about
to switch root.
It is used by udevd and networkd. Since udevd is enabled statically, let's also
change the preset to "on". networkd is opt-in, so let's pull in the generator
when enabling networkd too.
In general, it's not very usuful to repeat the unit name as the description.
Especially when the word is a common name and if somebody doesn't understand
the meaning immediately, they are not going to gain anything from the
repeat either, e.g. "halt", "swap".
In the status-unit-format=combined output parentheses are used around
Description, so avoid using parenthesis in the Description itself.
We don't need two (and half) templating systems anymore, yay!
I'm keeping the changes minimal, to make the diff manageable. Some enhancements
due to a better templating system might be possible in the future.
For handling of '## ' — see the next commit.
Add After=systemd-networkd.socket to avoid a race condition and networkd
falling back to the non-socket activation code.
Also add Wants=systemd-networkd.socket, so the socket is started when
networkd is started via `systemctl start systemd-networkd.service`.
A Requires is not strictly necessary, as networkd still ships the
non-socket activation code. Should this code be removed one day, the
Wants should be bumped to Requires accordingly.
See also 5544ee8516.
Fixes: #16809
Add `ProtectClock=yes` to systemd units. Since it implies certain
`DeviceAllow=` rules, make sure that the units have `DeviceAllow=` rules so
they are still able to access other devices. Exclude timesyncd and timedated.
We set ProtectKernelLogs=yes on all long running services except for
udevd, since it accesses /dev/kmsg, and journald, since it calls syslog
and accesses /dev/kmsg.
As discussed on systemd-devel [1], in Fedora we get lots of abrt reports
about the watchdog firing [2], but 100% of them seem to be caused by resource
starvation in the machine, and never actual deadlocks in the services being
monitored. Killing the services not only does not improve anything, but it
makes the resource starvation worse, because the service needs cycles to restart,
and coredump processing is also fairly expensive. This adds a configuration option
to allow the value to be changed. If the setting is not set, there is no change.
My plan is to set it to some ridiculusly high value, maybe 1h, to catch cases
where a service is actually hanging.
[1] https://lists.freedesktop.org/archives/systemd-devel/2019-October/043618.html
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1300212
The code supports SIGTERM and SIGINT to termiante the process. It would
be possible to reporpose one of those signals for the restart operation,
but I think it's better to use a completely different signal to avoid
misunderstandings.
While the need for access to character devices can be tricky to determine for
the general case, it's obvious that most of our services have no need to access
block devices. For logind and timedated this can be tightened further.
ProtectHostname= turns off hostname change propagation from host to
service. This means for services that care about the hostname and need
to be able to notice changes to it it's not suitable (though it is
useful for most other cases still).
Let's turn it off hence for journald (which logs the current hostname)
for networkd (which optionally sends the current hostname to dhcp
servers) and resolved (which announces the current hostname via
llmnr/mdns).
Previously, setting this option by default was problematic due to
SELinux (as this would also prohibit the transition from PID1's label to
the service's label). However, this restriction has since been lifted,
hence let's start making use of this universally in our services.
On SELinux system this change should be synchronized with a policy
update that ensures that NNP-ful transitions from init_t to service
labels is permitted.
An while we are at it: sort the settings in the unit files this touches.
This might increase the size of the change in this case, but hopefully
should result in stabler patches later on.
Fixes: #1219
This reverts commit d4e9e574ea.
(systemd.conf.m4 part was already reverted in 5b5d82615011b9827466b7cd5756da35627a1608.)
Together those reverts should "fix" #10025 and #10011. ("fix" is in quotes
because this doesn't really fix the underlying issue, which is combining
DynamicUser= with strict container sandbox, but it avoids the problem by not
using that feature in our default installation.)
Dynamic users don't work well if the service requires matching configuration in
other places, for example dbus policy. This is true for those three services.
In effect, distros create the user statically [1, 2]. Dynamic users make more
sense for "add-on" services where not creating the user, or more precisely,
creating the user lazily, can save resources. For "basic" services, if we are
going to create the user on package installation anyway, setting DynamicUser=
just creates unneeded confusion. The only case where it is actually used is
when somebody forgets to do system configuration. But it's better to have the
service fail cleanly in this case too. If we want to turn on some side-effect
of DynamicUser=yes for those services, we should just do that directly through
fine-grained options. By not using DynamicUser= we also avoid the need to
restart dbus.
[1] bd9bf30727
[2] 48ac1cebde/f/systemd.spec (_473)
(Fedora does not create systemd-timesync user.)
This is generally the safer approach, and is what container managers
(including nspawn) do, hence let's move to this too for our own
services. This is particularly useful as this this means the new
@system-service system call filter group will get serious real-life
testing quickly.
This also switches from firing SIGSYS on unexpected syscalls to
returning EPERM. This would have probably been a better default anyway,
but it's hard to change that these days. When whitelisting system calls
SIGSYS is highly problematic as system calls that are newly introduced
to Linux become minefields for services otherwise.
Note that this enables a system call filter for udev for the first time,
and will block @clock, @mount and @swap from it. Some downstream
distributions might want to revert this locally if they want to permit
unsafe operations on udev rules, but in general this shiuld be mostly
safe, as we already set MountFlags=shared for udevd, hence at least
@mount won't change anything.
The daemon requires the busname unit to operate (on kdbus systems),
since it contains the policy that allows it to acquire its service
name.
This fixes https://bugs.freedesktop.org/show_bug.cgi?id=90287
We will be woken up on rtnl or dbus activity, so let's just quit if some time has passed and that is the only thing that can happen.
Note that we will always stay around if we expect network activity (e.g. DHCP is enabled), as we are not restarted on that.
This way we are sure that /dev/net/tun has been given the right permissions before we try to connect to it.
Ideally, we should create tun/tap devices over netlink, and then this whole issue would go away.
networkd-wait-online should never exist in the default transaction,
unless explicitly enable or pulled in via things like NFS. However, just
enabling networkd shouldn't enable networkd-wait-online, since it's
common to use the former without the latter.