We don't need two (and half) templating systems anymore, yay!
I'm keeping the changes minimal, to make the diff manageable. Some enhancements
due to a better templating system might be possible in the future.
For handling of '## ' — see the next commit.
machined needs access to the host mount namespace to propagate bind
mounts created with the "machinectl bind" command. However, the
"ProtectKernelLogs" directive relies on mount namespaces to make the
kernel ring buffer inaccessible. This commit removes the
"ProtectKernelLogs=yes" directive from machined service file introduced
in 6168ae5.
Closes#14559.
We set ProtectKernelLogs=yes on all long running services except for
udevd, since it accesses /dev/kmsg, and journald, since it calls syslog
and accesses /dev/kmsg.
As discussed on systemd-devel [1], in Fedora we get lots of abrt reports
about the watchdog firing [2], but 100% of them seem to be caused by resource
starvation in the machine, and never actual deadlocks in the services being
monitored. Killing the services not only does not improve anything, but it
makes the resource starvation worse, because the service needs cycles to restart,
and coredump processing is also fairly expensive. This adds a configuration option
to allow the value to be changed. If the setting is not set, there is no change.
My plan is to set it to some ridiculusly high value, maybe 1h, to catch cases
where a service is actually hanging.
[1] https://lists.freedesktop.org/archives/systemd-devel/2019-October/043618.html
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1300212
Previously, setting this option by default was problematic due to
SELinux (as this would also prohibit the transition from PID1's label to
the service's label). However, this restriction has since been lifted,
hence let's start making use of this universally in our services.
On SELinux system this change should be synchronized with a policy
update that ensures that NNP-ful transitions from init_t to service
labels is permitted.
An while we are at it: sort the settings in the unit files this touches.
This might increase the size of the change in this case, but hopefully
should result in stabler patches later on.
Fixes: #1219
This is generally the safer approach, and is what container managers
(including nspawn) do, hence let's move to this too for our own
services. This is particularly useful as this this means the new
@system-service system call filter group will get serious real-life
testing quickly.
This also switches from firing SIGSYS on unexpected syscalls to
returning EPERM. This would have probably been a better default anyway,
but it's hard to change that these days. When whitelisting system calls
SIGSYS is highly problematic as system calls that are newly introduced
to Linux become minefields for services otherwise.
Note that this enables a system call filter for udev for the first time,
and will block @clock, @mount and @swap from it. Some downstream
distributions might want to revert this locally if they want to permit
unsafe operations on udev rules, but in general this shiuld be mostly
safe, as we already set MountFlags=shared for udevd, hence at least
@mount won't change anything.
Let's make this an excercise in dogfooding: let's turn on more security
features for all our long-running services.
Specifically:
- Turn on RestrictRealtime=yes for all of them
- Turn on ProtectKernelTunables=yes and ProtectControlGroups=yes for most of
them
- Turn on RestrictAddressFamilies= for all of them, but different sets of
address families for each
Also, always order settings in the unit files, that the various sandboxing
features are close together.
Add a couple of missing, older settings for a numbre of unit files.
Note that this change turns off AF_INET/AF_INET6 from udevd, thus effectively
turning of networking from udev rule commands. Since this might break stuff
(that is already broken I'd argue) this is documented in NEWS.
Specifically "machinectl shell" (or its OpenShell() bus call) is implemented by
entering the file system namespace of the container and opening a TTY there.
In order to enter the file system namespace, chroot() is required, which is
filtered by SystemCallFilter='s @mount group. Hence, let's make this work again
and drop @mount from the filter list.
Add a line
SystemCallFilter=~@clock @module @mount @obsolete @raw-io ptrace
for daemons shipped by systemd. As an exception, systemd-timesyncd
needs @clock system calls and systemd-localed is not privileged.
ptrace(2) is blocked to prevent seccomp escapes.
Container images from Debian or suchlike contain device nodes in /dev. Let's
make sure we can clone them properly, hence pass CAP_MKNOD to machined.
Fixes: #2867#465
Apparently, disk IO issues are more frequent than we hope, and 1min
waiting for disk IO happens, so let's increase the watchdog timeout a
bit, for all our services.
See #1353 for an example where this triggers.
Otherwise copying full directory trees between container and host won't
work, as we cannot access some fiels and cannot adjust the ownership
properly on the destination.
Of course, adding these many caps to the daemon kinda defeats the
purpose of the caps lock-down... but well...
Fixes#433
This adds a new bus call to machined that enumerates /var/lib/container
and returns all trees stored in it, distuingishing three types:
- GPT disk images, which are files suffixed with ".gpt"
- directory trees
- btrfs subvolumes
Also, rename ProtectedHome= to ProtectHome=, to simplify things a bit.
With this in place we now have two neat options ProtectSystem= and
ProtectHome= for protecting the OS itself (and optionally its
configuration), and for protecting the user's data.
ReadOnlySystem= uses fs namespaces to mount /usr and /boot read-only for
a service.
ProtectedHome= uses fs namespaces to mount /home and /run/user
inaccessible or read-only for a service.
This patch also enables these settings for all our long-running services.
Together they should be good building block for a minimal service
sandbox, removing the ability for services to modify the operating
system or access the user's private data.
Adds a new call sd_event_set_watchdog() that can be used to hook up the
event loop with the watchdog supervision logic of systemd. If enabled
and $WATCHDOG_USEC is set the event loop will ping the invoking systemd
daemon right after coming back from epoll_wait() but not more often than
$WATCHDOG_USEC/4. The epoll_wait() will sleep no longer than
$WATCHDOG_USEC/4*3, to make sure the service manager is called in time.
This means that setting WatchdogSec= in a .service file and calling
sd_event_set_watchdog() in your daemon is enough to hook it up with the
watchdog logic.
Embedded folks don't need the machine registration stuff, hence it's
nice to make this optional. Also, I'd expect that machinectl will grow
additional commands quickly, for example to join existing containers and
suchlike, hence it's better keeping that separate from loginctl.