Since d8f9686c0f we use the chattr +i flag
for marking containers in directories as reead-only. But to do so we
need the cap for it, hence grant it.
Fixes: #19115
We don't need two (and half) templating systems anymore, yay!
I'm keeping the changes minimal, to make the diff manageable. Some enhancements
due to a better templating system might be possible in the future.
For handling of '## ' — see the next commit.
As discussed on systemd-devel [1], in Fedora we get lots of abrt reports
about the watchdog firing [2], but 100% of them seem to be caused by resource
starvation in the machine, and never actual deadlocks in the services being
monitored. Killing the services not only does not improve anything, but it
makes the resource starvation worse, because the service needs cycles to restart,
and coredump processing is also fairly expensive. This adds a configuration option
to allow the value to be changed. If the setting is not set, there is no change.
My plan is to set it to some ridiculusly high value, maybe 1h, to catch cases
where a service is actually hanging.
[1] https://lists.freedesktop.org/archives/systemd-devel/2019-October/043618.html
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1300212
This is generally the safer approach, and is what container managers
(including nspawn) do, hence let's move to this too for our own
services. This is particularly useful as this this means the new
@system-service system call filter group will get serious real-life
testing quickly.
This also switches from firing SIGSYS on unexpected syscalls to
returning EPERM. This would have probably been a better default anyway,
but it's hard to change that these days. When whitelisting system calls
SIGSYS is highly problematic as system calls that are newly introduced
to Linux become minefields for services otherwise.
Note that this enables a system call filter for udev for the first time,
and will block @clock, @mount and @swap from it. Some downstream
distributions might want to revert this locally if they want to permit
unsafe operations on udev rules, but in general this shiuld be mostly
safe, as we already set MountFlags=shared for udevd, hence at least
@mount won't change anything.
Basically, we turn it on for most long-running services, with the
exception of machined (whose child processes need to join containers
here and there), and importd (which sandboxes tar in a CLONE_NEWNET
namespace). machined is left unrestricted, and importd is restricted to
use only "net"
Let's make this an excercise in dogfooding: let's turn on more security
features for all our long-running services.
Specifically:
- Turn on RestrictRealtime=yes for all of them
- Turn on ProtectKernelTunables=yes and ProtectControlGroups=yes for most of
them
- Turn on RestrictAddressFamilies= for all of them, but different sets of
address families for each
Also, always order settings in the unit files, that the various sandboxing
features are close together.
Add a couple of missing, older settings for a numbre of unit files.
Note that this change turns off AF_INET/AF_INET6 from udevd, thus effectively
turning of networking from udev rule commands. Since this might break stuff
(that is already broken I'd argue) this is documented in NEWS.
Add a line
SystemCallFilter=~@clock @module @mount @obsolete @raw-io ptrace
for daemons shipped by systemd. As an exception, systemd-timesyncd
needs @clock system calls and systemd-localed is not privileged.
ptrace(2) is blocked to prevent seccomp escapes.
Apparently, disk IO issues are more frequent than we hope, and 1min
waiting for disk IO happens, so let's increase the watchdog timeout a
bit, for all our services.
See #1353 for an example where this triggers.
Fedora's filesystem package ships /usr/bin (and other directories) which are
not writable by its owner. machinectl pull-dkr (and possibly others) are not
able to extract those:
14182 mkdirat(3, "usr", 0700) = 0
14182 mkdirat(3, "usr/bin", 0500) = 0
14182 openat(3, "usr/bin/[", O_WRONLY|O_CREAT|O_EXCL|O_NOCTTY|O_NONBLOCK|O_CLOEXEC, 0700) = -1 EACCES (Permission denied)
...
When manipulating container and VM images we need efficient and atomic
directory snapshots and file copies, as well as disk quota. btrfs
provides this, legacy file systems do not. Hence, implicitly create a
loopback file system in /var/lib/machines.raw and mount it to
/var/lib/machines, if that directory is not on btrfs anyway.
This is done implicitly and transparently the first time the user
invokes "machinectl import-xyz".
This allows us to take benefit of btrfs features for container
management without actually having the rest of the system use btrfs.
The loopback is sized 500M initially. Patches to grow it dynamically are
to follow.
The old "systemd-import" binary is now an internal tool. We still use it
as asynchronous backend for systemd-importd. Since the import tool might
require some IO and CPU resources (due to qcow2 explosion, and
decompression), and because we might want to run it with more minimal
priviliges we still keep it around as the worker binary to execute as
child process of importd.
machinectl now has verbs for pulling down images, cancelling them and
listing them.