Linux kernel v4.18 (2018-08-12) added user-namespace support to FUSE, and
bumped the FUSE version to 7.27 (see: da315f6e0398 (Merge tag
'fuse-update-4.18' of
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse, Linus Torvalds,
2018-06-07). This means that on such kernels it is safe to enable FUSE in
nspawn containers.
In outer_child(), before calling copy_devnodes(), check the FUSE version to
decide whether enable (>=7.27) or disable (<7.27) FUSE in the container. We
look at the FUSE version instead of the kernel version in order to enable FUSE
support on older-versioned kernels that may have the mentioned patchset
backported ([as requested by @poettering][1]). However, I am not sure that
this is safe; user-namespace support is not a documented part of the FUSE
protocol, which is what FUSE_KERNEL_VERSION/FUSE_KERNEL_MINOR_VERSION are meant
to capture. While the same patchset
- added FUSE_ABORT_ERROR (which is all that the 7.27 version bump
is documented as including),
- bumped FUSE_KERNEL_MINOR_VERSION from 26 to 27, and
- added user-namespace support
these 3 things are not inseparable; it is conceivable to me that a backport
could include the first 2 of those things and exclude the 3rd; perhaps it would
be safer to check the kernel version.
Do note that our get_fuse_version() function uses the fsopen() family of
syscalls, which were not added until Linux kernel v5.2 (2019-07-07); so if
nothing has been backported, then the minimum kernel version for FUSE-in-nspawn
is actually v5.2, not v4.18.
Pass whether or not to enable FUSE to copy_devnodes(); have copy_devnodes()
copy in /dev/fuse if enabled.
Pass whether or not to enable FUSE back over fd_outer_socket to run_container()
so that it can pass that to append_machine_properties() (via either
register_machine() or allocate_scope()); have append_machine_properties()
append "DeviceAllow=/dev/fuse rw" if enabled.
For testing, simply check that /dev/fuse can be opened for reading and writing,
but that actually reading from it fails with EPERM. The test assumes that if
FUSE is supported (/dev/fuse exists), then the testsuite is running on a kernel
with FUSE >= 7.27; I am unsure how to go about writing a test that validates
that the version check disables FUSE on old kernels.
[1]: https://github.com/systemd/systemd/issues/17607#issuecomment-745418835Closes#17607
When --boot is set, and --keep-unit is not, set CoredumpReceive=yes on
the scope allocated for the container. When --keep-unit is set, nspawn
does not allocate the container's unit, so the existing unit needs to
configure this setting itself.
Since systemd-nspawn@.service sets --boot and --keep-unit, add
CoredumpReceives=yes to that unit.
Let's make use of the new DelegateSubgroup= feature and delegate the
/supervisor/ subcgroup already to nspawn, so that moving the supervisor
process there is unnecessary.
The comment talks about upstream development steps and doesn't make
sense for users. We used special '## ' syntax to strip it out during
build, but it got inadvertently reformatted as a normal comment
in 3982becc92.
We don't need two (and half) templating systems anymore, yay!
I'm keeping the changes minimal, to make the diff manageable. Some enhancements
due to a better templating system might be possible in the future.
For handling of '## ' — see the next commit.
This patch modifies the RequireMountsFor setting in systemd-nspawn@.service to wait for the machine instance directory to be mounted, not just /var/lib/machines.
Closes#14931
This makes things a bit simpler and the build a bit faster, because we don't
have to rewrite files to do the trivial substitution. @rootbindir@ is always in
our internal $PATH that we use for non-absolute paths, so there should be no
functional change.
Devices referred to by `DeviceAllow=` sandboxing are resolved into their
corresponding major numbers when the unit is loaded by looking at
`/proc/devices`. If a reference is made to a device which is not yet
available, the `DeviceAllow` is ignored and the unit's processes cannot
access that device.
In both logind and nspawn, we have `DeviceAllow=` lines, and `modprobe`
in `ExecStartPre=` to load some kernel modules. Those kernel modules
cause device nodes to become available when they are loaded: the device
nodes may not exist when the unit itself is loaded. This means that the
unit's processes will not be able to access the device since the
`DeviceAllow=` will have been resolved earlier and denied it.
One way to fix this would be to re-evaluate the available devices and
re-apply the policy to the cgroup, but this cannot work atomically on
cgroupsv1. So we fall back to a second approach: instead of running
`modprobe` via `ExecStartPre`, we move this out to a separate unit and
order it before the units which want the module.
Closes#14322.
Fixes: #13943.
As discussed on systemd-devel [1], in Fedora we get lots of abrt reports
about the watchdog firing [2], but 100% of them seem to be caused by resource
starvation in the machine, and never actual deadlocks in the services being
monitored. Killing the services not only does not improve anything, but it
makes the resource starvation worse, because the service needs cycles to restart,
and coredump processing is also fairly expensive. This adds a configuration option
to allow the value to be changed. If the setting is not set, there is no change.
My plan is to set it to some ridiculusly high value, maybe 1h, to catch cases
where a service is actually hanging.
[1] https://lists.freedesktop.org/archives/systemd-devel/2019-October/043618.html
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1300212
https://github.com/systemd/systemd/pull/3685 introduced
/run/systemd/inaccessible/{chr,blk} to map inacessible devices,
this patch allows systemd running inside a nspawn container to create
/run/systemd/inaccessible/{chr,blk}.
When a container scope is allocated via machined it gets 16K set already since
cf7d1a30e4. Make sure when a container is run as
system service it gets the same values.
When using `%I` for instances of `systemd-nspawn@.service`, the result
will be `systemd-nspawn` trying to launch a container named e.g.
`fedora/23` instead of `fedora-23`.
Using `%i` instead prevents escaping `-` in a container name and uses
the unmodified container name from the machine store.
This way we know that any bridges and other user-created network devices
are in place, and can be properly added to the container.
In the long run this should be dropped, and replaced by direct calls
inside nspawn that cause the devices to be created when necessary.
- Unescape instance name so that we can take almost anything as instance
name.
- Introduce "machines.target" which consists of all enabled nspawns and
can be used to start/stop them altogether
- Look for container directory using -M instead of harcoding the path in
/var/lib/container
--link-journal={host,guest} fail if the host does not have persistent
journalling enabled and /var/log/journal/ does not exist. Even worse, as there
is no stdout/err any more, there is no error message to point that out.
Introduce two new modes "try-host" and "try-guest" which don't fail in this
case, and instead just silently skip the guest journal setup.
Change -j to mean "try-guest" instead of "guest", and fix the wrong --help
output for it (it said "host" before).
Change systemd-nspawn@.service.in to use "try-guest" so that this unit works
with both persistent and non-persistent journals on the host without failing.
https://bugs.debian.org/770275
For priviliged units this resource control property ensures that the
processes have all controllers systemd manages enabled.
For unpriviliged services (those with User= set) this ensures that
access rights to the service cgroup is granted to the user in question,
to create further subgroups. Note that this only applies to the
name=systemd hierarchy though, as access to other controllers is not
safe for unpriviliged processes.
Delegate=yes should be set for container scopes where a systemd instance
inside the container shall manage the hierarchies below its own cgroup
and have access to all controllers.
Delegate=yes should also be set for user@.service, so that systemd
--user can run, controlling its own cgroup tree.
This commit changes machined, systemd-nspawn@.service and user@.service
to set this boolean, in order to ensure that container management will
just work, and the user systemd instance can run fine.