This makes use of the option switch that was added in the previous commit.
We used a pretty big hammer on a relatively small nail: we would do daemon-reload
and (in principle) allow any configuration to be changed. But in fact we only
made use of this in systemd-fstab-generator. systemd-fstab-generator filters
out all mountpoints except /usr and those marked with x-initrd.mount, i.e. on
a big majority of systems it wouldn't do anything.
Also, since systemd-fstab-generator first parses /proc/cmdline, and then
initrd's /etc/fstab, and only then /sysroot/etc/fstab, configuration in the
host would only matter if it the same mountpoint wasn't configured "earlier".
So the config in the host could be used for new mountpoints, but it couldn't
be used to amend configuration for existing mountpoints. And we wouldn't actually
remount anything, so mountpoints that were already mounted wouldn't be affected,
even if did change some config.
In the new scheme, we will parse /sysroot/etc/fstab and explicitly start
sysroot-usr.mount and other units that we just wrote. In most cases (as written
above), this will actually result in no units being created or started.
If the generator is invoked on a system with /sysroot/etc/fstab present,
behaviour is not changed and we'll create units as before. This is needed so
that if daemon-reload is later at some points, we don't "lose" those units.
There's a minor bugfix here: we honour x-initrd.mount for swaps, but we
wouldn't restart swap.target, i.e. the new swaps wouldn't necessarilly be
pulled in immediately.
If for any reason something goes wrong during the boot process (most likely due
to a network issue), system admins should be allowed to log in to the system to
debug the problem. However due to the login session barrier enforced by
systemd-user-sessions.service for all users, logins for root will be delayed
until a (dbus) timeout expires. Beside being confusing, it's not a nice user
experience to wait for an indefinite period of time (no message is shown) this
and also suggests that something went wrong in the background.
The reason of this delay is due to the fact that all units involved in the
creation of a user session are ordered after systemd-user-sessions.service,
which is subject to network issues. If root needs to log in at that time,
logind is requested to create a new session (via pam_systemd), which ultimately
ends up waiting for systemd-user-session.service to be activated. This has the
bad side effect to block login for root until the dbus call done by pam_systemd
times out and the PAM stack proceeds anyways.
To solve this problem, this patch orders the session scope units and the user
instances only after systemd-user-sessions.service for unprivileged users only.
So far we didn't enable the cpu controller because of overhead of the
accounting. If I'm reading things correctly, delegation was enabled for a while
for the units with user and pam context set, i.e. for user@.service too.
a931ad47a8 added the explicit Delegate=yes|no
switch, but it was initially set to 'yes'.
acc8059129 disabled delegation for user@.service
with the justication that CPU accounting is expensive, but half a year later
a88c5b8ac4 changed DefaultCPUAccounting=yes for
kernels >=4.15 with the justification that CPU accounting is inexpensive there.
In my (very noncomprehensive) testing, I don't see a measurable overhead if the
cpu controller is enabled for user slices. I tried some repeated compilations,
and there is was no statistical difference, but the noise level was fairly
high. Maybe better benchmarking would reveal a difference.
The goal of this change is very simple: currently all of the user session,
including services like the display server and pipewire are under user@.service.
This means that when e.g. a compilation job is started in the session's
app.slice, the processes in session.slice compete for CPU and can be starved.
In particular, audio starts to stutter, etc. With CPU controller enabled,
I can start start 'ninja -C build -j40' in a tab and this doesn't have any
noticable effect on audio.
I don't think the particular values matter too much: the CPU controller is
work-convserving, and presumably the session slice would never need more than
e.g. one 1 full CPU, i.e. half or a quarter of available CPU resources on even
the smallest of today's machines. app.slice and session.slice are assigned
equal weights, background.slice is assigned a smaller fraction. CPUWeight=100
is the default, but I wrote it explicitly to make it easier for users to see
how the split is done. So effectively this should result in session.slice
getting as much power as it needs.
If if turns out that this does have a noticable overhead, we could make it
opt-in. But I think that the benefit to usability is important enough to enable
it by default. W/o something like this the session is not really usable with
background tasks.
We already had it on the socket units, so it's possible that
systemd-journald.service would be stopped and then restarted when trafic hits
the sockets when something logs. Let's not try to stop it. It is supposed to
run until the end and be eventually killed in the final killing spree.
This might (or not) help with #23287.
They are various cases where the same module might be repeatedly
loaded in a short time frame, for example if a service depending on a
module keep restarting, or if many instances of such service get
started at the same time. If this happend the modprobe@.service
instance will be marked as failed because it hit the restart limit.
Overall it doesn't seems to make much sense to have a restart limit on
the modprobe service so just disable it.
Fixes: #23742
The systemd-pstore service takes pstore files on boot and transfers them
to disk. It only does it once on boot and only if it finds any. The typical
location of the pstore on modern systems is the UEFI variable store.
Most distributions ship with CONFIG_EFI_VARS_PSTORE=m. That means, the
UEFI variable store is only available on boot after the respective module
is loaded.
In most situations, the pstore service gets loaded before the UEFI pstore,
so we don't get to transfer logs. Instead, they accumulate, filling up the
pstore over time, potentially breaking the UEFI variable store.
Let's add a service dependency on any kernel module that can provide a
pstore to ensure we only scan for pstate after we can actually see pstate.
I have seen live occurences of systems breaking because we did not erase
the pstates and ran out of UEFI nvram space.
Fixes https://github.com/systemd/systemd/issues/18540
All wiki pages that contain a deprecation banner
pointing to systemd.io or manpages are updated to
point to their replacements directly.
Helpful command for identification of available links:
git grep freedesktop.org/wiki | \
sed "s#.*\(https://www.freedesktop.org/wiki[^ $<'\\\")]*\)\(.*\)#\\1#" | \
sort | uniq
GIT_VERSION is not available as a config.h variable, because it's rendered
into version.h during builds. Let's rework jinja2 rendering to also
parse version.h. No functional change, the new variable is so far unused.
I guess this will make partial rebuilds a bit slower, but it's useful
to be able to use the full version string.
These unit (if enabled) will try to update the OS in regular intervals.
Moreover, every day in the early morning this will attempt to reboot the
system if there's a newer version installed than running.
And enable cgroup delegation for udevd.
Then, processes invoked through ExecReload= are assigned .control
subcgroup, and they are not killed by cg_kill().
Fixes#16867 and #22686.
The current description for the factory reset target does not add any
value and doesn't respect the definition of the related property as
described in systemd.unit(5).
Starting the target currently results in the following log:
[ 11.139174] systemd[1]: Reached target Target that triggers factory reset. Does nothing by default..
[ OK ] Reached target Target that…set. Does nothing by default..
Simply update the target description to "Factory Reset".
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
79a67f3ca4 pulled systemd-resolved.service
in from basic.target instead of multi-user.target, i.e. the idea is to
make it an early boot service, instead of a regular service.
However, early boot services are supposed to be in sysinit.target, not
basic.target (the latter is just one that combines the early boot
services in sysinit.target, the sockets in sockets.targt, the mounts in
local-fs.target and so on into one big target).
Also, the comit actually didn't add a synchronization point, i.e. not
Before=, so that the whole thing was racy.
Let's fix all that.
Follow-up for 79a67f3ca4
This ordering existed since resolved was first created, but there should
not be any need to order the two services against each other, as
resolved should be able to pick up networkd DNS metadata either way (as
it works with inotify in /run).
Let's drop this hence, and not cargo-cult this to eternity
Also see: https://github.com/systemd/systemd/pull/22389#issuecomment-1045978403
This is a follow-up for d5ee050ffc, and
reintroduces a requirement dep from systemd-journal-flush.service onto
systemd-journald.service, but a weaker one than originally: a Wants= one
instead of a Requires= one.
Why? Simply because the service issues an IPC call to the journald,
hence it should pull it in. (Note that socket activation doesn't happen
for the Varlink socket it uses, hence we should pull in the service
itself.)
The systemd-oomd.service unit contains
[Install]
WantedBy=multi-user.target
Alias=dbus-org.freedesktop.oom1.service
which means the symlink is supposed to be created dynamically when the
service is enabled.
In the olden days systemd-resolved used dbus and it didn't make sense to start
it before dbus which is started fairly late. But we have mostly ported resolved
over to varlink. The queries from nss-resolve are done using varlink, so name
resolution can work without dbus. resolvectl still uses dbus, so e.g. 'resolvectl
query' will not work, but by starting systemd-resolved earlier we're not making this
any worse.
If systemd-resolved is started after dbus, it registers the name and everything
is fine. If it is started before dbus, it'll watch for the dbus socket and
connect later. So it should be fine to start systemd-resolved earlier. (If dbus
is stopped and restarted, unfortunately systemd-resolved does not reconnect.
This seems to be a small bug: since our daemons know how to watch for
dbus.socket, they could restart the watch if they ever lose the connection. But
this scenario shouldn't happen in normal boot, and restarting dbus is not
supported anyway.)
Moving the start earlier the following advantages:
- name resolution becomes availabe earlier, in particular for synthesized
hostnames even before the network is up.
- basic.target is part of initrd.target, so systemd-resolved will get started
in the initrd if installed. This is required for nfs-root when the server is
specified using a name (https://bugzilla.redhat.com/show_bug.cgi?id=2037311).
Otherwise, systemd-homed-active.service will fail to deactivate all
homes because homectl can no longer talk to homed if dbus stops first.
As a result, /home cannot be umounted.
Doing this on systemd-homed-active.service instead works as well, but
systemd-homed will exit 1 if dbus is already shut down.
It is used by udevd and networkd. Since udevd is enabled statically, let's also
change the preset to "on". networkd is opt-in, so let's pull in the generator
when enabling networkd too.
Fixes#21626. (The bug report talks about /run, but the issue is actually with
/tmp.) People use /tmp for various things that fit in memory, e.g. unpacking
packages, and 400k is not much. Let's raise is a bit.
Programs run by udev triggers may need to execute the bpf() syscall. Even more
so, since on a cgroup v2 system, the only way to set up device access filtering
is to install a BPF program on the cgroup in question and one way of passing
data to such program is through BPF maps, which can only be access using the
bpf() syscall. One such use case was identified in RHBZ#2025264 related to
snap-device-helper, and led to RHBZ#2027627 being filed.
Unfortunately there is no finer grained control over what gets passed in the
syscall, so just enable bpf() and leave fine grained mediation to other
security layers (eg. SELinux).
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2027627
Signed-off-by: Maciek Borzecki <maciek.borzecki@gmail.com>
Due to the fact that systemd-journal-flush.service has
"Requires=systemd-journald.service", this service is stopped too when journald
is requested to do so.
However stopping systemd-journal-flush.service implies that journald
relinquishes /var hence implicitly switching back to the volatile storage
mode and removing /run/systemd/journal/flushed.
If journald is started afterwards, it will run in volatile storage mode
regardless of the value of 'Storage=' as it believes now that /var is not yet
ready (because the flushed flag is missing).
Because this flag is mainly an indication for journald that the initialization
of /var/log/journal (during the boot process) has been done,
systemd-journal-flush.service shouldn't be tied to the state of journald itself
but to the state of /var/log/journal, hence to the state of the system.
Parsing objects is risky as data could be malformed or malicious,
so avoid doing that from the main systemd-coredump process and
instead fork another process, and set it to avoid generating
core files itself.
Users may use rules that refer to binaries e.g. in /opt or /usr/local,
and those directories may be separate mount points. We don't need the
binfmt rules in early boot, so let's delay the service so that we can
rely on the full local filesystem being visible.
Fixes#21178.
When using "capture : true" in custom_target()s the mode of the source
file is not preserved when the generated file is not installed and so
needs to be tweaked manually. Switch from output capture to creating the
target file and copy the permissions from the input file.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
If the tty arg is set to "-", agetty uses the stdin fd as the tty.
Let's pass the tty this way so that we keep an fd open to the tty
at all times. If all fd's to a tty are closed, the kernel might
reset the tty which we want to avoid.
This adds support for dm integrity targets and an associated
/etc/integritytab file which is required as the dm integrity device
super block doesn't include all of the required metadata to bring up
the device correctly. See integritytab man page for details.
Let's make it slightly more likely that a per-user service manager is
killed than any system service. We use a conservative 100 (from a range
that goes all the way to 1000).
Replaces: #17426
Together with the previous commit this means: system manager and system
services are placed at OOM score adjustment 0 (specifically: they
inherit kernel default of 0). User service manager (both for root and
non-root) are placed at 100. User services for non-root are placed at
200, those for root inherit 100.
Note that processes forked off the user *sessions* (i.e. not forked off
the per-user service manager) remain at 0 (e.g. the shell process
created by a tty or ssh login). This probably should be
addressed too one day (maybe in pam_systemd?), but is not covered here.