IPE is a new LSM being introduced in 6.12. Like IMA, it works based on a
policy file that has to be loaded at boot, the earlier the better. So
like IMA, if such a policy is present, load it and activate it.
If there are any .p7b files in /etc/ipe/, load them as policies.
The files have to be inline signed in DER format as per IPE documentation.
For more information on the details of IPE:
https://microsoft.github.io/ipe/
Now that we have multi-profile UKIs people likely want to stick more PE
sections into them than before. Hence, bump the number of available PE
section slots to 30 (up from 15). Also, make this configurable at build
time since some folks probably want even more, and others don't want
this at all.
(pre-allocating too many shouldn't matter too much btw, I'd advise
everyone to overshoot, except maybe on the tiniest of embedded boards)
The new link-executor-shared option is similar to the existing
link-udev-shared: when set to false, we link to the static versions of our
internal libraries.
The resulting exuctor binary is fairly large, about as large as libsystemd-core
(14 MB without lto, 8 with lto).
This is intended as a workaround for the fuckup with the pinned executor
binary:
when an upgrade is performed, the package manager will install new version of
the libraries and new version of the code, and some time later reexecute the
managers. This creates a window when the pinned executor binary will fail to
execute. There are two factors which make the issue easier to hit:
- when the distribution uses a finely-grained shared-lib-tag. E.g. Fedora
uses version-release as the tag, which means that the issue occurs on
every package upgrade. This is the right thing to do, because the
ABI of our internal libraries is not stable at all, so replacing the
library from a different version in place creates a window where our
programs may crash or misbehave.
- when the distribution doesn't immediately reexec all the managers after
upgrade. In early versions of systemd, we used to hammer the machine during
upgrade, doing daemon-reexecs repeatedly. This works, but is ugly and
wasteful. Doing the reexecs while the upgrade is in progres also creates a
window where a mix of old and new configs or both is loaded. Users are
particularly annoyed by those reloads if there is some issue in the
configuration causing us to emit warnings on every reexec. Doing the
reexecs once after the new configuration and libraries have been put
in place is nicer.
The pinning of the executor binary breaks upgrades and in particular
it penalizes the distributions which make use of the features which
were previously added to avoid bugs and inefficiency during upgrades.
When the executor is linked statically, there is a smaller chance that it'll
fail to load libraries. The issue can still occur because other libraries, not
our own, are linked dynamically.
nscd is known to be racy [1] and it was already deprecated and later dropped in
Fedora a while back [1,2]. We don't need to support obsolete stuff in systemd,
and the cache in systemd-resolved provides a better solution anyway.
We announced the plan to drop nscd in d44934f378.
[1] https://fedoraproject.org/wiki/Changes/DeprecateNSCD
[2] https://fedoraproject.org/wiki/Changes/RemoveNSCD
The option is kept as a stub without any effect to make the transition easier.
This adds a small, socket-activated Varlink daemon that can delegate UID
ranges for user namespaces to clients asking for it.
The primary call is AllocateUserRange() where the user passes in an
uninitialized userns fd, which is then set up.
There are other calls that allow assigning a mount fd to a userns
allocated that way, to set up permissions for a cgroup subtree, and to
allocate a veth for such a user namespace.
Since the UID assignments are supposed to be transitive, i.e. not
permanent, care is taken to ensure that users cannot create inodes owned
by these UIDs, so that persistancy cannot be acquired. This is
implemented via a BPF-LSM module that ensures that any member of a
userns allocated that way cannot create files unless the mount it
operates on is owned by the userns itself, or is explicitly
allowelisted.
BPF LSM program with contributions from Alexei Starovoitov.
To make it easy to have a workable ssh-generator on various distros,
let's optionally generate the ssh privsep dir via tmpfiles.d/ drop-in.
This enables the concept with a path of /run/sshd/ as default. This is
the path Debian/Ubuntu uses, and means that we just work on those
distros. Debian/Ubuntu is the only distro (apparently?) that puts the
privsep dir under /run/, hence always needs the dir to be created
manually. Other distros don't need it that much, because they place the
dir in /usr/ (fedora, best choice!) or /var/ (others, not ideal, because
still mutable).
Also adds a longer explanation about this in NEWS, in the hope that
distro maintaines read that and maybe start cleaning this up.
Alternative to: #31543
Partially reverts commit b0d3095fd6.
While it is generally worthwhile for systemd to drop split-usr support,
these options are NOT about split-usr support. The universal location of
POSIX sh is always /bin/sh. Bash is pretty reasonably standardized there
too.
This happens irrespective of /bin being a symlink to /usr/bin.
Ramifications of this change include things like:
- portably running shell scripts that might run very nearly anywhere
- /etc/shells support
For standardization and compatibility reasons, these commands with these
paths need to be consistently found on any system, and thus distros make
sure this works, although even on split-usr systems /usr/bin/bash may be
a symlink to /bin/bash.
Embedding the *access path* of bash as /usr/bin/bash in systemd, for
example in libnss_systemd.so, means that login shells must agree with
systemd on how they invoke the shell. End result: users fail to login
because of access violations.
This cannot be fixed by "fixing PAM" because PAM does not follow
symlinks by design: one example is that it needs to treat rbash as
different from bash.
Fixes: https://bugs.gentoo.org/919749
Signed-off-by: Eli Schwartz <eschwartz93@gmail.com>
Most of our kernel cmdline options use underscores as word separators in
kernel cmdline options, but there were some exceptions. Let's fix those,
and also use underscores.
Since our /proc/cmdline parsers don't distinguish between the two
characters anyway this should not break anything, but makes sure our own
codebase (and in particular docs and log messages) are internally
consistent.
Let's split off a new vcs-tag option from version-tag that configures whether
the current commit should be appended to the version tag. Doing this saves
us from having to fiddle around with generating git versions in packaging
specs and instead let's meson do it for us, even if we pass in a custom
version tag.
With this approach there's no more need for tools/meson-vcs-tag.sh so
we remove it.
This functionality relied on telinit being available in a different path
then the compat symlink shipped by systemd itself. This is no longer the
case for any known distro, so remove that code.
Fixes: #31220
Replaces: #31249
This adds a tiny binary that is hooked into SSH client config via
ProxyCommand and which simply connects to an AF_UNIX or AF_VSOCK socket
of choice.
The syntax is as simple as this:
ssh unix/some/path # (this connects to AF_UNIX socket /some/path)
or:
ssh vsock/4711
I used "/" as separator of the protocol ID and the value since ":" is
already taken by SSH itself when doing sftp. And "@" is already taken
for separating the user name.
sshd now supports config file drop-ins, hence let's install one to hook
up "userdb ssh-authorized-keys", so that things just work.
We put the drop-in relatively early, so that other drop-ins generally
will override this.
Ideally sshd would support such drop-ins in /usr/ rather than /etc/, but
let's take what we can get. It's not that sshd's upstream was
particularly open to weird ideas from Linux people.
This should also implicitly enabled vmspawn in CI. It wasn't passing even the
basic tests, which we didn't see, because it needs to be explicitly enabled.
Also this renames 80-ethernet.network.example -> 89-ethernet.network.example,
to make it have lower precedence over other default .network files for
Ethernet interfaces.
Closes#29765.
This implements a "storage target mode", similar to what MacOS provides
since a long time as "Target Disk Mode":
https://en.wikipedia.org/wiki/Target_Disk_Mode
This implementation is relatively simple:
1. a new generic target "storage-target-mode.target" is added, which
when booted into defines the target mode.
2. a small tool and service "systemd-storagetm.service" is added which
exposes a specific device or all devices as NVMe-TCP devices over the
network. NVMe-TCP appears to be hot shit right now how to expose
block devices over the network. And it's really simple to set up via
configs, hence our code is relatively short and neat.
The idea is that systemd-storagetm.target can be extended sooner or
later, for example to expose block devices also as USB mass storage
devices and similar, in case the system has "dual mode" USB controller
that can also work as device, not just as host. (And people could also
plug in sharing as NBD, iSCSI, whatever they want.)
How to use this? Boot into your system with a kernel cmdline of
"rd.systemd.unit=storage-target-mode.target ip=link-local", and you'll see on
screen the precise "nvme connect" command line to make the relevant
block devices available locally on some other machine. This all requires
that the target mode stuff is included in the initrd of course. And the
system will the stay in the initrd forever.
Why bother? Primarily three use-cases:
1. Debug a broken system: with very few dependencies during boot get
access to the raw block device of a broken machine.
2. Migrate from system to another system, by dd'ing the old to the new
directly.
3. Installing an OS remotely on some device (for example via Thunderbolt
networking)
(And there might be more, for example the ability to boot from a
laptop's disk on another system)
Limitations:
1. There's no authentication/encryption. Hence: use this on local links
only.
2. NVMe target mode on Linux supports r/w operation only. Ideally, we'd
have a read-only mode, for security reasons, and default to it.
Future love:
1. We should have another mode, where we simply expose the homed LUKS
home dirs like that.
2. Some lightweight hookup with plymouth, to display a (shortened)
version of the info we write to the console.
To test all this, just run:
mkosi --kernel-command-line-extra="rd.systemd.unit=storage-target-mode.target" qemu
This allows distros to install configuration file templates in /usr/lib/systemd
for example.
Currently we install "empty" config files in /etc/systemd/. They serve two
purposes:
- The file contains commented-out values that show the default settings.
- It is easier to edit the right file if it is already there, the user doesn't
have to type in the path correctly, and the basic file structure is already in
place so it's easier to edit.
Things that have happened since this approach was put in place:
- We started supporting drop-ins for config files, and drop-ins are the
recommended way to create local configuration overrides.
- We have systemd-analyze cat-config which takes care of iterating over
all possible locations (/etc, /run, /usr, /usr/local) and figuring out
the right file.
- Because of the first two points, systemd-analyze cat-config is much better,
because it takes care of finding all the drop-ins and figuring out the
precedence. Looking at files manually is still possible of course, but not
very convenient.
The disadvantages of the current approach with "empty" files in /etc:
- We clutter up /etc so it's harder to see what the local configuration actually is.
- If a user edits the file, package updates will not override the file (e.g.
systemd.rpm uses %config(noreplace). This means that the "documented defaults"
will become stale over time, if the user ever edits the main config file.
Thus, I think that it's reasonable to:
- Install the main config file to /usr/lib so that it serves as reference for
syntax and option names and default values and is properly updated on package
upgrades.
- Recommend to users to always use drop-ins for configuration and
systemd-analyze cat-config to view the documentation.
This setting makes this change opt-in.
Fixes#18420.
[zjs: add more text to the description]
Now that we use meson feature options for our dependencies, we can just
rely on '--auto-features=disabled' to do the same. One benefit of this
is that specific features can still be force-enabled by overriding it
with the appropriate '-Dfeature=enabled' flag.
The two remaining uses for skip-deps can simply rely on their default
logic that sets the value to 'no' when the dependency is disabled.
Also, there is no need to conditionalize the get_variable() calls
because not-found dependencies will just return the passed default value
if provided.
This uses a two-step approach to make sure we can fall back to
find_library(), while also skipping the detection if the features are
explicitly disabled.
By making this a disabler dependency, we can slightly simplify the code
and it als fixes the build for -Dfdisk=disabled as we failed to create a
fallback empty libshared_fdisk variable.
By using meson features we can replace the handcrafted dependency
auto-detection by just passing the value from get_option directly to the
required arg for dependency, find_library etc.
'auto' features make the dependency optional, 'enabled' requires it
while 'disabled' features will skip detection entirely.
Any skipped or not found dependency will just be a no-op when passed to
build steps and therefore we can also skip the creation of empty vars.
The use of skip_deps for these is dropped here as meson provides a way
to disable all optional features in one go by passing
'-Dauto_features=disabled'.
Build option "link-portabled-shared" to build a statically linked
systemd-portabled by using
-Dlink-portabled-shared=false
on systems with full systemd stack except systemd-portabled, such
as CentOS/RHEL 9.
This drops all mentions of gnu-efi and its manual build machinery. A
future commit will bring bootloader builds back. A new bootloader meson
option is now used to control whether to build sd-boot and its userspace
tooling.
Most of the support for valgrind was under HAVE_VALGRIND_VALGRIND_H, i.e. we
would enable if the valgrind headers were found. The operations then we be
conditionalized on RUNNING_UNDER_VALGRIND.
But in a few places we had code which was conditionalized on VALGRIND, i.e. the
config option. I noticed because I compiled with -Dvalgrind=true on a machine
that didn't have valgrind.h, and the build failed because
RUNNING_UNDER_VALGRIND was not defined. My first idea was to add a check that
the header is present if the option is set, but it seems better to just remove
the option. The code to support valgrind is trivial, and if we're
!RUNNING_UNDER_VALGRIND, it has negligible cost. And the case of running under
valgrind is always some special testing/debugging mode, so we should just do
those extra steps to make valgrind output cleaner. Removing the option makes
things simpler and we don't have to think if something should be covered by the
one or the other configuration bit.
I had a vague recollection that in some places we used -Dvalgrind=true not
for valgrind support, but to enable additional cleanup under other sanitizers.
But that code would fail to build without the valgrind headers anyway, so
I'm not sure if that was still used. If there are uses like that, we can
extend the condition for cleanup_pools().
Allow defining the default keymap to be used by
vconsole-setup through a build option. A template
vconsole.conf also gets populated by tmpfiles if
it doesn't exist.
Config options are -Ddefault-timeout-sec= and -Ddefault-user-timeout-sec=.
Existing -Dupdate-helper-user-timeout= is renamed to -Dupdate-helper-user-timeout-sec=
for consistency. All three options take an integer value in seconds. The
renaming and type-change of the option is a small compat break, but it's just
at compile time and result in a clear error message. I also doubt that anyone was
actually using the option.
This commit separates the user manager timeouts, but keeps them unchanged at 90 s.
The timeout for the user manager is set to 4/3*user-timeout, which means that it
is still 120 s.
Fedora wants to experiment with lower timeouts, but doing this via a patch would
be annoying and more work than necessary. Let's make this easy to configure.
The option is added because we have a similar one for kernel-install. This
program requires python, and some people might want to skip it because of this.
The tool is installed in /usr/lib/systemd for now, since the interface might
change.
A template file is used, but there is no .in suffix.
The problem is that we'll later want to import the file as a module
for tests, but recent Python versions make it annoyingly hard to import
a module from a file without a .py suffix. imp.load_sources() works, but it
is deprecated and throws warnings.
importlib.machinery.SourceFileLoader().load_module() works, but is also
deprecated. And the documented replacements are a maze of twisted little
callbacks that result in an empty module.
So let's take the easy way out, and skip the suffix which makes it easy
to import the template as a module after adding the directory to sys.path.
In the Xen case, it's the hypervisor which manages kexec. We thus
have to ask it whether a kernel is loaded, instead of relying on
/sys/kernel/kexec_loaded.
There might be (embedded) systems that get never updated (things like
e.g. entertainment systems of trains, for example) and where the adjustment of
the system clock (introduced by b10abe4bba) would
do the wrong thing even if the difference between the systemd build time and
the rtc is 15 years or more.
This patch allows disabling the adjustment by setting
'clock-valid-range-usec-max' meson option to 0 or to a negative value.
Integers and booleans are supposed to be actual integers and booleans,
not strings describing their value, but Meson silently accepted either
one. It's still wrong to do it though, and other implementations of
Meson such as muon choke on it.
0 UID and GID are special, and should not be acceptable for the settings.
Hence, we can handle 0 as unset.
Strictly speaking, time epoch with 0 is valid, but I guess no one use
0 as a valid value.
The journalctl tool may be needed on cross compilation hosts in order
to run --update-catalog against a target rootfs.
To avoid reliability issues caused by shared linking allow journalctl
to be linked statically.