Optional features allow distros to define sets of transfers that can
be enabled or disabled by the system administrator. This is useful for
situations where a distro may want to ship some resources version-locked
to the core OS, but many people have no need for the resource, such as:
development tools/compilers, drivers for specialized hardware, language
packs, etc
We also rename sysupdate.d/*.conf -> sysupdate.d/*.transfer, because
now there are more than one type of definition in sysupdate.d/. For
backwards compat, we still load *.conf files as long as no *.transfer
files are found and the *.conf files don't try to declare themselves
as part of any features
Fixes https://github.com/systemd/systemd/issues/33343
Fixes https://github.com/systemd/systemd/issues/33344
This will allow units (scopes/slices/services) to override the default
systemd-oomd setting DefaultMemoryPressureDurationSec=.
The semantics of ManagedOOMMemoryPressureDurationSec= are:
- If >= 1 second, overrides DefaultMemoryPressureDurationSec= from oomd.conf
- If is empty, uses DefaultMemoryPressureDurationSec= from oomd.conf
- Ignored if ManagedOOMMemoryPressure= is not "kill"
- Disallowed if < 1 second
Note the corresponding dbus property is DefaultMemoryPressureDurationUSec
which is in microseconds. This is consistent with other time-based
dbus properties.
soft-reboot allows switching into a different root/installation,
i.e. potentially invalidate settings from kernel cmdline and such.
Let's hence inform generators about soft-reboots.
By default, in instances where timers are running on a realtime schedule,
if a service takes longer to run than the interval of a timer, the
service will immediately start again when the previous invocation finishes.
This is caused by the fact that the next elapse is calculated based on
the last trigger time, which, combined with the fact that the interval
is shorter than the runtime of the service, causes that elapse to be in
the past, which in turn means the timer will trigger as soon as the
service finishes running.
This behavior can be changed by enabling the new DeferReactivation setting,
which will cause the next calendar elapse to be calculated based on when
the trigger unit enters inactivity, rather than the last trigger time.
Thus, if a timer is on an realtime interval, the trigger will always
adhere to that specified interval.
E.g. if you have a timer that runs on a minutely interval, the setting
guarantees that triggers will happen at *:*:00 times, whereas by default
this may skew depending on how long the service runs.
Co-authored-by: Matteo Croce <teknoraver@meta.com>
The documentation claimed that ExecStartPre=/ExecStartPost= accepts
multiple command lines, in contrast to ExecStart=. This is half an
untruth, because ExecStart= allows that too – as long as Type=oneshot is
set.
Hence, reword this a bit, and do not emphasize the contrast.
Prompted by: #34570
Teaches systemd-stub how to load additional initrds from addon files.
This is very similar to the support for .ucode sections in addon files,
but with different ordering. Initrds from addons have a chance to
overwrite files from the base initrd in the UKI.
Follow-up for fa693fdc7e.
The documentation says the option takes a boolean or one of the "self"
and "identity". But the parser uses private_users_from_string() which
also accepts "off". Let's drop the implicit support of "off".
The method was added with migration of resources in mind (e.g. process's
allocated memory will follow it to the new scope), however, such a
resource migration is not in cgroup semantics. The method may thus have
the intended users and others could be guided to StartTransientUnit().
Since this API was advertised in a regular release, start the removal
with a deprecation message to callers.
Eventually, the goal is to remove the method to clean up DBus API and
simplify code (removal of cgroup_context_copy()).
Part of DBus docs is retained to satisfy build checks.
This adds the ExtraFileDescriptor property to StartTransient dbus API
with format "a(hs)" - array of (file descriptor, name) pairs. The FD
will be passed to the unit via sd_notify like Socket and OpenFile.
systemctl show also shows ExtraFileDescriptorName for these transient
units. We only show the name passed to dbus as the FD numbers will
change once passed over the unix socket and are duplicated, so its
confusing to display the numbers.
We do not add this functionality for systemd-run or general systemd
service units as it is not useful for general systemd services.
Arguably, it could be useful for systemd-run in bash scripts but we
prefer to be cautious and not expose the API yet.
Fixes: #34396
The API introduced in https://github.com/systemd/systemd/pull/34295
is less than ideal:
- It doesn't consider signing at all (ukify can't sign separately yet)
- Measurement is completely broken (all profile sections are marked to
not be measured)
- It focuses on a very niche use case of extending existing UKIs and makes
the more common use case of building a UKI with several profiles included
much harder than needed.
Let's instead rework the API to focus on the primary use case of building
a UKI with multiple profiles added to it immediately. We require the profiles
to be built upfront as separate PE binaries with UKI. There's no need to sign
or measure these, they're solely vehicles for profile sections. This saves us
from having to complicate the command line and config parsing to support defining
multiple profiles.
To add the profiles when building a UKI, we introduce the new --add-profile
switch which takes a path to a PE binary describing a profile. The required
sections are read from each PE binary, measured and added as a profile.
The integration test is disabled until the new API is merged and exposed in
mkosi so that building a UKI with profiles can be left to mkosi and the integration
test will only test the switching between profiles and not the building of UKIs
with profiles.
- The text was clearly edited in variuos places to e.g. allow multiple
sections, so it first said that sections are singletons, and immediately
after that that some section are not.
- Replace "regardless of the kernel" with "regardless of the kernel version".
The kernel is very much involved e.g. in loading of the initrds.
- Various other small rewordings to make the text more legible.
We had several users, that wrote their unit files with
WantedBy=default.target because it should be started "every time".
But for example in Fedora/CentOS/RHEL, this often breaks for
example selinux relabels (where we just want to do a relabel and reboot).
So far we supported this syntax:
ExecStart=foo ; bar
as equivalent to:
ExecStart=foo
ExecStart=bar
With this change we'll "soft" deprecate the first syntax. i.e. it's
still supported in code, but not documented anymore.
The concept was originally added to make things easier for 3rd party
.ini readers, as it allowed writing unit files with a .ini framework
that doesn't allow multiple assignments for the same key. But frankly,
this is kinda pointless, as so many other of our knobs require the
double assignment.
Hence, let's just stop advertising the concept, let's simplify the docs,
by removing one entirely redundant feature from it.
Replaces: #34570
Today it seems this is mostly used by mail and printer servers, and it's
not clear to me at all what the property is that makes
/var/spool/<package> the better place for the relevant data than
/var/lib/<package>.
Hence, in the interest of shortening the spec, let's not mention the dir
anymore. In particular as the dir really isn't used by us much, for
example we do not have a counterpart for RuntimeDirectory=,
StateDirectory=, … that would cover the spool.
Since most systems these days we care about probably come *without* a
printer or mail server, let's maybe no mention this in the man page that
is supposed to discuss the rough skeleton how things are set up. After
all, people are supposed to exend the skeleton with their stuff, and
this sounds more like a case for an extension of the skeleton instead of
being considered part of the skeleton itself.
The man page is supposed to provide a "generalized, though minimal and
modernized subset" (as per introductory pargapraghs), from a systemd
perspective. But the thing is that /usr/include/ really doesn't matter
to us. It's a development thing, and slightly weird (because it arguably
would be better places in /usr/share/include/ or so). It's not going to
be there on 95% of deployed systems, and we really don't want people to
bother with it on such systems.
We only define the skeleton of directories in this document, and it's
expected that people extend it, and I think this really should be one of
those dirs that is an extension of our skeleton, but not part of the
skeleton, if that makes any sense.
After 3b16e9f419, even the libraries are
documented in the man page, it is useful to mention which libraries are
checked in the command output.
Of course, the dependencies are kind of implementation detail, and may
be changed in the future version, but that's especially why I think
showing the library deps in the output is useful.
systemd-analyze is a debugging tool, and already shows many internal
states. I think there is nothing to prevent from showing the deps.
Prompted by #34477.
Somebody wrapped the text, but whitespace is preserved in <programlisting>, so
the output was mangled. It also doesn't make sense to run systemd-path as root
(as indicated by '#'), so drop that. Also, this chunk should be a separate
paragraph.
We generally do _not_ want the same sysexts to be loaded in both initrd and
exitrd phases. The environment is completely different and it's unlikely that
the same code can be useful in both places. Nevertheless, it can be useful in
_some_ cases, for example when the sysexts contains debugging tools.
I think we don't need to differentiate between initrds and exitrds through
SYSEXT_SCOPE, because the two types are made available in completely different
locations and loaded through a different mechanism, with very little chance of
an initrd being loaded as an exitrd without an explicit admin action (or the
other way around). So let's not complicate our code or definitions by an
explicit "exitrd" sysext designator, but just clarify that "initrd" also
encompasses exitrds in this context.
A sencence like "The system manager does, a, b, c, which is really d, and e.",
it is generally understood that the manager also does "e". This can be
quite confusing if the manager cannot do "e", in our case unmount the file
system on which it is sitting.
Similary, we cannot "fall back to x if it is missing", since "it" in that
sentence means "x".
The concept is fairly well established and present in our docs in various
places.
Say that the exitrd is also marked by the presence of /etc/initrd-release.
Otherwise, `<variable>$BOOT</variable>` is rendered:
```
[2548/2992] Generating man/repart.d.5 with a custom command
Element variable in namespace '' encountered in para, but no template matches.
Element variable in namespace '' encountered in para, but no template matches.
```
SetShowStatus() was added in order to fix#11447. Recently, I ran into
the exact same problem that OP was experiencing in #11447. I wasn’t able
to figure out how to deal with the problem until I found #11447, and it
took me a while to find #11447.
This commit takes what I learned from reading #11447 and adds it to the
documentation. Hopefully, this will make it easier for other people who
run into the same problem in the future.
This was designed to deal with $BOOT, as defined by the Boot Loader
Specification, but it was made a generic mechanism because it is useful
elsewhere too. See the updated man page for usage examples, motivation,
and an explanation of how this works.
These are inspired by the existing commands that return the path to the
boot or ESP partitions. However, these new commands show the path to the
boot loader (systemd-boot) or UKI/stub (systemd-stub) that was used on
the current boot. This information is derived from EFI variables.
This introduces 'i' prefix for match string. When specified, string or
pattern will match case-insensitively.
Closes#34359.
Co-authored-by: Ryan Wilson <ryantimwilson@meta.com>
The verb s not really specific to credential management, it was always a
bit misplaced. Hence move it to systemd-analyze, where we already have
some general TPM related verbs such as "srk" and "pcrs"
systemd-stub provides the signing key for TPM2 signed PCR policies in a
file tpm2-pcr-public-key.pem to userspace. Hence, to clarify that this
is the same key as used when signing via "systemd-measure", let's rename
it in the docs like that.
Also rename the private key to tpm2-pcr-private-key.pem, to keep the
symmetry.
With this we should universally stick to this nomenclature:
1. tpm2-pcr-public-key.pem ← public part of signing key
2. tpm2-pcr-private-key.pem ← private part of signing key
3. tpm2-pcr-signature.json ← signature file made with key pair
Inspired by: #34069
It is not true that "no string" is written to journal; the binary
name is used when run via `systemd-cat command`, or `cat` is used
when run via `command | systemd-cat`.
These variables closely mirror the existing
LoaderDevicePartUUID/LoaderImageIdentifier variables. But the Stub…
variables indicate the location of the stub/UKI (i.e. of systemd-stub),
while the Loader… variables indicate the location of the boot loader
(i.e. of systemd-boot). (Except of course, there is no boot loader used,
in which case both sets point to the stub/UKI, as a special case).
This actually matters, as we support that sd-boot runs off the ESP,
while a UKI then runs off XBOOTLDR, i.e. two distinct partitions.
First of all, these were always set, i.e. since sd-boot was merged into
our tree, i.e. v220. Let's say so explicitly.
Also, let's be more accurate, regarding which partition this referes to:
it's usually "the" ESP, but given that you can make firmware boot from
arbitrary disks, it could be any other partition too. Hence, be
explicit on this.
Also, clarify tha sd-stub will set this too, if sd-boot never set it.
This adds a ability to add alternative sections of a specific type in
the same UKI. The primary usecase is for supporting multiple different
kernel cmdlines that are baked into a UKI.
The mechanism is relatively simple (I think), in order to make it robust.
1. A new PE section ".profile" is introduced, that is a lot like
".osrel", but contains information about a specific "profile" to
boot. The ".profile" section can appear multiple times in the same
PE, and acts as delimiter indicating where a new profile starts.
Everything before the first ".profile" is called the "base profile",
and is shared among all other profiles, which can then override or
add addition PE sections on top.
2. An UKI's command line can be prefixed with an argument such as "@0" or
"@1" or "@2" which indicates the "profile" to boot. If no argument is
specified the default is profile 0. Also, a UKI that lacks any
.profile section is treated like one with only a profile 0, but with
no data in that profile section.
3. The stub will first search for its usual set of PE sections
(hereafter called "base sections"), and stop at the first .profile PE
section if any. It will then find the .profile matching the selected
profile by its index, and any sections found as part of that profile
on top of the base sections.
And that's already it.
Example: let's say a distro wants to provide a single UKI that can be
invoked in one of three ways:
1. The regular profile that just boots the system
2. A profile that boots into storagetm
3. A profile that initiates factory reset and reboots.
For this it would define a classic UKI with sections .linux, .initrd,
.cmdline, and whatever else it needs. The .cmdline section would contain
the kernel command line for the regular profile.
It would then insert one ".profile" section, with a contents like the
following:
ID=regular
This is the profile for profile 0. It would immediately afterwards add
another ".profile" section:
ID=storagetm
TITLE=Boot into Storage Target Mode
This would then followed with a .cmdline section that is just like the
basic one, but with "rd.systemd.unit=storage-target-mode.target"
suffixed. Then, another .profile section would be added:
ID=factory-reset
TITLE=Factory Reset
Which is then followed by one last PE section: a .cmdline one with
"systemd.unit=factory-reset.target" suffixed to te regular command line.
i.e. expressed in tabular form the above would be:
The base profile:
.linux
.initrd
.cmdline
.osrel
The regular boot profile:
.profile
The storagetm profile:
.profile
.cmdline
The factory reset profile:
.profile
.cmdline
You might wonder why the first .cmdline in the list above is placed in
the base profile rather than in the regular boot profile, given that it
is overriden in all other profiles anyway. And you are right. The only
reason I'd place it in the base profile is that it makes the UKI more
nicely extensible if later profiles are added that want to replace
something else instead of the .cmdline, for example .ucode or so. But it
really doesn't matter much.
While the primary usecase is of course multiple alternative command
lines, the concept is more powerful than that: for various usecases it
might be valuable to offer multiple choices of devicetree, ucode or
initrds.
The .profile contents is also passed to the invoked kernel as a file in
/.extra/profile (via a synthetic initrd). Thus, this functionality can
even be useful without overriding any section at all, simply by means of
reading that file from userspace.
Design choices:
1. On purposes I used a special command line marker (i.e. the "@" thing,
which maybe we should call the "profile selector"), that doesn't look
like a regular kernel command line option. This is because this is
really not a regular kernel command line option – we process it in
the stub, then remove it as prefix, and measure the unprefixed
command line only after that. The kernel will not see the profile
selector either. I think these special semantics are best
communicated by making it look substantially different from regular
options.
2. This moves around measurements a bit. Previously we measured our UKI
sections right after finding them. Now we first parse the profile
number from the command line, then search for the profile's sections,
and only then measure the sections we actually end up using for this
profile. I think that this logic makes most sense: measure what we
are using, not what we are overriding. Or in other words, if you boot
profile @3, then we'll measure .cmdline (assuming it exists) of
profile 3, and *not* measure .cmdline of the base profile. Also note
that if the user passes in a custom kernel command line via command
line arguments we'll strip off the profile selector (i.e. the initial
"@X" thing) before we pass it on.
3. The .profile stuff is supposed to be generic and extensible. For
example we could use it in future to mark "dangerous" options such as
factory reset, so that boot menus can ask for confirmation before
booting into it. Or we could introduce match expressions against
SMBIOS or other system identifiers, to filter out profiles on
specific hw.
Note btw, that PE allows defining multiple sections that point to the
same offsets in the file. This allows sharing payload under different
names. For example, if profile @4 and @7 shall carry the same .ucode
section, they can define .ucode in each profile and then make it point to
the same offset.
Also note that that one can even "mask" a base section in a profile, by
inserting an empty section. For example, if the base .dtb section should
not be used for profile @4, then add a section .dtb right after the
fourth .profile with a zero size to the UKI, and you will get your wish
fulfilled.
This code only contains changes to sd-stub. A follow-up commit will
teach sd-boot to also find this profile PE sections to synthesize
additional menu entries from a single UKI.
A later commit will add support for gnerating this via ukify.
Fixes: #24539
In mkosi, I want to add a sysupdate verb to wrap systemd-sysupdate.
The definitions will be picked up from mkosi.sysupdate/ and passed
to systemd-sysupdate. I want users to be able to write transfer
definitions that are independent of the output directory used by
mkosi. To make this possible, it should be possible to specify the
directory that transfer sources should be looked up in on the sysupdate
command line. Let's allow this via a new --transfer-source= option.
Additionally, transfer sources that want to take advantage of this
feature should specify PathRelativeTo=directory to indicate the configured
Path= is interpreted relative to the tranfer source directory specified
on the CLI.
This allows for the following transfer definition to be put in
mkosi.sysupdate:
"""
[Transfer]
ProtectVersion=%A
[Source]
Type=regular-file
Path=/
PathRelativeTo=directory
MatchPattern=ParticleOS_@v.usr-%a.@u.raw
[Target]
Type=partition
Path=auto
MatchPattern=ParticleOS_@v
MatchPartitionType=usr
PartitionFlags=0
ReadOnly=1
"""
This options is pretty simple, it allows specifying an UKI whose
sections to import first, and place at the beginning of the new UKI.
This is useful for generating multi-profile UKIs piecemeal: generate the
base UKI first, then append a profile, and another one and another one.
The sections imported this way are not included in any PCR signature,
the assumption is that that already happened before in the imported UKI.
Now that mkfs.btrfs is adding support for compressing the generated
filesystem (https://github.com/kdave/btrfs-progs/pull/882), let's
add general support for specifying the compression algorithm and
compression level to use.
We opt to not parse the specified compression algorithm and instead
pass it on as is to the mkfs tool. This has a few benefits:
- We support every compression algorithm supported by every tool
automatically.
- Users don't need to modify systemd-repart if a mkfs tool learns a
new compression algorithm in the future
- We don't need to maintain a bunch of tables for filesystem to map
from our generic compression algorithm enum to the filesystem specific
names.
We don't add support for btrfs just yet until the corresponding PR
in btrfs-progs is merged.
These operations might require slow I/O, and thus might block PID1's main
loop for an undeterminated amount of time. Instead of performing them
inline, fork a worker process and stash away the D-Bus message, and reply
once we get a SIGCHILD indicating they have completed. That way we don't
break compatibility and callers can continue to rely on the fact that when
they get the method reply the operation either succeeded or failed.
To keep backward compatibility, unlike reload control processes, these
are ran inside init.scope and not the target cgroup. Unlike ExecReload,
this is under our control and is not defined by the unit. This is necessary
because previously the operation also wasn't ran from the target cgroup,
so suddenly forking a copy-on-write copy of pid1 into the target cgroup
will make memory usage spike, and if there is a MemoryMax= or MemoryHigh=
set and the cgroup is already close to the limit, it will cause an OOM
kill, where previously it would have worked fine.
This adds two more fields in 'udevadm info':
- J for device ID, e.g. b128:1, c10:1, n1, and so on.
- B for driver subsystem, e.g. pci, i2c, and so on.
These, especially the device ID field may be useful to find udev
database file under /run/udev/data for a device.
To create the sd_device object of a driver, the function
sd_device_new_from_subsystem_sysname() requires "drivers" for subsystem
and e.g. "pci:iwlwifi" for sysname. Similarly, sd_device_new_from_device_id()
also requires driver subsystem. However, we have never provided a
way to get the driver subsystem ("pci" for the previous example) from
an existing sd_device object.
Let's introduce a way to get driver subsystem.
One of the major pait points of managing fleets of headless nodes is
that when something fails at startup, unless debug level was already
enabled (which usually isn't, as it's a firehose), one needs to manually
enable it and pray the issue can be reproduced, which often is really
hard and time consuming, just to get extra info. Usually the extra log
messages are enough to triage an issue.
This new option makes it so that when a service fails and is restarted
due to Restart=, log level for that unit is set to debug, so that all
setup code in pid1 and sd-executor logs at debug level, and also a new
DEBUG_INVOCATION=1 env var is passed to the service itself, so that it
knows it should start with a higher log level. Once the unit succeeds
or reaches the rate limit the original level is restored.
So far we manually hardcoded $LISTEN_FDNAMES to "varlink" in various
varlink service units we ship, even though FileDescriptorName=varlink
is specified in associated socket units already, because
FileDescriptorName= is currently silently ignored when combined with
Accept=yes. Let's step away from this, which seems saner.
Note that this is technically a compat break, but a mostly negligible
one as there shall be few users setting FileDescriptorName= but
still expecting LISTEN_FDNAMES=connection in the actual executable.
Preparation for #34080
DefaultRoute is a D-Bus property, not a valid setting name in .network
files nor resolved.conf.
Whether a link is the default route or not is configured with
DNSDefaultRoute= setting in .network files.
I don't actually need this anymore since we're going with a
unit based approach for the containers stuff internally so
let's just revert it.
Fixes#34085
This reverts commit ce2291730d.
A previous commit made sysupdate recognize installed versions where some
transfers are missing. This commit teaches sysupdate how to correctly
repair these incomplete versions.
Previously, if you had a incomplete installation of the OS booted, and
ran sysupdate in an attempt to repair it, sysupdate would make things
worse by creating copies of the currently-booted partitions in the
inactive slots. Then at boot you have two identical partitions, with
identical labels an UUIDs, and end up with a mess.
With this commit, sysupdate is able to recognize situations where it can
simply download the missing transfers and leave the rest of the system
undistrubed.
Partial fix for https://github.com/systemd/systemd/issues/33339
When enumerating what versions exist for a given target, sysupdate would
completely throw out any version that's incomplete (where some of the
transfers in the target have that version installed or available, and
other transfers do not).
If we're trying to find what versions we can offer for download, this is
great behavior. If the server side is advertising a partial update to
download, we shouldn't present it to the user.
On the other hand, if we're enumerating what versions we have currently
installed, this is a bad behavior. It makes sysupdate fragile. For
example, if a sysext introduces a new .conf file into
/usr/lib/sysupdate.d, suddenly the currently-installed OS stops being a
version that we've enumerated. Since it's not enumerated, it's not
protected, and so sysupdate will wipe the booted OS.
So if we're looking for installed versions, we now loosen the
restrictions and enumerate incomplete installations.
Partial fix for https://github.com/systemd/systemd/issues/33339
This has been a glaring omission the docs: when people create
.user/.group/.user-privileged/.group-privileged drop-in files, they
should also create matching .membership files.
This softens the behavior originally introduced in eded61e410 to apply
only to the fallback dns servers.
The intent is that the global FallbackDNS (instead of DNS) can now be
used in conjunction with the per-link dns, providing a fallback behavior
without introducing a scope overlap.
References: eded61e410 (resolved: demote the global unicast scope, 2024-08-19)
This commit may have been a breaking change for sd-resolved foreign
resolv.conf mode, where a legacy network management daemon directly
modifies resolv.conf and sd-resolved consumes that.
This reverts commit eded61e410.
mkfs.btrfs has recently learned new options --subvol and --default-subvol
so let's stop failing when Subvolumes= and DefaultSubvolume= are used offline
and use the new --subvol and --default-subvol options instead to create subvolumes
in the generated root filesystem without root privileges or loop devices.
This is the command-line tool to manage systemd-sysudpated
Co-authored-by: Tom Coldrick <thomas.coldrick@codethink.co.uk>
Co-authored-by: Abderrahim Kitouni <abderrahim.kitouni@codethink.co.uk>
This will greatly reduce the number of cases where the global unicast
scope overlaps with link scopes configured as default-route, making it
feasible to use the global DNS setting in conjunction with per-link dns
servers configured by the network.
This change is preferred over demoting links to default-route=no where
the user prefers to use the network provided DNS servers, and I expect
it is non-disruptive in that it should not degrade the efficacy of any
existing configuration.
Note, `systemd-analyze foo@.service --instance=hoge` is equivalent to
`systemd-analyze foo@hoge.service`. But, the option may be useful when
e.g. passing multiple template units that have restriction on their
instance name:
```
$ ls
template_aaa@.service template_bbb@.service template_ccc@.service
$ systemd-analyze ./template_* --instance=hoge
```
Without the option, we need to embed an instance name into each unit
name, so cannot use globs.
Prompted by #33681.