Instead of running meson install and hoping for the best, let's build
distribution packages from the downstream packaging specs. This gets
us the following:
- Vastly simplified mkosi scripts since we don't need a separate initrd
image anymore but can just reuse the default mkosi initrd.
- Almost everything can move to the base image as its not the basis
anymore for the initrd and as such we don't need to care about the
size anymore.
- The systemd packages that get pulled in as dependencies of other
packages get properly uninstalled and replaced with our packages that
we built instead of just installing on top of an existing systemd
installation with no guarantee that everything from that previous
installation was removed.
- Much better testing coverage as what we're testing is much closer
to what will actually be deployed in distributions.
- Immediate feedback if something we change breaks distribution packaging
- We get integration with the distribution for free as we'll automatically
use the proper directories and such instead of having to hack this
into a mkosi build script.
- ...
Whenever a home directory is in a locked state, accessing the files of
the home directory is extremely likely to cause the thread to hang. This
will put the session in a strange state, where some threads are hanging
due to file access and others are not hanging because they are not
trying to access any of the user's files.
This can lead to a whole slew of consequences. For example, imagine a
likely situation where the Wayland compositor is not hanging, but the
user's open apps are. Eventually, the compositor will detect that none
of the apps are responding to its pings, assume that they're frozen
(which they are), and kill them. The systemd user instance can end up in
a similarly confused state and start killing user services. In the worst
case, killing an app at an unexpected moment can lead to data loss.
The solution is to suspend execution of the whole user session by
freezing the user's slice.
Previously, we'd only freeze user.slice in the case of s2h, because we
didn't want the user session to resume while systemd was transitioning
from suspend to hibernate.
This commit extends this freezing behavior to all sleep modes.
We also have an environment variable to disable the freezing behavior
outright. This is a necessary workaround for someone that has hooks
in /usr/lib/systemd/system-sleep/ which communicate with some
process running under user.slice, or if someone is using the proprietary
NVIDIA driver which breaks when user.slice is frozen (issue #27559)
Fixes#27559
This is a kinda a follow-up for ce266330fc: it
makes resolved authoritative on our local hostname, and never contacts
DNS anymore for it.
We effectively already were authoritative for it, except if the user
queried for other RR types than just A/AAAA. This closes the gap and
refuses routing other RR type queries to DNS.
Fixes: #23662
It turns out it's mostly PKCS11 that supports the URI format,
and other engines just take files. For example the tpm2-tss-openssl
engine just takes a sealed private key file path as the key input,
and the engine needs to be specified separately.
Add --private-key-source=file|engine:foo|provider:bar to
manually specify how to use the private key parameter.
Follow-up for 0a8264080a
These will be used by display managers to pre-select the user's
preferred desktop environment and display server type. On homed, the
display manager will also be able to set these fields to cache the
user's last selection.
Most of our kernel cmdline options use underscores as word separators in
kernel cmdline options, but there were some exceptions. Let's fix those,
and also use underscores.
Since our /proc/cmdline parsers don't distinguish between the two
characters anyway this should not break anything, but makes sure our own
codebase (and in particular docs and log messages) are internally
consistent.
This reverts commit 5e8ff010a1.
This broke all the URLs, we can't have that. (And actually, we probably don't
_want_ to make the change either. It's nicer to have all the pages in one
directory, so one doesn't have to figure out to which collection the page
belongs.)
We're documenting the behavior of blob directories here. These docs
refer to things that aren't yet implemented at the time of the commit, but will be later in the same PR.
Let's make sure that versions generated by meson-vcs-tag.sh always
sort higher than official and stable releases. We achieve this by
immediately updating the meson version in meson.build after a new
release. To make sure this version always sorts lower than future
rcs, we suffix it with "~devel" which will sort lower than "~rcX".
The new release workflow is to update the version in meson.build
for each rc and the official release and to also update the version
number after a new release to the next development version.
The full version is exposed as PROJECT_VERSION_FULL and used where
it makes sense over PROJECT_VERSION.
We also switch to reading the version from a meson.version file in
the repo instead of hardcoding it in meson.build. This makes it
easier to access both inside and outside of the project.
The meson-vcs-tag.sh script is rewritten to query the version from
meson.version instead of passing it in via the command line. This
makes it easier to use outside of systemd since users don't have to
query the version themselves first.
This adds fields to the user record logic to allow a "fallback" home
directory and shell to be set as part of the "status" section of the
user record, i.e. supplied by the manager of the user record.
The idea is that if the fallback homedir/shell is set it will take
precedence over the real one in most ways.
Usecase: let's try to make ssh logins into homed directories work.
systemd-homed would set a fallback shell/homedir for inactive home dirs.
Thus, when ssh logins take place via key auth, we can allow them, and
these fallback session params would be used because the real home cannot
be activated just yet becasue we cannot acquire any password for it from
the user.
This field is like preferredLanguage, but takes a priority list of
languages instead. If an app isn't translated into a user's primary
language, it can fall back to one of the other languages in the list
thus making the app more accessible to the user.
For instance: in my experience, many Ukrainians are fluent in Russian,
often significantly better than English (especially if they are of a
generation that grew up during the USSR). Such a person might set this
new variable to ["uk_UA.UTF-8", "ru_UA.UTF-8"] so that software that
lacks Ukrainian translations will first try Russian translations before
defaulting to English.
Fixes#31290
tilde sorts lower in the version comparison spec:
https://uapi-group.org/specifications/specs/version_format_specification/
➜ systemd git:(strip) systemd-analyze compare-versions 249\~rc1 249
249\~rc1 < 249
➜ systemd git:(strip) systemd-analyze compare-versions 249-rc1 249
249-rc1 > 249
Also update tools/meson-vcs-tag.sh to use carets instead of hyphens
for the git part of the version as carets are allowed to be part of
a version by pacman while hyphens are not and both sort higher than
a version without the git part.
When running an image that cannot be mounted (e.g.: key missing intentionally
for development purposes), there's a retry loop that takes some time
and slows development down. Add an env var to disable it.
It silly for our docs to say that they aren't when we added support for this a
few years ago.
Also, drop some mentions of "runtime". This implied that those values can be
changed almost at will, but actually, they can only be meaningfully changed
_before_ the allocations are made.
varlink_server_listen_auto() is supposed to be the one-stop solution for
turning simple command line tools into IPC services. They aren't easy to
test/debug however, since you have to invoke them through a service
manager.
Let's make this easier: if the SYSTEMD_VARLINK_LISTEN env var is set,
let's listen on the socket specified therein. This makes things easier
to gdb: just run the service from the cmdline.
Both building and booting a directory image is much faster than
building or booting a disk image so let's default to a directory
image.
In CI, we stick to a disk image to make sure that keeps working as
well.
The only extra dependency this introduces is virtiofsd which is
packaged in all distributions except Debian stable. For users
hacking on systemd on Debian stable, a disk image can be built by
writing the following to mkosi.local.conf:
```
[Output]
Format=disk
```
To make things symmetric to the $SYSTEMD_SSH logic that the varlink
transport supports, let's also honour such a variable in sd-bus when
picking ssh transport.
This uses openssh 9.4's -W support for AF_UNIX. Unfortunately older versions
don't work with this, and I couldn#t figure a way that would work for
older versions too, would not be racy and where we'd still could keep
track of the forked off ssh process.
Unfortunately, on older versions -W will just hang (because it tries to
resolve the AF_UNIX path as regular host name), which sucks, but hopefully this
issue will go away sooner or later on its own, as distributions update.
Fedora is still stuck at 9.3 at the time of posting this (even on
Fedora), even though 9.4, 9.5, 9.6 have all already been released by
now.
Example:
varlinkctl call -j ssh:root@somehost:/run/systemd/io.systemd.Credentials io.systemd.Credentials.Encrypt '{"text":"foobar"}'
Otherwise, udev workers cannot detect slow programs invoked by
IMPORT{program}=, PROGRAM=, or RUN=, and whole worker process may be
killed.
Fixes#30436.
Co-authored-by: sushmbha <sushmita.bhattacharya@oracle.com>
Same as $KERNEL_INSTALL_BYPASS, but for hwdb. This will speed up
cross architecture image builds in mkosi as I can disable package
managers from running the costly hwdb update stuff in qemu user
mode and run it myself with a native systemd-hwdb with --root=.
Now that mkosi-kernel is a thing, this logic in systemd is just mostly
bitrotting since I just use mkosi-kernel these days. If I ever need to
hack on systemd and the kernel in tandem, I'll just add support for
building systemd to mkosi-kernel instead, so let's drop the support for
building a custom kernel in systemd's mkosi configuration.
This code doesn't link when gcc+lld is used:
$ LDFLAGS=-fuse-ld=lld meson setup build-lld && ninja -C build-lld udevadm
...
ld.lld: error: src/shared/libsystemd-shared-255.a(libsystemd-shared-255.a.p/cryptsetup-util.c.o):
symbol crypt_token_external_path@@ has undefined version
collect2: error: ld returned 1 exit status
As a work-around, restrict it to developer mode.
Closes https://github.com/systemd/systemd/issues/30218.
- Use mkosi.images/ instead of mkosi.presets/
- Use the .chroot suffix to run scripts in the image
- Use BuildSources= match for the kernel build
- Move 10-systemd.conf to mkosi.conf and rely on mkosi.local.conf
for local configuration
don't let the devices to be announced just as model "Linux". Let's instead
propagate the underlying block device's model. Also do something
reasonably smart for the serial and firmware version fields.
Introduce a new env variable $SYSTEMD_NSPAWN_CHECK_OS_RELEASE, that can
be used to disable the os-release check for bootable OS trees. Useful
when trying to boot a container with empty /etc/ and bind-mounted /usr/.
Resolves: #29185
I tried to get something similar upstream:
https://gitlab.com/cryptsetup/cryptsetup/-/issues/846
But no luck, it was suggested I use ELF interposition instead. Hence,
let's do so (but not via ugly LD_PRELOAD, but simply by overriding the
relevant symbol natively in our own code).
This makes debugging tokens a ton easier.
Automatically softreboot if the nextroot has been set up with an OS
tree, or automatically kexec if a kernel has been loaded with kexec
--load.
Add SYSTEMCTL_SKIP_AUTO_KEXEC and SYSTEMCTL_SKIP_AUTO_SOFT_REBOOT to
skip the automated switchover.
Instead of using ExtraTrees=, let's use the new RuntimeTrees= option
to mount the full repository into the VM/container. Let's also store
the sources under /usr/src/systemd and update the gdbinit file and
vscode HACKING guide section to match the new location.
Currently we have a 100ms delay which allows for people to enter/show
the boot menu even when timeout is set to zero.
In a handful of cases, that may not be needed - both in terms of access
policy, as well as latency.
For example: the option to provide the boot menu may be hidden behind an
"expert only" UX in the OS, to avoid end users from accidentally
entering it.
In addition, the current 100ms input polling may cause unexpected
additional delays in the boot. Some example numbers from my SteamDeck:
- boot counting/rename/flush doubles 300us -> 600us
- seed/hash setup doubles 900us -> 1800us
- kernel/image load gets ~40% slower 107ms -> 167ms
It's not entirely clear why the UEFI calls gets slower, nevertheless the
information in itself proves useful.
This commit introduces a new option "menu-disabled", which omits the
100ms delay. The option is documented throughout the manual pages as
well as the Boot Loader Specification.
v2:
- use STR_IN_SET
v3:
- drop erroneous whitespace
v4:
- add a new LoaderFeature bit,
- don't change ABI keep TIMEOUT_* tokens the same
- move new token in the 64bit range, update API and storage for it
- change inc/dec behaviour to TIMEOUT_MIN : TIMEOUT_MENU_FORCE
- user cannot opt-in from sd-boot itself, add assert_not_reached()
v5:
- s/Menu disablement control/Menu can be disabled/
- rewrap comments to 109
- use SYNTHETIC_ERRNO(EOPNOTSUPP)
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
To be on the safe side, explicitly mention that apart from the numerical
entries we can allow string ones.
Implementation-wise, bootctl will use internal numerical values that
match sd-boot's ABI. The latter also accepts the string options.
Going forward we'd like to avoid adding more internal magic and be more
explicit.
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Currently we spawn services by forking a child process, doing a bunch
of work, and then exec'ing the service executable.
There are some advantages to this approach:
- quick: we immediately have access to all the enourmous amount of
state simply by virtue of sharing the memory with the parent
- easy to refactor and add features
- part of the same binary, will never be out of sync
There are however significant drawbacks:
- doing work after fork and before exec is against glibc's supported
case for several APIs we call
- copy-on-write trap: anytime any memory is touched in either parent
or child, a copy of that page will be triggered
- memory footprint of the child process will be memory footprint of
PID1, but using the cgroup memory limits of the unit
The last issue is especially problematic on resource constrained
systems where hard memory caps are enforced and swap is not allowed.
As soon as PID1 is under load, with no page out due to no swap, and a
service with a low MemoryMax= tries to start, hilarity ensues.
Add a new systemd-executor binary, that is able to receive all the
required state via memfd, deserialize it, prepare the appropriate
data structures and call exec_child.
Use posix_spawn which uses CLONE_VM + CLONE_VFORK, to ensure there is
no copy-on-write (same address space will be used, and parent process
will be frozen, until exec).
The sd-executor binary is pinned by FD on startup, so that we can
guarantee there will be no incompatibilities during upgrades.
Let's mention that we just need the latest stable release of mkosi,
not the latest git commit. We also split the instructions for building
on the host and the instructions for building with mkosi into two blocks,
as it's not required to build on the host anymore to build with mkosi.
On normal systems, triggering a timeout should be a bug in code or
configuration error, so I do not think we should extend the default
timeout. Also, we should not introduce a 'first class' configuration
option about that. But, making it configurable may be useful for cases
such that "an extremely highly utilized system (lots of OOM kills,
very high CPU utilization, etc)".
Closes#25441.
The tool initially just measured the boot phase, but was subsequently
extended to measure file system and machine IDs, too. At AllSystemsGo
there were request to add more, and make the tool generically
accessible.
Hence, let's rename the binary (but not the pcrphase services), to make
clear the tool is not just measureing the boot phase, but a lot of other
things too.
The tool is located in /usr/lib/ and still relatively new, hence let's
just rename the binary and be done with it, while keeping the unit names
stable.
While we are at it, also move the tool out of src/boot/ and into its own
src/pcrextend/ dir, since it's not really doing boot related stuff
anymore.
ispell made some suggestions which I applied.
Addresses: https://github.com/systemd/systemd/pull/29209#pullrequestreview-1632623460
Also adds a brief paragraph about initrd transitions. (Plymouth really
should start using the fdstore for pinning DRM objects, and stop trying
to survive the initrd→host transition)
There were a couple spelling/grammatical errors in the docs that made
it hard to read and understand parts of this doc. I cleaned up those
errors and reflowed the line breaks to keep to the 80 char limit.
The article "a" goes before consonant sounds and "an" goes before vowel
sounds. This commit changes an to a for UKI, UDP, UTF-8, URL, UUID, U-Label, UI
and USB, since they start with the sound /ˌjuː/.
Since mkosi is now smart enough to drop the caches when the list of
packages changes, let's enable Incremental= mode by default to ensure
a good experience for anyone new to hacking on systemd with mkosi.
ImportCredential= takes a credential name and searches for a matching
credential in all the credential stores we know about it. It supports
globs which are expanded so that all matching credentials are loaded.
The kernel, systemd, and many other things print their version during boot.
sd-boot and sd-stub are also important, so let's print the version if EFI_DEBUG.
(If !EFI_DEBUG, continue to be quiet.)
When updating the docs, I saw that that the text in HACKING.md was out of date.
Instead of trying to update the instructions there, make it shorter and refer
the reader to tools/debug-sd-boot.sh for details.
Before 7cd43e34c5, it was possible to use
SYSTEMD_PROC_CMDLINE=systemd.condition-first-boot to override autodetection.
But now this doesn't work anymore, and it's useful to be able to do that for
testing.
We provide the same stability for all the headers that are public.
Also, mark id128 as portable to other systems. There is really nothing in the
code that would make it hard. It would probably work out-of-the-box.
Let's start moving towards a more involved partitioning setup to
test our stuff more when using mkosi.
The root partition is generated on boot with systemd-repart.
CentOS supports neither erofs nor btrfs so we use squashfs and xfs
instead.
We also enable SecureBoot= locally for additional coverage. This
and the use of verity means users need to run `mkosi genkey` once
to generate the keys necessary to do secure boot and verity.
This implements a minimal subset of #24961, but in a lot more
restrictive way: we only allow one level of subcgroup (as that's enough
to address the no-processes in inner cgroups rule), and does not change
anything about threaded cgroup logic or similar, or make any of this new
behaviour mandatory.
All this does is this: all non-control processes we invoke for a unit
we'll invoke in a subgroup by the specified name.
We'll later port all our current services that use cgroup delegation
over to this, i.e. user@.service, systemd-nspawn@.service and
systemd-udevd.service.
To make it consistent with other env vars, e.g. $SYSTEMD_ESP_PATH or
$SYSTEMD_XBOOTLDR_PATH.
This is useful when the root is specified by a file descriptor, instead
of a path.
Fixes RHBZ#2183546 (https://bugzilla.redhat.com/show_bug.cgi?id=2183546).
Previously, journal file is always compressed with the default algorithm
set at compile time. So, if a newer algorithm is used, journal files
cannot be read by older version of journalctl that does not support the
algorithm.
Co-authored-by: Colin Walters <walters@verbum.org>
- Drop Netdev= as it was removed in mkosi
- Always install python-psutil in the final image (required for networkd tests)
- Always Install python-pytest in the final image (required for ukify tests)
- Use the narrow glob for all centos python packages
- Drop the networkd mkosi config files (the default image can be used instead)
- Use ".conf" as the mkosi config file suffix everywhere
- Copy src/ to /root/src in the final image and set gdb substitute path in
.gdbinit to make gdb work properly
This is useful to identify log messages with metadata from the images
they run on. Look for ID/VERSION_ID/IMAGE_ID/IMAGE_VERSION/BUILD_ID,
with a SYSEXT_ prefix if we are looking at an extension, and append via
LogExtraFields= as respectively PORTABLE_NAME_AND_VERSION= in case of a
single image. In case of extensions, append as PORTABLE_ROOT_NAME_AND_VERSION=
for the base and one PORTABLE_EXTENSION_AND_VERSION= for each extension.
Example with a base and two extensions, with the unit coming from the
first extension:
[Service]
RootImage=/home/bluca/git/systemd/base.raw
Environment=PORTABLE=app0.raw
BindReadOnlyPaths=/etc/os-release:/run/host/os-release
LogExtraFields=PORTABLE=app0.raw
Environment=PORTABLE_ROOT=base.raw
LogExtraFields=PORTABLE_ROOT=base.raw
LogExtraFields=PORTABLE_ROOT_NAME_AND_VERSION=debian_10
ExtensionImages=/home/bluca/git/systemd/app0.raw
LogExtraFields=PORTABLE_EXTENSION=app0.raw
LogExtraFields=PORTABLE_EXTENSION_NAME_AND_VERSION=app_0
ExtensionImages=/home/bluca/git/systemd/app1.raw
LogExtraFields=PORTABLE_EXTENSION=app1.raw
LogExtraFields=PORTABLE_EXTENSION_NAME_AND_VERSION=app_1
When a portable service uses extensions, we use the 'main' image name
(the one where the unit was found in) as PORTABLE=. It is useful to
also list all the images actually used at runtime, as they might
contain libraries and so on.
Use PORTABLE_ROOT= for the image/directory that is used as RootImage=
or RootDirectory=, and PORTABLE_EXTENSION= for the image/directory that
is used as ExtensionImages= or ExtensionDirectories=.
Note that these new fields are only added if extensions are used,
there's no change for single-DDI portables.
Example with a base and two extensions, with the unit coming from the
first extension:
[Service]
RootImage=/home/bluca/git/systemd/base.raw
Environment=PORTABLE=app0.raw
BindReadOnlyPaths=/etc/os-release:/run/host/os-release
LogExtraFields=PORTABLE=app0.raw
LogExtraFields=PORTABLE_ROOT=base.raw
ExtensionImages=/home/bluca/git/systemd/app0.raw
LogExtraFields=PORTABLE_EXTENSION=app0.raw
ExtensionImages=/home/bluca/git/systemd/app1.raw
LogExtraFields=PORTABLE_EXTENSION=app1.raw
This way we can quickly find the most recent entry, without searching or
traversing entry array chains.
This is relevant later, as it it allows us to quickly determine the most
recent timestamps of each journal file, in a roughly atomic way.
(The one case that is left unchanged is '< <(subcommand)'.)
This way, the style with no gap was already dominant. This way, the reader
immediately knows that ' < ' is a comparison operator and ' << ' is a shift.
In a few cases, replace custom EOF replacement by just EOF. There is no point
in using someting like "_EOL" unless "EOF" appears in the text.
The documentation on moving an existing homedir into a systemd-homed managed
one suggests using rsync(1) with a bunch of flags to preserve as much metadata
as possible: permissions, xattrs, timestamps, etc. The previously suggested
flags were:
rsync -aHAXv --remove-source-files …
… which does include mtimes, but not ctimes and atimes, because -a does not
include those:
--archive, -a archive mode is -rlptgoD (no -A,-X,-U,-N,-H)
This change adds the -N and -U flags to preserve even more file timestamps,
turning the command into:
rsync -aHANUXv --remove-source-files …
The new flags are:
--crtimes, -N preserve create times (newness)
--atimes, -U preserve access (use) times
Let's introduce a common implementation of a function that checks
whether we are booted on a kernel with systemd-stub that has TPM PCR
measurements enabled. Do our own userspace measurements only if we
detect that.
PCRs are scarce and most likely there are projects which already make
use of them in other ways. Hence, instead of blindly stepping into their
territory let's conditionalize things so that people have to explicitly
buy into our PCR assignments before we start measuring things into them.
Specifically bind everything to an UKI that reported measurements.
This was previously already implemented in systemd-pcrphase, but with
this change we expand this to all tools that process PCR measurement
settings.
The env var to override the check is renamed to SYSTEMD_FORCE_MEASURE,
to make it more generic (since we'll use it at multiple places now).
This is not a compat break, since the original env var for that was not
included in any stable release yet.
This was dropped on reviewers' request in the revision that got merged,
but reference in two documents was not updated. Fix it.
Follow-up for: https://github.com/systemd/systemd/pull/25918
This commit adds support for attaching extra metadata to log
messages written to the journal via log.h. We keep track of a
thread local log context in log.c onto which we can push extra
metadata fields that should be logged. Once a field is no longer
relevant, it can be popped again from the log context.
On top of this, we then add macros to allow pushing extra fields
onto the log context.
LOG_CONTEXT_PUSH() will push the provided field onto the log context
and pop the last field from the log context when the current block
ends. LOG_CONTEXT_PUSH_STRV() will do the same but for all fields in
the given strv.
Using the macros is as simple as putting them anywhere inside a block
to add a field to all following log messages logged from inside that
block.
void myfunction(...) {
...
LOG_CONTEXT_PUSH("MYMETADATA=abc");
// Every journal message logged will now have the MYMETADATA=abc
// field included.
}
For convenience, there's also LOG_CONTEXT_PUSHF() to allow constructing
the field to be logged using printf() syntax.
log_context_new()/log_context_free() can be used to attach a log context
to an async operation by storing it in the associated userdata struct.
This is intended to be used with VSOCK, to notify the hypervisor/VMM, eg on the host:
qemu <...> -smbios type=11,value=io.systemd.credential:vmm.notify_socket=vsock:2:1234 -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=42
(vsock:2:1234 -> send to host on vsock port 1234, default is to send to 0 which is
the hypervisor itself)
Also on the host:
$ socat - VSOCK-LISTEN:1234,socktype=5
READY=1
STATUS=Ready.
The text said /dev/tty* as a whole was the VT subsystem and that VT is
not supported in containers.
But that's not accurate as /dev/tty* will match /dev/tty too and that
one device node is special and is not related to VT: it always points to
the current process own controlling tty, regardless what that is.
hence, rewrite /dev/tty* as /dev/tty[0-9]*.
When we dissect images automatically, let's be a bit more conservative
with the file system types we are willing to mount: only mount common
file systems automatically.
Explicit mounts requested by admins should always be OK, but when we do
automatic mounts, let's not permit barely maintained, possibly legacy
file systems.
The list for now covers the four common writable and two common
read-only file systems. Sooner or later we might want to add more to the
list.
Also, it might make sense to eventually make this configurable via the
image dissection policy logic.
-1 was used everywhere, but -EBADF or -EBADFD started being used in various
places. Let's make things consistent in the new style.
Note that there are two candidates:
EBADF 9 Bad file descriptor
EBADFD 77 File descriptor in bad state
Since we're initializating the fd, we're just assigning a value that means
"no fd yet", so it's just a bad file descriptor, and the first errno fits
better. If instead we had a valid file descriptor that became invalid because
of some operation or state change, the other errno would fit better.
In some places, initialization is dropped if unnecessary.
Define new unit parameter (LogFilterPatterns) to filter logs processed by
journald.
This option is used to store a regular expression which is carried from
PID1 to systemd-journald through a cgroup xattrs:
`user.journald_log_filter_patterns`.
This is an octal number. We used the 0 prefix in some places inconsistently.
The kernel always interprets in base-8, so this has no effect, but I think
it's nicer to use the 0 to remind the reader that this is not a decimal number.
So, i think "erofs" is probably the better, more modern alternative to
"squashfs". Many of the benefits don't matter too much to us I guess,
but there's one thing that stands out: erofs has a UUID in the
superblock, squashfs has not. Having an UUID in the superblock matters
if the file systems are used in an overlayfs stack, as overlayfs uses
the UUIDs to robustly and persistently reference inodes on layers in
case of metadata copy-up.
Since we probably want to allow such uses in overlayfs as emplyoed by
sysext (and the future syscfg) we probably should ramp up our erofs game
early on. Hence let's natively support erofs, test it, and in fact
mention it in the docs before squashfs even.
- Mention "/please-review" in the contributing guide
- Remove "needs-rebase" on push
- Don't add "please-review" if a green label is set
- Don't add please-review label to draft PRs
- Add please-review when a PR moves out of draft
Now that the random seed is used on virtualized systems, there's no
point in having a random-seed-mode toggle switch. Let's just always
require it now, with the existing logic already being there to allow not
having it if EFI itself has an RNG. In other words, the logic for this
can now be automatic.
Removing the virtualization check might not be the worst thing in the
world, and would potentially get many, many more systems properly seeded
rather than not seeded. There are a few reasons to consider this:
- In most QEMU setups and most guides on how to setup QEMU, a separate
pflash file is used for nvram variables, and this generally isn't
copied around.
- We're now hashing in a timestamp, which should provide some level of
differentiation, given that EFI_TIME has a nanoseconds field.
- The kernel itself will additionally hash in: a high resolution time
stamp, a cycle counter, RDRAND output, the VMGENID uniquely
identifying the virtual machine, any other seeds from the hypervisor
(like from FDT or setup_data).
- During early boot, the RNG is reseeded quite frequently to account for
the importance of early differentiation.
So maybe the mitigating factors make the actual feared problem
significantly less likely and therefore the pros of having file-based
seeding might outweigh the cons of weird misconfigured setups having a
hypothetical problem on first boot.
Rather than passing seeds up to userspace via EFI variables, pass seeds
directly to the kernel's EFI stub loader, via LINUX_EFI_RANDOM_SEED_TABLE_GUID.
EFI variables can potentially leak and suffer from forward secrecy
issues, and processing these with userspace means that they are
initialized much too late in boot to be useful. In contrast,
LINUX_EFI_RANDOM_SEED_TABLE_GUID uses EFI configuration tables, and so
is hidden from userspace entirely, and is parsed extremely early on by
the kernel, so that every single call to get_random_bytes() by the
kernel is seeded.
In order to do this properly, we use a bit more robust hashing scheme,
and make sure that each input is properly memzeroed out after use. The
scheme is:
key = HASH(LABEL || sizeof(input1) || input1 || ... || sizeof(inputN) || inputN)
new_disk_seed = HASH(key || 0)
seed_for_linux = HASH(key || 1)
The various inputs are:
- LINUX_EFI_RANDOM_SEED_TABLE_GUID from prior bootloaders
- 256 bits of seed from EFI's RNG
- The (immutable) system token, from its EFI variable
- The prior on-disk seed
- The UEFI monotonic counter
- A timestamp
This also adjusts the secure boot semantics, so that the operation is
only aborted if it's not possible to get random bytes from EFI's RNG or
a prior boot stage. With the proper hashing scheme, this should make
boot seeds safe even on secure boot.
There is currently a bug in Linux's EFI stub in which if the EFI stub
manages to generate random bytes on its own using EFI's RNG, it will
ignore what the bootloader passes. That's annoying, but it means that
either way, via systemd-boot or via EFI stub's mechanism, the RNG *does*
get initialized in a good safe way. And this bug is now fixed in the
efi.git tree, and will hopefully be backported to older kernels.
As the kernel recommends, the resultant seeds are 256 bits and are
allocated using pool memory of type EfiACPIReclaimMemory, so that it
gets freed at the right moment in boot.
This is useful to force off fancy unicode glyph use (i.e. use "->"
instead of "→"), which is useful in tests where locales might be
missing, and thus control via $LC_CTYPE is not reliable.
Use this in TEST-58, to ensure the output checks we do aren't confused
by missing these glyphs being unicode or not.
This reverts commit 1f22621ba3.
As described in the reverted commit, we don't want to get rid of the check
completely. But the check requires opting-in by setting SYSTEMD_IN_INITRD=lenient,
which is cumbersome and doesn't seem to actually happen.
https://bugzilla.redhat.com/show_bug.cgi?id=2137631 is caused by systemd refusing
to treat the system as an initrd because overlayfs is used. Let's revert this
approach and do something that doesn't require opt-in instead.
I don't think it makes sense to keep support for "SYSTEMD_IN_INITRD=lenient" or
"SYSTEMD_IN_INITRD=auto". To get "auto" behaviour, just unset the option. And
"lenient" will be reimplemented as a better check. Thus the changes to the
option interface are completely reverted.
(s) is just ugly with a vibe of DOS. In most cases just using the normal plural
form is more natural and gramatically correct.
There are some log_debug() statements left, and texts in foreign licenses or
headers. Those are not touched on purpose.
Previously, we'd iterate an entry array from start to end every time
we added an entry offset to it. To speed up this operation, we cache
the last entry array in the chain and how many items it contains.
This allows the addition of an entry to the chain to be done in
constant time instead of linear time as we don't have to iterate
the entire chain anymore every time we add an entry.
To do this, we move EntryItem out of journal-def.h and turn it into
a host only struct in native endian mode so we can still use it to
ship the necessary info around.
Aside from that, the changes are pretty simple, we introduce some
extra functions to access the right field depending on the mode and
convert all the other code to use those functions instead of
accessing the raw fields.
We also drop the unused entry item hash field in compact mode. We
already stopped doing anything with this field a while ago, now we
actually drop it from the format in compact mode.
We also add an environment variable $SYSTEMD_JOURNAL_COMPACT that
can be used to disable compact mode if needed (similar to
$SYSTEMD_JOURNAL_KEYED_HASH).
This adds a new flag in preparation for incompatible journal changes
which will be gated behind this flag. The max file size of journal
files in compact mode is limited to 4 GiB.
The text made it sound like breaking ABI in libsystemd is allowed with good reasons.
In fact, we plan never to do this, so make the language stronger.
Also remind people about distro forums for reporting bugs. Those are probably a
better place than systemd-devel for new users.
Also, add some missing articles and apostrophes, fix URLs, remove repeated phrases,
etc.
If mkosi.kernel/ exists, the mkosi script will try to build a kernel
image from it. We use the architecture defconfig as a base and add
our own extra configuration on top.
We also add some extra tooling to the build image required to build
the kernel and include some documentation in HACKING.md on how to
use this new feature.
To avoid the kernel sources from being copied into the build or
final image (which we don't want because it takes a while), we put
the mkosi.kernel/ directory in .gitignore and use
"SourceFileTransfer=mount" so that the sources are still accessible
in the build image.
The linked filter gives an up-to-date list of pull requests that need review.
(Yes, there's too many.) We used to set 'needs-review' label, but that is
not available to non-members, and also every pull requests which is not labeled
'reviewed/needs-rework'/'ci-fails/needs-rework'/'needs-rebase' can and should
be reviewed.
If this is merged, I'll drop the 'needs-review' label.
In 53c26db4da the meaning of $BOOT was
redefined. I think that's quite problematic, since the concept is
implemented in code and interface of bootctl. Thus, I think we should
stick to the original definition, which is: "where to *place* boot menu
entries" (as opposed to "where to *read* boot menu entries from").
The aforementioned change was done to address two things afaiu:
1. it focussed on a $BOOT as the single place to put boot entries in,
instead of mentioning that both ESP and $BOOT are expected to be
the source
2. it mentioned the /loader/ dir (as location for boot loader resources)
itself as part of the spec, which however only really makes sense in
the ESP. /loader/entries/ otoh makes sense in either the ESP or
$BOOT.
With this rework I try to address these two issues differently:
1. I intend to make clear the $BOOT is the "primary" place to put stuff
in, and is what should be mounted to /boot/.
2. The ESP (if different from $BOOT) is listed as "secondary" source to
read from, and is what should be mounted to /efi/. NB we now make the
distinction between "where to put" (which is single partition) and
"where to read from".
3. This drops any reference of the /loader/ dir witout the /entries/
suffix. Only the full /loader/entries/ dir (and its companion file
/loader/entries.srel) are now mentioned. Thus isolated /loader/
directory hence becomes irrelevant in the spec, and the fact that
sd-boot maintains some files there (and only in the ESP) is kept out
of the spec, because it is irrelevant to other boot loaders.
4. It puts back the suggestion to mount $BOOT to /boot/ and the ESP to
/efi/ (and suggests adding a symlink or bind mount if both are the
same partition). Why? Because the dirs are semantically unrelated:
it's OK and common to have and ESP but no $BOOT, hence putting ESP
inside of a useless, non-existing "ghost" dir /boot/ makes little
sense. More importantly though, because these partitions are
typically backed by VFAT we want to maintain them as an autofs, with
a short idle delay, so that the file systems are unmounted (and thus
fully clean) at almost all times. This doesn't work if they are
nested within each other, as the establishment of the inner autofs
would pin the outer one, making the excercise useless. Now I don't
think the spec should mention autofs (since that is an implementation
detail), but it should arrange things so that this specific, very
efficient, safe and robust implementation can be implemented.
The net result should be easy from an OS perspective:
1. *Put* boot loader entries in /boot/, always.
2. *Read* boot loader entries from both /boot/ and /efi/ -- if these are distinct.
3. The only things we define in the spec are /loader/entries/*.conf and
/EFI/Linux/*.efi in these two partitions (well, and the companion
file /loader/entries.srel
4. /efi/ and /boot/ because not nested can be autofs.
5. bootctl code and interface (in particular --esp-path= and
--boot-path=) match the spec again. `bootctl -x` and `bootctl -p`
will now print the path to $BOOT and ESP again, matching the concepts
in the spec again.
From the sd-boot perspective things are equally easy:
1. Read boot enrties from ESP and XBOOTLDR.
2. Maintain boot loader config/other resources in ESP only.
And that's it.
Fixes: #24247
In all other cases we have the older variant before the newer. And since we
generate some documentation tables from the header, this order is also visible
for users. Let's restore the order. This commit does
4565246911 in a slightly different fashion.
This documents that explicit `Before=`/`After=` dependencies can be
used to selectively override implicit ordering coming from default
dependencies. That allows for more granular control compared to the
already documented `DefaultDependencies=no` option.
The alternative approach came up in a discussion around the ordering
of `boot-complete.target`, so this also adds an explicit suggestion
in that direction to the "Automatic Boot Assessment" documentation.
Ref: https://lists.freedesktop.org/archives/systemd-devel/2022-September/048330.html