For run0 (as opposed to systemd-run in general), connecting to
the system bus (of localhost or container) as a different user
than root and then trying to elevate privilege from that
makes little sense:
https://github.com/systemd/systemd/issues/32997#issuecomment-2127992973
The @ syntax is mostly useful when connecting to the user bus,
which is not a use case for run0. Hence, let's remove the example.
The syntax will be properly refused in #32999.
- mention that /run/machine-id is used if exist.
- mention system.machine_id credential,
- credential, VM uuid, and container uuid are not read when --root=
is specified or running in a chroot environment.
Fixes https://github.com/systemd/systemd/issues/28514.
Quoting https://github.com/systemd/systemd/issues/28514#issuecomment-1831781486:
> Whenever PAM is enabled for a service, we set up the PAM session and then
> fork off a process whose only job is to eventually close the PAM session when
> the service dies. That services we run with service privileges, both to
> minimize attack surface and because we want to use PR_SET_DEATHSIG to be get
> a notification via signal whenever the main process dies. But that only works
> if we have the same credentials as that main process.
>
> Now, if pam_systemd runs inside the PAM stack (which it normally does) it's
> session close hook will ask logind to synchronously end the session via a bus
> call. Currently that call is not accessible to unprivileged clients. And
> that's the part we need to relax: allow users to end their own sessions.
The check is implemented in a way that allows the kill if the sender is in
the target session.
I found 'sudo systemctl --user -M "zbyszek@" is-system-running' to
be a convenient reproducer.
Before:
May 16 16:25:26 x1c systemd[1]: run-u24754.service: Deactivated successfully.
May 16 16:25:26 x1c dbus-broker[1489]: A security policy denied :1.24757 to send method call /org/freedesktop/login1:org.freedesktop.login1.Manager.ReleaseSession to org.freedesktop.login1.
May 16 16:25:26 x1c (sd-pam)[3036470]: pam_systemd(login:session): Failed to release session: Access denied
May 16 16:25:26 x1c systemd[1]: Stopping session-114.scope...
May 16 16:25:26 x1c systemd[1]: session-114.scope: Deactivated successfully.
May 16 16:25:26 x1c systemd[1]: Stopped session-114.scope.
May 16 16:25:26 x1c systemd[1]: session-c151.scope: Deactivated successfully.
May 16 16:25:26 x1c systemd-logind[1513]: Session c151 logged out. Waiting for processes to exit.
May 16 16:25:26 x1c systemd-logind[1513]: Removed session c151.
After:
May 16 17:02:15 x1c systemd[1]: run-u24770.service: Deactivated successfully.
May 16 17:02:15 x1c systemd[1]: Stopping session-115.scope...
May 16 17:02:15 x1c systemd[1]: session-c153.scope: Deactivated successfully.
May 16 17:02:15 x1c systemd[1]: session-115.scope: Deactivated successfully.
May 16 17:02:15 x1c systemd[1]: Stopped session-115.scope.
May 16 17:02:15 x1c systemd-logind[1513]: Session c153 logged out. Waiting for processes to exit.
May 16 17:02:15 x1c systemd-logind[1513]: Removed session c153.
Edit: this seems to also fix https://github.com/systemd/systemd/issues/8598.
It seems that with the call to ReleaseSession, we wait for the pam session
close hooks to finish. I inserted a 'sleep(10)' after the call to ReleaseSession
in pam_systemd, and things block on that, nothing is killed prematurely.
Prompted by #32895
Rather than ordering with each power operation targets,
ordering against shutdown.target which is a valid
synchronization point. This has no effect if soft-reboot
is being performed.
A fixed name is too rigid, let's give users the ability to define
custom drop-in names which at the same time also allows defining
multiple dropins per unit.
We use ~ as the separator because:
- ':' is not allowed in credential names
- '=' is used to separate credential from value in mkosi's --credential
argument.
- '-' is commonly used in filenames
- '@' already has meaning as the unit template specifier which might be
confusing when adding dropins for template units
Like much English text, the systemd documentation uses "may not" in the
sense of both "will possibly not" and "is forbidden to". In many cases
this is OK because the context makes it clear, but in others I felt it
was possible to read the "is forbidden to" sense by mistake: in
particular, I tripped over "the target file may not exist" in
systemd.unit(5) before realizing the correct interpretation.
Use "might not" or "may choose not to" in these cases to make it clear
which sense we mean.
This commit adds the new varlink interface io.systemd.Machine at
/run/systemd/machine/io.systemd.Machine with a single method Register
It supports all combinations of RegisterMachine[WithSSH,WithNetwork] all
under the same method.
Also adds three properties:
- VsockCid: the VSOCK CID of the VM
- SshAddress: the address of the VM in a format SSH can connect to
- SshPrivateKeyPath: the path to the SSH private key to use to connect
to the VM.
GetMachineSSHInfo is essentially a convenience method to query both the
SshAddress and SshPrivateKeyPath properties at once.
In the majority of cases, this is caused by
sleep_supported() returning error. Hence it's
very likely that it would fail again, so
the fallback is not really useful. Instead,
honor the --force option for these verbs.
The systemd-confext use case description was mentioning an OSConfig
project which won't say much to users. Also, it's good to call out that
systemd-confext provides a reliable way to manage configuration because
in contrast to other tools it will remove all old configuration files.
According to the documentation in systemd.resource-control(5),
resource-control options may be used in mount, scope, service,
slice, socket and swap units.
While e.g. systemd.service(5) includes that information,
documentation for some other units does not.
The most problematic example is systemd.slice(5).
Its documentation states a slice unit may only contain [Install]
and [Unit] sections, while actually it may contain also a [Slice]
section with options from systemd.resource-control(5).
units/user/app.slice is an example of a slice unit having a [Slice]
section.
TEST-26-SYSTEMCTL is racy as we call systemctl is-active immediately
after systemctl kill. Let's implement --wait for systemctl kill and
use it in TEST-26-SYSTEMCTL to avoid the race.
In mkosi CI, we want persistent journals when running interactively
and runtime journals when running in CI, so let's add a credential
that allows us to configure which one to use.
Required for integration tests to power off on PID 1 crashes. We
deprecate systemd.crash_reboot and related options by removing them
from the documentation but still parsing them.
LinkLocalAddressing accepts a boolean. This can be seen by looking at
`link_local_address_family_from_strong(cont char *s)` in
`src/network/netword-util.c#L102-108` which falls back to
`address_family_from_string`, defined two lines above (L100)
using `DEFINE_STRING_TABLE_LOOKUP_WITH_BOOLEAN`.
Also: rename Handover → Handoff. I think it makes it clearer that this
is not really about handing over any resources, but that the executor is
out off the game from that point on.
With 1df4b21abd we started to default to
enrolling into the LUKS device backing the root fs if none was specified
(and no wipe operation is used). This changes to look for /var/ instead.
On most systems /var/ is going to be on the root fs, hence this change
is with little effect.
However, on systems where / and /var/ is separate it makes more sense to
default to /var/ because that's where the persistent and variable data
is placed (i.e. where LUKS should be used) while / doesn't really have
to be variable, could as well be immutable, or ephemeral. Hence /var/
should be a safer default.
Or to say this differently: I think it makes sense to support systems
with /var/ being on / well. I also think it makes sense to support
systems with them being separate, and /var/ being variable and
persistent. But any other kind of system I find much less interesting to
support, and in that case people should just specify the device name.
Also, while we are at it, tighten the checks a bit, insist on a dm-crypt
+ LUKS superblock before continuing.
And finally, let's print a short message indicating the device we
operate on.
The log files defined using file:, append: or truncate: inherit the owner and other privileges from the effective user running systemd.
The log files are NOT created using the "User", "Group" or "UMask" defined in the service.
When starting a container with --user, the new uid will be resolved and switched to
only in the inner child, at the end of the setup, by spawning getent. But the
credentials are set up in the outer child, long before the user is resolvable,
and the directories/files are made only readable by root and read-only, which
means they cannot be changed later and made visible to the user.
When this particular combination is specified, it is obvious the caller wants
the single-process container to be able to use credentials, so make them world
readable only in that specific case.
Fixes https://github.com/systemd/systemd/issues/31794
Enable the exec_fd logic for Type=notify* services too, and change it
to send a timestamp instead of a '1' byte. Record the timestamp in a
new ExecMainHandoverTimestamp property so that users can track accurately
when control is handed over from systemd to the service payload, so
that latency and startup performance can be trivially and accurately
tracked and attributed.
When an IO event source owns relevant fd, replacing with a new fd leaks
the previously assigned fd.
===
sd_event_add_io(event, &s, fd, ...);
sd_event_source_set_io_fd_own(s, true);
sd_event_source_set_io_fd(s, new_fd); <-- The previous fd is not closed.
sd_event_source_unref(s); <-- new_fd is closed as expected.
===
Without the change, valgrind reports the leak:
==998589==
==998589== FILE DESCRIPTORS: 4 open (3 std) at exit.
==998589== Open file descriptor 4:
==998589== at 0x4F119AB: pipe2 (in /usr/lib64/libc.so.6)
==998589== by 0x408830: test_sd_event_source_set_io_fd (test-event.c:862)
==998589== by 0x403302: run_test_table (tests.h:171)
==998589== by 0x408E31: main (test-event.c:935)
==998589==
==998589==
==998589== HEAP SUMMARY:
==998589== in use at exit: 0 bytes in 0 blocks
==998589== total heap usage: 33,305 allocs, 33,305 frees, 1,283,581 bytes allocated
==998589==
==998589== All heap blocks were freed -- no leaks are possible
==998589==
==998589== For lists of detected and suppressed errors, rerun with: -s
==998589== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Follow-ups for 74c4231ce5.
Previously, the path is obtained from the fd, but it is closed in
sd_event_loop() to unpin the filesystem.
So, let's save the path when the event source is created, and make
sd_event_source_get_inotify_path() simply read it.
We look for the root fs on the device of the booted ESP, and for the
other partitions on the device of the root fs. On EFI systems this
generally boils down to the same, but there are cases where this doesn't
hold, hence document this properly.
Fixes: #31199
This commit adds support for loading, measuring and handling a ".ucode"
UKI section. This section is functionally an initrd, intended for
microcode updates. As such it will always be passed to the kernel first.
Resolve at attach/detach/inspect time, so that the image is pinned and requires
re-attaching on update, given files are extracted from it so just passing
img.v/ to RootImage= is not enough to get a portable image updated
This reworkds --recovery-pin= from a parameter that takes a boolean to
an enum supporting one of "hide", "show", "query".
If "hide" (default behaviour) we'll generate a recovery pin
automatically, but never show it, and thus just seal it and good.
If "show" we'll generate a recovery pin automatically, but display it in
the output, so the user can write it down.
If "query" we'll ask the user for a recovery pin, and not automatically
generate any.
For compatibility the old boolean behaviour is kept.
With this you can now do "systemd-pcrlock make-policy
--recovery-pin=show" to set up the first policy, write down the recovery
PIN. Later, if the PCR prediction didn't work out one day you can then
do "systemd-pcrlock make-policy --recovery-pin=query" and enter the
recovery key and write a new policy.
Running both sd-ndisc and sd-radv should be mostly a misconfiguration,
but may not. So, let's only disable sd-ndisc by default when sd-radv is
enabled, but allow when both are explicitly requested.
This adds a small, socket-activated Varlink daemon that can delegate UID
ranges for user namespaces to clients asking for it.
The primary call is AllocateUserRange() where the user passes in an
uninitialized userns fd, which is then set up.
There are other calls that allow assigning a mount fd to a userns
allocated that way, to set up permissions for a cgroup subtree, and to
allocate a veth for such a user namespace.
Since the UID assignments are supposed to be transitive, i.e. not
permanent, care is taken to ensure that users cannot create inodes owned
by these UIDs, so that persistancy cannot be acquired. This is
implemented via a BPF-LSM module that ensures that any member of a
userns allocated that way cannot create files unless the mount it
operates on is owned by the userns itself, or is explicitly
allowelisted.
BPF LSM program with contributions from Alexei Starovoitov.
This is just good style. In this particular case, if the argument is incorrect and
the function is not tested with $NOTIFY_SOCKET set, the user could not get the
proper error until running for real.
Also, remove mention of systemd. The protocol is fully generic on purpose.
F40 will be out soon, so we can update the man page already. The example should
already work.
The cloud link was dropped in fd571c9df0, so
drop the unused variable too.
Unfortunately, sd-bus-vtable.h, sd-journal.h, and sd-id128.h
have variadic macro and inline initialization of sub-object, these are
not supported in C90. So, we need to silence some errors.
I think the example should reflect the full set of lifecycle messages,
including STOPPING=1, which tells the service manager that the service
is already terminating. This is useful for reporting this information
back to the user and to suppress repeated shutdown requests.
It's not as important as the READY=1 and RELOADING=1 messages, since we
actively wait for those from the service message if the right Type= is
set. But it's still very valuable information, easy to do, and completes
the state engine.
stale HibernateLocation EFI variable
Currently, if the HibernateLocation EFI variable exists,
but we failed to resume from it, the boot carries on
without clearing the stale variable. Therefore, the subsequent
boots would still be waiting for the device timeout,
unless the variable is purged manually.
There's no point to keep trying to resume after a successful
switch-root, because the hibernation image state
would have been invalidated by then. OTOH, we don't
want to clear the variable prematurely either,
i.e. in initrd, since if the resume device is the same
as root one, the boot won't succeed and the user might
be able to try resuming again. So, let's introduce a
unit that only runs after switch-root and clears the var.
Fixes#32021
We are saying in public that the protocl is stable and can be easily
reimplemented, so provide an example doing so in the documentation,
license as MIT-0 so that it can be copied and pasted at will.
Same reason as the reload, reexec is disruptive and it requires the
same privileges, so if somebody wants to limit reloads, they'll also
want to limit reexecs, so use the same setting.
The setting is used when /sys/power/state is set to 'mem'
(common for suspend) or /sys/power/disk is set to 'suspend'
(hybrid-sleep). We default to kernel choice here, i.e.
respect what's set through 'mem_sleep_default=' kernel
cmdline option.
Today listen file descriptors created by socket unit don't get passed to
commands in Exec{Start,Stop}{Pre,Post}= socket options.
This prevents ExecXYZ= commands from accessing the created socket FDs to do
any kind of system setup which involves the socket but is not covered by
existing socket unit options.
One concrete example is to insert a socket FD into a BPF map capable of
holding socket references, such as BPF sockmap/sockhash [1] or
reuseport_sockarray [2]. Or, similarly, send the file descriptor with
SCM_RIGHTS to another process, which has access to a BPF map for storing
sockets.
To unblock this use case, pass ListenXYZ= file descriptors to ExecXYZ=
commands as listen FDs [4]. As an exception, ExecStartPre= command does not
inherit any file descriptors because it gets invoked before the listen FDs
are created.
This new behavior can potentially break existing configurations. Commands
invoked from ExecXYZ= might not expect to inherit file descriptors through
sd_listen_fds protocol.
To prevent breakage, add a new socket unit parameter,
PassFileDescriptorsToExec=, to control whether ExecXYZ= programs inherit
listen FDs.
[1] https://docs.kernel.org/bpf/map_sockmap.html
[2] https://lore.kernel.org/r/20180808075917.3009181-1-kafai@fb.com
[3] https://man.archlinux.org/man/socket.7#SO_INCOMING_CPU
[4] https://www.freedesktop.org/software/systemd/man/latest/sd_listen_fds.html
Drop connections and caches and reload config from files, to allow
for low-interruptions updates, and hook up to the usual SIGHUP and
ExecReload=. Mark servers and services configured directly via D-Bus
so that they can be kept around, and only the configuration file
settings are dropped and reloaded.
Fixes https://github.com/systemd/systemd/issues/17503
Fixes https://github.com/systemd/systemd/issues/20604