Unprivileged overlayfs is supported since Linux 5.11. The only
change needed to get ExtensionDirectories to work is to avoid
hard-coding the staging directory to the system manager runtime
directory, everything else just works (TM).
Remove the list logic, and simply skip passing metadata if more than one
unit triggered an OnFailure/OnSuccess handler.
Instead of a single env var to loop over, provide each separate item
as its own variable.
Fixes https://github.com/systemd/systemd/issues/22370
The only piece missing was to somehow make /proc appear in the
new user+mount namespace. It is not possible to mount a new
/proc instance, not even with hidepid=invisible,subset=pid, in
a user namespace unless a PID namespace is created too (and also
at the same time as the other namespaces, it is not possible to
mount a new /proc in a child process that creates a PID namespace
forked from a parent that created a user+mount namespace, it has
to happen at the same time).
Use the host's /proc with a bind-mount as a fallback for this
case. User session services would already run with it, so
nothing is lost.
Remove incorrect claim that C escapes (such as \t and \n) are recognized and that control characters are disallowed. Specify the allowed characters and escapes with single quotes, with double quotes, and without quotes.
Add a new setting that follows the same principle and implementation
as ExtensionImages, but using directories as sources.
It will be used to implement support for extending portable images
with directories, since portable services can already use a directory
as root.
Let's not mention a redundant setting of "none". Let's instead only
mention "best-effort", which is the same. Also mention the default
settings properly.
(Also, while we are at it, don#t document the numeric alias, that's
totally redundant and harder to use, so no need to push people towards
it.)
When combined with a tmpfs on /run or /var/lib, allows to create
arbitrary and ephemeral symlinks for StateDirectory or RuntimeDirectory.
This is especially useful when sharing these directories between
different services, to make the same state/runtime directory 'backend'
appear as different names to each service, so that they can be added/removed
to a sharing agreement transparently, without code changes.
An example (simplified, but real) use case:
foo.service:
StateDirectory=foo
bar.service:
StateDirectory=bar
foo.service.d/shared.conf:
StateDirectory=
StateDirectory=shared:foo
bar.service.d/shared.conf:
StateDirectory=
StateDirectory=shared:bar
foo and bar use respectively /var/lib/foo and /var/lib/bar. Then
the orchestration layer decides to stop this sharing, the drop-in
can be removed. The services won't need any update and will keep
working and being able to store state, transparently.
To keep backward compatibility, new DBUS messages are added.
Currently there does not exist a way to specify a path relative to which
all binaries executed by Exec should be found. The only way is to
specify the absolute path.
This change implements the functionality to specify a path relative to which
binaries executed by Exec*= can be found.
Closes#6308
They are somewhat similar, but not easy to discover, esp. considering that
they are described in different pages.
For PrivateDevices=, split out the first paragraph that gives the high-level
overview. (The giant second paragraph could also use some heavy editing to break
it up into more digestible chunks, alas.)
There is some inconsistency, partially caused by the awkward naming
of the docs/ pages. But let's be consistent and use the "official" title.
If we ever change plural↔singular, we should use the same form everywhere.
The mount option has special meaning when SELinux is enabled. To make
NoNewPrivileges=yes not break SELinux enabled systems, let's not set the
mount flag on such systems.
This reverts commit 1753d30215.
Let's re-enable that feature now. As reported when the original commit
was merged, this causes some trouble on SELinux enabled systems. So,
in the subsequent commit, the feature will be disabled when SELinux is enabled.
But, anyway, this commit just re-enable that feature unconditionally.
This reverts commit d8e3c31bd8.
A poorly documented fact is that SELinux unfortunately uses nosuid mount flag
to specify that also a fundamental feature of SELinux, domain transitions, must
not be allowed either. While this could be mitigated case by case by changing
the SELinux policy to use `nosuid_transition`, such mitigations would probably
have to be added everywhere if systemd used automatic nosuid mount flags when
`NoNewPrivileges=yes` would be implied. This isn't very desirable from SELinux
policy point of view since also untrusted mounts in service's mount namespaces
could start triggering domain transitions.
Alternatively there could be directives to override this behavior globally or
for each service (for example, new directives `SUIDPaths=`/`NoSUIDPaths=` or
more generic mount flag applicators), but since there's little value of the
commit by itself (setting NNP already disables most setuid functionality), it's
simpler to revert the commit. Such new directives could be used to implement
the original goal.
When `NoNewPrivileges=yes`, the service shouldn't have a need for any
setuid/setgid programs, so in case there will be a new mount namespace anyway,
mount the file systems with MS_NOSUID.
This commit applies the filtering imposed by LogLevelMax on a unit's
processes to messages logged by PID1 about the unit as well.
The target use case for this feature is a service that runs on a timer
many times an hour, where the system administrator decides that writing
a generic success message to the journal every few minutes or seconds
adds no diagnostic value and isn't worth the clutter or disk I/O.
This allows "LoadCredentials=foo" to be used as shortcut for
"LoadCredentials=foo:foo", i.e. it's a very short way to inherit a
credential under its original name from the service manager into a
service.
When using hidepid=invisible on procfs, the kernel will check if the
gid of the process trying to access /proc is the same as the gid of
the process that mounted the /proc instance, or if it has the ptrace
capability:
https://github.com/torvalds/linux/blob/v5.10/fs/proc/base.c#L723https://github.com/torvalds/linux/blob/v5.10/fs/proc/root.c#L155
Given we set up the /proc instance as root for system services,
The same restriction applies to CAP_SYS_PTRACE, if a process runs with
it then hidepid=invisible has no effect.
ProtectProc effectively can only be used with User= or DynamicUser=yes,
without CAP_SYS_PTRACE.
Update the documentation to explicitly state these limitations.
Fixes#18997
Tables with only one column aren't really tables, they are lists. And if
each cell only consists of a single word, they are probably better
written in a single line. Hence, shorten the man page a bit, and list
boot loader spec partition types in a simple sentence.
Also, drop "root-secondary" from the list. When dissecting images we'll
upgrade "root-secondary" to "root" if we mount it, and do so only if
"root" doesn't exist. Hence never mention "root-secondary" as we never
will mount a partition under that id.
A "Credentials" section name in systemd.exec man page was used
both for User/Group and for actual credentials support in systemd.
Rename the first instance to "User/Group Identity"
Implement directives `NoExecPaths=` and `ExecPaths=` to control `MS_NOEXEC`
mount flag for the file system tree. This can be used to implement file system
W^X policies, and for example with allow-listing mode (NoExecPaths=/) a
compromised service would not be able to execute a shell, if that was not
explicitly allowed.
Example:
[Service]
NoExecPaths=/
ExecPaths=/usr/bin/daemon /usr/lib64 /usr/lib
Closes: #17942.