When a service is triggered by a path unit, pass the
path unit name and the path that triggered it via env vars
to the spawned processes.
Note that this is best-effort, as there might be many triggers
at the same time, but we only get woken up by one.
This imports credentials also via SMBIOS' "OEM vendor string" section,
similar to the existing import logic from fw_cfg.
Functionality-wise this is very similar to the existing fw_cfg logic,
both of which are easily settable on the qemu command line.
Pros and cons of each:
SMBIOS OEM vendor strings:
- pro: fast, because memory mapped
- pro: somewhat VMM independent, at least in theory
- pro: qemu upstream sees this as the future
- pro: no additional kernel module needed
- con: strings only, thus binary data is base64 encoded
fw_cfg:
- pro: has been supported for longer in qemu
- pro: supports binary data
- con: slow, because IO port based
- con: only qemu
- con: requires qemu_fw_cfg.ko kernel module
- con: qemu upstream sees this as legacy
This reverts PR #22587 and its follow-up commit. More specifically,
2299b1cae3 (partially),
e176f85527,
ceb46a31a0, and
51bb9076ab.
The PR was merged without final approval, and has several issues:
- OSS fuzz reported issues in the conf parser,
- It calls synchrnous netlink call, it should not be especially in PID1,
- The importance of NFTSet for CGroup and DynamicUser may be
questionable, at least, there was no justification PID1 should support
it.
- For networkd, it should be implemented with Request object,
- There is no test for the feature.
Fixes#23711.
Fixes#23717.
Fixes#23719.
Fixes#23720.
Fixes#23721.
Fixes#23759.
New directive `DynamicUserNFTSet=` provides a method for integrating
configuration of dynamic users into firewall rules with NFT sets.
Example:
```
table inet filter {
set u {
typeof meta skuid
}
chain service_output {
meta skuid != @u drop
accept
}
}
```
```
/etc/systemd/system/dunft.service
[Service]
DynamicUser=yes
DynamicUserNFTSet=inet:filter:u
ExecStart=/bin/sleep 1000
[Install]
WantedBy=multi-user.target
```
```
$ sudo nft list set inet filter u
table inet filter {
set u {
typeof meta skuid
elements = { 64864 }
}
}
$ ps -n --format user,group,pid,command -p `pgrep sleep`
USER GROUP PID COMMAND
64864 64864 55158 /bin/sleep 1000
```
I don't know why this didn't occur to me earlier, but of course, it
*has* to be this data.
(This replaces some German prose about Berlin, that i guess only very
few people will get. With the new blob I think we have a much broader
chance of delivering smiles.)
We use authenticated encryption, and that deserves mention. This in
particular relevant as the fact they are authenticated makes the
credentials useful as initrd parameterization items.
Unprivileged overlayfs is supported since Linux 5.11. The only
change needed to get ExtensionDirectories to work is to avoid
hard-coding the staging directory to the system manager runtime
directory, everything else just works (TM).
Remove the list logic, and simply skip passing metadata if more than one
unit triggered an OnFailure/OnSuccess handler.
Instead of a single env var to loop over, provide each separate item
as its own variable.
Fixes https://github.com/systemd/systemd/issues/22370
The only piece missing was to somehow make /proc appear in the
new user+mount namespace. It is not possible to mount a new
/proc instance, not even with hidepid=invisible,subset=pid, in
a user namespace unless a PID namespace is created too (and also
at the same time as the other namespaces, it is not possible to
mount a new /proc in a child process that creates a PID namespace
forked from a parent that created a user+mount namespace, it has
to happen at the same time).
Use the host's /proc with a bind-mount as a fallback for this
case. User session services would already run with it, so
nothing is lost.
Remove incorrect claim that C escapes (such as \t and \n) are recognized and that control characters are disallowed. Specify the allowed characters and escapes with single quotes, with double quotes, and without quotes.
Add a new setting that follows the same principle and implementation
as ExtensionImages, but using directories as sources.
It will be used to implement support for extending portable images
with directories, since portable services can already use a directory
as root.
Let's not mention a redundant setting of "none". Let's instead only
mention "best-effort", which is the same. Also mention the default
settings properly.
(Also, while we are at it, don#t document the numeric alias, that's
totally redundant and harder to use, so no need to push people towards
it.)
When combined with a tmpfs on /run or /var/lib, allows to create
arbitrary and ephemeral symlinks for StateDirectory or RuntimeDirectory.
This is especially useful when sharing these directories between
different services, to make the same state/runtime directory 'backend'
appear as different names to each service, so that they can be added/removed
to a sharing agreement transparently, without code changes.
An example (simplified, but real) use case:
foo.service:
StateDirectory=foo
bar.service:
StateDirectory=bar
foo.service.d/shared.conf:
StateDirectory=
StateDirectory=shared:foo
bar.service.d/shared.conf:
StateDirectory=
StateDirectory=shared:bar
foo and bar use respectively /var/lib/foo and /var/lib/bar. Then
the orchestration layer decides to stop this sharing, the drop-in
can be removed. The services won't need any update and will keep
working and being able to store state, transparently.
To keep backward compatibility, new DBUS messages are added.
Currently there does not exist a way to specify a path relative to which
all binaries executed by Exec should be found. The only way is to
specify the absolute path.
This change implements the functionality to specify a path relative to which
binaries executed by Exec*= can be found.
Closes#6308
They are somewhat similar, but not easy to discover, esp. considering that
they are described in different pages.
For PrivateDevices=, split out the first paragraph that gives the high-level
overview. (The giant second paragraph could also use some heavy editing to break
it up into more digestible chunks, alas.)
There is some inconsistency, partially caused by the awkward naming
of the docs/ pages. But let's be consistent and use the "official" title.
If we ever change plural↔singular, we should use the same form everywhere.
The mount option has special meaning when SELinux is enabled. To make
NoNewPrivileges=yes not break SELinux enabled systems, let's not set the
mount flag on such systems.
This reverts commit 1753d30215.
Let's re-enable that feature now. As reported when the original commit
was merged, this causes some trouble on SELinux enabled systems. So,
in the subsequent commit, the feature will be disabled when SELinux is enabled.
But, anyway, this commit just re-enable that feature unconditionally.
This reverts commit d8e3c31bd8.
A poorly documented fact is that SELinux unfortunately uses nosuid mount flag
to specify that also a fundamental feature of SELinux, domain transitions, must
not be allowed either. While this could be mitigated case by case by changing
the SELinux policy to use `nosuid_transition`, such mitigations would probably
have to be added everywhere if systemd used automatic nosuid mount flags when
`NoNewPrivileges=yes` would be implied. This isn't very desirable from SELinux
policy point of view since also untrusted mounts in service's mount namespaces
could start triggering domain transitions.
Alternatively there could be directives to override this behavior globally or
for each service (for example, new directives `SUIDPaths=`/`NoSUIDPaths=` or
more generic mount flag applicators), but since there's little value of the
commit by itself (setting NNP already disables most setuid functionality), it's
simpler to revert the commit. Such new directives could be used to implement
the original goal.
When `NoNewPrivileges=yes`, the service shouldn't have a need for any
setuid/setgid programs, so in case there will be a new mount namespace anyway,
mount the file systems with MS_NOSUID.
This commit applies the filtering imposed by LogLevelMax on a unit's
processes to messages logged by PID1 about the unit as well.
The target use case for this feature is a service that runs on a timer
many times an hour, where the system administrator decides that writing
a generic success message to the journal every few minutes or seconds
adds no diagnostic value and isn't worth the clutter or disk I/O.
This allows "LoadCredentials=foo" to be used as shortcut for
"LoadCredentials=foo:foo", i.e. it's a very short way to inherit a
credential under its original name from the service manager into a
service.
When using hidepid=invisible on procfs, the kernel will check if the
gid of the process trying to access /proc is the same as the gid of
the process that mounted the /proc instance, or if it has the ptrace
capability:
https://github.com/torvalds/linux/blob/v5.10/fs/proc/base.c#L723https://github.com/torvalds/linux/blob/v5.10/fs/proc/root.c#L155
Given we set up the /proc instance as root for system services,
The same restriction applies to CAP_SYS_PTRACE, if a process runs with
it then hidepid=invisible has no effect.
ProtectProc effectively can only be used with User= or DynamicUser=yes,
without CAP_SYS_PTRACE.
Update the documentation to explicitly state these limitations.
Fixes#18997
Tables with only one column aren't really tables, they are lists. And if
each cell only consists of a single word, they are probably better
written in a single line. Hence, shorten the man page a bit, and list
boot loader spec partition types in a simple sentence.
Also, drop "root-secondary" from the list. When dissecting images we'll
upgrade "root-secondary" to "root" if we mount it, and do so only if
"root" doesn't exist. Hence never mention "root-secondary" as we never
will mount a partition under that id.
A "Credentials" section name in systemd.exec man page was used
both for User/Group and for actual credentials support in systemd.
Rename the first instance to "User/Group Identity"
Implement directives `NoExecPaths=` and `ExecPaths=` to control `MS_NOEXEC`
mount flag for the file system tree. This can be used to implement file system
W^X policies, and for example with allow-listing mode (NoExecPaths=/) a
compromised service would not be able to execute a shell, if that was not
explicitly allowed.
Example:
[Service]
NoExecPaths=/
ExecPaths=/usr/bin/daemon /usr/lib64 /usr/lib
Closes: #17942.
Allow to setup new bind mounts for a service at runtime (via either
DBUS or a new 'systemctl bind' verb) with a new helper that forks into
the unit's mount namespace.
Add a new integration test to cover this.
Useful for zero-downtime addition to services that are running inside
mount namespaces, especially when using RootImage/RootDirectory.
If a service runs with a read-only root, a tmpfs is added on /run
to ensure we can create the airlock directory for incoming mounts
under /run/host/incoming.
We need a writable /run for most operations, but in case a read-only
RootImage (or similar) is used, by default there's no additional
tmpfs mount on /run. Change this behaviour and document it.
This adds the ability to specify truncate:PATH for StandardOutput= and
StandardError=, similar to the existing append:PATH. The code is mostly
copied from the related append: code. Fixes#8983.
Importing the full environment is convenient, but it doesn't work too well in
practice, because we get a metric ton of shell-specific crap that should never
end up in the global environment block:
$ systemctl --user show-environment
...
SHELL=/bin/zsh
AUTOJUMP_ERROR_PATH=/home/zbyszek/.local/share/autojump/errors.log
AUTOJUMP_SOURCED=1
CONDA_SHLVL=0
CVS_RSH=ssh
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus
DESKTOP_SESSION=gnome
DISPLAY=:0
FPATH=/usr/share/Modules/init/zsh-functions:/usr/local/share/zsh/site-functions:/usr/share/zsh/site-functions:/usr/share/zsh/5.8/functions
GDMSESSION=gnome
GDM_LANG=en_US.UTF-8
GNOME_SETUP_DISPLAY=:1
GUESTFISH_INIT=$'\\e[1;34m'
GUESTFISH_OUTPUT=$'\\e[0m'
GUESTFISH_PS1=$'\\[\\e[1;32m\\]><fs>\\[\\e[0;31m\\] '
GUESTFISH_RESTORE=$'\\e[0m'
HISTCONTROL=ignoredups
HISTSIZE=1000
LOADEDMODULES=
OLDPWD=/home/zbyszek
PWD=/home/zbyszek
QTDIR=/usr/lib64/qt-3.3
QTINC=/usr/lib64/qt-3.3/include
QTLIB=/usr/lib64/qt-3.3/lib
QT_IM_MODULE=ibus
SDL_VIDEO_MINIMIZE_ON_FOCUS_LOSS=0
SESSION_MANAGER=local/unix:@/tmp/.ICE-unix/2612,unix/unix:/tmp/.ICE-unix/2612
SHLVL=0
STEAM_FRAME_FORCE_CLOSE=1
TERM=xterm-256color
USERNAME=zbyszek
WISECONFIGDIR=/usr/share/wise2/
...
Plenty of shell-specific and terminal-specific stuff that have no global
significance.
Let's start warning when this is used to push people towards importing only
specific variables.
Putative NEWS entry:
* systemctl import-environment will now emit a warning when called without
any arguments (i.e. to import the full environment block of the called
program). This command will usually be invoked from a shell, which means
that it'll inherit a bunch of variables which are specific to that shell,
and usually to the tty the shell is connected to, and don't have any
meaning in the global context of the system or user service manager.
Instead, only specific variables should be imported into the manager
environment block.
Similarly, programs which update the manager environment block by directly
calling the D-Bus API of the manager, should also push specific variables,
and not the full inherited environment.
This adds a general description of "philosphy" of keeping the environemnt
block small and hints about systemd-run -P env.
The list of generated variables is split out to a subsection. Viewing
the patch with ignoring whitespace changes is recommended.
We don't ignore invalid assignments (except in import-environment to some
extent), previous description was wrong.
For https://bugzilla.redhat.com/show_bug.cgi?id=1912046#c17.
This beefs up the READ_FULL_FILE_CONNECT_SOCKET logic of
read_full_file_full() a bit: when used a sender socket name may be
specified. If specified as NULL behaviour is as before: the client
socket name is picked by the kernel. But if specified as non-NULL the
client can pick a socket name to use when connecting. This is useful to
communicate a minimal amount of metainformation from client to server,
outside of the transport payload.
Specifically, these beefs up the service credential logic to pass an
abstract AF_UNIX socket name as client socket name when connecting via
READ_FULL_FILE_CONNECT_SOCKET, that includes the requesting unit name
and the eventual credential name. This allows servers implementing the
trivial credential socket logic to distinguish clients: via a simple
getpeername() it can be determined which unit is requesting a
credential, and which credential specifically.
Example: with this patch in place, in a unit file "waldo.service" a
configuration line like the following:
LoadCredential=foo:/run/quux/creds.sock
will result in a connection to the AF_UNIX socket /run/quux/creds.sock,
originating from an abstract namespace AF_UNIX socket:
@$RANDOM/unit/waldo.service/foo
(The $RANDOM is replaced by some randomized string. This is included in
the socket name order to avoid namespace squatting issues: the abstract
socket namespace is open to unprivileged users after all, and care needs
to be taken not to use guessable names)
The services listening on the /run/quux/creds.sock socket may thus
easily retrieve the name of the unit the credential is requested for
plus the credential name, via a simpler getpeername(), discarding the
random preifx and the /unit/ string.
This logic uses "/" as separator between the fields, since both unit
names and credential names appear in the file system, and thus are
designed to use "/" as outer separators. Given that it's a good safe
choice to use as separators here, too avoid any conflicts.
This is a minimal patch only: the new logic is used only for the unit
file credential logic. For other places where we use
READ_FULL_FILE_CONNECT_SOCKET it is probably a good idea to use this
scheme too, but this should be done carefully in later patches, since
the socket names become API that way, and we should determine the right
amount of info to pass over.
With new directive SystemCallLog= it's possible to list system calls to be
logged. This can be used for auditing or temporarily when constructing system
call filters.
---
v5: drop intermediary, update HASHMAP_FOREACH_KEY() use
v4: skip useless debug messages, actually parse directive
v3: don't declare unused variables with old libseccomp
v2: fix build without seccomp or old libseccomp
Define explicit action "kill" for SystemCallErrorNumber=.
In addition to errno code, allow specifying "kill" as action for
SystemCallFilter=.
---
v7: seccomp_parse_errno_or_action() returns -EINVAL if !HAVE_SECCOMP
v6: use streq_ptr(), let errno_to_name() handle bad values, kill processes,
init syscall_errno
v5: actually use seccomp_errno_or_action_to_string(), don't fail bus unit
parsing without seccomp
v4: fix build without seccomp
v3: drop log action
v2: action -> number
Mention the JSON user record stuff. Mention pam_umask explicitly.
Mention that UMask= of the per-user user@.service instance can be used
too.
Fixes: #16963