mirror of
https://github.com/systemd/systemd.git
synced 2024-11-24 10:43:35 +08:00
Merge pull request #4185 from endocode/djalal-sandbox-first-protection-v1
core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes
This commit is contained in:
commit
cc238590e4
@ -1639,8 +1639,14 @@ EXTRA_DIST += \
|
||||
test/test-execute/exec-personality-aarch64.service \
|
||||
test/test-execute/exec-privatedevices-no.service \
|
||||
test/test-execute/exec-privatedevices-yes.service \
|
||||
test/test-execute/exec-privatedevices-no-capability-mknod.service \
|
||||
test/test-execute/exec-privatedevices-yes-capability-mknod.service \
|
||||
test/test-execute/exec-privatetmp-no.service \
|
||||
test/test-execute/exec-privatetmp-yes.service \
|
||||
test/test-execute/exec-readonlypaths.service \
|
||||
test/test-execute/exec-readonlypaths-mount-propagation.service \
|
||||
test/test-execute/exec-readwritepaths-mount-propagation.service \
|
||||
test/test-execute/exec-inaccessiblepaths-mount-propagation.service \
|
||||
test/test-execute/exec-spec-interpolation.service \
|
||||
test/test-execute/exec-systemcallerrornumber.service \
|
||||
test/test-execute/exec-systemcallfilter-failing2.service \
|
||||
|
14
NEWS
14
NEWS
@ -137,6 +137,20 @@ CHANGES WITH 232 in spe
|
||||
$SYSTEMD_NSPAWN_SHARE_NS_UTS may be used to control the unsharing of
|
||||
individual namespaces.
|
||||
|
||||
* systemd-udevd.service is now run in a Seccomp-based sandbox that
|
||||
prohibits access to AF_INET and AF_INET6 sockets and thus access to
|
||||
the network. This might break code that runs from udev rules that
|
||||
tries to talk to the network. Doing that is generally a bad idea and
|
||||
unsafe due to a variety of reasons. It's also racy as device
|
||||
management would race against network configuration. It is
|
||||
recommended to rework such rules to use the SYSTEMD_WANTS property on
|
||||
the relevant devices to pull in a proper systemd service (which can
|
||||
be sandboxed differently and ordered correctly after the network
|
||||
having come up). If that's not possible consider reverting this
|
||||
sandboxing feature locally by removing the RestrictAddressFamilies=
|
||||
setting from the systemd-udevd.service unit file, or adding AF_INET
|
||||
and AF_INET6 to it.
|
||||
|
||||
CHANGES WITH 231:
|
||||
|
||||
* In service units the various ExecXYZ= settings have been extended
|
||||
|
38
TODO
38
TODO
@ -32,6 +32,8 @@ Janitorial Clean-ups:
|
||||
|
||||
Features:
|
||||
|
||||
* switch to ProtectSystem=strict for all our long-running services where that's possible
|
||||
|
||||
* introduce an "invocation ID" for units, that is randomly generated, and
|
||||
identifies each runtime-cycle of a unit. It should be set freshly each time
|
||||
we traverse inactive → activating/active, and should be the primary key to
|
||||
@ -40,8 +42,9 @@ Features:
|
||||
the cgroup of a services. The former is accessible without privileges, the
|
||||
latter ensures the ID cannot be faked.
|
||||
|
||||
* Introduce ProtectSystem=strict for making the entire OS hierarchy read-only
|
||||
except for a select few
|
||||
* If RootDirectory= is used, mount /proc, /sys, /dev into it, if not mounted yet
|
||||
|
||||
* Permit masking specific netlink APIs with RestrictAddressFamily=
|
||||
|
||||
* nspawn: start UID allocation loop from hash of container name
|
||||
|
||||
@ -55,16 +58,14 @@ Features:
|
||||
|
||||
* ProtectClock= (drops CAP_SYS_TIMES, adds seecomp filters for settimeofday, adjtimex), sets DeviceAllow o /dev/rtc
|
||||
|
||||
* ProtectKernelModules= (drops CAP_SYS_MODULE and filters the kmod syscalls)
|
||||
|
||||
* ProtectTracing= (drops CAP_SYS_PTRACE, blocks ptrace syscall, makes /sys/kernel/tracing go away)
|
||||
|
||||
* ProtectMount= (drop mount/umount/pivot_root from seccomp, disallow fuse via DeviceAllow, imply Mountflags=slave)
|
||||
|
||||
* ProtectDevices= should also take iopl/ioperm/pciaccess away
|
||||
|
||||
* ProtectKeyRing= to take keyring calls away
|
||||
|
||||
* ProtectControlGroups= which mounts all of /sys/fs/cgroup read-only
|
||||
|
||||
* ProtectKernelTunables= which mounts /sys and /proc/sys read-only
|
||||
|
||||
* RemoveKeyRing= to remove all keyring entries of the specified user
|
||||
|
||||
* Add DataDirectory=, CacheDirectory= and LogDirectory= to match
|
||||
@ -72,9 +73,6 @@ Features:
|
||||
|
||||
* Add BindDirectory= for allowing arbitrary, private bind mounts for services
|
||||
|
||||
* Beef up RootDirectory= to use namespacing/bind mounts as soon as fs
|
||||
namespaces are enabled by the service
|
||||
|
||||
* Add RootImage= for mounting a disk image or file as root directory
|
||||
|
||||
* RestrictNamespaces= or so in services (taking away the ability to create namespaces, with setns, unshare, clone)
|
||||
@ -180,7 +178,7 @@ Features:
|
||||
* implement a per-service firewall based on net_cls
|
||||
|
||||
* Port various tools to make use of verbs.[ch], where applicable: busctl,
|
||||
bootctl, coredumpctl, hostnamectl, localectl, systemd-analyze, timedatectl
|
||||
coredumpctl, hostnamectl, localectl, systemd-analyze, timedatectl
|
||||
|
||||
* hostnamectl: show root image uuid
|
||||
|
||||
@ -293,9 +291,6 @@ Features:
|
||||
|
||||
* MessageQueueMessageSize= (and suchlike) should use parse_iec_size().
|
||||
|
||||
* "busctl status" works only as root on dbus1, since we cannot read
|
||||
/proc/$PID/exe
|
||||
|
||||
* implement Distribute= in socket units to allow running multiple
|
||||
service instances processing the listening socket, and open this up
|
||||
for ReusePort=
|
||||
@ -306,8 +301,6 @@ Features:
|
||||
and passes this back to PID1 via SCM_RIGHTS. This also could be used
|
||||
to allow Chown/chgrp on sockets without requiring NSS in PID 1.
|
||||
|
||||
* New service property: maximum CPU runtime for a service
|
||||
|
||||
* introduce bus call FreezeUnit(s, b), as well as "systemctl freeze
|
||||
$UNIT" and "systemctl thaw $UNIT" as wrappers around this. The calls
|
||||
should SIGSTOP all unit processes in a loop until all processes of
|
||||
@ -344,12 +337,10 @@ Features:
|
||||
error. Currently, we just ignore it and read the unit from the search
|
||||
path anyway.
|
||||
|
||||
* refuse boot if /etc/os-release is missing or /etc/machine-id cannot be set up
|
||||
* refuse boot if /usr/lib/os-release is missing or /etc/machine-id cannot be set up
|
||||
|
||||
* btrfs raid assembly: some .device jobs stay stuck in the queue
|
||||
|
||||
* make sure gdm does not use multi-user-x but the new default X configuration file, and then remove multi-user-x from systemd
|
||||
|
||||
* man: the documentation of Restart= currently is very misleading and suggests the tools from ExecStartPre= might get restarted.
|
||||
|
||||
* load .d/*.conf dropins for device units
|
||||
@ -606,9 +597,6 @@ Features:
|
||||
* currently x-systemd.timeout is lost in the initrd, since crypttab is copied into dracut, but fstab is not
|
||||
|
||||
* nspawn:
|
||||
- to allow "linking" of nspawn containers, extend --network-bridge= so
|
||||
that it can dynamically create bridge interfaces that are refcounted
|
||||
by the containers on them. For each group of containers to link together
|
||||
- nspawn -x should support ephemeral instances of gpt images
|
||||
- emulate /dev/kmsg using CUSE and turn off the syslog syscall
|
||||
with seccomp. That should provide us with a useful log buffer that
|
||||
@ -617,8 +605,6 @@ Features:
|
||||
- as soon as networkd has a bus interface, hook up --network-interface=,
|
||||
--network-bridge= with networkd, to trigger netdev creation should an
|
||||
interface be missing
|
||||
- don't copy /etc/resolv.conf from host into container unless we are in
|
||||
shared-network mode
|
||||
- a nice way to boot up without machine id set, so that it is set at boot
|
||||
automatically for supporting --ephemeral. Maybe hash the host machine id
|
||||
together with the machine name to generate the machine id for the container
|
||||
@ -684,7 +670,6 @@ Features:
|
||||
|
||||
* coredump:
|
||||
- save coredump in Windows/Mozilla minidump format
|
||||
- move PID 1 segfaults to /var/lib/systemd/coredump?
|
||||
|
||||
* support crash reporting operation modes (https://live.gnome.org/GnomeOS/Design/Whiteboards/ProblemReporting)
|
||||
|
||||
@ -751,7 +736,6 @@ Features:
|
||||
- GC unreferenced jobs (such as .device jobs)
|
||||
- move PAM code into its own binary
|
||||
- when we automatically restart a service, ensure we restart its rdeps, too.
|
||||
- for services: do not set $HOME in services unless requested
|
||||
- hide PAM options in fragment parser when compile time disabled
|
||||
- Support --test based on current system state
|
||||
- If we show an error about a unit (such as not showing up) and it has no Description string, then show a description string generated form the reverse of unit_name_mangle().
|
||||
|
@ -160,14 +160,18 @@
|
||||
use. However, UID/GIDs are recycled after a unit is terminated. Care should be taken that any processes running
|
||||
as part of a unit for which dynamic users/groups are enabled do not leave files or directories owned by these
|
||||
users/groups around, as a different unit might get the same UID/GID assigned later on, and thus gain access to
|
||||
these files or directories. If <varname>DynamicUser=</varname> is enabled, <varname>RemoveIPC=</varname> and
|
||||
these files or directories. If <varname>DynamicUser=</varname> is enabled, <varname>RemoveIPC=</varname>,
|
||||
<varname>PrivateTmp=</varname> are implied. This ensures that the lifetime of IPC objects and temporary files
|
||||
created by the executed processes is bound to the runtime of the service, and hence the lifetime of the dynamic
|
||||
user/group. Since <filename>/tmp</filename> and <filename>/var/tmp</filename> are usually the only
|
||||
world-writable directories on a system this ensures that a unit making use of dynamic user/group allocation
|
||||
cannot leave files around after unit termination. Use <varname>RuntimeDirectory=</varname> (see below) in order
|
||||
to assign a writable runtime directory to a service, owned by the dynamic user/group and removed automatically
|
||||
when the unit is terminated. Defaults to off.</para></listitem>
|
||||
cannot leave files around after unit termination. Moreover <varname>ProtectSystem=strict</varname> and
|
||||
<varname>ProtectHome=read-only</varname> are implied, thus prohibiting the service to write to arbitrary file
|
||||
system locations. In order to allow the service to write to certain directories, they have to be whitelisted
|
||||
using <varname>ReadWritePaths=</varname>, but care must be taken so that that UID/GID recycling doesn't
|
||||
create security issues involving files created by the service. Use <varname>RuntimeDirectory=</varname> (see
|
||||
below) in order to assign a writable runtime directory to a service, owned by the dynamic user/group and
|
||||
removed automatically when the unit is terminated. Defaults to off.</para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
@ -817,49 +821,37 @@
|
||||
<listitem><para>Controls which capabilities to include in the capability bounding set for the executed
|
||||
process. See <citerefentry
|
||||
project='man-pages'><refentrytitle>capabilities</refentrytitle><manvolnum>7</manvolnum></citerefentry> for
|
||||
details. Takes a whitespace-separated list of capability names as read by <citerefentry
|
||||
project='mankier'><refentrytitle>cap_from_name</refentrytitle><manvolnum>3</manvolnum></citerefentry>,
|
||||
e.g. <constant>CAP_SYS_ADMIN</constant>, <constant>CAP_DAC_OVERRIDE</constant>,
|
||||
<constant>CAP_SYS_PTRACE</constant>. Capabilities listed will be included in the bounding set, all others are
|
||||
removed. If the list of capabilities is prefixed with <literal>~</literal>, all but the listed capabilities
|
||||
will be included, the effect of the assignment inverted. Note that this option also affects the respective
|
||||
capabilities in the effective, permitted and inheritable capability sets. If this option is not used, the
|
||||
capability bounding set is not modified on process execution, hence no limits on the capabilities of the
|
||||
process are enforced. This option may appear more than once, in which case the bounding sets are merged. If the
|
||||
empty string is assigned to this option, the bounding set is reset to the empty capability set, and all prior
|
||||
settings have no effect. If set to <literal>~</literal> (without any further argument), the bounding set is
|
||||
reset to the full set of available capabilities, also undoing any previous settings. This does not affect
|
||||
commands prefixed with <literal>+</literal>.</para></listitem>
|
||||
details. Takes a whitespace-separated list of capability names, e.g. <constant>CAP_SYS_ADMIN</constant>,
|
||||
<constant>CAP_DAC_OVERRIDE</constant>, <constant>CAP_SYS_PTRACE</constant>. Capabilities listed will be
|
||||
included in the bounding set, all others are removed. If the list of capabilities is prefixed with
|
||||
<literal>~</literal>, all but the listed capabilities will be included, the effect of the assignment
|
||||
inverted. Note that this option also affects the respective capabilities in the effective, permitted and
|
||||
inheritable capability sets. If this option is not used, the capability bounding set is not modified on process
|
||||
execution, hence no limits on the capabilities of the process are enforced. This option may appear more than
|
||||
once, in which case the bounding sets are merged. If the empty string is assigned to this option, the bounding
|
||||
set is reset to the empty capability set, and all prior settings have no effect. If set to
|
||||
<literal>~</literal> (without any further argument), the bounding set is reset to the full set of available
|
||||
capabilities, also undoing any previous settings. This does not affect commands prefixed with
|
||||
<literal>+</literal>.</para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><varname>AmbientCapabilities=</varname></term>
|
||||
|
||||
<listitem><para>Controls which capabilities to include in the
|
||||
ambient capability set for the executed process. Takes a
|
||||
whitespace-separated list of capability names as read by
|
||||
<citerefentry project='mankier'><refentrytitle>cap_from_name</refentrytitle><manvolnum>3</manvolnum></citerefentry>,
|
||||
e.g. <constant>CAP_SYS_ADMIN</constant>,
|
||||
<constant>CAP_DAC_OVERRIDE</constant>,
|
||||
<constant>CAP_SYS_PTRACE</constant>. This option may appear more than
|
||||
once in which case the ambient capability sets are merged.
|
||||
If the list of capabilities is prefixed with <literal>~</literal>, all
|
||||
but the listed capabilities will be included, the effect of the
|
||||
assignment inverted. If the empty string is
|
||||
assigned to this option, the ambient capability set is reset to
|
||||
the empty capability set, and all prior settings have no effect.
|
||||
If set to <literal>~</literal> (without any further argument), the
|
||||
ambient capability set is reset to the full set of available
|
||||
capabilities, also undoing any previous settings. Note that adding
|
||||
capabilities to ambient capability set adds them to the process's
|
||||
inherited capability set.
|
||||
</para><para>
|
||||
Ambient capability sets are useful if you want to execute a process
|
||||
as a non-privileged user but still want to give it some capabilities.
|
||||
Note that in this case option <constant>keep-caps</constant> is
|
||||
automatically added to <varname>SecureBits=</varname> to retain the
|
||||
capabilities over the user change. <varname>AmbientCapabilities=</varname> does not affect
|
||||
commands prefixed with <literal>+</literal>.</para></listitem>
|
||||
<listitem><para>Controls which capabilities to include in the ambient capability set for the executed
|
||||
process. Takes a whitespace-separated list of capability names, e.g. <constant>CAP_SYS_ADMIN</constant>,
|
||||
<constant>CAP_DAC_OVERRIDE</constant>, <constant>CAP_SYS_PTRACE</constant>. This option may appear more than
|
||||
once in which case the ambient capability sets are merged. If the list of capabilities is prefixed with
|
||||
<literal>~</literal>, all but the listed capabilities will be included, the effect of the assignment
|
||||
inverted. If the empty string is assigned to this option, the ambient capability set is reset to the empty
|
||||
capability set, and all prior settings have no effect. If set to <literal>~</literal> (without any further
|
||||
argument), the ambient capability set is reset to the full set of available capabilities, also undoing any
|
||||
previous settings. Note that adding capabilities to ambient capability set adds them to the process's inherited
|
||||
capability set. </para><para> Ambient capability sets are useful if you want to execute a process as a
|
||||
non-privileged user but still want to give it some capabilities. Note that in this case option
|
||||
<constant>keep-caps</constant> is automatically added to <varname>SecureBits=</varname> to retain the
|
||||
capabilities over the user change. <varname>AmbientCapabilities=</varname> does not affect commands prefixed
|
||||
with <literal>+</literal>.</para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
@ -885,48 +877,34 @@
|
||||
<term><varname>ReadOnlyPaths=</varname></term>
|
||||
<term><varname>InaccessiblePaths=</varname></term>
|
||||
|
||||
<listitem><para>Sets up a new file system namespace for
|
||||
executed processes. These options may be used to limit access
|
||||
a process might have to the main file system hierarchy. Each
|
||||
setting takes a space-separated list of paths relative to
|
||||
the host's root directory (i.e. the system running the service manager).
|
||||
Note that if entries contain symlinks, they are resolved from the host's root directory as well.
|
||||
Entries (files or directories) listed in
|
||||
<varname>ReadWritePaths=</varname> are accessible from
|
||||
within the namespace with the same access rights as from
|
||||
outside. Entries listed in
|
||||
<varname>ReadOnlyPaths=</varname> are accessible for
|
||||
reading only, writing will be refused even if the usual file
|
||||
access controls would permit this. Entries listed in
|
||||
<varname>InaccessiblePaths=</varname> will be made
|
||||
inaccessible for processes inside the namespace, and may not
|
||||
countain any other mountpoints, including those specified by
|
||||
<varname>ReadWritePaths=</varname> or
|
||||
<varname>ReadOnlyPaths=</varname>.
|
||||
Note that restricting access with these options does not extend
|
||||
to submounts of a directory that are created later on.
|
||||
Non-directory paths can be specified as well. These
|
||||
options may be specified more than once, in which case all
|
||||
paths listed will have limited access from within the
|
||||
namespace. If the empty string is assigned to this option, the
|
||||
specific list is reset, and all prior assignments have no
|
||||
effect.</para>
|
||||
<para>Paths in
|
||||
<varname>ReadOnlyPaths=</varname>
|
||||
and
|
||||
<varname>InaccessiblePaths=</varname>
|
||||
may be prefixed with
|
||||
<literal>-</literal>, in which case
|
||||
they will be ignored when they do not
|
||||
exist. Note that using this
|
||||
setting will disconnect propagation of
|
||||
mounts from the service to the host
|
||||
(propagation in the opposite direction
|
||||
continues to work). This means that
|
||||
this setting may not be used for
|
||||
services which shall be able to
|
||||
install mount points in the main mount
|
||||
namespace.</para></listitem>
|
||||
<listitem><para>Sets up a new file system namespace for executed processes. These options may be used to limit
|
||||
access a process might have to the file system hierarchy. Each setting takes a space-separated list of paths
|
||||
relative to the host's root directory (i.e. the system running the service manager). Note that if paths
|
||||
contain symlinks, they are resolved relative to the root directory set with
|
||||
<varname>RootDirectory=</varname>.</para>
|
||||
|
||||
<para>Paths listed in <varname>ReadWritePaths=</varname> are accessible from within the namespace with the same
|
||||
access modes as from outside of it. Paths listed in <varname>ReadOnlyPaths=</varname> are accessible for
|
||||
reading only, writing will be refused even if the usual file access controls would permit this. Nest
|
||||
<varname>ReadWritePaths=</varname> inside of <varname>ReadOnlyPaths=</varname> in order to provide writable
|
||||
subdirectories within read-only directories. Use <varname>ReadWritePaths=</varname> in order to whitelist
|
||||
specific paths for write access if <varname>ProtectSystem=strict</varname> is used. Paths listed in
|
||||
<varname>InaccessiblePaths=</varname> will be made inaccessible for processes inside the namespace (along with
|
||||
everything below them in the file system hierarchy).</para>
|
||||
|
||||
<para>Note that restricting access with these options does not extend to submounts of a directory that are
|
||||
created later on. Non-directory paths may be specified as well. These options may be specified more than once,
|
||||
in which case all paths listed will have limited access from within the namespace. If the empty string is
|
||||
assigned to this option, the specific list is reset, and all prior assignments have no effect.</para>
|
||||
|
||||
<para>Paths in <varname>ReadWritePaths=</varname>, <varname>ReadOnlyPaths=</varname> and
|
||||
<varname>InaccessiblePaths=</varname> may be prefixed with <literal>-</literal>, in which case they will be ignored
|
||||
when they do not exist. Note that using this setting will disconnect propagation of mounts from the service to
|
||||
the host (propagation in the opposite direction continues to work). This means that this setting may not be used
|
||||
for services which shall be able to install mount points in the main mount namespace. Note that the effect of
|
||||
these settings may be undone by privileged processes. In order to set up an effective sandboxed environment for
|
||||
a unit it is thus recommended to combine these settings with either
|
||||
<varname>CapabilityBoundingSet=~CAP_SYS_ADMIN</varname> or <varname>SystemCallFilter=~@mount</varname>.</para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
@ -941,37 +919,33 @@
|
||||
private <filename>/tmp</filename> and <filename>/var/tmp</filename> namespace by using the
|
||||
<varname>JoinsNamespaceOf=</varname> directive, see
|
||||
<citerefentry><refentrytitle>systemd.unit</refentrytitle><manvolnum>5</manvolnum></citerefentry> for
|
||||
details. Note that using this setting will disconnect propagation of mounts from the service to the host
|
||||
(propagation in the opposite direction continues to work). This means that this setting may not be used for
|
||||
services which shall be able to install mount points in the main mount namespace. This setting is implied if
|
||||
<varname>DynamicUser=</varname> is set.</para></listitem>
|
||||
details. This setting is implied if <varname>DynamicUser=</varname> is set. For this setting the same
|
||||
restrictions regarding mount propagation and privileges apply as for <varname>ReadOnlyPaths=</varname> and
|
||||
related calls, see above.</para></listitem>
|
||||
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><varname>PrivateDevices=</varname></term>
|
||||
|
||||
<listitem><para>Takes a boolean argument. If true, sets up a
|
||||
new /dev namespace for the executed processes and only adds
|
||||
API pseudo devices such as <filename>/dev/null</filename>,
|
||||
<filename>/dev/zero</filename> or
|
||||
<filename>/dev/random</filename> (as well as the pseudo TTY
|
||||
subsystem) to it, but no physical devices such as
|
||||
<filename>/dev/sda</filename>. This is useful to securely turn
|
||||
off physical device access by the executed process. Defaults
|
||||
to false. Enabling this option will also remove
|
||||
<constant>CAP_MKNOD</constant> from the capability bounding
|
||||
set for the unit (see above), and set
|
||||
<listitem><para>Takes a boolean argument. If true, sets up a new /dev namespace for the executed processes and
|
||||
only adds API pseudo devices such as <filename>/dev/null</filename>, <filename>/dev/zero</filename> or
|
||||
<filename>/dev/random</filename> (as well as the pseudo TTY subsystem) to it, but no physical devices such as
|
||||
<filename>/dev/sda</filename>, system memory <filename>/dev/mem</filename>, system ports
|
||||
<filename>/dev/port</filename> and others. This is useful to securely turn off physical device access by the
|
||||
executed process. Defaults to false. Enabling this option will install a system call filter to block low-level
|
||||
I/O system calls that are grouped in the <varname>@raw-io</varname> set, will also remove
|
||||
<constant>CAP_MKNOD</constant> from the capability bounding set for the unit (see above), and set
|
||||
<varname>DevicePolicy=closed</varname> (see
|
||||
<citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>
|
||||
for details). Note that using this setting will disconnect
|
||||
propagation of mounts from the service to the host
|
||||
(propagation in the opposite direction continues to work).
|
||||
This means that this setting may not be used for services
|
||||
which shall be able to install mount points in the main mount
|
||||
namespace. The /dev namespace will be mounted read-only and 'noexec'.
|
||||
The latter may break old programs which try to set up executable
|
||||
memory by using <citerefentry><refentrytitle>mmap</refentrytitle><manvolnum>2</manvolnum></citerefentry>
|
||||
of <filename>/dev/zero</filename> instead of using <constant>MAP_ANON</constant>.</para></listitem>
|
||||
for details). Note that using this setting will disconnect propagation of mounts from the service to the host
|
||||
(propagation in the opposite direction continues to work). This means that this setting may not be used for
|
||||
services which shall be able to install mount points in the main mount namespace. The /dev namespace will be
|
||||
mounted read-only and 'noexec'. The latter may break old programs which try to set up executable memory by
|
||||
using <citerefentry><refentrytitle>mmap</refentrytitle><manvolnum>2</manvolnum></citerefentry> of
|
||||
<filename>/dev/zero</filename> instead of using <constant>MAP_ANON</constant>. This setting is implied if
|
||||
<varname>DynamicUser=</varname> is set. For this setting the same restrictions regarding mount propagation and
|
||||
privileges apply as for <varname>ReadOnlyPaths=</varname> and related calls, see above.</para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
@ -1020,74 +994,80 @@
|
||||
<varlistentry>
|
||||
<term><varname>ProtectSystem=</varname></term>
|
||||
|
||||
<listitem><para>Takes a boolean argument or
|
||||
<literal>full</literal>. If true, mounts the
|
||||
<filename>/usr</filename> and <filename>/boot</filename>
|
||||
directories read-only for processes invoked by this unit. If
|
||||
set to <literal>full</literal>, the <filename>/etc</filename>
|
||||
directory is mounted read-only, too. This setting ensures that
|
||||
any modification of the vendor-supplied operating system (and
|
||||
optionally its configuration) is prohibited for the service.
|
||||
It is recommended to enable this setting for all long-running
|
||||
services, unless they are involved with system updates or need
|
||||
to modify the operating system in other ways. Note however
|
||||
that processes retaining the CAP_SYS_ADMIN capability can undo
|
||||
the effect of this setting. This setting is hence particularly
|
||||
useful for daemons which have this capability removed, for
|
||||
example with <varname>CapabilityBoundingSet=</varname>.
|
||||
Defaults to off.</para></listitem>
|
||||
<listitem><para>Takes a boolean argument or the special values <literal>full</literal> or
|
||||
<literal>strict</literal>. If true, mounts the <filename>/usr</filename> and <filename>/boot</filename>
|
||||
directories read-only for processes invoked by this unit. If set to <literal>full</literal>, the
|
||||
<filename>/etc</filename> directory is mounted read-only, too. If set to <literal>strict</literal> the entire
|
||||
file system hierarchy is mounted read-only, except for the API file system subtrees <filename>/dev</filename>,
|
||||
<filename>/proc</filename> and <filename>/sys</filename> (protect these directories using
|
||||
<varname>PrivateDevices=</varname>, <varname>ProtectKernelTunables=</varname>,
|
||||
<varname>ProtectControlGroups=</varname>). This setting ensures that any modification of the vendor-supplied
|
||||
operating system (and optionally its configuration, and local mounts) is prohibited for the service. It is
|
||||
recommended to enable this setting for all long-running services, unless they are involved with system updates
|
||||
or need to modify the operating system in other ways. If this option is used,
|
||||
<varname>ReadWritePaths=</varname> may be used to exclude specific directories from being made read-only. This
|
||||
setting is implied if <varname>DynamicUser=</varname> is set. For this setting the same restrictions regarding
|
||||
mount propagation and privileges apply as for <varname>ReadOnlyPaths=</varname> and related calls, see
|
||||
above. Defaults to off.</para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><varname>ProtectHome=</varname></term>
|
||||
|
||||
<listitem><para>Takes a boolean argument or
|
||||
<literal>read-only</literal>. If true, the directories
|
||||
<filename>/home</filename>, <filename>/root</filename> and
|
||||
<filename>/run/user</filename>
|
||||
are made inaccessible and empty for processes invoked by this
|
||||
unit. If set to <literal>read-only</literal>, the three
|
||||
directories are made read-only instead. It is recommended to
|
||||
enable this setting for all long-running services (in
|
||||
particular network-facing ones), to ensure they cannot get
|
||||
access to private user data, unless the services actually
|
||||
require access to the user's private data. Note however that
|
||||
processes retaining the CAP_SYS_ADMIN capability can undo the
|
||||
effect of this setting. This setting is hence particularly
|
||||
useful for daemons which have this capability removed, for
|
||||
example with <varname>CapabilityBoundingSet=</varname>.
|
||||
Defaults to off.</para></listitem>
|
||||
<listitem><para>Takes a boolean argument or <literal>read-only</literal>. If true, the directories
|
||||
<filename>/home</filename>, <filename>/root</filename> and <filename>/run/user</filename> are made inaccessible
|
||||
and empty for processes invoked by this unit. If set to <literal>read-only</literal>, the three directories are
|
||||
made read-only instead. It is recommended to enable this setting for all long-running services (in particular
|
||||
network-facing ones), to ensure they cannot get access to private user data, unless the services actually
|
||||
require access to the user's private data. This setting is implied if <varname>DynamicUser=</varname> is
|
||||
set. For this setting the same restrictions regarding mount propagation and privileges apply as for
|
||||
<varname>ReadOnlyPaths=</varname> and related calls, see above.</para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><varname>ProtectKernelTunables=</varname></term>
|
||||
|
||||
<listitem><para>Takes a boolean argument. If true, kernel variables accessible through
|
||||
<filename>/proc/sys</filename>, <filename>/sys</filename>, <filename>/proc/sysrq-trigger</filename>,
|
||||
<filename>/proc/latency_stats</filename>, <filename>/proc/acpi</filename>,
|
||||
<filename>/proc/timer_stats</filename>, <filename>/proc/fs</filename> and <filename>/proc/irq</filename> will
|
||||
be made read-only to all processes of the unit. Usually, tunable kernel variables should only be written at
|
||||
boot-time, with the <citerefentry><refentrytitle>sysctl.d</refentrytitle><manvolnum>5</manvolnum></citerefentry>
|
||||
mechanism. Almost no services need to write to these at runtime; it is hence recommended to turn this on for
|
||||
most services. For this setting the same restrictions regarding mount propagation and privileges apply as for
|
||||
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off.</para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><varname>ProtectControlGroups=</varname></term>
|
||||
|
||||
<listitem><para>Takes a boolean argument. If true, the Linux Control Groups (<citerefentry
|
||||
project='man-pages'><refentrytitle>cgroups</refentrytitle><manvolnum>7</manvolnum></citerefentry>) hierarchies
|
||||
accessible through <filename>/sys/fs/cgroup</filename> will be made read-only to all processes of the
|
||||
unit. Except for container managers no services should require write access to the control groups hierarchies;
|
||||
it is hence recommended to turn this on for most services. For this setting the same restrictions regarding
|
||||
mount propagation and privileges apply as for <varname>ReadOnlyPaths=</varname> and related calls, see
|
||||
above. Defaults to off.</para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><varname>MountFlags=</varname></term>
|
||||
|
||||
<listitem><para>Takes a mount propagation flag:
|
||||
<option>shared</option>, <option>slave</option> or
|
||||
<option>private</option>, which control whether mounts in the
|
||||
file system namespace set up for this unit's processes will
|
||||
receive or propagate mounts or unmounts. See
|
||||
<citerefentry project='man-pages'><refentrytitle>mount</refentrytitle><manvolnum>2</manvolnum></citerefentry>
|
||||
for details. Defaults to <option>shared</option>. Use
|
||||
<option>shared</option> to ensure that mounts and unmounts are
|
||||
propagated from the host to the container and vice versa. Use
|
||||
<option>slave</option> to run processes so that none of their
|
||||
mounts and unmounts will propagate to the host. Use
|
||||
<option>private</option> to also ensure that no mounts and
|
||||
unmounts from the host will propagate into the unit processes'
|
||||
namespace. Note that <option>slave</option> means that file
|
||||
systems mounted on the host might stay mounted continuously in
|
||||
the unit's namespace, and thus keep the device busy. Note that
|
||||
the file system namespace related options
|
||||
(<varname>PrivateTmp=</varname>,
|
||||
<varname>PrivateDevices=</varname>,
|
||||
<varname>ProtectSystem=</varname>,
|
||||
<varname>ProtectHome=</varname>,
|
||||
<varname>ReadOnlyPaths=</varname>,
|
||||
<varname>InaccessiblePaths=</varname> and
|
||||
<varname>ReadWritePaths=</varname>) require that mount
|
||||
and unmount propagation from the unit's file system namespace
|
||||
is disabled, and hence downgrade <option>shared</option> to
|
||||
<listitem><para>Takes a mount propagation flag: <option>shared</option>, <option>slave</option> or
|
||||
<option>private</option>, which control whether mounts in the file system namespace set up for this unit's
|
||||
processes will receive or propagate mounts or unmounts. See <citerefentry
|
||||
project='man-pages'><refentrytitle>mount</refentrytitle><manvolnum>2</manvolnum></citerefentry> for
|
||||
details. Defaults to <option>shared</option>. Use <option>shared</option> to ensure that mounts and unmounts
|
||||
are propagated from the host to the container and vice versa. Use <option>slave</option> to run processes so
|
||||
that none of their mounts and unmounts will propagate to the host. Use <option>private</option> to also ensure
|
||||
that no mounts and unmounts from the host will propagate into the unit processes' namespace. Note that
|
||||
<option>slave</option> means that file systems mounted on the host might stay mounted continuously in the
|
||||
unit's namespace, and thus keep the device busy. Note that the file system namespace related options
|
||||
(<varname>PrivateTmp=</varname>, <varname>PrivateDevices=</varname>, <varname>ProtectSystem=</varname>,
|
||||
<varname>ProtectHome=</varname>, <varname>ProtectKernelTunables=</varname>,
|
||||
<varname>ProtectControlGroups=</varname>, <varname>ReadOnlyPaths=</varname>,
|
||||
<varname>InaccessiblePaths=</varname>, <varname>ReadWritePaths=</varname>) require that mount and unmount
|
||||
propagation from the unit's file system namespace is disabled, and hence downgrade <option>shared</option> to
|
||||
<option>slave</option>. </para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
@ -1322,7 +1302,15 @@
|
||||
</table>
|
||||
|
||||
Note, that as new system calls are added to the kernel, additional system calls might be added to the groups
|
||||
above, so the contents of the sets may change between systemd versions.</para></listitem>
|
||||
above, so the contents of the sets may change between systemd versions.</para>
|
||||
|
||||
<para>It is recommended to combine the file system namespacing related options with
|
||||
<varname>SystemCallFilter=~@mount</varname>, in order to prohibit the unit's processes to undo the
|
||||
mappings. Specifically these are the options <varname>PrivateTmp=</varname>,
|
||||
<varname>PrivateDevices=</varname>, <varname>ProtectSystem=</varname>, <varname>ProtectHome=</varname>,
|
||||
<varname>ProtectKernelTunables=</varname>, <varname>ProtectControlGroups=</varname>,
|
||||
<varname>ReadOnlyPaths=</varname>, <varname>InaccessiblePaths=</varname> and
|
||||
<varname>ReadWritePaths=</varname>.</para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
@ -1629,8 +1617,8 @@
|
||||
<varname>ExecStop=</varname> and <varname>ExecStopPost=</varname> processes, and encodes the service
|
||||
"result". Currently, the following values are defined: <literal>timeout</literal> (in case of an operation
|
||||
timeout), <literal>exit-code</literal> (if a service process exited with a non-zero exit code; see
|
||||
<varname>$EXIT_STATUS</varname> below for the actual exit status returned), <literal>signal</literal> (if a
|
||||
service process was terminated abnormally by a signal; see <varname>$EXIT_STATUS</varname> below for the actual
|
||||
<varname>$EXIT_CODE</varname> below for the actual exit code returned), <literal>signal</literal> (if a
|
||||
service process was terminated abnormally by a signal; see <varname>$EXIT_CODE</varname> below for the actual
|
||||
signal used for the termination), <literal>core-dump</literal> (if a service process terminated abnormally and
|
||||
dumped core), <literal>watchdog</literal> (if the watchdog keep-alive ping was enabled for the service but it
|
||||
missed the deadline), or <literal>resources</literal> (a catch-all condition in case a system operation
|
||||
@ -1675,32 +1663,32 @@
|
||||
<row>
|
||||
<entry morerows="1" valign="top"><literal>timeout</literal></entry>
|
||||
<entry valign="top"><literal>killed</literal></entry>
|
||||
<entry><literal>TERM</literal><sbr/><literal>KILL</literal></entry>
|
||||
<entry><literal>TERM</literal>, <literal>KILL</literal></entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry valign="top"><literal>exited</literal></entry>
|
||||
<entry><literal>0</literal><sbr/><literal>1</literal><sbr/><literal>2</literal><sbr/><literal
|
||||
>3</literal><sbr/>…<sbr/><literal>255</literal></entry>
|
||||
<entry><literal>0</literal>, <literal>1</literal>, <literal>2</literal>, <literal
|
||||
>3</literal>, …, <literal>255</literal></entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry valign="top"><literal>exit-code</literal></entry>
|
||||
<entry valign="top"><literal>exited</literal></entry>
|
||||
<entry><literal>0</literal><sbr/><literal>1</literal><sbr/><literal>2</literal><sbr/><literal
|
||||
>3</literal><sbr/>…<sbr/><literal>255</literal></entry>
|
||||
<entry><literal>0</literal>, <literal>1</literal>, <literal>2</literal>, <literal
|
||||
>3</literal>, …, <literal>255</literal></entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry valign="top"><literal>signal</literal></entry>
|
||||
<entry valign="top"><literal>killed</literal></entry>
|
||||
<entry><literal>HUP</literal><sbr/><literal>INT</literal><sbr/><literal>KILL</literal><sbr/>…</entry>
|
||||
<entry><literal>HUP</literal>, <literal>INT</literal>, <literal>KILL</literal>, …</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry valign="top"><literal>core-dump</literal></entry>
|
||||
<entry valign="top"><literal>dumped</literal></entry>
|
||||
<entry><literal>ABRT</literal><sbr/><literal>SEGV</literal><sbr/><literal>QUIT</literal><sbr/>…</entry>
|
||||
<entry><literal>ABRT</literal>, <literal>SEGV</literal>, <literal>QUIT</literal>, …</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
@ -1710,12 +1698,12 @@
|
||||
</row>
|
||||
<row>
|
||||
<entry><literal>killed</literal></entry>
|
||||
<entry><literal>TERM</literal><sbr/><literal>KILL</literal></entry>
|
||||
<entry><literal>TERM</literal>, <literal>KILL</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><literal>exited</literal></entry>
|
||||
<entry><literal>0</literal><sbr/><literal>1</literal><sbr/><literal>2</literal><sbr/><literal
|
||||
>3</literal><sbr/>…<sbr/><literal>255</literal></entry>
|
||||
<entry><literal>0</literal>, <literal>1</literal>, <literal>2</literal>, <literal
|
||||
>3</literal>, …, <literal>255</literal></entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
|
@ -597,3 +597,190 @@ int inotify_add_watch_fd(int fd, int what, uint32_t mask) {
|
||||
|
||||
return r;
|
||||
}
|
||||
|
||||
int chase_symlinks(const char *path, const char *_root, char **ret) {
|
||||
_cleanup_free_ char *buffer = NULL, *done = NULL, *root = NULL;
|
||||
_cleanup_close_ int fd = -1;
|
||||
unsigned max_follow = 32; /* how many symlinks to follow before giving up and returning ELOOP */
|
||||
char *todo;
|
||||
int r;
|
||||
|
||||
assert(path);
|
||||
|
||||
/* This is a lot like canonicalize_file_name(), but takes an additional "root" parameter, that allows following
|
||||
* symlinks relative to a root directory, instead of the root of the host.
|
||||
*
|
||||
* Note that "root" matters only if we encounter an absolute symlink, it's unused otherwise. Most importantly
|
||||
* this means the path parameter passed in is not prefixed by it.
|
||||
*
|
||||
* Algorithmically this operates on two path buffers: "done" are the components of the path we already
|
||||
* processed and resolved symlinks, "." and ".." of. "todo" are the components of the path we still need to
|
||||
* process. On each iteration, we move one component from "todo" to "done", processing it's special meaning
|
||||
* each time. The "todo" path always starts with at least one slash, the "done" path always ends in no
|
||||
* slash. We always keep an O_PATH fd to the component we are currently processing, thus keeping lookup races
|
||||
* at a minimum. */
|
||||
|
||||
r = path_make_absolute_cwd(path, &buffer);
|
||||
if (r < 0)
|
||||
return r;
|
||||
|
||||
if (_root) {
|
||||
r = path_make_absolute_cwd(_root, &root);
|
||||
if (r < 0)
|
||||
return r;
|
||||
}
|
||||
|
||||
fd = open("/", O_CLOEXEC|O_NOFOLLOW|O_PATH);
|
||||
if (fd < 0)
|
||||
return -errno;
|
||||
|
||||
todo = buffer;
|
||||
for (;;) {
|
||||
_cleanup_free_ char *first = NULL;
|
||||
_cleanup_close_ int child = -1;
|
||||
struct stat st;
|
||||
size_t n, m;
|
||||
|
||||
/* Determine length of first component in the path */
|
||||
n = strspn(todo, "/"); /* The slashes */
|
||||
m = n + strcspn(todo + n, "/"); /* The entire length of the component */
|
||||
|
||||
/* Extract the first component. */
|
||||
first = strndup(todo, m);
|
||||
if (!first)
|
||||
return -ENOMEM;
|
||||
|
||||
todo += m;
|
||||
|
||||
/* Just a single slash? Then we reached the end. */
|
||||
if (isempty(first) || path_equal(first, "/"))
|
||||
break;
|
||||
|
||||
/* Just a dot? Then let's eat this up. */
|
||||
if (path_equal(first, "/."))
|
||||
continue;
|
||||
|
||||
/* Two dots? Then chop off the last bit of what we already found out. */
|
||||
if (path_equal(first, "/..")) {
|
||||
_cleanup_free_ char *parent = NULL;
|
||||
int fd_parent = -1;
|
||||
|
||||
if (isempty(done) || path_equal(done, "/"))
|
||||
return -EINVAL;
|
||||
|
||||
parent = dirname_malloc(done);
|
||||
if (!parent)
|
||||
return -ENOMEM;
|
||||
|
||||
/* Don't allow this to leave the root dir */
|
||||
if (root &&
|
||||
path_startswith(done, root) &&
|
||||
!path_startswith(parent, root))
|
||||
return -EINVAL;
|
||||
|
||||
free(done);
|
||||
done = parent;
|
||||
parent = NULL;
|
||||
|
||||
fd_parent = openat(fd, "..", O_CLOEXEC|O_NOFOLLOW|O_PATH);
|
||||
if (fd_parent < 0)
|
||||
return -errno;
|
||||
|
||||
safe_close(fd);
|
||||
fd = fd_parent;
|
||||
|
||||
continue;
|
||||
}
|
||||
|
||||
/* Otherwise let's see what this is. */
|
||||
child = openat(fd, first + n, O_CLOEXEC|O_NOFOLLOW|O_PATH);
|
||||
if (child < 0)
|
||||
return -errno;
|
||||
|
||||
if (fstat(child, &st) < 0)
|
||||
return -errno;
|
||||
|
||||
if (S_ISLNK(st.st_mode)) {
|
||||
_cleanup_free_ char *destination = NULL;
|
||||
|
||||
/* This is a symlink, in this case read the destination. But let's make sure we don't follow
|
||||
* symlinks without bounds. */
|
||||
if (--max_follow <= 0)
|
||||
return -ELOOP;
|
||||
|
||||
r = readlinkat_malloc(fd, first + n, &destination);
|
||||
if (r < 0)
|
||||
return r;
|
||||
if (isempty(destination))
|
||||
return -EINVAL;
|
||||
|
||||
if (path_is_absolute(destination)) {
|
||||
|
||||
/* An absolute destination. Start the loop from the beginning, but use the root
|
||||
* directory as base. */
|
||||
|
||||
safe_close(fd);
|
||||
fd = open(root ?: "/", O_CLOEXEC|O_NOFOLLOW|O_PATH);
|
||||
if (fd < 0)
|
||||
return -errno;
|
||||
|
||||
free(buffer);
|
||||
buffer = destination;
|
||||
destination = NULL;
|
||||
|
||||
todo = buffer;
|
||||
free(done);
|
||||
|
||||
/* Note that we do not revalidate the root, we take it as is. */
|
||||
if (isempty(root))
|
||||
done = NULL;
|
||||
else {
|
||||
done = strdup(root);
|
||||
if (!done)
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
} else {
|
||||
char *joined;
|
||||
|
||||
/* A relative destination. If so, this is what we'll prefix what's left to do with what
|
||||
* we just read, and start the loop again, but remain in the current directory. */
|
||||
|
||||
joined = strjoin("/", destination, todo, NULL);
|
||||
if (!joined)
|
||||
return -ENOMEM;
|
||||
|
||||
free(buffer);
|
||||
todo = buffer = joined;
|
||||
}
|
||||
|
||||
continue;
|
||||
}
|
||||
|
||||
/* If this is not a symlink, then let's just add the name we read to what we already verified. */
|
||||
if (!done) {
|
||||
done = first;
|
||||
first = NULL;
|
||||
} else {
|
||||
if (!strextend(&done, first, NULL))
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
/* And iterate again, but go one directory further down. */
|
||||
safe_close(fd);
|
||||
fd = child;
|
||||
child = -1;
|
||||
}
|
||||
|
||||
if (!done) {
|
||||
/* Special case, turn the empty string into "/", to indicate the root directory. */
|
||||
done = strdup("/");
|
||||
if (!done)
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
*ret = done;
|
||||
done = NULL;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
@ -77,3 +77,5 @@ union inotify_event_buffer {
|
||||
};
|
||||
|
||||
int inotify_add_watch_fd(int fd, int what, uint32_t mask);
|
||||
|
||||
int chase_symlinks(const char *path, const char *_root, char **ret);
|
||||
|
@ -36,6 +36,7 @@
|
||||
#include "set.h"
|
||||
#include "stdio-util.h"
|
||||
#include "string-util.h"
|
||||
#include "strv.h"
|
||||
|
||||
static int fd_fdinfo_mnt_id(int fd, const char *filename, int flags, int *mnt_id) {
|
||||
char path[strlen("/proc/self/fdinfo/") + DECIMAL_STR_MAX(int)];
|
||||
@ -287,10 +288,12 @@ int umount_recursive(const char *prefix, int flags) {
|
||||
continue;
|
||||
|
||||
if (umount2(p, flags) < 0) {
|
||||
r = -errno;
|
||||
r = log_debug_errno(errno, "Failed to umount %s: %m", p);
|
||||
continue;
|
||||
}
|
||||
|
||||
log_debug("Successfully unmounted %s", p);
|
||||
|
||||
again = true;
|
||||
n++;
|
||||
|
||||
@ -311,24 +314,21 @@ static int get_mount_flags(const char *path, unsigned long *flags) {
|
||||
return 0;
|
||||
}
|
||||
|
||||
int bind_remount_recursive(const char *prefix, bool ro) {
|
||||
int bind_remount_recursive(const char *prefix, bool ro, char **blacklist) {
|
||||
_cleanup_set_free_free_ Set *done = NULL;
|
||||
_cleanup_free_ char *cleaned = NULL;
|
||||
int r;
|
||||
|
||||
/* Recursively remount a directory (and all its submounts)
|
||||
* read-only or read-write. If the directory is already
|
||||
* mounted, we reuse the mount and simply mark it
|
||||
* MS_BIND|MS_RDONLY (or remove the MS_RDONLY for read-write
|
||||
* operation). If it isn't we first make it one. Afterwards we
|
||||
* apply MS_BIND|MS_RDONLY (or remove MS_RDONLY) to all
|
||||
* submounts we can access, too. When mounts are stacked on
|
||||
* the same mount point we only care for each individual
|
||||
* "top-level" mount on each point, as we cannot
|
||||
* influence/access the underlying mounts anyway. We do not
|
||||
* have any effect on future submounts that might get
|
||||
* propagated, they migt be writable. This includes future
|
||||
* submounts that have been triggered via autofs. */
|
||||
/* Recursively remount a directory (and all its submounts) read-only or read-write. If the directory is already
|
||||
* mounted, we reuse the mount and simply mark it MS_BIND|MS_RDONLY (or remove the MS_RDONLY for read-write
|
||||
* operation). If it isn't we first make it one. Afterwards we apply MS_BIND|MS_RDONLY (or remove MS_RDONLY) to
|
||||
* all submounts we can access, too. When mounts are stacked on the same mount point we only care for each
|
||||
* individual "top-level" mount on each point, as we cannot influence/access the underlying mounts anyway. We
|
||||
* do not have any effect on future submounts that might get propagated, they migt be writable. This includes
|
||||
* future submounts that have been triggered via autofs.
|
||||
*
|
||||
* If the "blacklist" parameter is specified it may contain a list of subtrees to exclude from the
|
||||
* remount operation. Note that we'll ignore the blacklist for the top-level path. */
|
||||
|
||||
cleaned = strdup(prefix);
|
||||
if (!cleaned)
|
||||
@ -385,6 +385,33 @@ int bind_remount_recursive(const char *prefix, bool ro) {
|
||||
if (r < 0)
|
||||
return r;
|
||||
|
||||
if (!path_startswith(p, cleaned))
|
||||
continue;
|
||||
|
||||
/* Ignore this mount if it is blacklisted, but only if it isn't the top-level mount we shall
|
||||
* operate on. */
|
||||
if (!path_equal(cleaned, p)) {
|
||||
bool blacklisted = false;
|
||||
char **i;
|
||||
|
||||
STRV_FOREACH(i, blacklist) {
|
||||
|
||||
if (path_equal(*i, cleaned))
|
||||
continue;
|
||||
|
||||
if (!path_startswith(*i, cleaned))
|
||||
continue;
|
||||
|
||||
if (path_startswith(p, *i)) {
|
||||
blacklisted = true;
|
||||
log_debug("Not remounting %s, because blacklisted by %s, called for %s", p, *i, cleaned);
|
||||
break;
|
||||
}
|
||||
}
|
||||
if (blacklisted)
|
||||
continue;
|
||||
}
|
||||
|
||||
/* Let's ignore autofs mounts. If they aren't
|
||||
* triggered yet, we want to avoid triggering
|
||||
* them, as we don't make any guarantees for
|
||||
@ -396,12 +423,9 @@ int bind_remount_recursive(const char *prefix, bool ro) {
|
||||
continue;
|
||||
}
|
||||
|
||||
if (path_startswith(p, cleaned) &&
|
||||
!set_contains(done, p)) {
|
||||
|
||||
if (!set_contains(done, p)) {
|
||||
r = set_consume(todo, p);
|
||||
p = NULL;
|
||||
|
||||
if (r == -EEXIST)
|
||||
continue;
|
||||
if (r < 0)
|
||||
@ -418,8 +442,7 @@ int bind_remount_recursive(const char *prefix, bool ro) {
|
||||
|
||||
if (!set_contains(done, cleaned) &&
|
||||
!set_contains(todo, cleaned)) {
|
||||
/* The prefix directory itself is not yet a
|
||||
* mount, make it one. */
|
||||
/* The prefix directory itself is not yet a mount, make it one. */
|
||||
if (mount(cleaned, cleaned, NULL, MS_BIND|MS_REC, NULL) < 0)
|
||||
return -errno;
|
||||
|
||||
@ -430,6 +453,8 @@ int bind_remount_recursive(const char *prefix, bool ro) {
|
||||
if (mount(NULL, prefix, NULL, orig_flags|MS_BIND|MS_REMOUNT|(ro ? MS_RDONLY : 0), NULL) < 0)
|
||||
return -errno;
|
||||
|
||||
log_debug("Made top-level directory %s a mount point.", prefix);
|
||||
|
||||
x = strdup(cleaned);
|
||||
if (!x)
|
||||
return -ENOMEM;
|
||||
@ -447,8 +472,7 @@ int bind_remount_recursive(const char *prefix, bool ro) {
|
||||
if (r < 0)
|
||||
return r;
|
||||
|
||||
/* Deal with mount points that are obstructed by a
|
||||
* later mount */
|
||||
/* Deal with mount points that are obstructed by a later mount */
|
||||
r = path_is_mount_point(x, 0);
|
||||
if (r == -ENOENT || r == 0)
|
||||
continue;
|
||||
@ -463,6 +487,7 @@ int bind_remount_recursive(const char *prefix, bool ro) {
|
||||
if (mount(NULL, x, NULL, orig_flags|MS_BIND|MS_REMOUNT|(ro ? MS_RDONLY : 0), NULL) < 0)
|
||||
return -errno;
|
||||
|
||||
log_debug("Remounted %s read-only.", x);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -35,7 +35,7 @@ int path_is_mount_point(const char *path, int flags);
|
||||
int repeat_unmount(const char *path, int flags);
|
||||
|
||||
int umount_recursive(const char *target, int flags);
|
||||
int bind_remount_recursive(const char *prefix, bool ro);
|
||||
int bind_remount_recursive(const char *prefix, bool ro, char **blacklist);
|
||||
|
||||
int mount_move_root(const char *path);
|
||||
|
||||
|
@ -31,14 +31,15 @@
|
||||
#include <unistd.h>
|
||||
#include <utmp.h>
|
||||
|
||||
#include "missing.h"
|
||||
#include "alloc-util.h"
|
||||
#include "fd-util.h"
|
||||
#include "formats-util.h"
|
||||
#include "macro.h"
|
||||
#include "missing.h"
|
||||
#include "parse-util.h"
|
||||
#include "path-util.h"
|
||||
#include "string-util.h"
|
||||
#include "strv.h"
|
||||
#include "user-util.h"
|
||||
#include "utf8.h"
|
||||
|
||||
@ -175,6 +176,35 @@ int get_user_creds(
|
||||
return 0;
|
||||
}
|
||||
|
||||
int get_user_creds_clean(
|
||||
const char **username,
|
||||
uid_t *uid, gid_t *gid,
|
||||
const char **home,
|
||||
const char **shell) {
|
||||
|
||||
int r;
|
||||
|
||||
/* Like get_user_creds(), but resets home/shell to NULL if they don't contain anything relevant. */
|
||||
|
||||
r = get_user_creds(username, uid, gid, home, shell);
|
||||
if (r < 0)
|
||||
return r;
|
||||
|
||||
if (shell &&
|
||||
(isempty(*shell) || PATH_IN_SET(*shell,
|
||||
"/bin/nologin",
|
||||
"/sbin/nologin",
|
||||
"/usr/bin/nologin",
|
||||
"/usr/sbin/nologin")))
|
||||
*shell = NULL;
|
||||
|
||||
if (home &&
|
||||
(isempty(*home) || path_equal(*home, "/")))
|
||||
*home = NULL;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
int get_group_creds(const char **groupname, gid_t *gid) {
|
||||
struct group *g;
|
||||
gid_t id;
|
||||
|
@ -40,6 +40,7 @@ char* getlogname_malloc(void);
|
||||
char* getusername_malloc(void);
|
||||
|
||||
int get_user_creds(const char **username, uid_t *uid, gid_t *gid, const char **home, const char **shell);
|
||||
int get_user_creds_clean(const char **username, uid_t *uid, gid_t *gid, const char **home, const char **shell);
|
||||
int get_group_creds(const char **groupname, gid_t *gid);
|
||||
|
||||
char* uid_to_name(uid_t uid);
|
||||
|
@ -707,6 +707,8 @@ const sd_bus_vtable bus_exec_vtable[] = {
|
||||
SD_BUS_PROPERTY("MountFlags", "t", bus_property_get_ulong, offsetof(ExecContext, mount_flags), SD_BUS_VTABLE_PROPERTY_CONST),
|
||||
SD_BUS_PROPERTY("PrivateTmp", "b", bus_property_get_bool, offsetof(ExecContext, private_tmp), SD_BUS_VTABLE_PROPERTY_CONST),
|
||||
SD_BUS_PROPERTY("PrivateDevices", "b", bus_property_get_bool, offsetof(ExecContext, private_devices), SD_BUS_VTABLE_PROPERTY_CONST),
|
||||
SD_BUS_PROPERTY("ProtectKernelTunables", "b", bus_property_get_bool, offsetof(ExecContext, protect_kernel_tunables), SD_BUS_VTABLE_PROPERTY_CONST),
|
||||
SD_BUS_PROPERTY("ProtectControlGroups", "b", bus_property_get_bool, offsetof(ExecContext, protect_control_groups), SD_BUS_VTABLE_PROPERTY_CONST),
|
||||
SD_BUS_PROPERTY("PrivateNetwork", "b", bus_property_get_bool, offsetof(ExecContext, private_network), SD_BUS_VTABLE_PROPERTY_CONST),
|
||||
SD_BUS_PROPERTY("PrivateUsers", "b", bus_property_get_bool, offsetof(ExecContext, private_users), SD_BUS_VTABLE_PROPERTY_CONST),
|
||||
SD_BUS_PROPERTY("ProtectHome", "s", bus_property_get_protect_home, offsetof(ExecContext, protect_home), SD_BUS_VTABLE_PROPERTY_CONST),
|
||||
@ -1072,7 +1074,8 @@ int bus_exec_context_set_transient_property(
|
||||
"IgnoreSIGPIPE", "TTYVHangup", "TTYReset",
|
||||
"PrivateTmp", "PrivateDevices", "PrivateNetwork", "PrivateUsers",
|
||||
"NoNewPrivileges", "SyslogLevelPrefix", "MemoryDenyWriteExecute",
|
||||
"RestrictRealtime", "DynamicUser", "RemoveIPC")) {
|
||||
"RestrictRealtime", "DynamicUser", "RemoveIPC", "ProtectKernelTunables",
|
||||
"ProtectControlGroups")) {
|
||||
int b;
|
||||
|
||||
r = sd_bus_message_read(message, "b", &b);
|
||||
@ -1106,6 +1109,10 @@ int bus_exec_context_set_transient_property(
|
||||
c->dynamic_user = b;
|
||||
else if (streq(name, "RemoveIPC"))
|
||||
c->remove_ipc = b;
|
||||
else if (streq(name, "ProtectKernelTunables"))
|
||||
c->protect_kernel_tunables = b;
|
||||
else if (streq(name, "ProtectControlGroups"))
|
||||
c->protect_control_groups = b;
|
||||
|
||||
unit_write_drop_in_private_format(u, mode, name, "%s=%s", name, yes_no(b));
|
||||
}
|
||||
|
@ -837,6 +837,8 @@ static int null_conv(
|
||||
return PAM_CONV_ERR;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
static int setup_pam(
|
||||
const char *name,
|
||||
const char *user,
|
||||
@ -845,6 +847,8 @@ static int setup_pam(
|
||||
char ***env,
|
||||
int fds[], unsigned n_fds) {
|
||||
|
||||
#ifdef HAVE_PAM
|
||||
|
||||
static const struct pam_conv conv = {
|
||||
.conv = null_conv,
|
||||
.appdata_ptr = NULL
|
||||
@ -1038,8 +1042,10 @@ fail:
|
||||
closelog();
|
||||
|
||||
return r;
|
||||
}
|
||||
#else
|
||||
return 0;
|
||||
#endif
|
||||
}
|
||||
|
||||
static void rename_process_from_path(const char *path) {
|
||||
char process_name[11];
|
||||
@ -1273,6 +1279,10 @@ static int apply_memory_deny_write_execute(const Unit* u, const ExecContext *c)
|
||||
if (!seccomp)
|
||||
return -ENOMEM;
|
||||
|
||||
r = seccomp_add_secondary_archs(seccomp);
|
||||
if (r < 0)
|
||||
goto finish;
|
||||
|
||||
r = seccomp_rule_add(
|
||||
seccomp,
|
||||
SCMP_ACT_ERRNO(EPERM),
|
||||
@ -1322,6 +1332,10 @@ static int apply_restrict_realtime(const Unit* u, const ExecContext *c) {
|
||||
if (!seccomp)
|
||||
return -ENOMEM;
|
||||
|
||||
r = seccomp_add_secondary_archs(seccomp);
|
||||
if (r < 0)
|
||||
goto finish;
|
||||
|
||||
/* Determine the highest policy constant we want to allow */
|
||||
for (i = 0; i < ELEMENTSOF(permitted_policies); i++)
|
||||
if (permitted_policies[i] > max_policy)
|
||||
@ -1375,12 +1389,121 @@ finish:
|
||||
return r;
|
||||
}
|
||||
|
||||
static int apply_protect_sysctl(Unit *u, const ExecContext *c) {
|
||||
scmp_filter_ctx *seccomp;
|
||||
int r;
|
||||
|
||||
assert(c);
|
||||
|
||||
/* Turn off the legacy sysctl() system call. Many distributions turn this off while building the kernel, but
|
||||
* let's protect even those systems where this is left on in the kernel. */
|
||||
|
||||
if (skip_seccomp_unavailable(u, "ProtectKernelTunables="))
|
||||
return 0;
|
||||
|
||||
seccomp = seccomp_init(SCMP_ACT_ALLOW);
|
||||
if (!seccomp)
|
||||
return -ENOMEM;
|
||||
|
||||
r = seccomp_add_secondary_archs(seccomp);
|
||||
if (r < 0)
|
||||
goto finish;
|
||||
|
||||
r = seccomp_rule_add(
|
||||
seccomp,
|
||||
SCMP_ACT_ERRNO(EPERM),
|
||||
SCMP_SYS(_sysctl),
|
||||
0);
|
||||
if (r < 0)
|
||||
goto finish;
|
||||
|
||||
r = seccomp_attr_set(seccomp, SCMP_FLTATR_CTL_NNP, 0);
|
||||
if (r < 0)
|
||||
goto finish;
|
||||
|
||||
r = seccomp_load(seccomp);
|
||||
|
||||
finish:
|
||||
seccomp_release(seccomp);
|
||||
return r;
|
||||
}
|
||||
|
||||
static int apply_private_devices(Unit *u, const ExecContext *c) {
|
||||
const SystemCallFilterSet *set;
|
||||
scmp_filter_ctx *seccomp;
|
||||
const char *sys;
|
||||
bool syscalls_found = false;
|
||||
int r;
|
||||
|
||||
assert(c);
|
||||
|
||||
/* If PrivateDevices= is set, also turn off iopl and all @raw-io syscalls. */
|
||||
|
||||
if (skip_seccomp_unavailable(u, "PrivateDevices="))
|
||||
return 0;
|
||||
|
||||
seccomp = seccomp_init(SCMP_ACT_ALLOW);
|
||||
if (!seccomp)
|
||||
return -ENOMEM;
|
||||
|
||||
r = seccomp_add_secondary_archs(seccomp);
|
||||
if (r < 0)
|
||||
goto finish;
|
||||
|
||||
for (set = syscall_filter_sets; set->set_name; set++)
|
||||
if (streq(set->set_name, "@raw-io")) {
|
||||
syscalls_found = true;
|
||||
break;
|
||||
}
|
||||
|
||||
/* We should never fail here */
|
||||
if (!syscalls_found) {
|
||||
r = -EOPNOTSUPP;
|
||||
goto finish;
|
||||
}
|
||||
|
||||
NULSTR_FOREACH(sys, set->value) {
|
||||
int id;
|
||||
bool add = true;
|
||||
|
||||
#ifndef __NR_s390_pci_mmio_read
|
||||
if (streq(sys, "s390_pci_mmio_read"))
|
||||
add = false;
|
||||
#endif
|
||||
#ifndef __NR_s390_pci_mmio_write
|
||||
if (streq(sys, "s390_pci_mmio_write"))
|
||||
add = false;
|
||||
#endif
|
||||
|
||||
if (!add)
|
||||
continue;
|
||||
|
||||
id = seccomp_syscall_resolve_name(sys);
|
||||
|
||||
r = seccomp_rule_add(
|
||||
seccomp,
|
||||
SCMP_ACT_ERRNO(EPERM),
|
||||
id, 0);
|
||||
if (r < 0)
|
||||
goto finish;
|
||||
}
|
||||
|
||||
r = seccomp_attr_set(seccomp, SCMP_FLTATR_CTL_NNP, 0);
|
||||
if (r < 0)
|
||||
goto finish;
|
||||
|
||||
r = seccomp_load(seccomp);
|
||||
|
||||
finish:
|
||||
seccomp_release(seccomp);
|
||||
return r;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
static void do_idle_pipe_dance(int idle_pipe[4]) {
|
||||
assert(idle_pipe);
|
||||
|
||||
|
||||
idle_pipe[1] = safe_close(idle_pipe[1]);
|
||||
idle_pipe[2] = safe_close(idle_pipe[2]);
|
||||
|
||||
@ -1581,7 +1704,9 @@ static bool exec_needs_mount_namespace(
|
||||
|
||||
if (context->private_devices ||
|
||||
context->protect_system != PROTECT_SYSTEM_NO ||
|
||||
context->protect_home != PROTECT_HOME_NO)
|
||||
context->protect_home != PROTECT_HOME_NO ||
|
||||
context->protect_kernel_tunables ||
|
||||
context->protect_control_groups)
|
||||
return true;
|
||||
|
||||
return false;
|
||||
@ -1740,6 +1865,111 @@ static int setup_private_users(uid_t uid, gid_t gid) {
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int setup_runtime_directory(
|
||||
const ExecContext *context,
|
||||
const ExecParameters *params,
|
||||
uid_t uid,
|
||||
gid_t gid) {
|
||||
|
||||
char **rt;
|
||||
int r;
|
||||
|
||||
assert(context);
|
||||
assert(params);
|
||||
|
||||
STRV_FOREACH(rt, context->runtime_directory) {
|
||||
_cleanup_free_ char *p;
|
||||
|
||||
p = strjoin(params->runtime_prefix, "/", *rt, NULL);
|
||||
if (!p)
|
||||
return -ENOMEM;
|
||||
|
||||
r = mkdir_p_label(p, context->runtime_directory_mode);
|
||||
if (r < 0)
|
||||
return r;
|
||||
|
||||
r = chmod_and_chown(p, context->runtime_directory_mode, uid, gid);
|
||||
if (r < 0)
|
||||
return r;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int setup_smack(
|
||||
const ExecContext *context,
|
||||
const ExecCommand *command) {
|
||||
|
||||
#ifdef HAVE_SMACK
|
||||
int r;
|
||||
|
||||
assert(context);
|
||||
assert(command);
|
||||
|
||||
if (!mac_smack_use())
|
||||
return 0;
|
||||
|
||||
if (context->smack_process_label) {
|
||||
r = mac_smack_apply_pid(0, context->smack_process_label);
|
||||
if (r < 0)
|
||||
return r;
|
||||
}
|
||||
#ifdef SMACK_DEFAULT_PROCESS_LABEL
|
||||
else {
|
||||
_cleanup_free_ char *exec_label = NULL;
|
||||
|
||||
r = mac_smack_read(command->path, SMACK_ATTR_EXEC, &exec_label);
|
||||
if (r < 0 && r != -ENODATA && r != -EOPNOTSUPP)
|
||||
return r;
|
||||
|
||||
r = mac_smack_apply_pid(0, exec_label ? : SMACK_DEFAULT_PROCESS_LABEL);
|
||||
if (r < 0)
|
||||
return r;
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int compile_read_write_paths(
|
||||
const ExecContext *context,
|
||||
const ExecParameters *params,
|
||||
char ***ret) {
|
||||
|
||||
_cleanup_strv_free_ char **l = NULL;
|
||||
char **rt;
|
||||
|
||||
/* Compile the list of writable paths. This is the combination of the explicitly configured paths, plus all
|
||||
* runtime directories. */
|
||||
|
||||
if (strv_isempty(context->read_write_paths) &&
|
||||
strv_isempty(context->runtime_directory)) {
|
||||
*ret = NULL; /* NOP if neither is set */
|
||||
return 0;
|
||||
}
|
||||
|
||||
l = strv_copy(context->read_write_paths);
|
||||
if (!l)
|
||||
return -ENOMEM;
|
||||
|
||||
STRV_FOREACH(rt, context->runtime_directory) {
|
||||
char *s;
|
||||
|
||||
s = strjoin(params->runtime_prefix, "/", *rt, NULL);
|
||||
if (!s)
|
||||
return -ENOMEM;
|
||||
|
||||
if (strv_consume(&l, s) < 0)
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
*ret = l;
|
||||
l = NULL;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void append_socket_pair(int *array, unsigned *n, int pair[2]) {
|
||||
assert(array);
|
||||
assert(n);
|
||||
@ -1796,6 +2026,37 @@ static int close_remaining_fds(
|
||||
return close_all_fds(dont_close, n_dont_close);
|
||||
}
|
||||
|
||||
static bool context_has_address_families(const ExecContext *c) {
|
||||
assert(c);
|
||||
|
||||
return c->address_families_whitelist ||
|
||||
!set_isempty(c->address_families);
|
||||
}
|
||||
|
||||
static bool context_has_syscall_filters(const ExecContext *c) {
|
||||
assert(c);
|
||||
|
||||
return c->syscall_whitelist ||
|
||||
!set_isempty(c->syscall_filter) ||
|
||||
!set_isempty(c->syscall_archs);
|
||||
}
|
||||
|
||||
static bool context_has_no_new_privileges(const ExecContext *c) {
|
||||
assert(c);
|
||||
|
||||
if (c->no_new_privileges)
|
||||
return true;
|
||||
|
||||
if (have_effective_cap(CAP_SYS_ADMIN)) /* if we are privileged, we don't need NNP */
|
||||
return false;
|
||||
|
||||
return context_has_address_families(c) || /* we need NNP if we have any form of seccomp and are unprivileged */
|
||||
c->memory_deny_write_execute ||
|
||||
c->restrict_realtime ||
|
||||
c->protect_kernel_tunables ||
|
||||
context_has_syscall_filters(c);
|
||||
}
|
||||
|
||||
static int send_user_lookup(
|
||||
Unit *unit,
|
||||
int user_lookup_fd,
|
||||
@ -1940,22 +2201,14 @@ static int exec_child(
|
||||
} else {
|
||||
if (context->user) {
|
||||
username = context->user;
|
||||
r = get_user_creds(&username, &uid, &gid, &home, &shell);
|
||||
r = get_user_creds_clean(&username, &uid, &gid, &home, &shell);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_USER;
|
||||
return r;
|
||||
}
|
||||
|
||||
/* Don't set $HOME or $SHELL if they are are not particularly enlightening anyway. */
|
||||
if (isempty(home) || path_equal(home, "/"))
|
||||
home = NULL;
|
||||
|
||||
if (isempty(shell) || PATH_IN_SET(shell,
|
||||
"/bin/nologin",
|
||||
"/sbin/nologin",
|
||||
"/usr/bin/nologin",
|
||||
"/usr/sbin/nologin"))
|
||||
shell = NULL;
|
||||
/* Note that we don't set $HOME or $SHELL if they are are not particularly enlightening anyway
|
||||
* (i.e. are "/" or "/bin/nologin"). */
|
||||
}
|
||||
|
||||
if (context->group) {
|
||||
@ -2108,28 +2361,10 @@ static int exec_child(
|
||||
}
|
||||
|
||||
if (!strv_isempty(context->runtime_directory) && params->runtime_prefix) {
|
||||
char **rt;
|
||||
|
||||
STRV_FOREACH(rt, context->runtime_directory) {
|
||||
_cleanup_free_ char *p;
|
||||
|
||||
p = strjoin(params->runtime_prefix, "/", *rt, NULL);
|
||||
if (!p) {
|
||||
*exit_status = EXIT_RUNTIME_DIRECTORY;
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
r = mkdir_p_label(p, context->runtime_directory_mode);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_RUNTIME_DIRECTORY;
|
||||
return r;
|
||||
}
|
||||
|
||||
r = chmod_and_chown(p, context->runtime_directory_mode, uid, gid);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_RUNTIME_DIRECTORY;
|
||||
return r;
|
||||
}
|
||||
r = setup_runtime_directory(context, params, uid, gid);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_RUNTIME_DIRECTORY;
|
||||
return r;
|
||||
}
|
||||
}
|
||||
|
||||
@ -2168,41 +2403,15 @@ static int exec_child(
|
||||
}
|
||||
accum_env = strv_env_clean(accum_env);
|
||||
|
||||
umask(context->umask);
|
||||
(void) umask(context->umask);
|
||||
|
||||
if ((params->flags & EXEC_APPLY_PERMISSIONS) && !command->privileged) {
|
||||
r = enforce_groups(context, username, gid);
|
||||
r = setup_smack(context, command);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_GROUP;
|
||||
*exit_status = EXIT_SMACK_PROCESS_LABEL;
|
||||
return r;
|
||||
}
|
||||
#ifdef HAVE_SMACK
|
||||
if (context->smack_process_label) {
|
||||
r = mac_smack_apply_pid(0, context->smack_process_label);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_SMACK_PROCESS_LABEL;
|
||||
return r;
|
||||
}
|
||||
}
|
||||
#ifdef SMACK_DEFAULT_PROCESS_LABEL
|
||||
else {
|
||||
_cleanup_free_ char *exec_label = NULL;
|
||||
|
||||
r = mac_smack_read(command->path, SMACK_ATTR_EXEC, &exec_label);
|
||||
if (r < 0 && r != -ENODATA && r != -EOPNOTSUPP) {
|
||||
*exit_status = EXIT_SMACK_PROCESS_LABEL;
|
||||
return r;
|
||||
}
|
||||
|
||||
r = mac_smack_apply_pid(0, exec_label ? : SMACK_DEFAULT_PROCESS_LABEL);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_SMACK_PROCESS_LABEL;
|
||||
return r;
|
||||
}
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
#ifdef HAVE_PAM
|
||||
if (context->pam_name && username) {
|
||||
r = setup_pam(context->pam_name, username, uid, context->tty_path, &accum_env, fds, n_fds);
|
||||
if (r < 0) {
|
||||
@ -2210,7 +2419,6 @@ static int exec_child(
|
||||
return r;
|
||||
}
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
if (context->private_network && runtime && runtime->netns_storage_socket[0] >= 0) {
|
||||
@ -2222,8 +2430,8 @@ static int exec_child(
|
||||
}
|
||||
|
||||
needs_mount_namespace = exec_needs_mount_namespace(context, params, runtime);
|
||||
|
||||
if (needs_mount_namespace) {
|
||||
_cleanup_free_ char **rw = NULL;
|
||||
char *tmp = NULL, *var = NULL;
|
||||
|
||||
/* The runtime struct only contains the parent
|
||||
@ -2239,14 +2447,22 @@ static int exec_child(
|
||||
var = strjoina(runtime->var_tmp_dir, "/tmp");
|
||||
}
|
||||
|
||||
r = compile_read_write_paths(context, params, &rw);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_NAMESPACE;
|
||||
return r;
|
||||
}
|
||||
|
||||
r = setup_namespace(
|
||||
(params->flags & EXEC_APPLY_CHROOT) ? context->root_directory : NULL,
|
||||
context->read_write_paths,
|
||||
rw,
|
||||
context->read_only_paths,
|
||||
context->inaccessible_paths,
|
||||
tmp,
|
||||
var,
|
||||
context->private_devices,
|
||||
context->protect_kernel_tunables,
|
||||
context->protect_control_groups,
|
||||
context->protect_home,
|
||||
context->protect_system,
|
||||
context->mount_flags);
|
||||
@ -2264,6 +2480,14 @@ static int exec_child(
|
||||
}
|
||||
}
|
||||
|
||||
if ((params->flags & EXEC_APPLY_PERMISSIONS) && !command->privileged) {
|
||||
r = enforce_groups(context, username, gid);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_GROUP;
|
||||
return r;
|
||||
}
|
||||
}
|
||||
|
||||
if (context->working_directory_home)
|
||||
wd = home;
|
||||
else if (context->working_directory)
|
||||
@ -2335,11 +2559,6 @@ static int exec_child(
|
||||
|
||||
if ((params->flags & EXEC_APPLY_PERMISSIONS) && !command->privileged) {
|
||||
|
||||
bool use_address_families = context->address_families_whitelist ||
|
||||
!set_isempty(context->address_families);
|
||||
bool use_syscall_filter = context->syscall_whitelist ||
|
||||
!set_isempty(context->syscall_filter) ||
|
||||
!set_isempty(context->syscall_archs);
|
||||
int secure_bits = context->secure_bits;
|
||||
|
||||
for (i = 0; i < _RLIMIT_MAX; i++) {
|
||||
@ -2416,15 +2635,14 @@ static int exec_child(
|
||||
return -errno;
|
||||
}
|
||||
|
||||
if (context->no_new_privileges ||
|
||||
(!have_effective_cap(CAP_SYS_ADMIN) && (use_address_families || context->memory_deny_write_execute || context->restrict_realtime || use_syscall_filter)))
|
||||
if (context_has_no_new_privileges(context))
|
||||
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) < 0) {
|
||||
*exit_status = EXIT_NO_NEW_PRIVILEGES;
|
||||
return -errno;
|
||||
}
|
||||
|
||||
#ifdef HAVE_SECCOMP
|
||||
if (use_address_families) {
|
||||
if (context_has_address_families(context)) {
|
||||
r = apply_address_families(unit, context);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_ADDRESS_FAMILIES;
|
||||
@ -2448,7 +2666,23 @@ static int exec_child(
|
||||
}
|
||||
}
|
||||
|
||||
if (use_syscall_filter) {
|
||||
if (context->protect_kernel_tunables) {
|
||||
r = apply_protect_sysctl(unit, context);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_SECCOMP;
|
||||
return r;
|
||||
}
|
||||
}
|
||||
|
||||
if (context->private_devices) {
|
||||
r = apply_private_devices(unit, context);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_SECCOMP;
|
||||
return r;
|
||||
}
|
||||
}
|
||||
|
||||
if (context_has_syscall_filters(context)) {
|
||||
r = apply_seccomp(unit, context);
|
||||
if (r < 0) {
|
||||
*exit_status = EXIT_SECCOMP;
|
||||
@ -2880,6 +3114,8 @@ void exec_context_dump(ExecContext *c, FILE* f, const char *prefix) {
|
||||
"%sNonBlocking: %s\n"
|
||||
"%sPrivateTmp: %s\n"
|
||||
"%sPrivateDevices: %s\n"
|
||||
"%sProtectKernelTunables: %s\n"
|
||||
"%sProtectControlGroups: %s\n"
|
||||
"%sPrivateNetwork: %s\n"
|
||||
"%sPrivateUsers: %s\n"
|
||||
"%sProtectHome: %s\n"
|
||||
@ -2893,6 +3129,8 @@ void exec_context_dump(ExecContext *c, FILE* f, const char *prefix) {
|
||||
prefix, yes_no(c->non_blocking),
|
||||
prefix, yes_no(c->private_tmp),
|
||||
prefix, yes_no(c->private_devices),
|
||||
prefix, yes_no(c->protect_kernel_tunables),
|
||||
prefix, yes_no(c->protect_control_groups),
|
||||
prefix, yes_no(c->private_network),
|
||||
prefix, yes_no(c->private_users),
|
||||
prefix, protect_home_to_string(c->protect_home),
|
||||
|
@ -174,6 +174,8 @@ struct ExecContext {
|
||||
bool private_users;
|
||||
ProtectSystem protect_system;
|
||||
ProtectHome protect_home;
|
||||
bool protect_kernel_tunables;
|
||||
bool protect_control_groups;
|
||||
|
||||
bool no_new_privileges;
|
||||
|
||||
|
@ -89,6 +89,8 @@ $1.ReadOnlyPaths, config_parse_namespace_path_strv, 0,
|
||||
$1.InaccessiblePaths, config_parse_namespace_path_strv, 0, offsetof($1, exec_context.inaccessible_paths)
|
||||
$1.PrivateTmp, config_parse_bool, 0, offsetof($1, exec_context.private_tmp)
|
||||
$1.PrivateDevices, config_parse_bool, 0, offsetof($1, exec_context.private_devices)
|
||||
$1.ProtectKernelTunables, config_parse_bool, 0, offsetof($1, exec_context.protect_kernel_tunables)
|
||||
$1.ProtectControlGroups, config_parse_bool, 0, offsetof($1, exec_context.protect_control_groups)
|
||||
$1.PrivateNetwork, config_parse_bool, 0, offsetof($1, exec_context.private_network)
|
||||
$1.PrivateUsers, config_parse_bool, 0, offsetof($1, exec_context.private_users)
|
||||
$1.ProtectSystem, config_parse_protect_system, 0, offsetof($1, exec_context)
|
||||
|
@ -996,10 +996,8 @@ static int parse_argv(int argc, char *argv[]) {
|
||||
|
||||
case ARG_MACHINE_ID:
|
||||
r = set_machine_id(optarg);
|
||||
if (r < 0) {
|
||||
log_error("MachineID '%s' is not valid.", optarg);
|
||||
return r;
|
||||
}
|
||||
if (r < 0)
|
||||
return log_error_errno(r, "MachineID '%s' is not valid.", optarg);
|
||||
break;
|
||||
|
||||
case 'h':
|
||||
|
@ -29,6 +29,7 @@
|
||||
#include "alloc-util.h"
|
||||
#include "dev-setup.h"
|
||||
#include "fd-util.h"
|
||||
#include "fs-util.h"
|
||||
#include "loopback-setup.h"
|
||||
#include "missing.h"
|
||||
#include "mkdir.h"
|
||||
@ -53,15 +54,104 @@ typedef enum MountMode {
|
||||
PRIVATE_TMP,
|
||||
PRIVATE_VAR_TMP,
|
||||
PRIVATE_DEV,
|
||||
READWRITE
|
||||
READWRITE,
|
||||
} MountMode;
|
||||
|
||||
typedef struct BindMount {
|
||||
const char *path; /* stack memory, doesn't need to be freed explicitly */
|
||||
char *chased; /* malloc()ed memory, needs to be freed */
|
||||
MountMode mode;
|
||||
bool ignore; /* Ignore if path does not exist */
|
||||
} BindMount;
|
||||
|
||||
typedef struct TargetMount {
|
||||
const char *path;
|
||||
MountMode mode;
|
||||
bool done;
|
||||
bool ignore;
|
||||
} BindMount;
|
||||
bool ignore; /* Ignore if path does not exist */
|
||||
} TargetMount;
|
||||
|
||||
/*
|
||||
* The following Protect tables are to protect paths and mark some of them
|
||||
* READONLY, in case a path is covered by an option from another table, then
|
||||
* it is marked READWRITE in the current one, and the more restrictive mode is
|
||||
* applied from that other table. This way all options can be combined in a
|
||||
* safe and comprehensible way for users.
|
||||
*/
|
||||
|
||||
/* ProtectKernelTunables= option and the related filesystem APIs */
|
||||
static const TargetMount protect_kernel_tunables_table[] = {
|
||||
{ "/proc/sys", READONLY, false },
|
||||
{ "/proc/sysrq-trigger", READONLY, true },
|
||||
{ "/proc/latency_stats", READONLY, true },
|
||||
{ "/proc/mtrr", READONLY, true },
|
||||
{ "/proc/apm", READONLY, true },
|
||||
{ "/proc/acpi", READONLY, true },
|
||||
{ "/proc/timer_stats", READONLY, true },
|
||||
{ "/proc/asound", READONLY, true },
|
||||
{ "/proc/bus", READONLY, true },
|
||||
{ "/proc/fs", READONLY, true },
|
||||
{ "/proc/irq", READONLY, true },
|
||||
{ "/sys", READONLY, false },
|
||||
{ "/sys/kernel/debug", READONLY, true },
|
||||
{ "/sys/kernel/tracing", READONLY, true },
|
||||
{ "/sys/fs/cgroup", READWRITE, false }, /* READONLY is set by ProtectControlGroups= option */
|
||||
};
|
||||
|
||||
/*
|
||||
* ProtectHome=read-only table, protect $HOME and $XDG_RUNTIME_DIR and rest of
|
||||
* system should be protected by ProtectSystem=
|
||||
*/
|
||||
static const TargetMount protect_home_read_only_table[] = {
|
||||
{ "/home", READONLY, true },
|
||||
{ "/run/user", READONLY, true },
|
||||
{ "/root", READONLY, true },
|
||||
};
|
||||
|
||||
/* ProtectHome=yes table */
|
||||
static const TargetMount protect_home_yes_table[] = {
|
||||
{ "/home", INACCESSIBLE, true },
|
||||
{ "/run/user", INACCESSIBLE, true },
|
||||
{ "/root", INACCESSIBLE, true },
|
||||
};
|
||||
|
||||
/* ProtectSystem=yes table */
|
||||
static const TargetMount protect_system_yes_table[] = {
|
||||
{ "/usr", READONLY, false },
|
||||
{ "/boot", READONLY, true },
|
||||
{ "/efi", READONLY, true },
|
||||
};
|
||||
|
||||
/* ProtectSystem=full includes ProtectSystem=yes */
|
||||
static const TargetMount protect_system_full_table[] = {
|
||||
{ "/usr", READONLY, false },
|
||||
{ "/boot", READONLY, true },
|
||||
{ "/efi", READONLY, true },
|
||||
{ "/etc", READONLY, false },
|
||||
};
|
||||
|
||||
/*
|
||||
* ProtectSystem=strict table. In this strict mode, we mount everything
|
||||
* read-only, except for /proc, /dev, /sys which are the kernel API VFS,
|
||||
* which are left writable, but PrivateDevices= + ProtectKernelTunables=
|
||||
* protect those, and these options should be fully orthogonal.
|
||||
* (And of course /home and friends are also left writable, as ProtectHome=
|
||||
* shall manage those, orthogonally).
|
||||
*/
|
||||
static const TargetMount protect_system_strict_table[] = {
|
||||
{ "/", READONLY, false },
|
||||
{ "/proc", READWRITE, false }, /* ProtectKernelTunables= */
|
||||
{ "/sys", READWRITE, false }, /* ProtectKernelTunables= */
|
||||
{ "/dev", READWRITE, false }, /* PrivateDevices= */
|
||||
{ "/home", READWRITE, true }, /* ProtectHome= */
|
||||
{ "/run/user", READWRITE, true }, /* ProtectHome= */
|
||||
{ "/root", READWRITE, true }, /* ProtectHome= */
|
||||
};
|
||||
|
||||
static void set_bind_mount(BindMount **p, const char *path, MountMode mode, bool ignore) {
|
||||
(*p)->path = path;
|
||||
(*p)->mode = mode;
|
||||
(*p)->ignore = ignore;
|
||||
}
|
||||
|
||||
static int append_mounts(BindMount **p, char **strv, MountMode mode) {
|
||||
char **i;
|
||||
@ -69,45 +159,125 @@ static int append_mounts(BindMount **p, char **strv, MountMode mode) {
|
||||
assert(p);
|
||||
|
||||
STRV_FOREACH(i, strv) {
|
||||
bool ignore = false;
|
||||
|
||||
(*p)->ignore = false;
|
||||
(*p)->done = false;
|
||||
|
||||
if ((mode == INACCESSIBLE || mode == READONLY || mode == READWRITE) && (*i)[0] == '-') {
|
||||
(*p)->ignore = true;
|
||||
if (IN_SET(mode, INACCESSIBLE, READONLY, READWRITE) && startswith(*i, "-")) {
|
||||
(*i)++;
|
||||
ignore = true;
|
||||
}
|
||||
|
||||
if (!path_is_absolute(*i))
|
||||
return -EINVAL;
|
||||
|
||||
(*p)->path = *i;
|
||||
(*p)->mode = mode;
|
||||
set_bind_mount(p, *i, mode, ignore);
|
||||
(*p)++;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int append_target_mounts(BindMount **p, const char *root_directory, const TargetMount *mounts, const size_t size) {
|
||||
unsigned i;
|
||||
|
||||
assert(p);
|
||||
assert(mounts);
|
||||
|
||||
for (i = 0; i < size; i++) {
|
||||
/*
|
||||
* Here we assume that the ignore field is set during
|
||||
* declaration we do not support "-" at the beginning.
|
||||
*/
|
||||
const TargetMount *m = &mounts[i];
|
||||
const char *path = prefix_roota(root_directory, m->path);
|
||||
|
||||
if (!path_is_absolute(path))
|
||||
return -EINVAL;
|
||||
|
||||
set_bind_mount(p, path, m->mode, m->ignore);
|
||||
(*p)++;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int append_protect_kernel_tunables(BindMount **p, const char *root_directory) {
|
||||
assert(p);
|
||||
|
||||
return append_target_mounts(p, root_directory, protect_kernel_tunables_table,
|
||||
ELEMENTSOF(protect_kernel_tunables_table));
|
||||
}
|
||||
|
||||
static int append_protect_home(BindMount **p, const char *root_directory, ProtectHome protect_home) {
|
||||
int r = 0;
|
||||
|
||||
assert(p);
|
||||
|
||||
if (protect_home == PROTECT_HOME_NO)
|
||||
return 0;
|
||||
|
||||
switch (protect_home) {
|
||||
case PROTECT_HOME_READ_ONLY:
|
||||
r = append_target_mounts(p, root_directory, protect_home_read_only_table,
|
||||
ELEMENTSOF(protect_home_read_only_table));
|
||||
break;
|
||||
case PROTECT_HOME_YES:
|
||||
r = append_target_mounts(p, root_directory, protect_home_yes_table,
|
||||
ELEMENTSOF(protect_home_yes_table));
|
||||
break;
|
||||
default:
|
||||
r = -EINVAL;
|
||||
break;
|
||||
}
|
||||
|
||||
return r;
|
||||
}
|
||||
|
||||
static int append_protect_system(BindMount **p, const char *root_directory, ProtectSystem protect_system) {
|
||||
int r = 0;
|
||||
|
||||
assert(p);
|
||||
|
||||
if (protect_system == PROTECT_SYSTEM_NO)
|
||||
return 0;
|
||||
|
||||
switch (protect_system) {
|
||||
case PROTECT_SYSTEM_STRICT:
|
||||
r = append_target_mounts(p, root_directory, protect_system_strict_table,
|
||||
ELEMENTSOF(protect_system_strict_table));
|
||||
break;
|
||||
case PROTECT_SYSTEM_YES:
|
||||
r = append_target_mounts(p, root_directory, protect_system_yes_table,
|
||||
ELEMENTSOF(protect_system_yes_table));
|
||||
break;
|
||||
case PROTECT_SYSTEM_FULL:
|
||||
r = append_target_mounts(p, root_directory, protect_system_full_table,
|
||||
ELEMENTSOF(protect_system_full_table));
|
||||
break;
|
||||
default:
|
||||
r = -EINVAL;
|
||||
break;
|
||||
}
|
||||
|
||||
return r;
|
||||
}
|
||||
|
||||
static int mount_path_compare(const void *a, const void *b) {
|
||||
const BindMount *p = a, *q = b;
|
||||
int d;
|
||||
|
||||
d = path_compare(p->path, q->path);
|
||||
|
||||
if (d == 0) {
|
||||
/* If the paths are equal, check the mode */
|
||||
if (p->mode < q->mode)
|
||||
return -1;
|
||||
|
||||
if (p->mode > q->mode)
|
||||
return 1;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* If the paths are not equal, then order prefixes first */
|
||||
return d;
|
||||
d = path_compare(p->path, q->path);
|
||||
if (d != 0)
|
||||
return d;
|
||||
|
||||
/* If the paths are equal, check the mode */
|
||||
if (p->mode < q->mode)
|
||||
return -1;
|
||||
|
||||
if (p->mode > q->mode)
|
||||
return 1;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void drop_duplicates(BindMount *m, unsigned *n) {
|
||||
@ -116,16 +286,110 @@ static void drop_duplicates(BindMount *m, unsigned *n) {
|
||||
assert(m);
|
||||
assert(n);
|
||||
|
||||
/* Drops duplicate entries. Expects that the array is properly ordered already. */
|
||||
|
||||
for (f = m, t = m, previous = NULL; f < m+*n; f++) {
|
||||
|
||||
/* The first one wins */
|
||||
if (previous && path_equal(f->path, previous->path))
|
||||
/* The first one wins (which is the one with the more restrictive mode), see mount_path_compare()
|
||||
* above. */
|
||||
if (previous && path_equal(f->path, previous->path)) {
|
||||
log_debug("%s is duplicate.", f->path);
|
||||
continue;
|
||||
}
|
||||
|
||||
*t = *f;
|
||||
|
||||
previous = t;
|
||||
t++;
|
||||
}
|
||||
|
||||
*n = t - m;
|
||||
}
|
||||
|
||||
static void drop_inaccessible(BindMount *m, unsigned *n) {
|
||||
BindMount *f, *t;
|
||||
const char *clear = NULL;
|
||||
|
||||
assert(m);
|
||||
assert(n);
|
||||
|
||||
/* Drops all entries obstructed by another entry further up the tree. Expects that the array is properly
|
||||
* ordered already. */
|
||||
|
||||
for (f = m, t = m; f < m+*n; f++) {
|
||||
|
||||
/* If we found a path set for INACCESSIBLE earlier, and this entry has it as prefix we should drop
|
||||
* it, as inaccessible paths really should drop the entire subtree. */
|
||||
if (clear && path_startswith(f->path, clear)) {
|
||||
log_debug("%s is masked by %s.", f->path, clear);
|
||||
continue;
|
||||
}
|
||||
|
||||
clear = f->mode == INACCESSIBLE ? f->path : NULL;
|
||||
|
||||
*t = *f;
|
||||
t++;
|
||||
}
|
||||
|
||||
*n = t - m;
|
||||
}
|
||||
|
||||
static void drop_nop(BindMount *m, unsigned *n) {
|
||||
BindMount *f, *t;
|
||||
|
||||
assert(m);
|
||||
assert(n);
|
||||
|
||||
/* Drops all entries which have an immediate parent that has the same type, as they are redundant. Assumes the
|
||||
* list is ordered by prefixes. */
|
||||
|
||||
for (f = m, t = m; f < m+*n; f++) {
|
||||
|
||||
/* Only suppress such subtrees for READONLY and READWRITE entries */
|
||||
if (IN_SET(f->mode, READONLY, READWRITE)) {
|
||||
BindMount *p;
|
||||
bool found = false;
|
||||
|
||||
/* Now let's find the first parent of the entry we are looking at. */
|
||||
for (p = t-1; p >= m; p--) {
|
||||
if (path_startswith(f->path, p->path)) {
|
||||
found = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
/* We found it, let's see if it's the same mode, if so, we can drop this entry */
|
||||
if (found && p->mode == f->mode) {
|
||||
log_debug("%s is redundant by %s", f->path, p->path);
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
*t = *f;
|
||||
t++;
|
||||
}
|
||||
|
||||
*n = t - m;
|
||||
}
|
||||
|
||||
static void drop_outside_root(const char *root_directory, BindMount *m, unsigned *n) {
|
||||
BindMount *f, *t;
|
||||
|
||||
assert(m);
|
||||
assert(n);
|
||||
|
||||
if (!root_directory)
|
||||
return;
|
||||
|
||||
/* Drops all mounts that are outside of the root directory. */
|
||||
|
||||
for (f = m, t = m; f < m+*n; f++) {
|
||||
|
||||
if (!path_startswith(f->path, root_directory)) {
|
||||
log_debug("%s is outside of root directory.", f->path);
|
||||
continue;
|
||||
}
|
||||
|
||||
*t = *f;
|
||||
t++;
|
||||
}
|
||||
|
||||
@ -278,24 +542,23 @@ static int apply_mount(
|
||||
|
||||
const char *what;
|
||||
int r;
|
||||
struct stat target;
|
||||
|
||||
assert(m);
|
||||
|
||||
log_debug("Applying namespace mount on %s", m->path);
|
||||
|
||||
switch (m->mode) {
|
||||
|
||||
case INACCESSIBLE:
|
||||
case INACCESSIBLE: {
|
||||
struct stat target;
|
||||
|
||||
/* First, get rid of everything that is below if there
|
||||
* is anything... Then, overmount it with an
|
||||
* inaccessible path. */
|
||||
umount_recursive(m->path, 0);
|
||||
(void) umount_recursive(m->path, 0);
|
||||
|
||||
if (lstat(m->path, &target) < 0) {
|
||||
if (m->ignore && errno == ENOENT)
|
||||
return 0;
|
||||
return -errno;
|
||||
}
|
||||
if (lstat(m->path, &target) < 0)
|
||||
return log_debug_errno(errno, "Failed to lstat() %s to determine what to mount over it: %m", m->path);
|
||||
|
||||
what = mode_to_inaccessible_node(target.st_mode);
|
||||
if (!what) {
|
||||
@ -303,11 +566,20 @@ static int apply_mount(
|
||||
return -ELOOP;
|
||||
}
|
||||
break;
|
||||
}
|
||||
|
||||
case READONLY:
|
||||
case READWRITE:
|
||||
/* Nothing to mount here, we just later toggle the
|
||||
* MS_RDONLY bit for the mount point */
|
||||
return 0;
|
||||
|
||||
r = path_is_mount_point(m->path, 0);
|
||||
if (r < 0)
|
||||
return log_debug_errno(r, "Failed to determine whether %s is already a mount point: %m", m->path);
|
||||
if (r > 0) /* Nothing to do here, it is already a mount. We just later toggle the MS_RDONLY bit for the mount point if needed. */
|
||||
return 0;
|
||||
|
||||
/* This isn't a mount point yet, let's make it one. */
|
||||
what = m->path;
|
||||
break;
|
||||
|
||||
case PRIVATE_TMP:
|
||||
what = tmp_dir;
|
||||
@ -326,38 +598,104 @@ static int apply_mount(
|
||||
|
||||
assert(what);
|
||||
|
||||
r = mount(what, m->path, NULL, MS_BIND|MS_REC, NULL);
|
||||
if (r >= 0) {
|
||||
log_debug("Successfully mounted %s to %s", what, m->path);
|
||||
return r;
|
||||
} else {
|
||||
if (m->ignore && errno == ENOENT)
|
||||
return 0;
|
||||
if (mount(what, m->path, NULL, MS_BIND|MS_REC, NULL) < 0)
|
||||
return log_debug_errno(errno, "Failed to mount %s to %s: %m", what, m->path);
|
||||
}
|
||||
|
||||
log_debug("Successfully mounted %s to %s", what, m->path);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int make_read_only(BindMount *m) {
|
||||
int r;
|
||||
static int make_read_only(BindMount *m, char **blacklist) {
|
||||
int r = 0;
|
||||
|
||||
assert(m);
|
||||
|
||||
if (IN_SET(m->mode, INACCESSIBLE, READONLY))
|
||||
r = bind_remount_recursive(m->path, true);
|
||||
else if (IN_SET(m->mode, READWRITE, PRIVATE_TMP, PRIVATE_VAR_TMP, PRIVATE_DEV)) {
|
||||
r = bind_remount_recursive(m->path, false);
|
||||
if (r == 0 && m->mode == PRIVATE_DEV) /* can be readonly but the submounts can't*/
|
||||
if (mount(NULL, m->path, NULL, MS_REMOUNT|DEV_MOUNT_OPTIONS|MS_RDONLY, NULL) < 0)
|
||||
r = -errno;
|
||||
r = bind_remount_recursive(m->path, true, blacklist);
|
||||
else if (m->mode == PRIVATE_DEV) { /* Can be readonly but the submounts can't*/
|
||||
if (mount(NULL, m->path, NULL, MS_REMOUNT|DEV_MOUNT_OPTIONS|MS_RDONLY, NULL) < 0)
|
||||
r = -errno;
|
||||
} else
|
||||
r = 0;
|
||||
|
||||
if (m->ignore && r == -ENOENT)
|
||||
return 0;
|
||||
|
||||
/* Not that we only turn on the MS_RDONLY flag here, we never turn it off. Something that was marked read-only
|
||||
* already stays this way. This improves compatibility with container managers, where we won't attempt to undo
|
||||
* read-only mounts already applied. */
|
||||
|
||||
return r;
|
||||
}
|
||||
|
||||
static int chase_all_symlinks(const char *root_directory, BindMount *m, unsigned *n) {
|
||||
BindMount *f, *t;
|
||||
int r;
|
||||
|
||||
assert(m);
|
||||
assert(n);
|
||||
|
||||
/* Since mount() will always follow symlinks and we need to take the different root directory into account we
|
||||
* chase the symlinks on our own first. This call wil do so for all entries and remove all entries where we
|
||||
* can't resolve the path, and which have been marked for such removal. */
|
||||
|
||||
for (f = m, t = m; f < m+*n; f++) {
|
||||
|
||||
r = chase_symlinks(f->path, root_directory, &f->chased);
|
||||
if (r == -ENOENT && f->ignore) /* Doesn't exist? Then remove it! */
|
||||
continue;
|
||||
if (r < 0)
|
||||
return log_debug_errno(r, "Failed to chase symlinks for %s: %m", f->path);
|
||||
|
||||
if (path_equal(f->path, f->chased))
|
||||
f->chased = mfree(f->chased);
|
||||
else {
|
||||
log_debug("Chased %s → %s", f->path, f->chased);
|
||||
f->path = f->chased;
|
||||
}
|
||||
|
||||
*t = *f;
|
||||
t++;
|
||||
}
|
||||
|
||||
*n = t - m;
|
||||
return 0;
|
||||
}
|
||||
|
||||
static unsigned namespace_calculate_mounts(
|
||||
char** read_write_paths,
|
||||
char** read_only_paths,
|
||||
char** inaccessible_paths,
|
||||
const char* tmp_dir,
|
||||
const char* var_tmp_dir,
|
||||
bool private_dev,
|
||||
bool protect_sysctl,
|
||||
bool protect_cgroups,
|
||||
ProtectHome protect_home,
|
||||
ProtectSystem protect_system) {
|
||||
|
||||
unsigned protect_home_cnt;
|
||||
unsigned protect_system_cnt =
|
||||
(protect_system == PROTECT_SYSTEM_STRICT ?
|
||||
ELEMENTSOF(protect_system_strict_table) :
|
||||
((protect_system == PROTECT_SYSTEM_FULL) ?
|
||||
ELEMENTSOF(protect_system_full_table) :
|
||||
((protect_system == PROTECT_SYSTEM_YES) ?
|
||||
ELEMENTSOF(protect_system_yes_table) : 0)));
|
||||
|
||||
protect_home_cnt =
|
||||
(protect_home == PROTECT_HOME_YES ?
|
||||
ELEMENTSOF(protect_home_yes_table) :
|
||||
((protect_home == PROTECT_HOME_READ_ONLY) ?
|
||||
ELEMENTSOF(protect_home_read_only_table) : 0));
|
||||
|
||||
return !!tmp_dir + !!var_tmp_dir +
|
||||
strv_length(read_write_paths) +
|
||||
strv_length(read_only_paths) +
|
||||
strv_length(inaccessible_paths) +
|
||||
private_dev +
|
||||
(protect_sysctl ? ELEMENTSOF(protect_kernel_tunables_table) : 0) +
|
||||
(protect_cgroups ? 1 : 0) +
|
||||
protect_home_cnt + protect_system_cnt;
|
||||
}
|
||||
|
||||
int setup_namespace(
|
||||
const char* root_directory,
|
||||
char** read_write_paths,
|
||||
@ -366,28 +704,31 @@ int setup_namespace(
|
||||
const char* tmp_dir,
|
||||
const char* var_tmp_dir,
|
||||
bool private_dev,
|
||||
bool protect_sysctl,
|
||||
bool protect_cgroups,
|
||||
ProtectHome protect_home,
|
||||
ProtectSystem protect_system,
|
||||
unsigned long mount_flags) {
|
||||
|
||||
BindMount *m, *mounts = NULL;
|
||||
bool make_slave = false;
|
||||
unsigned n;
|
||||
int r = 0;
|
||||
|
||||
if (mount_flags == 0)
|
||||
mount_flags = MS_SHARED;
|
||||
|
||||
if (unshare(CLONE_NEWNS) < 0)
|
||||
return -errno;
|
||||
n = namespace_calculate_mounts(read_write_paths,
|
||||
read_only_paths,
|
||||
inaccessible_paths,
|
||||
tmp_dir, var_tmp_dir,
|
||||
private_dev, protect_sysctl,
|
||||
protect_cgroups, protect_home,
|
||||
protect_system);
|
||||
|
||||
n = !!tmp_dir + !!var_tmp_dir +
|
||||
strv_length(read_write_paths) +
|
||||
strv_length(read_only_paths) +
|
||||
strv_length(inaccessible_paths) +
|
||||
private_dev +
|
||||
(protect_home != PROTECT_HOME_NO ? 3 : 0) +
|
||||
(protect_system != PROTECT_SYSTEM_NO ? 2 : 0) +
|
||||
(protect_system == PROTECT_SYSTEM_FULL ? 1 : 0);
|
||||
/* Set mount slave mode */
|
||||
if (root_directory || n > 0)
|
||||
make_slave = true;
|
||||
|
||||
if (n > 0) {
|
||||
m = mounts = (BindMount *) alloca0(n * sizeof(BindMount));
|
||||
@ -421,95 +762,113 @@ int setup_namespace(
|
||||
m++;
|
||||
}
|
||||
|
||||
if (protect_home != PROTECT_HOME_NO) {
|
||||
const char *home_dir, *run_user_dir, *root_dir;
|
||||
if (protect_sysctl)
|
||||
append_protect_kernel_tunables(&m, root_directory);
|
||||
|
||||
home_dir = prefix_roota(root_directory, "/home");
|
||||
home_dir = strjoina("-", home_dir);
|
||||
run_user_dir = prefix_roota(root_directory, "/run/user");
|
||||
run_user_dir = strjoina("-", run_user_dir);
|
||||
root_dir = prefix_roota(root_directory, "/root");
|
||||
root_dir = strjoina("-", root_dir);
|
||||
|
||||
r = append_mounts(&m, STRV_MAKE(home_dir, run_user_dir, root_dir),
|
||||
protect_home == PROTECT_HOME_READ_ONLY ? READONLY : INACCESSIBLE);
|
||||
if (r < 0)
|
||||
return r;
|
||||
if (protect_cgroups) {
|
||||
m->path = prefix_roota(root_directory, "/sys/fs/cgroup");
|
||||
m->mode = READONLY;
|
||||
m++;
|
||||
}
|
||||
|
||||
if (protect_system != PROTECT_SYSTEM_NO) {
|
||||
const char *usr_dir, *boot_dir, *etc_dir;
|
||||
r = append_protect_home(&m, root_directory, protect_home);
|
||||
if (r < 0)
|
||||
return r;
|
||||
|
||||
usr_dir = prefix_roota(root_directory, "/usr");
|
||||
boot_dir = prefix_roota(root_directory, "/boot");
|
||||
boot_dir = strjoina("-", boot_dir);
|
||||
etc_dir = prefix_roota(root_directory, "/etc");
|
||||
|
||||
r = append_mounts(&m, protect_system == PROTECT_SYSTEM_FULL
|
||||
? STRV_MAKE(usr_dir, boot_dir, etc_dir)
|
||||
: STRV_MAKE(usr_dir, boot_dir), READONLY);
|
||||
if (r < 0)
|
||||
return r;
|
||||
}
|
||||
r = append_protect_system(&m, root_directory, protect_system);
|
||||
if (r < 0)
|
||||
return r;
|
||||
|
||||
assert(mounts + n == m);
|
||||
|
||||
/* Resolve symlinks manually first, as mount() will always follow them relative to the host's
|
||||
* root. Moreover we want to suppress duplicates based on the resolved paths. This of course is a bit
|
||||
* racy. */
|
||||
r = chase_all_symlinks(root_directory, mounts, &n);
|
||||
if (r < 0)
|
||||
goto finish;
|
||||
|
||||
qsort(mounts, n, sizeof(BindMount), mount_path_compare);
|
||||
|
||||
drop_duplicates(mounts, &n);
|
||||
drop_outside_root(root_directory, mounts, &n);
|
||||
drop_inaccessible(mounts, &n);
|
||||
drop_nop(mounts, &n);
|
||||
}
|
||||
|
||||
if (n > 0 || root_directory) {
|
||||
if (unshare(CLONE_NEWNS) < 0) {
|
||||
r = -errno;
|
||||
goto finish;
|
||||
}
|
||||
|
||||
if (make_slave) {
|
||||
/* Remount / as SLAVE so that nothing now mounted in the namespace
|
||||
shows up in the parent */
|
||||
if (mount(NULL, "/", NULL, MS_SLAVE|MS_REC, NULL) < 0)
|
||||
return -errno;
|
||||
if (mount(NULL, "/", NULL, MS_SLAVE|MS_REC, NULL) < 0) {
|
||||
r = -errno;
|
||||
goto finish;
|
||||
}
|
||||
}
|
||||
|
||||
if (root_directory) {
|
||||
/* Turn directory into bind mount */
|
||||
if (mount(root_directory, root_directory, NULL, MS_BIND|MS_REC, NULL) < 0)
|
||||
return -errno;
|
||||
/* Turn directory into bind mount, if it isn't one yet */
|
||||
r = path_is_mount_point(root_directory, AT_SYMLINK_FOLLOW);
|
||||
if (r < 0)
|
||||
goto finish;
|
||||
if (r == 0) {
|
||||
if (mount(root_directory, root_directory, NULL, MS_BIND|MS_REC, NULL) < 0) {
|
||||
r = -errno;
|
||||
goto finish;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (n > 0) {
|
||||
char **blacklist;
|
||||
unsigned j;
|
||||
|
||||
/* First round, add in all special mounts we need */
|
||||
for (m = mounts; m < mounts + n; ++m) {
|
||||
r = apply_mount(m, tmp_dir, var_tmp_dir);
|
||||
if (r < 0)
|
||||
goto fail;
|
||||
goto finish;
|
||||
}
|
||||
|
||||
/* Create a blacklist we can pass to bind_mount_recursive() */
|
||||
blacklist = newa(char*, n+1);
|
||||
for (j = 0; j < n; j++)
|
||||
blacklist[j] = (char*) mounts[j].path;
|
||||
blacklist[j] = NULL;
|
||||
|
||||
/* Second round, flip the ro bits if necessary. */
|
||||
for (m = mounts; m < mounts + n; ++m) {
|
||||
r = make_read_only(m);
|
||||
r = make_read_only(m, blacklist);
|
||||
if (r < 0)
|
||||
goto fail;
|
||||
goto finish;
|
||||
}
|
||||
}
|
||||
|
||||
if (root_directory) {
|
||||
/* MS_MOVE does not work on MS_SHARED so the remount MS_SHARED will be done later */
|
||||
r = mount_move_root(root_directory);
|
||||
|
||||
/* at this point, we cannot rollback */
|
||||
if (r < 0)
|
||||
return r;
|
||||
goto finish;
|
||||
}
|
||||
|
||||
/* Remount / as the desired mode. Not that this will not
|
||||
* reestablish propagation from our side to the host, since
|
||||
* what's disconnected is disconnected. */
|
||||
if (mount(NULL, "/", NULL, mount_flags | MS_REC, NULL) < 0)
|
||||
/* at this point, we cannot rollback */
|
||||
return -errno;
|
||||
|
||||
return 0;
|
||||
|
||||
fail:
|
||||
if (n > 0) {
|
||||
for (m = mounts; m < mounts + n; ++m)
|
||||
if (m->done)
|
||||
(void) umount2(m->path, MNT_DETACH);
|
||||
if (mount(NULL, "/", NULL, mount_flags | MS_REC, NULL) < 0) {
|
||||
r = -errno;
|
||||
goto finish;
|
||||
}
|
||||
|
||||
r = 0;
|
||||
|
||||
finish:
|
||||
for (m = mounts; m < mounts + n; m++)
|
||||
free(m->chased);
|
||||
|
||||
return r;
|
||||
}
|
||||
|
||||
@ -658,6 +1017,7 @@ static const char *const protect_system_table[_PROTECT_SYSTEM_MAX] = {
|
||||
[PROTECT_SYSTEM_NO] = "no",
|
||||
[PROTECT_SYSTEM_YES] = "yes",
|
||||
[PROTECT_SYSTEM_FULL] = "full",
|
||||
[PROTECT_SYSTEM_STRICT] = "strict",
|
||||
};
|
||||
|
||||
DEFINE_STRING_TABLE_LOOKUP(protect_system, ProtectSystem);
|
||||
|
@ -35,6 +35,7 @@ typedef enum ProtectSystem {
|
||||
PROTECT_SYSTEM_NO,
|
||||
PROTECT_SYSTEM_YES,
|
||||
PROTECT_SYSTEM_FULL,
|
||||
PROTECT_SYSTEM_STRICT,
|
||||
_PROTECT_SYSTEM_MAX,
|
||||
_PROTECT_SYSTEM_INVALID = -1
|
||||
} ProtectSystem;
|
||||
@ -46,6 +47,8 @@ int setup_namespace(const char *chroot,
|
||||
const char *tmp_dir,
|
||||
const char *var_tmp_dir,
|
||||
bool private_dev,
|
||||
bool protect_sysctl,
|
||||
bool protect_cgroups,
|
||||
ProtectHome protect_home,
|
||||
ProtectSystem protect_system,
|
||||
unsigned long mount_flags);
|
||||
|
@ -3377,8 +3377,14 @@ int unit_patch_contexts(Unit *u) {
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
/* If the dynamic user option is on, let's make sure that the unit can't leave its UID/GID
|
||||
* around in the file system or on IPC objects. Hence enforce a strict sandbox. */
|
||||
|
||||
ec->private_tmp = true;
|
||||
ec->remove_ipc = true;
|
||||
ec->protect_system = PROTECT_SYSTEM_STRICT;
|
||||
if (ec->protect_home == PROTECT_HOME_NO)
|
||||
ec->protect_home = PROTECT_HOME_READ_ONLY;
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -314,19 +314,21 @@ int mount_all(const char *dest,
|
||||
} MountPoint;
|
||||
|
||||
static const MountPoint mount_table[] = {
|
||||
{ "proc", "/proc", "proc", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, true, true, false },
|
||||
{ "/proc/sys", "/proc/sys", NULL, NULL, MS_BIND, true, true, false }, /* Bind mount first ...*/
|
||||
{ "/proc/sys/net", "/proc/sys/net", NULL, NULL, MS_BIND, true, true, true }, /* (except for this) */
|
||||
{ NULL, "/proc/sys", NULL, NULL, MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, true, true, false }, /* ... then, make it r/o */
|
||||
{ "tmpfs", "/sys", "tmpfs", "mode=755", MS_NOSUID|MS_NOEXEC|MS_NODEV, true, false, true },
|
||||
{ "sysfs", "/sys", "sysfs", NULL, MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV, true, false, false },
|
||||
{ "tmpfs", "/dev", "tmpfs", "mode=755", MS_NOSUID|MS_STRICTATIME, true, false, false },
|
||||
{ "tmpfs", "/dev/shm", "tmpfs", "mode=1777", MS_NOSUID|MS_NODEV|MS_STRICTATIME, true, false, false },
|
||||
{ "tmpfs", "/run", "tmpfs", "mode=755", MS_NOSUID|MS_NODEV|MS_STRICTATIME, true, false, false },
|
||||
{ "tmpfs", "/tmp", "tmpfs", "mode=1777", MS_STRICTATIME, true, false, false },
|
||||
{ "proc", "/proc", "proc", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, true, true, false },
|
||||
{ "/proc/sys", "/proc/sys", NULL, NULL, MS_BIND, true, true, false }, /* Bind mount first ...*/
|
||||
{ "/proc/sys/net", "/proc/sys/net", NULL, NULL, MS_BIND, true, true, true }, /* (except for this) */
|
||||
{ NULL, "/proc/sys", NULL, NULL, MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, true, true, false }, /* ... then, make it r/o */
|
||||
{ "/proc/sysrq-trigger", "/proc/sysrq-trigger", NULL, NULL, MS_BIND, false, true, false }, /* Bind mount first ...*/
|
||||
{ NULL, "/proc/sysrq-trigger", NULL, NULL, MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, false, true, false }, /* ... then, make it r/o */
|
||||
{ "tmpfs", "/sys", "tmpfs", "mode=755", MS_NOSUID|MS_NOEXEC|MS_NODEV, true, false, true },
|
||||
{ "sysfs", "/sys", "sysfs", NULL, MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV, true, false, false },
|
||||
{ "tmpfs", "/dev", "tmpfs", "mode=755", MS_NOSUID|MS_STRICTATIME, true, false, false },
|
||||
{ "tmpfs", "/dev/shm", "tmpfs", "mode=1777", MS_NOSUID|MS_NODEV|MS_STRICTATIME, true, false, false },
|
||||
{ "tmpfs", "/run", "tmpfs", "mode=755", MS_NOSUID|MS_NODEV|MS_STRICTATIME, true, false, false },
|
||||
{ "tmpfs", "/tmp", "tmpfs", "mode=1777", MS_STRICTATIME, true, false, false },
|
||||
#ifdef HAVE_SELINUX
|
||||
{ "/sys/fs/selinux", "/sys/fs/selinux", NULL, NULL, MS_BIND, false, false, false }, /* Bind mount first */
|
||||
{ NULL, "/sys/fs/selinux", NULL, NULL, MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, false, false, false }, /* Then, make it r/o */
|
||||
{ "/sys/fs/selinux", "/sys/fs/selinux", NULL, NULL, MS_BIND, false, false, false }, /* Bind mount first */
|
||||
{ NULL, "/sys/fs/selinux", NULL, NULL, MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, false, false, false }, /* Then, make it r/o */
|
||||
#endif
|
||||
};
|
||||
|
||||
@ -356,7 +358,7 @@ int mount_all(const char *dest,
|
||||
continue;
|
||||
|
||||
r = mkdir_p(where, 0755);
|
||||
if (r < 0) {
|
||||
if (r < 0 && r != -EEXIST) {
|
||||
if (mount_table[k].fatal)
|
||||
return log_error_errno(r, "Failed to create directory %s: %m", where);
|
||||
|
||||
@ -476,7 +478,7 @@ static int mount_bind(const char *dest, CustomMount *m) {
|
||||
return log_error_errno(errno, "mount(%s) failed: %m", where);
|
||||
|
||||
if (m->read_only) {
|
||||
r = bind_remount_recursive(where, true);
|
||||
r = bind_remount_recursive(where, true, NULL);
|
||||
if (r < 0)
|
||||
return log_error_errno(r, "Read-only bind mount failed: %m");
|
||||
}
|
||||
@ -990,7 +992,7 @@ int setup_volatile_state(
|
||||
/* --volatile=state means we simply overmount /var
|
||||
with a tmpfs, and the rest read-only. */
|
||||
|
||||
r = bind_remount_recursive(directory, true);
|
||||
r = bind_remount_recursive(directory, true, NULL);
|
||||
if (r < 0)
|
||||
return log_error_errno(r, "Failed to remount %s read-only: %m", directory);
|
||||
|
||||
@ -1065,7 +1067,7 @@ int setup_volatile(
|
||||
|
||||
bind_mounted = true;
|
||||
|
||||
r = bind_remount_recursive(t, true);
|
||||
r = bind_remount_recursive(t, true, NULL);
|
||||
if (r < 0) {
|
||||
log_error_errno(r, "Failed to remount %s read-only: %m", t);
|
||||
goto fail;
|
||||
|
@ -3019,7 +3019,7 @@ static int outer_child(
|
||||
return r;
|
||||
|
||||
if (arg_read_only) {
|
||||
r = bind_remount_recursive(directory, true);
|
||||
r = bind_remount_recursive(directory, true, NULL);
|
||||
if (r < 0)
|
||||
return log_error_errno(r, "Failed to make tree read-only: %m");
|
||||
}
|
||||
|
@ -1168,17 +1168,21 @@ static int start_transient_scope(
|
||||
uid_t uid;
|
||||
gid_t gid;
|
||||
|
||||
r = get_user_creds(&arg_exec_user, &uid, &gid, &home, &shell);
|
||||
r = get_user_creds_clean(&arg_exec_user, &uid, &gid, &home, &shell);
|
||||
if (r < 0)
|
||||
return log_error_errno(r, "Failed to resolve user %s: %m", arg_exec_user);
|
||||
|
||||
r = strv_extendf(&user_env, "HOME=%s", home);
|
||||
if (r < 0)
|
||||
return log_oom();
|
||||
if (home) {
|
||||
r = strv_extendf(&user_env, "HOME=%s", home);
|
||||
if (r < 0)
|
||||
return log_oom();
|
||||
}
|
||||
|
||||
r = strv_extendf(&user_env, "SHELL=%s", shell);
|
||||
if (r < 0)
|
||||
return log_oom();
|
||||
if (shell) {
|
||||
r = strv_extendf(&user_env, "SHELL=%s", shell);
|
||||
if (r < 0)
|
||||
return log_oom();
|
||||
}
|
||||
|
||||
r = strv_extendf(&user_env, "USER=%s", arg_exec_user);
|
||||
if (r < 0)
|
||||
|
@ -204,7 +204,7 @@ int bus_append_unit_property_assignment(sd_bus_message *m, const char *assignmen
|
||||
"IgnoreSIGPIPE", "TTYVHangup", "TTYReset", "RemainAfterExit",
|
||||
"PrivateTmp", "PrivateDevices", "PrivateNetwork", "PrivateUsers", "NoNewPrivileges",
|
||||
"SyslogLevelPrefix", "Delegate", "RemainAfterElapse", "MemoryDenyWriteExecute",
|
||||
"RestrictRealtime", "DynamicUser", "RemoveIPC")) {
|
||||
"RestrictRealtime", "DynamicUser", "RemoveIPC", "ProtectKernelTunables", "ProtectControlGroups")) {
|
||||
|
||||
r = parse_boolean(eq);
|
||||
if (r < 0)
|
||||
|
@ -133,6 +133,28 @@ static void test_exec_privatedevices(Manager *m) {
|
||||
test(m, "exec-privatedevices-no.service", 0, CLD_EXITED);
|
||||
}
|
||||
|
||||
static void test_exec_privatedevices_capabilities(Manager *m) {
|
||||
if (detect_container() > 0) {
|
||||
log_notice("testing in container, skipping private device tests");
|
||||
return;
|
||||
}
|
||||
test(m, "exec-privatedevices-yes-capability-mknod.service", 0, CLD_EXITED);
|
||||
test(m, "exec-privatedevices-no-capability-mknod.service", 0, CLD_EXITED);
|
||||
}
|
||||
|
||||
static void test_exec_readonlypaths(Manager *m) {
|
||||
test(m, "exec-readonlypaths.service", 0, CLD_EXITED);
|
||||
test(m, "exec-readonlypaths-mount-propagation.service", 0, CLD_EXITED);
|
||||
}
|
||||
|
||||
static void test_exec_readwritepaths(Manager *m) {
|
||||
test(m, "exec-readwritepaths-mount-propagation.service", 0, CLD_EXITED);
|
||||
}
|
||||
|
||||
static void test_exec_inaccessiblepaths(Manager *m) {
|
||||
test(m, "exec-inaccessiblepaths-mount-propagation.service", 0, CLD_EXITED);
|
||||
}
|
||||
|
||||
static void test_exec_systemcallfilter(Manager *m) {
|
||||
#ifdef HAVE_SECCOMP
|
||||
if (!is_seccomp_available())
|
||||
@ -345,6 +367,10 @@ int main(int argc, char *argv[]) {
|
||||
test_exec_ignoresigpipe,
|
||||
test_exec_privatetmp,
|
||||
test_exec_privatedevices,
|
||||
test_exec_privatedevices_capabilities,
|
||||
test_exec_readonlypaths,
|
||||
test_exec_readwritepaths,
|
||||
test_exec_inaccessiblepaths,
|
||||
test_exec_privatenetwork,
|
||||
test_exec_systemcallfilter,
|
||||
test_exec_systemcallerrornumber,
|
||||
|
@ -20,16 +20,109 @@
|
||||
#include <unistd.h>
|
||||
|
||||
#include "alloc-util.h"
|
||||
#include "fileio.h"
|
||||
#include "fd-util.h"
|
||||
#include "fileio.h"
|
||||
#include "fs-util.h"
|
||||
#include "macro.h"
|
||||
#include "mkdir.h"
|
||||
#include "path-util.h"
|
||||
#include "rm-rf.h"
|
||||
#include "string-util.h"
|
||||
#include "strv.h"
|
||||
#include "util.h"
|
||||
|
||||
static void test_chase_symlinks(void) {
|
||||
_cleanup_free_ char *result = NULL;
|
||||
char temp[] = "/tmp/test-chase.XXXXXX";
|
||||
const char *top, *p, *q;
|
||||
int r;
|
||||
|
||||
assert_se(mkdtemp(temp));
|
||||
|
||||
top = strjoina(temp, "/top");
|
||||
assert_se(mkdir(top, 0700) >= 0);
|
||||
|
||||
p = strjoina(top, "/dot");
|
||||
assert_se(symlink(".", p) >= 0);
|
||||
|
||||
p = strjoina(top, "/dotdot");
|
||||
assert_se(symlink("..", p) >= 0);
|
||||
|
||||
p = strjoina(top, "/dotdota");
|
||||
assert_se(symlink("../a", p) >= 0);
|
||||
|
||||
p = strjoina(temp, "/a");
|
||||
assert_se(symlink("b", p) >= 0);
|
||||
|
||||
p = strjoina(temp, "/b");
|
||||
assert_se(symlink("/usr", p) >= 0);
|
||||
|
||||
p = strjoina(temp, "/start");
|
||||
assert_se(symlink("top/dot/dotdota", p) >= 0);
|
||||
|
||||
r = chase_symlinks(p, NULL, &result);
|
||||
assert_se(r >= 0);
|
||||
assert_se(path_equal(result, "/usr"));
|
||||
|
||||
result = mfree(result);
|
||||
r = chase_symlinks(p, temp, &result);
|
||||
assert_se(r == -ENOENT);
|
||||
|
||||
q = strjoina(temp, "/usr");
|
||||
assert_se(mkdir(q, 0700) >= 0);
|
||||
|
||||
r = chase_symlinks(p, temp, &result);
|
||||
assert_se(r >= 0);
|
||||
assert_se(path_equal(result, q));
|
||||
|
||||
p = strjoina(temp, "/slash");
|
||||
assert_se(symlink("/", p) >= 0);
|
||||
|
||||
result = mfree(result);
|
||||
r = chase_symlinks(p, NULL, &result);
|
||||
assert_se(r >= 0);
|
||||
assert_se(path_equal(result, "/"));
|
||||
|
||||
result = mfree(result);
|
||||
r = chase_symlinks(p, temp, &result);
|
||||
assert_se(r >= 0);
|
||||
assert_se(path_equal(result, temp));
|
||||
|
||||
p = strjoina(temp, "/slashslash");
|
||||
assert_se(symlink("///usr///", p) >= 0);
|
||||
|
||||
result = mfree(result);
|
||||
r = chase_symlinks(p, NULL, &result);
|
||||
assert_se(r >= 0);
|
||||
assert_se(path_equal(result, "/usr"));
|
||||
|
||||
result = mfree(result);
|
||||
r = chase_symlinks(p, temp, &result);
|
||||
assert_se(r >= 0);
|
||||
assert_se(path_equal(result, q));
|
||||
|
||||
result = mfree(result);
|
||||
r = chase_symlinks("/etc/./.././", NULL, &result);
|
||||
assert_se(r >= 0);
|
||||
assert_se(path_equal(result, "/"));
|
||||
|
||||
result = mfree(result);
|
||||
r = chase_symlinks("/etc/./.././", "/etc", &result);
|
||||
assert_se(r == -EINVAL);
|
||||
|
||||
result = mfree(result);
|
||||
r = chase_symlinks("/etc/machine-id/foo", NULL, &result);
|
||||
assert_se(r == -ENOTDIR);
|
||||
|
||||
result = mfree(result);
|
||||
p = strjoina(temp, "/recursive-symlink");
|
||||
assert_se(symlink("recursive-symlink", p) >= 0);
|
||||
r = chase_symlinks(p, NULL, &result);
|
||||
assert_se(r == -ELOOP);
|
||||
|
||||
assert_se(rm_rf(temp, REMOVE_ROOT|REMOVE_PHYSICAL) >= 0);
|
||||
}
|
||||
|
||||
static void test_unlink_noerrno(void) {
|
||||
char name[] = "/tmp/test-close_nointr.XXXXXX";
|
||||
int fd;
|
||||
@ -144,6 +237,7 @@ int main(int argc, char *argv[]) {
|
||||
test_readlink_and_make_absolute();
|
||||
test_get_files_in_directory();
|
||||
test_var_tmp();
|
||||
test_chase_symlinks();
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
@ -26,13 +26,18 @@
|
||||
int main(int argc, char *argv[]) {
|
||||
const char * const writable[] = {
|
||||
"/home",
|
||||
"-/home/lennart/projects/foobar", /* this should be masked automatically */
|
||||
NULL
|
||||
};
|
||||
|
||||
const char * const readonly[] = {
|
||||
"/",
|
||||
"/usr",
|
||||
/* "/", */
|
||||
/* "/usr", */
|
||||
"/boot",
|
||||
"/lib",
|
||||
"/usr/lib",
|
||||
"-/lib64",
|
||||
"-/usr/lib64",
|
||||
NULL
|
||||
};
|
||||
|
||||
@ -42,11 +47,12 @@ int main(int argc, char *argv[]) {
|
||||
};
|
||||
char *root_directory;
|
||||
char *projects_directory;
|
||||
|
||||
int r;
|
||||
char tmp_dir[] = "/tmp/systemd-private-XXXXXX",
|
||||
var_tmp_dir[] = "/var/tmp/systemd-private-XXXXXX";
|
||||
|
||||
log_set_max_level(LOG_DEBUG);
|
||||
|
||||
assert_se(mkdtemp(tmp_dir));
|
||||
assert_se(mkdtemp(var_tmp_dir));
|
||||
|
||||
@ -69,6 +75,8 @@ int main(int argc, char *argv[]) {
|
||||
tmp_dir,
|
||||
var_tmp_dir,
|
||||
true,
|
||||
true,
|
||||
true,
|
||||
PROTECT_HOME_NO,
|
||||
PROTECT_SYSTEM_NO,
|
||||
0);
|
||||
|
@ -0,0 +1,7 @@
|
||||
[Unit]
|
||||
Description=Test to make sure that InaccessiblePaths= disconnect mount propagation
|
||||
|
||||
[Service]
|
||||
InaccessiblePaths=-/i-dont-exist
|
||||
ExecStart=/bin/sh -x -c 'mkdir -p /TEST; mount -t tmpfs tmpfs /TEST; grep TEST /proc/self/mountinfo && ! grep TEST /proc/$${PPID}/mountinfo && ! grep TEST /proc/1/mountinfo'
|
||||
Type=oneshot
|
@ -0,0 +1,7 @@
|
||||
[Unit]
|
||||
Description=Test CAP_MKNOD capability for PrivateDevices=no
|
||||
|
||||
[Service]
|
||||
PrivateDevices=no
|
||||
ExecStart=/bin/sh -x -c 'capsh --print | grep cap_mknod'
|
||||
Type=oneshot
|
@ -0,0 +1,7 @@
|
||||
[Unit]
|
||||
Description=Test CAP_MKNOD capability for PrivateDevices=yes
|
||||
|
||||
[Service]
|
||||
PrivateDevices=yes
|
||||
ExecStart=/bin/sh -x -c '! capsh --print | grep cap_mknod'
|
||||
Type=oneshot
|
@ -0,0 +1,7 @@
|
||||
[Unit]
|
||||
Description=Test to make sure that passing ReadOnlyPaths= disconnect mount propagation
|
||||
|
||||
[Service]
|
||||
ReadOnlyPaths=-/i-dont-exist
|
||||
ExecStart=/bin/sh -x -c 'mkdir -p /TEST; mount -t tmpfs tmpfs /TEST; grep TEST /proc/self/mountinfo && ! grep TEST /proc/$${PPID}/mountinfo && ! grep TEST /proc/1/mountinfo'
|
||||
Type=oneshot
|
7
test/test-execute/exec-readonlypaths.service
Normal file
7
test/test-execute/exec-readonlypaths.service
Normal file
@ -0,0 +1,7 @@
|
||||
[Unit]
|
||||
Description=Test for ReadOnlyPaths=
|
||||
|
||||
[Service]
|
||||
ReadOnlyPaths=/etc -/i-dont-exist /usr
|
||||
ExecStart=/bin/sh -x -c 'test ! -w /etc && test ! -w /usr && test ! -e /i-dont-exist && test -w /var'
|
||||
Type=oneshot
|
@ -0,0 +1,7 @@
|
||||
[Unit]
|
||||
Description=Test to make sure that passing ReadWritePaths= disconnect mount propagation
|
||||
|
||||
[Service]
|
||||
ReadWritePaths=-/i-dont-exist
|
||||
ExecStart=/bin/sh -x -c 'mkdir -p /TEST; mount -t tmpfs tmpfs /TEST; grep TEST /proc/self/mountinfo && ! grep TEST /proc/$${PPID}/mountinfo && ! grep TEST /proc/1/mountinfo'
|
||||
Type=oneshot
|
@ -13,12 +13,16 @@ Documentation=http://www.freedesktop.org/wiki/Software/systemd/hostnamed
|
||||
[Service]
|
||||
ExecStart=@rootlibexecdir@/systemd-hostnamed
|
||||
BusName=org.freedesktop.hostname1
|
||||
CapabilityBoundingSet=CAP_SYS_ADMIN
|
||||
WatchdogSec=3min
|
||||
CapabilityBoundingSet=CAP_SYS_ADMIN
|
||||
PrivateTmp=yes
|
||||
PrivateDevices=yes
|
||||
PrivateNetwork=yes
|
||||
ProtectSystem=yes
|
||||
ProtectHome=yes
|
||||
ProtectControlGroups=yes
|
||||
ProtectKernelTunables=yes
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX
|
||||
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @mount @obsolete @raw-io
|
||||
|
@ -13,9 +13,11 @@ Documentation=http://www.freedesktop.org/wiki/Software/systemd/importd
|
||||
[Service]
|
||||
ExecStart=@rootlibexecdir@/systemd-importd
|
||||
BusName=org.freedesktop.import1
|
||||
CapabilityBoundingSet=CAP_CHOWN CAP_FOWNER CAP_FSETID CAP_MKNOD CAP_SETFCAP CAP_SYS_ADMIN CAP_SETPCAP CAP_DAC_OVERRIDE
|
||||
NoNewPrivileges=yes
|
||||
WatchdogSec=3min
|
||||
KillMode=mixed
|
||||
CapabilityBoundingSet=CAP_CHOWN CAP_FOWNER CAP_FSETID CAP_MKNOD CAP_SETFCAP CAP_SYS_ADMIN CAP_SETPCAP CAP_DAC_OVERRIDE
|
||||
NoNewPrivileges=yes
|
||||
MemoryDenyWriteExecute=yes
|
||||
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @mount @obsolete @raw-io
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6
|
||||
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @obsolete @raw-io
|
||||
|
@ -20,6 +20,11 @@ PrivateDevices=yes
|
||||
PrivateNetwork=yes
|
||||
ProtectSystem=full
|
||||
ProtectHome=yes
|
||||
ProtectControlGroups=yes
|
||||
ProtectKernelTunables=yes
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6
|
||||
|
||||
# If there are many split upjournal files we need a lot of fds to
|
||||
# access them all and combine
|
||||
|
@ -11,15 +11,20 @@ Documentation=man:systemd-journal-remote(8) man:journal-remote.conf(5)
|
||||
Requires=systemd-journal-remote.socket
|
||||
|
||||
[Service]
|
||||
ExecStart=@rootlibexecdir@/systemd-journal-remote \
|
||||
--listen-https=-3 \
|
||||
--output=/var/log/journal/remote/
|
||||
ExecStart=@rootlibexecdir@/systemd-journal-remote --listen-https=-3 --output=/var/log/journal/remote/
|
||||
User=systemd-journal-remote
|
||||
Group=systemd-journal-remote
|
||||
WatchdogSec=3min
|
||||
PrivateTmp=yes
|
||||
PrivateDevices=yes
|
||||
PrivateNetwork=yes
|
||||
WatchdogSec=3min
|
||||
ProtectSystem=full
|
||||
ProtectHome=yes
|
||||
ProtectControlGroups=yes
|
||||
ProtectKernelTunables=yes
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6
|
||||
|
||||
[Install]
|
||||
Also=systemd-journal-remote.socket
|
||||
|
@ -11,13 +11,19 @@ Documentation=man:systemd-journal-upload(8)
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
ExecStart=@rootlibexecdir@/systemd-journal-upload \
|
||||
--save-state
|
||||
ExecStart=@rootlibexecdir@/systemd-journal-upload --save-state
|
||||
User=systemd-journal-upload
|
||||
SupplementaryGroups=systemd-journal
|
||||
WatchdogSec=3min
|
||||
PrivateTmp=yes
|
||||
PrivateDevices=yes
|
||||
WatchdogSec=3min
|
||||
ProtectSystem=full
|
||||
ProtectHome=yes
|
||||
ProtectControlGroups=yes
|
||||
ProtectKernelTunables=yes
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6
|
||||
|
||||
# If there are many split up journal files we need a lot of fds to
|
||||
# access them all and combine
|
||||
|
@ -21,10 +21,12 @@ Restart=always
|
||||
RestartSec=0
|
||||
NotifyAccess=all
|
||||
StandardOutput=null
|
||||
CapabilityBoundingSet=CAP_SYS_ADMIN CAP_DAC_OVERRIDE CAP_SYS_PTRACE CAP_SYSLOG CAP_AUDIT_CONTROL CAP_AUDIT_READ CAP_CHOWN CAP_DAC_READ_SEARCH CAP_FOWNER CAP_SETUID CAP_SETGID CAP_MAC_OVERRIDE
|
||||
WatchdogSec=3min
|
||||
FileDescriptorStoreMax=1024
|
||||
CapabilityBoundingSet=CAP_SYS_ADMIN CAP_DAC_OVERRIDE CAP_SYS_PTRACE CAP_SYSLOG CAP_AUDIT_CONTROL CAP_AUDIT_READ CAP_CHOWN CAP_DAC_READ_SEARCH CAP_FOWNER CAP_SETUID CAP_SETGID CAP_MAC_OVERRIDE
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX AF_NETLINK
|
||||
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @mount @obsolete @raw-io
|
||||
|
||||
# Increase the default a bit in order to allow many simultaneous
|
||||
|
@ -13,12 +13,16 @@ Documentation=http://www.freedesktop.org/wiki/Software/systemd/localed
|
||||
[Service]
|
||||
ExecStart=@rootlibexecdir@/systemd-localed
|
||||
BusName=org.freedesktop.locale1
|
||||
CapabilityBoundingSet=
|
||||
WatchdogSec=3min
|
||||
CapabilityBoundingSet=
|
||||
PrivateTmp=yes
|
||||
PrivateDevices=yes
|
||||
PrivateNetwork=yes
|
||||
ProtectSystem=yes
|
||||
ProtectHome=yes
|
||||
ProtectControlGroups=yes
|
||||
ProtectKernelTunables=yes
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX
|
||||
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @mount @obsolete @raw-io
|
||||
|
@ -23,9 +23,11 @@ ExecStart=@rootlibexecdir@/systemd-logind
|
||||
Restart=always
|
||||
RestartSec=0
|
||||
BusName=org.freedesktop.login1
|
||||
CapabilityBoundingSet=CAP_SYS_ADMIN CAP_MAC_ADMIN CAP_AUDIT_CONTROL CAP_CHOWN CAP_KILL CAP_DAC_READ_SEARCH CAP_DAC_OVERRIDE CAP_FOWNER CAP_SYS_TTY_CONFIG
|
||||
WatchdogSec=3min
|
||||
CapabilityBoundingSet=CAP_SYS_ADMIN CAP_MAC_ADMIN CAP_AUDIT_CONTROL CAP_CHOWN CAP_KILL CAP_DAC_READ_SEARCH CAP_DAC_OVERRIDE CAP_FOWNER CAP_SYS_TTY_CONFIG
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX AF_NETLINK AF_INET AF_INET6
|
||||
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @obsolete @raw-io
|
||||
|
||||
# Increase the default a bit in order to allow many simultaneous
|
||||
|
@ -15,9 +15,11 @@ After=machine.slice
|
||||
[Service]
|
||||
ExecStart=@rootlibexecdir@/systemd-machined
|
||||
BusName=org.freedesktop.machine1
|
||||
CapabilityBoundingSet=CAP_KILL CAP_SYS_PTRACE CAP_SYS_ADMIN CAP_SETGID CAP_SYS_CHROOT CAP_DAC_READ_SEARCH CAP_DAC_OVERRIDE CAP_CHOWN CAP_FOWNER CAP_FSETID CAP_MKNOD
|
||||
WatchdogSec=3min
|
||||
CapabilityBoundingSet=CAP_KILL CAP_SYS_PTRACE CAP_SYS_ADMIN CAP_SETGID CAP_SYS_CHROOT CAP_DAC_READ_SEARCH CAP_DAC_OVERRIDE CAP_CHOWN CAP_FOWNER CAP_FSETID CAP_MKNOD
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX AF_NETLINK AF_INET AF_INET6
|
||||
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @obsolete @raw-io
|
||||
|
||||
# Note that machined cannot be placed in a mount namespace, since it
|
||||
|
@ -27,11 +27,14 @@ Type=notify
|
||||
Restart=on-failure
|
||||
RestartSec=0
|
||||
ExecStart=@rootlibexecdir@/systemd-networkd
|
||||
WatchdogSec=3min
|
||||
CapabilityBoundingSet=CAP_NET_ADMIN CAP_NET_BIND_SERVICE CAP_NET_BROADCAST CAP_NET_RAW CAP_SETUID CAP_SETGID CAP_SETPCAP CAP_CHOWN CAP_DAC_OVERRIDE CAP_FOWNER
|
||||
ProtectSystem=full
|
||||
ProtectHome=yes
|
||||
WatchdogSec=3min
|
||||
ProtectControlGroups=yes
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX AF_NETLINK AF_INET AF_INET6 AF_PACKET
|
||||
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @mount @obsolete @raw-io
|
||||
|
||||
[Install]
|
||||
|
@ -23,11 +23,17 @@ Type=notify
|
||||
Restart=always
|
||||
RestartSec=0
|
||||
ExecStart=@rootlibexecdir@/systemd-resolved
|
||||
WatchdogSec=3min
|
||||
CapabilityBoundingSet=CAP_SETUID CAP_SETGID CAP_SETPCAP CAP_CHOWN CAP_DAC_OVERRIDE CAP_FOWNER CAP_NET_RAW CAP_NET_BIND_SERVICE
|
||||
PrivateTmp=yes
|
||||
PrivateDevices=yes
|
||||
ProtectSystem=full
|
||||
ProtectHome=yes
|
||||
WatchdogSec=3min
|
||||
ProtectControlGroups=yes
|
||||
ProtectKernelTunables=yes
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX AF_NETLINK AF_INET AF_INET6
|
||||
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @mount @obsolete @raw-io
|
||||
|
||||
[Install]
|
||||
|
@ -13,10 +13,14 @@ Documentation=http://www.freedesktop.org/wiki/Software/systemd/timedated
|
||||
[Service]
|
||||
ExecStart=@rootlibexecdir@/systemd-timedated
|
||||
BusName=org.freedesktop.timedate1
|
||||
CapabilityBoundingSet=CAP_SYS_TIME
|
||||
WatchdogSec=3min
|
||||
CapabilityBoundingSet=CAP_SYS_TIME
|
||||
PrivateTmp=yes
|
||||
ProtectSystem=yes
|
||||
ProtectHome=yes
|
||||
ProtectControlGroups=yes
|
||||
ProtectKernelTunables=yes
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX
|
||||
SystemCallFilter=~@cpu-emulation @debug @keyring @module @mount @obsolete @raw-io
|
||||
|
@ -22,13 +22,17 @@ Type=notify
|
||||
Restart=always
|
||||
RestartSec=0
|
||||
ExecStart=@rootlibexecdir@/systemd-timesyncd
|
||||
WatchdogSec=3min
|
||||
CapabilityBoundingSet=CAP_SYS_TIME CAP_SETUID CAP_SETGID CAP_SETPCAP CAP_CHOWN CAP_DAC_OVERRIDE CAP_FOWNER
|
||||
PrivateTmp=yes
|
||||
PrivateDevices=yes
|
||||
ProtectSystem=full
|
||||
ProtectHome=yes
|
||||
WatchdogSec=3min
|
||||
ProtectControlGroups=yes
|
||||
ProtectKernelTunables=yes
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6
|
||||
SystemCallFilter=~@cpu-emulation @debug @keyring @module @mount @obsolete @raw-io
|
||||
|
||||
[Install]
|
||||
|
@ -21,7 +21,10 @@ Sockets=systemd-udevd-control.socket systemd-udevd-kernel.socket
|
||||
Restart=always
|
||||
RestartSec=0
|
||||
ExecStart=@rootlibexecdir@/systemd-udevd
|
||||
MountFlags=slave
|
||||
KillMode=mixed
|
||||
WatchdogSec=3min
|
||||
TasksMax=infinity
|
||||
MountFlags=slave
|
||||
MemoryDenyWriteExecute=yes
|
||||
RestrictRealtime=yes
|
||||
RestrictAddressFamilies=AF_UNIX AF_NETLINK
|
||||
|
Loading…
Reference in New Issue
Block a user