mirror of
https://github.com/systemd/systemd.git
synced 2024-12-12 11:44:13 +08:00
1676 lines
95 KiB
XML
1676 lines
95 KiB
XML
<?xml version='1.0'?>
|
||
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
|
||
"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
|
||
<!-- SPDX-License-Identifier: LGPL-2.1-or-later -->
|
||
|
||
<refentry id="systemd.resource-control" xmlns:xi="http://www.w3.org/2001/XInclude">
|
||
<refentryinfo>
|
||
<title>systemd.resource-control</title>
|
||
<productname>systemd</productname>
|
||
</refentryinfo>
|
||
|
||
<refmeta>
|
||
<refentrytitle>systemd.resource-control</refentrytitle>
|
||
<manvolnum>5</manvolnum>
|
||
</refmeta>
|
||
|
||
<refnamediv>
|
||
<refname>systemd.resource-control</refname>
|
||
<refpurpose>Resource control unit settings</refpurpose>
|
||
</refnamediv>
|
||
|
||
<refsynopsisdiv>
|
||
<para>
|
||
<filename><replaceable>slice</replaceable>.slice</filename>,
|
||
<filename><replaceable>scope</replaceable>.scope</filename>,
|
||
<filename><replaceable>service</replaceable>.service</filename>,
|
||
<filename><replaceable>socket</replaceable>.socket</filename>,
|
||
<filename><replaceable>mount</replaceable>.mount</filename>,
|
||
<filename><replaceable>swap</replaceable>.swap</filename>
|
||
</para>
|
||
</refsynopsisdiv>
|
||
|
||
<refsect1>
|
||
<title>Description</title>
|
||
|
||
<para>Unit configuration files for services, slices, scopes, sockets, mount points, and swap devices share a subset
|
||
of configuration options for resource control of spawned processes. Internally, this relies on the Linux Control
|
||
Groups (cgroups) kernel concept for organizing processes in a hierarchical tree of named groups for the purpose of
|
||
resource management.</para>
|
||
|
||
<para>This man page lists the configuration options shared by
|
||
those six unit types. See
|
||
<citerefentry><refentrytitle>systemd.unit</refentrytitle><manvolnum>5</manvolnum></citerefentry>
|
||
for the common options of all unit configuration files, and
|
||
<citerefentry><refentrytitle>systemd.slice</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.scope</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.service</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.socket</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.mount</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
and
|
||
<citerefentry><refentrytitle>systemd.swap</refentrytitle><manvolnum>5</manvolnum></citerefentry>
|
||
for more information on the specific unit configuration files. The
|
||
resource control configuration options are configured in the
|
||
[Slice], [Scope], [Service], [Socket], [Mount], or [Swap]
|
||
sections, depending on the unit type.</para>
|
||
|
||
<para>In addition, options which control resources available to programs
|
||
<emphasis>executed</emphasis> by systemd are listed in
|
||
<citerefentry><refentrytitle>systemd.exec</refentrytitle><manvolnum>5</manvolnum></citerefentry>.
|
||
Those options complement options listed here.</para>
|
||
|
||
<refsect2>
|
||
<title>Enabling and disabling controllers</title>
|
||
|
||
<para>Controllers in the cgroup hierarchy are hierarchical, and resource control is realized by
|
||
distributing resource assignments between siblings in branches of the cgroup hierarchy. There is no
|
||
need to explicitly <emphasis>enable</emphasis> a cgroup controller for a unit.
|
||
<command>systemd</command> will instruct the kernel to enable a controller for a given unit when this
|
||
unit has configuration for a given controller. For example, when <varname>CPUWeight=</varname> is set,
|
||
the <option>cpu</option> controller will be enabled, and when <varname>TasksMax=</varname> are set, the
|
||
<option>pids</option> controller will be enabled. In addition, various controllers may be also be
|
||
enabled explicitly via the
|
||
<varname>MemoryAccounting=</varname>/<varname>TasksAccounting=</varname>/<varname>IOAccounting=</varname>
|
||
settings. Because of how the cgroup hierarchy works, controllers will be automatically enabled for all
|
||
parent units and for any sibling units starting with the lowest level at which a controller is enabled.
|
||
Units for which a controller is enabled may be subject to resource control even if they don't have any
|
||
explicit configuration.</para>
|
||
|
||
<para>Setting <varname>Delegate=</varname> enables any delegated controllers for that unit (see below).
|
||
The delegatee may then enable controllers for its children as appropriate. In particular, if the
|
||
delegatee is <command>systemd</command> (in the <filename>user@.service</filename> unit), it will
|
||
repeat the same logic as the system instance and enable controllers for user units which have resource
|
||
limits configured, and their siblings and parents and parents' siblings.</para>
|
||
|
||
<para>Controllers may be <emphasis>disabled</emphasis> for parts of the cgroup hierarchy with
|
||
<varname>DisableControllers=</varname> (see below).</para>
|
||
|
||
<example>
|
||
<title>Enabling and disabling controllers</title>
|
||
|
||
<programlisting>
|
||
-.slice
|
||
/ \
|
||
/-----/ \--------------\
|
||
/ \
|
||
system.slice user.slice
|
||
/ \ / \
|
||
/ \ / \
|
||
/ \ user@42.service user@1000.service
|
||
/ \ Delegate= Delegate=yes
|
||
a.service b.slice / \
|
||
CPUWeight=20 DisableControllers=cpu / \
|
||
/ \ app.slice session.slice
|
||
/ \ CPUWeight=100 CPUWeight=100
|
||
/ \
|
||
b1.service b2.service
|
||
CPUWeight=1000
|
||
</programlisting>
|
||
|
||
<para>In this hierarchy, the <option>cpu</option> controller is enabled for all units shown except
|
||
<filename>b1.service</filename> and <filename>b2.service</filename>. Because there is no explicit
|
||
configuration for <filename>system.slice</filename> and <filename>user.slice</filename>, CPU
|
||
resources will be split equally between them. Similarly, resources are allocated equally between
|
||
children of <filename>user.slice</filename> and between the child slices beneath
|
||
<filename>user@1000.service</filename>. Assuming that there is no further configuration of resources
|
||
or delegation below slices <filename>app.slice</filename> or <filename>session.slice</filename>, the
|
||
<option>cpu</option> controller would not be enabled for units in those slices and CPU resources
|
||
would be further allocated using other mechanisms, e.g. based on nice levels. The manager for user
|
||
42 has delegation enabled without any controllers, i.e. it can manipulate its subtree of the cgroup
|
||
hierarchy, but without resource control.</para>
|
||
|
||
<para>In the slice <filename>system.slice</filename>, CPU resources are split 1:6 for service
|
||
<filename>a.service</filename>, and 5:6 for slice <filename>b.slice</filename>, because slice
|
||
<filename>b.slice</filename> gets the default value of 100 for <filename>cpu.weight</filename> when
|
||
<varname>CPUWeight=</varname> is not set.</para>
|
||
|
||
<para><varname>CPUWeight=</varname> setting in service <filename>b2.service</filename> is neutralized
|
||
by <varname>DisableControllers=</varname> in slice <filename>b.slice</filename>, so the
|
||
<option>cpu</option> controller would not be enabled for services <filename>b1.service</filename> and
|
||
<filename>b2.service</filename>, and CPU resources would be further allocated using other mechanisms,
|
||
e.g. based on nice levels.</para>
|
||
</example>
|
||
</refsect2>
|
||
|
||
<refsect2>
|
||
<title>Setting resource controls for a group of related units</title>
|
||
|
||
<para>As described in
|
||
<citerefentry><refentrytitle>systemd.unit</refentrytitle><manvolnum>5</manvolnum></citerefentry>, the
|
||
settings listed here may be set through the main file of a unit and drop-in snippets in
|
||
<filename index="false">*.d/</filename> directories. The list of directories searched for drop-ins
|
||
includes names formed by repeatedly truncating the unit name after all dashes. This is particularly
|
||
convenient to set resource limits for a group of units with similar names.</para>
|
||
|
||
<para>For example, every user gets their own slice
|
||
<filename>user-<replaceable>nnn</replaceable>.slice</filename>. Drop-ins with local configuration that
|
||
affect user 1000 may be placed in
|
||
<filename index="false">/etc/systemd/system/user-1000.slice</filename>,
|
||
<filename index="false">/etc/systemd/system/user-1000.slice.d/*.conf</filename>, but also
|
||
<filename index="false">/etc/systemd/system/user-.slice.d/*.conf</filename>. This last directory
|
||
applies to all user slices.</para>
|
||
</refsect2>
|
||
|
||
<para>See the <ulink
|
||
url="https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface">New
|
||
Control Group Interfaces</ulink> for an introduction on how to make
|
||
use of resource control APIs from programs.</para>
|
||
</refsect1>
|
||
|
||
<refsect1>
|
||
<title>Implicit Dependencies</title>
|
||
|
||
<para>The following dependencies are implicitly added:</para>
|
||
|
||
<itemizedlist>
|
||
<listitem><para>Units with the <varname>Slice=</varname> setting set automatically acquire
|
||
<varname>Requires=</varname> and <varname>After=</varname> dependencies on the specified
|
||
slice unit.</para></listitem>
|
||
</itemizedlist>
|
||
</refsect1>
|
||
|
||
<!-- We don't have any default dependency here. -->
|
||
|
||
<refsect1>
|
||
<title>Options</title>
|
||
|
||
<para>Units of the types listed above can have settings for resource control configuration:</para>
|
||
|
||
<refsect2><title>CPU Accounting and Control</title>
|
||
|
||
<variablelist class='unit-directives'>
|
||
|
||
<varlistentry>
|
||
<term><varname>CPUAccounting=</varname></term>
|
||
|
||
<listitem>
|
||
<para>Turn on CPU usage accounting for this unit. Takes a
|
||
boolean argument. Note that turning on CPU accounting for
|
||
one unit will also implicitly turn it on for all units
|
||
contained in the same slice and for all its parent slices
|
||
and the units contained therein. The system default for this
|
||
setting may be controlled with
|
||
<varname>DefaultCPUAccounting=</varname> in
|
||
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>
|
||
|
||
<para>Under the unified cgroup hierarchy, CPU accounting is available for all units and this
|
||
setting has no effect.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v208"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>CPUWeight=<replaceable>weight</replaceable></varname></term>
|
||
<term><varname>StartupCPUWeight=<replaceable>weight</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>These settings control the <option>cpu</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>These options accept an integer value or a the special string "idle":</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>If set to an integer value, assign the specified CPU time weight to the processes
|
||
executed, if the unified control group hierarchy is used on the system. These options control
|
||
the <literal>cpu.weight</literal> control group attribute. The allowed range is 1 to 10000.
|
||
Defaults to unset, but the kernel default is 100. For details about this control group
|
||
attribute, see <ulink url="https://docs.kernel.org/admin-guide/cgroup-v2.html">Control Groups
|
||
v2</ulink> and <ulink url="https://docs.kernel.org/scheduler/sched-design-CFS.html">CFS
|
||
Scheduler</ulink>. The available CPU time is split up among all units within one slice
|
||
relative to their CPU time weight. A higher weight means more CPU time, a lower weight means
|
||
less.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>If set to the special string "idle", mark the cgroup for "idle scheduling", which means
|
||
that it will get CPU resources only when there are no processes not marked in this way to execute in this
|
||
cgroup or its siblings. This setting corresponds to the <literal>cpu.idle</literal> cgroup attribute.</para>
|
||
|
||
<para>Note that this value only has an effect on cgroup-v2, for cgroup-v1 it is equivalent to the minimum weight.</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
|
||
<para>While <varname>StartupCPUWeight=</varname> applies to the startup and shutdown phases of the system,
|
||
<varname>CPUWeight=</varname> applies to normal runtime of the system, and if the former is not set also to
|
||
the startup and shutdown phases. Using <varname>StartupCPUWeight=</varname> allows prioritizing specific services at
|
||
boot-up and shutdown differently than during normal runtime.</para>
|
||
|
||
<para>In addition to the resource allocation performed by the <option>cpu</option> controller, the
|
||
kernel may automatically divide resources based on session-id grouping, see "The autogroup feature"
|
||
in <citerefentry
|
||
project='man-pages'><refentrytitle>sched</refentrytitle><manvolnum>7</manvolnum></citerefentry>.
|
||
The effect of this feature is similar to the <option>cpu</option> controller with no explicit
|
||
configuration, so users should be careful to not mistake one for the other.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v232"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>CPUQuota=</varname></term>
|
||
|
||
<listitem>
|
||
<para>This setting controls the <option>cpu</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Assign the specified CPU time quota to the processes executed. Takes a percentage value, suffixed with
|
||
"%". The percentage specifies how much CPU time the unit shall get at maximum, relative to the total CPU time
|
||
available on one CPU. Use values > 100% for allotting CPU time on more than one CPU. This controls the
|
||
<literal>cpu.max</literal> attribute on the unified control group hierarchy and
|
||
<literal>cpu.cfs_quota_us</literal> on legacy. For details about these control group attributes, see <ulink
|
||
url="https://docs.kernel.org/admin-guide/cgroup-v2.html">Control Groups v2</ulink> and <ulink
|
||
url="https://docs.kernel.org/scheduler/sched-bwc.html">CFS Bandwidth Control</ulink>.
|
||
Setting <varname>CPUQuota=</varname> to an empty value unsets the quota.</para>
|
||
|
||
<para>Example: <varname>CPUQuota=20%</varname> ensures that the executed processes will never get more than
|
||
20% CPU time on one CPU.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v213"/>
|
||
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>CPUQuotaPeriodSec=</varname></term>
|
||
|
||
<listitem>
|
||
<para>This setting controls the <option>cpu</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Assign the duration over which the CPU time quota specified by <varname>CPUQuota=</varname> is measured.
|
||
Takes a time duration value in seconds, with an optional suffix such as "ms" for milliseconds (or "s" for seconds.)
|
||
The default setting is 100ms. The period is clamped to the range supported by the kernel, which is [1ms, 1000ms].
|
||
Additionally, the period is adjusted up so that the quota interval is also at least 1ms.
|
||
Setting <varname>CPUQuotaPeriodSec=</varname> to an empty value resets it to the default.</para>
|
||
|
||
<para>This controls the second field of <literal>cpu.max</literal> attribute on the unified control group hierarchy
|
||
and <literal>cpu.cfs_period_us</literal> on legacy. For details about these control group attributes, see
|
||
<ulink url="https://docs.kernel.org/admin-guide/cgroup-v2.html">Control Groups v2</ulink> and
|
||
<ulink url="https://docs.kernel.org/scheduler/sched-design-CFS.html">CFS Scheduler</ulink>.</para>
|
||
|
||
<para>Example: <varname>CPUQuotaPeriodSec=10ms</varname> to request that the CPU quota is measured in periods of 10ms.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v242"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>AllowedCPUs=</varname></term>
|
||
<term><varname>StartupAllowedCPUs=</varname></term>
|
||
|
||
<listitem>
|
||
<para>This setting controls the <option>cpuset</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Restrict processes to be executed on specific CPUs. Takes a list of CPU indices or ranges separated by either
|
||
whitespace or commas. CPU ranges are specified by the lower and upper CPU indices separated by a dash.</para>
|
||
|
||
<para>Setting <varname>AllowedCPUs=</varname> or <varname>StartupAllowedCPUs=</varname> doesn't guarantee that all
|
||
of the CPUs will be used by the processes as it may be limited by parent units. The effective configuration is
|
||
reported as <varname>EffectiveCPUs=</varname>.</para>
|
||
|
||
<para>While <varname>StartupAllowedCPUs=</varname> applies to the startup and shutdown phases of the system,
|
||
<varname>AllowedCPUs=</varname> applies to normal runtime of the system, and if the former is not set also to
|
||
the startup and shutdown phases. Using <varname>StartupAllowedCPUs=</varname> allows prioritizing specific services at
|
||
boot-up and shutdown differently than during normal runtime.</para>
|
||
|
||
<para>This setting is supported only with the unified control group hierarchy.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v244"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
</variablelist>
|
||
|
||
</refsect2><refsect2><title>Memory Accounting and Control</title>
|
||
|
||
<variablelist class='unit-directives'>
|
||
|
||
<varlistentry>
|
||
<term><varname>MemoryAccounting=</varname></term>
|
||
|
||
<listitem>
|
||
<para>This setting controls the <option>memory</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Turn on process and kernel memory accounting for this
|
||
unit. Takes a boolean argument. Note that turning on memory
|
||
accounting for one unit will also implicitly turn it on for
|
||
all units contained in the same slice and for all its parent
|
||
slices and the units contained therein. The system default
|
||
for this setting may be controlled with
|
||
<varname>DefaultMemoryAccounting=</varname> in
|
||
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v208"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>MemoryMin=<replaceable>bytes</replaceable></varname>, <varname>MemoryLow=<replaceable>bytes</replaceable></varname></term>
|
||
<term><varname>StartupMemoryLow=<replaceable>bytes</replaceable></varname>, <varname>DefaultStartupMemoryLow=<replaceable>bytes</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>These settings control the <option>memory</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Specify the memory usage protection of the executed processes in this unit.
|
||
When reclaiming memory, the unit is treated as if it was using less memory resulting in memory
|
||
to be preferentially reclaimed from unprotected units.
|
||
Using <varname>MemoryLow=</varname> results in a weaker protection where memory may still
|
||
be reclaimed to avoid invoking the OOM killer in case there is no other reclaimable memory.</para>
|
||
<para>
|
||
For a protection to be effective, it is generally required to set a corresponding
|
||
allocation on all ancestors, which is then distributed between children
|
||
(with the exception of the root slice).
|
||
Any <varname>MemoryMin=</varname> or <varname>MemoryLow=</varname> allocation that is not
|
||
explicitly distributed to specific children is used to create a shared protection for all children.
|
||
As this is a shared protection, the children will freely compete for the memory.</para>
|
||
|
||
<para>Takes a memory size in bytes. If the value is suffixed with K, M, G or T, the specified memory size is
|
||
parsed as Kilobytes, Megabytes, Gigabytes, or Terabytes (with the base 1024), respectively. Alternatively, a
|
||
percentage value may be specified, which is taken relative to the installed physical memory on the
|
||
system. If assigned the special value <literal>infinity</literal>, all available memory is protected, which may be
|
||
useful in order to always inherit all of the protection afforded by ancestors.
|
||
This controls the <literal>memory.min</literal> or <literal>memory.low</literal> control group attribute.
|
||
For details about this control group attribute, see <ulink
|
||
url="https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files">Memory Interface Files</ulink>.</para>
|
||
|
||
<para>Units may have their children use a default <literal>memory.min</literal> or
|
||
<literal>memory.low</literal> value by specifying <varname>DefaultMemoryMin=</varname> or
|
||
<varname>DefaultMemoryLow=</varname>, which has the same semantics as
|
||
<varname>MemoryMin=</varname> and <varname>MemoryLow=</varname>, or <varname>DefaultStartupMemoryLow=</varname>
|
||
which has the same semantics as <varname>StartupMemoryLow=</varname>.
|
||
This setting does not affect <literal>memory.min</literal> or <literal>memory.low</literal>
|
||
in the unit itself.
|
||
Using it to set a default child allocation is only useful on kernels older than 5.7,
|
||
which do not support the <literal>memory_recursiveprot</literal> cgroup2 mount option.</para>
|
||
|
||
<para>While <varname>StartupMemoryLow=</varname> applies to the startup and shutdown phases of the system,
|
||
<varname>MemoryMin=</varname> applies to normal runtime of the system, and if the former is not set also to
|
||
the startup and shutdown phases. Using <varname>StartupMemoryLow=</varname> allows prioritizing specific services at
|
||
boot-up and shutdown differently than during normal runtime.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v240"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>MemoryHigh=<replaceable>bytes</replaceable></varname></term>
|
||
<term><varname>StartupMemoryHigh=<replaceable>bytes</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>These settings control the <option>memory</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Specify the throttling limit on memory usage of the executed processes in this unit. Memory usage may go
|
||
above the limit if unavoidable, but the processes are heavily slowed down and memory is taken away
|
||
aggressively in such cases. This is the main mechanism to control memory usage of a unit.</para>
|
||
|
||
<para>Takes a memory size in bytes. If the value is suffixed with K, M, G or T, the specified memory size is
|
||
parsed as Kilobytes, Megabytes, Gigabytes, or Terabytes (with the base 1024), respectively. Alternatively, a
|
||
percentage value may be specified, which is taken relative to the installed physical memory on the
|
||
system. If assigned the
|
||
special value <literal>infinity</literal>, no memory throttling is applied. This controls the
|
||
<literal>memory.high</literal> control group attribute. For details about this control group attribute, see
|
||
<ulink url="https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files">Memory Interface Files</ulink>.</para>
|
||
|
||
<para>While <varname>StartupMemoryHigh=</varname> applies to the startup and shutdown phases of the system,
|
||
<varname>MemoryHigh=</varname> applies to normal runtime of the system, and if the former is not set also to
|
||
the startup and shutdown phases. Using <varname>StartupMemoryHigh=</varname> allows prioritizing specific services at
|
||
boot-up and shutdown differently than during normal runtime.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v231"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>MemoryMax=<replaceable>bytes</replaceable></varname></term>
|
||
<term><varname>StartupMemoryMax=<replaceable>bytes</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>These settings control the <option>memory</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Specify the absolute limit on memory usage of the executed processes in this unit. If memory usage
|
||
cannot be contained under the limit, out-of-memory killer is invoked inside the unit. It is recommended to
|
||
use <varname>MemoryHigh=</varname> as the main control mechanism and use <varname>MemoryMax=</varname> as the
|
||
last line of defense.</para>
|
||
|
||
<para>Takes a memory size in bytes. If the value is suffixed with K, M, G or T, the specified memory size is
|
||
parsed as Kilobytes, Megabytes, Gigabytes, or Terabytes (with the base 1024), respectively. Alternatively, a
|
||
percentage value may be specified, which is taken relative to the installed physical memory on the system. If
|
||
assigned the special value <literal>infinity</literal>, no memory limit is applied. This controls the
|
||
<literal>memory.max</literal> control group attribute. For details about this control group attribute, see
|
||
<ulink url="https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files">Memory Interface Files</ulink>.</para>
|
||
|
||
<para>While <varname>StartupMemoryMax=</varname> applies to the startup and shutdown phases of the system,
|
||
<varname>MemoryMax=</varname> applies to normal runtime of the system, and if the former is not set also to
|
||
the startup and shutdown phases. Using <varname>StartupMemoryMax=</varname> allows prioritizing specific services at
|
||
boot-up and shutdown differently than during normal runtime.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v231"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>MemorySwapMax=<replaceable>bytes</replaceable></varname></term>
|
||
<term><varname>StartupMemorySwapMax=<replaceable>bytes</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>These settings control the <option>memory</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Specify the absolute limit on swap usage of the executed processes in this unit.</para>
|
||
|
||
<para>Takes a swap size in bytes. If the value is suffixed with K, M, G or T, the specified swap size is
|
||
parsed as Kilobytes, Megabytes, Gigabytes, or Terabytes (with the base 1024), respectively. If assigned the
|
||
special value <literal>infinity</literal>, no swap limit is applied. These settings control the
|
||
<literal>memory.swap.max</literal> control group attribute. For details about this control group attribute,
|
||
see <ulink url="https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files">Memory Interface Files</ulink>.</para>
|
||
|
||
<para>While <varname>StartupMemorySwapMax=</varname> applies to the startup and shutdown phases of the system,
|
||
<varname>MemorySwapMax=</varname> applies to normal runtime of the system, and if the former is not set also to
|
||
the startup and shutdown phases. Using <varname>StartupMemorySwapMax=</varname> allows prioritizing specific services at
|
||
boot-up and shutdown differently than during normal runtime.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v232"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>MemoryZSwapMax=<replaceable>bytes</replaceable></varname></term>
|
||
<term><varname>StartupMemoryZSwapMax=<replaceable>bytes</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>These settings control the <option>memory</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Specify the absolute limit on zswap usage of the processes in this unit. Zswap is a lightweight compressed
|
||
cache for swap pages. It takes pages that are in the process of being swapped out and attempts to compress them into a
|
||
dynamically allocated RAM-based memory pool. If the limit specified is hit, no entries from this unit will be
|
||
stored in the pool until existing entries are faulted back or written out to disk. See the kernel's
|
||
<ulink url="https://www.kernel.org/doc/html/latest/admin-guide/mm/zswap.html">Zswap</ulink> documentation for more details.</para>
|
||
|
||
<para>Takes a size in bytes. If the value is suffixed with K, M, G or T, the specified size is
|
||
parsed as Kilobytes, Megabytes, Gigabytes, or Terabytes (with the base 1024), respectively. If assigned the
|
||
special value <literal>infinity</literal>, no limit is applied. These settings control the
|
||
<literal>memory.zswap.max</literal> control group attribute. For details about this control group attribute,
|
||
see <ulink url="https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files">Memory Interface Files</ulink>.</para>
|
||
|
||
<para>While <varname>StartupMemoryZSwapMax=</varname> applies to the startup and shutdown phases of the system,
|
||
<varname>MemoryZSwapMax=</varname> applies to normal runtime of the system, and if the former is not set also to
|
||
the startup and shutdown phases. Using <varname>StartupMemoryZSwapMax=</varname> allows prioritizing specific services at
|
||
boot-up and shutdown differently than during normal runtime.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v253"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>AllowedMemoryNodes=</varname></term>
|
||
<term><varname>StartupAllowedMemoryNodes=</varname></term>
|
||
|
||
<listitem>
|
||
<para>These settings control the <option>cpuset</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Restrict processes to be executed on specific memory NUMA nodes. Takes a list of memory NUMA nodes indices
|
||
or ranges separated by either whitespace or commas. Memory NUMA nodes ranges are specified by the lower and upper
|
||
NUMA nodes indices separated by a dash.</para>
|
||
|
||
<para>Setting <varname>AllowedMemoryNodes=</varname> or <varname>StartupAllowedMemoryNodes=</varname> doesn't
|
||
guarantee that all of the memory NUMA nodes will be used by the processes as it may be limited by parent units.
|
||
The effective configuration is reported as <varname>EffectiveMemoryNodes=</varname>.</para>
|
||
|
||
<para>While <varname>StartupAllowedMemoryNodes=</varname> applies to the startup and shutdown phases of the system,
|
||
<varname>AllowedMemoryNodes=</varname> applies to normal runtime of the system, and if the former is not set also to
|
||
the startup and shutdown phases. Using <varname>StartupAllowedMemoryNodes=</varname> allows prioritizing specific services at
|
||
boot-up and shutdown differently than during normal runtime.</para>
|
||
|
||
<para>This setting is supported only with the unified control group hierarchy.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v244"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
</variablelist>
|
||
|
||
</refsect2><refsect2><title>Process Accounting and Control</title>
|
||
|
||
<variablelist class='unit-directives'>
|
||
|
||
<varlistentry>
|
||
<term><varname>TasksAccounting=</varname></term>
|
||
|
||
<listitem>
|
||
<para>This setting controls the <option>pids</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Turn on task accounting for this unit. Takes a boolean argument. If enabled, the kernel will
|
||
keep track of the total number of tasks in the unit and its children. This number includes both
|
||
kernel threads and userspace processes, with each thread counted individually. Note that turning on
|
||
tasks accounting for one unit will also implicitly turn it on for all units contained in the same
|
||
slice and for all its parent slices and the units contained therein. The system default for this
|
||
setting may be controlled with <varname>DefaultTasksAccounting=</varname> in
|
||
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v227"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>TasksMax=<replaceable>N</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>This setting controls the <option>pids</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Specify the maximum number of tasks that may be created in the unit. This ensures that the
|
||
number of tasks accounted for the unit (see above) stays below a specific limit. This either takes
|
||
an absolute number of tasks or a percentage value that is taken relative to the configured maximum
|
||
number of tasks on the system. If assigned the special value <literal>infinity</literal>, no tasks
|
||
limit is applied. This controls the <literal>pids.max</literal> control group attribute. For
|
||
details about this control group attribute, the
|
||
<ulink url="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#pid">pids controller
|
||
</ulink>.</para>
|
||
|
||
<para>The system default for this setting may be controlled with
|
||
<varname>DefaultTasksMax=</varname> in
|
||
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v227"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
</variablelist>
|
||
|
||
</refsect2><refsect2><title>IO Accounting and Control</title>
|
||
|
||
<variablelist class='unit-directives'>
|
||
|
||
<varlistentry>
|
||
<term><varname>IOAccounting=</varname></term>
|
||
|
||
<listitem>
|
||
<para>This setting controls the <option>io</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Turn on Block I/O accounting for this unit, if the unified control group hierarchy is used on the
|
||
system. Takes a boolean argument. Note that turning on block I/O accounting for one unit will also implicitly
|
||
turn it on for all units contained in the same slice and all for its parent slices and the units contained
|
||
therein. The system default for this setting may be controlled with <varname>DefaultIOAccounting=</varname>
|
||
in
|
||
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v230"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>IOWeight=<replaceable>weight</replaceable></varname></term>
|
||
<term><varname>StartupIOWeight=<replaceable>weight</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>These settings control the <option>io</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Set the default overall block I/O weight for the executed processes, if the unified control
|
||
group hierarchy is used on the system. Takes a single weight value (between 1 and 10000) to set the
|
||
default block I/O weight. This controls the <literal>io.weight</literal> control group attribute,
|
||
which defaults to 100. For details about this control group attribute, see <ulink
|
||
url="https://docs.kernel.org/admin-guide/cgroup-v2.html#io-interface-files">IO
|
||
Interface Files</ulink>. The available I/O bandwidth is split up among all units within one slice
|
||
relative to their block I/O weight. A higher weight means more I/O bandwidth, a lower weight means
|
||
less.</para>
|
||
|
||
<para>While <varname>StartupIOWeight=</varname> applies
|
||
to the startup and shutdown phases of the system,
|
||
<varname>IOWeight=</varname> applies to the later runtime of
|
||
the system, and if the former is not set also to the startup
|
||
and shutdown phases. This allows prioritizing specific services at boot-up
|
||
and shutdown differently than during runtime.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v230"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>IODeviceWeight=<replaceable>device</replaceable> <replaceable>weight</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>This setting controls the <option>io</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Set the per-device overall block I/O weight for the executed processes, if the unified control group
|
||
hierarchy is used on the system. Takes a space-separated pair of a file path and a weight value to specify
|
||
the device specific weight value, between 1 and 10000. (Example: <literal>/dev/sda 1000</literal>). The file
|
||
path may be specified as path to a block device node or as any other file, in which case the backing block
|
||
device of the file system of the file is determined. This controls the <literal>io.weight</literal> control
|
||
group attribute, which defaults to 100. Use this option multiple times to set weights for multiple devices.
|
||
For details about this control group attribute, see <ulink
|
||
url="https://docs.kernel.org/admin-guide/cgroup-v2.html#io-interface-files">IO Interface Files</ulink>.</para>
|
||
|
||
<para>The specified device node should reference a block device that has an I/O scheduler
|
||
associated, i.e. should not refer to partition or loopback block devices, but to the originating,
|
||
physical device. When a path to a regular file or directory is specified it is attempted to
|
||
discover the correct originating device backing the file system of the specified path. This works
|
||
correctly only for simpler cases, where the file system is directly placed on a partition or
|
||
physical block device, or where simple 1:1 encryption using dm-crypt/LUKS is used. This discovery
|
||
does not cover complex storage and in particular RAID and volume management storage devices.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v230"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>IOReadBandwidthMax=<replaceable>device</replaceable> <replaceable>bytes</replaceable></varname></term>
|
||
<term><varname>IOWriteBandwidthMax=<replaceable>device</replaceable> <replaceable>bytes</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>These settings control the <option>io</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Set the per-device overall block I/O bandwidth maximum limit for the executed processes, if the unified
|
||
control group hierarchy is used on the system. This limit is not work-conserving and the executed processes
|
||
are not allowed to use more even if the device has idle capacity. Takes a space-separated pair of a file
|
||
path and a bandwidth value (in bytes per second) to specify the device specific bandwidth. The file path may
|
||
be a path to a block device node, or as any other file in which case the backing block device of the file
|
||
system of the file is used. If the bandwidth is suffixed with K, M, G, or T, the specified bandwidth is
|
||
parsed as Kilobytes, Megabytes, Gigabytes, or Terabytes, respectively, to the base of 1000. (Example:
|
||
"/dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0 5M"). This controls the <literal>io.max</literal> control
|
||
group attributes. Use this option multiple times to set bandwidth limits for multiple devices. For details
|
||
about this control group attribute, see <ulink
|
||
url="https://docs.kernel.org/admin-guide/cgroup-v2.html#io-interface-files">IO Interface Files</ulink>.
|
||
</para>
|
||
|
||
<para>Similar restrictions on block device discovery as for <varname>IODeviceWeight=</varname> apply, see above.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v230"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>IOReadIOPSMax=<replaceable>device</replaceable> <replaceable>IOPS</replaceable></varname></term>
|
||
<term><varname>IOWriteIOPSMax=<replaceable>device</replaceable> <replaceable>IOPS</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>These settings control the <option>io</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Set the per-device overall block I/O IOs-Per-Second maximum limit for the executed processes, if the
|
||
unified control group hierarchy is used on the system. This limit is not work-conserving and the executed
|
||
processes are not allowed to use more even if the device has idle capacity. Takes a space-separated pair of
|
||
a file path and an IOPS value to specify the device specific IOPS. The file path may be a path to a block
|
||
device node, or as any other file in which case the backing block device of the file system of the file is
|
||
used. If the IOPS is suffixed with K, M, G, or T, the specified IOPS is parsed as KiloIOPS, MegaIOPS,
|
||
GigaIOPS, or TeraIOPS, respectively, to the base of 1000. (Example:
|
||
"/dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0 1K"). This controls the <literal>io.max</literal> control
|
||
group attributes. Use this option multiple times to set IOPS limits for multiple devices. For details about
|
||
this control group attribute, see <ulink
|
||
url="https://docs.kernel.org/admin-guide/cgroup-v2.html#io-interface-files">IO Interface Files</ulink>.
|
||
</para>
|
||
|
||
<para>Similar restrictions on block device discovery as for <varname>IODeviceWeight=</varname> apply, see above.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v230"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>IODeviceLatencyTargetSec=<replaceable>device</replaceable> <replaceable>target</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>This setting controls the <option>io</option> controller in the unified hierarchy.</para>
|
||
|
||
<para>Set the per-device average target I/O latency for the executed processes, if the unified control group
|
||
hierarchy is used on the system. Takes a file path and a timespan separated by a space to specify
|
||
the device specific latency target. (Example: "/dev/sda 25ms"). The file path may be specified
|
||
as path to a block device node or as any other file, in which case the backing block device of the file
|
||
system of the file is determined. This controls the <literal>io.latency</literal> control group
|
||
attribute. Use this option multiple times to set latency target for multiple devices. For details about this
|
||
control group attribute, see <ulink
|
||
url="https://docs.kernel.org/admin-guide/cgroup-v2.html#io-interface-files">IO Interface Files</ulink>.</para>
|
||
|
||
<para>Implies <literal>IOAccounting=yes</literal>.</para>
|
||
|
||
<para>These settings are supported only if the unified control group hierarchy is used.</para>
|
||
|
||
<para>Similar restrictions on block device discovery as for <varname>IODeviceWeight=</varname> apply, see above.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v240"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
</variablelist>
|
||
|
||
</refsect2><refsect2><title>Network Accounting and Control</title>
|
||
|
||
<variablelist class='unit-directives'>
|
||
|
||
<varlistentry>
|
||
<term><varname>IPAccounting=</varname></term>
|
||
|
||
<listitem>
|
||
<para>Takes a boolean argument. If true, turns on IPv4 and IPv6 network traffic accounting for packets sent
|
||
or received by the unit. When this option is turned on, all IPv4 and IPv6 sockets created by any process of
|
||
the unit are accounted for.</para>
|
||
|
||
<para>When this option is used in socket units, it applies to all IPv4 and IPv6 sockets
|
||
associated with it (including both listening and connection sockets where this applies). Note that for
|
||
socket-activated services, this configuration setting and the accounting data of the service unit and the
|
||
socket unit are kept separate, and displayed separately. No propagation of the setting and the collected
|
||
statistics is done, in either direction. Moreover, any traffic sent or received on any of the socket unit's
|
||
sockets is accounted to the socket unit — and never to the service unit it might have activated, even if the
|
||
socket is used by it.</para>
|
||
|
||
<para>The system default for this setting may be controlled with <varname>DefaultIPAccounting=</varname> in
|
||
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v235"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>IPAddressAllow=<replaceable>ADDRESS[/PREFIXLENGTH]…</replaceable></varname></term>
|
||
<term><varname>IPAddressDeny=<replaceable>ADDRESS[/PREFIXLENGTH]…</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>Turn on network traffic filtering for IP packets sent and received over
|
||
<constant>AF_INET</constant> and <constant>AF_INET6</constant> sockets. Both directives take a
|
||
space separated list of IPv4 or IPv6 addresses, each optionally suffixed with an address prefix
|
||
length in bits after a <literal>/</literal> character. If the suffix is omitted, the address is
|
||
considered a host address, i.e. the filter covers the whole address (32 bits for IPv4, 128 bits for
|
||
IPv6).</para>
|
||
|
||
<para>The access lists configured with this option are applied to all sockets created by processes
|
||
of this unit (or in the case of socket units, associated with it). The lists are implicitly
|
||
combined with any lists configured for any of the parent slice units this unit might be a member
|
||
of. By default both access lists are empty. Both ingress and egress traffic is filtered by these
|
||
settings. In case of ingress traffic the source IP address is checked against these access lists,
|
||
in case of egress traffic the destination IP address is checked. The following rules are applied in
|
||
turn:</para>
|
||
|
||
<itemizedlist>
|
||
<listitem><para>Access is granted when the checked IP address matches an entry in the
|
||
<varname>IPAddressAllow=</varname> list.</para></listitem>
|
||
|
||
<listitem><para>Otherwise, access is denied when the checked IP address matches an entry in the
|
||
<varname>IPAddressDeny=</varname> list.</para></listitem>
|
||
|
||
<listitem><para>Otherwise, access is granted.</para></listitem>
|
||
</itemizedlist>
|
||
|
||
<para>In order to implement an allow-listing IP firewall, it is recommended to use a
|
||
<varname>IPAddressDeny=</varname><constant>any</constant> setting on an upper-level slice unit
|
||
(such as the root slice <filename>-.slice</filename> or the slice containing all system services
|
||
<filename>system.slice</filename> – see
|
||
<citerefentry><refentrytitle>systemd.special</refentrytitle><manvolnum>7</manvolnum></citerefentry>
|
||
for details on these slice units), plus individual per-service <varname>IPAddressAllow=</varname>
|
||
lines permitting network access to relevant services, and only them.</para>
|
||
|
||
<para>Note that for socket-activated services, the IP access list configured on the socket unit
|
||
applies to all sockets associated with it directly, but not to any sockets created by the
|
||
ultimately activated services for it. Conversely, the IP access list configured for the service is
|
||
not applied to any sockets passed into the service via socket activation. Thus, it is usually a
|
||
good idea to replicate the IP access lists on both the socket and the service unit. Nevertheless,
|
||
it may make sense to maintain one list more open and the other one more restricted, depending on
|
||
the use case.</para>
|
||
|
||
<para>If these settings are used multiple times in the same unit the specified lists are combined. If an
|
||
empty string is assigned to these settings the specific access list is reset and all previous settings undone.</para>
|
||
|
||
<para>In place of explicit IPv4 or IPv6 address and prefix length specifications a small set of symbolic
|
||
names may be used. The following names are defined:</para>
|
||
|
||
<table>
|
||
<title>Special address/network names</title>
|
||
|
||
<tgroup cols='3'>
|
||
<colspec colname='name'/>
|
||
<colspec colname='definition'/>
|
||
<colspec colname='meaning'/>
|
||
|
||
<thead>
|
||
<row>
|
||
<entry>Symbolic Name</entry>
|
||
<entry>Definition</entry>
|
||
<entry>Meaning</entry>
|
||
</row>
|
||
</thead>
|
||
|
||
<tbody>
|
||
<row>
|
||
<entry><constant>any</constant></entry>
|
||
<entry>0.0.0.0/0 ::/0</entry>
|
||
<entry>Any host</entry>
|
||
</row>
|
||
|
||
<row>
|
||
<entry><constant>localhost</constant></entry>
|
||
<entry>127.0.0.0/8 ::1/128</entry>
|
||
<entry>All addresses on the local loopback</entry>
|
||
</row>
|
||
|
||
<row>
|
||
<entry><constant>link-local</constant></entry>
|
||
<entry>169.254.0.0/16 fe80::/64</entry>
|
||
<entry>All link-local IP addresses</entry>
|
||
</row>
|
||
|
||
<row>
|
||
<entry><constant>multicast</constant></entry>
|
||
<entry>224.0.0.0/4 ff00::/8</entry>
|
||
<entry>All IP multicasting addresses</entry>
|
||
</row>
|
||
</tbody>
|
||
</tgroup>
|
||
</table>
|
||
|
||
<para>Note that these settings might not be supported on some systems (for example if eBPF control group
|
||
support is not enabled in the underlying kernel or container manager). These settings will have no effect in
|
||
that case. If compatibility with such systems is desired it is hence recommended to not exclusively rely on
|
||
them for IP security.</para>
|
||
|
||
<xi:include href="cgroup-sandboxing.xml" xpointer="singular"/>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v235"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>SocketBindAllow=<replaceable>bind-rule</replaceable></varname></term>
|
||
<term><varname>SocketBindDeny=<replaceable>bind-rule</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>Allow or deny binding a socket address to a socket by matching it with the <replaceable>bind-rule</replaceable> and
|
||
applying a corresponding action if there is a match.</para>
|
||
|
||
<para><replaceable>bind-rule</replaceable> describes socket properties such as <replaceable>address-family</replaceable>,
|
||
<replaceable>transport-protocol</replaceable> and <replaceable>ip-ports</replaceable>.</para>
|
||
|
||
<para><replaceable>bind-rule</replaceable> :=
|
||
{ [<replaceable>address-family</replaceable><constant>:</constant>][<replaceable>transport-protocol</replaceable><constant>:</constant>][<replaceable>ip-ports</replaceable>] | <constant>any</constant> }</para>
|
||
|
||
<para><replaceable>address-family</replaceable> := { <constant>ipv4</constant> | <constant>ipv6</constant> }</para>
|
||
|
||
<para><replaceable>transport-protocol</replaceable> := { <constant>tcp</constant> | <constant>udp</constant> }</para>
|
||
|
||
<para><replaceable>ip-ports</replaceable> := { <replaceable>ip-port</replaceable> | <replaceable>ip-port-range</replaceable> }</para>
|
||
|
||
<para>An optional <replaceable>address-family</replaceable> expects <constant>ipv4</constant> or <constant>ipv6</constant> values.
|
||
If not specified, a rule will be matched for both IPv4 and IPv6 addresses and applied depending on other socket fields, e.g. <replaceable>transport-protocol</replaceable>,
|
||
<replaceable>ip-port</replaceable>.</para>
|
||
|
||
<para>An optional <replaceable>transport-protocol</replaceable> expects <constant>tcp</constant> or <constant>udp</constant> transport protocol names.
|
||
If not specified, a rule will be matched for any transport protocol.</para>
|
||
|
||
<para>An optional <replaceable>ip-port</replaceable> value must lie within 1…65535 interval inclusively, i.e.
|
||
dynamic port <constant>0</constant> is not allowed. A range of sequential ports is described by
|
||
<replaceable>ip-port-range</replaceable> := <replaceable>ip-port-low</replaceable><constant>-</constant><replaceable>ip-port-high</replaceable>,
|
||
where <replaceable>ip-port-low</replaceable> is smaller than or equal to <replaceable>ip-port-high</replaceable>
|
||
and both are within 1…65535 inclusively.</para>
|
||
|
||
<para>A special value <constant>any</constant> can be used to apply a rule to any address family, transport protocol and any port with a positive value.</para>
|
||
|
||
<para>To allow multiple rules assign <varname>SocketBindAllow=</varname> or <varname>SocketBindDeny=</varname> multiple times.
|
||
To clear the existing assignments pass an empty <varname>SocketBindAllow=</varname> or <varname>SocketBindDeny=</varname>
|
||
assignment.</para>
|
||
|
||
<para>For each of <varname>SocketBindAllow=</varname> and <varname>SocketBindDeny=</varname>, maximum allowed number of assignments is
|
||
<constant>128</constant>.</para>
|
||
|
||
<itemizedlist>
|
||
<listitem><para>Binding to a socket is allowed when a socket address matches an entry in the
|
||
<varname>SocketBindAllow=</varname> list.</para></listitem>
|
||
|
||
<listitem><para>Otherwise, binding is denied when the socket address matches an entry in the
|
||
<varname>SocketBindDeny=</varname> list.</para></listitem>
|
||
|
||
<listitem><para>Otherwise, binding is allowed.</para></listitem>
|
||
</itemizedlist>
|
||
|
||
<para>The feature is implemented with <constant>cgroup/bind4</constant> and <constant>cgroup/bind6</constant> cgroup-bpf hooks.</para>
|
||
<para>Examples:<programlisting>…
|
||
# Allow binding IPv6 socket addresses with a port greater than or equal to 10000.
|
||
[Service]
|
||
SocketBindAllow=ipv6:10000-65535
|
||
SocketBindDeny=any
|
||
…
|
||
# Allow binding IPv4 and IPv6 socket addresses with 1234 and 4321 ports.
|
||
[Service]
|
||
SocketBindAllow=1234
|
||
SocketBindAllow=4321
|
||
SocketBindDeny=any
|
||
…
|
||
# Deny binding IPv6 socket addresses.
|
||
[Service]
|
||
SocketBindDeny=ipv6
|
||
…
|
||
# Deny binding IPv4 and IPv6 socket addresses.
|
||
[Service]
|
||
SocketBindDeny=any
|
||
…
|
||
# Allow binding only over TCP
|
||
[Service]
|
||
SocketBindAllow=tcp
|
||
SocketBindDeny=any
|
||
…
|
||
# Allow binding only over IPv6/TCP
|
||
[Service]
|
||
SocketBindAllow=ipv6:tcp
|
||
SocketBindDeny=any
|
||
…
|
||
# Allow binding ports within 10000-65535 range over IPv4/UDP.
|
||
[Service]
|
||
SocketBindAllow=ipv4:udp:10000-65535
|
||
SocketBindDeny=any
|
||
…</programlisting></para>
|
||
|
||
<xi:include href="cgroup-sandboxing.xml" xpointer="singular"/>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v249"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>RestrictNetworkInterfaces=</varname></term>
|
||
|
||
<listitem>
|
||
<para>Takes a list of space-separated network interface names. This option restricts the network
|
||
interfaces that processes of this unit can use. By default processes can only use the network interfaces
|
||
listed (allow-list). If the first character of the rule is <literal>~</literal>, the effect is inverted:
|
||
the processes can only use network interfaces not listed (deny-list).
|
||
</para>
|
||
|
||
<para>This option can appear multiple times, in which case the network interface names are merged. If the
|
||
empty string is assigned the set is reset, all prior assignments will have not effect.
|
||
</para>
|
||
|
||
<para>If you specify both types of this option (i.e. allow-listing and deny-listing), the first encountered
|
||
will take precedence and will dictate the default action (allow vs deny). Then the next occurrences of this
|
||
option will add or delete the listed network interface names from the set, depending of its type and the
|
||
default action.
|
||
</para>
|
||
|
||
<para>The loopback interface ("lo") is not treated in any special way, you have to configure it explicitly
|
||
in the unit file.
|
||
</para>
|
||
<para>Example 1: allow-list
|
||
<programlisting>
|
||
RestrictNetworkInterfaces=eth1
|
||
RestrictNetworkInterfaces=eth2</programlisting>
|
||
Programs in the unit will be only able to use the eth1 and eth2 network
|
||
interfaces.
|
||
</para>
|
||
|
||
<para>Example 2: deny-list
|
||
<programlisting>
|
||
RestrictNetworkInterfaces=~eth1 eth2</programlisting>
|
||
Programs in the unit will be able to use any network interface but eth1 and eth2.
|
||
</para>
|
||
|
||
<para>Example 3: mixed
|
||
<programlisting>
|
||
RestrictNetworkInterfaces=eth1 eth2
|
||
RestrictNetworkInterfaces=~eth1</programlisting>
|
||
Programs in the unit will be only able to use the eth2 network interface.
|
||
</para>
|
||
|
||
<xi:include href="cgroup-sandboxing.xml" xpointer="singular"/>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v250"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>NFTSet=</varname><replaceable>family</replaceable>:<replaceable>table</replaceable>:<replaceable>set</replaceable></term>
|
||
<listitem>
|
||
<para>This setting provides a method for integrating dynamic cgroup, user and group IDs into
|
||
firewall rules with <ulink url="https://netfilter.org/projects/nftables/index.html">NFT</ulink>
|
||
sets. The benefit of using this setting is to be able to use the IDs as selectors in firewall rules
|
||
easily and this in turn allows more fine grained filtering. NFT rules for cgroup matching use
|
||
numeric cgroup IDs, which change every time a service is restarted, making them hard to use in
|
||
systemd environment otherwise. Dynamic and random IDs used by <varname>DynamicUser=</varname> can
|
||
be also integrated with this setting.</para>
|
||
|
||
<para>This option expects a whitespace separated list of NFT set definitions. Each definition
|
||
consists of a colon-separated tuple of source type (one of <literal>cgroup</literal>,
|
||
<literal>user</literal> or <literal>group</literal>), NFT address family (one of
|
||
<literal>arp</literal>, <literal>bridge</literal>, <literal>inet</literal>, <literal>ip</literal>,
|
||
<literal>ip6</literal>, or <literal>netdev</literal>), table name and set name. The names of tables
|
||
and sets must conform to lexical restrictions of NFT table names. The type of the element used in
|
||
the NFT filter must match the type implied by the directive (<literal>cgroup</literal>,
|
||
<literal>user</literal> or <literal>group</literal>) as shown in the table below. When a control
|
||
group or a unit is realized, the corresponding ID will be appended to the NFT sets and it will be
|
||
be removed when the control group or unit is removed. <command>systemd</command> only inserts
|
||
elements to (or removes from) the sets, so the related NFT rules, tables and sets must be prepared
|
||
elsewhere in advance. Failures to manage the sets will be ignored.</para>
|
||
|
||
<table>
|
||
<title>Defined <varname>source type</varname> values</title>
|
||
<tgroup cols='3'>
|
||
<colspec colname='source type'/>
|
||
<colspec colname='description'/>
|
||
<colspec colname='NFT type name'/>
|
||
<thead>
|
||
<row>
|
||
<entry>Source type</entry>
|
||
<entry>Description</entry>
|
||
<entry>Corresponding NFT type name</entry>
|
||
</row>
|
||
</thead>
|
||
|
||
<tbody>
|
||
<row>
|
||
<entry><literal>cgroup</literal></entry>
|
||
<entry>control group ID</entry>
|
||
<entry><literal>cgroupsv2</literal></entry>
|
||
</row>
|
||
<row>
|
||
<entry><literal>user</literal></entry>
|
||
<entry>user ID</entry>
|
||
<entry><literal>meta skuid</literal></entry>
|
||
</row>
|
||
<row>
|
||
<entry><literal>group</literal></entry>
|
||
<entry>group ID</entry>
|
||
<entry><literal>meta skgid</literal></entry>
|
||
</row>
|
||
</tbody>
|
||
</tgroup>
|
||
</table>
|
||
|
||
<para>If the firewall rules are reinstalled so that the contents of NFT sets are destroyed, command
|
||
<command>systemctl daemon-reload</command> can be used to refill the sets.</para>
|
||
|
||
<para>Example:
|
||
<programlisting>[Unit]
|
||
NFTSet=cgroup:inet:filter:my_service user:inet:filter:serviceuser
|
||
</programlisting>
|
||
Corresponding NFT rules:
|
||
<programlisting>table inet filter {
|
||
set my_service {
|
||
type cgroupsv2
|
||
}
|
||
set serviceuser {
|
||
typeof meta skuid
|
||
}
|
||
chain x {
|
||
socket cgroupv2 level 2 @my_service accept
|
||
drop
|
||
}
|
||
chain y {
|
||
meta skuid @serviceuser accept
|
||
drop
|
||
}
|
||
}</programlisting>
|
||
</para>
|
||
<xi:include href="version-info.xml" xpointer="v255"/></listitem>
|
||
</varlistentry>
|
||
|
||
</variablelist>
|
||
|
||
</refsect2><refsect2><title>BPF Programs</title>
|
||
|
||
<variablelist class='unit-directives'>
|
||
|
||
<varlistentry>
|
||
<term><varname>IPIngressFilterPath=<replaceable>BPF_FS_PROGRAM_PATH</replaceable></varname></term>
|
||
<term><varname>IPEgressFilterPath=<replaceable>BPF_FS_PROGRAM_PATH</replaceable></varname></term>
|
||
|
||
<listitem>
|
||
<para>Add custom network traffic filters implemented as BPF programs, applying to all IP packets
|
||
sent and received over <constant>AF_INET</constant> and <constant>AF_INET6</constant> sockets.
|
||
Takes an absolute path to a pinned BPF program in the BPF virtual filesystem (<filename>/sys/fs/bpf/</filename>).
|
||
</para>
|
||
|
||
<para>The filters configured with this option are applied to all sockets created by processes
|
||
of this unit (or in the case of socket units, associated with it). The filters are loaded in addition
|
||
to filters any of the parent slice units this unit might be a member of as well as any
|
||
<varname>IPAddressAllow=</varname> and <varname>IPAddressDeny=</varname> filters in any of these units.
|
||
By default there are no filters specified.</para>
|
||
|
||
<para>If these settings are used multiple times in the same unit all the specified programs are attached. If an
|
||
empty string is assigned to these settings the program list is reset and all previous specified programs ignored.</para>
|
||
|
||
<para>If the path <replaceable>BPF_FS_PROGRAM_PATH</replaceable> in <varname>IPIngressFilterPath=</varname> assignment
|
||
is already being handled by <varname>BPFProgram=</varname> ingress hook, e.g.
|
||
<varname>BPFProgram=</varname><constant>ingress</constant>:<replaceable>BPF_FS_PROGRAM_PATH</replaceable>,
|
||
the assignment will be still considered valid and the program will be attached to a cgroup. Same for
|
||
<varname>IPEgressFilterPath=</varname> path and <constant>egress</constant> hook.</para>
|
||
|
||
<para>Note that for socket-activated services, the IP filter programs configured on the socket unit apply to
|
||
all sockets associated with it directly, but not to any sockets created by the ultimately activated services
|
||
for it. Conversely, the IP filter programs configured for the service are not applied to any sockets passed into
|
||
the service via socket activation. Thus, it is usually a good idea, to replicate the IP filter programs on both
|
||
the socket and the service unit, however it often makes sense to maintain one configuration more open and the other
|
||
one more restricted, depending on the use case.</para>
|
||
|
||
<para>Note that these settings might not be supported on some systems (for example if eBPF control group
|
||
support is not enabled in the underlying kernel or container manager). These settings will fail the service in
|
||
that case. If compatibility with such systems is desired it is hence recommended to attach your filter manually
|
||
(requires <varname>Delegate=</varname><constant>yes</constant>) instead of using this setting.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v243"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>BPFProgram=<replaceable>type</replaceable>:<replaceable>program-path</replaceable></varname></term>
|
||
<listitem>
|
||
<para><varname>BPFProgram=</varname> allows attaching custom BPF programs to the cgroup of a
|
||
unit. (This generalizes the functionality exposed via <varname>IPEgressFilterPath=</varname> and
|
||
<varname>IPIngressFilterPath=</varname> for other hooks.) Cgroup-bpf hooks in the form of BPF
|
||
programs loaded to the BPF filesystem are attached with cgroup-bpf attach flags determined by the
|
||
unit. For details about attachment types and flags see <ulink
|
||
url="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/include/uapi/linux/bpf.h"><filename>bpf.h</filename></ulink>. Also
|
||
refer to the general <ulink url="https://docs.kernel.org/bpf/">BPF documentation</ulink>.</para>
|
||
|
||
<para>The specification of BPF program consists of a pair of BPF program type and program path in
|
||
the file system, with <literal>:</literal> as the separator:
|
||
<replaceable>type</replaceable>:<replaceable>program-path</replaceable>.</para>
|
||
|
||
<para>The BPF program type is equivalent to the BPF attach type used in
|
||
<citerefentry project='mankier'><refentrytitle>bpftool</refentrytitle><manvolnum>8</manvolnum></citerefentry>
|
||
It may be one of
|
||
<constant>egress</constant>,
|
||
<constant>ingress</constant>,
|
||
<constant>sock_create</constant>,
|
||
<constant>sock_ops</constant>,
|
||
<constant>device</constant>,
|
||
<constant>bind4</constant>,
|
||
<constant>bind6</constant>,
|
||
<constant>connect4</constant>,
|
||
<constant>connect6</constant>,
|
||
<constant>post_bind4</constant>,
|
||
<constant>post_bind6</constant>,
|
||
<constant>sendmsg4</constant>,
|
||
<constant>sendmsg6</constant>,
|
||
<constant>sysctl</constant>,
|
||
<constant>recvmsg4</constant>,
|
||
<constant>recvmsg6</constant>,
|
||
<constant>getsockopt</constant>,
|
||
or <constant>setsockopt</constant>.
|
||
</para>
|
||
|
||
<para>The specified program path must be an absolute path referencing a BPF program inode in the
|
||
bpffs file system (which generally means it must begin with <filename>/sys/fs/bpf/</filename>). If
|
||
a specified program does not exist (i.e. has not been uploaded to the BPF subsystem of the kernel
|
||
yet), it will not be installed but unit activation will continue (a warning will be printed to the
|
||
logs).</para>
|
||
|
||
<para>Setting <varname>BPFProgram=</varname> to an empty value makes previous assignments
|
||
ineffective.</para>
|
||
|
||
<para>Multiple assignments of the same program type/path pair have the same effect as a single
|
||
assignment: the program will be attached just once.</para>
|
||
|
||
<para>If BPF <constant>egress</constant> pinned to <replaceable>program-path</replaceable> path is already being
|
||
handled by <varname>IPEgressFilterPath=</varname>, <varname>BPFProgram=</varname>
|
||
assignment will be considered valid and <varname>BPFProgram=</varname> will be attached to a cgroup.
|
||
Similarly for <constant>ingress</constant> hook and <varname>IPIngressFilterPath=</varname> assignment.</para>
|
||
|
||
<para>BPF programs passed with <varname>BPFProgram=</varname> are attached to the cgroup of a unit
|
||
with BPF attach flag <constant>multi</constant>, that allows further attachments of the same
|
||
<replaceable>type</replaceable> within cgroup hierarchy topped by the unit cgroup.</para>
|
||
|
||
<para>Examples:<programlisting>BPFProgram=egress:/sys/fs/bpf/egress-hook
|
||
BPFProgram=bind6:/sys/fs/bpf/sock-addr-hook
|
||
</programlisting></para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v249"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
</variablelist>
|
||
|
||
</refsect2><refsect2><title>Device Access</title>
|
||
|
||
<variablelist class='unit-directives'>
|
||
|
||
<varlistentry>
|
||
<term><varname>DeviceAllow=</varname></term>
|
||
|
||
<listitem>
|
||
<para>Control access to specific device nodes by the executed processes. Takes two space-separated
|
||
strings: a device node specifier followed by a combination of <constant>r</constant>,
|
||
<constant>w</constant>, <constant>m</constant> to control <emphasis>r</emphasis>eading,
|
||
<emphasis>w</emphasis>riting, or creation of the specific device nodes by the unit
|
||
(<emphasis>m</emphasis>knod), respectively. This functionality is implemented using eBPF
|
||
filtering.</para>
|
||
|
||
<para>When access to <emphasis>all</emphasis> physical devices should be disallowed,
|
||
<varname>PrivateDevices=</varname> may be used instead. See
|
||
<citerefentry><refentrytitle>systemd.exec</refentrytitle><manvolnum>5</manvolnum></citerefentry>.
|
||
</para>
|
||
|
||
<para>The device node specifier is either a path to a device node in the file system, starting with
|
||
<filename>/dev/</filename>, or a string starting with either <literal>char-</literal> or
|
||
<literal>block-</literal> followed by a device group name, as listed in
|
||
<filename>/proc/devices</filename>. The latter is useful to allow-list all current and future
|
||
devices belonging to a specific device group at once. The device group is matched according to
|
||
filename globbing rules, you may hence use the <literal>*</literal> and <literal>?</literal>
|
||
wildcards. (Note that such globbing wildcards are not available for device node path
|
||
specifications!) In order to match device nodes by numeric major/minor, use device node paths in
|
||
the <filename>/dev/char/</filename> and <filename>/dev/block/</filename> directories. However,
|
||
matching devices by major/minor is generally not recommended as assignments are neither stable nor
|
||
portable between systems or different kernel versions.</para>
|
||
|
||
<para>Examples: <filename>/dev/sda5</filename> is a path to a device node, referring to an ATA or
|
||
SCSI block device. <literal>char-pts</literal> and <literal>char-alsa</literal> are specifiers for
|
||
all pseudo TTYs and all ALSA sound devices, respectively. <literal>char-cpu/*</literal> is a
|
||
specifier matching all CPU related device groups.</para>
|
||
|
||
<para>Note that allow lists defined this way should only reference device groups which are
|
||
resolvable at the time the unit is started. Any device groups not resolvable then are not added to
|
||
the device allow list. In order to work around this limitation, consider extending service units
|
||
with a pair of <command>After=modprobe@xyz.service</command> and
|
||
<command>Wants=modprobe@xyz.service</command> lines that load the necessary kernel module
|
||
implementing the device group if missing.
|
||
Example: <programlisting>…
|
||
[Unit]
|
||
Wants=modprobe@loop.service
|
||
After=modprobe@loop.service
|
||
|
||
[Service]
|
||
DeviceAllow=block-loop
|
||
DeviceAllow=/dev/loop-control
|
||
…</programlisting></para>
|
||
|
||
<xi:include href="cgroup-sandboxing.xml" xpointer="singular"/>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v208"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>DevicePolicy=auto|closed|strict</varname></term>
|
||
|
||
<listitem>
|
||
<para>
|
||
Control the policy for allowing device access:
|
||
</para>
|
||
<variablelist>
|
||
<varlistentry>
|
||
<term><option>strict</option></term>
|
||
<listitem>
|
||
<para>means to only allow types of access that are
|
||
explicitly specified.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v208"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><option>closed</option></term>
|
||
<listitem>
|
||
<para>in addition, allows access to standard pseudo
|
||
devices including
|
||
<filename>/dev/null</filename>,
|
||
<filename>/dev/zero</filename>,
|
||
<filename>/dev/full</filename>,
|
||
<filename>/dev/random</filename>, and
|
||
<filename>/dev/urandom</filename>.
|
||
</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v208"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><option>auto</option></term>
|
||
<listitem>
|
||
<para>
|
||
in addition, allows access to all devices if no
|
||
explicit <varname>DeviceAllow=</varname> is present.
|
||
This is the default.
|
||
</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v208"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
</variablelist>
|
||
|
||
<xi:include href="cgroup-sandboxing.xml" xpointer="singular"/>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v208"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
</variablelist>
|
||
|
||
</refsect2><refsect2><title>Control Group Management</title>
|
||
|
||
<variablelist class='unit-directives'>
|
||
|
||
<varlistentry>
|
||
<term><varname>Slice=</varname></term>
|
||
|
||
<listitem>
|
||
<para>The name of the slice unit to place the unit
|
||
in. Defaults to <filename>system.slice</filename> for all
|
||
non-instantiated units of all unit types (except for slice
|
||
units themselves see below). Instance units are by default
|
||
placed in a subslice of <filename>system.slice</filename>
|
||
that is named after the template name.</para>
|
||
|
||
<para>This option may be used to arrange systemd units in a
|
||
hierarchy of slices each of which might have resource
|
||
settings applied.</para>
|
||
|
||
<para>For units of type slice, the only accepted value for
|
||
this setting is the parent slice. Since the name of a slice
|
||
unit implies the parent slice, it is hence redundant to ever
|
||
set this parameter directly for slice units.</para>
|
||
|
||
<para>Special care should be taken when relying on the default slice assignment in templated service units
|
||
that have <varname>DefaultDependencies=no</varname> set, see
|
||
<citerefentry><refentrytitle>systemd.service</refentrytitle><manvolnum>5</manvolnum></citerefentry>, section
|
||
"Default Dependencies" for details.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v208"/>
|
||
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>Delegate=</varname></term>
|
||
|
||
<listitem>
|
||
<para>Turns on delegation of further resource control partitioning to processes of the unit. Units
|
||
where this is enabled may create and manage their own private subhierarchy of control groups below
|
||
the control group of the unit itself. For unprivileged services (i.e. those using the
|
||
<varname>User=</varname> setting) the unit's control group will be made accessible to the relevant
|
||
user.</para>
|
||
|
||
<para>When enabled the service manager will refrain from manipulating control groups or moving
|
||
processes below the unit's control group, so that a clear concept of ownership is established: the
|
||
control group tree at the level of the unit's control group and above (i.e. towards the root
|
||
control group) is owned and managed by the service manager of the host, while the control group
|
||
tree below the unit's control group is owned and managed by the unit itself.</para>
|
||
|
||
<para>Takes either a boolean argument or a (possibly empty) list of control group controller names.
|
||
If true, delegation is turned on, and all supported controllers are enabled for the unit, making
|
||
them available to the unit's processes for management. If false, delegation is turned off entirely
|
||
(and no additional controllers are enabled). If set to a list of controllers, delegation is turned
|
||
on, and the specified controllers are enabled for the unit. Assigning the empty string will enable
|
||
delegation, but reset the list of controllers, and all assignments prior to this will have no
|
||
effect. Note that additional controllers other than the ones specified might be made available as
|
||
well, depending on configuration of the containing slice unit or other units contained in it.
|
||
Defaults to false.</para>
|
||
|
||
<para>Note that controller delegation to less privileged code is only safe on the unified control
|
||
group hierarchy. Accordingly, access to the specified controllers will not be granted to
|
||
unprivileged services on the legacy hierarchy, even when requested.</para>
|
||
|
||
<xi:include href="supported-controllers.xml" xpointer="controllers-text" />
|
||
|
||
<para>Not all of these controllers are available on all kernels however, and some are specific to
|
||
the unified hierarchy while others are specific to the legacy hierarchy. Also note that the kernel
|
||
might support further controllers, which aren't covered here yet as delegation is either not
|
||
supported at all for them or not defined cleanly.</para>
|
||
|
||
<para>Note that because of the hierarchical nature of cgroup hierarchy, any controllers that are
|
||
delegated will be enabled for the parent and sibling units of the unit with delegation.</para>
|
||
|
||
<para>For further details on the delegation model consult <ulink
|
||
url="https://systemd.io/CGROUP_DELEGATION">Control Group APIs and Delegation</ulink>.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v218"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>DelegateSubgroup=</varname></term>
|
||
|
||
<listitem>
|
||
<para>Place unit processes in the specified subgroup of the unit's control group. Takes a valid
|
||
control group name (not a path!) as parameter, or an empty string to turn this feature
|
||
off. Defaults to off. The control group name must be usable as filename and avoid conflicts with
|
||
the kernel's control group attribute files (i.e. <filename>cgroup.procs</filename> is not an
|
||
acceptable name, since the kernel exposes a native control group attribute file by that name). This
|
||
option has no effect unless control group delegation is turned on via <varname>Delegate=</varname>,
|
||
see above. Note that this setting only applies to "main" processes of a unit, i.e. for services to
|
||
<varname>ExecStart=</varname>, but not for <varname>ExecReload=</varname> and similar. If
|
||
delegation is enabled, the latter are always placed inside a subgroup named
|
||
<filename>.control</filename>. The specified subgroup is automatically created (and potentially
|
||
ownership is passed to the unit's configured user/group) when a process is started in it.</para>
|
||
|
||
<para>This option is useful to avoid manually moving the invoked process into a subgroup after it
|
||
has been started. Since no processes should live in inner nodes of the control group tree it's
|
||
almost always necessary to run the main ("supervising") process of a unit that has delegation
|
||
turned on in a subgroup.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v254"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>DisableControllers=</varname></term>
|
||
|
||
<listitem>
|
||
<para>Disables controllers from being enabled for a unit's children. If a controller listed is
|
||
already in use in its subtree, the controller will be removed from the subtree. This can be used to
|
||
avoid configuration in child units from being able to implicitly or explicitly enable a controller.
|
||
Defaults to empty.</para>
|
||
|
||
<para>Multiple controllers may be specified, separated by spaces. You may also pass
|
||
<varname>DisableControllers=</varname> multiple times, in which case each new instance adds another controller
|
||
to disable. Passing <varname>DisableControllers=</varname> by itself with no controller name present resets
|
||
the disabled controller list.</para>
|
||
|
||
<para>It may not be possible to disable a controller after units have been started, if the unit or
|
||
any child of the unit in question delegates controllers to its children, as any delegated subtree
|
||
of the cgroup hierarchy is unmanaged by systemd.</para>
|
||
|
||
<xi:include href="supported-controllers.xml" xpointer="controllers-text" />
|
||
|
||
<xi:include href="version-info.xml" xpointer="v240"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
</variablelist>
|
||
|
||
</refsect2><refsect2><title>Memory Pressure Control</title>
|
||
|
||
<variablelist class='unit-directives'>
|
||
|
||
<varlistentry>
|
||
<term><varname>ManagedOOMSwap=auto|kill</varname></term>
|
||
<term><varname>ManagedOOMMemoryPressure=auto|kill</varname></term>
|
||
|
||
<listitem>
|
||
<para>Specifies how
|
||
<citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
|
||
will act on this unit's cgroups. Defaults to <option>auto</option>.</para>
|
||
|
||
<para>When set to <option>kill</option>, the unit becomes a candidate for monitoring by
|
||
<command>systemd-oomd</command>. If the cgroup passes the limits set by
|
||
<citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry> or
|
||
the unit configuration, <command>systemd-oomd</command> will select a descendant cgroup and send
|
||
<constant>SIGKILL</constant> to all of the processes under it. You can find more details on
|
||
candidates and kill behavior at
|
||
<citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
|
||
and
|
||
<citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>
|
||
|
||
<para>Setting either of these properties to <option>kill</option> will also result in
|
||
<varname>After=</varname> and <varname>Wants=</varname> dependencies on
|
||
<filename>systemd-oomd.service</filename> unless <varname>DefaultDependencies=no</varname>.</para>
|
||
|
||
<para>When set to <option>auto</option>, <command>systemd-oomd</command> will not actively use this
|
||
cgroup's data for monitoring and detection. However, if an ancestor cgroup has one of these
|
||
properties set to <option>kill</option>, a unit with <option>auto</option> can still be a candidate
|
||
for <command>systemd-oomd</command> to terminate.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v247"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>ManagedOOMMemoryPressureLimit=</varname></term>
|
||
|
||
<listitem>
|
||
<para>Overrides the default memory pressure limit set by
|
||
<citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry> for
|
||
this unit (cgroup). Takes a percentage value between 0% and 100%, inclusive. This property is
|
||
ignored unless <varname>ManagedOOMMemoryPressure=</varname><option>kill</option>. Defaults to 0%,
|
||
which means to use the default set by
|
||
<citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.
|
||
</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v247"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>ManagedOOMPreference=none|avoid|omit</varname></term>
|
||
|
||
<listitem>
|
||
<para>Allows deprioritizing or omitting this unit's cgroup as a candidate when
|
||
<command>systemd-oomd</command> needs to act. Requires support for extended attributes (see
|
||
<citerefentry project='man-pages'><refentrytitle>xattr</refentrytitle><manvolnum>7</manvolnum></citerefentry>)
|
||
in order to use <option>avoid</option> or <option>omit</option>.</para>
|
||
|
||
<para>When calculating candidates to relieve swap usage, <command>systemd-oomd</command> will
|
||
only respect these extended attributes if the unit's cgroup is owned by root.</para>
|
||
|
||
<para>When calculating candidates to relieve memory pressure, <command>systemd-oomd</command>
|
||
will only respect these extended attributes if the unit's cgroup is owned by root, or if the
|
||
unit's cgroup owner, and the owner of the monitored ancestor cgroup are the same. For example,
|
||
if <command>systemd-oomd</command> is calculating candidates for <filename>-.slice</filename>,
|
||
then extended attributes set on descendants of <filename>/user.slice/user-1000.slice/user@1000.service/</filename>
|
||
will be ignored because the descendants are owned by UID 1000, and <filename>-.slice</filename>
|
||
is owned by UID 0. But, if calculating candidates for
|
||
<filename>/user.slice/user-1000.slice/user@1000.service/</filename>, then extended attributes set
|
||
on the descendants would be respected.</para>
|
||
|
||
<para>If this property is set to <option>avoid</option>, the service manager will convey this to
|
||
<command>systemd-oomd</command>, which will only select this cgroup if there are no other viable
|
||
candidates.</para>
|
||
|
||
<para>If this property is set to <option>omit</option>, the service manager will convey this to
|
||
<command>systemd-oomd</command>, which will ignore this cgroup as a candidate and will not perform
|
||
any actions on it.</para>
|
||
|
||
<para>It is recommended to use <option>avoid</option> and <option>omit</option> sparingly, as it
|
||
can adversely affect <command>systemd-oomd</command>'s kill behavior. Also note that these extended
|
||
attributes are not applied recursively to cgroups under this unit's cgroup.</para>
|
||
|
||
<para>Defaults to <option>none</option> which means <command>systemd-oomd</command> will rank this
|
||
unit's cgroup as defined in
|
||
<citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
|
||
and <citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.
|
||
</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v248"/>
|
||
</listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>MemoryPressureWatch=</varname></term>
|
||
|
||
<listitem><para>Controls memory pressure monitoring for invoked processes. Takes one of
|
||
<literal>off</literal>, <literal>on</literal>, <literal>auto</literal> or <literal>skip</literal>. If
|
||
<literal>off</literal> tells the service not to watch for memory pressure events, by setting the
|
||
<varname>$MEMORY_PRESSURE_WATCH</varname> environment variable to the literal string
|
||
<filename>/dev/null</filename>. If <literal>on</literal> tells the service to watch for memory
|
||
pressure events. This enables memory accounting for the service, and ensures the
|
||
<filename>memory.pressure</filename> cgroup attribute file is accessible for reading and writing by the
|
||
service's user. It then sets the <varname>$MEMORY_PRESSURE_WATCH</varname> environment variable for
|
||
processes invoked by the unit to the file system path to this file. The threshold information
|
||
configured with <varname>MemoryPressureThresholdSec=</varname> is encoded in the
|
||
<varname>$MEMORY_PRESSURE_WRITE</varname> environment variable. If the <literal>auto</literal> value
|
||
is set the protocol is enabled if memory accounting is anyway enabled for the unit, and disabled
|
||
otherwise. If set to <literal>skip</literal> the logic is neither enabled, nor disabled and the two
|
||
environment variables are not set.</para>
|
||
|
||
<para>Note that services are free to use the two environment variables, but it's unproblematic if
|
||
they ignore them. Memory pressure handling must be implemented individually in each service, and
|
||
usually means different things for different software. For further details on memory pressure
|
||
handling see <ulink url="https://systemd.io/MEMORY_PRESSURE">Memory Pressure Handling in
|
||
systemd</ulink>.</para>
|
||
|
||
<para>Services implemented using
|
||
<citerefentry><refentrytitle>sd-event</refentrytitle><manvolnum>3</manvolnum></citerefentry> may use
|
||
<citerefentry><refentrytitle>sd_event_add_memory_pressure</refentrytitle><manvolnum>3</manvolnum></citerefentry>
|
||
to watch for and handle memory pressure events.</para>
|
||
|
||
<para>If not explicit set, defaults to the <varname>DefaultMemoryPressureWatch=</varname> setting in
|
||
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v254"/></listitem>
|
||
</varlistentry>
|
||
|
||
<varlistentry>
|
||
<term><varname>MemoryPressureThresholdSec=</varname></term>
|
||
|
||
<listitem><para>Sets the memory pressure threshold time for memory pressure monitor as configured via
|
||
<varname>MemoryPressureWatch=</varname>. Specifies the maximum allocation latency before a memory
|
||
pressure event is signalled to the service, per 2s window. If not specified defaults to the
|
||
<varname>DefaultMemoryPressureThresholdSec=</varname> setting in
|
||
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>
|
||
(which in turn defaults to 200ms). The specified value expects a time unit such as
|
||
<literal>ms</literal> or <literal>μs</literal>, see
|
||
<citerefentry><refentrytitle>systemd.time</refentrytitle><manvolnum>7</manvolnum></citerefentry> for
|
||
details on the permitted syntax.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v254"/></listitem>
|
||
</varlistentry>
|
||
</variablelist>
|
||
|
||
</refsect2><refsect2><title>Coredump Control</title>
|
||
|
||
<variablelist class='unit-directives'>
|
||
|
||
<varlistentry>
|
||
<term><varname>CoredumpReceive=</varname></term>
|
||
|
||
<listitem><para>Takes a boolean argument. This setting is used to enable coredump forwarding for containers
|
||
that belong to this unit's cgroup. Units with <varname>CoredumpReceive=yes</varname> must also be configured
|
||
with <varname>Delegate=yes</varname>. Defaults to false.</para>
|
||
|
||
<para>When <command>systemd-coredump</command> is handling a coredump for a process from a container,
|
||
if the container's leader process is a descendant of a cgroup with <varname>CoredumpReceive=yes</varname>
|
||
and <varname>Delegate=yes</varname>, then <command>systemd-coredump</command> will attempt to forward
|
||
the coredump to <command>systemd-coredump</command> within the container.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v255"/></listitem>
|
||
</varlistentry>
|
||
|
||
</variablelist>
|
||
</refsect2>
|
||
</refsect1>
|
||
|
||
<refsect1>
|
||
<title>History</title>
|
||
|
||
<variablelist>
|
||
<varlistentry>
|
||
<term>systemd 252</term>
|
||
<listitem><para> Options for controlling the Legacy Control Group Hierarchy (<ulink
|
||
url="https://docs.kernel.org/admin-guide/cgroup-v1/index.html">Control Groups version 1</ulink>)
|
||
are now fully deprecated:
|
||
<varname>CPUShares=<replaceable>weight</replaceable></varname>,
|
||
<varname>StartupCPUShares=<replaceable>weight</replaceable></varname>,
|
||
<varname>MemoryLimit=<replaceable>bytes</replaceable></varname>,
|
||
<varname>BlockIOAccounting=</varname>,
|
||
<varname>BlockIOWeight=<replaceable>weight</replaceable></varname>,
|
||
<varname>StartupBlockIOWeight=<replaceable>weight</replaceable></varname>,
|
||
<varname>BlockIODeviceWeight=<replaceable>device</replaceable>
|
||
<replaceable>weight</replaceable></varname>,
|
||
<varname>BlockIOReadBandwidth=<replaceable>device</replaceable>
|
||
<replaceable>bytes</replaceable></varname>,
|
||
<varname>BlockIOWriteBandwidth=<replaceable>device</replaceable> <replaceable>bytes</replaceable></varname>.
|
||
Please switch to the unified cgroup hierarchy.</para>
|
||
|
||
<xi:include href="version-info.xml" xpointer="v252"/></listitem>
|
||
</varlistentry>
|
||
</variablelist>
|
||
</refsect1>
|
||
|
||
<refsect1>
|
||
<title>See Also</title>
|
||
<para>
|
||
<citerefentry><refentrytitle>systemd</refentrytitle><manvolnum>1</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.unit</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.service</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.slice</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.scope</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.socket</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.mount</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.swap</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.exec</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.directives</refentrytitle><manvolnum>7</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd.special</refentrytitle><manvolnum>7</manvolnum></citerefentry>,
|
||
<citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>,
|
||
The documentation for control groups and specific controllers in the Linux kernel:
|
||
<ulink url="https://docs.kernel.org/admin-guide/cgroup-v2.html">Control Groups v2</ulink>.
|
||
</para>
|
||
</refsect1>
|
||
</refentry>
|