docs: add documentation for developers

This commit is contained in:
Mariano Giménez 2024-01-23 17:44:31 +01:00 committed by hulkoba
parent 313f2ebc88
commit 163e2c8346
No known key found for this signature in database
GPG Key ID: ACB6C4A3A4F2BE2F
19 changed files with 1526 additions and 0 deletions

92
docs/AUTOPGKTEST.md Normal file
View File

@ -0,0 +1,92 @@
---
title: Autopkgtest - Defining tests for Debian packages
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Test description
Full system integration/acceptance testing is done through [autopkgtests](https://salsa.debian.org/ci-team/autopkgtest/-/blob/master/doc/README.package-tests.rst). These test the actual installed binary distribution packages. They are run in QEMU or containers and thus can do intrusive and destructive things such as installing arbitrary packages, modifying arbitrary files in the system (including grub boot parameters), rebooting, or loading kernel modules.
The tests for systemd are defined in the [Debian package's debian/tests](https://salsa.debian.org/systemd-team/systemd/tree/master/debian/tests) directory. For validating a pull request, the Debian package is built using the unpatched code from that PR (via the [checkout-upstream](https://salsa.debian.org/systemd-team/systemd/blob/master/debian/extra/checkout-upstream) script), and the tests run against these built packages. Note that some tests which check Debian specific behaviour are skipped in "test upstream" mode.
# Infrastructure
systemd's GitHub project has webhooks that trigger autopkgtests on Ubuntu 18.04 LTS on three architectures:
* i386: 32 bit x86, little endian, QEMU (OpenStack cloud instance)
* amd64: 64 bit x86, little endian, QEMU (OpenStack cloud instance)
* arm64: 64 bit ARM, little endian, QEMU (OpenStack cloud instance)
* s390x: 64 bit IBM z/Series, big endian, LXC (this architecture is not yet available in Canonical's OpenStack and thus skips some tests)
Please see the [Ubuntu CI infrastructure](https://wiki.ubuntu.com/ProposedMigration/AutopkgtestInfrastructure) documentation for details about how this works.
# Manually retrying/triggering tests on the infrastructure
The current tests are fairly solid by now, but rarely they fail on infrastructure/network issues or race conditions. If you encounter these, please notify @iainlane in the GitHub PR for debugging/fixing those -- transient infrastructure issues are supposed to be detected automatically, and tests auto-retry on those; and flaky tests should of course be fixed properly. But sometimes it is useful to trigger tests on a different Ubuntu release too, for example to test a PR on a newer kernel or against current build/binary dependencies (cgroup changes, util-linux, gcc, etc.).
This can be done using the generic [retry-github-test](https://git.launchpad.net/autopkgtest-cloud/tree/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/tools/retry-github-test) script from [Ubuntu's autopkgtest infrastructure](https://git.launchpad.net/autopkgtest-cloud): you need the parameterized URL from the [configured webhooks](https://github.com/systemd/systemd/settings/hooks) and the shared secret (Ubuntu's CI needs to restrict access to avoid DoSing and misuse).
You can use Martin Pitt's [retry-gh-systemd-test](https://piware.de/gitweb/?p=bin.git;a=blob;f=retry-gh-systemd-test) shell wrapper around retry-github-test for that. You need to adjust the path where you put retry-github-test and the file with the shared secret, then you can call it like this:
```sh
$ retry-gh-systemd-test <#PR> <architecture> [release]
```
where `release` defaults to `bionic` (aka Ubuntu 18.04 LTS). For example:
```sh
$ retry-gh-systemd-test 1234 amd64
$ retry-gh-systemd-test 2345 s390x cosmic
```
Please make sure to not trigger unknown [releases](https://launchpad.net/ubuntu/+series) or architectures as they will cause a pending test on the PR which never gets finished.
# Test the code from the PR locally
As soon as a test on the infrastructure finishes, the "Details" link in the PR "checks" section will point to the `log.gz` log. You can download the individual test log, built .debs, and other artifacts that tests leave behind (some dump a complete journal or the udev database on failure) by replacing `/log.gz` with `/artifacts.tar.gz` in that URL. You can then unpack the tarball and use `sudo dpkg -iO binaries/*.deb` to install the debs from the PR into an Ubuntu VM of the same release/architecture for manually testing a PR.
# Run autopkgtests locally
Preparations:
* Get autopkgtest:
```sh
git clone https://salsa.debian.org/ci-team/autopkgtest.git
```
* Install necessary dependencies; on Debian/Ubuntu you can simply run `sudo apt install autopkgtest` (instead of the above cloning), on Fedora do `yum install qemu-kvm dpkg-perl`
* Build a test image based on Ubuntu cloud images for the desired release/arch:
```sh
autopkgtest/tools/autopkgtest-buildvm-ubuntu-cloud -r bionic -a amd64
```
This will build `autopkgtest-bionic-amd64.img`. This is normally being used through the `autopkgtest` command (see below), but you can boot this normally in QEMU (using `-snapshot` is highly recommended) to interactively poke around; this provides a easy throw-away test environment.
The most basic mode of operation is to run the tests for the current distro packages:
```sh
autopkgtest/runner/autopkgtest systemd -- qemu autopkgtest-bionic-amd64.img
```
But autopkgtest allows lots of [different modes](https://salsa.debian.org/ci-team/autopkgtest/-/blob/master/doc/README.running-tests.rst) and [options](http://manpages.ubuntu.com/autopkgtest), like running a shell on failure (`-s`), running a single test only (`--test-name`), running the tests from a local checkout of the Debian source tree (possibly with modifications to the test) instead of from the distribution source, or running QEMU with more than one CPU (check the [autopkgtest-virt-qemu manpage](http://manpages.ubuntu.com/autopkgtest-virt-qemu).
A common use case is to check out the Debian packaging git for getting/modifying the tests locally:
```sh
git clone https://salsa.debian.org/systemd-team/systemd.git /tmp/systemd-debian/
```
and running these against the binaries from a PR (see above), running only the `logind` test, getting a shell on failure, showing the boot output, and running with 2 CPUs:
```sh
autopkgtest/runner/autopkgtest --test-name logind /tmp/binaries/*.deb /tmp/systemd-debian/ -s -- \
qemu --show-boot --cpus 2 /srv/vm/autopkgtest-bionic-amd64.img
```
# Contact
For troubles with the infrastructure, please notify [iainlane](https://github.com/iainlane) in the affected PR.

25
docs/BACKPORTS.md Normal file
View File

@ -0,0 +1,25 @@
---
title: Backports
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Backports
The upstream systemd git repo at [https://github.com/systemd/systemd](https://github.com/systemd/systemd) only contains the main systemd branch that progresses at a quick pace, continuously bringing both bugfixes and new features. Distributions usually prefer basing their releases on stabilized versions branched off from this, that receive the bugfixes but not the features.
## Stable Branch Repository
Stable branches are available from [https://github.com/systemd/systemd-stable](https://github.com/systemd/systemd-stable).
Stable branches are started for certain releases of systemd and named after them, e.g. v208-stable. Stable branches are typically managed by distribution maintainers on an as needed basis. For example v208 has been chosen for stable as several distributions are shipping this version and the official/upstream cycle of v208-v209 was a long one due to kdbus work. If you are using a particular version and find yourself backporting several patches, you may consider pushing a stable branch here for that version so others can benefit. Please contact us if you are interested.
The following types of commits are cherry-picked onto those branches:
* bugfixes
* documentation updates, when relevant to this version
* hardware database additions, especially the keymap updates
* small non-conflicting features deemed safe to add in a stable release
Please try to ensure that anything backported to the stable repository is done with the `git cherry-pick -x` option such that text stating the original SHA1 is added into the commit message. This makes it easier to check where the code came from (as sometimes it is necessary to add small fixes as new code due to the upstream refactors that are deemed too invasive to backport as a stable patch.

112
docs/BOOT.md Normal file
View File

@ -0,0 +1,112 @@
---
title: systemd-boot UEFI Boot Manager
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# systemd-boot UEFI Boot Manager
systemd-boot is a UEFI boot manager which executes configured EFI images. The default entry is selected by a configured pattern (glob) or an on-screen menu.
systemd-boot operates on the EFI System Partition (ESP) only. Configuration file fragments, kernels, initrds, other EFI images need to reside on the ESP. Linux kernels need to be built with CONFIG\_EFI\_STUB to be able to be directly executed as an EFI image.
systemd-boot reads simple and entirely generic boot loader configuration files; one file per boot loader entry to select from. All files need to reside on the ESP.
Pressing the Space key (or most other keys actually work too) during bootup will show an on-screen menu with all configured loader entries to select from. Pressing Enter on the selected entry loads and starts the EFI image.
If no timeout is configured, which is the default setting, and no key pressed during bootup, the default entry is executed right away.
![systemd-boot menu](/assets/systemd-boot-menu.png)
All configuration files are expected to be 7-bit ASCII or valid UTF8. The loader configuration file understands the following keywords:
| Config |
|---------|------------------------------------------------------------|
| default | pattern to select the default entry in the list of entries |
| timeout | timeout in seconds for how long to show the menu |
The entry configuration files understand the following keywords:
| Entry |
|--------|------------------------------------------------------------|
| title | text to show in the menu |
| version | version string to append to the title when the title is not unique |
| machine-id | machine identifier to append to the title when the title is not unique |
| efi | executable EFI image |
| options | options to pass to the EFI image / kernel command line |
| linux | linux kernel image (systemd-boot still requires the kernel to have an EFI stub) |
| initrd | initramfs image (systemd-boot just adds this as option initrd=) |
Examples:
```
/boot/loader/loader.conf
timeout 3
default 6a9857a393724b7a981ebb5b8495b9ea-*
/boot/loader/entries/6a9857a393724b7a981ebb5b8495b9ea-3.8.0-2.fc19.x86_64.conf
title Fedora 19 (Rawhide)
version 3.8.0-2.fc19.x86_64
machine-id 6a9857a393724b7a981ebb5b8495b9ea
linux /6a9857a393724b7a981ebb5b8495b9ea/3.8.0-2.fc19.x86_64/linux
initrd /6a9857a393724b7a981ebb5b8495b9ea/3.8.0-2.fc19.x86_64/initrd
options root=UUID=f8f83f73-df71-445c-87f7-31f70263b83b quiet
/boot/loader/entries/custom-kernel.conf
title My kernel
efi /bzImage
options root=PARTUUID=084917b7-8be2-4e86-838d-f771a9902e08
/boot/loader/entries/custom-kernel-initrd.conf
title My kernel with initrd
linux /bzImage
initrd /initrd.img
options root=PARTUUID=084917b7-8be2-4e86-838d-f771a9902e08 quiet`
```
While the menu is shown, the following keys are active:
| Keys |
|--------|------------------------------------------------------------|
| Up/Down | Select menu entry |
| Enter | boot the selected entry |
| d | select the default entry to boot (stored in a non-volatile EFI variable) |
| t/T | adjust the timeout (stored in a non-volatile EFI variable) |
| e | edit the option line (kernel command line) for this bootup to pass to the EFI image |
| Q | quit |
| v | show the systemd-boot and UEFI version |
| P | print the current configuration to the console |
| h | show key mapping |
Hotkeys to select a specific entry in the menu, or when pressed during bootup to boot the entry right-away:
| Keys |
|--------|------------------------------------------------------------|
| l | Linux |
| w | Windows |
| a | OS X |
| s | EFI Shell |
| 1-9 | number of entry |
Some EFI variables control the loader or exported the loaders state to the started operating system. The vendor UUID `4a67b082-0a4c-41cf-b6c7-440b29bb8c4f` and the variable names are supposed to be shared across all loaders implementations which follow this scheme of configuration:
| EFI Variables |
|---------------|------------------------|-------------------------------|
| LoaderEntryDefault | entry identifier to select as default at bootup | non-volatile |
| LoaderConfigTimeout | timeout in seconds to show the menu | non-volatile |
| LoaderEntryOneShot | entry identifier to select at the next and only the next bootup | non-volatile |
| LoaderDeviceIdentifier | list of identifiers of the volume the loader was started from | volatile |
| LoaderDevicePartUUID | partition GPT UUID of the ESP systemd-boot was executed from | volatile |
Links:
[https://github.com/systemd/systemd](https://github.com/systemd/systemd)
[http://www.freedesktop.org/wiki/Specifications/BootLoaderSpec/](http://www.freedesktop.org/wiki/Specifications/BootLoaderSpec/)

68
docs/CATALOG.md Normal file
View File

@ -0,0 +1,68 @@
---
title: Journal Message Catalogs
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Journal Message Catalogs
Starting with 196 systemd includes a message catalog system which allows augmentation on display of journal log messages with short explanation texts, keyed off the MESSAGE\_ID= field of the entry. Many important log messages generated by systemd itself have message catalog entries. External packages can easily provide catalog data for their own messages.
The message catalog has a number of purposes:
* Provide the administrator, user or developer with further information about the issue at hand, beyond the actual message text
* Provide links to further documentation on the topic of the specific message
* Provide native language explanations for English language system messages
* Provide links for support forums, hotlines or other contacts
## Format
Message catalog source files are simple text files that follow an RFC822 inspired format. To get an understanding of the format [here's an example file](http://cgit.freedesktop.org/systemd/systemd/plain/catalog/systemd.catalog), which includes entries for many important messages systemd itself generates. On installation of a package that includes message catalogs all installed message catalog source files get compiled into a binary index, which is then used to look up catalog data.
journalctl's `-x` command line parameter may be used to augment on display journal log messages with message catalog data when browsing. `journalctl --list-catalog` may be used to print a list of all known catalog entries.
To register additional catalog entries, packages may drop (text) catalog files into /usr/lib/systemd/catalog/ with a suffix of .catalog. The files are not accessed directly when needed, but need to be built into a binary index file with `journalctl --update-catalog`.
Here's an example how a single catalog entry looks like in the text source format. Multiple of these may be listed one after the other per catalog source file:
```
-- fc2e22bc6ee647b6b90729ab34a250b1
Subject: Process @COREDUMP_PID@ (@COREDUMP_COMM@) dumped core
Defined-By: systemd
Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Documentation: man:core(5)
Documentation: http://www.freedesktop.org/wiki/Software/systemd/catalog/@MESSAGE_ID@
Process @COREDUMP_PID@ (@COREDUMP_COMM@) crashed and dumped core.
This usually indicates a programming error in the crashing program and
should be reported to its vendor as a bug.
```
The text format of the .catalog files is as follows:
* Simple, UTF-8 text files, with usual line breaks at 76 chars. URLs and suchlike where line-breaks are undesirable may use longer lines. As catalog files need to be usable on text consoles it is essential that the 76 char line break rule is otherwise followed for human readable text.
* Lines starting with `#` are ignored, and may be used for comments.
* The files consist of a series of entries. For each message ID (in combination with a locale) only a single entry may be defined. Every entry consists of:
* A separator line beginning with `-- `, followed by a hexadecimal message ID formatted as lower case ASCII string. Optionally, the message ID may be suffixed by a space and a locale identifier, such as `de` or `fr\_FR`, if i10n is required.
* A series of entry headers, in RFC822-style but not supporting continuation lines. Some header fields may appear more than once per entry. The following header fields are currently known (but additional fields may be added later):
* Subject: A short, one-line human readable description of the message
* Defined-By: Who defined this message. Usually a package name or suchlike
* Support: A URI for getting further support. This can be a web URL or a telephone number in the tel:// namespace
* Documentation: URIs for further user, administrator or developer documentation on the log entry. URIs should be listed in order of relevance, the most relevant documentation first.
* An empty line
* The actual catalog entry payload, as human readable prose. Multiple paragraphs may be separated by empty lines. The prose should first describe the message and when it occurs, possibly followed by recommendations how to deal with the message and (if it is an error message) correct the problem at hand. This message text should be readable by users and administrators. Information for developers should be stored externally instead, and referenced via a Documentation= header field.
* When a catalog entry is printed on screen for a specific log entry simple variable replacements are applied. Journal field names enclosed in @ will be replaced by their values, if such a field is available in an entry. If such a field is not defined in an entry the enclosing @ will be dropped but the variable name is kept. See [systemd's own message catalog](http://cgit.freedesktop.org/systemd/systemd/plain/catalog/systemd.catalog) for a complete example for a catalog file.
## Adding Message Catalog Support to Your Program
Note that the message catalog is only available for messages generated with the MESSAGE\_ID= journal meta data field, as this is need to find the right entry for a message. For more information on the MESSAGE\_ID= journal entry field see [systemd.journal-fields(7)](http://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html).
To add message catalog entries for log messages your application generates, please follow the following guidelines:
* Use the [native Journal logging APIs](http://0pointer.de/blog/projects/journal-submit.html) to generate your messages, and define message IDs for all messages you want to add catalog entries for. You may use `journalctl --new-id128` to allocate new message IDs.
* Write a catalog entry file for your messages and ship them in your package and install them to `/usr/lib/systemd/catalog/` (if you package your software with RPM use `%_journalcatalogdir`)
* Ensure that after installation of your application's RPM/DEB "`journalctl --update-catalog`" is executed, in order to update the binary catalog index. (if you package your software with RPM use the `%journal_catalog_update` macro to achieve that.)

View File

@ -0,0 +1,240 @@
---
title: New Control Group Interfaces
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# The New Control Group Interfaces
> _aka "I want to make use of kernel cgroups, how do I do this in the new world order?"_
Starting with version 205 systemd provides a number of interfaces that may be used to create and manage labelled groups of processes for the purpose of monitoring and controlling them and their resource usage. This is built on top of the Linux kernel Control Groups ("cgroups") facility. Previously, the kernel's cgroups API was exposed directly as shared application API, following the rules of the [Pax Control Groups](http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups/) document. However, the kernel cgroup interface has been reworked into an API that requires that each individual cgroup is managed by a single writer only. With this change the main cgroup tree becomes private property of that userspace component and is no longer a shared resource. On systemd systems PID 1 takes this role and hence needs to provide APIs for clients to take benefit of the control groups functionality of the kernel. Note that services running on systemd systems may manage their own subtrees of the cgroups tree, as long as they explicitly turn on delegation mode for them (see below).
That means explicitly, that:
1. The root control group may only be written to by systemd (PID 1). Services that create and manipulate control groups in the top level cgroup are in direct conflict with the kernel's requirement that each control group should have a single-writer only.
2. Services must set Delegate=yes for the units they intend to manage subcgroups of. If they create and manipulate cgroups outside of units that have Delegate=yes set, they violate the access contract for control groups.
For a more high-level background story, please have a look at this [Linux Foundation News Story](http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/all-about-linux-kernel-cgroup%E2%80%99s-redesign).
### Why this all again?
- Objects placed in the same level of the cgroup tree frequently need to propagate properties from one to each other. For example, when using the "cpu" controller for one object then all objects on the same level need to do the same, otherwise the entire cgroup of the first object will be scheduled against the individual processes of the others, thus giving the first object a drastic malus on scheduling if it uses many processes.
- Similar, some properties also require propagation up the tree.
- The tree needs to be refreshed/built in scheduled steps as devices show up/go away as controllers like "blkio" or "devices" refer to devices via major/minor device node indexes, which are not fixed but determined only as a device appears.
- The tree also needs refreshing/rebuilding as new services are installed/started/instantiated/stopped/uninstalled.
- Many of the cgroup attributes are too low-level as API. For example, the major/minor device interface in order to be useful requires a userspace component for translating stable device paths into major/minor at the right time.
- By unifying the cgroup logic under a single arbiter it is possible to write tools that can manage all objects the system contains, including services, virtual machines containers and whatever else applications register.
- By unifying the cgroup logic under a single arbiter a good default that encompasses all kinds of objects may be shipped, thus making manual configuration unnecessary to take benefit of basic resource control.
systemd through its "unit" concept already implements a dependency network between objects where propagation can take place and contains a powerful execution queue. Also, a major part of the objects resources need to be controlled for are already systemd objects, most prominently the services systemd manages.
### Why is this not managed by a component independent of systemd?
Well, as mentioned above, a dependency network between objects, usable for propagation, combined with a powerful execution engine is basically what systemd _is_. Since cgroups management requires precisely this it is an obvious choice to simply implement this in systemd itself.
Implementing a similar propagation/dependency network with execution scheduler outside of systemd in an independent "cgroup" daemon would basically mean reimplementing systemd a second time. Also, accessing such an external service from PID 1 for managing other services would result in cyclic dependencies between PID 1 which would need this functionality to manage the cgroup service which would only be available however after that service finished starting up. Such cyclic dependencies can certainly be worked around, but make such a design complex.
### I don't use systemd, what does this mean for me?
Nothing. This page is about systemd's cgroups APIs. If you don't use systemd then the kernel cgroup rework will probably affect you eventually, but a different component will be the single writer userspace daemon managing the cgroup tree, with different APIs. Note that the APIs described here expose a lot of systemd-specific concepts and hence are unlikely to be available outside of systemd systems.
### I want to write cgroup code that should work on both systemd systems and others (such as Ubuntu), what should I do?
On systemd systems use the systemd APIs as described below. At this time we are not aware of any component that would take the cgroup managing role on Upstart/sysvinit systems, so we cannot help you with this. Sorry.
### What's the timeframe of this? Do I need to care now?
In the short-term future writing directly to the control group tree from applications should still be OK, as long as the [Pax Control Groups](http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups/) document is followed. In the medium-term future it will still be supported to alter/read individual attributes of cgroups directly, but no longer to create/delete cgroups without using the systemd API. In the longer-term future altering/reading attributes will also be unavailable to userspace applications, unless done via systemd's APIs (either D-Bus based IPC APIs or shared library APIs for _passive_ operations).
It is recommended to use the new systemd APIs described below in any case. Note that the kernel cgroup interface is currently being reworked (available when the "sane_behaviour" kernel option is used). This will change the cgroupfs interface. By using systemd's APIs this change is abstracted away and invisible to applications.
## systemd's Resource Control Concepts
Systemd provides three unit types that are useful for the purpose of resource control:
- [_Services_](http://www.freedesktop.org/software/systemd/man/systemd.service.html) encapsulate a number of processes that are started and stopped by systemd based on configuration. Services are named in the style of `quux.service`.
- [_Scopes_](http://www.freedesktop.org/software/systemd/man/systemd.scope.html) encapsulate a number of processes that are started and stopped by arbitrary processes via fork(), and then registered at runtime with PID1. Scopes are named in the style of `wuff.scope`.
- [_Slices_](http://www.freedesktop.org/software/systemd/man/systemd.slice.html) may be used to group a number of services and scopes together in a hierarchial tree. Slices do not contain processes themselves, but the services and slices contained in them do. Slices are named in the style of `foobar-waldo.slice`, where the path to the location of the slice in the tree is encoded in the name with "-" as separator for the path components (`foobar-waldo.slice` is hence a subslice of `foobar.slice`). There's one special slices defined, `-.slice`, which is the root slice of all slices (`foobar.slice` is hence subslice of `-.slice`). This is similar how in regular file paths, "/" denotes the root directory.
Service, scope and slice units directly map to objects in the cgroup tree. When these units are activated they each map to directly (modulo some character escaping) to cgroup paths built from the unit names. For example, a service `quux.service` in a slice `foobar-waldo.slice` is found in the cgroup `foobar.slice/foobar-waldo.slice/quux.service/`.
Services, scopes and slices may be created freely by the administrator or dynamically by programs. However by default the OS defines a number of built-in services that are necessary to start-up the system. Also, there are four slices defined by default: first of all the root slice `-.slice` (as mentioned above), but also `system.slice`, `machine.slice`, `user.slice`. By default all system services are placed in the first slice, all virtual machines and containers in the second, and user sessions in the third. However, this is just a default, and the administrator my freely define new slices and assign services and scopes to them. Also note that all login sessions automatically are placed in an individual scope unit, as are VM and container processes. Finally, all users logging in will also get an implicit slice of their own where all the session scopes are placed.
Here's an example how the cgroup tree could look like (as generated with `systemd-cgls(1)`, see below):
```
├─user.slice
│ └─user-1000.slice
│ ├─session-18.scope
│ │ ├─703 login -- lennart
│ │ └─773 -bash
│ ├─session-1.scope
│ │ ├─ 518 gdm-session-worker [pam/gdm-autologin]
│ │ ├─ 540 gnome-session
│ │ ├─ 552 dbus-launch --sh-syntax --exit-with-session
│ │ ├─ 553 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
│ │ ├─ 589 /usr/libexec/gvfsd
│ │ ├─ 593 /usr/libexec//gvfsd-fuse -f /run/user/1000/gvfs
│ │ ├─ 598 /usr/libexec/gnome-settings-daemon
│ │ ├─ 617 /usr/bin/gnome-keyring-daemon --start --components=gpg
│ │ ├─ 630 /usr/bin/pulseaudio --start
│ │ ├─ 726 /usr/bin/gnome-shell
│ │ ├─ 728 syndaemon -i 1.0 -t -K -R
│ │ ├─ 736 /usr/libexec/gsd-printer
│ │ ├─ 742 /usr/libexec/dconf-service
│ │ ├─ 798 /usr/libexec/mission-control-5
│ │ ├─ 802 /usr/libexec/goa-daemon
│ │ ├─ 823 /usr/libexec/gvfsd-metadata
│ │ ├─ 866 /usr/libexec/gvfs-udisks2-volume-monitor
│ │ ├─ 880 /usr/libexec/gvfs-gphoto2-volume-monitor
│ │ ├─ 886 /usr/libexec/gvfs-afc-volume-monitor
│ │ ├─ 891 /usr/libexec/gvfs-mtp-volume-monitor
│ │ ├─ 895 /usr/libexec/gvfs-goa-volume-monitor
│ │ ├─ 999 /usr/libexec/telepathy-logger
│ │ ├─ 1076 /usr/libexec/gnome-terminal-server
│ │ ├─ 1079 gnome-pty-helper
│ │ ├─ 1080 bash
│ │ ├─ 1134 ssh-agent
│ │ ├─ 1137 gedit l
│ │ ├─ 1160 gpg-agent --daemon --write-env-file
│ │ ├─ 1371 /usr/lib64/firefox/firefox
│ │ ├─ 1729 systemd-cgls
│ │ ├─ 1929 bash
│ │ ├─ 2057 emacs src/login/org.freedesktop.login1.policy.in
│ │ ├─ 2060 /usr/libexec/gconfd-2
│ │ ├─29634 /usr/libexec/gvfsd-http --spawner :1.5 /org/gtk/gvfs/exec_spaw/0
│ │ └─31416 bash
│ └─user@1000.service
│ ├─532 /usr/lib/systemd/systemd --user
│ └─541 (sd-pam)
└─system.slice
├─1 /usr/lib/systemd/systemd --system --deserialize 22
├─sshd.service
│ └─29701 /usr/sbin/sshd -D
├─udisks2.service
│ └─743 /usr/lib/udisks2/udisksd --no-debug
├─colord.service
│ └─727 /usr/libexec/colord
├─upower.service
│ └─633 /usr/libexec/upowerd
├─wpa_supplicant.service
│ └─488 /usr/sbin/wpa_supplicant -u -f /var/log/wpa_supplicant.log -c /etc/wpa_supplicant/wpa_supplicant.conf -u -f /var/log/wpa_supplicant.log -P /var/run/wpa_supplicant.pid
├─bluetooth.service
│ └─463 /usr/sbin/bluetoothd -n
├─polkit.service
│ └─443 /usr/lib/polkit-1/polkitd --no-debug
├─alsa-state.service
│ └─408 /usr/sbin/alsactl -s -n 19 -c -E ALSA_CONFIG_PATH=/etc/alsa/alsactl.conf --initfile=/lib/alsa/init/00main rdaemon
├─systemd-udevd.service
│ └─253 /usr/lib/systemd/systemd-udevd
├─systemd-journald.service
│ └─240 /usr/lib/systemd/systemd-journald
├─rtkit-daemon.service
│ └─419 /usr/libexec/rtkit-daemon
├─rpcbind.service
│ └─475 /sbin/rpcbind -w
├─cups.service
│ └─731 /usr/sbin/cupsd -f
├─avahi-daemon.service
│ ├─417 avahi-daemon: running [delta.local]
│ └─424 avahi-daemon: chroot helper
├─dbus.service
│ ├─418 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
│ └─462 /usr/sbin/modem-manager
├─accounts-daemon.service
│ └─416 /usr/libexec/accounts-daemon
├─systemd-ask-password-wall.service
│ └─434 /usr/bin/systemd-tty-ask-password-agent --wall
├─systemd-logind.service
│ └─415 /usr/lib/systemd/systemd-logind
├─ntpd.service
│ └─429 /usr/sbin/ntpd -u ntp:ntp -g
├─rngd.service
│ └─412 /sbin/rngd -f
├─libvirtd.service
│ └─467 /usr/sbin/libvirtd
├─irqbalance.service
│ └─411 /usr/sbin/irqbalance --foreground
├─crond.service
│ └─421 /usr/sbin/crond -n
├─NetworkManager.service
│ ├─ 410 /usr/sbin/NetworkManager --no-daemon
│ ├─1066 /sbin/dhclient -d -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-enp0s20u2.pid -lf /var/lib/NetworkManager/dhclient-35c8218b-9e45-4b1f-b79e-22334f687340-enp0s20u2.lease -cf /var/lib/NetworkManager/dhclient-enp0s20u2.conf enp0s20u2
│ └─1070 /sbin/dhclient -d -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-enp0s26u1u4i2.pid -lf /var/lib/NetworkManager/dhclient-f404f1ca-ccfe-4957-aead-dec19c126dea-enp0s26u1u4i2.lease -cf /var/lib/NetworkManager/dhclient-enp0s26u1u4i2.conf enp0s26u1u4i2
└─gdm.service
├─420 /usr/sbin/gdm
├─449 /usr/libexec/gdm-simple-slave --display-id /org/gnome/DisplayManager/Displays/_0
└─476 /usr/bin/Xorg :0 -background none -verbose -auth /run/gdm/auth-for-gdm-pJjwsi/database -seat seat0 -nolisten tcp vt1
```
As you can see, services and scopes contain process and are placed in slices, and slices do not contain processes of their own. Also note that the special "-.slice" is not shown as it is implicitly identified with the root of the entire tree.
Resource limits may be set on services, scopes and slices the same way. All active service, scope and slice units may easily be viewed with the "systemctl" command. The hierarchy of services and scopes in the slice tree may be viewed with the "systemd-cgls" command.
Service and slice units may be configured via unit files on disk, or alternatively be created dynamically at runtime via API calls to PID 1. Scope units may only be created at runtime via API calls to PID 1, but not from unit files on disk. Units that are created dynamically at runtime via API calls are called _transient_ units. Transient units exist only during runtime and are released automatically as soon as they finished/got deactivated or the system is rebooted.
If a service/slice is configured via unit files on disk the resource controls may be configured with the settings documented in [systemd.resource-control(5)](http://www.freedesktop.org/software/systemd/man/systemd.resource-control.html). While the unit are started they may be reconfigured for services/slices/scopes (with changes applying instantly) with the a command line such as:
```
# systemctl set-property httpd.service CPUShares=500 MemoryLimit=500M
```
This will make these changes persistently, so that after the next reboot they are automatically applied right when the services are first started. By passing the `--runtime` switch the changes can alternatively be made in a volatile way so that they are lost on the next reboot.
Note that the number of cgroup attributes currently exposed as unit properties is limited. This will be extended later on, as their kernel interfaces are cleaned up. For example cpuset or freezer are currently not exposed at all due to the broken inheritance semantics of the kernel logic. Also, migrating units to a different slice at runtime is not supported (i.e. altering the Slice= property for running units) as the kernel currently lacks atomic cgroup subtree moves.
(Note that the resource control settings are actually also available on mount, swap and socket units. This is because they may also involve processes run for them. However, normally it should not be necessary to alter resource control settings on these unit types.)
## The APIs
Most relevant APIs are exposed via D-Bus, however some _passive_ interfaces are available as shared library, bypassing IPC so that they are much cheaper to call.
### Creating and Starting
To create and start a transient (scope, service or slice) unit in the cgroup tree use the `StartTransientUnit()` method on the `Manager` object exposed by systemd's PID 1 on the bus, see the [Bus API Documentation](http://www.freedesktop.org/wiki/Software/systemd/dbus/) for details. This call takes four arguments. The first argument is the full unit name you want this unit to be known under. This unit name is the handle to the unit, and is shown in the "systemctl" output and elsewhere. This name must be unique during runtime of the unit. You should generate a descriptive name for this that is useful for the administrator to make sense of it. The second parameter is the mode, and should usually be `replace` or `fail`. The third parameter contains an array of initial properties to set for the unit. It is an array of pairs of property names as string and values as variant. Note that this is an array and not a dictionary! This is that way in order to match the properties array of the `SetProperties()` call (see below). The fourth parameter is currently not used and should be passed as empty array. This call will first create the transient unit and then immediately queue a start job for it. This call returns an object path to a `Job` object for the start job of this unit.
### Properties
The properties array of `StartTransientUnit()` may take many of the settings that may also be configured in unit files. Not all parameters are currently accepted though, but we plan to cover more properties with future release. Currently you may set the `Description`, `Slice` and all dependency types of units, as well as `RemainAfterExit`, `ExecStart` for service units, `TimeoutStopUSec` and `PIDs` for scope units, and `CPUAccounting`, `CPUShares`, `BlockIOAccounting`, `BlockIOWeight`, `BlockIOReadBandwidth`, `BlockIOWriteBandwidth`, `BlockIODeviceWeight`, `MemoryAccounting`, `MemoryLimit`, `DevicePolicy`, `DeviceAllow` for services/scopes/slices. These fields map directly to their counterparts in unit files and as normal D-Bus object properties. The exception here is the `PIDs` field of scope units which is used for construction of the scope only and specifies the initial PIDs to add to the scope object.
To alter resource control properties at runtime use the `SetUnitProperty()` call on the `Manager` object or `SetProperty()` on the individual Unit objects. This also takes an array of properties to set, in the same format as `StartTransientUnit()` takes. Note again that this is not a dictionary, and allows properties to be set multiple times with a single invocation. THis is useful for array properties: if a property is assigned the empty array it will be reset to the empty array itself, however if it is assigned a non-empty array then this array is appended to the previous array. This mimics behaviour of array settings in unit files. Note that most settings may only be set during creation of units with `StartTransientUnit()`, and may not be altered later on. The exception here are the resource control settings, more specifically `CPUAccounting`, `CPUShares`, `BlockIOAccounting`, `BlockIOWeight`, `BlockIOReadBandwidth`, `BlockIOWriteBandwidth`, `BlockIODeviceWeight`, `MemoryAccounting`, `MemoryLimit`, `DevicePolicy`, `DeviceAllow` for services/scopes/slices. Note that the standard D-Bus `org.freedesktop.DBus.Properties.Set()` call is currently not supported by any of the unit objects to set these properties, but might eventually (note however, that it is substantially less useful as it only allows setting a single property at a time, resulting in races).
The [`systemctl set-property`](http://www.freedesktop.org/software/systemd/man/systemctl.html) command internally is little more than a wrapper around `SetUnitProperty()`. The [`systemd-run`](http://www.freedesktop.org/software/systemd/man/systemd-run.html) tool is a wrapper around `StartTransientUnit()`. It may be used to either run a process as a transient service in the background, where it is invoked from PID 1, or alternatively as a scope unit in the foreground, where it is run from the `systemd-run` process itself.
### Enumeration
To acquire a list of currently running units, use the `ListUnits()` call on the Manager bus object. To determine the scope/service unit and slice unit a process is running in use [`sd_pid_get_unit()`](http://www.freedesktop.org/software/systemd/man/sd_pid_get_unit.html) and `sd_pid_get_slice()`. These two calls are implemented in `libsystemd-login.so`. These call bypass the system bus (which they can because they are passive and do not require privileges) and are hence very effecient to invoke.
### VM and Container Managers
Use these APIs to register any kind of process workload with systemd to be placed in a resource controlled cgroup. Note however that for containers and virtual machines it is better to use the [`machined`](http://www.freedesktop.org/wiki/Software/systemd/machined/) interfaces since they provide integration with "ps" and similar tools beyond what mere cgroup registration provides. Also see [Writing VM and Container Managers](http://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers/) for details.
### Reading Accounting Information
Note that there's currently no systemd API to retrieve accounting information from cgroups. For now, if you need to retrieve this information use `/proc/&#036;PID/cgroup` to determine the cgroup path for your process in the `cpuacct` controller (or whichever controller matters to you), and then read the attributes directly from the cgroup tree.
If you want to collect the exit status and other runtime parameters of your transient scope or service unit after the processes in them ended set the `RemainAfterExited` boolean property when creating it. This will has the effect that the unit will stay around even after all processes in it died, in the `SubState="exited"` state. Simply watch for state changes until this state is reached, then read the status details from the various properties you need, and finally terminate the unit via `StopUnit()` on the `Manager` object or `Stop()` on the `Unit` object itself.
### Becoming a Controller
Optionally, it is possible for a program that registers a scope unit (the "scope manager") for one or more of its child processes to hook into the shutdown logic of the scope unit. Normally, if this is not done, and the scope needs to be shut down (regardless if during normal operation when the user invokes `systemctl stop` -- or something equivalent -- on the scope unit, or during system shutdown), then systemd will simply send SIGTERM to its processes. After a timeout this will be followed by SIGKILL unless the scope processes exited by then. If a scope manager program wants to be involved in the shutdown logic of its scopes it may set the `Controller` property of the scope unit when creating it via `StartTransientUnit()`. It should be set to the bus name (either unique name or well-known name) of the scope manager program. If this is done then instead of SIGTERM to the scope processes systemd will send the RequestStop() bus signal to the specified name. If the name is gone by then it will automatically fallback to SIGTERM, in order to make this robust. As before in either case this will be followed by SIGKILL to the scope unit processes after a timeout.
Scope units implement a special `Abandon()` method call. This method call is useful for informing the system manager that the scope unit is no longer managed by any scope manager process. Among other things it is useful for manager daemons which terminate but want to leave the scopes they started running. When a scope is abandoned its state will be set to "abandoned" which is shown in the usual systemctl output, as information to the user. Also, if a controller has been set for the scope, it will be unset. Note that there is not strictly need to ever invoke the `Abandon()` method call, however it is recommended for cases like the ones explained above.
### Delegation
Service and scope units know a special `Delegate` boolean property. If set, then the processes inside the scope or service may control their own control group subtree (that means: create subcgroups directly via /sys/fs/cgroup). The effect of the property is that:
1. All controllers systemd knows are enabled for that scope/service, if the scope/service runs privileged code.
2. Access to the cgroup directory of the scope/service is permitted, and files/and directories are updated to get write access for the user specified in `User=` if the scope/unit runs unprivileged. Note that in this case access to any controllers is not available.
3. systemd will refrain from moving processes across the "delegation" boundary.
Generally, the `Delegate` property is only useful for services that need to manage their own cgroup subtrees, such as container managers. After creating a unit with this property set, they should use `/proc/&#036;PID/cgroup` to figure out the cgroup subtree path they may manage (the one from the name=systemd hierarchy!). Managers should refrain from making any changes to the cgroup tree outside of the subtrees for units they created with the `Delegate` flag turned on.
Note that scope units created by `machined`'s `CreateMachine()` call have this flag set.
### Example
Please see the [systemd-run sources](http://cgit.freedesktop.org/systemd/systemd/plain/src/run/run.c) for a relatively simple example how to create scope or service units transiently and pass properties to them.

160
docs/INHIBITOR_LOCKS.md Normal file
View File

@ -0,0 +1,160 @@
---
title: Inhibitor Locks
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Inhibitor Locks
systemd 183 and newer include a logic to inhibit system shutdowns and sleep states. This is implemented as part of [systemd-logind.daemon(8)](http://www.freedesktop.org/software/systemd/man/systemd-logind.service.html) There are a couple of different use cases for this:
- A CD burning application wants to ensure that the system is not turned off or suspended while the burn process is in progress.
- A package manager wants to ensure that the system is not turned off while a package upgrade is in progress.
- An office suite wants to be notified before system suspend in order to save all data to disk, and delay the suspend logic until all data is written.
- A web browser wants to be notified before system hibernation in order to free its cache to minimize the amount of memory that needs to be virtualized.
- A screen lock tool wants to bring up the screen lock right before suspend, and delay the suspend until that's complete.
Applications which want to make use of the inhibition logic shall take an inhibitor lock via the [logind D-Bus API](http://www.freedesktop.org/wiki/Software/systemd/logind).
Seven distinct inhibitor lock types may be taken, or a combination of them:
1. _sleep_ inhibits system suspend and hibernation requested by (unprivileged) **users**
2. _shutdown_ inhibits high-level system power-off and reboot requested by (unprivileged) **users**
3. _idle_ inhibits that the system goes into idle mode, possibly resulting in **automatic** system suspend or shutdown depending on configuration.
- _handle-power-key_ inhibits the low-level (i.e. logind-internal) handling of the system power **hardware** key, allowing (possibly unprivileged) external code to handle the event instead.
4. Similar, _handle-suspend-key_ inhibits the low-level handling of the system **hardware** suspend key.
5. Similar, _handle-hibernate-key_ inhibits the low-level handling of the system **hardware** hibernate key.
6. Similar, _handle-lid-switch_ inhibits the low-level handling of the systemd **hardware** lid switch.
Two different modes of locks are supported:
1. _block_ inhibits operations entirely until the lock is released. If such a lock is taken the operation will fail (but still may be overridden if the user possesses the necessary privileges).
2. _delay_ inhibits operations only temporarily, either until the lock is released or up to a certain amount of time. The InhibitDelayMaxSec= setting in [logind.conf(5)](http://www.freedesktop.org/software/systemd/man/logind.conf.html) controls the timeout for this. This is intended to be used by applications which need a synchronous way to execute actions before system suspend but shall not be allowed to block suspend indefinitely. This mode is only available for _sleep_ and _shutdown_ locks.
Inhibitor locks are taken via the Inhibit() D-Bus call on the logind Manager object:
```
$ gdbus introspect --system --dest org.freedesktop.login1 --object-path /org/freedesktop/login1
node /org/freedesktop/login1 {
interface org.freedesktop.login1.Manager {
methods:
Inhibit(in s what,
in s who,
in s why,
in s mode,
out h fd);
ListInhibitors(out a(ssssuu) inhibitors);
...
signals:
PrepareForShutdown(b active);
PrepareForSleep(b active);
...
properties:
readonly s BlockInhibited = '';
readonly s DelayInhibited = '';
readonly t InhibitDelayMaxUSec = 5000000;
readonly b PreparingForShutdown = false;
readonly b PreparingForSleep = false;
...
};
...
};
```
**Inhibit()** is the only API necessary to take a lock. It takes four arguments:
- _What_ is a colon-separated list of lock types, i.e. `shutdown`, `sleep`, `idle`, `handle-power-key`, `handle-suspend-key`, `handle-hibernate-key`, `handle-lid-switch`. Example: "shutdown:idle"
- _Who_ is a human-readable, descriptive string of who is taking the lock. Example: "Package Updater"
- _Why_ is a human-readable, descriptive string of why the lock is taken. Example: "Package Update in Progress"
- _Mode_ is one of `block` or `delay`, see above. Example: "block"
Inhibit() returns a single value, a file descriptor that encapsulates the lock. As soon as the file descriptor is closed (and all its duplicates) the lock is automatically released. If the client dies while the lock is taken the kernel automatically closes the file descriptor so that the lock is automatically released. A delay lock taken this way should be released ASAP on reception of PrepareForShutdown(true) (see below), but of course only after execution of the actions the application wanted to delay the operation for in the first place.
**ListInhibitors()** lists all currently active inhibitor locks. It returns an array of structs, each consisting of What, Who, Why, Mode as above, plus the PID and UID of the process that requested the lock.
The **PrepareForShutdown()** and **PrepareForSleep()** signals are emitted when a system suspend or shutdown has been requested and is about to be executed, as well as after the the suspend/shutdown was completed (or failed). The signals carry a boolean argument. If _True_ the shutdown/sleep has been requested, and the preparation phase for it begins, if _False_ the operation has finished completion (or failed). If _True_, this should be used as indication for applications to quickly execute the operations they wanted to execute before suspend/shutdown and then release any delay lock taken. If _False_ the suspend/shutdown operation is over, either successfully or unsuccessfully (of course, this signal will never be sent if a shutdown request was successful). The signal with _False_ is generally delivered only after the system comes back from suspend, the signal with _True_ possibly as well, for example when no delay lock was taken in the first place, and the system suspend hence executed without any delay. The signal with _False_ is usually the signal on which applications request a new delay lock in order to be synchronously notified about the next suspend/shutdown cycle. Note that watching PrepareForShutdown(true)[?](//secure.freedesktop.org/write/www/ikiwiki.cgi?do=create&from=Software%2Fsystemd%2Finhibit&page=Software%2Fsystemd%2Finhibit%2FPrepareForSleep)/PrepareForSleep(true) without taking a delay lock is racy and should not be done, as any code that an application might want to execute on this signal might not actually finish before the suspend/shutdown cycle is executed. _Again_: if you watch PrepareForSuspend(true), then you really should have taken a delay lock first. PrepareForShutdown(false) may be subscribed to by applications which want to be notified about system resume events. Note that this will only be sent out for suspend/resume cycles done via logind, i.e. generally only for high-level user-induced suspend cycles, and not automatic, low-level kernel induced ones which might exist on certain devices with more aggressive power management.
The **BlockInhibited** and **DelayInhibited** properties encode what types of locks are currently taken. These fields are a colon separated list of `shutdown`, `sleep`, `idle`, `handle-power-key`, `handle-suspend-key`, `handle-hibernate-key`, `handle-lid-switch`. The list is basically the union of the What fields of all currently active locks of the specific mode.
**InhibitDelayMaxUSec** contains the delay timeout value as configured in [logind.conf(5)](http://www.freedesktop.org/software/systemd/man/logind.conf.html).
The **PreparingForShutdown** and **PreparingForSleep** boolean properties are true between the two PrepareForShutdown() resp PrepareForSleep() signals that are sent out. Note that these properties do not trigger PropertyChanged signals.
## Taking Blocking Locks
Here's the basic scheme for applications which need blocking locks such as a package manager or CD burning application:
1. Take the lock
2. Do your work you don't want to see interrupted by system sleep or shutdown
3. Release the lock
Example pseudo code:
```
fd = Inhibit("shutdown:idle", "Package Manager", "Upgrade in progress...", "block");
/* ...
do your work
... */
close(fd);
```
## Taking Delay Locks
Here's the basic scheme for applications which need delay locks such as a web browser or office suite:
1. As you open a document, take the delay lock
2. As soon as you see PrepareForSleep(true), save your data, then release the lock
3. As soon as you see PrepareForSleep(false), take the delay lock again, continue as before.
Example pseudo code:
```
int fd = -1;
takeLock() {
if (fd >= 0)
return;
fd = Inhibit("sleep", "Word Processor", "Save any unsaved data in time...", "delay");
}
onDocumentOpen(void) {
takeLock();
}
onPrepareForSleep(bool b) {
if (b) {
saveData();
if (fd >= 0) {
close(fd);
fd = -1;
}
} else
takeLock();
}
```
## Taking Key Handling Locks
By default logind will handle the power and sleep keys of the machine, as well as the lid switch in all states. This ensures that this basic system behavior is guaranteed to work in all circumstances, on text consoles as well as on all graphical environments. However, some DE might want to do their own handling of these keys, for example in order to show a pretty dialog box before executing the relevant operation, or to simply disable the action under certain conditions. For these cases the handle-power-key, handle-suspend-key, handle-hibernate-key and handle-lid-switch type inhibitor locks are available. When taken, these locks simply disable the low-level handling of the keys, they have no effect on system suspend/hibernate/poweroff executed with other mechanisms than the hardware keys (such as the user typing "systemctl suspend" in a shell). A DE intending to do its own handling of these keys should simply take the locks at login time, and release them on logout; alternatively it might make sense to take this lock only temporarily under certain circumstances (e.g. take the lid switch lock only when a second monitor is plugged in, in order to support the common setup where people close their laptops when they have the big screen connected).
These locks need to be taken in the "block" mode, "delay" is not supported for them.
If a DE wants to ensure the lock screen for the eventual resume is on the screen before the system enters suspend state, it should do this via a suspend delay inhibitor block (see above).
## Miscellanea
Taking inhibitor locks is a privileged operation. Depending on the action _org.freedesktop.login1.inhibit-block-shutdown_, _org.freedesktop.login1.inhibit-delay-shutdown_, _org.freedesktop.login1.inhibit-block-sleep_, _org.freedesktop.login1.inhibit-delay-sleep_, _org.freedesktop.login1.inhibit-block-idle_, _org.freedesktop.login1.inhibit-handle-power-key_, _org.freedesktop.login1.inhibit-handle-suspend-key_, _org.freedesktop.login1.inhibit-handle-hibernate-key_,_org.freedesktop.login1.inhibit-handle-lid-switch_. In general it should be assumed that delay locks are easier to obtain than blocking locks, simply because their impact is much more minimal. Note that the policy checks for Inhibit() are never interactive.
Inhibitor locks should not be misused. For example taking idle blocking locks without a very good reason might cause mobile devices to never auto-suspend. This can be quite detrimental for the battery.
If an application finds a lock denied it should not consider this much of an error and just continue its operation without the protecting lock.
The tool [systemd-inhibit(1)](http://www.freedesktop.org/software/systemd/man/systemd-inhibit.html) may be used to take locks or list active locks from the command line.
Note that gnome-session also provides an [inhibitor API](http://people.gnome.org/~mccann/gnome-session/docs/gnome-session.html#org.gnome.SessionManager.Inhibit), which is very similar to the one of systemd. Internally, locks taken on gnome-session's interface will be forwarded to logind, hence both APIs are supported. While both offer similar functionality they do differ in some regards. For obvious reasons gnome-session can offer logout locks and screensaver avoidance locks which logind lacks. logind's API OTOH supports delay locks in addition to block locks like GNOME. Also, logind is available to system components, and centralizes locks from all users, not just those of a specific one. In general: if in doubt it is probably advisable to stick to the GNOME locks, unless there is a good reason to use the logind APIs directly. When locks are to be enumerated it is better to use the logind APIs however, since they also include locks taken by system services and other users.

18
docs/MINIMAL_BUILDS.md Normal file
View File

@ -0,0 +1,18 @@
---
title: Minimal Builds
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Minimal Builds
systemd includes a variety of components. The core components are always built (which includes systemd itself, as well as udevd and journald). Many of the other components can be disabled at compile time with configure switches.
For some uses the configure switches do not provide sufficient modularity. For example, they cannot be used to build only the man pages, or to build only the tmpfiles tool, only detect-virt or only udevd. If such modularity is required that goes beyond what we support in the configure script we can suggest you two options:
1. Build systemd as usual, but pick only the built files you need from the result of "make install DESTDIR=<directory>", by using the file listing functionality of your packaging software. For example: if all you want is the tmpfiles tool, then build systemd normally, and list only /usr/bin/systemd-tmpfiles in the .spec file for your RPM package. This is simple to do, allows you to pick exactly what you need, but requires a larger number of build dependencies (but not runtime dependencies).
2. If you want to reduce the build time dependencies (though only dbus and libcap are needed as build time deps) and you know the specific component you are interested in doesn't need it, then create a dummy .pc file for that dependency (i.e. basically empty), and configure systemd with PKG_CONFIG_PATH set to the path of these dummy .pc files. Then, build only the few bits you need with "make foobar", where foobar is the file you need.
We are open to merging patches for the build system that make more "fringe" components of systemd optional. However, please be aware that in order to keep the complexity of our build system small and its readability high, and to make our lives easier, we will not accept patches that make the minimal core components optional, i.e. systemd itself, journald and udevd.
Note that the .pc file trick mentioned above currently doesn't work for libcap, since libcap doesn't provide a .pc file. We invite you to go ahead and post a patch to libcap upstream to get this corrected. We'll happily change our build system to look for that .pc file then. (a .pc file has been sent to upstream by Bryan Kadzban. It is also available at [http://kdzbn.homelinux.net/libcap-add-pkg-config.patch](http://kdzbn.homelinux.net/libcap-add-pkg-config.patch)).

52
docs/OPTIMIZATIONS.md Normal file
View File

@ -0,0 +1,52 @@
---
title: systemd Optimizations
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# systemd Optimizations
_So you are working on a Linux distribution or appliance and need very fast boot-ups?_
systemd can already offer boot times of < 1s for the Core OS (userspace only, i.e. only the bits controlled by systemd) and < 2s for a complete up-to-date desktop environments on simpler (but modern, i.e. SSDs) laptops if configured properly (examples: [http://git.fenrus.org/tmp/bootchart-20120512-1036.svg](http://git.fenrus.org/tmp/bootchart-20120512-1036.svg)). In this page we want to suggest a couple of ideas how to achieve that, and if the resulting boot times do not suffice where we believe room for improvements are that we'd like to see implemented sooner or later. If you are interested in investing engineering manpower in systemd to get to even shorter boot times, this list hopefully includes a few good suggestions to start with.
Of course, before optimizing you should instrument the boot to generate profiling data, so make sure you know your way around with systemd-bootchart, systemd-analyze and pytimechart! Optimizations without profiling are premature optimizations!
Note that systemd's fast performance is a side effect of its design but wasn't the primary design goal. As it stands now systemd (and Fedora using it) has been optimized very little and still has a lot of room for improvements. There are still many low hanging fruits to pick!
We are very interested in merging optimization work into systemd upstream. Note however that we are careful not to merge work that would drastically limit the general purpose usefulness or reliability of our code, or that would make systemd harder to maintain. So in case you work on optimizations for systemd, try to keep your stuff mainlineable. If in doubt, ask us.
The distributions have adopted systemd to varying levels. While there are many compatibility scripts in the boot process on Debian for example, Fedora has much less (but still too many). For better performance consider disabling these scripts, or using a different distribution.
It is our intention to optimize the upstream distributions by default (in particular Fedora) so that these optimizations won't be necessary. However, this will take some time, especially since making these changes is often not trivial when the general purpose usefulness cannot be compromised.
What you can optimize (locally) without writing any code:
1. Make sure not to use any fake block device storage technology such as LVM (as installed by default by various distributions, including Fedora) they result in the systemd-udev-settle.service unit to be pulled in. Settling device enumeration is slow, racy and mostly obsolete. Since LVM (still) hasn't been updated to handle Linux' event based design properly, settling device enumeration is still required for it, but it will slow down boot substantially. On Fedora, use "systemctl mask fedora-wait-storage.service fedora-storage-init-late.service fedora-storage-init.service" to get rid of all those storage technologies. Of course, don't try this if you actually installed your system with LVM. (The only fake block device storage technology that currently handles this all properly and doesn't require settling device enumerations is LUKS disk encryption.)
2. Consider bypassing the initrd, if you use one. On Fedora, make sure to install the OS on a plain disk without encryption, and without LVM/RAID/... (encrypted /home is fine) when doing this. Then, simply edit grub.conf and remove the initrd from your configuration, and change the root= kernel command line parameter so that it uses kernel device names instead of UUIDs, i.e. "root=sda5" or what is appropriate for your system. Also specify the root FS type with "rootfstype=ext4" (or as appropriate). Note that using kernel devices names is not really that nice if you have multiple hard disks, but if you are doing this for a laptop (i.e. with a single hdd), this should be fine. Note that you shouldn't need to rebuild your kernel in order to bypass the initrd. Distribution kernels (at least Fedora's) work fine with and without initrd, and systemd supports both ways to be started.
3. Consider disabling SELinux and auditing. We recommend leaving SELinux on, for security reasons, but truth be told you can save 100ms of your boot if you disable it. Use selinux=0 on the kernel cmdline.
4. Consider disabling Plymouth. If userspace boots in less than 1s, a boot splash is hardly useful, hence consider passing plymouth.enable=0 on the kernel command line. Plymouth is generally quite fast, but currently still forces settling device enumerations for graphics cards, which is slow. Disabling plymouth removes this bit of the boot.
5. Consider uninstalling syslog. The journal is used anyway on newer systemd systems, and is usually more than sufficient for desktops, and embedded, and even many servers. Just uninstall all syslog implementations and remember that "journalctl" will get you a pixel perfect copy of the classic /var/log/messages message log. To make journal logs persistent (i.e. so that they aren't lost at boot) make sure to run "mkdir -p /var/log/journal".
6. Consider masking a couple of redundant distribution boot scripts, that artificially slow down the boot. For example, on Fedora it's a good idea to mask fedora-autoswap.service fedora-configure.service fedora-loadmodules.service fedora-readonly.service. Also remove all LVM/RAID/FCOE/iSCSI related packages which slow down the boot substantially even if no storage of the specific kind is used (and if these RPMs can't be removed because some important packages require them, at least mask the respective services).
7. Console output is slow. So if you measure your boot times and ship your system, make sure to use "quiet" on the command line and disable systemd debug logging (if you enabled it before).
8. Consider removing cron from your system and use systemd timer units instead. Timer units currently have no support for calendar times (i.e. cannot be used to spawn things "at 6 am every Monday", but can do "run this every 7 days"), but for the usual /etc/cron.daily/, /etc/cron.weekly/, ... should be good enough, if the time of day of the execution doesn't matter (just add four small service and timer units for supporting these dirs. Eventually we might support these out of the box, but until then, just write your own scriplets for this).
9. If you work on an appliance, consider disabling readahead collection in the shipped devices, but leave readahead replay enabled.
10. If you work on an appliance, make sure to build all drivers you need into the kernel, since module loading is slow. If you build a distribution at least built all the stuff 90% of all people need into your kernel, i.e. at least USB, AHCI and HDA!
11. If it works, use libahci.ignore_sss=1 when booting.
12. Use a modern desktop that doesn't pull in ConsoleKit anymore. For example GNOME 3.4.
13. Get rid of a local MTA, if you are building a desktop or appliance. I.e. on Fedora remove the sendmail RPMs which are (still!) installed by default.
14. If you build an appliance, don't forget that various components of systemd are optional and may be disabled during build time, see "./configure --help" for details. For example, get rid of the virtual console setup if you never have local console users (this is a major source of slowness, actually). In addition, if you never have local users at all, consider disabling logind. And there are more components that are frequently unnecessary on appliances.
15. This goes without saying: the boot-up gets faster if you started less stuff at boot. So run "systemctl" and check if there's stuff you don't need and disable it, or even remove its package.
16. Don't use debug kernels. Debug kernels are slow. Fedora exclusively uses debug kernels during the development phase of each release. If you care about boot performance, either recompile these kernels with debugging turned off or wait for the final distribution release. It's a drastic difference. That also means that if you publish boot performance data of a Fedora pre-release distribution you are doing something wrong. ;-) So much about the basics of how to get a quick boot. Now, here's an incomprehensive list of things we'd like to see improved in systemd (and elsewhere) over short or long and need a bit of hacking (sometimes more, and sometimes less):
17. Get rid of systemd-cgroups-agent. Currently, whenever a systemd cgroup runs empty a tool "systemd-cgroups-agent" is invoked by the kernel which then notifies systemd about it. The need for this tool should really go away, which will save a number of forked processes at boot, and should make things faster (especially shutdown). This requires introduction of a new kernel interface to get notifications for cgroups running empty, for example via fanotify() on cgroupfs.
18. Make use of EXT4_IOC_MOVE_EXT in systemd's readahead implementation. This allows reordering/defragmentation of the files needed for boot. According to the data from [http://e4rat.sourceforge.net/](http://e4rat.sourceforge.net/) this might shorten the boot time to 40%. Implementation is not trivial, but given that we already support btrfs defragmentation and example code for this exists (e4rat as linked) should be fairly straightforward.
19. Compress readahead pack files with XZ or so. Since boot these days tends to be clearly IO bound (and not CPU bound) it might make sense to reduce the IO load for the pack file by compressing it. Since we already have a dependency on XZ we'd recommend using XZ for this.
20. Update the readahead logic to also precache directories (in addition to files).
21. Improve a couple of algorithms in the unit dependency graph calculation logic, as well as unit file loading. For example, right now when loading units we match them up with a subset of the other loaded units in order to add automatic dependencies between them where appropriate. Usually the set of units matched up is small, but the complexity is currently O(n^2), and this could be optimized. Since unit file loading and calculations in the dependency graphs is the only major, synchronous, computation-intensive bit of PID 1, and is executed before any services are started this should bring relevant improvements, especially on systems with big dependency graphs.
22. Add socket activation to X. Due to the special socket allocation semantics of X this is useful only for display :0. This should allow parallelization of X startup with its clients.
23. The usual housekeeping: get rid of shell-based services (i.e. SysV init scripts), replace them with unit files. Don't make use of Type=forking and ordering dependencies if possible, use socket activation with Type=simple instead. This allows drastically better parallelized start-up for your services. Also, if you cannot use socket activation, at least consider patching your services to support Type=notify in place of Type=forking. Consider making seldom used services activated on-demand (for example, printer services), and start frequently used services already at boot instead of delaying them until they are used.
24. Consider making use of systemd for the session as well, the way Tizen is doing this. This still needs some love in systemd upstream to be a smooth ride, but we definitely would like to go this way sooner or later, even for the normal desktops.
25. Add an option for service units to temporarily bump the CPU and IO priority of the startup code of important services. Note however, that we assume that this will not bring much and hence recommend looking into this only very late. Since boot-up tends to be IO bound, solutions such as readahead are probably more interesting than prioritizing service startup IO. Also, this would probably always require a certain amount of manual configuration since determining automatically which services are important is hard (if not impossible), because we cannot track properly which services other services wait for.
26. Same as the previous item, but temporarily lower the CPU/IO priority of the startups part of unimportant leaf services. This is probably more useful than 11 as it is easier to determine which processes don't matter.
27. Add a kernel sockopt for AF_UNIX to increase the maximum datagram queue length for SOCK_DGRAM sockets. This would allow us to queue substantially more logging datagrams in the syslog and journal sockets, and thus move the point where syslog/journal clients have to block before their message writes finish much later in the boot process. The current kernel default is rather low with 10. (As a temporary hack it is possible to increase /proc/sys/net/unix/max_dgram_qlen globally, but this has implications beyond systemd, and should probably be avoided.) The kernel patch to make this work is most likely trivial. In general, this should allow us to improve the level of parallelization between clients and servers for AF_UNIX sockets of type SOCK_DGRAM or SOCK_SEQPACKET. Again: the list above contains things we'd like to see in systemd anyway. We didn't do much profiling for these features, but we have enough indication to assume that these bits will bring some improvements. But yeah, if you work on this, keep your profiling tools ready at all times.

44
docs/PRESET.md Normal file
View File

@ -0,0 +1,44 @@
---
title: Presets
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Presets
## Why?
Different **distributions** have different policies on which services shall be enabled by default when the package they are shipped in is installed. On Fedora all services stay off by default, so that installing a package will not cause a service to be enabled (with some exceptions). On Debian all services are immediately enabled by default, so that installing a package will cause its service(s) to be enabled right-away.
Different **spins** (flavours, remixes, whatever you might want to call them) of a distribution also have different policies on what services to enable, and what services to leave off. For example, the Fedora default will enable gdm as display manager by default, while the Fedora KDE spin will enable kdm instead.
Different **sites** might also have different policies what to turn on by default and what to turn off. For example, one administrator would prefer to enforce the policy of "ssh should be always on, but everything else off", while another one might say "snmp always on, and for everything else use the distribution policy defaults".
## The Logic
Traditionally, policy about what services shall be enabled and what services shall not have been decided globally by the distributions, and were enforced in each package individually. This made it cumbersome to implement different policies per spin or per site, or to create software packages that do the right thing on more than one distribution. The enablement _mechanism_ was also encoding the enablement _policy_.
systemd 32 and newer support package "preset" policies. These encode which units shall be enabled by default when they are installed, and which units shall not be enabled.
Preset files may be written for specific distributions, for specific spins or for specific sites, in order to enforce different policies as needed. Preset policies are stored in .preset files in /usr/lib/systemd/system-preset/. If no policy exists the default implied policy of "enable everything" is enforced, i.e. in Debian style.
The policy encoded in preset files is applied to a unit by invoking "systemctl preset ". It is recommended to use this command in all package post installation scriptlets. "systemctl preset " is identical to "systemctl enable " resp. "systemctl disable " depending on the policy.
Preset files allow clean separation of enablement mechanism (inside the package scriptlets, by invoking "systemctl preset"), and enablement policy (centralized in the preset files).
## Documentation
Documentation for the preset policy file format is available here: [http://www.freedesktop.org/software/systemd/man/systemd.preset.html](http://www.freedesktop.org/software/systemd/man/systemd.preset.html)
Documentation for "systemctl preset" you find here: [http://www.freedesktop.org/software/systemd/man/systemctl.html](http://www.freedesktop.org/software/systemd/man/systemctl.html)
Documentation for the recommended package scriptlets you find here: [http://www.freedesktop.org/software/systemd/man/daemon.html](http://www.freedesktop.org/software/systemd/man/daemon.html)
## How To
For the preset logic to be useful, distributions need to implement a couple of steps:
- The default distribution policy needs to be encoded in a preset file /usr/lib/systemd/system-preset/99-default.preset or suchlike, unless the implied policy of "enable everything" is the right choice. For a Fedora-like policy of "enable nothing" it is sufficient to include the single line "disable" into that file. The default preset file should be installed as part of one the core packages of the distribution.
- All packages need to be updated to use "systemctl preset" in the post install scriptlets.
- (Optionally) spins/remixes/flavours should define their own preset file, either overriding or extending the default distribution preset policy. Also see the fedora feature page: [https://fedoraproject.org/wiki/Features/PackagePresets](https://fedoraproject.org/wiki/Features/PackagePresets)

50
docs/SYSLOG.md Normal file
View File

@ -0,0 +1,50 @@
---
title: Writing syslog Daemons Which Cooperate Nicely With systemd
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Writing syslog Daemons Which Cooperate Nicely With systemd
Here are a few notes on things to keep in mind when you work on a classic BSD syslog daemon for Linux, to ensure that your syslog daemon works nicely together with systemd. If your syslog implementation does not follow these rules, then it will not be compatible with systemd v38 and newer.
A few notes in advance: systemd centralizes all log streams in the Journal daemon. Messages coming in via /dev/log, via the native protocol, via STDOUT/STDERR of all services and via the kernel are received in the journal daemon. The journal daemon then stores them to disk or in RAM (depending on the configuration of the Storage= option in journald.conf), and optionally forwards them to the console, the kernel log buffer, or to a classic BSD syslog daemon -- and that's where you come in.
Note that it is now the journal that listens on /dev/log, no longer the BSD syslog daemon directly. If your logging daemon wants to get access to all logging data then it should listen on /run/systemd/journal/syslog instead via the syslog.socket unit file that is shipped along with systemd. On a systemd system it is no longer OK to listen on /dev/log directly, and your daemon may not bind to the /run/systemd/journal/syslog socket on its own. If you do that then you will lose logging from STDOUT/STDERR of services (as well as other stuff).
Your BSD compatible logging service should alias `syslog.service` to itself (i.e. symlink) when it is _enabled_. That way [syslog.socket](http://cgit.freedesktop.org/systemd/systemd/plain/units/syslog.socket) will activate your service when things are logged. Of course, only one implementation of BSD syslog can own that symlink, and hence only one implementation can be enabled at a time, but that's intended as there can only be one process listening on that socket. (see below for details how to manage this symlink.) Note that this means that syslog.socket as shipped with systemd is _shared_ among all implementations, and the implementation that is in control is configured with where syslog.service points to.
Note that journald tries hard to forward to your BSD syslog daemon as much as it can. That means you will get more than you traditionally got on /dev/log, such as stuff all daemons log on STDOUT/STDERR and the messages that are logged natively to systemd. Also, we will send stuff like the original SCM_CREDENTIALS along if possible.
(BTW, journald is smart enough not to forward the kernel messages it gets to you, you should read that on your own, directly from /proc/kmsg, as you always did. It's also smart enough never to forward kernel messages back to the kernel, but that probably shouldn't concern you too much...)
And here are the recommendations:
- First of all, make sure your syslog daemon installs a native service unit file (SysV scripts are not sufficient!) and is socket activatable. Newer systemd versions (v35+) do not support non-socket-activated syslog daemons anymore and we do no longer recommend people to order their units after syslog.target. That means that unless your syslog implementation is socket activatable many services will not be able to log to your syslog implementation and early boot messages are lost entirely to your implementation. Note that your service should install only one unit file, and nothing else. Do not install socket unit files.
- Make sure that in your unit file you set StandardOutput=null in the [Service] block. This makes sure that regardless what the global default for StandardOutput= is the output of your syslog implementation goes to /dev/null. This matters since the default StandardOutput= value for all units can be set to syslog and this should not create a feedback loop with your implementation where the messages your syslog implementation writes out are fed back to it. In other words: you need to explicitly opt out of the default standard output redirection we do for other services. (Also note that you do not need to set StandardError= explicitly, since that inherits the setting of StandardOutput= by default)
- /proc/kmsg is your property, flush it to disk as soon as you start up.
- Name your service unit after your daemon (e.g. rsyslog.service or syslog-ng.service) and make sure to include Alias=syslog.service in your [Install] section in the unit file. This is ensures that the symlink syslog.service is created if your service is enabled and that it points to your service. Also add WantedBy=multi-user.target so that your service gets started at boot, and add Requires=syslog.socket in [Unit] so that you pull in the socket unit.
Here are a few other recommendations, that are not directly related to systemd:
- Make sure to read the priority prefixes of the kmsg log messages the same way like from normal userspace syslog messages. When systemd writes to kmsg it will prefix all messages with valid priorities which include standard syslog facility values. OTOH for kernel messages the facility is always 0. If you need to know whether a message originated in the kernel rely on the facility value, not just on the fact that you read the message from /proc/kmsg! A number of userspace applications write messages to kmsg (systemd, udev, dracut, others), and they'll nowadays all set correct facility values.
- When you read a message from the socket use SCM_CREDENTIALS to get information about the client generating it, and possibly patch the message with this data in order to make it impossible for clients to fake identities.
The unit file you install for your service should look something like this:
```
[Unit]
Description=System Logging Service
Requires=syslog.socket
[Service]
ExecStart=/usr/sbin/syslog-ng -n
StandardOutput=null
[Install]
Alias=syslog.service
WantedBy=multi-user.target
```
And remember: don't ship any socket unit for /dev/log or /run/systemd/journal/syslog (or even make your daemon bind directly to these sockets)! That's already shipped along with systemd for you.

View File

@ -0,0 +1,20 @@
---
title: systemd File Hierarchy Requirements
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# systemd File Hierarchy Requirements
There are various attempts to standardize the file system hierarchy of Linux systems. In systemd we leave much of the file system layout open to the operating system, but here's what systemd strictly requires:
- /, /usr, /etc must be mounted when the host systemd is first invoked. This may be achieved either by using the kernel's built-in root disk mounting (in which case /, /usr and /etc need to be on the same file system), or via an initrd, which could mount the three directories from different sources.
- /bin, /sbin, /lib (and /lib64 if applicable) should reside on /, or be symlinks to the /usr file system (recommended). All of them must be available before the host systemd is first executed.
- /var does not have to be mounted when the host systemd is first invoked, however, it must be configured so that it is mounted writable before local-fs.target is reached (for example, by simply listing it in /etc/fstab).
- /tmp is recommended to be a tmpfs (default), but doesn't have to. If configured, it must be mounted before local-fs.target is reached (for example, by listing it in /etc/fstab).
- /dev must exist as an empty mount point and will automatically be mounted by systemd with a devtmpfs. Non-devtmpfs boots are not supported.
- /proc and /sys must exist as empty mount points and will automatically be mounted by systemd with procfs and sysfs.
- /run must exist as an empty mount point and will automatically be mounted by systemd with a tmpfs.
The other directories usually found in the root directory (such as /home, /boot, /opt) are irrelevant to systemd. If they are defined they may be mounted from any source and at any time, though it is a good idea to mount them also before local-fs.target is reached.

View File

@ -0,0 +1,115 @@
---
title: The Case for the /usr Merge
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# The Case for the /usr Merge
**Why the /usr Merge Makes Sense for Compatibility Reasons**
_This is based on the [Fedora feature](https://fedoraproject.org/wiki/Features/UsrMove) for the same topic, put together by Harald Hoyer and Kay Sievers. This feature has been implemented successfully in Fedora 17._
Note that this page discusses a topic that is actually independent of systemd. systemd supports both systems with split and with merged /usr, and the /usr merge also makes sense for systemd-less systems. That said we want to encourage distributions adopting systemd to also adopt the /usr merge.
### What's Being Discussed Here?
Fedora (and other distributions) have finished work on getting rid of the separation of /bin and /usr/bin, as well as /sbin and /usr/sbin, /lib and /usr/lib, and /lib64 and /usr/lib64. All files from the directories in / will be merged into their respective counterparts in /usr, and symlinks for the old directories will be created instead:
```
/bin → /usr/bin
/sbin → /usr/sbin
/lib → /usr/lib
/lib64 → /usr/lib64
```
You are wondering why merging /bin, /sbin, /lib, /lib64 into their respective counterparts in /usr makes sense, and why distributions are pushing for it? You are wondering whether your own distribution should adopt the same change? Here are a few answers to these questions, with an emphasis on a compatibility point of view:
### Compatibility: The Gist of It
- Improved compatibility with other Unixes/Linuxes in _behavior_: After the /usr merge all binaries become available in both /bin and /usr/bin, resp. both /sbin and /usr/sbin (simply because /bin becomes a symlink to /usr/bin, resp. /sbin to /usr/sbin). That means scripts/programs written for other Unixes or other Linuxes and ported to your distribution will no longer need fixing for the file system paths of the binaries called, which is otherwise a major source of frustration. /usr/bin and /bin (resp. /usr/sbin and /sbin) become entirely equivalent.
- Improved compatibility with other Unixes (in particular Solaris) in _appearance_: The primary commercial Unix implementation is nowadays Oracle Solaris. Solaris has already completed the same /usr merge in Solaris 11. By making the same change in Linux we minimize the difference towards the primary Unix implementation, thus easing portability from Solaris.
- Improved compatibility with GNU build systems: The biggest part of Linux software is built with GNU autoconf/automake (i.e. GNU autotools), which are unaware of the Linux-specific /usr split. Maintaining the /usr split requires non-trivial project-specific handling in the upstream build system, and in your distribution's packages. With the /usr merge, this work becomes unnecessary and porting packages to Linux becomes simpler.
- Improved compatibility with current upstream development: In order to minimize the delta from your Linux distribution to upstream development the /usr merge is key.
### Compatibility: The Longer Version
A unified filesystem layout (as it results from the /usr merge) is more compatible with UNIX than Linux traditional split of /bin vs. /usr/bin. Unixes differ in where individual tools are installed, their locations in many cases are not defined at all and differ in the various Linux distributions. The /usr merge removes this difference in its entirety, and provides full compatibility with the locations of tools of any Unix via the symlink from /bin to /usr/bin.
#### Example
- /usr/bin/foo may be called by other tools either via /usr/bin/foo or /bin/foo, both paths become fully equivalent through the /usr merge. The operating system ends up executing exactly the same file, simply because the symlink /bin just redirects the invocation to /usr/bin.
The historical justification for a /bin, /sbin and /lib separate from /usr no longer applies today. ([More on the historical justification for the split](http://lists.busybox.net/pipermail/busybox/2010-December/074114.html), by Rob Landley) They were split off to have selected tools on a faster hard disk (which was small, because it was more expensive) and to contain all the tools necessary to mount the slower /usr partition. Today, a separate /usr partition already must be mounted by the initramfs during early boot, thus making the justification for a split-off moot. In addition a lot of tools in /bin and /sbin in the status quo already lost the ability to run without a pre-mounted /usr. There is no valid reason anymore to have the operating system spread over multiple hierarchies, it lost its purpose.
Solaris implemented the core part of the /usr merge 15 years ago already, and completed it with the introduction of Solaris 11. Solaris has /bin and /sbin only as symlinks in the root file system, the same way as you will have after the /usr merge: [Transitioning From Oracle Solaris 10 to Oracle Solaris 11 - User Environment Feature Changes](http://docs.oracle.com/cd/E23824_01/html/E24456/userenv-1.html).
Not implementing the /usr merge in your distribution will isolate it from upstream development. It will make porting of packages needlessly difficult, because packagers need to split up installed files into multiple directories and hard code different locations for tools; both will cause unnecessary incompatibilities. Several Linux distributions are agreeing with the benefits of the /usr merge and are already in the process to implement the /usr merge. This means that upstream projects will adapt quickly to the change, those making portability to your distribution harder.
### Beyond Compatibility
One major benefit of the /usr merge is the reduction of complexity of our system: the new file system hierarchy becomes much simpler, and the separation between (read-only, potentially even immutable) vendor-supplied OS resources and users resources becomes much cleaner. As a result of the reduced complexity of the hierarchy, packaging becomes much simpler too, since the problems of handling the split in the .spec files go away.
The merged directory /usr, containing almost the entire vendor-supplied operating system resources, offers us a number of new features regarding OS snapshotting and options for enterprise environments for network sharing or running multiple guests on one host. Static vendor-supplied OS resources are monopolized at a single location, that can be made read-only easily, either for the whole system or individually for each service. Most of this is much harder to accomplish, or even impossible, with the current arbitrary split of tools across multiple directories.
_With all vendor-supplied OS resources in a single directory /usr they may be shared atomically, snapshots of them become atomic, and the file system may be made read-only as a single unit._
#### Example: /usr Network Share
- With the merged /usr directory we can offer a read-only export of the vendor supplied OS to the network, which will contain almost the entire operating system resources. The client hosts will then only need a minimal host-specific root filesystem with symlinks pointing into the shared /usr filesystem. From a maintenance perspective this is the first time where sharing the operating system over the network starts to make sense. Without the merged /usr directory (like in traditional Linux) we can only share parts of the OS at a time, but not the core components of it that are located in the root file system. The host-specific root filesystem hence traditionally needs to be much larger, cannot be shared among client hosts and needs to be updated individually and often. Vendor-supplied OS resources traditionally ended up "leaking" into the host-specific root file system.
#### Example: Multiple Guest Operating Systems on the Same Host
- With the merged /usr directory, we can offer to share /usr read-only with multiple guest operating systems, which will shrink down the guest file system to a couple of MB. The ratio of the per-guest host-only part vs. the shared operating system becomes nearly optimal.
In the long run the maintenance burden resulting of the split-up tools in your distribution, and hard-coded deviating installation locations to distribute binaries and other packaged files into multiple hierarchies will very likely cause more trouble than the /usr merge itself will cause.
## Myths and Facts
**Myth #1**: Fedora is the first OS to implement the /usr merge
**Fact**: Oracle Solaris has implemented the /usr merge in parts 15 years ago, and completed it in Solaris 11. Fedora is following suit here, it is not the pioneer.
**Myth #2**: Fedora is the only Linux distribution to implement the /usr merge
**Fact**: Multiple other Linux distributions have been working in a similar direction.
**Myth #3**: The /usr merge decreases compatibility with other Unixes/Linuxes
**Fact**: By providing all binary tools in /usr/bin as well as in /bin (resp. /usr/sbin + /sbin) compatibility with hard coded binary paths in scripts is increased. When a distro A installs a tool “foo” in /usr/bin, and distro B installs it in /bin, then well provide it in both, thus creating compatibility with both A and B.
**Myth #4**: The /usr merges only purpose is to look pretty, and has no other benefits
**Fact**: The /usr merge makes sharing the vendor-supplied OS resources between a host and networked clients as well as a host and local light-weight containers easier and atomic. Snapshotting the OS becomes a viable option. The /usr merge also allows making the entire vendor-supplied OS resources read-only for increased security and robustness.
**Myth #5**: Adopting the /usr merge in your distribution means additional work for your distribution's package maintainers
**Fact**: When the merge is implemented in other distributions and upstream, not adopting the /usr merge in your distribution means more work, adopting it is cheap.
**Myth #6**: A split /usr is Unix “standard”, and a merged /usr would be Linux-specific
**Fact**: On SysV Unix /bin traditionally has been a symlink to /usr/bin. A non-symlinked version of that directory is specific to non-SysV Unix and Linux.
**Myth #7**: After the /usr merge one can no longer mount /usr read-only, as it is common usage in many areas.
**Fact**: Au contraire! One of the reasons we are actually doing this is to make a read-only /usr more thorough: the entire vendor-supplied OS resources can be made read-only, i.e. all of what traditionally was stored in /bin, /sbin, /lib on top of what is already in /usr.
**Myth #8**: The /usr merge will break my old installation which has /usr on a separate partition.
**Fact**: This is perfectly well supported, and one of the reasons we are actually doing this is to make placing /usr of a separate partition more thorough. What changes is simply that you need to boot with an initrd that mounts /usr before jumping into the root file system. Most distributions rely on initrds anyway, so effectively little changes.
**Myth #9**: The /usr split is useful to have a minimal rescue system on the root file system, and the rest of the OS on /usr.
**Fact**: On Fedora the root directory contains ~450MB already. This hasn't been minimal since a long time, and due to today's complex storage and networking technologies it's unrealistic to ever reduce this again. In fact, since the introduction of initrds to Linux the initrd took over the role as minimal rescue system that requires only a working boot loader to be started, but not a full file system.
**Myth #10**: The status quo of a split /usr with mounting it without initrd is perfectly well supported right now and works.
**Fact**: A split /usr without involvement of an initrd mounting it before jumping into the root file system [hasn't worked correctly since a long time](http://freedesktop.org/wiki/Software/systemd/separate-usr-is-broken).
**Myth #11**: Instead of merging / into /usr it would make a lot more sense to merge /usr into /.
**Fact**: This would make the separation between vendor-supplied OS resources and machine-specific even worse, thus making OS snapshots and network/container sharing of it much harder and non-atomic, and clutter the root file system with a multitude of new directories.
---
If this page didn't answer your questions you may continue reading [on the Fedora feature page](https://fedoraproject.org/wiki/Features/UsrMove) and this [mail from Lennart](http://thread.gmane.org/gmane.linux.redhat.fedora.devel/155511/focus=155792).

View File

@ -0,0 +1,78 @@
---
title: Testing systemd during Development in Virtualization
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Testing systemd during Development in Virtualization
For quickly testing systemd during development it us useful to boot it up in a container and in a QEMU VM.
## Testing in a VM
Here's a nice hack if you regularly build and test-boot systemd, are gutsy enough to install it into your host, but too afraid or too lazy to continuously reboot your host.
Create a shell script like this:
```
#!/bin/sh
sudo sync
sudo /bin/sh -c 'echo 3 > /proc/sys/vm/drop_caches'
sudo umount /
sudo modprobe kvm-intel
exec sudo qemu-kvm -smp 2 -m 512 -snapshot /dev/sda
```
This will boot your local host system as a throw-away VM guest. It will take your main harddisk, boot from it in the VM, allow changes to it, but these changes are all just buffered in memory and never hit the real disk. Any changes made in this VM will be lost when the VM terminates. I have called this script "q", and hence for test booting my own system all I do is type the following command in my systemd source tree and I can see if it worked.
```
$ make -j10 && sudo make install && q
```
The first three lines are necessary to ensure that the kernel's disk caches are all synced to disk before qemu takes the snapshot of it. Yes, invoking "umount /" will sync your file system to disk as a side effect, even though it will actually fail. When the machine boots up the file system will still be marked dirty (and hence you will get an fsck, usually), but it will work fine nonetheless in virtually all cases.
Of course, if the host's hard disk changes while the VM is running this will be visible to the VM, and might confuse it. If you use this little hack you should keep changes on the host at a minimum, hence. Yeah this all is a hack, but a really useful and neat one.
YMMV if you use LVM or btrfs.
## Testing in a Container
Test-booting systemd in a container has the benefit of being much easier to debug/instrument from the outside.
**Important**: As preparation it is essential to turn off auditing entirely on your system. Auditing is broken with containers, and will trigger all kinds of error in containers if not turned off. Use `audit=0` on the host's kernel command line to turn it off.
Then, as the first step I install Fedora into a container tree:
```
$ sudo yum -y --releasever=20 --installroot=$HOME/fedora-tree --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal
```
You can do something similar with debootstrap on a Debian system. Now, we need to set a root password in order to be able to log in:
```
$ sudo systemd-nspawn -D ~/fedora-tree/
# passwd
...
# ^D
```
As the next step we can already boot the container:
```
$ sudo systemd-nspawn -bD ~/fedora-tree/ 3
```
To test systemd in the container I then run this from my source tree on the host:
```
$ make -j10 && sudo DESTDIR=$HOME/fedora-tree make install && sudo systemd-nspawn -bD ~/fedora-tree/ 3
```
And that's already it.

View File

@ -0,0 +1,30 @@
---
title: Writing Desktop Environments
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Writing Desktop Environments
_Or: how to hook up your favorite desktop environment with logind_
systemd's logind service obsoletes ConsoleKit which was previously widely used on Linux distributions. This provides a number of new features, but also requires updating of the Desktop Environment running on it, in a few ways.
This document should be read together with [Writing Display Managers](http://www.freedesktop.org/wiki/Software/systemd/writing-display-managers) which focuses on the porting work necessary for display managers.
If required it is possible to implement ConsoleKit and systemd-logind support in the same desktop environment code, detecting at runtime which interface is needed. The [sd_booted()](http://www.freedesktop.org/software/systemd/man/sd_booted.html) call may be used to determine at runtime whether systemd is used.
To a certain level ConsoleKit and systemd-logind may be used side-by-side, but a number of features are not available if ConsoleKit is used.
Please have a look at the [Bus API of logind](http://www.freedesktop.org/wiki/Software/systemd/logind) and the C API as documented in [sd-login(7)](http://www.freedesktop.org/software/systemd/man/sd-login.html). (Also see below)
Here are the suggested changes:
- Your session manager should listen to "Lock" and "Unlock" messages that are emitted from the session object logind exposes for your DE session, on the system bus. If "Lock" is received the screen lock should be activated, if "Unlock" is received it should be deactivated. This can easily be tested with "loginctl lock-sessions". See the [Bus API of logind](http://www.freedesktop.org/wiki/Software/systemd/logind) for further details.
- Whenever the session gets idle the DE should invoke the SetIdleHint(True) call on the respective session object on the session bus. This is necessary for the system to implement auto-suspend when all sessions are idle. If the session gets used again it should call SetIdleHint(False). A session should be considered idle if it didn't receive user input (mouse movements, keyboard) in a while. See the [Bus API of logind](http://www.freedesktop.org/wiki/Software/systemd/logind) for further details.
- To reboot/power-off/suspend/hibernate the machine from the DE use logind's bus calls Reboot(), PowerOff(), Suspend(), Hibernate(), HybridSleep(). For further details see [Bus API of logind](http://www.freedesktop.org/wiki/Software/systemd/logind).
- If your session manager handles the special power, suspend, hibernate hardware keys or the laptop lid switch on its own it is welcome to do so, but needs to disable logind's built-in handling of these events. Take one or more of the _handle-power-key_, _handle-suspend-key_, _handle-hibernate-key_, _handle-lid-switch_ inhibitor locks for that. See [Inhibitor Locks](http://www.freedesktop.org/wiki/Software/systemd/inhibit) for further details on this.
- Before rebooting/powering-off/suspending/hibernating and when the operation is triggered by the user by clicking on some UI elements (or suchlike) it is recommended to show the list of currently active inhibitors for the operation, and ask the user to acknowledge the operation. Note that PK often allows the user to execute the operation ignoring the inhibitors. Use logind's ListInhibitors() call to get a list of these inhibitors. See [Inhibitor Locks](http://www.freedesktop.org/wiki/Software/systemd/inhibit) for further details on this.
- If your DE contains a process viewer of some kind ("system monitor") it's a good idea to show session, service and seat information for each process. Use sd_pid_get_session(), sd_pid_get_unit(), sd_session_get_seat() to determine these. For details see [sd-login(7)](http://www.freedesktop.org/software/systemd/man/sd-login.html).
And that's all! Thank you!

View File

@ -0,0 +1,39 @@
---
title: Writing Display Managers
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Writing Display Managers
_Or: How to hook up your favorite X11 display manager with systemd_
systemd's logind service obsoletes ConsoleKit which was previously widely used on Linux distributions. For X11 display managers the switch to logind requires a minimal amount of porting, however brings a couple of new features: true automatic multi-seat support, proper tracking of session processes, (optional) automatic killing of user processes on logout, a synchronous low-level C API and much simplification.
This document should be read together with [Writing Desktop Environments](http://www.freedesktop.org/wiki/Software/systemd/writing-desktop-environments) which focuses on the porting work necessary for desktop environments.
If required it is possible to implement ConsoleKit and systemd-logind support in the same display manager, detecting at runtime which interface is needed. The [sd_booted()](http://www.freedesktop.org/software/systemd/man/sd_booted.html) call may be used to determine at runtime whether systemd is used.
To a certain level ConsoleKit and systemd-logind may be used side-by-side, but a number of features are not available if ConsoleKit is used, for example automatic multi-seat support.
Please have a look at the [Bus API of logind](http://www.freedesktop.org/wiki/Software/systemd/logind) and the C API as documented in [sd-login(7)](http://www.freedesktop.org/software/systemd/man/sd-login.html). (Also see below)
Minimal porting (without multi-seat) requires the following:
1. Remove/disable all code responsible for registering your service with ConsoleKit.
2. Make sure to register your greeter session via the PAM session stack, and make sure the PAM session modules include pam_systemd. Also, make sure to set the session class to "greeter." This may be done by setting the environment variable XDG_SESSION_CLASS to "greeter" with pam_misc_setenv() or setting the "class=greeter" option in the pam_systemd module, in order to allow applications to filter out greeter sessions from normal login sessions.
3. Make sure to register your logged in session via the PAM session stack as well, also including pam_systemd in it.
4. Optionally, use pam_misc_setenv() to set the environment variables XDG_SEAT and XDG_VTNR. The former should contain "seat0", the latter the VT number your session runs on. pam_systemd can determine these values automatically but it's nice to pass these variables anyway.
In summary: porting a display manager from ConsoleKit to systemd primarily means removing code, not necessarily adding any new code. Here, a cheers to simplicity!
Complete porting (with multi-seat) requires the following (Before you continue, make sure to read up on [Multi-Seat on Linux](http://www.freedesktop.org/wiki/Software/systemd/multiseat) first.):
1. Subscribe to seats showing up and going away, via the systemd-logind D-Bus interface's SeatAdded and SeatRemoved signals. Take possession of each seat by spawning your greeter on it. However, do so exclusively for seats where the boolean CanGraphical property is true. Note that there are seats that cannot do graphical, and there are seats that are text-only first, and gain graphical support later on. Most prominently this is actually seat0 which comes up in text mode, and where the graphics driver is then loaded and probed during boot. This means display managers must watch PropertyChanged events on all seats, to see if they gain (or lose) the CanGraphical field.
2. Use ListSeats() on the D-Bus interface to acquire a list of already available seats and also take possession of them.
3. For each seat you spawn a greeter/user session on use the XDG_SEAT and XDG_VTNR PAM environment variables to inform pam_systemd about the seat name, resp. VT number you start them on. Note that only the special seat "seat0" actually knows kernel VTs, so you shouldn't pass the VT number on any but the main seat, since it doesn't make any sense there.
4. Pass the seat name to the X server you start via the -seat parameter.
5. At this time X interprets the -seat parameter natively only for input devices, not for graphics devices. To work around this limitation we provide a tiny wrapper /lib/systemd/systemd-multi-seat-x which emulates the enumeration for graphics devices too. This wrapper will eventually go away, as soon as X learns udev-based graphics device enumeration natively, instead of the current PCI based one. Hence it is a good idea to fall back to the real X when this wrapper is not found. You may use this wrapper exactly like the real X server, and internally it will just exec() it after putting together a minimal multi-seat configuration.
And that's already it.
While most information about seats, sessions and users is available on systemd-logind's D-Bus interface, this is not the only API. The synchronous [sd-login(7)](http://www.freedesktop.org/software/systemd/man/sd-login.html) C interface is often easier to use and much faster too. In fact it is possible to implement the scheme above entirely without D-Bus relying only on this API. Note however, that this C API is purely passive, and if you want to execute an actually state changing operation you need to use the bus interface (for example, to switch sessions, or to kill sessions and suchlike). Also have a look at the [logind Bus API](http://www.freedesktop.org/wiki/Software/systemd/logind).

View File

@ -0,0 +1,64 @@
---
title: Writing Network Configuration Managers
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Writing Network Configuration Managers
_Or: How to hook up your favourite network configuration manager's DNS logic with `systemd-resolved`_
_(This is a longer explanation how to use some parts of `systemd-resolved` bus API. If you are just looking for an API reference, consult the [bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/) instead.)_
Since systemd 229 `systemd-resolved` offers a powerful bus API that may be used by network configuration managers (e.g. NetworkManager, connman, …, but also lower level DHCP, VPN or PPP daemons managing specific interfaces) to pass DNS server and DNSSEC configuration directly to `systemd-resolved`. Note that `systemd-resolved` also reads the DNS configuration data in `/etc/resolv.conf`, for compatibility. However, by passing the DNS configuration directly to `systemd-resolved` via the bus a couple of benefits are available:
1. `systemd-resolved` maintains DNS configuration per-interface, instead of simply system-wide, and is capable of sending DNS requests to servers on multiple different network interfaces simultaneously, returning the first positive response (or if all fail, the last negative one). This allows effective "merging" of DNS views on different interfaces, which makes private DNS zones on multi-homed hosts a lot nicer to use. For example, if you are connected to a LAN and a VPN, and both have private DNS zones, then you will be able to resolve both, as long as they don't clash in names. By using the bus API to configure DNS settings, the per-interface configuration is opened up.
2. Per-link configuration of DNSSEC is available. This is particularly interesting for network configuration managers that implement captive portal detection: as long as a verified connection to the Internet is not found DNSSEC should be turned off (as some captive portal systems alter the DNS in order to redirect clients to their internal pages).
3. Per-link configuration of LLMNR and MulticastDNS is available.
4. In contrast to changes to `/etc/resolv.conf` all changes made via the bus take effect immediately for all future lookups.
5. Statistical data about executed DNS transactions is available, as well as information about whether DNSSEC is supported on the chosen DNS server.
Note that `systemd-networkd` is already hooked up with `systemd-resolved`, exposing this functionality in full.
## Suggested Mode of Operation
Whenever a network configuration manager sets up an interface for operation, it should pass the DNS configuration information for the interface to `systemd-resolved`. It's recommended to do that after the Linux network interface index ("ifindex") has been allocated, but before the interface has beeen upped (i.e. `IFF_UP` turned on). That way, `systemd-resolved` will be able to use the configuration the moment the network interface is available. (Note that `systemd-resolved` watches the kernel interfaces come and go, and will make use of them as soon as they are suitable to be used, which among other factors requires `IFF_UP` to be set). That said it is OK to change DNS configuration dynamically any time: simply pass the new data to resolved, and it is happy to use it.
In order to pass the DNS configuration information to resolved, use the following methods of the `org.freedesktop.resolve1.Manager` interface of the `/org/freedesktop/resolve1` object, on the `org.freedesktop.resolve1` service:
1. To set the DNS server IP addresses for a network interface, use `SetLinkDNS()`
2. To set DNS search and routing domains for a network interface, use `SetLinkDomains()`
3. To configure the DNSSEC mode for a network interface, use `SetLinkDNSSEC()`
4. To configure DNSSEC Negative Trust Anchors (NTAs, i.e. domains for which not to do DNSSEC validation), use `SetLinkDNSSECNegativeTrustAnchors()`
5. To configure the LLMNR and MulticastDNS mode, use `SetLinkLLMNR()` and `SetLinkMulticastDNS()`
For details about these calls see the [full resolved bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/).
The calls should be pretty obvious to use: they simply take an interface index and the parameters to set. IP addresses are encoded as an address family specifier (an integer, that takes the usual `AF_INET` and `AF_INET6` constants), followed by a 4 or 16 byte array with the address in network byte order.
`systemd-resolved` distinguishes between "search" and "routing" domains. Routing domains are used to route DNS requests of specific domains to particular interfaces. i.e. requests for a hostname `foo.bar.com` will be routed to any interface that has `bar.com` as routing domain. The same routing domain may be defined on multiple interfaces, in which case the request is routed to all of them in parallel. Resolver requests for hostnames that do not end in any defined routing domain of any interface will be routed to all suitable interfaces. Search domains work like routing domain, but are also used to qualify single-label domain names. They hence are identical to the traditional search domain logic on UNIX. The `SetLinkDomains()` call may used to define both search and routing domains.
The most basic support of `systemd-resolved` in a network configuration manager would be to simply invoke `SetLinkDNS()` and `SetLinkDomains()` for the specific interface index with the data traditionally written to `/etc/resolv.conf`. More advanced integration could mean the network configuration manager also makes the DNSSEC mode, the DNSSEC NTAs and the LLMNR/MulticastDNS modes available for configuration.
It is strongly recommended for network configuration managers that implement captive portal detection to turn off DNSSEC validation during the detection phase, so that captive portals that modify DNS do not result in all DNSSEC look-ups to fail.
If a network configuration manager wants to reset specific settings to the defaults (such as the DNSSEC, LLMNR or MulticastDNS mode), it may simply call the function with an empty argument. To reset all per-link changes it made it may call `RevertLink()`.
To read back the various settings made, use `GetLink()` to get a `org.freedesktop.resolve1.Link` object for a specific network interface. It exposes the current settings in its bus properties. See the [full bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/) for details on this.
In order to translate a network interface name to an interface index, use the usual glibc `if_nametoindex()` call.
If the network configuration UI shall expose information about whether the selected DNS server supports DNSSEC, check the `DNSSECSupported` on the link object.
Note that it is fully OK if multiple different daemons push DNS configuration data into `systemd-resolved` as long as they do this only for the network interfaces they own and manage.
## Handling of `/etc/resolv.conf`
`systemd-resolved` receives DNS configuration from a number of sources, via the bus, as well as directly from `systemd-networkd` or user configuration. It uses this data to write a file that is compatible with the traditional Linux `/etc/resolv.conf` file. This file is stored in `/run/systemd/resolve/resolv.conf`. It is recommended to symlink `/etc/resolv.conf` to this file, in order to provide compatibility with programs reading the file directly and not going via the NSS and thus `systemd-resolved`.
For network configuration managers it is recommended to rely on this resolved-provided mechanism to update `resolv.conf`. Specifically, the network configuration manager should stop modifying `/etc/resolv.conf` directly if it notices it being a symlink to `/run/systemd/resolve/resolv.conf`.
If a system configuration manager desires to be compatible both with systems that use `systemd-resolved` and those which do not, it is recommended to first push any discovered DNS configuration into `systemd-resolved`, and deal gracefully with `systemd-resolved` not being available on the bus. If `/etc/resolv.conf` is a not a symlink to `/run/systemd/resolve/resolv.conf` the manager may then proceed and also update `/etc/resolv.conf`. With this mode of operation optimal compatibility is provided, as `systemd-resolved` is used for `/etc/resolv.conf` management when this is configured, but transparent compatibility with non-`systemd-resolved` systems is maintained. Note that `systemd-resolved` is part of systemd, and hence likely to be pretty universally available on Linux systems soon.
By allowing `systemd-resolved` to manage `/etc/resolv.conf` ownership issues regarding different programs overwriting each other's DNS configuration are effectively removed.

View File

@ -0,0 +1,290 @@
---
title: Writing Resolver Clients
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Writing Resolver Clients
_Or: How to look up hostnames and arbitrary DNS Resource Records via \_systemd-resolved_'s bus APIs\_
_(This is a longer explanation how to use some parts of \_systemd-resolved_ bus API. If you are just looking for an API reference, consult the [bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/) instead.)\_
_systemd-resolved_ provides a set of APIs on the bus for resolving DNS resource records. These are:
1. _ResolveHostname()_ for resolving hostnames to acquire their IP addresses
2. _ResolveAddress()_ for the reverse operation: acquire the hostname for an IP address
3. _ResolveService()_ for resolving a DNS-SD or SRV service
4. _ResolveRecord()_ for resolving arbitrary resource records.
Below you'll find examples for two of these calls, to show how to use them. Note that glibc offers similar (and more portable) calls in _getaddrinfo()_, _getnameinfo()_ and _res_query()_. Of these _getaddrinfo()_ and _getnameinfo()_ are directed to the calls above via the _nss-resolve_ NSS module, but _req_query()_ is not. There are a number of reasons why it might be preferable to invoke _systemd-resolved_'s bus calls rather than the glibc APIs:
1. Bus APIs are naturally asynchronous, which the glibc APIs generally are not.
2. The bus calls above pass back substantially more information about the resolved data, including where and how the data was found (i.e. which protocol was used: DNS, LLMNR, MulticastDNS, and on which network interface), and most importantly, whether the data could be authenticated via DNSSEC. This in particular makes these APIs useful for retrieving certificate data from the DNS, in order to implement DANE, SSHFP, OPENGPGKEY and IPSECKEY clients.
3. _ResolveService()_ knows no counterpart in glibc, and has the benefit of being a single call that collects all data necessary to connect to a DNS-SD or pure SRV service in one step.
4. _ResolveRecord()_ in contrast to _res_query()_ supports LLMNR and MulticastDNS as protocols on top of DNS, and makes use of _systemd-resolved_'s local DNS record cache. The processing of the request is done in the sandboxed _systemd-resolved_ process rather than in the local process, and all packets are pre-validated. Because this relies on _systemd-resolved_ the per-interface DNS zone handling is supported.
Of course, by using _systemd-resolved_ you lose some portability, but this could be handled via an automatic fallback to the glibc counterparts.
Note that the various resolver calls provided by _systemd-resolved_ will consult _/etc/hosts_ and synthesize resource records for these entries in order to ensure that this file is honoured fully.
The examples below use the _sd-bus_ D-Bus client implementation, which is part of _libsystemd_. Any other D-Bus library, including the original _libdbus_ or _GDBus_ may be used too.
## Resolving a Hostname
To resolve a hostname use the _ResolveHostname()_ call. For details on the function parameters see the [bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/).
This example specifies _AF_UNSPEC_ as address family for the requested address. This means both an _AF_INET_ (A) and an _AF_INET6_ (AAAA) record is looked for, depending on whether the local system has configured IPv4 and/or IPv6 connectivity. It is generally recommended to request _AF_UNSPEC_ addresses for best compatibility with both protocols, in particular on dual-stack systems.
The example specifies a network interface index of "0", i.e. does not specify any at all, so that the request may be done on any. Note that the interface index is primarily relevant for LLMNR and MulticastDNS lookups, which distinguish different scopes for each network interface index.
This examples makes no use of either the input flags parameter, nor the output flags parameter. See the _ResolveRecord()_ example below for information how to make use of the _SD_RESOLVED_AUTHENTICATED_ bit in the returned flags paramter.
```
#include <arpa/inet.h>
#include <netinet/in.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <systemd/sd-bus.h>
int main(int argc, char*argv[]) {
sd_bus_error error = SD_BUS_ERROR_NULL;
sd_bus_message *reply = NULL;
const char *canonical;
sd_bus *bus = NULL;
uint64_t flags;
int r;
r = sd_bus_open_system(&bus);
if (r < 0) {
fprintf(stderr, "Failed to open system bus: %s\n", strerror(-r));
goto finish;
}
r = sd_bus_call_method(bus,
"org.freedesktop.resolve1",
"/org/freedesktop/resolve1",
"org.freedesktop.resolve1.Manager",
"ResolveHostname",
&error,
&reply,
"isit",
0, /* Network interface index where to look (0 means any) */
argc >= 2 ? argv[1] : "www.github.com", /* Hostname */
AF_UNSPEC, /* Which address family to look for */
UINT64_C(0)); /* Input flags parameter */
if (r < 0) {
fprintf(stderr, "Failed to resolve hostnme: %s\n", error.message);
sd_bus_error_free(&error);
goto finish;
}
r = sd_bus_message_enter_container(reply, 'a', "(iiay)");
if (r < 0)
goto parse_failure;
for (;;) {
char buf[INET6_ADDRSTRLEN];
int ifindex, family;
const void *data;
size_t length;
r = sd_bus_message_enter_container(reply, 'r', "iiay");
if (r < 0)
goto parse_failure;
if (r == 0) /* Reached end of array */
break;
r = sd_bus_message_read(reply, "ii", &ifindex, &family);
if (r < 0)
goto parse_failure;
r = sd_bus_message_read_array(reply, 'y', &data, &length);
if (r < 0)
goto parse_failure;
r = sd_bus_message_exit_container(reply);
if (r < 0)
goto parse_failure;
printf("Found IP address %s on interface %i.\n", inet_ntop(family, data, buf, sizeof(buf)), ifindex);
}
r = sd_bus_message_exit_container(reply);
if (r < 0)
goto parse_failure;
r = sd_bus_message_read(reply, "st", &canonical, &flags);
if (r < 0)
goto parse_failure;
printf("Canonical name is %s\n", canonical);
goto finish;
parse_failure:
fprintf(stderr, "Parse failure: %s\n", strerror(-r));
finish:
sd_bus_message_unref(reply);
sd_bus_flush_close_unref(bus);
return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}
```
Compile this with a command line like the following (under the assumption you save the sources above as `addrtest.c`):
```
gcc addrtest.c -o addrtest -Wall `pkg-config --cflags --libs libsystemd`
```
## Resolving an Abitrary DNS Resource Record
Use `ResolveRecord()` in order to resolve arbitrary resource records. The call will return the binary RRset data. This calls is useful to acquire resource records for which no high-level calls such as ResolveHostname(), ResolveAddress() and ResolveService() exist. In particular RRs such as MX, SSHFP, TLSA, CERT, OPENPGPKEY or IPSECKEY may be requested via this API.
This example also shows how to determine whether the acquired data has been authenticated via DNSSEC (or another means) by checking the `SD_RESOLVED_AUTHENTICATED` bit in the
returned `flags` parameter.
This example contains a simple MX record parser. Note that the data comes pre-validated from `systemd-resolved`, hence we allow the example to parse the record slightly sloppily, to keep the example brief. For details on the MX RR binary format, see [RFC 1035](https://www.rfc-editor.org/rfc/rfc1035.txt).
For details on the function parameters see the [bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/).
```
#include <assert.h>
#include <endian.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <systemd/sd-bus.h>
#define DNS_CLASS_IN 1U
#define DNS_TYPE_MX 15U
#define SD_RESOLVED_AUTHENTICATED (UINT64_C(1) << 9)
static const uint8_t* print_name(const uint8_t* p) {
bool dot = false;
for (;;) {
if (*p == 0)
return p + 1;
if (dot)
putchar('.');
else
dot = true;
printf("%.*s", (int) *p, (const char*) p + 1);
p += *p + 1;
}
}
static void process_mx(const void *rr, size_t sz) {
uint16_t class, type, rdlength, preference;
const uint8_t *p = rr;
uint32_t ttl;
fputs("Found MX: ", stdout);
p = print_name(p);
memcpy(&type, p, sizeof(uint16_t));
p += sizeof(uint16_t);
memcpy(&class, p, sizeof(uint16_t));
p += sizeof(uint16_t);
memcpy(&ttl, p, sizeof(uint32_t));
p += sizeof(uint32_t);
memcpy(&rdlength, p, sizeof(uint16_t));
p += sizeof(uint16_t);
memcpy(&preference, p, sizeof(uint16_t));
p += sizeof(uint16_t);
assert(be16toh(type) == DNS_TYPE_MX);
assert(be16toh(class) == DNS_CLASS_IN);
printf(" preference=%u ", be16toh(preference));
p = print_name(p);
putchar('\n');
assert(p == (const uint8_t*) rr + sz);
}
int main(int argc, char*argv[]) {
sd_bus_error error = SD_BUS_ERROR_NULL;
sd_bus_message *reply = NULL;
sd_bus *bus = NULL;
uint64_t flags;
int r;
r = sd_bus_open_system(&bus);
if (r < 0) {
fprintf(stderr, "Failed to open system bus: %s\n", strerror(-r));
goto finish;
}
r = sd_bus_call_method(bus,
"org.freedesktop.resolve1",
"/org/freedesktop/resolve1",
"org.freedesktop.resolve1.Manager",
"ResolveRecord",
&error,
&reply,
"isqqt",
0, /* Network interface index where to look (0 means any) */
argc >= 2 ? argv[1] : "gmail.com", /* Domain name */
DNS_CLASS_IN, /* DNS RR class */
DNS_TYPE_MX, /* DNS RR type */
UINT64_C(0)); /* Input flags parameter */
if (r < 0) {
fprintf(stderr, "Failed to resolve record: %s\n", error.message);
sd_bus_error_free(&error);
goto finish;
}
r = sd_bus_message_enter_container(reply, 'a', "(iqqay)");
if (r < 0)
goto parse_failure;
for (;;) {
uint16_t class, type;
const void *data;
size_t length;
int ifindex;
r = sd_bus_message_enter_container(reply, 'r', "iqqay");
if (r < 0)
goto parse_failure;
if (r == 0) /* Reached end of array */
break;
r = sd_bus_message_read(reply, "iqq", &ifindex, &class, &type);
if (r < 0)
goto parse_failure;
r = sd_bus_message_read_array(reply, 'y', &data, &length);
if (r < 0)
goto parse_failure;
r = sd_bus_message_exit_container(reply);
if (r < 0)
goto parse_failure;
process_mx(data, length);
}
r = sd_bus_message_exit_container(reply);
if (r < 0)
goto parse_failure;
r = sd_bus_message_read(reply, "t", &flags);
if (r < 0)
goto parse_failure;
printf("Response is authenticated: %s\n", flags & SD_RESOLVED_AUTHENTICATED ? "yes" : "no");
goto finish;
parse_failure:
fprintf(stderr, "Parse failure: %s\n", strerror(-r));
finish:
sd_bus_message_unref(reply);
sd_bus_flush_close_unref(bus);
return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}
```
Compile this with a command line like the following (under the assumption you save the sources above as `rrtest.c`):
```
gcc rrtest.c -o rrtest -Wall `pkg-config --cflags --libs libsystemd`
```

View File

@ -0,0 +1,29 @@
---
title: Writing VM and Container Managers
category: Documentation for Developers
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Writing VM and Container Managers
_Or: How to hook up your favorite VM or container manager with systemd_
Nomenclature: a _Virtual Machine_ shall refer to a system running on virtualized hardware consisting of a full OS with its own kernel. A _Container_ shall refer to a system running on the same shared kernel of the host, but running a mostly complete OS with its own init system. Both kinds of virtualized systems shall collectively be called "machines".
systemd provides a number of integration points with virtual machine and container managers, such as libvirt, LXC or systemd-nspawn. On one hand there are integration points of the VM/container manager towards the host OS it is running on, and on the other there integration points for container managers towards the guest OS it is managing.
Note that this document does not cover lightweight containers for the purpose of application sandboxes, i.e. containers that do _not_ run a init system of their own.
## Host OS Integration
All virtual machines and containers should be registered with the [machined](http://www.freedesktop.org/wiki/Software/systemd/machined) mini service that is part of systemd. This provides integration into the core OS at various points. For example, tools like ps, cgls, gnome-system-manager use this registration information to show machine information for running processes, as each of the VM's/container's processes can reliably attributed to a registered machine. The various systemd tools (like systemctl, journalctl, loginctl, systemd-run, ...) all support a -M switch that operates on machines registered with machined. "machinectl" may be used to execute operations on any such machine. When a machine is registered via machined its processes will automatically be placed in a systemd scope unit (that is located in the machines.slice slice) and thus appear in "systemctl" and similar commands. The scope unit name is based on the machine meta information passed to machined at registration.
For more details on the APIs provided by machine consult [the bus API interface documentation](http://www.freedesktop.org/wiki/Software/systemd/machined).
## Guest OS Integration
As container virtualization is much less comprehensive, and the guest is less isolated from the host, there are a number of interfaces defined how the container manager can set up the environment for systemd running inside a container. These Interfaces are documented in [Container Interface of systemd](http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface).
VM virtualization is more comprehensive and fewer integration APIs are available. In fact there's only one: a VM manager may initialize the SMBIOS DMI field "Product UUUID" to a UUID uniquely identifying this virtual machine instance. This is read in the guest via /sys/class/dmi/id/product_uuid, and used as configuration source for /etc/machine-id if in the guest, if that file is not initialized yet. Note that this is currently only supported for kvm hosts, but may be extended to other managers as well.

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.7 KiB