2020-01-14 01:33:18 +08:00
|
|
|
---
|
|
|
|
title: Storage Daemons for the Root File System
|
|
|
|
category: Interfaces
|
|
|
|
layout: default
|
|
|
|
---
|
|
|
|
|
|
|
|
# systemd and Storage Daemons for the Root File System
|
|
|
|
|
|
|
|
a.k.a. _Pax Cellae pro Radix Arbor_
|
|
|
|
|
|
|
|
(or something like that, my Latin is a bit rusty)
|
|
|
|
|
|
|
|
A number of complex storage technologies on Linux (e.g. RAID, volume
|
|
|
|
management, networked storage) require user space services to run while the
|
|
|
|
storage is active and mountable. This requirement becomes tricky as soon as the
|
|
|
|
root file system of the Linux operating system is stored on such storage
|
|
|
|
technology. Previously no clear path to make this work was available. This text
|
|
|
|
tries to clear up the resulting confusion, and what is now supported and what
|
|
|
|
is not.
|
|
|
|
|
|
|
|
## A Bit of Background
|
|
|
|
|
|
|
|
When complex storage technologies are used as backing for the root file system
|
|
|
|
this needs to be set up by the initial RAM file system (initrd), i.e. on Fedora
|
|
|
|
by Dracut. In newer systemd versions tear-down of the root file system backing
|
|
|
|
is also done by the initrd: after terminating all remaining running processes
|
|
|
|
and unmounting all file systems it can (which means excluding the root fs)
|
|
|
|
systemd will jump back into the initrd code allowing it to unmount the final
|
|
|
|
file systems (and its storage backing) that could not be unmounted as long as
|
|
|
|
the OS was still running from the main root file system. The initrd' job is to
|
|
|
|
detach/unmount the root fs, i.e. inverting the exact commands it used to set
|
|
|
|
them up in the first place. This is not only cleaner, but also allows for the
|
|
|
|
first time arbitrary complex stacks of storage technology.
|
|
|
|
|
|
|
|
Previous attempts to handle root file system setups with complex storage as
|
|
|
|
backing usually tried to maintain the root storage with program code stored on
|
|
|
|
the root storage itself, thus creating a number of dependency loops. Safely
|
|
|
|
detaching such a root file system becomes messy, since the program code on the
|
|
|
|
storage needs to stay around longer than the storage, which is technically
|
|
|
|
contradicting.
|
|
|
|
|
|
|
|
|
|
|
|
## What's new?
|
|
|
|
|
|
|
|
As a result, we hereby clarify that we do not support storage technology setups
|
|
|
|
where the storage daemons are being run from the storage it maintains
|
|
|
|
itself. In other words: a storage daemon backing the root file system cannot be
|
|
|
|
stored on the root file system itself.
|
|
|
|
|
|
|
|
What we do support instead is that these storage daemons are started from the
|
|
|
|
initrd, stay running all the time during normal operation and are terminated
|
|
|
|
only after we returned control back to the initrd and by the initrd. As such,
|
|
|
|
storage daemons involved with maintaining the root file system storage
|
|
|
|
conceptually are more like kernel threads than like normal system services:
|
|
|
|
from the perspective of the init system (i.e. systemd) these services have been
|
|
|
|
started before systemd got initialized and stay around until after systemd is
|
|
|
|
already gone. These daemons can only be updated by updating the initrd and
|
|
|
|
rebooting, a takeover from initrd-supplied services to replacements from the
|
|
|
|
root file system is not supported.
|
|
|
|
|
|
|
|
|
|
|
|
## What does this mean?
|
|
|
|
|
|
|
|
Near the end of system shutdown, systemd executes a small tool called
|
|
|
|
systemd-shutdown, replacing its own process. This tool (which runs as PID 1, as
|
|
|
|
it entirely replaces the systemd init process) then iterates through the
|
|
|
|
mounted file systems and running processes (as well as a couple of other
|
|
|
|
resources) and tries to unmount/read-only mount/detach/kill them. It continues
|
|
|
|
to do this in a tight loop as long as this results in any effect. From this
|
|
|
|
killing spree a couple of processes are automatically excluded: PID 1 itself of
|
|
|
|
course, as well as all kernel threads. After the killing/unmounting spree
|
|
|
|
control is passed back to the initrd, whose job is then to unmount/detach
|
|
|
|
whatever might be remaining.
|
|
|
|
|
|
|
|
The same killing spree logic (but not the unmount/detach/read-only logic) is
|
|
|
|
applied during the transition from the initrd to the main system (i.e. the
|
|
|
|
"`switch_root`" operation), so that no processes from the initrd survive to the
|
|
|
|
main system.
|
|
|
|
|
|
|
|
To implement the supported logic proposed above (i.e. where storage daemons
|
|
|
|
needed for the root fs which are started by the initrd stay around during
|
|
|
|
normal operation and are only killed after control is passed back to the
|
|
|
|
initrd) we need to exclude these daemons from the shutdown/switch_root killing
|
|
|
|
spree. To accomplish this the following logic is available starting with
|
|
|
|
systemd 38:
|
|
|
|
|
|
|
|
Processes (run by the root user) whose first character of the zeroth command
|
|
|
|
line argument is `@` are excluded from the killing spree, much the same way as
|
|
|
|
kernel threads are excluded too. Thus, a daemon which wants to take advantage
|
|
|
|
of this logic needs to place the following at the top of its main() function:
|
|
|
|
|
|
|
|
```c
|
2020-01-14 04:03:15 +08:00
|
|
|
...
|
2020-01-14 01:33:18 +08:00
|
|
|
[0][0] = '@';
|
2020-01-14 04:03:15 +08:00
|
|
|
...
|
2020-01-14 01:33:18 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
And that's already it. Note that this functionality is only to be used by
|
|
|
|
programs running from the initrd, and **not** for programs running from the
|
|
|
|
root file system itself. Programs which use this functionality and are running
|
|
|
|
from the root file system are considered buggy since they effectively prohibit
|
|
|
|
clean unmounting/detaching of the root file system and its backing storage.
|
|
|
|
|
|
|
|
_Again: if your code is being run from the root file system, then this logic
|
|
|
|
suggested above is **NOT** for you. Sorry. Talk to us, we can probably help you
|
|
|
|
to find a different solution to your problem._
|
|
|
|
|
|
|
|
The recommended way to distinguish between run-from-initrd and run-from-rootfs
|
|
|
|
for a daemon is to check for `/etc/initrd-release` (which exists on all modern
|
|
|
|
initrd implementations, see the [initrd
|
|
|
|
Interface](http://www.freedesktop.org/wiki/Software/systemd/InitrdInterface)
|
|
|
|
for details) which when exists results in `argv[0][0]` being set to `@`, and
|
|
|
|
otherwise doesn't. Something like this:
|
|
|
|
|
|
|
|
```c
|
|
|
|
#include <unistd.h>
|
|
|
|
|
|
|
|
int main(int argc, char *argv[]) {
|
2020-01-14 04:03:15 +08:00
|
|
|
...
|
2020-01-14 01:33:18 +08:00
|
|
|
if (access("/etc/initrd-release", F_OK) >= 0)
|
|
|
|
argv[0][0] = '@';
|
2020-01-14 04:03:15 +08:00
|
|
|
...
|
2020-01-14 01:33:18 +08:00
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
Why `@`? Why `argv[0][0]`? First of all, a technique like this is not without
|
|
|
|
precedent: traditionally Unix login shells set `argv[0][0]` to `-` to clarify
|
|
|
|
they are login shells. This logic is also very easy to implement. We have been
|
|
|
|
looking for other ways to mark processes for exclusion from the killing spree,
|
|
|
|
but could not find any that was equally simple to implement and quick to read
|
|
|
|
when traversing through `/proc/`. Also, as a side effect replacing the first
|
|
|
|
character of `argv[0]` with `@` also visually invalidates the path normally
|
|
|
|
stored in `argv[0]` (which usually starts with `/`) thus helping the
|
|
|
|
administrator to understand that your daemon is actually not originating from
|
|
|
|
the actual root file system, but from a path in a completely different
|
|
|
|
namespace (i.e. the initrd namespace). Other than that we just think that `@`
|
|
|
|
is a cool character which looks pretty in the ps output... 😎
|
|
|
|
|
|
|
|
Note that your code should only modify `argv[0][0]` and leave the comm name
|
|
|
|
(i.e. `/proc/self/comm`) of your process untouched.
|
|
|
|
|
|
|
|
## To which technologies does this apply?
|
|
|
|
|
|
|
|
These recommendations apply to those storage daemons which need to stay around
|
|
|
|
until after the storage they maintain is unmounted. If your storage daemon is
|
|
|
|
fine with being shut down before its storage device is unmounted you may ignore
|
|
|
|
the recommendations above.
|
|
|
|
|
|
|
|
This all applies to storage technology only, not to daemons with any other
|
|
|
|
(non-storage related) purposes.
|
|
|
|
|
|
|
|
## What else to keep in mind?
|
|
|
|
|
|
|
|
If your daemon implements the logic pointed out above it should work nicely
|
|
|
|
from initrd environments. In many cases it might be necessary to additionally
|
|
|
|
support storage daemons to be started from within the actual OS, for example
|
|
|
|
when complex storage setups are used for auxiliary file systems, i.e. not the
|
|
|
|
root file system, or created by the administrator during runtime. Here are a
|
|
|
|
few additional notes for supporting these setups:
|
|
|
|
|
|
|
|
* If your storage daemon is run from the main OS (i.e. not the initrd) it will
|
|
|
|
also be terminated when the OS shuts down (i.e. before we pass control back
|
|
|
|
to the initrd). Your daemon needs to handle this properly.
|
|
|
|
|
|
|
|
* It is not acceptable to spawn off background processes transparently from
|
|
|
|
user commands or udev rules. Whenever a process is forked off on Unix it
|
|
|
|
inherits a multitude of process attributes (ranging from the obvious to the
|
|
|
|
not-so-obvious such as security contexts or audit trails) from its parent
|
|
|
|
process. It is practically impossible to fully detach a service from the
|
|
|
|
process context of the spawning process. In particular, systemd tracks which
|
|
|
|
processes belong to a service or login sessions very closely, and by spawning
|
|
|
|
off your storage daemon from udev or an administrator command you thus make
|
|
|
|
it part of its service/login. Effectively this means that whenever udev is
|
|
|
|
shut down, your storage daemon is killed too, resp. whenever the login
|
|
|
|
session goes away your storage might be terminated as well. (Also note that
|
|
|
|
recent udev versions will automatically kill all long running background
|
|
|
|
processes forked off udev rules now.) So, in summary: double-forking off
|
|
|
|
processes from user commands or udev rules is **NOT** OK!
|
|
|
|
|
|
|
|
* To automatically spawn storage daemons from udev rules or administrator
|
|
|
|
commands, the recommended technology is socket-based activation as
|
|
|
|
implemented by systemd. Transparently for your client code connecting to the
|
|
|
|
socket of your storage daemon will result in the storage to be started. For
|
|
|
|
that it is simply necessary to inform systemd about the socket you'd like it
|
|
|
|
to listen on on behalf of your daemon and minimally modify the daemon to
|
|
|
|
receive the listening socket for its services from systemd instead of
|
|
|
|
creating it on its own. Such modifications can be minimal, and are easily
|
|
|
|
written in a way that does not negatively impact usability on non-systemd
|
|
|
|
systems. For more information on making use of socket activation in your
|
|
|
|
program consult this blog story: [Socket
|
|
|
|
Activation](http://0pointer.de/blog/projects/socket-activation.html)
|
|
|
|
|
2020-01-14 04:03:15 +08:00
|
|
|
* Consider having a look at the [initrd Interface of systemd](https://systemd.io/INITRD_INTERFACE/).
|