Commit Graph

27201 Commits

Author SHA1 Message Date
Zbigniew Jędrzejewski-Szmek
9aa8fa701b test-bus-creds: are more debugging info
This test sometimes fails in semaphore, but not when run interactively,
so it's hard to debug.
2016-09-26 22:22:28 +02:00
Keith Busch
b4c6f71b82 udev/path_id: introduce support for NVMe devices (#4169)
This appends the nvme name and namespace identifier attribute the the
PCI path for by-path links. Symlinks like the following are now present:

lrwxrwxrwx. 1 root root 13 Sep 16 12:12 pci-0000:01:00.0-nvme-1 -> ../../nvme0n1
lrwxrwxrwx. 1 root root 15 Sep 16 12:12 pci-0000:01:00.0-nvme-1-part1 -> ../../nvme0n1p1

Cc: Michal Sekletar <sekletar.m@gmail.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
2016-09-26 21:01:07 +02:00
Paweł Szewczyk
00bb64ecfa core: Fix USB functionfs activation and clarify its documentation (#4188)
There was no certainty about how the path in service file should look
like for usb functionfs activation. Because of this it was treated
differently in different places, which made this feature unusable.

This patch fixes the path to be the *mount directory* of functionfs, not
ep0 file path and clarifies in the documentation that ListenUSBFunction should be
the location of functionfs mount point, not ep0 file itself.
2016-09-26 18:45:47 +02:00
Zbigniew Jędrzejewski-Szmek
bc3bb330b8 machinectl: prefer user@ to --uid=user for shell (#4006)
It seems to me that the explicit positional argument should have higher
priority than "an option".
2016-09-26 11:45:31 -04:00
HATAYAMA Daisuke
eeb084806b journald,ratelimit: fix wrong calculation of burst_modulate() (#4218)
This patch fixes wrong calculation of burst_modulate(), which now calculates
the values smaller than really expected ones if available disk space is
strictly more than 1MB.

In particular, if available disk space is strictly more than 1MB and strictly
less than 16MB, the resulted value becomes smaller than its original one.

>>> (math.log2(1*1024**2)-16) / 4
1.0
>>> (math.log2(16*1024**2)-16) / 4
2.0
>>> (math.log2(256*1024**2)-16) / 4
3.0
→ This matches the comment in the function.
2016-09-26 11:36:20 -04:00
Matej Habrnal
a5ca3649d3 coredump: initialize coredump_size in submit_coredump() (#4219)
If ulimit is smaller than page_size(), function save_external_coredump()
returns -EBADSLT and this causes skipping whole core dumping part in
submit_coredump(). Initializing coredump_size to UINT64_MAX prevents
evaluating a condition with uninitialized varialbe which leads to
calling allocate_journal_field() with coredump_fd = -1 which causes
aborting.

Signed-off-by: Matej Habrnal <mhabrnal@redhat.com>
2016-09-26 11:28:58 -04:00
Torstein Husebø
d23a0044a3 treewide: fix typos (#4217) 2016-09-26 11:32:47 +02:00
Djalal Harouni
615a1f4b26 test: add CAP_MKNOD tests for PrivateDevices= 2016-09-25 13:04:30 +02:00
Djalal Harouni
8f81a5f61b core: Use @raw-io syscall group to filter I/O syscalls when PrivateDevices= is set
Instead of having a local syscall list, use the @raw-io group which
contains the same set of syscalls to filter.
2016-09-25 12:52:27 +02:00
Djalal Harouni
b6c432ca7e core:namespace: simplify ProtectHome= implementation
As with previous patch simplify ProtectHome and don't care about
duplicates, they will be sorted by most restrictive mode and cleaned.
2016-09-25 12:41:16 +02:00
Djalal Harouni
f471b2afa1 core: simplify ProtectSystem= implementation
ProtectSystem= with all its different modes and other options like
PrivateDevices= + ProtectKernelTunables= + ProtectHome= are orthogonal,
however currently it's a bit hard to parse that from the implementation
view. Simplify it by giving each mode its own table with all paths and
references to other Protect options.

With this change some entries are duplicated, but we do not care since
duplicate mounts are first sorted by the most restrictive mode then
cleaned.
2016-09-25 12:21:25 +02:00
Djalal Harouni
49accde7bd core:sandbox: add more /proc/* entries to ProtectKernelTunables=
Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface,
filesystems configuration and IRQ tuning readonly.

Most of these interfaces now days should be in /sys but they are still
available through /proc, so just protect them. This patch does not touch
/proc/net/...
2016-09-25 11:30:11 +02:00
Djalal Harouni
9221aec8d0 doc: explicitly document that /dev/mem and /dev/port are blocked by PrivateDevices=true 2016-09-25 11:25:44 +02:00
Djalal Harouni
e778185bb5 doc: documentation fixes for ReadWritePaths= and ProtectKernelTunables=
Documentation fixes for ReadWritePaths= and ProtectKernelTunables=
as reported by Evgeny Vereshchagin.
2016-09-25 11:25:31 +02:00
Djalal Harouni
2652c6c103 core:namespace: simplify mount calculation
Move out mount calculation on its own function. Actually the logic is
smart enough to later drop nop and duplicates mounts, this change
improves code readability.
---
 src/core/namespace.c | 47 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 36 insertions(+), 11 deletions(-)
2016-09-25 11:25:00 +02:00
Djalal Harouni
11a30cec2a core:namespace: put paths protected by ProtectKernelTunables= in
Instead of having all these paths everywhere, put the ones that are
protected by ProtectKernelTunables= into their own table. This way it
is easy to add paths and track which ones are protected.
2016-09-25 11:16:44 +02:00
Djalal Harouni
9c94d52e09 core:namespace: minor improvements to append_mounts() 2016-09-25 11:03:21 +02:00
Lennart Poettering
cefc33aee2 execute: move SMACK setup code into its own function
While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the
main execution logic a bit.
2016-09-25 10:52:57 +02:00
Lennart Poettering
cd2902c954 namespace: drop all mounts outside of the new root directory
There's no point in mounting these, if they are outside of the root directory
we'll move to.
2016-09-25 10:52:57 +02:00
Lennart Poettering
54500613a4 main: minor simplification 2016-09-25 10:52:57 +02:00
Lennart Poettering
0439746492 Update TODO 2016-09-25 10:52:57 +02:00
Lennart Poettering
ba128bb809 execute: filter low-level I/O syscalls if PrivateDevices= is set
If device access is restricted via PrivateDevices=, let's also block the
various low-level I/O syscalls at the same time, so that we know that the
minimal set of devices in our virtualized /dev are really everything the unit
can access.
2016-09-25 10:52:57 +02:00
Lennart Poettering
1ecdba149b NEWS: update news about systemd-udevd.service 2016-09-25 10:52:57 +02:00
Lennart Poettering
0c28d51ac8 units: further lock down our long-running services
Let's make this an excercise in dogfooding: let's turn on more security
features for all our long-running services.

Specifically:

- Turn on RestrictRealtime=yes for all of them

- Turn on ProtectKernelTunables=yes and ProtectControlGroups=yes for most of
  them

- Turn on RestrictAddressFamilies= for all of them, but different sets of
  address families for each

Also, always order settings in the unit files, that the various sandboxing
features are close together.

Add a couple of missing, older settings for a numbre of unit files.

Note that this change turns off AF_INET/AF_INET6 from udevd, thus effectively
turning of networking from udev rule commands. Since this might break stuff
(that is already broken I'd argue) this is documented in NEWS.
2016-09-25 10:52:57 +02:00
Lennart Poettering
f6eb19a474 units: permit importd to mount stuff
Fixes #3996
2016-09-25 10:52:57 +02:00
Lennart Poettering
6757c06a1a man: shorten the exit status table a bit
Let's merge a couple of columns, to make the table a bit shorter. This
effectively just drops whitespace, not contents, but makes the currently
humungous table much much more compact.
2016-09-25 10:52:57 +02:00
Lennart Poettering
81c8aceed4 man: the exit code/signal is stored in $EXIT_CODE, not $EXIT_STATUS 2016-09-25 10:52:57 +02:00
Lennart Poettering
effbd6d2ea man: rework documentation for ReadOnlyPaths= and related settings
This reworks the documentation for ReadOnlyPaths=, ReadWritePaths=,
InaccessiblePaths=. It no longer claims that we'd follow symlinks relative to
the host file system. (Which wasn't true actually, as we didn't follow symlinks
at all in the most recent releases, and we know do follow them, but relative to
RootDirectory=).

This also replaces all references to the fact that all fs namespacing options
can be undone with enough privileges and disable propagation by a single one in
the documentation of ReadOnlyPaths= and friends, and then directs the read to
this in all other places.

Moreover a hint is added to the documentation of SystemCallFilter=, suggesting
usage of ~@mount in case any of the fs namespacing related options are used.
2016-09-25 10:42:18 +02:00
Lennart Poettering
b2656f1b1c man: in user-facing documentaiton don't reference C function names
Let's drop the reference to the cap_from_name() function in the documentation
for the capabilities setting, as it is hardly helpful. Our readers are not
necessarily C hackers knowing the semantics of cap_from_name(). Moreover, the
strings we accept are just the plain capability names as listed in
capabilities(7) hence there's really no point in confusing the user with
anything else.
2016-09-25 10:42:18 +02:00
Lennart Poettering
8f1ad200f0 namespace: don't make the root directory of a namespace a mount if it already is one
Let's not stack mounts needlessly.
2016-09-25 10:42:18 +02:00
Lennart Poettering
d944dc9553 namespace: chase symlinks for mounts to set up in userspace
This adds logic to chase symlinks for all mount points that shall be created in
a namespace environment in userspace, instead of leaving this to the kernel.
This has the advantage that we can correctly handle absolute symlinks that
shall be taken relative to a specific root directory. Moreover, we can properly
handle mounts created on symlinked files or directories as we can merge their
mounts as necessary.

(This also drops the "done" flag in the namespace logic, which was never
actually working, but was supposed to permit a partial rollback of the
namespace logic, which however is only mildly useful as it wasn't clear in
which case it would or would not be able to roll back.)

Fixes: #3867
2016-09-25 10:42:18 +02:00
Lennart Poettering
1e4e94c881 namespace: invoke unshare() only after checking all parameters
Let's create the new namespace only after we validated and processed all
parameters, right before we start with actually mounting things.

This way, the window where we can roll back is larger (not that it matters
IRL...)
2016-09-25 10:42:18 +02:00
Lennart Poettering
096424d123 execute: drop group priviliges only after setting up namespace
If PrivateDevices=yes is set, the namespace code creates device nodes in /dev
that should be owned by the host's root, hence let's make sure we set up the
namespace before dropping group privileges.
2016-09-25 10:42:18 +02:00
Lennart Poettering
920a7899de nspawn: let's mount /proc/sysrq-trigger read-only by default
LXC does this, and we should probably too. Better safe than sorry.
2016-09-25 10:42:18 +02:00
Lennart Poettering
63bb64a056 core: imply ProtectHome=read-only and ProtectSystem=strict if DynamicUser=1
Let's make sure that services that use DynamicUser=1 cannot leave files in the
file system should the system accidentally have a world-writable directory
somewhere.

This effectively ensures that directories need to be whitelisted rather than
blacklisted for access when DynamicUser=1 is set.
2016-09-25 10:42:18 +02:00
Lennart Poettering
3f815163ff core: introduce ProtectSystem=strict
Let's tighten our sandbox a bit more: with this change ProtectSystem= gains a
new setting "strict". If set, the entire directory tree of the system is
mounted read-only, but the API file systems /proc, /dev, /sys are excluded
(they may be managed with PrivateDevices= and ProtectKernelTunables=). Also,
/home and /root are excluded as those are left for ProtectHome= to manage.

In this mode, all "real" file systems (i.e. non-API file systems) are mounted
read-only, and specific directories may only be excluded via
ReadWriteDirectories=, thus implementing an effective whitelist instead of
blacklist of writable directories.

While we are at, also add /efi to the list of paths always affected by
ProtectSystem=. This is a follow-up for
b52a109ad3 which added /efi as alternative for
/boot. Our namespacing logic should respect that too.
2016-09-25 10:42:18 +02:00
Lennart Poettering
160cfdbed3 namespace: add some debug logging when enforcing InaccessiblePaths= 2016-09-25 10:42:18 +02:00
Lennart Poettering
6b7c9f8bce namespace: rework how ReadWritePaths= is applied
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths=
specification, then we'd first recursively apply the ReadOnlyPaths= paths, and
make everything below read-only, only in order to then flip the read-only bit
again for the subdirs listed in ReadWritePaths= below it.

This is not only ugly (as for the dirs in question we first turn on the RO bit,
only to turn it off again immediately after), but also problematic in
containers, where a container manager might have marked a set of dirs read-only
and this code will undo this is ReadWritePaths= is set for any.

With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be
applied to the children listed in ReadWritePaths= in the first place, so that
we do not need to turn off the RO bit for those after all.

This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the
RO bit, but never to turn it off again. Or to say this differently: if some
dirs are marked read-only via some external tool, then ReadWritePaths= will not
undo it.

This is not only the safer option, but also more in-line with what the man page
currently claims:

        "Entries (files or directories) listed in ReadWritePaths= are
        accessible from within the namespace with the same access rights as
        from outside."

To implement this change bind_remount_recursive() gained a new "blacklist"
string list parameter, which when passed may contain subdirs that shall be
excluded from the read-only mounting.

A number of functions are updated to add more debug logging to make this more
digestable.
2016-09-25 10:40:51 +02:00
Lennart Poettering
7648a565d1 namespace: when enforcing fs namespace restrictions suppress redundant mounts
If /foo is marked to be read-only, and /foo/bar too, then the latter may be
suppressed as it has no effect.
2016-09-25 10:19:15 +02:00
Lennart Poettering
6ee1a919cf namespace: simplify mount_path_compare() a bit 2016-09-25 10:19:10 +02:00
Lennart Poettering
3fbe8dbe41 execute: if RuntimeDirectory= is set, it should be writable
Implicitly make all dirs set with RuntimeDirectory= writable, as the concept
otherwise makes no sense.
2016-09-25 10:19:05 +02:00
Lennart Poettering
be39ccf3a0 execute: move suppression of HOME=/ and SHELL=/bin/nologin into user-util.c
This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
2016-09-25 10:18:57 +02:00
Lennart Poettering
07689d5d2c execute: split out creation of runtime dirs into its own functions 2016-09-25 10:18:54 +02:00
Lennart Poettering
fe3c2583be namespace: make sure InaccessibleDirectories= masks all mounts further down
If a dir is marked to be inaccessible then everything below it should be masked
by it.
2016-09-25 10:18:51 +02:00
Lennart Poettering
59eeb84ba6 core: add two new service settings ProtectKernelTunables= and ProtectControlGroups=
If enabled, these will block write access to /sys, /proc/sys and
/proc/sys/fs/cgroup.
2016-09-25 10:18:48 +02:00
Lennart Poettering
72246c2a65 core: enforce seccomp for secondary archs too, for all rules
Let's make sure that all our rules apply to all archs the local kernel
supports.
2016-09-25 10:18:44 +02:00
Zbigniew Jędrzejewski-Szmek
e4662e553c journal-remote: fix error format string
Bug introduced in 1b4cd64683.
2016-09-24 21:46:48 -04:00
Zbigniew Jędrzejewski-Szmek
bd5b9f0a12 systemctl: suppress errors with "show" for nonexistent units and properties
Show is documented to be program-parseable, and printing the warning about
about a non-existent unit, while useful for humans, broke a lot of scripts.
Restore previous behaviour of returning success and printing empty or useless
stuff for units which do not exist, and printing empty values for properties
which do not exists.

With SYSTEMD_LOG_LEVEL=debug, hints are printed, but the return value is
still 0.

This undoes parts of e33a06a and 3dced37b7 and fixes #3856.

We might consider adding an explicit switch to fail on missing units/properties
(e.g. --ensure-exists or similar), and make -P foobar equivalent to
--ensure-exists --property=foobar.
2016-09-24 21:09:39 -04:00
Zbigniew Jędrzejewski-Szmek
1cf03a4f8e systemctl,networkctl,busctl,backlight: use STRPTR_IN_SET 2016-09-24 20:22:05 -04:00
Zbigniew Jędrzejewski-Szmek
c7bf9d5183 basic/strv: add STRPTR_IN_SET
Also some trivial tests for STR_IN_SET and STRPTR_IN_SET.
2016-09-24 20:13:28 -04:00