Back in 2018 we introduced support for killing the whole QEMU process
instead of just one thread, when a seccomp rule is violated:
commit bda08a5764
Author: Marc-André Lureau <marcandre.lureau@redhat.com>
Date: Wed Aug 22 19:02:48 2018 +0200
seccomp: prefer SCMP_ACT_KILL_PROCESS if available
Fast forward a year and we introduced a patch to avoid killing the
process for resource control syscalls tickled by Mesa.
commit 9a1565a03b
Author: Daniel P. Berrangé <berrange@redhat.com>
Date: Wed Mar 13 09:49:03 2019 +0000
seccomp: don't kill process for resource control syscalls
Unfortunately a logic bug effectively reverted the first commit
mentioned so that we go back to only killing the thread, not the whole
process.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
Most of the seccomp functions return errnos as a negative return
value. The code is currently ignoring these and reporting a generic
error message for all seccomp failure scenarios making debugging
painful. Report a more precise error from each failed call and include
errno if it is available.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Eduardo Otubo <otubo@redhat.com>
The Mesa library tries to set process affinity on some of its threads in
order to optimize its performance. Currently this results in QEMU being
immediately terminated when seccomp is enabled.
Mesa doesn't consider failure of the process affinity settings to be
fatal to its operation, but our seccomp policy gives it no choice in
gracefully handling this denial.
It is reasonable to consider that malicious code using the resource
control syscalls to be a less serious attack than if they were trying
to spawn processes or change UIDs and other such things. Generally
speaking changing the resource control setting will "merely" affect
quality of service of processes on the host. With this in mind, rather
than kill the process, we can relax the policy for these syscalls to
return the EPERM errno value. This allows callers to detect that QEMU
does not want them to change resource allocations, and apply some
reasonable fallback logic.
The main downside to this is for code which uses these syscalls but does
not check the return value, blindly assuming they will always
succeeed. Returning an errno could result in sub-optimal behaviour.
Arguably though such code is already broken & needs fixing regardless.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Eduardo Otubo <otubo@redhat.com>
We'd like to compile QEMU with -std=gnu99, but GCC 4.8 currently
fails to compile qemu-seccomp.c in this mode:
qemu-seccomp.c:45:1: error: initializer element is not constant
};
^
qemu-seccomp.c:45:1: error: (near initialization for ‘sched_setscheduler_arg[0]’)
This is due to a compiler bug which has just been fixed in GCC 5.0:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63567
Since we still want to support GCC 4.8 for a while and also want to use
gnu99 mode, work-around the issue by expanding the macro manually.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Calling error_report() in a function that takes an Error ** argument
is suspicious. parse_sandbox() does that, and then fails without
setting an error. Its caller main(), via qemu_opts_foreach(), is fine
with it, but clean it up anyway.
Cc: Eduardo Otubo <otubo@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
Message-Id: <20181017082702.5581-18-armbru@redhat.com>
Remove -sandbox option if the host is not capable of TSYNC, since the
sandbox will fail at setup time otherwise. This will help libvirt, for
ex, to figure out if -sandbox will work.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Eduardo Otubo <otubo@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
When using "-seccomp on", the seccomp policy is only applied to the
main thread, the vcpu worker thread and other worker threads created
after seccomp policy is applied; the seccomp policy is not applied to
e.g. the RCU thread because it is created before the seccomp policy is
applied and SECCOMP_FILTER_FLAG_TSYNC isn't used.
This can be verified with
for task in /proc/`pidof qemu`/task/*; do cat $task/status | grep Secc ; done
Seccomp: 2
Seccomp: 0
Seccomp: 0
Seccomp: 2
Seccomp: 2
Seccomp: 2
Starting with libseccomp 2.2.0 and kernel >= 3.17, we can use
seccomp_attr_set(ctx, > SCMP_FLTATR_CTL_TSYNC, 1) to update the policy
on all threads.
libseccomp requirement was bumped to 2.2.0 in previous patch.
libseccomp should fail to set the filter if it can't honour
SCMP_FLTATR_CTL_TSYNC (untested), and thus -sandbox will now fail on
kernel < 3.17.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
The upcoming libseccomp release should have SCMP_ACT_KILL_PROCESS
action (https://github.com/seccomp/libseccomp/issues/96).
SCMP_ACT_KILL_PROCESS is preferable to immediately terminate the
offending process, rather than having the SIGSYS handler running.
Use SECCOMP_GET_ACTION_AVAIL to check availability of kernel support,
as libseccomp will fallback on SCMP_ACT_KILL otherwise, and we still
prefer SCMP_ACT_TRAP.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
The seccomp action SCMP_ACT_KILL results in immediate termination of
the thread that made the bad system call. However, qemu being
multi-threaded, it keeps running. There is no easy way for parent
process / management layer (libvirt) to know about that situation.
Instead, the default SIGSYS handler when invoked with SCMP_ACT_TRAP
will terminate the program and core dump.
This may not be the most secure solution, but probably better than
just killing the offending thread. SCMP_ACT_KILL_PROCESS has been
added in Linux 4.14 to improve the situation, which I propose to use
by default if available in the next patch.
Related to:
https://bugzilla.redhat.com/show_bug.cgi?id=1594456
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
Current and upcoming mesa releases rely on a shader disk cash. It uses
a thread job queue with low priority, set with
sched_setscheduler(SCHED_IDLE). However, that syscall is rejected by
the "resourcecontrol" seccomp qemu filter.
Since it should be safe to allow lowering thread priority, let's allow
scheduling thread to idle policy.
Related to:
https://bugzilla.redhat.com/show_bug.cgi?id=1594456
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
If CONFIG_SECCOMP is undefined, the option 'elevatedprivileges' remains
compiled. This would make libvirt set the corresponding capability and
then trigger failure during guest startup. This patch moves the code
regarding seccomp command line options to qemu-seccomp.c file and
wraps qemu_opts_foreach finding sandbox option with CONFIG_SECCOMP.
Because parse_sandbox() is moved into qemu-seccomp.c file, change
seccomp_start() to static function.
Signed-off-by: Yi Min Zhao <zyimin@linux.ibm.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Tested-by: Ján Tomko <jtomko@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
This patch adds [,resourcecontrol=deny] to `-sandbox on' option. It
blacklists all process affinity and scheduler priority system calls to
avoid any bigger of the process.
Signed-off-by: Eduardo Otubo <otubo@redhat.com>
This patch adds [,spawn=deny] argument to `-sandbox on' option. It
blacklists fork and execve system calls, avoiding Qemu to spawn new
threads or processes.
Signed-off-by: Eduardo Otubo <otubo@redhat.com>
This patch introduces the new argument
[,elevateprivileges=allow|deny|children] to the `-sandbox on'. It allows
or denies Qemu process to elevate its privileges by blacklisting all
set*uid|gid system calls. The 'children' option will let forks and
execves run unprivileged.
Signed-off-by: Eduardo Otubo <otubo@redhat.com>
This patch introduces the argument [,obsolete=allow] to the `-sandbox on'
option. It allows Qemu to run safely on old system that still relies on
old system calls.
Signed-off-by: Eduardo Otubo <otubo@redhat.com>
This patch changes the default behavior of the seccomp filter from
whitelist to blacklist. By default now all system calls are allowed and
a small black list of definitely forbidden ones was created.
Signed-off-by: Eduardo Otubo <otubo@redhat.com>
getrusage is used in a number of places throughout the qemu codebase
(notably, in crypto/pbkdf.c). Without this syscall being whitelisted,
qemu ends up getting killed by the kernel whenever you try to connect to
a VNC console.
Signed-off-by: Brian Rak <brak@gameservers.com>
Acked-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
Newer version of nss-softokn libraries (> 3.16.2.3) use sysinfo call
so qemu using rbd image hang after start when run in sandbox mode.
To allow using rbd images in sandbox mode we have to whitelist it.
Signed-off-by: Miroslav Rezanina <mrezanin@redhat.com>
Acked-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
The cacheflush system call (found on MIPS and ARM) has been included in
the libseccomp header since 2.2.0, so include it back to that version.
Previously it was only enabled since 2.2.3 since that is when it was
enabled properly for ARM.
This will allow seccomp support to be enabled for MIPS back to
libseccomp 2.2.0.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Reviewed-By: Andrew Jones <drjones@redhat.com>
Acked-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1454089805-5470-16-git-send-email-peter.maydell@linaro.org
cacheflush is an arm-specific syscall that qemu built for arm
uses. Add it to the whitelist, but only if we're linking with
a recent enough libseccomp.
Signed-off-by: Andrew Jones <drjones@redhat.com>
This is used by memfd code.
Signed-off-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Thibaut Collet <thibaut.collet@6wind.com>
This is used by "-realtime mlock=on".
Signed-off-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Tested-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
The "memory-backend-ram" QOM object utilizes the mbind(2) syscall to
set the policy for a memory range. Add the syscall to the seccomp
sandbox whitelist.
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
Acked-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
Tested-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
fallocate() is needed for snapshotting. If it isn’t whitelisted
$ qemu-img create -f qcow2 x.qcow 1G
Formatting 'x.qcow', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off
$ qemu-kvm -display none -monitor stdio -sandbox on x.qcow
QEMU 2.1.50 monitor - type 'help' for more information
(qemu) savevm foo
(qemu) loadvm foo
will fail, as will subsequent savevm commands on the same image.
fadvise64(), inotify_init1(), inotify_add_watch() are needed by
the SDL display. Without the whitelist entries,
qemu-kvm -sandbox on
fails immediately.
In my tests fadvise64() is called 50--51 times per VM run. That
number seems independent of the duration of the run. fallocate(),
inotify_init1(), inotify_add_watch() are called once each.
Accordingly, they are added to the whitelist at a very low
priority.
Signed-off-by: Philipp Gesang <philipp.gesang@intra2net.com>
Signed-off-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
QEMU needs to call semctl() for correct operation. This particular
problem was identified on shutdown with the following commandline:
# qemu -sandbox on -monitor stdio \
-device intel-hda -device hda-duplex -vnc :0
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
Additional testing reveals that PulseAudio requires shmctl() and the
mlock()/munlock() syscalls on some systems/configurations. As before,
on systems that do require these syscalls, the problem can be seen with
the following command line:
# qemu -monitor stdio -sandbox on \
-device intel-hda -device hda-duplex
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
PulseAudio requires the use of shared memory so add shmget(), shmat(),
and shmdt() to the syscall whitelist.
Reported-by: xuhan@redhat.com
Signed-off-by: Paul Moore <pmoore@redhat.com>
The PulseAudio library attempts to do a mkdir(2) and fchmod(2) on
"/run/user/<UID>/pulse" which is currently blocked by the syscall
filter; this patch adds the two missing syscalls to the whitelist.
You can reproduce this problem with the following command:
# qemu -monitor stdio -device intel-hda -device hda-duplex
If watched under strace the following syscalls are shown:
mkdir("/run/user/0/pulse", 0700)
fchmod(11, 0700) [NOTE: 11 is the fd for /run/user/0/pulse]
Reported-by: xuhan@redhat.com
Signed-off-by: Paul Moore <pmoore@redhat.com>
This fixes a bug where we weren't exiting if seccomp_init() failed.
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Acked-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Acked-by: Paul Moore <pmoore@redhat.com>
This was causing Qemu process to hang when using -sandbox on as
discribed on RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1004175
Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Tested-by: Paul Moore <pmoore@redhat.com>
Acked-by: Paul Moore <pmoore@redhat.com>
It appears that even a very simple /etc/qemu-ifup configuration can
require the arch_prctl() syscall, see the example below:
#!/bin/sh
/sbin/ifconfig $1 0.0.0.0 up
/usr/sbin/brctl addif <switch> $1
Signed-off-by: Paul Moore <pmoore@redhat.com>
Reviewed-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Message-id: 20130718135703.8247.19213.stgit@localhost
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
A previous commit, "seccomp: add the asynchronous I/O syscalls to the
whitelist", added several asynchronous I/O syscalls but left out the
io_submit() and io_cancel() syscalls. This patch corrects this by
adding the two missing asynchronous I/O syscalls.
Signed-off-by: Paul Moore <pmoore@redhat.com>
Reviewed-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Message-id: 20130715193201.943.4913.stgit@localhost
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
v3 update:
- reincluding getrlimit(), it is used by Xen.
v2 update:
- reincluding setrlimit(), it is used by Xen.
Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1374518017-10424-3-git-send-email-otubo@linux.vnet.ibm.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
v2 update:
- set libseccomp 2.1.0 as requirement on configure script.
Since libseccomp 2.0 there's no need to check the architecture type
anymore.
Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1374518017-10424-2-git-send-email-otubo@linux.vnet.ibm.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
In order to enable the asynchronous I/O functionality when using the
seccomp sandbox we need to add the associated syscalls to the
whitelist.
Signed-off-by: Paul Moore <pmoore@redhat.com>
Reviewed-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Message-id: 20130529203001.20939.83322.stgit@localhost
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
According to the bug 855162[0] - there's the need of adding new syscalls
to the whitelist when using Qemu with Libvirt.
[0] - https://bugzilla.redhat.com/show_bug.cgi?id=855162
Reported-by: Paul Moore <pmoore@redhat.com>
Tested-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
---
v1:
- I added a syscall struct using priority levels as described in the
libseccomp man page. The priority numbers are based to the frequency
they appear in a sample strace from a regular qemu guest run under
libvirt.
Libseccomp generates linear BPF code to filter system calls, those rules
are read one after another. The priority system places the most common
rules first in order to reduce the overhead when processing them.
v1 -> v2:
- Fixed some style issues
- Removed code from vl.c and created qemu-seccomp.[ch]
- Now using ARRAY_SIZE macro
- Added more syscalls without priority/frequency set yet
v2 -> v3:
- Adding copyright and license information
- Replacing seccomp_whitelist_count just by ARRAY_SIZE
- Adding header protection to qemu-seccomp.h
- Moving QemuSeccompSyscall definition to qemu-seccomp.c
- Negative return from seccomp_start is fatal now.
- Adding open() and execve() to the whitelis
v3 -> v4:
- Tests revealed a bigger set of syscalls.
- seccomp_start() now has an argument to set the mode according to the
configure option trap or kill.
v4 -> v5:
- Tests on x86_64 required a new specific set of system calls.
- libseccomp release 1.0.0: part of the API have changed in this last
release, had to adapt to the new function signatures.