Pull year 2038 updates from Thomas Gleixner:
"Another round of changes to make the kernel ready for 2038. After lots
of preparatory work this is the first set of syscalls which are 2038
safe:
403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64
The syscall numbers are identical all over the architectures"
* 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
riscv: Use latest system call ABI
checksyscalls: fix up mq_timedreceive and stat exceptions
unicore32: Fix __ARCH_WANT_STAT64 definition
asm-generic: Make time32 syscall numbers optional
asm-generic: Drop getrlimit and setrlimit syscalls from default list
32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option
compat ABI: use non-compat openat and open_by_handle_at variants
y2038: add 64-bit time_t syscalls to all 32-bit architectures
y2038: rename old time and utime syscalls
y2038: remove struct definition redirects
y2038: use time32 syscall names on 32-bit
syscalls: remove obsolete __IGNORE_ macros
y2038: syscalls: rename y2038 compat syscalls
x86/x32: use time64 versions of sigtimedwait and recvmmsg
timex: change syscalls to use struct __kernel_timex
timex: use __kernel_timex internally
sparc64: add custom adjtimex/clock_adjtime functions
time: fix sys_timer_settime prototype
time: Add struct __kernel_timex
time: make adjtime compat handling available for 32 bit
...
The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often surfaced and there has been a push to address this problem [1].
This patch uses file descriptors (fd) from proc/<pid> as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall pidfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (pidfd).
/* prototype and argument /*
long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
/* syscall number 424 */
The syscall number was chosen to be 424 to align with Arnd's rework in his
y2038 to minimize merge conflicts (cf. [25]).
In addition to the pidfd and signal argument it takes an additional
siginfo_t and flags argument. If the siginfo_t argument is NULL then
pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.
/* pidfd_send_signal() replaces multiple pid-based syscalls */
The pidfd_send_signal() syscall currently takes on the job of
rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
positive pid is passed to kill(2). It will however be possible to also
replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
/* sending signals to threads (tid) and process groups (pgid) */
Specifically, the pidfd_send_signal() syscall does currently not operate on
process groups or threads. This is left for future extensions.
In order to extend the syscall to allow sending signal to threads and
process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
PIDFD_TYPE_TID) should be added. This implies that the flags argument will
determine what is signaled and not the file descriptor itself. Put in other
words, grouping in this api is a property of the flags argument not a
property of the file descriptor (cf. [13]). Clarification for this has been
requested by Eric (cf. [19]).
When appropriate extensions through the flags argument are added then
pidfd_send_signal() can additionally replace the part of kill(2) which
operates on process groups as well as the tgkill(2) and
rt_tgsigqueueinfo(2) syscalls.
How such an extension could be implemented has been very roughly sketched
in [14], [15], and [16]. However, this should not be taken as a commitment
to a particular implementation. There might be better ways to do it.
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]).
/* naming */
The syscall had various names throughout iterations of this patchset:
- procfd_signal()
- procfd_send_signal()
- taskfd_send_signal()
In the last round of reviews it was pointed out that given that if the
flags argument decides the scope of the signal instead of different types
of fds it might make sense to either settle for "procfd_" or "pidfd_" as
prefix. The community was willing to accept either (cf. [17] and [18]).
Given that one developer expressed strong preference for the "pidfd_"
prefix (cf. [13]) and with other developers less opinionated about the name
we should settle for "pidfd_" to avoid further bikeshedding.
The "_send_signal" suffix was chosen to reflect the fact that the syscall
takes on the job of multiple syscalls. It is therefore intentional that the
name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
fomer because it might imply that pidfd_send_signal() is a replacement for
kill(2), and not the latter because it is a hassle to remember the correct
spelling - especially for non-native speakers - and because it is not
descriptive enough of what the syscall actually does. The name
"pidfd_send_signal" makes it very clear that its job is to send signals.
/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]). However,
this can be added as an extension through the @flags argument if the need
ever arises.
/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siginfo_t would need to be set to (cf. [5], [6]).
/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls
(cf. [7]). The compat syscall handling is now done in kernel/signal.c
itself by adding __copy_siginfo_from_user_generic() which lets us avoid
compat syscalls (cf. [8]). It should be noted that the addition of
__copy_siginfo_from_user_any() is caused by a bug in the original
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
any additional callers.
/* testing */
This patch was tested on x64 and x86.
/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
unsigned int flags)
{
#ifdef __NR_pidfd_send_signal
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
#else
return -ENOSYS;
#endif
}
int main(int argc, char *argv[])
{
int fd, ret, saved_errno, sig;
if (argc < 3)
exit(EXIT_FAILURE);
fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
exit(EXIT_FAILURE);
}
sig = atoi(argv[2]);
printf("Sending signal %d to process %s\n", sig, argv[1]);
ret = do_pidfd_send_signal(fd, sig, NULL, 0);
saved_errno = errno;
close(fd);
errno = saved_errno;
if (ret < 0) {
printf("%s - Failed to send signal %d to process %s\n",
strerror(errno), sig, argv[1]);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
/* Q&A
* Given that it seems the same questions get asked again by people who are
* late to the party it makes sense to add a Q&A section to the commit
* message so it's hopefully easier to avoid duplicate threads.
*
* For the sake of progress please consider these arguments settled unless
* there is a new point that desperately needs to be addressed. Please make
* sure to check the links to the threads in this commit message whether
* this has not already been covered.
*/
Q-01: (Florian Weimer [20], Andrew Morton [21])
What happens when the target process has exited?
A-01: Sending the signal will fail with ESRCH (cf. [22]).
Q-02: (Andrew Morton [21])
Is the task_struct pinned by the fd?
A-02: No. A reference to struct pid is kept. struct pid - as far as I
understand - was created exactly for the reason to not require to
pin struct task_struct (cf. [22]).
Q-03: (Andrew Morton [21])
Does the entire procfs directory remain visible? Just one entry
within it?
A-03: The same thing that happens right now when you hold a file descriptor
to /proc/<pid> open (cf. [22]).
Q-04: (Andrew Morton [21])
Does the pid remain reserved?
A-04: No. This patchset guarantees a stable handle not that pids are not
recycled (cf. [22]).
Q-05: (Andrew Morton [21])
Do attempts to signal that fd return errors?
A-05: See {Q,A}-01.
Q-06: (Andrew Morton [22])
Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
so this is something that we will need to support anyway. Hence,
there's no immediate need to add another syscalls just to make
pidfd_send_signal() not dependent on the presence of procfs. However,
adding a syscalls to get such file descriptors is planned for a
future patchset (cf. [22]).
Q-07: (Andrew Morton [21] and others)
This fd-for-a-process sounds like a handy thing and people may well
think up other uses for it in the future, probably unrelated to
signals. Are the code and the interface designed to permit such
future applications?
A-07: Yes (cf. [22]).
Q-08: (Andrew Morton [21] and others)
Now I think about it, why a new syscall? This thing is looking
rather like an ioctl?
A-08: This has been extensively discussed. It was agreed that a syscall is
preferred for a variety or reasons. Here are just a few taken from
prior threads. Syscalls are safer than ioctl()s especially when
signaling to fds. Processes are a core kernel concept so a syscall
seems more appropriate. The layout of the syscall with its four
arguments would require the addition of a custom struct for the
ioctl() thereby causing at least the same amount or even more
complexity for userspace than a simple syscall. The new syscall will
replace multiple other pid-based syscalls (see description above).
The file-descriptors-for-processes concept introduced with this
syscall will be extended with other syscalls in the future. See also
[22], [23] and various other threads already linked in here.
Q-09: (Florian Weimer [24])
What happens if you use the new interface with an O_PATH descriptor?
A-09:
pidfds opened as O_PATH fds cannot be used to send signals to a
process (cf. [2]). Signaling processes through pidfds is the
equivalent of writing to a file. Thus, this is not an operation that
operates "purely at the file descriptor level" as required by the
open(2) manpage. See also [4].
/* References */
[1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
[2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
[3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
[4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
[5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
[6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
[7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
[8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
[9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
[12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
[13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
[14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
[15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
[16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
[17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
[18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
[19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
[20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
[22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
[23]: https://lwn.net/Articles/773459/
[24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
In the middle of do_exit() there is there is a call
"ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
for the debugger to release the task or SIGKILL to be delivered.
Skipping past dequeue_signal when we know a fatal signal has already
been delivered resulted in SIGKILL remaining pending and
TIF_SIGPENDING remaining set. This in turn caused the
scheduler to not sleep in PTACE_EVENT_EXIT as it figured
a fatal signal was pending. This also caused ptrace_freeze_traced
in ptrace_check_attach to fail because it left a per thread
SIGKILL pending which is what fatal_signal_pending tests for.
This difference in signal state caused strace to report
strace: Exit of unknown pid NNNNN ignored
Therefore update the signal handling state like dequeue_signal
would when removing a per thread SIGKILL, by removing SIGKILL
from the per thread signal mask and clearing TIF_SIGPENDING.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Ivan Delalande <colona@arista.com>
Cc: stable@vger.kernel.org
Fixes: 35634ffa17 ("signal: Always notice exiting tasks")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Recently syzkaller was able to create unkillablle processes by
creating a timer that is delivered as a thread local signal on SIGHUP,
and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop failing
to deliver SIGHUP but always trying.
When the stack overflows delivery of SIGHUP fails and force_sigsegv is
called. Unfortunately because SIGSEGV is numerically higher than
SIGHUP next_signal tries again to deliver a SIGHUP.
From a quality of implementation standpoint attempting to deliver the
timer SIGHUP signal is wrong. We should attempt to deliver the
synchronous SIGSEGV signal we just forced.
We can make that happening in a fairly straight forward manner by
instead of just looking at the signal number we also look at the
si_code. In particular for exceptions (aka synchronous signals) the
si_code is always greater than 0.
That still has the potential to pick up a number of asynchronous
signals as in a few cases the same si_codes that are used
for synchronous signals are also used for asynchronous signals,
and SI_KERNEL is also included in the list of possible si_codes.
Still the heuristic is much better and timer signals are definitely
excluded. Which is enough to prevent all known ways for someone
sending a process signals fast enough to cause unexpected and
arguably incorrect behavior.
Cc: stable@vger.kernel.org
Fixes: a27341cd5f ("Prioritize synchronous signals over 'normal' signals")
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Recently syzkaller was able to create unkillablle processes by
creating a timer that is delivered as a thread local signal on SIGHUP,
and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop
failing to deliver SIGHUP but always trying.
Upon examination it turns out part of the problem is actually most of
the solution. Since 2.5 signal delivery has found all fatal signals,
marked the signal group for death, and queued SIGKILL in every threads
thread queue relying on signal->group_exit_code to preserve the
information of which was the actual fatal signal.
The conversion of all fatal signals to SIGKILL results in the
synchronous signal heuristic in next_signal kicking in and preferring
SIGHUP to SIGKILL. Which is especially problematic as all
fatal signals have already been transformed into SIGKILL.
Instead of dequeueing signals and depending upon SIGKILL to
be the first signal dequeued, first test if the signal group
has already been marked for death. This guarantees that
nothing in the signal queue can prevent a process that needs
to exit from exiting.
Cc: stable@vger.kernel.org
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Ref: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.
The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.
Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.
In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Since 2.5.34 the code has had the potential to not allocate siginfo
for SIGSTOP signals. Except for ptrace this is perfectly fine as only
ptrace can use PTRACE_PEEK_SIGINFO and see what the contents of
the delivered siginfo are.
Users of PTRACE_PEEK_SIGINFO that care about the contents siginfo
for SIGSTOP are rare, but they do exist. A seccomp self test
has cared and lldb cares.
Jack Andersen <jackoalan@gmail.com> writes:
> The patch titled
> `signal: Never allocate siginfo for SIGKILL or SIGSTOP`
> created a regression for users of PTRACE_GETSIGINFO needing to
> discern signals that were raised via the tgkill syscall.
>
> A notable user of this tgkill+ptrace combination is lldb while
> debugging a multithreaded program. Without the ability to detect a
> SIGSTOP originating from tgkill, lldb does not have a way to
> synchronize on a per-thread basis and falls back to SIGSTOP-ing the
> entire process.
Everyone affected by this please note. The kernel can still fail to
allocate a siginfo structure. The allocation is with GFP_KERNEL and
is best effort only. If memory is tight when the signal allocation
comes in this will fail to allocate a siginfo.
So I strongly recommend looking at more robust solutions for
synchronizing with a single thread such as PTRACE_INTERRUPT. Or if
that does not work persuading your friendly local kernel developer to
build the interface you need.
Reported-by: Tycho Andersen <tycho@tycho.ws>
Reported-by: Kees Cook <keescook@chromium.org>
Reported-by: Jack Andersen <jackoalan@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Christian Brauner <christian@brauner.io>
Cc: stable@vger.kernel.org
Fixes: f149b31557 ("signal: Never allocate siginfo for SIGKILL or SIGSTOP")
Fixes: 6dfc88977e42 ("[PATCH] shared thread signals")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that 32-bit architectures have two variants of sys_rt_sigtimedwaid()
for 32-bit and 64-bit time_t, we also need to have a second compat system
call entry point on the corresponding 64-bit architectures.
The traditional system call keeps getting handled
by compat_sys_rt_sigtimedwait(), and this adds a new
compat_sys_rt_sigtimedwait_time64() that differs only in the timeout
argument type.
The naming remains a bit asymmetric for the moment. Ideally we would
want to have compat_sys_rt_sigtimedwait_time32() for the old version
and compat_sys_rt_sigtimedwait() for the new one to mirror the names
of the native entry points, but renaming the existing system call
tables causes unnecessary churn. I would suggest renaming all such
system calls together at a later point.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Once sys_rt_sigtimedwait() gets changed to a 64-bit time_t, we have
to provide compatibility support for existing binaries.
An earlier version of this patch reused the compat_sys_rt_sigtimedwait
entry point to avoid code duplication, but this newer approach
duplicates the existing native entry point instead, which seems
a bit cleaner.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Refactor the logic to restore the sigmask before the syscall
returns into an api.
This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during
the execution and restored after the execution of the syscall.
With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Refactor reading sigset from userspace and updating sigmask
into an api.
This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during,
and restored after, the execution of the syscall.
With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll,
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.
Note that the calls to sigprocmask() ignored the return value
from the api as the function only returns an error on an invalid
first argument that is hardcoded at these call sites.
The updated logic uses set_current_blocked() instead.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Because get_signal_to_deliver() was renamed to get_signal() the
comment should be fixed.
Link: http://lkml.kernel.org/r/1539179128-45709-1-git-send-email-swkhack@gmail.com
Signed-off-by: Weikang Shi <swkhack@gmail.com>
Reported-by: Christian Brauner <christian@brauner.io>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull timekeeping updates from Thomas Gleixner:
"The timers and timekeeping departement provides:
- Another large y2038 update with further preparations for providing
the y2038 safe timespecs closer to the syscalls.
- An overhaul of the SHCMT clocksource driver
- SPDX license identifier updates
- Small cleanups and fixes all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
tick/sched : Remove redundant cpu_online() check
clocksource/drivers/dw_apb: Add reset control
clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
clocksource/drivers: Unify the names to timer-* format
clocksource/drivers/sh_cmt: Add R-Car gen3 support
dt-bindings: timer: renesas: cmt: document R-Car gen3 support
clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
clocksource/drivers/sh_cmt: Fixup for 64-bit machines
clocksource/drivers/sh_tmu: Convert to SPDX identifiers
clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
clocksource/drivers/sh_cmt: Convert to SPDX identifiers
clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
clocksource: Convert to using %pOFn instead of device_node.name
tick/broadcast: Remove redundant check
RISC-V: Request newstat syscalls
y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
y2038: socket: Change recvmmsg to use __kernel_timespec
y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
y2038: utimes: Rework #ifdef guards for compat syscalls
...
Pull siginfo updates from Eric Biederman:
"I have been slowly sorting out siginfo and this is the culmination of
that work.
The primary result is in several ways the signal infrastructure has
been made less error prone. The code has been updated so that manually
specifying SEND_SIG_FORCED is never necessary. The conversion to the
new siginfo sending functions is now complete, which makes it
difficult to send a signal without filling in the proper siginfo
fields.
At the tail end of the patchset comes the optimization of decreasing
the size of struct siginfo in the kernel from 128 bytes to about 48
bytes on 64bit. The fundamental observation that enables this is by
definition none of the known ways to use struct siginfo uses the extra
bytes.
This comes at the cost of a small user space observable difference.
For the rare case of siginfo being injected into the kernel only what
can be copied into kernel_siginfo is delivered to the destination, the
rest of the bytes are set to 0. For cases where the signal and the
si_code are known this is safe, because we know those bytes are not
used. For cases where the signal and si_code combination is unknown
the bits that won't fit into struct kernel_siginfo are tested to
verify they are zero, and the send fails if they are not.
I made an extensive search through userspace code and I could not find
anything that would break because of the above change. If it turns out
I did break something it will take just the revert of a single change
to restore kernel_siginfo to the same size as userspace siginfo.
Testing did reveal dependencies on preferring the signo passed to
sigqueueinfo over si->signo, so bit the bullet and added the
complexity necessary to handle that case.
Testing also revealed bad things can happen if a negative signal
number is passed into the system calls. Something no sane application
will do but something a malicious program or a fuzzer might do. So I
have fixed the code that performs the bounds checks to ensure negative
signal numbers are handled"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
signal: Guard against negative signal numbers in copy_siginfo_from_user32
signal: Guard against negative signal numbers in copy_siginfo_from_user
signal: In sigqueueinfo prefer sig not si_signo
signal: Use a smaller struct siginfo in the kernel
signal: Distinguish between kernel_siginfo and siginfo
signal: Introduce copy_siginfo_from_user and use it's return value
signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
signal: Fail sigqueueinfo if si_signo != sig
signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
signal/unicore32: Use force_sig_fault where appropriate
signal/unicore32: Generate siginfo in ucs32_notify_die
signal/unicore32: Use send_sig_fault where appropriate
signal/arc: Use force_sig_fault where appropriate
signal/arc: Push siginfo generation into unhandled_exception
signal/ia64: Use force_sig_fault where appropriate
signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
signal/ia64: Use the generic force_sigsegv in setup_frame
signal/arm/kvm: Use send_sig_mceerr
signal/arm: Use send_sig_fault where appropriate
signal/arm: Use force_sig_fault where appropriate
...
While fixing an out of bounds array access in known_siginfo_layout
reported by the kernel test robot it became apparent that the same bug
exists in siginfo_layout and affects copy_siginfo_from_user32.
The straight forward fix that makes guards against making this mistake
in the future and should keep the code size small is to just take an
unsigned signal number instead of a signed signal number, as I did to
fix known_siginfo_layout.
Cc: stable@vger.kernel.org
Fixes: cc731525f2 ("signal: Remove kernel interal si_code magic")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The bounds checks in known_siginfo_layout only guards against positive
numbers that are too large, large negative can slip through and
can cause out of bounds accesses.
Ordinarily this is not a concern because early in signal processing
the signal number is filtered with valid_signal which ensures it
is a small positive signal number, but copy_siginfo_from_user
is called before this check is performed.
[ 73.031126] BUG: unable to handle kernel paging request at ffffffff6281bcb6
[ 73.032038] PGD 3014067 P4D 3014067 PUD 0
[ 73.032565] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 73.033287] CPU: 0 PID: 732 Comm: trinity-c3 Tainted: G W T 4.19.0-rc1-00077-g4ce5f9c #1
[ 73.034423] RIP: 0010:copy_siginfo_from_user+0x4d/0xd0
[ 73.034908] Code: 00 8b 53 08 81 fa 80 00 00 00 0f 84 90 00 00 00 85 d2 7e 2d 48 63 0b 83 f9 1f 7f 1c 8d 71 ff bf d8 04 01 50 48 0f a3 f7 73 0e <0f> b6 8c 09 20 bb 81 82 39 ca 7f 15 eb 68 31 c0 83 fa 06 7f 0c eb
[ 73.036665] RSP: 0018:ffff88001b8f7e20 EFLAGS: 00010297
[ 73.037160] RAX: 0000000000000000 RBX: ffff88001b8f7e90 RCX: fffffffff00000cb
[ 73.037865] RDX: 0000000000000001 RSI: 00000000f00000ca RDI: 00000000500104d8
[ 73.038546] RBP: ffff88001b8f7e80 R08: 0000000000000000 R09: 0000000000000000
[ 73.039201] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000008
[ 73.039874] R13: 00000000000002dc R14: 0000000000000000 R15: 0000000000000000
[ 73.040613] FS: 000000000104a880(0000) GS:ffff88001f000000(0000) knlGS:0000000000000000
[ 73.041649] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 73.042405] CR2: ffffffff6281bcb6 CR3: 000000001cb52003 CR4: 00000000001606b0
[ 73.043351] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 73.044286] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 73.045221] Call Trace:
[ 73.045556] __x64_sys_rt_tgsigqueueinfo+0x34/0xa0
[ 73.046199] do_syscall_64+0x1a4/0x390
[ 73.046708] ? vtime_user_enter+0x61/0x80
[ 73.047242] ? __context_tracking_enter+0x4e/0x60
[ 73.047714] ? __context_tracking_enter+0x4e/0x60
[ 73.048278] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Therefore fix known_siginfo_layout to take an unsigned signal number
instead of a signed signal number. All valid signal numbers are small
positive numbers so they will not be affected, but invalid negative
signal numbers will now become large positive signal numbers and will
not be used as indices into the sig_sicodes array.
Making the signal number unsigned makes it difficult for similar mistakes to
happen in the future.
Fixes: 4ce5f9c9e7 ("signal: Use a smaller struct siginfo in the kernel")
Inspired-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Andrei Vagin <avagin@gmail.com> reported:
> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.
>
> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany the signal. This
> argument is a pointer to a structure of type siginfo_t, described in
> sigaction(2) (and defined by including <sigaction.h>). The caller
> should set the following fields in this structure:
>
> si_code
> This must be one of the SI_* codes in the Linux kernel source
> file include/asm-generic/siginfo.h, with the restriction that
> the code must be negative (i.e., cannot be SI_USER, which is
> used by the kernel to indicate a signal sent by kill(2)) and
> cannot (since Linux 2.6.39) be SI_TKILL (which is used by the
> kernel to indicate a signal sent using tgkill(2)).
>
> si_pid This should be set to a process ID, typically the process ID of
> the sender.
>
> si_uid This should be set to a user ID, typically the real user ID of
> the sender.
>
> si_value
> This field contains the user data to accompany the signal. For
> more information, see the description of the last (union sigval)
> argument of sigqueue(3).
>
> Internally, the kernel sets the si_signo field to the value specified
> in sig, so that the receiver of the signal can also obtain the signal
> number via that field.
>
> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>>
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.
Looking at the kernel code the historical behavior has alwasy been to prefer
the signal number passed in by the kernel.
So sigh. Implmenet __copy_siginfo_from_user and __copy_siginfo_from_user32 to
take that signal number and prefer it. The user of ptrace will still
use copy_siginfo_from_user and copy_siginfo_from_user32 as they do not and
never have had a signal number there.
Luckily this change has never made it farther than linux-next.
Fixes: e75dc036c4 ("signal: Fail sigqueueinfo if si_signo != sig")
Reported-by: Andrei Vagin <avagin@gmail.com>
Tested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We reserve 128 bytes for struct siginfo but only use about 48 bytes on
64bit and 32 bytes on 32bit. Someday we might use more but it is unlikely
to be anytime soon.
Userspace seems content with just enough bytes of siginfo to implement
sigqueue. Or in the case of checkpoint/restart reinjecting signals
the kernel has sent.
Reducing the stack footprint and the work to copy siginfo around from
2 cachelines to 1 cachelines seems worth doing even if I don't have
benchmarks to show a performance difference.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.
The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.
So split siginfo in two; kernel_siginfo and siginfo. Keeping the
traditional name for the userspace definition. While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.
The definition of struct kernel_siginfo I have put in include/signal_types.h
A set of buildtime checks has been added to verify the two structures have
the same field offsets.
To make it easy to verify the change kernel_siginfo retains the same
size as siginfo. The reduction in size comes in a following change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In preparation for using a smaller version of siginfo in the kernel
introduce copy_siginfo_from_user and use it when siginfo is copied from
userspace.
Make the pattern for using copy_siginfo_from_user and
copy_siginfo_from_user32 to capture the return value and return that
value on error.
This is a necessary prerequisite for using a smaller siginfo
in the kernel than the kernel exports to userspace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Rework the defintion of struct siginfo so that the array padding
struct siginfo to SI_MAX_SIZE can be placed in a union along side of
the rest of the struct siginfo members. The result is that we no
longer need the __ARCH_SI_PREAMBLE_SIZE or SI_PAD_SIZE definitions.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The kernel needs to validate that the contents of struct siginfo make
sense as siginfo is copied into the kernel, so that the proper union
members can be put in the appropriate locations. The field si_signo
is a fundamental part of that validation. As such changing the
contents of si_signo after the validation make no sense and can result
in nonsense values in the kernel.
As such simply fail if someone is silly enough to set si_signo out of
sync with the signal number passed to sigqueueinfo.
I don't expect a problem as glibc's sigqueue implementation sets
"si_signo = sig" and CRIU just returns to the kernel what the kernel
gave to it.
If there is some application that calls sigqueueinfo directly that has
a problem with this added sanity check we can revisit this when we see
what kind of crazy that application is doing.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When moving all of the architectures specific si_codes into
siginfo.h, I apparently overlooked EMT_TAGOVF. Move it now.
Remove the now redundant test in siginfo_layout for SIGEMT
as now NSIGEMT is always defined.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The sigaltstack(2) system call fails with -ENOMEM if the new alternative
signal stack is found to be smaller than SIGMINSTKSZ. On architectures
such as arm64, where the native value for SIGMINSTKSZ is larger than
the compat value, this can result in an unexpected error being reported
to a compat task. See, for example:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904385
This patch fixes the problem by extending do_sigaltstack to take the
minimum signal stack size as an additional parameter, allowing the
native and compat system call entry code to pass in their respective
values. COMPAT_SIGMINSTKSZ is just defined as SIGMINSTKSZ if it has not
been defined by the architecture.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Steve McIntyre <steve.mcintyre@arm.com>
Tested-by: Steve McIntyre <93sam@debian.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
For readability and consistency with the other exports in
kernel/signal.c pair the exports of signal sending functions with
their functions, instead of having the exports in one big clump.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This function is static and it only has two callers. As
specific_send_sig_info is only called twice remembering what
specific_send_sig_info does when reading the code is difficutl and it
makes it hard to see which sending sending functions are equivalent to
which others.
So remove specific_send_sig_info to make the code easier to read.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are no more users of SEND_SIG_FORCED so it may be safely removed.
Remove the definition of SEND_SIG_FORCED, it's use in is_si_special,
it's use in TP_STORE_SIGINFO, and it's use in __send_signal as without
any users the uses of SEND_SIG_FORCED are now unncessary.
This makes the code simpler, easier to understand and use. Users of
signal sending functions now no longer need to ask themselves do I
need to use SEND_SIG_FORCED.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The SIGKILL and SIGSTOP signals are never delivered to userspace so
queued siginfo for these signals can never be observed. Therefore
remove the chance of failure by never even attempting to allocate
siginfo in those cases.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Today kernel threads never dequeue siginfo so it is pointless to
enqueue siginfo for them. The usb gadget mass storage driver goes
one farther and uses SEND_SIG_FORCED to guarantee that no siginfo is
even enqueued.
Generalize the optimization of the usb mass storage driver and never
perform an unnecessary allocation when delivering signals to kthreads.
Switch the mass storage driver from sending signals with
SEND_SIG_FORCED to SEND_SIG_PRIV. As using SEND_SIG_FORCED is now
unnecessary.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Instead of playing whack-a-mole and changing SEND_SIG_PRIV to
SEND_SIG_FORCED throughout the kernel to ensure a pid namespace init
gets signals sent by the kernel, stop allowing a pid namespace init to
ignore SIGKILL or SIGSTOP sent by the kernel. A pid namespace init is
only supposed to be able to ignore signals sent from itself and
children with SIG_DFL.
Fixes: 921cf9f630 ("signals: protect cinit from unblocked SIG_DFL signals")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If the first process started (aka /sbin/init) receives a SIGKILL it
will panic the system if it is delivered. Making the system unusable
and undebugable. It isn't much better if the first process started
receives SIGSTOP.
So always ignore SIGSTOP and SIGKILL sent to init.
This is done in a separate clause in sig_task_ignored as force_sig_info
can clear SIG_UNKILLABLE and this protection should work even then.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This changes sys_rt_sigtimedwait() to use get_timespec64(), changing
the timeout type to __kernel_timespec, which will be changed to use
a 64-bit time_t in the future. Since the do_sigtimedwait() core
function changes, we also have to modify the compat version of this
system call in the same way.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:
Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.
The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:
old new
--- ---
compat_time_t old_time32_t
struct compat_timeval struct old_timeval32
struct compat_timespec struct old_timespec32
struct compat_itimerspec struct old_itimerspec32
ns_to_compat_timeval() ns_to_old_timeval32()
get_compat_itimerspec64() get_old_itimerspec32()
put_compat_itimerspec64() put_old_itimerspec32()
compat_get_timespec64() get_old_timespec32()
compat_put_timespec64() put_old_timespec32()
As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.
I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.
This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Merge more updates from Andrew Morton:
- the rest of MM
- procfs updates
- various misc things
- more y2038 fixes
- get_maintainer updates
- lib/ updates
- checkpatch updates
- various epoll updates
- autofs updates
- hfsplus
- some reiserfs work
- fatfs updates
- signal.c cleanups
- ipc/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
ipc/util.c: update return value of ipc_getref from int to bool
ipc/util.c: further variable name cleanups
ipc: simplify ipc initialization
ipc: get rid of ids->tables_initialized hack
lib/rhashtable: guarantee initial hashtable allocation
lib/rhashtable: simplify bucket_table_alloc()
ipc: drop ipc_lock()
ipc/util.c: correct comment in ipc_obtain_object_check
ipc: rename ipcctl_pre_down_nolock()
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc: reorganize initialization of kern_ipc_perm.seq
ipc: compute kern_ipc_perm.id under the ipc lock
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
adfs: use timespec64 for time conversion
kernel/sysctl.c: fix typos in comments
drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
fork: don't copy inconsistent signal handler state to child
signal: make get_signal() return bool
signal: make sigkill_pending() return bool
...
make get_signal() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-18-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sigkill_pending() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-17-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
legacy_queue() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-16-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wants_signal() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-15-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The return value of flush_sigqueue_mask() is never checked anywhere.
Link: http://lkml.kernel.org/r/20180602103653.18181-14-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
unhandled_signal() already behaves like a boolean function. Let's
actually declare it as such too. All callers treat it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-13-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
recalc_sigpending_tsk() already behaves like a boolean function. Let's
actually declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-12-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
has_pending_signals() already behaves like a boolean function. Let's
actually declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-11-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sig_ignored() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-10-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sig_task_ignored() already behaves like a boolean function. Let's
actually declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-9-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sig_handler_ignored() already behaves like a boolean function. Let's
actually declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-8-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kill_ok_by_cred() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-7-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The goto is not needed and does not add any clarity. Simply return
-EINVAL on unexpected sigset_t struct size directly.
Link: http://lkml.kernel.org/r/20180602103653.18181-6-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_sigpending() returned 0 unconditionally so it doesn't make sense to
have it return at all. This allows us to simplify a bunch of syscall
callers.
Link: http://lkml.kernel.org/r/20180602103653.18181-5-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
may_ptrace_stop() already behaves like a boolean function. Let's actually
declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-4-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kill_as_cred_perm() already behaves like a boolean function. Let's
actually declare it as such too.
Link: http://lkml.kernel.org/r/20180602103653.18181-3-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "signal: refactor some functions", v3.
This series refactors a bunch of functions in signal.c to simplify parts
of the code.
The greatest single change is declaring the static do_sigpending() helper
as void which makes it possible to remove a bunch of unnecessary checks in
the syscalls later on.
This patch (of 17):
force_sigsegv() returned 0 unconditionally so it doesn't make sense to have
it return at all. In addition, there are no callers that check
force_sigsegv()'s return value.
Link: http://lkml.kernel.org/r/20180602103653.18181-2-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.
The code was being overly pessimistic. Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child. For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().
While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread. Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.
Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes. When fork completes initialize the new processes
shared_pending signal set with it. The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals. Making it
appear as if those signals were received immediately after the fork.
It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo. This
means it is safe to not bother collecting siginfo on signals sent
during fork.
The sigaction of a child of fork is initially the same as the
sigaction of the parent process. So a signal the parent ignores the
child will also initially ignore. Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.
Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with. They won't cause any problems.
V2: Added removal from the multiprocess list on failure.
V3: Use -ERESTARTNOINTR directly
V4: - Don't queue both SIGCONT and SIGSTOP
- Initialize signal_struct.multiprocess in init_task
- Move setting of shared_pending to before the new task
is visible to signals. This prevents signals from comming
in before shared_pending.signal is set to delayed.signal
and being lost.
V5: - rework list add and delete to account for idle threads
v6: - Use sigdelsetmask when removing stop signals
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes: 4a2c7a7837 ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are only two signals that are delivered to every member of a
signal group: SIGSTOP and SIGKILL. Signal delivery requires every
signal appear to be delivered either before or after a clone syscall.
SIGKILL terminates the clone so does not need to be considered. Which
leaves only SIGSTOP that needs to be considered when creating new
threads.
Today in the event of a group stop TIF_SIGPENDING will get set and the
fork will restart ensuring the fork syscall participates in the group
stop.
A fork (especially of a process with a lot of memory) is one of the
most expensive system so we really only want to restart a fork when
necessary.
It is easy so check to see if a SIGSTOP is ongoing and have the new
thread join it immediate after the clone completes. Making it appear
the clone completed happened just before the SIGSTOP.
The calculate_sigpending function will see the bits set in jobctl and
set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.
V2: The call to task_join_group_stop was moved before the new task is
added to the thread group list. This should not matter as
sighand->siglock is held over both the addition of the threads,
the call to task_join_group_stop and do_signal_stop. But the change
is trivial and it is one less thing to worry about when reading
the code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add a function calculate_sigpending to test to see if any signals are
pending for a new task immediately following fork. Signals have to
happen either before or after fork. Today our practice is to push
all of the signals to before the fork, but that has the downside that
frequent or periodic signals can make fork take much much longer than
normal or prevent fork from completing entirely.
So we need move signals that we can after the fork to prevent that.
This updates the code to set TIF_SIGPENDING on a new task if there
are signals or other activities that have moved so that they appear
to happen after the fork.
As the code today restarts if it sees any such activity this won't
immediately have an effect, as there will be no reason for it
to set TIF_SIGPENDING immediately after the fork.
Adding calculate_sigpending means the code in fork can safely be
changed to not always restart if a signal is pending.
The new calculate_sigpending function sets sigpending if there
are pending bits in jobctl, pending signals, the freezer needs
to freeze the new task or the live kernel patching framework
need the new thread to take the slow path to userspace.
I have verified that setting TIF_SIGPENDING does make a new process
take the slow path to userspace before it executes it's first userspace
instruction.
I have looked at the callers of signal_wake_up and the code paths
setting TIF_SIGPENDING and I don't see anything else that needs to be
handled. The code probably doesn't need to set TIF_SIGPENDING for the
kernel live patching as it uses a separate thread flag as well. But
at this point it seems safer reuse the recalc_sigpending logic and get
the kernel live patching folks to sort out their story later.
V2: I have moved the test into schedule_tail where siglock can
be grabbed and recalc_sigpending can be reused directly.
Further as the last action of setting up a new task this
guarantees that TIF_SIGPENDING will be properly set in the
new process.
The helper calculate_sigpending takes the siglock and
uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending
clear TIF_SIGPENDING if it is unnecessary. This allows reusing
the existing code and keeps maintenance of the conditions simple.
Oleg Nesterov <oleg@redhat.com> suggested the movement
and pointed out the need to take siglock if this code
was going to be called while the new task is discoverable.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This is the bottom and by pushing this down it simplifies the callers
and otherwise leaves things as is. This is in preparation for allowing
fork to implement better handling of signals set to groups of processes.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This information is already available in the callers and by pushing it
down it makes the code a little clearer, and allows implementing
better handling of signales set to a group of processes in fork.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This information is already available in the callers and by pushing it
down it makes the code a little clearer, and allows better group
signal behavior in fork.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This passes the information we already have at the call sight into
do_send_sig_info. Ultimately allowing for better handling of signals
sent to a group of processes during fork.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This passes the information we already have at the call sight
into group_send_sig_info. Ultimatelly allowing for to better handle
signals sent to a group of processes.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Make the code more maintainable by performing more of the signal
related work in send_sigqueue.
A quick inspection of do_timer_create will show that this code path
does not lookup a thread group by a thread's pid. Making it safe
to find the task pointed to by it_pid with "pid_task(it_pid, type)";
This supports the changes needed in fork to tell if a signal was sent
to a single process or a group of processes.
Having the pid to task transition in signal.c will also make it easier
to sort out races with de_thread and and the thread group leader
exiting when it comes time to address that.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Commit a841796f11 ("signal: align __lock_task_sighand() irq disabling and
RCU") introduced a rcu read side critical section with interrupts
disabled. The changelog suggested that a better long-term fix would be "to
make rt_mutex_unlock() disable irqs when acquiring the rt_mutex structure's
->wait_lock".
This long-term fix has been made in commit b4abf91047 ("rtmutex: Make
wait_lock irq safe") for a different reason.
Therefore revert commit a841796f11 ("signal: align >
__lock_task_sighand() irq disabling and RCU") as the interrupt disable
dance is not longer required.
The change was tested on the base of b4abf91047 ("rtmutex: Make wait_lock
irq safe") with a four hour run of rcutorture scenario TREE03 with lockdep
enabled as suggested by Paul McKenney.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: bigeasy@linutronix.de
Link: https://lkml.kernel.org/r/20180525090507.22248-3-anna-maria@linutronix.de
Pull siginfo updates from Eric Biederman:
"This set of changes close the known issues with setting si_code to an
invalid value, and with not fully initializing struct siginfo. There
remains work to do on nds32, arc, unicore32, powerpc, arm, arm64, ia64
and x86 to get the code that generates siginfo into a simpler and more
maintainable state. Most of that work involves refactoring the signal
handling code and thus careful code review.
Also not included is the work to shrink the in kernel version of
struct siginfo. That depends on getting the number of places that
directly manipulate struct siginfo under control, as it requires the
introduction of struct kernel_siginfo for the in kernel things.
Overall this set of changes looks like it is making good progress, and
with a little luck I will be wrapping up the siginfo work next
development cycle"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
signal/sh: Stop gcc warning about an impossible case in do_divide_error
signal/mips: Report FPE_FLTUNK for undiagnosed floating point exceptions
signal/um: More carefully relay signals in relay_signal.
signal: Extend siginfo_layout with SIL_FAULT_{MCEERR|BNDERR|PKUERR}
signal: Remove unncessary #ifdef SEGV_PKUERR in 32bit compat code
signal/signalfd: Add support for SIGSYS
signal/signalfd: Remove __put_user from signalfd_copyinfo
signal/xtensa: Use force_sig_fault where appropriate
signal/xtensa: Consistenly use SIGBUS in do_unaligned_user
signal/um: Use force_sig_fault where appropriate
signal/sparc: Use force_sig_fault where appropriate
signal/sparc: Use send_sig_fault where appropriate
signal/sh: Use force_sig_fault where appropriate
signal/s390: Use force_sig_fault where appropriate
signal/riscv: Replace do_trap_siginfo with force_sig_fault
signal/riscv: Use force_sig_fault where appropriate
signal/parisc: Use force_sig_fault where appropriate
signal/parisc: Use force_sig_mceerr where appropriate
signal/openrisc: Use force_sig_fault where appropriate
signal/nios2: Use force_sig_fault where appropriate
...
Gaurav reported a perceived problem with TASK_PARKED, which turned out
to be a broken wait-loop pattern in __kthread_parkme(), but the
reported issue can (and does) in fact happen for states that do not do
condition based sleeps.
When the 'current->state = TASK_RUNNING' store of a previous
(concurrent) try_to_wake_up() collides with the setting of a 'special'
sleep state, we can loose the sleep state.
Normal condition based wait-loops are immune to this problem, but for
sleep states that are not condition based are subject to this problem.
There already is a fix for TASK_DEAD. Abstract that and also apply it
to TASK_STOPPED and TASK_TRACED, both of which are also without
condition based wait-loop.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update the siginfo_layout function and enum siginfo_layout to represent
all of the possible field layouts of struct siginfo.
This allows the uses of siginfo_layout in um and arm64 where they are testing
for SIL_FAULT to be more accurate as this rules out the other cases.
Further this allows the switch statements on siginfo_layout to be simpler
if perhaps a little more wordy. Making it easier to understand what is
actually going on.
As SIL_FAULT_BNDERR and SIL_FAULT_PKUERR are never expected to appear
in signalfd just treat them as SIL_FAULT. To include them would take
20 extra bytes an pretty much fill up what is left of
signalfd_siginfo.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The only architecture that does not support SEGV_PKUERR is ia64 and
ia64 has not had 32bit support since some time in 2008. Therefore
copy_siginfo_to_user32 and copy_siginfo_from_user32 do not need to
include support for a missing SEGV_PKUERR.
Compile test on ia64.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
With the recent architecture cleanups these si_codes are always
defined so there is no need to test for them.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
After the last round of cleanups to siginfo.h SEGV_BNDERR is defined
on all architectures so testing to see if it is defined is unnecessary.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
After more experience with the cases where no one the si_code of 0
is used both as a signal specific si_code, and as SI_USER it appears
that no one cares about the signal specific si_code case and the
good solution is to just fix the architectures by using
a different si_code.
In none of the conversations has anyone even suggested that
anything depends on the signal specific redefinition of SI_USER.
There are at least test cases that care when si_code as 0 does
not work as si_user.
So make things simple and keep the generic code from introducing
problems by removing the special casing of TRAP_FIXME and FPE_FIXME.
This will ensure the generic case of sending a signal with
kill will always set SI_USER and work.
The architecture specific, and signal specific overloads that
set si_code to 0 will now have problems with signalfd and
the 32bit compat versions of siginfo copying. At least
until they are fixed.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that every instance of struct siginfo is now initialized it is no
longer necessary to copy struct siginfo piece by piece to userspace
but instead the entire structure can be copied.
As well as making the code simpler and more efficient this means that
copy_sinfo_to_user no longer cares which union member of struct
siginfo is in use.
In practice this means that all 32bit architectures that define
FPE_FIXME will handle properly send SI_USER when kill(SIGFPE) is sent.
While still performing their historic architectural brokenness when 0
is used a floating pointer signal. This matches the current behavior
of 64bit architectures that define FPE_FIXME who get lucky and an
overloaded SI_USER has continuted to work through copy_siginfo_to_user
because the 8 byte si_addr occupies the same bytes in struct siginfo
as the 4 byte si_pid and the 4 byte si_uid.
Problematic architectures still need to fix their ABI so that signalfd
and 32bit compat code will work properly.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull general security layer updates from James Morris:
- Convert security hooks from list to hlist, a nice cleanup, saving
about 50% of space, from Sargun Dhillon.
- Only pass the cred, not the secid, to kill_pid_info_as_cred and
security_task_kill (as the secid can be determined from the cred),
from Stephen Smalley.
- Close a potential race in kernel_read_file(), by making the file
unwritable before calling the LSM check (vs after), from Kees Cook.
* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
security: convert security hooks to use hlist
exec: Set file unwritable before LSM check
usb, signal, security: only pass the cred, not the secid, to kill_pid_info_as_cred and security_task_kill
Nothing particularly stands out here, probably because people were tied
up with spectre/meltdown stuff last time around. Still, the main pieces
are:
- Rework of our CPU features framework so that we can whitelist CPUs that
don't require kpti even in a heterogeneous system
- Support for the IDC/DIC architecture extensions, which allow us to elide
instruction and data cache maintenance when writing out instructions
- Removal of the large memory model which resulted in suboptimal codegen
by the compiler and increased the use of literal pools, which could
potentially be used as ROP gadgets since they are mapped as executable
- Rework of forced signal delivery so that the siginfo_t is well-formed
and handling of show_unhandled_signals is consolidated and made
consistent between different fault types
- More siginfo cleanup based on the initial patches from Eric Biederman
- Workaround for Cortex-A55 erratum #1024718
- Some small ACPI IORT updates and cleanups from Lorenzo Pieralisi
- Misc cleanups and non-critical fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCgAGBQJaw1TCAAoJELescNyEwWM0gyQIAJVMK4QveBW+LwF96NYdZo16
p90Aa+nqKelh/s93govQArDMv1gxyuXdFlQZVOGPQHfqpz6RhJWmBA2tFsUbQrUc
OBcioPrRihqTmKBe+1r1XORwZxkVX6GGmCn0LYpPR7I3TjxXZpvxqaxGxiUvHkci
yVxWlDTyN/7eL3akhCpCDagN3Fxwk3QnJLqE3fxOFMlY7NvQcmUxcITiUl/s469q
xK6SWH9SRH1JK8jTHPitwUBiU//3FfCqSI9HLEdDIDoTuPcVM8UetWvi4QzrzJL1
UYg8lmU0CXNmflDzZJDaMf+qFApOrGxR0YVPpBzlQvxe0JIY69g48f+JzDPz8nc=
=+gNa
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"Nothing particularly stands out here, probably because people were
tied up with spectre/meltdown stuff last time around. Still, the main
pieces are:
- Rework of our CPU features framework so that we can whitelist CPUs
that don't require kpti even in a heterogeneous system
- Support for the IDC/DIC architecture extensions, which allow us to
elide instruction and data cache maintenance when writing out
instructions
- Removal of the large memory model which resulted in suboptimal
codegen by the compiler and increased the use of literal pools,
which could potentially be used as ROP gadgets since they are
mapped as executable
- Rework of forced signal delivery so that the siginfo_t is
well-formed and handling of show_unhandled_signals is consolidated
and made consistent between different fault types
- More siginfo cleanup based on the initial patches from Eric
Biederman
- Workaround for Cortex-A55 erratum #1024718
- Some small ACPI IORT updates and cleanups from Lorenzo Pieralisi
- Misc cleanups and non-critical fixes"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (70 commits)
arm64: uaccess: Fix omissions from usercopy whitelist
arm64: fpsimd: Split cpu field out from struct fpsimd_state
arm64: tlbflush: avoid writing RES0 bits
arm64: cmpxchg: Include linux/compiler.h in asm/cmpxchg.h
arm64: move percpu cmpxchg implementation from cmpxchg.h to percpu.h
arm64: cmpxchg: Include build_bug.h instead of bug.h for BUILD_BUG
arm64: lse: Include compiler_types.h and export.h for out-of-line LL/SC
arm64: fpsimd: include <linux/init.h> in fpsimd.h
drivers/perf: arm_pmu_platform: do not warn about affinity on uniprocessor
perf: arm_spe: include linux/vmalloc.h for vmap()
Revert "arm64: Revert L1_CACHE_SHIFT back to 6 (64-byte cache line size)"
arm64: cpufeature: Avoid warnings due to unused symbols
arm64: Add work around for Arm Cortex-A55 Erratum 1024718
arm64: Delay enabling hardware DBM feature
arm64: Add MIDR encoding for Arm Cortex-A55 and Cortex-A35
arm64: capabilities: Handle shared entries
arm64: capabilities: Add support for checks based on a list of MIDRs
arm64: Add helpers for checking CPU MIDR against a range
arm64: capabilities: Clean up midr range helpers
arm64: capabilities: Change scope of VHE to Boot CPU feature
...
Using this helper allows us to avoid the in-kernel call to the
compat_sys_sigaltstack() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
A similar but not fully equivalent code path is already open-coded
three times (in sys_rt_sigpending and in the two compat stubs), so
do it a fourth time here.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Currently, as reported by Eric, an invalid si_code value 0 is
passed in many signals delivered to userspace in response to faults
and other kernel errors. Typically 0 is passed when the fault is
insufficiently diagnosable or when there does not appear to be any
sensible alternative value to choose.
This appears to violate POSIX, and is intuitively wrong for at
least two reasons arising from the fact that 0 == SI_USER:
1) si_code is a union selector, and SI_USER (and si_code <= 0 in
general) implies the existence of a different set of fields
(siginfo._kill) from that which exists for a fault signal
(siginfo._sigfault). However, the code raising the signal
typically writes only the _sigfault fields, and the _kill
fields make no sense in this case.
Thus when userspace sees si_code == 0 (SI_USER) it may
legitimately inspect fields in the inactive union member _kill
and obtain garbage as a result.
There appears to be software in the wild relying on this,
albeit generally only for printing diagnostic messages.
2) Software that wants to be robust against spurious signals may
discard signals where si_code == SI_USER (or <= 0), or may
filter such signals based on the si_uid and si_pid fields of
siginfo._sigkill. In the case of fault signals, this means
that important (and usually fatal) error conditions may be
silently ignored.
In practice, many of the faults for which arm64 passes si_code == 0
are undiagnosable conditions such as exceptions with syndrome
values in ESR_ELx to which the architecture does not yet assign any
meaning, or conditions indicative of a bug or error in the kernel
or system and thus that are unrecoverable and should never occur in
normal operation.
The approach taken in this patch is to translate all such
undiagnosable or "impossible" synchronous fault conditions to
SIGKILL, since these are at least probably localisable to a single
process. Some of these conditions should really result in a kernel
panic, but due to the lack of diagnostic information it is
difficult to be certain: this patch does not add any calls to
panic(), but this could change later if justified.
Although si_code will not reach userspace in the case of SIGKILL,
it is still desirable to pass a nonzero value so that the common
siginfo handling code can detect incorrect use of si_code == 0
without false positives. In this case the si_code dependent
siginfo fields will not be correctly initialised, but since they
are not passed to userspace I deem this not to matter.
A few faults can reasonably occur in realistic userspace scenarios,
and _should_ raise a regular, handleable (but perhaps not
ignorable/blockable) signal: for these, this patch attempts to
choose a suitable standard si_code value for the raised signal in
each case instead of 0.
arm64 was the only arch to define a BUS_FIXME code, so after this
patch nobody defines it. This patch therefore also removes the
relevant code from siginfo_layout().
Cc: James Morse <james.morse@arm.com>
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
commit d178bc3a70 ("user namespace: usb:
make usb urbs user namespace aware (v2)") changed kill_pid_info_as_uid
to kill_pid_info_as_cred, saving and passing a cred structure instead of
uids. Since the secid can be obtained from the cred, drop the secid fields
from the usb_dev_state and async structures, and drop the secid argument to
kill_pid_info_as_cred. Replace the secid argument to security_task_kill
with the cred. Update SELinux, Smack, and AppArmor to use the cred, which
avoids the need for Smack and AppArmor to use a secid at all in this hook.
Further changes to Smack might still be required to take full advantage of
this change, since it should now be possible to perform capability
checking based on the supplied cred. The changes to Smack and AppArmor
have only been compile-tested.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: James Morris <james.morris@microsoft.com>
Pull livepatching updates from Jiri Kosina:
- handle 'infinitely'-long sleeping tasks, from Miroslav Benes
- remove 'immediate' feature, as it turns out it doesn't provide the
originally expected semantics, and brings more issues than value
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: add locking to force and signal functions
livepatch: Remove immediate feature
livepatch: force transition to finish
livepatch: send a fake signal to all blocking tasks
There are so many places that build struct siginfo by hand that at
least one of them is bound to get it wrong. A handful of cases in the
kernel arguably did just that when using the errno field of siginfo to
pass no errno values to userspace. The usage is limited to a single
si_code so at least does not mess up anything else.
Encapsulate this questionable pattern in a helper function so
that the userspace ABI is preserved.
Update all of the places that use this pattern to use the new helper
function.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The helpers added are:
send_sig_mceerr
force_sig_mceerr
force_sig_bnderr
force_sig_pkuerr
Filling out siginfo properly can ge tricky. Especially for these
specialized cases where the temptation is to share code with other
cases which use a different subset of siginfo fields. Unfortunately
that code sharing frequently results in bugs with the wrong siginfo
fields filled in, and makes it harder to verify that the siginfo
structure was properly initialized.
Provide these helpers instead that get all of the details right, and
guarantee that siginfo is properly initialized.
send_sig_mceerr and force_sig_mceer are a little special as two si
codes BUS_MCEERR_AO and BUS_MCEER_AR both use the same extended
signinfo layout.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The vast majority of signals sent from architecture specific code are
simple faults. Encapsulate this reality with two helper functions so
that the nit-picky implementation of preparing a siginfo does not need
to be repeated many times on each architecture.
As only some architectures support the trapno field, make the trapno
arguement only present on those architectures.
Similary as ia64 has three fields: imm, flags, and isr that
are specific to it. Have those arguments always present on ia64
and no where else.
This ensures the architecture specific code always remembers which
fields it needs to pass into the siginfo structure.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The siginfo structure has all manners of holes with the result that a
structure initializer is not guaranteed to initialize all of the bits.
As we have to copy the structure to userspace don't even try to use
a structure initializer. Instead use clear_siginfo followed by initializing
selected fields. This gives a guarantee that uninitialized kernel memory
is not copied to userspace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Among the existing architecture specific versions of
copy_siginfo_to_user32 there are several different implementation
problems. Some architectures fail to handle all of the cases in in
the siginfo union. Some architectures perform a blind copy of the
siginfo union when the si_code is negative. A blind copy suggests the
data is expected to be in 32bit siginfo format, which means that
receiving such a signal via signalfd won't work, or that the data is
in 64bit siginfo and the code is copying nonsense to userspace.
Create a single instance of copy_siginfo_to_user32 that all of the
architectures can share, and teach it to handle all of the cases in
the siginfo union correctly, with the assumption that siginfo is
stored internally to the kernel is 64bit siginfo format.
A special case is made for x86 x32 format. This is needed as presence
of both x32 and ia32 on x86_64 results in two different 32bit signal
formats. By allowing this small special case there winds up being
exactly one code base that needs to be maintained between all of the
architectures. Vastly increasing the testing base and the chances of
finding bugs.
As the x86 copy of copy_siginfo_to_user32 the call of the x86
signal_compat_build_tests were moved into sigaction_compat_abi, so
that they will keep running.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function copy_siginfo_from_user32 is used for two things, in ptrace
since the dawn of siginfo for arbirarily modifying a signal that
user space sees, and in sigqueueinfo to send a signal with arbirary
siginfo data.
Create a single copy of copy_siginfo_from_user32 that all architectures
share, and teach it to handle all of the cases in the siginfo union.
In the generic version of copy_siginfo_from_user32 ensure that all
of the fields in siginfo are initialized so that the siginfo structure
can be safely copied to userspace if necessary.
When copying the embedded sigval union copy the si_int member. That
ensures the 32bit values passes through the kernel unchanged.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Having si_codes in many different files simply encourages duplicate definitions
that can cause problems later. To avoid that merge the blackfin specific si_codes
into uapi/asm-generic/siginfo.h
Update copy_siginfo_to_user to copy with the absence of BUS_MCEERR_AR that blackfin
defines to be something else.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In preparation for unconditionally copying the whole of siginfo
to userspace clear si_sys_private. So this kernel internal
value is guaranteed not to make it to userspace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Call clear_siginfo to ensure stack allocated siginfos are fully
initialized before being passed to the signal sending functions.
This ensures that if there is the kind of confusion documented by
TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
data to userspace when the kernel generates a signal with SI_USER but
the copy to userspace assumes it is a different kind of signal, and
different fields are initialized.
This also prepares the way for turning copy_siginfo_to_user
into a copy_to_user, by removing the need in many cases to perform
a field by field copy simply to skip the uninitialized fields.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Setting si_code to 0 results in a userspace seeing an si_code of 0.
This is the same si_code as SI_USER. Posix and common sense requires
that SI_USER not be a signal specific si_code. As such this use of 0
for the si_code is a pretty horribly broken ABI.
Further use of si_code == 0 guaranteed that copy_siginfo_to_user saw a
value of __SI_KILL and now sees a value of SIL_KILL with the result
that uid and pid fields are copied and which might copying the si_addr
field by accident but certainly not by design. Making this a very
flakey implementation.
Utilizing FPE_FIXME, BUS_FIXME, TRAP_FIXME siginfo_layout will now return
SIL_FAULT and the appropriate fields will be reliably copied.
But folks this is a new and unique kind of bad. This is massively
untested code bad. This is inventing new and unique was to get
siginfo wrong bad. This is don't even think about Posix or what
siginfo means bad. This is lots of eyeballs all missing the fact
that the code does the wrong thing bad. This is getting stuck
and keep making the same mistake bad.
I really hope we can find a non userspace breaking fix for this on a
port as new as arm64.
Possible ABI fixes include:
- Send the signal without siginfo
- Don't generate a signal
- Possibly assign and use an appropriate si_code
- Don't handle cases which can't happen
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tyler Baicar <tbaicar@codeaurora.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Olof Johansson <olof@lixom.net>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arm-kernel@lists.infradead.org
Ref: 53631b54c8 ("arm64: Floating point and SIMD")
Ref: 32015c2356 ("arm64: exception: handle Synchronous External Abort")
Ref: 1d18c47c73 ("arm64: MMU fault handling and page table management")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- Rename from kdb_send_sig_info to kdb_send_sig
As there is no meaningful siginfo sent
- Use SEND_SIG_PRIV instead of generating a siginfo for a kdb
signal. The generated siginfo had a bogus rationale and was
not correct in the face of pid namespaces. SEND_SIG_PRIV
is simpler and actually correct.
- As the code grabs siglock just send the signal with siglock
held instead of dropping siglock and attempting to grab it again.
- Move the sig_valid test into kdb_kill where it can generate
a good error message.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Live patching consistency model is of LEAVE_PATCHED_SET and
SWITCH_THREAD. This means that all tasks in the system have to be marked
one by one as safe to call a new patched function. Safe means when a
task is not (sleeping) in a set of patched functions. That is, no
patched function is on the task's stack. Another clearly safe place is
the boundary between kernel and userspace. The patching waits for all
tasks to get outside of the patched set or to cross the boundary. The
transition is completed afterwards.
The problem is that a task can block the transition for quite a long
time, if not forever. It could sleep in a set of patched functions, for
example. Luckily we can force the task to leave the set by sending it a
fake signal, that is a signal with no data in signal pending structures
(no handler, no sign of proper signal delivered). Suspend/freezer use
this to freeze the tasks as well. The task gets TIF_SIGPENDING set and
is woken up (if it has been sleeping in the kernel before) or kicked by
rescheduling IPI (if it was running on other CPU). This causes the task
to go to kernel/userspace boundary where the signal would be handled and
the task would be marked as safe in terms of live patching.
There are tasks which are not affected by this technique though. The
fake signal is not sent to kthreads. They should be handled differently.
They can be woken up so they leave the patched set and their
TIF_PATCH_PENDING can be cleared thanks to stack checking.
For the sake of completeness, if the task is in TASK_RUNNING state but
not currently running on some CPU it doesn't get the IPI, but it would
eventually handle the signal anyway. Second, if the task runs in the
kernel (in TASK_RUNNING state) it gets the IPI, but the signal is not
handled on return from the interrupt. It would be handled on return to
the userspace in the future when the fake signal is sent again. Stack
checking deals with these cases in a better way.
If the task was sleeping in a syscall it would be woken by our fake
signal, it would check if TIF_SIGPENDING is set (by calling
signal_pending() predicate) and return ERESTART* or EINTR. Syscalls with
ERESTART* return values are restarted in case of the fake signal (see
do_signal()). EINTR is propagated back to the userspace program. This
could disturb the program, but...
* each process dealing with signals should react accordingly to EINTR
return values.
* syscalls returning EINTR happen to be quite common situation in the
system even if no fake signal is sent.
* freezer sends the fake signal and does not deal with EINTR anyhow.
Thus EINTR values are returned when the system is resumed.
The very safe marking is done in architectures' "entry" on syscall and
interrupt/exception exit paths, and in a stack checking functions of
livepatch. TIF_PATCH_PENDING is cleared and the next
recalc_sigpending() drops TIF_SIGPENDING. In connection with this, also
call klp_update_patch_state() before do_signal(), so that
recalc_sigpending() in dequeue_signal() can clear TIF_PATCH_PENDING
immediately and thus prevent a double call of do_signal().
Note that the fake signal is not sent to stopped/traced tasks. Such task
prevents the patching to finish till it continues again (is not traced
anymore).
Last, sending the fake signal is not automatic. It is done only when
admin requests it by writing 1 to signal sysfs attribute in livepatch
sysfs directory.
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: x86@kernel.org
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
complete_signal() checks SIGNAL_UNKILLABLE before it starts to destroy
the thread group, today this is wrong in many ways.
If nothing else, fatal_signal_pending() should always imply that the
whole thread group (except ->group_exit_task if it is not NULL) is
killed, this check breaks the rule.
After the previous changes we can rely on sig_task_ignored();
sig_fatal(sig) && SIGNAL_UNKILLABLE can only be true if we actually want
to kill this task and sig == SIGKILL OR it is traced and debugger can
intercept the signal.
This should hopefully fix the problem reported by Dmitry. This
test-case
static int init(void *arg)
{
for (;;)
pause();
}
int main(void)
{
char stack[16 * 1024];
for (;;) {
int pid = clone(init, stack + sizeof(stack)/2,
CLONE_NEWPID | SIGCHLD, NULL);
assert(pid > 0);
assert(ptrace(PTRACE_ATTACH, pid, 0, 0) == 0);
assert(waitpid(-1, NULL, WSTOPPED) == pid);
assert(ptrace(PTRACE_DETACH, pid, 0, SIGSTOP) == 0);
assert(syscall(__NR_tkill, pid, SIGKILL) == 0);
assert(pid == wait(NULL));
}
}
triggers the WARN_ON_ONCE(!(task->jobctl & JOBCTL_STOP_PENDING)) in
task_participate_group_stop(). do_signal_stop()->signal_group_exit()
checks SIGNAL_GROUP_EXIT and return false, but task_set_jobctl_pending()
checks fatal_signal_pending() and does not set JOBCTL_STOP_PENDING.
And his should fix the minor security problem reported by Kyle,
SECCOMP_RET_TRACE can miss fatal_signal_pending() the same way if the
task is the root of a pid namespace.
Link: http://lkml.kernel.org/r/20171103184246.GD21036@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Kyle Huey <me@kylehuey.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change sig_task_ignored() to drop the SIG_DFL && !sig_kernel_only()
signals even if force == T. This simplifies the next change and this
matches the same check in get_signal() which will drop these signals
anyway.
Link: http://lkml.kernel.org/r/20171103184227.GC21036@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment in sig_ignored() says "Tracers may want to know about even
ignored signals" but SIGKILL can not be reported to debugger and it is
just wrong to return 0 in this case: SIGKILL should only kill the
SIGNAL_UNKILLABLE task if it comes from the parent ns.
Change sig_ignored() to ignore ->ptrace if sig == SIGKILL and rely on
sig_task_ignored().
SISGTOP coming from within the namespace is not really right too but at
least debugger can intercept it, and we can't drop it here because this
will break "gdb -p 1": ptrace_attach() won't work. Perhaps we will add
another ->ptrace check later, we will see.
Link: http://lkml.kernel.org/r/20171103184206.GB21036@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull compat and uaccess updates from Al Viro:
- {get,put}_compat_sigset() series
- assorted compat ioctl stuff
- more set_fs() elimination
- a few more timespec64 conversions
- several removals of pointless access_ok() in places where it was
followed only by non-__ variants of primitives
* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (24 commits)
coredump: call do_unlinkat directly instead of sys_unlink
fs: expose do_unlinkat for built-in callers
ext4: take handling of EXT4_IOC_GROUP_ADD into a helper, get rid of set_fs()
ipmi: get rid of pointless access_ok()
pi433: sanitize ioctl
cxlflash: get rid of pointless access_ok()
mtdchar: get rid of pointless access_ok()
r128: switch compat ioctls to drm_ioctl_kernel()
selection: get rid of field-by-field copyin
VT_RESIZEX: get rid of field-by-field copyin
i2c compat ioctls: move to ->compat_ioctl()
sched_rr_get_interval(): move compat to native, get rid of set_fs()
mips: switch to {get,put}_compat_sigset()
sparc: switch to {get,put}_compat_sigset()
s390: switch to {get,put}_compat_sigset()
ppc: switch to {get,put}_compat_sigset()
parisc: switch to {get,put}_compat_sigset()
get_compat_sigset()
get rid of {get,put}_compat_itimerspec()
io_getevents: Use timespec64 to represent timeouts
...
Convert all allocations that used a NOTRACK flag to stop using it.
Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit cc731525f2 ("signal: Remove kernel interal si_code magic")
added a check for SIGMET and NSIGEMT being defined. That SIGMET should
in fact be SIGEMT, with SIGEMT being defined in
arch/{alpha,mips,sparc}/include/uapi/asm/signal.h
This was actually pointed out by BenHutchings in a lwn.net comment
here https://lwn.net/Comments/734608/
Fixes: cc731525f2 ("signal: Remove kernel interal si_code magic")
Signed-off-by: Andrew Clayton <andrew@digital-domain.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>