linux/kernel
Christian Brauner 7f192e3cd3
fork: add clone3
This adds the clone3 system call.

As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().

We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().

Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again. In general, clone3() is extensible and allows for the implementation
of new features.

The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
  in clone() into a dedicated struct clone_args
  - choose a struct layout that is easy to handle on 32 and on 64 bit
  - choose a struct layout that is extensible
  - give new flags that currently need to abuse another flag's dedicated
    return argument in clone() their own dedicated return argument
    (e.g. CLONE_PIDFD)
  - use a separate kernel internal struct kernel_clone_args that is
    properly typed according to current kernel conventions in fork.c and is
    different from  the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
  syscalls such as fork(), vfork(), clone(), and clone3() behave identical
  (Arnd suggested, that we can probably also port do_fork() itself in a
   separate patchset.)
- ease of transition for userspace from clone() to clone3()
  This very much means that we do *not* remove functionality that userspace
  currently relies on as the latter is a good way of creating a syscall
  that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible

In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:

/* uapi */
struct clone_args {
        __aligned_u64 flags;
        __aligned_u64 pidfd;
        __aligned_u64 child_tid;
        __aligned_u64 parent_tid;
        __aligned_u64 exit_signal;
        __aligned_u64 stack;
        __aligned_u64 stack_size;
        __aligned_u64 tls;
};

/* kernel internal */
struct kernel_clone_args {
        u64 flags;
        int __user *pidfd;
        int __user *child_tid;
        int __user *parent_tid;
        int exit_signal;
        unsigned long stack;
        unsigned long stack_size;
        unsigned long tls;
};

long sys_clone3(struct clone_args __user *uargs, size_t size)

clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().

There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.

/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
  It is superseeded by a dedicated "exit_signal" argument in struct
  clone_args freeing up space for additional flags.
  This is based on a suggestion from Andrei and Linus (cf. [9] and [10])

/* References */
[1]: b3e5838252
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
[9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
[10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2019-06-09 09:29:28 +02:00
..
bpf bpf: fix undefined behavior in narrow load handling 2019-05-13 02:05:50 +02:00
cgroup kernel/sched/psi.c: expose pressure metrics on root cgroup 2019-05-14 19:52:48 -07:00
configs
debug kdb: Fix bound check compiler warning 2019-05-14 13:44:24 +01:00
dma DMA mapping updates for 5.2 2019-05-09 08:40:55 -07:00
events mm/mmu_notifier: use correct mmu_notifier events for each invalidation 2019-05-14 09:47:49 -07:00
gcov gcov: clang support 2019-05-14 19:52:51 -07:00
irq Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-19 10:58:45 -07:00
livepatch The major changes in this tracing update includes: 2019-05-15 16:05:47 -07:00
locking locking/rwsem: Prevent decrement of reader count before increment 2019-05-07 08:46:46 +02:00
power Power management updates for 5.2-rc1 2019-05-06 19:40:31 -07:00
printk panic: add an option to replay all the printk message in buffer 2019-05-18 15:52:26 -07:00
rcu The major changes in this tracing update includes: 2019-05-15 16:05:47 -07:00
sched kernel/sched/psi.c: expose pressure metrics on root cgroup 2019-05-14 19:52:48 -07:00
time Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-16 11:00:20 -07:00
trace The major changes in this tracing update includes: 2019-05-15 16:05:47 -07:00
.gitignore Provide in-kernel headers to make extending kernel easier 2019-04-29 16:48:03 +02:00
acct.c acct_on(): don't mess with freeze protection 2019-04-04 21:04:13 -04:00
async.c treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively 2019-04-09 14:19:06 +02:00
audit_fsnotify.c audit_compare_dname_path(): switch to const struct qstr * 2019-04-28 20:33:43 -04:00
audit_tree.c fsnotify: switch send_to_group() and ->handle_event to const struct qstr * 2019-04-26 13:51:03 -04:00
audit_watch.c audit_compare_dname_path(): switch to const struct qstr * 2019-04-28 20:33:43 -04:00
audit.c audit: connect LOGIN record to its syscall record 2019-03-20 20:57:48 -04:00
audit.h audit_compare_dname_path(): switch to const struct qstr * 2019-04-28 20:33:43 -04:00
auditfilter.c Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:03:32 -07:00
auditsc.c Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-05-07 20:03:32 -07:00
backtracetest.c backtrace-test: Simplify stack trace handling 2019-04-29 12:37:47 +02:00
bounds.c
capability.c LSM: add SafeSetID module that gates setid calls 2019-01-25 11:22:43 -08:00
compat.c kernel/compat.c: mark expected switch fall-throughs 2019-05-15 08:16:14 -07:00
configs.c kernel/configs: use .incbin directive to embed config_data.gz 2019-03-07 18:32:02 -08:00
context_tracking.c
cpu_pm.c
cpu.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-06 14:50:46 -07:00
crash_core.c kexec: export PG_offline to VMCOREINFO 2019-03-05 21:07:14 -08:00
crash_dump.c
cred.c SELinux: Remove cred security blob poisoning 2019-01-08 13:18:44 -08:00
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCE 2019-05-14 19:52:47 -07:00
extable.c
fail_function.c treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively 2019-04-09 14:19:06 +02:00
fork.c fork: add clone3 2019-06-09 09:29:28 +02:00
freezer.c
futex.c mm/gup: change GUP fast to use flags rather than a write 'bool' 2019-05-14 09:47:46 -07:00
gen_ikh_data.sh Provide in-kernel headers to make extending kernel easier 2019-04-29 16:48:03 +02:00
groups.c
hung_task.c kernel/hung_task.c: Use continuously blocked time when reporting. 2019-03-07 18:31:59 -08:00
iomem.c mm/resource: Use resource_overlaps() to simplify region_intersects() 2019-04-19 12:59:36 +02:00
irq_work.c irq_work: Do not raise an IPI when queueing work on the local CPU 2019-04-18 14:07:52 +02:00
jump_label.c locking/static_key: Don't take sleeping locks in __static_key_slow_dec_deferred() 2019-04-29 08:29:21 +02:00
kallsyms.c bpf: Add module name [bpf] to ksymbols for bpf programs 2019-01-21 17:38:56 -03:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb()) 2019-05-06 16:57:52 -07:00
Kconfig.preempt kconfig: warn no new line at end of file 2018-12-15 17:44:35 +09:00
kcov.c kcov: convert kcov.refcount to refcount_t 2019-03-07 18:32:02 -08:00
kexec_core.c power/suspend: Add function to disable secondaries for suspend 2019-05-03 19:42:41 +02:00
kexec_file.c mm: memblock: make keeping memblock memory opt-in rather than opt-out 2019-05-14 09:47:50 -07:00
kexec_internal.h
kexec.c
kheaders.c Provide in-kernel headers to make extending kernel easier 2019-04-29 16:48:03 +02:00
kmod.c
kprobes.c kprobes: Fix error check when reusing optimized probes 2019-04-16 09:38:16 +02:00
ksysfs.c
kthread.c include/: refactor headers to allow kthread.h inclusion in psi_types.h 2019-05-14 19:52:48 -07:00
latencytop.c kernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing 2019-05-14 19:52:49 -07:00
Makefile kernel/Makefile: don't assume that kernel/gen_ikh_data.sh is executable 2019-05-14 19:52:47 -07:00
memremap.c kernel/memremap.c: remove the unused device_private_entry_fault() export 2019-05-14 09:47:51 -07:00
module_signing.c
module-internal.h kallsyms: store type information in its own array 2019-03-28 15:00:37 +01:00
module.c Modules updates for v5.2 2019-05-14 10:55:54 -07:00
notifier.c kernel/notifier.c: double register detection 2019-05-14 19:52:49 -07:00
nsproxy.c
padata.c padata: Replace padata_attr_type default_attrs field with groups 2019-04-25 22:06:11 +02:00
panic.c panic: add an option to replay all the printk message in buffer 2019-05-18 15:52:26 -07:00
params.c
pid_namespace.c
pid.c kernel/pid.c: remove unneeded hash header file 2019-05-14 19:52:51 -07:00
profile.c
ptrace.c ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK 2019-03-29 10:01:37 -07:00
range.c
reboot.c panic/reboot: allow specifying reboot_mode for panic only 2019-05-14 19:52:51 -07:00
relay.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-12 13:27:20 -07:00
resource.c mm/resource: Use resource_overlaps() to simplify region_intersects() 2019-04-19 12:59:36 +02:00
rseq.c rseq: Remove superfluous rseq_len from task_struct 2019-04-19 12:39:32 +02:00
seccomp.c audit/stable-5.2 PR 20190507 2019-05-07 19:06:04 -07:00
signal.c signal: unconditionally leave the frozen state in ptrace_stop() 2019-05-16 10:43:58 -07:00
smp.c cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM 2019-01-30 19:27:00 +01:00
smpboot.c
smpboot.h
softirq.c softirq: Remove tasklet_hrtimer 2019-03-22 14:36:02 +01:00
stackleak.c stackleak: Mark stackleak_track_stack() as notrace 2018-12-05 19:31:44 -08:00
stacktrace.c stacktrace: Provide common infrastructure 2019-04-29 12:37:57 +02:00
stop_machine.c treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively 2019-04-09 14:19:06 +02:00
sys_ni.c signal: support CLONE_PIDFD with pidfd_send_signal 2019-05-07 14:31:03 +02:00
sys.c kernel/sys.c: prctl: fix false positive in validate_prctl_map() 2019-05-14 09:47:44 -07:00
sysctl_binary.c kernel/sysctl: add panic_print into sysctl 2019-01-04 13:13:47 -08:00
sysctl.c kernel/sysctl.c: fix proc_do_large_bitmap for large input buffers 2019-05-14 19:52:51 -07:00
task_work.c
taskstats.c genetlink: optionally validate strictly/dumps 2019-04-27 17:07:22 -04:00
test_kprobes.c
torture.c torture: Don't try to offline the last CPU 2019-03-26 14:42:53 -07:00
tracepoint.c tracing: Replace synchronize_sched() and call_rcu_sched() 2018-11-27 09:21:41 -08:00
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c umh: add exit routine for UMH process 2019-01-11 18:05:40 -08:00
up.c
user_namespace.c
user-return-notifier.c
user.c kernel/user.c: clean up some leftover code 2019-05-14 19:52:49 -07:00
utsname_sysctl.c
utsname.c
watchdog_hld.c kernel/watchdog_hld.c: hard lockup message should end with a newline 2019-04-19 09:46:05 -07:00
watchdog.c watchdog: Fix typo in comment 2019-04-18 14:05:51 +02:00
workqueue_internal.h sched/core, workqueues: Distangle worker accounting from rq lock 2019-04-16 16:55:15 +02:00
workqueue.c Merge branch 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2019-05-09 13:48:52 -07:00