Go to file
Kirill Tkhai 1a57feb847 net: Introduce net_sem for protection of pernet_list
Currently, the mutex is mostly used to protect pernet operations
list. It orders setup_net() and cleanup_net() with parallel
{un,}register_pernet_operations() calls, so ->exit{,batch} methods
of the same pernet operations are executed for a dying net, as
were used to call ->init methods, even after the net namespace
is unlinked from net_namespace_list in cleanup_net().

But there are several problems with scalability. The first one
is that more than one net can't be created or destroyed
at the same moment on the node. For big machines with many cpus
running many containers it's very sensitive.

The second one is that it's need to synchronize_rcu() after net
is removed from net_namespace_list():

Destroy net_ns:
cleanup_net()
  mutex_lock(&net_mutex)
  list_del_rcu(&net->list)
  synchronize_rcu()                                  <--- Sleep there for ages
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_exit_list(ops, &net_exit_list)
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_free_list(ops, &net_exit_list)
  mutex_unlock(&net_mutex)

This primitive is not fast, especially on the systems with many processors
and/or when preemptible RCU is enabled in config. So, all the time, while
cleanup_net() is waiting for RCU grace period, creation of new net namespaces
is not possible, the tasks, who makes it, are sleeping on the same mutex:

Create net_ns:
copy_net_ns()
  mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages

I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
with preemptible RCU enabled after CRIU tests round is finished.

The solution is to convert net_mutex to the rw_semaphore and add fine grain
locks to really small number of pernet_operations, what really need them.

Then, pernet_operations::init/::exit methods, modifying the net-related data,
will require down_read() locking only, while down_write() will be used
for changing pernet_list (i.e., when modules are being loaded and unloaded).

This gives signify performance increase, after all patch set is applied,
like you may see here:

%for i in {1..10000}; do unshare -n bash -c exit; done

*before*
real 1m40,377s
user 0m9,672s
sys 0m19,928s

*after*
real 0m17,007s
user 0m5,311s
sys 0m11,779

(5.8 times faster)

This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
describes the variables it protects, and makes to use, where appropriate.
net_mutex is still present, and next patches will kick it out step-by-step.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 10:36:04 -05:00
arch unify {de,}mangle_poll(), get rid of kernel-side POLL... 2018-02-11 14:37:22 -08:00
block vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
certs License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
crypto vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
Documentation KVM changes for 4.16 2018-02-10 13:16:35 -08:00
drivers Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 2018-02-12 19:55:33 -05:00
firmware kbuild: remove all dummy assignments to obj- 2017-11-18 11:46:06 +09:00
fs net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
include net: Introduce net_sem for protection of pernet_list 2018-02-13 10:36:04 -05:00
init membarrier: Provide core serializing command, *_SYNC_CORE 2018-02-05 21:35:03 +01:00
ipc vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
kernel vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
lib Kbuild updates for v4.16 (2nd) 2018-02-09 19:32:41 -08:00
LICENSES LICENSES: Add MPL-1.1 license 2018-01-06 10:59:44 -07:00
mm vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
net net: Introduce net_sem for protection of pernet_list 2018-02-13 10:36:04 -05:00
samples sample/bpf: fix erspan metadata 2018-02-06 11:32:49 -05:00
scripts Kbuild updates for v4.16 (2nd) 2018-02-09 19:32:41 -08:00
security net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
sound vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
tools Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-09 15:34:18 -08:00
usr initramfs: fix initramfs rebuilds w/ compression after disabling 2017-11-03 07:39:19 -07:00
virt vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
.cocciconfig scripts: add Linux .cocciconfig for coccinelle 2016-07-22 12:13:39 +02:00
.get_maintainer.ignore
.gitattributes .gitattributes: set git diff driver for C source code files 2016-10-07 18:46:30 -07:00
.gitignore scripts/package: snap-pkg target 2017-12-13 00:00:18 +09:00
.mailmap mailmap: update Mark Yao's email address 2018-01-04 16:45:09 -08:00
COPYING
CREDITS MAINTAINERS: update TPM driver infrastructure changes 2017-11-09 17:58:40 -08:00
Kbuild Kbuild updates for v4.15 2017-11-17 17:45:29 -08:00
Kconfig License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
MAINTAINERS KVM changes for 4.16 2018-02-10 13:16:35 -08:00
Makefile Linux 4.16-rc1 2018-02-11 15:04:29 -08:00
README README: add a new README file, pointing to the Documentation/ 2016-10-24 08:12:35 -02:00

Linux kernel
============

This file was moved to Documentation/admin-guide/README.rst

Please notice that there are several guides for kernel developers and users.
These guides can be rendered in a number of formats, like HTML and PDF.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
See Documentation/00-INDEX for a list of what is contained in each file.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.