linux/net/core
Kirill Tkhai 1a57feb847 net: Introduce net_sem for protection of pernet_list
Currently, the mutex is mostly used to protect pernet operations
list. It orders setup_net() and cleanup_net() with parallel
{un,}register_pernet_operations() calls, so ->exit{,batch} methods
of the same pernet operations are executed for a dying net, as
were used to call ->init methods, even after the net namespace
is unlinked from net_namespace_list in cleanup_net().

But there are several problems with scalability. The first one
is that more than one net can't be created or destroyed
at the same moment on the node. For big machines with many cpus
running many containers it's very sensitive.

The second one is that it's need to synchronize_rcu() after net
is removed from net_namespace_list():

Destroy net_ns:
cleanup_net()
  mutex_lock(&net_mutex)
  list_del_rcu(&net->list)
  synchronize_rcu()                                  <--- Sleep there for ages
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_exit_list(ops, &net_exit_list)
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_free_list(ops, &net_exit_list)
  mutex_unlock(&net_mutex)

This primitive is not fast, especially on the systems with many processors
and/or when preemptible RCU is enabled in config. So, all the time, while
cleanup_net() is waiting for RCU grace period, creation of new net namespaces
is not possible, the tasks, who makes it, are sleeping on the same mutex:

Create net_ns:
copy_net_ns()
  mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages

I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
with preemptible RCU enabled after CRIU tests round is finished.

The solution is to convert net_mutex to the rw_semaphore and add fine grain
locks to really small number of pernet_operations, what really need them.

Then, pernet_operations::init/::exit methods, modifying the net-related data,
will require down_read() locking only, while down_write() will be used
for changing pernet_list (i.e., when modules are being loaded and unloaded).

This gives signify performance increase, after all patch set is applied,
like you may see here:

%for i in {1..10000}; do unshare -n bash -c exit; done

*before*
real 1m40,377s
user 0m9,672s
sys 0m19,928s

*after*
real 0m17,007s
user 0m5,311s
sys 0m11,779

(5.8 times faster)

This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
describes the variables it protects, and makes to use, where appropriate.
net_mutex is still present, and next patches will kick it out step-by-step.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 10:36:04 -05:00
..
datagram.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
dev_addr_lists.c net: fix spelling for synchronized 2014-11-18 15:26:32 -05:00
dev_ioctl.c dev_ioctl(): move copyin/copyout to callers 2018-01-24 19:13:45 -05:00
dev.c net_sched: plug in qdisc ops change_tx_queue_len 2018-01-29 12:42:15 -05:00
devlink.c devlink: fix memory leak on 'resource' 2018-01-22 09:27:10 -05:00
drop_monitor.c treewide: setup_timer() -> timer_setup() 2017-11-21 15:57:07 -08:00
dst_cache.c net: dst_cache_per_cpu_dst_set() can be static 2016-03-18 17:45:08 -04:00
dst.c net: Remove dst->next 2017-11-30 09:54:27 -05:00
ethtool.c bitmap: replace bitmap_{from,to}_u32array 2018-02-06 18:32:44 -08:00
fib_notifier.c net: Protect iterations over net::fib_notifier_ops in fib_seq_sum() 2017-11-15 14:01:30 +09:00
fib_rules.c fib_rules: exit_net cleanup check added 2017-11-14 15:45:53 +09:00
filter.c bpf: fix subprog verifier bypass by div/mod by 0 exception 2018-01-26 16:42:05 -08:00
flow_dissector.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-01-19 22:59:33 -05:00
gen_estimator.c net_sched: gen_estimator: fix lockdep splat 2018-01-29 14:29:10 -05:00
gen_stats.c net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq 2017-12-08 13:32:26 -05:00
gro_cells.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
hwbm.c net: hwbm: Fix unbalanced spinlock in error case 2016-05-25 12:35:09 -07:00
link_watch.c net: link_watch: mark bonding link events urgent 2018-01-23 19:43:30 -05:00
lwt_bpf.c bpf: rename bpf_compute_data_end into bpf_compute_data_pointers 2017-09-26 13:36:44 -07:00
lwtunnel.c ipv6: sr: define core operations for seg6local lightweight tunnel 2017-08-07 14:16:22 -07:00
Makefile xdp: base API for new XDP rx-queue info concept 2018-01-05 15:21:20 -08:00
neighbour.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-01-17 00:10:42 -05:00
net_namespace.c net: Introduce net_sem for protection of pernet_list 2018-02-13 10:36:04 -05:00
net-procfs.c net: delete /proc THIS_MODULE references 2018-01-16 15:01:33 -05:00
net-sysfs.c net: introduce helper dev_change_tx_queue_len() 2018-01-29 12:42:15 -05:00
net-sysfs.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
net-traces.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-11-04 09:26:51 +09:00
netclassid_cgroup.c cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS 2017-07-21 11:14:51 -04:00
netevent.c netevent: remove automatic variable in register_netevent_notifier() 2015-05-31 00:03:21 -07:00
netpoll.c netpoll: Use lockdep to assert IRQs are disabled/enabled 2017-11-08 11:13:54 +01:00
netprio_cgroup.c net: remove duplicate includes 2017-12-13 13:18:46 -05:00
pktgen.c pktgen: Clean read user supplied flag mess 2018-01-24 15:03:36 -05:00
ptp_classifier.c ptp: Change ptp_class to a proper bitmask 2015-11-03 11:08:22 -05:00
request_sock.c ipv4: Namespaceify tcp_max_syn_backlog knob 2016-12-29 11:38:31 -05:00
rtnetlink.c net: Introduce net_sem for protection of pernet_list 2018-02-13 10:36:04 -05:00
scm.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h> 2017-03-02 08:42:29 +01:00
secure_seq.c tcp: Namespaceify sysctl_tcp_timestamps 2017-06-08 10:53:29 -04:00
skbuff.c net: Whitelist the skbuff_head_cache "cb" field 2018-02-08 15:15:48 -05:00
sock_diag.c net: core: fix module type in sock_diag_bind 2018-01-09 11:28:58 -05:00
sock_reuseport.c soreuseport: fix mem leak in reuseport_add_sock() 2018-02-02 19:47:03 -05:00
sock.c net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
stream.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
sysctl_net_core.c bpf: restrict access to core bpf sysctls 2018-01-19 18:37:00 -08:00
timestamping.c net: skb_defer_rx_timestamp should check for phydev before setting up classify 2015-07-09 14:17:15 -07:00
tso.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
utils.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-05-02 16:40:27 -07:00
xdp.c xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg 2018-01-05 15:21:21 -08:00