2019-05-27 14:55:01 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0-or-later */
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* INET An implementation of the TCP/IP protocol suite for the LINUX
|
|
|
|
* operating system. INET is implemented using the BSD Socket
|
|
|
|
* interface as the means of communication with the user level.
|
|
|
|
*
|
|
|
|
* Definitions for the AF_INET socket handler.
|
|
|
|
*
|
|
|
|
* Version: @(#)sock.h 1.0.4 05/13/93
|
|
|
|
*
|
2005-05-06 07:16:16 +08:00
|
|
|
* Authors: Ross Biro
|
2005-04-17 06:20:36 +08:00
|
|
|
* Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG>
|
|
|
|
* Corey Minyard <wf-rch!minyard@relay.EU.net>
|
|
|
|
* Florian La Roche <flla@stud.uni-sb.de>
|
|
|
|
*
|
|
|
|
* Fixes:
|
|
|
|
* Alan Cox : Volatiles in skbuff pointers. See
|
|
|
|
* skbuff comments. May be overdone,
|
|
|
|
* better to prove they can be removed
|
|
|
|
* than the reverse.
|
|
|
|
* Alan Cox : Added a zapped field for tcp to note
|
|
|
|
* a socket is reset and must stay shut up
|
|
|
|
* Alan Cox : New fields for options
|
|
|
|
* Pauline Middelink : identd support
|
|
|
|
* Alan Cox : Eliminate low level recv/recvfrom
|
|
|
|
* David S. Miller : New socket lookup architecture.
|
|
|
|
* Steve Whitehouse: Default routines for sock_ops
|
|
|
|
* Arnaldo C. Melo : removed net_pinfo, tp_pinfo and made
|
|
|
|
* protinfo be just a void pointer, as the
|
|
|
|
* protocol specific parts were moved to
|
|
|
|
* respective headers and ipv4/v6, etc now
|
|
|
|
* use private slabcaches for its socks
|
|
|
|
* Pedro Hortas : New flags field for socket options
|
|
|
|
*/
|
|
|
|
#ifndef _SOCK_H
|
|
|
|
#define _SOCK_H
|
|
|
|
|
2011-06-06 18:43:46 +08:00
|
|
|
#include <linux/hardirq.h>
|
2007-08-29 06:50:33 +08:00
|
|
|
#include <linux/kernel.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/list.h>
|
2008-11-17 11:39:21 +08:00
|
|
|
#include <linux/list_nulls.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/timer.h>
|
|
|
|
#include <linux/cache.h>
|
memcg: decrement static keys at real destroy time
We call the destroy function when a cgroup starts to be removed, such as
by a rmdir event.
However, because of our reference counters, some objects are still
inflight. Right now, we are decrementing the static_keys at destroy()
time, meaning that if we get rid of the last static_key reference, some
objects will still have charges, but the code to properly uncharge them
won't be run.
This becomes a problem specially if it is ever enabled again, because now
new charges will be added to the staled charges making keeping it pretty
much impossible.
We just need to be careful with the static branch activation: since there
is no particular preferred order of their activation, we need to make sure
that we only start using it after all call sites are active. This is
achieved by having a per-memcg flag that is only updated after
static_key_slow_inc() returns. At this time, we are sure all sites are
active.
This is made per-memcg, not global, for a reason: it also has the effect
of making socket accounting more consistent. The first memcg to be
limited will trigger static_key() activation, therefore, accounting. But
all the others will then be accounted no matter what. After this patch,
only limited memcgs will have its sockets accounted.
[akpm@linux-foundation.org: move enum sock_flag_bits into sock.h,
document enum sock_flag_bits,
convert memcg_proto_active() and memcg_proto_activated() to test_bit(),
redo tcp_update_limit() comment to 80 cols]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-30 06:07:11 +08:00
|
|
|
#include <linux/bitops.h>
|
2006-07-03 15:25:35 +08:00
|
|
|
#include <linux/lockdep.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/netdevice.h>
|
|
|
|
#include <linux/skbuff.h> /* struct sk_buff */
|
2006-12-04 12:15:30 +08:00
|
|
|
#include <linux/mm.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/security.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2011-04-05 13:30:30 +08:00
|
|
|
#include <linux/uaccess.h>
|
mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 07:42:31 +08:00
|
|
|
#include <linux/page_counter.h>
|
2011-12-12 05:47:02 +08:00
|
|
|
#include <linux/memcontrol.h>
|
2012-02-24 15:31:31 +08:00
|
|
|
#include <linux/static_key.h>
|
2012-02-13 11:58:52 +08:00
|
|
|
#include <linux/sched.h>
|
2015-11-26 13:55:39 +08:00
|
|
|
#include <linux/wait.h>
|
2015-12-08 06:38:52 +08:00
|
|
|
#include <linux/cgroup-defs.h>
|
2017-10-06 13:21:27 +08:00
|
|
|
#include <linux/rbtree.h>
|
2008-11-17 11:39:21 +08:00
|
|
|
#include <linux/rculist_nulls.h>
|
2009-07-08 20:09:13 +08:00
|
|
|
#include <linux/poll.h>
|
2020-07-23 14:08:50 +08:00
|
|
|
#include <linux/sockptr.h>
|
2020-11-13 23:08:09 +08:00
|
|
|
#include <linux/indirect_call_wrapper.h>
|
2010-11-16 03:58:26 +08:00
|
|
|
#include <linux/atomic.h>
|
2017-06-30 18:08:01 +08:00
|
|
|
#include <linux/refcount.h>
|
tcp: defer skb freeing after socket lock is released
tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.
A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.
Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.
This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.
This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.
Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.
One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)
Tested:
100Gbit NIC
Max throughput for one TCP_STREAM flow, over 10 runs
MTU : 1500
Before: 55 Gbit
After: 66 Gbit
MTU : 4096+(headers)
Before: 82 Gbit
After: 95 Gbit
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 03:02:46 +08:00
|
|
|
#include <linux/llist.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <net/dst.h>
|
|
|
|
#include <net/checksum.h>
|
2015-03-16 12:12:12 +08:00
|
|
|
#include <net/tcp_states.h>
|
2014-08-05 10:11:46 +08:00
|
|
|
#include <linux/net_tstamp.h>
|
2018-01-05 06:03:54 +08:00
|
|
|
#include <net/l3mdev.h>
|
2021-08-04 15:55:56 +08:00
|
|
|
#include <uapi/linux/socket.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This structure really needs to be cleaned up.
|
|
|
|
* Most of it is for TCP, and not used by any of
|
|
|
|
* the other protocols.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* Define this to get the SOCK_DBG debugging facility. */
|
|
|
|
#define SOCK_DEBUGGING
|
|
|
|
#ifdef SOCK_DEBUGGING
|
|
|
|
#define SOCK_DEBUG(sk, msg...) do { if ((sk) && sock_flag((sk), SOCK_DBG)) \
|
|
|
|
printk(KERN_DEBUG msg); } while (0)
|
|
|
|
#else
|
2008-03-22 06:54:53 +08:00
|
|
|
/* Validate arguments and do nothing */
|
2011-11-01 08:11:33 +08:00
|
|
|
static inline __printf(2, 3)
|
2012-05-17 06:48:15 +08:00
|
|
|
void SOCK_DEBUG(const struct sock *sk, const char *msg, ...)
|
2008-03-22 06:54:53 +08:00
|
|
|
{
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|
|
|
|
|
|
|
|
/* This is the per-socket lock. The spinlock provides a synchronization
|
|
|
|
* between user contexts and software interrupt processing, whereas the
|
|
|
|
* mini-semaphore synchronizes multiple users amongst themselves.
|
|
|
|
*/
|
|
|
|
typedef struct {
|
|
|
|
spinlock_t slock;
|
2007-09-12 16:44:19 +08:00
|
|
|
int owned;
|
2005-04-17 06:20:36 +08:00
|
|
|
wait_queue_head_t wq;
|
2006-07-03 15:25:35 +08:00
|
|
|
/*
|
|
|
|
* We express the mutex-alike socket_lock semantics
|
|
|
|
* to the lock validator by explicitly managing
|
|
|
|
* the slock as a lock variant (in addition to
|
|
|
|
* the slock itself):
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
|
|
|
struct lockdep_map dep_map;
|
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
} socket_lock_t;
|
|
|
|
|
|
|
|
struct sock;
|
2005-08-10 11:09:30 +08:00
|
|
|
struct proto;
|
2007-12-04 17:15:45 +08:00
|
|
|
struct net;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-12-02 15:33:10 +08:00
|
|
|
typedef __u32 __bitwise __portpair;
|
|
|
|
typedef __u64 __bitwise __addrpair;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/**
|
2005-05-01 23:59:25 +08:00
|
|
|
* struct sock_common - minimal network layer representation of sockets
|
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 03:04:07 +08:00
|
|
|
* @skc_daddr: Foreign IPv4 addr
|
|
|
|
* @skc_rcv_saddr: Bound local IPv4 addr
|
2020-02-16 03:42:37 +08:00
|
|
|
* @skc_addrpair: 8-byte-aligned __u64 union of @skc_daddr & @skc_rcv_saddr
|
2009-07-16 07:13:10 +08:00
|
|
|
* @skc_hash: hash value used with various protocol lookup tables
|
2009-11-08 18:17:30 +08:00
|
|
|
* @skc_u16hashes: two u16 hash values used by UDP lookup tables
|
2012-11-30 17:49:27 +08:00
|
|
|
* @skc_dport: placeholder for inet_dport/tw_dport
|
|
|
|
* @skc_num: placeholder for inet_num/tw_num
|
2020-02-16 03:42:37 +08:00
|
|
|
* @skc_portpair: __u32 union of @skc_dport & @skc_num
|
2005-05-01 23:59:25 +08:00
|
|
|
* @skc_family: network address family
|
|
|
|
* @skc_state: Connection state
|
|
|
|
* @skc_reuse: %SO_REUSEADDR setting
|
2013-01-22 17:49:50 +08:00
|
|
|
* @skc_reuseport: %SO_REUSEPORT setting
|
2020-02-16 03:42:37 +08:00
|
|
|
* @skc_ipv6only: socket is IPV6 only
|
|
|
|
* @skc_net_refcnt: socket is using net ref counting
|
2005-05-01 23:59:25 +08:00
|
|
|
* @skc_bound_dev_if: bound device index if != 0
|
|
|
|
* @skc_bind_node: bind hash linkage for various protocol lookup tables
|
2009-11-08 18:17:58 +08:00
|
|
|
* @skc_portaddr_node: second hash linkage for UDP/UDP-Lite protocol
|
2005-08-10 11:09:30 +08:00
|
|
|
* @skc_prot: protocol handlers inside a network family
|
2007-09-12 17:58:02 +08:00
|
|
|
* @skc_net: reference to the network namespace of this socket
|
2020-02-16 03:42:37 +08:00
|
|
|
* @skc_v6_daddr: IPV6 destination address
|
|
|
|
* @skc_v6_rcv_saddr: IPV6 source address
|
|
|
|
* @skc_cookie: socket's cookie value
|
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 03:04:07 +08:00
|
|
|
* @skc_node: main hash linkage for various protocol lookup tables
|
|
|
|
* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
|
|
|
|
* @skc_tx_queue_mapping: tx queue number for this connection
|
2018-06-30 12:26:57 +08:00
|
|
|
* @skc_rx_queue_mapping: rx queue number for this connection
|
2015-10-09 10:33:22 +08:00
|
|
|
* @skc_flags: place holder for sk_flags
|
|
|
|
* %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
|
|
|
|
* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
|
2020-02-16 03:42:37 +08:00
|
|
|
* @skc_listener: connection request listener socket (aka rsk_listener)
|
|
|
|
* [union with @skc_flags]
|
|
|
|
* @skc_tw_dr: (aka tw_dr) ptr to &struct inet_timewait_death_row
|
|
|
|
* [union with @skc_flags]
|
2015-10-09 10:33:21 +08:00
|
|
|
* @skc_incoming_cpu: record/match cpu processing incoming packets
|
2020-02-16 03:42:37 +08:00
|
|
|
* @skc_rcv_wnd: (aka rsk_rcv_wnd) TCP receive window size (possibly scaled)
|
|
|
|
* [union with @skc_incoming_cpu]
|
|
|
|
* @skc_tw_rcv_nxt: (aka tw_rcv_nxt) TCP window next expected seq number
|
|
|
|
* [union with @skc_incoming_cpu]
|
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 03:04:07 +08:00
|
|
|
* @skc_refcnt: reference count
|
2005-05-01 23:59:25 +08:00
|
|
|
*
|
|
|
|
* This is the minimal network layer representation of sockets, the header
|
2005-08-10 11:09:30 +08:00
|
|
|
* for struct sock and struct inet_timewait_sock.
|
|
|
|
*/
|
2005-04-17 06:20:36 +08:00
|
|
|
struct sock_common {
|
2012-11-30 17:49:27 +08:00
|
|
|
union {
|
2012-12-02 15:33:10 +08:00
|
|
|
__addrpair skc_addrpair;
|
2012-11-30 17:49:27 +08:00
|
|
|
struct {
|
|
|
|
__be32 skc_daddr;
|
|
|
|
__be32 skc_rcv_saddr;
|
|
|
|
};
|
|
|
|
};
|
2009-11-08 18:17:30 +08:00
|
|
|
union {
|
|
|
|
unsigned int skc_hash;
|
|
|
|
__u16 skc_u16hashes[2];
|
|
|
|
};
|
2012-11-30 17:49:27 +08:00
|
|
|
/* skc_dport && skc_num must be grouped as well */
|
|
|
|
union {
|
2012-12-02 15:33:10 +08:00
|
|
|
__portpair skc_portpair;
|
2012-11-30 17:49:27 +08:00
|
|
|
struct {
|
|
|
|
__be16 skc_dport;
|
|
|
|
__u16 skc_num;
|
|
|
|
};
|
|
|
|
};
|
|
|
|
|
2009-07-16 07:13:10 +08:00
|
|
|
unsigned short skc_family;
|
|
|
|
volatile unsigned char skc_state;
|
2013-01-22 17:49:50 +08:00
|
|
|
unsigned char skc_reuse:4;
|
2014-06-27 23:36:16 +08:00
|
|
|
unsigned char skc_reuseport:1;
|
|
|
|
unsigned char skc_ipv6only:1;
|
2015-05-09 10:10:31 +08:00
|
|
|
unsigned char skc_net_refcnt:1;
|
2009-07-16 07:13:10 +08:00
|
|
|
int skc_bound_dev_if;
|
2009-11-08 18:17:58 +08:00
|
|
|
union {
|
|
|
|
struct hlist_node skc_bind_node;
|
2016-04-01 23:52:13 +08:00
|
|
|
struct hlist_node skc_portaddr_node;
|
2009-11-08 18:17:58 +08:00
|
|
|
};
|
2005-08-10 11:09:30 +08:00
|
|
|
struct proto *skc_prot;
|
2015-03-12 12:06:44 +08:00
|
|
|
possible_net_t skc_net;
|
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 06:42:29 +08:00
|
|
|
|
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
|
|
|
struct in6_addr skc_v6_daddr;
|
|
|
|
struct in6_addr skc_v6_rcv_saddr;
|
|
|
|
#endif
|
|
|
|
|
2015-03-12 09:53:14 +08:00
|
|
|
atomic64_t skc_cookie;
|
|
|
|
|
2015-10-09 10:33:22 +08:00
|
|
|
/* following fields are padding to force
|
|
|
|
* offset(struct sock, sk_refcnt) == 128 on 64bit arches
|
|
|
|
* assuming IPV6 is enabled. We use this padding differently
|
|
|
|
* for different kind of 'sockets'
|
|
|
|
*/
|
|
|
|
union {
|
|
|
|
unsigned long skc_flags;
|
|
|
|
struct sock *skc_listener; /* request_sock */
|
|
|
|
struct inet_timewait_death_row *skc_tw_dr; /* inet_timewait_sock */
|
|
|
|
};
|
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 03:04:07 +08:00
|
|
|
/*
|
|
|
|
* fields between dontcopy_begin/dontcopy_end
|
|
|
|
* are not copied in sock_copy()
|
|
|
|
*/
|
2011-01-09 01:39:21 +08:00
|
|
|
/* private: */
|
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 03:04:07 +08:00
|
|
|
int skc_dontcopy_begin[0];
|
2011-01-09 01:39:21 +08:00
|
|
|
/* public: */
|
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 03:04:07 +08:00
|
|
|
union {
|
|
|
|
struct hlist_node skc_node;
|
|
|
|
struct hlist_nulls_node skc_nulls_node;
|
|
|
|
};
|
2018-06-30 12:26:51 +08:00
|
|
|
unsigned short skc_tx_queue_mapping;
|
2021-02-11 19:35:51 +08:00
|
|
|
#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
|
2018-06-30 12:26:57 +08:00
|
|
|
unsigned short skc_rx_queue_mapping;
|
|
|
|
#endif
|
2015-10-09 10:33:23 +08:00
|
|
|
union {
|
|
|
|
int skc_incoming_cpu;
|
|
|
|
u32 skc_rcv_wnd;
|
2015-10-09 10:33:24 +08:00
|
|
|
u32 skc_tw_rcv_nxt; /* struct tcp_timewait_sock */
|
2015-10-09 10:33:23 +08:00
|
|
|
};
|
2015-10-09 10:33:21 +08:00
|
|
|
|
2017-06-30 18:08:01 +08:00
|
|
|
refcount_t skc_refcnt;
|
2011-01-09 01:39:21 +08:00
|
|
|
/* private: */
|
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 03:04:07 +08:00
|
|
|
int skc_dontcopy_end[0];
|
2015-10-09 10:33:23 +08:00
|
|
|
union {
|
|
|
|
u32 skc_rxhash;
|
|
|
|
u32 skc_window_clamp;
|
2015-10-09 10:33:24 +08:00
|
|
|
u32 skc_tw_snd_nxt; /* struct tcp_timewait_sock */
|
2015-10-09 10:33:23 +08:00
|
|
|
};
|
2011-01-09 01:39:21 +08:00
|
|
|
/* public: */
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2020-08-26 02:29:13 +08:00
|
|
|
struct bpf_local_storage;
|
2021-12-29 08:49:13 +08:00
|
|
|
struct sk_filter;
|
bpf: Introduce bpf sk local storage
After allowing a bpf prog to
- directly read the skb->sk ptr
- get the fullsock bpf_sock by "bpf_sk_fullsock()"
- get the bpf_tcp_sock by "bpf_tcp_sock()"
- get the listener sock by "bpf_get_listener_sock()"
- avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
into different bpf running context.
this patch is another effort to make bpf's network programming
more intuitive to do (together with memory and performance benefit).
When bpf prog needs to store data for a sk, the current practice is to
define a map with the usual 4-tuples (src/dst ip/port) as the key.
If multiple bpf progs require to store different sk data, multiple maps
have to be defined. Hence, wasting memory to store the duplicated
keys (i.e. 4 tuples here) in each of the bpf map.
[ The smallest key could be the sk pointer itself which requires
some enhancement in the verifier and it is a separate topic. ]
Also, the bpf prog needs to clean up the elem when sk is freed.
Otherwise, the bpf map will become full and un-usable quickly.
The sk-free tracking currently could be done during sk state
transition (e.g. BPF_SOCK_OPS_STATE_CB).
The size of the map needs to be predefined which then usually ended-up
with an over-provisioned map in production. Even the map was re-sizable,
while the sk naturally come and go away already, this potential re-size
operation is arguably redundant if the data can be directly connected
to the sk itself instead of proxy-ing through a bpf map.
This patch introduces sk->sk_bpf_storage to provide local storage space
at sk for bpf prog to use. The space will be allocated when the first bpf
prog has created data for this particular sk.
The design optimizes the bpf prog's lookup (and then optionally followed by
an inline update). bpf_spin_lock should be used if the inline update needs
to be protected.
BPF_MAP_TYPE_SK_STORAGE:
-----------------------
To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can
be created to fit different bpf progs' needs. The map enforces
BTF to allow printing the sk-local-storage during a system-wise
sk dump (e.g. "ss -ta") in the future.
The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
a "sk-local-storage" data from a particular sk.
Think of the map as a meta-data (or "type") of a "sk-local-storage". This
particular "type" of "sk-local-storage" data can then be stored in any sk.
The main purposes of this map are mostly:
1. Define the size of a "sk-local-storage" type.
2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
map-id, map-btf...etc.)
3. Keep track of all sk's storages of this "type" and clean them up
when the map is freed.
sk->sk_bpf_storage:
------------------
The main lookup/update/delete is done on sk->sk_bpf_storage (which
is a "struct bpf_sk_storage"). When doing a lookup,
the "map" pointer is now used as the "key" to search on the
sk_storage->list. The "map" pointer is actually serving
as the "type" of the "sk-local-storage" that is being
requested.
To allow very fast lookup, it should be as fast as looking up an
array at a stable-offset. At the same time, it is not ideal to
set a hard limit on the number of sk-local-storage "type" that the
system can have. Hence, this patch takes a cache approach.
The last search result from sk_storage->list is cached in
sk_storage->cache[] which is a stable sized array. Each
"sk-local-storage" type has a stable offset to the cache[] array.
In the future, a map's flag could be introduced to do cache
opt-out/enforcement if it became necessary.
The cache size is 16 (i.e. 16 types of "sk-local-storage").
Programs can share map. On the program side, having a few bpf_progs
running in the networking hotpath is already a lot. The bpf_prog
should have already consolidated the existing sock-key-ed map usage
to minimize the map lookup penalty. 16 has enough runway to grow.
All sk-local-storage data will be removed from sk->sk_bpf_storage
during sk destruction.
bpf_sk_storage_get() and bpf_sk_storage_delete():
------------------------------------------------
Instead of using bpf_map_(lookup|update|delete)_elem(),
the bpf prog needs to use the new helper bpf_sk_storage_get() and
bpf_sk_storage_delete(). The verifier can then enforce the
ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to
"create" new elem if one does not exist in the sk. It is done by
the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be
provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together,
it has eliminated the potential use cases for an equivalent
bpf_map_update_elem() API (for bpf_prog) in this patch.
Misc notes:
----------
1. map_get_next_key is not supported. From the userspace syscall
perspective, the map has the socket fd as the key while the map
can be shared by pinned-file or map-id.
Since btf is enforced, the existing "ss" could be enhanced to pretty
print the local-storage.
Supporting a kernel defined btf with 4 tuples as the return key could
be explored later also.
2. The sk->sk_lock cannot be acquired. Atomic operations is used instead.
e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
Please refer to the source code comments for the details in
synchronization cases and considerations.
3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
Benchmark:
---------
Here is the benchmark data collected by turning on
the "kernel.bpf_stats_enabled" sysctl.
Two bpf progs are tested:
One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
sk ptr as the key. (verifier is modified to support sk ptr as the key
That should have shortened the key lookup time.)
Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
each egress skb and then bump the cnt. netperf is used to drive
data with 4096 connected UDP sockets.
BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
loaded_at 2019-04-15T13:46:39-0700 uid 0
xlated 344B jited 258B memlock 4096B map_ids 16
btf_id 5
BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
loaded_at 2019-04-15T13:47:54-0700 uid 0
xlated 168B jited 156B memlock 4096B map_ids 17
btf_id 6
Here is a high-level picture on how are the objects organized:
sk
┌──────┐
│ │
│ │
│ │
│*sk_bpf_storage─────▶ bpf_sk_storage
└──────┘ ┌───────┐
┌───────────┤ list │
│ │ │
│ │ │
│ │ │
│ └───────┘
│
│ elem
│ ┌────────┐
├─▶│ snode │
│ ├────────┤
│ │ data │ bpf_map
│ ├────────┤ ┌─────────┐
│ │map_node│◀─┬─────┤ list │
│ └────────┘ │ │ │
│ │ │ │
│ elem │ │ │
│ ┌────────┐ │ └─────────┘
└─▶│ snode │ │
├────────┤ │
bpf_map │ data │ │
┌─────────┐ ├────────┤ │
│ list ├───────▶│map_node│ │
│ │ └────────┘ │
│ │ │
│ │ elem │
└─────────┘ ┌────────┐ │
┌─▶│ snode │ │
│ ├────────┤ │
│ │ data │ │
│ ├────────┤ │
│ │map_node│◀─┘
│ └────────┘
│
│
│ ┌───────┐
sk └──────────│ list │
┌──────┐ │ │
│ │ │ │
│ │ │ │
│ │ └───────┘
│*sk_bpf_storage───────▶bpf_sk_storage
└──────┘
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27 07:39:39 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/**
|
|
|
|
* struct sock - network layer representation of sockets
|
2005-08-10 11:09:30 +08:00
|
|
|
* @__sk_common: shared layout with inet_timewait_sock
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_shutdown: mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN
|
|
|
|
* @sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings
|
|
|
|
* @sk_lock: synchronizer
|
2017-03-09 16:09:05 +08:00
|
|
|
* @sk_kern_sock: True if sock is using kernel lock classes
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_rcvbuf: size of receive buffer in bytes
|
2010-04-29 19:01:49 +08:00
|
|
|
* @sk_wq: sock wait queue and async head
|
2013-10-08 00:01:39 +08:00
|
|
|
* @sk_rx_dst: receive input route used by early demux
|
2021-10-26 00:48:16 +08:00
|
|
|
* @sk_rx_dst_ifindex: ifindex for @sk_rx_dst
|
2021-10-26 00:48:17 +08:00
|
|
|
* @sk_rx_dst_cookie: cookie for @sk_rx_dst
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_dst_cache: destination cache
|
2017-02-07 05:14:11 +08:00
|
|
|
* @sk_dst_pending_confirm: need to confirm neighbour
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_policy: flow policy
|
|
|
|
* @sk_receive_queue: incoming packets
|
|
|
|
* @sk_wmem_alloc: transmit queue bytes committed
|
2017-07-13 00:29:06 +08:00
|
|
|
* @sk_tsq_flags: TCP Small Queues flags
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_write_queue: Packet sending queue
|
|
|
|
* @sk_omem_alloc: "o" is "option" or "other"
|
|
|
|
* @sk_wmem_queued: persistent queue size
|
|
|
|
* @sk_forward_alloc: space allocated forward
|
2021-09-30 01:25:11 +08:00
|
|
|
* @sk_reserved_mem: space reserved and non-reclaimable for the socket
|
2013-06-10 16:39:50 +08:00
|
|
|
* @sk_napi_id: id of the last napi context to receive data for sk
|
2013-06-14 21:33:57 +08:00
|
|
|
* @sk_ll_usec: usecs to busypoll when there is no data
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_allocation: allocation mode
|
tcp: TSO packets automatic sizing
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-27 20:46:32 +08:00
|
|
|
* @sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
|
2017-05-16 19:24:36 +08:00
|
|
|
* @sk_pacing_status: Pacing status (requested, handled by sch_fq)
|
2013-09-29 16:12:40 +08:00
|
|
|
* @sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_sndbuf: size of send buffer in bytes
|
2017-07-13 00:29:06 +08:00
|
|
|
* @__sk_flags_offset: empty field used to determine location of bitfield
|
2016-10-24 00:28:29 +08:00
|
|
|
* @sk_padding: unused element for alignment
|
2014-05-23 23:47:19 +08:00
|
|
|
* @sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets
|
|
|
|
* @sk_no_check_rx: allow zero checksum in RX packets
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
|
2021-11-16 03:02:35 +08:00
|
|
|
* @sk_gso_disabled: if set, NETIF_F_GSO_MASK is forbidden.
|
2006-07-01 04:36:35 +08:00
|
|
|
* @sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
|
[NET]: Add per-connection option to set max TSO frame size
Update: My mailer ate one of Jarek's feedback mails... Fixed the
parameter in netif_set_gso_max_size() to be u32, not u16. Fixed the
whitespace issue due to a patch import botch. Changed the types from
u32 to unsigned int to be more consistent with other variables in the
area. Also brought the patch up to the latest net-2.6.26 tree.
Update: Made gso_max_size container 32 bits, not 16. Moved the
location of gso_max_size within netdev to be less hotpath. Made more
consistent names between the sock and netdev layers, and added a
define for the max GSO size.
Update: Respun for net-2.6.26 tree.
Update: changed max_gso_frame_size and sk_gso_max_size from signed to
unsigned - thanks Stephen!
This patch adds the ability for device drivers to control the size of
the TSO frames being sent to them, per TCP connection. By setting the
netdevice's gso_max_size value, the socket layer will set the GSO
frame size based on that value. This will propogate into the TCP
layer, and send TSO's of that size to the hardware.
This can be desirable to help tune the bursty nature of TSO on a
per-adapter basis, where one may have 1 GbE and 10 GbE devices
coexisting in a system, one running multiqueue and the other not, etc.
This can also be desirable for devices that cannot support full 64 KB
TSO's, but still want to benefit from some level of segmentation
offloading.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 18:43:19 +08:00
|
|
|
* @sk_gso_max_size: Maximum GSO segment size to build
|
2012-07-31 00:11:42 +08:00
|
|
|
* @sk_gso_max_segs: Maximum number of GSO segments
|
2017-11-12 07:54:12 +08:00
|
|
|
* @sk_pacing_shift: scaling factor for TCP Small Queues
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_lingertime: %SO_LINGER l_linger setting
|
|
|
|
* @sk_backlog: always used with the per-socket spinlock held
|
|
|
|
* @sk_callback_lock: used with the callbacks in the end of this struct
|
|
|
|
* @sk_error_queue: rarely used
|
2007-11-14 12:30:01 +08:00
|
|
|
* @sk_prot_creator: sk_prot of original sock creator (see ipv6_setsockopt,
|
|
|
|
* IPV6_ADDRFORM for instance)
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_err: last error
|
2007-11-14 12:30:01 +08:00
|
|
|
* @sk_err_soft: errors that don't cause failure but are the cause of a
|
|
|
|
* persistent failure not just 'timed out'
|
2008-06-18 12:04:56 +08:00
|
|
|
* @sk_drops: raw/udp drops counter
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_ack_backlog: current listen backlog
|
|
|
|
* @sk_max_ack_backlog: listen backlog set in listen()
|
2017-07-13 00:29:06 +08:00
|
|
|
* @sk_uid: user id of owner
|
net: Introduce preferred busy-polling
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.
One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.
This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.
If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.
In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.
Example usage:
$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.
Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
2020-12-01 02:51:56 +08:00
|
|
|
* @sk_prefer_busy_poll: prefer busypolling over softirq processing
|
2020-12-01 02:51:57 +08:00
|
|
|
* @sk_busy_poll_budget: napi processing budget when busypolling
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_priority: %SO_PRIORITY setting
|
|
|
|
* @sk_type: socket type (%SOCK_STREAM, etc)
|
|
|
|
* @sk_protocol: which protocol this socket belongs in this network family
|
2021-10-02 00:46:22 +08:00
|
|
|
* @sk_peer_lock: lock protecting @sk_peer_pid and @sk_peer_cred
|
2010-08-09 21:41:07 +08:00
|
|
|
* @sk_peer_pid: &struct pid for this socket's peer
|
|
|
|
* @sk_peer_cred: %SO_PEERCRED setting
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_rcvlowat: %SO_RCVLOWAT setting
|
|
|
|
* @sk_rcvtimeo: %SO_RCVTIMEO setting
|
|
|
|
* @sk_sndtimeo: %SO_SNDTIMEO setting
|
2014-07-02 12:32:17 +08:00
|
|
|
* @sk_txhash: computed flow hash for use on transmit
|
2022-01-31 21:31:22 +08:00
|
|
|
* @sk_txrehash: enable TX hash rethink
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_filter: socket filtering instructions
|
|
|
|
* @sk_timer: sock cleanup timer
|
|
|
|
* @sk_stamp: time stamp of last packet received
|
2018-12-28 10:55:09 +08:00
|
|
|
* @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
|
2021-06-30 16:11:59 +08:00
|
|
|
* @sk_tsflags: SO_TIMESTAMPING flags
|
|
|
|
* @sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
|
|
|
|
* for timestamping
|
2014-08-05 10:11:47 +08:00
|
|
|
* @sk_tskey: counter to disambiguate concurrent tstamp requests
|
2017-08-04 04:29:39 +08:00
|
|
|
* @sk_zckey: counter to order MSG_ZEROCOPY notifications
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_socket: Identd and reporting IO signals
|
|
|
|
* @sk_user_data: RPC layer private data
|
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 07:04:42 +08:00
|
|
|
* @sk_frag: cached page frag
|
2012-04-17 22:03:53 +08:00
|
|
|
* @sk_peek_off: current peek_offset value
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_send_head: front of stuff to transmit
|
2020-02-16 03:42:37 +08:00
|
|
|
* @tcp_rtx_queue: TCP re-transmit queue [union with @sk_send_head]
|
2005-05-01 23:59:26 +08:00
|
|
|
* @sk_security: used by security modules
|
2008-02-19 12:52:13 +08:00
|
|
|
* @sk_mark: generic packet mark
|
2015-12-08 06:38:52 +08:00
|
|
|
* @sk_cgrp_data: cgroup data for this cgroup
|
2016-01-15 07:21:17 +08:00
|
|
|
* @sk_memcg: this socket's memory cgroup association
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_write_pending: a write to stream socket waits to start
|
|
|
|
* @sk_state_change: callback to indicate change in the state of the sock
|
|
|
|
* @sk_data_ready: callback to indicate there is data to be processed
|
|
|
|
* @sk_write_space: callback to indicate there is bf sending space available
|
|
|
|
* @sk_error_report: callback to indicate errors (e.g. %MSG_ERRQUEUE)
|
|
|
|
* @sk_backlog_rcv: callback to process the backlog
|
2020-02-16 03:42:37 +08:00
|
|
|
* @sk_validate_xmit_skb: ptr to an optional validate function
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
|
2016-01-05 06:41:45 +08:00
|
|
|
* @sk_reuseport_cb: reuseport group container
|
2020-02-16 03:42:37 +08:00
|
|
|
* @sk_bpf_storage: ptr to cache and control for bpf_sk_storage
|
2016-10-24 00:28:29 +08:00
|
|
|
* @sk_rcu: used during RCU grace period
|
2018-07-04 06:42:48 +08:00
|
|
|
* @sk_clockid: clockid used by time-based scheduling (SO_TXTIME)
|
|
|
|
* @sk_txtime_deadline_mode: set deadline mode for SO_TXTIME
|
2020-02-16 03:42:37 +08:00
|
|
|
* @sk_txtime_report_errors: set report errors mode for SO_TXTIME
|
2018-07-04 06:42:48 +08:00
|
|
|
* @sk_txtime_unused: unused txtime flags
|
2021-12-10 15:44:22 +08:00
|
|
|
* @ns_tracker: tracker for netns reference
|
2016-10-24 00:28:29 +08:00
|
|
|
*/
|
2005-04-17 06:20:36 +08:00
|
|
|
struct sock {
|
|
|
|
/*
|
2005-08-10 11:09:30 +08:00
|
|
|
* Now struct inet_timewait_sock also uses sock_common, so please just
|
2005-04-17 06:20:36 +08:00
|
|
|
* don't add nothing before this first member (__sk_common) --acme
|
|
|
|
*/
|
|
|
|
struct sock_common __sk_common;
|
2009-07-16 07:13:10 +08:00
|
|
|
#define sk_node __sk_common.skc_node
|
|
|
|
#define sk_nulls_node __sk_common.skc_nulls_node
|
|
|
|
#define sk_refcnt __sk_common.skc_refcnt
|
2009-10-20 07:46:20 +08:00
|
|
|
#define sk_tx_queue_mapping __sk_common.skc_tx_queue_mapping
|
2021-02-11 19:35:51 +08:00
|
|
|
#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
|
2018-06-30 12:26:57 +08:00
|
|
|
#define sk_rx_queue_mapping __sk_common.skc_rx_queue_mapping
|
|
|
|
#endif
|
2009-07-16 07:13:10 +08:00
|
|
|
|
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 03:04:07 +08:00
|
|
|
#define sk_dontcopy_begin __sk_common.skc_dontcopy_begin
|
|
|
|
#define sk_dontcopy_end __sk_common.skc_dontcopy_end
|
2009-07-16 07:13:10 +08:00
|
|
|
#define sk_hash __sk_common.skc_hash
|
inet: consolidate INET_TW_MATCH
TCP listener refactoring, part 2 :
We can use a generic lookup, sockets being in whatever state, if
we are sure all relevant fields are at the same place in all socket
types (ESTABLISH, TIME_WAIT, SYN_RECV)
This patch removes these macros :
inet_addrpair, inet_addrpair, tw_addrpair, tw_portpair
And adds :
sk_portpair, sk_addrpair, sk_daddr, sk_rcv_saddr
Then, INET_TW_MATCH() is really the same than INET_MATCH()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-02 19:29:50 +08:00
|
|
|
#define sk_portpair __sk_common.skc_portpair
|
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:22:02 +08:00
|
|
|
#define sk_num __sk_common.skc_num
|
|
|
|
#define sk_dport __sk_common.skc_dport
|
inet: consolidate INET_TW_MATCH
TCP listener refactoring, part 2 :
We can use a generic lookup, sockets being in whatever state, if
we are sure all relevant fields are at the same place in all socket
types (ESTABLISH, TIME_WAIT, SYN_RECV)
This patch removes these macros :
inet_addrpair, inet_addrpair, tw_addrpair, tw_portpair
And adds :
sk_portpair, sk_addrpair, sk_daddr, sk_rcv_saddr
Then, INET_TW_MATCH() is really the same than INET_MATCH()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-02 19:29:50 +08:00
|
|
|
#define sk_addrpair __sk_common.skc_addrpair
|
|
|
|
#define sk_daddr __sk_common.skc_daddr
|
|
|
|
#define sk_rcv_saddr __sk_common.skc_rcv_saddr
|
2005-04-17 06:20:36 +08:00
|
|
|
#define sk_family __sk_common.skc_family
|
|
|
|
#define sk_state __sk_common.skc_state
|
|
|
|
#define sk_reuse __sk_common.skc_reuse
|
2013-01-22 17:49:50 +08:00
|
|
|
#define sk_reuseport __sk_common.skc_reuseport
|
2014-06-27 23:36:16 +08:00
|
|
|
#define sk_ipv6only __sk_common.skc_ipv6only
|
2015-05-09 10:10:31 +08:00
|
|
|
#define sk_net_refcnt __sk_common.skc_net_refcnt
|
2005-04-17 06:20:36 +08:00
|
|
|
#define sk_bound_dev_if __sk_common.skc_bound_dev_if
|
|
|
|
#define sk_bind_node __sk_common.skc_bind_node
|
2005-08-10 11:09:30 +08:00
|
|
|
#define sk_prot __sk_common.skc_prot
|
2007-09-12 17:58:02 +08:00
|
|
|
#define sk_net __sk_common.skc_net
|
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 06:42:29 +08:00
|
|
|
#define sk_v6_daddr __sk_common.skc_v6_daddr
|
|
|
|
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
|
2015-03-12 09:53:14 +08:00
|
|
|
#define sk_cookie __sk_common.skc_cookie
|
2015-10-09 10:33:21 +08:00
|
|
|
#define sk_incoming_cpu __sk_common.skc_incoming_cpu
|
2015-10-09 10:33:22 +08:00
|
|
|
#define sk_flags __sk_common.skc_flags
|
2015-10-09 10:33:23 +08:00
|
|
|
#define sk_rxhash __sk_common.skc_rxhash
|
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 06:42:29 +08:00
|
|
|
|
2021-11-16 03:02:49 +08:00
|
|
|
/* early demux fields */
|
2021-12-24 08:09:58 +08:00
|
|
|
struct dst_entry __rcu *sk_rx_dst;
|
2021-11-16 03:02:49 +08:00
|
|
|
int sk_rx_dst_ifindex;
|
|
|
|
u32 sk_rx_dst_cookie;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
socket_lock_t sk_lock;
|
2016-12-04 03:14:56 +08:00
|
|
|
atomic_t sk_drops;
|
|
|
|
int sk_rcvlowat;
|
|
|
|
struct sk_buff_head sk_error_queue;
|
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 13:56:04 +08:00
|
|
|
struct sk_buff_head sk_receive_queue;
|
2007-03-05 08:05:44 +08:00
|
|
|
/*
|
|
|
|
* The backlog queue is special, it is always used with
|
|
|
|
* the per-socket spinlock held and requires low latency
|
|
|
|
* access. Therefore we special case it's implementation.
|
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 13:56:04 +08:00
|
|
|
* Note : rmem_alloc is in this structure to fill a hole
|
|
|
|
* on 64bit arches, not because its logically part of
|
|
|
|
* backlog.
|
2007-03-05 08:05:44 +08:00
|
|
|
*/
|
|
|
|
struct {
|
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 13:56:04 +08:00
|
|
|
atomic_t rmem_alloc;
|
|
|
|
int len;
|
|
|
|
struct sk_buff *head;
|
|
|
|
struct sk_buff *tail;
|
2007-03-05 08:05:44 +08:00
|
|
|
} sk_backlog;
|
tcp: defer skb freeing after socket lock is released
tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.
A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.
Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.
This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.
This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.
Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.
One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)
Tested:
100Gbit NIC
Max throughput for one TCP_STREAM flow, over 10 runs
MTU : 1500
Before: 55 Gbit
After: 66 Gbit
MTU : 4096+(headers)
Before: 82 Gbit
After: 95 Gbit
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 03:02:46 +08:00
|
|
|
|
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 13:56:04 +08:00
|
|
|
#define sk_rmem_alloc sk_backlog.rmem_alloc
|
net: introduce SO_INCOMING_CPU
Alternative to RPS/RFS is to use hardware support for multiple
queues.
Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.
Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.
We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.
After accept(), connect(), or even file descriptor passing around
processes, applications can use :
int cpu;
socklen_t len = sizeof(cpu);
getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 21:54:28 +08:00
|
|
|
|
2016-12-04 03:14:56 +08:00
|
|
|
int sk_forward_alloc;
|
2021-09-30 01:25:11 +08:00
|
|
|
u32 sk_reserved_mem;
|
2013-08-01 11:10:25 +08:00
|
|
|
#ifdef CONFIG_NET_RX_BUSY_POLL
|
2013-06-14 21:33:57 +08:00
|
|
|
unsigned int sk_ll_usec;
|
2016-12-04 03:14:56 +08:00
|
|
|
/* ===== mostly read cache line ===== */
|
|
|
|
unsigned int sk_napi_id;
|
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 13:56:04 +08:00
|
|
|
#endif
|
|
|
|
int sk_rcvbuf;
|
|
|
|
|
|
|
|
struct sk_filter __rcu *sk_filter;
|
2015-11-30 12:03:11 +08:00
|
|
|
union {
|
|
|
|
struct socket_wq __rcu *sk_wq;
|
2020-02-16 03:42:37 +08:00
|
|
|
/* private: */
|
2015-11-30 12:03:11 +08:00
|
|
|
struct socket_wq *sk_wq_raw;
|
2020-02-16 03:42:37 +08:00
|
|
|
/* public: */
|
2015-11-30 12:03:11 +08:00
|
|
|
};
|
2008-10-29 04:24:06 +08:00
|
|
|
#ifdef CONFIG_XFRM
|
2015-12-08 23:22:02 +08:00
|
|
|
struct xfrm_policy __rcu *sk_policy[2];
|
2008-10-29 04:24:06 +08:00
|
|
|
#endif
|
2021-10-26 00:48:16 +08:00
|
|
|
|
2013-01-23 05:09:51 +08:00
|
|
|
struct dst_entry __rcu *sk_dst_cache;
|
2005-04-17 06:20:36 +08:00
|
|
|
atomic_t sk_omem_alloc;
|
2007-05-30 04:17:47 +08:00
|
|
|
int sk_sndbuf;
|
2016-12-04 03:14:56 +08:00
|
|
|
|
|
|
|
/* ===== cache line for TX ===== */
|
|
|
|
int sk_wmem_queued;
|
2017-06-30 18:08:00 +08:00
|
|
|
refcount_t sk_wmem_alloc;
|
2016-12-04 03:14:56 +08:00
|
|
|
unsigned long sk_tsq_flags;
|
2017-10-06 13:21:27 +08:00
|
|
|
union {
|
|
|
|
struct sk_buff *sk_send_head;
|
|
|
|
struct rb_root tcp_rtx_queue;
|
|
|
|
};
|
2005-04-17 06:20:36 +08:00
|
|
|
struct sk_buff_head sk_write_queue;
|
2016-12-04 03:14:56 +08:00
|
|
|
__s32 sk_peek_off;
|
|
|
|
int sk_write_pending;
|
2017-02-07 05:14:11 +08:00
|
|
|
__u32 sk_dst_pending_confirm;
|
2017-05-16 19:24:36 +08:00
|
|
|
u32 sk_pacing_status; /* see enum sk_pacing */
|
2016-12-04 03:14:56 +08:00
|
|
|
long sk_sndtimeo;
|
|
|
|
struct timer_list sk_timer;
|
|
|
|
__u32 sk_priority;
|
|
|
|
__u32 sk_mark;
|
net: extend sk_pacing_rate to unsigned long
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
segs_in:3916318 data_segs_out:177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:37:53 +08:00
|
|
|
unsigned long sk_pacing_rate; /* bytes per second */
|
|
|
|
unsigned long sk_max_pacing_rate;
|
2016-12-04 03:14:56 +08:00
|
|
|
struct page_frag sk_frag;
|
|
|
|
netdev_features_t sk_route_caps;
|
|
|
|
int sk_gso_type;
|
|
|
|
unsigned int sk_gso_max_size;
|
|
|
|
gfp_t sk_allocation;
|
|
|
|
__u32 sk_txhash;
|
2016-05-19 00:19:27 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Because of non atomicity rules, all
|
|
|
|
* changes are protected by socket lock.
|
|
|
|
*/
|
2021-11-16 03:02:35 +08:00
|
|
|
u8 sk_gso_disabled : 1,
|
2017-03-09 16:09:05 +08:00
|
|
|
sk_kern_sock : 1,
|
2014-05-23 23:47:19 +08:00
|
|
|
sk_no_check_tx : 1,
|
|
|
|
sk_no_check_rx : 1,
|
2020-01-09 23:59:15 +08:00
|
|
|
sk_userlocks : 4;
|
2017-11-12 07:54:12 +08:00
|
|
|
u8 sk_pacing_shift;
|
2020-01-09 23:59:15 +08:00
|
|
|
u16 sk_type;
|
|
|
|
u16 sk_protocol;
|
|
|
|
u16 sk_gso_max_segs;
|
2005-04-17 06:20:36 +08:00
|
|
|
unsigned long sk_lingertime;
|
2005-05-06 04:35:15 +08:00
|
|
|
struct proto *sk_prot_creator;
|
2005-04-17 06:20:36 +08:00
|
|
|
rwlock_t sk_callback_lock;
|
|
|
|
int sk_err,
|
|
|
|
sk_err_soft;
|
2015-03-20 10:04:21 +08:00
|
|
|
u32 sk_ack_backlog;
|
|
|
|
u32 sk_max_ack_backlog;
|
net: core: Add a UID field to struct sock.
Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.
Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().
Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:
1. Whenever sk_socket is non-null: sk_uid is the same as the UID
in sk_socket, i.e., matches the return value of sock_i_uid.
Specifically, the UID is set when userspace calls socket(),
fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
- For a socket that no longer has a sk_socket because
userspace has called close(): the previous UID.
- For a cloned socket (e.g., an incoming connection that is
established but on which userspace has not yet called
accept): the UID of the socket it was cloned from.
- For a socket that has never had an sk_socket: UID 0 inside
the user namespace corresponding to the network namespace
the socket belongs to.
Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 01:23:41 +08:00
|
|
|
kuid_t sk_uid;
|
2022-01-31 21:31:22 +08:00
|
|
|
u8 sk_txrehash;
|
net: Introduce preferred busy-polling
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.
One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.
This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.
If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.
In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.
Example usage:
$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.
Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
2020-12-01 02:51:56 +08:00
|
|
|
#ifdef CONFIG_NET_RX_BUSY_POLL
|
|
|
|
u8 sk_prefer_busy_poll;
|
2020-12-01 02:51:57 +08:00
|
|
|
u16 sk_busy_poll_budget;
|
net: Introduce preferred busy-polling
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.
One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.
This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.
If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.
In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.
Example usage:
$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.
Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
2020-12-01 02:51:56 +08:00
|
|
|
#endif
|
2021-09-30 06:57:50 +08:00
|
|
|
spinlock_t sk_peer_lock;
|
2021-11-16 03:02:37 +08:00
|
|
|
int sk_bind_phc;
|
2010-06-13 11:30:14 +08:00
|
|
|
struct pid *sk_peer_pid;
|
|
|
|
const struct cred *sk_peer_cred;
|
2021-09-30 06:57:50 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
long sk_rcvtimeo;
|
2007-04-20 07:16:32 +08:00
|
|
|
ktime_t sk_stamp;
|
2018-12-28 10:55:09 +08:00
|
|
|
#if BITS_PER_LONG==32
|
|
|
|
seqlock_t sk_stamp_seq;
|
|
|
|
#endif
|
2014-08-05 10:11:46 +08:00
|
|
|
u16 sk_tsflags;
|
2016-05-19 00:19:27 +08:00
|
|
|
u8 sk_shutdown;
|
2022-02-18 01:05:02 +08:00
|
|
|
atomic_t sk_tskey;
|
2017-08-04 04:29:39 +08:00
|
|
|
atomic_t sk_zckey;
|
2018-07-04 06:42:48 +08:00
|
|
|
|
|
|
|
u8 sk_clockid;
|
|
|
|
u8 sk_txtime_deadline_mode : 1,
|
2018-07-04 06:43:00 +08:00
|
|
|
sk_txtime_report_errors : 1,
|
|
|
|
sk_txtime_unused : 6;
|
2018-07-04 06:42:48 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
struct socket *sk_socket;
|
|
|
|
void *sk_user_data;
|
2008-11-05 06:45:58 +08:00
|
|
|
#ifdef CONFIG_SECURITY
|
2005-04-17 06:20:36 +08:00
|
|
|
void *sk_security;
|
2008-11-05 06:45:58 +08:00
|
|
|
#endif
|
2015-12-08 06:38:52 +08:00
|
|
|
struct sock_cgroup_data sk_cgrp_data;
|
2016-01-15 07:21:17 +08:00
|
|
|
struct mem_cgroup *sk_memcg;
|
2005-04-17 06:20:36 +08:00
|
|
|
void (*sk_state_change)(struct sock *sk);
|
2014-04-12 04:15:36 +08:00
|
|
|
void (*sk_data_ready)(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
void (*sk_write_space)(struct sock *sk);
|
|
|
|
void (*sk_error_report)(struct sock *sk);
|
2012-05-17 06:48:15 +08:00
|
|
|
int (*sk_backlog_rcv)(struct sock *sk,
|
|
|
|
struct sk_buff *skb);
|
2018-04-30 15:16:12 +08:00
|
|
|
#ifdef CONFIG_SOCK_VALIDATE_XMIT
|
|
|
|
struct sk_buff* (*sk_validate_xmit_skb)(struct sock *sk,
|
|
|
|
struct net_device *dev,
|
|
|
|
struct sk_buff *skb);
|
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
void (*sk_destruct)(struct sock *sk);
|
2016-01-05 06:41:45 +08:00
|
|
|
struct sock_reuseport __rcu *sk_reuseport_cb;
|
bpf: Introduce bpf sk local storage
After allowing a bpf prog to
- directly read the skb->sk ptr
- get the fullsock bpf_sock by "bpf_sk_fullsock()"
- get the bpf_tcp_sock by "bpf_tcp_sock()"
- get the listener sock by "bpf_get_listener_sock()"
- avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
into different bpf running context.
this patch is another effort to make bpf's network programming
more intuitive to do (together with memory and performance benefit).
When bpf prog needs to store data for a sk, the current practice is to
define a map with the usual 4-tuples (src/dst ip/port) as the key.
If multiple bpf progs require to store different sk data, multiple maps
have to be defined. Hence, wasting memory to store the duplicated
keys (i.e. 4 tuples here) in each of the bpf map.
[ The smallest key could be the sk pointer itself which requires
some enhancement in the verifier and it is a separate topic. ]
Also, the bpf prog needs to clean up the elem when sk is freed.
Otherwise, the bpf map will become full and un-usable quickly.
The sk-free tracking currently could be done during sk state
transition (e.g. BPF_SOCK_OPS_STATE_CB).
The size of the map needs to be predefined which then usually ended-up
with an over-provisioned map in production. Even the map was re-sizable,
while the sk naturally come and go away already, this potential re-size
operation is arguably redundant if the data can be directly connected
to the sk itself instead of proxy-ing through a bpf map.
This patch introduces sk->sk_bpf_storage to provide local storage space
at sk for bpf prog to use. The space will be allocated when the first bpf
prog has created data for this particular sk.
The design optimizes the bpf prog's lookup (and then optionally followed by
an inline update). bpf_spin_lock should be used if the inline update needs
to be protected.
BPF_MAP_TYPE_SK_STORAGE:
-----------------------
To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can
be created to fit different bpf progs' needs. The map enforces
BTF to allow printing the sk-local-storage during a system-wise
sk dump (e.g. "ss -ta") in the future.
The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
a "sk-local-storage" data from a particular sk.
Think of the map as a meta-data (or "type") of a "sk-local-storage". This
particular "type" of "sk-local-storage" data can then be stored in any sk.
The main purposes of this map are mostly:
1. Define the size of a "sk-local-storage" type.
2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
map-id, map-btf...etc.)
3. Keep track of all sk's storages of this "type" and clean them up
when the map is freed.
sk->sk_bpf_storage:
------------------
The main lookup/update/delete is done on sk->sk_bpf_storage (which
is a "struct bpf_sk_storage"). When doing a lookup,
the "map" pointer is now used as the "key" to search on the
sk_storage->list. The "map" pointer is actually serving
as the "type" of the "sk-local-storage" that is being
requested.
To allow very fast lookup, it should be as fast as looking up an
array at a stable-offset. At the same time, it is not ideal to
set a hard limit on the number of sk-local-storage "type" that the
system can have. Hence, this patch takes a cache approach.
The last search result from sk_storage->list is cached in
sk_storage->cache[] which is a stable sized array. Each
"sk-local-storage" type has a stable offset to the cache[] array.
In the future, a map's flag could be introduced to do cache
opt-out/enforcement if it became necessary.
The cache size is 16 (i.e. 16 types of "sk-local-storage").
Programs can share map. On the program side, having a few bpf_progs
running in the networking hotpath is already a lot. The bpf_prog
should have already consolidated the existing sock-key-ed map usage
to minimize the map lookup penalty. 16 has enough runway to grow.
All sk-local-storage data will be removed from sk->sk_bpf_storage
during sk destruction.
bpf_sk_storage_get() and bpf_sk_storage_delete():
------------------------------------------------
Instead of using bpf_map_(lookup|update|delete)_elem(),
the bpf prog needs to use the new helper bpf_sk_storage_get() and
bpf_sk_storage_delete(). The verifier can then enforce the
ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to
"create" new elem if one does not exist in the sk. It is done by
the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be
provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together,
it has eliminated the potential use cases for an equivalent
bpf_map_update_elem() API (for bpf_prog) in this patch.
Misc notes:
----------
1. map_get_next_key is not supported. From the userspace syscall
perspective, the map has the socket fd as the key while the map
can be shared by pinned-file or map-id.
Since btf is enforced, the existing "ss" could be enhanced to pretty
print the local-storage.
Supporting a kernel defined btf with 4 tuples as the return key could
be explored later also.
2. The sk->sk_lock cannot be acquired. Atomic operations is used instead.
e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
Please refer to the source code comments for the details in
synchronization cases and considerations.
3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
Benchmark:
---------
Here is the benchmark data collected by turning on
the "kernel.bpf_stats_enabled" sysctl.
Two bpf progs are tested:
One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
sk ptr as the key. (verifier is modified to support sk ptr as the key
That should have shortened the key lookup time.)
Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
each egress skb and then bump the cnt. netperf is used to drive
data with 4096 connected UDP sockets.
BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
loaded_at 2019-04-15T13:46:39-0700 uid 0
xlated 344B jited 258B memlock 4096B map_ids 16
btf_id 5
BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
loaded_at 2019-04-15T13:47:54-0700 uid 0
xlated 168B jited 156B memlock 4096B map_ids 17
btf_id 6
Here is a high-level picture on how are the objects organized:
sk
┌──────┐
│ │
│ │
│ │
│*sk_bpf_storage─────▶ bpf_sk_storage
└──────┘ ┌───────┐
┌───────────┤ list │
│ │ │
│ │ │
│ │ │
│ └───────┘
│
│ elem
│ ┌────────┐
├─▶│ snode │
│ ├────────┤
│ │ data │ bpf_map
│ ├────────┤ ┌─────────┐
│ │map_node│◀─┬─────┤ list │
│ └────────┘ │ │ │
│ │ │ │
│ elem │ │ │
│ ┌────────┐ │ └─────────┘
└─▶│ snode │ │
├────────┤ │
bpf_map │ data │ │
┌─────────┐ ├────────┤ │
│ list ├───────▶│map_node│ │
│ │ └────────┘ │
│ │ │
│ │ elem │
└─────────┘ ┌────────┐ │
┌─▶│ snode │ │
│ ├────────┤ │
│ │ data │ │
│ ├────────┤ │
│ │map_node│◀─┘
│ └────────┘
│
│
│ ┌───────┐
sk └──────────│ list │
┌──────┐ │ │
│ │ │ │
│ │ │ │
│ │ └───────┘
│*sk_bpf_storage───────▶bpf_sk_storage
└──────┘
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27 07:39:39 +08:00
|
|
|
#ifdef CONFIG_BPF_SYSCALL
|
2020-08-26 02:29:13 +08:00
|
|
|
struct bpf_local_storage __rcu *sk_bpf_storage;
|
bpf: Introduce bpf sk local storage
After allowing a bpf prog to
- directly read the skb->sk ptr
- get the fullsock bpf_sock by "bpf_sk_fullsock()"
- get the bpf_tcp_sock by "bpf_tcp_sock()"
- get the listener sock by "bpf_get_listener_sock()"
- avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
into different bpf running context.
this patch is another effort to make bpf's network programming
more intuitive to do (together with memory and performance benefit).
When bpf prog needs to store data for a sk, the current practice is to
define a map with the usual 4-tuples (src/dst ip/port) as the key.
If multiple bpf progs require to store different sk data, multiple maps
have to be defined. Hence, wasting memory to store the duplicated
keys (i.e. 4 tuples here) in each of the bpf map.
[ The smallest key could be the sk pointer itself which requires
some enhancement in the verifier and it is a separate topic. ]
Also, the bpf prog needs to clean up the elem when sk is freed.
Otherwise, the bpf map will become full and un-usable quickly.
The sk-free tracking currently could be done during sk state
transition (e.g. BPF_SOCK_OPS_STATE_CB).
The size of the map needs to be predefined which then usually ended-up
with an over-provisioned map in production. Even the map was re-sizable,
while the sk naturally come and go away already, this potential re-size
operation is arguably redundant if the data can be directly connected
to the sk itself instead of proxy-ing through a bpf map.
This patch introduces sk->sk_bpf_storage to provide local storage space
at sk for bpf prog to use. The space will be allocated when the first bpf
prog has created data for this particular sk.
The design optimizes the bpf prog's lookup (and then optionally followed by
an inline update). bpf_spin_lock should be used if the inline update needs
to be protected.
BPF_MAP_TYPE_SK_STORAGE:
-----------------------
To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can
be created to fit different bpf progs' needs. The map enforces
BTF to allow printing the sk-local-storage during a system-wise
sk dump (e.g. "ss -ta") in the future.
The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
a "sk-local-storage" data from a particular sk.
Think of the map as a meta-data (or "type") of a "sk-local-storage". This
particular "type" of "sk-local-storage" data can then be stored in any sk.
The main purposes of this map are mostly:
1. Define the size of a "sk-local-storage" type.
2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
map-id, map-btf...etc.)
3. Keep track of all sk's storages of this "type" and clean them up
when the map is freed.
sk->sk_bpf_storage:
------------------
The main lookup/update/delete is done on sk->sk_bpf_storage (which
is a "struct bpf_sk_storage"). When doing a lookup,
the "map" pointer is now used as the "key" to search on the
sk_storage->list. The "map" pointer is actually serving
as the "type" of the "sk-local-storage" that is being
requested.
To allow very fast lookup, it should be as fast as looking up an
array at a stable-offset. At the same time, it is not ideal to
set a hard limit on the number of sk-local-storage "type" that the
system can have. Hence, this patch takes a cache approach.
The last search result from sk_storage->list is cached in
sk_storage->cache[] which is a stable sized array. Each
"sk-local-storage" type has a stable offset to the cache[] array.
In the future, a map's flag could be introduced to do cache
opt-out/enforcement if it became necessary.
The cache size is 16 (i.e. 16 types of "sk-local-storage").
Programs can share map. On the program side, having a few bpf_progs
running in the networking hotpath is already a lot. The bpf_prog
should have already consolidated the existing sock-key-ed map usage
to minimize the map lookup penalty. 16 has enough runway to grow.
All sk-local-storage data will be removed from sk->sk_bpf_storage
during sk destruction.
bpf_sk_storage_get() and bpf_sk_storage_delete():
------------------------------------------------
Instead of using bpf_map_(lookup|update|delete)_elem(),
the bpf prog needs to use the new helper bpf_sk_storage_get() and
bpf_sk_storage_delete(). The verifier can then enforce the
ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to
"create" new elem if one does not exist in the sk. It is done by
the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be
provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together,
it has eliminated the potential use cases for an equivalent
bpf_map_update_elem() API (for bpf_prog) in this patch.
Misc notes:
----------
1. map_get_next_key is not supported. From the userspace syscall
perspective, the map has the socket fd as the key while the map
can be shared by pinned-file or map-id.
Since btf is enforced, the existing "ss" could be enhanced to pretty
print the local-storage.
Supporting a kernel defined btf with 4 tuples as the return key could
be explored later also.
2. The sk->sk_lock cannot be acquired. Atomic operations is used instead.
e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
Please refer to the source code comments for the details in
synchronization cases and considerations.
3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
Benchmark:
---------
Here is the benchmark data collected by turning on
the "kernel.bpf_stats_enabled" sysctl.
Two bpf progs are tested:
One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
sk ptr as the key. (verifier is modified to support sk ptr as the key
That should have shortened the key lookup time.)
Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
each egress skb and then bump the cnt. netperf is used to drive
data with 4096 connected UDP sockets.
BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
loaded_at 2019-04-15T13:46:39-0700 uid 0
xlated 344B jited 258B memlock 4096B map_ids 16
btf_id 5
BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
loaded_at 2019-04-15T13:47:54-0700 uid 0
xlated 168B jited 156B memlock 4096B map_ids 17
btf_id 6
Here is a high-level picture on how are the objects organized:
sk
┌──────┐
│ │
│ │
│ │
│*sk_bpf_storage─────▶ bpf_sk_storage
└──────┘ ┌───────┐
┌───────────┤ list │
│ │ │
│ │ │
│ │ │
│ └───────┘
│
│ elem
│ ┌────────┐
├─▶│ snode │
│ ├────────┤
│ │ data │ bpf_map
│ ├────────┤ ┌─────────┐
│ │map_node│◀─┬─────┤ list │
│ └────────┘ │ │ │
│ │ │ │
│ elem │ │ │
│ ┌────────┐ │ └─────────┘
└─▶│ snode │ │
├────────┤ │
bpf_map │ data │ │
┌─────────┐ ├────────┤ │
│ list ├───────▶│map_node│ │
│ │ └────────┘ │
│ │ │
│ │ elem │
└─────────┘ ┌────────┐ │
┌─▶│ snode │ │
│ ├────────┤ │
│ │ data │ │
│ ├────────┤ │
│ │map_node│◀─┘
│ └────────┘
│
│
│ ┌───────┐
sk └──────────│ list │
┌──────┐ │ │
│ │ │ │
│ │ │ │
│ │ └───────┘
│*sk_bpf_storage───────▶bpf_sk_storage
└──────┘
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27 07:39:39 +08:00
|
|
|
#endif
|
2016-04-01 23:52:12 +08:00
|
|
|
struct rcu_head sk_rcu;
|
2021-12-10 15:44:22 +08:00
|
|
|
netns_tracker ns_tracker;
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2017-05-16 19:24:36 +08:00
|
|
|
enum sk_pacing {
|
|
|
|
SK_PACING_NONE = 0,
|
|
|
|
SK_PACING_NEEDED = 1,
|
|
|
|
SK_PACING_FQ = 2,
|
|
|
|
};
|
|
|
|
|
net, sk_msg: Clear sk_user_data pointer on clone if tagged
sk_user_data can hold a pointer to an object that is not intended to be
shared between the parent socket and the child that gets a pointer copy on
clone. This is the case when sk_user_data points at reference-counted
object, like struct sk_psock.
One way to resolve it is to tag the pointer with a no-copy flag by
repurposing its lowest bit. Based on the bit-flag value we clear the child
sk_user_data pointer after cloning the parent socket.
The no-copy flag is stored in the pointer itself as opposed to externally,
say in socket flags, to guarantee that the pointer and the flag are copied
from parent to child socket in an atomic fashion. Parent socket state is
subject to change while copying, we don't hold any locks at that time.
This approach relies on an assumption that sk_user_data holds a pointer to
an object aligned at least 2 bytes. A manual audit of existing users of
rcu_dereference_sk_user_data helper confirms our assumption.
Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
char value or a pathological case of "struct { char c; }". To be safe, warn
when the flag-bit is set when setting sk_user_data to catch any future
misuses.
It is worth considering why clearing sk_user_data unconditionally is not an
option. There exist users, DRBD, NVMe, and Xen drivers being among them,
that rely on the pointer being copied when cloning the listening socket.
Potentially we could distinguish these users by checking if the listening
socket has been created in kernel-space via sock_create_kern, and hence has
sk_kern_sock flag set. However, this is not the case for NVMe and Xen
drivers, which create sockets without marking them as belonging to the
kernel.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
2020-02-19 01:10:14 +08:00
|
|
|
/* Pointer stored in sk_user_data might not be suitable for copying
|
|
|
|
* when cloning the socket. For instance, it can point to a reference
|
|
|
|
* counted object. sk_user_data bottom bit is set if pointer must not
|
|
|
|
* be copied.
|
|
|
|
*/
|
|
|
|
#define SK_USER_DATA_NOCOPY 1UL
|
2020-07-09 14:11:10 +08:00
|
|
|
#define SK_USER_DATA_BPF 2UL /* Managed by BPF */
|
|
|
|
#define SK_USER_DATA_PTRMASK ~(SK_USER_DATA_NOCOPY | SK_USER_DATA_BPF)
|
net, sk_msg: Clear sk_user_data pointer on clone if tagged
sk_user_data can hold a pointer to an object that is not intended to be
shared between the parent socket and the child that gets a pointer copy on
clone. This is the case when sk_user_data points at reference-counted
object, like struct sk_psock.
One way to resolve it is to tag the pointer with a no-copy flag by
repurposing its lowest bit. Based on the bit-flag value we clear the child
sk_user_data pointer after cloning the parent socket.
The no-copy flag is stored in the pointer itself as opposed to externally,
say in socket flags, to guarantee that the pointer and the flag are copied
from parent to child socket in an atomic fashion. Parent socket state is
subject to change while copying, we don't hold any locks at that time.
This approach relies on an assumption that sk_user_data holds a pointer to
an object aligned at least 2 bytes. A manual audit of existing users of
rcu_dereference_sk_user_data helper confirms our assumption.
Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
char value or a pathological case of "struct { char c; }". To be safe, warn
when the flag-bit is set when setting sk_user_data to catch any future
misuses.
It is worth considering why clearing sk_user_data unconditionally is not an
option. There exist users, DRBD, NVMe, and Xen drivers being among them,
that rely on the pointer being copied when cloning the listening socket.
Potentially we could distinguish these users by checking if the listening
socket has been created in kernel-space via sock_create_kern, and hence has
sk_kern_sock flag set. However, this is not the case for NVMe and Xen
drivers, which create sockets without marking them as belonging to the
kernel.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
2020-02-19 01:10:14 +08:00
|
|
|
|
|
|
|
/**
|
|
|
|
* sk_user_data_is_nocopy - Test if sk_user_data pointer must not be copied
|
|
|
|
* @sk: socket
|
|
|
|
*/
|
|
|
|
static inline bool sk_user_data_is_nocopy(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return ((uintptr_t)sk->sk_user_data & SK_USER_DATA_NOCOPY);
|
|
|
|
}
|
|
|
|
|
2013-09-25 01:25:40 +08:00
|
|
|
#define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
|
|
|
|
|
net, sk_msg: Clear sk_user_data pointer on clone if tagged
sk_user_data can hold a pointer to an object that is not intended to be
shared between the parent socket and the child that gets a pointer copy on
clone. This is the case when sk_user_data points at reference-counted
object, like struct sk_psock.
One way to resolve it is to tag the pointer with a no-copy flag by
repurposing its lowest bit. Based on the bit-flag value we clear the child
sk_user_data pointer after cloning the parent socket.
The no-copy flag is stored in the pointer itself as opposed to externally,
say in socket flags, to guarantee that the pointer and the flag are copied
from parent to child socket in an atomic fashion. Parent socket state is
subject to change while copying, we don't hold any locks at that time.
This approach relies on an assumption that sk_user_data holds a pointer to
an object aligned at least 2 bytes. A manual audit of existing users of
rcu_dereference_sk_user_data helper confirms our assumption.
Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
char value or a pathological case of "struct { char c; }". To be safe, warn
when the flag-bit is set when setting sk_user_data to catch any future
misuses.
It is worth considering why clearing sk_user_data unconditionally is not an
option. There exist users, DRBD, NVMe, and Xen drivers being among them,
that rely on the pointer being copied when cloning the listening socket.
Potentially we could distinguish these users by checking if the listening
socket has been created in kernel-space via sock_create_kern, and hence has
sk_kern_sock flag set. However, this is not the case for NVMe and Xen
drivers, which create sockets without marking them as belonging to the
kernel.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
2020-02-19 01:10:14 +08:00
|
|
|
#define rcu_dereference_sk_user_data(sk) \
|
|
|
|
({ \
|
|
|
|
void *__tmp = rcu_dereference(__sk_user_data((sk))); \
|
|
|
|
(void *)((uintptr_t)__tmp & SK_USER_DATA_PTRMASK); \
|
|
|
|
})
|
|
|
|
#define rcu_assign_sk_user_data(sk, ptr) \
|
|
|
|
({ \
|
|
|
|
uintptr_t __tmp = (uintptr_t)(ptr); \
|
|
|
|
WARN_ON_ONCE(__tmp & ~SK_USER_DATA_PTRMASK); \
|
|
|
|
rcu_assign_pointer(__sk_user_data((sk)), __tmp); \
|
|
|
|
})
|
|
|
|
#define rcu_assign_sk_user_data_nocopy(sk, ptr) \
|
|
|
|
({ \
|
|
|
|
uintptr_t __tmp = (uintptr_t)(ptr); \
|
|
|
|
WARN_ON_ONCE(__tmp & ~SK_USER_DATA_PTRMASK); \
|
|
|
|
rcu_assign_pointer(__sk_user_data((sk)), \
|
|
|
|
__tmp | SK_USER_DATA_NOCOPY); \
|
|
|
|
})
|
2013-09-25 01:25:40 +08:00
|
|
|
|
2022-01-31 21:31:21 +08:00
|
|
|
static inline
|
|
|
|
struct net *sock_net(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return read_pnet(&sk->sk_net);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline
|
|
|
|
void sock_net_set(struct sock *sk, struct net *net)
|
|
|
|
{
|
|
|
|
write_pnet(&sk->sk_net, net);
|
|
|
|
}
|
|
|
|
|
2012-04-19 11:39:36 +08:00
|
|
|
/*
|
|
|
|
* SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
|
|
|
|
* or not whether his port will be reused by someone else. SK_FORCE_REUSE
|
|
|
|
* on a socket means that the socket will reuse everybody else's port
|
|
|
|
* without looking at the other's sk_reuse value.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define SK_NO_REUSE 0
|
|
|
|
#define SK_CAN_REUSE 1
|
|
|
|
#define SK_FORCE_REUSE 2
|
|
|
|
|
2016-04-06 00:41:16 +08:00
|
|
|
int sk_set_peek_off(struct sock *sk, int val);
|
|
|
|
|
2022-06-06 19:34:58 +08:00
|
|
|
static inline int sk_peek_offset(const struct sock *sk, int flags)
|
2012-02-21 15:31:34 +08:00
|
|
|
{
|
2016-04-06 00:41:14 +08:00
|
|
|
if (unlikely(flags & MSG_PEEK)) {
|
datagram: When peeking datagrams with offset < 0 don't skip empty skbs
Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove
headers from UDP packets before queueing"), when udp packets are being
peeked the requested extra offset is always 0 as there is no need to skip
the udp header. However, when the offset is 0 and the next skb is
of length 0, it is only returned once. The behaviour can be seen with
the following python script:
from socket import *;
f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
f.bind(('::', 0));
addr=('::1', f.getsockname()[1]);
g.sendto(b'', addr)
g.sendto(b'b', addr)
print(f.recvfrom(10, MSG_PEEK));
print(f.recvfrom(10, MSG_PEEK));
Where the expected output should be the empty string twice.
Instead, make sk_peek_offset return negative values, and pass those values
to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset
to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
__skb_try_recv_from_queue will then ensure the offset is reset back to 0
if a peek is requested without an offset, unless no packets are found.
Also simplify the if condition in __skb_try_recv_from_queue. If _off is
greater then 0, and off is greater then or equal to skb->len, then
(_off || skb->len) must always be true assuming skb->len >= 0 is always
true.
Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
as it double checked if MSG_PEEK was set in the flags.
V2:
- Moved the negative fixup into __skb_try_recv_from_queue, and remove now
redundant checks
- Fix peeking in udp{,v6}_recvmsg to report the right value when the
offset is 0
V3:
- Marked new branch in __skb_try_recv_from_queue as unlikely.
Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-19 03:04:54 +08:00
|
|
|
return READ_ONCE(sk->sk_peek_off);
|
2016-04-06 00:41:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
2012-02-21 15:31:34 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sk_peek_offset_bwd(struct sock *sk, int val)
|
|
|
|
{
|
2016-04-06 00:41:14 +08:00
|
|
|
s32 off = READ_ONCE(sk->sk_peek_off);
|
|
|
|
|
|
|
|
if (unlikely(off >= 0)) {
|
|
|
|
off = max_t(s32, off - val, 0);
|
|
|
|
WRITE_ONCE(sk->sk_peek_off, off);
|
2012-02-21 15:31:34 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sk_peek_offset_fwd(struct sock *sk, int val)
|
|
|
|
{
|
2016-04-06 00:41:14 +08:00
|
|
|
sk_peek_offset_bwd(sk, -val);
|
2012-02-21 15:31:34 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Hashed lists helper routines
|
|
|
|
*/
|
2010-02-09 07:18:45 +08:00
|
|
|
static inline struct sock *sk_entry(const struct hlist_node *node)
|
|
|
|
{
|
|
|
|
return hlist_entry(node, struct sock, sk_node);
|
|
|
|
}
|
|
|
|
|
2005-08-10 11:09:46 +08:00
|
|
|
static inline struct sock *__sk_head(const struct hlist_head *head)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
return hlist_entry(head->first, struct sock, sk_node);
|
|
|
|
}
|
|
|
|
|
2005-08-10 11:09:46 +08:00
|
|
|
static inline struct sock *sk_head(const struct hlist_head *head)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
return hlist_empty(head) ? NULL : __sk_head(head);
|
|
|
|
}
|
|
|
|
|
2008-11-17 11:39:21 +08:00
|
|
|
static inline struct sock *__sk_nulls_head(const struct hlist_nulls_head *head)
|
|
|
|
{
|
|
|
|
return hlist_nulls_entry(head->first, struct sock, sk_nulls_node);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct sock *sk_nulls_head(const struct hlist_nulls_head *head)
|
|
|
|
{
|
|
|
|
return hlist_nulls_empty(head) ? NULL : __sk_nulls_head(head);
|
|
|
|
}
|
|
|
|
|
2005-08-10 11:09:46 +08:00
|
|
|
static inline struct sock *sk_next(const struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2017-01-20 22:27:04 +08:00
|
|
|
return hlist_entry_safe(sk->sk_node.next, struct sock, sk_node);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-11-17 11:39:21 +08:00
|
|
|
static inline struct sock *sk_nulls_next(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return (!is_a_nulls(sk->sk_nulls_node.next)) ?
|
|
|
|
hlist_nulls_entry(sk->sk_nulls_node.next,
|
|
|
|
struct sock, sk_nulls_node) :
|
|
|
|
NULL;
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool sk_unhashed(const struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
return hlist_unhashed(&sk->sk_node);
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool sk_hashed(const struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2006-04-29 06:21:23 +08:00
|
|
|
return !sk_unhashed(sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline void sk_node_init(struct hlist_node *node)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
node->pprev = NULL;
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline void sk_nulls_node_init(struct hlist_nulls_node *node)
|
2008-11-17 11:39:21 +08:00
|
|
|
{
|
|
|
|
node->pprev = NULL;
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline void __sk_del_node(struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
__hlist_del(&sk->sk_node);
|
|
|
|
}
|
|
|
|
|
2010-02-22 15:57:18 +08:00
|
|
|
/* NB: equivalent to hlist_del_init_rcu */
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool __sk_del_node_init(struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
if (sk_hashed(sk)) {
|
|
|
|
__sk_del_node(sk);
|
|
|
|
sk_node_init(&sk->sk_node);
|
2012-05-17 06:48:15 +08:00
|
|
|
return true;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2012-05-17 06:48:15 +08:00
|
|
|
return false;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Grab socket reference count. This operation is valid only
|
|
|
|
when sk is ALREADY grabbed f.e. it is found in hash table
|
|
|
|
or a list and the lookup is made under lock preventing hash table
|
|
|
|
modifications.
|
|
|
|
*/
|
|
|
|
|
net: force inlining of netif_tx_start/stop_queue, sock_hold, __sock_put
Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
Arguably, gcc should do better, but gcc people aren't willing
to invest time into it, asking to use __always_inline instead.
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
the following functions get deinlined many times.
netif_tx_stop_queue: 207 copies, 590 calls:
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 80 8f e0 01 00 00 01 lock orb $0x1,0x1e0(%rdi)
5d pop %rbp
c3 retq
netif_tx_start_queue: 47 copies, 111 calls
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 80 a7 e0 01 00 00 fe lock andb $0xfe,0x1e0(%rdi)
5d pop %rbp
c3 retq
sock_hold: 39 copies, 124 calls
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 ff 87 80 00 00 00 lock incl 0x80(%rdi)
5d pop %rbp
c3 retq
__sock_put: 6 copies, 13 calls
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 ff 8f 80 00 00 00 lock decl 0x80(%rdi)
5d pop %rbp
c3 retq
This patch fixes this via s/inline/__always_inline/.
Code size decrease after the patch is ~2.5k:
text data bss dec hex filename
56719876 56364551 36196352 149280779 8e5d80b vmlinux_before
56717440 56364551 36196352 149278343 8e5ce87 vmlinux
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: netfilter-devel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-08 23:51:54 +08:00
|
|
|
static __always_inline void sock_hold(struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2017-06-30 18:08:01 +08:00
|
|
|
refcount_inc(&sk->sk_refcnt);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Ungrab socket in the context, which assumes that socket refcnt
|
|
|
|
cannot hit zero, f.e. it is true in context of any socketcall.
|
|
|
|
*/
|
net: force inlining of netif_tx_start/stop_queue, sock_hold, __sock_put
Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
Arguably, gcc should do better, but gcc people aren't willing
to invest time into it, asking to use __always_inline instead.
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
the following functions get deinlined many times.
netif_tx_stop_queue: 207 copies, 590 calls:
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 80 8f e0 01 00 00 01 lock orb $0x1,0x1e0(%rdi)
5d pop %rbp
c3 retq
netif_tx_start_queue: 47 copies, 111 calls
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 80 a7 e0 01 00 00 fe lock andb $0xfe,0x1e0(%rdi)
5d pop %rbp
c3 retq
sock_hold: 39 copies, 124 calls
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 ff 87 80 00 00 00 lock incl 0x80(%rdi)
5d pop %rbp
c3 retq
__sock_put: 6 copies, 13 calls
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 ff 8f 80 00 00 00 lock decl 0x80(%rdi)
5d pop %rbp
c3 retq
This patch fixes this via s/inline/__always_inline/.
Code size decrease after the patch is ~2.5k:
text data bss dec hex filename
56719876 56364551 36196352 149280779 8e5d80b vmlinux_before
56717440 56364551 36196352 149278343 8e5ce87 vmlinux
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: netfilter-devel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-08 23:51:54 +08:00
|
|
|
static __always_inline void __sock_put(struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2017-06-30 18:08:01 +08:00
|
|
|
refcount_dec(&sk->sk_refcnt);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool sk_del_node_init(struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2012-05-17 06:48:15 +08:00
|
|
|
bool rc = __sk_del_node_init(sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
if (rc) {
|
|
|
|
/* paranoid for a while -acme */
|
2017-06-30 18:08:01 +08:00
|
|
|
WARN_ON(refcount_read(&sk->sk_refcnt) == 1);
|
2005-04-17 06:20:36 +08:00
|
|
|
__sock_put(sk);
|
|
|
|
}
|
|
|
|
return rc;
|
|
|
|
}
|
2010-02-22 15:57:18 +08:00
|
|
|
#define sk_del_node_init_rcu(sk) sk_del_node_init(sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool __sk_nulls_del_node_init_rcu(struct sock *sk)
|
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 17:11:14 +08:00
|
|
|
{
|
|
|
|
if (sk_hashed(sk)) {
|
2008-11-17 11:39:21 +08:00
|
|
|
hlist_nulls_del_init_rcu(&sk->sk_nulls_node);
|
2012-05-17 06:48:15 +08:00
|
|
|
return true;
|
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 17:11:14 +08:00
|
|
|
}
|
2012-05-17 06:48:15 +08:00
|
|
|
return false;
|
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 17:11:14 +08:00
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool sk_nulls_del_node_init_rcu(struct sock *sk)
|
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 17:11:14 +08:00
|
|
|
{
|
2012-05-17 06:48:15 +08:00
|
|
|
bool rc = __sk_nulls_del_node_init_rcu(sk);
|
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 17:11:14 +08:00
|
|
|
|
|
|
|
if (rc) {
|
|
|
|
/* paranoid for a while -acme */
|
2017-06-30 18:08:01 +08:00
|
|
|
WARN_ON(refcount_read(&sk->sk_refcnt) == 1);
|
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 17:11:14 +08:00
|
|
|
__sock_put(sk);
|
|
|
|
}
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline void __sk_add_node(struct sock *sk, struct hlist_head *list)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
hlist_add_head(&sk->sk_node, list);
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline void sk_add_node(struct sock *sk, struct hlist_head *list)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
sock_hold(sk);
|
|
|
|
__sk_add_node(sk, list);
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline void sk_add_node_rcu(struct sock *sk, struct hlist_head *list)
|
2010-02-22 15:57:18 +08:00
|
|
|
{
|
|
|
|
sock_hold(sk);
|
2016-04-25 22:42:12 +08:00
|
|
|
if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
|
|
|
|
sk->sk_family == AF_INET6)
|
|
|
|
hlist_add_tail_rcu(&sk->sk_node, list);
|
|
|
|
else
|
|
|
|
hlist_add_head_rcu(&sk->sk_node, list);
|
2010-02-22 15:57:18 +08:00
|
|
|
}
|
|
|
|
|
2019-03-16 21:41:30 +08:00
|
|
|
static inline void sk_add_node_tail_rcu(struct sock *sk, struct hlist_head *list)
|
|
|
|
{
|
|
|
|
sock_hold(sk);
|
|
|
|
hlist_add_tail_rcu(&sk->sk_node, list);
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline void __sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
|
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 17:11:14 +08:00
|
|
|
{
|
2017-12-06 04:45:56 +08:00
|
|
|
hlist_nulls_add_head_rcu(&sk->sk_nulls_node, list);
|
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 17:11:14 +08:00
|
|
|
}
|
|
|
|
|
2019-12-14 10:20:41 +08:00
|
|
|
static inline void __sk_nulls_add_node_tail_rcu(struct sock *sk, struct hlist_nulls_head *list)
|
|
|
|
{
|
|
|
|
hlist_nulls_add_tail_rcu(&sk->sk_nulls_node, list);
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline void sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
|
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 17:11:14 +08:00
|
|
|
{
|
|
|
|
sock_hold(sk);
|
2008-11-17 11:39:21 +08:00
|
|
|
__sk_nulls_add_node_rcu(sk, list);
|
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 17:11:14 +08:00
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline void __sk_del_bind_node(struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
__hlist_del(&sk->sk_bind_node);
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline void sk_add_bind_node(struct sock *sk,
|
2005-04-17 06:20:36 +08:00
|
|
|
struct hlist_head *list)
|
|
|
|
{
|
|
|
|
hlist_add_head(&sk->sk_bind_node, list);
|
|
|
|
}
|
|
|
|
|
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 09:06:00 +08:00
|
|
|
#define sk_for_each(__sk, list) \
|
|
|
|
hlist_for_each_entry(__sk, list, sk_node)
|
|
|
|
#define sk_for_each_rcu(__sk, list) \
|
|
|
|
hlist_for_each_entry_rcu(__sk, list, sk_node)
|
2008-11-17 11:39:21 +08:00
|
|
|
#define sk_nulls_for_each(__sk, node, list) \
|
|
|
|
hlist_nulls_for_each_entry(__sk, node, list, sk_nulls_node)
|
|
|
|
#define sk_nulls_for_each_rcu(__sk, node, list) \
|
|
|
|
hlist_nulls_for_each_entry_rcu(__sk, node, list, sk_nulls_node)
|
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 09:06:00 +08:00
|
|
|
#define sk_for_each_from(__sk) \
|
|
|
|
hlist_for_each_entry_from(__sk, sk_node)
|
2008-11-17 11:39:21 +08:00
|
|
|
#define sk_nulls_for_each_from(__sk, node) \
|
|
|
|
if (__sk && ({ node = &(__sk)->sk_nulls_node; 1; })) \
|
|
|
|
hlist_nulls_for_each_entry_from(__sk, node, sk_nulls_node)
|
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 09:06:00 +08:00
|
|
|
#define sk_for_each_safe(__sk, tmp, list) \
|
|
|
|
hlist_for_each_entry_safe(__sk, tmp, list, sk_node)
|
|
|
|
#define sk_for_each_bound(__sk, list) \
|
|
|
|
hlist_for_each_entry(__sk, list, sk_bind_node)
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2014-07-16 11:28:32 +08:00
|
|
|
/**
|
2016-04-01 23:52:13 +08:00
|
|
|
* sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset
|
2014-07-16 11:28:32 +08:00
|
|
|
* @tpos: the type * to use as a loop cursor.
|
|
|
|
* @pos: the &struct hlist_node to use as a loop cursor.
|
|
|
|
* @head: the head for your list.
|
|
|
|
* @offset: offset of hlist_node within the struct.
|
|
|
|
*
|
|
|
|
*/
|
2016-04-01 23:52:13 +08:00
|
|
|
#define sk_for_each_entry_offset_rcu(tpos, pos, head, offset) \
|
2017-10-24 03:35:58 +08:00
|
|
|
for (pos = rcu_dereference(hlist_first_rcu(head)); \
|
2016-04-01 23:52:13 +08:00
|
|
|
pos != NULL && \
|
2014-07-16 11:28:32 +08:00
|
|
|
({ tpos = (typeof(*tpos) *)((void *)pos - offset); 1;}); \
|
2017-10-24 03:35:58 +08:00
|
|
|
pos = rcu_dereference(hlist_next_rcu(pos)))
|
2014-07-16 11:28:32 +08:00
|
|
|
|
2022-06-06 19:34:58 +08:00
|
|
|
static inline struct user_namespace *sk_user_ns(const struct sock *sk)
|
2012-05-25 07:56:43 +08:00
|
|
|
{
|
|
|
|
/* Careful only use this in a context where these parameters
|
|
|
|
* can not change and must all be valid, such as recvmsg from
|
|
|
|
* userspace.
|
|
|
|
*/
|
|
|
|
return sk->sk_socket->file->f_cred->user_ns;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Sock flags */
|
|
|
|
enum sock_flags {
|
|
|
|
SOCK_DEAD,
|
|
|
|
SOCK_DONE,
|
|
|
|
SOCK_URGINLINE,
|
|
|
|
SOCK_KEEPOPEN,
|
|
|
|
SOCK_LINGER,
|
|
|
|
SOCK_DESTROY,
|
|
|
|
SOCK_BROADCAST,
|
|
|
|
SOCK_TIMESTAMP,
|
|
|
|
SOCK_ZAPPED,
|
|
|
|
SOCK_USE_WRITE_QUEUE, /* whether to call sk->sk_write_space in sock_wfree */
|
|
|
|
SOCK_DBG, /* %SO_DEBUG setting */
|
|
|
|
SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
|
2007-03-26 13:14:49 +08:00
|
|
|
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
|
2005-04-17 06:20:36 +08:00
|
|
|
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
|
2012-08-01 07:44:16 +08:00
|
|
|
SOCK_MEMALLOC, /* VM depends on this socket for swapping */
|
2009-02-12 13:03:38 +08:00
|
|
|
SOCK_TIMESTAMPING_RX_SOFTWARE, /* %SOF_TIMESTAMPING_RX_SOFTWARE */
|
net: speedup sk_wake_async()
An incoming datagram must bring into cpu cache *lot* of cache lines,
in particular : (other parts omitted (hash chains, ip route cache...))
On 32bit arches :
offsetof(struct sock, sk_rcvbuf) =0x30 (read)
offsetof(struct sock, sk_lock) =0x34 (rw)
offsetof(struct sock, sk_sleep) =0x50 (read)
offsetof(struct sock, sk_rmem_alloc) =0x64 (rw)
offsetof(struct sock, sk_receive_queue)=0x74 (rw)
offsetof(struct sock, sk_forward_alloc)=0x98 (rw)
offsetof(struct sock, sk_callback_lock)=0xcc (rw)
offsetof(struct sock, sk_drops) =0xd8 (read if we add dropcount support, rw if frame dropped)
offsetof(struct sock, sk_filter) =0xf8 (read)
offsetof(struct sock, sk_socket) =0x138 (read)
offsetof(struct sock, sk_data_ready) =0x15c (read)
We can avoid sk->sk_socket and socket->fasync_list referencing on sockets
with no fasync() structures. (socket->fasync_list ptr is probably already in cache
because it shares a cache line with socket->wait, ie location pointed by sk->sk_sleep)
This avoids one cache line load per incoming packet for common cases (no fasync())
We can leave (or even move in a future patch) sk->sk_socket in a cold location
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-07 08:28:29 +08:00
|
|
|
SOCK_FASYNC, /* fasync() active */
|
net: Generalize socket rx gap / receive queue overflow cmsg
Create a new socket level option to report number of queue overflows
Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.
Notes:
1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.
2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.
3) This applies cleanly to net-next assuming that commit
977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-13 04:26:31 +08:00
|
|
|
SOCK_RXQ_OVFL,
|
2011-07-06 20:17:30 +08:00
|
|
|
SOCK_ZEROCOPY, /* buffers from userspace */
|
2011-11-09 17:15:42 +08:00
|
|
|
SOCK_WIFI_STATUS, /* push wifi status to userspace */
|
2012-02-11 23:39:30 +08:00
|
|
|
SOCK_NOFCS, /* Tell NIC not to do the Ethernet FCS.
|
|
|
|
* Will use last 4 bytes of packet sent from
|
|
|
|
* user-space instead.
|
|
|
|
*/
|
2013-01-17 05:55:49 +08:00
|
|
|
SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
|
2013-03-28 19:19:25 +08:00
|
|
|
SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
|
2016-04-01 23:52:12 +08:00
|
|
|
SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
|
2018-07-04 06:42:48 +08:00
|
|
|
SOCK_TXTIME,
|
2018-09-12 11:16:59 +08:00
|
|
|
SOCK_XDP, /* XDP is attached */
|
2019-02-02 23:34:50 +08:00
|
|
|
SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
|
2022-04-28 04:02:37 +08:00
|
|
|
SOCK_RCVMARK, /* Receive SO_MARK ancillary data with packet */
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2015-12-05 01:14:04 +08:00
|
|
|
#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
|
|
|
|
|
2022-06-06 19:34:58 +08:00
|
|
|
static inline void sock_copy_flags(struct sock *nsk, const struct sock *osk)
|
2005-08-24 01:11:30 +08:00
|
|
|
{
|
|
|
|
nsk->sk_flags = osk->sk_flags;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline void sock_set_flag(struct sock *sk, enum sock_flags flag)
|
|
|
|
{
|
|
|
|
__set_bit(flag, &sk->sk_flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sock_reset_flag(struct sock *sk, enum sock_flags flag)
|
|
|
|
{
|
|
|
|
__clear_bit(flag, &sk->sk_flags);
|
|
|
|
}
|
|
|
|
|
2020-06-20 23:30:50 +08:00
|
|
|
static inline void sock_valbool_flag(struct sock *sk, enum sock_flags bit,
|
|
|
|
int valbool)
|
|
|
|
{
|
|
|
|
if (valbool)
|
|
|
|
sock_set_flag(sk, bit);
|
|
|
|
else
|
|
|
|
sock_reset_flag(sk, bit);
|
|
|
|
}
|
|
|
|
|
2012-05-16 13:57:07 +08:00
|
|
|
static inline bool sock_flag(const struct sock *sk, enum sock_flags flag)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
return test_bit(flag, &sk->sk_flags);
|
|
|
|
}
|
|
|
|
|
2012-08-01 07:44:19 +08:00
|
|
|
#ifdef CONFIG_NET
|
2018-05-09 00:06:59 +08:00
|
|
|
DECLARE_STATIC_KEY_FALSE(memalloc_socks_key);
|
2012-08-01 07:44:19 +08:00
|
|
|
static inline int sk_memalloc_socks(void)
|
|
|
|
{
|
2018-05-09 00:06:59 +08:00
|
|
|
return static_branch_unlikely(&memalloc_socks_key);
|
2012-08-01 07:44:19 +08:00
|
|
|
}
|
2020-06-10 07:11:29 +08:00
|
|
|
|
|
|
|
void __receive_sock(struct file *file);
|
2012-08-01 07:44:19 +08:00
|
|
|
#else
|
|
|
|
|
|
|
|
static inline int sk_memalloc_socks(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-06-10 07:11:29 +08:00
|
|
|
static inline void __receive_sock(struct file *file)
|
|
|
|
{ }
|
2012-08-01 07:44:19 +08:00
|
|
|
#endif
|
|
|
|
|
2015-12-01 00:57:28 +08:00
|
|
|
static inline gfp_t sk_gfp_mask(const struct sock *sk, gfp_t gfp_mask)
|
2012-08-01 07:44:14 +08:00
|
|
|
{
|
2015-12-01 00:57:28 +08:00
|
|
|
return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
|
2012-08-01 07:44:14 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline void sk_acceptq_removed(struct sock *sk)
|
|
|
|
{
|
2019-11-06 06:11:53 +08:00
|
|
|
WRITE_ONCE(sk->sk_ack_backlog, sk->sk_ack_backlog - 1);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sk_acceptq_added(struct sock *sk)
|
|
|
|
{
|
2019-11-06 06:11:53 +08:00
|
|
|
WRITE_ONCE(sk->sk_ack_backlog, sk->sk_ack_backlog + 1);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2021-04-01 00:35:12 +08:00
|
|
|
/* Note: If you think the test should be:
|
|
|
|
* return READ_ONCE(sk->sk_ack_backlog) >= READ_ONCE(sk->sk_max_ack_backlog);
|
|
|
|
* Then please take a look at commit 64a146513f8f ("[NET]: Revert incorrect accept queue backlog changes.")
|
|
|
|
*/
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool sk_acceptq_is_full(const struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2021-04-01 00:35:12 +08:00
|
|
|
return READ_ONCE(sk->sk_ack_backlog) > READ_ONCE(sk->sk_max_ack_backlog);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute minimal free write space needed to queue new packets.
|
|
|
|
*/
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline int sk_stream_min_wspace(const struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2019-10-11 11:17:46 +08:00
|
|
|
return READ_ONCE(sk->sk_wmem_queued) >> 1;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline int sk_stream_wspace(const struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2019-10-11 11:17:46 +08:00
|
|
|
return READ_ONCE(sk->sk_sndbuf) - READ_ONCE(sk->sk_wmem_queued);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sk_wmem_queued_add(struct sock *sk, int val)
|
|
|
|
{
|
|
|
|
WRITE_ONCE(sk->sk_wmem_queued, sk->sk_wmem_queued + val);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
void sk_stream_write_space(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2010-03-05 02:01:40 +08:00
|
|
|
/* OOB backlog add */
|
2010-03-05 02:01:47 +08:00
|
|
|
static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
|
2005-11-09 01:39:42 +08:00
|
|
|
{
|
2010-05-12 07:19:48 +08:00
|
|
|
/* dont let skb dst not refcounted, we are going to leave rcu lock */
|
2017-09-22 00:15:46 +08:00
|
|
|
skb_dst_force(skb);
|
2010-05-12 07:19:48 +08:00
|
|
|
|
|
|
|
if (!sk->sk_backlog.tail)
|
2019-11-07 02:04:11 +08:00
|
|
|
WRITE_ONCE(sk->sk_backlog.head, skb);
|
2010-05-12 07:19:48 +08:00
|
|
|
else
|
2005-11-09 01:39:42 +08:00
|
|
|
sk->sk_backlog.tail->next = skb;
|
2010-05-12 07:19:48 +08:00
|
|
|
|
2019-11-07 02:04:11 +08:00
|
|
|
WRITE_ONCE(sk->sk_backlog.tail, skb);
|
2005-11-09 01:39:42 +08:00
|
|
|
skb->next = NULL;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2010-04-28 06:13:20 +08:00
|
|
|
/*
|
|
|
|
* Take into account size of receive queue and backlog queue
|
2011-12-21 15:11:44 +08:00
|
|
|
* Do not take into account this skb truesize,
|
|
|
|
* to allow even a single big packet to come.
|
2010-04-28 06:13:20 +08:00
|
|
|
*/
|
2014-07-23 02:16:51 +08:00
|
|
|
static inline bool sk_rcvqueues_full(const struct sock *sk, unsigned int limit)
|
2010-04-28 06:13:20 +08:00
|
|
|
{
|
|
|
|
unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc);
|
|
|
|
|
2012-04-23 07:34:26 +08:00
|
|
|
return qsize > limit;
|
2010-04-28 06:13:20 +08:00
|
|
|
}
|
|
|
|
|
2010-03-05 02:01:40 +08:00
|
|
|
/* The per-socket spinlock must be held here. */
|
2012-04-23 07:34:26 +08:00
|
|
|
static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *skb,
|
|
|
|
unsigned int limit)
|
2010-03-05 02:01:40 +08:00
|
|
|
{
|
2014-07-23 02:16:51 +08:00
|
|
|
if (sk_rcvqueues_full(sk, limit))
|
2010-03-05 02:01:40 +08:00
|
|
|
return -ENOBUFS;
|
|
|
|
|
2015-09-30 09:52:25 +08:00
|
|
|
/*
|
|
|
|
* If the skb was allocated from pfmemalloc reserves, only
|
|
|
|
* allow SOCK_MEMALLOC sockets to use it as this socket is
|
|
|
|
* helping free memory
|
|
|
|
*/
|
|
|
|
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2010-03-05 02:01:47 +08:00
|
|
|
__sk_add_backlog(sk, skb);
|
2010-03-05 02:01:40 +08:00
|
|
|
sk->sk_backlog.len += skb->truesize;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
|
2012-08-01 07:44:26 +08:00
|
|
|
|
2021-11-16 03:02:41 +08:00
|
|
|
INDIRECT_CALLABLE_DECLARE(int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb));
|
|
|
|
INDIRECT_CALLABLE_DECLARE(int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb));
|
|
|
|
|
2008-10-08 05:18:42 +08:00
|
|
|
static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
|
|
|
|
{
|
2012-08-01 07:44:26 +08:00
|
|
|
if (sk_memalloc_socks() && skb_pfmemalloc(skb))
|
|
|
|
return __sk_backlog_rcv(sk, skb);
|
|
|
|
|
2021-11-16 03:02:41 +08:00
|
|
|
return INDIRECT_CALL_INET(sk->sk_backlog_rcv,
|
|
|
|
tcp_v6_do_rcv,
|
|
|
|
tcp_v4_do_rcv,
|
|
|
|
sk, skb);
|
2008-10-08 05:18:42 +08:00
|
|
|
}
|
|
|
|
|
net: introduce SO_INCOMING_CPU
Alternative to RPS/RFS is to use hardware support for multiple
queues.
Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.
Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.
We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.
After accept(), connect(), or even file descriptor passing around
processes, applications can use :
int cpu;
socklen_t len = sizeof(cpu);
getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 21:54:28 +08:00
|
|
|
static inline void sk_incoming_cpu_update(struct sock *sk)
|
|
|
|
{
|
2017-06-21 17:45:31 +08:00
|
|
|
int cpu = raw_smp_processor_id();
|
|
|
|
|
2019-10-31 04:00:04 +08:00
|
|
|
if (unlikely(READ_ONCE(sk->sk_incoming_cpu) != cpu))
|
|
|
|
WRITE_ONCE(sk->sk_incoming_cpu, cpu);
|
net: introduce SO_INCOMING_CPU
Alternative to RPS/RFS is to use hardware support for multiple
queues.
Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.
Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.
We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.
After accept(), connect(), or even file descriptor passing around
processes, applications can use :
int cpu;
socklen_t len = sizeof(cpu);
getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 21:54:28 +08:00
|
|
|
}
|
|
|
|
|
2013-12-22 18:54:31 +08:00
|
|
|
static inline void sock_rps_record_flow_hash(__u32 hash)
|
2010-04-28 06:05:31 +08:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_RPS
|
|
|
|
struct rps_sock_flow_table *sock_flow_table;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
sock_flow_table = rcu_dereference(rps_sock_flow_table);
|
2013-12-22 18:54:31 +08:00
|
|
|
rps_record_sock_flow(sock_flow_table, hash);
|
2010-04-28 06:05:31 +08:00
|
|
|
rcu_read_unlock();
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2013-12-22 18:54:31 +08:00
|
|
|
static inline void sock_rps_record_flow(const struct sock *sk)
|
|
|
|
{
|
2014-01-01 04:31:01 +08:00
|
|
|
#ifdef CONFIG_RPS
|
2019-03-22 23:56:38 +08:00
|
|
|
if (static_branch_unlikely(&rfs_needed)) {
|
2016-12-08 00:29:10 +08:00
|
|
|
/* Reading sk->sk_rxhash might incur an expensive cache line
|
|
|
|
* miss.
|
|
|
|
*
|
|
|
|
* TCP_ESTABLISHED does cover almost all states where RFS
|
|
|
|
* might be useful, and is cheaper [1] than testing :
|
|
|
|
* IPv4: inet_sk(sk)->inet_daddr
|
|
|
|
* IPv6: ipv6_addr_any(&sk->sk_v6_daddr)
|
|
|
|
* OR an additional socket flag
|
|
|
|
* [1] : sk_state and sk_prot are in the same cache line.
|
|
|
|
*/
|
|
|
|
if (sk->sk_state == TCP_ESTABLISHED)
|
|
|
|
sock_rps_record_flow_hash(sk->sk_rxhash);
|
|
|
|
}
|
2014-01-01 04:31:01 +08:00
|
|
|
#endif
|
2013-12-22 18:54:31 +08:00
|
|
|
}
|
|
|
|
|
2011-08-15 03:45:55 +08:00
|
|
|
static inline void sock_rps_save_rxhash(struct sock *sk,
|
|
|
|
const struct sk_buff *skb)
|
2010-04-28 06:05:31 +08:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_RPS
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07 04:59:01 +08:00
|
|
|
if (unlikely(sk->sk_rxhash != skb->hash))
|
2014-03-25 06:34:47 +08:00
|
|
|
sk->sk_rxhash = skb->hash;
|
2010-04-28 06:05:31 +08:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2011-08-15 03:45:55 +08:00
|
|
|
static inline void sock_rps_reset_rxhash(struct sock *sk)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_RPS
|
|
|
|
sk->sk_rxhash = 0;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2016-11-12 02:20:50 +08:00
|
|
|
#define sk_wait_event(__sk, __timeo, __condition, __wait) \
|
2007-10-09 16:59:42 +08:00
|
|
|
({ int __rc; \
|
|
|
|
release_sock(__sk); \
|
|
|
|
__rc = __condition; \
|
|
|
|
if (!__rc) { \
|
2016-11-12 02:20:50 +08:00
|
|
|
*(__timeo) = wait_woken(__wait, \
|
|
|
|
TASK_INTERRUPTIBLE, \
|
|
|
|
*(__timeo)); \
|
2007-10-09 16:59:42 +08:00
|
|
|
} \
|
2016-11-12 02:20:50 +08:00
|
|
|
sched_annotate_sleep(); \
|
2007-10-09 16:59:42 +08:00
|
|
|
lock_sock(__sk); \
|
|
|
|
__rc = __condition; \
|
|
|
|
__rc; \
|
|
|
|
})
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
int sk_stream_wait_connect(struct sock *sk, long *timeo_p);
|
|
|
|
int sk_stream_wait_memory(struct sock *sk, long *timeo_p);
|
|
|
|
void sk_stream_wait_close(struct sock *sk, long timeo_p);
|
|
|
|
int sk_stream_error(struct sock *sk, int flags, int err);
|
|
|
|
void sk_stream_kill_queues(struct sock *sk);
|
|
|
|
void sk_set_memalloc(struct sock *sk);
|
|
|
|
void sk_clear_memalloc(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-04-30 05:16:53 +08:00
|
|
|
void __sk_flush_backlog(struct sock *sk);
|
|
|
|
|
|
|
|
static inline bool sk_flush_backlog(struct sock *sk)
|
|
|
|
{
|
|
|
|
if (unlikely(READ_ONCE(sk->sk_backlog.tail))) {
|
|
|
|
__sk_flush_backlog(sk);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-07-25 00:19:25 +08:00
|
|
|
int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2005-06-19 13:47:21 +08:00
|
|
|
struct request_sock_ops;
|
2005-12-14 15:25:19 +08:00
|
|
|
struct timewait_sock_ops;
|
[SOCK] proto: Add hashinfo member to struct proto
This way we can remove TCP and DCCP specific versions of
sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly
struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.
Now only the lookup routines receive as a parameter a struct inet_hashtable.
With this we further reuse code, reducing the difference among INET transport
protocols.
Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.
net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191
net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3
net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15
net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7
net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267
net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8
net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432
vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7
/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47
/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241
/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33
/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-03 20:06:04 +08:00
|
|
|
struct inet_hashinfo;
|
2008-03-23 07:56:51 +08:00
|
|
|
struct raw_hashinfo;
|
2017-01-09 23:55:26 +08:00
|
|
|
struct smc_hashinfo;
|
2011-05-27 01:46:22 +08:00
|
|
|
struct module;
|
2021-04-07 11:21:11 +08:00
|
|
|
struct sk_psock;
|
[NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-19 13:46:52 +08:00
|
|
|
|
2013-05-09 18:28:16 +08:00
|
|
|
/*
|
2017-01-18 18:53:44 +08:00
|
|
|
* caches using SLAB_TYPESAFE_BY_RCU should let .next pointer from nulls nodes
|
2013-05-09 18:28:16 +08:00
|
|
|
* un-modified. Special care is taken when initializing object to zero.
|
|
|
|
*/
|
|
|
|
static inline void sk_prot_clear_nulls(struct sock *sk, int size)
|
|
|
|
{
|
|
|
|
if (offsetof(struct sock, sk_node.next) != 0)
|
|
|
|
memset(sk, 0, offsetof(struct sock, sk_node.next));
|
|
|
|
memset(&sk->sk_node.pprev, 0,
|
|
|
|
size - offsetof(struct sock, sk_node.pprev));
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Networking protocol blocks we attach to sockets.
|
|
|
|
* socket layer -> transport layer interface
|
|
|
|
*/
|
|
|
|
struct proto {
|
2012-05-17 06:48:15 +08:00
|
|
|
void (*close)(struct sock *sk,
|
2005-04-17 06:20:36 +08:00
|
|
|
long timeout);
|
2018-03-31 06:08:05 +08:00
|
|
|
int (*pre_connect)(struct sock *sk,
|
|
|
|
struct sockaddr *uaddr,
|
|
|
|
int addr_len);
|
2005-04-17 06:20:36 +08:00
|
|
|
int (*connect)(struct sock *sk,
|
2012-05-17 06:48:15 +08:00
|
|
|
struct sockaddr *uaddr,
|
2005-04-17 06:20:36 +08:00
|
|
|
int addr_len);
|
|
|
|
int (*disconnect)(struct sock *sk, int flags);
|
|
|
|
|
2017-03-09 16:09:05 +08:00
|
|
|
struct sock * (*accept)(struct sock *sk, int flags, int *err,
|
|
|
|
bool kern);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
int (*ioctl)(struct sock *sk, int cmd,
|
|
|
|
unsigned long arg);
|
|
|
|
int (*init)(struct sock *sk);
|
2008-06-15 08:04:49 +08:00
|
|
|
void (*destroy)(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
void (*shutdown)(struct sock *sk, int how);
|
2012-05-17 06:48:15 +08:00
|
|
|
int (*setsockopt)(struct sock *sk, int level,
|
2020-07-23 14:09:07 +08:00
|
|
|
int optname, sockptr_t optval,
|
2009-10-01 07:12:20 +08:00
|
|
|
unsigned int optlen);
|
2012-05-17 06:48:15 +08:00
|
|
|
int (*getsockopt)(struct sock *sk, int level,
|
|
|
|
int optname, char __user *optval,
|
|
|
|
int __user *option);
|
2017-01-09 23:55:12 +08:00
|
|
|
void (*keepalive)(struct sock *sk, int valbool);
|
2008-08-28 17:53:51 +08:00
|
|
|
#ifdef CONFIG_COMPAT
|
2011-01-30 00:15:56 +08:00
|
|
|
int (*compat_ioctl)(struct sock *sk,
|
|
|
|
unsigned int cmd, unsigned long arg);
|
2008-08-28 17:53:51 +08:00
|
|
|
#endif
|
2015-03-02 15:37:48 +08:00
|
|
|
int (*sendmsg)(struct sock *sk, struct msghdr *msg,
|
|
|
|
size_t len);
|
|
|
|
int (*recvmsg)(struct sock *sk, struct msghdr *msg,
|
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
sk, msg, size, flags & MSG_DONTWAIT,
flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-11 20:49:55 +08:00
|
|
|
size_t len, int flags, int *addr_len);
|
2005-04-17 06:20:36 +08:00
|
|
|
int (*sendpage)(struct sock *sk, struct page *page,
|
|
|
|
int offset, size_t size, int flags);
|
2012-05-17 06:48:15 +08:00
|
|
|
int (*bind)(struct sock *sk,
|
2020-05-29 20:09:42 +08:00
|
|
|
struct sockaddr *addr, int addr_len);
|
|
|
|
int (*bind_add)(struct sock *sk,
|
|
|
|
struct sockaddr *addr, int addr_len);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
int (*backlog_rcv) (struct sock *sk,
|
2005-04-17 06:20:36 +08:00
|
|
|
struct sk_buff *skb);
|
2021-01-16 00:34:59 +08:00
|
|
|
bool (*bpf_bypass_getsockopt)(int level,
|
|
|
|
int optname);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
tcp: TCP Small Queues
This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.
TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.
As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.
This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.
Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.
Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)
I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.
As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.
If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.
[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 13:50:31 +08:00
|
|
|
void (*release_cb)(struct sock *sk);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Keeping track of sk's, looking them up, and port selection methods. */
|
2016-02-11 00:50:35 +08:00
|
|
|
int (*hash)(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
void (*unhash)(struct sock *sk);
|
udp: add rehash on connect()
commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
added a secondary hash on UDP, hashed on (local addr, local port).
Problem is that following sequence :
fd = socket(...)
connect(fd, &remote, ...)
not only selects remote end point (address and port), but also sets
local address, while UDP stack stored in secondary hash table the socket
while its local address was INADDR_ANY (or ipv6 equivalent)
Sequence is :
- autobind() : choose a random local port, insert socket in hash tables
[while local address is INADDR_ANY]
- connect() : set remote address and port, change local address to IP
given by a route lookup.
When an incoming UDP frame comes, if more than 10 sockets are found in
primary hash table, we switch to secondary table, and fail to find
socket because its local address changed.
One solution to this problem is to rehash datagram socket if needed.
We add a new rehash(struct socket *) method in "struct proto", and
implement this method for UDP v4 & v6, using a common helper.
This rehashing only takes care of secondary hash table, since primary
hash (based on local port only) is not changed.
Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-08 13:08:44 +08:00
|
|
|
void (*rehash)(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
int (*get_port)(struct sock *sk, unsigned short snum);
|
net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06 21:20:20 +08:00
|
|
|
void (*put_port)(struct sock *sk);
|
2021-03-31 10:32:31 +08:00
|
|
|
#ifdef CONFIG_BPF_SYSCALL
|
2021-04-07 11:21:11 +08:00
|
|
|
int (*psock_update_sk_prot)(struct sock *sk,
|
|
|
|
struct sk_psock *psock,
|
|
|
|
bool restore);
|
2021-03-31 10:32:31 +08:00
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
|
[NET]: Define infrastructure to keep 'inuse' changes in an efficent SMP/NUMA way.
"struct proto" currently uses an array stats[NR_CPUS] to track change on
'inuse' sockets per protocol.
If NR_CPUS is big, this means we use a big memory area for this.
Moreover, all this memory area is located on a single node on NUMA
machines, increasing memory pressure on the boot node.
In this patch, I tried to :
- Keep a fast !CONFIG_SMP implementation
- Keep a fast CONFIG_SMP implementation for often used protocols
(tcp,udp,raw,...)
- Introduce a NUMA efficient implementation
Some helper macros are defined in include/net/sock.h
These macros take into account CONFIG_SMP
If a "struct proto" is declared without using DEFINE_PROTO_INUSE /
REF_PROTO_INUSE
macros, it will automatically use a default implementation, using a
dynamically allocated percpu zone.
This default implementation will be NUMA efficient, but might use 32/64
bytes per possible cpu
because of current alloc_percpu() implementation.
However it still should be better than previous implementation based on
stats[NR_CPUS] field.
When a "struct proto" is changed to use the new macros, we use a single
static "int" percpu variable,
lowering the memory and cpu costs, still preserving NUMA efficiency.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-06 15:38:39 +08:00
|
|
|
/* Keeping track of sockets in use */
|
2008-01-04 12:46:48 +08:00
|
|
|
#ifdef CONFIG_PROC_FS
|
2008-03-29 07:38:17 +08:00
|
|
|
unsigned int inuse_idx;
|
2008-01-04 12:46:48 +08:00
|
|
|
#endif
|
2007-11-21 22:08:50 +08:00
|
|
|
|
2021-11-16 03:02:38 +08:00
|
|
|
#if IS_ENABLED(CONFIG_MPTCP)
|
2021-10-27 07:29:14 +08:00
|
|
|
int (*forward_alloc_get)(const struct sock *sk);
|
2021-11-16 03:02:38 +08:00
|
|
|
#endif
|
2021-10-27 07:29:14 +08:00
|
|
|
|
2018-12-04 23:58:17 +08:00
|
|
|
bool (*stream_memory_free)(const struct sock *sk, int wake);
|
2021-10-09 04:33:03 +08:00
|
|
|
bool (*sock_is_readable)(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Memory pressure */
|
2008-07-17 11:28:10 +08:00
|
|
|
void (*enter_memory_pressure)(struct sock *sk);
|
2017-06-08 04:29:12 +08:00
|
|
|
void (*leave_memory_pressure)(struct sock *sk);
|
2010-11-10 07:24:26 +08:00
|
|
|
atomic_long_t *memory_allocated; /* Current allocated memory. */
|
2022-06-09 14:34:08 +08:00
|
|
|
int __percpu *per_cpu_fw_alloc;
|
2008-11-26 13:16:35 +08:00
|
|
|
struct percpu_counter *sockets_allocated; /* Current number of sockets. */
|
2021-10-27 07:29:14 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Pressure flag: try to collapse.
|
|
|
|
* Technical note: it is used by multiple contexts non atomically.
|
2007-12-31 16:11:19 +08:00
|
|
|
* All the __sk_mem_schedule() is of this nature: accounting
|
2005-04-17 06:20:36 +08:00
|
|
|
* is strict, actions are advisory and have some latency.
|
|
|
|
*/
|
2017-06-08 04:29:12 +08:00
|
|
|
unsigned long *memory_pressure;
|
2010-11-10 07:24:26 +08:00
|
|
|
long *sysctl_mem;
|
2017-11-07 16:29:27 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
int *sysctl_wmem;
|
|
|
|
int *sysctl_rmem;
|
2017-11-07 16:29:27 +08:00
|
|
|
u32 sysctl_wmem_offset;
|
|
|
|
u32 sysctl_rmem_offset;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
int max_header;
|
2010-07-11 04:41:55 +08:00
|
|
|
bool no_autobind;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 17:11:14 +08:00
|
|
|
struct kmem_cache *slab;
|
2005-04-17 06:20:36 +08:00
|
|
|
unsigned int obj_size;
|
2017-11-16 09:32:18 +08:00
|
|
|
slab_flags_t slab_flags;
|
2018-04-06 07:21:31 +08:00
|
|
|
unsigned int useroffset; /* Usercopy region offset */
|
|
|
|
unsigned int usersize; /* Usercopy region size */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-10-14 21:41:26 +08:00
|
|
|
unsigned int __percpu *orphan_count;
|
2005-08-10 11:09:30 +08:00
|
|
|
|
2005-06-19 13:47:21 +08:00
|
|
|
struct request_sock_ops *rsk_prot;
|
2005-12-14 15:25:19 +08:00
|
|
|
struct timewait_sock_ops *twsk_prot;
|
[NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-19 13:46:52 +08:00
|
|
|
|
2008-03-23 07:50:58 +08:00
|
|
|
union {
|
|
|
|
struct inet_hashinfo *hashinfo;
|
2008-10-29 16:41:45 +08:00
|
|
|
struct udp_table *udp_table;
|
2008-03-23 07:56:51 +08:00
|
|
|
struct raw_hashinfo *raw_hash;
|
2017-01-09 23:55:26 +08:00
|
|
|
struct smc_hashinfo *smc_hash;
|
2008-03-23 07:50:58 +08:00
|
|
|
} h;
|
[SOCK] proto: Add hashinfo member to struct proto
This way we can remove TCP and DCCP specific versions of
sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly
struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.
Now only the lookup routines receive as a parameter a struct inet_hashtable.
With this we further reuse code, reducing the difference among INET transport
protocols.
Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.
net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191
net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3
net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15
net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7
net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267
net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8
net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432
vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7
/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47
/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241
/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33
/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-03 20:06:04 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
struct module *owner;
|
|
|
|
|
|
|
|
char name[32];
|
|
|
|
|
|
|
|
struct list_head node;
|
2005-08-10 10:45:38 +08:00
|
|
|
#ifdef SOCK_REFCNT_DEBUG
|
|
|
|
atomic_t socks;
|
2011-12-12 05:47:03 +08:00
|
|
|
#endif
|
2015-12-16 11:30:03 +08:00
|
|
|
int (*diag_destroy)(struct sock *sk, int err);
|
2016-10-28 16:22:25 +08:00
|
|
|
} __randomize_layout;
|
2011-12-12 05:47:03 +08:00
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
int proto_register(struct proto *prot, int alloc_slab);
|
|
|
|
void proto_unregister(struct proto *prot);
|
sock_diag: request _diag module only when the family or proto has been registered
Now when using 'ss' in iproute, kernel would try to load all _diag
modules, which also causes corresponding family and proto modules
to be loaded as well due to module dependencies.
Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
would be loaded.
For example:
$ lsmod|grep sctp
$ ss
$ lsmod|grep sctp
sctp_diag 16384 0
sctp 323584 5 sctp_diag
inet_diag 24576 4 raw_diag,tcp_diag,sctp_diag,udp_diag
libcrc32c 16384 3 nf_conntrack,nf_nat,sctp
As these family and proto modules are loaded unintentionally, it
could cause some problems, like:
- Some debug tools use 'ss' to collect the socket info, which loads all
those diag and family and protocol modules. It's noisy for identifying
issues.
- Users usually expect to drop sctp init packet silently when they
have no sense of sctp protocol instead of sending abort back.
- It wastes resources (especially with multiple netns), and SCTP module
can't be unloaded once it's loaded.
...
In short, it's really inappropriate to have these family and proto
modules loaded unexpectedly when just doing debugging with inet_diag.
This patch is to introduce sock_load_diag_module() where it loads
the _diag module only when it's corresponding family or proto has
been already registered.
Note that we can't just load _diag module without the family or
proto loaded, as some symbols used in _diag module are from the
family or proto module.
v1->v2:
- move inet proto check to inet_diag to avoid a compiling err.
v2->v3:
- define sock_load_diag_module in sock.c and export one symbol
only.
- improve the changelog.
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Phil Sutter <phil@nwl.cc>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-10 18:57:50 +08:00
|
|
|
int sock_load_diag_module(int family, int protocol);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2005-08-10 10:45:38 +08:00
|
|
|
#ifdef SOCK_REFCNT_DEBUG
|
|
|
|
static inline void sk_refcnt_debug_inc(struct sock *sk)
|
|
|
|
{
|
|
|
|
atomic_inc(&sk->sk_prot->socks);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sk_refcnt_debug_dec(struct sock *sk)
|
|
|
|
{
|
|
|
|
atomic_dec(&sk->sk_prot->socks);
|
|
|
|
printk(KERN_DEBUG "%s socket %p released, %d are still alive\n",
|
|
|
|
sk->sk_prot->name, sk, atomic_read(&sk->sk_prot->socks));
|
|
|
|
}
|
|
|
|
|
2013-02-16 06:28:25 +08:00
|
|
|
static inline void sk_refcnt_debug_release(const struct sock *sk)
|
2005-08-10 10:45:38 +08:00
|
|
|
{
|
2017-06-30 18:08:01 +08:00
|
|
|
if (refcount_read(&sk->sk_refcnt) != 1)
|
2005-08-10 10:45:38 +08:00
|
|
|
printk(KERN_DEBUG "Destruction of the %s socket %p delayed, refcnt=%d\n",
|
2017-06-30 18:08:01 +08:00
|
|
|
sk->sk_prot->name, sk, refcount_read(&sk->sk_refcnt));
|
2005-08-10 10:45:38 +08:00
|
|
|
}
|
|
|
|
#else /* SOCK_REFCNT_DEBUG */
|
|
|
|
#define sk_refcnt_debug_inc(sk) do { } while (0)
|
|
|
|
#define sk_refcnt_debug_dec(sk) do { } while (0)
|
|
|
|
#define sk_refcnt_debug_release(sk) do { } while (0)
|
|
|
|
#endif /* SOCK_REFCNT_DEBUG */
|
|
|
|
|
2020-11-13 23:08:09 +08:00
|
|
|
INDIRECT_CALLABLE_DECLARE(bool tcp_stream_memory_free(const struct sock *sk, int wake));
|
|
|
|
|
2021-10-27 07:29:14 +08:00
|
|
|
static inline int sk_forward_alloc_get(const struct sock *sk)
|
|
|
|
{
|
2021-11-16 03:02:38 +08:00
|
|
|
#if IS_ENABLED(CONFIG_MPTCP)
|
|
|
|
if (sk->sk_prot->forward_alloc_get)
|
|
|
|
return sk->sk_prot->forward_alloc_get(sk);
|
|
|
|
#endif
|
|
|
|
return sk->sk_forward_alloc;
|
2021-10-27 07:29:14 +08:00
|
|
|
}
|
|
|
|
|
2018-12-04 23:58:17 +08:00
|
|
|
static inline bool __sk_stream_memory_free(const struct sock *sk, int wake)
|
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 11:27:07 +08:00
|
|
|
{
|
2019-10-11 11:17:46 +08:00
|
|
|
if (READ_ONCE(sk->sk_wmem_queued) >= READ_ONCE(sk->sk_sndbuf))
|
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 11:27:07 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
return sk->sk_prot->stream_memory_free ?
|
2021-10-28 09:12:05 +08:00
|
|
|
INDIRECT_CALL_INET_1(sk->sk_prot->stream_memory_free,
|
|
|
|
tcp_stream_memory_free, sk, wake) : true;
|
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 11:27:07 +08:00
|
|
|
}
|
|
|
|
|
2018-12-04 23:58:17 +08:00
|
|
|
static inline bool sk_stream_memory_free(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return __sk_stream_memory_free(sk, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool __sk_stream_is_writeable(const struct sock *sk, int wake)
|
2013-07-23 11:26:31 +08:00
|
|
|
{
|
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 11:27:07 +08:00
|
|
|
return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) &&
|
2018-12-04 23:58:17 +08:00
|
|
|
__sk_stream_memory_free(sk, wake);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool sk_stream_is_writeable(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return __sk_stream_is_writeable(sk, 0);
|
2013-07-23 11:26:31 +08:00
|
|
|
}
|
2011-12-12 05:47:03 +08:00
|
|
|
|
2016-08-18 07:00:41 +08:00
|
|
|
static inline int sk_under_cgroup_hierarchy(struct sock *sk,
|
|
|
|
struct cgroup *ancestor)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_SOCK_CGROUP_DATA
|
|
|
|
return cgroup_is_descendant(sock_cgroup_ptr(&sk->sk_cgrp_data),
|
|
|
|
ancestor);
|
|
|
|
#else
|
|
|
|
return -ENOTSUPP;
|
|
|
|
#endif
|
|
|
|
}
|
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 11:27:07 +08:00
|
|
|
|
2011-12-12 05:47:02 +08:00
|
|
|
static inline bool sk_has_memory_pressure(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return sk->sk_prot->memory_pressure != NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool sk_under_memory_pressure(const struct sock *sk)
|
|
|
|
{
|
|
|
|
if (!sk->sk_prot->memory_pressure)
|
|
|
|
return false;
|
2011-12-12 05:47:03 +08:00
|
|
|
|
2016-01-15 07:21:17 +08:00
|
|
|
if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
|
|
|
|
mem_cgroup_under_socket_pressure(sk->sk_memcg))
|
net: tcp_memcontrol: sanitize tcp memory accounting callbacks
There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things unnecessarily.
Replace this with simple and clear charge and uncharge calls--hidden
behind a jump label--to account skb memory.
Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle the
global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit, and
thus close this loophole.
Without a soft limit, the per-memcg memory pressure state in sockets is
generally questionable. However, we did it until now, so we continue to
enter it when the hard limit is hit, and packets are dropped, to let
other sockets in the cgroup know that they shouldn't grow their transmit
windows, either. However, keep it simple in the new callback model and
leave memory pressure lazily when the next packet is accepted (as
opposed to doing it synchroneously when packets are processed). When
packets are dropped, network performance will already be in the toilet,
so that should be a reasonable trade-off.
As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 07:21:14 +08:00
|
|
|
return true;
|
2011-12-12 05:47:03 +08:00
|
|
|
|
2013-10-24 03:49:21 +08:00
|
|
|
return !!*sk->sk_prot->memory_pressure;
|
2011-12-12 05:47:02 +08:00
|
|
|
}
|
|
|
|
|
2022-06-09 14:34:09 +08:00
|
|
|
static inline long
|
|
|
|
proto_memory_allocated(const struct proto *prot)
|
|
|
|
{
|
|
|
|
return max(0L, atomic_long_read(prot->memory_allocated));
|
|
|
|
}
|
|
|
|
|
2011-12-12 05:47:02 +08:00
|
|
|
static inline long
|
|
|
|
sk_memory_allocated(const struct sock *sk)
|
|
|
|
{
|
2022-06-09 14:34:09 +08:00
|
|
|
return proto_memory_allocated(sk->sk_prot);
|
2011-12-12 05:47:02 +08:00
|
|
|
}
|
|
|
|
|
2022-06-09 14:34:09 +08:00
|
|
|
/* 1 MB per cpu, in page units */
|
|
|
|
#define SK_MEMORY_PCPU_RESERVE (1 << (20 - PAGE_SHIFT))
|
|
|
|
|
2022-06-11 11:30:16 +08:00
|
|
|
static inline void
|
net: tcp_memcontrol: sanitize tcp memory accounting callbacks
There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things unnecessarily.
Replace this with simple and clear charge and uncharge calls--hidden
behind a jump label--to account skb memory.
Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle the
global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit, and
thus close this loophole.
Without a soft limit, the per-memcg memory pressure state in sockets is
generally questionable. However, we did it until now, so we continue to
enter it when the hard limit is hit, and packets are dropped, to let
other sockets in the cgroup know that they shouldn't grow their transmit
windows, either. However, keep it simple in the new callback model and
leave memory pressure lazily when the next packet is accepted (as
opposed to doing it synchroneously when packets are processed). When
packets are dropped, network performance will already be in the toilet,
so that should be a reasonable trade-off.
As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 07:21:14 +08:00
|
|
|
sk_memory_allocated_add(struct sock *sk, int amt)
|
2011-12-12 05:47:02 +08:00
|
|
|
{
|
2022-06-09 14:34:09 +08:00
|
|
|
int local_reserve;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
local_reserve = __this_cpu_add_return(*sk->sk_prot->per_cpu_fw_alloc, amt);
|
|
|
|
if (local_reserve >= SK_MEMORY_PCPU_RESERVE) {
|
|
|
|
__this_cpu_sub(*sk->sk_prot->per_cpu_fw_alloc, local_reserve);
|
|
|
|
atomic_long_add(local_reserve, sk->sk_prot->memory_allocated);
|
|
|
|
}
|
|
|
|
preempt_enable();
|
2011-12-12 05:47:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
net: introduce res_counter_charge_nofail() for socket allocations
There is a case in __sk_mem_schedule(), where an allocation
is beyond the maximum, but yet we are allowed to proceed.
It happens under the following condition:
sk->sk_wmem_queued + size >= sk->sk_sndbuf
The network code won't revert the allocation in this case,
meaning that at some point later it'll try to do it. Since
this is never communicated to the underlying res_counter
code, there is an inbalance in res_counter uncharge operation.
I see two ways of fixing this:
1) storing the information about those allocations somewhere
in memcg, and then deducting from that first, before
we start draining the res_counter,
2) providing a slightly different allocation function for
the res_counter, that matches the original behavior of
the network code more closely.
I decided to go for #2 here, believing it to be more elegant,
since #1 would require us to do basically that, but in a more
obscure way.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: Li Zefan <lizf@cn.fujitsu.com>
CC: Laurent Chavey <chavey@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-20 12:57:16 +08:00
|
|
|
sk_memory_allocated_sub(struct sock *sk, int amt)
|
2011-12-12 05:47:02 +08:00
|
|
|
{
|
2022-06-09 14:34:09 +08:00
|
|
|
int local_reserve;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
local_reserve = __this_cpu_sub_return(*sk->sk_prot->per_cpu_fw_alloc, amt);
|
|
|
|
if (local_reserve <= -SK_MEMORY_PCPU_RESERVE) {
|
|
|
|
__this_cpu_sub(*sk->sk_prot->per_cpu_fw_alloc, local_reserve);
|
|
|
|
atomic_long_add(local_reserve, sk->sk_prot->memory_allocated);
|
|
|
|
}
|
|
|
|
preempt_enable();
|
2011-12-12 05:47:02 +08:00
|
|
|
}
|
|
|
|
|
2021-02-03 03:34:08 +08:00
|
|
|
#define SK_ALLOC_PERCPU_COUNTER_BATCH 16
|
|
|
|
|
2011-12-12 05:47:02 +08:00
|
|
|
static inline void sk_sockets_allocated_dec(struct sock *sk)
|
|
|
|
{
|
2021-02-03 03:34:08 +08:00
|
|
|
percpu_counter_add_batch(sk->sk_prot->sockets_allocated, -1,
|
|
|
|
SK_ALLOC_PERCPU_COUNTER_BATCH);
|
2011-12-12 05:47:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sk_sockets_allocated_inc(struct sock *sk)
|
|
|
|
{
|
2021-02-03 03:34:08 +08:00
|
|
|
percpu_counter_add_batch(sk->sk_prot->sockets_allocated, 1,
|
|
|
|
SK_ALLOC_PERCPU_COUNTER_BATCH);
|
2011-12-12 05:47:02 +08:00
|
|
|
}
|
|
|
|
|
2019-02-13 04:26:27 +08:00
|
|
|
static inline u64
|
2011-12-12 05:47:02 +08:00
|
|
|
sk_sockets_allocated_read_positive(struct sock *sk)
|
|
|
|
{
|
2016-01-15 07:21:08 +08:00
|
|
|
return percpu_counter_read_positive(sk->sk_prot->sockets_allocated);
|
2011-12-12 05:47:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int
|
|
|
|
proto_sockets_allocated_sum_positive(struct proto *prot)
|
|
|
|
{
|
|
|
|
return percpu_counter_sum_positive(prot->sockets_allocated);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool
|
|
|
|
proto_memory_pressure(struct proto *prot)
|
|
|
|
{
|
|
|
|
if (!prot->memory_pressure)
|
|
|
|
return false;
|
|
|
|
return !!*prot->memory_pressure;
|
|
|
|
}
|
|
|
|
|
2008-01-04 12:46:48 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_PROC_FS
|
2021-11-16 01:11:47 +08:00
|
|
|
#define PROTO_INUSE_NR 64 /* should be enough for the first time */
|
|
|
|
struct prot_inuse {
|
2021-11-16 01:11:49 +08:00
|
|
|
int all;
|
2021-11-16 01:11:47 +08:00
|
|
|
int val[PROTO_INUSE_NR];
|
|
|
|
};
|
2021-11-16 01:11:50 +08:00
|
|
|
|
2021-11-16 01:11:47 +08:00
|
|
|
static inline void sock_prot_inuse_add(const struct net *net,
|
|
|
|
const struct proto *prot, int val)
|
|
|
|
{
|
2021-11-16 01:11:50 +08:00
|
|
|
this_cpu_add(net->core.prot_inuse->val[prot->inuse_idx], val);
|
2021-11-16 01:11:47 +08:00
|
|
|
}
|
2021-11-16 01:11:48 +08:00
|
|
|
|
|
|
|
static inline void sock_inuse_add(const struct net *net, int val)
|
|
|
|
{
|
2021-11-16 01:11:49 +08:00
|
|
|
this_cpu_add(net->core.prot_inuse->all, val);
|
2021-11-16 01:11:48 +08:00
|
|
|
}
|
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
int sock_prot_inuse_get(struct net *net, struct proto *proto);
|
2017-12-14 21:51:58 +08:00
|
|
|
int sock_inuse_get(struct net *net);
|
2008-01-04 12:46:48 +08:00
|
|
|
#else
|
2021-11-16 01:11:47 +08:00
|
|
|
static inline void sock_prot_inuse_add(const struct net *net,
|
|
|
|
const struct proto *prot, int val)
|
2008-01-04 12:46:48 +08:00
|
|
|
{
|
|
|
|
}
|
2021-11-16 01:11:48 +08:00
|
|
|
|
|
|
|
static inline void sock_inuse_add(const struct net *net, int val)
|
|
|
|
{
|
|
|
|
}
|
2008-01-04 12:46:48 +08:00
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2005-08-10 10:47:37 +08:00
|
|
|
/* With per-bucket locks this operation is not-atomic, so that
|
|
|
|
* this version is not worse.
|
|
|
|
*/
|
2016-02-11 00:50:35 +08:00
|
|
|
static inline int __sk_prot_rehash(struct sock *sk)
|
2005-08-10 10:47:37 +08:00
|
|
|
{
|
|
|
|
sk->sk_prot->unhash(sk);
|
2016-02-11 00:50:35 +08:00
|
|
|
return sk->sk_prot->hash(sk);
|
2005-08-10 10:47:37 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* About 10 seconds */
|
|
|
|
#define SOCK_DESTROY_TIME (10*HZ)
|
|
|
|
|
|
|
|
/* Sockets 0-1023 can't be bound to unless you are superuser */
|
|
|
|
#define PROT_SOCK 1024
|
|
|
|
|
|
|
|
#define SHUTDOWN_MASK 3
|
|
|
|
#define RCV_SHUTDOWN 1
|
|
|
|
#define SEND_SHUTDOWN 2
|
|
|
|
|
|
|
|
#define SOCK_BINDADDR_LOCK 4
|
|
|
|
#define SOCK_BINDPORT_LOCK 8
|
|
|
|
|
|
|
|
struct socket_alloc {
|
|
|
|
struct socket socket;
|
|
|
|
struct inode vfs_inode;
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline struct socket *SOCKET_I(struct inode *inode)
|
|
|
|
{
|
|
|
|
return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct inode *SOCK_INODE(struct socket *socket)
|
|
|
|
{
|
|
|
|
return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
|
|
|
|
}
|
|
|
|
|
2007-12-31 16:11:19 +08:00
|
|
|
/*
|
|
|
|
* Functions for memory accounting
|
|
|
|
*/
|
2016-10-21 19:55:45 +08:00
|
|
|
int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind);
|
2013-09-23 01:32:26 +08:00
|
|
|
int __sk_mem_schedule(struct sock *sk, int size, int kind);
|
2016-10-21 19:55:45 +08:00
|
|
|
void __sk_mem_reduce_allocated(struct sock *sk, int amount);
|
2015-05-16 03:39:25 +08:00
|
|
|
void __sk_mem_reclaim(struct sock *sk, int amount);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2007-12-31 16:11:19 +08:00
|
|
|
#define SK_MEM_SEND 0
|
|
|
|
#define SK_MEM_RECV 1
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2022-06-09 14:34:06 +08:00
|
|
|
/* sysctl_mem values are in pages */
|
2016-11-01 04:32:55 +08:00
|
|
|
static inline long sk_prot_mem_limits(const struct sock *sk, int index)
|
|
|
|
{
|
2022-06-09 14:34:06 +08:00
|
|
|
return sk->sk_prot->sysctl_mem[index];
|
2016-11-01 04:32:55 +08:00
|
|
|
}
|
|
|
|
|
2007-12-31 16:11:19 +08:00
|
|
|
static inline int sk_mem_pages(int amt)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2022-06-09 14:34:07 +08:00
|
|
|
return (amt + PAGE_SIZE - 1) >> PAGE_SHIFT;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool sk_has_account(struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2007-12-31 16:11:19 +08:00
|
|
|
/* return true if protocol supports memory accounting */
|
|
|
|
return !!sk->sk_prot->memory_allocated;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool sk_wmem_schedule(struct sock *sk, int size)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2022-06-09 14:34:10 +08:00
|
|
|
int delta;
|
|
|
|
|
2007-12-31 16:11:19 +08:00
|
|
|
if (!sk_has_account(sk))
|
2012-05-17 06:48:15 +08:00
|
|
|
return true;
|
2022-06-09 14:34:10 +08:00
|
|
|
delta = size - sk->sk_forward_alloc;
|
|
|
|
return delta <= 0 || __sk_mem_schedule(sk, delta, SK_MEM_SEND);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
netvm: prevent a stream-specific deadlock
This patch series is based on top of "Swap-over-NBD without deadlocking
v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. In diskless systems this is not an option so if swap if
required then swapping over the network is considered. The two likely
scenarios are when blade servers are used as part of a cluster where the
form factor or maintenance costs do not allow the use of disks and thin
clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap but this is not always an option. There is no
guarantee that the network attached storage (NAS) device is running Linux
or supports NBD. However, it is likely that it supports NFS so there are
users that want support for swapping over NFS despite any performance
concern. Some distributions currently carry patches that support swapping
over NFS but it would be preferable to support it in the mainline kernel.
Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
reserves.
Patch 3 adds three helpers for filesystems to handle swap cache pages.
For example, page_file_mapping() returns page->mapping for
file-backed pages and the address_space of the underlying
swap file for swap cache pages.
Patch 4 adds two address_space_operations to allow a filesystem
to pin all metadata relevant to a swapfile in memory. Upon
successful activation, the swapfile is marked SWP_FILE and
the address space operation ->direct_IO is used for writing
and ->readpage for reading in swap pages.
Patch 5 notes that patch 3 is bolting
filesystem-specific-swapfile-support onto the side and that
the default handlers have different information to what
is available to the filesystem. This patch refactors the
code so that there are generic handlers for each of the new
address_space operations.
Patch 6 adds an API to allow a vector of kernel addresses to be
translated to struct pages and pinned for IO.
Patch 7 adds support for using highmem pages for swap by kmapping
the pages before calling the direct_IO handler.
Patch 8 updates NFS to use the helpers from patch 3 where necessary.
Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
Patch 10 implements the new swapfile-related address_space operations
for NFS and teaches the direct IO handler how to manage
kernel addresses.
Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
where appropriate.
Patch 12 fixes a NULL pointer dereference that occurs when using
swap-over-NFS.
With the patches applied, it is possible to mount a swapfile that is on an
NFS filesystem. Swap performance is not great with a swap stress test
taking roughly twice as long to complete than if the swap device was
backed by NBD.
This patch: netvm: prevent a stream-specific deadlock
It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
buffers from receiving data, which will prevent userspace from running,
which is needed to reduce the buffered data.
Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
this change it applied, it is important that sockets that set
SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
If this happens, a warning is generated and the tokens reclaimed to avoid
accounting errors until the bug is fixed.
[davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:44:41 +08:00
|
|
|
static inline bool
|
2012-09-18 05:09:11 +08:00
|
|
|
sk_rmem_schedule(struct sock *sk, struct sk_buff *skb, int size)
|
2005-09-02 08:48:23 +08:00
|
|
|
{
|
2022-06-09 14:34:10 +08:00
|
|
|
int delta;
|
|
|
|
|
2007-12-31 16:11:19 +08:00
|
|
|
if (!sk_has_account(sk))
|
2012-05-17 06:48:15 +08:00
|
|
|
return true;
|
2022-06-09 14:34:10 +08:00
|
|
|
delta = size - sk->sk_forward_alloc;
|
|
|
|
return delta <= 0 || __sk_mem_schedule(sk, delta, SK_MEM_RECV) ||
|
netvm: prevent a stream-specific deadlock
This patch series is based on top of "Swap-over-NBD without deadlocking
v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. In diskless systems this is not an option so if swap if
required then swapping over the network is considered. The two likely
scenarios are when blade servers are used as part of a cluster where the
form factor or maintenance costs do not allow the use of disks and thin
clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap but this is not always an option. There is no
guarantee that the network attached storage (NAS) device is running Linux
or supports NBD. However, it is likely that it supports NFS so there are
users that want support for swapping over NFS despite any performance
concern. Some distributions currently carry patches that support swapping
over NFS but it would be preferable to support it in the mainline kernel.
Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
reserves.
Patch 3 adds three helpers for filesystems to handle swap cache pages.
For example, page_file_mapping() returns page->mapping for
file-backed pages and the address_space of the underlying
swap file for swap cache pages.
Patch 4 adds two address_space_operations to allow a filesystem
to pin all metadata relevant to a swapfile in memory. Upon
successful activation, the swapfile is marked SWP_FILE and
the address space operation ->direct_IO is used for writing
and ->readpage for reading in swap pages.
Patch 5 notes that patch 3 is bolting
filesystem-specific-swapfile-support onto the side and that
the default handlers have different information to what
is available to the filesystem. This patch refactors the
code so that there are generic handlers for each of the new
address_space operations.
Patch 6 adds an API to allow a vector of kernel addresses to be
translated to struct pages and pinned for IO.
Patch 7 adds support for using highmem pages for swap by kmapping
the pages before calling the direct_IO handler.
Patch 8 updates NFS to use the helpers from patch 3 where necessary.
Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
Patch 10 implements the new swapfile-related address_space operations
for NFS and teaches the direct IO handler how to manage
kernel addresses.
Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
where appropriate.
Patch 12 fixes a NULL pointer dereference that occurs when using
swap-over-NFS.
With the patches applied, it is possible to mount a swapfile that is on an
NFS filesystem. Swap performance is not great with a swap stress test
taking roughly twice as long to complete than if the swap device was
backed by NBD.
This patch: netvm: prevent a stream-specific deadlock
It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
buffers from receiving data, which will prevent userspace from running,
which is needed to reduce the buffered data.
Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
this change it applied, it is important that sockets that set
SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
If this happens, a warning is generated and the tokens reclaimed to avoid
accounting errors until the bug is fixed.
[davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:44:41 +08:00
|
|
|
skb_pfmemalloc(skb);
|
2007-12-31 16:11:19 +08:00
|
|
|
}
|
|
|
|
|
2021-09-30 01:25:11 +08:00
|
|
|
static inline int sk_unused_reserved_mem(const struct sock *sk)
|
|
|
|
{
|
|
|
|
int unused_mem;
|
|
|
|
|
|
|
|
if (likely(!sk->sk_reserved_mem))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
unused_mem = sk->sk_reserved_mem - sk->sk_wmem_queued -
|
|
|
|
atomic_read(&sk->sk_rmem_alloc);
|
|
|
|
|
|
|
|
return unused_mem > 0 ? unused_mem : 0;
|
|
|
|
}
|
|
|
|
|
2007-12-31 16:11:19 +08:00
|
|
|
static inline void sk_mem_reclaim(struct sock *sk)
|
|
|
|
{
|
2021-09-30 01:25:11 +08:00
|
|
|
int reclaimable;
|
|
|
|
|
2007-12-31 16:11:19 +08:00
|
|
|
if (!sk_has_account(sk))
|
|
|
|
return;
|
2021-09-30 01:25:11 +08:00
|
|
|
|
|
|
|
reclaimable = sk->sk_forward_alloc - sk_unused_reserved_mem(sk);
|
|
|
|
|
2022-06-09 14:34:07 +08:00
|
|
|
if (reclaimable >= (int)PAGE_SIZE)
|
2021-09-30 01:25:11 +08:00
|
|
|
__sk_mem_reclaim(sk, reclaimable);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sk_mem_reclaim_final(struct sock *sk)
|
|
|
|
{
|
|
|
|
sk->sk_reserved_mem = 0;
|
|
|
|
sk_mem_reclaim(sk);
|
2007-12-31 16:11:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sk_mem_charge(struct sock *sk, int size)
|
|
|
|
{
|
|
|
|
if (!sk_has_account(sk))
|
|
|
|
return;
|
|
|
|
sk->sk_forward_alloc -= size;
|
|
|
|
}
|
|
|
|
|
2022-06-09 14:34:11 +08:00
|
|
|
/* the following macros control memory reclaiming in mptcp_rmem_uncharge()
|
2021-10-27 07:29:13 +08:00
|
|
|
*/
|
|
|
|
#define SK_RECLAIM_THRESHOLD (1 << 21)
|
|
|
|
#define SK_RECLAIM_CHUNK (1 << 20)
|
|
|
|
|
2007-12-31 16:11:19 +08:00
|
|
|
static inline void sk_mem_uncharge(struct sock *sk, int size)
|
|
|
|
{
|
|
|
|
if (!sk_has_account(sk))
|
|
|
|
return;
|
|
|
|
sk->sk_forward_alloc += size;
|
2022-06-09 14:34:11 +08:00
|
|
|
sk_mem_reclaim(sk);
|
2007-12-31 16:11:19 +08:00
|
|
|
}
|
|
|
|
|
2006-12-07 12:35:24 +08:00
|
|
|
/*
|
|
|
|
* Macro so as to not evaluate some arguments when
|
|
|
|
* lockdep is not enabled.
|
|
|
|
*
|
|
|
|
* Mark both the sk_lock and the sk_lock.slock as a
|
|
|
|
* per-address-family lock class.
|
|
|
|
*/
|
2012-05-17 06:48:15 +08:00
|
|
|
#define sock_lock_init_class_and_name(sk, sname, skey, name, key) \
|
2006-12-07 12:35:24 +08:00
|
|
|
do { \
|
2008-11-12 09:38:36 +08:00
|
|
|
sk->sk_lock.owned = 0; \
|
2006-12-07 12:35:24 +08:00
|
|
|
init_waitqueue_head(&sk->sk_lock.wq); \
|
|
|
|
spin_lock_init(&(sk)->sk_lock.slock); \
|
|
|
|
debug_check_no_locks_freed((void *)&(sk)->sk_lock, \
|
|
|
|
sizeof((sk)->sk_lock)); \
|
|
|
|
lockdep_set_class_and_name(&(sk)->sk_lock.slock, \
|
2012-05-17 06:48:15 +08:00
|
|
|
(skey), (sname)); \
|
2006-12-07 12:35:24 +08:00
|
|
|
lockdep_init_map(&(sk)->sk_lock.dep_map, (name), (key), 0); \
|
|
|
|
} while (0)
|
|
|
|
|
2018-01-17 23:14:14 +08:00
|
|
|
static inline bool lockdep_sock_is_held(const struct sock *sk)
|
2016-04-05 23:10:15 +08:00
|
|
|
{
|
|
|
|
return lockdep_is_held(&sk->sk_lock) ||
|
|
|
|
lockdep_is_held(&sk->sk_lock.slock);
|
|
|
|
}
|
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
void lock_sock_nested(struct sock *sk, int subclass);
|
2006-11-09 14:44:35 +08:00
|
|
|
|
|
|
|
static inline void lock_sock(struct sock *sk)
|
|
|
|
{
|
|
|
|
lock_sock_nested(sk, 0);
|
|
|
|
}
|
|
|
|
|
2020-11-27 18:10:22 +08:00
|
|
|
void __lock_sock(struct sock *sk);
|
2018-10-02 14:24:26 +08:00
|
|
|
void __release_sock(struct sock *sk);
|
2013-09-23 01:32:26 +08:00
|
|
|
void release_sock(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* BH context may only use the following locking interface. */
|
|
|
|
#define bh_lock_sock(__sk) spin_lock(&((__sk)->sk_lock.slock))
|
2006-07-03 15:25:13 +08:00
|
|
|
#define bh_lock_sock_nested(__sk) \
|
|
|
|
spin_lock_nested(&((__sk)->sk_lock.slock), \
|
|
|
|
SINGLE_DEPTH_NESTING)
|
2005-04-17 06:20:36 +08:00
|
|
|
#define bh_unlock_sock(__sk) spin_unlock(&((__sk)->sk_lock.slock))
|
|
|
|
|
net: introduce and use lock_sock_fast_nested()
Syzkaller reported a false positive deadlock involving
the nl socket lock and the subflow socket lock:
MPTCP: kernel_bind error, err=-98
============================================
WARNING: possible recursive locking detected
5.15.0-rc1-syzkaller #0 Not tainted
--------------------------------------------
syz-executor998/6520 is trying to acquire lock:
ffff8880795718a0 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
but task is already holding lock:
ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(k-sk_lock-AF_INET);
lock(k-sk_lock-AF_INET);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by syz-executor998/6520:
#0: ffffffff8d176c50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 net/netlink/genetlink.c:802
#1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_lock net/netlink/genetlink.c:33 [inline]
#1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_rcv_msg+0x3e0/0x580 net/netlink/genetlink.c:790
#2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
#2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
stack backtrace:
CPU: 1 PID: 6520 Comm: syz-executor998 Not tainted 5.15.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_deadlock_bug kernel/locking/lockdep.c:2944 [inline]
check_deadlock kernel/locking/lockdep.c:2987 [inline]
validate_chain kernel/locking/lockdep.c:3776 [inline]
__lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5015
lock_acquire kernel/locking/lockdep.c:5625 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
lock_sock_fast+0x36/0x100 net/core/sock.c:3229
mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
__sock_release net/socket.c:649 [inline]
sock_release+0x87/0x1b0 net/socket.c:677
mptcp_pm_nl_create_listen_socket+0x238/0x2c0 net/mptcp/pm_netlink.c:900
mptcp_nl_cmd_add_addr+0x359/0x930 net/mptcp/pm_netlink.c:1170
genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_no_sendpage+0x101/0x150 net/core/sock.c:2980
kernel_sendpage.part.0+0x1a0/0x340 net/socket.c:3504
kernel_sendpage net/socket.c:3501 [inline]
sock_sendpage+0xe5/0x140 net/socket.c:1003
pipe_to_sendpage+0x2ad/0x380 fs/splice.c:364
splice_from_pipe_feed fs/splice.c:418 [inline]
__splice_from_pipe+0x43e/0x8a0 fs/splice.c:562
splice_from_pipe fs/splice.c:597 [inline]
generic_splice_sendpage+0xd4/0x140 fs/splice.c:746
do_splice_from fs/splice.c:767 [inline]
direct_splice_actor+0x110/0x180 fs/splice.c:936
splice_direct_to_actor+0x34b/0x8c0 fs/splice.c:891
do_splice_direct+0x1b3/0x280 fs/splice.c:979
do_sendfile+0xae9/0x1240 fs/read_write.c:1249
__do_sys_sendfile64 fs/read_write.c:1314 [inline]
__se_sys_sendfile64 fs/read_write.c:1300 [inline]
__x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1300
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f215cb69969
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc96bb3868 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00007f215cbad072 RCX: 00007f215cb69969
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005
RBP: 0000000000000000 R08: 00007ffc96bb3a08 R09: 00007ffc96bb3a08
R10: 0000000100000002 R11: 0000000000000246 R12: 00007ffc96bb387c
R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000
the problem originates from uncorrect lock annotation in the mptcp
code and is only visible since commit 2dcb96bacce3 ("net: core: Correct
the sock::sk_lock.owned lockdep annotations"), but is present since
the port-based endpoint support initial implementation.
This patch addresses the issue introducing a nested variant of
lock_sock_fast() and using it in the relevant code path.
Fixes: 1729cf186d8a ("mptcp: create the listening socket for new port")
Fixes: 2dcb96bacce3 ("net: core: Correct the sock::sk_lock.owned lockdep annotations")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-and-tested-by: syzbot+1dd53f7a89b299d59eaf@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-29 17:59:17 +08:00
|
|
|
bool __lock_sock_fast(struct sock *sk) __acquires(&sk->sk_lock.slock);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* lock_sock_fast - fast version of lock_sock
|
|
|
|
* @sk: socket
|
|
|
|
*
|
|
|
|
* This version should be used for very small section, where process wont block
|
|
|
|
* return false if fast path is taken:
|
|
|
|
*
|
|
|
|
* sk_lock.slock locked, owned = 0, BH disabled
|
|
|
|
*
|
|
|
|
* return true if slow path is taken:
|
|
|
|
*
|
|
|
|
* sk_lock.slock unlocked, owned = 1, BH enabled
|
|
|
|
*/
|
|
|
|
static inline bool lock_sock_fast(struct sock *sk)
|
|
|
|
{
|
|
|
|
/* The sk_lock has mutex_lock() semantics here. */
|
|
|
|
mutex_acquire(&sk->sk_lock.dep_map, 0, 0, _RET_IP_);
|
|
|
|
|
|
|
|
return __lock_sock_fast(sk);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* fast socket lock variant for caller already holding a [different] socket lock */
|
|
|
|
static inline bool lock_sock_fast_nested(struct sock *sk)
|
|
|
|
{
|
|
|
|
mutex_acquire(&sk->sk_lock.dep_map, SINGLE_DEPTH_NESTING, 0, _RET_IP_);
|
|
|
|
|
|
|
|
return __lock_sock_fast(sk);
|
|
|
|
}
|
2020-11-18 02:43:49 +08:00
|
|
|
|
2010-05-27 03:20:18 +08:00
|
|
|
/**
|
|
|
|
* unlock_sock_fast - complement of lock_sock_fast
|
|
|
|
* @sk: socket
|
|
|
|
* @slow: slow mode
|
|
|
|
*
|
|
|
|
* fast unlock socket for user context.
|
|
|
|
* If slow mode is on, we call regular release_sock()
|
|
|
|
*/
|
|
|
|
static inline void unlock_sock_fast(struct sock *sk, bool slow)
|
2020-11-18 02:43:49 +08:00
|
|
|
__releases(&sk->sk_lock.slock)
|
2010-04-29 05:35:48 +08:00
|
|
|
{
|
2020-11-18 02:43:49 +08:00
|
|
|
if (slow) {
|
2010-05-27 03:20:18 +08:00
|
|
|
release_sock(sk);
|
2020-11-18 02:43:49 +08:00
|
|
|
__release(&sk->sk_lock.slock);
|
|
|
|
} else {
|
net: core: Correct the sock::sk_lock.owned lockdep annotations
lock_sock_fast() and lock_sock_nested() contain lockdep annotations for the
sock::sk_lock.owned 'mutex'. sock::sk_lock.owned is not a regular mutex. It
is just lockdep wise equivalent. In fact it's an open coded trivial mutex
implementation with some interesting features.
sock::sk_lock.slock is a regular spinlock protecting the 'mutex'
representation sock::sk_lock.owned which is a plain boolean. If 'owned' is
true, then some other task holds the 'mutex', otherwise it is uncontended.
As this locking construct is obviously endangered by lock ordering issues as
any other locking primitive it got lockdep annotated via a dedicated
dependency map sock::sk_lock.dep_map which has to be updated at the lock
and unlock sites.
lock_sock_nested() is a straight forward 'mutex' lock operation:
might_sleep();
spin_lock_bh(sock::sk_lock.slock)
while (!try_lock(sock::sk_lock.owned)) {
spin_unlock_bh(sock::sk_lock.slock);
wait_for_release();
spin_lock_bh(sock::sk_lock.slock);
}
The lockdep annotation for sock::sk_lock.owned is for unknown reasons
_after_ the lock has been acquired, i.e. after the code block above and
after releasing sock::sk_lock.slock, but inside the bottom halves disabled
region:
spin_unlock(sock::sk_lock.slock);
mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
local_bh_enable();
The placement after the unlock is obvious because otherwise the
mutex_acquire() would nest into the spin lock held region.
But that's from the lockdep perspective still the wrong place:
1) The mutex_acquire() is issued _after_ the successful acquisition which
is pointless because in a dead lock scenario this point is never
reached which means that if the deadlock is the first instance of
exposing the wrong lock order lockdep does not have a chance to detect
it.
2) It only works because lockdep is rather lax on the context from which
the mutex_acquire() is issued. Acquiring a mutex inside a bottom halves
and therefore non-preemptible region is obviously invalid, except for a
trylock which is clearly not the case here.
This 'works' stops working on RT enabled kernels where the bottom halves
serialization is done via a local lock, which exposes this misplacement
because the 'mutex' and the local lock nest the wrong way around and
lockdep complains rightfully about a lock inversion.
The placement is wrong since the initial commit a5b5bb9a053a ("[PATCH]
lockdep: annotate sk_locks") which introduced this.
Fix it by moving the mutex_acquire() in front of the actual lock
acquisition, which is what the regular mutex_lock() operation does as well.
lock_sock_fast() is not that straight forward. It looks at the first glance
like a convoluted trylock operation:
spin_lock_bh(sock::sk_lock.slock)
if (!sock::sk_lock.owned)
return false;
while (!try_lock(sock::sk_lock.owned)) {
spin_unlock_bh(sock::sk_lock.slock);
wait_for_release();
spin_lock_bh(sock::sk_lock.slock);
}
spin_unlock(sock::sk_lock.slock);
mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
local_bh_enable();
return true;
But that's not the case: lock_sock_fast() is an interesting optimization
for short critical sections which can run with bottom halves disabled and
sock::sk_lock.slock held. This allows to shortcut the 'mutex' operation in
the non contended case by preventing other lockers to acquire
sock::sk_lock.owned because they are blocked on sock::sk_lock.slock, which
in turn avoids the overhead of doing the heavy processing in release_sock()
including waking up wait queue waiters.
In the contended case, i.e. when sock::sk_lock.owned == true the behavior
is the same as lock_sock_nested().
Semantically this shortcut means, that the task acquired the 'mutex' even
if it does not touch the sock::sk_lock.owned field in the non-contended
case. Not telling lockdep about this shortcut acquisition is hiding
potential lock ordering violations in the fast path.
As a consequence the same reasoning as for the above lock_sock_nested()
case vs. the placement of the lockdep annotation applies.
The current placement of the lockdep annotation was just copied from
the original lock_sock(), now renamed to lock_sock_nested(),
implementation.
Fix this by moving the mutex_acquire() in front of the actual lock
acquisition and adding the corresponding mutex_release() into
unlock_sock_fast(). Also document the fast path return case with a comment.
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-18 20:42:35 +08:00
|
|
|
mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
|
2010-05-27 03:20:18 +08:00
|
|
|
spin_unlock_bh(&sk->sk_lock.slock);
|
2020-11-18 02:43:49 +08:00
|
|
|
}
|
2010-04-29 05:35:48 +08:00
|
|
|
}
|
|
|
|
|
2016-04-08 21:11:27 +08:00
|
|
|
/* Used by processes to "lock" a socket state, so that
|
|
|
|
* interrupts and bottom half handlers won't change it
|
|
|
|
* from under us. It essentially blocks any incoming
|
|
|
|
* packets, so that we won't get any new data or any
|
|
|
|
* packets that change the state of the socket.
|
|
|
|
*
|
|
|
|
* While locked, BH processing will add new packets to
|
|
|
|
* the backlog queue. This queue is processed by the
|
|
|
|
* owner of the socket lock right before it is released.
|
|
|
|
*
|
|
|
|
* Since ~2.3.5 it is also exclusive sleep lock serializing
|
|
|
|
* accesses from user process context.
|
|
|
|
*/
|
|
|
|
|
2016-05-04 07:56:03 +08:00
|
|
|
static inline void sock_owned_by_me(const struct sock *sk)
|
2016-04-08 21:11:27 +08:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_LOCKDEP
|
2016-04-25 21:34:09 +08:00
|
|
|
WARN_ON_ONCE(!lockdep_sock_is_held(sk) && debug_locks);
|
2016-04-08 21:11:27 +08:00
|
|
|
#endif
|
2016-05-04 07:56:03 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool sock_owned_by_user(const struct sock *sk)
|
|
|
|
{
|
|
|
|
sock_owned_by_me(sk);
|
2016-04-08 21:11:27 +08:00
|
|
|
return sk->sk_lock.owned;
|
|
|
|
}
|
|
|
|
|
2017-12-29 03:00:43 +08:00
|
|
|
static inline bool sock_owned_by_user_nocheck(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return sk->sk_lock.owned;
|
|
|
|
}
|
|
|
|
|
2021-12-08 14:21:58 +08:00
|
|
|
static inline void sock_release_ownership(struct sock *sk)
|
|
|
|
{
|
|
|
|
if (sock_owned_by_user_nocheck(sk)) {
|
|
|
|
sk->sk_lock.owned = 0;
|
|
|
|
|
|
|
|
/* The sk_lock has mutex_unlock() semantics: */
|
|
|
|
mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-04-08 21:11:27 +08:00
|
|
|
/* no reclassification while locks are held */
|
|
|
|
static inline bool sock_allow_reclassification(const struct sock *csk)
|
|
|
|
{
|
|
|
|
struct sock *sk = (struct sock *)csk;
|
|
|
|
|
2021-12-08 14:21:58 +08:00
|
|
|
return !sock_owned_by_user_nocheck(sk) &&
|
|
|
|
!spin_is_locked(&sk->sk_lock.slock);
|
2016-04-08 21:11:27 +08:00
|
|
|
}
|
2010-04-29 05:35:48 +08:00
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
|
2015-05-09 10:09:13 +08:00
|
|
|
struct proto *prot, int kern);
|
2013-09-23 01:32:26 +08:00
|
|
|
void sk_free(struct sock *sk);
|
2015-06-15 23:26:18 +08:00
|
|
|
void sk_destruct(struct sock *sk);
|
2013-09-23 01:32:26 +08:00
|
|
|
struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority);
|
2017-03-02 03:35:08 +08:00
|
|
|
void sk_free_unlock_clone(struct sock *sk);
|
2013-09-23 01:32:26 +08:00
|
|
|
|
|
|
|
struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
|
|
|
|
gfp_t priority);
|
2016-05-03 01:56:27 +08:00
|
|
|
void __sock_wfree(struct sk_buff *skb);
|
2013-09-23 01:32:26 +08:00
|
|
|
void sock_wfree(struct sk_buff *skb);
|
2017-08-04 04:29:37 +08:00
|
|
|
struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
|
|
|
|
gfp_t priority);
|
2013-09-23 01:32:26 +08:00
|
|
|
void skb_orphan_partial(struct sk_buff *skb);
|
|
|
|
void sock_rfree(struct sk_buff *skb);
|
2014-09-05 01:31:35 +08:00
|
|
|
void sock_efree(struct sk_buff *skb);
|
2014-09-05 01:32:11 +08:00
|
|
|
#ifdef CONFIG_INET
|
2013-09-23 01:32:26 +08:00
|
|
|
void sock_edemux(struct sk_buff *skb);
|
2020-03-30 06:53:38 +08:00
|
|
|
void sock_pfree(struct sk_buff *skb);
|
2014-09-05 01:32:11 +08:00
|
|
|
#else
|
2017-01-27 23:11:27 +08:00
|
|
|
#define sock_edemux sock_efree
|
2014-09-05 01:32:11 +08:00
|
|
|
#endif
|
2013-09-23 01:32:26 +08:00
|
|
|
|
|
|
|
int sock_setsockopt(struct socket *sock, int level, int op,
|
2020-07-23 14:08:50 +08:00
|
|
|
sockptr_t optval, unsigned int optlen);
|
2013-09-23 01:32:26 +08:00
|
|
|
|
|
|
|
int sock_getsockopt(struct socket *sock, int level, int op,
|
|
|
|
char __user *optval, int __user *optlen);
|
2019-04-18 04:51:48 +08:00
|
|
|
int sock_gettstamp(struct socket *sock, void __user *userstamp,
|
|
|
|
bool timeval, bool time32);
|
2013-09-23 01:32:26 +08:00
|
|
|
struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
|
|
|
|
unsigned long data_len, int noblock,
|
|
|
|
int *errcode, int max_page_order);
|
2022-04-28 18:58:44 +08:00
|
|
|
|
|
|
|
static inline struct sk_buff *sock_alloc_send_skb(struct sock *sk,
|
|
|
|
unsigned long size,
|
|
|
|
int noblock, int *errcode)
|
|
|
|
{
|
|
|
|
return sock_alloc_send_pskb(sk, size, 0, noblock, errcode, 0);
|
|
|
|
}
|
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
void *sock_kmalloc(struct sock *sk, int size, gfp_t priority);
|
|
|
|
void sock_kfree_s(struct sock *sk, void *mem, int size);
|
2014-11-20 00:13:11 +08:00
|
|
|
void sock_kzfree_s(struct sock *sk, void *mem, int size);
|
2013-09-23 01:32:26 +08:00
|
|
|
void sk_send_sigurg(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2015-10-09 05:56:48 +08:00
|
|
|
struct sockcm_cookie {
|
2018-07-04 06:42:48 +08:00
|
|
|
u64 transmit_time;
|
2015-10-09 05:56:48 +08:00
|
|
|
u32 mark;
|
2016-04-03 11:08:09 +08:00
|
|
|
u16 tsflags;
|
2015-10-09 05:56:48 +08:00
|
|
|
};
|
|
|
|
|
2018-07-06 22:12:56 +08:00
|
|
|
static inline void sockcm_init(struct sockcm_cookie *sockc,
|
|
|
|
const struct sock *sk)
|
|
|
|
{
|
|
|
|
*sockc = (struct sockcm_cookie) { .tsflags = sk->sk_tsflags };
|
|
|
|
}
|
|
|
|
|
2016-04-03 11:08:06 +08:00
|
|
|
int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
|
|
|
|
struct sockcm_cookie *sockc);
|
2015-10-09 05:56:48 +08:00
|
|
|
int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
|
|
|
|
struct sockcm_cookie *sockc);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Functions to fill in entries in struct proto_ops when a protocol
|
|
|
|
* does not implement a particular function.
|
|
|
|
*/
|
2013-09-23 01:32:26 +08:00
|
|
|
int sock_no_bind(struct socket *, struct sockaddr *, int);
|
|
|
|
int sock_no_connect(struct socket *, struct sockaddr *, int, int);
|
|
|
|
int sock_no_socketpair(struct socket *, struct socket *);
|
2017-03-09 16:09:05 +08:00
|
|
|
int sock_no_accept(struct socket *, struct socket *, int, bool);
|
2018-02-13 03:00:20 +08:00
|
|
|
int sock_no_getname(struct socket *, struct sockaddr *, int);
|
2013-09-23 01:32:26 +08:00
|
|
|
int sock_no_ioctl(struct socket *, unsigned int, unsigned long);
|
|
|
|
int sock_no_listen(struct socket *, int);
|
|
|
|
int sock_no_shutdown(struct socket *, int);
|
2015-03-02 15:37:48 +08:00
|
|
|
int sock_no_sendmsg(struct socket *, struct msghdr *, size_t);
|
2017-07-29 07:22:41 +08:00
|
|
|
int sock_no_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len);
|
2015-03-02 15:37:48 +08:00
|
|
|
int sock_no_recvmsg(struct socket *, struct msghdr *, size_t, int);
|
2013-09-23 01:32:26 +08:00
|
|
|
int sock_no_mmap(struct file *file, struct socket *sock,
|
|
|
|
struct vm_area_struct *vma);
|
|
|
|
ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset,
|
|
|
|
size_t size, int flags);
|
2017-07-29 07:22:41 +08:00
|
|
|
ssize_t sock_no_sendpage_locked(struct sock *sk, struct page *page,
|
|
|
|
int offset, size_t size, int flags);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Functions to fill in entries in struct proto_ops when a protocol
|
|
|
|
* uses the inet style.
|
|
|
|
*/
|
2013-09-23 01:32:26 +08:00
|
|
|
int sock_common_getsockopt(struct socket *sock, int level, int optname,
|
2005-04-17 06:20:36 +08:00
|
|
|
char __user *optval, int __user *optlen);
|
2015-03-02 15:37:48 +08:00
|
|
|
int sock_common_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
|
|
|
|
int flags);
|
2013-09-23 01:32:26 +08:00
|
|
|
int sock_common_setsockopt(struct socket *sock, int level, int optname,
|
2020-07-23 14:09:07 +08:00
|
|
|
sockptr_t optval, unsigned int optlen);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
void sk_common_release(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Default socket callbacks and setup code
|
|
|
|
*/
|
2012-05-17 06:48:15 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Initialise core socket variables */
|
2013-09-23 01:32:26 +08:00
|
|
|
void sock_init_data(struct socket *sock, struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Socket reference counting postulates.
|
|
|
|
*
|
|
|
|
* * Each user of socket SHOULD hold a reference count.
|
|
|
|
* * Each access point to socket (an hash table bucket, reference from a list,
|
|
|
|
* running timer, skb in flight MUST hold a reference count.
|
|
|
|
* * When reference count hits 0, it means it will never increase back.
|
|
|
|
* * When reference count hits 0, it means that no references from
|
|
|
|
* outside exist to this socket and current process on current CPU
|
|
|
|
* is last user and may/should destroy this socket.
|
|
|
|
* * sk_free is called from any context: process, BH, IRQ. When
|
|
|
|
* it is called, socket has no references from outside -> sk_free
|
|
|
|
* may release descendant resources allocated by the socket, but
|
|
|
|
* to the time when it is called, socket is NOT referenced by any
|
|
|
|
* hash tables, lists etc.
|
|
|
|
* * Packets, delivered from outside (from network or from another process)
|
|
|
|
* and enqueued on receive/error queues SHOULD NOT grab reference count,
|
|
|
|
* when they sit in queue. Otherwise, packets will leak to hole, when
|
|
|
|
* socket is looked up by one cpu and unhasing is made by another CPU.
|
|
|
|
* It is true for udp/raw, netlink (leak to receive and error queues), tcp
|
|
|
|
* (leak to backlog). Packet socket does all the processing inside
|
|
|
|
* BR_NETPROTO_LOCK, so that it has not this race condition. UNIX sockets
|
|
|
|
* use separate SMP lock, so that they are prone too.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* Ungrab socket and destroy it, if it was the last reference. */
|
|
|
|
static inline void sock_put(struct sock *sk)
|
|
|
|
{
|
2017-06-30 18:08:01 +08:00
|
|
|
if (refcount_dec_and_test(&sk->sk_refcnt))
|
2005-04-17 06:20:36 +08:00
|
|
|
sk_free(sk);
|
|
|
|
}
|
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:22:02 +08:00
|
|
|
/* Generic version of sock_put(), dealing with all sockets
|
2015-03-13 07:44:08 +08:00
|
|
|
* (TCP_TIMEWAIT, TCP_NEW_SYN_RECV, ESTABLISHED...)
|
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:22:02 +08:00
|
|
|
*/
|
|
|
|
void sock_gen_put(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-07-13 06:18:57 +08:00
|
|
|
int __sk_receive_skb(struct sock *sk, struct sk_buff *skb, const int nested,
|
2016-11-03 08:14:41 +08:00
|
|
|
unsigned int trim_cap, bool refcounted);
|
2016-07-13 06:18:57 +08:00
|
|
|
static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
|
|
|
|
const int nested)
|
|
|
|
{
|
2016-11-03 08:14:41 +08:00
|
|
|
return __sk_receive_skb(sk, skb, nested, 1, true);
|
2016-07-13 06:18:57 +08:00
|
|
|
}
|
2005-12-27 12:42:22 +08:00
|
|
|
|
2009-10-20 07:46:20 +08:00
|
|
|
static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
|
|
|
|
{
|
2018-06-30 12:26:51 +08:00
|
|
|
/* sk_tx_queue_mapping accept only upto a 16-bit value */
|
|
|
|
if (WARN_ON_ONCE((unsigned short)tx_queue >= USHRT_MAX))
|
|
|
|
return;
|
2009-10-20 07:46:20 +08:00
|
|
|
sk->sk_tx_queue_mapping = tx_queue;
|
|
|
|
}
|
|
|
|
|
2018-06-30 12:26:51 +08:00
|
|
|
#define NO_QUEUE_MAPPING USHRT_MAX
|
|
|
|
|
2009-10-20 07:46:20 +08:00
|
|
|
static inline void sk_tx_queue_clear(struct sock *sk)
|
|
|
|
{
|
2018-06-30 12:26:51 +08:00
|
|
|
sk->sk_tx_queue_mapping = NO_QUEUE_MAPPING;
|
2009-10-20 07:46:20 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int sk_tx_queue_get(const struct sock *sk)
|
|
|
|
{
|
2018-06-30 12:26:51 +08:00
|
|
|
if (sk && sk->sk_tx_queue_mapping != NO_QUEUE_MAPPING)
|
|
|
|
return sk->sk_tx_queue_mapping;
|
|
|
|
|
|
|
|
return -1;
|
2009-10-20 07:46:20 +08:00
|
|
|
}
|
|
|
|
|
2021-12-01 02:29:39 +08:00
|
|
|
static inline void __sk_rx_queue_set(struct sock *sk,
|
|
|
|
const struct sk_buff *skb,
|
|
|
|
bool force_set)
|
2018-06-30 12:26:57 +08:00
|
|
|
{
|
2021-02-11 19:35:51 +08:00
|
|
|
#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
|
2018-06-30 12:26:57 +08:00
|
|
|
if (skb_rx_queue_recorded(skb)) {
|
|
|
|
u16 rx_queue = skb_get_rx_queue(skb);
|
|
|
|
|
2021-12-01 02:29:39 +08:00
|
|
|
if (force_set ||
|
|
|
|
unlikely(READ_ONCE(sk->sk_rx_queue_mapping) != rx_queue))
|
2021-10-26 00:48:19 +08:00
|
|
|
WRITE_ONCE(sk->sk_rx_queue_mapping, rx_queue);
|
2018-06-30 12:26:57 +08:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2021-12-01 02:29:39 +08:00
|
|
|
static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
__sk_rx_queue_set(sk, skb, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sk_rx_queue_update(struct sock *sk, const struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
__sk_rx_queue_set(sk, skb, false);
|
|
|
|
}
|
|
|
|
|
2018-06-30 12:26:57 +08:00
|
|
|
static inline void sk_rx_queue_clear(struct sock *sk)
|
|
|
|
{
|
2021-02-11 19:35:51 +08:00
|
|
|
#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
|
2021-10-26 00:48:20 +08:00
|
|
|
WRITE_ONCE(sk->sk_rx_queue_mapping, NO_QUEUE_MAPPING);
|
2018-06-30 12:26:57 +08:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2018-06-30 12:27:02 +08:00
|
|
|
static inline int sk_rx_queue_get(const struct sock *sk)
|
|
|
|
{
|
2021-02-11 19:35:51 +08:00
|
|
|
#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
|
2021-10-26 00:48:20 +08:00
|
|
|
if (sk) {
|
|
|
|
int res = READ_ONCE(sk->sk_rx_queue_mapping);
|
|
|
|
|
|
|
|
if (res != NO_QUEUE_MAPPING)
|
|
|
|
return res;
|
|
|
|
}
|
2021-02-11 19:35:51 +08:00
|
|
|
#endif
|
2018-06-30 12:27:02 +08:00
|
|
|
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2008-06-18 13:41:38 +08:00
|
|
|
static inline void sk_set_socket(struct sock *sk, struct socket *sock)
|
|
|
|
{
|
|
|
|
sk->sk_socket = sock;
|
|
|
|
}
|
|
|
|
|
2010-04-20 21:03:51 +08:00
|
|
|
static inline wait_queue_head_t *sk_sleep(struct sock *sk)
|
|
|
|
{
|
2011-02-18 11:26:36 +08:00
|
|
|
BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
|
|
|
|
return &rcu_dereference_raw(sk->sk_wq)->wait;
|
2010-04-20 21:03:51 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Detach socket from process context.
|
|
|
|
* Announce socket dead, detach it from wait queue and inode.
|
|
|
|
* Note that parent inode held reference count on this struct sock,
|
|
|
|
* we do not release it in this function, because protocol
|
|
|
|
* probably wants some additional cleanups or even continuing
|
|
|
|
* to work with this socket (TCP).
|
|
|
|
*/
|
|
|
|
static inline void sock_orphan(struct sock *sk)
|
|
|
|
{
|
|
|
|
write_lock_bh(&sk->sk_callback_lock);
|
|
|
|
sock_set_flag(sk, SOCK_DEAD);
|
2008-06-18 13:41:38 +08:00
|
|
|
sk_set_socket(sk, NULL);
|
2010-04-29 19:01:49 +08:00
|
|
|
sk->sk_wq = NULL;
|
2005-04-17 06:20:36 +08:00
|
|
|
write_unlock_bh(&sk->sk_callback_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sock_graft(struct sock *sk, struct socket *parent)
|
|
|
|
{
|
2017-07-06 23:15:07 +08:00
|
|
|
WARN_ON(parent->sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
write_lock_bh(&sk->sk_callback_lock);
|
2019-07-06 03:14:16 +08:00
|
|
|
rcu_assign_pointer(sk->sk_wq, &parent->wq);
|
2005-04-17 06:20:36 +08:00
|
|
|
parent->sk = sk;
|
2008-06-18 13:41:38 +08:00
|
|
|
sk_set_socket(sk, parent);
|
net: core: Add a UID field to struct sock.
Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.
Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().
Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:
1. Whenever sk_socket is non-null: sk_uid is the same as the UID
in sk_socket, i.e., matches the return value of sock_i_uid.
Specifically, the UID is set when userspace calls socket(),
fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
- For a socket that no longer has a sk_socket because
userspace has called close(): the previous UID.
- For a cloned socket (e.g., an incoming connection that is
established but on which userspace has not yet called
accept): the UID of the socket it was cloned from.
- For a socket that has never had an sk_socket: UID 0 inside
the user namespace corresponding to the network namespace
the socket belongs to.
Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 01:23:41 +08:00
|
|
|
sk->sk_uid = SOCK_INODE(parent)->i_uid;
|
2006-07-25 14:32:50 +08:00
|
|
|
security_sock_graft(sk, parent);
|
2005-04-17 06:20:36 +08:00
|
|
|
write_unlock_bh(&sk->sk_callback_lock);
|
|
|
|
}
|
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
kuid_t sock_i_uid(struct sock *sk);
|
|
|
|
unsigned long sock_i_ino(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
net: core: Add a UID field to struct sock.
Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.
Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().
Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:
1. Whenever sk_socket is non-null: sk_uid is the same as the UID
in sk_socket, i.e., matches the return value of sock_i_uid.
Specifically, the UID is set when userspace calls socket(),
fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
- For a socket that no longer has a sk_socket because
userspace has called close(): the previous UID.
- For a cloned socket (e.g., an incoming connection that is
established but on which userspace has not yet called
accept): the UID of the socket it was cloned from.
- For a socket that has never had an sk_socket: UID 0 inside
the user namespace corresponding to the network namespace
the socket belongs to.
Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 01:23:41 +08:00
|
|
|
static inline kuid_t sock_net_uid(const struct net *net, const struct sock *sk)
|
|
|
|
{
|
|
|
|
return sk ? sk->sk_uid : make_kuid(net->user_ns, 0);
|
|
|
|
}
|
|
|
|
|
2015-09-16 06:24:20 +08:00
|
|
|
static inline u32 net_tx_rndhash(void)
|
2015-07-29 07:02:05 +08:00
|
|
|
{
|
2015-09-16 06:24:20 +08:00
|
|
|
u32 v = prandom_u32();
|
|
|
|
|
|
|
|
return v ?: 1;
|
|
|
|
}
|
2015-07-29 07:02:05 +08:00
|
|
|
|
2015-09-16 06:24:20 +08:00
|
|
|
static inline void sk_set_txhash(struct sock *sk)
|
|
|
|
{
|
2021-06-10 22:44:11 +08:00
|
|
|
/* This pairs with READ_ONCE() in skb_set_hash_from_sk() */
|
|
|
|
WRITE_ONCE(sk->sk_txhash, net_tx_rndhash());
|
2015-07-29 07:02:05 +08:00
|
|
|
}
|
|
|
|
|
2021-01-20 03:26:19 +08:00
|
|
|
static inline bool sk_rethink_txhash(struct sock *sk)
|
2015-07-29 07:02:06 +08:00
|
|
|
{
|
2022-01-31 21:31:22 +08:00
|
|
|
if (sk->sk_txhash && sk->sk_txrehash == SOCK_TXREHASH_ENABLED) {
|
2015-07-29 07:02:06 +08:00
|
|
|
sk_set_txhash(sk);
|
2021-01-20 03:26:19 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
2015-07-29 07:02:06 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline struct dst_entry *
|
|
|
|
__sk_dst_get(struct sock *sk)
|
|
|
|
{
|
2016-04-05 23:10:15 +08:00
|
|
|
return rcu_dereference_check(sk->sk_dst_cache,
|
|
|
|
lockdep_sock_is_held(sk));
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct dst_entry *
|
|
|
|
sk_dst_get(struct sock *sk)
|
|
|
|
{
|
|
|
|
struct dst_entry *dst;
|
|
|
|
|
2010-04-09 07:03:29 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
dst = rcu_dereference(sk->sk_dst_cache);
|
2014-06-25 01:05:11 +08:00
|
|
|
if (dst && !atomic_inc_not_zero(&dst->__refcnt))
|
|
|
|
dst = NULL;
|
2010-04-09 07:03:29 +08:00
|
|
|
rcu_read_unlock();
|
2005-04-17 06:20:36 +08:00
|
|
|
return dst;
|
|
|
|
}
|
|
|
|
|
2021-01-20 03:26:19 +08:00
|
|
|
static inline void __dst_negative_advice(struct sock *sk)
|
2010-04-09 07:03:29 +08:00
|
|
|
{
|
|
|
|
struct dst_entry *ndst, *dst = __sk_dst_get(sk);
|
|
|
|
|
|
|
|
if (dst && dst->ops->negative_advice) {
|
|
|
|
ndst = dst->ops->negative_advice(dst);
|
|
|
|
|
|
|
|
if (ndst != dst) {
|
|
|
|
rcu_assign_pointer(sk->sk_dst_cache, ndst);
|
2013-10-22 16:23:38 +08:00
|
|
|
sk_tx_queue_clear(sk);
|
2017-02-07 05:14:11 +08:00
|
|
|
sk->sk_dst_pending_confirm = 0;
|
2010-04-09 07:03:29 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-01-20 03:26:19 +08:00
|
|
|
static inline void dst_negative_advice(struct sock *sk)
|
|
|
|
{
|
|
|
|
sk_rethink_txhash(sk);
|
|
|
|
__dst_negative_advice(sk);
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline void
|
|
|
|
__sk_dst_set(struct sock *sk, struct dst_entry *dst)
|
|
|
|
{
|
|
|
|
struct dst_entry *old_dst;
|
|
|
|
|
2009-10-20 07:46:20 +08:00
|
|
|
sk_tx_queue_clear(sk);
|
2017-02-07 05:14:11 +08:00
|
|
|
sk->sk_dst_pending_confirm = 0;
|
2017-03-07 03:23:55 +08:00
|
|
|
old_dst = rcu_dereference_protected(sk->sk_dst_cache,
|
|
|
|
lockdep_sock_is_held(sk));
|
2010-04-09 07:03:29 +08:00
|
|
|
rcu_assign_pointer(sk->sk_dst_cache, dst);
|
2005-04-17 06:20:36 +08:00
|
|
|
dst_release(old_dst);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
sk_dst_set(struct sock *sk, struct dst_entry *dst)
|
|
|
|
{
|
2014-06-30 16:26:23 +08:00
|
|
|
struct dst_entry *old_dst;
|
|
|
|
|
|
|
|
sk_tx_queue_clear(sk);
|
2017-02-07 05:14:11 +08:00
|
|
|
sk->sk_dst_pending_confirm = 0;
|
2014-07-02 17:39:38 +08:00
|
|
|
old_dst = xchg((__force struct dst_entry **)&sk->sk_dst_cache, dst);
|
2014-06-30 16:26:23 +08:00
|
|
|
dst_release(old_dst);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
__sk_dst_reset(struct sock *sk)
|
|
|
|
{
|
2010-04-09 07:03:29 +08:00
|
|
|
__sk_dst_set(sk, NULL);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
sk_dst_reset(struct sock *sk)
|
|
|
|
{
|
2014-06-30 16:26:23 +08:00
|
|
|
sk_dst_set(sk, NULL);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-07 05:14:11 +08:00
|
|
|
static inline void sk_dst_confirm(struct sock *sk)
|
|
|
|
{
|
2019-11-06 06:11:51 +08:00
|
|
|
if (!READ_ONCE(sk->sk_dst_pending_confirm))
|
|
|
|
WRITE_ONCE(sk->sk_dst_pending_confirm, 1);
|
2017-02-07 05:14:11 +08:00
|
|
|
}
|
|
|
|
|
2017-02-07 05:14:12 +08:00
|
|
|
static inline void sock_confirm_neigh(struct sk_buff *skb, struct neighbour *n)
|
|
|
|
{
|
|
|
|
if (skb_get_dst_pending_confirm(skb)) {
|
|
|
|
struct sock *sk = skb->sk;
|
|
|
|
|
2019-11-06 06:11:51 +08:00
|
|
|
if (sk && READ_ONCE(sk->sk_dst_pending_confirm))
|
|
|
|
WRITE_ONCE(sk->sk_dst_pending_confirm, 0);
|
2021-11-23 10:54:30 +08:00
|
|
|
neigh_confirm(n);
|
2017-02-07 05:14:12 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-04-01 23:07:44 +08:00
|
|
|
bool sk_mc_loop(struct sock *sk);
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool sk_can_gso(const struct sock *sk)
|
2006-07-01 04:36:35 +08:00
|
|
|
{
|
|
|
|
return net_gso_ok(sk->sk_route_caps, sk->sk_gso_type);
|
|
|
|
}
|
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
void sk_setup_caps(struct sock *sk, struct dst_entry *dst);
|
2005-08-10 10:49:02 +08:00
|
|
|
|
2021-11-16 03:02:35 +08:00
|
|
|
static inline void sk_gso_disable(struct sock *sk)
|
2010-05-16 15:36:33 +08:00
|
|
|
{
|
2021-11-16 03:02:35 +08:00
|
|
|
sk->sk_gso_disabled = 1;
|
|
|
|
sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
|
2010-05-16 15:36:33 +08:00
|
|
|
}
|
|
|
|
|
2011-04-05 13:30:30 +08:00
|
|
|
static inline int skb_do_copy_data_nocache(struct sock *sk, struct sk_buff *skb,
|
2014-11-29 02:40:20 +08:00
|
|
|
struct iov_iter *from, char *to,
|
2011-04-07 02:40:12 +08:00
|
|
|
int copy, int offset)
|
2011-04-05 13:30:30 +08:00
|
|
|
{
|
|
|
|
if (skb->ip_summed == CHECKSUM_NONE) {
|
2014-11-29 02:40:20 +08:00
|
|
|
__wsum csum = 0;
|
2016-11-02 10:42:45 +08:00
|
|
|
if (!csum_and_copy_from_iter_full(to, copy, &csum, from))
|
2014-11-29 02:40:20 +08:00
|
|
|
return -EFAULT;
|
2011-04-07 02:40:12 +08:00
|
|
|
skb->csum = csum_block_add(skb->csum, csum, offset);
|
2011-04-05 13:30:30 +08:00
|
|
|
} else if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY) {
|
2016-11-02 10:42:45 +08:00
|
|
|
if (!copy_from_iter_full_nocache(to, copy, from))
|
2011-04-05 13:30:30 +08:00
|
|
|
return -EFAULT;
|
2016-11-02 10:42:45 +08:00
|
|
|
} else if (!copy_from_iter_full(to, copy, from))
|
2011-04-05 13:30:30 +08:00
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int skb_add_data_nocache(struct sock *sk, struct sk_buff *skb,
|
2014-11-29 02:40:20 +08:00
|
|
|
struct iov_iter *from, int copy)
|
2011-04-05 13:30:30 +08:00
|
|
|
{
|
2011-04-07 02:40:12 +08:00
|
|
|
int err, offset = skb->len;
|
2011-04-05 13:30:30 +08:00
|
|
|
|
2011-04-07 02:40:12 +08:00
|
|
|
err = skb_do_copy_data_nocache(sk, skb, from, skb_put(skb, copy),
|
|
|
|
copy, offset);
|
2011-04-05 13:30:30 +08:00
|
|
|
if (err)
|
2011-04-07 02:40:12 +08:00
|
|
|
__skb_trim(skb, offset);
|
2011-04-05 13:30:30 +08:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2014-11-29 02:40:20 +08:00
|
|
|
static inline int skb_copy_to_page_nocache(struct sock *sk, struct iov_iter *from,
|
2011-04-05 13:30:30 +08:00
|
|
|
struct sk_buff *skb,
|
|
|
|
struct page *page,
|
|
|
|
int off, int copy)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
2011-04-07 02:40:12 +08:00
|
|
|
err = skb_do_copy_data_nocache(sk, skb, from, page_address(page) + off,
|
|
|
|
copy, skb->len);
|
2011-04-05 13:30:30 +08:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2022-06-23 00:09:03 +08:00
|
|
|
skb_len_add(skb, copy);
|
2019-10-11 11:17:46 +08:00
|
|
|
sk_wmem_queued_add(sk, copy);
|
2011-04-05 13:30:30 +08:00
|
|
|
sk_mem_charge(sk, copy);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-06-16 18:12:03 +08:00
|
|
|
/**
|
|
|
|
* sk_wmem_alloc_get - returns write allocations
|
|
|
|
* @sk: socket
|
|
|
|
*
|
2020-02-16 03:42:37 +08:00
|
|
|
* Return: sk_wmem_alloc minus initial offset of one
|
2009-06-16 18:12:03 +08:00
|
|
|
*/
|
|
|
|
static inline int sk_wmem_alloc_get(const struct sock *sk)
|
|
|
|
{
|
2017-06-30 18:08:00 +08:00
|
|
|
return refcount_read(&sk->sk_wmem_alloc) - 1;
|
2009-06-16 18:12:03 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sk_rmem_alloc_get - returns read allocations
|
|
|
|
* @sk: socket
|
|
|
|
*
|
2020-02-16 03:42:37 +08:00
|
|
|
* Return: sk_rmem_alloc
|
2009-06-16 18:12:03 +08:00
|
|
|
*/
|
|
|
|
static inline int sk_rmem_alloc_get(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return atomic_read(&sk->sk_rmem_alloc);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sk_has_allocations - check if allocations are outstanding
|
|
|
|
* @sk: socket
|
|
|
|
*
|
2020-02-16 03:42:37 +08:00
|
|
|
* Return: true if socket has write or read allocations
|
2009-06-16 18:12:03 +08:00
|
|
|
*/
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool sk_has_allocations(const struct sock *sk)
|
2009-06-16 18:12:03 +08:00
|
|
|
{
|
|
|
|
return sk_wmem_alloc_get(sk) || sk_rmem_alloc_get(sk);
|
|
|
|
}
|
|
|
|
|
2009-07-08 20:09:13 +08:00
|
|
|
/**
|
2015-11-26 13:55:39 +08:00
|
|
|
* skwq_has_sleeper - check if there are any waiting processes
|
2010-05-25 14:54:18 +08:00
|
|
|
* @wq: struct socket_wq
|
2009-07-08 20:09:13 +08:00
|
|
|
*
|
2020-02-16 03:42:37 +08:00
|
|
|
* Return: true if socket_wq has waiting processes
|
2009-07-08 20:09:13 +08:00
|
|
|
*
|
2015-11-26 13:55:39 +08:00
|
|
|
* The purpose of the skwq_has_sleeper and sock_poll_wait is to wrap the memory
|
2009-07-08 20:09:13 +08:00
|
|
|
* barrier call. They were added due to the race found within the tcp code.
|
|
|
|
*
|
2017-05-12 20:35:46 +08:00
|
|
|
* Consider following tcp code paths::
|
2009-07-08 20:09:13 +08:00
|
|
|
*
|
2017-05-12 20:35:46 +08:00
|
|
|
* CPU1 CPU2
|
|
|
|
* sys_select receive packet
|
2009-07-08 20:09:13 +08:00
|
|
|
* ... ...
|
|
|
|
* __add_wait_queue update tp->rcv_nxt
|
|
|
|
* ... ...
|
|
|
|
* tp->rcv_nxt check sock_def_readable
|
|
|
|
* ... {
|
2010-04-29 19:01:49 +08:00
|
|
|
* schedule rcu_read_lock();
|
|
|
|
* wq = rcu_dereference(sk->sk_wq);
|
|
|
|
* if (wq && waitqueue_active(&wq->wait))
|
|
|
|
* wake_up_interruptible(&wq->wait)
|
2009-07-08 20:09:13 +08:00
|
|
|
* ...
|
|
|
|
* }
|
|
|
|
*
|
|
|
|
* The race for tcp fires when the __add_wait_queue changes done by CPU1 stay
|
|
|
|
* in its cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1
|
|
|
|
* could then endup calling schedule and sleep forever if there are no more
|
|
|
|
* data on the socket.
|
2009-07-08 20:10:31 +08:00
|
|
|
*
|
2009-07-08 20:09:13 +08:00
|
|
|
*/
|
2015-11-26 13:55:39 +08:00
|
|
|
static inline bool skwq_has_sleeper(struct socket_wq *wq)
|
2009-07-08 20:09:13 +08:00
|
|
|
{
|
2015-11-26 13:55:39 +08:00
|
|
|
return wq && wq_has_sleeper(&wq->wait);
|
2009-07-08 20:09:13 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* sock_poll_wait - place memory barrier behind the poll_wait call.
|
|
|
|
* @filp: file
|
2018-10-23 19:40:39 +08:00
|
|
|
* @sock: socket to wait on
|
2009-07-08 20:09:13 +08:00
|
|
|
* @p: poll_table
|
|
|
|
*
|
2010-04-29 19:01:49 +08:00
|
|
|
* See the comments in the wq_has_sleeper function.
|
2009-07-08 20:09:13 +08:00
|
|
|
*/
|
2018-10-23 19:40:39 +08:00
|
|
|
static inline void sock_poll_wait(struct file *filp, struct socket *sock,
|
|
|
|
poll_table *p)
|
2009-07-08 20:09:13 +08:00
|
|
|
{
|
2018-07-30 15:42:11 +08:00
|
|
|
if (!poll_does_not_wait(p)) {
|
2019-07-06 03:14:16 +08:00
|
|
|
poll_wait(filp, &sock->wq.wait, p);
|
2012-05-17 06:48:15 +08:00
|
|
|
/* We need to be sure we are in sync with the
|
2009-07-08 20:09:13 +08:00
|
|
|
* socket flags modification.
|
|
|
|
*
|
2010-04-29 19:01:49 +08:00
|
|
|
* This memory barrier is paired in the wq_has_sleeper.
|
2012-05-17 06:48:15 +08:00
|
|
|
*/
|
2009-07-08 20:09:13 +08:00
|
|
|
smp_mb();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-07-02 12:32:17 +08:00
|
|
|
static inline void skb_set_hash_from_sk(struct sk_buff *skb, struct sock *sk)
|
|
|
|
{
|
2021-06-10 22:44:11 +08:00
|
|
|
/* This pairs with WRITE_ONCE() in sk_set_txhash() */
|
|
|
|
u32 txhash = READ_ONCE(sk->sk_txhash);
|
|
|
|
|
|
|
|
if (txhash) {
|
2014-07-02 12:32:17 +08:00
|
|
|
skb->l4_hash = 1;
|
2021-06-10 22:44:11 +08:00
|
|
|
skb->hash = txhash;
|
2014-07-02 12:32:17 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-11-02 07:36:55 +08:00
|
|
|
void skb_set_owner_w(struct sk_buff *skb, struct sock *sk);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2012-05-17 06:48:15 +08:00
|
|
|
* Queue a received datagram if it will fit. Stream and sequenced
|
2005-04-17 06:20:36 +08:00
|
|
|
* protocols can't normally use this as they need to fit buffers in
|
|
|
|
* and play with them.
|
|
|
|
*
|
2012-05-17 06:48:15 +08:00
|
|
|
* Inlined as it's very short and called for pretty much every
|
2005-04-17 06:20:36 +08:00
|
|
|
* packet ever received.
|
|
|
|
*/
|
|
|
|
static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
|
|
|
|
{
|
2009-06-22 10:25:25 +08:00
|
|
|
skb_orphan(skb);
|
2005-04-17 06:20:36 +08:00
|
|
|
skb->sk = sk;
|
|
|
|
skb->destructor = sock_rfree;
|
|
|
|
atomic_add(skb->truesize, &sk->sk_rmem_alloc);
|
2007-12-31 16:11:19 +08:00
|
|
|
sk_mem_charge(sk, skb->truesize);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2021-05-11 16:35:21 +08:00
|
|
|
static inline __must_check bool skb_set_owner_sk_safe(struct sk_buff *skb, struct sock *sk)
|
2021-03-31 00:43:54 +08:00
|
|
|
{
|
|
|
|
if (sk && refcount_inc_not_zero(&sk->sk_refcnt)) {
|
|
|
|
skb_orphan(skb);
|
|
|
|
skb->destructor = sock_efree;
|
|
|
|
skb->sk = sk;
|
2021-05-11 16:35:21 +08:00
|
|
|
return true;
|
2021-03-31 00:43:54 +08:00
|
|
|
}
|
2021-05-11 16:35:21 +08:00
|
|
|
return false;
|
2021-03-31 00:43:54 +08:00
|
|
|
}
|
|
|
|
|
2021-07-29 00:24:03 +08:00
|
|
|
static inline void skb_prepare_for_gro(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
if (skb->destructor != sock_wfree) {
|
|
|
|
skb_orphan(skb);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
skb->slow_gro = 1;
|
|
|
|
}
|
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
void sk_reset_timer(struct sock *sk, struct timer_list *timer,
|
|
|
|
unsigned long expires);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
void sk_stop_timer(struct sock *sk, struct timer_list *timer);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2020-09-24 08:30:01 +08:00
|
|
|
void sk_stop_timer_sync(struct sock *sk, struct timer_list *timer);
|
|
|
|
|
2017-05-16 17:20:13 +08:00
|
|
|
int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
|
|
|
|
struct sk_buff *skb, unsigned int flags,
|
2017-02-06 01:25:24 +08:00
|
|
|
void (*destructor)(struct sock *sk,
|
|
|
|
struct sk_buff *skb));
|
2016-04-06 00:41:15 +08:00
|
|
|
int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
|
2022-04-07 14:20:49 +08:00
|
|
|
|
|
|
|
int sock_queue_rcv_skb_reason(struct sock *sk, struct sk_buff *skb,
|
|
|
|
enum skb_drop_reason *reason);
|
|
|
|
|
|
|
|
static inline int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
return sock_queue_rcv_skb_reason(sk, skb, NULL);
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb);
|
2014-09-01 09:30:27 +08:00
|
|
|
struct sk_buff *sock_dequeue_err_skb(struct sock *sk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Recover an error report and clear atomically
|
|
|
|
*/
|
2012-05-17 06:48:15 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline int sock_error(struct sock *sk)
|
|
|
|
{
|
2005-12-14 15:22:19 +08:00
|
|
|
int err;
|
2021-06-10 22:27:37 +08:00
|
|
|
|
|
|
|
/* Avoid an atomic operation for the common case.
|
|
|
|
* This is racy since another cpu/thread can change sk_err under us.
|
|
|
|
*/
|
|
|
|
if (likely(data_race(!sk->sk_err)))
|
2005-12-14 15:22:19 +08:00
|
|
|
return 0;
|
2021-06-10 22:27:37 +08:00
|
|
|
|
2005-12-14 15:22:19 +08:00
|
|
|
err = xchg(&sk->sk_err, 0);
|
2005-04-17 06:20:36 +08:00
|
|
|
return -err;
|
|
|
|
}
|
|
|
|
|
2021-06-28 06:48:21 +08:00
|
|
|
void sk_error_report(struct sock *sk);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline unsigned long sock_wspace(struct sock *sk)
|
|
|
|
{
|
|
|
|
int amt = 0;
|
|
|
|
|
|
|
|
if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
|
2017-06-30 18:08:00 +08:00
|
|
|
amt = sk->sk_sndbuf - refcount_read(&sk->sk_wmem_alloc);
|
2012-05-17 06:48:15 +08:00
|
|
|
if (amt < 0)
|
2005-04-17 06:20:36 +08:00
|
|
|
amt = 0;
|
|
|
|
}
|
|
|
|
return amt;
|
|
|
|
}
|
|
|
|
|
2015-11-30 12:03:11 +08:00
|
|
|
/* Note:
|
|
|
|
* We use sk->sk_wq_raw, from contexts knowing this
|
|
|
|
* pointer is not NULL and cannot disappear/change.
|
|
|
|
*/
|
2015-11-30 12:03:10 +08:00
|
|
|
static inline void sk_set_bit(int nr, struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2016-04-26 01:39:34 +08:00
|
|
|
if ((nr == SOCKWQ_ASYNC_NOSPACE || nr == SOCKWQ_ASYNC_WAITDATA) &&
|
|
|
|
!sock_flag(sk, SOCK_FASYNC))
|
2016-04-26 01:39:32 +08:00
|
|
|
return;
|
|
|
|
|
2015-11-30 12:03:11 +08:00
|
|
|
set_bit(nr, &sk->sk_wq_raw->flags);
|
2015-11-30 12:03:10 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sk_clear_bit(int nr, struct sock *sk)
|
|
|
|
{
|
2016-04-26 01:39:34 +08:00
|
|
|
if ((nr == SOCKWQ_ASYNC_NOSPACE || nr == SOCKWQ_ASYNC_WAITDATA) &&
|
|
|
|
!sock_flag(sk, SOCK_FASYNC))
|
2016-04-26 01:39:32 +08:00
|
|
|
return;
|
|
|
|
|
2015-11-30 12:03:11 +08:00
|
|
|
clear_bit(nr, &sk->sk_wq_raw->flags);
|
2015-11-30 12:03:10 +08:00
|
|
|
}
|
|
|
|
|
2015-11-30 12:03:11 +08:00
|
|
|
static inline void sk_wake_async(const struct sock *sk, int how, int band)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2015-11-30 12:03:11 +08:00
|
|
|
if (sock_flag(sk, SOCK_FASYNC)) {
|
|
|
|
rcu_read_lock();
|
|
|
|
sock_wake_async(rcu_dereference(sk->sk_wq), how, band);
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
net: sock: adapt SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF
The current situation is that SOCK_MIN_RCVBUF is 2048 + sizeof(struct sk_buff))
while SOCK_MIN_SNDBUF is 2048. Since in both cases, skb->truesize is used for
sk_{r,w}mem_alloc accounting, we should have both sizes adjusted via defining a
TCP_SKB_MIN_TRUESIZE.
Further, as Eric Dumazet points out, the minimal skb truesize in transmit path is
SKB_TRUESIZE(2048) after commit f07d960df33c5 ("tcp: avoid frag allocation for
small frames"), and tcp_sendmsg() tries to limit skb size to half the congestion
window, meaning we try to build two skbs at minimum. Thus, having SOCK_MIN_SNDBUF
as 2048 can hit a small regression for some applications setting to low
SO_SNDBUF / SO_RCVBUF. Note that we define a TCP_SKB_MIN_TRUESIZE, because
SKB_TRUESIZE(2048) adds SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), but in
case of TCP skbs, the skb_shared_info is part of the 2048 bytes allocation for
skb->head.
The minor adaption in sk_stream_moderate_sndbuf() is to silence a warning by
using a typed max macro, as similarly done in SOCK_MIN_RCVBUF occurences, that
would appear otherwise.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 18:51:20 +08:00
|
|
|
/* Since sk_{r,w}mem_alloc sums skb->truesize, even a small frame might
|
|
|
|
* need sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak.
|
|
|
|
* Note: for send buffers, TCP works better if we can build two skbs at
|
|
|
|
* minimum.
|
2010-09-27 09:53:07 +08:00
|
|
|
*/
|
2013-07-03 20:02:22 +08:00
|
|
|
#define TCP_SKB_MIN_TRUESIZE (2048 + SKB_DATA_ALIGN(sizeof(struct sk_buff)))
|
net: sock: adapt SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF
The current situation is that SOCK_MIN_RCVBUF is 2048 + sizeof(struct sk_buff))
while SOCK_MIN_SNDBUF is 2048. Since in both cases, skb->truesize is used for
sk_{r,w}mem_alloc accounting, we should have both sizes adjusted via defining a
TCP_SKB_MIN_TRUESIZE.
Further, as Eric Dumazet points out, the minimal skb truesize in transmit path is
SKB_TRUESIZE(2048) after commit f07d960df33c5 ("tcp: avoid frag allocation for
small frames"), and tcp_sendmsg() tries to limit skb size to half the congestion
window, meaning we try to build two skbs at minimum. Thus, having SOCK_MIN_SNDBUF
as 2048 can hit a small regression for some applications setting to low
SO_SNDBUF / SO_RCVBUF. Note that we define a TCP_SKB_MIN_TRUESIZE, because
SKB_TRUESIZE(2048) adds SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), but in
case of TCP skbs, the skb_shared_info is part of the 2048 bytes allocation for
skb->head.
The minor adaption in sk_stream_moderate_sndbuf() is to silence a warning by
using a typed max macro, as similarly done in SOCK_MIN_RCVBUF occurences, that
would appear otherwise.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 18:51:20 +08:00
|
|
|
|
|
|
|
#define SOCK_MIN_SNDBUF (TCP_SKB_MIN_TRUESIZE * 2)
|
|
|
|
#define SOCK_MIN_RCVBUF TCP_SKB_MIN_TRUESIZE
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
static inline void sk_stream_moderate_sndbuf(struct sock *sk)
|
|
|
|
{
|
2019-10-11 11:17:45 +08:00
|
|
|
u32 val;
|
|
|
|
|
|
|
|
if (sk->sk_userlocks & SOCK_SNDBUF_LOCK)
|
|
|
|
return;
|
|
|
|
|
|
|
|
val = min(sk->sk_sndbuf, sk->sk_wmem_queued >> 1);
|
2021-09-30 01:25:12 +08:00
|
|
|
val = max_t(u32, val, sk_unused_reserved_mem(sk));
|
2019-10-11 11:17:45 +08:00
|
|
|
|
|
|
|
WRITE_ONCE(sk->sk_sndbuf, max_t(u32, val, SOCK_MIN_SNDBUF));
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 07:04:42 +08:00
|
|
|
/**
|
|
|
|
* sk_page_frag - return an appropriate page_frag
|
|
|
|
* @sk: socket
|
|
|
|
*
|
net: fix sk_page_frag() recursion from memory reclaim
sk_page_frag() optimizes skb_frag allocations by using per-task
skb_frag cache when it knows it's the only user. The condition is
determined by seeing whether the socket allocation mask allows
blocking - if the allocation may block, it obviously owns the task's
context and ergo exclusively owns current->task_frag.
Unfortunately, this misses recursion through memory reclaim path.
Please take a look at the following backtrace.
[2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
...
tcp_sendmsg+0x27/0x40
sock_sendmsg+0x30/0x40
sock_xmit.isra.24+0xa1/0x170 [nbd]
nbd_send_cmd+0x1d2/0x690 [nbd]
nbd_queue_rq+0x1b5/0x3b0 [nbd]
__blk_mq_try_issue_directly+0x108/0x1b0
blk_mq_request_issue_directly+0xbd/0xe0
blk_mq_try_issue_list_directly+0x41/0xb0
blk_mq_sched_insert_requests+0xa2/0xe0
blk_mq_flush_plug_list+0x205/0x2a0
blk_flush_plug_list+0xc3/0xf0
[1] blk_finish_plug+0x21/0x2e
_xfs_buf_ioapply+0x313/0x460
__xfs_buf_submit+0x67/0x220
xfs_buf_read_map+0x113/0x1a0
xfs_trans_read_buf_map+0xbf/0x330
xfs_btree_read_buf_block.constprop.42+0x95/0xd0
xfs_btree_lookup_get_block+0x95/0x170
xfs_btree_lookup+0xcc/0x470
xfs_bmap_del_extent_real+0x254/0x9a0
__xfs_bunmapi+0x45c/0xab0
xfs_bunmapi+0x15/0x30
xfs_itruncate_extents_flags+0xca/0x250
xfs_free_eofblocks+0x181/0x1e0
xfs_fs_destroy_inode+0xa8/0x1b0
destroy_inode+0x38/0x70
dispose_list+0x35/0x50
prune_icache_sb+0x52/0x70
super_cache_scan+0x120/0x1a0
do_shrink_slab+0x120/0x290
shrink_slab+0x216/0x2b0
shrink_node+0x1b6/0x4a0
do_try_to_free_pages+0xc6/0x370
try_to_free_mem_cgroup_pages+0xe3/0x1e0
try_charge+0x29e/0x790
mem_cgroup_charge_skmem+0x6a/0x100
__sk_mem_raise_allocated+0x18e/0x390
__sk_mem_schedule+0x2a/0x40
[0] tcp_sendmsg_locked+0x8eb/0xe10
tcp_sendmsg+0x27/0x40
sock_sendmsg+0x30/0x40
___sys_sendmsg+0x26d/0x2b0
__sys_sendmsg+0x57/0xa0
do_syscall_64+0x42/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
In [0], tcp_send_msg_locked() was using current->page_frag when it
called sk_wmem_schedule(). It already calculated how many bytes can
be fit into current->page_frag. Due to memory pressure,
sk_wmem_schedule() called into memory reclaim path which called into
xfs and then IO issue path. Because the filesystem in question is
backed by nbd, the control goes back into the tcp layer - back into
tcp_sendmsg_locked().
nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
sense - it's in the process of freeing memory and wants to be able to,
e.g., drop clean pages to make forward progress. However, this
confused sk_page_frag() called from [2]. Because it only tests
whether the allocation allows blocking which it does, it now thinks
current->page_frag can be used again although it already was being
used in [0].
After [2] used current->page_frag, the offset would be increased by
the used amount. When the control returns to [0],
current->page_frag's offset is increased and the previously calculated
number of bytes now may overrun the end of allocated memory leading to
silent memory corruptions.
Fix it by adding gfpflags_normal_context() which tests sleepable &&
!reclaim and use it to determine whether to use current->task_frag.
v2: Eric didn't like gfp flags being tested twice. Introduce a new
helper gfpflags_normal_context() and combine the two tests.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-25 04:50:27 +08:00
|
|
|
* Use the per task page_frag instead of the per socket one for
|
2021-11-27 02:34:21 +08:00
|
|
|
* optimization when we know that we're in process context and own
|
net: fix sk_page_frag() recursion from memory reclaim
sk_page_frag() optimizes skb_frag allocations by using per-task
skb_frag cache when it knows it's the only user. The condition is
determined by seeing whether the socket allocation mask allows
blocking - if the allocation may block, it obviously owns the task's
context and ergo exclusively owns current->task_frag.
Unfortunately, this misses recursion through memory reclaim path.
Please take a look at the following backtrace.
[2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
...
tcp_sendmsg+0x27/0x40
sock_sendmsg+0x30/0x40
sock_xmit.isra.24+0xa1/0x170 [nbd]
nbd_send_cmd+0x1d2/0x690 [nbd]
nbd_queue_rq+0x1b5/0x3b0 [nbd]
__blk_mq_try_issue_directly+0x108/0x1b0
blk_mq_request_issue_directly+0xbd/0xe0
blk_mq_try_issue_list_directly+0x41/0xb0
blk_mq_sched_insert_requests+0xa2/0xe0
blk_mq_flush_plug_list+0x205/0x2a0
blk_flush_plug_list+0xc3/0xf0
[1] blk_finish_plug+0x21/0x2e
_xfs_buf_ioapply+0x313/0x460
__xfs_buf_submit+0x67/0x220
xfs_buf_read_map+0x113/0x1a0
xfs_trans_read_buf_map+0xbf/0x330
xfs_btree_read_buf_block.constprop.42+0x95/0xd0
xfs_btree_lookup_get_block+0x95/0x170
xfs_btree_lookup+0xcc/0x470
xfs_bmap_del_extent_real+0x254/0x9a0
__xfs_bunmapi+0x45c/0xab0
xfs_bunmapi+0x15/0x30
xfs_itruncate_extents_flags+0xca/0x250
xfs_free_eofblocks+0x181/0x1e0
xfs_fs_destroy_inode+0xa8/0x1b0
destroy_inode+0x38/0x70
dispose_list+0x35/0x50
prune_icache_sb+0x52/0x70
super_cache_scan+0x120/0x1a0
do_shrink_slab+0x120/0x290
shrink_slab+0x216/0x2b0
shrink_node+0x1b6/0x4a0
do_try_to_free_pages+0xc6/0x370
try_to_free_mem_cgroup_pages+0xe3/0x1e0
try_charge+0x29e/0x790
mem_cgroup_charge_skmem+0x6a/0x100
__sk_mem_raise_allocated+0x18e/0x390
__sk_mem_schedule+0x2a/0x40
[0] tcp_sendmsg_locked+0x8eb/0xe10
tcp_sendmsg+0x27/0x40
sock_sendmsg+0x30/0x40
___sys_sendmsg+0x26d/0x2b0
__sys_sendmsg+0x57/0xa0
do_syscall_64+0x42/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
In [0], tcp_send_msg_locked() was using current->page_frag when it
called sk_wmem_schedule(). It already calculated how many bytes can
be fit into current->page_frag. Due to memory pressure,
sk_wmem_schedule() called into memory reclaim path which called into
xfs and then IO issue path. Because the filesystem in question is
backed by nbd, the control goes back into the tcp layer - back into
tcp_sendmsg_locked().
nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
sense - it's in the process of freeing memory and wants to be able to,
e.g., drop clean pages to make forward progress. However, this
confused sk_page_frag() called from [2]. Because it only tests
whether the allocation allows blocking which it does, it now thinks
current->page_frag can be used again although it already was being
used in [0].
After [2] used current->page_frag, the offset would be increased by
the used amount. When the control returns to [0],
current->page_frag's offset is increased and the previously calculated
number of bytes now may overrun the end of allocated memory leading to
silent memory corruptions.
Fix it by adding gfpflags_normal_context() which tests sleepable &&
!reclaim and use it to determine whether to use current->task_frag.
v2: Eric didn't like gfp flags being tested twice. Introduce a new
helper gfpflags_normal_context() and combine the two tests.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-25 04:50:27 +08:00
|
|
|
* everything that's associated with %current.
|
|
|
|
*
|
2021-11-27 02:34:21 +08:00
|
|
|
* Both direct reclaim and page faults can nest inside other
|
|
|
|
* socket operations and end up recursing into sk_page_frag()
|
|
|
|
* while it's already in use: explicitly avoid task page_frag
|
|
|
|
* usage if the caller is potentially doing any of them.
|
|
|
|
* This assumes that page fault handlers use the GFP_NOFS flags.
|
2020-02-16 03:42:37 +08:00
|
|
|
*
|
|
|
|
* Return: a per task page_frag if context allows that,
|
|
|
|
* otherwise a per socket one.
|
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 07:04:42 +08:00
|
|
|
*/
|
|
|
|
static inline struct page_frag *sk_page_frag(struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2021-11-27 02:34:21 +08:00
|
|
|
if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
|
|
|
|
(__GFP_DIRECT_RECLAIM | __GFP_FS))
|
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 07:04:42 +08:00
|
|
|
return ¤t->task_frag;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 07:04:42 +08:00
|
|
|
return &sk->sk_frag;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag);
|
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 07:04:42 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Default write policy as shown to user space via poll/select/SIGIO
|
|
|
|
*/
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline bool sock_writeable(const struct sock *sk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2019-10-11 11:17:45 +08:00
|
|
|
return refcount_read(&sk->sk_wmem_alloc) < (READ_ONCE(sk->sk_sndbuf) >> 1);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2005-10-07 14:46:04 +08:00
|
|
|
static inline gfp_t gfp_any(void)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2009-02-13 08:43:17 +08:00
|
|
|
return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2021-08-18 03:40:03 +08:00
|
|
|
static inline gfp_t gfp_memcg_charge(void)
|
|
|
|
{
|
|
|
|
return in_softirq() ? GFP_NOWAIT : GFP_KERNEL;
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
return noblock ? 0 : sk->sk_rcvtimeo;
|
|
|
|
}
|
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline long sock_sndtimeo(const struct sock *sk, bool noblock)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
return noblock ? 0 : sk->sk_sndtimeo;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int sock_rcvlowat(const struct sock *sk, int waitall, int len)
|
|
|
|
{
|
2019-10-10 06:32:35 +08:00
|
|
|
int v = waitall ? len : min_t(int, READ_ONCE(sk->sk_rcvlowat), len);
|
|
|
|
|
|
|
|
return v ?: 1;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Alas, with timeout socket operations are not restartable.
|
|
|
|
* Compare this to poll().
|
|
|
|
*/
|
|
|
|
static inline int sock_intr_errno(long timeo)
|
|
|
|
{
|
|
|
|
return timeo == MAX_SCHEDULE_TIMEOUT ? -ERESTARTSYS : -EINTR;
|
|
|
|
}
|
|
|
|
|
2015-03-01 20:58:31 +08:00
|
|
|
struct sock_skb_cb {
|
|
|
|
u32 dropcount;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* Store sock_skb_cb at the end of skb->cb[] so protocol families
|
|
|
|
* using skb->cb[] would keep using it directly and utilize its
|
|
|
|
* alignement guarantee.
|
|
|
|
*/
|
2019-12-10 02:31:43 +08:00
|
|
|
#define SOCK_SKB_CB_OFFSET ((sizeof_field(struct sk_buff, cb) - \
|
2015-03-01 20:58:31 +08:00
|
|
|
sizeof(struct sock_skb_cb)))
|
|
|
|
|
|
|
|
#define SOCK_SKB_CB(__skb) ((struct sock_skb_cb *)((__skb)->cb + \
|
|
|
|
SOCK_SKB_CB_OFFSET))
|
|
|
|
|
2015-03-01 20:58:29 +08:00
|
|
|
#define sock_skb_cb_check_size(size) \
|
2015-03-01 20:58:31 +08:00
|
|
|
BUILD_BUG_ON((size) > SOCK_SKB_CB_OFFSET)
|
2015-03-01 20:58:29 +08:00
|
|
|
|
2015-03-01 20:58:30 +08:00
|
|
|
static inline void
|
|
|
|
sock_skb_set_dropcount(const struct sock *sk, struct sk_buff *skb)
|
|
|
|
{
|
2016-12-08 02:05:36 +08:00
|
|
|
SOCK_SKB_CB(skb)->dropcount = sock_flag(sk, SOCK_RXQ_OVFL) ?
|
|
|
|
atomic_read(&sk->sk_drops) : 0;
|
2015-03-01 20:58:30 +08:00
|
|
|
}
|
|
|
|
|
2016-04-01 23:52:19 +08:00
|
|
|
static inline void sk_drops_add(struct sock *sk, const struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
int segs = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
|
|
|
|
|
|
|
|
atomic_add(segs, &sk->sk_drops);
|
|
|
|
}
|
|
|
|
|
2018-12-28 10:55:09 +08:00
|
|
|
static inline ktime_t sock_read_timestamp(struct sock *sk)
|
|
|
|
{
|
|
|
|
#if BITS_PER_LONG==32
|
|
|
|
unsigned int seq;
|
|
|
|
ktime_t kt;
|
|
|
|
|
|
|
|
do {
|
|
|
|
seq = read_seqbegin(&sk->sk_stamp_seq);
|
|
|
|
kt = sk->sk_stamp;
|
|
|
|
} while (read_seqretry(&sk->sk_stamp_seq, seq));
|
|
|
|
|
|
|
|
return kt;
|
|
|
|
#else
|
2019-11-05 13:38:43 +08:00
|
|
|
return READ_ONCE(sk->sk_stamp);
|
2018-12-28 10:55:09 +08:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void sock_write_timestamp(struct sock *sk, ktime_t kt)
|
|
|
|
{
|
|
|
|
#if BITS_PER_LONG==32
|
|
|
|
write_seqlock(&sk->sk_stamp_seq);
|
|
|
|
sk->sk_stamp = kt;
|
|
|
|
write_sequnlock(&sk->sk_stamp_seq);
|
|
|
|
#else
|
2019-11-05 13:38:43 +08:00
|
|
|
WRITE_ONCE(sk->sk_stamp, kt);
|
2018-12-28 10:55:09 +08:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2013-09-23 01:32:26 +08:00
|
|
|
void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
|
|
|
|
struct sk_buff *skb);
|
|
|
|
void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
|
|
|
|
struct sk_buff *skb);
|
2007-03-26 13:14:49 +08:00
|
|
|
|
2012-05-17 06:48:15 +08:00
|
|
|
static inline void
|
2005-04-17 06:20:36 +08:00
|
|
|
sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
|
|
|
|
{
|
2007-04-20 07:16:32 +08:00
|
|
|
ktime_t kt = skb->tstamp;
|
2009-02-12 13:03:38 +08:00
|
|
|
struct skb_shared_hwtstamps *hwtstamps = skb_hwtstamps(skb);
|
2005-08-15 08:24:31 +08:00
|
|
|
|
2009-02-12 13:03:38 +08:00
|
|
|
/*
|
|
|
|
* generate control messages if
|
2014-08-05 10:11:46 +08:00
|
|
|
* - receive time stamping in software requested
|
2009-02-12 13:03:38 +08:00
|
|
|
* - software time stamp available and wanted
|
|
|
|
* - hardware time stamps available and wanted
|
|
|
|
*/
|
|
|
|
if (sock_flag(sk, SOCK_RCVTSTAMP) ||
|
2014-08-05 10:11:46 +08:00
|
|
|
(sk->sk_tsflags & SOF_TIMESTAMPING_RX_SOFTWARE) ||
|
2016-12-25 18:38:40 +08:00
|
|
|
(kt && sk->sk_tsflags & SOF_TIMESTAMPING_SOFTWARE) ||
|
|
|
|
(hwtstamps->hwtstamp &&
|
2014-08-05 10:11:46 +08:00
|
|
|
(sk->sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE)))
|
2007-03-26 13:14:49 +08:00
|
|
|
__sock_recv_timestamp(msg, sk, skb);
|
|
|
|
else
|
2018-12-28 10:55:09 +08:00
|
|
|
sock_write_timestamp(sk, kt);
|
2011-11-09 17:15:42 +08:00
|
|
|
|
|
|
|
if (sock_flag(sk, SOCK_WIFI_STATUS) && skb->wifi_acked_valid)
|
|
|
|
__sock_recv_wifi_status(msg, sk, skb);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2022-04-28 04:02:37 +08:00
|
|
|
void __sock_recv_cmsgs(struct msghdr *msg, struct sock *sk,
|
|
|
|
struct sk_buff *skb);
|
2010-04-29 03:14:43 +08:00
|
|
|
|
2017-03-30 20:03:06 +08:00
|
|
|
#define SK_DEFAULT_STAMP (-1L * NSEC_PER_SEC)
|
2022-04-28 04:02:37 +08:00
|
|
|
static inline void sock_recv_cmsgs(struct msghdr *msg, struct sock *sk,
|
|
|
|
struct sk_buff *skb)
|
2010-04-29 03:14:43 +08:00
|
|
|
{
|
2022-04-28 04:02:37 +08:00
|
|
|
#define FLAGS_RECV_CMSGS ((1UL << SOCK_RXQ_OVFL) | \
|
|
|
|
(1UL << SOCK_RCVTSTAMP) | \
|
|
|
|
(1UL << SOCK_RCVMARK))
|
2014-08-05 10:11:46 +08:00
|
|
|
#define TSFLAGS_ANY (SOF_TIMESTAMPING_SOFTWARE | \
|
|
|
|
SOF_TIMESTAMPING_RAW_HARDWARE)
|
2010-04-29 03:14:43 +08:00
|
|
|
|
2022-04-28 04:02:37 +08:00
|
|
|
if (sk->sk_flags & FLAGS_RECV_CMSGS || sk->sk_tsflags & TSFLAGS_ANY)
|
|
|
|
__sock_recv_cmsgs(msg, sk, skb);
|
2017-04-01 05:59:25 +08:00
|
|
|
else if (unlikely(sock_flag(sk, SOCK_TIMESTAMP)))
|
2018-12-28 10:55:09 +08:00
|
|
|
sock_write_timestamp(sk, skb->tstamp);
|
2017-03-30 20:03:06 +08:00
|
|
|
else if (unlikely(sk->sk_stamp == SK_DEFAULT_STAMP))
|
2018-12-28 10:55:09 +08:00
|
|
|
sock_write_timestamp(sk, 0);
|
2010-04-29 03:14:43 +08:00
|
|
|
}
|
net: Generalize socket rx gap / receive queue overflow cmsg
Create a new socket level option to report number of queue overflows
Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.
Notes:
1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.
2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.
3) This applies cleanly to net-next assuming that commit
977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-13 04:26:31 +08:00
|
|
|
|
2016-04-03 11:08:12 +08:00
|
|
|
void __sock_tx_timestamp(__u16 tsflags, __u8 *tx_flags);
|
2014-09-09 07:58:58 +08:00
|
|
|
|
2009-02-12 13:03:38 +08:00
|
|
|
/**
|
2018-12-18 01:24:00 +08:00
|
|
|
* _sock_tx_timestamp - checks whether the outgoing packet is to be time stamped
|
2009-02-12 13:03:38 +08:00
|
|
|
* @sk: socket sending this packet
|
2016-04-03 11:08:12 +08:00
|
|
|
* @tsflags: timestamping flags to use
|
2014-08-06 17:49:29 +08:00
|
|
|
* @tx_flags: completed with instructions for time stamping
|
2018-12-18 01:24:00 +08:00
|
|
|
* @tskey: filled in with next sk_tskey (not for TCP, which uses seqno)
|
2014-08-06 17:49:29 +08:00
|
|
|
*
|
2017-05-12 20:35:46 +08:00
|
|
|
* Note: callers should take care of initial ``*tx_flags`` value (usually 0)
|
2009-02-12 13:03:38 +08:00
|
|
|
*/
|
2018-12-18 01:24:00 +08:00
|
|
|
static inline void _sock_tx_timestamp(struct sock *sk, __u16 tsflags,
|
|
|
|
__u8 *tx_flags, __u32 *tskey)
|
2014-09-09 07:58:58 +08:00
|
|
|
{
|
2018-12-18 01:24:00 +08:00
|
|
|
if (unlikely(tsflags)) {
|
2016-04-03 11:08:12 +08:00
|
|
|
__sock_tx_timestamp(tsflags, tx_flags);
|
2018-12-18 01:24:00 +08:00
|
|
|
if (tsflags & SOF_TIMESTAMPING_OPT_ID && tskey &&
|
|
|
|
tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK)
|
2022-02-18 01:05:02 +08:00
|
|
|
*tskey = atomic_inc_return(&sk->sk_tskey) - 1;
|
2018-12-18 01:24:00 +08:00
|
|
|
}
|
2014-09-09 07:58:58 +08:00
|
|
|
if (unlikely(sock_flag(sk, SOCK_WIFI_STATUS)))
|
|
|
|
*tx_flags |= SKBTX_WIFI_STATUS;
|
|
|
|
}
|
2009-02-12 13:03:38 +08:00
|
|
|
|
2018-12-18 01:24:00 +08:00
|
|
|
static inline void sock_tx_timestamp(struct sock *sk, __u16 tsflags,
|
|
|
|
__u8 *tx_flags)
|
|
|
|
{
|
|
|
|
_sock_tx_timestamp(sk, tsflags, tx_flags, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void skb_setup_tx_timestamp(struct sk_buff *skb, __u16 tsflags)
|
|
|
|
{
|
|
|
|
_sock_tx_timestamp(skb->sk, tsflags, &skb_shinfo(skb)->tx_flags,
|
|
|
|
&skb_shinfo(skb)->tskey);
|
|
|
|
}
|
|
|
|
|
2021-11-16 03:02:33 +08:00
|
|
|
static inline bool sk_is_tcp(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/**
|
|
|
|
* sk_eat_skb - Release a skb if it is no longer needed
|
2005-05-01 23:59:25 +08:00
|
|
|
* @sk: socket to eat this skb from
|
|
|
|
* @skb: socket buffer to eat
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* This routine must be called with interrupts disabled or with the socket
|
|
|
|
* locked so that the sk_buff queue operation is ok.
|
|
|
|
*/
|
2013-12-31 04:37:29 +08:00
|
|
|
static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
__skb_unlink(skb, &sk->sk_receive_queue);
|
|
|
|
__kfree_skb(skb);
|
|
|
|
}
|
|
|
|
|
2020-03-30 06:53:38 +08:00
|
|
|
static inline bool
|
|
|
|
skb_sk_is_prefetched(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_INET
|
|
|
|
return skb->destructor == sock_pfree;
|
|
|
|
#else
|
|
|
|
return false;
|
|
|
|
#endif /* CONFIG_INET */
|
|
|
|
}
|
|
|
|
|
2020-03-30 06:53:40 +08:00
|
|
|
/* This helper checks if a socket is a full socket,
|
|
|
|
* ie _not_ a timewait or request socket.
|
|
|
|
*/
|
|
|
|
static inline bool sk_fullsock(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool
|
|
|
|
sk_is_refcounted(struct sock *sk)
|
|
|
|
{
|
|
|
|
/* Only full sockets have sk->sk_flags. */
|
|
|
|
return !sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE);
|
|
|
|
}
|
|
|
|
|
2020-03-30 06:53:39 +08:00
|
|
|
/**
|
2020-04-08 06:55:25 +08:00
|
|
|
* skb_steal_sock - steal a socket from an sk_buff
|
|
|
|
* @skb: sk_buff to steal the socket from
|
|
|
|
* @refcounted: is set to true if the socket is reference-counted
|
2020-03-30 06:53:39 +08:00
|
|
|
*/
|
|
|
|
static inline struct sock *
|
|
|
|
skb_steal_sock(struct sk_buff *skb, bool *refcounted)
|
2008-10-08 03:41:01 +08:00
|
|
|
{
|
2012-06-24 21:03:07 +08:00
|
|
|
if (skb->sk) {
|
2008-10-08 03:41:01 +08:00
|
|
|
struct sock *sk = skb->sk;
|
|
|
|
|
2020-03-30 06:53:39 +08:00
|
|
|
*refcounted = true;
|
2020-03-30 06:53:40 +08:00
|
|
|
if (skb_sk_is_prefetched(skb))
|
|
|
|
*refcounted = sk_is_refcounted(sk);
|
2008-10-08 03:41:01 +08:00
|
|
|
skb->destructor = NULL;
|
|
|
|
skb->sk = NULL;
|
|
|
|
return sk;
|
|
|
|
}
|
2020-03-30 06:53:39 +08:00
|
|
|
*refcounted = false;
|
2008-10-08 03:41:01 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2018-04-30 15:16:12 +08:00
|
|
|
/* Checks if this SKB belongs to an HW offloaded socket
|
|
|
|
* and whether any SW fallbacks are required based on dev.
|
2019-08-08 08:03:59 +08:00
|
|
|
* Check decrypted mark in case skb_orphan() cleared socket.
|
2018-04-30 15:16:12 +08:00
|
|
|
*/
|
|
|
|
static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
|
|
|
|
struct net_device *dev)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_SOCK_VALIDATE_XMIT
|
|
|
|
struct sock *sk = skb->sk;
|
|
|
|
|
2019-08-08 08:03:59 +08:00
|
|
|
if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb) {
|
2018-04-30 15:16:12 +08:00
|
|
|
skb = sk->sk_validate_xmit_skb(sk, dev, skb);
|
2019-08-08 08:03:59 +08:00
|
|
|
#ifdef CONFIG_TLS_DEVICE
|
|
|
|
} else if (unlikely(skb->decrypted)) {
|
|
|
|
pr_warn_ratelimited("unencrypted skb with no associated socket - dropping\n");
|
|
|
|
kfree_skb(skb);
|
|
|
|
skb = NULL;
|
|
|
|
#endif
|
|
|
|
}
|
2018-04-30 15:16:12 +08:00
|
|
|
#endif
|
|
|
|
|
|
|
|
return skb;
|
|
|
|
}
|
|
|
|
|
2015-10-08 20:01:55 +08:00
|
|
|
/* This helper checks if a socket is a LISTEN or NEW_SYN_RECV
|
|
|
|
* SYNACK messages can be attached to either ones (depending on SYNCOOKIE)
|
|
|
|
*/
|
|
|
|
static inline bool sk_listener(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return (1 << sk->sk_state) & (TCPF_LISTEN | TCPF_NEW_SYN_RECV);
|
|
|
|
}
|
|
|
|
|
2019-10-04 04:56:37 +08:00
|
|
|
void sock_enable_timestamp(struct sock *sk, enum sock_flags flag);
|
2013-09-23 01:32:26 +08:00
|
|
|
int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, int level,
|
|
|
|
int type);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2014-04-24 05:26:56 +08:00
|
|
|
bool sk_ns_capable(const struct sock *sk,
|
|
|
|
struct user_namespace *user_ns, int cap);
|
|
|
|
bool sk_capable(const struct sock *sk, int cap);
|
|
|
|
bool sk_net_capable(const struct sock *sk, int cap);
|
|
|
|
|
2017-03-21 03:22:03 +08:00
|
|
|
void sk_get_meminfo(const struct sock *sk, u32 *meminfo);
|
|
|
|
|
2017-08-30 06:16:01 +08:00
|
|
|
/* Take into consideration the size of the struct sk_buff overhead in the
|
|
|
|
* determination of these values, since that is non-constant across
|
|
|
|
* platforms. This makes socket queueing behavior and performance
|
|
|
|
* not depend upon such differences.
|
|
|
|
*/
|
|
|
|
#define _SK_MEM_PACKETS 256
|
|
|
|
#define _SK_MEM_OVERHEAD SKB_TRUESIZE(256)
|
|
|
|
#define SK_WMEM_MAX (_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
|
|
|
|
#define SK_RMEM_MAX (_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
extern __u32 sysctl_wmem_max;
|
|
|
|
extern __u32 sysctl_rmem_max;
|
|
|
|
|
2015-01-31 02:29:32 +08:00
|
|
|
extern int sysctl_tstamp_allow_data;
|
2005-09-06 09:14:11 +08:00
|
|
|
extern int sysctl_optmem_max;
|
|
|
|
|
2005-08-16 13:18:02 +08:00
|
|
|
extern __u32 sysctl_wmem_default;
|
|
|
|
extern __u32 sysctl_rmem_default;
|
|
|
|
|
2021-08-26 10:49:47 +08:00
|
|
|
#define SKB_FRAG_PAGE_ORDER get_order(32768)
|
2019-06-15 07:22:21 +08:00
|
|
|
DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
|
|
|
|
|
2017-11-07 16:29:27 +08:00
|
|
|
static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
|
|
|
|
{
|
|
|
|
/* Does this proto have per netns sysctl_wmem ? */
|
|
|
|
if (proto->sysctl_wmem_offset)
|
|
|
|
return *(int *)((void *)sock_net(sk) + proto->sysctl_wmem_offset);
|
|
|
|
|
|
|
|
return *proto->sysctl_wmem;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int sk_get_rmem0(const struct sock *sk, const struct proto *proto)
|
|
|
|
{
|
|
|
|
/* Does this proto have per netns sysctl_rmem ? */
|
|
|
|
if (proto->sysctl_rmem_offset)
|
|
|
|
return *(int *)((void *)sock_net(sk) + proto->sysctl_rmem_offset);
|
|
|
|
|
|
|
|
return *proto->sysctl_rmem;
|
|
|
|
}
|
|
|
|
|
2017-12-12 22:34:19 +08:00
|
|
|
/* Default TCP Small queue budget is ~1 ms of data (1sec >> 10)
|
|
|
|
* Some wifi drivers need to tweak it to get more chunks.
|
|
|
|
* They can use this helper from their ndo_start_xmit()
|
|
|
|
*/
|
|
|
|
static inline void sk_pacing_shift_update(struct sock *sk, int val)
|
|
|
|
{
|
2019-12-17 10:51:03 +08:00
|
|
|
if (!sk || !sk_fullsock(sk) || READ_ONCE(sk->sk_pacing_shift) == val)
|
2017-12-12 22:34:19 +08:00
|
|
|
return;
|
2019-12-17 10:51:03 +08:00
|
|
|
WRITE_ONCE(sk->sk_pacing_shift, val);
|
2017-12-12 22:34:19 +08:00
|
|
|
}
|
|
|
|
|
2018-01-05 06:03:54 +08:00
|
|
|
/* if a socket is bound to a device, check that the given device
|
|
|
|
* index is either the same or that the socket is bound to an L3
|
|
|
|
* master device and the given device index is also enslaved to
|
|
|
|
* that L3 master
|
|
|
|
*/
|
|
|
|
static inline bool sk_dev_equal_l3scope(struct sock *sk, int dif)
|
|
|
|
{
|
2022-05-14 02:55:41 +08:00
|
|
|
int bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
|
2018-01-05 06:03:54 +08:00
|
|
|
int mdif;
|
|
|
|
|
2022-05-14 02:55:41 +08:00
|
|
|
if (!bound_dev_if || bound_dev_if == dif)
|
2018-01-05 06:03:54 +08:00
|
|
|
return true;
|
|
|
|
|
|
|
|
mdif = l3mdev_master_ifindex_by_index(sock_net(sk), dif);
|
2022-05-14 02:55:41 +08:00
|
|
|
if (mdif && mdif == bound_dev_if)
|
2018-01-05 06:03:54 +08:00
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2020-01-20 17:29:17 +08:00
|
|
|
void sock_def_readable(struct sock *sk);
|
|
|
|
|
2020-05-31 05:09:00 +08:00
|
|
|
int sock_bindtoindex(struct sock *sk, int ifindex, bool lock_sk);
|
2021-06-04 07:24:27 +08:00
|
|
|
void sock_set_timestamp(struct sock *sk, int optname, bool valbool);
|
2021-06-30 16:11:59 +08:00
|
|
|
int sock_set_timestamping(struct sock *sk, int optname,
|
|
|
|
struct so_timestamping timestamping);
|
2021-06-04 07:24:28 +08:00
|
|
|
|
2020-05-28 13:12:14 +08:00
|
|
|
void sock_enable_timestamps(struct sock *sk);
|
2020-05-28 13:12:10 +08:00
|
|
|
void sock_no_linger(struct sock *sk);
|
2020-05-28 13:12:15 +08:00
|
|
|
void sock_set_keepalive(struct sock *sk);
|
2020-05-28 13:12:11 +08:00
|
|
|
void sock_set_priority(struct sock *sk, u32 priority);
|
2020-05-28 13:12:16 +08:00
|
|
|
void sock_set_rcvbuf(struct sock *sk, int val);
|
2020-06-27 01:26:48 +08:00
|
|
|
void sock_set_mark(struct sock *sk, u32 val);
|
2020-05-28 13:12:09 +08:00
|
|
|
void sock_set_reuseaddr(struct sock *sk);
|
2020-05-28 13:12:17 +08:00
|
|
|
void sock_set_reuseport(struct sock *sk);
|
2020-05-28 13:12:12 +08:00
|
|
|
void sock_set_sndtimeo(struct sock *sk, s64 secs);
|
2020-05-28 13:12:09 +08:00
|
|
|
|
2020-05-29 20:09:42 +08:00
|
|
|
int sock_bind_add(struct sock *sk, struct sockaddr *addr, int addr_len);
|
|
|
|
|
2021-10-08 18:00:53 +08:00
|
|
|
int sock_get_timeout(long timeo, void *optval, bool old_timeval);
|
|
|
|
int sock_copy_user_timeval(struct __kernel_sock_timeval *tv,
|
|
|
|
sockptr_t optval, int optlen, bool old_timeval);
|
|
|
|
|
2021-10-09 04:33:03 +08:00
|
|
|
static inline bool sk_is_readable(struct sock *sk)
|
|
|
|
{
|
|
|
|
if (sk->sk_prot->sock_is_readable)
|
|
|
|
return sk->sk_prot->sock_is_readable(sk);
|
|
|
|
return false;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif /* _SOCK_H */
|