linux/net/dccp
Joanne Koong 28044fc1d4 net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.

This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.

Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:

  1) The case where there is a bind() call on INADDR_ANY (ipv4) or
  IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
  assign the socket an address when it handles the connect()

  2) In inet_sk_reselect_saddr(), which is called when rebuilding the
  sk header and a few pre-conditions are met (eg rerouting fails).

In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.

* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.

This brings up a few stipulations:
  1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
  will always be acquired after the bhash lock and released before the
  bhash lock is released.

  2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
  acquired+released before another bhash2 lock is acquired+released.

* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-24 19:30:07 -07:00
..
ccids dccp: tfrc: fix doc warnings in tfrc_equation.c 2021-06-10 14:08:49 -07:00
ackvec.c net: dccp: Fix most of the kerneldoc warnings 2020-10-30 12:08:54 -07:00
ackvec.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
ccid.c net: dccp: Add __printf() markup to fix -Wsuggest-attribute=format 2020-10-30 11:31:46 -07:00
ccid.h net: dccp: Replace zero-length array with flexible-array member 2020-02-28 12:08:37 -08:00
dccp.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
diag.c inet_diag: Move the INET_DIAG_REQ_BYTECODE nlattr to cb->data 2020-02-27 18:50:19 -08:00
feat.c dccp: Return the correct errno code 2021-02-06 11:15:28 -08:00
feat.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
input.c net: dccp: Convert to use the preferred fallthrough macro 2020-08-22 12:38:34 -07:00
ipv4.c net: Add a bhash2 table hashed by port and address 2022-08-24 19:30:07 -07:00
ipv6.c net: Add a bhash2 table hashed by port and address 2022-08-24 19:30:07 -07:00
ipv6.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
Kconfig dccp: Replace HTTP links with HTTPS ones 2020-07-13 11:54:07 -07:00
Makefile net: dccp: Remove dccpprobe module 2018-01-02 14:27:30 -05:00
minisocks.c tcp: allocate tcp_death_row outside of struct netns_ipv4 2022-01-26 19:00:31 -08:00
options.c net: dccp: Convert to use the preferred fallthrough macro 2020-08-22 12:38:34 -07:00
output.c net: dccp: Fix most of the kerneldoc warnings 2020-10-30 12:08:54 -07:00
proto.c net: Add a bhash2 table hashed by port and address 2022-08-24 19:30:07 -07:00
qpolicy.c net: dccp: Fix most of the kerneldoc warnings 2020-10-30 12:08:54 -07:00
sysctl.c proc/sysctl: add shared variables for range check 2019-07-18 17:08:07 -07:00
timer.c net: sock: introduce sk_error_report 2021-06-29 11:28:21 -07:00
trace.h net: dccp: Use memset_startat() for TP zeroing 2021-11-19 11:22:49 +00:00