2009-08-21 20:28:31 +08:00
|
|
|
/*
|
2018-07-24 11:51:22 +08:00
|
|
|
* Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
|
2009-08-21 20:28:31 +08:00
|
|
|
*
|
|
|
|
* This software is available to you under a choice of one of two
|
|
|
|
* licenses. You may choose to be licensed under the terms of the GNU
|
|
|
|
* General Public License (GPL) Version 2, available from the file
|
|
|
|
* COPYING in the main directory of this source tree, or the
|
|
|
|
* OpenIB.org BSD license below:
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or
|
|
|
|
* without modification, are permitted provided that the following
|
|
|
|
* conditions are met:
|
|
|
|
*
|
|
|
|
* - Redistributions of source code must retain the above
|
|
|
|
* copyright notice, this list of conditions and the following
|
|
|
|
* disclaimer.
|
|
|
|
*
|
|
|
|
* - Redistributions in binary form must reproduce the above
|
|
|
|
* copyright notice, this list of conditions and the following
|
|
|
|
* disclaimer in the documentation and/or other materials
|
|
|
|
* provided with the distribution.
|
|
|
|
*
|
|
|
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
|
|
|
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
|
|
|
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
|
|
|
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
|
|
|
|
* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
|
|
|
|
* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
|
|
|
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
|
|
* SOFTWARE.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
#include <linux/kernel.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2009-08-21 20:28:31 +08:00
|
|
|
#include <linux/in.h>
|
2011-05-27 21:12:25 +08:00
|
|
|
#include <linux/module.h>
|
2009-08-21 20:28:31 +08:00
|
|
|
#include <net/tcp.h>
|
2015-08-05 13:43:26 +08:00
|
|
|
#include <net/net_namespace.h>
|
|
|
|
#include <net/netns/generic.h>
|
2018-07-24 11:51:21 +08:00
|
|
|
#include <net/addrconf.h>
|
2009-08-21 20:28:31 +08:00
|
|
|
|
|
|
|
#include "rds.h"
|
|
|
|
#include "tcp.h"
|
|
|
|
|
|
|
|
/* only for info exporting */
|
|
|
|
static DEFINE_SPINLOCK(rds_tcp_tc_list_lock);
|
|
|
|
static LIST_HEAD(rds_tcp_tc_list);
|
2018-07-24 11:51:22 +08:00
|
|
|
|
|
|
|
/* rds_tcp_tc_count counts only IPv4 connections.
|
|
|
|
* rds6_tcp_tc_count counts both IPv4 and IPv6 connections.
|
|
|
|
*/
|
2010-10-19 16:08:33 +08:00
|
|
|
static unsigned int rds_tcp_tc_count;
|
2018-07-31 13:48:42 +08:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2018-07-24 11:51:22 +08:00
|
|
|
static unsigned int rds6_tcp_tc_count;
|
2018-07-31 13:48:42 +08:00
|
|
|
#endif
|
2009-08-21 20:28:31 +08:00
|
|
|
|
|
|
|
/* Track rds_tcp_connection structs so they can be cleaned up */
|
|
|
|
static DEFINE_SPINLOCK(rds_tcp_conn_lock);
|
|
|
|
static LIST_HEAD(rds_tcp_conn_list);
|
2018-02-03 20:26:51 +08:00
|
|
|
static atomic_t rds_tcp_unloading = ATOMIC_INIT(0);
|
2009-08-21 20:28:31 +08:00
|
|
|
|
|
|
|
static struct kmem_cache *rds_tcp_conn_slab;
|
|
|
|
|
2016-03-17 02:38:12 +08:00
|
|
|
static int rds_tcp_skbuf_handler(struct ctl_table *ctl, int write,
|
2020-04-24 14:43:38 +08:00
|
|
|
void *buffer, size_t *lenp, loff_t *fpos);
|
2016-03-17 02:38:12 +08:00
|
|
|
|
2016-06-18 02:12:46 +08:00
|
|
|
static int rds_tcp_min_sndbuf = SOCK_MIN_SNDBUF;
|
|
|
|
static int rds_tcp_min_rcvbuf = SOCK_MIN_RCVBUF;
|
2016-03-17 02:38:12 +08:00
|
|
|
|
|
|
|
static struct ctl_table rds_tcp_sysctl_table[] = {
|
|
|
|
#define RDS_TCP_SNDBUF 0
|
|
|
|
{
|
|
|
|
.procname = "rds_tcp_sndbuf",
|
|
|
|
/* data is per-net pointer */
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = rds_tcp_skbuf_handler,
|
|
|
|
.extra1 = &rds_tcp_min_sndbuf,
|
|
|
|
},
|
|
|
|
#define RDS_TCP_RCVBUF 1
|
|
|
|
{
|
|
|
|
.procname = "rds_tcp_rcvbuf",
|
|
|
|
/* data is per-net pointer */
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = rds_tcp_skbuf_handler,
|
|
|
|
.extra1 = &rds_tcp_min_rcvbuf,
|
|
|
|
},
|
|
|
|
{ }
|
|
|
|
};
|
|
|
|
|
2018-01-19 05:11:07 +08:00
|
|
|
u32 rds_tcp_write_seq(struct rds_tcp_connection *tc)
|
2009-08-21 20:28:31 +08:00
|
|
|
{
|
2018-01-19 05:11:07 +08:00
|
|
|
/* seq# of the last byte of data in tcp send buffer */
|
|
|
|
return tcp_sk(tc->t_sock->sk)->write_seq;
|
2009-08-21 20:28:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
u32 rds_tcp_snd_una(struct rds_tcp_connection *tc)
|
|
|
|
{
|
|
|
|
return tcp_sk(tc->t_sock->sk)->snd_una;
|
|
|
|
}
|
|
|
|
|
|
|
|
void rds_tcp_restore_callbacks(struct socket *sock,
|
|
|
|
struct rds_tcp_connection *tc)
|
|
|
|
{
|
|
|
|
rdsdebug("restoring sock %p callbacks from tc %p\n", sock, tc);
|
|
|
|
write_lock_bh(&sock->sk->sk_callback_lock);
|
|
|
|
|
|
|
|
/* done under the callback_lock to serialize with write_space */
|
|
|
|
spin_lock(&rds_tcp_tc_list_lock);
|
|
|
|
list_del_init(&tc->t_list_item);
|
2018-07-31 13:48:42 +08:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2018-07-24 11:51:22 +08:00
|
|
|
rds6_tcp_tc_count--;
|
2018-07-31 13:48:42 +08:00
|
|
|
#endif
|
2018-07-24 11:51:22 +08:00
|
|
|
if (!tc->t_cpath->cp_conn->c_isv6)
|
|
|
|
rds_tcp_tc_count--;
|
2009-08-21 20:28:31 +08:00
|
|
|
spin_unlock(&rds_tcp_tc_list_lock);
|
|
|
|
|
|
|
|
tc->t_sock = NULL;
|
|
|
|
|
|
|
|
sock->sk->sk_write_space = tc->t_orig_write_space;
|
|
|
|
sock->sk->sk_data_ready = tc->t_orig_data_ready;
|
|
|
|
sock->sk->sk_state_change = tc->t_orig_state_change;
|
|
|
|
sock->sk->sk_user_data = NULL;
|
|
|
|
|
|
|
|
write_unlock_bh(&sock->sk->sk_callback_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2016-06-05 04:59:58 +08:00
|
|
|
* rds_tcp_reset_callbacks() switches the to the new sock and
|
|
|
|
* returns the existing tc->t_sock.
|
|
|
|
*
|
|
|
|
* The only functions that set tc->t_sock are rds_tcp_set_callbacks
|
|
|
|
* and rds_tcp_reset_callbacks. Send and receive trust that
|
|
|
|
* it is set. The absence of RDS_CONN_UP bit protects those paths
|
|
|
|
* from being called while it isn't set.
|
|
|
|
*/
|
|
|
|
void rds_tcp_reset_callbacks(struct socket *sock,
|
2016-07-01 07:11:14 +08:00
|
|
|
struct rds_conn_path *cp)
|
2016-06-05 04:59:58 +08:00
|
|
|
{
|
2016-07-01 07:11:14 +08:00
|
|
|
struct rds_tcp_connection *tc = cp->cp_transport_data;
|
2016-06-05 04:59:58 +08:00
|
|
|
struct socket *osock = tc->t_sock;
|
|
|
|
|
|
|
|
if (!osock)
|
|
|
|
goto newsock;
|
|
|
|
|
|
|
|
/* Need to resolve a duelling SYN between peers.
|
|
|
|
* We have an outstanding SYN to this peer, which may
|
|
|
|
* potentially have transitioned to the RDS_CONN_UP state,
|
|
|
|
* so we must quiesce any send threads before resetting
|
2016-07-01 07:11:14 +08:00
|
|
|
* cp_transport_data. We quiesce these threads by setting
|
|
|
|
* cp_state to something other than RDS_CONN_UP, and then
|
2016-06-05 04:59:58 +08:00
|
|
|
* waiting for any existing threads in rds_send_xmit to
|
|
|
|
* complete release_in_xmit(). (Subsequent threads entering
|
|
|
|
* rds_send_xmit() will bail on !rds_conn_up().
|
2016-06-05 05:00:00 +08:00
|
|
|
*
|
|
|
|
* However an incoming syn-ack at this point would end up
|
|
|
|
* marking the conn as RDS_CONN_UP, and would again permit
|
|
|
|
* rds_send_xmi() threads through, so ideally we would
|
|
|
|
* synchronize on RDS_CONN_UP after lock_sock(), but cannot
|
|
|
|
* do that: waiting on !RDS_IN_XMIT after lock_sock() may
|
|
|
|
* end up deadlocking with tcp_sendmsg(), and the RDS_IN_XMIT
|
|
|
|
* would not get set. As a result, we set c_state to
|
|
|
|
* RDS_CONN_RESETTTING, to ensure that rds_tcp_state_change
|
|
|
|
* cannot mark rds_conn_path_up() in the window before lock_sock()
|
2016-06-05 04:59:58 +08:00
|
|
|
*/
|
2016-07-01 07:11:14 +08:00
|
|
|
atomic_set(&cp->cp_state, RDS_CONN_RESETTING);
|
|
|
|
wait_event(cp->cp_waitq, !test_bit(RDS_IN_XMIT, &cp->cp_flags));
|
2016-06-05 04:59:58 +08:00
|
|
|
/* reset receive side state for rds_tcp_data_recv() for osock */
|
2016-07-14 18:51:02 +08:00
|
|
|
cancel_delayed_work_sync(&cp->cp_send_w);
|
|
|
|
cancel_delayed_work_sync(&cp->cp_recv_w);
|
2022-09-28 23:25:37 +08:00
|
|
|
lock_sock(osock->sk);
|
2016-06-05 04:59:58 +08:00
|
|
|
if (tc->t_tinc) {
|
|
|
|
rds_inc_put(&tc->t_tinc->ti_inc);
|
|
|
|
tc->t_tinc = NULL;
|
|
|
|
}
|
|
|
|
tc->t_tinc_hdr_rem = sizeof(struct rds_header);
|
|
|
|
tc->t_tinc_data_rem = 0;
|
2016-07-14 18:51:02 +08:00
|
|
|
rds_tcp_restore_callbacks(osock, tc);
|
2016-06-05 04:59:58 +08:00
|
|
|
release_sock(osock->sk);
|
|
|
|
sock_release(osock);
|
|
|
|
newsock:
|
2016-07-01 07:11:14 +08:00
|
|
|
rds_send_path_reset(cp);
|
2016-06-05 04:59:58 +08:00
|
|
|
lock_sock(sock->sk);
|
2016-07-14 18:51:02 +08:00
|
|
|
rds_tcp_set_callbacks(sock, cp);
|
2016-06-05 04:59:58 +08:00
|
|
|
release_sock(sock->sk);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Add tc to rds_tcp_tc_list and set tc->t_sock. See comments
|
|
|
|
* above rds_tcp_reset_callbacks for notes about synchronization
|
|
|
|
* with data path
|
2009-08-21 20:28:31 +08:00
|
|
|
*/
|
2016-07-01 07:11:14 +08:00
|
|
|
void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp)
|
2009-08-21 20:28:31 +08:00
|
|
|
{
|
2016-07-01 07:11:14 +08:00
|
|
|
struct rds_tcp_connection *tc = cp->cp_transport_data;
|
2009-08-21 20:28:31 +08:00
|
|
|
|
|
|
|
rdsdebug("setting sock %p callbacks to tc %p\n", sock, tc);
|
|
|
|
write_lock_bh(&sock->sk->sk_callback_lock);
|
|
|
|
|
|
|
|
/* done under the callback_lock to serialize with write_space */
|
|
|
|
spin_lock(&rds_tcp_tc_list_lock);
|
|
|
|
list_add_tail(&tc->t_list_item, &rds_tcp_tc_list);
|
2018-07-31 13:48:42 +08:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2018-07-24 11:51:22 +08:00
|
|
|
rds6_tcp_tc_count++;
|
2018-07-31 13:48:42 +08:00
|
|
|
#endif
|
2018-07-24 11:51:22 +08:00
|
|
|
if (!tc->t_cpath->cp_conn->c_isv6)
|
|
|
|
rds_tcp_tc_count++;
|
2009-08-21 20:28:31 +08:00
|
|
|
spin_unlock(&rds_tcp_tc_list_lock);
|
|
|
|
|
|
|
|
/* accepted sockets need our listen data ready undone */
|
|
|
|
if (sock->sk->sk_data_ready == rds_tcp_listen_data_ready)
|
|
|
|
sock->sk->sk_data_ready = sock->sk->sk_user_data;
|
|
|
|
|
|
|
|
tc->t_sock = sock;
|
2016-07-01 07:11:14 +08:00
|
|
|
tc->t_cpath = cp;
|
2009-08-21 20:28:31 +08:00
|
|
|
tc->t_orig_data_ready = sock->sk->sk_data_ready;
|
|
|
|
tc->t_orig_write_space = sock->sk->sk_write_space;
|
|
|
|
tc->t_orig_state_change = sock->sk->sk_state_change;
|
|
|
|
|
2016-07-01 07:11:14 +08:00
|
|
|
sock->sk->sk_user_data = cp;
|
2009-08-21 20:28:31 +08:00
|
|
|
sock->sk->sk_data_ready = rds_tcp_data_ready;
|
|
|
|
sock->sk->sk_write_space = rds_tcp_write_space;
|
|
|
|
sock->sk->sk_state_change = rds_tcp_state_change;
|
|
|
|
|
|
|
|
write_unlock_bh(&sock->sk->sk_callback_lock);
|
|
|
|
}
|
|
|
|
|
2018-07-24 11:51:22 +08:00
|
|
|
/* Handle RDS_INFO_TCP_SOCKETS socket option. It only returns IPv4
|
|
|
|
* connections for backward compatibility.
|
|
|
|
*/
|
2016-11-05 01:04:11 +08:00
|
|
|
static void rds_tcp_tc_info(struct socket *rds_sock, unsigned int len,
|
2009-08-21 20:28:31 +08:00
|
|
|
struct rds_info_iterator *iter,
|
|
|
|
struct rds_info_lengths *lens)
|
|
|
|
{
|
|
|
|
struct rds_info_tcp_socket tsinfo;
|
|
|
|
struct rds_tcp_connection *tc;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&rds_tcp_tc_list_lock, flags);
|
|
|
|
|
|
|
|
if (len / sizeof(tsinfo) < rds_tcp_tc_count)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
list_for_each_entry(tc, &rds_tcp_tc_list, t_list_item) {
|
2018-07-24 11:51:22 +08:00
|
|
|
struct inet_sock *inet = inet_sk(tc->t_sock->sk);
|
2009-08-21 20:28:31 +08:00
|
|
|
|
2018-07-24 11:51:22 +08:00
|
|
|
if (tc->t_cpath->cp_conn->c_isv6)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
tsinfo.local_addr = inet->inet_saddr;
|
|
|
|
tsinfo.local_port = inet->inet_sport;
|
|
|
|
tsinfo.peer_addr = inet->inet_daddr;
|
|
|
|
tsinfo.peer_port = inet->inet_dport;
|
2009-08-21 20:28:31 +08:00
|
|
|
|
|
|
|
tsinfo.hdr_rem = tc->t_tinc_hdr_rem;
|
|
|
|
tsinfo.data_rem = tc->t_tinc_data_rem;
|
|
|
|
tsinfo.last_sent_nxt = tc->t_last_sent_nxt;
|
|
|
|
tsinfo.last_expected_una = tc->t_last_expected_una;
|
|
|
|
tsinfo.last_seen_una = tc->t_last_seen_una;
|
2018-10-24 11:21:14 +08:00
|
|
|
tsinfo.tos = tc->t_cpath->cp_conn->c_tos;
|
2009-08-21 20:28:31 +08:00
|
|
|
|
|
|
|
rds_info_copy(iter, &tsinfo, sizeof(tsinfo));
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
lens->nr = rds_tcp_tc_count;
|
|
|
|
lens->each = sizeof(tsinfo);
|
|
|
|
|
|
|
|
spin_unlock_irqrestore(&rds_tcp_tc_list_lock, flags);
|
|
|
|
}
|
|
|
|
|
2018-07-31 13:48:42 +08:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2018-07-24 11:51:23 +08:00
|
|
|
/* Handle RDS6_INFO_TCP_SOCKETS socket option. It returns both IPv4 and
|
|
|
|
* IPv6 connections. IPv4 connection address is returned in an IPv4 mapped
|
|
|
|
* address.
|
|
|
|
*/
|
|
|
|
static void rds6_tcp_tc_info(struct socket *sock, unsigned int len,
|
|
|
|
struct rds_info_iterator *iter,
|
|
|
|
struct rds_info_lengths *lens)
|
|
|
|
{
|
|
|
|
struct rds6_info_tcp_socket tsinfo6;
|
|
|
|
struct rds_tcp_connection *tc;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&rds_tcp_tc_list_lock, flags);
|
|
|
|
|
|
|
|
if (len / sizeof(tsinfo6) < rds6_tcp_tc_count)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
list_for_each_entry(tc, &rds_tcp_tc_list, t_list_item) {
|
|
|
|
struct sock *sk = tc->t_sock->sk;
|
|
|
|
struct inet_sock *inet = inet_sk(sk);
|
|
|
|
|
|
|
|
tsinfo6.local_addr = sk->sk_v6_rcv_saddr;
|
|
|
|
tsinfo6.local_port = inet->inet_sport;
|
|
|
|
tsinfo6.peer_addr = sk->sk_v6_daddr;
|
|
|
|
tsinfo6.peer_port = inet->inet_dport;
|
|
|
|
|
|
|
|
tsinfo6.hdr_rem = tc->t_tinc_hdr_rem;
|
|
|
|
tsinfo6.data_rem = tc->t_tinc_data_rem;
|
|
|
|
tsinfo6.last_sent_nxt = tc->t_last_sent_nxt;
|
|
|
|
tsinfo6.last_expected_una = tc->t_last_expected_una;
|
|
|
|
tsinfo6.last_seen_una = tc->t_last_seen_una;
|
|
|
|
|
|
|
|
rds_info_copy(iter, &tsinfo6, sizeof(tsinfo6));
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
lens->nr = rds6_tcp_tc_count;
|
|
|
|
lens->each = sizeof(tsinfo6);
|
|
|
|
|
|
|
|
spin_unlock_irqrestore(&rds_tcp_tc_list_lock, flags);
|
|
|
|
}
|
2018-07-31 13:48:42 +08:00
|
|
|
#endif
|
2018-07-24 11:51:23 +08:00
|
|
|
|
2021-05-22 02:08:06 +08:00
|
|
|
int rds_tcp_laddr_check(struct net *net, const struct in6_addr *addr,
|
|
|
|
__u32 scope_id)
|
2009-08-21 20:28:31 +08:00
|
|
|
{
|
2018-07-24 11:51:21 +08:00
|
|
|
struct net_device *dev = NULL;
|
2018-07-31 13:48:42 +08:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2018-07-24 11:51:21 +08:00
|
|
|
int ret;
|
2018-07-31 13:48:42 +08:00
|
|
|
#endif
|
2018-07-24 11:51:21 +08:00
|
|
|
|
|
|
|
if (ipv6_addr_v4mapped(addr)) {
|
|
|
|
if (inet_addr_type(net, addr->s6_addr32[3]) == RTN_LOCAL)
|
|
|
|
return 0;
|
|
|
|
return -EADDRNOTAVAIL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If the scope_id is specified, check only those addresses
|
|
|
|
* hosted on the specified interface.
|
|
|
|
*/
|
|
|
|
if (scope_id != 0) {
|
|
|
|
rcu_read_lock();
|
|
|
|
dev = dev_get_by_index_rcu(net, scope_id);
|
|
|
|
/* scope_id is not valid... */
|
|
|
|
if (!dev) {
|
|
|
|
rcu_read_unlock();
|
|
|
|
return -EADDRNOTAVAIL;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
2018-07-31 13:48:42 +08:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2018-07-24 11:51:21 +08:00
|
|
|
ret = ipv6_chk_addr(net, addr, dev, 0);
|
|
|
|
if (ret)
|
2009-08-21 20:28:31 +08:00
|
|
|
return 0;
|
2018-07-31 13:48:42 +08:00
|
|
|
#endif
|
2009-08-21 20:28:31 +08:00
|
|
|
return -EADDRNOTAVAIL;
|
|
|
|
}
|
|
|
|
|
2017-12-23 01:39:01 +08:00
|
|
|
static void rds_tcp_conn_free(void *arg)
|
|
|
|
{
|
|
|
|
struct rds_tcp_connection *tc = arg;
|
2018-03-15 18:54:26 +08:00
|
|
|
unsigned long flags;
|
2017-12-23 01:39:01 +08:00
|
|
|
|
|
|
|
rdsdebug("freeing tc %p\n", tc);
|
|
|
|
|
2018-03-15 18:54:26 +08:00
|
|
|
spin_lock_irqsave(&rds_tcp_conn_lock, flags);
|
2017-12-23 01:39:01 +08:00
|
|
|
if (!tc->t_tcp_node_detached)
|
|
|
|
list_del(&tc->t_tcp_node);
|
2018-03-15 18:54:26 +08:00
|
|
|
spin_unlock_irqrestore(&rds_tcp_conn_lock, flags);
|
2017-12-23 01:39:01 +08:00
|
|
|
|
|
|
|
kmem_cache_free(rds_tcp_conn_slab, tc);
|
|
|
|
}
|
|
|
|
|
2009-08-21 20:28:31 +08:00
|
|
|
static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp)
|
|
|
|
{
|
|
|
|
struct rds_tcp_connection *tc;
|
2017-12-23 01:39:01 +08:00
|
|
|
int i, j;
|
|
|
|
int ret = 0;
|
2009-08-21 20:28:31 +08:00
|
|
|
|
2016-07-01 07:11:12 +08:00
|
|
|
for (i = 0; i < RDS_MPATH_WORKERS; i++) {
|
|
|
|
tc = kmem_cache_alloc(rds_tcp_conn_slab, gfp);
|
2017-12-23 01:39:01 +08:00
|
|
|
if (!tc) {
|
|
|
|
ret = -ENOMEM;
|
2018-02-03 20:26:51 +08:00
|
|
|
goto fail;
|
2017-12-23 01:39:01 +08:00
|
|
|
}
|
2016-07-01 07:11:12 +08:00
|
|
|
mutex_init(&tc->t_conn_path_lock);
|
|
|
|
tc->t_sock = NULL;
|
|
|
|
tc->t_tinc = NULL;
|
|
|
|
tc->t_tinc_hdr_rem = sizeof(struct rds_header);
|
|
|
|
tc->t_tinc_data_rem = 0;
|
2009-08-21 20:28:31 +08:00
|
|
|
|
2016-07-01 07:11:12 +08:00
|
|
|
conn->c_path[i].cp_transport_data = tc;
|
|
|
|
tc->t_cpath = &conn->c_path[i];
|
2018-02-03 20:26:51 +08:00
|
|
|
tc->t_tcp_node_detached = true;
|
2009-08-21 20:28:31 +08:00
|
|
|
|
2016-07-01 07:11:12 +08:00
|
|
|
rdsdebug("rds_conn_path [%d] tc %p\n", i,
|
|
|
|
conn->c_path[i].cp_transport_data);
|
|
|
|
}
|
2018-03-15 18:54:26 +08:00
|
|
|
spin_lock_irq(&rds_tcp_conn_lock);
|
2018-02-03 20:26:51 +08:00
|
|
|
for (i = 0; i < RDS_MPATH_WORKERS; i++) {
|
|
|
|
tc = conn->c_path[i].cp_transport_data;
|
|
|
|
tc->t_tcp_node_detached = false;
|
|
|
|
list_add_tail(&tc->t_tcp_node, &rds_tcp_conn_list);
|
|
|
|
}
|
2018-03-15 18:54:26 +08:00
|
|
|
spin_unlock_irq(&rds_tcp_conn_lock);
|
2018-02-03 20:26:51 +08:00
|
|
|
fail:
|
2017-12-23 01:39:01 +08:00
|
|
|
if (ret) {
|
|
|
|
for (j = 0; j < i; j++)
|
|
|
|
rds_tcp_conn_free(conn->c_path[j].cp_transport_data);
|
|
|
|
}
|
|
|
|
return ret;
|
2009-08-21 20:28:31 +08:00
|
|
|
}
|
|
|
|
|
2016-07-01 07:11:13 +08:00
|
|
|
static bool list_has_conn(struct list_head *list, struct rds_connection *conn)
|
|
|
|
{
|
|
|
|
struct rds_tcp_connection *tc, *_tc;
|
|
|
|
|
|
|
|
list_for_each_entry_safe(tc, _tc, list, t_tcp_node) {
|
|
|
|
if (tc->t_cpath->cp_conn == conn)
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2018-02-03 20:26:51 +08:00
|
|
|
static void rds_tcp_set_unloading(void)
|
|
|
|
{
|
|
|
|
atomic_set(&rds_tcp_unloading, 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool rds_tcp_is_unloading(struct rds_connection *conn)
|
|
|
|
{
|
|
|
|
return atomic_read(&rds_tcp_unloading) != 0;
|
|
|
|
}
|
|
|
|
|
2009-08-21 20:28:31 +08:00
|
|
|
static void rds_tcp_destroy_conns(void)
|
|
|
|
{
|
|
|
|
struct rds_tcp_connection *tc, *_tc;
|
|
|
|
LIST_HEAD(tmp_list);
|
|
|
|
|
|
|
|
/* avoid calling conn_destroy with irqs off */
|
|
|
|
spin_lock_irq(&rds_tcp_conn_lock);
|
2016-07-01 07:11:13 +08:00
|
|
|
list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) {
|
|
|
|
if (!list_has_conn(&tmp_list, tc->t_cpath->cp_conn))
|
|
|
|
list_move_tail(&tc->t_tcp_node, &tmp_list);
|
|
|
|
}
|
2009-08-21 20:28:31 +08:00
|
|
|
spin_unlock_irq(&rds_tcp_conn_lock);
|
|
|
|
|
2016-07-01 07:11:11 +08:00
|
|
|
list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node)
|
2016-07-01 07:11:12 +08:00
|
|
|
rds_conn_destroy(tc->t_cpath->cp_conn);
|
2009-08-21 20:28:31 +08:00
|
|
|
}
|
|
|
|
|
2015-08-05 13:43:26 +08:00
|
|
|
static void rds_tcp_exit(void);
|
2009-08-21 20:28:31 +08:00
|
|
|
|
2018-10-13 21:36:49 +08:00
|
|
|
static u8 rds_tcp_get_tos_map(u8 tos)
|
|
|
|
{
|
|
|
|
/* all user tos mapped to default 0 for TCP transport */
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-08-21 20:28:31 +08:00
|
|
|
struct rds_transport rds_tcp_transport = {
|
|
|
|
.laddr_check = rds_tcp_laddr_check,
|
2016-07-01 07:11:10 +08:00
|
|
|
.xmit_path_prepare = rds_tcp_xmit_path_prepare,
|
|
|
|
.xmit_path_complete = rds_tcp_xmit_path_complete,
|
2009-08-21 20:28:31 +08:00
|
|
|
.xmit = rds_tcp_xmit,
|
2016-07-01 07:11:15 +08:00
|
|
|
.recv_path = rds_tcp_recv_path,
|
2009-08-21 20:28:31 +08:00
|
|
|
.conn_alloc = rds_tcp_conn_alloc,
|
|
|
|
.conn_free = rds_tcp_conn_free,
|
2016-07-01 07:11:16 +08:00
|
|
|
.conn_path_connect = rds_tcp_conn_path_connect,
|
2016-07-01 07:11:10 +08:00
|
|
|
.conn_path_shutdown = rds_tcp_conn_path_shutdown,
|
2009-08-21 20:28:31 +08:00
|
|
|
.inc_copy_to_user = rds_tcp_inc_copy_to_user,
|
|
|
|
.inc_free = rds_tcp_inc_free,
|
|
|
|
.stats_info_copy = rds_tcp_stats_info_copy,
|
|
|
|
.exit = rds_tcp_exit,
|
2018-10-13 21:36:49 +08:00
|
|
|
.get_tos_map = rds_tcp_get_tos_map,
|
2009-08-21 20:28:31 +08:00
|
|
|
.t_owner = THIS_MODULE,
|
|
|
|
.t_name = "tcp",
|
2009-08-21 20:28:34 +08:00
|
|
|
.t_type = RDS_TRANS_TCP,
|
2009-08-21 20:28:31 +08:00
|
|
|
.t_prefer_loopback = 1,
|
2016-07-14 18:51:03 +08:00
|
|
|
.t_mp_capable = 1,
|
2018-02-03 20:26:51 +08:00
|
|
|
.t_unloading = rds_tcp_is_unloading,
|
2009-08-21 20:28:31 +08:00
|
|
|
};
|
|
|
|
|
netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17 09:58:21 +08:00
|
|
|
static unsigned int rds_tcp_netid;
|
2015-08-05 13:43:26 +08:00
|
|
|
|
|
|
|
/* per-network namespace private data for this module */
|
|
|
|
struct rds_tcp_net {
|
|
|
|
struct socket *rds_tcp_listen_sock;
|
|
|
|
struct work_struct rds_tcp_accept_w;
|
2016-03-17 02:38:12 +08:00
|
|
|
struct ctl_table_header *rds_tcp_sysctl;
|
|
|
|
struct ctl_table *ctl_table;
|
|
|
|
int sndbuf_size;
|
|
|
|
int rcvbuf_size;
|
2015-08-05 13:43:26 +08:00
|
|
|
};
|
|
|
|
|
2016-03-17 02:38:12 +08:00
|
|
|
/* All module specific customizations to the RDS-TCP socket should be done in
|
|
|
|
* rds_tcp_tune() and applied after socket creation.
|
|
|
|
*/
|
2022-05-05 09:53:53 +08:00
|
|
|
bool rds_tcp_tune(struct socket *sock)
|
2016-03-17 02:38:12 +08:00
|
|
|
{
|
|
|
|
struct sock *sk = sock->sk;
|
|
|
|
struct net *net = sock_net(sk);
|
2022-05-05 09:53:53 +08:00
|
|
|
struct rds_tcp_net *rtn;
|
2016-03-17 02:38:12 +08:00
|
|
|
|
2020-05-28 13:12:19 +08:00
|
|
|
tcp_sock_set_nodelay(sock->sk);
|
2016-03-17 02:38:12 +08:00
|
|
|
lock_sock(sk);
|
2022-05-02 09:40:18 +08:00
|
|
|
/* TCP timer functions might access net namespace even after
|
|
|
|
* a process which created this net namespace terminated.
|
|
|
|
*/
|
|
|
|
if (!sk->sk_net_refcnt) {
|
2022-05-05 09:53:53 +08:00
|
|
|
if (!maybe_get_net(net)) {
|
|
|
|
release_sock(sk);
|
|
|
|
return false;
|
|
|
|
}
|
2022-10-21 07:20:18 +08:00
|
|
|
/* Update ns_tracker to current stack trace and refcounted tracker */
|
|
|
|
__netns_tracker_free(net, &sk->ns_tracker, false);
|
|
|
|
|
2022-05-02 09:40:18 +08:00
|
|
|
sk->sk_net_refcnt = 1;
|
2022-05-05 09:53:53 +08:00
|
|
|
netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL);
|
2022-05-02 09:40:18 +08:00
|
|
|
sock_inuse_add(net, 1);
|
|
|
|
}
|
2022-05-05 09:53:53 +08:00
|
|
|
rtn = net_generic(net, rds_tcp_netid);
|
2016-03-17 02:38:12 +08:00
|
|
|
if (rtn->sndbuf_size > 0) {
|
|
|
|
sk->sk_sndbuf = rtn->sndbuf_size;
|
|
|
|
sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
|
|
|
|
}
|
|
|
|
if (rtn->rcvbuf_size > 0) {
|
2021-12-01 22:45:22 +08:00
|
|
|
sk->sk_rcvbuf = rtn->rcvbuf_size;
|
2016-03-17 02:38:12 +08:00
|
|
|
sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
|
|
|
|
}
|
|
|
|
release_sock(sk);
|
2022-05-05 09:53:53 +08:00
|
|
|
return true;
|
2016-03-17 02:38:12 +08:00
|
|
|
}
|
|
|
|
|
2015-08-05 13:43:26 +08:00
|
|
|
static void rds_tcp_accept_worker(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct rds_tcp_net *rtn = container_of(work,
|
|
|
|
struct rds_tcp_net,
|
|
|
|
rds_tcp_accept_w);
|
|
|
|
|
|
|
|
while (rds_tcp_accept_one(rtn->rds_tcp_listen_sock) == 0)
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
|
|
|
|
void rds_tcp_accept_work(struct sock *sk)
|
|
|
|
{
|
|
|
|
struct net *net = sock_net(sk);
|
|
|
|
struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
|
|
|
|
|
|
|
|
queue_work(rds_wq, &rtn->rds_tcp_accept_w);
|
|
|
|
}
|
|
|
|
|
|
|
|
static __net_init int rds_tcp_init_net(struct net *net)
|
|
|
|
{
|
|
|
|
struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
|
2016-03-17 02:38:12 +08:00
|
|
|
struct ctl_table *tbl;
|
|
|
|
int err = 0;
|
2015-08-05 13:43:26 +08:00
|
|
|
|
2016-03-17 02:38:12 +08:00
|
|
|
memset(rtn, 0, sizeof(*rtn));
|
|
|
|
|
|
|
|
/* {snd, rcv}buf_size default to 0, which implies we let the
|
|
|
|
* stack pick the value, and permit auto-tuning of buffer size.
|
|
|
|
*/
|
|
|
|
if (net == &init_net) {
|
|
|
|
tbl = rds_tcp_sysctl_table;
|
|
|
|
} else {
|
|
|
|
tbl = kmemdup(rds_tcp_sysctl_table,
|
|
|
|
sizeof(rds_tcp_sysctl_table), GFP_KERNEL);
|
|
|
|
if (!tbl) {
|
2019-05-03 20:10:17 +08:00
|
|
|
pr_warn("could not set allocate sysctl table\n");
|
2016-03-17 02:38:12 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
rtn->ctl_table = tbl;
|
|
|
|
}
|
|
|
|
tbl[RDS_TCP_SNDBUF].data = &rtn->sndbuf_size;
|
|
|
|
tbl[RDS_TCP_RCVBUF].data = &rtn->rcvbuf_size;
|
2023-08-09 18:50:03 +08:00
|
|
|
rtn->rds_tcp_sysctl = register_net_sysctl_sz(net, "net/rds/tcp", tbl,
|
|
|
|
ARRAY_SIZE(rds_tcp_sysctl_table));
|
2016-03-17 02:38:12 +08:00
|
|
|
if (!rtn->rds_tcp_sysctl) {
|
|
|
|
pr_warn("could not register sysctl\n");
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto fail;
|
|
|
|
}
|
2018-07-31 13:48:42 +08:00
|
|
|
|
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2018-07-24 11:51:22 +08:00
|
|
|
rtn->rds_tcp_listen_sock = rds_tcp_listen_init(net, true);
|
2018-07-31 13:48:42 +08:00
|
|
|
#else
|
|
|
|
rtn->rds_tcp_listen_sock = rds_tcp_listen_init(net, false);
|
|
|
|
#endif
|
2015-08-05 13:43:26 +08:00
|
|
|
if (!rtn->rds_tcp_listen_sock) {
|
2018-07-24 11:51:22 +08:00
|
|
|
pr_warn("could not set up IPv6 listen sock\n");
|
|
|
|
|
2018-07-31 13:48:42 +08:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2018-07-24 11:51:22 +08:00
|
|
|
/* Try IPv4 as some systems disable IPv6 */
|
|
|
|
rtn->rds_tcp_listen_sock = rds_tcp_listen_init(net, false);
|
|
|
|
if (!rtn->rds_tcp_listen_sock) {
|
2018-07-31 13:48:42 +08:00
|
|
|
#endif
|
2018-07-24 11:51:22 +08:00
|
|
|
unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
|
|
|
|
rtn->rds_tcp_sysctl = NULL;
|
|
|
|
err = -EAFNOSUPPORT;
|
|
|
|
goto fail;
|
2018-07-31 13:48:42 +08:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2018-07-24 11:51:22 +08:00
|
|
|
}
|
2018-07-31 13:48:42 +08:00
|
|
|
#endif
|
2015-08-05 13:43:26 +08:00
|
|
|
}
|
|
|
|
INIT_WORK(&rtn->rds_tcp_accept_w, rds_tcp_accept_worker);
|
|
|
|
return 0;
|
2016-03-17 02:38:12 +08:00
|
|
|
|
|
|
|
fail:
|
|
|
|
if (net != &init_net)
|
|
|
|
kfree(tbl);
|
|
|
|
return err;
|
2015-08-05 13:43:26 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void rds_tcp_kill_sock(struct net *net)
|
|
|
|
{
|
|
|
|
struct rds_tcp_connection *tc, *_tc;
|
|
|
|
LIST_HEAD(tmp_list);
|
|
|
|
struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
|
2017-03-05 00:57:35 +08:00
|
|
|
struct socket *lsock = rtn->rds_tcp_listen_sock;
|
2015-08-05 13:43:26 +08:00
|
|
|
|
|
|
|
rtn->rds_tcp_listen_sock = NULL;
|
2017-03-05 00:57:35 +08:00
|
|
|
rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w);
|
2018-03-15 18:54:26 +08:00
|
|
|
spin_lock_irq(&rds_tcp_conn_lock);
|
2015-08-05 13:43:26 +08:00
|
|
|
list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) {
|
2017-12-01 03:11:28 +08:00
|
|
|
struct net *c_net = read_pnet(&tc->t_cpath->cp_conn->c_net);
|
2015-08-05 13:43:26 +08:00
|
|
|
|
net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock().
When it is to cleanup net namespace, rds_tcp_exit_net() will call
rds_tcp_kill_sock(), if t_sock is NULL, it will not call
rds_conn_destroy(), rds_conn_path_destroy() and rds_tcp_conn_free() to free
connection, and the worker cp_conn_w is not stopped, afterwards the net is freed in
net_drop_ns(); While cp_conn_w rds_connect_worker() will call rds_tcp_conn_path_connect()
and reference 'net' which has already been freed.
In rds_tcp_conn_path_connect(), rds_tcp_set_callbacks() will set t_sock = sock before
sock->ops->connect, but if connect() is failed, it will call
rds_tcp_restore_callbacks() and set t_sock = NULL, if connect is always
failed, rds_connect_worker() will try to reconnect all the time, so
rds_tcp_kill_sock() will never to cancel worker cp_conn_w and free the
connections.
Therefore, the condition !tc->t_sock is not needed if it is going to do
cleanup_net->rds_tcp_exit_net->rds_tcp_kill_sock, because tc->t_sock is always
NULL, and there is on other path to cancel cp_conn_w and free
connection. So this patch is to fix this.
rds_tcp_kill_sock():
...
if (net != c_net || !tc->t_sock)
...
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
==================================================================
BUG: KASAN: use-after-free in inet_create+0xbcc/0xd28
net/ipv4/af_inet.c:340
Read of size 4 at addr ffff8003496a4684 by task kworker/u8:4/3721
CPU: 3 PID: 3721 Comm: kworker/u8:4 Not tainted 5.1.0 #11
Hardware name: linux,dummy-virt (DT)
Workqueue: krdsd rds_connect_worker
Call trace:
dump_backtrace+0x0/0x3c0 arch/arm64/kernel/time.c:53
show_stack+0x28/0x38 arch/arm64/kernel/traps.c:152
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x120/0x188 lib/dump_stack.c:113
print_address_description+0x68/0x278 mm/kasan/report.c:253
kasan_report_error mm/kasan/report.c:351 [inline]
kasan_report+0x21c/0x348 mm/kasan/report.c:409
__asan_report_load4_noabort+0x30/0x40 mm/kasan/report.c:429
inet_create+0xbcc/0xd28 net/ipv4/af_inet.c:340
__sock_create+0x4f8/0x770 net/socket.c:1276
sock_create_kern+0x50/0x68 net/socket.c:1322
rds_tcp_conn_path_connect+0x2b4/0x690 net/rds/tcp_connect.c:114
rds_connect_worker+0x108/0x1d0 net/rds/threads.c:175
process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153
worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296
kthread+0x2f0/0x378 kernel/kthread.c:255
ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117
Allocated by task 687:
save_stack mm/kasan/kasan.c:448 [inline]
set_track mm/kasan/kasan.c:460 [inline]
kasan_kmalloc+0xd4/0x180 mm/kasan/kasan.c:553
kasan_slab_alloc+0x14/0x20 mm/kasan/kasan.c:490
slab_post_alloc_hook mm/slab.h:444 [inline]
slab_alloc_node mm/slub.c:2705 [inline]
slab_alloc mm/slub.c:2713 [inline]
kmem_cache_alloc+0x14c/0x388 mm/slub.c:2718
kmem_cache_zalloc include/linux/slab.h:697 [inline]
net_alloc net/core/net_namespace.c:384 [inline]
copy_net_ns+0xc4/0x2d0 net/core/net_namespace.c:424
create_new_namespaces+0x300/0x658 kernel/nsproxy.c:107
unshare_nsproxy_namespaces+0xa0/0x198 kernel/nsproxy.c:206
ksys_unshare+0x340/0x628 kernel/fork.c:2577
__do_sys_unshare kernel/fork.c:2645 [inline]
__se_sys_unshare kernel/fork.c:2643 [inline]
__arm64_sys_unshare+0x38/0x58 kernel/fork.c:2643
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
el0_svc_common+0x168/0x390 arch/arm64/kernel/syscall.c:83
el0_svc_handler+0x60/0xd0 arch/arm64/kernel/syscall.c:129
el0_svc+0x8/0xc arch/arm64/kernel/entry.S:960
Freed by task 264:
save_stack mm/kasan/kasan.c:448 [inline]
set_track mm/kasan/kasan.c:460 [inline]
__kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:521
kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:528
slab_free_hook mm/slub.c:1370 [inline]
slab_free_freelist_hook mm/slub.c:1397 [inline]
slab_free mm/slub.c:2952 [inline]
kmem_cache_free+0xb8/0x3a8 mm/slub.c:2968
net_free net/core/net_namespace.c:400 [inline]
net_drop_ns.part.6+0x78/0x90 net/core/net_namespace.c:407
net_drop_ns net/core/net_namespace.c:406 [inline]
cleanup_net+0x53c/0x6d8 net/core/net_namespace.c:569
process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153
worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296
kthread+0x2f0/0x378 kernel/kthread.c:255
ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117
The buggy address belongs to the object at ffff8003496a3f80
which belongs to the cache net_namespace of size 7872
The buggy address is located 1796 bytes inside of
7872-byte region [ffff8003496a3f80, ffff8003496a5e40)
The buggy address belongs to the page:
page:ffff7e000d25a800 count:1 mapcount:0 mapping:ffff80036ce4b000
index:0x0 compound_mapcount: 0
flags: 0xffffe0000008100(slab|head)
raw: 0ffffe0000008100 dead000000000100 dead000000000200 ffff80036ce4b000
raw: 0000000000000000 0000000080040004 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8003496a4580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8003496a4600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8003496a4680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8003496a4700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8003496a4780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Fixes: 467fa15356ac("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 17:10:56 +08:00
|
|
|
if (net != c_net)
|
2015-08-05 13:43:26 +08:00
|
|
|
continue;
|
2017-12-01 03:11:29 +08:00
|
|
|
if (!list_has_conn(&tmp_list, tc->t_cpath->cp_conn)) {
|
2016-07-01 07:11:13 +08:00
|
|
|
list_move_tail(&tc->t_tcp_node, &tmp_list);
|
2017-12-01 03:11:29 +08:00
|
|
|
} else {
|
|
|
|
list_del(&tc->t_tcp_node);
|
|
|
|
tc->t_tcp_node_detached = true;
|
|
|
|
}
|
2015-08-05 13:43:26 +08:00
|
|
|
}
|
2018-03-15 18:54:26 +08:00
|
|
|
spin_unlock_irq(&rds_tcp_conn_lock);
|
2017-12-01 03:11:27 +08:00
|
|
|
list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node)
|
2016-07-01 07:11:12 +08:00
|
|
|
rds_conn_destroy(tc->t_cpath->cp_conn);
|
2015-08-05 13:43:26 +08:00
|
|
|
}
|
|
|
|
|
2018-03-19 21:52:48 +08:00
|
|
|
static void __net_exit rds_tcp_exit_net(struct net *net)
|
2016-07-14 18:51:01 +08:00
|
|
|
{
|
|
|
|
struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
|
2017-03-05 00:57:35 +08:00
|
|
|
|
2018-03-19 21:52:48 +08:00
|
|
|
rds_tcp_kill_sock(net);
|
2016-07-14 18:51:01 +08:00
|
|
|
|
2018-03-19 21:52:48 +08:00
|
|
|
if (rtn->rds_tcp_sysctl)
|
|
|
|
unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
|
|
|
|
|
2018-12-30 23:24:11 +08:00
|
|
|
if (net != &init_net)
|
2018-03-19 21:52:48 +08:00
|
|
|
kfree(rtn->ctl_table);
|
2016-07-14 18:51:01 +08:00
|
|
|
}
|
|
|
|
|
2018-03-19 21:52:48 +08:00
|
|
|
static struct pernet_operations rds_tcp_net_ops = {
|
|
|
|
.init = rds_tcp_init_net,
|
|
|
|
.exit = rds_tcp_exit_net,
|
|
|
|
.id = &rds_tcp_netid,
|
|
|
|
.size = sizeof(struct rds_tcp_net),
|
|
|
|
};
|
|
|
|
|
|
|
|
void *rds_tcp_listen_sock_def_readable(struct net *net)
|
2015-08-05 13:43:26 +08:00
|
|
|
{
|
2018-03-19 21:52:48 +08:00
|
|
|
struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
|
|
|
|
struct socket *lsock = rtn->rds_tcp_listen_sock;
|
2015-08-05 13:43:26 +08:00
|
|
|
|
2018-03-19 21:52:48 +08:00
|
|
|
if (!lsock)
|
|
|
|
return NULL;
|
2015-08-05 13:43:26 +08:00
|
|
|
|
2018-03-19 21:52:48 +08:00
|
|
|
return lsock->sk->sk_user_data;
|
2015-08-05 13:43:26 +08:00
|
|
|
}
|
|
|
|
|
2016-03-17 02:38:12 +08:00
|
|
|
/* when sysctl is used to modify some kernel socket parameters,this
|
|
|
|
* function resets the RDS connections in that netns so that we can
|
|
|
|
* restart with new parameters. The assumption is that such reset
|
|
|
|
* events are few and far-between.
|
|
|
|
*/
|
|
|
|
static void rds_tcp_sysctl_reset(struct net *net)
|
|
|
|
{
|
|
|
|
struct rds_tcp_connection *tc, *_tc;
|
|
|
|
|
2018-03-15 18:54:26 +08:00
|
|
|
spin_lock_irq(&rds_tcp_conn_lock);
|
2016-03-17 02:38:12 +08:00
|
|
|
list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) {
|
2017-12-01 03:11:28 +08:00
|
|
|
struct net *c_net = read_pnet(&tc->t_cpath->cp_conn->c_net);
|
2016-03-17 02:38:12 +08:00
|
|
|
|
|
|
|
if (net != c_net || !tc->t_sock)
|
|
|
|
continue;
|
|
|
|
|
2016-07-01 07:11:12 +08:00
|
|
|
/* reconnect with new parameters */
|
2017-07-17 07:43:46 +08:00
|
|
|
rds_conn_path_drop(tc->t_cpath, false);
|
2016-03-17 02:38:12 +08:00
|
|
|
}
|
2018-03-15 18:54:26 +08:00
|
|
|
spin_unlock_irq(&rds_tcp_conn_lock);
|
2016-03-17 02:38:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int rds_tcp_skbuf_handler(struct ctl_table *ctl, int write,
|
2020-04-24 14:43:38 +08:00
|
|
|
void *buffer, size_t *lenp, loff_t *fpos)
|
2016-03-17 02:38:12 +08:00
|
|
|
{
|
|
|
|
struct net *net = current->nsproxy->net_ns;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = proc_dointvec_minmax(ctl, write, buffer, lenp, fpos);
|
|
|
|
if (err < 0) {
|
|
|
|
pr_warn("Invalid input. Must be >= %d\n",
|
|
|
|
*(int *)(ctl->extra1));
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
if (write)
|
|
|
|
rds_tcp_sysctl_reset(net);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-08-05 13:43:26 +08:00
|
|
|
static void rds_tcp_exit(void)
|
|
|
|
{
|
2018-02-03 20:26:51 +08:00
|
|
|
rds_tcp_set_unloading();
|
|
|
|
synchronize_rcu();
|
2015-08-05 13:43:26 +08:00
|
|
|
rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
|
2018-07-31 13:48:42 +08:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2018-07-24 11:51:23 +08:00
|
|
|
rds_info_deregister_func(RDS6_INFO_TCP_SOCKETS, rds6_tcp_tc_info);
|
2018-07-31 13:48:42 +08:00
|
|
|
#endif
|
2018-03-19 21:52:48 +08:00
|
|
|
unregister_pernet_device(&rds_tcp_net_ops);
|
2015-08-05 13:43:26 +08:00
|
|
|
rds_tcp_destroy_conns();
|
|
|
|
rds_trans_unregister(&rds_tcp_transport);
|
|
|
|
rds_tcp_recv_exit();
|
|
|
|
kmem_cache_destroy(rds_tcp_conn_slab);
|
|
|
|
}
|
|
|
|
module_exit(rds_tcp_exit);
|
|
|
|
|
2022-09-09 17:18:40 +08:00
|
|
|
static int __init rds_tcp_init(void)
|
2009-08-21 20:28:31 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
rds_tcp_conn_slab = kmem_cache_create("rds_tcp_connection",
|
|
|
|
sizeof(struct rds_tcp_connection),
|
|
|
|
0, 0, NULL);
|
2010-01-13 03:56:44 +08:00
|
|
|
if (!rds_tcp_conn_slab) {
|
2009-08-21 20:28:31 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2017-03-05 00:57:34 +08:00
|
|
|
ret = rds_tcp_recv_init();
|
|
|
|
if (ret)
|
2017-02-24 17:28:01 +08:00
|
|
|
goto out_slab;
|
2015-08-05 13:43:26 +08:00
|
|
|
|
2018-03-19 21:52:48 +08:00
|
|
|
ret = register_pernet_device(&rds_tcp_net_ops);
|
2015-08-05 13:43:26 +08:00
|
|
|
if (ret)
|
2017-03-05 00:57:34 +08:00
|
|
|
goto out_recv;
|
2015-08-05 13:43:26 +08:00
|
|
|
|
2017-03-03 13:44:26 +08:00
|
|
|
rds_trans_register(&rds_tcp_transport);
|
2009-08-21 20:28:31 +08:00
|
|
|
|
|
|
|
rds_info_register_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
|
2018-07-31 13:48:42 +08:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2018-07-24 11:51:23 +08:00
|
|
|
rds_info_register_func(RDS6_INFO_TCP_SOCKETS, rds6_tcp_tc_info);
|
2018-07-31 13:48:42 +08:00
|
|
|
#endif
|
2009-08-21 20:28:31 +08:00
|
|
|
|
|
|
|
goto out;
|
2017-03-05 00:57:34 +08:00
|
|
|
out_recv:
|
|
|
|
rds_tcp_recv_exit();
|
2017-02-24 17:28:01 +08:00
|
|
|
out_slab:
|
2009-08-21 20:28:31 +08:00
|
|
|
kmem_cache_destroy(rds_tcp_conn_slab);
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
module_init(rds_tcp_init);
|
|
|
|
|
|
|
|
MODULE_AUTHOR("Oracle Corporation <rds-devel@oss.oracle.com>");
|
|
|
|
MODULE_DESCRIPTION("RDS: TCP transport");
|
|
|
|
MODULE_LICENSE("Dual BSD/GPL");
|