Completion or congestion notifications were not being checked
if the socket went to sleep. This patch fixes that.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Backwards compatibility with rds 3.0 causes protocol-
based flow control to be disabled as a side-effect.
I don't want to pull out FC support from the IB transport
but I do want to document and keep the sysctl consistent
if possible.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since RDS 3.0 and 3.1 have different packet formats,
we need to wait until after protocol negotiation
is complete to layout the rx buffers.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocol negotiation is logically a property of the
transports, so rds core need not set it.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Of course len is in bytes. Calling it data_len hopefully indicates
a little better what the variable is actually for.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The big differences between RDS 3.0 and 3.1 are protocol-level
flow control, and with 3.1 the header is in front of the data. The header
always ends up in the header buffer, and the data goes in the data page.
In 3.0 our "header" is a trailer, and will end up either in the data
page, the header buffer, or split across the two. Since 3.1 is backwards-
compatible with 3.0, we need to continue to support these cases. This
patch does that -- if using RDS 3.0 wire protocol, it will copy the header
from wherever it ended up into the header buffer.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS on IB uses privdata to do protocol version negotiation. Apparently
the IB stack will return a larger privdata buffer than the struct we were
expecting. Just to be extra-sure, this patch adds some checks in this area.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will be default cause IB connections to failover faster,
but allow a longer retry count to be used if desired.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a few places where ___cacheline_aligned* is used with
DEFINE_PER_CPU(). Use DEFINE_PER_CPU_SHARED_ALIGNED() instead.
DEFINE_PER_CPU_SHARED_ALIGNED() applies alignment only on SMPs. While
all other converted places used _in_smp variant or only get compiled
for SMP, net/rds used unconditional ____cacheline_aligned. I don't
see any reason these data structures should be aligned on UP and thus
converted together.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Andy Grover <andy.grover@oracle.com>
In non-SMP mode, the variable section attribute specified by DECLARE_PER_CPU()
does not agree with that specified by DEFINE_PER_CPU(). This means that
architectures that have a small data section references relative to a base
register may throw up linkage errors due to too great a displacement between
where the base register points and the per-CPU variable.
On FRV, the .h declaration says that the variable is in the .sdata section, but
the .c definition says it's actually in the .data section. The linker throws
up the following errors:
kernel/built-in.o: In function `release_task':
kernel/exit.c:78: relocation truncated to fit: R_FRV_GPREL12 against symbol `per_cpu__process_counts' defined in .data section in kernel/built-in.o
kernel/exit.c:78: relocation truncated to fit: R_FRV_GPREL12 against symbol `per_cpu__process_counts' defined in .data section in kernel/built-in.o
To fix this, DECLARE_PER_CPU() should simply apply the same section attribute
as does DEFINE_PER_CPU(). However, this is made slightly more complex by
virtue of the fact that there are several variants on DEFINE, so these need to
be matched by variants on DECLARE.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rdma_create_id() doesn't return NULL, only ERR_PTR().
Found by smatch (http://repo.or.cz/w/smatch.git). Compile tested.
regards,
dan carpenter
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rdma_create_id() returns ERR_PTR() not null.
Found by smatch (http://repo.or.cz/w/smatch.git). Compile tested.
regards,
dan carpenter
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use kmem_cache_zalloc instead of kmem_cache_alloc/memset.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unused #include <version.h> in net/rds/af_rds.c.
Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new function that is simpler and faster.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The first message to a remote node should prompt a new connection.
Even an RDMA op via CMSG. Therefore move CMSG parsing to after
connection establishment.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Putting the constant first is a supposed "best practice" that actually makes
the code harder to read.
Thanks to Roland Dreier for finding a bug in this "simple, obviously correct"
patch.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix hack that restricts the credit advertisement to 127.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RDS_LL_SEND_FULL bit should be set when we stop transmitted due to
flow control. Otherwise the send worker will keep trying as opposed to
sleeping until we unthrottle. Saves CPU.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Had some lingering instances of _iw_ variable names from when
the listen code was centralized into rdma_transport.c
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the recv ring low water mark is 1/4 the depth. Performance
measurements show that this limits iWARP throughput by flow controlling
the rds-stress senders. Setting it to 1/2 seems to max the T3
performance. I tried even higher levels but that didn't help and it
started to increase the rds thread cpu utilization.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have a 64bit value that needs to be set atomically.
This is easy and quick on all 64bit archs, and can also be done
on x86/32 with set_64bit() (uses cmpxchg8b). However other
32b archs don't have this.
I actually changed this to the current state in preparation for
mainline because the old way (using a spinlock on 32b) resulted in
unsightly #ifdefs in the code. But obviously, being correct takes
precedence.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a bug where a connection was unexpectedly
not on *any* list while being destroyed. It also
cleans up some code duplication and regularizes some
function names.
* Grab appropriate lock in conn_free() and explain in comment
* Ensure via locking that a conn is never not on either
a dev's list or the nodev list
* Add rds_xx_remove_conn() to match rds_xx_add_conn()
* Make rds_xx_add_conn() return void
* Rename remove_{,nodev_}conns() to
destroy_{,nodev_}conns() and unify their implementation
in a helper function
* Document lock ordering as nodev conn_lock before
dev_conn_lock
Reported-by: Yosef Etigin <yosefe@voltaire.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rs_send_drop_to() is called during socket close. If it takes
m_rs_lock without disabling interrupts, then
rds_send_remove_from_sock() can run from the rx completion
handler and thus deadlock.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As reported by Stephen Rothwell.
> Today's linux-next build (powerpc allyesconfig) failed like this:
>
> net/rds/cong.c: In function 'rds_cong_set_bit':
> net/rds/cong.c:284: error: implicit declaration of function 'generic___set_le_bit'
> net/rds/cong.c: In function 'rds_cong_clear_bit':
> net/rds/cong.c:298: error: implicit declaration of function 'generic___clear_le_bit'
> net/rds/cong.c: In function 'rds_cong_test_bit':
> net/rds/cong.c:309: error: implicit declaration of function 'generic_test_le_bit'
Signed-off-by: David S. Miller <davem@davemloft.net>
Add RDS Kconfig and Makefile, and modify net/'s to add
us to the build.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although most of IB and iWARP are separated from each other,
there is some common code required to handle their shared
CM listen port. This code listens for CM events and then
dispatches the event to the appropriate transport, either
IB or iWARP.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support for iWARP NICs is implemented as a separate
RDS transport from IB. The code, however, is very
similar to IB (it was forked, basically.) so let's keep
it in one changeset.
The reason for this duplicationis that despite its similarity
to IB, there are a number of places where it has different
semantics. iwarp zcopy support is still under development,
and giving it its own sandbox ensures that IB code isn't
disrupted while iwarp changes. Over time these transports
will re-converge.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Header parsing, ring refill. It puts the incoming data into an
rds_incoming struct, which is passed up to rds-core.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Specific to IB is a credits-based flow control mechanism, in
addition to the expected usage of the IB API to package outgoing
data into work requests.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Registers as an RDS transport and an IB client, and uses IB CM
API to allocate ids, queue pairs, and the rest of that fun stuff.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some transports may support RDMA features. This handles the
non-transport-specific parts, like pinning user pages and
tracking mapped regions.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon receiving a datagram from the transport, RDS parses the
headers and potentially queues an ACK.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parsing of newly-received RDS message headers (including ext.
headers) and copy-to/from-user routines.
page.c implements a per-cpu page remainder cache, to reduce the
number of allocations needed for small datagrams.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS exposes a few tunable parameters via sysctls.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A simple rds transport to handle loopback connections.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While arguably the fact that the underlying transport needs a
connection to convey RDS's datagrame reliably is not important
to rds proper, the transports implemented so far (IB and TCP)
have both been connection-oriented, and so the connection
state machine-related code is in the common rds code.
This patch also includes several work items, to handle connecting,
sending, receiving, and shutdown.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS currently generates a lot of stats that are accessible via
the rds-info utility. This code implements the support for this.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS supports multiple transports. While this initial submission
only supports Infiniband transport, this abstraction allows others
to be added. We're working on an iWARP transport, and also see
UDP over DCB as another possibility.
This code handles transport registration.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS handles per-socket congestion by updating peers with a complete
congestion map (8KB). This code keeps track of these maps for itself
and ones received from peers.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS's main data structure definitions and exported functions.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement the RDS (Reliable Datagram Sockets) interface.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>