The current svcrdma recvfrom code path has a lot of detail about
registration mode and the type of port (iWARP, IB, etc).
Instead, use the RDMA core's generic R/W API. This shares code with
other RDMA-enabled ULPs that manages the gory details of buffer
registration and the posting of RDMA Read Work Requests.
Since the Read list marshaling code is being replaced, I took the
opportunity to replace C structure-based XDR encoding code with more
portable code that uses pointer arithmetic.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
req_maps are no longer used by the send path and can thus be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up. All RDMA Write completions are now handled by
svc_rdma_wc_write_ctx.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The current svcrdma sendto code path posts one RDMA Write WR at a
time. Each of these Writes typically carries a small number of pages
(for instance, up to 30 pages for mlx4 devices). That means a 1MB
NFS READ reply requires 9 ib_post_send() calls for the Write WRs,
and one for the Send WR carrying the actual RPC Reply message.
Instead, use the new rdma_rw API. The details of Write WR chain
construction and memory registration are taken care of in the RDMA
core. svcrdma can focus on the details of the RPC-over-RDMA
protocol. This gives three main benefits:
1. All Write WRs for one RDMA segment are posted in a single chain.
As few as one ib_post_send() for each Write chunk.
2. The Write path can now use FRWR to register the Write buffers.
If the device's maximum page list depth is large, this means a
single Write WR is needed for each RPC's Write chunk data.
3. The new code introduces support for RPCs that carry both a Write
list and a Reply chunk. This combination can be used for an NFSv4
READ where the data payload is large, and thus is removed from the
Payload Stream, but the Payload Stream is still larger than the
inline threshold.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The plan is to replace the local bespoke code that constructs and
posts RDMA Read and Write Work Requests with calls to the rdma_rw
API. This shares code with other RDMA-enabled ULPs that manages the
gory details of buffer registration and posting Work Requests.
Some design notes:
o The structure of RPC-over-RDMA transport headers is flexible,
allowing multiple segments per Reply with arbitrary alignment,
each with a unique R_key. Write and Send WRs continue to be
built and posted in separate code paths. However, one whole
chunk (with one or more RDMA segments apiece) gets exactly
one ib_post_send and one work completion.
o svc_xprt reference counting is modified, since a chain of
rdma_rw_ctx structs generates one completion, no matter how
many Write WRs are posted.
o The current code builds the transport header as it is construct-
ing Write WRs. I've replaced that with marshaling of transport
header data items in a separate step. This is because the exact
structure of client-provided segments may not align with the
components of the server's reply xdr_buf, or the pages in the
page list. Thus parts of each client-provided segment may be
written at different points in the send path.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The Send Queue depth is temporarily reduced to 1 SQE per credit. The
new rdma_rw API does an internal computation, during QP creation, to
increase the depth of the Send Queue to handle RDMA Read and Write
operations.
This change has to come before the NFSD code paths are updated to
use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
the size of the SQ too much, resulting in memory allocation failures
during QP creation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Same change as Kinglong Mee's fix for the TCP backchannel service.
Fixes: 5283b03ee5 ("nfs/nfsd/sunrpc: enforce transport...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
bugfixes.
A couple changes could theoretically break working setups on upgrade. I
don't expect complaints in practice, but they seem worth calling out
just in case:
- NFS security labels are now off by default; a new
security_label export flag reenables it per export. But,
having them on by default is a disaster, as it generally only
makes sense if all your clients and servers have similar
enough selinux policies. Thanks to Jason Tibbitts for
pointing this out.
- NFSv4/UDP support is off. It was never really supported, and
the spec explicitly forbids it. We only ever left it on out
of laziness; thanks to Jeff Layton for finally fixing that.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYtejbAAoJECebzXlCjuG+JhEQAK3YTYYNrPY26Pfiu0FghLFV
4qOHK4DOkJzrWIom5uWyBo7yOwH6WnQtTe/gCx/voOEW3lsJO7F3IfTnTVp+Smp6
GJeVtsr1vI9EBnwhMlyoJ5hZ2Ju5kX3MBVnew6+momt6620ZO7a+EtT+74ePaY8Y
jxLzWVA1UqbWYoMabNQpqgKypKvNrhwst72iYyBhNuL/qtGeBDQWwcrA+TFeE9tv
Ad7qB53xL1mr0Wn1CNIOR/IzVAj4o2H0vdjqrPjAdvbfwf8YYLNpJXt5k591wx/j
1TpiWIPqnwLjMT3X5NkQN3agZKeD+2ZWPrClr35TgRRe62CK6JblK9/Wwc5BRSzV
paMP3hOm6/dQOBA5C+mqPaHdEI8VqcHyZpxU4VC/ttsVEGTgaLhGwHGzSn+5lYiM
Qx9Sh50yFV3oiBW/sb/y8lBDwYm/Cq0OyqAU277idbdzjcFerMg1qt06tjEQhYMY
K2V7rS8NuADUF6F1BwOONZzvg7Rr7iWHmLh+iSM9TeQoEmz2jIHSaLIFaYOET5Jr
PIZS3rOYoa0FaKOYVYnZMC74n/LqP/Aou8B+1rRcLy5YEdUIIIVpFvqpg1nGv6PI
sA3zx/f13IRte3g0CuQiY0/2cx7uXk/gXJ7s5+ejEzljF/aYWiomx3mr6HqPQETn
CWEtXlfyJCyX+A8hbO+U
=iLLz
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"The nfsd update this round is mainly a lot of miscellaneous cleanups
and bugfixes.
A couple changes could theoretically break working setups on upgrade.
I don't expect complaints in practice, but they seem worth calling out
just in case:
- NFS security labels are now off by default; a new security_label
export flag reenables it per export. But, having them on by default
is a disaster, as it generally only makes sense if all your clients
and servers have similar enough selinux policies. Thanks to Jason
Tibbitts for pointing this out.
- NFSv4/UDP support is off. It was never really supported, and the
spec explicitly forbids it. We only ever left it on out of
laziness; thanks to Jeff Layton for finally fixing that"
* tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux: (34 commits)
nfsd: Fix display of the version string
nfsd: fix configuration of supported minor versions
sunrpc: don't register UDP port with rpcbind when version needs congestion control
nfs/nfsd/sunrpc: enforce transport requirements for NFSv4
sunrpc: flag transports as having congestion control
sunrpc: turn bitfield flags in svc_version into bools
nfsd: remove superfluous KERN_INFO
nfsd: special case truncates some more
nfsd: minor nfsd_setattr cleanup
NFSD: Reserve adequate space for LOCKT operation
NFSD: Get response size before operation for all RPCs
nfsd/callback: Drop a useless data copy when comparing sessionid
nfsd/callback: skip the callback tag
nfsd/callback: Cleanup callback cred on shutdown
nfsd/idmap: return nfserr_inval for 0-length names
SUNRPC/Cache: Always treat the invalid cache as unexpired
SUNRPC: Drop all entries from cache_detail when cache_purge()
svcrdma: Poll CQs in "workqueue" mode
svcrdma: Combine list fields in struct svc_rdma_op_ctxt
svcrdma: Remove unused sc_dto_q field
...
NFSv4 requires a transport protocol with congestion control in most
cases.
On an IP network, that means that NFSv4 over UDP should be forbidden.
The situation with RDMA is a bit more nuanced, but most RDMA transports
are suitable for this. For now, we assume that all RDMA transports are
suitable, but we may need to revise that at some point.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
svcrdma calls svc_xprt_put() in its completion handlers, which
currently run in IRQ context.
However, svc_xprt_put() is meant to be invoked in process context,
not in IRQ context. After the last transport reference is gone, it
directly calls a transport release function that expects to run in
process context.
Change the CQ polling modes to IB_POLL_WORKQUEUE so that svcrdma
invokes svc_xprt_put() only in process context. As an added benefit,
bottom half-disabled spin locking can be eliminated from I/O paths.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: The free list and the dto_q list fields are never used at
the same time. Reduce the size of struct svc_rdma_op_ctxt by
combining these fields.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up. Commit be99bb1140 ("svcrdma: Use new CQ API for
RPC-over-RDMA server send CQs") removed code that used the sc_dto_q
field, but neglected to remove sc_dto_q at the same time.
Fixes: be99bb1140 ("svcrdma: Use new CQ API for RPC-over- ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable, and is used
throughout the kernel's existing XDR encoders. The gcc optimizer
generates similar assembler code either way.
Byte-swapping before a memory store on x86 typically results in an
instruction pipeline stall. Avoid byte-swapping when encoding a new
header.
svcrdma currently doesn't alter a connection's credit grant value
after the connection has been accepted, so it is effectively a
constant. Cache the byte-swapped value in a separate field.
Christoph suggested pulling the header encoding logic into the only
function that uses it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Since we need to change the implementation, stop exposing internals.
Provide kref_read() to read the current reference count; typically
used for debug messages.
Kills two anti-patterns:
atomic_read(&kref->refcount)
kref->refcount.counter
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
that makes ACL inheritance a little more useful in environments that
default to restrictive umasks. Requires client-side support, also on
its way for 4.10.
Other than that, miscellaneous smaller fixes and cleanup, especially to
the server rdma code.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYVAEqAAoJECebzXlCjuG+VM0QAKaR+ibSM31Ahpnrgit5/wrb
n630KDFztO7iqEeuHfPQ4/n05T2QR0JWsLpjLMFvx88Gy4gyXYk9cuDPIrNKX1IS
3/nnhBo0+EVnjODjufommCrtbPZlqOSsS3N03vWkB7rTi8QYsWBOThh+XLRJYOXo
LZzJE1WmXNeCXV1kXPBsauryywql1fmwTXBzmIf1HbzoGAVROMEA2qqh4Z3nb7BP
sJuGchWx0STBOuAa278ighXQPUW2lUft9uzw2bssOtMwfNyOs/Pd6nx4F1Lg6WwD
1UQXoiR8K3PqelZfoeFJ05v0css/sbNKep+huWRdOXZj3Kjpa20lKBX8xHfat7sN
1OQ4FHx8ToigX3c+wwtlCqRMCcIxqUYkRjqzPHyeBiSSSp0rLrId44rI5x/K0yay
3bkGw7hFDSzc0Nq2uZgmtlbyTC71hLNhkWe7ThofcVG/pS0JtAqBiKIVwXJPh/e0
PLmVHYGU6Xowjag5edJlXY1tlIlxtWfqsWUarCXS5bfKUa3UjMVSjyuljsDqqJsn
96fEWu7DiUo4HeGYmf8MJoeZYV2y0DKSQGeguVkUKWp2DoTzinQHTfdKvrZVwNuu
hVE9/QeWzUvPY13HOUaKD2skozhbUChqv0NHESKUv8gxE3svTEpYZkXrE74WNqMk
l/WXAhw+RdKZof4+qdjU
=JANY
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"The one new feature is support for a new NFSv4.2 mode_umask attribute
that makes ACL inheritance a little more useful in environments that
default to restrictive umasks. Requires client-side support, also on
its way for 4.10.
Other than that, miscellaneous smaller fixes and cleanup, especially
to the server rdma code"
[ The client side of the umask attribute was merged yesterday ]
* tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux:
nfsd: add support for the umask attribute
sunrpc: use DEFINE_SPINLOCK()
svcrdma: Further clean-up of svc_rdma_get_inv_rkey()
svcrdma: Break up dprintk format in svc_rdma_accept()
svcrdma: Remove unused variable in rdma_copy_tail()
svcrdma: Remove unused variables in xprt_rdma_bc_allocate()
svcrdma: Remove svc_rdma_op_ctxt::wc_status
svcrdma: Remove DMA map accounting
svcrdma: Remove BH-disabled spin locking in svc_rdma_send()
svcrdma: Renovate sendto chunk list parsing
svcauth_gss: Close connection when dropping an incoming message
svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm
nfsd: constify reply_cache_stats_operations structure
nfsd: update workqueue creation
sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code
nfsd: catch errors in decode_fattr earlier
nfsd: clean up supported attribute handling
nfsd: fix error handling for clients that fail to return the layout
nfsd: more robust allocation failure handling in nfsd_reply_cache_init
The current code results in:
Nov 7 14:50:19 klimt kernel: svcrdma: newxprt->sc_cm_id=ffff88085590c800,
newxprt->sc_pd=ffff880852a7ce00#012 cm_id->device=ffff88084dd20000,
sc_pd->device=ffff88084dd20000#012 cap.max_send_wr = 272#012
cap.max_recv_wr = 34#012 cap.max_send_sge = 32#012
cap.max_recv_sge = 32
Nov 7 14:50:19 klimt kernel: svcrdma: new connection ffff880855908000
accepted with the following attributes:#012 local_ip :
10.0.0.5#012 local_port#011 : 20049#012 remote_ip :
10.0.0.2#012 remote_port : 59909#012 max_sge : 32#012
max_sge_rd : 30#012 sq_depth : 272#012 max_requests :
32#012 ord : 16
Split up the output over multiple dprintks and take the opportunity
to fix the display of IPv6 addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Completion status is already reported in the individual
completion handlers. Save a few bytes in struct svc_rdma_op_ctxt.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: sc_dma_used is not required for correct operation. It is
simply a debugging tool to report when svcrdma has leaked DMA maps.
However, manipulating an atomic has a measurable CPU cost, and DMA
map accounting specific to svcrdma will be meaningless once svcrdma
is converted to use the new generic r/w API.
A similar kind of debug accounting can be done simply by enabling
the IOMMU or by using CONFIG_DMA_API_DEBUG, CONFIG_IOMMU_DEBUG, and
CONFIG_IOMMU_LEAK.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
svcrdma's current SQ accounting algorithm takes sc_lock and disables
bottom-halves while posting all RDMA Read, Write, and Send WRs.
This is relatively heavyweight serialization. And note that Write and
Send are already fully serialized by the xpt_mutex.
Using a single atomic_t should be all that is necessary to guarantee
that ib_post_send() is called only when there is enough space on the
send queue. This is what the other RDMA-enabled storage targets do.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
benefit from user testing:
Anna Schumacker contributed a simple NFSv4.2 COPY implementation. COPY
is already supported on the client side, so a call to copy_file_range()
on a recent client should now result in a server-side copy that doesn't
require all the data to make a round trip to the client and back.
Jeff Layton implemented callbacks to notify clients when contended locks
become available, which should reduce latency on workloads with
contended locks.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJX/mcsAAoJECebzXlCjuG+MU0P/3SzTLGYXU5yOTAorx255/uf
fUVKQQhTzzaA2xj3gWWWztYx3y0ZJUVgwU56a+Ap5Z8/goqDQ78H+ePEc+MG7BT/
/UXS/bITvt0MP/dvPrDzhSltvqx/wpelLPBo29hGLlAQ2dsnD4Y75IbOOQccWqcC
iD2v6x7lnpWZ7j9Zhwzg/JNQHwISIb7tiLoYBjfcdNDEMU76KIyhxD0Cx9MSeBzH
9Rq/oEdwGDFS5WqVfNe2jxbngoauq1IupziQ2eQGv2D/POyXCx8fphoYjDz1XaW8
PxaJtJtM2owPGG+z2CxklJqNaS1Z4F+oppjg+nf4i/ibxmIBaTy8NluASX3vMh69
CDO1+ly+TiF0l1VqMOQJWRnqn1qGk6fLpF6P1Ac62B0oWpeLGU7nmik7XN1ORgsi
8ksxRKNAWeprZo3wl5xNrADu/wlZ7XCJTc4QoHEgYT04aHF+j8EMCHv+mtZ8+Bwn
WWiA8iItZOgXV4vitCRJlvsixjYvmF3djPIoI2Lt5KDWIg+eL89sKwzTALSfeC4m
Vjb0svzPX1MmZCNP1rCStFbl3gZYXZyqPk+uA6M7H8mjAjVeKxRPowWpMBgvYZHr
FjCPb878bAuqCeBVbIyOLLcKWBLTw8PsUWZAor3gNg454JGkMjLUyJ/S22Cz5Nbo
HdjoiTJtbPrHnCwTMXwa
=nozl
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Some RDMA work and some good bugfixes, and two new features that could
benefit from user testing:
- Anna Schumacker contributed a simple NFSv4.2 COPY implementation.
COPY is already supported on the client side, so a call to
copy_file_range() on a recent client should now result in a
server-side copy that doesn't require all the data to make a round
trip to the client and back.
- Jeff Layton implemented callbacks to notify clients when contended
locks become available, which should reduce latency on workloads
with contended locks"
* tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux:
NFSD: Implement the COPY call
nfsd: handle EUCLEAN
nfsd: only WARN once on unmapped errors
exportfs: be careful to only return expected errors.
nfsd4: setclientid_confirm with unmatched verifier should fail
nfsd: randomize SETCLIENTID reply to help distinguish servers
nfsd: set the MAY_NOTIFY_LOCK flag in OPEN replies
nfs: add a new NFS4_OPEN_RESULT_MAY_NOTIFY_LOCK constant
nfsd: add a LRU list for blocked locks
nfsd: have nfsd4_lock use blocking locks for v4.1+ locks
nfsd: plumb in a CB_NOTIFY_LOCK operation
NFSD: fix corruption in notifier registration
svcrdma: support Remote Invalidation
svcrdma: Server-side support for rpcrdma_connect_private
rpcrdma: RDMA/CM private message data structure
svcrdma: Skip put_page() when send_reply() fails
svcrdma: Tail iovec leaves an orphaned DMA mapping
nfsd: fix dprintk in nfsd4_encode_getdeviceinfo
nfsd: eliminate cb_minorversion field
nfsd: don't set a FL_LAYOUT lease for flexfiles layouts
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited. It also prints a warning
everytime this feature is used as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Support Remote Invalidation. A private message is exchanged with
the client upon RDMA transport connect that indicates whether
Send With Invalidation may be used by the server to send RPC
replies. The invalidate_rkey is arbitrarily chosen from among
rkeys present in the RPC-over-RDMA header's chunk lists.
Send With Invalidate improves performance only when clients can
recognize, while processing an RPC reply, that an rkey has already
been invalidated. That has been submitted as a separate change.
In the future, the RPC-over-RDMA protocol might support Remote
Invalidation properly. The protocol needs to enable signaling
between peers to indicate when Remote Invalidation can be used
for each individual RPC.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Prepare to receive an RDMA-CM private message when handling a new
connection attempt, and send a similar message as part of connection
acceptance.
Both sides can communicate their various implementation limits.
Implementations that don't support this sideband protocol ignore it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The ctxt's count field is overloaded to mean the number of pages in
the ctxt->page array and the number of SGEs in the ctxt->sge array.
Typically these two numbers are the same.
However, when an inline RPC reply is constructed from an xdr_buf
with a tail iovec, the head and tail often occupy the same page,
but each are DMA mapped independently. In that case, ->count equals
the number of pages, but it does not equal the number of SGEs.
There's one more SGE, for the tail iovec. Hence there is one more
DMA mapping than there are pages in the ctxt->page array.
This isn't a real problem until the server's iommu is enabled. Then
each RPC reply that has content in that iovec orphans a DMA mapping
that consists of real resources.
krb5i and krb5p always populate that tail iovec. After a couple
million sent krb5i/p RPC replies, the NFS server starts behaving
erratically. Reboot is needed to clear the problem.
Fixes: 9d11b51ce7 ("svcrdma: Fix send_reply() scatter/gather set-up")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If the server has forced a disconnect, the associated QP has not
been moved to the Error state, and thus Receives are still posted.
Ensure Receives (and any other outstanding WRs) are drained to
release resources that can be freed during teardown of the
svcrdma_xprt.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Since backward direction support was added, the rq_depth was
increased to accommodate both forward and backward Receives.
But only forward Receives need to be posted after a connection
has been accepted. Receives for backward replies are posted as
needed by svc_rdma_bc_sendto().
This doesn't break anything, but it means some resources are
wasted.
Fixes: 03fe993153 ('svcrdma: Define maximum number of ...')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Allow both IPv4 and IPv6 to bind same port at the same time,
restricts use of the IPv6 socket to IPv6 communication.
Changes from v1:
- Check rdma_set_afonly return value (suggested by Leon Romanovsky)
Changes from v2:
- Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b249
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.
By converting to this new API, svcrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.
This new API also aims each completion at a function that is
specific to the WR's opcode. Thus the ctxt->wr_op field and the
switch in process_context is replaced by a set of methods that
handle each completion type.
Because each ib_cqe carries a pointer to a completion method, the
core can now post operations on a consumer's QP, and handle the
completions itself.
The server's rdma_stat_sq_poll and rdma_stat_sq_prod metrics are no
longer updated.
As a clean up, the cq_event_handler, the dto_tasklet, and all
associated locking is removed, as they are no longer referenced or
used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b249
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.
By converting to this new API, svcrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.
Because each ib_cqe carries a pointer to a completion method, the
core can now post operations on a consumer's QP, and handle the
completions itself.
svcrdma receive completions no longer use the dto_tasklet. Each
polled Receive WC is now handled individually in soft IRQ context.
The server transport's rdma_stat_rq_poll and rdma_stat_rq_prod
metrics are no longer updated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fix several issues with svc_rdma_send_error():
- Post a receive buffer to replace the one that was consumed by
the incoming request
- Posting a send should use DMA_TO_DEVICE, not DMA_FROM_DEVICE
- No need to put_page _and_ free pages in svc_rdma_put_context
- Make sure the sge is set up completely in case the error
path goes through svc_rdma_unmap_dma()
- Replace the use of ENOSYS, which has a reserved meaning
Related fixes in svc_rdma_recvfrom():
- Don't leak the ctxt associated with the incoming request
- Don't close the connection after sending an error reply
- Let svc_rdma_send_error() figure out the right header error code
As a last clean up, move svc_rdma_send_error() to svc_rdma_sendto.c
with other similar functions. There is some common logic in these
functions that could someday be combined to reduce code duplication.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Most svc_rdma_post_recv() call sites close the transport
connection when a receive cannot be posted. Wrap that in a common
helper.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Tested-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We now alwasy have a per-PD local_dma_lkey available. Make use of that
fact in svc_rdma and stop registering our own MR.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class that enables backward
direction messages on an existing forward channel connection.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Extra resources for handling backchannel requests have to be
pre-allocated when a transport instance is created. Set up
additional fields in svcxprt_rdma to track these resources.
The max_requests fields are elements of the RPC-over-RDMA
protocol, so they should be u32. To ensure that unsigned
arithmetic is used everywhere, some other fields in the
svcxprt_rdma struct are updated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Clean up.
These functions can otherwise fail, so check for page allocation
failures too.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
svc_rdma_post_recv() allocates pages for receive buffers on-demand.
It uses GFP_KERNEL so the allocator tries hard, and may sleep. But
I'm about to add a call to svc_rdma_post_recv() from a function
that may not sleep.
Since all svc_rdma_post_recv() call sites can tolerate its failure,
allow it to fail if the page allocator returns nothing. Longer term,
receive buffers, being a finite resource per-connection, should be
pre-allocated and re-used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
To ensure this allocation cannot fail and will not sleep,
pre-allocate the req_map structures per-connection.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When the maximum payload size of NFS READ and WRITE was increased
by commit cc9a903d91 ("svcrdma: Change maximum server payload back
to RPCSVC_MAXPAYLOAD"), the size of struct svc_rdma_op_ctxt
increased to over 6KB (on x86_64). That makes allocating one of
these from a kmem_cache more likely to fail in situations when
system memory is exhausted.
Since I'm about to add a caller where this allocation must always
work _and_ it cannot sleep, pre-allocate ctxts for each connection.
Another motivation for this change is that NFSv4.x servers are
required by specification not to drop NFS requests. Pre-allocating
memory resources reduces the likelihood of a drop.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Be sure the completed ctxt is put in every path.
The xprt enqueue can take a while, so put the completed ctxt back
in circulation _before_ enqueuing the xprt.
Remove/disable debugging.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
kzalloc is used here, so setting the atomic fields to zero is
unnecessary. sc_ord is set again in handle_connect_req. The other
fields are re-initialized in svc_rdma_accept().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@fieldses.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead, use the cached copy of the attributes present on the device.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Highlights include:
Features:
- RDMA client backchannel from Chuck
- Support for NFSv4.2 file CLONE using the btrfs ioctl
Bugfixes + cleanups
- Move socket data receive out of the bottom halves and into a workqueue
- Refactor NFSv4 error handling so synchronous and asynchronous RPC handles
errors identically.
- Fix a panic when blocks or object layouts reads return a bad data length
- Fix nfsroot so it can handle a 1024 byte long path.
- Fix bad usage of page offset in bl_read_pagelist
- Various NFSv4 callback cleanups+fixes
- Fix GETATTR bitmap verification
- Support hexadecimal number for sunrpc debug sysctl files
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWQPMXAAoJEGcL54qWCgDy6ZUQAL32vpgyMXe7R4jcxoQxm52+
tn8FrY8aBZAqucvQsIGCrYfE01W/s8goDTQdZODn0MCcoor12BTPVYNIR42/J/no
MNnRTDF0dJ4WG+inX9G87XGG6sFN3wDaQcCaexknkQZlFNF4KthxojzR2BgjmRVI
p3WKkLSNTt6DYQQ8eDetvKoDT0AjR/KCYm89tiE8GMhKYcaZl6dTazJxwOcp2CX9
YDW6+fQbsv8qp5v2ay03e88O/DSmcNRFoxy/KUGT9OwJgdN08IN8fTt6GG38yycT
D9tb9uObBRcll4PnucouadBcykGr6jAP0z8HklE266LH1dwYLOHQoDFdgAs0QGtq
nlySiKvToj6CYXonXoPOjZF3P/lxlkj5ViZ2enBxgxrPmyWl172cUSa6NTXOMO46
kPpxw50xa1gP5kkBVwIZ6XZuzl/5YRhB3BRP3g6yuJCbAwVBJvawYU7riC+6DEB9
zygVfm21vi9juUQXJ37zXVRBTtoFhFjuSxcAYxc63o181lWYShKQ3IiRYg+zTxnq
7DOhXa0ZNGvMgJJi0tH9Es3/S6TrGhyKh5gKY/o2XUjY0hCSsCSdP6jw6Mb9Ax1s
0LzByHAikxBKPt2OFeoUgwycI2xqow4iAfuFk071iP7n0nwC804cUHSkGxW67dBZ
Ve5Skkg1CV+oWQYxGmGZ
=py1V
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
New features:
- RDMA client backchannel from Chuck
- Support for NFSv4.2 file CLONE using the btrfs ioctl
Bugfixes + cleanups:
- Move socket data receive out of the bottom halves and into a
workqueue
- Refactor NFSv4 error handling so synchronous and asynchronous RPC
handles errors identically.
- Fix a panic when blocks or object layouts reads return a bad data
length
- Fix nfsroot so it can handle a 1024 byte long path.
- Fix bad usage of page offset in bl_read_pagelist
- Various NFSv4 callback cleanups+fixes
- Fix GETATTR bitmap verification
- Support hexadecimal number for sunrpc debug sysctl files"
* tag 'nfs-for-4.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (53 commits)
Sunrpc: Supports hexadecimal number for sysctl files of sunrpc debug
nfs: Fix GETATTR bitmap verification
nfs: Remove unused xdr page offsets in getacl/setacl arguments
fs/nfs: remove unnecessary new_valid_dev check
SUNRPC: fix variable type
NFS: Enable client side NFSv4.1 backchannel to use other transports
pNFS/flexfiles: Add support for FF_FLAGS_NO_IO_THRU_MDS
pNFS/flexfiles: When mirrored, retry failed reads by switching mirrors
SUNRPC: Remove the TCP-only restriction in bc_svc_process()
svcrdma: Add backward direction service for RPC/RDMA transport
xprtrdma: Handle incoming backward direction RPC calls
xprtrdma: Add support for sending backward direction RPC replies
xprtrdma: Pre-allocate Work Requests for backchannel
xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers
SUNRPC: Abstract backchannel operations
xprtrdma: Saving IRQs no longer needed for rb_lock
xprtrdma: Remove reply tasklet
xprtrdma: Use workqueue to process RPC/RDMA replies
xprtrdma: Replace send and receive arrays
xprtrdma: Refactor reply handler error handling
...
On NFSv4.1 mount points, the Linux NFS client uses this transport
endpoint to receive backward direction calls and route replies back
to the NFSv4.1 server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Instead of maintaining a fastreg page list, keep an sg table
and convert an array of pages to a sg list. Then call ib_map_mr_sg
and construct ib_reg_wr.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Selvin Xavier <selvin.xavier@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support for network namespaces in the ib_cma module. This is
accomplished by:
1. Adding network namespace parameter for rdma_create_id. This parameter is
used to populate the network namespace field in rdma_id_private.
rdma_create_id keeps a reference on the network namespace.
2. Using the network namespace from the rdma_id instead of init_net inside
of ib_cma, when listening on an ID and when looking for an ID for an
incoming request.
3. Decrementing the reference count for the appropriate network namespace
when calling rdma_destroy_id.
In order to preserve the current behavior init_net is passed when calling
from other modules.
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
- Create drivers/staging/rdma
- Move amso1100 driver to staging/rdma and schedule for deletion
- Move ipath driver to staging/rdma and schedule for deletion
- Add hfi1 driver to staging/rdma and set TODO for move to regular tree
- Initial support for namespaces to be used on RDMA devices
- Add RoCE GID table handling to the RDMA core caching code
- Infrastructure to support handling of devices with differing
read and write scatter gather capabilities
- Various iSER updates
- Kill off unsafe usage of global mr registrations
- Update SRP driver
- Misc. mlx4 driver updates
- Support for the mr_alloc verb
- Support for a netlink interface between kernel and user space cache
daemon to speed path record queries and route resolution
- Ininitial support for safe hot removal of verbs devices
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV7v8wAAoJELgmozMOVy/d2dcP/3PXnGFPgFGJODKE6VCZtTvj
nooNXRKXjxv470UT5DiAX7SNcBxzzS7Zl/Lj+831H9iNXUyzuH31KtBOAZ3W03vZ
yXwCB2caOStSldTRSUUvPe2aIFPnyNmSpC4i6XcJLJMCFijKmxin5pAo8qE44BQU
yjhT+wC9P6LL5wZXsn/nFIMLjOFfu0WBFHNp3gs5j59paxlx5VeIAZk16aQZH135
m7YCyicwrS8iyWQl2bEXRMon2vlCHlX2RHmOJ4f/P5I0quNcGF2+d8Yxa+K1VyC5
zcb3OBezz+wZtvh16yhsDfSPqHWirljwID2VzOgRSzTJWvQjju8VkwHtkq6bYoBW
egIxGCHcGWsD0R5iBXLYr/tB+BmjbDObSm0AsR4+JvSShkeVA1IpeoO+19162ixE
n6CQnk2jCee8KXeIN4PoIKsjRSbIECM0JliWPLoIpuTuEhhpajftlSLgL5hf1dzp
HrSy6fXmmoRj7wlTa7DnYIC3X+ffwckB8/t1zMAm2sKnIFUTjtQXF7upNiiyWk4L
/T1QEzJ2bLQckQ9yY4v528SvBQwA4Dy1amIQB7SU8+2S//bYdUvhysWPkdKC4oOT
WlqS5PFDCI31MvNbbM3rUbMAD8eBAR8ACw9ZpGI/Rffm5FEX5W3LoxA8gfEBRuqt
30ZYFuW8evTL+YQcaV65
=EHLg
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull inifiniband/rdma updates from Doug Ledford:
"This is a fairly sizeable set of changes. I've put them through a
decent amount of testing prior to sending the pull request due to
that.
There are still a few fixups that I know are coming, but I wanted to
go ahead and get the big, sizable chunk into your hands sooner rather
than waiting for those last few fixups.
Of note is the fact that this creates what is intended to be a
temporary area in the drivers/staging tree specifically for some
cleanups and additions that are coming for the RDMA stack. We
deprecated two drivers (ipath and amso1100) and are waiting to hear
back if we can deprecate another one (ehca). We also put Intel's new
hfi1 driver into this area because it needs to be refactored and a
transfer library created out of the factored out code, and then it and
the qib driver and the soft-roce driver should all be modified to use
that library.
I expect drivers/staging/rdma to be around for three or four kernel
releases and then to go away as all of the work is completed and final
deletions of deprecated drivers are done.
Summary of changes for 4.3:
- Create drivers/staging/rdma
- Move amso1100 driver to staging/rdma and schedule for deletion
- Move ipath driver to staging/rdma and schedule for deletion
- Add hfi1 driver to staging/rdma and set TODO for move to regular
tree
- Initial support for namespaces to be used on RDMA devices
- Add RoCE GID table handling to the RDMA core caching code
- Infrastructure to support handling of devices with differing read
and write scatter gather capabilities
- Various iSER updates
- Kill off unsafe usage of global mr registrations
- Update SRP driver
- Misc mlx4 driver updates
- Support for the mr_alloc verb
- Support for a netlink interface between kernel and user space cache
daemon to speed path record queries and route resolution
- Ininitial support for safe hot removal of verbs devices"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
IB/ipoib: Suppress warning for send only join failures
IB/ipoib: Clean up send-only multicast joins
IB/srp: Fix possible protection fault
IB/core: Move SM class defines from ib_mad.h to ib_smi.h
IB/core: Remove unnecessary defines from ib_mad.h
IB/hfi1: Add PSM2 user space header to header_install
IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
mlx5: Fix incorrect wc pkey_index assignment for GSI messages
IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
IB/uverbs: reject invalid or unknown opcodes
IB/cxgb4: Fix if statement in pick_local_ip6adddrs
IB/sa: Fix rdma netlink message flags
IB/ucma: HW Device hot-removal support
IB/mlx4_ib: Disassociate support
IB/uverbs: Enable device removal when there are active user space applications
IB/uverbs: Explicitly pass ib_dev to uverbs commands
IB/uverbs: Fix race between ib_uverbs_open and remove_one
IB/uverbs: Fix reference counting usage of event files
IB/core: Make ib_dealloc_pd return void
IB/srp: Create an insecure all physical rkey only if needed
...
Svcrdma was incorrectly allocating fastreg MRs and page lists using
RPCSVC_MAXPAGES, which can exceed the device capabilities. So limit
the depth to the minimum of RPCSVC_MAXPAGES and xprt->sc_frmr_pg_list_len.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>