linux/drivers/infiniband
Håkon Bugge 5f5a650999 RDMA/core/sa_query: Retry SA queries
A MAD packet is sent as an unreliable datagram (UD). SA requests are sent
as MAD packets. As such, SA requests or responses may be silently dropped.

IB Core's MAD layer has a timeout and retry mechanism, which amongst
other, is used by RDMA CM. But it is not used by SA queries. The lack of
retries of SA queries leads to long specified timeout, and error being
returned in case of packet loss. The ULP or user-land process has to
perform the retry.

Fix this by taking advantage of the MAD layer's retry mechanism.

First, a check against a zero timeout is added in rdma_resolve_route(). In
send_mad(), we set the MAD layer timeout to one tenth of the specified
timeout and the number of retries to 10. The special case when timeout is
less than 10 is handled.

With this fix:

 # ucmatose -c 1000 -S 1024 -C 1

runs stable on an Infiniband fabric. Without this fix, we see an
intermittent behavior and it errors out with:

cmatose: event: RDMA_CM_EVENT_ROUTE_ERROR, error: -110

(110 is ETIMEDOUT)

Link: https://lore.kernel.org/r/1628784755-28316-1-git-send-email-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-25 13:42:47 -03:00
..
core RDMA/core/sa_query: Retry SA queries 2021-08-25 13:42:47 -03:00
hw RDMA/hns: Delete unused hns bitmap interface 2021-08-24 09:15:17 -03:00
sw RDMA: Globally allocate and release QP memory 2021-08-03 13:44:27 -03:00
ulp RDMA/rtrs: Remove (void) casting for functions 2021-08-22 19:22:59 -03:00
Kconfig RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw 2021-06-02 20:06:36 -03:00
Makefile treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00