Commit Graph

14414 Commits

Author SHA1 Message Date
Al Viro
cb787f4ac0 [tree-wide] finally take no_llseek out
no_llseek had been defined to NULL two years ago, in commit 868941b144
("fs: remove no_llseek")

To quote that commit,

  At -rc1 we'll need do a mechanical removal of no_llseek -

  git grep -l -w no_llseek | grep -v porting.rst | while read i; do
	sed -i '/\<no_llseek\>/d' $i
  done

  would do it.

Unfortunately, that hadn't been done.  Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
	.llseek = no_llseek,
so it's obviously safe.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-27 08:18:43 -07:00
Linus Torvalds
54d7e8190e RDMA v6.12 merge window
Usual collection of small improvements and fixes:
 
 - Bug fixes and minor improvments in cxgb4, siw, mlx5, rxe, efa, rts, hfi,
   erdma, hns, irdma
 
 - Code cleanups/typos/etc. Tidy alloc_ordered_workqueue() calls
 
 - Multipath PCI for mlx5
 
 - Variable size work queue, SRQ changes, and relaxed ordering for new bnxt HW
 
 - New ODP fault resolution FW protocol in mlx5
 
 - New "rdma monitor" netlink mechanism
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZvGZowAKCRCFwuHvBreF
 YcvbAP9abSxZte3zzG1ZJ/6BShSCGJvu4RMMMQI6wNJWZZiJ5wEA18MdaWzGFS8O
 BzP48Z/0VGsd2MOfNX4JeyYIs7SNYQA=
 =FXLo
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Usual collection of small improvements and fixes, nothing especially
  stands out to me here.

  The new multipath PCI feature is a sign of things to come, I think we
  will see more of this in the next 10 years. Broadcom and HNS continue
  to update their drivers for their new HW generations.

  Summary:

   - Bug fixes and minor improvments in cxgb4, siw, mlx5, rxe, efa, rts,
     hfi, erdma, hns, irdma

   - Code cleanups/typos/etc. Tidy alloc_ordered_workqueue() calls

   - Multipath PCI for mlx5

   - Variable size work queue, SRQ changes, and relaxed ordering for new
     bnxt HW

   - New ODP fault resolution FW protocol in mlx5

   - New 'rdma monitor' netlink mechanism"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (99 commits)
  RDMA/bnxt_re: Remove the unused variable en_dev
  RDMA/nldev: Add missing break in rdma_nl_notify_err_msg()
  RDMA/irdma: fix error message in irdma_modify_qp_roce()
  RDMA/cxgb4: Added NULL check for lookup_atid
  RDMA/hns: Fix ah error counter in sw stat not increasing
  RDMA/bnxt_re: Recover the device when FW error is detected
  RDMA/bnxt_re: Group all operations under add_device and remove_device
  RDMA/bnxt_re: Use the aux device for L2 ULP callbacks
  RDMA/bnxt_re: Change aux driver data to en_info to hold more information
  RDMA/nldev: Expose whether RDMA monitoring is supported
  RDMA/nldev: Add support for RDMA monitoring
  RDMA/mlx5: Use IB set_netdev and get_netdev functions
  RDMA/device: Remove optimization in ib_device_get_netdev()
  RDMA/mlx5: Initialize phys_port_cnt earlier in RDMA device creation
  RDMA/mlx5: Obtain upper net device only when needed
  RDMA/mlx5: Check RoCE LAG status before getting netdev
  RDMA/mlx5: Consider the query_vuid cap for data_direct
  net/mlx5: Handle memory scheme ODP capabilities
  RDMA/mlx5: Add implicit MR handling to ODP memory scheme
  RDMA/mlx5: Add handling for memory scheme page fault events
  ...
2024-09-24 11:48:00 -07:00
Linus Torvalds
f8ffbc365f struct fd layout change (and conversion to accessor helpers)
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ
 63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d
 592+iDgLsema/H/0/CqfqlaNtDNY8Q0=
 =HUl5
 -----END PGP SIGNATURE-----

Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' updates from Al Viro:
 "Just the 'struct fd' layout change, with conversion to accessor
  helpers"

* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  add struct fd constructors, get rid of __to_fd()
  struct fd: representation change
  introduce fd_file(), convert all accessors to it.
2024-09-23 09:35:36 -07:00
Jiapeng Chong
7092094192 RDMA/bnxt_re: Remove the unused variable en_dev
Variable en_dev is not effectively used, so delete it.

drivers/infiniband/hw/bnxt_re/main.c:1980:22: warning: variable ‘en_dev’ set but not used.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=10867
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://patch.msgid.link/20240918021632.36091-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-22 16:53:46 +03:00
Nathan Chancellor
7acad3c442 RDMA/nldev: Add missing break in rdma_nl_notify_err_msg()
Clang warns (or errors with CONFIG_WERROR=y):

  drivers/infiniband/core/nldev.c:2795:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
   2795 |         default:
        |         ^

Clang is a little more pedantic than GCC, which does not warn when
falling through to a case that is just break or return. Clang's version
is more in line with the kernel's own stance in deprecated.rst, which
states that all switch/case blocks must end in either break,
fallthrough, continue, goto, or return. Add the missing break to silence
the warning.

Fixes: 9cbed5aab5 ("RDMA/nldev: Add support for RDMA monitoring")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20240916-rdma-fix-clang-fallthrough-nl_notify_err_msg-v1-1-89de6a7423f1@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-16 20:28:40 +03:00
Vitaliy Shevtsov
9f0eafe86e RDMA/irdma: fix error message in irdma_modify_qp_roce()
Use a correct field max_dest_rd_atomic instead of max_rd_atomic for the
error output.

Found by Linux Verification Center (linuxtesting.org) with Svace.

Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Vitaliy Shevtsov <v.shevtsov@maxima.ru>
Link: https://lore.kernel.org/stable/20240916165817.14691-1-v.shevtsov%40maxima.ru
Link: https://patch.msgid.link/20240916165817.14691-1-v.shevtsov@maxima.ru
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-16 20:27:04 +03:00
Mikhail Lobanov
e766e6a924 RDMA/cxgb4: Added NULL check for lookup_atid
The lookup_atid() function can return NULL if the ATID is
invalid or does not exist in the identifier table, which
could lead to dereferencing a null pointer without a
check in the `act_establish()` and `act_open_rpl()` functions.
Add a NULL check to prevent null pointer dereferencing.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: cfdda9d764 ("RDMA/cxgb4: Add driver for Chelsio T4 RNIC")
Signed-off-by: Mikhail Lobanov <m.lobanov@rosalinux.ru>
Link: https://patch.msgid.link/20240912145844.77516-1-m.lobanov@rosalinux.ru
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-16 11:52:19 +03:00
Junxian Huang
39c047d404 RDMA/hns: Fix ah error counter in sw stat not increasing
There are several error cases where hns_roce_create_ah() returns
directly without jumping to sw stat path, thus leading to a problem
that the ah error counter does not increase.

Fixes: ee20cc17e9 ("RDMA/hns: Support DSCP")
Fixes: eb7854d63d ("RDMA/hns: Support SW stats with debugfs")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240912115700.2016443-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-16 11:21:58 +03:00
Selvin Xavier
cc5b9b48d4 RDMA/bnxt_re: Recover the device when FW error is detected
If the FW crashes, L2 driver gets notified and it notifies
the RoCE driver. Currently driver doesn't re-initialize the
device. Add support for re-initialize the RoCE device.

RoCE device is removed and re-attached in the ulp_stop and
ulp_start respectively. The recovery logic expects the RoCE
driver to be registered with L2 driver while its being removed.
So the driver avoids unregistering with L2 driver in the
recovery path.

Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1726027710-2292-5-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:29:34 +03:00
Selvin Xavier
94a9dc6ac8 RDMA/bnxt_re: Group all operations under add_device and remove_device
Adding and removing device need to be handled from multiple contexts
when Firmware error recovery is supported. So group all the add and remove
operations to add_device and remove_device function.

Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1726027710-2292-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:29:34 +03:00
Chandramohan Akula
532929ad0a RDMA/bnxt_re: Use the aux device for L2 ULP callbacks
While registering with the L2 for ULP operations, use the
aux device pointer as the handle. Aux device has
the data bnxt_re_en_dev_info, which is used to
store required information for the bnxt_re_suspend
and bnxt_re_resume functions.

Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1726027710-2292-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:29:34 +03:00
Chandramohan Akula
dee3da3422 RDMA/bnxt_re: Change aux driver data to en_info to hold more information
rdev will be destroyed and recreated during the FW error
recovery scenarios. So to keep the state, if any, use an
en_info structure which gets created/freed based on auxiliary
device initialization/de-initialization.

Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1726027710-2292-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:29:34 +03:00
Chiara Meiohas
12fb1153c5 RDMA/nldev: Expose whether RDMA monitoring is supported
Extend the "rdma sys" command to display whether RDMA
monitoring is supported.

RDMA monitoring is not supported in mlx4 because it does
not use the ib_device_set_netdev() API, which sends the
RDMA events.

Example output for kernel where monitoring is supported:
$ rdma sys show
netns shared privileged-qkey off monitor on copy-on-fork on

Example output for kernel where monitoring is not supported:
$ rdma sys show
netns shared privileged-qkey off monitor off copy-on-fork on

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-8-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:29:34 +03:00
Chiara Meiohas
9cbed5aab5 RDMA/nldev: Add support for RDMA monitoring
Introduce a new netlink command to allow rdma event monitoring.
The rdma events supported now are IB device
registration/unregistration and net device attachment/detachment.

Example output of rdma monitor and the commands which trigger
the events:

$ rdma monitor
$ rmmod mlx5_ib
[UNREGISTER]	dev 1 rocep8s0f1
[UNREGISTER]	dev 0 rocep8s0f0

$ modprobe mlx5_ib
[REGISTER]	dev 2 mlx5_0
[NETDEV_ATTACH]	dev 2 mlx5_0 port 1 netdev 4 eth2
[REGISTER]	dev 3 mlx5_1
[NETDEV_ATTACH]	dev 3 mlx5_1 port 1 netdev 5 eth3

$ devlink dev eswitch set pci/0000:08:00.0 mode switchdev
[UNREGISTER]	dev 2 rocep8s0f0
[REGISTER]	dev 4 mlx5_0
[NETDEV_ATTACH]	dev 4 mlx5_0 port 30 netdev 4 eth2

$ echo 4 > /sys/class/net/eth2/device/sriov_numvfs
[NETDEV_ATTACH]	dev 4 rdmap8s0f0 port 2 netdev 7 eth4
[NETDEV_ATTACH]	dev 4 rdmap8s0f0 port 3 netdev 8 eth5
[NETDEV_ATTACH]	dev 4 rdmap8s0f0 port 4 netdev 9 eth6
[NETDEV_ATTACH]	dev 4 rdmap8s0f0 port 5 netdev 10 eth7
[REGISTER]	dev 5 mlx5_0
[NETDEV_ATTACH]	dev 5 mlx5_0 port 1 netdev 11 eth8
[REGISTER]	dev 6 mlx5_0
[NETDEV_ATTACH]	dev 6 mlx5_0 port 1 netdev 12 eth9
[REGISTER]	dev 7 mlx5_0
[NETDEV_ATTACH]	dev 7 mlx5_0 port 1 netdev 13 eth10
[REGISTER]	dev 8 mlx5_0
[NETDEV_ATTACH]	dev 8 mlx5_0 port 1 netdev 14 eth11

$ echo 0 > /sys/class/net/eth2/device/sriov_numvfs
[UNREGISTER]	dev 5 rocep8s0f0v0
[UNREGISTER]	dev 6 rocep8s0f0v1
[UNREGISTER]	dev 7 rocep8s0f0v2
[UNREGISTER]	dev 8 rocep8s0f0v3
[NETDEV_DETACH]	dev 4 rdmap8s0f0 port 2
[NETDEV_DETACH]	dev 4 rdmap8s0f0 port 3
[NETDEV_DETACH]	dev 4 rdmap8s0f0 port 4
[NETDEV_DETACH]	dev 4 rdmap8s0f0 port 5

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-7-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:29:14 +03:00
Chiara Meiohas
8d159eb211 RDMA/mlx5: Use IB set_netdev and get_netdev functions
The IB layer provides a common interface to store and get net
devices associated to an IB device port (ib_device_set_netdev()
and ib_device_get_netdev()).
Previously, mlx5_ib stored and managed the associated net devices
internally.

Replace internal net device management in mlx5_ib with
ib_device_set_netdev() when attaching/detaching  a net device and
ib_device_get_netdev() when retrieving the net device.

Export ib_device_get_netdev().

For mlx5 representors/PFs/VFs and lag creation we replace the netdev
assignments with the IB set/get netdev functions.

In active-backup mode lag the active slave net device is stored in the
lag itself. To assure the net device stored in a lag bond IB device is
the active slave we implement the following:
- mlx5_core: when modifying the slave of a bond we send the internal driver event
  MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE.
- mlx5_ib: when catching the event call ib_device_set_netdev()

This patch also ensures the correct IB events are sent in switchdev lag.

While at it, when in multiport eswitch mode, only a single IB device is
created for all ports. The said IB device will receive all netdev events
of its VFs once loaded, thus to avoid overwriting the mapping of PF IB
device to PF netdev, ignore NETDEV_REGISTER events if the ib device has
already been mapped to a netdev.

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-6-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:27:40 +03:00
Chiara Meiohas
5f8ca04fdd RDMA/device: Remove optimization in ib_device_get_netdev()
The caller of ib_device_get_netdev() relies on its result to accurately
match a given netdev with the ib device associated netdev.

ib_device_get_netdev returns NULL when the IB device associated
netdev is unregistering, preventing the caller of matching netdevs properly.

Thus, remove this optimization and return the netdev even if
it is undergoing unregistration, allowing matching by the caller.

This change ensures proper netdev matching and reference count handling
by the caller of ib_device_get_netdev/ib_device_set_netdev API.

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-5-michaelgur@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:27:40 +03:00
Chiara Meiohas
91b4b2c626 RDMA/mlx5: Initialize phys_port_cnt earlier in RDMA device creation
phys_port_cnt of the IB device must be initialized before calling
ib_device_set_netdev().

Previously, phys_port_cnt was initialized in the mlx5_ib init function.
Remove this initialization to allow setting it separately, providing
the flexibility to call ib_device_set_netdev before registering the
IB device.

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-4-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:27:40 +03:00
Mark Bloch
3ed7f9e239 RDMA/mlx5: Obtain upper net device only when needed
Report the upper device's state as the RDMA port state only in RoCE LAG or
switchdev LAG.

Fixes: 27f9e0ccb6 ("net/mlx5: Lag, Add single RDMA device in multiport mode")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-3-michaelgur@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:27:40 +03:00
Mark Bloch
303ee44ac4 RDMA/mlx5: Check RoCE LAG status before getting netdev
Check if RoCE LAG is active before calling the LAG layer for netdev.
This clarifies if LAG is active. No behavior changes with this patch.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-2-michaelgur@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:27:40 +03:00
Yishai Hadas
c77aec65e8 RDMA/mlx5: Consider the query_vuid cap for data_direct
Consider also the query_vuid cap before enabling the data_direct
functionality.

This may prevent a syndrome from the FW in case the query_vuid command
is not supported. (e.g. migratable VF)

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Gal Shalom <galshalom@nvidia.com>
Link: https://patch.msgid.link/274c4f6f1ac0b1078243dd296695a49dbe58e7d1.1725907637.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-13 08:27:40 +03:00
Michael Guralnik
6f2487bfaf RDMA/mlx5: Add implicit MR handling to ODP memory scheme
Implicit MRs in ODP memory scheme require allocating a private null mkey
and assigning the mkey and va differently in the KSM mkey.
The page faults are received on the null mkey so we also add storing the
null mkey in the odp_mkey xarray.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-8-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-11 14:56:33 +03:00
Michael Guralnik
e4fda2320f RDMA/mlx5: Add handling for memory scheme page fault events
The memory scheme page fault event is a new approch in handling page fault
on mkeys using the on-demand-paging feature.
The major shift in handling the page fault in this scheme is that the HW
is taking responsibilty for parsing the faulted mkey instead of the
previous approach where the driver would read and parse the wqes and
query the mkeys to get to the direct mkey that we need to handle.

Therefore, the event we get from FW in this scheme will contain the
direct mkey and address we need to handle and require much less work
from driver.

Additionally, to optimize performance, the FW can generate the event on
a memory area that is larger than the faulted memory operation is
requiring, to 'prefetch' memory that is around it and will likely be
used soon.

Unlike previous types of page fault, the memory page scheme fault does
not always require a resume command after handling the page fault as the FW
can post multiple events on same mkey and will set the 'last' flag only on
the page fault that requires the resume command.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-7-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-11 14:56:29 +03:00
Michael Guralnik
7f91510af9 RDMA/mlx5: Split ODP mkey search logic
Split the search for the ODP mkey when handling an rdma type page fault to
a helper function, later to be used in other page fault types.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-6-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-11 14:56:25 +03:00
Michael Guralnik
8c6d097d83 RDMA/mlx5: Enforce umem boundaries for explicit ODP page faults
The new memory scheme page faults are requesting the driver to fetch
additinal pages to the faulted memory access.
This is done in order to prefetch pages before and after the area that
got the page fault, assuming this will reduce the total amount of page
faults.

The driver should ensure it handles only the pages that are within the
umem range.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-5-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-11 14:56:19 +03:00
Michael Guralnik
64c68385a3 RDMA/mlx5: Add new ODP memory scheme eqe format
Add new fields to support the new memory scheme page fault and extend
the token field to u64 as in the new scheme the token is 48 bit.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-4-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-11 14:56:15 +03:00
Michael Guralnik
6cd9171d04 net/mlx5: Expose HW bits for Memory scheme ODP
Expose IFC bits to support the new memory scheme on demand paging.
Change the macro reading odp capabilities to be able to read from the
new IFC layout and align the code in upper layers to be compiled.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-3-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-11 14:56:12 +03:00
Michael Guralnik
cef7dde883 net/mlx5: Expand mkey page size to support 6 bits
Protect the usage of the 6th bit with the relevant capability to ensure
we are using the new page sizes with FW that supports the bit extension.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909100504.29797-2-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-11 14:56:07 +03:00
Junxian Huang
f4ccc0a2a0 RDMA/hns: Fix restricted __le16 degrades to integer issue
Fix sparse warnings: restricted __le16 degrades to integer.

Fixes: 5a87279591 ("RDMA/hns: Support hns HW stats")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409080508.g4mNSLwy-lkp@intel.com/
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240909065331.3950268-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-10 16:42:53 +03:00
Zhang Zekun
9cd30319bb IB/qib: Remove unused declarations in header file
The definition of qib_rc_rnr_retry() has been removed since
commit b4238e7057 ("IB/qib: Use new rdmavt timers"). Also, the definition
of mr_rcu_callback() has been remove since commit 7c2e11fe2d ("IB/qib:
Remove qp and mr functionality from qib"). So, let's remove the unused
declartions.

Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Link: https://patch.msgid.link/20240909121408.80079-3-zhangzekun11@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-10 16:41:03 +03:00
Zhang Zekun
e4ed570122 IB/iser: Remove unused declaration in header file
The definition of iser_finalize_rdma_unaligned_sg() has been removed
since commit dd0107a089 ("IB/iser: set block queue_virt_boundary").
Let's remove the unused declaration in header file.

Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Link: https://patch.msgid.link/20240909121408.80079-2-zhangzekun11@huawei.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-10 16:41:03 +03:00
Junxian Huang
fe51f6254d RDMA/hns: Optimize hem allocation performance
When allocating MTT hem, for each hop level of each hem that is being
allocated, the driver iterates the hem list to find out whether the
bt page has been allocated in this hop level. If not, allocate a new
one and splice it to the list. The time complexity is O(n^2) in worst
cases.

Currently the allocation for-loop uses 'unit' as the step size. This
actually has taken into account the reuse of last-hop-level MTT bt
pages by multiple buffer pages. Thus pages of last hop level will
never have been allocated, so there is no need to iterate the hem list
in last hop level.

Removing this unnecessary iteration can reduce the time complexity to
O(n).

Fixes: 38389eaa4d ("RDMA/hns: Add mtr support for mixed multihop addressing")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-9-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-10 16:06:39 +03:00
Chengchang Tang
ce196f6297 RDMA/hns: Fix 1bit-ECC recovery address in non-4K OS
The 1bit-ECC recovery address read from HW only contain bits 64:12, so
it should be fixed left-shifted 12 bits when used.

Currently, the driver will shift the address left by PAGE_SHIFT when
used, which is wrong in non-4K OS.

Fixes: 2de949abd6 ("RDMA/hns: Recover 1bit-ECC error of RAM on chip")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-8-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-10 16:06:39 +03:00
Junxian Huang
4321feefa5 RDMA/hns: Fix VF triggering PF reset in abnormal interrupt handler
In abnormal interrupt handler, a PF reset will be triggered even if
the device is a VF. It should be a VF reset.

Fixes: 2b9acb9a97 ("RDMA/hns: Add the process of AEQ overflow for hip08")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-7-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-10 16:06:39 +03:00
Chengchang Tang
74d315b5af RDMA/hns: Fix spin_unlock_irqrestore() called with IRQs enabled
Fix missuse of spin_lock_irq()/spin_unlock_irq() when
spin_lock_irqsave()/spin_lock_irqrestore() was hold.

This was discovered through the lock debugging, and the corresponding
log is as follows:

raw_local_irq_restore() called with IRQs enabled
WARNING: CPU: 96 PID: 2074 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x30/0x40
...
Call trace:
 warn_bogus_irq_restore+0x30/0x40
 _raw_spin_unlock_irqrestore+0x84/0xc8
 add_qp_to_list+0x11c/0x148 [hns_roce_hw_v2]
 hns_roce_create_qp_common.constprop.0+0x240/0x780 [hns_roce_hw_v2]
 hns_roce_create_qp+0x98/0x160 [hns_roce_hw_v2]
 create_qp+0x138/0x258
 ib_create_qp_kernel+0x50/0xe8
 create_mad_qp+0xa8/0x128
 ib_mad_port_open+0x218/0x448
 ib_mad_init_device+0x70/0x1f8
 add_client_context+0xfc/0x220
 enable_device_and_get+0xd0/0x140
 ib_register_device.part.0+0xf4/0x1c8
 ib_register_device+0x34/0x50
 hns_roce_register_device+0x174/0x3d0 [hns_roce_hw_v2]
 hns_roce_init+0xfc/0x2c0 [hns_roce_hw_v2]
 __hns_roce_hw_v2_init_instance+0x7c/0x1d0 [hns_roce_hw_v2]
 hns_roce_hw_v2_init_instance+0x9c/0x180 [hns_roce_hw_v2]

Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-10 16:06:39 +03:00
wenglianfa
d586628b16 RDMA/hns: Fix the overflow risk of hem_list_calc_ba_range()
The max value of 'unit' and 'hop_num' is 2^24 and 2, so the value of
'step' may exceed the range of u32. Change the type of 'step' to u64.

Fixes: 38389eaa4d ("RDMA/hns: Add mtr support for mixed multihop addressing")
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-10 16:06:39 +03:00
wenglianfa
fd8489294d RDMA/hns: Fix Use-After-Free of rsv_qp on HIP08
Currently rsv_qp is freed before ib_unregister_device() is called
on HIP08. During the time interval, users can still dereg MR and
rsv_qp will be used in this process, leading to a UAF. Move the
release of rsv_qp after calling ib_unregister_device() to fix it.

Fixes: 70f9252158 ("RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT")
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-10 16:06:39 +03:00
Junxian Huang
6928d264e3 RDMA/hns: Don't modify rq next block addr in HIP09 QPC
The field 'rq next block addr' in QPC can be updated by driver only
on HIP08. On HIP09 HW updates this field while driver is not allowed.

Fixes: 926a01dc00 ("RDMA/hns: Add QP operations support for hip08 SoC")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20240906093444.3571619-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-10 16:06:39 +03:00
WangYuli
dab2214fec treewide: correct the typo 'retun'
There are some spelling mistakes of 'retun' in comments which
should be instead of 'return'.

Link: https://lkml.kernel.org/r/63D0F870EE8E87A0+20240906054008.390188-1-wangyuli@uniontech.com
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:47:43 -07:00
Selvin Xavier
227f51743b RDMA/bnxt_re: Fix the max WQE size for static WQE support
When variable size WQE is supported, max_qp_sges reported
is more than 6. For devices that supports variable size WQE,
the Send WQE size calculation is wrong when an an older library
that doesn't support variable size WQE is used.

Set the WQE size to 128 when static WQE is supported.

Fixes: de1d364c38 ("RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725444253-13221-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Selvin Xavier
c6b2b5c86d RDMA/bnxt_re: Fix the compatibility flag for variable size WQE
For older adapters that doesn't support variable size WQE,
driver is wrongly reporting that variable WQE is supported,
when the latest library is used.

Report the variable WQE capability only if the driver supports
it.

Fixes: 10a104c0de ("RDMA/bnxt_re: Enable variable size WQEs for user space applications")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725444253-13221-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Michael Guralnik
7ebb00cea4 RDMA/mlx5: Fix MR cache temp entries cleanup
Fix the cleanup of the temp cache entries that are dynamically created
in the MR cache.

The cleanup of the temp cache entries is currently scheduled only when a
new entry is created. Since in the cleanup of the entries only the mkeys
are destroyed and the cache entry stays in the cache, subsequent
registrations might reuse the entry and it will eventually be filled with
new mkeys without cleanup ever getting scheduled again.

On workloads that register and deregister MRs with a wide range of
properties we see the cache ends up holding many cache entries, each
holding the max number of mkeys that were ever used through it.

Additionally, as the cleanup work is scheduled to run over the whole
cache, any mkey that is returned to the cache after the cleanup was
scheduled will be held for less than the intended 30 seconds timeout.

Solve both issues by dropping the existing remove_ent_work and reusing
the existing per-entry work to also handle the temp entries cleanup.

Schedule the work to run with a 30 seconds delay every time we push an
mkey to a clean temp entry.
This ensures the cleanup runs on each entry only 30 seconds after the
first mkey was pushed to an empty entry.

As we have already been distinguishing between persistent and temp entries
when scheduling the cache_work_func, it is not being scheduled in any
other flows for the temp entries.

Another benefit from moving to a per-entry cleanup is we now not
required to hold the rb_tree mutex, thus enabling other flow to run
concurrently.

Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/e4fa4bb03bebf20dceae320f26816cd2dde23a26.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Michael Guralnik
ee6d57a2e1 RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache
When searching the MR cache for suitable cache entries, don't use mkeys
larger than twice the size required for the MR.
This should ensure the usage of mkeys closer to the minimal required size
and reduce memory waste.

On driver init we create entries for mkeys with clear attributes and
powers of 2 sizes from 4 to the max supported size.
This solves the issue for anyone using mkeys that fit these
requirements.

In the use case where an MR is registered with different attributes,
like an access flag we can't UMR, we'll create a new cache entry to store
it upon dereg.
Without this fix, any later registration with same attributes and smaller
size will use the newly created cache entry and it's mkeys, disregarding
the memory waste of using mkeys larger than required.

For example, one worst-case scenario can be when registering and
deregistering a 1GB mkey with ATS enabled which will cause the creation of
a new cache entry to hold those type of mkeys. A user registering a 4k MR
with ATS will end up using the new cache entry and an mkey that can
support a 1GB MR, thus wasting x250k memory than actually needed in the HW.

Additionally, allow all small registration to use the smallest size
cache entry that is initialized on driver load even if size is larger
than twice the required size.

Fixes: 73d09b2fe8 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/8ba3a6e3748aace2026de8b83da03aba084f78f4.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Michael Guralnik
6f5cd6ac9a RDMA/mlx5: Fix counter update on MR cache mkey creation
After an mkey is created, update the counter for pending mkeys before
reshceduling the work that is filling the cache.

Rescheduling the work with a full MR cache entry and a wrong 'pending'
counter will cause us to miss disabling the fill_to_high_water flag.
Thus leaving the cache full but with an indication that it's still
needs to be filled up to it's full size (2 * limit).
Next time an mkey will be taken from the cache, we'll unnecessarily
continue the process of filling the cache to it's full size.

Fixes: 57e7071683 ("RDMA/mlx5: Implement mkeys management via LIFO queue")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/0f44f462ba22e45f72cb3d0ec6a748634086b8d0.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Michael Guralnik
30e6bd8d3b RDMA/mlx5: Drop redundant work canceling from clean_keys()
The canceling of dealyed work in clean_keys() is a leftover from years
back and was added to prevent races in the cleanup process of MR cache.
The cleanup process was rewritten a few years ago and the canceling of
delayed work and flushing of workqueue was added before the call to
clean_keys().

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/943d21f5a9dba7b98a3e1d531e3561ffe9745d71.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Cheng Xu
e77127ff64 RDMA/erdma: Return QP state in erdma_query_qp
Fix qp_state and cur_qp_state to return correct values in
struct ib_qp_attr.

Fixes: 1550557717 ("RDMA/erdma: Add verbs implementation")
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://patch.msgid.link/20240902112920.58749-4-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Cheng Xu
b80330f105 RDMA/erdma: Add disassociate ucontext support
All IO pages mapped to user space are handled by rdma_user_mmap_io,
so add empty stub for disassociate ucontext.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://patch.msgid.link/20240902112920.58749-3-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Cheng Xu
b24506f1c3 RDMA/erdma: Refactor the initialization and destruction of EQ
We extracted the common parts of the initialization/destruction
process to make the code cleaner.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://patch.msgid.link/20240902112920.58749-2-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:08 +03:00
Maher Sanalla
34efda1735 RDMA/mlx5: Enable ATS when allocating kernel MRs
When creating kernel MRs, it is not definitive whether they will be used
for peer-to-peer transactions or for other usecases, since address
mapping is performed only after the MR is created.

Since peer-to-peer transactions benefit significantly from ATS
performance-wise, enable ATS on newly-allocated kernel MRs when
supported.

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Gal Shalom <galshalom@nvidia.com>
Link: https://patch.msgid.link/fafd4c9f14cf438d2882d88649c2947e1d05d0b4.1725273403.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:16:40 +03:00
Patrisious Haddad
1403c8b147 IB/core: Fix ib_cache_setup_one error flow cleanup
When ib_cache_update return an error, we exit ib_cache_setup_one
instantly with no proper cleanup, even though before this we had
already successfully done gid_table_setup_one, that results in
the kernel WARN below.

Do proper cleanup using gid_table_cleanup_one before returning
the err in order to fix the issue.

WARNING: CPU: 4 PID: 922 at drivers/infiniband/core/cache.c:806 gid_table_release_one+0x181/0x1a0
Modules linked in:
CPU: 4 UID: 0 PID: 922 Comm: c_repro Not tainted 6.11.0-rc1+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:gid_table_release_one+0x181/0x1a0
Code: 44 8b 38 75 0c e8 2f cb 34 ff 4d 8b b5 28 05 00 00 e8 23 cb 34 ff 44 89 f9 89 da 4c 89 f6 48 c7 c7 d0 58 14 83 e8 4f de 21 ff <0f> 0b 4c 8b 75 30 e9 54 ff ff ff 48 8    3 c4 10 5b 5d 41 5c 41 5d 41
RSP: 0018:ffffc90002b835b0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff811c8527
RDX: 0000000000000000 RSI: ffffffff811c8534 RDI: 0000000000000001
RBP: ffff8881011b3d00 R08: ffff88810b3abe00 R09: 205d303839303631
R10: 666572207972746e R11: 72746e6520444947 R12: 0000000000000001
R13: ffff888106390000 R14: ffff8881011f2110 R15: 0000000000000001
FS:  00007fecc3b70800(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000340 CR3: 000000010435a001 CR4: 00000000003706b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? show_regs+0x94/0xa0
 ? __warn+0x9e/0x1c0
 ? gid_table_release_one+0x181/0x1a0
 ? report_bug+0x1f9/0x340
 ? gid_table_release_one+0x181/0x1a0
 ? handle_bug+0xa2/0x110
 ? exc_invalid_op+0x31/0xa0
 ? asm_exc_invalid_op+0x16/0x20
 ? __warn_printk+0xc7/0x180
 ? __warn_printk+0xd4/0x180
 ? gid_table_release_one+0x181/0x1a0
 ib_device_release+0x71/0xe0
 ? __pfx_ib_device_release+0x10/0x10
 device_release+0x44/0xd0
 kobject_put+0x135/0x3d0
 put_device+0x20/0x30
 rxe_net_add+0x7d/0xa0
 rxe_newlink+0xd7/0x190
 nldev_newlink+0x1b0/0x2a0
 ? __pfx_nldev_newlink+0x10/0x10
 rdma_nl_rcv_msg+0x1ad/0x2e0
 rdma_nl_rcv_skb.constprop.0+0x176/0x210
 netlink_unicast+0x2de/0x400
 netlink_sendmsg+0x306/0x660
 __sock_sendmsg+0x110/0x120
 ____sys_sendmsg+0x30e/0x390
 ___sys_sendmsg+0x9b/0xf0
 ? kstrtouint+0x6e/0xa0
 ? kstrtouint_from_user+0x7c/0xb0
 ? get_pid_task+0xb0/0xd0
 ? proc_fail_nth_write+0x5b/0x140
 ? __fget_light+0x9a/0x200
 ? preempt_count_add+0x47/0xa0
 __sys_sendmsg+0x61/0xd0
 do_syscall_64+0x50/0x110
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 1901b91f99 ("IB/core: Fix potential NULL pointer dereference in pkey cache")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/79137687d829899b0b1c9835fcb4b258004c439a.1725273354.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-04 10:33:09 +03:00
Chris Mi
112e6e83a8 IB/mlx5: Fix UMR pd cleanup on error flow of driver init
The cited commit moves the pd allocation from function
mlx5r_umr_resource_cleanup() to a new function mlx5r_umr_cleanup().
So the fix in commit [1] is broken. In error flow, will hit panic [2].

Fix it by checking pd pointer to avoid panic if it is NULL;

[1] RDMA/mlx5: Fix UMR cleanup on error flow of driver init
[2]
 [  347.567063] infiniband mlx5_0: Couldn't register device with driver model
 [  347.591382] BUG: kernel NULL pointer dereference, address: 0000000000000020
 [  347.593438] #PF: supervisor read access in kernel mode
 [  347.595176] #PF: error_code(0x0000) - not-present page
 [  347.596962] PGD 0 P4D 0
 [  347.601361] RIP: 0010:ib_dealloc_pd_user+0x12/0xc0 [ib_core]
 [  347.604171] RSP: 0018:ffff888106293b10 EFLAGS: 00010282
 [  347.604834] RAX: 0000000000000000 RBX: 000000000000000e RCX: 0000000000000000
 [  347.605672] RDX: ffff888106293ad0 RSI: 0000000000000000 RDI: 0000000000000000
 [  347.606529] RBP: 0000000000000000 R08: ffff888106293ae0 R09: ffff888106293ae0
 [  347.607379] R10: 0000000000000a06 R11: 0000000000000000 R12: 0000000000000000
 [  347.608224] R13: ffffffffa0704dc0 R14: 0000000000000001 R15: 0000000000000001
 [  347.609067] FS:  00007fdc720cd9c0(0000) GS:ffff88852c880000(0000) knlGS:0000000000000000
 [  347.610094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [  347.610727] CR2: 0000000000000020 CR3: 0000000103012003 CR4: 0000000000370eb0
 [  347.611421] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [  347.612113] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [  347.612804] Call Trace:
 [  347.613130]  <TASK>
 [  347.613417]  ? __die+0x20/0x60
 [  347.613793]  ? page_fault_oops+0x150/0x3e0
 [  347.614243]  ? free_msg+0x68/0x80 [mlx5_core]
 [  347.614840]  ? cmd_exec+0x48f/0x11d0 [mlx5_core]
 [  347.615359]  ? exc_page_fault+0x74/0x130
 [  347.615808]  ? asm_exc_page_fault+0x22/0x30
 [  347.616273]  ? ib_dealloc_pd_user+0x12/0xc0 [ib_core]
 [  347.616801]  mlx5r_umr_cleanup+0x23/0x90 [mlx5_ib]
 [  347.617365]  mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x36/0x40 [mlx5_ib]
 [  347.618025]  __mlx5_ib_add+0x96/0xd0 [mlx5_ib]
 [  347.618539]  mlx5r_probe+0xe9/0x310 [mlx5_ib]
 [  347.619032]  ? kernfs_add_one+0x107/0x150
 [  347.619478]  ? __mlx5_ib_add+0xd0/0xd0 [mlx5_ib]
 [  347.619984]  auxiliary_bus_probe+0x3e/0x90
 [  347.620448]  really_probe+0xc5/0x3a0
 [  347.620857]  __driver_probe_device+0x80/0x160
 [  347.621325]  driver_probe_device+0x1e/0x90
 [  347.621770]  __driver_attach+0xec/0x1c0
 [  347.622213]  ? __device_attach_driver+0x100/0x100
 [  347.622724]  bus_for_each_dev+0x71/0xc0
 [  347.623151]  bus_add_driver+0xed/0x240
 [  347.623570]  driver_register+0x58/0x100
 [  347.623998]  __auxiliary_driver_register+0x6a/0xc0
 [  347.624499]  ? driver_register+0xae/0x100
 [  347.624940]  ? 0xffffffffa0893000
 [  347.625329]  mlx5_ib_init+0x16a/0x1e0 [mlx5_ib]
 [  347.625845]  do_one_initcall+0x4a/0x2a0
 [  347.626273]  ? gcov_event+0x2e2/0x3a0
 [  347.626706]  do_init_module+0x8a/0x260
 [  347.627126]  init_module_from_file+0x8b/0xd0
 [  347.627596]  __x64_sys_finit_module+0x1ca/0x2f0
 [  347.628089]  do_syscall_64+0x4c/0x100

Fixes: 638420115c ("IB/mlx5: Create UMR QP just before first reg_mr occurs")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Link: https://patch.msgid.link/778c40c60287992da5d6ec92bb07b67f7bb5e6ef.1725273295.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-04 10:28:49 +03:00