The 2nd parameter of 'find_first_bit' is the number of bits to search.
In this case, we are passing 'sizeof(u64)' which is 8.
It is likely that the number of bits of 'port_mask' was expected here.
Use sizeof() * 8 to get the correct number.
It has been spotted by the following coccinelle script:
@@
expression ret, x;
@@
* ret = \(find_first_bit \| find_first_zero_bit\) (x, sizeof(...));
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The 2nd parameter of 'find_first_bit' is the number of bits to search.
In this case, we are passing 'sizeof(tmp)' which is likely to be 4 or 8
because 'tmp' is an 'unsigned long'.
It is likely that the number of bits of 'tmp' was expected here. So use
BITS_PER_LONG instead.
It has been spotted by the following coccinelle script:
@@
expression ret, x;
@@
* ret = \(find_first_bit \| find_first_zero_bit\) (x, sizeof(...));
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Majd Dibbiny <majd@mellanox.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In all other places in this file where 'find_first_bit' is called,
port_num is defined as a 'u8' and no casting is done.
Do the same here in order to be more consistent.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Device notifications are not received after the first interface is
closed; since there is an unregister for notifications on every
interface close. Correct this by unregistering for device
notifications only when the last interface is closed. Also, make
all operations on the i40iw_notifiers_registered atomic as it
can be read/modified concurrently.
Fixes: 8e06af711b ("i40iw: add main, hdr, status")
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Update iwqp->hw_iwarp_state to reflect the new state of the CQP
modify QP operation. This avoids reissuing a CQP operation to
modify a QP to a state that it is already in.
Fixes: 4e9042e647 ("i40iw: add hw and utils files")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Send a zero length last streaming mode message for loopback
connections to synchronize between accepting QP and connecting QP.
This avoids data transfer to start on the accepting QP before
the connecting QP is in RTS. Also remove function i40iw_loopback_nop()
as it is no longer used.
Fixes: f27b4746f3 ("i40iw: add connection management code")
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch is meant to add support of ACPI to the Hisilicon RoCE
driver.
Changes done are primarily meant to detect the type and then either
use DT specific or ACPI spcific functions. Where ever possible,
this patch tries to make use of Unified Device Property Interface
APIs to support both DT and ACPI through single interface.
This patch depends upon HNS ethernet driver to Reset RoCE. This
function within HNS ethernet driver has also been enhanced to
support ACPI and is part of other accompanying patch with this
patch-set.
NOTE: The changes in this patch are done over below branch,
https://github.com/dledford/linux/tree/hns-roce
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If port_guid is set with the default subnet_prefix, then we get a change
event and run a port refresh, we don't update the port_guid. As a
result, attempts to create a target device that uses the new
subnet_prefix in the wwn will fail to find a match and be rejected by
the ib_srpt driver. This makes it impossible to configure a port if it
was initialized with a default subnet_prefix and later changed to any
non-default subnet-prefix. Updating the port refresh task to always
update the wwn based upon the current subnext_prefix solves this
problem.
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: nab@linux-iscsi.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
Current driver is reporting wrong values for max_sge and
max_sge_rd in query_device. This breaks the nfs rdma and iser
in some device profiles. Fixing the driver to report
correct values from FW.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
iwpbl->iwmr points to the structure that contains iwpbl,
which is iwmr. Setting this to NULL would result in
writing to freed memory. So just free iwmr, and return.
Fixes: d374984179 ("i40iw: add files for iwarp interface")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Memory allocated for iwqp; iwqp->allocated_buffer is freed twice in
the create_qp error path. Correct this by having it freed only once in
i40iw_free_qp_resources().
Fixes: d374984179 ("i40iw: add files for iwarp interface")
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This file does not use any structs or functions defined by io-mapping.h
(nor does it directly use iomap, ioremap, iounamp or friends). Remove it
to simplify verification of changes to io-mapping.h
The include existed since its inception in
commit e126ba97db
Author: Eli Cohen <eli@mellanox.com>
Date: Sun Jul 7 17:25:49 2013 +0300
mlx5: Add driver for Mellanox Connect-IB adapters
which looks like a copy across from the Mellanox ethernet driver.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Eli Cohen <eli@mellanox.com>
Cc: Jack Morgenstein <jackm@dev.mellanox.co.il>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Matan Barak <matanb@mellanox.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: linux-rdma@vger.kernel.org
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In i40iw_free_virt_mem(), do not set mem->va to NULL
after freeing it as mem->va is a self-referencing pointer
to mem.
Fixes: 4e9042e647 ("i40iw: add hw and utils files")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add NULL check for pdata and pdata->addr before the memcpy in
i40iw_form_cm_frame(). This fixes a NULL pointer de-reference
which occurs when the MPA private data pointer is NULL. Also
only copy pdata->size bytes in the memcpy to prevent reading
past the length of the private data buffer provided by upper layer.
Fixes: f27b4746f3 ("i40iw: add connection management code")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Current cxgb4 arm CQ logic ignores IB_CQ_REPORT_MISSED_EVENTS for
request completion notification on a CQ. Due to this ib_poll_handler()
assumes all events polled and avoids further iopoll scheduling.
This patch adds logic to cxgb4 ib_req_notify_cq() handler to check if
CQ is not empty and return accordingly. Based on the return value of
ib_req_notify_cq() handler, ib_poll_handler() will schedule a run of
iopoll handler.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In i40iw_open(), check if interface is already open
and return success if it is.
Fixes: 8e06af711b ("i40iw: add main, hdr, status")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In i40iw_alloc_resource(), ensure that the update to
req_resource_num is protected by the lock.
Fixes: 8e06af711b ("i40iw: add main, hdr, status")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
iwdev->mem_resources is incorrectly defined as an unsigned
long instead of u8. As a result, the offset into the dynamic
allocated structures in i40iw_initialize_hw_resources() is
incorrectly calculated and would lead to writing of memory
regions outside of the allocated buffer.
Fixes: 8e06af711b ("i40iw: add main, hdr, status")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Reuse existing functionality from memdup_user() instead of keeping
duplicate source code.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The i40iw initiator sends an MPA-request with ird=16 and ord=16. The cxgb4
responder sends an MPA-reply with ord = 32 causing i40iw to terminate
due to insufficient resources.
The logic to reduce the ORD to <= peer's IRD was wrong.
Reported-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The i40iw initiator sends an MPA-request with ird = 63, ord = 63. The
cxgb4 responder sends a RST. Since the inbound ord=63 and it exceeds
the max_ird/c4iw_max_read_depth (=32 default), chelsio decides to abort.
Instead, cxgb4 should adjust the ord/ird down before presenting it to
the ULP.
Reported-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The unwind logic for creating a user QP has a double vfree
of the non-shared receive queue when handling a "too many qps"
failure.
The code unwinds the mmmap info by decrementing a reference
count which will call rvt_release_mmap_info() which in turn
does the vfree() of the r_rq.wq. The unwind code then does
the same free.
Fix by guarding the vfree() with the same test that is done
in close and only do the vfree() if qp->ip is NULL.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously, J_KEY generation was based on the lower 16 bits
of the user's UID. While this works, it was not good enough
as a non-root user could collide with a root user given a
sufficiently large UID.
This patch attempt to improve the J_KEY generation by using
the following algorithm:
The 16 bit J_KEY space is partitioned into 3 separate spaces
reserved for different user classes:
* all users with administtor privileges (including 'root')
will use J_KEYs in the range of 0 to 31,
* all kernel protocols, which use KDETH packets will use
J_KEYs in the range of 32 to 63, and
* all other users will use J_KEYs in the range of 64 to
65535.
The above separation is aimed at preventing different user levels
from sending packets to each other and, additionally, separate
kernel protocols from all other types of users. The later is meant
to prevent the potential corruption of kernel memory by any other
type of user.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The driver does not check if the CableInfo query is supported for the
port type. Return early if CableInfo is not supported for the port type,
making compliance with the specification explicit and preventing lower
level code from potentially doing the wrong thing if the query is not
supported for the hardware implementation.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If 'pci_register_driver' fails, we return 'err' which is known to be 0.
Return the error instead.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Doug Ledford <dledford@redhat.com>
It is likely that checking the result of 'setup_ctxt' is expected here.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Validate the etype to insure that the header is correct.
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The "packet" parameter was being passed on the stack,
change it to a pointer.
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The monitor values from bytes 22 through 81 of the QSFP memory space
(SFF 8636) are dynamic and serving them out of the QSFP memory cache
maintained by the driver provides stale data to the CableInfo SMA query.
This patch refreshes the dynamic values from the QSFP memory on request
and overwrites the stale data from the cache for the overlap between the
requested range and the monitor range.
Reviewed-by: Jubin John <jubin.john@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The qp init function does a kzalloc() while holding the RCU
lock that encounters the following warning with a debug kernel
when a cat of the qp_stats is done:
[ 231.723948] rcu_scheduler_active = 1, debug_locks = 0
[ 231.731939] 3 locks held by cat/11355:
[ 231.736492] #0: (debugfs_srcu){......}, at: [<ffffffff813001a5>] debugfs_use_file_start+0x5/0x90
[ 231.746955] #1: (&p->lock){+.+.+.}, at: [<ffffffff81289a6c>] seq_read+0x4c/0x3c0
[ 231.755873] #2: (rcu_read_lock){......}, at: [<ffffffffa0a0c535>] _qp_stats_seq_start+0x5/0xd0 [hfi1]
[ 231.766862]
The init functions do an implicit next which requires the rcu read lock
before the kzalloc().
Fix for both drivers is to change the scope of the init function to only
do the allocation and the initialization of the just allocated iter.
The implict next is moved back into the respective start functions to fix
the issue.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
CC: <stable@vger.kernel.org> # 4.6.x-
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix to return error code -ENOMEM from the alloc error handling
case instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
'work' and 'route->path_rec' are malloced in cma_resolve_iboe_route()
and should be freed before leaving from the error handling cases,
otherwise it will cause memory leak.
Fixes: 200298326b ('IB/core: Validate route when we init ah')
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If CONFIG_FRAME_WARN is small (1K) and CONFIG_NR_CPUS big
then a frame size warning is triggered during build.
Allocate the cpu mask dynamically to silence the warning.
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Error code EAGAIN should be used when errors are temporary and next call
might succeeds.
When error code other than EAGAIN is returned, the caller (mlx4_ib_poll)
will assume all CQE in the same bunch are error too and will drop them all.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
No need to return int if function always returns 0
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In case of error, the function devm_ioremap_resource() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check should
be replaced with IS_ERR().
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch added Kconfig and Makefile for building RoCE module.
Signed-off-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Nenglong Zhao <zhaonenglong@hisilicon.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
These are the various new source code files for the Hisilicon
RoCE driver for ARM architecture.
Signed-off-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Nenglong Zhao <zhaonenglong@hisilicon.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Replaced mlx5_query_port_proto_oper with separate functions per link
type. The functions should take different arguments so no point in
trying to unite them.
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Now as all commands use mlx5 ifc interface, instead of doing two calls
for executing a command we embed command status checking into
mlx5_cmd_exec to simplify the interface.
Also we do here some cleanup for redundant software structures
(inbox/outbox) and functions and improved command failure output.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Prior to this patch we assumed that modify QP commands have the
same layout.
In ConnectX-4 for each QP transition there is a specific command
and their layout can vary.
e.g: 2err/2rst commands don't have QP context in their layout and before
this patch we posted the QP context in those commands.
Fortunately the FW only checks the suffix of the commands and executes
them, while ignoring all invalid data sent after the valid command
layout.
This patch removes mlx5_modify_qp_mbox_in and changes
mlx5_core_qp_modify to receive the required transition and QP context
with opt_param_mask if needed. This way the caller is not required to
provide the command inbox layout and it will be generated automatically.
mlx5_core_qp_modify will generate the command inbox/outbox layouts
according to the requested transition and will fill the requested
parameters.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Remove old representation of manually created QP/XRCD commands layout
amd use mlx5_ifc canonical structures and defines.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Remove old representation of manually created MKey/PSV commands layout,
and use mlx5_ifc canonical structures and defines.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Remove old representation of manually created CQ commands layout,
and use mlx5_ifc canonical structures and defines.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
- hfi1 driver updates
- Fix for max SGEs allowed via RDMA R/W API
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXoqUzAAoJELgmozMOVy/dNKAP/1/Rzn/k97eda1qFqzWpqsPl
lMaxDiZZnRIAFJEqEF9Iwo1JLiFIzjpDJnqHB++CKuXZQT0NY6sHW0yrcyUwzsx7
5gui92ldkVg4vY7PTco171vyzG+79KKRZ1dFS14z7oC8XAg48zQ7yJmfb1op3dEw
mgxyoLaaMwMF5aLwPoWG4+aPkBMtKUGB/ARb4ehq6M2p71c43lb18GaarJuWLdAz
1HxakXL/uzttyvGDyJGKDrT6ktXXSyvdCTRO60OrrPFJ67P2xRYXce85TLRr8srp
Q5RNjyR5fP8uN0qtrQz+hl09mtBeBQHKomyFIOVwkB2r53OKqsR5g5roz3BlpA1X
7PF/MO0pKy4t8XQnLfohEwtNWgszupvxkyAAISI8MwzLOPra/V8smQ9CpTltx1UB
hTu3tpAMy1auAjh8TWzzzII1ZoRZz6YCTziWnTaC3bqAljufjt1mnvjrtNmQ1sNi
MCLeA3yr8HjlKWdwYr+gVfhSR1wEoOxwHZdLsvBsxmC32hFLlh6rbg2x8wceqTlR
4T8l0AERV1YPjsoSe3/pWVImKUA97qppIfeFcCZiBCBHBPlhpw3ebVt6B1mLVUCV
hTMuZeFVcV75D+qr0kR5ZuVn4jgEn9zB1VH3tCV9LJnhBfySZFcP4yhATqiELaHG
RVoVAiTBxq5RgNVOH4Zo
=cQcp
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull second round of rdma updates from Doug Ledford:
"This can be split out into just two categories:
- fixes to the RDMA R/W API in regards to SG list length limits
(about 5 patches)
- fixes/features for the Intel hfi1 driver (everything else)
The hfi1 driver is still being brought to full feature support by
Intel, and they have a lot of people working on it, so that amounts to
almost the entirety of this pull request"
* tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (84 commits)
IB/hfi1: Add cache evict LRU list
IB/hfi1: Fix memory leak during unexpected shutdown
IB/hfi1: Remove unneeded mm argument in remove function
IB/hfi1: Consistently call ops->remove outside spinlock
IB/hfi1: Use evict mmu rb operation
IB/hfi1: Add evict operation to the mmu rb handler
IB/hfi1: Fix TID caching actions
IB/hfi1: Make the cache handler own its rb tree root
IB/hfi1: Make use of mm consistent
IB/hfi1: Fix user SDMA racy user request claim
IB/hfi1: Fix error condition that needs to clean up
IB/hfi1: Release node on insert failure
IB/hfi1: Validate SDMA user iovector count
IB/hfi1: Validate SDMA user request index
IB/hfi1: Use the same capability state for all shared contexts
IB/hfi1: Prevent null pointer dereference
IB/hfi1: Rename TID mmu_rb_* functions
IB/hfi1: Remove unneeded empty check in hfi1_mmu_rb_unregister()
IB/hfi1: Restructure hfi1_file_open
IB/hfi1: Make iovec loop index easy to understand
...
- Updates/fixes for iw_cxgb4 driver
- Updates/fixes for mlx5 driver
- Add flow steering and RSS API
- Add hardware stats to mlx4 and mlx5 drivers
- Add firmware version API for RDMA driver use
- Add the rxe driver (this is a software RoCE driver that makes any
Ethernet device a RoCE device)
- Fixes for i40iw driver
- Support for send only multicast joins in the cma layer
- Other minor fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXo1vCAAoJELgmozMOVy/d0HcQAJqMi7siD9cSaMViYbu812pq
3kNkHZbLNB/947uShDPhhFAWFXU0nRxEnTNSvYxRo+nxnDE/9hEEXpx8OzzKLNU+
GXyDeHsEEriSFcaSne5Tak/QuiFm3PJv73ttXQROCtHG7KxLG9ieVbfusz42Xwiu
5R21qfp6PZEOC+j7L/fTZh/kEN3cfaDYrGnCgmU3z0ka9xG5Qe2/+uWGNkuioRA5
phFUR4MS+1n/VrnxPHrLXTrqv3sw8YfCfRImaXSBrxFVMqhno+cDDtEJQCRnmNrq
7KcJO2KqDMl/QqsjxdwqojNpUTh2t7SeOeQuzUsfXl15yyyetq2Zu7ZurkCGjNtQ
NtTt6hv5eXq3mNuBmOPKYDDgakSYyYjS0zueoi8wFFqIeSYxRJv4wx4xoeJ/Bsz8
2LplpaPMQaTM65FhzYXGhYNBKaRkqjL9ihbIl1OcLNvfXAqLElfONM17/Yc/hgVw
xfDtvNFrZcl7/exIpBBNOnxwbs4h78vvXsXoBiVoN7V/hBnMzDhkiBHNxNCfZXA0
REGs/cnyy6cpiJOnVCWs77NqL75oK/qb1mEwe1M+A2kaxe/tLixUdYXo/zclDPm8
3DLTL9lCgJIBIEiZT4q/alxLK+yUKD+SHtQT3lmF2Bfsmv/I38Uy55SXAiFO4yOq
kwy96TvYtT43SkyNmmBf
=oZOO
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull base rdma updates from Doug Ledford:
"Round one of 4.8 code: while this is mostly normal, there is a new
driver in here (the driver was hosted outside the kernel for several
years and is actually a fairly mature and well coded driver). It
amounts to 13,000 of the 16,000 lines of added code in here.
Summary:
- Updates/fixes for iw_cxgb4 driver
- Updates/fixes for mlx5 driver
- Add flow steering and RSS API
- Add hardware stats to mlx4 and mlx5 drivers
- Add firmware version API for RDMA driver use
- Add the rxe driver (this is a software RoCE driver that makes any
Ethernet device a RoCE device)
- Fixes for i40iw driver
- Support for send only multicast joins in the cma layer
- Other minor fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (72 commits)
Soft RoCE driver
IB/core: Support for CMA multicast join flags
IB/sa: Add cached attribute containing SM information to SA port
IB/uverbs: Fix race between uverbs_close and remove_one
IB/mthca: Clean up error unwind flow in mthca_reset()
IB/mthca: NULL arg to pci_dev_put is OK
IB/hfi1: NULL arg to sc_return_credits is OK
IB/mlx4: Add diagnostic hardware counters
net/mlx4: Query performance and diagnostics counters
net/mlx4: Add diagnostic counters capability bit
Use smaller 512 byte messages for portmapper messages
IB/ipoib: Report SG feature regardless of HW UD CSUM capability
IB/mlx4: Don't use GFP_ATOMIC for CQ resize struct
IB/hfi1: Disable by default
IB/rdmavt: Disable by default
IB/mlx5: Fix port counter ID association to QP offset
IB/mlx5: Fix iteration overrun in GSI qps
i40iw: Add NULL check for puda buffer
i40iw: Change dup_ack_thresh to u8
i40iw: Remove unnecessary check for moving CQ head
...
Soft RoCE (RXE) - The software RoCE driver
ib_rxe implements the RDMA transport and registers to the RDMA core
device as a kernel verbs provider. It also implements the packet IO
layer. On the other hand ib_rxe registers to the Linux netdev stack
as a udp encapsulating protocol, in that case RDMA, for sending and
receiving packets over any Ethernet device. This yields a RDMA
transport over the UDP/Ethernet network layer forming a RoCEv2
compatible device.
The configuration procedure of the Soft RoCE drivers requires
binding to any existing Ethernet network device. This is done with
/sys interface.
A userspace Soft RoCE library (librxe) provides user applications
the ability to run with Soft RoCE devices. The use of rxe verbs ins
user space requires the inclusion of librxe as a device specifics
plug-in to libibverbs. librxe is packaged separately.
Architecture:
+-----------------------------------------------------------+
| Application |
+-----------------------------------------------------------+
+-----------------------------------+
| libibverbs |
User +-----------------------------------+
+----------------+ +----------------+
| librxe | | HW RoCE lib |
+----------------+ +----------------+
+---------------------------------------------------------------+
+--------------+ +------------+
| Sockets | | RDMA ULP |
+--------------+ +------------+
+--------------+ +---------------------+
| TCP/IP | | ib_core |
+--------------+ +---------------------+
+------------+ +----------------+
Kernel | ib_rxe | | HW RoCE driver |
+------------+ +----------------+
+------------------------------------+
| NIC driver |
+------------------------------------+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+-----------------------------------------------------------+
| Application |
+-----------------------------------------------------------+
+-----------------------------------+
| libibverbs |
User +-----------------------------------+
+----------------+ +----------------+
| librxe | | HW RoCE lib |
+----------------+ +----------------+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+--------------+ +------------+
| Sockets | | RDMA ULP |
+--------------+ +------------+
+--------------+ +---------------------+
| TCP/IP | | ib_core |
+--------------+ +---------------------+
+------------+ +----------------+
Kernel | ib_rxe | | HW RoCE driver |
+------------+ +----------------+
+------------------------------------+
| NIC driver |
+------------------------------------+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Soft RoCE resources:
[1[ https://github.com/SoftRoCE/librxe-dev librxe - source code in
Github
[2] https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home - Soft RoCE
Wiki page
[3] https://github.com/SoftRoCE/librxe-dev - Soft RoCE userspace library
Signed-off-by: Kamal Heib <kamalh@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The dma-mapping core and the implementations do not change the DMA
attributes passed by pointer. Thus the pointer can point to const data.
However the attributes do not have to be a bitfield. Instead unsigned
long will do fine:
1. This is just simpler. Both in terms of reading the code and setting
attributes. Instead of initializing local attributes on the stack
and passing pointer to it to dma_set_attr(), just set the bits.
2. It brings safeness and checking for const correctness because the
attributes are passed by value.
Semantic patches for this change (at least most of them):
virtual patch
virtual context
@r@
identifier f, attrs;
@@
f(...,
- struct dma_attrs *attrs
+ unsigned long attrs
, ...)
{
...
}
@@
identifier r.f;
@@
f(...,
- NULL
+ 0
)
and
// Options: --all-includes
virtual patch
virtual context
@r@
identifier f, attrs;
type t;
@@
t f(..., struct dma_attrs *attrs);
@@
identifier r.f;
@@
f(...,
- NULL
+ 0
)
Link: http://lkml.kernel.org/r/1468399300-5399-2-git-send-email-k.kozlowski@samsung.com
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
Acked-by: Mark Salter <msalter@redhat.com> [c6x]
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> [cris]
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> [drm]
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
Acked-by: Fabien Dessenne <fabien.dessenne@st.com> [bdisp]
Reviewed-by: Marek Szyprowski <m.szyprowski@samsung.com> [vb2-core]
Acked-by: David Vrabel <david.vrabel@citrix.com> [xen]
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [xen swiotlb]
Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon]
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390]
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> [avr32]
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arc]
Acked-by: Robin Murphy <robin.murphy@arm.com> [arm64 and dma-iommu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Added UCMA and CMA support for multicast join flags. Flags are
passed using UCMA CM join command previously reserved fields.
Currently supporting two join flags indicating two different
multicast JoinStates:
1. Full Member:
The initiator creates the Multicast group(MCG) if it wasn't
previously created, can send Multicast messages to the group
and receive messages from the MCG.
2. Send Only Full Member:
The initiator creates the Multicast group(MCG) if it wasn't
previously created, can send Multicast messages to the group
but doesn't receive any messages from the MCG.
IB: Send Only Full Member requires a query of ClassPortInfo
to determine if SM/SA supports this option. If SM/SA
doesn't support Send-Only there will be no join request
sent and an error will be returned.
ETH: When Send Only Full Member is requested no IGMP join
will be sent.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Added a new SA port attribute containing SM ClassPortInfo fields,
(ClassPortInfo fields: Table 126 IB Spec 1.3.). This is useful for
checking SM support for specific features. The attribute is cached
to avoid resending queries, caching is done when a successful
ClassPortInfo reply is received on the port. Invalidation of the
attribute is done on SM change events, SM re-registration events,
and SM LID change events. The fields in ClassPortInfo should not
change during SM runtime without an event.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fixes an oops that might happen if uverbs_close races with
remove_one.
Both contexts may run ib_uverbs_cleanup_ucontext, it depends
on the flow.
Currently, there is no protection for a case that remove_one
didn't make the cleanup it runs to its end, the underlying
ib_device was freed then uverbs_close will call
ib_uverbs_cleanup_ucontext and OOPs.
Above might happen if uverbs_close deleted the file from the list
then remove_one didn't find it and runs to its end.
Fixes to protect against that case by a new cleanup lock so that
ib_uverbs_cleanup_ucontext will be called always before that
remove_one is ended.
Fixes: 35d4a0b63d ("IB/uverbs: Fix race between ib_uverbs_open and remove_one")
Reported-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The kfree() function was called in a few cases by the mthca_reset()
function during error handling even if the passed variables "bridge_header"
and "hca_header" contained a null pointer.
Adjust jump targets according to the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The pci_dev_put() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The sc_return_credits() function tests whether its argument is NULL
and then returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Expose IB diagnostic hardware counters.
The counters count IB events and are applicable for IB and RoCE.
The counters can be divided into two groups, per device and per port.
Device counters are always exposed.
Port counters are exposed only if the firmware supports per port counters.
rq_num_dup and sq_num_to are only exposed if we have firmware support
for them, if we do, we expose them per device and per port.
rq_num_udsdprd and num_cqovf are device only counters.
rq - denotes responder.
sq - denotes requester.
|-----------------------|---------------------------------------|
| Name | Description |
|-----------------------|---------------------------------------|
|rq_num_lle | Number of local length errors |
|-----------------------|---------------------------------------|
|sq_num_lle | number of local length errors |
|-----------------------|---------------------------------------|
|rq_num_lqpoe | Number of local QP operation errors |
|-----------------------|---------------------------------------|
|sq_num_lqpoe | Number of local QP operation errors |
|-----------------------|---------------------------------------|
|rq_num_lpe | Number of local protection errors |
|-----------------------|---------------------------------------|
|sq_num_lpe | Number of local protection errors |
|-----------------------|---------------------------------------|
|rq_num_wrfe | Number of CQEs with error |
|-----------------------|---------------------------------------|
|sq_num_wrfe | Number of CQEs with error |
|-----------------------|---------------------------------------|
|sq_num_mwbe | Number of Memory Window bind errors |
|-----------------------|---------------------------------------|
|sq_num_bre | Number of bad response errors |
|-----------------------|---------------------------------------|
|sq_num_rire | Number of Remote Invalid request |
| | errors |
|-----------------------|---------------------------------------|
|rq_num_rire | Number of Remote Invalid request |
| | errors |
|-----------------------|---------------------------------------|
|sq_num_rae | Number of remote access errors |
|-----------------------|---------------------------------------|
|rq_num_rae | Number of remote access errors |
|-----------------------|---------------------------------------|
|sq_num_roe | Number of remote operation errors |
|-----------------------|---------------------------------------|
|sq_num_tree | Number of transport retries exceeded |
| | errors |
|-----------------------|---------------------------------------|
|sq_num_rree | Number of RNR NAK retries exceeded |
| | errors |
|-----------------------|---------------------------------------|
|rq_num_rnr | Number of RNR NAKs sent |
|-----------------------|---------------------------------------|
|sq_num_rnr | Number of RNR NAKs received |
|-----------------------|---------------------------------------|
|rq_num_oos | Number of Out of Sequence requests |
| | received |
|-----------------------|---------------------------------------|
|sq_num_oos | Number of Out of Sequence NAKs |
| | received |
|-----------------------|---------------------------------------|
|rq_num_udsdprd | Number of UD packets silently |
| | discarded on the Receive Queue due to |
| | lack of receive descriptor |
|-----------------------|---------------------------------------|
|rq_num_dup | Number of duplicate requests received |
|-----------------------|---------------------------------------|
|sq_num_to | Number of time out received |
|-----------------------|---------------------------------------|
|num_cqovf | Number of CQ overflows |
|-----------------------|---------------------------------------|
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Portmapper messages are short and do not occupy more than 512 bytes.
Lower portmapper message size to 512 bytes. This change significantly
reduces the amount of memory needed when trying to establish a large
number of connections simultaneously. The old value is based on page
size.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Decouple SG support from HW ability to do UD checksum.
This coupling is for historical reasons and removed with 'commit
ec5f061564 ("net: Kill link between CSUM and SG features.")'
During driver load it is assumed that device does not supports SG. The
final decision is taken after creating UD QP based on device capability.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We allocate a small tracking structure as part of mlx4_ib_resize_cq().
However, we don't need to use GFP_ATOMIC -- immediately after the
allocation, we call mlx4_cq_resize(), which allocates a command
mailbox with GFP_KERNEL and then sleeps on a firmware command, so we
better not be in an atomic context.
This actually has a real impact, because when this GFP_ATOMIC
allocation fails (and GFP_ATOMIC does fail in practice) then a
userspace consumer resizing a CQ will get a spurious failure that we
can easily avoid.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is a strict policy in the Linux kernel that new drivers must be
disabled by default. Hence leave out the "default m" line from Kconfig.
Fixes: f48ad614c1 ("IB/hfi1: Move driver out of staging")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Jubin John <jubin.john@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: <stable@vger.kernel.org> # v4.7+
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is a strict policy in the Linux kernel that new drivers must be
disabled by default. Hence leave out the "default m" line from Kconfig.
Fixes: 0194621b22 ("IB/rdmavt: Create module framework and handle driver registration")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Jubin John <jubin.john@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: <stable@vger.kernel.org> # v4.6+
Signed-off-by: Doug Ledford <dledford@redhat.com>
The original code used a LRU list to evict nodes which were least
recently used. For correctness the evict code was moved under the
handler->lock, now add back the LRU list.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
During an unexpected shutdown, references to tid_rb_node were NULL'ed out
without properly being released.
Fix this by calling clear_tid_node in the mmu notifier remove callback
rather than after these callbacks are called.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The reworked mmu_rb interface allows the unused mm argument to be removed.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The ops->remove() callback was called by hfi1_mmu_unregister() with a
NULL mm argument while holding a spinlock. In the case of sdma_rb_remove()
this caused it to pass current->mm to hfi1_release_user_pages()
This had 2 problems. First this would attempt to acquire the mmap_sem
under a spin lock. Second the use of current->mm is not always guaranteed
to be the proper mm when the fd is being closed.
Rather than depend on this implicit behavior we move all calls to
ops->remove outside of the spinlock. This also allows the correct
mm to be used in the remove callback without fear of deadlock.
Because the MMU notifier is not guaranteed to hold mm->mmap_sem, but
usually does, we must delay all remove callbacks until out of the notifier,
when the callbacks can take the mmap_sem if they need to.
Code comments were added to clarify what the expectations are for the
users of the mmu rb tree.
Suggested-by: Jim Foraker <foraker1@llnl.gov>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use the new cache evict operation in the SDMA code. This allows the cache
to properly coordinate evicts and removes, preventing any race. With this
change, the separate list, lock, and race flag are not needed.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Allow users to clear nodes from the rb tree based on their evict callback.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Per file descriptor TID caching actions depend on a global that can
change midway through the lifetime of that file descriptor.
Make the use of caching consistent for the life of the file descriptor
by using the presence of the cache handler to decide when to use the cache
functions.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The objects which use cache handling should reference their own handler
object not the internal data structure it uses to track the nodes.
Have the "users" of the mmu notifier code pass opaque objects which can
then be properly used in the mmu callbacks depending on the owners needs.
This patch has the additional benefit that operations no longer require a
look up in a list to find the handlers.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The hfi1 driver registers a mmu_notifier callback when /dev/hfi1_* is
opened, and unregisters it when the device is closed. The driver
incorrectly assumes that the close will always happen from the same
context as the open. In particular, closes due to SIGKILL or OOM killer
activity may happen from a different context. In these cases, the wrong
mm is passed to mmu_notifier_unregister(), which causes improper reference
counting for the victim mm, and eventual memory corruption.
Preserve the mm for all open file descriptors and use this mm rather than
current->mm for memory operations for the lifetime of that fd. Note: this
patch leaves 1 use of current->mm in place. This use is removed in a
follow on patch because other functional changes were required prior to
that use being removed.
If registration fails, there is no reason to keep the handler object
around. Free the handler object rather than add it to the list to
prevent any mmu_notifier operations, including unregister, when
registration fails.
Suggested-by: Jim Foraker <foraker1@llnl.gov>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The user SDMA in-use claim bit is in the structure that gets zeroed out
once the claim is made. Move the request in-use flag into its own bit
array and use that for atomic claims. This cleans up the claim code and
removes any race possibility.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If input validation fails, properly free the request before returning.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If unable to insert node into the RB tree cache, node will be freed
before returning from the function. Null out iovec's pointer to node
so iovec does not try to free it later.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Save the current capability state at user context creation
time. Report this saved value for all shared contexts.
Also get rid of unnecessary hfi1_get_base_kinfo function.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If a context has not been assigned or assignment failed, pq may be NULL.
Move the unregister within the protection of the null check.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Clarify the names of the TID mmu functions.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Checking if the rb tree is empty is redundant with the while loop which is
emptying the rb tree.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Rearrange the file open call in prep for new changes.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
For bool parameters "false" should be used
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
subctxt is not used, just remove it.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
__mmu_rb_remove was called in only 1 place which was a very simple
call site. Combine this function into its caller.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove, insert, and invalidate are always provided. No
need to test.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This makes it more clear what these functions are
operating on.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Parameter names to function declarations make it more clear
what those parameters do.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
These are no longer needed.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Brackets should be on the next line of a function
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Expand the serial number space by using more bits
from the GUID.
Reviewed-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The driver pads non-double word multiple message sizes but it doesn't
account for this padding when the packet length is calculated. Also, the
data length is miscalculated for message sizes less than 4 bytes due to
the bit representation in LRH. And there's a check for non-double word
multiple message sizes that prevents these messages from being sent.
This patch fixes length miscalculations and enables the functionality to
send non-double word multiple message sizes.
Reviewed-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The use of the specific opcode test is redundant since
all ack entry users correctly manipulate the mr pointer
to selectively trigger the reference clearing.
The overly specific test hinders the use of implementation
specific operations.
The change needs to get rid of the union to insure that
an atomic value is not seen as an MR pointer.
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Checking the return value of the memory allocation call in
init_pervl_scs() was missed. Recently the kmalloc() was changed to
kzalloc() which identified the problem.
While fixing this issue 2 other bugs were noticed. First, the array
being allocated is accessed in the nomem path which can be reached before
it is allocated. Second, kernel_send_context was not released on error.
Fix both of these by creating a more common memory unwind label structure.
Fixes: 35f6befc84 ("staging/rdma/hfi1: Add qp to send context mapping for PIO")
Reported-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead of copying the actual GRH of type struct ib_grh, existing code
copies the struct ib_global_route into the sge. This patch fixes that
and constructs the actual GRH from ib_global_route and copies the GRH
into the sge.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The interface is used to compute the 5-bit SC field from the
LRH and the RHF bits. Modify code to use the interface instead.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cleanup hfi1_ud_rcv to not have to look at the packet
header fields multiple times. The fields are looked up
once and used throughout the function. Also fix sc
computation when validating MAD packets.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
hfi1_pio_header should really be called hfi1_sdma_header
as it is only used for sdma transmits.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
struct ahg_ib_header has no header specific information.
Rename it to struct hfi1_ahg_info
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
sde and hfi1_ib_header are not used anymore.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Active QSFP cables were reset only every alternate iteration of the
channel tuning algorithm instead of every iteration due to incorrect
reset of the flag that controlled QSFP reset, resulting in using stale
QSFP status in the channel tuning algorithm.
Fixes: 8ebd4cf185 ("Add active and optical cable support")
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some QSFP cables assert the interrupt line as a side effect of module
plug-in and power up. This causes the SerDes and QSFP tuning algorithm
to begin cable initialization by reading the QSFP memory map over I2C,
which fails. This patch ignores any interrupt line assertion until
the module has completed power up and voltage rails have stabilized,
which can take a maximum of 500 ms per the SFF-8679 specification.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
QSFP CDR enablement is now controlled by determining power class
and the configuration file. We disable the DC 8051 from requesting
enablement or disabling of TX and RX CDRs by removing the code
that allowed the DC 8051 to request changes.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Hanging has been observed while writing a file over NFSoRDMA. Dmesg on
the server contains messages like these:
[ 931.992501] svcrdma: Error -22 posting RDMA_READ
[ 952.076879] svcrdma: Error -22 posting RDMA_READ
[ 982.154127] svcrdma: Error -22 posting RDMA_READ
[ 1012.235884] svcrdma: Error -22 posting RDMA_READ
[ 1042.319194] svcrdma: Error -22 posting RDMA_READ
Here is why:
With the base memory management extension enabled, FRMR is used instead
of FMR. The xprtrdma server issues each RDMA read request as the following
bundle:
(1)IB_WR_REG_MR, signaled;
(2)IB_WR_RDMA_READ, signaled;
(3)IB_WR_LOCAL_INV, signaled & fencing.
These requests are signaled. In order to generate completion, the fast
register work request is processed by the hfi1 send engine after being
posted to the work queue, and the corresponding lkey is not valid until
the request is processed. However, the rdmavt driver validates lkey when
the RDMA read request is posted and thus it fails immediately with error
-EINVAL (-22).
This patch changes the work flow of local operations (fast register and
local invalidate) so that fast register work requests are always
processed immediately to ensure that the corresponding lkey is valid
when subsequent work requests are posted. Local invalidate requests are
processed immediately if fencing is not required and no previous local
invalidate request is pending.
To allow completion generation for signaled local operations that have
been processed before posting to the work queue, an internal send flag
RVT_SEND_COMPLETION_ONLY is added. The hfi1 send engine checks this flag
and only generates completion for such requests.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This fix allows for support of in-kernel reserved operations
without impacting the ULP user.
The low level driver can register a non-zero value which
will be transparently added to the send queue size and hidden
from the ULP in every respect.
ULP post sends will never see a full queue due to a reserved
post send and reserved operations will never exceed that
registered value.
The s_avail will continue to track the ULP swqe availability
and the difference between the reserved value and the reserved
in use will track reserved availabity.
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Trace shows incorrect amount of allocated memory.
Fix trace to display memory in KB.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Grzegorz Heldt <grzegorz.heldt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add sysfs entry to allow user to override affinity for SDMA
engine interrupts.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Enhance the PCIe Gen3 recipe to support static CTLE tuning,
and add a switch to choose between static and dynamic
approaches. Make discrete chips default to static CTLE
tuning.
Reviewed-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This fixes the following warnings with PROVE_LOCKING and PROVE_RCU
enabled in the kernel:
case (1):
[ INFO: suspicious RCU usage. ]
drivers/infiniband/hw/hfi1/init.c:532
suspicious rcu_dereference_check() usage!
case (2):
[ INFO: suspicious RCU usage. ]
drivers/infiniband/hw/hfi1/hfi.h:1624
suspicious rcu_dereference_check() usage!
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Read the version of the SBus, PCIe SerDes, and Fabric Serdes
firmwares at driver load time.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When link up fails in LNI, the local and peer state complete
frames are reported as numbers. Explain what the values mean
so the operator can better diagnose the problem.
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, the default number of kernel receive contexts is set to the
number of NUMA nodes on the system plus one for control context. However,
the systems that have a single socket and/or have NUMA disabled in the BIOS
will have only one receive context by default. This patch would ensure that
by default there will be at least two kernel receive contexts plus one for
control context regardless of the number of NUMA nodes on the system. The
user can override the default number of kernel receive contexts with the
krcvqs module parameter.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Advertise and add the capability of handing all aspects of IBTA extended
memory management support in post send.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to support extended memory management support, add send side
processing of work requests of type IB_WR_REG_MR, IB_WR_LOCAL_INV, and
IB_WR_SEND_WITH_INV. The first two are local operations and are supported
for both RC and UC. Send with invalidate is only supported for RC because
the corresponding IB opcodes are not defined for UC.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
As part of enabling extended memory management support, add the processing
of the RC send with invalidate.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some work requests are local operations, such as IB_WR_REG_MR and
IB_WR_LOCAL_INV. They differ from non-local operations in that:
(1) Local operations can be processed immediately without being posted
to the send queue if neither fencing nor completion generation is needed.
However, to ensure correct ordering, once a local operation is posted to
the work queue due to fencing or completion requiement, all subsequent
local operations must also be posted to the work queue until all the
local operations on the work queue have completed.
(2) Local operations don't send packets over the wire and thus don't
need (and shouldn't update) the packet sequence numbers.
Define a new a flag bit for the post send table to identify local
operations.
Add a new field to the QP structure to track the number of local
operations on the send queue to determine if direct processing of new
local operations should be enabled/disabled.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to support extended memory management, add the mechanism to
invalidate MR keys. This includes a flag "lkey_invalid" in the MR data
structure that is to be checked when validating access to the MR via
the associated key, and two utility functions to perform fast memory
registration and memory key invalidate operations.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This implements the device specific function needed by the verbs
API function ib_map_mr_sg().
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There were multiple places where FECN/BECN processing was
being done for the different types of QPs. All of that code
was very similar, which meant that it could be pulled into
a single function used by the different QP types.
To retain the performance in the fastpath, the common code
starts with an inline function, which only calls the slow
path if the packet has any of the [FB]ECN bits set.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
While handling buffer control MAD, partially initialized
dd->kernel_send_context area may cause potential dereference
of uninitialized pointers. Fix by using kzalloc_node()
instead of kmalloc_node().
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Andrzej Kacprowski <andrzej.kacprowski@intel.com>
Signed-off-by: Tymoteusz Kielan <tymoteusz.kielan@intel.com>
Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
PMA should not sum TX and RX replay counts when reporting
local link integrity errors. Fixed by removing C_DC_TX_REPLAY
counter from calculation of the link integrity errors counter
value.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Change rvt_post_one_wr to use the new table mechanism for
post send.
Validate that each low level driver specifies the table.
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add flexibility for driver dependent operations in post send
because different drivers will have differing post send
operation support.
This includes data structure definitions to support a table
driven scheme along with the necessary validation routine
using the new table.
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Prevent processing receive packet in case when opcode is
accepted by QP but handler for this type of packet is not
defined.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently each user context is assigned a single SDMA engine
based on the VL, context id, and subcontext id. That means for
MPI applications, each rank can only use one SDMA engine for
all messages. This may create unwanted backup for independent
messages going to different destinations upon congestion at one
destination.
This patch adds the packet "dlid" to the formula of SDMA engine
selection for user SDMA requests. A simple hash table is used
to maintain even distribution among the available SDMA engines
regardless how the "dlid" values are distributed.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove the TWSI code. The driver now uses the kernel's built-in
i2c bit bus module.
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use built-in i2c bit-shift bus adapter to control the
i2c busses on the chip.
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When performing process affinity recommendations for MPI ranks, the current
algorithm doesn't take into account multiple HFI units. Also, real
cores and HT cores are not distinguished from one another. Therefore,
all HT cores are recommended to be assigned first within the local NUMA
node before recommending the assignments of cores in other NUMA nodes.
It's ideal to assign all real cores across all NUMA nodes first, then all
HT 1 cores, then all HT 2 cores, and so on to balance CPU workload. CPU
cores in other NUMA nodes could be running interrupt handlers, and this is
not taken into account.
To balance the CPU workload for user processes, the following
recommendation algorithm is used:
For each user process that is opening a context on HFI Y:
a) If all cores are assigned to user processes, start assignments all
over from the first core
b) Assign real cores first, then HT cores (First set of HT cores on
all physical cores, then second set of HT cores, and, so on) in the
following order:
1. Same NUMA node as HFI Y and not running an IRQ handler
2. Same NUMA node as HFI Y and running an IRQ handler
3. Different NUMA node to HFI Y and not running an IRQ handler
4. Different NUMA node to HFI Y and running an IRQ handler
c) Mark core as assigned in the global affinity structure. As user
processes are done, remove core assignments from global affinity
structure.
This implementation allows an arbitrary number of HT cores and provides
support for multiple HFIs.
This is being included in the kernel rather than user space due to the
fact that user space has no way of knowing the CPU recommendations for
contexts running as part of other jobs.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Kernel receive queues oversubscribe CPU cores on multi-HFI systems.
To prevent this, the kernel receive queues are separated onto
different cores, and the SDMA engine interrupts are constrained to
a lesser number of cores.
hfi1s_on_numa_node*krcvqs is the number of CPU cores that are
reserved for kernel receive queues for all HFIs. Each HFI initializes
its kernel receive queues to one of the reserved CPU cores. If there
ends up being 0 CPU cores leftover for SDMA engines, use the same
CPU cores as receive contexts.
In addition, general and control contexts are assigned to their own
CPU core, however, both types of contexts tend to have low traffic.
To save CPU cores, collapse general and control contexts to one CPU
core for all HFI units. This change prevents SDMA engine interrupts
from wrapping around general contexts.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When HFI units get initialized, they each use their own mask copy for
affinity assignments. On a multi-HFI system, affinity assignments
overbook CPU cores as each HFI doesn't have knowledge of affinity
assignments for other HFI units. Therefore, some CPU cores are never
used for interrupt handlers in systems with high number of CPU cores
per NUMA node.
For multi-HFI systems, SDMA engine interrupt assignments start all over
from the first CPU in the local NUMA node after the first HFI
initialization. This change allows assignments to continue where the
last HFI unit left off.
Add global structure for affinity assignments for multiple HFIs to share
affinity mask.
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Jubin John <jubin.john@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The q-counter-id is given in modify-QP command associates
the QP with the counter. The offset to which the counter
ID was set is incorrect, causing IB port counters not to
count on QP.
Fixes: 0837e86a7a ('IB/mlx5: Add per port counters')
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Tested-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Number of outstanding_pi may overflow and as a result may indicate that
there are no elements in the queue. The effect of doing this is that the
MAD layer will get stuck waiting for completions. The MAD layer will
think that the QP is full - because it didn't receive these completions.
This fix changes it so the outstanding_pi number is increased
with 32-bit wraparound and is not limited to max_send_wr so
that the difference between outstanding_pi and outstanding_ci will
really indicate the number of outstanding completions.
Cc: Stable <stable@vger.kernel.org>
Fixes: ea6dc20362 ('IB/mlx5: Reorder GSI completions')
Signed-off-by: Slava Shwartsman <slavash@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
i40iw_puda_get_listbuf may return NULL if the list is empty.
Add NULL check prior to accessing the pointer.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Change dup_ack_thressh to u8 since it is a 3 bit field.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In i40iw_cq_poll_completion, we always move the tail. So there is
no reason to check for overflow everytime we move the head.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Replace a subtract and multiply with an add; while populating fragments
in SQ wqe.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Child_listen_node pointer is used in a debug print after kfree.
Move the print before kfree.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix size parameter passed to i40iw_reg_phys_mr and use it to
register memory.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove the complicated logic to free the iw_cm_id inside iw_cm
event handlers vs when an application thread destroys the cm_id.
Also remove the block in iw_destroy_cm_id() to block the application
until all references are removed. This block can cause a deadlock when
disconnecting or destroying cm_ids inside an rdma_cm event handler.
Simply allowing the last deref of the iw_cm_id to free the memory
is cleaner and avoids this potential deadlock. Also a flag is added,
IW_CM_DROP_EVENTS, that is set when the cm_id is marked for destruction.
If any events are pending on this iw_cm_id, then as they are processed
they will be dropped vs posted upstream if IW_CM_DROP_EVENTS is set.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Blocking in c4iw_destroy_qp() causes a deadlock when apps destroy a qp
or disconnect a cm_id from their cm event handler function. There is
no need to block here anyway, so just replace the refcnt atomic with a
kref object and free the memory on the last put.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This forces the connection to abort if the application failed to
disconnect before flushing. This is aligned with how the common
flush services work.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There exists a race where the application can setup a connection
and then disconnect it before iw_cxgb4 processes the fw4_ack
message. For passive side connections, the fw4_ack message is
used to know when to stop the ep timer for MPA_REPLY messages.
If the application disconnects before the fw4_ack is handled then
c4iw_ep_disconnect() needs to clean up the timer state and stop the
timer before restarting it for the disconnect timer. Failure to do this
results in a "timer already started" message and a premature stopping
of the disconnect timer.
Fixes: e4b76a2 ("RDMA/iw_cxgb4: stop_ep_timer() after MPA negotiation")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
During connection establishment with a large number of
connections, it is possible that the connection requests
might fail. Adding flow control prevents this failure.
Change ibnl_unicast to use blocking to enable flow control.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Elsewhere the sin_family field holds a value with a name of the form
AF_..., so it seems reasonable to do so here as well. Also the values
of PF_INET and AF_INET are the same.
The Coccinelle semantic patch that makes this change is as follows:
// <smpl>
@@
struct sockaddr_in sip;
@@
(
sip.sin_family ==
- PF_INET
+ AF_INET
|
sip.sin_family !=
- PF_INET
+ AF_INET
|
sip.sin_family =
- PF_INET
+ AF_INET
)
// </smpl>
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The commit 0f8ab0b6e9 ("RDMA/iw_cxgb4: Low resource fixes for Memory
registration") from Jun 10, 2016, leads to the following static checker
warning:
drivers/infiniband/hw/cxgb4/mem.c:612 c4iw_alloc_mw()
error: use kfree_skb() here instead of kfree(mhp->dereg_skb)
Also fixes skb leak in c4iw_dealloc_mw
Fixes: 0f8ab0b6e9 ("RDMA/iw_cxgb4: Low resource fixes for Memory registration")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Initialize first_wr to &send_wr. This allows to remove a ternary
operator and an else branch. This patch does not change the behavior
of srpt_queue_response().
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Parav Pandit <pandit.parav@gmail.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Limit the number of SG elements per work request to what the HCA
and the queue pair support.
Fixes: 34693573fde0 ("IB/srpt: Reduce QP buffer size")
Reported-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Parav Pandit <pandit.parav@gmail.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: <stable@vger.kernel.org> #v4.7+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Compute the SGE limit for RDMA READ and WRITE requests in
ib_create_qp(). Use that limit in the RDMA RW API implementation.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Parav Pandit <pandit.parav@gmail.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: <stable@vger.kernel.org> #v4.7+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some but not all callers of rdma_rw_ctx_init() zero-initialize
struct rdma_rw_ctx. Hence make rdma_rw_ctx_init() initialize all
work request fields that will be read by ib_post_send().
Fixes: a060b5629a ("IB/core: generic RDMA READ/WRITE API")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Parav Pandit <pandit.parav@gmail.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: <stable@vger.kernel.org> #v4.7+
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add sw counter to track dropped unsupported packets.
Report unsupported packets drop as the RcvError.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add per VL XmitDiscards counters to the opapmaquery
status and error response.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix sparse errors by making sure the fast assign destinations
are host cpu typed.
For the void __iomem *, just make the field match source
data.
Fix a bug where the hw_free trace printed the pointer vs.
the dereferenced value.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The ftrace infrastructure used to evaluate the TRACE_SYSTEM
macro on every DEFINE_EVENT() macro. Now the TRACE_SYSTEM
macro only gets evaluated when trace/define_trace.h is
included, so the group event information is lost. This was
introduced in
commit acd388fd3a ("tracing: Give system name a pointer")
Therefore, each system tracepoint must be on its own file.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix a copy and paste typo in comment.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Simple code clean up of hfi1_write_iter.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The definition of port state changed mid development and the
old structure was kept accidentally. Remove this dead code.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In preparation for writing the tx descriptor from multiple functions,
create a helper for both normal and blueflame access.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
i40iw_create_cqp() printed the contents of variables maj_err and min_err
in an error message before they could be initialized (by calling
dev->cqp_ops->cqp_create).
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add the missing port_xmit_wait counter. This counter is displayed through
some tools like perfquery but is not available via sysfs.
For the PORT_PMA_ATTR macro the _counter field is set to zero
allowing us to specify the offset directly like with PORT_PMA_ATTR_EXT
See also the earlier work in 2008 by Vladimir Skolovsky
https://www.mail-archive.com/general@lists.openfabrics.org/msg20313.html
Signed-off-by: Vladimir Sokolvsky <vlad@mellanox.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The critical section should protect only the list traversal
and dd->asic_data modification, not the memory allocation.
The fix pulls the allocation out of the critical section.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There are several computatations of the sc in the
ud receive routine.
Besides the code duplication, all are wrong when the
sc is greater than 15. In that case the code incorrectly
or's a 1 into the computed sc instead of 1 shifted left
by 4.
Fix precomputed sc5 by using an already implemented routine
hdr2sc() and deleting flawed duplicated code.
Cc: Stable <stable@vger.kernel.org> # 4.6+
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Reduce the set of arguments passed to mlx5_add_flow_rule
by introducing flow_spec structure.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export the firmware version through the core.
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Now that all the devices have stopped exporting their own sysfs
entry points we can have the core export this on their behalf.
Eventually this may be removed but this provides for backwards
compatibility.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Using this allows for devices to specify the format of their
firmware version rather than forcing a format.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
And remove sysfs file in favor of the common core.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
And remove sysfs in favor of the core support.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
And remove the sysfs in favor of the core version.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
And remove the sysfs entry in favor of the core support.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
And remove sysfs entry in favor of the common code.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
And remove the sysfs in favor of common core version.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
And remove sysfs support in favor of the core version.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
And remove sysfs fw_ver in favor of the core.
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Also remove fw_ver sysfs to be replaced by the common core one.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Allow for a common core function to get firmware version strings
from the individual devices.
In later patches this format can then then be used to pass a
properly formated version string through the IPoIB layer.
The problem with the current code in the IPoIB layer is that it is
specific to certain hardware types.
Furthermore, this gives us a common function through which the core
can provide a common sysfs entry. Eventually we may want to
remove the sysfs export but this provides for user space backwards
compatibility.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The memory needed for the send and receive queues associated with
a QP is proportional to the max_sge parameter. The current value
of that parameter is such that with an mlx4 HCA the QP buffer size
is 8 MB. Since DMA is used for communication between HCA and CPU
that buffer either has to be allocated coherently or map_single()
must succeed for that buffer. Since large contiguous allocations
are fragile and since the maximum segment size for e.g. swiotlb
is 256 KB, reduce the max_sge parameter. This patch avoids that
the following text appears on the console after SRP logout and
relogin on a system equipped with multiple IB HCAs:
mlx4_core 0000:05:00.0: swiotlb buffer is full (sz: 8388608 bytes)
swiotlb: coherent allocation failed for device 0000:05:00.0 size=8388608
CPU: 11 PID: 148 Comm: kworker/11:1 Not tainted 4.7.0-rc4-dbg+ #1
Call Trace:
[<ffffffff812c6d35>] dump_stack+0x67/0x92
[<ffffffff812efe71>] swiotlb_alloc_coherent+0x141/0x150
[<ffffffff810458be>] x86_swiotlb_alloc_coherent+0x3e/0x50
[<ffffffffa03861fa>] mlx4_buf_direct_alloc.isra.5+0x9a/0x120 [mlx4_core]
[<ffffffffa0386545>] mlx4_buf_alloc+0x165/0x1a0 [mlx4_core]
[<ffffffffa035053d>] create_qp_common.isra.29+0x57d/0xff0 [mlx4_ib]
[<ffffffffa03510da>] mlx4_ib_create_qp+0x12a/0x3f0 [mlx4_ib]
[<ffffffffa031154a>] ib_create_qp+0x3a/0x250 [ib_core]
[<ffffffffa055dd4b>] srpt_cm_handler+0x4bb/0xcad [ib_srpt]
[<ffffffffa02c1ab0>] cm_process_work+0x20/0xf0 [ib_cm]
[<ffffffffa02c3640>] cm_work_handler+0x1ac0/0x2059 [ib_cm]
[<ffffffff810737ed>] process_one_work+0x19d/0x490
[<ffffffff81073b29>] worker_thread+0x49/0x490
[<ffffffff8107a0ea>] kthread+0xea/0x100
[<ffffffff815b25af>] ret_from_fork+0x1f/0x40
Fixes: b99f8e4d7b ("IB/srpt: convert to the generic RDMA READ/WRITE API")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Expose new counters using the get protocol stats callback.
We expose the following counters:
|------------------------------------------------------------------------|
| Name | IB | EN | Description |
|------------------------------------------------------------------------|
|rx_write_requests | + | - | Number of received WRITE requests for |
| | | | the associated QP. |
|------------------------------------------------------------------------|
|rx_read_requests | + | - | Number of received READ requests for |
| | | | the associated QP. |
|------------------------------------------------------------------------|
|rx_atomic_requests | + | - | Number of received ATOMIC requests for |
| | | | the associated QP. |
|------------------------------------------------------------------------|
|out_of_buffer | + | + | Number of drops occurred due to lack |
| | | | of WQE for the associated QPs/RQs. |
|------------------------------------------------------------------------|
|out_of_sequence | + | - | Number of errors in the packet |
| | | | transport sequence number |
|------------------------------------------------------------------------|
|duplicate_request | + | + | Number of received duplicated packets. |
| | | | A request that previously executed is |
| | | | named duplicated. |
|------------------------------------------------------------------------|
|rnr_nak_retry_err | + | + | Number of received RNR NAC packets. |
| | | | The QP retry limit did not exceed. |
|------------------------------------------------------------------------|
|packet_seq_err | + | + | Number of received NAK - sequence error|
| | | | packets. The QP retry limit did not |
| | | | exceed. |
|------------------------------------------------------------------------|
|implied_nak_err | + | + | Number of times the requester detected |
| | | | an ACK with a PSN larger than expected |
| | | | PSN for RDMA READ or ATOMIC response |
| | | | The QP retry limit did not exceed. |
|------------------------------------------------------------------------|
|local_ack_timeout_err| + | - | Number of NO ACK responses from |
| | | | responder within timer interval. |
| | | | The QP retry limit did not exceed. |
|------------------------------------------------------------------------|
Counters are available if all of them are supported.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to support statistics for ports, we attach
each QP to a counter set which is dedicate to this port.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, the SRQ API uses the obsolete mlx5_*_srq_mbox_{in,out}
structs which limit the ability to pass the SRQ attributes between
net and IB parts of the driver.
This patch changes the SRQ API so as to use auto-generated structs
and provides a better way to pass attributes which will be in use by
coming features.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Enable mlx5 based hardware to report TCP segmentation offload (TSO)
capabilities from kernel to user space. A TSO enabled NIC will accept
big chunks of data with sizes greater than MTU for TCP traffic. The TSO
engine will break the data into separate packets and will insert headers
automatically.
The capabilities are exposed to user space through query_device by uhw
directly. The following capabilities are reported:
1. The maximum payload size in bytes supported for segmentation by TSO
engine.
2. Bitmap showing which QP types are supported by TSO operation. The bitmap
is built by members from 'enmu ib_qp_type'. For example, similar code
should be performed if UD QP is supported:
supported_qpts |= 1 << IB_QPT_UD;
To make user-space library aware of whether kernel supports uhw or not, a
new flag: cmds_supp_uhw will be returned back to user-space through
alloc_ucontext.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Enable flow steering for IPv6 traffic by using an IPv6 spec.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The driver exposes interfaces that directly relate to HW state.
Upon fatal error, consumers of these interfaces (ULPs) that rely
on completion of all their posted work-request could hang, thereby
introducing dependencies in shutdown order. To prevent this from
happening, we manage the relevant resources (CQs, QPs) that are used
by the device. Upon a fatal error, we now generate simulated
completions for outstanding WQEs that were not completed at the
time the HW was reset.
It includes invoking the completion event handler for all involved
CQs so that the ULPs will poll those CQs. When polled we return
simulated CQEs with IB_WC_WR_FLUSH_ERR return code enabling ULPs
to clean up their resources and not wait forever for completions
upon receiving remove_one.
The above change requires an extra check in the data path to make
sure that when device is in error state, the simulated CQEs will
be returned and no further WQEs will be posted.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Implements the IB core disassociate_ucontext API.
The driver detaches the HW resources for a given user context to
prevent a dependency between application termination and device
disconnect. This is done by managing the VMAs that were mapped
to the HW bars such as doorbell and blueflame. When need to detach,
remap them to an arbitrary kernel page returned by the zap API.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support for Raw Ethernet RX HASH QP. Currently, creation and
destruction of such a QP are supported. This QP is implemented as
a simple TIR object which points to the receive RQ indirection table.
The given hashing configuration is used to configure the TIR and by
that it chooses the right RQ from the RQ indirection table.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
User applications that want to spread incoming traffic between several WQs
should create a QP which contains an indirection table.
When such a QP is created other receive side parameters are not valid
and should not be given. Its send side is optional and assumed active
based on max_send_wr capability value.
Extend create QP to work accordingly.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Extend create QP to get Receive Work Queue (WQ) indirection table.
QP can be created with external Receive Work Queue indirection table,
in that case it is ready to receive immediately.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some mlx5 based hardwares support a RQ table object. This RQ table
points to a few RQ objects. We implement the receive work queue
indirection table API (create and destroy) by using this hardware
object.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
User applications that want to spread traffic on several WQs, need to
create an indirection table, by using already created WQs.
Adding uverbs API in order to create and destroy this table.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce Receive Work Queue (WQ) indirection table.
This object can be used to spread incoming traffic to different
receive Work Queues.
A Receive WQ indirection table points to variable size of WQs.
This table is given to a QP in downstream patches.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimerg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
A QP can be created without internal WQs "packaged" inside it,
this QP can be configured to use "external" WQ object as its
receive/send queue.
WQ is a necessary component for RSS technology since RSS mechanism
is supposed to distribute the traffic between multiple
Receive Work Queues
Receive WQs are implemented by RQs.
Implement the WQ creation, modification and destruction verbs.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
User space applications which use RSS functionality need to create
a work queue object (WQ). The lifetime of such an object is:
* Create a WQ
* Modify the WQ from reset to init state.
* Use the WQ (by downstream patches).
* Destroy the WQ.
These commands are added to the uverbs API.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@rimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce Work Queue object and its create/destroy/modify verbs.
QP can be created without internal WQs "packaged" inside it,
this QP can be configured to use "external" WQ object as its
receive/send queue.
WQ is a necessary component for RSS technology since RSS mechanism
is supposed to distribute the traffic between multiple
Receive Work Queues.
WQ associated (many to one) with Completion Queue and it owns WQ
properties (PD, WQ size, etc.).
WQ has a type, this patch introduces the IB_WQT_RQ (i.e.receive queue),
it may be extend to others such as IB_WQT_SQ. (send queue).
WQ from type IB_WQT_RQ contains receive work requests.
PD is an attribute of a work queue (i.e. send/receive queue), it's used
by the hardware for security validation before scattering to a memory
region which is pointed by the WQ. For that, an external WQ object
needs a PD, letting the hardware makes that validation.
When accessing a memory region that is pointed by the WQ its PD
is used and not the QP's PD, this behavior is similar
to a SRQ and a QP.
WQ context is subject to a well-defined state transitions done by
the modify_wq verb.
When WQ is created its initial state becomes IB_WQS_RESET.
>From IB_WQS_RESET it can be modified to itself or to IB_WQS_RDY.
>From IB_WQS_RDY it can be modified to itself, to IB_WQS_RESET
or to IB_WQS_ERR.
>From IB_WQS_ERR it can be modified to IB_WQS_RESET.
Note: transition to IB_WQS_ERR might occur implicitly in case there
was some HW error.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pre-allocate buffers to deallocate completion queue, so that completion
queue is deallocated during RDMA termination when system is running
out of memory.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pre-allocate buffers for deregistering memory region and memory window
during RDMA connection close, when system is running out of memory.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pre-allocate buffers for sending various control messages to close
connection, abort connection, etc so that we gracefully handle
connections when system is running out of memory.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Get rid of unneeded code, and refactor things a bit.
For MPA version 0 we abort the connection. For > 0, we attempt to send
an MPA_START/REJECT Reply, and then disconnect gracefully. If the send
of the MPA message fails, then we abort the connection. We can ignore
c4iw_ep_disconnect() errors here because it will clean up the endpoint
if there are failures.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
With IPv6 addresses, the "qps" debugfs is running out of space and
truncating the output. Bump the required size accordingly.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
markers_enabled should be read only once during MPA negotiation.
The present code does read markers_enabled twice during negotiation
which results in setting wrong recv/xmit markers if the markers_enabled is
changed in the middle of negotiation.
With this change the markers_enabled is read only once during MPA
negotiation. recv markers are set based on markers enabled module
parameter and xmit markers are set based on markers flag from the
MPA_START_REQ/MPA_START_REP.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Set the chunk_size to enable level-1 PBL support when the fast memory
page count is more than one.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
CQ is armed for solicited events only, ignoring other notification
flags. Correct this by arming for next and arming for solicited
event if IB_CQ_SOLICITED is set. Also protect CQ shadow area update
with spinlock.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The current drivers return errors from this calldown
wrapped in an ERR_PTR().
The rdmavt code incorrectly tests for NULL.
The code is fixed to use IS_ERR() and change ret according
to the driver return value.
Cc: Stable <stable@vger.kernel.org> # 4.6+
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Since rvt_reset_qp already zero's out qp->s_ack_queue head and tail
pointers, there is no need to zero out qp->s_ack_queue itself.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
A failure in the get_txreq() inline will result in a
slow path retry using __get_txreq().
__get_txreq() attempts to procure the qp s_lock, which
is already held in all callers.
Fix by deleting the s_lock maintenance in __get_txreq()
and add sparse syntax hooks to future proof the code.
Cc: Stable <stable@vger.kernel.org> # 4.6+
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Prevent cross page boundary allocation by allocating
new page, this is required to be aligned with ConnectX-3 HW
requirements.
Not doing that might cause to "RDMA read local protection" error.
Fixes: 1b2cd0fc67 ('IB/mlx4: Support the new memory registration API')
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When RC, UC, or RAW QPs are created, a qp object is allocated (kzalloc).
If at a later point (in procedure create_qp_common) the qp creation fails,
this qp object must be freed.
Fixes: 1ffeb2eb8b ("IB/mlx4: SR-IOV IB context objects and proxy/tunnel SQP support")
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In procedure mlx4_ib_create_flow, passing an invalid port number
will cause an out-of-bounds array access. Data passed to this procedure
can come from user-space. Therefore, need to validate port number
before proceeding onwards.
Note that we check against the number of physical ports declared at
the verbs (ib core) level; When bonding is active, the verbs level
sees one physical port, even though the low-level driver sees two ports.
Fixes: f77c0162a3 ("IB/mlx4: Add receive flow steering support")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix mad send error flow to prevent double freeing address handles,
and leaking tx_ring entries when SRIOV is active.
If ib_mad_post_send fails, the address handle pointer in the tx_ring entry
must be set to NULL (or there will be a double-free) and tx_tail must be
incremented (or there will be a leak of tx_ring entries).
The tx_ring is handled the same way in the send-completion handler.
Fixes: 37bfc7c1e8 ("IB/mlx4: SR-IOV multiplex and demultiplex MADs")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When calculating the required size of an RC QP send queue, leave
enough space for masked atomic operations, which require more space than
"regular" atomic operation.
Fixes: 6fa8f71984 ("IB/mlx4: Add support for masked atomic operations")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@mellanox.co.il>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
port_xmit_data is written instead of port_rcv_data.
Fixes: 3efd9a1121 ('IB/mlx5: Modify MAD reading counters method to use counter registers')
Signed-off-by: Talat Batheesh <talatb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If the caller specified IB_SEND_FENCE in the send flags of the work
request and no previous work request stated that the successive one
should be fenced, the work request would be executed without a fence.
This could result in RDMA read or atomic operations failure due to a MR
being invalidated. Fix this by adding the mlx5 enumeration for fencing
RDMA/atomic operations and fix the logic to apply this.
Fixes: e126ba97db ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Initialize ib_qp_init_attr with zeros in order to avoid from garbage
in fields that won't be set with user values.
Fixes: a060b5629a ('IB/core: generic RDMA READ/WRITE API')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When virtualziation is supported, VFs may send SA MADs to a GID formed
by the concatenation of the subnet prefix with the
IB_SA_WELL_KNOWN_GUID. When a response is required, the current code
will search the local HCA's port for the received GID to figure out the
GID index of the entry containing this GID. However, since this is not a
real GID it will not be found and error will be printed.
We change the logic to check if the destination GID is this special GID
and avoid lookup in this case and use GID index 0.
Fixes: a0c1b2a350 ('IB/core: Support accessing SA in virtualized environment')
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
During multicast join of RoCEv1, IGMP join state and max hop limit
were updated incorrectly. IGMP join should be sent and marked as
joined only on RoCEv2 after a successful join. Max hops should be
updated to the hop limit on RoCEv2 regardless of the join state.
Fixes: bee3c3c918 ('IB/cma: Join and leave multicast groups...')
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, when the netdevice returned by get_netdev is unregistered,
we delete all GIDs (including the default GIDs) and reset their
attributes. Therefore, when we re-register it, no default GIDs
will be assigned (as their "default GID") attribute will be reset.
Fixing this by keeping "default GID" attribute.
Fixes: 03db3a2d81 ('IB/core: Add RoCE GID table management')
Signed-off-by: Talat Batheesh <talatb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Swapping a cable from a "Mgmt Allowed=No" switch port to a
"Mgmt Allowed=Yes" switch port doesn't send a pkey change
notification. Therefore, the link doesn't become active as
the oib_utils layer uses an old pkey table cache.
Fix by ensuring the pkey change notification is sent when
the table is changed both explicitly by the FM and implicitly
by the driver via a cable swap.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
FULL_MGMT_P_KEY doesn't get cleared from the pkey table at link bounce
because the link down and link bounce code paths are different when
moving a QSFP cable on a switch. This causes an HFI unit connected to a
switch to try to be initialized to an FM node when the QSFP cable is
moved from a MgmtAllowed=NO port to a MgmtAllowed=YES port and back to a
MgmtAllowed=NO port. Remove FULL_MGMT_P_KEY from pkey table at link up.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This fixes potential buffer overflow because the sprintf function
doesn't check buffer boundaries. Use snprintf instead.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This fixes potential NULL ptr dereference because IS_ERR(dd) doesn't
handle NULL. Fix the issue by initializing the pointer with a not NULL
error code.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If a context has already been assigned to an FD, prevent
another assignment.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If a context has already been assigned to an FD, prevent
another assignment.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The current value of 500us for the packet egress timeout is too small
which causes the host to declare failure on draining packets too early
and unnecessarily bounces the link. Increase this to 50ms taking into
account the switch packet discard timer default and the worst case
per-VL package drainage rate.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Correct calculation of the low order bits which should be unset
based on use of qos_shift parameter when assigning QPN.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Functions required for MODIFY_PORT were incorrectly being
required for MODIFY_QP.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The credit return threshold adjustment on mtu change algorithm does not
take into account all the kernel send contexts that are assigned per VL.
Use the pio send context map to adjust the credit return thresholds for
all the allocated and assigned kernel send contexts based on the MTU
adjustment per VL.
The pio send context map can be changed dynamically based on the actual
number of operational vls which is set by the fabric manager. When this
happens update the credit return threshold values for all the remapped
kernel send contexts.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Static source code analysis tools like smatch cannot handle functions
that lock or not lock a mutex depending on the value of the arguments.
Hence inline the function cma_disable_callback(). Additionally, this
patch realizes a small performance optimization by reducing the number of
mutex_lock() and mutex_unlock() calls in the modified functions. With
this patch applied smatch no longer complains about source file cma.c.
Without this patch smatch reports the following for this source file:
drivers/infiniband/core/cma.c:1959: cma_req_handler() warn: inconsistent returns 'mutex:&listen_id->handler_mutex'.
Locked on: line 1880
line 1959
Unlocked on: line 1941
drivers/infiniband/core/cma.c:2112: iw_conn_req_handler() warn: inconsistent returns 'mutex:&listen_id->handler_mutex'.
Locked on: line 2048
Unlocked on: line 2112
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>