This adds a driver that allows userspace to use UD-like QPs over a
proprietary Cisco transport with Cisco's Virtual Interface Cards (VICs),
including VIC 1240 and 1280 cards.
Signed-off-by: Upinder Malhi <umalhi@cisco.com>
Signed-off-by: Christian Benvenuti <benve@cisco.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
OCRDMA_GEN2_FAMILY is wrongly defined as 0x02 -- it should be 0x0F.
Signed-off-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix ah->av->valid bit position and big endian portability.
Signed-off-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Pull networking fixes from David Miller:
"Some holiday bug fixes for 3.13... There is still one bug I'd like to
get fixed before 3.13-final.
The vlan code erroneously assignes the header ops of the underlying
real device to the VLAN device above it when the real device can
hardware offload VLAN handling. That's completely bogus because
header ops are tied to the device type, so they only expect to see a
'dev' argument compatible with their ops.
The fix is the have the VLAN code use a special set of header ops that
does the pass-thru correctly, by calling the underlying real device's
header ops but _also_ passing in the real device instead of the VLAN
device.
That fix is currently waiting some testing.
Anyways, of note here:
1) Fix bitmap edge case in radiotap, from Johannes Berg.
2) Fix oops on driver unload in rtlwifi, from Larry Finger.
3) Bonding doesn't do locking correctly during speed/duplex/link
changes, from Ding Tianhong.
4) Fix header parsing in GRE code, this bug has been around for a few
releases. From Timo Teräs.
5) SIT tunnel driver MTU check needs to take GSO into account, from
Eric Dumazet.
6) Minor info leak in inet_diag, from Daniel Borkmann.
7) Info leak in YAM hamradio driver, from Salva Peiró.
8) Fix route expiration state handling in ipv6 routing code, from Li
RongQing.
9) DCCP probe module does not check request_module()'s return value,
from Wang Weidong.
10) cpsw driver passes NULL device names to request_irq(), from
Mugunthan V N.
11) Prevent a NULL splat in RDS binding code, from Sasha Levin.
12) Fix 4G overflow test in tg3 driver, from Nithin Sujir.
13) Cure use after free in arc_emac and fec driver's software
timestamp handling, from Eric Dumazet.
14) SIT driver can fail to release the route when
iptunnel_handle_offloads() throws an error. From Li RongQing.
15) Several batman-adv fixes from Simon Wunderlich and Antonio
Quartulli.
16) Fix deadlock during TIPC socket release, from Ying Xue.
17) Fix regression in ROSE protocol recvmsg() msg_name handling, from
Florian Westphal.
18) stmmac PTP support releases wrong spinlock, from Vince Bridgers"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
stmmac: Fix incorrect spinlock release and PTP cap detection.
phy: IRQ cannot be shared
net: rose: restore old recvmsg behavior
xen-netback: fix guest-receive-side array sizes
fec: Do not assume that PHY reset is active low
tipc: fix deadlock during socket release
netfilter: nf_tables: fix wrong datatype in nft_validate_data_load()
batman-adv: fix vlan header access
batman-adv: clean nf state when removing protocol header
batman-adv: fix alignment for batadv_tvlv_tt_change
batman-adv: fix size of batadv_bla_claim_dst
batman-adv: fix size of batadv_icmp_header
batman-adv: fix header alignment by unrolling batadv_header
batman-adv: fix alignment for batadv_coded_packet
netfilter: nf_tables: fix oops when updating table with user chains
netfilter: nf_tables: fix dumping with large number of sets
ipv6: release dst properly in ipip6_tunnel_xmit
netxen: Correct off-by-one errors in bounds checks
net: Add some clarification to skb_tx_timestamp() comment.
arc_emac: fix potential use after free
...
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on original work by Santosh Rastapur <santosh@chelsio.com>
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch marks the function _c4iw_write_mem_dma() as static
because it is not used outside this file, which fixes the warning:
drivers/infiniband/hw/cxgb4/mem.c:176:5: warning: no previous prototype for ‘_c4iw_write_mem_dma’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- Re-enable flow steering verbs with new improved userspace ABI
- Fixes for slow connection due to GID lookup scalability
- IPoIB fixes
- Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
- Further improvements to SRP error handling
- Add new transport type for Cisco usNIC
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABCAAGBQJSil7BAAoJEENa44ZhAt0hbtgP/A+AmUalbOX6ZKzuOFxsrtY2
r55CX9b1JBeFM/Zhn2o6y+81lpCjkckJSggESMe4izNgocGw0nW4vYGN4SBynatj
y8sR9OSn+G3ihuENrzG41MJUGEa5WbcNMy4boN+Oa+qyTlV/WjLR7Fv4WbikK7Wm
o8FNlXiiDhMoGfHHG5J0MD0EQsnxuLDk2XP+ciu4tLtTs+wBka+gFK8WnMvztle3
gTeMNna5ilvCS2fdBxteuPA3KeDnJE9AgJSMJ2a4Rh+DR8uTgWYQ6n7amjmOc546
yhAKkoBkxPE10+Yj82WOPhCFxSeWcuSwJvpgv5dTVZ1XqUUcC1V3TEcZDHmyyHQ7
uPXgS1A+erBW3OYPBjZqtKvnHObscV12fL+rId3vIhcAQIbFroci08ZwPidEYRkn
fvwlEKcrIsBIpRXEyjlFCxsiiDnfq1wC1VayMR3jrIK0P6idf1SXf/geiRp9+RGT
wKUc0j51jvEx29qc65xuhEP9FQV9pCMxyd+FEE0d0KkjMz5hsIkjmcUcBbgF0CGg
GEyDPlgRLv+vmWDGpT8XraaV/0CJOEQDIgB4WSN87/AZ4UoNt7spW2xqsLsp1toy
5e0100tpWUleTPLe/Wig5GtBdagQ2jAUK1+186CP93pFPtkwc4/7X3hyp7qPIPTz
VDvT9DEy6zjSMCLpMcdo
=xxC+
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband/rdma updates from Roland Dreier:
- Re-enable flow steering verbs with new improved userspace ABI
- Fixes for slow connection due to GID lookup scalability
- IPoIB fixes
- Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
- Further improvements to SRP error handling
- Add new transport type for Cisco usNIC
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (66 commits)
IB/core: Re-enable create_flow/destroy_flow uverbs
IB/core: extended command: an improved infrastructure for uverbs commands
IB/core: Remove ib_uverbs_flow_spec structure from userspace
IB/core: Use a common header for uverbs flow_specs
IB/core: Make uverbs flow structure use names like verbs ones
IB/core: Rename 'flow' structs to match other uverbs structs
IB/core: clarify overflow/underflow checks on ib_create/destroy_flow
IB/ucma: Convert use of typedef ctl_table to struct ctl_table
IB/cm: Convert to using idr_alloc_cyclic()
IB/mlx5: Fix page shift in create CQ for userspace
IB/mlx4: Fix device max capabilities check
IB/mlx5: Fix list_del of empty list
IB/mlx5: Remove dead code
IB/core: Encorce MR access rights rules on kernel consumers
IB/mlx4: Fix endless loop in resize CQ
RDMA/cma: Remove unused argument and minor dead code
RDMA/ucma: Discard events for IDs not yet claimed by user space
IB/core: Add Cisco usNIC rdma node and transport types
RDMA/nes: Remove self-assignment from nes_query_qp()
IB/srp: Report receive errors correctly
...
This commit reverts commit 7afbddfae9 ("IB/core: Temporarily disable
create_flow/destroy_flow uverbs"). Since the uverbs extensions
functionality was experimental for v3.12, this patch re-enables the
support for them and flow-steering for v3.13.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 400dbc9658 ("IB/core: Infrastructure for extensible uverbs
commands") added an infrastructure for extensible uverbs commands
while later commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow
through uverbs") exported ib_create_flow()/ib_destroy_flow() functions
using this new infrastructure.
According to the commit 400dbc9658, the purpose of this
infrastructure is to support passing around provider (eg. hardware)
specific buffers when userspace issue commands to the kernel, so that
it would be possible to extend uverbs (eg. core) buffers independently
from the provider buffers.
But the new kernel command function prototypes were not modified to
take advantage of this extension. This issue was exposed by Roland
Dreier in a previous review[1].
So the following patch is an attempt to a revised extensible command
infrastructure.
This improved extensible command infrastructure distinguish between
core (eg. legacy)'s command/response buffers from provider
(eg. hardware)'s command/response buffers: each extended command
implementing function is given a struct ib_udata to hold core
(eg. uverbs) input and output buffers, and another struct ib_udata to
hold the hw (eg. provider) input and output buffers.
Having those buffers identified separately make it easier to increase
one buffer to support extension without having to add some code to
guess the exact size of each command/response parts: This should make
the extended functions more reliable.
Additionally, instead of relying on command identifier being greater
than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on
unused bits in command field: on the 32 bits provided by command
field, only 6 bits are really needed to encode the identifier of
commands currently supported by the kernel. (Even using only 6 bits
leaves room for about 23 new commands).
So this patch makes use of some high order bits in command field to
store flags, leaving enough room for more command identifiers than one
will ever need (eg. 256).
The new flags are used to specify if the command should be processed
as an extended one or a legacy one. While designing the new command
format, care was taken to make usage of flags itself extensible.
Using high order bits of the commands field ensure that newer
libibverbs on older kernel will properly fail when trying to call
extended commands. On the other hand, older libibverbs on newer kernel
will never be able to issue calls to extended commands.
The extended command header includes the optional response pointer so
that output buffer length and output buffer pointer are located
together in the command, allowing proper parameters checking. This
should make implementing functions easier and safer.
Additionally the extended header ensure 64bits alignment, while making
all sizes multiple of 8 bytes, extending the maximum buffer size:
legacy extended
Maximum command buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
Maximum response buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
For the purpose of doing proper buffer size accounting, the headers
size are no more taken in account in "in_words".
One of the odds of the current extensible infrastructure, reading
twice the "legacy" command header, is fixed by removing the "legacy"
command header from the extended command header: they are processed as
two different parts of the command: memory is read once and
information are not duplicated: it's making clear that's an extended
command scheme and not a different command scheme.
The proposed scheme will format input (command) and output (response)
buffers this way:
- command:
legacy header +
extended header +
command data (core + hw):
+----------------------------------------+
| flags | 00 00 | command |
| in_words | out_words |
+----------------------------------------+
| response |
| response |
| provider_in_words | provider_out_words |
| padding |
+----------------------------------------+
| |
. <uverbs input> .
. (in_words * 8) .
| |
+----------------------------------------+
| |
. <provider input> .
. (provider_in_words * 8) .
| |
+----------------------------------------+
- response, if present:
+----------------------------------------+
| |
. <uverbs output space> .
. (out_words * 8) .
| |
+----------------------------------------+
| |
. <provider output space> .
. (provider_out_words * 8) .
| |
+----------------------------------------+
The overall design is to ensure that the extensible infrastructure is
itself extensible while begin more reliable with more input and bound
checking.
Note:
The unused field in the extended header would be perfect candidate to
hold the command "comp_mask" (eg. bit field used to handle
compatibility). This was suggested by Roland Dreier in a previous
review[2]. But "comp_mask" field is likely to be present in the uverb
input and/or provider input, likewise for the response, as noted by
Matan Barak[3], so it doesn't make sense to put "comp_mask" in the
header.
[1]:
http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com
[2]:
http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com
[3]:
http://marc.info/?i=525C1149.6000701@mellanox.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
[ Convert "ret ? ret : 0" to the equivalent "ret". - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
When creating a CQ, we must use mlx5 adapter page shift.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Move the check on max supported CQEs after the final number of entries is
evaluated.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The value of the local variable index is never used in reg_mr_callback().
Signed-off-by: Eli Cohen <eli@mellanox.com>
[ Remove now-unused variable delta too. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
When calling get_sw_cqe() we need pass the consumer_index and not the
masked value. Failure to do so will cause incorrect result of
get_sw_cqe() possibly leading to endless loop.
This problem was reported and analyzed by Michael Rice from HP.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* lookup_one_len() really wants i_mutex held on directory.
* leaks galore - just mount ipathfs, then
cd /sys/bus/pci/drivers/qib_ib; echo *:*:*.* >unbind
on a box with that card present and try to umount ipathfs...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Assigning a value to itself is pointless.
Spotted with coverity, no hardware to test.
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 7fac33014f54("IB/qib: checkpatch fixes") was overzealous in
removing a simple_strtoul for a parse routine, setup_txselect(). That
routine is required to handle a multi-value string.
Unwind that aspect of the fix.
Cc: <stable@vger.kernel.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Convert __attribute__ ((packed)) to __packed.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
qib_user_sdma_queue_pkts() gets called with mmap_sem held for
writing. Except for get_user_pages() deep down in
qib_user_sdma_pin_pages() we don't seem to need mmap_sem at all. Even
more interestingly the function qib_user_sdma_queue_pkts() (and also
qib_user_sdma_coalesce() called somewhat later) call copy_from_user()
which can hit a page fault and we deadlock on trying to get mmap_sem
when handling that fault.
So just make qib_user_sdma_pin_pages() use get_user_pages_fast() and
leave mmap_sem locking for mm.
This deadlock has actually been observed in the wild when the node
is under memory pressure.
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Roland Dreier <roland@purestorage.com>
ipath_user_sdma_queue_pkts() gets called with mmap_sem held for
writing. Except for get_user_pages() deep down in
ipath_user_sdma_pin_pages() we don't seem to need mmap_sem at all.
Even more interestingly the function ipath_user_sdma_queue_pkts() (and
also ipath_user_sdma_coalesce() called somewhat later) call
copy_from_user() which can hit a page fault and we deadlock on trying
to get mmap_sem when handling that fault. So just make
ipath_user_sdma_pin_pages() use get_user_pages_fast() and leave
mmap_sem locking for mm.
This deadlock has actually been observed in the wild when the node
is under memory pressure.
Cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
[ Merged in fix for call to get_user_pages_fast from Tetsuo Handa
<penguin-kernel@I-love.SAKURA.ne.jp>. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Remove the redundant check of comparing if a 32-bit value is greater
than 0xffffffffULL.
Reported by Dan Carpenter.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1) ocrdma_remove_free() is called from a call_rcu callback funtion
context, which can be a bottom-half context. So the code in
ocrdma_remove_free should not sleep.
But ocrdma_cleanup_hw() can sleep, So move it ocrdma_remove()
instead of ocrdma_remove_free.
2) Fix a couple of kbuild test robot warnings.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
We recently added a cap on "max_wqe_allocated" in 43a6b4025c
('RDMA/ocrdma: Create IRD queue fix').
My static checker complains that the cap has a problem because it
casts large values to negative. "attrs->cap.max_send_wr" is a u32.
It comes from the user, but it's capped in ocrdma_check_qp_params() so
it can't wrap here.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The Connect-IB adapter has an inherent page size which equals 4K.
Define an new enum that equals the page shift and use it instead of
using the value 12 throughout the code.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
RTS to RTS transition should allow update of alternate path.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
mlx5_cur and mlx5_new cannot have negative values so remove the
redundant condition.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In mlx5_mr_cache_init() the size variable is not used so remove it to
avoid compiler warnings when running with make W=1.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Connect-IB firmware requires 4K pages to be communicated with the
driver. This patch breaks larger pages to 4K units to enable support
for architectures utilizing larger page size, such as PowerPC. This
patch also fixes several places that referred to PAGE_SHIFT instead of
explicit 12 which is the inherent page shift on Connect-IB.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
On destroy QP the driver walks over the relevant CQ and removes CQEs
reported for the destroyed QP. It also frees the related SRQ entry
without checking that this is actually an SRQ-related CQE. In case of
a CQ used for both send and receive QP, we could free SRQ entries for
send CQEs. This patch resolves this issue by verifying that this is a
SRQ related CQE by checking the SRQ number in the CQE is not zero.
Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Make use of destroy_srq_kernel() to clear SRQ resouces.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Make sure not to overflow when reading the page list from struct
ib_fast_reg_page_list.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use asynchronous commands to execute up to eight concurrent create MR
commands. This is to fill memory caches faster so we keep consuming
from there. Also, increase timeout for shrinking caches to five
minutes.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Verify that the value is non negative before rounding up to power of 2.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Physical addresses may be wider than virtual addresses (e.g. on i386
with PAE) and must not be formatted with %p.
Compile-tested only.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Roland Dreier <roland@purestorage.com>
To guarantee that all unused fields in all FW commands for both inboxes
and outboxes are zeroed out, initialize the mailbox buffer to all zeroes.
This is especially important for SRIOV comm-channel virtual commands
(such as QUERY_FUNC_CAP), where if new fields are added to support new
features, the driver can depend on older kernels passing zeroes in these
fields.
In addition to zeroing out the mailbox buffer at allocation time, all
(now unnecessary) calls to memset by the callers of
mlx4_alloc_cmd_mailbox() are removed.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is step #1 for implementing SRIOV resource quotas for VFs.
Quotas are implemented per resource type for VFs and the PF, to prevent
any entity from simply grabbing all the resources for itself and leaving
the other entities unable to obtain such resources.
Resources which are allocated using quotas: QPs, CQs, SRQs, MPTs, MTTs, MAC,
VLAN, and Counters.
The quota system works as follows:
Each entity (VF or PF) is given a max number of a given resource (its quota),
and a guaranteed minimum number for each resource (starvation prevention).
For QPs, CQs, SRQs, MPTs and MTTs:
50% of the available quantity for the resource is divided equally among
the PF and all the active VFs (i.e., the number of VFs in the mlx4_core module
parameter "num_vfs"). This 50% represents the "guaranteed minimum" pool.
The other 50% is the "free pool", allocated on a first-come-first-serve basis.
For each VF/PF, resources are first allocated from its "guaranteed-minimum"
pool. When that pool is exhausted, the driver attempts to allocate from
the resource "free-pool".
The quota (i.e., max) for the VFs and the PF is:
The free-pool amount (50% of the real max) + the guaranteed minimum
For MACs:
Guarantee 2 MACs per VF/PF per port. As a result, since we have only
128 MACs per port, reduce the allowable number of VFs from 64 to 63.
Any remaining MACs are put into a free pool.
For VLANs:
For the PF, the per-port quota is 128 and guarantee is 64
(to allow the PF to register at least a VLAN per VF in VST mode).
For the VFs, the per-port quota is 64 and the guarantee is 0.
We assume that VGT VFs are trusted not to abuse the VLAN resource.
For Counters:
For all functions (PF and VFs), the quota is 128 and the guarantee is 0.
In this patch, we define the needed structures, which are added to the
resource-tracker struct. In addition, we do initialization
for the resource quota, and adjust the query_device response to use quotas
rather than resource maxima.
As part of the implementation, we introduce a new field in
mlx4_dev: quotas. This field holds the resource quotas used
to report maxima to the upper layers (ib_core, via query_device).
The HCA maxima of these values are passed to the VFs (via
QUERY_HCA) so that they may continue to use these in handling
QPs, CQs, SRQs and MPTs.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The create_flow/destroy_flow uverbs and the associated extensions to
the user-kernel verbs ABI are under review and are too experimental to
freeze at this point.
So userspace is not exposed to experimental features and an uinstable
ABI, temporarily disable this for v3.12 (with a Kconfig option behind
staging to reenable it if desired).
The feature will be enabled after proper cleanup for v3.13.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1381351016.git.ydroneaud@opteya.com
Link: http://marc.info/?i=cover.1381177342.git.ydroneaud@opteya.com
[ Add a Kconfig option to reenable these verbs. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Call mlx5_ib_populate_pas() before mapping the DMA buffer to ensure
the hardware reads the values written by the CPU.
Found by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The hardware requires that gather buffers for UMR work requests be
aligned to 2K.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
It's helpful for a driver to put the pci slot name in its interrupt
names, so /proc/interrupts will show the pci slot of the device.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Failed to configure opt mask to configure rre from init to rtr.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently Atomic operations don't work properly. Disable them for the
time being.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
On a single ported Connect-IB, its possible for the firmware to issue
events on the non-existing 2nd port. Make sure to ignore events
generated for such ports.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Change the logic so we do not allocate memory nor map the device
before actually posting to the REG_UMR QP. In addition, unmap and free
the memory after we get completion.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The patch fixes the rollback in case of failure in creating SRQ.
Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Destroying the workqueue without flushing it first can lead to a case
in which the kernel tries to push a delayed work to the workqueue
which does not exist anymore.
Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1. Make sure wqe_cnt does not exceed the limit published by firmware.
2. There is no requirement that the number of outstanding work
requests will be a power of two. Remove the ilog2 in the
calculation of sq.max_post to fix that.
3. Add case for IB_QPT_XRC_TGT in sq_overhead and return 0 as XRC
target QPs do not have a send queue.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The callers of qib_tune_pcie_caps() and qib_tune_pcie_coalesce() don't
check the return values, so this patch drops the return values altogether.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Refactor qib_tune_pcie_caps(). Use pcie_get_mps(), pcie_set_mps(),
pcie_get_readrq(), and pcie_set_readrq() to simplify the code. The PCI
core caches the "PCIe Max Payload Size Supported" in pci_dev->pcie_mpss,
so use that instead of pcie_capability_read_word(). Remove the unused
val2fld() and fld2val().
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Use pci_is_root_bus() instead of "if (bus->parent)" statement
for better readability.
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull trivial tree from Jiri Kosina:
"The usual trivial updates all over the tree -- mostly typo fixes and
documentation updates"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
doc: Documentation/cputopology.txt fix typo
treewide: Convert retrun typos to return
Fix comment typo for init_cma_reserved_pageblock
Documentation/trace: Correcting and extending tracepoint documentation
mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
power: Documentation: Update s2ram link
doc: fix a typo in Documentation/00-INDEX
Documentation/printk-formats.txt: No casts needed for u64/s64
doc: Fix typo "is is" in Documentations
treewide: Fix printks with 0x%#
zram: doc fixes
Documentation/kmemcheck: update kmemcheck documentation
doc: documentation/hwspinlock.txt fix typo
PM / Hibernate: add section for resume options
doc: filesystems : Fix typo in Documentations/filesystems
scsi/megaraid fixed several typos in comments
ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
page_isolation: Fix a comment typo in test_pages_isolated()
doc: fix a typo about irq affinity
...
Fix:
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c: In function 'ocrdma_build_fr':
>> drivers/infiniband/hw/ocrdma/ocrdma_verbs.c:1832:7: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
mr = (struct ocrdma_mr *)qp->dev->stag_arr[(hdr->lkey >> 8) &
^
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c: In function 'ocrdma_alloc_frmr':
>> drivers/infiniband/hw/ocrdma/ocrdma_verbs.c:2661:64: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
dev->stag_arr[(mr->hwmr.lkey >> 8) & (OCRDMA_MAX_STAG - 1)] = (u64) mr;
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 36a8f01cd2 ("IB/qib: Add congestion control agent
implementation") caused statements to leak pass the header guard.
Fix this.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In UMC case, driver needs to fill PVID in the address vector
template for UD traffic.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add ABI versioning support between driver and userspace library.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
While posting inline DPP data, we are not considering multiple sges.
Fix this.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1) Increase STAG Array size.
2) Max inline data size should be set to the same value
used during QP creation
3) Set max_sge_rd to zero since we dont support RD transport in our adapters.
4) Max cqes reported in ibv_devinfo should be from QUERY_CONFIG.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Create_CQ verb doesn't provide a PD pointer. So, until now we are
creating all (both userspace and kernel) CQ DB regions from PD0. This
will result in mmapping PD0 to applications. A rogue userspace
application can mess things up.
Also more serious issues is even the be2net NIC uses PD0.
This patch addresses this problem by:
1) Create a PD page for every userspace application when the
alloc_ucontext is called. This will be destroyed in
dealloc_ucontext.
2) All CQs for that context will use the PD allocated in ucontext.
3) The first create_PD call from application will result in returning
the PD address from its ucontext (no new PD will be created).
4) For subsecquent create_pd calls from application, we create new PDs for
the application.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1) Fixed setting FR_MR bit for FRWR stag allocation
2) Access rights are passsed during FRWR stage and not during STAT allocation stage
3) FRWR WQE structure cleanup
4) Add QP level signaled bit.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1) All RQ doorbells are handled by ERX2 and doorbell->num_posted
offset is constant to bit offset 24 for ERX2 irrspective of Q id.
2) Fixed RESET to INIT state change (from ERR->RST->INIT->RTR case).
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There are cases like SRIOV where can get only one MSI-X vector
allocated for RoCE. In that case we need to use the vector for both
data plane and control plane. We need to use EQ create version V2.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Also get the max_srq value from query_config mailbox response.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1) Fix ocrdma_get_num_posted_shift for upto 128 QPs.
2) Create for min of dev->max_wqe and requested wqe in create_qp.
3) As part of creating ird queue, populate with basic header templates.
4) Make sure all the DB memory allocated to userspace are page aligned.
5) Fix issue in checking the mmap local cache.
6) Some code cleanup.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Implement ib_create_flow() and ib_destroy_flow().
Translate the verbs structures provided by the user to HW structures
and call the MLX4_QP_FLOW_STEERING_ATTACH/DETACH firmware commands.
On the ATTACH command completion, the firmware provides a 64-bit
registration ID, which is placed into struct mlx4_ib_flow that wraps
the instance of struct ib_flow which is retuned to caller. Later,
this reg ID is used for detaching that flow from the firmware.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.
Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Lustre uses a advertised max MR size of ~0ULL to indicate it should
use a dma_mr. Hence advertise max MR size as ~0ULL.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When polling, we do a GTS update if the accumulated cidx_inc == the CQ
depth / 16. However, if the CQ is large enough, Cq depth / 16 exceeds
the size of the field in the GTS word. So we also need to update if
cidx_inc hits CIDXINC_MASK to avoid overflowing the field.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
accept_cr() failed to set the arp error handler on a reused skb. This
results in a kernel crash if the arp does indeed time out.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When determining how many WRs are completed with a signaled CQE,
correctly deal with queue wraps.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch makes following fixes in QP flush logic:
- correctly flushes unsignaled WRs followed by a signaled WR
- supports for flushing a CQ bound to multiple QPs
- resets cidx_flush if a active queue starts getting HW CQEs again
- marks WQ in error when we leave RTS. This was only being done for
user queues, but we need it for kernel queues too so that
post_send/post_recv will start returning the appropriate error
synchronously
- eats unsignaled read resp CQEs. HW always inserts CQEs so we must
silently discard them if the read work request was unsignaled.
- handles QP flushes with pending SW CQEs. The flush and out of order
completion logic has a bug where if out of order completions are
flushed but not yet polled by the consumer and the qp is then
flushed then we end up inserting duplicate completions.
- c4iw_flush_sq() should only flush wrs that have not already been
flushed. Since we already track where in the SQ we've flushed via
sq.cidx_flush, just start at that point and flush any remaining.
This bug only caused a problem in the presence of unsignaled work
requests.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
[ Fixed sparse warning due to htonl/ntohl confusion. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Move QP to TERMINATE instead to allow the peer to get the TERM
message. This bug wasn't detectable until newer FW that moves
connections out of RDMA mode as soon as an error is detected.
QP can exit RTS before the last AE arrives. This was introduced by
changes in the FW to kick connections out of RDMA mode as soon as an
error is detected. A side effect of this is that the driver can move
the QP out of RTS before the AE causing the connection to get kicked
out of RDMA mode is processed. Fix for this is to always post async
errors even if the QP is out of RTS.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add new cpl messages, cpl_act_open_req6 and cpl_t5_act_open_req6, for
initiating active open connections.
Use LLD api cxgb4_create_server and cxgb4_create_server6 for
initiating passive open connections. Similarly use cxgb4_remove_server
to remove the passive open connections in place of listen_stop.
Add support for iWARP over VLAN device and enable IPv6 support on VLAN device.
Make use of import_ep in c4iw_reconnect.
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
[ Fix build when IPv6 is disabled and make sure iw_cxgb4 is not built-in
when ipv6 is a module. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
PCI core will initialize device MSI/MSI-X capability in
pci_msi_init_pci_dev(). So device drivers should use
pci_dev->msi_cap/msix_cap to determine whether a device supports
MSI/MSI-X instead of using pci_find_capability(pci_dev,
PCI_CAP_ID_MSI/MSIX). Access to PCIe device config space again will
consume more time.
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
struct pci_driver qib_driver is only used in qib_init.c. Remove it
from qib.h and make it static in qib_init.c.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1. The code accepts chunks of messages, and splits the chunk into
packets when converting packets into sdma queue entries. Adjacent
packets will use user buffer pages smartly to avoid pinning the
same page multiple times.
2. Instead of discarding all the work when SDMA queue is full, the
work is saved in a pending queue. Whenever there are enough SDMA
queue free entries, pending queue is directly put onto SDMA queue.
3. An interrupt handler is used to progress this pending queue.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: CQ Tang <cq.tang@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
[ Fixed up sparse warnings. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Modify the type of local_addr and remote_addr fields in struct
iw_cm_id from struct sockaddr_in to struct sockaddr_storage to hold
IPv6 and IPv4 addresses uniformly.
Change the references of local_addr and remote_addr in cxgb4, cxgb3,
nes and amso drivers to match this. However to be able to actully run
traffic over IPv6, low-level drivers have to add code to support this.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
[ Fix unused variable warnings when INFINIBAND_NES_DEBUG not set.
- Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
1) In post recv, don't ring the DB doorbell if the QP is in RTR state.
Cache the DB calls, until the QP is moved to RTS state.
2) Add max_rd_sge support to dev->attr.
3) Code cleanup in alloc_pd path.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1) Remove __packed for structures.
2) Align and pad all ABI structure to 64 bit boundaries
instead of using __packed.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Remove QP state machine in ocrdma low-level driver and use on the core
IB stack's instead.
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In the sq_overhead() function, if qp_typ is equal to IB_QPT_RC, size
will be used uninitialized.
Signed-off-by: Andi Shyti <andi@etezian.org>
Acked-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
We don't set "resp.reserved". Since it's at the end of the struct
that means we don't have to copy it to the user.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix to return a negative error code from the error handling case
instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When creating tunnel QPs for special QP tunneling, look for the
default pkey in the slave's virtual pkey table. If it is present, use
the real pkey index where the default pkey is located.
If the default pkey is not found in the pkey table, use the real pkey
index which is stored at index 0 in the slave's virtual pkey table
(this is the current behavior).
This change is required to support cloud computing, where the
paravirtualized index of the default pkey is moved to index 1 or
higher. The pkey at paravirtualized index 0 is used for the default
IPoIB interface created by the VF.
Its possible for the pkey value at paravirtualized index 0 to be
invalid (zero) at VF probe time (pkey index 0 is mapped to real pkey
index 127, which contains pkey = 0).
At some point after the VF probe, the cloud computing interface at the
hypervisor maps virtual index 0 for the VF to the pkey index
containing the pkey that IPoIB will use in its operation. However,
when the tunnel QP is created, the pkey at the slave's virtual index 0
is still mapped to the invalid pkey index, so tunnel QP creation
fails.
This commit causes the hypervisor to search for the default pkey in
the slave's pkey table -- and this pkey is present in the table (at
index > 0) at tunnel QP creation time, so that the tunnel QP creation
will succeed.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>