If an mlx4 device with default FW (which gives a UAR BAR size of 8 MB)
is used in a system with 64 KB pages, then there are only 8192/64==128
UAR pages available. However, the first 128 UAR pages are reserved
for use with event queue doorbells, so no UAR pages are available to
do anything else with, which means that the driver cannot work.
The current driver fails with a fairly cryptic "Failed to allocate
driver access region, aborting" message in this situation. Fix the
driver to detect the problem earlier and print out a clearer
description of the problem and a suggestion of how to fix it (use a
new firmware image).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for the following operations to mlx4 when device firmware
supports them:
- Send with invalidate and local invalidate send queue work requests;
- Allocate/free fast register MRs;
- Allocate/free fast register MR page lists;
- Fast register MR send queue work requests;
- Local DMA L_Key.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
MTT entries are allocated with a buddy allocator, which just keeps
bitmaps for each level of the buddy table. However, all free space
starts out at the highest order, and small allocations start scanning
from the lowest order. When the lowest order tables have no free
space, this can lead to scanning potentially millions of bits before
finding a free entry at a higher order.
We can avoid this by just keeping a count of how many free entries
each order has, and skipping the bitmap scan when an order is
completely empty. This provides a nice performance boost for a
negligible increase in memory usage.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add ICM_ERROR firmware status code. In mapping to errnos, -ENFILE
seems closest.
This is in preparation for providing more detailed log info using
mlx4_err() in low-level driver when a non-zero status is returned.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a module parameter "enable_qos" to mlx4_core. If this param is
set, enable support for QoS in the INIT_HCA command. By default, the
parameter is set to 0 (disabled).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There was a bug in some versions of the mlx4 driver in
mlx4_alloc_fmr(), which hardcoded the minimum acceptable page_shift to
be 12. However, new ConnectX firmware can support a minimum
page_shift of 9 (log_pg_sz of 9 returned by QUERY_DEV_LIM) -- so with
old drivers, ib_fmr_alloc() would fail for ULPs using the device
minimum when creating FMRs.
To preserve firmware compatibility with released mlx4 drivers, the
firmware will continue to return 12 as before for log_page_sz in
QUERY_DEV_CAP for these drivers. However, to enable new drivers to
take advantage of the available smaller page size, the mlx4 driver now
first sets the log_pg_sz to the device minimum by setting a
log_page_sz value to 0 via the MOD_STAT_CFG command and then reading
the real minimum via QUERY_DEV_CAP.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for handling the IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK
flag by using the per-multicast group loopback blocking feature of
mlx4 hardware.
Signed-off-by: Ron Livne <ronli@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't hard code a test against a minimum page shift of 12, since the
device may support smaller pages. Test against the actual smallest
page size from the device capabilities.
Signed-off-by: Oren Duer <oren@mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When a FMR is unmapped, mlx4 resets the map count to 0, and clears the
upper part of the R_Key which is used as the sequence counter.
This poses a problem for RDS, which uses ib_fmr_unmap as a fence
operation. RDS assumes that after issuing an unmap, the old R_Keys
will be invalid for a "reasonable" period of time. For instance,
Oracle processes uses shared memory buffers allocated from a pool of
buffers. When a process dies, we want to reclaim these buffers -- but
we must make sure there are no pending RDMA operations to/from those
buffers. The only way to achieve that is by using unmap and sync the
TPT.
However, when the sequence count is reset on unmap, there is a high
likelihood that a new mapping will be given the same R_Key that was
issued a few milliseconds ago.
To prevent this, don't reset the sequence count when unmapping a FMR.
Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Extend the mlx4_cq_resize() API with a way to set the "collapsed" flag
for the CQ being created.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Avoid duplicating code in ethernet and FC modules.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Wrap doorbell, buffer and MTT allocation in helper functions for
ethernet and FC modules to use.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The call to mlx4_MODIFY_CQ() had a typo so that mlx4_cq_resize() was
actually asking the FW to modify a CQ's interrupt moderation rather than
asking it to resize a CQ.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In addition to mlx4_ib, there will be ethernet and FC consumers of
mlx4_core, so move the code for managing kernel doorbells into the
core module to avoid having to duplicate this multiple times.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4 hardware does not support external DDR memory. Moreover, UAR
area (BAR 2) can change depending on FW version.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When detaching the last QP from an MCG entry, we need to make
sure that at any time, there will be no entry with zero number of
QPs which is linked to the list of the MCGs of the corresponding
hash index. So don't write back the MCG entry if we are removing the
last QP; just unlink the entry.
Also, remove an unnecessary MCG read when attaching a QP requires
allocation of a new entry in the AMGM.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
With the advent large clusters which utilize multicore hosts, 64K QPs
is not enough. We should increase the default maximum for QPs to 128K.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX devices support checksum generation and verification of TCP
and UDP packets for UD IPoIB messages. This patch checks if the HCA
supports this and sets the IB_DEVICE_UD_IP_CSUM capability flag if it
does. It implements support for handling the IB_SEND_IP_CSUM send
flag and setting the csum_ok field in receive work completions.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Ali Ayub <ali@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The struct mlx4_interface.event() method was supposed to get an enum
mlx4_dev_event, but the driver code was actually passing in the
hardware enum mlx4_event values. Fix up the callers of
mlx4_dispatch_event() so that they pass in the right type of value,
and fix up the event method in mlx4_ib so that it can handle the enum
mlx4_dev_event values.
This eliminates the need for the subtype parameter to the event
method, so remove it.
This also fixes the sparse warning
drivers/net/mlx4/intf.c:127:48: warning: mixing different enum types
drivers/net/mlx4/intf.c:127:48: int enum mlx4_event versus
drivers/net/mlx4/intf.c:127:48: int enum mlx4_dev_event
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4_table_find (for FMR MPTs) requires that ICM memory already be
mapped. Before this fix, FMR allocation depended on ICM memory
already being mapped for the MPT entry. If all currently mapped
entries are taken, the find operation fails (even if the MPT ICM table
still had more entries, which were just not mapped yet).
This fix moves the mpt find operation to fmr_enable, to guarantee that
any required ICM memory mapping has already occurred.
Found by Oren Duer of Mellanox.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 313abe55 ("mlx4_core: For 64-bit systems, vmap() kernel queue
buffers") caused this to pop up on powerpc allyesconfig, looks like a
missing include file:
drivers/net/mlx4/alloc.c: In function 'mlx4_buf_alloc':
drivers/net/mlx4/alloc.c:162: error: implicit declaration of function 'vmap'
drivers/net/mlx4/alloc.c:162: error: 'VM_MAP' undeclared (first use in this function)
drivers/net/mlx4/alloc.c:162: error: (Each undeclared identifier is reported only once
drivers/net/mlx4/alloc.c:162: error: for each function it appears in.)
drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast
drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free':
drivers/net/mlx4/alloc.c:187: error: implicit declaration of function 'vunmap'
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Now that struct mlx4_buf.u is a struct instead of a union because of
the vmap() changes, there's no point in having a struct at all. So
move .direct and .page_list directly into struct mlx4_buf and get rid
of a bunch of unnecessary ".u"s.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Since kernel virtual memory is not a problem on 64-bit systems, there
is no reason to use our own 2-layer page mapping scheme for large
kernel queue buffers on such systems. Instead, map the page list to a
single virtually contiguous buffer with vmap(), so that can we access
buffer memory via direct indexing.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The firmware QUERY_ADAPTER command does not return vendor_id,
device_id, and revision_id; eliminate these fields from the query.
Initialize the rev_id field of the mlx4 device via init_node_data (MAD
IFC query), as is done in the query_device verb implementation.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 3d73c288 ("mlx4_core: Fix section mismatches") fixed some of
the section mismatches introduced when error recovery was added, but
there were still more cases of errory recovery code calling into
__devinit code from regular .text. Fix this by getting rid of the
now-incorrect __devinit annotations.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
log_max_eqs is a 4-bit field, not a 3-bit field in the response to the
QUERY_DEV_CAP FW command, so we should mask with 0xf instead of 0x7
when reading it.
Found by Yossi Leybovitch of Mellanox.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When checking the states passed in, mlx4_qp_modify() accidentally checks
cur_state twice rather than checking cur_state and new_state. Fix this
to make sure that both values are in-bounds.
Since these values may be passed in from userspace, this bug results in
userspace being able to trigger an oops.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Cc: stable <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix thinko in commit eaf559bf ("mlx4_core: Don't free special QPs in
QP number bitmap"). The old commit had the logic exactly backwards
and ended up freeing *only* special QPs, which not only left the
original bug in place but also introduced the problem that the QP
number bitmap would get full after a while.
Found by Dotan Barak of Mellanox.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When mlx4_buf_free() is called from the error path of
mlx4_buf_alloc(), it may be passed a buffer structure that does not
have all pages filled in. Add a check for NULL to mlx4_buf_free() so
we avoid passing NULL to dma_free_coherent() (which will crash).
Signed-off-by: Ali Ayoub <ali@mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Most drivers need to set length and offset as well, so may as well fold
those three lines into one.
Add sg_assign_page() for those two locations that only needed to set
the page, where the offset/length is set outside of the function context.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
mlx4_core: Increase command timeout for INIT_HCA to 10 seconds
IPoIB/cm: Use common CQ for CM send completions
IB/uverbs: Fix checking of userspace object ownership
IB/mlx4: Sanity check userspace send queue sizes
IPoIB: Rewrite "if (!likely(...))" as "if (unlikely(!(...)))"
IB/ehca: Enable large page MRs by default
IB/ehca: Change meaning of hca_cap_mr_pgsize
IB/ehca: Fix ehca_encode_hwpage_size() and alloc_fmr()
IB/ehca: Fix masking error in {,re}reg_phys_mr()
IB/ehca: Supply QP token for SRQ base QPs
IPoIB: Use round_jiffies() for ah_reap_task
RDMA/cma: Fix deadlock destroying listen requests
RDMA/cma: Add locking around QP accesses
IB/mthca: Avoid alignment traps when writing doorbells
mlx4_core: Kill mlx4_write64_raw()
The current INIT_HCA firmware command timeout is sufficient for the
default number of resources (QPs, CQs, etc) being allocated, but if
the HCA profile is modified to increase the amount of resources, then
a spurious timeout is detected and HCA initialization fails.
Increase the timeout for the INIT_HCA command to 10 seconds, which
also brings it into line with all the other command timeouts.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 3d73c288 ("mlx4_core: Fix section mismatches") introduced a
stupid bug in device init: when some of mlx4_init_one() was split off
into __mlx4_init_one(), the call from the main mlx4_init_one()
function was back to mlx4_init_one() rather than to __mlx4_init_one(),
which leads to an obvious infinite loop if the function is every
called.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit ee49bd93 ("mlx4_core: Reset device when internal error is
detected") introduced some section mismatch problems when
CONFIG_HOTPLUG=n, because the error recovery code tears down and
reinitializes the device after everything is loaded, which ends up
calling into lots of code marked __devinit and __devexit from regular
.text. Fix this by getting rid of these now-incorrect section
markers.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Firmware commands are sent to the HCA by writing multiple words to a
command register block. Access to this block of registers is
serialized with a mutex. However, on large SGI systems writes to the
register block may be reordered within the system interconnect and
reach the HCA in a different order than they were issued (even with
the mutex). Fix this by adding an mmiowb() before dropping the mutex.
This bug was observed with real workloads with the similar FW command
code in the mthca driver, and adding the mmiowb() as in commit
66547550 ("IB/mthca: Use mmiowb() to avoid firmware commands getting
jumbled up") was confirmed to fix the problems, so we should add the
same fix to mlx4.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Increase the number of QPs allowed per multicast group from 8 to 56.
This allows for one QP per core on 16-core systems, which are now
quite common, and allows some space for future growth.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Implement FMRs for mlx4. This is an adaptation of code from mthca.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Write MTT entries directly to ICM from the driver (eliminating use of
WRITE_MTT command). This reduces the number of FW commands needed to
register an MR by at least a factor of 2 and speeds up memory
registration significantly. This code will also be used to implement
FMRs.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Everything that uses caps.reserved_mtts expects it to be a count of MTT
segments, not MTT entries. So convert the value that the FW gives us to
a count of segments.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Taking ilog2(dev->caps.reserved_mtts) to find out the order to pass to
the MTT buddy allocator will do the wrong thing if reserved_mtts is ever
not a power of 2. Be safe and use fls(dev->caps.reserved_mtts - 1).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Enable having ICM tables in coherent memory, and use coherent memory
for the dMPT table. This will allow writing MPT entries for MRs both
via the SW2HW_MPT command and also directly by the driver for FMR
remapping without needing to flush or worry about cacheline boundaries.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
display the following device information under /sys/class/infiniband/mlx4_X:
board_id, fw_ver, hw_rev, hca_type.
This patch makes this information available to userspace utilities
such as ibstat and ibv_devinfo.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The SRC ("scalable RC") transport has been renamed to XRC ("extended
RC"), to avoid having an abbreviation that is so easily confused with an
abbreviation for "source." Update the HCA capability decoding output to
use the new name.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4_srq_query() returns a big-endian 16-bit value through an int *,
which screws up sparse checking. Fix this so that a CPU-endian value
is returned.
Signed-off-by: Roland Dreier <rolandd@cisco.com>