InfiniBand HCAs replicate multicast packets back to the QP that sent
them if that QP is attached to the destination multicast group. This
means that IPoIB multicasts are often replicated back to the receive
queue of the interface that generated them. To avoid confusing the
network stack, we drop these duplicates within the IPoIB driver.
However, there's no reason to free the skb that received the duplicate
and then immediately allocate a new skb to post to the receive queue.
We can be more efficient and just repost the same skb.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Since NAPI polling is disabled while ipoib_cm_dev_stop() is running,
ipoib_cm_dev_stop() must poll the CQ itself in order to see the
packets draining.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
SM reconfiguration or failover possibly causes a shuffling of the values
in the P_Key table. Right now, IPoIB only queries for the P_Key index
once when it creates the device QP, and hence there are problems if the
index of a P_Key value changes. Fix this by using the PKEY_CHANGE event
to trigger a recheck of the P_Key index.
Signed-off-by: Yosef Etigin <yosefe@voltaire.com>
Acked-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert the IP-over-InfiniBand network device driver over to using
NAPI to handle completions for the main CQ. This covers all receives
as well as datagram mode sends; send completions for connected mode
connections are still handled from interrupt context.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband: (49 commits)
IB: Set class_dev->dev in core for nice device symlink
IB/ehca: Implement modify_port
IB/umad: Clarify documentation of transaction ID
IPoIB/cm: spin_lock_irqsave() -> spin_lock_irq() replacements
IB/mad: Change SMI to use enums rather than magic return codes
IB/umad: Implement GRH handling for sent/received MADs
IB/ipoib: Use ib_init_ah_from_path to initialize ah_attr
IB/sa: Set src_path_bits correctly in ib_init_ah_from_path()
IB/ucm: Simplify ib_ucm_event()
RDMA/ucma: Simplify ucma_get_event()
IB/mthca: Simplify CQ cleaning in mthca_free_qp()
IB/mthca: Fix mthca_write_mtt() on HCAs with hidden memory
IB/mthca: Update HCA firmware revisions
IB/ipath: Fix WC format drift between user and kernel space
IB/ipath: Check that a UD work request's address handle is valid
IB/ipath: Remove duplicate stuff from ipath_verbs.h
IB/ipath: Check reserved memory keys
IB/ipath: Fix unit selection when all CPU affinity bits set
IB/ipath: Don't allow QPs 0 and 1 to be opened multiple times
IB/ipath: Disable IB link earlier in shutdown sequence
...
For the common, open coded 'skb->mac.raw = skb->data' operation, so that we can
later turn skb->mac.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.
This one touches just the most simple case, next will handle the slightly more
"complex" cases.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no point in printing the opcode field in the completion
handling debugging output, since the type of completion is already
printed at the beginning of the line. In fact the opcode field is not
even defined for completions with a status other than success.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The packet length checks in ipoib are broken: we add 4 bytes (IPoIB
encapsulation header) when sending a packet, not 20 bytes (hardware
address length) to each packet. Therefore, if connected mode is
enabled so that the interface MTU is larger than the multicast MTU,
IPoIB may end up trying to send too-long multicast packets. For
example, multicast is broken if a message of size 2048 bytes is sent
on an interface with UD MTU 2048, because 2048 is bigger than the real
limit of 2044 but the code tests against the wrong limit of 2060.
This patch fixes <https://bugs.openfabrics.org/show_bug.cgi?id=418>,
submitted by Scott Weitzenkamp <sweitzen@cisco.com>.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The following patch adds experimental support for IPoIB connected
mode, as defined by the draft from the IETF ipoib working group. The
idea is to increase performance by increasing the MTU from the maximum
of 2K (theoretically 4K) supported by IPoIB on top of UD. With this
code, I'm able to get 800MByte/sec or more with netperf without
options on a Mellanox 4x back-to-back DDR system.
Some notes on code:
1. SRQ is used for scalability to large cluster sizes
2. Only RC connections are used (UC does not support SRQ now)
3. Retry count is set to 0 since spec draft warns against retries
4. Each connection is used for data transfers in only 1 direction, so
each connection is either active(TX) or passive (RX). 2 sides that
want to communicate create 2 connections.
5. Each active (TX) connection has a separate CQ for send completions -
this keeps the code simple without CQ resize and other tricks
6. To detect stale passive side connections (where the remote side is
down), we keep an LRU list of passive connections (updated once per
second per connection) and destroy a connection after it has been
unused for several seconds. The LRU rule makes it possible to avoid
scanning connections that have recently been active.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert IPoIB to use the new DMA mapping functions
for kernel verbs consumers.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When ipoib_ib_dev_flush() is called because of a port event, the
driver needs to rejoin all multicast groups, since the flush will call
ipoib_mcast_dev_flush() (via ipoib_ib_dev_down()). Otherwise no
(non-broadcast) multicast groups will be rejoined until the networking
core calls ->set_multicast_list again, and so multicast reception will
be broken for potentially a long time.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Split up ipoib_ib_handle_wc() into ipoib_ib_handle_rx_wc() and
ipoib_ib_handle_tx_wc() to make the code easier to read. This will
also help implement NAPI in the future.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The comparisons of priv->tx_tail to ah->last_send in ipoib_free_ah()
and ipoib_post_receive() are slightly unsafe, because priv->tx_lock is
not held and hence a stale value of ah->last_send might be used, which
would lead to freeing an AH before the driver was really done with it.
The simple way to fix this is to the optimization of early free from
ipoib_free_ah() and unconditionally queue AHs for reaping, and then
take priv->tx_lock in __ipoib_reap_ah().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When ipoib_stop() is called it first calls netif_stop_queue() to stop
the kernel from passing more packets to the network driver. However,
the completion handler may call netif_wake_queue() re-enabling packet
transfer.
This might result in leaks (we see AH leaks which we think can be
attributed to this bug) as new packets get posted while the interface
is going down.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Make IPoIB's send and receive queue sizes tunable via module
parameters ("send_queue_size" and "recv_queue_size"). This allows the
queue sizes to be enlarged to fix disastrously bad performance on some
platforms and workloads, without bloating memory usage when large
queues aren't needed.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch causes the network interface to respond to P_Key change
events correctly. As a result, you'll see a child interface in the
"RUNNING" state (netif_carrier_on()) only when the corresponding P_Key
is configured by the SM. When SM removes a P_Key, the "RUNNING" state
will be disabled for the corresponding network interface. To
implement this, I added IB_EVENT_PKEY_CHANGE event handling. To
prevent flushing the device before the device is open by the "delay
open" mechanism, I added an additional device flag called
IPOIB_FLAG_INITIALIZED.
This also prevents the child network interface from trying to join to
multicast groups until the PKEY is configured. We used to get error
messages like:
ib0.f2f2: couldn't attach QP to multicast group ff12:401b:f2f2:0:0:0:ffff:ffff
in this case. To fix this, I just check IPOIB_FLAG_OPER_UP flag in
ipoib_set_mcast_list().
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_ib_dev_flush() should get passed cpriv->dev, not &cpriv->dev.
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move ipoib_ib_dev_flush() to ipoib's workqueue. This keeps it ordered
with respect to other work scheduled by the ipoib driver. This fixes
problems with races, for example:
- ipoib_ib_dev_flush() has started running because of an IB event
- user does ifconfig ib0 down
- ipoib_mcast_stop_thread() gets called twice and waits for the same
completion twice
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If posting receives in ipoib_ib_dev_open() fails, call
ipoib_ib_dev_stop() to move the device's QP back to the RESET state so
that we can try again later.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
semaphore to mutex conversion by Ingo and Arjan's script.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Sanity-checked on real IB hardware ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The current handling of multicast groups in IPoIB ends up never
freeing send-only multicast groups. It turns out the logic was much
more complicated than it needed to be; we can fix this bug and
completely kill ipoib_mcast_dev_down() at the same time.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
race condition: ipoib_ib_dev_flush is accessing child list without locks.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't build ipoib_mcast_iter_ functions if CONFIG_INFINIBAND_IPOIB_DEBUG
is not enabled -- their only callers will not be built either.
Also move the prototype for ipoib_open() to ipoib.h to fix a sparse warning.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Minor cleanups: fix a misleading comment, and get rid of attr_mask
variables that are only used to hold constants (just use the constants
directly).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change the way IPoIB handles RX packets when it can't allocate a new
receive skbuff. If the allocation of a new receive skb fails, we now
drop the packet we just received and repost the original receive skb.
This means that the receive ring always stays full and we don't have
to monkey around with trying to schedule a refill task for later.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_create_qp() no longer creates IPoIB's QP, so it shouldn't
destroy the QP on failure -- that unwinding happens elsewhere, so the
current code can cause a double free. While we're at it, the
function's name should match what it actually does, so rename it to
ipoib_init_qp().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_mcast_restart_task() is always called from within the
single-threaded IPoIB workqueue, so flushing the workqueue from within
the function can lead to a recursion overflow. But since we're
running in a single-threaded workqueue, we're already synchronized
against other items in the workqueue, so just get rid of the flush in
ipoib_mcast_restart_task().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move the InfiniBand headers from drivers/infiniband/include to include/rdma.
This allows InfiniBand-using code to live elsewhere, and lets us remove the
ugly EXTRA_CFLAGS include path from the InfiniBand Makefiles.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Make some lawyers happy and add copyright notices for people who
forgot to include them when they actually touched the code.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set skb->mac.raw on receive. This fixes crashes when this is
dereferenced, for example by netfilter or when PF_PACKET is used.
Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Roland Dreier <roland@topspin.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!