Currently the msk->flags bitmask carries both state for the
mptcp_release_cb() - mostly touched under the mptcp data lock
- and others state info touched even outside such lock scope.
As a consequence, msk->flags is always manipulated with
atomic operations.
This change splits such bitmask in two separate fields, so
that we use plain bit operations when touching the
cb-related info.
The MPTCP_PUSH_PENDING bit needs additional care, as it is the
only CB related field currently accessed either under the mptcp
data lock or the mptcp socket lock.
Let's add another mask just for such bit's sake.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can simplify the join list handling leveraging the
mptcp_release_cb(): if we can acquire the msk socket
lock at mptcp_finish_join time, move the new subflow
directly into the conn_list, otherwise place it on join_list and
let the release_cb process such list.
Since pending MPJ connection are now always processed
in a timely way, we can avoid flushing the join list
every time we have to process all the current subflows.
Additionally we can now use the mptcp data lock to protect
the join_list, removing the additional spin lock.
Finally, the MPJ handshake is now always finalized under the
msk socket lock, we can drop the additional synchronization
between mptcp_finish_join() and mptcp_close().
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Verify that, when multiple endpoints are available, subflows
creation proceed even when the first additional subflow creation
fails - due to packet drop on the relevant link
Co-developed-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the MPTCP configuration allows for multiple subflows
creation, and the first additional subflows never reach
the fully established status - e.g. due to packets drop or
reset - the in kernel path manager do not move to the
next subflow.
This patch introduces a new PM helper to cope with MPJ
subflow creation failure and delay and hook it where appropriate.
Such helper triggers additional subflow creation, as needed
and updates the PM subflow counter, if the current one is
closing.
Additionally start all the needed additional subflows
as soon as the MPTCP socket is fully established, so we don't
have to cope with slow MPJ handshake blocking the next subflow
creation.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Include into the path manager status a bitmap tracking the list
of local endpoints still available - not yet used - for the
relevant mptcp socket.
Keep such map updated at endpoint creation/deletion time, so
that we can easily skip already used endpoint at local address
selection time.
The endpoint used by the initial subflow is lazyly accounted at
subflow creation time: the usage bitmap is be up2date before
endpoint selection and we avoid such unneeded task in some relevant
scenarios - e.g. busy servers accepting incoming subflows but
not creating any additional ones nor annuncing additional addresses.
Overall this allows for fair local endpoints usage in case of
subflow failure.
As a side effect, this patch also enforces that each endpoint
is used at most once for each mptcp connection.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check for all MPJ variant at once, this reduces the number
of conditionals traversed on average and will simplify the
next patch.
No functional change intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since full-mesh endpoint support, the reception of a single ADD_ADDR
option can cause multiple subflows creation. When such option is
accepted we increment 'add_addr_accepted' by one. When we received
a paired RM_ADDR option, we deleted all the relevant subflows,
decrementing 'add_addr_accepted' by one for each of them.
We have a similar issue for 'local_addr_used'
Fix them moving the pm endpoint accounting outside the subflow
traversal.
Fixes: 1a0d6136c5 ("mptcp: local addresses fullmesh")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Performs several disconnect/reconnect on the same socket,
ensuring the overall transfer is succesful.
The new test leverages ioctl(SIOCOUTQ) to ensure all the
pending data is acked before disconnecting.
Additionally order alphabetically the test program arguments list
for better maintainability.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle explicitly AF_UNSPEC in mptcp_stream_connnect() to
allow user-space to disconnect established MPTCP connections
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous patch, msk->subflow will never be deleted during
the whole msk lifetime. We don't need anymore to acquire references to
it in mptcp_stream_accept() and we can use the listener subflow accept
queue to simplify mptcp_poll() for listener socket.
Overall this removes a lock pair and 4 more atomic operations per
accept().
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current mptcp_disconnect() implementation lacks several
steps, we additionally need to reset the msk socket state
and flush the subflow list.
Factor out the needed helper to avoid code duplication.
Additionally ensure that the initial subflow is disposed
only after mptcp_close(), just reset it at disconnect time.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the MPTCP xmit path to add MP_FASTCLOSE suboption
on RST egress packets.
Additionally reorder related options writing to reduce
the number of conditionals required in the fast path.
Co-developed-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After shutdown, for fallback MPTCP sockets, we always have
write_seq == snd_una+1
The above will foul OUTQ ioctl(). Keep snd_una in sync with
write_seq even after shutdown.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Expose FEC per lane block counters via ethtool
2) Trivial fixes/updates/cleanup to mlx5e netdev driver
3) Fix htmldoc build warning
4) Spread mlx5 SFs (sub-functions) to all available CPU cores: Commits 1..5
Shay Drory Says:
================
Before this patchset, mlx5 subfunction shared the same IRQs (MSI-X) with
their peers subfunctions, causing them to use same CPU cores.
In large scale, this is very undesirable, SFs use small number of cpu
cores and all of them will be packed on the same CPU cores, not
utilizing all CPU cores in the system.
In this patchset we want to achieve two things.
a) Spread IRQs used by SFs to all cpu cores
b) Pack less SFs in the same IRQ, will result in multiple IRQs per core.
In this patchset, we spread SFs over all online cpus available to mlx5
irqs in Round-Robin manner. e.g.: Whenever a SF is created, pick the next
CPU core with least number of SF IRQs bound to it, SFs will share IRQs on
the same core until a certain limit, when such limit is reached, we
request a new IRQ and add it to that CPU core IRQ pool, when out of IRQs,
pick any IRQ with least number of SF users.
This enhancement is done in order to achieve a better distribution of
the SFs over all the available CPUs, which reduces application latency,
as shown bellow.
Machine details:
Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz with 56 cores.
PCI Express 3 with BW of 126 Gb/s.
ConnectX-5 Ex; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe4.0
x16.
Base line test description:
Single SF on the system. One instance of netperf is running on-top the
SF.
Numbers: latency = 15.136 usec, CPU Util = 35%
Test description:
There are 250 SFs on the system. There are 3 instances of netperf
running, on-top three different SFs, in parallel.
Perf numbers:
# netperf SFs latency(usec) latency CPU utilization
affinity affinity (lower is better) increase %
1 cpu=0 cpu={0} ~23 (app 1-3) 35% 75%
2 cpu=0,2,4 cpu={0} app 1: 21.625 30% 68% (CPU 0)
app 2-3: 16.5 9% 15% (CPU 2,4)
3 cpu=0 cpu={0,2,4} app 1: ~16 7% 84% (CPU 0)
app 2-3: ~17.9 14% 22% (CPU 2,4)
4 cpu=0,2,4 cpu={0,2,4} 15.2 (app 1-3) 0% 33% (CPU 0,2,4)
- The first two entries (#1 and #2) show current state. e.g.: SFs are
using the same CPU. The last two entries (#3 and #4) shows the latency
reduction improvement of this patch. e.g.: SFs are on different CPUs.
- Whenever we use several CPUs, in case there is a different CPU
utilization, write the utilization of each CPU separately.
- Whenever the latency result of the netperf instances were different,
write the latency of each netperf instances separately.
Commands:
- for netperf CPU=0:
$ for i in {1..3}; do taskset -c 0 netperf -H 1${i}.1.1.1 -t TCP_RR -- \
-o RT_LATENCY -r8 & done
- for netperf CPU=0,2,4
$ for i in {1..3}; do taskset -c $(( ($i - 1) * 2 )) netperf -H \
1${i}.1.1.1 -t TCP_RR -- -o RT_LATENCY -r8 & done
================
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmHXh+AACgkQSD+KveBX
+j68fQgAghUX4TFS2JSwa7+XSCtzz7GIu2Xrz8aWTAnydRLlNXuFuuHYcNed6I0l
7DaVOZwHp1tp3tnx3WMGPUU6ujDPEgasaDDblvG2UXix5LPVEHDXY44ittQX8mpC
SC8Yj9mNo6DSfOMUZklFDMbw57XuLJ+HEGnwnrOEEyLX7ruDXGEViUmVBd4IoC3B
F2fJHBkdTJfHWTJRB4pWbZD1dw7WbKd0RyPla3OkoHugEUCKnbjii8cMwNM64Bbp
Pjz/SiShVy+NTotqPzRNjcx7y4tHOXCYt33zt1VlGtdUxs5eCA5jkjHFz0jb12Lu
rvfHaBaU+elMKTw5G/WMGJxZQx0kEQ==
=VBWY
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2022-01-06' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2022-01-06
1) Expose FEC per lane block counters via ethtool
2) Trivial fixes/updates/cleanup to mlx5e netdev driver
3) Fix htmldoc build warning
4) Spread mlx5 SFs (sub-functions) to all available CPU cores: Commits 1..5
Shay Drory Says:
================
Before this patchset, mlx5 subfunction shared the same IRQs (MSI-X) with
their peers subfunctions, causing them to use same CPU cores.
In large scale, this is very undesirable, SFs use small number of cpu
cores and all of them will be packed on the same CPU cores, not
utilizing all CPU cores in the system.
In this patchset we want to achieve two things.
a) Spread IRQs used by SFs to all cpu cores
b) Pack less SFs in the same IRQ, will result in multiple IRQs per core.
In this patchset, we spread SFs over all online cpus available to mlx5
irqs in Round-Robin manner. e.g.: Whenever a SF is created, pick the next
CPU core with least number of SF IRQs bound to it, SFs will share IRQs on
the same core until a certain limit, when such limit is reached, we
request a new IRQ and add it to that CPU core IRQ pool, when out of IRQs,
pick any IRQ with least number of SF users.
This enhancement is done in order to achieve a better distribution of
the SFs over all the available CPUs, which reduces application latency,
as shown bellow.
Machine details:
Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz with 56 cores.
PCI Express 3 with BW of 126 Gb/s.
ConnectX-5 Ex; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe4.0
x16.
Base line test description:
Single SF on the system. One instance of netperf is running on-top the
SF.
Numbers: latency = 15.136 usec, CPU Util = 35%
Test description:
There are 250 SFs on the system. There are 3 instances of netperf
running, on-top three different SFs, in parallel.
Perf numbers:
# netperf SFs latency(usec) latency CPU utilization
affinity affinity (lower is better) increase %
1 cpu=0 cpu={0} ~23 (app 1-3) 35% 75%
2 cpu=0,2,4 cpu={0} app 1: 21.625 30% 68% (CPU 0)
app 2-3: 16.5 9% 15% (CPU 2,4)
3 cpu=0 cpu={0,2,4} app 1: ~16 7% 84% (CPU 0)
app 2-3: ~17.9 14% 22% (CPU 2,4)
4 cpu=0,2,4 cpu={0,2,4} 15.2 (app 1-3) 0% 33% (CPU 0,2,4)
- The first two entries (#1 and #2) show current state. e.g.: SFs are
using the same CPU. The last two entries (#3 and #4) shows the latency
reduction improvement of this patch. e.g.: SFs are on different CPUs.
- Whenever we use several CPUs, in case there is a different CPU
utilization, write the utilization of each CPU separately.
- Whenever the latency result of the netperf instances were different,
write the latency of each netperf instances separately.
Commands:
- for netperf CPU=0:
$ for i in {1..3}; do taskset -c 0 netperf -H 1${i}.1.1.1 -t TCP_RR -- \
-o RT_LATENCY -r8 & done
- for netperf CPU=0,2,4
$ for i in {1..3}; do taskset -c $(( ($i - 1) * 2 )) netperf -H \
1${i}.1.1.1 -t TCP_RR -- -o RT_LATENCY -r8 & done
================
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmHXj44ACgkQSD+KveBX
+j4cWggAsenq+AvDHtq2JwolF/IUiNetWiWIrIvnaKLBNcWAL4CvdMU7zNVpjnr5
03xuSn/LnCaCynnF52RSPNWH7RTue2gsVz28pR6cBUMyJ05DL7lxbjpgnu29E/mG
vPx8QR/DZjyWYjPt1d46niYKd2FK6egnlGrVrQ911s3zox35R1HHoWPtk+C1/s8i
iThuZXTcDBPNPPiwtbANENTPBjTabW9we7F8k5iZNzQiTDkxdH349ojlebNv37Zy
Q+BYJJfAxvpMm/kyA4lDAqehKRtbIUqK7n0c8WaF4kxCpA7VKQU7T0osCAa+S7JD
R1+juxG3xiZkToKqqMmAu4x0f3mDOw==
=KRes
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2022-01-06' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2022-01-06
This series provides bug fixes to mlx5 driver.
Please pull and let me know if there is any problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This copies a u16 into the high bits of an int, which works on a big
endian system but not on a little endian system.
Fixes: 09572fca72 ("Bluetooth: hci_sock: Add support for BT_{SND,RCV}BUF")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The "opt" variable is a u32, but on some paths only the top bytes
were initialized and the others contained random stack data.
Fixes: a7b75c5a8c ("net: pass a sockptr_t into ->setsockopt")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch fixes the broken LED quirk for Intel legacy ROM devices.
To fix the LED issue that doesn't turn off immediately, the host sends
the SW RFKILL command while shutting down the interface and it puts the
devices in SW RFKILL state.
Once the device is in SW RFKILL state, it can only accept HCI_Reset to
exit from the SW RFKILL state. This patch checks the quirk for broken
LED and sends the HCI_Reset before sending the HCI_Intel_Read_Version
command.
The affected legacy ROM devices are
- 8087:07dc
- 8087:0a2a
- 8087:0aa7
Fixes: ffcba827c0 ("Bluetooth: btintel: Fix the LED is not turning off immediately")
Signed-off-by: Tedd Ho-Jeong An <tedd.an@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Tony Nguyen says:
====================
100GbE Intel Wired LAN Driver Updates 2022-01-06
Victor adds restoring of advanced rules after reset.
Wojciech improves usage of switchdev control VSI by utilizing the
device's advanced rules for forwarding.
Christophe Jaillet removes some unneeded calls to zero bitmaps, changes
some bitmap operations that don't need to be atomic, and converts a
kfree() to a more appropriate bitmap_free().
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: Use bitmap_free() to free bitmap
ice: Optimize a few bitmap operations
ice: Slightly simply ice_find_free_recp_res_idx
ice: improve switchdev's slow-path
ice: replay advanced rules after reset
====================
Link: https://lore.kernel.org/r/20220106183013.3777622-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel says:
====================
mlxsw: Add Spectrum-4 support
This patchset adds Spectrum-4 support in mlxsw. It builds on top of a
previous patchset merged in commit 10184da916 ("Merge branch
'mlxsw-Spectrum-4-prep'") and makes two additional changes before adding
Spectrum-4 support.
Patchset overview:
Patches #1-#2 add a few Spectrum-4 specific variants of existing ACL
keys. The new variants are needed because the size of certain key
elements (e.g., local port) was increased in Spectrum-4.
Patches #3-#6 are preparations.
Patch #7 implements the Spectrum-4 variant of the Bloom filter hash
function. The Bloom filter is used to optimize ACL lookups by
potentially skipping certain lookups if they are guaranteed not to
match. See additional info in merge commit ae6750e0a5 ("Merge branch
'mlxsw-spectrum_acl-Add-Bloom-filter-support'").
Patch #8 finally adds Spectrum-4 support.
====================
Link: https://lore.kernel.org/r/20220106160652.821176-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend existing driver for Spectrum, Spectrum-2 and Spectrum-3 ASICs
to support Spectrum-4 ASIC as well.
Currently there is no released firmware version for Spectrum-4, so the
driver is not enforcing a minimum version.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Spectrum-4 will calculate hash function for bloom filter differently
from the existing ASICs.
First, two hash functions will be used to calculate 16 bits result.
The final result will be combination of the two results - 6 bits which
are result of CRC-6 will be used as MSB and 10 bits which are result of
CRC-10 will be used as LSB.
Second, while in Spectrum{2,3}, there is a padding in each chunk, so the
chunks use a sequence of whole bytes, in Spectrum-4 there is no padding,
so each chunk use 20 bytes minus 2 bits, so it is necessary to align the
chunks to be without holes.
Add dedicated 'mlxsw_sp_acl_bf_ops' for Spectrum-4 and add the required
tables for CRC calculations.
All the details are documented as part of the code for future use.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.
There are two changes:
1. Instead of using one hash function to calculate 16 bits output (CRC-16),
two functions will be used.
2. The chunks will be built differently, without padding.
As preparation for support of Spectrum-4 bloom filter, add 'ops'
structure to allow handling different calculation for different ASICs.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.
There are two changes:
1. Instead of using one hash function to calculate 16 bits output (CRC-16),
two functions will be used.
2. The chunks will be built differently, without padding.
As preparation for support of Spectrum-4 bloom filter, rename CRC table
to include "sp2" prefix and "crc16", as next patch will add two additional
tables. In addition, rename all the dedicated functions and defines for
Spectrum-{2,3} to include "sp2" prefix.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.
One of the changes is related to the way that the chunks will be build -
without padding.
As preparation for support of Spectrum-4 bloom filter, make
mlxsw_sp_acl_bf_key_encode() more flexible, so it will be able to use it
for Spectrum-4 as well.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, mlxsw_sp_acl_bf_rule_count_index_get() is implemented before
mlxsw_sp_acl_bf_index_get() but is used after it.
Adding a new function for Spectrum-4 would make them further apart still.
Fix by moving them around.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Spectrum-4 ASIC will support more virtual routers and local ports
compared to the existing ASICs. Therefore, the virtual router and local
port ACL key elements need to be increased.
Introduce new key elements for Spectrum-4 to be aligned with the elements
used already for other Spectrum ASICs.
The key blocks layout is the same for Spectrum-4, so use the existing
code for encode_block() and clear_block(), just create separate blocks.
Note that size of `VIRT_ROUTER_MSB` is 4 bits in Spectrum-4,
therefore declare it using `MLXSW_AFK_ELEMENT_INST_U32()`, in order to
be able to set `.avoid_size_check` to true.
Otherwise, `mlxsw_afk_blocks_check()` will fail and warn.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In Spectrum-4, the size of the virtual router ACL key element increased
from 11 bits to 12 bits.
In order to reuse the existing virtual router ACL key element
enumerators for Spectrum-4, rename 'VIRT_ROUTER_8_10' and
'VIRT_ROUTER_0_7' to 'VIRT_ROUTER_MSB' and 'VIRT_ROUTER_LSB',
respectively.
No functional changes intended.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ioana Ciornei says:
====================
dpaa2-eth: small cleanup
These 3 patches are just part of a small cleanup on the dpaa2-eth and
the dpaa2-switch drivers.
In case we are hitting a case in which the fwnode of the root dprc
device we initiate a deferred probe. On the dpaa2-switch side, if we are
on the remove path, make sure that we check for a non-NULL pointer
before accessing the port private structure.
====================
Link: https://lore.kernel.org/r/20220106135905.81923-1-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before accessing the port private structure make sure that there is
still a non-NULL pointer there. A NULL pointer access can happen when we
are on the remove path, some switch ports are unregistered and some are
in the process of unregistering.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We could get into a situation when the fwnode of the parent device is
not yet set because its probe didn't yet finish. When this happens, any
caller of the dpaa2_mac_open() will not have the fwnode available, thus
cause problems at the PHY connect time.
Avoid this by just returning -EPROBE_DEFER from the dpaa2_mac_open when
this happens.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The parent pointer node handler must be declared with a NULL
initializer. Before using it, a check must be performed to make
sure that a valid address has been assigned to it.
Signed-off-by: Robert-Ionut Alexa <robert-ionut.alexa@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Refcount leak in ipt_CLUSTERIP rule loading path, from Xin Xiong.
2) Use socat in netfilter selftests, from Hangbin Liu.
3) Skip layer checksum 4 update for IP fragments.
4) Missing allocation of pcpu scratch maps on clone in
nft_set_pipapo, from Florian Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nft_set_pipapo: allocate pcpu scratch maps on clone
netfilter: nft_payload: do not update layer 4 checksum when mangling fragments
selftests: netfilter: switch to socat for tests using -q option
netfilter: ipt_CLUSTERIP: fix refcount leak in clusterip_tg_check()
====================
Link: https://lore.kernel.org/r/20220106215139.170824-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2022-01-06
We've added 41 non-merge commits during the last 2 day(s) which contain
a total of 36 files changed, 1214 insertions(+), 368 deletions(-).
The main changes are:
1) Various fixes in the verifier, from Kris and Daniel.
2) Fixes in sockmap, from John.
3) bpf_getsockopt fix, from Kuniyuki.
4) INET_POST_BIND fix, from Menglong.
5) arm64 JIT fix for bpf pseudo funcs, from Hou.
6) BPF ISA doc improvements, from Christoph.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits)
bpf: selftests: Add bind retry for post_bind{4, 6}
bpf: selftests: Use C99 initializers in test_sock.c
net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
bpf/selftests: Test bpf_d_path on rdonly_mem.
libbpf: Add documentation for bpf_map batch operations
selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3
xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames
xdp: Move conversion to xdp_frame out of map functions
page_pool: Store the XDP mem id
page_pool: Add callback to init pages when they are allocated
xdp: Allow registering memory model without rxq reference
samples/bpf: xdpsock: Add timestamp for Tx-only operation
samples/bpf: xdpsock: Add time-out for cleaning Tx
samples/bpf: xdpsock: Add sched policy and priority support
samples/bpf: xdpsock: Add cyclic TX operation capability
samples/bpf: xdpsock: Add clockid selection support
samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operation
samples/bpf: xdpsock: Add VLAN support for Tx-only operation
libbpf 1.0: Deprecate bpf_object__find_map_by_offset() API
libbpf 1.0: Deprecate bpf_map__is_offload_neutral()
...
====================
Link: https://lore.kernel.org/r/20220107013626.53943-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Menglong Dong says:
====================
From: Menglong Dong <imagedong@tencent.com>
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
The second patch use C99 initializers in test_sock.c
The third patch is the selftests for this modification.
Changes since v4:
- use C99 initializers in test_sock.c before adding the test case
Changes since v3:
- add the third patch which use C99 initializers in test_sock.c
Changes since v2:
- NULL check for sk->sk_prot->put_port
Changes since v1:
- introduce 'put_port' field for 'struct proto'
- add selftests for it
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With previous patch, kernel is able to 'put_port' after sys_bind()
fails. Add the test for that case: rebind another port after
sys_bind() fails. If the bind success, it means previous bind
operation is already undoed.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-4-imagedong@tencent.com
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
This reverts commit 410bd754cd.
The reverted commit had added a retry mechanism to the command entry
index allocation. The previous patch ensures that there is a free
command entry index once the command work handler holds the command
semaphore. Thus the retry mechanism is not needed.
Fixes: 410bd754cd ("net/mlx5: Add retry mechanism to the command entry index allocation")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Avoid a race where command work handler may fail to allocate command
entry index, by holding the command semaphore down till command entry
index is being freed.
Fixes: 410bd754cd ("net/mlx5: Add retry mechanism to the command entry index allocation")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently during NIC profile disablement all VXLAN udp ports offloaded to the
HW are flushed and during its enablement the driver send notification to
the stack to inform the core that the entire UDP tunnel port state has been
lost, uplink representor doesn't have the same behavior which can cause
VXLAN udp ports offload to be in bad state while moving between modes while
VXLAN interface exist.
Fixed by aligning the uplink representor profile behavior to the NIC behavior.
Fixes: 84db661247 ("net/mlx5e: Move set vxlan nic info to profile init")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Even when SF devices are supported, the SF device table allocation
can still fail.
In such case mlx5_sf_dev_supported still reports true, but SF device
table is invalid. This can result in NULL table access.
Hence, fix it by adding NULL table check.
Fixes: 1958fc2f07 ("net/mlx5: SF, Add auxiliary device driver")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Tunnel device follows RFC 6040, and during decapsulation inner
ip_ecn might change depending on inner and outer ip_ecn as follows:
+---------+----------------------------------------+
|Arriving | Arriving Outer Header |
| Inner +---------+---------+---------+----------+
| Header | Not-ECT | ECT(0) | ECT(1) | CE |
+---------+---------+---------+---------+----------+
| Not-ECT | Not-ECT | Not-ECT | Not-ECT | <drop> |
| ECT(0) | ECT(0) | ECT(0) | ECT(1) | CE* |
| ECT(1) | ECT(1) | ECT(1) | ECT(1)* | CE* |
| CE | CE | CE | CE | CE |
+---------+---------+---------+---------+----------+
Cells marked above are changed from original inner packet ip_ecn value.
Tc then matches on the modified inner ip_ecn, but hw offload which
matches the inner ip_ecn value before decap, will fail.
Fix that by mapping all the cases of outer and inner ip_ecn matching,
and only supporting cases where we know inner wouldn't be changed by
decap, or in the outer ip_ecn=CE case, inner ip_ecn didn't matter.
Fixes: bcef735c59 ("net/mlx5e: Offload TC matching on tos/ttl for ip tunnels")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This reverts commit 54e1217b90.
Although the NIC doesn't support offload of outer header CSUM, using
gso_partial_features allows offloading the tunnel's segmentation. The
driver relies on the stack CSUM calculation of the outer header. For
this, NETIF_F_GSO_GRE_CSUM must be a member of the device's features.
Fixes: 54e1217b90 ("net/mlx5e: Block offload of outer header csum for GRE tunnel")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This reverts commit 6d6727dddc.
Although the NIC doesn't support offload of outer header CSUM, using
gso_partial_features allows offloading the tunnel's segmentation. The
driver relies on the stack CSUM calculation of the outer header. For
this, NETIF_F_GSO_UDP_TUNNEL_CSUM must be a member of the device's
features.
Fixes: 6d6727dddc ("net/mlx5e: Block offload of outer header csum for UDP tunnels")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Routes with nexthop objects is currently not supported by multipath offload
and any attempts to use it is blocked, however this also block adding SW
routes with nexthop.
Resolve this by returning NOTIFY_DONE instead of an error which will allow such
a route to be created in SW but not offloaded.
This fix also solve an issue which block adding such routes on different devices
due to missing check if the route FIB device is one of multipath devices.
Fixes: 6a87afc072 ("mlx5: Fail attempts to use routes with nexthop objects")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>