Commit Graph

1029617 Commits

Author SHA1 Message Date
Nikolay Aleksandrov
4cdd0d10f3 net: bridge: multicast: check if should use vlan mcast ctx
Add helpers which check if the current bridge/port multicast context
should be used (i.e. they're not disabled) and use them for Rx IGMP/MLD
processing, timers and new group addition. It is important for vlans to
disable processing of timer/packet after the multicast_lock is obtained
if the vlan context doesn't have BR_VLFLAG_MCAST_ENABLED. There are two
cases when that flag is missing:
 - if the vlan is getting destroyed it will be removed and timers will
   be stopped
 - if the vlan mcast snooping is being disabled

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
eb1593a0b4 net: bridge: multicast: use the port group to port context helper
We need to use the new port group to port context helper in places where
we cannot pass down the proper context (i.e. functions that can be
called by timers or outside the packet snooping paths).

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
74edfd483d net: bridge: multicast: add helper to get port mcast context from port group
Add br_multicast_pg_to_port_ctx() which returns the proper port multicast
context from either port or vlan based on bridge option and vlan flags.
As the comment inside explains the locking is a bit tricky, we rely on
the fact that BR_VLFLAG_MCAST_ENABLED requires multicast_lock to change
and we also require it to be held to call that helper. If we find the
vlan under rcu and it still has the flag then we can be sure it will be
alive until we unlock multicast_lock which should be enough.
Note that the context might change from vlan to bridge between different
calls to this helper as the mcast vlan knob requires only rtnl so it should
be used carefully and for read-only/check purposes.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
f4b7002a70 net: bridge: add vlan mcast snooping knob
Add a global knob that controls if vlan multicast snooping is enabled.
The proper contexts (vlan or bridge-wide) will be chosen based on the knob
when processing packets and changing bridge device state. Note that
vlans have their individual mcast snooping enabled by default, but this
knob is needed to turn on bridge vlan snooping. It is disabled by
default. To enable the knob vlan filtering must also be enabled, it
doesn't make sense to have vlan mcast snooping without vlan filtering
since that would lead to inconsistencies. Disabling vlan filtering will
also automatically disable vlan mcast snooping.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:20 -07:00
Nikolay Aleksandrov
7b54aaaf53 net: bridge: multicast: add vlan state initialization and control
Add helpers to enable/disable vlan multicast based on its flags, we need
two flags because we need to know if the vlan has multicast enabled
globally (user-controlled) and if it has it enabled on the specific device
(bridge or port). The new private vlan flags are:
 - BR_VLFLAG_MCAST_ENABLED: locally enabled multicast on the device, used
   when removing a vlan, toggling vlan mcast snooping and controlling
   single vlan (kernel-controlled, valid under RTNL and multicast_lock)
 - BR_VLFLAG_GLOBAL_MCAST_ENABLED: globally enabled multicast for the
   vlan, used to control the bridge-wide vlan mcast snooping for a
   single vlan (user-controlled, can be checked under any context)

Bridge vlan contexts are created with multicast snooping enabled by
default to be in line with the current bridge snooping defaults. In
order to actually activate per vlan snooping and context usage a
bridge-wide knob will be added later which will default to disabled.
If that knob is enabled then automatically all vlan snooping will be
enabled. All vlan contexts are initialized with the current bridge
multicast context defaults.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov
613d61dbef net: bridge: vlan: add global and per-port multicast context
Add global and per-port vlan multicast context, only initialized but
still not used. No functional changes intended.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov
adc47037a7 net: bridge: multicast: use multicast contexts instead of bridge or port
Pass multicast context pointers to multicast functions instead of bridge/port.
This would make it easier later to switch these contexts to their per-vlan
versions. The patch is basically search and replace, no functional changes.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov
d3d065c003 net: bridge: multicast: factor out bridge multicast context
Factor out the bridge's global multicast context into a separate
structure which will later be used for per-vlan global context.
No functional changes intended.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:19 -07:00
Nikolay Aleksandrov
9632233e7d net: bridge: multicast: factor out port multicast context
Factor out the port's multicast context into a separate structure which
will later be shared for per-port,vlan context. No functional changes
intended. We need the structure even if bridge multicast is not defined
to pass down as pointer to forwarding functions.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20 05:41:19 -07:00
Liu Jian
23d2b94043 igmp: Add ip_mc_list lock in ip_check_mc_rcu
I got below panic when doing fuzz test:

Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 4056 Comm: syz-executor.3 Tainted: G    B             5.14.0-rc1-00195-gcff5c4254439-dirty #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack_lvl+0x7a/0x9b
panic+0x2cd/0x5af
end_report.cold+0x5a/0x5a
kasan_report+0xec/0x110
ip_check_mc_rcu+0x556/0x5d0
__mkroute_output+0x895/0x1740
ip_route_output_key_hash_rcu+0x2d0/0x1050
ip_route_output_key_hash+0x182/0x2e0
ip_route_output_flow+0x28/0x130
udp_sendmsg+0x165d/0x2280
udpv6_sendmsg+0x121e/0x24f0
inet6_sendmsg+0xf7/0x140
sock_sendmsg+0xe9/0x180
____sys_sendmsg+0x2b8/0x7a0
___sys_sendmsg+0xf0/0x160
__sys_sendmmsg+0x17e/0x3c0
__x64_sys_sendmmsg+0x9e/0x100
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x462eb9
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8
 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48>
 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f3df5af1c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462eb9
RDX: 0000000000000312 RSI: 0000000020001700 RDI: 0000000000000007
RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f3df5af26bc
R13: 00000000004c372d R14: 0000000000700b10 R15: 00000000ffffffff

It is one use-after-free in ip_check_mc_rcu.
In ip_mc_del_src, the ip_sf_list of pmc has been freed under pmc->lock protection.
But access to ip_sf_list in ip_check_mc_rcu is not protected by the lock.

Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-17 12:58:29 -07:00
David S. Miller
ab0441b4a9 Merge branch 'vmxnet3-version-6'
Ronak Doshi says:

====================
vmxnet3: upgrade to version 6

vmxnet3 emulation has recently added several new features which includes
increase in queues supported, remove power of 2 limitation on queues,
add RSS for ESP IPv6, etc. This patch series extends the vmxnet3 driver
to leverage these new features.

Compatibility is maintained using existing vmxnet3 versioning mechanism as
follows:
- new features added to vmxnet3 emulation are associated with new vmxnet3
   version viz. vmxnet3 version 6.
- emulation advertises all the versions it supports to the driver.
- during initialization, vmxnet3 driver picks the highest version number
supported by both the emulation and the driver and configures emulation
to run at that version.

In particular, following changes are introduced:

Patch 1:
  This patch introduces utility macros for vmxnet3 version 6 comparison
  and updates Copyright information.

Patch 2:
  This patch adds support to increase maximum Tx/Rx queues from 8 to 32.

Patch 3:
  This patch removes the limitation of power of 2 on the queues.

Patch 4:
  Uses existing get_rss_hash_opts and set_rss_hash_opts methods to add
  support for ESP IPv6 RSS.

Patch 5:
  This patch reports correct RSS hash type based on the type of RSS
  performed.

Patch 6:
  This patch updates maximum configurable mtu to 9190.

Patch 7:
  With all vmxnet3 version 6 changes incorporated in the vmxnet3 driver,
  with this patch, the driver can configure emulation to run at vmxnet3
  version 6.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 17:32:15 -07:00
Ronak Doshi
ce2639ad69 vmxnet3: update to version 6
With all vmxnet3 version 6 changes incorporated in the vmxnet3 driver,
the driver can configure emulation to run at vmxnet3 version 6, provided
the emulation advertises support for version 6.

Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 17:32:14 -07:00
Ronak Doshi
8c5663e461 vmxnet3: increase maximum configurable mtu to 9190
This patch increases the maximum configurable mtu to 9190
to accommodate jumbo packets of overlay traffic.

Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 17:32:14 -07:00
Ronak Doshi
b3973bb400 vmxnet3: set correct hash type based on rss information
As vmxnet3 supports IP/TCP/UDP RSS, this patch sets appropriate
hash type based on the type of RSS performed.

Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 17:32:14 -07:00
Ronak Doshi
79d124bb36 vmxnet3: add support for ESP IPv6 RSS
Vmxnet3 version 4 added support for ESP RSS. However, only IPv4 was
supported. With vmxnet3 version 6, this patch enables RSS for ESP
IPv6 packets as well.

Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 17:32:14 -07:00
Ronak Doshi
15ccf2f4b0 vmxnet3: remove power of 2 limitation on the queues
With version 6, vmxnet3 relaxes the restriction on queues to
be power of two. This is helpful in cases (Edge VM) where
vcpus are less than 8 and device requires more than 4 queues.

Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 17:32:14 -07:00
Ronak Doshi
39f9895a00 vmxnet3: add support for 32 Tx/Rx queues
Currently, vmxnet3 supports maximum of 8 Tx/Rx queues. With increase
in number of vcpus on a VM, to achieve better performance and utilize
idle vcpus, we need to increase the max number of queues supported.

This patch enhances vmxnet3 to support maximum of 32 Tx/Rx queues.
Increasing the Rx queues also increases the probability of distrubuting
the traffic from different flows to different queues with RSS.

Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 17:32:14 -07:00
Ronak Doshi
69dbef0d1c vmxnet3: prepare for version 6 changes
vmxnet3 is currently at version 4 and this patch initiates the
preparation to accommodate changes for upto version 6. Introduced
utility macros for vmxnet3 version 6 comparison and update Copyright
information.

Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 17:32:14 -07:00
Xin Long
f4919ff59c tipc: keep the skb in rcv queue until the whole data is read
Currently, when userspace reads a datagram with a buffer that is
smaller than this datagram, the data will be truncated and only
part of it can be received by users. It doesn't seem right that
users don't know the datagram size and have to use a huge buffer
to read it to avoid the truncation.

This patch to fix it by keeping the skb in rcv queue until the
whole data is read by users. Only the last msg of the datagram
will be marked with MSG_EOR, just as TCP/SCTP does.

Note that this will work as above only when MSG_EOR is set in the
flags parameter of recvmsg(), so that it won't break any old user
applications.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 17:28:09 -07:00
David S. Miller
5242b0c6b5 Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/t
nguy/next-queue

Tony Nguyen says:

====================
1GbE Intel Wired LAN Driver Updates 2021-07-16

Vinicius Costa Gomes says:

Add support for steering traffic to specific RX queues using Flex Filters.

As the name implies, Flex Filters are more flexible than using
Layer-2, VLAN or MAC address filters, one of the reasons is that they
allow "AND" operations more easily, e.g. when the user wants to steer
some traffic based on the source MAC address and the packet ethertype.

Future work include adding support for offloading tc-u32 filters to
the hardware.

The series is divided as follows:

Patch 1/5, add the low level primitives for configuring Flex filters.

Patch 2/5 and 3/5, allow ethtool to manage Flex filters.

Patch 4/5, when specifying filters that have multiple predicates, use
Flex filters.

Patch 5/5, Adds support for exposing the i225 LEDs using the LED subsystem.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 17:25:13 -07:00
Kurt Kanzenbach
cf8331825a igc: Export LEDs
Each i225 has three LEDs. Export them via the LED class framework.

Each LED is controllable via sysfs. Example:

$ cd /sys/class/leds/igc_led0
$ cat brightness      # Current Mode
$ cat max_brightness  # 15
$ echo 0 > brightness # Mode 0
$ echo 1 > brightness # Mode 1

The brightness field here reflects the different LED modes ranging
from 0 to 15.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-07-16 14:08:12 -07:00
Kurt Kanzenbach
7374426221 igc: Make flex filter more flexible
Currently flex filters are only used for filters containing user data.
However, it makes sense to utilize them also for filters having
multiple conditions, because that's not supported by the driver at the
moment. Add it.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-07-16 14:08:03 -07:00
Vinicius Costa Gomes
7991487ecb igc: Allow for Flex Filters to be installed
Allows Flex Filters to be installed.

The previous restriction to the types of filters that can be installed
can now be lifted.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-07-16 14:07:55 -07:00
Kurt Kanzenbach
2b477d057e igc: Integrate flex filter into ethtool ops
Use the flex filter mechanism to extend the current ethtool filter
operations by intercoperating the user data. This allows to match
eight more bytes within a Ethernet frame in addition to macs, ether
types and vlan.

The matching pattern looks like this:

 * dest_mac [6]
 * src_mac [6]
 * tpid [2]
 * vlan tci [2]
 * ether type [2]
 * user data [8]

This can be used to match Profinet traffic classes by FrameID range.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-07-16 14:07:33 -07:00
Kurt Kanzenbach
6574631b50 igc: Add possibility to add flex filter
The Intel i225 NIC has the possibility to add flex filters which can
match up to the first 128 byte of a packet. These filters are useful
for all kind of packet matching. One particular use case is Profinet,
as the different traffic classes are distinguished by the frame id
range which cannot be matched by any other means.

Add code to configure and enable flex filters.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-07-16 14:07:24 -07:00
Voon Weifeng
08041a9af9 net: phy: marvell10g: enable WoL for 88X3310 and 88E2110
Implement Wake-on-LAN feature for 88X3310 and 88E2110.

This is done by enabling WoL interrupt and WoL detection and
configuring MAC address into WoL magic packet registers

Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
Signed-off-by: Ling Pei Lee <pei.lee.ling@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 13:22:12 -07:00
Joakim Zhang
86a176f485 ARM: dts: imx7-mba7: remove un-used "phy-reset-delay" property
Remove un-used "phy-reset-delay" property which found when do dtbs_check
(set additionalProperties: false in fsl,fec.yaml).
Double check current driver and commit history, "phy-reset-delay" never comes
up, so it should be safe to remove it.

$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- dtbs_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/fsl,fec.yaml
arch/arm/boot/dts/imx7d-mba7.dt.yaml: ethernet@30be0000: 'phy-reset-delay' does not match any of the regexes: 'pinctrl-[0-9]+'
arch/arm/boot/dts/imx7d-mba7.dt.yaml: ethernet@30bf0000: 'phy-reset-delay' does not match any of the regexes: 'pinctrl-[0-9]+'
/arch/arm/boot/dts/imx7s-mba7.dt.yaml: ethernet@30be0000: 'phy-reset-delay' does not match any of the regexes: 'pinctrl-[0-9]+'

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 12:34:29 -07:00
Joakim Zhang
95740a9a3a ARM: dts: imx35: correct node name for FEC
Correct node name for FEC which found when do dtbs_check.

$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- dtbs_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/fsl,fec.yaml
arch/arm/boot/dts/imx35-eukrea-mbimxsd35-baseboard.dt.yaml: fec@50038000: $nodename:0: 'fec@50038000' does not match '^ethernet(@.*)?$'
arch/arm/boot/dts/imx35-pdk.dt.yaml: fec@50038000: $nodename:0: 'fec@50038000' does not match '^ethernet(@.*)?$'

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 12:34:29 -07:00
Joakim Zhang
96e4781b3d dt-bindings: net: fec: convert fsl,*fec bindings to yaml
In order to automate the verification of DT nodes convert fsl-fec.txt to
fsl,fec.yaml, and pass binding check with below command.

$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- dt_binding_check DT_SCHEMA_FILES=Documentation/devicetree/bindings/net/fsl,fec.yaml
  DTEX    Documentation/devicetree/bindings/net/fsl,fec.example.dts
  DTC     Documentation/devicetree/bindings/net/fsl,fec.example.dt.yaml
  CHECK   Documentation/devicetree/bindings/net/fsl,fec.example.dt.yaml

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 12:34:29 -07:00
Peilin Ye
d4861fc6be netdevsim: Add multi-queue support
Currently netdevsim only supports a single queue per port, which is
insufficient for testing multi-queue TC schedulers e.g. sch_mq.  Extend
the current sysfs interface so that users can create ports with multiple
queues:

$ echo "[ID] [PORT_COUNT] [NUM_QUEUES]" > /sys/bus/netdevsim/new_device

As an example, echoing "2 4 8" creates 4 ports, with 8 queues per port.
Note, this is compatible with the current interface, with default number
of queues set to 1.  For example, echoing "2 4" creates 4 ports with 1
queue per port; echoing "2" simply creates 1 port with 1 queue.

Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 11:17:56 -07:00
Mark Gray
b83d23a2a3 openvswitch: Introduce per-cpu upcall dispatch
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.

This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:

* On systems with a large number of vports, there is a correspondingly
large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)

This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.

In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:

a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.

The corresponding user space code can be found at:
https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385139.html

Bugzilla: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 11:06:33 -07:00
Bill Wendling
919d527956 bnx2x: remove unused variable 'cur_data_offset'
Fix the clang build warning:

  drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c:1862:13: error: variable 'cur_data_offset' set but not used [-Werror,-Wunused-but-set-variable]
        dma_addr_t cur_data_offset;

Signed-off-by: Bill Wendling <morbo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 10:56:55 -07:00
Christophe JAILLET
a99f030b24 net: switchdev: Simplify 'mlxsw_sp_mc_write_mdb_entry()'
Use 'bitmap_alloc()/bitmap_free()' instead of hand-writing it.
This makes the code less verbose.

Also, use 'bitmap_alloc()' instead of 'bitmap_zalloc()' because the bitmap
is fully overridden by a 'bitmap_copy()' call just after its allocation.

While at it, remove an extra and unneeded space.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 10:54:18 -07:00
Yajun Deng
f79a3bcb1a net/sched: Remove unnecessary if statement
It has been deal with the 'if (err' statement in rtnetlink_send()
and rtnl_unicast(). so remove unnecessary if statement.

v2: use the raw name rtnetlink_send().

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 10:46:35 -07:00
Yajun Deng
cfdf0d9ae7 rtnetlink: use nlmsg_notify() in rtnetlink_send()
The netlink_{broadcast, unicast} don't deal with 'if (err > 0' statement
but nlmsg_{multicast, unicast} do. The nlmsg_notify() contains them.
so use nlmsg_notify() instead. so that the caller wouldn't deal with
'if (err > 0' statement.

v2: use nlmsg_notify() will do well.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 10:46:35 -07:00
Haiyue Wang
63a9192b8f gve: fix the wrong AdminQ buffer overflow check
The 'tail' pointer is also free-running count, so it needs to be masked
as 'adminq_prod_cnt' does, to become an index value of AdminQ buffer.

Fixes: 5cdad90de6 ("gve: Batch AQ commands for creating and destroying queues.")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 10:41:40 -07:00
David S. Miller
82a1ffe57e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-07-15

The following pull-request contains BPF updates for your *net-next* tree.

We've added 45 non-merge commits during the last 15 day(s) which contain
a total of 52 files changed, 3122 insertions(+), 384 deletions(-).

The main changes are:

1) Introduce bpf timers, from Alexei.

2) Add sockmap support for unix datagram socket, from Cong.

3) Fix potential memleak and UAF in the verifier, from He.

4) Add bpf_get_func_ip helper, from Jiri.

5) Improvements to generic XDP mode, from Kumar.

6) Support for passing xdp_md to XDP programs in bpf_prog_run, from Zvi.
===================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15 22:40:10 -07:00
Alexei Starovoitov
c50524ec4e Merge branch 'sockmap: add sockmap support for unix datagram socket'
Cong Wang says:

====================

From: Cong Wang <cong.wang@bytedance.com>

This is the last patchset of the original large patchset. In the
previous patchset, a new BPF sockmap program BPF_SK_SKB_VERDICT
was introduced and UDP began to support it too. In this patchset,
we add BPF_SK_SKB_VERDICT support to Unix datagram socket, so that
we can finally splice Unix datagram socket and UDP socket. Please
check each patch description for more details.

To see the big picture, the previous patchsets are available here:
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=1e0ab70778bd86a90de438cc5e1535c115a7c396
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=89d69c5d0fbcabd8656459bc8b1a476d6f1efee4

and this patchset is available here:
https://github.com/congwang/linux/tree/sockmap3
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
v5: lift socket state check for dgram
    remove ->unhash() case
    add retries for EAGAIN in all test cases
    remove an unused parameter of __unix_dgram_recvmsg()
    rebase on the latest bpf-next

v4: fix af_unix disconnect case
    add unix_unhash()
    split out two small patches
    reduce u->iolock critical section
    remove an unused parameter of __unix_dgram_recvmsg()

v3: fix Kconfig dependency
    make unix_read_sock() static
    fix a UAF in unix_release()
    add a missing header unix_bpf.c

v2: separate out from the original large patchset
    rebase to the latest bpf-next
    clean up unix_read_sock()
    export sock_map_close()
    factor out some helpers in selftests for code reuse
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-07-15 18:17:51 -07:00
Cong Wang
a2ffda38dc selftests/bpf: Add test cases for redirection between udp and unix
Add two test cases to ensure redirection between udp and unix
work bidirectionally.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-12-xiyou.wangcong@gmail.com
2021-07-15 18:17:51 -07:00
Cong Wang
5ea905dd43 selftests/bpf: Add a test case for unix sockmap
Add a test case to ensure redirection between two AF_UNIX
datagram sockets work.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-11-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
0626bc2ff6 selftests/bpf: Factor out add_to_sockmap()
Factor out a common helper add_to_sockmap() which adds two
sockets into a sockmap.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-10-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
d950625c81 selftests/bpf: Factor out udp_socketpair()
Factor out a common helper udp_socketpair() which creates
a pair of connected UDP sockets.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-9-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
9825d866ce af_unix: Implement unix_dgram_bpf_recvmsg()
We have to implement unix_dgram_bpf_recvmsg() to replace the
original ->recvmsg() to retrieve skmsg from ingress_msg.

AF_UNIX is again special here because the lack of
sk_prot->recvmsg(). I simply add a special case inside
unix_dgram_recvmsg() to call sk->sk_prot->recvmsg() directly.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-8-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
c63829182c af_unix: Implement ->psock_update_sk_prot()
Now we can implement unix_bpf_update_proto() to update
sk_prot, especially prot->close().

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-7-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
c7272e15f0 af_unix: Add a dummy ->close() for sockmap
Unlike af_inet, unix_proto is very different, it does not even
have a ->close(). We have to add a dummy implementation to
satisfy sockmap. Normally it is just a nop, it is introduced only
for sockmap to replace it.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-6-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
83301b5367 af_unix: Set TCP_ESTABLISHED for datagram sockets too
Currently only unix stream socket sets TCP_ESTABLISHED,
datagram socket can set this too when they connect to its
peer socket. At least __ip4_datagram_connect() does the same.

This will be used to determine whether an AF_UNIX datagram
socket can be redirected to in sockmap.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-5-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
29df44fa52 af_unix: Implement ->read_sock() for sockmap
Implement ->read_sock() for AF_UNIX datagram socket, it is
pretty much similar to udp_read_sock().

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-4-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
0c48eefae7 sock_map: Lift socket state restriction for datagram sockets
TCP and other connection oriented sockets have accept()
for each incoming connection on the server side, hence
they can just insert those fd's from accept() to sockmap,
which are of course established.

Now with datagram sockets begin to support sockmap and
redirection, the restriction is no longer applicable to
them, as they have no accept(). So we have to lift this
restriction for them. This is fine, because inside
bpf_sk_redirect_map() we still have another socket status
check, sock_map_redirect_allowed(), as a guard.

This also means they do not have to be removed from
sockmap when disconnecting.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-3-xiyou.wangcong@gmail.com
2021-07-15 18:17:49 -07:00
Cong Wang
17edea21b3 sock_map: Relax config dependency to CONFIG_NET
Currently sock_map still has Kconfig dependency on CONFIG_INET,
but there is no actual functional dependency on it after we
introduce ->psock_update_sk_prot().

We have to extend it to CONFIG_NET now as we are going to
support AF_UNIX.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-2-xiyou.wangcong@gmail.com
2021-07-15 18:17:49 -07:00
Alexei Starovoitov
1554a080e7 Merge branch 'Add bpf_get_func_ip helper'
Jiri Olsa says:

====================
Add bpf_get_func_ip helper that returns IP address of the
caller function for trampoline and krobe programs.

There're 2 specific implementation of the bpf_get_func_ip
helper, one for trampoline progs and one for kprobe/kretprobe
progs.

The trampoline helper call is replaced/inlined by the verifier
with simple move instruction. The kprobe/kretprobe is actual
helper call that returns prepared caller address.

Also available at:
  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  bpf/get_func_ip

v4 changes:
  - dropped jit/x86 check for get_func_ip tracing check [Alexei]
  - added code to bpf_get_func_ip_tracing [Alexei]
    and tested that it works without inlining [Alexei]
  - changed has_get_func_ip to check_get_func_ip [Andrii]
  - replaced test assert loop with explicit asserts [Andrii]
  - adde bpf_program__attach_kprobe_opts function
    and use it for offset setup [Andrii]
  - used bpf_program__set_autoload(false) for test6 [Andrii]
  - added Masami's ack

v3 changes:
  - resend with Masami in cc and v3 in each patch subject

v2 changes:
  - use kprobe_running to get kprobe instead of cpu var [Masami]
  - added support to add kprobe on function+offset
    and test for that [Alan]
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-07-15 17:59:27 -07:00