Commit Graph

14785 Commits

Author SHA1 Message Date
David S. Miller
1e5e4acb66 mlx5-updates-2021-04-21
devlink external port attribute for SF (Sub-Function) port flavour
 
 This adds the support to instantiate Sub-Functions on external hosts
 E.g when Eswitch manager is enabled on the ARM SmarNic SoC CPU, users
 are now able to spawn new Sub-Functions on the Host server CPU.
 
 Parav Pandit Says:
 ==================
 
 This series introduces and uses external attribute for the SF port to
 indicate that a SF port belongs to an external controller.
 
 This is needed to generate unique phys_port_name when PF and SF numbers
 are overlapping between local and external controllers.
 For example two controllers 0 and 1, both of these controller have a SF.
 having PF number 0, SF number 77. Here, phys_port_name has duplicate
 entry which doesn't have controller number in it.
 
 Hence, add controller number optionally when a SF port is for an
 external controller. This extension is similar to existing PF and VF
 eswitch ports of the external controller.
 
 When a SF is for external controller an example view of external SF
 port and config sequence:
 
 On eswitch system:
 $ devlink dev eswitch set pci/0033:01:00.0 mode switchdev
 
 $ devlink port show
 pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
 pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
   function:
     hw_addr 00:00:00:00:00:00
 
 $ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
 pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
   function:
     hw_addr 00:00:00:00:00:00 state inactive opstate detached
 
 phys_port_name construction:
 $ cat /sys/class/net/eth1/phys_port_name
 c1pf0sf77
 
 Patch summary:
 First 3 patches prepares the eswitch to handle vports in more generic
 way using xarray to lookup vport from its unique vport number.
 Patch-1 returns maximum eswitch ports only when eswitch is enabled
 Patch-2 prepares eswitch to return eswitch max ports from a struct
 Patch-3 uses xarray for vport and representor lookup
 Patch-4 considers SF for an additioanl range of SF vports
 Patch-5 relies on SF hw table to check SF support
 Patch-6 extends SF devlink port attribute for external flag
 Patch-7 stores the per controller SF allocation attributes
 Patch-8 uses SF function id for filtering events
 Patch-9 uses helper for allocation and free
 Patch-10 splits hw table into per controller table and generic one
 Patch-11 extends sf table for additional range
 
 ==================
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmCDz80ACgkQSD+KveBX
 +j4LmggAwS9otYoo639Kmow/wlMZ6yyLsH02zVMFLEJ2AE4VbL73i4iiQ67ZWygL
 yQ8HawPAnythx4RsN/M6/WjSKRpdqTC27C9CpdM78zhXb1vnOrlzba7rYngqmo7N
 5fIkGyjsUGHNqq+15SftK7JYbXFTe1b5RdWawXkQoyBlXTTBamyxD7C5NMpoDots
 /e88Bs8Zy5nVPZqPchIId8TZEKKuO/heTz8ks6q6s/t1MGj7QP+ddxVMgNg00NR5
 OpNTr7YYdpHxpfLSUZgdHaptwwKOx+nou8LdJkIKWPs7SHX6HDggyZJjGBOEWtE7
 qG7oSip4olOTM0w9PZrAewLwSYhq7Q==
 =oqBr
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2021-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2021-04-21

devlink external port attribute for SF (Sub-Function) port flavour

This adds the support to instantiate Sub-Functions on external hosts
E.g when Eswitch manager is enabled on the ARM SmarNic SoC CPU, users
are now able to spawn new Sub-Functions on the Host server CPU.

Parav Pandit Says:
==================

This series introduces and uses external attribute for the SF port to
indicate that a SF port belongs to an external controller.

This is needed to generate unique phys_port_name when PF and SF numbers
are overlapping between local and external controllers.
For example two controllers 0 and 1, both of these controller have a SF.
having PF number 0, SF number 77. Here, phys_port_name has duplicate
entry which doesn't have controller number in it.

Hence, add controller number optionally when a SF port is for an
external controller. This extension is similar to existing PF and VF
eswitch ports of the external controller.

When a SF is for external controller an example view of external SF
port and config sequence:

On eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev

$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

phys_port_name construction:
$ cat /sys/class/net/eth1/phys_port_name
c1pf0sf77

Patch summary:
First 3 patches prepares the eswitch to handle vports in more generic
way using xarray to lookup vport from its unique vport number.
Patch-1 returns maximum eswitch ports only when eswitch is enabled
Patch-2 prepares eswitch to return eswitch max ports from a struct
Patch-3 uses xarray for vport and representor lookup
Patch-4 considers SF for an additioanl range of SF vports
Patch-5 relies on SF hw table to check SF support
Patch-6 extends SF devlink port attribute for external flag
Patch-7 stores the per controller SF allocation attributes
Patch-8 uses SF function id for filtering events
Patch-9 uses helper for allocation and free
Patch-10 splits hw table into per controller table and generic one
Patch-11 extends sf table for additional range

==================

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:31:35 -07:00
Florian Westphal
f7163c4882 netfilter: remove all xt_table anchors from struct net
No longer needed, table pointer arg is now passed via netfilter core.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:47 +02:00
Florian Westphal
4c95e0728e netfilter: ebtables: remove the 3 ebtables pointers from struct net
ebtables stores the table internal data (what gets passed to the
ebt_do_table() interpreter) in struct net.

nftables keeps the internal interpreter format in pernet lists
and passes it via the netfilter core infrastructure (priv pointer).

Do the same for ebtables: the nf_hook_ops are duplicated via kmemdup,
then the ops->priv pointer is set to the table that is being registered.

After that, the netfilter core passes this table info to the hookfn.

This allows to remove the pointers from struct net.

Same pattern can be applied to ip/ip6/arptables.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
Florian Westphal
de8c12110a netfilter: disable defrag once its no longer needed
When I changed defrag hooks to no longer get registered by default I
intentionally made it so that registration can only be un-done by unloading
the nf_defrag_ipv4/6 module.

In hindsight this was too conservative; there is no reason to keep defrag
on while there is no feature dependency anymore.

Moreover, this won't work if user isn't allowed to remove nf_defrag module.

This adds the disable() functions for both ipv4 and ipv6 and calls them
from conntrack, TPROXY and the xtables socket module.

ipvs isn't converted here, it will behave as before this patch and
will need module removal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
Florian Westphal
885e8c6824 netfilter: nat: move nf_xfrm_me_harder to where it is used
remove the export and make it static.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
David S. Miller
5f6c2f536d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).

The main changes are:

1) Add BPF static linker support for extern resolution of global, from Andrii.

2) Refine retval for bpf_get_task_stack helper, from Dave.

3) Add a bpf_snprintf helper, from Florent.

4) A bunch of miscellaneous improvements from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:02:32 -07:00
Parav Pandit
a1ab3e4554 devlink: Extend SF port attributes to have external attribute
Extended SF port attributes to have optional external flag similar to
PCI PF and VF port attributes.

External atttibute is required to generate unique phys_port_name when PF number
and SF number are overlapping between two controllers similar to SR-IOV
VFs.

When a SF is for external controller an example view of external SF
port and config sequence.

On eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev

$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

phys_port_name construction:
$ cat /sys/class/net/eth1/phys_port_name
c1pf0sf77

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:58:53 -07:00
David S. Miller
7679f864a0 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2021-04-23

1) The SPI flow key in struct flowi has no consumers,
   so remove it. From Florian Westphal.

2) Remove stray synchronize_rcu from xfrm_init.
   From Florian Westphal.

3) Use the new exit_pre hook to reset the netlink socket
   on net namespace destruction. From Florian Westphal.

4) Remove an unnecessary get_cpu() in ipcomp, that
   code is always called with BHs off.
   From Sabrina Dubroca.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 13:37:47 -07:00
David Howells
f105da1a79 afs: Don't truncate iter during data fetch
Don't truncate the iterator to correspond to the actual data size when
fetching the data from the server - rather, pass the length we want to read
to rxrpc.

This will allow the clear-after-read code in future to simply clear the
remaining iterator capacity rather than having to reinitialise the
iterator.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/158861249201.340223.13035445866976590375.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465825061.1377938.14403904452300909320.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588531418.3465195.10712005940763063144.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118148567.1232039.13380313332292947956.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161044610.2537118.17908520793806837792.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340407907.1303470.6501394859511712746.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539551721.286939.14655713136572200716.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653807790.2770958.14034599989374173734.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789090823.6155.15673999934535049102.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:26 +01:00
Tobias Waldekranz
deff710703 net: dsa: Allow default tag protocol to be overridden from DT
Some combinations of tag protocols and Ethernet controllers are
incompatible, and it is hard for the driver to keep track of these.

Therefore, allow the device tree author (typically the board vendor)
to inform the driver of this fact by selecting an alternate protocol
that is known to work.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:51:20 -07:00
David S. Miller
08322284c1 Another set of updates, all over the map:
* set sk_pacing_shift for 802.3->802.11 encap offload
  * some monitor support for 802.11->802.3 decap offload
  * HE (802.11ax) spec updates
  * userspace API for TDLS HE support
  * along with various other small features, cleanups and
    fixups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmB+4y8ACgkQB8qZga/f
 l8SiGw/9Fz3XETnNDYMvyY7ppmWzZ6vofRq307YJiCz1fszEKqwyyzMQOrHA9tg2
 Nasl711egWlVyHTBCN+VCSaTQUjkODsK/5t4XWoxdJ0J3lZkgryVGBJljpl+k4A6
 11qpvwUnO1WCmt0s49V2yU/jWgZ9itHfu9dosu/YIq+NfXUVA7ylKmP3gqfmcCeV
 631z5AnM8/9N8QVMpnk5F2fE57WUXbA+KdVsw0LXMmjXYSsQ9MyTBX/lRDVcaMWV
 7cOtHekkzD0MVfsOoBVvsJl+bybBgEPOfZn2Kt22Rh4JzAch/uUhwRQGzsGxcR3p
 D8W9BABXCU8C5mhP8gcKlOSuH3h7ydKKqrXXNeRO+y5hymOtUSGJxia93m+uQ8qC
 97wootP3cb97/dEzv5cWqw5Pa39uEsny6mQqueD5WcMI9imL98HEo3hrZElbctx8
 s9ZE37WAlZ0zw+cGIsmElZfE2qMqEhjxF3mGFcpXLkk9/Y/1jmypYopkBLJh6KcS
 mIfwk9qWgADbPT5df1A/1388lMkjBRcQGc1SriYxy/olvb70mD8IPPiDSD2kULDt
 Sq2frnOdvjW0Q5DB6jBKzdMudAxY3WP5MlcGDy1iYwEbY6s4lPfQXG48joJpRQFG
 I3zPM6Z+Pimx7vcTd5a+IUyKvDoF+DtxiOu8DGKYT2M5tv3/tpI=
 =b0NQ
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2021-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Another set of updates, all over the map:
 * set sk_pacing_shift for 802.3->802.11 encap offload
 * some monitor support for 802.11->802.3 decap offload
 * HE (802.11ax) spec updates
 * userspace API for TDLS HE support
 * along with various other small features, cleanups and
   fixups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:44:04 -07:00
Oleksij Rempel
a71acad90a net: dsa: enable selftest support for all switches by default
Most of generic selftest should be able to work with probably all ethernet
controllers. The DSA switches are not exception, so enable it by default at
least for DSA.

This patch was tested with SJA1105 and AR9331.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:08:02 -07:00
Oleksij Rempel
3e1e58d64c net: add generic selftest support
Port some parts of the stmmac selftest and reuse it as basic generic selftest
library. This patch was tested with following combinations:
- iMX6DL FEC -> AT8035
- iMX6DL FEC -> SJA1105Q switch -> KSZ8081
- iMX6DL FEC -> SJA1105Q switch -> KSZ9031
- AR9331 ag71xx -> AR9331 PHY
- AR9331 ag71xx -> AR9331 switch -> AR9331 PHY

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:08:02 -07:00
David S. Miller
6dd06ec7c1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Add vlan match and pop actions to the flowtable offload,
   patches from wenxu.

2) Reduce size of the netns_ct structure, which itself is
   embedded in struct net Make netns_ct a read-mostly structure.
   Patches from Florian Westphal.

3) Add FLOW_OFFLOAD_XMIT_UNSPEC to skip dst check from garbage
   collector path, as required by the tc CT action. From Roi Dayan.

4) VLAN offload fixes for nftables: Allow for matching on both s-vlan
   and c-vlan selectors. Fix match of VLAN id due to incorrect
   byteorder. Add a new routine to properly populate flow dissector
   ethertypes.

5) Missing keys in ip{6}_route_me_harder() results in incorrect
   routes. This includes an update for selftest infra. Patches
   from Ido Schimmel.

6) Add counter hardware offload support through FLOW_CLS_STATS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 15:49:50 -07:00
Florian Westphal
b07dd26f07 flow: remove spi key from flowi struct
xfrm session decode ipv4 path (but not ipv6) sets this, but there are no
consumers.  Remove it.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-04-19 12:25:11 +02:00
Emmanuel Grumbach
f30386a85f mac80211: make ieee80211_vif_to_wdev work when the vif isn't in the driver
This will allow the low level driver to get the wdev during
the add_interface flow.
In order to do that, remove a few checks from there and do
not return NULL for vifs that were not yet added to the
driver. Note that all the current callers of this helper
function assume that the vif already exists:
 - The callers from the drivers already have a vif pointer.
 Before this change, ieee80211_vif_to_wdev would return NULL
 in some cases, but those callers don't even check they
 get a non-NULL pointer from ieee80211_vif_to_wdev.
 - The callers from net/mac80211/cfg.c assume the vif is
 already added to the driver as well.

So, this change has no impact on existing callers of this
helper function.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210409123755.6078d3517095.I1907a45f267a62dab052bcc44428aa7a2005ffc9@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:03:13 +02:00
Avraham Stern
73807523f9 nl80211/cfg80211: add a flag to negotiate for LMR feedback in NDP ranging
Add a flag that indicates that the ISTA shall indicate support for
LMR feedback in NDP ranging negotiation.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210409123755.eff546283504.I2606161e700ac24d94d0b50c8edcdedd4c0395c2@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:02:51 +02:00
Joe Perches
623b988f2d cfg80211: constify ieee80211_get_response_rate return
It's not modified so make it const with the eventual goal of moving
data to text for various static struct ieee80211_rate arrays.

Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/8b210b5f5972e39eded269b35a1297cf824c4181.camel@perches.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 11:59:33 +02:00
Randy Dunlap
5d9c358d05 cfg80211: fix a few kernel-doc warnings
Fix multiple kernel-doc warnings in cfg80211.h.

cfg80211.h:363: warning: missing initial short description on line:
 * struct ieee80211_sband_iftype_data
cfg80211.h:6743: warning: missing initial short description on line:
 * cfg80211_vendor_cmd_get_sender

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20210417060142.1648-1-rdunlap@infradead.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 11:59:15 +02:00
Emmanuel Grumbach
810344ed07 cfg80211: fix an htmldoc warning
The htmldoc produces this warning which was introduced
bu the commit below.

include/net/cfg80211.h:6643: warning: expecting prototype for wiphy_rfkill_set_hw_state().
Prototype was for wiphy_rfkill_set_hw_state_reason() instead

Fixes: 6f779a66dc ("cfg80211: allow specifying a reason for hw_rfkill")
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://lore.kernel.org/r/20210413113850.59098-1-emmanuel.grumbach@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 11:57:35 +02:00
Pablo Neira Ayuso
b72920f6e4 netfilter: nftables: counter hardware offload support
This patch adds the .offload_stats operation to synchronize hardware
stats with the expression data. Update the counter expression to use
this new interface. The hardware stats are retrieved from the netlink
dump path via FLOW_CLS_STATS command to the driver.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-18 22:04:49 +02:00
Pablo Neira Ayuso
ff4d90a89d netfilter: nftables_offload: VLAN id needs host byteorder in flow dissector
The flow dissector representation expects the VLAN id in host byteorder.
Add the NFT_OFFLOAD_F_NETWORK2HOST flag to swap the bytes from nft_cmp.

Fixes: a82055af59 ("netfilter: nft_payload: add VLAN offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-18 22:02:21 +02:00
Pablo Neira Ayuso
14c20643ef netfilter: nft_payload: fix C-VLAN offload support
- add another struct flow_dissector_key_vlan for C-VLAN
- update layer 3 dependency to allow to match on IPv4/IPv6

Fixes: 89d8fd44ab ("netfilter: nft_payload: add C-VLAN offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-18 22:02:21 +02:00
Vladimir Oltean
2c4eca3ef7 net: bridge: switchdev: include local flag in FDB notifications
As explained in bugfix commit 6ab4c3117a ("net: bridge: don't notify
switchdev for local FDB addresses") as well as in this discussion:
https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/

the switchdev notifiers for FDB entries managed to have a zero-day bug,
which was that drivers would not know what to do with local FDB entries,
because they were not told that they are local. The bug fix was to
simply not notify them of those addresses.

Let us now add the 'is_local' bit to bridge FDB entries, and make all
drivers ignore these entries by their own choice.

Co-developed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:15:45 -07:00
Michael Walle
83216e3988 of: net: pass the dst buffer to of_get_mac_address()
of_get_mac_address() returns a "const void*" pointer to a MAC address.
Lately, support to fetch the MAC address by an NVMEM provider was added.
But this will only work with platform devices. It will not work with
PCI devices (e.g. of an integrated root complex) and esp. not with DSA
ports.

There is an of_* variant of the nvmem binding which works without
devices. The returned data of a nvmem_cell_read() has to be freed after
use. On the other hand the return of_get_mac_address() points to some
static data without a lifetime. The trick for now, was to allocate a
device resource managed buffer which is then returned. This will only
work if we have an actual device.

Change it, so that the caller of of_get_mac_address() has to supply a
buffer where the MAC address is written to. Unfortunately, this will
touch all drivers which use the of_get_mac_address().

Usually the code looks like:

  const char *addr;
  addr = of_get_mac_address(np);
  if (!IS_ERR(addr))
    ether_addr_copy(ndev->dev_addr, addr);

This can then be simply rewritten as:

  of_get_mac_address(np, ndev->dev_addr);

Sometimes is_valid_ether_addr() is used to test the MAC address.
of_get_mac_address() already makes sure, it just returns a valid MAC
address. Thus we can just test its return code. But we have to be
careful if there are still other sources for the MAC address before the
of_get_mac_address(). In this case we have to keep the
is_valid_ether_addr() call.

The following coccinelle patch was used to convert common cases to the
new style. Afterwards, I've manually gone over the drivers and fixed the
return code variable: either used a new one or if one was already
available use that. Mansour Moufid, thanks for that coccinelle patch!

<spml>
@a@
identifier x;
expression y, z;
@@
- x = of_get_mac_address(y);
+ x = of_get_mac_address(y, z);
  <...
- ether_addr_copy(z, x);
  ...>

@@
identifier a.x;
@@
- if (<+... x ...+>) {}

@@
identifier a.x;
@@
  if (<+... x ...+>) {
      ...
  }
- else {}

@@
identifier a.x;
expression e;
@@
- if (<+... x ...+>@e)
-     {}
- else
+ if (!(e))
      {...}

@@
expression x, y, z;
@@
- x = of_get_mac_address(y, z);
+ of_get_mac_address(y, z);
  ... when != x
</spml>

All drivers, except drivers/net/ethernet/aeroflex/greth.c, were
compile-time tested.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 14:35:02 -07:00
Roi Dayan
78ed0a9bc6 netfilter: flowtable: Add FLOW_OFFLOAD_XMIT_UNSPEC xmit type
It could be xmit type was not set and would default to FLOW_OFFLOAD_XMIT_NEIGH
and in this type the gc expect to have a route info.
Fix that by adding FLOW_OFFLOAD_XMIT_UNSPEC which defaults to 0.

Fixes: 8b9229d158 ("netfilter: flowtable: dst_check() from garbage collector path")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 13:47:00 +02:00
Florian Westphal
9b1a4d0f91 netfilter: conntrack: convert sysctls to u8
log_invalid sysctl allows values of 0 to 255 inclusive so we no longer
need a range check: the min/max values can be removed.

This also removes all member variables that were moved to net_generic
data in previous patches.

This reduces size of netns_ct struct by one cache line.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 13:10:39 +02:00
Florian Westphal
c53bd0e966 netfilter: conntrack: move ct counter to net_generic data
Its only needed from slowpath (sysctl, ctnetlink, gc worker) and
when a new conntrack object is allocated.

Furthermore, each write dirties the otherwise read-mostly pernet
data in struct net.ct, which are accessed from packet path.

Move it to the net_generic data.  This makes struct netns_ct
read-mostly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 13:10:39 +02:00
Florian Westphal
f6f2e580d5 netfilter: conntrack: move expect counter to net_generic data
Creation of a new conntrack entry isn't a frequent operation (compared
to 'ct entry already exists').  Creation of a new entry that is also an
expected (related) connection even less so.

Place this counter in net_generic data.

A followup patch will also move the conntrack count -- this will make
netns_ct a read-mostly structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 13:10:39 +02:00
Florian Westphal
67f28216ca netfilter: conntrack: move autoassign_helper sysctl to net_generic data
While at it, make it an u8, no need to use an integer for a boolean.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 13:10:39 +02:00
Florian Westphal
098b5d3565 netfilter: conntrack: move autoassign warning member to net_generic data
Not accessed in fast path, place this is generic_net data instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 13:10:39 +02:00
wenxu
3e1b0c168f netfilter: flowtable: add vlan match offload support
This patch adds support for vlan_id, vlan_priority and vlan_proto match
for flowtable offload.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 12:59:54 +02:00
Cong Wang
51e0158a54 skmsg: Pass psock pointer to ->psock_update_sk_prot()
Using sk_psock() to retrieve psock pointer from sock requires
RCU read lock, but we already get psock pointer before calling
->psock_update_sk_prot() in both cases, so we can just pass it
without bothering sk_psock().

Fixes: 8a59f9d1e3 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
Reported-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210407032111.33398-1-xiyou.wangcong@gmail.com
2021-04-12 17:34:27 +02:00
Jakub Kicinski
8859a44ea0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

MAINTAINERS
 - keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 - simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
 - trivial
include/linux/ethtool.h
 - trivial, fix kdoc while at it
include/linux/skmsg.h
 - move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
 - add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
 - trivial

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 20:48:35 -07:00
Florian Westphal
b98b33043c net: dccp: use net_generic storage
DCCP is virtually never used, so no need to use space in struct net for it.

Put the pernet ipv4/v6 socket in the dccp ipv4/ipv6 modules instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20210408174502.1625-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 16:34:56 -07:00
David S. Miller
4438669eb7 bluetooth-next pull request for net-next:
- Proper support for BCM4330 and BMC4334
  - Various improvements for firmware download of Intel controllers
  - Update management interface revision to 20
  - Support for AOSP HCI vendor commands
  - Initial Virtio support
 
 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmBvL7UZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKU+eD/4qGiwnwQVia4hjnQ5PZ0c/
 sBmZvzzZE43RNRvy8gAwcVLypLaiIkEYml4ruVdF6RGoqaDA6kRB7D0tdk8EtZc4
 6h72LX0zefnyZNj/Fcl6+dKuq7bSHjdOK0G+oXFmpvYfScPsI6Oq0ntpYOFYwUE0
 cDdeZyb/sW4f6L5AXUgLjD9AIjzlBOJEU5HcRL/uY+E/fH1oe757/IhUDerXjTIh
 9EhY28mrI7KeoEcJ1NIgvjPnRCxPRLAKgrzc8NXkUsLeNZSIywrwjgRfTvCZ+qQf
 g/MmNv+n7yFlUzA7Fu2d8YAgRtgF9nZc45geqOSHqcSZURORSs0CzEoYinqwT7f0
 xdMBi/UoSzJsLksgv+OfXOyoQ4lPNJ47pcH/edmT/ZJ/Ai2yTqUUSODHQ3MLC4Dp
 kQj34thQvfTf5b6+9HwKmBjlfMK0QGPqWjp6dth2bzos/ugurDP+XimAv0urvFz+
 NkES5kCMUs/1z7Yh+Nj1MyU3Wfdwol0wOh+PXBTAw2fxSy6kmKYGyLw7HEdWegB0
 K2w/ZS58g0EGK89ZSiqlZKmXtAd22dQqS6JlZje+9YTtEtRWu+rAF1dddS8vJDjl
 Rq3VSQ1B6Qw0rdGpyfxdgm/sWwq5S1YgQQcAZgVaeo9ZiIhfscVD5v5L0Ol6k9ms
 3eCK/2brsX0IaV61eVnoOw==
 =jiCz
 -----END PGP SIGNATURE-----

Merge tag 'for-net-next-2021-04-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

 - Proper support for BCM4330 and BMC4334
 - Various improvements for firmware download of Intel controllers
 - Update management interface revision to 20
 - Support for AOSP HCI vendor commands
 - Initial Virtio support
====================

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-08 14:19:32 -07:00
Stephen Hemminger
3583a4e8d7 ipv6: report errors for iftoken via netlink extack
Setting iftoken can fail for several different reasons but there
and there was no report to user as to the cause. Add netlink
extended errors to the processing of the request.

This requires adding additional argument through rtnl_af_ops
set_link_af callback.

Reported-by: Hongren Zheng <li@zenithal.me>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-08 13:52:36 -07:00
Vlad Buslov
b3650bf76a net: sched: fix err handler in tcf_action_init()
With recent changes that separated action module load from action
initialization tcf_action_init() function error handling code was modified
to manually release the loaded modules if loading/initialization of any
further action in same batch failed. For the case when all modules
successfully loaded and some of the actions were initialized before one of
them failed in init handler. In this case for all previous actions the
module will be released twice by the error handler: First time by the loop
that manually calls module_put() for all ops, and second time by the action
destroy code that puts the module after destroying the action.

Reproduction:

$ sudo tc actions add action simple sdata \"2\" index 2
$ sudo tc actions add action simple sdata \"1\" index 1 \
                      action simple sdata \"2\" index 2
RTNETLINK answers: File exists
We have an error talking to the kernel
$ sudo tc actions ls action simple
total acts 1

        action order 0: Simple <"2">
         index 2 ref 1 bind 0
$ sudo tc actions flush action simple
$ sudo tc actions ls action simple
$ sudo tc actions add action simple sdata \"2\" index 2
Error: Failed to load TC action module.
We have an error talking to the kernel
$ lsmod | grep simple
act_simple             20480  -1

Fix the issue by modifying module reference counting handling in action
initialization code:

- Get module reference in tcf_idr_create() and put it in tcf_idr_release()
instead of taking over the reference held by the caller.

- Modify users of tcf_action_init_1() to always release the module
reference which they obtain before calling init function instead of
assuming that created action takes over the reference.

- Finally, modify tcf_action_init_1() to not release the module reference
when overwriting existing action as this is no longer necessary since both
upper and lower layers obtain and manage their own module references
independently.

Fixes: d349f99768 ("net_sched: fix RTNL deadlock again caused by request_module()")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-08 13:47:33 -07:00
Vlad Buslov
87c750e8c3 net: sched: fix action overwrite reference counting
Action init code increments reference counter when it changes an action.
This is the desired behavior for cls API which needs to obtain action
reference for every classifier that points to action. However, act API just
needs to change the action and releases the reference before returning.
This sequence breaks when the requested action doesn't exist, which causes
act API init code to create new action with specified index, but action is
still released before returning and is deleted (unless it was referenced
concurrently by cls API).

Reproduction:

$ sudo tc actions ls action gact
$ sudo tc actions change action gact drop index 1
$ sudo tc actions ls action gact

Extend tcf_action_init() to accept 'init_res' array and initialize it with
action->ops->init() result. In tcf_action_add() remove pointers to created
actions from actions array before passing it to tcf_action_put_many().

Fixes: cae422f379 ("net: sched: use reference counting action init")
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-08 13:47:33 -07:00
Sriram R
55f8205e7d mac80211: Allow concurrent monitor iface and ethernet rx decap
Some HW/driver can support passing ethernet rx decap frames and
raw 802.11 frames for the monitor interface concurrently and
via separate RX calls to mac80211. Packets going to the monitor
interface(s) would be in 802.11 format and thus not have the
RX_FLAG_8023 set, and 802.11 format monitoring frames should have
RX_FLAG_ONLY_MONITOR set.

Drivers doing such can enable the SUPPORTS_CONC_MON_RX_DECAP to
allow using ethernet decap offload while a monitor interface is
active, currently RX decapsulation offload gets disabled when a
monitor interface is added.

Signed-off-by: Sriram R <srirrama@codeaurora.org>
Link: https://lore.kernel.org/r/1617068116-32253-1-git-send-email-srirrama@codeaurora.org
[add proper documentation, rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-08 15:21:05 +02:00
Emmanuel Grumbach
6f779a66dc cfg80211: allow specifying a reason for hw_rfkill
rfkill now allows to report a reason for the hw_rfkill state.
Allow cfg80211 drivers to specify this reason.
Keep the current API to use the default reason
(RFKILL_HARD_BLOCK_SIGNAL).

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://lore.kernel.org/r/20210322204633.102581-4-emmanuel.grumbach@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-08 13:05:56 +02:00
Marcel Holtmann
a61d67188f Bluetooth: Allow Microsoft extension to indicate curve validation
Some controllers don't support the Simple Pairing Options feature that
can indicate the support for P-192 and P-256 public key validation.
However they might support the Microsoft vendor extension that can
indicate the validiation capability as well.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2021-04-08 12:26:34 +02:00
Marcel Holtmann
cfa15cca51 Bluetooth: Fix default values for advertising interval
The DISCOV_LE_FAST_ADV_INT_{MIN,MAX} contants are in msec, but then used
later on directly while it is suppose to be N * 0.625 ms according to
the Bluetooth Core specification.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2021-04-08 12:26:34 +02:00
Geliang Tang
30f60bae80 mptcp: use mptcp_addr_info in mptcp_out_options
This patch moved the mptcp_addr_info struct from protocol.h to mptcp.h,
added a new struct mptcp_addr_info member addr in struct mptcp_out_options,
and dropped the original addr, addr6, addr_id and port fields in it. Then
we can use opts->addr to get the adding address from PM directly using
mptcp_pm_add_addr_signal.

Since the port number became big-endian now, use ntohs to convert it
before sending it out with the ADD_ADDR suboption. Also convert it
when passing it to add_addr_generate_hmac or printing it out.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-07 14:09:39 -07:00
David S. Miller
5106efe6ed Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter/IPVS updates for your net-next tree:

1) Simplify log infrastructure modularity: Merge ipv4, ipv6, bridge,
   netdev and ARP families to nf_log_syslog.c. Add module softdeps.
   This fixes a rare deadlock condition that might occur when log
   module autoload is required. From Florian Westphal.

2) Moves part of netfilter related pernet data from struct net to
   net_generic() infrastructure. All of these users can be modules,
   so if they are not loaded there is no need to waste space. Size
   reduction is 7 cachelines on x86_64, also from Florian.

2) Update nftables audit support to report events once per table,
   to get it aligned with iptables. From Richard Guy Briggs.

3) Check for stale routes from the flowtable garbage collector path.
   This is fixing IPv6 which breaks due missing check for the dst_cookie.

4) Add a nfnl_fill_hdr() function to simplify netlink + nfnetlink
   headers setup.

5) Remove documentation on several statified functions.

6) Remove printk on netns creation for the FTP IPVS tracker,
   from Florian Westphal.

7) Remove unnecessary nf_tables_destroy_list_lock spinlock
   initialization, from Yang Yingliang.

7) Remove a duplicated forward declaration in ipset,
   from Wan Jiabing.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-06 16:36:41 -07:00
Marcel Holtmann
f67743f9e0 Bluetooth: Add support for reading AOSP vendor capabilities
When drivers indicate support for AOSP vendor extension, initialize them
and read its capabilities.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2021-04-06 14:11:23 -07:00
Florian Westphal
db3685b404 net: remove obsolete members from struct net
all have been moved to generic_net infra. On x86_64, this reduces
struct net size from 70 to 63 cache lines (4480 to 4032 byte).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-06 00:34:53 +02:00
Florian Westphal
1379940bf8 netfilter: conntrack: move ecache dwork to net_generic infra
dwork struct is large (>128 byte) and not needed when conntrack module
is not loaded.

Place it in net_generic data instead.  The struct net dwork member is now
obsolete and will be removed in a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-06 00:34:53 +02:00
Florian Westphal
7b5974709f netfilter: conntrack: move sysctl pointer to net_generic infra
No need to keep this in struct net, place it in the net_generic data.
The sysctl pointer is removed from struct net in a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-06 00:34:53 +02:00
Florian Westphal
0854db2aae netfilter: nf_tables: use net_generic infra for transaction data
This moves all nf_tables pernet data from struct net to a net_generic
extension, with the exception of the gencursor.

The latter is used in the data path and also outside of the nf_tables
core. All others are only used from the configuration plane.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-06 00:34:52 +02:00
Florian Westphal
8b0adbe3e3 netfilter: nf_defrag_ipv6: use net_generic infra
This allows followup patch to remove these members from struct net.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-06 00:34:51 +02:00
Greg Kroah-Hartman
9594408763 Merge 5.12-rc6 into tty-next
We need the serial/tty fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-05 08:59:21 +02:00
Eric Dumazet
8250666517 tcp: reorder tcp_congestion_ops for better cache locality
Group all the often used fields in the first cache line,
to reduce cache line misses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-02 14:32:27 -07:00
Eric Dumazet
7f040aa322 net: reorganize fields in netns_mib
Order fields to increase locality for most used protocols.

udplite and icmp are moved at the end.

Same for proc_net_devsnmp6 which is not used in fast path.

This potentially saves one cache line miss for typical TCP/UDP over IPv4/IPv6.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-02 14:31:44 -07:00
Florian Westphal
dc87efdb1a mptcp: add mptcp reset option support
The MPTCP reset option allows to carry a mptcp-specific error code that
provides more information on the nature of a connection reset.

Reset option data received gets stored in the subflow context so it can
be sent to userspace via the 'subflow closed' netlink event.

When a subflow is closed, the desired error code that should be sent to
the peer is also placed in the subflow context structure.

If a reset is sent before subflow establishment could complete, e.g. on
HMAC failure during an MP_JOIN operation, the mptcp skb extension is
used to store the reset information.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-02 14:21:50 -07:00
David S. Miller
c2bcb4cf02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-01

The following pull-request contains BPF updates for your *net-next* tree.

We've added 68 non-merge commits during the last 7 day(s) which contain
a total of 70 files changed, 2944 insertions(+), 1139 deletions(-).

The main changes are:

1) UDP support for sockmap, from Cong.

2) Verifier merge conflict resolution fix, from Daniel.

3) xsk selftests enhancements, from Maciej.

4) Unstable helpers aka kernel func calling, from Martin.

5) Batches ops for LPM map, from Pedro.

6) Fix race in bpf_get_local_storage, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-02 11:03:07 -07:00
Wan Jiabing
9fadafa46f include: net: Remove repeated struct declaration
struct ctl_table_header is declared twice. One is declared
at 46th line. The blew one is not needed. Remove the duplicate.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-01 15:51:52 -07:00
Cong Wang
2bc793e327 skmsg: Extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
Although these two functions are only used by TCP, they are not
specific to TCP at all, both operate on skmsg and ingress_msg,
so fit in net/core/skmsg.c very well.

And we will need them for non-TCP, so rename and move them to
skmsg.c and export them to modules.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-13-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang
d7f571188e udp: Implement ->read_sock() for sockmap
This is similar to tcp_read_sock(), except we do not need
to worry about connections, we just need to retrieve skb
from UDP receive queue.

Note, the return value of ->read_sock() is unused in
sk_psock_verdict_data_ready(), and UDP still does not
support splice() due to lack of ->splice_read(), so users
can not reach udp_read_sock() directly.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-12-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang
8a59f9d1e3 sock: Introduce sk->sk_prot->psock_update_sk_prot()
Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.

Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each
protocol can implement its own way to replace the struct proto.
This also helps get rid of symbol dependencies on CONFIG_INET.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Eric Dumazet
c609e6aae4 Revert "net: correct sk_acceptq_is_full()"
This reverts commit f211ac1545.

We had similar attempt in the past, and we reverted it.

History:

64a146513f [NET]: Revert incorrect accept queue backlog changes.
8488df894d [NET]: Fix bugs in "Whether sock accept queue is full" checking

I am adding a fat comment so that future attempts will
be much harder.

Fixes: f211ac1545 ("net: correct sk_acceptq_is_full()")
Cc: iuyacan <yacanliu@163.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:58:45 -07:00
Eric Dumazet
0dd39d952f ipv6: move ip6_dst_ops first in netns_ipv6
ip6_dst_ops have cache line alignement.

Moving it at beginning of netns_ipv6
removes a 48 byte hole, and shrinks netns_ipv6
from 12 to 11 cache lines.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
a6175633a2 ipv6: convert elligible sysctls to u8
Convert most sysctls that can fit in a byte.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
1c3289c931 tcp: convert tcp_comp_sack_nr sysctl to u8
tcp_comp_sack_nr max value was already 255.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
7d4b37ebb9 ipv4: convert igmp_link_local_mcast_reports sysctl to u8
This sysctl is a bool, can use less storage.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
be205fe6ec ipv4: convert fib_multipath_{use_neigh|hash_policy} sysctls to u8
Make room for better packing of netns_ipv4

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
cd04bd0222 ipv4: convert udp_l3mdev_accept sysctl to u8
Reduce footprint of sysctls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:20 -07:00
Eric Dumazet
b2908fac5b ipv4: convert fib_notify_on_flag_change sysctl to u8
Reduce footprint of sysctls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:19 -07:00
Eric Dumazet
490f33c4e7 inet: shrink netns_ipv4 by another cache line
By shuffling around some fields to remove 8 bytes of hole,
we can save one cache line.

pahole result before/after the patch :

/* size: 768, cachelines: 12, members: 139 */
/* sum members: 673, holes: 11, sum holes: 39 */
/* padding: 56 */
/* paddings: 2, sum paddings: 7 */
/* forced alignments: 1 */

->

/* size: 704, cachelines: 11, members: 139 */
/* sum members: 673, holes: 10, sum holes: 31 */
/* paddings: 2, sum paddings: 7 */
/* forced alignments: 1 */

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:19 -07:00
Eric Dumazet
1caf8d39c5 inet: shrink inet_timewait_death_row by 48 bytes
struct inet_timewait_death_row uses two cache lines, because we want
tw_count to use a full cache line to avoid false sharing.

Rework its definition and placement in netns_ipv4 so that:

1) We add 60 bytes of padding after tw_count to avoid
  false sharing, knowing that tcp_death_row will
  have ____cacheline_aligned_in_smp attribute.

2) We do not risk padding before tcp_death_row, because
  we move it at the beginning of netns_ipv4, even if new
 fields are added later.

3) We do not waste 48 bytes of padding after it.

Note that I have not changed dccp.

pahole result for struct netns_ipv4 before/after the patch :

/* size: 832, cachelines: 13, members: 139 */
/* sum members: 721, holes: 12, sum holes: 95 */
/* padding: 16 */
/* paddings: 2, sum paddings: 55 */

->

/* size: 768, cachelines: 12, members: 139 */
/* sum members: 673, holes: 11, sum holes: 39 */
/* padding: 56 */
/* paddings: 2, sum paddings: 7 */
/* forced alignments: 1 */

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:48:19 -07:00
David S. Miller
c9170f1321 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2021-03-31

1) Fix ipv4 pmtu checks for xfrm anf vti interfaces.
   From Eyal Birger.

2) There are situations where the socket passed to
   xfrm_output_resume() is not the same as the one
   attached to the skb. Use the socket passed to
   xfrm_output_resume() to avoid lookup failures
   when xfrm is used with VRFs.
   From Evan Nimmo.

3) Make the xfrm_state_hash_generation sequence counter per
   network namespace because but its write serialization
   lock is also per network namespace. Write protection
   is insufficient otherwise.
   From Ahmed S. Darwish.

4) Fixup sctp featue flags when used with esp offload.
   From Xin Long.

5) xfrm BEET mode doesn't support fragments for inner packets.
   This is a limitation of the protocol, so no fix possible.
   Warn at least to notify the user about that situation.
   From Xin Long.

6) Fix NULL pointer dereference on policy lookup when
   namespaces are uses in combination with esp offload.

7) Fix incorrect transformation on esp offload when
   packets get segmented at layer 3.

8) Fix some user triggered usages of WARN_ONCE in
   the xfrm compat layer.
   From Dmitry Safonov.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:37:51 -07:00
Pablo Neira Ayuso
8b9229d158 netfilter: flowtable: dst_check() from garbage collector path
Move dst_check() to the garbage collector path. Stale routes trigger the
flow entry teardown state which makes affected flows go back to the
classic forwarding path to re-evaluate flow offloading.

IPv6 requires the dst cookie to work, store it in the flow_tuple,
otherwise dst_check() always fails.

Fixes: e5075c0bad ("netfilter: flowtable: call dst_check() to fall back to classic forwarding")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-31 22:34:11 +02:00
Florian Westphal
cefa31a9d4 netfilter: nft_log: perform module load from nf_tables
modprobe calls from the nf_logger_find_get() API causes deadlock in very
special cases because they occur with the nf_tables transaction mutex held.

In the specific case of nf_log, deadlock is via:

 A nf_tables -> transaction mutex -> nft_log -> modprobe -> nf_log_syslog \
	    -> pernet_ops rwsem -> wait for C
 B netlink event -> rtnl_mutex -> nf_tables transaction mutex -> wait for A
 C close() -> ip6mr_sk_done -> rtnl_mutex -> wait for B

Earlier patch added NFLOG/xt_LOG module softdeps to avoid the need to load
the backend module during a transaction.

For nft_log we would have to add a softdep for both nfnetlink_log or
nf_log_syslog, since we do not know in advance which of the two backends
are going to be configured.

This defers the modprobe op until after the transaction mutex is released.

Tested-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-31 22:34:11 +02:00
Florian Westphal
e465cccd0b netfilter: nf_log_common: merge with nf_log_syslog
Remove nf_log_common.  Now that all per-af modules have been merged
there is no longer a need to provide a helper module.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-31 22:34:10 +02:00
Florian Westphal
77ccee96a6 netfilter: nf_log_bridge: merge with nf_log_syslog
Provide bridge log support from nf_log_syslog.

After the merge there is no need to load the "real packet loggers",
all of them now reside in the same module.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-31 22:34:05 +02:00
Paolo Abeni
000ac44da7 udp: fixup csum for GSO receive slow path
When UDP packets generated locally by a socket with UDP_SEGMENT
traverse the following path:

UDP tunnel(xmit) -> veth (segmentation) -> veth (gro) ->
	UDP tunnel (rx) -> UDP socket (no UDP_GRO)

ip_summed will be set to CHECKSUM_PARTIAL at creation time and
such checksum mode will be preserved in the above path up to the
UDP tunnel receive code where we have:

 __iptunnel_pull_header() -> skb_pull_rcsum() ->
skb_postpull_rcsum() -> __skb_postpull_rcsum()

The latter will convert the skb to CHECKSUM_NONE.

The UDP GSO packet will be later segmented as part of the rx socket
receive operation, and will present a CHECKSUM_NONE after segmentation.

Additionally the segmented packets UDP CB still refers to the original
GSO packet len. Overall that causes unexpected/wrong csum validation
errors later in the UDP receive path.

We could possibly address the issue with some additional checks and
csum mangling in the UDP tunnel code. Since the issue affects only
this UDP receive slow path, let's set a suitable csum status there.

Note that SKB_GSO_UDP_L4 or SKB_GSO_FRAGLIST packets lacking an UDP
encapsulation present a valid checksum when landing to udp_queue_rcv_skb(),
as the UDP checksum has been validated by the GRO engine.

v2 -> v3:
 - even more verbose commit message and comments

v1 -> v2:
 - restrict the csum update to the packets strictly needing them
 - hopefully clarify the commit message and code comments

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 17:06:49 -07:00
Paolo Abeni
9adc89af72 net: let skb_orphan_partial wake-up waiters.
Currently the mentioned helper can end-up freeing the socket wmem
without waking-up any processes waiting for more write memory.

If the partially orphaned skb is attached to an UDP (or raw) socket,
the lack of wake-up can hang the user-space.

Even for TCP sockets not calling the sk destructor could have bad
effects on TSQ.

Address the issue using skb_orphan to release the sk wmem before
setting the new sock_efree destructor. Additionally bundle the
whole ownership update in a new helper, so that later other
potential users could avoid duplicate code.

v1 -> v2:
 - use skb_orphan() instead of sort of open coding it (Eric)
 - provide an helper for the ownership change (Eric)

Fixes: f6ba8d33cf ("netem: fix skb_orphan_partial()")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 13:57:28 -07:00
Andreas Roeseler
504a40113c ipv6: add ipv6_dev_find to stubs
Add ipv6_dev_find to ipv6_stub to allow lookup of net_devices by IPV6
address in net/ipv4/icmp.c.

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 13:29:39 -07:00
Andreas Roeseler
f1b8fa9fa5 net: add sysctl for enabling RFC 8335 PROBE messages
Section 8 of RFC 8335 specifies potential security concerns of
responding to PROBE requests, and states that nodes that support PROBE
functionality MUST be able to enable/disable responses and that
responses MUST be disabled by default

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 13:29:39 -07:00
Petr Machata
de1d1ee3e3 nexthop: Rename artifacts related to legacy multipath nexthop groups
After resilient next-hop groups have been added recently, there are two
types of multipath next-hop groups: the legacy "mpath", and the new
"resilient". Calling the legacy next-hop group type "mpath" is unfortunate,
because that describes the fact that a packet could be forwarded in one of
several paths, which is also true for the resilient next-hop groups.

Therefore, to make the naming clearer, rename various artifacts to reflect
the assumptions made. Therefore as of this patch:

- The flag for multipath groups is nh_grp_entry::is_multipath. This
  includes the legacy and resilient groups, as well as any future group
  types that behave as multipath groups.
  Functions that assume this have "mpath" in the name.

- The flag for legacy multipath groups is nh_grp_entry::hash_threshold.
  Functions that assume this have "hthr" in the name.

- The flag for resilient groups is nh_grp_entry::resilient.
  Functions that assume this have "res" in the name.

Besides the above, struct nh_grp_entry::mpath was renamed to ::hthr as
well.

UAPI artifacts were obviously left intact.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-28 17:53:39 -07:00
Taehee Yoo
63ed8de4be mld: add mc_lock for protecting per-interface mld data
The purpose of this lock is to avoid a bottleneck in the query/report
event handler logic.

By previous patches, almost all mld data is protected by RTNL.
So, the query and report event handler, which is data path logic
acquires RTNL too. Therefore if a lot of query and report events
are received, it uses RTNL for a long time.
So it makes the control-plane bottleneck because of using RTNL.
In order to avoid this bottleneck, mc_lock is added.

mc_lock protect only per-interface mld data and per-interface mld
data is used in the query/report event handler logic.
So, no longer rtnl_lock is needed in the query/report event handler logic.
Therefore bottleneck will be disappeared by mc_lock.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26 15:14:56 -07:00
Taehee Yoo
f185de28d9 mld: add new workqueues for process mld events
When query/report packets are received, mld module processes them.
But they are processed under BH context so it couldn't use sleepable
functions. So, in order to switch context, the two workqueues are
added which processes query and report event.

In the struct inet6_dev, mc_{query | report}_queue are added so it
is per-interface queue.
And mc_{query | report}_work are workqueue structure.

When the query or report event is received, skb is queued to proper
queue and worker function is scheduled immediately.
Workqueues and queues are protected by spinlock, which is
mc_{query | report}_lock, and worker functions are protected by RTNL.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26 15:14:56 -07:00
Taehee Yoo
88e2ca3080 mld: convert ifmcaddr6 to RCU
The ifmcaddr6 has been protected by inet6_dev->lock(rwlock) so that
the critical section is atomic context. In order to switch this context,
changing locking is needed. The ifmcaddr6 actually already protected by
RTNL So if it's converted to use RCU, its control path context can be
switched to sleepable.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26 15:14:56 -07:00
Taehee Yoo
4b200e3989 mld: convert ip6_sf_list to RCU
The ip6_sf_list has been protected by mca_lock(spin_lock) so that the
critical section is atomic context. In order to switch this context,
changing locking is needed. The ip6_sf_list actually already protected
by RTNL So if it's converted to use RCU, its control path context can
be switched to sleepable.
But It doesn't remove mca_lock yet because ifmcaddr6 isn't converted
to RCU yet. So, It's not fully converted to the sleepable context.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26 15:14:56 -07:00
Taehee Yoo
882ba1f73c mld: convert ipv6_mc_socklist->sflist to RCU
The sflist has been protected by rwlock so that the critical section
is atomic context.
In order to switch this context, changing locking is needed.
The sflist actually already protected by RTNL So if it's converted
to use RCU, its control path context can be switched to sleepable.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26 15:14:56 -07:00
Taehee Yoo
cf2ce339b4 mld: get rid of inet6_dev->mc_lock
The purpose of mc_lock is to protect inet6_dev->mc_tomb.
But mc_tomb is already protected by RTNL and all functions,
which manipulate mc_tomb are called under RTNL.
So, mc_lock is not needed.
Furthermore, it is spinlock so the critical section is atomic.
In order to reduce atomic context, it should be removed.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26 15:14:55 -07:00
Taehee Yoo
2d9a93b490 mld: convert from timer to delayed work
mcast.c has several timers for delaying works.
Timer's expire handler is working under atomic context so it can't use
sleepable things such as GFP_KERNEL, mutex, etc.
In order to use sleepable APIs, it converts from timers to delayed work.
But there are some critical sections, which is used by both process
and BH context. So that it still uses spin_lock_bh() and rwlock.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26 15:14:55 -07:00
Wan Jiabing
fcb8d0d758 bpf: struct sock is declared twice in bpf_sk_storage header
struct sock has been declared twice, therefore remove the duplicate.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210325070602.858024-1-wanjiabing@vivo.com
2021-03-26 17:43:55 +01:00
Eric Dumazet
3a87571f0f sch_red: fix off-by-one checks in red_check_params()
This fixes following syzbot report:

UBSAN: shift-out-of-bounds in ./include/net/red.h:237:23
shift exponent 32 is too large for 32-bit type 'unsigned int'
CPU: 1 PID: 8418 Comm: syz-executor170 Not tainted 5.12.0-rc4-next-20210324-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x141/0x1d7 lib/dump_stack.c:120
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327
 red_set_parms include/net/red.h:237 [inline]
 choke_change.cold+0x3c/0xc8 net/sched/sch_choke.c:414
 qdisc_create+0x475/0x12f0 net/sched/sch_api.c:1247
 tc_modify_qdisc+0x4c8/0x1a50 net/sched/sch_api.c:1663
 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5553
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502
 netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338
 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927
 sock_sendmsg_nosec net/socket.c:654 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:674
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x43f039
Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffdfa725168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000400488 RCX: 000000000043f039
RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000004
RBP: 0000000000403020 R08: 0000000000400488 R09: 0000000000400488
R10: 0000000000400488 R11: 0000000000000246 R12: 00000000004030b0
R13: 0000000000000000 R14: 00000000004ac018 R15: 0000000000400488

Fixes: 8afa10cbe2 ("net_sched: red: Avoid illegal values")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 17:40:43 -07:00
Eric Dumazet
4ecc1baf36 tcp: convert elligible sysctls to u8
Many tcp sysctls are either bools or small ints that can fit into u8.

Reducing space taken by sysctls can save few cache line misses
when sending/receiving data while cpu caches are empty,
for example after cpu idle period.

This is hard to measure with typical network performance tests,
but after this patch, struct netns_ipv4 has shrunk
by three cache lines.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 17:39:33 -07:00
Eric Dumazet
2932bcda07 inet: convert tcp_early_demux and udp_early_demux to u8
For these sysctls, their dedicated helpers have
to use proc_dou8vec_minmax().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 17:39:33 -07:00
Eric Dumazet
1c69dedc8f ipv4: convert ip_forward_update_priority sysctl to u8
This sysctl uses ip_fwd_update_priority() helper,
so the conversion needs to change it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 17:39:33 -07:00
Eric Dumazet
4b6bbf17d4 ipv4: shrink netns_ipv4 with sysctl conversions
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.

 - icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
 - icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
 - tcp_ecn, tcp_ecn_fallback
 - ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
 - ip_nonlocal_bind, ip_autobind_reuse
 - ip_dynaddr, ip_early_demux, raw_l3mdev_accept
 - nexthop_compat_mode, fwmark_reflect

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 17:39:33 -07:00
David S. Miller
efd13b71a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 15:31:22 -07:00
Eric Dumazet
d1c5688087 tcp_metrics: tcpm_hash_bucket is strictly local
After commit 098a697b49 ("tcp_metrics: Use a single hash table
for all network namespaces."), tcpm_hash_bucket is local to
net/ipv4/tcp_metrics.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 16:45:40 -07:00
Pablo Neira Ayuso
563ae557dd net: flow_offload: add FLOW_ACTION_PPPOE_PUSH
Add an action to represent the PPPoE hardware offload support that
includes the session ID.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 12:48:39 -07:00
Felix Fietkau
26267bf9bb netfilter: flowtable: bridge vlan hardware offload and switchdev
The switch might have already added the VLAN tag through PVID hardware
offload. Keep this extra VLAN in the flowtable but skip it on egress.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 12:48:39 -07:00
Pablo Neira Ayuso
73f97025a9 netfilter: nft_flow_offload: use direct xmit if hardware offload is enabled
If there is a forward path to reach an ethernet device and hardware
offload is enabled, then use the direct xmit path.

Moreover, store the real device in the direct xmit path info since
software datapath uses dev_hard_header() to push the layer encapsulation
headers while hardware offload refers to the real device.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 12:48:39 -07:00
Pablo Neira Ayuso
4cd91f7c29 netfilter: flowtable: add vlan support
Add the vlan id and protocol to the flow tuple to uniquely identify
flows from the receive path. For the transmit path, dev_hard_header() on
the vlan device push the headers. This patch includes support for two
vlan headers (QinQ) from the ingress path.

Add a generic encap field to the flowtable entry which stores the
protocol and the tag id. This allows to reuse these fields in the PPPoE
support coming in a later patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 12:48:39 -07:00
Pablo Neira Ayuso
7a27f6ab41 netfilter: flowtable: use dev_fill_forward_path() to obtain egress device
The egress device in the tuple is obtained from route. Use
dev_fill_forward_path() instead to provide the real egress device for
this flow whenever this is available.

The new FLOW_OFFLOAD_XMIT_DIRECT type uses dev_queue_xmit() to transmit
ethernet frames. Cache the source and destination hardware address to
use dev_queue_xmit() to transfer packets.

The FLOW_OFFLOAD_XMIT_DIRECT replaces FLOW_OFFLOAD_XMIT_NEIGH if
dev_fill_forward_path() finds a direct transmit path.

In case of topology updates, if peer is moved to different bridge port,
the connection will time out, reconnect will result in a new entry with
the correct path. Snooping fdb updates would allow for cleaning up stale
flowtable entries.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 12:48:39 -07:00
Pablo Neira Ayuso
c63a7cc4d7 netfilter: flowtable: use dev_fill_forward_path() to obtain ingress device
Obtain the ingress device in the tuple from the route in the reply
direction. Use dev_fill_forward_path() instead to get the real ingress
device for this flow.

Fall back to use the ingress device that the IP forwarding route
provides if:

- dev_fill_forward_path() finds no real ingress device.
- the ingress device that is obtained is not part of the flowtable
  devices.
- this route has a xfrm policy.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 12:48:39 -07:00
Pablo Neira Ayuso
5139c0c007 netfilter: flowtable: add xmit path types
Add the xmit_type field that defines the two supported xmit paths in the
flowtable data plane, which are the neighbour and the xfrm xmit paths.
This patch prepares for new flowtable xmit path types to come.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 12:48:39 -07:00
Steffen Klassert
b1e3a56070 xfrm: Fix NULL pointer dereference on policy lookup
When xfrm interfaces are used in combination with namespaces
and ESP offload, we get a dst_entry NULL pointer dereference.
This is because we don't have a dst_entry attached in the ESP
offloading case and we need to do a policy lookup before the
namespace transition.

Fix this by expicit checking of skb_dst(skb) before accessing it.

Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-03-24 10:00:24 +01:00
Vladimir Oltean
4f2673b3a2 net: bridge: add helper to replay port and host-joined mdb entries
I have a system with DSA ports, and udhcpcd is configured to bring
interfaces up as soon as they are created.

I create a bridge as follows:

ip link add br0 type bridge

As soon as I create the bridge and udhcpcd brings it up, I also have
avahi which automatically starts sending IPv6 packets to advertise some
local services, and because of that, the br0 bridge joins the following
IPv6 groups due to the code path detailed below:

33:33:ff:6d:c1:9c vid 0
33:33:00:00:00:6a vid 0
33:33:00:00:00:fb vid 0

br_dev_xmit
-> br_multicast_rcv
   -> br_ip6_multicast_add_group
      -> __br_multicast_add_group
         -> br_multicast_host_join
            -> br_mdb_notify

This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host
hooked up, and switchdev will attempt to offload the host joined groups
to an empty list of ports. Of course nobody offloads them.

Then when we add a port to br0:

ip link set swp0 master br0

the bridge doesn't replay the host-joined MDB entries from br_add_if,
and eventually the host joined addresses expire, and a switchdev
notification for deleting it is emitted, but surprise, the original
addition was already completely missed.

The strategy to address this problem is to replay the MDB entries (both
the port ones and the host joined ones) when the new port joins the
bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can
be populated and only then attached to a bridge that you offload).
However there are 2 possibilities: the addresses can be 'pushed' by the
bridge into the port, or the port can 'pull' them from the bridge.

Considering that in the general case, the new port can be really late to
the party, and there may have been many other switchdev ports that
already received the initial notification, we would like to avoid
delivering duplicate events to them, since they might misbehave. And
currently, the bridge calls the entire switchdev notifier chain, whereas
for replaying it should just call the notifier block of the new guy.
But the bridge doesn't know what is the new guy's notifier block, it
just knows where the switchdev notifier chain is. So for simplification,
we make this a driver-initiated pull for now, and the notifier block is
passed as an argument.

To emulate the calling context for mdb objects (deferred and put on the
blocking notifier chain), we must iterate under RCU protection through
the bridge's mdb entries, queue them, and only call them once we're out
of the RCU read-side critical section.

There was some opportunity for reuse between br_mdb_switchdev_host_port,
br_mdb_notify and the newly added br_mdb_queue_one in how the switchdev
mdb object is created, so a helper was created.

Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23 14:49:05 -07:00
Xie He
65d2dbb300 net: lapb: Make "lapb_t1timer_running" able to detect an already running timer
Problem:

The "lapb_t1timer_running" function in "lapb_timer.c" is used in only
one place: in the "lapb_kick" function in "lapb_out.c". "lapb_kick" calls
"lapb_t1timer_running" to check if the timer is already pending, and if
it is not, schedule it to run.

However, if the timer has already fired and is running, and is waiting to
get the "lapb->lock" lock, "lapb_t1timer_running" will not detect this,
and "lapb_kick" will then schedule a new timer. The old timer will then
abort when it sees a new timer pending.

I think this is not right. The purpose of "lapb_kick" should be ensuring
that the actual work of the timer function is scheduled to be done.
If the timer function is already running but waiting for the lock,
"lapb_kick" should not abort and reschedule it.

Changes made:

I added a new field "t1timer_running" in "struct lapb_cb" for
"lapb_t1timer_running" to use. "t1timer_running" will accurately reflect
whether the actual work of the timer is pending. If the timer has fired
but is still waiting for the lock, "t1timer_running" will still correctly
reflect whether the actual work is waiting to be done.

The old "t1timer_stop" field, whose only responsibility is to ask a timer
(that is already running but waiting for the lock) to abort, is no longer
needed, because the new "t1timer_running" field can fully take over its
responsibility. Therefore "t1timer_stop" is deleted.

"t1timer_running" is not simply a negation of the old "t1timer_stop".
At the end of the timer function, if it does not reschedule itself,
"t1timer_running" is set to false to indicate that the timer is stopped.

For consistency of the code, I also added "t2timer_running" and deleted
"t2timer_stop".

Signed-off-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23 14:14:50 -07:00
David S. Miller
9a255a0635 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next:

1) Split flowtable workqueues per events, from Oz Shlomo.

2) fall-through warnings for clang, from Gustavo A. R. Silva

3) Remove unused declaration in conntrack, from YueHaibing.

4) Consolidate skb_try_make_writable() in flowtable datapath,
   simplify some of the existing codebase.

5) Call dst_check() to fall back to static classic forwarding path.

6) Update table flags from commit phase.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 17:07:30 -07:00
Archie Pusaka
5c4c8c9544 Bluetooth: verify AMP hci_chan before amp_destroy
hci_chan can be created in 2 places: hci_loglink_complete_evt() if
it is an AMP hci_chan, or l2cap_conn_add() otherwise. In theory,
Only AMP hci_chan should be removed by a call to
hci_disconn_loglink_complete_evt(). However, the controller might mess
up, call that function, and destroy an hci_chan which is not initiated
by hci_loglink_complete_evt().

This patch adds a verification that the destroyed hci_chan must have
been init'd by hci_loglink_complete_evt().

Example crash call trace:
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0xe3/0x144 lib/dump_stack.c:118
 print_address_description+0x67/0x22a mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report mm/kasan/report.c:412 [inline]
 kasan_report+0x251/0x28f mm/kasan/report.c:396
 hci_send_acl+0x3b/0x56e net/bluetooth/hci_core.c:4072
 l2cap_send_cmd+0x5af/0x5c2 net/bluetooth/l2cap_core.c:877
 l2cap_send_move_chan_cfm_icid+0x8e/0xb1 net/bluetooth/l2cap_core.c:4661
 l2cap_move_fail net/bluetooth/l2cap_core.c:5146 [inline]
 l2cap_move_channel_rsp net/bluetooth/l2cap_core.c:5185 [inline]
 l2cap_bredr_sig_cmd net/bluetooth/l2cap_core.c:5464 [inline]
 l2cap_sig_channel net/bluetooth/l2cap_core.c:5799 [inline]
 l2cap_recv_frame+0x1d12/0x51aa net/bluetooth/l2cap_core.c:7023
 l2cap_recv_acldata+0x2ea/0x693 net/bluetooth/l2cap_core.c:7596
 hci_acldata_packet net/bluetooth/hci_core.c:4606 [inline]
 hci_rx_work+0x2bd/0x45e net/bluetooth/hci_core.c:4796
 process_one_work+0x6f8/0xb50 kernel/workqueue.c:2175
 worker_thread+0x4fc/0x670 kernel/workqueue.c:2321
 kthread+0x2f0/0x304 kernel/kthread.c:253
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415

Allocated by task 38:
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0x8d/0x9a mm/kasan/kasan.c:553
 kmem_cache_alloc_trace+0x102/0x129 mm/slub.c:2787
 kmalloc include/linux/slab.h:515 [inline]
 kzalloc include/linux/slab.h:709 [inline]
 hci_chan_create+0x86/0x26d net/bluetooth/hci_conn.c:1674
 l2cap_conn_add.part.0+0x1c/0x814 net/bluetooth/l2cap_core.c:7062
 l2cap_conn_add net/bluetooth/l2cap_core.c:7059 [inline]
 l2cap_connect_cfm+0x134/0x852 net/bluetooth/l2cap_core.c:7381
 hci_connect_cfm+0x9d/0x122 include/net/bluetooth/hci_core.h:1404
 hci_remote_ext_features_evt net/bluetooth/hci_event.c:4161 [inline]
 hci_event_packet+0x463f/0x72fa net/bluetooth/hci_event.c:5981
 hci_rx_work+0x197/0x45e net/bluetooth/hci_core.c:4791
 process_one_work+0x6f8/0xb50 kernel/workqueue.c:2175
 worker_thread+0x4fc/0x670 kernel/workqueue.c:2321
 kthread+0x2f0/0x304 kernel/kthread.c:253
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415

Freed by task 1732:
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free mm/kasan/kasan.c:521 [inline]
 __kasan_slab_free+0x106/0x128 mm/kasan/kasan.c:493
 slab_free_hook mm/slub.c:1409 [inline]
 slab_free_freelist_hook+0xaa/0xf6 mm/slub.c:1436
 slab_free mm/slub.c:3009 [inline]
 kfree+0x182/0x21e mm/slub.c:3972
 hci_disconn_loglink_complete_evt net/bluetooth/hci_event.c:4891 [inline]
 hci_event_packet+0x6a1c/0x72fa net/bluetooth/hci_event.c:6050
 hci_rx_work+0x197/0x45e net/bluetooth/hci_core.c:4791
 process_one_work+0x6f8/0xb50 kernel/workqueue.c:2175
 worker_thread+0x4fc/0x670 kernel/workqueue.c:2321
 kthread+0x2f0/0x304 kernel/kthread.c:253
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415

The buggy address belongs to the object at ffff8881d7af9180
 which belongs to the cache kmalloc-128 of size 128
The buggy address is located 24 bytes inside of
 128-byte region [ffff8881d7af9180, ffff8881d7af9200)
The buggy address belongs to the page:
page:ffffea00075ebe40 count:1 mapcount:0 mapping:ffff8881da403200 index:0x0
flags: 0x8000000000000200(slab)
raw: 8000000000000200 dead000000000100 dead000000000200 ffff8881da403200
raw: 0000000000000000 0000000080150015 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8881d7af9080: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
 ffff8881d7af9100: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff8881d7af9180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
 ffff8881d7af9200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8881d7af9280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reported-by: syzbot+98228e7407314d2d4ba2@syzkaller.appspotmail.com
Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-03-22 17:00:09 +01:00
Ahmed S. Darwish
bc8e0adff3 net: xfrm: Use sequence counter with associated spinlock
A sequence counter write section must be serialized or its internal
state can get corrupted. A plain seqcount_t does not contain the
information of which lock must be held to guaranteee write side
serialization.

For xfrm_state_hash_generation, use seqcount_spinlock_t instead of plain
seqcount_t.  This allows to associate the spinlock used for write
serialization with the sequence counter. It thus enables lockdep to
verify that the write serialization lock is indeed held before entering
the sequence counter write section.

If lockdep is disabled, this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-03-22 07:38:08 +01:00
Ahmed S. Darwish
e88add19f6 net: xfrm: Localize sequence counter per network namespace
A sequence counter write section must be serialized or its internal
state can get corrupted. The "xfrm_state_hash_generation" seqcount is
global, but its write serialization lock (net->xfrm.xfrm_state_lock) is
instantiated per network namespace. The write protection is thus
insufficient.

To provide full protection, localize the sequence counter per network
namespace instead. This should be safe as both the seqcount read and
write sections access data exclusively within the network namespace. It
also lays the foundation for transforming "xfrm_state_hash_generation"
data type from seqcount_t to seqcount_LOCKNAME_t in further commits.

Fixes: b65e3d7be0 ("xfrm: state: add sequence count to detect hash resizes")
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-03-22 07:35:42 +01:00
Bhaskar Chowdhury
8a2dc6af67 sch_red: Fix a typo
s/recalcultion/recalculation/

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-19 11:51:03 -07:00
Alexander Lobakin
86af2c82c2 gro: add combined call_gro_receive() + INDIRECT_CALL_INET() helper
call_gro_receive() is used to limit GRO recursion, but it works only
with callback pointers.
There's a combined version of call_gro_receive() + INDIRECT_CALL_2()
in <net/inet_common.h>, but it doesn't check for IPv6 modularity.
Add a similar new helper to cover both of these. It can and will be
used to avoid retpoline overhead when IP header lies behind another
offloaded proto.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 19:51:12 -07:00
Alexander Lobakin
e75ec151c1 gro: make net/gro.h self-contained
If some source file includes <net/gro.h>, but doesn't include
<linux/indirect_call_wrapper.h>:

In file included from net/8021q/vlan_core.c:7:
./include/net/gro.h:6:1: warning: data definition has no type or storage class
    6 | INDIRECT_CALLABLE_DECLARE(struct sk_buff *ipv6_gro_receive(struct list_head *,
      | ^~~~~~~~~~~~~~~~~~~~~~~~~
./include/net/gro.h:6:1: error: type defaults to ‘int’ in declaration of ‘INDIRECT_CALLABLE_DECLARE’ [-Werror=implicit-int]

[...]

Include <linux/indirect_call_wrapper.h> directly. It's small and
won't pull lots of dependencies.
Also add some incomplete struct declarations to be fully stacked.

Fixes: 04f00ab227 ("net/core: move gro function declarations to separate header ")
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 19:51:12 -07:00
David S. Miller
84f4aced67 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Several patches to testore use of memory barriers instead of RCU to
   ensure consistent access to ruleset, from Mark Tomlinson.

2) Fix dump of expectation via ctnetlink, from Florian Westphal.

3) GRE helper works for IPv6, from Ludovic Senecaux.

4) Set error on unsupported flowtable flags.

5) Use delayed instead of deferrable workqueue in the flowtable,
   from Yinjun Zhang.

6) Fix spurious EEXIST in case of add-after-delete flowtable in
   the same batch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 19:19:06 -07:00
Tobias Waldekranz
cc76ce9e8d net: dsa: Add helper to resolve bridge port from DSA port
In order for a driver to be able to query a bridge for information
about itself, e.g. reading out port flags, it has to use a netdev that
is known to the bridge. In the simple case, that is just the netdev
representing the port, e.g. swp0 or swp1 in this example:

   br0
   / \
swp0 swp1

But in the case of an offloaded lag, this will be the bond or team
interface, e.g. bond0 in this example:

     br0
     /
  bond0
   / \
swp0 swp1

Add a helper that hides some of this complexity from the
drivers. Then, redefine dsa_port_offloads_bridge_port using the helper
to avoid double accounting of the set of possible offloaded uppers.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 16:24:06 -07:00
Pablo Neira Ayuso
0ce7cf4127 netfilter: nftables: update table flags from the commit phase
Do not update table flags from the preparation phase. Store the flags
update into the transaction, then update the flags from the commit
phase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 01:35:39 +01:00
Pablo Neira Ayuso
f4401262b9 netfilter: flowtable: fast NAT functions never fail
Simplify existing fast NAT routines by returning void. After the
skb_try_make_writable() call consolidation, these routines cannot ever
fail.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:44:00 +01:00
Pablo Neira Ayuso
4f08f173d0 netfilter: flowtable: move FLOW_OFFLOAD_DIR_MAX away from enumeration
This allows to remove the default case which should not ever happen and
that was added to avoid gcc warnings on unhandled FLOW_OFFLOAD_DIR_MAX
enumeration case.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:44:00 +01:00
YueHaibing
d4a96be654 netfilter: conntrack: Remove unused variable declaration
commit e97c3e278e ("tproxy: split off ipv6 defragmentation to a separate
module") left behind this.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:34:19 +01:00
Pablo Neira Ayuso
7b35582cd0 netfilter: nftables: allow to update flowtable flags
Honor flowtable flags from the control update path. Disallow disabling
to toggle hardware offload support though.

Fixes: 8bb69f3b29 ("netfilter: nf_tables: add flowtable offload control plane")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:32:21 +01:00
Álvaro Fernández Rojas
964dbf186e net: dsa: tag_brcm: add support for legacy tags
Add support for legacy Broadcom tags, which are similar to DSA_TAG_PROTO_BRCM.
These tags are used on BCM5325, BCM5365 and BCM63xx switches.

Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17 12:24:36 -07:00
Martin Willi
3a5ca85707 can: dev: Move device back to init netns on owning netns delete
When a non-initial netns is destroyed, the usual policy is to delete
all virtual network interfaces contained, but move physical interfaces
back to the initial netns. This keeps the physical interface visible
on the system.

CAN devices are somewhat special, as they define rtnl_link_ops even
if they are physical devices. If a CAN interface is moved into a
non-initial netns, destroying that netns lets the interface vanish
instead of moving it back to the initial netns. default_device_exit()
skips CAN interfaces due to having rtnl_link_ops set. Reproducer:

  ip netns add foo
  ip link set can0 netns foo
  ip netns delete foo

WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60
CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1
Workqueue: netns cleanup_net
[<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14)
[<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8)
[<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114)
[<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac)
[<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60)
[<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380)
[<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438)
[<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8)
[<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c)
[<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c)

To properly restore physical CAN devices to the initial netns on owning
netns exit, introduce a flag on rtnl_link_ops that can be set by drivers.
For CAN devices setting this flag, default_device_exit() considers them
non-virtual, applying the usual namespace move.

The issue was introduced in the commit mentioned below, as at that time
CAN devices did not have a dellink() operation.

Fixes: e008b5fc8d ("net: Simplfy default_device_exit and improve batching.")
Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.org
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-16 08:40:04 +01:00
Alexander Ovechkin
7233da8669 tcp: relookup sock for RST+ACK packets handled by obsolete req sock
Currently tcp_check_req can be called with obsolete req socket for which big
socket have been already created (because of CPU race or early demux
assigning req socket to multiple packets in gro batch).

Commit e0f9759f53 ("tcp: try to keep packet if SYN_RCV race
is lost") added retry in case when tcp_check_req is called for PSH|ACK packet.
But if client sends RST+ACK immediatly after connection being
established (it is performing healthcheck, for example) retry does not
occur. In that case tcp_check_req tries to close req socket,
leaving big socket active.

Fixes: e0f9759f53 ("tcp: try to keep packet if SYN_RCV race is lost")
Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru>
Reported-by: Oleg Senin <olegsenin@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-15 14:34:29 -07:00
Luiz Augusto von Dentz
7cf3b1dd6a Bluetooth: L2CAP: Fix not checking for maximum number of DCID
When receiving L2CAP_CREDIT_BASED_CONNECTION_REQ the remote may request
more channels than allowed by the spec (10 octecs = 5 CIDs) so this
checks if the number of channels is bigger than the maximum allowed and
respond with an error.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-03-15 21:08:33 +01:00
Ido Schimmel
07e1a5809b psample: Add additional metadata attributes
Extend psample to report the following attributes when available:

* Output traffic class as a 16-bit value
* Output traffic class occupancy in bytes as a 64-bit value
* End-to-end latency of the packet in nanoseconds resolution
* Software timestamp in nanoseconds resolution (always available)
* Packet's protocol. Needed for packet dissection in user space (always
  available)

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 15:00:43 -07:00
Ido Schimmel
a03e99d39f psample: Encapsulate packet metadata in a struct
Currently, callers of psample_sample_packet() pass three metadata
attributes: Ingress port, egress port and truncated size. Subsequent
patches are going to add more attributes (e.g., egress queue occupancy),
which also need an indication whether they are valid or not.

Encapsulate packet metadata in a struct in order to keep the number of
arguments reasonable.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 15:00:43 -07:00
Alexander Lobakin
f96533cded flow_dissector: constify raw input data argument
Flow Dissector code never modifies the input buffer, neither skb nor
raw data.
Make 'data' argument const for all of the Flow dissector's functions.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:46:32 -07:00
Alexander Lobakin
dac06b32c7 flow_dissector: constify bpf_flow_dissector's data pointers
BPF Flow dissection programs are read-only and don't touch input
buffers.
Mark 'data' and 'data_end' in struct bpf_flow_dissector as const
in preparation for global input constifying.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:46:32 -07:00
Baowen Zheng
2ffe039528 net/sched: act_police: add support for packet-per-second policing
Allow a policer action to enforce a rate-limit based on packets-per-second,
configurable using a packet-per-second rate and burst parameters.

e.g.
tc filter add dev tap1 parent ffff: u32 match \
        u32 0 0 police pkts_rate 3000 pkts_burst 1000

Testing was unable to uncover a performance impact of this change on
existing features.

Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-13 14:18:09 -08:00
Xingfeng Hu
25660156f4 flow_offload: add support for packet-per-second policing
Allow flow_offload API to configure packet-per-second policing using rate
and burst parameters.

Dummy implementations of tcf_police_rate_pkt_ps() and
tcf_police_burst_pkt() are supplied which return 0, the unconfigured state.
This is to facilitate splitting the offload, driver, and TC code portion of
this feature into separate patches with the aim of providing a logical flow
for review. And the implementation of these helpers will be filled out by a
follow-up patch.

Signed-off-by: Xingfeng Hu <xingfeng.hu@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-13 14:18:09 -08:00
Geliang Tang
6445e17af7 mptcp: add rm_list in mptcp_out_options
This patch defined a new struct mptcp_rm_list, the ids field was an
array of the removing address ids, the nr field was the valid number of
removing address ids in the array. The array size was definced as a new
macro MPTCP_RM_IDS_MAX. Changed the member rm_id of struct
mptcp_out_options to rm_list.

In mptcp_established_options_rm_addr, invoked mptcp_pm_rm_addr_signal to
get the rm_list. According the number of addresses in it, calculated
the padded RM_ADDR suboption length. And saved the ids array in struct
mptcp_out_options's rm_list member.

In mptcp_write_options, iterated each address id from struct
mptcp_out_options's rm_list member, set the invalid ones as TCPOPT_NOP,
then filled them into the RM_ADDR suboption.

Changed TCPOLEN_MPTCP_RM_ADDR_BASE from 4 to 3.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12 17:47:45 -08:00
liuyacan
f211ac1545 net: correct sk_acceptq_is_full()
The "backlog" argument in listen() specifies
the maximom length of pending connections,
so the accept queue should be considered full
if there are exactly "backlog" elements.

Signed-off-by: liuyacan <yacanliu@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12 12:27:14 -08:00
Ido Schimmel
cfc15c1dbb nexthop: Allow reporting activity of nexthop buckets
The kernel periodically checks the idle time of nexthop buckets to
determine if they are idle and can be re-populated with a new nexthop.

When the resilient nexthop group is offloaded to hardware, the kernel
will not see activity on nexthop buckets unless it is reported from
hardware.

Add a function that can be periodically called by device drivers to
report activity on nexthop buckets after querying it from the underlying
device.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11 16:13:00 -08:00
Ido Schimmel
56ad5ba344 nexthop: Allow setting "offload" and "trap" indication of nexthop buckets
Add a function that can be called by device drivers to set "offload" or
"trap" indication on nexthop buckets following nexthop notifications and
other changes such as a neighbour becoming invalid.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11 16:12:59 -08:00
Ido Schimmel
b8f090d0be nexthop: Add data structures for resilient group notifications
Add data structures that will be used for in-kernel notifications about
addition / deletion of a resilient nexthop group and about changes to a
hash bucket within a resilient group.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11 16:12:59 -08:00
Petr Machata
283a72a559 nexthop: Add implementation of resilient next-hop groups
At this moment, there is only one type of next-hop group: an mpath group,
which implements the hash-threshold algorithm.

To select a next hop, hash-threshold algorithm first assigns a range of
hashes to each next hop in the group, and then selects the next hop by
comparing the SKB hash with the individual ranges. When a next hop is
removed from the group, the ranges are recomputed, which leads to
reassignment of parts of hash space from one next hop to another. While
there will usually be some overlap between the previous and the new
distribution, some traffic flows change the next hop that they resolve to.
That causes problems e.g. as established TCP connections are reset, because
the traffic is forwarded to a server that is not familiar with the
connection.

Resilient hashing is a technique to address the above problem. Resilient
next-hop group has another layer of indirection between the group itself
and its constituent next hops: a hash table. The selection algorithm uses a
straightforward modulo operation to choose a hash bucket, and then reads
the next hop that this bucket contains, and forwards traffic there.

This indirection brings an important feature. In the hash-threshold
algorithm, the range of hashes associated with a next hop must be
continuous. With a hash table, mapping between the hash table buckets and
the individual next hops is arbitrary. Therefore when a next hop is deleted
the buckets that held it are simply reassigned to other next hops. When
weights of next hops in a group are altered, it may be possible to choose a
subset of buckets that are currently not used for forwarding traffic, and
use those to satisfy the new next-hop distribution demands, keeping the
"busy" buckets intact. This way, established flows are ideally kept being
forwarded to the same endpoints through the same paths as before the
next-hop group change.

In a nutshell, the algorithm works as follows. Each next hop has a number
of buckets that it wants to have, according to its weight and the number of
buckets in the hash table. In case of an event that might cause bucket
allocation change, the numbers for individual next hops are updated,
similarly to how ranges are updated for mpath group next hops. Following
that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
next hop that is currently occupying more buckets than it wants (it is
"overweight"), it migrates the buckets to one of the next hops that has
fewer buckets than it wants (it is "underweight"). If, after this, there
are still underweight next hops, another upkeep run is scheduled to a
future time.

Chances are there are not enough "idle" buckets to satisfy the new demands.
The algorithm has knobs to select both what it means for a bucket to be
idle, and for whether and when to forcefully migrate buckets if there keeps
being an insufficient number of idle buckets.

There are three users of the resilient data structures.

- The forwarding code accesses them under RCU, and does not modify them
  except for updating the time a selected bucket was last used.

- Netlink code, running under RTNL, which may modify the data.

- The delayed upkeep code, which may modify the data. This runs unlocked,
  and mutual exclusion between the RTNL code and the delayed upkeep is
  maintained by canceling the delayed work synchronously before the RTNL
  code touches anything. Later it restarts the delayed work if necessary.

The RTNL code has to implement next-hop group replacement, next hop
removal, etc. For removal, the mpath code uses a neat trick of having a
backup next hop group structure, doing the necessary changes offline, and
then RCU-swapping them in. However, the hash tables for resilient hashing
are about an order of magnitude larger than the groups themselves (the size
might be e.g. 4K entries), and it was felt that keeping two of them is an
overkill. Both the primary next-hop group and the spare therefore use the
same resilient table, and writers are careful to keep all references valid
for the forwarding code. The hash table references next-hop group entries
from the next-hop group that is currently in the primary role (i.e. not
spare). During the transition from primary to spare, the table references a
mix of both the primary group and the spare. When a next hop is deleted,
the corresponding buckets are not set to NULL, but instead marked as empty,
so that the pointer is valid and can be used by the forwarding code. The
buckets are then migrated to a new next-hop group entry during upkeep. The
only times that the hash table is invalid is the very beginning and very
end of its lifetime. Between those points, it is always kept valid.

This patch introduces the core support code itself. It does not handle
notifications towards drivers, which are kept as if the group were an mpath
one. It does not handle netlink either. The only bit currently exposed to
user space is the new next-hop group type, and that is currently bounced.
There is therefore no way to actually access this code.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11 16:12:59 -08:00
Petr Machata
90e1a9e213 nexthop: Add a dedicated flag for multipath next-hop groups
With the introduction of resilient nexthop groups, there will be two types
of multipath groups: the current hash-threshold "mpath" ones, and resilient
groups. Both are multipath, but to determine the fact, the system needs to
consider two flags. This might prove costly in the datapath. Therefore,
introduce a new flag, that should be set for next-hop groups that have more
than one nexthop, and should be considered multipath.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11 16:12:59 -08:00
Julien Massonneau
ee90c6ba34 seg6: add support for IPv4 decapsulation in ipv6_srh_rcv()
As specified in IETF RFC 8754, section 4.3.1.2, if the upper layer
header is IPv4 or IPv6, perform IPv6 decapsulation and resubmit the
decapsulated packet to the IPv4 or IPv6 module.
Only IPv6 decapsulation was implemented. This patch adds support for IPv4
decapsulation.

Link: https://tools.ietf.org/html/rfc8754#section-4.3.1.2
Signed-off-by: Julien Massonneau <julien.massonneau@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11 16:09:21 -08:00
Eric Dumazet
e323d865b3 net: sched: validate stab values
iproute2 package is well behaved, but malicious user space can
provide illegal shift values and trigger UBSAN reports.

Add stab parameter to red_check_params() to validate user input.

syzbot reported:

UBSAN: shift-out-of-bounds in ./include/net/red.h:312:18
shift exponent 111 is too large for 64-bit type 'long unsigned int'
CPU: 1 PID: 14662 Comm: syz-executor.3 Not tainted 5.12.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x141/0x1d7 lib/dump_stack.c:120
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327
 red_calc_qavg_from_idle_time include/net/red.h:312 [inline]
 red_calc_qavg include/net/red.h:353 [inline]
 choke_enqueue.cold+0x18/0x3dd net/sched/sch_choke.c:221
 __dev_xmit_skb net/core/dev.c:3837 [inline]
 __dev_queue_xmit+0x1943/0x2e00 net/core/dev.c:4150
 neigh_hh_output include/net/neighbour.h:499 [inline]
 neigh_output include/net/neighbour.h:508 [inline]
 ip6_finish_output2+0x911/0x1700 net/ipv6/ip6_output.c:117
 __ip6_finish_output net/ipv6/ip6_output.c:182 [inline]
 __ip6_finish_output+0x4c1/0xe10 net/ipv6/ip6_output.c:161
 ip6_finish_output+0x35/0x200 net/ipv6/ip6_output.c:192
 NF_HOOK_COND include/linux/netfilter.h:290 [inline]
 ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:215
 dst_output include/net/dst.h:448 [inline]
 NF_HOOK include/linux/netfilter.h:301 [inline]
 NF_HOOK include/linux/netfilter.h:295 [inline]
 ip6_xmit+0x127e/0x1eb0 net/ipv6/ip6_output.c:320
 inet6_csk_xmit+0x358/0x630 net/ipv6/inet6_connection_sock.c:135
 dccp_transmit_skb+0x973/0x12c0 net/dccp/output.c:138
 dccp_send_reset+0x21b/0x2b0 net/dccp/output.c:535
 dccp_finish_passive_close net/dccp/proto.c:123 [inline]
 dccp_finish_passive_close+0xed/0x140 net/dccp/proto.c:118
 dccp_terminate_connection net/dccp/proto.c:958 [inline]
 dccp_close+0xb3c/0xe60 net/dccp/proto.c:1028
 inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
 inet6_release+0x4c/0x70 net/ipv6/af_inet6.c:478
 __sock_release+0xcd/0x280 net/socket.c:599
 sock_close+0x18/0x20 net/socket.c:1258
 __fput+0x288/0x920 fs/file_table.c:280
 task_work_run+0xdd/0x1a0 kernel/task_work.c:140
 tracehook_notify_resume include/linux/tracehook.h:189 [inline]

Fixes: 8afa10cbe2 ("net_sched: red: Avoid illegal values")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 15:47:52 -08:00
Vladimir Oltean
847cbfc014 net: add a helper to avoid issues with HW TX timestamping and SO_TXTIME
As explained in commit 29d98f54a4 ("net: enetc: allow hardware
timestamping on TX queues with tc-etf enabled"), hardware TX
timestamping requires an skb with skb->tstamp = 0. When a packet is sent
with SO_TXTIME, the skb->skb_mstamp_ns corrupts the value of skb->tstamp,
so the drivers need to explicitly reset skb->tstamp to zero after
consuming the TX time.

Create a helper named skb_txtime_consumed() which does just that. All
drivers which offload TC_SETUP_QDISC_ETF should implement it, and it
would make it easier to assess during review whether they do the right
thing in order to be compatible with hardware timestamping or not.

Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 12:45:16 -08:00
Wei Wang
28259bac7f ipv6: fix suspecious RCU usage warning
Syzbot reported the suspecious RCU usage in nexthop_fib6_nh() when
called from ipv6_route_seq_show(). The reason is ipv6_route_seq_start()
calls rcu_read_lock_bh(), while nexthop_fib6_nh() calls
rcu_dereference_rtnl().
The fix proposed is to add a variant of nexthop_fib6_nh() to use
rcu_dereference_bh_rtnl() for ipv6_route_seq_show().

The reported trace is as follows:
./include/net/nexthop.h:416 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
2 locks held by syz-executor.0/17895:
     at: seq_read+0x71/0x12a0 fs/seq_file.c:169
     at: seq_file_net include/linux/seq_file_net.h:19 [inline]
     at: ipv6_route_seq_start+0xaf/0x300 net/ipv6/ip6_fib.c:2616

stack backtrace:
CPU: 1 PID: 17895 Comm: syz-executor.0 Not tainted 4.15.0-syzkaller #0
Call Trace:
 [<ffffffff849edf9e>] __dump_stack lib/dump_stack.c:17 [inline]
 [<ffffffff849edf9e>] dump_stack+0xd8/0x147 lib/dump_stack.c:53
 [<ffffffff8480b7fa>] lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5745
 [<ffffffff8459ada6>] nexthop_fib6_nh include/net/nexthop.h:416 [inline]
 [<ffffffff8459ada6>] ipv6_route_native_seq_show net/ipv6/ip6_fib.c:2488 [inline]
 [<ffffffff8459ada6>] ipv6_route_seq_show+0x436/0x7a0 net/ipv6/ip6_fib.c:2673
 [<ffffffff81c556df>] seq_read+0xccf/0x12a0 fs/seq_file.c:276
 [<ffffffff81dbc62c>] proc_reg_read+0x10c/0x1d0 fs/proc/inode.c:231
 [<ffffffff81bc28ae>] do_loop_readv_writev fs/read_write.c:714 [inline]
 [<ffffffff81bc28ae>] do_loop_readv_writev fs/read_write.c:701 [inline]
 [<ffffffff81bc28ae>] do_iter_read+0x49e/0x660 fs/read_write.c:935
 [<ffffffff81bc81ab>] vfs_readv+0xfb/0x170 fs/read_write.c:997
 [<ffffffff81c88847>] kernel_readv fs/splice.c:361 [inline]
 [<ffffffff81c88847>] default_file_splice_read+0x487/0x9c0 fs/splice.c:416
 [<ffffffff81c86189>] do_splice_to+0x129/0x190 fs/splice.c:879
 [<ffffffff81c86f66>] splice_direct_to_actor+0x256/0x890 fs/splice.c:951
 [<ffffffff81c8777d>] do_splice_direct+0x1dd/0x2b0 fs/splice.c:1060
 [<ffffffff81bc4747>] do_sendfile+0x597/0xce0 fs/read_write.c:1459
 [<ffffffff81bca205>] SYSC_sendfile64 fs/read_write.c:1520 [inline]
 [<ffffffff81bca205>] SyS_sendfile64+0x155/0x170 fs/read_write.c:1506
 [<ffffffff81015fcf>] do_syscall_64+0x1ff/0x310 arch/x86/entry/common.c:305
 [<ffffffff84a00076>] entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: f88d8ea67f ("ipv6: Plumb support for nexthop object in a fib6_info")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Petr Machata <petrm@nvidia.com>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 12:35:23 -08:00
Daniel Borkmann
c4c877b273 net: Consolidate common blackhole dst ops
Move generic blackhole dst ops to the core and use them from both
ipv4_dst_blackhole_ops and ip6_dst_blackhole_ops where possible. No
functional change otherwise. We need these also in other locations
and having to define them over and over again is not great.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 12:24:18 -08:00
Jiri Slaby
c2a5a45c02 net: nfc: nci: drop nci_uart_ops::recv_buf
There is noone setting nci_uart_ops::recv_buf, so the default one
(nci_uart_default_recv_buf) is always used. So drop this hook, move
nci_uart_default_recv_buf before the use in nci_uart_tty_receive and
remove unused parameter flags.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Link: https://lore.kernel.org/r/20210302062214.29627-16-jslaby@suse.cz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-10 09:34:07 +01:00
David S. Miller
c1acda9807 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-03-09

The following pull-request contains BPF updates for your *net-next* tree.

We've added 90 non-merge commits during the last 17 day(s) which contain
a total of 114 files changed, 5158 insertions(+), 1288 deletions(-).

The main changes are:

1) Faster bpf_redirect_map(), from Björn.

2) skmsg cleanup, from Cong.

3) Support for floating point types in BTF, from Ilya.

4) Documentation for sys_bpf commands, from Joe.

5) Support for sk_lookup in bpf_prog_test_run, form Lorenz.

6) Enable task local storage for tracing programs, from Song.

7) bpf_for_each_map_elem() helper, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-09 18:07:05 -08:00
Björn Töpel
e6a4750ffe bpf, xdp: Make bpf_redirect_map() a map operation
Currently the bpf_redirect_map() implementation dispatches to the
correct map-lookup function via a switch-statement. To avoid the
dispatching, this change adds bpf_redirect_map() as a map
operation. Each map provides its bpf_redirect_map() version, and
correct function is automatically selected by the BPF verifier.

A nice side-effect of the code movement is that the map lookup
functions are now local to the map implementation files, which removes
one additional function call.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210308112907.559576-2-bjorn.topel@gmail.com
2021-03-10 01:06:34 +01:00
Daniel Winkler
ff02db13e9 Bluetooth: Allow scannable adv with extended MGMT APIs
An issue was found, where if a bluetooth client requests a broadcast
advertisement with scan response data, it will not be properly
registered with the controller. This is because at the time that the
hci_cp_le_set_scan_param structure is created, the scan response will
not yet have been received since it comes in a second MGMT call. With
empty scan response, the request defaults to a non-scannable PDU type.
On some controllers, the subsequent scan response request will fail due
to incorrect PDU type, and others will succeed and not use the scan
response.

This fix allows the advertising parameters MGMT call to include a flag
to let the kernel know whether a scan response will be coming, so that
the correct PDU type is used in the first place. A bluetoothd change is
also incoming to take advantage of it.

To test this, I created a broadcast advertisement with scan response
data and registered it on the hatch chromebook. Without this change, the
request fails, and with it will succeed.

Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-03-03 18:14:38 +01:00
Abhishek Pandit-Subedi
e5b0ad69c9 Bluetooth: Remove unneeded commands for suspend
During suspend, there are a few scan enable and set event filter
commands that don't need to be sent unless there are actual BR/EDR
devices capable of waking the system. Check the HCI_PSCAN bit before
writing scan enable and use a new dev flag, HCI_EVENT_FILTER_CONFIGURED
to control whether to clear the event filter.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-03-03 17:49:29 +01:00
Evan Nimmo
9ab1265d52 xfrm: Use actual socket sk instead of skb socket for xfrm_output_resume
A situation can occur where the interface bound to the sk is different
to the interface bound to the sk attached to the skb. The interface
bound to the sk is the correct one however this information is lost inside
xfrm_output2 and instead the sk on the skb is used in xfrm_output_resume
instead. This assumes that the sk bound interface and the bound interface
attached to the sk within the skb are the same which can lead to lookup
failures inside ip_route_me_harder resulting in the packet being dropped.

We have an l2tp v3 tunnel with ipsec protection. The tunnel is in the
global VRF however we have an encapsulated dot1q tunnel interface that
is within a different VRF. We also have a mangle rule that marks the
packets causing them to be processed inside ip_route_me_harder.

Prior to commit 31c70d5956 ("l2tp: keep original skb ownership") this
worked fine as the sk attached to the skb was changed from the dot1q
encapsulated interface to the sk for the tunnel which meant the interface
bound to the sk and the interface bound to the skb were identical.
Commit 46d6c5ae95 ("netfilter: use actual socket sk rather than skb sk
when routing harder") fixed some of these issues however a similar
problem existed in the xfrm code.

Fixes: 31c70d5956 ("l2tp: keep original skb ownership")
Signed-off-by: Evan Nimmo <evan.nimmo@alliedtelesis.co.nz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-03-03 09:32:52 +01:00
Cong Wang
e3526bb92a skmsg: Move sk_redir from TCP_SKB_CB to skb
Currently TCP_SKB_CB() is hard-coded in skmsg code, it certainly
does not work for any other non-TCP protocols. We can move them to
skb ext, but it introduces a memory allocation on fast path.

Fortunately, we only need to a word-size to store all the information,
because the flags actually only contains 1 bit so can be just packed
into the lowest bit of the "pointer", which is stored as unsigned
long.

Inside struct sk_buff, '_skb_refdst' can be reused because skb dst is
no longer needed after ->sk_data_ready() so we can just drop it.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-5-xiyou.wangcong@gmail.com
2021-02-26 12:28:03 -08:00
Cong Wang
16137b09a6 bpf: Compute data_end dynamically with JIT code
Currently, we compute ->data_end with a compile-time constant
offset of skb. But as Jakub pointed out, we can actually compute
it in eBPF JIT code at run-time, so that we can competely get
rid of ->data_end. This is similar to skb_shinfo(skb) computation
in bpf_convert_shinfo_access().

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-4-xiyou.wangcong@gmail.com
2021-02-26 12:28:03 -08:00
Cong Wang
887596095e bpf: Clean up sockmap related Kconfigs
As suggested by John, clean up sockmap related Kconfigs:

Reduce the scope of CONFIG_BPF_STREAM_PARSER down to TCP stream
parser, to reflect its name.

Make the rest sockmap code simply depend on CONFIG_BPF_SYSCALL
and CONFIG_INET, the latter is still needed at this point because
of TCP/UDP proto update. And leave CONFIG_NET_SOCK_MSG untouched,
as it is used by non-sockmap cases.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-2-xiyou.wangcong@gmail.com
2021-02-26 12:28:03 -08:00