Commit Graph

12108 Commits

Author SHA1 Message Date
Julian Anastasov
275411430f ipvs: add assured state for conn templates
cp->state was not used for templates. Add support for state bits
and for the first "assured" bit which indicates that some
connection controlled by this template was established or assured
by the real server. In a followup patch we will use it to drop
templates under SYN attack.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18 11:26:40 +02:00
Julian Anastasov
ec1b28ca96 ipvs: provide just conn to ip_vs_state_name
In preparation for followup patches, provide just the cp
ptr to ip_vs_state_name.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18 11:26:39 +02:00
Yi-Hung Wei
5c789e131c netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search
This patch is originally from Florian Westphal.

This patch does the following 3 main tasks.

1) Add list lock to 'struct nf_conncount_list' so that we can
alter the lists containing the individual connections without holding the
main tree lock.  It would be useful when we only need to add/remove to/from
a list without allocate/remove a node in the tree.  With this change, we
update nft_connlimit accordingly since we longer need to maintain
a list lock in nft_connlimit now.

2) Use RCU for the initial tree search to improve tree look up performance.

3) Add a garbage collection worker. This worker is schedule when there
are excessive tree node that needed to be recycled.

Moreover,the rbnode reclaim logic is moved from search tree to insert tree
to avoid race condition.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18 11:26:37 +02:00
Yi-Hung Wei
976afca1ce netfilter: nf_conncount: Early exit in nf_conncount_lookup() and cleanup
This patch is originally from Florian Westphal.

This patch does the following three tasks.

It applies the same early exit technique for nf_conncount_lookup().

Since now we keep the number of connections in 'struct nf_conncount_list',
we no longer need to return the count in nf_conncount_lookup().

Moreover, we expose the garbage collection function nf_conncount_gc_list()
for nft_connlimit.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18 11:26:34 +02:00
Yi-Hung Wei
cb2b36f5a9 netfilter: nf_conncount: Switch to plain list
Original patch is from Florian Westphal.

This patch switches from hlist to plain list to store the list of
connections with the same filtering key in nf_conncount. With the
plain list, we can insert new connections at the tail, so over time
the beginning of list holds long-running connections and those are
expired, while the newly creates ones are at the end.

Later on, we could probably move checked ones to the end of the list,
so the next run has higher chance to reclaim stale entries in the front.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18 11:26:32 +02:00
Taehee Yoo
26b2f55252 netfilter: nf_tables: fix jumpstack depth validation
The level of struct nft_ctx is updated by nf_tables_check_loops().  That
is used to validate jumpstack depth. But jumpstack validation routine
doesn't update and validate recursively.  So, in some cases, chain depth
can be bigger than the NFT_JUMP_STACK_SIZE.

After this patch, The jumpstack validation routine is located in the
nft_chain_validate(). When new rules or new set elements are added, the
nft_table_validate() is called by the nf_tables_newrule and the
nf_tables_newsetelem. The nft_table_validate() calls the
nft_chain_validate() that visit all their children chains recursively.
So it can update depth of chain certainly.

Reproducer:
   %cat ./test.sh
   #!/bin/bash
   nft add table ip filter
   nft add chain ip filter input { type filter hook input priority 0\; }
   for ((i=0;i<20;i++)); do
	nft add chain ip filter a$i
   done

   nft add rule ip filter input jump a1

   for ((i=0;i<10;i++)); do
	nft add rule ip filter a$i jump a$((i+1))
   done

   for ((i=11;i<19;i++)); do
	nft add rule ip filter a$i jump a$((i+1))
   done

   nft add rule ip filter a10 jump a11

Result:
[  253.931782] WARNING: CPU: 1 PID: 0 at net/netfilter/nf_tables_core.c:186 nft_do_chain+0xacc/0xdf0 [nf_tables]
[  253.931915] Modules linked in: nf_tables nfnetlink ip_tables x_tables
[  253.932153] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.18.0-rc3+ #48
[  253.932153] RIP: 0010:nft_do_chain+0xacc/0xdf0 [nf_tables]
[  253.932153] Code: 83 f8 fb 0f 84 c7 00 00 00 e9 d0 00 00 00 83 f8 fd 74 0e 83 f8 ff 0f 84 b4 00 00 00 e9 bd 00 00 00 83 bd 64 fd ff ff 0f 76 09 <0f> 0b 31 c0 e9 bc 02 00 00 44 8b ad 64 fd
[  253.933807] RSP: 0018:ffff88011b807570 EFLAGS: 00010212
[  253.933807] RAX: 00000000fffffffd RBX: ffff88011b807660 RCX: 0000000000000000
[  253.933807] RDX: 0000000000000010 RSI: ffff880112b39d78 RDI: ffff88011b807670
[  253.933807] RBP: ffff88011b807850 R08: ffffed0023700ece R09: ffffed0023700ecd
[  253.933807] R10: ffff88011b80766f R11: ffffed0023700ece R12: ffff88011b807898
[  253.933807] R13: ffff880112b39d80 R14: ffff880112b39d60 R15: dffffc0000000000
[  253.933807] FS:  0000000000000000(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
[  253.933807] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  253.933807] CR2: 00000000014f1008 CR3: 000000006b216000 CR4: 00000000001006e0
[  253.933807] Call Trace:
[  253.933807]  <IRQ>
[  253.933807]  ? sched_clock_cpu+0x132/0x170
[  253.933807]  ? __nft_trace_packet+0x180/0x180 [nf_tables]
[  253.933807]  ? sched_clock_cpu+0x132/0x170
[  253.933807]  ? debug_show_all_locks+0x290/0x290
[  253.933807]  ? __lock_acquire+0x4835/0x4af0
[  253.933807]  ? inet_ehash_locks_alloc+0x1a0/0x1a0
[  253.933807]  ? unwind_next_frame+0x159e/0x1840
[  253.933807]  ? __read_once_size_nocheck.constprop.4+0x5/0x10
[  253.933807]  ? nft_do_chain_ipv4+0x197/0x1e0 [nf_tables]
[  253.933807]  ? nft_do_chain+0x5/0xdf0 [nf_tables]
[  253.933807]  nft_do_chain_ipv4+0x197/0x1e0 [nf_tables]
[  253.933807]  ? nft_do_chain_arp+0xb0/0xb0 [nf_tables]
[  253.933807]  ? __lock_is_held+0x9d/0x130
[  253.933807]  nf_hook_slow+0xc4/0x150
[  253.933807]  ip_local_deliver+0x28b/0x380
[  253.933807]  ? ip_call_ra_chain+0x3e0/0x3e0
[  253.933807]  ? ip_rcv_finish+0x1610/0x1610
[  253.933807]  ip_rcv+0xbcc/0xcc0
[  253.933807]  ? debug_show_all_locks+0x290/0x290
[  253.933807]  ? ip_local_deliver+0x380/0x380
[  253.933807]  ? __lock_is_held+0x9d/0x130
[  253.933807]  ? ip_local_deliver+0x380/0x380
[  253.933807]  __netif_receive_skb_core+0x1c9c/0x2240

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-17 20:48:24 +02:00
Florian Westphal
a0ae2562c6 netfilter: conntrack: remove l3proto abstraction
This unifies ipv4 and ipv6 protocol trackers and removes the l3proto
abstraction.

This gets rid of all l3proto indirect calls and the need to do
a lookup on the function to call for l3 demux.

It increases module size by only a small amount (12kbyte), so this reduces
size because nf_conntrack.ko is useless without either nf_conntrack_ipv4
or nf_conntrack_ipv6 module.

before:
   text    data     bss     dec     hex filename
   7357    1088       0    8445    20fd nf_conntrack_ipv4.ko
   7405    1084       4    8493    212d nf_conntrack_ipv6.ko
  72614   13689     236   86539   1520b nf_conntrack.ko
 19K nf_conntrack_ipv4.ko
 19K nf_conntrack_ipv6.ko
179K nf_conntrack.ko

after:
   text    data     bss     dec     hex filename
  79277   13937     236   93450   16d0a nf_conntrack.ko
  191K nf_conntrack.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-17 15:27:49 +02:00
Hangbin Liu
c7ea20c9da ipv6/mcast: init as INCLUDE when join SSM INCLUDE group
This an IPv6 version patch of "ipv4/igmp: init group mode as INCLUDE when
join source group". From RFC3810, part 6.1:

   If no per-interface state existed for that
   multicast address before the change (i.e., the change consisted of
   creating a new per-interface record), or if no state exists after the
   change (i.e., the change consisted of deleting a per-interface
   record), then the "non-existent" state is considered to have an
   INCLUDE filter mode and an empty source list.

Which means a new multicast group should start with state IN(). Currently,
for MLDv2 SSM JOIN_SOURCE_GROUP mode, we first call ipv6_sock_mc_join(),
then ip6_mc_source(), which will trigger a TO_IN() message instead of
ALLOW().

The issue was exposed by commit a052517a8f ("net/multicast: should not
send source list records when have filter mode change"). Before this change,
we sent both ALLOW(A) and TO_IN(A). Now, we only send TO_IN(A).

Fix it by adding a new parameter to init group mode. Also add some wrapper
functions to avoid changing too much code.

v1 -> v2:
In the first version I only cleared the group change record. But this is not
enough. Because when a new group join, it will init as EXCLUDE and trigger
a filter mode change in ip/ip6_mc_add_src(), which will clear all source
addresses sf_crcount. This will prevent early joined address sending state
change records if multi source addressed joined at the same time.

In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM
JOIN_SOURCE_GROUP. I also split the original patch into two separated patches
for IPv4 and IPv6.

There is also a difference between v4 and v6 version. For IPv6, when the
interface goes down and up, we will send correct state change record with
unspecified IPv6 address (::) with function ipv6_mc_up(). But after DAD is
completed, we resend the change record TO_IN() in mld_send_initial_cr().
Fix it by sending ALLOW() for INCLUDE mode in mld_send_initial_cr().

Fixes: a052517a8f ("net/multicast: should not send source list records when have filter mode change")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 11:20:06 -07:00
Florian Westphal
c779e84960 netfilter: conntrack: remove get_timeout() indirection
Not needed, we can have the l4trackers fetch it themselvs.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16 17:55:01 +02:00
Florian Westphal
8b3892ea87 netfilter: conntrack: avoid calls to l4proto invert_tuple
Handle the common cases (tcp, udp, etc). in the core and only
do the indirect call for the protocols that need it (GRE for instance).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16 17:55:00 +02:00
Florian Westphal
6816d931ca netfilter: conntrack: remove get_l4proto indirection from l3 protocol trackers
Handle it in the core instead.

ipv6_skip_exthdr() is built-in even if ipv6 is a module, i.e. this
doesn't create an ipv6 dependency.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16 17:54:59 +02:00
Florian Westphal
d1b6fe9494 netfilter: conntrack: remove invert_tuple indirection from l3 protocol trackers
Its simpler to just handle it directly in nf_ct_invert_tuple().
Also gets rid of need to pass l3proto pointer to resolve_conntrack().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16 17:54:59 +02:00
Florian Westphal
47a91b14de netfilter: conntrack: remove pkt_to_tuple indirection from l3 protocol trackers
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16 17:54:58 +02:00
Florian Westphal
f957be9d34 netfilter: conntrack: remove ctnetlink callbacks from l3 protocol trackers
handle everything from ctnetlink directly.

After all these years we still only support ipv4 and ipv6, so it
seems reasonable to remove l3 protocol tracker support and instead
handle ipv4/ipv6 from a common, always builtin inet tracker.

Step 1: Get rid of all the l3proto->func() calls.

Start with ctnetlink, then move on to packet-path ones.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16 17:54:58 +02:00
Florian Westphal
60e3be94e6 openvswitch: use nf_ct_get_tuplepr, invert_tuplepr
These versions deal with the l3proto/l4proto details internally.
It removes only caller of nf_ct_get_tuple, so make it static.

After this, l3proto->get_l4proto() can be removed in a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16 17:51:48 +02:00
Máté Eckl
f286586df6 netfilter: nft_tproxy: Move nf_tproxy_assign_sock() to nf_tproxy.h
This function is also necessary to implement nft tproxy support

Fixes: 45ca4e0cf2 ("netfilter: Libify xt_TPROXY")
Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16 17:51:48 +02:00
Boris Pismenny
4799ac81e5 tls: Add rx inline crypto offload
This patch completes the generic infrastructure to offload TLS crypto to a
network device. It enables the kernel to skip decryption and
authentication of some skbs marked as decrypted by the NIC. In the fast
path, all packets received are decrypted by the NIC and the performance
is comparable to plain TCP.

This infrastructure doesn't require a TCP offload engine. Instead, the
NIC only decrypts packets that contain the expected TCP sequence number.
Out-Of-Order TCP packets are provided unmodified. As a result, at the
worst case a received TLS record consists of both plaintext and ciphertext
packets. These partially decrypted records must be reencrypted,
only to be decrypted.

The notable differences between SW KTLS Rx and this offload are as
follows:
1. Partial decryption - Software must handle the case of a TLS record
that was only partially decrypted by HW. This can happen due to packet
reordering.
2. Resynchronization - tls_read_size calls the device driver to
resynchronize HW after HW lost track of TLS record framing in
the TCP stream.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:13:11 -07:00
Boris Pismenny
39f56e1a78 tls: Split tls_sw_release_resources_rx
This patch splits tls_sw_release_resources_rx into two functions one
which releases all inner software tls structures and another that also
frees the containing structure.

In TLS_DEVICE we will need to release the software structures without
freeeing the containing structure, which contains other information.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:13:11 -07:00
Boris Pismenny
dafb67f3bb tls: Split decrypt_skb to two functions
Previously, decrypt_skb also updated the TLS context.
Now, decrypt_skb only decrypts the payload using the current context,
while decrypt_skb_update also updates the state.

Later, in the tls_device Rx flow, we will use decrypt_skb directly.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:13:10 -07:00
Boris Pismenny
d80a1b9d18 tls: Refactor tls_offload variable names
For symmetry, we rename tls_offload_context to
tls_offload_context_tx before we add tls_offload_context_rx.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:12:09 -07:00
David S. Miller
2aa4a3378a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-07-15

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Various different arm32 JIT improvements in order to optimize code emission
   and make the JIT code itself more robust, from Russell.

2) Support simultaneous driver and offloaded XDP in order to allow for advanced
   use-cases where some work is offloaded to the NIC and some to the host. Also
   add ability for bpftool to load programs and maps beyond just the cgroup case,
   from Jakub.

3) Add BPF JIT support in nfp for multiplication as well as division. For the
   latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong.

4) Add BTF pretty print functionality to bpftool in plain and JSON output
   format, from Okash.

5) Add build and installation to the BPF helper man page into bpftool, from Quentin.

6) Add a TCP BPF callback for listening sockets which is triggered right after
   the socket transitions to TCP_LISTEN state, from Andrey.

7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup
   tree and prints all attached programs, from Roman.

8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged
   packets, from Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-14 18:47:44 -07:00
Yuchung Cheng
a69258f7aa tcp: remove DELAYED ACK events in DCTCP
After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
related callbacks are no longer needed

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13 18:30:19 -07:00
Vlad Buslov
01683a1469 net: sched: refactor flower walk to iterate over idr
Extend struct tcf_walker with additional 'cookie' field. It is intended to
be used by classifier walk implementations to continue iteration directly
from particular filter, instead of iterating 'skip' number of times.

Change flower walk implementation to save filter handle in 'cookie'. Each
time flower walk is called, it looks up filter with saved handle directly
with idr, instead of iterating over filter linked list 'skip' number of
times. This change improves complexity of dumping flower classifier from
quadratic to linearithmic. (assuming idr lookup has logarithmic complexity)

Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reported-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13 18:24:27 -07:00
Jakub Kicinski
05296620f6 xdp: factor out common program/flags handling from drivers
Basic operations drivers perform during xdp setup and query can
be moved to helpers in the core.  Encapsulate program and flags
into a structure and add helpers.  Note that the structure is
intended as the "main" program information source in the driver.
Most drivers will additionally place the program pointer in their
fast path or ring structures.

The helpers don't have a huge impact now, but they will
decrease the code duplication when programs can be installed
in HW and driver at the same time.  Encapsulating the basic
operations in helpers will hopefully also reduce the number
of changes to drivers which adopt them.

Helpers could really be static inline, but they depend on
definition of struct netdev_bpf which means they'd have
to be placed in netdevice.h, an already 4500 line header.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13 20:26:35 +02:00
Romuald CARI
811e299f46 ieee802154: add rx LQI from userspace
The Link Quality Indication data exposed by drivers could not be accessed from
userspace. Since this data is per-datagram received, it makes sense to make it
available to userspace application through the ancillary data mechanism in
recvmsg rather than through ioctls. This can be activated using the socket
option WPAN_WANTLQI under SOL_IEEE802154 protocol.

This LQI data is available in the ancillary data buffer under the SOL_IEEE802154
level as the type WPAN_LQI. The value is an unsigned byte indicating the link
quality with values ranging 0-255.

Signed-off-by: Romuald Cari <romuald.cari@devialet.com>
Signed-off-by: Clément Peron <clement.peron@devialet.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2018-07-13 12:18:18 -04:00
Alex Vesker
f6a69885f2 devlink: Add generic parameters region_snapshot
region_snapshot - When set enables capturing region snapshots

Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 17:37:13 -07:00
Alex Vesker
d7e5272282 devlink: Add support for creating region snapshots
Each device address region can store multiple snapshots,
each snapshot is identified using a different numerical ID.
This ID is used when deleting a snapshot or showing an address
region specific snapshot. This patch exposes a callback to add
a new snapshot to an address region.
The snapshot will be deleted using the destructor function
when destroying a region or when a snapshot delete command
from devlink user tool.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 17:37:13 -07:00
Alex Vesker
ccadfa444b devlink: Add callback to query for snapshot id before snapshot create
To restrict the driver with the snapshot ID selection a new callback
is introduced for the driver to get the snapshot ID before creating
a new snapshot. This will also allow giving the same ID for multiple
snapshots taken of different regions on the same time.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 17:37:12 -07:00
Alex Vesker
b16ebe925a devlink: Add support for creating and destroying regions
This allows a device to register its supported address regions.
Each address region can be accessed directly for example reading
the snapshots taken of this address space.
Drivers are not limited in the name selection for different regions.
An example of a region-name can be: pci cr-space, register-space.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 17:37:12 -07:00
Davide Caratti
c749cdda90 net/sched: act_skbedit: don't use spinlock in the data path
use RCU instead of spin_{,un}lock_bh, to protect concurrent read/write on
act_skbedit configuration. This reduces the effects of contention in the
data path, in case multiple readers are present.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 14:54:12 -07:00
Arnd Bergmann
cca9bab1b7 tcp: use monotonic timestamps for PAWS
Using get_seconds() for timestamps is deprecated since it can lead
to overflows on 32-bit systems. While the interface generally doesn't
overflow until year 2106, the specific implementation of the TCP PAWS
algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
overflow.

A related problem is that the local timestamps in CLOCK_REALTIME form
lead to unexpected behavior when settimeofday is called to set the system
clock backwards or forwards by more than 24 days.

While the first problem could be solved by using an overflow-safe method
of comparing the timestamps, a nicer solution is to use a monotonic
clocksource with ktime_get_seconds() that simply doesn't overflow (at
least not until 136 years after boot) and that doesn't change during
settimeofday().

To make 32-bit and 64-bit architectures behave the same way here, and
also save a few bytes in the tcp_options_received structure, I'm changing
the type to a 32-bit integer, which is now safe on all architectures.

Finally, the ts_recent_stamp field also (confusingly) gets used to store
a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
This is currently safe, but changing the type to 32-bit requires
some small changes there to keep it working.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 14:50:40 -07:00
Petr Machata
eeed992b77 net: Add lag.h, net_lag_port_dev_txable()
LAG devices (team or bond) recognize for each one of their slave devices
whether LAG traffic is going to be sent through that device. Bond calls
such devices "active", team calls them "txable". When this state
changes, a NETDEV_CHANGELOWERSTATE notification is distributed, together
with a netdev_notifier_changelowerstate_info structure that for LAG
devices includes a tx_enabled flag that refers to the new state. The
notification thus makes it possible to react to the changes in txability
in drivers.

However there's no way to query txability from the outside on demand.
That is problematic namely for mlxsw, which when resolving ERSPAN packet
path, may encounter a LAG device, and needs to determine which of the
slaves it should choose.

To that end, introduce a new function, net_lag_port_dev_txable(), which
determines whether a given slave device is "active" or
"txable" (depending on the flavor of the LAG device). That function then
dispatches to per-LAG-flavor helpers, bond_is_active_slave_dev() resp.
team_port_dev_txable().

Because there currently is no good place where net_lag_port_dev_txable()
should be added, introduce a new header file, lag.h, which should from
now on hold any logic common to both team and bond. (But keep
netif_is_lag_master() together with the rest of netif_is_*_master()
functions).

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11 23:10:19 -07:00
Deepti Raghavan
4929c9428a tcp: expose both send and receive intervals for rate sample
Congestion control algorithms, which access the rate sample
through the tcp_cong_control function, only have access to the maximum
of the send and receive interval, for cases where the acknowledgment
rate may be inaccurate due to ACK compression or decimation. Algorithms
may want to use send rates and receive rates as separate signals.

Signed-off-by: Deepti Raghavan <deeptir@mit.edu>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11 23:01:56 -07:00
Arnd Bergmann
03dc7a35fc ipv6: xfrm: use 64-bit timestamps
get_seconds() is deprecated because it can overflow on 32-bit
architectures.  For the xfrm_state->lastused member, we treat the data
as a 64-bit number already, so we just need to use the right accessor
that works on both 32-bit and 64-bit machines.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-07-11 15:26:35 +02:00
David S. Miller
26420d9ce0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree:

1) Missing module autoloadfor icmp and icmpv6 x_tables matches,
   from Florian Westphal.

2) Possible non-linear access to TCP header from tproxy, from
   Mate Eckl.

3) Do not allow rbtree to be used for single elements, this patch
   moves all set backend into one single module since such thing
   can only happen if hashtable module is explicitly blacklisted,
   which should not ever be done.

4) Reject error and standard targets from nft_compat for sanity
   reasons, they are never used from there.

5) Don't crash on double hashsize module parameter, from Andrey
   Ryabinin.

6) Drop dst on skb before placing it in the fragmentation
   reassembly queue, from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-09 14:23:13 -07:00
David S. Miller
7f93d12951 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2018-07-07

The following pull-request contains BPF updates for your *net* tree.

Plenty of fixes for different components:

1) A set of critical fixes for sockmap and sockhash, from John Fastabend.

2) fixes for several race conditions in af_xdp, from Magnus Karlsson.

3) hash map refcnt fix, from Mauricio Vasquez.

4) samples/bpf fixes, from Taeung Song.

5) ifup+mtu check for xdp_redirect, from Toshiaki Makita.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 13:06:55 +09:00
Vlad Buslov
90b73b77d0 net: sched: change action API to use array of pointers to actions
Act API used linked list to pass set of actions to functions. It is
intrusive data structure that stores list nodes inside action structure
itself, which means it is not safe to modify such list concurrently.
However, action API doesn't use any linked list specific operations on this
set of actions, so it can be safely refactored into plain pointer array.

Refactor action API to use array of pointers to tc_actions instead of
linked list. Change argument 'actions' type of exported action init,
destroy and dump functions.

Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:29 +09:00
Vlad Buslov
0190c1d452 net: sched: atomically check-allocate action
Implement function that atomically checks if action exists and either takes
reference to it, or allocates idr slot for action index to prevent
concurrent allocations of actions with same index. Use EBUSY error pointer
to indicate that idr slot is reserved.

Implement cleanup helper function that removes temporary error pointer from
idr. (in case of error between idr allocation and insertion of newly
created action to specified index)

Refactor all action init functions to insert new action to idr using this
API.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:29 +09:00
Vlad Buslov
b409074e66 net: sched: add 'delete' function to action ops
Extend action ops with 'delete' function. Each action type to implements
its own delete function that doesn't depend on rtnl lock.

Implement delete function that is required to delete actions without
holding rtnl lock. Use action API function that atomically deletes action
only if it is still in action idr. This implementation prevents concurrent
threads from deleting same action twice.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:29 +09:00
Vlad Buslov
2a2ea34970 net: sched: implement action API that deletes action by index
Implement new action API function that atomically finds and deletes action
from idr by index. Intended to be used by lockless actions that do not rely
on rtnl lock.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:28 +09:00
Vlad Buslov
789871bb2a net: sched: implement unlocked action init API
Add additional 'rtnl_held' argument to act API init functions. It is
required to implement actions that need to release rtnl lock before loading
kernel module and reacquire if afterwards.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:28 +09:00
Vlad Buslov
036bb44327 net: sched: change type of reference and bind counters
Change type of action reference counter to refcount_t.

Change type of action bind counter to atomic_t.
This type is used to allow decrementing bind counter without testing
for 0 result.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:28 +09:00
Vlad Buslov
eec94fdb04 net: sched: use rcu for action cookie update
Implement functions to atomically update and free action cookie
using rcu mechanism.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08 12:42:28 +09:00
John Fastabend
0ea488ff8d bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb
In commit

  'bpf: bpf_compute_data uses incorrect cb structure' (8108a77515)

we added the routine bpf_compute_data_end_sk_skb() to compute the
correct data_end values, but this has since been lost. In kernel
v4.14 this was correct and the above patch was applied in it
entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
we lost the piece that renamed bpf_compute_data_pointers to the
new function bpf_compute_data_end_sk_skb. This was done here,

e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")

When it conflicted with the following rename patch,

6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")

Finally, after a refactor I thought even the function
bpf_compute_data_end_sk_skb() was no longer needed and it was
erroneously removed.

However, we never reverted the sk_skb_convert_ctx_access() usage of
tcp_skb_cb which had been committed and survived the merge conflict.
Here we fix this by adding back the helper and *_data_end_sk_skb()
usage. Using the bpf_skc_data_end mapping is not correct because it
expects a qdisc_skb_cb object but at the sock layer this is not the
case. Even though it happens to work here because we don't overwrite
any data in-use at the socket layer and the cb structure is cleared
later this has potential to create some subtle issues. But, even
more concretely the filter.c access check uses tcp_skb_cb.

And by some act of chance though,

struct bpf_skb_data_end {
        struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */

        /* XXX 4 bytes hole, try to pack */

        void *                     data_meta;            /*    32     8 */
        void *                     data_end;             /*    40     8 */

        /* size: 48, cachelines: 1, members: 3 */
        /* sum members: 44, holes: 1, sum holes: 4 */
        /* last cacheline: 48 bytes */
};

and then tcp_skb_cb,

struct tcp_skb_cb {
	[...]
                struct {
                        __u32      flags;                /*    24     4 */
                        struct sock * sk_redir;          /*    32     8 */
                        void *     data_end;             /*    40     8 */
                } bpf;                                   /*          24 */
        };

So when we use offset_of() to track down the byte offset we get 40 in
either case and everything continues to work. Fix this mess and use
correct structures its unclear how long this might actually work for
until someone moves the structs around.

Reported-by: Martin KaFai Lau <kafai@fb.com>
Fixes: e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Fixes: 6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-07 15:19:30 -07:00
Davide Caratti
38230a3e0e net/sched: act_tunnel_key: fix NULL dereference when 'goto chain' is used
the control action in the common member of struct tcf_tunnel_key must be a
valid value, as it can contain the chain index when 'goto chain' is used.
Ensure that the control action can be read as x->tcfa_action, when x is a
pointer to struct tc_action and x->ops->type is TCA_ACT_TUNNEL_KEY, to
prevent the following command:

 # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
 > $tcflags dst_mac $h2mac action tunnel_key unset goto chain 1

from causing a NULL dereference when a matching packet is received:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
 PGD 80000001097ac067 P4D 80000001097ac067 PUD 103b0a067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 3491 Comm: mausezahn Tainted: G            E     4.18.0-rc2.auguri+ #421
 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
 RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
 RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
 R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
 R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
 FS:  00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
 Call Trace:
  <IRQ>
  fl_classify+0x1ad/0x1c0 [cls_flower]
  ? __update_load_avg_se.isra.47+0x1ca/0x1d0
  ? __update_load_avg_se.isra.47+0x1ca/0x1d0
  ? update_load_avg+0x665/0x690
  ? update_load_avg+0x665/0x690
  ? kmem_cache_alloc+0x38/0x1c0
  tcf_classify+0x89/0x140
  __netif_receive_skb_core+0x5ea/0xb70
  ? enqueue_entity+0xd0/0x270
  ? process_backlog+0x97/0x150
  process_backlog+0x97/0x150
  net_rx_action+0x14b/0x3e0
  __do_softirq+0xde/0x2b4
  do_softirq_own_stack+0x2a/0x40
  </IRQ>
  do_softirq.part.18+0x49/0x50
  __local_bh_enable_ip+0x49/0x50
  __dev_queue_xmit+0x4ab/0x8a0
  ? wait_woken+0x80/0x80
  ? packet_sendmsg+0x38f/0x810
  ? __dev_queue_xmit+0x8a0/0x8a0
  packet_sendmsg+0x38f/0x810
  sock_sendmsg+0x36/0x40
  __sys_sendto+0x10e/0x140
  ? do_vfs_ioctl+0xa4/0x630
  ? syscall_trace_enter+0x1df/0x2e0
  ? __audit_syscall_exit+0x22a/0x290
  __x64_sys_sendto+0x24/0x30
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fd67e18dc93
 Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
 RSP: 002b:00007ffe0189b748 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
 RAX: ffffffffffffffda RBX: 00000000020ca010 RCX: 00007fd67e18dc93
 RDX: 0000000000000062 RSI: 00000000020ca322 RDI: 0000000000000003
 RBP: 00007ffe0189b780 R08: 00007ffe0189b760 R09: 0000000000000014
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000062
 R13: 00000000020ca322 R14: 00007ffe0189b760 R15: 0000000000000003
 Modules linked in: act_tunnel_key act_gact cls_flower sch_ingress vrf veth act_csum(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek coretemp snd_hda_codec_generic kvm_intel kvm irqbypass snd_hda_intel crct10dif_pclmul crc32_pclmul hp_wmi ghash_clmulni_intel pcbc snd_hda_codec aesni_intel sparse_keymap rfkill snd_hda_core snd_hwdep snd_seq crypto_simd iTCO_wdt gpio_ich iTCO_vendor_support wmi_bmof cryptd mei_wdt glue_helper snd_seq_device snd_pcm pcspkr snd_timer snd i2c_i801 lpc_ich sg soundcore wmi mei_me
  mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom i915 video i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_intel libahci serio_raw sfc libata mtd drm ixgbe mdio i2c_core e1000e dca
 CR2: 0000000000000000
 ---[ end trace 1ab8b5b5d4639dfc ]---
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
 RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
 RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
 R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
 R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
 FS:  00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
 Kernel panic - not syncing: Fatal exception in interrupt
 Kernel Offset: 0x11400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Fixes: d0f6dd8a91 ("net/sched: Introduce act_tunnel_key")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 22:01:08 +09:00
Davide Caratti
11a245e2f7 net/sched: act_csum: fix NULL dereference when 'goto chain' is used
the control action in the common member of struct tcf_csum must be a valid
value, as it can contain the chain index when 'goto chain' is used. Ensure
that the control action can be read as x->tcfa_action, when x is a pointer
to struct tc_action and x->ops->type is TCA_ACT_CSUM, to prevent the
following command:

  # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
  > $tcflags dst_mac $h2mac action csum ip or tcp or udp or sctp goto chain 1

from triggering a NULL pointer dereference when a matching packet is
received.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
 PGD 800000010416b067 P4D 800000010416b067 PUD 1041be067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 3072 Comm: mausezahn Tainted: G            E     4.18.0-rc2.auguri+ #421
 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:ffffa020dea03c40 EFLAGS: 00010246
 RAX: 0000000020000001 RBX: ffffa020d7ccef00 RCX: 0000000000000054
 RDX: 0000000000000000 RSI: ffffa020ca5ae000 RDI: ffffa020d7ccef00
 RBP: ffffa020dea03e60 R08: 0000000000000000 R09: ffffa020dea03c9c
 R10: ffffa020dea03c78 R11: 0000000000000008 R12: ffffa020d3fe4f00
 R13: ffffa020d3fe4f08 R14: 0000000000000001 R15: ffffa020d53ca300
 FS:  00007f5a46942740(0000) GS:ffffa020dea00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 0000000104218002 CR4: 00000000001606f0
 Call Trace:
  <IRQ>
  fl_classify+0x1ad/0x1c0 [cls_flower]
  ? arp_rcv+0x121/0x1b0
  ? __x2apic_send_IPI_dest+0x40/0x40
  ? smp_reschedule_interrupt+0x1c/0xd0
  ? reschedule_interrupt+0xf/0x20
  ? reschedule_interrupt+0xa/0x20
  ? device_is_rmrr_locked+0xe/0x50
  ? iommu_should_identity_map+0x49/0xd0
  ? __intel_map_single+0x30/0x140
  ? e1000e_update_rdt_wa.isra.52+0x22/0xb0 [e1000e]
  ? e1000_alloc_rx_buffers+0x233/0x250 [e1000e]
  ? kmem_cache_alloc+0x38/0x1c0
  tcf_classify+0x89/0x140
  __netif_receive_skb_core+0x5ea/0xb70
  ? enqueue_task_fair+0xb6/0x7d0
  ? process_backlog+0x97/0x150
  process_backlog+0x97/0x150
  net_rx_action+0x14b/0x3e0
  __do_softirq+0xde/0x2b4
  do_softirq_own_stack+0x2a/0x40
  </IRQ>
  do_softirq.part.18+0x49/0x50
  __local_bh_enable_ip+0x49/0x50
  __dev_queue_xmit+0x4ab/0x8a0
  ? wait_woken+0x80/0x80
  ? packet_sendmsg+0x38f/0x810
  ? __dev_queue_xmit+0x8a0/0x8a0
  packet_sendmsg+0x38f/0x810
  sock_sendmsg+0x36/0x40
  __sys_sendto+0x10e/0x140
  ? do_vfs_ioctl+0xa4/0x630
  ? syscall_trace_enter+0x1df/0x2e0
  ? __audit_syscall_exit+0x22a/0x290
  __x64_sys_sendto+0x24/0x30
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f5a45cbec93
 Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
 RSP: 002b:00007ffd0ee6d748 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
 RAX: ffffffffffffffda RBX: 0000000001161010 RCX: 00007f5a45cbec93
 RDX: 0000000000000062 RSI: 0000000001161322 RDI: 0000000000000003
 RBP: 00007ffd0ee6d780 R08: 00007ffd0ee6d760 R09: 0000000000000014
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000062
 R13: 0000000001161322 R14: 00007ffd0ee6d760 R15: 0000000000000003
 Modules linked in: act_csum act_gact cls_flower sch_ingress vrf veth act_tunnel_key(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_hdmi snd_hda_codec_realtek kvm snd_hda_codec_generic hp_wmi iTCO_wdt sparse_keymap rfkill mei_wdt iTCO_vendor_support wmi_bmof gpio_ich irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_hda_intel crypto_simd cryptd snd_hda_codec glue_helper snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm pcspkr i2c_i801 snd_timer snd sg lpc_ich soundcore wmi mei_me
  mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod ahci libahci crc32c_intel i915 ixgbe serio_raw libata video dca i2c_algo_bit sfc drm_kms_helper syscopyarea mtd sysfillrect mdio sysimgblt fb_sys_fops drm e1000e i2c_core
 CR2: 0000000000000000
 ---[ end trace 3c9e9d1a77df4026 ]---
 RIP: 0010:tcf_action_exec+0xb8/0x100
 Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
 RSP: 0018:ffffa020dea03c40 EFLAGS: 00010246
 RAX: 0000000020000001 RBX: ffffa020d7ccef00 RCX: 0000000000000054
 RDX: 0000000000000000 RSI: ffffa020ca5ae000 RDI: ffffa020d7ccef00
 RBP: ffffa020dea03e60 R08: 0000000000000000 R09: ffffa020dea03c9c
 R10: ffffa020dea03c78 R11: 0000000000000008 R12: ffffa020d3fe4f00
 R13: ffffa020d3fe4f08 R14: 0000000000000001 R15: ffffa020d53ca300
 FS:  00007f5a46942740(0000) GS:ffffa020dea00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 0000000104218002 CR4: 00000000001606f0
 Kernel panic - not syncing: Fatal exception in interrupt
 Kernel Offset: 0x26400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Fixes: 9c5f69bbd7 ("net/sched: act_csum: don't use spinlock in the fast path")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 22:01:08 +09:00
Jianbo Liu
24c590e3b0 net/flow_dissector: Add support for QinQ dissection
Dissect the QinQ packets to get both outer and inner vlan information,
then store to the extended flow keys.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 20:51:53 +09:00
Jianbo Liu
2064c3d4c0 net/flow_dissector: Save vlan ethertype from headers
Change vlan dissector key to save vlan tpid to support both 802.1Q
and 802.1AD ethertype.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 20:51:53 +09:00
Arnd Bergmann
22dd149167 devlink: fix incorrect return statement
A newly added dummy helper function tries to return a failure from a "void"
function:

In file included from include/net/dsa.h:24,
                 from arch/arm/plat-orion/common.c:21:
include/net/devlink.h: In function 'devlink_param_value_changed':
include/net/devlink.h:771:9: error: 'return' with a value, in function returning void [-Werror]
  return -EOPNOTSUPP;

This fixes it by removing the bogus statement.

Fixes: ea601e1709 ("devlink: Add devlink notifications support for params")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 20:06:42 +09:00
Willem de Bruijn
678ca42d68 ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6
skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
after modification by __sock_cmsg_send, by calling sock_tx_timestamp.

The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
individual protocols that support tx timestamps call this function
and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
called in __ip6_append_data.

There is no need to store both tx_flags and ts_flags in the cookie
as one is derived from the other. Convert when setting up the cork
and remove the redundant field. This is similar to IPv6, only have
the conversion happen only once per datagram, in ip(6)_setup_cork.

Also change __ip6_append_data to match __ip_append_data. Only update
tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
redundant: only valid protocols can have non-zero cork->tx_flags.

After this change the IPv4 and IPv6 logic is the same.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 10:58:49 +09:00
Willem de Bruijn
5fdaa88dfe ipv6: fold sockcm_cookie into ipcm6_cookie
ipcm_cookie includes sockcm_cookie. Do the same for ipcm6_cookie.

This reduces the number of arguments that need to be passed around,
applies ipcm6_init to all cookie fields at once and reduces code
differentiation between ipv4 and ipv6.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 10:58:49 +09:00
Willem de Bruijn
657a066702 sock: sockc cookie initializer
Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba199 ("ip: limit use of gso_size to udp").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 10:58:49 +09:00
Willem de Bruijn
b515430ac9 ipv6: ipcm6_cookie initializer
Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba199 ("ip: limit use of gso_size to udp").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 10:58:49 +09:00
Willem de Bruijn
351782067b ipv4: ipcm_cookie initializers
Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba199 ("ip: limit use of gso_size to udp").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 10:58:49 +09:00
Jaganath Kanakkassery
4d94f95d30 Bluetooth: Use extended LE Connection if supported
This implements extended LE craete connection and enhanced
LE conn complete event if the controller supports.

For now it is as good as legacy LE connection and event as
no new features in the extended connection is handled.

< HCI Command: LE Extended Create Connection (0x08|0x0043) plen 26
        Filter policy: White list is not used (0x00)
        Own address type: Public (0x00)
        Peer address type: Random (0x01)
        Peer address: DB:7E:2E:1D:85:E8 (Static)
        Initiating PHYs: 0x01
        Entry 0: LE 1M
          Scan interval: 60.000 msec (0x0060)
          Scan window: 60.000 msec (0x0060)
          Min connection interval: 50.00 msec (0x0028)
          Max connection interval: 70.00 msec (0x0038)
          Connection latency: 0 (0x0000)
          Supervision timeout: 420 msec (0x002a)
          Min connection length: 0.000 msec (0x0000)
          Max connection length: 0.000 msec (0x0000)
> HCI Event: Command Status (0x0f) plen 4
      LE Extended Create Connection (0x08|0x0043) ncmd 2
        Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 31
      LE Enhanced Connection Complete (0x0a)
        Status: Success (0x00)
        Handle: 3585
        Role: Master (0x00)
        Peer address type: Random (0x01)
        Peer address: DB:7E:2E:1D:85:E8 (Static)
        Local resolvable private address: 00:00:00:00:00:00 (Non-Resolvable)
        Peer resolvable private address: 00:00:00:00:00:00 (Non-Resolvable)
        Connection interval: 67.50 msec (0x0036)
        Connection latency: 0 (0x0000)
        Supervision timeout: 420 msec (0x002a)
        Master clock accuracy: 0x00
@ MGMT Event: Device Connected (0x000b) plen 40
        LE Address: DB:7E:2E:1D:85:E8 (Static)
        Flags: 0x00000000
        Data length: 27
        Name (complete): Designer Mouse
        Appearance: Mouse (0x03c2)
        Flags: 0x05
          LE Limited Discoverable Mode
          BR/EDR Not Supported
        16-bit Service UUIDs (complete): 1 entry
          Human Interface Device (0x1812)

Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-07-06 22:54:03 +02:00
Jaganath Kanakkassery
c215e9397b Bluetooth: Process extended ADV report event
This patch enables Extended ADV report event if extended scanning
is supported in the controller and process the same.

The new features are not handled and for now its as good as
legacy ADV report.

> HCI Event: LE Meta Event (0x3e) plen 53
      LE Extended Advertising Report (0x0d)
        Num reports: 1
        Entry 0
          Event type: 0x0013
            Props: 0x0013
              Connectable
              Scannable
              Use legacy advertising PDUs
            Data status: Complete
          Legacy PDU Type: ADV_IND (0x0013)
          Address type: Random (0x01)
          Address: DB:7E:2E:1A:85:E8 (Static)
          Primary PHY: LE 1M
          Secondary PHY: LE 1M
          SID: 0x00
          TX power: 0 dBm
          RSSI: -90 dBm (0xa6)
          Periodic advertising invteral: 0.00 msec (0x0000)
          Direct address type: Public (0x00)
          Direct address: 00:00:00:00:00:00 (OUI 00-00-00)
          Data length: 0x1b
        0f 09 44 65 73 69 67 6e 65 72 20 4d 6f 75 73 65  ..Designer Mouse
        03 19 c2 03 02 01 05 03 03 12 18                 ...........

Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-07-06 22:43:34 +02:00
Jaganath Kanakkassery
a2344b9e3a Bluetooth: Use extended scanning if controller supports
This implements Set extended scan param and set extended scan enable
commands and use it for start LE scan based on controller support.

The new features added in these commands are setting of new PHY for
scanning and setting of scan duration. Both features are disabled
for now, meaning only 1M PHY is set and scan duration is set to 0
which means that scanning will be done untill scan disable is called.

< HCI Command: LE Set Extended Scan Parameters (0x08|0x0041) plen 8
        Own address type: Random (0x01)
        Filter policy: Accept all advertisement (0x00)
        PHYs: 0x01
        Entry 0: LE 1M
          Type: Active (0x01)
          Interval: 11.250 msec (0x0012)
          Window: 11.250 msec (0x0012)
> HCI Event: Command Complete (0x0e) plen 4
      LE Set Extended Scan Parameters (0x08|0x0041) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Set Extended Scan Enable (0x08|0x0042) plen 6
        Extended scan: Enabled (0x01)
        Filter duplicates: Enabled (0x01)
        Duration: 0 msec (0x0000)
        Period: 0.00 sec (0x0000)
> HCI Event: Command Complete (0x0e) plen 4
      LE Set Extended Scan Enable (0x08|0x0042) ncmd 2
        Status: Success (0x00)

Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-07-06 22:41:17 +02:00
Pablo Neira Ayuso
e240cd0df4 netfilter: nf_tables: place all set backends in one single module
This patch disallows rbtree with single elements, which is causing
problems with the recent timeout support. Before this patch, you
could opt out individual set representations per module, which is
just adding extra complexity.

Fixes: 8d8540c4f5e0("netfilter: nft_set_rbtree: add timeout support")
Reported-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-06 19:31:53 +02:00
Denis Kenzior
a948f71384 nl80211/mac80211: allow non-linear skb in rx_control_port
The current implementation of cfg80211_rx_control_port assumed that the
caller could provide a contiguous region of memory for the control port
frame to be sent up to userspace.  Unfortunately, many drivers produce
non-linear skbs, especially for data frames.  This resulted in userspace
getting notified of control port frames with correct metadata (from
address, port, etc) yet garbage / nonsense contents, resulting in bad
handshakes, disconnections, etc.

mac80211 linearizes skbs containing management frames.  But it didn't
seem worthwhile to do this for control port frames.  Thus the signature
of cfg80211_rx_control_port was changed to take the skb directly.
nl80211 then takes care of obtaining control port frame data directly
from the (linear | non-linear) skb.

The caller is still responsible for freeing the skb,
cfg80211_rx_control_port does not take ownership of it.

Fixes: 6a671a50f8 ("nl80211: Add CMD_CONTROL_PORT_FRAME API")
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
[fix some kernel-doc formatting, add fixes tag]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-07-06 14:34:42 +02:00
Máté Eckl
5711b4e893 netfilter: nf_tproxy: fix possible non-linear access to transport header
This patch fixes a silent out-of-bound read possibility that was present
because of the misuse of this function.

Mostly it was called with a struct udphdr *hp which had only the udphdr
part linearized by the skb_header_pointer, however
nf_tproxy_get_sock_v{4,6} uses it as a tcphdr pointer, so some reads for
tcp specific attributes may be invalid.

Fixes: a583636a83 ("inet: refactor inet[6]_lookup functions to take skb")
Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-06 14:32:44 +02:00
Ankit Navik
545f2596b9 Bluetooth: Add HCI command for clear Resolv list
Check for Resolv list supported by controller. So check the supported
commmand first before issuing this command i.e.,HCI_OP_LE_CLEAR_RESOLV_LIST

Before patch:
< HCI Command: LE Read White List... (0x08|0x000f) plen 0  #55 [hci0] 13.338168
> HCI Event: Command Complete (0x0e) plen 5                #56 [hci0] 13.338842
      LE Read White List Size (0x08|0x000f) ncmd 1
        Status: Success (0x00)
        Size: 25
< HCI Command: LE Clear White List (0x08|0x0010) plen 0    #57 [hci0] 13.339029
> HCI Event: Command Complete (0x0e) plen 4                #58 [hci0] 13.339939
      LE Clear White List (0x08|0x0010) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Read Resolving L.. (0x08|0x002a) plen 0  #59 [hci0] 13.340152
> HCI Event: Command Complete (0x0e) plen 5                #60 [hci0] 13.340952
      LE Read Resolving List Size (0x08|0x002a) ncmd 1
        Status: Success (0x00)
        Size: 25
< HCI Command: LE Read Maximum Dat.. (0x08|0x002f) plen 0  #61 [hci0] 13.341180
> HCI Event: Command Complete (0x0e) plen 12               #62 [hci0] 13.341898
      LE Read Maximum Data Length (0x08|0x002f) ncmd 1
        Status: Success (0x00)
        Max TX octets: 251
        Max TX time: 17040
        Max RX octets: 251
        Max RX time: 17040

After patch:
< HCI Command: LE Read White List... (0x08|0x000f) plen 0  #55 [hci0] 28.919131
> HCI Event: Command Complete (0x0e) plen 5                #56 [hci0] 28.920016
      LE Read White List Size (0x08|0x000f) ncmd 1
        Status: Success (0x00)
        Size: 25
< HCI Command: LE Clear White List (0x08|0x0010) plen 0    #57 [hci0] 28.920164
> HCI Event: Command Complete (0x0e) plen 4                #58 [hci0] 28.920873
      LE Clear White List (0x08|0x0010) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Read Resolving L.. (0x08|0x002a) plen 0  #59 [hci0] 28.921109
> HCI Event: Command Complete (0x0e) plen 5                #60 [hci0] 28.922016
      LE Read Resolving List Size (0x08|0x002a) ncmd 1
        Status: Success (0x00)
        Size: 25
< HCI Command: LE Clear Resolving... (0x08|0x0029) plen 0  #61 [hci0] 28.922166
> HCI Event: Command Complete (0x0e) plen 4                #62 [hci0] 28.922872
      LE Clear Resolving List (0x08|0x0029) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Read Maximum Dat.. (0x08|0x002f) plen 0  #63 [hci0] 28.923117
> HCI Event: Command Complete (0x0e) plen 12               #64 [hci0] 28.924030
      LE Read Maximum Data Length (0x08|0x002f) ncmd 1
        Status: Success (0x00)
        Max TX octets: 251
        Max TX time: 17040
        Max RX octets: 251
        Max RX time: 17040

Signed-off-by: Ankit Navik <ankit.p.navik@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-07-06 12:40:08 +02:00
Ankit Navik
cfdb0c2d09 Bluetooth: Store Resolv list size
When the controller supports the Read LE Resolv List size feature, the
maximum list size are read and now stored.

Before patch:
< HCI Command: LE Read White List... (0x08|0x000f) plen 0  #55 [hci0] 17.979791
> HCI Event: Command Complete (0x0e) plen 5                #56 [hci0] 17.980629
      LE Read White List Size (0x08|0x000f) ncmd 1
        Status: Success (0x00)
        Size: 25
< HCI Command: LE Clear White List (0x08|0x0010) plen 0    #57 [hci0] 17.980786
> HCI Event: Command Complete (0x0e) plen 4                #58 [hci0] 17.981627
      LE Clear White List (0x08|0x0010) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Read Maximum Dat.. (0x08|0x002f) plen 0  #59 [hci0] 17.981786
> HCI Event: Command Complete (0x0e) plen 12               #60 [hci0] 17.982636
      LE Read Maximum Data Length (0x08|0x002f) ncmd 1
        Status: Success (0x00)
        Max TX octets: 251
        Max TX time: 17040
        Max RX octets: 251
        Max RX time: 17040

After patch:
< HCI Command: LE Read White List... (0x08|0x000f) plen 0  #55 [hci0] 13.338168
> HCI Event: Command Complete (0x0e) plen 5                #56 [hci0] 13.338842
      LE Read White List Size (0x08|0x000f) ncmd 1
        Status: Success (0x00)
        Size: 25
< HCI Command: LE Clear White List (0x08|0x0010) plen 0    #57 [hci0] 13.339029
> HCI Event: Command Complete (0x0e) plen 4                #58 [hci0] 13.339939
      LE Clear White List (0x08|0x0010) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Read Resolving L.. (0x08|0x002a) plen 0  #59 [hci0] 13.340152
> HCI Event: Command Complete (0x0e) plen 5                #60 [hci0] 13.340952
      LE Read Resolving List Size (0x08|0x002a) ncmd 1
        Status: Success (0x00)
        Size: 25
< HCI Command: LE Read Maximum Dat.. (0x08|0x002f) plen 0  #61 [hci0] 13.341180
> HCI Event: Command Complete (0x0e) plen 12               #62 [hci0] 13.341898
      LE Read Maximum Data Length (0x08|0x002f) ncmd 1
        Status: Success (0x00)
        Max TX octets: 251
        Max TX time: 17040
        Max RX octets: 251
        Max RX time: 17040

Signed-off-by: Ankit Navik <ankit.p.navik@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-07-06 12:40:08 +02:00
Edward Cree
d8269e2cbf net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()
Essentially the same as the ipv4 equivalents.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-06 11:19:07 +09:00
Paul Moore
a9ba23d48d ipv6: make ipv6_renew_options() interrupt/kernel safe
At present the ipv6_renew_options_kern() function ends up calling into
access_ok() which is problematic if done from inside an interrupt as
access_ok() calls WARN_ON_IN_IRQ() on some (all?) architectures
(x86-64 is affected).  Example warning/backtrace is shown below:

 WARNING: CPU: 1 PID: 3144 at lib/usercopy.c:11 _copy_from_user+0x85/0x90
 ...
 Call Trace:
  <IRQ>
  ipv6_renew_option+0xb2/0xf0
  ipv6_renew_options+0x26a/0x340
  ipv6_renew_options_kern+0x2c/0x40
  calipso_req_setattr+0x72/0xe0
  netlbl_req_setattr+0x126/0x1b0
  selinux_netlbl_inet_conn_request+0x80/0x100
  selinux_inet_conn_request+0x6d/0xb0
  security_inet_conn_request+0x32/0x50
  tcp_conn_request+0x35f/0xe00
  ? __lock_acquire+0x250/0x16c0
  ? selinux_socket_sock_rcv_skb+0x1ae/0x210
  ? tcp_rcv_state_process+0x289/0x106b
  tcp_rcv_state_process+0x289/0x106b
  ? tcp_v6_do_rcv+0x1a7/0x3c0
  tcp_v6_do_rcv+0x1a7/0x3c0
  tcp_v6_rcv+0xc82/0xcf0
  ip6_input_finish+0x10d/0x690
  ip6_input+0x45/0x1e0
  ? ip6_rcv_finish+0x1d0/0x1d0
  ipv6_rcv+0x32b/0x880
  ? ip6_make_skb+0x1e0/0x1e0
  __netif_receive_skb_core+0x6f2/0xdf0
  ? process_backlog+0x85/0x250
  ? process_backlog+0x85/0x250
  ? process_backlog+0xec/0x250
  process_backlog+0xec/0x250
  net_rx_action+0x153/0x480
  __do_softirq+0xd9/0x4f7
  do_softirq_own_stack+0x2a/0x40
  </IRQ>
  ...

While not present in the backtrace, ipv6_renew_option() ends up calling
access_ok() via the following chain:

  access_ok()
  _copy_from_user()
  copy_from_user()
  ipv6_renew_option()

The fix presented in this patch is to perform the userspace copy
earlier in the call chain such that it is only called when the option
data is actually coming from userspace; that place is
do_ipv6_setsockopt().  Not only does this solve the problem seen in
the backtrace above, it also allows us to simplify the code quite a
bit by removing ipv6_renew_options_kern() completely.  We also take
this opportunity to cleanup ipv6_renew_options()/ipv6_renew_option()
a small amount as well.

This patch is heavily based on a rough patch by Al Viro.  I've taken
his original patch, converted a kmemdup() call in do_ipv6_setsockopt()
to a memdup_user() call, made better use of the e_inval jump target in
the same function, and cleaned up the use ipv6_renew_option() by
ipv6_renew_options().

CC: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05 20:15:26 +09:00
Vasundhara Volam
f567bcdae2 devlink: Add enable_sriov boolean generic parameter
enable_sriov - Enables Single-Root Input/Output Virtualization(SR-IOV)
characteristic of the device.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05 19:58:35 +09:00
Moshe Shemesh
036467c399 devlink: Add generic parameters internal_err_reset and max_macs
Add 2 first generic parameters to devlink configuration parameters set:
internal_err_reset - When set enables reset device on internal errors.
max_macs - max number of MACs per ETH port.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05 19:58:35 +09:00
Moshe Shemesh
ea601e1709 devlink: Add devlink notifications support for params
Add devlink_param_notify() function to support devlink param notifications.
Add notification call to devlink param set, register and unregister
functions.
Add devlink_param_value_changed() function to enable the driver notify
devlink on value change. Driver should use this function after value was
changed on any configuration mode part to driverinit.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05 19:58:35 +09:00
Moshe Shemesh
ec01aeb180 devlink: Add support for get/set driverinit value
"driverinit" configuration mode value is held by devlink to enable
the driver query the value after reload. Two additional functions
added to help the driver get/set the value from/to devlink:
devlink_param_driverinit_value_set() and
devlink_param_driverinit_value_get().

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05 19:58:35 +09:00
Moshe Shemesh
e3b7ca18ad devlink: Add param set command
Add param set command to set value for a parameter.
Value can be set to any of the supported configuration modes.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05 19:58:35 +09:00
Moshe Shemesh
eabaef1896 devlink: Add devlink_param register and unregister
Define configuration parameters data structure.
Add functions to register and unregister the driver supported
configuration parameters table.
For each parameter registered, the driver should fill all the parameter's
fields. In case the only supported configuration mode is "driverinit"
the parameter's get()/set() functions are not required and should be set
to NULL, for any other configuration mode, these functions are required
and should be set by the driver.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05 19:58:35 +09:00
Jesus Sanchez-Palencia
4b15c70753 net/sched: Make etf report drops on error_queue
Use the socket error queue for reporting dropped packets if the
socket has enabled that feature through the SO_TXTIME API.

Packets are dropped either on enqueue() if they aren't accepted by the
qdisc or on dequeue() if the system misses their deadline. Those are
reported as different errors so applications can react accordingly.

Userspace can retrieve the errors through the socket error queue and the
corresponding cmsg interfaces. A struct sock_extended_err* is used for
returning the error data, and the packet's timestamp can be retrieved by
adding both ee_data and ee_info fields as e.g.:

    ((__u64) serr->ee_data << 32) + serr->ee_info

This feature is disabled by default and must be explicitly enabled by
applications. Enabling it can bring some overhead for the Tx cycles
of the application.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 22:30:28 +09:00
Jesus Sanchez-Palencia
88cab77162 net/sched: Add HW offloading capability to ETF
Add infra so etf qdisc supports HW offload of time-based transmission.

For hw offload, the time sorted list is still used, so packets are
dequeued always in order of txtime.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
           map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \
	   clockid CLOCK_REALTIME

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. The hrtimer used for
packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME
as reference and packets leave the Qdisc "delta" (100000) nanoseconds
before their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as
expected.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 22:30:27 +09:00
Vinicius Costa Gomes
860b642b9c net/sched: Allow creating a Qdisc watchdog with other clocks
This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 22:30:27 +09:00
Jesus Sanchez-Palencia
bc969a9778 net: ipv4: Hook into time based transmission
Add a transmit_time field to struct inet_cork, then copy the
timestamp from the CMSG cookie at ip_setup_cork() so we can
safely copy it into the skb later during __ip_make_skb().

For the raw fast path, just perform the copy at raw_send_hdrinc().

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 22:30:27 +09:00
Richard Cochran
80b14dee2b net: Add a new socket option for a future transmit time.
This patch introduces SO_TXTIME. User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2). The argument to this socket option is a 8-bytes long struct
provided by the uapi header net_tstamp.h defined as:

struct sock_txtime {
	clockid_t 	clockid;
	u32		flags;
};

Note that new fields were added to struct sock by filling a 2-bytes
hole found in the struct. For that reason, neither the struct size or
number of cachelines were altered.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 22:30:27 +09:00
David Ahern
33bd5ac54d net/ipv6: Revert attempt to simplify route replace and append
NetworkManager likes to manage linklocal prefix routes and does so with
the NLM_F_APPEND flag, breaking attempts to simplify the IPv6 route
code and by extension enable multipath routes with device only nexthops.

Revert f34436a430 and these followup patches:
6eba08c362 ("ipv6: Only emit append events for appended routes").
ce45bded64 ("mlxsw: spectrum_router: Align with new route replace logic")
53b562df8c ("mlxsw: spectrum_router: Allow appending to dev-only routes")

Update the fib_tests cases to reflect the old behavior.

Fixes: f34436a430 ("net/ipv6: Simplify route replace and appending into multipath route")
Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-04 15:22:13 +09:00
Edward Cree
17266ee939 net: ipv4: listified version of ip_rcv
Also involved adding a way to run a netfilter hook over a list of packets.
 Rather than attempting to make netfilter know about lists (which would be
 a major project in itself) we just let it call the regular okfn (in this
 case ip_rcv_finish()) for any packets it steals, and have it give us back
 a list of packets it's synchronously accepted (which normally NF_HOOK
 would automatically call okfn() on, but we want to be able to potentially
 pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
 packet (see nf_hook_entry_hookfn()), but again, changing that can be left
 for future work.

There is potential for out-of-order receives if the netfilter hook ends up
 synchronously stealing packets, as they will be processed before any
 accepts earlier in the list.  However, it was already possible for an
 asynchronous accept to cause out-of-order receives, so presumably this is
 considered OK.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 14:06:20 +09:00
Xin Long
8a9c58d28d sctp: add support for dscp and flowlabel per transport
Like some other per transport params, flowlabel and dscp are added
in transport, asoc and sctp_sock. By default, transport sets its
value from asoc's, and asoc does it from sctp_sock. flowlabel
only works for ipv6 transport.

Other than that they need to be passed down in sctp_xmit, flow4/6
also needs to set them before looking up route in get_dst.

Note that it uses '& 0x100000' to check if flowlabel is set and
'& 0x1' (tos 1st bit is unused) to check if dscp is set by users,
so that they could be set to 0 by sockopt in next patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 11:36:54 +09:00
Xin Long
69b9e1e07d ipv4: add __ip_queue_xmit() that supports tos param
This patch introduces __ip_queue_xmit(), through which the callers
can pass tos param into it without having to set inet->tos. For
ipv6, ip6_xmit() already allows passing tclass parameter.

It's needed when some transport protocol doesn't use inet->tos,
like sctp's per transport dscp, which will be added in next patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 11:36:54 +09:00
Magnus Karlsson
a9744f7ca2 xsk: fix potential race in SKB TX completion code
There is a potential race in the TX completion code for the SKB
case. One process enters the sendmsg code of an AF_XDP socket in order
to send a frame. The execution eventually trickles down to the driver
that is told to send the packet. However, it decides to drop the
packet due to some error condition (e.g., rings full) and frees the
SKB. This will trigger the SKB destructor and a completion will be
sent to the AF_XDP user space through its
single-producer/single-consumer queues.

At the same time a TX interrupt has fired on another core and it
dispatches the TX completion code in the driver. It does its HW
specific things and ends up freeing the SKB associated with the
transmitted packet. This will trigger the SKB destructor and a
completion will be sent to the AF_XDP user space through its
single-producer/single-consumer queues. With a pseudo call stack, it
would look like this:

Core 1:
sendmsg() being called in the application
  netdev_start_xmit()
    Driver entered through ndo_start_xmit
      Driver decides to free the SKB for some reason (e.g., rings full)
        Destructor of SKB called
          xskq_produce_addr() is called to signal completion to user space

Core 2:
TX completion irq
  NAPI loop
    Driver irq handler for TX completions
      Frees the SKB
        Destructor of SKB called
          xskq_produce_addr() is called to signal completion to user space

We now have a violation of the single-producer/single-consumer
principle for our queues as there are two threads trying to produce at
the same time on the same queue.

Fixed by introducing a spin_lock in the destructor. In regards to the
performance, I get around 1.74 Mpps for txonly before and after the
introduction of the spinlock. There is of course some impact due to
the spin lock but it is in the less significant digits that are too
noisy for me to measure. But let us say that the version without the
spin lock got 1.745 Mpps in the best case and the version with 1.735
Mpps in the worst case, then that would mean a maximum drop in
performance of 0.5%.

Fixes: 35fcde7f8d ("xsk: support for Tx")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-02 18:37:12 -07:00
David S. Miller
5cd3da4ba2 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
Simple overlapping changes in stmmac driver.

Adjust skb_gro_flush_final_remcsum function signature to make GRO list
changes in net-next, as per Stephen Rothwell's example merge
resolution.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-03 10:29:26 +09:00
Linus Torvalds
4e33d7d479 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Verify netlink attributes properly in nf_queue, from Eric Dumazet.

 2) Need to bump memory lock rlimit for test_sockmap bpf test, from
    Yonghong Song.

 3) Fix VLAN handling in lan78xx driver, from Dave Stevenson.

 4) Fix uninitialized read in nf_log, from Jann Horn.

 5) Fix raw command length parsing in mlx5, from Alex Vesker.

 6) Cleanup loopback RDS connections upon netns deletion, from Sowmini
    Varadhan.

 7) Fix regressions in FIB rule matching during create, from Jason A.
    Donenfeld and Roopa Prabhu.

 8) Fix mpls ether type detection in nfp, from Pieter Jansen van Vuuren.

 9) More bpfilter build fixes/adjustments from Masahiro Yamada.

10) Fix XDP_{TX,REDIRECT} flushing in various drivers, from Jesper
    Dangaard Brouer.

11) fib_tests.sh file permissions were broken, from Shuah Khan.

12) Make sure BH/preemption is disabled in data path of mac80211, from
    Denis Kenzior.

13) Don't ignore nla_parse_nested() return values in nl80211, from
    Johannes berg.

14) Properly account sock objects ot kmemcg, from Shakeel Butt.

15) Adjustments to setting bpf program permissions to read-only, from
    Daniel Borkmann.

16) TCP Fast Open key endianness was broken, it always took on the host
    endiannness. Whoops. Explicitly make it little endian. From Yuching
    Cheng.

17) Fix prefix route setting for link local addresses in ipv6, from
    David Ahern.

18) Potential Spectre v1 in zatm driver, from Gustavo A. R. Silva.

19) Various bpf sockmap fixes, from John Fastabend.

20) Use after free for GRO with ESP, from Sabrina Dubroca.

21) Passing bogus flags to crypto_alloc_shash() in ipv6 SR code, from
    Eric Biggers.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
  qede: Adverstise software timestamp caps when PHC is not available.
  qed: Fix use of incorrect size in memcpy call.
  qed: Fix setting of incorrect eswitch mode.
  qed: Limit msix vectors in kdump kernel to the minimum required count.
  ipvlan: call dev_change_flags when ipvlan mode is reset
  ipv6: sr: fix passing wrong flags to crypto_alloc_shash()
  net: fix use-after-free in GRO with ESP
  tcp: prevent bogus FRTO undos with non-SACK flows
  bpf: sockhash, add release routine
  bpf: sockhash fix omitted bucket lock in sock_close
  bpf: sockmap, fix smap_list_map_remove when psock is in many maps
  bpf: sockmap, fix crash when ipv6 sock is added
  net: fib_rules: bring back rule_exists to match rule during add
  hv_netvsc: split sub-channel setup into async and sync
  net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN
  atm: zatm: Fix potential Spectre v1
  s390/qeth: consistently re-enable device features
  s390/qeth: don't clobber buffer on async TX completion
  s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]
  s390/qeth: fix race when setting MAC address
  ...
2018-07-02 11:18:28 -07:00
Amritha Nambiar
fc9bab24e9 net: Enable Tx queue selection based on Rx queues
This patch adds support to pick Tx queue based on the Rx queue(s) map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive queue(s) map
does not apply, then the Tx queue selection falls back to CPU(s) map
based selection and finally to hashing.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-02 09:06:24 +09:00
Amritha Nambiar
c6345ce7d3 net: Record receive queue number for a connection
This patch adds a new field to sock_common 'skc_rx_queue_mapping'
which holds the receive queue number for the connection. The Rx queue
is marked in tcp_finish_connect() to allow a client app to do
SO_INCOMING_NAPI_ID after a connect() call to get the right queue
association for a socket. Rx queue is also marked in tcp_conn_request()
to allow syn-ack to go on the right tx-queue associated with
the queue on which syn is received.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-02 09:06:24 +09:00
Amritha Nambiar
755c31cd85 net: sock: Change tx_queue_mapping in sock_common to unsigned short
Change 'skc_tx_queue_mapping' field in sock_common structure from
'int' to 'unsigned short' type with ~0 indicating unset and
other positive queue values being set. This will accommodate adding
a new 'unsigned short' field in sock_common in the next patch for
rx_queue_mapping.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-02 09:06:24 +09:00
David S. Miller
8365da2c05 This round's updates:
* finally some of the promised HE code, but it turns
    out to be small - but everything kept changing, so
    one part I did in the driver was >30 patches for
    what was ultimately <200 lines of code ... similar
    here for this code.
  * improved scan privacy support - can now specify scan
    flags for randomizing the sequence number as well as
    reducing the probe request element content
  * rfkill cleanups
  * a timekeeping cleanup from Arnd
  * various other cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAls2HpsACgkQB8qZga/f
 l8RPuQ//aZbTXc/GkYh0/GAmF4ORHePOHTXTZbMEzPeHQSlUE0nTSieyVtamsyy+
 P+0Ik/lck15Oq/8qabUqDfDY37Fm/OD88jxmoVhjDdgTUcTbIm71n1yS9vDLytuL
 n0Awq2d8xuR2bRkwGgt3Bg0RsCbvqUTa/irrighPiKGqwdVGf7kqGi76hsLrMkx9
 MQsVh1tRJCEvqEfs3yojhPna4AFjl9OoKFh0JjKJmKv5MWY5x4ojYG3kvvnAq2uF
 TIqko4l+R6AR+IzgBsPfzjj8YSJT67Z9IGe8YzId3OcMubpaJqKwrIq0+sYD/9AO
 /FGlK7V/NNge4E7sRPwu+dFzf9tOQAtKE06Icxy7aFknhdv5yGnuT2XaIUt2fv6b
 1jMWMPxY8azBL3H2siDJ17ouRoIJbkw+3o41m3ZCneLebMWjIX/s2Azqiz2lUiU2
 RjZ9Zr0qXdSghK5yD6/iInUBdmNBNq5ubQ8OIAy7fL7linvBAO23iP/G4E7zBikw
 9DtHvrpRx2yA4oYTZiaP0FIEmN/nhVuY7VLdjfLlLBtU9cs9kxOydOVSVB9MeJfE
 c+HiIApuykDxUj5mrd2mo7AkINjUVXKrVZLOH8hqlNvbjJRmcfyR/TOUJzdfeLX+
 0jmji7TMZaaooUEm+KllCnIyUxSmlS25/Ekfm2gdx/rMXXzi/Oo=
 =sNaA
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2018-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Small merge conflict in net/mac80211/scan.c, I preserved
the kcalloc() conversion. -DaveM

Johannes Berg says:

====================
This round's updates:
 * finally some of the promised HE code, but it turns
   out to be small - but everything kept changing, so
   one part I did in the driver was >30 patches for
   what was ultimately <200 lines of code ... similar
   here for this code.
 * improved scan privacy support - can now specify scan
   flags for randomizing the sequence number as well as
   reducing the probe request element content
 * rfkill cleanups
 * a timekeeping cleanup from Arnd
 * various other cleanups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30 21:08:12 +09:00
Hans Wippel
1619f77058 net/smc: add pnetid support for SMC-D and ISM
SMC-D relies on PNETIDs to find usable SMC-D/ISM devices for a SMC
connection. This patch adds SMC-D/ISM support to the current PNETID
implementation.

Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30 20:42:25 +09:00
Hans Wippel
c6ba7c9ba4 net/smc: add base infrastructure for SMC-D and ISM
SMC supports two variants: SMC-R and SMC-D. For data transport, SMC-R
uses RDMA devices, SMC-D uses so-called Internal Shared Memory (ISM)
devices. An ISM device only allows shared memory communication between
SMC instances on the same machine. For example, this allows virtual
machines on the same host to communicate via SMC without RDMA devices.

This patch adds the base infrastructure for SMC-D and ISM devices to
the existing SMC code. It contains the following:

* ISM driver interface:
  This interface allows an ISM driver to register ISM devices in SMC. In
  the process, the driver provides a set of device ops for each device.
  SMC uses these ops to execute SMC specific operations on or transfer
  data over the device.

* Core SMC-D link group, connection, and buffer support:
  Link groups, SMC connections and SMC buffers (in smc_core) are
  extended to support SMC-D.

* SMC type checks:
  Some type checks are added to prevent using SMC-R specific code for
  SMC-D and vice versa.

To actually use SMC-D, additional changes to pnetid, CLC, CDC, etc. are
required. These are added in follow-up patches.

Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30 20:42:25 +09:00
Ursula Braun
0afff91c6f net/smc: add pnetid support
s390 hardware supports the definition of a so-call Physical NETwork
IDentifier (short PNETID) per network device port. These PNETIDS
can be used to identify network devices that are attached to the same
physical network (broadcast domain).

On s390 try to use the PNETID of the ethernet device port used for
initial connecting, and derive the IB device port used for SMC RDMA
traffic.

On platforms without PNETID support fall back to the existing
solution of a configured pnet table.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30 20:42:25 +09:00
Pieter Jansen van Vuuren
256c87c17c net: check tunnel option type in tunnel flags
Check the tunnel option type stored in tunnel flags when creating options
for tunnels. Thereby ensuring we do not set geneve, vxlan or erspan tunnel
options on interfaces that are not associated with them.

Make sure all users of the infrastructure set correct flags, for the BPF
helper we have to set all bits to keep backward compatibility.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 23:50:26 +09:00
Xin Long
b0e9a2fe3f sctp: add support for SCTP_REUSE_PORT sockopt
This feature is actually already supported by sk->sk_reuse which can be
set by socket level opt SO_REUSEADDR. But it's not working exactly as
RFC6458 demands in section 8.1.27, like:

  - This option only supports one-to-one style SCTP sockets
  - This socket option must not be used after calling bind()
    or sctp_bindx().

Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
work in linux.

To separate it from the socket level version, this patch adds 'reuse' in
sctp_sock and it works pretty much as sk->sk_reuse, but with some extra
setup limitations that are needed when it is being enabled.

"It should be noted that the behavior of the socket-level socket option
to reuse ports and/or addresses for SCTP sockets is unspecified", so it
leaves SO_REUSEADDR as is for the compatibility.

Note that the name SCTP_REUSE_PORT is somewhat confusing, as its
functionality is nearly identical to SO_REUSEADDR, but with some
extra restrictions. Here it uses 'reuse' in sctp_sock instead of
'reuseport'. As for sk->sk_reuseport support for SCTP, it will be
added in another patch.

Thanks to Neil to make this clear.

v1->v2:
  - add sctp_sk->reuse to separate it from the socket level version.
v2->v3:
  - improve changelog according to Marcelo's suggestion.

Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 22:20:55 +09:00
Linus Torvalds
a11e1d432b Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL
The poll() changes were not well thought out, and completely
unexplained.  They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead.  That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case.  The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
  individually because that would just be unnecessarily messy  - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28 10:40:47 -07:00
Flavio Leitner
f564650106 netfilter: check if the socket netns is correct.
Netfilter assumes that if the socket is present in the skb, then
it can be used because that reference is cleaned up while the skb
is crossing netns.

We want to change that to preserve the socket reference in a future
patch, so this is a preparation updating netfilter to check if the
socket netns matches before use it.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:21:32 +09:00
Roman Mashak
d020d4559d net sched actions: fix coding style in pedit headers
Fix coding style issues in tc pedit headers detected by the
checkpatch script.

Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:12:03 +09:00
David S. Miller
0901441839 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree:

1) Missing netlink attribute validation in nf_queue, uncovered by KASAN,
   from Eric Dumazet.

2) Use pointer to sysctl table, save us 192 bytes of memory per netns.
   Also from Eric.

3) Possible use-after-free when removing conntrack helper modules due
   to missing synchronize RCU call. From Taehee Yoo.

4) Fix corner case in systcl writes to nf_log that lead to appending
   data to uninitialized buffer, from Jann Horn.

5) Jann Horn says we may indefinitely block other users of nf_log_mutex
   if a userspace access in proc_dostring() blocked e.g. due to a
   userfaultfd.

6) Fix garbage collection race for unconfirmed conntrack entries,
   from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 13:32:44 +09:00
John Hurley
951a8ee6de nfp: reject binding to shared blocks
TC shared blocks allow multiple qdiscs to be grouped together and filters
shared between them. Currently the chains of filters attached to a block
are only flushed when the block is removed. If a qdisc is removed from a
block but the block still exists, flow del messages are not passed to the
callback registered for that qdisc. For the NFP, this presents the
possibility of rules still existing in hw when they should be removed.

Prevent binding to shared blocks until the kernel can send per qdisc del
messages when block unbinds occur.

tcf_block_shared() was not used outside of the core until now, so also
add an empty implementation for builds with CONFIG_NET_CLS=n.

Fixes: 4861738775 ("net: sched: introduce shared filter blocks infrastructure")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-27 10:46:43 +09:00
John Hurley
326367427c net: sched: call reoffload op on block callback reg
Call the reoffload tcf_proto_op on all tcf_proto nodes in all chains of a
block when a callback tries to register to a block that already has
offloaded rules. If all existing rules cannot be offloaded then the
registration is rejected. This replaces the previous policy of rejecting
such callback registration outright.

On unregistration of a callback, the rules are flushed for that given cb.
The implementation of block sharing in the NFP driver, for example,
duplicates shared rules to all devs bound to a block. This meant that
rules could still exist in hw even after a device is unbound from a block
(assuming the block still remains active).

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:33 +09:00
John Hurley
31533cba43 net: sched: cls_flower: implement offload tcf_proto_op
Add the reoffload tcf_proto_op in flower to generate an offload message
for each filter in the given tcf_proto. Call the specified callback with
this new offload message. The function only returns an error if the
callback rejects adding a 'hardware only' rule.

A filter contains a flag to indicate if it is in hardware or not. To
ensure the reoffload function properly maintains this flag, keep a
reference counter for the number of instances of the filter that are in
hardware. Only update the flag when this counter changes from or to 0. Add
a generic helper function to implement this behaviour.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:32 +09:00
John Hurley
e56185c78b net: sched: add tcf_proto_op to offload a rule
Create a new tcf_proto_op called 'reoffload' that generates a new offload
message for each node in a tcf_proto. Pointers to the tcf_proto and
whether the offload request is to add or delete the node are included.
Also included is a callback function to send the offload message to and
the option of priv data to go with the cb.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:32 +09:00
John Hurley
60513bd82c net: sched: pass extack pointer to block binds and cb registration
Pass the extact struct from a tc qdisc add to the block bind function and,
in turn, to the setup_tc ndo of binding device via the tc_block_offload
struct. Pass this back to any block callback registrations to allow
netlink logging of fails in the bind process.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:32 +09:00
David Miller
d4546c2509 net: Convert GRO SKB handling to list_head.
Manage pending per-NAPI GRO packets via list_head.

Return an SKB pointer from the GRO receive handlers.  When GRO receive
handlers return non-NULL, it means that this SKB needs to be completed
at this time and removed from the NAPI queue.

Several operations are greatly simplified by this transformation,
especially timing out the oldest SKB in the list when gro_count
exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue
in reverse order.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 11:33:04 +09:00
Florian Westphal
e4db5b61c5 xfrm: policy: remove pcpu policy cache
Kristian Evensen says:
  In a project I am involved in, we are running ipsec (Strongswan) on
  different mt7621-based routers. Each router is configured as an
  initiator and has around ~30 tunnels to different responders (running
  on misc. devices). Before the flow cache was removed (kernel 4.9), we
  got a combined throughput of around 70Mbit/s for all tunnels on one
  router. However, we recently switched to kernel 4.14 (4.14.48), and
  the total throughput is somewhere around 57Mbit/s (best-case). I.e., a
  drop of around 20%. Reverting the flow cache removal restores, as
  expected, performance levels to that of kernel 4.9.

When pcpu xdst exists, it has to be validated first before it can be
used.

A negative hit thus increases cost vs. no-cache.

As number of tunnels increases, hit rate decreases so this pcpu caching
isn't a viable strategy.

Furthermore, the xdst cache also needs to run with BH off, so when
removing this the bh disable/enable pairs can be removed too.

Kristian tested a 4.14.y backport of this change and reported
increased performance:

  In our tests, the throughput reduction has been reduced from around -20%
  to -5%. We also see that the overall throughput is independent of the
  number of tunnels, while before the throughput was reduced as the number
  of tunnels increased.

Reported-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-06-25 17:46:06 +02:00
Steffen Klassert
f203b76d78 xfrm: Add virtual xfrm interfaces
This patch adds support for virtual xfrm interfaces.
Packets that are routed through such an interface
are guaranteed to be IPsec transformed or dropped.
It is a generic virtual interface that ensures IPsec
transformation, no need to know what happens behind
the interface. This means that we can tunnel IPv4 and
IPv6 through the same interface and support all xfrm
modes (tunnel, transport and beet) on it.

Co-developed-by: Lorenzo Colitti <lorenzo@google.com>
Co-developed-by: Benedict Wong <benedictwong@google.com>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Benedict Wong <benedictwong@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Benedict Wong <benedictwong@google.com>
Tested-by: Antony Antony <antony@phenome.org>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
2018-06-23 16:07:25 +02:00
Steffen Klassert
7e6526404a xfrm: Add a new lookup key to match xfrm interfaces.
This patch adds the xfrm interface id as a lookup key
for xfrm states and policies. With this we can assign
states and policies to virtual xfrm interfaces.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Benedict Wong <benedictwong@google.com>
Tested-by: Benedict Wong <benedictwong@google.com>
Tested-by: Antony Antony <antony@phenome.org>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
2018-06-23 16:07:15 +02:00
Steffen Klassert
d159ce7957 flow: Extend flow informations with xfrm interface id.
Add a new flowi_xfrm structure with informations needed to do
a xfrm lookup. At the moment it keeps the informations about
the new xfrm interface id needed to lookup xfrm interfaces
that are introduced with a followup patch. We need this new
lookup key as other possible keys, like the ifindex is
already part of the xfrm selector and used as a key to
enforce the output device after the transformation in the
policy/state lookup.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Benedict Wong <benedictwong@google.com>
Tested-by: Benedict Wong <benedictwong@google.com>
Tested-by: Antony Antony <antony@phenome.org>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
2018-06-23 16:07:05 +02:00
Steffen Klassert
9b42c1f179 xfrm: Extend the output_mark to support input direction and masking.
We already support setting an output mark at the xfrm_state,
unfortunately this does not support the input direction and
masking the marks that will be applied to the skb. This change
adds support applying a masked value in both directions.

The existing XFRMA_OUTPUT_MARK number is reused for this purpose
and as it is now bi-directional, it is renamed to XFRMA_SET_MARK.

An additional XFRMA_SET_MARK_MASK attribute is added for setting the
mask. If the attribute mask not provided, it is set to 0xffffffff,
keeping the XFRMA_OUTPUT_MARK existing 'full mask' semantics.

Co-developed-by: Tobias Brunner <tobias@strongswan.org>
Co-developed-by: Eyal Birger <eyal.birger@gmail.com>
Co-developed-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
2018-06-23 16:06:57 +02:00
Eric Dumazet
5424ea2739 netns: get more entropy from net_hash_mix()
struct net are effectively allocated from order-1 pages on x86,
with one object per slab, meaning that the 13 low order bits
of their addresses are zero.

Once shifted by L1_CACHE_SHIFT, this leaves 7 zero-bits,
meaning that net_hash_mix() does not help spreading
objects on various hash tables.

For example, TCP listen table has 32 buckets, meaning that
all netns use the same bucket for port 80 or port 443.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-23 10:59:56 +09:00
Eric Dumazet
cadefe5f58 tcp_bbr: fix bbr pacing rate for internal pacing
This commit makes BBR use only the MSS (without any headers) to
calculate pacing rates when internal TCP-layer pacing is used.

This is necessary to achieve the correct pacing behavior in this case,
since tcp_internal_pacing() uses only the payload length to calculate
pacing delays.

Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22 13:59:22 +09:00
NeilBrown
0eb71a9da5 rhashtable: split rhashtable.h
Due to the use of rhashtables in net namespaces,
rhashtable.h is included in lots of the kernel,
so a small changes can required a large recompilation.
This makes development painful.

This patch splits out rhashtable-types.h which just includes
the major type declarations, and does not include (non-trivial)
inline code.  rhashtable.h is no longer included by anything
in the include/ directory.
Common include files only include rhashtable-types.h so a large
recompilation is only triggered when that changes.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22 13:43:27 +09:00
Eric Dumazet
9b0a8da8c4 net/ipv6: respect rcu grace period before freeing fib6_info
syzbot reported use after free that is caused by fib6_info being
freed without a proper RCU grace period.

CPU: 0 PID: 1407 Comm: udevd Not tainted 4.17.0+ #39
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 __read_once_size include/linux/compiler.h:188 [inline]
 find_rr_leaf net/ipv6/route.c:705 [inline]
 rt6_select net/ipv6/route.c:761 [inline]
 fib6_table_lookup+0x12b7/0x14d0 net/ipv6/route.c:1823
 ip6_pol_route+0x1c2/0x1020 net/ipv6/route.c:1856
 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2082
 fib6_rule_lookup+0x211/0x6d0 net/ipv6/fib6_rules.c:122
 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2110
 ip6_route_output include/net/ip6_route.h:82 [inline]
 icmpv6_xrlim_allow net/ipv6/icmp.c:211 [inline]
 icmp6_send+0x147c/0x2da0 net/ipv6/icmp.c:535
 icmpv6_send+0x17a/0x300 net/ipv6/ip6_icmp.c:43
 ip6_link_failure+0xa5/0x790 net/ipv6/route.c:2244
 dst_link_failure include/net/dst.h:427 [inline]
 ndisc_error_report+0xd1/0x1c0 net/ipv6/ndisc.c:695
 neigh_invalidate+0x246/0x550 net/core/neighbour.c:892
 neigh_timer_handler+0xaf9/0xde0 net/core/neighbour.c:978
 call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
 expire_timers kernel/time/timer.c:1363 [inline]
 __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
 __do_softirq+0x2e0/0xaf5 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x1d1/0x200 kernel/softirq.c:404
 exiting_irq arch/x86/include/asm/apic.h:527 [inline]
 smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
 </IRQ>
RIP: 0010:strlen+0x5e/0xa0 lib/string.c:482
Code: 24 00 74 3b 48 bb 00 00 00 00 00 fc ff df 4c 89 e0 48 83 c0 01 48 89 c2 48 89 c1 48 c1 ea 03 83 e1 07 0f b6 14 1a 38 ca 7f 04 <84> d2 75 23 80 38 00 75 de 48 83 c4 08 4c 29 e0 5b 41 5c 5d c3 48
RSP: 0018:ffff8801af117850 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: ffff880197f53bd0 RBX: dffffc0000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81c5b06c RDI: ffff880197f53bc0
RBP: ffff8801af117868 R08: ffff88019a976540 R09: 0000000000000000
R10: ffff88019a976540 R11: 0000000000000000 R12: ffff880197f53bc0
R13: ffff880197f53bc0 R14: ffffffff899e4e90 R15: ffff8801d91c6a00
 strlen include/linux/string.h:267 [inline]
 getname_kernel+0x24/0x370 fs/namei.c:218
 open_exec+0x17/0x70 fs/exec.c:882
 load_elf_binary+0x968/0x5610 fs/binfmt_elf.c:780
 search_binary_handler+0x17d/0x570 fs/exec.c:1653
 exec_binprm fs/exec.c:1695 [inline]
 __do_execve_file.isra.35+0x16fe/0x2710 fs/exec.c:1819
 do_execveat_common fs/exec.c:1866 [inline]
 do_execve fs/exec.c:1883 [inline]
 __do_sys_execve fs/exec.c:1964 [inline]
 __se_sys_execve fs/exec.c:1959 [inline]
 __x64_sys_execve+0x8f/0xc0 fs/exec.c:1959
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f1576a46207
Code: 77 19 f4 48 89 d7 44 89 c0 0f 05 48 3d 00 f0 ff ff 76 e0 f7 d8 64 41 89 01 eb d8 f7 d8 64 41 89 01 eb df b8 3b 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 00 8c 2d 00 f7 d8 64 89 02
RSP: 002b:00007ffff2784568 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f1576a46207
RDX: 0000000001215b10 RSI: 00007ffff2784660 RDI: 00007ffff2785670
RBP: 0000000000625500 R08: 000000000000589c R09: 000000000000589c
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000001215b10
R13: 0000000000000007 R14: 0000000001204250 R15: 0000000000000005

Allocated by task 12188:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
 kmalloc include/linux/slab.h:513 [inline]
 kzalloc include/linux/slab.h:706 [inline]
 fib6_info_alloc+0xbb/0x280 net/ipv6/ip6_fib.c:152
 ip6_route_info_create+0x782/0x2b50 net/ipv6/route.c:3013
 ip6_route_add+0x23/0xb0 net/ipv6/route.c:3154
 ipv6_route_ioctl+0x5a5/0x760 net/ipv6/route.c:3660
 inet6_ioctl+0x100/0x1f0 net/ipv6/af_inet6.c:546
 sock_do_ioctl+0xe4/0x3e0 net/socket.c:973
 sock_ioctl+0x30d/0x680 net/socket.c:1097
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:500 [inline]
 do_vfs_ioctl+0x1cf/0x16f0 fs/ioctl.c:684
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
 __do_sys_ioctl fs/ioctl.c:708 [inline]
 __se_sys_ioctl fs/ioctl.c:706 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:706
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 1402:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kfree+0xd9/0x260 mm/slab.c:3813
 fib6_info_destroy+0x29b/0x350 net/ipv6/ip6_fib.c:207
 fib6_info_release include/net/ip6_fib.h:286 [inline]
 __ip6_del_rt_siblings net/ipv6/route.c:3235 [inline]
 ip6_route_del+0x11c4/0x13b0 net/ipv6/route.c:3316
 ipv6_route_ioctl+0x616/0x760 net/ipv6/route.c:3663
 inet6_ioctl+0x100/0x1f0 net/ipv6/af_inet6.c:546
 sock_do_ioctl+0xe4/0x3e0 net/socket.c:973
 sock_ioctl+0x30d/0x680 net/socket.c:1097
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:500 [inline]
 do_vfs_ioctl+0x1cf/0x16f0 fs/ioctl.c:684
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
 __do_sys_ioctl fs/ioctl.c:708 [inline]
 __se_sys_ioctl fs/ioctl.c:706 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:706
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8801b5df2580
 which belongs to the cache kmalloc-256 of size 256
The buggy address is located 8 bytes inside of
 256-byte region [ffff8801b5df2580, ffff8801b5df2680)
The buggy address belongs to the page:
page:ffffea0006d77c80 count:1 mapcount:0 mapping:ffff8801da8007c0 index:0xffff8801b5df2e40
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffffea0006c5cc48 ffffea0007363308 ffff8801da8007c0
raw: ffff8801b5df2e40 ffff8801b5df2080 0000000100000006 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8801b5df2480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8801b5df2500: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
> ffff8801b5df2580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                      ^
 ffff8801b5df2600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8801b5df2680: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb

Fixes: a64efe142f ("net/ipv6: introduce fib6_info struct and helpers")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@gmail.com>
Reported-by: syzbot+9e6d75e3edef427ee888@syzkaller.appspotmail.com
Acked-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-20 07:57:23 +09:00
Richard Guy Briggs
f7859590d9 audit: eliminate audit_enabled magic number comparison
Remove comparison of audit_enabled to magic numbers outside of audit.

Related: https://github.com/linux-audit/audit-kernel/issues/86

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-06-19 10:43:55 -04:00
Luca Coelho
41cbb0f5a2 mac80211: add support for HE
Add support for HE in mac80211 conforming with P802.11ax_D1.4.

Johannes: Fix another bug with the buf_size comparison in agg-rx.c.

Signed-off-by: Liad Kaufman <liad.kaufman@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Ido Yariv <idox.yariv@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-06-18 22:40:32 +02:00
Eric Dumazet
9ce7bc036a netfilter: ipv6: nf_defrag: reduce struct net memory waste
It is a waste of memory to use a full "struct netns_sysctl_ipv6"
while only one pointer is really used, considering netns_sysctl_ipv6
keeps growing.

Also, since "struct netns_frags" has cache line alignment,
it is better to move the frags_hdr pointer outside, otherwise
we spend a full cache line for this pointer.

This saves 192 bytes of memory per netns.

Fixes: c038a767cd ("ipv6: add a new namespace for nf_conntrack_reasm")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-18 14:13:25 +02:00
Luca Coelho
95a28eeaf1 radiotap: add structs for HE
Add radiotap structures for HE.

Signed-off-by: Liad Kaufman <liad.kaufman@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Ido Yariv <idox.yariv@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2018-06-15 14:04:00 +02:00
Luca Coelho
c4cbaf7973 cfg80211: Add support for HE
Add support for the HE in cfg80211 and also add userspace API to
nl80211 to send rate information out, conforming with P802.11ax_D2.0.

Signed-off-by: Liad Kaufman <liad.kaufman@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Ido Yariv <idox.yariv@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2018-06-15 14:03:56 +02:00
Xin Long
9951912200 sctp: define sctp_packet_gso_append to build GSO frames
Now sctp GSO uses skb_gro_receive() to append the data into head
skb frag_list. However it actually only needs very few code from
skb_gro_receive(). Besides, NAPI_GRO_CB has to be set while most
of its members are not needed here.

This patch is to add sctp_packet_gso_append() to build GSO frames
instead of skb_gro_receive(), and it would avoid many unnecessary
checks and make the code clearer.

Note that sctp will use page frags instead of frag_list to build
GSO frames in another patch. But it may take time, as sctp's GSO
frames may have different size. skb_segment() can only split it
into the frags with the same size, which would break the border
of sctp chunks.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-14 10:25:53 -07:00
Yi-Hung Wei
21ba8847f8 netfilter: nf_conncount: Fix garbage collection with zones
Currently, we use check_hlist() for garbage colleciton. However, we
use the ‘zone’ from the counted entry to query the existence of
existing entries in the hlist. This could be wrong when they are in
different zones, and this patch fixes this issue.

Fixes: e59ea3df3f ("netfilter: xt_connlimit: honor conntrack zone if available")
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-12 20:07:07 +02:00
Daniel Borkmann
f6fadff33e tls: fix NULL pointer dereference on poll
While hacking on kTLS, I ran into the following panic from an
unprivileged netserver / netperf TCP session:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
  PGD 800000037f378067 P4D 800000037f378067 PUD 3c0e61067 PMD 0
  Oops: 0010 [#1] SMP KASAN PTI
  CPU: 1 PID: 2289 Comm: netserver Not tainted 4.17.0+ #139
  Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET47W (1.21 ) 11/28/2016
  RIP: 0010:          (null)
  Code: Bad RIP value.
  RSP: 0018:ffff88036abcf740 EFLAGS: 00010246
  RAX: dffffc0000000000 RBX: ffff88036f5f6800 RCX: 1ffff1006debed26
  RDX: ffff88036abcf920 RSI: ffff8803cb1a4f00 RDI: ffff8803c258c280
  RBP: ffff8803c258c280 R08: ffff8803c258c280 R09: ffffed006f559d48
  R10: ffff88037aacea43 R11: ffffed006f559d49 R12: ffff8803c258c280
  R13: ffff8803cb1a4f20 R14: 00000000000000db R15: ffffffffc168a350
  FS:  00007f7e631f4700(0000) GS:ffff8803d1c80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffffffffffffd6 CR3: 00000003ccf64005 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   ? tls_sw_poll+0xa4/0x160 [tls]
   ? sock_poll+0x20a/0x680
   ? do_select+0x77b/0x11a0
   ? poll_schedule_timeout.constprop.12+0x130/0x130
   ? pick_link+0xb00/0xb00
   ? read_word_at_a_time+0x13/0x20
   ? vfs_poll+0x270/0x270
   ? deref_stack_reg+0xad/0xe0
   ? __read_once_size_nocheck.constprop.6+0x10/0x10
  [...]

Debugging further, it turns out that calling into ctx->sk_poll() is
invalid since sk_poll itself is NULL which was saved from the original
TCP socket in order for tls_sw_poll() to invoke it.

Looks like the recent conversion from poll to poll_mask callback started
in 1525242310 ("net: add support for ->poll_mask in proto_ops") missed
to eventually convert kTLS, too: TCP's ->poll was converted over to the
->poll_mask in commit 2c7d3daceb ("net/tcp: convert to ->poll_mask")
and therefore kTLS wrongly saved the ->poll old one which is now NULL.

Convert kTLS over to use ->poll_mask instead. Also instead of POLLIN |
POLLRDNORM use the proper EPOLLIN | EPOLLRDNORM bits as the case in
tcp_poll_mask() as well that is mangled here.

Fixes: 2c7d3daceb ("net/tcp: convert to ->poll_mask")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Watson <davejwatson@fb.com>
Tested-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-11 16:29:54 -07:00
David S. Miller
a08ce73ba0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree:

1) Reject non-null terminated helper names from xt_CT, from Gao Feng.

2) Fix KASAN splat due to out-of-bound access from commit phase, from
   Alexey Kodanev.

3) Missing conntrack hook registration on IPVS FTP helper, from Julian
   Anastasov.

4) Incorrect skbuff allocation size in bridge nft_reject, from Taehee Yoo.

5) Fix inverted check on packet xmit to non-local addresses, also from
   Julian.

6) Fix ebtables alignment compat problems, from Alin Nastac.

7) Hook mask checks are not correct in xt_set, from Serhey Popovych.

8) Fix timeout listing of element in ipsets, from Jozsef.

9) Cap maximum timeout value in ipset, also from Jozsef.

10) Don't allow family option for hash:mac sets, from Florent Fourcot.

11) Restrict ebtables to work with NFPROTO_BRIDGE targets only, this
    Florian.

12) Another bug reported by KASAN in the rbtree set backend, from
    Taehee Yoo.

13) Missing __IPS_MAX_BIT update doesn't include IPS_OFFLOAD_BIT.
    From Gao Feng.

14) Missing initialization of match/target in ebtables, from Florian
    Westphal.

15) Remove useless nft_dup.h file in include path, from C. Labbe.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-11 14:24:32 -07:00
Paolo Abeni
6c206b2009 udp: fix rx queue len reported by diag and proc interface
After commit 6b229cf77d ("udp: add batching to udp_rmem_release()")
the sk_rmem_alloc field does not measure exactly anymore the
receive queue length, because we batch the rmem release. The issue
is really apparent only after commit 0d4a6608f6 ("udp: do rmem bulk
free even if the rx sk queue is empty"): the user space can easily
check for an empty socket with not-0 queue length reported by the 'ss'
tool or the procfs interface.

We need to use a custom UDP helper to report the correct queue length,
taking into account the forward allocation deficit.

Reported-by: trevor.francis@46labs.com
Fixes: 6b229cf77d ("UDP: add batching to udp_rmem_release()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-08 19:55:15 -04:00
Corentin Labbe
d8e87fc6d1 netfilter: remove include/net/netfilter/nft_dup.h
include/net/netfilter/nft_dup.h was introduced in d877f07112 ("netfilter: nf_tables: add nft_dup expression")
but was never user since this date.

Furthermore, the only struct in this file is unused elsewhere.

Signed-off-by: Corentin Labbe <clabbe@baylibre.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-08 12:42:24 +02:00
Linus Torvalds
1c8c5a9d38 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add Maglev hashing scheduler to IPVS, from Inju Song.

 2) Lots of new TC subsystem tests from Roman Mashak.

 3) Add TCP zero copy receive and fix delayed acks and autotuning with
    SO_RCVLOWAT, from Eric Dumazet.

 4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
    Brouer.

 5) Add ttl inherit support to vxlan, from Hangbin Liu.

 6) Properly separate ipv6 routes into their logically independant
    components. fib6_info for the routing table, and fib6_nh for sets of
    nexthops, which thus can be shared. From David Ahern.

 7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
    messages from XDP programs. From Nikita V. Shirokov.

 8) Lots of long overdue cleanups to the r8169 driver, from Heiner
    Kallweit.

 9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.

10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.

11) Plumb extack down into fib_rules, from Roopa Prabhu.

12) Add Flower classifier offload support to igb, from Vinicius Costa
    Gomes.

13) Add UDP GSO support, from Willem de Bruijn.

14) Add documentation for eBPF helpers, from Quentin Monnet.

15) Add TLS tx offload to mlx5, from Ilya Lesokhin.

16) Allow applications to be given the number of bytes available to read
    on a socket via a control message returned from recvmsg(), from
    Soheil Hassas Yeganeh.

17) Add x86_32 eBPF JIT compiler, from Wang YanQing.

18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
    From Björn Töpel.

19) Remove indirect load support from all of the BPF JITs and handle
    these operations in the verifier by translating them into native BPF
    instead. From Daniel Borkmann.

20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.

21) Allow XDP programs to do lookups in the main kernel routing tables
    for forwarding. From David Ahern.

22) Allow drivers to store hardware state into an ELF section of kernel
    dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.

23) Various RACK and loss detection improvements in TCP, from Yuchung
    Cheng.

24) Add TCP SACK compression, from Eric Dumazet.

25) Add User Mode Helper support and basic bpfilter infrastructure, from
    Alexei Starovoitov.

26) Support ports and protocol values in RTM_GETROUTE, from Roopa
    Prabhu.

27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
    Brouer.

28) Add lots of forwarding selftests, from Petr Machata.

29) Add generic network device failover driver, from Sridhar Samudrala.

* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
  strparser: Add __strp_unpause and use it in ktls.
  rxrpc: Fix terminal retransmission connection ID to include the channel
  net: hns3: Optimize PF CMDQ interrupt switching process
  net: hns3: Fix for VF mailbox receiving unknown message
  net: hns3: Fix for VF mailbox cannot receiving PF response
  bnx2x: use the right constant
  Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
  net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
  enic: fix UDP rss bits
  netdev-FAQ: clarify DaveM's position for stable backports
  rtnetlink: validate attributes in do_setlink()
  mlxsw: Add extack messages for port_{un, }split failures
  netdevsim: Add extack error message for devlink reload
  devlink: Add extack to reload and port_{un, }split operations
  net: metrics: add proper netlink validation
  ipmr: fix error path when ipmr_new_table fails
  ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
  net: hns3: remove unused hclgevf_cfg_func_mta_filter
  netfilter: provide udp*_lib_lookup for nf_tproxy
  qed*: Utilize FW 8.37.2.0
  ...
2018-06-06 18:39:49 -07:00
Linus Torvalds
8b5c6a3a49 audit/stable-4.18 PR 20180605
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlsXFUEUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQVeRaWujKfIoomg//eRNpc6x9kxTijN670AC2uD0CBTlZ
 2z6mHuJaOhG8bTxjZxQfUBoo6/eZJ2YC1yq6ornGFNzw4sfKsR/j86ujJim2HAmo
 opUhziq3SILGEvjsxfPkREe/wb49jy0AA/WjZqciitB1ig8Hz7xzqi0lpNaEspFh
 QJFB6XXkojWGFGrRzruAVJnPS+pDWoTQR0qafs3JWKnpeinpOdZnl1hPsysAEHt5
 Ag8o4qS/P9xJM0khi7T+jWECmTyT/mtWqEtFcZ0o+JLOgt/EMvNX6DO4ETDiYRD2
 mVChga9x5r78bRgNy2U8IlEWWa76WpcQAEODvhzbijX4RxMAmjsmLE+e+udZSnMZ
 eCITl2f7ExxrL5SwNFC/5h7pAv0RJ+SOC19vcyeV4JDlQNNVjUy/aNKv5baV0aeg
 EmkeobneMWxqHx52aERz8RF1in5pT8gLOYoYnWfNpcDEmjLrwhuZLX2asIzUEqrS
 SoPJ8hxIDCxceHOWIIrz5Dqef7x28Dyi46w3QINC8bSy2RnR/H3q40DRegvXOGiS
 9WcbbwbhnM4Kau413qKicGCvdqTVYdeyZqo7fVelSciD139Vk7pZotyom4MuU25p
 fIyGfXa8/8gkl7fZ+HNkZbba0XWNfAZt//zT095qsp3CkhVnoybwe6OwG1xRqErq
 W7OOQbS7vvN/KGo=
 =10u6
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "Another reasonable chunk of audit changes for v4.18, thirteen patches
  in total.

  The thirteen patches can mostly be broken down into one of four
  categories: general bug fixes, accessor functions for audit state
  stored in the task_struct, negative filter matches on executable
  names, and extending the (relatively) new seccomp logging knobs to the
  audit subsystem.

  The main driver for the accessor functions from Richard are the
  changes we're working on to associate audit events with containers,
  but I think they have some standalone value too so I figured it would
  be good to get them in now.

  The seccomp/audit patches from Tyler apply the seccomp logging
  improvements from a few releases ago to audit's seccomp logging;
  starting with this patchset the changes in
  /proc/sys/kernel/seccomp/actions_logged should apply to both the
  standard kernel logging and audit.

  As usual, everything passes the audit-testsuite and it happens to
  merge cleanly with your tree"

[ Heh, except it had trivial merge conflicts with the SELinux tree that
  also came in from Paul   - Linus ]

* tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: Fix wrong task in comparison of session ID
  audit: use existing session info function
  audit: normalize loginuid read access
  audit: use new audit_context access funciton for seccomp_actions_logged
  audit: use inline function to set audit context
  audit: use inline function to get audit context
  audit: convert sessionid unset to a macro
  seccomp: Don't special case audited processes when logging
  seccomp: Audit attempts to modify the actions_logged sysctl
  seccomp: Configurable separator for the actions_logged string
  seccomp: Separate read and write code for actions_logged sysctl
  audit: allow not equal op for audit by executable
  audit: add syscall information to FEATURE_CHANGE records
2018-06-06 16:34:00 -07:00
Doron Roberts-Kedes
7170e6045a strparser: Add __strp_unpause and use it in ktls.
strp_unpause queues strp_work in order to parse any messages that
arrived while the strparser was paused. However, the process invoking
strp_unpause could eagerly parse a buffered message itself if it held
the sock lock.

__strp_unpause is an alternative to strp_pause that avoids the scheduling
overhead that results when a receiving thread unpauses the strparser
and waits for the next message to be delivered by the workqueue thread.

This patch more than doubled the IOPS achieved in a benchmark of NBD
traffic encrypted using ktls.

Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-06 14:07:53 -04:00
David S. Miller
fd129f8941 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-06-05

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add a new BPF hook for sendmsg similar to existing hooks for bind and
   connect: "This allows to override source IP (including the case when it's
   set via cmsg(3)) and destination IP:port for unconnected UDP (slow path).
   TCP and connected UDP (fast path) are not affected. This makes UDP support
   complete, that is, connected UDP is handled by connect hooks, unconnected
   by sendmsg ones.", from Andrey.

2) Rework of the AF_XDP API to allow extending it in future for type writer
   model if necessary. In this mode a memory window is passed to hardware
   and multiple frames might be filled into that window instead of just one
   that is the case in the current fixed frame-size model. With the new
   changes made this can be supported without having to add a new descriptor
   format. Also, core bits for the zero-copy support for AF_XDP have been
   merged as agreed upon, where i40e bits will be routed via Jeff later on.
   Various improvements to documentation and sample programs included as
   well, all from Björn and Magnus.

3) Given BPF's flexibility, a new program type has been added to implement
   infrared decoders. Quote: "The kernel IR decoders support the most
   widely used IR protocols, but there are many protocols which are not
   supported. [...] There is a 'long tail' of unsupported IR protocols,
   for which lircd is need to decode the IR. IR encoding is done in such
   a way that some simple circuit can decode it; therefore, BPF is ideal.
   [...] user-space can define a decoder in BPF, attach it to the rc
   device through the lirc chardev.", from Sean.

4) Several improvements and fixes to BPF core, among others, dumping map
   and prog IDs into fdinfo which is a straight forward way to correlate
   BPF objects used by applications, removing an indirect call and therefore
   retpoline in all map lookup/update/delete calls by invoking the callback
   directly for 64 bit archs, adding a new bpf_skb_cgroup_id() BPF helper
   for tc BPF programs to have an efficient way of looking up cgroup v2 id
   for policy or other use cases. Fixes to make sure we zero tunnel/xfrm
   state that hasn't been filled, to allow context access wrt pt_regs in
   32 bit archs for tracing, and last but not least various test cases
   for fixes that landed in bpf earlier, from Daniel.

5) Get rid of the ndo_xdp_flush API and extend the ndo_xdp_xmit with
   a XDP_XMIT_FLUSH flag instead which allows to avoid one indirect
   call as flushing is now merged directly into ndo_xdp_xmit(), from Jesper.

6) Add a new bpf_get_current_cgroup_id() helper that can be used in
   tracing to retrieve the cgroup id from the current process in order
   to allow for e.g. aggregation of container-level events, from Yonghong.

7) Two follow-up fixes for BTF to reject invalid input values and
   related to that also two test cases for BPF kselftests, from Martin.

8) Various API improvements to the bpf_fib_lookup() helper, that is,
   dropping MPLS bits which are not fully hashed out yet, rejecting
   invalid helper flags, returning error for unsupported address
   families as well as renaming flowlabel to flowinfo, from David.

9) Various fixes and improvements to sockmap BPF kselftests in particular
   in proper error detection and data verification, from Prashant.

10) Two arm32 BPF JIT improvements. One is to fix imm range check with
    regards to whether immediate fits into 24 bits, and a naming cleanup
    to get functions related to rsh handling consistent to those handling
    lsh, from Wang.

11) Two compile warning fixes in BPF, one for BTF and a false positive
    to silent gcc in stack_map_get_build_id_offset(), from Arnd.

12) Add missing seg6.h header into tools include infrastructure in order
    to fix compilation of BPF kselftests, from Mathieu.

13) Several formatting cleanups in the BPF UAPI helper description that
    also fix an error during rst2man compilation, from Quentin.

14) Hide an unused variable in sk_msg_convert_ctx_access() when IPv6 is
    not built into the kernel, from Yue.

15) Remove a useless double assignment in dev_map_enqueue(), from Colin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-05 12:42:19 -04:00
David Ahern
ac0fc8a1bb devlink: Add extack to reload and port_{un, }split operations
Add extack argument to reload, port_split and port_unsplit operations.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-05 12:32:37 -04:00
Maciej Żenczykowski
95358a9553 net-tcp: remove useless tw_timeout field
Tested: 'git grep tw_timeout' comes up empty and it builds :-)

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-05 10:45:24 -04:00
Magnus Karlsson
ac98d8aab6 xsk: wire upp Tx zero-copy functions
Here we add the functionality required to support zero-copy Tx, and
also exposes various zero-copy related functions for the netdevs.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-05 15:48:34 +02:00
Björn Töpel
173d3adb6f xsk: add zero-copy support for Rx
Extend the xsk_rcv to support the new MEM_TYPE_ZERO_COPY memory, and
wireup ndo_bpf call in bind.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-05 15:46:55 +02:00
Björn Töpel
02b55e5657 xdp: add MEM_TYPE_ZERO_COPY
Here, a new type of allocator support is added to the XDP return
API. A zero-copy allocated xdp_buff cannot be converted to an
xdp_frame. Instead is the buff has to be copied. This is not supported
at all in this commit.

Also, an opaque "handle" is added to xdp_buff. This can be used as a
context for the zero-copy allocator implementation.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-05 15:46:26 +02:00
Björn Töpel
8aef7340ae xsk: introduce xdp_umem_page
The xdp_umem_page holds the address for a page. Trade memory for
faster lookup. Later, we'll add DMA address here as well.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-05 15:45:41 +02:00
Björn Töpel
e61e62b9e2 xsk: moved struct xdp_umem definition
Moved struct xdp_umem to xdp_sock.h, in order to prepare for zero-copy
support.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-05 15:45:17 +02:00
Linus Torvalds
408afb8d78 Merge branch 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull aio updates from Al Viro:
 "Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly.

  The only thing I'm holding back for a day or so is Adam's aio ioprio -
  his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case),
  but let it sit in -next for decency sake..."

* 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  aio: sanitize the limit checking in io_submit(2)
  aio: fold do_io_submit() into callers
  aio: shift copyin of iocb into io_submit_one()
  aio_read_events_ring(): make a bit more readable
  aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
  aio: take list removal to (some) callers of aio_complete()
  aio: add missing break for the IOCB_CMD_FDSYNC case
  random: convert to ->poll_mask
  timerfd: convert to ->poll_mask
  eventfd: switch to ->poll_mask
  pipe: convert to ->poll_mask
  crypto: af_alg: convert to ->poll_mask
  net/rxrpc: convert to ->poll_mask
  net/iucv: convert to ->poll_mask
  net/phonet: convert to ->poll_mask
  net/nfc: convert to ->poll_mask
  net/caif: convert to ->poll_mask
  net/bluetooth: convert to ->poll_mask
  net/sctp: convert to ->poll_mask
  net/tipc: convert to ->poll_mask
  ...
2018-06-04 13:57:43 -07:00
Michal Kubecek
fa1be7e01e ipv6: omit traffic class when calculating flow hash
Some of the code paths calculating flow hash for IPv6 use flowlabel member
of struct flowi6 which, despite its name, encodes both flow label and
traffic class. If traffic class changes within a TCP connection (as e.g.
ssh does), ECMP route can switch between path. It's also inconsistent with
other code paths where ip6_flowlabel() (returning only flow label) is used
to feed the key.

Use only flow label everywhere, including one place where hash key is set
using ip6_flowinfo().

Fixes: 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)")
Fixes: f70ea018da ("net: Add functions to get skb->hash based on flow structures")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-04 13:21:18 -04:00
David S. Miller
a925ab48da Revert "ipv6: omit traffic class when calculating flow hash"
This reverts commit 87ae68c8b4.

Applied the wrong version of this fix, correct version
coming up.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-04 13:20:38 -04:00
Michal Kubecek
87ae68c8b4 ipv6: omit traffic class when calculating flow hash
Some of the code paths calculating flow hash for IPv6 use flowlabel member
of struct flowi6 which, despite its name, encodes both flow label and
traffic class. If traffic class changes within a TCP connection (as e.g.
ssh does), ECMP route can switch between path. It's also incosistent with
other code paths where ip6_flowlabel() (returning only flow label) is used
to feed the key.

Use only flow label everywhere, including one place where hash key is set
using ip6_flowinfo().

Fixes: 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)")
Fixes: f70ea018da ("net: Add functions to get skb->hash based on flow structures")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-04 13:18:35 -04:00
Linus Torvalds
cf626b0da7 Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull procfs updates from Al Viro:
 "Christoph's proc_create_... cleanups series"

* 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits)
  xfs, proc: hide unused xfs procfs helpers
  isdn/gigaset: add back gigaset_procinfo assignment
  proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields
  tty: replace ->proc_fops with ->proc_show
  ide: replace ->proc_fops with ->proc_show
  ide: remove ide_driver_proc_write
  isdn: replace ->proc_fops with ->proc_show
  atm: switch to proc_create_seq_private
  atm: simplify procfs code
  bluetooth: switch to proc_create_seq_data
  netfilter/x_tables: switch to proc_create_seq_private
  netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data
  neigh: switch to proc_create_seq_data
  hostap: switch to proc_create_{seq,single}_data
  bonding: switch to proc_create_seq_data
  rtc/proc: switch to proc_create_single_data
  drbd: switch to proc_create_single
  resource: switch to proc_create_seq_data
  staging/rtl8192u: simplify procfs code
  jfs: simplify procfs code
  ...
2018-06-04 10:00:01 -07:00
Jesper Dangaard Brouer
73de5717eb xdp: done implementing ndo_xdp_xmit flush flag for all drivers
Removing XDP_XMIT_FLAGS_NONE as all driver now implement
a flush operation in their ndo_xdp_xmit call.  The compiler
will catch if any users of XDP_XMIT_FLAGS_NONE remains.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03 08:11:34 -07:00
Jesper Dangaard Brouer
42b3346898 xdp: add flags argument to ndo_xdp_xmit API
This patch only change the API and reject any use of flags. This is an
intermediate step that allows us to implement the flush flag operation
later, for each individual driver in a separate patch.

The plan is to implement flush operation via XDP_XMIT_FLUSH flag
and then remove XDP_XMIT_FLAGS_NONE when done.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03 08:11:34 -07:00
Florian Westphal
1b2470e59f netfilter: nf_tables: handle chain name lookups via rhltable
If there is a significant amount of chains list search is too slow, so
add an rhlist table for this.

This speeds up ruleset loading: for every new rule we have to check if
the name already exists in current generation.

We need to be able to cope with duplicate chain names in case a transaction
drops the nfnl mutex (for request_module) and the abort of this old
transaction is still pending.

The list is kept -- we need a way to iterate chains even if hash resize is
in progress without missing an entry.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-03 01:18:37 +02:00
Pablo Neira Ayuso
371ebcbb9e netfilter: nf_tables: add destroy_clone expression
Before this patch, cloned expressions are released via ->destroy. This
is a problem for the new connlimit expression since the ->destroy path
drop a reference on the conntrack modules and it unregisters hooks. The
new ->destroy_clone provides context that this expression is being
released from the packet path, so it is mirroring ->clone(), where
neither module reference is dropped nor hooks need to be unregistered -
because this done from the control plane path from the ->init() path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-03 00:02:11 +02:00
Pablo Neira Ayuso
79b174ade1 netfilter: nf_tables: garbage collection for stateful expressions
Use garbage collector to schedule removal of elements based of feedback
from expression that this element comes with. Therefore, the garbage
collector is not guided by timeout expirations in this new mode.

The new connlimit expression sets on the NFT_EXPR_GC flag to enable this
behaviour, the dynset expression needs to explicitly enable the garbage
collector via set->ops->gc_init call.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-03 00:02:10 +02:00
Pablo Neira Ayuso
3453c92731 netfilter: nf_tables: pass ctx to nf_tables_expr_destroy()
nft_set_elem_destroy() can be called from call_rcu context. Annotate
netns and table in set object so we can populate the context object.
Moreover, pass context object to nf_tables_set_elem_destroy() from the
commit phase, since it is already available from there.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-03 00:02:09 +02:00
Pablo Neira Ayuso
5e5cbc7b23 netfilter: nf_conncount: expose connection list interface
This patch provides an interface to maintain the list of connections and
the lookup function to obtain the number of connections in the list.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-03 00:02:08 +02:00
Pablo Neira Ayuso
00bfb3205e netfilter: nf_tables: pass context to object destroy indirection
The new connlimit object needs this to properly deal with conntrack
dependencies.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-03 00:02:06 +02:00
Máté Eckl
45ca4e0cf2 netfilter: Libify xt_TPROXY
The extracted functions will likely be usefull to implement tproxy
support in nf_tables.

Extrancted functions:
	- nf_tproxy_sk_is_transparent
	- nf_tproxy_laddr4
	- nf_tproxy_handle_time_wait4
	- nf_tproxy_get_sock_v4
	- nf_tproxy_laddr6
	- nf_tproxy_handle_time_wait6
	- nf_tproxy_get_sock_v6

(nf_)tproxy_handle_time_wait6 also needed some refactor as its current
implementation was xtables-specific.

Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-03 00:02:05 +02:00
Máté Eckl
8d6e555773 netfilter: Decrease code duplication regarding transparent socket option
There is a function in include/net/netfilter/nf_socket.h to decide if a
socket has IP(V6)_TRANSPARENT socket option set or not. However this
does the same as inet_sk_transparent() in include/net/tcp.h

include/net/tcp.h:1733
/* This helper checks if socket has IP_TRANSPARENT set */
static inline bool inet_sk_transparent(const struct sock *sk)
{
	switch (sk->sk_state) {
	case TCP_TIME_WAIT:
		return inet_twsk(sk)->tw_transparent;
	case TCP_NEW_SYN_RECV:
		return inet_rsk(inet_reqsk(sk))->no_srccheck;
	}
	return inet_sk(sk)->transparent;
}

tproxy_sk_is_transparent has also been refactored to use this function
instead of reimplementing it.

Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-03 00:02:01 +02:00
David S. Miller
1ffdd8e164 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next
tree, the most relevant things in this batch are:

1) Compile masquerade infrastructure into NAT module, from Florian Westphal.
   Same thing with the redirection support.

2) Abort transaction if early initialization of the commit phase fails.
   Also from Florian.

3) Get rid of synchronize_rcu() by using rule array in nf_tables, from
   Florian.

4) Abort nf_tables batch if fatal signal is pending, from Florian.

5) Use .call_rcu nfnetlink from nf_tables to make dumps fully lockless.
   From Florian Westphal.

6) Support to match transparent sockets from nf_tables, from Máté Eckl.

7) Audit support for nf_tables, from Phil Sutter.

8) Validate chain dependencies from commit phase, fall back to fine grain
   validation only in case of errors.

9) Attach dst to skbuff from netfilter flowtable packet path, from
   Jason A. Donenfeld.

10) Use artificial maximum attribute cap to remove VLA from nfnetlink.
    Patch from Kees Cook.

11) Add extension to allow to forward packets through neighbour layer.

12) Add IPv6 conntrack helper support to IPVS, from Julian Anastasov.

13) Add IPv6 FTP conntrack support to IPVS, from Julian Anastasov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-02 09:04:21 -04:00
Julian Anastasov
31875d4970 ipvs: register conntrack hooks for ftp
ip_vs_ftp requires conntrack modules for mangling
of FTP command responses in passive mode.

Make sure the conntrack hooks are registered when
real servers use NAT method in FTP virtual service.
The hooks will be registered while the service is
present.

Fixes: 0c66dc1ea3 ("netfilter: conntrack: register hooks in netns when needed by ruleset")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-02 00:55:38 +02:00
Julian Anastasov
d12e12299a ipvs: add ipv6 support to ftp
Add support for FTP commands with extended format (RFC 2428):

- FTP EPRT: IPv4 and IPv6, active mode, similar to PORT
- FTP EPSV: IPv4 and IPv6, passive mode, similar to PASV.
EPSV response usually contains only port but we allow real
server to provide different address

We restrict control and data connection to be from same
address family.

Allow the "(" and ")" to be optional in PASV response.

Also, add ipvsh argument to the pkt_in/pkt_out handlers to better
access the payload after transport header.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-01 14:01:54 +02:00
Pablo Neira Ayuso
a654de8fdc netfilter: nf_tables: fix chain dependency validation
The following ruleset:

 add table ip filter
 add chain ip filter input { type filter hook input priority 4; }
 add chain ip filter ap
 add rule ip filter input jump ap
 add rule ip filter ap masquerade

results in a panic, because the masquerade extension should be rejected
from the filter chain. The existing validation is missing a chain
dependency check when the rule is added to the non-base chain.

This patch fixes the problem by walking down the rules from the
basechains, searching for either immediate or lookup expressions, then
jumping to non-base chains and again walking down the rules to perform
the expression validation, so we make sure the full ruleset graph is
validated. This is done only once from the commit phase, in case of
problem, we abort the transaction and perform fine grain validation for
error reporting. This patch requires 003087911a ("netfilter:
nfnetlink: allow commit to fail") to achieve this behaviour.

This patch also adds a cleanup callback to nfnl batch interface to reset
the validate state from the exit path.

As a result of this patch, nf_tables_check_loops() doesn't use
->validate to check for loops, instead it just checks for immediate
expressions.

Reported-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-01 09:46:22 +02:00
Kees Cook
ccf8dbcd06 rtnetlink: Remove VLA usage
In the quest to remove all stack VLA usage from the kernel[1], this
allocates the maximum size expected for all possible types and adds
sanity-checks at both registration and usage to make sure nothing gets
out of sync. This matches the proposed VLA solution for nfnetlink[2]. The
values chosen here were based on finding assignments for .maxtype and
.slave_maxtype and manually counting the enums:

slave_maxtype (max 33):
	IFLA_BRPORT_MAX     33
	IFLA_BOND_SLAVE_MAX  9

maxtype (max 45):
	IFLA_BOND_MAX       28
	IFLA_BR_MAX         45
	__IFLA_CAIF_HSI_MAX  8
	IFLA_CAIF_MAX        4
	IFLA_CAN_MAX        16
	IFLA_GENEVE_MAX     12
	IFLA_GRE_MAX        25
	IFLA_GTP_MAX         5
	IFLA_HSR_MAX         7
	IFLA_IPOIB_MAX       4
	IFLA_IPTUN_MAX      21
	IFLA_IPVLAN_MAX      3
	IFLA_MACSEC_MAX     15
	IFLA_MACVLAN_MAX     7
	IFLA_PPP_MAX         2
	__IFLA_RMNET_MAX     4
	IFLA_VLAN_MAX        6
	IFLA_VRF_MAX         2
	IFLA_VTI_MAX         7
	IFLA_VXLAN_MAX      28
	VETH_INFO_MAX        2
	VXCAN_INFO_MAX       2

This additionally changes maxtype and slave_maxtype fields to unsigned,
since they're only ever using positive values.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
[2] https://patchwork.kernel.org/patch/10439647/

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-31 22:48:46 -04:00
Yafang Shao
3d97d88e80 tcp: minor optimization around tcp_hdr() usage in receive path
This is additional to the
commit ea1627c20c ("tcp: minor optimizations around tcp_hdr() usage").
At this point, skb->data is same with tcp_hdr() as tcp header has not
been pulled yet. So use the less expensive one to get the tcp header.

Remove the third parameter of tcp_rcv_established() and put it into
the function body.

Furthermore, the local variables are listed as a reverse christmas tree :)

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-31 13:20:47 -04:00
David Ahern
8308f3ff17 net/ipv6: Add support for specifying metric of connected routes
Add support for IFA_RT_PRIORITY to ipv6 addresses.

If the metric is changed on an existing address then the new route
is inserted before removing the old one. Since the metric is one
of the route keys, the prefix route can not be atomically replaced.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-29 10:12:45 -04:00
David Ahern
af4d768ad2 net/ipv4: Add support for specifying metric of connected routes
Add support for IFA_RT_PRIORITY to ipv4 addresses.

If the metric is changed on an existing address then the new route
is inserted before removing the old one. Since the metric is one
of the route keys, the prefix route can not be replaced.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-29 10:12:45 -04:00
David Ahern
e6464b8c63 net/ipv6: Convert ipv6_add_addr to struct ifa6_config
Move config parameters for adding an ipv6 address to a struct. struct
names stem from inet6_rtm_newaddr which is the modern handler for
adding an address.

Start the conversion to ifa6_config with ipv6_add_addr. This is an argument
move only; no functional change intended. Mapping of variable changes:

    addr      -->  cfg->pfx
    peer_addr -->  cfg->peer_pfx
    pfxlen    -->  cfg->plen
    flags     -->  cfg->ifa_flags

scope, valid_lft, prefered_lft have the same names within cfg
(with corrected spelling).

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-29 10:12:44 -04:00
Jakub Kicinski
47c669a406 net: sched: mq: request stats from offloads
MQ doesn't hold any statistics on its own, however, statistic
from offloads are requested starting from the root, hence MQ
will read the old values for its sums.  Call into the drivers,
because of the additive nature of the stats drivers are aware
of how much "pending updates" they have to children of the MQ.
Since MQ reset its stats on every dump we can simply offset
the stats, predicting how stats of offloaded children will
change.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-29 09:49:16 -04:00
Jakub Kicinski
f971b13230 net: sched: mq: add simple offload notification
mq offload is trivial, we just need to let the device know
that the root qdisc is mq.  Alternative approach would be
to export qdisc_lookup() and make drivers check the root
type themselves, but notification via ndo_setup_tc is more
in line with other qdiscs.

Note that mq doesn't hold any stats on it's own, it just
adds up stats of its children.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-29 09:49:16 -04:00
Jakub Kicinski
6172abc1e2 net: sched: add qstats.qlen to qlen
AFAICT struct gnet_stats_queue.qlen is not used in Qdiscs.
It may, however, be useful for offloads to report HW queue
length there.  Add that value to the result of qdisc_qlen_sum().

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-29 09:49:15 -04:00
Florian Westphal
0cbc06b3fa netfilter: nf_tables: remove synchronize_rcu in commit phase
synchronize_rcu() is expensive.

The commit phase currently enforces an unconditional
synchronize_rcu() after incrementing the generation counter.

This is to make sure that a packet always sees a consistent chain, either
nft_do_chain is still using old generation (it will skip the newly added
rules), or the new one (it will skip old ones that might still be linked
into the list).

We could just remove the synchronize_rcu(), it would not cause a crash but
it could cause us to evaluate a rule that was removed and new rule for the
same packet, instead of either-or.

To resolve this, add rule pointer array holding two generations, the
current one and the future generation.

In commit phase, allocate the rule blob and populate it with the rules that
will be active in the new generation.

Then, make this rule blob public, replacing the old generation pointer.

Then the generation counter can be incremented.

nft_do_chain() will either continue to use the current generation
(in case loop was invoked right before increment), or the new one.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-29 14:49:59 +02:00
Paolo Abeni
e9be0e993d net: sched: shrink struct Qdisc
The struct Qdisc has a lot of holes, especially after commit
a53851e2c3 ("net: sched: explicit locking in gso_cpu fallback"),
which as a side effect, moved the fields just after 'busylock'
on a new cacheline.

Since both 'padded' and 'refcnt' are not updated frequently, and
there is a hole before 'gso_skb', we can move such fields there,
saving a cacheline without any performance side effect.

Before this commit:

pahole -C Qdisc net/sche/sch_generic.o
	# ...
        /* size: 384, cachelines: 6, members: 25 */
        /* sum members: 236, holes: 3, sum holes: 92 */
        /* padding: 56 */

After this commit:
pahole -C Qdisc net/sche/sch_generic.o
	# ...
	/* size: 320, cachelines: 5, members: 25 */
	/* sum members: 236, holes: 2, sum holes: 28 */
	/* padding: 56 */

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-28 23:13:39 -04:00
Sridhar Samudrala
cfc80d9a11 net: Introduce net_failover driver
The net_failover driver provides an automated failover mechanism via APIs
to create and destroy a failover master netdev and manages a primary and
standby slave netdevs that get registered via the generic failover
infrastructure.

The failover netdev acts a master device and controls 2 slave devices. The
original paravirtual interface gets registered as 'standby' slave netdev and
a passthru/vf device with the same MAC gets registered as 'primary' slave
netdev. Both 'standby' and 'failover' netdevs are associated with the same
'pci' device. The user accesses the network interface via 'failover' netdev.
The 'failover' netdev chooses 'primary' netdev as default for transmits when
it is available with link up and running.

This can be used by paravirtual drivers to enable an alternate low latency
datapath. It also enables hypervisor controlled live migration of a VM with
direct attached VF by failing over to the paravirtual datapath when the VF
is unplugged.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-28 22:59:54 -04:00
Sridhar Samudrala
30c8bd5aa8 net: Introduce generic failover module
The failover module provides a generic interface for paravirtual drivers
to register a netdev and a set of ops with a failover instance. The ops
are used as event handlers that get called to handle netdev register/
unregister/link change/name change events on slave pci ethernet devices
with the same mac address as the failover netdev.

This enables paravirtual drivers to use a VF as an accelerated low latency
datapath. It also allows migration of VMs with direct attached VFs by
failing over to the paravirtual datapath when the VF is unplugged.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-28 22:59:54 -04:00
Máté Eckl
e5a10bb2ac netfilter: add includes to nf_socket.h
These have to be included always when nf_socket.h is included.

Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-29 00:23:58 +02:00
David S. Miller
5b79c2af66 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of easy overlapping changes in the confict
resolutions here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-26 19:46:15 -04:00
Christoph Hellwig
f87be89481 net/iucv: convert to ->poll_mask
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-26 09:16:44 +02:00
Christoph Hellwig
17112d8081 net/bluetooth: convert to ->poll_mask
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-26 09:16:44 +02:00
Christoph Hellwig
568ea88ef9 net/sctp: convert to ->poll_mask
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-26 09:16:44 +02:00
Christoph Hellwig
db5051ead6 net: convert datagram_poll users tp ->poll_mask
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-26 09:16:44 +02:00
Christoph Hellwig
2c7d3daceb net/tcp: convert to ->poll_mask
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-26 09:16:44 +02:00
Christoph Hellwig
984652dd8b net: remove sock_no_poll
Now that sock_poll handles a NULL ->poll or ->poll_mask there is no need
for a stub.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-26 09:16:44 +02:00
Christoph Hellwig
3cafb37633 net: refactor socket_poll
Factor out two busy poll related helpers for late reuse, and remove
a command that isn't very helpful, especially with the __poll_t
annotations in place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-26 09:16:44 +02:00
David S. Miller
d7c52fc8bc mlx5e-updates-2018-05-19
This series contains updates for mlx5e netdevice driver with one subject,
 DSCP to priority mapping, in the first patch Huy adds the needed API in
 dcbnl, the second patch adds the needed mlx5 core capability bits for the
 feature, and all other patches are mlx5e (netdev) only changes to add
 support for the feature.
 
 From: Huy Nguyen
 
 Dscp to priority mapping for Ethernet packet:
 
 These patches enable differentiated services code point (dscp) to
 priority mapping for Ethernet packet. Once this feature is
 enabled, the packet is routed to the corresponding priority based on its
 dscp. User can combine this feature with priority flow control (pfc)
 feature to have priority flow control based on the dscp.
 
 Firmware interface:
 Mellanox firmware provides two control knobs for this feature:
   QPTS register allow changing the trust state between dscp and
   pcp mode. The default is pcp mode. Once in dscp mode, firmware will
   route the packet based on its dscp value if the dscp field exists.
 
   QPDPM register allow mapping a specific dscp (0 to 63) to a
   specific priority (0 to 7). By default, all the dscps are mapped to
   priority zero.
 
 Software interface:
 This feature is controlled via application priority TLV. IEEE
 specification P802.1Qcd/D2.1 defines priority selector id 5 for
 application priority TLV. This APP TLV selector defines DSCP to priority
 map. This APP TLV can be sent by the switch or can be set locally using
 software such as lldptool. In mlx5 drivers, we add the support for net
 dcb's getapp and setapp call back. Mlx5 driver only handles the selector
 id 5 application entry (dscp application priority application entry).
 If user sends multiple dscp to priority APP TLV entries on the same
 dscp, the last sent one will take effect. All the previous sent will be
 deleted.
 
 This attribute combined with pfc attribute allows advanced user to
 fine tune the qos setting for specific priority queue. For example,
 user can give dedicated buffer for one or more priorities or user
 can give large buffer to certain priorities.
 
 The dcb buffer configuration will be controlled by lldptool.
 >> lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6
       maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6
 >> lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0
       sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively
 
 After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to
 choose dcbnl over devlink interface since this feature is intended to set
 port attributes which are governed by the netdev instance of that port, where
 devlink API is more suitable for global ASIC configurations.
 
 The firmware trust state (in QPTS register) is changed based on the
 number of dscp to priority application entries. When the first dscp to
 priority application entry is added by the user, the trust state is
 changed to dscp. When the last dscp to priority application entry is
 deleted by the user, the trust state is changed to pcp.
 
 When the port is in DSCP trust state, the transmit queue is selected
 based on the dscp of the skb.
 
 When the port is in DSCP trust state and vport inline mode is not NONE,
 firmware requires mlx5 driver to copy the IP header to the
 wqe ethernet segment inline header if the skb has it.
 This is done by changing the transmit queue sq's min inline mode to L3.
 Note that the min inline mode of sqs that belong to other features
 such as xdpsq, icosq are not modified.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJbBy2YAAoJEEg/ir3gV/o+RkYIAId0h3GqVof41/qCHPuqx+7j
 J/lTJSU+PHDZfVAuXhoQN5lFGsRku/wkqdSZNXxwvbUxVzIoUpxWvlaEv+481Cmj
 +WhVC1QIExGYUs4CCGjgl853kNtGpDrZmk+olY3QaQISHdpcmgs/W13YlEKRyw5q
 nFSa6GLA/3IQXb7eOwUiSW1gYYjn/eM2XFHntTTOeasa9dwb2QDqDQdoYiKdS+yS
 KxqxrlADU4V+CtNlOmZ4QGxOa7suBBr+a6JvNnvviYOKKpYMNQRbhUbgu0pPdKAt
 zbHHS7Twy1ka3s6CKk/QIT7/YWfCOQgWy6LGPCtUuGUA6dy5xsV7i2BtoDwMC7o=
 =EJd2
 -----END PGP SIGNATURE-----

Merge tag 'mlx5e-updates-2018-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5e-updates-2018-05-19

This series contains updates for mlx5e netdevice driver with one subject,
DSCP to priority mapping, in the first patch Huy adds the needed API in
dcbnl, the second patch adds the needed mlx5 core capability bits for the
feature, and all other patches are mlx5e (netdev) only changes to add
support for the feature.

From: Huy Nguyen

Dscp to priority mapping for Ethernet packet:

These patches enable differentiated services code point (dscp) to
priority mapping for Ethernet packet. Once this feature is
enabled, the packet is routed to the corresponding priority based on its
dscp. User can combine this feature with priority flow control (pfc)
feature to have priority flow control based on the dscp.

Firmware interface:
Mellanox firmware provides two control knobs for this feature:
  QPTS register allow changing the trust state between dscp and
  pcp mode. The default is pcp mode. Once in dscp mode, firmware will
  route the packet based on its dscp value if the dscp field exists.

  QPDPM register allow mapping a specific dscp (0 to 63) to a
  specific priority (0 to 7). By default, all the dscps are mapped to
  priority zero.

Software interface:
This feature is controlled via application priority TLV. IEEE
specification P802.1Qcd/D2.1 defines priority selector id 5 for
application priority TLV. This APP TLV selector defines DSCP to priority
map. This APP TLV can be sent by the switch or can be set locally using
software such as lldptool. In mlx5 drivers, we add the support for net
dcb's getapp and setapp call back. Mlx5 driver only handles the selector
id 5 application entry (dscp application priority application entry).
If user sends multiple dscp to priority APP TLV entries on the same
dscp, the last sent one will take effect. All the previous sent will be
deleted.

This attribute combined with pfc attribute allows advanced user to
fine tune the qos setting for specific priority queue. For example,
user can give dedicated buffer for one or more priorities or user
can give large buffer to certain priorities.

The dcb buffer configuration will be controlled by lldptool.
>> lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6
      maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6
>> lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0
      sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively

After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to
choose dcbnl over devlink interface since this feature is intended to set
port attributes which are governed by the netdev instance of that port, where
devlink API is more suitable for global ASIC configurations.

The firmware trust state (in QPTS register) is changed based on the
number of dscp to priority application entries. When the first dscp to
priority application entry is added by the user, the trust state is
changed to dscp. When the last dscp to priority application entry is
deleted by the user, the trust state is changed to pcp.

When the port is in DSCP trust state, the transmit queue is selected
based on the dscp of the skb.

When the port is in DSCP trust state and vport inline mode is not NONE,
firmware requires mlx5 driver to copy the IP header to the
wqe ethernet segment inline header if the skb has it.
This is done by changing the transmit queue sq's min inline mode to L3.
Note that the min inline mode of sqs that belong to other features
such as xdpsq, icosq are not modified.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-25 16:41:23 -04:00
Cong Wang
aaa908ffbe net_sched: switch to rcu_work
Commit 05f0fe6b74 ("RCU, workqueue: Implement rcu_work") introduces
new API's for dispatching work in a RCU callback. Now we can just
switch to the new API's for tc filters. This could get rid of a lot
of code.

Cc: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24 22:56:15 -04:00
David S. Miller
90fed9c946 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2018-05-24

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc).

2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers.

3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit.

4) Jiong Wang adds support for indirect and arithmetic shifts to NFP

5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible.

6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing
   to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions.

7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT.

8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events.

9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24 22:20:51 -04:00
Jesper Dangaard Brouer
389ab7f01a xdp: introduce xdp_return_frame_rx_napi
When sending an xdp_frame through xdp_do_redirect call, then error
cases can happen where the xdp_frame needs to be dropped, and
returning an -errno code isn't sufficient/possible any-longer
(e.g. for cpumap case). This is already fully supported, by simply
calling xdp_return_frame.

This patch is an optimization, which provides xdp_return_frame_rx_napi,
which is a faster variant for these error cases.  It take advantage of
the protection provided by XDP RX running under NAPI protection.

This change is mostly relevant for drivers using the page_pool
allocator as it can take advantage of this. (Tested with mlx5).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 18:36:15 -07:00
Huy Nguyen
e549f6f9c0 net/dcb: Add dcbnl buffer attribute
In this patch, we add dcbnl buffer attribute to allow user
change the NIC's buffer configuration such as priority
to buffer mapping and buffer size of individual buffer.

This attribute combined with pfc attribute allows advanced user to
fine tune the qos setting for specific priority queue. For example,
user can give dedicated buffer for one or more priorities or user
can give large buffer to certain priorities.

The dcb buffer configuration will be controlled by lldptool.
lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6
  maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6
lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0
  sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively

After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to
choose dcbnl over devlink interface since this feature is intended to set
port attributes which are governed by the netdev instance of that port, where
devlink API is more suitable for global ASIC configurations.

We present an use case scenario where dcbnl buffer attribute configured
by advance user helps reduce the latency of messages of different sizes.

Scenarios description:
On ConnectX-5, we run latency sensitive traffic with
small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
traffic with large messages sizes 512KB and 1MB. We group small, medium,
and large message sizes to their own pfc enables priorities as follow.
  Priorities 1 & 2 (64B, 256B and 1KB)
  Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
  Priorities 5 & 6 (512KB and 1MB)

By default, ConnectX-5 maps all pfc enabled priorities to a single
lossless fixed buffer size of 50% of total available buffer space. The
other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
we create three equal size lossless buffers. Each buffer has 25% of total
available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
to lossless  buffer mappings are set as follow.
  Priorities 1 & 2 on lossless buffer #1
  Priorities 3 & 4 on lossless buffer #2
  Priorities 5 & 6 on lossless buffer #3

We observe improvements in latency for small and medium message sizes
as follows. Please note that the large message sizes bandwidth performance is
reduced but the total bandwidth remains the same.
  256B message size (42 % latency reduction)
  4K message size (21% latency reduction)
  64K message size (16% latency reduction)

CC: Ido Schimmel <idosch@idosch.org>
CC: Jakub Kicinski <jakub.kicinski@netronome.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Or Gerlitz <gerlitz.or@gmail.com>
CC: Parav Pandit <parav@mellanox.com>
CC: Aron Silverton <aron.silverton@oracle.com>
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-24 14:22:59 -07:00
Mathieu Xhonneux
fe94cc290f bpf: Add IPv6 Segment Routing helpers
The BPF seg6local hook should be powerful enough to enable users to
implement most of the use-cases one could think of. After some thinking,
we figured out that the following actions should be possible on a SRv6
packet, requiring 3 specific helpers :
    - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
    - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
                               (to add/delete TLVs)
    - bpf_lwt_seg6_action: Apply some SRv6 network programming actions
                           (specifically End.X, End.T, End.B6 and
                            End.B6.Encap)

The specifications of these helpers are provided in the patch (see
include/uapi/linux/bpf.h).

The non-sensitive fields of the SRH are the following : flags, tag and
TLVs. The other fields can not be modified, to maintain the SRH
integrity. Flags, tag and TLVs can easily be modified as their validity
can be checked afterwards via seg6_validate_srh. It is not allowed to
modify the segments directly. If one wants to add segments on the path,
he should stack a new SRH using the End.B6 action via
bpf_lwt_seg6_action.

Growing, shrinking or editing TLVs via the helpers will flag the SRH as
invalid, and it will have to be re-validated before re-entering the IPv6
layer. This flag is stored in a per-CPU buffer, along with the current
header length in bytes.

Storing the SRH len in bytes in the control block is mandatory when using
bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
boundary). When adding/deleting TLVs within the BPF program, the SRH may
temporary be in an invalid state where its length cannot be rounded to 8
bytes without remainder, hence the need to store the length in bytes
separately. The caller of the BPF program can then ensure that the SRH's
final length is valid using this value. Again, a final SRH modified by a
BPF program which doesn’t respect the 8-bytes boundary will be discarded
as it will be considered as invalid.

Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
available from the LWT BPF IN hook, but not from the seg6local BPF one.
This helper allows to encapsulate a Segment Routing Header (either with
a new outer IPv6 header, or by inlining it directly in the existing IPv6
header) into a non-SRv6 packet. This helper is required if we want to
offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
as the BPF seg6local hook only works on traffic already containing a SRH.
This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
the same purpose but with a static SRH per route.

These helpers require CONFIG_IPV6=y (and not =m).

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24 11:57:35 +02:00
Mathieu Xhonneux
1c1e761ef1 ipv6: sr: export function lookup_nexthop
The function lookup_nexthop is essential to implement most of the seg6local
actions. As we want to provide a BPF helper allowing to apply some of these
actions on the packet being processed, the helper should be able to call
this function, hence the need to make it public.

Moreover, if one argument is incorrect or if the next hop can not be found,
an error should be returned by the BPF helper so the BPF program can adapt
its processing of the packet (return an error, properly force the drop,
...). This patch hence makes this function return dst->error to indicate a
possible error.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24 11:57:35 +02:00
Mathieu Xhonneux
63526e1c80 ipv6: sr: make seg6.h includable without IPv6
include/net/seg6.h cannot be included in a source file if CONFIG_IPV6 is
not enabled:
   include/net/seg6.h: In function 'seg6_pernet':
>> include/net/seg6.h:52:14: error: 'struct net' has no member named
                                        'ipv6'; did you mean 'ipv4'?
     return net->ipv6.seg6_data;
                 ^~~~
                 ipv4

This commit makes seg6_pernet return NULL if IPv6 is not compiled, hence
allowing seg6.h to be included regardless of the configuration.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24 11:57:35 +02:00
David S. Miller
fb83eb93c6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree, they are:

1) Remove obsolete nf_log tracing from nf_tables, from Florian Westphal.

2) Add support for map lookups to numgen, random and hash expressions,
   from Laura Garcia.

3) Allow to register nat hooks for iptables and nftables at the same
   time. Patchset from Florian Westpha.

4) Timeout support for rbtree sets.

5) ip6_rpfilter works needs interface for link-local addresses, from
   Vincent Bernat.

6) Add nf_ct_hook and nf_nat_hook structures and use them.

7) Do not drop packets on packets raceing to insert conntrack entries
   into hashes, this is particularly a problem in nfqueue setups.

8) Address fallout from xt_osf separation to nf_osf, patches
   from Florian Westphal and Fernando Mancera.

9) Remove reference to struct nft_af_info, which doesn't exist anymore.
   From Taehee Yoo.

This batch comes with is a conflict between 25fd386e0b ("netfilter:
core: add missing __rcu annotation") in your tree and 2c205dd398
("netfilter: add struct nf_nat_hook and use it") coming in this batch.
This conflict can be solved by leaving the __rcu tag on
__netfilter_net_init() - added by 25fd386e0b - and remove all code
related to nf_nat_decode_session_hook - which is gone after
2c205dd398, as described by:

diff --cc net/netfilter/core.c
index e0ae4aae96f5,206fb2c4c319..168af54db975
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@@ -611,7 -580,13 +611,8 @@@ const struct nf_conntrack_zone nf_ct_zo
  EXPORT_SYMBOL_GPL(nf_ct_zone_dflt);
  #endif /* CONFIG_NF_CONNTRACK */

- static void __net_init __netfilter_net_init(struct nf_hook_entries **e, int max)
 -#ifdef CONFIG_NF_NAT_NEEDED
 -void (*nf_nat_decode_session_hook)(struct sk_buff *, struct flowi *);
 -EXPORT_SYMBOL(nf_nat_decode_session_hook);
 -#endif
 -
+ static void __net_init
+ __netfilter_net_init(struct nf_hook_entries __rcu **e, int max)
  {
  	int h;

I can also merge your net-next tree into nf-next, solve the conflict and
resend the pull request if you prefer so.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23 16:37:11 -04:00
David S. Miller
5ee6ad201e For this round, we have various things all over the place, notably
* a fix for a race in aggregation, which I want to let
    bake for a bit longer before sending to stable
  * some new statistics (ACK RSSI, TXQ)
  * TXQ configuration
  * preparations for HE, particularly radiotap
  * replace confusing "country IE" by "country element" since it's
    not referring to Ireland
 
 Note that I merged net-next to get a fix from mac80211 that got
 there via net, to apply one patch that would otherwise conflict.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlsFWnEACgkQB8qZga/f
 l8Ry2xAAmOLiTrZ8VlZIwzXEoIrr2b7VM0jQsbCLGmBDu82EV4aRRtX9ZeZm9PuR
 a6t9kvQFyT0/7tInTfv6I9JlNCZpwT9Mc3Ttw2JQgJ9zm/IYsxmWJ4TtjIz7F+AA
 rqmxdplSCSJUcIVQ/mJ1oINl3p4ewoAv1doxtQx0Ucavb31ROwjVRUX24OJd1SeK
 YOFSjoTLHcCDS5jaTbzAGwI31F3plHG8NKMLlwGtrYMhN2SmaQV2YU+YTPJuiQbt
 EGa3MukngxF7ck+D57CJM+OcLrPF4RiuT6pmJHR8as5Yz5u40bgn3wZu361EcmSy
 wpJKFNsTOJS+nFHS/zMTWiVbB12bBGNWf3rZXUv5yH1TwVf8y8B2p2jrEasmVcjB
 PgwNcylNJYfqd2W439xwt1ChGAzc388U2yyzMtWmnNQeAAUFMthtjhEv2Vnowxf3
 cFvO5okRpVpOP42JB57VZfNoPeeUHnPfrlDl40AwbKUkKeVOom5oJQIi5WMg4nAV
 +MXooiJStZxMsY1PDyQgE06dL40r2HlmaX0DB/UbbWeVAaJ2c4aS3ptApEWrfedY
 FDTL0XhfqejPbK2Au/KX64TTj8ID2bGsundM4ErcilOK3Pu63FMv9b0mziBd8jX1
 6lJE2oIR8w10dFZG4O5itVE8n6PE2Fgx728480Lsjuz56GVxMB0=
 =G98y
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2018-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

For this round, we have various things all over the place, notably
 * a fix for a race in aggregation, which I want to let
   bake for a bit longer before sending to stable
 * some new statistics (ACK RSSI, TXQ)
 * TXQ configuration
 * preparations for HE, particularly radiotap
 * replace confusing "country IE" by "country element" since it's
   not referring to Ireland

Note that I merged net-next to get a fix from mac80211 that got
there via net, to apply one patch that would otherwise conflict.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23 15:53:00 -04:00
Roopa Prabhu
404eb77ea7 ipv4: support sport, dport and ip_proto in RTM_GETROUTE
This is a followup to fib rules sport, dport and ipproto
match support. Only supports tcp, udp and icmp for ipproto.
Used by fib rule self tests.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23 15:14:12 -04:00
Taehee Yoo
0c6bca7471 netfilter: nf_tables: remove nft_af_info.
The struct nft_af_info was removed.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23 12:16:25 +02:00
Vidyullatha Kanchanapally
7f9a3e150e nl80211: Update ERP info using NL80211_CMD_UPDATE_CONNECT_PARAMS
Use NL80211_CMD_UPDATE_CONNECT_PARAMS to update new ERP information,
Association IEs and the Authentication type to driver / firmware which
will be used in subsequent roamings.

Signed-off-by: Vidyullatha Kanchanapally <vidyullatha@codeaurora.org>
[arend: extended fils-sk kernel doc and added check in wiphy_register()]
Reviewed-by: Jithu Jance <jithu.jance@broadcom.com>
Reviewed-by: Eylon Pedinovsky <eylon.pedinovsky@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-23 11:21:35 +02:00
Arend Van Spriel
e841b7b11e nl80211: add FILS related parameters to ROAM event
In case of FILS shared key offload the parameters can change
upon roaming of which user-space needs to be notified.

Reviewed-by: Jithu Jance <jithu.jance@broadcom.com>
Reviewed-by: Eylon Pedinovsky <eylon.pedinovsky@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-23 11:19:02 +02:00
Arend Van Spriel
76804d28c3 cfg80211: use separate struct for FILS parameters
Put FILS related parameters into their own struct definition so
it can be reused for roam events in subsequent change.

Reviewed-by: Jithu Jance <jithu.jance@broadcom.com>
Reviewed-by: Eylon Pedinovsky <eylon.pedinovsky@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-23 11:07:35 +02:00
Ilan Peer
d4e36e5554 mac80211: Support adding duration for prepare_tx() callback
There are specific cases, such as SAE authentication exchange, that
might require long duration to complete. For such cases, add support
for indicating to the driver the required duration of the prepare_tx()
operation, so the driver would still be able to complete the frame
exchange.

Currently, indicate the duration only for SAE authentication exchange,
as SAE authentication can take up to 2000 msec (as defined in IEEE
P802.11-REVmd D1.0 p. 3504).

As the patch modified the prepare_tx() callback API, also modify
the relevant code in iwlwifi.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-23 11:06:10 +02:00
Johannes Berg
dd8070bff2 Merge remote-tracking branch 'net-next/master' into mac80211-next
Bring in net-next which had pulled in net, so I have the changes
from mac80211 and can apply a patch that would otherwise conflict.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-23 11:05:59 +02:00
Pablo Neira Ayuso
2c205dd398 netfilter: add struct nf_nat_hook and use it
Move decode_session() and parse_nat_setup_hook() indirections to struct
nf_nat_hook structure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23 09:26:07 +02:00
Florian Westphal
9971a514ed netfilter: nf_nat: add nat type hooks to nat core
Currently the packet rewrite and instantiation of nat NULL bindings
happens from the protocol specific nat backend.

Invocation occurs either via ip(6)table_nat or the nf_tables nat chain type.

Invocation looks like this (simplified):
NF_HOOK()
   |
   `---iptable_nat
	 |
	 `---> nf_nat_l3proto_ipv4 -> nf_nat_packet
	               |
          new packet? pass skb though iptables nat chain
                       |
		       `---> iptable_nat: ipt_do_table

In nft case, this looks the same (nft_chain_nat_ipv4 instead of
iptable_nat).

This is a problem for two reasons:
1. Can't use iptables nat and nf_tables nat at the same time,
   as the first user adds a nat binding (nf_nat_l3proto_ipv4 adds a
   NULL binding if do_table() did not find a matching nat rule so we
   can detect post-nat tuple collisions).
2. If you use e.g. nft_masq, snat, redir, etc. uses must also register
   an empty base chain so that the nat core gets called fro NF_HOOK()
   to do the reverse translation, which is neither obvious nor user
   friendly.

After this change, the base hook gets registered not from iptable_nat or
nftables nat hooks, but from the l3 nat core.

iptables/nft nat base hooks get registered with the nat core instead:

NF_HOOK()
   |
   `---> nf_nat_l3proto_ipv4 -> nf_nat_packet
		|
         new packet? pass skb through iptables/nftables nat chains
                |
		+-> iptables_nat: ipt_do_table
	        +-> nft nat chain x
	        `-> nft nat chain y

The nat core deals with null bindings and reverse translation.
When no mapping exists, it calls the registered nat lookup hooks until
one creates a new mapping.
If both iptables and nftables nat hooks exist, the first matching
one is used (i.e., higher priority wins).

Also, nft users do not need to create empty nat hooks anymore,
nat core always registers the base hooks that take care of reverse/reply
translation.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23 09:14:06 +02:00
Florian Westphal
1cd472bf03 netfilter: nf_nat: add nat hook register functions to nf_nat
This adds the infrastructure to register nat hooks with the nat core
instead of the netfilter core.

nat hooks are used to configure nat bindings.  Such hooks are registered
from ip(6)table_nat or by the nftables core when a nat chain is added.

After next patch, nat hooks will be registered with nf_nat instead of
netfilter core.  This allows to use many nat lookup functions at the
same time while doing the real packet rewrite (nat transformation) in
one place.

This change doesn't convert the intended users yet to ease review.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23 09:14:05 +02:00
Florian Westphal
4e25ceb80b netfilter: nf_tables: allow chain type to override hook register
Will be used in followup patch when nat types no longer
use nf_register_net_hook() but will instead register with the nat core.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23 09:14:05 +02:00
Florian Westphal
1f55236bd8 netfilter: nf_nat: move common nat code to nat core
Copy-pasted, both l3 helpers almost use same code here.
Split out the common part into an 'inet' helper.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23 09:14:05 +02:00
David Ahern
f34436a430 net/ipv6: Simplify route replace and appending into multipath route
Bring consistency to ipv6 route replace and append semantics.

Remove rt6_qualify_for_ecmp which is just guess work. It fails in 2 cases:
1. can not replace a route with a reject route. Existing code appends
   a new route instead of replacing the existing one.

2. can not have a multipath route where a leg uses a dev only nexthop

Existing use cases affected by this change:
1. adding a route with existing prefix and metric using NLM_F_CREATE
   without NLM_F_APPEND or NLM_F_EXCL (ie., what iproute2 calls
   'prepend'). Existing code auto-determines that the new nexthop can
   be appended to an existing route to create a multipath route. This
   change breaks that by requiring the APPEND flag for the new route
   to be added to an existing one. Instead the prepend just adds another
   route entry.

2. route replace. Existing code replaces first matching multipath route
   if new route is multipath capable and fallback to first matching
   non-ECMP route (reject or dev only route) in case one isn't available.
   New behavior replaces first matching route. (Thanks to Ido for spotting
   this one)

Note: Newer iproute2 is needed to display multipath routes with a dev-only
      nexthop. This is due to a bug in iproute2 and parsing nexthops.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-22 14:44:18 -04:00
Xin Long
644fbdeacf sctp: fix the issue that flags are ignored when using kernel_connect
Now sctp uses inet_dgram_connect as its proto_ops .connect, and the flags
param can't be passed into its proto .connect where this flags is really
needed.

sctp works around it by getting flags from socket file in __sctp_connect.
It works for connecting from userspace, as inherently the user sock has
socket file and it passes f_flags as the flags param into the proto_ops
.connect.

However, the sock created by sock_create_kern doesn't have a socket file,
and it passes the flags (like O_NONBLOCK) by using the flags param in
kernel_connect, which calls proto_ops .connect later.

So to fix it, this patch defines a new proto_ops .connect for sctp,
sctp_inet_connect, which calls __sctp_connect() directly with this
flags param. After this, the sctp's proto .connect can be removed.

Note that sctp_inet_connect doesn't need to do some checks that are not
needed for sctp, which makes thing better than with inet_dgram_connect.

Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-22 13:37:26 -04:00
Johannes Berg
f3a7ca6458 cfg80211: add missing kernel-doc
Add the kernel-doc missed earlier.

Fixes: 52539ca89f ("cfg80211: Expose TXQ stats and parameters to userspace")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-22 11:32:43 +02:00
David Ahern
901731b882 net/ipv6: Add helper to return path MTU based on fib result
Determine path MTU from a FIB lookup result. Logic is based on
ip6_dst_mtu_forward plus lookup of nexthop exception.

Add ip6_dst_mtu_forward to ipv6_stubs to handle access by core
bpf code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-22 10:51:09 +02:00
David Ahern
50d889b178 net/ipv4: Add helper to return path MTU based on fib result
Determine path MTU from a FIB lookup result. Logic is a distillation of
ip_dst_mtu_maybe_forward.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-22 10:51:09 +02:00
David S. Miller
6f6e434aa2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.

TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.

The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.

Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-21 16:01:54 -04:00
William Tu
d48f1958ab erspan: set bso bit based on mirrored packet's len
Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not
handled.  BSO has 4 possible values:
  00 --> Good frame with no error, or unknown integrity
  11 --> Payload is a Bad Frame with CRC or Alignment Error
  01 --> Payload is a Short Frame
  10 --> Payload is an Oversized Frame

Based the short/oversized definitions in RFC1757, the patch sets
the bso bit based on the mirrored packet's size.

Reported-by: Xiaoyan Jin <xiaoyanj@vmware.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-20 18:31:42 -04:00
David S. Miller
2c71ab4bb4 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2018-05-18

Here's the first bluetooth-next pull request for the 4.18 kernel:

 - Refactoring of the btbcm driver
 - New USB IDs for QCA_ROME and LiteOn controllers
 - Buffer overflow fix if the controller sends invalid advertising data length
 - Various cleanups & fixes for Qualcomm controllers

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-20 18:24:22 -04:00
Jiri Pirko
08474c1a9d devlink: introduce a helper to generate physical port names
Each driver implements physical port name generation by itself. However
as devlink has all needed info, it can easily do the job for all its
users. So implement this helper in devlink.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-19 16:30:39 -04:00
Jiri Pirko
5ec1380a21 devlink: extend attrs_set for setting port flavours
Devlink ports can have specific flavour according to the purpose of use.
This patch extend attrs_set so the driver can say which flavour port
has. Initial flavours are:
physical, cpu, dsa
User can query this to see right away what is the purpose of each port.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-19 16:30:39 -04:00
Jiri Pirko
b9ffcbaf56 devlink: introduce devlink_port_attrs_set
Change existing setter for split port information into more generic
attrs setter. Alongside with that, allow to set port number and subport
number for split ports.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-19 16:30:39 -04:00
Eric Dumazet
9c21d2fc41 tcp: add tcp_comp_sack_nr sysctl
This per netns sysctl allows for TCP SACK compression fine-tuning.

This limits number of SACK that can be compressed.
Using 0 disables SACK compression.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18 11:40:27 -04:00
Eric Dumazet
6d82aa2420 tcp: add tcp_comp_sack_delay_ns sysctl
This per netns sysctl allows for TCP SACK compression fine-tuning.

Its default value is 1,000,000, or 1 ms to meet TSO autosizing period.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18 11:40:27 -04:00
Eric Dumazet
5d9f4262b7 tcp: add SACK compression
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.

Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.

This patch adds a high resolution timer and tp->compressed_ack counter.

Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :

	delay = min ( 5 % of RTT, 1 ms)

If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.

When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.

Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.

A new SNMP counter is added in the following patch.

Two other patches add sysctls to allow changing the 1,000,000 and 44
values that this commit hard-coded.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18 11:40:27 -04:00
Eric Dumazet
cf0dd20372 tcp: use __sock_put() instead of sock_put() in tcp_clear_xmit_timers()
Socket can not disappear under us.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18 11:40:27 -04:00
Björn Töpel
dac09149d9 xsk: clean up SPDX headers
Clean up SPDX-License-Identifier and removing licensing leftovers.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-18 16:07:02 +02:00
Johannes Berg
7ea3e110f2 cfg80211: release station info tidstats where needed
This fixes memory leaks in cases where we got the station
info but failed sending it out properly.

Fixes: 8689c051a2 ("cfg80211: dynamically allocate per-tid stats for station info")
Reviewed-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-18 12:37:55 +02:00
Arend van Spriel
8689c051a2 cfg80211: dynamically allocate per-tid stats for station info
With the addition of TXQ stats in the per-tid statistics the struct
station_info grew significantly. This resulted in stack size warnings
due to the structure itself being above the limit for the warnings.

Add an allocation function that those who want to provide per-tid
stats should use to allocate the tid array, i.e.
struct station_info::pertid.

Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Fixes: 52539ca89f ("cfg80211: Expose TXQ stats and parameters to userspace")
Signed-off-by: Arend van Spriel <aspriel@gmail.com>
[johannes: fix missing BIT() and logic by removing]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-18 11:14:34 +02:00
Loic Poulain
d6ee6ad774 Bluetooth: Add __hci_cmd_send function
This function allows to send a HCI command without expecting any
controller event/response in return. This is allowed for vendor-
specific commands only.

Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-05-18 06:37:52 +02:00
Yuchung Cheng
b8fef65a8a tcp: new helper tcp_rack_skb_timeout
Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack
to prepare the final RTO change.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 15:41:29 -04:00
Yuchung Cheng
d716bfdb10 tcp: account lost retransmit after timeout
The previous approach for the lost and retransmit bits was to
wipe the slate clean: zero all the lost and retransmit bits,
correspondingly zero the lost_out and retrans_out counters, and
then add back the lost bits (and correspondingly increment lost_out).

The new approach is to treat this very much like marking packets
lost in fast recovery. We don’t wipe the slate clean. We just say
that for all packets that were not yet marked sacked or lost, we now
mark them as lost in exactly the same way we do for fast recovery.

This fixes the lost retransmit accounting at RTO time and greatly
simplifies the RTO code by sharing much of the logic with Fast
Recovery.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 15:41:29 -04:00
Yuchung Cheng
6ac06ecd3a tcp: simpler NewReno implementation
This is a rewrite of NewReno loss recovery implementation that is
simpler and standalone for readability and better performance by
using less states.

Note that NewReno refers to RFC6582 as a modification to the fast
recovery algorithm. It is used only if the connection does not
support SACK in Linux. It should not to be confused with the Reno
(AIMD) congestion control.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 15:41:28 -04:00
Yuchung Cheng
20b654dfe1 tcp: support DUPACK threshold in RACK
This patch adds support for the classic DUPACK threshold rule
(#DupThresh) in RACK.

When the number of packets SACKed is greater or equal to the
threshold, RACK sets the reordering window to zero which would
immediately mark all the unsacked packets below the highest SACKed
sequence lost. Since this approach is known to not work well with
reordering, RACK only uses it if no reordering has been observed.

The DUPACK threshold rule is a particularly useful extension to the
fast recoveries triggered by RACK reordering timer. For example
data-center transfers where the RTT is much smaller than a timer
tick, or high RTT path where the default RTT/4 may take too long.

Note that this patch differs slightly from RFC6675. RFC6675
considers a packet lost when at least #DupThresh higher-sequence
packets are SACKed.

With RACK, for connections that have seen reordering, RACK
continues to use a dynamically-adaptive time-based reordering
window to detect losses. But for connections on which we have not
yet seen reordering, this patch considers a packet lost when at
least one higher sequence packet is SACKed and the total number
of SACKed packets is at least DupThresh. For example, suppose a
connection has not seen reordering, and sends 10 packets, and
packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2
lost. RACK considers packets 1, 2, 4, 6 lost.

There is some small risk of spurious retransmits here due to
reordering. However, this is mostly limited to the first flight of
a connection on which the sender receives SACKs from reordering.
And RFC 6675 and FACK loss detection have a similar risk on the
first flight with reordering (it's just that the risk of spurious
retransmits from reordering was slightly narrower for those older
algorithms due to the margin of 3*MSS).

Also the minimum reordering window is reduced from 1 msec to 0
to recover quicker on short RTT transfers. Therefore RACK is more
aggressive in marking packets lost during recovery to reduce the
reordering window timeouts.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 15:41:28 -04:00
Matt Mullins
8ab6ffba14 tls: don't use stack memory in a scatterlist
scatterlist code expects virt_to_page() to work, which fails with
CONFIG_VMAP_STACK=y.

Fixes: c46234ebb4 ("tls: RX path for ktls")
Signed-off-by: Matt Mullins <mmullins@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 14:49:38 -04:00
Paolo Abeni
96009c7d50 sched: replace __QDISC_STATE_RUNNING bit with a spin lock
So that we can use lockdep on it.
The newly introduced sequence lock has the same scope of busylock,
so it shares the same lockdep annotation, but it's only used for
NOLOCK qdiscs.

With this changeset we acquire such lock in the control path around
flushing operation (qdisc reset), to allow more NOLOCK qdisc perf
improvement in the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 12:46:54 -04:00
David S. Miller
b9f672af14 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-05-17

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Provide a new BPF helper for doing a FIB and neighbor lookup
   in the kernel tables from an XDP or tc BPF program. The helper
   provides a fast-path for forwarding packets. The API supports
   IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
   implemented in this initial work, from David (Ahern).

2) Just a tiny diff but huge feature enabled for nfp driver by
   extending the BPF offload beyond a pure host processing offload.
   Offloaded XDP programs are allowed to set the RX queue index and
   thus opening the door for defining a fully programmable RSS/n-tuple
   filter replacement. Once BPF decided on a queue already, the device
   data-path will skip the conventional RSS processing completely,
   from Jakub.

3) The original sockmap implementation was array based similar to
   devmap. However unlike devmap where an ifindex has a 1:1 mapping
   into the map there are use cases with sockets that need to be
   referenced using longer keys. Hence, sockhash map is added reusing
   as much of the sockmap code as possible, from John.

4) Introduce BTF ID. The ID is allocatd through an IDR similar as
   with BPF maps and progs. It also makes BTF accessible to user
   space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
   through BPF_OBJ_GET_INFO_BY_FD, from Martin.

5) Enable BPF stackmap with build_id also in NMI context. Due to the
   up_read() of current->mm->mmap_sem build_id cannot be parsed.
   This work defers the up_read() via a per-cpu irq_work so that
   at least limited support can be enabled, from Song.

6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
   JIT conversion as well as implementation of an optimized 32/64 bit
   immediate load in the arm64 JIT that allows to reduce the number of
   emitted instructions; in case of tested real-world programs they
   were shrinking by three percent, from Daniel.

7) Add ifindex parameter to the libbpf loader in order to enable
   BPF offload support. Right now only iproute2 can load offloaded
   BPF and this will also enable libbpf for direct integration into
   other applications, from David (Beckett).

8) Convert the plain text documentation under Documentation/bpf/ into
   RST format since this is the appropriate standard the kernel is
   moving to for all documentation. Also add an overview README.rst,
   from Jesper.

9) Add __printf verification attribute to the bpf_verifier_vlog()
   helper. Though it uses va_list we can still allow gcc to check
   the format string, from Mathieu.

10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
    is a bash 4.0+ feature which is not guaranteed to be available
    when calling out to shell, therefore use a more portable variant,
    from Joe.

11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
    instead of relying on the gcc built-in, from Björn.

12) Fix a sock hashmap kmalloc warning reported by syzbot when an
    overly large key size is used in hashmap then causing overflows
    in htab->elem_size. Reject bogus attr->key_size early in the
    sock_hash_alloc(), from Yonghong.

13) Ensure in BPF selftests when urandom_read is being linked that
    --build-id is always enabled so that test_stacktrace_build_id[_nmi]
    won't be failing, from Alexei.

14) Add bitsperlong.h as well as errno.h uapi headers into the tools
    header infrastructure which point to one of the arch specific
    uapi headers. This was needed in order to fix a build error on
    some systems for the BPF selftests, from Sirio.

15) Allow for short options to be used in the xdp_monitor BPF sample
    code. And also a bpf.h tools uapi header sync in order to fix a
    selftest build failure. Both from Prashant.

16) More formally clarify the meaning of ID in the direct packet access
    section of the BPF documentation, from Wang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16 22:47:11 -04:00
Paolo Abeni
32f7b44d0f sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers
Currently NOLOCK qdiscs pay a measurable overhead to atomically
manipulate the __QDISC_STATE_RUNNING. Such bit is flipped twice per
packet in the uncontended scenario with packet rate below the
line rate: on packed dequeue and on the next, failing dequeue attempt.

This changeset moves the bit manipulation into the qdisc_run_{begin,end}
helpers, so that the bit is now flipped only once per packet, with
measurable performance improvement in the uncontended scenario.

This also allows simplifying the qdisc teardown code path - since
qdisc_is_running() is now effective for each qdisc type - and avoid a
possible race between qdisc_run() and dev_deactivate_many(), as now
the some_qdisc_is_busy() can properly detect NOLOCK qdiscs being busy
dequeuing packets.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16 12:26:25 -04:00
Debabrata Banerjee
e79c105574 bonding: allow use of tx hashing in balance-alb
The rx load balancing provided by balance-alb is not mutually
exclusive with using hashing for tx selection, and should provide a decent
speed increase because this eliminates spinlocks and cache contention.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16 12:15:11 -04:00
Christoph Hellwig
c350637227 proc: introduce proc_create_net{,_data}
Variants of proc_create{,_data} that directly take a struct seq_operations
and deal with network namespaces in ->open and ->release.  All callers of
proc_create + seq_open_net converted over, and seq_{open,release}_net are
removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:24:30 +02:00
Christoph Hellwig
a2dcdee374 net: move seq_file_single_net to <linux/seq_file_net.h>
This helper deals with single_{open,release}_net internals and thus
belongs here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:24:30 +02:00
Christoph Hellwig
93cb5a1f58 ipv{4,6}/raw: simplify ѕeq_file code
Pass the hashtable to the proc private data instead of copying
it into the per-file private data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Christoph Hellwig
f455022166 ipv{4,6}/ping: simplify proc file creation
Remove the pointless ping_seq_afinfo indirection and make the code look
like most other protocols.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Christoph Hellwig
37d849bb42 ipv{4,6}/tcp: simplify procfs registration
Avoid most of the afinfo indirections and just call the proc helpers
directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Christoph Hellwig
a3d2599b24 ipv{4,6}/udp{,lite}: simplify proc registration
Remove a couple indirections to make the code look like most other
protocols.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Christoph Hellwig
fddda2b7b5 proc: introduce proc_create_seq{,_data}
Variants of proc_create{,_data} that directly take a struct seq_operations
argument and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
John Fastabend
e5cd3abcb3 bpf: sockmap, refactor sockmap routines to work with hashmap
This patch only refactors the existing sockmap code. This will allow
much of the psock initialization code path and bpf helper codes to
work for both sockmap bpf map types that are backed by an array, the
currently supported type, and the new hash backed bpf map type
sockhash.

Most the fallout comes from three changes,

  - Pushing bpf programs into an independent structure so we
    can use it from the htab struct in the next patch.
  - Generalizing helpers to use void *key instead of the hardcoded
    u32.
  - Instead of passing map/key through the metadata we now do
    the lookup inline. This avoids storing the key in the metadata
    which will be useful when keys can be longer than 4 bytes. We
    rename the sk pointers to sk_redir at this point as well to
    avoid any confusion between the current sk pointer and the
    redirect pointer sk_redir.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-15 17:19:59 +02:00
Richard Guy Briggs
cdfb6b341f audit: use inline function to get audit context
Recognizing that the audit context is an internal audit value, use an
access function to retrieve the audit context pointer for the task
rather than reaching directly into the task struct to get it.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: merge fuzz in auditsc.c and selinuxfs.c, checkpatch.pl fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-14 17:24:18 -04:00
Marcelo Ricardo Leitner
81c7288b17 sched: cls: enable verbose logging
Currently, when the rule is not to be exclusively executed by the
hardware, extack is not passed along and offloading failures don't
get logged. The idea was that hardware failures are okay because the
rule will get executed in software then and this way it doesn't confuse
unware users.

But this is not helpful in case one needs to understand why a certain
rule failed to get offloaded. Considering it may have been a temporary
failure, like resources exceeded or so, reproducing it later and knowing
that it is triggering the same reason may be challenging.

The ultimate goal is to improve Open vSwitch debuggability when using
flower offloading.

This patch adds a new flag to enable verbose logging. With the flag set,
extack will be passed to the driver, which will be able to log the
error. As the operation itself probably won't fail (not because of this,
at least), current iproute will already log it as a Warning.

The flag is generic, so it can be reused later. No need to restrict it
just for HW offloading. The command line will follow the syntax that
tc-ebpf already uses, tc ... [ verbose ] ... , and extend its meaning.

For example:
# ./tc qdisc add dev p7p1 ingress
# ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \
	flower verbose \
	src_mac ed:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \
	src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop
Warning: TC offload is disabled on net device.
# echo $?
0
# ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \
	flower \
	src_mac ff:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \
	src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop
# echo $?
0

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-14 16:18:27 -04:00
Richard Guy Briggs
f0b752168d audit: convert sessionid unset to a macro
Use a macro, "AUDIT_SID_UNSET", to replace each instance of
initialization and comparison to an audit session ID.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-14 15:56:35 -04:00
David S. Miller
4f6b15c3a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:

1) Fix handling of simultaneous open TCP connection in conntrack,
   from Jozsef Kadlecsik.

2) Insufficient sanitify check of xtables extension names, from
   Florian Westphal.

3) Skip unnecessary synchronize_rcu() call when transaction log
   is already empty, from Florian Westphal.

4) Incorrect destination mac validation in ebt_stp, from Stephen
   Hemminger.

5) xtables module reference counter leak in nft_compat, from
   Florian Westphal.

6) Incorrect connection reference counting logic in IPVS
   one-packet scheduler, from Julian Anastasov.

7) Wrong stats for 32-bits CPU in IPVS, also from Julian.

8) Calm down sparse error in netfilter core, also from Florian.

9) Use nla_strlcpy to fix compilation warning in nfnetlink_acct
   and nfnetlink_cthelper, again from Florian.

10) Missing module alias in icmp and icmp6 xtables extensions,
    from Florian Westphal.

11) Base chain statistics in nf_tables may be unset/null, from Florian.

12) Fix handling of large matchinfo size in nft_compat, this includes
    one preparation for before this fix. From Florian.

13) Fix bogus EBUSY error when deleting chains due to incorrect reference
    counting from the preparation phase of the two-phase commit protocol.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-13 20:28:47 -04:00
David S. Miller
b2d6cee117 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The bpf syscall and selftests conflicts were trivial
overlapping changes.

The r8169 change involved moving the added mdelay from 'net' into a
different function.

A TLS close bug fix overlapped with the splitting of the TLS state
into separate TX and RX parts.  I just expanded the tests in the bug
fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf
== X".

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 20:53:22 -04:00
Eric Dumazet
73a6bab5aa tcp: switch pacing timer to softirq based hrtimer
linux-4.16 got support for softirq based hrtimers.
TCP can switch its pacing hrtimer to this variant, since this
avoids going through a tasklet and some atomic operations.

pacing timer logic looks like other (jiffies based) tcp timers.

v2: use hrtimer_try_to_cancel() in tcp_clear_xmit_timers()
    to correctly release reference on socket if needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 12:24:37 -04:00
Florian Fainelli
aab9c4067d net: dsa: Plug in PHYLINK support
Add support for PHYLINK within the DSA subsystem in order to support more
complex devices such as pluggable (SFP) and non-pluggable (SFF) modules, 10G
PHYs, and traditional PHYs. Using PHYLINK allows us to drop some amount of
complexity we had while probing fixed and non-fixed PHYs using Device Tree.

Because PHYLINK separates the Ethernet MAC/port configuration into different
stages, we let switch drivers implement those, and for now, we maintain
functionality by calling dsa_slave_adjust_link() during
phylink_mac_link_{up,down} which provides semantically equivalent steps.

Drivers willing to take advantage of PHYLINK should implement the phylink_mac_*
operations that DSA wraps.

We cannot quite remove the adjust_link() callback just yet, because a number of
drivers rely on that for configuring their "CPU" and "DSA" ports, this is done
dsa_port_setup_phy_of() and dsa_port_fixed_link_register_of() still.

Drivers that utilize fixed links for user-facing ports (e.g: bcm_sf2) will need
to implement phylink_mac_ops from now on to preserve functionality, since PHYLINK
*does not* create a phy_device instance for fixed links.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 12:03:06 -04:00
Florian Fainelli
11d8f3ddab net: dsa: Add PHYLINK switch operations
In preparation for adding support for PHYLINK within DSA, define a number of
operations that we will need and that switch drivers can start implementing.
Proper integration with PHYLINK will follow in subsequent patches.

We start selecting PHYLINK (which implies PHYLIB) in net/dsa/Kconfig
such that drivers can be guaranteed that this dependency is properly
taken care of and can start referencing PHYLINK helper functions without
requiring stubs or anything.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 12:03:05 -04:00
Debabrata Banerjee
21706ee8a4 bonding: send learning packets for vlans on slave
There was a regression at some point from the intended functionality of
commit f60c3704e8 ("bonding: Fix alb mode to only use first level
vlans.")

Given the return value vlan_get_encap_level() we need to store the nest
level of the bond device, and then compare the vlan's encap level to
this. Without this, this check always fails and learning packets are
never sent.

In addition, this same commit caused a regression in the behavior of
balance_alb, which requires learning packets be sent for all interfaces
using the slave's mac in order to load balance properly. For vlan's
that have not set a user mac, we can send after checking one bit.
Otherwise we need send the set mac, albeit defeating rx load balancing
for that vlan.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 11:50:41 -04:00
David Ahern
65a2022e89 net/ipv6: Add fib lookup stubs for use in bpf helper
Add stubs to retrieve a handle to an IPv6 FIB table, fib6_get_table,
a stub to do a lookup in a specific table, fib6_table_lookup, and
a stub for a full route lookup.

The stubs are needed for core bpf code to handle the case when the
IPv6 module is not builtin.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11 00:10:57 +02:00
David Ahern
138118ec96 net/ipv6: Add fib6_lookup
Add IPv6 equivalent to fib_lookup. Does a fib lookup, including rules,
but returns a FIB entry, fib6_info, rather than a dst based rt6_info.
fib6_lookup is any where from 140% (MULTIPLE_TABLES config disabled)
to 60% faster than any of the dst based lookup methods (without custom
rules) and 25% faster with custom rules (e.g., l3mdev rule).

Since the lookup function has a completely different signature,
fib6_rule_action is split into 2 paths: the existing one is
renamed __fib6_rule_action and a new one for the fib6_info path
is added. fib6_rule_action decides which to call based on the
lookup_ptr. If it is fib6_table_lookup then the new path is taken.

Caller must hold rcu lock as no reference is taken on the returned
fib entry.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11 00:10:56 +02:00
David Ahern
1d053da910 net/ipv6: Extract table lookup from ip6_pol_route
ip6_pol_route is used for ingress and egress FIB lookups. Refactor it
moving the table lookup into a separate fib6_table_lookup that can be
invoked separately and export the new function.

ip6_pol_route now calls fib6_table_lookup and uses the result to generate
a dst based rt6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11 00:10:56 +02:00
David Ahern
3b290a31bb net/ipv6: Rename rt6_multipath_select
Rename rt6_multipath_select to fib6_multipath_select and export it.
A later patch wants access to it similar to IPv4's fib_select_path.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11 00:10:56 +02:00
David Ahern
6454743bc1 net/ipv6: Rename fib6_lookup to fib6_node_lookup
Rename fib6_lookup to fib6_node_lookup to better reflect what it
returns. The fib6_lookup name will be used in a later patch for
an IPv6 equivalent to IPv4's fib_lookup.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11 00:10:56 +02:00
Jon Maxwell
0048369055 tcp: Add mark for TIMEWAIT sockets
This version has some suggestions by Eric Dumazet:

- Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP
races.
- Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
statement.
- Factorize code as sk_fullsock() check is not necessary.

Aidan McGurn from Openwave Mobility systems reported the following bug:

"Marked routing is broken on customer deployment. Its effects are large
increase in Uplink retransmissions caused by the client never receiving
the final ACK to their FINACK - this ACK misses the mark and routes out
of the incorrect route."

Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
sysctl is enabled. But not for TW sockets that had sk->sk_mark set via
setsockopt(SO_MARK..).

Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
Then progate this so that the skb gets sent with the correct mark. Do the same
for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
netfilter rules are still honored.

Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 17:44:52 -04:00
Joe Perches
03bdfc001c net: ipv4: remove define INET_CSK_DEBUG and unnecessary EXPORT_SYMBOL
INET_CSK_DEBUG is always set and only is used for 2 pr_debug calls.

EXPORT_SYMBOL(inet_csk_timer_bug_msg) is only used by these 2
pr_debug calls and is also unnecessary as the exported string can
be used directly by these calls.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 17:43:55 -04:00
David S. Miller
b2a9643855 We only have a few fixes this time:
* WMM element validation
  * SAE timeout
  * add-BA timeout
  * docbook parsing
  * a few memory leaks in error paths
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlrzTUQACgkQB8qZga/f
 l8QgqA/+KrZlJPuQjT0xZYYj0rXGc4GzeOJiyDqVz/25sGrmDP8pCnabJhC0l10X
 +4edB85XjgPbQ/OZsnRSRaCQDPOuBUiQqCl33u6T/qngHdbxuVoxx8UykAcr/+z1
 bmrokLBMXWijaacMine9ouCjjuMKXaiyPsZ5aobMpFAKFcVxHDK2hnlMp2bhfqVw
 fuCLtMlvc5G5FGKEfLoT0YzFGhOpcz6Zmnuk58pCoVTUTNDOdbKUVI/4TSGHUnaH
 YrS1I+ERIIIRf8NCdUHqvygigHyEYd4WlOtFnmrEEUrBf3YP/frUtf1oUlg77Kdc
 bPIKD76cO8ONaFu45z4/nBhzQQPz6idUBMfk+gYuZ6bNxrFkJCHNoSkz2AXiOxQF
 tThXUoSMnlrqyW+jJ0JdZHZQqU0vwaJvCh96F97ox3WWPoMlty+JtQ0orVJSkUF1
 LHNmKCJfCJ2LfK5xr0tSnhBBeteyE18zaPZMHkEUh7gUkQeov9+wgvtPFdX/pQ0N
 /sZCs4QQLmHn3R1f/XSytSIsEoIEneCKN/pwSc63M6SqqVDOE4+Mue7+Jzjq4xad
 oY+pFIWusxyUQmC4R01/bQzADROp1vJFUS9/4sDmwsIk+RbRxSGP3o2kHwzl+Wc2
 DxxQv2WN8V0L2i+ru7Ck+Q4zAUgxoIHqHWefKpI4MO9liCcBHFQ=
 =mJ4m
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2018-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
We only have a few fixes this time:
 * WMM element validation
 * SAE timeout
 * add-BA timeout
 * docbook parsing
 * a few memory leaks in error paths
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 17:34:50 -04:00
Davidlohr Bueso
a7950ae821 net/sock: Update memalloc_socks static key to modern api
No changes in refcount semantics -- key init is false; replace

static_key_slow_inc|dec   with   static_branch_inc|dec
static_key_false          with   static_branch_unlikely

Added a '_key' suffix to memalloc_socks, for better self
documentation.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 15:13:34 -04:00
Davidlohr Bueso
5263a98f16 net/ipv4: Update ip_tunnel_metadata_cnt static key to modern api
No changes in refcount semantics -- key init is false; replace

static_key_slow_inc|dec   with   static_branch_inc|dec
static_key_false          with   static_branch_unlikely

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 15:13:33 -04:00
Pablo Neira Ayuso
bb7b40aecb netfilter: nf_tables: bogus EBUSY in chain deletions
When removing a rule that jumps to chain and such chain in the same
batch, this bogusly hits EBUSY. Add activate and deactivate operations
to expression that can be called from the preparation and the
commit/abort phases.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-09 10:09:30 +02:00
Alexander Duyck
9a0d41b359 udp: Do not pass checksum as a parameter to GSO segmentation
This patch is meant to allow us to avoid having to recompute the checksum
from scratch and have it passed as a parameter.

Instead of taking that approach we can take advantage of the fact that the
length that was used to compute the existing checksum is included in the
UDP header.

Finally to avoid the need to invert the result we can just call csum16_add
and csum16_sub directly. By doing this we can avoid a number of
instructions in the loop that is handling segmentation.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-08 22:30:06 -04:00
Alexander Duyck
b21c034b3d udp: Do not pass MSS as parameter to GSO segmentation
There is no point in passing MSS as a parameter for for the GSO
segmentation call as it is already available via the shared info for the
skb itself.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-08 22:30:06 -04:00
Toke Høiland-Jørgensen
52539ca89f cfg80211: Expose TXQ stats and parameters to userspace
This adds support for exporting the mac80211 TXQ stats via nl80211 by
way of a nested TXQ stats attribute, as well as for configuring the
quantum and limits that were previously only changeable through debugfs.

This commit adds just the nl80211 API, a subsequent commit adds support to
mac80211 itself.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-08 13:19:24 +02:00
Paolo Abeni
d869dea664 flow_dissector: do not rely on implicit casts
This change fixes a couple of type mismatch reported by the sparse
tool, explicitly using the requested type for the offending arguments.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-08 00:02:41 -04:00
Paolo Abeni
72a338bcc6 net: core: rework basic flow dissection helper
When the core networking needs to detect the transport offset in a given
packet and parse it explicitly, a full-blown flow_keys struct is used for
storage.
This patch introduces a smaller keys store, rework the basic flow dissect
helper to use it, and apply this new helper where possible - namely in
skb_probe_transport_header(). The used flow dissector data structures
are renamed to match more closely the new role.

The above gives ~50% performance improvement in micro benchmarking around
skb_probe_transport_header() and ~30% around eth_get_headlen(), mostly due
to the smaller memset. Small, but measurable improvement is measured also
in macro benchmarking.

v1 -> v2: use the new helper in eth_get_headlen() and skb_get_poff(),
  as per DaveM suggestion

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-08 00:02:36 -04:00
David S. Miller
1822f638e8 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2018-05-07

1) Always verify length of provided sadb_key to fix a
   slab-out-of-bounds read in pfkey_add. From Kevin Easton.

2) Make sure that all states are really deleted
   before we check that the state lists are empty.
   Otherwise we trigger a warning.

3) Fix MTU handling of the VTI6 interfaces on
   interfamily tunnels. From Stefano Brivio.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-07 23:51:30 -04:00
Wolfram Sang
53bc017f72 net: flow_dissector: fix typo 'can by' to 'can be'
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-07 23:47:05 -04:00
David S. Miller
01adc4851a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Minor conflict, a CHECK was placed into an if() statement
in net-next, whilst a newline was added to that CHECK
call in 'net'.  Thanks to Daniel for the merge resolution.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-07 23:35:08 -04:00
Balaji Pothunoori
81d5439da8 cfg80211: average ack rssi support for data frames
Average ack rssi will be given to userspace via NL80211 interface
if firmware is capable. Userspace tool ‘iw’ can process this
information and give the output as one of the fields in
‘iw dev wlanX station dump’.

Example output :

localhost ~ #iw dev wlan-5000mhz station dump Station
34:f3:9a:aa:3b:29 (on wlan-5000mhz)
        inactive time:  5370 ms
        rx bytes:       85321
        rx packets:     576
        tx bytes:       14225
        tx packets:     71
        tx retries:     0
        tx failed:      2
        beacon loss:    0
        rx drop misc:   0
        signal:         -54 dBm
        signal avg:     -53 dBm
        tx bitrate:     866.7 MBit/s VHT-MCS 9 80MHz short GI VHT-NSS 2
        rx bitrate:     866.7 MBit/s VHT-MCS 9 80MHz short GI VHT-NSS 2
        avg ack signal: -56 dBm
        authorized:     yes
        authenticated:  yes
        associated:     yes
        preamble:       short
        WMM/WME:        yes
        MFP:            no
        TDLS peer:      no
        DTIM period:    2
        beacon interval:100
       short preamble: yes
       short slot time:yes
       connected time: 203 seconds

Main use case is to measure the signal strength of a connected station
to AP. Data packet transmit rates and bandwidth used by station can vary
a lot even if the station is at fixed location, especially if the rates
used are multi stream(2stream, 3stream) rates with different bandwidth(20/40/80 Mhz).
These multi stream rates are sensitive and station can use different transmit power
for each of the rate and bandwidth combinations. RSSI measured from these RX packets
on AP will be not stable and can vary a lot with in a short time.
Whereas 802.11 ack frames from station are sent relatively at a constant
rate (6/12/24 Mbps) with constant bandwidth(20 Mhz).
So average rssi of the ack packets is good and more accurate.

Signed-off-by: Balaji Pothunoori <bpothuno@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-07 21:37:20 +02:00
Gregory Greenman
03737001e4 mac80211: add api to set CSA counter in mac80211
Sometimes the most updated CSA counter values are known only
to the device. Add an API to pass this data to mac80211.

Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-07 20:50:15 +02:00
Randy Dunlap
d1361b32e6 mac80211: fix kernel-doc "bad line" warning
Fix 88 instances of a kernel-doc warning:
  ../include/net/mac80211.h:2083: warning: bad line:  >

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-wireless@vger.kernel.org
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-07 15:01:52 +02:00
David S. Miller
90278871d4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next
tree, more relevant updates in this batch are:

1) Add Maglev support to IPVS. Moreover, store lastest server weight in
   IPVS since this is needed by maglev, patches from from Inju Song.

2) Preparation works to add iptables flowtable support, patches
   from Felix Fietkau.

3) Hand over flows back to conntrack slow path in case of TCP RST/FIN
   packet is seen via new teardown state, also from Felix.

4) Add support for extended netlink error reporting for nf_tables.

5) Support for larger timeouts that 23 days in nf_tables, patch from
   Florian Westphal.

6) Always set an upper limit to dynamic sets, also from Florian.

7) Allow number generator to make map lookups, from Laura Garcia.

8) Use hash_32() instead of opencode hashing in IPVS, from Vicent Bernat.

9) Extend ip6tables SRH match to support previous, next and last SID,
   from Ahmed Abdelsalam.

10) Move Passive OS fingerprint nf_osf.c, from Fernando Fernandez.

11) Expose nf_conntrack_max through ctnetlink, from Florent Fourcot.

12) Several housekeeping patches for xt_NFLOG, x_tables and ebtables,
   from Taehee Yoo.

13) Unify meta bridge with core nft_meta, then make nft_meta built-in.
   Make rt and exthdr built-in too, again from Florian.

14) Missing initialization of tbl->entries in IPVS, from Cong Wang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-06 21:51:37 -04:00
Florian Westphal
3a2e86f645 netfilter: nf_nat: remove unused ct arg from lookup functions
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-06 23:33:47 +02:00
David Ahern
8fb11a9a8d net/ipv6: rename rt6_next to fib6_next
This slipped through the cracks in the followup set to the fib6_info flip.
Rename rt6_next to fib6_next.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04 19:54:52 -04:00
David S. Miller
a7b15ab887 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Overlapping changes in selftests Makefile.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04 09:58:56 -04:00
Magnus Karlsson
f61459030e xsk: add Tx queue setup and mmap support
Another setsockopt (XDP_TX_QUEUE) is added to let the process allocate
a queue, where the user process can pass frames to be transmitted by
the kernel.

The mmapping of the queue is done using the XDP_PGOFF_TX_QUEUE offset.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 15:55:24 -07:00
Björn Töpel
fbfc504a24 bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
The xskmap is yet another BPF map, very much inspired by
dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
adds AF_XDP sockets into the map, and by using the bpf_redirect_map
helper, an XDP program can redirect XDP frames to an AF_XDP socket.

Note that a socket that is bound to certain ifindex/queue index will
*only* accept XDP frames from that netdev/queue index. If an XDP
program tries to redirect from a netdev/queue index other than what
the socket is bound to, the frame will not be received on the socket.

A socket can reside in multiple maps.

v3: Fixed race and simplified code.
v2: Removed one indirection in map lookup.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 15:55:24 -07:00
Björn Töpel
c497176cb2 xsk: add Rx receive functions and poll support
Here the actual receive functions of AF_XDP are implemented, that in a
later commit, will be called from the XDP layers.

There's one set of functions for the XDP_DRV side and another for
XDP_SKB (generic).

A new XDP API, xdp_return_buff, is also introduced.

Adding xdp_return_buff, which is analogous to xdp_return_frame, but
acts upon an struct xdp_buff. The API will be used by AF_XDP in future
commits.

Support for the poll syscall is also implemented.

v2: xskq_validate_id did not update cons_tail.
    The entries variable was calculated twice in xskq_nb_avail.
    Squashed xdp_return_buff commit.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 15:55:24 -07:00
Magnus Karlsson
965a990984 xsk: add support for bind for Rx
Here, the bind syscall is added. Binding an AF_XDP socket, means
associating the socket to an umem, a netdev and a queue index. This
can be done in two ways.

The first way, creating a "socket from scratch". Create the umem using
the XDP_UMEM_REG setsockopt and an associated fill queue with
XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
setsockopt. Call bind passing ifindex and queue index ("channel" in
ethtool speak).

The second way to bind a socket, is simply skipping the
umem/netdev/queue index, and passing another already setup AF_XDP
socket. The new socket will then have the same umem/netdev/queue index
as the parent so it will share the same umem. You must also set the
flags field in the socket address to XDP_SHARED_UMEM.

v2: Use PTR_ERR instead of passing error variable explicitly.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 15:55:23 -07:00
Björn Töpel
b9b6b68e8a xsk: add Rx queue setup and mmap support
Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate
a queue, where the kernel can pass completed Rx frames from the kernel
to user process.

The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 15:55:23 -07:00
Björn Töpel
c0c77d8fb7 xsk: add user memory registration support sockopt
In this commit the base structure of the AF_XDP address family is set
up. Further, we introduce the abilty register a window of user memory
to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
window is viewed by an AF_XDP socket as a set of equally large
frames. After a user memory registration all frames are "owned" by the
user application, and not the kernel.

v2: More robust checks on umem creation and unaccount on error.
    Call set_page_dirty_lock on cleanup.
    Simplified xdp_umem_reg.

Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 15:55:23 -07:00
Petr Machata
816a3bed95 switchdev: Add fdb.added_by_user to switchdev notifications
The following patch enables sending notifications also for events on FDB
entries that weren't added by the user. Give the drivers the information
necessary to distinguish between the two origins of FDB entries.

To maintain the current behavior, have switchdev-implementing drivers
bail out on notifications about non-user-added FDB entries. In case of
mlxsw driver, allow a call to mlxsw_sp_span_respin() so that SPAN over
bridge catches up with the changed FDB.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-03 13:46:47 -04:00
Ido Schimmel
30ca22e4a5 ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6"
This reverts commit edd7ceb782 ("ipv6: Allow non-gateway ECMP for
IPv6").

Eric reported a division by zero in rt6_multipath_rebalance() which is
caused by above commit that considers identical local routes to be
siblings. The division by zero happens because a nexthop weight is not
set for local routes.

Revert the commit as it does not fix a bug and has side effects.

To reproduce:

# ip -6 address add 2001:db8::1/64 dev dummy0
# ip -6 address add 2001:db8::1/64 dev dummy1

Fixes: edd7ceb782 ("ipv6: Allow non-gateway ECMP for IPv6")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02 16:33:02 -04:00
Dave Watson
c212d2c7fc net/tls: Don't recursively call push_record during tls_write_space callbacks
It is reported that in some cases, write_space may be called in
do_tcp_sendpages, such that we recursively invoke do_tcp_sendpages again:

[  660.468802]  ? do_tcp_sendpages+0x8d/0x580
[  660.468826]  ? tls_push_sg+0x74/0x130 [tls]
[  660.468852]  ? tls_push_record+0x24a/0x390 [tls]
[  660.468880]  ? tls_write_space+0x6a/0x80 [tls]
...

tls_push_sg already does a loop over all sending sg's, so ignore
any tls_write_space notifications until we are done sending.
We then have to call the previous write_space to wake up
poll() waiters after we are done with the send loop.

Reported-by: Andre Tomt <andre@tomt.net>
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 18:57:52 -04:00
Thomas Winter
edd7ceb782 ipv6: Allow non-gateway ECMP for IPv6
It is valid to have static routes where the nexthop
is an interface not an address such as tunnels.
For IPv4 it was possible to use ECMP on these routes
but not for IPv6.

Signed-off-by: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz>
Cc: David Ahern <dsahern@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 14:23:33 -04:00
Marcelo Ricardo Leitner
6d3e8aa876 sctp: allow sctp_init_cause to return errors
And do so if the skb doesn't have enough space for the payload.
This is a preparation for the next patch.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 12:09:35 -04:00
Ilya Lesokhin
e8f6979981 net/tls: Add generic NIC offload infrastructure
This patch adds a generic infrastructure to offload TLS crypto to a
network device. It enables the kernel TLS socket to skip encryption
and authentication operations on the transmit side of the data path.
Leaving those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to
the TCP layer just like the SW KTLS implementation and using the same
API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmits the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 09:42:47 -04:00
Boris Pismenny
f66de3ee2c net/tls: Split conf to rx + tx
In TLS inline crypto, we can have one direction in software
and another in hardware. Thus, we split the TLS configuration to separate
structures for receive and transmit.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 09:42:47 -04:00
Ilya Lesokhin
ebf4e808fa net: Add Software fallback infrastructure for socket dependent offloads
With socket dependent offloads we rely on the netdev to transform
the transmitted packets before sending them to the wire.
When a packet from an offloaded socket is rerouted to a different
device we need to detect it and do the transformation in software.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 09:42:46 -04:00
Ilya Lesokhin
6dac152355 tcp: Add clean acked data hook
Called when a TCP segment is acknowledged.
Could be used by application protocols who hold additional
metadata associated with the stream data.

This is required by TLS device offload to release
metadata associated with acknowledged TLS records.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 09:42:46 -04:00
Marcelo Ricardo Leitner
22d7be267e sctp: remove sctp_transport_pmtu_check
We are now keeping the MTU information synced between asoc, transport
and dst, which makes the check at sctp_packet_config() not needed
anymore. As it was the sole caller to this function, lets remove it.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 14:35:23 -04:00
Marcelo Ricardo Leitner
6ff0f871c2 sctp: introduce sctp_dst_mtu
Which makes sure that the MTU respects the minimum value of
SCTP_DEFAULT_MINSEGMENT and that it is correctly aligned.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 14:35:23 -04:00
Marcelo Ricardo Leitner
2521680e18 sctp: remove sctp_assoc_pending_pmtu
No need for this helper.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 14:35:23 -04:00
Marcelo Ricardo Leitner
2f5e3c9df6 sctp: introduce sctp_assoc_update_frag_point
and avoid the open-coded versions of it.

Now sctp_datamsg_from_user can just re-use asoc->frag_point as it will
always be updated.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 14:35:23 -04:00
Marcelo Ricardo Leitner
feddd6c1af sctp: introduce sctp_mtu_payload
When given a MTU, this function calculates how much payload we can carry
on it. Without a MTU, it calculates the amount of header overhead we
have.

So that when we have extra overhead, like the one added for IP options
on SELinux patches, it is easier to handle it.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 14:35:23 -04:00
Marcelo Ricardo Leitner
c4b2893dae sctp: introduce sctp_assoc_set_pmtu
All changes to asoc PMTU should now go through this wrapper, making it
easier to track them and to do other actions upon it.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 14:35:22 -04:00
Marcelo Ricardo Leitner
c45698f896 sctp: remove old and unused SCTP_MIN_PMTU
This value is not used anywhere in the code. In essence it is a
duplicate of SCTP_DEFAULT_MINSEGMENT, which is used by the stack.

SCTP_MIN_PMTU value makes more sense, but we should not change to it now
as it would risk breaking applications.

So this patch removes SCTP_MIN_PMTU and adjust the comment above it.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 14:35:22 -04:00
Florian Fainelli
cf96357303 net: dsa: Allow providing PHY statistics from CPU port
Implement the same type of ethtool diversion that we have for
ETH_SS_STATS and make it work with ETH_SS_PHY_STATS. This allows
providing PHY level statistics for CPU ports that are directly
connecting to a PHY device.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 11:53:03 -04:00
Florian Fainelli
89f0904834 net: dsa: Pass stringset to ethtool operations
Up until now we largely assumed that we were interested in ETH_SS_STATS
type of strings for all ethtool operations, this is about to change with
the introduction of additional string sets, e.g: ETH_SS_PHY_STATS.
Update all functions to take an appropriate stringset argument and act
on it when it is different than ETH_SS_STATS for now.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 11:53:03 -04:00
Pablo Neira Ayuso
146cd6b5d5 Merge tag 'ipvs-for-v4.18' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
IPVS Updates for v4.18

please consider these IPVS enhancements for v4.18.

* Whitepace cleanup

* Add Maglev hashing algorithm as a IPVS scheduler

  Inju Song says "Implements the Google's Maglev hashing algorithm as a
  IPVS scheduler.  Basically it provides consistent hashing but offers some
  special features about disruption and load balancing.

  1) minimal disruption: when the set of destinations changes,
     a connection will likely be sent to the same destination
     as it was before.

  2) load balancing: each destination will receive an almost
     equal number of connections.

 Seel also: [3.4 Consistent Hasing] in
 https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf
 "

* Fix to correct implementation of Knuth's multiplicative hashing
  which is used in sh/dh/lblc/lblcr algorithms. Instead the
  implementation provided by the hash_32() macro is used.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27 00:16:14 +02:00
Florian Westphal
d0103158cf netfilter: nf_tables: merge exthdr expression into nft core
before:
   text    data     bss     dec     hex filename
   5056     844       0    5900    170c net/netfilter/nft_exthdr.ko
 102456    2316     401  105173   19ad5 net/netfilter/nf_tables.ko

after:
 106410    2392     401  109203   1aa93 net/netfilter/nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27 00:00:56 +02:00
Florian Westphal
ae1bc6a9f3 netfilter: nf_tables: merge rt expression into nft core
before:
   text    data     bss     dec     hex filename
   2657     844       0    3501     dad net/netfilter/nft_rt.ko
 100826    2240     401  103467   1942b net/netfilter/nf_tables.ko
after:
   2657     844       0    3501     dad net/netfilter/nft_rt.ko
 102456    2316     401  105173   19ad5 net/netfilter/nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27 00:00:55 +02:00
Florian Westphal
8a22543c8e netfilter: nf_tables: make meta expression builtin
size net/netfilter/nft_meta.ko
   text    data     bss     dec     hex filename
   5826     936       1    6763    1a6b net/netfilter/nft_meta.ko
  96407    2064     400   98871   18237 net/netfilter/nf_tables.ko

after:
 100826    2240     401  103467   1942b net/netfilter/nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27 00:00:46 +02:00
Willem de Bruijn
2e8de85763 udp: add gso segment cmsg
Allow specifying segment size in the send call.

The new control message performs the same function as socket option
UDP_SEGMENT while avoiding the extra system call.

[ Export udp_cmsg_send for ipv6. -DaveM ]

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 15:08:51 -04:00
Willem de Bruijn
bec1f6f697 udp: generate gso with UDP_SEGMENT
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.

To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.

A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.

Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.

The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.

Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.

    tcp tso
     3197 MB/s 54232 msg/s 54232 calls/s
         6,457,754,262      cycles

    tcp gso
     1765 MB/s 29939 msg/s 29939 calls/s
        11,203,021,806      cycles

    tcp without tso/gso *
      739 MB/s 12548 msg/s 12548 calls/s
        11,205,483,630      cycles

    udp
      876 MB/s 14873 msg/s 624666 calls/s
        11,205,777,429      cycles

    udp gso
     2139 MB/s 36282 msg/s 36282 calls/s
        11,204,374,561      cycles

   [*] after reverting commit 0a6b2a1dc2
       ("tcp: switch to GSO being always on")

Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:

  perf stat -a -C 12 -e cycles \
    ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4

Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 15:08:04 -04:00
Willem de Bruijn
ee80d1ebe5 udp: add udp gso
Implement generic segmentation offload support for udp datagrams. A
follow-up patch adds support to the protocol stack to generate such
packets.

UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
a large payload into a number of discrete UDP datagrams.

The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
from UFO (SKB_UDP_GSO).

IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
registered.

[ Export __udp_gso_segment for ipv6. -DaveM ]

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 15:07:42 -04:00
Willem de Bruijn
1cd7884dfd udp: expose inet cork to udp
UDP segmentation offload needs access to inet_cork in the udp layer.
Pass the struct to ip(6)_make_skb instead of allocating it on the
stack in that function itself.

This patch is a noop otherwise.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 15:06:46 -04:00
Xin Long
22e15b6f9c sctp: remove the unused sctp_assoc_is_match function
After Commit 4f00878126 ("sctp: apply rhashtable api to send/recv
path"), there's no place using sctp_assoc_is_match, so remove it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25 14:09:23 -04:00
Marcelo Ricardo Leitner
47b3ba5175 sctp: fix const parameter violation in sctp_make_sack
sctp_make_sack() make changes to the asoc and this cast is just
bypassing the const attribute. As there is no need to have the const
there, just remove it and fix the violation.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25 13:21:26 -04:00
Roopa Prabhu
9ce33e4653 neighbour: support for NTF_EXT_LEARNED flag
This patch extends NTF_EXT_LEARNED support to the neighbour system.
Example use-case: An Ethernet VPN implementation (eg in FRR routing suite)
can use this flag to add dynamic reachable external neigh entires
learned via control plane. The use of neigh NTF_EXT_LEARNED in this
patch is consistent with its use with bridge and vxlan fdb entries.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25 13:19:59 -04:00