Commit Graph

677699 Commits

Author SHA1 Message Date
Jiri Pirko
6529eaba33 net: sched: introduce tcf block infractructure
Currently, the filter chains are direcly put into the private structures
of qdiscs. In order to be able to have multiple chains per qdisc and to
allow filter chains sharing among qdiscs, there is a need for common
object that would hold the chains. This introduces such object and calls
it "tcf_block".

Helpers to get and put the blocks are provided to be called from
individual qdisc code. Also, the original filter_list pointers are left
in qdisc privs to allow the entry into tcf_block processing without any
added overhead of possible multiple pointer dereference on fast path.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko
87d83093bf net: sched: move tc_classify function to cls_api.c
Move tc_classify function to cls_api.c where it belongs, rename it to
fit the namespace.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
David S. Miller
c63fbb0b41 Merge branch 'dsa-sort'
Andrew Lunn says:

====================
net: dsa: Sort various lists

As we gain more DSA drivers and tagging protocols, the lists are
getting a bit unruly. Do some sorting.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:19:40 -04:00
Andrew Lunn
ec34e93f99 drivers: net: DSA: Sort drivers
With more drivers being added, it is time to sort the drivers to
impose some order.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:19:40 -04:00
Andrew Lunn
eb7b721129 net: dsa: Sort DSA tagging protocol drivers
With more tag protocols being added, regain some order by sorting the
entries in various places.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:19:40 -04:00
David S. Miller
f917174caf Merge branch 'bnxt_en-DCBX-fixes'
Michael Chan says:

====================
bnxt_en: DCBX fixes.

2 bug fixes for the case where the NIC's firmware DCBX agent is enabled.
With these fixes, we will return the proper information to lldpad.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:12:50 -04:00
Michael Chan
f667724b99 bnxt_en: Check status of firmware DCBX agent before setting DCB_CAP_DCBX_HOST.
Otherwise, all the host based DCBX settings from lldpad will fail if the
firmware DCBX agent is running.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:12:49 -04:00
Michael Chan
87fe603274 bnxt_en: Call bnxt_dcb_init() after getting firmware DCBX configuration.
In the current code, bnxt_dcb_init() is called too early before we
determine if the firmware DCBX agent is running or not.  As a result,
we are not setting the DCB_CAP_DCBX_HOST and DCB_CAP_DCBX_LLD_MANAGED
flags properly to report to DCBNL.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:12:49 -04:00
Eric Dumazet
9142e9007f net: fix compile error in skb_orphan_partial()
If CONFIG_INET is not set, net/core/sock.c can not compile :

net/core/sock.c: In function ‘skb_orphan_partial’:
net/core/sock.c:1810:2: error: implicit declaration of function
‘skb_is_tcp_pure_ack’ [-Werror=implicit-function-declaration]
  if (skb_is_tcp_pure_ack(skb))
  ^

Fix this by always including <net/tcp.h>

Fixes: f6ba8d33cf ("netem: fix skb_orphan_partial()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:10:13 -04:00
Liam R. Howlett
48078d2dac sparc/ftrace: Fix ftrace graph time measurement
The ftrace function_graph time measurements of a given function is not
accurate according to those recorded by ftrace using the function
filters.  This change pulls the x86_64 fix from 'commit 722b3c7469
("ftrace/graph: Trace function entry before updating index")' into the
sparc specific prepare_ftrace_return which stops ftrace from
counting interrupted tasks in the time measurement.

Example measurements for select_task_rq_fair running "hackbench 100
process 1000":

              |  tracing/trace_stat/function0  |  function_graph
 Before patch |  2.802 us                      |  4.255 us
 After patch  |  2.749 us                      |  3.094 us

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 12:07:47 -07:00
Orlando Arias
deba804c90 sparc: Fix -Wstringop-overflow warning
Greetings,

GCC 7 introduced the -Wstringop-overflow flag to detect buffer overflows
in calls to string handling functions [1][2]. Due to the way
``empty_zero_page'' is declared in arch/sparc/include/setup.h, this
causes a warning to trigger at compile time in the function mem_init(),
which is subsequently converted to an error. The ensuing patch fixes
this issue and aligns the declaration of empty_zero_page to that of
other architectures. Thank you.

Cheers,
Orlando.

[1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg02308.html
[2] https://gcc.gnu.org/gcc-7/changes.html

Signed-off-by: Orlando Arias <oarias@knights.ucf.edu>

--------------------------------------------------------------------------------
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 12:07:46 -07:00
Nitin Gupta
b6c41cb050 sparc64: Fix mapping of 64k pages with MAP_FIXED
An incorrect huge page alignment check caused
mmap failure for 64K pages when MAP_FIXED is used
with address not aligned to HPAGE_SIZE.

Orabug: 25885991

Fixes: dcd1912d21 ("sparc64: Add 64K page size support")
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 12:07:01 -07:00
Craig Gallek
2423496af3 ipv6: Prevent overrun when parsing v6 header options
The KASAN warning repoted below was discovered with a syzkaller
program.  The reproducer is basically:
  int s = socket(AF_INET6, SOCK_RAW, NEXTHDR_HOP);
  send(s, &one_byte_of_data, 1, MSG_MORE);
  send(s, &more_than_mtu_bytes_data, 2000, 0);

The socket() call sets the nexthdr field of the v6 header to
NEXTHDR_HOP, the first send call primes the payload with a non zero
byte of data, and the second send call triggers the fragmentation path.

The fragmentation code tries to parse the header options in order
to figure out where to insert the fragment option.  Since nexthdr points
to an invalid option, the calculation of the size of the network header
can made to be much larger than the linear section of the skb and data
is read outside of it.

This fix makes ip6_find_1stfrag return an error if it detects
running out-of-bounds.

[   42.361487] ==================================================================
[   42.364412] BUG: KASAN: slab-out-of-bounds in ip6_fragment+0x11c8/0x3730
[   42.365471] Read of size 840 at addr ffff88000969e798 by task ip6_fragment-oo/3789
[   42.366469]
[   42.366696] CPU: 1 PID: 3789 Comm: ip6_fragment-oo Not tainted 4.11.0+ #41
[   42.367628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[   42.368824] Call Trace:
[   42.369183]  dump_stack+0xb3/0x10b
[   42.369664]  print_address_description+0x73/0x290
[   42.370325]  kasan_report+0x252/0x370
[   42.370839]  ? ip6_fragment+0x11c8/0x3730
[   42.371396]  check_memory_region+0x13c/0x1a0
[   42.371978]  memcpy+0x23/0x50
[   42.372395]  ip6_fragment+0x11c8/0x3730
[   42.372920]  ? nf_ct_expect_unregister_notifier+0x110/0x110
[   42.373681]  ? ip6_copy_metadata+0x7f0/0x7f0
[   42.374263]  ? ip6_forward+0x2e30/0x2e30
[   42.374803]  ip6_finish_output+0x584/0x990
[   42.375350]  ip6_output+0x1b7/0x690
[   42.375836]  ? ip6_finish_output+0x990/0x990
[   42.376411]  ? ip6_fragment+0x3730/0x3730
[   42.376968]  ip6_local_out+0x95/0x160
[   42.377471]  ip6_send_skb+0xa1/0x330
[   42.377969]  ip6_push_pending_frames+0xb3/0xe0
[   42.378589]  rawv6_sendmsg+0x2051/0x2db0
[   42.379129]  ? rawv6_bind+0x8b0/0x8b0
[   42.379633]  ? _copy_from_user+0x84/0xe0
[   42.380193]  ? debug_check_no_locks_freed+0x290/0x290
[   42.380878]  ? ___sys_sendmsg+0x162/0x930
[   42.381427]  ? rcu_read_lock_sched_held+0xa3/0x120
[   42.382074]  ? sock_has_perm+0x1f6/0x290
[   42.382614]  ? ___sys_sendmsg+0x167/0x930
[   42.383173]  ? lock_downgrade+0x660/0x660
[   42.383727]  inet_sendmsg+0x123/0x500
[   42.384226]  ? inet_sendmsg+0x123/0x500
[   42.384748]  ? inet_recvmsg+0x540/0x540
[   42.385263]  sock_sendmsg+0xca/0x110
[   42.385758]  SYSC_sendto+0x217/0x380
[   42.386249]  ? SYSC_connect+0x310/0x310
[   42.386783]  ? __might_fault+0x110/0x1d0
[   42.387324]  ? lock_downgrade+0x660/0x660
[   42.387880]  ? __fget_light+0xa1/0x1f0
[   42.388403]  ? __fdget+0x18/0x20
[   42.388851]  ? sock_common_setsockopt+0x95/0xd0
[   42.389472]  ? SyS_setsockopt+0x17f/0x260
[   42.390021]  ? entry_SYSCALL_64_fastpath+0x5/0xbe
[   42.390650]  SyS_sendto+0x40/0x50
[   42.391103]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[   42.391731] RIP: 0033:0x7fbbb711e383
[   42.392217] RSP: 002b:00007ffff4d34f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[   42.393235] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbbb711e383
[   42.394195] RDX: 0000000000001000 RSI: 00007ffff4d34f60 RDI: 0000000000000003
[   42.395145] RBP: 0000000000000046 R08: 00007ffff4d34f40 R09: 0000000000000018
[   42.396056] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400aad
[   42.396598] R13: 0000000000000066 R14: 00007ffff4d34ee0 R15: 00007fbbb717af00
[   42.397257]
[   42.397411] Allocated by task 3789:
[   42.397702]  save_stack_trace+0x16/0x20
[   42.398005]  save_stack+0x46/0xd0
[   42.398267]  kasan_kmalloc+0xad/0xe0
[   42.398548]  kasan_slab_alloc+0x12/0x20
[   42.398848]  __kmalloc_node_track_caller+0xcb/0x380
[   42.399224]  __kmalloc_reserve.isra.32+0x41/0xe0
[   42.399654]  __alloc_skb+0xf8/0x580
[   42.400003]  sock_wmalloc+0xab/0xf0
[   42.400346]  __ip6_append_data.isra.41+0x2472/0x33d0
[   42.400813]  ip6_append_data+0x1a8/0x2f0
[   42.401122]  rawv6_sendmsg+0x11ee/0x2db0
[   42.401505]  inet_sendmsg+0x123/0x500
[   42.401860]  sock_sendmsg+0xca/0x110
[   42.402209]  ___sys_sendmsg+0x7cb/0x930
[   42.402582]  __sys_sendmsg+0xd9/0x190
[   42.402941]  SyS_sendmsg+0x2d/0x50
[   42.403273]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[   42.403718]
[   42.403871] Freed by task 1794:
[   42.404146]  save_stack_trace+0x16/0x20
[   42.404515]  save_stack+0x46/0xd0
[   42.404827]  kasan_slab_free+0x72/0xc0
[   42.405167]  kfree+0xe8/0x2b0
[   42.405462]  skb_free_head+0x74/0xb0
[   42.405806]  skb_release_data+0x30e/0x3a0
[   42.406198]  skb_release_all+0x4a/0x60
[   42.406563]  consume_skb+0x113/0x2e0
[   42.406910]  skb_free_datagram+0x1a/0xe0
[   42.407288]  netlink_recvmsg+0x60d/0xe40
[   42.407667]  sock_recvmsg+0xd7/0x110
[   42.408022]  ___sys_recvmsg+0x25c/0x580
[   42.408395]  __sys_recvmsg+0xd6/0x190
[   42.408753]  SyS_recvmsg+0x2d/0x50
[   42.409086]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[   42.409513]
[   42.409665] The buggy address belongs to the object at ffff88000969e780
[   42.409665]  which belongs to the cache kmalloc-512 of size 512
[   42.410846] The buggy address is located 24 bytes inside of
[   42.410846]  512-byte region [ffff88000969e780, ffff88000969e980)
[   42.411941] The buggy address belongs to the page:
[   42.412405] page:ffffea000025a780 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
[   42.413298] flags: 0x100000000008100(slab|head)
[   42.413729] raw: 0100000000008100 0000000000000000 0000000000000000 00000001800c000c
[   42.414387] raw: ffffea00002a9500 0000000900000007 ffff88000c401280 0000000000000000
[   42.415074] page dumped because: kasan: bad access detected
[   42.415604]
[   42.415757] Memory state around the buggy address:
[   42.416222]  ffff88000969e880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   42.416904]  ffff88000969e900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   42.417591] >ffff88000969e980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   42.418273]                    ^
[   42.418588]  ffff88000969ea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   42.419273]  ffff88000969ea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   42.419882] ==================================================================

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:55:59 -04:00
Felix Manlunas
0d9a599784 liquidio: fix PF falsely indicating success at setting MAC address of a nonexistent VF
In the function assigned to .ndo_set_vf_mac, check the validity of the
vfidx argument before proceeding to tell the firmware to set the VF MAC
address.

Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:48:29 -04:00
Rick Farrington
e1e3ce6236 liquidio: fix insmod failure when multiple NICs are plugged in
When multiple liquidio NICs are plugged in, the first insmod of the PF
driver succeeds.  But after an rmmod, a subsequent insmod fails.  Reason is
during rmmod, the PF driver resets the Octeon of only one of the NICs; it
neglects to reset the Octeons of the other NICs.

Fix the insmod failure by adding the missing Octeon resets at rmmod.  Keep
a per-NIC refcount that indicates the number of active PFs in a given NIC.
When the refcount goes to zero, then reset the Octeon of that NIC.

Signed-off-by: Rick Farrington <ricardo.farrington@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:48:29 -04:00
Vivien Didelot
8b0d3ea555 net: dsa: store CPU port pointer in the tree
A dsa_switch_tree instance holds a dsa_switch pointer and a port index
to identify the switch port to which the CPU is attached.

Now that the DSA layer has a dsa_port structure to hold this data, use
it to point the switch CPU port.

This patch simply substitutes s/dst->cpu_switch/dst->cpu_dp->ds/ and
s/dst->cpu_port/dst->cpu_dp->index/.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:19:12 -04:00
David S. Miller
631581bf46 Merge branch 'mlxsw-Preparations-for-restructuring'
Jiri Pirko says:

====================
mlxsw: Preparations for restructuring

This patchset doesn't introduce any functional changes and merely meant
to make the code base more receptive for upcoming restructuring.

The first six patches mainly shuffle code in order to reduce the scope of
structs that shouldn't be defined in the main driver header. Most of them
will be later expanded, so it makes sense to correctly place them now.

The last patches mostly simplify bridge-related functions, so that they
could be more easily modified later on.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:56 -04:00
Ido Schimmel
45a4a16cdb mlxsw: spectrum: Default ports to non-virtual mode
In virtual mode, packets are classified to FIDs based on their ingress
port and VLAN whereas in non-virtual mode only the VLAN is taken into
account.

Currently ports are initialized to use virtual mode due to the presence
of the PVID vPort. However, we're going to transition ports between both
modes based on the FIDs they use and not merely based on the presence on
a VLAN upper. Therefore, during initialization, no mode will be
explicitly set.

Since the Programmer's Reference Manual (PRM) doesn't specify a default,
explicitly set the port to non-virtual mode and later transition the
port between both modes based on the FIDs it uses.

In a follow-up patchset, this step will be moved to the common FID core
where it logically belongs.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Ido Schimmel
b02eae9b91 mlxsw: spectrum: Move PVID code to appropriate place
PVID is a port attribute and should therefore reside in the main driver
file and not the switchdev specific one.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Ido Schimmel
7cbc4277c7 mlxsw: spectrum_switchdev: Don't batch learning operations
We no longer batch VLAN operations, so there's no need to set the
learning state for a range of VLANs.

Use a common function to set the learning state for a Port-VLAN, thereby
making the code saner more receptive for upcoming changes.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Ido Schimmel
45bfe6b433 mlxsw: spectrum_switchdev: Don't batch STP operations
Simplify the code by using the common function that sets an STP state
for a Port-VLAN and remove the existing one that tries to batch it for
several VLANs.

This will help us in a follow-up patchset to introduce a unified
infrastructure for bridge ports, regardless if the bridge is VLAN-aware
or not.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Ido Schimmel
fe9ccc785d mlxsw: spectrum_switchdev: Don't batch VLAN operations
switchdev's VLAN object has the ability to describe a range of VLAN IDs,
but this is only used when VLAN operations are done using the SELF flag,
which is something we would like to remove as it allows one to bypass
the bridge driver.

Do VLAN operations on a per-VLAN basis, thereby simplifying the code and
preparing it for refactoring in a follow-up patchset.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Ido Schimmel
d341e2ce6b mlxsw: spectrum_switchdev: Remove redundant check
Since commit 97c242902c ("switchdev: Execute bridge ndos only for
bridge ports") switchdev code checks that port is bridged, so no need to
perform the same check in the driver.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Ido Schimmel
348b8fc3cf mlxsw: spectrum_router: Initialize RIFs in a separate function
The router interfaces (RIFs) array is currently initialized together
with the general router configuration. However, in a follow-up patchset
we're going to introduce a common RIF core that will require us to
initialize more RIF constructs, so move the RIF initialization to its
own function.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Ido Schimmel
7e39d1153d mlxsw: spectrum_router: Move FIB notification block to router struct
The FIB notification block logically belongs inside the router specific
struct, so move it there.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Ido Schimmel
5f9efffbdb mlxsw: spectrum_router: Move RIFs array to its rightful place
The router interfaces (RIFs) array is of no interest to code outside the
routing realm, so declare it inside the router specific struct instead
of the chip-wide one.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Ido Schimmel
5f6935c6a4 mlxsw: spectrum_switchdev: Reduce scope of bridge struct
Some attributes in the global chip struct are only relevant for bridge
operation, so encapsulate them in their own struct that isn't exposed to
non-bridge code.

This will also help us later, when we add more bridge-specific
attributes.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Ido Schimmel
9011b677e7 mlxsw: spectrum_router: Reduce scope of router struct
In a similar fashion to previous patch, the router structure
('mlxsw_sp_router') doesn't need to be accessible to anyone, but the
router code located at spectrum_router.c

Make this apparent and reduce its scope by defining it there.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Ido Schimmel
33cbd87cc0 mlxsw: spectrum_buffer: Reduce scope of shared buffer struct
The shared buffer structure ('mlxsw_sp_sb') doesn't need to be
accessible to anyone, but the shared buffer code located at
spectrum_buffers.c

Make this apparent and reduce its scope by defining it there.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 14:06:54 -04:00
Masahiro Yamada
05d8cba4a1 kbuild: skip install/check of headers right under uapi directories
Since commit 61562f981e ("uapi: export all arch specifics
directories"), "make INSTALL_HDR_PATH=$root/usr headers_install"
deletes standard glibc headers and others in $(root)/usr/include.

The cause of the issue is that headers_install now starts descending
from arch/$(hdr-arch)/include/uapi with $(root)/usr/include for its
destination when installing asm headers.  So, headers already there
are assumed to be unwanted.

When headers_install starts descending from include/uapi with
$(root)/usr/include for its destination, it works around the problem
by creating an dummy destination $(root)/usr/include/uapi, but this
is tricky.

To fix the problem in a clean way is to skip headers install/check
in include/uapi and arch/$(hdr-arch)/include/uapi because we know
there are only sub-directories in uapi directories.  A good side
effect is the empty destination $(root)/usr/include/uapi will go
away.

I am also removing the trailing slash in the headers_check target to
skip checking in arch/$(hdr-arch)/include/uapi.

Fixes: 61562f981e ("uapi: export all arch specifics directories")
Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
2017-05-18 02:17:45 +09:00
Ganesh Goudar
29db398418 cxgb4: add new T5 pci device id
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 12:02:27 -04:00
Ganesh Goudar
85eacf3f42 cxgb4: reduce resource allocation in kdump kernel
When is_kdump_kernel() is true, reduce memory footprint of
cxgb4 by using a single "Queue Set".

Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 12:02:27 -04:00
Ihar Hrachyshka
77d7123342 neighbour: update neigh timestamps iff update is effective
It's a common practice to send gratuitous ARPs after moving an
IP address to another device to speed up healing of a service. To
fulfill service availability constraints, the timing of network peers
updating their caches to point to a new location of an IP address can be
particularly important.

Sometimes neigh_update calls won't touch neither lladdr nor state, for
example if an update arrives in locktime interval. The neigh->updated
value is tested by the protocol specific neigh code, which in turn
will influence whether NEIGH_UPDATE_F_OVERRIDE gets set in the
call to neigh_update() or not. As a result, we may effectively ignore
the update request, bailing out of touching the neigh entry, except that
we still bump its timestamps inside neigh_update.

This may be a problem for updates arriving in quick succession. For
example, consider the following scenario:

A service is moved to another device with its IP address. The new device
sends three gratuitous ARP requests into the network with ~1 seconds
interval between them. Just before the first request arrives to one of
network peer nodes, its neigh entry for the IP address transitions from
STALE to DELAY.  This transition, among other things, updates
neigh->updated. Once the kernel receives the first gratuitous ARP, it
ignores it because its arrival time is inside the locktime interval. The
kernel still bumps neigh->updated. Then the second gratuitous ARP
request arrives, and it's also ignored because it's still in the (new)
locktime interval. Same happens for the third request. The node
eventually heals itself (after delay_first_probe_time seconds since the
initial transition to DELAY state), but it just wasted some time and
require a new ARP request/reply round trip. This unfortunate behaviour
both puts more load on the network, as well as reduces service
availability.

This patch changes neigh_update so that it bumps neigh->updated (as well
as neigh->confirmed) only once we are sure that either lladdr or entry
state will change). In the scenario described above, it means that the
second gratuitous ARP request will actually update the entry lladdr.

Ideally, we would update the neigh entry on the very first gratuitous
ARP request. The locktime mechanism is designed to ignore ARP updates in
a short timeframe after a previous ARP update was honoured by the kernel
layer. This would require tracking timestamps for state transitions
separately from timestamps when actual updates are received. This would
probably involve changes in neighbour struct. Therefore, the patch
doesn't tackle the issue of the first gratuitous APR ignored, leaving
it for a follow-up.

Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 11:41:38 -04:00
Ihar Hrachyshka
23d268eb24 arp: honour gratuitous ARP _replies_
When arp_accept is 1, gratuitous ARPs are supposed to override matching
entries irrespective of whether they arrive during locktime. This was
implemented in commit 56022a8fdd ("ipv4: arp: update neighbour address
when a gratuitous arp is received and arp_accept is set")

There is a glitch in the patch though. RFC 2002, section 4.6, "ARP,
Proxy ARP, and Gratuitous ARP", defines gratuitous ARPs so that they can
be either of Request or Reply type. Those Reply gratuitous ARPs can be
triggered with standard tooling, for example, arping -A option does just
that.

This patch fixes the glitch, making both Request and Reply flavours of
gratuitous ARPs to behave identically.

As per RFC, if gratuitous ARPs are of Reply type, their Target Hardware
Address field should also be set to the link-layer address to which this
cache entry should be updated. The field is present in ARP over Ethernet
but not in IEEE 1394. In this patch, I don't consider any broadcasted
ARP replies as gratuitous if the field is not present, to conform the
standard. It's not clear whether there is such a thing for IEEE 1394 as
a gratuitous ARP reply; until it's cleared up, we will ignore such
broadcasts. Note that they will still update existing ARP cache entries,
assuming they arrive out of locktime time interval.

Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 11:32:44 -04:00
Colin Ian King
7e1b9521f5 dm cache: handle kmalloc failure allocating background_tracker struct
Currently there is no kmalloc failure check on the allocation of
the background_tracker struct in btracker_create(), and so a NULL return
will lead to a NULL pointer dereference.  Add a NULL check.

Detected by CoverityScan, CID#1416587 ("Dereference null return value")

Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-17 09:44:53 -04:00
Tin Huynh
83345d51a4 i2c: xgene: Set ACPI_COMPANION_I2C
With ACPI, i2c-core requires ACPI companion to be set in order for it
to create slave device.
This patch sets the ACPI companion accordingly.

Signed-off-by: Tin Huynh <tnhuynh@apm.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2017-05-17 09:21:06 +02:00
Thomas Petazzoni
88ad60c23a i2c: mv64xxx: don't override deferred probing when getting irq
There is no reason to use platform_get_irq() for non-DT probing and
irq_of_parse_and_map() for DT probing. Indeed, platform_get_irq()
works fine for both.

In addition, using platform_get_irq() properly returns -EPROBE_DEFER
when the interrupt controller is not yet available, so instead of
inventing our own error code (-ENXIO), return the one provided by
platform_get_irq().

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2017-05-16 23:19:00 +02:00
Linus Torvalds
b23afd3848 - fix bad EFI vars iterator usage
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJZG0rAAAoJEIly9N/cbcAmfigP/iaKLzuVid4PC3Pzf7hAwmmg
 cWrH8sPMZF6elSSKHguRWqUpyHgK6crsJYnMB8c+4+/upXMorBpX6GFK3CEZG2UO
 1vrVkJ06q0N41qNnJc3TpDV3RxecdKOs26Jtc9gm+ZMScJxZBLXBu3bfwkwE8V/c
 1RFYhFciFg2lZBwIUZb73nAtvBtMdkLACLA4isa1Gn/Q0Ah50s+MFrePNjBm2Xen
 mjhp7lP1w57X6Is3ZlbNwVEWA66qUP3PiyTasA0RiCNfQzJtT71NSCkh3849w1rX
 61cink78ZLtokdWMs510PiiUNiwivpkewftkUj33QAWkCvr7mh5rTleYNDcule3S
 GWHcq7I4HNJSf2aoAUMG6A21KGT1Rg3+EmdH6Ci1/sTOIRkitFdWTjdx/JJb6qH8
 WRmFh+lF3v5eB4pFu5z4sgqBbHFEa3cXsHqYw5rRrJuFJM/FLzVrOPTGCEfaQzj0
 Nnl53MeQ3PNK5djpJtNQ0JFkvnylgbed2E5XUQbqm5nNGWMuviBHNPKhyBroSoEZ
 LHk9EW8NRFUXZMjPEbi+2vPigz4/UAXBcc/MOPqdbZCBughTdq7xicIdil31hPxn
 WeJzhXGs0rVEdzZJaUDYT+yKuTbTBeYwl/ujAGtqSB45lk/VLdZ0vWRz7PL83MwL
 BwOer++6MGmiO0JnkS85
 =bdBo
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore fix from Kees Cook:
 "Fix bad EFI vars iterator usage"

* tag 'pstore-v4.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  efi-pstore: Fix read iter after pstore API refactor
2017-05-16 13:29:07 -07:00
Christoph Hellwig
9ad098037d liquidio: use pcie_flr instead of duplicating it
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:59:35 -04:00
Andrew Lunn
1b86f702f8 net: phy: Remove residual magic from PHY drivers
commit fa8cddaf90 ("net phylib: Remove unnecessary condition check in phy")
removed the only place where the PHY flag PHY_HAS_MAGICANEG was
checked. But it left the flag being set in the drivers. Remove the flag.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:58:18 -04:00
Leon Romanovsky
3fdd34c18e bnx2x: Remove open coded carrier check
There is inline function to test if carrier present,
so it makes open-coded solution redundant.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:48:57 -04:00
Arnd Bergmann
2432a3fb5c mlx5e: add CONFIG_INET dependency
We now reference the arp_tbl, which requires IPv4 support to be
enabled in the kernel, otherwise we get a link error:

drivers/net/built-in.o: In function `mlx5e_tc_update_neigh_used_value':
(.text+0x16afec): undefined reference to `arp_tbl'
drivers/net/built-in.o: In function `mlx5e_rep_neigh_init':
en_rep.c:(.text+0x16c16d): undefined reference to `arp_tbl'
drivers/net/built-in.o: In function `mlx5e_rep_netevent_event':
en_rep.c:(.text+0x16cbb5): undefined reference to `arp_tbl'

This adds a Kconfig dependency for it.

Fixes: 232c001398 ("net/mlx5e: Add support to neighbour update flow")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:48:01 -04:00
Eric Dumazet
218af599fa tcp: internal implementation for pacing
BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.

However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
  flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
  to use BBR for the few TCP flows they initiate/terminate.

This patch implements an automatic fallback to internal pacing.

Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.

One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.

Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.

Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.

If MQ+pfifo_fast is used on the NIC :

$ sar -n DEV 1 5 | grep eth
14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
 4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
 0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
 1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
 1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0

Now use MQ+FQ :

lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq

$ sar -n DEV 1 5 | grep eth
14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
 1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
 0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
 4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
 3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0

As expected, number of interrupts per second is very different.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:43:31 -04:00
David S. Miller
8dfedc5343 Merge branch 'udp-scalability-improvements'
Paolo Abeni says:

====================
udp: scalability improvements

This patch series implement an idea suggested by Eric Dumazet to
reduce the contention of the udp sk_receive_queue lock when the socket is
under flood.

An ancillary queue is added to the udp socket, and the socket always
tries first to read packets from such queue. If it's empty, we splice
the content from sk_receive_queue into the ancillary queue.

The first patch introduces some helpers to keep the udp code small, and the
following two implement the ancillary queue strategy. The code is split
to hopefully help the reviewing process.

The measured overall gain under udp flood is up to the 30% depending on
the numa layout and the number of ingress queue used by the relevant nic.

The performance numbers have been gathered using pktgen as sender, with 64
bytes packets, random src port on a host b2b connected via a 10Gbs link
with the dut.

The receiver used the udp_sink program by Jesper [1] and an h/w l4 rx hash on
the ingress nic, so that the number of ingress nic rx queues hit by the udp
traffic could be controlled via ethtool -L.

The udp_sink program was bound to the first idle cpu, to get more
stable numbers.

On a single numa node receiver:

nic rx queues           vanilla                 patched kernel
1                       1820 kpps               1900 kpps
2                       1950 kpps               2500 kpps
16                      1670 kpps               2120 kpps

When using a single nic rx queue, busy polling was also enabled,
elsewhere, in the above scenario, the bh processing becomes the bottle-neck
and this produces large artifacts in the measured performances (e.g.
improving the udp sink run time, decreases the overall tput, since more
action from the scheduler comes into play).

[1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c

v1 -> v2:
  Patches 1/3 and 2/3 are unchanged, in patch 3/3 the rx_queue_lock_held param
  of udp_rmem_release() is now a bool.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:41:31 -04:00
Paolo Abeni
6dfb4367cd udp: keep the sk_receive_queue held when splicing
On packet reception, when we are forced to splice the
sk_receive_queue, we can keep the related lock held, so
that we can avoid re-acquiring it, if fwd memory
scheduling is required.

v1 -> v2:
  the rx_queue_lock_held param in udp_rmem_release() is
  now a bool

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:41:30 -04:00
Paolo Abeni
2276f58ac5 udp: use a separate rx queue for packet reception
under udp flood the sk_receive_queue spinlock is heavily contended.
This patch try to reduce the contention on such lock adding a
second receive queue to the udp sockets; recvmsg() looks first
in such queue and, only if empty, tries to fetch the data from
sk_receive_queue. The latter is spliced into the newly added
queue every time the receive path has to acquire the
sk_receive_queue lock.

The accounting of forward allocated memory is still protected with
the sk_receive_queue lock, so udp_rmem_release() needs to acquire
both locks when the forward deficit is flushed.

On specific scenarios we can end up acquiring and releasing the
sk_receive_queue lock multiple times; that will be covered by
the next patch

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:41:29 -04:00
Paolo Abeni
65101aeca5 net/sock: factor out dequeue/peek with offset code
And update __sk_queue_drop_skb() to work on the specified queue.
This will help the udp protocol to use an additional private
rx queue in a later patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:41:29 -04:00
Mikulas Patocka
13840d3801 dm bufio: make the parameter "retain_bytes" unsigned long
Change the type of the parameter "retain_bytes" from unsigned to
unsigned long, so that on 64-bit machines the user can set more than
4GiB of data to be retained.

Also, change the type of the variable "count" in the function
"__evict_old_buffers" to unsigned long.  The assignment
"count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];"
could result in unsigned long to unsigned overflow and that could result
in buffers not being freed when they should.

While at it, avoid division in get_retain_buffers().  Division is slow,
we can change it to shift because we have precalculated the log2 of
block size.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-16 15:12:08 -04:00
David Ahern
f6c5775ff0 net: Improve handling of failures on link and route dumps
In general, rtnetlink dumps do not anticipate failure to dump a single
object (e.g., link or route) on a single pass. As both route and link
objects have grown via more attributes, that is no longer a given.

netlink dumps can handle a failure if the dump function returns an
error; specifically, netlink_dump adds the return code to the response
if it is <= 0 so userspace is notified of the failure. The missing
piece is the rtnetlink dump functions returning the error.

Fix route and link dump functions to return the errors if no object is
added to an skb (detected by skb->len != 0). IPv6 route dumps
(rt6_dump_route) already return the error; this patch updates IPv4 and
link dumps. Other dump functions may need to be ajusted as well.

Reported-by: Jan Moskyto Matejka <mq@ucw.cz>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 14:54:11 -04:00
Christoph Hellwig
19a0f7e37c net/smc: Add warning about remote memory exposure
The driver explicitly bypasses APIs to register all memory once a
connection is made, and thus allows remote access to memory.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 14:49:43 -04:00