In case IPv6 is compiled as a module, introduce a stub
for ipv6_sock_mc_join and ipv6_sock_mc_drop etc.. It will be used
by vxlan module. Suggested by Ben.
This is an ugly but easy solution for now.
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It will be used by vxlan, and may not be inlined.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes warnings introduced by the qdisc default patch.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default, the pfifo_fast queue discipline has been used by default
for all devices. But we have better choices now.
This patch allow setting the default queueing discipline with sysctl.
This allows easy use of better queueing disciplines on all devices
without having to use tc qdisc scripts. It is intended to allow
an easy path for distributions to make fq_codel or sfq the default
qdisc.
This patch also makes pfifo_fast more of a first class qdisc, since
it is now possible to manually override the default and explicitly
use pfifo_fast. The behavior for systems who do not use the sysctl
is unchanged, they still get pfifo_fast
Also removes leftover random # in sysctl net core.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some synopsys ip implementation doesn't support DMA store and forward mode,
such as BF60x. So, set force_thresh_dma_mode to use DMA thresholds only.
Update document and devicetree as well.
Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__GFP_ZERO is an uncommon flag and perhaps is better
not used. static inline dma_zalloc_coherent exists
so convert the uses of dma_alloc_coherent with __GFP_ZERO
to the more common kernel style with zalloc.
Remove memset from the static inline dma_zalloc_coherent
and add just one use of __GFP_ZERO instead.
Trivially reduces the size of the existing uses of
dma_zalloc_coherent.
Realign arguments as appropriate.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
- Uses the new_flow/old_flow separation from FQ_codel
- New flows get an initial credit allowing IW10 without added delay.
- Special FIFO queue for high prio packets (no need for PRIO + FQ)
- Uses a hash table of RB trees to locate the flows at enqueue() time
- Smart on demand gc (at enqueue() time, RB tree lookup evicts old
unused flows)
- Dynamic memory allocations.
- Designed to allow millions of concurrent flows per Qdisc.
- Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
- Single high resolution timer for throttled flows (if any).
- One RB tree to link throttled flows.
- Ability to have a max rate per flow. We might add a socket option
to add per socket limitation.
Attempts have been made to add TCP pacing in TCP stack, but this
seems to add complex code to an already complex stack.
TCP pacing is welcomed for flows having idle times, as the cwnd
permits TCP stack to queue a possibly large number of packets.
This removes the 'slow start after idle' choice, hitting badly
large BDP flows, and applications delivering chunks of data
as video streams.
Nicely spaced packets :
Here interface is 10Gbit, but flow bottleneck is ~20Mbit
cwin is big, yet FQ avoids the typical bursts generated by TCP
(as in netperf TCP_RR -- -r 100000,100000)
15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115>
15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115>
15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115>
15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115>
15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115>
15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115>
15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115>
15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
TCP timestamps show that most packets from B were queued in the same ms
timeframe (TSval 1159799{3,4}), but FQ managed to send them right
in time to avoid a big burst.
In slow start or steady state, very few packets are throttled [1]
FQ gets a bunch of tunables as :
limit : max number of packets on whole Qdisc (default 10000)
flow_limit : max number of packets per flow (default 100)
quantum : the credit per RR round (default is 2 MTU)
initial_quantum : initial credit for new flows (default is 10 MTU)
maxrate : max per flow rate (default : unlimited)
buckets : number of RB trees (default : 1024) in hash table.
(consumes 8 bytes per bucket)
[no]pacing : disable/enable pacing (default is enable)
All of them can be changed on a live qdisc.
$ tc qd add dev eth0 root fq help
Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
[ quantum BYTES ] [ initial_quantum BYTES ]
[ maxrate RATE ] [ buckets NUMBER ]
[ [no]pacing ]
$ tc -s -d qd
qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
backlog 0b 0p requeues 14
511 flows, 511 inactive, 0 throttled
110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit
[1] Except if initial srtt is overestimated, as if using
cached srtt in tcp metrics. We'll provide a fix for this issue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently allow for different fanout scheduling policies in pf_packet
such as scheduling by skb's rxhash, round-robin, by cpu, and rollover.
Also allow for a random, equidistributed selection of the socket from the
fanout process group.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new macro netdev_for_each_upper_dev_rcu(dev, upper, iter) iterates
through the dev->upper_dev_list starting from the first element, using
the netdev_upper_get_next_dev_rcu(dev, &iter).
Must be called under RCU read lock.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements RFC6980: Drop fragmented ndisc packets by
default. If a fragmented ndisc packet is received the user is informed
that it is possible to disable the check.
Cc: Fernando Gont <fernando@gont.com.ar>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce cacheline usage from 2 to 1 cacheline for sctp_globals structure. By
reordering elements, we can close gaps and simply achieve the following:
Current situation:
/* size: 80, cachelines: 2, members: 10 */
/* sum members: 57, holes: 4, sum holes: 16 */
/* padding: 7 */
/* last cacheline: 16 bytes */
Afterwards:
/* size: 64, cachelines: 1, members: 10 */
/* padding: 7 */
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesse Gross says:
====================
A number of significant new features and optimizations for net-next/3.12.
Highlights are:
* "Megaflows", an optimization that allows userspace to specify which
flow fields were used to compute the results of the flow lookup.
This allows for a major reduction in flow setups (the major
performance bottleneck in Open vSwitch) without reducing flexibility.
* Converting netlink dump operations to use RCU, allowing for
additional parallelism in userspace.
* Matching and modifying SCTP protocol fields.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Extract the local TCP stack independant parts of tcp_v6_init_sequence()
and cookie_v6_check() and export them for use by the upcoming IPv6 SYNPROXY
target.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
core with common functions and an address family specific target.
The SYNPROXY receives the connection request from the client, responds with
a SYN/ACK containing a SYN cookie and announcing a zero window and checks
whether the final ACK from the client contains a valid cookie.
It then establishes a connection to the original destination and, if
successful, sends a window update to the client with the window size
announced by the server.
Support for timestamps, SACK, window scaling and MSS options can be
statically configured as target parameters if the features of the server
are known. If timestamps are used, the timestamp value sent back to
the client in the SYN/ACK will be different from the real timestamp of
the server. In order to now break PAWS, the timestamps are translated in
the direction server->client.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Extract the local TCP stack independant parts of tcp_v4_init_sequence()
and cookie_v4_check() and export them for use by the upcoming SYNPROXY
target.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Split out sequence number adjustments from NAT and move them to the conntrack
core to make them usable for SYN proxying. The sequence number adjustment
information is moved to a seperate extend. The extend is added to new
conntracks when a NAT mapping is set up for a connection using a helper.
As a side effect, this saves 24 bytes per connection with NAT in the common
case that a connection does not have a helper assigned.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds support for rewriting SCTP src,dst ports similar to the
functionality already available for TCP/UDP.
Rewriting SCTP ports is expensive due to double-recalculation of the
SCTP checksums; this is performed to ensure that packets traversing OVS
with invalid checksums will continue to the destination with any
checksum corruption intact.
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Conflicts:
drivers/net/wireless/iwlwifi/pcie/trans.c
include/linux/inetdevice.h
The inetdevice.h conflict involves moving the IPV4_DEVCONF values
into a UAPI header, overlapping additions of some new entries.
The iwlwifi conflict is a context overlap.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add wildcarded flow support in kernel datapath.
Wildcarded flow can improve OVS flow set up performance by avoid sending
matching new flows to the user space program. The exact performance boost
will largely dependent on wildcarded flow hit rate.
In case all new flows hits wildcard flows, the flow set up rate is
within 5% of that of linux bridge module.
Pravin has made significant contributions to this patch. Including API
clean ups and bug fixes.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Merge networking fixes from David Miller:
1) Revert Johannes Berg's genetlink locking fix, because it causes
regressions.
Johannes and Pravin Shelar are working on fixing things properly.
2) Do not drop ipv6 ICMP messages without a redirected header option,
they are legal. From Duan Jiong.
3) Missing error return propagation in probing of via-ircc driver.
From Alexey Khoroshilov.
4) Do not clear out broadcast/multicast/unicast/WOL bits in r8169 when
initializing, from Peter Wu.
5) realtek phy driver programs wrong interrupt status bit, from
Giuseppe CAVALLARO.
6) Fix statistics regression in AF_PACKET code, from Willem de Bruijn.
7) Bridge code uses wrong bitmap length, from Toshiaki Makita.
8) SFC driver uses wrong indexes to look up MAC filters, from Ben
Hutchings.
9) Don't pass stack buffers into usb control operations in hso driver,
from Daniel Gimpelevich.
10) Multiple ipv6 fragmentation headers in one packet is illegal and
such packets should be dropped, from Hannes Frederic Sowa.
11) When TCP sockets are "repaired" as part of checkpoint/restart, the
timestamp field of SKBs need to be refreshed otherwise RTOs can be
wildly off. From Andrey Vagin.
12) Fix memcpy args (uses 'address of pointer' instead of 'pointer') in
hostp driver. From Dan Carpenter.
13) nl80211hdr_put() doesn't return an ERR_PTR, but some code believes
it does. From Dan Carpenter.
14) Fix regression in wireless SME disconnects, from Johannes Berg.
15) Don't use a stack buffer for DMA in zd1201 USB wireless driver, from
Jussi Kivilinna.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
ipv4: expose IPV4_DEVCONF
ipv6: handle Redirect ICMP Message with no Redirected Header option
be2net: fix disabling TX in be_close()
Revert "genetlink: fix family dump race"
hso: Fix stack corruption on some architectures
hso: Earlier catch of error condition
sfc: Fix lookup of default RX MAC filters when steered using ethtool
bridge: Use the correct bit length for bitmap functions in the VLAN code
packet: restore packet statistics tp_packets to include drops
net: phy: rtl8211: fix interrupt on status link change
r8169: remember WOL preferences on driver load
via-ircc: don't return zero if via_ircc_open() failed
macvtap: Ignore tap features when VNET_HDR is off
macvtap: Correctly set tap features when IFF_VNET_HDR is disabled.
macvtap: simplify usage of tap_features
tcp: set timestamps for restored skb-s
bnx2x: set VF DMAE when first function has 0 supported VFs
bnx2x: Protect against VFs' ndos when SR-IOV is disabled
bnx2x: prevent VF benign attentions
bnx2x: Consider DCBX remote error
...
make the Freescale ethernet driver get, prepare and enable the FEC clock
during probe(); disable and unprepare the clock upon remove(), put is
done by the devm approach; hold a reference to the clock over the period
of use.
clock lookup is non-fatal as not all platforms provide clock specs in
their device tree; failure to enable specified clocks is fatal.
Signed-off-by: Gerhard Sittig <gsi@denx.de>
Signed-off-by: Anatolij Gustschin <agust@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP sends device configuration (see inet_fill_link_af) as an array
in the netlink information, but the indices in that array are not
exposed to userspace through any current santized header file.
It was available back in 2.6.32 (in /usr/include/linux/sysctl.h)
but was broken by:
commit 02291680ff
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Sun Feb 14 03:25:51 2010 +0000
net ipv4: Decouple ipv4 interface parameters from binary sysctl numbers
Eric was solving the sysctl problem but then the indices were re-exposed
by a later addition of devconf support for IPV4
commit 9f0f7272ac
Author: Thomas Graf <tgraf@infradead.org>
Date: Tue Nov 16 04:32:48 2010 +0000
ipv4: AF_INET link address family
Putting them in /usr/include/linux/ip.h seemed the logical match
for the DEVCONF_ definitions for IPV6 in /usr/include/linux/ip6.h
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
rfc 4861 says the Redirected Header option is optional, so
the kernel should not drop the Redirect Message that has no
Redirected Header option. In this patch, the function
ip6_redirect_no_header() is introduced to deal with that
condition.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Steffen Klassert says:
====================
1) Some constifications, from Mathias Krause.
2) Catch bugs if a hold timer is still active when xfrm_policy_destroy()
is called, from Fan Du.
3) Remove a redundant address family checking, from Fan Du.
4) Make xfrm_state timer monotonic to be independent of system clock changes,
from Fan Du.
5) Remove an outdated comment on returning -EREMOTE in the xfrm_lookup(),
from Rami Rosen.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the updated version of df54d6fa54 ("x86 get_unmapped_area():
use proper mmap base for bottom-up direction") that only randomizes the
mmap base address once.
Signed-off-by: Radu Caragea <sinaelgl@gmail.com>
Reported-and-tested-by: Jeff Shorey <shoreyjeff@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Adrian Sendroiu <molecula2788@gmail.com>
Cc: Greg KH <greg@kroah.com>
Cc: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit df54d6fa54.
The commit isn't necessarily wrong, but because it recalculates the
random mmap_base every time, it seems to confuse user memory allocators
that expect contiguous mmap allocations even when the mmap address isn't
specified.
In particular, the MATLAB Java runtime seems to be unhappy. See
https://bugzilla.kernel.org/show_bug.cgi?id=60774
So we'll want to apply the random offset only once, and Radu has a patch
for that. Revert this older commit in order to apply the other one.
Reported-by: Jeff Shorey <shoreyjeff@gmail.com>
Cc: Radu Caragea <sinaelgl@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marc Kleine-Budde says:
====================
this pull-request for net-next consists of a series by Alexander
Shiyan, he cleans up the mcp251x driver. As the first patch touches
arch/arm/mach-pxa, it's acked by Haojian Zhuang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The only thing we may have from tun device is the fprog, whic contains
the number of filter elements and a pointer to (user-space) memory
where the elements are. The program itself may not be available if the
device is persistent and detached.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a small problem with sk-filters on tun devices. Consider
an application doing this sequence of steps:
fd = open("/dev/net/tun");
ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
ioctl(fd, TUNATTACHFILTER, &my_filter);
ioctl(fd, TUNSETPERSIST, 1);
close(fd);
At that point the tun0 will remain in the system and will keep in
mind that there should be a socket filter at address '&my_filter'.
If after that we do
fd = open("/dev/net/tun");
ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
we most likely receive the -EFAULT error, since tun_attach() would
try to connect the filter back. But (!) if we provide a filter at
address &my_filter, then tun0 will be created and the "new" filter
would be attached, but application may not know about that.
This may create certain problems to anyone using tun-s, but it's
critical problem for c/r -- if we meet a persistent tun device
with a filter in mind, we will not be able to attach to it to dump
its state (flags, owner, address, vnethdr size, etc.).
The proposal is to allow to attach to tun device (with TUNSETIFF)
w/o attaching the filter to the tun-file's socket. After this
attach app may e.g clean the device by dropping the filter, it
doesn't want to have one, or (in case of c/r) get information
about the device with tun ioctls.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tun devices cannot be created with ifidex user wants, but it's
required by checkpoint-restore project.
Long time ago such ability was implemented for rtnl_ops-based
interface for creating links (9c7dafbf net: Allow to create links
with given ifindex), but the only API for creating and managing
tuntap devices is ioctl-based and is evolving with adding new ones
(cde8b15f tuntap: add ioctl to attach or detach a file form tuntap
device).
Following that trend, here's how a new ioctl that sets the ifindex
for device, that _will_ be created by TUNSETIFF ioctl looks like.
So those who want a tuntap device with the ifindex N, should open
the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF.
If the index N is busy, then the register_netdev will find this out
and the ioctl would be failed with -EBUSY.
If setifindex is not called, then it will be generated as before.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flags is not used by boards, so remove this field from the driver
platform_data.
Signed-off-by: Alexander Shiyan <shc_work@mail.ru>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch replaces power callbacks to the regulator API. To improve
the readability of the code, helper for the regulator enable/disable
was added.
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Signed-off-by: Alexander Shiyan <shc_work@mail.ru>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
CPSW driver no longer supports platform register as all the SoCs which has CPSW
are supporting DT only booting, so moving cpsw.h header file from platform
include to drivers/net/ethernet/ti
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the 'register_type' field of the 'sh_eth' driver's platform data is not
used by the driver anymore, it's time to remove it and its initializers from
the SH platform code. Also move *enum* declaring values for this field from
<linux/sh_eth.h> to the local driver's header file as they're only needed
by the driver itself now...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/netfilter/nf_conntrack_proto_tcp.c
The conflict had to do with overlapping changes dealing with
fixing the use of an "s32" to hold the value returned by
NAT_OFFSET().
Pablo Neira Ayuso says:
====================
The following batch contains Netfilter/IPVS updates for your net-next tree.
More specifically, they are:
* Trivial typo fix in xt_addrtype, from Phil Oester.
* Remove net_ratelimit in the conntrack logging for consistency with other
logging subsystem, from Patrick McHardy.
* Remove unneeded includes from the recently added xt_connlabel support, from
Florian Westphal.
* Allow to update conntracks via nfqueue, don't need NFQA_CFG_F_CONNTRACK for
this, from Florian Westphal.
* Remove tproxy core, now that we have socket early demux, from Florian
Westphal.
* A couple of patches to refactor conntrack event reporting to save a good
bunch of lines, from Florian Westphal.
* Fix missing locking in NAT sequence adjustment, it did not manifested in
any known bug so far, from Patrick McHardy.
* Change sequence number adjustment variable to 32 bits, to delay the
possible early overflow in long standing connections, also from Patrick.
* Comestic cleanups for IPVS, from Dragos Foianu.
* Fix possible null dereference in IPVS in the SH scheduler, from Daniel
Borkmann.
* Allow to attach conntrack expectations via nfqueue. Before this patch, you
had to use ctnetlink instead, thus, we save the conntrack lookup.
* Export xt_rpfilter and xt_HMARK header files, from Nicolas Dichtel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch adds vxlan vport type for openvswitch using
vxlan api. So now there is vxlan dependency for openvswitch.
CC: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch allows more code sharing between vxlan and ovs-vxlan.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch adds data field to vxlan socket and export
vxlan handler api.
vh->data is required to store private data per vxlan handler.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not allowed for an ipv6 packet to contain multiple fragmentation
headers. So discard packets which were already reassembled by
fragmentation logic and send back a parameter problem icmp.
The updates for RFC 6980 will come in later, I have to do a bit more
research here.
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix SKB leak in 8139cp, from Dave Jones.
2) Fix use of *_PAGES interfaces with mlx5 firmware, from Moshe Lazar.
3) RCU conversion of macvtap introduced two races, fixes by Eric
Dumazet
4) Synchronize statistic flows in bnx2x driver to prevent corruption,
from Dmitry Kravkov
5) Undo optimization in IP tunneling, we were using the inner IP header
in some cases to inherit the IP ID, but that isn't correct in some
circumstances. From Pravin B Shelar
6) Use correct struct size when parsing netlink attributes in
rtnl_bridge_getlink(). From Asbjoern Sloth Toennesen
7) Length verifications in tun_get_user() are bogus, from Weiping Pan
and Dan Carpenter
8) Fix bad merge resolution during 3.11 networking development in
openvswitch, albeit a harmless one which added some unreachable
code. From Jesse Gross
9) Wrong size used in flexible array allocation in openvswitch, from
Pravin B Shelar
10) Clear out firmware capability flags the be2net driver isn't ready to
handle yet, from Sarveshwar Bandi
11) Revert DMA mapping error checking addition to cxgb3 driver, it's
buggy. From Alexey Kardashevskiy
12) Fix regression in packet scheduler rate limiting when working with a
link layer of ATM. From Jesper Dangaard Brouer
13) Fix several errors in TCP Cubic congestion control, in particular
overflow errors in timestamp calculations. From Eric Dumazet and
Van Jacobson
14) In ipv6 routing lookups, we need to backtrack if subtree traversal
don't result in a match. From Hannes Frederic Sowa
15) ipgre_header() returns incorrect packet offset. Fix from Timo Teräs
16) Get "low latency" out of the new MIB counter names. From Eliezer
Tamir
17) State check in ndo_dflt_fdb_del() is inverted, from Sridhar
Samudrala
18) Handle TCP Fast Open properly in netfilter conntrack, from Yuchung
Cheng
19) Wrong memcpy length in pcan_usb driver, from Stephane Grosjean
20) Fix dealock in TIPC, from Wang Weidong and Ding Tianhong
21) call_rcu() call to destroy SCTP transport is done too early and
might result in an oops. From Daniel Borkmann
22) Fix races in genetlink family dumps, from Johannes Berg
23) Flags passed into macvlan by the user need to be validated properly,
from Michael S Tsirkin
24) Fix skge build on 32-bit, from Stephen Hemminger
25) Handle malformed TCP headers properly in xt_TCPMSS, from Pablo Neira
Ayuso
26) Fix handling of stacked vlans in vlan_dev_real_dev(), from Nikolay
Aleksandrov
27) Eliminate MTU calculation overflows in esp{4,6}, from Daniel
Borkmann
28) neigh_parms need to be setup before calling the ->ndo_neigh_setup()
method. From Veaceslav Falico
29) Kill out-of-bounds prefetch in fib_trie, from Eric Dumazet
30) Don't dereference MLD query message if the length isn't value in the
bridge multicast code, from Linus Lüssing
31) Fix VXLAN IGMP join regression due to an inverted check, from Cong
Wang
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (70 commits)
net/mlx5_core: Support MANAGE_PAGES and QUERY_PAGES firmware command changes
tun: signedness bug in tun_get_user()
qlcnic: Fix diagnostic interrupt test for 83xx adapters
qlcnic: Fix beacon state return status handling
qlcnic: Fix set driver version command
net: tg3: fix NULL pointer dereference in tg3_io_error_detected and tg3_io_slot_reset
net_sched: restore "linklayer atm" handling
drivers/net/ethernet/via/via-velocity.c: update napi implementation
Revert "cxgb3: Check and handle the dma mapping errors"
be2net: Clear any capability flags that driver is not interested in.
openvswitch: Reset tunnel key between input and output.
openvswitch: Use correct type while allocating flex array.
openvswitch: Fix bad merge resolution.
tun: compare with 0 instead of total_len
rtnetlink: rtnl_bridge_getlink: Call nlmsg_find_attr() with ifinfomsg header
ethernet/arc/arc_emac - fix NAPI "work > weight" warning
ip_tunnel: Do not use inner ip-header-id for tunnel ip-header-id.
bnx2x: prevent crash in shutdown flow with CNIC
bnx2x: fix PTE write access error
bnx2x: fix memory leak in VF
...
Ben Tebulin reported:
"Since v3.7.2 on two independent machines a very specific Git
repository fails in 9/10 cases on git-fsck due to an SHA1/memory
failures. This only occurs on a very specific repository and can be
reproduced stably on two independent laptops. Git mailing list ran
out of ideas and for me this looks like some very exotic kernel issue"
and bisected the failure to the backport of commit 53a59fc67f ("mm:
limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").
That commit itself is not actually buggy, but what it does is to make it
much more likely to hit the partial TLB invalidation case, since it
introduces a new case in tlb_next_batch() that previously only ever
happened when running out of memory.
The real bug is that the TLB gather virtual memory range setup is subtly
buggered. It was introduced in commit 597e1c3580 ("mm/mmu_gather:
enable tlb flush range in generic mmu_gather"), and the range handling
was already fixed at least once in commit e6c495a96c ("mm: fix the TLB
range flushed when __tlb_remove_page() runs out of slots"), but that fix
was not complete.
The problem with the TLB gather virtual address range is that it isn't
set up by the initial tlb_gather_mmu() initialization (which didn't get
the TLB range information), but it is set up ad-hoc later by the
functions that actually flush the TLB. And so any such case that forgot
to update the TLB range entries would potentially miss TLB invalidates.
Rather than try to figure out exactly which particular ad-hoc range
setup was missing (I personally suspect it's the hugetlb case in
zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
did), this patch just gets rid of the problem at the source: make the
TLB range information available to tlb_gather_mmu(), and initialize it
when initializing all the other tlb gather fields.
This makes the patch larger, but conceptually much simpler. And the end
result is much more understandable; even if you want to play games with
partial ranges when invalidating the TLB contents in chunks, now the
range information is always there, and anybody who doesn't want to
bother with it won't introduce subtle bugs.
Ben verified that this fixes his problem.
Reported-bisected-and-tested-by: Ben Tebulin <tebulin@googlemail.com>
Build-testing-by: Stephen Rothwell <sfr@canb.auug.org.au>
Build-testing-by: Richard Weinberger <richard.weinberger@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the previous QUERY_PAGES command version we used one command to get the
required amount of boot, init and post init pages. The new version uses the
op_mod field to specify whether the query is for the required amount of boot,
init or post init pages. In addition the output field size for the required
amount of pages increased from 16 to 32 bits.
In MANAGE_PAGES command the input_num_entries and output_num_entries fields
sizes changed from 16 to 32 bits and the PAS tables offset changed to 0x10.
In the pages request event the num_pages field also changed to 32 bits.
In the HCA-capabilities-layout the size and location of max_qp_mcg field has
been changed to support 24 bits.
This patch isn't compatible with firmware versions < 5; however, it turns out that the
first GA firmware we will publish will not support previous versions so this should be OK.
Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 56b765b79 ("htb: improved accuracy at high rates")
broke the "linklayer atm" handling.
tc class add ... htb rate X ceil Y linklayer atm
The linklayer setting is implemented by modifying the rate table
which is send to the kernel. No direct parameter were
transferred to the kernel indicating the linklayer setting.
The commit 56b765b79 ("htb: improved accuracy at high rates")
removed the use of the rate table system.
To keep compatible with older iproute2 utils, this patch detects
the linklayer by parsing the rate table. It also supports future
versions of iproute2 to send this linklayer parameter to the
kernel directly. This is done by using the __reserved field in
struct tc_ratespec, to convey the choosen linklayer option, but
only using the lower 4 bits of this field.
Linklayer detection is limited to speeds below 100Mbit/s, because
at high rates the rtab is gets too inaccurate, so bad that
several fields contain the same values, this resembling the ATM
detect. Fields even start to contain "0" time to send, e.g. at
1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have
been more broken than we first realized.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.
When one of the two netns is removed, the tunnel is destroyed.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.
When one of the two netns is removed, the tunnel is destroyed.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge a bunch of fixes from Andrew Morton.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
fs/proc/task_mmu.c: fix buffer overflow in add_page_map()
arch: *: Kconfig: add "kernel/Kconfig.freezer" to "arch/*/Kconfig"
ocfs2: fix null pointer dereference in ocfs2_dir_foreach_blk_id()
x86 get_unmapped_area(): use proper mmap base for bottom-up direction
ocfs2: fix NULL pointer dereference in ocfs2_duplicate_clusters_by_page
ocfs2: Revert 40bd62e to avoid regression in extended allocation
drivers/rtc/rtc-stmp3xxx.c: provide timeout for potentially endless loop polling a HW bit
hugetlb: fix lockdep splat caused by pmd sharing
aoe: adjust ref of head for compound page tails
microblaze: fix clone syscall
mm: save soft-dirty bits on file pages
mm: save soft-dirty bits on swapped pages
memcg: don't initialize kmem-cache destroying work for root caches