2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2024-12-17 09:43:59 +08:00
Commit Graph

31617 Commits

Author SHA1 Message Date
Matija Glavinic Pecotic
ef2820a735 net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
Implementation of (a)rwnd calculation might lead to severe performance issues
and associations completely stalling. These problems are described and solution
is proposed which improves lksctp's robustness in congestion state.

1) Sudden drop of a_rwnd and incomplete window recovery afterwards

Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
but size of sk_buff, which is blamed against receiver buffer, is not accounted
in rwnd. Theoretically, this should not be the problem as actual size of buffer
is double the amount requested on the socket (SO_RECVBUF). Problem here is
that this will have bad scaling for data which is less then sizeof sk_buff.
E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
of traffic of this size (less then 100B).

An example of sudden drop and incomplete window recovery is given below. Node B
exhibits problematic behavior. Node A initiates association and B is configured
to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
message in 4G (LTE) network). On B data is left in buffer by not reading socket
in userspace.

Lets examine when we will hit pressure state and declare rwnd to be 0 for
scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
chunk is sent in separate sctp packet)

Logic is implemented in sctp_assoc_rwnd_decrease:

socket_buffer (see below) is maximum size which can be held in socket buffer
(sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)

A simple expression is given for which it will be examined after how many
packets for above stated parameters we enter pressure state:

We start by condition which has to be met in order to enter pressure state:

	socket_buffer < currently_alloced;

currently_alloced is represented as size of sctp packets received so far and not
yet delivered to userspace. x is the number of chunks/packets (since there is no
bundling, and each chunk is delivered in separate packet, we can observe each
chunk also as sctp packet, and what is important here, having its own sk_buff):

	socket_buffer < x*each_sctp_packet;

each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
twice the amount of initially requested size of socket buffer, which is in case
of sctp, twice the a_rwnd requested:

	2*rwnd < x*(payload+sizeof(struc sk_buff));

sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
and each payload size is 43

	20000 < x(43+190);

	x > 20000/233;

	x ~> 84;

After ~84 messages, pressure state is entered and 0 rwnd is advertised while
received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
drop from 6474 to 0, as it will be now shown in example:

IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
<...>
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]

--> Sudden drop

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

At this point, rwnd_press stores current rwnd value so it can be later restored
in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
adding amount which is read from userspace. Let us observe values in above
example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
amount of actual sctp data currently waiting to be delivered to userspace
is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
only for sctp data, which is ~3500. Condition is never met, and when userspace
reads all data, rwnd stays on 3569.

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

--> At this point userspace read everything, rwnd recovered only to 3569

IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

Reproduction is straight forward, it is enough for sender to send packets of
size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.

2) Minute size window for associations sharing the same socket buffer

In case multiple associations share the same socket, and same socket buffer
(sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
of the associations can permanently drop rwnd of other association(s).

Situation will be typically observed as one association suddenly having rwnd
dropped to size of last packet received and never recovering beyond that point.
Different scenarios will lead to it, but all have in common that one of the
associations (let it be association from 1)) nearly depleted socket buffer, and
the other association blames socket buffer just for the amount enough to start
the pressure. This association will enter pressure state, set rwnd_press and
announce 0 rwnd.
When data is read by userspace, similar situation as in 1) will occur, rwnd will
increase just for the size read by userspace but rwnd_press will be high enough
so that association doesn't have enough credit to reach rwnd_press and restore
to previous state. This case is special case of 1), being worse as there is, in
the worst case, only one packet in buffer for which size rwnd will be increased.
Consequence is association which has very low maximum rwnd ('minute size', in
our case down to 43B - size of packet which caused pressure) and as such
unusable.

Scenario happened in the field and labs frequently after congestion state (link
breaks, different probabilities of packet drop, packet reordering) and with
scenario 1) preceding. Here is given a deterministic scenario for reproduction:

>From node A establish two associations on the same socket, with rcvbuf_policy
being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
bring it down close to 0, as in just one more packet would close it. This has as
a consequence that association number 2 is able to receive (at least) one more
packet which will bring it in pressure state. E.g. if association 2 had rwnd of
10000, packet received was 43, and we enter at this point into pressure,
rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
increase for 43, but conditions to restore rwnd to original state, just as in
1), will never be satisfied.

--> Association 1, between A.y and B.12345

IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.55915: sctp (1) [COOKIE ACK]

--> Association 2, between A.z and B.12346

IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
IP B.12346 > A.55915: sctp (1) [COOKIE ACK]

--> Deplete socket buffer by sending messages of size 43B over association 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]

<...>

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]

--> Sudden drop on 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
    association 1 so there is place in buffer for only one more packet

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

--> Socket buffer is almost depleted, but there is space for one more packet,
    send them over association 2, size 43B

IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Immediate drop

IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Read everything from the socket, both association recover up to maximum rwnd
    they are capable of reaching, note that association 1 recovered up to 3698,
    and association 2 recovered only to 43

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

A careful reader might wonder why it is necessary to reproduce 1) prior
reproduction of 2). It is simply easier to observe when to send packet over
association 2 which will push association into the pressure state.

Proposed solution:

Both problems share the same root cause, and that is improper scaling of socket
buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
calculating rwnd is not possible due to fact that there is no linear
relationship between amount of data blamed in increase/decrease with IP packet
in which payload arrived. Even in case such solution would be followed,
complexity of the code would increase. Due to nature of current rwnd handling,
slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
entered is rationale, but it gives false representation to the sender of current
buffer space. Furthermore, it implements additional congestion control mechanism
which is defined on implementation, and not on standard basis.

Proposed solution simplifies whole algorithm having on mind definition from rfc:

o  Receiver Window (rwnd): This gives the sender an indication of the space
   available in the receiver's inbound buffer.

Core of the proposed solution is given with these lines:

sctp_assoc_rwnd_update:
	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
	else
		asoc->rwnd = 0;

We advertise to sender (half of) actual space we have. Half is in the braces
depending whether you would like to observe size of socket buffer as SO_RECVBUF
or twice the amount, i.e. size is the one visible from userspace, that is,
from kernelspace.
In this way sender is given with good approximation of our buffer space,
regardless of the buffer policy - we always advertise what we have. Proposed
solution fixes described problems and removes necessity for rwnd restoration
algorithm. Finally, as proposed solution is simplification, some lines of code,
along with some bytes in struct sctp_association are saved.

Version 2 of the patch addressed comments from Vlad. Name of the function is set
to be more descriptive, and two parts of code are changed, in one removing the
superfluous call to sctp_assoc_rwnd_update since call would not result in update
of rwnd, and the other being reordering of the code in a way that call to
sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
in a way that existing function is not reordered/copied in line, but it is
correctly called. Thanks Vlad for suggesting.

Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17 00:16:56 -05:00
Duan Jiong
cd0f0b95fd ipv4: distinguish EHOSTUNREACH from the ENETUNREACH
since commit 251da413("ipv4: Cache ip_error() routes even when not forwarding."),
the counter IPSTATS_MIB_INADDRERRORS can't work correctly, because the value of
err was always set to ENETUNREACH.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-16 23:45:31 -05:00
Gerrit Renker
09db308053 dccp: re-enable debug macro
dccp tfrc: revert

This reverts 6aee49c558 ("dccp: make local variable static") since
the variable tfrc_debug is referenced by the tfrc_pr_debug(fmt, ...)
macro when TFRC debugging is enabled. If it is enabled, use of the
macro produces a compilation error.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-16 23:45:00 -05:00
Trond Myklebust
e9776d0f4a SUNRPC: Fix a pipe_version reference leak
In gss_alloc_msg(), if the call to gss_encode_v1_msg() fails, we
want to release the reference to the pipe_version that was obtained
earlier in the function.

Fixes: 9d3a2260f0 (SUNRPC: Fix buffer overflow checking in...)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-16 13:28:01 -05:00
Trond Myklebust
9eb2ddb48c SUNRPC: Ensure that gss_auth isn't freed before its upcall messages
Fix a race in which the RPC client is shutting down while the
gss daemon is processing a downcall. If the RPC client manages to
shut down before the gss daemon is done, then the struct gss_auth
used in gss_release_msg() may have already been freed.

Link: http://lkml.kernel.org/r/1392494917.71728.YahooMailNeo@web140002.mail.bf1.yahoo.com
Reported-by: John <da_audiophile@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-16 13:06:06 -05:00
Jarno Rajahalme
42ee19e293 openvswitch: Fix race.
ovs_vport_cmd_dump() did rcu_read_lock() only after getting the
datapath, which could have been deleted in between.  Resolved by
taking rcu_read_lock() before the get_dp() call.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-02-15 17:42:29 -08:00
Jarno Rajahalme
04382a3303 openvswitch: Read tcp flags only then the tranport header is present.
Only the first IP fragment can have a TCP header, check for this.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-02-15 17:37:45 -08:00
Jiri Pirko
3c7eacfc8a ovs: fix dp check in ovs_dp_reset_user_features
This fixes crash when userspace does "ovs-dpctl add-dp dev" where dev is
existing non-dp netdevice.

Introduced by:
commit 44da5ae5fb
"openvswitch: Drop user features if old user space attempted to create datapath"

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-02-15 17:24:19 -08:00
FX Le Bail
2b7a79bae2 netfilter: nf_nat_snmp_basic: fix duplicates in if/else branches
The solution was found by Patrick in 2.4 kernel sources.

Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-14 11:37:36 +01:00
Paul Bolle
06efbd6d56 netfilter: nft_meta: fix typo "CONFIG_NET_CLS_ROUTE"
There are two checks for CONFIG_NET_CLS_ROUTE, but the corresponding
Kconfig symbol was dropped in v2.6.39. Since the code guards access to
dst_entry.tclassid it seems CONFIG_IP_ROUTE_CLASSID should be used
instead.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-14 11:37:34 +01:00
Patrick McHardy
ce898ecb5a netfilter: nft_reject_inet: fix unintended fall-through in switch-statatement
For IPv4 packets, we call both IPv4 and IPv6 reject.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-14 11:37:33 +01:00
FX Le Bail
357137a422 ipv4: ipconfig.c: add parentheses in an if statement
Even if the 'time_before' macro expand with parentheses, the look is bad.

Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-14 00:14:23 -05:00
Vijay Subramanian
219e288e89 net: sched: Cleanup PIE comments
Fix incorrect comment reported by Norbert Kiesel. Edit another comment to add
more details. Also add references to algorithm (IETF draft and paper) to top of
file.

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
CC: Mythili Prabhu <mysuryan@cisco.com>
CC: Norbert Kiesel <nkiesel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 18:29:58 -05:00
Florian Westphal
fe6cc55f3a net: ip, ipv6: handle gso skbs in forwarding path
Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.

Given:
Host <mtu1500> R1 <mtu1200> R2

Host sends tcp stream which is routed via R1 and R2.  R1 performs GRO.

In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.

When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.

This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.

For ipv6, we send out pkt too big error for gso if the individual
segments are too big.

For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.

Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.

However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there.  I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.

Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.

This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small.  Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway.  Also its not like this could not be improved later
once the dust settles.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 17:17:02 -05:00
Florian Westphal
d206940319 net: core: introduce netif_skb_dev_features
Will be used by upcoming ipv4 forward path change that needs to
determine feature mask using skb->dst->dev instead of skb->dev.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 17:17:02 -05:00
wangweidong
efb842c45e sctp: optimize the sctp_sysctl_net_register
Here, when the net is init_net, we needn't to kmemdup the ctl_table
again. So add a check for net. Also we can save some memory.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 17:08:29 -05:00
wangweidong
22a1f5140e sctp: fix a missed .data initialization
As commit 3c68198e75111a90("sctp: Make hmac algorithm selection for
 cookie generation dynamic"), we miss the .data initialization.
If we don't use the net_namespace, the problem that parts of the
sysctl configuration won't be isolation and won't occur.

In sctp_sysctl_net_register(), we register the sysctl for each
net, in the for(), we use the 'table[i].data' as check condition, so
when the 'i' is the index of sctp_hmac_alg, the data is NULL, then
break. So add the .data initialization.

Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 17:08:29 -05:00
Cong Wang
0e0eee2465 net: correct error path in rtnl_newlink()
I saw the following BUG when ->newlink() fails in rtnl_newlink():

[   40.240058] kernel BUG at net/core/dev.c:6438!

this is due to free_netdev() is not supposed to be called before
netdev is completely unregistered, therefore it is not correct
to call free_netdev() here, at least for ops->newlink!=NULL case,
many drivers call it in ->destructor so that rtnl_unlock() will
take care of it, we probably don't need to do anything here.

Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 17:08:29 -05:00
Erik Hugne
64380a04de tipc: fix message corruption bug for deferred packets
If a packet received on a link is out-of-sequence, it will be
placed on a deferred queue and later reinserted in the receive
path once the preceding packets have been processed. The problem
with this is that it will be subject to the buffer adjustment from
link_recv_buf_validate twice. The second adjustment for 20 bytes
header space will corrupt the packet.

We solve this by tagging the deferred packets and bail out from
receive buffer validation for packets that have already been
subjected to this.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 16:35:05 -05:00
Felix Fietkau
1bf4bbb402 mac80211: send control port protocol frames to the VO queue
Improves reliability of wifi connections with WPA, since authentication
frames are prioritized over normal traffic and also typically exempt
from aggregation.

Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-12 11:26:43 +01:00
Linus Torvalds
16e5a2ed59 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking updates from David Miller:

 1) Fix flexcan build on big endian, from Arnd Bergmann

 2) Correctly attach cpsw to GPIO bitbang MDIO drive, from Stefan Roese

 3) udp_add_offload has to use GFP_ATOMIC since it can be invoked from
    non-sleepable contexts.  From Or Gerlitz

 4) vxlan_gro_receive() does not iterate over all possible flows
    properly, fix also from Or Gerlitz

 5) CAN core doesn't use a proper SKB destructor when it hooks up
    sockets to SKBs.  Fix from Oliver Hartkopp

 6) ip_tunnel_xmit() can use an uninitialized route pointer, fix from
    Eric Dumazet

 7) Fix address family assignment in IPVS, from Michal Kubecek

 8) Fix ath9k build on ARM, from Sujith Manoharan

 9) Make sure fail_over_mac only applies for the correct bonding modes,
    from Ding Tianhong

10) The udp offload code doesn't use RCU correctly, from Shlomo Pongratz

11) Handle gigabit features properly in generic PHY code, from Florian
    Fainelli

12) Don't blindly invoke link operations in
    rtnl_link_get_slave_info_data_size, they are optional.  Fix from
    Fernando Luis Vazquez Cao

13) Add USB IDs for Netgear Aircard 340U, from Bjørn Mork

14) Handle netlink packet padding properly in openvswitch, from Thomas
    Graf

15) Fix oops when deleting chains in nf_tables, from Patrick McHardy

16) Fix RX stalls in xen-netback driver, from Zoltan Kiss

17) Fix deadlock in mac80211 stack, from Emmanuel Grumbach

18) inet_nlmsg_size() forgets to consider ifa_cacheinfo, fix from Geert
    Uytterhoeven

19) tg3_change_mtu() can deadlock, fix from Nithin Sujir

20) Fix regression in setting SCTP local source addresses on accepted
    sockets, caused by some generic ipv6 socket changes.  Fix from
    Matija Glavinic Pecotic

21) IPPROTO_* must be pure defines, otherwise module aliases don't get
    constructed properly.  Fix from Jan Moskyto

22) IPV6 netconsole setup doesn't work properly unless an explicit
    source address is specified, fix from Sabrina Dubroca

23) Use __GFP_NORETRY for high order skb page allocations in
    sock_alloc_send_pskb and skb_page_frag_refill.  From Eric Dumazet

24) Fix a regression added in netconsole over bridging, from Cong Wang

25) TCP uses an artificial offset of 1ms for SRTT, but this doesn't jive
    well with TCP pacing which needs the SRTT to be accurate.  Fix from
    Eric Dumazet

26) Several cases of missing header file includes from Rashika Kheria

27) Add ZTE MF667 device ID to qmi_wwan driver, from Raymond Wanyoike

28) TCP Small Queues doesn't handle nonagle properly in some corner
    cases, fix from Eric Dumazet

29) Remove extraneous read_unlock in bond_enslave, whoops.  From Ding
    Tianhong

30) Fix 9p trans_virtio handling of vmalloc buffers, from Richard Yao

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (136 commits)
  6lowpan: fix lockdep splats
  alx: add missing stats_lock spinlock init
  9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers
  bonding: remove unwanted bond lock for enslave processing
  USB2NET : SR9800 : One chip USB2.0 USB2NET SR9800 Device Driver Support
  tcp: tsq: fix nonagle handling
  bridge: Prevent possible race condition in br_fdb_change_mac_address
  bridge: Properly check if local fdb entry can be deleted when deleting vlan
  bridge: Properly check if local fdb entry can be deleted in br_fdb_delete_by_port
  bridge: Properly check if local fdb entry can be deleted in br_fdb_change_mac_address
  bridge: Fix the way to check if a local fdb entry can be deleted
  bridge: Change local fdb entries whenever mac address of bridge device changes
  bridge: Fix the way to find old local fdb entries in br_fdb_change_mac_address
  bridge: Fix the way to insert new local fdb entries in br_fdb_changeaddr
  bridge: Fix the way to find old local fdb entries in br_fdb_changeaddr
  tcp: correct code comment stating 3 min timeout for FIN_WAIT2, we only do 1 min
  net: vxge: Remove unused device pointer
  net: qmi_wwan: add ZTE MF667
  3c59x: Remove unused pointer in vortex_eisa_cleanup()
  net: fix 'ip rule' iif/oif device rename
  ...
2014-02-11 12:05:55 -08:00
Trond Myklebust
628356791b SUNRPC: Fix potential memory scribble in xprt_free_bc_request()
The call to xprt_free_allocation() will call list_del() on
req->rq_bc_pa_list, which is not attached to a list.
This patch moves the list_del() out of xprt_free_allocation()
and into those callers that need it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-11 13:56:54 -05:00
Trond Myklebust
06ea0bfe6e SUNRPC: Fix races in xs_nospace()
When a send failure occurs due to the socket being out of buffer space,
we call xs_nospace() in order to have the RPC task wait until the
socket has drained enough to make it worth while trying again.
The current patch fixes a race in which the socket is drained before
we get round to setting up the machinery in xs_nospace(), and which
is reported to cause hangs.

Link: http://lkml.kernel.org/r/20140210170315.33dfc621@notabene.brown
Fixes: a9a6b52ee1 (SUNRPC: Don't start the retransmission timer...)
Reported-by: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-11 09:34:30 -05:00
Eytan Lifshitz
c368ddaa9a mac80211: fix memory leak
In case ieee80211_prep_connection() fails to dereference
sdata->vif.chanctx_conf, the function returns and doesn't
free new_sta. fixed.

Signed-off-by: Eytan Lifshitz <eytan.lifshitz@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-11 12:59:36 +01:00
Arik Nemtsov
32769814d5 mac80211: fix sched_scan restart on recovery
In case we were not suspended, the reconfig function returns without
configuring the scheduled scan.

Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-11 12:59:12 +01:00
Eric Dumazet
20e7c4e80d 6lowpan: fix lockdep splats
When a device ndo_start_xmit() calls again dev_queue_xmit(),
lockdep can complain because dev_queue_xmit() is re-entered and the
spinlocks protecting tx queues share a common lockdep class.

Same issue was fixed for bonding/l2tp/ppp in commits

0daa230302 ("[PATCH] bonding: lockdep annotation")
49ee49202b ("bonding: set qdisc_tx_busylock to avoid LOCKDEP splat")
23d3b8bfb8 ("net: qdisc busylock needs lockdep annotations ")
303c07db48 ("ppp: set qdisc_tx_busylock to avoid LOCKDEP splat ")

Reported-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 17:51:29 -08:00
Richard Yao
b6f52ae2f0 9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers
The 9p-virtio transport does zero copy on things larger than 1024 bytes
in size. It accomplishes this by returning the physical addresses of
pages to the virtio-pci device. At present, the translation is usually a
bit shift.

That approach produces an invalid page address when we read/write to
vmalloc buffers, such as those used for Linux kernel modules. Any
attempt to load a Linux kernel module from 9p-virtio produces the
following stack.

[<ffffffff814878ce>] p9_virtio_zc_request+0x45e/0x510
[<ffffffff814814ed>] p9_client_zc_rpc.constprop.16+0xfd/0x4f0
[<ffffffff814839dd>] p9_client_read+0x15d/0x240
[<ffffffff811c8440>] v9fs_fid_readn+0x50/0xa0
[<ffffffff811c84a0>] v9fs_file_readn+0x10/0x20
[<ffffffff811c84e7>] v9fs_file_read+0x37/0x70
[<ffffffff8114e3fb>] vfs_read+0x9b/0x160
[<ffffffff81153571>] kernel_read+0x41/0x60
[<ffffffff810c83ab>] copy_module_from_fd.isra.34+0xfb/0x180

Subsequently, QEMU will die printing:

qemu-system-x86_64: virtio: trying to map MMIO memory

This patch enables 9p-virtio to correctly handle this case. This not
only enables us to load Linux kernel modules off virtfs, but also
enables ZFS file-based vdevs on virtfs to be used without killing QEMU.

Special thanks to both Avi Kivity and Alexander Graf for their
interpretation of QEMU backtraces. Without their guidence, tracking down
this bug would have taken much longer. Also, special thanks to Linus
Torvalds for his insightful explanation of why this should use
is_vmalloc_addr() instead of is_vmalloc_or_module_addr():

https://lkml.org/lkml/2014/2/8/272

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 17:48:54 -08:00
John Ogness
bf06200e73 tcp: tsq: fix nonagle handling
Commit 46d3ceabd8 ("tcp: TCP Small Queues") introduced a possible
regression for applications using TCP_NODELAY.

If TCP session is throttled because of tsq, we should consult
tp->nonagle when TX completion is done and allow us to send additional
segment, especially if this segment is not a full MSS.
Otherwise this segment is sent after an RTO.

[edumazet] : Cooked the changelog, added another fix about testing
sk_wmem_alloc twice because TX completion can happen right before
setting TSQ_THROTTLED bit.

This problem is particularly visible with recent auto corking,
but might also be triggered with low tcp_limit_output_bytes
values or NIC drivers delaying TX completion by hundred of usec,
and very low rtt.

Thomas Glanzmann for example reported an iscsi regression, caused
by tcp auto corking making this bug quite visible.

Fixes: 46d3ceabd8 ("tcp: TCP Small Queues")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Thomas Glanzmann <thomas@glanzmann.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 15:23:39 -08:00
Toshiaki Makita
ac4c886883 bridge: Prevent possible race condition in br_fdb_change_mac_address
br_fdb_change_mac_address() calls fdb_insert()/fdb_delete() without
br->hash_lock.

These hash list updates are racy with br_fdb_update()/br_fdb_cleanup().

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 14:34:34 -08:00
Toshiaki Makita
424bb9c97c bridge: Properly check if local fdb entry can be deleted when deleting vlan
Vlan codes unconditionally delete local fdb entries.
We should consider the possibility that other ports have the same
address and vlan.

Example of problematic case:
  ip link set eth0 address 12:34:56:78:90:ab
  ip link set eth1 address aa:bb:cc:dd:ee:ff
  brctl addif br0 eth0
  brctl addif br0 eth1 # br0 will have mac address 12:34:56:78:90:ab
  bridge vlan add dev eth0 vid 10
  bridge vlan add dev eth1 vid 10
  bridge vlan add dev br0 vid 10 self
We will have fdb entry such that f->dst == eth0, f->vlan_id == 10 and
f->addr == 12:34:56:78:90:ab at this time.
Next, delete eth0 vlan 10.
  bridge vlan del dev eth0 vid 10
In this case, we still need the entry for br0, but it will be deleted.

Note that br0 needs the entry even though its mac address is not set
manually. To delete the entry with proper condition checking,
fdb_delete_local() is suitable to use.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 14:34:34 -08:00
Toshiaki Makita
a778e6d1a5 bridge: Properly check if local fdb entry can be deleted in br_fdb_delete_by_port
br_fdb_delete_by_port() doesn't care about vlan and mac address of the
bridge device.

As the check is almost the same as mac address changing, slightly modify
fdb_delete_local() and use it.

Note that we can always set added_by_user to 0 in fdb_delete_local() because
- br_fdb_delete_by_port() calls fdb_delete_local() for local entries
  regardless of its added_by_user. In this case, we have to check if another
  port has the same address and vlan, and if found, we have to create the
  entry (by changing dst). This is kernel-added entry, not user-added.
- br_fdb_changeaddr() doesn't call fdb_delete_local() for user-added entry.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 14:34:34 -08:00
Toshiaki Makita
960b589f86 bridge: Properly check if local fdb entry can be deleted in br_fdb_change_mac_address
br_fdb_change_mac_address() doesn't check if the local entry has the
same address as any of bridge ports.
Although I'm not sure when it is beneficial, current implementation allow
the bridge device to receive any mac address of its ports.
To preserve this behavior, we have to check if the mac address of the
entry being deleted is identical to that of any port.

As this check is almost the same as that in br_fdb_changeaddr(), create
a common function fdb_delete_local() and call it from
br_fdb_changeadddr() and br_fdb_change_mac_address().

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 14:34:33 -08:00
Toshiaki Makita
2b292fb4a5 bridge: Fix the way to check if a local fdb entry can be deleted
We should take into account the followings when deleting a local fdb
entry.

- nbp_vlan_find() can be used only when vid != 0 to check if an entry is
  deletable, because a fdb entry with vid 0 can exist at any time while
  nbp_vlan_find() always return false with vid 0.

  Example of problematic case:
    ip link set eth0 address 12:34:56:78:90:ab
    ip link set eth1 address 12:34:56:78:90:ab
    brctl addif br0 eth0
    brctl addif br0 eth1
    ip link set eth0 address aa:bb:cc:dd:ee:ff
  Then, the fdb entry 12:34:56:78:90:ab will be deleted even though the
  bridge port eth1 still has that address.

- The port to which the bridge device is attached might needs a local entry
  if its mac address is set manually.

  Example of problematic case:
    ip link set eth0 address 12:34:56:78:90:ab
    brctl addif br0 eth0
    ip link set br0 address 12:34:56:78:90:ab
    ip link set eth0 address aa:bb:cc:dd:ee:ff
  Then, the fdb still must have the entry 12:34:56:78:90:ab, but it will be
  deleted.

We can use br->dev->addr_assign_type to check if the address is manually
set or not, but I propose another approach.

Since we delete and insert local entries whenever changing mac address
of the bridge device, we can change dst of the entry to NULL regardless of
addr_assign_type when deleting an entry associated with a certain port,
and if it is found to be unnecessary later, then delete it.
That is, if changing mac address of a port, the entry might be changed
to its dst being NULL first, but is eventually deleted when recalculating
and changing bridge id.

This approach is especially useful when we want to share the code with
deleting vlan in which the bridge device might want such an entry regardless
of addr_assign_type, and makes things easy because we don't have to consider
if mac address of the bridge device will be changed or not at the time we
delete a local entry of a port, which means fdb code will not be bothered
even if the bridge id calculating logic is changed in the future.

Also, this change reduces inconsistent state, where frames whose dst is the
mac address of the bridge, can't reach the bridge because of premature fdb
entry deletion. This change reduces the possibility that the bridge device
replies unreachable mac address to arp requests, which could occur during
the short window between calling del_nbp() and br_stp_recalculate_bridge_id()
in br_del_if(). This will effective after br_fdb_delete_by_port() starts to
use the same code by following patch.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 14:34:33 -08:00
Toshiaki Makita
a4b816d8ba bridge: Change local fdb entries whenever mac address of bridge device changes
Vlan code may need fdb change when changing mac address of bridge device
even if it is caused by the mac address changing of a bridge port.

Example configuration:
  ip link set eth0 address 12:34:56:78:90:ab
  ip link set eth1 address aa:bb:cc:dd:ee:ff
  brctl addif br0 eth0
  brctl addif br0 eth1 # br0 will have mac address 12:34:56:78:90:ab
  bridge vlan add dev br0 vid 10 self
  bridge vlan add dev eth0 vid 10
We will have fdb entry such that f->dst == NULL, f->vlan_id == 10 and
f->addr == 12:34:56:78:90:ab at this time.
Next, change the mac address of eth0 to greater value.
  ip link set eth0 address ee:ff:12:34:56:78
Then, mac address of br0 will be recalculated and set to aa:bb:cc:dd:ee:ff.
However, an entry aa:bb:cc:dd:ee:ff will not be created and we will be not
able to communicate using br0 on vlan 10.

Address this issue by deleting and adding local entries whenever
changing the mac address of the bridge device.

If there already exists an entry that has the same address, for example,
in case that br_fdb_changeaddr() has already inserted it,
br_fdb_change_mac_address() will simply fail to insert it and no
duplicated entry will be made, as it was.

This approach also needs br_add_if() to call br_fdb_insert() before
br_stp_recalculate_bridge_id() so that we don't create an entry whose
dst == NULL in this function to preserve previous behavior.

Note that this is a slight change in behavior where the bridge device can
receive the traffic to the new address before calling
br_stp_recalculate_bridge_id() in br_add_if().
However, it is not a problem because we have already the address on the
new port and such a way to insert new one before recalculating bridge id
is taken in br_device_event() as well.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 14:34:33 -08:00
Toshiaki Makita
a3ebb7efe7 bridge: Fix the way to find old local fdb entries in br_fdb_change_mac_address
We have been always failed to delete the old entry at
br_fdb_change_mac_address() because br_set_mac_address() updates
dev->dev_addr before calling br_fdb_change_mac_address() and
br_fdb_change_mac_address() uses dev->dev_addr to find the old entry.

That update of dev_addr is completely unnecessary because the same work
is done in br_stp_change_bridge_id() which is called right away after
calling br_fdb_change_mac_address().

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 14:34:33 -08:00
Toshiaki Makita
2836882fe0 bridge: Fix the way to insert new local fdb entries in br_fdb_changeaddr
Since commit bc9a25d21e ("bridge: Add vlan support for local fdb entries"),
br_fdb_changeaddr() has inserted a new local fdb entry only if it can
find old one. But if we have two ports where they have the same address
or user has deleted a local entry, there will be no entry for one of the
ports.

Example of problematic case:
  ip link set eth0 address aa:bb:cc:dd:ee:ff
  ip link set eth1 address aa:bb:cc:dd:ee:ff
  brctl addif br0 eth0
  brctl addif br0 eth1 # eth1 will not have a local entry due to dup.
  ip link set eth1 address 12:34:56:78:90:ab
Then, the new entry for the address 12:34:56:78:90:ab will not be
created, and the bridge device will not be able to communicate.

Insert new entries regardless of whether we can find old entries or not.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 14:34:33 -08:00
Toshiaki Makita
a5642ab474 bridge: Fix the way to find old local fdb entries in br_fdb_changeaddr
br_fdb_changeaddr() assumes that there is at most one local entry per port
per vlan. It used to be true, but since commit 36fd2b63e3 ("bridge: allow
creating/deleting fdb entries via netlink"), it has not been so.
Therefore, the function might fail to search a correct previous address
to be deleted and delete an arbitrary local entry if user has added local
entries manually.

Example of problematic case:
  ip link set eth0 address ee:ff:12:34:56:78
  brctl addif br0 eth0
  bridge fdb add 12:34:56:78:90:ab dev eth0 master
  ip link set eth0 address aa:bb:cc:dd:ee:ff
Then, the address 12:34:56:78:90:ab might be deleted instead of
ee:ff:12:34:56:78, the original mac address of eth0.

Address this issue by introducing a new flag, added_by_user, to struct
net_bridge_fdb_entry.

Note that br_fdb_delete_by_port() has to set added_by_user to 0 in cases
like:
  ip link set eth0 address 12:34:56:78:90:ab
  ip link set eth1 address aa:bb:cc:dd:ee:ff
  brctl addif br0 eth0
  bridge fdb add aa:bb:cc:dd:ee:ff dev eth0 master
  brctl addif br0 eth1
  brctl delif br0 eth0
In this case, kernel should delete the user-added entry aa:bb:cc:dd:ee:ff,
but it also should have been added by "brctl addif br0 eth1" originally,
so we don't delete it and treat it a new kernel-created entry.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-10 14:34:33 -08:00
Trond Myklebust
a699d65ec4 SUNRPC: Don't create a gss auth cache unless rpc.gssd is running
An infinite loop is caused when nfs4_establish_lease() fails
with -EACCES. This causes nfs4_handle_reclaim_lease_error()
to sleep a bit and resets the NFS4CLNT_LEASE_EXPIRED bit.
This in turn causes nfs4_state_manager() to try and
reestablished the lease, again, again, again...

The problem is a valid RPCSEC_GSS client is being created when
rpc.gssd is not running.

Link: http://lkml.kernel.org/r/1392066375-16502-1-git-send-email-steved@redhat.com
Fixes: 0ea9de0ea6 (sunrpc: turn warn_gssd() log message into a dprintk())
Reported-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-10 16:49:17 -05:00
Jesper Juhl
b10bd54c05 tcp: correct code comment stating 3 min timeout for FIN_WAIT2, we only do 1 min
As far as I can tell we have used a default of 60 seconds for
FIN_WAIT2 timeout for ages (since 2.x times??).

In any case, the timeout these days is 60 seconds, so the 3 min
comment is wrong (and cost me a few minutes of my life when I was
debugging a FIN_WAIT2 related problem in a userspace application and
checked the kernel source for details).

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 19:14:23 -08:00
Maciej Żenczykowski
946c032e5a net: fix 'ip rule' iif/oif device rename
ip rules with iif/oif references do not update:
(detach/attach) across interface renames.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
CC: Willem de Bruijn <willemb@google.com>
CC: Eric Dumazet <edumazet@google.com>
CC: Chris Davis <chrismd@google.com>
CC: Carlo Contavalli <ccontavalli@google.com>

Google-Bug-Id: 12936021
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 19:02:52 -08:00
Christian Engelmayer
310f6fd0ca 6lowpan: Remove unused pointer in lowpan_header_create()
Commit 8df8c56a (6lowpan: Moving generic compression code into 6lowpan_iphc.c)
left pointer 'hdr' unused - remove it.

Detected by Coverity: CID 1164868.

Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 18:38:10 -08:00
FX Le Bail
d94c1f92bb ipv6: icmp6_send: fix Oops when pinging a not set up IPv6 peer on a sit tunnel
The patch 446fab5933 ("ipv6: enable anycast addresses
as source addresses in ICMPv6 error messages") causes an Oops when pinging a not
set up IPv6 peer on a sit tunnel.

The problem is that ipv6_anycast_destination() uses unconditionally skb_dst(skb),
which is NULL in this case.

The solution is to use instead the ipv6_chk_acast_addr_src() function.

Here are the steps to reproduce it:
modprobe sit
ip link add sit1 type sit remote 10.16.0.121 local 10.16.0.249
ip l s sit1 up
ip -6 a a dev sit1 2001🔢:123 remote 2001🔢:121
ping6 2001🔢:121

Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 18:12:40 -08:00
Rashika Kheria
e1d83ee673 net: Mark functions as static in net/sunrpc/svc_xprt.c
Mark functions as static in net/sunrpc/svc_xprt.c because they are not
used outside this file.

This eliminates the following warning in net/sunrpc/svc_xprt.c:
net/sunrpc/svc_xprt.c:574:5: warning: no previous prototype for ‘svc_alloc_arg’ [-Wmissing-prototypes]
net/sunrpc/svc_xprt.c:615:18: warning: no previous prototype for ‘svc_get_next_xprt’ [-Wmissing-prototypes]
net/sunrpc/svc_xprt.c:694:6: warning: no previous prototype for ‘svc_add_new_temp_xprt’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:50 -08:00
Rashika Kheria
bd76ed36ba net: Include appropriate header file in netfilter/nft_lookup.c
Include appropriate header file net/netfilter/nf_tables_core.h in
net/netfilter/nft_lookup.c because it has prototype declaration of
functions defined in net/netfilter/nft_lookup.c.

This eliminates the following warning in net/netfilter/nft_lookup.c:
net/netfilter/nft_lookup.c:133:12: warning: no previous prototype for ‘nft_lookup_module_init’ [-Wmissing-prototypes]
net/netfilter/nft_lookup.c:138:6: warning: no previous prototype for ‘nft_lookup_module_exit’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:50 -08:00
Rashika Kheria
535d3ae9c8 net: Move prototype declaration to header file include/net/net_namespace.h from net/ipx/af_ipx.c
Move prototype declaration of function to header file
include/net/net_namespace.h from net/ipx/af_ipx.c because they are used
by more than one file.

This eliminates the following warning in net/ipx/sysctl_net_ipx.c:
net/ipx/sysctl_net_ipx.c:33:6: warning: no previous prototype for ‘ipx_register_sysctl’ [-Wmissing-prototypes]
net/ipx/sysctl_net_ipx.c:38:6: warning: no previous prototype for ‘ipx_unregister_sysctl’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:50 -08:00
Rashika Kheria
7780d8ae4a net: Move prototype declaration to header file include/net/datalink.h from net/ipx/af_ipx.c
Move prototype declarations of function to header file
include/net/datalink.h from net/ipx/af_ipx.c because they are used by
more than one file.

This eliminates the following warning in net/ipx/pe2.c:
net/ipx/pe2.c:20:24: warning: no previous prototype for ‘make_EII_client’ [-Wmissing-prototypes]
net/ipx/pe2.c:32:6: warning: no previous prototype for ‘destroy_EII_client’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:50 -08:00
Rashika Kheria
578efbc19f net: Move prototype declaration to header file include/net/ipx.h from net/ipx/af_ipx.c
Move prototype declaration of functions to header file include/net/ipx.h
from net/ipx/af_ipx.c because they are used by more than one file.

This eliminates the following warning in
net/ipx/ipx_route.c:33:19: warning: no previous prototype for ‘ipxrtr_lookup’ [-Wmissing-prototypes]
net/ipx/ipx_route.c:52:5: warning: no previous prototype for ‘ipxrtr_add_route’ [-Wmissing-prototypes]
net/ipx/ipx_route.c:94:6: warning: no previous prototype for ‘ipxrtr_del_routes’ [-Wmissing-prototypes]
net/ipx/ipx_route.c:149:5: warning: no previous prototype for ‘ipxrtr_route_skb’ [-Wmissing-prototypes]
net/ipx/ipx_route.c:171:5: warning: no previous prototype for ‘ipxrtr_route_packet’ [-Wmissing-prototypes]
net/ipx/ipx_route.c:261:5: warning: no previous prototype for ‘ipxrtr_ioctl’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:50 -08:00
Rashika Kheria
493cc5e5ba net: Move prototype declaration to include/net/ipx.h from net/ipx/ipx_route.c
Move prototype definition of function to header file include/net/ipx.h
from net/ipx/ipx_route.c because they are used by more than one file.

This eliminates the following warning from net/ipx/af_ipx.c:
net/ipx/af_ipx.c:193:23: warning: no previous prototype for ‘ipxitf_find_using_net’ [-Wmissing-prototypes]
net/ipx/af_ipx.c:577:5: warning: no previous prototype for ‘ipxitf_send’ [-Wmissing-prototypes]
net/ipx/af_ipx.c:1219:8: warning: no previous prototype for ‘ipx_cksum’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:49 -08:00
Rashika Kheria
ab3301bd96 net: Move prototype declaration to header file include/net/dn.h from net/decnet/af_decnet.c
Move prototype declaration of functions to header file include/net/dn.h
from net/decnet/af_decnet.c because they are used by more than one file.

This eliminates the following warning in net/decnet/af_decnet.c:
net/decnet/sysctl_net_decnet.c:354:6: warning: no previous prototype for ‘dn_register_sysctl’ [-Wmissing-prototypes]
net/decnet/sysctl_net_decnet.c:359:6: warning: no previous prototype for ‘dn_unregister_sysctl’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:49 -08:00
Rashika Kheria
f56b8bf6e4 net: Move prototype declaration to appropriate header file from decnet/af_decnet.c
Move prototype declaration of functions to header file include/net/dn_route.h
from net/decnet/af_decnet.c because it is used by more than one file.

This eliminates the following warning in net/decnet/dn_route.c:
net/decnet/dn_route.c:629:5: warning: no previous prototype for ‘dn_route_rcv’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:49 -08:00
Rashika Kheria
0a59f3a9fd net: Mark functions as static in core/dev.c
Mark functions as static in core/dev.c because they are not used outside
this file.

This eliminates the following warning in core/dev.c:
net/core/dev.c:2806:5: warning: no previous prototype for ‘__dev_queue_xmit’ [-Wmissing-prototypes]
net/core/dev.c:4640:5: warning: no previous prototype for ‘netdev_adjacent_sysfs_add’ [-Wmissing-prototypes]
net/core/dev.c:4650:6: warning: no previous prototype for ‘netdev_adjacent_sysfs_del’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:49 -08:00
Rashika Kheria
02fe72c9ed net: Include appropriate header file in caif/cfsrvl.c
Include appropriate header file net/caif/caif_dev.h in caif/cfsrvl.c
because it has prototype declaration of functions defined in
caif/cfsrvl.c.

This eliminates the following warning in caif/cfsrvl.c:
net/caif/cfsrvl.c:198:6: warning: no previous prototype for ‘caif_free_client’ [-Wmissing-prototypes]
net/caif/cfsrvl.c:208:6: warning: no previous prototype for ‘caif_client_register_refcnt’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:49 -08:00
Rashika Kheria
8203274e15 net: Include appropriate header file in caif/caif_dev.c
Include appropriate header file net/caif/caif_dev.h in caif/caif_dev.c
because it has prototype declarations of function defined in
caif/caif_dev.c.

This eliminates the following file in caif/caif_dev.c:
net/caif/caif_dev.c:303:6: warning: no previous prototype for ‘caif_enroll_dev’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:49 -08:00
Rashika Kheria
fd944bded5 net: Mark function as static in 9p/client.c
Mark function as static in net/9p/client.c because it is not used
outside this file.

This eliminates the following warning in net/9p/client.c:
net/9p/client.c:207:18: warning: no previous prototype for ‘p9_fcall_alloc’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 17:32:49 -08:00
David S. Miller
872c7e6fd2 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless
John W. Linville says:

====================
Please pull this batch of fixes intended for the 3.14 stream!

For the mac80211 bits, Johannes says:

"This is just a collection of small fixes, the commit logs explain the
details. The only thing that isn't strictly a fix is the 5/10 MHz
enabling, I had forgotten this and there's little point in waiting
longer. The patch simply removes the force-disable code that I put in
when there was a problem with the userspace API (that has long been
fixed.)"

For the iwlwifi bits, Emmanuel says:

"I have an important fix that disables A band in case the driver thought
it was enabled, and the firmware disagreed. We ended up making the
firmware unhappy. I also fix the station table in AP mode and fix the
scan while we have BT working.
Johannes removes a static variable that could potentially lead to to
issues on multi-device setups and disables scheduled scan to avoid
issues with old versions of wpa_supplicant.
A small fix from David on scan and a few new device IDs for 7265."

On top of that...

Oleksij Rempel adds a USB ID to the ar5523 driver and changes the
default powersave setting for ath9k_htc to "off", due to observed
stability issues (based on an equivalent ath9k patch).

Stanislaw Gruszka similarly disables powersave for a couple of rt2x00
drivers.  He also fixes a couple of scheduling while atomic issues
in ath9k_htc.

Sujith Manoharan rounds-out the powersave disables with one for ath9k.
He also fixes a build prolem with ath9k on ARM and fixes an ath9k Tx
power calculation.

Finally, Andrea Merello fixes a couple of lingering DMA mapping
problems in the rtl8180 driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 14:21:25 -08:00
David S. Miller
f41f031960 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/nftables/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes, mostly nftables
fixes, most relevantly they are:

* Fix a crash in the h323 conntrack NAT helper due to expectation list
  corruption, from Alexey Dobriyan.

* A couple of RCU race fixes for conntrack, one manifests by hitting BUG_ON
  in nf_nat_setup_info() and the destroy path, patches from Andrey Vagin and
  me.

* Dump direction attribute in nft_ct only if it is set, from Arturo
  Borrero.

* Fix IPVS bug in its own connection tracking system that may lead to
  copying only 4 bytes of the IPv6 address when initializing the
  ip_vs_conn object, from Michal Kubecek.

* Fix -EBUSY errors in nftables when deleting the rules, chain and tables
  in a row due mixture of asynchronous and synchronous object releasing,
  from me.

* Three fixes for the nf_tables set infrastructure when using intervals and
  mappings, from me.

* Four patches to fixing the nf_tables log, reject and ct expressions from
  the new inet table, from Patrick McHardy.

* Fix memory overrun in the map that is used to dynamically allocate names
  from anonymous sets, also from Patrick.

* Fix a potential oops if you dump a set with NFPROTO_UNSPEC and a table
  name, from Patrick McHardy.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09 14:20:00 -08:00
Ilya Dryomov
0ec1d15ec6 libceph: do not dereference a NULL bio pointer
Commit f38a5181d9 ("ceph: Convert to immutable biovecs") introduced
a NULL pointer dereference, which broke rbd in -rc1.  Fix it.

Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-02-07 11:37:07 -08:00
Ilya Dryomov
ff513ace9b libceph: take map_sem for read in handle_reply()
Handling redirect replies requires both map_sem and request_mutex.
Taking map_sem unconditionally near the top of handle_reply() avoids
possible race conditions that arise from releasing request_mutex to be
able to acquire map_sem in redirect reply case.  (Lock ordering is:
map_sem, request_mutex, crush_mutex.)

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-02-07 10:45:53 -08:00
Ilya Dryomov
0bbfdfe8d2 libceph: factor out logic from ceph_osdc_start_request()
Factor out logic from ceph_osdc_start_request() into a new helper,
__ceph_osdc_start_request().  ceph_osdc_start_request() now amounts to
taking locks and calling __ceph_osdc_start_request().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-02-07 10:45:42 -08:00
John W. Linville
0f96b860bc Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem 2014-02-07 13:44:14 -05:00
Patrick McHardy
6d8c00d58e netfilter: nf_tables: unininline nft_trace_packet()
It makes no sense to inline a rarely used function meant for debugging
only that is called a total of five times in the main evaluation loop.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-07 17:50:27 +01:00
Pablo Neira Ayuso
62f9c8b40d netfilter: nf_tables: fix loop checking with end interval elements
Fix access to uninitialized data for end interval elements. The
element data part is uninitialized in interval end elements.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-07 17:21:45 +01:00
Pablo Neira Ayuso
2fb91ddbf8 netfilter: nft_rbtree: fix data handling of end interval elements
This patch fixes several things which related to the handling of
end interval elements:

* Chain use underflow with intervals and map: If you add a rule
  using intervals+map that introduces a loop, the error path of the
  rbtree set decrements the chain refcount for each side of the
  interval, leading to a chain use counter underflow.

* Don't copy the data part of the end interval element since, this
  area is uninitialized and this confuses the loop detection code.

* Don't allocate room for the data part of end interval elements
  since this is unused.

So, after this patch the idea is that end interval elements don't
have a data part.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
2014-02-07 14:22:06 +01:00
Pablo Neira Ayuso
bd7fc645da netfilter: nf_tables: do not allow NFT_SET_ELEM_INTERVAL_END flag and data
This combination is not allowed since end interval elements cannot
contain data.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
2014-02-07 14:21:49 +01:00
Eric Dumazet
4a5ab4e224 tcp: remove 1ms offset in srtt computation
TCP pacing depends on an accurate srtt estimation.

Current srtt estimation is using jiffie resolution,
and has an artificial offset of at least 1 ms, which can produce
slowdowns when FQ/pacing is used, especially in DC world,
where typical rtt is below 1 ms.

We are planning a switch to usec resolution for linux-3.15,
but in the meantime, this patch removes the 1 ms offset.

All we need is to have tp->srtt minimal value of 1 to differentiate
the case of srtt being initialized or not, not 8.

The problematic behavior was observed on a 40Gbit testbed,
where 32 concurrent netperf were reaching 12Gbps of aggregate
speed, instead of line speed.

This patch also has the effect of reporting more accurate srtt and send
rates to iproute2 ss command as in :

$ ss -i dst cca2
Netid  State      Recv-Q Send-Q          Local Address:Port
Peer Address:Port
tcp    ESTAB      0      0                10.244.129.1:56984
10.244.129.2:12865
	 cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
463.4Mbps rcv_rtt:1 rcv_space:29200
tcp    ESTAB      0      390960           10.244.129.1:60247
10.244.129.2:50204
	 cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200

Reported-by: Vytautas Valancius <valas@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-06 21:28:06 -08:00
Cong Wang
dbe173079a bridge: fix netconsole setup over bridge
Commit 93d8bf9fb8 ("bridge: cleanup netpoll code") introduced
a check in br_netpoll_enable(), but this check is incorrect for
br_netpoll_setup(). This patch moves the code after the check
into __br_netpoll_enable() and calls it in br_netpoll_setup().
For br_add_if(), the check is still needed.

Fixes: 93d8bf9fb8 ("bridge: cleanup netpoll code")
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Tested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-06 21:28:06 -08:00
Eric Dumazet
ed98df3361 net: use __GFP_NORETRY for high order allocations
sock_alloc_send_pskb() & sk_page_frag_refill()
have a loop trying high order allocations to prepare
skb with low number of fragments as this increases performance.

Problem is that under memory pressure/fragmentation, this can
trigger OOM while the intent was only to try the high order
allocations, then fallback to order-0 allocations.

We had various reports from unexpected regressions.

According to David, setting __GFP_NORETRY should be fine,
as the asynchronous compaction is still enabled, and this
will prevent OOM from kicking as in :

CFSClientEventm invoked oom-killer: gfp_mask=0x42d0, order=3, oom_adj=0,
oom_score_adj=0, oom_score_badness=2 (enabled),memcg_scoring=disabled
CFSClientEventm

Call Trace:
 [<ffffffff8043766c>] dump_header+0xe1/0x23e
 [<ffffffff80437a02>] oom_kill_process+0x6a/0x323
 [<ffffffff80438443>] out_of_memory+0x4b3/0x50d
 [<ffffffff8043a4a6>] __alloc_pages_may_oom+0xa2/0xc7
 [<ffffffff80236f42>] __alloc_pages_nodemask+0x1002/0x17f0
 [<ffffffff8024bd23>] alloc_pages_current+0x103/0x2b0
 [<ffffffff8028567f>] sk_page_frag_refill+0x8f/0x160
 [<ffffffff80295fa0>] tcp_sendmsg+0x560/0xee0
 [<ffffffff802a5037>] inet_sendmsg+0x67/0x100
 [<ffffffff80283c9c>] __sock_sendmsg_nosec+0x6c/0x90
 [<ffffffff80283e85>] sock_sendmsg+0xc5/0xf0
 [<ffffffff802847b6>] __sys_sendmsg+0x136/0x430
 [<ffffffff80284ec8>] sys_sendmsg+0x88/0x110
 [<ffffffff80711472>] system_call_fastpath+0x16/0x1b
Out of Memory: Kill process 2856 (bash) score 9999 or sacrifice child

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-06 21:28:06 -08:00
Sabrina Dubroca
00fe11b3c6 netpoll: fix netconsole IPv6 setup
Currently, to make netconsole start over IPv6, the source address
needs to be specified. Without a source address, netpoll_parse_options
assumes we're setting up over IPv4 and the destination IPv6 address is
rejected.

Check if the IP version has been forced by a source address before
checking for a version mismatch when parsing the destination address.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-06 21:28:06 -08:00
Matija Glavinic Pecotic
661dbf3412 net: sctp: fix initialization of local source address on accepted ipv6 sockets
commit 	efe4208f47:
'ipv6: make lookups simpler and faster' broke initialization of local source
address on accepted ipv6 sockets. Before the mentioned commit receive address
was copied along with the contents of ipv6_pinfo in sctp_v6_create_accept_sk.
Now when it is moved, it has to be copied separately.

This also fixes lksctp's ipv6 regression in a sense that test_getname_v6, TC5 -
'getsockname on a connected server socket' now passes.

Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-06 21:18:06 -08:00
Geert Uytterhoeven
63b5f152eb ipv4: Fix runtime WARNING in rtmsg_ifa()
On m68k/ARAnyM:

WARNING: CPU: 0 PID: 407 at net/ipv4/devinet.c:1599 0x316a99()
Modules linked in:
CPU: 0 PID: 407 Comm: ifconfig Not tainted
3.13.0-atari-09263-g0c71d68014d1 #1378
Stack from 10c4fdf0:
        10c4fdf0 002ffabb 000243e8 00000000 008ced6c 00024416 00316a99 0000063f
        00316a99 00000009 00000000 002501b4 00316a99 0000063f c0a86117 00000080
        c0a86117 00ad0c90 00250a5a 00000014 00ad0c90 00000000 00000000 00000001
        00b02dd0 00356594 00000000 00356594 c0a86117 eff6c9e4 008ced6c 00000002
        008ced60 0024f9b4 00250b52 00ad0c90 00000000 00000000 00252390 00ad0c90
        eff6c9e4 0000004f 00000000 00000000 eff6c9e4 8000e25c eff6c9e4 80001020
Call Trace: [<000243e8>] warn_slowpath_common+0x52/0x6c
 [<00024416>] warn_slowpath_null+0x14/0x1a
 [<002501b4>] rtmsg_ifa+0xdc/0xf0
 [<00250a5a>] __inet_insert_ifa+0xd6/0x1c2
 [<0024f9b4>] inet_abc_len+0x0/0x42
 [<00250b52>] inet_insert_ifa+0xc/0x12
 [<00252390>] devinet_ioctl+0x2ae/0x5d6

Adding some debugging code reveals that net_fill_ifaddr() fails in

    put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp,
                              preferred, valid))

nla_put complains:

    lib/nlattr.c:454: skb_tailroom(skb) = 12, nla_total_size(attrlen) = 20

Apparently commit 5c766d642b ("ipv4:
introduce address lifetime") forgot to take into account the addition of
struct ifa_cacheinfo in inet_nlmsg_size(). Hence add it, like is already
done for ipv6.

Suggested-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-06 20:02:15 -08:00
Pablo Neira Ayuso
0165d9325d netfilter: nf_tables: fix racy rule deletion
We may lost race if we flush the rule-set (which happens asynchronously
via call_rcu) and we try to remove the table (that userspace assumes
to be empty).

Fix this by recovering synchronous rule and chain deletion. This was
introduced time ago before we had no batch support, and synchronous
rule deletion performance was not good. Now that we have the batch
support, we can just postpone the purge of old rule in a second step
in the commit phase. All object deletions are synchronous after this
patch.

As a side effect, we save memory as we don't need rcu_head per rule
anymore.

Cc: Patrick McHardy <kaber@trash.net>
Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-06 11:46:06 +01:00
Patrick McHardy
b8ecbee67c netfilter: nf_tables: fix log/queue expressions for NFPROTO_INET
The log and queue expressions both store the family during ->init() and
use it to deliver packets. This is wrong when used in NFPROTO_INET since
they should both deliver to the actual AF of the packet, not the dummy
NFPROTO_INET.

Use the family from the hook ops to fix this.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-06 11:41:38 +01:00
Johannes Berg
fab57a6cc2 mac80211: fix virtual monitor interface iteration
During channel context assignment, the interface should
be found by interface iteration, so we need to assign the
pointer before the channel context.

Reported-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Tested-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-06 09:55:22 +01:00
Johannes Berg
338f977f4e mac80211: fix fragmentation code, particularly for encryption
The "new" fragmentation code (since my rewrite almost 5 years ago)
erroneously sets skb->len rather than using skb_trim() to adjust
the length of the first fragment after copying out all the others.
This leaves the skb tail pointer pointing to after where the data
originally ended, and thus causes the encryption MIC to be written
at that point, rather than where it belongs: immediately after the
data.

The impact of this is that if software encryption is done, then
 a) encryption doesn't work for the first fragment, the connection
    becomes unusable as the first fragment will never be properly
    verified at the receiver, the MIC is practically guaranteed to
    be wrong
 b) we leak up to 8 bytes of plaintext (!) of the packet out into
    the air

This is only mitigated by the fact that many devices are capable
of doing encryption in hardware, in which case this can't happen
as the tail pointer is irrelevant in that case. Additionally,
fragmentation is not used very frequently and would normally have
to be configured manually.

Fix this by using skb_trim() properly.

Cc: stable@vger.kernel.org
Fixes: 2de8e0d999 ("mac80211: rewrite fragmentation")
Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-06 09:55:21 +01:00
Sujith Manoharan
d4c80d9df6 mac80211: Fix IBSS disconnect
Currently, when a station leaves an IBSS network, the
corresponding BSS is not dropped from cfg80211 if there are
other active stations in the network. But, the small
window that is present when trying to determine a station's
status based on IEEE80211_IBSS_MERGE_INTERVAL introduces
a race.

Instead of trying to keep the BSS, always remove it when
leaving an IBSS network. There is not much benefit to retain
the BSS entry since it will be added with a subsequent join
operation.

This fixes an issue where a dangling BSS entry causes ath9k
to wait for a beacon indefinitely.

Cc: <stable@vger.kernel.org>
Reported-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Sujith Manoharan <c_manoha@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-06 09:55:20 +01:00
Emmanuel Grumbach
0297ea17bf mac80211: release the channel in error path in start_ap
When the driver cannot start the AP or when the assignement
of the beacon goes wrong, we need to unassign the vif.

Cc: stable@vger.kernel.org
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-06 09:55:20 +01:00
Johannes Berg
f9d15d162b cfg80211: send scan results from work queue
Due to the previous commit, when a scan finishes, it is in theory
possible to hit the following sequence:
 1. interface starts being removed
 2. scan is cancelled by driver and cfg80211 is notified
 3. scan done work is scheduled
 4. interface is removed completely, rdev->scan_req is freed,
    event sent to userspace but scan done work remains pending
 5. new scan is requested on another virtual interface
 6. scan done work runs, freeing the still-running scan

To fix this situation, hang on to the scan done message and block
new scans while that is the case, and only send the message from
the work function, regardless of whether the scan_req is already
freed from interface removal. This makes step 5 above impossible
and changes step 6 to be
 5. scan done work runs, sending the scan done message

As this can't work for wext, so we send the message immediately,
but this shouldn't be an issue since we still return -EBUSY.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-06 09:55:19 +01:00
Johannes Berg
a617302c53 cfg80211: fix scan done race
When an interface/wdev is removed, any ongoing scan should be
cancelled by the driver. This will make it call cfg80211, which
only queues a work struct. If interface/wdev removal is quick
enough, this can leave the scan request pending and processed
only after the interface is gone, causing a use-after-free.

Fix this by making sure the scan request is not pending after
the interface is destroyed. We can't flush or cancel the work
item due to locking concerns, but when it'll run it shouldn't
find anything to do. This leaves a potential issue, if a new
scan gets requested before the work runs, it prematurely stops
the running scan, potentially causing another crash. I'll fix
that in the next patch.

This was particularly observed with P2P_DEVICE wdevs, likely
because freeing them is quicker than freeing netdevs.

Reported-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Fixes: 4a58e7c384 ("cfg80211: don't "leak" uncompleted scans")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-06 09:55:19 +01:00
Emmanuel Grumbach
8ffcc704c9 mac80211: avoid deadlock revealed by lockdep
sdata->u.ap.request_smps_work can’t be flushed synchronously
under wdev_lock(wdev) since ieee80211_request_smps_ap_work
itself locks the same lock.
While at it, reset the driver_smps_mode when the ap is
stopped to its default: OFF.

This solves:

======================================================
[ INFO: possible circular locking dependency detected ]
3.12.0-ipeer+ #2 Tainted: G           O
-------------------------------------------------------
rmmod/2867 is trying to acquire lock:
  ((&sdata->u.ap.request_smps_work)){+.+...}, at: [<c105b8d0>] flush_work+0x0/0x90

but task is already holding lock:
  (&wdev->mtx){+.+.+.}, at: [<f9b32626>] cfg80211_stop_ap+0x26/0x230 [cfg80211]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&wdev->mtx){+.+.+.}:
        [<c10aefa9>] lock_acquire+0x79/0xe0
        [<c1607a1a>] mutex_lock_nested+0x4a/0x360
        [<fb06288b>] ieee80211_request_smps_ap_work+0x2b/0x50 [mac80211]
        [<c105cdd8>] process_one_work+0x198/0x450
        [<c105d469>] worker_thread+0xf9/0x320
        [<c10669ff>] kthread+0x9f/0xb0
        [<c1613397>] ret_from_kernel_thread+0x1b/0x28

-> #0 ((&sdata->u.ap.request_smps_work)){+.+...}:
        [<c10ae9df>] __lock_acquire+0x183f/0x1910
        [<c10aefa9>] lock_acquire+0x79/0xe0
        [<c105b917>] flush_work+0x47/0x90
        [<c105d867>] __cancel_work_timer+0x67/0xe0
        [<c105d90f>] cancel_work_sync+0xf/0x20
        [<fb0765cc>] ieee80211_stop_ap+0x8c/0x340 [mac80211]
        [<f9b3268c>] cfg80211_stop_ap+0x8c/0x230 [cfg80211]
        [<f9b0d8f9>] cfg80211_leave+0x79/0x100 [cfg80211]
        [<f9b0da72>] cfg80211_netdev_notifier_call+0xf2/0x4f0 [cfg80211]
        [<c160f2c9>] notifier_call_chain+0x59/0x130
        [<c106c6de>] __raw_notifier_call_chain+0x1e/0x30
        [<c106c70f>] raw_notifier_call_chain+0x1f/0x30
        [<c14f8213>] call_netdevice_notifiers_info+0x33/0x70
        [<c14f8263>] call_netdevice_notifiers+0x13/0x20
        [<c14f82a4>] __dev_close_many+0x34/0xb0
        [<c14f83fe>] dev_close_many+0x6e/0xc0
        [<c14f9c77>] rollback_registered_many+0xa7/0x1f0
        [<c14f9dd4>] unregister_netdevice_many+0x14/0x60
        [<fb06f4d9>] ieee80211_remove_interfaces+0xe9/0x170 [mac80211]
        [<fb055116>] ieee80211_unregister_hw+0x56/0x110 [mac80211]
        [<fa3e9396>] iwl_op_mode_mvm_stop+0x26/0xe0 [iwlmvm]
        [<f9b9d8ca>] _iwl_op_mode_stop+0x3a/0x70 [iwlwifi]
        [<f9b9d96f>] iwl_opmode_deregister+0x6f/0x90 [iwlwifi]
        [<fa405179>] __exit_compat+0xd/0x19 [iwlmvm]
        [<c10b8bf9>] SyS_delete_module+0x179/0x2b0
        [<c1613421>] sysenter_do_call+0x12/0x32

Fixes: 687da13223 ("mac80211: implement SMPS for AP")
Cc: <stable@vger.kernel.org> [3.13]
Reported-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-06 09:55:18 +01:00
Johannes Berg
5a6aa705ff cfg80211: re-enable 5/10 MHz support
Unfortunately I forgot this during the merge window, but the
patch seems small enough to go in as a fix. The userspace API
bug that was the reason for disabling it has long been fixed.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-06 09:55:18 +01:00
Pontus Fuchs
f12cb28930 nl80211: Reset split_start when netlink skb is exhausted
When the netlink skb is exhausted split_start is left set. In the
subsequent retry, with a larger buffer, the dump is continued from the
failing point instead of from the beginning.

This was causing my rt28xx based USB dongle to now show up when
running "iw list" with an old iw version without split dump support.

Cc: stable@vger.kernel.org
Fixes: 3713b4e364 ("nl80211: allow splitting wiphy information in dumps")
Signed-off-by: Pontus Fuchs <pontus.fuchs@gmail.com>
[avoid the entire workaround when state->split is set]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-06 09:55:17 +01:00
Eliad Peller
2f617435c3 mac80211: move roc cookie assignment earlier
ieee80211_start_roc_work() might add a new roc
to existing roc, and tell cfg80211 it has already
started.

However, this might happen before the roc cookie
was set, resulting in REMAIN_ON_CHANNEL (started)
event with null cookie. Consequently, it can make
wpa_supplicant go out of sync.

Fix it by setting the roc cookie earlier.

Cc: stable@vger.kernel.org
Signed-off-by: Eliad Peller <eliad@wizery.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-02-06 09:55:17 +01:00
Patrick McHardy
05513e9e33 netfilter: nf_tables: add reject module for NFPROTO_INET
Add a reject module for NFPROTO_INET. It does nothing but dispatch
to the AF-specific modules based on the hook family.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-06 09:44:18 +01:00
Patrick McHardy
cc4723ca31 netfilter: nft_reject: split up reject module into IPv4 and IPv6 specifc parts
Currently the nft_reject module depends on symbols from ipv6. This is
wrong since no generic module should force IPv6 support to be loaded.
Split up the module into AF-specific and a generic part.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-06 09:44:10 +01:00
Patrick McHardy
64d46806b6 netfilter: nf_tables: add AF specific expression support
For the reject module, we need to add AF-specific implementations to
get rid of incorrect module dependencies. Try to load an AF-specific
module first and fall back to generic modules.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-06 00:05:36 +01:00
Patrick McHardy
51292c0735 netfilter: nft_ct: fix missing NFT_CT_L3PROTOCOL key in validity checks
The key was missing in the list of valid keys, add it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-06 00:05:33 +01:00
Patrick McHardy
ec2c993568 netfilter: nf_tables: fix potential oops when dumping sets
Commit c9c8e48597 (netfilter: nf_tables: dump sets in all existing families)
changed nft_ctx_init_from_setattr() to only look up the address family if it
is not NFPROTO_UNSPEC. However if it is NFPROTO_UNSPEC and a table attribute
is given, nftables_afinfo_lookup() will dereference the NULL afi pointer.

Fix by checking for non-NULL afi and also move a check added by that commit
to the proper position.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-06 00:04:15 +01:00
Patrick McHardy
53b70287dd netfilter: nf_tables: fix overrun in nf_tables_set_alloc_name()
The map that is used to allocate anonymous sets is indeed
BITS_PER_BYTE * PAGE_SIZE long.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-05 17:46:07 +01:00
Pablo Neira Ayuso
e53376bef2 netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt
With this patch, the conntrack refcount is initially set to zero and
it is bumped once it is added to any of the list, so we fulfill
Eric's golden rule which is that all released objects always have a
refcount that equals zero.

Andrey Vagin reports that nf_conntrack_free can't be called for a
conntrack with non-zero ref-counter, because it can race with
nf_conntrack_find_get().

A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
ref-counter says that this conntrack is used. So when we release
a conntrack with non-zero counter, we break this assumption.

CPU1                                    CPU2
____nf_conntrack_find()
                                        nf_ct_put()
                                         destroy_conntrack()
                                        ...
                                        init_conntrack
                                         __nf_conntrack_alloc (set use = 1)
atomic_inc_not_zero(&ct->use) (use = 2)
                                         if (!l4proto->new(ct, skb, dataoff, timeouts))
                                          nf_conntrack_free(ct); (use = 2 !!!)
                                        ...
                                        __nf_conntrack_alloc (set use = 1)
 if (!nf_ct_key_equal(h, tuple, zone))
  nf_ct_put(ct); (use = 0)
   destroy_conntrack()
                                        /* continue to work with CT */

After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
race in nf_conntrack_find_get" another bug was triggered in
destroy_conntrack():

<4>[67096.759334] ------------[ cut here ]------------
<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
...
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G         C ---------------    2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>]  [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] Call Trace:
<4>[67096.760255]  [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255]  [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
<4>[67096.760255]  [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
<4>[67096.760255]  [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
<4>[67096.760255]  [<ffffffff81484419>] nf_iterate+0x69/0xb0
<4>[67096.760255]  [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255]  [<ffffffff814845d4>] nf_hook_slow+0x74/0x110
<4>[67096.760255]  [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255]  [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910
<4>[67096.760255]  [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0
<4>[67096.760255]  [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140
<4>[67096.760255]  [<ffffffff81444f97>] sock_sendmsg+0x117/0x140
<4>[67096.760255]  [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255]  [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255]  [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255]  [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff814457c9>] sys_sendto+0x139/0x190
<4>[67096.760255]  [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255]  [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255]  [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5

I have reused the original title for the RFC patch that Andrey posted and
most of the original patch description.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Andrew Vagin <avagin@parallels.com>
Cc: Florian Westphal <fw@strlen.de>
Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
2014-02-05 17:46:06 +01:00
Alexey Dobriyan
829d9315c4 netfilter: nf_nat_h323: fix crash in nf_ct_unlink_expect_report()
Similar bug fixed in SIP module in 3f509c6 ("netfilter: nf_nat_sip: fix
incorrect handling of EBUSY for RTCP expectation").

BUG: unable to handle kernel paging request at 00100104
IP: [<f8214f07>] nf_ct_unlink_expect_report+0x57/0xf0 [nf_conntrack]
...
Call Trace:
  [<c0244bd8>] ? del_timer+0x48/0x70
  [<f8215687>] nf_ct_remove_expectations+0x47/0x60 [nf_conntrack]
  [<f8211c99>] nf_ct_delete_from_lists+0x59/0x90 [nf_conntrack]
  [<f8212e5e>] death_by_timeout+0x14e/0x1c0 [nf_conntrack]
  [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
  [<c024442d>] call_timer_fn+0x1d/0x80
  [<c024461e>] run_timer_softirq+0x18e/0x1a0
  [<f8212d10>] ? nf_conntrack_set_hashsize+0x190/0x190 [nf_conntrack]
  [<c023e6f3>] __do_softirq+0xa3/0x170
  [<c023e650>] ? __local_bh_enable+0x70/0x70
  <IRQ>
  [<c023e587>] ? irq_exit+0x67/0xa0
  [<c0202af6>] ? do_IRQ+0x46/0xb0
  [<c027ad05>] ? clockevents_notify+0x35/0x110
  [<c066ac6c>] ? common_interrupt+0x2c/0x40
  [<c056e3c1>] ? cpuidle_enter_state+0x41/0xf0
  [<c056e6fb>] ? cpuidle_idle_call+0x8b/0x100
  [<c02085f8>] ? arch_cpu_idle+0x8/0x30
  [<c027314b>] ? cpu_idle_loop+0x4b/0x140
  [<c0273258>] ? cpu_startup_entry+0x18/0x20
  [<c066056d>] ? rest_init+0x5d/0x70
  [<c0813ac8>] ? start_kernel+0x2ec/0x2f2
  [<c081364f>] ? repair_env_string+0x5b/0x5b
  [<c0813269>] ? i386_start_kernel+0x33/0x35

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-05 17:46:05 +01:00
Andrey Vagin
c6825c0976 netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1			task 2			task 3
			nf_conntrack_find_get
			 ____nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
						__nf_conntrack_alloc
						 kmem_cache_alloc
						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
			 if (nf_ct_is_dying(ct))
			 if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>]  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697]  [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277]  [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346]  [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229]  [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-05 13:16:18 +01:00
Patrick McHardy
3dd7279fb6 netfilter: nf_tables: fix oops when deleting a chain with references
The following commands trigger an oops:

 # nft -i
 nft> add table filter
 nft> add chain filter input { type filter hook input priority 0; }
 nft> add chain filter test
 nft> add rule filter input jump test
 nft> delete chain filter test

We need to check the chain use counter before allowing destruction since
we might have references from sets or jump rules.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=69341
Reported-by: Matthew Ife <deleriux1@gmail.com>
Tested-by: Matthew Ife <deleriux1@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-05 13:16:17 +01:00
Arturo Borrero
2a53bfb3e0 netfilter: nft_ct: fix unconditional dump of 'dir' attr
We want to make sure that the information that we get from the kernel can
be reinjected without troubles. The kernel shouldn't return an attribute
that is not required, or even prohibited.

Dumping unconditionally NFTA_CT_DIRECTION could lead an application in
userspace to interpret that the attribute was originally set, while it
was not.

Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-05 13:16:17 +01:00
Andy Zhou
c14e0953ca openvswitch: Suppress error messages on megaflow updates
With subfacets, we'd expect megaflow updates message to carry
the original micro flow. If not, EINVAL is returned and kernel
logs an error message.  Now that the user space subfacet layer is
removed, it is expected that flow updates can arrive with a
micro flow other than the original. Change the return code to
EEXIST and remove the kernel error log message.

Reported-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-02-04 22:32:38 -08:00
Pravin B Shelar
e4c6d75954 openvswitch: Fix ovs_flow_free() ovs-lock assert.
ovs_flow_free() is not called under ovs-lock during packet
execute path (ovs_packet_cmd_execute()). Since packet execute
does not touch flow->mask, there is no need to take that
lock either. So move assert in case where flow->mask is checked.

Found by code inspection.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-02-04 22:21:45 -08:00
Daniele Di Proietto
45fb9c35b2 openvswitch: Fix ovs_dp_cmd_msg_size()
commit 43d4be9cb5 (openvswitch: Allow user space
to announce ability to accept unaligned Netlink messages) introduced
OVS_DP_ATTR_USER_FEATURES netlink attribute in datapath responses,
but the attribute size was not taken into account in ovs_dp_cmd_msg_size().

Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-02-04 22:21:23 -08:00
Andy Zhou
e80857cce8 openvswitch: Fix kernel panic on ovs_flow_free
Both mega flow mask's reference counter and per flow table mask list
should only be accessed when holding ovs_mutex() lock. However
this is not true with ovs_flow_table_flush(). The patch fixes this bug.

Reported-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-02-04 22:21:17 -08:00
Thomas Graf
aea0bb4f8e openvswitch: Pad OVS_PACKET_ATTR_PACKET if linear copy was performed
While the zerocopy method is correctly omitted if user space
does not support unaligned Netlink messages. The attribute is
still not padded correctly as skb_zerocopy() will not ensure
padding and the attribute size is no longer pre calculated
though nla_reserve() which ensured padding previously.

This patch applies appropriate padding if a linear data copy
was performed in skb_zerocopy().

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-02-04 22:21:11 -08:00
Fernando Luis Vazquez Cao
6049f2530c rtnetlink: fix oops in rtnl_link_get_slave_info_data_size
We should check whether rtnetlink link operations
are defined before calling get_slave_size().

Without this, the following oops can occur when
adding a tap device to OVS.

[   87.839553] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
[   87.839595] IP: [<ffffffff813d47c0>] if_nlmsg_size+0xf0/0x220
[...]
[   87.840651] Call Trace:
[   87.840664]  [<ffffffff813d694b>] ? rtmsg_ifinfo+0x2b/0x100
[   87.840688]  [<ffffffff813c8340>] ? __netdev_adjacent_dev_insert+0x150/0x1a0
[   87.840718]  [<ffffffff813d6a50>] ? rtnetlink_event+0x30/0x40
[   87.840742]  [<ffffffff814b4144>] ? notifier_call_chain+0x44/0x70
[   87.840768]  [<ffffffff813c8946>] ? __netdev_upper_dev_link+0x3c6/0x3f0
[   87.840798]  [<ffffffffa0678d6c>] ? netdev_create+0xcc/0x160 [openvswitch]
[   87.840828]  [<ffffffffa06781ea>] ? ovs_vport_add+0x4a/0xd0 [openvswitch]
[   87.840857]  [<ffffffffa0670139>] ? new_vport+0x9/0x50 [openvswitch]
[   87.840884]  [<ffffffffa067279e>] ? ovs_vport_cmd_new+0x11e/0x210 [openvswitch]
[   87.840915]  [<ffffffff813f3efa>] ? genl_family_rcv_msg+0x19a/0x360
[   87.840941]  [<ffffffff813f40c0>] ? genl_family_rcv_msg+0x360/0x360
[   87.840967]  [<ffffffff813f4139>] ? genl_rcv_msg+0x79/0xc0
[   87.840991]  [<ffffffff813b6cf9>] ? __kmalloc_reserve.isra.25+0x29/0x80
[   87.841018]  [<ffffffff813f2389>] ? netlink_rcv_skb+0xa9/0xc0
[   87.841042]  [<ffffffff813f27cf>] ? genl_rcv+0x1f/0x30
[   87.841064]  [<ffffffff813f1988>] ? netlink_unicast+0xe8/0x1e0
[   87.841088]  [<ffffffff813f1d9a>] ? netlink_sendmsg+0x31a/0x750
[   87.841113]  [<ffffffff813aee96>] ? sock_sendmsg+0x86/0xc0
[   87.841136]  [<ffffffff813c960d>] ? __netdev_update_features+0x4d/0x200
[   87.841163]  [<ffffffff813ca94e>] ? ethtool_get_value+0x2e/0x50
[   87.841188]  [<ffffffff813af269>] ? ___sys_sendmsg+0x359/0x370
[   87.841212]  [<ffffffff813da686>] ? dev_ioctl+0x1a6/0x5c0
[   87.841236]  [<ffffffff8109c210>] ? autoremove_wake_function+0x30/0x30
[   87.841264]  [<ffffffff813ac59d>] ? sock_do_ioctl+0x3d/0x50
[   87.841288]  [<ffffffff813aca68>] ? sock_ioctl+0x1e8/0x2c0
[   87.841312]  [<ffffffff811934bf>] ? do_vfs_ioctl+0x2cf/0x4b0
[   87.841335]  [<ffffffff813afeb9>] ? __sys_sendmsg+0x39/0x70
[   87.841362]  [<ffffffff814b86f9>] ? system_call_fastpath+0x16/0x1b
[   87.841386] Code: c0 74 10 48 89 ef ff d0 83 c0 07 83 e0 fc 48 98 49 01 c7 48 89 ef e8 d0 d6 fe ff 48 85 c0 0f 84 df 00 00 00 48 8b 90 08 07 00 00 <48> 8b 8a a8 00 00 00 31 d2 48 85 c9 74 0c 48 89 ee 48 89 c7 ff
[   87.841529] RIP  [<ffffffff813d47c0>] if_nlmsg_size+0xf0/0x220
[   87.841555]  RSP <ffff880221aa5950>
[   87.841569] CR2: 00000000000000a8
[   87.851442] ---[ end trace e42ab217691b4fc2 ]---

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-04 20:28:51 -08:00
Shlomo Pongratz
a664a4f7aa net/ipv4: Use proper RCU APIs for writer-side in udp_offload.c
RCU writer side should use rcu_dereference_protected() and not
rcu_dereference(), fix that. This also removes the "suspicious RCU usage"
warning seen when running with CONFIG_PROVE_RCU.

Also, don't use rcu_assign_pointer/rcu_dereference for pointers
which are invisible beyond the udp offload code.

Fixes: b582ef0 ('net: Add GRO support for UDP encapsulating protocols')
Reported-by: Eric Dumazet <edumazet@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-04 20:01:55 -08:00
Michal Kubecek
2a971354e7 ipvs: fix AF assignment in ip_vs_conn_new()
If a fwmark is passed to ip_vs_conn_new(), it is passed in
vaddr, not daddr. Therefore we should set AF to AF_UNSPEC in
vaddr assignment (like we do in ip_vs_ct_in_get()), otherwise we
may copy only first 4 bytes of an IPv6 address into cp->daddr.

Signed-off-by: Bogdano Arendartchuk <barendartchuk@suse.com>
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2014-02-04 21:13:47 +09:00
Eric Dumazet
b045d37bd6 ip_tunnel: fix panic in ip_tunnel_xmit()
Setting rt variable to NULL at the beginning of ip_tunnel_xmit()
missed possible use of this variable as a scratch value.

Also fixes a possible dst leak in tunnel_dst_check() :
If we had to call tunnel_dst_reset(), we forgot to
release the reference on dst.

Merges tunnel_dst_get()/tunnel_dst_check() into
a single tunnel_rtable_get() function for clarity.

Many thanks to Tommi for his report and tests.

Fixes: 7d442fab0a ("ipv4: Cache dst in tunnels")
Reported-by: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Tommi Rantala <tt.rantala@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-03 13:01:48 -08:00
Ilya Dryomov
c172ec5c8d libceph: fix error handling in ceph_osdc_init()
msgpool_op_reply message pool isn't destroyed if workqueue construction
fails.  Fix it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-02-03 19:20:59 +02:00
Linus Torvalds
8a1f006ad3 NFS client bugfixes for Linux 3.14
Highlights:
 
 - Fix several races in nfs_revalidate_mapping
 - NFSv4.1 slot leakage in the pNFS files driver
 - Stable fix for a slot leak in nfs40_sequence_done
 - Don't reject NFSv4 servers that support ACLs with only ALLOW aces
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJS7Bb+AAoJEGcL54qWCgDyDuQP/17nKR5e6MLhixcAbvlcH+pN
 8CGolAM3HmRXDWUW/PkBH3UguG8Tzx1Ex26vIxipPeTSwZabf6194Twj6L97DEGZ
 2SouD158BW1TkAbhEN/alKB/4ZCPos05iXjZkrL7MRff+8FD0UvWR2pBT1F2jQdY
 ZftG76Q72qhZHfH07ZMxM/v4Oy2Ge98RDD35gfuuqMSjHpmN9tiB55PeheW33LVY
 fu6I/JEwmlJpgy2qUcDv7v0V4mDpjC7XbcjjHpMHL8zp/C5Rx/rdgt9OQPlwmjdV
 FD8MWNXLc5TWxIouLDFPVUv3WZPjyu449QHS9Wc95fSqsHcdl4j4SwLAoSvUIdHt
 vDI5PtWhw3WAezbtiuCQnT0xcoIOn5bLjOVP13taDcV9vlZLcFlyOpZ5gVE4/Yju
 zm4sCW2+imDc74ERGa4fukF6QhzzAVmD8RlCJwuNzwCfXiZ36+xSanMYiPoUiwLL
 OVNgymrm0fe7GVFQKWN2D+Vr68OQEmARO+KfA3UzP5rQV+9CU8zSHjbcoRWZ59QG
 VahOS5WDLQSrMp8W37yAHH9IiAWveAAKJJTHlOniRqH90QYPgyW18fTo7YcpW313
 AQGFgr/1n4t27MWRLu5rdoN5v8+kwNi0UV6oboNIPoP1v15NkEMvc7HKFj5M883R
 qEYfe5wqN/eRNj68NT/+
 =B7f0
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights:

   - Fix several races in nfs_revalidate_mapping
   - NFSv4.1 slot leakage in the pNFS files driver
   - Stable fix for a slot leak in nfs40_sequence_done
   - Don't reject NFSv4 servers that support ACLs with only ALLOW aces"

* tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: initialize the ACL support bits to zero.
  NFSv4.1: Cleanup
  NFSv4.1: Clean up nfs41_sequence_done
  NFSv4: Fix a slot leak in nfs40_sequence_done
  NFSv4.1 free slot before resending I/O to MDS
  nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATING
  NFS: Fix races in nfs_revalidate_mapping
  sunrpc: turn warn_gssd() log message into a dprintk()
  NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping
  nfs: handle servers that support only ALLOW ACE type.
2014-01-31 15:39:07 -08:00
PaX Team
2def2ef2ae x86, x32: Correct invalid use of user timespec in the kernel
The x32 case for the recvmsg() timout handling is broken:

  asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg,
                                      unsigned int vlen, unsigned int flags,
                                      struct compat_timespec __user *timeout)
  {
          int datagrams;
          struct timespec ktspec;

          if (flags & MSG_CMSG_COMPAT)
                  return -EINVAL;

          if (COMPAT_USE_64BIT_TIME)
                  return __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen,
                                        flags | MSG_CMSG_COMPAT,
                                        (struct timespec *) timeout);
          ...

The timeout pointer parameter is provided by userland (hence the __user
annotation) but for x32 syscalls it's simply cast to a kernel pointer
and is passed to __sys_recvmmsg which will eventually directly
dereference it for both reading and writing.  Other callers to
__sys_recvmmsg properly copy from userland to the kernel first.

The bug was introduced by commit ee4fa23c4b ("compat: Use
COMPAT_USE_64BIT_TIME in net/compat.c") and should affect all kernels
since 3.4 (and perhaps vendor kernels if they backported x32 support
along with this code).

Note that CONFIG_X86_X32_ABI gets enabled at build time and only if
CONFIG_X86_X32 is enabled and ld can build x32 executables.

Other uses of COMPAT_USE_64BIT_TIME seem fine.

This addresses CVE-2014-0038.

Signed-off-by: PaX Team <pageexec@freemail.hu>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> # v3.4+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 18:44:13 -08:00
David S. Miller
65b80cae7a linux-can-fixes-for-3.14-20140129
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iEUEABECAAYFAlLpas0ACgkQjTAFq1RaXHN8KwCeM2I+jaknMl9xJIauEG4FHhwj
 UvoAlRLtN1IlYbnhta5iHmFycdyN158=
 =M6Dq
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-3.14-20140129' of git://gitorious.org/linux-can/linux-can

linux-can-fixes-for-3.14-20140129

Marc Kleine-Budde says:

====================
Arnd Bergmann provides a fix for the flexcan driver, enabling compilation on
all combinations of big and little endian on ARM and PowerPc. A patch by Ira W.
Snyder fixes uninitialized variable warnings in the janz-ican3 driver.
Rostislav Lisovy contributes a patch to propagate the SO_PRIORITY of raw
sockets to skbs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-30 16:48:17 -08:00
Oliver Hartkopp
0ae89beb28 can: add destructor for self generated skbs
Self generated skbuffs in net/can/bcm.c are setting a skb->sk reference but
no explicit destructor which is enforced since Linux 3.11 with commit
376c7311bd (net: add a temporary sanity check in skb_orphan()).

This patch adds some helper functions to make sure that a destructor is
properly defined when a sock reference is assigned to a CAN related skb.
To create an unshared skb owned by the original sock a common helper function
has been introduced to replace open coded functions to create CAN echo skbs.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Andre Naujoks <nautsch2@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-30 16:25:49 -08:00
Or Gerlitz
b5aaab12b2 net/ipv4: Use non-atomic allocation of udp offloads structure instance
Since udp_add_offload() can be called from non-sleepable context e.g
under this call tree from the vxlan driver use case:

  vxlan_socket_create() <-- holds the spinlock
  -> vxlan_notify_add_rx_port()
     -> udp_add_offload()  <-- schedules

we should allocate the udp_offloads structure in atomic manner.

Fixes: b582ef0 ('net: Add GRO support for UDP encapsulating protocols')
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-30 16:25:48 -08:00
Linus Torvalds
f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Linus Torvalds
d9894c228b Merge branch 'for-3.14' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 - Handle some loose ends from the vfs read delegation support.
   (For example nfsd can stop breaking leases on its own in a
    fewer places where it can now depend on the vfs to.)
 - Make life a little easier for NFSv4-only configurations
   (thanks to Kinglong Mee).
 - Fix some gss-proxy problems (thanks Jeff Layton).
 - miscellaneous bug fixes and cleanup

* 'for-3.14' of git://linux-nfs.org/~bfields/linux: (38 commits)
  nfsd: consider CLAIM_FH when handing out delegation
  nfsd4: fix delegation-unlink/rename race
  nfsd4: delay setting current_fh in open
  nfsd4: minor nfs4_setlease cleanup
  gss_krb5: use lcm from kernel lib
  nfsd4: decrease nfsd4_encode_fattr stack usage
  nfsd: fix encode_entryplus_baggage stack usage
  nfsd4: simplify xdr encoding of nfsv4 names
  nfsd4: encode_rdattr_error cleanup
  nfsd4: nfsd4_encode_fattr cleanup
  minor svcauth_gss.c cleanup
  nfsd4: better VERIFY comment
  nfsd4: break only delegations when appropriate
  NFSD: Fix a memory leak in nfsd4_create_session
  sunrpc: get rid of use_gssp_lock
  sunrpc: fix potential race between setting use_gss_proxy and the upcall rpc_clnt
  sunrpc: don't wait for write before allowing reads from use-gss-proxy file
  nfsd: get rid of unused function definition
  Define op_iattr for nfsd4_open instead using macro
  NFSD: fix compile warning without CONFIG_NFSD_V3
  ...
2014-01-30 10:18:43 -08:00
Linus Torvalds
1d494f36d1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Several fixups, of note:

  1) Fix unlock of not held spinlock in RXRPC code, from Alexey
     Khoroshilov.

  2) Call pci_disable_device() from the correct shutdown path in bnx2x
     driver, from Yuval Mintz.

  3) Fix qeth build on s390 for some configurations, from Eugene
     Crosser.

  4) Cure locking bugs in bond_loadbalance_arp_mon(), from Ding
     Tianhong.

  5) Must do netif_napi_add() before registering netdevice in sky2
     driver, from Stanislaw Gruszka.

  6) Fix lost bug fix during merge due to code movement in ieee802154,
     noticed and fixed by the eagle eyed Stephen Rothwell.

  7) Get rid of resource leak in xen-netfront driver, from Annie Li.

  8) Bounds checks in qlcnic driver are off by one, from Manish Chopra.

  9) TPROXY can leak sockets when TCP early demux is enabled, fix from
     Holger Eitzenberger"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits)
  qeth: fix build of s390 allmodconfig
  bonding: fix locking in bond_loadbalance_arp_mon()
  tun: add device name(iff) field to proc fdinfo entry
  DT: net: davinci_emac: "ti, davinci-no-bd-ram" property is actually optional
  DT: net: davinci_emac: "ti, davinci-rmii-en" property is actually optional
  bnx2x: Fix generic option settings
  net: Fix warning on make htmldocs caused by skbuff.c
  llc: remove noisy WARN from llc_mac_hdr_init
  qlcnic: Fix loopback test failure
  qlcnic: Fix tx timeout.
  qlcnic: Fix initialization of vlan list.
  qlcnic: Correct off-by-one errors in bounds checks
  net: Document promote_secondaries
  net: gre: use icmp_hdr() to get inner ip header
  i40e: Add missing braces to i40e_dcb_need_reconfig()
  xen-netfront: fix resource leak in netfront
  net: 6lowpan: fixup for code movement
  hyperv: Add support for physically discontinuous receive buffer
  sky2: initialize napi before registering device
  net: Fix memory leak if TPROXY used with TCP early demux
  ...
2014-01-29 18:08:37 -08:00
Rostislav Lisovy
bb5ecb0c63 can: Propagate SO_PRIORITY of raw sockets to skbs
This allows controlling certain queueing disciplines by setting the
socket's SO_PRIORITY option.

For example, with the default pfifo_fast queueing discipline, which
provides three priorities, socket priority TC_PRIO_CONTROL means
higher than default and TC_PRIO_BULK means lower than default.

Signed-off-by: Rostislav Lisovy <lisovy@gmail.com>
Signed-off-by: Michal Sojka <sojkam1@fel.cvut.cz>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2014-01-29 20:23:23 +01:00
Masanari Iida
7fceb4de75 net: Fix warning on make htmldocs caused by skbuff.c
This patch fixed following Warning while executing "make htmldocs".

Warning(/net/core/skbuff.c:2164): No description found for parameter 'from'
Warning(/net/core/skbuff.c:2164): Excess function parameter 'source'
description in 'skb_zerocopy'
Replace "@source" with "@from" fixed the warning.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-28 18:06:06 -08:00
David S. Miller
cd0c75a78d RxRPC fixes
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAUufhfBOxKuMESys7AQL1bQ/9GswO6rzisJPxnOdC9TXUtO2LXJFzaUjB
 soyYlyd/3/Syx5/EnUxwrBaAoL8CJIVJzO7B2RaWXjnrrQM7pvP32A8mN4GuLKPb
 tTm7/yQi8vTP97lhEvVBYFFEjr0pJAOzDAhtc/7N3b9i+zJ2Rh1kG3ihItlYEqx3
 zBvscaRF7rqozny9bZUVk6DH0q9nLywd8RbSPW4PhCBTeZUqwYIe6Pu6uMWQEPAX
 sNwA5F6ukiAB6/Cz4v4RQtqZrFpZUM+pQhf/hi10k92g4qmTWhPj9XtfsI2glUUx
 brX3pcfaiOtxQidtEwVA8Daicry6gWxt4NDmxzDKmn/8FliaRIWwUBhEn8FXEXhI
 Y63RzQf48KBd4t6Ux/JEI+/oe+RiPe5rYrgoxW1Y1y4QsB015zTYKI2nvwujycfn
 6qxMuu5G7gXq7DFXYyQjS1paQsUQZUmU18i8oVgeNHS60jxwKKSkB/21Pt/xC+aT
 ztxvL0IxfXQS1C5bK67URZgj5xFj/SMKMVic9PNkmnGZJ/CzSUPOiPRF7qAhe/q5
 dH3ZLJmkAFQZYKfezvOCrsqABMXG9Ndvr3UVq0kEQrshEOe8LqVH+d8gkWnbzLL6
 cc+Dat4Tf1hm79z79xA3SvTTs8xNp8ypSQk21G+8b5ecUicV4BKpJl0ieU3WhFSm
 HY7xX9Ihaec=
 =igX7
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-20140126' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
RxRPC fixes

Here are some small AF_RXRPC fixes.

 (1) Fix a place where a spinlock is taken conditionally but is released
     unconditionally.

 (2) Fix a double-free that happens when cleaning up on a checksum error.

 (3) Fix handling of CHECKSUM_PARTIAL whilst delivering messages to userspace.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-28 18:04:18 -08:00
Dave Jones
0f1a24c9a9 llc: remove noisy WARN from llc_mac_hdr_init
Sending malformed llc packets triggers this spew, which seems excessive.

WARNING: CPU: 1 PID: 6917 at net/llc/llc_output.c:46 llc_mac_hdr_init+0x85/0x90 [llc]()
device type not supported: 0
CPU: 1 PID: 6917 Comm: trinity-c1 Not tainted 3.13.0+ #95
 0000000000000009 00000000007e257d ffff88009232fbe8 ffffffffac737325
 ffff88009232fc30 ffff88009232fc20 ffffffffac06d28d ffff88020e07f180
 ffff88009232fec0 00000000000000c8 0000000000000000 ffff88009232fe70
Call Trace:
 [<ffffffffac737325>] dump_stack+0x4e/0x7a
 [<ffffffffac06d28d>] warn_slowpath_common+0x7d/0xa0
 [<ffffffffac06d30c>] warn_slowpath_fmt+0x5c/0x80
 [<ffffffffc01736d5>] llc_mac_hdr_init+0x85/0x90 [llc]
 [<ffffffffc0173759>] llc_build_and_send_ui_pkt+0x79/0x90 [llc]
 [<ffffffffc057cdba>] llc_ui_sendmsg+0x23a/0x400 [llc2]
 [<ffffffffac605d8c>] sock_sendmsg+0x9c/0xe0
 [<ffffffffac185a37>] ? might_fault+0x47/0x50
 [<ffffffffac606321>] SYSC_sendto+0x121/0x1c0
 [<ffffffffac011847>] ? syscall_trace_enter+0x207/0x270
 [<ffffffffac6071ce>] SyS_sendto+0xe/0x10
 [<ffffffffac74aaa4>] tracesys+0xdd/0xe2

Until 2009, this was a printk, when it was changed in
bf9ae5386b: "llc: use dev_hard_header".

Let userland figure out what -EINVAL means by itself.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-28 18:01:32 -08:00
Linus Torvalds
d891ea23d5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "This is a big batch.  From Ilya we have:

   - rbd support for more than ~250 mapped devices (now uses same scheme
     that SCSI does for device major/minor numbering)
   - crush updates for new mapping behaviors (will be needed for coming
     erasure coding support, among other things)
   - preliminary support for tiered storage pools

  There is also a big series fixing a pile cephfs bugs with clustered
  MDSs from Yan Zheng, ACL support for cephfs from Guangliang Zhao, ceph
  fscache improvements from Li Wang, improved behavior when we get
  ENOSPC from Josh Durgin, some readv/writev improvements from
  Majianpeng, and the usual mix of small cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (76 commits)
  ceph: cast PAGE_SIZE to size_t in ceph_sync_write()
  ceph: fix dout() compile warnings in ceph_filemap_fault()
  libceph: support CEPH_FEATURE_OSD_CACHEPOOL feature
  libceph: follow redirect replies from osds
  libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}
  libceph: follow {read,write}_tier fields on osd request submission
  libceph: add ceph_pg_pool_by_id()
  libceph: CEPH_OSD_FLAG_* enum update
  libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()
  libceph: introduce and start using oid abstraction
  libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN
  libceph: move ceph_file_layout helpers to ceph_fs.h
  libceph: start using oloc abstraction
  libceph: dout() is missing a newline
  libceph: add ceph_kv{malloc,free}() and switch to them
  libceph: support CEPH_FEATURE_EXPORT_PEER
  ceph: add imported caps when handling cap export message
  ceph: add open export target session helper
  ceph: remove exported caps when handling cap import message
  ceph: handle session flush message
  ...
2014-01-28 11:02:23 -08:00
Linus Torvalds
2b2b15c32a NFS client updates for Linux 3.14
Highlights include:
 
 - Stable fix for an infinite loop in RPC state machine
 - Stable fix for a use after free situation in the NFSv4 trunking discovery
 - Stable fix for error handling in the NFSv4 trunking discovery
 - Stable fix for the page write update code
 - Stable fix for the NFSv4.1 mount time security negotiation
 - Stable fix for the NFSv4 open code.
 - O_DIRECT locking fixes
 - fix an Oops in the pnfs file commit code
 - RPC layer needs finer grained handling of connection errors
 - More RPC GSS upcall fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJS5ozQAAoJEGcL54qWCgDy8EIQAMKYX1E5qOal3oJCzWdHAPNz
 ZSQ7CbA3c66vgJwpxy5Mz4gEtTK1IEzfTX31gLgkCXkyw54As+0lOa/SvoXFUusN
 BdBtskkIcVjhcly56xP2dzWGMsVrS8Vt+nwhsPv1Qaor5El0zXwPv8YE5PuuxJK5
 fyQdFEsywnCHtmFdyBdzsV8qHvAA0rxZTMmd6ZDBPCi9362D+pfp/1ESVOA6O14N
 rMBAbadF0pVM1UNvcvxSQaeqwCNqg5OuYKgyy9rhlH0WiQ6ijvKPrLVwg2pKZ2hj
 DCmwEqmKNEpxIFeOvmgFs/uhOEBx2IOF58xTc0+X81q96yTVm80anG1VTNFX577U
 gO8Ts0K/gWTD8ghxz4vh4/llc4yUv8ep8zB3qdSfL8C217UJIwnshkbPct7P1DTh
 8vpWtUeVJPu6rwcxMQXy0NntNZjRo1aqrv+htvFzPAMicM2KEAp73eOjStefvtr5
 JkdbvhhOR6dLwPrUEXM5FW5ewURegLjLcEqw3tq8kMnH0nEYjWOMBaB+uT0QFXun
 EXNqCpQHmHisem/3lGU+iVPc9lPf3C6tPIgjvoSplKcah1l3phVx6a5ReL22Zx2n
 qB2ePHfqToMjMcWiW3O3sbRpaDb+Br7xI4l8F3oeicvfv7SKB8k1u/w2IIoXKFIa
 FIdD6R0UIPgdnH5c03EC
 =abfY
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

   - stable fix for an infinite loop in RPC state machine
   - stable fix for a use after free situation in the NFSv4 trunking discovery
   - stable fix for error handling in the NFSv4 trunking discovery
   - stable fix for the page write update code
   - stable fix for the NFSv4.1 mount time security negotiation
   - stable fix for the NFSv4 open code.
   - O_DIRECT locking fixes
   - fix an Oops in the pnfs file commit code
   - RPC layer needs finer grained handling of connection errors
   - more RPC GSS upcall fixes"

* tag 'nfs-for-3.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (30 commits)
  pnfs: Proper delay for NFS4ERR_RECALLCONFLICT in layout_get_done
  pnfs: fix BUG in filelayout_recover_commit_reqs
  nfs4: fix discover_server_trunking use after free
  NFSv4.1: Handle errors correctly in nfs41_walk_client_list
  nfs: always make sure page is up-to-date before extending a write to cover the entire page
  nfs: page cache invalidation for dio
  nfs: take i_mutex during direct I/O reads
  nfs: merge nfs_direct_write into nfs_file_direct_write
  nfs: merge nfs_direct_read into nfs_file_direct_read
  nfs: increment i_dio_count for reads, too
  nfs: defer inode_dio_done call until size update is done
  nfs: fix size updates for aio writes
  nfs4.1: properly handle ENOTSUP in SECINFO_NO_NAME
  NFSv4.1: Fix a race in nfs4_write_inode
  NFSv4.1: Don't trust attributes if a pNFS LAYOUTCOMMIT is outstanding
  point to the right include file in a comment (left over from a9004abc3)
  NFS: dprintk() should not print negative fileids and inode numbers
  nfs: fix dead code of ipv6_addr_scope
  sunrpc: Fix infinite loop in RPC state machine
  SUNRPC: Add tracepoint for socket errors
  ...
2014-01-28 08:46:44 -08:00
Duan Jiong
c0c0c50ff7 net: gre: use icmp_hdr() to get inner ip header
When dealing with icmp messages, the skb->data points the
ip header that triggered the sending of the icmp message.

In gre_cisco_err(), the parse_gre_header() is called, and the
iptunnel_pull_header() is called to pull the skb at the end of
the parse_gre_header(), so the skb->data doesn't point the
inner ip header.

Unfortunately, the ipgre_err still needs those ip addresses in
inner ip header to look up tunnel by ip_tunnel_lookup().

So just use icmp_hdr() to get inner ip header instead of skb->data.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-27 20:38:26 -08:00
Stephen Rothwell
ce60e0c4df net: 6lowpan: fixup for code movement
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-27 16:43:03 -08:00
Holger Eitzenberger
a452ce345d net: Fix memory leak if TPROXY used with TCP early demux
I see a memory leak when using a transparent HTTP proxy using TPROXY
together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):

unreferenced object 0xffff88008cba4a40 (size 1696):
  comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
  hex dump (first 32 bytes):
    0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00  .. j@..7..2.....
    02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
    [<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
    [<ffffffff812702cf>] sk_clone_lock+0x14/0x283
    [<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
    [<ffffffff8129a893>] netlink_broadcast+0x14/0x16
    [<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
    [<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
    [<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
    [<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
    [<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
    [<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
    [<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
    [<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
    [<ffffffff8127aa77>] process_backlog+0xee/0x1c5
    [<ffffffff8127c949>] net_rx_action+0xa7/0x200
    [<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157

But there are many more, resulting in the machine going OOM after some
days.

From looking at the TPROXY code, and with help from Florian, I see
that the memory leak is introduced in tcp_v4_early_demux():

  void tcp_v4_early_demux(struct sk_buff *skb)
  {
    /* ... */

    iph = ip_hdr(skb);
    th = tcp_hdr(skb);

    if (th->doff < sizeof(struct tcphdr) / 4)
        return;

    sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
                       iph->saddr, th->source,
                       iph->daddr, ntohs(th->dest),
                       skb->skb_iif);
    if (sk) {
        skb->sk = sk;

where the socket is assigned unconditionally to skb->sk, also bumping
the refcnt on it.  This is problematic, because in our case the skb
has already a socket assigned in the TPROXY target.  This then results
in the leak I see.

The very same issue seems to be with IPv6, but haven't tested.

Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-27 16:22:11 -08:00
Ilya Dryomov
205ee1187a libceph: follow redirect replies from osds
Follow redirect replies from osds, for details see ceph.git commit
fbbe3ad1220799b7bb00ea30fce581c5eadaf034.

v1 (current) version of redirect reply consists of oloc and oid, which
expands to pool, key, nspace, hash and oid.  However, server-side code
that would populate anything other than pool doesn't exist yet, and
hence this commit adds support for pool redirects only.  To make sure
that future server-side updates don't break us, we decode all fields
and, if any of key, nspace, hash or oid have a non-default value, error
out with "corrupt osd_op_reply ..." message.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:53 +02:00
Ilya Dryomov
3c972c95c6 libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}
Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before
introducing r_target_{oloc,oid} needed for redirects.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:49 +02:00
Ilya Dryomov
17a13e4028 libceph: follow {read,write}_tier fields on osd request submission
Overwrite ceph_osd_request::r_oloc.pool with read_tier for read ops and
write_tier for write and read+write ops (aka basic tiering support).
{read,write}_tier are part of pg_pool_t since v9.  This commit bumps
our pg_pool_t decode compat version from v7 to v9, all new fields
except for {read,write}_tier are ignored.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:45 +02:00
Ilya Dryomov
ce7f6a2790 libceph: add ceph_pg_pool_by_id()
"Lookup pool info by ID" function is hidden in osdmap.c.  Expose it to
the rest of libceph.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:40 +02:00
Ilya Dryomov
7c13cb6435 libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()
Switch ceph_calc_ceph_pg() to new oloc and oid abstractions and rename
it to ceph_oloc_oid_to_pg() to make its purpose more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:32 +02:00
Ilya Dryomov
4295f2217a libceph: introduce and start using oid abstraction
In preparation for tiering support, which would require having two
(base and target) object names for each osd request and also copying
those names around, introduce struct ceph_object_id (oid) and a couple
helpers to facilitate those copies and encapsulate the fact that object
name is not necessarily a NUL-terminated string.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:28 +02:00
Ilya Dryomov
2d0ebc5d59 libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN
In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to
CEPH_MAX_OID_NAME_LEN.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:24 +02:00
Ilya Dryomov
22116525ba libceph: start using oloc abstraction
Instead of relying on pool fields in ceph_file_layout (for mapping) and
ceph_pg (for enconding), start using ceph_object_locator (oloc)
abstraction.  Note that userspace oloc currently consists of pool, key,
nspace and hash fields, while this one contains only a pool.  This is
OK, because at this point we only send (i.e. encode) olocs and never
have to receive (i.e. decode) them.

This makes keeping a copy of ceph_file_layout in every osd request
unnecessary, so ceph_osd_request::r_file_layout field is nuked.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:03 +02:00
Sachin Kamat
27d79f3b10 net: ipv4: Use PTR_ERR_OR_ZERO
PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
also include missing err.h header.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-27 12:57:31 -08:00
Jeff Layton
0ea9de0ea6 sunrpc: turn warn_gssd() log message into a dprintk()
The original printk() made sense when the GSSAPI codepaths were called
only when sec=krb5* was explicitly requested. Now however, in many cases
the nfs client will try to acquire GSSAPI credentials by default, even
when it's not requested.

Since we don't have a great mechanism to distinguish between the two
cases, just turn the pr_warn into a dprintk instead. With this change we
can also get rid of the ratelimiting.

We do need to keep the EXPORT_SYMBOL(gssd_running) in place since
auth_gss.ko needs it and sunrpc.ko provides it. We can however,
eliminate the gssd_running call in the nfs code since that's a bit of a
layering violation.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-27 15:36:05 -05:00
Florian Westphal
de960aa9ab net: add and use skb_gso_transport_seglen()
This moves part of Eric Dumazets skb_gso_seglen helper from tbf sched to
skbuff core so it may be reused by upcoming ip forwarding path patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-26 22:38:23 -08:00
Linus Torvalds
1c2948380b 9p changes for 3.14 merge window
Included are a new cache model for support of mmap,
 and several cleanups across the filesystem and networking
 portions of the code.
 
 Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 Comment: GPGTools - http://gpgtools.org
 
 iQIcBAABAgAGBQJS4qT3AAoJEDZk62b0Tg6xiF0P/RI2c75f/5SDbeNtH0usbgRj
 YfC7do4NX0HX0nI8H0bvfai1UU9JLc8M7aEjf5nw19O45phOQcGm/KeGRmMKtAhr
 OixTLXaMkRd3llTGkFv8ZY0W6aaSsGpB3Lzin+lZBwzYqMcqksBqhOTOoS1MSh3F
 u5PyhpuyJmxMkS5ud857PfwIREXOSHF/NIMMs5k9M9kK0zCka7xvl4Kg8zng2RVf
 A3rmKsLEvYuAgnxOq16hsRMgqHwx2833C3VmQKSl/n6SfOCy7cAMNmChIDrnAwtF
 dJosxypiRSkYjgD/YJR3UZofF7IqPgdL4umNmnb2lTHbOpeqNQ1hLB8BotjGpoVX
 pl9lxzz8UzaflwkAdgPsy/GBrbULxQKPLhL1Y0QPedhYh57bqRUEPPJ/HOjyrbOE
 RZXKZXfKbYlbNwc61N+meRC0IJETTjafnJlEzXu2vA+3LxZ3n/uZ7uq7XasVPiUV
 UKTKcvzYMs/PxA47rX81DOzebmphGEZDzw2ONbi4LMwGqeWt6WIpCMLPdGDjq7kl
 jdkpf9DuDr4mDrVP5+cFhzGQYbv9rCGR1zakWSW2H9xqP4Zy+o3kEPstniTMuNS4
 smkLPfpcG0VAKvY3HiVxT62EA4M+38IBAME0ATicE6esrWDyuLtGlke7x+uZoLUF
 mQ7WPimYBR+60liZ3zbQ
 =tCej
 -----END PGP SIGNATURE-----

Merge tag 'for-3.14-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p changes from Eric Van Hensbergen:
 "Included are a new cache model for support of mmap, and several
  cleanups across the filesystem and networking portions of the code"

* tag 'for-3.14-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  9p: update documentation
  9P: introduction of a new cache=mmap model.
  net/9p: remove virtio default hack and set appropriate bits instead
  9p: remove useless 'name' variable and assignment
  9p: fix return value in case in v9fs_fid_xattr_set()
  9p: remove useless variable and assignment
  9p: remove useless assignment
  9p: remove unused 'super_block' struct pointer
  9p: remove never used return variable
  9p: remove unused 'p9_fid' struct pointer
  9p: remove unused 'p9_client' struct pointer
2014-01-26 10:55:41 -08:00
Tim Smith
1ea427359d af_rxrpc: Handle frames delivered from another VM
On input, CHECKSUM_PARTIAL should be treated the same way as
CHECKSUM_UNNECESSARY. See include/linux/skbuff.h

Signed-off-by: Tim Smith <tim@electronghost.co.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
2014-01-26 11:45:04 +00:00
Tim Smith
24a9981ee9 af_rxrpc: Avoid setting up double-free on checksum error
skb_kill_datagram() does not dequeue the skb when MSG_PEEK is unset.
This leaves a free'd skb on the queue, resulting a double-free later.

Without this, the following oops can occur:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff8154fcf7>] skb_dequeue+0x47/0x70
PGD 0
Oops: 0002 [#1] SMP
Modules linked in: af_rxrpc ...
CPU: 0 PID: 1191 Comm: listen Not tainted 3.12.0+ #4
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8801183536b0 ti: ffff880035c92000 task.ti: ffff880035c92000
RIP: 0010:[<ffffffff8154fcf7>] skb_dequeue+0x47/0x70
RSP: 0018:ffff880035c93db8  EFLAGS: 00010097
RAX: 0000000000000246 RBX: ffff8800d2754b00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8800d254c084
RBP: ffff880035c93dd0 R08: ffff880035c93cf0 R09: ffff8800d968f270
R10: 0000000000000000 R11: 0000000000000293 R12: ffff8800d254c070
R13: ffff8800d254c084 R14: ffff8800cd861240 R15: ffff880119b39720
FS:  00007f37a969d740(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 00000000d4413000 CR4: 00000000000006f0
Stack:
 ffff8800d254c000 ffff8800d254c070 ffff8800d254c2c0 ffff880035c93df8
 ffffffffa041a5b8 ffff8800cd844c80 ffffffffa04385a0 ffff8800cd844cb0
 ffff880035c93e18 ffffffff81546cef ffff8800d45fea00 0000000000000008
Call Trace:
 [<ffffffffa041a5b8>] rxrpc_release+0x128/0x2e0 [af_rxrpc]
 [<ffffffff81546cef>] sock_release+0x1f/0x80
 [<ffffffff81546d62>] sock_close+0x12/0x20
 [<ffffffff811aaba1>] __fput+0xe1/0x230
 [<ffffffff811aad3e>] ____fput+0xe/0x10
 [<ffffffff810862cc>] task_work_run+0xbc/0xe0
 [<ffffffff8106a3be>] do_exit+0x2be/0xa10
 [<ffffffff8116dc47>] ? do_munmap+0x297/0x3b0
 [<ffffffff8106ab8f>] do_group_exit+0x3f/0xa0
 [<ffffffff8106ac04>] SyS_exit_group+0x14/0x20
 [<ffffffff8166b069>] system_call_fastpath+0x16/0x1b


Signed-off-by: Tim Smith <tim@electronghost.co.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
2014-01-26 11:45:04 +00:00
Alexey Khoroshilov
8f22ba61b5 RxRPC: do not unlock unheld spinlock in rxrpc_connect_exclusive()
If rx->conn is not NULL, rxrpc_connect_exclusive() does not
acquire the transport's client lock, but it still releases it.

The patch adds locking of the spinlock to this path.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: David Howells <dhowells@redhat.com>
2014-01-26 11:39:51 +00:00
Ilya Dryomov
0b4af2e8c9 libceph: dout() is missing a newline
Add a missing newline to a dout() in __reset_osd().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-01-26 12:35:54 +02:00
Ilya Dryomov
eeb0bed557 libceph: add ceph_kv{malloc,free}() and switch to them
Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.

ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
vmalloc()'ed with __GFP_HIGHMEM set.  This changes the existing
behaviour:

- for buffers (ceph_buffer_new()), from trying to kmalloc() everything
  and using vmalloc() just as a fallback

- for messages (ceph_msg_new()), from going to vmalloc() for anything
  bigger than a page

- for messages (ceph_msg_new()), from disallowing vmalloc() to use high
  memory

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-26 12:34:23 +02:00
Linus Torvalds
4ba9920e5e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) BPF debugger and asm tool by Daniel Borkmann.

 2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.

 3) Correct reciprocal_divide and update users, from Hannes Frederic
    Sowa and Daniel Borkmann.

 4) Currently we only have a "set" operation for the hw timestamp socket
    ioctl, add a "get" operation to match.  From Ben Hutchings.

 5) Add better trace events for debugging driver datapath problems, also
    from Ben Hutchings.

 6) Implement auto corking in TCP, from Eric Dumazet.  Basically, if we
    have a small send and a previous packet is already in the qdisc or
    device queue, defer until TX completion or we get more data.

 7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.

 8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
    Borkmann.

 9) Share IP header compression code between Bluetooth and IEEE802154
    layers, from Jukka Rissanen.

10) Fix ipv6 router reachability probing, from Jiri Benc.

11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.

12) Support tunneling in GRO layer, from Jerry Chu.

13) Allow bonding to be configured fully using netlink, from Scott
    Feldman.

14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
    already get the TCI.  From Atzm Watanabe.

15) New "Heavy Hitter" qdisc, from Terry Lam.

16) Significantly improve the IPSEC support in pktgen, from Fan Du.

17) Allow ipv4 tunnels to cache routes, just like sockets.  From Tom
    Herbert.

18) Add Proportional Integral Enhanced packet scheduler, from Vijay
    Subramanian.

19) Allow openvswitch to mmap'd netlink, from Thomas Graf.

20) Key TCP metrics blobs also by source address, not just destination
    address.  From Christoph Paasch.

21) Support 10G in generic phylib.  From Andy Fleming.

22) Try to short-circuit GRO flow compares using device provided RX
    hash, if provided.  From Tom Herbert.

The wireless and netfilter folks have been busy little bees too.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
  net/cxgb4: Fix referencing freed adapter
  ipv6: reallocate addrconf router for ipv6 address when lo device up
  fib_frontend: fix possible NULL pointer dereference
  rtnetlink: remove IFLA_BOND_SLAVE definition
  rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
  qlcnic: update version to 5.3.55
  qlcnic: Enhance logic to calculate msix vectors.
  qlcnic: Refactor interrupt coalescing code for all adapters.
  qlcnic: Update poll controller code path
  qlcnic: Interrupt code cleanup
  qlcnic: Enhance Tx timeout debugging.
  qlcnic: Use bool for rx_mac_learn.
  bonding: fix u64 division
  rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
  sfc: Use the correct maximum TX DMA ring size for SFC9100
  Add Shradha Shah as the sfc driver maintainer.
  net/vxlan: Share RX skb de-marking and checksum checks with ovs
  tulip: cleanup by using ARRAY_SIZE()
  ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
  net/cxgb4: Don't retrieve stats during recovery
  ...
2014-01-25 11:17:34 -08:00
Gao feng
33d99113b1 ipv6: reallocate addrconf router for ipv6 address when lo device up
commit 25fb6ca4ed
"net IPv6 : Fix broken IPv6 routing table after loopback down-up"
allocates addrconf router for ipv6 address when lo device up.
but commit a881ae1f62
"ipv6:don't call addrconf_dst_alloc again when enable lo" breaks
this behavior.

Since the addrconf router is moved to the garbage list when
lo device down, we should release this router and rellocate
a new one for ipv6 address when lo device up.

This patch solves bug 67951 on bugzilla
https://bugzilla.kernel.org/show_bug.cgi?id=67951

change from v1:
use ip6_rt_put to repleace ip6_del_rt, thanks Hannes!
change code style, suggested by Sergei.

CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-24 15:59:38 -08:00
Oliver Hartkopp
a0065f266a fib_frontend: fix possible NULL pointer dereference
The two commits 0115e8e30d (net: remove delay at device dismantle) and
748e2d9396 (net: reinstate rtnl in call_netdevice_notifiers()) silently
removed a NULL pointer check for in_dev since Linux 3.7.

This patch re-introduces this check as it causes crashing the kernel when
setting small mtu values on non-ip capable netdevices.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-24 15:51:26 -08:00
Luis Henriques
c692554bf4 gss_krb5: use lcm from kernel lib
Replace hardcoded lowest common multiple algorithm by the lcm()
function in kernel lib.

Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-01-24 15:58:44 -05:00
Linus Torvalds
3aacd625f2 Merge branch 'akpm' (incoming from Andrew)
Merge second patch-bomb from Andrew Morton:
 - various misc bits
 - the rest of MM
 - add generic fixmap.h, use it
 - backlight updates
 - dynamic_debug updates
 - printk() updates
 - checkpatch updates
 - binfmt_elf
 - ramfs
 - init/
 - autofs4
 - drivers/rtc
 - nilfs
 - hfsplus
 - Documentation/
 - coredump
 - procfs
 - fork
 - exec
 - kexec
 - kdump
 - partitions
 - rapidio
 - rbtree
 - userns
 - memstick
 - w1
 - decompressors

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (197 commits)
  lib/decompress_unlz4.c: always set an error return code on failures
  romfs: fix returm err while getting inode in fill_super
  drivers/w1/masters/w1-gpio.c: add strong pullup emulation
  drivers/memstick/host/rtsx_pci_ms.c: fix ms card data transfer bug
  userns: relax the posix_acl_valid() checks
  arch/sh/kernel/dwarf.c: use rbtree postorder iteration helper instead of solution using repeated rb_erase()
  fs-ext3-use-rbtree-postorder-iteration-helper-instead-of-opencoding-fix
  fs/ext3: use rbtree postorder iteration helper instead of opencoding
  fs/jffs2: use rbtree postorder iteration helper instead of opencoding
  fs/ext4: use rbtree postorder iteration helper instead of opencoding
  fs/ubifs: use rbtree postorder iteration helper instead of opencoding
  net/netfilter/ipset/ip_set_hash_netiface.c: use rbtree postorder iteration instead of opencoding
  rbtree/test: test rbtree_postorder_for_each_entry_safe()
  rbtree/test: move rb_node to the middle of the test struct
  rapidio: add modular rapidio core build into powerpc and mips branches
  partitions/efi: complete documentation of gpt kernel param purpose
  kdump: add /sys/kernel/vmcoreinfo ABI documentation
  kdump: fix exported size of vmcoreinfo note
  kexec: add sysctl to disable kexec_load
  fs/exec.c: call arch_pick_mmap_layout() only once
  ...
2014-01-23 19:11:50 -08:00
Linus Torvalds
6dd9158ae8 Merge git://git.infradead.org/users/eparis/audit
Pull audit update from Eric Paris:
 "Again we stayed pretty well contained inside the audit system.
  Venturing out was fixing a couple of function prototypes which were
  inconsistent (didn't hurt anything, but we used the same value as an
  int, uint, u32, and I think even a long in a couple of places).

  We also made a couple of minor changes to when a couple of LSMs called
  the audit system.  We hoped to add aarch64 audit support this go
  round, but it wasn't ready.

  I'm disappearing on vacation on Thursday.  I should have internet
  access, but it'll be spotty.  If anything goes wrong please be sure to
  cc rgb@redhat.com.  He'll make fixing things his top priority"

* git://git.infradead.org/users/eparis/audit: (50 commits)
  audit: whitespace fix in kernel-parameters.txt
  audit: fix location of __net_initdata for audit_net_ops
  audit: remove pr_info for every network namespace
  audit: Modify a set of system calls in audit class definitions
  audit: Convert int limit uses to u32
  audit: Use more current logging style
  audit: Use hex_byte_pack_upper
  audit: correct a type mismatch in audit_syscall_exit()
  audit: reorder AUDIT_TTY_SET arguments
  audit: rework AUDIT_TTY_SET to only grab spin_lock once
  audit: remove needless switch in AUDIT_SET
  audit: use define's for audit version
  audit: documentation of audit= kernel parameter
  audit: wait_for_auditd rework for readability
  audit: update MAINTAINERS
  audit: log task info on feature change
  audit: fix incorrect set of audit_sock
  audit: print error message when fail to create audit socket
  audit: fix dangling keywords in audit_log_set_loginuid() output
  audit: log on errors from filter user rules
  ...
2014-01-23 18:08:10 -08:00
Cody P Schafer
b182837ac1 net/netfilter/ipset/ip_set_hash_netiface.c: use rbtree postorder iteration instead of opencoding
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead
of opencoding an alternate postorder iteration that modifies the tree

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:03 -08:00
Alex Elder
04f9b74e4d remove extra definitions of U32_MAX
Now that the definition is centralized in <linux/kernel.h>, the
definitions of U32_MAX (and related) elsewhere in the kernel can be
removed.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Sage Weil <sage@inktank.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:36:55 -08:00
Alex Elder
77719536dc conditionally define U32_MAX
The symbol U32_MAX is defined in several spots.  Change these
definitions to be conditional.  This is in preparation for the next
patch, which centralizes the definition in <linux/kernel.h>.

Signed-off-by: Alex Elder <elder@linaro.org>
Cc: Sage Weil <sage@inktank.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:36:54 -08:00
Jiri Pirko
813f020c5d rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
This check is not needed because the same check is done before
fill_slave_info is used in rtnl_link_slave_info_fill.
Also, by removing this check, kernel will fillup IFLA_INFO_SLAVE_KIND
even for slaves of masters which does not implement fill_slave_info.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-23 16:21:48 -08:00
Duan Jiong
11c21a307d ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
commit a622260254ee48("ip_tunnel: fix kernel panic with icmp_dest_unreach")
clear IPCB in ip_tunnel_xmit()  , or else skb->cb[] may contain garbage from
GSO segmentation layer.

But commit 0e6fbc5b6c621("ip_tunnels: extend iptunnel_xmit()") refactor codes,
and it clear IPCB behind the dst_link_failure().

So clear IPCB in ip_tunnel_xmit() just like commti a622260254ee48("ip_tunnel:
fix kernel panic with icmp_dest_unreach").

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-23 13:23:16 -08:00
Vlad Yasevich
6ef7b8a23a net: Correctly sync addresses from multiple sources to single device
When we have multiple devices attempting to sync the same address
to a single destination, each device should be permitted to sync
it once.  To accomplish this, pass the 'sync_cnt' of the source
address when adding the addresss to the lower device.  'sync_cnt'
tracks how many time a given address has been succefully synced.
This way, we know that if the 'sync_cnt' passed in is 0, we should
sync this address.

Also, turn 'synced' member back into the counter as was originally
done in
   commit 4543fbefe6.
   net: count hw_addr syncs so that unsync works properly.
It tracks how many time a given address has been added via a
'sync' operation.  For every successfull 'sync' the counter is
incremented, and for ever 'unsync', the counter is decremented.
This makes sure that the address will be properly removed from
the the lower device when all the upper devices have removed it.

Reported-by: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru>
CC: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru>
CC: Alexandra N. Kossovsky <Alexandra.Kossovsky@oktetlabs.ru>
CC: Konstantin Ushakov <Konstantin.Ushakov@oktetlabs.ru>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-23 13:06:34 -08:00
Shlomo Pongratz
a1d0cd8ed5 net/udp_offload: Handle static checker complaints
Fixed few issues around using __rcu prefix and rcu_assign_pointer, also
fixed a warning print to use ntohs(port) and not htons(port).

net/ipv4/udp_offload.c:112:9: error: incompatible types in comparison expression (different address spaces)
net/ipv4/udp_offload.c:113:9: error: incompatible types in comparison expression (different address spaces)
net/ipv4/udp_offload.c:176:19: error: incompatible types in comparison expression (different address spaces)

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-23 12:59:08 -08:00
Christoph Paasch
3ad88cf70a tcp: metrics: Handle v6/v4-mapped sockets in tcp-metrics
A socket may be v6/v4-mapped. In that case sk->sk_family is AF_INET6,
but the IP being used is actually an IPv4-address.
Current's tcp-metrics will thus represent it as an IPv6-address:

root@server:~# ip tcp_metrics
::ffff:10.1.1.2 age 22.920sec rtt 18750us rttvar 15000us cwnd 10
10.1.1.2 age 47.970sec rtt 16250us rttvar 10000us cwnd 10

This patch modifies the tcp-metrics so that they are able to handle the
v6/v4-mapped sockets correctly.

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-23 12:48:28 -08:00
Linus Torvalds
5ee7a81a9f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
 "This contains a fix for a potential use-after-module-unload bug
  noticed by Al and caching improvements for read-only fuse filesystems
  by Andrew Gallagher"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: support clients that don't implement 'open'
  fuse: don't invalidate attrs when not using atime
  fuse: fix SetPageUptodate() condition in STORE
  fuse: fix pipe_buf_operations
2014-01-23 09:22:58 -08:00
Yann Droneaud
5ae5e991ee 6lowpan: add a license to 6lowpan_iphc module
Since commit 8df8c56a5a, 6lowpan_iphc is a module of its own.

Unfortunately, it lacks some infrastructure to behave like a
good kernel citizen:

  kernel: 6lowpan_iphc: module license 'unspecified' taints kernel.
  kernel: Disabling lock debugging due to kernel taint

This patch adds the basic MODULE_LICENSE(); with GPL license:
the code was copied from net/ieee802154/6lowpan.c which is GPL
and the module exports symbol with EXPORT_SYMBOL_GPL();.

Cc: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 21:57:17 -08:00
Jiri Pirko
3bad540ed8 bonding: convert netlink to use slave data info api
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 21:57:17 -08:00
Jiri Pirko
ba7d49b1f0 rtnetlink: provide api for getting and setting slave info
Recent patch
bonding: add netlink attributes to slave link dev (1d3ee88ae0)

Introduced yet another device specific way to access slave information
over rtnetlink. There is one already there for bridge.

This patch introduces generic way to do this, for getting and setting
info as well by extending link_ops. Later on, this new interface will
be used for bridge ports as well.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 21:57:05 -08:00
Jiri Pirko
df7dbcbbaf rtnetlink: put "BOND" into nl attribute names which are related to bonding
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 21:57:05 -08:00
viresh kumar
f618002b0b net/neighbour: queue work on power efficient wq
Workqueue used in neighbour layer have no real dependency of scheduling these on
the cpu which scheduled them.

On a idle system, it is observed that an idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.

This patch replaces normal workqueues with power efficient versions. This
doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is
enabled.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 21:57:05 -08:00
viresh kumar
906e073f3e net/ipv4: queue work on power efficient wq
Workqueue used in ipv4 layer have no real dependency of scheduling these on the
cpu which scheduled them.

On a idle system, it is observed that an idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.

This patch replaces normal workqueues with power efficient versions. This
doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is
enabled.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 21:57:05 -08:00
FX Le Bail
7c90cc2d40 ipv6: enable anycast addresses as source addresses for datagrams
This change allows to consider an anycast address valid as source address
when given via an IPV6_PKTINFO or IPV6_2292PKTINFO ancillary data item.
So, when sending a datagram with ancillary data, the unicast and anycast
addresses are handled in the same way.

- Adds ipv6_chk_acast_addr_src() to check if an anycast address is link-local
  on given interface or is global.
- Uses it in ip6_datagram_send_ctl().

Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 21:57:05 -08:00
Toshiaki Makita
bdf4351bbc bridge: Remove unnecessary vlan_put_tag in br_handle_vlan
br_handle_vlan() pushes HW accelerated vlan tag into skbuff when outgoing
port is the bridge device.
This is unnecessary because __netif_receive_skb_core() can handle skbs
with HW accelerated vlan tag. In current implementation,
__netif_receive_skb_core() needs to extract the vlan tag embedded in skb
data. This could cause low network performance especially when receiving
frames at a high frame rate on the bridge device.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 21:29:27 -08:00
Christoph Paasch
00ca9c5b2b tcp: metrics: Fix rcu-race when deleting multiple entries
In bbf852b96e I introduced the tmlist, which allows to delete
multiple entries from the cache that match a specified destination if no
source-IP is specified.

However, as the cache is an RCU-list, we should not create this tmlist, as
it will change the tcpm_next pointer of the element that will be deleted
and so a thread iterating over the cache's entries while holding the
RCU-lock might get "redirected" to this tmlist.

This patch fixes this, by reverting back to the old behavior prior to
bbf852b96e, which means that we simply change the tcpm_next
pointer of the previous element (pp) to jump over the one we are
deleting.
The difference is that we call kfree_rcu() directly on the cache entry,
which allows us to delete multiple entries from the list.

Fixes: bbf852b96e (tcp: metrics: Delete all entries matching a certain destination)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 21:26:16 -08:00
Linus Torvalds
bb1281f2aa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual rocket science stuff from trivial.git"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
  neighbour.h: fix comment
  sched: Fix warning on make htmldocs caused by wait.h
  slab: struct kmem_cache is protected by slab_mutex
  doc: Fix typo in USB Gadget Documentation
  of/Kconfig: Spelling s/one/once/
  mkregtable: Fix sscanf handling
  lp5523, lp8501: comment improvements
  thermal: rcar: comment spelling
  treewide: fix comments and printk msgs
  IXP4xx: remove '1 &&' from a condition check in ixp4xx_restart()
  Documentation: update /proc/uptime field description
  Documentation: Fix size parameter for snprintf
  arm: fix comment header and macro name
  asm-generic: uaccess: Spelling s/a ny/any/
  mtd: onenand: fix comment header
  doc: driver-model/platform.txt: fix a typo
  drivers: fix typo in DEVTMPFS_MOUNT Kconfig help text
  doc: Fix typo (acces_process_vm -> access_process_vm)
  treewide: Fix typos in printk
  drivers/gpu/drm/qxl/Kconfig: reformat the help text
  ...
2014-01-22 21:21:55 -08:00
Harry Mason
29824310ce sch_htb: let skb->priority refer to non-leaf class
If the class in skb->priority is not a leaf, apply filters from the
selected class, not the qdisc. This lets netfilter or user space
partially classify the packet.

Signed-off-by: Harry Mason <harry.mason@smoothwall.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 17:39:48 -08:00
Neil Horman
2d36097d26 af_packet: Add Queue mapping mode to af_packet fanout operation
This patch adds a queue mapping mode to the fanout operation of af_packet
sockets.  This allows user space af_packet users to better filter on flows
ingressing and egressing via a specific hardware queue, and avoids the potential
packet reordering that can occur when FANOUT_CPU is being used and irq affinity
varies.

Tested successfully by myself.  applies to net-next

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 17:35:50 -08:00
Miklos Szeredi
28a625cbc2 fuse: fix pipe_buf_operations
Having this struct in module memory could Oops when if the module is
unloaded while the buffer still persists in a pipe.

Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal
merge them into nosteal_pipe_buf_ops (this is the same as
default_pipe_buf_ops except stealing the page from the buffer is not
allowed).

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2014-01-22 19:36:57 +01:00
Hannes Frederic Sowa
809fa972fd reciprocal_divide: update/correction of the algorithm
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.

This has been fixed by Eric Dumazet in commit aee636c480
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:

1) Initialization:

  int l = ceil(log_2 d)
  uword m' = floor((1<<32)*((1<<l)-d)/d)+1
  int sh_1 = min(l,1)
  int sh_2 = max(l-1,0)

2) For q = n/d, all uword:

  uword t = (n*m')>>32
  q = (t+((n-t)>>sh_1))>>sh_2

The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.

Joint work with Daniel Borkmann.

  [1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
  [2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
  [3] https://gmplib.org/~tege/division-paper.pdf
  [4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
  [5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
  [6] http://www.agner.org/optimize/asmlib.zip

Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 23:17:20 -08:00
Daniel Borkmann
89770b0a69 net: introduce reciprocal_scale helper and convert users
As David Laight suggests, we shouldn't necessarily call this
reciprocal_divide() when users didn't requested a reciprocal_value();
lets keep the basic idea and call it reciprocal_scale(). More
background information on this topic can be found in [1].

Joint work with Hannes Frederic Sowa.

  [1] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html

Suggested-by: David Laight <david.laight@aculab.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 23:17:20 -08:00
Daniel Borkmann
f337db64af random32: add prandom_u32_max and convert open coded users
Many functions have open coded a function that returns a random
number in range [0,N-1]. Under the assumption that we have a PRNG
such as taus113 with being well distributed in [0, ~0U] space,
we can implement such a function as uword t = (n*m')>>32, where
m' is a random number obtained from PRNG, n the right open interval
border and t our resulting random number, with n,m',t in u32 universe.

Lets go with Joe and simply call it prandom_u32_max(), although
technically we have an right open interval endpoint, but that we
have documented. Other users can further be migrated to the new
prandom_u32_max() function later on; for now, we need to make sure
to migrate reciprocal_divide() users for the reciprocal_divide()
follow-up fixup since their function signatures are going to change.

Joint work with Hannes Frederic Sowa.

Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 23:17:20 -08:00
David S. Miller
14e481445d net: Missing change from the ether_addr_copy() fixups.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 22:54:01 -08:00
David S. Miller
9be68c1ae0 net: Fix some fallout from the etner_addr_copy() changes.
net/appletalk/aarp.c: In function ‘__aarp_send_query’:
net/appletalk/aarp.c:137:2: error: implicit declaration of function ‘ether_addr_copy’ [-Werror=implicit-function-declaration]
 ...
net/atm/lec.c: In function ‘send_to_lecd’:
net/atm/lec.c:524:3: warning: passing argument 1 of ‘ether_addr_copy’ from incompatible pointer type [enabled by default]
In file included from net/atm/lec.c:17:0:
include/linux/etherdevice.h:227:20: note: expected ‘u8 *’ but argument is of type ‘unsigned char (*)[6]’
 ...
net/caif/caif_usb.c: In function ‘cfusbl_create’:
net/caif/caif_usb.c:108:2: error: implicit declaration of function ‘ether_addr_copy’ [-Werror=implicit-function-declaration]

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:57:26 -08:00
wangweidong
5bc1d1b4a2 sctp: remove macros sctp_bh_[un]lock_sock
Redefined bh_[un]lock_sock to sctp_bh[un]lock_sock for user
space friendly code which we haven't use in years, so removing them.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:41:36 -08:00
wangweidong
048ed4b626 sctp: remove macros sctp_{lock|release}_sock
Redefined {lock|release}_sock to sctp_{lock|release}_sock for user space friendly
code which we haven't use in years, so removing them.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:41:36 -08:00
wangweidong
387602dfdc sctp: remove macros sctp_write_[un]_lock
Redefined write_[un]lock to sctp_write_[un]lock for user space
friendly code which we haven't use in years, so removing them.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:40:41 -08:00
wangweidong
3c8e43ba9f sctp: remove macros sctp_spin_[un]lock
Redefined spin_[un]lock to sctp_spin_[un]lock for user space friendly
code which we haven't use in years, so removing them.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:40:41 -08:00
wangweidong
79b91130a2 sctp: remove macros sctp_local_bh_{disable|enable}
Redefined local_bh_{disable|enable} to sctp_local_bh_{disable|enable}
for user space friendly code which we haven't use in years, so removing them.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:40:40 -08:00
Joe Perches
d08f161a10 dsa: Use ether_addr_copy
Use ether_addr_copy instead of memcpy(a, b, ETH_ALEN) to
save some cycles on arm and powerpc.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:13:05 -08:00
Joe Perches
9ea08b1299 pktgen: Use ether_addr_copy
Use ether_addr_copy instead of memcpy(a, b, ETH_ALEN) to
save some cycles on arm and powerpc.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:13:05 -08:00
Joe Perches
c62326abac netpoll: Use ether_addr_copy
Use ether_addr_copy instead of memcpy(a, b, ETH_ALEN) to
save some cycles on arm and powerpc.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:13:05 -08:00
Joe Perches
34b2cff4ee caif_usb: Use ether_addr_copy
Use ether_addr_copy instead of memcpy(a, b, ETH_ALEN) to
save some cycles on arm and powerpc.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:13:04 -08:00
Joe Perches
116e853f7f atm: Use ether_addr_copy
Use ether_addr_copy instead of memcpy(a, b, ETH_ALEN) to
save some cycles on arm and powerpc.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:13:04 -08:00
Joe Perches
90ccb6aa40 appletalk: Use ether_addr_copy
Use ether_addr_copy instead of memcpy(a, b, ETH_ALEN) to
save some cycles on arm and powerpc.

Convert struct aarp_entry.hwaddr[6] to hwaddr[ETH_ALEN].

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:13:04 -08:00
Joe Perches
07fc67befd 8021q: Use ether_addr_copy
Use ether_addr_copy instead of memcpy(a, b, ETH_ALEN) to
save some cycles on arm and powerpc.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:13:04 -08:00
Or Gerlitz
e27a2f8395 net: Export gro_find_by_type helpers
Export the gro_find_receive/complete_by_type helpers to they can be invoked
by the gro callbacks of encapsulation protocols such as vxlan.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:05:04 -08:00
Or Gerlitz
b582ef0990 net: Add GRO support for UDP encapsulating protocols
Add GRO handlers for protocols that do UDP encapsulation, with the intent of
being able to coalesce packets which encapsulate packets belonging to
the same TCP session.

For GRO purposes, the destination UDP port takes the role of the ether type
field in the ethernet header or the next protocol in the IP header.

The UDP GRO handler will only attempt to coalesce packets whose destination
port is registered to have gro handler.

Use a mark on the skb GRO CB data to disallow (flush) running the udp gro receive
code twice on a packet. This solves the problem of udp encapsulated packets whose
inner VM packet is udp and happen to carry a port which has registered offloads.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:05:04 -08:00
Linus Torvalds
f075e0f699 Merge branch 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "The bulk of changes are cleanups and preparations for the upcoming
  kernfs conversion.

   - cgroup_event mechanism which is and will be used only by memcg is
     moved to memcg.

   - pidlist handling is updated so that it can be served by seq_file.

     Also, the list is not sorted if sane_behavior.  cgroup
     documentation explicitly states that the file is not sorted but it
     has been for quite some time.

   - All cgroup file handling now happens on top of seq_file.  This is
     to prepare for kernfs conversion.  In addition, all operations are
     restructured so that they map 1-1 to kernfs operations.

   - Other cleanups and low-pri fixes"

* 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits)
  cgroup: trivial style updates
  cgroup: remove stray references to css_id
  doc: cgroups: Fix typo in doc/cgroups
  cgroup: fix fail path in cgroup_load_subsys()
  cgroup: fix missing unlock on error in cgroup_load_subsys()
  cgroup: remove for_each_root_subsys()
  cgroup: implement for_each_css()
  cgroup: factor out cgroup_subsys_state creation into create_css()
  cgroup: combine css handling loops in cgroup_create()
  cgroup: reorder operations in cgroup_create()
  cgroup: make for_each_subsys() useable under cgroup_root_mutex
  cgroup: css iterations and css_from_dir() are safe under cgroup_mutex
  cgroup: unify pidlist and other file handling
  cgroup: replace cftype->read_seq_string() with cftype->seq_show()
  cgroup: attach cgroup_open_file to all cgroup files
  cgroup: generalize cgroup_pidlist_open_file
  cgroup: unify read path so that seq_file is always used
  cgroup: unify cgroup_write_X64() and cgroup_write_string()
  cgroup: remove cftype->read(), ->read_map() and ->write()
  hugetlb_cgroup: convert away from cftype->read()
  ...
2014-01-21 17:51:34 -08:00
Dan Carpenter
08d4d217df rxrpc: out of bound read in debug code
Smatch complains because we are using an untrusted index into the
rxrpc_acks[] array.  It's just a read and it's only in the debug code,
but it's simple enough to add a check and fix it.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 17:02:52 -08:00
Yegor Yefremov
2fa053a0a2 8021q: update description
Replace deprecated 'vconfig' tool with 'ip' from 'iproute2'. Add
some beautifications like replacing 'ethernet' with 'Ethernet' and
removing unneeded spaces.

Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 17:01:25 -08:00
Hannes Frederic Sowa
82b276cd2b ipv6: protect protocols not handling ipv4 from v4 connection/bind attempts
Some ipv6 protocols cannot handle ipv4 addresses, so we must not allow
connecting and binding to them. sendmsg logic does already check msg->name
for this but must trust already connected sockets which could be set up
for connection to ipv4 address family.

Per-socket flag ipv6only is of no use here, as it is under users control
by setsockopt.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 16:59:19 -08:00
FX Le Bail
446fab5933 ipv6: enable anycast addresses as source addresses in ICMPv6 error messages
- Uses ipv6_anycast_destination() in icmp6_send().

Suggested-by: Bill Fink <billfink@mindspring.com>
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 16:53:23 -08:00
Peter Pan(潘卫平)
4d83e17730 tcp: delete redundant calls of tcp_mtup_init()
As tcp_rcv_state_process() has already calls tcp_mtup_init() for non-fastopen
sock, we can delete the redundant calls of tcp_mtup_init() in
tcp_{v4,v6}_syn_recv_sock().

Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 16:52:31 -08:00
Daniel Borkmann
f0d4eb29d1 packet: fix a couple of cppcheck warnings
Doesn't bring much, but also doesn't hurt us to fix 'em:

1) In tpacket_rcv() flush dcache page we can restirct the scope
   for start and end and remove one layer of indent.

2) In tpacket_destruct_skb() we can restirct the scope for ph.

3) In alloc_one_pg_vec_page() we can remove the NULL assignment
   and change spacing a bit.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 16:51:42 -08:00
Duan Jiong
967680e027 ipv4: remove the useless argument from ip_tunnel_hash()
Since commit c544193214("GRE: Refactor GRE tunneling code")
introduced function ip_tunnel_hash(), the argument itn is no
longer in use, so remove it.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 16:48:18 -08:00
Sabrina Dubroca
f14fe8a848 net: remove unnecessary initializations in net_dev_init
softnet_data is already set to 0, no need to use memset or initialize
specific fields to 0 or NULL afterwards.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 16:44:45 -08:00
WANG Cong
6e6a50c254 net_sched: act: export tcf_hash_search() instead of tcf_hash_lookup()
So that we will not expose struct tcf_common to modules.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 14:43:16 -08:00
WANG Cong
c779f7af99 net_sched: act: fetch hinfo from a->ops->hinfo
Every action ops has a pointer to hash info, so we don't need to
hard-code it in each module.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 14:43:16 -08:00
Linus Torvalds
8e50966072 GPIO tree bulk changes for v3.14
A big set this merge window, as we have much going on in
 this subsystem. Major changes this time:
 
 - Some core improvements and cleanups to the new GPIO
   descriptor API. This seems to be working now so we can
   start the exodus to this API, moving gradually away from
   the global GPIO numberspace.
 
 - Incremental improvements to the ACPI GPIO core, and move
   the few GPIO ACPI clients we have to the GPIO descriptor
   API right *now* before we go any further. We actually
   managed to contain this *before* we started to litter
   the kernel with yet another hackish global numberspace for
   the ACPI GPIOs, which is a big win.
 
 - The RFkill GPIO driver and all platforms using it have
   been migrated to use the GPIO descriptors rather than
   fixed number assignments. Tegra machine has been migrated
   as part of this.
 
 - New drivers for MOXA ART, Xtensa GPIO32 and SMSC SCH311x.
   Those should be really good examples of how I expect a
   nice GPIO driver to look these days.
 
 - Do away with custom GPIO implementations on a major
   part of the ARM machines: ks8695, lpc32xx, mv78xx0.
   Make a first step towards the same in the horribly
   convoluted Samsung S3C include forest. We expect to
   continue to clean this up as we move forward.
 
 - Flag GPIO lines used for IRQ on adnp, bcm-kona, em,
   intel-mid and lynxpoint.
   This makes the GPIOlib core aware that a certain GPIO line
   is used for IRQs and can then enforce some semantics such
   as disallowing a GPIO line marked as in use for IRQ to be
   switched to output mode.
 
 - Drop all use of irq_set_chip_and_handler_name().
   The name provided in these cases were just unhelpful
   tags like "mux" or "demux".
 
 - Extend the MCP23s08 driver to handle interrupts.
 
 - Minor incremental improvements for rcar, lynxpoint, em
   74x164 and msm drivers.
 
 - Some non-urgent bug fixes here and there, duplicate
   #includes and that usual kind of cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJS3i/MAAoJEEEQszewGV1zVB8P/Rjzgx8To0gQPn49M4u/A1Mk
 mAzpUoKa05ILTKBm/bpWPYZPpg9PDqUxOYPsIDEAkc70BKMPTXxrYiE+LSfIzwaJ
 a8IRwOzNL7Iwc+zPNS/GrmRJyxymb4lmMD/fypk/YaumZ6j4Hbo+9R8Zct9gbZ5Q
 ZbKtz6kLhbkbNCc71bVMgk6yacSBx1ak8Xpd12HlW85NgOCoBj7/DI1Lb61x1ImY
 NYpSpmtfGGTkQLtBl5dTLefZOvL1dKSct9TMOsA2jzNqf3zA1YA6XOxPGHK/qtjq
 3s9cN1sIVF/g7sm1+qoKXe0OTQrXHT7SX8BH9/tb3MiKO8ItactlQUJlYNR3WFSN
 zm1PNe5zWr+GWzV0iUrqoMN4XX8nThiFDOxZpOwBTZcUD6qtDFIZp41M3qxwFTbJ
 hCtSQ8gUO1Ce+xtOQYYOwEkRS7FZa1Z+p/lendTFuGDh6DcXy97SrKkTktM4Q98B
 LhqrwUzCdES0ecNDi2+P5y4Fc7M0cMMn9SnFvbSBObLB89TF9uzMIn8jUBCZMvrM
 eAeZlRBYk8F+6F12higaWqZyiBKIEubXo/Z8T0L2KEDm/z/ddJvhQgBKvWlf3rqi
 RToD446rda+RhFBnxLZ3mTui5nZ2WyKTOqhVqeBuriJhE/cTUaQHUBUrbOwx20kE
 Xb9mQ2n3GRk2157n1CLY
 =lW2i
 -----END PGP SIGNATURE-----

Merge tag 'gpio-v3.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio

Pull GPIO tree bulk changes from Linus Walleij:
 "A big set this merge window, as we have much going on in this
  subsystem.  The changes to other subsystems (notably a slew of ARM
  machines as I am doing away with their custom APIs) have all been
  ACKed to the extent possible.

  Major changes this time:

   - Some core improvements and cleanups to the new GPIO descriptor API.
     This seems to be working now so we can start the exodus to this
     API, moving gradually away from the global GPIO numberspace.

   - Incremental improvements to the ACPI GPIO core, and move the few
     GPIO ACPI clients we have to the GPIO descriptor API right *now*
     before we go any further.  We actually managed to contain this
     *before* we started to litter the kernel with yet another hackish
     global numberspace for the ACPI GPIOs, which is a big win.

   - The RFkill GPIO driver and all platforms using it have been
     migrated to use the GPIO descriptors rather than fixed number
     assignments.  Tegra machine has been migrated as part of this.

   - New drivers for MOXA ART, Xtensa GPIO32 and SMSC SCH311x.  Those
     should be really good examples of how I expect a nice GPIO driver
     to look these days.

   - Do away with custom GPIO implementations on a major part of the ARM
     machines: ks8695, lpc32xx, mv78xx0.  Make a first step towards the
     same in the horribly convoluted Samsung S3C include forest.  We
     expect to continue to clean this up as we move forward.

   - Flag GPIO lines used for IRQ on adnp, bcm-kona, em, intel-mid and
     lynxpoint.

     This makes the GPIOlib core aware that a certain GPIO line is used
     for IRQs and can then enforce some semantics such as disallowing a
     GPIO line marked as in use for IRQ to be switched to output mode.

   - Drop all use of irq_set_chip_and_handler_name().  The name provided
     in these cases were just unhelpful tags like "mux" or "demux".

   - Extend the MCP23s08 driver to handle interrupts.

   - Minor incremental improvements for rcar, lynxpoint, em 74x164 and
     msm drivers.

   - Some non-urgent bug fixes here and there, duplicate #includes and
     that usual kind of cleanups"

Fix up broken Kconfig file manually to make this all compile.

* tag 'gpio-v3.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (71 commits)
  gpio: mcp23s08: fix casting caused build warning
  gpio: mcp23s08: depend on OF_GPIO
  gpio: mcp23s08: Add irq functionality for i2c chips
  ARM: S5P[v210|c100|64x0]: Fix build error
  gpio: pxa: clamp gpio get value to [0,1]
  ARM: s3c24xx: explicit dependency on <plat/gpio-cfg.h>
  ARM: S3C[24|64]xx: move includes back under <mach/> scope
  Documentation / ACPI: update to GPIO descriptor API
  gpio / ACPI: get rid of acpi_gpio.h
  gpio / ACPI: register to ACPI events automatically
  mmc: sdhci-acpi: convert to use GPIO descriptor API
  ARM: s3c24xx: fix build error
  gpio: f7188x: set can_sleep attribute
  gpio: samsung: Update documentation
  gpio: samsung: Remove hardware.h inclusion
  gpio: xtensa: depend on HAVE_XTENSA_GPIO32
  gpio: clps711x: Enable driver compilation with COMPILE_TEST
  gpio: clps711x: Use of_match_ptr()
  net: rfkill: gpio: convert to descriptor-based GPIO interface
  leds: s3c24xx: Fix build failure
  ...
2014-01-21 10:09:12 -08:00
Linus Torvalds
a0fa1dd3cd Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:

 - Add the initial implementation of SCHED_DEADLINE support: a real-time
   scheduling policy where tasks that meet their deadlines and
   periodically execute their instances in less than their runtime quota
   see real-time scheduling and won't miss any of their deadlines.
   Tasks that go over their quota get delayed (Available to privileged
   users for now)

 - Clean up and fix preempt_enable_no_resched() abuse all around the
   tree

 - Do sched_clock() performance optimizations on x86 and elsewhere

 - Fix and improve auto-NUMA balancing

 - Fix and clean up the idle loop

 - Apply various cleanups and fixes

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  sched: Fix __sched_setscheduler() nice test
  sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
  sched: Fix up attr::sched_priority warning
  sched: Fix up scheduler syscall LTP fails
  sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
  sched/core: Fix htmldocs warnings
  sched/deadline: No need to check p if dl_se is valid
  sched/deadline: Remove unused variables
  sched/deadline: Fix sparse static warnings
  m68k: Fix build warning in mac_via.h
  sched, thermal: Clean up preempt_enable_no_resched() abuse
  sched, net: Fixup busy_loop_us_clock()
  sched, net: Clean up preempt_enable_no_resched() abuse
  sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
  sched/preempt, locking: Rework local_bh_{dis,en}able()
  sched/clock, x86: Avoid a runtime condition in native_sched_clock()
  sched/clock: Fix up clear_sched_clock_stable()
  sched/clock, x86: Use a static_key for sched_clock_stable
  sched/clock: Remove local_irq_disable() from the clocks
  sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
  ...
2014-01-20 10:42:08 -08:00
Weilong Chen
82ef3d5d5f net: fix "queues" uevent between network namespaces
When I create a new namespace with 'ip netns add net0', or add/remove
new links in a namespace with 'ip link add/delete type veth', rx/tx
queues events can be got in all namespaces. That is because rx/tx queue
ktypes do not have namespace support, and their kobj parents are setted to
NULL. This patch is to fix it.

Reported-by: Libo Chen <chenlibo@huawei.com>
Signed-off-by: Libo Chen <chenlibo@huawei.com>
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19 20:02:02 -08:00
WANG Cong
671314a5ab net_sched: act: remove capab from struct tc_action_ops
It is not actually implemented.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19 19:58:07 -08:00
Jason Wang
9d08dd3d32 net: document accel_priv parameter for __dev_queue_xmit()
To silent "make htmldocs" warning.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19 19:55:51 -08:00