Commit 9cd4002942 (Fix race between
neigh_parms_release and neightbl_fill_parms) introduced device
reference counting regressions for several people, see:
http://bugzilla.kernel.org/show_bug.cgi?id=9778
for example.
Signed-off-by: David S. Miller <davem@davemloft.net>
When packets are flood-forwarded to multiple output devices, the
bridge-netfilter code reuses skb->nf_bridge for each clone to store
the bridge port. When queueing packets using NFQUEUE netfilter takes
a reference to skb->nf_bridge->physoutdev, which is overwritten
when the packet is forwarded to the second port. This causes
refcount unterflows for the first device and refcount leaks for all
others. Additionally this provides incorrect data to the iptables
physdev match.
Unshare skb->nf_bridge by copying it if it is shared before assigning
the physoutdev device.
Reported, tested and based on initial patch by
Jan Christoph Nordholz <hesso@pool.math.tu-berlin.de>.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We omit (or delay) sending NSes for known-to-unreachable routers (in
NUD_FAILED state) according to RFC 4191 (Default Router Preferences
and More-Specific Routes). But this is not fully compatible with RFC
4861 (Neighbor Discovery Protocol for IPv6), which does not remember
unreachability of neighbors.
So, let's avoid mixing sending algorithm of RFC 4191 and that of RFC
4861, and make the algorithm more friendly with RFC 4861 if RFC 4191
is disabled.
Issue was found by IPv6 Ready Logo Core Self_Test 1.5.0b2 (by TAHI
Project), and has been tracked down by Mitsuru Chinen
<mitch@linux.vnet.ibm.com>.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed "ip route list" was slower than "cat /proc/net/route" on a
machine with a full Internet routing table (214392 entries : Special
thanks to Robert ;) )
This is similar to problem reported in commit
d8c9283089 ("[IPV4] ROUTE: ip_rt_dump()
is unecessary slow")
Fix is to avoid scanning the begining of fz_hash table, but directly
seek to the right offset.
Before patch :
time ip route >/tmp/ROUTE
real 0m1.285s
user 0m0.712s
sys 0m0.436s
After patch
# time ip route >/tmp/ROUTE
real 0m0.835s
user 0m0.692s
sys 0m0.124s
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
http://bugzilla.kernel.org/show_bug.cgi?id=9493
The fib allows making identical routes with 'ip route replace'.
This patch makes the fib return -EEXIST if replacement would cause duplication.
Signed-off-by: Joonwoo Park <joonwpark81@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
http://bugzilla.kernel.org/show_bug.cgi?id=9493
The fib allows making identical routes with 'ip route replace'.
This patch makes the fib return -EEXIST if replacement would cause duplication.
Signed-off-by: Joonwoo Park <joonwpark81@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When looking for a conflicting connection the !sk->sk_bound_dev_if
check is performed only for live sockets, but not for timewait-ed.
This is not the case for ipv4, for __inet6_lookup_established in
both ipv4 and ipv6 and for other places that check for tw-s.
Was this missed accidentally? If so, then this patch fixes it and
besides makes use if the dif variable declared in the function.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Code inspection turned up that error cases in rfkill_register() do not
call rfkill_led_trigger_unregister() even though we have already
registered.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The bridge code incorrectly causes two POST_ROUTING hook invocations
for DNATed packets that end up on the same bridge device. This
happens because packets with a changed destination address are passed
to dst_output() to make them go through the neighbour output function
again to build a new destination MAC address, before they will continue
through the IP hooks simulated by bridge netfilter.
The resulting hook order is:
PREROUTING (bridge netfilter)
POSTROUTING (dst_output -> ip_output)
FORWARD (bridge netfilter)
POSTROUTING (bridge netfilter)
The deferred hooks used to abort the first POST_ROUTING invocation,
but since the only thing bridge netfilter actually really wants is
a new MAC address, we can avoid going through the IP stack completely
by simply calling the neighbour output function directly.
Tested, reported and lots of data provided by: Damien Thebault <damien.thebault@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the @helper variable that was just obtained.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC2464 says that the next to lowerst order bit of the first octet
of the Interface Identifier is formed by complementing
the Universal/Local bit of the EUI-64. But ip6t_eui64 uses OR not XOR.
Thanks Peter Ivancik for reporing this bug and posting a patch
for it.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow vlans nesting other vlans without lockdep's warnings (max. 2 levels
i.e. parent + child). Thanks to Patrick McHardy for pointing a bug in the
first version of this patch.
Reported-by: Benny Amorsen
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In dn_rt_cache_get_next(), no need to guard seq->private by a
rcu_dereference() since seq is private to the thread running this
function. Reading seq.private once (as guaranted bu rcu_dereference())
or several time if compiler really is dumb enough wont change the
result.
But we miss real spots where rcu_dereference() are needed, both in
dn_rt_cache_get_first() and dn_rt_cache_get_next()
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) In tty.c the BUG_ON at line 115 will never be called, because the the
before list_del_init in this same function.
115 BUG_ON(!list_empty(&dev->list));
So move the list_del_init to rfcomm_dev_del
2) The rfcomm_dev_del could be called from diffrent path
(rfcomm_tty_hangup/rfcomm_dev_state_change/rfcomm_release_dev),
So add another BUG_ON when the rfcomm_dev_del is called more than
one time.
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bernard Pidoux F6BVP reported:
> When I killall kissattach I can see the following message.
>
> This happens on kernel 2.6.24-rc5 already patched with the 6 previously
> patches I sent recently.
>
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.23.9 #1
> -------------------------------------------------------
> kissattach/2906 is trying to acquire lock:
> (linkfail_lock){-+..}, at: [<d8bd4603>] ax25_link_failed+0x11/0x39 [ax25]
>
> but task is already holding lock:
> (ax25_list_lock){-+..}, at: [<d8bd7c7c>] ax25_device_event+0x38/0x84
> [ax25]
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
...
lockdep is worried about the different order here:
#1 (rose_neigh_list_lock){-+..}:
#3 (ax25_list_lock){-+..}:
#0 (linkfail_lock){-+..}:
#1 (rose_neigh_list_lock){-+..}:
#3 (ax25_list_lock){-+..}:
#0 (linkfail_lock){-+..}:
So, ax25_list_lock could be taken before and after linkfail_lock.
I don't know if this three-thread clutch is very probable (or
possible at all), but it seems another bug reported by Bernard
("[...] system impossible to reboot with linux-2.6.24-rc5")
could have similar source - namely ax25_list_lock held by
ax25_kill_by_device() during ax25_disconnect(). It looks like the
only place which calls ax25_disconnect() this way, so I guess, it
isn't necessary.
This patch is breaking the lock for ax25_disconnect().
Reported-and-tested-by: Bernard Pidoux <f6bvp@free.fr>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfuzz can easily trigger any of those.
move the printk message to the corresponding comment: makes the
intention of the code clear and easy to pick up on an scheduled
removal. as bonus simplify the braces placement.
Signed-off-by: maximilian attems <max@stro.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rt_cache_get_next(), no need to guard seq->private by a
rcu_dereference() since seq is private to the thread running this
function. Reading seq.private once (as guaranted bu rcu_dereference())
or several time if compiler really is dumb enough wont change the
result.
But we miss real spots where rcu_dereference() are needed, both in
rt_cache_get_first() and rt_cache_get_next()
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The neightbl_fill_parms() is called under the write-locked tbl->lock
and accesses the parms->dev. The negh_parm_release() calls the
dev_put(parms->dev) without this lock. This creates a tiny race window
on which the parms contains potentially stale dev pointer.
To fix this race it's enough to move the dev_put() upper under the
tbl->lock, but note, that the parms are held by neighbors and thus can
live after the neigh_parms_release() is called, so we still can have a
parm with bad dev pointer.
I didn't find where the neigh->parms->dev is accessed, but still think
that putting the dev is to be done in a place, where the parms are
really freed. Am I right with that?
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Al went through the ip_fast_csum callers and found this piece of code
that did not validate the IP header. While root crashing the machine
by sending bogus packets through raw or AF_PACKET sockets isn't that
serious, it is still nice to react gracefully.
This patch ensures that the skb has enough data for an IP header and
that the header length field is valid.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
alg_key_len is the length in bits of the key, not in bytes.
Best way to fix this is to move alg_len() function from net/xfrm/xfrm_user.c
to include/net/xfrm.h, and to use it in xfrm_algo_clone()
alg_len() is renamed to xfrm_alg_len() because of its global exposition.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
lro_mgr->features contains a bitmask of LRO_F_* values which are
defined as power of two, not as bit indexes.
They must be checked with x&LRO_F_FOO, not with test_bit(LRO_F_FOO,&x).
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Andrew Gallatin <gallatin@myri.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both NetLabel and SELinux (other LSMs may grow to use it as well) rely
on the 'iif' field to determine the receiving network interface of
inbound packets. Unfortunately, at present this field is not
preserved across a skb clone operation which can lead to garbage
values if the cloned skb is sent back through the network stack. This
patch corrects this problem by properly copying the 'iif' field in
__skb_clone() and removing the 'iif' field assignment from
skb_act_clone() since it is no longer needed.
Also, while we are here, put the assignments in the same order as the
offsets to reduce cacheline bounces.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This finally adds the code in net_rx_action() to break out of the
->poll()'ing loop when a napi_disable() is found to be pending.
Now, even if a device is being flooded with packets it can be cleanly
brought down.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently mac80211 fails silently when trying to set a nonexistent
rate. Return an error instead.
Signed-Off-By: Andy Lutomirski <luto@myrealbox.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
easy to trigger as user with sfuzz.
irda_create() is quiet on unknown sock->type,
match this behaviour for SOCK_DGRAM unknown protocol
Signed-off-by: maximilian attems <max@stro.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some recent changes completely removed accounting for the FORWARD_TSN
parameter length in the INIT and INIT-ACK chunk. This is wrong and
should be restored.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When processing an unexpected INIT chunk, we do not need to
do any preservation of the old AUTH parameters. In fact,
doing such preservations will nullify AUTH and allow connection
stealing.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The even should be called SCTP_AUTHENTICATION_INDICATION.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent changes for ip command line processing fixed some problems
but unfortunately broke some common usage scenarios. In current
2.6.24-rc6 the following command line results in no IP address
assignment, which is surely a regression:
ip=10.0.2.15::10.0.2.2:255.255.255.0::eth0:off
Please find below a patch that works for all cases I can find.
Signed-off-by: Amos Waterland <apw@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently check that iph->ihl is bounded by the real length and that
the real length is greater than the minimum IP header length. However,
we did not check the caes where iph->ihl is less than the minimum IP
header length.
This breaks because some ip_fast_csum implementations assume that which
is quite reasonable.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When re-naming an interface, the previous secondary address
labels get lost e.g.
$> brctl addbr foo
$> ip addr add 192.168.0.1 dev foo
$> ip addr add 192.168.0.2 dev foo label foo:00
$> ip addr show dev foo | grep inet
inet 192.168.0.1/32 scope global foo
inet 192.168.0.2/32 scope global foo:00
$> ip link set foo name bar
$> ip addr show dev bar | grep inet
inet 192.168.0.1/32 scope global bar
inet 192.168.0.2/32 scope global bar:2
Turns out to be a simple thinko in inetdev_changename() - clearly we
want to look at the address label, rather than the device name, for
a suffix to retain.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In include/net/xfrm.h we find :
#ifdef CONFIG_XFRM_MIGRATE
extern int km_migrate(struct xfrm_selector *sel, u8 dir, u8 type,
struct xfrm_migrate *m, int num_bundles);
...
#endif
We can also guard the function body itself in net/xfrm/xfrm_state.c
with same condition.
(Problem spoted by sparse checker)
make C=2 net/xfrm/xfrm_state.o
...
net/xfrm/xfrm_state.c:1765:5: warning: symbol 'km_migrate' was not declared. Should it be static?
...
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function x25_get_neigh increments a reference count. At the point of
the second goto out, the result of calling x25_get_neigh is only stored in
a local variable, and thus no one outside the function will be able to
decrease the reference count. Thus, x25_neigh_put should be called before
the return in this case.
The problem was found using the following semantic match.
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@
type T,T1,T2;
identifier E;
statement S;
expression x1,x2,x3;
int ret;
@@
T E;
...
* if ((E = x25_get_neigh(...)) == NULL)
S
... when != x25_neigh_put(...,(T1)E,...)
when != if (E != NULL) { ... x25_neigh_put(...,(T1)E,...); ...}
when != x1 = (T1)E
when != E = x3;
when any
if (...) {
... when != x25_neigh_put(...,(T2)E,...)
when != if (E != NULL) { ... x25_neigh_put(...,(T2)E,...); ...}
when != x2 = (T2)E
(
* return;
|
* return ret;
)
}
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because of workqueue delay, the put_device could be called before
device_del, so move it to del_conn.
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a delayed ACK representing two packets arrives, there are two RTT
samples available, one for each packet. The first (in order of seq
number) will be artificially long due to the delay waiting for the
second packet, the second will trigger the ACK and so will not itself
be delayed.
According to rfc1323, the SRTT used for RTO calculation should use the
first rtt, so receivers echo the timestamp from the first packet in
the delayed ack. For congestion control however, it seems measuring
delayed ack delay is not desirable as it varies independently of
congestion.
The patch below causes seq_rtt and last_ackt to be updated with any
available later packet rtts which should have less (and hopefully
zero) delack delay. The rtt value then gets passed to
ca_ops->pkts_acked().
Where TCP_CONG_RTT_STAMP was set, effort was made to supress RTTs from
within a TSO chunk (!fully_acked), using only the final ACK (which
includes any TSO delay) to generate RTTs. This patch removes these
checks so RTTs are passed for each ACK to ca_ops->pkts_acked().
For non-delay based congestion control (cubic, h-tcp), rtt is
sometimes used for rtt-scaling. In shortening the RTT, this may make
them a little less aggressive. Delay-based schemes (eg vegas, veno,
illinois) should get a cleaner, more accurate congestion signal,
particularly for small cwnds. The congestion control module can
potentially also filter out bad RTTs due to the delayed ack alarm by
looking at the associated cnt which (where delayed acking is in use)
should probably be 1 if the alarm went off or greater if the ACK was
triggered by a packet.
Signed-off-by: Gavin McCullagh <gavin.mccullagh@nuim.ie>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Brownell pointed out a regression in my recent "Fix ip command
line processing" patch. It turns out to be a fairly blatant oversight on
my part whereby ic_enable is never set, and thus autoconfiguration is
never enabled. Clearly my testing was broken :-(
The solution that I have is to set ic_enable to 1 if we hit
ip_auto_config_setup(), which basically means that autoconfiguration is
activated unless told otherwise. I then flip ic_enable to 0 if ip=off,
ip=none, ip=::::::off or ip=::::::none using ic_proto_name();
The incremental patch is below, let me know if a non-incremental version
is prepared, as I did as for the original patch to be reverted pending a
fix.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently the documentation in Documentation/nfsroot.txt was
update to note that in fact ip=off and ip=::::::off as the
latter is ignored and the default (on) is used.
This was certainly a step in the direction of reducing confusion.
But it seems to me that the code ought to be fixed up so that
ip=::::::off actually turns off ip autoconfiguration.
This patch also notes more specifically that ip=on (aka ip=::::::on)
is the default.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some users do "modprobe ip_conntrack hashsize=...". Since we have the
module aliases this loads nf_conntrack_ipv4 and nf_conntrack, the
hashsize parameter is unknown for nf_conntrack_ipv4 however and makes
it fail.
Allow to specify hashsize= for both nf_conntrack and nf_conntrack_ipv4.
Note: the nf_conntrack message in the ringbuffer will display an
incorrect hashsize since nf_conntrack is first pulled in as a
dependency and calculates the size itself, then it gets changed
through a call to nf_conntrack_set_hashsize().
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes mac80211 warn (once) when the driver passes up a
frame in which the payload data is not aligned on a four-byte
boundary, with a long comment for people who run into the condition
and need to know what to do.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The station cleanup timer runs every ten seconds, the exact
timing is not relevant at all so it can well run together with
other things to save power.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
[ Regression added by changeset:
cd40b7d398
[NET]: make netlink user -> kernel interface synchronious
-DaveM ]
nl_fib_input re-reuses incoming skb to send the reply. This means that this
packet will be freed twice, namely in:
- netlink_unicast_kernel
- on receive path
Use clone to send as a cure, the caller is responsible for kfree_skb on error.
Thanks to Alexey Dobryan, who originally found the problem.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When used function put_cmsg() to copy kernel information to user
application memory, if the memory length given by user application is
not enough, by the bad length calculate of msg.msg_controllen,
put_cmsg() function may cause the msg.msg_controllen to be a large
value, such as 0xFFFFFFF0, so the following put_cmsg() can also write
data to usr application memory even usr has no valid memory to store
this. This may cause usr application memory overflow.
int put_cmsg(struct msghdr * msg, int level, int type, int len, void *data)
{
struct cmsghdr __user *cm
= (__force struct cmsghdr __user *)msg->msg_control;
struct cmsghdr cmhdr;
int cmlen = CMSG_LEN(len);
~~~~~~~~~~~~~~~~~~~~~
int err;
if (MSG_CMSG_COMPAT & msg->msg_flags)
return put_cmsg_compat(msg, level, type, len, data);
if (cm==NULL || msg->msg_controllen < sizeof(*cm)) {
msg->msg_flags |= MSG_CTRUNC;
return 0; /* XXX: return error? check spec. */
}
if (msg->msg_controllen < cmlen) {
~~~~~~~~~~~~~~~~~~~~~~~~
msg->msg_flags |= MSG_CTRUNC;
cmlen = msg->msg_controllen;
}
cmhdr.cmsg_level = level;
cmhdr.cmsg_type = type;
cmhdr.cmsg_len = cmlen;
err = -EFAULT;
if (copy_to_user(cm, &cmhdr, sizeof cmhdr))
goto out;
if (copy_to_user(CMSG_DATA(cm), data, cmlen - sizeof(struct cmsghdr)))
goto out;
cmlen = CMSG_SPACE(len);
~~~~~~~~~~~~~~~~~~~~~~~~~~~
If MSG_CTRUNC flags is set, msg->msg_controllen is less than
CMSG_SPACE(len), "msg->msg_controllen -= cmlen" will cause unsinged int
type msg->msg_controllen to be a large value.
~~~~~~~~~~~~~~~~~~~~~~~~~~~
msg->msg_control += cmlen;
msg->msg_controllen -= cmlen;
~~~~~~~~~~~~~~~~~~~~~
err = 0;
out:
return err;
}
The same promble exists in put_cmsg_compat(). This patch can fix this
problem.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>