Bugfixes include:
- Fix a typo in commit e092693443 that breaks copy offload
- Fix the connect error propagation in xs_tcp_setup_socket()
- Fix a lock leak in nfs40_walk_client_list
- Verify that pNFS requests lie within the offset range of the layout segment.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZM0YBAAoJEGcL54qWCgDyLUwQALaPEVp00UMdDR0in7MIFKsO
2mgi7pOyn6po3EjxKbGtjAbL4nSlVxdaFpCIGg47YXrl9/95Zjjmyke+iwRdnMsa
ZPyXwfhVRa80fxbOogAverNCnCptoHoG7EzdWuCTcOOxMxR3Ixs7wVJXrs+7ig+r
IdvIAyTsiDYuP5yVp5KkmJCtLGc0Ze20rb7VgdQJfdiLibWvfYCLZ9CgfAQkdAMU
RIlbT0/BG13XDqwh/C2V1vLge0VfpT5p8qbIb/kFyQ0ZJUUiicGGGjp3u/yj0aG9
ljldI34WmQpsy+nCNN4dEgsF461ECvWLwRZnnpN9nv7VurUBpJNUqHLnubvDbzhh
w8QX54ceEWuQAjg96keNuYOhoG53Omle2/Cm+nmiJOmShJbJ0yh4OcB9DYe0gdYa
5YXbKRjPvf/HfdE7PPpvbPG2E211zfvkLdHnFxswggWyGrh23kqlWrpcHpZomGNW
GbJLfIfhyEfBjCPdNJT3Tzvewo2LkcTNLb+3mJhkxOegkdops8vGYA9G2mba3Daj
1HWl1yFAdzlEf2H1Cb8Y2ZrJKHAmaYBKBkKZYUeAcr6EtoxNqnNMP+PEDcVIzPKg
6Jq7DiYwYksK+XDWK9G4QBguKKGLvYtv0MIA3QDX+bBGLo+eFYxc2iaaJefYNdkK
+vdLHclg/YpepLg+Ui21
=P2bm
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Bugfixes include:
- Fix a typo in commit e092693443 ("NFS append COMMIT after
synchronous COPY") that breaks copy offload
- Fix the connect error propagation in xs_tcp_setup_socket()
- Fix a lock leak in nfs40_walk_client_list
- Verify that pNFS requests lie within the offset range of the layout
segment"
* tag 'nfs-for-4.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: Mark unnecessarily extern functions as static
SUNRPC: ensure correct error is reported by xs_tcp_setup_socket()
NFSv4.0: Fix a lock leak in nfs40_walk_client_list
pnfs: Fix the check for requests in range of layout segment
xprtrdma: Delete an error message for a failed memory allocation in xprt_rdma_bc_setup()
pNFS/flexfiles: missing error code in ff_layout_alloc_lseg()
NFS fix COMMIT after COPY
If you attempt a TCP mount from an host that is unreachable in a way
that triggers an immediate error from kernel_connect(), that error
does not propagate up, instead EAGAIN is reported.
This results in call_connect_status receiving the wrong error.
A case that it easy to demonstrate is to attempt to mount from an
address that results in ENETUNREACH, but first deleting any default
route.
Without this patch, the mount.nfs process is persistently runnable
and is hard to kill. With this patch it exits as it should.
The problem is caused by the fact that xs_tcp_force_close() eventually
calls
xprt_wake_pending_tasks(xprt, -EAGAIN);
which causes an error return of -EAGAIN. so when xs_tcp_setup_sock()
calls
xprt_wake_pending_tasks(xprt, status);
the status is ignored.
Fixes: 4efdd92c92 ("SUNRPC: Remove TCP client connection reset hack")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull networking fixes from David Miller:
1) Fix state pruning in bpf verifier wrt. alignment, from Daniel
Borkmann.
2) Handle non-linear SKBs properly in SCTP ICMP parsing, from Davide
Caratti.
3) Fix bit field definitions for rss_hash_type of descriptors in mlx5
driver, from Jesper Brouer.
4) Defer slave->link updates until bonding is ready to do a full commit
to the new settings, from Nithin Sujir.
5) Properly reference count ipv4 FIB metrics to avoid use after free
situations, from Eric Dumazet and several others including Cong Wang
and Julian Anastasov.
6) Fix races in llc_ui_bind(), from Lin Zhang.
7) Fix regression of ESP UDP encapsulation for TCP packets, from
Steffen Klassert.
8) Fix mdio-octeon driver Kconfig deps, from Randy Dunlap.
9) Fix regression in setting DSCP on ipv6/GRE encapsulation, from Peter
Dawson.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
ipv4: add reference counting to metrics
net: ethernet: ax88796: don't call free_irq without request_irq first
ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets
sctp: fix ICMP processing if skb is non-linear
net: llc: add lock_sock in llc_ui_bind to avoid a race condition
bonding: Don't update slave->link until ready to commit
test_bpf: Add a couple of tests for BPF_JSGE.
bpf: add various verifier test cases
bpf: fix wrong exposure of map_flags into fdinfo for lpm
bpf: add bpf_clone_redirect to bpf_helper_changes_pkt_data
bpf: properly reset caller saved regs after helper call and ld_abs/ind
bpf: fix incorrect pruning decision when alignment must be tracked
arp: fixed -Wuninitialized compiler warning
tcp: avoid fastopen API to be used on AF_UNSPEC
net: move somaxconn init from sysctl code
net: fix potential null pointer dereference
geneve: fix fill_info when using collect_metadata
virtio-net: enable TSO/checksum offloads for Q-in-Q vlans
be2net: Fix offload features for Q-in-Q packets
vlan: Fix tcp checksum offloads in Q-in-Q vlans
...
Andrey Konovalov reported crashes in ipv4_mtu()
I could reproduce the issue with KASAN kernels, between
10.246.7.151 and 10.246.7.152 :
1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &
2) At the same time run following loop :
while :
do
ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
done
Cong Wang attempted to add back rt->fi in commit
82486aa6f1 ("ipv4: restore rt->fi for reference counting")
but this proved to add some issues that were complex to solve.
Instead, I suggested to add a refcount to the metrics themselves,
being a standalone object (in particular, no reference to other objects)
I tried to make this patch as small as possible to ease its backport,
instead of being super clean. Note that we believe that only ipv4 dst
need to take care of the metric refcount. But if this is wrong,
this patch adds the basic infrastructure to extend this to other
families.
Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
for his efforts on this problem.
Fixes: 2860583fe8 ("ipv4: Kill rt->fi")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fix addresses two problems in the way the DSCP field is formulated
on the encapsulating header of IPv6 tunnels.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195661
1) The IPv6 tunneling code was manipulating the DSCP field of the
encapsulating packet using the 32b flowlabel. Since the flowlabel is
only the lower 20b it was incorrect to assume that the upper 12b
containing the DSCP and ECN fields would remain intact when formulating
the encapsulating header. This fix handles the 'inherit' and
'fixed-value' DSCP cases explicitly using the extant dsfield u8 variable.
2) The use of INET_ECN_encapsulate(0, dsfield) in ip6_tnl_xmit was
incorrect and resulted in the DSCP value always being set to 0.
Commit 90427ef5d2 ("ipv6: fix flow labels when the traffic class
is non-0") caused the regression by masking out the flowlabel
which exposed the incorrect handling of the DSCP portion of the
flowlabel in ip6_tunnel and ip6_gre.
Fixes: 90427ef5d2 ("ipv6: fix flow labels when the traffic class is non-0")
Signed-off-by: Peter Dawson <peter.a.dawson@boeing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sometimes ICMP replies to INIT chunks are ignored by the client, even if
the encapsulated SCTP headers match an open socket. This happens when the
ICMP packet is carried by a paged skb: use skb_header_pointer() to read
packet contents beyond the SCTP header, so that chunk header and initiate
tag are validated correctly.
v2:
- don't use skb_header_pointer() to read the transport header, since
icmp_socket_deliver() already puts these 8 bytes in the linear area.
- change commit message to make specific reference to INIT chunks.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race condition in llc_ui_bind if two or more processes/threads
try to bind a same socket.
If more processes/threads bind a same socket success that will lead to
two problems, one is this action is not what we expected, another is
will lead to kernel in unstable status or oops(in my simple test case,
cause llc2.ko can't unload).
The current code is test SOCK_ZAPPED bit to avoid a process to
bind a same socket twice but that is can't avoid more processes/threads
try to bind a same socket at the same time.
So, add lock_sock in llc_ui_bind like others, such as llc_ui_connect.
Signed-off-by: Lin Zhang <xiaolou4617@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
messenger patch from Zheng and Luis' fallocate fix.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJZKD/7AAoJEEp/3jgCEfOLu6sIAKEvmDAZxkRiIV9HF36+0jLO
947jIV22sb4FjngcMs0eBdFD4IJrL8QPq1UVYjIyHtnJN4Tbp9VDPfjyWArhr7+k
hjfTcgnTStmwFy1bUXSq7xNusg9qm0Mw5zpY1DJLCdvkwIU0yrN9zusTlIQvlV5G
Kg4Mzvc3EaL/VgUcsGI2lKuVlMt95wb5u1YGt5AG9FjLv1BTBhpX+3/swtvmtzy3
ZpxyujS4YH+RBpHr9AI/+5IJ2xumZB0C6hzOoa/DAyGzjUH7MQJEuD8hjXqMOWQy
L1wqZo7gXrIk3NSEjxrCb7/mE0S915jkKyHjoJbUxBhy1zEZmri9AfEwe9isb0M=
=enjn
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.12-rc3' of git://github.com/ceph/ceph-client
Pul ceph fixes from Ilya Dryomov:
"A bunch of make W=1 and static checker fixups, a RECONNECT_SEQ
messenger patch from Zheng and Luis' fallocate fix"
* tag 'ceph-for-4.12-rc3' of git://github.com/ceph/ceph-client:
ceph: check that the new inode size is within limits in ceph_fallocate()
libceph: cleanup old messages according to reconnect seq
libceph: NULL deref on crush_decode() error path
libceph: fix error handling in process_one_ticket()
libceph: validate blob_struct_v in process_one_ticket()
libceph: drop version variable from ceph_monmap_decode()
libceph: make ceph_msg_data_advance() return void
libceph: use kbasename() and kill ceph_file_part()
The bpf_clone_redirect() still needs to be listed in
bpf_helper_changes_pkt_data() since we call into
bpf_try_make_head_writable() from there, thus we need
to invalidate prior pkt regs as well.
Fixes: 36bbef52c7 ("bpf: direct packet write and access for helpers for clsact progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7d472a59c0 ("arp: always override
existing neigh entries with gratuitous ARP") introduced a compiler
warning:
net/ipv4/arp.c:880:35: warning: 'addr_type' may be used uninitialized in
this function [-Wmaybe-uninitialized]
While the code logic seems to be correct and doesn't allow the variable
to be used uninitialized, and the warning is not consistently
reproducible, it's still worth fixing it for other people not to waste
time looking at the warning in case it pops up in the build environment.
Yes, compiler is probably at fault, but we will need to accommodate.
Fixes: 7d472a59c0 ("arp: always override existing neigh entries with gratuitous ARP")
Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fastopen API should be used to perform fastopen operations on the TCP
socket. It does not make sense to use fastopen API to perform disconnect
by calling it with AF_UNSPEC. The fastopen data path is also prone to
race conditions and bugs when using with AF_UNSPEC.
One issue reported and analyzed by Vegard Nossum is as follows:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Thread A: Thread B:
------------------------------------------------------------------------
sendto()
- tcp_sendmsg()
- sk_stream_memory_free() = 0
- goto wait_for_sndbuf
- sk_stream_wait_memory()
- sk_wait_event() // sleep
| sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC)
| - tcp_sendmsg()
| - tcp_sendmsg_fastopen()
| - __inet_stream_connect()
| - tcp_disconnect() //because of AF_UNSPEC
| - tcp_transmit_skb()// send RST
| - return 0; // no reconnect!
| - sk_stream_wait_connect()
| - sock_error()
| - xchg(&sk->sk_err, 0)
| - return -ECONNRESET
- ... // wake up, see sk->sk_err == 0
- skb_entail() on TCP_CLOSE socket
If the connection is reopened then we will send a brand new SYN packet
after thread A has already queued a buffer. At this point I think the
socket internal state (sequence numbers etc.) becomes messed up.
When the new connection is closed, the FIN-ACK is rejected because the
sequence number is outside the window. The other side tries to
retransmit,
but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which
corrupts the skb data length and hits a BUG() in copy_and_csum_bits().
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Hence, this patch adds a check for AF_UNSPEC in the fastopen data path
and return EOPNOTSUPP to user if such case happens.
Fixes: cf60af03ca ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The default value for somaxconn is set in sysctl_core_net_init(), but this
function is not called when kernel is configured without CONFIG_SYSCTL.
This results in the kernel not being able to accept TCP connections,
because the backlog has zero size. Usually, the user ends up with:
"TCP: request_sock_TCP: Possible SYN flooding on port 7. Dropping request. Check SNMP counters."
If SYN cookies are not enabled the connection is rejected.
Before ef547f2ac1 (tcp: remove max_qlen_log), the effects were less
severe, because the backlog was always at least eight slots long.
Signed-off-by: Roman Kapl <roman.kapl@sysgo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* fix the scheduled scan "BUG: scheduling while atomic"
* check mesh address extension flags more strictly
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAlkkLTwACgkQa3t4Rpy0
AB3i4hAAmq9+ZqKcHMtObQL0xnzkm8MEHoqL/ed1VOkjyvAnhw5sDMsvN1KaqjY3
XwSVQBkJcyLVTixEjJuD6oKKKSiX6IGWmaawsuSxpgNZRuYv4DpZHx8zvO+a6228
RwExpn1l7ziQS/wvSiKCvvK/SsPB/KGtp5YijpoXeUsWzxUOoJaJGNb6EsyoChPY
tLLboT65aPvDEp3ZHX540XzxR1yJbM9Dwm4cecfjgF8b/jPYAgohul6hRHoO+l5u
O9yW5FO4Z58IhfLqYD1o5Vp2sV5ZODP166pkIQjLs7oDKe4OZSmAMDVrz9TTM33p
4mBMAfIE0G4g0F8Oc+n4xnHury4fiB4OEYIg5SEh6JGfDFWfJtKGlwS2HuyLAh0d
hLzOjCZ1h2yAfxZrNN34chtVmpTXZOiQfDaGxqE+xjsFwWCY8usN0BmZOl0j/mRo
pfAoYnFOj0Co2av9qIjOG1trys1iZn9BrocZXUAqaAoLADhLeHbYv6y1bAv0riuw
xkDkCIoVfdrT9ZvaVlIZA+itKQb6lvDQ+WO4TR5KjjPu8FAHVpwvi8dZW/h12+S8
VE9A1CjfC0ePrFrMcN3uMZmqK3I5QmKhg60JE22zPtkTLnSeTnSq3pzWBjWGZMC7
Aht5HiEoA4rLTYgOR2Eq6tsXu7j4/VEr5b5iwniZCWpu0EQQXN8=
=H9PL
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2017-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Just two fixes this time:
* fix the scheduled scan "BUG: scheduling while atomic"
* check mesh address extension flags more strictly
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After sctp changed to use transport hashtable, a transport would be
added into global hashtable when adding the peer to an asoc, then
the asoc can be got by searching the transport in the hashtbale.
The problem is when processing dupcookie in sctp_sf_do_5_2_4_dupcook,
a new asoc would be created. A peer with the same addr and port as
the one in the old asoc might be added into the new asoc, but fail
to be added into the hashtable, as they also belong to the same sk.
It causes that sctp's dupcookie processing can not really work.
Since the new asoc will be freed after copying it's information to
the old asoc, it's more like a temp asoc. So this patch is to fix
it by setting it as a temp asoc to avoid adding it's any transport
into the hashtable and also avoid allocing assoc_id.
An extra thing it has to do is to also alloc stream info for any
temp asoc, as sctp dupcookie process needs it to update old asoc.
But I don't think it would hurt something, as a temp asoc would
always be freed after finishing processing cookie echo packet.
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 3dbcc105d5 ("sctp: alloc stream info when initializing
asoc"), stream and stream.out info are always alloced when creating
an asoc.
So it's not correct to check !asoc->stream before updating stream
info when processing dupcookie, but would be better to check asoc
state instead.
Fixes: 3dbcc105d5 ("sctp: alloc stream info when initializing asoc")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
when reopen a connection, use 'reconnect seq' to clean up
messages that have already been received by peer.
Link: http://tracker.ceph.com/issues/18690
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Omit an extra message for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If there is not enough space then ceph_decode_32_safe() does a goto bad.
We need to return an error code in that situation. The current code
returns ERR_PTR(0) which is NULL. The callers are not expecting that
and it results in a NULL dereference.
Fixes: f24e9980eb ("ceph: OSD client")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
None of these are validated in userspace, but since we do validate
reply_struct_v in ceph_x_proc_ticket_reply(), tkt_struct_v (first) and
CephXServiceTicket struct_v (second) in process_one_ticket(), validate
CephXTicketBlob struct_v as well.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
It's set but not used: CEPH_FEATURE_MONNAMES feature bit isn't
advertised, which guarantees a v1 MonMap.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Steffen Klassert says:
====================
pull request (net): ipsec 2017-05-23
1) Fix wrong header offset for esp4 udpencap packets.
2) Fix a stack access out of bounds when creating a bundle
with sub policies. From Sabrina Dubroca.
3) Fix slab-out-of-bounds in pfkey due to an incorrect
sadb_x_sec_len calculation.
4) We checked the wrong feature flags when taking down
an interface with IPsec offload enabled.
Fix from Ilan Tayari.
5) Copy the anti replay sequence numbers when doing a state
migration, otherwise we get out of sync with the sequence
numbers. Fix from Antony Antony.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"Mostly netfilter bug fixes in here, but we have some bits elsewhere as
well.
1) Don't do SNAT replies for non-NATed connections in IPVS, from
Julian Anastasov.
2) Don't delete conntrack helpers while they are still in use, from
Liping Zhang.
3) Fix zero padding in xtables's xt_data_to_user(), from Willem de
Bruijn.
4) Add proper RCU protection to nf_tables_dump_set() because we
cannot guarantee that we hold the NFNL_SUBSYS_NFTABLES lock. From
Liping Zhang.
5) Initialize rcv_mss in tcp_disconnect(), from Wei Wang.
6) smsc95xx devices can't handle IPV6 checksums fully, so don't
advertise support for offloading them. From Nisar Sayed.
7) Fix out-of-bounds access in __ip6_append_data(), from Eric
Dumazet.
8) Make atl2_probe() propagate the error code properly on failures,
from Alexey Khoroshilov.
9) arp_target[] in bond_check_params() is used uninitialized. This
got changes from a global static to a local variable, which is how
this mistake happened. Fix from Jarod Wilson.
10) Fix fallout from unnecessary NULL check removal in cls_matchall,
from Jiri Pirko. This is definitely brown paper bag territory..."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
net: sched: cls_matchall: fix null pointer dereference
vsock: use new wait API for vsock_stream_sendmsg()
bonding: fix randomly populated arp target array
net: Make IP alignment calulations clearer.
bonding: fix accounting of active ports in 3ad
net: atheros: atl2: don't return zero on failure path in atl2_probe()
ipv6: fix out of bound writes in __ip6_append_data()
bridge: start hello_timer when enabling KERNEL_STP in br_stp_start
smsc95xx: Support only IPv4 TCP/UDP csum offload
arp: always override existing neigh entries with gratuitous ARP
arp: postpone addr_type calculation to as late as possible
arp: decompose is_garp logic into a separate function
arp: fixed error in a comment
tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0
netfilter: xtables: fix build failure from COMPAT_XT_ALIGN outside CONFIG_COMPAT
ebtables: arpreply: Add the standard target sanity check
netfilter: nf_tables: revisit chain/object refcounting from elements
netfilter: nf_tables: missing sanitization in data from userspace
netfilter: nf_tables: can't assume lock is acquired when dumping set elems
netfilter: synproxy: fix conntrackd interaction
...
Since the head is guaranteed by the check above to be null, the call_rcu
would explode. Remove the previously logically dead code that was made
logically very much alive and kicking.
Fixes: 985538eee0 ("net/sched: remove redundant null check on head")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As reported by Michal, vsock_stream_sendmsg() could still
sleep at vsock_stream_has_space() after prepare_to_wait():
vsock_stream_has_space
vmci_transport_stream_has_space
vmci_qpair_produce_free_space
qp_lock
qp_acquire_queue_mutex
mutex_lock
Just switch to the new wait API like we did for commit
d9dc8b0f8b ("net: fix sleeping for sk_wait_event()").
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Jorgen Hansen <jhansen@vmware.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey Konovalov and idaifish@gmail.com reported crashes caused by
one skb shared_info being overwritten from __ip6_append_data()
Andrey program lead to following state :
copy -4200 datalen 2000 fraglen 2040
maxfraglen 2040 alloclen 2048 transhdrlen 0 offset 0 fraggap 6200
The skb_copy_and_csum_bits(skb_prev, maxfraglen, data + transhdrlen,
fraggap, 0); is overwriting skb->head and skb_shared_info
Since we apparently detect this rare condition too late, move the
code earlier to even avoid allocating skb and risking crashes.
Once again, many thanks to Andrey and syzkaller team.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: <idaifish@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 76b91c32dd ("bridge: stp: when using userspace stp stop
kernel hello and hold timers"), bridge would not start hello_timer if
stp_enabled is not KERNEL_STP when br_dev_open.
The problem is even if users set stp_enabled with KERNEL_STP later,
the timer will still not be started. It causes that KERNEL_STP can
not really work. Users have to re-ifup the bridge to avoid this.
This patch is to fix it by starting br->hello_timer when enabling
KERNEL_STP in br_stp_start.
As an improvement, it's also to start hello_timer again only when
br->stp_enabled is KERNEL_STP in br_hello_timer_expired, there is
no reason to start the timer again when it's NO_STP.
Fixes: 76b91c32dd ("bridge: stp: when using userspace stp stop kernel hello and hold timers")
Reported-by: Haidong Li <haili@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ivan Vecera <cera@cera.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when arp_accept is 1, we always override existing neigh
entries with incoming gratuitous ARP replies. Otherwise, we override
them only if new replies satisfy _locktime_ conditional (packets arrive
not earlier than _locktime_ seconds since the last update to the neigh
entry).
The idea behind locktime is to pick the very first (=> close) reply
received in a unicast burst when ARP proxies are used. This helps to
avoid ARP thrashing where Linux would switch back and forth from one
proxy to another.
This logic has nothing to do with gratuitous ARP replies that are
generally not aligned in time when multiple IP address carriers send
them into network.
This patch enforces overriding of existing neigh entries by all incoming
gratuitous ARP packets, irrespective of their time of arrival. This will
make the kernel honour all incoming gratuitous ARP packets.
Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The addr_type retrieval can be costly, so it's worth trying to avoid its
calculation as much as possible. This patch makes it calculated only
for gratuitous ARP packets. This is especially important since later we
may want to move is_garp calculation outside of arp_accept block, at
which point the costly operation will be executed for all setups.
The patch is the result of a discussion in net-dev:
http://marc.info/?l=linux-netdev&m=149506354216994
Suggested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code is quite involving already to earn a separate function for
itself. If anything, it helps arp_process readability.
Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the is_garp code deals just with gratuitous ARP packets, not every
unsolicited packet.
This patch is a result of a discussion in netdev:
http://marc.info/?l=linux-netdev&m=149506354216994
Suggested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When tcp_disconnect() is called, inet_csk_delack_init() sets
icsk->icsk_ack.rcv_mss to 0.
This could potentially cause tcp_recvmsg() => tcp_cleanup_rbuf() =>
__tcp_select_window() call path to have division by 0 issue.
So this patch initializes rcv_mss to TCP_MIN_MSS instead of 0.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) When using IPVS in direct-routing mode, normal traffic from the LVS
host to a back-end server is sometimes incorrectly NATed on the way
back into the LVS host. Patch to fix this from Julian Anastasov.
2) Calm down clang compilation warning in ctnetlink due to type
mismatch, from Matthias Kaehlcke.
3) Do not re-setup NAT for conntracks that are already confirmed, this
is fixing a problem that was introduced in the previous nf-next batch.
Patch from Liping Zhang.
4) Do not allow conntrack helper removal from userspace cthelper
infrastructure if already in used. This comes with an initial patch
to introduce nf_conntrack_helper_put() that is required by this fix.
From Liping Zhang.
5) Zero the pad when copying data to userspace, otherwise iptables fails
to remove rules. This is a follow up on the patchset that sorts out
the internal match/target structure pointer leak to userspace. Patch
from the same author, Willem de Bruijn. This also comes with a build
failure when CONFIG_COMPAT is not on, coming in the last patch of
this series.
6) SYNPROXY crashes with conntrack entries that are created via
ctnetlink, more specifically via conntrackd state sync. Patch from
Eric Leblond.
7) RCU safe iteration on set element dumping in nf_tables, from
Liping Zhang.
8) Missing sanitization of immediate date for the bitwise and cmp
expressions in nf_tables.
9) Refcounting logic for chain and objects from set elements does not
integrate into the nf_tables 2-phase commit protocol.
10) Missing sanitization of target verdict in ebtables arpreply target,
from Gao Feng.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJZHx/IAAoJELDendYovxMvzegIAIOyDATZsyLnbDnTunOmYqLJ
n06v50N3KwQ+pegJyz4lHdTryI10/TEUzvuT4v/V9B0sHimNRJcE7ClvRVPEaFrs
4y459kKGXRpXXAvS2r0WIY3NhwP/Num9+duVY5lInJ6caq+/JDm3S1tL2HeQ9gl1
SDuI6IMV3q12Agk6jgbvwd1XBh3wbj8Z6SOx3DAchqY/kbdy6tS4y5CR93mKpjs3
LsVyPvY2IOLWCSrPsdloM4l7lMoVmd/1tt6NfzymepIxQbIS3KWo5AwBsoM0cVfs
KGb4T3+H8uwmpyWjgibsayr31cC7LIulEqLtqZNyycpIZGR5TlZ01KEPSMKn78s=
=Boz3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.12b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Some fixes for the new Xen 9pfs frontend and some minor cleanups"
* tag 'for-linus-4.12b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: make xen_flush_tlb_all() static
xen: cleanup pvh leftovers from pv-only sources
xen/9pfs: p9_trans_xen_init and p9_trans_xen_exit can be static
xen/9pfs: fix return value check in xen_9pfs_front_probe()
During xfrm migration copy replay and preplay sequence numbers
from the previous state.
Here is a tcpdump output showing the problem.
10.0.10.46 is running vanilla kernel, is the IKE/IPsec responder.
After the migration it sent wrong sequence number, reset to 1.
The migration is from 10.0.0.52 to 10.0.0.53.
IP 10.0.0.52.4500 > 10.0.10.46.4500: UDP-encap: ESP(spi=0x43ef462d,seq=0x7cf), length 136
IP 10.0.10.46.4500 > 10.0.0.52.4500: UDP-encap: ESP(spi=0xca1c282d,seq=0x7cf), length 136
IP 10.0.0.52.4500 > 10.0.10.46.4500: UDP-encap: ESP(spi=0x43ef462d,seq=0x7d0), length 136
IP 10.0.10.46.4500 > 10.0.0.52.4500: UDP-encap: ESP(spi=0xca1c282d,seq=0x7d0), length 136
IP 10.0.0.53.4500 > 10.0.10.46.4500: NONESP-encap: isakmp: child_sa inf2[I]
IP 10.0.10.46.4500 > 10.0.0.53.4500: NONESP-encap: isakmp: child_sa inf2[R]
IP 10.0.0.53.4500 > 10.0.10.46.4500: NONESP-encap: isakmp: child_sa inf2[I]
IP 10.0.10.46.4500 > 10.0.0.53.4500: NONESP-encap: isakmp: child_sa inf2[R]
IP 10.0.0.53.4500 > 10.0.10.46.4500: UDP-encap: ESP(spi=0x43ef462d,seq=0x7d1), length 136
NOTE: next sequence is wrong 0x1
IP 10.0.10.46.4500 > 10.0.0.53.4500: UDP-encap: ESP(spi=0xca1c282d,seq=0x1), length 136
IP 10.0.0.53.4500 > 10.0.10.46.4500: UDP-encap: ESP(spi=0x43ef462d,seq=0x7d2), length 136
IP 10.0.10.46.4500 > 10.0.0.53.4500: UDP-encap: ESP(spi=0xca1c282d,seq=0x2), length 136
Signed-off-by: Antony Antony <antony@phenome.org>
Reviewed-by: Richard Guy Briggs <rgb@tricolour.ca>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Fixes the following sparse warnings:
net/9p/trans_xen.c:528:5: warning:
symbol 'p9_trans_xen_init' was not declared. Should it be static?
net/9p/trans_xen.c:540:6: warning:
symbol 'p9_trans_xen_exit' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
In case of error, the function xenbus_read() returns ERR_PTR() and never
returns NULL. The NULL test in the return value check should be replaced
with IS_ERR().
Fixes: 71ebd71921 ("xen/9pfs: connect to the backend")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
SCTP needs fixes similar to 83eaddab43 ("ipv6/dccp: do not inherit
ipv6_mc_list from parent"), otherwise bad things can happen.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the udp memory accounting refactor, we don't need any more
to export the *udp*_queue_rcv_skb(). Make them static and fix
a couple of sparse warnings:
net/ipv4/udp.c:1615:5: warning: symbol 'udp_queue_rcv_skb' was not
declared. Should it be static?
net/ipv6/udp.c:572:5: warning: symbol 'udpv6_queue_rcv_skb' was not
declared. Should it be static?
Fixes: 850cbaddb5 ("udp: use it's own memory accounting schema")
Fixes: c915fe13cb ("udplite: fix NULL pointer dereference")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it is allowed to set the default pvid of a bridge to a value
above VLAN_VID_MASK (0xfff). This patch adds a check to br_validate and
returns -EINVAL in case the pvid is out of bounds.
Reproduce by calling:
[root@test ~]# ip l a type bridge
[root@test ~]# ip l a type dummy
[root@test ~]# ip l s bridge0 type bridge vlan_filtering 1
[root@test ~]# ip l s bridge0 type bridge vlan_default_pvid 9999
[root@test ~]# ip l s dummy0 master bridge0
[root@test ~]# bridge vlan
port vlan ids
bridge0 9999 PVID Egress Untagged
dummy0 9999 PVID Egress Untagged
Fixes: 0f963b7592 ("bridge: netlink: add support for default_pvid")
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Tobias Jungel <tobias.jungel@bisdn.de>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function x25_init is not properly unregister related resources
on error handler.It is will result in kernel oops if x25_init init
failed, so add properly unregister call on error handler.
Also, i adjust the coding style and make x25_register_sysctl properly
return failure.
Signed-off-by: linzhang <xiaolou4617@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch in the Fixes references COMPAT_XT_ALIGN in the definition
of XT_DATA_TO_USER, outside an #ifdef CONFIG_COMPAT block.
Split XT_DATA_TO_USER into separate compat and non compat variants and
define the first inside an CONFIG_COMPAT block.
This simplifies both variants by removing branches inside the macro.
Fixes: 324318f024 ("netfilter: xtables: zero padding in data_to_user")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Do not use unsigned variables to see if it returns a negative
error or not.
Fixes: 2423496af3 ("ipv6: Prevent overrun when parsing v6 header options")
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_INET is not set, net/core/sock.c can not compile :
net/core/sock.c: In function ‘skb_orphan_partial’:
net/core/sock.c:1810:2: error: implicit declaration of function
‘skb_is_tcp_pure_ack’ [-Werror=implicit-function-declaration]
if (skb_is_tcp_pure_ack(skb))
^
Fix this by always including <net/tcp.h>
Fixes: f6ba8d33cf ("netem: fix skb_orphan_partial()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's a common practice to send gratuitous ARPs after moving an
IP address to another device to speed up healing of a service. To
fulfill service availability constraints, the timing of network peers
updating their caches to point to a new location of an IP address can be
particularly important.
Sometimes neigh_update calls won't touch neither lladdr nor state, for
example if an update arrives in locktime interval. The neigh->updated
value is tested by the protocol specific neigh code, which in turn
will influence whether NEIGH_UPDATE_F_OVERRIDE gets set in the
call to neigh_update() or not. As a result, we may effectively ignore
the update request, bailing out of touching the neigh entry, except that
we still bump its timestamps inside neigh_update.
This may be a problem for updates arriving in quick succession. For
example, consider the following scenario:
A service is moved to another device with its IP address. The new device
sends three gratuitous ARP requests into the network with ~1 seconds
interval between them. Just before the first request arrives to one of
network peer nodes, its neigh entry for the IP address transitions from
STALE to DELAY. This transition, among other things, updates
neigh->updated. Once the kernel receives the first gratuitous ARP, it
ignores it because its arrival time is inside the locktime interval. The
kernel still bumps neigh->updated. Then the second gratuitous ARP
request arrives, and it's also ignored because it's still in the (new)
locktime interval. Same happens for the third request. The node
eventually heals itself (after delay_first_probe_time seconds since the
initial transition to DELAY state), but it just wasted some time and
require a new ARP request/reply round trip. This unfortunate behaviour
both puts more load on the network, as well as reduces service
availability.
This patch changes neigh_update so that it bumps neigh->updated (as well
as neigh->confirmed) only once we are sure that either lladdr or entry
state will change). In the scenario described above, it means that the
second gratuitous ARP request will actually update the entry lladdr.
Ideally, we would update the neigh entry on the very first gratuitous
ARP request. The locktime mechanism is designed to ignore ARP updates in
a short timeframe after a previous ARP update was honoured by the kernel
layer. This would require tracking timestamps for state transitions
separately from timestamps when actual updates are received. This would
probably involve changes in neighbour struct. Therefore, the patch
doesn't tackle the issue of the first gratuitous APR ignored, leaving
it for a follow-up.
Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>