Nicolas Dichtel says:
After commit b87a2f9199 ("netfilter: conntrack: add gc worker to
remove timed-out entries"), netlink conntrack deletion events may be
sent with a huge delay.
Nicolas further points at this line:
goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);
and indeed, this isn't optimal at all. Rationale here was to ensure that
we don't block other work items for too long, even if
nf_conntrack_htable_size is huge. But in order to have some guarantee
about maximum time period where a scan of the full conntrack table
completes we should always use a fixed slice size, so that once every
N scans the full table has been examined at least once.
We also need to balance this vs. the case where the system is either idle
(i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
from packet path).
So, after some discussion with Nicolas:
1. want hard guarantee that we scan entire table at least once every X s
-> need to scan fraction of table (get rid of upper bound)
2. don't want to eat cycles on idle or very busy system
-> increase interval if we did not evict any entries
3. don't want to block other worker items for too long
-> make fraction really small, and prefer small scan interval instead
4. Want reasonable short time where we detect timed-out entry when
system went idle after a burst of traffic, while not doing scans
all the time.
-> Store next gc scan in worker, increasing delays when no eviction
happened and shrinking delay when we see timed out entries.
The old gc interval is turned into a max number, scans can now happen
every jiffy if stale entries are present.
Longest possible time period until an entry is evicted is now 2 minutes
in worst case (entry expires right after it was deemed 'not expired').
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Thomas reports its not possible to attach the H.245 helper:
iptables -t raw -A PREROUTING -p udp -j CT --helper H.245
iptables: No chain/target/match by that name.
xt_CT: No such helper "H.245"
This is because H.245 registers as NFPROTO_UNSPEC, but the CT target
passes NFPROTO_IPV4/IPV6 to nf_conntrack_helper_try_module_get.
We should treat UNSPEC as wildcard and ignore the l3num instead.
Reported-by: Thomas Woerner <twoerner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The (percpu) untracked conntrack entries can end up with nonzero connmarks.
The 'untracked' conntrack objects are merely a way to distinguish INVALID
(i.e. protocol connection tracker says payload doesn't meet some
requirements or packet was never seen by the connection tracking code)
from packets that are intentionally not tracked (some icmpv6 types such as
neigh solicitation, or by using 'iptables -j CT --notrack' option).
Untracked conntrack objects are implementation detail, we might as well use
invalid magic address instead to tell INVALID and UNTRACKED apart.
Check skb->nfct for untracked dummy and behave as if skb->nfct is NULL.
Reported-by: XU Tianwen <evan.xu.tianwen@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
family.maxattr is the max index for policy[], the size of
ops[] is determined with ARRAY_SIZE().
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The NFTA_DUP_SREG_DEV attribute is not a must option, so we should use it
in routing lookup only when the user specify it.
Fixes: d877f07112 ("netfilter: nf_tables: add nft_dup expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When the memory is exhausted, then we will fail to add the NFT_MSG_NEWSET
transaction. In such case, we should destroy the set before we free it.
Fixes: 958bee14d0 ("netfilter: nf_tables: use new transaction infrastructure to handle sets")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Building the ip_vs_sync code with CONFIG_OPTIMIZE_INLINING on x86
confuses the compiler to the point where it produces a rather
dubious warning message:
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.init_seq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
struct ip_vs_sync_conn_options opt;
^~~
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.previous_delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).init_seq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).previous_delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
The problem appears to be a combination of a number of factors, including
the __builtin_bswap32 compiler builtin being slightly odd, having a large
amount of code inlined into a single function, and the way that some
functions only get partially inlined here.
I've spent way too much time trying to work out a way to improve the
code, but the best I've come up with is to add an explicit memset
right before the ip_vs_seq structure is first initialized here. When
the compiler works correctly, this has absolutely no effect, but in the
case that produces the warning, the warning disappears.
In the process of analysing this warning, I also noticed that
we use memcpy to copy the larger ip_vs_sync_conn_options structure
over two members of the ip_vs_conn structure. This works because
the layout is identical, but seems error-prone, so I'm changing
this in the process to directly copy the two members. This change
seemed to have no effect on the object code or the warning, but
it deals with the same data, so I kept the two changes together.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 36b701fae1 ("netfilter: nf_tables: validate maximum value of
u32 netlink attributes") introduced nft_parse_u32_check with a return
value of "unsigned int", yet on error it returns "-ERANGE".
This patch corrects the mismatch by changing the return value to "int",
which happens to match the actual users of nft_parse_u32_check already.
Found by Coverity, CID 1373930.
Note that commit 21a9e0f156 ("netfilter: nft_exthdr: fix error
handling in nft_exthdr_init()) attempted to address the issue, but
did not address the return type of nft_parse_u32_check.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Cc: Laura Garcia Liebana <nevola@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 36b701fae1 ("netfilter: nf_tables: validate maximum value...")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
on SIP requests, so a fragmented TCP SIP packet from an allow header starting with
INVITE,NOTIFY,OPTIONS,REFER,REGISTER,UPDATE,SUBSCRIBE
Content-Length: 0
will not bet interpreted as an INVITE request. Also Request-URI must start with an alphabetic character.
Confirm with RFC 3261
Request-Line = Method SP Request-URI SP SIP-Version CRLF
Fixes: 30f33e6dee ("[NETFILTER]: nf_conntrack_sip: support method specific request/response handling")
Signed-off-by: Ulrich Weber <ulrich.weber@riverbed.com>
Acked-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Packets may race when create the new element in nft_hash_update:
CPU0 CPU1
lookup_fast - fail lookup_fast - fail
new - ok new - ok
insert - ok insert - fail(EEXIST)
So when race happened, we reuse the existing element. Otherwise,
these *racing* packets will not be handled properly.
Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When nft_expr_clone failed, a series of problems will happen:
1. module refcnt will leak, we call __module_get at the beginning but
we forget to put it back if ops->clone returns fail
2. memory will be leaked, if clone fail, we just return NULL and forget
to free the alloced element
3. set->nelems will become incorrect when set->size is specified. If
clone fail, we should decrease the set->nelems
Now this patch fixes these problems. And fortunately, clone fail will
only happen on counter expression when memory is exhausted.
Fixes: 086f332167 ("netfilter: nf_tables: add clone interface to expression operations")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When CONFIG_NFT_SET_HASH is not enabled and I input the following rule:
"nft add rule filter output flow table test {ip daddr counter }", kernel
panic happened on my system:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [< (null)>] (null)
[...]
Call Trace:
[<ffffffffa0590466>] ? nft_dynset_eval+0x56/0x100 [nf_tables]
[<ffffffffa05851bb>] nft_do_chain+0xfb/0x4e0 [nf_tables]
[<ffffffffa0432f01>] ? nf_conntrack_tuple_taken+0x61/0x210 [nf_conntrack]
[<ffffffffa0459ea6>] ? get_unique_tuple+0x136/0x560 [nf_nat]
[<ffffffffa043bca1>] ? __nf_ct_ext_add_length+0x111/0x130 [nf_conntrack]
[<ffffffffa045a357>] ? nf_nat_setup_info+0x87/0x3b0 [nf_nat]
[<ffffffff81761e27>] ? ipt_do_table+0x327/0x610
[<ffffffffa045a6d7>] ? __nf_nat_alloc_null_binding+0x57/0x80 [nf_nat]
[<ffffffffa059f21f>] nft_ipv4_output+0xaf/0xd0 [nf_tables_ipv4]
[<ffffffff81702515>] nf_iterate+0x55/0x60
[<ffffffff81702593>] nf_hook_slow+0x73/0xd0
Because in rbtree type set, ops->update is not implemented. So just keep
it simple, in such case, report -EOPNOTSUPP to the user space.
Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
First bug was added in commit ad6f939ab1 ("ip: Add offset parameter to
ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
Then commit e6afc8ace6 ("udp: remove headers from UDP packets before
queueing") forgot to adjust the offsets now UDP headers are pulled
before skb are put in receive queue.
Fixes: ad6f939ab1 ("ip: Add offset parameter to ip_cmsg_recv")
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sam Kumar <samanthakumar@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7303a14750 ("sctp: identify chunks that need to be fragmented
at IP level") made the chunk be fragmented at IP level in the next round
if it's size exceed PMTU.
But there still is another case, PMTU can be updated if transport's dst
expires and transport's pmtu_pending is set in sctp_packet_transmit. If
the new PMTU is less than the chunk, the same issue with that commit can
be triggered.
So we should drop this packet and let it retransmit in another round
where it would be fragmented at IP level.
This patch is to fix it by checking the chunk size after PMTU may be
updated and dropping this packet if it's size exceed PMTU.
Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@txudriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth 2016-10-21
Here are some more Bluetooth fixes for the 4.9 kernel:
- Fix to btwilink driver probe function return value
- Power management fix to hci_bcm
- Fix to encoding name in scan response data
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of getsockopt handlers in net/sctp/socket.c check len against
sizeof some structure like:
if (len < sizeof(int))
return -EINVAL;
On the first look, the check seems to be correct. But since len is int
and sizeof returns size_t, int gets promoted to unsigned size_t too. So
the test returns false for negative lengths. Yes, (-1 < sizeof(long)) is
false.
Fix this in sctp by explicitly checking len < 0 before any getsockopt
handler is called.
Note that sctp_getsockopt_events already handled the negative case.
Since we added the < 0 check elsewhere, this one can be removed.
If not checked, this is the result:
UBSAN: Undefined behaviour in ../mm/page_alloc.c:2722:19
shift exponent 52 is too large for 32-bit type 'int'
CPU: 1 PID: 24535 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
0000000000000000 ffff88006d99f2a8 ffffffffb2f7bdea 0000000041b58ab3
ffffffffb4363c14 ffffffffb2f7bcde ffff88006d99f2d0 ffff88006d99f270
0000000000000000 0000000000000000 0000000000000034 ffffffffb5096422
Call Trace:
[<ffffffffb3051498>] ? __ubsan_handle_shift_out_of_bounds+0x29c/0x300
...
[<ffffffffb273f0e4>] ? kmalloc_order+0x24/0x90
[<ffffffffb27416a4>] ? kmalloc_order_trace+0x24/0x220
[<ffffffffb2819a30>] ? __kmalloc+0x330/0x540
[<ffffffffc18c25f4>] ? sctp_getsockopt_local_addrs+0x174/0xca0 [sctp]
[<ffffffffc18d2bcd>] ? sctp_getsockopt+0x10d/0x1b0 [sctp]
[<ffffffffb37c1219>] ? sock_common_getsockopt+0xb9/0x150
[<ffffffffb37be2f5>] ? SyS_getsockopt+0x1a5/0x270
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-sctp@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise we'll overflow the integer. This occurs when layer 3 tunneled
packets are handed off to the IPv6 layer.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit a681574c99
("ipv4: disable BH in set_ping_group_range()") because we never
read ping_group_range in BH context (unlike local_port_range).
Then, since we already have a lock for ping_group_range, those
using ip_local_ports.lock for ping_group_range are clearly typos.
We might consider to share a same lock for both ping_group_range
and local_port_range w.r.t. space saving, but that should be for
net-next.
Fixes: a681574c99 ("ipv4: disable BH in set_ping_group_range()")
Fixes: ba6b918ab2 ("ping: move ping_group_range out of CONFIG_SYSCTL")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Eric Salo <salo@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit bc51dddf98 ("netns: avoid disabling irq for
netns id") as it was found to cause problems with systems running
SELinux/audit, see the mailing list thread below:
* http://marc.info/?t=147694653900002&r=1&w=2
Eventually we should be able to reintroduce this code once we have
rewritten the audit multicast code to queue messages much the same
way we do for unicast messages. A tracking issue for this can be
found below:
* https://github.com/linux-audit/audit-kernel/issues/23
Reported-by: Stephen Smalley <sds@tycho.nsa.gov>
Reported-by: Elad Raz <e@eladraz.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Baozeng reported this deadlock case:
CPU0 CPU1
---- ----
lock([ 165.136033] sk_lock-AF_INET6);
lock([ 165.136033] rtnl_mutex);
lock([ 165.136033] sk_lock-AF_INET6);
lock([ 165.136033] rtnl_mutex);
Similar to commit 87e9f03159
("ipv4: fix a potential deadlock in mcast getsockopt() path")
this is due to we still have a case, ipv6_sock_mc_close(),
where we acquire sk_lock before rtnl_lock. Close this deadlock
with the similar solution, that is always acquire rtnl lock first.
Fixes: baf606d9c9 ("ipv4,ipv6: grab rtnl before locking the socket")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Tested-by: Baozeng Ding <sploving1@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Fix compilation warning in xt_hashlimit on m68k 32-bits, from
Geert Uytterhoeven.
2) Fix wrong timeout in set elements added from packet path via
nft_dynset, from Anders K. Pedersen.
3) Remove obsolete nf_conntrack_events_retry_timeout sysctl
documentation, from Nicolas Dichtel.
4) Ensure proper initialization of log flags via xt_LOG, from
Liping Zhang.
5) Missing alias to autoload ipcomp, also from Liping Zhang.
6) Missing NFTA_HASH_OFFSET attribute validation, again from Liping.
7) Wrong integer type in the new nft_parse_u32_check() function,
from Dan Carpenter.
8) Another wrong integer type declaration in nft_exthdr_init, also
from Dan Carpenter.
9) Fix insufficient mode validation in nft_range.
10) Fix compilation warning in nft_range due to possible uninitialized
value, from Arnd Bergmann.
11) Zero nf_hook_ops allocated via xt_hook_alloc() in x_tables to
calm down kmemcheck, from Florian Westphal.
12) Schedule gc_worker() to run again if GC_MAX_EVICTS quota is reached,
from Nicolas Dichtel.
13) Fix nf_queue() after conversion to single-linked hook list, related
to incorrect bypass flag handling and incorrect hook point of
reinjection.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 4ee3bd4a8c ("ipv4: disable BH when changing ip local port
range") Cong added BH protection in set_local_port_range() but missed
that same fix was needed in set_ping_group_range()
Fixes: b8f1a55639 ("udp: Add function to make source port for UDP tunnels")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Eric Salo <salo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Baozeng Ding reported KASAN traces showing uses after free in
udp_lib_get_port() and other related UDP functions.
A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.
I could write a reproducer with two threads doing :
static int sock_fd;
static void *thr1(void *arg)
{
for (;;) {
connect(sock_fd, (const struct sockaddr *)arg,
sizeof(struct sockaddr_in));
}
}
static void *thr2(void *arg)
{
struct sockaddr_in unspec;
for (;;) {
memset(&unspec, 0, sizeof(unspec));
connect(sock_fd, (const struct sockaddr *)&unspec,
sizeof(unspec));
}
}
Problem is that udp_disconnect() could run without holding socket lock,
and this was causing list corruptions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, GRO can do unlimited recursion through the gro_receive
handlers. This was fixed for tunneling protocols by limiting tunnel GRO
to one level with encap_mark, but both VLAN and TEB still have this
problem. Thus, the kernel is vulnerable to a stack overflow, if we
receive a packet composed entirely of VLAN headers.
This patch adds a recursion counter to the GRO layer to prevent stack
overflow. When a gro_receive function hits the recursion limit, GRO is
aborted for this skb and it is processed normally. This recursion
counter is put in the GRO CB, but could be turned into a percpu counter
if we run out of space in the CB.
Thanks to Vladimír Beneš <vbenes@redhat.com> for the initial bug report.
Fixes: CVE-2016-7039
Fixes: 9b174d88c2 ("net: Add Transparent Ethernet Bridging GRO support.")
Fixes: 66e5133f19 ("vlan: Add GRO support for non hardware accelerated vlan")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check for an underflow of tmp_prefered_lft is always false
because tmp_prefered_lft is unsigned. The intention of the check
was to guard against racing with an update of the
temp_prefered_lft sysctl, potentially resulting in an underflow.
As suggested by David Miller, the best way to prevent the race is
by reading the sysctl variable using READ_ONCE.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Fixes: 76506a986d ("IPv6: fix DESYNC_FACTOR")
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_queue handling is broken since e3b37f11e6 ("netfilter: replace
list_head with single linked list") for two reasons:
1) If the bypass flag is set on, there are no userspace listeners and
we still have more hook entries to iterate over, then jump to the
next hook. Otherwise accept the packet. On nf_reinject() path, the
okfn() needs to be invoked.
2) We should not re-enter the same hook on packet reinjection. If the
packet is accepted, we have to skip the current hook from where the
packet was enqueued, otherwise the packets gets enqueued over and
over again.
This restores the previous list_for_each_entry_continue() behaviour
happening from nf_iterate() that was dealing with these two cases.
This patch introduces a new nf_queue() wrapper function so this fix
becomes simpler.
Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When the maximum evictions number is reached, do not wait 5 seconds before
the next run.
CC: Florian Westphal <fw@strlen.de>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This improves AEN handler for Host Network Controller Driver Status
Change (HNCDSC):
* The channel's lock should be hold when accessing its state.
* Do failover when host driver isn't ready.
* Configure channel when host driver becomes ready.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The issue was found on BCM5718 which has two NCSI channels in one
package: C0 and C1. C0 is in link-up state while C1 is in link-down
state. C0 is chosen as active channel until unplugging and plugging
C0's cable: On unplugging C0's cable, LSC (Link State Change) AEN
packet received on C0 to report link-down event. After that, C1 is
chosen as active channel. LSC AEN for link-up event is lost on C0
when plugging C0's cable back. We lose the network even C0 is usable.
This resolves the issue by recording the (hot) channel that was ever
chosen as active one. The hot channel is chosen to be active one
if none of available channels in link-up state. With this, C0 is still
the active one after unplugging C0's cable. LSC AEN packet received
on C0 when plugging its cable back.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The issue was found on BCM5718 which has two NCSI channels in one
package: C0 and C1. Both of them are connected to different LANs,
means they are in link-up state and C0 is chosen as the active one
until resetting BCM5718 happens as below.
Resetting BCM5718 results in LSC (Link State Change) AEN packet
received on C0, meaning LSC AEN is missed on C1. When LSC AEN packet
received on C0 to report link-down, it fails over to C1 because C1
is in link-up state as software can see. However, C1 is in link-down
state in hardware. It means the link state is out of synchronization
between hardware and software, resulting in inappropriate channel (C1)
selected as active one.
This resolves the issue by sending separate GLS (Get Link Status)
commands to all channels in the package before trying to do failover.
The last link states of all channels in the package are retrieved.
With it, C0 (not C1) is selected as active one as expected.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several if/else statements in the state machine implemented
by switch/case in ncsi_suspend_channel() to avoid duplicated code. It
makes the code a bit hard to be understood.
This drops if/else statements in ncsi_suspend_channel() to improve the
code readability as Joel Stanley suggested. Also, it becomes easy to
add more states in the state machine without affecting current code.
No logical changes introduced by this.
Suggested-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stats_update callback is called by NIC drivers doing hardware
offloading of the mirred action. Lastuse is passed as argument
to specify when the stats was actually last updated and is not
always the current time.
Fixes: 9798e6fe4f ('net: act_mirred: allow statistic updates from offloaded actions')
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Append maximum of 10 + 1 bytes of name to scan response data.
Complete name is appended only if exists and is <= 10 characters.
Else append short name if exists or shorten complete name if not.
This makes sure name is consistent across multiple advertising
instances.
Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Markus Trippelsdorf reports:
WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88001e605480)
4055601e0088ffff000000000000000090686d81ffffffff0000000000000000
u u u u u u u u u u u u u u u u i i i i i i i i u u u u u u u u
^
|RIP: 0010:[<ffffffff8166e561>] [<ffffffff8166e561>] nf_register_net_hook+0x51/0x160
[..]
[<ffffffff8166e561>] nf_register_net_hook+0x51/0x160
[<ffffffff8166eaaf>] nf_register_net_hooks+0x3f/0xa0
[<ffffffff816d6715>] ipt_register_table+0xe5/0x110
[..]
This warning is harmless; we copy 'uninitialized' data from the hook ops
but it will not be used.
Long term the structures keeping run-time data should be disentangled
from those only containing config-time data (such as where in the list
to insert a hook), but thats -next material.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since commit b2fb4f54ec ("tcp: uninline tcp_prequeue()") we no longer
access sysctl_tcp_low_latency from a module.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We recently got the following warning after setting up a vlan device on
top of an offloaded bridge and executing 'bridge link':
WARNING: CPU: 0 PID: 18566 at drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c:81 mlxsw_sp_port_orig_get.part.9+0x55/0x70 [mlxsw_spectrum]
[...]
CPU: 0 PID: 18566 Comm: bridge Not tainted 4.8.0-rc7 #1
Hardware name: Mellanox Technologies Ltd. Mellanox switch/Mellanox switch, BIOS 4.6.5 05/21/2015
0000000000000286 00000000e64ab94f ffff880406e6f8f0 ffffffff8135eaa3
0000000000000000 0000000000000000 ffff880406e6f930 ffffffff8108c43b
0000005106e6f988 ffff8803df398840 ffff880403c60108 ffff880406e6f990
Call Trace:
[<ffffffff8135eaa3>] dump_stack+0x63/0x90
[<ffffffff8108c43b>] __warn+0xcb/0xf0
[<ffffffff8108c56d>] warn_slowpath_null+0x1d/0x20
[<ffffffffa01420d5>] mlxsw_sp_port_orig_get.part.9+0x55/0x70 [mlxsw_spectrum]
[<ffffffffa0142195>] mlxsw_sp_port_attr_get+0xa5/0xb0 [mlxsw_spectrum]
[<ffffffff816f151f>] switchdev_port_attr_get+0x4f/0x140
[<ffffffff816f15d0>] switchdev_port_attr_get+0x100/0x140
[<ffffffff816f15d0>] switchdev_port_attr_get+0x100/0x140
[<ffffffff816f1d6b>] switchdev_port_bridge_getlink+0x5b/0xc0
[<ffffffff816f2680>] ? switchdev_port_fdb_dump+0x90/0x90
[<ffffffff815f5427>] rtnl_bridge_getlink+0xe7/0x190
[<ffffffff8161a1b2>] netlink_dump+0x122/0x290
[<ffffffff8161b0df>] __netlink_dump_start+0x15f/0x190
[<ffffffff815f5340>] ? rtnl_bridge_dellink+0x230/0x230
[<ffffffff815fab46>] rtnetlink_rcv_msg+0x1a6/0x220
[<ffffffff81208118>] ? __kmalloc_node_track_caller+0x208/0x2c0
[<ffffffff815f5340>] ? rtnl_bridge_dellink+0x230/0x230
[<ffffffff815fa9a0>] ? rtnl_newlink+0x890/0x890
[<ffffffff8161cf54>] netlink_rcv_skb+0xa4/0xc0
[<ffffffff815f56f8>] rtnetlink_rcv+0x28/0x30
[<ffffffff8161c92c>] netlink_unicast+0x18c/0x240
[<ffffffff8161ccdb>] netlink_sendmsg+0x2fb/0x3a0
[<ffffffff815c5a48>] sock_sendmsg+0x38/0x50
[<ffffffff815c6031>] SYSC_sendto+0x101/0x190
[<ffffffff815c7111>] ? __sys_recvmsg+0x51/0x90
[<ffffffff815c6b6e>] SyS_sendto+0xe/0x10
[<ffffffff817017f2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
The problem is that the 8021q module propagates the call to
ndo_bridge_getlink() via switchdev ops, but the switch driver doesn't
recognize the netdev, as it's not offloaded.
While we can ignore calls being made to non-bridge ports inside the
driver, a better fix would be to push this check up to the switchdev
layer.
Note that these ndos can be called for non-bridged netdev, but this only
happens in certain PF drivers which don't call the corresponding
switchdev functions anyway.
Fixes: 99f44bb352 ("mlxsw: spectrum: Enable L3 interfaces on top of bridge devices")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Tamir Winetroub <tamirw@mellanox.com>
Tested-by: Tamir Winetroub <tamirw@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes a panic when calling eth_get_headlen(). Noticed on i40e driver.
Fixes: d5709f7ab7 ("flow_dissector: For stripped vlan, get vlan info from skb->vlan_tci")
Signed-off-by: Eric Garver <e@erig.me>
Reviewed-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
reuseport_add_sock() is not used from a module,
no need to export it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Satish reported a problem with the perm multicast router ports not getting
reenabled after some series of events, in particular if it happens that the
multicast snooping has been disabled and the port goes to disabled state
then it will be deleted from the router port list, but if it moves into
non-disabled state it will not be re-added because the mcast snooping is
still disabled, and enabling snooping later does nothing.
Here are the steps to reproduce, setup br0 with snooping enabled and eth1
added as a perm router (multicast_router = 2):
1. $ echo 0 > /sys/class/net/br0/bridge/multicast_snooping
2. $ ip l set eth1 down
^ This step deletes the interface from the router list
3. $ ip l set eth1 up
^ This step does not add it again because mcast snooping is disabled
4. $ echo 1 > /sys/class/net/br0/bridge/multicast_snooping
5. $ bridge -d -s mdb show
<empty>
At this point we have mcast enabled and eth1 as a perm router (value = 2)
but it is not in the router list which is incorrect.
After this change:
1. $ echo 0 > /sys/class/net/br0/bridge/multicast_snooping
2. $ ip l set eth1 down
^ This step deletes the interface from the router list
3. $ ip l set eth1 up
^ This step does not add it again because mcast snooping is disabled
4. $ echo 1 > /sys/class/net/br0/bridge/multicast_snooping
5. $ bridge -d -s mdb show
router ports on br0: eth1
Note: we can directly do br_multicast_enable_port for all because the
querier timer already has checks for the port state and will simply
expire if it's in blocking/disabled. See the comment added by
commit 9aa6638216 ("bridge: multicast: add a comment to
br_port_state_selection about blocking state")
Fixes: 561f1103a2 ("bridge: Add multicast_snooping sysfs toggle")
Reported-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The newly added nft_range_eval() function handles the two possible
nft range operations, but as the compiler warning points out,
any unexpected value would lead to the 'mismatch' variable being
used without being initialized:
net/netfilter/nft_range.c: In function 'nft_range_eval':
net/netfilter/nft_range.c:45:5: error: 'mismatch' may be used uninitialized in this function [-Werror=maybe-uninitialized]
This removes the variable in question and instead moves the
condition into the switch itself, which is potentially more
efficient than adding a bogus 'default' clause as in my
first approach, and is nicer than using the 'uninitialized_var'
macro.
Fixes: 0f3cd9b369 ("netfilter: nf_tables: add range expression")
Link: http://patchwork.ozlabs.org/patch/677114/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Remove the unused but set variable icsk in listening_get_next to fix the
following GCC warning when building with 'W=1':
net/ipv4/tcp_ipv4.c: In function ‘listening_get_next’:
net/ipv4/tcp_ipv4.c:1890:31: warning: variable ‘icsk’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the unused but set variable dev in ip_do_fragment to fix the
following GCC warning when building with 'W=1':
net/ipv4/ip_output.c: In function ‘ip_do_fragment’:
net/ipv4/ip_output.c:541:21: warning: variable ‘dev’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the unused but set variable master_dev in check_local_dest to fix
the following GCC warning when building with 'W=1':
net/hsr/hsr_forward.c: In function ‘check_local_dest’:
net/hsr/hsr_forward.c:303:21: warning: variable ‘master_dev’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
from stack removal fix that crashes things when VMAP
stack is used in conjunction with software crypto.
Aside from that, we have:
* a fix for AP_VLAN usage with the nl80211 frame command
* two fixes (and two preparation patches) for A-MSDU, one
to discard group-addressed (multicast) and unexpected
4-address A-MSDUs, the other to validate A-MSDU inner
MAC addresses properly to prevent controlled port bypass
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJYBcgKAAoJEGt7eEactAAdhUkP/jMVQbLMZ1Jcc9+lsPVGUIga
I9GeQ4lcnD+4ASeJUhTtemC1IMNL4zMVqaIxbznDXKP7rZRrODVvCPk2TYIw9c5S
rzF/TRierMFttLu3xY757nAsYg6T7F03JdOQ3SKIb3xOD8pXCWQoVRN14ldroRno
4stOAtDrpD5wvK2JhlWv1EYlxGVLqLcakZt/BwgDX/cJGkAx49Q/s29FUnesB9Ep
sCH5chffeQskOL9CrSwboNmucgt4HGQORc4UL/KtPOEBtyfu/LCXEKSqAKVyQZtZ
OerouOHWqQE5lT2K6qD/KKFW4lV2t1h+xzqsvZk4ZR5o3s+PAGai6D/wf+JgY9Hk
uor9ju/e0htcI9m0aFdHDnltV0OOwIhR2bxWTuBBUkyFVtdQQY+1MRTTtuunWIB4
SDYv6LrNL/0HAIuTlPQH99rnsFNnRZCtTpdbT7GRckAMeWMvy19bF2ZB1FXuSn+h
5dxIo0qkw8nv4Y9wQ6QmgOcSzYyidUrCgLTO516qXVAKY0kl/u4q/zPr0Fmx/qfY
oxspelDv0qd2NMQwJ/AmwjAjkQBulv5DVLu+cDXdOMkc/EbhzWyvetcHiNukxjHn
mukCBxTlLoDLug2LFkAPIddEutj+VUEefkf/pD/js8uYuyd9ZnPjiIh6fG25il9a
cHbMYtANt2EnZjwI9Z74
=T6t1
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-10-18' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
This is relatively small, mostly to get the SG/crypto
from stack removal fix that crashes things when VMAP
stack is used in conjunction with software crypto.
Aside from that, we have:
* a fix for AP_VLAN usage with the nl80211 frame command
* two fixes (and two preparation patches) for A-MSDU, one
to discard group-addressed (multicast) and unexpected
4-address A-MSDUs, the other to validate A-MSDU inner
MAC addresses properly to prevent controlled port bypass
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Use nft_parse_u32_check() to make sure we don't get a value over the
unsigned 8-bit integer. Moreover, make sure this value doesn't go over
the two supported range comparison modes.
Fixes: 9286c2eb1fda ("netfilter: nft_range: validate operation netlink attribute")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
"err" needs to be signed for the error handling to work.
Fixes: 36b701fae1 ('netfilter: nf_tables: validate maximum value of u32 netlink attributes')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We don't want to allow negatives here.
Fixes: 36b701fae1 ('netfilter: nf_tables: validate maximum value of u32 netlink attributes')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Missing the nla_policy description will also miss the validation check
in kernel.
Fixes: 70ca767ea1 ("netfilter: nft_hash: Add hash offset value")
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Otherwise, user cannot add related rules if xt_ipcomp.ko is not loaded:
# iptables -A OUTPUT -p 108 -m ipcomp --ipcompspi 1
iptables: No chain/target/match by that name.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>