Commit Graph

487 Commits

Author SHA1 Message Date
Florian Westphal
23014011ba netfilter: conntrack: support a fixed size of 128 distinct labels
The conntrack label extension is currently variable-sized, e.g. if
only 2 labels are used by iptables rules then the labels->bits[] array
will only contain one element.

We track size of each label storage area in the 'words' member.

But in nftables and openvswitch we always have to ask for worst-case
since we don't know what bit will be used at configuration time.

As most arches are 64bit we need to allocate 24 bytes in this case:

struct nf_conn_labels {
    u8            words;   /*     0     1 */
    /* XXX 7 bytes hole, try to pack */
    long unsigned bits[2]; /*     8     24 */

Make bits a fixed size and drop the words member, it simplifies
the code and only increases memory requirements on x86 when
less than 64bit labels are required.

We still only allocate the extension if its needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-22 17:04:55 +02:00
David S. Miller
ee58b57100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 05:03:36 -04:00
Samuel Gauthier
d913d3a763 openvswitch: fix conntrack netlink event delivery
Only the first and last netlink message for a particular conntrack are
actually sent. The first message is sent through nf_conntrack_confirm when
the conntrack is committed. The last one is sent when the conntrack is
destroyed on timeout. The other conntrack state change messages are not
advertised.

When the conntrack subsystem is used from netfilter, nf_conntrack_confirm
is called for each packet, from the postrouting hook, which in turn calls
nf_ct_deliver_cached_events to send the state change netlink messages.

This commit fixes the problem by calling nf_ct_deliver_cached_events in the
non-commit case as well.

Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
CC: Joe Stringer <joestringer@nicira.com>
CC: Justin Pettit <jpettit@nicira.com>
CC: Andy Zhou <azhou@nicira.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Samuel Gauthier <samuel.gauthier@6wind.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-29 08:13:59 -04:00
Jarno Rajahalme
7d904c7bcd openvswitch: Only set mark and labels with a commit flag.
Only set conntrack mark or labels when the commit flag is specified.
This makes sure we can not set them before the connection has been
persisted, as in that case the mark and labels would be lost in an
event of an userspace upcall.

OVS userspace already requires the commit flag to accept setting
ct_mark and/or ct_labels.  Validate for this in the kernel API.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 11:55:51 -04:00
Jarno Rajahalme
1c1779fa54 openvswitch: Set mark and labels before confirming.
Set conntrack mark and labels right before committing so that
the initial conntrack NEW event has the mark and labels.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 11:55:51 -04:00
William Tu
b95e5928fc openvswitch: Add packet len info to upcall.
The commit f2a4d086ed ("openvswitch: Add packet truncation support.")
introduces packet truncation before sending to userspace upcall receiver.
This patch passes up the skb->len before truncation so that the upcall
receiver knows the original packet size. Potentially this will be used
by sFlow, where OVS translates sFlow config header=N to a sample action,
truncating packet to N byte in kernel datapath. Thus, only N bytes instead
of full-packet size is copied from kernel to userspace, saving the
kernel-to-userspace bandwidth.

Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-22 16:34:39 -04:00
William Tu
f2a4d086ed openvswitch: Add packet truncation support.
The patch adds a new OVS action, OVS_ACTION_ATTR_TRUNC, in order to
truncate packets. A 'max_len' is added for setting up the maximum
packet size, and a 'cutlen' field is to record the number of bytes
to trim the packet when the packet is outputting to a port, or when
the packet is sent to userspace.

Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 17:58:03 -07:00
Zhang Shengju
684ff4ef5e ovs: set name assign type of internal port
Set name_assign_type of internal port to NET_NAME_USER.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-02 18:05:47 -04:00
Simon Horman
bc7cc5999f openvswitch: update checksum in {push,pop}_mpls
In the case of CHECKSUM_COMPLETE the skb checksum should be updated in
{push,pop}_mpls() as they the type in the ethernet header.

As suggested by Pravin Shelar.

Cc: Pravin Shelar <pshelar@nicira.com>
Fixes: 25cd9ba0ab ("openvswitch: Add basic MPLS support to kernel")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-31 13:51:42 -07:00
David S. Miller
909b27f706 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next'
because we no longer have a per-netns conntrack hash.

The ip_gre.c conflict as well as the iwlwifi ones were cases of
overlapping changes.

Conflicts:
	drivers/net/wireless/intel/iwlwifi/mvm/tx.c
	net/ipv4/ip_gre.c
	net/netfilter/nf_conntrack_core.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-15 13:32:48 -04:00
Joe Stringer
16ec3d4fbb openvswitch: Fix cached ct with helper.
When using conntrack helpers from OVS, a common configuration is to
perform a lookup without specifying a helper, then go through a
firewalling policy, only to decide to attach a helper afterwards.

In this case, the initial lookup will cause a ct entry to be attached to
the skb, then the later commit with helper should attach the helper and
confirm the connection. However, the helper attachment has been missing.
If the user has enabled automatic helper attachment, then this issue
will be masked as it will be applied in init_conntrack(). It is also
masked if the action is executed from ovs_packet_cmd_execute() as that
will construct a fresh skb.

This patch fixes the issue by making an explicit call to try to assign
the helper if there is a discrepancy between the action's helper and the
current skb->nfct.

Fixes: cae3a26275 ("openvswitch: Allow attaching helpers to ct action")
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11 15:14:56 -04:00
David S. Miller
e8ed77dfa9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following large patchset contains Netfilter updates for your
net-next tree. My initial intention was to send you this in two goes but
when I looked back twice I already had this burden on top of me.

Several updates for IPVS from Marco Angaroni:

1) Allow SIP connections originating from real-servers to be load
   balanced by the SIP persistence engine as is already implemented
   in the other direction.

2) Release connections immediately for One-packet-scheduling (OPS)
   in IPVS, instead of making it via timer and rcu callback.

3) Skip deleting conntracks for each one packet in OPS, and don't call
   nf_conntrack_alter_reply() since no reply is expected.

4) Enable drop on exhaustion for OPS + SIP persistence.

Miscelaneous conntrack updates from Florian Westphal, including fix for
hash resize:

5) Move conntrack generation counter out of conntrack pernet structure
   since this is only used by the init_ns to allow hash resizing.

6) Use get_random_once() from packet path to collect hash random seed
    instead of our compound.

7) Don't disable BH from ____nf_conntrack_find() for statistics,
   use NF_CT_STAT_INC_ATOMIC() instead.

8) Fix lookup race during conntrack hash resizing.

9) Introduce clash resolution on conntrack insertion for connectionless
   protocol.

Then, Florian's netns rework to get rid of per-netns conntrack table,
thus we use one single table for them all. There was consensus on this
change during the NFWS 2015 and, on top of that, it has recently been
pointed as a source of multiple problems from unpriviledged netns:

11) Use a single conntrack hashtable for all namespaces. Include netns
    in object comparisons and make it part of the hash calculation.
    Adapt early_drop() to consider netns.

12) Use single expectation and NAT hashtable for all namespaces.

13) Use a single slab cache for all namespaces for conntrack objects.

14) Skip full table scanning from nf_ct_iterate_cleanup() if the pernet
    conntrack counter tells us the table is empty (ie. equals zero).

Fixes for nf_tables interval set element handling, support to set
conntrack connlabels and allow set names up to 32 bytes.

15) Parse element flags from element deletion path and pass it up to the
    backend set implementation.

16) Allow adjacent intervals in the rbtree set type for dynamic interval
    updates.

17) Add support to set connlabel from nf_tables, from Florian Westphal.

18) Allow set names up to 32 bytes in nf_tables.

Several x_tables fixes and updates:

19) Fix incorrect use of IS_ERR_VALUE() in x_tables, original patch
    from Andrzej Hajda.

And finally, miscelaneous netfilter updates such as:

20) Disable automatic helper assignment by default. Note this proc knob
    was introduced by a900689264 ("netfilter: nf_ct_helper: allow to
    disable automatic helper assignment") 4 years ago to start moving
    towards explicit conntrack helper configuration via iptables CT
    target.

21) Get rid of obsolete and inconsistent debugging instrumentation
    in x_tables.

22) Remove unnecessary check for null after ip6_route_output().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09 15:02:58 -04:00
Pablo Neira Ayuso
3b78155b1b openvswitch: __nf_ct_l{3,4}proto_find() always return a valid pointer
If the protocol is not natively supported, this assigns generic protocol
tracker so we can always assume a valid pointer after these calls.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Joe Stringer <joe@ovn.org>
2016-05-05 16:39:50 +02:00
Nicolas Dichtel
66c7a5ee1a ovs: align nlattr properly when needed
I also fix commit 8b32ab9e6ef1: use nla_total_size_64bit() for
OVS_FLOW_ATTR_USED in ovs_flow_cmd_msg_size().

Fixes: 8b32ab9e6ef1 ("ovs: use nla_put_u64_64bit()")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 12:00:48 -04:00
Nicolas Dichtel
0238b7204b ovs: use nla_put_u64_64bit()
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:09:10 -04:00
David S. Miller
11afbff861 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree, mostly from Florian Westphal to sort out the lack of sufficient
validation in x_tables and connlabel preparation patches to add
nf_tables support. They are:

1) Ensure we don't go over the ruleset blob boundaries in
   mark_source_chains().

2) Validate that target jumps land on an existing xt_entry. This extra
   sanitization comes with a performance penalty when loading the ruleset.

3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables.

4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables.

5) Make sure the minimal possible target size in x_tables.

6) Similar to #3, add xt_compat_check_entry_offsets() for compat code.

7) Check that standard target size is valid.

8) More sanitization to ensure that the target_offset field is correct.

9) Add xt_check_entry_match() to validate that matches are well-formed.

10-12) Three patch to reduce the number of parameters in
    translate_compat_table() for {arp,ip,ip6}tables by using a container
    structure.

13) No need to return value from xt_compat_match_from_user(), so make
    it void.

14) Consolidate translate_table() so it can be used by compat code too.

15) Remove obsolete check for compat code, so we keep consistent with
    what was already removed in the native layout code (back in 2007).

16) Get rid of target jump validation from mark_source_chains(),
    obsoleted by #2.

17) Introduce xt_copy_counters_from_user() to consolidate counter
    copying, and use it from {arp,ip,ip6}tables.

18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump
    functions.

19) Move nf_connlabel_match() to xt_connlabel.

20) Skip event notification if connlabel did not change.

21) Update of nf_connlabels_get() to make the upcoming nft connlabel
    support easier.

23) Remove spinlock to read protocol state field in conntrack.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 00:12:08 -04:00
Nicolas Dichtel
b46f6ded90 libnl: nla_put_be64(): align on a 64-bit area
nla_data() is now aligned on a 64-bit area.

A temporary version (nla_put_be64_32bit()) is added for nla_put_net64().
This function is removed in the next patch.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 20:13:24 -04:00
David S. Miller
1602f49b58 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were two cases of simple overlapping changes,
nothing serious.

In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 18:51:33 -04:00
Simon Horman
b4f70527f0 openvswitch: use flow protocol when recalculating ipv6 checksums
When using masked actions the ipv6_proto field of an action
to set IPv6 fields may be zero rather than the prevailing protocol
which will result in skipping checksum recalculation.

This patch resolves the problem by relying on the protocol
in the flow key rather than that in the set field action.

Fixes: 83d2b9ba1a ("net: openvswitch: Support masked set actions.")
Cc: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 15:28:47 -04:00
Joe Stringer
49e261a8a2 openvswitch: Orphan skbs before IPv6 defrag
This is the IPv6 counterpart to commit 8282f27449 ("inet: frag: Always
orphan skbs inside ip_defrag()").

Prior to commit 029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free
clone operations"), ipv6 fragments sent to nf_ct_frag6_gather() would be
cloned (implicitly orphaning) prior to queueing for reassembly. As such,
when the IPv6 message is eventually reassembled, the skb->sk for all
fragments would be NULL. After that commit was introduced, rather than
cloning, the original skbs were queued directly without orphaning. The
end result is that all frags except for the first and last may have a
socket attached.

This commit explicitly orphans such skbs during nf_ct_frag6_gather() to
prevent BUG_ON(skb->sk) during a later call to ip6_fragment().

kernel BUG at net/ipv6/ip6_output.c:631!
[...]
Call Trace:
 <IRQ>
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffffa042c7c0>] ? do_output.isra.28+0x1b0/0x1b0 [openvswitch]
 [<ffffffff810bb8a2>] ? __lock_is_held+0x52/0x70
 [<ffffffffa042c587>] ovs_fragment+0x1f7/0x280 [openvswitch]
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff817be416>] ? _raw_spin_unlock_irqrestore+0x36/0x50
 [<ffffffff81697ea0>] ? dst_discard_out+0x20/0x20
 [<ffffffff81697e80>] ? dst_ifdown+0x80/0x80
 [<ffffffffa042c703>] do_output.isra.28+0xf3/0x1b0 [openvswitch]
 [<ffffffffa042d279>] do_execute_actions+0x709/0x12c0 [openvswitch]
 [<ffffffffa04340a4>] ? ovs_flow_stats_update+0x74/0x1e0 [openvswitch]
 [<ffffffffa04340d1>] ? ovs_flow_stats_update+0xa1/0x1e0 [openvswitch]
 [<ffffffff817be387>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffffa042de75>] ovs_execute_actions+0x45/0x120 [openvswitch]
 [<ffffffffa0432d65>] ovs_dp_process_packet+0x85/0x150 [openvswitch]
 [<ffffffff817be387>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffffa042def4>] ovs_execute_actions+0xc4/0x120 [openvswitch]
 [<ffffffffa0432d65>] ovs_dp_process_packet+0x85/0x150 [openvswitch]
 [<ffffffffa04337f2>] ? key_extract+0x442/0xc10 [openvswitch]
 [<ffffffffa043b26d>] ovs_vport_receive+0x5d/0xb0 [openvswitch]
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffff817be416>] ? _raw_spin_unlock_irqrestore+0x36/0x50
 [<ffffffffa043c11d>] internal_dev_xmit+0x6d/0x150 [openvswitch]
 [<ffffffffa043c0b5>] ? internal_dev_xmit+0x5/0x150 [openvswitch]
 [<ffffffff8168fb5f>] dev_hard_start_xmit+0x2df/0x660
 [<ffffffff8168f5ea>] ? validate_xmit_skb.isra.105.part.106+0x1a/0x2b0
 [<ffffffff81690925>] __dev_queue_xmit+0x8f5/0x950
 [<ffffffff81690080>] ? __dev_queue_xmit+0x50/0x950
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff81690990>] dev_queue_xmit+0x10/0x20
 [<ffffffff8169a418>] neigh_resolve_output+0x178/0x220
 [<ffffffff81752759>] ? ip6_finish_output2+0x219/0x7b0
 [<ffffffff81752759>] ip6_finish_output2+0x219/0x7b0
 [<ffffffff817525a5>] ? ip6_finish_output2+0x65/0x7b0
 [<ffffffff816cde2b>] ? ip_idents_reserve+0x6b/0x80
 [<ffffffff8175488f>] ? ip6_fragment+0x93f/0xc50
 [<ffffffff81754af1>] ip6_fragment+0xba1/0xc50
 [<ffffffff81752540>] ? ip6_flush_pending_frames+0x40/0x40
 [<ffffffff81754c6b>] ip6_finish_output+0xcb/0x1d0
 [<ffffffff81754dcf>] ip6_output+0x5f/0x1a0
 [<ffffffff81754ba0>] ? ip6_fragment+0xc50/0xc50
 [<ffffffff81797fbd>] ip6_local_out+0x3d/0x80
 [<ffffffff817554df>] ip6_send_skb+0x2f/0xc0
 [<ffffffff817555bd>] ip6_push_pending_frames+0x4d/0x50
 [<ffffffff817796cc>] icmpv6_push_pending_frames+0xac/0xe0
 [<ffffffff8177a4be>] icmpv6_echo_reply+0x42e/0x500
 [<ffffffff8177acbf>] icmpv6_rcv+0x4cf/0x580
 [<ffffffff81755ac7>] ip6_input_finish+0x1a7/0x690
 [<ffffffff81755925>] ? ip6_input_finish+0x5/0x690
 [<ffffffff817567a0>] ip6_input+0x30/0xa0
 [<ffffffff81755920>] ? ip6_rcv_finish+0x1a0/0x1a0
 [<ffffffff817557ce>] ip6_rcv_finish+0x4e/0x1a0
 [<ffffffff8175640f>] ipv6_rcv+0x45f/0x7c0
 [<ffffffff81755fe6>] ? ipv6_rcv+0x36/0x7c0
 [<ffffffff81755780>] ? ip6_make_skb+0x1c0/0x1c0
 [<ffffffff8168b649>] __netif_receive_skb_core+0x229/0xb80
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff8168c07f>] ? process_backlog+0x6f/0x230
 [<ffffffff8168bfb6>] __netif_receive_skb+0x16/0x70
 [<ffffffff8168c088>] process_backlog+0x78/0x230
 [<ffffffff8168c0ed>] ? process_backlog+0xdd/0x230
 [<ffffffff8168db43>] net_rx_action+0x203/0x480
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff817c156e>] __do_softirq+0xde/0x49f
 [<ffffffff81752768>] ? ip6_finish_output2+0x228/0x7b0
 [<ffffffff817c070c>] do_softirq_own_stack+0x1c/0x30
 <EOI>
 [<ffffffff8106f88b>] do_softirq.part.18+0x3b/0x40
 [<ffffffff8106f946>] __local_bh_enable_ip+0xb6/0xc0
 [<ffffffff81752791>] ip6_finish_output2+0x251/0x7b0
 [<ffffffff81754af1>] ? ip6_fragment+0xba1/0xc50
 [<ffffffff816cde2b>] ? ip_idents_reserve+0x6b/0x80
 [<ffffffff8175488f>] ? ip6_fragment+0x93f/0xc50
 [<ffffffff81754af1>] ip6_fragment+0xba1/0xc50
 [<ffffffff81752540>] ? ip6_flush_pending_frames+0x40/0x40
 [<ffffffff81754c6b>] ip6_finish_output+0xcb/0x1d0
 [<ffffffff81754dcf>] ip6_output+0x5f/0x1a0
 [<ffffffff81754ba0>] ? ip6_fragment+0xc50/0xc50
 [<ffffffff81797fbd>] ip6_local_out+0x3d/0x80
 [<ffffffff817554df>] ip6_send_skb+0x2f/0xc0
 [<ffffffff817555bd>] ip6_push_pending_frames+0x4d/0x50
 [<ffffffff81778558>] rawv6_sendmsg+0xa28/0xe30
 [<ffffffff81719097>] ? inet_sendmsg+0xc7/0x1d0
 [<ffffffff817190d6>] inet_sendmsg+0x106/0x1d0
 [<ffffffff81718fd5>] ? inet_sendmsg+0x5/0x1d0
 [<ffffffff8166d078>] sock_sendmsg+0x38/0x50
 [<ffffffff8166d4d6>] SYSC_sendto+0xf6/0x170
 [<ffffffff8100201b>] ? trace_hardirqs_on_thunk+0x1b/0x1d
 [<ffffffff8166e38e>] SyS_sendto+0xe/0x10
 [<ffffffff817bebe5>] entry_SYSCALL_64_fastpath+0x18/0xa8
Code: 06 48 83 3f 00 75 26 48 8b 87 d8 00 00 00 2b 87 d0 00 00 00 48 39 d0 72 14 8b 87 e4 00 00 00 83 f8 01 75 09 48 83 7f 18 00 74 9a <0f> 0b 41 8b 86 cc 00 00 00 49 8#
RIP  [<ffffffff8175468a>] ip6_fragment+0x73a/0xc50
 RSP <ffff880072803120>

Fixes: 029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free clone
operations")
Reported-by: Daniele Di Proietto <diproiettod@vmware.com>
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 13:42:05 -04:00
Florian Westphal
adff6c6560 netfilter: connlabels: change nf_connlabels_get bit arg to 'highest used'
nf_connlabel_set() takes the bit number that we would like to set.
nf_connlabels_get() however took the number of bits that we want to
support.

So e.g. nf_connlabels_get(32) support bits 0 to 31, but not 32.
This changes nf_connlabels_get() to take the highest bit that we want
to set.

Callers then don't have to cope with a potential integer wrap
when using nf_connlabels_get(bit + 1) anymore.

Current callers are fine, this change is only to make folloup
nft ct label set support simpler.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-18 20:39:48 +02:00
Phil Sutter
4272cc51a6 openvswitch: Convert to using IFF_NO_QUEUE
Cc: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16 22:02:14 -04:00
David S. Miller
0c84ea17ff Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for you net tree,
they are:

1) There was a race condition between parallel save/swap and delete,
   which resulted a kernel crash due to the increase ref for save, swap,
   wrong ref decrease operations. Reported and fixed by Vishwanath Pai.

2) OVS should call into CT NAT for packets of new expected connections only
   when the conntrack state is persisted with the 'commit' option to the
   OVS CT action. From Jarno Rajahalme.

3) Resolve kconfig dependencies with new OVS NAT support. From Arnd Bergmann.

4) Early validation of entry->target_offset to make sure it doesn't take us
   out from the blob, from Florian Westphal.

5) Again early validation of entry->next_offset to make sure it doesn't take
   out from the blob, also from Florian.

6) Check that entry->target_offset is always of of sizeof(struct xt_entry)
   for unconditional entries, when checking both from check_underflow()
   and when checking for loops in mark_source_chains(), again from
   Florian.

7) Fix inconsistent behaviour in nfnetlink_queue when
   NFQA_CFG_F_FAIL_OPEN is set and netlink_unicast() fails due to buffer
   overrun, we have to reinject the packet as the user expects.

8) Enforce nul-terminated table names from getsockopt GET_ENTRIES
   requests.

9) Don't assume skb->sk is set from nft_bridge_reject and synproxy,
   this fixes a recent update of the code to namespaceify
   ip_default_ttl, patch from Liping Zhang.

This batch comes with four patches to validate x_tables blobs coming
from userspace. CONFIG_USERNS exposes the x_tables interface to
unpriviledged users and to be honest this interface never received the
attention for this move away from the CAP_NET_ADMIN domain. Florian is
working on another round with more patches with more sanity checks, so
expect a bit more Netfilter fixes in this development cycle than usual.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-28 15:38:59 -04:00
Arnd Bergmann
99b7248e2a openvswitch: call only into reachable nf-nat code
The openvswitch code has gained support for calling into the
nf-nat-ipv4/ipv6 modules, however those can be loadable modules
in a configuration in which openvswitch is built-in, leading
to link errors:

net/built-in.o: In function `__ovs_ct_lookup':
:(.text+0x2cc2c8): undefined reference to `nf_nat_icmp_reply_translation'
:(.text+0x2cc66c): undefined reference to `nf_nat_icmpv6_reply_translation'

The dependency on (!NF_NAT || NF_NAT) prevents similar issues,
but NF_NAT is set to 'y' if any of the symbols selecting
it are built-in, but the link error happens when any of them
are modular.

A second issue is that even if CONFIG_NF_NAT_IPV6 is built-in,
CONFIG_NF_NAT_IPV4 might be completely disabled. This is unlikely
to be useful in practice, but the driver currently only handles
IPv6 being optional.

This patch improves the Kconfig dependency so that openvswitch
cannot be built-in if either of the two other symbols are set
to 'm', and it replaces the incorrect #ifdef in ovs_ct_nat_execute()
with two "if (IS_ENABLED())" checks that should catch all corner
cases also make the code more readable.

The same #ifdef exists ovs_ct_nat_to_attr(), where it does not
cause a link error, but for consistency I'm changing it the same
way.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 05752523e5 ("openvswitch: Interface with NAT.")
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:58:59 +02:00
Jarno Rajahalme
5745b0be05 openvswitch: Fix checking for new expected connections.
OVS should call into CT NAT for packets of new expected connections only
when the conntrack state is persisted with the 'commit' option to the
OVS CT action.  The test for this condition is doubly wrong, as the CT
status field is ANDed with the bit number (IPS_EXPECTED_BIT) rather
than the mask (IPS_EXPECTED), and due to the wrong assumption that the
expected bit would apply only for the first (i.e., 'new') packet of a
connection, while in fact the expected bit remains on for the lifetime of
an expected connection.  The 'ctinfo' value IP_CT_RELATED derived from
the ct status can be used instead, as it is only ever applicable to
the 'new' packets of the expected connection.

Fixes: 05752523e5 ('openvswitch: Interface with NAT.')
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:58:51 +02:00
Haishuang Yan
ac71b46efd openvswitch: Use proper buffer size in nla_memcpy
For the input parameter count, it's better to use the size
of destination buffer size, as nla_memcpy would take into
account the length of the source netlink attribute when
a data is copied from an attribute.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-28 11:37:14 -04:00
Linus Torvalds
1200b6809d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support more Realtek wireless chips, from Jes Sorenson.

   2) New BPF types for per-cpu hash and arrap maps, from Alexei
      Starovoitov.

   3) Make several TCP sysctls per-namespace, from Nikolay Borisov.

   4) Allow the use of SO_REUSEPORT in order to do per-thread processing
   of incoming TCP/UDP connections.  The muxing can be done using a
   BPF program which hashes the incoming packet.  From Craig Gallek.

   5) Add a multiplexer for TCP streams, to provide a messaged based
      interface.  BPF programs can be used to determine the message
      boundaries.  From Tom Herbert.

   6) Add 802.1AE MACSEC support, from Sabrina Dubroca.

   7) Avoid factorial complexity when taking down an inetdev interface
      with lots of configured addresses.  We were doing things like
      traversing the entire address less for each address removed, and
      flushing the entire netfilter conntrack table for every address as
      well.

   8) Add and use SKB bulk free infrastructure, from Jesper Brouer.

   9) Allow offloading u32 classifiers to hardware, and implement for
      ixgbe, from John Fastabend.

  10) Allow configuring IRQ coalescing parameters on a per-queue basis,
      from Kan Liang.

  11) Extend ethtool so that larger link mode masks can be supported.
      From David Decotigny.

  12) Introduce devlink, which can be used to configure port link types
      (ethernet vs Infiniband, etc.), port splitting, and switch device
      level attributes as a whole.  From Jiri Pirko.

  13) Hardware offload support for flower classifiers, from Amir Vadai.

  14) Add "Local Checksum Offload".  Basically, for a tunneled packet
      the checksum of the outer header is 'constant' (because with the
      checksum field filled into the inner protocol header, the payload
      of the outer frame checksums to 'zero'), and we can take advantage
      of that in various ways.  From Edward Cree"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
  bonding: fix bond_get_stats()
  net: bcmgenet: fix dma api length mismatch
  net/mlx4_core: Fix backward compatibility on VFs
  phy: mdio-thunder: Fix some Kconfig typos
  lan78xx: add ndo_get_stats64
  lan78xx: handle statistics counter rollover
  RDS: TCP: Remove unused constant
  RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
  net: smc911x: convert pxa dma to dmaengine
  team: remove duplicate set of flag IFF_MULTICAST
  bonding: remove duplicate set of flag IFF_MULTICAST
  net: fix a comment typo
  ethernet: micrel: fix some error codes
  ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
  bpf, dst: add and use dst_tclassid helper
  bpf: make skb->tc_classid also readable
  net: mvneta: bm: clarify dependencies
  cls_bpf: reset class and reuse major in da
  ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
  ldmvsw: Add ldmvsw.c driver code
  ...
2016-03-19 10:05:34 -07:00
Daniel Borkmann
fca5fdf67d ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
eBPF defines this as BPF_TUNLEN_MAX and OVS just uses the hard-coded
value inside struct sw_flow_key. Thus, add and use IP_TUNNEL_OPTS_MAX
for this, which makes the code a bit more generic and allows to remove
BPF_TUNLEN_MAX from eBPF code.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-18 19:38:46 -04:00
Simon Horman
fe3a5f6c79 openvswitch: allow output of MPLS packets on tunnel vports
Currently output of MPLS packets on tunnel vports is not allowed by Open
vSwitch. This is because historically encapsulation was done in such a way
that the inner_protocol field of the skb needed to hold the inner protocol
for both MPLS and tunnel encapsulation in order for GSO segmentation to be
performed correctly.

Since b2acd1dc39 ("openvswitch: Use regular GRE net_device instead of
vport") Open vSwitch makes use of lwt to output to tunnel netdevs which
perform encapsulation. As no drivers expose support for MPLS offloads this
means that GSO packets are segmented in software by validate_xmit_skb(),
which is called from __dev_queue_xmit(), before tunnel encapsulation occurs.
This means that the inner protocol of MPLS is no longer needed by the time
encapsulation occurs and the contention on the inner_protocol field of the
skb no longer occurs.

Thus it is now safe to output MPLS to tunnel vports.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-18 18:26:38 -04:00
Wu Fengguang
e014e84685 ovs: internal_set_rx_headroom() can be static
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-18 17:50:36 -04:00
Linus Torvalds
49dc2b7173 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  drivers/rtc: broken link fix
  drm/i915 Fix typos in i915_gem_fence.c
  Docs: fix missing word in REPORTING-BUGS
  lib+mm: fix few spelling mistakes
  MAINTAINERS: add git URL for APM driver
  treewide: Fix typo in printk
2016-03-17 21:38:27 -07:00
David S. Miller
1cdba55055 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS/OVS updates for net-next

The following patchset contains Netfilter/IPVS fixes and OVS NAT
support, more specifically this batch is composed of:

1) Fix a crash in ipset when performing a parallel flush/dump with
   set:list type, from Jozsef Kadlecsik.

2) Make sure NFACCT_FILTER_* netlink attributes are in place before
   accessing them, from Phil Turnbull.

3) Check return error code from ip_vs_fill_iph_skb_off() in IPVS SIP
   helper, from Arnd Bergmann.

4) Add workaround to IPVS to reschedule existing connections to new
   destination server by dropping the packet and wait for retransmission
   of TCP syn packet, from Julian Anastasov.

5) Allow connection rescheduling in IPVS when in CLOSE state, also
   from Julian.

6) Fix wrong offset of SIP Call-ID in IPVS helper, from Marco Angaroni.

7) Validate IPSET_ATTR_ETHER netlink attribute length, from Jozsef.

8) Check match/targetinfo netlink attribute size in nft_compat,
   patch from Florian Westphal.

9) Check for integer overflow on 32-bit systems in x_tables, from
   Florian Westphal.

Several patches from Jarno Rajahalme to prepare the introduction of
NAT support to OVS based on the Netfilter infrastructure:

10) Schedule IP_CT_NEW_REPLY definition for removal in
    nf_conntrack_common.h.

11) Simplify checksumming recalculation in nf_nat.

12) Add comments to the openvswitch conntrack code, from Jarno.

13) Update the CT state key only after successful nf_conntrack_in()
    invocation.

14) Find existing conntrack entry after upcall.

15) Handle NF_REPEAT case due to templates in nf_conntrack_in().

16) Call the conntrack helper functions once the conntrack has been
    confirmed.

17) And finally, add the NAT interface to OVS.

The batch closes with:

18) Cleanup to use spin_unlock_wait() instead of
    spin_lock()/spin_unlock(), from Nicholas Mc Guire.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14 22:10:25 -04:00
Jarno Rajahalme
05752523e5 openvswitch: Interface with NAT.
Extend OVS conntrack interface to cover NAT.  New nested
OVS_CT_ATTR_NAT attribute may be used to include NAT with a CT action.
A bare OVS_CT_ATTR_NAT only mangles existing and expected connections.
If OVS_NAT_ATTR_SRC or OVS_NAT_ATTR_DST is included within the nested
attributes, new (non-committed/non-confirmed) connections are mangled
according to the rest of the nested attributes.

The corresponding OVS userspace patch series includes test cases (in
tests/system-traffic.at) that also serve as example uses.

This work extends on a branch by Thomas Graf at
https://github.com/tgraf/ovs/tree/nat.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-14 23:47:29 +01:00
Jarno Rajahalme
28b6e0c1ac openvswitch: Delay conntrack helper call for new connections.
There is no need to help connections that are not confirmed, so we can
delay helping new connections to the time when they are confirmed.
This change is needed for NAT support, and having this as a separate
patch will make the following NAT patch a bit easier to review.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-14 23:47:29 +01:00
Jarno Rajahalme
5b6b929376 openvswitch: Handle NF_REPEAT in conntrack action.
Repeat the nf_conntrack_in() call when it returns NF_REPEAT.  This
avoids dropping a SYN packet re-opening an existing TCP connection.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-14 23:47:29 +01:00
Jarno Rajahalme
289f225349 openvswitch: Find existing conntrack entry after upcall.
Add a new function ovs_ct_find_existing() to find an existing
conntrack entry for which this packet was already applied to.  This is
only to be called when there is evidence that the packet was already
tracked and committed, but we lost the ct reference due to an
userspace upcall.

ovs_ct_find_existing() is called from skb_nfct_cached(), which can now
hide the fact that the ct reference may have been lost due to an
upcall.  This allows ovs_ct_commit() to be simplified.

This patch is needed by later "openvswitch: Interface with NAT" patch,
as we need to be able to pass the packet through NAT using the
original ct reference also after the reference is lost after an
upcall.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-14 23:47:28 +01:00
Jarno Rajahalme
394e910e90 openvswitch: Update the CT state key only after nf_conntrack_in().
Only a successful nf_conntrack_in() call can effect a connection state
change, so it suffices to update the key only after the
nf_conntrack_in() returns.

This change is needed for the later NAT patches.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-14 23:47:28 +01:00
Jarno Rajahalme
9f13ded8d3 openvswitch: Add commentary to conntrack.c
This makes the code easier to understand and the following patches
more focused.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-14 23:47:27 +01:00
Jarno Rajahalme
bfa3f9d7f3 netfilter: Remove IP_CT_NEW_REPLY definition.
Remove the definition of IP_CT_NEW_REPLY from the kernel as it does
not make sense.  This allows the definition of IP_CT_NUMBER to be
simplified as well.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-14 23:47:27 +01:00
Samuel Gauthier
6f15cdbf8a ovs: allow nl 'flow set' to use ufid without flow key
When we want to change a flow using netlink, we have to identify it to
be able to perform a lookup. Both the flow key and unique flow ID
(ufid) are valid identifiers, but we always have to specify the flow
key in the netlink message. When both attributes are there, the ufid
is used. The flow key is used to validate the actions provided by
the userland.

This commit allows to use the ufid without having to provide the flow
key, as it is already done in the netlink 'flow get' and 'flow del'
path. The flow key remains mandatory when an action is provided.

Signed-off-by: Samuel Gauthier <samuel.gauthier@6wind.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13 22:18:26 -04:00
Paolo Abeni
3a927bc7cf ovs: propagate per dp max headroom to all vports
This patch implements bookkeeping support to compute the maximum
headroom for all the devices in each datapath. When said value
changes, the underlying devs are notified via the
ndo_set_rx_headroom method.

This also increases the internal vports xmit performance.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01 15:54:30 -05:00
David S. Miller
b633353115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/bcm7xxx.c
	drivers/net/phy/marvell.c
	drivers/net/vxlan.c

All three conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 00:09:14 -05:00
Daniel Borkmann
6b83d28a55 net: use skb_postpush_rcsum instead of own implementations
Replace individual implementations with the recently introduced
skb_postpush_rcsum() helper.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Tom Herbert <tom@herbertland.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 23:43:10 -05:00
Paolo Abeni
c868ee7063 lwt: fix rx checksum setting for lwt devices tunneling over ipv6
the commit 35e2d1152b ("tunnels: Allow IPv6 UDP checksums to be
correctly controlled.") changed the default xmit checksum setting
for lwt vxlan/geneve ipv6 tunnels, so that now the checksum is not
set into external UDP header.
This commit changes the rx checksum setting for both lwt vxlan/geneve
devices created by openvswitch accordingly, so that lwt over ipv6
tunnel pairs are again able to communicate with default values.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 15:39:30 -05:00
Florian Westphal
263ea09084 Revert "genl: Add genlmsg_new_unicast() for unicast message allocation"
This reverts commit bb9b18fb55 ("genl: Add genlmsg_new_unicast() for
unicast message allocation")'.

Nothing wrong with it; its no longer needed since this was only for
mmapped netlink support.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:42:19 -05:00
Florian Westphal
551ddc057e openvswitch: Revert: "Enable memory mapped Netlink i/o"
revert commit 795449d8b8 ("openvswitch: Enable memory mapped Netlink i/o").
Following the mmaped netlink removal this code can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:42:18 -05:00
Paolo Abeni
d71785ffc7 net: add dst_cache to ovs vxlan lwtunnel
In case of UDP traffic with datagram length
below MTU this give about 2% performance increase
when tunneling over ipv4 and about 60% when tunneling
over ipv6

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:21:48 -05:00
Masanari Iida
fc4fa6e112 treewide: Fix typo in printk
This patch fix spelling typos found in printk and Kconfig.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-02-15 11:18:22 +01:00
Tycho Andersen
4a92602aa1 openvswitch: allow management from inside user namespaces
Operations with the GENL_ADMIN_PERM flag fail permissions checks because
this flag means we call netlink_capable, which uses the init user ns.

Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations
which should be allowed inside a user namespace.

The motivation for this is to be able to run openvswitch in unprivileged
containers. I've tested this and it seems to work, but I really have no
idea about the security consequences of this patch, so thoughts would be
much appreciated.

v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function
v3: use separate ifs for UNS_ADMIN_PERM and ADMIN_PERM, instead of one
    massive one

Reported-by: James Page <james.page@canonical.com>
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
CC: Eric Biederman <ebiederm@xmission.com>
CC: Pravin Shelar <pshelar@ovn.org>
CC: Justin Pettit <jpettit@nicira.com>
CC: "David S. Miller" <davem@davemloft.net>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 09:53:19 -05:00
David Wragg
7e059158d5 vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices
Prior to 4.3, openvswitch tunnel vports (vxlan, gre and geneve) could
transmit vxlan packets of any size, constrained only by the ability to
send out the resulting packets.  4.3 introduced netdevs corresponding
to tunnel vports.  These netdevs have an MTU, which limits the size of
a packet that can be successfully encapsulated.  The default MTU
values are low (1500 or less), which is awkwardly small in the context
of physical networks supporting jumbo frames, and leads to a
conspicuous change in behaviour for userspace.

Instead, set the MTU on openvswitch-created netdevs to be the relevant
maximum (i.e. the maximum IP packet size minus any relevant overhead),
effectively restoring the behaviour prior to 4.3.

Signed-off-by: David Wragg <david@weave.works>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-10 05:50:03 -05:00