Commit Graph

7435 Commits

Author SHA1 Message Date
Eric Dumazet
3b24d854cb tcp/dccp: do not touch listener sk_refcnt under synflood
When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

By letting listeners use SOCK_RCU_FREE infrastructure,
we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
only listeners are impacted by this change.

Peak performance under SYNFLOOD is increased by ~33% :

On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

Most consuming functions are now skb_set_owner_w() and sock_wfree()
contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 22:11:20 -04:00
Eric Dumazet
2d331915a0 tcp/dccp: use rcu locking in inet_diag_find_one_icsk()
RX packet processing holds rcu_read_lock(), so we can remove
pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
if inet_diag also holds rcu before calling them.

This is needed anyway as __inet_lookup_listener() and
inet6_lookup_listener() will soon no longer increment
refcount on the found listener.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 22:11:19 -04:00
Eric Dumazet
ca065d0cf8 udp: no longer use SLAB_DESTROY_BY_RCU
Tom Herbert would like not touching UDP socket refcnt for encapsulated
traffic. For this to happen, we need to use normal RCU rules, with a grace
period before freeing a socket. UDP sockets are not short lived in the
high usage case, so the added cost of call_rcu() should not be a concern.

This actually removes a lot of complexity in UDP stack.

Multicast receives no longer need to hold a bucket spinlock.

Note that ip early demux still needs to take a reference on the socket.

Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
but this might be changed later.

Performance for a single UDP socket receiving flood traffic from
many RX queues/cpus.

Simple udp_rx using simple recvfrom() loop :
438 kpps instead of 374 kpps : 17 % increase of the peak rate.

v2: Addressed Willem de Bruijn feedback in multicast handling
 - keep early demux break in __udp4_lib_demux_lookup()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 22:11:19 -04:00
Soheil Hassas Yeganeh
c14ac9451c sock: enable timestamping using control messages
Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 15:50:30 -04:00
Soheil Hassas Yeganeh
24025c465f ipv4: process socket-level control messages in IPv4
Process socket-level control messages by invoking
__sock_cmsg_send in ip_cmsg_send for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip_cmsg_send is called in udp, icmp,
and raw, we also process socket-level control messages.

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv4 control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 15:50:30 -04:00
Soheil Hassas Yeganeh
6b084928ba tcp: use one bit in TCP_SKB_CB to mark ACK timestamps
Currently, to avoid a cache line miss for accessing skb_shinfo,
tcp_ack_tstamp skips socket that do not have
SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
implemented based on an implicit assumption that the
SOF_TIMESTAMPING_TX_ACK is set via socket options for the
duration that ACK timestamps are needed.

To implement per-write timestamps, this check should be
removed and replaced with a per-packet alternative that
quickly skips packets missing ACK timestamps marks without
a cache-line miss.

To enable per-packet marking without a cache line miss, use
one bit in TCP_SKB_CB to mark a whether a SKB might need a
ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
modified and work as before.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 15:50:29 -04:00
Haishuang Yan
7822ce73e6 netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 address
Since nla_get_in_addr and nla_put_in_addr were implemented,
so use them appropriately.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-02 20:15:58 -04:00
Yuchung Cheng
2349262397 tcp: remove cwnd moderation after recovery
For non-SACK connections, cwnd is lowered to inflight plus 3 packets
when the recovery ends. This is an optional feature in the NewReno
RFC 2582 to reduce the potential burst when cwnd is "re-opened"
after recovery and inflight is low.

This feature is questionably effective because of PRR: when
the recovery ends (i.e., snd_una == high_seq) NewReno holds the
CA_Recovery state for another round trip to prevent false fast
retransmits. But if the inflight is low, PRR will overwrite the
moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
receiver responds bogus ACKs (i.e., acking future data) to speed up
transfer after recovery, it can only induce a burst up to a window
worth of data packets by acking up to SND.NXT. A restart from (short)
idle or receiving streched ACKs can both cause such bursts as well.

On the other hand, if the recovery ends because the sender
detects the losses were spurious (e.g., reordering). This feature
unconditionally lowers a reverted cwnd even though nothing
was lost.

By principle loss recovery module should not update cwnd. Further
pacing is much more effective to reduce burst. Hence this patch
removes the cwnd moderation feature.

v2 changes: revised commit message on bogus ACKs and burst, and
            missing signature

Signed-off-by: Matt Mathis <mattmathis@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-02 20:11:43 -04:00
Alexander Duyck
c3483384ee gro: Allow tunnel stacking in the case of FOU/GUE
This patch should fix the issues seen with a recent fix to prevent
tunnel-in-tunnel frames from being generated with GRO.  The fix itself is
correct for now as long as we do not add any devices that support
NETIF_F_GSO_GRE_CSUM.  When such a device is added it could have the
potential to mess things up due to the fact that the outer transport header
points to the outer UDP header and not the GRE header as would be expected.

Fixes: fac8e0f579 ("tunnels: Don't apply GRO to multiple layers of encapsulation.")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30 16:02:33 -04:00
Liping Zhang
29421198c3 netfilter: ipv4: fix NULL dereference
Commit fa50d974d1 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
use sock_net(skb->sk) to get the net namespace, but we can't assume
that sk_buff->sk is always exist, so when it is NULL, oops will happen.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:29 +02:00
Pablo Neira Ayuso
b301f25387 netfilter: x_tables: enforce nul-terminated table name from getsockopt GET_ENTRIES
Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
in ebtables and all the x_tables variants and their respective compat
code. Uncovered by KASAN.

Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:24 +02:00
Florian Westphal
54d83fc74a netfilter: x_tables: fix unconditional helper
Ben Hawkes says:

 In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
 is possible for a user-supplied ipt_entry structure to have a large
 next_offset field. This field is not bounds checked prior to writing a
 counter value at the supplied offset.

Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.

However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.

However, an unconditional rule must also not be using any matches
(no -m args).

The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.

Unify this so that all the callers have same idea of 'unconditional rule'.

Reported-by: Ben Hawkes <hawkes@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:15 +02:00
Florian Westphal
6e94e0cfb0 netfilter: x_tables: make sure e->next_offset covers remaining blob size
Otherwise this function may read data beyond the ruleset blob.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:08 +02:00
Florian Westphal
bdf533de69 netfilter: x_tables: validate e->target_offset early
We should check that e->target_offset is sane before
mark_source_chains gets called since it will fetch the target entry
for loop detection.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28 17:59:04 +02:00
Quentin Armitage
995096a0a4 Fix returned tc and hoplimit values for route with IPv6 encapsulation
For a route with IPv6 encapsulation, the traffic class and hop limit
values are interchanged when returned to userspace by the kernel.
For example, see below.

># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000
># ip route show table 1000
192.168.0.1  encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2  scope link

Signed-off-by: Quentin Armitage <quentin@armitage.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-27 22:35:02 -04:00
Alexander Duyck
5197f3499c net: Reset encap_level to avoid resetting features on inner IP headers
This patch corrects an oversight in which we were allowing the encap_level
value to pass from the outer headers to the inner headers.  As a result we
were incorrectly identifying UDP or GRE tunnels as also making use of ipip
or sit when the second header actually represented a tunnel encapsulated in
either a UDP or GRE tunnel which already had the features masked.

Fixes: 7644345622 ("net: Move GSO csum into SKB_GSO_CB")
Reported-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-23 14:19:08 -04:00
Lance Richardson
4cfc86f3da ipv4: initialize flowi4_flags before calling fib_lookup()
Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst()
before calling fib_lookup(), which means fib_table_lookup() is
using non-deterministic data at this line:

	if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {

Fix by initializing the entire fl4 structure, which will prevent
similar issues as fields are added in the future by ensuring that
all fields are initialized to zero unless explicitly initialized
to another value.

Fixes: 58189ca7b2 ("net: Fix vti use case with oif in dst lookups")
Suggested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-22 15:59:23 -04:00
Paolo Abeni
ad0ea1989c ipv4: fix broadcast packets reception
Currently, ingress ipv4 broadcast datagrams are dropped since,
in udp_v4_early_demux(), ip_check_mc_rcu() is invoked even on
bcast packets.

This patch addresses the issue, invoking ip_check_mc_rcu()
only for mcast packets.

Fixes: 6e54030932 ("ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-22 15:53:50 -04:00
Deepa Dinamani
3ba9d300c9 net: ipv4: Fix truncated timestamp returned by inet_current_timestamp()
The millisecond timestamps returned by the function is
converted to network byte order by making a call to htons().
htons() only returns __be16 while __be32 is required here.

This was identified by the sparse warning from the buildbot:
net/ipv4/af_inet.c:1405:16: sparse: incorrect type in return
			    expression (different base types)
net/ipv4/af_inet.c:1405:16: expected restricted __be32
net/ipv4/af_inet.c:1405:16: got restricted __be16 [usertype] <noident>

Change the function to use htonl() to return the correct __be32 type
instead so that the millisecond value doesn't get truncated.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Fixes: 822c868532 ("net: ipv4: Convert IP network timestamps to be y2038 safe")
Reported-by: Fengguang Wu <fengguang.wu@intel.com> [0-day test robot]
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-21 22:56:38 -04:00
Jesse Gross
a09a4c8dd1 tunnels: Remove encapsulation offloads on decap.
If a packet is either locally encapsulated or processed through GRO
it is marked with the offloads that it requires. However, when it is
decapsulated these tunnel offload indications are not removed. This
means that if we receive an encapsulated TCP packet, aggregate it with
GRO, decapsulate, and retransmit the resulting frame on a NIC that does
not support encapsulation, we won't be able to take advantage of hardware
offloads even though it is just a simple TCP packet at this point.

This fixes the problem by stripping off encapsulation offload indications
when packets are decapsulated.

The performance impacts of this bug are significant. In a test where a
Geneve encapsulated TCP stream is sent to a hypervisor, GRO'ed, decapsulated,
and bridged to a VM performance is improved by 60% (5Gbps->8Gbps) as a
result of avoiding unnecessary segmentation at the VM tap interface.

Reported-by: Ramu Ramamurthy <sramamur@linux.vnet.ibm.com>
Fixes: 68c33163 ("v4 GRE: Add TCP segmentation offload for GRE")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20 16:33:40 -04:00
Jesse Gross
fac8e0f579 tunnels: Don't apply GRO to multiple layers of encapsulation.
When drivers express support for TSO of encapsulated packets, they
only mean that they can do it for one layer of encapsulation.
Supporting additional levels would mean updating, at a minimum,
more IP length fields and they are unaware of this.

No encapsulation device expresses support for handling offloaded
encapsulated packets, so we won't generate these types of frames
in the transmit path. However, GRO doesn't have a check for
multiple levels of encapsulation and will attempt to build them.

UDP tunnel GRO actually does prevent this situation but it only
handles multiple UDP tunnels stacked on top of each other. This
generalizes that solution to prevent any kind of tunnel stacking
that would cause problems.

Fixes: bf5a755f ("net-gre-gro: Add GRE support to the GRO stack")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20 16:33:40 -04:00
Jesse Gross
b8cba75bdf ipip: Properly mark ipip GRO packets as encapsulated.
ipip encapsulated packets can be merged together by GRO but the result
does not have the proper GSO type set or even marked as being
encapsulated at all. Later retransmission of these packets will likely
fail if the device does not support ipip offloads. This is similar to
the issue resolved in IPv6 sit in feec0cb3
("ipv6: gro: support sit protocol").

Reported-by: Patrick Boutilier <boutilpj@ednet.ns.ca>
Fixes: 9667e9bb ("ipip: Add gro callbacks to ipip offload")
Tested-by: Patrick Boutilier <boutilpj@ednet.ns.ca>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20 16:33:39 -04:00
Linus Torvalds
1200b6809d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support more Realtek wireless chips, from Jes Sorenson.

   2) New BPF types for per-cpu hash and arrap maps, from Alexei
      Starovoitov.

   3) Make several TCP sysctls per-namespace, from Nikolay Borisov.

   4) Allow the use of SO_REUSEPORT in order to do per-thread processing
   of incoming TCP/UDP connections.  The muxing can be done using a
   BPF program which hashes the incoming packet.  From Craig Gallek.

   5) Add a multiplexer for TCP streams, to provide a messaged based
      interface.  BPF programs can be used to determine the message
      boundaries.  From Tom Herbert.

   6) Add 802.1AE MACSEC support, from Sabrina Dubroca.

   7) Avoid factorial complexity when taking down an inetdev interface
      with lots of configured addresses.  We were doing things like
      traversing the entire address less for each address removed, and
      flushing the entire netfilter conntrack table for every address as
      well.

   8) Add and use SKB bulk free infrastructure, from Jesper Brouer.

   9) Allow offloading u32 classifiers to hardware, and implement for
      ixgbe, from John Fastabend.

  10) Allow configuring IRQ coalescing parameters on a per-queue basis,
      from Kan Liang.

  11) Extend ethtool so that larger link mode masks can be supported.
      From David Decotigny.

  12) Introduce devlink, which can be used to configure port link types
      (ethernet vs Infiniband, etc.), port splitting, and switch device
      level attributes as a whole.  From Jiri Pirko.

  13) Hardware offload support for flower classifiers, from Amir Vadai.

  14) Add "Local Checksum Offload".  Basically, for a tunneled packet
      the checksum of the outer header is 'constant' (because with the
      checksum field filled into the inner protocol header, the payload
      of the outer frame checksums to 'zero'), and we can take advantage
      of that in various ways.  From Edward Cree"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
  bonding: fix bond_get_stats()
  net: bcmgenet: fix dma api length mismatch
  net/mlx4_core: Fix backward compatibility on VFs
  phy: mdio-thunder: Fix some Kconfig typos
  lan78xx: add ndo_get_stats64
  lan78xx: handle statistics counter rollover
  RDS: TCP: Remove unused constant
  RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
  net: smc911x: convert pxa dma to dmaengine
  team: remove duplicate set of flag IFF_MULTICAST
  bonding: remove duplicate set of flag IFF_MULTICAST
  net: fix a comment typo
  ethernet: micrel: fix some error codes
  ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
  bpf, dst: add and use dst_tclassid helper
  bpf: make skb->tc_classid also readable
  net: mvneta: bm: clarify dependencies
  cls_bpf: reset class and reuse major in da
  ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
  ldmvsw: Add ldmvsw.c driver code
  ...
2016-03-19 10:05:34 -07:00
Daniel Borkmann
fca5fdf67d ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
eBPF defines this as BPF_TUNLEN_MAX and OVS just uses the hard-coded
value inside struct sw_flow_key. Thus, add and use IP_TUNNEL_OPTS_MAX
for this, which makes the code a bit more generic and allows to remove
BPF_TUNLEN_MAX from eBPF code.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-18 19:38:46 -04:00
Eric Dumazet
e316ea62e3 tcp/dccp: remove obsolete WARN_ON() in icmp handlers
Now SYN_RECV request sockets are installed in ehash table, an ICMP
handler can find a request socket while another cpu handles an incoming
packet transforming this SYN_RECV request socket into an ESTABLISHED
socket.

We need to remove the now obsolete WARN_ON(req->sk), since req->sk
is set when a new child is created and added into listener accept queue.

If this race happens, the ICMP will do nothing special.

Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ben Lazarus <blazarus@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-17 21:06:40 -04:00
Linus Torvalds
70477371dc Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "Here is the crypto update for 4.6:

  API:
   - Convert remaining crypto_hash users to shash or ahash, also convert
     blkcipher/ablkcipher users to skcipher.
   - Remove crypto_hash interface.
   - Remove crypto_pcomp interface.
   - Add crypto engine for async cipher drivers.
   - Add akcipher documentation.
   - Add skcipher documentation.

  Algorithms:
   - Rename crypto/crc32 to avoid name clash with lib/crc32.
   - Fix bug in keywrap where we zero the wrong pointer.

  Drivers:
   - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
   - Add PIC32 hwrng driver.
   - Support BCM6368 in bcm63xx hwrng driver.
   - Pack structs for 32-bit compat users in qat.
   - Use crypto engine in omap-aes.
   - Add support for sama5d2x SoCs in atmel-sha.
   - Make atmel-sha available again.
   - Make sahara hashing available again.
   - Make ccp hashing available again.
   - Make sha1-mb available again.
   - Add support for multiple devices in ccp.
   - Improve DMA performance in caam.
   - Add hashing support to rockchip"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
  crypto: qat - remove redundant arbiter configuration
  crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
  crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
  crypto: qat - Change the definition of icp_qat_uof_regtype
  hwrng: exynos - use __maybe_unused to hide pm functions
  crypto: ccp - Add abstraction for device-specific calls
  crypto: ccp - CCP versioning support
  crypto: ccp - Support for multiple CCPs
  crypto: ccp - Remove check for x86 family and model
  crypto: ccp - memset request context to zero during import
  lib/mpi: use "static inline" instead of "extern inline"
  lib/mpi: avoid assembler warning
  hwrng: bcm63xx - fix non device tree compatibility
  crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
  crypto: qat - The AE id should be less than the maximal AE number
  lib/mpi: Endianness fix
  crypto: rockchip - add hash support for crypto engine in rk3288
  crypto: xts - fix compile errors
  crypto: doc - add skcipher API documentation
  crypto: doc - update AEAD AD handling
  ...
2016-03-17 11:22:54 -07:00
Peter Zijlstra
25528213fe tags: Fix DEFINE_PER_CPU expansions
$ make tags
  GEN     tags
ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"

Which are all the result of the DEFINE_PER_CPU pattern:

  scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
  scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'

The below cures them. All except the workqueue one are within reasonable
distance of the 80 char limit. TJ do you have any preference on how to
fix the wq one, or shall we just not care its too long?

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
David S. Miller
1cdba55055 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS/OVS updates for net-next

The following patchset contains Netfilter/IPVS fixes and OVS NAT
support, more specifically this batch is composed of:

1) Fix a crash in ipset when performing a parallel flush/dump with
   set:list type, from Jozsef Kadlecsik.

2) Make sure NFACCT_FILTER_* netlink attributes are in place before
   accessing them, from Phil Turnbull.

3) Check return error code from ip_vs_fill_iph_skb_off() in IPVS SIP
   helper, from Arnd Bergmann.

4) Add workaround to IPVS to reschedule existing connections to new
   destination server by dropping the packet and wait for retransmission
   of TCP syn packet, from Julian Anastasov.

5) Allow connection rescheduling in IPVS when in CLOSE state, also
   from Julian.

6) Fix wrong offset of SIP Call-ID in IPVS helper, from Marco Angaroni.

7) Validate IPSET_ATTR_ETHER netlink attribute length, from Jozsef.

8) Check match/targetinfo netlink attribute size in nft_compat,
   patch from Florian Westphal.

9) Check for integer overflow on 32-bit systems in x_tables, from
   Florian Westphal.

Several patches from Jarno Rajahalme to prepare the introduction of
NAT support to OVS based on the Netfilter infrastructure:

10) Schedule IP_CT_NEW_REPLY definition for removal in
    nf_conntrack_common.h.

11) Simplify checksumming recalculation in nf_nat.

12) Add comments to the openvswitch conntrack code, from Jarno.

13) Update the CT state key only after successful nf_conntrack_in()
    invocation.

14) Find existing conntrack entry after upcall.

15) Handle NF_REPEAT case due to templates in nf_conntrack_in().

16) Call the conntrack helper functions once the conntrack has been
    confirmed.

17) And finally, add the NAT interface to OVS.

The batch closes with:

18) Cleanup to use spin_unlock_wait() instead of
    spin_lock()/spin_unlock(), from Nicholas Mc Guire.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14 22:10:25 -04:00
Eric Dumazet
acffb584cd net: diag: add a scheduling point in inet_diag_dump_icsk()
On loaded TCP servers, looking at millions of sockets can hold
cpu for many seconds, if the lookup condition is very narrow.

(eg : ss dst 1.2.3.4 )

Better add a cond_resched() to allow other processes to access
the cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14 19:38:09 -04:00
Jarno Rajahalme
264619055b netfilter: Allow calling into nat helper without skb_dst.
NAT checksum recalculation code assumes existence of skb_dst, which
becomes a problem for a later patch in the series ("openvswitch:
Interface with NAT.").  Simplify this by removing the check on
skb_dst, as the checksum will be dealt with later in the stack.

Suggested-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-14 23:47:27 +01:00
Martin KaFai Lau
a44d6eacda tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In
Per RFC4898, they count segments sent/received
containing a positive length data segment (that includes
retransmission segments carrying data).  Unlike
tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
carrying no data (e.g. pure ack).

The patch also updates the segs_in in tcp_fastopen_add_skb()
so that segs_in >= data_segs_in property is kept.

Together with retransmission data, tcpi_data_segs_out
gives a better signal on the rxmit rate.

v6: Rebase on the latest net-next

v5: Eric pointed out that checking skb->len is still needed in
tcp_fastopen_add_skb() because skb can carry a FIN without data.
Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
helper is used.  Comment is added to the fastopen case to explain why
segs_in has to be reset and tcp_segs_in() has to be called before
__skb_pull().

v4: Add comment to the changes in tcp_fastopen_add_skb()
and also add remark on this case in the commit message.

v3: Add const modifier to the skb parameter in tcp_segs_in()

v2: Rework based on recent fix by Eric:
commit a9d99ce28e ("tcp: fix tcpi_segs_in after connection establishment")

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14 14:55:26 -04:00
Alexander Duyck
0833482495 GSO/UDP: Use skb->len instead of udph->len to determine length of original skb
It is possible for tunnels to end up generating IP or IPv6 datagrams that
are larger than 64K and expecting to be segmented.  As such we need to deal
with length values greater than 64K.  In order to accommodate this we need
to update the code to work with a 32b length value instead of a 16b one.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13 23:55:14 -04:00
David S. Miller
fbd40ea018 ipv4: Don't do expensive useless work during inetdev destroy.
When an inetdev is destroyed, every address assigned to the interface
is removed.  And in this scenerio we do two pointless things which can
be very expensive if the number of assigned interfaces is large:

1) Address promotion.  We are deleting all addresses, so there is no
   point in doing this.

2) A full nf conntrack table purge for every address.  We only need to
   do this once, as is already caught by the existing
   masq_dev_notifier so masq_inet_event() can skip this.

Reported-by: Solar Designer <solar@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
2016-03-13 23:28:35 -04:00
Zhang Shengju
136ba622de netconf: add macro to represent all attributes
This patch adds macro NETCONFA_ALL to represent all type of netconf
attributes for IPv4 and IPv6.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13 21:54:44 -04:00
Alexander Duyck
c194cf93c1 gro: Defer clearing of flush bit in tunnel paths
This patch updates the GRO handlers for GRE, VXLAN, GENEVE, and FOU so that
we do not clear the flush bit until after we have called the next level GRO
handler.  Previously this was being cleared before parsing through the list
of frames, however this resulted in several paths where either the bit
needed to be reset but wasn't as in the case of FOU, or cases where it was
being set as in GENEVE.  By just deferring the clearing of the bit until
after the next level protocol has been parsed we can avoid any unnecessary
bit twiddling and avoid bugs.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13 15:01:00 -04:00
Tom Herbert
473bd239b8 tcp: Add tcp_inq to get available receive bytes on socket
Create a common kernel function to get the number of bytes available
on a TCP socket. This is based on code in INQ getsockopt and we now call
the function for that getsockopt.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09 16:36:14 -05:00
David S. Miller
4c38cd61ae Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for your net-next tree,
they are:

1) Remove useless debug message when deleting IPVS service, from
   Yannick Brosseau.

2) Get rid of compilation warning when CONFIG_PROC_FS is unset in
   several spots of the IPVS code, from Arnd Bergmann.

3) Add prandom_u32 support to nft_meta, from Florian Westphal.

4) Remove unused variable in xt_osf, from Sudip Mukherjee.

5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook
   since fixing af_packet defragmentation issues, from Joe Stringer.

6) On-demand hook registration for iptables from netns. Instead of
   registering the hooks for every available netns whenever we need
   one of the support tables, we register this on the specific netns
   that needs it, patchset from Florian Westphal.

7) Add missing port range selection to nf_tables masquerading support.

BTW, just for the record, there is a typo in the description of
5f6c253ebe ("netfilter: bridge: register hooks only when bridge
interface is added") that refers to the cluster match as deprecated, but
it is actually the CLUSTERIP target (which registers hooks
inconditionally) the one that is scheduled for removal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 14:25:20 -05:00
Daniel Borkmann
db3c6139e6 bpf, vxlan, geneve, gre: fix usage of dst_cache on xmit
The assumptions from commit 0c1d70af92 ("net: use dst_cache for vxlan
device"), 468dfffcd7 ("geneve: add dst caching support") and 3c1cb4d260
("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage
when ip_tunnel_info is used is unfortunately not always valid as assumed.

While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF
however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre
with different remote dsts, tos, etc, therefore they cannot be assumed as
packet independent.

Right now vxlan, geneve, gre would cache the dst for eBPF and every packet
would reuse the same entry that was first created on the initial route
lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have
a different one.

Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test
in vxlan needs to be handeled differently in this context as it is currently
inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable()
helper is added for the three tunnel cases, which checks if we can use dst
cache.

Fixes: 0c1d70af92 ("net: use dst_cache for vxlan device")
Fixes: 468dfffcd7 ("geneve: add dst caching support")
Fixes: 3c1cb4d260 ("net/ipv4: add dst cache support for gre lwtunnels")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 13:58:47 -05:00
David S. Miller
810813c47a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 12:34:12 -05:00
Eric Dumazet
a9d99ce28e tcp: fix tcpi_segs_in after connection establishment
If final packet (ACK) of 3WHS is lost, it appears we do not properly
account the following incoming segment into tcpi_segs_in

While we are at it, starts segs_in with one, to count the SYN packet.

We do not yet count number of SYN we received for a request sock, we
might add this someday.

packetdrill script showing proper behavior after fix :

// Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.020 < P. 1:1001(1000) ack 1 win 32792

   +0 accept(3, ..., ...) = 4

+.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }%

Fixes: 2efd055c53 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 15:47:13 -05:00
Zhang Shengju
8dfd329fbc arp: correct return value of arp_rcv
Currently, arp_rcv() always return zero on a packet delivery upcall.

To make its behavior more compliant with the way this API should be
used, this patch changes this to let it return NET_RX_SUCCESS when the
packet is proper handled, and NET_RX_DROP otherwise.

v1->v2:
If sanity check is failed, call kfree_skb() instead of consume_skb(), then
return the correct return value.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 14:55:04 -05:00
Benjamin Poirier
1837b2e2bc mld, igmp: Fix reserved tailroom calculation
The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data____________|__tlen___|__extra__]
^                                               ^
head                                            skb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data____________|__extra__|__tlen___]
^                                               ^
head                                            skb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
	reserved_tailroom = skb_tailroom - mtu
else:
	reserved_tailroom = tlen

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 15:41:07 -05:00
Eric Engestrom
a9d562358b net/ipv4: remove left over dead code
8cc785f6f4 ("net: ipv4: make the ping
/proc code AF-independent") removed the code using it, but renamed this
variable instead of removing it.

Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 15:00:55 -05:00
Pablo Neira Ayuso
8a6bf5da1a netfilter: nft_masq: support port range
Complete masquerading support by allowing port range selection.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-02 20:05:27 +01:00
Florian Westphal
b9e69e1273 netfilter: xtables: don't hook tables by default
delay hook registration until the table is being requested inside a
namespace.

Historically, a particular table (iptables mangle, ip6tables filter, etc)
was registered on module load.

When netns support was added to iptables only the ip/ip6tables ruleset was
made namespace aware, not the actual hook points.

This means f.e. that when ipt_filter table/module is loaded on a system,
then each namespace on that system has an (empty) iptables filter ruleset.

In other words, if a namespace sends a packet, such skb is 'caught' by
netfilter machinery and fed to hooking points for that table (i.e. INPUT,
FORWARD, etc).

Thanks to Eric Biederman, hooks are no longer global, but per namespace.

This means that we can avoid allocation of empty ruleset in a namespace and
defer hook registration until we need the functionality.

We register a tables hook entry points ONLY in the initial namespace.
When an iptables get/setockopt is issued inside a given namespace, we check
if the table is found in the per-namespace list.

If not, we attempt to find it in the initial namespace, and, if found,
create an empty default table in the requesting namespace and register the
needed hooks.

Hook points are destroyed only once namespace is deleted, there is no
'usage count' (it makes no sense since there is no 'remove table' operation
in xtables api).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-02 20:05:24 +01:00
Florian Westphal
a67dd266ad netfilter: xtables: prepare for on-demand hook register
This change prepares for upcoming on-demand xtables hook registration.

We change the protoypes of the register/unregister functions.
A followup patch will then add nf_hook_register/unregister calls
to the iptables one.

Once a hook is registered packets will be picked up, so all assignments
of the form

net->ipv4.iptable_$table = new_table

have to be moved to ip(6)t_register_table, else we can see NULL
net->ipv4.iptable_$table later.

This patch doesn't change functionality; without this the actual change
simply gets too big.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-02 20:05:23 +01:00
Joe Stringer
5f547391f5 netfilter: nf_defrag_ipv4: Drop redundant ip_send_check()
Since commit 0848f6428b ("inet: frags: fix defragmented packet's IP
header for af_packet"), ip_send_check() would be called twice for
defragmentation that occurs from netfilter ipv4 defrag hooks. Remove the
extra call.

Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-02 20:05:22 +01:00
WANG Cong
64d4e3431e net: remove skb_sender_cpu_clear()
After commit 52bd2d62ce ("net: better skb->sender_cpu and skb->napi_id cohabitation")
skb_sender_cpu_clear() becomes empty and can be removed.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01 17:36:47 -05:00
Deepa Dinamani
b1b270d863 net: ipv4: tcp_probe: Replace timespec with timespec64
TCP probe log timestamps use struct timespec which is
not y2038 safe. Even though timespec might be good enough here
as it is used to represent delta time, the plan is to get rid
of all uses of timespec in the kernel.
Replace with struct timespec64 which is y2038 safe.

Prints still use unsigned long format and type.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01 17:18:44 -05:00
Deepa Dinamani
822c868532 net: ipv4: Convert IP network timestamps to be y2038 safe
ICMP timestamp messages and IP source route options require
timestamps to be in milliseconds modulo 24 hours from
midnight UT format.

Add inet_current_timestamp() function to support this. The function
returns the required timestamp in network byte order.

Timestamp calculation is also changed to call ktime_get_real_ts64()
which uses struct timespec64. struct timespec64 is y2038 safe.
Previously it called getnstimeofday() which uses struct timespec.
struct timespec is not y2038 safe.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01 17:18:44 -05:00
Alexander Duyck
2246387662 GSO: Provide software checksum of tunneled UDP fragmentation offload
On reviewing the code I realized that GRE and UDP tunnels could cause a
kernel panic if we used GSO to segment a large UDP frame that was sent
through the tunnel with an outer checksum and hardware offloads were not
available.

In order to correct this we need to update the feature flags that are
passed to the skb_segment function so that in the event of UDP
fragmentation being requested for the inner header the segmentation
function will correctly generate the checksum for the payload if we cannot
segment the outer header.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-26 14:23:35 -05:00
David Lamparter
17b693cdd8 net: l3mdev: prefer VRF master for source address selection
When selecting an address in context of a VRF, the vrf master should be
preferred for address selection.  If it isn't, the user has a hard time
getting the system to select to their preference - the code will pick
the address off the first in-VRF interface it can find, which on a
router could well be a non-routable address.

Signed-off-by: David Lamparter <equinox@diac24.net>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
[dsa: Fixed comment style and removed extra blank link ]
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-26 14:22:26 -05:00
David Ahern
3f2fb9a834 net: l3mdev: address selection should only consider devices in L3 domain
David Lamparter noted a use case where the source address selection fails
to pick an address from a VRF interface - unnumbered interfaces.

Relevant commands from his script:
    ip addr add 9.9.9.9/32 dev lo
    ip link set lo up

    ip link add name vrf0 type vrf table 101
    ip rule add oif vrf0 table 101
    ip rule add iif vrf0 table 101
    ip link set vrf0 up
    ip addr add 10.0.0.3/32 dev vrf0

    ip link add name dummy2 type dummy
    ip link set dummy2 master vrf0 up

    --> note dummy2 has no address - unnumbered device

    ip route add 10.2.2.2/32 dev dummy2 table 101
    ip neigh add 10.2.2.2 dev dummy2 lladdr 02:00:00:00:00:02

    tcpdump -ni dummy2 &

And using ping instead of his socat example:
    $ ping -I vrf0 -c1 10.2.2.2
    ping: Warning: source address might be selected on device other than vrf0.
    PING 10.2.2.2 (10.2.2.2) from 9.9.9.9 vrf0: 56(84) bytes of data.

>From tcpdump:
    12:57:29.449128 IP 9.9.9.9 > 10.2.2.2: ICMP echo request, id 2491, seq 1, length 64

Note the source address is from lo and is not a VRF local address. With
this patch:

    $ ping -I vrf0 -c1 10.2.2.2
    PING 10.2.2.2 (10.2.2.2) from 10.0.0.3 vrf0: 56(84) bytes of data.

>From tcpdump:
    12:59:25.096426 IP 10.0.0.3 > 10.2.2.2: ICMP echo request, id 2113, seq 1, length 64

Now the source address comes from vrf0.

The ipv4 function for selecting source address takes a const argument.
Removing the const requires touching a lot of places, so instead
l3mdev_master_ifindex_rcu is changed to take a const argument and then
do the typecast to non-const as required by netdev_master_upper_dev_get_rcu.
This is similar to what l3mdev_fib_table_rcu does.

IPv6 for unnumbered interfaces appears to be selecting the addresses
properly.

Cc: David Lamparter <david@opensourcerouting.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-26 14:22:26 -05:00
Hannes Frederic Sowa
a8c4a2522a ipv4: only create late gso-skb if skb is already set up with CHECKSUM_PARTIAL
Otherwise we break the contract with GSO to only pass CHECKSUM_PARTIAL
skbs down. This can easily happen with UDP+IPv4 sockets with the first
MSG_MORE write smaller than the MTU, second write is a sendfile.

Returning -EOPNOTSUPP lets the callers fall back into normal sendmsg path,
were we calculate the checksum manually during copying.

Commit d749c9cbff ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked
sockets") started to exposes this bug.

Fixes: d749c9cbff ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets")
Reported-by: Jiri Benc <jbenc@redhat.com>
Cc: Jiri Benc <jbenc@redhat.com>
Reported-by: Wakko Warner <wakko@animx.eu.org>
Cc: Wakko Warner <wakko@animx.eu.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-24 14:11:34 -05:00
Craig Gallek
e5fbfc1c2d soreuseport: fix merge conflict in tcp bind
One of the validation checks for the new array-based TCP SO_REUSEPORT
validation was unintentionally dropped in ea8add2b19.  This adds it back.

Lack of this check allows the user to allocate multiple sock_reuseport
structures (leaking all but the first).

Fixes: ea8add2b19 ("tcp/dccp: better use of ephemeral ports in bind()")
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-24 13:38:18 -05:00
Bernie Harris
5146d1f151 tunnel: Clear IPCB(skb)->opt before dst_link_failure called
IPCB may contain data from previous layers (in the observed case the
qdisc layer). In the observed scenario, the data was misinterpreted as
ip header options, which later caused the ihl to be set to an invalid
value (<5). This resulted in an infinite loop in the mips implementation
of ip_fast_csum.

This patch clears IPCB(skb)->opt before dst_link_failure can be called for
various types of tunnels. This change only applies to encapsulated ipv4
packets.

The code introduced in 11c21a30 which clears all of IPCB has been removed
to be consistent with these changes, and instead the opt field is cleared
unconditionally in ip_tunnel_xmit. The change in ip_tunnel_xmit applies to
SIT, GRE, and IPIP tunnels.

The relevant vti, l2tp, and pptp functions already contain similar code for
clearing the IPCB.

Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 19:11:56 -05:00
Konstantin Khlebnikov
9bdfb3b79e tcp: convert cached rtt from usec to jiffies when feeding initial rto
Currently it's converted into msecs, thus HZ=1000 intact.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 740b0f1841 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 18:28:46 -05:00
David S. Miller
b633353115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/bcm7xxx.c
	drivers/net/phy/marvell.c
	drivers/net/vxlan.c

All three conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 00:09:14 -05:00
Anton Protopopov
a97eb33ff2 rtnl: RTM_GETNETCONF: fix wrong return value
An error response from a RTM_GETNETCONF request can return the positive
error value EINVAL in the struct nlmsgerr that can mislead userspace.

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 15:33:46 -05:00
Jiri Benc
d13b161c2c gre: clear IFF_TX_SKB_SHARING
ether_setup sets IFF_TX_SKB_SHARING but this is not supported by gre
as it modifies the skb on xmit.

Also, clean up whitespace in ipgre_tap_setup when we're already touching it.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:43:48 -05:00
Jiri Benc
7f290c9435 iptunnel: scrub packet in iptunnel_pull_header
Part of skb_scrub_packet was open coded in iptunnel_pull_header. Let it call
skb_scrub_packet directly instead.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:34:54 -05:00
Eric Dumazet
7716682cc5 tcp/dccp: fix another race at listener dismantle
Ilya reported following lockdep splat:

kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0:  (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1:  (rcu_read_lock){......}, at: [<ffffffff816e977f>]
ip_local_deliver_finish+0x3f/0x380
kernel: #2:  (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
sk_clone_lock+0x19b/0x440
kernel: #3:  (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0

To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.

We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.

Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: ebb516af60 ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:35:51 -05:00
Xin Long
deed49df73 route: check and remove route cache when we get route
Since the gc of ipv4 route was removed, the route cached would has
no chance to be removed, and even it has been timeout, it still could
be used, cause no code to check it's expires.

Fix this issue by checking  and removing route cache when we get route.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:31:36 -05:00
Insu Yun
1eea84b74c tcp: correctly crypto_alloc_hash return check
crypto_alloc_hash never returns NULL

Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17 22:23:04 -05:00
Ben Hutchings
7bbf3cae65 ipv4: Remove inet_lro library
There are no longer any in-tree drivers that use it.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17 16:15:46 -05:00
Nikolay Borisov
52a773d645 net: Export ip fragment sysctl to unprivileged users
Now that all the ip fragmentation related sysctls are namespaceified
there is no reason to hide them anymore from "root" users inside
containers.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Nikolay Borisov
0fbf4cb27e ipv4: namespacify ip fragment max dist sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Nikolay Borisov
e21145a987 ipv4: namespacify ip_early_demux sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Nikolay Borisov
287b7f38fd ipv4: Namespacify ip_dynaddr sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Nikolay Borisov
dcd87999d4 igmp: net: Move igmp namespace init to correct file
When igmp related sysctl were namespacified their initializatin was
erroneously put into the tcp socket namespace constructor. This
patch moves the relevant code into the igmp namespace constructor to
keep things consistent.

Also sprinkle some #ifdefs to silence warnings

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Nikolay Borisov
fa50d974d1 ipv4: Namespaceify ip_default_ttl sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Eric Dumazet
cd9b266095 tcp: add tcpi_min_rtt and tcpi_notsent_bytes to tcp_info
tcpi_min_rtt reports the minimal rtt observed by TCP stack for the flow,
in usec unit. Might be ~0U if not yet known.

tcpi_notsent_bytes reports the amount of bytes in the write queue that
were not yet sent.

This is done in a single patch to not add a temporary 32bit padding hole
in tcp_info.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:27:35 -05:00
Eric Dumazet
729235554d tcp: md5: release request socket instead of listener
If tcp_v4_inbound_md5_hash() returns an error, we must release
the refcount on the request socket, not on the listener.

The bug was added for IPv4 only.

Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:24:06 -05:00
Paolo Abeni
3c1cb4d260 net/ipv4: add dst cache support for gre lwtunnels
In case of UDP traffic with datagram length below MTU this
gives about 4% performance increase

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:21:49 -05:00
Paolo Abeni
e09acddf87 ip_tunnel: replace dst_cache with generic implementation
The current ip_tunnel cache implementation is prone to a race
that will cause the wrong dst to be cached on cuncurrent dst cache
miss and ip tunnel update via netlink.

Replacing with the generic implementation fix the issue.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:21:48 -05:00
Eric Dumazet
372022830b tcp: do not set rtt_min to 1
There are some cases where rtt_us derives from deltas of jiffies,
instead of using usec timestamps.

Since we want to track minimal rtt, better to assume a delta of 0 jiffie
might be in fact be very close to 1 jiffie.

It is kind of sad jiffies_to_usecs(1) calls a function instead of simply
using a constant.

Fixes: f672258391 ("tcp: track min RTT using windowed min-filter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 16:07:13 -05:00
Eric Dumazet
919483096b ipv4: fix memory leaks in ip_cmsg_send() callers
Dmitry reported memory leaks of IP options allocated in
ip_cmsg_send() when/if this function returns an error.

Callers are responsible for the freeing.

Many thanks to Dmitry for the report and diagnostic.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-13 05:57:39 -05:00
Edward Cree
6fa79666e2 net: ip_tunnel: remove 'csum_help' argument to iptunnel_handle_offloads
All users now pass false, so we can remove it, and remove the code that
 was conditional upon it.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:52:16 -05:00
Edward Cree
53936107ba net: gre: Implement LCO for GRE over IPv4
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:52:16 -05:00
Edward Cree
06f622926d fou: enable LCO in FOU and GUE
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:52:16 -05:00
Edward Cree
d75f1306d9 net: udp: always set up for CHECKSUM_PARTIAL offload
If the dst device doesn't support it, it'll get fixed up later anyway
 by validate_xmit_skb().  Also, this allows us to take advantage of LCO
 to avoid summing the payload multiple times.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:52:15 -05:00
Edward Cree
179bc67f69 net: local checksum offload for encapsulation
The arithmetic properties of the ones-complement checksum mean that a
 correctly checksummed inner packet, including its checksum, has a ones
 complement sum depending only on whatever value was used to initialise
 the checksum field before checksumming (in the case of TCP and UDP,
 this is the ones complement sum of the pseudo header, complemented).
Consequently, if we are going to offload the inner checksum with
 CHECKSUM_PARTIAL, we can compute the outer checksum based only on the
 packed data not covered by the inner checksum, and the initial value of
 the inner checksum field.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:52:15 -05:00
Eric Dumazet
ea8add2b19 tcp/dccp: better use of ephemeral ports in bind()
Implement strategy used in __inet_hash_connect() in opposite way :

Try to find a candidate using odd ports, then fallback to even ports.

We no longer disable BH for whole traversal, but one bucket at a time.
We also use cond_resched() to yield cpu to other tasks if needed.

I removed one indentation level and tried to mirror the loop we have
in __inet_hash_connect() and variable names to ease code maintenance.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:28:32 -05:00
Eric Dumazet
1580ab63fc tcp/dccp: better use of ephemeral ports in connect()
In commit 07f4c90062 ("tcp/dccp: try to not exhaust ip_local_port_range
in connect()"), I added a very simple heuristic, so that we got better
chances to use even ports, and allow bind() users to have more available
slots.

It gave nice results, but with more than 200,000 TCP sessions on a typical
server, the ~30,000 ephemeral ports are still a rare resource.

I chose to go a step further, by looking at all even ports, and if none
was available, fallback to odd ports.

The companion patch does the same in bind(), but in opposite way.

I've seen exec times of up to 30ms on busy servers, so I no longer
disable BH for the whole traversal, but only for each hash bucket.
I also call cond_resched() to be gentle to other tasks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:28:32 -05:00
Nikolay Borisov
165094afce igmp: Namespacify igmp_qrv sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 09:59:22 -05:00
Nikolay Borisov
87a8a2ae65 igmp: Namespaceify igmp_llm_reports sysctl knob
This was initially introduced in df2cf4a78e ("IGMP: Inhibit
reports for local multicast groups") by defining the sysctl in the
ipv4_net_table array, however it was never implemented to be
namespace aware. Fix this by changing the code accordingly.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 09:59:22 -05:00
Nikolay Borisov
166b6b2d6f igmp: Namespaceify igmp_max_msf sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 09:59:22 -05:00
Nikolay Borisov
815c527007 igmp: Namespaceify igmp_max_memberships sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 09:59:22 -05:00
Alexander Duyck
dbef491ebe udp: Use uh->len instead of skb->len to compute checksum in segmentation
The segmentation code was having to do a bunch of work to pull the
skb->len and strip the udp header offset before the value could be used to
adjust the checksum.  Instead of doing all this work we can just use the
value that goes into uh->len since that is the correct value with the
correct byte order that we need anyway.  By using this value we can save
ourselves a bunch of pain as there is no need to do multiple byte swaps.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 08:55:34 -05:00
Alexander Duyck
fdaefd62fd udp: Clean up the use of flags in UDP segmentation offload
This patch goes though and cleans up the logic related to several of the
control flags used in UDP segmentation.  Specifically the use of dont_encap
isn't really needed as we can just check the skb for CHECKSUM_PARTIAL and
if it isn't set then we don't need to update the internal headers.  As such
we can just drop that value.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 08:55:34 -05:00
Alexander Duyck
3872035241 gre: Use inner_proto to obtain inner header protocol
Instead of parsing headers to determine the inner protocol we can just pull
the value from inner_proto.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 08:55:34 -05:00
Alexander Duyck
2e598af713 gre: Use GSO flags to determine csum need instead of GRE flags
This patch updates the gre checksum path to follow something much closer to
the UDP checksum path.  By doing this we can avoid needing to do as much
header inspection and can just make use of the fields we were already
reading in the sk_buff structure.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 08:55:34 -05:00
Alexander Duyck
ddff00d420 net: Move skb_has_shared_frag check out of GRE code and into segmentation
The call skb_has_shared_frag is used in the GRE path and skb_checksum_help
to verify that no frags can be modified by an external entity.  This check
really doesn't belong in the GRE path but in the skb_segment function
itself.  This way any protocol that might be segmented will be performing
this check before attempting to offload a checksum to software.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 08:55:34 -05:00
Alexander Duyck
08b64fcca9 net: Store checksum result for offloaded GSO checksums
This patch makes it so that we can offload the checksums for a packet up
to a certain point and then begin computing the checksums via software.
Setting this up is fairly straight forward as all we need to do is reset
the values stored in csum and csum_start for the GSO context block.

One complication for this is remote checksum offload.  In order to allow
the inner checksums to be offloaded while computing the outer checksum
manually we needed to have some way of indicating that the offload wasn't
real.  In order to do that I replaced CHECKSUM_PARTIAL with
CHECKSUM_UNNECESSARY in the case of us computing checksums for the outer
header while skipping computing checksums for the inner headers.  We clean
up the ip_summed flag and set it to either CHECKSUM_PARTIAL or
CHECKSUM_NONE once we hand the packet off to the next lower level.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 08:55:33 -05:00
Alexander Duyck
7fbeffed77 net: Update remote checksum segmentation to support use of GSO checksum
This patch addresses two main issues.

First in the case of remote checksum offload we were avoiding dealing with
scatter-gather issues.  As a result it would be possible to assemble a
series of frames that used frags instead of being linearized as they should
have if remote checksum offload was enabled.

Second I have updated the code so that we now let GSO take care of doing
the checksum on the data itself and drop the special case that was added
for remote checksum offload.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 08:55:33 -05:00
Alexander Duyck
bef3c6c937 net: Drop unecessary enc_features variable from tunnel segmentation functions
The enc_features variable isn't necessary since features isn't used
anywhere after we create enc_features so instead just use a destructive AND
on features itself and save ourselves the variable declaration.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 08:55:33 -05:00
Johannes Berg
97daf33145 ipv4: add option to drop gratuitous ARP packets
In certain 802.11 wireless deployments, there will be ARP proxies
that use knowledge of the network to correctly answer requests.
To prevent gratuitous ARP frames on the shared medium from being
a problem, on such deployments wireless needs to drop them.

Enable this by providing an option called "drop_gratuitous_arp".

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 04:27:35 -05:00
Johannes Berg
12b74dfadb ipv4: add option to drop unicast encapsulated in L2 multicast
In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv4 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.

Additionally, enabling this option provides compliance with a SHOULD
clause of RFC 1122.

Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 04:27:35 -05:00
Craig Gallek
c125e80b88 soreuseport: fast reuseport TCP socket selection
This change extends the fast SO_REUSEPORT socket lookup implemented
for UDP to TCP.  Listener sockets with SO_REUSEPORT and the same
receive address are additionally added to an array for faster
random access.  This means that only a single socket from the group
must be found in the listener list before any socket in the group can
be used to receive a packet.  Previously, every socket in the group
needed to be considered before handing off the incoming packet.

This feature also exposes the ability to use a BPF program when
selecting a socket from a reuseport group.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 03:54:15 -05:00
Craig Gallek
a583636a83 inet: refactor inet[6]_lookup functions to take skb
This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
groups.  Doing so with a BPF filter will require access to the
skb in question.  This change plumbs the skb (and offset to payload
data) through the call stack to the listening socket lookup
implementations where it will be used in a following patch.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 03:54:14 -05:00
Craig Gallek
086c653f58 sock: struct proto hash function may error
In order to support fast reuseport lookups in TCP, the hash function
defined in struct proto must be capable of returning an error code.
This patch changes the function signature of all related hash functions
to return an integer and handles or propagates this return value at
all call sites.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 03:54:14 -05:00
David Wragg
7e059158d5 vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices
Prior to 4.3, openvswitch tunnel vports (vxlan, gre and geneve) could
transmit vxlan packets of any size, constrained only by the ability to
send out the resulting packets.  4.3 introduced netdevs corresponding
to tunnel vports.  These netdevs have an MTU, which limits the size of
a packet that can be successfully encapsulated.  The default MTU
values are low (1500 or less), which is awkwardly small in the context
of physical networks supporting jumbo frames, and leads to a
conspicuous change in behaviour for userspace.

Instead, set the MTU on openvswitch-created netdevs to be the relevant
maximum (i.e. the maximum IP packet size minus any relevant overhead),
effectively restoring the behaviour prior to 4.3.

Signed-off-by: David Wragg <david@weave.works>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-10 05:50:03 -05:00
Hans Westgaard Ry
5f74f82ea3 net:Add sysctl_max_skb_frags
Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09 04:28:06 -05:00
Eric Dumazet
9cf7490360 tcp: do not drop syn_recv on all icmp reports
Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.

This is of course incorrect.

A specific list of ICMP messages should be able to drop a SYN_RECV.

For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09 04:15:37 -05:00
David S. Miller
0aca737d46 tcp: Fix syncookies sysctl default.
Unintentionally the default was changed to zero, fix
that.

Fixes: 12ed8244ed ("ipv4: Namespaceify tcp syncookies sysctl knob")
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08 04:24:33 -05:00
Nikolay Borisov
4979f2d9f7 ipv4: Namespaceify tcp_notsent_lowat sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:36:11 -05:00
Nikolay Borisov
1e579caa18 ipv4: Namespaceify tcp_fin_timeout sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:36:11 -05:00
Nikolay Borisov
c402d9beff ipv4: Namespaceify tcp_orphan_retries sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:11 -05:00
Nikolay Borisov
c6214a97c8 ipv4: Namespaceify tcp_retries2 sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:11 -05:00
Nikolay Borisov
ae5c3f406c ipv4: Namespaceify tcp_retries1 sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:10 -05:00
Nikolay Borisov
1043e25ff9 ipv4: Namespaceify tcp reordering sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:10 -05:00
Nikolay Borisov
12ed8244ed ipv4: Namespaceify tcp syncookies sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:10 -05:00
Nikolay Borisov
7c083ecb3b ipv4: Namespaceify tcp synack retries sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:10 -05:00
Nikolay Borisov
6fa2516630 ipv4: Namespaceify tcp syn retries sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:10 -05:00
Yuchung Cheng
d452e6caf8 tcp: tcp_cong_control helper
Refactor and consolidate cwnd and rate updates into a new function
tcp_cong_control().

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:09:51 -05:00
Yuchung Cheng
2d14a4def4 tcp: make congestion control more robust against reordering
This change enables congestion control to update cwnd based on
not only packet cumulatively acked but also packets delivered
out-of-order. This makes congestion control robust against packet
reordering because it may raise cwnd as long as packets are being
delivered once reordering has been detected (i.e., it only cares
the amount of packets delivered, not the ordering among them).

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:09:51 -05:00
Yuchung Cheng
3ebd887105 tcp: refactor pkts acked accounting
A small refactoring that gets number of packets cumulatively acked
from tcp_clean_rtx_queue() directly.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:09:51 -05:00
Yuchung Cheng
ddf1af6fa0 tcp: new delivery accounting
This patch changes the accounting of how many packets are
newly acked or sacked when the sender receives an ACK.

The current approach basically computes

   newly_acked_sacked = (prior_packets - prior_sacked) -
                        (tp->packets_out - tp->sacked_out)

   where prior_packets and prior_sacked out are snapshot
   at the beginning of the ACK processing.

The new approach tracks the delivery information via a new
TCP state variable "delivered" which monotically increases
as new packets are delivered in order or out-of-order.

The reason for this change is that the current approach is
brittle that produces negative or inaccurate estimate.

   1) For non-SACK connections, an ACK that advances the SND.UNA
   could reset the DUPACK counters (tp->sacked_out) in
   tcp_process_loss() or tcp_fastretrans_alert(). This inflates
   the inflight suddenly and causes under-estimate or even
   negative estimate. Here is a real example:

                   before   after (processing ACK)
   packets_out     75       73
   sacked_out      23        0
   ca state        Loss     Open

   The old approach computes (75-23) - (73 - 0) = -21 delivered
   while the new approach computes 1 delivered since it
   considers the 2nd-24th packets are delivered OOO.

   2) MSS change would re-count packets_out and sacked_out so
   the estimate is in-accurate and can even become negative.
   E.g., the inflight is doubled when MSS is halved.

   3) Spurious retransmission signaled by DSACK is not accounted

The new approach is simpler and more robust. For SACK connections,
tp->delivered increments as packets are being acked or sacked in
SACK and ACK processing.

For non-sack connections, it's done in tcp_remove_reno_sacks() and
tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered
is incremented by the number of packets ACKed (less the current
number of DUPACKs received plus one packet hole).  Upon receiving
a DUPACK, tp->delivered is incremented assuming one out-of-order
packet is delivered.

Upon receiving a DSACK, tp->delivered is incremtened assuming one
retransmission is delivered in tcp_sacktag_write_queue().

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:09:51 -05:00
Yuchung Cheng
31ba0c1072 tcp: move cwnd reduction after recovery state procesing
Currently the cwnd is reduced and increased in various different
places. The reduction happens in various places in the recovery
state processing (tcp_fastretrans_alert) while the increase
happens afterward.

A better sequence is to identify lost packets and update
the congestion control state (icsk_ca_state) first. Then base
on the new state, up/down the cwnd in one central place. It's
more clear to reason cwnd changes.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:09:50 -05:00
Yuchung Cheng
e662ca40de tcp: retransmit after recovery processing and congestion control
The retransmission and F-RTO transmission currently happen inside
recovery state processing (tcp_fastretrans_alert) but before
congestion control.  This refactoring moves the logic after both
s.t. we can determine how much to send (cwnd) before deciding what to
send.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:09:50 -05:00
Eric Dumazet
e3e17b773b tcp: fastopen: call tcp_fin() if FIN present in SYNACK
When we acknowledge a FIN, it is not enough to ack the sequence number
and queue the skb into receive queue. We also have to call tcp_fin()
to properly update socket state and send proper poll() notifications.

It seems we also had the problem if we received a SYN packet with the
FIN flag set, but it does not seem an urgent issue, as no known
implementation can do that.

Fixes: 61d2bcae99 ("tcp: fastopen: accept data/FIN present in SYNACK message")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 16:49:58 -05:00
Eric Dumazet
9d691539ee tcp: do not enqueue skb with SYN flag
If we remove the SYN flag from the skbs that tcp_fastopen_add_skb()
places in socket receive queue, then we can remove the test that
tcp_recvmsg() has to perform in fast path.

All we have to do is to adjust SEQ in the slow path.

For the moment, we place an unlikely() and output a message
if we find an skb having SYN flag set.
Goal would be to get rid of the test completely.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:11:59 -05:00
Eric Dumazet
61d2bcae99 tcp: fastopen: accept data/FIN present in SYNACK message
RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message
MAY include data and/or FIN

This patch adds support for the client side :

If we receive a SYNACK with payload or FIN, queue the skb instead
of ignoring it.

Since we already support the same for SYN, we refactor the existing
code and reuse it. Note we need to clone the skb, so this operation
might fail under memory pressure.

Sara Dickinson pointed out FreeBSD server Fast Open implementation
was planned to generate such SYNACK in the future.

The server side might be implemented on linux later.

Reported-by: Sara Dickinson <sara@sinodun.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:11:59 -05:00
Linus Torvalds
34229b2774 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "This looks like a lot but it's a mixture of regression fixes as well
  as fixes for longer standing issues.

   1) Fix on-channel cancellation in mac80211, from Johannes Berg.

   2) Handle CHECKSUM_COMPLETE properly in xt_TCPMSS netfilter xtables
      module, from Eric Dumazet.

   3) Avoid infinite loop in UDP SO_REUSEPORT logic, also from Eric
      Dumazet.

   4) Avoid a NULL deref if we try to set SO_REUSEPORT after a socket is
      bound, from Craig Gallek.

   5) GRO key comparisons don't take lightweight tunnels into account,
      from Jesse Gross.

   6) Fix struct pid leak via SCM credentials in AF_UNIX, from Eric
      Dumazet.

   7) We need to set the rtnl_link_ops of ipv6 SIT tunnels before we
      register them, otherwise the NEWLINK netlink message is missing
      the proper attributes.  From Thadeu Lima de Souza Cascardo.

   8) Several Spectrum chip bug fixes for mlxsw switch driver, from Ido
      Schimmel

   9) Handle fragments properly in ipv4 easly socket demux, from Eric
      Dumazet.

  10) Don't ignore the ifindex key specifier on ipv6 output route
      lookups, from Paolo Abeni"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (128 commits)
  tcp: avoid cwnd undo after receiving ECN
  irda: fix a potential use-after-free in ircomm_param_request
  net: tg3: avoid uninitialized variable warning
  net: nb8800: avoid uninitialized variable warning
  net: vxge: avoid unused function warnings
  net: bgmac: clarify CONFIG_BCMA dependency
  net: hp100: remove unnecessary #ifdefs
  net: davinci_cpdma: use dma_addr_t for DMA address
  ipv6/udp: use sticky pktinfo egress ifindex on connect()
  ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()
  netlink: not trim skb for mmaped socket when dump
  vxlan: fix a out of bounds access in __vxlan_find_mac
  net: dsa: mv88e6xxx: fix port VLAN maps
  fib_trie: Fix shift by 32 in fib_table_lookup
  net: moxart: use correct accessors for DMA memory
  ipv4: ipconfig: avoid unused ic_proto_used symbol
  bnxt_en: Fix crash in bnxt_free_tx_skbs() during tx timeout.
  bnxt_en: Exclude rx_drop_pkts hw counter from the stack's rx_dropped counter.
  bnxt_en: Ring free response from close path should use completion ring
  net_sched: drr: check for NULL pointer in drr_dequeue
  ...
2016-02-01 15:56:08 -08:00
Yuchung Cheng
99b4dd9f24 tcp: avoid cwnd undo after receiving ECN
RFC 4015 section 3.4 says the TCP sender MUST refrain from
reversing the congestion control state when the ACK signals
congestion through the ECN-Echo flag. Currently we may not
always do that when prior_ssthresh is reset upon receiving
ACKs with ECE marks. This patch fixes that.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29 23:03:56 -08:00
Alexander Duyck
a5829f536b fib_trie: Fix shift by 32 in fib_table_lookup
The fib_table_lookup function had a shift by 32 that triggered a UBSAN
warning.  This was due to the fact that I had placed the shift first and
then followed it with the check for the suffix length to ignore the
undefined behavior.  If we reorder this so that we verify the suffix is
less than 32 before shifting the value we can avoid the issue.

Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29 19:41:00 -08:00
Arnd Bergmann
52b79e2bdf ipv4: ipconfig: avoid unused ic_proto_used symbol
When CONFIG_PROC_FS, CONFIG_IP_PNP_BOOTP, CONFIG_IP_PNP_DHCP and
CONFIG_IP_PNP_RARP are all disabled, we get a warning about the
ic_proto_used variable being unused:

net/ipv4/ipconfig.c:146:12: error: 'ic_proto_used' defined but not used [-Werror=unused-variable]

This avoids the warning, by making the definition conditional on
whether a dynamic IP configuration protocol is configured. If not,
we know that the value is always zero, so we can optimize away the
variable and all code that depends on it.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29 19:39:09 -08:00
Eric Dumazet
63e51b6a24 ipv4: early demux should be aware of fragments
We should not assume a valid protocol header is present,
as this is not the case for IPv4 fragments.

Lets avoid extra cache line misses and potential bugs
if we actually find a socket and incorrectly uses its dst.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29 15:14:20 -05:00
Eric Dumazet
ff5d749772 tcp: beware of alignments in tcp_get_info()
With some combinations of user provided flags in netlink command,
it is possible to call tcp_get_info() with a buffer that is not 8-bytes
aligned.

It does matter on some arches, so we need to use put_unaligned() to
store the u64 fields.

Current iproute2 package does not trigger this particular issue.

Fixes: 0df48c26d8 ("tcp: add tcpi_bytes_acked to tcp_info")
Fixes: 977cb0ecf8 ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-28 22:49:30 -08:00
Neal Cardwell
d88270eef4 tcp: fix tcp_mark_head_lost to check skb len before fragmenting
This commit fixes a corner case in tcp_mark_head_lost() which was
causing the WARN_ON(len > skb->len) in tcp_fragment() to fire.

tcp_mark_head_lost() was assuming that if a packet has
tcp_skb_pcount(skb) of N, then it's safe to fragment off a prefix of
M*mss bytes, for any M < N. But with the tricky way TCP pcounts are
maintained, this is not always true.

For example, suppose the sender sends 4 1-byte packets and have the
last 3 packet sacked. It will merge the last 3 packets in the write
queue into an skb with pcount = 3 and len = 3 bytes. If another
recovery happens after a sack reneging event, tcp_mark_head_lost()
may attempt to split the skb assuming it has more than 2*MSS bytes.

This sounds very counterintuitive, but as the commit description for
the related commit c0638c247f ("tcp: don't fragment SACKed skbs in
tcp_mark_head_lost()") notes, this is because tcp_shifted_skb()
coalesces adjacent regions of SACKed skbs, and when doing this it
preserves the sum of their packet counts in order to reflect the
real-world dynamics on the wire. The c0638c247f commit tried to
avoid problems by not fragmenting SACKed skbs, since SACKed skbs are
where the non-proportionality between pcount and skb->len/mss is known
to be possible. However, that commit did not handle the case where
during a reneging event one of these weird SACKed skbs becomes an
un-SACKed skb, which tcp_mark_head_lost() can then try to fragment.

The fix is to simply mark the entire skb lost when this happens.
This makes the recovery slightly more aggressive in such corner
cases before we detect reordering. But once we detect reordering
this code path is by-passed because FACK is disabled.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-28 16:02:48 -08:00
Joe Stringer
8282f27449 inet: frag: Always orphan skbs inside ip_defrag()
Later parts of the stack (including fragmentation) expect that there is
never a socket attached to frag in a frag_list, however this invariant
was not enforced on all defrag paths. This could lead to the
BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the
end of this commit message.

While the call could be added to openvswitch to fix this particular
error, the head and tail of the frags list are already orphaned
indirectly inside ip_defrag(), so it seems like the remaining fragments
should all be orphaned in all circumstances.

kernel BUG at net/ipv4/ip_output.c:586!
[...]
Call Trace:
 <IRQ>
 [<ffffffffa0205270>] ? do_output.isra.29+0x1b0/0x1b0 [openvswitch]
 [<ffffffffa02167a7>] ovs_fragment+0xcc/0x214 [openvswitch]
 [<ffffffff81667830>] ? dst_discard_out+0x20/0x20
 [<ffffffff81667810>] ? dst_ifdown+0x80/0x80
 [<ffffffffa0212072>] ? find_bucket.isra.2+0x62/0x70 [openvswitch]
 [<ffffffff810e0ba5>] ? mod_timer_pending+0x65/0x210
 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
 [<ffffffffa03205a2>] ? nf_conntrack_in+0x252/0x500 [nf_conntrack]
 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
 [<ffffffffa02051a3>] do_output.isra.29+0xe3/0x1b0 [openvswitch]
 [<ffffffffa0206411>] do_execute_actions+0xe11/0x11f0 [openvswitch]
 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
 [<ffffffffa0206822>] ovs_execute_actions+0x32/0xd0 [openvswitch]
 [<ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch]
 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
 [<ffffffffa02068a2>] ovs_execute_actions+0xb2/0xd0 [openvswitch]
 [<ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch]
 [<ffffffffa0215019>] ? ovs_ct_get_labels+0x49/0x80 [openvswitch]
 [<ffffffffa0213a1d>] ovs_vport_receive+0x5d/0xa0 [openvswitch]
 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
 [<ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch]
 [<ffffffffa02148fc>] internal_dev_xmit+0x6c/0x140 [openvswitch]
 [<ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch]
 [<ffffffff81660299>] dev_hard_start_xmit+0x2b9/0x5e0
 [<ffffffff8165fc21>] ? netif_skb_features+0xd1/0x1f0
 [<ffffffff81660f20>] __dev_queue_xmit+0x800/0x930
 [<ffffffff81660770>] ? __dev_queue_xmit+0x50/0x930
 [<ffffffff810b53f1>] ? mark_held_locks+0x71/0x90
 [<ffffffff81669876>] ? neigh_resolve_output+0x106/0x220
 [<ffffffff81661060>] dev_queue_xmit+0x10/0x20
 [<ffffffff816698e8>] neigh_resolve_output+0x178/0x220
 [<ffffffff816a8e6f>] ? ip_finish_output2+0x1ff/0x590
 [<ffffffff816a8e6f>] ip_finish_output2+0x1ff/0x590
 [<ffffffff816a8cee>] ? ip_finish_output2+0x7e/0x590
 [<ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0
 [<ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0
 [<ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80
 [<ffffffff816a9c9c>] ip_finish_output+0x17c/0x340
 [<ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190
 [<ffffffff816ab4c0>] ip_output+0x70/0x110
 [<ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80
 [<ffffffff816aa9f9>] ip_local_out+0x39/0x70
 [<ffffffff816abf89>] ip_send_skb+0x19/0x40
 [<ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40
 [<ffffffff816df21a>] icmp_push_reply+0xea/0x120
 [<ffffffff816df93d>] icmp_reply.constprop.23+0x1ed/0x230
 [<ffffffff816df9ce>] icmp_echo.part.21+0x4e/0x50
 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
 [<ffffffff810d5f9e>] ? rcu_read_lock_held+0x5e/0x70
 [<ffffffff816dfa06>] icmp_echo+0x36/0x70
 [<ffffffff816e0d11>] icmp_rcv+0x271/0x450
 [<ffffffff816a4ca7>] ip_local_deliver_finish+0x127/0x3a0
 [<ffffffff816a4bc1>] ? ip_local_deliver_finish+0x41/0x3a0
 [<ffffffff816a5160>] ip_local_deliver+0x60/0xd0
 [<ffffffff816a4b80>] ? ip_rcv_finish+0x560/0x560
 [<ffffffff816a46fd>] ip_rcv_finish+0xdd/0x560
 [<ffffffff816a5453>] ip_rcv+0x283/0x3e0
 [<ffffffff810b6302>] ? match_held_lock+0x192/0x200
 [<ffffffff816a4620>] ? inet_del_offload+0x40/0x40
 [<ffffffff8165d062>] __netif_receive_skb_core+0x392/0xae0
 [<ffffffff8165e68e>] ? process_backlog+0x8e/0x230
 [<ffffffff810b53f1>] ? mark_held_locks+0x71/0x90
 [<ffffffff8165d7c8>] __netif_receive_skb+0x18/0x60
 [<ffffffff8165e678>] process_backlog+0x78/0x230
 [<ffffffff8165e6dd>] ? process_backlog+0xdd/0x230
 [<ffffffff8165e355>] net_rx_action+0x155/0x400
 [<ffffffff8106b48c>] __do_softirq+0xcc/0x420
 [<ffffffff816a8e87>] ? ip_finish_output2+0x217/0x590
 [<ffffffff8178e78c>] do_softirq_own_stack+0x1c/0x30
 <EOI>
 [<ffffffff8106b88e>] do_softirq+0x4e/0x60
 [<ffffffff8106b948>] __local_bh_enable_ip+0xa8/0xb0
 [<ffffffff816a8eb0>] ip_finish_output2+0x240/0x590
 [<ffffffff816a9a31>] ? ip_do_fragment+0x831/0x8a0
 [<ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0
 [<ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0
 [<ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80
 [<ffffffff816a9c9c>] ip_finish_output+0x17c/0x340
 [<ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190
 [<ffffffff816ab4c0>] ip_output+0x70/0x110
 [<ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80
 [<ffffffff816aa9f9>] ip_local_out+0x39/0x70
 [<ffffffff816abf89>] ip_send_skb+0x19/0x40
 [<ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40
 [<ffffffff816d55d3>] raw_sendmsg+0x7d3/0xc30
 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
 [<ffffffff816e7557>] ? inet_sendmsg+0xc7/0x1d0
 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
 [<ffffffff816e759a>] inet_sendmsg+0x10a/0x1d0
 [<ffffffff816e7495>] ? inet_sendmsg+0x5/0x1d0
 [<ffffffff8163e398>] sock_sendmsg+0x38/0x50
 [<ffffffff8163ec5f>] ___sys_sendmsg+0x25f/0x270
 [<ffffffff811aadad>] ? handle_mm_fault+0x8dd/0x1320
 [<ffffffff8178c147>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffff810529b2>] ? __do_page_fault+0x1e2/0x460
 [<ffffffff81204886>] ? __fget_light+0x66/0x90
 [<ffffffff8163f8e2>] __sys_sendmsg+0x42/0x80
 [<ffffffff8163f932>] SyS_sendmsg+0x12/0x20
 [<ffffffff8178cb17>] entry_SYSCALL_64_fastpath+0x12/0x6f
Code: 00 00 44 89 e0 e9 7c fb ff ff 4c 89 ff e8 e7 e7 ff ff 41 8b 9d 80 00 00 00 2b 5d d4 89 d8 c1 f8 03 0f b7 c0 e9 33 ff ff f
 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
RIP  [<ffffffff816a9a92>] ip_do_fragment+0x892/0x8a0
 RSP <ffff88006d603170>

Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-28 16:00:46 -08:00
Herbert Xu
cf80e0e47e tcp: Use ahash
This patch replaces uses of the long obsolete hash interface with
ahash.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>
2016-01-27 20:36:18 +08:00
Thomas Egerer
32b6170ca5 ipv4+ipv6: Make INET*_ESP select CRYPTO_ECHAINIV
The ESP algorithms using CBC mode require echainiv. Hence INET*_ESP have
to select CRYPTO_ECHAINIV in order to work properly. This solves the
issues caused by a misconfiguration as described in [1].
The original approach, patching crypto/Kconfig was turned down by
Herbert Xu [2].

[1] https://lists.strongswan.org/pipermail/users/2015-December/009074.html
[2] http://marc.info/?l=linux-crypto-vger&m=145224655809562&w=2

Signed-off-by: Thomas Egerer <hakke_007@gmx.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-25 10:45:41 -08:00
Tetsuo Handa
1d5cfdb076 tree wide: use kvfree() than conditional kfree()/vfree()
There are many locations that do

  if (memory_was_allocated_by_vmalloc)
    vfree(ptr);
  else
    kfree(ptr);

but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
using is_vmalloc_addr().  Unless callers have special reasons, we can
replace this branch with kvfree().  Please check and reply if you found
problems.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Boris Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22 17:02:18 -08:00
Eric Dumazet
e62a123b8e tcp: fix NULL deref in tcp_v4_send_ack()
Neal reported crashes with this stack trace :

 RIP: 0010:[<ffffffff8c57231b>] tcp_v4_send_ack+0x41/0x20f
...
 CR2: 0000000000000018 CR3: 000000044005c000 CR4: 00000000001427e0
...
  [<ffffffff8c57258e>] tcp_v4_reqsk_send_ack+0xa5/0xb4
  [<ffffffff8c1a7caa>] tcp_check_req+0x2ea/0x3e0
  [<ffffffff8c19e420>] tcp_rcv_state_process+0x850/0x2500
  [<ffffffff8c1a6d21>] tcp_v4_do_rcv+0x141/0x330
  [<ffffffff8c56cdb2>] sk_backlog_rcv+0x21/0x30
  [<ffffffff8c098bbd>] tcp_recvmsg+0x75d/0xf90
  [<ffffffff8c0a8700>] inet_recvmsg+0x80/0xa0
  [<ffffffff8c17623e>] sock_aio_read+0xee/0x110
  [<ffffffff8c066fcf>] do_sync_read+0x6f/0xa0
  [<ffffffff8c0673a1>] SyS_read+0x1e1/0x290
  [<ffffffff8c5ca262>] system_call_fastpath+0x16/0x1b

The problem here is the skb we provide to tcp_v4_send_ack() had to
be parked in the backlog of a new TCP fastopen child because this child
was owned by the user at the time an out of window packet arrived.

Before queuing a packet, TCP has to set skb->dev to NULL as the device
could disappear before packet is removed from the queue.

Fix this issue by using the net pointer provided by the socket (being a
timewait or a request socket).

IPv6 is immune to the bug : tcp_v6_send_response() already gets the net
pointer from the socket if provided.

Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-21 11:20:14 -08:00
Eric Dumazet
7c1306723e net: diag: support v4mapped sockets in inet_diag_find_one_icsk()
Lorenzo reported that we could not properly find v4mapped sockets
in inet_diag_find_one_icsk(). This patch fixes the issue.

Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-20 18:51:31 -08:00
Vladimir Davydov
d55f90bfab net: drop tcp_memcontrol.c
tcp_memcontrol.c only contains legacy memory.tcp.kmem.* file definitions
and mem_cgroup->tcp_mem init/destroy stuff.  This doesn't belong to
network subsys.  Let's move it to memcontrol.c.  This also allows us to
reuse generic code for handling legacy memcg files.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Johannes Weiner
489c2a20a4 mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEM
Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
interface. This also makes legacy-only code sections stand out better.

[arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Johannes Weiner
6d378dac7c mm: memcontrol: drop unused @css argument in memcg_init_kmem
This series adds accounting of the historical "kmem" memory consumers to
the cgroup2 memory controller.

These consumers include the dentry cache, the inode cache, kernel stack
pages, and a few others that are pointed out in patch 7/8.  The
footprint of these consumers is directly tied to userspace activity in
common workloads, and so they have to be part of the minimally viable
configuration in order to present a complete feature to our users.

The cgroup2 interface of the memory controller is far from complete, but
this series, along with the socket memory accounting series, provides
the final semantic changes for the existing memory knobs in the cgroup2
interface, which is scheduled for initial release in the next merge
window.

This patch (of 8):

Remove unused css argument frmo memcg_init_kmem()

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Eric Dumazet
ed0dfffd7d udp: fix potential infinite loop in SO_REUSEPORT logic
Using a combination of connected and un-connected sockets, Dmitry
was able to trigger soft lockups with his fuzzer.

The problem is that sockets in the SO_REUSEPORT array might have
different scores.

Right after sk2=socket(), setsockopt(sk2,...,SO_REUSEPORT, on) and
bind(sk2, ...), but _before_ the connect(sk2) is done, sk2 is added into
the soreuseport array, with a score which is smaller than the score of
first socket sk1 found in hash table (I am speaking of the regular UDP
hash table), if sk1 had the connect() done, giving a +8 to its score.

hash bucket [X] -> sk1 -> sk2 -> NULL

sk1 score = 14  (because it did a connect())
sk2 score = 6

SO_REUSEPORT fast selection is an optimization. If it turns out the
score of the selected socket does not match score of first socket, just
fallback to old SO_REUSEPORT logic instead of trying to be too smart.

Normal SO_REUSEPORT users do not mix different kind of sockets, as this
mechanism is used for load balance traffic.

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Craig Gallek <kraigatgoog@gmail.com>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-19 13:52:25 -05:00
Linus Torvalds
4e5448a31d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "A quick set of bug fixes after there initial networking merge:

  1) Netlink multicast group storage allocator only was tested with
     nr_groups equal to 1, make it work for other values too.  From
     Matti Vaittinen.

  2) Check build_skb() return value in macb and hip04_eth drivers, from
     Weidong Wang.

  3) Don't leak x25_asy on x25_asy_open() failure.

  4) More DMA map/unmap fixes in 3c59x from Neil Horman.

  5) Don't clobber IP skb control block during GSO segmentation, from
     Konstantin Khlebnikov.

  6) ECN helpers for ipv6 don't fixup the checksum, from Eric Dumazet.

  7) Fix SKB segment utilization estimation in xen-netback, from David
     Vrabel.

  8) Fix lockdep splat in bridge addrlist handling, from Nikolay
     Aleksandrov"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  bgmac: Fix reversed test of build_skb() return value.
  bridge: fix lockdep addr_list_lock false positive splat
  net: smsc: Add support h8300
  xen-netback: free queues after freeing the net device
  xen-netback: delete NAPI instance when queue fails to initialize
  xen-netback: use skb to determine number of required guest Rx requests
  net: sctp: Move sequence start handling into sctp_transport_get_idx()
  ipv6: update skb->csum when CE mark is propagated
  net: phy: turn carrier off on phy attach
  net: macb: clear interrupts when disabling them
  sctp: support to lookup with ep+paddr in transport rhashtable
  net: hns: fixes no syscon error when init mdio
  dts: hisi: fixes no syscon fault when init mdio
  net: preserve IP control block during GSO segmentation
  fsl/fman: Delete one function call "put_device" in dtsec_config()
  hip04_eth: fix missing error handle for build_skb failed
  3c59x: fix another page map/single unmap imbalance
  3c59x: balance page maps and unmaps
  x25_asy: Free x25_asy on x25_asy_open() failure.
  mlxsw: fix SWITCHDEV_OBJ_ID_PORT_MDB
  ...
2016-01-15 13:33:12 -08:00
Konstantin Khlebnikov
9207f9d45b net: preserve IP control block during GSO segmentation
Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.

This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-15 14:35:24 -05:00
Johannes Weiner
ef12947c9c mm: memcontrol: switch to the updated jump-label API
According to <linux/jump_label.h> the direct use of struct static_key is
deprecated.  Update the socket and slab accounting code accordingly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reported-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
80e95fe0fd mm: memcontrol: generalize the socket accounting jump label
The unified hierarchy memory controller is going to use this jump label
as well to control the networking callbacks.  Move it to the memory
controller code and give it a more generic name.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
baac50bbc3 net: tcp_memcontrol: simplify linkage between socket and page counter
There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future.  Remove the indirection and link
sockets directly to their owning memory cgroup.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
e805605c72 net: tcp_memcontrol: sanitize tcp memory accounting callbacks
There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things unnecessarily.
Replace this with simple and clear charge and uncharge calls--hidden
behind a jump label--to account skb memory.

Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle the
global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit.  This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds.  After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit, and
thus close this loophole.

Without a soft limit, the per-memcg memory pressure state in sockets is
generally questionable.  However, we did it until now, so we continue to
enter it when the hard limit is hit, and packets are dropped, to let
other sockets in the cgroup know that they shouldn't grow their transmit
windows, either.  However, keep it simple in the new callback model and
leave memory pressure lazily when the next packet is accepted (as
opposed to doing it synchroneously when packets are processed).  When
packets are dropped, network performance will already be in the toilet,
so that should be a reasonable trade-off.

As described above, consumption is now checked on the per-memcg level
and the global level separately.  Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
80f23124f5 net: tcp_memcontrol: simplify the per-memcg limit access
tcp_memcontrol replicates the global sysctl_mem limit array per cgroup,
but it only ever sets these entries to the value of the memory_allocated
page_counter limit.  Use the latter directly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
af95d7df40 net: tcp_memcontrol: remove dead per-memcg count of allocated sockets
The number of allocated sockets is used for calculations in the soft
limit phase, where packets are accepted but the socket is under memory
pressure.
 Since there is no soft limit phase in tcp_memcontrol, and memory
pressure is only entered when packets are already dropped, this is
actually dead code.  Remove it.

As this is the last user of parent_cg_proto(), remove that too.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
3d596f7b90 net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label
Move the jump-label from sock_update_memcg() and sock_release_memcg() to
the callsite, and so eliminate those function calls when socket
accounting is not enabled.

This also eliminates the need for dummy functions because the calls will
be optimized away if the Kconfig options are not enabled.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vladimir Davydov
9ee11ba425 memcg: do not allow to disable tcp accounting after limit is set
There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED
and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former
is never cleared while the latter can be cleared by unsetting the limit.
This allows to disable tcp socket accounting for new sockets after it
was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still
guaranteeing that memcg_socket_limit_enabled static key will be
decremented on memcg destruction.

This functionality looks dubious, because it is not clear what a use
case would be.  By enabling tcp accounting a user accepts the price.  If
they then find the performance degradation unacceptable, they can always
restart their workload with tcp accounting disabled.  It does not seem
there is any need to flip it while the workload is running.

Besides, it contradicts to how kmem accounting API works: writing
whatever to memory.kmem.limit_in_bytes enables kmem accounting for the
cgroup in question, after which it cannot be disabled.  Therefore one
might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just
enables socket accounting w/o limiting it, which might be useful by
itself, but it isn't true.

Since this API peculiarity is not documented anywhere, I propose to drop
it.  This will allow to simplify the code by dropping cg_proto->flags.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00