Commit Graph

727 Commits

Author SHA1 Message Date
Eric Dumazet
03f45c883c tcp: avoid extra wakeups for SO_RCVLOWAT users
SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only
generated when enough bytes are available in receive queue, after
David change (commit c7004482e8 "tcp: Respect SO_RCVLOWAT in tcp_poll().")

But TCP still calls sk->sk_data_ready() for each chunk added in receive
queue, meaning thread is awaken, and goes back to sleep shortly after.

Tested:

tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB

-> Should get ~2 wakeups (c-switches) per MB, regardless of how many
(tiny or big) packets were received.

High speed (mostly full size GRO packets)

received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit,
  cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches

received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit,
  cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches

Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping)

received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit,
  cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:26:37 -04:00
Eric Dumazet
d1361840f8 tcp: fix SO_RCVLOWAT and RCVBUF autotuning
Applications might use SO_RCVLOWAT on TCP socket hoping to receive
one [E]POLLIN event only when a given amount of bytes are ready in socket
receive queue.

Problem is that receive autotuning is not aware of this constraint,
meaning sk_rcvbuf might be too small to allow all bytes to be stored.

Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
can override the default setsockopt(SO_RCVLOWAT) behavior.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:26:37 -04:00
Eric Dumazet
dcb8c9b437 tcp_bbr: better deal with suboptimal GSO (II)
This is second part of dealing with suboptimal device gso parameters.
In first patch (350c9f484b "tcp_bbr: better deal with suboptimal GSO")
we dealt with devices having low gso_max_segs

Some devices lower gso_max_size from 64KB to 16 KB (r8152 is an example)

In order to probe an optimal cwnd, we want BBR being not sensitive
to whatever GSO constraint a device can have.

This patch removes tso_segs_goal() CC callback in favor of
min_tso_segs() for CC wanting to override sysctl_tcp_min_tso_segs

Next patch will remove bbr->tso_segs_goal since it does not have
to be persistent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 21:44:28 -05:00
Eric Dumazet
e0f9759f53 tcp: try to keep packet if SYN_RCV race is lost
배석진 reported that in some situations, packets for a given 5-tuple
end up being processed by different CPUS.

This involves RPS, and fragmentation.

배석진 is seeing packet drops when a SYN_RECV request socket is
moved into ESTABLISH state. Other states are protected by socket lock.

This is caused by a CPU losing the race, and simply not caring enough.

Since this seems to occur frequently, we can do better and perform
a second lookup.

Note that all needed memory barriers are already in the existing code,
thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
and reqsk_put(). The second lookup must find the new socket,
unless it has already been accepted and closed by another cpu.

Note that the fragmentation could be avoided in the first place by
use of a correct TCP MSS option in the SYN{ACK} packet, but this
does not mean we can not be more robust.

Many thanks to 배석진 for a very detailed analysis.

Reported-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:21:45 -05:00
John Fastabend
1aa12bdf1b bpf: sockmap, add sock close() hook to remove socks
The selftests test_maps program was leaving dangling BPF sockmap
programs around because not all psock elements were removed from
the map. The elements in turn hold a reference on the BPF program
they are attached to causing BPF programs to stay open even after
test_maps has completed.

The original intent was that sk_state_change() would be called
when TCP socks went through TCP_CLOSE state. However, because
socks may be in SOCK_DEAD state or the sock may be a listening
socket the event is not always triggered.

To resolve this use the ULP infrastructure and register our own
proto close() handler. This fixes the above case.

Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Reported-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-02-06 11:39:32 +01:00
John Fastabend
b11a632c44 net: add a UID to use for ULP socket assignment
Create a UID field and enum that can be used to assign ULPs to
sockets. This saves a set of string comparisons if the ULP id
is known.

For sockmap, which is added in the next patches, a ULP is used to
hook into TCP sockets close state. In this case the ULP being added
is done at map insert time and the ULP is known and done on the kernel
side. In this case the named lookup is not needed. Because we don't
want to expose psock internals to user space socket options a user
visible flag is also added. For TLS this is set for BPF it will be
cleared.

Alos remove pr_notice, user gets an error code back and should check
that rather than rely on logs.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-02-06 11:39:31 +01:00
Linus Torvalds
b2fe5fa686 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

 2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

 3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

 4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

 5) Add eBPF based queue selection to tun, from Jason Wang.

 6) Lockless qdisc support, from John Fastabend.

 7) SCTP stream interleave support, from Xin Long.

 8) Smoother TCP receive autotuning, from Eric Dumazet.

 9) Lots of erspan tunneling enhancements, from William Tu.

10) Add true function call support to BPF, from Alexei Starovoitov.

11) Add explicit support for GRO HW offloading, from Michael Chan.

12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

17) Add resource abstration to devlink, from Arkadi Sharshevsky.

18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

19) Avoid locking in act_csum, from Davide Caratti.

20) devinet_ioctl() simplifications from Al viro.

21) More TCP bpf improvements from Lawrence Brakmo.

22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
  tls: Add support for encryption using async offload accelerator
  ip6mr: fix stale iterator
  net/sched: kconfig: Remove blank help texts
  openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
  tcp_nv: fix potential integer overflow in tcpnv_acked
  r8169: fix RTL8168EP take too long to complete driver initialization.
  qmi_wwan: Add support for Quectel EP06
  rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
  ipmr: Fix ptrdiff_t print formatting
  ibmvnic: Wait for device response when changing MAC
  qlcnic: fix deadlock bug
  tcp: release sk_frag.page in tcp_disconnect
  ipv4: Get the address of interface correctly.
  net_sched: gen_estimator: fix lockdep splat
  net: macb: Handle HRESP error
  net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
  ipv6: addrconf: break critical section in addrconf_verify_rtnl()
  ipv6: change route cache aging logic
  i40e/i40evf: Update DESC_NEEDED value to reflect larger value
  bnxt_en: cleanup DIM work on device shutdown
  ...
2018-01-31 14:31:10 -08:00
Linus Torvalds
168fe32a07 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro:
 "This introduces a __bitwise type for POLL### bitmap, and propagates
  the annotations through the tree. Most of that stuff is as simple as
  'make ->poll() instances return __poll_t and do the same to local
  variables used to hold the future return value'.

  Some of the obvious brainos found in process are fixed (e.g. POLLIN
  misspelled as POLL_IN). At that point the amount of sparse warnings is
  low and most of them are for genuine bugs - e.g. ->poll() instance
  deciding to return -EINVAL instead of a bitmap. I hadn't touched those
  in this series - it's large enough as it is.

  Another problem it has caught was eventpoll() ABI mess; select.c and
  eventpoll.c assumed that corresponding POLL### and EPOLL### were
  equal. That's true for some, but not all of them - EPOLL### are
  arch-independent, but POLL### are not.

  The last commit in this series separates userland POLL### values from
  the (now arch-independent) kernel-side ones, converting between them
  in the few places where they are copied to/from userland. AFAICS, this
  is the least disruptive fix preserving poll(2) ABI and making epoll()
  work on all architectures.

  As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
  it will trigger only on what would've triggered EPOLLWRBAND on other
  architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
  at all on sparc. With this patch they should work consistently on all
  architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  make kernel-side POLL... arch-independent
  eventpoll: no need to mask the result of epi_item_poll() again
  eventpoll: constify struct epoll_event pointers
  debugging printk in sg_poll() uses %x to print POLL... bitmap
  annotate poll(2) guts
  9p: untangle ->poll() mess
  ->si_band gets POLL... bitmap stored into a user-visible long field
  ring_buffer_poll_wait() return value used as return value of ->poll()
  the rest of drivers/*: annotate ->poll() instances
  media: annotate ->poll() instances
  fs: annotate ->poll() instances
  ipc, kernel, mm: annotate ->poll() instances
  net: annotate ->poll() instances
  apparmor: annotate ->poll() instances
  tomoyo: annotate ->poll() instances
  sound: annotate ->poll() instances
  acpi: annotate ->poll() instances
  crypto: annotate ->poll() instances
  block: annotate ->poll() instances
  x86: annotate ->poll() instances
  ...
2018-01-30 17:58:07 -08:00
Lawrence Brakmo
de525be2ca bpf: Support passing args to sock_ops bpf function
Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25 16:41:14 -08:00
Lawrence Brakmo
b73042b8a2 bpf: Add write access to tcp_sock and sock fields
This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25 16:41:14 -08:00
Yuchung Cheng
e42866031f tcp: avoid min RTT bloat by skipping RTT from delayed-ACK in BBR
A persistent connection may send tiny amount of data (e.g. health-check)
for a long period of time. BBR's windowed min RTT filter may only see
RTT samples from delayed ACKs causing BBR to grossly over-estimate
the path delay depending how much the ACK was delayed at the receiver.

This patch skips RTT samples that are likely coming from delayed ACKs. Note
that it is possible the sender never obtains a valid measure to set the
min RTT. In this case BBR will continue to set cwnd to initial window
which seems fine because the connection is thin stream.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 15:39:30 -05:00
Yuchung Cheng
7268586baa tcp: pause Fast Open globally after third consecutive timeout
Prior to this patch, active Fast Open is paused on a specific
destination IP address if the previous connections to the
IP address have experienced recurring timeouts . But recent
experiments by Microsoft (https://goo.gl/cykmn7) and Mozilla
browsers indicate the isssue is often caused by broken middle-boxes
sitting close to the client. Therefore it is much better user
experience if Fast Open is disabled out-right globally to avoid
experiencing further timeouts on connections toward other
destinations.

This patch changes the destination-IP disablement to global
disablement if a connection experiencing recurring timeouts
or aborts due to timeout.  Repeated incidents would still
exponentially increase the pause time, starting from an hour.
This is extremely conservative but an unfortunate compromise to
minimize bad experience due to broken middle-boxes.

Reported-by: Dragana Damjanovic <ddamjanovic@mozilla.com>
Reported-by: Patrick McManus <mcmanus@ducksong.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 15:51:12 -05:00
David S. Miller
51e18a453f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflict was two parallel additions of include files to sch_generic.c,
no biggie.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-09 22:09:55 -05:00
David S. Miller
62cd277039 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2017-12-07

The following pull-request contains BPF updates for your net-next tree.

The main changes are:

1) Detailed documentation of BPF development process from Daniel.

2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops
   from Lawrence.

3) Minor follow up for bpf_skb_set_tunnel_key() from William.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 10:48:25 -05:00
Yousuk Seung
d4761754b4 tcp: invalidate rate samples during SACK reneging
Mark tcp_sock during a SACK reneging event and invalidate rate samples
while marked. Such rate samples may overestimate bw by including packets
that were SACKed before reneging.

< ack 6001 win 10000 sack 7001:38001
< ack 7001 win 0 sack 8001:38001 // Reneg detected
> seq 7001:8001 // RTO, SACK cleared.
< ack 38001 win 10000

In above example the rate sample taken after the last ack will count
7001-38001 as delivered while the actual delivery rate likely could
be much lower i.e. 7001-8001.

This patch adds a new field tcp_sock.sack_reneg and marks it when we
declare SACK reneging and entering TCP_CA_Loss, and unmarks it after
the last rate sample was taken before moving back to TCP_CA_Open. This
patch also invalidates rate samples taken while tcp_sock.is_sack_reneg
is set.

Fixes: b9f64820fb ("tcp: track data delivery rate for a TCP connection")
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 10:07:02 -05:00
Lawrence Brakmo
f19397a5c6 bpf: Add access to snd_cwnd and others in sock_ops
Adds read access to snd_cwnd and srtt_us fields of tcp_sock. Since these
fields are only valid if the socket associated with the sock_ops program
call is a full socket, the field is_fullsock is also added to the
bpf_sock_ops struct. If the socket is not a full socket, reading these
fields returns 0.

Note that in most cases it will not be necessary to check is_fullsock to
know if there is a full socket. The context of the call, as specified by
the 'op' field, can sometimes determine whether there is a full socket.

The struct bpf_sock_ops has the following fields added:

  __u32 is_fullsock;      /* Some TCP fields are only valid if
                           * there is a full socket. If not, the
                           * fields read as zero.
			   */
  __u32 snd_cwnd;
  __u32 srtt_us;          /* Averaged RTT << 3 in usecs */

There is a new macro, SOCK_OPS_GET_TCP32(NAME), to make it easier to add
read access to more 32 bit tcp_sock fields.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-05 14:55:32 +01:00
David Ahern
b4d1605a8e tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match()
After this fix : ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()"),
socket lookups happen while skb->cb[] has not been mangled yet by TCP.

Fixes: a04a480d43 ("net: Require exact match for TCP socket lookups if dif is l3mdev")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-03 12:39:15 -05:00
Al Viro
ade994f4f6 net: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:20:04 -05:00
Neal Cardwell
ed66dfaf23 tcp: when scheduling TLP, time of RTO should account for current ACK
Fix the TLP scheduling logic so that when scheduling a TLP probe, we
ensure that the estimated time at which an RTO would fire accounts for
the fact that ACKs indicating forward progress should push back RTO
times.

After the following fix:

df92c8394e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")

we had an unintentional behavior change in the following kind of
scenario: suppose the RTT variance has been very low recently. Then
suppose we send out a flight of N packets and our RTT is 100ms:

t=0: send a flight of N packets
t=100ms: receive an ACK for N-1 packets

The response before df92c8394e that was:
  -> schedule a TLP for now + RTO_interval

The response after df92c8394e is:
  -> schedule a TLP for t=0 + RTO_interval

Since RTO_interval = srtt + RTT_variance, this means that we have
scheduled a TLP timer at a point in the future that only accounts for
RTT_variance. If the RTT_variance term is small, this means that the
timer fires soon.

Before df92c8394e this would not happen, because in that code, when
we receive an ACK for a prefix of flight, we did:

    1) Near the top of tcp_ack(), switch from TLP timer to RTO
       at write_queue_head->paket_tx_time + RTO_interval:
            if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                   tcp_rearm_rto(sk);

    2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
            if (flag & FLAG_ACKED) {
                   tcp_rearm_rto(sk);

    3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
       to TLP at now + RTO_interval:
            if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                   tcp_schedule_loss_probe(sk);

In df92c8394e we removed that 3-phase dance, and instead directly
set the TLP timer once: we set the TLP timer in cases like this to
write_queue_head->packet_tx_time + RTO_interval. So if the RTT
variance is small, then this means that this is setting the TLP timer
to fire quite soon. This means if the ACK for the tail of the flight
takes longer than an RTT to arrive (often due to delayed ACKs), then
the TLP timer fires too quickly.

Fixes: df92c8394e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-19 12:25:26 +09:00
Eric Dumazet
50895b9de1 tcp: highest_sack fix
syzbot easily found a regression added in our latest patches [1]

No longer set tp->highest_sack to the head of the send queue since
this is not logical and error prone.

Only sack processing should maintain the pointer to an skb from rtx queue.

We might in the future only remember the sequence instead of a pointer to skb,
since rb-tree should allow a fast lookup.

[1]
BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1706 [inline]
BUG: KASAN: use-after-free in tcp_ack+0x42bb/0x4fd0 net/ipv4/tcp_input.c:3537
Read of size 4 at addr ffff8801c154faa8 by task syz-executor4/12860

CPU: 0 PID: 12860 Comm: syz-executor4 Not tainted 4.14.0-next-20171113+ #41
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x25b/0x340 mm/kasan/report.c:409
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
 tcp_highest_sack_seq include/net/tcp.h:1706 [inline]
 tcp_ack+0x42bb/0x4fd0 net/ipv4/tcp_input.c:3537
 tcp_rcv_established+0x672/0x18a0 net/ipv4/tcp_input.c:5439
 tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1468
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __release_sock+0x124/0x360 net/core/sock.c:2264
 release_sock+0xa4/0x2a0 net/core/sock.c:2778
 tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
 sock_sendmsg_nosec net/socket.c:632 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:642
 ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2048
 __sys_sendmsg+0xe5/0x210 net/socket.c:2082
 SYSC_sendmsg net/socket.c:2093 [inline]
 SyS_sendmsg+0x2d/0x50 net/socket.c:2089
 entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x452879
RSP: 002b:00007fc9761bfbe8 EFLAGS: 00000212 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000758020 RCX: 0000000000452879
RDX: 0000000000000000 RSI: 0000000020917fc8 RDI: 0000000000000015
RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000212 R12: 00000000006ee3a0
R13: 00000000ffffffff R14: 00007fc9761c06d4 R15: 0000000000000000

Allocated by task 12860:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
 kmem_cache_alloc_node+0x144/0x760 mm/slab.c:3638
 __alloc_skb+0xf1/0x780 net/core/skbuff.c:193
 alloc_skb_fclone include/linux/skbuff.h:1023 [inline]
 sk_stream_alloc_skb+0x11d/0x900 net/ipv4/tcp.c:870
 tcp_sendmsg_locked+0x1341/0x3b80 net/ipv4/tcp.c:1299
 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1461
 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
 sock_sendmsg_nosec net/socket.c:632 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:642
 SYSC_sendto+0x358/0x5a0 net/socket.c:1749
 SyS_sendto+0x40/0x50 net/socket.c:1717
 entry_SYSCALL_64_fastpath+0x1f/0x96

Freed by task 12860:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
 __cache_free mm/slab.c:3492 [inline]
 kmem_cache_free+0x77/0x280 mm/slab.c:3750
 kfree_skbmem+0xdd/0x1d0 net/core/skbuff.c:603
 __kfree_skb+0x1d/0x20 net/core/skbuff.c:642
 sk_wmem_free_skb include/net/sock.h:1419 [inline]
 tcp_rtx_queue_unlink_and_free include/net/tcp.h:1682 [inline]
 tcp_clean_rtx_queue net/ipv4/tcp_input.c:3111 [inline]
 tcp_ack+0x1b17/0x4fd0 net/ipv4/tcp_input.c:3593
 tcp_rcv_established+0x672/0x18a0 net/ipv4/tcp_input.c:5439
 tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1468
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __release_sock+0x124/0x360 net/core/sock.c:2264
 release_sock+0xa4/0x2a0 net/core/sock.c:2778
 tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
 sock_sendmsg_nosec net/socket.c:632 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:642
 ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2048
 __sys_sendmsg+0xe5/0x210 net/socket.c:2082
 SYSC_sendmsg net/socket.c:2093 [inline]
 SyS_sendmsg+0x2d/0x50 net/socket.c:2089
 entry_SYSCALL_64_fastpath+0x1f/0x96

The buggy address belongs to the object at ffff8801c154fa80
 which belongs to the cache skbuff_fclone_cache of size 456
The buggy address is located 40 bytes inside of
 456-byte region [ffff8801c154fa80, ffff8801c154fc48)
The buggy address belongs to the page:
page:ffffea00070553c0 count:1 mapcount:0 mapping:ffff8801c154f080 index:0x0
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffff8801c154f080 0000000000000000 0000000100000006
raw: ffffea00070a5a20 ffffea0006a18360 ffff8801d9ca0500 0000000000000000
page dumped because: kasan: bad access detected

Fixes: 737ff31456 ("tcp: use sequence distance to detect reordering")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-15 19:48:42 +09:00
Stephen Hemminger
6670e15244 tcp: Namespace-ify sysctl_tcp_default_congestion_control
Make default TCP default congestion control to a per namespace
value. This changes default congestion control to a pointer to congestion ops
(rather than implicit as first element of available lsit).

The congestion control setting of new namespaces is inherited
from the current setting of the root namespace.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-15 14:09:52 +09:00
Yuchung Cheng
713bafea92 tcp: retire FACK loss detection
FACK loss detection has been disabled by default and the
successor RACK subsumed FACK and can handle reordering better.
This patch removes FACK to simplify TCP loss recovery.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 18:53:16 +09:00
Eric Dumazet
356d1833b6 tcp: Namespace-ify sysctl_tcp_rmem and sysctl_tcp_wmem
Note that when a new netns is created, it inherits its
sysctl_tcp_rmem and sysctl_tcp_wmem from initial netns.

This change is needed so that we can refine TCP rcvbuf autotuning,
to take RTT into consideration.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 14:34:58 +09:00
Priyaranjan Jha
1f2556916d tcp: higher throughput under reordering with adaptive RACK reordering wnd
Currently TCP RACK loss detection does not work well if packets are
being reordered beyond its static reordering window (min_rtt/4).Under
such reordering it may falsely trigger loss recoveries and reduce TCP
throughput significantly.

This patch improves that by increasing and reducing the reordering
window based on DSACK, which is now supported in major TCP implementations.
It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.

- If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
  by srtt), since there is possibility that spurious retransmission was
  due to reordering delay longer than reo_wnd.

- Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
  no. of successful recoveries (accounts for full DSACK-based loss
  recovery undo). After that, reset it to default (min_rtt/4).

- At max, reo_wnd is incremented only once per rtt. So that the new
  DSACK on which we are reacting, is due to the spurious retx (approx)
  after the reo_wnd has been updated last time.

- reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
  absolute value to account for change in rtt.

In our internal testing, we observed significant increase in throughput,
in scenarios where reordering exceeds min_rtt/4 (previous static value).

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 23:15:42 +09:00
David S. Miller
ed29668d1a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Smooth Cong Wang's bug fix into 'net-next'.  Basically put
the bulk of the tcf_block_put() logic from 'net' into
tcf_block_put_ext(), but after the offload unbind.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-02 15:23:39 +09:00
Eric Dumazet
2b7cda9c35 tcp: fix tcp_mtu_probe() vs highest_sack
Based on SNMP values provided by Roman, Yuchung made the observation
that some crashes in tcp_sacktag_walk() might be caused by MTU probing.

Looking at tcp_mtu_probe(), I found that when a new skb was placed
in front of the write queue, we were not updating tcp highest sack.

If one skb is freed because all its content was copied to the new skb
(for MTU probing), then tp->highest_sack could point to a now freed skb.

Bad things would then happen, including infinite loops.

This patch renames tcp_highest_sack_combine() and uses it
from tcp_mtu_probe() to fix the bug.

Note that I also removed one test against tp->sacked_out,
since we want to replace tp->highest_sack regardless of whatever
condition, since keeping a stale pointer to freed skb is a recipe
for disaster.

Fixes: a47e5a988a ("[TCP]: Convert highest_sack to sk_buff to allow direct access")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Roman Gushchin <guro@fb.com>
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-01 21:18:34 +09:00
David S. Miller
e1ea2f9856 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several conflicts here.

NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to
nfp_fl_output() needed some adjustments because the code block is in
an else block now.

Parallel additions to net/pkt_cls.h and net/sch_generic.h

A bug fix in __tcp_retransmit_skb() conflicted with some of
the rbtree changes in net-next.

The tc action RCU callback fixes in 'net' had some overlap with some
of the recent tcf_block reworking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-30 21:09:24 +09:00
John Fastabend
8108a77515 bpf: bpf_compute_data uses incorrect cb structure
SK_SKB program types use bpf_compute_data to store the end of the
packet data. However, bpf_compute_data assumes the cb is stored in the
qdisc layer format. But, for SK_SKB this is the wrong layer of the
stack for this type.

It happens to work (sort of!) because in most cases nothing happens
to be overwritten today. This is very fragile and error prone.
Fortunately, we have another hole in tcp_skb_cb we can use so lets
put the data_end value there.

Note, SK_SKB program types do not use data_meta, they are failed by
sk_skb_is_valid_access().

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 11:18:48 +09:00
Alexei Starovoitov
5b52a4c3ac tcp: remove unnecessary include
two extra #include are not necessary in tcp.h
Remove them.

Fixes: 40304b2a15 ("bpf: BPF support for sock_ops")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:27:33 +09:00
Eric Dumazet
c26e91f8b9 tcp: Namespace-ify sysctl_tcp_pacing_ca_ratio
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:39 +09:00
Eric Dumazet
23a7102a2d tcp: Namespace-ify sysctl_tcp_pacing_ss_ratio
Also remove an obsolete comment about TCP pacing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:39 +09:00
Eric Dumazet
4170ba6b58 tcp: Namespace-ify sysctl_tcp_invalid_ratelimit
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:39 +09:00
Eric Dumazet
790f00e19f tcp: Namespace-ify sysctl_tcp_autocorking
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:39 +09:00
Eric Dumazet
bd23970429 tcp: Namespace-ify sysctl_tcp_min_rtt_wlen
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:39 +09:00
Eric Dumazet
26e9596e5b tcp: Namespace-ify sysctl_tcp_min_tso_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:38 +09:00
Eric Dumazet
b530b68148 tcp: Namespace-ify sysctl_tcp_challenge_ack_limit
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:38 +09:00
Eric Dumazet
9184d8bb44 tcp: Namespace-ify sysctl_tcp_limit_output_bytes
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:38 +09:00
Eric Dumazet
ceef9ab6be tcp: Namespace-ify sysctl_tcp_workaround_signed_windows
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:38 +09:00
Eric Dumazet
d06a990458 tcp: Namespace-ify sysctl_tcp_tso_win_divisor
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:38 +09:00
Eric Dumazet
4540c0cf98 tcp: Namespace-ify sysctl_tcp_moderate_rcvbuf
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:38 +09:00
Eric Dumazet
ec36e416f0 tcp: Namespace-ify sysctl_tcp_nometrics_save
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:24:38 +09:00
Eric Dumazet
af9b69a7a6 tcp: Namespace-ify sysctl_tcp_frto
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:43 +09:00
Eric Dumazet
94f0893e0c tcp: Namespace-ify sysctl_tcp_adv_win_scale
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:43 +09:00
Eric Dumazet
0c12654ac6 tcp: Namespace-ify sysctl_tcp_app_win
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:43 +09:00
Eric Dumazet
6496f6bde0 tcp: Namespace-ify sysctl_tcp_dsack
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:43 +09:00
Eric Dumazet
c6e2180359 tcp: Namespace-ify sysctl_tcp_max_reordering
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:43 +09:00
Eric Dumazet
773d4bb96c tcp: remove stale sysctl_tcp_reordering
This extern is no longer used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:43 +09:00
Eric Dumazet
0bc65a28ae tcp: Namespace-ify sysctl_tcp_fack
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:42 +09:00
Eric Dumazet
65c9410cf5 tcp: Namespace-ify sysctl_tcp_abort_on_overflow
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:42 +09:00
Eric Dumazet
625357aa17 tcp: Namespace-ify sysctl_tcp_rfc1337
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:42 +09:00
Eric Dumazet
3f4c7c6f6a tcp: Namespace-ify sysctl_tcp_stdurg
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:42 +09:00
Eric Dumazet
e0a1e5b519 tcp: Namespace-ify sysctl_tcp_retrans_collapse
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:42 +09:00
Eric Dumazet
b510f0d23a tcp: Namespace-ify sysctl_tcp_slow_start_after_idle
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:42 +09:00
Eric Dumazet
2c04ac8ae0 tcp: Namespace-ify sysctl_tcp_thin_linear_timeouts
Note that sysctl_tcp_thin_dupack was not used, I deleted it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:42 +09:00
Eric Dumazet
e20223f196 tcp: Namespace-ify sysctl_tcp_recovery
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:42 +09:00
Eric Dumazet
2ae21cf527 tcp: Namespace-ify sysctl_tcp_early_retrans
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-27 16:35:42 +09:00
Ursula Braun
60e2a77807 tcp: TCP experimental option for SMC
The SMC protocol [1] relies on the use of a new TCP experimental
option [2, 3]. With this option, SMC capabilities are exchanged
between peers during the TCP three way handshake. This patch adds
support for this experimental option to TCP.

References:
[1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609
[2] Shared Use of TCP Experimental Options RFC 6994:
    https://tools.ietf.org/rfc/rfc6994.txt
[3] IANA ExID SMCR:
http://www.iana.org/assignments/tcp-parameters/tcp-parameters.xhtml#tcp-exids

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-26 18:00:29 +09:00
Christoph Paasch
71c02379c7 tcp: Configure TFO without cookie per socket and/or per route
We already allow to enable TFO without a cookie by using the
fastopen-sysctl and setting it to TFO_SERVER_COOKIE_NOT_REQD (or
TFO_CLIENT_NO_COOKIE).
This is safe to do in certain environments where we know that there
isn't a malicous host (aka., data-centers) or when the
application-protocol already provides an authentication mechanism in the
first flight of data.

A server however might be providing multiple services or talking to both
sides (public Internet and data-center). So, this server would want to
enable cookie-less TFO for certain services and/or for connections that
go to the data-center.

This patch exposes a socket-option and a per-route attribute to enable such
fine-grained configurations.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-24 18:48:08 +09:00
David S. Miller
f8ddadc4db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
There were quite a few overlapping sets of changes here.

Daniel's bug fix for off-by-ones in the new BPF branch instructions,
along with the added allowances for "data_end > ptr + x" forms
collided with the metadata additions.

Along with those three changes came veritifer test cases, which in
their final form I tried to group together properly.  If I had just
trimmed GIT's conflict tags as-is, this would have split up the
meta tests unnecessarily.

In the socketmap code, a set of preemption disabling changes
overlapped with the rename of bpf_compute_data_end() to
bpf_compute_data_pointers().

Changes were made to the mv88e6060.c driver set addr method
which got removed in net-next.

The hyperv transport socket layer had a locking change in 'net'
which overlapped with a change of socket state macro usage
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 13:39:14 +01:00
Yuchung Cheng
1fba70e5b6 tcp: socket option to set TCP fast open key
New socket option TCP_FASTOPEN_KEY to allow different keys per
listener.  The listener by default uses the global key until the
socket option is set.  The key is a 16 bytes long binary data. This
option has no effect on regular non-listener TCP sockets.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:21:36 +01:00
John Fastabend
34f79502bb bpf: avoid preempt enable/disable in sockmap using tcp_skb_cb region
SK_SKB BPF programs are run from the socket/tcp context but early in
the stack before much of the TCP metadata is needed in tcp_skb_cb. So
we can use some unused fields to place BPF metadata needed for SK_SKB
programs when implementing the redirect function.

This allows us to drop the preempt disable logic. It does however
require an API change so sk_redirect_map() has been updated to
additionally provide ctx_ptr to skb. Note, we do however continue to
disable/enable preemption around actual BPF program running to account
for map updates.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:01:29 +01:00
Eric Dumazet
437d2762ba tcp: remove obsolete helpers
Remove three inline helpers that are no longer needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-12 09:47:08 -07:00
Eric Dumazet
4a269818a7 tcp: fix tcp_unlink_write_queue()
Yury reported crash with this signature :

[  554.034021] [<ffff80003ccd5a58>] 0xffff80003ccd5a58
[  554.034156] [<ffff00000888fd34>] skb_release_all+0x14/0x30
[  554.034288] [<ffff00000888fd64>] __kfree_skb+0x14/0x28
[  554.034409] [<ffff0000088ece6c>] tcp_sendmsg_locked+0x4dc/0xcc8
[  554.034541] [<ffff0000088ed68c>] tcp_sendmsg+0x34/0x58
[  554.034659] [<ffff000008919fd4>] inet_sendmsg+0x2c/0xf8
[  554.034783] [<ffff0000088842e8>] sock_sendmsg+0x18/0x30
[  554.034928] [<ffff0000088861fc>] SyS_sendto+0x84/0xf8

Problem is that skb->destructor contains garbage, and this is
because I accidentally removed tcp_skb_tsorted_anchor_cleanup()
from tcp_unlink_write_queue()

This would trigger with a write(fd, <invalid_memory>, len) attempt,
and we will add to packetdrill this capability to avoid future
regressions.

Fixes: 75c119afe1 ("tcp: implement rb-tree based retransmit queue")
Reported-by: Yury Norov <ynorov@caviumnetworks.com>
Tested-by: Yury Norov <ynorov@caviumnetworks.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11 13:41:24 -07:00
Eric Dumazet
75c119afe1 tcp: implement rb-tree based retransmit queue
Using a linear list to store all skbs in write queue has been okay
for quite a while : O(N) is not too bad when N < 500.

Things get messy when N is the order of 100,000 : Modern TCP stacks
want 10Gbit+ of throughput even with 200 ms RTT flows.

40 ns per cache line miss means a full scan can use 4 ms,
blowing away CPU caches.

SACK processing often can use various hints to avoid parsing
whole retransmit queue. But with high packet losses and/or high
reordering, hints no longer work.

Sender has to process thousands of unfriendly SACK, accumulating
a huge socket backlog, burning a cpu and massively dropping packets.

Using an rb-tree for retransmit queue has been avoided for years
because it added complexity and overhead, but now is the time
to be more resistant and say no to quadratic behavior.

1) RTX queue is no longer part of the write queue : already sent skbs
are stored in one rb-tree.

2) Since reaching the head of write queue no longer needs
sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue

Tested:

 On receiver :
 netem on ingress : delay 150ms 200us loss 1
 GRO disabled to force stress and SACK storms.

for f in `seq 1 10`
do
 ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'

Before patch :

323.87
351.48
339.59
338.62
306.72
204.07
304.93
291.88
202.47
176.88
   2840

After patch:

1700.83
2207.98
2070.17
1544.26
2114.76
2124.89
1693.14
1080.91
2216.82
1299.94
  18053

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:54 +01:00
Eric Dumazet
ac3f09ba3e tcp: uninline tcp_write_queue_purge()
Since the upcoming rtx rbtree will add some extra code,
it is time to not inline this fat function anymore.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:53 +01:00
Eric Dumazet
e2080072ed tcp: new list for sent but unacked skbs for RACK recovery
This patch adds a new queue (list) that tracks the sent but not yet
acked or SACKed skbs for a TCP connection. The list is chronologically
ordered by skb->skb_mstamp (the head is the oldest sent skb).

This list will be used to optimize TCP Rack recovery, which checks
an skb's timestamp to judge if it has been lost and needs to be
retransmitted. Since TCP write queue is ordered by sequence instead
of sent time, RACK has to scan over the write queue to catch all
eligible packets to detect lost retransmission, and iterates through
SACKed skbs repeatedly.

Special cares for rare events:
1. TCP repair fakes skb transmission so the send queue needs adjusted
2. SACK reneging would require re-inserting SACKed skbs into the
   send queue. For now I believe it's not worth the complexity to
   make RACK work perfectly on SACK reneging, so we do nothing here.
3. Fast Open: currently for non-TFO, send-queue correctly queues
   the pure SYN packet. For TFO which queues a pure SYN and
   then a data packet, send-queue only queues the data packet but
   not the pure SYN due to the structure of TFO code. This is okay
   because the SYN receiver would never respond with a SACK on a
   missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).

In order to not grow sk_buff, we use an union for the new list and
_skb_refdst/destructor fields. This is a bit complicated because
we need to make sure _skb_refdst and destructor are properly zeroed
before skb is cloned/copied at transmit, and before being freed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-05 21:24:47 -07:00
Wei Wang
27204aaa9d tcp: uniform the set up of sockets after successful connection
Currently in the TCP code, the initialization sequence for cached
metrics, congestion control, BPF, etc, after successful connection
is very inconsistent. This introduces inconsistent bevhavior and is
prone to bugs. The current call sequence is as follows:

(1) for active case (tcp_finish_connect() case):
        tcp_mtup_init(sk);
        icsk->icsk_af_ops->rebuild_header(sk);
        tcp_init_metrics(sk);
        tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
        tcp_init_congestion_control(sk);
        tcp_init_buffer_space(sk);

(2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case):
        icsk->icsk_af_ops->rebuild_header(sk);
        tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
        tcp_init_congestion_control(sk);
        tcp_mtup_init(sk);
        tcp_init_buffer_space(sk);
        tcp_init_metrics(sk);

(3) for TFO passive case (tcp_fastopen_create_child()):
        inet_csk(child)->icsk_af_ops->rebuild_header(child);
        tcp_init_congestion_control(child);
        tcp_mtup_init(child);
        tcp_init_metrics(child);
        tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
        tcp_init_buffer_space(child);

This commit uniforms the above functions to have the following sequence:
        tcp_mtup_init(sk);
        icsk->icsk_af_ops->rebuild_header(sk);
        tcp_init_metrics(sk);
        tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB);
        tcp_init_congestion_control(sk);
        tcp_init_buffer_space(sk);
This sequence is the same as the (1) active case. We pick this sequence
because this order correctly allows BPF to override the settings
including congestion control module and initial cwnd, etc from
the route, and then allows the CC module to see those settings.

Suggested-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-05 21:10:16 -07:00
David S. Miller
53954cf8c5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-05 18:19:22 -07:00
Haishuang Yan
4371384856 ipv4: Namespaceify tcp_fastopen_key knob
Different namespace application might require different tcp_fastopen_key
independently of the host.

David Miller pointed out there is a leak without releasing the context
of tcp_fastopen_key during netns teardown. So add the release action in
exit_batch path.

Tested:
1. Container namespace:
# cat /proc/sys/net/ipv4/tcp_fastopen_key:
2817fff2-f803cf97-eadfd1f3-78c0992b

cookie key in tcp syn packets:
Fast Open Cookie
    Kind: TCP Fast Open Cookie (34)
    Length: 10
    Fast Open Cookie: 1e5dd82a8c492ca9

2. Host:
# cat /proc/sys/net/ipv4/tcp_fastopen_key:
107d7c5f-68eb2ac7-02fb06e6-ed341702

cookie key in tcp syn packets:
Fast Open Cookie
    Kind: TCP Fast Open Cookie (34)
    Length: 10
    Fast Open Cookie: e213c02bf0afbc8a

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-01 17:55:54 -07:00
Haishuang Yan
dd000598a3 ipv4: Remove the 'publish' logic in tcp_fastopen_init_key_once
The 'publish' logic is not necessary after commit dfea2aa654 ("tcp:
Do not call tcp_fastopen_reset_cipher from interrupt context"), because
in tcp_fastopen_cookie_gen,it wouldn't call tcp_fastopen_init_key_once.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-01 17:55:54 -07:00
Haishuang Yan
e1cfcbe82b ipv4: Namespaceify tcp_fastopen knob
Different namespace application might require enable TCP Fast Open
feature independently of the host.

This patch series continues making more of the TCP Fast Open related
sysctl knobs be per net-namespace.

Reported-by: Luca BRUNO <lucab@debian.org>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-01 17:55:54 -07:00
Paolo Abeni
7487449c86 IPv4: early demux can return an error code
Currently no error is emitted, but this infrastructure will
used by the next patch to allow source address validation
for mcast sockets.
Since early demux can do a route lookup and an ipv4 route
lookup can return an error code this is consistent with the
current ipv4 route infrastructure.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-01 03:55:47 +01:00
David S. Miller
1f8d31d189 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-09-23 10:16:53 -07:00
Eric Dumazet
bffa72cf7f net: sk_buff rbnode reorg
skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).

Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.

This saves some overhead in both TCP and netem.

v2: removes the swtstamp field from struct tcp_skb_cb

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-19 15:20:22 -07:00
Yuchung Cheng
4c7124413a tcp: remove two unused functions
remove tcp_may_send_now and tcp_snd_test that are no longer used

Fixes: 840a3cbe89 ("tcp: remove forward retransmit feature")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-18 17:26:11 -07:00
David S. Miller
6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Florian Westphal
31770e34e4 tcp: Revert "tcp: remove header prediction"
This reverts commit 45f119bf93.

Eric Dumazet says:
  We found at Google a significant regression caused by
  45f119bf93 tcp: remove header prediction

  In typical RPC  (TCP_RR), when a TCP socket receives data, we now call
  tcp_ack() while we used to not call it.

  This touches enough cache lines to cause a slowdown.

so problem does not seem to be HP removal itself but the tcp_ack()
call.  Therefore, it might be possible to remove HP after all, provided
one finds a way to elide tcp_ack for most cases.

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 11:20:09 -07:00
Florian Westphal
c1d2b4c3e2 tcp: Revert "tcp: remove CA_ACK_SLOWPATH"
This change was a followup to the header prediction removal,
so first revert this as a prerequisite to back out hp removal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 11:20:08 -07:00
Sabrina Dubroca
ebfa00c574 tcp: fix refcnt leak with ebpf congestion control
There are a few bugs around refcnt handling in the new BPF congestion
control setsockopt:

 - The new ca is assigned to icsk->icsk_ca_ops even in the case where we
   cannot get a reference on it. This would lead to a use after free,
   since that ca is going away soon.

 - Changing the congestion control case doesn't release the refcnt on
   the previous ca.

 - In the reinit case, we first leak a reference on the old ca, then we
   call tcp_reinit_congestion_control on the ca that we have just
   assigned, leading to deinitializing the wrong ca (->release of the
   new ca on the old ca's data) and releasing the refcount on the ca
   that we actually want to use.

This is visible by building (for example) BIC as a module and setting
net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
samples/bpf.

This patch fixes the refcount issues, and moves reinit back into tcp
core to avoid passing a ca pointer back to BPF.

Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25 17:16:27 -07:00
Mike Maloney
98aaa913b4 tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsg
When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the
timestamp corresponding to the highest sequence number data returned.

Previously the skb->tstamp is overwritten when a TCP packet is placed
in the out of order queue.  While the packet is in the ooo queue, save the
timestamp in the TCB_SKB_CB.  This space is shared with the gso_*
options which are only used on the tx path, and a previously unused 4
byte hole.

When skbs are coalesced either in the sk_receive_queue or the
out_of_order_queue always choose the timestamp of the appended skb to
maintain the invariant of returning the timestamp of the last byte in
the recvmsg buffer.

Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-23 20:30:47 -07:00
Tonghao Zhang
1119936927 tcp: Remove the unused parameter for tcp_try_fastopen.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22 14:16:12 -07:00
David S. Miller
3118e6e19d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:28:45 -07:00
David Ahern
4297a0ef08 net: ipv6: add second dif to inet6 socket lookups
Add a second device index, sdif, to inet6 socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
ingress index is obtained from IPCB using inet_sdif and after tcp_v4_rcv
tcp_v4_sdif is used.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 11:39:22 -07:00
David Ahern
3fa6f616a7 net: ipv4: add second dif to inet socket lookups
Add a second device index, sdif, to inet socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
ingress index is obtained from IPCB using inet_sdif and after the cb move
in  tcp_v4_rcv the tcp_v4_sdif helper is used.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 11:39:21 -07:00
Paolo Abeni
91ed1e666a ip/options: explicitly provide net ns to __ip_options_echo()
__ip_options_echo() uses the current network namespace, and
currently retrives it via skb->dst->dev.

This commit adds an explicit 'net' argument to __ip_options_echo()
and update all the call sites to provide it, usually via a simpler
sock_net().

After this change, __ip_options_echo() no more needs to access
skb->dst and we can drop a couple of hack to preserve such
info in the rx path.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:51:12 -07:00
Neal Cardwell
e1a10ef7fa tcp: introduce tcp_rto_delta_us() helper for xmit timer fix
Pure refactor. This helper will be required in the xmit timer fix
later in the patch series. (Because the TLP logic will want to make
this calculation.)

Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:38:30 -07:00
Tom Herbert
306b13eb3c proto_ops: Add locked held versions of sendmsg and sendpage
Add new proto_ops sendmsg_locked and sendpage_locked that can be
called when the socket lock is already held. Correspondingly, add
kernel_sendmsg_locked and kernel_sendpage_locked as front end
functions.

These functions will be used in zero proxy so that we can take
the socket lock in a ULP sendmsg/sendpage and then directly call the
backend transport proto_ops functions.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 15:26:18 -07:00
Florian Westphal
573aeb0492 tcp: remove CA_ACK_SLOWPATH
re-indent tcp_ack, and remove CA_ACK_SLOWPATH; it is always set now.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:50 -07:00
Florian Westphal
45f119bf93 tcp: remove header prediction
Like prequeue, I am not sure this is overly useful nowadays.

If we receive a train of packets, GRO will aggregate them if the
headers are the same (HP predates GRO by several years) so we don't
get a per-packet benefit, only a per-aggregated-packet one.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:49 -07:00
Florian Westphal
b6690b1438 tcp: remove low_latency sysctl
Was only checked by the removed prequeue code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:49 -07:00
Florian Westphal
e7942d0633 tcp: remove prequeue support
prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.

This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.

In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.

Even normal client applications (web browsers) commonly use many tcp
connections in parallel.

This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:49 -07:00
Matvejchikov Ilya
e42e24c3cc tcp: remove redundant argument from tcp_rcv_established()
The last (4th) argument of tcp_rcv_established() is redundant as it
always equals to skb->len and the skb itself is always passed as 2th
agrument. There is no reason to have it.

Signed-off-by: Ilya V. Matveychikov <matvejchikov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 17:28:12 -07:00
Yuchung Cheng
bb4d991a28 tcp: adjust tail loss probe timeout
This patch adjusts the timeout formula to schedule the TCP loss probe
(TLP). The previous formula uses 2*SRTT or 1.5*RTT + DelayACKMax if
only one packet is in flight. It keeps a lower bound of 10 msec which
is too large for short RTT connections (e.g. within a data-center).
The new formula = 2*RTT + (inflight == 1 ? 200ms : 2ticks) which
performs better for short and fast connections.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-19 16:14:10 -07:00
Lawrence Brakmo
91b5b21c7c bpf: Add support for changing congestion control
Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:14 -07:00
Lawrence Brakmo
13d3b1ebe2 bpf: Support for setting initial receive window
This patch adds suppport for setting the initial advertized window from
within a BPF_SOCK_OPS program. This can be used to support larger
initial cwnd values in environments where it is known to be safe.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:13 -07:00
Lawrence Brakmo
8550f328f4 bpf: Support for per connection SYN/SYN-ACK RTOs
This patch adds support for setting a per connection SYN and
SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
to set small RTOs when it is known both hosts are within a
datacenter.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:13 -07:00
Lawrence Brakmo
40304b2a15 bpf: BPF support for sock_ops
Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.

Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code.  For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.

The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.

This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.

This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.

Two new corresponding structs (one for the kernel one for the user/BPF
program):

/* kernel version */
struct bpf_sock_ops_kern {
        struct sock *sk;
        __u32  op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
};

/* user version
 * Some fields are in network byte order reflecting the sock struct
 * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
 * convert them to host byte order.
 */
struct bpf_sock_ops {
        __u32 op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
        __u32 family;
        __u32 remote_ip4;     /* In network byte order */
        __u32 local_ip4;      /* In network byte order */
        __u32 remote_ip6[4];  /* In network byte order */
        __u32 local_ip6[4];   /* In network byte order */
        __u32 remote_port;    /* In network byte order */
        __u32 local_port;     /* In host byte horder */
};

Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.

The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:13 -07:00
Ivan Delalande
8917a777be tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix
Replace first padding in the tcp_md5sig structure with a new flag field
and address prefix length so it can be specified when configuring a new
key for TCP MD5 signature. The tcpm_flags field will only be used if the
socket option is TCP_MD5SIG_EXT to avoid breaking existing programs, and
tcpm_prefixlen only when the TCP_MD5SIG_FLAG_PREFIX flag is set.

Signed-off-by: Bob Gilligan <gilligan@arista.com>
Signed-off-by: Eric Mowat <mowat@arista.com>
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-19 13:51:34 -04:00
Ivan Delalande
6797318e62 tcp: md5: add an address prefix for key lookup
This allows the keys used for TCP MD5 signature to be used for whole
range of addresses, specified with a prefix length, instead of only one
address as it currently is.

Signed-off-by: Bob Gilligan <gilligan@arista.com>
Signed-off-by: Eric Mowat <mowat@arista.com>
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-19 13:50:55 -04:00
Dave Watson
e3b5616a34 tcp: export do_tcp_sendpages and tcp_rate_check_app_limited functions
Export do_tcp_sendpages and tcp_rate_check_app_limited, since tls will need to
sendpages while the socket is already locked.

tcp_sendpage is exported, but requires the socket lock to not be held already.

Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 12:12:40 -04:00