Per RFC4898, they count segments sent/received
containing a positive length data segment (that includes
retransmission segments carrying data). Unlike
tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
carrying no data (e.g. pure ack).
The patch also updates the segs_in in tcp_fastopen_add_skb()
so that segs_in >= data_segs_in property is kept.
Together with retransmission data, tcpi_data_segs_out
gives a better signal on the rxmit rate.
v6: Rebase on the latest net-next
v5: Eric pointed out that checking skb->len is still needed in
tcp_fastopen_add_skb() because skb can carry a FIN without data.
Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
helper is used. Comment is added to the fastopen case to explain why
segs_in has to be reset and tcp_segs_in() has to be called before
__skb_pull().
v4: Add comment to the changes in tcp_fastopen_add_skb()
and also add remark on this case in the commit message.
v3: Add const modifier to the skb parameter in tcp_segs_in()
v2: Rework based on recent fix by Eric:
commit a9d99ce28e ("tcp: fix tcpi_segs_in after connection establishment")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a common kernel function to get the number of bytes available
on a TCP socket. This is based on code in INQ getsockopt and we now call
the function for that getsockopt.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/phy/bcm7xxx.c
drivers/net/phy/marvell.c
drivers/net/vxlan.c
All three conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
tcpi_min_rtt reports the minimal rtt observed by TCP stack for the flow,
in usec unit. Might be ~0U if not yet known.
tcpi_notsent_bytes reports the amount of bytes in the write queue that
were not yet sent.
This is done in a single patch to not add a temporary 32bit padding hole
in tcp_info.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we remove the SYN flag from the skbs that tcp_fastopen_add_skb()
places in socket receive queue, then we can remove the test that
tcp_recvmsg() has to perform in fast path.
All we have to do is to adjust SEQ in the slow path.
For the moment, we place an unlikely() and output a message
if we find an skb having SYN flag set.
Goal would be to get rid of the test completely.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With some combinations of user provided flags in netlink command,
it is possible to call tcp_get_info() with a buffer that is not 8-bytes
aligned.
It does matter on some arches, so we need to use put_unaligned() to
store the u64 fields.
Current iproute2 package does not trigger this particular issue.
Fixes: 0df48c26d8 ("tcp: add tcpi_bytes_acked to tcp_info")
Fixes: 977cb0ecf8 ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces uses of the long obsolete hash interface with
ahash.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>
Move the jump-label from sock_update_memcg() and sock_release_memcg() to
the callsite, and so eliminate those function calls when socket
accounting is not enabled.
This also eliminates the need for dummy functions because the calls will
be optimized away if the Kconfig options are not enabled.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When closing a listen socket, tcp_abort currently calls
tcp_done without clearing the request queue. If the socket has a
child socket that is established but not yet accepted, the child
socket is then left without a parent, causing a leak.
Fix this by setting the socket state to TCP_CLOSE and calling
inet_csk_listen_stop with the socket lock held, like tcp_close
does.
Tested using net_test. With this patch, calling SOCK_DESTROY on a
listen socket that has an established but not yet accepted child
socket results in the parent and the child being closed, such
that they no longer appear in sock_diag dumps.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding support for SYN_RECV request sockets to tcp_abort()
is quite easy after our tcp listener rewrite.
Note that we also need to better handle listeners, or we might
leak not yet accepted children, because of a missing
inet_csk_listen_stop() call.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tcp_send_sendpage and tcp_sendmsg we check the route capabilities to
determine if checksum offload can be performed. This check currently
does not take the IP protocol into account for devices that advertise
only one of NETIF_F_IPV6_CSUM or NETIF_F_IP_CSUM. This patch adds a
function to check capabilities for checksum offload with a socket
called sk_check_csum_caps. This function checks for specific IPv4 or
IPv6 offload support based on the family of the socket.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.
This patch:
- Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
NETIF_F_ALL_CSUM is being used as a mask).
- Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a cleanup to make following patch easier to
review.
Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()
To ease backports, we rename both constants.
Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some functions access TCP sockets without holding a lock and
might output non consistent data, depending on compiler and or
architecture.
tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ...
Introduce sk_state_load() and sk_state_store() to fix the issues,
and more clearly document where this lack of locking is happening.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.
Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.
Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code
The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.
The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While auditing TCP stack for upcoming 'lockless' listener changes,
I found I had to change fastopen_init_queue() to properly init the object
before publishing it.
Otherwise an other cpu could try to lock the spinlock before it gets
properly initialized.
Instead of adding appropriate barriers, just remove dynamic memory
allocations :
- Structure is 28 bytes on 64bit arches. Using additional 8 bytes
for holding a pointer seems overkill.
- Two listeners can share same cache line and performance would suffer.
If we really want to save few bytes, we would instead dynamically allocate
whole struct request_sock_queue in the future.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
slow start after idle might reduce cwnd, but we perform this
after first packet was cooked and sent.
With TSO/GSO, it means that we might send a full TSO packet
even if cwnd should have been reduced to IW10.
Moving the SSAI check in skb_entail() makes sense, because
we slightly reduce number of times this check is done,
especially for large send() and TCP Small queue callbacks from
softirq context.
As Neal pointed out, we also need to perform the check
if/when receive window opens.
Tested:
Following packetdrill test demonstrates the problem
// Test of slow start after idle
`sysctl -q net.ipv4.tcp_slow_start_after_idle=1`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.100 < . 1:1(0) ack 1 win 511
+0 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0
+0 write(4, ..., 26000) = 26000
+0 > . 1:5001(5000) ack 1
+0 > . 5001:10001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10 }%
+.100 < . 1:1(0) ack 10001 win 511
+0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
+0 > . 10001:20001(10000) ack 1
+0 > P. 20001:26001(6000) ack 1
+.100 < . 1:1(0) ack 26001 win 511
+0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%
+4 write(4, ..., 20000) = 20000
// If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
+0 > . 26001:31001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0 > . 31001:36001(5000) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called
with flags = MSG_WAITALL | MSG_PEEK.
sk_wait_data waits for sk_receive_queue not empty, but in this case,
the receive queue is not empty, but does not contain any skb that we
can use.
Add a "last skb seen on receive queue" argument to sk_wait_data, so
that it sleeps until the receive queue has new skbs.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258
Reported-by: Enrico Scholz <rh-bugzilla@ensc.de>
Reported-by: Dan Searle <dan@censornet.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mellanox/mlx4/main.c
net/packet/af_packet.c
Both conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
This get_info handler will simply dispatch to the appropriate
existing inet protocol handler.
This patch also includes a new netlink attribute
(INET_DIAG_PROTOCOL). This attribute is currently only used
for multicast messages. Without this attribute, there is no
way of knowing the IP protocol used by the socket information
being broadcast. This attribute is not necessary in the 'dump'
variant of this protocol (though it could easily be added)
because dump requests are issued for specific family/protocol
pairs.
Tested: ss -E (note, the -E option has not yet been merged into
the upstream version of ss).
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prepare skb_splice_bits to be able to deal with AF_UNIX sockets.
AF_UNIX sockets don't use lock_sock/release_sock and thus we have to
use a callback to make the locking and unlocking configureable.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
Taking socket spinlock in tcp_get_info() can deadlock, as
inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
while packet processing can use the reverse locking order.
We could avoid this locking for TCP_LISTEN states, but lockdep would
certainly get confused as all TCP sockets share same lockdep classes.
[ 523.722504] ======================================================
[ 523.728706] [ INFO: possible circular locking dependency detected ]
[ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted
[ 523.739202] -------------------------------------------------------
[ 523.745474] ss/18032 is trying to acquire lock:
[ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
[ 523.758129]
[ 523.758129] but task is already holding lock:
[ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
[ 523.774661]
[ 523.774661] which lock already depends on the new lock.
[ 523.774661]
[ 523.782850]
[ 523.782850] the existing dependency chain (in reverse order) is:
[ 523.790326]
-> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
[ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270
[ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
[ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
[ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
[ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
[ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
[ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
[ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
[ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
[ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
[ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420
Lets use u64_sync infrastructure instead. As a bonus, 64bit
arches get optimized, as these are nop for them.
Fixes: 0df48c26d8 ("tcp: add tcpi_bytes_acked to tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tracks the total number of inbound and outbound segments on a
TCP socket. One may use this number to have an idea on connection
quality when compared against the retransmissions.
RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
These are a 32bit field each and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)
Note that tp->segs_out was placed near tp->snd_nxt for good data
locality and minimal performance impact, while tp->segs_in was placed
near tp->bytes_received for the same reason.
Join work with Eric Dumazet.
Note that received SYN are accounted on the listener, but sent SYNACK
are not accounted.
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently rely on the setting of SOCK_NOSPACE in the write()
path to ensure that we wake up any epoll edge trigger waiters when
acks return to free space in the write queue. However, if we fail
to allocate even a single skb in the write queue, we could end up
waiting indefinitely.
Fix this by explicitly issuing a wakeup when we detect the condition
of an empty write queue and a return value of -EAGAIN. This allows
userspace to re-try as we expect this to be a temporary failure.
I've tested this approach by artificially making
sk_stream_alloc_skb() return NULL periodically. In that case,
epoll edge trigger waiters will hang indefinitely in epoll_wait()
without this patch.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 8e4d980ac2 ("tcp: fix behavior for epoll edge trigger")
we fixed a possible hang of TCP sockets under memory pressure,
by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
if no packet is in socket write queue.
It turns out there are other cases where we want to force memory
schedule :
tcp_fragment() & tso_fragment() need to split a big TSO packet into
two smaller ones. If we block here because of TCP memory pressure,
we can effectively block TCP socket from sending new data.
If no further ACK is coming, this hang would be definitive, and socket
has no chance to effectively reduce its memory usage.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the getsockopt() requesting the cached contents of the syn
packet headers will fail silently if the caller uses a buffer that is
too small to contain the requested data. Rather than fail silently and
discard the headers, getsockopt() should return an error and report the
required size to hold the data.
Signed-off-by: Eric B Munson <emunson@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Allowing tcp to use ~19% of physical memory is way too much,
and allowed bugs to be hidden. Add to this that some drivers use a full
page per incoming frame, so real cost can be twice the advertized one.
Reduce tcp_mem by 50 % as a first step to sanity.
tcp_mem[0,1,2] defaults are now 4.68%, 6.25%, 9.37% of physical memory.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under memory pressure, tcp_sendmsg() can fail to queue a packet
while no packet is present in write queue. If we return -EAGAIN
with no packet in write queue, no ACK packet will ever come
to raise EPOLLOUT.
We need to allow one skb per TCP socket, and make sure that
tcp sockets can release their forward allocations under pressure.
This is a followup to commit 790ba4566c ("tcp: set SOCK_NOSPACE
under memory pressure")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows a server application to get the TCP SYN headers for
its passive connections. This is useful if the server is doing
fingerprinting of clients based on SYN packet contents.
Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.
The first is used on a socket to enable saving the SYN headers
for child connections. This can be set before or after the listen()
call.
The latter is used to retrieve the SYN headers for passive connections,
if the parent listener has enabled TCP_SAVE_SYN.
TCP_SAVED_SYN is read once, it frees the saved SYN headers.
The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
headers.
Original patch was written by Tom Herbert, I changed it to not hold
a full skb (and associated dst and conntracking reference).
We have used such patch for about 3 years at Google.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some Congestion Control modules can provide per flow information,
but current way to get this information is to use netlink.
Like TCP_INFO, let's add TCP_CC_INFO so that applications can
issue a getsockopt() if they have a socket file descriptor,
instead of playing complex netlink games.
Sample usage would be :
union tcp_cc_info info;
socklen_t len = sizeof(info);
if (getsockopt(fd, SOL_TCP, TCP_CC_INFO, &info, &len) == -1)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tracks total number of payload bytes received on a TCP socket.
This is the sum of all changes done to tp->rcv_nxt
RFC4898 named this : tcpEStatsAppHCThruOctetsReceived
This is a 64bit field, and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)
Note that tp->bytes_received was placed near tp->rcv_nxt for
best data locality and minimal performance impact.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Eric Salo <salo@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tracks total number of bytes acked for a TCP socket.
This is the sum of all changes done to tp->snd_una, and allows
for precise tracking of delivered data.
RFC4898 named this : tcpEStatsAppHCThruOctetsAcked
This is a 64bit field, and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)
Note that tp->bytes_acked was placed near tp->snd_una for
best data locality and minimal performance impact.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Eric Salo <salo@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that we either see that the buffer has write space
in tcp_poll() or that we perform a wakeup from the input
side. Did not run into any actual problem here, but thought
that we should make things explicit.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_get_info() can be called without holding socket lock,
so any socket fields can change under us.
Use READ_ONCE() to fetch sk_pacing_rate and sk_max_pacing_rate
Fixes: 977cb0ecf8 ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipv4 code uses a mixture of coding styles. In some instances check
for non-NULL pointer is done as x != NULL and sometimes as x. x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipv4 code uses a mixture of coding styles. In some instances check
for NULL pointer is done as x == NULL and sometimes as !x. !x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cadence/macb.c
Overlapping changes in macb driver, mostly fixes and cleanups
in 'net' overlapping with the integration of at91_ether into
macb in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
With some mss values, it is possible tcp_xmit_size_goal() puts
one segment more in TSO packet than tcp_tso_autosize().
We send then one TSO packet followed by one single MSS.
It is not a serious bug, but we can do slightly better, especially
for drivers using netif_set_gso_max_size() to lower gso_max_size.
Using same formula avoids these corner cases and makes
tcp_xmit_size_goal() a bit faster.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 605ad7f184 ("tcp: refine TSO autosizing")
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.
Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As part of an effort to move skb->dropcount to skb->cb[] use a common
macro in protocol families using skb->cb[] for ancillary data to
validate available room in skb->cb[].
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
patch is actually smaller than it seems to be - most of it is unindenting
the inner loop body in tcp_sendmsg() itself...
the bit in tcp_input.c is going to get reverted very soon - that's what
memcpy_from_msg() will become, but not in this commit; let's keep it
reasonably contained...
There's one potentially subtle change here: in case of short copy from
userland, mainline tcp_send_syn_data() discards the skb it has allocated
and falls back to normal path, where we'll send as much as possible after
rereading the same data again. This patch trims SYN+data skb instead -
that way we don't need to copy from the same place twice.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 95bd09eb27 ("tcp: TSO packets automatic sizing") tried to
control TSO size, but did this at the wrong place (sendmsg() time)
At sendmsg() time, we might have a pessimistic view of flow rate,
and we end up building very small skbs (with 2 MSS per skb).
This is bad because :
- It sends small TSO packets even in Slow Start where rate quickly
increases.
- It tends to make socket write queue very big, increasing tcp_ack()
processing time, but also increasing memory needs, not necessarily
accounted for, as fast clones overhead is currently ignored.
- Lower GRO efficiency and more ACK packets.
Servers with a lot of small lived connections suffer from this.
Lets instead fill skbs as much as possible (64KB of payload), but split
them at xmit time, when we have a precise idea of the flow rate.
skb split is actually quite efficient.
Patch looks bigger than necessary, because TCP Small Queue decision now
has to take place after the eventual split.
As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
tcp_tso_should_defer() can be synchronized on same goal.
Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
contains number of mss that we can put in GSO packet, and is not
related to the autosizing goal anymore.
Tested:
40 ms rtt link
nstat >/dev/null
netperf -H remote -l -2000000 -- -s 1000000
nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"
Before patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/s
87380 2000000 2000000 0.36 44.22
IpInReceives 600 0.0
IpOutRequests 599 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2033249 0.0
After patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 2000000 2000000 0.36 44.27
IpInReceives 221 0.0
IpOutRequests 232 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2013953 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
TCP timestamping introduced MSG_ERRQUEUE handling for TCP sockets.
If the socket is of family AF_INET6, call ipv6_recv_error instead
of ip_recv_error.
This change is more complex than a single branch due to the loadable
ipv6 module. It reuses a pre-existing indirect function call from
ping. The ping code is safe to call, because it is part of the core
ipv6 module and always present when AF_INET6 sockets are active.
Fixes: 4ed2d765 (net-timestamp: TCP timestamping)
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
It may also be worthwhile to add WARN_ON_ONCE(sk->family == AF_INET6)
to ip_recv_error.
Signed-off-by: David S. Miller <davem@davemloft.net>
This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".
When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.
Having a helper like this means there will be less places to touch
during that transformation.
Based upon descriptions and patch from Al Viro.
Signed-off-by: David S. Miller <davem@davemloft.net>
percpu tcp_md5sig_pool contains memory blobs that ultimately
go through sg_set_buf().
-> sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
This requires that whole area is in a physically contiguous portion
of memory. And that @buf is not backed by vmalloc().
Given that alloc_percpu() can use vmalloc() areas, this does not
fit the requirements.
Replace alloc_percpu() by a static DEFINE_PER_CPU() as tcp_md5sig_pool
is small anyway, there is no gain to dynamically allocate it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 765cf9976e ("tcp: md5: remove one indirection level in tcp_md5sig_pool")
Reported-by: Crestez Dan Leonard <cdleonard@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull percpu consistent-ops changes from Tejun Heo:
"Way back, before the current percpu allocator was implemented, static
and dynamic percpu memory areas were allocated and handled separately
and had their own accessors. The distinction has been gone for many
years now; however, the now duplicate two sets of accessors remained
with the pointer based ones - this_cpu_*() - evolving various other
operations over time. During the process, we also accumulated other
inconsistent operations.
This pull request contains Christoph's patches to clean up the
duplicate accessor situation. __get_cpu_var() uses are replaced with
with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().
Unfortunately, the former sometimes is tricky thanks to C being a bit
messy with the distinction between lvalues and pointers, which led to
a rather ugly solution for cpumask_var_t involving the introduction of
this_cpu_cpumask_var_ptr().
This converts most of the uses but not all. Christoph will follow up
with the remaining conversions in this merge window and hopefully
remove the obsolete accessors"
* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
irqchip: Properly fetch the per cpu offset
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
Revert "powerpc: Replace __get_cpu_var uses"
percpu: Remove __this_cpu_ptr
clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
sparc: Replace __get_cpu_var uses
avr32: Replace __get_cpu_var with __this_cpu_write
blackfin: Replace __get_cpu_var uses
tile: Use this_cpu_ptr() for hardware counters
tile: Replace __get_cpu_var uses
powerpc: Replace __get_cpu_var uses
alpha: Replace __get_cpu_var
ia64: Replace __get_cpu_var uses
s390: cio driver &__get_cpu_var replacements
s390: Replace __get_cpu_var uses
mips: Replace __get_cpu_var uses
MIPS: Replace __get_cpu_var uses in FPU emulator.
arm: Replace __this_cpu_ptr with raw_cpu_ptr
...
Pull percpu updates from Tejun Heo:
"A lot of activities on percpu front. Notable changes are...
- percpu allocator now can take @gfp. If @gfp doesn't contain
GFP_KERNEL, it tries to allocate from what's already available to
the allocator and a work item tries to keep the reserve around
certain level so that these atomic allocations usually succeed.
This will replace the ad-hoc percpu memory pool used by
blk-throttle and also be used by the planned blkcg support for
writeback IOs.
Please note that I noticed a bug in how @gfp is interpreted while
preparing this pull request and applied the fix 6ae833c7fe
("percpu: fix how @gfp is interpreted by the percpu allocator")
just now.
- percpu_ref now uses longs for percpu and global counters instead of
ints. It leads to more sparse packing of the percpu counters on
64bit machines but the overhead should be negligible and this
allows using percpu_ref for refcnting pages and in-memory objects
directly.
- The switching between percpu and single counter modes of a
percpu_ref is made independent of putting the base ref and a
percpu_ref can now optionally be initialized in single or killed
mode. This allows avoiding percpu shutdown latency for cases where
the refcounted objects may be synchronously created and destroyed
in rapid succession with only a fraction of them reaching fully
operational status (SCSI probing does this when combined with
blk-mq support). It's also planned to be used to implement forced
single mode to detect underflow more timely for debugging.
There's a separate branch percpu/for-3.18-consistent-ops which cleans
up the duplicate percpu accessors. That branch causes a number of
conflicts with s390 and other trees. I'll send a separate pull
request w/ resolutions once other branches are merged"
* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
percpu: fix how @gfp is interpreted by the percpu allocator
blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
percpu_ref: add PERCPU_REF_INIT_* flags
percpu_ref: decouple switching to percpu mode and reinit
percpu_ref: decouple switching to atomic mode and killing
percpu_ref: add PCPU_REF_DEAD
percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref: minor code and comment updates
percpu_ref: relocate percpu_ref_reinit()
Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
Revert "percpu: free percpu allocation info for uniprocessor system"
percpu-refcount: make percpu_ref based on longs instead of ints
percpu-refcount: improve WARN messages
percpu: fix locking regression in the failure path of pcpu_alloc()
percpu-refcount: add @gfp to percpu_ref_init()
proportions: add @gfp to init functions
percpu_counter: add @gfp to percpu_counter_init()
percpu_counter: make percpu_counters_lock irq-safe
...
Pull networking updates from David Miller:
"Most notable changes in here:
1) By far the biggest accomplishment, thanks to a large range of
contributors, is the addition of multi-send for transmit. This is
the result of discussions back in Chicago, and the hard work of
several individuals.
Now, when the ->ndo_start_xmit() method of a driver sees
skb->xmit_more as true, it can choose to defer the doorbell
telling the driver to start processing the new TX queue entires.
skb->xmit_more means that the generic networking is guaranteed to
call the driver immediately with another SKB to send.
There is logic added to the qdisc layer to dequeue multiple
packets at a time, and the handling mis-predicted offloads in
software is now done with no locks held.
Finally, pktgen is extended to have a "burst" parameter that can
be used to test a multi-send implementation.
Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
virtio_net
Adding support is almost trivial, so export more drivers to
support this optimization soon.
I want to thank, in no particular or implied order, Jesper
Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
David Tat, Hannes Frederic Sowa, and Rusty Russell.
2) PTP and timestamping support in bnx2x, from Michal Kalderon.
3) Allow adjusting the rx_copybreak threshold for a driver via
ethtool, and add rx_copybreak support to enic driver. From
Govindarajulu Varadarajan.
4) Significant enhancements to the generic PHY layer and the bcm7xxx
driver in particular (EEE support, auto power down, etc.) from
Florian Fainelli.
5) Allow raw buffers to be used for flow dissection, allowing drivers
to determine the optimal "linear pull" size for devices that DMA
into pools of pages. The objective is to get exactly the
necessary amount of headers into the linear SKB area pre-pulled,
but no more. The new interface drivers use is eth_get_headlen().
From WANG Cong, with driver conversions (several had their own
by-hand duplicated implementations) by Alexander Duyck and Eric
Dumazet.
6) Support checksumming more smoothly and efficiently for
encapsulations, and add "foo over UDP" facility. From Tom
Herbert.
7) Add Broadcom SF2 switch driver to DSA layer, from Florian
Fainelli.
8) eBPF now can load programs via a system call and has an extensive
testsuite. Alexei Starovoitov and Daniel Borkmann.
9) Major overhaul of the packet scheduler to use RCU in several major
areas such as the classifiers and rate estimators. From John
Fastabend.
10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
Duyck.
11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
Dumazet.
12) Add Datacenter TCP congestion control algorithm support, From
Florian Westphal.
13) Reorganize sk_buff so that __copy_skb_header() is significantly
faster. From Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
netlabel: directly return netlbl_unlabel_genl_init()
net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
net: description of dma_cookie cause make xmldocs warning
cxgb4: clean up a type issue
cxgb4: potential shift wrapping bug
i40e: skb->xmit_more support
net: fs_enet: Add NAPI TX
net: fs_enet: Remove non NAPI RX
r8169:add support for RTL8168EP
net_sched: copy exts->type in tcf_exts_change()
wimax: convert printk to pr_foo()
af_unix: remove 0 assignment on static
ipv6: Do not warn for informational ICMP messages, regardless of type.
Update Intel Ethernet Driver maintainers list
bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
tipc: fix bug in multicast congestion handling
net: better IFF_XMIT_DST_RELEASE support
net/mlx4_en: remove NETDEV_TX_BUSY
3c59x: fix bad split of cpu_to_le32(pci_map_single())
net: bcmgenet: fix Tx ring priority programming
...
1/ Step down as dmaengine maintainer see commit 08223d80df "dmaengine
maintainer update"
2/ Removal of net_dma, as it has been marked 'broken' since 3.13 (commit
7787380336 "net_dma: mark broken"), without reports of performance
regression.
3/ Miscellaneous fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUKDLKAAoJEB7SkWpmfYgC7wwP/iNHqRjf1suMUTBIF3P6Hgbe
VCUwh0IkuujMPDG46WRn6cYzarRxVPLoGaLHLPszgjI6pmGPVv19wqeDOlUxtcmr
0iQWEWv/zqseaAIW+4gj/WYCyMgKil49EUBJKCZCfNmIaad+e0pr8f0uE5yOkHPM
tqWoZERu9A4dlXGr1TjeOZVzdnPrCt92MrLDN6ZZ6tMuJaEc5PauaLxKTeGy5fYj
UB+k1xJQzECbsYfpB+uCVYl5/qPO1rNyuBYS8THCsW+JYmrbbfH2kkF2lo2FaUpO
8Yd50FtzXHKWwAt7BzfIwU2M7x0wRmryrC/xsQi6M+WmVeHYvvHUIpzaA66xRZ5x
fCy3Fu8sEnmnmboAbh2v2c5uTycqRl2xPzbpLAuxglloXIxzi3ckp6ESF/Z4SldH
oxIoEievN7lah3vKgvlHZYcWDzrYr8EKf/EzFe9RqDBQDKtzDzre1H9Uivr387Vm
uFUcGHYG/GXuX47C7EUsMtaSW2UEoR2ytw/HR6CKFPTVXwAzEO6kA9vg0EqL0iIq
2wVLgavlZuwegmaUBgnr+bgVZMvVN7OU7fAIRVe5xNO6itrPKvheSlQthmRiiq9C
uzOu4PS6PexqzHUNPCcJpCsj+lawmCSrE0bxtPzTA/CQInVgWs219V9+W5Gn/0YA
EARN9k6ueX9PZPQrPQLm
=BBBv
-----END PGP SIGNATURE-----
Merge tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine
Pull dmaengine updates from Dan Williams:
"Even though this has fixes marked for -stable, given the size and the
needed conflict resolutions this is 3.18-rc1/merge-window material.
These patches have been languishing in my tree for a long while. The
fact that I do not have the time to do proper/prompt maintenance of
this tree is a primary factor in the decision to step down as
dmaengine maintainer. That and the fact that the bulk of drivers/dma/
activity is going through Vinod these days.
The net_dma removal has not been in -next. It has developed simple
conflicts against mainline and net-next (for-3.18).
Continuing thanks to Vinod for staying on top of drivers/dma/.
Summary:
1/ Step down as dmaengine maintainer see commit 08223d80df
"dmaengine maintainer update"
2/ Removal of net_dma, as it has been marked 'broken' since 3.13
(commit 7787380336 "net_dma: mark broken"), without reports of
performance regression.
3/ Miscellaneous fixes"
* tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
net: make tcp_cleanup_rbuf private
net_dma: revert 'copied_early'
net_dma: simple removal
dmaengine maintainer update
dmatest: prevent memory leakage on error path in thread
ioat: Use time_before_jiffies()
dmaengine: fix xor sources continuation
dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup()
dma: mv_xor: Remove all callers of mv_xor_slot_cleanup()
dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call
ioat: Use pci_enable_msix_exact() instead of pci_enable_msix()
drivers: dma: Include appropriate header file in dca.c
drivers: dma: Mark functions as static in dma_v3.c
dma: mv_xor: Add DMA API error checks
ioat/dca: Use dev_is_pci() to check whether it is pci device
Currently we have two different policies for orphan sockets
that repeatedly stall on zero window ACKs. If a socket gets
a zero window ACK when it is transmitting data, the RTO is
used to probe the window. The socket is aborted after roughly
tcp_orphan_retries() retries (as in tcp_write_timeout()).
But if the socket was idle when it received the zero window ACK,
and later wants to send more data, we use the probe timer to
probe the window. If the receiver always returns zero window ACKs,
icsk_probes keeps getting reset in tcp_ack() and the orphan socket
can stall forever until the system reaches the orphan limit (as
commented in tcp_probe_timer()). This opens up a simple attack
to create lots of hanging orphan sockets to burn the memory
and the CPU, as demonstrated in the recent netdev post "TCP
connection will hang in FIN_WAIT1 after closing if zero window is
advertised." http://www.spinics.net/lists/netdev/msg296539.html
This patch follows the design in RTO-based probe: we abort an orphan
socket stalling on zero window when the probe timer reaches both
the maximum backoff and the maximum RTO. For example, an 100ms RTT
connection will timeout after roughly 153 seconds (0.3 + 0.6 +
.... + 76.8) if the receiver keeps the window shut. If the orphan
socket passes this check, but the system already has too many orphans
(as in tcp_out_of_resources()), we still abort it but we'll also
send an RST packet as the connection may still be active.
In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
sockets stalled on zero-window probes. This changes the semantics
of TCP_USER_TIMEOUT slightly because it previously only applies
when the socket has pending transmission.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split assignment and initialization from one into two functions.
This is required by followup patches that add Datacenter TCP
(DCTCP) congestion control algorithm - we need to be able to
determine if the connection is moderated by DCTCP before the
3WHS has finished.
As we walk the available congestion control list during the
assignment, we are always guaranteed to have Reno present as
it's fixed compiled-in. Therefore, since we're doing the
early assignment, we don't have a real use for the Reno alias
tcp_init_congestion_ops anymore and can thus remove it.
Actual usage of the congestion control operations are being
made after the 3WHS has finished, in some cases however we
can access get_info() via diag if implemented, therefore we
need to zero out the private area for those modules.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Our goal is to access no more than one cache line access per skb in
a write or receive queue when doing the various walks.
After recent TCP_SKB_CB() reorganizations, it is almost done.
Last part is tcp_skb_pcount() which currently uses
skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
3 cache lines in current kernel (skb->head, skb->end, and
shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)
This very simple patch reuses space currently taken by tcp_tw_isn
only in input path, as tcp_skb_pcount is only needed for skb stored in
write queue.
This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
to get SKBTX_ACK_TSTAMP, which seems possible.
This also speeds up all sack processing in general.
This speeds up tcp_sendmsg() because it no longer has to access/dirty
shinfo.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_dma was the only external user so this can become local to tcp.c
again.
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.
This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.
Marked for stable due to Roman's report of a memory leak in
dma_pin_iovec_pages():
https://lkml.org/lkml/2014/9/3/177
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: David Whipple <whipple@securedatainnovations.ch>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: <stable@vger.kernel.org>
Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
While profiling TCP stack, I noticed one useless atomic operation
in tcp_sendmsg(), caused by skb_header_release().
It turns out all current skb_header_release() users have a fresh skb,
that no other user can see, so we can avoid one atomic operation.
Introduce __skb_header_release() to clearly document this.
This gave me a 1.5 % improvement on TCP_RR workload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Input path of TCP do not currently uses TCP_SKB_CB(skb)->tcp_flags,
which is only used in output path.
tcp_recvmsg(), looks at tcp_hdr(skb)->syn for every skb found in receive queue,
and its unfortunate because this bit is located in a cache line right before
the payload.
We can simplify TCP by copying tcp flags into TCP_SKB_CB(skb)->tcp_flags.
This patch does so, and avoids the cache line miss in tcp_recvmsg()
Following patches will
- allow a segment with FIN being coalesced in tcp_try_coalesce()
- simplify tcp_collapse() by not copying the headers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.
We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Replace uses of get_cpu_var for address calculation through this_cpu_ptr.
Cc: netdev@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A set of small fixes pointed out just after the merge:
- make tcp_tx_timestamp static
- make tcp_gso_tstamp static
- use before() to compare TCP seqno, instead of cast to u64
- add tstamp to tx_flags in GSO, instead of overwrite tx_flags
- record skb_shinfo(skb)->tskey for all timestamps, also HW.
- optimization in tcp_tx_timestamp:
call sock_tx_timestamp only if a tstamp option is set.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Fixes: 4ed2d765df ("net-timestamp: TCP timestamping")
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP timestamping extends SO_TIMESTAMPING to bytestreams.
Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call on a bytestream as
a request for a timestamp for the last byte in that send() buffer.
The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.
This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.
This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.
If ID generation in ee_data is enabled, bytestream timestamps return a
byte offset, instead of the packet counter for datagrams.
The implementation supports a single timestamp per packet. It silenty
replaces requests for previous timestamps. To avoid missing tstamps,
flush the tcp queue by disabling Nagle, cork and autocork. Missing
tstamps can be detected by offset when the ee_data ID is enabled.
Implementation details:
- On GSO, the timestamping code can be included in the main loop. I
moved it into its own loop to reduce the impact on the common case
to a single branch.
- To avoid leaking the absolute seqno to userspace, the offset
returned in ee_data must always be relative. It is an offset between
an skb and sk field. The first is always set (also for GSO & ACK).
The second must also never be uninitialized. Only allow the ID
option on sockets in the ESTABLISHED state, for which the seqno
is available. Never reset it to zero (instead, move it to the
current seqno when reenabling the option).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When in repair-mode and TCP_RECV_QUEUE is set, we end up calling
tcp_push with mss_now being 0. If data is in the send-queue and
tcp_set_skb_tso_segs gets called, we crash because it will divide by
mss_now:
[ 347.151939] divide error: 0000 [#1] SMP
[ 347.152907] Modules linked in:
[ 347.152907] CPU: 1 PID: 1123 Comm: packetdrill Not tainted 3.16.0-rc2 #4
[ 347.152907] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 347.152907] task: f5b88540 ti: f3c82000 task.ti: f3c82000
[ 347.152907] EIP: 0060:[<c1601359>] EFLAGS: 00210246 CPU: 1
[ 347.152907] EIP is at tcp_set_skb_tso_segs+0x49/0xa0
[ 347.152907] EAX: 00000b67 EBX: f5acd080 ECX: 00000000 EDX: 00000000
[ 347.152907] ESI: f5a28f40 EDI: f3c88f00 EBP: f3c83d10 ESP: f3c83d00
[ 347.152907] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 347.152907] CR0: 80050033 CR2: 083158b0 CR3: 35146000 CR4: 000006b0
[ 347.152907] Stack:
[ 347.152907] c167f9d9 f5acd080 000005b4 00000002 f3c83d20 c16013e6 f3c88f00 f5acd080
[ 347.152907] f3c83da0 c1603b5a f3c83d38 c10a0188 00000000 00000000 f3c83d84 c10acc85
[ 347.152907] c1ad5ec0 00000000 00000000 c1ad679c 010003e0 00000000 00000000 f3c88fc8
[ 347.152907] Call Trace:
[ 347.152907] [<c167f9d9>] ? apic_timer_interrupt+0x2d/0x34
[ 347.152907] [<c16013e6>] tcp_init_tso_segs+0x36/0x50
[ 347.152907] [<c1603b5a>] tcp_write_xmit+0x7a/0xbf0
[ 347.152907] [<c10a0188>] ? up+0x28/0x40
[ 347.152907] [<c10acc85>] ? console_unlock+0x295/0x480
[ 347.152907] [<c10ad24f>] ? vprintk_emit+0x1ef/0x4b0
[ 347.152907] [<c1605716>] __tcp_push_pending_frames+0x36/0xd0
[ 347.152907] [<c15f4860>] tcp_push+0xf0/0x120
[ 347.152907] [<c15f7641>] tcp_sendmsg+0xf1/0xbf0
[ 347.152907] [<c116d920>] ? kmem_cache_free+0xf0/0x120
[ 347.152907] [<c106a682>] ? __sigqueue_free+0x32/0x40
[ 347.152907] [<c106a682>] ? __sigqueue_free+0x32/0x40
[ 347.152907] [<c114f0f0>] ? do_wp_page+0x3e0/0x850
[ 347.152907] [<c161c36a>] inet_sendmsg+0x4a/0xb0
[ 347.152907] [<c1150269>] ? handle_mm_fault+0x709/0xfb0
[ 347.152907] [<c15a006b>] sock_aio_write+0xbb/0xd0
[ 347.152907] [<c1180b79>] do_sync_write+0x69/0xa0
[ 347.152907] [<c1181023>] vfs_write+0x123/0x160
[ 347.152907] [<c1181d55>] SyS_write+0x55/0xb0
[ 347.152907] [<c167f0d8>] sysenter_do_call+0x12/0x28
This can easily be reproduced with the following packetdrill-script (the
"magic" with netem, sk_pacing and limit_output_bytes is done to prevent
the kernel from pushing all segments, because hitting the limit without
doing this is not so easy with packetdrill):
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1460>
+0 > S. 0:0(0) ack 1 <mss 1460>
+0.1 < . 1:1(0) ack 1 win 65000
+0 accept(3, ..., ...) = 4
// This forces that not all segments of the snd-queue will be pushed
+0 `tc qdisc add dev tun0 root netem delay 10ms`
+0 `sysctl -w net.ipv4.tcp_limit_output_bytes=2`
+0 setsockopt(4, SOL_SOCKET, 47, [2], 4) = 0
+0 write(4,...,10000) = 10000
+0 write(4,...,10000) = 10000
// Set tcp-repair stuff, particularly TCP_RECV_QUEUE
+0 setsockopt(4, SOL_TCP, 19, [1], 4) = 0
+0 setsockopt(4, SOL_TCP, 20, [1], 4) = 0
// This now will make the write push the remaining segments
+0 setsockopt(4, SOL_SOCKET, 47, [20000], 4) = 0
+0 `sysctl -w net.ipv4.tcp_limit_output_bytes=130000`
// Now we will crash
+0 write(4,...,1000) = 1000
This happens since ec34232575 (tcp: fix retransmission in repair
mode). Prior to that, the call to tcp_push was prevented by a check for
tp->repair.
The patch fixes it, by adding the new goto-label out_nopush. When exiting
tcp_sendmsg and a push is not required, which is the case for tp->repair,
we go to this label.
When repairing and calling send() with TCP_RECV_QUEUE, the data is
actually put in the receive-queue. So, no push is required because no
data has been added to the send-queue.
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Fixes: ec34232575 (tcp: fix retransmission in repair mode)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a TCP_FASTOPEN socket option to get a max backlog on its
listener to getsockopt().
Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/recv.c
drivers/net/wireless/mwifiex/pcie.c
net/ipv6/sit.c
The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.
The two wireless conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Upcoming congestion controls for TCP require usec resolution for RTT
estimations. Millisecond resolution is simply not enough these days.
FQ/pacing in DC environments also require this change for finer control
and removal of bimodal behavior due to the current hack in
tcp_update_pacing_rate() for 'small rtt'
TCP_CONG_RTT_STAMP is no longer needed.
As Julian Anastasov pointed out, we need to keep user compatibility :
tcp_metrics used to export RTT and RTTVAR in msec resolution,
so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
to use the new attributes if provided by the kernel.
In this example ss command displays a srtt of 32 usecs (10Gbit link)
lpk51:~# ./ss -i dst lpk52
Netid State Recv-Q Send-Q Local Address:Port Peer
Address:Port
tcp ESTAB 0 1 10.246.11.51:42959
10.246.11.52:64614
cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
cwnd:10 send
3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559
Updated iproute2 ip command displays :
lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
10.246.11.51
Old binary displays :
lpk51:~# ip tcp_metrics | grep 10.246.11.52
10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
10.246.11.51
With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Larry Brakmo <brakmo@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes two bugs in fastopen :
1) The tcp_sendmsg(..., @size) argument was ignored.
Code was relying on user not fooling the kernel with iovec mismatches
2) When MTU is about 64KB, tcp_send_syn_data() attempts order-5
allocations, which are likely to fail when memory gets fragmented.
Fixes: 783237e8da ("net-tcp: Fast Open client - sending SYN-data")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Tested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two new fields to struct tcp_info, to report sk_pacing_rate
and sk_max_pacing_rate to monitoring applications, as ss from iproute2.
User exported fields are 64bit, even if kernel is currently using 32bit
fields.
lpaa5:~# ss -i
..
skmem:(r0,rb357120,t0,tb2097152,f1584,w1980880,o0,bl0) ts sack cubic
wscale:6,6 rto:400 rtt:0.875/0.75 mss:1448 cwnd:1 ssthresh:12 send
13.2Mbps pacing_rate 3336.2Mbps unacked:15 retrans:1/5448 lost:15
rcv_space:29200
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As far as I can tell we have used a default of 60 seconds for
FIN_WAIT2 timeout for ages (since 2.x times??).
In any case, the timeout these days is 60 seconds, so the 3 min
comment is wrong (and cost me a few minutes of my life when I was
debugging a FIN_WAIT2 related problem in a userspace application and
checked the kernel source for details).
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) BPF debugger and asm tool by Daniel Borkmann.
2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.
3) Correct reciprocal_divide and update users, from Hannes Frederic
Sowa and Daniel Borkmann.
4) Currently we only have a "set" operation for the hw timestamp socket
ioctl, add a "get" operation to match. From Ben Hutchings.
5) Add better trace events for debugging driver datapath problems, also
from Ben Hutchings.
6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we
have a small send and a previous packet is already in the qdisc or
device queue, defer until TX completion or we get more data.
7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.
8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
Borkmann.
9) Share IP header compression code between Bluetooth and IEEE802154
layers, from Jukka Rissanen.
10) Fix ipv6 router reachability probing, from Jiri Benc.
11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.
12) Support tunneling in GRO layer, from Jerry Chu.
13) Allow bonding to be configured fully using netlink, from Scott
Feldman.
14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
already get the TCI. From Atzm Watanabe.
15) New "Heavy Hitter" qdisc, from Terry Lam.
16) Significantly improve the IPSEC support in pktgen, from Fan Du.
17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom
Herbert.
18) Add Proportional Integral Enhanced packet scheduler, from Vijay
Subramanian.
19) Allow openvswitch to mmap'd netlink, from Thomas Graf.
20) Key TCP metrics blobs also by source address, not just destination
address. From Christoph Paasch.
21) Support 10G in generic phylib. From Andy Fleming.
22) Try to short-circuit GRO flow compares using device provided RX
hash, if provided. From Tom Herbert.
The wireless and netfilter folks have been busy little bees too.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
net/cxgb4: Fix referencing freed adapter
ipv6: reallocate addrconf router for ipv6 address when lo device up
fib_frontend: fix possible NULL pointer dereference
rtnetlink: remove IFLA_BOND_SLAVE definition
rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
qlcnic: update version to 5.3.55
qlcnic: Enhance logic to calculate msix vectors.
qlcnic: Refactor interrupt coalescing code for all adapters.
qlcnic: Update poll controller code path
qlcnic: Interrupt code cleanup
qlcnic: Enhance Tx timeout debugging.
qlcnic: Use bool for rx_mac_learn.
bonding: fix u64 division
rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
sfc: Use the correct maximum TX DMA ring size for SFC9100
Add Shradha Shah as the sfc driver maintainer.
net/vxlan: Share RX skb de-marking and checksum checks with ovs
tulip: cleanup by using ARRAY_SIZE()
ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
net/cxgb4: Don't retrieve stats during recovery
...
The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
TCP out_of_order_queue lock is not used, as queue manipulation
happens with socket lock held and we therefore use the lockless
skb queue routines (as __skb_queue_head())
We can use __skb_queue_head_init() instead of skb_queue_head_init()
to make this more consistent.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem noticed a TCP_RR regression caused by TCP autocorking
on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be
right above RTT between hosts.
We can receive a ACK for a packet still in NIC TX ring buffer or in a
softnet completion queue.
Fix this by always pushing the skb if it is at the head of write queue.
Also, as TX completion is lockless, it's safer to perform sk_wmem_alloc
test after setting TSQ_THROTTLED.
erd:~# MIB="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
erd:~# ./netperf -H remote -t TCP_RR -- -o $MIB | tail -n 1
(repeat 3 times)
Before patch :
18,1049.87,41004,39631,6295.47
17,239.52,40804,48,2912.79
18,348.40,40877,54,3573.39
After patch :
18,22.84,4606,38,16.39
17,21.56,2871,36,13.51
17,22.46,2705,37,11.83
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: f54b311142 ("tcp: auto corking")
Signed-off-by: David S. Miller <davem@davemloft.net>
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull slave-dmaengine changes from Vinod Koul:
"This brings for slave dmaengine:
- Change dma notification flag to DMA_COMPLETE from DMA_SUCCESS as
dmaengine can only transfer and not verify validaty of dma
transfers
- Bunch of fixes across drivers:
- cppi41 driver fixes from Daniel
- 8 channel freescale dma engine support and updated bindings from
Hongbo
- msx-dma fixes and cleanup by Markus
- DMAengine updates from Dan:
- Bartlomiej and Dan finalized a rework of the dma address unmap
implementation.
- In the course of testing 1/ a collection of enhancements to
dmatest fell out. Notably basic performance statistics, and
fixed / enhanced test control through new module parameters
'run', 'wait', 'noverify', and 'verbose'. Thanks to Andriy and
Linus [Walleij] for their review.
- Testing the raid related corner cases of 1/ triggered bugs in
the recently added 16-source operation support in the ioatdma
driver.
- Some minor fixes / cleanups to mv_xor and ioatdma"
* 'next' of git://git.infradead.org/users/vkoul/slave-dma: (99 commits)
dma: mv_xor: Fix mis-usage of mmio 'base' and 'high_base' registers
dma: mv_xor: Remove unneeded NULL address check
ioat: fix ioat3_irq_reinit
ioat: kill msix_single_vector support
raid6test: add new corner case for ioatdma driver
ioatdma: clean up sed pool kmem_cache
ioatdma: fix selection of 16 vs 8 source path
ioatdma: fix sed pool selection
ioatdma: Fix bug in selftest after removal of DMA_MEMSET.
dmatest: verbose mode
dmatest: convert to dmaengine_unmap_data
dmatest: add a 'wait' parameter
dmatest: add basic performance metrics
dmatest: add support for skipping verification and random data setup
dmatest: use pseudo random numbers
dmatest: support xor-only, or pq-only channels in tests
dmatest: restore ability to start test at module load and init
dmatest: cleanup redundant "dmatest: " prefixes
dmatest: replace stored results mechanism, with uniform messages
Revert "dmatest: append verify result to results"
...
After commit c9eeec26e3 ("tcp: TSQ can use a dynamic limit"), several
users reported throughput regressions, notably on mvneta and wifi
adapters.
802.11 AMPDU requires a fair amount of queueing to be effective.
This patch partially reverts the change done in tcp_write_xmit()
so that the minimal amount is sysctl_tcp_limit_output_bytes.
It also remove the use of this sysctl while building skb stored
in write queue, as TSO autosizing does the right thing anyway.
Users with well behaving NICS and correct qdisc (like sch_fq),
can then lower the default sysctl_tcp_limit_output_bytes value from
128KB to 8KB.
This new usage of sysctl_tcp_limit_output_bytes permits each driver
authors to check how their driver performs when/if the value is set
to a minimum of 4KB.
Normally, line rate for a single TCP flow should be possible,
but some drivers rely on timers to perform TX completion and
too long TX completion delays prevent reaching full throughput.
Fixes: c9eeec26e3 ("tcp: TSQ can use a dynamic limit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sujith Manoharan <sujith@msujith.org>
Reported-by: Arnaud Ebalard <arno@natisbad.org>
Tested-by: Sujith Manoharan <sujith@msujith.org>
Cc: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
The code that is implemented is per memory cgroup not per netns, and
having per netns bits is just confusing. Remove the per netns bits to
make it easier to see what is really going on.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
net/bridge/br_multicast.c
net/ipv6/sit.c
The conflicts were minor:
1) sit.c changes overlap with change to ip_tunnel_xmit() signature.
2) br_multicast.c had an overlap between computing max_delay using
msecs_to_jiffies and turning MLDV2_MRC() into an inline function
with a name using lowercase instead of uppercase letters.
3) stmmac had two overlapping changes, one which conditionally allocated
and hooked up a dma_cfg based upon the presence of the pbl OF property,
and another one handling store-and-forward DMA made. The latter of
which should not go into the new of_find_property() basic block.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/iwlwifi/pcie/trans.c
include/linux/inetdevice.h
The inetdevice.h conflict involves moving the IPV4_DEVCONF values
into a UAPI header, overlapping additions of some new entries.
The iwlwifi conflict is a context overlap.
Signed-off-by: David S. Miller <davem@davemloft.net>
When the repair mode is turned off, the write queue seqs are
updated so that the whole queue is considered to be 'already sent.
The "when" field must be set for such skb. It's used in tcp_rearm_rto
for example. If the "when" field isn't set, the retransmit timeout can
be calculated incorrectly and a tcp connected can stop for two minutes
(TCP_RTO_MAX).
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>