linux-next

mirror of https://github.com/edk2-porting/linux-next.git synced 2024-12-23 12:43:55 +08:00

Author	SHA1	Message	Date
David S. Miller	2397849baa	[PATCH] tcp: Cache inetpeer in timewait socket, and only when necessary. Since it's guarenteed that we will access the inetpeer if we're trying to do timewait recycling and TCP options were enabled on the connection, just cache the peer in the timewait socket. In the future, inetpeer lookups will be context dependent (per routing realm), and this helps facilitate that as well. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-09 14:56:12 -07:00
David S. Miller	4670fd819e	tcp: Get rid of inetpeer special cases. The get_peer method TCP uses is full of special cases that make no sense accommodating, and it also gets in the way of doing more reasonable things here. First of all, if the socket doesn't have a usable cached route, there is no sense in trying to optimize timewait recycling. Likewise for the case where we have IP options, such as SRR enabled, that make the IP header destination address (and thus the destination address of the route key) differ from that of the connection's destination address. Just return a NULL peer in these cases, and thus we're also able to get rid of the clumsy inetpeer release logic. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-09 01:25:47 -07:00
Eric Dumazet	a2a385d627	tcp: bool conversions bool conversions where possible. __inline__ -> inline space cleanups Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-17 14:59:59 -04:00
Pavel Emelyanov	292e8d8c85	tcp: Move rcvq sending to tcp_input.c It actually works on the input queue and will use its read mem routines, thus it's better to have in in the tcp_input.c file. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-10 23:24:35 -04:00
Eric Dumazet	bd14b1b2e2	tcp: be more strict before accepting ECN negociation It appears some networks play bad games with the two bits reserved for ECN. This can trigger false congestion notifications and very slow transferts. Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can disable TCP ECN negociation if it happens we receive mangled CT bits in the SYN packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Perry Lorier <perryl@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Wilmer van der Gaast <wilmer@google.com> Cc: Ankur Jain <jankur@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Dave Täht <dave.taht@bufferbloat.net> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-04 12:05:27 -04:00
Eric Dumazet	b081f85c29	net: implement tcp coalescing in tcp_queue_rcv() Extend tcp coalescing implementing it from tcp_queue_rcv(), the main receiver function when application is not blocked in recvmsg(). Function tcp_queue_rcv() is moved a bit to allow its call from tcp_data_queue() This gives good results especially if GRO could not kick, and if skb head is a fragment. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-02 21:11:11 -04:00
Yuchung Cheng	750ea2bafa	tcp: early retransmit: delayed fast retransmit Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2). Delays the fast retransmit by an interval of RTT/4. We borrow the RTO timer to implement the delay. If we receive another ACK or send a new packet, the timer is cancelled and restored to original RTO value offset by time elapsed. When the delayed-ER timer fires, we enter fast recovery and perform fast retransmit. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-02 20:56:10 -04:00
Yuchung Cheng	eed530b6c6	tcp: early retransmit This patch implements RFC 5827 early retransmit (ER) for TCP. It reduces DUPACK threshold (dupthresh) if outstanding packets are less than 4 to recover losses by fast recovery instead of timeout. While the algorithm is simple, small but frequent network reordering makes this feature dangerous: the connection repeatedly enter false recovery and degrade performance. Therefore we implement a mitigation suggested in the appendix of the RFC that delays entering fast recovery by a small interval, i.e., RTT/4. Currently ER is conservative and is disabled for the rest of the connection after the first reordering event. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate Reduction for TCP”, IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf Note that Linux has a similar feature called THIN_DUPACK. The differences are THIN_DUPACK do not mitigate reorderings and is only used after slow start. Currently ER is disabled if THIN_DUPACK is enabled. I would be happy to merge THIN_DUPACK feature with ER if people think it's a good idea. ER is enabled by sysctl_tcp_early_retrans: 0: Disables ER 1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4. 2: (Default) reduce dupthresh like mode 1. In addition, delay entering fast recovery by RTT/4. Note: mode 2 is implemented in the third part of this patch series. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-02 20:56:10 -04:00
Eric Dumazet	6746960140	ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing Quoting Tore Anderson from : https://bugzilla.kernel.org/show_bug.cgi?id=42572 When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment size does not take into account the size of the IPv6 Fragmentation header that needs to be included in outbound packets, causing every transmitted TCP segment to be fragmented across two IPv6 packets, the latter of which will only contain 8 bytes of actual payload. RTAX_FEATURE_ALLFRAG is typically set on a route in response to receving a ICMPv6 Packet Too Big message indicating a Path MTU of less than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6 PTBs with MTU < 1280 are still valid, in particular when an IPv6 packet is sent to an IPv4 destination through a stateless translator. Any ICMPv4 Need To Fragment packets originated from the IPv4 part of the path will be translated to ICMPv6 PTB which may then indicate an MTU of less than 1280. The Linux kernel refuses to reduce the effective MTU to anything below 1280 bytes, instead it sets it to exactly 1280 bytes, and RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header), instead of 1232 (additionally taking into account the 8 bytes required by the IPv6 Fragmentation extension header). This in turn results in rather inefficient transmission, as every transmitted TCP segment now is split in two fragments containing 1232+8 bytes of payload. After this patch, all the outgoing packets that includes a Fragmentation header all are "atomic" or "non-fragmented" fragments, i.e., they both have Offset=0 and More Fragments=0. With help from David S. Miller Reported-by: Tore Anderson <tore@fud.no> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Tom Herbert <therbert@google.com> Tested-by: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-27 00:03:34 -04:00
Neal Cardwell	900f65d361	tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock() This commit moves the (substantial) common code shared between tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family independent function, tcp_init_sock(). Centralizing this functionality should help avoid drift issues, e.g. where the IPv4 side is updated without a corresponding update to IPv6. There was already some drift: IPv4 initialized snd_cwnd to TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to 2 (in this case it should not matter, since snd_cwnd is also initialized in tcp_init_metrics(), but the general risks and maintenance overhead remain). When diffing the old and new code, note that new tcp_init_sock() function uses the order of steps from the tcp_v4_init_sock() implementation (the order is slightly different in tcp_v6_init_sock()). Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-21 16:36:42 -04:00
Pavel Emelyanov	ee9952831c	tcp: Initial repair mode This includes (according the the previous description): * TCP_REPAIR sockoption This one just puts the socket in/out of the repair mode. Allowed for CAP_NET_ADMIN and for closed/establised sockets only. When repair mode is turned off and the socket happens to be in the established state the window probe is sent to the peer to 'unlock' the connection. * TCP_REPAIR_QUEUE sockoption This one sets the queue which we're about to repair. The 'no-queue' is set by default. * TCP_QUEUE_SEQ socoption Sets the write_seq/rcv_nxt of a selected repaired queue. Allowed for TCP_CLOSE-d sockets only. When the socket changes its state the other seq-s are changed by the kernel according to the protocol rules (most of the existing code is actually reused). * Ability to forcibly bind a socket to a port The sk->sk_reuse is set to SK_FORCE_REUSE. * Immediate connect modification The connect syscall initializes the connection, then directly jumps to the code which finalizes it. * Silent close modification The close just aborts the connection (similar to SO_LINGER with 0 time) but without sending any FIN/RST-s to peer. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-21 15:52:25 -04:00
Pavel Emelyanov	370816aef0	tcp: Move code around This is just the preparation patch, which makes the needed for TCP repair code ready for use. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-21 15:52:25 -04:00
Neal Cardwell	f4f9f6e75d	tcp: restore formatting of macros for tcp_skb_cb sacked field Commit `b82d1bb4` inadvertendly placed unrelated new code between TCPCB_EVER_RETRANS and TCPCB_RETRANS and the other macros that refer to the sacked field in the struct tcp_skb_cb (probably because there was a misleading empty line there). This commit fixes up the formatting so that all macros related to the sacked field are adjacent again. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-16 14:38:16 -04:00
Eric Dumazet	95c9617472	net: cleanup unsigned to unsigned int Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-15 12:44:40 -04:00
Eric Dumazet	fd4f2cead6	tcp: RFC6298 supersedes RFC2988bis Updates some comments to track RFC6298 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-14 15:24:26 -04:00
Linus Torvalds	ed2d265d12	The following text was taken from the original review request: "[RFC - PATCH 0/7] consolidation of BUG support code." https://lkml.org/lkml/2012/1/26/525 -- The changes shown here are to unify linux's BUG support under the one <linux/bug.h> file. Due to historical reasons, we have some BUG code in bug.h and some in kernel.h -- i.e. the support for BUILD_BUG in linux/kernel.h predates the addition of linux/bug.h, but old code in kernel.h wasn't moved to bug.h at that time. As a band-aid, kernel.h was including <asm/bug.h> to pseudo link them. This has caused confusion[1] and general yuck/WTF[2] reactions. Here is an example that violates the principle of least surprise: CC lib/string.o lib/string.c: In function 'strlcat': lib/string.c:225:2: error: implicit declaration of function 'BUILD_BUG_ON' make[2]: *** [lib/string.o] Error 1 $ $ grep linux/bug.h lib/string.c #include <linux/bug.h> $ We've included <linux/bug.h> for the BUG infrastructure and yet we still get a compile fail! [We've not kernel.h for BUILD_BUG_ON.] Ugh - very confusing for someone who is new to kernel development. With the above in mind, the goals of this changeset are: 1) find and fix any include/.h files that were relying on the implicit presence of BUG code. 2) find and fix any C files that were consuming kernel.h and hence relying on implicitly getting some/all BUG code. 3) Move the BUG related code living in kernel.h to <linux/bug.h> 4) remove the asm/bug.h from kernel.h to finally break the chain. During development, the order was more like 3-4, build-test, 1-2. But to ensure that git history for bisect doesn't get needless build failures introduced, the commits have been reorderd to fix the problem areas in advance. [1] https://lkml.org/lkml/2012/1/3/90 [2] https://lkml.org/lkml/2012/1/17/414 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJPbNwpAAoJEOvOhAQsB9HWrqYP/A0t9VB0nK6e42F0OR2P14MZ GJFtf1B++wwioIrx+KSWSRfSur1C5FKhDbxLR3I/pvkAYl4+T4JvRdMG6xJwxyip CC1kVQQNDjWVVqzjz2x6rYkOffx6dUlw/ERyIyk+OzP+1HzRIsIrugMqbzGLlX0X y0v2Tbd0G6xg1DV8lcRdp95eIzcGuUvdb2iY2LGadWZczEOeSXx64Jz3QCFxg3aL LFU4oovsg8Nb7MRJmqDvHK/oQf5vaTm9WSrS0pvVte0msSQRn8LStYdWC0G9BPCS GwL86h/eLXlUXQlC5GpgWg1QQt5i2QpjBFcVBIG0IT5SgEPMx+gXyiqZva2KwbHu LKicjKtfnzPitQnyEV/N6JyV1fb1U6/MsB7ebU5nCCzt9Gr7MYbjZ44peNeprAtu HMvJ/BNnRr4Ha6nPQNu952AdASPKkxmeXFUwBL1zUbLkOX/bK/vy1ujlcdkFxCD7 fP3t7hghYa737IHk0ehUOhrE4H67hvxTSCKioLUAy/YeN1IcfH/iOQiCBQVLWmoS AqYV6ou9cqgdYoyila2UeAqegb+8xyubPIHt+lebcaKxs5aGsTg+r3vq5juMDAPs iwSVYUDcIw9dHer1lJfo7QCy3QUTRDTxh+LB9VlHXQICgeCK02sLBOi9hbEr4/H8 Ko9g8J3BMxcMkXLHT9ud =PYQT -----END PGP SIGNATURE----- Merge tag 'bug-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Pull <linux/bug.h> cleanup from Paul Gortmaker: "The changes shown here are to unify linux's BUG support under the one <linux/bug.h> file. Due to historical reasons, we have some BUG code in bug.h and some in kernel.h -- i.e. the support for BUILD_BUG in linux/kernel.h predates the addition of linux/bug.h, but old code in kernel.h wasn't moved to bug.h at that time. As a band-aid, kernel.h was including <asm/bug.h> to pseudo link them. This has caused confusion[1] and general yuck/WTF[2] reactions. Here is an example that violates the principle of least surprise: CC lib/string.o lib/string.c: In function 'strlcat': lib/string.c:225:2: error: implicit declaration of function 'BUILD_BUG_ON' make[2]: ** [lib/string.o] Error 1 $ $ grep linux/bug.h lib/string.c #include <linux/bug.h> $ We've included <linux/bug.h> for the BUG infrastructure and yet we still get a compile fail! [We've not kernel.h for BUILD_BUG_ON.] Ugh - very confusing for someone who is new to kernel development. With the above in mind, the goals of this changeset are: 1) find and fix any include/.h files that were relying on the implicit presence of BUG code. 2) find and fix any C files that were consuming kernel.h and hence relying on implicitly getting some/all BUG code. 3) Move the BUG related code living in kernel.h to <linux/bug.h> 4) remove the asm/bug.h from kernel.h to finally break the chain. During development, the order was more like 3-4, build-test, 1-2. But to ensure that git history for bisect doesn't get needless build failures introduced, the commits have been reorderd to fix the problem areas in advance. [1] https://lkml.org/lkml/2012/1/3/90 [2] https://lkml.org/lkml/2012/1/17/414" Fix up conflicts (new radeon file, reiserfs header cleanups) as per Paul and linux-next. tag 'bug-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: kernel.h: doesn't explicitly use bug.h, so don't include it. bug: consolidate BUILD_BUG_ON with other bug code BUG: headers with BUG/BUG_ON etc. need linux/bug.h bug.h: add include of it to various implicit C users lib: fix implicit users of kernel.h for TAINT_WARN spinlock: macroize assert_spin_locked to avoid bug.h dependency x86: relocate get/set debugreg fcns to include/asm/debugreg.	2012-03-24 10:08:39 -07:00
Paul Gortmaker	187f1882b5	BUG: headers with BUG/BUG_ON etc. need linux/bug.h If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any other BUG variant in a static inline (i.e. not in a #define) then that header really should be including <linux/bug.h> and not just expecting it to be implicitly present. We can make this change risk-free, since if the files using these headers didn't have exposure to linux/bug.h already, they would have been causing compile failures/warnings. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-03-04 17:54:34 -05:00
David S. Miller	b4017c5368	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/tg3.c Conflicts in the statistics regression bug fix from 'net', but happily Matt Carlson originally posted the fix against 'net-next' so I used that to resolve this. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-03-01 17:57:40 -05:00
Neal Cardwell	ecb9719236	tcp: fix comment for tp->highest_sack There was an off-by-one error in the comments describing the highest_sack field in struct tcp_sock. The comments previously claimed that it was the "start sequence of the highest skb with SACKed bit". This commit fixes the comments to note that it is the "start sequence of the skb just after the highest skb with SACKed bit". Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-02-28 15:06:33 -05:00
David S. Miller	dd48dc34fe	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2012-02-04 16:39:32 -05:00
Arun Sharma	efcdbf24fd	net: Disambiguate kernel message Some of our machines were reporting: TCP: too many of orphaned sockets even when the number of orphaned sockets was well below the limit. We print a different message depending on whether we're out of TCP memory or there are too many orphaned sockets. Also move the check out of line and cleanup the messages that were printed. Signed-off-by: Arun Sharma <asharma@fb.com> Suggested-by: Mohan Srinivasan <mohan@fb.com> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: David Miller <davem@davemloft.net> Cc: Glauber Costa <glommer@parallels.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-02-01 14:41:50 -05:00
Eric Dumazet	a8afca0329	tcp: md5: protects md5sig_info with RCU This patch makes sure we use appropriate memory barriers before publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from tcp_v4_send_reset() without holding socket lock (upcoming patch from Shawn Lu) Note we also need to respect rcu grace period before its freeing, since we can free socket without this grace period thanks to SLAB_DESTROY_BY_RCU Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Shawn Lu <shawn.lu@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-02-01 02:11:47 -05:00
Eric Dumazet	a915da9b69	tcp: md5: rcu conversion In order to be able to support proper RST messages for TCP MD5 flows, we need to allow access to MD5 keys without locking listener socket. This conversion is a nice cleanup, and shrinks size of timewait sockets by 80 bytes. IPv6 code reuses generic code found in IPv4 instead of duplicating it. Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Shawn Lu <shawn.lu@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-31 12:14:00 -05:00
Eric Dumazet	a2d91241a8	tcp: md5: remove obsolete md5_add() method We no longer use md5_add() method from struct tcp_sock_af_ops Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-31 12:13:59 -05:00
Glauber Costa	4acb41903b	net/tcp: Fix tcp memory limits initialization when !CONFIG_SYSCTL sysctl_tcp_mem() initialization was moved to sysctl_tcp_ipv4.c in commit `3dc43e3e4d`, since it became a per-ns value. That code, however, will never run when CONFIG_SYSCTL is disabled, leading to bogus values on those fields - causing hung TCP sockets. This patch fixes it by keeping an initialization code in tcp_init(). It will be overwritten by the first net namespace init if CONFIG_SYSCTL is compiled in, and do the right thing if it is compiled out. It is also named properly as tcp_init_mem(), to properly signal its non-sysctl side effect on TCP limits. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: David S. Miller <davem@davemloft.net> Link: http://lkml.kernel.org/r/4F22D05A.8030604@parallels.com [ renamed the function, tidied up the changelog a bit ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-30 12:41:06 -05:00
Vijay Subramanian	ab56222a32	tcp: Replace constants with #define macros to record the state of SACK/FACK and DSACK for better readability and maintenance. Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-21 01:03:23 -05:00
Glauber Costa	3dc43e3e4d	per-netns ipv4 sysctl_tcp_mem This patch allows each namespace to independently set up its levels for tcp memory pressure thresholds. This patch alone does not buy much: we need to make this values per group of process somehow. This is achieved in the patches that follows in this patchset. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: David S. Miller <davem@davemloft.net> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-12 19:04:11 -05:00
Glauber Costa	180d8cd942	foundations of per-cgroup memory pressure controlling. This patch replaces all uses of struct sock fields' memory_pressure, memory_allocated, sockets_allocated, and sysctl_mem to acessor macros. Those macros can either receive a socket argument, or a mem_cgroup argument, depending on the context they live in. Since we're only doing a macro wrapping here, no performance impact at all is expected in the case where we don't have cgroups disabled. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com> CC: David S. Miller <davem@davemloft.net> CC: Eric W. Biederman <ebiederm@xmission.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-12 19:04:10 -05:00
Eric Dumazet	dfd56b8b38	net: use IS_ENABLED(CONFIG_IPV6) Instead of testing defined(CONFIG_IPV6) \|\| defined(CONFIG_IPV6_MODULE) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-11 18:25:16 -05:00
Neal Cardwell	6b5a5c0dbb	tcp: do not scale TSO segment size with reordering degree Since 2005 (`c1b4a7e695`) tcp_tso_should_defer has been using tcp_max_burst() as a target limit for deciding how large to make outgoing TSO packets when not using sysctl_tcp_tso_win_divisor. But since 2008 (`dd9e0dda66`) tcp_max_burst() returns the reordering degree. We should not have tcp_tso_should_defer attempt to build larger segments just because there is more reordering. This commit splits the notion of deferral size used in TSO from the notion of burst size used in cwnd moderation, and returns the TSO deferral limit to its original value. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-29 00:29:41 -05:00
Michał Mirosław	c8f44affb7	net: introduce and use netdev_features_t for device features sets v2: add couple missing conversions in drivers split unexporting netdev_fix_features() implemented %pNF convert sock::sk_route_(no?)caps Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-16 17:43:10 -05:00
Arjan van de Ven	73cb88ecb9	net: make the tcp and udp file_operations for the /proc stuff const the tcp and udp code creates a set of struct file_operations at runtime while it can also be done at compile time, with the added benefit of then having these file operations be const. the trickiest part was to get the "THIS_MODULE" reference right; the naive method of declaring a struct in the place of registration would not work for this reason. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-01 17:56:14 -04:00
Flavio Leitner	78d81d15b7	TCP: remove TCP_DEBUG It was enabled by default and the messages guarded by the define are useful. Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-24 17:36:08 -04:00
Eric Dumazet	318cf7aaa0	tcp: md5: add more const attributes Now tcp_md5_hash_header() has a const tcphdr argument, we can add more const attributes to callers. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-24 02:46:04 -04:00
Eric Dumazet	ca35a0ef85	tcp: md5: dont write skb head in tcp_md5_hash_header() tcp_md5_hash_header() writes into skb header a temporary zero value, this might confuse other users of this area. Since tcphdr is small (20 bytes), copy it in a temporary variable and make the change in the copy. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-24 01:52:35 -04:00
Eric Dumazet	cf533ea53e	tcp: add const qualifiers where possible Adding const qualifiers to pointers can ease code review, and spot some bugs. It might allow compiler to optimize code further. For example, is it legal to temporary write a null cksum into tcphdr in tcp_md5_hash_header() ? I am afraid a sniffer could catch the temporary null value... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-21 05:22:42 -04:00
Eric Dumazet	4de075e043	tcp: rename tcp_skb_cb flags Rename struct tcp_skb_cb "flags" to "tcp_flags" to ease code review and maintenance. Its content is a combination of FIN/SYN/RST/PSH/ACK/URG/ECE/CWR flags Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-27 13:25:05 -04:00
Eric Dumazet	b82d1bb4fd	tcp: unalias tcp_skb_cb flags and ip_dsfield struct tcp_skb_cb contains a "flags" field containing either tcp flags or IP dsfield depending on context (input or output path) Introduce ip_dsfield to make the difference clear and ease maintenance. If later we want to save space, we can union flags/ip_dsfield Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-27 02:20:08 -04:00
Eric Dumazet	7a269ffad7	tcp: ECN blackhole should not force quickack mode While playing with a new ADSL box at home, I discovered that ECN blackhole can trigger suboptimal quickack mode on linux : We send one ACK for each incoming data frame, without any delay and eventual piggyback. This is because TCP_ECN_check_ce() considers that if no ECT is seen on a segment, this is because this segment was a retransmit. Refine this heuristic and apply it only if we seen ECT in a previous segment, to detect ECN blackhole at IP level. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jamal Hadi Salim <jhs@mojatatu.com> CC: Jerry Chu <hkchu@google.com> CC: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> CC: Jim Gettys <jg@freedesktop.org> CC: Dave Taht <dave.taht@gmail.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-27 00:58:44 -04:00
David S. Miller	8decf86879	Merge branch 'master' of github.com:davem330/net Conflicts: MAINTAINERS drivers/net/Kconfig drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c drivers/net/ethernet/broadcom/tg3.c drivers/net/wireless/iwlwifi/iwl-pci.c drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c drivers/net/wireless/rt2x00/rt2800usb.c drivers/net/wireless/wl12xx/main.c	2011-09-22 03:23:13 -04:00
Eric Dumazet	e05c82d366	tcp: fix build error if !CONFIG_SYN_COOKIES commit `946cedccbd` (tcp: Change possible SYN flooding messages) added a build error if CONFIG_SYN_COOKIES=n Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-18 21:48:01 -04:00
Eric Dumazet	765cf9976e	tcp: md5: remove one indirection level in tcp_md5sig_pool tcp_md5sig_pool is currently an 'array' (a percpu object) of pointers to struct tcp_md5sig_pool. Only the pointers are NUMA aware, but objects themselves are all allocated on a single node. Remove this extra indirection to get proper percpu memory (NUMA aware) and make code simpler. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-17 01:15:46 -04:00
Eric Dumazet	946cedccbd	tcp: Change possible SYN flooding messages "Possible SYN flooding on port xxxx " messages can fill logs on servers. Change logic to log the message only once per listener, and add two new SNMP counters to track : TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client TCPReqQFullDrop : number of times a SYN request was dropped because syncookies were not enabled. Based on a prior patch from Tom Herbert, and suggestions from David. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-15 14:49:43 -04:00
Jerry Chu	9ad7c049f0	tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side This patch lowers the default initRTO from 3secs to 1sec per RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet has been retransmitted, AND the TCP timestamp option is not on. It also adds support to take RTT sample during 3WHS on the passive open side, just like its active open counterpart, and uses it, if valid, to seed the initRTO for the data transmission phase. The patch also resets ssthresh to its initial default at the beginning of the data transmission phase, and reduces cwnd to 1 if there has been MORE THAN ONE retransmission during 3WHS per RFC5681. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-08 17:05:30 -07:00
Shan Wei	089c34827e	tcp: Remove debug macro of TCP_CHECK_TIMER Now, TCP_CHECK_TIMER is not used for debuging, it does nothing. And, it has been there for several years, maybe 6 years. Remove it to keep code clearer. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-20 11:10:14 -08:00
David S. Miller	7eb38527c4	tcp: Add reference to initial CWND ietf draft. Suggested by Alexander Zimmermann Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-05 18:13:45 -08:00
David S. Miller	442b9635c5	tcp: Increase the initial congestion window to 10. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Nandita Dukkipati <nanditad@google.com>	2011-02-02 20:48:47 -08:00
Michał Mirosław	04ed3e741d	net: change netdev->features to u32 Quoting Ben Hutchings: we presumably won't be defining features that can only be enabled on 64-bit architectures. Occurences found by `grep -r` on net/, drivers/net, include/ [ Move features and vlan_features next to each other in struct netdev, as per Eric Dumazet's suggestion -DaveM ] Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 15:32:47 -08:00
Nandita Dukkipati	356f039822	TCP: increase default initial receive window. This patch changes the default initial receive window to 10 mss (defined constant). The default window is limited to the maximum of 101460 and 2mss (when mss > 1460). draft-ietf-tcpm-initcwnd-00 is a proposal to the IETF that recommends increasing TCP's initial congestion window to 10 mss or about 15KB. Leading up to this proposal were several large-scale live Internet experiments with an initial congestion window of 10 mss (IW10), where we showed that the average latency of HTTP responses improved by approximately 10%. This was accompanied by a slight increase in retransmission rate (0.5%), most of which is coming from applications opening multiple simultaneous connections. To understand the extreme worst case scenarios, and fairness issues (IW10 versus IW3), we further conducted controlled testbed experiments. We came away finding minimal negative impact even under low link bandwidths (dial-ups) and small buffers. These results are extremely encouraging to adopting IW10. However, an initial congestion window of 10 mss is useless unless a TCP receiver advertises an initial receive window of at least 10 mss. Fortunately, in the large-scale Internet experiments we found that most widely used operating systems advertised large initial receive windows of 64KB, allowing us to experiment with a wide range of initial congestion windows. Linux systems were among the few exceptions that advertised a small receive window of 6KB. The purpose of this patch is to fix this shortcoming. References: 1. A comprehensive list of all IW10 references to date. http://code.google.com/speed/protocols/tcpm-IW10.html 2. Paper describing results from large-scale Internet experiments with IW10. http://ccr.sigcomm.org/drupal/?q=node/621 3. Controlled testbed experiments under worst case scenarios and a fairness study. http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf 4. Raw test data from testbed experiments (Linux senders/receivers) with initial congestion and receive windows of both 10 mss. http://research.csc.ncsu.edu/netsrv/?q=content/iw10 5. Internet-Draft. Increasing TCP's Initial Window. https://datatracker.ietf.org/doc/draft-ietf-tcpm-initcwnd/ Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-20 21:33:00 -08:00
Shan Wei	4c306a9291	net: kill unused macros These macros never be used, so remove them. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-19 21:59:35 -08:00
Eric Dumazet	bc2ce894e1	tcp: relax tcp_paws_check() Some windows versions have wrong RFC1323 implementations, with SYN and SYNACKS messages containing zero tcp timestamps. We relaxed in commit `fc1ad92dfc` the passive connection case (Windows connects to a linux machine), but the reverse case (linux connects to a Windows machine) has an analogue problem when tsvals from windows machine are 'negative' (high order bit set) : PAWS triggers and we drops incoming messages. Fix this by making zero ts_recent value special, allowing frame to be processed. Based on a report and initial patch from Dmitiy Balakin Bugzilla reference : https://bugzilla.kernel.org/show_bug.cgi?id=24842 Reported-by: dmitriy.balakin@nicneiron.ru Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-16 14:08:34 -08:00
Shan Wei	dca9b2404a	net: kill unused macros from head file These macros have been defined for several years since v2.6.12-rc2（tracing by git）, but never be used. So remove them. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-02 13:27:33 -08:00
David S. Miller	ccb7c410dd	timewait_sock: Create and use getpeer op. The only thing AF-specific about remembering the timestamp for a time-wait TCP socket is getting the peer. Abstract that behind a new timewait_sock_ops vector. Support for real IPV6 sockets is not filled in yet, but curiously this makes timewait recycling start to work for v4-mapped ipv6 sockets. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-01 18:09:13 -08:00
David S. Miller	3f419d2d48	inet: Turn ->remember_stamp into ->get_peer in connection AF ops. Then we can make a completely generic tcp_remember_stamp() that uses ->get_peer() as a helper, minimizing the AF specific code and minimizing the eventual code duplication when we implement the ipv6 side of TW recycling. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-30 12:28:06 -08:00
Eric Dumazet	8d987e5c75	net: avoid limits overflow Robin Holt tried to boot a 16TB machine and found some limits were reached : sysctl_tcp_mem[2], sysctl_udp_mem[2] We can switch infrastructure to use long "instead" of "int", now atomic_long_t primitives are available for free. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: Robin Holt <holt@sgi.com> Reviewed-by: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-10 12:12:00 -08:00
stephen hemminger	1b9f409293	tcp: tcp_enter_quickack_mode can be static Function only used in tcp_input.c Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-29 19:45:36 -07:00
David S. Miller	e40051d134	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/qlcnic/qlcnic_init.c net/ipv4/ip_output.c	2010-09-27 01:03:03 -07:00
Alexey Kuznetsov	01f83d6984	tcp: Prevent overzealous packetization by SWS logic. If peer uses tiny MSS (say, 75 bytes) and similarly tiny advertised window, the SWS logic will packetize to half the MSS unnecessarily. This causes problems with some embedded devices. However for large MSS devices we do want to half-MSS packetize otherwise we never get enough packets into the pipe for things like fast retransmit and recovery to work. Be careful also to handle the case where MSS > window, otherwise we'll never send until the probe timer. Reported-by: ツ Leandro Melo de Sales <leandroal@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 12:01:44 -07:00
David S. Miller	e548833df8	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/mac80211/main.c	2010-09-09 22:27:33 -07:00
Gerrit Renker	3d5b99ae82	TCP: update initial windows according to RFC 5681 This updates the use of larger initial windows, as originally specified in RFC 3390, to use the newer IW values specified in RFC 5681, section 3.1. The changes made in RFC 5681 are: a) the setting now is more clearly specified in units of segments (as the comments by John Heffner emphasized, this was not very clear in RFC 3390); b) for connections with 1095 < SMSS <= 2190 there is now a change: - RFC 3390 says that IW <= 4380, - RFC 5681 says that IW = 3 * SMSS <= 6570. Since RFC 3390 is older and "only" proposed standard, whereas the newer RFC 5681 is already draft standard, it seems preferable to use the newer IW variant. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-30 13:50:44 -07:00
Gerrit Renker	22b71c8f4f	tcp/dccp: Consolidate common code for RFC 3390 conversion This patch consolidates initial-window code common to TCP and CCID-2: * TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and * CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-30 13:45:26 -07:00
David S. Miller	ad1af0fedb	tcp: Combat per-cpu skew in orphan tests. As reported by Anton Blanchard when we use percpu_counter_read_positive() to make our orphan socket limit checks, the check can be off by up to num_cpus_online() * batch (which is 32 by default) which on a 128 cpu machine can be as large as the default orphan limit itself. Fix this by doing the full expensive sum check if the optimized check triggers. Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>	2010-08-25 02:27:49 -07:00
Eric Dumazet	f86586fa48	tcp: sizeof struct tcp_skb_cb is 44 Correct comment stating sizeof(struct tcp_skb_cb) is 36 or 40, since its 44 bytes, since commit `951dbc8ac7` ([IPV6]: Move nextheader offset to the IP6CB). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-15 21:41:00 -07:00
Changli Gao	7ba4291007	inet, inet6: make tcp_sendmsg() and tcp_sendpage() through inet_sendmsg() and inet_sendpage() a new boolean flag no_autobind is added to structure proto to avoid the autobind calls when the protocol is TCP. Then sock_rps_record_flow() is called int the TCP's sendmsg() and sendpage() pathes. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/net/inet_common.h \| 4 ++++ include/net/sock.h \| 1 + include/net/tcp.h \| 8 ++++---- net/ipv4/af_inet.c \| 15 +++++++++------ net/ipv4/tcp.c \| 11 +++++------ net/ipv4/tcp_ipv4.c \| 3 +++ net/ipv6/af_inet6.c \| 8 ++++---- net/ipv6/tcp_ipv6.c \| 3 +++ 8 files changed, 33 insertions(+), 20 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-12 20:21:46 -07:00
Changli Gao	53d3176b28	net: cleanups remove useless blanks. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/net/inet_common.h \| 55 ++++------- include/net/tcp.h \| 222 +++++++++++++++++----------------------------- include/net/udp.h \| 38 +++---- 3 files changed, 123 insertions(+), 192 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-12 20:21:45 -07:00
Florian Westphal	172d69e63c	syncookies: add support for ECN Allows use of ECN when syncookies are in effect by encoding ecn_ok into the syn-ack tcp timestamp. While at it, remove a uneeded #ifdef CONFIG_SYN_COOKIES. With CONFIG_SYN_COOKIES=nm want_cookie is ifdef'd to 0 and gcc removes the "if (0)". Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-06-26 22:00:03 -07:00
Florian Westphal	8c76368174	syncookies: check decoded options against sysctl settings Discard the ACK if we find options that do not match current sysctl settings. Previously it was possible to create a connection with sack, wscale, etc. enabled even if the feature was disabled via sysctl. Also remove an unneeded call to tcp_sack_reset() in cookie_check_timestamp: Both call sites (cookie_v4_check, cookie_v6_check) zero "struct tcp_options_received", hand it to tcp_parse_options() (which does not change tcp_opt->num_sacks/dsack) and then call cookie_check_timestamp(). Even if num_sacks/dsacks were changed, the structure is allocated on the stack and after cookie_check_timestamp returns only a few selected members are copied to the inet_request_sock. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-06-16 14:42:15 -07:00
Changli Gao	a3433f35a5	tcp: unify tcp flag macros unify tcp flag macros: TCPHDR_FIN, TCPHDR_SYN, TCPHDR_RST, TCPHDR_PSH, TCPHDR_ACK, TCPHDR_URG, TCPHDR_ECE and TCPHDR_CWR. TCBCB_FLAG_* are replaced with the corresponding TCPHDR_*. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/net/tcp.h \| 24 ++++++------- net/ipv4/tcp.c \| 8 ++-- net/ipv4/tcp_input.c \| 2 - net/ipv4/tcp_output.c \| 59 ++++++++++++++++----------------- net/netfilter/nf_conntrack_proto_tcp.c \| 32 ++++++----------- net/netfilter/xt_TCPMSS.c \| 4 -- 6 files changed, 58 insertions(+), 71 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>	2010-06-15 11:56:19 -07:00
Tom Herbert	a8b690f98b	tcp: Fix slowness in read /proc/net/tcp This patch address a serious performance issue in reading the TCP sockets table (/proc/net/tcp). Reading the full table is done by a number of sequential read operations. At each read operation, a seek is done to find the last socket that was previously read. This seek operation requires that the sockets in the table need to be counted up to the current file position, and to count each of these requires taking a lock for each non-empty bucket. The whole algorithm is O(n^2). The fix is to cache the last bucket value, offset within the bucket, and the file position returned by the last read operation. On the next sequential read, the bucket and offset are used to find the last read socket immediately without needing ot scan the previous buckets the table. This algorithm t read the whole table is O(n). The improvement offered by this patch is easily show by performing cat'ing /proc/net/tcp on a machine with a lot of connections. With about 182K connections in the table, I see the following: - Without patch time cat /proc/net/tcp > /dev/null real 1m56.729s user 0m0.214s sys 1m56.344s - With patch time cat /proc/net/tcp > /dev/null real 0m0.894s user 0m0.290s sys 0m0.594s Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-06-07 00:43:42 -07:00
David S. Miller	6811d58fc1	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: include/linux/if_link.h	2010-05-16 22:26:58 -07:00
Eric Dumazet	35790c0421	tcp: fix MD5 (RFC2385) support TCP MD5 support uses percpu data for temporary storage. It currently disables preemption so that same storage cannot be reclaimed by another thread on same cpu. We also have to make sure a softirq handler wont try to use also same context. Various bug reports demonstrated corruptions. Fix is to disable preemption and BH. Reported-by: Bhaskar Dutta <bhaskie@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-16 00:34:04 -07:00
Flavio Leitner	6c37e5de45	TCP: avoid to send keepalive probes if receiving data RFC 1122 says the following: ... Keep-alive packets MUST only be sent when no data or acknowledgement packets have been received for the connection within an interval. ... The acknowledgement packet is reseting the keepalive timer but the data packet isn't. This patch fixes it by checking the timestamp of the last received data packet too when the keepalive timer expires. Signed-off-by: Flavio Leitner <fleitner@redhat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-27 12:53:25 -07:00
Tom Herbert	aa2ea0586d	tcp: fix outsegs stat for TSO segments Account for TSO segments of an skb in TCP_MIB_OUTSEGS counter. Without doing this, the counter can be off by orders of magnitude from the actual number of segments sent. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-22 16:00:00 -07:00
Eric Dumazet	aa39514516	net: sk_sleep() helper Define a new function to return the waitqueue of a "struct sock". static inline wait_queue_head_t sk_sleep(struct sock sk) { return sk->sk_sleep; } Change all read occurrences of sk_sleep by a call to this function. Needed for a future RCU conversion. sk_sleep wont be a field directly available. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-20 16:37:13 -07:00
Herbert Xu	bb29624614	inet: Remove unused send_check length argument inet: Remove unused send_check length argument This patch removes the unused length argument from the send_check function in struct inet_connection_sock_af_ops. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Yinghai <yinghai.lu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-11 15:29:09 -07:00
Mike Galbraith	c839d30a41	net: add scheduler sync hint to tcp_prequeue(). Decreases the odds wakee will suffer from frequent cache misses. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-03-04 00:53:51 -08:00
Andreas Petlund	7e38017557	net: TCP thin dupack This patch enables fast retransmissions after one dupACK for TCP if the stream is identified as thin. This will reduce latencies for thin streams that are not able to trigger fast retransmissions due to high packet interarrival time. This mechanism is only active if enabled by iocontrol or syscontrol and the stream is identified as thin. Signed-off-by: Andreas Petlund <apetlund@simula.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-18 15:43:09 -08:00
Andreas Petlund	36e31b0af5	net: TCP thin linear timeouts This patch will make TCP use only linear timeouts if the stream is thin. This will help to avoid the very high latencies that thin stream suffer because of exponential backoff. This mechanism is only active if enabled by iocontrol or syscontrol and the stream is identified as thin. A maximum of 6 linear timeouts is tried before exponential backoff is resumed. Signed-off-by: Andreas Petlund <apetlund@simula.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-18 15:43:08 -08:00
Andreas Petlund	5aa4b32fc8	net: TCP thin-stream detection Inline function to dynamically detect thin streams based on the number of packets in flight. Used to dynamically trigger thin-stream mechanisms if enabled by ioctl or sysctl. Signed-off-by: Andreas Petlund <apetlund@simula.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-18 15:43:07 -08:00
Tejun Heo	7d720c3e4f	percpu: add __percpu sparse annotations to net Add __percpu sparse annotations to net. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. The macro and type tricks around snmp stats make things a bit interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All snmp_mib_() users which used to cast the argument to (void ) are updated to cast it to (void __percpu *). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Vlad Yasevich <vladislav.yasevich@hp.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-16 23:05:38 -08:00
Octavian Purdila	72659ecce6	tcp: account SYN-ACK timeouts & retransmissions Currently we don't increment SYN-ACK timeouts & retransmissions although we do increment the same stats for SYN. We seem to have lost the SYN-ACK accounting with the introduction of tcp_syn_recv_timer (commit 2248761e in the netdev-vger-cvs tree). This patch fixes this issue. In the process we also rename the v4/v6 syn/ack retransmit functions for clarity. We also add a new request_socket operations (syn_ack_timeout) so we can keep code in inet_connection_sock.c protocol agnostic. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-01-17 19:09:39 -08:00
laurent chavey	31d12926e3	net: Add rtnetlink init_rcvwnd to set the TCP initial receive window Add rtnetlink init_rcvwnd to set the TCP initial receive window size advertised by passive and active TCP connections. The current Linux TCP implementation limits the advertised TCP initial receive window to the one prescribed by slow start. For short lived TCP connections used for transaction type of traffic (i.e. http requests), bounding the advertised TCP initial receive window results in increased latency to complete the transaction. Support for setting initial congestion window is already supported using rtnetlink init_cwnd, but the feature is useless without the ability to set a larger TCP initial receive window. The rtnetlink init_rcvwnd allows increasing the TCP initial receive window, allowing TCP connection to advertise larger TCP receive window than the ones bounded by slow start. Signed-off-by: Laurent Chavey <chavey@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-23 14:13:30 -08:00
Krishna Kumar	12d50c46dc	tcp: Remove check in __tcp_push_pending_frames tcp_push checks tcp_send_head and calls __tcp_push_pending_frames, which again checks tcp_send_head, and this unnecessary check is done for every other caller of __tcp_push_pending_frames. Remove tcp_send_head check in __tcp_push_pending_frames and add the check to tcp_push_pending_frames. Other functions call __tcp_push_pending_frames only when tcp_send_head would evaluate to true. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-23 14:13:28 -08:00
David S. Miller	bb5b7c1126	tcp: Revert per-route SACK/DSACK/TIMESTAMP changes. It creates a regression, triggering badness for SYN_RECV sockets, for example: [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293 [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000 [19148.023035] REGS: eeecbd30 TRAP: 0700 Not tainted (2.6.32) [19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR> CR: 24002442 XER: 00000000 [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000 This is likely caused by the change in the 'estab' parameter passed to tcp_parse_options() when invoked by the functions in net/ipv4/tcp_minisocks.c But even if that is fixed, the ->conn_request() changes made in this patch series is fundamentally wrong. They try to use the listening socket's 'dst' to probe the route settings. The listening socket doesn't even have a route, and you can't get the right route (the child request one) until much later after we setup all of the state, and it must be done by hand. This stuff really isn't ready, so the best thing to do is a full revert. This reverts the following commits: `f55017a93f` `022c3f7d82` `1aba721eba` `cda42ebd67` `345cda2fd6` `dc343475ed` `05eaade278` `6a2a2d6bf8` Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-15 20:56:42 -08:00
David S. Miller	501706565b	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: include/net/tcp.h	2009-12-11 17:12:17 -08:00
Linus Torvalds	4ef58d4e2a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits) tree-wide: fix misspelling of "definition" in comments reiserfs: fix misspelling of "journaled" doc: Fix a typo in slub.txt. inotify: remove superfluous return code check hdlc: spelling fix in find_pvc() comment doc: fix regulator docs cut-and-pasteism mtd: Fix comment in Kconfig doc: Fix IRQ chip docs tree-wide: fix assorted typos all over the place drivers/ata/libata-sff.c: comment spelling fixes fix typos/grammos in Documentation/edac.txt sysctl: add missing comments fs/debugfs/inode.c: fix comment typos sgivwfb: Make use of ARRAY_SIZE. sky2: fix sky2_link_down copy/paste comment error tree-wide: fix typos "couter" -> "counter" tree-wide: fix typos "offest" -> "offset" fix kerneldoc for set_irq_msi() spidev: fix double "of of" in comment comment typo fix: sybsystem -> subsystem ...	2009-12-09 19:43:33 -08:00
Damian Lukowski	2f7de5710a	tcp: Stalling connections: Move timeout calculation routine This patch moves retransmits_timed_out() from include/net/tcp.h to tcp_timer.c, where it is used. Reported-by: Frederic Leroy <fredo@starox.org> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-08 20:56:11 -08:00
Damian Lukowski	07f29bc5bb	tcp: Stalling connections: Fix timeout calculation routine This patch fixes a problem in the TCP connection timeout calculation. Currently, timeout decisions are made on the basis of the current tcp_time_stamp and retrans_stamp, which is usually set at the first retransmission. However, if the retransmission fails in tcp_retransmit_skb(), retrans_stamp is not updated and remains zero. This leads to wrong decisions in retransmits_timed_out() if tcp_time_stamp is larger than the specified timeout, which is very likely. In this case, the TCP connection dies after the first attempted (and unsuccessful) retransmission. With this patch, tcp_skb_cb->when is used instead, when retrans_stamp is not available. This bug has been introduced together with retransmits_timed_out() in 2.6.32, as the number of retransmissions has been used for timeout decisions before. The corresponding commit was `6fa12c8503` (Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.). Thanks to Ilpo Järvinen for code suggestions and Frederic Leroy for testing. Reported-by: Frederic Leroy <fredo@starox.org> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-08 20:56:11 -08:00
André Goddard Rosa	af901ca181	tree-wide: fix assorted typos all over the place That is "success", "unknown", "through", "performance", "[re\|un]mapping" , "access", "default", "reasonable", "[con]currently", "temperature" , "channel", "[un]used", "application", "example","hierarchy", "therefore" , "[over\|under]flow", "contiguous", "threshold", "enough" and others. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-12-04 15:39:55 +01:00
Ilpo Järvinen	8818a9d884	tcp: clear hints to avoid a stale one (nfs only affected?) Eric Dumazet mentioned in a context of another problem: "Well, it seems NFS reuses its socket, so maybe we miss some cleaning as spotted in this old patch" I've not check under which conditions that actually happens but if true, we need to make sure we don't accidently leave stale hints behind when the write queue had to be purged (whether reusing with NFS can actually happen if purging took place is something I'm not sure of). ...At least it compiles. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-02 22:24:02 -08:00
William Allen Simpson	4957faade1	TCPCT part 1g: Responder Cookie => Initiator Parse incoming TCP_COOKIE option(s). Calculate <SYN,ACK> TCP_COOKIE option. Send optional <SYN,ACK> data. This is a significantly revised implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 Requires: TCPCT part 1a: add request_values parameter for sending SYNACK TCPCT part 1b: generate Responder Cookie secret TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1d: define TCP cookie option, extend existing struct's TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1f: Initiator Cookie => Responder Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-02 22:07:26 -08:00
William Allen Simpson	435cf559f0	TCPCT part 1d: define TCP cookie option, extend existing struct's Data structures are carefully composed to require minimal additions. For example, the struct tcp_options_received cookie_plus variable fits between existing 16-bit and 8-bit variables, requiring no additional space (taking alignment into consideration). There are no additions to tcp_request_sock, and only 1 pointer in tcp_sock. This is a significantly revised implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 The principle difference is using a TCP option to carry the cookie nonce, instead of a user configured offset in the data. This is more flexible and less subject to user configuration error. Such a cookie option has been suggested for many years, and is also useful without SYN data, allowing several related concepts to use the same extension option. "Re: SYN floods (was: does history repeat itself?)", September 9, 1996. http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html "Re: what a new TCP header might look like", May 12, 1998. ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail These functions will also be used in subsequent patches that implement additional features. Requires: TCPCT part 1a: add request_values parameter for sending SYNACK TCPCT part 1b: generate Responder Cookie secret TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-02 22:07:25 -08:00
William Allen Simpson	519855c508	TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS Define sysctl (tcp_cookie_size) to turn on and off the cookie option default globally, instead of a compiled configuration option. Define per socket option (TCP_COOKIE_TRANSACTIONS) for setting constant data values, retrieving variable cookie values, and other facilities. Move inline tcp_clear_options() unchanged from net/tcp.h to linux/tcp.h, near its corresponding struct tcp_options_received (prior to changes). This is a straightforward re-implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 These functions will also be used in subsequent patches that implement additional features. Requires: net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-02 22:07:24 -08:00
William Allen Simpson	da5c78c826	TCPCT part 1b: generate Responder Cookie secret Define (missing) hash message size for SHA1. Define hashing size constants specific to TCP cookies. Add new function: tcp_cookie_generator(). Maintain global secret values for tcp_cookie_generator(). This is a significantly revised implementation of earlier (15-year-old) Photuris [RFC-2522] code for the KA9Q cooperative multitasking platform. Linux RCU technique appears to be well-suited to this application, though neither of the circular queue items are freed. These functions will also be used in subsequent patches that implement additional features. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-02 22:07:23 -08:00
William Allen Simpson	e6b4d11367	TCPCT part 1a: add request_values parameter for sending SYNACK Add optional function parameters associated with sending SYNACK. These parameters are not needed after sending SYNACK, and are not used for retransmission. Avoids extending struct tcp_request_sock, and avoids allocating kernel memory. Also affects DCCP as it uses common struct request_sock_ops, but this parameter is currently reserved for future use. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-02 22:07:23 -08:00
William Allen Simpson	bee7ca9ec0	net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED Define two symbols needed in both kernel and user space. Remove old (somewhat incorrect) kernel variant that wasn't used in most cases. Default should apply to both RMSS and SMSS (RFC2581). Replace numeric constants with defined symbols. Stand-alone patch, originally developed for TCPCT. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-11-13 20:38:48 -08:00
Eric Dumazet	fd2c3ef761	net: cleanup include/net This cleanup patch puts struct/union/enum opening braces, in first line to ease grep games. struct something { becomes : struct something { Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-11-04 05:06:25 -08:00
Gilad Ben-Yossef	022c3f7d82	Allow tcp_parse_options to consult dst entry We need tcp_parse_options to be aware of dst_entry to take into account per dst_entry TCP options settings Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-29 01:28:41 -07:00
David S. Miller	b7058842c9	net: Make setsockopt() optlen be unsigned. This provides safety against negative optlen at the type level instead of depending upon (sometimes non-trivial) checks against this sprinkled all over the the place, in each and every implementation. Based upon work done by Arjan van de Ven and feedback from Linus Torvalds. Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-30 16:12:20 -07:00
Ilpo Järvinen	0b6a05c1db	tcp: fix ssthresh u16 leftover It was once upon time so that snd_sthresh was a 16-bit quantity. ...That has not been true for long period of time. I run across some ancient compares which still seem to trust such legacy. Put all that magic into a single place, I hopefully found all of them. Compile tested, though linking of allyesconfig is ridiculous nowadays it seems. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-15 01:30:10 -07:00
Wu Fengguang	aa1330766c	tcp: replace hard coded GFP_KERNEL with sk_allocation This fixed a lockdep warning which appeared when doing stress memory tests over NFS: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock mount_root => nfs_root_data => tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim David raised a concern that if the allocation fails in tcp_send_fin(), and it's GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting for the allocation to succeed. But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could loop endlessly under memory pressure. CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> CC: David S. Miller <davem@davemloft.net> CC: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-02 23:45:45 -07:00
Damian Lukowski	5152fc7de3	RTO connection timeout: coding style fixes and comments This patch affects the retransmits_timed_out() function. Changes: 1) Variables have more meaningful names 2) retransmits_timed_out() has an introductionary comment. 3) Small coding style changes. Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-01 17:40:47 -07:00
Damian Lukowski	6fa12c8503	Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value. RFC 1122 specifies two threshold values R1 and R2 for connection timeouts, which may represent a number of allowed retransmissions or a timeout value. Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds in number of allowed retransmissions. For any desired threshold R2 (by means of time) one can specify tcp_retries2 (by means of number of retransmissions) such that TCP will not time out earlier than R2. This is the case, because the RTO schedule follows a fixed pattern, namely exponential backoff. However, the RTO behaviour is not predictable any more if RTO backoffs can be reverted, as it is the case in the draft "Make TCP more Robust to Long Connectivity Disruptions" (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd). In the worst case TCP would time out a connection after 3.2 seconds, if the initial RTO equaled MIN_RTO and each backoff has been reverted. This patch introduces a function retransmits_timed_out(N), which calculates the timeout of a TCP connection, assuming an initial RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions. Whenever timeout decisions are made by comparing the retransmission counter to some value N, this function can be used, instead. The meaning of tcp_retries2 will be changed, as many more RTO retransmissions can occur than the value indicates. However, it yields a timeout which is similar to the one of an unpatched, exponentially backing off TCP in the same scenario. As no application could rely on an RTO greater than MIN_RTO, there should be no risk of a regression. Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-01 02:45:47 -07:00
Damian Lukowski	f1ecd5d9e7	Revert Backoff [v3]: Revert RTO on ICMP destination unreachable Here, an ICMP host/network unreachable message, whose payload fits to TCP's SND.UNA, is taken as an indication that the RTO retransmission has not been lost due to congestion, but because of a route failure somewhere along the path. With true congestion, a router won't trigger such a message and the patched TCP will operate as standard TCP. This patch reverts one RTO backoff, if an ICMP host/network unreachable message, whose payload fits to TCP's SND.UNA, arrives. Based on the new RTO, the retransmission timer is reset to reflect the remaining time, or - if the revert clocked out the timer - a retransmission is sent out immediately. Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if there have been retransmissions and reversible backoffs, already. Changes from v2: 1) Renaming of skb in tcp_v4_err() moved to another patch. 2) Reintroduced tcp_bound_rto() and __tcp_set_rto(). 3) Fixed code comments. Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-01 02:45:42 -07:00
Eric Dumazet	df19a62677	tcp: keepalive cleanups Introduce keepalive_probes(tp) helper, and use it, like keepalive_time_when(tp) and keepalive_intvl_when(tp) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-08-28 23:48:54 -07:00
John Dykstra	e3afe7b75e	tcp: Fix MD5 signature checking on IPv4 mapped sockets Fix MD5 signature checking so that an IPv4 active open to an IPv6 socket can succeed. In particular, use the correct address family's signature generation function for the SYN/ACK. Reported-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: John Dykstra <john.dykstra1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-07-20 07:49:07 -07:00
David S. Miller	22f6dacdfc	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: include/net/tcp.h	2009-05-08 02:48:30 -07:00
Eric Dumazet	7aedec2ad5	tcp: tcp_prequeue() can use keyed wakeups We can avoid waking up tasks not interested in receive notifications, using wake_up_interruptible_poll() instead of wake_up_interruptible() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-05-07 14:52:28 -07:00
Eric Dumazet	f5f8d86b23	tcp: tcp_prequeue() cleanup Small cleanup patch to reduce line lengths, before a change in tcp_prequeue(). Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-05-07 14:52:26 -07:00
Satoru SATOH	0c266898b4	tcp: Fix tcp_prequeue() to get correct rto_min value tcp_prequeue() refers to the constant value (TCP_RTO_MIN) regardless of the actual value might be tuned. The following patches fix this and make tcp_prequeue get the actual value returns from tcp_rto_min(). Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-05-04 11:11:01 -07:00
Florian Westphal	a0f82f64e2	syncookies: remove last_synq_overflow from struct tcp_sock last_synq_overflow eats 4 or 8 bytes in struct tcp_sock, even though it is only used when a listening sockets syn queue is full. We can (ab)use rx_opt.ts_recent_stamp to store the same information; it is not used otherwise as long as a socket is in listen state. Move linger2 around to avoid splitting struct mtu_probe across cacheline boundary on 32 bit arches. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-04-20 02:25:26 -07:00
Ilpo Järvinen	797108d134	tcp: add helper for counter tweaking due mid-wq change We need full-scale adjustment to fix a TCP miscount in the next patch, so just move it into a helper and call for that from the other places. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-04-02 16:31:44 -07:00
Ilpo Järvinen	0c54b85f28	tcp: simplify tcp_current_mss There's very little need for most of the callsites to get tp->xmit_goal_size updated. That will cost us divide as is, so slice the function in two. Also, the only users of the tp->xmit_goal_size are directly behind tcp_current_mss(), so there's no need to store that variable into tcp_sock at all! The drop of xmit_goal_size currently leaves 16-bit hole and some reorganization would again be necessary to change that (but I'm aiming to fill that hole with u16 xmit_goal_size_segs to cache the results of the remaining divide to get that tso on regression). Bring xmit_goal_size parts into tcp.c Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Evgeniy Polyakov <zbr@ioremap.net> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-03-15 20:09:54 -07:00
Ilpo Järvinen	c887e6d2d9	tcp: consolidate paws check Wow, it was quite tricky to merge that stream of negations but I think I finally got it right: check & replace_ts_recent: (s32)(rcv_tsval - ts_recent) >= 0 => 0 (s32)(ts_recent - rcv_tsval) <= 0 => 0 discard: (s32)(ts_recent - rcv_tsval) > TCP_PAWS_WINDOW => 1 (s32)(ts_recent - rcv_tsval) <= TCP_PAWS_WINDOW => 0 I toggled the return values of tcp_paws_check around since the old encoding added yet-another negation making tracking of truth-values really complicated. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-03-15 20:09:52 -07:00
Hantzis Fotis	ee7537b63a	tcp: tcp_init_wl / tcp_update_wl argument cleanup The above functions from include/net/tcp.h have been defined with an argument that they never use. The argument is 'u32 ack' which is never used inside the function body, and thus it can be removed. The rest of the patch involves the necessary changes to the function callers of the above two functions. Signed-off-by: Hantzis Fotis <xantzis@ceid.upatras.gr> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-03-02 22:42:02 -08:00
Ilpo Järvinen	cabeccbd17	tcp: kill eff_sacks "cache", the sole user can calculate itself Also fixes insignificant bug that would cause sending of stale SACK block (would occur in some corner cases). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-03-02 03:00:16 -08:00
Ilpo Järvinen	758ce5c8d1	tcp: add helper for AI algorithm It seems that implementation in yeah was inconsistent to what other did as it would increase cwnd one ack earlier than the others do. Size benefits: bictcp_cong_avoid \| -36 tcp_cong_avoid_ai \| +52 bictcp_cong_avoid \| -34 tcp_scalable_cong_avoid \| -36 tcp_veno_cong_avoid \| -12 tcp_yeah_cong_avoid \| -38 = -104 bytes total Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-03-02 03:00:15 -08:00
Herbert Xu	bf296b125b	tcp: Add GRO support This patch adds the TCP-specific portion of GRO. The criterion for merging is extremely strict (the TCP header must match exactly apart from the checksum) so as to allow refragmentation. Otherwise this is pretty much identical to LRO, except that we support the merging of ECN packets. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-12-15 23:43:36 -08:00
Eric Dumazet	dd24c00191	net: Use a percpu_counter for orphan_count Instead of using one atomic_t per protocol, use a percpu_counter for "orphan_count", to reduce cache line contention on heavy duty network servers. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 21:17:14 -08:00
Eric Dumazet	1748376b66	net: Use a percpu_counter for sockets_allocated Instead of using one atomic_t per protocol, use a percpu_counter for "sockets_allocated", to reduce cache line contention on heavy duty network servers. Note : We revert commit (`248969ae31` net: af_unix can make unix_nr_socks visbile in /proc), since it is not anymore used after sock_prot_inuse_add() addition Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 21:16:35 -08:00
Ilpo Järvinen	8eecaba900	tcp: tcp_limit_reno_sacked can become static Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 13:45:29 -08:00
Ilpo Järvinen	832d11c5cd	tcp: Try to restore large SKBs while SACK processing During SACK processing, most of the benefits of TSO are eaten by the SACK blocks that one-by-one fragment SKBs to MSS sized chunks. Then we're in problems when cleanup work for them has to be done when a large cumulative ACK comes. Try to return back to pre-split state already while more and more SACK info gets discovered by combining newly discovered SACK areas with the previous skb if that's SACKed as well. This approach has a number of benefits: 1) The processing overhead is spread more equally over the RTT 2) Write queue has less skbs to process (affect everything which has to walk in the queue past the sacked areas) 3) Write queue is consistent whole the time, so no other parts of TCP has to be aware of this (this was not the case with some other approach that was, well, quite intrusive all around). 4) Clean_rtx_queue can release most of the pages using single put_page instead of previous PAGE_SIZE/mss+1 calls In case a hole is fully filled by the new SACK block, we attempt to combine the next skb too which allows construction of skbs that are even larger than what tso split them to and it handles hole per on every nth patterns that often occur during slow start overshoot pretty nicely. Though this to be really useful also a retransmission would have to get lost since cumulative ACKs advance one hole at a time in the most typical case. TODO: handle upwards only merging. That should be rather easy when segment is fully sacked but I'm leaving that as future work item (it won't make very large difference anyway since this current approach already covers quite a lot of normal cases). I was earlier thinking of some sophisticated way of tracking timestamps of the first and the last segment but later on realized that it won't be that necessary at all to store the timestamp of the last segment. The cases that can occur are basically either: 1) ambiguous => no sensible measurement can be taken anyway 2) non-ambiguous is due to reordering => having the timestamp of the last segment there is just skewing things more off than does some good since the ack got triggered by one of the holes (besides some substle issues that would make determining right hole/skb even harder problem). Anyway, it has nothing to do with this change then. I choose to route some abnormal looking cases with goto noop, some could be handled differently (eg., by stopping the walking at that skb but again). In general, they either shouldn't happen at all or are rare enough to make no difference in practice. In theory this change (as whole) could cause some macroscale regression (global) because of cache misses that are taken over the round-trip time but it gets very likely better because of much less (local) cache misses per other write queue walkers and the big recovery clearing cumulative ack. Worth to note that these benefits would be very easy to get also without TSO/GSO being on as long as the data is in pages so that we can merge them. Currently I won't let that happen because DSACK splitting at fragment that would mess up pcounts due to sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets avoided, we have some conditions that can be made less strict. TODO: I will probably have to convert the excessive pointer passing to struct sacktag_state... :-) My testing revealed that considerable amount of skbs couldn't be shifted because they were cloned (most likely still awaiting tx reclaim)... [The rest is considering future work instead since I got repeatably EFAULT to tcpdump's recvfrom when I added pskb_expand_head to deal with clones, so I separated that into another, later patch] ...To counter that, I gave up on the fifth advantage: 5) When growing previous SACK block, less allocs for new skbs are done, basically a new alloc is needed only when new hole is detected and when the previous skb runs out of frags space ...which now only happens of if reclaim is fast enough to dispose the clone before the SACK block comes in (the window is RTT long), otherwise we'll have to alloc some. With clones being handled I got these numbers (will be somewhat worse without that), taken with fine-grained mibs: TCPSackShifted 398 TCPSackMerged 877 TCPSackShiftFallback 320 TCPSACKCOLLAPSEFALLBACKGSO 0 TCPSACKCOLLAPSEFALLBACKSKBBITS 0 TCPSACKCOLLAPSEFALLBACKSKBDATA 0 TCPSACKCOLLAPSEFALLBACKBELOW 0 TCPSACKCOLLAPSEFALLBACKFIRST 1 TCPSACKCOLLAPSEFALLBACKPREVBITS 318 TCPSACKCOLLAPSEFALLBACKMSS 1 TCPSACKCOLLAPSEFALLBACKNOHEAD 0 TCPSACKCOLLAPSEFALLBACKSHIFT 0 TCPSACKCOLLAPSENOOPSEQ 0 TCPSACKCOLLAPSENOOPSMALLPCOUNT 0 TCPSACKCOLLAPSENOOPSMALLLEN 0 TCPSACKCOLLAPSEHOLE 12 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-24 21:20:15 -08:00
Ilpo Järvinen	e1aa680fa4	tcp: move tcp_simple_retransmit to tcp_input Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-24 21:11:55 -08:00
Petr Tesarik	38a7ddffa4	tcp: remove an unnecessary field in struct tcp_skb_cb The urg_ptr field is not used anywhere and is merely confusing. Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-13 22:44:11 -08:00
Peter Zijlstra	c57943a1c9	net: wrap sk->sk_backlog_rcv() Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the generic sk_backlog_rcv behaviour. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-10-07 14:18:42 -07:00
KOVACS Krisztian	a3116ac5c2	tcp: Port redirection support for TCP Current TCP code relies on the local port of the listening socket being the same as the destination address of the incoming connection. Port redirection used by many transparent proxying techniques obviously breaks this, so we have to store the original destination port address. This patch extends struct inet_request_sock and stores the incoming destination port value there. It also modifies the handshake code to use that value as the source port when sending reply packets. Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-10-01 07:46:49 -07:00
David S. Miller	cd07a8ea0d	tcp: Use SKB queue handling interfaces instead of by-hand versions. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-23 00:50:13 -07:00
David S. Miller	d258b4914b	tcp: Use skb_queue_is_last() instead of by-hand version. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-23 00:34:37 -07:00
David S. Miller	43f59c8939	net: Remove __skb_insert() calls outside of skbuff internals. This minor cleanup simplifies later changes which will convert struct sk_buff and friends over to using struct list_head. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-21 21:28:51 -07:00
Ilpo Järvinen	ef9da47c7c	tcp: don't clear retransmit_skb_hint when not necessary Most importantly avoid doing it with cumulative ACK. Not clearing means that we no longer need n^2 processing in resolution of each fast recovery. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-20 21:25:15 -07:00
Ilpo Järvinen	0e1c54c2a4	tcp: reorganize retransmit code loops Both loops are quite similar, so they can be combined with little effort. As a result, forward_skb_hint becomes obsolete as well. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-20 21:24:21 -07:00
Ilpo Järvinen	006f582c73	tcp: convert retransmit_cnt_hint to seqno Main benefit in this is that we can then freely point the retransmit_skb_hint to anywhere we want to because there's no longer need to know what would be the count changes involve, and since this is really used only as a terminator, unnecessary work is one time walk at most, and if some retransmissions are necessary after that point later on, the walk is not full waste of time anyway. Since retransmit_high must be kept valid, all lost markers must ensure that. Now I also have learned how those "holes" in the rexmittable skbs can appear, mtu probe does them. So I removed the misleading comment as well. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-20 21:20:20 -07:00
Ilpo Järvinen	64edc2736e	tcp: Partial hint clearing has again become meaningless Ie., the difference between partial and all clearing doesn't exists anymore since the SACK optimizations got dropped by an sacktag rewrite. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-20 21:18:32 -07:00
Gerrit Renker	410e27a49b	This reverts "Merge branch 'dccp' of git://eden-feed.erg.abdn.ac.uk/dccp_exp" as it accentally contained the wrong set of patches. These will be submitted separately. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-09 13:27:22 +02:00
Gerrit Renker	6224877b2c	tcp/dccp: Consolidate common code for RFC 3390 conversion This patch consolidates the code common to TCP and CCID-2: * TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and * CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:39 +02:00
Adam Langley	33ad798c92	tcp: options clean up This should fix the following bugs: * Connections with MD5 signatures produce invalid packets whenever SACK options are included * MD5 signatures are counted twice in the MSS calculations Behaviour changes: * A SYN with MD5 + SACK + TS elicits a SYNACK with MD5 + SACK This is because we can't fit any SACK blocks in a packet with MD5 + TS options. There was discussion about disabling SACK rather than TS in order to fit in better with old, buggy kernels, but that was deemed to be unnecessary. * SYNs with MD5 don't include a TS option See above. Additionally, it removes a bunch of duplicated logic for calculating options, which should help avoid these sort of issues in the future. Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-19 00:04:31 -07:00
Adam Langley	49a72dfb88	tcp: Fix MD5 signatures for non-linear skbs Currently, the MD5 code assumes that the SKBs are linear and, in the case that they aren't, happily goes off and hashes off the end of the SKB and into random memory. Reported by Stephen Hemminger in [1]. Advice thanks to Stephen and Evgeniy Polyakov. Also includes a couple of missed route_caps from Stephen's patch in [2]. [1] http://marc.info/?l=linux-netdev&m=121445989106145&w=2 [2] http://marc.info/?l=linux-netdev&m=121459157816964&w=2 Signed-off-by: Adam Langley <agl@imperialviolet.org> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-19 00:01:42 -07:00
Pavel Emelyanov	57ef42d59d	mib: put tcp statistics on struct net Proc temporary uses stats from init_net. BTW, TCP_XXX_STATS are beautiful (w/o do { } while (0) facing) again :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:02:08 -07:00
Pavel Emelyanov	de0744af1f	mib: add net to NET_INC_STATS_BH Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:31:16 -07:00
Pavel Emelyanov	5c52ba170f	sock: add net to prot->enter_memory_pressure callback The tcp_enter_memory_pressure calls NET_INC_STATS, but doesn't have where to get the net from. I decided to add a sk argument, not the net itself, only to factor all the required sock_net(sk) calls inside the enter_memory_pressure callback itself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:28:10 -07:00
Pavel Emelyanov	cf1100a7a4	mib: add net to TCP_ADD_STATS_USER Now we're done with the TCP_XXX_STATS macros. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:27:38 -07:00
Pavel Emelyanov	74688e487a	mib: add net to TCP_DEC_STATS Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:22:46 -07:00
Pavel Emelyanov	63231bddf6	mib: add net to TCP_INC_STATS_BH Same as before - the sock is always there to get the net from, but there are also some places with the net already saved on the stack. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:22:25 -07:00
Pavel Emelyanov	81cc8a75d9	mib: add net to TCP_INC_STATS Fortunately (almost) all the TCP code has a sock to get the net from :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:22:04 -07:00
Pavel Emelyanov	a9c19329ec	tcp: add net to tcp_mib_init This one sets TCP MIBs after zeroing them, and thus requires the net. The existing single caller can use init_net (temporarily). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:21:42 -07:00
Pavel Emelyanov	f10f84314d	mib: drop unused TCP_XXX_STATS macros TCP_INC_STATS_USER and TCP_ADD_STATS_BH are currently unused. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:21:20 -07:00
Brian Haley	7d06b2e053	net: change proto destroy method to return void Change struct proto destroy function pointer to return void. Noticed by Al Viro. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-06-14 17:04:49 -07:00
David S. Miller	4ae127d1b6	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/smc911x.c	2008-06-13 20:52:39 -07:00
David S. Miller	ec0a196626	tcp: Revert 'process defer accept as established' changes. This reverts two changesets, `ec3c0982a2` ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and the follow-on bug fix `9ae27e0adb` ("tcp: Fix slab corruption with ipv6 and tcp6fuzz"). This change causes several problems, first reported by Ingo Molnar as a distcc-over-loopback regression where connections were getting stuck. Ilpo Järvinen first spotted the locking problems. The new function added by this code, tcp_defer_accept_check(), only has the child socket locked, yet it is modifying state of the parent listening socket. Fixing that is non-trivial at best, because we can't simply just grab the parent listening socket lock at this point, because it would create an ABBA deadlock. The normal ordering is parent listening socket --> child socket, but this code path would require the reverse lock ordering. Next is a problem noticed by Vitaliy Gusev, he noted: ---------------------------------------- >--- a/net/ipv4/tcp_timer.c >+++ b/net/ipv4/tcp_timer.c >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data) > goto death; > } > >+ if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) { >+ tcp_send_active_reset(sk, GFP_ATOMIC); >+ goto death; Here socket sk is not attached to listening socket's request queue. tcp_done() will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should release this sk) as socket is not DEAD. Therefore socket sk will be lost for freeing. ---------------------------------------- Finally, Alexey Kuznetsov argues that there might not even be any real value or advantage to these new semantics even if we fix all of the bugs: ---------------------------------------- Hiding from accept() sockets with only out-of-order data only is the only thing which is impossible with old approach. Is this really so valuable? My opinion: no, this is nothing but a new loophole to consume memory without control. ---------------------------------------- So revert this thing for now. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-06-12 16:34:35 -07:00
YOSHIFUJI Hideaki	9501f97229	tcp md5sig: Let the caller pass appropriate key for tcp_v{4,6}_do_calc_md5_hash(). As we do for other socket/timewait-socket specific parameters, let the callers pass appropriate arguments to tcp_v{4,6}_do_calc_md5_hash(). Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>	2008-06-12 03:46:30 +09:00
YOSHIFUJI Hideaki	8d26d76dd4	tcp md5sig: Share most of hash calcucaltion bits between IPv4 and IPv6. We can share most part of the hash calculation code because the only difference between IPv4 and IPv6 is their pseudo headers. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>	2008-06-12 02:38:20 +09:00
YOSHIFUJI Hideaki	076fb72233	tcp md5sig: Remove redundant protocol argument. Protocol is always TCP, so remove useless protocol argument. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>	2008-06-12 02:38:19 +09:00
YOSHIFUJI Hideaki	7d5d5525bd	tcp md5sig: Share MD5 Signature option parser between IPv4 and IPv6. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>	2008-06-12 02:38:18 +09:00
Rami Rosen	45d465bc23	ipv4: Remove unused declaration from include/net/tcp.h. - The tcp_unhash() method in /include/net/tcp.h is no more needed, as the unhash method in tcp_prot structure is now inet_unhash (instead of tcp_unhash in the past); see tcp_prot structure in net/ipv4/tcp_ipv4.c. - So, this patch removes tcp_unhash() declaration from include/net/tcp.h Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-06-10 12:37:42 -07:00
John Heffner	dd9e0dda66	[TCP]: Increase the max_burst threshold from 3 to tp->reordering. This change is necessary to allow cwnd to grow during persistent reordering. Cwnd moderation is applied when in the disorder state and an ack that fills the hole comes in. If the hole was greater than 3 packets, but less than tp->reordering, cwnd will shrink when it should not have. Signed-off-by: John Heffner <jheffner@napa.(none)> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-16 02:29:56 -07:00
David S. Miller	df39e8ba56	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/ehea/ehea_main.c drivers/net/wireless/iwlwifi/Kconfig drivers/net/wireless/rt2x00/rt61pci.c net/ipv4/inet_timewait_sock.c net/ipv6/raw.c net/mac80211/ieee80211_sta.c	2008-04-14 02:30:23 -07:00
Gerrit Renker	7de6c03336	[SKB]: __skb_append = __skb_queue_after This expresses __skb_append in terms of __skb_queue_after, exploiting that __skb_append(old, new, list) = __skb_queue_after(list, old, new). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-14 00:05:09 -07:00
Denis V. Lunev	5f4472c5a6	[TCP]: Remove owner from tcp_seq_afinfo. Move it to tcp_seq_afinfo->seq_fops as should be. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-13 22:13:53 -07:00
Denis V. Lunev	68fcadd16c	[TCP]: Place file operations directly into tcp_seq_afinfo. No need to have separate never-used variable. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-13 22:13:30 -07:00
Denis V. Lunev	9427c4b36b	[TCP]: Move seq_ops from tcp_iter_state to tcp_seq_afinfo. No need to create seq_operations for each instance of 'netstat'. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-13 22:12:13 -07:00
Denis V. Lunev	a4146b1b2c	[TCP]: Replace struct net on tcp_iter_state with seq_net_private. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-13 22:11:14 -07:00
Florian Westphal	4dfc281702	[Syncookies]: Add support for TCP options via timestamps. Allow the use of SACK and window scaling when syncookies are used and the client supports tcp timestamps. Options are encoded into the timestamp sent in the syn-ack and restored from the timestamp echo when the ack is received. Based on earlier work by Glenn Griffin. This patch avoids increasing the size of structs by encoding TCP options into the least significant bits of the timestamp and by not using any 'timestamp offset'. The downside is that the timestamp sent in the packet after the synack will increase by several seconds. changes since v1: don't duplicate timestamp echo decoding function, put it into ipv4/syncookie.c and have ipv6/syncookies.c use it. Feedback from Glenn Griffin: fix line indented with spaces, kill redundant if () Reviewed-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-10 03:12:40 -07:00
Ilpo Järvinen	882bebaaca	[TCP]: tcp_simple_retransmit can cause S+L This fixes Bugzilla #10384 tcp_simple_retransmit does L increment without any checking whatsoever for overflowing S+L when Reno is in use. The simplest scenario I can currently think of is rather complex in practice (there might be some more straightforward cases though). Ie., if mss is reduced during mtu probing, it may end up marking everything lost and if some duplicate ACKs arrived prior to that sacked_out will be non-zero as well, leading to S+L > packets_out, tcp_clean_rtx_queue on the next cumulative ACK or tcp_fastretrans_alert on the next duplicate ACK will fix the S counter. More straightforward (but questionable) solution would be to just call tcp_reset_reno_sack() in tcp_simple_retransmit but it would negatively impact the probe's retransmission, ie., the retransmissions would not occur if some duplicate ACKs had arrived. So I had to add reno sacked_out reseting to CA_Loss state when the first cumulative ACK arrives (this stale sacked_out might actually be the explanation for the reports of left_out overflows in kernel prior to 2.6.23 and S+L overflow reports of 2.6.24). However, this alone won't be enough to fix kernel before 2.6.24 because it is building on top of the commit `1b6d427bb7` ([TCP]: Reduce sacked_out with reno when purging write_queue) to keep the sacked_out from overflowing. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Alessandro Suardi <alessandro.suardi@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-07 22:33:07 -07:00
Florian Westphal	2051f11fb8	[TCP]: Shrink syncookie_secret by 8 byte. the first u32 copied from syncookie_secret is overwritten by the minute-counter four lines below. After adjusting the destination address, the size of syncookie_secret can be reduced accordingly. AFAICS, the only other user of syncookie_secret[] is the ipv6 syncookie support. Because ipv6 syncookies only grab 44 bytes from syncookie_secret[], this shouldn't affect them in any way. With fixes from Glenn Griffin. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Glenn Griffin <ggriffin.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-23 22:21:28 -07:00
Patrick McManus	ec3c0982a2	[TCP]: TCP_DEFER_ACCEPT updates - process as established Change TCP_DEFER_ACCEPT implementation so that it transitions a connection to ESTABLISHED after handshake is complete instead of leaving it in SYN-RECV until some data arrvies. Place connection in accept queue when first data packet arrives from slow path. Benefits: - established connection is now reset if it never makes it to the accept queue - diagnostic state of established matches with the packet traces showing completed handshake - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be enforced with reasonable accuracy instead of rounding up to next exponential back-off of syn-ack retry. Signed-off-by: Patrick McManus <mcmanus@ducksong.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-21 16:33:01 -07:00
Daniel Lezcano	6f8b13bcb3	[NETNS][IPV6] tcp6 - make proc per namespace Make the proc for tcp6 to be per namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-21 04:14:45 -07:00
Daniel Lezcano	f40c8174d3	[NETNS][IPV4] tcp - make proc handle the network namespaces This patch, like udp proc, makes the proc functions to take care of which namespace the socket belongs. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-21 04:13:54 -07:00
Glenn Griffin	c6aefafb7e	[TCP]: Add IPv6 support to TCP SYN cookies Updated to incorporate Eric's suggestion of using a per cpu buffer rather than allocating on the stack. Just a two line change, but will resend in it's entirety. Signed-off-by: Glenn Griffin <ggriffin.kernel@gmail.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>	2008-03-04 15:18:21 +09:00
Denis V. Lunev	9b0f976f27	[INET]: Remove struct net_proto_family* from _init calls. struct net_proto_family* is not used in icmp[v6]_init, ndisc_init, igmp_init and tcp_v4_init. Remove it. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-02-29 11:13:15 -08:00
Ilpo Järvinen	cea14e0ed6	[TCP]: Uninline tcp_is_cwnd_limited net/ipv4/tcp_cong.c: tcp_reno_cong_avoid \| -65 1 function changed, 65 bytes removed, diff: -65 net/ipv4/arp.c: arp_ignore \| -5 1 function changed, 5 bytes removed, diff: -5 net/ipv4/tcp_bic.c: bictcp_cong_avoid \| -57 1 function changed, 57 bytes removed, diff: -57 net/ipv4/tcp_cubic.c: bictcp_cong_avoid \| -61 1 function changed, 61 bytes removed, diff: -61 net/ipv4/tcp_highspeed.c: hstcp_cong_avoid \| -63 1 function changed, 63 bytes removed, diff: -63 net/ipv4/tcp_hybla.c: hybla_cong_avoid \| -85 1 function changed, 85 bytes removed, diff: -85 net/ipv4/tcp_htcp.c: htcp_cong_avoid \| -57 1 function changed, 57 bytes removed, diff: -57 net/ipv4/tcp_veno.c: tcp_veno_cong_avoid \| -52 1 function changed, 52 bytes removed, diff: -52 net/ipv4/tcp_scalable.c: tcp_scalable_cong_avoid \| -61 1 function changed, 61 bytes removed, diff: -61 net/ipv4/tcp_yeah.c: tcp_yeah_cong_avoid \| -75 1 function changed, 75 bytes removed, diff: -75 net/ipv4/tcp_illinois.c: tcp_illinois_cong_avoid \| -54 1 function changed, 54 bytes removed, diff: -54 net/dccp/ccids/ccid3.c: ccid3_update_send_interval \| -7 ccid3_hc_tx_packet_recv \| +7 2 functions changed, 7 bytes added, 7 bytes removed, diff: +0 net/ipv4/tcp_cong.c: tcp_is_cwnd_limited \| +88 1 function changed, 88 bytes added, diff: +88 built-in.o: 14 functions changed, 95 bytes added, 642 bytes removed, diff: -547 ...Again some gcc artifacts visible as well. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:48 -08:00
Ilpo Järvinen	490d504693	[TCP]: Uninline tcp_set_state net/ipv4/tcp.c: tcp_close_state \| -226 tcp_done \| -145 tcp_close \| -564 tcp_disconnect \| -141 4 functions changed, 1076 bytes removed, diff: -1076 net/ipv4/tcp_input.c: tcp_fin \| -86 tcp_rcv_state_process \| -164 2 functions changed, 250 bytes removed, diff: -250 net/ipv4/tcp_ipv4.c: tcp_v4_connect \| -209 1 function changed, 209 bytes removed, diff: -209 net/ipv4/arp.c: arp_ignore \| +5 1 function changed, 5 bytes added, diff: +5 net/ipv6/tcp_ipv6.c: tcp_v6_connect \| -158 1 function changed, 158 bytes removed, diff: -158 net/sunrpc/xprtsock.c: xs_sendpages \| -2 1 function changed, 2 bytes removed, diff: -2 net/dccp/ccids/ccid3.c: ccid3_update_send_interval \| +7 1 function changed, 7 bytes added, diff: +7 net/ipv4/tcp.c: tcp_set_state \| +238 1 function changed, 238 bytes added, diff: +238 built-in.o: 12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445 I've no explanation why some unrelated changes seem to occur consistently as well (arp_ignore, ccid3_update_send_interval; I checked the arp_ignore asm and it seems to be due to some reordered of operation order causing some extra opcodes to be generated). Still, the benefits are pretty obvious from the codiff's results. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:47 -08:00
Ilpo Järvinen	4828e7f49a	[TCP]: Remove TCPCB_URG & TCPCB_AT_TAIL as unnecessary The snd_up check should be enough. I suspect this has been there to provide a minor optimization in clean_rtx_queue which used to have a small if (!->sacked) block which could skip snd_up check among the other work. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:00:23 -08:00
Ilpo Järvinen	90840defab	[TCP]: Introduce tcp_wnd_end() to reduce line lengths Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:00:22 -08:00
Hideo Aoki	3ab224be6d	[NET] CORE: Introducing new memory accounting interface. This patch introduces new memory accounting functions for each network protocol. Most of them are renamed from memory accounting functions for stream protocols. At the same time, some stream memory accounting functions are removed since other functions do same thing. Renaming: sk_stream_free_skb() -> sk_wmem_free_skb() __sk_stream_mem_reclaim() -> __sk_mem_reclaim() sk_stream_mem_reclaim() -> sk_mem_reclaim() sk_stream_mem_schedule -> __sk_mem_schedule() sk_stream_pages() -> sk_mem_pages() sk_stream_rmem_schedule() -> sk_rmem_schedule() sk_stream_wmem_schedule() -> sk_wmem_schedule() sk_charge_skb() -> sk_mem_charge() Removeing sk_stream_rfree(): consolidates into sock_rfree() sk_stream_set_owner_r(): consolidates into skb_set_owner_r() sk_stream_mem_schedule() The following functions are added. sk_has_account(): check if the protocol supports accounting sk_mem_uncharge(): do the opposite of sk_mem_charge() In addition, to achieve consolidation, updating sk_wmem_queued is removed from sk_mem_charge(). Next, to consolidate memory accounting functions, this patch adds memory accounting calls to network core functions. Moreover, present memory accounting call is renamed to new accounting call. Finally we replace present memory accounting calls with new interface in TCP and SCTP. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Hideo Aoki <haoki@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:00:18 -08:00
YOSHIFUJI Hideaki	9cb5734e5b	[TCP]: Convert several length variable to unsigned. Several length variables cannot be negative, so convert int to unsigned int. This also allows us to do sane shift operations on those variables. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:59:56 -08:00
Ilpo Järvinen	6859d49475	[TCP]: Abstract tp->highest_sack accessing & point to next skb Pointing to the next skb is necessary to avoid referencing already SACKed skbs which will soon be on a separate list. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:55:46 -08:00
Ilpo Järvinen	234b686070	[TCP]: Add tcp_for_write_queue_from_safe and use it in mtu_probe Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:55:43 -08:00
Ilpo Järvinen	c3a05c6050	[TCP]: Cong.ctrl modules: remove unused good_ack from cong_avoid Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:55:41 -08:00
Ilpo Järvinen	8512430e55	[TCP]: Move FRTO checks out from write queue abstraction funcs Better place exists in update_send_head (other non-queue related adjustments are done there as well) which is the only caller of tcp_advance_send_head (now that the bogus call from mtu_probe is gone). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:55:05 -08:00
Ilpo Järvinen	68f8353b48	[TCP]: Rewrite SACK block processing & sack_recv_cache use Key points of this patch are: - In case new SACK information is advance only type, no skb processing below previously discovered highest point is done - Optimize cases below highest point too since there's no need to always go up to highest point (which is very likely still present in that SACK), this is not entirely true though because I'm dropping the fastpath_skb_hint which could previously optimize those cases even better. Whether that's significant, I'm not too sure. Currently it will provide skipping by walking. Combined with RB-tree, all skipping would become fast too regardless of window size (can be done incrementally later). Previously a number of cases in TCP SACK processing fails to take advantage of costly stored information in sack_recv_cache, most importantly, expected events such as cumulative ACK and new hole ACKs. Processing on such ACKs result in rather long walks building up latencies (which easily gets nasty when window is huge). Those latencies are often completely unnecessary compared with the amount of _new_ information received, usually for cumulative ACK there's no new information at all, yet TCP walks whole queue unnecessary potentially taking a number of costly cache misses on the way, etc.! Since the inclusion of highest_sack, there's a lot information that is very likely redundant (SACK fastpath hint stuff, fackets_out, highest_sack), though there's no ultimate guarantee that they'll remain the same whole the time (in all unearthly scenarios). Take advantage of this knowledge here and drop fastpath hint and use direct access to highest SACKed skb as a replacement. Effectively "special cased" fastpath is dropped. This change adds some complexity to introduce better coveraged "fastpath", though the added complexity should make TCP behave more cache friendly. The current ACK's SACK blocks are compared against each cached block individially and only ranges that are new are then scanned by the high constant walk. For other parts of write queue, even when in previously known part of the SACK blocks, a faster skip function is used (if necessary at all). In addition, whenever possible, TCP fast-forwards to highest_sack skb that was made available by an earlier patch. In typical case, no other things but this fast-forward and mandatory markings after that occur making the access pattern quite similar to the former fastpath "special case". DSACKs are special case that must always be walked. The local to recv_sack_cache copying could be more intelligent w.r.t DSACKs which are likely to be there only once but that is left to a separate patch. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:54:07 -08:00
Ilpo Järvinen	a47e5a988a	[TCP]: Convert highest_sack to sk_buff to allow direct access It is going to replace the sack fastpath hint quite soon... :-) Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:54:03 -08:00
Jens Axboe	9c55e01c0c	[TCP]: Splice receive support. Support for network splice receive. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:53:31 -08:00
Ilpo J�rvinen	6e42141009	[TCP] MTUprobe: fix potential sk_send_head corruption When the abstraction functions got added, conversion here was made incorrectly. As a result, the skb may end up pointing to skb which got included to the probe skb and then was freed. For it to trigger, however, skb_transmit must fail sending as well. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 23:24:09 -08:00
Chuck Lever	c1bd24b768	[TCP]: Remove unneeded implicit type cast when calling tcp_minshall_update() The tcp_minshall_update() function is called in exactly one place, and is passed an unsigned integer for the mss_len argument. Make the sign of the argument match the sign of the passed variable in order to eliminate an unneeded implicit type cast and a mixed sign comparison in tcp_minshall_update(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-23 21:27:55 -07:00
David S. Miller	0800f17026	[TCP]: Minor coding style fixup. Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:52:13 -07:00
Ilpo Järvinen	b76892051c	[TCP]: Avoid clearing sacktag hint in trivial situations There's no reason to clear the sacktag skb hint when small part of the rexmit queue changes. Account changes (if any) instead when fragmenting/collapsing. RTO/FRTO do not touch SACKED_ACKED bits so no need to discard SACK tag hint at all. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:52:12 -07:00
Ilpo Järvinen	5af4ec236f	[TCP]: clear_all_retrans_hints prefixed by tcp_ In addition, fix its function comment spacing. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>	2007-10-10 16:52:09 -07:00
Ilpo Järvinen	6ff03ac355	[TCP]: tcp_packets_out_inc to tcp_output.c (no callers elsewhere) Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:48:28 -07:00
Ilpo Järvinen	e9144bd8da	[TCP]: Remove unnecessary wrapper tcp_packets_out_dec Makes caller side more obvious, there's no need to have a wrapper for this oneliner! Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:48:27 -07:00
Ilpo Järvinen	e60402d0a9	[TCP]: Move sack_ok access to obviously named funcs & cleanup Previously code had IsReno/IsFack defined as macros that were local to tcp_input.c though sack_ok field has user elsewhere too for the same purpose. This changes them to static inlines as preferred according the current coding style and unifies the access to sack_ok across multiple files. Magic bitops of sack_ok for FACK and DSACK are also abstracted to functions with appropriate names. Note: - One sack_ok = 1 remains but that's self explanary, i.e., it enables sack - Couple of !IsReno cases are changed to tcp_is_sack - There were no users for IsDSack => I dropped it Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:48:00 -07:00
Ilpo Järvinen	b9c4595bc4	[TCP]: Don't panic if S+L skb is detected BUG_ON is an overkill. In fact, I was mislead by BUG_TRAP severity (equals to WARN_ON) which is much lower than BUG_ON's (that panics). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:47:59 -07:00
Ilpo Järvinen	1b6d427bb7	[TCP]: Reduce sacked_out with reno when purging write_queue Previously TCP had a transitional state during which reno counted segments that are already below the current window into sacked_out, which is now prevented. In addition, re-try now the unconditional S+L skb catching. This approach conservatively calls just remove_sack and leaves reset_sack() calls alone. The best solution to the whole problem would be to first calculate the new sacked_out fully (this patch does not move reno_sack_reset calls from original sites and thus does not implement this). However, that would require very invasive change to fastretrans_alert (perhaps even slicing it to two halves). Alternatively, all callers of tcp_packets_in_flight (i.e., users that depend on sacked_out) should be postponed until the new sacked_out has been calculated but it isn't any simpler alternative. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:47:58 -07:00
Ilpo Järvinen	005903bc3a	[TCP]: Left out sync->verify (the new meaning of it) & definify Left_out was dropped a while ago, thus leaving verifying consistency of the "left out" as only task for the function in question. Thus make it's name more appropriate. In addition, it is intentionally converted to #define instead of static inline because the location of the invariant failure is the most important thing to have if this ever triggers. I think it would have been helpful e.g. in this case where the location of the failure point had to be based on some quesswork: http://lkml.org/lkml/2007/5/2/464 ...Luckily the guesswork seems to have proved to be correct. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:47:57 -07:00
Ilpo Järvinen	83ae40885f	[TCP]: Add tcp_left_out(tp) "back" to get cleaner looking lines tp->left_out got removed but nothing came to replace it back then (users just did addition by themselves), so add function for users now. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:47:56 -07:00
Ilpo Järvinen	b5860bbac7	[TCP]: Tighten tcp_sock's belt, drop left_out It is easily calculable when needed and user are not that many after all. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:47:55 -07:00
Ilpo Järvinen	af610b4ca1	[TCP]: Add tcp_dec_pcount_approx int variant Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:47:54 -07:00
Ilpo Järvinen	bdf1ee5d3b	[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it No other users exist for tcp_ecn.h. Very few things remain in tcp.h, for most TCP ECN functions callers reside within a single .c file and can be placed there. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:47:54 -07:00
David S. Miller	f8ab18d2d9	[TCP]: Fix MD5 signature handling on big-endian. Based upon a report and initial patch by Peter Lieven. tcp4_md5sig_key and tcp6_md5sig_key need to start with the exact same members as tcp_md5sig_key. Because they are both cast to that type by tcp_v{4,6}_md5_do_lookup(). Unfortunately tcp{4,6}_md5sig_key use a u16 for the key length instead of a u8, which is what tcp_md5sig_key uses. This just so happens to work by accident on little-endian, but on big-endian it doesn't. Instead of casting, just place tcp_md5sig_key as the first member of the address-family specific structures, adjust the access sites, and kill off the ugly casts. Signed-off-by: David S. Miller <davem@davemloft.net>	2007-09-28 15:18:35 -07:00
David S. Miller	3516ffb0fe	[TCP]: Invoke tcp_sendmsg() directly, do not use inet_sendmsg(). As discovered by Evegniy Polyakov, if we try to sendmsg after a connection reset, we can do incredibly stupid things. The core issue is that inet_sendmsg() tries to autobind the socket, but we should never do that for TCP. Instead we should just go straight into TCP's sendmsg() code which will do all of the necessary state and pending socket error checks. TCP's sendpage already directly vectors to tcp_sendpage(), so this merely brings sendmsg() in line with that. Signed-off-by: David S. Miller <davem@davemloft.net>	2007-08-02 19:42:28 -07:00
Stephen Hemminger	30cfd0baf0	[TCP]: congestion control API pass RTT in microseconds This patch changes the API for the callback that is done after an ACK is received. It solves a couple of issues: * Some congestion controls want higher resolution value of RTT (controlled by TCP_CONG_RTT_SAMPLE flag). These don't really want a ktime, but all compute a RTT in microseconds. * Other congestion control could use RTT at jiffies resolution. To keep API consistent the units should be the same for both cases, just the resolution should change. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-31 02:27:57 -07:00
Stephen Hemminger	16751347a0	[TCP]: remove unused argument to cong_avoid op None of the existing TCP congestion controls use the rtt value pased in the ca_ops->cong_avoid interface. Which is lucky because seq_rtt could have been -1 when handling a duplicate ack. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-18 01:46:58 -07:00
Pavel Emelianov	e4fd5da39f	[TCP]: Consolidate checking for tcp orphan count being too big. tcp_out_of_resources() and tcp_close() perform the same checking of number of orphan sockets. Move this code into common place. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-05-31 01:23:34 -07:00
Ilpo Järvinen	0ec96822d5	[TCP]: Use S+L catcher only with SACK for now TCP has a transitional state when SACK is not in use during which this invariant is temporarily broken. Without SACK, tcp_clean_rtx_queue does not decrement sacked_out. Therefore calls to tcp_sync_left_out before sacked_out is again corrected by tcp_fastretrans_alert can trigger this trap as sacked_out still has couple of segments that are already out of window. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-05-03 03:30:34 -07:00
Ilpo Järvinen	d551e4541d	[TCP] FRTO: RFC4138 allows Nagle override when new data must be sent This is a corner case where less than MSS sized new data thingie is awaiting in the send queue. For F-RTO to work correctly, a new data segment must be sent at certain point or F-RTO cannot be used at all. RFC4138 allows overriding of Nagle at that point. Implementation uses frto_counter states 2 and 3 to distinguish when Nagle override is needed. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-30 00:58:16 -07:00
Ilpo Järvinen	34588b4c04	[TCP]: Catch skb with S+L bugs earlier SACKED_ACKED and LOST are mutually exclusive with SACK, thus having their sum larger than packets_out is bug with SACK. Eventually these bugs trigger traps in the tcp_clean_rtx_queue with SACK but it's much more informative to do this here. Non-SACK TCP, however, could get more than packets_out duplicate ACKs which each increment sacked_out, so it makes sense to do this kind of limitting for non-SACK TCP but not for SACK enabled one. Perhaps the author had the opposite in mind but did the logic accidently wrong way around? Anyway, the sacked_out incrementer code for non-SACK already deals this issue before calling sync_left_out so this trapping can be done unconditionally. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-30 00:57:33 -07:00
Stephen Hemminger	164891aadf	[TCP]: Congestion control API update. Do some simple changes to make congestion control API faster/cleaner. * use ktime_t rather than timeval * merge rtt sampling into existing ack callback this means one indirect call versus two per ack. * use flags bits to store options/settings Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:29:45 -07:00
Ilpo Järvinen	9e412ba763	[TCP]: Sed magic converts func(sk, tp, ...) -> func(sk, ...) This is (mostly) automated change using magic: sed -e '/struct sock \sk/ N' -e '/struct sock \sk/ N' -e '/struct sock \sk/ N' -e '/struct sock \sk/ N' -e 's\|struct sock \sk,[\n\t ]struct tcp_sock \tp$[^{]\n{\n$\| struct sock \sk\1\tstruct tcp_sock tp = tcp_sk(sk);\n\|g' -e 's\|struct sock \sk, struct tcp_sock \tp\| struct sock \*sk\|g' -e 's\|sk, tp$[^-]$\|sk\1\|g' Fixed four unused variable (tp) warnings that were introduced. In addition, manually added newlines after local variables and tweaked function arguments positioning. $ gcc --version gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1) ... $ codiff -fV built-in.o.old built-in.o.new net/ipv4/route.c: rt_cache_flush \| +14 1 function changed, 14 bytes added net/ipv4/tcp.c: tcp_setsockopt \| -5 tcp_sendpage \| -25 tcp_sendmsg \| -16 3 functions changed, 46 bytes removed net/ipv4/tcp_input.c: tcp_try_undo_recovery \| +3 tcp_try_undo_dsack \| +2 tcp_mark_head_lost \| -12 tcp_ack \| -15 tcp_event_data_recv \| -32 tcp_rcv_state_process \| -10 tcp_rcv_established \| +1 7 functions changed, 6 bytes added, 69 bytes removed, diff: -63 net/ipv4/tcp_output.c: update_send_head \| -9 tcp_transmit_skb \| +19 tcp_cwnd_validate \| +1 tcp_write_wakeup \| -17 __tcp_push_pending_frames \| -25 tcp_push_one \| -8 tcp_send_fin \| -4 7 functions changed, 20 bytes added, 63 bytes removed, diff: -43 built-in.o.new: 18 functions changed, 40 bytes added, 178 bytes removed, diff: -138 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:29:34 -07:00
Andi Kleen	4ac02bab77	[TCP]: Uninline tcp_done(). The function is quite big and has several call sites and nothing to collapse by compiler optimization on inlining. Besides it's nicer to read in a in .c file. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:29:25 -07:00
Herbert Xu	604763722c	[NET]: Treat CHECKSUM_PARTIAL as CHECKSUM_UNNECESSARY When a transmitted packet is looped back directly, CHECKSUM_PARTIAL maps to the semantics of CHECKSUM_UNNECESSARY. Therefore we should treat it as such in the stack. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:28:43 -07:00
Arnaldo Carvalho de Melo	aa8223c7bb	[SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:25:26 -07:00
David S. Miller	fe067e8ab5	[TCP]: Abstract out all write queue operations. This allows the write queue implementation to be changed, for example, to one which allows fast interval searching. Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:24:02 -07:00
James Morris	9d729f72dc	[NET]: Convert xtime.tv_sec to get_seconds() Where appropriate, convert references to xtime.tv_sec to the get_seconds() helper function. Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:23:32 -07:00
Ilpo Järvinen	3cfe3baaf0	[TCP]: Add two new spurious RTO responses to FRTO New sysctl tcp_frto_response is added to select amongst these responses: - Rate halving based; reuses CA_CWR state (default) - Very conservative; used to be the only one available (=1) - Undo cwr; undoes ssthresh and cwnd reductions (=2) The response with rate halving requires a new parameter to tcp_enter_cwr because FRTO has already reduced ssthresh and doing a second reduction there has to be prevented. In addition, to keep things nice on 80 cols screen, a local variable was added. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:23:23 -07:00
John Heffner	886236c124	[TCP]: Add RFC3742 Limited Slow-Start, controlled by variable sysctl_tcp_max_ssthresh. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:23:19 -07:00
Ilpo Järvinen	46d0de4ed9	[TCP] FRTO: Entry is allowed only during (New)Reno like recovery This interpretation comes from RFC4138: "If the sender implements some loss recovery algorithm other than Reno or NewReno [FHG04], the F-RTO algorithm SHOULD NOT be entered when earlier fast recovery is underway." I think the RFC means to say (especially in the light of Appendix B) that ...recovery is underway (not just fast recovery) or was underway when it was interrupted by an earlier (F-)RTO that hasn't yet been resolved (snd_una has not advanced enough). Thus, my interpretation is that whenever TCP has ever retransmitted other than head, basic version cannot be used because then the order assumptions which are used as FRTO basis do not hold. NewReno has only the head segment retransmitted at a time. Therefore, walk up to the segment that has not been SACKed, if that segment is not retransmitted nor anything before it, we know for sure, that nothing after the non-SACKed segment should be either. This assumption is valid because TCPCB_EVER_RETRANS does not leave holes but each non-SACKed segment is rexmitted in-order. Check for retrans_out > 1 avoids more expensive walk through the skb list, as we can know the result beforehand: F-RTO will not be allowed. SACKed skb can turn into non-SACked only in the extremely rare case of SACK reneging, in this case we might fail to detect retransmissions if there were them for any other than head. To get rid of that feature, whole rexmit queue would have to be walked (always) or FRTO should be prevented when SACK reneging happens. Of course RTO should still trigger after reneging which makes this issue even less likely to show up. And as long as the response is as conservative as it's now, nothing bad happens even then. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:23:12 -07:00
Ilpo Järvinen	bdaae17da8	[TCP] FRTO: Moved tcp_use_frto from tcp.h to tcp_input.c In addition, removed inline. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:23:02 -07:00
Frederik Deweerdt	ba7808eac1	[TCP]: remove tcp header from tcp_v4_check (take #2 ) The tcphdr struct passed to tcp_v4_check is not used, the following patch removes it from the parameter list. This adds the netfilter modifications missing in the patch I sent for rc3-mm1. Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-02-08 12:38:44 -08:00
Gerrit Renker	0d630cc0a6	[TCP]: Use old definition of before This reverts the new (unambiguous) definition of the TCP `before' relation. As pointed out in an example by Herbert Xu, there is existing code which implicitly requires the old definition in order to work correctly. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-01-04 12:25:16 -08:00
Gerrit Renker	9a036b9c33	[TCP]: Fix ambiguity in the `before' relation. While looking at DCCP sequence numbers, I stumbled over a problem with the following definition of before in tcp.h: static inline int before(__u32 seq1, __u32 seq2) { return (__s32)(seq1-seq2) < 0; } Problem: This definition suffers from an an ambiguity, i.e. always before(a, (a + 2^31) % 2^32)) = 1 before((a + 2^31) % 2^32), a) = 1 In text: when the difference between a and b amounts to 2^31, a is always considered `before' b, the function can not decide. The reason is that implicitly 0 is `before' 1 ... 2^31-1 ... 2^31 Solution: There is a simple fix, by defining before in such a way that 0 is no longer `before' 2^31, i.e. 0 `before' 1 ... 2^31-1 By not using the middle between 0 and 2^32, before can be made unambiguous. This is achieved by testing whether seq2-seq1 > 0 (using signed 32-bit arithmetic). I attach a patch to codify this. Also the `after' relation is basically a redefinition of `before', it is now defined as a macro after before. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-22 11:12:01 -08:00
Al Viro	8e5200f540	[NET]: Fix assorted misannotations (from md5 and udplite merges). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:27:16 -08:00
Al Viro	b51655b958	[NET]: Annotate __skb_checksum_complete() and friends. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:23:38 -08:00
Al Viro	6b11687ef0	[NET]: Annotate csum_tcpudp_magic() callers in net/* Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:23:29 -08:00
YOSHIFUJI Hideaki	cfb6eeb4c8	[TCP]: MD5 Signature Option (RFC2385) support. Based on implementation by Rick Payne. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:22:39 -08:00
Stephen Hemminger	ce7bc3bf15	[TCP]: Restrict congestion control choices. Allow normal users to only choose among a restricted set of congestion control choices. The default is reno and what ever has been configured as default. But the policy can be changed by administrator at any time. For example, to allow any choice: cp /proc/sys/net/ipv4/tcp_available_congestion_control \ /proc/sys/net/ipv4/tcp_allowed_congestion_control Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:21:49 -08:00
Stephen Hemminger	3ff825b28d	[TCP]: Add tcp_available_congestion_control sysctl. Create /proc/sys/net/ipv4/tcp_available_congestion_control that reflects currently available TCP choices. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:21:48 -08:00
Eric Dumazet	72a3effaf6	[NET]: Size listen hash tables using backlog hint We currently allocate a fixed size (TCP_SYNQ_HSIZE=512) slots hash table for each LISTEN socket, regardless of various parameters (listen backlog for example) On x86_64, this means order-1 allocations (might fail), even for 'small' sockets, expecting few connections. On the contrary, a huge server wanting a backlog of 50000 is slowed down a bit because of this fixed limit. This patch makes the sizing of listen hash table a dynamic parameter, depending of : - net.core.somaxconn tunable (default is 128) - net.ipv4.tcp_max_syn_backlog tunable (default : 256, 1024 or 128) - backlog value given by user application (2nd parameter of listen()) For large allocations (bigger than PAGE_SIZE), we use vmalloc() instead of kmalloc(). We still limit memory allocation with the two existing tunables (somaxconn & tcp_max_syn_backlog). So for standard setups, this patch actually reduce RAM usage. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:21:44 -08:00
Wei Yongjun	3687b1dc6f	[TCP]: SNMPv2 tcpAttemptFails counter error Refer to RFC2012, tcpAttemptFails is defined as following: tcpAttemptFails OBJECT-TYPE SYNTAX Counter32 MAX-ACCESS read-only STATUS current DESCRIPTION "The number of times TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state." ::= { tcp 7 } When I lookup into RFC793, I found that the state change should occured under following condition: 1. SYN-SENT -> CLOSED a) Received ACK,RST segment when SYN-SENT state. 2. SYN-RCVD -> CLOSED b) Received SYN segment when SYN-RCVD state(came from LISTEN). c) Received RST segment when SYN-RCVD state(came from SYN-SENT). d) Received SYN segment when SYN-RCVD state(came from SYN-SENT). 3. SYN-RCVD -> LISTEN e) Received RST segment when SYN-RCVD state(came from LISTEN). In my test, those direct state transition can not be counted to tcpAttemptFails. Signed-off-by: Wei Yongjun <yjwei@nanjing-fnst.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-08-02 13:38:19 -07:00
Herbert Xu	a430a43d08	[NET] gso: Fix up GSO packets with broken checksums Certain subsystems in the stack (e.g., netfilter) can break the partial checksum on GSO packets. Until they're fixed, this patch allows this to work by recomputing the partial checksums through the GSO mechanism. Once they've all been converted to update the partial checksum instead of clearing it, this workaround can be removed. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-07-08 13:34:56 -07:00
Herbert Xu	bcd7611117	[NET]: Generalise TSO-specific bits from skb_setup_caps This patch generalises the TSO-specific bits from sk_setup_caps by adding the sk_gso_type member to struct sock. This makes sk_setup_caps generic so that it can be used by TCPv6 or UFO. The only catch is that whoever uses this must provide a GSO implementation for their protocol which I think is a fair deal :) For now UFO continues to live without a GSO implementation which is OK since it doesn't use the sock caps field at the moment. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-30 14:12:08 -07:00
Herbert Xu	576a30eb64	[NET]: Added GSO header verification When GSO packets come from an untrusted source (e.g., a Xen guest domain), we need to verify the header integrity before passing it to the hardware. Since the first step in GSO is to verify the header, we can reuse that code by adding a new bit to gso_type: SKB_GSO_DODGY. Packets with this bit set can only be fed directly to devices with the corresponding bit NETIF_F_GSO_ROBUST. If the device doesn't have that bit, then the skb is fed to the GSO engine which will allow the packet to be sent to the hardware if it passes the header check. This patch changes the sg flag to a full features flag. The same method can be used to implement TSO ECN support. We simply have to mark packets with CWR set with SKB_GSO_ECN so that only hardware with a corresponding NETIF_F_TSO_ECN can accept them. The GSO engine can either fully segment the packet, or segment the first MTU and pass the rest to the hardware for further segmentation. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-29 16:57:53 -07:00
Herbert Xu	f4c50d990d	[NET]: Add software TSOv4 This patch adds the GSO implementation for IPv4 TCP. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-23 02:07:33 -07:00
Herbert Xu	7967168cef	[NET]: Merge TSO/UFO fields in sk_buff Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not going to scale if we add any more segmentation methods (e.g., DCCP). So let's merge them. They were used to tell the protocol of a packet. This function has been subsumed by the new gso_type field. This is essentially a set of netdev feature bits (shifted by 16 bits) that are required to process a specific skb. As such it's easy to tell whether a given device can process a GSO skb: you just have to and the gso_type field and the netdev's features field. I've made gso_type a conjunction. The idea is that you have a base type (e.g., SKB_GSO_TCPV4) that can be modified further to support new features. For example, if we add a hardware TSO type that supports ECN, they would declare NETIF_F_TSO \| NETIF_F_TSO_ECN. All TSO packets with CWR set would have a gso_type of SKB_GSO_TCPV4 \| SKB_GSO_TCPV4_ECN while all other TSO packets would be SKB_GSO_TCPV4. This means that only the CWR packets need to be emulated in software. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-23 02:07:29 -07:00
Linus Torvalds	cee4cca740	Merge git://git.infradead.org/hdrcleanup-2.6 * git://git.infradead.org/hdrcleanup-2.6: (63 commits) [S390] __FD_foo definitions. Switch to __s32 types in joystick.h instead of C99 types for consistency. Add <sys/types.h> to headers included for userspace in <linux/input.h> Move inclusion of <linux/compat.h> out of user scope in asm-x86_64/mtrr.h Remove struct fddi_statistics from user view in <linux/if_fddi.h> Move user-visible parts of drivers/s390/crypto/z90crypt.h to include/asm-s390 Revert include/media changes: Mauro says those ioctls are only used in-kernel(!) Include <linux/types.h> and use __uXX types in <linux/cramfs_fs.h> Use __uXX types in <linux/i2o_dev.h>, include <linux/ioctl.h> too Remove private struct dx_hash_info from public view in <linux/ext3_fs.h> Include <linux/types.h> and use __uXX types in <linux/affs_hardblocks.h> Use __uXX types in <linux/divert.h> for struct divert_blk et al. Use __u32 for elf_addr_t in <asm-powerpc/elf.h>, not u32. It's user-visible. Remove PPP_FCS from user view in <linux/ppp_defs.h>, remove __P mess entirely Use __uXX types in user-visible structures in <linux/nbd.h> Don't use 'u32' in user-visible struct ip_conntrack_old_tuple. Use __uXX types for S390 DASD volume label definitions which are user-visible S390 BIODASDREADCMB ioctl should use __u64 not u64 type. Remove unneeded inclusion of <linux/time.h> from <linux/ufs_fs.h> Fix private integer types used in V4L2 ioctls. ... Manually resolve conflict in include/linux/mtd/physmap.h	2006-06-20 15:10:08 -07:00
David S. Miller	35089bb203	[TCP]: Add tcp_slow_start_after_idle sysctl. A lot of people have asked for a way to disable tcp_cwnd_restart(), and it seems reasonable to add a sysctl to do that. Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-17 21:30:53 -07:00
Stephen Hemminger	72dc5b9225	[TCP]: Minimum congestion window consolidation. Many of the TCP congestion methods all just use ssthresh as the minimum congestion window on decrease. Rather than duplicating the code, just have that be the default if that handle in the ops structure is not set. Minor behaviour change to TCP compound. It probably wants to use this (ssthresh) as lower bound, rather than ssthresh/2 because the latter causes undershoot on loss. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-17 21:29:29 -07:00
Chris Leech	9593782585	[I/OAT]: Add a sysctl for tuning the I/OAT offloaded I/O threshold Any socket recv of less than this ammount will not be offloaded Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-17 21:25:54 -07:00
Chris Leech	0e4b4992b8	[I/OAT]: Rename cleanup_rbuf to tcp_cleanup_rbuf and make non-static Needed to be able to call tcp_cleanup_rbuf in tcp_input.c for I/OAT Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-17 21:25:50 -07:00
Chris Leech	97fc2f0848	[I/OAT]: Structure changes for TCP recv offload to I/OAT Adds an async_wait_queue and some additional fields to tcp_sock, and a dma_cookie_t to sk_buff. Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-17 21:25:48 -07:00
David Woodhouse	62c4f0a2d5	Don't include linux/config.h from anywhere else in include/ Signed-off-by: David Woodhouse <dwmw2@infradead.org>	2006-04-26 12:56:16 +01:00
David S. Miller	0803dbed7a	[TCP]: Kill unused extern decl for tcp_v4_hash_connecting() Noticed by Alan Menegotto. Signed-off-by: David S. Miller <davem@davemloft.net>	2006-03-31 02:25:46 -08:00
Dmitry Mishin	3fdadf7d27	[NET]: {get\|set}sockopt compatibility layer This patch extends {get\|set}sockopt compatibility layer in order to move protocol specific parts to their place and avoid huge universal net/compat.c file in the future. Signed-off-by: Dmitry Mishin <dim@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-03-20 22:45:21 -08:00
Rick Jones	15d99e02ba	[TCP]: sysctl to allow TCP window > 32767 sans wscale Back in the dark ages, we had to be conservative and only allow 15-bit window fields if the window scale option was not negotiated. Some ancient stacks used a signed 16-bit quantity for the window field of the TCP header and would get confused. Those days are long gone, so we can use the full 16-bits by default now. There is a sysctl added so that we can still interact with such old stacks Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-03-20 22:40:29 -08:00
John Heffner	5d424d5a67	[TCP]: MTU probing Implementation of packetization layer path mtu discovery for TCP, based on the internet-draft currently found at <http://www.ietf.org/internet-drafts/draft-ietf-pmtud-method-05.txt>. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-03-20 17:53:41 -08:00
Stephen Hemminger	40efc6fa17	[TCP]: less inline's TCP inline usage cleanup: * get rid of inline in several places * replace __inline__ with inline where possible * move functions used in one file out of tcp.h * let compiler decide on used once cases On x86_64: text data bss dec hex filename 3594701 648348 567400 4810449 4966d1 vmlinux.orig 3593133 648580 567400 4809113 496199 vmlinux On sparc64: text data bss dec hex filename 2538278 406152 530392 3474822 350586 vmlinux.ORIG `2536382` 406384 530392 3473158 34ff06 vmlinux Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-01-03 16:03:49 -08:00
Arnaldo Carvalho de Melo	8639a11e23	[TCP]: Don't use __constant_htonl for a non const arg Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-01-03 13:11:22 -08:00
Arnaldo Carvalho de Melo	6d6ee43e0b	[TWSK]: Introduce struct timewait_sock_ops So that we can share several timewait sockets related functions and make the timewait mini sockets infrastructure closer to the request mini sockets one. Next changesets will take advantage of this, moving more code out of TCP and DCCP v4 and v6 to common infrastructure. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-01-03 13:10:54 -08:00
Arnaldo Carvalho de Melo	8292a17a39	[ICSK]: Rename struct tcp_func to struct inet_connection_sock_af_ops And move it to struct inet_connection_sock. DCCP will use it in the upcoming changesets. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-01-03 13:10:38 -08:00
Stephen Hemminger	31f3426904	[TCP]: More spelling fixes. From Joe Perches Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-11-15 15:17:10 -08:00
Stephen Hemminger	6a438bbe68	[TCP]: speed up SACK processing Use "hints" to speed up the SACK processing. Various forms of this have been used by TCP developers (Web100, STCP, BIC) to avoid the 2x linear search of outstanding segments. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-11-10 17:14:59 -08:00
Stephen Hemminger	caa20d9abe	[TCP]: spelling fixes Minor spelling fixes for TCP code. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-11-10 17:13:47 -08:00

... 3 4 5 6 7 ...

498 Commits