2019-05-19 20:07:45 +08:00
|
|
|
# SPDX-License-Identifier: GPL-2.0-only
|
2005-04-17 06:20:36 +08:00
|
|
|
#
|
|
|
|
# IP configuration
|
|
|
|
#
|
|
|
|
config IP_MULTICAST
|
|
|
|
bool "IP: multicasting"
|
|
|
|
help
|
|
|
|
This is code for addressing several networked computers at once,
|
|
|
|
enlarging your kernel by about 2 KB. You need multicasting if you
|
|
|
|
intend to participate in the MBONE, a high bandwidth network on top
|
|
|
|
of the Internet which carries audio and video broadcasts. More
|
|
|
|
information about the MBONE is on the WWW at
|
2013-06-02 00:23:17 +08:00
|
|
|
<http://www.savetz.com/mbone/>. For most people, it's safe to say N.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
config IP_ADVANCED_ROUTER
|
|
|
|
bool "IP: advanced router"
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-04-17 06:20:36 +08:00
|
|
|
If you intend to run your Linux box mostly as a router, i.e. as a
|
|
|
|
computer that forwards and redistributes network packets, say Y; you
|
|
|
|
will then be presented with several options that allow more precise
|
|
|
|
control about the routing process.
|
|
|
|
|
|
|
|
The answer to this question won't directly affect the kernel:
|
|
|
|
answering N will just cause the configurator to skip all the
|
|
|
|
questions about advanced routing.
|
|
|
|
|
|
|
|
Note that your box can only act as a router if you enable IP
|
|
|
|
forwarding in your kernel; you can do that by saying Y to "/proc
|
|
|
|
file system support" and "Sysctl support" below and executing the
|
|
|
|
line
|
|
|
|
|
|
|
|
echo "1" > /proc/sys/net/ipv4/ip_forward
|
|
|
|
|
|
|
|
at boot time after the /proc file system has been mounted.
|
|
|
|
|
2009-02-22 16:06:20 +08:00
|
|
|
If you turn on IP forwarding, you should consider the rp_filter, which
|
2005-04-17 06:20:36 +08:00
|
|
|
automatically rejects incoming packets if the routing table entry
|
|
|
|
for their source address doesn't match the network interface they're
|
|
|
|
arriving on. This has security advantages because it prevents the
|
|
|
|
so-called IP spoofing, however it can pose problems if you use
|
|
|
|
asymmetric routing (packets from you to a host take a different path
|
|
|
|
than packets from that host to you) or if you operate a non-routing
|
|
|
|
host which has several IP addresses on different interfaces. To turn
|
2007-05-18 06:02:21 +08:00
|
|
|
rp_filter on use:
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2007-05-18 06:02:21 +08:00
|
|
|
echo 1 > /proc/sys/net/ipv4/conf/<device>/rp_filter
|
2010-08-31 13:50:43 +08:00
|
|
|
or
|
2007-05-18 06:02:21 +08:00
|
|
|
echo 1 > /proc/sys/net/ipv4/conf/all/rp_filter
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-02-22 16:06:20 +08:00
|
|
|
Note that some distributions enable it in startup scripts.
|
2009-02-23 12:40:43 +08:00
|
|
|
For details about rp_filter strict and loose mode read
|
2020-04-28 06:01:49 +08:00
|
|
|
<file:Documentation/networking/ip-sysctl.rst>.
|
2009-02-22 16:06:20 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
If unsure, say N here.
|
|
|
|
|
2008-01-13 13:23:17 +08:00
|
|
|
config IP_FIB_TRIE_STATS
|
|
|
|
bool "FIB TRIE statistics"
|
2011-02-02 07:15:39 +08:00
|
|
|
depends on IP_ADVANCED_ROUTER
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2008-01-13 13:23:17 +08:00
|
|
|
Keep track of statistics on structure of FIB TRIE table.
|
|
|
|
Useful for testing and measuring TRIE performance.
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config IP_MULTIPLE_TABLES
|
|
|
|
bool "IP: policy routing"
|
|
|
|
depends on IP_ADVANCED_ROUTER
|
2006-08-04 18:39:22 +08:00
|
|
|
select FIB_RULES
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-04-17 06:20:36 +08:00
|
|
|
Normally, a router decides what to do with a received packet based
|
|
|
|
solely on the packet's final destination address. If you say Y here,
|
|
|
|
the Linux router will also be able to take the packet's source
|
|
|
|
address into account. Furthermore, the TOS (Type-Of-Service) field
|
|
|
|
of the packet can be used for routing decisions as well.
|
|
|
|
|
2017-10-12 11:10:31 +08:00
|
|
|
If you need more information, see the Linux Advanced
|
|
|
|
Routing and Traffic Control documentation at
|
|
|
|
<http://lartc.org/howto/lartc.rpdb.html>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
|
|
|
config IP_ROUTE_MULTIPATH
|
|
|
|
bool "IP: equal cost multipath"
|
|
|
|
depends on IP_ADVANCED_ROUTER
|
|
|
|
help
|
|
|
|
Normally, the routing tables specify a single action to be taken in
|
|
|
|
a deterministic manner for a given packet. If you say Y here
|
|
|
|
however, it becomes possible to attach several actions to a packet
|
|
|
|
pattern, in effect specifying several alternative paths to travel
|
|
|
|
for those packets. The router considers all these paths to be of
|
|
|
|
equal "cost" and chooses one of them in a non-deterministic fashion
|
|
|
|
if a matching packet arrives.
|
|
|
|
|
|
|
|
config IP_ROUTE_VERBOSE
|
|
|
|
bool "IP: verbose route monitoring"
|
|
|
|
depends on IP_ADVANCED_ROUTER
|
|
|
|
help
|
|
|
|
If you say Y here, which is recommended, then the kernel will print
|
|
|
|
verbose messages regarding the routing, for example warnings about
|
|
|
|
received packets which look strange and could be evidence of an
|
|
|
|
attack or a misconfigured system somewhere. The information is
|
|
|
|
handled by the klogd daemon which is responsible for kernel messages
|
|
|
|
("man klogd").
|
|
|
|
|
2011-01-14 20:36:42 +08:00
|
|
|
config IP_ROUTE_CLASSID
|
|
|
|
bool
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config IP_PNP
|
|
|
|
bool "IP: kernel level autoconfiguration"
|
|
|
|
help
|
|
|
|
This enables automatic configuration of IP addresses of devices and
|
|
|
|
of the routing table during kernel boot, based on either information
|
|
|
|
supplied on the kernel command line or by BOOTP or RARP protocols.
|
|
|
|
You need to say Y only for diskless machines requiring network
|
|
|
|
access to boot (in which case you want to say Y to "Root file system
|
|
|
|
on NFS" as well), because all other machines configure the network
|
|
|
|
in their startup scripts.
|
|
|
|
|
|
|
|
config IP_PNP_DHCP
|
|
|
|
bool "IP: DHCP support"
|
|
|
|
depends on IP_PNP
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-04-17 06:20:36 +08:00
|
|
|
If you want your Linux box to mount its whole root file system (the
|
|
|
|
one containing the directory /) from some other computer over the
|
|
|
|
net via NFS and you want the IP address of your computer to be
|
|
|
|
discovered automatically at boot time using the DHCP protocol (a
|
|
|
|
special protocol designed for doing this job), say Y here. In case
|
|
|
|
the boot ROM of your network card was designed for booting Linux and
|
|
|
|
does DHCP itself, providing all necessary information on the kernel
|
|
|
|
command line, you can say N here.
|
|
|
|
|
|
|
|
If unsure, say Y. Note that if you want to use DHCP, a DHCP server
|
|
|
|
must be operating on your network. Read
|
2020-02-13 02:13:32 +08:00
|
|
|
<file:Documentation/admin-guide/nfs/nfsroot.rst> for details.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
config IP_PNP_BOOTP
|
|
|
|
bool "IP: BOOTP support"
|
|
|
|
depends on IP_PNP
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-04-17 06:20:36 +08:00
|
|
|
If you want your Linux box to mount its whole root file system (the
|
|
|
|
one containing the directory /) from some other computer over the
|
|
|
|
net via NFS and you want the IP address of your computer to be
|
|
|
|
discovered automatically at boot time using the BOOTP protocol (a
|
|
|
|
special protocol designed for doing this job), say Y here. In case
|
|
|
|
the boot ROM of your network card was designed for booting Linux and
|
|
|
|
does BOOTP itself, providing all necessary information on the kernel
|
|
|
|
command line, you can say N here. If unsure, say Y. Note that if you
|
|
|
|
want to use BOOTP, a BOOTP server must be operating on your network.
|
2020-02-13 02:13:32 +08:00
|
|
|
Read <file:Documentation/admin-guide/nfs/nfsroot.rst> for details.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
config IP_PNP_RARP
|
|
|
|
bool "IP: RARP support"
|
|
|
|
depends on IP_PNP
|
|
|
|
help
|
|
|
|
If you want your Linux box to mount its whole root file system (the
|
|
|
|
one containing the directory /) from some other computer over the
|
|
|
|
net via NFS and you want the IP address of your computer to be
|
|
|
|
discovered automatically at boot time using the RARP protocol (an
|
|
|
|
older protocol which is being obsoleted by BOOTP and DHCP), say Y
|
|
|
|
here. Note that if you want to use RARP, a RARP server must be
|
2008-04-08 03:59:03 +08:00
|
|
|
operating on your network. Read
|
2020-02-13 02:13:32 +08:00
|
|
|
<file:Documentation/admin-guide/nfs/nfsroot.rst> for details.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
config NET_IPIP
|
|
|
|
tristate "IP: tunneling"
|
2006-03-28 17:12:13 +08:00
|
|
|
select INET_TUNNEL
|
2013-03-25 22:49:41 +08:00
|
|
|
select NET_IP_TUNNEL
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-04-17 06:20:36 +08:00
|
|
|
Tunneling means encapsulating data of one protocol type within
|
|
|
|
another protocol and sending it over a channel that understands the
|
|
|
|
encapsulating protocol. This particular tunneling driver implements
|
|
|
|
encapsulation of IP within IP, which sounds kind of pointless, but
|
|
|
|
can be useful if you want to make your (or some other) machine
|
|
|
|
appear on a different network than it physically is, or to use
|
|
|
|
mobile-IP facilities (allowing laptops to seamlessly move between
|
|
|
|
networks without changing their IP addresses).
|
|
|
|
|
|
|
|
Saying Y to this option will produce two modules ( = code which can
|
|
|
|
be inserted in and removed from the running kernel whenever you
|
|
|
|
want). Most people won't need this and can say N.
|
|
|
|
|
2010-08-22 14:05:39 +08:00
|
|
|
config NET_IPGRE_DEMUX
|
|
|
|
tristate "IP: GRE demultiplexer"
|
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
This is helper module to demultiplex GRE packets on GRE version field criteria.
|
|
|
|
Required by ip_gre and pptp modules.
|
2010-08-22 14:05:39 +08:00
|
|
|
|
2013-03-25 22:49:35 +08:00
|
|
|
config NET_IP_TUNNEL
|
|
|
|
tristate
|
2016-02-12 22:43:55 +08:00
|
|
|
select DST_CACHE
|
2017-02-08 07:37:15 +08:00
|
|
|
select GRO_CELLS
|
2013-03-25 22:49:35 +08:00
|
|
|
default n
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config NET_IPGRE
|
|
|
|
tristate "IP: GRE tunnels over IP"
|
2010-10-05 02:56:38 +08:00
|
|
|
depends on (IPV6 || IPV6=n) && NET_IPGRE_DEMUX
|
2013-03-25 22:49:35 +08:00
|
|
|
select NET_IP_TUNNEL
|
2005-04-17 06:20:36 +08:00
|
|
|
help
|
|
|
|
Tunneling means encapsulating data of one protocol type within
|
|
|
|
another protocol and sending it over a channel that understands the
|
|
|
|
encapsulating protocol. This particular tunneling driver implements
|
|
|
|
GRE (Generic Routing Encapsulation) and at this time allows
|
|
|
|
encapsulating of IPv4 or IPv6 over existing IPv4 infrastructure.
|
|
|
|
This driver is useful if the other endpoint is a Cisco router: Cisco
|
|
|
|
likes GRE much better than the other Linux tunneling driver ("IP
|
|
|
|
tunneling" above). In addition, GRE allows multicast redistribution
|
|
|
|
through the tunnel.
|
|
|
|
|
|
|
|
config NET_IPGRE_BROADCAST
|
|
|
|
bool "IP: broadcast GRE over IP"
|
|
|
|
depends on IP_MULTICAST && NET_IPGRE
|
|
|
|
help
|
|
|
|
One application of GRE/IP is to construct a broadcast WAN (Wide Area
|
|
|
|
Network), which looks like a normal Ethernet LAN (Local Area
|
|
|
|
Network), but can be distributed all over the Internet. If you want
|
|
|
|
to do that, say Y here and to "IP multicast routing" below.
|
|
|
|
|
2018-03-01 05:29:29 +08:00
|
|
|
config IP_MROUTE_COMMON
|
|
|
|
bool
|
|
|
|
depends on IP_MROUTE || IPV6_MROUTE
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config IP_MROUTE
|
|
|
|
bool "IP: multicast routing"
|
|
|
|
depends on IP_MULTICAST
|
2018-03-01 05:29:29 +08:00
|
|
|
select IP_MROUTE_COMMON
|
2005-04-17 06:20:36 +08:00
|
|
|
help
|
|
|
|
This is used if you want your machine to act as a router for IP
|
|
|
|
packets that have several destination addresses. It is needed on the
|
|
|
|
MBONE, a high bandwidth network on top of the Internet which carries
|
|
|
|
audio and video broadcasts. In order to do that, you would most
|
2013-06-02 00:23:17 +08:00
|
|
|
likely run the program mrouted. If you haven't heard about it, you
|
|
|
|
don't need it.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
ipv4: ipmr: support multiple tables
This patch adds support for multiple independant multicast routing instances,
named "tables".
Userspace multicast routing daemons can bind to a specific table instance by
issuing a setsockopt call using a new option MRT_TABLE. The table number is
stored in the raw socket data and affects all following ipmr setsockopt(),
getsockopt() and ioctl() calls. By default, a single table (RT_TABLE_DEFAULT)
is created with a default routing rule pointing to it. Newly created pimreg
devices have the table number appended ("pimregX"), with the exception of
devices created in the default table, which are named just "pimreg" for
compatibility reasons.
Packets are directed to a specific table instance using routing rules,
similar to how regular routing rules work. Currently iif, oif and mark
are supported as keys, source and destination addresses could be supported
additionally.
Example usage:
- bind pimd/xorp/... to a specific table:
uint32_t table = 123;
setsockopt(fd, IPPROTO_IP, MRT_TABLE, &table, sizeof(table));
- create routing rules directing packets to the new table:
# ip mrule add iif eth0 lookup 123
# ip mrule add oif eth0 lookup 123
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 13:03:23 +08:00
|
|
|
config IP_MROUTE_MULTIPLE_TABLES
|
|
|
|
bool "IP: multicast policy routing"
|
2010-04-15 19:29:27 +08:00
|
|
|
depends on IP_MROUTE && IP_ADVANCED_ROUTER
|
ipv4: ipmr: support multiple tables
This patch adds support for multiple independant multicast routing instances,
named "tables".
Userspace multicast routing daemons can bind to a specific table instance by
issuing a setsockopt call using a new option MRT_TABLE. The table number is
stored in the raw socket data and affects all following ipmr setsockopt(),
getsockopt() and ioctl() calls. By default, a single table (RT_TABLE_DEFAULT)
is created with a default routing rule pointing to it. Newly created pimreg
devices have the table number appended ("pimregX"), with the exception of
devices created in the default table, which are named just "pimreg" for
compatibility reasons.
Packets are directed to a specific table instance using routing rules,
similar to how regular routing rules work. Currently iif, oif and mark
are supported as keys, source and destination addresses could be supported
additionally.
Example usage:
- bind pimd/xorp/... to a specific table:
uint32_t table = 123;
setsockopt(fd, IPPROTO_IP, MRT_TABLE, &table, sizeof(table));
- create routing rules directing packets to the new table:
# ip mrule add iif eth0 lookup 123
# ip mrule add oif eth0 lookup 123
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 13:03:23 +08:00
|
|
|
select FIB_RULES
|
|
|
|
help
|
|
|
|
Normally, a multicast router runs a userspace daemon and decides
|
|
|
|
what to do with a multicast packet based on the source and
|
|
|
|
destination addresses. If you say Y here, the multicast router
|
|
|
|
will also be able to take interfaces and packet marks into
|
|
|
|
account and run multiple instances of userspace daemons
|
|
|
|
simultaneously, each one handling a single table.
|
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config IP_PIMSM_V1
|
|
|
|
bool "IP: PIM-SM version 1 support"
|
|
|
|
depends on IP_MROUTE
|
|
|
|
help
|
|
|
|
Kernel side support for Sparse Mode PIM (Protocol Independent
|
|
|
|
Multicast) version 1. This multicast routing protocol is used widely
|
|
|
|
because Cisco supports it. You need special software to use it
|
|
|
|
(pimd-v1). Please see <http://netweb.usc.edu/pim/> for more
|
|
|
|
information about PIM.
|
|
|
|
|
|
|
|
Say Y if you want to use PIM-SM v1. Note that you can say N here if
|
|
|
|
you just want to use Dense Mode PIM.
|
|
|
|
|
|
|
|
config IP_PIMSM_V2
|
|
|
|
bool "IP: PIM-SM version 2 support"
|
|
|
|
depends on IP_MROUTE
|
|
|
|
help
|
|
|
|
Kernel side support for Sparse Mode PIM version 2. In order to use
|
|
|
|
this, you need an experimental routing daemon supporting it (pimd or
|
|
|
|
gated-5). This routing protocol is not used widely, so say N unless
|
|
|
|
you want to play with it.
|
|
|
|
|
|
|
|
config SYN_COOKIES
|
2010-06-03 08:42:30 +08:00
|
|
|
bool "IP: TCP syncookie support"
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-04-17 06:20:36 +08:00
|
|
|
Normal TCP/IP networking is open to an attack known as "SYN
|
|
|
|
flooding". This denial-of-service attack prevents legitimate remote
|
|
|
|
users from being able to connect to your computer during an ongoing
|
|
|
|
attack and requires very little work from the attacker, who can
|
|
|
|
operate from anywhere on the Internet.
|
|
|
|
|
|
|
|
SYN cookies provide protection against this type of attack. If you
|
|
|
|
say Y here, the TCP/IP stack will use a cryptographic challenge
|
|
|
|
protocol known as "SYN cookies" to enable legitimate users to
|
|
|
|
continue to connect, even when your machine is under attack. There
|
|
|
|
is no need for the legitimate users to change their TCP/IP software;
|
|
|
|
SYN cookies work transparently to them. For technical information
|
|
|
|
about SYN cookies, check out <http://cr.yp.to/syncookies.html>.
|
|
|
|
|
|
|
|
If you are SYN flooded, the source address reported by the kernel is
|
|
|
|
likely to have been forged by the attacker; it is only reported as
|
|
|
|
an aid in tracing the packets to their actual source and should not
|
|
|
|
be taken as absolute truth.
|
|
|
|
|
|
|
|
SYN cookies may prevent correct error reporting on clients when the
|
|
|
|
server is really overloaded. If this happens frequently better turn
|
|
|
|
them off.
|
|
|
|
|
2010-06-03 08:42:30 +08:00
|
|
|
If you say Y here, you can disable SYN cookies at run time by
|
|
|
|
saying Y to "/proc file system support" and
|
2005-04-17 06:20:36 +08:00
|
|
|
"Sysctl support" below and executing the command
|
|
|
|
|
2010-06-03 08:42:30 +08:00
|
|
|
echo 0 > /proc/sys/net/ipv4/tcp_syncookies
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2010-06-03 08:42:30 +08:00
|
|
|
after the /proc file system has been mounted.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
2012-07-17 17:44:54 +08:00
|
|
|
config NET_IPVTI
|
|
|
|
tristate "Virtual (secure) IP: tunneling"
|
2020-02-05 00:00:27 +08:00
|
|
|
depends on IPV6 || IPV6=n
|
2012-07-17 17:44:54 +08:00
|
|
|
select INET_TUNNEL
|
2013-03-25 22:50:00 +08:00
|
|
|
select NET_IP_TUNNEL
|
2019-03-30 04:16:31 +08:00
|
|
|
select XFRM
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2012-07-17 17:44:54 +08:00
|
|
|
Tunneling means encapsulating data of one protocol type within
|
|
|
|
another protocol and sending it over a channel that understands the
|
|
|
|
encapsulating protocol. This can be used with xfrm mode tunnel to give
|
|
|
|
the notion of a secure tunnel for IPSEC and then use routing protocol
|
|
|
|
on top.
|
|
|
|
|
2014-07-14 10:49:37 +08:00
|
|
|
config NET_UDP_TUNNEL
|
|
|
|
tristate
|
2014-10-07 06:15:14 +08:00
|
|
|
select NET_IP_TUNNEL
|
2014-07-14 10:49:37 +08:00
|
|
|
default n
|
|
|
|
|
2014-09-18 03:25:56 +08:00
|
|
|
config NET_FOU
|
|
|
|
tristate "IP: Foo (IP protocols) over UDP"
|
|
|
|
select XFRM
|
|
|
|
select NET_UDP_TUNNEL
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2014-09-18 03:25:56 +08:00
|
|
|
Foo over UDP allows any IP protocol to be directly encapsulated
|
|
|
|
over UDP include tunnels (IPIP, GRE, SIT). By encapsulating in UDP
|
|
|
|
network mechanisms and optimizations for UDP (such as ECMP
|
|
|
|
and RSS) can be leveraged to provide better service.
|
|
|
|
|
2014-11-05 01:06:51 +08:00
|
|
|
config NET_FOU_IP_TUNNELS
|
|
|
|
bool "IP: FOU encapsulation of IP tunnels"
|
|
|
|
depends on NET_IPIP || NET_IPGRE || IPV6_SIT
|
|
|
|
select NET_FOU
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2014-11-05 01:06:51 +08:00
|
|
|
Allow configuration of FOU or GUE encapsulation for IP tunnels.
|
|
|
|
When this option is enabled IP tunnels can be configured to use
|
|
|
|
FOU or GUE encapsulation.
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config INET_AH
|
|
|
|
tristate "IP: AH transformation"
|
2020-06-11 00:14:35 +08:00
|
|
|
select XFRM_AH
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2020-06-11 00:14:37 +08:00
|
|
|
Support for IPsec AH (Authentication Header).
|
|
|
|
|
|
|
|
AH can be used with various authentication algorithms. Besides
|
|
|
|
enabling AH support itself, this option enables the generic
|
|
|
|
implementations of the algorithms that RFC 8221 lists as MUST be
|
|
|
|
implemented. If you need any other algorithms, you'll need to enable
|
|
|
|
them in the crypto API. You should also enable accelerated
|
|
|
|
implementations of any needed algorithms when available.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
If unsure, say Y.
|
|
|
|
|
|
|
|
config INET_ESP
|
|
|
|
tristate "IP: ESP transformation"
|
2020-06-11 00:14:35 +08:00
|
|
|
select XFRM_ESP
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2020-06-11 00:14:37 +08:00
|
|
|
Support for IPsec ESP (Encapsulating Security Payload).
|
|
|
|
|
|
|
|
ESP can be used with various encryption and authentication algorithms.
|
|
|
|
Besides enabling ESP support itself, this option enables the generic
|
|
|
|
implementations of the algorithms that RFC 8221 lists as MUST be
|
|
|
|
implemented. If you need any other algorithms, you'll need to enable
|
|
|
|
them in the crypto API. You should also enable accelerated
|
|
|
|
implementations of any needed algorithms when available.
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
If unsure, say Y.
|
|
|
|
|
2017-02-15 16:40:00 +08:00
|
|
|
config INET_ESP_OFFLOAD
|
|
|
|
tristate "IP: ESP transformation offload"
|
|
|
|
depends on INET_ESP
|
|
|
|
select XFRM_OFFLOAD
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2017-02-15 16:40:00 +08:00
|
|
|
Support for ESP transformation offload. This makes sense
|
|
|
|
only if this system really does IPsec and want to do it
|
|
|
|
with high throughput. A typical desktop system does not
|
|
|
|
need it, even if it does IPsec.
|
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
2019-11-25 21:49:02 +08:00
|
|
|
config INET_ESPINTCP
|
|
|
|
bool "IP: ESP in TCP encapsulation (RFC 8229)"
|
|
|
|
depends on XFRM && INET_ESP
|
|
|
|
select STREAM_PARSER
|
|
|
|
select NET_SOCK_MSG
|
2020-04-27 23:59:35 +08:00
|
|
|
select XFRM_ESPINTCP
|
2019-11-25 21:49:02 +08:00
|
|
|
help
|
|
|
|
Support for RFC 8229 encapsulation of ESP and IKE over
|
|
|
|
TCP/IPv4 sockets.
|
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config INET_IPCOMP
|
|
|
|
tristate "IP: IPComp transformation"
|
2006-03-28 17:12:13 +08:00
|
|
|
select INET_XFRM_TUNNEL
|
2008-07-25 17:54:40 +08:00
|
|
|
select XFRM_IPCOMP
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-04-17 06:20:36 +08:00
|
|
|
Support for IP Payload Compression Protocol (IPComp) (RFC3173),
|
|
|
|
typically needed for IPsec.
|
2009-02-22 16:07:13 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
If unsure, say Y.
|
|
|
|
|
2006-03-28 17:12:13 +08:00
|
|
|
config INET_XFRM_TUNNEL
|
|
|
|
tristate
|
|
|
|
select INET_TUNNEL
|
|
|
|
default n
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
config INET_TUNNEL
|
2006-03-28 17:12:13 +08:00
|
|
|
tristate
|
|
|
|
default n
|
2005-04-17 06:20:36 +08:00
|
|
|
|
[INET_DIAG]: Move the tcp_diag interface to the proper place
With this the previous setup is back, i.e. tcp_diag can be built as a module,
as dccp_diag and both share the infrastructure available in inet_diag.
If one selects CONFIG_INET_DIAG as module CONFIG_INET_TCP_DIAG will also be
built as a module, as will CONFIG_INET_DCCP_DIAG, if CONFIG_IP_DCCP was
selected static or as a module, if CONFIG_INET_DIAG is y, being statically
linked CONFIG_INET_TCP_DIAG will follow suit and CONFIG_INET_DCCP_DIAG will be
built in the same manner as CONFIG_IP_DCCP.
Now to aim at UDP, converting it to use inet_hashinfo, so that we can use
iproute2 for UDP sockets as well.
Ah, just to show an example of this new infrastructure working for DCCP :-)
[root@qemu ~]# ./ss -dane
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 0 *:5001 *:* ino:942 sk:cfd503a0
ESTAB 0 0 127.0.0.1:5001 127.0.0.1:32770 ino:943 sk:cfd50a60
ESTAB 0 0 127.0.0.1:32770 127.0.0.1:5001 ino:947 sk:cfd50700
TIME-WAIT 0 0 127.0.0.1:32769 127.0.0.1:5001 timer:(timewait,3.430ms,0) ino:0 sk:cf209620
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-12 23:59:17 +08:00
|
|
|
config INET_DIAG
|
|
|
|
tristate "INET: socket monitoring interface"
|
2005-04-17 06:20:36 +08:00
|
|
|
default y
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-08-12 23:51:49 +08:00
|
|
|
Support for INET (TCP, DCCP, etc) socket monitoring interface used by
|
|
|
|
native Linux tools such as ss. ss is included in iproute2, currently
|
2010-11-16 03:55:34 +08:00
|
|
|
downloadable at:
|
2018-07-25 03:29:18 +08:00
|
|
|
|
2010-11-16 03:55:34 +08:00
|
|
|
http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2
|
2009-02-22 16:07:13 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
If unsure, say Y.
|
|
|
|
|
[INET_DIAG]: Move the tcp_diag interface to the proper place
With this the previous setup is back, i.e. tcp_diag can be built as a module,
as dccp_diag and both share the infrastructure available in inet_diag.
If one selects CONFIG_INET_DIAG as module CONFIG_INET_TCP_DIAG will also be
built as a module, as will CONFIG_INET_DCCP_DIAG, if CONFIG_IP_DCCP was
selected static or as a module, if CONFIG_INET_DIAG is y, being statically
linked CONFIG_INET_TCP_DIAG will follow suit and CONFIG_INET_DCCP_DIAG will be
built in the same manner as CONFIG_IP_DCCP.
Now to aim at UDP, converting it to use inet_hashinfo, so that we can use
iproute2 for UDP sockets as well.
Ah, just to show an example of this new infrastructure working for DCCP :-)
[root@qemu ~]# ./ss -dane
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 0 *:5001 *:* ino:942 sk:cfd503a0
ESTAB 0 0 127.0.0.1:5001 127.0.0.1:32770 ino:943 sk:cfd50a60
ESTAB 0 0 127.0.0.1:32770 127.0.0.1:5001 ino:947 sk:cfd50700
TIME-WAIT 0 0 127.0.0.1:32769 127.0.0.1:5001 timer:(timewait,3.430ms,0) ino:0 sk:cf209620
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-12 23:59:17 +08:00
|
|
|
config INET_TCP_DIAG
|
|
|
|
depends on INET_DIAG
|
|
|
|
def_tristate INET_DIAG
|
|
|
|
|
2011-12-09 14:24:36 +08:00
|
|
|
config INET_UDP_DIAG
|
2012-01-08 04:13:06 +08:00
|
|
|
tristate "UDP: socket monitoring interface"
|
2012-02-07 15:39:11 +08:00
|
|
|
depends on INET_DIAG && (IPV6 || IPV6=n)
|
2012-01-08 04:13:06 +08:00
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2012-01-08 04:13:06 +08:00
|
|
|
Support for UDP socket monitoring interface used by the ss tool.
|
|
|
|
If unsure, say Y.
|
2011-12-09 14:24:36 +08:00
|
|
|
|
2016-10-21 18:03:44 +08:00
|
|
|
config INET_RAW_DIAG
|
|
|
|
tristate "RAW: socket monitoring interface"
|
|
|
|
depends on INET_DIAG && (IPV6 || IPV6=n)
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2016-10-21 18:03:44 +08:00
|
|
|
Support for RAW socket monitoring interface used by the ss tool.
|
|
|
|
If unsure, say Y.
|
|
|
|
|
2015-12-16 11:30:05 +08:00
|
|
|
config INET_DIAG_DESTROY
|
|
|
|
bool "INET: allow privileged process to administratively close sockets"
|
|
|
|
depends on INET_DIAG
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2015-12-16 11:30:05 +08:00
|
|
|
Provides a SOCK_DESTROY operation that allows privileged processes
|
|
|
|
(e.g., a connection manager or a network administration tool such as
|
|
|
|
ss) to close sockets opened by other processes. Closing a socket in
|
|
|
|
this way interrupts any blocking read/write/connect operations on
|
|
|
|
the socket and causes future socket calls to behave as if the socket
|
|
|
|
had been disconnected.
|
|
|
|
If unsure, say N.
|
|
|
|
|
2006-09-25 11:11:58 +08:00
|
|
|
menuconfig TCP_CONG_ADVANCED
|
2005-06-25 09:07:51 +08:00
|
|
|
bool "TCP: advanced congestion control"
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2005-06-25 09:07:51 +08:00
|
|
|
Support for selection of various TCP congestion control
|
|
|
|
modules.
|
|
|
|
|
|
|
|
Nearly all users can safely say no here, and a safe default
|
2006-09-25 11:13:03 +08:00
|
|
|
selection will be made (CUBIC with new Reno as a fallback).
|
2005-06-25 09:07:51 +08:00
|
|
|
|
|
|
|
If unsure, say N.
|
|
|
|
|
2006-09-25 11:11:58 +08:00
|
|
|
if TCP_CONG_ADVANCED
|
2005-06-24 03:23:25 +08:00
|
|
|
|
|
|
|
config TCP_CONG_BIC
|
|
|
|
tristate "Binary Increase Congestion (BIC) control"
|
2006-09-25 11:13:03 +08:00
|
|
|
default m
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
BIC-TCP is a sender-side only change that ensures a linear RTT
|
|
|
|
fairness under large windows while offering both scalability and
|
|
|
|
bounded TCP-friendliness. The protocol combines two schemes
|
|
|
|
called additive increase and binary search increase. When the
|
|
|
|
congestion window is large, additive increase with a large
|
|
|
|
increment ensures linear RTT fairness as well as good
|
|
|
|
scalability. Under small congestion windows, binary search
|
|
|
|
increase provides TCP friendliness.
|
|
|
|
See http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/
|
2005-06-24 03:23:25 +08:00
|
|
|
|
2005-12-14 15:13:28 +08:00
|
|
|
config TCP_CONG_CUBIC
|
|
|
|
tristate "CUBIC TCP"
|
2006-09-25 11:13:03 +08:00
|
|
|
default y
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
This is version 2.0 of BIC-TCP which uses a cubic growth function
|
|
|
|
among other techniques.
|
|
|
|
See http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/cubic-paper.pdf
|
2005-12-14 15:13:28 +08:00
|
|
|
|
2005-06-24 03:24:09 +08:00
|
|
|
config TCP_CONG_WESTWOOD
|
|
|
|
tristate "TCP Westwood+"
|
|
|
|
default m
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
TCP Westwood+ is a sender-side only modification of the TCP Reno
|
|
|
|
protocol stack that optimizes the performance of TCP congestion
|
|
|
|
control. It is based on end-to-end bandwidth estimation to set
|
|
|
|
congestion window and slow start threshold after a congestion
|
|
|
|
episode. Using this estimation, TCP Westwood+ adaptively sets a
|
|
|
|
slow start threshold and a congestion window which takes into
|
|
|
|
account the bandwidth used at the time congestion is experienced.
|
|
|
|
TCP Westwood+ significantly increases fairness wrt TCP Reno in
|
|
|
|
wired networks and throughput over wireless links.
|
2005-06-24 03:24:09 +08:00
|
|
|
|
2005-06-24 03:28:11 +08:00
|
|
|
config TCP_CONG_HTCP
|
2019-09-23 23:52:42 +08:00
|
|
|
tristate "H-TCP"
|
|
|
|
default m
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
H-TCP is a send-side only modifications of the TCP Reno
|
|
|
|
protocol stack that optimizes the performance of TCP
|
|
|
|
congestion control for high speed network links. It uses a
|
|
|
|
modeswitch to change the alpha and beta parameters of TCP Reno
|
|
|
|
based on network conditions and in a way so as to be fair with
|
|
|
|
other Reno and H-TCP flows.
|
2005-06-24 03:28:11 +08:00
|
|
|
|
2005-06-24 03:24:58 +08:00
|
|
|
config TCP_CONG_HSTCP
|
|
|
|
tristate "High Speed TCP"
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
Sally Floyd's High Speed TCP (RFC 3649) congestion control.
|
|
|
|
A modification to TCP's congestion control mechanism for use
|
|
|
|
with large congestion windows. A table indicates how much to
|
|
|
|
increase the congestion window by when an ACK is received.
|
|
|
|
For more detail see http://www.icir.org/floyd/hstcp.html
|
2005-06-24 03:24:58 +08:00
|
|
|
|
[TCP]: Add TCP Hybla congestion control module.
TCP Hybla congestion avoidance.
- "In heterogeneous networks, TCP connections that incorporate a
terrestrial or satellite radio link are greatly disadvantaged with
respect to entirely wired connections, because of their longer round
trip times (RTTs). To cope with this problem, a new TCP proposal, the
TCP Hybla, is presented and discussed in the paper[1]. It stems from an
analytical evaluation of the congestion window dynamics in the TCP
standard versions (Tahoe, Reno, NewReno), which suggests the necessary
modifications to remove the performance dependence on RTT.[...]"[1]
[1]: Carlo Caini, Rosario Firrincieli, "TCP Hybla: a TCP enhancement for
heterogeneous networks",
International Journal of Satellite Communications and Networking
Volume 22, Issue 5 , Pages 547 - 566. September 2004.
Signed-off-by: Daniele Lacamera (root at danielinux.net)net
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-24 03:26:34 +08:00
|
|
|
config TCP_CONG_HYBLA
|
|
|
|
tristate "TCP-Hybla congestion control algorithm"
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
TCP-Hybla is a sender-side only change that eliminates penalization of
|
|
|
|
long-RTT, large-bandwidth connections, like when satellite legs are
|
|
|
|
involved, especially when sharing a common bottleneck with normal
|
|
|
|
terrestrial connections.
|
[TCP]: Add TCP Hybla congestion control module.
TCP Hybla congestion avoidance.
- "In heterogeneous networks, TCP connections that incorporate a
terrestrial or satellite radio link are greatly disadvantaged with
respect to entirely wired connections, because of their longer round
trip times (RTTs). To cope with this problem, a new TCP proposal, the
TCP Hybla, is presented and discussed in the paper[1]. It stems from an
analytical evaluation of the congestion window dynamics in the TCP
standard versions (Tahoe, Reno, NewReno), which suggests the necessary
modifications to remove the performance dependence on RTT.[...]"[1]
[1]: Carlo Caini, Rosario Firrincieli, "TCP Hybla: a TCP enhancement for
heterogeneous networks",
International Journal of Satellite Communications and Networking
Volume 22, Issue 5 , Pages 547 - 566. September 2004.
Signed-off-by: Daniele Lacamera (root at danielinux.net)net
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-24 03:26:34 +08:00
|
|
|
|
2005-06-24 03:27:19 +08:00
|
|
|
config TCP_CONG_VEGAS
|
|
|
|
tristate "TCP Vegas"
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
TCP Vegas is a sender-side only change to TCP that anticipates
|
|
|
|
the onset of congestion by estimating the bandwidth. TCP Vegas
|
|
|
|
adjusts the sending rate by modifying the congestion
|
|
|
|
window. TCP Vegas should provide less packet loss, but it is
|
|
|
|
not as aggressive as TCP Reno.
|
2005-06-24 03:27:19 +08:00
|
|
|
|
2016-06-09 12:16:45 +08:00
|
|
|
config TCP_CONG_NV
|
2019-11-21 21:28:35 +08:00
|
|
|
tristate "TCP NV"
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
TCP NV is a follow up to TCP Vegas. It has been modified to deal with
|
|
|
|
10G networks, measurement noise introduced by LRO, GRO and interrupt
|
|
|
|
coalescence. In addition, it will decrease its cwnd multiplicatively
|
|
|
|
instead of linearly.
|
2016-06-09 12:16:45 +08:00
|
|
|
|
2019-11-21 21:28:35 +08:00
|
|
|
Note that in general congestion avoidance (cwnd decreased when # packets
|
|
|
|
queued grows) cannot coexist with congestion control (cwnd decreased only
|
|
|
|
when there is packet loss) due to fairness issues. One scenario when they
|
|
|
|
can coexist safely is when the CA flows have RTTs << CC flows RTTs.
|
2016-06-09 12:16:45 +08:00
|
|
|
|
2019-11-21 21:28:35 +08:00
|
|
|
For further details see http://www.brakmo.org/networking/tcp-nv/
|
2016-06-09 12:16:45 +08:00
|
|
|
|
2005-06-24 03:29:07 +08:00
|
|
|
config TCP_CONG_SCALABLE
|
|
|
|
tristate "Scalable TCP"
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
Scalable TCP is a sender-side only change to TCP which uses a
|
|
|
|
MIMD congestion control algorithm which has some nice scaling
|
|
|
|
properties, though is known to have fairness issues.
|
|
|
|
See http://www.deneholme.net/tom/scalable/
|
2005-06-24 03:28:11 +08:00
|
|
|
|
2006-06-06 08:27:58 +08:00
|
|
|
config TCP_CONG_LP
|
|
|
|
tristate "TCP Low Priority"
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
TCP Low Priority (TCP-LP), a distributed algorithm whose goal is
|
|
|
|
to utilize only the excess network bandwidth as compared to the
|
|
|
|
``fair share`` of bandwidth as targeted by TCP.
|
|
|
|
See http://www-ece.rice.edu/networks/TCP-LP/
|
2006-06-06 08:27:58 +08:00
|
|
|
|
2006-06-06 08:28:30 +08:00
|
|
|
config TCP_CONG_VENO
|
|
|
|
tristate "TCP Veno"
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
TCP Veno is a sender-side only enhancement of TCP to obtain better
|
|
|
|
throughput over wireless networks. TCP Veno makes use of state
|
|
|
|
distinguishing to circumvent the difficult judgment of the packet loss
|
|
|
|
type. TCP Veno cuts down less congestion window in response to random
|
|
|
|
loss packets.
|
|
|
|
See <http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1177186>
|
2006-06-06 08:28:30 +08:00
|
|
|
|
2007-02-22 16:23:05 +08:00
|
|
|
config TCP_CONG_YEAH
|
|
|
|
tristate "YeAH TCP"
|
2007-05-17 15:07:47 +08:00
|
|
|
select TCP_CONG_VEGAS
|
2007-02-22 16:23:05 +08:00
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
YeAH-TCP is a sender-side high-speed enabled TCP congestion control
|
|
|
|
algorithm, which uses a mixed loss/delay approach to compute the
|
|
|
|
congestion window. It's design goals target high efficiency,
|
|
|
|
internal, RTT and Reno fairness, resilience to link loss while
|
|
|
|
keeping network elements load as low as possible.
|
2007-02-22 16:23:05 +08:00
|
|
|
|
2019-11-21 21:28:35 +08:00
|
|
|
For further details look here:
|
|
|
|
http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf
|
2007-02-22 16:23:05 +08:00
|
|
|
|
2007-04-21 08:07:51 +08:00
|
|
|
config TCP_CONG_ILLINOIS
|
|
|
|
tristate "TCP Illinois"
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
TCP-Illinois is a sender-side modification of TCP Reno for
|
|
|
|
high speed long delay links. It uses round-trip-time to
|
|
|
|
adjust the alpha and beta parameters to achieve a higher average
|
|
|
|
throughput and maintain fairness.
|
2007-04-21 08:07:51 +08:00
|
|
|
|
2019-11-21 21:28:35 +08:00
|
|
|
For further details see:
|
|
|
|
http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
|
2007-04-21 08:07:51 +08:00
|
|
|
|
net: tcp: add DCTCP congestion control algorithm
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.
RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.
However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].
DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.
Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.
It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.
The algorithm itself has already seen deployments in large production
data centers since then.
We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:
This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.
The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).
This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)
For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600
Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999
Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.
Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064
Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.
The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.
4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.
At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.
The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.
DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.
CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0
SYN/ACK ECT test:
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.
Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
As the number of competing flows moves beyond 1, the connection
probability drops rapidly.
Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
ss {-4,-6} -t -i diag interface:
... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
reordering:101 rcv_space:29200
... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
325.5Mbps rcv_rtt:1.5 rcv_space:29200
More information about DCTCP can be found in [1-4].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-27 04:37:36 +08:00
|
|
|
config TCP_CONG_DCTCP
|
|
|
|
tristate "DataCenter TCP (DCTCP)"
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
DCTCP leverages Explicit Congestion Notification (ECN) in the network to
|
|
|
|
provide multi-bit feedback to the end hosts. It is designed to provide:
|
net: tcp: add DCTCP congestion control algorithm
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.
RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.
However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].
DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.
Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.
It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.
The algorithm itself has already seen deployments in large production
data centers since then.
We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:
This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.
The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).
This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)
For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600
Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999
Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.
Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064
Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.
The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.
4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.
At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.
The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.
DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.
CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0
SYN/ACK ECT test:
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.
Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
As the number of competing flows moves beyond 1, the connection
probability drops rapidly.
Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
ss {-4,-6} -t -i diag interface:
... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
reordering:101 rcv_space:29200
... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
325.5Mbps rcv_rtt:1.5 rcv_space:29200
More information about DCTCP can be found in [1-4].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-27 04:37:36 +08:00
|
|
|
|
2019-11-21 21:28:35 +08:00
|
|
|
- High burst tolerance (incast due to partition/aggregate),
|
|
|
|
- Low latency (short flows, queries),
|
|
|
|
- High throughput (continuous data updates, large file transfers) with
|
|
|
|
commodity, shallow-buffered switches.
|
net: tcp: add DCTCP congestion control algorithm
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.
RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.
However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].
DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.
Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.
It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.
The algorithm itself has already seen deployments in large production
data centers since then.
We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:
This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.
The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).
This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)
For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600
Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999
Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.
Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064
Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.
The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.
4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.
At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.
The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.
DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.
CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0
SYN/ACK ECT test:
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.
Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
As the number of competing flows moves beyond 1, the connection
probability drops rapidly.
Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
ss {-4,-6} -t -i diag interface:
... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
reordering:101 rcv_space:29200
... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
325.5Mbps rcv_rtt:1.5 rcv_space:29200
More information about DCTCP can be found in [1-4].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-27 04:37:36 +08:00
|
|
|
|
2019-11-21 21:28:35 +08:00
|
|
|
All switches in the data center network running DCTCP must support
|
|
|
|
ECN marking and be configured for marking when reaching defined switch
|
|
|
|
buffer thresholds. The default ECN marking threshold heuristic for
|
|
|
|
DCTCP on switches is 20 packets (30KB) at 1Gbps, and 65 packets
|
|
|
|
(~100KB) at 10Gbps, but might need further careful tweaking.
|
net: tcp: add DCTCP congestion control algorithm
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.
RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.
However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].
DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.
Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.
It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.
The algorithm itself has already seen deployments in large production
data centers since then.
We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:
This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.
The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).
This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)
For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600
Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999
Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.
Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064
Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.
The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.
4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.
At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.
The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.
DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.
CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0
SYN/ACK ECT test:
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.
Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
As the number of competing flows moves beyond 1, the connection
probability drops rapidly.
Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
ss {-4,-6} -t -i diag interface:
... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
reordering:101 rcv_space:29200
... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
325.5Mbps rcv_rtt:1.5 rcv_space:29200
More information about DCTCP can be found in [1-4].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-27 04:37:36 +08:00
|
|
|
|
2019-11-21 21:28:35 +08:00
|
|
|
For further details see:
|
|
|
|
http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
|
net: tcp: add DCTCP congestion control algorithm
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.
RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.
However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].
DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.
Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.
It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.
The algorithm itself has already seen deployments in large production
data centers since then.
We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:
This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.
The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).
This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)
For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600
Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999
Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.
Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064
Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.
The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.
4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.
At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.
The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.
DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.
CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0
SYN/ACK ECT test:
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.
Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
As the number of competing flows moves beyond 1, the connection
probability drops rapidly.
Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
ss {-4,-6} -t -i diag interface:
... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
reordering:101 rcv_space:29200
... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
325.5Mbps rcv_rtt:1.5 rcv_space:29200
More information about DCTCP can be found in [1-4].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-27 04:37:36 +08:00
|
|
|
|
2015-06-11 01:08:17 +08:00
|
|
|
config TCP_CONG_CDG
|
|
|
|
tristate "CAIA Delay-Gradient (CDG)"
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2019-11-21 21:28:35 +08:00
|
|
|
CAIA Delay-Gradient (CDG) is a TCP congestion control that modifies
|
|
|
|
the TCP sender in order to:
|
2015-06-11 01:08:17 +08:00
|
|
|
|
|
|
|
o Use the delay gradient as a congestion signal.
|
|
|
|
o Back off with an average probability that is independent of the RTT.
|
|
|
|
o Coexist with flows that use loss-based congestion control.
|
|
|
|
o Tolerate packet loss unrelated to congestion.
|
|
|
|
|
2019-11-21 21:28:35 +08:00
|
|
|
For further details see:
|
|
|
|
D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using
|
|
|
|
delay gradients." In Networking 2011. Preprint: http://goo.gl/No3vdg
|
2015-06-11 01:08:17 +08:00
|
|
|
|
tcp_bbr: add BBR congestion control
This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".
BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.
BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.
The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.
In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.
While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.
In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.
Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.
When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.
Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).
BBR has been fully deployed on Google's wide-area backbone networks
and we're experimenting with BBR on Google.com and YouTube on a global
scale. Replacing CUBIC with BBR has resulted in significant
improvements in network latency and application (RPC, browser, and
video) metrics. For more details please refer to our upcoming ACM
Queue publication.
Example performance results, to illustrate the difference between BBR
and CUBIC:
Resilience to random loss (e.g. from shallow buffers):
Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).
Low latency with the bloated buffers common in today's last-mile links:
Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
buffer. Both fully utilize the bottleneck bandwidth, but BBR
achieves this with a median RTT 25x lower (43 ms instead of 1.09
secs).
Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.
Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:
https://groups.google.com/forum/#!forum/bbr-dev
NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
enabled, since pacing is integral to the BBR design and
implementation. BBR without pacing would not function properly, and
may incur unnecessary high packet loss rates.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 11:39:23 +08:00
|
|
|
config TCP_CONG_BBR
|
|
|
|
tristate "BBR TCP"
|
|
|
|
default n
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
tcp_bbr: add BBR congestion control
This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".
BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.
BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.
The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.
In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.
While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.
In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.
Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.
When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.
Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).
BBR has been fully deployed on Google's wide-area backbone networks
and we're experimenting with BBR on Google.com and YouTube on a global
scale. Replacing CUBIC with BBR has resulted in significant
improvements in network latency and application (RPC, browser, and
video) metrics. For more details please refer to our upcoming ACM
Queue publication.
Example performance results, to illustrate the difference between BBR
and CUBIC:
Resilience to random loss (e.g. from shallow buffers):
Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).
Low latency with the bloated buffers common in today's last-mile links:
Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
buffer. Both fully utilize the bottleneck bandwidth, but BBR
achieves this with a median RTT 25x lower (43 ms instead of 1.09
secs).
Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.
Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:
https://groups.google.com/forum/#!forum/bbr-dev
NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
enabled, since pacing is integral to the BBR design and
implementation. BBR without pacing would not function properly, and
may incur unnecessary high packet loss rates.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 11:39:23 +08:00
|
|
|
|
2019-11-21 21:28:35 +08:00
|
|
|
BBR (Bottleneck Bandwidth and RTT) TCP congestion control aims to
|
|
|
|
maximize network utilization and minimize queues. It builds an explicit
|
|
|
|
model of the the bottleneck delivery rate and path round-trip
|
|
|
|
propagation delay. It tolerates packet loss and delay unrelated to
|
|
|
|
congestion. It can operate over LAN, WAN, cellular, wifi, or cable
|
|
|
|
modem links. It can coexist with flows that use loss-based congestion
|
|
|
|
control, and can operate with shallow buffers, deep buffers,
|
|
|
|
bufferbloat, policers, or AQM schemes that do not provide a delay
|
|
|
|
signal. It requires the fq ("Fair Queue") pacing packet scheduler.
|
tcp_bbr: add BBR congestion control
This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".
BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.
BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.
The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.
In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.
While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.
In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.
Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.
When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.
Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).
BBR has been fully deployed on Google's wide-area backbone networks
and we're experimenting with BBR on Google.com and YouTube on a global
scale. Replacing CUBIC with BBR has resulted in significant
improvements in network latency and application (RPC, browser, and
video) metrics. For more details please refer to our upcoming ACM
Queue publication.
Example performance results, to illustrate the difference between BBR
and CUBIC:
Resilience to random loss (e.g. from shallow buffers):
Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).
Low latency with the bloated buffers common in today's last-mile links:
Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
buffer. Both fully utilize the bottleneck bandwidth, but BBR
achieves this with a median RTT 25x lower (43 ms instead of 1.09
secs).
Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.
Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:
https://groups.google.com/forum/#!forum/bbr-dev
NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
enabled, since pacing is integral to the BBR design and
implementation. BBR without pacing would not function properly, and
may incur unnecessary high packet loss rates.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 11:39:23 +08:00
|
|
|
|
2006-09-25 11:11:58 +08:00
|
|
|
choice
|
|
|
|
prompt "Default TCP congestion control"
|
2006-09-25 11:13:03 +08:00
|
|
|
default DEFAULT_CUBIC
|
2006-09-25 11:11:58 +08:00
|
|
|
help
|
|
|
|
Select the TCP congestion control that will be used by default
|
|
|
|
for all connections.
|
|
|
|
|
|
|
|
config DEFAULT_BIC
|
|
|
|
bool "Bic" if TCP_CONG_BIC=y
|
|
|
|
|
|
|
|
config DEFAULT_CUBIC
|
|
|
|
bool "Cubic" if TCP_CONG_CUBIC=y
|
|
|
|
|
|
|
|
config DEFAULT_HTCP
|
|
|
|
bool "Htcp" if TCP_CONG_HTCP=y
|
|
|
|
|
2010-03-11 17:57:27 +08:00
|
|
|
config DEFAULT_HYBLA
|
|
|
|
bool "Hybla" if TCP_CONG_HYBLA=y
|
|
|
|
|
2006-09-25 11:11:58 +08:00
|
|
|
config DEFAULT_VEGAS
|
|
|
|
bool "Vegas" if TCP_CONG_VEGAS=y
|
|
|
|
|
2010-03-11 17:57:28 +08:00
|
|
|
config DEFAULT_VENO
|
|
|
|
bool "Veno" if TCP_CONG_VENO=y
|
|
|
|
|
2006-09-25 11:11:58 +08:00
|
|
|
config DEFAULT_WESTWOOD
|
|
|
|
bool "Westwood" if TCP_CONG_WESTWOOD=y
|
|
|
|
|
net: tcp: add DCTCP congestion control algorithm
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.
RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.
However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].
DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.
Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.
It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.
The algorithm itself has already seen deployments in large production
data centers since then.
We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:
This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.
The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).
This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)
For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600
Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999
Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.
Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064
Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.
The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.
4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.
At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.
The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.
DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.
CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0
SYN/ACK ECT test:
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.
Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
As the number of competing flows moves beyond 1, the connection
probability drops rapidly.
Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
ss {-4,-6} -t -i diag interface:
... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
reordering:101 rcv_space:29200
... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
325.5Mbps rcv_rtt:1.5 rcv_space:29200
More information about DCTCP can be found in [1-4].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-27 04:37:36 +08:00
|
|
|
config DEFAULT_DCTCP
|
|
|
|
bool "DCTCP" if TCP_CONG_DCTCP=y
|
|
|
|
|
2015-06-11 01:08:17 +08:00
|
|
|
config DEFAULT_CDG
|
|
|
|
bool "CDG" if TCP_CONG_CDG=y
|
|
|
|
|
tcp_bbr: add BBR congestion control
This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".
BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.
BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.
The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.
In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.
While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.
In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.
Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.
When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.
Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).
BBR has been fully deployed on Google's wide-area backbone networks
and we're experimenting with BBR on Google.com and YouTube on a global
scale. Replacing CUBIC with BBR has resulted in significant
improvements in network latency and application (RPC, browser, and
video) metrics. For more details please refer to our upcoming ACM
Queue publication.
Example performance results, to illustrate the difference between BBR
and CUBIC:
Resilience to random loss (e.g. from shallow buffers):
Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).
Low latency with the bloated buffers common in today's last-mile links:
Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
buffer. Both fully utilize the bottleneck bandwidth, but BBR
achieves this with a median RTT 25x lower (43 ms instead of 1.09
secs).
Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.
Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:
https://groups.google.com/forum/#!forum/bbr-dev
NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
enabled, since pacing is integral to the BBR design and
implementation. BBR without pacing would not function properly, and
may incur unnecessary high packet loss rates.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 11:39:23 +08:00
|
|
|
config DEFAULT_BBR
|
|
|
|
bool "BBR" if TCP_CONG_BBR=y
|
|
|
|
|
2006-09-25 11:11:58 +08:00
|
|
|
config DEFAULT_RENO
|
|
|
|
bool "Reno"
|
|
|
|
endchoice
|
|
|
|
|
|
|
|
endif
|
2005-06-24 03:23:25 +08:00
|
|
|
|
2006-09-25 11:13:03 +08:00
|
|
|
config TCP_CONG_CUBIC
|
2005-06-27 06:20:20 +08:00
|
|
|
tristate
|
2005-06-25 09:07:51 +08:00
|
|
|
depends on !TCP_CONG_ADVANCED
|
|
|
|
default y
|
|
|
|
|
2006-09-25 11:11:58 +08:00
|
|
|
config DEFAULT_TCP_CONG
|
|
|
|
string
|
|
|
|
default "bic" if DEFAULT_BIC
|
|
|
|
default "cubic" if DEFAULT_CUBIC
|
|
|
|
default "htcp" if DEFAULT_HTCP
|
2010-03-11 17:57:27 +08:00
|
|
|
default "hybla" if DEFAULT_HYBLA
|
2006-09-25 11:11:58 +08:00
|
|
|
default "vegas" if DEFAULT_VEGAS
|
|
|
|
default "westwood" if DEFAULT_WESTWOOD
|
2010-03-11 17:57:28 +08:00
|
|
|
default "veno" if DEFAULT_VENO
|
2006-09-25 11:11:58 +08:00
|
|
|
default "reno" if DEFAULT_RENO
|
net: tcp: add DCTCP congestion control algorithm
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.
RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.
However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].
DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.
Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.
It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.
The algorithm itself has already seen deployments in large production
data centers since then.
We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:
This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.
The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).
This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)
For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600
Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999
Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.
Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064
Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.
The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.
4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.
At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.
The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.
DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.
CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0
SYN/ACK ECT test:
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.
Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
As the number of competing flows moves beyond 1, the connection
probability drops rapidly.
Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
ss {-4,-6} -t -i diag interface:
... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
reordering:101 rcv_space:29200
... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
325.5Mbps rcv_rtt:1.5 rcv_space:29200
More information about DCTCP can be found in [1-4].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-27 04:37:36 +08:00
|
|
|
default "dctcp" if DEFAULT_DCTCP
|
2015-06-11 01:08:17 +08:00
|
|
|
default "cdg" if DEFAULT_CDG
|
2016-11-25 22:05:26 +08:00
|
|
|
default "bbr" if DEFAULT_BBR
|
2006-09-25 11:13:03 +08:00
|
|
|
default "cubic"
|
2006-09-25 11:11:58 +08:00
|
|
|
|
2006-11-15 11:07:45 +08:00
|
|
|
config TCP_MD5SIG
|
2012-10-03 02:19:48 +08:00
|
|
|
bool "TCP: MD5 Signature Option support (RFC2385)"
|
2006-11-15 11:07:45 +08:00
|
|
|
select CRYPTO
|
|
|
|
select CRYPTO_MD5
|
2020-06-14 00:50:22 +08:00
|
|
|
help
|
2007-05-09 13:12:20 +08:00
|
|
|
RFC2385 specifies a method of giving MD5 protection to TCP sessions.
|
2006-11-15 11:07:45 +08:00
|
|
|
Its main (only?) use is to protect BGP sessions between core routers
|
|
|
|
on the Internet.
|
|
|
|
|
|
|
|
If unsure, say N.
|