linux/net
Daniel Borkmann aa4a83ee8b net: sctp: fix suboptimal edge-case on non-active active/retrans path selection
In SCTP, selection of active (T.ACT) and retransmission (T.RET)
transports is being done whenever transport control operations
(UP, DOWN, PF, ...) are engaged through sctp_assoc_control_transport().

Commits 4c47af4d5e ("net: sctp: rework multihoming retransmission
path selection to rfc4960") and a7288c4dd5 ("net: sctp: improve
sctp_select_active_and_retran_path selection") have both improved
it towards a more fine-grained and optimal path selection.

Currently, the selection algorithm for T.ACT and T.RET is as follows:

1) Elect the two most recently used ACTIVE transports T1, T2 for
   T.ACT, T.RET, where T.ACT<-T1 and T1 is most recently used
2) In case primary path T.PRI not in {T1, T2} but ACTIVE, set
   T.ACT<-T.PRI and T.RET<-T1
3) If only T1 is ACTIVE from the set, set T.ACT<-T1 and T.RET<-T1
4) If none is ACTIVE, set T.ACT<-best(T.PRI, T.RET, T3) where
   T3 is the most recently used (if avail) in PF, set T.RET<-T.PRI

Prior to above commits, 4) was simply a camp on T.ACT<-T.PRI and
T.RET<-T.PRI, ignoring possible paths in PF. Camping on T.PRI is
still slightly suboptimal as it can lead to the following scenario:

Setup:
        <A>                                <B>
    T1: p1p1 (10.0.10.10) <==>  .'`)  <==> p1p1 (10.0.10.12)  <= T.PRI
    T2: p1p2 (10.0.10.20) <==> (_ . ) <==> p1p2 (10.0.10.22)

    net.sctp.rto_min = 1000
    net.sctp.path_max_retrans = 2
    net.sctp.pf_retrans = 0
    net.sctp.hb_interval = 1000

T.PRI is permanently down, T2 is put briefly into PF state (e.g. due to
link flapping). Here, the first time transmission is sent over PF path
T2 as it's the only non-INACTIVE path, but the retransmitted data-chunks
are sent over the INACTIVE path T1 (T.PRI), which is not good.

After the patch, it's choosing better transports in both cases by
modifying step 4):

4) If none is ACTIVE, set T.ACT_new<-best(T.ACT_old, T3) where T3 is
   the most recently used (if avail) in PF, set T.RET<-T.ACT_new

This will still select a best possible path in PF if available (which
can also include T.PRI/T.RET), and set both T.ACT/T.RET to it.

In case sctp_assoc_control_transport() *just* put T.ACT_old into INACTIVE
as it transitioned from ACTIVE->PF->INACTIVE and stays in INACTIVE just
for a very short while before going back ACTIVE, it will guarantee that
this path will be reselected for T.ACT/T.RET since T3 (PF) is not
available.

Previously, this was not possible, as we would only select between T.PRI
and T.RET, and a possible T3 would be NULL due to the fact that we have
just transitioned T3 in sctp_assoc_control_transport() from PF->INACTIVE
and would select a suboptimal path when T.PRI/T.RET have worse properties.

In the case that T.ACT_old permanently went to INACTIVE during this
transition and there's no PF path available, plus T.PRI and T.RET are
INACTIVE as well, we would now camp on T.ACT_old, but if everything is
being INACTIVE there's really not much we can do except hoping for a
successful HB to bring one of the transports back up again and, thus
cause a new selection through sctp_assoc_control_transport().

Now both tests work fine:

Case 1:

 1. T1 S(ACTIVE) T.ACT
    T2 S(ACTIVE) T.RET

 2. T1 S(ACTIVE) T.ACT, T.RET
    T2 S(PF)

 3. T1 S(ACTIVE) T.ACT, T.RET
    T2 S(INACTIVE)

 5. T1 S(PF) T.ACT, T.RET
    T2 S(INACTIVE)

[ 5.1 T1 S(INACTIVE) T.ACT, T.RET
      T2 S(INACTIVE) ]

 6. T1 S(ACTIVE) T.ACT, T.RET
    T2 S(INACTIVE)

 7. T1 S(ACTIVE) T.ACT
    T2 S(ACTIVE) T.RET

Case 2:

 1. T1 S(ACTIVE) T.ACT
    T2 S(ACTIVE) T.RET

 2. T1 S(PF)
    T2 S(ACTIVE) T.ACT, T.RET

 3. T1 S(INACTIVE)
    T2 S(ACTIVE) T.ACT, T.RET

 5. T1 S(INACTIVE)
    T2 S(PF) T.ACT, T.RET

[ 5.1 T1 S(INACTIVE)
      T2 S(INACTIVE) T.ACT, T.RET ]

 6. T1 S(INACTIVE)
    T2 S(ACTIVE) T.ACT, T.RET

 7. T1 S(ACTIVE) T.ACT
    T2 S(ACTIVE) T.RET

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 11:31:30 -07:00
..
6lowpan 6lowpan: Allow 6LoWPAN to be modular 2014-08-07 11:44:18 -07:00
9p 9P: remove unnecessary break after return 2014-07-15 16:27:00 -07:00
802 net: set name_assign_type in alloc_netdev() 2014-07-15 16:12:48 -07:00
8021q net: Always untag vlan-tagged traffic on input. 2014-08-11 12:16:51 -07:00
appletalk Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-07-16 14:09:34 -07:00
atm lec: Use rtnl lock/unlock when updating MTU 2014-08-21 16:31:23 -07:00
ax25 net: Fix use after free by removing length arg from sk_data_ready callbacks. 2014-04-11 16:15:36 -04:00
batman-adv batman-adv: Fix parameter order of hlist_add_behind 2014-08-16 19:19:08 -07:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-08-06 09:38:14 -07:00
bridge net: Always untag vlan-tagged traffic on input. 2014-08-11 12:16:51 -07:00
caif caif: remove unnecessary break after goto 2014-07-15 16:27:01 -07:00
can can: add hash based access to single EFF frame filters 2014-05-19 09:38:24 +02:00
ceph Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2014-08-13 17:43:29 -06:00
core Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-08-13 18:27:40 -06:00
dcb dcbnl : Fix misleading dcb_app->priority explanation 2014-07-30 17:21:05 -07:00
dccp inet: move ipv6only in sock_common 2014-07-01 23:46:21 -07:00
decnet net: Split sk_no_check into sk_no_check_{rx,tx} 2014-05-23 16:28:53 -04:00
dns_resolver Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2014-08-06 08:06:39 -07:00
dsa net: set name_assign_type in alloc_netdev() 2014-07-15 16:12:48 -07:00
ethernet net: set name_assign_type in alloc_netdev() 2014-07-15 16:12:48 -07:00
hsr net/hsr: Remove left-over never-true conditional code. 2014-07-11 15:04:40 -07:00
ieee802154 inet: frags: use kmem_cache for inet_frag_queue 2014-08-02 15:31:31 -07:00
ipv4 tcp: fix ssthresh and undo for consecutive short FRTO episodes 2014-08-14 14:38:55 -07:00
ipv6 net: ipv6: fib: don't sleep inside atomic lock 2014-08-22 10:54:49 -07:00
ipx net: Split sk_no_check into sk_no_check_{rx,tx} 2014-05-23 16:28:53 -04:00
irda irda: Fix rd_frame control field initialization in irlap_send_rd_frame() 2014-08-13 20:05:52 -07:00
iucv af_iucv: avoid path quiesce of severed path in shutdown() 2014-07-21 20:21:40 -07:00
key af_key: remove unnecessary break after return 2014-07-15 16:27:00 -07:00
l2tp net: use inet6_iif instead of IP6CB()->iif 2014-07-31 22:37:06 -07:00
lapb
llc
mac80211 Merge tag 'master-2014-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next 2014-07-28 17:36:25 -07:00
mac802154 net: set name_assign_type in alloc_netdev() 2014-07-15 16:12:48 -07:00
mpls gre: Call gso_make_checksum 2014-06-04 22:46:38 -07:00
netfilter netfilter: nf_tables: fix error return code 2014-08-08 16:47:29 +02:00
netlabel Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-08-06 09:38:14 -07:00
netlink netlink: Annotate RCU locking for seq_file walker 2014-08-14 15:13:40 -07:00
netrom net: set name_assign_type in alloc_netdev() 2014-07-15 16:12:48 -07:00
nfc Merge tag 'master-2014-07-31' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next 2014-08-05 13:18:20 -07:00
openvswitch openvswitch: fix panic with multiple vlan headers 2014-08-22 11:24:04 -07:00
packet packet: handle too big packets for PACKET_V3 2014-08-21 16:44:28 -07:00
phonet net: set name_assign_type in alloc_netdev() 2014-07-15 16:12:48 -07:00
rds Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-06-12 14:27:40 -07:00
rfkill Merge git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next 2014-05-22 13:58:36 -04:00
rose net: set name_assign_type in alloc_netdev() 2014-07-15 16:12:48 -07:00
rxrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-08-06 09:38:14 -07:00
sched cbq: now_rt removal 2014-08-19 10:58:44 -07:00
sctp net: sctp: fix suboptimal edge-case on non-active active/retrans path selection 2014-08-22 11:31:30 -07:00
sunrpc NFS client updates for Linux 3.17 2014-08-13 18:13:19 -06:00
tipc tipc: Fix build. 2014-08-19 11:16:38 -07:00
unix Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-06-12 14:27:40 -07:00
vmw_vsock vsock: Make transport the proto owner 2014-05-05 13:13:50 -04:00
wimax
wireless Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless 2014-07-25 10:22:36 -04:00
x25 net: Fix use after free by removing length arg from sk_data_ready callbacks. 2014-04-11 16:15:36 -04:00
xfrm list: fix order of arguments for hlist_add_after(_rcu) 2014-08-06 18:01:24 -07:00
compat.c net: sendmsg: fix NULL pointer dereference 2014-07-29 12:20:22 -07:00
Kconfig 6lowpan: introduce new net/6lowpan directory 2014-07-12 01:53:30 +02:00
Makefile 6lowpan: introduce new net/6lowpan directory 2014-07-12 01:53:30 +02:00
nonet.c
socket.c net-timestamp: sock_tx_timestamp() fix 2014-08-06 12:38:07 -07:00
sysctl_net.c