2005-04-17 06:20:36 +08:00
/*
* originally based on the dummy device .
*
* Copyright 1999 , Thomas Davis , tadavis @ lbl . gov .
* Licensed under the GPL . Based on dummy . c , and eql . c devices .
*
* bonding . c : an Ethernet Bonding driver
*
* This is useful to talk to a Cisco EtherChannel compatible equipment :
* Cisco 5500
* Sun Trunking ( Solaris )
* Alteon AceDirector Trunks
* Linux Bonding
* and probably many L2 switches . . .
*
* How it works :
* ifconfig bond0 ipaddress netmask up
* will setup a network device , with an ip address . No mac address
* will be assigned at this time . The hw mac address will come from
* the first slave bonded to the channel . All slaves will then use
* this hw mac address .
*
* ifconfig bond0 down
* will release all slaves , marking them as down .
*
* ifenslave bond0 eth0
* will attach eth0 to bond0 as a slave . eth0 hw mac address will either
* a : be used as initial mac address
* b : if a hw mac address already is there , eth0 ' s hw mac address
* will then be set from bond0 .
*
*/
# include <linux/kernel.h>
# include <linux/module.h>
# include <linux/types.h>
# include <linux/fcntl.h>
# include <linux/interrupt.h>
# include <linux/ptrace.h>
# include <linux/ioport.h>
# include <linux/in.h>
2005-06-27 05:54:11 +08:00
# include <net/ip.h>
2005-04-17 06:20:36 +08:00
# include <linux/ip.h>
bonding: symmetric ICMP transmit
A bonding with layer2+3 or layer3+4 hashing uses the IP addresses and the ports
to balance packets between slaves. With some network errors, we receive an ICMP
error packet by the remote host or a router. If sent by a router, the source IP
can differ from the remote host one. Additionally the ICMP protocol has no port
numbers, so a layer3+4 bonding will get a different hash than the previous one.
These two conditions could let the packet go through a different interface than
the other packets of the same flow:
# tcpdump -qltnni veth0 |sed 's/^/0: /' &
# tcpdump -qltnni veth1 |sed 's/^/1: /' &
# hping3 -2 192.168.0.2 -p 9
0: IP 192.168.0.1.2251 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2252 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2253 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.2254 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
An ICMP error packet contains the header of the packet which caused the network
error, so inspect it and match the flow against it, so we can send the ICMP via
the same interface of the previous packet in the flow.
Move the IP and port dissect code into a generic function bond_flow_ip() and if
we are dissecting an ICMP error packet, call it again with the adjusted offset.
# hping3 -2 192.168.0.2 -p 9
1: IP 192.168.0.1.1224 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.1225 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1226 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1227 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 19:10:37 +08:00
# include <linux/icmp.h>
# include <linux/icmpv6.h>
2005-06-27 05:54:11 +08:00
# include <linux/tcp.h>
# include <linux/udp.h>
2005-04-17 06:20:36 +08:00
# include <linux/slab.h>
# include <linux/string.h>
# include <linux/init.h>
# include <linux/timer.h>
# include <linux/socket.h>
# include <linux/ctype.h>
# include <linux/inet.h>
# include <linux/bitops.h>
2009-06-13 03:02:48 +08:00
# include <linux/io.h>
2005-04-17 06:20:36 +08:00
# include <asm/dma.h>
2009-06-13 03:02:48 +08:00
# include <linux/uaccess.h>
2005-04-17 06:20:36 +08:00
# include <linux/errno.h>
# include <linux/netdevice.h>
# include <linux/inetdevice.h>
bonding: Improve IGMP join processing
In active-backup mode, the current bonding code duplicates IGMP
traffic to all slaves, so that switches are up to date in case of a
failover from an active to a backup interface. If bonding then fails
back to the original active interface, it is likely that the "active
slave" switch's IGMP forwarding for the port will be out of date until
some event occurs to refresh the switch (e.g., a membership query).
This patch alters the behavior of bonding to no longer flood
IGMP to all ports, and to issue IGMP JOINs to the newly active port at
the time of a failover. This insures that switches are kept up to date
for all cases.
"GOELLESCH Niels" <niels.goellesch@eurocontrol.int> originally
reported this problem, and included a patch. His original patch was
modified by Jay Vosburgh to additionally remove the existing IGMP flood
behavior, use RCU, streamline code paths, fix trailing white space, and
adjust for style.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-03-01 09:03:37 +08:00
# include <linux/igmp.h>
2005-04-17 06:20:36 +08:00
# include <linux/etherdevice.h>
# include <linux/skbuff.h>
# include <net/sock.h>
# include <linux/rtnetlink.h>
# include <linux/smp.h>
# include <linux/if_ether.h>
# include <net/arp.h>
# include <linux/mii.h>
# include <linux/ethtool.h>
# include <linux/if_vlan.h>
# include <linux/if_bonding.h>
2007-12-07 15:40:33 +08:00
# include <linux/jiffies.h>
2010-10-14 00:01:50 +08:00
# include <linux/preempt.h>
2005-06-27 05:52:20 +08:00
# include <net/route.h>
2007-09-12 18:01:34 +08:00
# include <net/net_namespace.h>
2009-10-29 22:18:26 +08:00
# include <net/netns/generic.h>
2012-06-12 14:03:51 +08:00
# include <net/pkt_sched.h>
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 22:54:51 +08:00
# include <linux/rculist.h>
2015-05-12 20:56:07 +08:00
# include <net/flow_dissector.h>
2014-11-11 02:27:49 +08:00
# include <net/bonding.h>
# include <net/bond_3ad.h>
# include <net/bond_alb.h>
2005-04-17 06:20:36 +08:00
2015-04-26 20:55:57 +08:00
# include "bonding_priv.h"
2005-04-17 06:20:36 +08:00
/*---------------------------- Module parameters ----------------------------*/
/* monitor all links that often (in milliseconds). <=0 disables monitoring */
static int max_bonds = BOND_DEFAULT_MAX_BONDS ;
2010-06-02 16:40:18 +08:00
static int tx_queues = BOND_DEFAULT_TX_QUEUES ;
2011-04-26 23:25:52 +08:00
static int num_peer_notif = 1 ;
2014-01-22 21:53:31 +08:00
static int miimon ;
2009-06-13 03:02:48 +08:00
static int updelay ;
static int downdelay ;
2005-04-17 06:20:36 +08:00
static int use_carrier = 1 ;
2009-06-13 03:02:48 +08:00
static char * mode ;
static char * primary ;
2009-09-25 11:28:09 +08:00
static char * primary_reselect ;
2009-06-13 03:02:48 +08:00
static char * lacp_rate ;
2011-06-22 17:54:39 +08:00
static int min_links ;
2009-06-13 03:02:48 +08:00
static char * ad_select ;
static char * xmit_hash_policy ;
2014-01-22 21:53:23 +08:00
static int arp_interval ;
2009-06-13 03:02:48 +08:00
static char * arp_ip_target [ BOND_MAX_ARP_TARGETS ] ;
static char * arp_validate ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:34 +08:00
static char * arp_all_targets ;
2009-06-13 03:02:48 +08:00
static char * fail_over_mac ;
2013-07-23 15:25:47 +08:00
static int all_slaves_active ;
2009-06-13 03:02:44 +08:00
static struct bond_params bonding_defaults ;
2010-10-05 22:23:59 +08:00
static int resend_igmp = BOND_DEFAULT_RESEND_IGMP ;
2013-11-05 20:51:41 +08:00
static int packets_per_slave = 1 ;
2013-12-21 14:40:12 +08:00
static int lp_interval = BOND_ALB_DEFAULT_LP_INTERVAL ;
2005-04-17 06:20:36 +08:00
module_param ( max_bonds , int , 0 ) ;
MODULE_PARM_DESC ( max_bonds , " Max number of bonded devices " ) ;
2010-06-02 16:40:18 +08:00
module_param ( tx_queues , int , 0 ) ;
MODULE_PARM_DESC ( tx_queues , " Max number of transmit queues (default = 16) " ) ;
2011-04-26 23:25:52 +08:00
module_param_named ( num_grat_arp , num_peer_notif , int , 0644 ) ;
2011-05-25 12:41:59 +08:00
MODULE_PARM_DESC ( num_grat_arp , " Number of peer notifications to send on "
" failover event (alias of num_unsol_na) " ) ;
2011-04-26 23:25:52 +08:00
module_param_named ( num_unsol_na , num_peer_notif , int , 0644 ) ;
2011-05-25 12:41:59 +08:00
MODULE_PARM_DESC ( num_unsol_na , " Number of peer notifications to send on "
" failover event (alias of num_grat_arp) " ) ;
2005-04-17 06:20:36 +08:00
module_param ( miimon , int , 0 ) ;
MODULE_PARM_DESC ( miimon , " Link check interval in milliseconds " ) ;
module_param ( updelay , int , 0 ) ;
MODULE_PARM_DESC ( updelay , " Delay before considering link up, in milliseconds " ) ;
module_param ( downdelay , int , 0 ) ;
2005-11-10 02:35:03 +08:00
MODULE_PARM_DESC ( downdelay , " Delay before considering link down, "
" in milliseconds " ) ;
2005-04-17 06:20:36 +08:00
module_param ( use_carrier , int , 0 ) ;
2005-11-10 02:35:03 +08:00
MODULE_PARM_DESC ( use_carrier , " Use netif_carrier_ok (vs MII ioctls) in miimon; "
2018-05-17 02:02:13 +08:00
" 0 for off, 1 for on (default) " ) ;
2005-04-17 06:20:36 +08:00
module_param ( mode , charp , 0 ) ;
2011-05-25 12:41:59 +08:00
MODULE_PARM_DESC ( mode , " Mode of operation; 0 for balance-rr, "
2005-11-10 02:35:03 +08:00
" 1 for active-backup, 2 for balance-xor, "
" 3 for broadcast, 4 for 802.3ad, 5 for balance-tlb, "
" 6 for balance-alb " ) ;
2005-04-17 06:20:36 +08:00
module_param ( primary , charp , 0 ) ;
MODULE_PARM_DESC ( primary , " Primary network device to use " ) ;
2009-09-25 11:28:09 +08:00
module_param ( primary_reselect , charp , 0 ) ;
MODULE_PARM_DESC ( primary_reselect , " Reselect primary slave "
" once it comes up; "
" 0 for always (default), "
" 1 for only if speed of primary is "
" better, "
" 2 for only on active slave "
" failure " ) ;
2005-04-17 06:20:36 +08:00
module_param ( lacp_rate , charp , 0 ) ;
2011-05-25 12:41:59 +08:00
MODULE_PARM_DESC ( lacp_rate , " LACPDU tx rate to request from 802.3ad partner; "
" 0 for slow, 1 for fast " ) ;
2008-11-05 09:51:16 +08:00
module_param ( ad_select , charp , 0 ) ;
2016-08-09 21:36:04 +08:00
MODULE_PARM_DESC ( ad_select , " 802.3ad aggregation selection logic; "
2011-05-25 12:41:59 +08:00
" 0 for stable (default), 1 for bandwidth, "
" 2 for count " ) ;
2011-06-22 17:54:39 +08:00
module_param ( min_links , int , 0 ) ;
MODULE_PARM_DESC ( min_links , " Minimum number of available links before turning on carrier " ) ;
2005-06-27 05:54:11 +08:00
module_param ( xmit_hash_policy , charp , 0 ) ;
2018-05-15 02:48:09 +08:00
MODULE_PARM_DESC ( xmit_hash_policy , " balance-alb, balance-tlb, balance-xor, 802.3ad hashing method; "
2011-05-25 12:41:59 +08:00
" 0 for layer 2 (default), 1 for layer 3+4, "
2013-10-02 19:39:25 +08:00
" 2 for layer 2+3, 3 for encap layer 2+3, "
" 4 for encap layer 3+4 " ) ;
2005-04-17 06:20:36 +08:00
module_param ( arp_interval , int , 0 ) ;
MODULE_PARM_DESC ( arp_interval , " arp interval in milliseconds " ) ;
module_param_array ( arp_ip_target , charp , NULL , 0 ) ;
MODULE_PARM_DESC ( arp_ip_target , " arp targets in n.n.n.n form " ) ;
2006-09-23 12:54:53 +08:00
module_param ( arp_validate , charp , 0 ) ;
2011-05-25 12:41:59 +08:00
MODULE_PARM_DESC ( arp_validate , " validate src/dst of ARP probes; "
" 0 for none (default), 1 for active, "
" 2 for backup, 3 for all " ) ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:34 +08:00
module_param ( arp_all_targets , charp , 0 ) ;
MODULE_PARM_DESC ( arp_all_targets , " fail on any/all arp targets timeout; 0 for any (default), 1 for all " ) ;
2008-05-18 12:10:14 +08:00
module_param ( fail_over_mac , charp , 0 ) ;
2011-05-25 12:41:59 +08:00
MODULE_PARM_DESC ( fail_over_mac , " For active-backup, do not set all slaves to "
" the same MAC; 0 for none (default), "
" 1 for active, 2 for follow " ) ;
2010-06-02 16:39:21 +08:00
module_param ( all_slaves_active , int , 0 ) ;
2014-09-09 17:07:55 +08:00
MODULE_PARM_DESC ( all_slaves_active , " Keep all frames received on an interface "
2011-05-25 12:41:59 +08:00
" by setting active flag for all slaves; "
2010-06-02 16:39:21 +08:00
" 0 for never (default), 1 for always. " ) ;
2010-10-05 22:23:59 +08:00
module_param ( resend_igmp , int , 0 ) ;
2011-05-25 12:41:59 +08:00
MODULE_PARM_DESC ( resend_igmp , " Number of IGMP membership reports to send on "
" link failure " ) ;
2013-11-05 20:51:41 +08:00
module_param ( packets_per_slave , int , 0 ) ;
MODULE_PARM_DESC ( packets_per_slave , " Packets to send per slave in balance-rr "
" mode; 0 for a random slave, 1 packet per "
" slave (default), >1 packets per slave. " ) ;
2013-12-21 14:40:12 +08:00
module_param ( lp_interval , uint , 0 ) ;
MODULE_PARM_DESC ( lp_interval , " The number of seconds between instances where "
" the bonding driver sends learning packets to "
" each slaves peer switch. The default is 1. " ) ;
2005-04-17 06:20:36 +08:00
/*----------------------------- Global variables ----------------------------*/
2010-10-14 00:01:50 +08:00
# ifdef CONFIG_NET_POLL_CONTROLLER
net: Convert netpoll blocking api in bonding driver to be a counter
A while back I made some changes to enable netpoll in the bonding driver. Among
them was a per-cpu flag that indicated we were in a path that held locks which
could cause the netpoll path to block in during tx, and as such the tx path
should queue the frame for later use. This appears to have given rise to a
regression. If one of those paths on which we hold the per-cpu flag yields the
cpu, its possible for us to come back on a different cpu, leading to us clearing
a different flag than we set. This results in odd netpoll drops, and BUG
backtraces appearing in the log, as we check to make sure that we only clear set
bits, and only set clear bits. I had though briefly about changing the
offending paths so that they wouldn't sleep, but looking at my origional work
more closely, it doesn't appear that a per-cpu flag is warranted. We alrady
gate the checking of this flag on IFF_IN_NETPOLL, so we don't hit this in the
normal tx case anyway. And practically speaking, the normal use case for
netpoll is to only have one client anyway, so we're not going to erroneously
queue netpoll frames when its actually safe to do so. As such, lets just
convert that per-cpu flag to an atomic counter. It fixes the rescheduling bugs,
is equivalent from a performance perspective and actually eliminates some code
in the process.
Tested by the reporter and myself, successfully
Reported-by: Liang Zheng <lzheng@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-06 17:05:50 +08:00
atomic_t netpoll_block_tx = ATOMIC_INIT ( 0 ) ;
2010-10-14 00:01:50 +08:00
# endif
netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17 09:58:21 +08:00
unsigned int bond_net_id __read_mostly ;
2005-04-17 06:20:36 +08:00
bonding: balance ICMP echoes in layer3+4 mode
The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:
# tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
# tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64
But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:
# ping -qc1 192.168.0.2
0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64
Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:
# ping -q 192.168.1.2 -c1
0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
# ping -q 192.168.1.2 -c1
1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 21:50:53 +08:00
static const struct flow_dissector_key flow_keys_bonding_keys [ ] = {
{
. key_id = FLOW_DISSECTOR_KEY_CONTROL ,
. offset = offsetof ( struct flow_keys , control ) ,
} ,
{
. key_id = FLOW_DISSECTOR_KEY_BASIC ,
. offset = offsetof ( struct flow_keys , basic ) ,
} ,
{
. key_id = FLOW_DISSECTOR_KEY_IPV4_ADDRS ,
. offset = offsetof ( struct flow_keys , addrs . v4addrs ) ,
} ,
{
. key_id = FLOW_DISSECTOR_KEY_IPV6_ADDRS ,
. offset = offsetof ( struct flow_keys , addrs . v6addrs ) ,
} ,
{
. key_id = FLOW_DISSECTOR_KEY_TIPC ,
. offset = offsetof ( struct flow_keys , addrs . tipckey ) ,
} ,
{
. key_id = FLOW_DISSECTOR_KEY_PORTS ,
. offset = offsetof ( struct flow_keys , ports ) ,
} ,
{
. key_id = FLOW_DISSECTOR_KEY_ICMP ,
. offset = offsetof ( struct flow_keys , icmp ) ,
} ,
{
. key_id = FLOW_DISSECTOR_KEY_VLAN ,
. offset = offsetof ( struct flow_keys , vlan ) ,
} ,
{
. key_id = FLOW_DISSECTOR_KEY_FLOW_LABEL ,
. offset = offsetof ( struct flow_keys , tags ) ,
} ,
{
. key_id = FLOW_DISSECTOR_KEY_GRE_KEYID ,
. offset = offsetof ( struct flow_keys , keyid ) ,
} ,
} ;
static struct flow_dissector flow_keys_bonding __read_mostly ;
2005-04-17 06:20:36 +08:00
/*-------------------------- Forward declarations ---------------------------*/
2009-06-13 03:02:52 +08:00
static int bond_init ( struct net_device * bond_dev ) ;
2009-10-29 22:18:24 +08:00
static void bond_uninit ( struct net_device * bond_dev ) ;
2017-01-07 11:12:52 +08:00
static void bond_get_stats ( struct net_device * bond_dev ,
struct rtnl_link_stats64 * stats ) ;
2014-10-05 08:45:01 +08:00
static void bond_slave_arr_handler ( struct work_struct * work ) ;
bonding: Fix ARP monitor validation
The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.
This logic can fail if there is no currently active slave. In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave. The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.
This can arise if the bonding slaves and switch are working, but
the ARP target is not responding. When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave. This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.
Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.
This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.
Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave really works")
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-03 05:35:56 +08:00
static bool bond_time_in_interval ( struct bonding * bond , unsigned long last_act ,
int mod ) ;
2018-09-25 05:40:11 +08:00
static void bond_netdev_notify_work ( struct work_struct * work ) ;
2005-04-17 06:20:36 +08:00
/*---------------------------- General routines -----------------------------*/
2011-03-07 05:58:46 +08:00
const char * bond_mode_name ( int mode )
2005-04-17 06:20:36 +08:00
{
2008-12-10 15:08:09 +08:00
static const char * names [ ] = {
[ BOND_MODE_ROUNDROBIN ] = " load balancing (round-robin) " ,
[ BOND_MODE_ACTIVEBACKUP ] = " fault-tolerance (active-backup) " ,
[ BOND_MODE_XOR ] = " load balancing (xor) " ,
[ BOND_MODE_BROADCAST ] = " fault-tolerance (broadcast) " ,
2009-06-13 03:02:48 +08:00
[ BOND_MODE_8023AD ] = " IEEE 802.3ad Dynamic link aggregation " ,
2008-12-10 15:08:09 +08:00
[ BOND_MODE_TLB ] = " transmit load balancing " ,
[ BOND_MODE_ALB ] = " adaptive load balancing " ,
} ;
2013-07-24 14:53:26 +08:00
if ( mode < BOND_MODE_ROUNDROBIN | | mode > BOND_MODE_ALB )
2005-04-17 06:20:36 +08:00
return " unknown " ;
2008-12-10 15:08:09 +08:00
return names [ mode ] ;
2005-04-17 06:20:36 +08:00
}
/*---------------------------------- VLAN -----------------------------------*/
/**
* bond_dev_queue_xmit - Prepare skb for xmit .
2009-06-13 03:02:48 +08:00
*
2005-04-17 06:20:36 +08:00
* @ bond : bond device that got this skb for tx .
* @ skb : hw accel VLAN tagged skb to transmit
* @ slave_dev : slave that is supposed to xmit this skbuff
*/
2014-01-02 09:13:09 +08:00
void bond_dev_queue_xmit ( struct bonding * bond , struct sk_buff * skb ,
2009-06-13 03:02:48 +08:00
struct net_device * slave_dev )
2005-04-17 06:20:36 +08:00
{
2010-12-13 16:19:28 +08:00
skb - > dev = slave_dev ;
2011-06-03 18:35:52 +08:00
2012-06-12 14:03:51 +08:00
BUILD_BUG_ON ( sizeof ( skb - > queue_mapping ) ! =
2012-07-20 10:28:49 +08:00
sizeof ( qdisc_skb_cb ( skb ) - > slave_dev_queue_mapping ) ) ;
2018-05-11 17:53:11 +08:00
skb_set_queue_mapping ( skb , qdisc_skb_cb ( skb ) - > slave_dev_queue_mapping ) ;
2011-06-03 18:35:52 +08:00
2012-08-10 09:24:45 +08:00
if ( unlikely ( netpoll_tx_running ( bond - > dev ) ) )
2011-02-18 07:43:32 +08:00
bond_netpoll_send_skb ( bond_get_slave_by_dev ( bond , slave_dev ) , skb ) ;
2011-02-18 07:43:33 +08:00
else
2010-05-06 15:48:51 +08:00
dev_queue_xmit ( skb ) ;
2005-04-17 06:20:36 +08:00
}
2014-09-15 23:19:34 +08:00
/* In the following 2 functions, bond_vlan_rx_add_vid and bond_vlan_rx_kill_vid,
2011-07-20 12:54:46 +08:00
* We don ' t protect the slave list iteration with a lock because :
2005-04-17 06:20:36 +08:00
* a . This operation is performed in IOCTL context ,
* b . The operation is protected by the RTNL semaphore in the 8021 q code ,
* c . Holding a lock with BH disabled while directly calling a base driver
* entry point is generally a BAD idea .
2009-06-13 03:02:48 +08:00
*
2005-04-17 06:20:36 +08:00
* The design of synchronization / protection for this operation in the 8021 q
* module is good for one or more VLAN devices over a single physical device
* and cannot be extended for a teaming solution like bonding , so there is a
* potential race condition here where a net device from the vlan group might
* be referenced ( either by a base driver or the 8021 q code ) while it is being
* removed from the system . However , it turns out we ' re not making matters
* worse , and if it works for regular VLAN usage it will work here too .
*/
/**
* bond_vlan_rx_add_vid - Propagates adding an id to slaves
* @ bond_dev : bonding net device that got called
* @ vid : vlan id being added
*/
2013-04-19 10:04:28 +08:00
static int bond_vlan_rx_add_vid ( struct net_device * bond_dev ,
__be16 proto , u16 vid )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 15:20:13 +08:00
struct slave * slave , * rollback_slave ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2013-08-01 22:54:47 +08:00
int res ;
2005-04-17 06:20:36 +08:00
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2013-04-19 10:04:28 +08:00
res = vlan_vid_add ( slave - > dev , proto , vid ) ;
2011-12-08 12:11:17 +08:00
if ( res )
goto unwind ;
2005-04-17 06:20:36 +08:00
}
2011-12-09 08:52:37 +08:00
return 0 ;
2011-12-08 12:11:17 +08:00
unwind :
2013-09-25 15:20:13 +08:00
/* unwind to the slave that failed */
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , rollback_slave , iter ) {
2013-09-25 15:20:13 +08:00
if ( rollback_slave = = slave )
break ;
vlan_vid_del ( rollback_slave - > dev , proto , vid ) ;
}
2011-12-08 12:11:17 +08:00
return res ;
2005-04-17 06:20:36 +08:00
}
/**
* bond_vlan_rx_kill_vid - Propagates deleting an id to slaves
* @ bond_dev : bonding net device that got called
* @ vid : vlan id being removed
*/
2013-04-19 10:04:28 +08:00
static int bond_vlan_rx_kill_vid ( struct net_device * bond_dev ,
__be16 proto , u16 vid )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2005-04-17 06:20:36 +08:00
struct slave * slave ;
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter )
2013-04-19 10:04:28 +08:00
vlan_vid_del ( slave - > dev , proto , vid ) ;
2005-04-17 06:20:36 +08:00
2013-08-29 05:25:15 +08:00
if ( bond_is_lb ( bond ) )
bond_alb_clear_vlan ( bond , vid ) ;
2011-12-09 08:52:37 +08:00
return 0 ;
2005-04-17 06:20:36 +08:00
}
/*------------------------------- Link status -------------------------------*/
2014-09-15 23:19:34 +08:00
/* Set the carrier state for the master according to the state of its
2006-03-28 05:27:43 +08:00
* slaves . If any slaves are up , the master is up . In 802.3 ad mode ,
* do special 802.3 ad magic .
*
* Returns zero if carrier state does not change , nonzero if it does .
*/
2015-01-26 14:16:57 +08:00
int bond_set_carrier ( struct bonding * bond )
2006-03-28 05:27:43 +08:00
{
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2006-03-28 05:27:43 +08:00
struct slave * slave ;
2013-09-25 15:20:21 +08:00
if ( ! bond_has_slaves ( bond ) )
2006-03-28 05:27:43 +08:00
goto down ;
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD )
2006-03-28 05:27:43 +08:00
return bond_3ad_set_carrier ( bond ) ;
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2006-03-28 05:27:43 +08:00
if ( slave - > link = = BOND_LINK_UP ) {
if ( ! netif_carrier_ok ( bond - > dev ) ) {
netif_carrier_on ( bond - > dev ) ;
return 1 ;
}
return 0 ;
}
}
down :
if ( netif_carrier_ok ( bond - > dev ) ) {
netif_carrier_off ( bond - > dev ) ;
return 1 ;
}
return 0 ;
}
2014-09-15 23:19:34 +08:00
/* Get link speed and duplex from the slave's base driver
2005-04-17 06:20:36 +08:00
* using ethtool . If for some reason the call fails or the
bonding:update speed/duplex for NETDEV_CHANGE
Zheng Liang(lzheng@redhat.com) found a bug that if we config bonding with
arp monitor, sometimes bonding driver cannot get the speed and duplex from
its slaves, it will assume them to be 100Mb/sec and Full, please see
/proc/net/bonding/bond0.
But there is no such problem when uses miimon.
(Take igb for example)
I find that the reason is that after dev_open() in bond_enslave(),
bond_update_speed_duplex() will call igb_get_settings()
, but in that function,
it runs ethtool_cmd_speed_set(ecmd, -1); ecmd->duplex = -1;
because igb get an error value of status.
So even dev_open() is called, but the device is not really ready to get its
settings.
Maybe it is safe for us to call igb_get_settings() only after
this message shows up, that is "igb: p4p1 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: RX".
So I prefer to update the speed and duplex for a slave when reseices
NETDEV_CHANGE/NETDEV_UP event.
Changelog
V2:
1 remove the "fake 100/Full" logic in bond_update_speed_duplex(),
set speed and duplex to -1 when it gets error value of speed and duplex.
2 delete the warning in bond_enslave() if bond_update_speed_duplex() returns
error.
3 make bond_info_show_slave() handle bad values of speed and duplex.
Signed-off-by: Weiping Pan <wpan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-01 01:20:48 +08:00
* values are invalid , set speed and duplex to - 1 ,
2017-03-28 02:37:35 +08:00
* and return . Return 1 if speed or duplex settings are
* UNKNOWN ; 0 otherwise .
2005-04-17 06:20:36 +08:00
*/
2017-03-28 02:37:35 +08:00
static int bond_update_speed_duplex ( struct slave * slave )
2005-04-17 06:20:36 +08:00
{
struct net_device * slave_dev = slave - > dev ;
2016-02-25 02:58:02 +08:00
struct ethtool_link_ksettings ecmd ;
2007-08-01 05:00:02 +08:00
int res ;
2005-04-17 06:20:36 +08:00
2011-11-04 16:21:38 +08:00
slave - > speed = SPEED_UNKNOWN ;
slave - > duplex = DUPLEX_UNKNOWN ;
2005-04-17 06:20:36 +08:00
2016-02-25 02:58:02 +08:00
res = __ethtool_get_link_ksettings ( slave_dev , & ecmd ) ;
2007-08-01 05:00:02 +08:00
if ( res < 0 )
2017-03-28 02:37:35 +08:00
return 1 ;
2016-02-25 02:58:02 +08:00
if ( ecmd . base . speed = = 0 | | ecmd . base . speed = = ( ( __u32 ) - 1 ) )
2017-03-28 02:37:35 +08:00
return 1 ;
2016-02-25 02:58:02 +08:00
switch ( ecmd . base . duplex ) {
2005-04-17 06:20:36 +08:00
case DUPLEX_FULL :
case DUPLEX_HALF :
break ;
default :
2017-03-28 02:37:35 +08:00
return 1 ;
2005-04-17 06:20:36 +08:00
}
2016-02-25 02:58:02 +08:00
slave - > speed = ecmd . base . speed ;
slave - > duplex = ecmd . base . duplex ;
2005-04-17 06:20:36 +08:00
2017-03-28 02:37:35 +08:00
return 0 ;
2005-04-17 06:20:36 +08:00
}
2014-01-17 14:57:49 +08:00
const char * bond_slave_link_status ( s8 link )
{
switch ( link ) {
case BOND_LINK_UP :
return " up " ;
case BOND_LINK_FAIL :
return " going down " ;
case BOND_LINK_DOWN :
return " down " ;
case BOND_LINK_BACK :
return " going back " ;
default :
return " unknown " ;
}
}
2014-09-15 23:19:34 +08:00
/* if <dev> supports MII link status reporting, check its link status.
2005-04-17 06:20:36 +08:00
*
* We either do MII / ETHTOOL ioctls , or check netif_carrier_ok ( ) ,
2009-06-13 03:02:48 +08:00
* depending upon the setting of the use_carrier parameter .
2005-04-17 06:20:36 +08:00
*
* Return either BMSR_LSTATUS , meaning that the link is up ( or we
* can ' t tell and just pretend it is ) , or 0 , meaning that the link is
* down .
*
* If reporting is non - zero , instead of faking link up , return - 1 if
* both ETHTOOL and MII ioctls fail ( meaning the device does not
* support them ) . If use_carrier is set , return whatever it says .
* It ' d be nice if there was a good way to tell if a driver supports
* netif_carrier , but there really isn ' t .
*/
2009-06-13 03:02:48 +08:00
static int bond_check_dev_link ( struct bonding * bond ,
struct net_device * slave_dev , int reporting )
2005-04-17 06:20:36 +08:00
{
2008-11-20 13:56:05 +08:00
const struct net_device_ops * slave_ops = slave_dev - > netdev_ops ;
2009-10-29 13:23:54 +08:00
int ( * ioctl ) ( struct net_device * , struct ifreq * , int ) ;
2005-04-17 06:20:36 +08:00
struct ifreq ifr ;
struct mii_ioctl_data * mii ;
2009-08-28 20:05:15 +08:00
if ( ! reporting & & ! netif_running ( slave_dev ) )
return 0 ;
2008-11-20 13:56:05 +08:00
if ( bond - > params . use_carrier )
2018-05-17 02:02:13 +08:00
return netif_carrier_ok ( slave_dev ) ? BMSR_LSTATUS : 0 ;
2005-04-17 06:20:36 +08:00
2009-04-24 09:58:23 +08:00
/* Try to get link status using Ethtool first. */
2012-12-07 14:15:32 +08:00
if ( slave_dev - > ethtool_ops - > get_link )
return slave_dev - > ethtool_ops - > get_link ( slave_dev ) ?
BMSR_LSTATUS : 0 ;
2009-04-24 09:58:23 +08:00
2009-06-13 03:02:48 +08:00
/* Ethtool can't be used, fallback to MII ioctls. */
2008-11-20 13:56:05 +08:00
ioctl = slave_ops - > ndo_do_ioctl ;
2005-04-17 06:20:36 +08:00
if ( ioctl ) {
2014-09-15 23:19:34 +08:00
/* TODO: set pointer to correct ioctl on a per team member
* bases to make this more efficient . that is , once
* we determine the correct ioctl , we will always
* call it and not the others for that team
* member .
*/
/* We cannot assume that SIOCGMIIPHY will also read a
2005-04-17 06:20:36 +08:00
* register ; not all network drivers ( e . g . , e100 )
* support that .
*/
/* Yes, the mii is overlaid on the ifreq.ifr_ifru */
strncpy ( ifr . ifr_name , slave_dev - > name , IFNAMSIZ ) ;
mii = if_mii ( & ifr ) ;
2016-09-04 07:37:25 +08:00
if ( ioctl ( slave_dev , & ifr , SIOCGMIIPHY ) = = 0 ) {
2005-04-17 06:20:36 +08:00
mii - > reg_num = MII_BMSR ;
2016-09-04 07:37:25 +08:00
if ( ioctl ( slave_dev , & ifr , SIOCGMIIREG ) = = 0 )
2009-06-13 03:02:48 +08:00
return mii - > val_out & BMSR_LSTATUS ;
2005-04-17 06:20:36 +08:00
}
}
2014-09-15 23:19:34 +08:00
/* If reporting, report that either there's no dev->do_ioctl,
2007-08-01 05:00:02 +08:00
* or both SIOCGMIIREG and get_link failed ( meaning that we
2005-04-17 06:20:36 +08:00
* cannot report link status ) . If not reporting , pretend
* we ' re ok .
*/
2009-06-13 03:02:48 +08:00
return reporting ? - 1 : BMSR_LSTATUS ;
2005-04-17 06:20:36 +08:00
}
/*----------------------------- Multicast list ------------------------------*/
2014-09-15 23:19:34 +08:00
/* Push the promiscuity flag down to appropriate slaves */
2008-07-15 11:51:36 +08:00
static int bond_set_promiscuity ( struct bonding * bond , int inc )
2005-04-17 06:20:36 +08:00
{
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2008-07-15 11:51:36 +08:00
int err = 0 ;
2013-09-25 15:20:14 +08:00
2014-05-16 03:39:54 +08:00
if ( bond_uses_primary ( bond ) ) {
2014-07-17 00:32:01 +08:00
struct slave * curr_active = rtnl_dereference ( bond - > curr_active_slave ) ;
2014-07-15 21:56:55 +08:00
if ( curr_active )
err = dev_set_promiscuity ( curr_active - > dev , inc ) ;
2005-04-17 06:20:36 +08:00
} else {
struct slave * slave ;
2013-08-01 22:54:47 +08:00
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2008-07-15 11:51:36 +08:00
err = dev_set_promiscuity ( slave - > dev , inc ) ;
if ( err )
return err ;
2005-04-17 06:20:36 +08:00
}
}
2008-07-15 11:51:36 +08:00
return err ;
2005-04-17 06:20:36 +08:00
}
2014-09-15 23:19:34 +08:00
/* Push the allmulti flag down to all slaves */
2008-07-15 11:51:36 +08:00
static int bond_set_allmulti ( struct bonding * bond , int inc )
2005-04-17 06:20:36 +08:00
{
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2008-07-15 11:51:36 +08:00
int err = 0 ;
2013-09-25 15:20:14 +08:00
2014-05-16 03:39:54 +08:00
if ( bond_uses_primary ( bond ) ) {
2014-07-17 00:32:01 +08:00
struct slave * curr_active = rtnl_dereference ( bond - > curr_active_slave ) ;
2014-07-15 21:56:55 +08:00
if ( curr_active )
err = dev_set_allmulti ( curr_active - > dev , inc ) ;
2005-04-17 06:20:36 +08:00
} else {
struct slave * slave ;
2013-08-01 22:54:47 +08:00
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2008-07-15 11:51:36 +08:00
err = dev_set_allmulti ( slave - > dev , inc ) ;
if ( err )
return err ;
2005-04-17 06:20:36 +08:00
}
}
2008-07-15 11:51:36 +08:00
return err ;
2005-04-17 06:20:36 +08:00
}
2014-09-15 23:19:34 +08:00
/* Retrieve the list of registered multicast addresses for the bonding
2010-10-05 22:23:57 +08:00
* device and retransmit an IGMP JOIN request to the current active
* slave .
*/
2013-12-13 10:20:26 +08:00
static void bond_resend_igmp_join_requests_delayed ( struct work_struct * work )
2010-10-05 22:23:57 +08:00
{
2013-12-13 10:20:26 +08:00
struct bonding * bond = container_of ( work , struct bonding ,
mcast_work . work ) ;
2013-07-20 18:13:53 +08:00
if ( ! rtnl_trylock ( ) ) {
2013-08-01 17:51:42 +08:00
queue_delayed_work ( bond - > wq , & bond - > mcast_work , 1 ) ;
2013-07-20 18:13:53 +08:00
return ;
2010-10-05 22:23:57 +08:00
}
2013-07-20 18:13:53 +08:00
call_netdevice_notifiers ( NETDEV_RESEND_IGMP , bond - > dev ) ;
2010-10-05 22:23:57 +08:00
2013-06-12 06:07:02 +08:00
if ( bond - > igmp_retrans > 1 ) {
bond - > igmp_retrans - - ;
2010-10-05 22:23:59 +08:00
queue_delayed_work ( bond - > wq , & bond - > mcast_work , HZ / 5 ) ;
2013-06-12 06:07:02 +08:00
}
2013-12-13 10:20:26 +08:00
rtnl_unlock ( ) ;
2010-10-05 22:23:57 +08:00
}
2014-09-15 23:19:34 +08:00
/* Flush bond's hardware addresses from slave */
2013-05-31 19:57:30 +08:00
static void bond_hw_addr_flush ( struct net_device * bond_dev ,
2009-06-13 03:02:48 +08:00
struct net_device * slave_dev )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 06:20:36 +08:00
2013-05-31 19:57:30 +08:00
dev_uc_unsync ( slave_dev , bond_dev ) ;
dev_mc_unsync ( slave_dev , bond_dev ) ;
2005-04-17 06:20:36 +08:00
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD ) {
2005-04-17 06:20:36 +08:00
/* del lacpdu mc addr from mc list */
u8 lacpdu_multicast [ ETH_ALEN ] = MULTICAST_LACPDU_ADDR ;
2010-04-02 05:22:57 +08:00
dev_mc_del ( slave_dev , lacpdu_multicast ) ;
2005-04-17 06:20:36 +08:00
}
}
/*--------------------------- Active slave change ---------------------------*/
2013-05-31 19:57:30 +08:00
/* Update the hardware address list and promisc/allmulti for the new and
2014-05-16 03:39:54 +08:00
* old active slaves ( if any ) . Modes that are not using primary keep all
* slaves up date at all times ; only the modes that use primary need to call
2013-05-31 19:57:30 +08:00
* this function to swap these settings during a failover .
2005-04-17 06:20:36 +08:00
*/
2013-05-31 19:57:30 +08:00
static void bond_hw_addr_swap ( struct bonding * bond , struct slave * new_active ,
struct slave * old_active )
2005-04-17 06:20:36 +08:00
{
if ( old_active ) {
2009-06-13 03:02:48 +08:00
if ( bond - > dev - > flags & IFF_PROMISC )
2005-04-17 06:20:36 +08:00
dev_set_promiscuity ( old_active - > dev , - 1 ) ;
2009-06-13 03:02:48 +08:00
if ( bond - > dev - > flags & IFF_ALLMULTI )
2005-04-17 06:20:36 +08:00
dev_set_allmulti ( old_active - > dev , - 1 ) ;
2013-05-31 19:57:30 +08:00
bond_hw_addr_flush ( bond - > dev , old_active - > dev ) ;
2005-04-17 06:20:36 +08:00
}
if ( new_active ) {
2008-07-15 11:51:36 +08:00
/* FIXME: Signal errors upstream. */
2009-06-13 03:02:48 +08:00
if ( bond - > dev - > flags & IFF_PROMISC )
2005-04-17 06:20:36 +08:00
dev_set_promiscuity ( new_active - > dev , 1 ) ;
2009-06-13 03:02:48 +08:00
if ( bond - > dev - > flags & IFF_ALLMULTI )
2005-04-17 06:20:36 +08:00
dev_set_allmulti ( new_active - > dev , 1 ) ;
2013-04-18 15:33:38 +08:00
netif_addr_lock_bh ( bond - > dev ) ;
2013-05-31 19:57:30 +08:00
dev_uc_sync ( new_active - > dev , bond - > dev ) ;
dev_mc_sync ( new_active - > dev , bond - > dev ) ;
2013-04-18 15:33:38 +08:00
netif_addr_unlock_bh ( bond - > dev ) ;
2005-04-17 06:20:36 +08:00
}
}
2013-06-26 23:13:39 +08:00
/**
* bond_set_dev_addr - clone slave ' s address to bond
* @ bond_dev : bond net device
* @ slave_dev : slave net device
*
* Should be called with RTNL held .
*/
2018-12-13 19:54:44 +08:00
static int bond_set_dev_addr ( struct net_device * bond_dev ,
struct net_device * slave_dev )
2013-06-26 23:13:39 +08:00
{
2018-12-13 19:54:46 +08:00
int err ;
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " bond_dev=%p slave_dev=%p slave_dev->addr_len=%d \n " ,
bond_dev , slave_dev , slave_dev - > addr_len ) ;
2018-12-13 19:54:46 +08:00
err = dev_pre_changeaddr_notify ( bond_dev , slave_dev - > dev_addr , NULL ) ;
if ( err )
return err ;
2013-06-26 23:13:39 +08:00
memcpy ( bond_dev - > dev_addr , slave_dev - > dev_addr , slave_dev - > addr_len ) ;
bond_dev - > addr_assign_type = NET_ADDR_STOLEN ;
call_netdevice_notifiers ( NETDEV_CHANGEADDR , bond_dev ) ;
2018-12-13 19:54:44 +08:00
return 0 ;
2013-06-26 23:13:39 +08:00
}
bonding: correct the MAC address for "follow" fail_over_mac policy
The "follow" fail_over_mac policy is useful for multiport devices that
either become confused or incur a performance penalty when multiple
ports are programmed with the same MAC address, but the same MAC
address still may happened by this steps for this policy:
1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
bond0 has the same mac address with eth0, it is MAC1.
2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
eth1 is backup, eth1 has MAC2.
3) ifconfig eth0 down
eth1 became active slave, bond will swap MAC for eth0 and eth1,
so eth1 has MAC1, and eth0 has MAC2.
4) ifconfig eth1 down
there is no active slave, and eth1 still has MAC1, eth2 has MAC2.
5) ifconfig eth0 up
the eth0 became active slave again, the bond set eth0 to MAC1.
Something wrong here, then if you set eth1 up, the eth0 and eth1 will have the same
MAC address, it will break this policy for ACTIVE_BACKUP mode.
This patch will fix this problem by finding the old active slave and
swap them MAC address before change active slave.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:30:02 +08:00
static struct slave * bond_get_old_active ( struct bonding * bond ,
struct slave * new_active )
{
struct slave * slave ;
struct list_head * iter ;
bond_for_each_slave ( bond , slave , iter ) {
if ( slave = = new_active )
continue ;
if ( ether_addr_equal ( bond - > dev - > dev_addr , slave - > dev - > dev_addr ) )
return slave ;
}
return NULL ;
}
2014-09-15 23:19:34 +08:00
/* bond_do_fail_over_mac
2008-05-18 12:10:14 +08:00
*
* Perform special MAC address swapping for fail_over_mac settings
*
2014-09-12 04:49:24 +08:00
* Called with RTNL
2008-05-18 12:10:14 +08:00
*/
static void bond_do_fail_over_mac ( struct bonding * bond ,
struct slave * new_active ,
struct slave * old_active )
{
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
u8 tmp_mac [ MAX_ADDR_LEN ] ;
struct sockaddr_storage ss ;
2008-05-18 12:10:14 +08:00
int rv ;
switch ( bond - > params . fail_over_mac ) {
case BOND_FOM_ACTIVE :
2018-12-13 19:54:44 +08:00
if ( new_active ) {
rv = bond_set_dev_addr ( bond - > dev , new_active - > dev ) ;
if ( rv )
2019-06-07 22:59:29 +08:00
slave_err ( bond - > dev , new_active - > dev , " Error %d setting bond MAC from slave \n " ,
- rv ) ;
2018-12-13 19:54:44 +08:00
}
2008-05-18 12:10:14 +08:00
break ;
case BOND_FOM_FOLLOW :
2014-09-15 23:19:34 +08:00
/* if new_active && old_active, swap them
2008-05-18 12:10:14 +08:00
* if just old_active , do nothing ( going to no active slave )
* if just new_active , set new_active to bond ' s MAC
*/
if ( ! new_active )
return ;
bonding: correct the MAC address for "follow" fail_over_mac policy
The "follow" fail_over_mac policy is useful for multiport devices that
either become confused or incur a performance penalty when multiple
ports are programmed with the same MAC address, but the same MAC
address still may happened by this steps for this policy:
1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
bond0 has the same mac address with eth0, it is MAC1.
2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
eth1 is backup, eth1 has MAC2.
3) ifconfig eth0 down
eth1 became active slave, bond will swap MAC for eth0 and eth1,
so eth1 has MAC1, and eth0 has MAC2.
4) ifconfig eth1 down
there is no active slave, and eth1 still has MAC1, eth2 has MAC2.
5) ifconfig eth0 up
the eth0 became active slave again, the bond set eth0 to MAC1.
Something wrong here, then if you set eth1 up, the eth0 and eth1 will have the same
MAC address, it will break this policy for ACTIVE_BACKUP mode.
This patch will fix this problem by finding the old active slave and
swap them MAC address before change active slave.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 16:30:02 +08:00
if ( ! old_active )
old_active = bond_get_old_active ( bond , new_active ) ;
2008-05-18 12:10:14 +08:00
if ( old_active ) {
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
bond_hw_addr_copy ( tmp_mac , new_active - > dev - > dev_addr ,
new_active - > dev - > addr_len ) ;
bond_hw_addr_copy ( ss . __data ,
old_active - > dev - > dev_addr ,
old_active - > dev - > addr_len ) ;
ss . ss_family = new_active - > dev - > type ;
2008-05-18 12:10:14 +08:00
} else {
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
bond_hw_addr_copy ( ss . __data , bond - > dev - > dev_addr ,
bond - > dev - > addr_len ) ;
ss . ss_family = bond - > dev - > type ;
2008-05-18 12:10:14 +08:00
}
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
rv = dev_set_mac_address ( new_active - > dev ,
2018-12-13 19:54:30 +08:00
( struct sockaddr * ) & ss , NULL ) ;
2008-05-18 12:10:14 +08:00
if ( rv ) {
2019-06-07 22:59:29 +08:00
slave_err ( bond - > dev , new_active - > dev , " Error %d setting MAC of new active slave \n " ,
- rv ) ;
2008-05-18 12:10:14 +08:00
goto out ;
}
if ( ! old_active )
goto out ;
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
bond_hw_addr_copy ( ss . __data , tmp_mac ,
new_active - > dev - > addr_len ) ;
ss . ss_family = old_active - > dev - > type ;
2008-05-18 12:10:14 +08:00
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
rv = dev_set_mac_address ( old_active - > dev ,
2018-12-13 19:54:30 +08:00
( struct sockaddr * ) & ss , NULL ) ;
2008-05-18 12:10:14 +08:00
if ( rv )
2019-06-07 22:59:29 +08:00
slave_err ( bond - > dev , old_active - > dev , " Error %d setting MAC of old active slave \n " ,
- rv ) ;
2008-05-18 12:10:14 +08:00
out :
break ;
default :
2014-07-16 01:35:58 +08:00
netdev_err ( bond - > dev , " bond_do_fail_over_mac impossible: bad policy %d \n " ,
bond - > params . fail_over_mac ) ;
2008-05-18 12:10:14 +08:00
break ;
}
}
2015-07-07 17:34:50 +08:00
static struct slave * bond_choose_primary_or_current ( struct bonding * bond )
2009-09-25 11:28:09 +08:00
{
2014-09-10 05:17:00 +08:00
struct slave * prim = rtnl_dereference ( bond - > primary_slave ) ;
2014-09-12 04:49:24 +08:00
struct slave * curr = rtnl_dereference ( bond - > curr_active_slave ) ;
2009-09-25 11:28:09 +08:00
2015-07-07 17:34:50 +08:00
if ( ! prim | | prim - > link ! = BOND_LINK_UP ) {
if ( ! curr | | curr - > link ! = BOND_LINK_UP )
return NULL ;
return curr ;
}
2009-09-25 11:28:09 +08:00
if ( bond - > force_primary ) {
bond - > force_primary = false ;
2015-07-07 17:34:50 +08:00
return prim ;
}
if ( ! curr | | curr - > link ! = BOND_LINK_UP )
return prim ;
/* At this point, prim and curr are both up */
switch ( bond - > params . primary_reselect ) {
case BOND_PRI_RESELECT_ALWAYS :
return prim ;
case BOND_PRI_RESELECT_BETTER :
if ( prim - > speed < curr - > speed )
return curr ;
if ( prim - > speed = = curr - > speed & & prim - > duplex < = curr - > duplex )
return curr ;
return prim ;
case BOND_PRI_RESELECT_FAILURE :
return curr ;
default :
netdev_err ( bond - > dev , " impossible primary_reselect %d \n " ,
bond - > params . primary_reselect ) ;
return curr ;
2009-09-25 11:28:09 +08:00
}
}
2008-05-18 12:10:14 +08:00
2005-04-17 06:20:36 +08:00
/**
2015-07-07 17:34:50 +08:00
* bond_find_best_slave - select the best available slave to be the active one
2005-04-17 06:20:36 +08:00
* @ bond : our bonding struct
*/
static struct slave * bond_find_best_slave ( struct bonding * bond )
{
2015-07-07 17:34:50 +08:00
struct slave * slave , * bestslave = NULL ;
2013-09-25 15:20:18 +08:00
struct list_head * iter ;
2005-04-17 06:20:36 +08:00
int mintime = bond - > params . updelay ;
2015-07-07 17:34:50 +08:00
slave = bond_choose_primary_or_current ( bond ) ;
if ( slave )
return slave ;
2005-04-17 06:20:36 +08:00
2013-09-25 15:20:18 +08:00
bond_for_each_slave ( bond , slave , iter ) {
if ( slave - > link = = BOND_LINK_UP )
return slave ;
2014-05-16 03:39:57 +08:00
if ( slave - > link = = BOND_LINK_BACK & & bond_slave_is_up ( slave ) & &
2013-09-25 15:20:18 +08:00
slave - > delay < mintime ) {
mintime = slave - > delay ;
bestslave = slave ;
2005-04-17 06:20:36 +08:00
}
}
return bestslave ;
}
2011-04-26 23:25:52 +08:00
static bool bond_should_notify_peers ( struct bonding * bond )
{
bonding: rebuild the lock use for bond_mii_monitor()
The bond_mii_monitor() still use bond lock to protect bond slave list,
it is no effect, I have 2 way to fix the problem, move the RTNL to the
top of the function, or add RCU to protect the bond slave list,
according to the Jay Vosburgh's opinion, 10 times one second is a
truely big performance loss if use RTNL to protect the whole monitor,
so I would take the advice and use RCU to protect the bond slave list.
The bond_has_slave() will not protect by anything, there will no things
happen if the slave list is be changed, unless the bond was free, but
it will not happened before the monitor, the bond will closed before
be freed.
The peers notify for the bond will calling curr_active_slave, so
derefence the slave to make sure we will accessing the same slave
if the curr_active_slave changed, as the rcu dereference need in
read-side critical sector and bond_change_active_slave() will call
it with no RCU hold, so add peer notify in rcu_read_lock which
will be nested in monitor.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-13 10:19:39 +08:00
struct slave * slave ;
rcu_read_lock ( ) ;
slave = rcu_dereference ( bond - > curr_active_slave ) ;
rcu_read_unlock ( ) ;
2011-04-26 23:25:52 +08:00
2014-07-16 01:35:58 +08:00
netdev_dbg ( bond - > dev , " bond_should_notify_peers: slave %s \n " ,
slave ? slave - > dev - > name : " NULL " ) ;
2011-04-26 23:25:52 +08:00
if ( ! slave | | ! bond - > send_peer_notif | |
bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 01:43:54 +08:00
bond - > send_peer_notif %
max ( 1 , bond - > params . peer_notif_delay ) ! = 0 | |
2015-08-11 22:57:23 +08:00
! netif_carrier_ok ( bond - > dev ) | |
2011-04-26 23:25:52 +08:00
test_bit ( __LINK_STATE_LINKWATCH_PENDING , & slave - > dev - > state ) )
return false ;
return true ;
}
2005-04-17 06:20:36 +08:00
/**
* change_active_interface - change the active slave into the specified one
* @ bond : our bonding struct
* @ new : the new slave to make the active one
*
* Set the new slave to the bond ' s settings and unset them on the old
* curr_active_slave .
* Setting include flags , mc - list , promiscuity , allmulti , etc .
*
* If @ new ' s link state is % BOND_LINK_BACK we ' ll set it to % BOND_LINK_UP ,
* because it is apparently the best available slave we have , even though its
* updelay hasn ' t timed out yet .
*
2014-09-12 04:49:24 +08:00
* Caller must hold RTNL .
2005-04-17 06:20:36 +08:00
*/
2005-11-10 02:35:51 +08:00
void bond_change_active_slave ( struct bonding * bond , struct slave * new_active )
2005-04-17 06:20:36 +08:00
{
2014-07-15 21:56:55 +08:00
struct slave * old_active ;
2014-09-12 04:49:24 +08:00
ASSERT_RTNL ( ) ;
old_active = rtnl_dereference ( bond - > curr_active_slave ) ;
2005-04-17 06:20:36 +08:00
2009-06-13 03:02:48 +08:00
if ( old_active = = new_active )
2005-04-17 06:20:36 +08:00
return ;
if ( new_active ) {
2014-02-18 14:48:46 +08:00
new_active - > last_link_up = jiffies ;
2008-05-18 12:10:13 +08:00
2005-04-17 06:20:36 +08:00
if ( new_active - > link = = BOND_LINK_BACK ) {
2014-05-16 03:39:54 +08:00
if ( bond_uses_primary ( bond ) ) {
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , new_active - > dev , " making interface the new active one %d ms earlier \n " ,
( bond - > params . updelay - new_active - > delay ) * bond - > params . miimon ) ;
2005-04-17 06:20:36 +08:00
}
new_active - > delay = 0 ;
2015-12-03 19:12:19 +08:00
bond_set_slave_link_state ( new_active , BOND_LINK_UP ,
BOND_SLAVE_NOTIFY_NOW ) ;
2005-04-17 06:20:36 +08:00
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD )
2005-04-17 06:20:36 +08:00
bond_3ad_handle_link_change ( new_active , BOND_LINK_UP ) ;
2008-12-10 15:07:13 +08:00
if ( bond_is_lb ( bond ) )
2005-04-17 06:20:36 +08:00
bond_alb_handle_link_change ( bond , new_active , BOND_LINK_UP ) ;
} else {
2014-05-16 03:39:54 +08:00
if ( bond_uses_primary ( bond ) ) {
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , new_active - > dev , " making interface the new active one \n " ) ;
2005-04-17 06:20:36 +08:00
}
}
}
2014-05-16 03:39:54 +08:00
if ( bond_uses_primary ( bond ) )
2013-05-31 19:57:30 +08:00
bond_hw_addr_swap ( bond , new_active , old_active ) ;
2005-04-17 06:20:36 +08:00
2008-12-10 15:07:13 +08:00
if ( bond_is_lb ( bond ) ) {
2005-04-17 06:20:36 +08:00
bond_alb_handle_active_change ( bond , new_active ) ;
2006-02-22 08:36:44 +08:00
if ( old_active )
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_inactive_flags ( old_active ,
BOND_SLAVE_NOTIFY_NOW ) ;
2006-02-22 08:36:44 +08:00
if ( new_active )
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_active_flags ( new_active ,
BOND_SLAVE_NOTIFY_NOW ) ;
2005-04-17 06:20:36 +08:00
} else {
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 22:54:51 +08:00
rcu_assign_pointer ( bond - > curr_active_slave , new_active ) ;
2005-04-17 06:20:36 +08:00
}
2005-06-27 05:52:20 +08:00
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_ACTIVEBACKUP ) {
2009-06-13 03:02:48 +08:00
if ( old_active )
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_inactive_flags ( old_active ,
BOND_SLAVE_NOTIFY_NOW ) ;
2005-06-27 05:52:20 +08:00
if ( new_active ) {
2011-04-26 23:25:52 +08:00
bool should_notify_peers = false ;
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_active_flags ( new_active ,
BOND_SLAVE_NOTIFY_NOW ) ;
2007-10-10 10:43:39 +08:00
2008-06-14 09:12:01 +08:00
if ( bond - > params . fail_over_mac )
bond_do_fail_over_mac ( bond , new_active ,
old_active ) ;
2008-05-18 12:10:14 +08:00
2011-04-26 23:25:52 +08:00
if ( netif_running ( bond - > dev ) ) {
bond - > send_peer_notif =
bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 01:43:54 +08:00
bond - > params . num_peer_notif *
max ( 1 , bond - > params . peer_notif_delay ) ;
2011-04-26 23:25:52 +08:00
should_notify_peers =
bond_should_notify_peers ( bond ) ;
}
2012-08-10 06:14:57 +08:00
call_netdevice_notifiers ( NETDEV_BONDING_FAILOVER , bond - > dev ) ;
bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 01:43:54 +08:00
if ( should_notify_peers ) {
bond - > send_peer_notif - - ;
2012-08-10 06:14:57 +08:00
call_netdevice_notifiers ( NETDEV_NOTIFY_PEERS ,
bond - > dev ) ;
bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 01:43:54 +08:00
}
2008-05-18 12:10:12 +08:00
}
2005-06-27 05:52:20 +08:00
}
2010-03-25 22:49:05 +08:00
2010-10-05 22:23:57 +08:00
/* resend IGMP joins since active slave has changed or
2011-05-25 16:38:58 +08:00
* all were sent on curr_active_slave .
* resend only if bond is brought up with the affected
2014-09-15 23:19:34 +08:00
* bonding modes and the retransmission is enabled
*/
2011-05-25 16:38:58 +08:00
if ( netif_running ( bond - > dev ) & & ( bond - > params . resend_igmp > 0 ) & &
2014-05-16 03:39:54 +08:00
( ( bond_uses_primary ( bond ) & & new_active ) | |
2014-05-16 03:39:55 +08:00
BOND_MODE ( bond ) = = BOND_MODE_ROUNDROBIN ) ) {
2010-10-05 22:23:59 +08:00
bond - > igmp_retrans = bond - > params . resend_igmp ;
2013-08-01 17:51:42 +08:00
queue_delayed_work ( bond - > wq , & bond - > mcast_work , 1 ) ;
2010-03-25 22:49:05 +08:00
}
2005-04-17 06:20:36 +08:00
}
/**
* bond_select_active_slave - select a new active slave , if needed
* @ bond : our bonding struct
*
2009-06-13 03:02:48 +08:00
* This functions should be called when one of the following occurs :
2005-04-17 06:20:36 +08:00
* - The old curr_active_slave has been released or lost its link .
* - The primary_slave has got its link back .
* - A slave has got its link back and there ' s no old curr_active_slave .
*
2014-09-12 04:49:24 +08:00
* Caller must hold RTNL .
2005-04-17 06:20:36 +08:00
*/
2005-11-10 02:35:51 +08:00
void bond_select_active_slave ( struct bonding * bond )
2005-04-17 06:20:36 +08:00
{
struct slave * best_slave ;
2006-03-28 05:27:43 +08:00
int rv ;
2005-04-17 06:20:36 +08:00
2014-09-15 23:19:35 +08:00
ASSERT_RTNL ( ) ;
2005-04-17 06:20:36 +08:00
best_slave = bond_find_best_slave ( bond ) ;
2014-09-12 04:49:24 +08:00
if ( best_slave ! = rtnl_dereference ( bond - > curr_active_slave ) ) {
2005-04-17 06:20:36 +08:00
bond_change_active_slave ( bond , best_slave ) ;
2006-03-28 05:27:43 +08:00
rv = bond_set_carrier ( bond ) ;
if ( ! rv )
return ;
2016-02-03 10:02:32 +08:00
if ( netif_carrier_ok ( bond - > dev ) )
2019-07-02 01:48:51 +08:00
netdev_info ( bond - > dev , " active interface up! \n " ) ;
2016-02-03 10:02:32 +08:00
else
2014-07-16 01:35:58 +08:00
netdev_info ( bond - > dev , " now running without any active interface! \n " ) ;
2005-04-17 06:20:36 +08:00
}
}
2010-05-06 15:48:51 +08:00
# ifdef CONFIG_NET_POLL_CONTROLLER
2011-02-18 07:43:32 +08:00
static inline int slave_enable_netpoll ( struct slave * slave )
2010-05-06 15:48:51 +08:00
{
2011-02-18 07:43:32 +08:00
struct netpoll * np ;
int err = 0 ;
2010-05-06 15:48:51 +08:00
2014-03-28 06:36:38 +08:00
np = kzalloc ( sizeof ( * np ) , GFP_KERNEL ) ;
2011-02-18 07:43:32 +08:00
err = - ENOMEM ;
if ( ! np )
goto out ;
2014-03-28 06:36:38 +08:00
err = __netpoll_setup ( np , slave - > dev ) ;
2011-02-18 07:43:32 +08:00
if ( err ) {
kfree ( np ) ;
goto out ;
2010-05-06 15:48:51 +08:00
}
2011-02-18 07:43:32 +08:00
slave - > np = np ;
out :
return err ;
}
static inline void slave_disable_netpoll ( struct slave * slave )
{
struct netpoll * np = slave - > np ;
if ( ! np )
return ;
slave - > np = NULL ;
2018-10-18 23:18:26 +08:00
__netpoll_free ( np ) ;
2011-02-18 07:43:32 +08:00
}
2010-05-06 15:48:51 +08:00
static void bond_poll_controller ( struct net_device * bond_dev )
{
2015-03-05 13:57:52 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
struct slave * slave = NULL ;
struct list_head * iter ;
struct ad_info ad_info ;
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD )
if ( bond_3ad_get_active_agg_info ( bond , & ad_info ) )
return ;
bond_for_each_slave_rcu ( bond , slave , iter ) {
2018-09-22 06:27:39 +08:00
if ( ! bond_slave_is_up ( slave ) )
2015-03-05 13:57:52 +08:00
continue ;
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD ) {
struct aggregator * agg =
SLAVE_AD_INFO ( slave ) - > port . aggregator ;
if ( agg & &
agg - > aggregator_identifier ! = ad_info . aggregator_id )
continue ;
}
2018-09-22 06:27:39 +08:00
netpoll_poll_dev ( slave - > dev ) ;
2015-03-05 13:57:52 +08:00
}
2011-02-18 07:43:32 +08:00
}
2013-07-23 15:25:27 +08:00
static void bond_netpoll_cleanup ( struct net_device * bond_dev )
2011-02-18 07:43:32 +08:00
{
2013-07-23 15:25:27 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2010-10-14 00:01:49 +08:00
struct slave * slave ;
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter )
2014-05-16 03:39:57 +08:00
if ( bond_slave_is_up ( slave ) )
2011-02-18 07:43:32 +08:00
slave_disable_netpoll ( slave ) ;
2010-05-06 15:48:51 +08:00
}
2011-02-18 07:43:32 +08:00
2014-03-28 06:36:38 +08:00
static int bond_netpoll_setup ( struct net_device * dev , struct netpoll_info * ni )
2011-02-18 07:43:32 +08:00
{
struct bonding * bond = netdev_priv ( dev ) ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2010-05-06 15:48:51 +08:00
struct slave * slave ;
2013-08-01 22:54:47 +08:00
int err = 0 ;
2010-05-06 15:48:51 +08:00
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2011-02-18 07:43:32 +08:00
err = slave_enable_netpoll ( slave ) ;
if ( err ) {
2013-07-23 15:25:27 +08:00
bond_netpoll_cleanup ( dev ) ;
2011-02-18 07:43:32 +08:00
break ;
2010-05-06 15:48:51 +08:00
}
}
2011-02-18 07:43:32 +08:00
return err ;
2010-05-06 15:48:51 +08:00
}
2011-02-18 07:43:32 +08:00
# else
static inline int slave_enable_netpoll ( struct slave * slave )
{
return 0 ;
}
static inline void slave_disable_netpoll ( struct slave * slave )
{
}
2010-05-06 15:48:51 +08:00
static void bond_netpoll_cleanup ( struct net_device * bond_dev )
{
}
# endif
2005-04-17 06:20:36 +08:00
/*---------------------------------- IOCTL ----------------------------------*/
2011-11-15 23:29:55 +08:00
static netdev_features_t bond_fix_features ( struct net_device * dev ,
2013-09-02 19:51:41 +08:00
netdev_features_t features )
2005-08-23 13:34:53 +08:00
{
2011-05-07 11:22:17 +08:00
struct bonding * bond = netdev_priv ( dev ) ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2011-11-15 23:29:55 +08:00
netdev_features_t mask ;
2013-09-02 19:51:41 +08:00
struct slave * slave ;
2008-10-23 16:11:29 +08:00
2015-05-11 00:48:07 +08:00
mask = features ;
2015-01-30 14:40:16 +08:00
2008-10-23 16:11:29 +08:00
features & = ~ NETIF_F_ONE_FOR_ALL ;
2011-05-07 11:22:17 +08:00
features | = NETIF_F_ALL_FOR_ALL ;
2007-08-11 06:47:58 +08:00
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2008-10-23 16:11:29 +08:00
features = netdev_increment_features ( features ,
slave - > dev - > features ,
2011-05-07 11:22:17 +08:00
mask ) ;
}
2013-05-16 15:34:53 +08:00
features = netdev_add_tso_features ( features , mask ) ;
2011-05-07 11:22:17 +08:00
return features ;
}
2015-12-15 03:19:43 +08:00
# define BOND_VLAN_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | \
2011-07-13 22:10:29 +08:00
NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
NETIF_F_HIGHDMA | NETIF_F_LRO )
2011-05-07 11:22:17 +08:00
2015-12-15 03:19:43 +08:00
# define BOND_ENC_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | \
NETIF_F_RXCSUM | NETIF_F_ALL_TSO )
2014-06-17 21:11:09 +08:00
2019-06-04 06:36:46 +08:00
# define BOND_MPLS_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | \
NETIF_F_ALL_TSO )
2011-05-07 11:22:17 +08:00
static void bond_compute_features ( struct bonding * bond )
{
2014-10-06 09:38:35 +08:00
unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE |
IFF_XMIT_DST_RELEASE_PERM ;
2011-11-15 23:29:55 +08:00
netdev_features_t vlan_features = BOND_VLAN_FEATURES ;
2014-06-17 21:11:09 +08:00
netdev_features_t enc_features = BOND_ENC_FEATURES ;
2019-06-04 06:36:46 +08:00
netdev_features_t mpls_features = BOND_MPLS_FEATURES ;
2013-09-25 15:20:14 +08:00
struct net_device * bond_dev = bond - > dev ;
struct list_head * iter ;
struct slave * slave ;
2011-05-07 11:22:17 +08:00
unsigned short max_hard_header_len = ETH_HLEN ;
2012-11-21 12:35:03 +08:00
unsigned int gso_max_size = GSO_MAX_SIZE ;
u16 gso_max_segs = GSO_MAX_SEGS ;
2011-05-07 11:22:17 +08:00
2013-09-25 15:20:21 +08:00
if ( ! bond_has_slaves ( bond ) )
2011-05-07 11:22:17 +08:00
goto done ;
2014-05-20 14:29:35 +08:00
vlan_features & = NETIF_F_ALL_FOR_ALL ;
2019-06-04 06:36:46 +08:00
mpls_features & = NETIF_F_ALL_FOR_ALL ;
2011-05-07 11:22:17 +08:00
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2009-08-28 20:05:12 +08:00
vlan_features = netdev_increment_features ( vlan_features ,
2011-05-07 11:22:17 +08:00
slave - > dev - > vlan_features , BOND_VLAN_FEATURES ) ;
2014-06-17 21:11:09 +08:00
enc_features = netdev_increment_features ( enc_features ,
slave - > dev - > hw_enc_features ,
BOND_ENC_FEATURES ) ;
2019-06-04 06:36:46 +08:00
mpls_features = netdev_increment_features ( mpls_features ,
slave - > dev - > mpls_features ,
BOND_MPLS_FEATURES ) ;
2012-07-17 20:19:48 +08:00
dst_release_flag & = slave - > dev - > priv_flags ;
2006-09-23 12:53:39 +08:00
if ( slave - > dev - > hard_header_len > max_hard_header_len )
max_hard_header_len = slave - > dev - > hard_header_len ;
2012-11-21 12:35:03 +08:00
gso_max_size = min ( gso_max_size , slave - > dev - > gso_max_size ) ;
gso_max_segs = min ( gso_max_segs , slave - > dev - > gso_max_segs ) ;
2006-09-23 12:53:39 +08:00
}
2017-04-28 01:29:34 +08:00
bond_dev - > hard_header_len = max_hard_header_len ;
2005-08-23 13:34:53 +08:00
2008-10-23 16:11:29 +08:00
done :
2011-05-07 11:22:17 +08:00
bond_dev - > vlan_features = vlan_features ;
2018-05-22 23:34:40 +08:00
bond_dev - > hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL |
bonding: Add vlan tx offload to hw_enc_features
As commit 30d8177e8ac7 ("bonding: Always enable vlan tx offload")
said, we should always enable bonding's vlan tx offload, pass the
vlan packets to the slave devices with vlan tci, let them to handle
vlan implementation.
Now if encapsulation protocols like VXLAN is used, skb->encapsulation
may be set, then the packet is passed to vlan device which based on
bonding device. However in netif_skb_features(), the check of
hw_enc_features:
if (skb->encapsulation)
features &= dev->hw_enc_features;
clears NETIF_F_HW_VLAN_CTAG_TX/NETIF_F_HW_VLAN_STAG_TX. This results
in same issue in commit 30d8177e8ac7 like this:
vlan_dev_hard_start_xmit
-->dev_queue_xmit
-->validate_xmit_skb
-->netif_skb_features //NETIF_F_HW_VLAN_CTAG_TX is cleared
-->validate_xmit_vlan
-->__vlan_hwaccel_push_inside //skb->tci is cleared
...
--> bond_start_xmit
--> bond_xmit_hash //BOND_XMIT_POLICY_ENCAP34
--> __skb_flow_dissect // nhoff point to IP header
--> case htons(ETH_P_8021Q)
// skb_vlan_tag_present is false, so
vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan),
//vlan point to ip header wrongly
Fixes: b2a103e6d0af ("bonding: convert to ndo_fix_features")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-07 10:19:59 +08:00
NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_HW_VLAN_STAG_TX |
2018-05-22 23:34:40 +08:00
NETIF_F_GSO_UDP_L4 ;
2019-06-04 06:36:46 +08:00
bond_dev - > mpls_features = mpls_features ;
2012-11-21 12:35:03 +08:00
bond_dev - > gso_max_segs = gso_max_segs ;
netif_set_gso_max_size ( bond_dev , gso_max_size ) ;
2005-08-23 13:34:53 +08:00
2014-10-06 09:38:35 +08:00
bond_dev - > priv_flags & = ~ IFF_XMIT_DST_RELEASE ;
if ( ( bond_dev - > priv_flags & IFF_XMIT_DST_RELEASE_PERM ) & &
dst_release_flag = = ( IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM ) )
bond_dev - > priv_flags | = IFF_XMIT_DST_RELEASE ;
2012-07-17 20:19:48 +08:00
2011-05-07 11:22:17 +08:00
netdev_change_features ( bond_dev ) ;
2005-08-23 13:34:53 +08:00
}
2007-10-10 10:43:38 +08:00
static void bond_setup_by_slave ( struct net_device * bond_dev ,
struct net_device * slave_dev )
{
2008-11-21 12:14:53 +08:00
bond_dev - > header_ops = slave_dev - > header_ops ;
2007-10-10 10:43:38 +08:00
bond_dev - > type = slave_dev - > type ;
bond_dev - > hard_header_len = slave_dev - > hard_header_len ;
bond_dev - > addr_len = slave_dev - > addr_len ;
memcpy ( bond_dev - > broadcast , slave_dev - > broadcast ,
slave_dev - > addr_len ) ;
}
2011-02-23 17:05:42 +08:00
/* On bonding slaves other than the currently active slave, suppress
2011-04-19 11:48:16 +08:00
* duplicates except for alb non - mcast / bcast .
2011-02-23 17:05:42 +08:00
*/
static bool bond_should_deliver_exact_match ( struct sk_buff * skb ,
2011-03-16 16:45:23 +08:00
struct slave * slave ,
struct bonding * bond )
2011-02-23 17:05:42 +08:00
{
2011-03-16 16:46:43 +08:00
if ( bond_is_slave_inactive ( slave ) ) {
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_ALB & &
2011-02-23 17:05:42 +08:00
skb - > pkt_type ! = PACKET_BROADCAST & &
skb - > pkt_type ! = PACKET_MULTICAST )
return false ;
return true ;
}
return false ;
}
2011-03-12 11:14:39 +08:00
static rx_handler_result_t bond_handle_frame ( struct sk_buff * * pskb )
2011-02-23 17:05:42 +08:00
{
2011-03-12 11:14:39 +08:00
struct sk_buff * skb = * pskb ;
2011-03-12 11:14:35 +08:00
struct slave * slave ;
2011-03-16 16:45:23 +08:00
struct bonding * bond ;
2012-06-12 03:23:07 +08:00
int ( * recv_probe ) ( const struct sk_buff * , struct bonding * ,
struct slave * ) ;
2012-05-09 09:01:40 +08:00
int ret = RX_HANDLER_ANOTHER ;
2011-02-23 17:05:42 +08:00
2011-03-12 11:14:39 +08:00
skb = skb_share_check ( skb , GFP_ATOMIC ) ;
if ( unlikely ( ! skb ) )
return RX_HANDLER_CONSUMED ;
* pskb = skb ;
2011-02-23 17:05:42 +08:00
2011-03-22 10:38:12 +08:00
slave = bond_slave_get_rcu ( skb - > dev ) ;
bond = slave - > bond ;
2011-03-16 16:45:23 +08:00
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 05:07:29 +08:00
recv_probe = READ_ONCE ( bond - > recv_probe ) ;
2011-10-13 00:04:29 +08:00
if ( recv_probe ) {
2012-06-12 03:23:07 +08:00
ret = recv_probe ( skb , bond , slave ) ;
if ( ret = = RX_HANDLER_CONSUMED ) {
consume_skb ( skb ) ;
return ret ;
2011-04-19 11:48:16 +08:00
}
}
2019-02-19 00:55:28 +08:00
/*
* For packets determined by bond_should_deliver_exact_match ( ) call to
* be suppressed we want to make an exception for link - local packets .
* This is necessary for e . g . LLDP daemons to be able to monitor
* inactive slave links without being forced to bind to them
* explicitly .
*
* At the same time , packets that are passed to the bonding master
* ( including link - local ones ) can have their originating interface
* determined via PACKET_ORIGDEV socket option .
2018-09-25 05:39:42 +08:00
*/
2019-02-19 00:55:28 +08:00
if ( bond_should_deliver_exact_match ( skb , slave , bond ) ) {
if ( is_link_local_ether_addr ( eth_hdr ( skb ) - > h_dest ) )
return RX_HANDLER_PASS ;
2011-03-12 11:14:39 +08:00
return RX_HANDLER_EXACT ;
2019-02-19 00:55:28 +08:00
}
2011-02-23 17:05:42 +08:00
2011-03-22 10:38:12 +08:00
skb - > dev = bond - > dev ;
2011-02-23 17:05:42 +08:00
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_ALB & &
2020-02-20 16:00:07 +08:00
netif_is_bridge_port ( bond - > dev ) & &
2011-02-23 17:05:42 +08:00
skb - > pkt_type = = PACKET_HOST ) {
2011-03-03 05:07:14 +08:00
if ( unlikely ( skb_cow_head ( skb ,
skb - > data - skb_mac_header ( skb ) ) ) ) {
kfree_skb ( skb ) ;
2011-03-12 11:14:39 +08:00
return RX_HANDLER_CONSUMED ;
2011-03-03 05:07:14 +08:00
}
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
bond_hw_addr_copy ( eth_hdr ( skb ) - > h_dest , bond - > dev - > dev_addr ,
bond - > dev - > addr_len ) ;
2011-02-23 17:05:42 +08:00
}
2012-05-09 09:01:40 +08:00
return ret ;
2011-02-23 17:05:42 +08:00
}
2015-12-03 19:12:14 +08:00
static enum netdev_lag_tx_type bond_lag_tx_type ( struct bonding * bond )
2013-01-04 06:49:01 +08:00
{
2015-12-03 19:12:14 +08:00
switch ( BOND_MODE ( bond ) ) {
case BOND_MODE_ROUNDROBIN :
return NETDEV_LAG_TX_TYPE_ROUNDROBIN ;
case BOND_MODE_ACTIVEBACKUP :
return NETDEV_LAG_TX_TYPE_ACTIVEBACKUP ;
case BOND_MODE_BROADCAST :
return NETDEV_LAG_TX_TYPE_BROADCAST ;
case BOND_MODE_XOR :
case BOND_MODE_8023AD :
return NETDEV_LAG_TX_TYPE_HASH ;
default :
return NETDEV_LAG_TX_TYPE_UNKNOWN ;
}
}
2018-05-24 10:22:52 +08:00
static enum netdev_lag_hash bond_lag_hash_type ( struct bonding * bond ,
enum netdev_lag_tx_type type )
{
if ( type ! = NETDEV_LAG_TX_TYPE_HASH )
return NETDEV_LAG_HASH_NONE ;
switch ( bond - > params . xmit_policy ) {
case BOND_XMIT_POLICY_LAYER2 :
return NETDEV_LAG_HASH_L2 ;
case BOND_XMIT_POLICY_LAYER34 :
return NETDEV_LAG_HASH_L34 ;
case BOND_XMIT_POLICY_LAYER23 :
return NETDEV_LAG_HASH_L23 ;
case BOND_XMIT_POLICY_ENCAP23 :
return NETDEV_LAG_HASH_E23 ;
case BOND_XMIT_POLICY_ENCAP34 :
return NETDEV_LAG_HASH_E34 ;
default :
return NETDEV_LAG_HASH_UNKNOWN ;
}
}
2017-10-05 08:48:47 +08:00
static int bond_master_upper_dev_link ( struct bonding * bond , struct slave * slave ,
struct netlink_ext_ack * extack )
2015-12-03 19:12:14 +08:00
{
struct netdev_lag_upper_info lag_upper_info ;
2018-05-24 10:22:52 +08:00
enum netdev_lag_tx_type type ;
2013-01-04 06:49:01 +08:00
2018-05-24 10:22:52 +08:00
type = bond_lag_tx_type ( bond ) ;
lag_upper_info . tx_type = type ;
lag_upper_info . hash_type = bond_lag_hash_type ( bond , type ) ;
2017-10-24 13:54:18 +08:00
return netdev_master_upper_dev_link ( slave - > dev , bond - > dev , slave ,
& lag_upper_info , extack ) ;
2013-01-04 06:49:01 +08:00
}
2015-12-03 19:12:14 +08:00
static void bond_upper_dev_unlink ( struct bonding * bond , struct slave * slave )
2013-01-04 06:49:01 +08:00
{
2015-12-03 19:12:14 +08:00
netdev_upper_dev_unlink ( slave - > dev , bond - > dev ) ;
slave - > dev - > flags & = ~ IFF_SLAVE ;
2013-01-04 06:49:01 +08:00
}
2014-05-12 15:08:43 +08:00
static struct slave * bond_alloc_slave ( struct bonding * bond )
{
struct slave * slave = NULL ;
2016-02-03 10:02:32 +08:00
slave = kzalloc ( sizeof ( * slave ) , GFP_KERNEL ) ;
2014-05-12 15:08:43 +08:00
if ( ! slave )
return NULL ;
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD ) {
2014-05-12 15:08:43 +08:00
SLAVE_AD_INFO ( slave ) = kzalloc ( sizeof ( struct ad_slave_info ) ,
GFP_KERNEL ) ;
if ( ! SLAVE_AD_INFO ( slave ) ) {
kfree ( slave ) ;
return NULL ;
}
}
2018-09-25 05:40:11 +08:00
INIT_DELAYED_WORK ( & slave - > notify_work , bond_netdev_notify_work ) ;
2014-05-12 15:08:43 +08:00
return slave ;
}
static void bond_free_slave ( struct slave * slave )
{
struct bonding * bond = bond_get_bond_by_slave ( slave ) ;
2018-09-25 05:40:11 +08:00
cancel_delayed_work_sync ( & slave - > notify_work ) ;
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD )
2014-05-12 15:08:43 +08:00
kfree ( SLAVE_AD_INFO ( slave ) ) ;
kfree ( slave ) ;
}
2015-02-03 22:48:30 +08:00
static void bond_fill_ifbond ( struct bonding * bond , struct ifbond * info )
{
info - > bond_mode = BOND_MODE ( bond ) ;
info - > miimon = bond - > params . miimon ;
info - > num_slaves = bond - > slave_cnt ;
}
static void bond_fill_ifslave ( struct slave * slave , struct ifslave * info )
{
strcpy ( info - > slave_name , slave - > dev - > name ) ;
info - > link = slave - > link ;
info - > state = bond_slave_state ( slave ) ;
info - > link_failure_count = slave - > link_failure_count ;
}
2015-02-03 22:48:31 +08:00
static void bond_netdev_notify_work ( struct work_struct * _work )
{
2018-09-25 05:40:11 +08:00
struct slave * slave = container_of ( _work , struct slave ,
notify_work . work ) ;
if ( rtnl_trylock ( ) ) {
struct netdev_bonding_info binfo ;
2015-02-03 22:48:31 +08:00
2018-09-25 05:40:11 +08:00
bond_fill_ifslave ( slave , & binfo . slave ) ;
bond_fill_ifbond ( slave - > bond , & binfo . master ) ;
netdev_bonding_info_change ( slave - > dev , & binfo ) ;
rtnl_unlock ( ) ;
} else {
queue_delayed_work ( slave - > bond - > wq , & slave - > notify_work , 1 ) ;
}
2015-02-03 22:48:31 +08:00
}
void bond_queue_slave_event ( struct slave * slave )
{
2018-09-25 05:40:11 +08:00
queue_delayed_work ( slave - > bond - > wq , & slave - > notify_work , 0 ) ;
2015-02-03 22:48:31 +08:00
}
2015-12-03 19:12:20 +08:00
void bond_lower_state_changed ( struct slave * slave )
{
struct netdev_lag_lower_state_info info ;
info . link_up = slave - > link = = BOND_LINK_UP | |
slave - > link = = BOND_LINK_FAIL ;
info . tx_enabled = bond_is_active_slave ( slave ) ;
netdev_lower_state_changed ( slave - > dev , & info ) ;
}
2005-04-17 06:20:36 +08:00
/* enslave device <slave> to bond device <master> */
2017-10-05 08:48:46 +08:00
int bond_enslave ( struct net_device * bond_dev , struct net_device * slave_dev ,
struct netlink_ext_ack * extack )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2008-11-20 13:56:05 +08:00
const struct net_device_ops * slave_ops = slave_dev - > netdev_ops ;
2013-09-25 15:20:25 +08:00
struct slave * new_slave = NULL , * prev_slave ;
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
struct sockaddr_storage ss ;
2005-04-17 06:20:36 +08:00
int link_reporting ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:34 +08:00
int res = 0 , i ;
2005-04-17 06:20:36 +08:00
2012-12-07 14:15:32 +08:00
if ( ! bond - > params . use_carrier & &
slave_dev - > ethtool_ops - > get_link = = NULL & &
slave_ops - > ndo_do_ioctl = = NULL ) {
2019-06-07 22:59:29 +08:00
slave_warn ( bond_dev , slave_dev , " no link monitoring support \n " ) ;
2005-04-17 06:20:36 +08:00
}
2016-09-02 13:18:34 +08:00
/* already in-use? */
if ( netdev_is_rx_handler_busy ( slave_dev ) ) {
2017-10-05 08:48:49 +08:00
NL_SET_ERR_MSG ( extack , " Device is in use and cannot be enslaved " ) ;
2019-06-07 22:59:29 +08:00
slave_err ( bond_dev , slave_dev ,
" Error: Device is in use and cannot be enslaved \n " ) ;
2005-04-17 06:20:36 +08:00
return - EBUSY ;
}
2014-02-27 01:20:13 +08:00
if ( bond_dev = = slave_dev ) {
2017-10-05 08:48:49 +08:00
NL_SET_ERR_MSG ( extack , " Cannot enslave bond to itself. " ) ;
2014-07-16 01:35:58 +08:00
netdev_err ( bond_dev , " cannot enslave bond to itself. \n " ) ;
2014-02-27 01:20:13 +08:00
return - EPERM ;
}
2005-04-17 06:20:36 +08:00
/* vlan challenged mutual exclusion */
/* no need to lock since we're protected by rtnl_lock */
if ( slave_dev - > features & NETIF_F_VLAN_CHALLENGED ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " is NETIF_F_VLAN_CHALLENGED \n " ) ;
2012-10-14 12:30:56 +08:00
if ( vlan_uses_dev ( bond_dev ) ) {
2017-10-05 08:48:49 +08:00
NL_SET_ERR_MSG ( extack , " Can not enslave VLAN challenged device to VLAN enabled bond " ) ;
2019-06-07 22:59:29 +08:00
slave_err ( bond_dev , slave_dev , " Error: cannot enslave VLAN challenged slave on VLAN enabled bond \n " ) ;
2005-04-17 06:20:36 +08:00
return - EPERM ;
} else {
2019-06-07 22:59:29 +08:00
slave_warn ( bond_dev , slave_dev , " enslaved VLAN challenged slave. Adding VLANs will be blocked as long as it is part of bond. \n " ) ;
2005-04-17 06:20:36 +08:00
}
} else {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " is !NETIF_F_VLAN_CHALLENGED \n " ) ;
2005-04-17 06:20:36 +08:00
}
2014-09-15 23:19:34 +08:00
/* Old ifenslave binaries are no longer supported. These can
2009-06-13 03:02:48 +08:00
* be identified with moderate accuracy by the state of the slave :
2005-09-27 07:11:50 +08:00
* the current ifenslave will set the interface down prior to
* enslaving it ; the old ifenslave will not .
*/
2015-12-03 18:00:55 +08:00
if ( slave_dev - > flags & IFF_UP ) {
2017-10-05 08:48:49 +08:00
NL_SET_ERR_MSG ( extack , " Device can not be enslaved while up " ) ;
2019-06-07 22:59:29 +08:00
slave_err ( bond_dev , slave_dev , " slave is up - this may be due to an out of date ifenslave \n " ) ;
2016-02-09 18:37:46 +08:00
return - EPERM ;
2005-09-27 07:11:50 +08:00
}
2005-04-17 06:20:36 +08:00
2007-10-10 10:43:38 +08:00
/* set bonding device ether type by slave - bonding netdevices are
* created with ether_setup , so when the slave type is not ARPHRD_ETHER
* there is a need to override some of the type dependent attribs / funcs .
*
* bond ether type mutual exclusion - don ' t allow slaves of dissimilar
* ether type ( eg ARPHRD_ETHER and ARPHRD_INFINIBAND ) share the same bond
*/
2013-09-25 15:20:21 +08:00
if ( ! bond_has_slaves ( bond ) ) {
2009-07-15 12:56:31 +08:00
if ( bond_dev - > type ! = slave_dev - > type ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " change device type from %d to %d \n " ,
bond_dev - > type , slave_dev - > type ) ;
2009-09-15 17:37:40 +08:00
2012-08-10 06:14:57 +08:00
res = call_netdevice_notifiers ( NETDEV_PRE_TYPE_CHANGE ,
bond_dev ) ;
2010-03-10 18:29:35 +08:00
res = notifier_to_errno ( res ) ;
if ( res ) {
2019-06-07 22:59:29 +08:00
slave_err ( bond_dev , slave_dev , " refused to change device type \n " ) ;
2016-02-09 18:37:46 +08:00
return - EBUSY ;
2010-03-10 18:29:35 +08:00
}
2009-09-15 17:37:40 +08:00
2010-03-19 12:00:23 +08:00
/* Flush unicast and multicast addresses */
2010-04-02 05:22:09 +08:00
dev_uc_flush ( bond_dev ) ;
2010-04-02 05:22:57 +08:00
dev_mc_flush ( bond_dev ) ;
2010-03-19 12:00:23 +08:00
2009-07-15 12:56:31 +08:00
if ( slave_dev - > type ! = ARPHRD_ETHER )
bond_setup_by_slave ( bond_dev , slave_dev ) ;
2011-07-26 14:05:38 +08:00
else {
2009-07-15 12:56:31 +08:00
ether_setup ( bond_dev ) ;
2011-07-26 14:05:38 +08:00
bond_dev - > priv_flags & = ~ IFF_TX_SKB_SHARING ;
}
2009-09-15 17:37:40 +08:00
2012-08-10 06:14:57 +08:00
call_netdevice_notifiers ( NETDEV_POST_TYPE_CHANGE ,
bond_dev ) ;
2009-07-15 12:56:31 +08:00
}
2007-10-10 10:43:38 +08:00
} else if ( bond_dev - > type ! = slave_dev - > type ) {
2017-10-05 08:48:49 +08:00
NL_SET_ERR_MSG ( extack , " Device type is different from other slaves " ) ;
2019-06-07 22:59:29 +08:00
slave_err ( bond_dev , slave_dev , " ether type (%d) is different from other slaves (%d), can not enslave it \n " ,
slave_dev - > type , bond_dev - > type ) ;
2016-02-09 18:37:46 +08:00
return - EINVAL ;
2007-10-10 10:43:38 +08:00
}
2016-07-21 16:52:55 +08:00
if ( slave_dev - > type = = ARPHRD_INFINIBAND & &
BOND_MODE ( bond ) ! = BOND_MODE_ACTIVEBACKUP ) {
2017-10-05 08:48:49 +08:00
NL_SET_ERR_MSG ( extack , " Only active-backup mode is supported for infiniband slaves " ) ;
2019-06-07 22:59:29 +08:00
slave_warn ( bond_dev , slave_dev , " Type (%d) supports only active-backup mode \n " ,
slave_dev - > type ) ;
2016-07-21 16:52:55 +08:00
res = - EOPNOTSUPP ;
goto err_undo_flags ;
}
if ( ! slave_ops - > ndo_set_mac_address | |
slave_dev - > type = = ARPHRD_INFINIBAND ) {
2019-06-07 22:59:29 +08:00
slave_warn ( bond_dev , slave_dev , " The slave device specified does not support setting the MAC address \n " ) ;
2014-07-15 19:26:01 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_ACTIVEBACKUP & &
bond - > params . fail_over_mac ! = BOND_FOM_ACTIVE ) {
if ( ! bond_has_slaves ( bond ) ) {
2014-01-25 13:00:29 +08:00
bond - > params . fail_over_mac = BOND_FOM_ACTIVE ;
2019-06-07 22:59:29 +08:00
slave_warn ( bond_dev , slave_dev , " Setting fail_over_mac to active for active-backup mode \n " ) ;
2014-07-15 19:26:01 +08:00
} else {
2017-10-05 08:48:49 +08:00
NL_SET_ERR_MSG ( extack , " Slave device does not support setting the MAC address, but fail_over_mac is not set to active " ) ;
2019-06-07 22:59:29 +08:00
slave_err ( bond_dev , slave_dev , " The slave device specified does not support setting the MAC address, but fail_over_mac is not set to active \n " ) ;
2014-07-15 19:26:01 +08:00
res = - EOPNOTSUPP ;
goto err_undo_flags ;
2014-01-25 13:00:29 +08:00
}
2007-10-10 10:43:39 +08:00
}
2005-04-17 06:20:36 +08:00
}
2011-05-20 05:39:10 +08:00
call_netdevice_notifiers ( NETDEV_JOIN , slave_dev ) ;
2010-05-19 09:14:29 +08:00
/* If this is the first slave, then we need to set the master's hardware
2014-09-15 23:19:34 +08:00
* address to be the same as the slave ' s .
*/
2013-09-25 15:20:21 +08:00
if ( ! bond_has_slaves ( bond ) & &
2018-12-13 19:54:44 +08:00
bond - > dev - > addr_assign_type = = NET_ADDR_RANDOM ) {
res = bond_set_dev_addr ( bond - > dev , slave_dev ) ;
if ( res )
goto err_undo_flags ;
}
2010-05-19 09:14:29 +08:00
2014-05-12 15:08:43 +08:00
new_slave = bond_alloc_slave ( bond ) ;
2005-04-17 06:20:36 +08:00
if ( ! new_slave ) {
res = - ENOMEM ;
goto err_undo_flags ;
}
2014-05-12 15:08:43 +08:00
2014-05-21 23:42:00 +08:00
new_slave - > bond = bond ;
new_slave - > dev = slave_dev ;
2014-09-15 23:19:34 +08:00
/* Set the new_slave's queue_id to be zero. Queue ID mapping
2010-06-02 16:40:18 +08:00
* is set via sysfs or module option if desired .
*/
new_slave - > queue_id = 0 ;
2010-05-18 13:42:40 +08:00
/* Save slave's original mtu and then set it to match the bond */
new_slave - > original_mtu = slave_dev - > mtu ;
res = dev_set_mtu ( slave_dev , bond - > dev - > mtu ) ;
if ( res ) {
2019-06-07 22:59:29 +08:00
slave_err ( bond_dev , slave_dev , " Error %d calling dev_set_mtu \n " , res ) ;
2010-05-18 13:42:40 +08:00
goto err_free ;
}
2014-09-15 23:19:34 +08:00
/* Save slave's original ("permanent") mac address for modes
2005-09-27 07:11:50 +08:00
* that need it , and for restoring it upon release , and then
* set it to the master ' s address
*/
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
bond_hw_addr_copy ( new_slave - > perm_hwaddr , slave_dev - > dev_addr ,
slave_dev - > addr_len ) ;
2005-04-17 06:20:36 +08:00
2014-01-25 13:00:29 +08:00
if ( ! bond - > params . fail_over_mac | |
2014-05-16 03:39:55 +08:00
BOND_MODE ( bond ) ! = BOND_MODE_ACTIVEBACKUP ) {
2014-09-15 23:19:34 +08:00
/* Set slave to master's mac address. The application already
2007-10-10 10:43:39 +08:00
* set the master ' s mac address to that of the first slave
*/
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
memcpy ( ss . __data , bond_dev - > dev_addr , bond_dev - > addr_len ) ;
ss . ss_family = slave_dev - > type ;
2018-12-13 19:54:30 +08:00
res = dev_set_mac_address ( slave_dev , ( struct sockaddr * ) & ss ,
extack ) ;
2007-10-10 10:43:39 +08:00
if ( res ) {
2019-06-07 22:59:29 +08:00
slave_err ( bond_dev , slave_dev , " Error %d calling set_mac_address \n " , res ) ;
2010-05-18 13:42:40 +08:00
goto err_restore_mtu ;
2007-10-10 10:43:39 +08:00
}
2005-09-27 07:11:50 +08:00
}
2005-04-17 06:20:36 +08:00
2016-01-11 21:28:43 +08:00
/* set slave flag before open to prevent IPv6 addrconf */
slave_dev - > flags | = IFF_SLAVE ;
2005-09-27 07:11:50 +08:00
/* open the slave since the application closed it */
2018-12-07 01:05:36 +08:00
res = dev_open ( slave_dev , extack ) ;
2005-09-27 07:11:50 +08:00
if ( res ) {
2019-06-07 22:59:29 +08:00
slave_err ( bond_dev , slave_dev , " Opening slave failed \n " ) ;
2013-09-25 15:20:10 +08:00
goto err_restore_mac ;
2005-04-17 06:20:36 +08:00
}
2006-09-23 12:54:10 +08:00
slave_dev - > priv_flags | = IFF_BONDING ;
2014-09-29 10:34:37 +08:00
/* initialize slave stats */
dev_get_stats ( new_slave - > dev , & new_slave - > slave_stats ) ;
2005-04-17 06:20:36 +08:00
2008-12-10 15:07:13 +08:00
if ( bond_is_lb ( bond ) ) {
2005-04-17 06:20:36 +08:00
/* bond_alb_init_slave() must be called before all other stages since
* it might fail and we do not want to have to undo everything
*/
res = bond_alb_init_slave ( bond , new_slave ) ;
2009-06-13 03:02:48 +08:00
if ( res )
2008-05-03 09:06:02 +08:00
goto err_close ;
2005-04-17 06:20:36 +08:00
}
2013-08-23 10:45:07 +08:00
res = vlan_vids_add_by_dev ( slave_dev , bond_dev ) ;
if ( res ) {
2019-06-07 22:59:29 +08:00
slave_err ( bond_dev , slave_dev , " Couldn't add bond vlan ids \n " ) ;
2018-03-26 01:16:46 +08:00
goto err_close ;
2013-08-06 18:40:15 +08:00
}
2005-04-17 06:20:36 +08:00
2013-09-25 15:20:25 +08:00
prev_slave = bond_last_slave ( bond ) ;
2005-04-17 06:20:36 +08:00
new_slave - > delay = 0 ;
new_slave - > link_failure_count = 0 ;
2017-08-10 12:41:44 +08:00
if ( bond_update_speed_duplex ( new_slave ) & &
bond_needs_speed_duplex ( bond ) )
2017-04-04 09:38:39 +08:00
new_slave - > link = BOND_LINK_DOWN ;
2013-03-12 14:31:32 +08:00
2014-02-18 14:48:47 +08:00
new_slave - > last_rx = jiffies -
2012-04-17 10:02:06 +08:00
( msecs_to_jiffies ( bond - > params . arp_interval ) + 1 ) ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:34 +08:00
for ( i = 0 ; i < BOND_MAX_ARP_TARGETS ; i + + )
2014-02-18 14:48:47 +08:00
new_slave - > target_last_arp_rx [ i ] = new_slave - > last_rx ;
2006-09-23 12:54:53 +08:00
2005-04-17 06:20:36 +08:00
if ( bond - > params . miimon & & ! bond - > params . use_carrier ) {
link_reporting = bond_check_dev_link ( bond , slave_dev , 1 ) ;
if ( ( link_reporting = = - 1 ) & & ! bond - > params . arp_interval ) {
2014-09-15 23:19:34 +08:00
/* miimon is set but a bonded network driver
2005-04-17 06:20:36 +08:00
* does not support ETHTOOL / MII and
* arp_interval is not set . Note : if
* use_carrier is enabled , we will never go
* here ( because netif_carrier is always
* supported ) ; thus , we don ' t need to change
* the messages for netif_carrier .
*/
2019-06-07 22:59:29 +08:00
slave_warn ( bond_dev , slave_dev , " MII and ETHTOOL support not available for slave, and arp_interval/arp_ip_target module parameters not specified, thus bonding will not detect link failures! see bonding.txt for details \n " ) ;
2005-04-17 06:20:36 +08:00
} else if ( link_reporting = = - 1 ) {
/* unable get link status using mii/ethtool */
2019-06-07 22:59:29 +08:00
slave_warn ( bond_dev , slave_dev , " can't get link status from slave; the network driver associated with this interface does not support MII or ETHTOOL link status reporting, thus miimon has no effect on this interface \n " ) ;
2005-04-17 06:20:36 +08:00
}
}
/* check for initial state */
2016-07-05 17:09:47 +08:00
new_slave - > link = BOND_LINK_NOCHANGE ;
2012-04-17 10:02:06 +08:00
if ( bond - > params . miimon ) {
if ( bond_check_dev_link ( bond , slave_dev , 0 ) = = BMSR_LSTATUS ) {
if ( bond - > params . updelay ) {
2015-02-03 22:48:30 +08:00
bond_set_slave_link_state ( new_slave ,
2015-12-03 19:12:19 +08:00
BOND_LINK_BACK ,
BOND_SLAVE_NOTIFY_NOW ) ;
2012-04-17 10:02:06 +08:00
new_slave - > delay = bond - > params . updelay ;
} else {
2015-02-03 22:48:30 +08:00
bond_set_slave_link_state ( new_slave ,
2015-12-03 19:12:19 +08:00
BOND_LINK_UP ,
BOND_SLAVE_NOTIFY_NOW ) ;
2012-04-17 10:02:06 +08:00
}
2005-04-17 06:20:36 +08:00
} else {
2015-12-03 19:12:19 +08:00
bond_set_slave_link_state ( new_slave , BOND_LINK_DOWN ,
BOND_SLAVE_NOTIFY_NOW ) ;
2005-04-17 06:20:36 +08:00
}
2012-04-17 10:02:06 +08:00
} else if ( bond - > params . arp_interval ) {
2015-02-03 22:48:30 +08:00
bond_set_slave_link_state ( new_slave ,
( netif_carrier_ok ( slave_dev ) ?
2015-12-03 19:12:19 +08:00
BOND_LINK_UP : BOND_LINK_DOWN ) ,
BOND_SLAVE_NOTIFY_NOW ) ;
2005-04-17 06:20:36 +08:00
} else {
2015-12-03 19:12:19 +08:00
bond_set_slave_link_state ( new_slave , BOND_LINK_UP ,
BOND_SLAVE_NOTIFY_NOW ) ;
2005-04-17 06:20:36 +08:00
}
2012-04-17 10:02:06 +08:00
if ( new_slave - > link ! = BOND_LINK_DOWN )
2014-02-18 14:48:46 +08:00
new_slave - > last_link_up = jiffies ;
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " Initial state of slave is BOND_LINK_%s \n " ,
new_slave - > link = = BOND_LINK_DOWN ? " DOWN " :
( new_slave - > link = = BOND_LINK_UP ? " UP " : " BACK " ) ) ;
2012-04-17 10:02:06 +08:00
2014-05-16 03:39:54 +08:00
if ( bond_uses_primary ( bond ) & & bond - > params . primary [ 0 ] ) {
2005-04-17 06:20:36 +08:00
/* if there is a primary slave, remember it */
2009-09-25 11:28:09 +08:00
if ( strcmp ( bond - > params . primary , new_slave - > dev - > name ) = = 0 ) {
2014-09-10 05:17:00 +08:00
rcu_assign_pointer ( bond - > primary_slave , new_slave ) ;
2009-09-25 11:28:09 +08:00
bond - > force_primary = true ;
}
2005-04-17 06:20:36 +08:00
}
2014-05-16 03:39:55 +08:00
switch ( BOND_MODE ( bond ) ) {
2005-04-17 06:20:36 +08:00
case BOND_MODE_ACTIVEBACKUP :
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_inactive_flags ( new_slave ,
BOND_SLAVE_NOTIFY_NOW ) ;
2005-04-17 06:20:36 +08:00
break ;
case BOND_MODE_8023AD :
/* in 802.3ad mode, the internal mechanism
* will activate the slaves in the selected
* aggregator
*/
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_inactive_flags ( new_slave , BOND_SLAVE_NOTIFY_NOW ) ;
2005-04-17 06:20:36 +08:00
/* if this is the first slave */
bonding: correctly verify for the first slave in bond_enslave
After commit 1f718f0f4f97145f4072d2d72dcf85069ca7226d ("bonding: populate
neighbour's private on enslave"), we've moved the actual 'linking' in the
end of the function - so that, once linked, the slave is ready to be used,
and is not still in the process of enslaving.
However, 802.3ad verified if it's the first slave by looking at the
if (bond_first_slave(bond) == new_slave)
which, because we've moved the linking to the end, became broken - on the
first slave bond_first_slave(bond) returns NULL.
Fix this by verifying if the prev_slave, that equals bond_last_slave(), is
actually populated - if it is - then it's not the first slave, and vice
versa.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-27 21:10:57 +08:00
if ( ! prev_slave ) {
2014-05-12 15:08:43 +08:00
SLAVE_AD_INFO ( new_slave ) - > id = 1 ;
2005-04-17 06:20:36 +08:00
/* Initialize AD with the number of times that the AD timer is called in 1 second
* can be called only after the mac address of the bond is set
*/
2011-06-09 05:19:02 +08:00
bond_3ad_initialize ( bond , 1000 / AD_TIMER_INTERVAL ) ;
2005-04-17 06:20:36 +08:00
} else {
2014-05-12 15:08:43 +08:00
SLAVE_AD_INFO ( new_slave ) - > id =
SLAVE_AD_INFO ( prev_slave ) - > id + 1 ;
2005-04-17 06:20:36 +08:00
}
bond_3ad_bind_slave ( new_slave ) ;
break ;
case BOND_MODE_TLB :
case BOND_MODE_ALB :
2011-03-12 11:14:37 +08:00
bond_set_active_slave ( new_slave ) ;
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_inactive_flags ( new_slave , BOND_SLAVE_NOTIFY_NOW ) ;
2005-04-17 06:20:36 +08:00
break ;
default :
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " This slave is always active in trunk mode \n " ) ;
2005-04-17 06:20:36 +08:00
/* always active in trunk mode */
2011-03-12 11:14:37 +08:00
bond_set_active_slave ( new_slave ) ;
2005-04-17 06:20:36 +08:00
/* In trunking mode there is little meaning to curr_active_slave
* anyway ( it holds no special properties of the bond device ) ,
* so we can change it without calling change_active_interface ( )
*/
2014-07-15 21:56:55 +08:00
if ( ! rcu_access_pointer ( bond - > curr_active_slave ) & &
new_slave - > link = = BOND_LINK_UP )
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 22:54:51 +08:00
rcu_assign_pointer ( bond - > curr_active_slave , new_slave ) ;
2009-06-13 03:02:48 +08:00
2005-04-17 06:20:36 +08:00
break ;
} /* switch(bond_mode) */
2010-05-06 15:48:51 +08:00
# ifdef CONFIG_NET_POLL_CONTROLLER
2018-04-22 19:11:50 +08:00
if ( bond - > dev - > npinfo ) {
2011-02-18 07:43:32 +08:00
if ( slave_enable_netpoll ( new_slave ) ) {
2019-06-07 22:59:29 +08:00
slave_info ( bond_dev , slave_dev , " master_dev is using netpoll, but new slave device does not support netpoll \n " ) ;
2011-02-18 07:43:32 +08:00
res = - EBUSY ;
2011-12-31 21:26:46 +08:00
goto err_detach ;
2011-02-18 07:43:32 +08:00
}
2010-05-06 15:48:51 +08:00
}
# endif
2011-02-18 07:43:32 +08:00
2014-11-13 14:54:50 +08:00
if ( ! ( bond_dev - > features & NETIF_F_LRO ) )
dev_disable_lro ( slave_dev ) ;
2011-03-22 10:38:12 +08:00
res = netdev_rx_handler_register ( slave_dev , bond_handle_frame ,
new_slave ) ;
if ( res ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " Error %d calling netdev_rx_handler_register \n " , res ) ;
2013-09-25 15:20:32 +08:00
goto err_detach ;
2011-03-22 10:38:12 +08:00
}
2017-10-05 08:48:47 +08:00
res = bond_master_upper_dev_link ( bond , new_slave , extack ) ;
2013-09-25 15:20:10 +08:00
if ( res ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " Error %d calling bond_master_upper_dev_link \n " , res ) ;
2013-09-25 15:20:10 +08:00
goto err_unregister ;
}
2014-01-17 14:57:49 +08:00
res = bond_sysfs_slave_add ( new_slave ) ;
if ( res ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " Error %d calling bond_sysfs_slave_add \n " , res ) ;
2014-01-17 14:57:49 +08:00
goto err_upper_unlink ;
}
2018-03-26 01:16:46 +08:00
/* If the mode uses primary, then the following is handled by
* bond_change_active_slave ( ) .
*/
if ( ! bond_uses_primary ( bond ) ) {
/* set promiscuity level to new slave */
if ( bond_dev - > flags & IFF_PROMISC ) {
res = dev_set_promiscuity ( slave_dev , 1 ) ;
if ( res )
goto err_sysfs_del ;
}
/* set allmulti level to new slave */
if ( bond_dev - > flags & IFF_ALLMULTI ) {
res = dev_set_allmulti ( slave_dev , 1 ) ;
2018-03-26 01:16:47 +08:00
if ( res ) {
if ( bond_dev - > flags & IFF_PROMISC )
dev_set_promiscuity ( slave_dev , - 1 ) ;
2018-03-26 01:16:46 +08:00
goto err_sysfs_del ;
2018-03-26 01:16:47 +08:00
}
2018-03-26 01:16:46 +08:00
}
netif_addr_lock_bh ( bond_dev ) ;
dev_mc_sync_multiple ( slave_dev , bond_dev ) ;
dev_uc_sync_multiple ( slave_dev , bond_dev ) ;
netif_addr_unlock_bh ( bond_dev ) ;
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD ) {
/* add lacpdu mc addr to mc list */
u8 lacpdu_multicast [ ETH_ALEN ] = MULTICAST_LACPDU_ADDR ;
dev_mc_add ( slave_dev , lacpdu_multicast ) ;
}
}
2013-10-21 17:48:30 +08:00
bond - > slave_cnt + + ;
bond_compute_features ( bond ) ;
bond_set_carrier ( bond ) ;
2014-05-16 03:39:54 +08:00
if ( bond_uses_primary ( bond ) ) {
2014-02-12 12:06:40 +08:00
block_netpoll_tx ( ) ;
2013-10-21 17:48:30 +08:00
bond_select_active_slave ( bond ) ;
2014-02-12 12:06:40 +08:00
unblock_netpoll_tx ( ) ;
2013-10-21 17:48:30 +08:00
}
2013-09-25 15:20:10 +08:00
2018-05-15 02:48:09 +08:00
if ( bond_mode_can_use_xmit_hash ( bond ) )
2014-10-05 08:45:01 +08:00
bond_update_slave_arr ( bond , NULL ) ;
2018-05-10 07:32:11 +08:00
2019-06-07 22:59:29 +08:00
slave_info ( bond_dev , slave_dev , " Enslaving as %s interface with %s link \n " ,
bond_is_active_slave ( new_slave ) ? " an active " : " a backup " ,
new_slave - > link ! = BOND_LINK_DOWN ? " an up " : " a down " ) ;
2005-04-17 06:20:36 +08:00
/* enslave is successful */
2015-02-03 22:48:31 +08:00
bond_queue_slave_event ( new_slave ) ;
2005-04-17 06:20:36 +08:00
return 0 ;
/* Undo stages on error */
2018-03-26 01:16:46 +08:00
err_sysfs_del :
bond_sysfs_slave_del ( new_slave ) ;
2014-01-17 14:57:49 +08:00
err_upper_unlink :
2015-12-03 19:12:14 +08:00
bond_upper_dev_unlink ( bond , new_slave ) ;
2014-01-17 14:57:49 +08:00
2013-09-25 15:20:10 +08:00
err_unregister :
netdev_rx_handler_unregister ( slave_dev ) ;
2011-12-31 21:26:46 +08:00
err_detach :
2013-08-06 18:40:15 +08:00
vlan_vids_del_by_dev ( slave_dev , bond_dev ) ;
2014-09-10 05:17:00 +08:00
if ( rcu_access_pointer ( bond - > primary_slave ) = = new_slave )
RCU_INIT_POINTER ( bond - > primary_slave , NULL ) ;
2014-07-15 21:56:55 +08:00
if ( rcu_access_pointer ( bond - > curr_active_slave ) = = new_slave ) {
2014-02-12 12:06:40 +08:00
block_netpoll_tx ( ) ;
2013-12-13 10:20:07 +08:00
bond_change_active_slave ( bond , NULL ) ;
2013-04-18 15:33:36 +08:00
bond_select_active_slave ( bond ) ;
2014-02-12 12:06:40 +08:00
unblock_netpoll_tx ( ) ;
2013-04-18 15:33:36 +08:00
}
2014-09-10 05:17:00 +08:00
/* either primary_slave or curr_active_slave might've changed */
synchronize_rcu ( ) ;
2013-04-18 15:33:37 +08:00
slave_disable_netpoll ( new_slave ) ;
2011-12-31 21:26:46 +08:00
2005-04-17 06:20:36 +08:00
err_close :
2019-10-22 02:47:52 +08:00
if ( ! netif_is_bond_master ( slave_dev ) )
slave_dev - > priv_flags & = ~ IFF_BONDING ;
2005-04-17 06:20:36 +08:00
dev_close ( slave_dev ) ;
err_restore_mac :
2016-01-11 21:28:43 +08:00
slave_dev - > flags & = ~ IFF_SLAVE ;
2014-01-25 13:00:29 +08:00
if ( ! bond - > params . fail_over_mac | |
2014-05-16 03:39:55 +08:00
BOND_MODE ( bond ) ! = BOND_MODE_ACTIVEBACKUP ) {
2008-05-18 12:10:14 +08:00
/* XXX TODO - fom follow mode needs to change master's
* MAC if this slave ' s MAC is in use by the bond , or at
* least print a warning .
*/
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
bond_hw_addr_copy ( ss . __data , new_slave - > perm_hwaddr ,
new_slave - > dev - > addr_len ) ;
ss . ss_family = slave_dev - > type ;
2018-12-13 19:54:30 +08:00
dev_set_mac_address ( slave_dev , ( struct sockaddr * ) & ss , NULL ) ;
2007-10-10 10:43:39 +08:00
}
2005-04-17 06:20:36 +08:00
2010-05-18 13:42:40 +08:00
err_restore_mtu :
dev_set_mtu ( slave_dev , new_slave - > original_mtu ) ;
2005-04-17 06:20:36 +08:00
err_free :
2014-05-12 15:08:43 +08:00
bond_free_slave ( new_slave ) ;
2005-04-17 06:20:36 +08:00
err_undo_flags :
2013-06-12 06:07:01 +08:00
/* Enslave of first slave has failed and we need to fix master's mac */
2015-07-16 04:57:01 +08:00
if ( ! bond_has_slaves ( bond ) ) {
if ( ether_addr_equal_64bits ( bond_dev - > dev_addr ,
slave_dev - > dev_addr ) )
eth_hw_addr_random ( bond_dev ) ;
if ( bond_dev - > type ! = ARPHRD_ETHER ) {
bonding: fix panic on non-ARPHRD_ETHER enslave failure
Since commit 7d5cd2ce529b, when bond_enslave fails on devices that
are not ARPHRD_ETHER, if needed, it resets the bonding device back to
ARPHRD_ETHER by calling ether_setup.
Unfortunately, ether_setup clobbers dev->flags, clearing IFF_UP
if the bond device is up, leaving it in a quasi-down state without
having actually gone through dev_close. For bonding, if any periodic
work queue items are active (miimon, arp_interval, etc), those will
remain running, as they are stopped by bond_close. At this point, if
the bonding module is unloaded or the bond is deleted, the system will
panic when the work function is called.
This panic is resolved by calling dev_close on the bond itself
prior to calling ether_setup.
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Fixes: 7d5cd2ce5292 ("bonding: correctly handle bonding type change on enslave failure")
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-07 09:23:23 +08:00
dev_close ( bond_dev ) ;
2015-07-16 04:57:01 +08:00
ether_setup ( bond_dev ) ;
bond_dev - > flags | = IFF_MASTER ;
bond_dev - > priv_flags & = ~ IFF_TX_SKB_SHARING ;
}
}
2009-06-13 03:02:48 +08:00
2005-04-17 06:20:36 +08:00
return res ;
}
2014-09-15 23:19:34 +08:00
/* Try to release the slave device <slave> from the bond device <master>
2005-04-17 06:20:36 +08:00
* It is legal to access curr_active_slave without a lock because all the function
2014-09-12 04:49:28 +08:00
* is RTNL - locked . If " all " is true it means that the function is being called
2013-02-18 22:09:42 +08:00
* while destroying a bond interface and all slaves are being released .
2005-04-17 06:20:36 +08:00
*
* The rules for slave state should be :
* for Active / Backup :
* Active stays on all backups go down
* for Bonded connections :
* The first up interface should be left on and all others downed .
*/
2013-02-18 22:09:42 +08:00
static int __bond_release_one ( struct net_device * bond_dev ,
struct net_device * slave_dev ,
2017-07-07 06:01:57 +08:00
bool all , bool unregister )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 06:20:36 +08:00
struct slave * slave , * oldcurrent ;
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
struct sockaddr_storage ss ;
bonding: Fix broken promiscuity reference counting issue
Recently grabbed this report:
https://bugzilla.redhat.com/show_bug.cgi?id=1005567
Of an issue in which the bonding driver, with an attached vlan encountered the
following errors when bond0 was taken down and back up:
dummy1: promiscuity touches roof, set promiscuity failed. promiscuity feature of
device might be broken.
The error occurs because, during __bond_release_one, if we release our last
slave, we take on a random mac address and issue a NETDEV_CHANGEADDR
notification. With an attached vlan, the vlan may see that the vlan and bond
mac address were in sync, but no longer are. This triggers a call to dev_uc_add
and dev_set_rx_mode, which enables IFF_PROMISC on the bond device. Then, when
we complete __bond_release_one, we use the current state of the bond flags to
determine if we should decrement the promiscuity of the releasing slave. But
since the bond changed promiscuity state during the release operation, we
incorrectly decrement the slave promisc count when it wasn't in promiscuous mode
to begin with, causing the above error
Fix is pretty simple, just cache the bonding flags at the start of the function
and use those when determining the need to set promiscuity.
This is also needed for the ALLMULTI flag
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Mark Wu <wudxw@linux.vnet.ibm.com>
CC: "David S. Miller" <davem@davemloft.net>
Reported-by: Mark Wu <wudxw@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28 00:22:15 +08:00
int old_flags = bond_dev - > flags ;
2011-11-15 23:29:55 +08:00
netdev_features_t old_features = bond_dev - > features ;
2005-04-17 06:20:36 +08:00
/* slave is not a slave or master is not master of this slave */
if ( ! ( slave_dev - > flags & IFF_SLAVE ) | |
2013-01-04 06:49:01 +08:00
! netdev_has_upper_dev ( slave_dev , bond_dev ) ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " cannot release slave \n " ) ;
2005-04-17 06:20:36 +08:00
return - EINVAL ;
}
2010-10-14 00:01:50 +08:00
block_netpoll_tx ( ) ;
2005-04-17 06:20:36 +08:00
slave = bond_get_slave_by_dev ( bond , slave_dev ) ;
if ( ! slave ) {
/* not a slave of this bond */
2019-06-07 22:59:29 +08:00
slave_info ( bond_dev , slave_dev , " interface not enslaved \n " ) ;
2010-10-14 00:01:50 +08:00
unblock_netpoll_tx ( ) ;
2005-04-17 06:20:36 +08:00
return - EINVAL ;
}
2015-12-03 19:12:21 +08:00
bond_set_slave_inactive_flags ( slave , BOND_SLAVE_NOTIFY_NOW ) ;
2014-01-17 14:57:49 +08:00
bond_sysfs_slave_del ( slave ) ;
2014-09-29 10:34:37 +08:00
/* recompute stats just before removing the slave */
bond_get_stats ( bond - > dev , & bond - > bond_stats ) ;
2015-12-03 19:12:14 +08:00
bond_upper_dev_unlink ( bond , slave ) ;
2011-03-22 10:38:12 +08:00
/* unregister rx_handler early so bond_handle_frame wouldn't be called
* for this slave anymore .
*/
netdev_rx_handler_unregister ( slave_dev ) ;
2014-09-12 04:49:27 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD )
2005-04-17 06:20:36 +08:00
bond_3ad_unbind_slave ( slave ) ;
2018-05-15 02:48:09 +08:00
if ( bond_mode_can_use_xmit_hash ( bond ) )
2014-10-05 08:45:01 +08:00
bond_update_slave_arr ( bond , slave ) ;
2019-06-07 22:59:29 +08:00
slave_info ( bond_dev , slave_dev , " Releasing %s interface \n " ,
bond_is_active_slave ( slave ) ? " active " : " backup " ) ;
2005-04-17 06:20:36 +08:00
2014-07-15 21:56:55 +08:00
oldcurrent = rcu_access_pointer ( bond - > curr_active_slave ) ;
2005-04-17 06:20:36 +08:00
2014-07-15 21:56:56 +08:00
RCU_INIT_POINTER ( bond - > current_arp_slave , NULL ) ;
2005-04-17 06:20:36 +08:00
2014-01-25 13:00:29 +08:00
if ( ! all & & ( ! bond - > params . fail_over_mac | |
2014-05-16 03:39:55 +08:00
BOND_MODE ( bond ) ! = BOND_MODE_ACTIVEBACKUP ) ) {
2014-01-02 09:13:16 +08:00
if ( ether_addr_equal_64bits ( bond_dev - > dev_addr , slave - > perm_hwaddr ) & &
2013-09-25 15:20:21 +08:00
bond_has_slaves ( bond ) )
2019-06-07 22:59:29 +08:00
slave_warn ( bond_dev , slave_dev , " the permanent HWaddr of slave - %pM - is still in use by bond - set the HWaddr of slave to a different address to avoid conflicts \n " ,
slave - > perm_hwaddr ) ;
2013-08-01 22:54:47 +08:00
}
2014-09-10 05:17:00 +08:00
if ( rtnl_dereference ( bond - > primary_slave ) = = slave )
RCU_INIT_POINTER ( bond - > primary_slave , NULL ) ;
2005-04-17 06:20:36 +08:00
2014-09-12 04:49:24 +08:00
if ( oldcurrent = = slave )
2005-04-17 06:20:36 +08:00
bond_change_active_slave ( bond , NULL ) ;
2008-12-10 15:07:13 +08:00
if ( bond_is_lb ( bond ) ) {
2005-04-17 06:20:36 +08:00
/* Must be called only after the slave has been
* detached from the list and the curr_active_slave
* has been cleared ( if our_slave = = old_current ) ,
* but before a new active slave is selected .
*/
bond_alb_deinit_slave ( bond , slave ) ;
}
2013-02-18 22:09:42 +08:00
if ( all ) {
2013-12-10 07:19:53 +08:00
RCU_INIT_POINTER ( bond - > curr_active_slave , NULL ) ;
2013-02-18 22:09:42 +08:00
} else if ( oldcurrent = = slave ) {
2014-09-15 23:19:34 +08:00
/* Note that we hold RTNL over this sequence, so there
2007-10-18 08:37:49 +08:00
* is no concern that another slave add / remove event
* will interfere .
*/
2005-04-17 06:20:36 +08:00
bond_select_active_slave ( bond ) ;
2007-10-18 08:37:49 +08:00
}
2013-09-25 15:20:21 +08:00
if ( ! bond_has_slaves ( bond ) ) {
2006-03-28 05:27:43 +08:00
bond_set_carrier ( bond ) ;
2013-01-30 18:08:11 +08:00
eth_hw_addr_random ( bond_dev ) ;
2005-04-17 06:20:36 +08:00
}
2010-10-14 00:01:50 +08:00
unblock_netpoll_tx ( ) ;
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 22:54:51 +08:00
synchronize_rcu ( ) ;
2014-02-26 21:20:30 +08:00
bond - > slave_cnt - - ;
2005-04-17 06:20:36 +08:00
2013-09-25 15:20:21 +08:00
if ( ! bond_has_slaves ( bond ) ) {
2012-04-04 06:56:19 +08:00
call_netdevice_notifiers ( NETDEV_CHANGEADDR , bond - > dev ) ;
2013-03-06 15:10:32 +08:00
call_netdevice_notifiers ( NETDEV_RELEASE , bond - > dev ) ;
}
2012-04-04 06:56:19 +08:00
2011-05-07 11:22:17 +08:00
bond_compute_features ( bond ) ;
if ( ! ( bond_dev - > features & NETIF_F_VLAN_CHALLENGED ) & &
( old_features & NETIF_F_VLAN_CHALLENGED ) )
2019-06-07 22:59:29 +08:00
slave_info ( bond_dev , slave_dev , " last VLAN challenged slave left bond - VLAN blocking is removed \n " ) ;
2011-05-07 11:22:17 +08:00
2013-08-06 18:40:15 +08:00
vlan_vids_del_by_dev ( slave_dev , bond_dev ) ;
2005-04-17 06:20:36 +08:00
2014-09-15 23:19:34 +08:00
/* If the mode uses primary, then this case was handled above by
2013-05-31 19:57:30 +08:00
* bond_change_active_slave ( . . . , NULL )
2005-04-17 06:20:36 +08:00
*/
2014-05-16 03:39:54 +08:00
if ( ! bond_uses_primary ( bond ) ) {
bonding: Fix broken promiscuity reference counting issue
Recently grabbed this report:
https://bugzilla.redhat.com/show_bug.cgi?id=1005567
Of an issue in which the bonding driver, with an attached vlan encountered the
following errors when bond0 was taken down and back up:
dummy1: promiscuity touches roof, set promiscuity failed. promiscuity feature of
device might be broken.
The error occurs because, during __bond_release_one, if we release our last
slave, we take on a random mac address and issue a NETDEV_CHANGEADDR
notification. With an attached vlan, the vlan may see that the vlan and bond
mac address were in sync, but no longer are. This triggers a call to dev_uc_add
and dev_set_rx_mode, which enables IFF_PROMISC on the bond device. Then, when
we complete __bond_release_one, we use the current state of the bond flags to
determine if we should decrement the promiscuity of the releasing slave. But
since the bond changed promiscuity state during the release operation, we
incorrectly decrement the slave promisc count when it wasn't in promiscuous mode
to begin with, causing the above error
Fix is pretty simple, just cache the bonding flags at the start of the function
and use those when determining the need to set promiscuity.
This is also needed for the ALLMULTI flag
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Mark Wu <wudxw@linux.vnet.ibm.com>
CC: "David S. Miller" <davem@davemloft.net>
Reported-by: Mark Wu <wudxw@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28 00:22:15 +08:00
/* unset promiscuity level from slave
* NOTE : The NETDEV_CHANGEADDR call above may change the value
* of the IFF_PROMISC flag in the bond_dev , but we need the
* value of that flag before that change , as that was the value
* when this slave was attached , so we cache at the start of the
* function and use it here . Same goes for ALLMULTI below
*/
if ( old_flags & IFF_PROMISC )
2005-04-17 06:20:36 +08:00
dev_set_promiscuity ( slave_dev , - 1 ) ;
/* unset allmulti level from slave */
bonding: Fix broken promiscuity reference counting issue
Recently grabbed this report:
https://bugzilla.redhat.com/show_bug.cgi?id=1005567
Of an issue in which the bonding driver, with an attached vlan encountered the
following errors when bond0 was taken down and back up:
dummy1: promiscuity touches roof, set promiscuity failed. promiscuity feature of
device might be broken.
The error occurs because, during __bond_release_one, if we release our last
slave, we take on a random mac address and issue a NETDEV_CHANGEADDR
notification. With an attached vlan, the vlan may see that the vlan and bond
mac address were in sync, but no longer are. This triggers a call to dev_uc_add
and dev_set_rx_mode, which enables IFF_PROMISC on the bond device. Then, when
we complete __bond_release_one, we use the current state of the bond flags to
determine if we should decrement the promiscuity of the releasing slave. But
since the bond changed promiscuity state during the release operation, we
incorrectly decrement the slave promisc count when it wasn't in promiscuous mode
to begin with, causing the above error
Fix is pretty simple, just cache the bonding flags at the start of the function
and use those when determining the need to set promiscuity.
This is also needed for the ALLMULTI flag
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Mark Wu <wudxw@linux.vnet.ibm.com>
CC: "David S. Miller" <davem@davemloft.net>
Reported-by: Mark Wu <wudxw@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28 00:22:15 +08:00
if ( old_flags & IFF_ALLMULTI )
2005-04-17 06:20:36 +08:00
dev_set_allmulti ( slave_dev , - 1 ) ;
2013-05-31 19:57:30 +08:00
bond_hw_addr_flush ( bond_dev , slave_dev ) ;
2005-04-17 06:20:36 +08:00
}
2011-02-18 07:43:32 +08:00
slave_disable_netpoll ( slave ) ;
2010-05-06 15:48:51 +08:00
2005-04-17 06:20:36 +08:00
/* close slave before restoring its mac address */
dev_close ( slave_dev ) ;
2014-01-25 13:00:29 +08:00
if ( bond - > params . fail_over_mac ! = BOND_FOM_ACTIVE | |
2014-05-16 03:39:55 +08:00
BOND_MODE ( bond ) ! = BOND_MODE_ACTIVEBACKUP ) {
2007-10-10 10:43:39 +08:00
/* restore original ("permanent") mac address */
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
bond_hw_addr_copy ( ss . __data , slave - > perm_hwaddr ,
slave - > dev - > addr_len ) ;
ss . ss_family = slave_dev - > type ;
2018-12-13 19:54:30 +08:00
dev_set_mac_address ( slave_dev , ( struct sockaddr * ) & ss , NULL ) ;
2007-10-10 10:43:39 +08:00
}
2005-04-17 06:20:36 +08:00
2017-07-07 06:01:57 +08:00
if ( unregister )
__dev_set_mtu ( slave_dev , slave - > original_mtu ) ;
else
dev_set_mtu ( slave_dev , slave - > original_mtu ) ;
2010-05-18 13:42:40 +08:00
2019-10-22 02:47:52 +08:00
if ( ! netif_is_bond_master ( slave_dev ) )
slave_dev - > priv_flags & = ~ IFF_BONDING ;
2005-04-17 06:20:36 +08:00
2014-05-12 15:08:43 +08:00
bond_free_slave ( slave ) ;
2005-04-17 06:20:36 +08:00
2014-09-15 23:19:34 +08:00
return 0 ;
2005-04-17 06:20:36 +08:00
}
2013-02-18 22:09:42 +08:00
/* A wrapper used because of ndo_del_link */
int bond_release ( struct net_device * bond_dev , struct net_device * slave_dev )
{
2017-07-07 06:01:57 +08:00
return __bond_release_one ( bond_dev , slave_dev , false , false ) ;
2013-02-18 22:09:42 +08:00
}
2014-09-15 23:19:34 +08:00
/* First release a slave and then destroy the bond if no more slaves are left.
* Must be under rtnl_lock when this function is called .
*/
2019-06-07 22:59:29 +08:00
static int bond_release_and_destroy ( struct net_device * bond_dev ,
struct net_device * slave_dev )
2007-10-10 10:43:43 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2007-10-10 10:43:43 +08:00
int ret ;
2017-07-07 06:01:57 +08:00
ret = __bond_release_one ( bond_dev , slave_dev , false , true ) ;
2013-09-25 15:20:21 +08:00
if ( ret = = 0 & & ! bond_has_slaves ( bond ) ) {
2011-02-18 07:43:32 +08:00
bond_dev - > priv_flags | = IFF_DISABLE_NETPOLL ;
2019-06-07 22:59:29 +08:00
netdev_info ( bond_dev , " Destroying bond \n " ) ;
2015-07-16 03:52:51 +08:00
bond_remove_proc_entry ( bond ) ;
2009-06-13 03:02:47 +08:00
unregister_netdevice ( bond_dev ) ;
2007-10-10 10:43:43 +08:00
}
return ret ;
}
2017-02-03 12:46:21 +08:00
static void bond_info_query ( struct net_device * bond_dev , struct ifbond * info )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2015-02-03 22:48:30 +08:00
bond_fill_ifbond ( bond , info ) ;
2005-04-17 06:20:36 +08:00
}
static int bond_slave_info_query ( struct net_device * bond_dev , struct ifslave * info )
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2013-08-01 22:54:47 +08:00
int i = 0 , res = - ENODEV ;
2005-04-17 06:20:36 +08:00
struct slave * slave ;
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2013-08-01 22:54:47 +08:00
if ( i + + = = ( int ) info - > slave_id ) {
2009-04-23 11:39:04 +08:00
res = 0 ;
2015-02-03 22:48:30 +08:00
bond_fill_ifslave ( slave , info ) ;
2005-04-17 06:20:36 +08:00
break ;
}
}
2009-04-23 11:39:04 +08:00
return res ;
2005-04-17 06:20:36 +08:00
}
/*-------------------------------- Monitoring -------------------------------*/
2014-07-15 21:56:55 +08:00
/* called with rcu_read_lock() */
2008-07-03 09:21:58 +08:00
static int bond_miimon_inspect ( struct bonding * bond )
{
2013-08-01 22:54:47 +08:00
int link_state , commit = 0 ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2008-07-03 09:21:58 +08:00
struct slave * slave ;
2009-04-24 11:57:29 +08:00
bool ignore_updelay ;
2014-07-15 21:56:55 +08:00
ignore_updelay = ! rcu_dereference ( bond - > curr_active_slave ) ;
2005-04-17 06:20:36 +08:00
bonding: rebuild the lock use for bond_mii_monitor()
The bond_mii_monitor() still use bond lock to protect bond slave list,
it is no effect, I have 2 way to fix the problem, move the RTNL to the
top of the function, or add RCU to protect the bond slave list,
according to the Jay Vosburgh's opinion, 10 times one second is a
truely big performance loss if use RTNL to protect the whole monitor,
so I would take the advice and use RCU to protect the bond slave list.
The bond_has_slave() will not protect by anything, there will no things
happen if the slave list is be changed, unless the bond was free, but
it will not happened before the monitor, the bond will closed before
be freed.
The peers notify for the bond will calling curr_active_slave, so
derefence the slave to make sure we will accessing the same slave
if the curr_active_slave changed, as the rcu dereference need in
read-side critical sector and bond_change_active_slave() will call
it with no RCU hold, so add peer notify in rcu_read_lock which
will be nested in monitor.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-13 10:19:39 +08:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2019-11-02 12:56:42 +08:00
bond_propose_link_state ( slave , BOND_LINK_NOCHANGE ) ;
2005-04-17 06:20:36 +08:00
2008-07-03 09:21:58 +08:00
link_state = bond_check_dev_link ( bond , slave - > dev , 0 ) ;
2005-04-17 06:20:36 +08:00
switch ( slave - > link ) {
2008-07-03 09:21:58 +08:00
case BOND_LINK_UP :
if ( link_state )
continue ;
2005-04-17 06:20:36 +08:00
2017-03-28 02:37:33 +08:00
bond_propose_link_state ( slave , BOND_LINK_FAIL ) ;
2017-07-26 00:44:25 +08:00
commit + + ;
2008-07-03 09:21:58 +08:00
slave - > delay = bond - > params . downdelay ;
if ( slave - > delay ) {
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " link status down for %sinterface, disabling it in %d ms \n " ,
( BOND_MODE ( bond ) = =
BOND_MODE_ACTIVEBACKUP ) ?
( bond_is_active_slave ( slave ) ?
" active " : " backup " ) : " " ,
bond - > params . downdelay * bond - > params . miimon ) ;
2005-04-17 06:20:36 +08:00
}
2008-07-03 09:21:58 +08:00
/*FALLTHRU*/
case BOND_LINK_FAIL :
if ( link_state ) {
2014-09-15 23:19:34 +08:00
/* recovered before downdelay expired */
2017-03-28 02:37:33 +08:00
bond_propose_link_state ( slave , BOND_LINK_UP ) ;
2014-02-18 14:48:46 +08:00
slave - > last_link_up = jiffies ;
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " link status up again after %d ms \n " ,
( bond - > params . downdelay - slave - > delay ) *
bond - > params . miimon ) ;
2017-04-12 13:36:00 +08:00
commit + + ;
2008-07-03 09:21:58 +08:00
continue ;
2005-04-17 06:20:36 +08:00
}
2008-07-03 09:21:58 +08:00
if ( slave - > delay < = 0 ) {
2019-11-02 12:56:42 +08:00
bond_propose_link_state ( slave , BOND_LINK_DOWN ) ;
2008-07-03 09:21:58 +08:00
commit + + ;
continue ;
2005-04-17 06:20:36 +08:00
}
2008-07-03 09:21:58 +08:00
slave - > delay - - ;
break ;
case BOND_LINK_DOWN :
if ( ! link_state )
continue ;
2017-03-28 02:37:33 +08:00
bond_propose_link_state ( slave , BOND_LINK_BACK ) ;
2017-07-26 00:44:25 +08:00
commit + + ;
2008-07-03 09:21:58 +08:00
slave - > delay = bond - > params . updelay ;
if ( slave - > delay ) {
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " link status up, enabling it in %d ms \n " ,
ignore_updelay ? 0 :
bond - > params . updelay *
bond - > params . miimon ) ;
2008-07-03 09:21:58 +08:00
}
/*FALLTHRU*/
case BOND_LINK_BACK :
if ( ! link_state ) {
2017-03-28 02:37:33 +08:00
bond_propose_link_state ( slave , BOND_LINK_DOWN ) ;
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " link status down again after %d ms \n " ,
( bond - > params . updelay - slave - > delay ) *
bond - > params . miimon ) ;
2017-04-12 13:36:00 +08:00
commit + + ;
2008-07-03 09:21:58 +08:00
continue ;
}
2009-04-24 11:57:29 +08:00
if ( ignore_updelay )
slave - > delay = 0 ;
2008-07-03 09:21:58 +08:00
if ( slave - > delay < = 0 ) {
2019-11-02 12:56:42 +08:00
bond_propose_link_state ( slave , BOND_LINK_UP ) ;
2008-07-03 09:21:58 +08:00
commit + + ;
2009-04-24 11:57:29 +08:00
ignore_updelay = false ;
2008-07-03 09:21:58 +08:00
continue ;
2005-04-17 06:20:36 +08:00
}
2008-07-03 09:21:58 +08:00
slave - > delay - - ;
2005-04-17 06:20:36 +08:00
break ;
2008-07-03 09:21:58 +08:00
}
}
2005-04-17 06:20:36 +08:00
2008-07-03 09:21:58 +08:00
return commit ;
}
2005-04-17 06:20:36 +08:00
2018-05-17 10:09:23 +08:00
static void bond_miimon_link_change ( struct bonding * bond ,
struct slave * slave ,
char link )
{
switch ( BOND_MODE ( bond ) ) {
case BOND_MODE_8023AD :
bond_3ad_handle_link_change ( slave , link ) ;
break ;
case BOND_MODE_TLB :
case BOND_MODE_ALB :
bond_alb_handle_link_change ( bond , slave , link ) ;
break ;
case BOND_MODE_XOR :
bond_update_slave_arr ( bond , NULL ) ;
break ;
}
}
2008-07-03 09:21:58 +08:00
static void bond_miimon_commit ( struct bonding * bond )
{
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2014-09-10 05:17:00 +08:00
struct slave * slave , * primary ;
2008-07-03 09:21:58 +08:00
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2019-11-02 12:56:42 +08:00
switch ( slave - > link_new_state ) {
2008-07-03 09:21:58 +08:00
case BOND_LINK_NOCHANGE :
2019-07-17 06:25:10 +08:00
/* For 802.3ad mode, check current slave speed and
* duplex again in case its port was disabled after
* invalid speed / duplex reporting but recovered before
* link monitoring could make a decision on the actual
* link status
*/
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD & &
slave - > link = = BOND_LINK_UP )
bond_3ad_adapter_speed_duplex_changed ( slave ) ;
2008-07-03 09:21:58 +08:00
continue ;
2005-04-17 06:20:36 +08:00
2008-07-03 09:21:58 +08:00
case BOND_LINK_UP :
2017-08-10 12:41:44 +08:00
if ( bond_update_speed_duplex ( slave ) & &
bond_needs_speed_duplex ( bond ) ) {
2017-04-04 09:38:39 +08:00
slave - > link = BOND_LINK_DOWN ;
2017-08-12 06:36:55 +08:00
if ( net_ratelimit ( ) )
2019-06-07 22:59:29 +08:00
slave_warn ( bond - > dev , slave - > dev ,
" failed to get link speed/duplex \n " ) ;
2017-03-28 02:37:37 +08:00
continue ;
}
2015-12-03 19:12:19 +08:00
bond_set_slave_link_state ( slave , BOND_LINK_UP ,
BOND_SLAVE_NOTIFY_NOW ) ;
2014-02-18 14:48:46 +08:00
slave - > last_link_up = jiffies ;
2008-07-03 09:21:58 +08:00
2014-09-10 05:17:00 +08:00
primary = rtnl_dereference ( bond - > primary_slave ) ;
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD ) {
2008-07-03 09:21:58 +08:00
/* prevent it from being the active one */
2011-03-12 11:14:37 +08:00
bond_set_backup_slave ( slave ) ;
2014-05-16 03:39:55 +08:00
} else if ( BOND_MODE ( bond ) ! = BOND_MODE_ACTIVEBACKUP ) {
2008-07-03 09:21:58 +08:00
/* make it immediately active */
2011-03-12 11:14:37 +08:00
bond_set_active_slave ( slave ) ;
2005-04-17 06:20:36 +08:00
}
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " link status definitely up, %u Mbps %s duplex \n " ,
slave - > speed = = SPEED_UNKNOWN ? 0 : slave - > speed ,
slave - > duplex ? " full " : " half " ) ;
2005-04-17 06:20:36 +08:00
2018-05-17 10:09:23 +08:00
bond_miimon_link_change ( bond , slave , BOND_LINK_UP ) ;
2014-10-05 08:45:01 +08:00
2014-09-10 05:17:00 +08:00
if ( ! bond - > curr_active_slave | | slave = = primary )
2008-07-03 09:21:58 +08:00
goto do_failover ;
2005-04-17 06:20:36 +08:00
2008-07-03 09:21:58 +08:00
continue ;
2007-10-18 08:37:49 +08:00
2008-07-03 09:21:58 +08:00
case BOND_LINK_DOWN :
2008-10-31 08:41:14 +08:00
if ( slave - > link_failure_count < UINT_MAX )
slave - > link_failure_count + + ;
2015-12-03 19:12:19 +08:00
bond_set_slave_link_state ( slave , BOND_LINK_DOWN ,
BOND_SLAVE_NOTIFY_NOW ) ;
2005-04-17 06:20:36 +08:00
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_ACTIVEBACKUP | |
BOND_MODE ( bond ) = = BOND_MODE_8023AD )
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_inactive_flags ( slave ,
BOND_SLAVE_NOTIFY_NOW ) ;
2008-07-03 09:21:58 +08:00
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " link status definitely down, disabling slave \n " ) ;
2008-07-03 09:21:58 +08:00
2018-05-17 10:09:23 +08:00
bond_miimon_link_change ( bond , slave , BOND_LINK_DOWN ) ;
2014-10-05 08:45:01 +08:00
2014-07-15 21:56:55 +08:00
if ( slave = = rcu_access_pointer ( bond - > curr_active_slave ) )
2008-07-03 09:21:58 +08:00
goto do_failover ;
continue ;
default :
2019-06-07 22:59:29 +08:00
slave_err ( bond - > dev , slave - > dev , " invalid new link %d on slave \n " ,
2019-11-02 12:56:42 +08:00
slave - > link_new_state ) ;
bond_propose_link_state ( slave , BOND_LINK_NOCHANGE ) ;
2008-07-03 09:21:58 +08:00
continue ;
}
do_failover :
2010-10-14 00:01:50 +08:00
block_netpoll_tx ( ) ;
2008-07-03 09:21:58 +08:00
bond_select_active_slave ( bond ) ;
2010-10-14 00:01:50 +08:00
unblock_netpoll_tx ( ) ;
2008-07-03 09:21:58 +08:00
}
bond_set_carrier ( bond ) ;
2005-04-17 06:20:36 +08:00
}
2014-09-15 23:19:34 +08:00
/* bond_mii_monitor
2007-10-18 08:37:48 +08:00
*
* Really a wrapper that splits the mii monitor into two phases : an
2008-07-03 09:21:58 +08:00
* inspection , then ( if inspection indicates something needs to be done )
* an acquisition of appropriate locks followed by a commit phase to
* implement whatever link state changes are indicated .
2007-10-18 08:37:48 +08:00
*/
2013-12-31 02:43:41 +08:00
static void bond_mii_monitor ( struct work_struct * work )
2007-10-18 08:37:48 +08:00
{
struct bonding * bond = container_of ( work , struct bonding ,
mii_work . work ) ;
2011-04-26 23:25:52 +08:00
bool should_notify_peers = false ;
bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 01:43:54 +08:00
bool commit ;
2013-10-28 12:11:22 +08:00
unsigned long delay ;
2017-03-28 02:37:33 +08:00
struct slave * slave ;
struct list_head * iter ;
2007-10-18 08:37:48 +08:00
2013-10-28 12:11:22 +08:00
delay = msecs_to_jiffies ( bond - > params . miimon ) ;
if ( ! bond_has_slaves ( bond ) )
2008-07-03 09:21:58 +08:00
goto re_arm ;
2008-06-14 09:12:03 +08:00
bonding: rebuild the lock use for bond_mii_monitor()
The bond_mii_monitor() still use bond lock to protect bond slave list,
it is no effect, I have 2 way to fix the problem, move the RTNL to the
top of the function, or add RCU to protect the bond slave list,
according to the Jay Vosburgh's opinion, 10 times one second is a
truely big performance loss if use RTNL to protect the whole monitor,
so I would take the advice and use RCU to protect the bond slave list.
The bond_has_slave() will not protect by anything, there will no things
happen if the slave list is be changed, unless the bond was free, but
it will not happened before the monitor, the bond will closed before
be freed.
The peers notify for the bond will calling curr_active_slave, so
derefence the slave to make sure we will accessing the same slave
if the curr_active_slave changed, as the rcu dereference need in
read-side critical sector and bond_change_active_slave() will call
it with no RCU hold, so add peer notify in rcu_read_lock which
will be nested in monitor.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-13 10:19:39 +08:00
rcu_read_lock ( ) ;
2011-04-26 23:25:52 +08:00
should_notify_peers = bond_should_notify_peers ( bond ) ;
bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 01:43:54 +08:00
commit = ! ! bond_miimon_inspect ( bond ) ;
if ( bond - > send_peer_notif ) {
rcu_read_unlock ( ) ;
if ( rtnl_trylock ( ) ) {
bond - > send_peer_notif - - ;
rtnl_unlock ( ) ;
}
} else {
bonding: rebuild the lock use for bond_mii_monitor()
The bond_mii_monitor() still use bond lock to protect bond slave list,
it is no effect, I have 2 way to fix the problem, move the RTNL to the
top of the function, or add RCU to protect the bond slave list,
according to the Jay Vosburgh's opinion, 10 times one second is a
truely big performance loss if use RTNL to protect the whole monitor,
so I would take the advice and use RCU to protect the bond slave list.
The bond_has_slave() will not protect by anything, there will no things
happen if the slave list is be changed, unless the bond was free, but
it will not happened before the monitor, the bond will closed before
be freed.
The peers notify for the bond will calling curr_active_slave, so
derefence the slave to make sure we will accessing the same slave
if the curr_active_slave changed, as the rcu dereference need in
read-side critical sector and bond_change_active_slave() will call
it with no RCU hold, so add peer notify in rcu_read_lock which
will be nested in monitor.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-13 10:19:39 +08:00
rcu_read_unlock ( ) ;
bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 01:43:54 +08:00
}
2008-07-03 09:21:58 +08:00
bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 01:43:54 +08:00
if ( commit ) {
2013-10-28 12:11:22 +08:00
/* Race avoidance with bond_close cancel of workqueue */
if ( ! rtnl_trylock ( ) ) {
delay = 1 ;
should_notify_peers = false ;
goto re_arm ;
}
2013-10-24 11:09:03 +08:00
2017-03-28 02:37:33 +08:00
bond_for_each_slave ( bond , slave , iter ) {
bond_commit_link_state ( slave , BOND_SLAVE_NOTIFY_LATER ) ;
}
2013-10-28 12:11:22 +08:00
bond_miimon_commit ( bond ) ;
rtnl_unlock ( ) ; /* might sleep, hold no other locks */
bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 01:43:54 +08:00
}
2007-10-18 08:37:48 +08:00
2008-07-03 09:21:58 +08:00
re_arm :
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 23:42:50 +08:00
if ( bond - > params . miimon )
2013-10-28 12:11:22 +08:00
queue_delayed_work ( bond - > wq , & bond - > mii_work , delay ) ;
if ( should_notify_peers ) {
if ( ! rtnl_trylock ( ) )
return ;
call_netdevice_notifiers ( NETDEV_NOTIFY_PEERS , bond - > dev ) ;
rtnl_unlock ( ) ;
}
2007-10-18 08:37:48 +08:00
}
2005-06-27 05:52:20 +08:00
2016-10-18 10:15:45 +08:00
static int bond_upper_dev_walk ( struct net_device * upper , void * data )
{
__be32 ip = * ( ( __be32 * ) data ) ;
return ip = = bond_confirm_addr ( upper , 0 , ip ) ;
}
2013-08-29 05:25:11 +08:00
static bool bond_has_this_ip ( struct bonding * bond , __be32 ip )
2006-09-23 12:54:53 +08:00
{
2013-08-29 05:25:11 +08:00
bool ret = false ;
2006-09-23 12:54:53 +08:00
2012-03-23 00:14:29 +08:00
if ( ip = = bond_confirm_addr ( bond - > dev , 0 , ip ) )
2013-08-29 05:25:11 +08:00
return true ;
2006-09-23 12:54:53 +08:00
2013-08-29 05:25:11 +08:00
rcu_read_lock ( ) ;
2016-10-18 10:15:45 +08:00
if ( netdev_walk_all_upper_dev_rcu ( bond - > dev , bond_upper_dev_walk , & ip ) )
ret = true ;
2013-08-29 05:25:11 +08:00
rcu_read_unlock ( ) ;
2006-09-23 12:54:53 +08:00
2013-08-29 05:25:11 +08:00
return ret ;
2006-09-23 12:54:53 +08:00
}
2014-09-15 23:19:34 +08:00
/* We go to the (large) trouble of VLAN tagging ARP frames because
2005-06-27 05:52:20 +08:00
* switches in VLAN mode ( especially if ports are configured as
* " native " to a VLAN ) might not pass non - tagged frames .
*/
2019-06-07 22:59:29 +08:00
static void bond_arp_send ( struct slave * slave , int arp_op , __be32 dest_ip ,
__be32 src_ip , struct bond_vlan_tag * tags )
2005-06-27 05:52:20 +08:00
{
struct sk_buff * skb ;
2014-07-17 23:02:23 +08:00
struct bond_vlan_tag * outer_tag = tags ;
2019-06-07 22:59:29 +08:00
struct net_device * slave_dev = slave - > dev ;
struct net_device * bond_dev = slave - > bond - > dev ;
2005-06-27 05:52:20 +08:00
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " arp %d on slave: dst %pI4 src %pI4 \n " ,
arp_op , & dest_ip , & src_ip ) ;
2009-06-13 03:02:48 +08:00
2005-06-27 05:52:20 +08:00
skb = arp_create ( arp_op , ETH_P_ARP , dest_ip , slave_dev , src_ip ,
NULL , slave_dev - > dev_addr , NULL ) ;
if ( ! skb ) {
2014-03-25 17:44:44 +08:00
net_err_ratelimited ( " ARP packet allocation failed \n " ) ;
2005-06-27 05:52:20 +08:00
return ;
}
bonding: support QinQ for bond arp interval
The bond send arp request to indicate that the slave is active, and if the bond dev
is a vlan dev, it will set the vlan tag in skb to notice the vlan group, but the
bond could only send a skb with 802.1q proto, not support for QinQ.
So add outer tag for lower vlan tag and inner tag for upper vlan tag to support QinQ,
The new skb will be consist of two vlan tag just like this:
dst mac | src mac | outer vlan tag | inner vlan tag | data | .....
If We don't need QinQ, the inner vlan tag could be set to 0 and use outer vlan tag
as a normal vlan group.
Using "ip link" to configure the bond for QinQ and add test log:
ip link add link bond0 bond0.20 type vlan proto 802.1ad id 20
ip link add link bond0.20 bond0.20.200 type vlan proto 802.1q id 200
ifconfig bond0.20 11.11.20.36/24
ifconfig bond0.20.200 11.11.200.36/24
echo +11.11.200.37 > /sys/class/net/bond0/bonding/arp_ip_target
90:e2:ba:07:4a:5c (oui Unknown) > Broadcast, ethertype 802.1Q-QinQ (0x88a8),length 50: vlan 20, p 0,ethertype 802.1Q, vlan 200, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 11.11.200.37 tell 11.11.200.36, length 28
90:e2:ba:06:f9:86 (oui Unknown) > 90:e2:ba:07:4a:5c (oui Unknown), ethertype 802.1Q-QinQ (0x88a8), length 50: vlan 20, p 0, ethertype 802.1Q, vlan 200, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Reply 11.11.200.37 is-at 90:e2:ba:06:f9:86 (oui Unknown), length 28
v1->v2: remove the comment "TODO: QinQ?".
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-25 17:44:43 +08:00
2014-07-17 23:02:23 +08:00
if ( ! tags | | tags - > vlan_proto = = VLAN_N_VID )
goto xmit ;
tags + + ;
2014-05-17 05:20:38 +08:00
/* Go through all the tags backwards and add them to the packet */
2014-07-17 23:02:23 +08:00
while ( tags - > vlan_proto ! = VLAN_N_VID ) {
if ( ! tags - > vlan_id ) {
tags + + ;
2014-05-17 05:20:38 +08:00
continue ;
2014-07-17 23:02:23 +08:00
}
2014-05-17 05:20:38 +08:00
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " inner tag: proto %X vid %X \n " ,
ntohs ( outer_tag - > vlan_proto ) , tags - > vlan_id ) ;
2014-11-19 21:04:58 +08:00
skb = vlan_insert_tag_set_proto ( skb , tags - > vlan_proto ,
tags - > vlan_id ) ;
2014-05-17 05:20:38 +08:00
if ( ! skb ) {
net_err_ratelimited ( " failed to insert inner VLAN tag \n " ) ;
return ;
}
2014-07-17 23:02:23 +08:00
tags + + ;
2014-05-17 05:20:38 +08:00
}
/* Set the outer tag */
2014-07-17 23:02:23 +08:00
if ( outer_tag - > vlan_id ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " outer tag: proto %X vid %X \n " ,
ntohs ( outer_tag - > vlan_proto ) , outer_tag - > vlan_id ) ;
2014-11-19 21:04:57 +08:00
__vlan_hwaccel_put_tag ( skb , outer_tag - > vlan_proto ,
outer_tag - > vlan_id ) ;
2005-06-27 05:52:20 +08:00
}
2014-07-17 23:02:23 +08:00
xmit :
2005-06-27 05:52:20 +08:00
arp_xmit ( skb ) ;
}
2014-05-17 05:20:38 +08:00
/* Validate the device path between the @start_dev and the @end_dev.
* The path is valid if the @ end_dev is reachable through device
* stacking .
* When the path is validated , collect any vlan information in the
* path .
*/
2014-07-17 23:02:23 +08:00
struct bond_vlan_tag * bond_verify_device_path ( struct net_device * start_dev ,
struct net_device * end_dev ,
int level )
2014-05-17 05:20:38 +08:00
{
2014-07-17 23:02:23 +08:00
struct bond_vlan_tag * tags ;
2014-05-17 05:20:38 +08:00
struct net_device * upper ;
struct list_head * iter ;
2014-07-17 23:02:23 +08:00
if ( start_dev = = end_dev ) {
treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:
kzalloc(a * b, gfp)
with:
kcalloc(a * b, gfp)
as well as handling cases of:
kzalloc(a * b * c, gfp)
with:
kzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kzalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc
+ kcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc(sizeof(THING) * C2, ...)
|
kzalloc(sizeof(TYPE) * C2, ...)
|
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 05:03:40 +08:00
tags = kcalloc ( level + 1 , sizeof ( * tags ) , GFP_ATOMIC ) ;
2014-07-17 23:02:23 +08:00
if ( ! tags )
return ERR_PTR ( - ENOMEM ) ;
tags [ level ] . vlan_proto = VLAN_N_VID ;
return tags ;
}
2014-05-17 05:20:38 +08:00
netdev_for_each_upper_dev_rcu ( start_dev , upper , iter ) {
2014-07-17 23:02:23 +08:00
tags = bond_verify_device_path ( upper , end_dev , level + 1 ) ;
if ( IS_ERR_OR_NULL ( tags ) ) {
if ( IS_ERR ( tags ) )
return tags ;
continue ;
2014-05-17 05:20:38 +08:00
}
2014-07-17 23:02:23 +08:00
if ( is_vlan_dev ( upper ) ) {
tags [ level ] . vlan_proto = vlan_dev_vlan_proto ( upper ) ;
tags [ level ] . vlan_id = vlan_dev_vlan_id ( upper ) ;
}
return tags ;
2014-05-17 05:20:38 +08:00
}
2014-07-17 23:02:23 +08:00
return NULL ;
2014-05-17 05:20:38 +08:00
}
2005-06-27 05:52:20 +08:00
2005-04-17 06:20:36 +08:00
static void bond_arp_send_all ( struct bonding * bond , struct slave * slave )
{
2005-06-27 05:52:20 +08:00
struct rtable * rt ;
2014-07-17 23:02:23 +08:00
struct bond_vlan_tag * tags ;
2013-08-29 05:25:10 +08:00
__be32 * targets = bond - > params . arp_targets , addr ;
bonding: support QinQ for bond arp interval
The bond send arp request to indicate that the slave is active, and if the bond dev
is a vlan dev, it will set the vlan tag in skb to notice the vlan group, but the
bond could only send a skb with 802.1q proto, not support for QinQ.
So add outer tag for lower vlan tag and inner tag for upper vlan tag to support QinQ,
The new skb will be consist of two vlan tag just like this:
dst mac | src mac | outer vlan tag | inner vlan tag | data | .....
If We don't need QinQ, the inner vlan tag could be set to 0 and use outer vlan tag
as a normal vlan group.
Using "ip link" to configure the bond for QinQ and add test log:
ip link add link bond0 bond0.20 type vlan proto 802.1ad id 20
ip link add link bond0.20 bond0.20.200 type vlan proto 802.1q id 200
ifconfig bond0.20 11.11.20.36/24
ifconfig bond0.20.200 11.11.200.36/24
echo +11.11.200.37 > /sys/class/net/bond0/bonding/arp_ip_target
90:e2:ba:07:4a:5c (oui Unknown) > Broadcast, ethertype 802.1Q-QinQ (0x88a8),length 50: vlan 20, p 0,ethertype 802.1Q, vlan 200, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 11.11.200.37 tell 11.11.200.36, length 28
90:e2:ba:06:f9:86 (oui Unknown) > 90:e2:ba:07:4a:5c (oui Unknown), ethertype 802.1Q-QinQ (0x88a8), length 50: vlan 20, p 0, ethertype 802.1Q, vlan 200, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Reply 11.11.200.37 is-at 90:e2:ba:06:f9:86 (oui Unknown), length 28
v1->v2: remove the comment "TODO: QinQ?".
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-25 17:44:43 +08:00
int i ;
2005-04-17 06:20:36 +08:00
2013-08-29 05:25:10 +08:00
for ( i = 0 ; i < BOND_MAX_ARP_TARGETS & & targets [ i ] ; i + + ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond - > dev , slave - > dev , " %s: target %pI4 \n " ,
__func__ , & targets [ i ] ) ;
2014-07-17 23:02:23 +08:00
tags = NULL ;
2005-06-27 05:52:20 +08:00
2013-08-29 05:25:10 +08:00
/* Find out through which dev should the packet go */
2011-03-12 13:00:52 +08:00
rt = ip_route_output ( dev_net ( bond - > dev ) , targets [ i ] , 0 ,
RTO_ONLINK , 0 ) ;
2011-03-03 06:31:35 +08:00
if ( IS_ERR ( rt ) ) {
2014-02-28 19:39:19 +08:00
/* there's no route to target - try to send arp
* probe to generate any traffic ( arp_validate = 0 )
*/
2014-03-25 17:44:44 +08:00
if ( bond - > params . arp_validate )
net_warn_ratelimited ( " %s: no route to arp_ip_target %pI4 and arp_validate is set \n " ,
bond - > dev - > name ,
& targets [ i ] ) ;
2019-06-07 22:59:29 +08:00
bond_arp_send ( slave , ARPOP_REQUEST , targets [ i ] ,
2014-05-17 05:20:38 +08:00
0 , tags ) ;
2005-06-27 05:52:20 +08:00
continue ;
}
2013-08-29 05:25:10 +08:00
/* bond device itself */
if ( rt - > dst . dev = = bond - > dev )
goto found ;
rcu_read_lock ( ) ;
2014-07-17 23:02:23 +08:00
tags = bond_verify_device_path ( bond - > dev , rt - > dst . dev , 0 ) ;
2013-08-29 05:25:10 +08:00
rcu_read_unlock ( ) ;
2005-06-27 05:52:20 +08:00
2014-07-17 23:02:23 +08:00
if ( ! IS_ERR_OR_NULL ( tags ) )
2014-05-17 05:20:38 +08:00
goto found ;
2013-08-29 05:25:10 +08:00
/* Not our device - skip */
2019-06-07 22:59:29 +08:00
slave_dbg ( bond - > dev , slave - > dev , " no path to arp_ip_target %pI4 via rt.dev %s \n " ,
2014-07-16 01:35:58 +08:00
& targets [ i ] , rt - > dst . dev ? rt - > dst . dev - > name : " NULL " ) ;
2013-08-29 05:25:16 +08:00
2005-09-15 05:52:09 +08:00
ip_rt_put ( rt ) ;
2013-08-29 05:25:10 +08:00
continue ;
found :
addr = bond_confirm_addr ( rt - > dst . dev , targets [ i ] , 0 ) ;
ip_rt_put ( rt ) ;
2019-06-07 22:59:29 +08:00
bond_arp_send ( slave , ARPOP_REQUEST , targets [ i ] , addr , tags ) ;
2014-07-25 20:21:21 +08:00
kfree ( tags ) ;
2005-06-27 05:52:20 +08:00
}
}
2007-08-23 08:06:58 +08:00
static void bond_validate_arp ( struct bonding * bond , struct slave * slave , __be32 sip , __be32 tip )
2006-09-23 12:54:53 +08:00
{
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:34 +08:00
int i ;
2013-06-24 17:49:29 +08:00
if ( ! sip | | ! bond_has_this_ip ( bond , tip ) ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond - > dev , slave - > dev , " %s: sip %pI4 tip %pI4 not found \n " ,
__func__ , & sip , & tip ) ;
2013-06-24 17:49:29 +08:00
return ;
}
2006-09-23 12:54:53 +08:00
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:34 +08:00
i = bond_get_targets_ip ( bond - > params . arp_targets , sip ) ;
if ( i = = - 1 ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond - > dev , slave - > dev , " %s: sip %pI4 not found in targets \n " ,
__func__ , & sip ) ;
2013-06-24 17:49:29 +08:00
return ;
2006-09-23 12:54:53 +08:00
}
2014-02-18 14:48:47 +08:00
slave - > last_rx = jiffies ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:34 +08:00
slave - > target_last_arp_rx [ i ] = jiffies ;
2006-09-23 12:54:53 +08:00
}
2013-09-07 06:00:26 +08:00
int bond_arp_rcv ( const struct sk_buff * skb , struct bonding * bond ,
struct slave * slave )
2006-09-23 12:54:53 +08:00
{
2012-06-12 03:23:07 +08:00
struct arphdr * arp = ( struct arphdr * ) skb - > data ;
bonding: Fix ARP monitor validation
The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.
This logic can fail if there is no currently active slave. In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave. The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.
This can arise if the bonding slaves and switch are working, but
the ARP target is not responding. When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave. This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.
Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.
This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.
Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave really works")
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-03 05:35:56 +08:00
struct slave * curr_active_slave , * curr_arp_slave ;
2006-09-23 12:54:53 +08:00
unsigned char * arp_ptr ;
2007-08-23 08:06:58 +08:00
__be32 sip , tip ;
2017-09-27 04:12:28 +08:00
int is_arp = skb - > protocol = = __cpu_to_be16 ( ETH_P_ARP ) ;
unsigned int alen ;
2006-09-23 12:54:53 +08:00
2014-02-18 14:48:42 +08:00
if ( ! slave_do_arp_validate ( bond , slave ) ) {
2014-05-07 22:10:20 +08:00
if ( ( slave_do_arp_validate_only ( bond ) & & is_arp ) | |
! slave_do_arp_validate_only ( bond ) )
2014-02-18 14:48:47 +08:00
slave - > last_rx = jiffies ;
2012-05-09 09:01:40 +08:00
return RX_HANDLER_ANOTHER ;
2014-02-18 14:48:42 +08:00
} else if ( ! is_arp ) {
return RX_HANDLER_ANOTHER ;
}
2013-06-24 17:49:31 +08:00
2012-06-12 03:23:07 +08:00
alen = arp_hdr_len ( bond - > dev ) ;
2006-09-23 12:54:53 +08:00
2019-06-07 22:59:29 +08:00
slave_dbg ( bond - > dev , slave - > dev , " %s: skb->dev %s \n " ,
__func__ , skb - > dev - > name ) ;
2006-09-23 12:54:53 +08:00
2012-06-12 03:23:07 +08:00
if ( alen > skb_headlen ( skb ) ) {
arp = kmalloc ( alen , GFP_ATOMIC ) ;
if ( ! arp )
goto out_unlock ;
if ( skb_copy_bits ( skb , 0 , arp , alen ) < 0 )
goto out_unlock ;
}
2006-09-23 12:54:53 +08:00
2011-04-19 11:48:16 +08:00
if ( arp - > ar_hln ! = bond - > dev - > addr_len | |
2006-09-23 12:54:53 +08:00
skb - > pkt_type = = PACKET_OTHERHOST | |
skb - > pkt_type = = PACKET_LOOPBACK | |
arp - > ar_hrd ! = htons ( ARPHRD_ETHER ) | |
arp - > ar_pro ! = htons ( ETH_P_IP ) | |
arp - > ar_pln ! = 4 )
goto out_unlock ;
arp_ptr = ( unsigned char * ) ( arp + 1 ) ;
2011-04-19 11:48:16 +08:00
arp_ptr + = bond - > dev - > addr_len ;
2006-09-23 12:54:53 +08:00
memcpy ( & sip , arp_ptr , 4 ) ;
2011-04-19 11:48:16 +08:00
arp_ptr + = 4 + bond - > dev - > addr_len ;
2006-09-23 12:54:53 +08:00
memcpy ( & tip , arp_ptr , 4 ) ;
2019-06-07 22:59:29 +08:00
slave_dbg ( bond - > dev , slave - > dev , " %s: %s/%d av %d sv %d sip %pI4 tip %pI4 \n " ,
__func__ , slave - > dev - > name , bond_slave_state ( slave ) ,
bond - > params . arp_validate , slave_do_arp_validate ( bond , slave ) ,
& sip , & tip ) ;
2006-09-23 12:54:53 +08:00
2014-02-20 19:07:57 +08:00
curr_active_slave = rcu_dereference ( bond - > curr_active_slave ) ;
bonding: Fix ARP monitor validation
The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.
This logic can fail if there is no currently active slave. In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave. The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.
This can arise if the bonding slaves and switch are working, but
the ARP target is not responding. When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave. This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.
Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.
This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.
Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave really works")
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-03 05:35:56 +08:00
curr_arp_slave = rcu_dereference ( bond - > current_arp_slave ) ;
2014-02-20 19:07:57 +08:00
bonding: Fix ARP monitor validation
The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.
This logic can fail if there is no currently active slave. In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave. The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.
This can arise if the bonding slaves and switch are working, but
the ARP target is not responding. When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave. This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.
Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.
This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.
Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave really works")
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-03 05:35:56 +08:00
/* We 'trust' the received ARP enough to validate it if:
bonding: don't trust arp requests unless active slave really works
Currently, if we receive any arp packet on a backup slave in active-backup
mode and arp_validate enabled, we suppose that it's an arp request, swap
source/target ip and try to validate it. This optimization gives us
virtually no downtime in the most common situation (active and backup
slaves are in the same broadcast domain and the active slave failed).
However, if we can't reach the arp_ip_target(s), we end up in an endless
loop of reselecting slaves, because we receive our arp requests, sent by
the active slave, and think that backup slaves are up, thus selecting them
as active and, again, sending arp requests, which fool our backup slaves.
Fix this by not validating the swapped arp packets if the current active
slave didn't receive any arp reply after it was selected as active. This
way we will only accept arp requests if we know that the current active
slave can actually reach arp_ip_target.
v3->v4:
Obey 80 lines and make checkpatch.pl happy, per Sergei's suggestion.
v1->v3:
No change.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:32 +08:00
*
bonding: Fix ARP monitor validation
The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.
This logic can fail if there is no currently active slave. In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave. The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.
This can arise if the bonding slaves and switch are working, but
the ARP target is not responding. When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave. This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.
Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.
This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.
Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave really works")
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-03 05:35:56 +08:00
* ( a ) the slave receiving the ARP is active ( which includes the
* current ARP slave , if any ) , or
*
* ( b ) the receiving slave isn ' t active , but there is a currently
* active slave and it received valid arp reply ( s ) after it became
* the currently active slave , or
*
* ( c ) there is an ARP slave that sent an ARP during the prior ARP
* interval , and we receive an ARP reply on any slave . We accept
* these because switch FDB update delays may deliver the ARP
* reply to a slave other than the sender of the ARP request .
*
* Note : for ( b ) , backup slaves are receiving the broadcast ARP
* request , not a reply . This request passes from the sending
* slave through the L2 switch ( es ) to the receiving slave . Since
* this is checking the request , sip / tip are swapped for
* validation .
*
* This is done to avoid endless looping when we can ' t reach the
bonding: don't trust arp requests unless active slave really works
Currently, if we receive any arp packet on a backup slave in active-backup
mode and arp_validate enabled, we suppose that it's an arp request, swap
source/target ip and try to validate it. This optimization gives us
virtually no downtime in the most common situation (active and backup
slaves are in the same broadcast domain and the active slave failed).
However, if we can't reach the arp_ip_target(s), we end up in an endless
loop of reselecting slaves, because we receive our arp requests, sent by
the active slave, and think that backup slaves are up, thus selecting them
as active and, again, sending arp requests, which fool our backup slaves.
Fix this by not validating the swapped arp packets if the current active
slave didn't receive any arp reply after it was selected as active. This
way we will only accept arp requests if we know that the current active
slave can actually reach arp_ip_target.
v3->v4:
Obey 80 lines and make checkpatch.pl happy, per Sergei's suggestion.
v1->v3:
No change.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:32 +08:00
* arp_ip_target and fool ourselves with our own arp requests .
2006-09-23 12:54:53 +08:00
*/
2011-03-12 11:14:37 +08:00
if ( bond_is_active_slave ( slave ) )
2006-09-23 12:54:53 +08:00
bond_validate_arp ( bond , slave , sip , tip ) ;
2014-02-20 19:07:57 +08:00
else if ( curr_active_slave & &
time_after ( slave_last_rx ( bond , curr_active_slave ) ,
curr_active_slave - > last_link_up ) )
2006-09-23 12:54:53 +08:00
bond_validate_arp ( bond , slave , tip , sip ) ;
bonding: Fix ARP monitor validation
The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.
This logic can fail if there is no currently active slave. In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave. The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.
This can arise if the bonding slaves and switch are working, but
the ARP target is not responding. When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave. This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.
Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.
This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.
Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave really works")
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-03 05:35:56 +08:00
else if ( curr_arp_slave & & ( arp - > ar_op = = htons ( ARPOP_REPLY ) ) & &
bond_time_in_interval ( bond ,
dev_trans_start ( curr_arp_slave - > dev ) , 1 ) )
bond_validate_arp ( bond , slave , sip , tip ) ;
2006-09-23 12:54:53 +08:00
out_unlock :
2012-06-12 03:23:07 +08:00
if ( arp ! = ( struct arphdr * ) skb - > data )
kfree ( arp ) ;
2012-05-09 09:01:40 +08:00
return RX_HANDLER_ANOTHER ;
2006-09-23 12:54:53 +08:00
}
2013-08-03 09:50:36 +08:00
/* function to verify if we're in the arp_interval timeslice, returns true if
* ( last_act - arp_interval ) < = jiffies < = ( last_act + mod * arp_interval +
* arp_interval / 2 ) . the arp_interval / 2 is needed for really fast networks .
*/
static bool bond_time_in_interval ( struct bonding * bond , unsigned long last_act ,
int mod )
{
int delta_in_ticks = msecs_to_jiffies ( bond - > params . arp_interval ) ;
return time_in_range ( jiffies ,
last_act - delta_in_ticks ,
last_act + mod * delta_in_ticks + delta_in_ticks / 2 ) ;
}
2014-09-15 23:19:34 +08:00
/* This function is called regularly to monitor each slave's link
2005-04-17 06:20:36 +08:00
* ensuring that traffic is being sent and received when arp monitoring
* is used in load - balancing mode . if the adapter has been dormant , then an
* arp is transmitted to generate traffic . see activebackup_arp_monitor for
* arp monitoring in active backup mode .
*/
2017-03-09 02:55:51 +08:00
static void bond_loadbalance_arp_mon ( struct bonding * bond )
2005-04-17 06:20:36 +08:00
{
struct slave * slave , * oldcurrent ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2014-01-28 11:48:53 +08:00
int do_failover = 0 , slave_state_changed = 0 ;
2005-04-17 06:20:36 +08:00
2013-10-28 12:11:22 +08:00
if ( ! bond_has_slaves ( bond ) )
2005-04-17 06:20:36 +08:00
goto re_arm ;
2013-12-13 10:19:50 +08:00
rcu_read_lock ( ) ;
2014-07-15 21:56:55 +08:00
oldcurrent = rcu_dereference ( bond - > curr_active_slave ) ;
2005-04-17 06:20:36 +08:00
/* see if any of the previous devices are up now (i.e. they have
* xmt and rcv traffic ) . the curr_active_slave does not come into
2014-02-18 14:48:46 +08:00
* the picture unless it is null . also , slave - > last_link_up is not
* needed here because we send an arp on each slave and give a slave
* as long as it needs to get the tx / rx within the delta .
2005-04-17 06:20:36 +08:00
* TODO : what about up / down delay in arp mode ? it wasn ' t here before
* so it can wait
*/
2013-12-13 10:19:50 +08:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2010-09-02 13:45:54 +08:00
unsigned long trans_start = dev_trans_start ( slave - > dev ) ;
2019-11-02 12:56:42 +08:00
bond_propose_link_state ( slave , BOND_LINK_NOCHANGE ) ;
bonding: Don't update slave->link until ready to commit
In the loadbalance arp monitoring scheme, when a slave link change is
detected, the slave->link is immediately updated and slave_state_changed
is set. Later down the function, the rtnl_lock is acquired and the
changes are committed, updating the bond link state.
However, the acquisition of the rtnl_lock can fail. The next time the
monitor runs, since slave->link is already updated, it determines that
link is unchanged. This results in the bond link state permanently out
of sync with the slave link.
This patch modifies bond_loadbalance_arp_mon() to handle link changes
identical to bond_ab_arp_{inspect/commit}(). The new link state is
maintained in slave->new_link until we're ready to commit at which point
it's copied into slave->link.
NOTE: miimon_{inspect/commit}() has a more complex state machine
requiring the use of the bond_{propose,commit}_link_state() functions
which maintains the intermediate state in slave->link_new_state. The arp
monitors don't require that.
Testing: This bug is very easy to reproduce with the following steps.
1. In a loop, toggle a slave link of a bond slave interface.
2. In a separate loop, do ifconfig up/down of an unrelated interface to
create contention for rtnl_lock.
Within a few iterations, the bond link goes out of sync with the slave
link.
Signed-off-by: Nithin Nayak Sujir <nsujir@tintri.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 10:45:17 +08:00
2005-04-17 06:20:36 +08:00
if ( slave - > link ! = BOND_LINK_UP ) {
2013-08-03 09:50:36 +08:00
if ( bond_time_in_interval ( bond , trans_start , 1 ) & &
2014-02-18 14:48:47 +08:00
bond_time_in_interval ( bond , slave - > last_rx , 1 ) ) {
2005-04-17 06:20:36 +08:00
2019-11-02 12:56:42 +08:00
bond_propose_link_state ( slave , BOND_LINK_UP ) ;
2014-01-28 11:48:53 +08:00
slave_state_changed = 1 ;
2005-04-17 06:20:36 +08:00
/* primary_slave has no meaning in round-robin
* mode . the window of a slave being up and
* curr_active_slave being null after enslaving
* is closed .
*/
if ( ! oldcurrent ) {
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " link status definitely up \n " ) ;
2005-04-17 06:20:36 +08:00
do_failover = 1 ;
} else {
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " interface is now up \n " ) ;
2005-04-17 06:20:36 +08:00
}
}
} else {
/* slave->link == BOND_LINK_UP */
/* not all switches will respond to an arp request
* when the source ip is 0 , so don ' t take the link down
* if we don ' t know our ip yet
*/
2013-08-03 09:50:36 +08:00
if ( ! bond_time_in_interval ( bond , trans_start , 2 ) | |
2014-02-18 14:48:47 +08:00
! bond_time_in_interval ( bond , slave - > last_rx , 2 ) ) {
2005-04-17 06:20:36 +08:00
2019-11-02 12:56:42 +08:00
bond_propose_link_state ( slave , BOND_LINK_DOWN ) ;
2014-01-28 11:48:53 +08:00
slave_state_changed = 1 ;
2005-04-17 06:20:36 +08:00
2009-06-13 03:02:48 +08:00
if ( slave - > link_failure_count < UINT_MAX )
2005-04-17 06:20:36 +08:00
slave - > link_failure_count + + ;
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " interface is now down \n " ) ;
2005-04-17 06:20:36 +08:00
2009-06-13 03:02:48 +08:00
if ( slave = = oldcurrent )
2005-04-17 06:20:36 +08:00
do_failover = 1 ;
}
}
/* note: if switch is in round-robin mode, all links
* must tx arp to ensure all links rx an arp - otherwise
* links may oscillate or not come up at all ; if switch is
* in something like xor mode , there is nothing we can
* do - all replies will be rx ' ed on same link causing slaves
* to be unstable during low / no traffic periods
*/
2014-05-16 03:39:57 +08:00
if ( bond_slave_is_up ( slave ) )
2005-04-17 06:20:36 +08:00
bond_arp_send_all ( bond , slave ) ;
}
2013-12-13 10:19:50 +08:00
rcu_read_unlock ( ) ;
2014-01-28 11:48:53 +08:00
if ( do_failover | | slave_state_changed ) {
2013-12-13 10:19:50 +08:00
if ( ! rtnl_trylock ( ) )
goto re_arm ;
2005-04-17 06:20:36 +08:00
bonding: Don't update slave->link until ready to commit
In the loadbalance arp monitoring scheme, when a slave link change is
detected, the slave->link is immediately updated and slave_state_changed
is set. Later down the function, the rtnl_lock is acquired and the
changes are committed, updating the bond link state.
However, the acquisition of the rtnl_lock can fail. The next time the
monitor runs, since slave->link is already updated, it determines that
link is unchanged. This results in the bond link state permanently out
of sync with the slave link.
This patch modifies bond_loadbalance_arp_mon() to handle link changes
identical to bond_ab_arp_{inspect/commit}(). The new link state is
maintained in slave->new_link until we're ready to commit at which point
it's copied into slave->link.
NOTE: miimon_{inspect/commit}() has a more complex state machine
requiring the use of the bond_{propose,commit}_link_state() functions
which maintains the intermediate state in slave->link_new_state. The arp
monitors don't require that.
Testing: This bug is very easy to reproduce with the following steps.
1. In a loop, toggle a slave link of a bond slave interface.
2. In a separate loop, do ifconfig up/down of an unrelated interface to
create contention for rtnl_lock.
Within a few iterations, the bond link goes out of sync with the slave
link.
Signed-off-by: Nithin Nayak Sujir <nsujir@tintri.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 10:45:17 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2019-11-02 12:56:42 +08:00
if ( slave - > link_new_state ! = BOND_LINK_NOCHANGE )
slave - > link = slave - > link_new_state ;
bonding: Don't update slave->link until ready to commit
In the loadbalance arp monitoring scheme, when a slave link change is
detected, the slave->link is immediately updated and slave_state_changed
is set. Later down the function, the rtnl_lock is acquired and the
changes are committed, updating the bond link state.
However, the acquisition of the rtnl_lock can fail. The next time the
monitor runs, since slave->link is already updated, it determines that
link is unchanged. This results in the bond link state permanently out
of sync with the slave link.
This patch modifies bond_loadbalance_arp_mon() to handle link changes
identical to bond_ab_arp_{inspect/commit}(). The new link state is
maintained in slave->new_link until we're ready to commit at which point
it's copied into slave->link.
NOTE: miimon_{inspect/commit}() has a more complex state machine
requiring the use of the bond_{propose,commit}_link_state() functions
which maintains the intermediate state in slave->link_new_state. The arp
monitors don't require that.
Testing: This bug is very easy to reproduce with the following steps.
1. In a loop, toggle a slave link of a bond slave interface.
2. In a separate loop, do ifconfig up/down of an unrelated interface to
create contention for rtnl_lock.
Within a few iterations, the bond link goes out of sync with the slave
link.
Signed-off-by: Nithin Nayak Sujir <nsujir@tintri.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 10:45:17 +08:00
}
2014-01-28 11:48:53 +08:00
if ( slave_state_changed ) {
bond_slave_state_change ( bond ) ;
2014-10-05 08:45:01 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_XOR )
bond_update_slave_arr ( bond , NULL ) ;
2014-11-18 22:14:44 +08:00
}
if ( do_failover ) {
2014-01-28 11:48:53 +08:00
block_netpoll_tx ( ) ;
bond_select_active_slave ( bond ) ;
unblock_netpoll_tx ( ) ;
}
2013-12-13 10:19:50 +08:00
rtnl_unlock ( ) ;
2005-04-17 06:20:36 +08:00
}
re_arm :
bonding: eliminate bond_close race conditions
This patch resolves two sets of race conditions.
Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> reported the
first, as follows:
The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.
He included a patch that converted the cancel_delayed_work calls
in bond_close to flush_delayed_work_sync, which eliminated the above
problem.
His patch is incorporated, at least in principle, into this
patch. In this patch, we use cancel_delayed_work_sync in place of
flush_delayed_work_sync, and also convert bond_uninit in addition to
bond_close.
This conversion to _sync, however, opens new races between
bond_close and three periodically executing workqueue functions:
bond_mii_monitor, bond_alb_monitor and bond_activebackup_arp_mon.
The race occurs because bond_close and bond_uninit are always
called with RTNL held, and these workqueue functions may acquire RTNL to
perform failover-related activities. If bond_close or bond_uninit is
waiting in cancel_delayed_work_sync, deadlock occurs.
These deadlocks are resolved by having the workqueue functions
acquire RTNL conditionally. If the rtnl_trylock() fails, the functions
reschedule and return immediately. For the cases that are attempting to
perform link failover, a delay of 1 is used; for the other cases, the
normal interval is used (as those activities are not as time critical).
Additionally, the bond_mii_monitor function now stores the delay
in a variable (mimicing the structure of activebackup_arp_mon).
Lastly, all of the above renders the kill_timers sentinel moot,
and therefore it has been removed.
Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 23:42:50 +08:00
if ( bond - > params . arp_interval )
2013-08-03 09:50:36 +08:00
queue_delayed_work ( bond - > wq , & bond - > arp_work ,
msecs_to_jiffies ( bond - > params . arp_interval ) ) ;
2005-04-17 06:20:36 +08:00
}
2014-09-15 23:19:34 +08:00
/* Called to inspect slaves for active-backup mode ARP monitor link state
2019-11-02 12:56:42 +08:00
* changes . Sets proposed link state in slaves to specify what action
* should take place for the slave . Returns 0 if no changes are found , > 0
* if changes to link states must be committed .
2008-05-18 12:10:13 +08:00
*
2014-09-12 04:49:28 +08:00
* Called with rcu_read_lock held .
2005-04-17 06:20:36 +08:00
*/
2013-08-03 09:50:36 +08:00
static int bond_ab_arp_inspect ( struct bonding * bond )
2005-04-17 06:20:36 +08:00
{
2013-08-03 09:50:35 +08:00
unsigned long trans_start , last_rx ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2013-08-01 22:54:47 +08:00
struct slave * slave ;
int commit = 0 ;
2012-08-30 20:02:47 +08:00
2013-12-13 10:20:02 +08:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2019-11-02 12:56:42 +08:00
bond_propose_link_state ( slave , BOND_LINK_NOCHANGE ) ;
2013-08-03 09:50:35 +08:00
last_rx = slave_last_rx ( bond , slave ) ;
2005-04-17 06:20:36 +08:00
2008-05-18 12:10:13 +08:00
if ( slave - > link ! = BOND_LINK_UP ) {
2013-08-03 09:50:36 +08:00
if ( bond_time_in_interval ( bond , last_rx , 1 ) ) {
2019-11-02 12:56:42 +08:00
bond_propose_link_state ( slave , BOND_LINK_UP ) ;
2008-05-18 12:10:13 +08:00
commit + + ;
}
continue ;
}
2005-04-17 06:20:36 +08:00
2014-09-15 23:19:34 +08:00
/* Give slaves 2*delta after being enslaved or made
2008-05-18 12:10:13 +08:00
* active . This avoids bouncing , as the last receive
* times need a full ARP monitor cycle to be updated .
*/
2014-02-18 14:48:46 +08:00
if ( bond_time_in_interval ( bond , slave - > last_link_up , 2 ) )
2008-05-18 12:10:13 +08:00
continue ;
2014-09-15 23:19:34 +08:00
/* Backup slave is down if:
2008-05-18 12:10:13 +08:00
* - No current_arp_slave AND
* - more than 3 * delta since last receive AND
* - the bond has an IP address
*
* Note : a non - null current_arp_slave indicates
* the curr_active_slave went down and we are
* searching for a new one ; under this condition
* we only take the curr_active_slave down - this
* gives each slave a chance to tx / rx traffic
* before being taken out
*/
2011-03-12 11:14:37 +08:00
if ( ! bond_is_active_slave ( slave ) & &
2014-07-15 21:56:56 +08:00
! rcu_access_pointer ( bond - > current_arp_slave ) & &
2013-08-03 09:50:36 +08:00
! bond_time_in_interval ( bond , last_rx , 3 ) ) {
2019-11-02 12:56:42 +08:00
bond_propose_link_state ( slave , BOND_LINK_DOWN ) ;
2008-05-18 12:10:13 +08:00
commit + + ;
}
2014-09-15 23:19:34 +08:00
/* Active slave is down if:
2008-05-18 12:10:13 +08:00
* - more than 2 * delta since transmitting OR
* - ( more than 2 * delta since receive AND
* the bond has an IP address )
*/
2010-09-02 13:45:54 +08:00
trans_start = dev_trans_start ( slave - > dev ) ;
2011-03-12 11:14:37 +08:00
if ( bond_is_active_slave ( slave ) & &
2013-08-03 09:50:36 +08:00
( ! bond_time_in_interval ( bond , trans_start , 2 ) | |
! bond_time_in_interval ( bond , last_rx , 2 ) ) ) {
2019-11-02 12:56:42 +08:00
bond_propose_link_state ( slave , BOND_LINK_DOWN ) ;
2008-05-18 12:10:13 +08:00
commit + + ;
}
2005-04-17 06:20:36 +08:00
}
2008-05-18 12:10:13 +08:00
return commit ;
}
2005-04-17 06:20:36 +08:00
2014-09-15 23:19:34 +08:00
/* Called to commit link state changes noted by inspection step of
2008-05-18 12:10:13 +08:00
* active - backup mode ARP monitor .
*
2013-12-13 10:20:02 +08:00
* Called with RTNL hold .
2008-05-18 12:10:13 +08:00
*/
2013-08-03 09:50:36 +08:00
static void bond_ab_arp_commit ( struct bonding * bond )
2008-05-18 12:10:13 +08:00
{
2010-09-02 13:45:54 +08:00
unsigned long trans_start ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2013-08-01 22:54:47 +08:00
struct slave * slave ;
2005-04-17 06:20:36 +08:00
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2019-11-02 12:56:42 +08:00
switch ( slave - > link_new_state ) {
2008-05-18 12:10:13 +08:00
case BOND_LINK_NOCHANGE :
continue ;
2006-03-28 05:27:43 +08:00
2008-05-18 12:10:13 +08:00
case BOND_LINK_UP :
2010-09-02 13:45:54 +08:00
trans_start = dev_trans_start ( slave - > dev ) ;
2014-07-15 21:56:55 +08:00
if ( rtnl_dereference ( bond - > curr_active_slave ) ! = slave | |
( ! rtnl_dereference ( bond - > curr_active_slave ) & &
2013-08-03 09:50:36 +08:00
bond_time_in_interval ( bond , trans_start , 1 ) ) ) {
2014-07-15 21:56:56 +08:00
struct slave * current_arp_slave ;
current_arp_slave = rtnl_dereference ( bond - > current_arp_slave ) ;
2015-12-03 19:12:19 +08:00
bond_set_slave_link_state ( slave , BOND_LINK_UP ,
BOND_SLAVE_NOTIFY_NOW ) ;
2014-07-15 21:56:56 +08:00
if ( current_arp_slave ) {
2012-04-05 11:47:43 +08:00
bond_set_slave_inactive_flags (
2014-07-15 21:56:56 +08:00
current_arp_slave ,
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
BOND_SLAVE_NOTIFY_NOW ) ;
2014-07-15 21:56:56 +08:00
RCU_INIT_POINTER ( bond - > current_arp_slave , NULL ) ;
2012-04-05 11:47:43 +08:00
}
2008-05-18 12:10:13 +08:00
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " link status definitely up \n " ) ;
2008-05-18 12:10:13 +08:00
2014-07-15 21:56:55 +08:00
if ( ! rtnl_dereference ( bond - > curr_active_slave ) | |
2014-09-10 05:17:00 +08:00
slave = = rtnl_dereference ( bond - > primary_slave ) )
2009-08-31 19:09:38 +08:00
goto do_failover ;
2005-04-17 06:20:36 +08:00
2009-08-31 19:09:38 +08:00
}
2005-04-17 06:20:36 +08:00
2009-08-31 19:09:38 +08:00
continue ;
2005-04-17 06:20:36 +08:00
2008-05-18 12:10:13 +08:00
case BOND_LINK_DOWN :
if ( slave - > link_failure_count < UINT_MAX )
slave - > link_failure_count + + ;
2015-12-03 19:12:19 +08:00
bond_set_slave_link_state ( slave , BOND_LINK_DOWN ,
BOND_SLAVE_NOTIFY_NOW ) ;
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_inactive_flags ( slave ,
BOND_SLAVE_NOTIFY_NOW ) ;
2008-05-18 12:10:13 +08:00
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " link status definitely down, disabling slave \n " ) ;
2008-05-18 12:10:13 +08:00
2014-07-15 21:56:55 +08:00
if ( slave = = rtnl_dereference ( bond - > curr_active_slave ) ) {
2014-07-15 21:56:56 +08:00
RCU_INIT_POINTER ( bond - > current_arp_slave , NULL ) ;
2009-08-31 19:09:38 +08:00
goto do_failover ;
2005-04-17 06:20:36 +08:00
}
2009-08-31 19:09:38 +08:00
continue ;
2008-05-18 12:10:13 +08:00
default :
2019-11-02 12:56:42 +08:00
slave_err ( bond - > dev , slave - > dev ,
" impossible: link_new_state %d on slave \n " ,
slave - > link_new_state ) ;
2009-08-31 19:09:38 +08:00
continue ;
2005-04-17 06:20:36 +08:00
}
2009-08-31 19:09:38 +08:00
do_failover :
2010-10-14 00:01:50 +08:00
block_netpoll_tx ( ) ;
2009-08-31 19:09:38 +08:00
bond_select_active_slave ( bond ) ;
2010-10-14 00:01:50 +08:00
unblock_netpoll_tx ( ) ;
2008-05-18 12:10:13 +08:00
}
2005-04-17 06:20:36 +08:00
2008-05-18 12:10:13 +08:00
bond_set_carrier ( bond ) ;
}
2005-04-17 06:20:36 +08:00
2014-09-15 23:19:34 +08:00
/* Send ARP probes for active-backup mode ARP monitor.
2014-02-26 11:05:23 +08:00
*
2014-09-12 04:49:28 +08:00
* Called with rcu_read_lock held .
2008-05-18 12:10:13 +08:00
*/
2014-01-27 21:37:32 +08:00
static bool bond_ab_arp_probe ( struct bonding * bond )
2008-05-18 12:10:13 +08:00
{
2013-12-13 10:20:02 +08:00
struct slave * slave , * before = NULL , * new_slave = NULL ,
2014-02-26 11:05:23 +08:00
* curr_arp_slave = rcu_dereference ( bond - > current_arp_slave ) ,
* curr_active_slave = rcu_dereference ( bond - > curr_active_slave ) ;
2013-09-25 15:20:19 +08:00
struct list_head * iter ;
bool found = false ;
2014-02-26 11:05:23 +08:00
bool should_notify_rtnl = BOND_SLAVE_NOTIFY_LATER ;
2014-01-27 21:37:32 +08:00
2014-01-27 21:37:31 +08:00
if ( curr_arp_slave & & curr_active_slave )
2014-07-16 01:35:58 +08:00
netdev_info ( bond - > dev , " PROBE: c_arp %s && cas %s BAD \n " ,
curr_arp_slave - > dev - > name ,
curr_active_slave - > dev - > name ) ;
2005-04-17 06:20:36 +08:00
2014-01-27 21:37:31 +08:00
if ( curr_active_slave ) {
bond_arp_send_all ( bond , curr_active_slave ) ;
2014-02-26 11:05:23 +08:00
return should_notify_rtnl ;
2008-05-18 12:10:13 +08:00
}
2005-04-17 06:20:36 +08:00
2008-05-18 12:10:13 +08:00
/* if we don't have a curr_active_slave, search for the next available
* backup slave from the current_arp_slave and make it the candidate
* for becoming the curr_active_slave
*/
2005-04-17 06:20:36 +08:00
2013-12-13 10:20:02 +08:00
if ( ! curr_arp_slave ) {
2014-02-26 11:05:23 +08:00
curr_arp_slave = bond_first_slave_rcu ( bond ) ;
if ( ! curr_arp_slave )
return should_notify_rtnl ;
2008-05-18 12:10:13 +08:00
}
2005-04-17 06:20:36 +08:00
2014-02-26 11:05:23 +08:00
bond_set_slave_inactive_flags ( curr_arp_slave , BOND_SLAVE_NOTIFY_LATER ) ;
2007-10-18 08:37:49 +08:00
2014-02-26 11:05:23 +08:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2014-05-16 03:39:57 +08:00
if ( ! found & & ! before & & bond_slave_is_up ( slave ) )
2013-09-25 15:20:19 +08:00
before = slave ;
2005-04-17 06:20:36 +08:00
2014-05-16 03:39:57 +08:00
if ( found & & ! new_slave & & bond_slave_is_up ( slave ) )
2013-09-25 15:20:19 +08:00
new_slave = slave ;
2008-05-18 12:10:13 +08:00
/* if the link state is up at this point, we
* mark it down - this can happen if we have
* simultaneous link failures and
* reselect_active_interface doesn ' t make this
* one the current slave so it is still marked
* up when it is actually down
2005-04-17 06:20:36 +08:00
*/
2014-05-16 03:39:57 +08:00
if ( ! bond_slave_is_up ( slave ) & & slave - > link = = BOND_LINK_UP ) {
2015-12-03 19:12:19 +08:00
bond_set_slave_link_state ( slave , BOND_LINK_DOWN ,
BOND_SLAVE_NOTIFY_LATER ) ;
2008-05-18 12:10:13 +08:00
if ( slave - > link_failure_count < UINT_MAX )
slave - > link_failure_count + + ;
2005-04-17 06:20:36 +08:00
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_inactive_flags ( slave ,
2014-02-26 11:05:23 +08:00
BOND_SLAVE_NOTIFY_LATER ) ;
2008-05-18 12:10:13 +08:00
2019-06-07 22:59:29 +08:00
slave_info ( bond - > dev , slave - > dev , " backup interface is now down \n " ) ;
2005-04-17 06:20:36 +08:00
}
2013-12-13 10:20:02 +08:00
if ( slave = = curr_arp_slave )
2013-09-25 15:20:19 +08:00
found = true ;
2008-05-18 12:10:13 +08:00
}
2013-09-25 15:20:19 +08:00
if ( ! new_slave & & before )
new_slave = before ;
2014-02-26 11:05:23 +08:00
if ( ! new_slave )
goto check_state ;
2013-09-25 15:20:19 +08:00
2015-12-03 19:12:19 +08:00
bond_set_slave_link_state ( new_slave , BOND_LINK_BACK ,
BOND_SLAVE_NOTIFY_LATER ) ;
2014-02-26 11:05:23 +08:00
bond_set_slave_active_flags ( new_slave , BOND_SLAVE_NOTIFY_LATER ) ;
2013-09-25 15:20:19 +08:00
bond_arp_send_all ( bond , new_slave ) ;
2014-02-18 14:48:46 +08:00
new_slave - > last_link_up = jiffies ;
2013-12-13 10:20:02 +08:00
rcu_assign_pointer ( bond - > current_arp_slave , new_slave ) ;
2014-01-27 21:37:32 +08:00
2014-02-26 11:05:23 +08:00
check_state :
bond_for_each_slave_rcu ( bond , slave , iter ) {
2015-12-03 19:12:19 +08:00
if ( slave - > should_notify | | slave - > should_notify_link ) {
2014-02-26 11:05:23 +08:00
should_notify_rtnl = BOND_SLAVE_NOTIFY_NOW ;
break ;
}
}
return should_notify_rtnl ;
2008-05-18 12:10:13 +08:00
}
2005-04-17 06:20:36 +08:00
2017-03-09 02:55:51 +08:00
static void bond_activebackup_arp_mon ( struct bonding * bond )
2008-05-18 12:10:13 +08:00
{
2014-02-26 11:05:23 +08:00
bool should_notify_peers = false ;
bool should_notify_rtnl = false ;
2013-10-28 12:11:22 +08:00
int delta_in_ticks ;
2005-04-17 06:20:36 +08:00
2013-10-28 12:11:22 +08:00
delta_in_ticks = msecs_to_jiffies ( bond - > params . arp_interval ) ;
if ( ! bond_has_slaves ( bond ) )
2008-05-18 12:10:13 +08:00
goto re_arm ;
2013-12-13 10:20:02 +08:00
rcu_read_lock ( ) ;
2014-02-26 11:05:23 +08:00
2011-04-26 23:25:52 +08:00
should_notify_peers = bond_should_notify_peers ( bond ) ;
2014-02-26 11:05:23 +08:00
if ( bond_ab_arp_inspect ( bond ) ) {
rcu_read_unlock ( ) ;
2013-10-28 12:11:22 +08:00
/* Race avoidance with bond_close flush of workqueue */
if ( ! rtnl_trylock ( ) ) {
delta_in_ticks = 1 ;
should_notify_peers = false ;
goto re_arm ;
}
2008-05-18 12:10:13 +08:00
2013-10-28 12:11:22 +08:00
bond_ab_arp_commit ( bond ) ;
2014-02-26 11:05:23 +08:00
2013-10-28 12:11:22 +08:00
rtnl_unlock ( ) ;
2014-02-26 11:05:23 +08:00
rcu_read_lock ( ) ;
2013-10-28 12:11:22 +08:00
}
2014-02-26 11:05:23 +08:00
should_notify_rtnl = bond_ab_arp_probe ( bond ) ;
rcu_read_unlock ( ) ;
2011-04-26 23:25:52 +08:00
2013-10-24 11:09:25 +08:00
re_arm :
if ( bond - > params . arp_interval )
2013-10-28 12:11:22 +08:00
queue_delayed_work ( bond - > wq , & bond - > arp_work , delta_in_ticks ) ;
2014-02-26 11:05:23 +08:00
if ( should_notify_peers | | should_notify_rtnl ) {
2013-10-28 12:11:22 +08:00
if ( ! rtnl_trylock ( ) )
return ;
2014-02-26 11:05:23 +08:00
if ( should_notify_peers )
call_netdevice_notifiers ( NETDEV_NOTIFY_PEERS ,
bond - > dev ) ;
2015-12-03 19:12:19 +08:00
if ( should_notify_rtnl ) {
2014-02-26 11:05:23 +08:00
bond_slave_state_notify ( bond ) ;
2015-12-03 19:12:19 +08:00
bond_slave_link_notify ( bond ) ;
}
2014-02-26 11:05:23 +08:00
2013-10-28 12:11:22 +08:00
rtnl_unlock ( ) ;
}
2005-04-17 06:20:36 +08:00
}
2017-03-09 02:55:51 +08:00
static void bond_arp_monitor ( struct work_struct * work )
{
struct bonding * bond = container_of ( work , struct bonding ,
arp_work . work ) ;
if ( BOND_MODE ( bond ) = = BOND_MODE_ACTIVEBACKUP )
bond_activebackup_arp_mon ( bond ) ;
else
bond_loadbalance_arp_mon ( bond ) ;
}
2005-04-17 06:20:36 +08:00
/*-------------------------- netdev event handling --------------------------*/
2014-09-15 23:19:34 +08:00
/* Change device name */
2005-04-17 06:20:36 +08:00
static int bond_event_changename ( struct bonding * bond )
{
bond_remove_proc_entry ( bond ) ;
bond_create_proc_entry ( bond ) ;
2009-06-13 03:02:46 +08:00
2010-12-09 23:17:13 +08:00
bond_debug_reregister ( bond ) ;
2005-04-17 06:20:36 +08:00
return NOTIFY_DONE ;
}
2009-06-13 03:02:48 +08:00
static int bond_master_netdev_event ( unsigned long event ,
struct net_device * bond_dev )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * event_bond = netdev_priv ( bond_dev ) ;
2005-04-17 06:20:36 +08:00
2019-06-07 22:59:29 +08:00
netdev_dbg ( bond_dev , " %s called \n " , __func__ ) ;
2005-04-17 06:20:36 +08:00
switch ( event ) {
case NETDEV_CHANGENAME :
return bond_event_changename ( event_bond ) ;
2012-07-09 18:51:45 +08:00
case NETDEV_UNREGISTER :
bond_remove_proc_entry ( event_bond ) ;
break ;
case NETDEV_REGISTER :
bond_create_proc_entry ( event_bond ) ;
break ;
2005-04-17 06:20:36 +08:00
default :
break ;
}
return NOTIFY_DONE ;
}
2009-06-13 03:02:48 +08:00
static int bond_slave_netdev_event ( unsigned long event ,
struct net_device * slave_dev )
2005-04-17 06:20:36 +08:00
{
2014-09-10 05:17:00 +08:00
struct slave * slave = bond_slave_get_rtnl ( slave_dev ) , * primary ;
2013-04-11 17:18:55 +08:00
struct bonding * bond ;
struct net_device * bond_dev ;
2005-04-17 06:20:36 +08:00
2013-04-11 17:18:55 +08:00
/* A netdev event can be generated while enslaving a device
* before netdev_rx_handler_register is called in which case
* slave will be NULL
*/
2019-06-07 22:59:29 +08:00
if ( ! slave ) {
netdev_dbg ( slave_dev , " %s called on NULL slave \n " , __func__ ) ;
2013-04-11 17:18:55 +08:00
return NOTIFY_DONE ;
2019-06-07 22:59:29 +08:00
}
2013-04-11 17:18:55 +08:00
bond_dev = slave - > bond - > dev ;
bond = slave - > bond ;
2014-09-10 05:17:00 +08:00
primary = rtnl_dereference ( bond - > primary_slave ) ;
2013-04-11 17:18:55 +08:00
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " %s called \n " , __func__ ) ;
2005-04-17 06:20:36 +08:00
switch ( event ) {
case NETDEV_UNREGISTER :
2013-06-26 23:13:37 +08:00
if ( bond_dev - > type ! = ARPHRD_ETHER )
2013-01-04 06:49:01 +08:00
bond_release_and_destroy ( bond_dev , slave_dev ) ;
else
2017-07-07 06:01:57 +08:00
__bond_release_one ( bond_dev , slave_dev , false , true ) ;
2005-04-17 06:20:36 +08:00
break ;
bonding:update speed/duplex for NETDEV_CHANGE
Zheng Liang(lzheng@redhat.com) found a bug that if we config bonding with
arp monitor, sometimes bonding driver cannot get the speed and duplex from
its slaves, it will assume them to be 100Mb/sec and Full, please see
/proc/net/bonding/bond0.
But there is no such problem when uses miimon.
(Take igb for example)
I find that the reason is that after dev_open() in bond_enslave(),
bond_update_speed_duplex() will call igb_get_settings()
, but in that function,
it runs ethtool_cmd_speed_set(ecmd, -1); ecmd->duplex = -1;
because igb get an error value of status.
So even dev_open() is called, but the device is not really ready to get its
settings.
Maybe it is safe for us to call igb_get_settings() only after
this message shows up, that is "igb: p4p1 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: RX".
So I prefer to update the speed and duplex for a slave when reseices
NETDEV_CHANGE/NETDEV_UP event.
Changelog
V2:
1 remove the "fake 100/Full" logic in bond_update_speed_duplex(),
set speed and duplex to -1 when it gets error value of speed and duplex.
2 delete the warning in bond_enslave() if bond_update_speed_duplex() returns
error.
3 make bond_info_show_slave() handle bad values of speed and duplex.
Signed-off-by: Weiping Pan <wpan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-01 01:20:48 +08:00
case NETDEV_UP :
2005-04-17 06:20:36 +08:00
case NETDEV_CHANGE :
2017-09-28 09:03:49 +08:00
/* For 802.3ad mode only:
* Getting invalid Speed / Duplex values here will put slave
bonding/802.3ad: fix slave link initialization transition states
Once in a while, with just the right timing, 802.3ad slaves will fail to
properly initialize, winding up in a weird state, with a partner system
mac address of 00:00:00:00:00:00. This started happening after a fix to
properly track link_failure_count tracking, where an 802.3ad slave that
reported itself as link up in the miimon code, but wasn't able to get a
valid speed/duplex, started getting set to BOND_LINK_FAIL instead of
BOND_LINK_DOWN. That was the proper thing to do for the general "my link
went down" case, but has created a link initialization race that can put
the interface in this odd state.
The simple fix is to instead set the slave link to BOND_LINK_DOWN again,
if the link has never been up (last_link_up == 0), so the link state
doesn't bounce from BOND_LINK_DOWN to BOND_LINK_FAIL -- it hasn't failed
in this case, it simply hasn't been up yet, and this prevents the
unnecessary state change from DOWN to FAIL and getting stuck in an init
failure w/o a partner mac.
Fixes: ea53abfab960 ("bonding/802.3ad: fix link_failure_count tracking")
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Tested-by: Heesoon Kim <Heesoon.Kim@stratus.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 21:49:28 +08:00
* in weird state . Mark it as link - fail if the link was
* previously up or link - down if it hasn ' t yet come up , and
* let link - monitoring ( miimon ) set it right when correct
* speeds / duplex are available .
2017-09-28 09:03:49 +08:00
*/
if ( bond_update_speed_duplex ( slave ) & &
bonding/802.3ad: fix slave link initialization transition states
Once in a while, with just the right timing, 802.3ad slaves will fail to
properly initialize, winding up in a weird state, with a partner system
mac address of 00:00:00:00:00:00. This started happening after a fix to
properly track link_failure_count tracking, where an 802.3ad slave that
reported itself as link up in the miimon code, but wasn't able to get a
valid speed/duplex, started getting set to BOND_LINK_FAIL instead of
BOND_LINK_DOWN. That was the proper thing to do for the general "my link
went down" case, but has created a link initialization race that can put
the interface in this odd state.
The simple fix is to instead set the slave link to BOND_LINK_DOWN again,
if the link has never been up (last_link_up == 0), so the link state
doesn't bounce from BOND_LINK_DOWN to BOND_LINK_FAIL -- it hasn't failed
in this case, it simply hasn't been up yet, and this prevents the
unnecessary state change from DOWN to FAIL and getting stuck in an init
failure w/o a partner mac.
Fixes: ea53abfab960 ("bonding/802.3ad: fix link_failure_count tracking")
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Tested-by: Heesoon Kim <Heesoon.Kim@stratus.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 21:49:28 +08:00
BOND_MODE ( bond ) = = BOND_MODE_8023AD ) {
if ( slave - > last_link_up )
slave - > link = BOND_LINK_FAIL ;
else
slave - > link = BOND_LINK_DOWN ;
}
2017-09-28 09:03:49 +08:00
2015-11-01 03:45:11 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD )
bond_3ad_adapter_speed_duplex_changed ( slave ) ;
2015-02-20 02:13:25 +08:00
/* Fallthrough */
case NETDEV_DOWN :
2014-10-05 08:45:01 +08:00
/* Refresh slave-array if applicable!
* If the setup does not use miimon or arpmon ( mode - specific ! ) ,
* then these events will not cause the slave - array to be
* refreshed . This will cause xmit to use a slave that is not
* usable . Avoid such situation by refeshing the array at these
* events . If these ( miimon / arpmon ) parameters are configured
* then array gets refreshed twice and that should be fine !
*/
2018-05-15 02:48:09 +08:00
if ( bond_mode_can_use_xmit_hash ( bond ) )
2014-10-05 08:45:01 +08:00
bond_update_slave_arr ( bond , NULL ) ;
2005-04-17 06:20:36 +08:00
break ;
case NETDEV_CHANGEMTU :
2014-09-15 23:19:34 +08:00
/* TODO: Should slaves be allowed to
2005-04-17 06:20:36 +08:00
* independently alter their MTU ? For
* an active - backup bond , slaves need
* not be the same type of device , so
* MTUs may vary . For other modes ,
* slaves arguably should have the
* same MTUs . To do this , we ' d need to
* take over the slave ' s change_mtu
* function for the duration of their
* servitude .
*/
break ;
case NETDEV_CHANGENAME :
bonding: handle slave's name change with primary_slave logic
Currently, if a slave's name change, we just pass it by. However, if the
slave is a current primary_slave, then we end up with using a slave, whose
name != params.primary, for primary_slave. And vice-versa, if we don't have
a primary_slave but have params.primary set - we will not detected a new
primary_slave.
Fix this by catching the NETDEV_CHANGENAME event and setting primary_slave
accordingly. Also, if the primary_slave was changed, issue a reselection of
the active slave, cause the priorities have changed.
Reported-by: Ding Tianhong <dingtianhong@huawei.com>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 09:04:29 +08:00
/* we don't care if we don't have primary set */
2014-05-16 03:39:54 +08:00
if ( ! bond_uses_primary ( bond ) | |
bonding: handle slave's name change with primary_slave logic
Currently, if a slave's name change, we just pass it by. However, if the
slave is a current primary_slave, then we end up with using a slave, whose
name != params.primary, for primary_slave. And vice-versa, if we don't have
a primary_slave but have params.primary set - we will not detected a new
primary_slave.
Fix this by catching the NETDEV_CHANGENAME event and setting primary_slave
accordingly. Also, if the primary_slave was changed, issue a reselection of
the active slave, cause the priorities have changed.
Reported-by: Ding Tianhong <dingtianhong@huawei.com>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 09:04:29 +08:00
! bond - > params . primary [ 0 ] )
break ;
2014-09-10 05:17:00 +08:00
if ( slave = = primary ) {
bonding: handle slave's name change with primary_slave logic
Currently, if a slave's name change, we just pass it by. However, if the
slave is a current primary_slave, then we end up with using a slave, whose
name != params.primary, for primary_slave. And vice-versa, if we don't have
a primary_slave but have params.primary set - we will not detected a new
primary_slave.
Fix this by catching the NETDEV_CHANGENAME event and setting primary_slave
accordingly. Also, if the primary_slave was changed, issue a reselection of
the active slave, cause the priorities have changed.
Reported-by: Ding Tianhong <dingtianhong@huawei.com>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 09:04:29 +08:00
/* slave's name changed - he's no longer primary */
2014-09-10 05:17:00 +08:00
RCU_INIT_POINTER ( bond - > primary_slave , NULL ) ;
bonding: handle slave's name change with primary_slave logic
Currently, if a slave's name change, we just pass it by. However, if the
slave is a current primary_slave, then we end up with using a slave, whose
name != params.primary, for primary_slave. And vice-versa, if we don't have
a primary_slave but have params.primary set - we will not detected a new
primary_slave.
Fix this by catching the NETDEV_CHANGENAME event and setting primary_slave
accordingly. Also, if the primary_slave was changed, issue a reselection of
the active slave, cause the priorities have changed.
Reported-by: Ding Tianhong <dingtianhong@huawei.com>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 09:04:29 +08:00
} else if ( ! strcmp ( slave_dev - > name , bond - > params . primary ) ) {
/* we have a new primary slave */
2014-09-10 05:17:00 +08:00
rcu_assign_pointer ( bond - > primary_slave , slave ) ;
bonding: handle slave's name change with primary_slave logic
Currently, if a slave's name change, we just pass it by. However, if the
slave is a current primary_slave, then we end up with using a slave, whose
name != params.primary, for primary_slave. And vice-versa, if we don't have
a primary_slave but have params.primary set - we will not detected a new
primary_slave.
Fix this by catching the NETDEV_CHANGENAME event and setting primary_slave
accordingly. Also, if the primary_slave was changed, issue a reselection of
the active slave, cause the priorities have changed.
Reported-by: Ding Tianhong <dingtianhong@huawei.com>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 09:04:29 +08:00
} else { /* we didn't change primary - exit */
break ;
}
2014-07-16 01:35:58 +08:00
netdev_info ( bond - > dev , " Primary slave changed to %s, reselecting active slave \n " ,
2014-09-10 05:17:00 +08:00
primary ? slave_dev - > name : " none " ) ;
2014-02-12 12:06:40 +08:00
block_netpoll_tx ( ) ;
bonding: handle slave's name change with primary_slave logic
Currently, if a slave's name change, we just pass it by. However, if the
slave is a current primary_slave, then we end up with using a slave, whose
name != params.primary, for primary_slave. And vice-versa, if we don't have
a primary_slave but have params.primary set - we will not detected a new
primary_slave.
Fix this by catching the NETDEV_CHANGENAME event and setting primary_slave
accordingly. Also, if the primary_slave was changed, issue a reselection of
the active slave, cause the priorities have changed.
Reported-by: Ding Tianhong <dingtianhong@huawei.com>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 09:04:29 +08:00
bond_select_active_slave ( bond ) ;
2014-02-12 12:06:40 +08:00
unblock_netpoll_tx ( ) ;
2005-04-17 06:20:36 +08:00
break ;
2005-08-23 13:34:53 +08:00
case NETDEV_FEAT_CHANGE :
bond_compute_features ( bond ) ;
break ;
2013-07-20 18:13:53 +08:00
case NETDEV_RESEND_IGMP :
/* Propagate to master device */
call_netdevice_notifiers ( event , slave - > bond - > dev ) ;
break ;
2005-04-17 06:20:36 +08:00
default :
break ;
}
return NOTIFY_DONE ;
}
2014-09-15 23:19:34 +08:00
/* bond_netdev_event: handle netdev notifier chain events.
2005-04-17 06:20:36 +08:00
*
* This function receives events for the netdev chain . The caller ( an
[PATCH] Notifier chain update: API changes
The kernel's implementation of notifier chains is unsafe. There is no
protection against entries being added to or removed from a chain while the
chain is in use. The issues were discussed in this thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2
We noticed that notifier chains in the kernel fall into two basic usage
classes:
"Blocking" chains are always called from a process context
and the callout routines are allowed to sleep;
"Atomic" chains can be called from an atomic context and
the callout routines are not allowed to sleep.
We decided to codify this distinction and make it part of the API. Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name). New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain. The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.
With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed. For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections. (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)
There are some limitations, which should not be too hard to live with. For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem. Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain. (This did happen in a couple of places and the code
had to be changed to avoid it.)
Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization. Instead we use RCU. The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.
Here is the list of chains that we adjusted and their classifications. None
of them use the raw API, so for the moment it is only a placeholder.
ATOMIC CHAINS
-------------
arch/i386/kernel/traps.c: i386die_chain
arch/ia64/kernel/traps.c: ia64die_chain
arch/powerpc/kernel/traps.c: powerpc_die_chain
arch/sparc64/kernel/traps.c: sparc64die_chain
arch/x86_64/kernel/traps.c: die_chain
drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list
kernel/panic.c: panic_notifier_list
kernel/profile.c: task_free_notifier
net/bluetooth/hci_core.c: hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain
net/ipv6/addrconf.c: inet6addr_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain
net/netlink/af_netlink.c: netlink_chain
BLOCKING CHAINS
---------------
arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain
arch/s390/kernel/process.c: idle_chain
arch/x86_64/kernel/process.c idle_notifier
drivers/base/memory.c: memory_chain
drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list
drivers/macintosh/adb.c: adb_client_list
drivers/macintosh/via-pmu.c sleep_notifier_list
drivers/macintosh/via-pmu68k.c sleep_notifier_list
drivers/macintosh/windfarm_core.c wf_client_list
drivers/usb/core/notify.c usb_notifier_list
drivers/video/fbmem.c fb_notifier_list
kernel/cpu.c cpu_chain
kernel/module.c module_notify_list
kernel/profile.c munmap_notifier
kernel/profile.c task_exit_notifier
kernel/sys.c reboot_notifier_list
net/core/dev.c netdev_chain
net/decnet/dn_dev.c: dnaddr_chain
net/ipv4/devinet.c: inetaddr_chain
It's possible that some of these classifications are wrong. If they are,
please let us know or submit a patch to fix them. Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)
The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.
[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 17:16:30 +08:00
* ioctl handler calling blocking_notifier_call_chain ) holds the necessary
2005-04-17 06:20:36 +08:00
* locks for us to safely manipulate the slave devices ( RTNL lock ,
* dev_probe_lock ) .
*/
2009-06-13 03:02:48 +08:00
static int bond_netdev_event ( struct notifier_block * this ,
unsigned long event , void * ptr )
2005-04-17 06:20:36 +08:00
{
2013-05-28 09:30:21 +08:00
struct net_device * event_dev = netdev_notifier_info_to_dev ( ptr ) ;
2005-04-17 06:20:36 +08:00
2019-06-07 22:59:26 +08:00
netdev_dbg ( event_dev , " %s received %s \n " ,
__func__ , netdev_cmd_to_name ( event ) ) ;
2005-04-17 06:20:36 +08:00
2006-09-23 12:54:10 +08:00
if ( ! ( event_dev - > priv_flags & IFF_BONDING ) )
return NOTIFY_DONE ;
2005-04-17 06:20:36 +08:00
if ( event_dev - > flags & IFF_MASTER ) {
2019-04-12 21:04:10 +08:00
int ret ;
ret = bond_master_netdev_event ( event , event_dev ) ;
if ( ret ! = NOTIFY_DONE )
return ret ;
2005-04-17 06:20:36 +08:00
}
2019-06-07 22:59:29 +08:00
if ( event_dev - > flags & IFF_SLAVE )
2005-04-17 06:20:36 +08:00
return bond_slave_netdev_event ( event , event_dev ) ;
return NOTIFY_DONE ;
}
static struct notifier_block bond_netdev_notifier = {
. notifier_call = bond_netdev_event ,
} ;
2005-06-27 05:54:11 +08:00
/*---------------------------- Hashing Policies -----------------------------*/
2013-10-02 19:39:25 +08:00
/* L2 hash helper */
static inline u32 bond_eth_hash ( struct sk_buff * skb )
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 04:43:35 +08:00
{
2014-07-17 14:16:25 +08:00
struct ethhdr * ep , hdr_tmp ;
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 04:43:35 +08:00
2014-07-17 14:16:25 +08:00
ep = skb_header_pointer ( skb , 0 , sizeof ( hdr_tmp ) , & hdr_tmp ) ;
if ( ep )
return ep - > h_dest [ 5 ] ^ ep - > h_source [ 5 ] ^ ep - > h_proto ;
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 04:43:35 +08:00
return 0 ;
}
bonding: symmetric ICMP transmit
A bonding with layer2+3 or layer3+4 hashing uses the IP addresses and the ports
to balance packets between slaves. With some network errors, we receive an ICMP
error packet by the remote host or a router. If sent by a router, the source IP
can differ from the remote host one. Additionally the ICMP protocol has no port
numbers, so a layer3+4 bonding will get a different hash than the previous one.
These two conditions could let the packet go through a different interface than
the other packets of the same flow:
# tcpdump -qltnni veth0 |sed 's/^/0: /' &
# tcpdump -qltnni veth1 |sed 's/^/1: /' &
# hping3 -2 192.168.0.2 -p 9
0: IP 192.168.0.1.2251 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2252 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2253 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.2254 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
An ICMP error packet contains the header of the packet which caused the network
error, so inspect it and match the flow against it, so we can send the ICMP via
the same interface of the previous packet in the flow.
Move the IP and port dissect code into a generic function bond_flow_ip() and if
we are dissecting an ICMP error packet, call it again with the adjusted offset.
# hping3 -2 192.168.0.2 -p 9
1: IP 192.168.0.1.1224 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.1225 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1226 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1227 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 19:10:37 +08:00
static bool bond_flow_ip ( struct sk_buff * skb , struct flow_keys * fk ,
int * noff , int * proto , bool l34 )
{
const struct ipv6hdr * iph6 ;
const struct iphdr * iph ;
if ( skb - > protocol = = htons ( ETH_P_IP ) ) {
if ( unlikely ( ! pskb_may_pull ( skb , * noff + sizeof ( * iph ) ) ) )
return false ;
iph = ( const struct iphdr * ) ( skb - > data + * noff ) ;
iph_to_flow_copy_v4addrs ( fk , iph ) ;
* noff + = iph - > ihl < < 2 ;
if ( ! ip_is_fragment ( iph ) )
* proto = iph - > protocol ;
} else if ( skb - > protocol = = htons ( ETH_P_IPV6 ) ) {
if ( unlikely ( ! pskb_may_pull ( skb , * noff + sizeof ( * iph6 ) ) ) )
return false ;
iph6 = ( const struct ipv6hdr * ) ( skb - > data + * noff ) ;
iph_to_flow_copy_v6addrs ( fk , iph6 ) ;
* noff + = sizeof ( * iph6 ) ;
* proto = iph6 - > nexthdr ;
} else {
return false ;
}
if ( l34 & & * proto > = 0 )
fk - > ports . ports = skb_flow_get_ports ( skb , * noff , * proto ) ;
return true ;
}
2013-10-02 19:39:25 +08:00
/* Extract the appropriate headers based on bond's xmit policy */
static bool bond_flow_dissect ( struct bonding * bond , struct sk_buff * skb ,
struct flow_keys * fk )
2007-12-07 15:40:34 +08:00
{
bonding: symmetric ICMP transmit
A bonding with layer2+3 or layer3+4 hashing uses the IP addresses and the ports
to balance packets between slaves. With some network errors, we receive an ICMP
error packet by the remote host or a router. If sent by a router, the source IP
can differ from the remote host one. Additionally the ICMP protocol has no port
numbers, so a layer3+4 bonding will get a different hash than the previous one.
These two conditions could let the packet go through a different interface than
the other packets of the same flow:
# tcpdump -qltnni veth0 |sed 's/^/0: /' &
# tcpdump -qltnni veth1 |sed 's/^/1: /' &
# hping3 -2 192.168.0.2 -p 9
0: IP 192.168.0.1.2251 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2252 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2253 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.2254 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
An ICMP error packet contains the header of the packet which caused the network
error, so inspect it and match the flow against it, so we can send the ICMP via
the same interface of the previous packet in the flow.
Move the IP and port dissect code into a generic function bond_flow_ip() and if
we are dissecting an ICMP error packet, call it again with the adjusted offset.
# hping3 -2 192.168.0.2 -p 9
1: IP 192.168.0.1.1224 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.1225 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1226 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1227 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 19:10:37 +08:00
bool l34 = bond - > params . xmit_policy = = BOND_XMIT_POLICY_LAYER34 ;
2013-10-02 19:39:25 +08:00
int noff , proto = - 1 ;
bonding: support for IPv6 transmit hashing
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 04:43:35 +08:00
bonding: balance ICMP echoes in layer3+4 mode
The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:
# tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
# tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64
But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:
# ping -qc1 192.168.0.2
0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64
Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:
# ping -q 192.168.1.2 -c1
0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
# ping -q 192.168.1.2 -c1
1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 21:50:53 +08:00
if ( bond - > params . xmit_policy > BOND_XMIT_POLICY_LAYER23 ) {
memset ( fk , 0 , sizeof ( * fk ) ) ;
return __skb_flow_dissect ( NULL , skb , & flow_keys_bonding ,
fk , NULL , 0 , 0 , 0 , 0 ) ;
}
2013-10-02 19:39:25 +08:00
2015-05-12 20:56:16 +08:00
fk - > ports . ports = 0 ;
bonding: balance ICMP echoes in layer3+4 mode
The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:
# tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
# tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64
But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:
# ping -qc1 192.168.0.2
0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64
Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:
# ping -q 192.168.1.2 -c1
0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
# ping -q 192.168.1.2 -c1
1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 21:50:53 +08:00
memset ( & fk - > icmp , 0 , sizeof ( fk - > icmp ) ) ;
2013-10-02 19:39:25 +08:00
noff = skb_network_offset ( skb ) ;
bonding: symmetric ICMP transmit
A bonding with layer2+3 or layer3+4 hashing uses the IP addresses and the ports
to balance packets between slaves. With some network errors, we receive an ICMP
error packet by the remote host or a router. If sent by a router, the source IP
can differ from the remote host one. Additionally the ICMP protocol has no port
numbers, so a layer3+4 bonding will get a different hash than the previous one.
These two conditions could let the packet go through a different interface than
the other packets of the same flow:
# tcpdump -qltnni veth0 |sed 's/^/0: /' &
# tcpdump -qltnni veth1 |sed 's/^/1: /' &
# hping3 -2 192.168.0.2 -p 9
0: IP 192.168.0.1.2251 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2252 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2253 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.2254 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
An ICMP error packet contains the header of the packet which caused the network
error, so inspect it and match the flow against it, so we can send the ICMP via
the same interface of the previous packet in the flow.
Move the IP and port dissect code into a generic function bond_flow_ip() and if
we are dissecting an ICMP error packet, call it again with the adjusted offset.
# hping3 -2 192.168.0.2 -p 9
1: IP 192.168.0.1.1224 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.1225 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1226 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1227 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 19:10:37 +08:00
if ( ! bond_flow_ip ( skb , fk , & noff , & proto , l34 ) )
2013-10-02 19:39:25 +08:00
return false ;
bonding: symmetric ICMP transmit
A bonding with layer2+3 or layer3+4 hashing uses the IP addresses and the ports
to balance packets between slaves. With some network errors, we receive an ICMP
error packet by the remote host or a router. If sent by a router, the source IP
can differ from the remote host one. Additionally the ICMP protocol has no port
numbers, so a layer3+4 bonding will get a different hash than the previous one.
These two conditions could let the packet go through a different interface than
the other packets of the same flow:
# tcpdump -qltnni veth0 |sed 's/^/0: /' &
# tcpdump -qltnni veth1 |sed 's/^/1: /' &
# hping3 -2 192.168.0.2 -p 9
0: IP 192.168.0.1.2251 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2252 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.2253 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.2254 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
An ICMP error packet contains the header of the packet which caused the network
error, so inspect it and match the flow against it, so we can send the ICMP via
the same interface of the previous packet in the flow.
Move the IP and port dissect code into a generic function bond_flow_ip() and if
we are dissecting an ICMP error packet, call it again with the adjusted offset.
# hping3 -2 192.168.0.2 -p 9
1: IP 192.168.0.1.1224 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
1: IP 192.168.0.1.1225 > 192.168.0.2.9: UDP, length 0
1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1226 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
0: IP 192.168.0.1.1227 > 192.168.0.2.9: UDP, length 0
0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 19:10:37 +08:00
/* ICMP error packets contains at least 8 bytes of the header
* of the packet which generated the error . Use this information
* to correlate ICMP error packets within the same flow which
* generated the error .
*/
if ( proto = = IPPROTO_ICMP | | proto = = IPPROTO_ICMPV6 ) {
skb_flow_get_icmp_tci ( skb , & fk - > icmp , skb - > data ,
skb_transport_offset ( skb ) ,
skb_headlen ( skb ) ) ;
if ( proto = = IPPROTO_ICMP ) {
if ( ! icmp_is_err ( fk - > icmp . type ) )
return true ;
noff + = sizeof ( struct icmphdr ) ;
} else if ( proto = = IPPROTO_ICMPV6 ) {
if ( ! icmpv6_is_err ( fk - > icmp . type ) )
return true ;
noff + = sizeof ( struct icmp6hdr ) ;
}
return bond_flow_ip ( skb , fk , & noff , & proto , l34 ) ;
bonding: balance ICMP echoes in layer3+4 mode
The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:
# tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
# tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64
But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:
# ping -qc1 192.168.0.2
0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64
Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:
# ping -q 192.168.1.2 -c1
0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
# ping -q 192.168.1.2 -c1
1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 21:50:53 +08:00
}
2007-12-07 15:40:34 +08:00
2013-10-02 19:39:25 +08:00
return true ;
2007-12-07 15:40:34 +08:00
}
2013-10-02 19:39:25 +08:00
/**
* bond_xmit_hash - generate a hash value based on the xmit policy
* @ bond : bonding device
* @ skb : buffer to use for headers
*
* This function will extract the necessary headers from the skb buffer and use
* them to generate a hash based on the xmit_policy set in the bonding device
2005-06-27 05:54:11 +08:00
*/
2014-04-23 07:30:15 +08:00
u32 bond_xmit_hash ( struct bonding * bond , struct sk_buff * skb )
2005-06-27 05:54:11 +08:00
{
2013-10-02 19:39:25 +08:00
struct flow_keys flow ;
u32 hash ;
2013-04-16 01:03:24 +08:00
2015-09-16 06:24:28 +08:00
if ( bond - > params . xmit_policy = = BOND_XMIT_POLICY_ENCAP34 & &
skb - > l4_hash )
return skb - > hash ;
2013-10-02 19:39:25 +08:00
if ( bond - > params . xmit_policy = = BOND_XMIT_POLICY_LAYER2 | |
! bond_flow_dissect ( bond , skb , & flow ) )
2014-04-23 07:30:15 +08:00
return bond_eth_hash ( skb ) ;
2005-06-27 05:54:11 +08:00
2013-10-02 19:39:25 +08:00
if ( bond - > params . xmit_policy = = BOND_XMIT_POLICY_LAYER23 | |
bonding: balance ICMP echoes in layer3+4 mode
The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:
# tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
# tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64
But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:
# ping -qc1 192.168.0.2
0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64
Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:
# ping -q 192.168.1.2 -c1
0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
# ping -q 192.168.1.2 -c1
1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 21:50:53 +08:00
bond - > params . xmit_policy = = BOND_XMIT_POLICY_ENCAP23 ) {
2013-10-02 19:39:25 +08:00
hash = bond_eth_hash ( skb ) ;
bonding: balance ICMP echoes in layer3+4 mode
The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:
# tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
# tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64
But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:
# ping -qc1 192.168.0.2
0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64
Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:
# ping -q 192.168.1.2 -c1
0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
# ping -q 192.168.1.2 -c1
1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 21:50:53 +08:00
} else {
if ( flow . icmp . id )
memcpy ( & hash , & flow . icmp , sizeof ( hash ) ) ;
else
memcpy ( & hash , & flow . ports . ports , sizeof ( hash ) ) ;
}
2015-06-05 00:16:40 +08:00
hash ^ = ( __force u32 ) flow_get_u32_dst ( & flow ) ^
( __force u32 ) flow_get_u32_src ( & flow ) ;
2013-10-02 19:39:25 +08:00
hash ^ = ( hash > > 16 ) ;
hash ^ = ( hash > > 8 ) ;
2017-11-06 09:01:57 +08:00
return hash > > 1 ;
2005-06-27 05:54:11 +08:00
}
2005-04-17 06:20:36 +08:00
/*-------------------------- Device entry points ----------------------------*/
2017-04-21 03:49:24 +08:00
void bond_work_init_all ( struct bonding * bond )
2012-11-29 09:31:31 +08:00
{
INIT_DELAYED_WORK ( & bond - > mcast_work ,
bond_resend_igmp_join_requests_delayed ) ;
INIT_DELAYED_WORK ( & bond - > alb_work , bond_alb_monitor ) ;
INIT_DELAYED_WORK ( & bond - > mii_work , bond_mii_monitor ) ;
2017-03-09 02:55:51 +08:00
INIT_DELAYED_WORK ( & bond - > arp_work , bond_arp_monitor ) ;
2012-11-29 09:31:31 +08:00
INIT_DELAYED_WORK ( & bond - > ad_work , bond_3ad_state_machine_handler ) ;
2014-10-05 08:45:01 +08:00
INIT_DELAYED_WORK ( & bond - > slave_arr_work , bond_slave_arr_handler ) ;
2012-11-29 09:31:31 +08:00
}
static void bond_work_cancel_all ( struct bonding * bond )
{
cancel_delayed_work_sync ( & bond - > mii_work ) ;
cancel_delayed_work_sync ( & bond - > arp_work ) ;
cancel_delayed_work_sync ( & bond - > alb_work ) ;
cancel_delayed_work_sync ( & bond - > ad_work ) ;
cancel_delayed_work_sync ( & bond - > mcast_work ) ;
2014-10-05 08:45:01 +08:00
cancel_delayed_work_sync ( & bond - > slave_arr_work ) ;
2012-11-29 09:31:31 +08:00
}
2005-04-17 06:20:36 +08:00
static int bond_open ( struct net_device * bond_dev )
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
bonding:reset backup and inactive flag of slave
Eduard Sinelnikov (eduard.sinelnikov@gmail.com) found that if we change
bonding mode from active backup to round robin, some slaves are still keeping
"backup", and won't transmit packets.
As Jay Vosburgh(fubar@us.ibm.com) pointed out that we can work around that by
removing the bond_is_active_slave() check, because the "backup" flag is only
meaningful for active backup mode.
But if we just simply ignore the bond_is_active_slave() check,
the transmission will work fine, but we can't maintain the correct value of
"backup" flag for each slaves, though it is meaningless for other mode than
active backup.
I'd like to reset "backup" and "inactive" flag in bond_open,
thus we can keep the correct value of them.
As for bond_is_active_slave(), I'd like to prepare another patch to handle it.
V2:
Use C style comment.
Move read_lock(&bond->curr_slave_lock).
Replace restore with reset, for active backup mode, it means "restore",
but for other modes, it means "reset".
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-15 23:57:35 +08:00
struct slave * slave ;
2005-04-17 06:20:36 +08:00
bonding:reset backup and inactive flag of slave
Eduard Sinelnikov (eduard.sinelnikov@gmail.com) found that if we change
bonding mode from active backup to round robin, some slaves are still keeping
"backup", and won't transmit packets.
As Jay Vosburgh(fubar@us.ibm.com) pointed out that we can work around that by
removing the bond_is_active_slave() check, because the "backup" flag is only
meaningful for active backup mode.
But if we just simply ignore the bond_is_active_slave() check,
the transmission will work fine, but we can't maintain the correct value of
"backup" flag for each slaves, though it is meaningless for other mode than
active backup.
I'd like to reset "backup" and "inactive" flag in bond_open,
thus we can keep the correct value of them.
As for bond_is_active_slave(), I'd like to prepare another patch to handle it.
V2:
Use C style comment.
Move read_lock(&bond->curr_slave_lock).
Replace restore with reset, for active backup mode, it means "restore",
but for other modes, it means "reset".
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-15 23:57:35 +08:00
/* reset slave->backup and slave->inactive */
2013-09-25 15:20:21 +08:00
if ( bond_has_slaves ( bond ) ) {
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2014-07-15 21:56:55 +08:00
if ( bond_uses_primary ( bond ) & &
slave ! = rcu_access_pointer ( bond - > curr_active_slave ) ) {
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_inactive_flags ( slave ,
BOND_SLAVE_NOTIFY_NOW ) ;
2015-01-26 14:16:58 +08:00
} else if ( BOND_MODE ( bond ) ! = BOND_MODE_8023AD ) {
bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for 802.3ad mode
The problem was introduced by the commit 1d3ee88ae0d
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-26 11:05:22 +08:00
bond_set_slave_active_flags ( slave ,
BOND_SLAVE_NOTIFY_NOW ) ;
bonding:reset backup and inactive flag of slave
Eduard Sinelnikov (eduard.sinelnikov@gmail.com) found that if we change
bonding mode from active backup to round robin, some slaves are still keeping
"backup", and won't transmit packets.
As Jay Vosburgh(fubar@us.ibm.com) pointed out that we can work around that by
removing the bond_is_active_slave() check, because the "backup" flag is only
meaningful for active backup mode.
But if we just simply ignore the bond_is_active_slave() check,
the transmission will work fine, but we can't maintain the correct value of
"backup" flag for each slaves, though it is meaningless for other mode than
active backup.
I'd like to reset "backup" and "inactive" flag in bond_open,
thus we can keep the correct value of them.
As for bond_is_active_slave(), I'd like to prepare another patch to handle it.
V2:
Use C style comment.
Move read_lock(&bond->curr_slave_lock).
Replace restore with reset, for active backup mode, it means "restore",
but for other modes, it means "reset".
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-15 23:57:35 +08:00
}
}
}
2008-12-10 15:07:13 +08:00
if ( bond_is_lb ( bond ) ) {
2005-04-17 06:20:36 +08:00
/* bond_alb_initialize must be called before the timer
* is started .
*/
2014-05-16 03:39:55 +08:00
if ( bond_alb_initialize ( bond , ( BOND_MODE ( bond ) = = BOND_MODE_ALB ) ) )
2010-01-26 07:34:15 +08:00
return - ENOMEM ;
2018-05-15 02:48:09 +08:00
if ( bond - > params . tlb_dynamic_lb | | BOND_MODE ( bond ) = = BOND_MODE_ALB )
2014-04-23 07:30:22 +08:00
queue_delayed_work ( bond - > wq , & bond - > alb_work , 0 ) ;
2005-04-17 06:20:36 +08:00
}
2012-11-29 09:31:31 +08:00
if ( bond - > params . miimon ) /* link check interval, in milliseconds. */
2007-10-18 08:37:45 +08:00
queue_delayed_work ( bond - > wq , & bond - > mii_work , 0 ) ;
2005-04-17 06:20:36 +08:00
if ( bond - > params . arp_interval ) { /* arp interval, in milliseconds. */
2007-10-18 08:37:45 +08:00
queue_delayed_work ( bond - > wq , & bond - > arp_work , 0 ) ;
2014-02-18 14:48:39 +08:00
bond - > recv_probe = bond_arp_rcv ;
2005-04-17 06:20:36 +08:00
}
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD ) {
2007-10-18 08:37:45 +08:00
queue_delayed_work ( bond - > wq , & bond - > ad_work , 0 ) ;
2005-04-17 06:20:36 +08:00
/* register to receive LACPDUs */
2011-04-19 11:48:16 +08:00
bond - > recv_probe = bond_3ad_lacpdu_recv ;
2008-11-05 09:51:16 +08:00
bond_3ad_initiate_agg_selection ( bond , 1 ) ;
2005-04-17 06:20:36 +08:00
}
2018-05-15 02:48:09 +08:00
if ( bond_mode_can_use_xmit_hash ( bond ) )
2014-10-05 08:45:01 +08:00
bond_update_slave_arr ( bond , NULL ) ;
2005-04-17 06:20:36 +08:00
return 0 ;
}
static int bond_close ( struct net_device * bond_dev )
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 06:20:36 +08:00
2012-11-29 09:31:31 +08:00
bond_work_cancel_all ( bond ) ;
2013-09-02 19:51:38 +08:00
bond - > send_peer_notif = 0 ;
2013-09-02 19:51:39 +08:00
if ( bond_is_lb ( bond ) )
2005-04-17 06:20:36 +08:00
bond_alb_deinitialize ( bond ) ;
2011-04-19 11:48:16 +08:00
bond - > recv_probe = NULL ;
2005-04-17 06:20:36 +08:00
return 0 ;
}
2016-03-18 08:23:36 +08:00
/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
* that some drivers can provide 32 bit values only .
*/
static void bond_fold_stats ( struct rtnl_link_stats64 * _res ,
const struct rtnl_link_stats64 * _new ,
const struct rtnl_link_stats64 * _old )
{
const u64 * new = ( const u64 * ) _new ;
const u64 * old = ( const u64 * ) _old ;
u64 * res = ( u64 * ) _res ;
int i ;
for ( i = 0 ; i < sizeof ( * _res ) / sizeof ( u64 ) ; i + + ) {
u64 nv = new [ i ] ;
u64 ov = old [ i ] ;
2017-03-30 01:45:44 +08:00
s64 delta = nv - ov ;
2016-03-18 08:23:36 +08:00
/* detects if this particular field is 32bit only */
if ( ( ( nv | ov ) > > 32 ) = = 0 )
2017-03-30 01:45:44 +08:00
delta = ( s64 ) ( s32 ) ( ( u32 ) nv - ( u32 ) ov ) ;
/* filter anomalies, some drivers reset their stats
* at down / up events .
*/
if ( delta > 0 )
res [ i ] + = delta ;
2016-03-18 08:23:36 +08:00
}
}
bonding: fix lockdep warning in bond_get_stats()
In the "struct bonding", there is stats_lock.
This lock protects "bond_stats" in the "struct bonding".
bond_stats is updated in the bond_get_stats() and this function would be
executed concurrently. So, the lock is needed.
Bonding interfaces would be nested.
So, either stats_lock should use dynamic lockdep class key or stats_lock
should be used by spin_lock_nested(). In the current code, stats_lock is
using a dynamic lockdep class key.
But there is no updating stats_lock_key routine So, lockdep warning
will occur.
Test commands:
ip link add bond0 type bond
ip link add bond1 type bond
ip link set bond0 master bond1
ip link set bond0 nomaster
ip link set bond1 master bond0
Splat looks like:
[ 38.420603][ T957] 5.5.0+ #394 Not tainted
[ 38.421074][ T957] ------------------------------------------------------
[ 38.421837][ T957] ip/957 is trying to acquire lock:
[ 38.422399][ T957] ffff888063262cd8 (&bond->stats_lock_key#2){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.423528][ T957]
[ 38.423528][ T957] but task is already holding lock:
[ 38.424526][ T957] ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.426075][ T957]
[ 38.426075][ T957] which lock already depends on the new lock.
[ 38.426075][ T957]
[ 38.428536][ T957]
[ 38.428536][ T957] the existing dependency chain (in reverse order) is:
[ 38.429475][ T957]
[ 38.429475][ T957] -> #1 (&bond->stats_lock_key){+.+.}:
[ 38.430273][ T957] _raw_spin_lock+0x30/0x70
[ 38.430812][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ 38.431451][ T957] dev_get_stats+0x1ec/0x270
[ 38.432088][ T957] bond_get_stats+0x1a5/0x4d0 [bonding]
[ 38.432767][ T957] dev_get_stats+0x1ec/0x270
[ 38.433322][ T957] rtnl_fill_stats+0x44/0xbe0
[ 38.433866][ T957] rtnl_fill_ifinfo+0xeb2/0x3720
[ 38.434474][ T957] rtmsg_ifinfo_build_skb+0xca/0x170
[ 38.435081][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0
[ 38.436848][ T957] rtnetlink_event+0xcd/0x120
[ 38.437455][ T957] notifier_call_chain+0x90/0x160
[ 38.438067][ T957] netdev_change_features+0x74/0xa0
[ 38.438708][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[ 38.439522][ T957] bond_enslave+0x3639/0x47b0 [bonding]
[ 38.440225][ T957] do_setlink+0xaab/0x2ef0
[ 38.440786][ T957] __rtnl_newlink+0x9c5/0x1270
[ 38.441463][ T957] rtnl_newlink+0x65/0x90
[ 38.442075][ T957] rtnetlink_rcv_msg+0x4a8/0x890
[ 38.442774][ T957] netlink_rcv_skb+0x121/0x350
[ 38.443451][ T957] netlink_unicast+0x42e/0x610
[ 38.444282][ T957] netlink_sendmsg+0x65a/0xb90
[ 38.444992][ T957] ____sys_sendmsg+0x5ce/0x7a0
[ 38.445679][ T957] ___sys_sendmsg+0x10f/0x1b0
[ 38.446365][ T957] __sys_sendmsg+0xc6/0x150
[ 38.447007][ T957] do_syscall_64+0x99/0x4f0
[ 38.447668][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 38.448538][ T957]
[ 38.448538][ T957] -> #0 (&bond->stats_lock_key#2){+.+.}:
[ 38.449554][ T957] __lock_acquire+0x2d8d/0x3de0
[ 38.450148][ T957] lock_acquire+0x164/0x3b0
[ 38.450711][ T957] _raw_spin_lock+0x30/0x70
[ 38.451292][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ 38.451950][ T957] dev_get_stats+0x1ec/0x270
[ 38.452425][ T957] bond_get_stats+0x1a5/0x4d0 [bonding]
[ 38.453362][ T957] dev_get_stats+0x1ec/0x270
[ 38.453825][ T957] rtnl_fill_stats+0x44/0xbe0
[ 38.454390][ T957] rtnl_fill_ifinfo+0xeb2/0x3720
[ 38.456257][ T957] rtmsg_ifinfo_build_skb+0xca/0x170
[ 38.456998][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0
[ 38.459351][ T957] rtnetlink_event+0xcd/0x120
[ 38.460086][ T957] notifier_call_chain+0x90/0x160
[ 38.460829][ T957] netdev_change_features+0x74/0xa0
[ 38.461752][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[ 38.462705][ T957] bond_enslave+0x3639/0x47b0 [bonding]
[ 38.463476][ T957] do_setlink+0xaab/0x2ef0
[ 38.464141][ T957] __rtnl_newlink+0x9c5/0x1270
[ 38.464897][ T957] rtnl_newlink+0x65/0x90
[ 38.465522][ T957] rtnetlink_rcv_msg+0x4a8/0x890
[ 38.466215][ T957] netlink_rcv_skb+0x121/0x350
[ 38.466895][ T957] netlink_unicast+0x42e/0x610
[ 38.467583][ T957] netlink_sendmsg+0x65a/0xb90
[ 38.468285][ T957] ____sys_sendmsg+0x5ce/0x7a0
[ 38.469202][ T957] ___sys_sendmsg+0x10f/0x1b0
[ 38.469884][ T957] __sys_sendmsg+0xc6/0x150
[ 38.470587][ T957] do_syscall_64+0x99/0x4f0
[ 38.471245][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 38.472093][ T957]
[ 38.472093][ T957] other info that might help us debug this:
[ 38.472093][ T957]
[ 38.473438][ T957] Possible unsafe locking scenario:
[ 38.473438][ T957]
[ 38.474898][ T957] CPU0 CPU1
[ 38.476234][ T957] ---- ----
[ 38.480171][ T957] lock(&bond->stats_lock_key);
[ 38.480808][ T957] lock(&bond->stats_lock_key#2);
[ 38.481791][ T957] lock(&bond->stats_lock_key);
[ 38.482754][ T957] lock(&bond->stats_lock_key#2);
[ 38.483416][ T957]
[ 38.483416][ T957] *** DEADLOCK ***
[ 38.483416][ T957]
[ 38.484505][ T957] 3 locks held by ip/957:
[ 38.485048][ T957] #0: ffffffffbccf6230 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x457/0x890
[ 38.486198][ T957] #1: ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.487625][ T957] #2: ffffffffbc9254c0 (rcu_read_lock){....}, at: bond_get_stats+0x5/0x4d0 [bonding]
[ 38.488897][ T957]
[ 38.488897][ T957] stack backtrace:
[ 38.489646][ T957] CPU: 1 PID: 957 Comm: ip Not tainted 5.5.0+ #394
[ 38.490497][ T957] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 38.492810][ T957] Call Trace:
[ 38.493219][ T957] dump_stack+0x96/0xdb
[ 38.493709][ T957] check_noncircular+0x371/0x450
[ 38.494344][ T957] ? lookup_address+0x60/0x60
[ 38.494923][ T957] ? print_circular_bug.isra.35+0x310/0x310
[ 38.495699][ T957] ? hlock_class+0x130/0x130
[ 38.496334][ T957] ? __lock_acquire+0x2d8d/0x3de0
[ 38.496979][ T957] __lock_acquire+0x2d8d/0x3de0
[ 38.497607][ T957] ? register_lock_class+0x14d0/0x14d0
[ 38.498333][ T957] ? check_chain_key+0x236/0x5d0
[ 38.499003][ T957] lock_acquire+0x164/0x3b0
[ 38.499800][ T957] ? bond_get_stats+0x90/0x4d0 [bonding]
[ 38.500706][ T957] _raw_spin_lock+0x30/0x70
[ 38.501435][ T957] ? bond_get_stats+0x90/0x4d0 [bonding]
[ 38.502311][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ ... ]
But, there is another problem.
The dynamic lockdep class key is protected by RTNL, but bond_get_stats()
would be called outside of RTNL.
So, it would use an invalid dynamic lockdep class key.
In order to fix this issue, stats_lock uses spin_lock_nested() instead of
a dynamic lockdep key.
The bond_get_stats() calls bond_get_lowest_level_rcu() to get the correct
nest level value, which will be used by spin_lock_nested().
The "dev->lower_level" indicates lower nest level value, but this value
is invalid outside of RTNL.
So, bond_get_lowest_level_rcu() returns valid lower nest level value in
the RCU critical section.
bond_get_lowest_level_rcu() will be work only when LOCKDEP is enabled.
Fixes: 089bca2caed0 ("bonding: use dynamic lockdep key instead of subclass")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-15 18:50:40 +08:00
# ifdef CONFIG_LOCKDEP
static int bond_get_lowest_level_rcu ( struct net_device * dev )
{
struct net_device * ldev , * next , * now , * dev_stack [ MAX_NEST_DEV + 1 ] ;
struct list_head * niter , * iter , * iter_stack [ MAX_NEST_DEV + 1 ] ;
int cur = 0 , max = 0 ;
now = dev ;
iter = & dev - > adj_list . lower ;
while ( 1 ) {
next = NULL ;
while ( 1 ) {
ldev = netdev_next_lower_dev_rcu ( now , & iter ) ;
if ( ! ldev )
break ;
next = ldev ;
niter = & ldev - > adj_list . lower ;
dev_stack [ cur ] = now ;
iter_stack [ cur + + ] = iter ;
if ( max < = cur )
max = cur ;
break ;
}
if ( ! next ) {
if ( ! cur )
return max ;
next = dev_stack [ - - cur ] ;
niter = iter_stack [ cur ] ;
}
now = next ;
iter = niter ;
}
return max ;
}
# endif
2017-01-07 11:12:52 +08:00
static void bond_get_stats ( struct net_device * bond_dev ,
struct rtnl_link_stats64 * stats )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2010-07-08 05:58:56 +08:00
struct rtnl_link_stats64 temp ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2005-04-17 06:20:36 +08:00
struct slave * slave ;
bonding: fix lockdep warning in bond_get_stats()
In the "struct bonding", there is stats_lock.
This lock protects "bond_stats" in the "struct bonding".
bond_stats is updated in the bond_get_stats() and this function would be
executed concurrently. So, the lock is needed.
Bonding interfaces would be nested.
So, either stats_lock should use dynamic lockdep class key or stats_lock
should be used by spin_lock_nested(). In the current code, stats_lock is
using a dynamic lockdep class key.
But there is no updating stats_lock_key routine So, lockdep warning
will occur.
Test commands:
ip link add bond0 type bond
ip link add bond1 type bond
ip link set bond0 master bond1
ip link set bond0 nomaster
ip link set bond1 master bond0
Splat looks like:
[ 38.420603][ T957] 5.5.0+ #394 Not tainted
[ 38.421074][ T957] ------------------------------------------------------
[ 38.421837][ T957] ip/957 is trying to acquire lock:
[ 38.422399][ T957] ffff888063262cd8 (&bond->stats_lock_key#2){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.423528][ T957]
[ 38.423528][ T957] but task is already holding lock:
[ 38.424526][ T957] ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.426075][ T957]
[ 38.426075][ T957] which lock already depends on the new lock.
[ 38.426075][ T957]
[ 38.428536][ T957]
[ 38.428536][ T957] the existing dependency chain (in reverse order) is:
[ 38.429475][ T957]
[ 38.429475][ T957] -> #1 (&bond->stats_lock_key){+.+.}:
[ 38.430273][ T957] _raw_spin_lock+0x30/0x70
[ 38.430812][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ 38.431451][ T957] dev_get_stats+0x1ec/0x270
[ 38.432088][ T957] bond_get_stats+0x1a5/0x4d0 [bonding]
[ 38.432767][ T957] dev_get_stats+0x1ec/0x270
[ 38.433322][ T957] rtnl_fill_stats+0x44/0xbe0
[ 38.433866][ T957] rtnl_fill_ifinfo+0xeb2/0x3720
[ 38.434474][ T957] rtmsg_ifinfo_build_skb+0xca/0x170
[ 38.435081][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0
[ 38.436848][ T957] rtnetlink_event+0xcd/0x120
[ 38.437455][ T957] notifier_call_chain+0x90/0x160
[ 38.438067][ T957] netdev_change_features+0x74/0xa0
[ 38.438708][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[ 38.439522][ T957] bond_enslave+0x3639/0x47b0 [bonding]
[ 38.440225][ T957] do_setlink+0xaab/0x2ef0
[ 38.440786][ T957] __rtnl_newlink+0x9c5/0x1270
[ 38.441463][ T957] rtnl_newlink+0x65/0x90
[ 38.442075][ T957] rtnetlink_rcv_msg+0x4a8/0x890
[ 38.442774][ T957] netlink_rcv_skb+0x121/0x350
[ 38.443451][ T957] netlink_unicast+0x42e/0x610
[ 38.444282][ T957] netlink_sendmsg+0x65a/0xb90
[ 38.444992][ T957] ____sys_sendmsg+0x5ce/0x7a0
[ 38.445679][ T957] ___sys_sendmsg+0x10f/0x1b0
[ 38.446365][ T957] __sys_sendmsg+0xc6/0x150
[ 38.447007][ T957] do_syscall_64+0x99/0x4f0
[ 38.447668][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 38.448538][ T957]
[ 38.448538][ T957] -> #0 (&bond->stats_lock_key#2){+.+.}:
[ 38.449554][ T957] __lock_acquire+0x2d8d/0x3de0
[ 38.450148][ T957] lock_acquire+0x164/0x3b0
[ 38.450711][ T957] _raw_spin_lock+0x30/0x70
[ 38.451292][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ 38.451950][ T957] dev_get_stats+0x1ec/0x270
[ 38.452425][ T957] bond_get_stats+0x1a5/0x4d0 [bonding]
[ 38.453362][ T957] dev_get_stats+0x1ec/0x270
[ 38.453825][ T957] rtnl_fill_stats+0x44/0xbe0
[ 38.454390][ T957] rtnl_fill_ifinfo+0xeb2/0x3720
[ 38.456257][ T957] rtmsg_ifinfo_build_skb+0xca/0x170
[ 38.456998][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0
[ 38.459351][ T957] rtnetlink_event+0xcd/0x120
[ 38.460086][ T957] notifier_call_chain+0x90/0x160
[ 38.460829][ T957] netdev_change_features+0x74/0xa0
[ 38.461752][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[ 38.462705][ T957] bond_enslave+0x3639/0x47b0 [bonding]
[ 38.463476][ T957] do_setlink+0xaab/0x2ef0
[ 38.464141][ T957] __rtnl_newlink+0x9c5/0x1270
[ 38.464897][ T957] rtnl_newlink+0x65/0x90
[ 38.465522][ T957] rtnetlink_rcv_msg+0x4a8/0x890
[ 38.466215][ T957] netlink_rcv_skb+0x121/0x350
[ 38.466895][ T957] netlink_unicast+0x42e/0x610
[ 38.467583][ T957] netlink_sendmsg+0x65a/0xb90
[ 38.468285][ T957] ____sys_sendmsg+0x5ce/0x7a0
[ 38.469202][ T957] ___sys_sendmsg+0x10f/0x1b0
[ 38.469884][ T957] __sys_sendmsg+0xc6/0x150
[ 38.470587][ T957] do_syscall_64+0x99/0x4f0
[ 38.471245][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 38.472093][ T957]
[ 38.472093][ T957] other info that might help us debug this:
[ 38.472093][ T957]
[ 38.473438][ T957] Possible unsafe locking scenario:
[ 38.473438][ T957]
[ 38.474898][ T957] CPU0 CPU1
[ 38.476234][ T957] ---- ----
[ 38.480171][ T957] lock(&bond->stats_lock_key);
[ 38.480808][ T957] lock(&bond->stats_lock_key#2);
[ 38.481791][ T957] lock(&bond->stats_lock_key);
[ 38.482754][ T957] lock(&bond->stats_lock_key#2);
[ 38.483416][ T957]
[ 38.483416][ T957] *** DEADLOCK ***
[ 38.483416][ T957]
[ 38.484505][ T957] 3 locks held by ip/957:
[ 38.485048][ T957] #0: ffffffffbccf6230 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x457/0x890
[ 38.486198][ T957] #1: ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.487625][ T957] #2: ffffffffbc9254c0 (rcu_read_lock){....}, at: bond_get_stats+0x5/0x4d0 [bonding]
[ 38.488897][ T957]
[ 38.488897][ T957] stack backtrace:
[ 38.489646][ T957] CPU: 1 PID: 957 Comm: ip Not tainted 5.5.0+ #394
[ 38.490497][ T957] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 38.492810][ T957] Call Trace:
[ 38.493219][ T957] dump_stack+0x96/0xdb
[ 38.493709][ T957] check_noncircular+0x371/0x450
[ 38.494344][ T957] ? lookup_address+0x60/0x60
[ 38.494923][ T957] ? print_circular_bug.isra.35+0x310/0x310
[ 38.495699][ T957] ? hlock_class+0x130/0x130
[ 38.496334][ T957] ? __lock_acquire+0x2d8d/0x3de0
[ 38.496979][ T957] __lock_acquire+0x2d8d/0x3de0
[ 38.497607][ T957] ? register_lock_class+0x14d0/0x14d0
[ 38.498333][ T957] ? check_chain_key+0x236/0x5d0
[ 38.499003][ T957] lock_acquire+0x164/0x3b0
[ 38.499800][ T957] ? bond_get_stats+0x90/0x4d0 [bonding]
[ 38.500706][ T957] _raw_spin_lock+0x30/0x70
[ 38.501435][ T957] ? bond_get_stats+0x90/0x4d0 [bonding]
[ 38.502311][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ ... ]
But, there is another problem.
The dynamic lockdep class key is protected by RTNL, but bond_get_stats()
would be called outside of RTNL.
So, it would use an invalid dynamic lockdep class key.
In order to fix this issue, stats_lock uses spin_lock_nested() instead of
a dynamic lockdep key.
The bond_get_stats() calls bond_get_lowest_level_rcu() to get the correct
nest level value, which will be used by spin_lock_nested().
The "dev->lower_level" indicates lower nest level value, but this value
is invalid outside of RTNL.
So, bond_get_lowest_level_rcu() returns valid lower nest level value in
the RCU critical section.
bond_get_lowest_level_rcu() will be work only when LOCKDEP is enabled.
Fixes: 089bca2caed0 ("bonding: use dynamic lockdep key instead of subclass")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-15 18:50:40 +08:00
int nest_level = 0 ;
2005-04-17 06:20:36 +08:00
2016-03-18 08:23:36 +08:00
rcu_read_lock ( ) ;
bonding: fix lockdep warning in bond_get_stats()
In the "struct bonding", there is stats_lock.
This lock protects "bond_stats" in the "struct bonding".
bond_stats is updated in the bond_get_stats() and this function would be
executed concurrently. So, the lock is needed.
Bonding interfaces would be nested.
So, either stats_lock should use dynamic lockdep class key or stats_lock
should be used by spin_lock_nested(). In the current code, stats_lock is
using a dynamic lockdep class key.
But there is no updating stats_lock_key routine So, lockdep warning
will occur.
Test commands:
ip link add bond0 type bond
ip link add bond1 type bond
ip link set bond0 master bond1
ip link set bond0 nomaster
ip link set bond1 master bond0
Splat looks like:
[ 38.420603][ T957] 5.5.0+ #394 Not tainted
[ 38.421074][ T957] ------------------------------------------------------
[ 38.421837][ T957] ip/957 is trying to acquire lock:
[ 38.422399][ T957] ffff888063262cd8 (&bond->stats_lock_key#2){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.423528][ T957]
[ 38.423528][ T957] but task is already holding lock:
[ 38.424526][ T957] ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.426075][ T957]
[ 38.426075][ T957] which lock already depends on the new lock.
[ 38.426075][ T957]
[ 38.428536][ T957]
[ 38.428536][ T957] the existing dependency chain (in reverse order) is:
[ 38.429475][ T957]
[ 38.429475][ T957] -> #1 (&bond->stats_lock_key){+.+.}:
[ 38.430273][ T957] _raw_spin_lock+0x30/0x70
[ 38.430812][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ 38.431451][ T957] dev_get_stats+0x1ec/0x270
[ 38.432088][ T957] bond_get_stats+0x1a5/0x4d0 [bonding]
[ 38.432767][ T957] dev_get_stats+0x1ec/0x270
[ 38.433322][ T957] rtnl_fill_stats+0x44/0xbe0
[ 38.433866][ T957] rtnl_fill_ifinfo+0xeb2/0x3720
[ 38.434474][ T957] rtmsg_ifinfo_build_skb+0xca/0x170
[ 38.435081][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0
[ 38.436848][ T957] rtnetlink_event+0xcd/0x120
[ 38.437455][ T957] notifier_call_chain+0x90/0x160
[ 38.438067][ T957] netdev_change_features+0x74/0xa0
[ 38.438708][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[ 38.439522][ T957] bond_enslave+0x3639/0x47b0 [bonding]
[ 38.440225][ T957] do_setlink+0xaab/0x2ef0
[ 38.440786][ T957] __rtnl_newlink+0x9c5/0x1270
[ 38.441463][ T957] rtnl_newlink+0x65/0x90
[ 38.442075][ T957] rtnetlink_rcv_msg+0x4a8/0x890
[ 38.442774][ T957] netlink_rcv_skb+0x121/0x350
[ 38.443451][ T957] netlink_unicast+0x42e/0x610
[ 38.444282][ T957] netlink_sendmsg+0x65a/0xb90
[ 38.444992][ T957] ____sys_sendmsg+0x5ce/0x7a0
[ 38.445679][ T957] ___sys_sendmsg+0x10f/0x1b0
[ 38.446365][ T957] __sys_sendmsg+0xc6/0x150
[ 38.447007][ T957] do_syscall_64+0x99/0x4f0
[ 38.447668][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 38.448538][ T957]
[ 38.448538][ T957] -> #0 (&bond->stats_lock_key#2){+.+.}:
[ 38.449554][ T957] __lock_acquire+0x2d8d/0x3de0
[ 38.450148][ T957] lock_acquire+0x164/0x3b0
[ 38.450711][ T957] _raw_spin_lock+0x30/0x70
[ 38.451292][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ 38.451950][ T957] dev_get_stats+0x1ec/0x270
[ 38.452425][ T957] bond_get_stats+0x1a5/0x4d0 [bonding]
[ 38.453362][ T957] dev_get_stats+0x1ec/0x270
[ 38.453825][ T957] rtnl_fill_stats+0x44/0xbe0
[ 38.454390][ T957] rtnl_fill_ifinfo+0xeb2/0x3720
[ 38.456257][ T957] rtmsg_ifinfo_build_skb+0xca/0x170
[ 38.456998][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0
[ 38.459351][ T957] rtnetlink_event+0xcd/0x120
[ 38.460086][ T957] notifier_call_chain+0x90/0x160
[ 38.460829][ T957] netdev_change_features+0x74/0xa0
[ 38.461752][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[ 38.462705][ T957] bond_enslave+0x3639/0x47b0 [bonding]
[ 38.463476][ T957] do_setlink+0xaab/0x2ef0
[ 38.464141][ T957] __rtnl_newlink+0x9c5/0x1270
[ 38.464897][ T957] rtnl_newlink+0x65/0x90
[ 38.465522][ T957] rtnetlink_rcv_msg+0x4a8/0x890
[ 38.466215][ T957] netlink_rcv_skb+0x121/0x350
[ 38.466895][ T957] netlink_unicast+0x42e/0x610
[ 38.467583][ T957] netlink_sendmsg+0x65a/0xb90
[ 38.468285][ T957] ____sys_sendmsg+0x5ce/0x7a0
[ 38.469202][ T957] ___sys_sendmsg+0x10f/0x1b0
[ 38.469884][ T957] __sys_sendmsg+0xc6/0x150
[ 38.470587][ T957] do_syscall_64+0x99/0x4f0
[ 38.471245][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 38.472093][ T957]
[ 38.472093][ T957] other info that might help us debug this:
[ 38.472093][ T957]
[ 38.473438][ T957] Possible unsafe locking scenario:
[ 38.473438][ T957]
[ 38.474898][ T957] CPU0 CPU1
[ 38.476234][ T957] ---- ----
[ 38.480171][ T957] lock(&bond->stats_lock_key);
[ 38.480808][ T957] lock(&bond->stats_lock_key#2);
[ 38.481791][ T957] lock(&bond->stats_lock_key);
[ 38.482754][ T957] lock(&bond->stats_lock_key#2);
[ 38.483416][ T957]
[ 38.483416][ T957] *** DEADLOCK ***
[ 38.483416][ T957]
[ 38.484505][ T957] 3 locks held by ip/957:
[ 38.485048][ T957] #0: ffffffffbccf6230 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x457/0x890
[ 38.486198][ T957] #1: ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.487625][ T957] #2: ffffffffbc9254c0 (rcu_read_lock){....}, at: bond_get_stats+0x5/0x4d0 [bonding]
[ 38.488897][ T957]
[ 38.488897][ T957] stack backtrace:
[ 38.489646][ T957] CPU: 1 PID: 957 Comm: ip Not tainted 5.5.0+ #394
[ 38.490497][ T957] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 38.492810][ T957] Call Trace:
[ 38.493219][ T957] dump_stack+0x96/0xdb
[ 38.493709][ T957] check_noncircular+0x371/0x450
[ 38.494344][ T957] ? lookup_address+0x60/0x60
[ 38.494923][ T957] ? print_circular_bug.isra.35+0x310/0x310
[ 38.495699][ T957] ? hlock_class+0x130/0x130
[ 38.496334][ T957] ? __lock_acquire+0x2d8d/0x3de0
[ 38.496979][ T957] __lock_acquire+0x2d8d/0x3de0
[ 38.497607][ T957] ? register_lock_class+0x14d0/0x14d0
[ 38.498333][ T957] ? check_chain_key+0x236/0x5d0
[ 38.499003][ T957] lock_acquire+0x164/0x3b0
[ 38.499800][ T957] ? bond_get_stats+0x90/0x4d0 [bonding]
[ 38.500706][ T957] _raw_spin_lock+0x30/0x70
[ 38.501435][ T957] ? bond_get_stats+0x90/0x4d0 [bonding]
[ 38.502311][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ ... ]
But, there is another problem.
The dynamic lockdep class key is protected by RTNL, but bond_get_stats()
would be called outside of RTNL.
So, it would use an invalid dynamic lockdep class key.
In order to fix this issue, stats_lock uses spin_lock_nested() instead of
a dynamic lockdep key.
The bond_get_stats() calls bond_get_lowest_level_rcu() to get the correct
nest level value, which will be used by spin_lock_nested().
The "dev->lower_level" indicates lower nest level value, but this value
is invalid outside of RTNL.
So, bond_get_lowest_level_rcu() returns valid lower nest level value in
the RCU critical section.
bond_get_lowest_level_rcu() will be work only when LOCKDEP is enabled.
Fixes: 089bca2caed0 ("bonding: use dynamic lockdep key instead of subclass")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-15 18:50:40 +08:00
# ifdef CONFIG_LOCKDEP
nest_level = bond_get_lowest_level_rcu ( bond_dev ) ;
# endif
spin_lock_nested ( & bond - > stats_lock , nest_level ) ;
memcpy ( stats , & bond - > bond_stats , sizeof ( * stats ) ) ;
2016-03-18 08:23:36 +08:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
const struct rtnl_link_stats64 * new =
2010-07-08 05:58:56 +08:00
dev_get_stats ( slave - > dev , & temp ) ;
2016-03-18 08:23:36 +08:00
bond_fold_stats ( stats , new , & slave - > slave_stats ) ;
2014-09-29 10:34:37 +08:00
/* save off the slave stats for the next run */
2016-03-18 08:23:36 +08:00
memcpy ( & slave - > slave_stats , new , sizeof ( * new ) ) ;
2014-09-29 10:34:37 +08:00
}
2016-03-18 08:23:36 +08:00
2014-09-29 10:34:37 +08:00
memcpy ( & bond - > bond_stats , stats , sizeof ( * stats ) ) ;
2016-03-18 08:23:36 +08:00
spin_unlock ( & bond - > stats_lock ) ;
bonding: fix lockdep warning in bond_get_stats()
In the "struct bonding", there is stats_lock.
This lock protects "bond_stats" in the "struct bonding".
bond_stats is updated in the bond_get_stats() and this function would be
executed concurrently. So, the lock is needed.
Bonding interfaces would be nested.
So, either stats_lock should use dynamic lockdep class key or stats_lock
should be used by spin_lock_nested(). In the current code, stats_lock is
using a dynamic lockdep class key.
But there is no updating stats_lock_key routine So, lockdep warning
will occur.
Test commands:
ip link add bond0 type bond
ip link add bond1 type bond
ip link set bond0 master bond1
ip link set bond0 nomaster
ip link set bond1 master bond0
Splat looks like:
[ 38.420603][ T957] 5.5.0+ #394 Not tainted
[ 38.421074][ T957] ------------------------------------------------------
[ 38.421837][ T957] ip/957 is trying to acquire lock:
[ 38.422399][ T957] ffff888063262cd8 (&bond->stats_lock_key#2){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.423528][ T957]
[ 38.423528][ T957] but task is already holding lock:
[ 38.424526][ T957] ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.426075][ T957]
[ 38.426075][ T957] which lock already depends on the new lock.
[ 38.426075][ T957]
[ 38.428536][ T957]
[ 38.428536][ T957] the existing dependency chain (in reverse order) is:
[ 38.429475][ T957]
[ 38.429475][ T957] -> #1 (&bond->stats_lock_key){+.+.}:
[ 38.430273][ T957] _raw_spin_lock+0x30/0x70
[ 38.430812][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ 38.431451][ T957] dev_get_stats+0x1ec/0x270
[ 38.432088][ T957] bond_get_stats+0x1a5/0x4d0 [bonding]
[ 38.432767][ T957] dev_get_stats+0x1ec/0x270
[ 38.433322][ T957] rtnl_fill_stats+0x44/0xbe0
[ 38.433866][ T957] rtnl_fill_ifinfo+0xeb2/0x3720
[ 38.434474][ T957] rtmsg_ifinfo_build_skb+0xca/0x170
[ 38.435081][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0
[ 38.436848][ T957] rtnetlink_event+0xcd/0x120
[ 38.437455][ T957] notifier_call_chain+0x90/0x160
[ 38.438067][ T957] netdev_change_features+0x74/0xa0
[ 38.438708][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[ 38.439522][ T957] bond_enslave+0x3639/0x47b0 [bonding]
[ 38.440225][ T957] do_setlink+0xaab/0x2ef0
[ 38.440786][ T957] __rtnl_newlink+0x9c5/0x1270
[ 38.441463][ T957] rtnl_newlink+0x65/0x90
[ 38.442075][ T957] rtnetlink_rcv_msg+0x4a8/0x890
[ 38.442774][ T957] netlink_rcv_skb+0x121/0x350
[ 38.443451][ T957] netlink_unicast+0x42e/0x610
[ 38.444282][ T957] netlink_sendmsg+0x65a/0xb90
[ 38.444992][ T957] ____sys_sendmsg+0x5ce/0x7a0
[ 38.445679][ T957] ___sys_sendmsg+0x10f/0x1b0
[ 38.446365][ T957] __sys_sendmsg+0xc6/0x150
[ 38.447007][ T957] do_syscall_64+0x99/0x4f0
[ 38.447668][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 38.448538][ T957]
[ 38.448538][ T957] -> #0 (&bond->stats_lock_key#2){+.+.}:
[ 38.449554][ T957] __lock_acquire+0x2d8d/0x3de0
[ 38.450148][ T957] lock_acquire+0x164/0x3b0
[ 38.450711][ T957] _raw_spin_lock+0x30/0x70
[ 38.451292][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ 38.451950][ T957] dev_get_stats+0x1ec/0x270
[ 38.452425][ T957] bond_get_stats+0x1a5/0x4d0 [bonding]
[ 38.453362][ T957] dev_get_stats+0x1ec/0x270
[ 38.453825][ T957] rtnl_fill_stats+0x44/0xbe0
[ 38.454390][ T957] rtnl_fill_ifinfo+0xeb2/0x3720
[ 38.456257][ T957] rtmsg_ifinfo_build_skb+0xca/0x170
[ 38.456998][ T957] rtmsg_ifinfo_event.part.33+0x1b/0xb0
[ 38.459351][ T957] rtnetlink_event+0xcd/0x120
[ 38.460086][ T957] notifier_call_chain+0x90/0x160
[ 38.460829][ T957] netdev_change_features+0x74/0xa0
[ 38.461752][ T957] bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[ 38.462705][ T957] bond_enslave+0x3639/0x47b0 [bonding]
[ 38.463476][ T957] do_setlink+0xaab/0x2ef0
[ 38.464141][ T957] __rtnl_newlink+0x9c5/0x1270
[ 38.464897][ T957] rtnl_newlink+0x65/0x90
[ 38.465522][ T957] rtnetlink_rcv_msg+0x4a8/0x890
[ 38.466215][ T957] netlink_rcv_skb+0x121/0x350
[ 38.466895][ T957] netlink_unicast+0x42e/0x610
[ 38.467583][ T957] netlink_sendmsg+0x65a/0xb90
[ 38.468285][ T957] ____sys_sendmsg+0x5ce/0x7a0
[ 38.469202][ T957] ___sys_sendmsg+0x10f/0x1b0
[ 38.469884][ T957] __sys_sendmsg+0xc6/0x150
[ 38.470587][ T957] do_syscall_64+0x99/0x4f0
[ 38.471245][ T957] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 38.472093][ T957]
[ 38.472093][ T957] other info that might help us debug this:
[ 38.472093][ T957]
[ 38.473438][ T957] Possible unsafe locking scenario:
[ 38.473438][ T957]
[ 38.474898][ T957] CPU0 CPU1
[ 38.476234][ T957] ---- ----
[ 38.480171][ T957] lock(&bond->stats_lock_key);
[ 38.480808][ T957] lock(&bond->stats_lock_key#2);
[ 38.481791][ T957] lock(&bond->stats_lock_key);
[ 38.482754][ T957] lock(&bond->stats_lock_key#2);
[ 38.483416][ T957]
[ 38.483416][ T957] *** DEADLOCK ***
[ 38.483416][ T957]
[ 38.484505][ T957] 3 locks held by ip/957:
[ 38.485048][ T957] #0: ffffffffbccf6230 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x457/0x890
[ 38.486198][ T957] #1: ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[ 38.487625][ T957] #2: ffffffffbc9254c0 (rcu_read_lock){....}, at: bond_get_stats+0x5/0x4d0 [bonding]
[ 38.488897][ T957]
[ 38.488897][ T957] stack backtrace:
[ 38.489646][ T957] CPU: 1 PID: 957 Comm: ip Not tainted 5.5.0+ #394
[ 38.490497][ T957] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 38.492810][ T957] Call Trace:
[ 38.493219][ T957] dump_stack+0x96/0xdb
[ 38.493709][ T957] check_noncircular+0x371/0x450
[ 38.494344][ T957] ? lookup_address+0x60/0x60
[ 38.494923][ T957] ? print_circular_bug.isra.35+0x310/0x310
[ 38.495699][ T957] ? hlock_class+0x130/0x130
[ 38.496334][ T957] ? __lock_acquire+0x2d8d/0x3de0
[ 38.496979][ T957] __lock_acquire+0x2d8d/0x3de0
[ 38.497607][ T957] ? register_lock_class+0x14d0/0x14d0
[ 38.498333][ T957] ? check_chain_key+0x236/0x5d0
[ 38.499003][ T957] lock_acquire+0x164/0x3b0
[ 38.499800][ T957] ? bond_get_stats+0x90/0x4d0 [bonding]
[ 38.500706][ T957] _raw_spin_lock+0x30/0x70
[ 38.501435][ T957] ? bond_get_stats+0x90/0x4d0 [bonding]
[ 38.502311][ T957] bond_get_stats+0x90/0x4d0 [bonding]
[ ... ]
But, there is another problem.
The dynamic lockdep class key is protected by RTNL, but bond_get_stats()
would be called outside of RTNL.
So, it would use an invalid dynamic lockdep class key.
In order to fix this issue, stats_lock uses spin_lock_nested() instead of
a dynamic lockdep key.
The bond_get_stats() calls bond_get_lowest_level_rcu() to get the correct
nest level value, which will be used by spin_lock_nested().
The "dev->lower_level" indicates lower nest level value, but this value
is invalid outside of RTNL.
So, bond_get_lowest_level_rcu() returns valid lower nest level value in
the RCU critical section.
bond_get_lowest_level_rcu() will be work only when LOCKDEP is enabled.
Fixes: 089bca2caed0 ("bonding: use dynamic lockdep key instead of subclass")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-15 18:50:40 +08:00
rcu_read_unlock ( ) ;
2005-04-17 06:20:36 +08:00
}
static int bond_do_ioctl ( struct net_device * bond_dev , struct ifreq * ifr , int cmd )
{
2013-10-18 23:43:36 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 06:20:36 +08:00
struct net_device * slave_dev = NULL ;
struct ifbond k_binfo ;
struct ifbond __user * u_binfo = NULL ;
struct ifslave k_sinfo ;
struct ifslave __user * u_sinfo = NULL ;
struct mii_ioctl_data * mii = NULL ;
2014-01-22 21:53:35 +08:00
struct bond_opt_value newval ;
2013-02-01 00:31:00 +08:00
struct net * net ;
2005-04-17 06:20:36 +08:00
int res = 0 ;
2014-07-16 01:35:58 +08:00
netdev_dbg ( bond_dev , " bond_ioctl: cmd=%d \n " , cmd ) ;
2005-04-17 06:20:36 +08:00
switch ( cmd ) {
case SIOCGMIIPHY :
mii = if_mii ( ifr ) ;
2009-06-13 03:02:48 +08:00
if ( ! mii )
2005-04-17 06:20:36 +08:00
return - EINVAL ;
2009-06-13 03:02:48 +08:00
2005-04-17 06:20:36 +08:00
mii - > phy_id = 0 ;
/* Fall Through */
case SIOCGMIIREG :
2014-09-15 23:19:34 +08:00
/* We do this again just in case we were called by SIOCGMIIREG
2005-04-17 06:20:36 +08:00
* instead of SIOCGMIIPHY .
*/
mii = if_mii ( ifr ) ;
2009-06-13 03:02:48 +08:00
if ( ! mii )
2005-04-17 06:20:36 +08:00
return - EINVAL ;
2009-06-13 03:02:48 +08:00
2005-04-17 06:20:36 +08:00
if ( mii - > reg_num = = 1 ) {
mii - > val_out = 0 ;
2009-06-13 03:02:48 +08:00
if ( netif_carrier_ok ( bond - > dev ) )
2005-04-17 06:20:36 +08:00
mii - > val_out = BMSR_LSTATUS ;
}
return 0 ;
case BOND_INFO_QUERY_OLD :
case SIOCBONDINFOQUERY :
u_binfo = ( struct ifbond __user * ) ifr - > ifr_data ;
2009-06-13 03:02:48 +08:00
if ( copy_from_user ( & k_binfo , u_binfo , sizeof ( ifbond ) ) )
2005-04-17 06:20:36 +08:00
return - EFAULT ;
2017-02-03 12:46:21 +08:00
bond_info_query ( bond_dev , & k_binfo ) ;
if ( copy_to_user ( u_binfo , & k_binfo , sizeof ( ifbond ) ) )
2009-06-13 03:02:48 +08:00
return - EFAULT ;
2005-04-17 06:20:36 +08:00
2017-02-03 12:46:21 +08:00
return 0 ;
2005-04-17 06:20:36 +08:00
case BOND_SLAVE_INFO_QUERY_OLD :
case SIOCBONDSLAVEINFOQUERY :
u_sinfo = ( struct ifslave __user * ) ifr - > ifr_data ;
2009-06-13 03:02:48 +08:00
if ( copy_from_user ( & k_sinfo , u_sinfo , sizeof ( ifslave ) ) )
2005-04-17 06:20:36 +08:00
return - EFAULT ;
res = bond_slave_info_query ( bond_dev , & k_sinfo ) ;
2009-06-13 03:02:48 +08:00
if ( res = = 0 & &
copy_to_user ( u_sinfo , & k_sinfo , sizeof ( ifslave ) ) )
return - EFAULT ;
2005-04-17 06:20:36 +08:00
return res ;
default :
break ;
}
2013-02-01 00:31:00 +08:00
net = dev_net ( bond_dev ) ;
if ( ! ns_capable ( net - > user_ns , CAP_NET_ADMIN ) )
2005-04-17 06:20:36 +08:00
return - EPERM ;
2014-01-15 10:23:37 +08:00
slave_dev = __dev_get_by_name ( net , ifr - > ifr_slave ) ;
2005-04-17 06:20:36 +08:00
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave_dev , " slave_dev=%p: \n " , slave_dev ) ;
2005-04-17 06:20:36 +08:00
2009-06-13 03:02:48 +08:00
if ( ! slave_dev )
2014-01-15 10:23:37 +08:00
return - ENODEV ;
2005-04-17 06:20:36 +08:00
2014-01-15 10:23:37 +08:00
switch ( cmd ) {
case BOND_ENSLAVE_OLD :
case SIOCBONDENSLAVE :
2017-10-05 08:48:46 +08:00
res = bond_enslave ( bond_dev , slave_dev , NULL ) ;
2014-01-15 10:23:37 +08:00
break ;
case BOND_RELEASE_OLD :
case SIOCBONDRELEASE :
res = bond_release ( bond_dev , slave_dev ) ;
2020-02-15 18:50:08 +08:00
if ( ! res )
netdev_update_lockdep_key ( slave_dev ) ;
2014-01-15 10:23:37 +08:00
break ;
case BOND_SETHWADDR_OLD :
case SIOCBONDSETHWADDR :
2018-12-13 19:54:44 +08:00
res = bond_set_dev_addr ( bond_dev , slave_dev ) ;
2014-01-15 10:23:37 +08:00
break ;
case BOND_CHANGE_ACTIVE_OLD :
case SIOCBONDCHANGEACTIVE :
2014-01-22 21:53:35 +08:00
bond_opt_initstr ( & newval , slave_dev - > name ) ;
2017-05-27 22:14:35 +08:00
res = __bond_opt_set_notify ( bond , BOND_OPT_ACTIVE_SLAVE ,
& newval ) ;
2014-01-15 10:23:37 +08:00
break ;
default :
res = - EOPNOTSUPP ;
2005-04-17 06:20:36 +08:00
}
return res ;
}
2011-08-16 11:15:04 +08:00
static void bond_change_rx_flags ( struct net_device * bond_dev , int change )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 06:20:36 +08:00
2011-08-16 11:15:04 +08:00
if ( change & IFF_PROMISC )
bond_set_promiscuity ( bond ,
bond_dev - > flags & IFF_PROMISC ? 1 : - 1 ) ;
2009-06-13 03:02:48 +08:00
2011-08-16 11:15:04 +08:00
if ( change & IFF_ALLMULTI )
bond_set_allmulti ( bond ,
bond_dev - > flags & IFF_ALLMULTI ? 1 : - 1 ) ;
}
2005-04-17 06:20:36 +08:00
2013-05-31 19:57:30 +08:00
static void bond_set_rx_mode ( struct net_device * bond_dev )
2011-08-16 11:15:04 +08:00
{
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2013-05-31 19:57:30 +08:00
struct slave * slave ;
2005-04-17 06:20:36 +08:00
2013-09-29 03:18:56 +08:00
rcu_read_lock ( ) ;
2014-05-16 03:39:54 +08:00
if ( bond_uses_primary ( bond ) ) {
2013-09-29 03:18:56 +08:00
slave = rcu_dereference ( bond - > curr_active_slave ) ;
2013-05-31 19:57:30 +08:00
if ( slave ) {
dev_uc_sync ( slave - > dev , bond_dev ) ;
dev_mc_sync ( slave - > dev , bond_dev ) ;
}
} else {
2013-09-29 03:18:56 +08:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2013-05-31 19:57:30 +08:00
dev_uc_sync_multiple ( slave - > dev , bond_dev ) ;
dev_mc_sync_multiple ( slave - > dev , bond_dev ) ;
}
2005-04-17 06:20:36 +08:00
}
2013-09-29 03:18:56 +08:00
rcu_read_unlock ( ) ;
2005-04-17 06:20:36 +08:00
}
2012-04-04 06:56:20 +08:00
static int bond_neigh_init ( struct neighbour * n )
2008-11-21 12:14:53 +08:00
{
2012-04-04 06:56:20 +08:00
struct bonding * bond = netdev_priv ( n - > dev ) ;
const struct net_device_ops * slave_ops ;
struct neigh_parms parms ;
2013-08-01 22:54:47 +08:00
struct slave * slave ;
2019-12-08 06:10:34 +08:00
int ret = 0 ;
2012-04-04 06:56:20 +08:00
2019-12-08 06:10:34 +08:00
rcu_read_lock ( ) ;
slave = bond_first_slave_rcu ( bond ) ;
2012-04-04 06:56:20 +08:00
if ( ! slave )
2019-12-08 06:10:34 +08:00
goto out ;
2012-04-04 06:56:20 +08:00
slave_ops = slave - > dev - > netdev_ops ;
if ( ! slave_ops - > ndo_neigh_setup )
2019-12-08 06:10:34 +08:00
goto out ;
2012-04-04 06:56:20 +08:00
2019-12-08 06:10:34 +08:00
/* TODO: find another way [1] to implement this.
* Passing a zeroed structure is fragile ,
* but at least we do not pass garbage .
*
* [ 1 ] One way would be that ndo_neigh_setup ( ) never touch
* struct neigh_parms , but propagate the new neigh_setup ( )
* back to ___neigh_create ( ) / neigh_parms_alloc ( )
*/
memset ( & parms , 0 , sizeof ( parms ) ) ;
2012-04-04 06:56:20 +08:00
ret = slave_ops - > ndo_neigh_setup ( slave - > dev , & parms ) ;
2019-12-08 06:10:34 +08:00
if ( ret )
goto out ;
2012-04-04 06:56:20 +08:00
2019-12-08 06:10:34 +08:00
if ( parms . neigh_setup )
ret = parms . neigh_setup ( n ) ;
out :
rcu_read_unlock ( ) ;
return ret ;
2012-04-04 06:56:20 +08:00
}
2014-09-15 23:19:34 +08:00
/* The bonding ndo_neigh_setup is called at init time beofre any
2012-04-04 06:56:20 +08:00
* slave exists . So we must declare proxy setup function which will
* be used at run time to resolve the actual slave neigh param setup .
2013-08-03 01:07:39 +08:00
*
* It ' s also called by master devices ( such as vlans ) to setup their
* underlying devices . In that case - do nothing , we ' re already set up from
* our init .
2012-04-04 06:56:20 +08:00
*/
static int bond_neigh_setup ( struct net_device * dev ,
struct neigh_parms * parms )
{
2013-08-03 01:07:39 +08:00
/* modify only our neigh_parms */
if ( parms - > dev = = dev )
parms - > neigh_setup = bond_neigh_init ;
2008-11-21 12:14:53 +08:00
return 0 ;
}
2014-09-15 23:19:34 +08:00
/* Change the MTU of all of a master's slaves to match the master */
2005-04-17 06:20:36 +08:00
static int bond_change_mtu ( struct net_device * bond_dev , int new_mtu )
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 15:20:13 +08:00
struct slave * slave , * rollback_slave ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2005-04-17 06:20:36 +08:00
int res = 0 ;
2014-07-16 01:35:58 +08:00
netdev_dbg ( bond_dev , " bond=%p, new_mtu=%d \n " , bond , new_mtu ) ;
2005-04-17 06:20:36 +08:00
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave - > dev , " s %p c_m %p \n " ,
2014-07-16 01:35:58 +08:00
slave , slave - > dev - > netdev_ops - > ndo_change_mtu ) ;
2005-11-10 02:36:50 +08:00
2005-04-17 06:20:36 +08:00
res = dev_set_mtu ( slave - > dev , new_mtu ) ;
if ( res ) {
/* If we failed to set the slave's mtu to the new value
* we must abort the operation even in ACTIVE_BACKUP
* mode , because if we allow the backup slaves to have
* different mtu values than the active slave we ' ll
* need to change their mtu when doing a failover . That
* means changing their mtu from timer context , which
* is probably not a good idea .
*/
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave - > dev , " err %d setting mtu to %d \n " ,
res , new_mtu ) ;
2005-04-17 06:20:36 +08:00
goto unwind ;
}
}
bond_dev - > mtu = new_mtu ;
return 0 ;
unwind :
/* unwind from head to the slave that failed */
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , rollback_slave , iter ) {
2005-04-17 06:20:36 +08:00
int tmp_res ;
2013-09-25 15:20:13 +08:00
if ( rollback_slave = = slave )
break ;
tmp_res = dev_set_mtu ( rollback_slave - > dev , bond_dev - > mtu ) ;
2019-06-07 22:59:29 +08:00
if ( tmp_res )
slave_dbg ( bond_dev , rollback_slave - > dev , " unwind err %d \n " ,
tmp_res ) ;
2005-04-17 06:20:36 +08:00
}
return res ;
}
2014-09-15 23:19:34 +08:00
/* Change HW address
2005-04-17 06:20:36 +08:00
*
* Note that many devices must be down to change the HW address , and
* downing the master releases all slaves . We can make bonds full of
* bonding devices to test this , however .
*/
static int bond_set_mac_address ( struct net_device * bond_dev , void * addr )
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 15:20:13 +08:00
struct slave * slave , * rollback_slave ;
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
struct sockaddr_storage * ss = addr , tmp_ss ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2005-04-17 06:20:36 +08:00
int res = 0 ;
2014-05-16 03:39:55 +08:00
if ( BOND_MODE ( bond ) = = BOND_MODE_ALB )
2008-11-20 13:56:05 +08:00
return bond_alb_set_mac_address ( bond_dev , addr ) ;
2019-06-07 22:59:29 +08:00
netdev_dbg ( bond_dev , " %s: bond=%p \n " , __func__ , bond ) ;
2005-04-17 06:20:36 +08:00
2013-05-31 19:57:31 +08:00
/* If fail_over_mac is enabled, do nothing and return success.
* Returning an error causes ifenslave to fail .
2007-10-10 10:57:24 +08:00
*/
2014-01-25 13:00:57 +08:00
if ( bond - > params . fail_over_mac & &
2014-05-16 03:39:55 +08:00
BOND_MODE ( bond ) = = BOND_MODE_ACTIVEBACKUP )
2007-10-10 10:57:24 +08:00
return 0 ;
2007-10-10 10:43:39 +08:00
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
if ( ! is_valid_ether_addr ( ss - > __data ) )
2005-04-17 06:20:36 +08:00
return - EADDRNOTAVAIL ;
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave - > dev , " %s: slave=%p \n " ,
__func__ , slave ) ;
2018-12-13 19:54:30 +08:00
res = dev_set_mac_address ( slave - > dev , addr , NULL ) ;
2005-04-17 06:20:36 +08:00
if ( res ) {
/* TODO: consider downing the slave
* and retry ?
* User should expect communications
* breakage anyway until ARP finish
* updating , so . . .
*/
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , slave - > dev , " %s: err %d \n " ,
__func__ , res ) ;
2005-04-17 06:20:36 +08:00
goto unwind ;
}
}
/* success */
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
memcpy ( bond_dev - > dev_addr , ss - > __data , bond_dev - > addr_len ) ;
2005-04-17 06:20:36 +08:00
return 0 ;
unwind :
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
memcpy ( tmp_ss . __data , bond_dev - > dev_addr , bond_dev - > addr_len ) ;
tmp_ss . ss_family = bond_dev - > type ;
2005-04-17 06:20:36 +08:00
/* unwind from head to the slave that failed */
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , rollback_slave , iter ) {
2005-04-17 06:20:36 +08:00
int tmp_res ;
2013-09-25 15:20:13 +08:00
if ( rollback_slave = = slave )
break ;
bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e7731522, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 05:32:42 +08:00
tmp_res = dev_set_mac_address ( rollback_slave - > dev ,
2018-12-13 19:54:30 +08:00
( struct sockaddr * ) & tmp_ss , NULL ) ;
2005-04-17 06:20:36 +08:00
if ( tmp_res ) {
2019-06-07 22:59:29 +08:00
slave_dbg ( bond_dev , rollback_slave - > dev , " %s: unwind err %d \n " ,
__func__ , tmp_res ) ;
2005-04-17 06:20:36 +08:00
}
}
return res ;
}
2013-08-01 22:54:50 +08:00
/**
* bond_xmit_slave_id - transmit skb through slave with slave_id
* @ bond : bonding device that is transmitting
* @ skb : buffer to transmit
* @ slave_id : slave id up to slave_cnt - 1 through which to transmit
*
* This function tries to transmit through slave with slave_id but in case
* it fails , it tries to find the first available slave for transmission .
* The skb is consumed in all cases , thus the function is void .
*/
2013-12-30 03:41:25 +08:00
static void bond_xmit_slave_id ( struct bonding * bond , struct sk_buff * skb , int slave_id )
2013-08-01 22:54:50 +08:00
{
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2013-08-01 22:54:50 +08:00
struct slave * slave ;
int i = slave_id ;
/* Here we start from the slave with slave_id */
2013-09-25 15:20:14 +08:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2013-08-01 22:54:50 +08:00
if ( - - i < 0 ) {
2014-05-16 03:39:58 +08:00
if ( bond_slave_can_tx ( slave ) ) {
2013-08-01 22:54:50 +08:00
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
return ;
}
}
}
/* Here we start from the first slave up to slave_id */
i = slave_id ;
2013-09-25 15:20:14 +08:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2013-08-01 22:54:50 +08:00
if ( - - i < 0 )
break ;
2014-05-16 03:39:58 +08:00
if ( bond_slave_can_tx ( slave ) ) {
2013-08-01 22:54:50 +08:00
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
return ;
}
}
/* no slave that can tx has been found */
2014-11-01 02:47:54 +08:00
bond_tx_drop ( bond - > dev , skb ) ;
2013-08-01 22:54:50 +08:00
}
2013-11-05 20:51:41 +08:00
/**
* bond_rr_gen_slave_id - generate slave id based on packets_per_slave
* @ bond : bonding device to use
*
* Based on the value of the bonding device ' s packets_per_slave parameter
* this function generates a slave id , which is usually used as the next
* slave to transmit through .
*/
static u32 bond_rr_gen_slave_id ( struct bonding * bond )
{
u32 slave_id ;
reciprocal_divide: update/correction of the algorithm
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.
This has been fixed by Eric Dumazet in commit aee636c4809fa5
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d8d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:
1) Initialization:
int l = ceil(log_2 d)
uword m' = floor((1<<32)*((1<<l)-d)/d)+1
int sh_1 = min(l,1)
int sh_2 = max(l-1,0)
2) For q = n/d, all uword:
uword t = (n*m')>>32
q = (t+((n-t)>>sh_1))>>sh_2
The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.
Joint work with Daniel Borkmann.
[1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
[2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
[3] https://gmplib.org/~tege/division-paper.pdf
[4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
[5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
[6] http://www.agner.org/optimize/asmlib.zip
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 09:29:41 +08:00
struct reciprocal_value reciprocal_packets_per_slave ;
int packets_per_slave = bond - > params . packets_per_slave ;
2013-11-05 20:51:41 +08:00
switch ( packets_per_slave ) {
case 0 :
slave_id = prandom_u32 ( ) ;
break ;
case 1 :
slave_id = bond - > rr_tx_counter ;
break ;
default :
reciprocal_divide: update/correction of the algorithm
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.
This has been fixed by Eric Dumazet in commit aee636c4809fa5
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d8d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:
1) Initialization:
int l = ceil(log_2 d)
uword m' = floor((1<<32)*((1<<l)-d)/d)+1
int sh_1 = min(l,1)
int sh_2 = max(l-1,0)
2) For q = n/d, all uword:
uword t = (n*m')>>32
q = (t+((n-t)>>sh_1))>>sh_2
The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.
Joint work with Daniel Borkmann.
[1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
[2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
[3] https://gmplib.org/~tege/division-paper.pdf
[4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
[5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
[6] http://www.agner.org/optimize/asmlib.zip
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 09:29:41 +08:00
reciprocal_packets_per_slave =
bond - > params . reciprocal_packets_per_slave ;
2013-11-05 20:51:41 +08:00
slave_id = reciprocal_divide ( bond - > rr_tx_counter ,
reciprocal_divide: update/correction of the algorithm
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.
This has been fixed by Eric Dumazet in commit aee636c4809fa5
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d8d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:
1) Initialization:
int l = ceil(log_2 d)
uword m' = floor((1<<32)*((1<<l)-d)/d)+1
int sh_1 = min(l,1)
int sh_2 = max(l-1,0)
2) For q = n/d, all uword:
uword t = (n*m')>>32
q = (t+((n-t)>>sh_1))>>sh_2
The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.
Joint work with Daniel Borkmann.
[1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
[2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
[3] https://gmplib.org/~tege/division-paper.pdf
[4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
[5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
[6] http://www.agner.org/optimize/asmlib.zip
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 09:29:41 +08:00
reciprocal_packets_per_slave ) ;
2013-11-05 20:51:41 +08:00
break ;
}
bond - > rr_tx_counter + + ;
return slave_id ;
}
2018-05-11 17:53:10 +08:00
static netdev_tx_t bond_xmit_roundrobin ( struct sk_buff * skb ,
struct net_device * bond_dev )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-08-01 22:54:50 +08:00
struct slave * slave ;
2019-07-02 11:40:24 +08:00
int slave_cnt ;
2013-11-05 20:51:41 +08:00
u32 slave_id ;
2005-04-17 06:20:36 +08:00
2013-11-05 20:51:41 +08:00
/* Start with the curr_active_slave that joined the bond as the
2010-03-25 22:49:05 +08:00
* default for sending IGMP traffic . For failover purposes one
* needs to maintain some consistency for the interface that will
* send the join / membership reports . The curr_active_slave found
* will send all of this type of traffic .
2007-10-18 08:37:47 +08:00
*/
2019-07-02 11:40:24 +08:00
if ( skb - > protocol = = htons ( ETH_P_IP ) ) {
int noff = skb_network_offset ( skb ) ;
struct iphdr * iph ;
2014-09-12 23:38:18 +08:00
2019-07-02 11:40:24 +08:00
if ( unlikely ( ! pskb_may_pull ( skb , noff + sizeof ( * iph ) ) ) )
goto non_igmp ;
iph = ip_hdr ( skb ) ;
if ( iph - > protocol = = IPPROTO_IGMP ) {
slave = rcu_dereference ( bond - > curr_active_slave ) ;
if ( slave )
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
else
bond_xmit_slave_id ( bond , skb , 0 ) ;
return NETDEV_TX_OK ;
2014-09-12 23:38:18 +08:00
}
2005-04-17 06:20:36 +08:00
}
2011-05-07 09:48:02 +08:00
2019-07-02 11:40:24 +08:00
non_igmp :
slave_cnt = READ_ONCE ( bond - > slave_cnt ) ;
if ( likely ( slave_cnt ) ) {
slave_id = bond_rr_gen_slave_id ( bond ) ;
bond_xmit_slave_id ( bond , skb , slave_id % slave_cnt ) ;
} else {
bond_tx_drop ( bond_dev , skb ) ;
}
2009-07-06 10:23:38 +08:00
return NETDEV_TX_OK ;
2005-04-17 06:20:36 +08:00
}
2014-09-15 23:19:34 +08:00
/* In active-backup mode, we know that bond->curr_active_slave is always valid if
2005-04-17 06:20:36 +08:00
* the bond has a usable interface .
*/
2018-05-11 17:53:10 +08:00
static netdev_tx_t bond_xmit_activebackup ( struct sk_buff * skb ,
struct net_device * bond_dev )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-08-01 22:54:48 +08:00
struct slave * slave ;
2005-04-17 06:20:36 +08:00
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 22:54:51 +08:00
slave = rcu_dereference ( bond - > curr_active_slave ) ;
2013-08-01 22:54:48 +08:00
if ( slave )
2013-08-01 22:54:50 +08:00
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
else
2014-11-01 02:47:54 +08:00
bond_tx_drop ( bond_dev , skb ) ;
2011-05-07 09:48:02 +08:00
2009-07-06 10:23:38 +08:00
return NETDEV_TX_OK ;
2005-04-17 06:20:36 +08:00
}
2014-10-05 08:45:01 +08:00
/* Use this to update slave_array when (a) it's not appropriate to update
* slave_array right away ( note that update_slave_array ( ) may sleep )
* and / or ( b ) RTNL is not held .
2005-04-17 06:20:36 +08:00
*/
2014-10-05 08:45:01 +08:00
void bond_slave_arr_work_rearm ( struct bonding * bond , unsigned long delay )
2005-04-17 06:20:36 +08:00
{
2014-10-05 08:45:01 +08:00
queue_delayed_work ( bond - > wq , & bond - > slave_arr_work , delay ) ;
}
2005-04-17 06:20:36 +08:00
2014-10-05 08:45:01 +08:00
/* Slave array work handler. Holds only RTNL */
static void bond_slave_arr_handler ( struct work_struct * work )
{
struct bonding * bond = container_of ( work , struct bonding ,
slave_arr_work . work ) ;
int ret ;
if ( ! rtnl_trylock ( ) )
goto err ;
ret = bond_update_slave_arr ( bond , NULL ) ;
rtnl_unlock ( ) ;
if ( ret ) {
pr_warn_ratelimited ( " Failed to update slave array from WT \n " ) ;
goto err ;
}
return ;
err :
bond_slave_arr_work_rearm ( bond , 1 ) ;
}
/* Build the usable slaves array in control path for modes that use xmit-hash
* to determine the slave interface -
* ( a ) BOND_MODE_8023AD
* ( b ) BOND_MODE_XOR
2018-05-15 02:48:09 +08:00
* ( c ) ( BOND_MODE_TLB | | BOND_MODE_ALB ) & & tlb_dynamic_lb = = 0
2014-10-05 08:45:01 +08:00
*
* The caller is expected to hold RTNL only and NO other lock !
*/
int bond_update_slave_arr ( struct bonding * bond , struct slave * skipslave )
{
struct slave * slave ;
struct list_head * iter ;
struct bond_up_slave * new_arr , * old_arr ;
int agg_id = 0 ;
int ret = 0 ;
# ifdef CONFIG_LOCKDEP
WARN_ON ( lockdep_is_held ( & bond - > mode_lock ) ) ;
# endif
new_arr = kzalloc ( offsetof ( struct bond_up_slave , arr [ bond - > slave_cnt ] ) ,
GFP_KERNEL ) ;
if ( ! new_arr ) {
ret = - ENOMEM ;
pr_err ( " Failed to build slave-array. \n " ) ;
goto out ;
}
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD ) {
struct ad_info ad_info ;
if ( bond_3ad_get_active_agg_info ( bond , & ad_info ) ) {
pr_debug ( " bond_3ad_get_active_agg_info failed \n " ) ;
kfree_rcu ( new_arr , rcu ) ;
/* No active aggragator means it's not safe to use
* the previous array .
*/
old_arr = rtnl_dereference ( bond - > slave_arr ) ;
if ( old_arr ) {
RCU_INIT_POINTER ( bond - > slave_arr , NULL ) ;
kfree_rcu ( old_arr , rcu ) ;
}
goto out ;
}
agg_id = ad_info . aggregator_id ;
}
bond_for_each_slave ( bond , slave , iter ) {
if ( BOND_MODE ( bond ) = = BOND_MODE_8023AD ) {
struct aggregator * agg ;
agg = SLAVE_AD_INFO ( slave ) - > port . aggregator ;
if ( ! agg | | agg - > aggregator_identifier ! = agg_id )
continue ;
}
if ( ! bond_slave_can_tx ( slave ) )
continue ;
if ( skipslave = = slave )
continue ;
2018-05-15 02:48:09 +08:00
2019-06-07 22:59:29 +08:00
slave_dbg ( bond - > dev , slave - > dev , " Adding slave to tx hash array[%d] \n " ,
new_arr - > count ) ;
2018-05-15 02:48:09 +08:00
2014-10-05 08:45:01 +08:00
new_arr - > arr [ new_arr - > count + + ] = slave ;
}
old_arr = rtnl_dereference ( bond - > slave_arr ) ;
rcu_assign_pointer ( bond - > slave_arr , new_arr ) ;
if ( old_arr )
kfree_rcu ( old_arr , rcu ) ;
out :
if ( ret ! = 0 & & skipslave ) {
int idx ;
/* Rare situation where caller has asked to skip a specific
* slave but allocation failed ( most likely ! ) . BTW this is
* only possible when the call is initiated from
* __bond_release_one ( ) . In this situation ; overwrite the
* skipslave entry in the array with the last entry from the
* array to avoid a situation where the xmit path may choose
* this to - be - skipped slave to send a packet out .
*/
old_arr = rtnl_dereference ( bond - > slave_arr ) ;
2019-10-08 06:43:01 +08:00
for ( idx = 0 ; old_arr ! = NULL & & idx < old_arr - > count ; idx + + ) {
2014-10-05 08:45:01 +08:00
if ( skipslave = = old_arr - > arr [ idx ] ) {
old_arr - > arr [ idx ] =
old_arr - > arr [ old_arr - > count - 1 ] ;
old_arr - > count - - ;
break ;
}
}
}
return ret ;
}
/* Use this Xmit function for 3AD as well as XOR modes. The current
* usable slave array is formed in the control path . The xmit function
* just calculates hash and sends the packet out .
*/
2018-05-11 17:53:10 +08:00
static netdev_tx_t bond_3ad_xor_xmit ( struct sk_buff * skb ,
struct net_device * dev )
2014-10-05 08:45:01 +08:00
{
struct bonding * bond = netdev_priv ( dev ) ;
struct slave * slave ;
struct bond_up_slave * slaves ;
unsigned int count ;
slaves = rcu_dereference ( bond - > slave_arr ) ;
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 05:07:29 +08:00
count = slaves ? READ_ONCE ( slaves - > count ) : 0 ;
2014-10-05 08:45:01 +08:00
if ( likely ( count ) ) {
slave = slaves - > arr [ bond_xmit_hash ( bond , skb ) % count ] ;
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
} else {
2014-11-01 02:47:54 +08:00
bond_tx_drop ( dev , skb ) ;
2014-10-05 08:45:01 +08:00
}
2011-05-07 09:48:02 +08:00
2009-07-06 10:23:38 +08:00
return NETDEV_TX_OK ;
2005-04-17 06:20:36 +08:00
}
2013-08-01 22:54:49 +08:00
/* in broadcast mode, we send everything to all usable interfaces. */
2018-05-11 17:53:10 +08:00
static netdev_tx_t bond_xmit_broadcast ( struct sk_buff * skb ,
struct net_device * bond_dev )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-08-01 22:54:49 +08:00
struct slave * slave = NULL ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2005-04-17 06:20:36 +08:00
2013-09-25 15:20:14 +08:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2013-08-01 22:54:49 +08:00
if ( bond_is_last_slave ( bond , slave ) )
break ;
2014-05-16 03:39:57 +08:00
if ( bond_slave_is_up ( slave ) & & slave - > link = = BOND_LINK_UP ) {
2013-08-01 22:54:49 +08:00
struct sk_buff * skb2 = skb_clone ( skb , GFP_ATOMIC ) ;
2005-04-17 06:20:36 +08:00
2013-08-01 22:54:49 +08:00
if ( ! skb2 ) {
2014-03-25 17:00:10 +08:00
net_err_ratelimited ( " %s: Error: %s: skb_clone() failed \n " ,
bond_dev - > name , __func__ ) ;
2013-08-01 22:54:49 +08:00
continue ;
2005-04-17 06:20:36 +08:00
}
2013-08-01 22:54:49 +08:00
bond_dev_queue_xmit ( bond , skb2 , slave - > dev ) ;
2005-04-17 06:20:36 +08:00
}
}
2014-05-16 03:39:57 +08:00
if ( slave & & bond_slave_is_up ( slave ) & & slave - > link = = BOND_LINK_UP )
2013-08-01 22:54:49 +08:00
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
else
2014-11-01 02:47:54 +08:00
bond_tx_drop ( bond_dev , skb ) ;
2009-06-13 03:02:48 +08:00
2009-07-06 10:23:38 +08:00
return NETDEV_TX_OK ;
2005-04-17 06:20:36 +08:00
}
/*------------------------- Device initialization ---------------------------*/
2014-09-15 23:19:34 +08:00
/* Lookup the slave that corresponds to a qid */
2010-06-02 16:40:18 +08:00
static inline int bond_slave_override ( struct bonding * bond ,
struct sk_buff * skb )
{
struct slave * slave = NULL ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2010-06-02 16:40:18 +08:00
2018-05-11 17:53:11 +08:00
if ( ! skb_rx_queue_recorded ( skb ) )
2011-05-07 09:48:02 +08:00
return 1 ;
2010-06-02 16:40:18 +08:00
/* Find out if any slaves have the same mapping as this skb. */
2014-01-02 09:13:06 +08:00
bond_for_each_slave_rcu ( bond , slave , iter ) {
2018-05-11 17:53:11 +08:00
if ( slave - > queue_id = = skb_get_queue_mapping ( skb ) ) {
2015-03-29 19:20:25 +08:00
if ( bond_slave_is_up ( slave ) & &
slave - > link = = BOND_LINK_UP ) {
2014-01-02 09:13:06 +08:00
bond_dev_queue_xmit ( bond , skb , slave - > dev ) ;
return 0 ;
}
/* If the slave isn't UP, use default transmit policy. */
2010-06-02 16:40:18 +08:00
break ;
}
}
2014-01-02 09:13:06 +08:00
return 1 ;
2010-06-02 16:40:18 +08:00
}
2011-06-03 18:35:52 +08:00
2014-01-10 16:18:26 +08:00
static u16 bond_select_queue ( struct net_device * dev , struct sk_buff * skb ,
2019-03-20 18:02:06 +08:00
struct net_device * sb_dev )
2010-06-02 16:40:18 +08:00
{
2014-09-15 23:19:34 +08:00
/* This helper function exists to help dev_pick_tx get the correct
2011-03-14 14:22:04 +08:00
* destination queue . Using a helper function skips a call to
2010-06-02 16:40:18 +08:00
* skb_tx_hash and will put the skbs in the queue we expect on their
* way down to the bonding driver .
*/
2011-03-14 14:22:04 +08:00
u16 txq = skb_rx_queue_recorded ( skb ) ? skb_get_rx_queue ( skb ) : 0 ;
2014-09-15 23:19:34 +08:00
/* Save the original txq to restore before passing to the driver */
2018-05-11 17:53:11 +08:00
qdisc_skb_cb ( skb ) - > slave_dev_queue_mapping = skb_get_queue_mapping ( skb ) ;
2011-06-03 18:35:52 +08:00
2011-03-14 14:22:04 +08:00
if ( unlikely ( txq > = dev - > real_num_tx_queues ) ) {
2011-04-13 23:22:29 +08:00
do {
2011-03-14 14:22:04 +08:00
txq - = dev - > real_num_tx_queues ;
2011-04-13 23:22:29 +08:00
} while ( txq > = dev - > real_num_tx_queues ) ;
2011-03-14 14:22:04 +08:00
}
return txq ;
2010-06-02 16:40:18 +08:00
}
2011-05-07 09:48:02 +08:00
static netdev_tx_t __bond_start_xmit ( struct sk_buff * skb , struct net_device * dev )
2008-11-21 12:14:53 +08:00
{
2010-06-02 16:40:18 +08:00
struct bonding * bond = netdev_priv ( dev ) ;
2014-05-16 03:39:52 +08:00
if ( bond_should_override_tx_queue ( bond ) & &
! bond_slave_override ( bond , skb ) )
return NETDEV_TX_OK ;
2008-11-21 12:14:53 +08:00
2014-05-16 03:39:55 +08:00
switch ( BOND_MODE ( bond ) ) {
2008-11-21 12:14:53 +08:00
case BOND_MODE_ROUNDROBIN :
return bond_xmit_roundrobin ( skb , dev ) ;
case BOND_MODE_ACTIVEBACKUP :
return bond_xmit_activebackup ( skb , dev ) ;
2014-10-05 08:45:01 +08:00
case BOND_MODE_8023AD :
2008-11-21 12:14:53 +08:00
case BOND_MODE_XOR :
2014-10-05 08:45:01 +08:00
return bond_3ad_xor_xmit ( skb , dev ) ;
2008-11-21 12:14:53 +08:00
case BOND_MODE_BROADCAST :
return bond_xmit_broadcast ( skb , dev ) ;
case BOND_MODE_ALB :
return bond_alb_xmit ( skb , dev ) ;
2014-04-23 07:30:20 +08:00
case BOND_MODE_TLB :
return bond_tlb_xmit ( skb , dev ) ;
2008-11-21 12:14:53 +08:00
default :
/* Should never happen, mode already checked */
2014-07-16 01:35:58 +08:00
netdev_err ( dev , " Unknown bonding mode %d \n " , BOND_MODE ( bond ) ) ;
2008-11-21 12:14:53 +08:00
WARN_ON_ONCE ( 1 ) ;
2014-11-01 02:47:54 +08:00
bond_tx_drop ( dev , skb ) ;
2008-11-21 12:14:53 +08:00
return NETDEV_TX_OK ;
}
}
2011-05-07 09:48:02 +08:00
static netdev_tx_t bond_start_xmit ( struct sk_buff * skb , struct net_device * dev )
{
struct bonding * bond = netdev_priv ( dev ) ;
netdev_tx_t ret = NETDEV_TX_OK ;
2014-09-15 23:19:34 +08:00
/* If we risk deadlock from transmitting this in the
2011-05-07 09:48:02 +08:00
* netpoll path , tell netpoll to queue the frame for later tx
*/
2014-03-25 17:00:09 +08:00
if ( unlikely ( is_netpoll_tx_blocked ( dev ) ) )
2011-05-07 09:48:02 +08:00
return NETDEV_TX_BUSY ;
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 22:54:51 +08:00
rcu_read_lock ( ) ;
2013-09-25 15:20:21 +08:00
if ( bond_has_slaves ( bond ) )
2011-05-07 09:48:02 +08:00
ret = __bond_start_xmit ( skb , dev ) ;
else
2014-11-01 02:47:54 +08:00
bond_tx_drop ( dev , skb ) ;
bonding: initial RCU conversion
This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.
1. Active-backup mode
1.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
in bonding
- new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
in bonding
1.2. Bandwidth measurements
- old bonding: 16.1 gbps consistently
- new bonding: 17.5 gbps consistently
2. Round-robin mode
2.1 Perf recording while doing iperf -P 4
- old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
in bonding
- new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
in bonding
2.2 Bandwidth measurements
- old bonding: 8 gbps (variable due to packet reorderings)
- new bonding: 10 gbps (variable due to packet reorderings)
Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).
Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 22:54:51 +08:00
rcu_read_unlock ( ) ;
2011-05-07 09:48:02 +08:00
return ret ;
}
2008-11-21 12:14:53 +08:00
2016-10-26 00:41:31 +08:00
static int bond_ethtool_get_link_ksettings ( struct net_device * bond_dev ,
struct ethtool_link_ksettings * cmd )
2013-04-16 22:46:00 +08:00
{
struct bonding * bond = netdev_priv ( bond_dev ) ;
unsigned long speed = 0 ;
2013-09-25 15:20:14 +08:00
struct list_head * iter ;
2013-08-01 22:54:47 +08:00
struct slave * slave ;
2013-04-16 22:46:00 +08:00
2016-10-26 00:41:31 +08:00
cmd - > base . duplex = DUPLEX_UNKNOWN ;
cmd - > base . port = PORT_OTHER ;
2013-04-16 22:46:00 +08:00
2014-05-16 03:39:59 +08:00
/* Since bond_slave_can_tx returns false for all inactive or down slaves, we
2013-04-16 22:46:00 +08:00
* do not need to check mode . Though link speed might not represent
* the true receive or transmit bandwidth ( not all modes are symmetric )
* this is an accurate maximum .
*/
2013-09-25 15:20:14 +08:00
bond_for_each_slave ( bond , slave , iter ) {
2014-05-16 03:39:59 +08:00
if ( bond_slave_can_tx ( slave ) ) {
2013-04-16 22:46:00 +08:00
if ( slave - > speed ! = SPEED_UNKNOWN )
speed + = slave - > speed ;
2016-10-26 00:41:31 +08:00
if ( cmd - > base . duplex = = DUPLEX_UNKNOWN & &
2013-04-16 22:46:00 +08:00
slave - > duplex ! = DUPLEX_UNKNOWN )
2016-10-26 00:41:31 +08:00
cmd - > base . duplex = slave - > duplex ;
2013-04-16 22:46:00 +08:00
}
}
2016-10-26 00:41:31 +08:00
cmd - > base . speed = speed ? : SPEED_UNKNOWN ;
2013-08-01 22:54:47 +08:00
2013-04-16 22:46:00 +08:00
return 0 ;
}
2005-09-27 07:11:50 +08:00
static void bond_ethtool_get_drvinfo ( struct net_device * bond_dev ,
2013-01-06 08:44:26 +08:00
struct ethtool_drvinfo * drvinfo )
2005-09-27 07:11:50 +08:00
{
2013-01-06 08:44:26 +08:00
strlcpy ( drvinfo - > driver , DRV_NAME , sizeof ( drvinfo - > driver ) ) ;
strlcpy ( drvinfo - > version , DRV_VERSION , sizeof ( drvinfo - > version ) ) ;
snprintf ( drvinfo - > fw_version , sizeof ( drvinfo - > fw_version ) , " %d " ,
BOND_ABI_VERSION ) ;
2005-09-27 07:11:50 +08:00
}
2006-09-14 02:30:00 +08:00
static const struct ethtool_ops bond_ethtool_ops = {
2005-09-27 07:11:50 +08:00
. get_drvinfo = bond_ethtool_get_drvinfo ,
2008-09-14 09:17:09 +08:00
. get_link = ethtool_op_get_link ,
2016-10-26 00:41:31 +08:00
. get_link_ksettings = bond_ethtool_get_link_ksettings ,
2005-08-23 13:34:53 +08:00
} ;
2008-11-20 13:56:05 +08:00
static const struct net_device_ops bond_netdev_ops = {
2009-06-13 03:02:52 +08:00
. ndo_init = bond_init ,
2009-06-13 03:02:47 +08:00
. ndo_uninit = bond_uninit ,
2008-11-20 13:56:05 +08:00
. ndo_open = bond_open ,
. ndo_stop = bond_close ,
2008-11-21 12:14:53 +08:00
. ndo_start_xmit = bond_start_xmit ,
2010-06-02 16:40:18 +08:00
. ndo_select_queue = bond_select_queue ,
2010-06-08 15:19:54 +08:00
. ndo_get_stats64 = bond_get_stats ,
2008-11-20 13:56:05 +08:00
. ndo_do_ioctl = bond_do_ioctl ,
2011-08-16 11:15:04 +08:00
. ndo_change_rx_flags = bond_change_rx_flags ,
2013-05-31 19:57:30 +08:00
. ndo_set_rx_mode = bond_set_rx_mode ,
2008-11-20 13:56:05 +08:00
. ndo_change_mtu = bond_change_mtu ,
2011-07-20 12:54:46 +08:00
. ndo_set_mac_address = bond_set_mac_address ,
2008-11-21 12:14:53 +08:00
. ndo_neigh_setup = bond_neigh_setup ,
2011-07-20 12:54:46 +08:00
. ndo_vlan_rx_add_vid = bond_vlan_rx_add_vid ,
2008-11-20 13:56:05 +08:00
. ndo_vlan_rx_kill_vid = bond_vlan_rx_kill_vid ,
2010-05-06 15:48:51 +08:00
# ifdef CONFIG_NET_POLL_CONTROLLER
2011-02-18 07:43:32 +08:00
. ndo_netpoll_setup = bond_netpoll_setup ,
2010-05-06 15:48:51 +08:00
. ndo_netpoll_cleanup = bond_netpoll_cleanup ,
. ndo_poll_controller = bond_poll_controller ,
# endif
2011-02-13 17:33:01 +08:00
. ndo_add_slave = bond_enslave ,
. ndo_del_slave = bond_release ,
2011-05-07 11:22:17 +08:00
. ndo_fix_features = bond_fix_features ,
2015-03-27 13:31:14 +08:00
. ndo_features_check = passthru_features_check ,
2008-11-20 13:56:05 +08:00
} ;
2013-02-18 22:59:23 +08:00
static const struct device_type bond_type = {
. name = " bond " ,
} ;
2010-04-01 05:30:52 +08:00
static void bond_destructor ( struct net_device * bond_dev )
{
struct bonding * bond = netdev_priv ( bond_dev ) ;
if ( bond - > wq )
destroy_workqueue ( bond - > wq ) ;
}
2013-10-18 23:43:33 +08:00
void bond_setup ( struct net_device * bond_dev )
2005-04-17 06:20:36 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2005-04-17 06:20:36 +08:00
2014-09-12 04:49:25 +08:00
spin_lock_init ( & bond - > mode_lock ) ;
2009-06-13 03:02:44 +08:00
bond - > params = bonding_defaults ;
2005-04-17 06:20:36 +08:00
/* Initialize pointers */
bond - > dev = bond_dev ;
/* Initialize the device entry points */
2009-06-13 03:02:52 +08:00
ether_setup ( bond_dev ) ;
2017-03-03 04:24:36 +08:00
bond_dev - > max_mtu = ETH_MAX_MTU ;
2008-11-20 13:56:05 +08:00
bond_dev - > netdev_ops = & bond_netdev_ops ;
2005-08-23 13:34:53 +08:00
bond_dev - > ethtool_ops = & bond_ethtool_ops ;
2005-04-17 06:20:36 +08:00
net: Fix inconsistent teardown and release of private netdev state.
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init(). However, the release of these resources
can occur in one of two different places.
Either netdev_ops->ndo_uninit() or netdev->destructor().
The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.
netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.
netdev->destructor(), on the other hand, does not run until the
netdev references all go away.
Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().
This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.
If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
it is not able to invoke netdev->destructor().
This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.
However, this means that the resources that would normally be released
by netdev->destructor() will not be.
Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.
Many drivers do not try to deal with this, and instead we have leaks.
Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().
netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().
netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().
Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().
And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-09 00:52:56 +08:00
bond_dev - > needs_free_netdev = true ;
bond_dev - > priv_destructor = bond_destructor ;
2005-04-17 06:20:36 +08:00
2013-02-18 22:59:23 +08:00
SET_NETDEV_DEVTYPE ( bond_dev , & bond_type ) ;
2005-04-17 06:20:36 +08:00
/* Initialize the device options */
2016-03-16 17:59:15 +08:00
bond_dev - > flags | = IFF_MASTER ;
2015-08-18 16:30:39 +08:00
bond_dev - > priv_flags | = IFF_BONDING | IFF_UNICAST_FLT | IFF_NO_QUEUE ;
2011-07-26 14:05:38 +08:00
bond_dev - > priv_flags & = ~ ( IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING ) ;
2009-06-13 03:02:52 +08:00
2014-09-15 23:19:34 +08:00
/* don't acquire bond device's netif_tx_lock when transmitting */
2005-04-17 06:20:36 +08:00
bond_dev - > features | = NETIF_F_LLTX ;
/* By default, we declare the bond to be fully
* VLAN hardware accelerated capable . Special
* care is taken in the various xmit functions
* when there are slaves that are not hw accel
* capable
*/
2014-01-22 17:16:30 +08:00
/* Don't allow bond devices to change network namespaces. */
bond_dev - > features | = NETIF_F_NETNS_LOCAL ;
2011-05-07 11:22:17 +08:00
bond_dev - > hw_features = BOND_VLAN_FEATURES |
2013-04-19 10:04:27 +08:00
NETIF_F_HW_VLAN_CTAG_RX |
NETIF_F_HW_VLAN_CTAG_FILTER ;
2011-05-07 11:22:17 +08:00
2018-05-22 23:34:40 +08:00
bond_dev - > hw_features | = NETIF_F_GSO_ENCAP_ALL | NETIF_F_GSO_UDP_L4 ;
2011-05-07 11:22:17 +08:00
bond_dev - > features | = bond_dev - > hw_features ;
2019-06-26 16:08:44 +08:00
bond_dev - > features | = NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_STAG_TX ;
2005-04-17 06:20:36 +08:00
}
2014-09-15 23:19:34 +08:00
/* Destroy a bonding device.
* Must be under rtnl_lock when this function is called .
*/
2009-10-29 22:18:24 +08:00
static void bond_uninit ( struct net_device * bond_dev )
2008-10-31 08:41:15 +08:00
{
2008-11-13 15:37:49 +08:00
struct bonding * bond = netdev_priv ( bond_dev ) ;
2013-09-25 15:20:15 +08:00
struct list_head * iter ;
struct slave * slave ;
2014-10-05 08:45:01 +08:00
struct bond_up_slave * arr ;
2008-10-31 08:41:15 +08:00
2010-05-06 15:48:51 +08:00
bond_netpoll_cleanup ( bond_dev ) ;
2009-10-29 22:18:24 +08:00
/* Release the bonded slaves */
2013-09-25 15:20:15 +08:00
bond_for_each_slave ( bond , slave , iter )
2017-07-07 06:01:57 +08:00
__bond_release_one ( bond_dev , slave - > dev , true , true ) ;
2014-07-16 01:35:58 +08:00
netdev_info ( bond_dev , " Released all slaves \n " ) ;
2009-10-29 22:18:24 +08:00
2014-10-05 08:45:01 +08:00
arr = rtnl_dereference ( bond - > slave_arr ) ;
if ( arr ) {
RCU_INIT_POINTER ( bond - > slave_arr , NULL ) ;
kfree_rcu ( arr , rcu ) ;
}
2008-10-31 08:41:15 +08:00
list_del ( & bond - > bond_list ) ;
2019-10-22 02:47:53 +08:00
lockdep_unregister_key ( & bond - > stats_lock_key ) ;
2010-12-09 23:17:13 +08:00
bond_debug_unregister ( bond ) ;
2008-10-31 08:41:15 +08:00
}
2005-04-17 06:20:36 +08:00
/*------------------------- Module initialization ---------------------------*/
static int bond_check_params ( struct bond_params * params )
{
2013-05-18 09:18:30 +08:00
int arp_validate_value , fail_over_mac_value , primary_reselect_value , i ;
2014-03-05 08:36:44 +08:00
struct bond_opt_value newval ;
const struct bond_opt_value * valptr ;
bonding: fix randomly populated arp target array
In commit dc9c4d0fe023, the arp_target array moved from a static global
to a local variable. By the nature of static globals, the array used to
be initialized to all 0. At present, it's full of random data, which
that gets interpreted as arp_target values, when none have actually been
specified. Systems end up booting with spew along these lines:
[ 32.161783] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[ 32.168475] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[ 32.175089] 8021q: adding VLAN 0 to HW filter on device lacp0
[ 32.193091] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[ 32.204892] lacp0: Setting MII monitoring interval to 100
[ 32.211071] lacp0: Removing ARP target 216.124.228.17
[ 32.216824] lacp0: Removing ARP target 218.160.255.255
[ 32.222646] lacp0: Removing ARP target 185.170.136.184
[ 32.228496] lacp0: invalid ARP target 255.255.255.255 specified for removal
[ 32.236294] lacp0: option arp_ip_target: invalid value (-255.255.255.255)
[ 32.243987] lacp0: Removing ARP target 56.125.228.17
[ 32.249625] lacp0: Removing ARP target 218.160.255.255
[ 32.255432] lacp0: Removing ARP target 15.157.233.184
[ 32.261165] lacp0: invalid ARP target 255.255.255.255 specified for removal
[ 32.268939] lacp0: option arp_ip_target: invalid value (-255.255.255.255)
[ 32.276632] lacp0: Removing ARP target 16.0.0.0
[ 32.281755] lacp0: Removing ARP target 218.160.255.255
[ 32.287567] lacp0: Removing ARP target 72.125.228.17
[ 32.293165] lacp0: Removing ARP target 218.160.255.255
[ 32.298970] lacp0: Removing ARP target 8.125.228.17
[ 32.304458] lacp0: Removing ARP target 218.160.255.255
None of these were actually specified as ARP targets, and the driver does
seem to clean up the mess okay, but it's rather noisy and confusing, leaks
values to userspace, and the 255.255.255.255 spew shows up even when debug
prints are disabled.
The fix: just zero out arp_target at init time.
While we're in here, init arp_all_targets_value in the right place.
Fixes: dc9c4d0fe023 ("bonding: reduce scope of some global variables")
CC: Mahesh Bandewar <maheshb@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
CC: stable@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-20 02:46:46 +08:00
int arp_all_targets_value = 0 ;
2015-05-09 15:01:55 +08:00
u16 ad_actor_sys_prio = 0 ;
2015-05-09 15:01:57 +08:00
u16 ad_user_port_key = 0 ;
bonding: fix randomly populated arp target array
In commit dc9c4d0fe023, the arp_target array moved from a static global
to a local variable. By the nature of static globals, the array used to
be initialized to all 0. At present, it's full of random data, which
that gets interpreted as arp_target values, when none have actually been
specified. Systems end up booting with spew along these lines:
[ 32.161783] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[ 32.168475] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[ 32.175089] 8021q: adding VLAN 0 to HW filter on device lacp0
[ 32.193091] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[ 32.204892] lacp0: Setting MII monitoring interval to 100
[ 32.211071] lacp0: Removing ARP target 216.124.228.17
[ 32.216824] lacp0: Removing ARP target 218.160.255.255
[ 32.222646] lacp0: Removing ARP target 185.170.136.184
[ 32.228496] lacp0: invalid ARP target 255.255.255.255 specified for removal
[ 32.236294] lacp0: option arp_ip_target: invalid value (-255.255.255.255)
[ 32.243987] lacp0: Removing ARP target 56.125.228.17
[ 32.249625] lacp0: Removing ARP target 218.160.255.255
[ 32.255432] lacp0: Removing ARP target 15.157.233.184
[ 32.261165] lacp0: invalid ARP target 255.255.255.255 specified for removal
[ 32.268939] lacp0: option arp_ip_target: invalid value (-255.255.255.255)
[ 32.276632] lacp0: Removing ARP target 16.0.0.0
[ 32.281755] lacp0: Removing ARP target 218.160.255.255
[ 32.287567] lacp0: Removing ARP target 72.125.228.17
[ 32.293165] lacp0: Removing ARP target 218.160.255.255
[ 32.298970] lacp0: Removing ARP target 8.125.228.17
[ 32.304458] lacp0: Removing ARP target 218.160.255.255
None of these were actually specified as ARP targets, and the driver does
seem to clean up the mess okay, but it's rather noisy and confusing, leaks
values to userspace, and the 255.255.255.255 spew shows up even when debug
prints are disabled.
The fix: just zero out arp_target at init time.
While we're in here, init arp_all_targets_value in the right place.
Fixes: dc9c4d0fe023 ("bonding: reduce scope of some global variables")
CC: Mahesh Bandewar <maheshb@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
CC: stable@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-20 02:46:46 +08:00
__be32 arp_target [ BOND_MAX_ARP_TARGETS ] = { 0 } ;
2017-03-09 02:56:02 +08:00
int arp_ip_count ;
int bond_mode = BOND_MODE_ROUNDROBIN ;
int xmit_hashtype = BOND_XMIT_POLICY_LAYER2 ;
int lacp_fast = 0 ;
2017-09-12 20:10:05 +08:00
int tlb_dynamic_lb ;
2006-09-23 12:54:53 +08:00
2014-09-15 23:19:34 +08:00
/* Convert string parameters. */
2005-04-17 06:20:36 +08:00
if ( mode ) {
2014-01-22 21:53:17 +08:00
bond_opt_initstr ( & newval , mode ) ;
valptr = bond_opt_parse ( bond_opt_get ( BOND_OPT_MODE ) , & newval ) ;
if ( ! valptr ) {
pr_err ( " Error: Invalid bonding mode \" %s \" \n " , mode ) ;
2005-04-17 06:20:36 +08:00
return - EINVAL ;
}
2014-01-22 21:53:17 +08:00
bond_mode = valptr - > value ;
2005-04-17 06:20:36 +08:00
}
2005-06-27 05:54:11 +08:00
if ( xmit_hash_policy ) {
2018-05-15 02:48:09 +08:00
if ( bond_mode = = BOND_MODE_ROUNDROBIN | |
bond_mode = = BOND_MODE_ACTIVEBACKUP | |
bond_mode = = BOND_MODE_BROADCAST ) {
2009-12-14 12:06:07 +08:00
pr_info ( " xmit_hash_policy param is irrelevant in mode %s \n " ,
2014-02-16 08:01:45 +08:00
bond_mode_name ( bond_mode ) ) ;
2005-06-27 05:54:11 +08:00
} else {
2014-01-22 21:53:19 +08:00
bond_opt_initstr ( & newval , xmit_hash_policy ) ;
valptr = bond_opt_parse ( bond_opt_get ( BOND_OPT_XMIT_HASH ) ,
& newval ) ;
if ( ! valptr ) {
2009-12-14 12:06:07 +08:00
pr_err ( " Error: Invalid xmit_hash_policy \" %s \" \n " ,
2005-06-27 05:54:11 +08:00
xmit_hash_policy ) ;
return - EINVAL ;
}
2014-01-22 21:53:19 +08:00
xmit_hashtype = valptr - > value ;
2005-06-27 05:54:11 +08:00
}
}
2005-04-17 06:20:36 +08:00
if ( lacp_rate ) {
if ( bond_mode ! = BOND_MODE_8023AD ) {
2009-12-14 12:06:07 +08:00
pr_info ( " lacp_rate param is irrelevant in mode %s \n " ,
bond_mode_name ( bond_mode ) ) ;
2005-04-17 06:20:36 +08:00
} else {
2014-01-22 21:53:27 +08:00
bond_opt_initstr ( & newval , lacp_rate ) ;
valptr = bond_opt_parse ( bond_opt_get ( BOND_OPT_LACP_RATE ) ,
& newval ) ;
if ( ! valptr ) {
2009-12-14 12:06:07 +08:00
pr_err ( " Error: Invalid lacp rate \" %s \" \n " ,
2014-01-22 21:53:27 +08:00
lacp_rate ) ;
2005-04-17 06:20:36 +08:00
return - EINVAL ;
}
2014-01-22 21:53:27 +08:00
lacp_fast = valptr - > value ;
2005-04-17 06:20:36 +08:00
}
}
2008-11-05 09:51:16 +08:00
if ( ad_select ) {
2014-07-13 15:47:47 +08:00
bond_opt_initstr ( & newval , ad_select ) ;
2014-01-22 21:53:29 +08:00
valptr = bond_opt_parse ( bond_opt_get ( BOND_OPT_AD_SELECT ) ,
& newval ) ;
if ( ! valptr ) {
pr_err ( " Error: Invalid ad_select \" %s \" \n " , ad_select ) ;
2008-11-05 09:51:16 +08:00
return - EINVAL ;
}
2014-01-22 21:53:29 +08:00
params - > ad_select = valptr - > value ;
if ( bond_mode ! = BOND_MODE_8023AD )
2014-02-16 07:57:04 +08:00
pr_warn ( " ad_select param only affects 802.3ad mode \n " ) ;
2008-11-05 09:51:16 +08:00
} else {
params - > ad_select = BOND_AD_STABLE ;
}
2009-08-28 21:18:34 +08:00
if ( max_bonds < 0 ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: max_bonds (%d) not in range %d-%d, so it was reset to BOND_DEFAULT_MAX_BONDS (%d) \n " ,
max_bonds , 0 , INT_MAX , BOND_DEFAULT_MAX_BONDS ) ;
2005-04-17 06:20:36 +08:00
max_bonds = BOND_DEFAULT_MAX_BONDS ;
}
if ( miimon < 0 ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: miimon module parameter (%d), not in range 0-%d, so it was reset to 0 \n " ,
miimon , INT_MAX ) ;
2014-01-22 21:53:31 +08:00
miimon = 0 ;
2005-04-17 06:20:36 +08:00
}
if ( updelay < 0 ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: updelay module parameter (%d), not in range 0-%d, so it was reset to 0 \n " ,
updelay , INT_MAX ) ;
2005-04-17 06:20:36 +08:00
updelay = 0 ;
}
if ( downdelay < 0 ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: downdelay module parameter (%d), not in range 0-%d, so it was reset to 0 \n " ,
downdelay , INT_MAX ) ;
2005-04-17 06:20:36 +08:00
downdelay = 0 ;
}
2018-05-17 02:02:13 +08:00
if ( ( use_carrier ! = 0 ) & & ( use_carrier ! = 1 ) ) {
pr_warn ( " Warning: use_carrier module parameter (%d), not of valid value (0/1), so it was set to 1 \n " ,
2014-02-16 07:57:04 +08:00
use_carrier ) ;
2005-04-17 06:20:36 +08:00
use_carrier = 1 ;
}
2011-04-26 23:25:52 +08:00
if ( num_peer_notif < 0 | | num_peer_notif > 255 ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: num_grat_arp/num_unsol_na (%d) not in range 0-255 so it was reset to 1 \n " ,
num_peer_notif ) ;
2011-04-26 23:25:52 +08:00
num_peer_notif = 1 ;
}
2013-12-21 14:40:17 +08:00
/* reset values for 802.3ad/TLB/ALB */
2014-05-16 03:39:53 +08:00
if ( ! bond_mode_uses_arp ( bond_mode ) ) {
2005-04-17 06:20:36 +08:00
if ( ! miimon ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: miimon must be specified, otherwise bonding will not detect link failure, speed and duplex which are essential for 802.3ad operation \n " ) ;
pr_warn ( " Forcing miimon to 100msec \n " ) ;
bonding: disable arp and enable mii monitoring when bond change to no uses arp mode
Because the ARP monitoring is not support for 802.3ad, but I still
could change the mode to 802.3ad from ab mode while ARP monitoring
is running, it is incorrect.
So add a check for 802.3ad in bonding_store_mode to fix the problem,
and make a new macro BOND_NO_USES_ARP() to simplify the code.
v2: according to the Dan Williams's suggestion, bond mode is the most
important bond option, it should override any of the other sub-options.
So when the mode is changed, the conficting values should be cleared
or reset, otherwise the user has to duplicate more operations to modify
the logic. I disable the arp and enable mii monitoring when the bond mode
is changed to AB, TB and 8023AD if the arp interval is true.
v3: according to the Nik's suggestion, the default value of miimon should need
a name, there is several place to use it, and the bond_store_arp_interval()
could use micro BOND_NO_USES_ARP to make the code more simpify.
Suggested-by: Dan Williams <dcbw@redhat.com>
Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-22 22:28:43 +08:00
miimon = BOND_DEFAULT_MIIMON ;
2005-04-17 06:20:36 +08:00
}
}
2010-06-02 16:40:18 +08:00
if ( tx_queues < 1 | | tx_queues > 255 ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: tx_queues (%d) should be between 1 and 255, resetting to %d \n " ,
tx_queues , BOND_DEFAULT_TX_QUEUES ) ;
2010-06-02 16:40:18 +08:00
tx_queues = BOND_DEFAULT_TX_QUEUES ;
}
2010-06-02 16:39:21 +08:00
if ( ( all_slaves_active ! = 0 ) & & ( all_slaves_active ! = 1 ) ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: all_slaves_active module parameter (%d), not of valid value (0/1), so it was set to 0 \n " ,
all_slaves_active ) ;
2010-06-02 16:39:21 +08:00
all_slaves_active = 0 ;
}
2010-10-05 22:23:59 +08:00
if ( resend_igmp < 0 | | resend_igmp > 255 ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: resend_igmp (%d) should be between 0 and 255, resetting to %d \n " ,
resend_igmp , BOND_DEFAULT_RESEND_IGMP ) ;
2010-10-05 22:23:59 +08:00
resend_igmp = BOND_DEFAULT_RESEND_IGMP ;
}
2014-01-22 21:53:18 +08:00
bond_opt_initval ( & newval , packets_per_slave ) ;
if ( ! bond_opt_parse ( bond_opt_get ( BOND_OPT_PACKETS_PER_SLAVE ) , & newval ) ) {
2013-11-05 20:51:41 +08:00
pr_warn ( " Warning: packets_per_slave (%d) should be between 0 and %u resetting to 1 \n " ,
packets_per_slave , USHRT_MAX ) ;
packets_per_slave = 1 ;
}
2005-04-17 06:20:36 +08:00
if ( bond_mode = = BOND_MODE_ALB ) {
2009-12-14 12:06:07 +08:00
pr_notice ( " In ALB mode you might experience client disconnections upon reconnection of a link if the bonding module updelay parameter (%d msec) is incompatible with the forwarding delay time of the switch \n " ,
updelay ) ;
2005-04-17 06:20:36 +08:00
}
if ( ! miimon ) {
if ( updelay | | downdelay ) {
/* just warn the user the up/down delay will have
* no effect since miimon is zero . . .
*/
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: miimon module parameter not set and updelay (%d) or downdelay (%d) module parameter is set; updelay and downdelay have no effect unless miimon is set \n " ,
updelay , downdelay ) ;
2005-04-17 06:20:36 +08:00
}
} else {
/* don't allow arp monitoring */
if ( arp_interval ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: miimon (%d) and arp_interval (%d) can't be used simultaneously, disabling ARP monitoring \n " ,
miimon , arp_interval ) ;
2005-04-17 06:20:36 +08:00
arp_interval = 0 ;
}
if ( ( updelay % miimon ) ! = 0 ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: updelay (%d) is not a multiple of miimon (%d), updelay rounded to %d ms \n " ,
updelay , miimon , ( updelay / miimon ) * miimon ) ;
2005-04-17 06:20:36 +08:00
}
updelay / = miimon ;
if ( ( downdelay % miimon ) ! = 0 ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: downdelay (%d) is not a multiple of miimon (%d), downdelay rounded to %d ms \n " ,
downdelay , miimon ,
( downdelay / miimon ) * miimon ) ;
2005-04-17 06:20:36 +08:00
}
downdelay / = miimon ;
}
if ( arp_interval < 0 ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: arp_interval module parameter (%d), not in range 0-%d, so it was reset to 0 \n " ,
arp_interval , INT_MAX ) ;
2014-01-22 21:53:23 +08:00
arp_interval = 0 ;
2005-04-17 06:20:36 +08:00
}
2013-05-18 09:18:30 +08:00
for ( arp_ip_count = 0 , i = 0 ;
( arp_ip_count < BOND_MAX_ARP_TARGETS ) & & arp_ip_target [ i ] ; i + + ) {
2013-12-04 18:59:31 +08:00
__be32 ip ;
2014-09-15 23:19:34 +08:00
/* not a complete check, but good enough to catch mistakes */
2013-12-04 18:59:31 +08:00
if ( ! in4_pton ( arp_ip_target [ i ] , - 1 , ( u8 * ) & ip , - 1 , NULL ) | |
2014-05-16 03:39:56 +08:00
! bond_is_ip_target_ok ( ip ) ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: bad arp_ip_target module parameter (%s), ARP monitoring will not be performed \n " ,
arp_ip_target [ i ] ) ;
2005-04-17 06:20:36 +08:00
arp_interval = 0 ;
} else {
2013-06-24 17:49:30 +08:00
if ( bond_get_targets_ip ( arp_target , ip ) = = - 1 )
arp_target [ arp_ip_count + + ] = ip ;
else
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: duplicate address %pI4 in arp_ip_target, skipping \n " ,
& ip ) ;
2005-04-17 06:20:36 +08:00
}
}
if ( arp_interval & & ! arp_ip_count ) {
/* don't allow arping if no arp_ip_target given... */
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: arp_interval module parameter (%d) specified without providing an arp_ip_target parameter, arp_interval was reset to 0 \n " ,
arp_interval ) ;
2005-04-17 06:20:36 +08:00
arp_interval = 0 ;
}
2006-09-23 12:54:53 +08:00
if ( arp_validate ) {
if ( ! arp_interval ) {
2009-12-14 12:06:07 +08:00
pr_err ( " arp_validate requires arp_interval \n " ) ;
2006-09-23 12:54:53 +08:00
return - EINVAL ;
}
2014-01-22 21:53:20 +08:00
bond_opt_initstr ( & newval , arp_validate ) ;
valptr = bond_opt_parse ( bond_opt_get ( BOND_OPT_ARP_VALIDATE ) ,
& newval ) ;
if ( ! valptr ) {
2009-12-14 12:06:07 +08:00
pr_err ( " Error: invalid arp_validate \" %s \" \n " ,
2014-01-22 21:53:20 +08:00
arp_validate ) ;
2006-09-23 12:54:53 +08:00
return - EINVAL ;
}
2014-01-22 21:53:20 +08:00
arp_validate_value = valptr - > value ;
} else {
2006-09-23 12:54:53 +08:00
arp_validate_value = 0 ;
2014-01-22 21:53:20 +08:00
}
2006-09-23 12:54:53 +08:00
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:34 +08:00
if ( arp_all_targets ) {
2014-01-22 21:53:21 +08:00
bond_opt_initstr ( & newval , arp_all_targets ) ;
valptr = bond_opt_parse ( bond_opt_get ( BOND_OPT_ARP_ALL_TARGETS ) ,
& newval ) ;
if ( ! valptr ) {
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:34 +08:00
pr_err ( " Error: invalid arp_all_targets_value \" %s \" \n " ,
arp_all_targets ) ;
arp_all_targets_value = 0 ;
2014-01-22 21:53:21 +08:00
} else {
arp_all_targets_value = valptr - > value ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:34 +08:00
}
}
2005-04-17 06:20:36 +08:00
if ( miimon ) {
2009-12-14 12:06:07 +08:00
pr_info ( " MII link monitoring set to %d ms \n " , miimon ) ;
2005-04-17 06:20:36 +08:00
} else if ( arp_interval ) {
2014-01-22 21:53:20 +08:00
valptr = bond_opt_get_val ( BOND_OPT_ARP_VALIDATE ,
arp_validate_value ) ;
2009-12-14 12:06:07 +08:00
pr_info ( " ARP monitoring set to %d ms, validate %s, with %d target(s): " ,
2014-01-22 21:53:20 +08:00
arp_interval , valptr - > string , arp_ip_count ) ;
2005-04-17 06:20:36 +08:00
for ( i = 0 ; i < arp_ip_count ; i + + )
2014-02-16 08:01:45 +08:00
pr_cont ( " %s " , arp_ip_target [ i ] ) ;
2005-04-17 06:20:36 +08:00
2014-02-16 08:01:45 +08:00
pr_cont ( " \n " ) ;
2005-04-17 06:20:36 +08:00
2008-06-14 09:12:04 +08:00
} else if ( max_bonds ) {
2005-04-17 06:20:36 +08:00
/* miimon and arp_interval not set, we need one so things
* work as expected , see bonding . txt for details
*/
2014-02-16 08:01:45 +08:00
pr_debug ( " Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details \n " ) ;
2005-04-17 06:20:36 +08:00
}
2014-05-16 03:39:54 +08:00
if ( primary & & ! bond_mode_uses_primary ( bond_mode ) ) {
2005-04-17 06:20:36 +08:00
/* currently, using a primary only makes sense
* in active backup , TLB or ALB modes
*/
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: %s primary device specified but has no effect in %s mode \n " ,
primary , bond_mode_name ( bond_mode ) ) ;
2005-04-17 06:20:36 +08:00
primary = NULL ;
}
2009-09-25 11:28:09 +08:00
if ( primary & & primary_reselect ) {
2014-01-22 21:53:33 +08:00
bond_opt_initstr ( & newval , primary_reselect ) ;
valptr = bond_opt_parse ( bond_opt_get ( BOND_OPT_PRIMARY_RESELECT ) ,
& newval ) ;
if ( ! valptr ) {
2009-12-14 12:06:07 +08:00
pr_err ( " Error: Invalid primary_reselect \" %s \" \n " ,
2014-01-22 21:53:33 +08:00
primary_reselect ) ;
2009-09-25 11:28:09 +08:00
return - EINVAL ;
}
2014-01-22 21:53:33 +08:00
primary_reselect_value = valptr - > value ;
2009-09-25 11:28:09 +08:00
} else {
primary_reselect_value = BOND_PRI_RESELECT_ALWAYS ;
}
2008-05-18 12:10:14 +08:00
if ( fail_over_mac ) {
2014-01-22 21:53:22 +08:00
bond_opt_initstr ( & newval , fail_over_mac ) ;
valptr = bond_opt_parse ( bond_opt_get ( BOND_OPT_FAIL_OVER_MAC ) ,
& newval ) ;
if ( ! valptr ) {
2009-12-14 12:06:07 +08:00
pr_err ( " Error: invalid fail_over_mac \" %s \" \n " ,
2014-01-22 21:53:22 +08:00
fail_over_mac ) ;
2008-05-18 12:10:14 +08:00
return - EINVAL ;
}
2014-01-22 21:53:22 +08:00
fail_over_mac_value = valptr - > value ;
2008-05-18 12:10:14 +08:00
if ( bond_mode ! = BOND_MODE_ACTIVEBACKUP )
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: fail_over_mac only affects active-backup mode \n " ) ;
2008-05-18 12:10:14 +08:00
} else {
fail_over_mac_value = BOND_FOM_NONE ;
}
2007-10-10 10:57:24 +08:00
2015-05-09 15:01:55 +08:00
bond_opt_initstr ( & newval , " default " ) ;
valptr = bond_opt_parse (
bond_opt_get ( BOND_OPT_AD_ACTOR_SYS_PRIO ) ,
& newval ) ;
if ( ! valptr ) {
pr_err ( " Error: No ad_actor_sys_prio default value " ) ;
return - EINVAL ;
}
ad_actor_sys_prio = valptr - > value ;
2015-05-09 15:01:57 +08:00
valptr = bond_opt_parse ( bond_opt_get ( BOND_OPT_AD_USER_PORT_KEY ) ,
& newval ) ;
if ( ! valptr ) {
pr_err ( " Error: No ad_user_port_key default value " ) ;
return - EINVAL ;
}
ad_user_port_key = valptr - > value ;
2017-09-12 20:10:05 +08:00
bond_opt_initstr ( & newval , " default " ) ;
valptr = bond_opt_parse ( bond_opt_get ( BOND_OPT_TLB_DYNAMIC_LB ) , & newval ) ;
if ( ! valptr ) {
pr_err ( " Error: No tlb_dynamic_lb default value " ) ;
return - EINVAL ;
2017-03-09 02:55:56 +08:00
}
2017-09-12 20:10:05 +08:00
tlb_dynamic_lb = valptr - > value ;
2017-03-09 02:55:56 +08:00
2013-12-21 14:40:12 +08:00
if ( lp_interval = = 0 ) {
2014-02-16 07:57:04 +08:00
pr_warn ( " Warning: ip_interval must be between 1 and %d, so it was reset to %d \n " ,
INT_MAX , BOND_ALB_DEFAULT_LP_INTERVAL ) ;
2013-12-21 14:40:12 +08:00
lp_interval = BOND_ALB_DEFAULT_LP_INTERVAL ;
}
2005-04-17 06:20:36 +08:00
/* fill params struct with the proper values */
params - > mode = bond_mode ;
2005-06-27 05:54:11 +08:00
params - > xmit_policy = xmit_hashtype ;
2005-04-17 06:20:36 +08:00
params - > miimon = miimon ;
2011-04-26 23:25:52 +08:00
params - > num_peer_notif = num_peer_notif ;
2005-04-17 06:20:36 +08:00
params - > arp_interval = arp_interval ;
2006-09-23 12:54:53 +08:00
params - > arp_validate = arp_validate_value ;
bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.
All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.
One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.
Though any other configuration needs that if we need to have access to all
arp_ip_targets.
This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:
0 or any (the default) - which works exactly as it's working now -
the slave is up if any of the arp_ip_targets are up.
1 or all - the slave is up if all of the arp_ip_targets are up.
This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).
Internally it's done through:
1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
last time we've received arp from bond->params.arp_targets[i] on this
slave.
2) If we successfully validate an arp from bond->params.arp_targets[i] in
bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
current jiffies value.
3) When getting slave's last_rx via slave_last_rx(), we return the oldest
time when we've received an arp from any address in
bond->params.arp_targets[].
If the value of arp_all_targets == 0 - we still work the same way as
before.
Also, update the documentation to reflect the new parameter.
v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.
v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.
Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.
v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 17:49:34 +08:00
params - > arp_all_targets = arp_all_targets_value ;
2005-04-17 06:20:36 +08:00
params - > updelay = updelay ;
params - > downdelay = downdelay ;
bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 01:43:54 +08:00
params - > peer_notif_delay = 0 ;
2005-04-17 06:20:36 +08:00
params - > use_carrier = use_carrier ;
params - > lacp_fast = lacp_fast ;
params - > primary [ 0 ] = 0 ;
2009-09-25 11:28:09 +08:00
params - > primary_reselect = primary_reselect_value ;
2008-05-18 12:10:14 +08:00
params - > fail_over_mac = fail_over_mac_value ;
2010-06-02 16:40:18 +08:00
params - > tx_queues = tx_queues ;
2010-06-02 16:39:21 +08:00
params - > all_slaves_active = all_slaves_active ;
2010-10-05 22:23:59 +08:00
params - > resend_igmp = resend_igmp ;
2011-06-22 17:54:39 +08:00
params - > min_links = min_links ;
2013-12-21 14:40:12 +08:00
params - > lp_interval = lp_interval ;
reciprocal_divide: update/correction of the algorithm
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.
This has been fixed by Eric Dumazet in commit aee636c4809fa5
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d8d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:
1) Initialization:
int l = ceil(log_2 d)
uword m' = floor((1<<32)*((1<<l)-d)/d)+1
int sh_1 = min(l,1)
int sh_2 = max(l-1,0)
2) For q = n/d, all uword:
uword t = (n*m')>>32
q = (t+((n-t)>>sh_1))>>sh_2
The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.
Joint work with Daniel Borkmann.
[1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
[2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
[3] https://gmplib.org/~tege/division-paper.pdf
[4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
[5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
[6] http://www.agner.org/optimize/asmlib.zip
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 09:29:41 +08:00
params - > packets_per_slave = packets_per_slave ;
2017-03-09 02:55:56 +08:00
params - > tlb_dynamic_lb = tlb_dynamic_lb ;
2015-05-09 15:01:55 +08:00
params - > ad_actor_sys_prio = ad_actor_sys_prio ;
2015-05-09 15:01:56 +08:00
eth_zero_addr ( params - > ad_actor_system ) ;
2015-05-09 15:01:57 +08:00
params - > ad_user_port_key = ad_user_port_key ;
reciprocal_divide: update/correction of the algorithm
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.
This has been fixed by Eric Dumazet in commit aee636c4809fa5
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d8d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:
1) Initialization:
int l = ceil(log_2 d)
uword m' = floor((1<<32)*((1<<l)-d)/d)+1
int sh_1 = min(l,1)
int sh_2 = max(l-1,0)
2) For q = n/d, all uword:
uword t = (n*m')>>32
q = (t+((n-t)>>sh_1))>>sh_2
The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.
Joint work with Daniel Borkmann.
[1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
[2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
[3] https://gmplib.org/~tege/division-paper.pdf
[4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
[5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
[6] http://www.agner.org/optimize/asmlib.zip
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 09:29:41 +08:00
if ( packets_per_slave > 0 ) {
params - > reciprocal_packets_per_slave =
reciprocal_value ( packets_per_slave ) ;
} else {
/* reciprocal_packets_per_slave is unused if
* packets_per_slave is 0 or 1 , just initialize it
*/
params - > reciprocal_packets_per_slave =
( struct reciprocal_value ) { 0 } ;
}
2005-04-17 06:20:36 +08:00
if ( primary ) {
strncpy ( params - > primary , primary , IFNAMSIZ ) ;
params - > primary [ IFNAMSIZ - 1 ] = 0 ;
}
memcpy ( params - > arp_targets , arp_target , sizeof ( arp_target ) ) ;
return 0 ;
}
2014-09-15 23:19:34 +08:00
/* Called from registration process */
2009-06-13 03:02:52 +08:00
static int bond_init ( struct net_device * bond_dev )
{
struct bonding * bond = netdev_priv ( bond_dev ) ;
2009-10-29 22:18:26 +08:00
struct bond_net * bn = net_generic ( dev_net ( bond_dev ) , bond_net_id ) ;
2009-06-13 03:02:52 +08:00
2014-07-16 01:35:58 +08:00
netdev_dbg ( bond_dev , " Begin bond_init \n " ) ;
2009-06-13 03:02:52 +08:00
bonding: Remove deprecated create_singlethread_workqueue
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "wq" queues multiple work items viz
&bond->mcast_work, &nnw->work, &bond->mii_work, &bond->arp_work,
&bond->alb_work, &bond->mii_work, &bond->ad_work, &bond->slave_arr_work
which require strict execution ordering. Hence, an ordered dedicated
workqueue has been used.
Since, it is a network driver, WQ_MEM_RECLAIM has been set to
ensure forward progress under memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-31 00:32:01 +08:00
bond - > wq = alloc_ordered_workqueue ( bond_dev - > name , WQ_MEM_RECLAIM ) ;
2009-06-13 03:02:52 +08:00
if ( ! bond - > wq )
return - ENOMEM ;
2019-10-22 02:47:53 +08:00
spin_lock_init ( & bond - > stats_lock ) ;
lockdep_register_key ( & bond - > stats_lock_key ) ;
lockdep_set_class ( & bond - > stats_lock , & bond - > stats_lock_key ) ;
2009-06-13 03:02:52 +08:00
2009-10-29 22:18:26 +08:00
list_add_tail ( & bond - > bond_list , & bn - > dev_list ) ;
2009-06-13 03:02:52 +08:00
2009-10-29 22:18:22 +08:00
bond_prepare_sysfs_group ( bond ) ;
2010-04-02 05:22:57 +08:00
2010-12-09 23:17:13 +08:00
bond_debug_register ( bond ) ;
2013-01-30 18:08:11 +08:00
/* Ensure valid dev_addr */
if ( is_zero_ether_addr ( bond_dev - > dev_addr ) & &
2013-06-26 23:13:38 +08:00
bond_dev - > addr_assign_type = = NET_ADDR_PERM )
2013-01-30 18:08:11 +08:00
eth_hw_addr_random ( bond_dev ) ;
2009-06-13 03:02:52 +08:00
return 0 ;
}
2013-10-18 23:43:33 +08:00
unsigned int bond_get_num_tx_queues ( void )
2011-08-10 14:09:44 +08:00
{
2012-04-11 02:34:43 +08:00
return tx_queues ;
2011-08-10 14:09:44 +08:00
}
2005-11-10 02:36:04 +08:00
/* Create a new bond based on the specified name and bonding parameters.
2007-01-20 10:15:31 +08:00
* If name is NULL , obtain a suitable " bond%d " name for us .
2005-11-10 02:36:04 +08:00
* Caller must NOT hold rtnl_lock ; we need to release it here before we
* set up our sysfs entries .
*/
2009-10-29 22:18:26 +08:00
int bond_create ( struct net * net , const char * name )
2005-11-10 02:36:04 +08:00
{
struct net_device * bond_dev ;
2015-04-30 02:24:23 +08:00
struct bonding * bond ;
struct alb_bond_info * bond_info ;
2005-11-10 02:36:04 +08:00
int res ;
rtnl_lock ( ) ;
2008-01-18 08:25:02 +08:00
2011-04-30 09:21:32 +08:00
bond_dev = alloc_netdev_mq ( sizeof ( struct bonding ) ,
net: set name_assign_type in alloc_netdev()
Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert
all users to pass NET_NAME_UNKNOWN.
Coccinelle patch:
@@
expression sizeof_priv, name, setup, txqs, rxqs, count;
@@
(
-alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs)
+alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs)
|
-alloc_netdev_mq(sizeof_priv, name, setup, count)
+alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count)
|
-alloc_netdev(sizeof_priv, name, setup)
+alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup)
)
v9: move comments here from the wrong commit
Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 22:37:24 +08:00
name ? name : " bond%d " , NET_NAME_UNKNOWN ,
2011-04-30 09:21:32 +08:00
bond_setup , tx_queues ) ;
2005-11-10 02:36:04 +08:00
if ( ! bond_dev ) {
2009-12-14 12:06:07 +08:00
pr_err ( " %s: eek! can't alloc netdev! \n " , name ) ;
2010-04-01 05:30:52 +08:00
rtnl_unlock ( ) ;
return - ENOMEM ;
2005-11-10 02:36:04 +08:00
}
2015-04-30 02:24:23 +08:00
/*
* Initialize rx_hashtbl_used_head to RLB_NULL_INDEX .
* It is set to 0 by default which is wrong .
*/
bond = netdev_priv ( bond_dev ) ;
bond_info = & ( BOND_ALB_INFO ( bond ) ) ;
bond_info - > rx_hashtbl_used_head = RLB_NULL_INDEX ;
2009-10-29 22:18:26 +08:00
dev_net_set ( bond_dev , net ) ;
2009-10-29 22:18:25 +08:00
bond_dev - > rtnl_link_ops = & bond_link_ops ;
2005-11-10 02:36:04 +08:00
res = register_netdevice ( bond_dev ) ;
2006-11-09 11:51:01 +08:00
2011-03-14 14:22:05 +08:00
netif_carrier_off ( bond_dev ) ;
2017-03-09 02:55:54 +08:00
bond_work_init_all ( bond ) ;
2009-06-13 03:02:46 +08:00
rtnl_unlock ( ) ;
2010-04-01 05:30:52 +08:00
if ( res < 0 )
net: Fix inconsistent teardown and release of private netdev state.
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init(). However, the release of these resources
can occur in one of two different places.
Either netdev_ops->ndo_uninit() or netdev->destructor().
The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.
netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.
netdev->destructor(), on the other hand, does not run until the
netdev references all go away.
Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().
This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.
If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
it is not able to invoke netdev->destructor().
This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.
However, this means that the resources that would normally be released
by netdev->destructor() will not be.
Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.
Many drivers do not try to deal with this, and instead we have leaks.
Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().
netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().
netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().
Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().
And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-09 00:52:56 +08:00
free_netdev ( bond_dev ) ;
2009-10-29 22:18:23 +08:00
return res ;
2005-11-10 02:36:04 +08:00
}
2010-01-17 11:35:32 +08:00
static int __net_init bond_net_init ( struct net * net )
2009-10-29 22:18:26 +08:00
{
2009-11-29 23:46:04 +08:00
struct bond_net * bn = net_generic ( net , bond_net_id ) ;
2009-10-29 22:18:26 +08:00
bn - > net = net ;
INIT_LIST_HEAD ( & bn - > dev_list ) ;
bond_create_proc_dir ( bn ) ;
2011-10-13 05:56:25 +08:00
bond_create_sysfs ( bn ) ;
2013-06-24 17:49:29 +08:00
2009-11-29 23:46:04 +08:00
return 0 ;
2009-10-29 22:18:26 +08:00
}
2010-01-17 11:35:32 +08:00
static void __net_exit bond_net_exit ( struct net * net )
2009-10-29 22:18:26 +08:00
{
2009-11-29 23:46:04 +08:00
struct bond_net * bn = net_generic ( net , bond_net_id ) ;
2013-04-06 08:54:38 +08:00
struct bonding * bond , * tmp_bond ;
LIST_HEAD ( list ) ;
2009-10-29 22:18:26 +08:00
2011-10-13 05:56:25 +08:00
bond_destroy_sysfs ( bn ) ;
2013-04-06 08:54:38 +08:00
/* Kill off any bonds created after unregistering bond rtnl ops */
rtnl_lock ( ) ;
list_for_each_entry_safe ( bond , tmp_bond , & bn - > dev_list , bond_list )
unregister_netdevice_queue ( bond - > dev , & list ) ;
unregister_netdevice_many ( & list ) ;
rtnl_unlock ( ) ;
2014-07-17 18:04:08 +08:00
bond_destroy_proc_dir ( bn ) ;
2009-10-29 22:18:26 +08:00
}
static struct pernet_operations bond_net_ops = {
. init = bond_net_init ,
. exit = bond_net_exit ,
2009-11-29 23:46:04 +08:00
. id = & bond_net_id ,
. size = sizeof ( struct bond_net ) ,
2009-10-29 22:18:26 +08:00
} ;
2005-04-17 06:20:36 +08:00
static int __init bonding_init ( void )
{
int i ;
int res ;
2011-03-07 05:58:46 +08:00
pr_info ( " %s " , bond_version ) ;
2005-04-17 06:20:36 +08:00
2005-11-10 02:36:04 +08:00
res = bond_check_params ( & bonding_defaults ) ;
2009-06-13 03:02:48 +08:00
if ( res )
2005-11-10 02:36:04 +08:00
goto out ;
2005-04-17 06:20:36 +08:00
2009-11-29 23:46:04 +08:00
res = register_pernet_subsys ( & bond_net_ops ) ;
2009-10-29 22:18:26 +08:00
if ( res )
goto out ;
2008-01-18 08:25:02 +08:00
2013-10-18 23:43:33 +08:00
res = bond_netlink_init ( ) ;
2009-10-29 22:18:25 +08:00
if ( res )
2009-10-30 07:58:54 +08:00
goto err_link ;
2009-10-29 22:18:25 +08:00
2010-12-09 23:17:13 +08:00
bond_create_debugfs ( ) ;
2005-04-17 06:20:36 +08:00
for ( i = 0 ; i < max_bonds ; i + + ) {
2009-10-29 22:18:26 +08:00
res = bond_create ( & init_net , NULL ) ;
2005-11-10 02:36:04 +08:00
if ( res )
goto err ;
2005-04-17 06:20:36 +08:00
}
bonding: balance ICMP echoes in layer3+4 mode
The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:
# tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
# tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64
But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:
# ping -qc1 192.168.0.2
0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
# ping -qc1 192.168.0.2
1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64
Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:
# ping -q 192.168.1.2 -c1
0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
# ping -q 192.168.1.2 -c1
1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 21:50:53 +08:00
skb_flow_dissector_init ( & flow_keys_bonding ,
flow_keys_bonding_keys ,
ARRAY_SIZE ( flow_keys_bonding_keys ) ) ;
2005-04-17 06:20:36 +08:00
register_netdevice_notifier ( & bond_netdev_notifier ) ;
2005-11-10 02:36:04 +08:00
out :
2005-04-17 06:20:36 +08:00
return res ;
2009-10-29 22:18:25 +08:00
err :
2014-04-09 18:52:59 +08:00
bond_destroy_debugfs ( ) ;
2013-10-18 23:43:33 +08:00
bond_netlink_fini ( ) ;
2009-10-30 07:58:54 +08:00
err_link :
2009-11-29 23:46:04 +08:00
unregister_pernet_subsys ( & bond_net_ops ) ;
2009-10-29 22:18:25 +08:00
goto out ;
2005-11-10 02:36:04 +08:00
2005-04-17 06:20:36 +08:00
}
static void __exit bonding_exit ( void )
{
unregister_netdevice_notifier ( & bond_netdev_notifier ) ;
2010-12-09 23:17:13 +08:00
bond_destroy_debugfs ( ) ;
2008-05-03 08:49:39 +08:00
2013-10-18 23:43:33 +08:00
bond_netlink_fini ( ) ;
2013-04-06 08:54:37 +08:00
unregister_pernet_subsys ( & bond_net_ops ) ;
2010-10-14 00:01:50 +08:00
# ifdef CONFIG_NET_POLL_CONTROLLER
2014-09-15 23:19:34 +08:00
/* Make sure we don't have an imbalance on our netpoll blocking */
net: Convert netpoll blocking api in bonding driver to be a counter
A while back I made some changes to enable netpoll in the bonding driver. Among
them was a per-cpu flag that indicated we were in a path that held locks which
could cause the netpoll path to block in during tx, and as such the tx path
should queue the frame for later use. This appears to have given rise to a
regression. If one of those paths on which we hold the per-cpu flag yields the
cpu, its possible for us to come back on a different cpu, leading to us clearing
a different flag than we set. This results in odd netpoll drops, and BUG
backtraces appearing in the log, as we check to make sure that we only clear set
bits, and only set clear bits. I had though briefly about changing the
offending paths so that they wouldn't sleep, but looking at my origional work
more closely, it doesn't appear that a per-cpu flag is warranted. We alrady
gate the checking of this flag on IFF_IN_NETPOLL, so we don't hit this in the
normal tx case anyway. And practically speaking, the normal use case for
netpoll is to only have one client anyway, so we're not going to erroneously
queue netpoll frames when its actually safe to do so. As such, lets just
convert that per-cpu flag to an atomic counter. It fixes the rescheduling bugs,
is equivalent from a performance perspective and actually eliminates some code
in the process.
Tested by the reporter and myself, successfully
Reported-by: Liang Zheng <lzheng@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-06 17:05:50 +08:00
WARN_ON ( atomic_read ( & netpoll_block_tx ) ) ;
2010-10-14 00:01:50 +08:00
# endif
2005-04-17 06:20:36 +08:00
}
module_init ( bonding_init ) ;
module_exit ( bonding_exit ) ;
MODULE_LICENSE ( " GPL " ) ;
MODULE_VERSION ( DRV_VERSION ) ;
MODULE_DESCRIPTION ( DRV_DESCRIPTION " , v " DRV_VERSION ) ;
MODULE_AUTHOR ( " Thomas Davis, tadavis@lbl.gov and many others " ) ;