linux-next/Documentation/networking/netdevices.txt


Network Devices, the Kernel, and You!


Introduction
============
The following is a random collection of documentation regarding
network devices.

struct net_device allocation rules
==================================
Network device structures need to persist even after module is unloaded and
must be allocated with kmalloc.  If device has registered successfully,
it will be freed on last use by free_netdev.  This is required to handle the
pathologic case cleanly (example: rmmod mydriver </sys/class/net/myeth/mtu )

There are routines in net_init.c to handle the common cases of
alloc_etherdev, alloc_netdev.  These reserve extra space for driver
private data which gets freed when the network device is freed. If
separately allocated data is attached to the network device
(dev->priv) then it is up to the module exit handler to free that.

MTU
===
Each network device has a Maximum Transfer Unit. The MTU does not
include any link layer protocol overhead. Upper layer protocols must
not pass a socket buffer (skb) to a device to transmit with more data
than the mtu. The MTU does not include link layer header overhead, so
for example on Ethernet if the standard MTU is 1500 bytes used, the
actual skb will contain up to 1514 bytes because of the Ethernet
header. Devices should allow for the 4 byte VLAN header as well.

Segmentation Offload (GSO, TSO) is an exception to this rule.  The
upper layer protocol may pass a large socket buffer to the device
transmit routine, and the device will break that up into separate
packets based on the current MTU.

MTU is symmetrical and applies both to receive and transmit. A device
must be able to receive at least the maximum size packet allowed by
the MTU. A network device may use the MTU as mechanism to size receive
buffers, but the device should allow packets with VLAN header. With
standard Ethernet mtu of 1500 bytes, the device should allow up to
1518 byte packets (1500 + 14 header + 4 tag).  The device may either:
drop, truncate, or pass up oversize packets, but dropping oversize
packets is preferred.


struct net_device synchronization rules
=======================================
dev->open:
	Synchronization: rtnl_lock() semaphore.
	Context: process

dev->stop:
	Synchronization: rtnl_lock() semaphore.
	Context: process
	Note1: netif_running() is guaranteed false
	Note2: dev->poll() is guaranteed to be stopped

dev->do_ioctl:
	Synchronization: rtnl_lock() semaphore.
	Context: process

dev->get_stats:
	Synchronization: dev_base_lock rwlock.
	Context: nominally process, but don't sleep inside an rwlock

dev->hard_start_xmit:
	Synchronization: netif_tx_lock spinlock.

	When the driver sets NETIF_F_LLTX in dev->features this will be
	called without holding netif_tx_lock. In this case the driver
	has to lock by itself when needed. It is recommended to use a try lock
	for this and return NETDEV_TX_LOCKED when the spin lock fails.
	The locking there should also properly protect against 
	set_multicast_list.

	Context: Process with BHs disabled or BH (timer),
	         will be called with interrupts disabled by netconsole.

	Return codes: 
	o NETDEV_TX_OK everything ok. 
	o NETDEV_TX_BUSY Cannot transmit packet, try later 
	  Usually a bug, means queue start/stop flow control is broken in
	  the driver. Note: the driver must NOT put the skb in its DMA ring.
	o NETDEV_TX_LOCKED Locking failed, please retry quickly.
	  Only valid when NETIF_F_LLTX is set.

dev->tx_timeout:
	Synchronization: netif_tx_lock spinlock.
	Context: BHs disabled
	Notes: netif_queue_stopped() is guaranteed true

dev->set_multicast_list:
	Synchronization: netif_tx_lock spinlock.
	Context: BHs disabled

struct napi_struct synchronization rules
========================================
napi->poll:
	Synchronization: NAPI_STATE_SCHED bit in napi->state.  Device
		driver's dev->close method will invoke napi_disable() on
		all NAPI instances which will do a sleeping poll on the
		NAPI_STATE_SCHED napi->state bit, waiting for all pending
		NAPI activity to cease.
	Context: softirq
	         will be called with interrupts disabled by netconsole.
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-17 06:20:36 +08:00
			`Network Devices, the Kernel, and You!`


			`Introduction`
			`============`
			`The following is a random collection of documentation regarding`
			`network devices.`

			`struct net_device allocation rules`
			`==================================`
			`Network device structures need to persist even after module is unloaded and`
			`must be allocated with kmalloc. If device has registered successfully,`
			`it will be freed on last use by free_netdev. This is required to handle the`
			`pathologic case cleanly (example: rmmod mydriver </sys/class/net/myeth/mtu )`

			`There are routines in net_init.c to handle the common cases of`
			`alloc_etherdev, alloc_netdev. These reserve extra space for driver`
			`private data which gets freed when the network device is freed. If`
			`separately allocated data is attached to the network device`
			`(dev->priv) then it is up to the module exit handler to free that.`

[NET]: netdevice mtu assumptions documentation Document the expectations about device MTU handling. The documentation about oversize packet handling is probably too loose. IMHO devices should drop oversize packets for robustness, but many devices allow it now. For example, if you set mtu to 1200 bytes, most ether devices will allow a 1500 byte frame in. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> 2007-07-08 14:03:44 +08:00			`MTU`
			`===`
			`Each network device has a Maximum Transfer Unit. The MTU does not`
			`include any link layer protocol overhead. Upper layer protocols must`
			`not pass a socket buffer (skb) to a device to transmit with more data`
			`than the mtu. The MTU does not include link layer header overhead, so`
			`for example on Ethernet if the standard MTU is 1500 bytes used, the`
			`actual skb will contain up to 1514 bytes because of the Ethernet`
			`header. Devices should allow for the 4 byte VLAN header as well.`

			`Segmentation Offload (GSO, TSO) is an exception to this rule. The`
			`upper layer protocol may pass a large socket buffer to the device`
			`transmit routine, and the device will break that up into separate`
			`packets based on the current MTU.`

			`MTU is symmetrical and applies both to receive and transmit. A device`
			`must be able to receive at least the maximum size packet allowed by`
			`the MTU. A network device may use the MTU as mechanism to size receive`
			`buffers, but the device should allow packets with VLAN header. With`
			`standard Ethernet mtu of 1500 bytes, the device should allow up to`
			`1518 byte packets (1500 + 14 header + 4 tag). The device may either:`
			`drop, truncate, or pass up oversize packets, but dropping oversize`
			`packets is preferred.`


Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-17 06:20:36 +08:00			`struct net_device synchronization rules`
			`=======================================`
			`dev->open:`
			`Synchronization: rtnl_lock() semaphore.`
			`Context: process`

			`dev->stop:`
			`Synchronization: rtnl_lock() semaphore.`
			`Context: process`
			`Note1: netif_running() is guaranteed false`
			`Note2: dev->poll() is guaranteed to be stopped`

			`dev->do_ioctl:`
			`Synchronization: rtnl_lock() semaphore.`
			`Context: process`

			`dev->get_stats:`
			`Synchronization: dev_base_lock rwlock.`
			`Context: nominally process, but don't sleep inside an rwlock`

			`dev->hard_start_xmit:`
[NET]: Add netif_tx_lock Various drivers use xmit_lock internally to synchronise with their transmission routines. They do so without setting xmit_lock_owner. This is fine as long as netpoll is not in use. With netpoll it is possible for deadlocks to occur if xmit_lock_owner isn't set. This is because if a printk occurs while xmit_lock is held and xmit_lock_owner is not set can cause netpoll to attempt to take xmit_lock recursively. While it is possible to resolve this by getting netpoll to use trylock, it is suboptimal because netpoll's sole objective is to maximise the chance of getting the printk out on the wire. So delaying or dropping the message is to be avoided as much as possible. So the only alternative is to always set xmit_lock_owner. The following patch does this by introducing the netif_tx_lock family of functions that take care of setting/unsetting xmit_lock_owner. I renamed xmit_lock to _xmit_lock to indicate that it should not be used directly. I didn't provide irq versions of the netif_tx_lock functions since xmit_lock is meant to be a BH-disabling lock. This is pretty much a straight text substitution except for a small bug fix in winbond. It currently uses netif_stop_queue/spin_unlock_wait to stop transmission. This is unsafe as an IRQ can potentially wake up the queue. So it is safer to use netif_tx_disable. The hamradio bits used spin_lock_irq but it is unnecessary as xmit_lock must never be taken in an IRQ handler. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> 2006-06-10 03:20:56 +08:00			`Synchronization: netif_tx_lock spinlock.`
[NET]: netdevice locking assumptions documentation Update the documentation about locking assumptions. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> 2007-07-08 13:59:14 +08:00
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-17 06:20:36 +08:00			`When the driver sets NETIF_F_LLTX in dev->features this will be`
[NET]: Add netif_tx_lock Various drivers use xmit_lock internally to synchronise with their transmission routines. They do so without setting xmit_lock_owner. This is fine as long as netpoll is not in use. With netpoll it is possible for deadlocks to occur if xmit_lock_owner isn't set. This is because if a printk occurs while xmit_lock is held and xmit_lock_owner is not set can cause netpoll to attempt to take xmit_lock recursively. While it is possible to resolve this by getting netpoll to use trylock, it is suboptimal because netpoll's sole objective is to maximise the chance of getting the printk out on the wire. So delaying or dropping the message is to be avoided as much as possible. So the only alternative is to always set xmit_lock_owner. The following patch does this by introducing the netif_tx_lock family of functions that take care of setting/unsetting xmit_lock_owner. I renamed xmit_lock to _xmit_lock to indicate that it should not be used directly. I didn't provide irq versions of the netif_tx_lock functions since xmit_lock is meant to be a BH-disabling lock. This is pretty much a straight text substitution except for a small bug fix in winbond. It currently uses netif_stop_queue/spin_unlock_wait to stop transmission. This is unsafe as an IRQ can potentially wake up the queue. So it is safer to use netif_tx_disable. The hamradio bits used spin_lock_irq but it is unnecessary as xmit_lock must never be taken in an IRQ handler. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> 2006-06-10 03:20:56 +08:00			`called without holding netif_tx_lock. In this case the driver`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-17 06:20:36 +08:00			`has to lock by itself when needed. It is recommended to use a try lock`
[NET]: netdevice locking assumptions documentation Update the documentation about locking assumptions. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> 2007-07-08 13:59:14 +08:00			`for this and return NETDEV_TX_LOCKED when the spin lock fails.`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-17 06:20:36 +08:00			`The locking there should also properly protect against`
[NET]: netdevice locking assumptions documentation Update the documentation about locking assumptions. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> 2007-07-08 13:59:14 +08:00			`set_multicast_list.`

			`Context: Process with BHs disabled or BH (timer),`
			`will be called with interrupts disabled by netconsole.`

Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-17 06:20:36 +08:00			`Return codes:`
			`o NETDEV_TX_OK everything ok.`
			`o NETDEV_TX_BUSY Cannot transmit packet, try later`
			`Usually a bug, means queue start/stop flow control is broken in`
			`the driver. Note: the driver must NOT put the skb in its DMA ring.`
			`o NETDEV_TX_LOCKED Locking failed, please retry quickly.`
			`Only valid when NETIF_F_LLTX is set.`

			`dev->tx_timeout:`
[NET]: Add netif_tx_lock Various drivers use xmit_lock internally to synchronise with their transmission routines. They do so without setting xmit_lock_owner. This is fine as long as netpoll is not in use. With netpoll it is possible for deadlocks to occur if xmit_lock_owner isn't set. This is because if a printk occurs while xmit_lock is held and xmit_lock_owner is not set can cause netpoll to attempt to take xmit_lock recursively. While it is possible to resolve this by getting netpoll to use trylock, it is suboptimal because netpoll's sole objective is to maximise the chance of getting the printk out on the wire. So delaying or dropping the message is to be avoided as much as possible. So the only alternative is to always set xmit_lock_owner. The following patch does this by introducing the netif_tx_lock family of functions that take care of setting/unsetting xmit_lock_owner. I renamed xmit_lock to _xmit_lock to indicate that it should not be used directly. I didn't provide irq versions of the netif_tx_lock functions since xmit_lock is meant to be a BH-disabling lock. This is pretty much a straight text substitution except for a small bug fix in winbond. It currently uses netif_stop_queue/spin_unlock_wait to stop transmission. This is unsafe as an IRQ can potentially wake up the queue. So it is safer to use netif_tx_disable. The hamradio bits used spin_lock_irq but it is unnecessary as xmit_lock must never be taken in an IRQ handler. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> 2006-06-10 03:20:56 +08:00			`Synchronization: netif_tx_lock spinlock.`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-17 06:20:36 +08:00			`Context: BHs disabled`
			`Notes: netif_queue_stopped() is guaranteed true`

			`dev->set_multicast_list:`
[NET]: Add netif_tx_lock Various drivers use xmit_lock internally to synchronise with their transmission routines. They do so without setting xmit_lock_owner. This is fine as long as netpoll is not in use. With netpoll it is possible for deadlocks to occur if xmit_lock_owner isn't set. This is because if a printk occurs while xmit_lock is held and xmit_lock_owner is not set can cause netpoll to attempt to take xmit_lock recursively. While it is possible to resolve this by getting netpoll to use trylock, it is suboptimal because netpoll's sole objective is to maximise the chance of getting the printk out on the wire. So delaying or dropping the message is to be avoided as much as possible. So the only alternative is to always set xmit_lock_owner. The following patch does this by introducing the netif_tx_lock family of functions that take care of setting/unsetting xmit_lock_owner. I renamed xmit_lock to _xmit_lock to indicate that it should not be used directly. I didn't provide irq versions of the netif_tx_lock functions since xmit_lock is meant to be a BH-disabling lock. This is pretty much a straight text substitution except for a small bug fix in winbond. It currently uses netif_stop_queue/spin_unlock_wait to stop transmission. This is unsafe as an IRQ can potentially wake up the queue. So it is safer to use netif_tx_disable. The hamradio bits used spin_lock_irq but it is unnecessary as xmit_lock must never be taken in an IRQ handler. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> 2006-06-10 03:20:56 +08:00			`Synchronization: netif_tx_lock spinlock.`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-17 06:20:36 +08:00			`Context: BHs disabled`

[NET]: Make NAPI polling independent of struct net_device objects. Several devices have multiple independant RX queues per net device, and some have a single interrupt doorbell for several queues. In either case, it's easier to support layouts like that if the structure representing the poll is independant from the net device itself. The signature of the ->poll() call back goes from: int foo_poll(struct net_device dev, int budget) to int foo_poll(struct napi_struct napi, int budget) The caller is returned the number of RX packets processed (or the number of "NAPI credits" consumed if you want to get abstract). The callee no longer messes around bumping dev->quota, budget, etc. because that is all handled in the caller upon return. The napi_struct is to be embedded in the device driver private data structures. Furthermore, it is the driver's responsibility to disable all NAPI instances in it's ->stop() device close handler. Since the napi_struct is privatized into the driver's private data structures, only the driver knows how to get at all of the napi_struct instances it may have per-device. With lots of help and suggestions from Rusty Russell, Roland Dreier, Michael Chan, Jeff Garzik, and Jamal Hadi Salim. Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra, Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan. [ Ported to current tree and all drivers converted. Integrated Stephen's follow-on kerneldoc additions, and restored poll_list handling to the old style to fix mutual exclusion issues. -DaveM ] Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> 2007-10-04 07:41:36 +08:00			`struct napi_struct synchronization rules`
			`========================================`
			`napi->poll:`
			`Synchronization: NAPI_STATE_SCHED bit in napi->state. Device`
			`driver's dev->close method will invoke napi_disable() on`
			`all NAPI instances which will do a sleeping poll on the`
			`NAPI_STATE_SCHED napi->state bit, waiting for all pending`
			`NAPI activity to cease.`
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! 2005-04-17 06:20:36 +08:00			`Context: softirq`
[NET]: netdevice locking assumptions documentation Update the documentation about locking assumptions. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> 2007-07-08 13:59:14 +08:00			`will be called with interrupts disabled by netconsole.`