Commit Graph

206 Commits

Author SHA1 Message Date
Dave Taht
112f9cb656 netem: convert to qdisc_watchdog_schedule_ns
Upgrade the internal netem scheduler to use nanoseconds rather than
ticks throughout.

Convert to and from the std "ticks" userspace api automatically,
while allowing for finer grained scheduling to take place.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:15:47 +09:00
Eric Dumazet
18a4c0eab2 net: add rb_to_skb() and other rb tree helpers
Geeralize private netem_rb_to_skb()

TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:53 +01:00
Eric Dumazet
3aa605f28b sch_netem: faster rb tree removal
While running TCP tests involving netem storing millions of packets,
I had the idea to speed up tfifo_reset() and did experiments.

I tried the rbtree_postorder_for_each_entry_safe() method that is
used in skb_rbtree_purge() but discovered it was slower than the
current tfifo_reset() method.

I measured time taken to release skbs with three occupation levels :
10^4, 10^5 and 10^6 skbs with three methods :

1) (current 'naive' method)

	while ((p = rb_first(&q->t_root))) {
		struct sk_buff *skb = netem_rb_to_skb(p);

		rb_erase(p, &q->t_root);
		rtnl_kfree_skbs(skb, skb);
	}

2) Use rb_next() instead of rb_first() in the loop :

	p = rb_first(&q->t_root);
	while (p) {
		struct sk_buff *skb = netem_rb_to_skb(p);

		p = rb_next(p);
		rb_erase(&skb->rbnode, &q->t_root);
		rtnl_kfree_skbs(skb, skb);
	}

3) "optimized" method using rbtree_postorder_for_each_entry_safe()

	struct sk_buff *skb, *next;

	rbtree_postorder_for_each_entry_safe(skb, next,
					     &q->t_root, rbnode) {
               rtnl_kfree_skbs(skb, skb);
	}
	q->t_root = RB_ROOT;

Results :

method_1:while (rb_first()) rb_erase() 10000 skbs in 690378 ns (69 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...}  10000 skbs in 541846 ns (54 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 10000 skbs in 868307 ns (86 ns per skb)

method_1:while (rb_first()) rb_erase() 99996 skbs in 7804021 ns (78 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...}  100000 skbs in 5942456 ns (59 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 100000 skbs in 11584940 ns (115 ns per skb)

method_1:while (rb_first()) rb_erase() 1000000 skbs in 108577838 ns (108 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...}  1000000 skbs in 82619635 ns (82 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 1000000 skbs in 127328743 ns (127 ns per skb)

Method 2) is simply faster, probably because it maintains a smaller
working size set.

Note that this is the method we use in tcp_ofo_queue() already.

I will also change skb_rbtree_purge() in a second patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 20:31:32 -07:00
Eric Dumazet
bffa72cf7f net: sk_buff rbnode reorg
skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).

Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.

This saves some overhead in both TCP and netem.

v2: removes the swtstamp field from struct tcp_skb_cb

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-19 15:20:22 -07:00
David S. Miller
6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Nikolay Aleksandrov
634576a184 sch_netem: avoid null pointer deref on init failure
netem can fail in ->init due to missing options (either not supplied by
user-space or used as a default qdisc) causing a timer->base null
pointer deref in its ->destroy() and ->reset() callbacks.

Reproduce:
$ sysctl net.core.default_qdisc=netem
$ ip l set ethX up

Crash log:
[ 1814.846943] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 1814.847181] IP: hrtimer_active+0x17/0x8a
[ 1814.847270] PGD 59c34067
[ 1814.847271] P4D 59c34067
[ 1814.847337] PUD 37374067
[ 1814.847403] PMD 0
[ 1814.847468]
[ 1814.847582] Oops: 0000 [#1] SMP
[ 1814.847655] Modules linked in: sch_netem(O) sch_fq_codel(O)
[ 1814.847761] CPU: 3 PID: 1573 Comm: ip Tainted: G           O 4.13.0-rc6+ #62
[ 1814.847884] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 1814.848043] task: ffff88003723a700 task.stack: ffff88005adc8000
[ 1814.848235] RIP: 0010:hrtimer_active+0x17/0x8a
[ 1814.848407] RSP: 0018:ffff88005adcb590 EFLAGS: 00010246
[ 1814.848590] RAX: 0000000000000000 RBX: ffff880058e359d8 RCX: 0000000000000000
[ 1814.848793] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880058e359d8
[ 1814.848998] RBP: ffff88005adcb5b0 R08: 00000000014080c0 R09: 00000000ffffffff
[ 1814.849204] R10: ffff88005adcb660 R11: 0000000000000020 R12: 0000000000000000
[ 1814.849410] R13: ffff880058e359d8 R14: 00000000ffffffff R15: 0000000000000001
[ 1814.849616] FS:  00007f733bbca740(0000) GS:ffff88005d980000(0000) knlGS:0000000000000000
[ 1814.849919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1814.850107] CR2: 0000000000000000 CR3: 0000000059f0d000 CR4: 00000000000406e0
[ 1814.850313] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1814.850518] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1814.850723] Call Trace:
[ 1814.850875]  hrtimer_try_to_cancel+0x1a/0x93
[ 1814.851047]  hrtimer_cancel+0x15/0x20
[ 1814.851211]  qdisc_watchdog_cancel+0x12/0x14
[ 1814.851383]  netem_reset+0xe6/0xed [sch_netem]
[ 1814.851561]  qdisc_destroy+0x8b/0xe5
[ 1814.851723]  qdisc_create_dflt+0x86/0x94
[ 1814.851890]  ? dev_activate+0x129/0x129
[ 1814.852057]  attach_one_default_qdisc+0x36/0x63
[ 1814.852232]  netdev_for_each_tx_queue+0x3d/0x48
[ 1814.852406]  dev_activate+0x4b/0x129
[ 1814.852569]  __dev_open+0xe7/0x104
[ 1814.852730]  __dev_change_flags+0xc6/0x15c
[ 1814.852899]  dev_change_flags+0x25/0x59
[ 1814.853064]  do_setlink+0x30c/0xb3f
[ 1814.853228]  ? check_chain_key+0xb0/0xfd
[ 1814.853396]  ? check_chain_key+0xb0/0xfd
[ 1814.853565]  rtnl_newlink+0x3a4/0x729
[ 1814.853728]  ? rtnl_newlink+0x117/0x729
[ 1814.853905]  ? ns_capable_common+0xd/0xb1
[ 1814.854072]  ? ns_capable+0x13/0x15
[ 1814.854234]  rtnetlink_rcv_msg+0x188/0x197
[ 1814.854404]  ? rcu_read_unlock+0x3e/0x5f
[ 1814.854572]  ? rtnl_newlink+0x729/0x729
[ 1814.854737]  netlink_rcv_skb+0x6c/0xce
[ 1814.854902]  rtnetlink_rcv+0x23/0x2a
[ 1814.855064]  netlink_unicast+0x103/0x181
[ 1814.855230]  netlink_sendmsg+0x326/0x337
[ 1814.855398]  sock_sendmsg_nosec+0x14/0x3f
[ 1814.855584]  sock_sendmsg+0x29/0x2e
[ 1814.855747]  ___sys_sendmsg+0x209/0x28b
[ 1814.855912]  ? do_raw_spin_unlock+0xcd/0xf8
[ 1814.856082]  ? _raw_spin_unlock+0x27/0x31
[ 1814.856251]  ? __handle_mm_fault+0x651/0xdb1
[ 1814.856421]  ? check_chain_key+0xb0/0xfd
[ 1814.856592]  __sys_sendmsg+0x45/0x63
[ 1814.856755]  ? __sys_sendmsg+0x45/0x63
[ 1814.856923]  SyS_sendmsg+0x19/0x1b
[ 1814.857083]  entry_SYSCALL_64_fastpath+0x23/0xc2
[ 1814.857256] RIP: 0033:0x7f733b2dd690
[ 1814.857419] RSP: 002b:00007ffe1d3387d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 1814.858238] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007f733b2dd690
[ 1814.858445] RDX: 0000000000000000 RSI: 00007ffe1d338820 RDI: 0000000000000003
[ 1814.858651] RBP: ffff88005adcbf98 R08: 0000000000000001 R09: 0000000000000003
[ 1814.858856] R10: 00007ffe1d3385a0 R11: 0000000000000246 R12: 0000000000000002
[ 1814.859060] R13: 000000000066f1a0 R14: 00007ffe1d3408d0 R15: 0000000000000000
[ 1814.859267]  ? trace_hardirqs_off_caller+0xa7/0xcf
[ 1814.859446] Code: 10 55 48 89 c7 48 89 e5 e8 45 a1 fb ff 31 c0 5d c3
31 c0 c3 66 66 66 66 90 55 48 89 e5 41 56 41 55 41 54 53 49 89 fd 49 8b
45 30 <4c> 8b 20 41 8b 5c 24 38 31 c9 31 d2 48 c7 c7 50 8e 1d 82 41 89
[ 1814.860022] RIP: hrtimer_active+0x17/0x8a RSP: ffff88005adcb590
[ 1814.860214] CR2: 0000000000000000

Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Fixes: 0fbbeb1ba4 ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 15:26:11 -07:00
WANG Cong
143976ce99 net_sched: remove tc class reference counting
For TC classes, their ->get() and ->put() are always paired, and the
reference counting is completely useless, because:

1) For class modification and dumping paths, we already hold RTNL lock,
   so all of these ->get(),->change(),->put() are atomic.

2) For filter bindiing/unbinding, we use other reference counter than
   this one, and they should have RTNL lock too.

3) For ->qlen_notify(), it is special because it is called on ->enqueue()
   path, but we already hold qdisc tree lock there, and we hold this
   tree lock when graft or delete the class too, so it should not be gone
   or changed until we release the tree lock.

Therefore, this patch removes ->get() and ->put(), but:

1) Adds a new ->find() to find the pointer to a class by classid, no
   refcnt.

2) Move the original class destroy upon the last refcnt into ->delete(),
   right after releasing tree lock. This is fine because the class is
   already removed from hash when holding the lock.

For those who also use ->put() as ->unbind(), just rename them to reflect
this change.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25 17:19:10 -07:00
Michal Hocko
752ade68cb treewide: use kv[mz]alloc* rather than opencoded variants
There are many code paths opencoding kvmalloc.  Let's use the helper
instead.  The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator.  E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation.  This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously.  There is no guarantee something like that happens
though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Johannes Berg
fceb6435e8 netlink: pass extended ACK struct to parsing functions
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:22 -04:00
Nik Unger
5080f39e8c netem: apply correct delay when rate throttling
I recently reported on the netem list that iperf network benchmarks
show unexpected results when a bandwidth throttling rate has been
configured for netem. Specifically:

1) The measured link bandwidth *increases* when a higher delay is added
2) The measured link bandwidth appears higher than the specified limit
3) The measured link bandwidth for the same very slow settings varies significantly across
  machines

The issue can be reproduced by using tc to configure netem with a
512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
veth pair between network namespaces, and then using iperf (or any
other network benchmarking tool) to test throughput. Complete detailed
instructions are in the original email chain here:
https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html

There appear to be two underlying bugs causing these effects:

- The first issue causes long delays when the rate is slow and no
  delay is configured (e.g., "rate 512kbit"). This is because SKBs are
  not orphaned when no delay is configured, so orphaning does not
  occur until *after* the rate-induced delay has been applied. For
  this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
  dramatically increases the measured bandwidth.

- The second issue is that rate-induced delays are not correctly
  applied, allowing SKB delays to occur in parallel. The indended
  approach is to compute the delay for an SKB and to add this delay to
  the end of the current queue. However, the code does not detect
  existing SKBs in the queue due to improperly testing sch->q.qlen,
  which is nonzero even when packets exist only in the
  rbtree. Consequently, new SKBs do not wait for the current queue to
  empty. When packet delays vary significantly (e.g., if packet sizes
  are different), then this also causes unintended reordering.

I modified the code to expect a delay (and orphan the SKB) when a rate
is configured. I also added some defensive tests that correctly find
the latest scheduled delivery time, even if it is (unexpectedly) for a
packet in sch->q. I have tested these changes on the latest kernel
(4.11.0-rc1+) and the iperf / ping test results are as expected.

Signed-off-by: Nik Unger <njunger@uwaterloo.ca>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:14:06 -07:00
Willem de Bruijn
bc31c905e9 net-tc: convert tc_from to tc_from_ingress and tc_redirected
The tc_from field fulfills two roles. It encodes whether a packet was
redirected by an act_mirred device and, if so, whether act_mirred was
called on ingress or egress. Split it into separate fields.

The information is needed by the special IFB loop, where packets are
taken out of the normal path by act_mirred, forwarded to IFB, then
reinjected at their original location (ingress or egress) by IFB.

The IFB device cannot use skb->tc_at_ingress, because that may have
been overwritten as the packet travels from act_mirred to ifb_xmit,
when it passes through tc_classify on the IFB egress path. Cache this
value in skb->tc_from_ingress.

That field is valid only if a packet arriving at ifb_xmit came from
act_mirred. Other packets can be crafted to reach ifb_xmit. These
must be dropped. Set tc_redirected on redirection and drop all packets
that do not have this bit set.

Both fields are set only on cloned skbs in tc actions, so original
packet sources do not have to clear the bit when reusing packets
(notably, pktgen and octeon).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 20:58:52 -05:00
Willem de Bruijn
a5135bcfba net-tc: convert tc_verd to integer bitfields
Extract the remaining two fields from tc_verd and remove the __u16
completely. TC_AT and TC_FROM are converted to equivalent two-bit
integer fields tc_at and tc_from. Where possible, use existing
helper skb_at_tc_ingress when reading tc_at. Introduce helper
skb_reset_tc to clear fields.

Not documenting tc_from and tc_at, because they will be replaced
with single bit fields in follow-on patches.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 20:58:52 -05:00
Thomas Gleixner
2456e85535 ktime: Get rid of the union
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.

Get rid of the union and just keep ktime_t as simple typedef of type s64.

The conversion was done with coccinelle and some manual mopping up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25 17:21:22 +01:00
Geliang Tang
7f7cd56c33 net_sched: sch_netem: use rb_entry()
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-20 14:22:48 -05:00
Florian Westphal
48da34b7a7 sched: add and use qdisc_skb_head helpers
This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.

Its similar to the skb_buff_head api, but does not use skb->prev pointers.

Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
While skb_buff_head works fine for this, enqueue/dequeue needs to also
adjust the prev pointer of next element.

The ->prev pointer is not required for qdiscs so we can just leave
it undefined and avoid one cacheline write access for en/dequeue.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal
ed760cb8aa sched: replace __skb_dequeue with __qdisc_dequeue_head
After previous patch these functions are identical.
Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head.

Next patch will then make __qdisc_dequeue_head handle
single-linked list instead of strcut sk_buff_head argument.

Doesn't change generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal
97d0678f91 sched: don't use skb queue helpers
A followup change will replace the sk_buff_head in the qdisc
struct with a slightly different list.

Use of the sk_buff_head helpers will thus cause compiler
warnings.

Open-code these accesses in an extra change to ease review.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
David S. Miller
ee58b57100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 05:03:36 -04:00
Eric Dumazet
8a6e9c6703 net_sched: netem: do not call qdisc_drop() with a NULL skb
If skb_unshare() fails, we call qdisc_drop() with a NULL skb, which
is no longer supported.

Fixes: 520ac30f45 ("net_sched: drop packets after root qdisc lock is released")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-29 08:02:24 -04:00
Eric Dumazet
520ac30f45 net_sched: drop packets after root qdisc lock is released
Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.

Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.

Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.

This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.

I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 12:19:35 -04:00
Eric Dumazet
21de12ee55 netem: fix a use after free
If the packet was dropped by lower qdisc, then we must not
access it later.

Save qdisc_pkt_len(skb) in a temp variable.

Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-23 15:07:44 -04:00
Eric Dumazet
2f08a9a162 net_sched: sch_netem: defer skb freeing
rtnl_kfree_skbs() can be used in tfifo_reset()

It would be nice if we could iterate through rb tree instead
of removing one skb at a time, and build a single skb chain.
But this is left for a future patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 14:08:35 -07:00
Eric Dumazet
45f50bed1d net_sched: remove generic throttled management
__QDISC_STATE_THROTTLED bit manipulation is rather expensive
for HTB and few others.

I already removed it for sch_fq in commit f2600cf02b
("net: sched: avoid costly atomic operation in fq_dequeue()")
and so far nobody complained.

When one ore more packets are stuck in one or more throttled
HTB class, a htb dequeue() performs two atomic operations
to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
lock is held.

Removing this pair of atomic operations bring me a 8 % performance
increase on 200 TCP_RR tests, in presence of throttled classes.

This patch has no side effect, since nothing actually uses
disc_is_throttled() anymore.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:58:21 -07:00
Eric Dumazet
42117927ca net_sched: netem: remove qdisc_is_throttled() use
Looks like it is only there as some optimization attempt.

Since __QDISC_STATE_THROTTLED set/unset is way too expensive,
and netem is the last user, just remove this check.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:58:21 -07:00
Florian Westphal
a09ceb0e08 sched: remove qdisc->drop
after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no
more callers of ->drop() outside of other ->drop functions, i.e.
nothing calls them.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 23:58:52 -07:00
Florian Westphal
c3a173d7db sched: remove qdisc_rehape_fail
After the removal of TCA_CBQ_POLICE in cbq scheduler qdisc->reshape_fail
is always NULL, i.e. qdisc_rehape_fail is now the same as qdisc_drop.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 23:58:51 -07:00
David S. Miller
cba6532100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/ip_gre.c

Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 00:52:29 -04:00
Neil Horman
6071bd1aa1 netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:

[  788.073771] ---------------------[ cut here ]---------------------------
[  788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[  788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[  788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[  788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G        W
------------   3.10.0-327.el7.x86_64 #1
[  788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[  788.542260]  ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[  788.576332]  ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[  788.611943]  0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[  788.647241] Call Trace:
[  788.658817]  <IRQ>  [<ffffffff816351f1>] dump_stack+0x19/0x1b
[  788.686193]  [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[  788.713803]  [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[  788.741314]  [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[  788.767018]  [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[  788.796117]  [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[  788.823392]  [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[  788.854487]  [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[  788.880870]  [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...

The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).

The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.

tested successfully by myself on the latest net kernel, to which this applies

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 00:33:14 -04:00
Nicolas Dichtel
2a51c1e8ec sched: use nla_put_u64_64bit()
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:09:09 -04:00
WANG Cong
2ccccf5fb4 net_sched: update hierarchical backlog too
When the bottom qdisc decides to, for example, drop some packet,
it calls qdisc_tree_decrease_qlen() to update the queue length
for all its ancestors, we need to update the backlog too to
keep the stats on root qdisc accurate.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-29 17:02:33 -05:00
WANG Cong
86a7996cc8 net_sched: introduce qdisc_replace() helper
Remove nearly duplicated code and prepare for the following patch.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-29 17:02:33 -05:00
Eric Dumazet
b396cca6fa net: sched: deprecate enqueue_root()
Only left enqueue_root() user is netem, and it looks not necessary :

qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 14:17:32 -04:00
Beshay, Joseph
0ad2a83659 netem: Fixes byte backlog accounting for the first of two chained netem instances
Fixes byte backlog accounting for the first of two chained netem instances.
Bytes backlog reported now corresponds to the number of queued packets.

When two netem instances are chained, for instance to apply rate and queue
limitation followed by packet delay, the number of backlogged bytes reported
by the first netem instance is wrong. It reports the sum of bytes in the queues
of the first and second netem. The first netem reports the correct number of
backlogged packets but not bytes. This is shown in the example below.

Consider a chain of two netem schedulers created using the following commands:

$ tc -s qdisc replace dev veth2 root handle 1:0 netem rate 10000kbit limit 100
$ tc -s qdisc add dev veth2 parent 1:0 handle 2: netem delay 50ms

Start an iperf session to send packets out on the specified interface and
monitor the backlog using tc:

$ tc -s qdisc show dev veth2

Output using unpatched netem:
	qdisc netem 1: root refcnt 2 limit 100 rate 10000Kbit
	 Sent 98422639 bytes 65434 pkt (dropped 123, overlimits 0 requeues 0)
	 backlog 172694b 73p requeues 0
	qdisc netem 2: parent 1: limit 1000 delay 50.0ms
	 Sent 98422639 bytes 65434 pkt (dropped 0, overlimits 0 requeues 0)
	 backlog 63588b 42p requeues 0

The interface used to produce this output has an MTU of 1500. The output for
backlogged bytes behind netem 1 is 172694b. This value is not correct. Consider
the total number of sent bytes and packets. By dividing the number of sent
bytes by the number of sent packets, we get an average packet size of ~=1504.
If we divide the number of backlogged bytes by packets, we get ~=2365. This is
due to the first netem incorrectly counting the 63588b which are in netem 2's
queue as being in its own queue. To verify this is the case, we subtract them
from the reported value and divide by the number of packets as follows:
	172694 - 63588 = 109106 bytes actualled backlogged in netem 1
	109106 / 73 packets ~= 1494 bytes (which matches our MTU)

The root cause is that the byte accounting is not done at the
same time with packet accounting. The solution is to update the backlog value
every time the packet queue is updated.

Signed-off-by: Joseph D Beshay <joseph.beshay@utdallas.edu>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 18:34:24 -04:00
Eric Dumazet
56b174256b net: add rbnode to struct sk_buff
Yaogong replaces TCP out of order receive queue by an RB tree.

As netem already does a private skb->{next/prev/tstamp} union
with a 'struct rb_node', lets do this in a cleaner way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03 16:13:03 -05:00
John Fastabend
25331d6ce4 net: sched: implement qstat helper routines
This adds helpers to manipulate qstats logic and replaces locations
that touch the counters directly. This simplifies future patches
to push qstats onto per cpu counters.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30 01:02:26 -04:00
WANG Cong
4cb28970a2 net: use the new API kvfree()
It is available since v3.15-rc5.

Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-05 00:49:51 -07:00
Yang Yingliang
3fbac2a87e sch_netem: replace magic numbers with enumerate in get_loss_clg
Replace two magic numbers which intialize clgstate::state.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17 16:55:54 -05:00
Yang Yingliang
c045a734da sch_netem: replace magic numbers with enumerate in GE model
Replace some magic numbers which describe states of GE model
loss generator with enumerate.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-14 00:15:20 -05:00
Yang Yingliang
49545a7775 sch_netem: change some func's param from "struct Qdisc *" to "struct netem_sched_data *"
In netem_change(), we have already get "struct netem_sched_data *q".
Replace params of get_correlation() and other similar functions with
"struct netem_sched_data *q".

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-14 00:15:20 -05:00
Yang Yingliang
54a4b05cd2 sch_netem: return errcode before setting params
get_dist_table() and get_loss_clg() may be failed. These
two functions should be called after setting the members
of qdisc_priv(sch), or it will break the old settings while
either of them is failed.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-14 00:15:20 -05:00
Hannes Frederic Sowa
809fa972fd reciprocal_divide: update/correction of the algorithm
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.

This has been fixed by Eric Dumazet in commit aee636c480
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:

1) Initialization:

  int l = ceil(log_2 d)
  uword m' = floor((1<<32)*((1<<l)-d)/d)+1
  int sh_1 = min(l,1)
  int sh_2 = max(l-1,0)

2) For q = n/d, all uword:

  uword t = (n*m')>>32
  q = (t+((n-t)>>sh_1))>>sh_2

The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.

Joint work with Daniel Borkmann.

  [1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
  [2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
  [3] https://gmplib.org/~tege/division-paper.pdf
  [4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
  [5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
  [6] http://www.agner.org/optimize/asmlib.zip

Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 23:17:20 -08:00
Yang Yingliang
a6e2fe17eb sch_netem: replace magic numbers with enumerate
Replace some magic numbers which describe states of 4-state model
loss generator with enumerate.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19 17:17:34 -08:00
Aruna-Hewapathirane
63862b5bef net: replace macros net_random and net_srandom with direct calls to prandom
This patch removes the net_random and net_srandom macros and replaces
them with direct calls to the prandom ones. As new commits only seem to
use prandom_u32 there is no use to keep them around.
This change makes it easier to grep for users of prandom_u32.

Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14 15:15:25 -08:00
Yang Yingliang
6a031f67c8 sch_netem: support of 64bit rates
Add a new attribute to support 64bit rates so that
tc can use them to break the 32bit limit.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31 14:31:44 -05:00
Yang Yingliang
8cfd88d6d7 sch_netem: more precise length of packets
With TSO/GSO/GRO packets, skb->len doesn't represent
a precise amount of bytes on wire.
This patch replace skb->len with qdisc_pkt_len(skb)
which is more precise.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31 14:31:43 -05:00
Yang Yingliang
833fa74386 net_sched: add space around '>' and before '('
Spaces required around that '>' (ctx:VxV) and
before the open parenthesis '('.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:44:51 -05:00
stephen hemminger
eff7979f00 netem: fix gemodel loss generator
Patch from developers of the alternative loss models, downloaded from:
   http://netgroup.uniroma2.it/twiki/bin/view.cgi/Main/NetemCLG

 "in case 2, of the switch we change the direction of the inequality to
  net_random()>clg->a3, because clg->a3 is h in the GE model and when h
  is 0 all packets will be lost."

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-30 12:49:29 -05:00
stephen hemminger
ab6c27be81 netem: fix loss 4 state model
Patch from developers of the alternative loss models, downloaded from:
   http://netgroup.uniroma2.it/twiki/bin/view.cgi/Main/NetemCLG

 "In the case 1 of the switch statement in the if conditions we
   need to add clg->a4 to clg->a1, according to the model."

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-30 12:49:28 -05:00
stephen hemminger
7c2781fa92 netem: missing break in ge loss generator
There is a missing break statement in the Gilbert Elliot loss model
generator which makes state machine behave incorrectly.

Reported-by: Martin Burri <martin.burri@ch.abb.com
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-30 12:49:28 -05:00
Hagen Paul Pfeifer
4a3ad7b3ea netem: markov loss model transition fix
The transition from markov state "3 => lost packets within a burst
period" to "1 => successfully transmitted packets within a gap period"
has no *additional* loss event. The loss already happen for transition
from 1 -> 3, this additional loss will make things go wild.

E.g. transition probabilities:

p13:   10%
p31:  100%

Expected:

Ploss = p13 / (p13 + p31)
Ploss = ~9.09%

... but it isn't. Even worse: we get a double loss - each time.
So simple don't return true to indicate loss, rather break and return
false.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Stefano Salsano <stefano.salsano@uniroma2.it>
Cc: Fabio Ludovici <fabio.ludovici@yahoo.it>
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-25 19:03:39 -04:00
stephen hemminger
ff704050f2 netem: free skb's in tree on reset
Netem can leak memory because packets get stored in red-black
tree and it is not cleared on reset.

Reported by: Сергеев Сергей <adron@yapic.net>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-11 17:30:38 -04:00
stephen hemminger
638a52b801 netem: update backlog after drop
When packet is dropped from rb-tree netem the backlog statistic should
also be updated.

Reported-by: Сергеев Сергей <adron@yapic.net>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-11 17:29:29 -04:00
Eric Dumazet
f2f872f927 netem: Introduce skb_orphan_partial() helper
Commit 547669d483 ("tcp: xps: fix reordering issues") added
unexpected reorders in case netem is used in a MQ setup for high
performance test bed.

ETH=eth0
tc qd del dev $ETH root 2>/dev/null
tc qd add dev $ETH root handle 1: mq
for i in `seq 1 32`
do
 tc qd add dev $ETH parent 1:$i netem delay 100ms
done

As all tcp packets are orphaned by netem, TCP stack believes it can
set skb->ooo_okay on all packets.

In order to allow producers to send more packets, we want to
keep sk_wmem_alloc from reaching sk_sndbuf limit.

We can do that by accounting one byte per skb in netem queues,
so that TCP stack is not fooled too much.

Tested:

With above MQ/netem setup, scaling number of concurrent flows gives
linear results and no reorders/retransmits

lpq83:~# for n in 1 10 20 30 40 50 60 70 80 90 100
 do echo -n "n:$n " ; ./super_netperf $n -H 10.7.7.84; done
n:1 198.46
n:10 2002.69
n:20 4000.98
n:30 6006.35
n:40 8020.93
n:50 10032.3
n:60 12081.9
n:70 13971.3
n:80 16009.7
n:90 17117.3
n:100 17425.5

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 14:59:49 -07:00
Eric Dumazet
36b7bfe09b netem: fix possible NULL deref in netem_dequeue()
commit aec0a40a6f ("netem: use rb tree to implement the time queue")
added a regression if a child qdisc is attached to netem, as we perform
a NULL dereference.

Fix this by adding a temporary variable to cache
netem_skb_cb(skb)->time_to_send.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 16:52:10 -07:00
Eric Dumazet
aec0a40a6f netem: use rb tree to implement the time queue
Following typical setup to implement a ~100 ms RTT and big
amount of reorders has very poor performance because netem
implements the time queue using a linked list.
-----------------------------------------------------------
ETH=eth0
IFB=ifb0
modprobe ifb
ip link set dev $IFB up
tc qdisc add dev $ETH ingress 2>/dev/null
tc filter add dev $ETH parent ffff: \
   protocol ip u32 match u32 0 0 flowid 1:1 action mirred egress \
   redirect dev $IFB
ethtool -K $ETH gro off tso off gso off
tc qdisc add dev $IFB root netem delay 50ms 10ms limit 100000
tc qd add dev $ETH root netem delay 50ms limit 100000
---------------------------------------------------------

Switch netem time queue to a rb tree, so this kind of setup can work at
high speed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01 18:07:15 -07:00
Johannes Naab
a13d310471 netem: fix delay calculation in rate extension
The delay calculation with the rate extension introduces in v3.3 does
not properly work, if other packets are still queued for transmission.
For the delay calculation to work, both delay types (latency and delay
introduces by rate limitation) have to be handled differently. The
latency delay for a packet can overlap with the delay of other packets.
The delay introduced by the rate however is separate, and can only
start, once all other rate-introduced delays finished.

Latency delay is from same distribution for each packet, rate delay
depends on the packet size.

.: latency delay
-: rate delay
x: additional delay we have to wait since another packet is currently
   transmitted

  .....----                    Packet 1
    .....xx------              Packet 2
               .....------     Packet 3
    ^^^^^
    latency stacks
         ^^
         rate delay doesn't stack
               ^^
               latency stacks

  -----> time

When a packet is enqueued, we first consider the latency delay. If other
packets are already queued, we can reduce the latency delay until the
last packet in the queue is send, however the latency delay cannot be
<0, since this would mean that the rate is overcommitted.  The new
reference point is the time at which the last packet will be send. To
find the time, when the packet should be send, the rate introduces delay
has to be added on top of that.

Signed-off-by: Johannes Naab <jn@stusta.de>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 15:43:02 -05:00
Eric Dumazet
5a308f40bf netem: refine early skb orphaning
netem does an early orphaning of skbs. Doing so breaks TCP Small Queue
or any mechanism relying on socket sk_wmem_alloc feedback.

Ideally, we should perform this orphaning after the rate module and
before the delay module, to mimic what happens on a real link :

skb orphaning is indeed normally done at TX completion, before the
transit on the link.

+-------+   +--------+  +---------------+  +-----------------+
+ Qdisc +---> Device +--> TX completion +--> links / hops    +->
+       +   +  xmit  +  + skb orphaning +  + propagation     +
+-------+   +--------+  +---------------+  +-----------------+
      < rate limiting >                  < delay, drops, reorders >

If netem is used without delay feature (drops, reorders, rate
limiting), then we should avoid early skb orphaning, to keep pressure
on sockets as long as packets are still in qdisc queue.

Ideally, netem should be refactored to implement delay module
as the last stage. Current algorithm merges the two phases
(rate limiting + delay) so its not correct.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Mark Gordon <msg@google.com>
Cc: Andreas Terzis <aterzis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-16 23:08:33 -07:00
Eric Dumazet
960fb66e52 netem: add limitation to reordered packets
Fix two netem bugs :

1) When a frame was dropped by tfifo_enqueue(), drop counter
   was incremented twice.

2) When reordering is triggered, we enqueue a packet without
   checking queue limit. This can OOM pretty fast when this
   is repeated enough, since skbs are orphaned, no socket limit
   can help in this situation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mark Gordon <msg@google.com>
Cc: Andreas Terzis <aterzis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-09 00:01:49 -07:00
David S. Miller
0d6c4a2e46 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/e1000e/param.c
	drivers/net/wireless/iwlwifi/iwl-agn-rx.c
	drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c
	drivers/net/wireless/iwlwifi/iwl-trans.h

Resolved the iwlwifi conflict with mainline using 3-way diff posted
by John Linville and Stephen Rothwell.  In 'net' we added a bug
fix to make iwlwifi report a more accurate skb->truesize but this
conflicted with RX path changes that happened meanwhile in net-next.

In e1000e a conflict arose in the validation code for settings of
adapter->itr.  'net-next' had more sophisticated logic so that
logic was used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-07 23:35:40 -04:00
Eric Dumazet
116a0fc31c netem: fix possible skb leak
skb_checksum_help(skb) can return an error, we must free skb in this
case. qdisc_drop(skb, sch) can also be feeded with a NULL skb (if
skb_unshare() failed), so lets use this generic helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01 13:40:48 -04:00
Eric Dumazet
e4ae004b84 netem: add ECN capability
Add ECN (Explicit Congestion Notification) marking capability to netem

tc qdisc add dev eth0 root netem drop 0.5 ecn

Instead of dropping packets, try to ECN mark them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01 09:39:48 -04:00
David S. Miller
1b34ec43c9 pkt_sched: Stop using NLA_PUT*().
These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-01 18:11:37 -04:00
Eric Dumazet
cd961c2ca9 netem: fix dequeue
commit 50612537e9 (netem: fix classful handling) added two errors in
netem_dequeue()

1) After checking skb at the head of tfifo queue for time constraints,
   it dequeues tail skb, thus adding unwanted reordering.

2) qdisc stats are updated twice per packet
   (one when packet dequeued from tfifo, once when delivered)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-19 18:57:50 -05:00
David S. Miller
16bda13d90 net: Make qdisc_skb_cb upper size bound explicit.
Just like skb->cb[], so that qdisc_skb_cb can be encapsulated inside
of other data structures.

This is intended to be used by IPoIB so that it can remember
addressing information stored at hard_header_ops->create() time that
it can fetch when the packet gets to the transmit routine.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-09 13:50:34 -05:00
Vijay Subramanian
a42b4799c6 netem: Fix off-by-one bug in reordering
With netem reordering, a gap of N is supposed to reorder every Nth packet with
given reorder probability.  However, the code currently skips N packets and
reorders every (N+1)th packet.

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-22 15:08:44 -05:00
Hagen Paul Pfeifer
eb10192447 net_sched: Bug in netem reordering
Not now, but it looks you are correct. q->qdisc is NULL until another
additional qdisc is attached (beside tfifo). See 50612537e9.
The following patch should work.

From: Hagen Paul Pfeifer <hagen@jauu.net>

netem: catch NULL pointer by updating the real qdisc statistic

Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-05 13:27:39 -05:00
Eric Dumazet
50612537e9 netem: fix classful handling
Commit 10f6dfcfde (Revert "sch_netem: Remove classful functionality")
reintroduced classful functionality to netem, but broke basic netem
behavior :

netem uses an t(ime)fifo queue, and store timestamps in skb->cb[]

If qdisc is changed, time constraints are not respected and other qdisc
can destroy skb->cb[] and block netem at dequeue time.

Fix this by always using internal tfifo, and optionally attach a child
qdisc to netem (or a tree of qdiscs)

Example of use :

DEV=eth3
tc qdisc del dev $DEV root
tc qdisc add dev $DEV root handle 30: est 1sec 8sec netem delay 20ms 10ms
tc qdisc add dev $DEV handle 40:0 parent 30:0 tbf \
	burst 20480 limit 20480 mtu 1514 rate 32000bps

qdisc netem 30: root refcnt 18 limit 1000 delay 20.0ms  10.0ms
 Sent 190792 bytes 413 pkt (dropped 0, overlimits 0 requeues 0)
 rate 18416bit 3pps backlog 0b 0p requeues 0
qdisc tbf 40: parent 30: rate 256000bit burst 20Kb/8 mpu 0b lat 0us
 Sent 190792 bytes 413 pkt (dropped 6, overlimits 10 requeues 0)
 backlog 0b 5p requeues 0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-30 17:12:23 -05:00
David S. Miller
7f8e3234c5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2011-12-30 13:04:14 -05:00
Eric Dumazet
bb52c7acf8 netem: dont call vfree() under spinlock and BH disabled
commit 6373a9a286 (netem: use vmalloc for distribution table) added a
regression, since vfree() is called while holding a spinlock and BH
being disabled.

Fix this by doing the pointers swap in critical section, and freeing
after spinlock release.

Also add __GFP_NOWARN to the kmalloc() try, since we fallback to
vmalloc().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-24 16:08:50 -05:00
stephen hemminger
2494654d48 netem: loss model API sizes
The new netem loss model is configured with nested netlink messages.
This code is being overly strict about sizes, and is easily confused
by padding (or possible future expansion). Also message
for gemodel is incorrect.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-23 16:51:18 -05:00
Hagen Paul Pfeifer
90b41a1cd4 netem: add cell concept to simulate special MAC behavior
This extension can be used to simulate special link layer
characteristics. Simulate because packet data is not modified, only the
calculation base is changed to delay a packet based on the original
packet size and artificial cell information.

packet_overhead can be used to simulate a link layer header compression
scheme (e.g. set packet_overhead to -20) or with a positive
packet_overhead value an additional MAC header can be simulated. It is
also possible to "replace" the 14 byte Ethernet header with something
else.

cell_size and cell_overhead can be used to simulate link layer schemes,
based on cells, like some TDMA schemes. Another application area are MAC
schemes using a link layer fragmentation with a (small) header each.
Cell size is the maximum amount of data bytes within one cell. Cell
overhead is an additional variable to change the per-cell-overhead
(e.g.  5 byte header per fragment).

Example (5 kbit/s, 20 byte per packet overhead, cell-size 100 byte, per
cell overhead 5 byte):

  tc qdisc add dev eth0 root netem rate 5kbit 20 100 5

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12 19:44:48 -05:00
Eric Dumazet
fc33cc7242 netem: fix build error on 32bit arches
ERROR: "__udivdi3" [net/sched/sch_netem.ko] undefined!

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-01 11:40:19 -05:00
Hagen Paul Pfeifer
7bc0f28c7a netem: rate extension
Currently netem is not in the ability to emulate channel bandwidth. Only static
delay (and optional random jitter) can be configured.

To emulate the channel rate the token bucket filter (sch_tbf) can be used.  But
TBF has some major emulation flaws. The buffer (token bucket depth/rate) cannot
be 0. Also the idea behind TBF is that the credit (token in buckets) fills if
no packet is transmitted. So that there is always a "positive" credit for new
packets. In real life this behavior contradicts the law of nature where
nothing can travel faster as speed of light. E.g.: on an emulated 1000 byte/s
link a small IPv4/TCP SYN packet with ~50 byte require ~0.05 seconds - not 0
seconds.

Netem is an excellent place to implement a rate limiting feature: static
delay is already implemented, tfifo already has time information and the
user can skip TBF configuration completely.

This patch implement rate feature which can be configured via tc. e.g:

	tc qdisc add dev eth0 root netem rate 10kbit

To emulate a link of 5000byte/s and add an additional static delay of 10ms:

	tc qdisc add dev eth0 root netem delay 10ms rate 5KBps

Note: similar to TBF the rate extension is bounded to the kernel timing
system. Depending on the architecture timer granularity, higher rates (e.g.
10mbit/s and higher) tend to transmission bursts. Also note: further queues
living in network adaptors; see ethtool(8).

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@drr.davemloft.net>
2011-11-30 23:18:35 -05:00
Alexey Dobriyan
b7f080cfe2 net: remove mm.h inclusion from netdevice.h
Remove linux/mm.h inclusion from netdevice.h -- it's unused (I've checked manually).

To prevent mm.h inclusion via other channels also extract "enum dma_data_direction"
definition into separate header. This tiny piece is what gluing netdevice.h with mm.h
via "netdevice.h => dmaengine.h => dma-mapping.h => scatterlist.h => mm.h".
Removal of mm.h from scatterlist.h was tried and was found not feasible
on most archs, so the link was cutoff earlier.

Hope people are OK with tiny include file.

Note, that mm_types.h is still dragged in, but it is a separate story.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21 19:17:20 -07:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
David S. Miller
78776d3f2b sch_netem: Need to include vmalloc.h
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:48:13 -08:00
stephen hemminger
250a65f782 netem: update version and cleanup
Get rid of debug message that are not useful, and enable
the log messages in case of error.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:56 -08:00
stephen hemminger
661b79725f netem: revised correlated loss generator
This is a patch originated with Stefano Salsano and Fabio Ludovici.
It provides several alternative loss models for use with netem.
This patch adds two state machine based loss models.

See: http://netgroup.uniroma2.it/twiki/bin/view.cgi/Main/NetemCLG

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:56 -08:00
stephen hemminger
10f6dfcfde Revert "sch_netem: Remove classful functionality"
Many users have wanted the old functionality that was lost
to be able to use pfifo as inner qdisc for netem. The reason that
netem could not be classful with the older API was because of the
limitations of the old dequeue/requeue interface; now that qdisc API has
a peek function, there is no longer a problem with using any
inner qdisc's.

This reverts commit 0220146411.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:55 -08:00
stephen hemminger
df173bda26 netem: define NETEM_DIST_MAX
Rather than magic constant in code, expose the maximum size of
packet distribution table in API. In iproute2, q_netem defines
MAX_DIST as 16K already.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:54 -08:00
stephen hemminger
6373a9a286 netem: use vmalloc for distribution table
The netem probability table can be large (up to 64K bytes)
which may be too large to allocate in one contiguous chunk.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:54 -08:00
stephen hemminger
861d7f745f netem: cleanup dump code
Use nla_put_nested to update netlink attribute value.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:53 -08:00
David S. Miller
5bdc22a565 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/sched/sch_hfsc.c
	net/sched/sch_htb.c
	net/sched/sch_tbf.c
2011-01-24 14:09:35 -08:00
Eric Dumazet
9190b3b320 net_sched: accurate bytes/packets stats/rates
In commit 44b8288308 (net_sched: pfifo_head_drop problem), we fixed
a problem with pfifo_head drops that incorrectly decreased
sch->bstats.bytes and sch->bstats.packets

Several qdiscs (CHOKe, SFQ, pfifo_head, ...) are able to drop a
previously enqueued packet, and bstats cannot be changed, so
bstats/rates are not accurate (over estimated)

This patch changes the qdisc_bstats updates to be done at dequeue() time
instead of enqueue() time. bstats counters no longer account for dropped
frames, and rates are more correct, since enqueue() bursts dont have
effect on dequeue() rate.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20 23:31:33 -08:00
Eric Dumazet
fd245a4adb net_sched: move TCQ_F_THROTTLED flag
In commit 3711210576 (net: QDISC_STATE_RUNNING dont need atomic bit
ops) I moved QDISC_STATE_RUNNING flag to __state container, located in
the cache line containing qdisc lock and often dirtied fields.

I now move TCQ_F_THROTTLED bit too, so that we let first cache line read
mostly, and shared by all cpus. This should speedup HTB/CBQ for example.

Not using test_bit()/__clear_bit()/__test_and_set_bit allows to use an
"unsigned int" for __state container, reducing by 8 bytes Qdisc size.

Introduce helpers to hide implementation details.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jesper Dangaard Brouer <hawk@diku.dk>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20 16:59:32 -08:00
Eric Dumazet
cc7ec456f8 net_sched: cleanups
Cleanup net/sched code to current CodingStyle and practices.

Reduce inline abuse

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-19 23:31:12 -08:00
Eric Dumazet
bfe0d0298f net_sched: factorize qdisc stats handling
HTB takes into account skb is segmented in stats updates.
Generalize this to all schedulers.

They should use qdisc_bstats_update() helper instead of manipulating
bstats.bytes and bstats.packets

Add bstats_update() helper too for classes that use
gnet_stats_basic_packed fields.

Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no
stab is setup on qdisc.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-10 16:07:54 -08:00
Changli Gao
3511c9132f net_sched: remove the unused parameter of qdisc_create_dflt()
The first parameter dev isn't in use in qdisc_create_dflt().

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-21 03:09:47 -07:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Joe Perches
f64f9e7192 net: Move && and || to end of previous line
Not including net/atm/

Compiled tested x86 allyesconfig only
Added a > 80 column line or two, which I ignored.
Existing checkpatch plaints willfully, cheerfully ignored.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-29 16:55:45 -08:00
Jarek Poplawski
8caf153974 net: sch_netem: Fix an inconsistency in ingress netem timestamps.
Alex Sidorenko reported:

"while experimenting with 'netem' we have found some strange behaviour. It
seemed that ingress delay as measured by 'ping' command shows up on some
hosts but not on others.

After some investigation I have found that the problem is that skbuff->tstamp
field value depends on whether there are any packet sniffers enabled. That
is:

- if any ptype_all handler is registered, the tstamp field is as expected
- if there are no ptype_all handlers, the tstamp field does not show the delay"

This patch prevents unnecessary update of tstamp in dev_queue_xmit_nit()
on ingress path (with act_mirred) adding a check, so minimal overhead on
the fast path, but only when sniffers etc. are active.

Since netem at ingress seems to logically emulate a network before a host,
tstamp is zeroed to trigger the update and pretend delays are from the
outside.

Reported-by: Alex Sidorenko <alexandre.sidorenko@hp.com>
Tested-by: Alex Sidorenko <alexandre.sidorenko@hp.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-20 02:14:59 -07:00
Jarek Poplawski
05a8c1cbfe pkt_sched: Remove smp_wmb() in qdisc_watchdog()
While implementing a TCQ_F_THROTTLED flag there was used an smp_wmb()
in qdisc_watchdog(), but since this flag is practically used only in
sch_netem(), and since it's not even clear what reordering is avoided
here (TCQ_F_THROTTLED vs. __QDISC_STATE_SCHED?) it seems the barrier
could be safely removed.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-22 19:44:13 -08:00
David S. Miller
eb14f01959 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/e1000e/ich8lan.c
2008-12-15 20:03:50 -08:00
Jesper Dangaard Brouer
eb9b851b98 SCHED: netem: Correct documentation comment in code.
The netem simulator is no longer limited by Linux timer resolution HZ.
Not since Patrick McHardy changed the QoS system to use hrtimer.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-15 00:39:17 -08:00
Patrick McHardy
b94c8afcba pkt_sched: remove unnecessary xchg() in packet schedulers
The use of xchg() hasn't been necessary since 2.2.something when proper
locking was added to packet schedulers.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20 04:11:36 -08:00
Jarek Poplawski
f30ab418a1 pkt_sched: Remove qdisc->ops->requeue() etc.
After implementing qdisc->ops->peek() and changing sch_netem into
classless qdisc there are no more qdisc->ops->requeue() users. This
patch removes this method with its wrappers (qdisc_requeue()), and
also unused qdisc->requeue structure. There are a few minor fixes of
warnings (htb_enqueue()) and comments btw.

The idea to kill ->requeue() and a similar patch were first developed
by David S. Miller.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-13 22:56:30 -08:00
Stephen Hemminger
265eb67fb4 netem: eliminate unneeded return values
All these individual parsing functions never return an error,
so they can be void.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 21:13:26 -08:00
Jarek Poplawski
8ba25dad0a sch_netem: Replace ->requeue() method with open code
After removing netem classful functionality we are sure its inner
qdisc is tfifo, so we can replace qdisc->ops->requeue() method with
open code. After this patch there are no more ops->requeue() users.

The idea of this patch is by Patrick McHardy.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-02 00:36:03 -07:00
Jarek Poplawski
0220146411 sch_netem: Remove classful functionality
Patrick McHardy noticed that: "a lot of the functionality of netem
requires the inner tfifo anyways and rate-limiting is usually done
on top of netem. So I would suggest so either hard-wire the tfifo
qdisc or at least make the assumption that inner qdiscs are
work-conserving.", and later: "- a lot of other qdiscs still don't
work as inner qdiscs of netem [...]".

So, according to his suggestion, this patch removes classful options
of netem. The main reason of this change is to remove ops->requeue()
method, which is currently used only by netem.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-02 00:35:24 -07:00
Jarek Poplawski
77be155cba pkt_sched: Add peek emulation for non-work-conserving qdiscs.
This patch adds qdisc_peek_dequeued() wrapper to emulate peek method
with qdisc->dequeue() and storing "peeked" skb in qdisc->gso_skb until
dequeuing. This is mainly for compatibility reasons not to break some
strange configs because peeking is expected for non-work-conserving
parent qdiscs to query work-conserving child qdiscs.

This implementation requires using qdisc_dequeue_peeked() wrapper
instead of directly calling qdisc->dequeue() for all qdiscs ever
querried with qdisc->ops->peek() or qdisc_peek_dequeued().

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31 00:47:01 -07:00