2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-02 10:43:57 +08:00
Commit Graph

1226 Commits

Author SHA1 Message Date
David S. Miller
54e5c4def0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/bonding/bond_alb.c
	drivers/net/ethernet/altera/altera_msgdma.c
	drivers/net/ethernet/altera/altera_sgdma.c
	net/ipv6/xfrm6_output.c

Several cases of overlapping changes.

The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.

In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.

Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-24 00:32:30 -04:00
Daniel Borkmann
b1fcd35cf5 net: filter: let unattached filters use sock_fprog_kern
The sk_unattached_filter_create() API is used by BPF filters that
are not directly attached or related to sockets, and are used in
team, ptp, xt_bpf, cls_bpf, etc. As such all users do their own
internal managment of obtaining filter blocks and thus already
have them in kernel memory and set up before calling into
sk_unattached_filter_create(). As a result, due to __user annotation
in sock_fprog, sparse triggers false positives (incorrect type in
assignment [different address space]) when filters are set up before
passing them to sk_unattached_filter_create(). Therefore, let
sk_unattached_filter_create() API use sock_fprog_kern to overcome
this issue.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-23 16:48:05 -04:00
Cong Wang
bf63ac73b3 net_sched: fix an oops in tcindex filter
Kelly reported the following crash:

        IP: [<ffffffff817a993d>] tcf_action_exec+0x46/0x90
        PGD 3009067 PUD 300c067 PMD 11ff30067 PTE 800000011634b060
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
        CPU: 1 PID: 639 Comm: dhclient Not tainted 3.15.0-rc4+ #342
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff8801169ecd00 ti: ffff8800d21b8000 task.ti: ffff8800d21b8000
        RIP: 0010:[<ffffffff817a993d>]  [<ffffffff817a993d>] tcf_action_exec+0x46/0x90
        RSP: 0018:ffff8800d21b9b90  EFLAGS: 00010283
        RAX: 00000000ffffffff RBX: ffff88011634b8e8 RCX: ffff8800cf7133d8
        RDX: ffff88011634b900 RSI: ffff8800cf7133e0 RDI: ffff8800d210f840
        RBP: ffff8800d21b9bb0 R08: ffffffff8287bf60 R09: 0000000000000001
        R10: ffff8800d2b22b24 R11: 0000000000000001 R12: ffff8800d210f840
        R13: ffff8800d21b9c50 R14: ffff8800cf7133e0 R15: ffff8800cad433d8
        FS:  00007f49723e1840(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffff88011634b8f0 CR3: 00000000ce469000 CR4: 00000000000006e0
        Stack:
         ffff8800d2170188 ffff8800d210f840 ffff8800d2171b90 0000000000000000
         ffff8800d21b9be8 ffffffff817c55bb ffff8800d21b9c50 ffff8800d2171b90
         ffff8800d210f840 ffff8800d21b0300 ffff8800d21b9c50 ffff8800d21b9c18
        Call Trace:
         [<ffffffff817c55bb>] tcindex_classify+0x88/0x9b
         [<ffffffff817a7f7d>] tc_classify_compat+0x3e/0x7b
         [<ffffffff817a7fdf>] tc_classify+0x25/0x9f
         [<ffffffff817b0e68>] htb_enqueue+0x55/0x27a
         [<ffffffff817b6c2e>] dsmark_enqueue+0x165/0x1a4
         [<ffffffff81775642>] __dev_queue_xmit+0x35e/0x536
         [<ffffffff8177582a>] dev_queue_xmit+0x10/0x12
         [<ffffffff818f8ecd>] packet_sendmsg+0xb26/0xb9a
         [<ffffffff810b1507>] ? __lock_acquire+0x3ae/0xdf3
         [<ffffffff8175cf08>] __sock_sendmsg_nosec+0x25/0x27
         [<ffffffff8175d916>] sock_aio_write+0xd0/0xe7
         [<ffffffff8117d6b8>] do_sync_write+0x59/0x78
         [<ffffffff8117d84d>] vfs_write+0xb5/0x10a
         [<ffffffff8117d96a>] SyS_write+0x49/0x7f
         [<ffffffff8198e212>] system_call_fastpath+0x16/0x1b

This is because we memcpy struct tcindex_filter_result which contains
struct tcf_exts, obviously struct list_head can not be simply copied.
This is a regression introduced by commit 33be627159
(net_sched: act: use standard struct list_head).

It's not very easy to fix it as the code is a mess:

       if (old_r)
               memcpy(&cr, r, sizeof(cr));
       else {
               memset(&cr, 0, sizeof(cr));
               tcf_exts_init(&cr.exts, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
       }
       ...
       tcf_exts_change(tp, &cr.exts, &e);
       ...
       memcpy(r, &cr, sizeof(cr));

the above code should equal to:

        tcindex_filter_result_init(&cr);
        if (old_r)
               cr.res = r->res;
        ...
        if (old_r)
               tcf_exts_change(tp, &r->exts, &e);
        else
               tcf_exts_change(tp, &cr.exts, &e);
        ...
        r->res = cr.res;

after this change, since there is no need to copy struct tcf_exts.

And it also fixes other places zero'ing struct's contains struct tcf_exts.

Fixes: commit 33be627159 (net_sched: act: use standard struct list_head)
Reported-by: Kelly Anderson <kelly@xilka.com>
Tested-by: Kelly Anderson <kelly@xilka.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-21 16:47:13 -04:00
Yang Yingliang
b2ce49e737 sch_hhf: fix comparison of qlen and limit
When I use the following command, eth0 cannot send any packets.
 #tc qdisc add dev eth0 root handle 1: hhf limit 1

Because qlen need be smaller than limit, all packets were dropped.
Fix this by qlen *<=* limit.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-12 14:55:21 -04:00
David S. Miller
5f013c9bc7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/altera/altera_sgdma.c
	net/netlink/af_netlink.c
	net/sched/cls_api.c
	net/sched/sch_api.c

The netlink conflict dealt with moving to netlink_capable() and
netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations
in non-init namespaces.  These were simple transformations from
netlink_capable to netlink_ns_capable.

The Altera driver conflict was simply code removal overlapping some
void pointer cast cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-12 13:19:14 -04:00
John Fastabend
f6a082fed1 net: sched: lock imbalance in hhf qdisc
hhf_change() takes the sch_tree_lock and releases it but misses the
error cases. Fix the missed case here.

To reproduce try a command like this,

# tc qdisc change dev p3p2 root hhf quantum 40960 non_hh_weight 300000

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-04 19:41:45 -04:00
Stéphane Graber
4e8bbb819d net: Allow tc changes in user namespaces
This switches a few remaining capable(CAP_NET_ADMIN) to ns_capable so
that root in a user namespace may set tc rules inside that namespace.

Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-02 17:43:25 -04:00
Cong Wang
a49eb42a34 sched, act: allow to clear all actions as well
When we change the list of action on a given filter, currently we don't
change it to empty. This is a bug, we should allow to change to whatever
users given.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-27 23:42:39 -04:00
Cong Wang
2f7ef2f879 sched, cls: check if we could overwrite actions when changing a filter
When actions are attached to a filter, they are a part of the filter
itself, so when changing a filter we should allow to overwrite the actions
inside as well.

In my specific case, when I tried to _append_ a new action to an existing
filter which already has an action, I got EEXIST since kernel refused
to overwrite the existing one in kernel.

This patch checks if we are changing the filter checking NLM_F_CREATE flag
(Sigh, filters don't use NLM_F_REPLACE...) and then passes the boolean down
to actions. This fixes the problem above.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-27 23:42:39 -04:00
Eric W. Biederman
90f62cf30a net: Use netlink_ns_capable to verify the permisions of netlink messages
It is possible by passing a netlink socket to a more privileged
executable and then to fool that executable into writing to the socket
data that happens to be valid netlink message to do something that
privileged executable did not intend to do.

To keep this from happening replace bare capable and ns_capable calls
with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
Which act the same as the previous calls except they verify that the
opener of the socket had the desired permissions as well.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-24 13:44:54 -04:00
david decotigny
2d3b479df4 net-sysfs: expose number of carrier on/off changes
This allows to monitor carrier on/off transitions and detect link
flapping issues:
 - new /sys/class/net/X/carrier_changes
 - new rtnetlink IFLA_CARRIER_CHANGES (getlink)

Tested:
  - grep . /sys/class/net/*/carrier_changes
    + ip link set dev X down/up
    + plug/unplug cable
  - updated iproute2: prints IFLA_CARRIER_CHANGES
  - iproute2 20121211-2 (debian): unchanged behavior

Signed-off-by: David Decotigny <decot@googlers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31 16:24:52 -04:00
Eric Dumazet
d37d8ac17d net: sched: use no more than one page in struct fw_head
In commit b4e9b520ca ("[NET_SCHED]: Add mask support to fwmark
classifier") Patrick added an u32 field in fw_head, making it slightly
bigger than one page.

Lets use 256 slots to make fw_hash() more straight forward, and move
@mask to the beginning of the structure as we often use a small number
of skb->mark. @mask and first hash buckets share the same cache line.

This brings back the memory usage to less than 4000 bytes, and permits
John to add a rcu_head at the end of the structure later without any
worry.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-18 14:17:55 -04:00
David S. Miller
85dcce7a73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/r8152.c
	drivers/net/xen-netback/netback.c

Both the r8152 and netback conflicts were simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-14 22:31:55 -04:00
Yang Yingliang
d59b7d8059 net_sched: return nla_nest_end() instead of skb->len
nla_nest_end() already has return skb->len, so replace
return skb->len with return nla_nest_end instead().

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-13 15:39:20 -04:00
Eric Dumazet
fba373d2bb pkt_sched: add cond_resched() to class and qdisc dump
We have seen delays of more than 50ms in class or qdisc dumps, in case
device is under high TX stress, even with the prior 4KB per skb limit.

Add cond_resched() to give a chance to higher prio tasks to get cpu.

Signed-off-by; Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 23:54:23 -04:00
Eric Dumazet
15dc36ebbb pkt_sched: do not use rcu in tc_dump_qdisc()
Like all rtnetlink dump operations, we hold RTNL in tc_dump_qdisc(),
so we do not need to use rcu protection to protect list of netdevices.

This will allow preemption to occur, thus reducing latencies.
Following patch adds explicit cond_resched() calls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 23:54:23 -04:00
Eric Dumazet
2818fa0fa0 pkt_sched: fq: do not hold qdisc lock while allocating memory
Resizing fq hash table allocates memory while holding qdisc spinlock,
with BH disabled.

This is definitely not good, as allocation might sleep.

We can drop the lock and get it when needed, we hold RTNL so no other
changes can happen at the same time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10 16:17:52 -04:00
Eric Dumazet
37314363cd pkt_sched: move the sanity test in qdisc_list_add()
The WARN_ON(root == &noop_qdisc)) added in qdisc_list_add()
can trigger in normal conditions when devices are not up.
It should be done only right before the list_add_tail() call.

Fixes: e57a784d8c ("pkt_sched: set root qdisc before change() in attach_default_qdiscs()")
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Mirco Tischler <mt-ml@gmx.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10 15:44:21 -04:00
Eric Dumazet
2d8d40afd1 pkt_sched: fq: do not hold qdisc lock while allocating memory
Resizing fq hash table allocates memory while holding qdisc spinlock,
with BH disabled.

This is definitely not good, as allocation might sleep.

We can drop the lock and get it when needed, we hold RTNL so no other
changes can happen at the same time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-08 19:09:10 -05:00
Eric Dumazet
6f542efcbc net_sched: htb: do not acquire qdisc lock in dump operations
htb_dump() and htb_dump_class() do not strictly need to acquire
qdisc lock to fetch qdisc and/or class parameters.

We hold RTNL and no changes can occur.

This reduces by 50% qdisc lock pressure while doing tc qdisc|class dump
operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 17:24:54 -05:00
David S. Miller
67ddc87f16 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/wireless/ath/ath9k/recv.c
	drivers/net/wireless/mwifiex/pcie.c
	net/ipv6/sit.c

The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.

The two wireless conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05 20:32:02 -05:00
Hiroaki SHIMODA
a135e598c4 sch_tbf: Remove holes in struct tbf_sched_data.
On x86_64 we have 3 holes in struct tbf_sched_data.

The member peak_present can be replaced with peak.rate_bytes_ps,
because peak.rate_bytes_ps is set only when peak is specified in
tbf_change(). tbf_peak_present() is introduced to test
peak.rate_bytes_ps.

The member max_size is moved to fill 32bit hole.

Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-03 15:43:47 -05:00
Hiroaki SHIMODA
724b9e1d75 sch_tbf: Fix potential memory leak in tbf_change().
The allocated child qdisc is not freed in error conditions.
Defer the allocation after user configuration turns out to be
valid and acceptable.

Fixes: cc106e441a ("net: sched: tbf: fix the calculation of max_size")
Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Cc: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-27 12:53:50 -05:00
David S. Miller
1e8d6421cf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/bonding/bond_3ad.h
	drivers/net/bonding/bond_main.c

Two minor conflicts in bonding, both of which were overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-19 01:24:22 -05:00
Yang Yingliang
3fbac2a87e sch_netem: replace magic numbers with enumerate in get_loss_clg
Replace two magic numbers which intialize clgstate::state.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17 16:55:54 -05:00
Yang Yingliang
c045a734da sch_netem: replace magic numbers with enumerate in GE model
Replace some magic numbers which describe states of GE model
loss generator with enumerate.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-14 00:15:20 -05:00
Yang Yingliang
49545a7775 sch_netem: change some func's param from "struct Qdisc *" to "struct netem_sched_data *"
In netem_change(), we have already get "struct netem_sched_data *q".
Replace params of get_correlation() and other similar functions with
"struct netem_sched_data *q".

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-14 00:15:20 -05:00
Yang Yingliang
54a4b05cd2 sch_netem: return errcode before setting params
get_dist_table() and get_loss_clg() may be failed. These
two functions should be called after setting the members
of qdisc_priv(sch), or it will break the old settings while
either of them is failed.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-14 00:15:20 -05:00
Vijay Subramanian
219e288e89 net: sched: Cleanup PIE comments
Fix incorrect comment reported by Norbert Kiesel. Edit another comment to add
more details. Also add references to algorithm (IETF draft and paper) to top of
file.

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
CC: Mythili Prabhu <mysuryan@cisco.com>
CC: Norbert Kiesel <nkiesel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 18:29:58 -05:00
WANG Cong
03701d6ebd net_sched: act: clean up tca_action_flush()
We could allocate tc_action on stack in tca_action_flush(),
since it is not large.

Also, we could use create_a() in tcf_action_get_1().

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-12 19:23:32 -05:00
WANG Cong
55334a5db5 net_sched: act: refuse to remove bound action outside
When an action is bonnd to a filter, there is no point to
remove it outside. Currently we just silently decrease the refcnt,
we should reject this explicitly with EPERM.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-12 19:23:32 -05:00
WANG Cong
4f1e9d8949 net_sched: act: move tcf_hashinfo_init() into tcf_register_action()
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-12 19:23:32 -05:00
WANG Cong
a5b5c958ff net_sched: act: refactor cleanup ops
For bindcnt and refcnt etc., they are common for all actions,
not need to repeat such operations for their own, they can be unified
now. Actions just need to do its specific cleanup if needed.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-12 19:23:32 -05:00
WANG Cong
86062033fe net_sched: act: hide struct tcf_common from API
Now we can totally hide it from modules. tcf_hash_*() API's
will operate on struct tc_action, modules don't need to care about
the details.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-12 19:23:32 -05:00
Florian Westphal
de960aa9ab net: add and use skb_gso_transport_seglen()
This moves part of Eric Dumazets skb_gso_seglen helper from tbf sched to
skbuff core so it may be reused by upcoming ip forwarding path patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-26 22:38:23 -08:00
Harry Mason
29824310ce sch_htb: let skb->priority refer to non-leaf class
If the class in skb->priority is not a leaf, apply filters from the
selected class, not the qdisc. This lets netfilter or user space
partially classify the packet.

Signed-off-by: Harry Mason <harry.mason@smoothwall.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 17:39:48 -08:00
Hannes Frederic Sowa
809fa972fd reciprocal_divide: update/correction of the algorithm
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.

This has been fixed by Eric Dumazet in commit aee636c480
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:

1) Initialization:

  int l = ceil(log_2 d)
  uword m' = floor((1<<32)*((1<<l)-d)/d)+1
  int sh_1 = min(l,1)
  int sh_2 = max(l-1,0)

2) For q = n/d, all uword:

  uword t = (n*m')>>32
  q = (t+((n-t)>>sh_1))>>sh_2

The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.

Joint work with Daniel Borkmann.

  [1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
  [2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
  [3] https://gmplib.org/~tege/division-paper.pdf
  [4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
  [5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
  [6] http://www.agner.org/optimize/asmlib.zip

Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 23:17:20 -08:00
Daniel Borkmann
f337db64af random32: add prandom_u32_max and convert open coded users
Many functions have open coded a function that returns a random
number in range [0,N-1]. Under the assumption that we have a PRNG
such as taus113 with being well distributed in [0, ~0U] space,
we can implement such a function as uword t = (n*m')>>32, where
m' is a random number obtained from PRNG, n the right open interval
border and t our resulting random number, with n,m',t in u32 universe.

Lets go with Joe and simply call it prandom_u32_max(), although
technically we have an right open interval endpoint, but that we
have documented. Other users can further be migrated to the new
prandom_u32_max() function later on; for now, we need to make sure
to migrate reciprocal_divide() users for the reciprocal_divide()
follow-up fixup since their function signatures are going to change.

Joint work with Hannes Frederic Sowa.

Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 23:17:20 -08:00
WANG Cong
6e6a50c254 net_sched: act: export tcf_hash_search() instead of tcf_hash_lookup()
So that we will not expose struct tcf_common to modules.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 14:43:16 -08:00
WANG Cong
c779f7af99 net_sched: act: fetch hinfo from a->ops->hinfo
Every action ops has a pointer to hash info, so we don't need to
hard-code it in each module.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 14:43:16 -08:00
WANG Cong
671314a5ab net_sched: act: remove capab from struct tc_action_ops
It is not actually implemented.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19 19:58:07 -08:00
Yang Yingliang
a6e2fe17eb sch_netem: replace magic numbers with enumerate
Replace some magic numbers which describe states of 4-state model
loss generator with enumerate.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19 17:17:34 -08:00
Wei Yongjun
722e47d792 net_sched: fix error return code in fw_change_attrs()
The error code was not set if change indev fail, so the error
condition wasn't reflected in the return value. Fix to return a
negative error code from this error handling case instead of 0.

Fixes: 2519a602c2 ('net_sched: optimize tcf_match_indev()')
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 19:12:03 -08:00
WANG Cong
6c80563c2f net_sched: act: pick a different type for act_xt
In tcf_register_action() we check either ->type or ->kind to see if
there is an existing action registered, but ipt action registers two
actions with same type but different kinds. They should have different
types too.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:24:11 -08:00
WANG Cong
fb1d598d48 net_sched: act: use tcf_hash_release() in net/sched/act_police.c
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:21:55 -08:00
Aruna-Hewapathirane
63862b5bef net: replace macros net_random and net_srandom with direct calls to prandom
This patch removes the net_random and net_srandom macros and replaces
them with direct calls to the prandom ones. As new commits only seem to
use prandom_u32 there is no use to keep them around.
This change makes it easier to grep for users of prandom_u32.

Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14 15:15:25 -08:00
David S. Miller
0a379e21c5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-01-14 14:42:42 -08:00
WANG Cong
7eb8896df0 net_sched: act: remove struct tcf_act_hdr
It is not necessary at all.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 11:50:15 -08:00
WANG Cong
a8701a6c7a net_sched: avoid casting void pointer
tp->root is a void* pointer, no need to cast it.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 11:50:15 -08:00
WANG Cong
2519a602c2 net_sched: optimize tcf_match_indev()
tcf_match_indev() is called in fast path, it is not wise to
search for a netdev by ifindex and then compare by its name,
just compare the ifindex.

Also, dev->name could be changed by user-space, therefore
the match would be always fail, but dev->ifindex could
be consistent.

BTW, this will also save some bytes from the core struct of u32.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 11:50:15 -08:00
WANG Cong
832d1d5bfa net_sched: add struct net pointer to tcf_proto_ops->dump
It will be needed by the next patch.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 11:50:14 -08:00
WANG Cong
a56e19538d net_sched: act: clean up notification functions
Refactor tcf_add_notify() and factor out tcf_del_notify().

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 11:50:14 -08:00
WANG Cong
ddafd34f41 net_sched: act: move idx_gen into struct tcf_hashinfo
There is no need to store the index separatedly
since tcf_hashinfo is allocated statically too.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 11:50:14 -08:00
Terry Lam
6c76a07a71 HHF qdisc: fix jiffies-time conversion.
This is to be compatible with the use of "get_time" (i.e. default
time unit in us) in iproute2 patch for HHF as requested by Stephen.

Signed-off-by: Terry Lam <vtlam@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 11:20:39 -08:00
Jason Wang
f663dd9aaf net: core: explicitly select a txq before doing l2 forwarding
Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The
will cause several issues:

- NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan
  instead of lower device which misses the necessary txq synchronization for
  lower device such as txq stopping or frozen required by dev watchdog or
  control path.
- dev_hard_start_xmit() was called with NULL txq which bypasses the net device
  watchdog.
- dev_hard_start_xmit() does not check txq everywhere which will lead a crash
  when tso is disabled for lower device.

Fix this by explicitly introducing a new param for .ndo_select_queue() for just
selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also
extended to accept this parameter and dev_queue_xmit_accel() was used to do l2
forwarding transmission.

With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need
to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep
a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of
dev_queue_xmit() to do the transmission.

In the future, it was also required for macvtap l2 forwarding support since it
provides a necessary synchronization method.

Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: e1000-devel@lists.sourceforge.net
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-10 13:23:08 -05:00
David S. Miller
56a4342dfe Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
	net/ipv6/ip6_tunnel.c
	net/ipv6/ip6_vti.c

ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.

qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 17:37:45 -05:00
Jamal Hadi Salim
805c1f4aed net_sched: act: action flushing missaccounting
action flushing missaccounting
Account only for deleted actions

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:46:56 -05:00
Jamal Hadi Salim
63acd6807c net_sched: Remove unnecessary checks for act->ops
Remove unnecessary checks for act->ops
(suggested by Eric Dumazet).

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 16:46:32 -05:00
Vijay Subramanian
d4b36210c2 net: pkt_sched: PIE AQM scheme
Proportional Integral controller Enhanced (PIE) is a scheduler to address the
bufferbloat problem.

>From the IETF draft below:
" Bufferbloat is a phenomenon where excess buffers in the network cause high
latency and jitter. As more and more interactive applications (e.g. voice over
IP, real time video streaming and financial transactions) run in the Internet,
high latency and jitter degrade application performance. There is a pressing
need to design intelligent queue management schemes that can control latency and
jitter; and hence provide desirable quality of service to users.

We present here a lightweight design, PIE(Proportional Integral controller
Enhanced) that can effectively control the average queueing latency to a target
value. Simulation results, theoretical analysis and Linux testbed results have
shown that PIE can ensure low latency and achieve high link utilization under
various congestion situations. The design does not require per-packet
timestamp, so it incurs very small overhead and is simple enough to implement
in both hardware and software.  "

Many thanks to Dave Taht for extensive feedback, reviews, testing and
suggestions. Thanks also to Stephen Hemminger and Eric Dumazet for reviews and
suggestions.  Naeem Khademi and Dave Taht independently contributed to ECN
support.

For more information, please see technical paper about PIE in the IEEE
Conference on High Performance Switching and Routing 2013. A copy of the paper
can be found at ftp://ftpeng.cisco.com/pie/.

Please also refer to the IETF draft submission at
http://tools.ietf.org/html/draft-pan-tsvwg-pie-00

All relevant code, documents and test scripts and results can be found at
ftp://ftpeng.cisco.com/pie/.

For problems with the iproute2/tc or Linux kernel code, please contact Vijay
Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
(mysuryan@cisco.com)

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Mythili Prabhu <mysuryan@cisco.com>
CC: Dave Taht <dave.taht@bufferbloat.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06 15:13:01 -05:00
David S. Miller
855404efae Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for your net-next tree,
they are:

* Add full port randomization support. Some crazy researchers found a way
  to reconstruct the secure ephemeral ports that are allocated in random mode
  by sending off-path bursts of UDP packets to overrun the socket buffer of
  the DNS resolver to trigger retransmissions, then if the timing for the
  DNS resolution done by a client is larger than usual, then they conclude
  that the port that received the burst of UDP packets is the one that was
  opened. It seems a bit aggressive method to me but it seems to work for
  them. As a result, Daniel Borkmann and Hannes Frederic Sowa came up with a
  new NAT mode to fully randomize ports using prandom.

* Add a new classifier to x_tables based on the socket net_cls set via
  cgroups. These includes two patches to prepare the field as requested by
  Zefan Li. Also from Daniel Borkmann.

* Use prandom instead of get_random_bytes in several locations of the
  netfilter code, from Florian Westphal.

* Allow to use the CTA_MARK_MASK in ctnetlink when mangling the conntrack
  mark, also from Florian Westphal.

* Fix compilation warning due to unused variable in IPVS, from Geert
  Uytterhoeven.

* Add support for UID/GID via nfnetlink_queue, from Valentina Giusti.

* Add IPComp extension to x_tables, from Fan Du.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-05 20:18:50 -05:00
Daniel Borkmann
fe1217c4f3 net: net_cls: move cgroupfs classid handling into core
Zefan Li requested [1] to perform the following cleanup/refactoring:

- Split cgroupfs classid handling into net core to better express a
  possible more generic use.

- Disable module support for cgroupfs bits as the majority of other
  cgroupfs subsystems do not have that, and seems to be not wished
  from cgroup side. Zefan probably might want to follow-up for netprio
  later on.

- By this, code can be further reduced which previously took care of
  functionality built when compiled as module.

cgroupfs bits are being placed under net/core/netclassid_cgroup.c, so
that we are consistent with {netclassid,netprio}_cgroup naming that is
under net/core/ as suggested by Zefan.

No change in functionality, but only code refactoring that is being
done here.

 [1] http://patchwork.ozlabs.org/patch/304825/

Suggested-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: cgroups@vger.kernel.org
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:41 +01:00
Cong Wang
c1ddf295f5 net: revert "sched classifier: make cgroup table local"
This reverts commit de6fb288b1.
Otherwise we got:

net/sched/cls_cgroup.c:106:29: error: static declaration of ‘net_cls_subsys’ follows non-static declaration
 static struct cgroup_subsys net_cls_subsys = {
                             ^
In file included from include/linux/cgroup.h:654:0,
                 from net/sched/cls_cgroup.c:18:
include/linux/cgroup_subsys.h:35:29: note: previous declaration of ‘net_cls_subsys’ was here
 SUBSYS(net_cls)
                             ^
make[2]: *** [net/sched/cls_cgroup.o] Error 1

Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 19:02:12 -05:00
stephen hemminger
de6fb288b1 sched classifier: make cgroup table local
Doesn't need to be global.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 03:30:36 -05:00
stephen hemminger
9c75f4029c sched action: make local function static
No need to export functions only used in one file.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 03:30:36 -05:00
Zhi Yong Wu
21eb218989 net, sch: fix the typo in register_qdisc()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31 16:44:10 -05:00
Yang Yingliang
6a031f67c8 sch_netem: support of 64bit rates
Add a new attribute to support 64bit rates so that
tc can use them to break the 32bit limit.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31 14:31:44 -05:00
Yang Yingliang
8cfd88d6d7 sch_netem: more precise length of packets
With TSO/GSO/GRO packets, skb->len doesn't represent
a precise amount of bytes on wire.
This patch replace skb->len with qdisc_pkt_len(skb)
which is more precise.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31 14:31:43 -05:00
Yang Yingliang
c76f2a2c4c sch_dsmark: use correct func name in print messages
In dsmark_drop(), the function name printed by pr_debug
is "dsmark_reset", correct it to "dsmark_drop" by using
__func__ .

BTW, replace the other function names with __func__ .

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31 13:50:57 -05:00
Yang Yingliang
a071d27241 sch_htb: use /* comments
Do not use C99 // comments and correct a spelling typo.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31 13:50:57 -05:00
Yang Yingliang
c17988a90f net_sched: replace pr_warning with pr_warn
Prefer pr_warn(... to pr_warning(...

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31 13:50:56 -05:00
Jamal Hadi Salim
1a29321ed0 net_sched: act: Dont increment refcnt on replace
This is a bug fix. The existing code tries to kill many
 birds with one stone: Handling binding of actions to
 filters, new actions and replacing of action
 attributes. A simple test case to illustrate:

XXXX
 moja@fe1:~$ sudo tc actions add action drop index 12
 moja@fe1:~$ actions get action gact index 12
 action order 1: gact action drop
  random type none pass val 0
  index 12 ref 1 bind 0
 moja@fe1:~$ sudo tc actions replace action ok index 12
 moja@fe1:~$ actions get action gact index 12
 action order 1: gact action drop
  random type none pass val 0
  index 12 ref 2 bind 0
XXXX

The above shows the refcounf being wrongly incremented on replace.
There are more complex scenarios with binding of actions to filters
that i am leaving out that didnt work as well...

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-27 12:50:00 -05:00
Yang Yingliang
2e04ad424b sch_tbf: add TBF_BURST/TBF_PBURST attribute
When we set burst to 1514 with low rate in userspace,
the kernel get a value of burst that less than 1514,
which doesn't work.

Because it may make some loss when transform burst
to buffer in userspace. This makes burst lose some
bytes, when the kernel transform the buffer back to
burst.

This patch adds two new attributes to support sending
burst/mtu to kernel directly to avoid the loss.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26 13:54:22 -05:00
stephen hemminger
c49fa257ba hhf: make qdisc ops static
This module shouldn't be randomly exporting symbols

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26 13:29:35 -05:00
Eric Dumazet
a792866ad2 net_sched: fix regression in tc_action_ops
list_for_each_entry(a, &act_base, head) doesn't
exit with a = NULL if we reached the end of the list.

tcf_unregister_action(), tc_lookup_action_n() and tc_lookup_action()
need fixes.

Remove tc_lookup_action_id() as its unused and not worth 'fixing'

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 1f747c26c4 ("net_sched: convert tc_action_ops to use struct list_head")
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-20 17:06:27 -05:00
Eric Dumazet
dcd7608134 net_sched: fix a regression in tcf_proto_lookup_ops()
list_for_each_entry(t, &tcf_proto_base, head) doesn't
exit with t = NULL if we reached the end of the list.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 3627287463 ("net_sched: convert tcf_proto_ops to use struct
list_head")
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-20 17:06:27 -05:00
WANG Cong
568a153a22 net_sched: fix a regression in tc actions
This patch fixes:
1) pass mask rather than size to tcf_hashinfo_init()
2) the cleanup should be in reversed order in mirred_cleanup_module()

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 369ba56787 ("net_sched: init struct tcf_hashinfo at register time")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-20 17:06:27 -05:00
Yang Yingliang
79c11f2e3f sch_cbq: remove unnecessary null pointer check
It already has a NULL pointer check of rtab in qdisc_put_rtab().
Remove the check outside of qdisc_put_rtab().

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19 15:06:55 -05:00
Yang Yingliang
3b69a4c9be act_police: remove unnecessary null pointer check
It already has a NULL pointer check of rtab in qdisc_put_rtab().
Remove the check outside of qdisc_put_rtab().

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19 15:06:55 -05:00
Terry Lam
10239edf86 net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc
This patch implements the first size-based qdisc that attempts to
differentiate between small flows and heavy-hitters.  The goal is to
catch the heavy-hitters and move them to a separate queue with less
priority so that bulk traffic does not affect the latency of critical
traffic.  Currently "less priority" means less weight (2:1 in
particular) in a Weighted Deficit Round Robin (WDRR) scheduler.

In essence, this patch addresses the "delay-bloat" problem due to
bloated buffers. In some systems, large queues may be necessary for
obtaining CPU efficiency, or due to the presence of unresponsive
traffic like UDP, or just a large number of connections with each
having a small amount of outstanding traffic. In these circumstances,
HHF aims to reduce the HoL blocking for latency sensitive traffic,
while not impacting the queues built up by bulk traffic.  HHF can also
be used in conjunction with other AQM mechanisms such as CoDel.

To capture heavy-hitters, we implement the "multi-stage filter" design
in the following paper:
C. Estan and G. Varghese, "New Directions in Traffic Measurement and
Accounting", in ACM SIGCOMM, 2002.

Some configurable qdisc settings through 'tc':
- hhf_reset_timeout: period to reset counter values in the multi-stage
                     filter (default 40ms)
- hhf_admit_bytes:   threshold to classify heavy-hitters
                     (default 128KB)
- hhf_evict_timeout: threshold to evict idle heavy-hitters
                     (default 1s)
- hhf_non_hh_weight: Weighted Deficit Round Robin (WDRR) weight for
                     non-heavy-hitters (default 2)
- hh_flows_limit:    max number of heavy-hitter flow entries
                     (default 2048)

Note that the ratio between hhf_admit_bytes and hhf_reset_timeout
reflects the bandwidth of heavy-hitters that we attempt to capture
(25Mbps with the above default settings).

The false negative rate (heavy-hitter flows getting away unclassified)
is zero by the design of the multi-stage filter algorithm.
With 100 heavy-hitter flows, using four hashes and 4000 counters yields
a false positive rate (non-heavy-hitters mistakenly classified as
heavy-hitters) of less than 1e-4.

Signed-off-by: Terry Lam <vtlam@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19 14:48:42 -05:00
David S. Miller
143c905494 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/i40e/i40e_main.c
	drivers/net/macvtap.c

Both minor merge hassles, simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 16:42:06 -05:00
WANG Cong
3627287463 net_sched: convert tcf_proto_ops to use struct list_head
We don't need to maintain our own singly linked list code.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:08 -05:00
WANG Cong
1f747c26c4 net_sched: convert tc_action_ops to use struct list_head
We don't need to maintain our own singly linked list code.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:07 -05:00
WANG Cong
89819dc01f net_sched: convert tcf_hashinfo to hlist and use spinlock
So that we don't need to play with singly linked list,
and since the code is not on hot path, we can use spinlock
instead of rwlock.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:07 -05:00
WANG Cong
369ba56787 net_sched: init struct tcf_hashinfo at register time
It looks weird to store the lock out of the struct but
still points to a static variable. Just move them into the struct.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:07 -05:00
WANG Cong
5da57f422d net_sched: cls: refactor out struct tcf_ext_map
These information can be saved in tcf_exts, and this will
simplify the code.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:07 -05:00
WANG Cong
33be627159 net_sched: act: use standard struct list_head
Currently actions are chained by a singly linked list,
therefore it is a bit hard to add and remove a specific
entry. Convert it to struct list_head so that in the
latter patch we can remove an action without finding
its head.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:07 -05:00
WANG Cong
d84231d3a2 net_sched: remove get_stats from tc_action_ops
It is not used.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:07 -05:00
Tom Herbert
3958afa1b2 net: Change skb_get_rxhash to skb_get_hash
Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 16:36:21 -05:00
Eric Dumazet
c3bd85495a pkt_sched: fq: more robust memory allocation
This patch brings NUMA support and automatic fallback to vmalloc()
in case kmalloc() failed to allocate FQ hash table.

NUMA support depends on XPS being setup for the device before
qdisc allocation. After a XPS change, it might be worth creating
qdisc hierarchy again.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 15:25:20 -05:00
Eric Dumazet
e57a784d8c pkt_sched: set root qdisc before change() in attach_default_qdiscs()
After commit 95dc19299f ("pkt_sched: give visibility to mq slave
qdiscs") we call disc_list_add() while the device qdisc might be
the noop_qdisc one.

This shows up as duplicates in "tc qdisc show", as all inactive devices
point to noop_qdisc.

Fix this by setting dev->qdisc to the new qdisc before calling
ops->change() in attach_default_qdiscs()

Add a WARN_ON_ONCE() to catch any future similar problem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-14 01:20:06 -05:00
Yang Yingliang
d55d282e6a sch_tbf: use do_div() for 64-bit divide
It's doing a 64-bit divide which is not supported
on 32-bit architectures in psched_ns_t2l(). The
correct way to do this is to use do_div().

It's introduced by commit cc106e441a
("net: sched: tbf: fix the calculation of max_size")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 22:53:26 -05:00
Yang Yingliang
6b1dd85601 sch_htb: remove unnecessary NULL pointer judgment
It already has a NULL pointer judgment of rtab in qdisc_put_rtab().
Remove the judgment outside of qdisc_put_rtab().

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 15:30:17 -05:00
Yang Yingliang
1598f7cb47 net: sched: htb: fix the calculation of quantum
Now, 32bit rates may be not the true rate.
So use rate_bytes_ps which is from
max(rate32, rate64) to calcualte quantum.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 15:08:41 -05:00
Yang Yingliang
cc106e441a net: sched: tbf: fix the calculation of max_size
Current max_size is caluated from rate table. Now, the rate table
has been replaced and it's wrong to caculate max_size based on this
rate table. It can lead wrong calculation of max_size.

The burst in kernel may be lower than user asked, because burst may gets
some loss when transform it to buffer(E.g. "burst 40kb rate 30mbit/s")
and it seems we cannot avoid this loss. Burst's value(max_size) based on
rate table may be equal user asked. If a packet's length is max_size, this
packet will be stalled in tbf_dequeue() because its length is above the
burst in kernel so that it cannot get enough tokens. The max_size guards
against enqueuing packet sizes above q->buffer "time" in tbf_enqueue().

To make consistent with the calculation of tokens, this patch add a helper
psched_ns_t2l() to calculate burst(max_size) directly to fix this problem.

After this fix, we can support to using 64bit rates to calculate burst as well.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 15:08:41 -05:00
Yang Yingliang
4f8f61eb43 net_sched: expand control flow of macro SKIP_NONLOCAL
SKIP_NONLOCAL hides the control flow. The control flow should be
inlined and expanded explicitly in code so that someone who reads
it can tell the control flow can be changed by the statement.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 12:29:26 -05:00
Yang Yingliang
fa08943b97 net_sched: sfq: put sfq_unlink in a do - while loop
Macros with multiple statements should be enclosed in a do - while loop

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:44:52 -05:00
Yang Yingliang
833fa74386 net_sched: add space around '>' and before '('
Spaces required around that '>' (ctx:VxV) and
before the open parenthesis '('.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:44:51 -05:00
Yang Yingliang
82d567c266 net_sched: change "foo* bar" to "foo *bar"
"foo* bar" or "foo * bar" should be "foo *bar".

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:44:51 -05:00
Yang Yingliang
1fab9abc56 net_sched: cls_bpf: use tabs to do indent
Code indent should use tabs where possible

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:44:51 -05:00
Yang Yingliang
17569faedf net_sched: remove unnecessary parentheses while return
return is not a function, parentheses are not required.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:44:51 -05:00