Commit Graph

400976 Commits

Author SHA1 Message Date
Daniel Borkmann
97203abe6b net: ipvs: sctp: do not recalc sctp csum when ports didn't change
Unlike UDP or TCP, we do not take the pseudo-header into
account in SCTP checksums. So in case port mapping is the
very same, we do not need to recalculate the whole SCTP
checksum in software, which is very expensive.

Also, similarly as in TCP, take into account when a private
helper mangled the packet. In that case, we also need to
recalculate the checksum even if ports might be same.

Thanks for feedback regarding skb->ip_summed checks from
Julian Anastasov; here's a discussion on these checks for
snat and dnat:

* For snat_handler(), we can see CHECKSUM_PARTIAL from
  virtual devices, and from LOCAL_OUT, otherwise it
  should be CHECKSUM_UNNECESSARY. In general, in snat it
  is more complex. skb contains the original route and
  ip_vs_route_me_harder() can change the route after
  snat_handler. So, for locally generated replies from
  local server we can not preserve the CHECKSUM_PARTIAL
  mode. It is an chicken or egg dilemma: snat_handler
  needs the device after rerouting (to check for
  NETIF_F_SCTP_CSUM), while ip_route_me_harder() wants
  the snat_handler() to put the new saddr for proper
  rerouting.

* For dnat_handler(), we should not see CHECKSUM_COMPLETE
  for SCTP, in fact the small set of drivers that support
  SCTP offloading return CHECKSUM_UNNECESSARY on correctly
  received SCTP csum. We can see CHECKSUM_PARTIAL from
  local stack or received from virtual drivers. The idea is
  that SCTP decides to avoid csum calculation if hardware
  supports offloading. IPVS can change the device after
  rerouting to real server but we can preserve the
  CHECKSUM_PARTIAL mode if the new device supports
  offloading too. This works because skb dst is changed
  before dnat_handler and we see the new device. So, checks
  in the 'if' part will decide whether it is ok to keep
  CHECKSUM_PARTIAL for the output. If the packet was with
  CHECKSUM_NONE, hence we deal with unknown checksum. As we
  recalculate the sum for IP header in all cases, it should
  be safe to use CHECKSUM_UNNECESSARY. We can forward wrong
  checksum in this case (without cp->app). In case of
  CHECKSUM_UNNECESSARY, the csum was valid on receive.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-30 09:48:16 +09:00
Daniel Borkmann
6e7cd27c0f net: ipvs: sctp: add missing verdict assignments in sctp_conn_schedule
If skb_header_pointer() fails, we need to assign a verdict, that is
NF_DROP in this case, otherwise, we would leave the verdict from
conn_schedule() uninitialized when returning.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-28 19:00:49 +09:00
Florian Westphal
6b8dbcf2c4 bridge: netfilter: orphan skb before invoking ip netfilter hooks
Pekka Pietikäinen reports xt_socket behavioural change after commit
00028aa37098o (netfilter: xt_socket: use IP early demux).

Reason is xt_socket now no longer does an unconditional sk lookup -
it re-uses existing skb->sk if possible, assuming ->sk was set by
ip early demux.

However, when netfilter is invoked via bridge, this can cause 'bogus'
sockets to be examined by the match, e.g. a 'tun' device socket.

bridge netfilter should orphan the skb just like the routing path
before invoking ipv4/ipv6 netfilter hooks to avoid this.

Reported-and-tested-by: Pekka Pietikäinen <pp@ee.oulu.fi>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-27 21:44:33 +01:00
Michael Opdenacker
1e56555ee1 netfilter: ipset: remove duplicate define
This patch removes a duplicate define from
net/netfilter/ipset/ip_set_hash_gen.h

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2013-10-27 19:24:45 +01:00
Jozsef Kadlecsik
dc476e7c8e netfilter:ipset: Fix memory allocation for bitmap:port
At the restructuring of the bitmap types creation in ipset, for the
bitmap:port type wrong (too large) memory allocation was copied
(netfilter bugzilla id #859).

Reported-by: Quentin Armitage <quentin@armitage.org.uk>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2013-10-27 19:24:39 +01:00
Stanislav Fomichev
f2020b27be netfilter: ip6t_REJECT: skip checksum verification for outgoing ipv6 packets
Don't verify checksum for outgoing packets because checksum calculation
may be done by the device.

Without this patch:
$ ip6tables -I OUTPUT -p tcp --dport 80 -j REJECT --reject-with tcp-reset
$ time telnet ipv6.google.com 80
Trying 2a00:1450:4010:c03::67...
telnet: Unable to connect to remote host: Connection timed out

real    0m7.201s
user    0m0.000s
sys     0m0.000s

With the patch applied:
$ ip6tables -I OUTPUT -p tcp --dport 80 -j REJECT --reject-with tcp-reset
$ time telnet ipv6.google.com 80
Trying 2a00:1450:4010:c03::67...
telnet: Unable to connect to remote host: Connection refused

real    0m0.085s
user    0m0.000s
sys     0m0.000s

Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-23 11:20:00 +02:00
Jozsef Kadlecsik
1a869205c7 netfilter: ipset: The unnamed union initialization may lead to compilation error
The unnamed union should be possible to be initialized directly, but
unfortunately it's not so:

/usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c: In
function ?hash_netnet4_kadt?:
/usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c:141:
error: unknown field ?cidr? specified in initializer

Reported-by: Husnu Demir <hdemir@metu.edu.tr>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-22 10:14:01 +02:00
Jozsef Kadlecsik
93302880d8 netfilter: ipset: Use netlink callback dump args only
Instead of cb->data, use callback dump args only and introduce symbolic
names instead of plain numbers at accessing the argument members.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-22 10:13:59 +02:00
Gao feng
d86946d2c5 netfilter: ipt_CLUSTERIP: use proper net namespace to operate CLUSTERIP
we can allow users in uninit net namespace to operate ipt_CLUSTERIP
now.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-17 10:48:47 +02:00
Gao feng
f58d786601 netfilter: ipt_CLUSTERIP: create proc entry under proper ipt_CLUSTERIP directory
Create proc entries under the ipt_CLUSTERIP directory of proper
net namespace.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-17 10:48:25 +02:00
Gao feng
b5ef0f85bf netfilter: ipt_CLUSTERIP: add parameter net in clusterip_config_find_get
Inorder to find clusterip_config in net namespace.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-17 10:48:16 +02:00
Gao feng
f1e8077f49 netfilter: ipt_CLUSTERIP: make clusterip_lock per net namespace
this lock is used for protecting clusterip_configs of per
net namespace, it should be per net namespace too.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-17 10:48:07 +02:00
Gao feng
26a89e4354 netfilter: ipt_CLUSTERIP: make clusterip_list per net namespace
clusterip_configs should be per net namespace, so operate
cluster in one net namespace won't affect other net
namespace. right now, only allow to operate the clusterip_configs
of init net namespace.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-17 10:47:58 +02:00
Gao feng
ce4ff76c15 netfilter: ipt_CLUSTERIP: make proc directory per net namespace
Create /proc/net/ipt_CLUSTERIP directory for per net namespace.
Right now,only allow to create entries under the ipt_CLUSTERIP
in init net namespace.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-17 10:47:49 +02:00
Eric Dumazet
1a8bf6eeef netfilter: xt_socket: use sock_gen_put()
TCP listener refactoring, part 7 :

Use sock_gen_put() instead of xt_socket_put_sk() for future
SYN_RECV support.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-17 10:27:25 +02:00
Alexander Frolkin
1255ce5f10 ipvs: improved SH fallback strategy
Improve the SH fallback realserver selection strategy.

With sh and sh-fallback, if a realserver is down, this attempts to
distribute the traffic that would have gone to that server evenly
among the remaining servers.

Signed-off-by: Alexander Frolkin <avf@eldamar.org.uk>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-15 10:54:50 +09:00
Julian Anastasov
9e4e948a3e ipvs: avoid rcu_barrier during netns cleanup
commit 578bc3ef1e ("ipvs: reorganize dest trash") added
rcu_barrier() on cleanup to wait dest users and schedulers
like LBLC and LBLCR to put their last dest reference.
Using rcu_barrier with many namespaces is problematic.

Trying to fix it by freeing dest with kfree_rcu is not
a solution, RCU callbacks can run in parallel and execution
order is random.

Fix it by creating new function ip_vs_dest_put_and_free()
which is heavier than ip_vs_dest_put(). We will use it just
for schedulers like LBLC, LBLCR that can delay their dest
release.

By default, dests reference is above 0 if they are present in
service and it is 0 when deleted but still in trash list.
Change the dest trash code to use ip_vs_dest_put_and_free(),
so that refcnt -1 can be used for freeing. As result,
such checks remain in slow path and the rcu_barrier() from
netns cleanup can be removed.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-15 10:36:01 +09:00
Julian Anastasov
120c9794a3 ipvs: fix the IPVS_CMD_ATTR_MAX definition
It was wrong (bigger) but problem is harmless.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-15 10:36:01 +09:00
David S. Miller
58308451e9 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:

====================
This series contains updates to i40e only.

Alex provides the majority of the patches against i40e, where he does
cleanup of the Tx and RX queues and to align the code with the known
good Tx/Rx queue code in the ixgbe driver.

Anjali provides an i40e patch to update link events to not print to
the log until the device is administratively up.

Catherine provides a patch to update the driver version.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 15:29:44 -04:00
Eric Dumazet
b44084c2c8 inet: rename ir_loc_port to ir_num
In commit 634fb979e8 ("inet: includes a sock_common in request_sock")
I forgot that the two ports in sock_common do not have same byte order :

skc_dport is __be16 (network order), but skc_num is __u16 (host order)

So sparse complains because ir_loc_port (mapped into skc_num) is
considered as __u16 while it should be __be16

Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
and perform appropriate htons/ntohs conversions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 14:37:35 -04:00
Catherine Sullivan
d04795d663 i40e: Bump version
Update the version number of the driver.

Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 23:25:33 -07:00
Alexander Duyck
980e9b1186 i40e: Add support for 64 bit netstats
This change brings support for 64 bit netstats to the driver. Previously
the stats were 64 bit but highly racy due to the fact that 64 bit
transactions are not atomic on 32 bit systems.  This change makes is so
that the 64 bit byte and packet stats are reliable on all architectures.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 23:16:27 -07:00
Alexander Duyck
9f65e15b4f i40e: Move rings from pointer to array to array of pointers
Allocate the queue pairs individually instead of as a group.  This
allows for much easier queue management as it is possible to dynamically
resize the queues without having to free and allocate the entire block.

Ease statistic collection by treating Tx/Rx queue pairs as a single
unit.  Each pair is allocated together and starts with a Tx queue and
ends with an Rx queue.  By ordering them this way it is possible to know
the Rx offset based on a pointer to the Tx queue.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 22:48:35 -07:00
Alexander Duyck
cd0b6fa656 i40e: Replace ring container array with linked list
This replaces the ring container array with a linked list.  The idea is
to make the logic much easier to deal with since this will allow us to
call a simple helper function from the q_vectors to go through the
entire list.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 22:39:30 -07:00
Alexander Duyck
493fb30011 i40e: Move q_vectors from pointer to array to array of pointers
Allocate the q_vectors individually. The advantage to this is that it
allows for easier freeing and allocation.  In addition it makes it so
that we could do node specific allocations at some point in the future.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 22:32:10 -07:00
Alexander Duyck
a114d0a6ac i40e: Split bytes and packets from Rx/Tx stats
This makes it so that the Tx and Rx byte and packet counts are
separated from the rest of the statistics.  This allows for better
isolation of these stats when we move them into the 64 bit statistics.

Simplify things by re-ordering how the stats display in ethtool.
Instead of displaying all of the Tx queues as a block, followed by all
the Rx queues, the new order is Tx[0], Rx[0], Tx[1], Rx[1], ..., Tx[n],
Rx[n].  This reduces the loops and cleans up the display for testing
purposes since it is very easy to verify if flow director is doing the
right thing as the Tx and Rx queue pair are shown in pairs.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 22:21:39 -07:00
Alexander Duyck
7070ce0a64 i40e: Add support for Tx byte queue limits
Implement BQL (byte queue limit) support in i40e.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 22:13:08 -07:00
Alexander Duyck
b194130627 i40e: Drop dead code and flags from Tx hotpath
Drop Tx flag and TXSW which is tested but never set.

As a result of this change we can drop a complicated check that always
resulted in the final result of i40e_tx_csum being equal to the
CHECKSUM_PARTIAL value.  As such we can replace the entire function call
with just a check for skb->summed == CHECKSUM_PARTIAL.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 22:06:02 -07:00
Alexander Duyck
a5e9c57264 i40e: clean up Tx fast path
Sync the fast path for i40e_tx_map and i40e_clean_tx_irq so that they
are similar to igb and ixgbe.

- Only update the Tx descriptor ring in tx_map
- Make skb mapping always on the first buffer in the chain
- Drop the use of MAPPED_AS_PAGE Tx flag
- Only store flags on the first buffer_info structure

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 21:58:04 -07:00
Alexander Duyck
fc4ac67bc9 i40e: Do not directly increment Tx next_to_use
Avoid directly incrementing next_to_use for multiple reasons.  The main
reason being that if we directly increment it then it can attain a state
where it is equal to the ring count.  Technically this is a state it
should not be able to reach but the way this is written it now can.

This patch pulls the value off into a register and then increments it
and writes back either the value or 0 depending on if the value is equal
to the ring count.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 21:22:01 -07:00
Alexander Duyck
35a1e2ad17 i40e: Cleanup Tx buffer info layout
- drop the mapped_as_page u8 from the Tx buffer info as it was unused
- use the DMA unmap accessors for Tx DMA
- replace checks of DMA with checks of the unmap length to verify if an
  unmap is needed
- update the Tx buffer layout to make it consistent with igb, ixgbe

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 21:14:41 -07:00
Eric Dumazet
ba537427d7 tcp: use ACCESS_ONCE() in tcp_update_pacing_rate()
sk_pacing_rate is read by sch_fq packet scheduler at any time,
with no synchronization, so make sure we update it in a
sensible way. ACCESS_ONCE() is how we instruct compiler
to not do stupid things, like using the memory location
as a temporary variable.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 00:08:07 -04:00
Eric Dumazet
634fb979e8 inet: includes a sock_common in request_sock
TCP listener refactoring, part 5 :

We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.

This patch includes the needed struct sock_common in front
of struct request_sock

This means there is no more inet6_request_sock IPv6 specific
structure.

Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.

loc_port   -> ir_loc_port
loc_addr   -> ir_loc_addr
rmt_addr   -> ir_rmt_addr
rmt_port   -> ir_rmt_port
iif        -> ir_iif

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 00:08:07 -04:00
Eric Dumazet
8a29111c7c net: gro: allow to build full sized skb
skb_gro_receive() is currently limited to 16 or 17 MSS per GRO skb,
typically 24616 bytes, because it fills up to MAX_SKB_FRAGS frags.

It's relatively easy to extend the skb using frag_list to allow
more frags to be appended into the last sk_buff.

This still builds very efficient skbs, and allows reaching 45 MSS per
skb.

(45 MSS GRO packet uses one skb plus a frag_list containing 2 additional
sk_buff)

High speed TCP flows benefit from this extension by lowering TCP stack
cpu usage (less packets stored in receive queue, less ACK packets
processed)

Forwarding setups could be hurt, as such skbs will need to be
linearized, although its not a new problem, as GRO could already
provide skbs with a frag_list.

We could make the 65536 bytes threshold a tunable to mitigate this.

(First time we need to linearize skb in skb_needs_linearize(), we could
lower the tunable to ~16*1460 so that following skb_gro_receive() calls
build smaller skbs)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 00:08:07 -04:00
baker.zhang
4c60f1d67f fib_trie: only calc for the un-first node
This is a enhancement.

for the first node in fib_trie, newpos is 0, bit is 1.
Only for the leaf or node with unmatched key need calc pos.

Signed-off-by: baker.zhang <baker.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 00:08:07 -04:00
Gao feng
5c70ef85a2 veth: allow to setup multicast address for veth device
We can only setup multicast address for network device when
net_device_ops->ndo_set_rx_mode is not null.

Some configurations need to add multicast address for net
device, such as netfilter cluster match module.

Add a fake ndo_set_rx_mode function to allow this operation.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 00:08:07 -04:00
Alexander Duyck
c304fdac6c i40e: Drop unused completed stat
The Tx "completed" stat was part of the original rewrite for detecting
Tx hangs.  However some time ago in ixgbe I determined that we could
just use the packets stat instead.  Since then this stat was
removed from ixgbe and it serves no purpose in i40e so it can be
dropped.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 21:05:10 -07:00
Anjali Singhai
6d779b41f7 i40e: Link code updates
Link events should not print to the log until the device is
administratively up.

Signed-off-by: Anjali Singhai <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-09 20:46:28 -07:00
Ajit Khaparde
b68656b22f be2net: change the driver version number to 4.9.224.0
Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 15:50:51 -04:00
Ajit Khaparde
461ae37922 be2net: Display RoCE specific counters in ethtool -S
SkyHawk-R can support RoCE. Add code to display RoCE specific
counters maintained in hardware.

Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 15:50:51 -04:00
Ajit Khaparde
61000861e8 be2net: Call version 2 of GET_STATS ioctl for Skyhawk-R
Moving to version 2 of GET_STATS command as SkyHawk-R supports
higher number of rings.

Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 15:50:51 -04:00
Yuval Mintz
3d7d562ca4 bnx2x: Add ndo_get_phys_port_id support
Each network interface (either PF or VF) is identified by its port's MAC id.

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 14:55:13 -04:00
Eric Dumazet
c2bb06db59 net: fix build errors if ipv6 is disabled
CONFIG_IPV6=n is still a valid choice ;)

It appears we can remove dead code.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 13:04:03 -04:00
Eric Dumazet
f69b923a75 udp: fix a typo in __udp4_lib_mcast_demux_lookup
At this point sk might contain garbage.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 01:51:57 -04:00
Eric Dumazet
efe4208f47 ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :

To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common

Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.

Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).

inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.

inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.

We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 00:01:25 -04:00
Eric Dumazet
05dbc7b594 tcp/dccp: remove twchain
TCP listener refactoring, part 3 :

Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.

Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.

As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.

If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.

[ INET_TW_MATCH() is no longer needed ]

I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()

This way, SYN_RECV pseudo sockets will be supported the same.

A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].

Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()

Before patch :

dmesg | grep "TCP established"

TCP established hash table entries: 524288 (order: 11, 8388608 bytes)

After patch :

TCP established hash table entries: 524288 (order: 10, 4194304 bytes)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 23:19:24 -04:00
David S. Miller
53af53ae83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	include/linux/netdevice.h
	net/core/sock.c

Trivial merge issues.

Removal of "extern" for functions declaration in netdevice.h
at the same time "const" was added to an argument.

Two parallel line additions in net/core/sock.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 23:07:53 -04:00
David S. Miller
9684d7b0da Merge branch 'sfc-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc
Ben Hutchings says:

====================
Some more fixes for EF10 support; hopefully the last lot:

1. Fixes for reading statistics, from Edward Cree and Jon Cooper.
2. Addition of ethtool statistics for packets dropped by the hardware
before they were associated with a specific function, from Edward Cree.
3. Only bind to functions that are in control of their associated port,
as the driver currently assumes this is the case.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 21:56:09 -04:00
Eric Dumazet
7eec4174ff pkt_sched: fq: fix non TCP flows pacing
Steinar reported FQ pacing was not working for UDP flows.

It looks like the initial sk->sk_pacing_rate value of 0 was
a wrong choice. We should init it to ~0U (unlimited)

Then, TCA_FQ_FLOW_DEFAULT_RATE should be removed because it makes
no real sense. The default rate is really unlimited, and we
need to avoid a zero divide.

Reported-by: Steinar H. Gunderson <sesse@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 21:54:01 -04:00
David S. Miller
b343ca84b4 Revert "veth: Showing peer of veth type dev in ip link (kernel side)"
This reverts commit 612c337306.

As per Stephen Hemminger, the layout of the netlink attribute
is not implemented correctly so revert this for now.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 21:52:03 -04:00