Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is
represented by a tcp_timewait_sock, we rate limit dupacks in response
to incoming packets (a) with TCP timestamps that fail PAWS checks, or
(b) with sequence numbers that are out of the acceptable window.
We do not send a dupack in response to out-of-window packets if it has
been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
last sent a dupack in response to an out-of-window packet.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that in state ESTABLISHED, where the connection is represented
by a tcp_sock, we rate limit dupacks in response to incoming packets
(a) with TCP timestamps that fail PAWS checks, or (b) with sequence
numbers or ACK numbers that are out of the acceptable window.
We do not send a dupack in response to out-of-window packets if it has
been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
last sent a dupack in response to an out-of-window packet.
There is already a similar (although global) rate-limiting mechanism
for "challenge ACKs". When deciding whether to send a challence ACK,
we first consult the new per-connection rate limit, and then the
global rate limit.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the SYN_RECV state, where the TCP connection is represented by
tcp_request_sock, we now rate-limit SYNACKs in response to a client's
retransmitted SYNs: we do not send a SYNACK in response to client SYN
if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
since we last sent a SYNACK in response to a client's retransmitted
SYN.
This allows the vast majority of legitimate client connections to
proceed unimpeded, even for the most aggressive platforms, iOS and
MacOS, which actually retransmit SYNs 1-second intervals for several
times in a row. They use SYN RTO timeouts following the progression:
1,1,1,1,1,2,4,8,16,32.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helpers for mitigating ACK loops by rate-limiting dupacks sent in
response to incoming out-of-window packets.
This patch includes:
- rate-limiting logic
- sysctl to control how often we allow dupacks to out-of-window packets
- SNMP counter for cases where we rate-limited our dupack sending
The rate-limiting logic in this patch decides to not send dupacks in
response to out-of-window segments if (a) they are SYNs or pure ACKs
and (b) the remote endpoint is sending them faster than the configured
rate limit.
We rate-limit our responses rather than blocking them entirely or
resetting the connection, because legitimate connections can rely on
dupacks in response to some out-of-window segments. For example, zero
window probes are typically sent with a sequence number that is below
the current window, and ZWPs thus expect to thus elicit a dupack in
response.
We allow dupacks in response to TCP segments with data, because these
may be spurious retransmissions for which the remote endpoint wants to
receive DSACKs. This is safe because segments with data can't
realistically be part of ACK loops, which by their nature consist of
each side sending pure/data-less ACKs to each other.
The dupack interval is controlled by a new sysctl knob,
tcp_invalid_ratelimit, given in milliseconds, in case an administrator
needs to dial this upward in the face of a high-rate DoS attack. The
name and units are chosen to be analogous to the existing analogous
knob for ICMP, icmp_ratelimit.
The default value for tcp_invalid_ratelimit is 500ms, which allows at
most one such dupack per 500ms. This is chosen to be 2x faster than
the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
2.4). We allow the extra 2x factor because network delay variations
can cause packets sent at 1 second intervals to be compressed and
arrive much closer.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
OVS userspace already probes the openvswitch kernel module for
OVS_ACTION_ATTR_SET_MASKED support. This patch adds the kernel module
implementation of masked set actions.
The existing set action sets many fields at once. When only a subset
of the IP header fields, for example, should be modified, all the IP
fields need to be exact matched so that the other field values can be
copied to the set action. A masked set action allows modification of
an arbitrary subset of the supported header bits without requiring the
rest to be matched.
Masked set action is now supported for all writeable key types, except
for the tunnel key. The set tunnel action is an exception as any
input tunnel info is cleared before action processing starts, so there
is no tunnel info to mask.
The kernel module converts all (non-tunnel) set actions to masked set
actions. This makes action processing more uniform, and results in
less branching and duplicating the action processing code. When
returning actions to userspace, the fully masked set actions are
converted back to normal set actions. We use a kernel internal action
code to be able to tell the userspace provided and converted masked
set actions apart.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the second NFC pull request for 3.20.
It brings:
- NCI NFCEE (NFC Execution Environment, typically an embedded or
external secure element) discovery and enabling/disabling support.
In order to communicate with an NFCEE, we also added NCI's logical
connections support to the NCI stack.
- HCI over NCI protocol support. Some secure elements only understand
HCI and thus we need to send them HCI frames when they're part of
an NCI chipset.
- NFC_EVT_TRANSACTION userspace API addition. Whenever an application
running on a secure element needs to notify its host counterpart,
we send an NFC_EVENT_SE_TRANSACTION event to userspace through the
NFC netlink socket.
- Secure element and HCI transaction event support for the st21nfcb
chipset.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU06ozAAoJEIqAPN1PVmxKdtkP/2CeQ0+xJnJWgyh4xjkVxfPb
wPYrz3cHeHdnVX36mnIdRoQZRjxh/Ef/vgK8RsXohcldzHA9D8GjxRpj0GrkVGQ9
xoDkiHsFDdsZHeezSrlnXNnObvZrkeMsmnUDJxfwLtcX1nzFKzBgW2GaDiNjUazC
r5Ip4YIfoCaVLq5MSqYRqJ6uhplXd/ePdtPoo54L/E81y4ptQi8pUsQBy4K3GrFn
FbXoemcymhi9AVBGl+OlppZ93t6bdOV1D5tcOwpytAbfa3jp2oUnJPZay7EoWruR
aj8l5v3P+SSphAJwaGSbny0L83tTS7sq+N4QG+VxxdMkw2YjpmxgN3AGguNBtPGG
SUThpZV6QhrkAOnwzP2BJnh68WSU1eJrGvk8zUWW0SanA8k2jaHO/VzXCA3/ZkC4
IFcXl4Jr6R3iUxDFgS/6MB5E9EGedzFTacsWoHVyZPtaDAQp4WVdAIRQ4QKgJCLw
1yRmCeVunTsrs6O0aenU1xHinPlnx+tkuiNh4H64APQYTz54WwXH1/S04OL1SkqI
mTzUvFgWqJvSIOuU19g0HBw+xZ/9vvX8LLkm9kSTbe7tcmFH5CI7ZCN/hNDHvMSd
sjNYOyag/SAHJ01tTKcSPAgWO49bHWPodI04zP0lK0gMLjKvRknVMdTgIDfu49HN
2da9E23iBmNNgG79iVHN
=8caD
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-3.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
NFC: 3.20 second pull request
This is the second NFC pull request for 3.20.
It brings:
- NCI NFCEE (NFC Execution Environment, typically an embedded or
external secure element) discovery and enabling/disabling support.
In order to communicate with an NFCEE, we also added NCI's logical
connections support to the NCI stack.
- HCI over NCI protocol support. Some secure elements only understand
HCI and thus we need to send them HCI frames when they're part of
an NCI chipset.
- NFC_EVT_TRANSACTION userspace API addition. Whenever an application
running on a secure element needs to notify its host counterpart,
we send an NFC_EVENT_SE_TRANSACTION event to userspace through the
NFC netlink socket.
- Secure element and HCI transaction event support for the st21nfcb
chipset.
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c
The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.
In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.
In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.
In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Stretch ACKs can kill performance with Reno and CUBIC congestion
control, largely due to LRO and GRO. Fix from Neal Cardwell.
2) Fix userland breakage because we accidently emit zero length netlink
messages from the bridging code. From Roopa Prabhu.
3) Carry handling in generic csum_tcpudp_nofold is broken, fix from
Karl Beldan.
4) Remove bogus dev_set_net() calls from CAIF driver, from Nicolas
Dichtel.
5) Make sure PPP deflation never returns a length greater then the
output buffer, otherwise we overflow and trigger skb_over_panic().
Fix from Florian Westphal.
6) COSA driver needs VIRT_TO_BUS Kconfig dependencies, from Arnd
Bergmann.
7) Don't increase route cached MTU on datagram too big ICMPs. From Li
Wei.
8) Fix error path leaks in nf_tables, from Pablo Neira Ayuso.
9) Fix bitmask handling regression in netlink that broke things like
acpi userland tools. From Pablo Neira Ayuso.
10) Wrong header pointer passed to param_type2af() in SCTP code, from
Saran Maruti Ramanara.
11) Stacked vlans not handled correctly by vlan_get_protocol(), from
Toshiaki Makita.
12) Add missing DMA memory barrier to xgene driver, from Iyappan
Subramanian.
13) Fix crash in rate estimators, from Eric Dumazet.
14) We've been adding various workarounds, one after another, for the
change which added the per-net tcp_sock. It was meant to reduce
socket contention but added lots of problems.
Reduce this instead to a proper per-cpu socket and that rids us of
all the daemons.
From Eric Dumazet.
15) Fix memory corruption and OOPS in mlx4 driver, from Jack
Morgenstein.
16) When we disabled UFO in the virtio_net device, it introduces some
serious performance regressions. The orignal problem was IPV6
fragment ID generation, so fix that properly instead. From Vlad
Yasevich.
17) sr9700 driver build breaks on xtensa because it defines macros with
the same name as those used by the arch code. Use more unique
names. From Chen Gang.
18) Fix endianness in new virio 1.0 mode of the vhost net driver, from
Michael S Tsirkin.
19) Several sysctls were setting the maxlen attribute incorrectly, from
Sasha Levin.
20) Don't accept an FQ scheduler quantum of zero, that leads to crashes.
From Kenneth Klette Jonassen.
21) Fix dumping of non-existing actions in the packet scheduler
classifier. From Ignacy Gawędzki.
22) Return the write work_done value when doing TX work in the qlcnic
driver.
23) ip6gre_err accesses the info field with the wrong endianness, from
Sabrina Dubroca.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (54 commits)
sit: fix some __be16/u16 mismatches
ipv6: fix sparse errors in ip6_make_flowlabel()
net: remove some sparse warnings
flow_keys: n_proto type should be __be16
ip6_gre: fix endianness errors in ip6gre_err
qlcnic: Fix NAPI poll routine for Tx completion
amd-xgbe: Set RSS enablement based on hardware features
amd-xgbe: Adjust for zero-based traffic class count
cls_api.c: Fix dumping of non-existing actions' stats.
pkt_sched: fq: avoid hang when quantum 0
net: rds: use correct size for max unacked packets and bytes
vhost/net: fix up num_buffers endian-ness
gianfar: correct the bad expression while writing bit-pattern
net: usb: sr9700: Use 'SR_' prefix for the common register macros
Revert "drivers/net: Disable UFO through virtio"
Revert "drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets"
ipv6: Select fragment id during UFO segmentation if not set.
xen-netback: stop the guest rx thread after a fatal error
net/mlx4_core: Fix kernel Oops (mem corruption) when working with more than 80 VFs
isdn: off by one in connect_res()
...
include/net/ipv6.h:713:22: warning: incorrect type in assignment (different base types)
include/net/ipv6.h:713:22: expected restricted __be32 [usertype] hash
include/net/ipv6.h:713:22: got unsigned int
include/net/ipv6.h:719:25: warning: restricted __be32 degrades to integer
include/net/ipv6.h:719:22: warning: invalid assignment: ^=
include/net/ipv6.h:719:22: left side has type restricted __be32
include/net/ipv6.h:719:22: right side has type unsigned int
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(struct flow_keys)->n_proto is in network order, use
proper type for this.
Fixes following sparse errors :
net/core/flow_dissector.c:139:39: warning: incorrect type in assignment (different base types)
net/core/flow_dissector.c:139:39: expected unsigned short [unsigned] [usertype] n_proto
net/core/flow_dissector.c:139:39: got restricted __be16 [assigned] [usertype] proto
net/core/flow_dissector.c:237:23: warning: incorrect type in assignment (different base types)
net/core/flow_dissector.c:237:23: expected unsigned short [unsigned] [usertype] n_proto
net/core/flow_dissector.c:237:23: got restricted __be16 [assigned] [usertype] proto
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: e0f31d8498 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
Signed-off-by: David S. Miller <davem@davemloft.net>
FQ has a fast path for skb attached to a socket, as it does not
have to compute a flow hash. But for other packets, FQ being non
stochastic means that hosts exposed to random Internet traffic
can allocate million of flows structure (104 bytes each) pretty
easily. Not only host can OOM, but lookup in RB trees can take
too much cpu and memory resources.
This patch adds a new attribute, orphan_mask, that is adding
possibility of having a stochastic hash for orphaned skb.
Its default value is 1024 slots, to mimic SFQ behavior.
Note: This does not apply to locally generated TCP traffic,
and no locally generated traffic will share a flow structure
with another perfect or stochastic flow.
This patch also handles the specific case of SYNACK messages:
They are attached to the listener socket, and therefore all map
to a single hash bucket. If listener have set SO_MAX_PACING_RATE,
hoping to have new accepted socket inherit this rate, SYNACK
might be paced and even dropped.
This is very similar to an internal patch Google have used more
than one year.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we added pacing to TCP, we decided to let sch_fq take care
of actual pacing.
All TCP had to do was to compute sk->pacing_rate using simple formula:
sk->pacing_rate = 2 * cwnd * mss / rtt
It works well for senders (bulk flows), but not very well for receivers
or even RPC :
cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
can end up pacing ACK packets, slowing down the sender.
Really, only the sender should pace, according to its own logic.
Instead of adding a new bit in skb, or call yet another flow
dissection, we tweak skb->truesize to a small value (2), and
we instruct sch_fq to use new helper and not pace pure ack.
Note this also helps TCP small queue, as ack packets present
in qdisc/NIC do not prevent sending a data packet (RPC workload)
This helps to reduce tx completion overhead, ack packets can use regular
sock_wfree() instead of tcp_wfree() which is a bit more expensive.
This has no impact in the case packets are sent to loopback interface,
as we do not coalesce ack packets (were we would detect skb->truesize
lie)
In case netem (with a delay) is used, skb_orphan_partial() also sets
skb->truesize to 1.
This patch is a combination of two patches we used for about one year at
Google.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some existing rhashtable users get too intimate with it by walking
the buckets directly. This prevents us from easily changing the
internals of rhashtable.
This patch adds the helpers rhashtable_walk_init/exit/start/next/stop
which will replace these custom walkers.
They are meant to be usable for both procfs seq_file walks as well
as walking by a netlink dump. The iterator structure should fit
inside a netlink dump cb structure, with at least one element to
spare.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Supply interface functions to bond and unbond ports of a mlx4 internal
interfaces. Example for such an interface is the one registered by the
mlx4 IB driver under RoCE.
There are
1. Functions to go in/out to/from bonded mode
2. Function to remap virtual ports to physical ports
The bond_mutex prevents simultaneous access to data that keep status of
the device in bonded mode.
The upper mlx4 interface marks to the mlx4 core module that they
want to be subject for such bonding by setting the MLX4_INTFF_BONDING
flag. Interface which goes to/from bonded mode is re-created.
The mlx4 Ethernet driver does not set this flag when registering the
interface, the IB driver does.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement the hardware interface required for port aggregation.
1. Disable RX port check on receive - don't perform a validity check
that matches to QP's port and the port where the packet is received.
2. Virtual to physical port remap - configure virtual to physical port
mapping. Port remap capability for virtual functions.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use notifier chain to dispatch an event upon a change in slave state.
Event is dispatched with slave specific info.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move slave state changes to a helper function, this is a pre-step for adding
functionality of dispatching an event when this helper is called.
This commit doesn't add new functionality.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add event which provides an indication on a change in the state
of a bonding slave. The event handler should cast the pointer to the
appropriate type (struct netdev_bonding_info) in order to get the
full info about the slave.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* revert a patch that caused a regression with mesh userspace (Bob)
* fix a number of suspend/resume related races
(from Emmanuel, Luca and myself - we'll look at backporting later)
* add software implementations for new ciphers (Jouni)
* add a new ACPI ID for Broadcom's rfkill (Mika)
* allow using netns FD for wireless (Vadim)
* some other cleanups (various)
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJU0NEaAAoJEDBSmw7B7bqr2DAP/3nbk6WB2Lz4Vwi8fh9C4y6X
4ZzN2v7NHaimC+Wpxg3wP2wCdX2VG3ZWwF3yLm6qgVHGnE35RFLxURen1eqNeW77
Yf5gvZ066nCGEQ8l8J6YK9vrLX4qp5c4lyE00bbxpZTA4Qq71SgTg+rmGC1be8uX
vacfaLwfDWffuOOnjBAPfanj7f4AQaUEY2uN1WkBFC7iEeOtPcWVkHAFVIeGjJfQ
vfgQJcwOjgWjYbwZdQQi7Aj+k1Yzda4pg1yEWn3CkZ8zyeCYGs1gk2ovoBgPgXBP
yT9ypIdFsN242VJvy7nkFnCKA8mhKyltMQ1Xjs0Q9lAxWdaq9U+iqt5cUn4jxVIG
T9Vi3PCbx/nVOqcfR81dBTQ3uDU5AyosPsmh2YTxi5lpRBrsjNY2FtaKE0sm2Om3
wRiSPOdPrXBeEnU0KssI9e6euXgS4JQV78Naq85OWZDd2yZ1fT5U2fi8y4drRGlz
rucbbobdVhQch5L4FStPz1uW5pNuJrhekXeZIE8MruTNg2A2oBAK3ApO7hxn68sE
RnbAnkxVLwgedC9042JF5eiS1PDIU46w4e782j/+XskVKMEakqd23iJycx3tmgHZ
cxDi/qKZ2RCE74YsT61o/9ErSVPvCfNPL3+CVov918jidQQg2WHfKm/jFlIxHIxk
4wBP7p2VGlENMgw/R8GJ
=wOaR
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Last round of updates for net-next:
* revert a patch that caused a regression with mesh userspace (Bob)
* fix a number of suspend/resume related races
(from Emmanuel, Luca and myself - we'll look at backporting later)
* add software implementations for new ciphers (Jouni)
* add a new ACPI ID for Broadcom's rfkill (Mika)
* allow using netns FD for wireless (Vadim)
* some other cleanups (various)
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-02-03
Here's what's likely the last bluetooth-next pull request for 3.20.
Notable changes include:
- xHCI workaround + a new id for the ath3k driver
- Several new ids for the btusb driver
- Support for new Intel Bluetooth controllers
- Minor cleanups to ieee802154 code
- Nested sleep warning fix in socket accept() code path
- Fixes for Out of Band pairing handling
- Support for LE scan restarting for HCI_QUIRK_STRICT_DUPLICATE_FILTER
- Improvements to data we expose through debugfs
- Proper handling of Hardware Error HCI events
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds skb_remcsum_process and skb_gro_remcsum_process to
perform the appropriate adjustments to the skb when receiving
remote checksum offload.
Updated vxlan and gue to use these functions.
Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did
not see any change in performance.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A typical qdisc setup is the following :
bond0 : bonding device, using HTB hierarchy
eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc
XPS allows to spread packets on specific tx queues, based on the cpu
doing the send.
Problem is that dequeues from bond0 qdisc can happen on random cpus,
due to the fact that qdisc_run() can dequeue a batch of packets.
CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1
CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0
CPUB -> dequeue packet P1 from bond0
enqueue packet on eth1/eth2
CPUC -> dequeue packet P2 from bond0
enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0)
get_xps_queue() then might select wrong queue for P1, since current cpu
might be different than CPUA.
P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping),
if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock)
Effect of this bug is TCP reorders, and more generally not optimal
TX queue placement. (A victim bulk flow can be migrated to the wrong TX
queue for a while)
To fix this, we have to record sender cpu number the first time
dev_queue_xmit() is called for one tx skb.
We can union napi_id (used on receive path) and sender_cpu,
granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for
this union idea)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
conn_info is currently allocated only after nfcee_discovery_ntf
which is not generic enough for logical connection other than
NFCEE. The corresponding conn_info is now created in
nci_core_conn_create_rsp().
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
For consistency sake change nci_core_conn_create_rsp structure
credits field to credits_cnt.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The current implementation limits nci_core_conn_create_req()
to only manage NCI_DESTINATION_NFCEE.
Add new parameters to nci_core_conn_create() to support all
destination types described in the NCI specification.
Because there are some parameters with variable size dynamic
buffer allocation is needed.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The NCI_STATIC_RF_CONN_ID logical connection is the most used
connection. Keeping it directly accessible in the nci_dev
structure will simplify and optimize the access.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
If the IPv6 fragment id has not been set and we perform
fragmentation due to UFO, select a new fragment id.
We now consider a fragment id of 0 as unset and if id selection
process returns 0 (after all the pertrubations), we set it to
0x80000000, thus giving us ample space not to create collisions
with the next packet we may have to fragment.
When doing UFO integrity checking, we also select the
fragment id if it has not be set yet. This is stored into
the skb_shinfo() thus allowing UFO to function correclty.
This patch also removes duplicate fragment id generation code
and moves ipv6_select_ident() into the header as it may be
used during GSO.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
it has just verified that it asks no more than the length of the
first segment of iovec.
And with that the last user of stuff in lib/iovec.c is gone.
RIP.
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With that, all ->sendmsg() instances are converted to iov_iter primitives
and are agnostic wrt the kind of iov_iter they are working with.
So's the last remaining ->recvmsg() instance that wasn't kind-agnostic yet.
All ->sendmsg() and ->recvmsg() advance ->msg_iter by the amount actually
copied and none of them modifies the underlying iovec, etc.
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
That takes care of the majority of ->sendmsg() instances - most of them
via memcpy_to_msg() or assorted getfrag() callbacks. One place where we
still keep memcpy_fromiovecend() is tipc - there we potentially read the
same data over and over; separate patch, that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
patch is actually smaller than it seems to be - most of it is unindenting
the inner loop body in tcp_sendmsg() itself...
the bit in tcp_input.c is going to get reverted very soon - that's what
memcpy_from_msg() will become, but not in this commit; let's keep it
reasonably contained...
There's one potentially subtle change here: in case of short copy from
userland, mainline tcp_send_syn_data() discards the skb it has allocated
and falls back to normal path, where we'll send as much as possible after
rereading the same data again. This patch trims SYN+data skb instead -
that way we don't need to copy from the same place twice.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert skb_add_data() to iov_iter; allows to get rid of the explicit
messing with iovec in its only caller - skb_add_data() will keep advancing
->msg_iter for us, so there's no need to similate that manually.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit de966c5928 (net/mlx4_core: Support more than 64 VFs) was meant to
allow up to 126 VFs. However, due to leaving MLX4_MFUNC_MAX too low, using
more than 80 VFs resulted in memory corruptions (and Oopses) when more than
80 VFs were requested. In addition, the number of slaves was left too high.
This commit fixes these issues.
Fixes: de966c5928 ("net/mlx4_core: Support more than 64 VFs")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Validate hooks for nf_tables NAT expressions, otherwise users can
crash the kernel when using them from the wrong hook. We already
got one user trapped on this when configuring masquerading.
2) Fix a BUG splat in nf_tables with CONFIG_DEBUG_PREEMPT=y. Reported
by Andreas Schultz.
3) Avoid unnecessary reroute of traffic in the local input path
in IPVS that triggers a crash in in xfrm. Reported by Florian
Wiessner and fixes by Julian Anastasov.
4) Fix memory and module refcount leak from the error path of
nf_tables_newchain().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is very similar to
commit 1c32c5ad6f
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue Mar 1 02:36:47 2011 +0000
inet: Add ip_make_skb and ip_finish_skb
It adds IPv6 version of the helpers ip6_make_skb and ip6_finish_skb.
The job of ip6_make_skb is to collect messages into an ipv6 packet
and poplulate ipv6 eader. The job of ip6_finish_skb is to transmit
the generated skb. Together they replicated the job of
ip6_push_pending_frames() while also provide the capability to be
called independently. This will be needed to add lockless UDP sendmsg
support.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull IPv6 cork initialization into its own function that
can be re-used. IPv6 specific cork data did not have an
explicit data structure. This patch creats eone so that
just ipv6 cork data can be as arguemts. Also, since
IPv6 tries to save the flow label into inet_cork_full
tructure, pass the full cork.
Adjust ip6_cork_release() to take cork data structures.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tx timestamps are looped onto the error queue on top of an skb. This
mechanism leaks packet headers to processes unless the no-payload
options SOF_TIMESTAMPING_OPT_TSONLY is set.
Add a sysctl that optionally drops looped timestamp with data. This
only affects processes without CAP_NET_RAW.
The policy is checked when timestamps are generated in the stack.
It is possible for timestamps with data to be reported after the
sysctl is set, if these were queued internally earlier.
No vulnerability is immediately known that exploits knowledge
gleaned from packet headers, but it may still be preferable to allow
administrators to lock down this path at the cost of possible
breakage of legacy applications.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes
(v1 -> v2)
- test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
(rfc -> v1)
- document the sysctl in Documentation/sysctl/net.txt
- fix access control race: read .._OPT_TSONLY only once,
use same value for permission check and skb generation.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
timestamps, this loops timestamps on top of empty packets.
Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
cmsg reception (aside from timestamps) are no longer possible. This
works together with a follow on patch that allows administrators to
only allow tx timestamping if it does not loop payload or metadata.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes (rfc -> v1)
- add documentation
- remove unnecessary skb->len test (thanks to Richard Cochran)
Signed-off-by: David S. Miller <davem@davemloft.net>
The NFCC sends an NCI_OP_RF_NFCEE_ACTION_NTF notification
to the host (DH) to let it know that for example an RF
transaction with a payment reader is done.
For now the notification handler is empty.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
NFC_EVT_TRANSACTION is sent through netlink in order for a
specific application running on a secure element to notify
userspace of an event. Typically the secure element application
counterpart on the host could interpret that event and act
upon it.
Forwarded information contains:
- SE host generating the event
- Application IDentifier doing the operation
- Applications parameters
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
According to the NCI specification, one can use HCI over NCI
to talk with specific NFCEE. The HCI network is viewed as one
logical NFCEE.
This is needed to support secure element running HCI only
firmwares embedded on an NCI capable chipset, like e.g. the
st21nfcb.
There is some duplication between this piece of code and the
HCI core code, but the latter would need to be abstracted even
more to be able to use NCI as a logical transport for HCP packets.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to communicate with an NFCEE, we need to open a logical
connection to it, by sending the NCI_OP_CORE_CONN_CREATE_CMD
command to the NFCC. It's left up to the drivers to decide when
to close an already opened logical connection.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
NFCEEs can be enabled or disabled by sending the
NCI_OP_NFCEE_MODE_SET_CMD command to the NFCC. This patch
provides an API for drivers to enable and disable e.g. their
NCI discoveredd secure elements.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
NFCEEs (NFC Execution Environment) have to be explicitly
discovered by sending the NCI_OP_NFCEE_DISCOVER_CMD
command. The NFCC will respond to this command by telling
us how many NFCEEs are connected to it. Then the NFCC sends
a notification command for each and every NFCEE connected.
Here we implement support for sending
NCI_OP_NFCEE_DISCOVER_CMD command, receiving the response
and the potential notifications.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>