Conflicts:
net/netfilter/core.c
net/netfilter/nf_tables_netdev.c
Resolve two conflicts before pull request for David's net-next tree:
1) Between c73c248490 ("netfilter: nf_tables_netdev: remove redundant
ip_hdr assignment") from the net tree and commit ddc8b6027a
("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").
2) Between e8bffe0cf9 ("net: Add _nf_(un)register_hooks symbols") and
Aaron Conole's patches to replace list_head with single linked list.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_log is used by both nftables and iptables, so use XT_LOG_XXX macros
here is not appropriate. Replace them with NF_LOG_XXX.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
NFTA_LOG_FLAGS attribute is already supported, but the related
NF_LOG_XXX flags are not exposed to the userspace. So we cannot
explicitly enable log flags to log uid, tcp sequence, ip options
and so on, i.e. such rule "nft add rule filter output log uid"
is not supported yet.
So move NF_LOG_XXX macro definitions to the uapi/../nf_log.h. In
order to keep consistent with other modules, change NF_LOG_MASK to
refer to all supported log flags. On the other hand, add a new
NF_LOG_DEFAULT_MASK to refer to the original default log flags.
Finally, if user specify the unsupported log flags or NFTA_LOG_GROUP
and NFTA_LOG_FLAGS are set at the same time, report EINVAL to the
userspace.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Inverse ranges != [a,b] are not currently possible because rules are
composites of && operations, and we need to express this:
data < a || data > b
This patch adds a new range expression. Positive ranges can be already
through two cmp expressions:
cmp(sreg, data, >=)
cmp(sreg, data, <=)
This new range expression provides an alternative way to express this.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The introduction of TCP_NEW_SYN_RECV state, and the addition of request
sockets to the ehash table seems to have broken the --transparent option
of the socket match for IPv6 (around commit a9407000).
Now that the socket lookup finds the TCP_NEW_SYN_RECV socket instead of the
listener, the --transparent option tries to match on the no_srccheck flag
of the request socket.
Unfortunately, that flag was only set for IPv4 sockets in tcp_v4_init_req()
by copying the transparent flag of the listener socket. This effectively
causes '-m socket --transparent' not match on the ACK packet sent by the
client in a TCP handshake.
Based on the suggestion from Eric Dumazet, this change moves the code
initializing no_srccheck to tcp_conn_request(), rendering the above
scenario working again.
Fixes: a940700003 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV support")
Signed-off-by: Alex Badics <alex.badics@balabit.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fabian reports a possible conntrack memory leak (could not reproduce so
far), however, one minor issue can be easily resolved:
> cat /proc/net/nf_conntrack | wc -l = 5
> 4 minutes required to clean up the table.
We should not report those timed-out entries to the user in first place.
And instead of just skipping those timed-out entries while iterating over
the table we can also zap them (we already do this during ctnetlink
walks, but I forgot about the /proc interface).
Fixes: f330a7fdbe ("netfilter: conntrack: get rid of conntrack timer")
Reported-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Create a new revision for the hashlimit iptables extension module. Rev 2
will support higher pps of upto 1 million, Version 1 supports only 10k.
To support this we have to increase the size of the variables avg and
burst in hashlimit_cfg to 64-bit. Create two new structs hashlimit_cfg2
and xt_hashlimit_mtinfo2 and also create newer versions of all the
functions for match, checkentry and destroy.
Some of the functions like hashlimit_mt, hashlimit_mt_check etc are very
similar in both rev1 and rev2 with only minor changes, so I have split
those functions and moved all the common code to a *_common function.
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
I am planning to add a revision 2 for the hashlimit xtables module to
support higher packets per second rates. This patch renames all the
functions and variables related to revision 1 by adding _v1 at the
end of the names.
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
NFT_CT_MARK is unrelated to direction, so if NFTA_CT_DIRECTION attr is
specified, report EINVAL to the userspace. This validation check was
already done at nft_ct_get_init, but we missed it in nft_ct_set_init.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, if the user want to match ct l3proto, we must specify the
direction, for example:
# nft add rule filter input ct original l3proto ipv4
^^^^^^^^
Otherwise, error message will be reported:
# nft add rule filter input ct l3proto ipv4
nft add rule filter input ct l3proto ipv4
<cmdline>:1:1-38: Error: Could not process rule: Invalid argument
add rule filter input ct l3proto ipv4
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Actually, there's no need to require NFTA_CT_DIRECTION attr, because
ct l3proto and protocol are unrelated to direction.
And for compatibility, even if the user specify the NFTA_CT_DIRECTION
attr, do not report error, just skip it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It is valid that the TCP RST packet which does not set ack flag, and bytes
of ack number are zero. But current seqadj codes would adjust the "0" ack
to invalid ack number. Actually seqadj need to check the ack flag before
adjust it for these RST packets.
The following is my test case
client is 10.26.98.245, and add one iptable rule:
iptables -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2:
--connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with
tcp-reset
This iptables rule could generate on TCP RST without ack flag.
server:10.172.135.55
Enable the synproxy with seqadjust by the following iptables rules
iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345
-m tcp --syn -j CT --notrack
iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7
--mss 1460
iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT
The following is my test result.
1. packet trace on client
root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829,
win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7],
length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266,
ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884,
nop,wscale 7], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229,
options [nop,nop,TS val 452367885 ecr 15643479], length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226,
options [nop,nop,TS val 15643479 ecr 452367885], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830,
win 0, length 0
2. seqadj log on server
[62873.867319] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.867644] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.869040] Adjusting sequence number from 3695959830->3695959830,
ack from 0->55618628
To summarize, it is clear that the seqadj codes adjust the 0 ack when receive
one TCP RST packet without ack.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The netfilter hook list never uses the prev pointer, and so can be trimmed to
be a simple singly-linked list.
In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
2176 bytes (down from 2240).
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAV+cFjvSw1s6N8H32AQIRyBAAkedosPtgW2JBsCEgvYcfQuSg36gBc0Km
ozAcbzbWA50P+WSdIq+GhcnQLo3OWRsaGKp6vIrgDtJjgKZXQJyOwzlMm6xFf0A+
Z7NtnWHuSZ90xcxVHCqXRqdNGqwmve8VTNWUbQmZpKIc6d6wNiO3rsg/WFTshroz
FSMf67PsOf5i1hCR8x+3dPlSdkIjH/hWh3wyEmmnPjDPq49twnKkaklLylf9V9NO
DGuW+/+wxkht3ylPRxGD7IQ1t0eA9iyU59Zk7SizuEmzrUNQrqQ3RM8ZsBe/cAA8
VEr+RVOvVoxWUYnZbYEvwfwuKQIWhX5ptt6VLQH7sLJlR8AiMIV6/K47ZqK7ILoI
xBlkQAxI5K4SdFkOux7JPZ31yrIQMTl5xAUkbZxQOFsIPh237mtvGatUIKYcKyi6
uURrPtm13C1uqBZ+sfiUcjq82YdnLkliTUvUncarjZjUAddncQaSOwxcXZGCxThE
24JAzBAJfWLn0760SzNDZytcWSU4tpLVI3X+FpujtaF447DfjzJ/NdyBXWXbmOyY
ubX3pk0UOHVj5kMr7WXK5c+txOI2qIH4aV8cYZpVsFAXp63WjqgfRvRbNhICUA86
3v2M/qLgAEFR3rYh9i+mm0D74aug0+NwST/QDtf5dYFiH53+BteEpQqd/kcYxt8m
6bWwKxzUQG0=
=PTOG
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160924' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Implement slow-start and other bits
This set of patches implements the RxRPC slow-start feature for AF_RXRPC to
improve performance and handling of occasional packet loss. This is more or
less the same as TCP slow start [RFC 5681]. Firstly, there are some ACK
generation improvements:
(1) Send ACKs regularly to apprise the peer of our state so that they can do
congestion management of their own.
(2) Send an ACK when we fill in a hole in the buffer so that the peer can
find out that we did this thus forestalling retransmission.
(3) Note the final DATA packet's serial number in the final ACK for
correlation purposes.
and a couple of bug fixes:
(4) Reinitialise the ACK state and clear the ACK and resend timers upon
entering the client reply reception phase to kill off any pending probe
ACKs.
(5) Delay the resend timer to allow for nsec->jiffies conversion errors.
and then there's the slow-start pieces:
(6) Summarise an ACK.
(7) Schedule a PING or IDLE ACK if the reply to a client call is overdue to
try and find out what happened to it.
(8) Implement the slow start feature.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate a sparse endianness mismatch warning, use nla_get_be32() to
extract a __be32 value instead of nla_get_u32().
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement RxRPC slow-start, which is similar to RFC 5681 for TCP. A
tracepoint is added to log the state of the congestion management algorithm
and the decisions it makes.
Notes:
(1) Since we send fixed-size DATA packets (apart from the final packet in
each phase), counters and calculations are in terms of packets rather
than bytes.
(2) The ACK packet carries the equivalent of TCP SACK.
(3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
suited to SACK of a small number of packets. It seems that, almost
inevitably, by the time three 'duplicate' ACKs have been seen, we have
narrowed the loss down to one or two missing packets, and the
FLIGHT_SIZE calculation ends up as 2.
(4) In rxrpc_resend(), if there was no data that apparently needed
retransmission, we transmit a PING ACK to ask the peer to tell us what
its Rx window state is.
Signed-off-by: David Howells <dhowells@redhat.com>
If we've sent all the request data in a client call but haven't seen any
sign of the reply data yet, schedule an ACK to be sent to the server to
find out if the reply data got lost.
If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to
demand a response to find out whether we need to retransmit.
If the server says it has received all of the data, we send an IDLE ACK to
tell the server that we haven't received anything in the receive phase as
yet.
To make this work, a non-immediate PING ACK must carry a delay. I've chosen
the same as the IDLE ACK for the moment.
Signed-off-by: David Howells <dhowells@redhat.com>
Generate a summary of the Tx buffer packet state when an ACK is received
for use in a later patch that does congestion management.
Signed-off-by: David Howells <dhowells@redhat.com>
When determining the resend timer value, we have a value in nsec but the
timer is in jiffies which may be a million or more times more coarse.
nsecs_to_jiffies() rounds down - which means that the resend timeout
expressed as jiffies is very likely earlier than the one expressed as
nanoseconds from which it was derived.
The problem is that rxrpc_resend() gets triggered by the timer, but can't
then find anything to resend yet. It sets the timer again - but gets
kicked off immediately again and again until the nanosecond-based expiry
time is reached and we actually retransmit.
Fix this by adding 1 to the jiffies-based resend_at value to counteract the
rounding and make sure that the timer happens after the nanosecond-based
expiry is passed.
Alternatives would be to adjust the timestamp on the packets to align
with the jiffie scale or to switch back to using jiffie-timestamps.
Signed-off-by: David Howells <dhowells@redhat.com>
Clear the ACK reason, ACK timer and resend timer when entering the client
reply phase when the first DATA packet is received. New ACKs will be
proposed once the data is queued.
The resend timer is no longer relevant and we need to cancel ACKs scheduled
to probe for a lost reply.
Signed-off-by: David Howells <dhowells@redhat.com>
Send an immediate ACK if we fill in a hole in the buffer left by an
out-of-sequence packet. This may allow the congestion management in the peer
to avoid a retransmission if packets got reordered on the wire.
Signed-off-by: David Howells <dhowells@redhat.com>
This commit adds an upfront check for sane values to be passed when
registering a netfilter hook. This will be used in a future patch for a
simplified hook list traversal.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
All of the callers of nf_hook_slow already hold the rcu_read_lock, so this
cleanup removes the recursive call. This is just a cleanup, as the locking
code gracefully handles this situation.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This commit ensures that the rcu read-side lock is held while the
ingress hook is called. This ensures that a call to nf_hook_slow (and
ultimately nf_ingress) will be read protected.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.
The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.
The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.
It's used only in the recursion cases of br_netfilter. It invokes
nf_hook_slow while holding an rcu read-side critical section to make a
future cleanup simpler.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The origin codes perform two condition checks with dst_mtu(skb_dst(skb))
and in_mtu. And the last statement is "min(dst_mtu(skb_dst(skb)),
in_mtu) - minlen". It may let reader think about how about the result.
Would it be negative.
Now assign the result of min(dst_mtu(skb_dst(skb)), in_mtu) to a new
variable, then only perform one condition check, and it is more readable.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Send an ACK if we haven't sent one for the last two packets we've received.
This keeps the other end apprised of where we've got to - which is
important if they're doing slow-start.
We do this in recvmsg so that we can dispatch a packet directly without the
need to wake up the background thread.
This should possibly be made configurable in future.
Signed-off-by: David Howells <dhowells@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAV+VA9/Sw1s6N8H32AQJufQ//bP1Aq6Ej+3km0o+S1w2lEl7kc1Gnown+
PLKecg73rtMLEXDCJPKrh/OjGBX1GFaPC5HXWF/4TRav3/qjsi2/P7MSIYbt8ZAk
0kVcdpIU1RWrPFt0tdhxTDSgPMhRWaKJTFV0CwQgnCqeyrVX44syx42B941Wsgcs
N/6ANRr1qFLTwT1LKD0EhRWxnds1YI9WPgF5huaAn5RHpymCynwxsxyTDdU+NL0j
YE75o5rjUzAZzMk5xrTjO8yd7xjUfjO7xT/CzjjVfXB4TOBmYExfOrLNZX0vOPPr
LBOj3LTLwklWtx5/RJGe3A4y1+yfiUYGTu90ArARHUxP6AMwnxztvjuy2MQvBOgP
xTlHcEOPoHrxxfIGdJH/0NRl1zgn6IF2lFOH3EgRcw8hzMdFp46AXV74ckst1vOQ
LsqTbhndG2vkVFV8uZCZO7om0HRbpxbvy6+hEakdZ6k9kJA3amezS6Q8wiXxGhAE
0wBns3XI/75Zd7QyXRkcHl3iUqS5uW0dKmAqitIlgf3CUsr4cZFlHO8IDD3rkTFD
WbnvFhaL9Ee5IQjbsHgyzqPSI7sUJUYG1xBAz8wIWnjD0bGu5b/QzgdBSCVIFJ3e
bNvpe1B6HEHj0EyIj1wK2H8311Y4lPNeJlOj7BDUqqRIZW8vviG26MTGML6GnJwU
L35wa/s4ebo=
=+8HJ
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160923' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Bug fixes and tracepoints
Here are a bunch of bug fixes:
(1) Need to set the timestamp on a Tx packet before queueing it to avoid
trouble with the retransmission function.
(2) Don't send an ACK at the end of the service reply transmission; it's
the responsibility of the client to send an ACK to close the call.
The service can resend the last DATA packet or send a PING ACK.
(3) Wake sendmsg() on abnormal call termination.
(4) Use ktime_add_ms() not ktime_add_ns() to add millisecond offsets.
(5) Use before_eq() & co. to compare serial numbers (which may wrap).
(6) Start the resend timer on DATA packet transmission.
(7) Don't accidentally cancel a retransmission upon receiving a NACK.
(8) Fix the call timer setting function to deal with timeouts that are now
or past.
(9) Don't use a flag to communicate the presence of the last packet in the
Tx buffer from sendmsg to the input routines where ACK and DATA
reception is handled. The problem is that there's a window between
queueing the last packet for transmission and setting the flag in
which ACKs or reply DATA packets can arrive, causing apparent state
machine violation issues.
Instead use the annotation buffer to mark the last packet and pick up
and set the flag in the input routines.
(10) Don't call the tx_ack tracepoint and don't allocate a serial number if
someone else nicked the ACK we were about to transmit.
There are also new tracepoints and one altered tracepoint used to track
down the above bugs:
(11) Call timer tracepoint.
(12) Data Tx tracepoint (and adjustments to ACK tracepoint).
(13) Injected Rx packet loss tracepoint.
(14) Ack proposal tracepoint.
(15) Retransmission selection tracepoint.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2016-09-23
Only two patches this time:
1) Fix a comment reference to struct xfrm_replay_state_esn.
From Richard Guy Briggs.
2) Convert xfrm_state_lookup to rcu, we don't need the
xfrm_state_lock anymore in the input path.
From Florian Westphal.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce new rtnl UAPI that exposes a list of vlans per VF, giving
the ability for user-space application to specify it for the VF, as an
option to support 802.1ad.
We adjusted IP Link tool to support this option.
For future use cases, the new UAPI supports multiple vlans. For now we
limit the list size to a single vlan in kernel.
Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward
compatibility with older versions of IP Link tool.
Add a vlan protocol parameter to the ndo_set_vf_vlan callback.
We kept 802.1Q as the drivers' default vlan protocol.
Suitable ip link tool command examples:
Set vf vlan protocol 802.1ad:
ip link set eth0 vf 1 vlan 100 proto 802.1ad
Set vf to VST (802.1Q) mode:
ip link set eth0 vf 1 vlan 100 proto 802.1Q
Or by omitting the new parameter
ip link set eth0 vf 1 vlan 100
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a tracepoint to log in rxrpc_resend() which packets will be
retransmitted. Note that if a positive ACK comes in whilst we have dropped
the lock to retransmit another packet, the actual retransmission may not
happen, though some of the effects will (such as altering the congestion
management).
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint to log proposed ACKs, including whether the proposal is
used to update a pending ACK or is discarded in favour of an easlier,
higher priority ACK.
Whilst we're at it, get rid of the rxrpc_acks() function and access the
name array directly. We do, however, need to validate the ACK reason
number given to trace_rxrpc_rx_ack() to make sure we don't overrun the
array.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint to log transmission of DATA packets (including loss
injection).
Adjust the ACK transmission tracepoint to include the packet serial number
and to line this up with the DATA transmission display.
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc_send_call_packet() is invoking the tx_ack tracepoint before it checks
whether there's an ACK to transmit (another thread may jump in and transmit
it).
Fix this by only invoking the tracepoint if we get a valid ACK to transmit.
Further, only allocate a serial number if we're going to actually transmit
something.
Signed-off-by: David Howells <dhowells@redhat.com>
When the last packet of data to be transmitted on a call is queued, tx_top
is set and then the RXRPC_CALL_TX_LAST flag is set. Unfortunately, this
leaves a race in the ACK processing side of things because the flag affects
the interpretation of tx_top and also allows us to start receiving reply
data before we've finished transmitting.
To fix this, make the following changes:
(1) rxrpc_queue_packet() now sets a marker in the annotation buffer
instead of setting the RXRPC_CALL_TX_LAST flag.
(2) rxrpc_rotate_tx_window() detects the marker and sets the flag in the
same context as the routines that use it.
(3) rxrpc_end_tx_phase() is simplified to just shift the call state.
The Tx window must have been rotated before calling to discard the
last packet.
(4) rxrpc_receiving_reply() is added to handle the arrival of the first
DATA packet of a reply to a client call (which is an implicit ACK of
the Tx phase).
(5) The last part of rxrpc_input_ack() is reordered to perform Tx
rotation, then soft-ACK application and then to end the phase if we've
rotated the last packet. In the event of a terminal ACK, the soft-ACK
application will be skipped as nAcks should be 0.
(6) rxrpc_input_ackall() now has to rotate as well as ending the phase.
In addition:
(7) Alter the transmit tracepoint to log the rotation of the last packet.
(8) Remove the no-longer relevant queue_reqack tracepoint note. The
ACK-REQUESTED packet header flag is now set as needed when we actually
transmit the packet and may vary by retransmission.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the call timer in the following ways:
(1) If call->resend_at or call->ack_at are before or equal to the current
time, then ignore that timeout.
(2) If call->expire_at is before or equal to the current time, then don't
set the timer at all (possibly we should queue the call).
(3) Don't skip modifying the timer if timer_pending() is true. This
indicates that the timer is working, not that it has expired and is
running/waiting to run its expiry handler.
Also call rxrpc_set_timer() to start the call timer going rather than
calling add_timer().
Signed-off-by: David Howells <dhowells@redhat.com>
When rxrpc_input_soft_acks() is parsing the soft-ACKs from an ACK packet,
it updates the Tx packet annotations in the annotation buffer. If a
soft-ACK is an ACK, then we overwrite unack'd, nak'd or to-be-retransmitted
states and that is fine; but if the soft-ACK is an NACK, we overwrite the
to-be-retransmitted with a nak - which isn't.
Instead, we need to let any scheduled retransmission stand if the packet
was NAK'd.
Note that we don't reissue a resend if the annotation is in the
to-be-retransmitted state because someone else must've scheduled the
resend already.
Signed-off-by: David Howells <dhowells@redhat.com>
When a DATA packet has its initial transmission, we may need to start or
adjust the resend timer. Without this we end up relying on being sent a
NACK to initiate the resend.
Signed-off-by: David Howells <dhowells@redhat.com>
before_eq() and friends should be used to compare serial numbers (when not
checking for (non)equality) rather than casting to int, subtracting and
checking the result.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a small helper that complements 36bbef52c7 ("bpf: direct packet
write and access for helpers for clsact progs") for invalidating the
current skb->hash after mangling on headers via direct packet write.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same motivation as in commit 80b48c4457 ("bpf: don't use raw processor
id in generic helper"), but this time for XDP typed programs. Thus, allow
for preemption checks when we have DEBUG_PREEMPT enabled, and otherwise
use the raw variant.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to use skb_to_full_sk() helper introduced in commit bd5eb35f16
("xfrm: take care of request sockets") as otherwise we miss tcp synack
messages, since ownership is on request socket and therefore it would
miss the sk_fullsock() check. Use skb_to_full_sk() as also done similarly
in the bpf_get_cgroup_classid() helper via 2309236c13 ("cls_cgroup:
get sk_classid only from full sockets") fix to not let this fall through.
Fixes: 4a482f34af ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Today the DSA drivers are in charge of flushing the MAC addresses
associated to a port when its STP state changes from Learning or
Forwarding, to Disabled or Blocking or Listening.
This makes the drivers more complex and hides the generic switch logic.
Introduce a new optional port_fast_age operation to dsa_switch_ops, to
move this logic to the DSA layer and keep drivers simple.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a void helper to set the STP state of a port, checking first if the
required routine is provided by the driver.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't send an IDLE ACK at the end of the transmission of the response to a
service call. The service end resends DATA packets until the client sends an
ACK that hard-acks all the send data. At that point, the call is complete.
Signed-off-by: David Howells <dhowells@redhat.com>
Set the timestamp on sk_buffs holding packets to be transmitted before
queueing them because the moment the packet is on the queue it can be seen
by the retransmission algorithm - which may see a completely random
timestamp.
If the retransmission algorithm sees such a timestamp, it may retransmit
the packet and, in future, tell the congestion management algorithm that
the retransmit timer expired.
Signed-off-by: David Howells <dhowells@redhat.com>