net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.
net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'. Thanks to David
Ahern for the heads up.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a missing call to rxrpc_put_peer() on the main path through the
rxrpc_error_report() function. This manifests itself as a ref leak
whenever an ICMP packet or other error comes in.
In commit f334430316, the hand-off of the ref to a work item was removed
and was not replaced with a put.
Fixes: f334430316 ("rxrpc: Fix error distribution")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were easy to resolve using immediate context mostly,
except the cls_u32.c one where I simply too the entire HEAD
chunk.
Signed-off-by: David S. Miller <davem@davemloft.net>
The rxrpc_input_packet() function and its call tree was built around the
assumption that data_ready() handler called from UDP to inform a kernel
service that there is data to be had was non-reentrant. This means that
certain locking could be dispensed with.
This, however, turns out not to be the case with a multi-queue network card
that can deliver packets to multiple cpus simultaneously. Each of those
cpus can be in the rxrpc_input_packet() function at the same time.
Fix by adding or changing some structure members:
(1) Add peer->rtt_input_lock to serialise access to the RTT buffer.
(2) Make conn->service_id into a 32-bit variable so that it can be
cmpxchg'd on all arches.
(3) Add call->input_lock to serialise access to the Rx/Tx state. Note
that although the Rx and Tx states are (almost) entirely separate,
there's no point completing the separation and having separate locks
since it's a bi-phasal RPC protocol rather than a bi-direction
streaming protocol. Data transmission and data reception do not take
place simultaneously on any particular call.
and making the following functional changes:
(1) In rxrpc_input_data(), hold call->input_lock around the core to
prevent simultaneous producing of packets into the Rx ring and
updating of tracking state for a particular call.
(2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
The bit test and bit clear can then be combined. No further locking
is needed here.
(3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
the ACK packet. The superseded ACK check is then done both before and
after the lock is taken.
The handing of ackinfo data is split, parsing before the lock is taken
and processing with it held. This is keyed on rxMTU being non-zero.
Congestion management is also done within the locked section.
(4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
rotation. The ACKALL packet carries no information and is only really
useful after all packets have been transmitted since it's imprecise.
(5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
prevent calls being simultaneously implicitly ended on two cpus and
also to prevent any races with incoming call setup.
(6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
on a connection. It is only permitted to happen once for a
connection.
(7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
rx->incoming_lock to see if someone else set up the call, connection
or peer whilst we were getting there. We can't trust the values from
the earlier routing check unless we pin refs on them - which we want
to avoid.
Further, we need to allow for an incoming call to have its state
changed on another CPU between us making it live and us adjusting it
because the conn is now in the RXRPC_CONN_SERVICE state.
(8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
to the RTT buffer. Don't need to lock around setting peer->rtt.
For reference, the inventory of state-accessing or state-altering functions
used by the packet input procedure is:
> rxrpc_input_packet()
* PACKET CHECKING
* ROUTING
> rxrpc_post_packet_to_local()
> rxrpc_find_connection_rcu() - uses RCU
> rxrpc_lookup_peer_rcu() - uses RCU
> rxrpc_find_service_conn_rcu() - uses RCU
> idr_find() - uses RCU
* CONNECTION-LEVEL PROCESSING
- Service upgrade
- Can only happen once per conn
! Changed to use cmpxchg
> rxrpc_post_packet_to_conn()
- Setting conn->hi_serial
- Probably safe not using locks
- Maybe use cmpxchg
* CALL-LEVEL PROCESSING
> Old-call checking
> rxrpc_input_implicit_end_call()
> rxrpc_call_completed()
> rxrpc_queue_call()
! Need to take rx->incoming_lock
> __rxrpc_disconnect_call()
> rxrpc_notify_socket()
> rxrpc_new_incoming_call()
- Uses rx->incoming_lock for the entire process
- Might be able to drop this earlier in favour of the call lock
> rxrpc_incoming_call()
! Conflicts with rxrpc_input_implicit_end_call()
> rxrpc_send_ping()
- Don't need locks to check rtt state
> rxrpc_propose_ACK
* PACKET DISTRIBUTION
> rxrpc_input_call_packet()
> rxrpc_input_data()
* QUEUE DATA PACKET ON CALL
> rxrpc_reduce_call_timer()
- Uses timer_reduce()
! Needs call->input_lock()
> rxrpc_receiving_reply()
! Needs locking around ack state
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_proto_abort()
> rxrpc_input_dup_data()
- Fills the Rx buffer
- rxrpc_propose_ACK()
- rxrpc_notify_socket()
> rxrpc_input_ack()
* APPLY ACK PACKET TO CALL AND DISCARD PACKET
> rxrpc_input_ping_response()
- Probably doesn't need any extra locking
! Need READ_ONCE() on call->ping_serial
> rxrpc_input_check_for_lost_ack()
- Takes call->lock to consult Tx buffer
> rxrpc_peer_add_rtt()
! Needs to take a lock (peer->rtt_input_lock)
! Could perhaps manage with cmpxchg() and xadd() instead
> rxrpc_input_requested_ack
- Consults Tx buffer
! Probably needs a lock
> rxrpc_peer_add_rtt()
> rxrpc_propose_ack()
> rxrpc_input_ackinfo()
- Changes call->tx_winsize
! Use cmpxchg to handle change
! Should perhaps track serial number
- Uses peer->lock to record MTU specification changes
> rxrpc_proto_abort()
! Need to take call->input_lock
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_input_soft_acks()
- Consults the Tx buffer
> rxrpc_congestion_management()
- Modifies the Tx annotations
! Needs call->input_lock()
> rxrpc_queue_call()
> rxrpc_input_abort()
* APPLY ABORT PACKET TO CALL AND DISCARD PACKET
> rxrpc_set_call_completion()
> rxrpc_notify_socket()
> rxrpc_input_ackall()
* APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
! Need to take call->input_lock
> rxrpc_rotate_tx_window()
> rxrpc_end_tx_phase()
> rxrpc_reject_packet()
There are some functions used by the above that queue the packet, after
which the procedure is terminated:
- rxrpc_post_packet_to_local()
- local->event_queue is an sk_buff_head
- local->processor is a work_struct
- rxrpc_post_packet_to_conn()
- conn->rx_queue is an sk_buff_head
- conn->processor is a work_struct
- rxrpc_reject_packet()
- local->reject_queue is an sk_buff_head
- local->processor is a work_struct
And some that offload processing to process context:
- rxrpc_notify_socket()
- Uses RCU lock
- Uses call->notify_lock to call call->notify_rx
- Uses call->recvmsg_lock to queue recvmsg side
- rxrpc_queue_call()
- call->processor is a work_struct
- rxrpc_propose_ACK()
- Uses call->lock to wrap __rxrpc_propose_ACK()
And a bunch that complete a call, all of which use call->state_lock to
protect the call state:
- rxrpc_call_completed()
- rxrpc_set_call_completion()
- rxrpc_abort_call()
- rxrpc_proto_abort()
- Also uses rxrpc_queue_call()
Fixes: 17926a7932 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: David Howells <dhowells@redhat.com>
AF_RXRPC opens an IPv6 socket through which to send and receive network
packets, both IPv6 and IPv4. It currently turns AF_INET addresses into
AF_INET-as-AF_INET6 addresses based on an assumption that this was
necessary; on further inspection of the code, however, it turns out that
the IPv6 code just farms packets aimed at AF_INET addresses out to the IPv4
code.
Fix AF_RXRPC to use AF_INET addresses directly when given them.
Fixes: 7b674e390e ("rxrpc: Fix IPv6 support")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix error distribution by immediately delivering the errors to all the
affected calls rather than deferring them to a worker thread. The problem
with the latter is that retries and things can happen in the meantime when we
want to stop that sooner.
To this end:
(1) Stop the error distributor from removing calls from the error_targets
list so that peer->lock isn't needed to synchronise against other adds
and removals.
(2) Require the peer's error_targets list to be accessed with RCU, thereby
avoiding the need to take peer->lock over distribution.
(3) Don't attempt to affect a call's state if it is already marked complete.
Signed-off-by: David Howells <dhowells@redhat.com>
AF_RXRPC has a keepalive message generator that generates a message for a
peer ~20s after the last transmission to that peer to keep firewall ports
open. The implementation is incorrect in the following ways:
(1) It mixes up ktime_t and time64_t types.
(2) It uses ktime_get_real(), the output of which may jump forward or
backward due to adjustments to the time of day.
(3) If the current time jumps forward too much or jumps backwards, the
generator function will crank the base of the time ring round one slot
at a time (ie. a 1s period) until it catches up, spewing out VERSION
packets as it goes.
Fix the problem by:
(1) Only using time64_t. There's no need for sub-second resolution.
(2) Use ktime_get_seconds() rather than ktime_get_real() so that time
isn't perceived to go backwards.
(3) Simplifying rxrpc_peer_keepalive_worker() by splitting it into two
parts:
(a) The "worker" function that manages the buckets and the timer.
(b) The "dispatch" function that takes the pending peers and
potentially transmits a keepalive packet before putting them back
in the ring into the slot appropriate to the revised last-Tx time.
(4) Taking everything that's pending out of the ring and splicing it into
a temporary collector list for processing.
In the case that there's been a significant jump forward, the ring
gets entirely emptied and then the time base can be warped forward
before the peers are processed.
The warping can't happen if the ring isn't empty because the slot a
peer is in is keepalive-time dependent, relative to the base time.
(5) Limit the number of iterations of the bucket array when scanning it.
(6) Set the timer to skip any empty slots as there's no point waking up if
there's nothing to do yet.
This can be triggered by an incoming call from a server after a reboot with
AF_RXRPC and AFS built into the kernel causing a peer record to be set up
before userspace is started. The system clock is then adjusted by
userspace, thereby potentially causing the keepalive generator to have a
meltdown - which leads to a message like:
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:23]
...
Workqueue: krxrpcd rxrpc_peer_keepalive_worker
EIP: lock_acquire+0x69/0x80
...
Call Trace:
? rxrpc_peer_keepalive_worker+0x5e/0x350
? _raw_spin_lock_bh+0x29/0x60
? rxrpc_peer_keepalive_worker+0x5e/0x350
? rxrpc_peer_keepalive_worker+0x5e/0x350
? __lock_acquire+0x3d3/0x870
? process_one_work+0x110/0x340
? process_one_work+0x166/0x340
? process_one_work+0x110/0x340
? worker_thread+0x39/0x3c0
? kthread+0xdb/0x110
? cancel_delayed_work+0x90/0x90
? kthread_stop+0x70/0x70
? ret_from_fork+0x19/0x24
Fixes: ace45bec6d ("rxrpc: Fix firewall route keepalive")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the firewall route keepalive part of AF_RXRPC which is currently
function incorrectly by replying to VERSION REPLY packets from the server
with VERSION REQUEST packets.
Instead, send VERSION REPLY packets to the peers of service connections to
act as keep-alives 20s after the latest packet was transmitted to that
peer.
Also, just discard VERSION REPLY packets rather than replying to them.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix IPv6 support in AF_RXRPC in the following ways:
(1) When extracting the address from a received IPv4 packet, if the local
transport socket is open for IPv6 then fill out the sockaddr_rxrpc
struct for an IPv4-mapped-to-IPv6 AF_INET6 transport address instead
of an AF_INET one.
(2) When sending CHALLENGE or RESPONSE packets, the transport length needs
to be set from the sockaddr_rxrpc::transport_len field rather than
sizeof() on the IPv4 transport address.
(3) When processing an IPv4 ICMP packet received by an IPv6 socket, set up
the address correctly before searching for the affected peer.
Signed-off-by: David Howells <dhowells@redhat.com>
Use negative error codes in struct rxrpc_call::error because that's what
the kernel normally deals with and to make the code consistent. We only
turn them positive when transcribing into a cmsg for userspace recvmsg.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a function to track the average RTT for a peer. Sources of RTT data
will be added in subsequent patches.
The RTT data will be useful in the future for determining resend timeouts
and for handling the slow-start part of the Rx protocol.
Also add a pair of tracepoints, one to log transmissions to elicit a
response for RTT purposes and one to log responses that contribute RTT
data.
Signed-off-by: David Howells <dhowells@redhat.com>
Improve sk_buff tracing within AF_RXRPC by the following means:
(1) Use an enum to note the event type rather than plain integers and use
an array of event names rather than a big multi ?: list.
(2) Distinguish Rx from Tx packets and account them separately. This
requires the call phase to be tracked so that we know what we might
find in rxtx_buffer[].
(3) Add a parameter to rxrpc_{new,see,get,free}_skb() to indicate the
event type.
(4) A pair of 'rotate' events are added to indicate packets that are about
to be rotated out of the Rx and Tx windows.
(5) A pair of 'lost' events are added, along with rxrpc_lose_skb() for
packet loss injection recording.
Signed-off-by: David Howells <dhowells@redhat.com>
Add CONFIG_AF_RXRPC_IPV6 and make the IPv6 support code conditional on it.
This is then made conditional on CONFIG_IPV6.
Without this, the following can be seen:
net/built-in.o: In function `rxrpc_init_peer':
>> peer_object.c:(.text+0x18c3c8): undefined reference to `ip6_route_output_flags'
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IPv6 support to AF_RXRPC. With this, AF_RXRPC sockets can be created:
service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6);
instead of:
service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
The AFS filesystem doesn't support IPv6 at the moment, though, since that
requires upgrades to some of the RPC calls.
Note that a good portion of this patch is replacing "%pI4:%u" in print
statements with "%pISpc" which is able to handle both protocols and print
the port.
Signed-off-by: David Howells <dhowells@redhat.com>
Rewrite the data and ack handling code such that:
(1) Parsing of received ACK and ABORT packets and the distribution and the
filing of DATA packets happens entirely within the data_ready context
called from the UDP socket. This allows us to process and discard ACK
and ABORT packets much more quickly (they're no longer stashed on a
queue for a background thread to process).
(2) We avoid calling skb_clone(), pskb_pull() and pskb_trim(). We instead
keep track of the offset and length of the content of each packet in
the sk_buff metadata. This means we don't do any allocation in the
receive path.
(3) Jumbo DATA packet parsing is now done in data_ready context. Rather
than cloning the packet once for each subpacket and pulling/trimming
it, we file the packet multiple times with an annotation for each
indicating which subpacket is there. From that we can directly
calculate the offset and length.
(4) A call's receive queue can be accessed without taking locks (memory
barriers do have to be used, though).
(5) Incoming calls are set up from preallocated resources and immediately
made live. They can than have packets queued upon them and ACKs
generated. If insufficient resources exist, DATA packet #1 is given a
BUSY reply and other DATA packets are discarded).
(6) sk_buffs no longer take a ref on their parent call.
To make this work, the following changes are made:
(1) Each call's receive buffer is now a circular buffer of sk_buff
pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
between the call and the socket. This permits each sk_buff to be in
the buffer multiple times. The receive buffer is reused for the
transmit buffer.
(2) A circular buffer of annotations (rxtx_annotations) is kept parallel
to the data buffer. Transmission phase annotations indicate whether a
buffered packet has been ACK'd or not and whether it needs
retransmission.
Receive phase annotations indicate whether a slot holds a whole packet
or a jumbo subpacket and, if the latter, which subpacket. They also
note whether the packet has been decrypted in place.
(3) DATA packet window tracking is much simplified. Each phase has just
two numbers representing the window (rx_hard_ack/rx_top and
tx_hard_ack/tx_top).
The hard_ack number is the sequence number before base of the window,
representing the last packet the other side says it has consumed.
hard_ack starts from 0 and the first packet is sequence number 1.
The top number is the sequence number of the highest-numbered packet
residing in the buffer. Packets between hard_ack+1 and top are
soft-ACK'd to indicate they've been received, but not yet consumed.
Four macros, before(), before_eq(), after() and after_eq() are added
to compare sequence numbers within the window. This allows for the
top of the window to wrap when the hard-ack sequence number gets close
to the limit.
Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
to indicate when rx_top and tx_top point at the packets with the
LAST_PACKET bit set, indicating the end of the phase.
(4) Calls are queued on the socket 'receive queue' rather than packets.
This means that we don't need have to invent dummy packets to queue to
indicate abnormal/terminal states and we don't have to keep metadata
packets (such as ABORTs) around
(5) The offset and length of a (sub)packet's content are now passed to
the verify_packet security op. This is currently expected to decrypt
the packet in place and validate it.
However, there's now nowhere to store the revised offset and length of
the actual data within the decrypted blob (there may be a header and
padding to skip) because an sk_buff may represent multiple packets, so
a locate_data security op is added to retrieve these details from the
sk_buff content when needed.
(6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
individually secured and needs to be individually decrypted. The code
to do this is broken out into rxrpc_recvmsg_data() and shared with the
kernel API. It now iterates over the call's receive buffer rather
than walking the socket receive queue.
Additional changes:
(1) The timers are condensed to a single timer that is set for the soonest
of three timeouts (delayed ACK generation, DATA retransmission and
call lifespan).
(2) Transmission of ACK and ABORT packets is effected immediately from
process-context socket ops/kernel API calls that cause them instead of
them being punted off to a background work item. The data_ready
handler still has to defer to the background, though.
(3) A shutdown op is added to the AF_RXRPC socket so that the AFS
filesystem can shut down the socket and flush its own work items
before closing the socket to deal with any in-progress service calls.
Future additional changes that will need to be considered:
(1) Make sure that a call doesn't hog the front of the queue by receiving
data from the network as fast as userspace is consuming it to the
exclusion of other calls.
(2) Transmit delayed ACKs from within recvmsg() when we've consumed
sufficiently more packets to avoid the background work item needing to
run.
Signed-off-by: David Howells <dhowells@redhat.com>
Condense the terminal states of a call state machine to a single state,
plus a separate completion type value. The value is then set, along with
error and abort code values, only when the call is transitioned to the
completion state.
Helpers are provided to simplify this.
Signed-off-by: David Howells <dhowells@redhat.com>
Use the peer record to distribute network errors rather than the transport
object (which I want to get rid of). An error from a particular peer
terminates all calls on that peer.
For future consideration:
(1) For ICMP-induced errors it might be worth trying to extract the RxRPC
header from the offending packet, if one is returned attached to the
ICMP packet, to better direct the error.
This may be overkill, though, since an ICMP packet would be expected
to be relating to the destination port, machine or network. RxRPC
ABORT and BUSY packets give notice at RxRPC level.
(2) To also abort connection-level communications (such as CHALLENGE
packets) where indicted by an error - but that requires some revamping
of the connection event handling first.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't assume anything about the address in an ICMP packet in
rxrpc_error_report() as the address may not be IPv4 in future, especially
since we're just printing these details.
Signed-off-by: David Howells <dhowells@redhat.com>
Break MTU determination from ICMP out into its own function to reduce the
complexity of the error report handler.
Signed-off-by: David Howells <dhowells@redhat.com>
Rename rxrpc_UDP_error_report() to rxrpc_error_report() as it might get
called for something other than UDP.
Signed-off-by: David Howells <dhowells@redhat.com>
Rework peer object handling to use a hash table instead of a flat list and
to use RCU. Peer objects are no longer destroyed by passing them to a
workqueue to process, but rather are just passed to the RCU garbage
collector as kfree'able objects.
The hash function uses the local endpoint plus all the components of the
remote address, except for the RxRPC service ID. Peers thus represent a
UDP port on the remote machine as contacted by a UDP port on this machine.
The RCU read lock is used to handle non-creating lookups so that they can
be called from bottom half context in the sk_error_report handler without
having to lock the hash table against modification.
rxrpc_lookup_peer_rcu() *does* take a reference on the peer object as in
the future, this will be passed to a work item for error distribution in
the error_report path and this function will cease being used in the
data_ready path.
Creating lookups are done under spinlock rather than mutex as they might be
set up due to an external stimulus if the local endpoint is a server.
Captured network error messages (ICMP) are handled with respect to this
struct and MTU size and RTT are cached here.
Signed-off-by: David Howells <dhowells@redhat.com>
Rename files matching net/rxrpc/ar-*.c to get rid of the "ar-" prefix.
This will aid splitting those files by making easier to come up with new
names.
Note that the not all files are simply renamed from ar-X.c to X.c. The
following exceptions are made:
(*) ar-call.c -> call_object.c
ar-ack.c -> call_event.c
call_object.c is going to contain the core of the call object
handling. Call event handling is all going to be in call_event.c.
(*) ar-accept.c -> call_accept.c
Incoming call handling is going to be here.
(*) ar-connection.c -> conn_object.c
ar-connevent.c -> conn_event.c
The former file is going to have the basic connection object handling,
but there will likely be some differentiation between client
connections and service connections in additional files later. The
latter file will have all the connection-level event handling.
(*) ar-local.c -> local_object.c
This will have the local endpoint object handling code. The local
endpoint event handling code will later be split out into
local_event.c.
(*) ar-peer.c -> peer_object.c
This will have the peer endpoint object handling code. Peer event
handling code will be placed in peer_event.c (for the moment, there is
none).
(*) ar-error.c -> peer_event.c
This will become the peer event handling code, though for the moment
it's actually driven from the local endpoint's perspective.
Note that I haven't renamed ar-transport.c to transport_object.c as the
intention is to delete it when the rxrpc_transport struct is excised.
The only file that actually has its contents changed is net/rxrpc/Makefile.
net/rxrpc/ar-internal.h will need its section marker comments updating, but
I'll do that in a separate patch to make it easier for git to follow the
history across the rename. I may also want to rename ar-internal.h at some
point - but that would mean updating all the #includes and I'd rather do
that in a separate step.
Signed-off-by: David Howells <dhowells@redhat.com.