2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2024-12-27 22:53:55 +08:00
Commit Graph

783389 Commits

Author SHA1 Message Date
Sabrina Dubroca
af7d6cce53 net: ipv4: update fnhe_pmtu when first hop's MTU changes
Since commit 5aad1de5ea ("ipv4: use separate genid for next hop
exceptions"), exceptions get deprecated separately from cached
routes. In particular, administrative changes don't clear PMTU anymore.

As Stefano described in commit e9fa1495d7 ("ipv6: Reflect MTU changes
on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
the local MTU change can become stale:
 - if the local MTU is now lower than the PMTU, that PMTU is now
   incorrect
 - if the local MTU was the lowest value in the path, and is increased,
   we might discover a higher PMTU

Similarly to what commit e9fa1495d7 did for IPv6, update PMTU in those
cases.

If the exception was locked, the discovered PMTU was smaller than the
minimal accepted PMTU. In that case, if the new local MTU is smaller
than the current PMTU, let PMTU discovery figure out if locking of the
exception is still needed.

To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
notifier. By the time the notifier is called, dev->mtu has been
changed. This patch adds the old MTU as additional information in the
notifier structure, and a new call_netdevice_notifiers_u32() function.

Fixes: 5aad1de5ea ("ipv4: use separate genid for next hop exceptions")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 22:44:46 -07:00
Mike Rapoport
7abab7b9b4 net/ipv6: stop leaking percpu memory in fib6 info
The fib6_info_alloc() function allocates percpu memory to hold per CPU
pointers to rt6_info, but this memory is never freed. Fix it.

Fixes: a64efe142f ("net/ipv6: introduce fib6_info struct and helpers")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 22:42:54 -07:00
David S. Miller
49b538e79b RxRPC fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAW7vWy/u3V2unywtrAQJkeA//UTmUmv9bL8k38XvnnHQ82iTyIjvZ40md
 9XbNagwjkxuDv2eqW1rTxk5+Po596Nb4s/cBvpyAABjCA+zoT0/fklK5PqsSKFla
 ibAHlrBiv/1FlSUSTf4dvAfVVszRAItCP4OhKTU+0fkX3Hq75T7Lty1h92Yk/ijx
 ccqSV9K/nB4WN9bbg+sDngu3RAVR3VJ5RhVSmrnE9xd3M9ihz2RHacJ6GWNQf5PV
 j8cKuP3qY/2K245hb4sGn1i9OO6/WVVQhnojYA+jR8Vg4+8mY0Y++MLsW8ywlsuo
 xlrDCVPBaR6YqjnlDoOXIXYZ8EbEsmy5FP51zfEZJqeKJcVPLavQyoku0xJ8fHgr
 oK/3DBUkwgi0hiq4dC1CrmXZRSbBQqGTEJMA6AKh2ApXb9BkmhmqRGMx+2nqocwF
 TjAQmCu0FJTAYdwJuqzv5SRFjocjrHGcYDpQ27CgNqSiiDlh8CZ1ej/mbV+o8pUQ
 tOrpnRCLY8Nl55AgJbl2260mKcpcKQcRUwggKAOzqRjXwcy9q3KKm8H149qejJSl
 AQihDkPVqBZrv5ODKR0z88e5cVF9ad+vBPd+2TrRVGtTWhm/q8N/0Hh0M2sYDSc6
 vLklYfwI4U5U1/naJC9WCkW+ybj5ekVNHgW91ICvQVNG0TH2zfE7sUGMQn6lRZL8
 jfpLTVbtKfI=
 =q6E6
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-fixes-20181008' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fix packet reception code

Here are a set of patches that prepares for and fix problems in rxrpc's
package reception code.  There serious problems are:

 (A) There's a window between binding the socket and setting the data_ready
     hook in which packets can find their way into the UDP socket's receive
     queues.

 (B) The skb_recv_udp() will return an error (and clear the error state) if
     there was an error on the Tx side.  rxrpc doesn't handle this.

 (C) The rxrpc data_ready handler doesn't fully drain the UDP receive
     queue.

 (D) The rxrpc data_ready handler assumes it is called in a non-reentrant
 state.

The second patch fixes (A) - (C); the third patch renders (B) and (C)
non-issues by using the recap_rcv hook instead of data_ready - and the
final patch fixes (D).  That last is the most complex.

The preparatory patches are:

 (1) Fix some places that are doing things in the wrong net namespace.

 (2) Stop taking the rcu read lock as it's held by the IP input routine in
     the call chain.

 (3) Only end the Tx phase if *we* rotated the final packet out of the Tx
     buffer.

 (4) Don't assume that the call state won't change after dropping the
     call_state lock.

 (5) Only take receive window and MTU suze parameters from an ACK packet if
     it's the latest ACK packet.

 (6) Record connection-level abort information correctly.

 (7) Fix a trace line.

And then there are three main patches - note that these are mixed in with
the preparatory patches somewhat:

 (1) Fix the setup window (A), skb_recv_udp() error check (B) and packet
     drainage (C).

 (2) Switch to using the encap_rcv instead of data_ready to cut out the
     effects of the UDP read queues and get the packets delivered directly.

 (3) Add more locking into the various packet input paths to defend against
     re-entrance (D).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 22:27:38 -07:00
Ka-Cheong Poon
9a4890bd6d rds: RDS (tcp) hangs on sendto() to unresponding address
In rds_send_mprds_hash(), if the calculated hash value is non-zero and
the MPRDS connections are not yet up, it will wait.  But it should not
wait if the send is non-blocking.  In this case, it should just use the
base c_path for sending the message.

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 22:19:52 -07:00
Eric Dumazet
52b5d6f5dc net: make skb_partial_csum_set() more robust against overflows
syzbot managed to crash in skb_checksum_help() [1] :

        BUG_ON(offset + sizeof(__sum16) > skb_headlen(skb));

Root cause is the following check in skb_partial_csum_set()

	if (unlikely(start > skb_headlen(skb)) ||
	    unlikely((int)start + off > skb_headlen(skb) - 2))
		return false;

If skb_headlen(skb) is 1, then (skb_headlen(skb) - 2) becomes 0xffffffff
and the check fails to detect that ((int)start + off) is off the limit,
since the compare is unsigned.

When we fix that, then the first condition (start > skb_headlen(skb))
becomes obsolete.

Then we should also check that (skb_headroom(skb) + start) wont
overflow 16bit field.

[1]
kernel BUG at net/core/dev.c:2880!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 7330 Comm: syz-executor4 Not tainted 4.19.0-rc6+ #253
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:skb_checksum_help+0x9e3/0xbb0 net/core/dev.c:2880
Code: 85 00 ff ff ff 48 c1 e8 03 42 80 3c 28 00 0f 84 09 fb ff ff 48 8b bd 00 ff ff ff e8 97 a8 b9 fb e9 f8 fa ff ff e8 2d 09 76 fb <0f> 0b 48 8b bd 28 ff ff ff e8 1f a8 b9 fb e9 b1 f6 ff ff 48 89 cf
RSP: 0018:ffff8801d83a6f60 EFLAGS: 00010293
RAX: ffff8801b9834380 RBX: ffff8801b9f8d8c0 RCX: ffffffff8608c6d7
RDX: 0000000000000000 RSI: ffffffff8608cc63 RDI: 0000000000000006
RBP: ffff8801d83a7068 R08: ffff8801b9834380 R09: 0000000000000000
R10: ffff8801d83a76d8 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000010001 R14: 000000000000ffff R15: 00000000000000a8
FS:  00007f1a66db5700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7d77f091b0 CR3: 00000001ba252000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 skb_csum_hwoffload_help+0x8f/0xe0 net/core/dev.c:3269
 validate_xmit_skb+0xa2a/0xf30 net/core/dev.c:3312
 __dev_queue_xmit+0xc2f/0x3950 net/core/dev.c:3797
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3838
 packet_snd net/packet/af_packet.c:2928 [inline]
 packet_sendmsg+0x422d/0x64c0 net/packet/af_packet.c:2953

Fixes: 5ff8dda303 ("net: Ensure partial checksum offset is inside the skb head")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 10:21:31 -07:00
David S. Miller
8b79f41043 Merge branch 'devlink-param-type-string-fixes'
Moshe Shemesh says:

====================
devlink param type string fixes

This patchset fixes devlink param infrastructure for string param type.

The devlink param infrastructure doesn't handle copying the string data
correctly.  The first two patches fix it and the third patch adds helper
function to safely copy string value without exceeding
DEVLINK_PARAM_MAX_STRING_VALUE.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 10:19:10 -07:00
Moshe Shemesh
bde74ad10e devlink: Add helper function for safely copy string param
Devlink string param buffer is allocated at the size of
DEVLINK_PARAM_MAX_STRING_VALUE. Add helper function which makes sure
this size is not exceeded.
Renamed DEVLINK_PARAM_MAX_STRING_VALUE to
__DEVLINK_PARAM_MAX_STRING_VALUE to emphasize that it should be used by
devlink only. The driver should use the helper function instead to
verify it doesn't exceed the allowed length.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 10:19:10 -07:00
Moshe Shemesh
1276534c98 devlink: Fix param cmode driverinit for string type
Driverinit configuration mode value is held by devlink to enable the
driver fetch the value after reload command. In case the param type is
string devlink should copy the value from driver string buffer to
devlink string buffer on devlink_param_driverinit_value_set() and
vice-versa on devlink_param_driverinit_value_get().

Fixes: ec01aeb180 ("devlink: Add support for get/set driverinit value")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 10:19:10 -07:00
Moshe Shemesh
f355cfcdb2 devlink: Fix param set handling for string type
In case devlink param type is string, it needs to copy the string value
it got from the input to devlink_param_value.

Fixes: e3b7ca18ad ("devlink: Add param set command")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 10:19:10 -07:00
David S. Miller
4cf34c0cf6 Merge branch 'ena-fixes'
Arthur Kiyanovski says:

====================
minor bug fixes for ENA Ethernet driver

Arthur Kiyanovski (4):
  net: ena: fix warning in rmmod caused by double iounmap
  net: ena: fix rare bug when failed restart/resume is followed by
    driver removal
  net: ena: fix NULL dereference due to untimely napi initialization
  net: ena: fix auto casting to boolean
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09 10:49:50 -07:00
Arthur Kiyanovski
248ab77342 net: ena: fix auto casting to boolean
Eliminate potential auto casting compilation error.

Fixes: 1738cd3ed3 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)")
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09 10:49:49 -07:00
Arthur Kiyanovski
78a55d05de net: ena: fix NULL dereference due to untimely napi initialization
napi poll functions should be initialized before running request_irq(),
to handle a rare condition where there is a pending interrupt, causing
the ISR to fire immediately while the poll function wasn't set yet,
causing a NULL dereference.

Fixes: 1738cd3ed3 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)")
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09 10:49:49 -07:00
Arthur Kiyanovski
d7703ddbd7 net: ena: fix rare bug when failed restart/resume is followed by driver removal
In a rare scenario when ena_device_restore() fails, followed by device
remove, an FLR will not be issued. In this case, the device will keep
sending asynchronous AENQ keep-alive events, even after driver removal,
leading to memory corruption.

Fixes: 8c5c7abdeb ("net: ena: add power management ops to the ENA driver")
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09 10:49:49 -07:00
Arthur Kiyanovski
d79c3888bd net: ena: fix warning in rmmod caused by double iounmap
Memory mapped with devm_ioremap is automatically freed when the driver
is disconnected from the device. Therefore there is no need to
explicitly call devm_iounmap.

Fixes: 0857d92f71 ("net: ena: add missing unmap bars on device removal")
Fixes: 411838e7b4 ("net: ena: fix rare kernel crash when bar memory remap fails")
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09 10:49:49 -07:00
David Howells
c1e15b4944 rxrpc: Fix the packet reception routine
The rxrpc_input_packet() function and its call tree was built around the
assumption that data_ready() handler called from UDP to inform a kernel
service that there is data to be had was non-reentrant.  This means that
certain locking could be dispensed with.

This, however, turns out not to be the case with a multi-queue network card
that can deliver packets to multiple cpus simultaneously.  Each of those
cpus can be in the rxrpc_input_packet() function at the same time.

Fix by adding or changing some structure members:

 (1) Add peer->rtt_input_lock to serialise access to the RTT buffer.

 (2) Make conn->service_id into a 32-bit variable so that it can be
     cmpxchg'd on all arches.

 (3) Add call->input_lock to serialise access to the Rx/Tx state.  Note
     that although the Rx and Tx states are (almost) entirely separate,
     there's no point completing the separation and having separate locks
     since it's a bi-phasal RPC protocol rather than a bi-direction
     streaming protocol.  Data transmission and data reception do not take
     place simultaneously on any particular call.

and making the following functional changes:

 (1) In rxrpc_input_data(), hold call->input_lock around the core to
     prevent simultaneous producing of packets into the Rx ring and
     updating of tracking state for a particular call.

 (2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
     check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
     The bit test and bit clear can then be combined.  No further locking
     is needed here.

 (3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
     the ACK packet.  The superseded ACK check is then done both before and
     after the lock is taken.

     The handing of ackinfo data is split, parsing before the lock is taken
     and processing with it held.  This is keyed on rxMTU being non-zero.

     Congestion management is also done within the locked section.

 (4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
     rotation.  The ACKALL packet carries no information and is only really
     useful after all packets have been transmitted since it's imprecise.

 (5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
     prevent calls being simultaneously implicitly ended on two cpus and
     also to prevent any races with incoming call setup.

 (6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
     on a connection.  It is only permitted to happen once for a
     connection.

 (7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
     rx->incoming_lock to see if someone else set up the call, connection
     or peer whilst we were getting there.  We can't trust the values from
     the earlier routing check unless we pin refs on them - which we want
     to avoid.

     Further, we need to allow for an incoming call to have its state
     changed on another CPU between us making it live and us adjusting it
     because the conn is now in the RXRPC_CONN_SERVICE state.

 (8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
     to the RTT buffer.  Don't need to lock around setting peer->rtt.

For reference, the inventory of state-accessing or state-altering functions
used by the packet input procedure is:

> rxrpc_input_packet()
  * PACKET CHECKING

  * ROUTING
    > rxrpc_post_packet_to_local()
    > rxrpc_find_connection_rcu() - uses RCU
      > rxrpc_lookup_peer_rcu() - uses RCU
      > rxrpc_find_service_conn_rcu() - uses RCU
      > idr_find() - uses RCU

  * CONNECTION-LEVEL PROCESSING
    - Service upgrade
      - Can only happen once per conn
      ! Changed to use cmpxchg
    > rxrpc_post_packet_to_conn()
    - Setting conn->hi_serial
      - Probably safe not using locks
      - Maybe use cmpxchg

  * CALL-LEVEL PROCESSING
    > Old-call checking
      > rxrpc_input_implicit_end_call()
        > rxrpc_call_completed()
	> rxrpc_queue_call()
	! Need to take rx->incoming_lock
	> __rxrpc_disconnect_call()
	> rxrpc_notify_socket()
    > rxrpc_new_incoming_call()
      - Uses rx->incoming_lock for the entire process
        - Might be able to drop this earlier in favour of the call lock
      > rxrpc_incoming_call()
      	! Conflicts with rxrpc_input_implicit_end_call()
    > rxrpc_send_ping()
      - Don't need locks to check rtt state
      > rxrpc_propose_ACK

  * PACKET DISTRIBUTION
    > rxrpc_input_call_packet()
      > rxrpc_input_data()
	* QUEUE DATA PACKET ON CALL
	> rxrpc_reduce_call_timer()
	  - Uses timer_reduce()
	! Needs call->input_lock()
	> rxrpc_receiving_reply()
	  ! Needs locking around ack state
	  > rxrpc_rotate_tx_window()
	  > rxrpc_end_tx_phase()
	> rxrpc_proto_abort()
	> rxrpc_input_dup_data()
	- Fills the Rx buffer
	- rxrpc_propose_ACK()
	- rxrpc_notify_socket()

      > rxrpc_input_ack()
	* APPLY ACK PACKET TO CALL AND DISCARD PACKET
	> rxrpc_input_ping_response()
	  - Probably doesn't need any extra locking
	  ! Need READ_ONCE() on call->ping_serial
	  > rxrpc_input_check_for_lost_ack()
	    - Takes call->lock to consult Tx buffer
	  > rxrpc_peer_add_rtt()
	    ! Needs to take a lock (peer->rtt_input_lock)
	    ! Could perhaps manage with cmpxchg() and xadd() instead
	> rxrpc_input_requested_ack
	  - Consults Tx buffer
	    ! Probably needs a lock
	  > rxrpc_peer_add_rtt()
	> rxrpc_propose_ack()
	> rxrpc_input_ackinfo()
	  - Changes call->tx_winsize
	    ! Use cmpxchg to handle change
	    ! Should perhaps track serial number
	  - Uses peer->lock to record MTU specification changes
	> rxrpc_proto_abort()
	! Need to take call->input_lock
	> rxrpc_rotate_tx_window()
	> rxrpc_end_tx_phase()
	> rxrpc_input_soft_acks()
	- Consults the Tx buffer
	> rxrpc_congestion_management()
	  - Modifies the Tx annotations
	  ! Needs call->input_lock()
	  > rxrpc_queue_call()

      > rxrpc_input_abort()
	* APPLY ABORT PACKET TO CALL AND DISCARD PACKET
	> rxrpc_set_call_completion()
	> rxrpc_notify_socket()

      > rxrpc_input_ackall()
	* APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
	! Need to take call->input_lock
	> rxrpc_rotate_tx_window()
	> rxrpc_end_tx_phase()

    > rxrpc_reject_packet()

There are some functions used by the above that queue the packet, after
which the procedure is terminated:

 - rxrpc_post_packet_to_local()
   - local->event_queue is an sk_buff_head
   - local->processor is a work_struct
 - rxrpc_post_packet_to_conn()
   - conn->rx_queue is an sk_buff_head
   - conn->processor is a work_struct
 - rxrpc_reject_packet()
   - local->reject_queue is an sk_buff_head
   - local->processor is a work_struct

And some that offload processing to process context:

 - rxrpc_notify_socket()
   - Uses RCU lock
   - Uses call->notify_lock to call call->notify_rx
   - Uses call->recvmsg_lock to queue recvmsg side
 - rxrpc_queue_call()
   - call->processor is a work_struct
 - rxrpc_propose_ACK()
   - Uses call->lock to wrap __rxrpc_propose_ACK()

And a bunch that complete a call, all of which use call->state_lock to
protect the call state:

 - rxrpc_call_completed()
 - rxrpc_set_call_completion()
 - rxrpc_abort_call()
 - rxrpc_proto_abort()
   - Also uses rxrpc_queue_call()

Fixes: 17926a7932 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-08 22:42:04 +01:00
David Howells
4e2abd3c05 rxrpc: Fix the rxrpc_tx_packet trace line
Fix the rxrpc_tx_packet trace line by storing the where parameter.

Fixes: 4764c0da69 ("rxrpc: Trace packet transmission")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-08 22:42:04 +01:00
David Howells
647530924f rxrpc: Fix connection-level abort handling
Fix connection-level abort handling to cache the abort and error codes
properly so that a new incoming call can be properly aborted if it races
with the parent connection being aborted by another CPU.

The abort_code and error parameters can then be dropped from
rxrpc_abort_calls().

Fixes: f5c17aaeb2 ("rxrpc: Calls should only have one terminal state")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-08 22:42:04 +01:00
David Howells
298bc15b20 rxrpc: Only take the rwind and mtu values from latest ACK
Move the out-of-order and duplicate ACK packet check to before the call to
rxrpc_input_ackinfo() so that the receive window size and MTU size are only
checked in the latest ACK packet and don't regress.

Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-08 22:42:04 +01:00
David Howells
dfe9952246 rxrpc: Carry call state out of locked section in rxrpc_rotate_tx_window()
Carry the call state out of the locked section in rxrpc_rotate_tx_window()
rather than sampling it afterwards.  This is only used to select tracepoint
data, but could have changed by the time we do the tracepoint.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-08 15:57:36 +01:00
David Howells
c479d5f2c2 rxrpc: Don't check RXRPC_CALL_TX_LAST after calling rxrpc_rotate_tx_window()
We should only call the function to end a call's Tx phase if we rotated the
marked-last packet out of the transmission buffer.

Make rxrpc_rotate_tx_window() return an indication of whether it just
rotated the packet marked as the last out of the transmit buffer, carrying
the information out of the locked section in that function.

We can then check the return value instead of examining RXRPC_CALL_TX_LAST.

Fixes: 70790dbe3f ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-08 15:55:20 +01:00
David Howells
bfd2821117 rxrpc: Don't need to take the RCU read lock in the packet receiver
We don't need to take the RCU read lock in the rxrpc packet receive
function because it's held further up the stack in the IP input routine
around the UDP receive routines.

Fix this by dropping the RCU read lock calls from rxrpc_input_packet().
This simplifies the code.

Fixes: 70790dbe3f ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-08 15:45:56 +01:00
David Howells
5271953cad rxrpc: Use the UDP encap_rcv hook
Use the UDP encap_rcv hook to cut the bit out of the rxrpc packet reception
in which a packet is placed onto the UDP receive queue and then immediately
removed again by rxrpc.  Going via the queue in this manner seems like it
should be unnecessary.

This does, however, require the invention of a value to place in encap_type
as that's one of the conditions to switch packets out to the encap_rcv
hook.  Possibly the value doesn't actually matter for anything other than
sockopts on the UDP socket, which aren't accessible outside of rxrpc
anyway.

This seems to cut a bit of time out of the time elapsed between each
sk_buff being timestamped and turning up in rxrpc (the final number in the
following trace excerpts).  I measured this by making the rxrpc_rx_packet
trace point print the time elapsed between the skb being timestamped and
the current time (in ns), e.g.:

	... 424.278721: rxrpc_rx_packet: ...  ACK 25026

So doing a 512MiB DIO read from my test server, with an unmodified kernel:

	N       min     max     sum		mean    stddev
	27605   2626    7581    7.83992e+07     2840.04 181.029

and with the patch applied:

	N       min     max     sum		mean    stddev
	27547   1895    12165   6.77461e+07     2459.29 255.02

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-08 15:45:18 +01:00
David S. Miller
e2a322a0c8 Merge branch 'net-smc-userspace-breakage-fixes'
Eugene Syromiatnikov says:

====================
net/smc: userspace breakage fixes

These two patches correct some userspace-affecting issues introduced
during 4.19 development cycle, specifically:
 * New structure "struct smcd_diag_dmbinfo" has been defined in a way
   that would lead to different layout of the structure on most 32-bit
   ABIs in comparison with layout on 64-bit ABIs;
 * One of the commits renamed an UAPI-exposed field name.

Changes since v1:
 * Managed not to forget to add --cover-letter.
 * Commit ID format in commit message has been changed in accordance
   with Sergei Shtylyov's recommendations.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-07 21:06:28 -07:00
Eugene Syromiatnikov
d4f0006a08 net/smc: retain old name for diag_mode field
Commit c601171d7a ("net/smc: provide smc mode in smc_diag.c") changed
the name of diag_fallback field of struct smc_diag_msg structure
to diag_mode.  However, this structure is a part of UAPI, and this change
breaks user space applications that use it ([1], for example).  Since
the new name is more suitable, convert the field to a union that provides
access to the data via both the new and the old name.

[1] https://gitlab.com/strace/strace/blob/v4.24/netlink_smc_diag.c#L165

Fixes: c601171d7a ("net/smc: provide smc mode in smc_diag.c")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-07 21:06:28 -07:00
Eugene Syromiatnikov
a21048c8ec net/smc: use __aligned_u64 for 64-bit smc_diag fields
Commit 4b1b7d3b30 ("net/smc: add SMC-D diag support") introduced
new UAPI-exposed structure, struct smcd_diag_dmbinfo.  However,
it's not usable by compat binaries, as it has different layout there.
Probably, the most straightforward fix that will avoid similar issues
in the future is to use __aligned_u64 for 64-bit fields.

Fixes: 4b1b7d3b30 ("net/smc: add SMC-D diag support")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-07 21:06:28 -07:00
Al Viro
6d4c407744 net: sched: cls_u32: fix hnode refcounting
cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references
via ->hlist and via ->tp_root together.  u32_destroy() drops the former
and, in case when there had been links, leaves the sucker on the list.
As the result, there's nothing to protect it from getting freed once links
are dropped.
That also makes the "is it busy" check incapable of catching the root
hnode - it *is* busy (there's a reference from tp), but we don't see it as
something separate.  "Is it our root?" check partially covers that, but
the problem exists for others' roots as well.

AFAICS, the minimal fix preserving the existing behaviour (where it doesn't
include oopsen, that is) would be this:
        * count tp->root and tp_c->hlist as separate references.  I.e.
have u32_init() set refcount to 2, not 1.
	* in u32_destroy() we always drop the former;
in u32_destroy_hnode() - the latter.

	That way we have *all* references contributing to refcount.  List
removal happens in u32_destroy_hnode() (called only when ->refcnt is 1)
an in u32_destroy() in case of tc_u_common going away, along with
everything reachable from it.  IOW, that way we know that
u32_destroy_key() won't free something still on the list (or pointed to by
someone's ->root).

Reproducer:

tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: \
u32 divisor 1
tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: \
u32 divisor 1
tc filter add dev eth0 parent ffff: protocol ip prio 100 \
handle 1:0:11 u32 ht 1: link 801: offset at 0 mask 0f00 shift 6 \
plus 0 eat match ip protocol 6 ff
tc filter delete dev eth0 parent ffff: protocol ip prio 200
tc filter change dev eth0 parent ffff: protocol ip prio 100 \
handle 1:0:11 u32 ht 1: link 0: offset at 0 mask 0f00 shift 6 plus 0 \
eat match ip protocol 6 ff
tc filter delete dev eth0 parent ffff: protocol ip prio 100

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-07 21:02:37 -07:00
Jiri Kosina
7e823644b6 udp: Unbreak modules that rely on external __skb_recv_udp() availability
Commit 2276f58ac5 ("udp: use a separate rx queue for packet reception")
turned static inline __skb_recv_udp() from being a trivial helper around
__skb_recv_datagram() into a UDP specific implementaion, making it
EXPORT_SYMBOL_GPL() at the same time.

There are external modules that got broken by __skb_recv_udp() not being
visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().

Rationale (one of those) why this is actually "technically correct" thing
to do: __skb_recv_udp() used to be an inline wrapper around
__skb_recv_datagram(), which itself (still, and correctly so, I believe)
is EXPORT_SYMBOL().

Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Fixes: 2276f58ac5 ("udp: use a separate rx queue for packet reception")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-07 20:33:10 -07:00
Greg Kroah-Hartman
c1d84a1b42 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Dave writes:
  "Networking fixes:

  1) Fix truncation of 32-bit right shift in bpf, from Jann Horn.

  2) Fix memory leak in wireless wext compat, from Stefan Seyfried.

  3) Use after free in cfg80211's reg_process_hint(), from Yu Zhao.

  4) Need to cancel pending work when unbinding in smsc75xx otherwise
     we oops, also from Yu Zhao.

  5) Don't allow enslaving a team device to itself, from Ido Schimmel.

  6) Fix backwards compat with older userspace for rtnetlink FDB dumps.
     From Mauricio Faria.

  7) Add validation of tc policy netlink attributes, from David Ahern.

  8) Fix RCU locking in rawv6_send_hdrinc(), from Wei Wang."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  net: mvpp2: Extract the correct ethtype from the skb for tx csum offload
  ipv6: take rcu lock in rawv6_send_hdrinc()
  net: sched: Add policy validation for tc attributes
  rtnetlink: fix rtnl_fdb_dump() for ndmsg header
  yam: fix a missing-check bug
  net: bpfilter: Fix type cast and pointer warnings
  net: cxgb3_main: fix a missing-check bug
  bpf: 32-bit RSH verification must truncate input before the ALU op
  net: phy: phylink: fix SFP interface autodetection
  be2net: don't flip hw_features when VXLANs are added/deleted
  net/packet: fix packet drop as of virtio gso
  net: dsa: b53: Keep CPU port as tagged in all VLANs
  openvswitch: load NAT helper
  bnxt_en: get the reduced max_irqs by the ones used by RDMA
  bnxt_en: free hwrm resources, if driver probe fails.
  bnxt_en: Fix enables field in HWRM_QUEUE_COS2BW_CFG request
  bnxt_en: Fix VNIC reservations on the PF.
  team: Forbid enslaving team device to itself
  net/usb: cancel pending work when unbinding smsc75xx
  mlxsw: spectrum: Delete RIF when VLAN device is removed
  ...
2018-10-06 02:11:30 -07:00
Greg Kroah-Hartman
091a1eaa0e Merge branch 'akpm'
* akpm:
  mm: madvise(MADV_DODUMP): allow hugetlbfs pages
  ocfs2: fix locking for res->tracking and dlm->tracking_list
  mm/vmscan.c: fix int overflow in callers of do_shrink_slab()
  mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly
  mm/vmstat.c: fix outdated vmstat_text
  proc: restrict kernel stack dumps to root
  mm/hugetlb: add mmap() encodings for 32MB and 512MB page sizes
  mm/migrate.c: split only transparent huge pages when allocation fails
  ipc/shm.c: use ERR_CAST() for shm_lock() error return
  mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl
  mm, thp: fix mlocking THP page with migration enabled
  ocfs2: fix crash in ocfs2_duplicate_clusters_by_page()
  hugetlb: take PMD sharing into account when flushing tlb/caches
  mm: migration: fix migration of huge PMD shared pages
2018-10-05 16:33:03 -07:00
Daniel Black
d41aa52523 mm: madvise(MADV_DODUMP): allow hugetlbfs pages
Reproducer, assuming 2M of hugetlbfs available:

Hugetlbfs mounted, size=2M and option user=testuser

  # mount | grep ^hugetlbfs
  hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan)
  # sysctl vm.nr_hugepages=1
  vm.nr_hugepages = 1
  # grep Huge /proc/meminfo
  AnonHugePages:         0 kB
  ShmemHugePages:        0 kB
  HugePages_Total:       1
  HugePages_Free:        1
  HugePages_Rsvd:        0
  HugePages_Surp:        0
  Hugepagesize:       2048 kB
  Hugetlb:            2048 kB

Code:

  #include <sys/mman.h>
  #include <stddef.h>
  #define SIZE 2*1024*1024
  int main()
  {
    void *ptr;
    ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0);
    madvise(ptr, SIZE, MADV_DONTDUMP);
    madvise(ptr, SIZE, MADV_DODUMP);
  }

Compile and strace:

  mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000
  madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0
  madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument)

hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on
author testing with analysis from Florian Weimer[1].

The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a
consequence of the large useage of VM_DONTEXPAND in device drivers.

A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be
marked DODUMP.

A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs
memory for a while and later request that madvise(MADV_DODUMP) on the same
memory.  We correct this omission by allowing madvice(MADV_DODUMP) on
hugetlbfs pages.

[1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit
[2] commit 0103bd16fb ("mm: prepare VM_DONTDUMP for using in drivers")

Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com
Link: https://lists.launchpad.net/maria-discuss/msg05245.html
Fixes: 0103bd16fb ("mm: prepare VM_DONTDUMP for using in drivers")
Reported-by: Kenneth Penza <kpenza@gmail.com>
Signed-off-by: Daniel Black <daniel@linux.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Ashish Samant
cbe355f57c ocfs2: fix locking for res->tracking and dlm->tracking_list
In dlm_init_lockres() we access and modify res->tracking and
dlm->tracking_list without holding dlm->track_lock.  This can cause list
corruptions and can end up in kernel panic.

Fix this by locking res->tracking and dlm->tracking_list with
dlm->track_lock instead of dlm->spinlock.

Link: http://lkml.kernel.org/r/1529951192-4686-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Kirill Tkhai
b8e57efa2c mm/vmscan.c: fix int overflow in callers of do_shrink_slab()
do_shrink_slab() returns unsigned long value, and the placing into int
variable cuts high bytes off.  Then we compare ret and 0xfffffffe (since
SHRINK_EMPTY is converted to ret type).

Thus a large number of objects returned by do_shrink_slab() may be
interpreted as SHRINK_EMPTY, if low bytes of their value are equal to
0xfffffffe.  Fix that by declaration ret as unsigned long in these
functions.

Link: http://lkml.kernel.org/r/153813407177.17544.14888305435570723973.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Jann Horn
58bc4c34d2 mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly
5dd0b16cda ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
the kernel unconditional to reduce #ifdef soup, but (either to avoid
showing dummy zero counters to userspace, or because that code was missed)
didn't update the vmstat_array, meaning that all following counters would
be shown with incorrect values.

This only affects kernel builds with
CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.

Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
Fixes: 5dd0b16cda ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Jann Horn
28e2c4bb99 mm/vmstat.c: fix outdated vmstat_text
7a9cdebdcc ("mm: get rid of vmacache_flush_all() entirely") removed the
VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
entry in vmstat_text.  This causes an out-of-bounds access in
vmstat_show().

Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
is probably very rare.

Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
Fixes: 7a9cdebdcc ("mm: get rid of vmacache_flush_all() entirely")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Jann Horn
f8a00cef17 proc: restrict kernel stack dumps to root
Currently, you can use /proc/self/task/*/stack to cause a stack walk on
a task you control while it is running on another CPU.  That means that
the stack can change under the stack walker.  The stack walker does
have guards against going completely off the rails and into random
kernel memory, but it can interpret random data from your kernel stack
as instruction pointers and stack pointers.  This can cause exposure of
kernel stack contents to userspace.

Restrict the ability to inspect kernel stacks of arbitrary tasks to root
in order to prevent a local attacker from exploiting racy stack unwinding
to leak kernel task stack contents.  See the added comment for a longer
rationale.

There don't seem to be any users of this userspace API that can't
gracefully bail out if reading from the file fails.  Therefore, I believe
that this change is unlikely to break things.  In the case that this patch
does end up needing a revert, the next-best solution might be to fake a
single-entry stack based on wchan.

Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
Fixes: 2ec220e27f ("proc: add /proc/*/stack")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Anshuman Khandual
20916d4636 mm/hugetlb: add mmap() encodings for 32MB and 512MB page sizes
ARM64 architecture also supports 32MB and 512MB HugeTLB page sizes.  This
just adds mmap() system call argument encoding for them.

Link: http://lkml.kernel.org/r/1537841300-6979-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Anshuman Khandual
e6112fc300 mm/migrate.c: split only transparent huge pages when allocation fails
split_huge_page_to_list() fails on HugeTLB pages.  I was experimenting
with moving 32MB contig HugeTLB pages on arm64 (with a debug patch
applied) and hit the following stack trace when the kernel crashed.

[ 3732.462797] Call trace:
[ 3732.462835]  split_huge_page_to_list+0x3b0/0x858
[ 3732.462913]  migrate_pages+0x728/0xc20
[ 3732.462999]  soft_offline_page+0x448/0x8b0
[ 3732.463097]  __arm64_sys_madvise+0x724/0x850
[ 3732.463197]  el0_svc_handler+0x74/0x110
[ 3732.463297]  el0_svc+0x8/0xc
[ 3732.463347] Code: d1000400 f90b0e60 f2fbd5a2 a94982a1 (f9000420)

When unmap_and_move[_huge_page]() fails due to lack of memory, the
splitting should happen only for transparent huge pages not for HugeTLB
pages.  PageTransHuge() returns true for both THP and HugeTLB pages.
Hence the conditonal check should test PagesHuge() flag to make sure that
given pages is not a HugeTLB one.

Link: http://lkml.kernel.org/r/1537798495-4996-1-git-send-email-anshuman.khandual@arm.com
Fixes: 94723aafb9 ("mm: unclutter THP migration")
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Kees Cook
59cf0a9339 ipc/shm.c: use ERR_CAST() for shm_lock() error return
This uses ERR_CAST() instead of an open-coded cast, as it is casting
across structure pointers, which upsets __randomize_layout:

ipc/shm.c: In function `shm_lock':
ipc/shm.c:209:9: note: randstruct: casting between randomized structure pointer types (ssa): `struct shmid_kernel' and `struct kern_ipc_perm'

  return (void *)ipcp;
         ^~~~~~~~~~~~

Link: http://lkml.kernel.org/r/20180919180722.GA15073@beast
Fixes: 82061c57ce ("ipc: drop ipc_lock()")
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
YueHaibing
5189686457 mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl
get_user_pages_fast() will return negative value if no pages were pinned,
then be converted to a unsigned, which is compared to zero, giving the
wrong result.

Link: http://lkml.kernel.org/r/20180921095015.26088-1-yuehaibing@huawei.com
Fixes: 09e35a4a1c ("mm/gup_benchmark: handle gup failures")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Kirill A. Shutemov
e125fe405a mm, thp: fix mlocking THP page with migration enabled
A transparent huge page is represented by a single entry on an LRU list.
Therefore, we can only make unevictable an entire compound page, not
individual subpages.

If a user tries to mlock() part of a huge page, we want the rest of the
page to be reclaimable.

We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
PMD on border of VM_LOCKED VMA will be split into PTE table.

Introduction of THP migration breaks[1] the rules around mlocking THP
pages.  If we had a single PMD mapping of the page in mlocked VMA, the
page will get mlocked, regardless of PTE mappings of the page.

For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
remove_migration_pmd().

Anon THP pages can only be shared between processes via fork().  Mlocked
page can only be shared if parent mlocked it before forking, otherwise CoW
will be triggered on mlock().

For Anon-THP, we can fix the issue by munlocking the page on removing PTE
migration entry for the page.  PTEs for the page will always come after
mlocked PMD: rmap walks VMAs from oldest to newest.

Test-case:

	#include <unistd.h>
	#include <sys/mman.h>
	#include <sys/wait.h>
	#include <linux/mempolicy.h>
	#include <numaif.h>

	int main(void)
	{
	        unsigned long nodemask = 4;
	        void *addr;

		addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);

	        if (fork()) {
			wait(NULL);
			return 0;
	        }

	        mlock(addr, 4UL << 10);
	        mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
	                &nodemask, 4, MPOL_MF_MOVE);

	        return 0;
	}

[1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com

Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com
Fixes: 616b837153 ("mm: thp: enable thp migration in generic path")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Larry Chen
69eb7765b9 ocfs2: fix crash in ocfs2_duplicate_clusters_by_page()
ocfs2_duplicate_clusters_by_page() may crash if one of the extent's pages
is dirty.  When a page has not been written back, it is still in dirty
state.  If ocfs2_duplicate_clusters_by_page() is called against the dirty
page, the crash happens.

To fix this bug, we can just unlock the page and wait until the page until
its not dirty.

The following is the backtrace:

kernel BUG at /root/code/ocfs2/refcounttree.c:2961!
[exception RIP: ocfs2_duplicate_clusters_by_page+822]
__ocfs2_move_extent+0x80/0x450 [ocfs2]
? __ocfs2_claim_clusters+0x130/0x250 [ocfs2]
ocfs2_defrag_extent+0x5b8/0x5e0 [ocfs2]
__ocfs2_move_extents_range+0x2a4/0x470 [ocfs2]
ocfs2_move_extents+0x180/0x3b0 [ocfs2]
? ocfs2_wait_for_recovery+0x13/0x70 [ocfs2]
ocfs2_ioctl_move_extents+0x133/0x2d0 [ocfs2]
ocfs2_ioctl+0x253/0x640 [ocfs2]
do_vfs_ioctl+0x90/0x5f0
SyS_ioctl+0x74/0x80
do_syscall_64+0x74/0x140
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Once we find the page is dirty, we do not wait until it's clean, rather we
use write_one_page() to write it back

Link: http://lkml.kernel.org/r/20180829074740.9438-1-lchen@suse.com
[lchen@suse.com: update comments]
  Link: http://lkml.kernel.org/r/20180830075041.14879-1-lchen@suse.com
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Larry Chen <lchen@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Mike Kravetz
dff11abe28 hugetlb: take PMD sharing into account when flushing tlb/caches
When fixing an issue with PMD sharing and migration, it was discovered via
code inspection that other callers of huge_pmd_unshare potentially have an
issue with cache and tlb flushing.

Use the routine adjust_range_if_pmd_sharing_possible() to calculate worst
case ranges for mmu notifiers.  Ensure that this range is flushed if
huge_pmd_unshare succeeds and unmaps a PUD_SUZE area.

Link: http://lkml.kernel.org/r/20180823205917.16297-3-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Mike Kravetz
017b1660df mm: migration: fix migration of huge PMD shared pages
The page migration code employs try_to_unmap() to try and unmap the source
page.  This is accomplished by using rmap_walk to find all vmas where the
page is mapped.  This search stops when page mapcount is zero.  For shared
PMD huge pages, the page map count is always 1 no matter the number of
mappings.  Shared mappings are tracked via the reference count of the PMD
page.  Therefore, try_to_unmap stops prematurely and does not completely
unmap all mappings of the source page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the target
page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global areas
after a huge page was soft offlined due to ECC memory errors.  DB
developers noticed they could reproduce the issue by (hotplug) offlining
memory used to back huge pages.  A simple testcase can reproduce the
problem by creating a shared PMD mapping (note that this must be at least
PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
migrate_pages() to migrate process pages between nodes while continually
writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing by
calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
mapping it will be 'unshared' which removes the page table entry and drops
the reference on the PMD page.  After this, flush caches and TLB.

mmu notifiers are called before locking page tables, but we can not be
sure of PMD sharing until page tables are locked.  Therefore, check for
the possibility of PMD sharing before locking so that notifiers can
prepare for the worst possible case.

Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
[mike.kravetz@oracle.com: make _range_in_vma() a static inline]
  Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
Fixes: 39dde65c99 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Greg Kroah-Hartman
5943a9bbbb pci-v4.19-fixes-3
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlu3zDkUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vxnHA/7BrOMAbvRkC9p4dDVM9y+kx9cyBB3
 B6Pq2o7G/2740frHrCx5wJkgh4VSXJF8PI8N6eBmpvUylcm6sNLHrEYq5Ui/QSdo
 /Q2CP+4bWZHTjRGLdOTJ4NYzCPA2NGvFsY78B8W/qaPSYoik/7FJ1Pg8diZA1TG6
 +HA4IIv/QeBPgGTYVtSBKirSdbapDCaa1g/XYPTKp8kWdB/KWmRZ1tkjk1Lds7Fd
 alYa/HIHowAsFwJg+yNh8PgVLz5IAuygeP098qTKxYvGgai5zKF1YorU/7bewJWx
 S/IyIa+/uPHGDgHrU+6foYrivehtqeqdZ4SENAE4zUX5J5xkdH1woYXvOBAGt6/N
 sv2KClNX1cCSBRfyyV1QO/ti0RfGUOl709NQyHDWVe23cqb1Qyi9nVT9DIgIwrbe
 v6+i10AfBMKMFXm7uit5y7sIykd8cYbK6EokDpHd87V72qYUU8VNYcmcOIMDj4LY
 fReC4COWCLiNf2pkYawfex1u+3mCOBr52ShlxzVVZ4+7/ft6kcKsPJnA34Ew/sue
 xdb35nyRdgh0P9M+jtExVyMsesauPAi0a9p7AI2DElOr6qo2vkxniiLXEalD5UQ6
 MTYUqm4LnV2ECnidabtPsyqR+77KRlQIB4lhYkqtTWLgIfj48Oaii50KP7K48ZxR
 9kc70QSxH5FtR3g=
 =0P87
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.19-fixes-3' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Bjorn writes:
  "PCI fixes for v4.19:

   - Reprogram bridge prefetch registers to fix NVIDIA and Radeon issues
     after suspend/resume (Daniel Drake)

   - Fix mvebu I/O mapping creation sequence (Thomas Petazzoni)

   - Fix minor MAINTAINERS file match issue (Bjorn Helgaas)"

* tag 'pci-v4.19-fixes-3' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  PCI: mvebu: Fix PCI I/O mapping creation sequence
  MAINTAINERS: Remove obsolete drivers/pci pattern from ACPI section
  PCI: Reprogram bridge prefetch registers on resume
2018-10-05 16:11:16 -07:00
Greg Kroah-Hartman
b98d6cb80b - Fix a DM thinp __udivdi3 undefined on 32-bit bug introduced during
4.19 merge window.
 
 - Fix leak and dangling pointer in DM multipath's scsi_dh related code.
 
 - A couple stable@ fixes for DM cache's resize support.
 
 - A DM raid fix to remove "const" from decipher_sync_action()'s return
   type.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJbt7RQAAoJEMUj8QotnQNac2MIAK0p2Ndy7ErItA+qIlmBodsO
 juWGdnhBXTHQ9qoeIuamqA66g7LZ8oxnaiIQX8tqzt+GyzxgcjH+9C+Nn7qf8VLF
 4SkbI8/8CxBKmsKv5G5OI2tCfdfTXop9rOTg+3tF6BcTxXzPGeY+5mImsMN4KI/S
 1+hnq2xMtScdTDLyN0qKxs0+e8YdP7IJn2QAscLVNik0wdWn+7Hfp7Hg7tXcupyF
 WQ1dkY8kaGCUid167jAkCmY4etXHIfAb81EWxFtGb/GNGlFmc57t7HGa5mVaO3qv
 QXgp9u27q7v/wqzDHO68+mzAfiBHMFmJHa4uRg7gR1LrE9/0y3CDWC3zMYDL3Es=
 =Av3x
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Mike writes:
  "device mapper fixes

   - Fix a DM thinp __udivdi3 undefined on 32-bit bug introduced during
     4.19 merge window.

   - Fix leak and dangling pointer in DM multipath's scsi_dh related code.

   - A couple stable@ fixes for DM cache's resize support.

   - A DM raid fix to remove "const" from decipher_sync_action()'s return
     type."

* tag 'for-4.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache: fix resize crash if user doesn't reload cache table
  dm cache metadata: ignore hints array being too small during resize
  dm raid: remove bogus const from decipher_sync_action() return type
  dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer
  dm thin metadata: fix __udivdi3 undefined on 32-bit
2018-10-05 16:09:56 -07:00
Greg Kroah-Hartman
3830711f3e A single GPIO fix:
Free the last used descriptor, an off by one error.
 This is tagged for stable as well.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbt0SyAAoJEEEQszewGV1zv/4P/RMh6vgvl9Cl3YSuYCIxv0V8
 MzSeSdCcbo3WPV2YOtzfwuTfsa00TmHsUaAJ3tBidNxv/UElb041MJYhCHu2BKzl
 UXs3M1iK7CSt5hoeXSPTRjdoR/W0pBRI3iFTKkrkqCG+0aFxoE3HCmy6QVmLKfgB
 kajgmLrKCKq0CfT9xgbrw7Jm/qdauCceuH7rdOGplA8zc8Ply66jwagTPmD3lrGd
 RFCU7NPWJK+3SDj8f62/sr4Qvi1PghugDOOlzwbDp3ZSpzsLWxAqFaK2N1ae1UT5
 0RPbLEsVTcBsqYpVpVEwlwASkRu+vIUqRFkcUobOCcIU998R54H3PLI1G9XDPpkM
 i1LJPJY/Nm6I5jqU6Nguw4siimoMPFdklzleFrEPXV5GJibTGWgGwFItz4FL44SJ
 4jGop6Tl7E+H+lQ9mAbp5/oNm3JpgzU5wTTjRN1IHlkprLe0umcFyEO0PioKGGe2
 Rd4OeCpSd3ELTXCS9X9EGbKqamdybocjdJX9AA9qp0s+zshk9tIO4nzahRSzEtJi
 t2ClCfN6ApTNgpUiavfPrGGYub7XOf4IJlBk/7hQ238wn0uwuDO5FzTUQs/UUfXf
 ++RtTMQkv0MVH5smY4g6zcDY2xkZHOk+tc2G5V9uipGOdc2s0exk7tBWeZgTXt33
 0hc9Hv42j5hf/b7aApBm
 =PKfb
 -----END PGP SIGNATURE-----

Merge tag 'gpio-v4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio

Linus writes:
  "A single GPIO fix:
   Free the last used descriptor, an off by one error.
   This is tagged for stable as well."

* tag 'gpio-v4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpiolib: Free the last requested descriptor
2018-10-05 16:09:11 -07:00
Greg Kroah-Hartman
5aebc7d278 Power management fix for 4.19-rc7
Fix a bug that may cause runtime PM to misbehave for some devices
 after a failing or aborted system suspend which is nasty enough for
 an -rc7 time frame fix.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJbtzD9AAoJEILEb/54YlRxRCEP/jBooTFH6SKEY0EIvzpda6yV
 Pi3q8sMkcGJMm8/HRqRqpg6N03eUK/2sdoqQf2fRDoJfWa/NcZ3Vcq9gElN8JSgp
 vTYUYVANsD+je8NbWyV6WrwGRedlebw4nkeS+/uAbpEPwKNJv94H0YZ71JVZa/lE
 Mx6C+gClepKz3So4JKcfn0NNyLlAl6iCJOB/8WRHZIj17WJBsMSV+I4mfjexlbQD
 AloPeVYtsGSSsi/Y0aKbjcpE1i49zlFoS6Vqj4rfkXqbwGEaFEoQutIviO+hYsH6
 3DDUtoCF3w0haJJNaeAbB2XB+uVeIzxp4N5eq9mEWwZhKwkm37NsDWmV6D1oCQXl
 2nhDmj2gCypMRQr3/QvAareW3FxHrJqyEEcgnSdM7EX0Sc3EmXZF5T8teGUtZP2s
 w8byBKnTbvER9X6THZajlYym0owR5LybjXAtvb2Y9s5KCauw9Dw6n14dlPnZAHQo
 O5eqcDnQaQhfznU+ngYFa8RBdKKJPbORrma0yKRhs5S8CvdQci2nx4cwkaHw/pPR
 GSSV6yrL5ogEPLISXvITrg0iglHizxPonbzcEcfKljEd1+2Ai/oJZpM90BveaOvV
 t1IBY3eHFefW4tZnToS4fRy/WtamPIqOXFuIrnbU4KKeQWv7Yreb2NbPmM/okzr6
 SxKdPMUUmeyI2jxA03r2
 =wko2
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Rafael writes:
  "Power management fix for 4.19-rc7

   Fix a bug that may cause runtime PM to misbehave for some devices
   after a failing or aborted system suspend which is nasty enough for
   an -rc7 time frame fix."

* tag 'pm-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / core: Clear the direct_complete flag on errors
2018-10-05 16:08:12 -07:00
Greg Kroah-Hartman
31d099085d Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Ingo writes:
  "perf fixes:
    - fix a CPU#0 hot unplug bug and a PCI enumeration bug in the x86 Intel uncore PMU driver
    - fix a CPU event enumeration bug in the x86 AMD PMU driver
    - fix a perf ring-buffer corruption bug when using tracepoints
    - fix a PMU unregister locking bug"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events
  perf/x86/intel/uncore: Fix PCI BDF address of M3UPI on SKX
  perf/ring_buffer: Prevent concurent ring buffer access
  perf/x86/intel/uncore: Use boot_cpu_data.phys_proc_id instead of hardcorded physical package ID 0
  perf/core: Fix perf_pmu_unregister() locking
2018-10-05 16:07:13 -07:00
Greg Kroah-Hartman
247373b5dd Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Ingo writes:
  "x86 fixes:

   Misc fixes:

    - fix various vDSO bugs: asm constraints and retpolines
    - add vDSO test units to make sure they never re-appear
    - fix UV platform TSC initialization bug
    - fix build warning on Clang"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Fix vDSO syscall fallback asm constraint regression
  x86/cpu/amd: Remove unnecessary parentheses
  x86/vdso: Only enable vDSO retpolines when enabled and supported
  x86/tsc: Fix UV TSC initialization
  x86/platform/uv: Provide is_early_uv_system()
  selftests/x86: Add clock_gettime() tests to test_vdso
  x86/vdso: Fix asm constraints on vDSO syscall fallbacks
2018-10-05 15:40:57 -07:00
Greg Kroah-Hartman
8be673735e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Ingo writes:
  "scheduler fixes:

   These fixes address a rather involved performance regression between
   v4.17->v4.19 in the sched/numa auto-balancing code. Since distros
   really need this fix we accelerated it to sched/urgent for a faster
   upstream merge.

   NUMA scheduling and balancing performance is now largely back to
   v4.17 levels, without reintroducing the NUMA placement bugs that
   v4.18 and v4.19 fixed.

   Many thanks to Srikar Dronamraju, Mel Gorman and Jirka Hladky, for
   reporting, testing, re-testing and solving this rather complex set of
   bugs."

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task
  mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration
  sched/numa: Avoid task migration for small NUMA improvement
  mm/migrate: Use spin_trylock() while resetting rate limit
  sched/numa: Limit the conditions where scan period is reset
  sched/numa: Reset scan rate whenever task moves across nodes
  sched/numa: Pass destination CPU as a parameter to migrate_task_rq
  sched/numa: Stop multiple tasks from moving to the CPU at the same time
2018-10-05 15:39:38 -07:00