Commit Graph

65187 Commits

Author SHA1 Message Date
Aurelien Aptel
2f58967979 cifs: multichannel: try to rebind when reconnecting a channel
first steps in trying to make channels properly reconnect.

* add cifs_ses_find_chan() function to find the enclosing cifs_chan
  struct it belongs to
* while we have the session lock and are redoing negprot and
  sess.setup in smb2_reconnect() redo the binding of channels.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-04 13:50:55 -05:00
Aurelien Aptel
8eec79540d cifs: multichannel: use pointer for binding channel
Add a cifs_chan pointer in struct cifs_ses that points to the channel
currently being bound if ses->binding is true.

Previously it was always the channel past the established count.

This will make reconnecting (and rebinding) a channel easier later on.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-04 13:50:55 -05:00
Steve French
edb1613536 smb3: remove static checker warning
Remove static checker warning pointed out by Dan Carpenter:

The patch feeaec621c09: "cifs: multichannel: move channel selection
above transport layer" from Apr 24, 2020, leads to the following
static checker warning:

        fs/cifs/smb2pdu.c:149 smb2_hdr_assemble()
        error: we previously assumed 'tcon->ses' could be null (see line 133)

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
CC: Aurelien Aptel <aptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-04 13:50:55 -05:00
Aurelien Aptel
352d96f3ac cifs: multichannel: move channel selection above transport layer
Move the channel (TCP_Server_Info*) selection from the tranport
layer to higher in the call stack so that:

- credit handling is done with the server that will actually be used
  to send.
  * ->wait_mtu_credit
  * ->set_credits / set_credits
  * ->add_credits / add_credits
  * add_credits_and_wake_if

- potential reconnection (smb2_reconnect) done when initializing a
  request is checked and done with the server that will actually be
  used to send.

To do this:

- remove the cifs_pick_channel() call out of compound_send_recv()

- select channel and pass it down by adding a cifs_pick_channel(ses)
  call in:
  - smb311_posix_mkdir
  - SMB2_open
  - SMB2_ioctl
  - __SMB2_close
  - query_info
  - SMB2_change_notify
  - SMB2_flush
  - smb2_async_readv  (if none provided in context param)
  - SMB2_read         (if none provided in context param)
  - smb2_async_writev (if none provided in context param)
  - SMB2_write        (if none provided in context param)
  - SMB2_query_directory
  - send_set_info
  - SMB2_oplock_break
  - SMB311_posix_qfs_info
  - SMB2_QFS_info
  - SMB2_QFS_attr
  - smb2_lockv
  - SMB2_lease_break
    - smb2_compound_op
  - smb2_set_ea
  - smb2_ioctl_query_info
  - smb2_query_dir_first
  - smb2_query_info_comound
  - smb2_query_symlink
  - cifs_writepages
  - cifs_write_from_iter
  - cifs_send_async_read
  - cifs_read
  - cifs_readpages

- add TCP_Server_Info *server param argument to:
  - cifs_send_recv
  - compound_send_recv
  - SMB2_open_init
  - SMB2_query_info_init
  - SMB2_set_info_init
  - SMB2_close_init
  - SMB2_ioctl_init
  - smb2_iotcl_req_init
  - SMB2_query_directory_init
  - SMB2_notify_init
  - SMB2_flush_init
  - build_qfs_info_req
  - smb2_hdr_assemble
  - smb2_reconnect
  - fill_small_buf
  - smb2_plain_req_init
  - __smb2_plain_req_init

The read/write codepath is different than the rest as it is using
pages, io iterators and async calls. To deal with those we add a
server pointer in the cifs_writedata/cifs_readdata/cifs_io_parms
context struct and set it in:

- cifs_writepages      (wdata)
- cifs_write_from_iter (wdata)
- cifs_readpages       (rdata)
- cifs_send_async_read (rdata)

The [rw]data->server pointer is eventually copied to
cifs_io_parms->server to pass it down to SMB2_read/SMB2_write.
If SMB2_read/SMB2_write is called from a different place that doesn't
set the server field it will pick a channel.

Some places do not pick a channel and just use ses->server or
cifs_ses_server(ses). All cifs_ses_server(ses) calls are in codepaths
involving negprot/sess.setup.

- SMB2_negotiate         (binding channel)
- SMB2_sess_alloc_buffer (binding channel)
- SMB2_echo              (uses provided one)
- SMB2_logoff            (uses master)
- SMB2_tdis              (uses master)

(list not exhaustive)

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-04 13:50:55 -05:00
Aurelien Aptel
7c06514afd cifs: multichannel: always zero struct cifs_io_parms
SMB2_read/SMB2_write check and use cifs_io_parms->server, which might
be uninitialized memory.

This change makes all callers zero-initialize the struct.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-04 13:50:55 -05:00
Kenneth D'souza
8e84a61a9c cifs: dump Security Type info in DebugData
Currently the end user is unaware with what sec type the
cifs share is mounted if no sec=<type> option is parsed.
With this patch one can easily check from DebugData.

Example:
1) Name: x.x.x.x Uses: 1 Capability: 0x8001f3fc	Session Status: 1 Security type: RawNTLMSSP

Signed-off-by: Kenneth D'souza <kdsouza@redhat.com>
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Aurelien Aptel <aaptel@suse.com>
2020-06-04 13:50:38 -05:00
Sahitya Tummala
e78790f84a f2fs: fix retry logic in f2fs_write_cache_pages()
In case a compressed file is getting overwritten, the current retry
logic doesn't include the current page to be retried now as it sets
the new start index as 0 and new end index as writeback_index - 1.
This causes the corresponding cluster to be uncompressed and written
as normal pages without compression. Fix this by allowing writeback to
be retried for the current page as well (in case of compressed page
getting retried due to index mismatch with cluster index). So that
this cluster can be written compressed in case of overwrite.

Also, align f2fs_write_cache_pages() according to the change -
<64081362e8ff>("mm/page-writeback.c: fix range_cyclic writeback vs
writepages deadlock").

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-04 11:45:09 -07:00
Jens Axboe
dddb3e26f6 io_uring: re-set iov base/len for buffer select retry
We already have the buffer selected, but we should set the iter list
again.

Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 11:45:29 -06:00
Pavel Begunkov
d2b6f48b69 io_uring: move send/recv IOPOLL check into prep
Fail recv/send in case of IORING_SETUP_IOPOLL earlier during prep,
so it'd be done only once. Removes duplication as well

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 11:14:19 -06:00
Pavel Begunkov
ec65fea5a8 io_uring: deduplicate io_openat{,2}_prep()
io_openat_prep() and io_openat2_prep() are identical except for how
struct open_how is built. Deduplicate it with a helper.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 11:14:19 -06:00
Pavel Begunkov
25e72d1012 io_uring: do build_open_how() only once
build_open_how() is just adjusting open_flags/mode. Do it once during
prep. It looks better than storing raw values for the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 11:14:19 -06:00
Pavel Begunkov
3232dd02af io_uring: fix {SQ,IO}POLL with unsupported opcodes
IORING_SETUP_IOPOLL is defined only for read/write, other opcodes should
be disallowed, otherwise it'll get an error as below. Also refuse
open/close with SQPOLL, as the polling thread wouldn't know which file
table to use.

RIP: 0010:io_iopoll_getevents+0x111/0x5a0
Call Trace:
 ? _raw_spin_unlock_irqrestore+0x24/0x40
 ? do_send_sig_info+0x64/0x90
 io_iopoll_reap_events.part.0+0x5e/0xa0
 io_ring_ctx_wait_and_kill+0x132/0x1c0
 io_uring_release+0x20/0x30
 __fput+0xcd/0x230
 ____fput+0xe/0x10
 task_work_run+0x67/0xa0
 do_exit+0x353/0xb10
 ? handle_mm_fault+0xd4/0x200
 ? syscall_trace_enter+0x18c/0x2c0
 do_group_exit+0x43/0xa0
 __x64_sys_exit_group+0x18/0x20
 do_syscall_64+0x60/0x1e0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: allow provide/remove buffers and files update]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 11:13:53 -06:00
David Howells
8409f67b64 afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly
Adjust the fileserver rotation algorithm so that if we've tried all the
addresses on a server (cumulatively over multiple operations) until we've
run out of untried addresses, immediately reprobe all that server's
interfaces and retry the op at least once before we move onto the next
server.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:58 +01:00
David Howells
32275d3f75 afs: Show more a bit more server state in /proc/net/afs/servers
Display more information about the state of a server record, including the
flags, rtt and break counter plus the probe state for each server in
/proc/net/afs/servers.

Rearrange the server flags a bit to make them easier to read at a glance in
the proc file.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:58 +01:00
David Howells
f3c130e6e6 afs: Don't use probe running state to make decisions outside probe code
Don't use the running state for fileserver probes to make decisions about
which server to use as the state is cleared at the start of a probe and
also intermediate values might be misleading.

Instead, add a separate 'latest known' rtt in the afs_server struct and a
flag to indicate if the server is known to be responding and update these
as and when we know what to change them to.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:58 +01:00
David Howells
f11a016a85 afs: Fix afs_statfs() to not let the values go below zero
Fix afs_statfs() so that the value for f_bavail and f_bfree don't go
"negative" if the number of blocks in use by a volume exceeds the max quota
for that volume.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:58 +01:00
David Howells
3c4c4075fc afs: Fix the by-UUID server tree to allow servers with the same UUID
Whilst it shouldn't happen, it is possible for multiple fileservers to
share a UUID, particularly if an entire cell has been duplicated, UUIDs and
all.  In such a case, it's not necessarily possible to map the effect of
the CB.InitCallBackState3 incoming RPC to a specific server unambiguously
by UUID and thus to a specific cell.

Indeed, there's a problem whereby multiple server records may need to
occupy the same spot in the rb_tree rooted in the afs_net struct.

Fix this by allowing servers to form a list, with the head of the list in
the tree.  When the front entry in the list is removed, the second in the
list just replaces it.  afs_init_callback_state() then just goes down the
line, poking each server in the list.

This means that some servers will be unnecessarily poked, unfortunately.
An alternative would be to route by call parameters.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
2020-06-04 15:37:57 +01:00
David Howells
20325960f8 afs: Reorganise volume and server trees to be rooted on the cell
Reorganise afs_volume objects such that they're in a tree keyed on volume
ID, rooted at on an afs_cell object rather than being in multiple trees,
each of which is rooted on an afs_server object.

afs_server structs become per-cell and acquire a pointer to the cell.

The process of breaking a callback then starts with finding the server by
its network address, following that to the cell and then looking up each
volume ID in the volume tree.

This is simpler than the afs_vol_interest/afs_cb_interest N:M mapping web
and allows those structs and the code for maintaining them to be simplified
or removed.

It does make a couple of things a bit more tricky, though:

 (1) Operations now start with a volume, not a server, so there can be more
     than one answer as to whether or not the server we'll end up using
     supports the FS.InlineBulkStatus RPC.

 (2) CB RPC operations that specify the server UUID.  There's still a tree
     of servers by UUID on the afs_net struct, but the UUIDs in it aren't
     guaranteed unique.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
cca37d45d5 afs: Add a tracepoint to track the lifetime of the afs_volume struct
Add a tracepoint to track the lifetime of the afs_volume struct.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
6dfdf5369c afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op
YFS Volume Location servers have an operation by which the cell name may be
queried.  Use this to find out what a YFS server thinks the canonical cell
name should be.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
6ef350b184 afs: Detect cell aliases 2 - Cells with no root volumes
Implement the second phase of cell alias detection.  This part handles
alias detection for cells that don't have root.cell volumes and so we have
to find some other volume or fileserver to query.

We take the first volume from each such cell and attempt to look it up in
the new cell.  If found, we compare the records, if they are the same, we
judge the cell names to be aliases.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
8a070a9648 afs: Detect cell aliases 1 - Cells with root volumes
Put in the first phase of cell alias detection.  This part handles alias
detection for cells that have root.cell volumes (which is expected to be
likely).

When a cell becomes newly active, it is probed for its root.cell volume,
and if it has one, this volume is compared against other root.cell volumes
to find out if the list of fileserver UUIDs have any in common - and if
that's the case, do the address lists of those fileservers have any
addresses in common.  If they do, the new cell is adjudged to be an alias
of the old cell and the old cell is used instead.

Comparing is aided by the server list in struct afs_server_list being
sorted in UUID order and the addresses in the fileserver address lists
being sorted in address order.

The cell then retains the afs_volume object for the root.cell volume, even
if it's not mounted for future alias checking.

This necessary because:

 (1) Whilst fileservers have UUIDs that are meant to be globally unique, in
     practice they are not because cells get cloned without changing the
     UUIDs - so afs_server records need to be per cell.

 (2) Sometimes the DNS is used to make cell aliases - but if we don't know
     they're the same, we may end up with multiple superblocks and multiple
     afs_server records for the same thing, impairing our ability to
     deliver callback notifications of third party changes

 (3) The fileserver RPC API doesn't contain the cell name, so it can't tell
     us which cell it's notifying and can't see that a change made to to
     one cell should notify the same client that's also accessed as the
     other cell.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
c3e9f88826 afs: Implement client support for the YFSVL.GetCellName RPC op
Implement client support for the YFSVL.GetCellName RPC operation by which
YFS permits the canonical cell name to be queried from a VL server.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
194d28cf19 afs: Retain more of the VLDB record for alias detection
Save more bits from the volume location database record obtained for a
server so that we can use this information in cell alias detection.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
3120c170ef afs: Fix handling of CB.ProbeUuid cache manager op
The AFS filesystem driver is handling the CB.ProbeUuid request incorrectly.
The UUID presented in the request is that of the cache manager, not the
fileserver, so afs_deliver_cb_probe_uuid() shouldn't be using that UUID to
look up the server.

Fix this by looking up the server by address instead.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:57 +01:00
David Howells
44746355cc afs: Don't get epoch from a server because it may be ambiguous
Don't get the epoch from a server, particularly one that we're looking up
by UUID, as UUIDs may be ambiguous and may map to more than one server - so
we can't draw any conclusions from it.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:56 +01:00
David Howells
e49c7b2f6d afs: Build an abstraction around an "operation" concept
Turn the afs_operation struct into the main way that most fileserver
operations are managed.  Various things are added to the struct, including
the following:

 (1) All the parameters and results of the relevant operations are moved
     into it, removing corresponding fields from the afs_call struct.
     afs_call gets a pointer to the op.

 (2) The target volume is made the main focus of the operation, rather than
     the target vnode(s), and a bunch of op->vnode->volume are made
     op->volume instead.

 (3) Two vnode records are defined (op->file[]) for the vnode(s) involved
     in most operations.  The vnode record (struct afs_vnode_param)
     contains:

	- The vnode pointer.

	- The fid of the vnode to be included in the parameters or that was
          returned in the reply (eg. FS.MakeDir).

	- The status and callback information that may be returned in the
     	  reply about the vnode.

	- Callback break and data version tracking for detecting
          simultaneous third-parth changes.

 (4) Pointers to dentries to be updated with new inodes.

 (5) An operations table pointer.  The table includes pointers to functions
     for issuing AFS and YFS-variant RPCs, handling the success and abort
     of an operation and handling post-I/O-lock local editing of a
     directory.

To make this work, the following function restructuring is made:

 (A) The rotation loop that issues calls to fileservers that can be found
     in each function that wants to issue an RPC (such as afs_mkdir()) is
     extracted out into common code, in a new file called fs_operation.c.

 (B) The rotation loops, such as the one in afs_mkdir(), are replaced with
     a much smaller piece of code that allocates an operation, sets the
     parameters and then calls out to the common code to do the actual
     work.

 (C) The code for handling the success and failure of an operation are
     moved into operation functions (as (5) above) and these are called
     from the core code at appropriate times.

 (D) The pseudo inode getting stuff used by the dynamic root code is moved
     over into dynroot.c.

 (E) struct afs_iget_data is absorbed into the operation struct and
     afs_iget() expects to be given an op pointer and a vnode record.

 (F) Point (E) doesn't work for the root dir of a volume, but we know the
     FID in advance (it's always vnode 1, unique 1), so a separate inode
     getter, afs_root_iget(), is provided to special-case that.

 (G) The inode status init/update functions now also take an op and a vnode
     record.

 (H) The RPC marshalling functions now, for the most part, just take an
     afs_operation struct as their only argument.  All the data they need
     is held there.  The result delivery functions write their answers
     there as well.

 (I) The call is attached to the operation and then the operation core does
     the waiting.

And then the new operation code is, for the moment, made to just initialise
the operation, get the appropriate vnode I/O locks and do the same rotation
loop as before.

This lays the foundation for the following changes in the future:

 (*) Overhauling the rotation (again).

 (*) Support for asynchronous I/O, where the fileserver rotation must be
     done asynchronously also.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04 15:37:17 +01:00
Miklos Szeredi
74c6e384e9 ovl: make oip->index bool
ovl_get_inode() uses oip->index as a bool value, not as a pointer.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Miklos Szeredi
b778e1ee1a ovl: only pass ->ki_flags to ovl_iocb_to_rwf()
Next patch will want to pass a modified set of flags, so...

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Miklos Szeredi
df820f8de4 ovl: make private mounts longterm
Overlayfs is using clone_private_mount() to create internal mounts for
underlying layers.  These are used for operations requiring a path, such as
dentry_open().

Since these private mounts are not in any namespace they are treated as
short term, "detached" mounts and mntput() involves taking the global
mount_lock, which can result in serious cacheline pingpong.

Make these private mounts longterm instead, which trade the penalty on
mntput() for a slightly longer shutdown time due to an added RCU grace
period when putting these mounts.

Introduce a new helper kern_unmount_many() that can take care of multiple
longterm mounts with a single RCU grace period.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Miklos Szeredi
b8e42a651b ovl: get rid of redundant members in struct ovl_fs
ofs->upper_mnt is copied to ->layers[0].mnt and ->layers[0].trap could be
used instead of a separate ->upperdir_trap.

Split the lowerdir option early to get the number of layers, then allocate
the ->layers array, and finally fill the upper and lower layers, as before.

Get rid of path_put_init() in ovl_lower_dir(), since the only caller will
take care of that.

[Colin Ian King] Fix null pointer dereference on null stack pointer on
error return found by Coverity.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Miklos Szeredi
08f4c7c86d ovl: add accessor for ofs->upper_mnt
Next patch will remove ofs->upper_mnt, so add an accessor function for this
field.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Yuxuan Shui
520da69d26 ovl: initialize error in ovl_copy_xattr
In ovl_copy_xattr, if all the xattrs to be copied are overlayfs private
xattrs, the copy loop will terminate without assigning anything to the
error variable, thus returning an uninitialized value.

If ovl_copy_xattr is called from ovl_clear_empty, this uninitialized error
value is put into a pointer by ERR_PTR(), causing potential invalid memory
accesses down the line.

This commit initialize error with 0. This is the correct value because when
there's no xattr to copy, because all xattrs are private, ovl_copy_xattr
should succeed.

This bug is discovered with the help of INIT_STACK_ALL and clang.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1050405
Fixes: 0956254a2d ("ovl: don't copy up opaqueness")
Cc: stable@vger.kernel.org # v4.8
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-04 10:48:19 +02:00
Steve French
e80ddeb2f7 smb3: fix incorrect number of credits when ioctl MaxOutputResponse > 64K
We were not checking to see if ioctl requests asked for more than
64K (ie when CIFSMaxBufSize was > 64K) so when setting larger
CIFSMaxBufSize then ioctls would fail with invalid parameter errors.
When requests ask for more than 64K in MaxOutputResponse then we
need to ask for more than 1 credit.

Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-04 01:13:41 -05:00
Steve French
1ee0e6d47d smb3: default to minimum of two channels when multichannel specified
When "multichannel" is specified on mount, make sure to default to
at least two channels.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-06-04 01:13:37 -05:00
Linus Torvalds
ee01c4d72a Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "More mm/ work, plenty more to come

  Subsystems affected by this patch series: slub, memcg, gup, kasan,
  pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
  thp, mmap, kconfig"

* akpm: (131 commits)
  arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  riscv: support DEBUG_WX
  mm: add DEBUG_WX support
  drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
  mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
  powerpc/mm: drop platform defined pmd_mknotpresent()
  mm: thp: don't need to drain lru cache when splitting and mlocking THP
  hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
  sparc32: register memory occupied by kernel as memblock.memory
  include/linux/memblock.h: fix minor typo and unclear comment
  mm, mempolicy: fix up gup usage in lookup_node
  tools/vm/page_owner_sort.c: filter out unneeded line
  mm: swap: memcg: fix memcg stats for huge pages
  mm: swap: fix vmstats for huge pages
  mm: vmscan: limit the range of LRU type balancing
  mm: vmscan: reclaim writepage is IO cost
  mm: vmscan: determine anon/file pressure balance at the reclaim root
  mm: balance LRU lists based on relative thrashing
  mm: only count actual rotations as LRU reclaim cost
  ...
2020-06-03 20:24:15 -07:00
Jan Kara
6b8ed62008 ext4: avoid unnecessary transaction starts during writeback
ext4_writepages() currently works in a loop like:
  start a transaction
  scan inode for pages to write
  map and submit these pages
  stop the transaction

This loop results in starting transaction once more than is needed
because in the last iteration we start a transaction only to scan the
inode and find there are no pages to write. This can be significant
increase in number of transaction starts for single-extent files or
files that have all blocks already mapped. Furthermore we already know
from previous iteration whether there are more pages to write or not. So
propagate the information from mpage_prepare_extent_to_map() and avoid
unnecessary looping in case there are no more pages to write.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200525081215.29451-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:56 -04:00
Jens Axboe
6e014c621e ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
Running with some debug patches to detect illegal blocking triggered the
extend/unaligned condition in ext4. If ext4 needs to extend the file (and
hence go to buffered IO), or if the app is doing unaligned IO, then ext4
asks the iomap code to wait for IO completion. If the caller asked for
no-wait semantics by setting IOCB_NOWAIT, then ext4 should return -EAGAIN
instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Link: https://lore.kernel.org/r/76152096-2bbb-7682-8fce-4cb498bcd909@kernel.dk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
ba98890393 ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
access_ok just checks we are fed a proper user pointer.  We also do that
in copy_to_user itself, so no need to do this early.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-10-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
c7d216e8c4 fs: remove the access_ok() check in ioctl_fiemap
access_ok just checks we are fed a proper user pointer.  We also do that
in copy_to_user itself, so no need to do this early.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-9-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
45dd052e67 fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
By moving FIEMAP_FLAG_SYNC handling to fiemap_prep we ensure it is
handled once instead of duplicated, but can still be done under fs locks,
like xfs/iomap intended with its duplicate handling.  Also make sure the
error value of filemap_write_and_wait is propagated to user space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-8-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
cddf8a2c4a fs: move fiemap range validation into the file systems instances
Replace fiemap_check_flags with a fiemap_prep helper that also takes the
inode and mapped range, and performs the sanity check and truncation
previously done in fiemap_check_range.  This way the validation is inside
the file system itself and thus properly works for the stacked overlayfs
case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-7-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
2732881894 iomap: fix the iomap_fiemap prototype
iomap_fiemap should take u64 start and len arguments, just like the
->fiemap prototype.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-6-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
10c5db2864 fs: move the fiemap definitions out of fs.h
No need to pull the fiemap definitions into almost every file in the
kernel build.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Christoph Hellwig
44ebcd06bb fs: mark __generic_block_fiemap static
There is no caller left outside of ioctl.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-4-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Christoph Hellwig
da565e792b ext4: remove the call to fiemap_check_flags in ext4_fiemap
iomap_fiemap already calls fiemap_check_flags first thing, so this
additional check is redundant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-3-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Christoph Hellwig
03a5ed24c9 ext4: split _ext4_fiemap
The fiemap and EXT4_IOC_GET_ES_CACHE cases share almost no code, so split
them into entirely separate functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200523073016.2944131-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Christoph Hellwig
328e24ae14 ext4: fix fiemap size checks for bitmap files
Add an extra validation of the len parameter, as for ext4 some files
might have smaller file size limits than others.  This also means the
redundant size check in ext4_ioctl_get_es_cache can go away, as all
size checking is done in the shared fiemap handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200505154324.3226743-3-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Ritesh Harjani
175efa81fe ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
ext4 supports max number of logical blocks in a file to be 0xffffffff.
(This is since ext4_extent's ee_block is __le32).
This means that EXT4_MAX_LOGICAL_BLOCK should be 0xfffffffe (starting
from 0 logical offset). This patch fixes this.

The issue was seen when ext4 moved to iomap_fiemap API and when
overlayfs was mounted on top of ext4. Since overlayfs was missing
filemap_check_ranges(), so it could pass a arbitrary huge length which
lead to overflow of map.m_len logic.

This patch fixes that.

Fixes: d3b6f23f71 ("ext4: move ext4_fiemap to use iomap framework")
Reported-by: syzbot+77fa5bdb65cc39711820@syzkaller.appspotmail.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20200505154324.3226743-2-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Jonathan Grant
9f364e1d95 add comment for ext4_dir_entry_2 file_type member
Signed-off-by: Jonathan Grant <jg@jguk.org>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/ad3290d5-86af-99c1-f9d5-cd1bab710429@jguk.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:54 -04:00
Jan Kara
14ff628630 jbd2: avoid leaking transaction credits when unreserving handle
When reserved transaction handle is unused, we subtract its reserved
credits in __jbd2_journal_unreserve_handle() called from
jbd2_journal_stop(). However this function forgets to remove reserved
credits from transaction->t_outstanding_credits and thus the transaction
space that was reserved remains effectively leaked. The leaked
transaction space can be quite significant in some cases and leads to
unnecessarily small transactions and thus reducing throughput of the
journalling machinery. E.g. fsmark workload creating lots of 4k files
was observed to have about 20% lower throughput due to this when ext4 is
mounted with dioread_nolock mount option.

Subtract reserved credits from t_outstanding_credits as well.

CC: stable@vger.kernel.org
Fixes: 8f7d89f368 ("jbd2: transaction reservation support")
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200520133119.1383-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Jan Kara
dfcd4489e2 ext4: drop ext4_journal_free_reserved()
Remove ext4_journal_free_reserved() function. It is never used.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20200520133119.1383-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
993778306e ext4: mballoc: use lock for checking free blocks while retrying
Currently while doing block allocation grp->bb_free may be getting
modified if discard is happening in parallel.
For e.g. consider a case where there are lot of threads who have
preallocated lot of blocks and there is a thread which is trying
to discard all of this group's PA. Now it could happen that
we see all of those group's bb_free is zero and fail the allocation
while there is sufficient space if we free up all the PA.

So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set
if we are unable to allocate any blocks in the first try (since we may
not have considered blocks about to be discarded from PA lists).
So during retry attempt to allocate blocks we will use ext4_lock_group()
for checking if the group is good or not.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
8ef123fe02 ext4: mballoc: refactor ext4_mb_good_group()
ext4_mb_good_group() definition was changed some time back
and now it even initializes the buddy cache (via ext4_mb_init_group()),
if in case the EXT4_MB_GRP_NEED_INIT() is true for a group.
Note that ext4_mb_init_group() could sleep and so should not be called
under a spinlock held.
This is fine as of now because ext4_mb_good_group() is called before
loading the buddy bitmap without ext4_lock_group() held
and again called after loading the bitmap, only this time with
ext4_lock_group() held.
But still this whole thing is confusing.

So this patch refactors out ext4_mb_good_group_nolock() which should be
called when without holding ext4_lock_group().
Also in further patches we hold the spinlock (ext4_lock_group()) while
doing any calculations which involves grp->bb_free or grp->bb_fragments.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d9f7d031a5fbe1c943fae6bf1ff5cdf0604ae722.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
07b5b8e1ac ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
   allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
   or free these blocks i.e. while marking those blocks as used/free in
   mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
   that the bb_prealloc_list is not empty and hence proceed for discarding
   of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.

It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-

1. If for some reason the blocks available in the group are not
   appropriate for allocation logic (say for e.g.
   EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
   the retry logic may result into infinte looping since grp->bb_free is
   non-zero.

2. Also before preallocation was clubbed with block allocation with the
   same ext4_lock_group() held, there were lot of races where grp->bb_free
   could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
cf5e2ca6c9 ext4: mballoc: refactor ext4_mb_discard_preallocations()
Implement ext4_mb_discard_preallocations_should_retry()
which we will need in later patches to add more logic
like check for sequence number match to see if we should
retry for block allocation or not.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
53f86b170d ext4: mballoc: add blocks to PA list under same spinlock after allocating blocks
ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list
of every group to discard the group's PA to free up the space if
allocation request fails. Consider below race:-

Process A  				Process B

1. allocate blocks
					1. Fails block allocation from
					     ext4_mb_regular_allocator()
   ext4_lock_group()
	allocated blocks
	more than ac_o_ex.fe_len
   ext4_unlock_group()
					2. Scans the
					   grp->bb_prealloc_list (under
					   ext4_lock_group()) and
					   find nothing and thus return
					   -ENOSPC.

2. Add the additional blocks to PA list

   ext4_lock_group()
   	add blocks to grp->bb_prealloc_list
   ext4_unlock_group()

Above race could be avoided if we add those additional blocks to
grp->bb_prealloc_list at the same time with block allocation when
ext4_lock_group() was still held.
With this discard-PA will know if there are actually any blocks which
could be freed from the PA

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/a2217dd782585b42328981832e6d396abaaccb80.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Eric Biggers
de8ff14cab ext4: add casefold flag to EXT4_INODE_* flags
No one currently needs EXT4_INODE_CASEFOLD, but add it to keep the
EXT4_INODE_* definitions in sync with the EXT4_*_FL definitions.

Also make it clearer that the casefold flag is only for directories.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200510215252.87833-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Eric Whitney
8ad8d71003 ext4: rework map struct instantiation in ext4_ext_map_blocks()
The path performing block allocations in ext4_ext_map_blocks() contains
code trimming the length of a new extent that is repeated later
in the function.  This code is both redundant and unnecessary as the
exact length of the new extent has already been calculated.  Rewrite the
instantiation of the map struct in this case to use the available
values, avoiding the overhead of unnecessary conversions and improving
clarity.  Add another map struct instantiation tailored specifically to
the separate case for an existing written extent.  Remove an old comment
that no longer appears applicable to the current code.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200510155805.18808-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
70aa1554b0 ext4: make ext_debug() implementation to use pr_debug()
ext_debug() msgs could be helpful, provided those could be enabled
without recompiling kernel and also if we could selectively enable
only required prints for case by case debugging.

So make ext_debug() implementation use pr_debug().
Also change ext_debug() to be defined with CONFIG_EXT4_DEBUG.
So EXT_DEBUG macro now mostly remain for below 3 functions.
ext4_ext_show_path/leaf/move() (whose print msgs use ext_debug()
which again could be dynamically enabled using pr_debug())

This also changes the ext_debug() to take inode as a parameter
to add inode no. in all of it's msgs.
Prints additional info like process name / pid, superblock id etc.
This also removes any explicit function names passed in ext_debug().
Since ext_debug() on it's own prints file, func and line no.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d31dc189b0aeda9384fe7665e36da7cd8c61571f.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
d3df14535f ext4: mballoc: make mb_debug() implementation to use pr_debug()
mb_debug() msg had only 1 control level for all type of msgs.
And if we enable mballoc_debug then all of those msgs would be enabled.
Instead of adding multiple debug levels for mb_debug() msgs, use
pr_debug() with which we could have finer control to print msgs at all
of different levels (i.e. at file, func, line no.).

Also add process name/pid, superblk id, and other info in mb_debug()
msg. This also kills the mballoc_debug module parameter, since it is
not needed any more.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/f0c660cbde9e2edbe95c67942ca9ad80dd2231eb.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
8ec2d31b27 ext4: replace EXT_DEBUG with __maybe_unused in ext4_ext_handle_unwritten_extents()
Replace EXT_DEBUG with __maybe_unused from inside
ext4_ext_handle_unwritten_extents() function.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/ae335b94506cd9db9d2648c1f4dd25a80f9f3ce2.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
ec8c60be96 ext4: improve ext_debug() msg in case of block allocation failure
ext4_map_blocks() has ext_debug msg early at the start of function.
We also get ext_debug msg if we could allocate a block from
ext4_ext_map_blocks(). But there is no ext_debug() msg in case of
block allocation failure. So add one along with error code.

Also add more info in ext_debug() msg like how many blocks were allocated
v/s how many were requested in ext4_ext_map_blocks().

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1610ec2aa932396be00f9d552fe29da473ead176.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
6db0746189 ext4: use BIT() macro for BH_** state bits
Simply use BIT() macro for all BH_** state bits instead of open
coding it.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/57667689f51a3f9dba2fcef7d3425187fa3ba69f.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
9bee5779ee ext4: balloc: use task_pid_nr() helper
Use task_pid_nr() function instead of current->pid.
There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/4b58403e15e9c8deb34a1b93deb3fc9cd153ab84.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
eb2b8ebb86 ext4: mballoc: fix possible NULL ptr & remove BUG_ONs from DOUBLE_CHECK
Make sure to check for e4b->bd_info->bb_bitmap == NULL, in
mb_cmp_bitmaps() and return if NULL, to avoid possible NULL ptr
dereference. Similar to how we do this in other ifdef DOUBLE_CHECK
functions.

Also remove the BUG_ON() logic if kmalloc() or ext4_read_block_bitmap()
fails. We should simply mark grp->bb_bitmap as NULL if above happens.
In fact ext4_read_block_bitmap() may even return an error in case of resize
ioctl. Hence remove this BUG_ON logic (fstests ext4/032 may trigger
this).

Link: https://lore.kernel.org/r/9a54f8a696ff17c057cd571be3d15ac3ec1407f1.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
a345021553 ext4: mballoc: refactor code inside DOUBLE_CHECK into separate function
This patch implemets mb_group_bb_bitmap_alloc() and
mb_group_bb_bitmap_free() function to remove #ifdef DOUBLE_CHECK macro
and it's related code from inside
ext4_mb_add_groupinfo()/ext4_mb_release().

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8c2095d74b779f0254a19b24982490dc6f07c4f9.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
4fca8f0779 ext4: mballoc: make ext4_mb_use_preallocated() return type as bool
Change return type of function ext4_mb_use_preallocated() to bool to
better reflect what this function can return.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7880cb6ef911465beafefcd7e9c3ea214688744b.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
f283529aba ext4: mballoc: simplify error handling in ext4_init_mballoc()
This patch simplifies error handling logic in ext4_init_mballoc(),
by adding all the cleanups at one place at the end of that function.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8621a7bc68f7107a9ac4292afeb784515333bd25.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
004379d0b0 ext4: mballoc: fix few other format specifier in mb_debug()
Fix few other format specifiers in mb_debug() msgs.
As such no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/574fa7f833abf2dbf3b53a2fea3195e71f6cdbd8.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
36bad4233c ext4: mballoc: correct the mb_debug() format specifier for pa_len var
pa->pa_len is an integer. Fix all of the format specifier used in
mb_debug() for pa_len to %d instead of %u.

As such no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/af4987f643c586f62bcc9961e43f0a67151d5551.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
bbc4ec77e9 ext4: mballoc: add more mb_debug() msgs
This patch adds some more debugging mb_debug() msgs to help improve
mballoc code debugging.
Other than adding more mb_debug() msgs at few more places,
there should be no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/5fc8e7788b924e211fcfa4a4c1d2f8503511661a.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
e68cf40c0d ext4: mballoc: refactor ext4_mb_show_ac()
This factors out ext4_mb_show_pa() function to show all the group's
preallocation info. This could be useful info to be added in later
patches.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8f07d890b0038dcc935e9c10e6043ec9f3792721.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
212da3ec6f ext4: mballoc: print bb_free info even when it is 0
Improve the debugging msg by also printing even if bb_free is 0.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c894f1d1d30f86ae38f4e3a861949665b6dc61cd.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Theodore Ts'o
73c384c0cd ext4: avoid ext4_error()'s caused by ENOMEM in the truncate path
We can't fail in the truncate path without requiring an fsck.
Add work around for this by using a combination of retry loops
and the __GFP_NOFAIL flag.

From: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Anna Pendleton <pendleton@google.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200507175028.15061-1-pendleton@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Eric Biggers
08adf452e6 ext4: fix race between ext4_sync_parent() and rename()
'igrab(d_inode(dentry->d_parent))' without holding dentry->d_lock is
broken because without d_lock, d_parent can be concurrently changed due
to a rename().  Then if the old directory is immediately deleted, old
d_parent->inode can be NULL.  That causes a NULL dereference in igrab().

To fix this, use dget_parent() to safely grab a reference to the parent
dentry, which pins the inode.  This also eliminates the need to use
d_find_any_alias() other than for the initial inode, as we no longer
throw away the dentry at each step.

This is an extremely hard race to hit, but it is possible.  Adding a
udelay() in between the reads of ->d_parent and its ->d_inode makes it
reproducible on a no-journal filesystem using the following program:

    #include <fcntl.h>
    #include <unistd.h>

    int main()
    {
        if (fork()) {
            for (;;) {
                mkdir("dir1", 0700);
                int fd = open("dir1/file", O_RDWR|O_CREAT|O_SYNC);
                write(fd, "X", 1);
                close(fd);
            }
        } else {
            mkdir("dir2", 0700);
            for (;;) {
                rename("dir1/file", "dir2/file");
                rmdir("dir1");
            }
        }
    }

Fixes: d59729f4e7 ("ext4: fix races in ext4_sync_parent()")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200506183140.541194-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Christophe JAILLET
80dd4978dd ext4: fix a typo in a comment
s/extnets/extents/

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20200503200647.154701-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Eric Whitney
be809e1274 ext4: clean up ext4_ext_convert_to_initialized() error handling
If ext4_ext_convert_to_initialized() fails when called within
ext4_ext_handle_unwritten_extents(), immediately error out through the
exit point at function end.  Fix the error handling in the event
ext4_ext_convert_to_initialized() returns 0, which it shouldn't do when
converting an existing extent.  The current code returns the passed in
value of allocated (which is likely non-zero) while failing to set
m_flags, m_pblk, and m_len.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-5-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Eric Whitney
779e26517b ext4: clean up GET_BLOCKS_PRE_IO error handling
If the call to ext4_split_convert_extents() fails in the
EXT4_GET_BLOCKS_PRE_IO case within ext4_ext_handle_unwritten_extents(),
error out through the exit point at function end rather than jumping
through an intermediate point.  Fix the error handling in the event
ext4_split_convert_extents() returns 0, which it shouldn't do when
splitting an existing extent.  The current code returns the passed in
value of allocated (which is likely non-zero) while failing to set
m_flags, m_pblk, and m_len.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-4-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Eric Whitney
bee6cf00c7 ext4: remove redundant GET_BLOCKS_CONVERT code
Remove the redundant code assigning values to ext4_map_blocks components
in ext4_ext_handle_unwritten_extents() for the EXT4_GET_BLOCKS_CONVERT
case, using the code at the function exit instead.  Clean up and reorder
that code to eliminate more redundancy and improve readability.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200430185320.23001-3-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Eric Whitney
ee802f8689 ext4: remove dead GET_BLOCKS_ZERO code
There's no call to ext4_map_blocks() in the current ext4 code with a
flags argument that combines EXT4_GET_BLOCKS_CONVERT and
EXT4_GET_BLOCKS_ZERO.  Remove the code that corresponds to this case
from ext4_ext_handle_unwritten_extents().

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200430185320.23001-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Harshad Shirwadkar
b60ca3343e ext4: don't ignore return values from ext4_ext_dirty()
Don't ignore return values from ext4_ext_dirty, since the errors
indicate valid failures below Ext4.  In all of the other instances of
ext4_ext_dirty calls, the error return value is handled in some
way. This patch makes those remaining couple of places to handle
ext4_ext_dirty errors as well. In case of ext4_split_extent_at(), the
ignorance of return value is intentional. The reason is that we are
already in error path and there isn't much we can do if ext4_ext_dirty
returns error. This patch adds a comment for that case explaining why
we ignore the return value.

In the longer run, we probably should
make sure that errors from other mark_dirty routines are handled as
well.

Ran gce-xfstests smoke tests and verified that there were no
regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200427013438.219117-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Harshad Shirwadkar
4209ae12b1 ext4: handle ext4_mark_inode_dirty errors
ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return
value may lead ext4 to ignore real failures that would result in
corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail
as soon as possible and return errors to the caller whenever
appropriate.

One of the possible scnearios when this bug could affected is that
while creating a new inode, its directory entry gets added
successfully but while writing the inode itself mark_inode_dirty
returns error which is ignored. This would result in inconsistency
that the directory entry points to a non-existent inode.

Ran gce-xfstests smoke tests and verified that there were no
regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:50 -04:00
Jeffle Xu
8418897f1b ext4: fix error pointer dereference
Don't pass error pointers to brelse().

commit 7159a986b4 ("ext4: fix some error pointer dereferences") has fixed
some cases, fix the remaining one case.

Once ext4_xattr_block_find()->ext4_sb_bread() failed, error pointer is
stored in @bs->bh, which will be passed to brelse() in the cleanup
routine of ext4_xattr_set_handle(). This will then cause a NULL panic
crash in __brelse().

BUG: unable to handle kernel NULL pointer dereference at 000000000000005b
RIP: 0010:__brelse+0x1b/0x50
Call Trace:
 ext4_xattr_set_handle+0x163/0x5d0
 ext4_xattr_set+0x95/0x110
 __vfs_setxattr+0x6b/0x80
 __vfs_setxattr_noperm+0x68/0x1b0
 vfs_setxattr+0xa0/0xb0
 setxattr+0x12c/0x1a0
 path_setxattr+0x8d/0xc0
 __x64_sys_setxattr+0x27/0x30
 do_syscall_64+0x60/0x250
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

In this case, @bs->bh stores '-EIO' actually.

Fixes: fb265c9cb4 ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: stable@kernel.org # 2.6.19
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1587628004-95123-1-git-send-email-jefflexu@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:49 -04:00
Jan Kara
ceff86fdda ext4: Avoid freeing inodes on dirty list
When we are evicting inode with journalled data, we may race with
transaction commit in the following way:

CPU0					CPU1
jbd2_journal_commit_transaction()	evict(inode)
					  inode_io_list_del()
					  inode_wait_for_writeback()
  process BJ_Forget list
    __jbd2_journal_insert_checkpoint()
    __jbd2_journal_refile_buffer()
      __jbd2_journal_unfile_buffer()
        if (test_clear_buffer_jbddirty(bh))
          mark_buffer_dirty(bh)
	    __mark_inode_dirty(inode)
					  ext4_evict_inode(inode)
					    frees the inode

This results in use-after-free issues in the writeback code (or
the assertion added in the previous commit triggering).

Fix the problem by removing inode from writeback lists once all the page
cache is evicted and so inode cannot be added to writeback lists again.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200421085445.5731-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:49 -04:00
Jan Kara
4301efa4c7 writeback: Export inode_io_list_del()
Ext4 needs to remove inode from writeback lists after it is out of
visibility of its journalling machinery (which can still dirty the
inode). Export inode_io_list_del() for it.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200421085445.5731-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:49 -04:00
Xiyu Yang
3bbd0ef260 ext4: fix buffer_head refcnt leak when ext4_iget() fails
ext4_orphan_get() invokes ext4_read_inode_bitmap(), which returns a
reference of the specified buffer_head object to "bitmap_bh" with
increased refcnt.

When ext4_orphan_get() returns, local variable "bitmap_bh" becomes
invalid, so the refcount should be decreased to keep refcount balanced.

The reference counting issue happens in one exception handling path of
ext4_orphan_get(). When ext4_iget() fails, the function forgets to
decrease the refcnt increased by ext4_read_inode_bitmap(), causing a
refcnt leak.

Fix this issue by calling brelse() when ext4_iget() fails.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/1587618568-13418-1-git-send-email-xiyuyang19@fudan.edu.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:49 -04:00
Harshad Shirwadkar
c36a71b4e3 ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max
If eh->eh_max is 0, EXT_MAX_EXTENT/INDEX would evaluate to unsigned
(-1) resulting in illegal memory accesses. Although there is no
consistent repro, we see that generic/019 sometimes crashes because of
this bug.

Ran gce-xfstests smoke and verified that there were no regressions.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20200421023959.20879-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-06-03 23:16:49 -04:00
Jason Yan
39c0ae163f ext4: remove unnecessary comparisons to bool
Fix the following coccicheck warning:

fs/ext4/extents_status.c:1057:5-28: WARNING: Comparison to bool
fs/ext4/inode.c:2314:18-24: WARNING: Comparison to bool

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200420042918.19459-1-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:49 -04:00
Eric Whitney
9e52484c71 ext4: remove EXT4_GET_BLOCKS_KEEP_SIZE flag
The eofblocks code was removed in the 5.7 release by "ext4: remove
EOFBLOCKS_FL and associated code" (4337ecd1fe).  The ext4_map_blocks()
flag used to trigger it can now be removed as well.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200415203140.30349-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:48 -04:00
Carlos Guerrero Álvarez
6b6aeffc93 ext4: fix a style issue in fs/ext4/acl.c
Fixed an if statement where braces were not needed.

Link: https://lore.kernel.org/r/20200416141456.1089-1-carlosteniswarrior@gmail.com
Signed-off-by: Carlos Guerrero Álvarez <carlosteniswarrior@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2020-06-03 23:16:48 -04:00
Shijie Hu
8859025315 hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
In a 32-bit program, running on arm64 architecture.  When the address
space below mmap base is completely exhausted, shmat() for huge pages will
return ENOMEM, but shmat() for normal pages can still success on no-legacy
mode.  This seems not fair.

For normal pages, the calling trace of get_unmapped_area() is:

	=> mm->get_unmapped_area()
	if on legacy mode,
		=> arch_get_unmapped_area()
			=> vm_unmapped_area()
	if on no-legacy mode,
		=> arch_get_unmapped_area_topdown()
			=> vm_unmapped_area()

For huge pages, the calling trace of get_unmapped_area() is:

	=> file->f_op->get_unmapped_area()
		=> hugetlb_get_unmapped_area()
			=> vm_unmapped_area()

To solve this issue, we only need to make hugetlb_get_unmapped_area() take
the same way as mm->get_unmapped_area().  Add *bottomup() and *topdown()
for hugetlbfs, and check current mm->get_unmapped_area() to decide which
one to use.  If mm->get_unmapped_area is equal to
arch_get_unmapped_area_topdown(), hugetlb_get_unmapped_area() calls
topdown routine, otherwise calls bottomup routine.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Shijie Hu <hushijie3@huawei.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Will Deacon <will@kernel.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: yangerkun <yangerkun@huawei.com>
Cc: ChenGang <cg.chen@huawei.com>
Cc: Chen Jie <chenjie6@huawei.com>
Link: http://lkml.kernel.org/r/20200518065338.113664-1-hushijie3@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:49 -07:00
Johannes Weiner
6058eaec81 mm: fold and remove lru_cache_add_anon() and lru_cache_add_file()
They're the same function, and for the purpose of all callers they are
equivalent to lru_cache_add().

[akpm@linux-foundation.org: fix it for local_lock changes]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:48 -07:00
Linus Torvalds
cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Linus Torvalds
ae03c53d00 Merge branch 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull splice updates from Al Viro:
 "Christoph's assorted splice cleanups"

* 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: rename pipe_buf ->steal to ->try_steal
  fs: make the pipe_buf_operations ->confirm operation optional
  fs: make the pipe_buf_operations ->steal operation optional
  trace: remove tracing_pipe_buf_ops
  pipe: merge anon_pipe_buf*_ops
  fs: simplify do_splice_from
  fs: simplify do_splice_to
2020-06-03 15:52:19 -07:00
Al Viro
8861fd576e binfmt_flat: don't use __put_user()
... and check the return value

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-03 16:58:53 -04:00
Al Viro
0abb013e2e binfmt_elf_fdpic: don't use __... uaccess primitives
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-03 16:58:11 -04:00
Al Viro
646e84deb4 binfmt_elf: don't bother with __{put,copy_to}_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-03 16:56:47 -04:00
Linus Torvalds
e7c93cbfe9 threads-v5.8
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXtYhfgAKCRCRxhvAZXjc
 oghSAP9uVX3vxYtEtNvu9WtEn1uYZcSKZoF1YrcgY7UfSmna0gEAruzyZcai4CJL
 WKv+4aRq2oYk+hsqZDycAxIsEgWvNg8=
 =ZWj3
 -----END PGP SIGNATURE-----

Merge tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread updates from Christian Brauner:
 "We have been discussing using pidfds to attach to namespaces for quite
  a while and the patches have in one form or another already existed
  for about a year. But I wanted to wait to see how the general api
  would be received and adopted.

  This contains the changes to make it possible to use pidfds to attach
  to the namespaces of a process, i.e. they can be passed as the first
  argument to the setns() syscall.

  When only a single namespace type is specified the semantics are
  equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
  equals setns(pidfd, CLONE_NEWNET).

  However, when a pidfd is passed, multiple namespace flags can be
  specified in the second setns() argument and setns() will attach the
  caller to all the specified namespaces all at once or to none of them.

  Specifying 0 is not valid together with a pidfd. Here are just two
  obvious examples:

    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);

  Allowing to also attach subsets of namespaces supports various
  use-cases where callers setns to a subset of namespaces to retain
  privilege, perform an action and then re-attach another subset of
  namespaces.

  Apart from significantly reducing the number of syscalls needed to
  attach to all currently supported namespaces (eight "open+setns"
  sequences vs just a single "setns()"), this also allows atomic setns
  to a set of namespaces, i.e. either attaching to all namespaces
  succeeds or we fail without having changed anything.

  This is centered around a new internal struct nsset which holds all
  information necessary for a task to switch to a new set of namespaces
  atomically. Fwiw, with this change a pidfd becomes the only token
  needed to interact with a container. I'm expecting this to be
  picked-up by util-linux for nsenter rather soon.

  Associated with this change is a shiny new test-suite dedicated to
  setns() (for pidfds and nsfds alike)"

* tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests/pidfd: add pidfd setns tests
  nsproxy: attach to namespaces via pidfds
  nsproxy: add struct nsset
2020-06-03 13:12:57 -07:00
J. Bruce Fields
c25bf185e5 nfsd: safer handling of corrupted c_type
This can only happen if there's a bug somewhere, so let's make it a WARN
not a printk.  Also, I think it's safest to ignore the corruption rather
than trying to fix it by removing a cache entry.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-06-03 11:12:32 -04:00
Chengguang Xu
1434a65ea6 ovl: drop negative dentry in upper layer
Negative dentries of upper layer are useless after construction of
overlayfs' own dentry and may keep in the memory long time even after
unmount of overlayfs instance. This patch tries to drop unnecessary
negative dentry of upper layer to effectively reclaim memory.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-03 09:45:22 +02:00
Miklos Szeredi
05acefb487 ovl: check permission to open real file
Call inode_permission() on real inode before opening regular file on one of
the underlying layers.

In some cases ovl_permission() already checks access to an underlying file,
but it misses the metacopy case, and possibly other ones as well.

Removing the redundant permission check from ovl_permission() should be
considered later.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-03 09:45:22 +02:00
Miklos Szeredi
292f902a40 ovl: call secutiry hook in ovl_real_ioctl()
Verify LSM permissions for underlying file, since vfs_ioctl() doesn't do
it.

[Stephen Rothwell] export security_file_ioctl

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-03 09:45:18 +02:00
Linus Torvalds
d6f9469a03 Changes since last update:
- Convert to use the new mount apis;
 
  - Some random cleanup patches.
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXtbfOhUcaHNpYW5na2Fv
 QHJlZGhhdC5jb20ACgkQOTcx3B+15gTvZgD6Ap8mYxRaW7Qta+HEyFuyRrxWZ/XZ
 pq/hYiouGosDdaMBAOUNl8pGlPX54T+Y9VZv0wV0Dp4pan6NApdgtL9fIQUE
 =QhQh
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "The most interesting part is the new mount api conversion, which is
  actually a old patch already pending for several cycles. And the
  others are recent trivial cleanups here.

  Summary:

   - Convert to use the new mount apis

   - Some random cleanup patches"

* tag 'erofs-for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: suppress false positive last_block warning
  erofs: convert to use the new mount fs_context api
  erofs: code cleanup by removing ifdef macro surrounding
2020-06-02 20:16:55 -07:00
Linus Torvalds
cadf32234b Replace zero-length array in JFS
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAl7WbbAACgkQNqiEXrVA
 jGRBvg//aff/q785MyIKlv2icnWJTFIfkFhdKBfD36OI5Obur3ko17gJiRQ7BvIl
 07yBTOpcqYyq4pThFXA8TfiZyRxKv0faHeFYmtmEum6gZNtHPmpvfy1JFr12vjHZ
 SWSpylW6nprQJa61KwGHcRqTGuWe0sco6BBAQ/E015wJLKrtIpNCH5H3x3eDv8we
 nBPSHSu1BsJFTcf94ECZhjbsiLgLgTaYVFUU8tw7eTp7EkZz9xq6wYmhZ6e5Z6Ki
 AGbIQ/c16Nk5B3/8A1bU0pm7BHz9Zx1ucNy4KZqyFl2fWEmQiv1ofa3cfLI+ZAF2
 /TFzLW+rpvesc+Byk7PgzRrLNXAMBk+botzdlN8JPaPDFYcm6whnRKstmjYGAzCI
 /bpTI9wyEPzlfWtt5h2ws1KaP5HNl1OLd3sJCzfwH+nKaeJpdBBSaZRQ0yZWIsg5
 hzt9WnJ10351sbLbAA4ybsOlWJvhPtSo8jSiGUGzu7aFyfa7H82nslWHuFjeI3F0
 qgcA5j6LJ8qAviMnhxX5jrqIMNZzyf+P5/oJvXjfSeWft1yD6EUqRqb7i0FREjh6
 qzxiX6xhRSh8AIr3XlToG3W7KULQkBJKwZumSTQ2X03yJRhelqKAs3LvvyXoZuG3
 oFvGgiGNph0VSTTPdOSrZZ3tOCDiONXfHHOFpghk4qKdWgFoWtg=
 =yNwD
 -----END PGP SIGNATURE-----

Merge tag 'jfs-5.8' of git://github.com/kleikamp/linux-shaggy

Pull JFS update from David Kleikamp:
 "Replace zero-length array in JFS"

* tag 'jfs-5.8' of git://github.com/kleikamp/linux-shaggy:
  jfs: Replace zero-length array with flexible-array member
2020-06-02 20:11:35 -07:00
Linus Torvalds
f3cdc8ae11 for-5.8-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7U50AACgkQxWXV+ddt
 WDtK1g//RXeNsTguYQr1N9R5eUPThjLEI0+4J0l4SYfCPU8Ou3C7nqpOEJJQgm8F
 ezZE+16cWi9U5uGueOc+w0rfyz4AuIXKgzoz+c0/GG2+yV5jp6DsAMbWqojAb96L
 V/N3HxEzR66jqwgVUBE/x5okb2SyY7//B1l/O0amc66XDO7KTMImpIwThere6zWZ
 o2SNpYpHAPQeUYJQx8h+FAW3w1CxrCZmnifazU9Jqe9J7QeQLg7rbUlJDV38jySm
 ZOA8ohKN9U1gPZy+dTU3kdyyuBIq1etkIaSPJANyTo5TczPKiC0IMg75cXtS4ae/
 NSxhccMpSIjVMcIHARzSFGYKNP3sGNRsmaTUg/2Cx/9GoHOhYMiCAVc8qtBBpwJO
 UI0siexrCe64RuTBMRRc128GdFv7IjmSImcdi8xaR62bCcUiNdEa3zvjRe/9tOEH
 ET7Z85oBnKpSzpC3MdhSUU4dtHY5XLawP8z3oUU1VSzSWM2DVjlHf79/VzbOfp18
 miCVpt94lCn/gUX7el6qcnbuvMAjDyeC6HmfD+TwzQgGwyV6TLgKN9lRXeH/Oy6/
 VgjGQSavGHMll3zIGURmrBCXKudjJg0J+IP4wN1TimmSEMfwKH+7tnekQd8y5qlF
 eXEIqlWNykKeDzEnmV9QJy+/cV83hVWM/mUslcTx39tLN/3B/Us=
 =qTt8
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Highlights:

   - speedup dead root detection during orphan cleanup, eg. when there
     are many deleted subvolumes waiting to be cleaned, the trees are
     now looked up in radix tree instead of a O(N^2) search

   - snapshot creation with inherited qgroup will mark the qgroup
     inconsistent, requires a rescan

   - send will emit file capabilities after chown, this produces a
     stream that does not need postprocessing to set the capabilities
     again

   - direct io ported to iomap infrastructure, cleaned up and simplified
     code, notably removing last use of struct buffer_head in btrfs code

  Core changes:

   - factor out backreference iteration, to be used by ordinary
     backreferences and relocation code

   - improved global block reserve utilization
      * better logic to serialize requests
      * increased maximum available for unlink
      * improved handling on large pages (64K)

   - direct io cleanups and fixes
      * simplify layering, where cloned bios were unnecessarily created
        for some cases
      * error handling fixes (submit, endio)
      * remove repair worker thread, used to avoid deadlocks during
        repair

   - refactored block group reading code, preparatory work for new type
     of block group storage that should improve mount time on large
     filesystems

  Cleanups:

   - cleaned up (and slightly sped up) set/get helpers for metadata data
     structure members

   - root bit REF_COWS got renamed to SHAREABLE to reflect the that the
     blocks of the tree get shared either among subvolumes or with the
     relocation trees

  Fixes:

   - when subvolume deletion fails due to ENOSPC, the filesystem is not
     turned read-only

   - device scan deals with devices from other filesystems that changed
     ownership due to overwrite (mkfs)

   - fix a race between scrub and block group removal/allocation

   - fix long standing bug of a runaway balance operation, printing the
     same line to the syslog, caused by a stale status bit on a reloc
     tree that prevented progress

   - fix corrupt log due to concurrent fsync of inodes with shared
     extents

   - fix space underflow for NODATACOW and buffered writes when it for
     some reason needs to fallback to COW mode"

* tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
  btrfs: fix space_info bytes_may_use underflow during space cache writeout
  btrfs: fix space_info bytes_may_use underflow after nocow buffered write
  btrfs: fix wrong file range cleanup after an error filling dealloc range
  btrfs: remove redundant local variable in read_block_for_search
  btrfs: open code key_search
  btrfs: split btrfs_direct_IO to read and write part
  btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
  fs: remove dio_end_io()
  btrfs: switch to iomap_dio_rw() for dio
  iomap: remove lockdep_assert_held()
  iomap: add a filesystem hook for direct I/O bio submission
  fs: export generic_file_buffered_read()
  btrfs: turn space cache writeout failure messages into debug messages
  btrfs: include error on messages about failure to write space/inode caches
  btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
  btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
  btrfs: make checksum item extension more efficient
  btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
  btrfs: unexport btrfs_compress_set_level()
  btrfs: simplify iget helpers
  ...
2020-06-02 19:59:25 -07:00
Linus Torvalds
8eeae5bae1 (More) new code for 5.8:
- Introduce DONTCACHE flags for dentries and inodes.  This hint will
   cause the VFS to drop the associated objects immediately after the
   last put, so that we can change the file access mode (DAX or page
   cache) on the fly.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl68FowACgkQ+H93GTRK
 tOtNzA/9FkXXQYAlTWK/toHfJV8DQT/Kx1fvf8Ng0EphBUQa/rNzlcMzFg7Gw5Cs
 Rzis96+xj4q//iseLZN5LLxaoxqT2Qipza0GWCMJpQG/4wTWM0Ar7BnG/Vc87lUV
 F0mXnILZOUMFzr8Zj9q4ka6UGRTDSXXtwNXqBuPpIZyVbMQvPtXHhM3lWV5RUQwm
 fznBxDAEGoVXiyID2OrZD5tS4BMd16uFWAWLjWphpcy18zfC7zp0+0MQik4v/9oi
 54pZdtPT9/dQOu/BI8tfLP45XzZ6f++gXy2p/G96dy7ism1u40ML77ojEkadVVFe
 Bf7t+EswNxrx/em/ugWbcJDtrxttSqU47g2AXsbJJB2+aHCih6Cfid41lMyRvlhR
 d4cumoteX7IF/PpT3YaKHWQBo5OxHK0a2CBPd6czrCBw5yXrEUagdmw1XQ//bw5e
 FRCg4eMcEW0UgINvBCHWdWRx6VaL8ngMMsflVJ/lY7FeVvM10ZYRFzJoryoebSPm
 /yWcoHFsTPC8K0nWVmbwPazVE19I0g4y6Wiw39YvZDzZRzM9PcQI4DBxQcab+Va/
 FPfXEXkpz0GiC6zjs/QfkPtg60GI1IG5Um4JUzdv6ce1P0p1rGcu5WiNYearahE7
 7V/44WGIEAd4NP7R0JPTI0Fqv7v6uuDzMoCp7YDn8gE4FCJTt6M=
 =ebl3
 -----END PGP SIGNATURE-----

Merge tag 'vfs-5.8-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull DAX updates part two from Darrick Wong:
 "This time around, we're hoisting the DONTCACHE flag from XFS into the
  VFS so that we can make the incore DAX mode changes become effective
  sooner.

  We can't change the file data access mode on a live inode because we
  don't have a safe way to change the file ops pointers. The incore
  state change becomes effective at inode loading time, which can happen
  if the inode is evicted. Therefore, we're making it so that
  filesystems can ask the VFS to evict the inode as soon as the last
  holder drops.

  The per-fs changes to make this call this will be in subsequent pull
  requests from Ted and myself.

  Summary:

   - Introduce DONTCACHE flags for dentries and inodes. This hint will
     cause the VFS to drop the associated objects immediately after the
     last put, so that we can change the file access mode (DAX or page
     cache) on the fly"

* tag 'vfs-5.8-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fs: Introduce DCACHE_DONTCACHE
  fs: Lift XFS_IDONTCACHE to the VFS layer
2020-06-02 19:48:41 -07:00
Linus Torvalds
96ed320d52 New code for 5.8:
- Clean up io_is_direct.
 - Add a new statx flag to indicate when file data access is being done
   via DAX (as opposed to the page cache).
 - Update the documentation for how system administrators and application
   programmers can take advantage of the (still experimental DAX) feature.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl6wO7UACgkQ+H93GTRK
 tOtEaw//eShC2YE0S+GS7ihQ3x71PJa4Is0VZOIpTHl01aqMSegwB3QbDQbVUhn1
 TzLhUw4pZsz3R9GbUOrfHOYRt+aSP2t0WNhIulDeBp41CYJQSaFt85KnfM9hoBOi
 VYssum3Lu7/6ReKrDD/mumzWYkts+JDCuXRmt7nOQeZJVNXOCBBbvN354V4/IKLY
 wB4Wnaq3f3gYniXYW/23aCX+kocaOIUZtK6aFKyeD0KvfP5toDlpw1cBVMoM9CmO
 bmEy8vKf4lgFZDLeDMqmWOecMgEH5h0baN5Psu13WuDCiCd6maBl0KpxVpVlwsep
 yVz6mMbZjmLOJ2lqyw+lZb+XicD+K3yRVSTGKxV3VbuRjeX9tjVG5Im13VesNvJB
 WWJq/CkOU8W0Zs7Q5RbUDGbFFWDJSI/OStAU+UeuWvL9Gndv7hqv6H904qbPPtEu
 4m4Y34ARzrEaKpkABKKwQ53cLClNxmmgUN9N3cXK3mk8idlX4zM3j6+HJYUxXTO+
 fBjhOlyUy2KaWmzZoJp28QvaU4iegGmMSuRnQ9HAvXmdxUA2K6+wjS6LCZGh04vz
 z7SbzTBlo2kvsKdRMwJ306s2QA0/HvmKHHLI+p8OQANce9hjhE3XdJayzhitd0fk
 k0D/y8OY+fbCSgI8C4g66lA8Zf2sos/ulD0QTTNBPfU2rRWkKUc=
 =rS19
 -----END PGP SIGNATURE-----

Merge tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull DAX updates part one from Darrick Wong:
 "After many years of LKML-wrangling about how to enable programs to
  query and influence the file data access mode (DAX) when a filesystem
  resides on storage devices such as persistent memory, Ira Weiny has
  emerged with a proposed set of standard behaviors that has not been
  shot down by anyone! We're more or less standardizing on the current
  XFS behavior and adapting ext4 to do the same.

  This is the first of a handful pull requests that will make ext4 and
  XFS present a consistent interface for user programs that care about
  DAX. We add a statx attribute that programs can check to see if DAX is
  enabled on a particular file. Then, we update the DAX documentation to
  spell out the user-visible behaviors that filesystems will guarantee
  (until the next storage industry shakeup). The on-disk inode flag has
  been in XFS for a few years now.

  Summary:

   - Clean up io_is_direct.

   - Add a new statx flag to indicate when file data access is being
     done via DAX (as opposed to the page cache).

   - Update the documentation for how system administrators and
     application programmers can take advantage of the (still
     experimental DAX) feature"

Link: https://lore.kernel.org/lkml/20200505002016.1085071-1-ira.weiny@intel.com/

* tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  Documentation/dax: Update Usage section
  fs/stat: Define DAX statx attribute
  fs: Remove unneeded IS_DAX() check in io_is_direct()
2020-06-02 19:45:12 -07:00
Linus Torvalds
16d91548d1 New code for 5.8:
- Various cleanups to remove dead code, unnecessary conditionals,
       asserts, etc.
     - Fix a linker warning caused by xfs stuffing '-g' into CFLAGS
       redundantly.
     - Tighten up our dmesg logging to ensure that everything is prefixed
       with 'XFS' for easier grepping.
     - Kill a bunch of typedefs.
     - Refactor the deferred ops code to reduce indirect function calls.
     - Increase type-safety with the deferred ops code.
     - Make the DAX mount options a tri-state.
     - Fix some error handling problems in the inode flush code and clean up
       other inode flush warts.
     - Refactor log recovery so that each log item recovery functions now live
       with the other log item processing code.
     - Fix some SPDX forms.
     - Fix quota counter corruption if the fs crashes after running
       quotacheck but before any dquots get logged.
     - Don't fail metadata verification on zero-entry attr leaf blocks, since
       they're just part of the disk format now due to a historic lack of log
       atomicity.
     - Don't allow SWAPEXT between files with different [ugp]id when quotas
       are enabled.
     - Refactor inode fork reading and verification to run directly from the
       inode-from-disk function.  This means that we now actually guarantee
       that _iget'ted inodes are totally verified and ready to go.
     - Move the incore inode fork format and extent counts to the ifork
       structure.
     - Scalability improvements by reducing cacheline pingponging in
       struct xfs_mount.
     - More scalability improvements by removing m_active_trans from the
       hot path.
     - Fix inode counter update sanity checking to run /only/ on debug
       kernels.
     - Fix longstanding inconsistency in what error code we return when a
       program hits project quota limits (ENOSPC).
     - Fix group quota returning the wrong error code when a program hits
       group quota limits.
     - Fix per-type quota limits and grace periods for group and project
       quotas so that they actually work.
     - Allow extension of individual grace periods.
     - Refactor the non-reclaim inode radix tree walking code to remove a
       bunch of stupid little functions and straighten out the
       inconsistent naming schemes.
     - Fix a bug in speculative preallocation where we measured a new
       allocation based on the last extent mapping in the file instead of
       looking farther for the last contiguous space allocation.
     - Force delalloc writes to unwritten extents.  This closes a
       stale disk contents exposure vector if the system goes down before
       the write completes.
     - More lockdep whackamole.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7OjhgACgkQ+H93GTRK
 tOuGeBAApuP9ohtvrJT9FW7U+OrRsK3lw/3R+MEYpJu8GKLpGbJ6j+SKrTHxxLvu
 Rp63YLIlHBOz2rNa4brm/wW8gGJIGXOnGpuiGq0Irl01xEmwqmjOLfLcYkYhno1E
 i+rG0PiKYZeo/xhLtTKGl+NAwHHxmbOmxUtYHnbinHtPzDyYLQ0wff+oUkmQ7ydg
 bMYFMXohoJ3Pc5UjmUrCuJj1cvYOUwl0P4LGKiq5Zud61AkBCSskEpk+oo5xFcEX
 JJc1xkn5MPi+oGpSYqhnSZ6aSjwp53/i44O9volp5vCRXXv1eLVni2u/ScZ85L72
 HXxoDyuZOUupirIfMBQFHsazDGPGyFIqtPhGlXoTJjrwX+ymimY6CU/0e+Xu9DEu
 krlxajfUssH30zyG2q/2TaxslU35CROH6hVBXFe0Y5cEEsOIf2aOpErUhhw2YyS7
 onN9gb2NBBQdYtHqIMwsbhcgq60g5H6JfGriB5dJimXXLmpuTfAREGCY2AqIoB1x
 +8QFod0WwsMn6FYhi/UpZjC9qp/WTvojBUEt8Ci3ketUFwO1CLf9qm6Hj71RL3fs
 fCEDHx/ZMMft7Bdbf36lICoMAhF/KfNcRn1PsQdpW4LY1Aml/7qjFNZthSVRDW+E
 rhzNu+RIzGEQsSemBvccRaaTP3HFqN+qPATu2K0sALaa1LRFxzQ=
 =/NYc
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.8-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "Most of the changes this cycle are refactoring of existing code in
  preparation for things landing in the future.

  We also fixed various problems and deficiencies in the quota
  implementation, and (I hope) the last of the stale read vectors by
  forcing write allocations to go through the unwritten state until the
  write completes.

  Summary:

   - Various cleanups to remove dead code, unnecessary conditionals,
     asserts, etc.

   - Fix a linker warning caused by xfs stuffing '-g' into CFLAGS
     redundantly.

   - Tighten up our dmesg logging to ensure that everything is prefixed
     with 'XFS' for easier grepping.

   - Kill a bunch of typedefs.

   - Refactor the deferred ops code to reduce indirect function calls.

   - Increase type-safety with the deferred ops code.

   - Make the DAX mount options a tri-state.

   - Fix some error handling problems in the inode flush code and clean
     up other inode flush warts.

   - Refactor log recovery so that each log item recovery functions now
     live with the other log item processing code.

   - Fix some SPDX forms.

   - Fix quota counter corruption if the fs crashes after running
     quotacheck but before any dquots get logged.

   - Don't fail metadata verification on zero-entry attr leaf blocks,
     since they're just part of the disk format now due to a historic
     lack of log atomicity.

   - Don't allow SWAPEXT between files with different [ugp]id when
     quotas are enabled.

   - Refactor inode fork reading and verification to run directly from
     the inode-from-disk function. This means that we now actually
     guarantee that _iget'ted inodes are totally verified and ready to
     go.

   - Move the incore inode fork format and extent counts to the ifork
     structure.

   - Scalability improvements by reducing cacheline pingponging in
     struct xfs_mount.

   - More scalability improvements by removing m_active_trans from the
     hot path.

   - Fix inode counter update sanity checking to run /only/ on debug
     kernels.

   - Fix longstanding inconsistency in what error code we return when a
     program hits project quota limits (ENOSPC).

   - Fix group quota returning the wrong error code when a program hits
     group quota limits.

   - Fix per-type quota limits and grace periods for group and project
     quotas so that they actually work.

   - Allow extension of individual grace periods.

   - Refactor the non-reclaim inode radix tree walking code to remove a
     bunch of stupid little functions and straighten out the
     inconsistent naming schemes.

   - Fix a bug in speculative preallocation where we measured a new
     allocation based on the last extent mapping in the file instead of
     looking farther for the last contiguous space allocation.

   - Force delalloc writes to unwritten extents. This closes a stale
     disk contents exposure vector if the system goes down before the
     write completes.

   - More lockdep whackamole"

* tag 'xfs-5.8-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (129 commits)
  xfs: more lockdep whackamole with kmem_alloc*
  xfs: force writes to delalloc regions to unwritten
  xfs: refactor xfs_iomap_prealloc_size
  xfs: measure all contiguous previous extents for prealloc size
  xfs: don't fail unwritten extent conversion on writeback due to edquot
  xfs: rearrange xfs_inode_walk_ag parameters
  xfs: straighten out all the naming around incore inode tree walks
  xfs: move xfs_inode_ag_iterator to be closer to the perag walking code
  xfs: use bool for done in xfs_inode_ag_walk
  xfs: fix inode ag walk predicate function return values
  xfs: refactor eofb matching into a single helper
  xfs: remove __xfs_icache_free_eofblocks
  xfs: remove flags argument from xfs_inode_ag_walk
  xfs: remove xfs_inode_ag_iterator_flags
  xfs: remove unused xfs_inode_ag_iterator function
  xfs: replace open-coded XFS_ICI_NO_TAG
  xfs: move eofblocks conversion function to xfs_ioctl.c
  xfs: allow individual quota grace period extension
  xfs: per-type quota timers and warn limits
  xfs: switch xfs_get_defquota to take explicit type
  ...
2020-06-02 19:21:40 -07:00
Jens Axboe
fd2206e4e9 io_uring: disallow close of ring itself
A previous commit enabled this functionality, which also enabled O_PATH
to work correctly with io_uring. But we can't safely close the ring
itself, as the file handle isn't reference counted inside
io_uring_enter(). Instead of jumping through hoops to enable ring
closure, add a "soft" ->needs_file option, ->needs_file_no_error. This
enables O_PATH file descriptors to work, but still catches the case of
trying to close the ring itself.

Reported-by: Jann Horn <jannh@google.com>
Fixes: 904fbcb115 ("io_uring: remove 'fd is io_uring' from close path")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-02 17:22:24 -06:00
Linus Torvalds
1ee08de1e2 for-5.8/io_uring-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VP+kQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuK4D/0XsSG/Yirbba1rrbqw/qpw9xcAs9oyN0tS
 8SmmGN27ghrkVSsGBXNcG+PSTu3pkkLjYZ6TQtKamrya9G+lRAsKRsQ+Yq+7Qv4e
 N6lCUlLJ99KqTMtwvIoxSpA1tz3ENHucOw2cJrw3kd9G0kil7GvDkIOBasd+kmwn
 ak+mnMJZzRhqSM7M5lKQOk8l92gKBHGbPy4xKb0st3dQkYptDvit0KcNSAuevtOp
 sRZpdbXaT3FA6xa5iEgggI6vZQGVmK1EaGoQqZ8vgVo75aovkjZyQWWiFVVOlEqr
 QjUCCQuixcbMRbZjgpojqva5nmLhFVhLCfoSH2XgttEQZhmTwypdRwM2/IlxV5q2
 xCofrDkhYOfIgHkuP6p68ukIPIfQ+4jotvsmXZ/HeD/xbx3TRyJRZadISr6wiuLm
 7zRXWaGCYomUIPJOOrpBQ9FsCglkaN63oB6VGuGKTg3g7kE2QrZ2/aGuexP+FAdh
 OrA8BlzxZzpqMKhjQVKOl9r6FU928MZn8nIAkMdQ/Ia1mOpb4rrPo4qCdf+tbhPO
 pmKtQPQjbszQ3UfTgShvfvDk43BeRim1DxZPFTauSu1FMpqWBCwQgXMynPFrf5TR
 HXF61G+jw5swDW6uJgW7bXdm7hHr15vRqQr54MgGS+T0OOa1df9MR0dJB5CGklfI
 ycLU6AAT+A==
 =A/qA
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "A relatively quiet round, mostly just fixes and code improvements. In
particular:

   - Make statx just use the generic statx handler, instead of open
     coding it. We don't need that anymore, as we always call it async
     safe (Bijan)

   - Enable closing of the ring itself. Also fixes O_PATH closure (me)

   - Properly name completion members (me)

   - Batch reap of dead file registrations (me)

   - Allow IORING_OP_POLL with double waitqueues (me)

   - Add tee(2) support (Pavel)

   - Remove double off read (Pavel)

   - Fix overflow cancellations (Pavel)

   - Improve CQ timeouts (Pavel)

   - Async defer drain fixes (Pavel)

   - Add support for enabling/disabling notifications on a registered
     eventfd (Stefano)

   - Remove dead state parameter (Xiaoguang)

   - Disable SQPOLL submit on dying ctx (Xiaoguang)

   - Various code cleanups"

* tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block: (29 commits)
  io_uring: fix overflowed reqs cancellation
  io_uring: off timeouts based only on completions
  io_uring: move timeouts flushing to a helper
  statx: hide interfaces no longer used by io_uring
  io_uring: call statx directly
  statx: allow system call to be invoked from io_uring
  io_uring: add io_statx structure
  io_uring: get rid of manual punting in io_close
  io_uring: separate DRAIN flushing into a cold path
  io_uring: don't re-read sqe->off in timeout_prep()
  io_uring: simplify io_timeout locking
  io_uring: fix flush req->refs underflow
  io_uring: don't submit sqes when ctx->refs is dying
  io_uring: async task poll trigger cleanup
  io_uring: add tee(2) support
  splice: export do_tee()
  io_uring: don't repeat valid flag list
  io_uring: rename io_file_put()
  io_uring: remove req->needs_fixed_files
  io_uring: cleanup io_poll_remove_one() logic
  ...
2020-06-02 15:42:50 -07:00
Linus Torvalds
bce159d734 for-5.8/drivers-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VPc4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgQkEACnQlzWOfNQMz1AzgUAv/S8IYDJCLrkbjLZ
 JK4pJv8Hjhss/7sS+fd8kyKe9VtaZz2IjmrXcC66RMMwtpx4iHnkRffoNAgEdGOl
 /M5TCZGhs+F/mp3Lc0WdR5DFHkM6yy2Tkk9wCFLreB4bW67janAWnd7nbU4INqJj
 +WqIgpzNMc/kfUhpBYTeQLORhL4e2TG9ADTi/zeUITlpnEsA65LOgXKEpeIFYnSX
 KTl4GIZ9tjazG3Y1Eva7DYHDIErNNAtX67KBqf+WBgMV98eB0O6xIPN1WlmhDTqj
 FGMLkb8msH1HHntvxDAuc4/ortnUy8vPI4o6zKP89HJJNjIM5p5eHEuVF5JnBw42
 Rtu9Om6JqWx51nhAhJNBj9bUStYbhEl0vVQCwbkfPbDJhzTy3RR8z709q9+ZwOrL
 xbp4aJBzqrzscjBEiSQbNCf2PyuOAdU0r1x81UN81ZN41d5qUcumcinjw4Y7vru8
 z5zMlo1Iy/AWQYyu7jgHmnpI7ZyA/1Qclo5dV7aa72bLFaJa35e7QxgfQOFBA5dY
 UZl6QPJRlnB80uGRzD5jCh2O2sQ3XZqYnpaKsUAka1GgbceCp9IC4A5mfZvpACsh
 Xk8VXjlhvY/iPJsKLqrh4Oedg4Dj5M3PLL9C3MDfYeIP2qgXpbnk87UV1TPNSpY0
 QcTxsXXXIw==
 =H+/Z
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "On top of the core changes, here are the block driver changes for this
  merge window:

   - NVMe changes:
        - NVMe over Fibre Channel protocol updates, which also reach
          over to drivers/scsi/lpfc (James Smart)
        - namespace revalidation support on the target (Anthony
          Iliopoulos)
        - gcc zero length array fix (Arnd Bergmann)
        - nvmet cleanups (Chaitanya Kulkarni)
        - misc cleanups and fixes (me, Keith Busch, Sagi Grimberg)
        - use a SRQ per completion vector (Max Gurtovoy)
        - fix handling of runtime changes to the queue count (Weiping
          Zhang)
        - t10 protection information support for nvme-rdma and
          nvmet-rdma (Israel Rukshin and Max Gurtovoy)
        - target side AEN improvements (Chaitanya Kulkarni)
        - various fixes and minor improvements all over, icluding the
          nvme part of the lpfc driver"

   - Floppy code cleanup series (Willy, Denis)

   - Floppy contention fix (Jiri)

   - Loop CONFIGURE support (Martijn)

   - bcache fixes/improvements (Coly, Joe, Colin)

   - q->queuedata cleanups (Christoph)

   - Get rid of ioctl_by_bdev (Christoph, Stefan)

   - md/raid5 allocation fixes (Coly)

   - zero length array fixes (Gustavo)

   - swim3 task state fix (Xu)"

* tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits)
  bcache: configure the asynchronous registertion to be experimental
  bcache: asynchronous devices registration
  bcache: fix refcount underflow in bcache_device_free()
  bcache: Convert pr_<level> uses to a more typical style
  bcache: remove redundant variables i and n
  lpfc: Fix return value in __lpfc_nvme_ls_abort
  lpfc: fix axchg pointer reference after free and double frees
  lpfc: Fix pointer checks and comments in LS receive refactoring
  nvme: set dma alignment to qword
  nvmet: cleanups the loop in nvmet_async_events_process
  nvmet: fix memory leak when removing namespaces and controllers concurrently
  nvmet-rdma: add metadata/T10-PI support
  nvmet: add metadata support for block devices
  nvmet: add metadata/T10-PI support
  nvme: add Metadata Capabilities enumerations
  nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len
  nvmet: rename nvmet_rw_len to nvmet_rw_data_len
  nvmet: add metadata characteristics for a namespace
  nvme-rdma: add metadata/T10-PI support
  nvme-rdma: introduce nvme_rdma_sgl structure
  ...
2020-06-02 15:37:03 -07:00
Linus Torvalds
750a02ab8d for-5.8/block-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VOwMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoR7EADAlz3TCkb4wwuHytTBDrm6gVDdsJ9zUfQW
 Cl2ASLtufA8PWZUCEI3vhFyOe6P5e+ZZ0O2HjljSevmHyogCaRYXFYVfbWKcQKuk
 AcxiTgnYNevh8KbGLfJY1WL4eXsY+C3QUGivg35cCgrx+kr9oDaHMeqA9Tm1plyM
 FSprDBoSmHPqRxiV/1gnr8uXLX6K7i/fHzwmKgySMhavum7Ma8W3wdAGebzvQwrO
 SbFSuJVgz06e4B1Fzr/wSvVNUE/qW/KqfGuQKIp7VQFIywbgG7TgRMHjE1FSnpnh
 gn+BfL+O5gc0sTvcOTGOE0SRWWwLx961WNg8Azq08l3fzsxLA6h8/AnoDf3i+QMA
 rHmLpWZIic2xPSvjaFHX3/V9ITyGYeAMpAR77EL+4ivWrKv5JrBhnSLDt1fKILdg
 5elxm7RDI+C4nCP4xuTlVCy5gCd6gwjgytKj+NUWhNq1WiGAD0B54SSiV+SbCSH6
 Om2f5trcxz8E4pqWcf0k3LjFapVKRNV8v/+TmVkCdRPBl3y9P0h0wFTkkcEquqnJ
 y7Yq6efdWviRCnX5w/r/yj0qBuk4xo5hMVsPmlthCWtnBm+xZQ6LwMRcq4HQgZgR
 2SYNscZ3OFMekHssH7DvY4DAy1J+n83ims+KzbScbLg2zCZjh/scQuv38R5Eh9WZ
 rCS8c+T7Ig==
 =HYf4
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Core block changes that have been queued up for this release:

   - Remove dead blk-throttle and blk-wbt code (Guoqing)

   - Include pid in blktrace note traces (Jan)

   - Don't spew I/O errors on wouldblock termination (me)

   - Zone append addition (Johannes, Keith, Damien)

   - IO accounting improvements (Konstantin, Christoph)

   - blk-mq hardware map update improvements (Ming)

   - Scheduler dispatch improvement (Salman)

   - Inline block encryption support (Satya)

   - Request map fixes and improvements (Weiping)

   - blk-iocost tweaks (Tejun)

   - Fix for timeout failing with error injection (Keith)

   - Queue re-run fixes (Douglas)

   - CPU hotplug improvements (Christoph)

   - Queue entry/exit improvements (Christoph)

   - Move DMA drain handling to the few drivers that use it (Christoph)

   - Partition handling cleanups (Christoph)"

* tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits)
  block: mark bio_wouldblock_error() bio with BIO_QUIET
  blk-wbt: rename __wbt_update_limits to wbt_update_limits
  blk-wbt: remove wbt_update_limits
  blk-throttle: remove tg_drain_bios
  blk-throttle: remove blk_throtl_drain
  null_blk: force complete for timeout request
  blk-mq: drain I/O when all CPUs in a hctx are offline
  blk-mq: add blk_mq_all_tag_iter
  blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
  blk-mq: use BLK_MQ_NO_TAG in more places
  blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
  blk-mq: move more request initialization to blk_mq_rq_ctx_init
  blk-mq: simplify the blk_mq_get_request calling convention
  blk-mq: remove the bio argument to ->prepare_request
  nvme: force complete cancelled requests
  blk-mq: blk-mq: provide forced completion method
  block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds
  block: blk-crypto-fallback: remove redundant initialization of variable err
  block: reduce part_stat_lock() scope
  block: use __this_cpu_add() instead of access by smp_processor_id()
  ...
2020-06-02 15:29:19 -07:00
Miklos Szeredi
56230d9567 ovl: verify permissions in ovl_path_open()
Check permission before opening a real file.

ovl_path_open() is used by readdir and copy-up routines.

ovl_permission() theoretically already checked copy up permissions, but it
doesn't hurt to re-do these checks during the actual copy-up.

For directory reading ovl_permission() only checks access to topmost
underlying layer.  Readdir on a merged directory accesses layers below the
topmost one as well.  Permission wasn't checked for these layers.

Note: modifying ovl_permission() to perform this check would be far more
complex and hence more bug prone.  The result is less precise permissions
returned in access(2).  If this turns out to be an issue, we can revisit
this bug.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:26 +02:00
Miklos Szeredi
48bd024b8a ovl: switch to mounter creds in readdir
In preparation for more permission checking, override credentials for
directory operations on the underlying filesystems.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Miklos Szeredi
130fdbc3d1 ovl: pass correct flags for opening real directory
The three instances of ovl_path_open() in overlayfs/readdir.c do three
different things:

 - pass f_flags from overlay file
 - pass O_RDONLY | O_DIRECTORY
 - pass just O_RDONLY

The value of f_flags can be (other than O_RDONLY):

O_WRONLY	- not possible for a directory
O_RDWR		- not possible for a directory
O_CREAT		- masked out by dentry_open()
O_EXCL		- masked out by dentry_open()
O_NOCTTY	- masked out by dentry_open()
O_TRUNC		- masked out by dentry_open()
O_APPEND	- no effect on directory ops
O_NDELAY	- no effect on directory ops
O_NONBLOCK	- no effect on directory ops
__O_SYNC	- no effect on directory ops
O_DSYNC		- no effect on directory ops
FASYNC		- no effect on directory ops
O_DIRECT	- no effect on directory ops
O_LARGEFILE	- ?
O_DIRECTORY	- only affects lookup
O_NOFOLLOW	- only affects lookup
O_NOATIME	- overlay sets this unconditionally in ovl_path_open()
O_CLOEXEC	- only affects fd allocation
O_PATH		- no effect on directory ops
__O_TMPFILE	- not possible for a directory


Fon non-merge directories we use the underlying filesystem's iterate; in
this case honor O_LARGEFILE from the original file to make sure that open
doesn't get rejected.

For merge directories it's safe to pass O_LARGEFILE unconditionally since
userspace will only see the artificial offsets created by overlayfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Vivek Goyal
21d8d66abf ovl: fix redirect traversal on metacopy dentries
Amir pointed me to metacopy test cases in unionmount-testsuite and I
decided to run "./run --ov=10 --meta" and it failed while running test
"rename-mass-5.py".

Problem is w.r.t absolute redirect traversal on intermediate metacopy
dentry.  We do not store intermediate metacopy dentries and also skip
current loop/layer and move onto lookup in next layer.  But at the end of
loop, we have logic to reset "poe" and layer index if currnently looked up
dentry has absolute redirect.  We skip all that and that means lookup in
next layer will fail.

Following is simple test case to reproduce this.

- mkdir -p lower upper work merged lower/a lower/b
- touch lower/a/foo.txt
- mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work,metacopy=on none merged

# Following will create absolute redirect "/a/foo.txt" on upper/b/bar.txt.
- mv merged/a/foo.txt merged/b/bar.txt

# unmount overlay and use upper as lower layer (lower2) for next mount.
- umount merged
- mv upper lower2
- rm -rf work; mkdir -p upper work
- mount -t overlay -o lowerdir=lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged

# Force a metacopy copy-up
- chown bin:bin merged/b/bar.txt

# unmount overlay and use upper as lower layer (lower3) for next mount.
- umount merged
- mv upper lower3
- rm -rf work; mkdir -p upper work
- mount -t overlay -o lowerdir=lower3:lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged

# ls merged/b/bar.txt
ls: cannot access 'bar.txt': Input/output error

Intermediate lower layer (lower2) has metacopy dentry b/bar.txt with
absolute redirect "/a/foo.txt".  We skipped redirect processing at the end
of loop which sets poe to roe and sets the appropriate next lower layer
index.  And that means lookup failed in next layer.

Fix this by continuing the loop for any intermediate dentries.  We still do
not save these at lower stack.  With this fix applied unionmount-testsuite,
"./run --ov-10 --meta" now passes.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Vivek Goyal
28166ab3c8 ovl: initialize OVL_UPPERDATA in ovl_lookup()
Currently ovl_get_inode() initializes OVL_UPPERDATA flag and for that it
has to call ovl_check_metacopy_xattr() and check if metacopy xattr is
present or not.

yangerkun reported sometimes underlying filesystem might return -EIO and in
that case error handling path does not cleanup properly leading to various
warnings.

Run generic/461 with ext4 upper/lower layer sometimes may trigger the bug
as below(linux 4.19):

[  551.001349] overlayfs: failed to get metacopy (-5)
[  551.003464] overlayfs: failed to get inode (-5)
[  551.004243] overlayfs: cleanup of 'd44/fd51' failed (-5)
[  551.004941] overlayfs: failed to get origin (-5)
[  551.005199] ------------[ cut here ]------------
[  551.006697] WARNING: CPU: 3 PID: 24674 at fs/inode.c:1528 iput+0x33b/0x400
...
[  551.027219] Call Trace:
[  551.027623]  ovl_create_object+0x13f/0x170
[  551.028268]  ovl_create+0x27/0x30
[  551.028799]  path_openat+0x1a35/0x1ea0
[  551.029377]  do_filp_open+0xad/0x160
[  551.029944]  ? vfs_writev+0xe9/0x170
[  551.030499]  ? page_counter_try_charge+0x77/0x120
[  551.031245]  ? __alloc_fd+0x160/0x2a0
[  551.031832]  ? do_sys_open+0x189/0x340
[  551.032417]  ? get_unused_fd_flags+0x34/0x40
[  551.033081]  do_sys_open+0x189/0x340
[  551.033632]  __x64_sys_creat+0x24/0x30
[  551.034219]  do_syscall_64+0xd5/0x430
[  551.034800]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

One solution is to improve error handling and call iget_failed() if error
is encountered.  Amir thinks that this path is little intricate and there
is not real need to check and initialize OVL_UPPERDATA in ovl_get_inode().
Instead caller of ovl_get_inode() can initialize this state.  And this will
avoid double checking of metacopy xattr lookup in ovl_lookup() and
ovl_get_inode().

OVL_UPPERDATA is inode flag.  So I was little concerned that initializing
it outside ovl_get_inode() might have some races.  But this is one way
transition.  That is once a file has been fully copied up, it can't go back
to metacopy file again.  And that seems to help avoid races.  So as of now
I can't see any races w.r.t OVL_UPPERDATA being set wrongly.  So move
settingof OVL_UPPERDATA inside the callers of ovl_get_inode().
ovl_obtain_alias() already does it.  So only two callers now left are
ovl_lookup() and ovl_instantiate().

Reported-by: yangerkun <yangerkun@huawei.com>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Vivek Goyal
6815f479ca ovl: use only uppermetacopy state in ovl_lookup()
Currently we use a variable "metacopy" which signifies that dentry could be
either uppermetacopy or lowermetacopy.  Amir suggested that we can move
code around and use d.metacopy in such a way that we don't need
lowermetacopy and just can do away with uppermetacopy.

So this patch replaces "metacopy" with "uppermetacopy".

It also moves some code little higher to keep reading little simpler.

Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Vivek Goyal
59fb20138a ovl: simplify setting of origin for index lookup
overlayfs can keep index of copied up files and directories and it seems to
serve two primary puroposes.  For regular files, it avoids breaking lower
hardlinks over copy up.  For directories it seems to be used for various
error checks.

During ovl_lookup(), we lookup for index using lower dentry in many a
cases.  That lower dentry is called "origin" and following is a summary of
current logic.

If there is no upperdentry, always lookup for index using lower dentry.
For regular files it helps avoiding breaking hard links over copyup and for
directories it seems to be just error checks.

If there is an upperdentry, then there are 3 possible cases.

 - For directories, lower dentry is found using two ways.  One is regular
  path based lookup in lower layers and second is using ORIGIN xattr on
  upper dentry.  First verify that path based lookup lower dentry matches
  the one pointed by upper ORIGIN xattr.  If yes, use this verified origin
  for index lookup.

 - For regular files (non-metacopy), there is no path based lookup in lower
  layers as lookup stops once we find upper dentry.  So there is no origin
  verification.  If there is ORIGIN xattr present on upper, use that to
  lookup index otherwise don't.

 - For regular metacopy files, again lower dentry is found using path based
  lookup as well as ORIGIN xattr on upper.  Path based lookup is continued
  in this case to find lower data dentry for metacopy upper.  So like
  directories we only use verified origin.  If ORIGIN xattr is not present
  (Either because lower did not support file handles or because this is
  hardlink copied up with index=off), then don't use path lookup based
  lower dentry as origin.  This is same as regular non-metacopy file case.

Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Amir Goldstein
522f6e6cba ovl: fix out of bounds access warning in ovl_check_fb_len()
syzbot reported out of bounds memory access from open_by_handle_at()
with a crafted file handle that looks like this:

  { .handle_bytes = 2, .handle_type = OVL_FILEID_V1 }

handle_bytes gets rounded down to 0 and we end up calling:
  ovl_check_fh_len(fh, 0) => ovl_check_fb_len(fh + 3, -3)

But fh buffer is only 2 bytes long, so accessing struct ovl_fb at
fh + 3 is illegal.

Fixes: cbe7fba8ed ("ovl: make sure that real fid is 32bit aligned in memory")
Reported-and-tested-by: syzbot+61958888b1c60361a791@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org> # v5.5
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-06-02 22:20:25 +02:00
Linus Torvalds
355ba37d75 Power management updates for 5.8-rc1
- Rework the system-wide PM driver flags to make them easier to
    understand and use and update their documentation (Rafael Wysocki,
    Alan Stern).
 
  - Allow cpuidle governors to be switched at run time regardless of
    the kernel configuration and update the related documentation
    accordingly (Hanjun Guo).
 
  - Improve the resume device handling in the user space hibernarion
    interface code (Domenico Andreoli).
 
  - Document the intel-speed-select sysfs interface (Srinivas
    Pandruvada).
 
  - Make the ACPI code handing suspend to idle print more debug
    messages to help diagnose issues with it (Rafael Wysocki).
 
  - Fix a helper routine in the cpufreq core and correct a typo in
    the struct cpufreq_driver kerneldoc comment (Rafael Wysocki, Wang
    Wenhu).
 
  - Update cpufreq drivers:
 
    * Make the intel_pstate driver start in the passive mode by
      default on systems without HWP (Rafael Wysocki).
 
    * Add i.MX7ULP support to the imx-cpufreq-dt driver and add
      i.MX7ULP to the cpufreq-dt-platdev blacklist (Peng Fan).
 
    * Convert the qoriq cpufreq driver to a platform one, make the
      platform code create a suitable device object for it and add
      platform dependencies to it (Mian Yousaf Kaukab, Geert
      Uytterhoeven).
 
    * Fix wrong compatible binding in the qcom driver (Ansuel Smith).
 
    * Build the omap driver by default for ARCH_OMAP2PLUS (Anders
      Roxell).
 
    * Add r8a7742 SoC support to the dt cpufreq driver (Lad Prabhakar).
 
  - Update cpuidle core and drivers:
 
    * Fix three reference count leaks in error code paths in the
      cpuidle core (Qiushi Wu).
 
    * Convert Qualcomm SPM to a generic cpuidle driver (Stephan
      Gerhold).
 
    * Fix up the execution order when entering a domain idle state in
      the PSCI driver (Ulf Hansson).
 
  - Fix a reference counting issue related to clock management and
    clean up two oddities in the PM-runtime framework (Rafael Wysocki,
    Andy Shevchenko).
 
  - Add ElkhartLake support to the Intel RAPL power capping driver
    and remove an unused local MSR definition from it (Jacob Pan,
    Sumeet Pawnikar).
 
  - Update devfreq core and drivers:
 
    * Replace strncpy() with strscpy() in the devfreq core and use
      lockdep asserts instead of manual checks for a locked mutex in
      it (Dmitry Osipenko, Krzysztof Kozlowski).
 
    * Add a generic imx bus scaling driver and make it register an
      interconnect device (Leonard Crestez, Gustavo A. R. Silva).
 
    * Make the cpufreq notifier in the tegra30 driver take boosting
      into account and delete an unuseful error message from that
      driver (Dmitry Osipenko, Markus Elfring).
 
  - Remove unneeded semicolon from the cpupower code (Zou Wei).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl7VGjwSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRx46gP/jGAXlddFEQswi6qUT3Cff0A9mb8CdcX
 dyKrjX4xxo/wtBIAwSN4achxrgse//ayo2dYTzWRDd31W9Azbv+5F+46XsDRz4hL
 pH29u/E66NMtFWnHCmt78NEJn0FzSa0YBC43ZzwFwKktCK9skYIpGN2z6iuXUBSX
 Q5GHqop3zvDsdKQFBGL62xvUw/AmOTPG7ohIZvqWBN2mbOqEqMcoFHT+aUF/NbLj
 +i14dvTH767eDZGRVASmXWQyljjaRWm+SIw4+m8zT1D1Y3d5IFObuMN+9RQl1Tif
 BYjkgJ2oDDMhCJLW7TBuJB+g7exiyaSQds3nMr2ZR+eZbJipICjU4eehNEKIUopU
 DM17tHQfnwZfS/7YbCx3vYQwLkNq37AJyXS9uqCAIFM+0n4xN4/mIVmgWYISLDTs
 1v9olFxtwMRNpjGGQWPJAO7ebB8Zz9qhQv7pIkSQEfwp93/SzvlVf4vvruTeFN9J
 qqG60cDumXWAm+s43eQHJNn5nOd5ocWv0FBpo/cxqKbzxFVWwdB42Cm0SY+rK2ID
 uHdnc2DJcK2c78UVbz3Cmk4272foJt2zxchqjFXXAZPLrOsFfzmti4B28VxGxjmP
 LG3MhH5sdbF4yl/1aSC1Bnrt+PV9Lus6ut/VKhjwIpw8cqiXgpwSbMoDoaBd9UMQ
 ubGz2rplGAtB
 =APdj
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These rework the system-wide PM driver flags, make runtime switching
  of cpuidle governors easier, improve the user space hibernation
  interface code, add intel-speed-select interface documentation, add
  more debug messages to the ACPI code handling suspend to idle, update
  the cpufreq core and drivers, fix a minor issue in the cpuidle core
  and update two cpuidle drivers, improve the PM-runtime framework,
  update the Intel RAPL power capping driver, update devfreq core and
  drivers, and clean up the cpupower utility.

  Specifics:

   - Rework the system-wide PM driver flags to make them easier to
     understand and use and update their documentation (Rafael Wysocki,
     Alan Stern).

   - Allow cpuidle governors to be switched at run time regardless of
     the kernel configuration and update the related documentation
     accordingly (Hanjun Guo).

   - Improve the resume device handling in the user space hibernarion
     interface code (Domenico Andreoli).

   - Document the intel-speed-select sysfs interface (Srinivas
     Pandruvada).

   - Make the ACPI code handing suspend to idle print more debug
     messages to help diagnose issues with it (Rafael Wysocki).

   - Fix a helper routine in the cpufreq core and correct a typo in the
     struct cpufreq_driver kerneldoc comment (Rafael Wysocki, Wang
     Wenhu).

   - Update cpufreq drivers:

      - Make the intel_pstate driver start in the passive mode by
        default on systems without HWP (Rafael Wysocki).

      - Add i.MX7ULP support to the imx-cpufreq-dt driver and add
        i.MX7ULP to the cpufreq-dt-platdev blacklist (Peng Fan).

      - Convert the qoriq cpufreq driver to a platform one, make the
        platform code create a suitable device object for it and add
        platform dependencies to it (Mian Yousaf Kaukab, Geert
        Uytterhoeven).

      - Fix wrong compatible binding in the qcom driver (Ansuel Smith).

      - Build the omap driver by default for ARCH_OMAP2PLUS (Anders
        Roxell).

      - Add r8a7742 SoC support to the dt cpufreq driver (Lad
        Prabhakar).

   - Update cpuidle core and drivers:

      - Fix three reference count leaks in error code paths in the
        cpuidle core (Qiushi Wu).

      - Convert Qualcomm SPM to a generic cpuidle driver (Stephan
        Gerhold).

      - Fix up the execution order when entering a domain idle state in
        the PSCI driver (Ulf Hansson).

   - Fix a reference counting issue related to clock management and
     clean up two oddities in the PM-runtime framework (Rafael Wysocki,
     Andy Shevchenko).

   - Add ElkhartLake support to the Intel RAPL power capping driver and
     remove an unused local MSR definition from it (Jacob Pan, Sumeet
     Pawnikar).

   - Update devfreq core and drivers:

      - Replace strncpy() with strscpy() in the devfreq core and use
        lockdep asserts instead of manual checks for a locked mutex in
        it (Dmitry Osipenko, Krzysztof Kozlowski).

      - Add a generic imx bus scaling driver and make it register an
        interconnect device (Leonard Crestez, Gustavo A. R. Silva).

      - Make the cpufreq notifier in the tegra30 driver take boosting
        into account and delete an unuseful error message from that
        driver (Dmitry Osipenko, Markus Elfring).

   - Remove unneeded semicolon from the cpupower code (Zou Wei)"

* tag 'pm-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (51 commits)
  cpuidle: Fix three reference count leaks
  PM: runtime: Replace pm_runtime_callbacks_present()
  PM / devfreq: Use lockdep asserts instead of manual checks for locked mutex
  PM / devfreq: imx-bus: Fix inconsistent IS_ERR and PTR_ERR
  PM / devfreq: Replace strncpy with strscpy
  PM / devfreq: imx: Register interconnect device
  PM / devfreq: Add generic imx bus scaling driver
  PM / devfreq: tegra30: Delete an error message in tegra_devfreq_probe()
  PM / devfreq: tegra30: Make CPUFreq notifier to take into account boosting
  PM: hibernate: Restrict writes to the resume device
  PM: runtime: clk: Fix clk_pm_runtime_get() error path
  cpuidle: Convert Qualcomm SPM driver to a generic CPUidle driver
  ACPI: EC: PM: s2idle: Extend GPE dispatching debug message
  ACPI: PM: s2idle: Print type of wakeup debug messages
  powercap: RAPL: remove unused local MSR define
  PM: runtime: Make clear what we do when conditions are wrong in rpm_suspend()
  Documentation: admin-guide: pm: Document intel-speed-select
  PM: hibernate: Split off snapshot dev option
  PM: hibernate: Incorporate concurrency handling
  Documentation: ABI: make current_governer_ro as a candidate for removal
  ...
2020-06-02 13:17:23 -07:00
Bob Peterson
1a0b00d15d gfs2: Only do glock put in gfs2_create_inode for free inodes
Before this patch, the error path of function gfs2_create_inode would
always calls gfs2_glock_put for the inode glock. That's good for inodes
that are free. But after they've been added to the vfs inodes, errors
will cause the inode to be evicted, and the evict will do the glock
put for us. If we do a glock put again, we can try to free the glock
while there are still references to it, e.g. revokes pending for
the transaction that created it.

This patch adds a check: if (free_vfs_inode) before the put, thus
solving the problem.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-02 21:23:55 +02:00
Linus Torvalds
94709049fb Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "A few little subsystems and a start of a lot of MM patches.

  Subsystems affected by this patch series: squashfs, ocfs2, parisc,
  vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
  swap, memcg, pagemap, memory-failure, vmalloc, kasan"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
  kasan: move kasan_report() into report.c
  mm/mm_init.c: report kasan-tag information stored in page->flags
  ubsan: entirely disable alignment checks under UBSAN_TRAP
  kasan: fix clang compilation warning due to stack protector
  x86/mm: remove vmalloc faulting
  mm: remove vmalloc_sync_(un)mappings()
  x86/mm/32: implement arch_sync_kernel_mappings()
  x86/mm/64: implement arch_sync_kernel_mappings()
  mm/ioremap: track which page-table levels were modified
  mm/vmalloc: track which page-table levels were modified
  mm: add functions to track page directory modifications
  s390: use __vmalloc_node in stack_alloc
  powerpc: use __vmalloc_node in alloc_vm_stack
  arm64: use __vmalloc_node in arch_alloc_vmap_stack
  mm: remove vmalloc_user_node_flags
  mm: switch the test_vmalloc module to use __vmalloc_node
  mm: remove __vmalloc_node_flags_caller
  mm: remove both instances of __vmalloc_node_flags
  mm: remove the prot argument to __vmalloc_node
  mm: remove the pgprot argument to __vmalloc
  ...
2020-06-02 12:21:36 -07:00
Christoph Hellwig
88dca4ca5a mm: remove the pgprot argument to __vmalloc
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:11 -07:00
Christoph Hellwig
d4efd79a81 mm: remove the prot argument from vm_map_ram
This is always PAGE_KERNEL - for long term mappings with other properties
vmap should be used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-19-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:11 -07:00
Huang Ying
c94b6923fa /proc/PID/smaps: Add PMD migration entry parsing
Now, when reading /proc/PID/smaps, the PMD migration entry in page table
is simply ignored.  To improve the accuracy of /proc/PID/smaps, its
parsing and processing is added.

To test the patch, we run pmbench to eat 400 MB memory in background,
then run /usr/bin/migratepages and `cat /proc/PID/smaps` every second.
The issue as follows can be reproduced within 60 seconds.

Before the patch, for the fully populated 400 MB anonymous VMA, some THP
pages under migration may be lost as below.

  7f3f6a7e5000-7f3f837e5000 rw-p 00000000 00:00 0
  Size:             409600 kB
  KernelPageSize:        4 kB
  MMUPageSize:           4 kB
  Rss:              407552 kB
  Pss:              407552 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:    407552 kB
  Referenced:       301056 kB
  Anonymous:        407552 kB
  LazyFree:              0 kB
  AnonHugePages:    405504 kB
  ShmemPmdMapped:        0 kB
  FilePmdMapped:        0 kB
  Shared_Hugetlb:        0 kB
  Private_Hugetlb:       0 kB
  Swap:                  0 kB
  SwapPss:               0 kB
  Locked:                0 kB
  THPeligible:		1
  VmFlags: rd wr mr mw me ac

After the patch, it will be always,

  7f3f6a7e5000-7f3f837e5000 rw-p 00000000 00:00 0
  Size:             409600 kB
  KernelPageSize:        4 kB
  MMUPageSize:           4 kB
  Rss:              409600 kB
  Pss:              409600 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:    409600 kB
  Referenced:       294912 kB
  Anonymous:        409600 kB
  LazyFree:              0 kB
  AnonHugePages:    407552 kB
  ShmemPmdMapped:        0 kB
  FilePmdMapped:        0 kB
  Shared_Hugetlb:        0 kB
  Private_Hugetlb:       0 kB
  Swap:                  0 kB
  SwapPss:               0 kB
  Locked:                0 kB
  THPeligible:		1
  VmFlags: rd wr mr mw me ac

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Link: http://lkml.kernel.org/r/20200403123059.1846960-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:10 -07:00
NeilBrown
8d92890bd6 mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead
After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds.  If the COMMIT fails, the page will be
re-written.

These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count.  This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g.  releasepage() used to
send a COMMIT).  However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a8
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.

So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.

A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).

Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput.  With this
patch, there are fewer writes which are each larger on average.

Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero.  static trace points and warning printks which mentioned this
counter no longer report it.

[akpm@linux-foundation.org: re-layout comment]
[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Acked-by: Michal Hocko <mhocko@suse.com>	[mm]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
NeilBrown
a37b0715dd mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled.  The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics.  This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi
for an NFS filesystem or loop block device etc) do not affect the
throttling of the daemon writing to the "final" bdi.

This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi.  This ensures it will always be
able to have some pages in flight, and so will not deadlock.

In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
 - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
 - removes the 25% bonus that that flag gives, and
 - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
   global and the local free-run thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
for real-time threads, so it is now different from the behaviour of nfsd
and loop tasks.  I don't know what is wanted for realtime.

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
Guoqing Jiang
4c42be38c2 orangefs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the
code in orangefs.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Martin Brandenburg <martin@omnibond.com>
Link: http://lkml.kernel.org/r/20200517214718.468-9-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
Guoqing Jiang
14ed109e3f ntfs: replace attach_page_buffers with attach_page_private
Call the new function since attach_page_buffers will be removed.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Link: http://lkml.kernel.org/r/20200517214718.468-8-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Guoqing Jiang
58aeb73196 iomap: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the
code in iomap.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Link: http://lkml.kernel.org/r/20200517214718.468-7-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Guoqing Jiang
7128cf9a25 f2fs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the
code in f2fs.h.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Chao Yu <yuchao0@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Link: http://lkml.kernel.org/r/20200517214718.468-6-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Guoqing Jiang
45dcfc2732 fs/buffer.c: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the
code in buffer.c.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200517214718.468-5-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Guoqing Jiang
d1b89bc042 btrfs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the
code in btrfs.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Link: http://lkml.kernel.org/r/20200517214718.468-4-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
9d24a13a93 iomap: convert from readpages to readahead
Use the new readahead operation in iomap.  Convert XFS and ZoneFS to use
it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-26-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
76a0294eb1 fuse: convert from readpages to readahead
Implement the new readahead operation in fuse by using __readahead_batch()
to fill the array of pages in fuse_args_pages directly.  This lets us
inline fuse_readpages_fill() into fuse_readahead().

[willy@infradead.org: build fix]
  Link: http://lkml.kernel.org/r/20200415025938.GB5820@bombadil.infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-25-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
e20a769364 f2fs: pass the inode to f2fs_mpage_readpages
This function now only uses the mapping argument to look up the inode, and
both callers already have the inode, so just pass the inode instead of the
mapping.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-24-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
2332319625 f2fs: convert from readpages to readahead
Use the new readahead operation in f2fs

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-23-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
a07f624bd6 ext4: pass the inode to ext4_mpage_readpages
This function now only uses the mapping argument to look up the inode, and
both callers already have the inode, so just pass the inode instead of the
mapping.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-22-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
6311f91f76 ext4: convert from readpages to readahead
Use the new readahead operation in ext4

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-21-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
0615090c50 erofs: convert compressed files from readpages to readahead
Use the new readahead operation in erofs.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Gao Xiang <gaoxiang25@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-20-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
0c07a9f91e erofs: convert uncompressed files from readpages to readahead
Use the new readahead operation in erofs

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Gao Xiang <gaoxiang25@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-19-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
ba206a026f btrfs: convert from readpages to readahead
Implement the new readahead method in btrfs using the new
readahead_page_batch() function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-18-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
d4388340ae fs: convert mpage_readpages to mpage_readahead
Implement the new readahead aop and convert all callers (block_dev,
exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
reiserfs & udf).

The callers are all trivial except for GFS2 & OCFS2.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> # ocfs2
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
2c684234d3 mm: add page_cache_readahead_unbounded
ext4 and f2fs have duplicated the guts of the readahead code so they can
read past i_size.  Instead, separate out the guts of the readahead code
so they can call it directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-14-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Jeff Layton
485e9605c0 fs/buffer.c: record blockdev write errors in super_block that it backs
When syncing out a block device (a'la __sync_blockdev), any error
encountered will only be recorded in the bd_inode's mapping.  When the
blockdev contains a filesystem however, we'd like to also record the
error in the super_block that's stored there.

Make mark_buffer_write_io_error also record the error in the
corresponding super_block when a writeback error occurs and the block
device contains a mounted superblock.

Since superblocks are RCU freed, hold the rcu_read_lock to ensure that
the superblock doesn't go away while we're marking it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-3-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Jeff Layton
735e4ae5ba vfs: track per-sb writeback errors and report them to syncfs
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.

Currently, syncfs does not return errors when one of the inodes fails to
be written back.  It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.

The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.

This patch (of 2):

Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.

Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect.  If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error.  syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.

It would be better if syncfs reported an error if there were any
writeback failures.  Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.

This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.

To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor.  This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.

An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.

I think that API is just too weird though.  This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Gang He
912f655d78 ocfs2: mount shared volume without ha stack
Usually we create and use a ocfs2 shared volume on the top of ha stack.
For pcmk based ha stack, which includes DLM, corosync and pacemaker
services.

The customers complained they could not mount existent ocfs2 volume in
the single node without ha stack, e.g.  single node backup/restore
scenario.

Like this case, the customers just want to access the data from the
existent ocfs2 volume quickly, but do not want to restart or setup ha
stack.

Then, I'd like to add a mount option "nocluster", if the users use this
option to mount a ocfs2 shared volume, the whole mount will not depend
on the ha related services.  the command will mount the existent ocfs2
volume directly (like local mount), for avoiding setup the ha stack.

Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200423053300.22661-1-ghe@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Jules Irenge
8f745e62a1 ocfs2: add missing annotation for dlm_empty_lockres()
Sparse reports a warning at dlm_empty_lockres()

  warning: context imbalance in dlm_purge_lockres() - unexpected unlock

The root cause is the missing annotation at dlm_purge_lockres()

Add the missing __must_hold(&dlm->spinlock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200403160505.2832-4-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Philippe Liard
93e72b3c61 squashfs: migrate from ll_rw_block usage to BIO
ll_rw_block() function has been deprecated in favor of BIO which appears
to come with large performance improvements.

This patch decreases boot time by close to 40% when using squashfs for
the root file-system.  This is observed at least in the context of
starting an Android VM on Chrome OS using crosvm.  The patch was tested
on 4.19 as well as master.

This patch is largely based on Adrien Schildknecht's patch that was
originally sent as https://lkml.org/lkml/2017/9/22/814 though with some
significant changes and simplifications while also taking Phillip
Lougher's feedback into account, around preserving support for
FILE_CACHE in particular.

[akpm@linux-foundation.org: fix build error reported by Randy]
  Link: http://lkml.kernel.org/r/319997c2-5fc8-f889-2ea3-d913308a7c1f@infradead.org
Signed-off-by: Philippe Liard <pliard@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Adrien Schildknecht <adrien+dev@schischi.me>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Daniel Rosenberg <drosen@google.com>
Link: https://chromium.googlesource.com/chromiumos/platform/crosvm
Link: http://lkml.kernel.org/r/20191106074238.186023-1-pliard@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Bob Peterson
ea22eee4e6 gfs2: Allow lock_nolock mount to specify jid=X
Before this patch, a simple typo accidentally added \n to the jid=
string for lock_nolock mounts. This made it impossible to mount a
gfs2 file system with a journal other than journal0. Thus:

mount -tgfs2 -o hostdata="jid=1" <device> <mount pt>

Resulted in:
mount: wrong fs type, bad option, bad superblock on <device>

In most cases this is not a problem. However, for debugging and
testing purposes we sometimes want to test the integrity of other
journals. This patch removes the unnecessary \n and thus allows
lock_nolock users to specify an alternate journal.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-02 19:45:05 +02:00
Bob Peterson
bbae10fac2 gfs2: Don't ignore inode write errors during inode_go_sync
Before for this patch, function inode_go_sync ignored io errors
during inode_go_sync, overwriting them with metadata write errors:

		error = filemap_fdatawait(mapping);
		mapping_set_error(mapping, error);
	}
	error = filemap_fdatawait(metamapping);
	...
	return error;

So any errors returned by the inode write would be forgotten if the
metadata write succeeded. This patch still does both writes, but
only sets error if it's still zero. That way, any errors will be
reported by to the caller, do_xmote, which will take appropriate
action and report the error.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-02 19:45:05 +02:00
Aurelien Aptel
5f68ea4aa9 cifs: multichannel: move channel selection in function
This commit moves channel picking code in separate function.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-02 09:58:41 -05:00
Steve French
bbbf9eafbf cifs: fix minor typos in comments and log messages
Fix four minor typos in comments and log messages

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-02 09:58:30 -05:00
Steve French
3563a6f468 smb3: minor update to compression header definitions
MS-SMB2 specification was updated in March.  Make minor additions
and corrections to compression related definitions in smb2pdu.h

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-02 09:58:17 -05:00
Linus Torvalds
f359287765 Merge branch 'from-miklos' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted patches from Miklos.

  An interesting part here is /proc/mounts stuff..."

The "/proc/mounts stuff" is using a cursor for keeeping the location
data while traversing the mount listing.

Also probably worth noting is the addition of faccessat2(), which takes
an additional set of flags to specify how the lookup is done
(AT_EACCESS, AT_SYMLINK_NOFOLLOW, AT_EMPTY_PATH).

* 'from-miklos' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: add faccessat2 syscall
  vfs: don't parse "silent" option
  vfs: don't parse "posixacl" option
  vfs: don't parse forbidden flags
  statx: add mount_root
  statx: add mount ID
  statx: don't clear STATX_ATIME on SB_RDONLY
  uapi: deprecate STATX_ALL
  utimensat: AT_EMPTY_PATH support
  vfs: split out access_override_creds()
  proc/mounts: add cursor
  aio: fix async fsync creds
  vfs: allow unprivileged whiteout creation
2020-06-01 16:44:06 -07:00
Linus Torvalds
8b39a57e96 Merge branch 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess/coredump updates from Al Viro:
 "set_fs() removal in coredump-related area - mostly Christoph's
  stuff..."

* 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  binfmt_elf_fdpic: remove the set_fs(KERNEL_DS) in elf_fdpic_core_dump
  binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump
  binfmt_elf: remove the set_fs in fill_siginfo_note
  signal: refactor copy_siginfo_to_user32
  powerpc/spufs: simplify spufs core dumping
  powerpc/spufs: stop using access_ok
  powerpc/spufs: fix copy_to_user while atomic
2020-06-01 16:21:46 -07:00
Linus Torvalds
062ea674ae Merge branch 'uaccess.__copy_to_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess/__copy_to_user updates from Al Viro:
 "Getting rid of __copy_to_user() callers - stuff that doesn't fit into
  other series"

* 'uaccess.__copy_to_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  dlmfs: convert dlmfs_file_read() to copy_to_user()
  esas2r: don't bother with __copy_to_user()
2020-06-01 16:19:59 -07:00
Linus Torvalds
56446efab9 Merge branch 'uaccess.__copy_from_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess/__copy_from_user updates from Al Viro:
 "Getting rid of __copy_from_user() callers - patches that don't fit
  into other series"

* 'uaccess.__copy_from_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  pstore: switch to copy_from_user()
  firewire: switch ioctl_queue_iso to use of copy_from_user()
2020-06-01 16:18:46 -07:00
Linus Torvalds
e148a8f948 Merge branch 'uaccess.readdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess/readdir updates from Al Viro:
 "Finishing the conversion of readdir.c to unsafe_... API.

  This includes the uaccess_{read,write}_begin series by Christophe
  Leroy"

* 'uaccess.readdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  readdir.c: get rid of the last __put_user(), drop now-useless access_ok()
  readdir.c: get compat_filldir() more or less in sync with filldir()
  switch readdir(2) to unsafe_copy_dirent_name()
  drm/i915/gem: Replace user_access_begin by user_write_access_begin
  uaccess: Selectively open read or write user access
  uaccess: Add user_read_access_begin/end and user_write_access_begin/end
2020-06-01 16:11:38 -07:00
Linus Torvalds
e0cd920687 Merge branch 'uaccess.access_ok' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess/access_ok updates from Al Viro:
 "Removals of trivially pointless access_ok() calls.

  Note: the fiemap stuff was removed from the series, since they are
  duplicates with part of ext4 series carried in Ted's tree"

* 'uaccess.access_ok' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vmci_host: get rid of pointless access_ok()
  hfi1: get rid of pointless access_ok()
  usb: get rid of pointless access_ok() calls
  lpfc_debugfs: get rid of pointless access_ok()
  efi_test: get rid of pointless access_ok()
  drm_read(): get rid of pointless access_ok()
  via-pmu: don't bother with access_ok()
  drivers/crypto/ccp/sev-dev.c: get rid of pointless access_ok()
  omapfb: get rid of pointless access_ok() calls
  amifb: get rid of pointless access_ok() calls
  drivers/fpga/dfl-afu-dma-region.c: get rid of pointless access_ok()
  drivers/fpga/dfl-fme-pr.c: get rid of pointless access_ok()
  cm4000_cs.c cmm_ioctl(): get rid of pointless access_ok()
  nvram: drop useless access_ok()
  n_hdlc_tty_read(): remove pointless access_ok()
  tomoyo_write_control(): get rid of pointless access_ok()
  btrfs_ioctl_send(): don't bother with access_ok()
  fat_dir_ioctl(): hadn't needed that access_ok() for more than a decade...
  dlmfs_file_write(): get rid of pointless access_ok()
2020-06-01 16:09:43 -07:00
Linus Torvalds
b23c4771ff A fair amount of stuff this time around, dominated by yet another massive
set from Mauro toward the completion of the RST conversion.  I *really*
 hope we are getting close to the end of this.  Meanwhile, those patches
 reach pretty far afield to update document references around the tree;
 there should be no actual code changes there.  There will be, alas, more of
 the usual trivial merge conflicts.
 
 Beyond that we have more translations, improvements to the sphinx
 scripting, a number of additions to the sysctl documentation, and lots of
 fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl7VId8PHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Yq/gH/iaDgirQZV6UZ2v9sfwQNYolNpf2sKAuOZjd
 bPFB7WJoMQbKwQEvYrAUL2+5zPOcLYuIfzyOfo1BV1py+EyKbACcKjI4AedxfJF7
 +NchmOBhlEqmEhzx2U08HRc4/8J223WG17fJRVsV3p+opJySexSFeQucfOciX5NR
 RUCxweWWyg/FgyqjkyMMTtsePqZPmcT5dWTlVXISlbWzcv5NFhuJXnSrw8Sfzcmm
 SJMzqItv3O+CabnKQ8kMLV2PozXTMfjeWH47ZUK0Y8/8PP9+cvqwFzZ0UDQJ1Xaz
 oyW/TqmunaXhfMsMFeFGSwtfgwRHvXdxkQdtwNHvo1dV4dzTvDw=
 =fDC/
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.8' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "A fair amount of stuff this time around, dominated by yet another
  massive set from Mauro toward the completion of the RST conversion. I
  *really* hope we are getting close to the end of this. Meanwhile,
  those patches reach pretty far afield to update document references
  around the tree; there should be no actual code changes there. There
  will be, alas, more of the usual trivial merge conflicts.

  Beyond that we have more translations, improvements to the sphinx
  scripting, a number of additions to the sysctl documentation, and lots
  of fixes"

* tag 'docs-5.8' of git://git.lwn.net/linux: (130 commits)
  Documentation: fixes to the maintainer-entry-profile template
  zswap: docs/vm: Fix typo accept_threshold_percent in zswap.rst
  tracing: Fix events.rst section numbering
  docs: acpi: fix old http link and improve document format
  docs: filesystems: add info about efivars content
  Documentation: LSM: Correct the basic LSM description
  mailmap: change email for Ricardo Ribalda
  docs: sysctl/kernel: document unaligned controls
  Documentation: admin-guide: update bug-hunting.rst
  docs: sysctl/kernel: document ngroups_max
  nvdimm: fixes to maintainter-entry-profile
  Documentation/features: Correct RISC-V kprobes support entry
  Documentation/features: Refresh the arch support status files
  Revert "docs: sysctl/kernel: document ngroups_max"
  docs: move locking-specific documents to locking/
  docs: move digsig docs to the security book
  docs: move the kref doc into the core-api book
  docs: add IRQ documentation at the core-api book
  docs: debugging-via-ohci1394.txt: add it to the core-api book
  docs: fix references for ipmi.rst file
  ...
2020-06-01 15:45:27 -07:00
Linus Torvalds
533b220f7b arm64 updates for 5.8
- Branch Target Identification (BTI)
 	* Support for ARMv8.5-BTI in both user- and kernel-space. This
 	  allows branch targets to limit the types of branch from which
 	  they can be called and additionally prevents branching to
 	  arbitrary code, although kernel support requires a very recent
 	  toolchain.
 
 	* Function annotation via SYM_FUNC_START() so that assembly
 	  functions are wrapped with the relevant "landing pad"
 	  instructions.
 
 	* BPF and vDSO updates to use the new instructions.
 
 	* Addition of a new HWCAP and exposure of BTI capability to
 	  userspace via ID register emulation, along with ELF loader
 	  support for the BTI feature in .note.gnu.property.
 
 	* Non-critical fixes to CFI unwind annotations in the sigreturn
 	  trampoline.
 
 - Shadow Call Stack (SCS)
 	* Support for Clang's Shadow Call Stack feature, which reserves
 	  platform register x18 to point at a separate stack for each
 	  task that holds only return addresses. This protects function
 	  return control flow from buffer overruns on the main stack.
 
 	* Save/restore of x18 across problematic boundaries (user-mode,
 	  hypervisor, EFI, suspend, etc).
 
 	* Core support for SCS, should other architectures want to use it
 	  too.
 
 	* SCS overflow checking on context-switch as part of the existing
 	  stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
 
 - CPU feature detection
 	* Removed numerous "SANITY CHECK" errors when running on a system
 	  with mismatched AArch32 support at EL1. This is primarily a
 	  concern for KVM, which disabled support for 32-bit guests on
 	  such a system.
 
 	* Addition of new ID registers and fields as the architecture has
 	  been extended.
 
 - Perf and PMU drivers
 	* Minor fixes and cleanups to system PMU drivers.
 
 - Hardware errata
 	* Unify KVM workarounds for VHE and nVHE configurations.
 
 	* Sort vendor errata entries in Kconfig.
 
 - Secure Monitor Call Calling Convention (SMCCC)
 	* Update to the latest specification from Arm (v1.2).
 
 	* Allow PSCI code to query the SMCCC version.
 
 - Software Delegated Exception Interface (SDEI)
 	* Unexport a bunch of unused symbols.
 
 	* Minor fixes to handling of firmware data.
 
 - Pointer authentication
 	* Add support for dumping the kernel PAC mask in vmcoreinfo so
 	  that the stack can be unwound by tools such as kdump.
 
 	* Simplification of key initialisation during CPU bringup.
 
 - BPF backend
 	* Improve immediate generation for logical and add/sub
 	  instructions.
 
 - vDSO
 	- Minor fixes to the linker flags for consistency with other
 	  architectures and support for LLVM's unwinder.
 
 	- Clean up logic to initialise and map the vDSO into userspace.
 
 - ACPI
 	- Work around for an ambiguity in the IORT specification relating
 	  to the "num_ids" field.
 
 	- Support _DMA method for all named components rather than only
 	  PCIe root complexes.
 
 	- Minor other IORT-related fixes.
 
 - Miscellaneous
 	* Initialise debug traps early for KGDB and fix KDB cacheflushing
 	  deadlock.
 
 	* Minor tweaks to early boot state (documentation update, set
 	  TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
 
 	* Refactoring and cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl7U9csQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNLBHCACs/YU4SM7Om5f+7QnxIKao5DBr2CnGGvdC
 yTfDghFDTLQVv3MufLlfno3yBe5G8sQpcZfcc+hewfcGoMzVZXu8s7LzH6VSn9T9
 jmT3KjDMrg0RjSHzyumJp2McyelTk0a4FiKArSIIKsJSXUyb1uPSgm7SvKVDwEwU
 JGDzL9IGilmq59GiXfDzGhTZgmC37QdwRoRxDuqtqWQe5CHoRXYexg87HwBKOQxx
 HgU9L7ehri4MRZfpyjaDrr6quJo3TVnAAKXNBh3mZAskVS9ZrfKpEH0kYWYuqybv
 znKyHRecl/rrGePV8RTMtrwnSdU26zMXE/omsVVauDfG9hqzqm+Q
 =w3qi
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "A sizeable pile of arm64 updates for 5.8.

  Summary below, but the big two features are support for Branch Target
  Identification and Clang's Shadow Call stack. The latter is currently
  arm64-only, but the high-level parts are all in core code so it could
  easily be adopted by other architectures pending toolchain support

  Branch Target Identification (BTI):

   - Support for ARMv8.5-BTI in both user- and kernel-space. This allows
     branch targets to limit the types of branch from which they can be
     called and additionally prevents branching to arbitrary code,
     although kernel support requires a very recent toolchain.

   - Function annotation via SYM_FUNC_START() so that assembly functions
     are wrapped with the relevant "landing pad" instructions.

   - BPF and vDSO updates to use the new instructions.

   - Addition of a new HWCAP and exposure of BTI capability to userspace
     via ID register emulation, along with ELF loader support for the
     BTI feature in .note.gnu.property.

   - Non-critical fixes to CFI unwind annotations in the sigreturn
     trampoline.

  Shadow Call Stack (SCS):

   - Support for Clang's Shadow Call Stack feature, which reserves
     platform register x18 to point at a separate stack for each task
     that holds only return addresses. This protects function return
     control flow from buffer overruns on the main stack.

   - Save/restore of x18 across problematic boundaries (user-mode,
     hypervisor, EFI, suspend, etc).

   - Core support for SCS, should other architectures want to use it
     too.

   - SCS overflow checking on context-switch as part of the existing
     stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.

  CPU feature detection:

   - Removed numerous "SANITY CHECK" errors when running on a system
     with mismatched AArch32 support at EL1. This is primarily a concern
     for KVM, which disabled support for 32-bit guests on such a system.

   - Addition of new ID registers and fields as the architecture has
     been extended.

  Perf and PMU drivers:

   - Minor fixes and cleanups to system PMU drivers.

  Hardware errata:

   - Unify KVM workarounds for VHE and nVHE configurations.

   - Sort vendor errata entries in Kconfig.

  Secure Monitor Call Calling Convention (SMCCC):

   - Update to the latest specification from Arm (v1.2).

   - Allow PSCI code to query the SMCCC version.

  Software Delegated Exception Interface (SDEI):

   - Unexport a bunch of unused symbols.

   - Minor fixes to handling of firmware data.

  Pointer authentication:

   - Add support for dumping the kernel PAC mask in vmcoreinfo so that
     the stack can be unwound by tools such as kdump.

   - Simplification of key initialisation during CPU bringup.

  BPF backend:

   - Improve immediate generation for logical and add/sub instructions.

  vDSO:

   - Minor fixes to the linker flags for consistency with other
     architectures and support for LLVM's unwinder.

   - Clean up logic to initialise and map the vDSO into userspace.

  ACPI:

   - Work around for an ambiguity in the IORT specification relating to
     the "num_ids" field.

   - Support _DMA method for all named components rather than only PCIe
     root complexes.

   - Minor other IORT-related fixes.

  Miscellaneous:

   - Initialise debug traps early for KGDB and fix KDB cacheflushing
     deadlock.

   - Minor tweaks to early boot state (documentation update, set
     TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).

   - Refactoring and cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
  KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
  KVM: arm64: Check advertised Stage-2 page size capability
  arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
  ACPI/IORT: Remove the unused __get_pci_rid()
  arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
  arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
  arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
  arm64/cpufeature: Introduce ID_MMFR5 CPU register
  arm64/cpufeature: Introduce ID_DFR1 CPU register
  arm64/cpufeature: Introduce ID_PFR2 CPU register
  arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
  arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
  arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
  arm64: mm: Add asid_gen_match() helper
  firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
  arm64: vdso: Fix CFI directives in sigreturn trampoline
  arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
  ...
2020-06-01 15:18:27 -07:00
J. Bruce Fields
027690c75e nfsd4: make drc_slab global, not per-net
I made every global per-network-namespace instead.  But perhaps doing
that to this slab was a step too far.

The kmem_cache_create call in our net init method also seems to be
responsible for this lockdep warning:

[   45.163710] Unable to find swap-space signature
[   45.375718] trinity-c1 (855): attempted to duplicate a private mapping with mremap.  This is not supported.
[   46.055744] futex_wake_op: trinity-c1 tries to shift op by -209; fix this program
[   51.011723]
[   51.013378] ======================================================
[   51.013875] WARNING: possible circular locking dependency detected
[   51.014378] 5.2.0-rc2 #1 Not tainted
[   51.014672] ------------------------------------------------------
[   51.015182] trinity-c2/886 is trying to acquire lock:
[   51.015593] 000000005405f099 (slab_mutex){+.+.}, at: slab_attr_store+0xa2/0x130
[   51.016190]
[   51.016190] but task is already holding lock:
[   51.016652] 00000000ac662005 (kn->count#43){++++}, at: kernfs_fop_write+0x286/0x500
[   51.017266]
[   51.017266] which lock already depends on the new lock.
[   51.017266]
[   51.017909]
[   51.017909] the existing dependency chain (in reverse order) is:
[   51.018497]
[   51.018497] -> #1 (kn->count#43){++++}:
[   51.018956]        __lock_acquire+0x7cf/0x1a20
[   51.019317]        lock_acquire+0x17d/0x390
[   51.019658]        __kernfs_remove+0x892/0xae0
[   51.020020]        kernfs_remove_by_name_ns+0x78/0x110
[   51.020435]        sysfs_remove_link+0x55/0xb0
[   51.020832]        sysfs_slab_add+0xc1/0x3e0
[   51.021332]        __kmem_cache_create+0x155/0x200
[   51.021720]        create_cache+0xf5/0x320
[   51.022054]        kmem_cache_create_usercopy+0x179/0x320
[   51.022486]        kmem_cache_create+0x1a/0x30
[   51.022867]        nfsd_reply_cache_init+0x278/0x560
[   51.023266]        nfsd_init_net+0x20f/0x5e0
[   51.023623]        ops_init+0xcb/0x4b0
[   51.023928]        setup_net+0x2fe/0x670
[   51.024315]        copy_net_ns+0x30a/0x3f0
[   51.024653]        create_new_namespaces+0x3c5/0x820
[   51.025257]        unshare_nsproxy_namespaces+0xd1/0x240
[   51.025881]        ksys_unshare+0x506/0x9c0
[   51.026381]        __x64_sys_unshare+0x3a/0x50
[   51.026937]        do_syscall_64+0x110/0x10b0
[   51.027509]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   51.028175]
[   51.028175] -> #0 (slab_mutex){+.+.}:
[   51.028817]        validate_chain+0x1c51/0x2cc0
[   51.029422]        __lock_acquire+0x7cf/0x1a20
[   51.029947]        lock_acquire+0x17d/0x390
[   51.030438]        __mutex_lock+0x100/0xfa0
[   51.030995]        mutex_lock_nested+0x27/0x30
[   51.031516]        slab_attr_store+0xa2/0x130
[   51.032020]        sysfs_kf_write+0x11d/0x180
[   51.032529]        kernfs_fop_write+0x32a/0x500
[   51.033056]        do_loop_readv_writev+0x21d/0x310
[   51.033627]        do_iter_write+0x2e5/0x380
[   51.034148]        vfs_writev+0x170/0x310
[   51.034616]        do_pwritev+0x13e/0x160
[   51.035100]        __x64_sys_pwritev+0xa3/0x110
[   51.035633]        do_syscall_64+0x110/0x10b0
[   51.036200]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   51.036924]
[   51.036924] other info that might help us debug this:
[   51.036924]
[   51.037876]  Possible unsafe locking scenario:
[   51.037876]
[   51.038556]        CPU0                    CPU1
[   51.039130]        ----                    ----
[   51.039676]   lock(kn->count#43);
[   51.040084]                                lock(slab_mutex);
[   51.040597]                                lock(kn->count#43);
[   51.041062]   lock(slab_mutex);
[   51.041320]
[   51.041320]  *** DEADLOCK ***
[   51.041320]
[   51.041793] 3 locks held by trinity-c2/886:
[   51.042128]  #0: 000000001f55e152 (sb_writers#5){.+.+}, at: vfs_writev+0x2b9/0x310
[   51.042739]  #1: 00000000c7d6c034 (&of->mutex){+.+.}, at: kernfs_fop_write+0x25b/0x500
[   51.043400]  #2: 00000000ac662005 (kn->count#43){++++}, at: kernfs_fop_write+0x286/0x500

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 3ba75830ce "drc containerization"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-06-01 17:44:45 -04:00
Linus Torvalds
17e0a7cb6a Misc cleanups, with an emphasis on removing obsolete/dead code.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VLcQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iFnhAArGBqco3C2RPQugv7UDDbKEaMvxOGrc5B
 kwnyOS/k/yeIkfhT9u11oBuLcaj/Zgw8YCjFyRfaNsorRqnytLyZzZ6PvdCCE3YU
 X3DVYgulcdAQnM4bS2e3Kt9ciJvFxB27XNm0AfuyLMUxMqCD+iIO4gJ6TuQNBYy3
 dfUMfB1R9OUDW13GCrASe+p1Dw76uaqVngdFWJhnC8Rm49E6gFXq7CLQp5Cka81I
 KZeJ8I6ug9p3gqhOIXdi+S6g5CM5jf86Wkk7dOHwHFH7CceFb3FIz7z0n1je4Wgd
 L5rYX7+PwfNeZ73GIuvEBN+agJH2K0H/KmnlWNWeZHzc+J12MeruSdSMBIkBOEpn
 iSbYAOmDpQLzBjTdZjC8bDqTZf472WrTh4VwN9NxHLucjdC+IqGoTAvnyyEOmZ5o
 R7sv7Q++316CVwRhYVXbzwZcqtiinCDE1EkP5nKTo9z3z0kMF5+ce/k7wn5sgZIk
 zJq3LXtaToiDoDRAPGxcvFPts9MdC0EI1aKTIjaK/n6i2h/SpJfrTKgANWaldYTe
 XJIqlSB43saqf5YAQ3/sY+wnpCRBmmCU+sfKja4C8bH7RuggI3mZS19uhFs0Qctq
 Yx5bIXVSBAIqjJtgzQ0WAAZ5LrCpNNyAzb35ZYefQlGyJlx1URKXVBmxa6S99biU
 KiYX7Dk5uhQ=
 =0ZQd
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups, with an emphasis on removing obsolete/dead code"

* tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/spinlock: Remove obsolete ticket spinlock macros and types
  x86/mm: Drop deprecated DISCONTIGMEM support for 32-bit
  x86/apb_timer: Drop unused declaration and macro
  x86/apb_timer: Drop unused TSC calibration
  x86/io_apic: Remove unused function mp_init_irq_at_boot()
  x86/mm: Stop printing BRK addresses
  x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()
  x86/nmi: Remove edac.h include leftover
  mm: Remove MPX leftovers
  x86/mm/mmap: Fix -Wmissing-prototypes warnings
  x86/early_printk: Remove unused includes
  crash_dump: Remove no longer used saved_max_pfn
  x86/smpboot: Remove the last ICPU() macro
2020-06-01 13:47:10 -07:00
Anders Roxell
d194e12b3e fs: ext4: default KUNIT_* fragments to KUNIT_ALL_TESTS
This makes it easier to enable all KUnit fragments.

Adding 'if !KUNIT_ALL_TESTS' so individual tests can not be turned off.
Therefore if KUNIT_ALL_TESTS is enabled that will hide the prompt in
menuconfig.

Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2020-06-01 14:24:58 -06:00
Linus Torvalds
60056060be The biggest change to core locking facilities in this cycle is the introduction
of local_lock_t - this primitive comes from the -rt project and identifies
 CPU-local locking dependencies normally handled opaquely beind preempt_disable()
 or local_irq_save/disable() critical sections.
 
 The generated code on mainline kernels doesn't change as a result, but still there
 are benefits: improved debugging and better documentation of data structure
 accesses.
 
 The new local_lock_t primitives are introduced and then utilized in a couple of
 kernel subsystems. No change in functionality is intended.
 
 There's also other smaller changes and cleanups.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VAogRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h67BAAusYb44jJyZUE74rmaLnJr0c6j7eJ6twT
 8LKRwxb21Y35DMuX6M5ewmvnHiLFYmjL728z+y8O+SP8vb4PSJBX/75X+wsawIJB
 cjHdxonyynVVC4zcbdrc37FsrOiVoKLbbZcpqRzHksKkCq2PHbFVxBNvEaKHZCWW
 1jnq0MRy9wEJtW9EThDWPLD+OPWhBvocUFYJH4fiqCIaDiip/E16fz3i+yMPt545
 Jz4Ibnsq+G5Ehm1N2AkaZuK9V9nYv85E7Z/UNiK4mkDOApE6OMS+q3d86BhqgPg5
 g/HL3HNXAtIY74tBYAac5tAQglT+283LuTpEPt9BEjNM7QxKg/ecXO7lwtn7Boku
 dACMqeuMHbLyru8uhbun/VBx1gca7HIhW1cvXO5OoR7o78fHpEFivjJ0B0OuSYAI
 y+/DsA41OlkWSEnboUs+zTQgFatqxQPke92xpGOJtjVVZRYHRqxcPtw9WFmoVqWA
 HeczDQLcSUhqbKSfr6X9BO2u3qxys5BzmImTKMqXEQ4d8Kk0QXbJgGYGfS8+ASey
 Am/jwUP3Cvzs99NxLH5gECKRSuTx3rY7nRGaIBYa+Ui575bdSF8sVAF13riB2mBp
 NJq2Pw0D36WcX7ecaC2Fk2ezkphbeuAr8E7gh/Mt/oVxjrfwRGfPMrnIwKygUydw
 1W5x+WZ+WsY=
 =TBTY
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "The biggest change to core locking facilities in this cycle is the
  introduction of local_lock_t - this primitive comes from the -rt
  project and identifies CPU-local locking dependencies normally handled
  opaquely beind preempt_disable() or local_irq_save/disable() critical
  sections.

  The generated code on mainline kernels doesn't change as a result, but
  still there are benefits: improved debugging and better documentation
  of data structure accesses.

  The new local_lock_t primitives are introduced and then utilized in a
  couple of kernel subsystems. No change in functionality is intended.

  There's also other smaller changes and cleanups"

* tag 'locking-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  zram: Use local lock to protect per-CPU data
  zram: Allocate struct zcomp_strm as per-CPU memory
  connector/cn_proc: Protect send_msg() with a local lock
  squashfs: Make use of local lock in multi_cpu decompressor
  mm/swap: Use local_lock for protection
  radix-tree: Use local_lock for protection
  locking: Introduce local_lock()
  locking/lockdep: Replace zero-length array with flexible-array
  locking/rtmutex: Remove unused rt_mutex_cmpxchg_relaxed()
2020-06-01 13:03:31 -07:00
Linus Torvalds
4d67829e11 fsverity updates for 5.8
Fix kerneldoc warnings and some coding style inconsistencies.
 This mirrors the similar cleanups being done in fs/crypto/.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXtSdTBQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK8m/AP9+n5FpIxE2X6aYTVLweKIQ2bqfO/5K
 5WyPlW5zdMEDyQD+OT8bjqVTDxTI0/c+MBOidwvJF6kUyZyVze3M0pE7OQg=
 =b+RP
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fsverity updates from Eric Biggers:
 "Fix kerneldoc warnings and some coding style inconsistencies.

  This mirrors the similar cleanups being done in fs/crypto/"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fs-verity: remove unnecessary extern keywords
  fs-verity: fix all kerneldoc warnings
2020-06-01 12:11:56 -07:00
Linus Torvalds
afdb0f2ec5 fscrypt updates for 5.8
- Add the IV_INO_LBLK_32 encryption policy flag which modifies the
   encryption to be optimized for eMMC inline encryption hardware.
 
 - Make the test_dummy_encryption mount option for ext4 and f2fs support
   v2 encryption policies.
 
 - Fix kerneldoc warnings and some coding style inconsistencies.
 
 There will be merge conflicts with the ext4 and f2fs trees due to the
 test_dummy_encryption change, but the resolutions are straightforward.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXtScMBQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKxC6AP0eOEkMrc9e10YftdN6xsyRjvqiPyFg
 oMjuU+SvQ+/sVgEAo0mBFITnl75ZGb8PyqXCNMDAy6uHaxcEjVGufx5q2QE=
 =dbxy
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:

 - Add the IV_INO_LBLK_32 encryption policy flag which modifies the
   encryption to be optimized for eMMC inline encryption hardware.

 - Make the test_dummy_encryption mount option for ext4 and f2fs support
   v2 encryption policies.

 - Fix kerneldoc warnings and some coding style inconsistencies.

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: add support for IV_INO_LBLK_32 policies
  fscrypt: make test_dummy_encryption use v2 by default
  fscrypt: support test_dummy_encryption=v2
  fscrypt: add fscrypt_add_test_dummy_key()
  linux/parser.h: add include guards
  fscrypt: remove unnecessary extern keywords
  fscrypt: name all function parameters
  fscrypt: fix all kerneldoc warnings
2020-06-01 12:10:17 -07:00
Linus Torvalds
829f3b9401 Fixes and new features for pstore
- refactor pstore locking for safer module unloading (Kees Cook)
 - remove orphaned records from pstorefs when backend unloaded (Kees Cook)
 - refactor dump_oops parameter into max_reason (Pavel Tatashin)
 - introduce pstore/zone for common code for contiguous storage (WeiXiong Liao)
 - introduce pstore/blk for block device backend (WeiXiong Liao)
 - introduce mtd backend (WeiXiong Liao)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl7UbYYWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJpkgD/9/09OkJIWydwk2lr2T89HW5fSF
 5uBT0a309/QDUpnV9yhcRsrESEicnvbtaGxD0kuYIInkiW/2cj1l689EkyRjUmy9
 q3z4GzLqOlC7qvd7LUPFNGHmllBb09H/CxmXDxRP3aynB9oHzdpNQdPcpLBDA00r
 0byp/AE48dFbKIhtT0QxpGUYZFOlyc7XVAaOkED4bmu148gx8q7MU1AxFgbx0Feb
 9iPV0r6XYMgXJZ3sn/3PJsxF0V/giDSJ8ui2xsYRjCE408zVIYLdDs2e8dz+2yW6
 +3Lyankgo+ofZc4XYExTYgn3WjhPFi+pjVRUaj+BcyTk9SLNIj2WmZdmcLMuzanh
 BaUurmED7ffTtlsH4PhQgn8/OY4FX2PO2MwUHwlU+87Y8YDiW0lpzTq5H822OO8p
 QQ8awql/6lLCJuyzuWIciVUsS65MCPxsZ4+LSiMZzyYpWu1sxrEY8ic3agzCgsA0
 0i+4nZFlLG+Aap/oiKpegenkIyAunn2tDXAyFJFH6qLOiZJ78iRuws3XZqjCElhJ
 XqvyDJIfjkJhWUb++ckeqX7ThOR4CPSnwba/7GHv7NrQWuk3Cn+GQ80oxydXUY6b
 2/4eYjq0wtvf9NeuJ4/LYNXotLR/bq9zS0zqwTWG50v+RPmuC3bNJB+RmF7fCiCG
 jo1Sd1LMeTQ7bnULpA==
 =7s1u
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:
 "Fixes and new features for pstore.

  This is a pretty big set of changes (relative to past pstore pulls),
  but it has been in -next for a while. The biggest change here is the
  ability to support a block device as a pstore backend, which has been
  desired for a while. A lot of additional fixes and refactorings are
  also included, mostly in support of the new features.

   - refactor pstore locking for safer module unloading (Kees Cook)

   - remove orphaned records from pstorefs when backend unloaded (Kees
     Cook)

   - refactor dump_oops parameter into max_reason (Pavel Tatashin)

   - introduce pstore/zone for common code for contiguous storage
     (WeiXiong Liao)

   - introduce pstore/blk for block device backend (WeiXiong Liao)

   - introduce mtd backend (WeiXiong Liao)"

* tag 'pstore-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (35 commits)
  mtd: Support kmsg dumper based on pstore/blk
  pstore/blk: Introduce "best_effort" mode
  pstore/blk: Support non-block storage devices
  pstore/blk: Provide way to query pstore configuration
  pstore/zone: Provide way to skip "broken" zone for MTD devices
  Documentation: Add details for pstore/blk
  pstore/zone,blk: Add ftrace frontend support
  pstore/zone,blk: Add console frontend support
  pstore/zone,blk: Add support for pmsg frontend
  pstore/blk: Introduce backend for block devices
  pstore/zone: Introduce common layer to manage storage zones
  ramoops: Add "max-reason" optional field to ramoops DT node
  pstore/ram: Introduce max_reason and convert dump_oops
  pstore/platform: Pass max_reason to kmesg dump
  printk: Introduce kmsg_dump_reason_str()
  printk: honor the max_reason field in kmsg_dumper
  printk: Collapse shutdown types into a single dump reason
  pstore/ftrace: Provide ftrace log merging routine
  pstore/ram: Refactor ftrace buffer merging
  pstore/ram: Refactor DT size parsing
  ...
2020-06-01 12:07:34 -07:00
Linus Torvalds
81e8c10dac Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Introduce crypto_shash_tfm_digest() and use it wherever possible.
   - Fix use-after-free and race in crypto_spawn_alg.
   - Add support for parallel and batch requests to crypto_engine.

  Algorithms:
   - Update jitter RNG for SP800-90B compliance.
   - Always use jitter RNG as seed in drbg.

  Drivers:
   - Add Arm CryptoCell driver cctrng.
   - Add support for SEV-ES to the PSP driver in ccp"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits)
  crypto: hisilicon - fix driver compatibility issue with different versions of devices
  crypto: engine - do not requeue in case of fatal error
  crypto: cavium/nitrox - Fix a typo in a comment
  crypto: hisilicon/qm - change debugfs file name from qm_regs to regs
  crypto: hisilicon/qm - add DebugFS for xQC and xQE dump
  crypto: hisilicon/zip - add debugfs for Hisilicon ZIP
  crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE
  crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC
  crypto: hisilicon/qm - add debugfs to the QM state machine
  crypto: hisilicon/qm - add debugfs for QM
  crypto: stm32/crc32 - protect from concurrent accesses
  crypto: stm32/crc32 - don't sleep in runtime pm
  crypto: stm32/crc32 - fix multi-instance
  crypto: stm32/crc32 - fix run-time self test issue.
  crypto: stm32/crc32 - fix ext4 chksum BUG_ON()
  crypto: hisilicon/zip - Use temporary sqe when doing work
  crypto: hisilicon - add device error report through abnormal irq
  crypto: hisilicon - remove codes of directly report device errors through MSI
  crypto: hisilicon - QM memory management optimization
  crypto: hisilicon - unify initial value assignment into QM
  ...
2020-06-01 12:00:10 -07:00
Arnd Bergmann
37744feebc sh: remove sh5 support
sh5 never became a product and has probably never really worked.

Remove it by recursively deleting all associated Kconfig options
and all corresponding files.

Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rich Felker <dalias@libc.org>
2020-06-01 14:48:52 -04:00
Rafael J. Wysocki
be6018a44c Merge branches 'pm-core' and 'pm-sleep'
* pm-core:
  PM: runtime: Replace pm_runtime_callbacks_present()
  PM: runtime: clk: Fix clk_pm_runtime_get() error path
  PM: runtime: Make clear what we do when conditions are wrong in rpm_suspend()

* pm-sleep:
  PM: hibernate: Restrict writes to the resume device
  PM: hibernate: Split off snapshot dev option
  PM: hibernate: Incorporate concurrency handling
  PM: sleep: Helpful edits for devices.rst documentation
  Documentation: PM: sleep: Update driver flags documentation
  PM: sleep: core: Rename DPM_FLAG_LEAVE_SUSPENDED
  PM: sleep: core: Rename DPM_FLAG_NEVER_SKIP
  PM: sleep: core: Rename dev_pm_smart_suspend_and_suspended()
  PM: sleep: core: Rename dev_pm_may_skip_resume()
  PM: sleep: core: Rework the power.may_skip_resume handling
  PM: sleep: core: Do not skip callbacks in the resume phase
  PM: sleep: core: Fold functions into their callers
  PM: sleep: core: Simplify the SMART_SUSPEND flag handling
2020-06-01 15:19:08 +02:00
Xiubo Li
e64f44a884 ceph: skip checking caps when session reconnecting and releasing reqs
It make no sense to check the caps when reconnecting to mds. And
for the async dirop caps, they will be put by its _cb() function,
so when releasing the requests, it will make no sense too.

URL: https://tracker.ceph.com/issues/45635
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Xiubo Li
ea8412b284 ceph: make sure mdsc->mutex is nested in s->s_mutex to fix dead lock
send_mds_reconnect takes the s_mutex while the mdsc->mutex is already
held. That inverts the locking order documented in mds_client.h. Drop
the mdsc->mutex, acquire the s_mutex and then reacquire the mdsc->mutex
to prevent a deadlock.

URL: https://tracker.ceph.com/issues/45609
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Luis Henriques
878dabb641 ceph: don't return -ESTALE if there's still an open file
Similarly to commit 03f219041f ("ceph: check i_nlink while converting
a file handle to dentry"), this fixes another corner case with
name_to_handle_at/open_by_handle_at.  The issue has been detected by
xfstest generic/467, when doing:

 - name_to_handle_at("/cephfs/myfile")
 - open("/cephfs/myfile")
 - unlink("/cephfs/myfile")
 - sync; sync;
 - drop caches
 - open_by_handle_at()

The call to open_by_handle_at should not fail because the file hasn't been
deleted yet (only unlinked) and we do have a valid handle to it.  -ESTALE
shall be returned only if i_nlink is 0 *and* i_count is 1.

This patch also makes sure we have LINK caps before checking i_nlink.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Luis Henriques
dffdcd7145 ceph: allow rename operation under different quota realms
Returning -EXDEV when trying to 'mv' files/directories from different
quota realms results in copy+unlink operations instead of the faster
CEPH_MDS_OP_RENAME.  This will occur even when there aren't any quotas
set in the destination directory, or if there's enough space left for
the new file(s).

This patch adds a new helper function to be called on rename operations
which will allow these operations if they can be executed.  This patch
mimics userland fuse client commit b8954e5734b3 ("client:
optimize rename operation under different quota root").

Since ceph_quota_is_same_realm() is now called only from this new
helper, make it static.

URL: https://tracker.ceph.com/issues/44791
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Luis Henriques
daa668fbac ceph: normalize 'delta' parameter usage in check_quota_exceeded
Function check_quota_exceeded() uses delta parameter only for the
QUOTA_CHECK_MAX_BYTES_OP operation.  Using this parameter also for
MAX_FILES will makes the code cleaner and will be required to support
cross-quota-tree renames.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Jeff Layton
829ad4db95 ceph: ceph_kick_flushing_caps needs the s_mutex
The mdsc->cap_dirty_lock is not held while walking the list in
ceph_kick_flushing_caps, which is not safe.

ceph_early_kick_flushing_caps does something similar, but the
s_mutex is held while it's called and I think that guards against
changes to the list.

Ensure we hold the s_mutex when calling ceph_kick_flushing_caps,
and add some clarifying comments.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Jeff Layton
d67c72e6cc ceph: request expedited service on session's last cap flush
When flushing a lot of caps to the MDS's at once (e.g. for syncfs),
we can end up waiting a substantial amount of time for MDS replies, due
to the fact that it may delay some of them so that it can batch them up
together in a single journal transaction. This can lead to stalls when
calling sync or syncfs.

What we'd really like to do is request expedited service on the _last_
cap we're flushing back to the server. If the CHECK_CAPS_FLUSH flag is
set on the request and the current inode was the last one on the
session->s_cap_dirty list, then mark the request with
CEPH_CLIENT_CAPS_SYNC.

Note that this heuristic is not perfect. New inodes can race onto the
list after we've started flushing, but it does seem to fix some common
use cases.

URL: https://tracker.ceph.com/issues/44744
Reported-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
1cf03a68e7 ceph: convert mdsc->cap_dirty to a per-session list
This is a per-sb list now, but that makes it difficult to tell when
the cap is the last dirty one associated with the session. Switch
this to be a per-session list, but continue using the
mdsc->cap_dirty_lock to protect the lists.

This list is only ever walked in ceph_flush_dirty_caps, so change that
to walk the sessions array and then flush the caps for inodes on each
session's list.

If the auth cap ever changes while the inode has dirty caps, then
move the inode to the appropriate session for the new auth_cap. Also,
ensure that we never remove an auth cap while the inode is still on the
s_cap_dirty list.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Yan, Zheng
6f05b30ea0 ceph: reset i_requested_max_size if file write is not wanted
write can stuck at waiting for larger max_size in following sequence of
events:

- client opens a file and writes to position 'A' (larger than unit of
  max size increment)
- client closes the file handle and updates wanted caps (not wanting
  file write caps)
- client opens and truncates the file, writes to position 'A' again.

At the 1st event, client set inode's requested_max_size to 'A'. At the
2nd event, mds removes client's writable range, but client does not reset
requested_max_size. At the 3rd event, client does not request max size
because requested_max_size is already larger than 'A'.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
88828190f0 ceph: throw a warning if we destroy session with mutex still locked
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
dc3da0461c ceph: fix potential race in ceph_check_caps
Nothing ensures that session will still be valid by the time we
dereference the pointer. Take and put a reference.

In principle, we should always be able to get a reference here, but
throw a warning if that's ever not the case.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
4fb5dda39c ceph: document what protects i_dirty_item and i_flushing_item
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
7833323363 ceph: don't take i_ceph_lock in handle_cap_import
Just take it before calling it. This means we have to do a couple of
minor in-memory operations under the spinlock now, but those shouldn't
be an issue.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
7391fba267 ceph: don't release i_ceph_lock in handle_cap_trunc
There's no reason to do this here. Just have the caller handle it.
Also, add a lockdep assertion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
d7dbfb4f2b ceph: add comments for handle_cap_flush_ack logic
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
681ac63488 ceph: split up __finish_cap_flush
This function takes a mdsc argument or ci argument, but if both are
passed in, it ignores the ci arg. Fortunately, nothing does that, but
there's no good reason to have the same function handle both cases.

Also, get rid of some branches and just use |= to set the wake_* vals.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
0a454bdd50 ceph: reorganize __send_cap for less spinlock abuse
Get rid of the __releases annotation by breaking it up into two
functions: __prep_cap which is done under the spinlock and __send_cap
that is done outside it. Add new fields to cap_msg_args for the wake
boolean and old_xattr_buf pointer.

Nothing checks the return value from __send_cap, so make it void
return.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:51 +02:00
Xiubo Li
70c948206f ceph: add metadata perf metric support
Add a new "r_ended" field to struct ceph_mds_request and use that to
maintain the average latency of MDS requests.

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:51 +02:00
Xiubo Li
97e27aaa9a ceph: add read/write latency metric support
Calculate the latency for OSD read requests. Add a new r_end_stamp
field to struct ceph_osd_request that will hold the time of that
the reply was received. Use that to calculate the RTT for each call,
and divide the sum of those by number of calls to get averate RTT.

Keep a tally of RTT for OSD writes and number of calls to track average
latency of OSD writes.

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:51 +02:00
Xiubo Li
1af16d547f ceph: add caps perf metric for each superblock
Count hits and misses in the caps cache. If the client has all of
the necessary caps when a task needs references, then it's counted
as a hit. Any other situation is a miss.

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:51 +02:00
Xiubo Li
f9009efac4 ceph: add dentry lease metric support
For dentry leases, only count the hit/miss info triggered from the vfs
calls. For the cases like request reply handling and ceph_trim_dentries,
ignore them.

For now, these are only viewable using debugfs. Future patches will
allow the client to send the stats to the MDS.

The output looks like:

item          total           miss            hit
-------------------------------------------------
d_lease       11              7               141

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:51 +02:00
Steve French
adbb2dafe7 cifs: minor fix to two debug messages
Joe Perches pointed out that we were missing a newline
at the end of two debug messages

Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01 00:10:18 -05:00
Joe Perches
a0a3036b81 cifs: Standardize logging output
Use pr_fmt to standardize all logging for fs/cifs.

Some logging output had no CIFS: specific prefix.

Now all output has one of three prefixes:

o CIFS:
o CIFS: VFS:
o Root-CIFS:

Miscellanea:

o Convert printks to pr_<level>
o Neaten macro definitions
o Remove embedded CIFS: prefixes from formats
o Convert "illegal" to "invalid"
o Coalesce formats
o Add missing '\n' format terminations
o Consolidate multiple cifs_dbg continuations into single calls
o More consistent use of upper case first word output logging
o Multiline statement argument alignment and wrapping

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01 00:10:18 -05:00
Steve French
82e9367c43 smb3: Add new parm "nodelete"
In order to handle workloads where it is important to make sure that
a buggy app did not delete content on the drive, the new mount option
"nodelete" allows standard permission checks on the server to work,
but prevents on the client any attempts to unlink a file or delete
a directory on that mount point.  This can be helpful when running
a little understood app on a network mount that contains important
content that should not be deleted.

Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-06-01 00:10:18 -05:00
Ronnie Sahlberg
b2ca6c2c9e cifs: move some variables off the stack in smb2_ioctl_query_info
Move some large data structures off the stack and into dynamically
allocated memory in the function smb2_ioctl_query_info

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01 00:10:18 -05:00
Ronnie Sahlberg
a7d5c29462 cifs: reduce stack use in smb2_compound_op
Move a lot of structures and arrays off the stack and into a dynamically
allocated structure instead.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01 00:10:18 -05:00
Paulo Alcantara
baf3f08ef4 cifs: get rid of unused parameter in reconn_setup_dfs_targets()
The target iterator parameter "it" is not used in
reconn_setup_dfs_targets(), so just remove it.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01 00:10:18 -05:00
Paulo Alcantara
e4af35fa55 cifs: handle hostnames that resolve to same ip in failover
In order to support reconnect to hostnames that resolve to same ip
address, besides relying on the currently set hostname to match DFS
targets, attempt to resolve the targets and then match their addresses
with the reconnected server ip address.

For instance, if we have two hostnames "FOO" and "BAR", and both
resolve to the same ip address, we would be able to handle failover in
DFS paths like

    \\FOO\dfs\link1 -> [ \BAZ\share2 (*), \BAR\share1 ]
    \\FOO\dfs\link2 -> [ \BAZ\share2 (*), \FOO\share1 ]

so when "BAZ" is no longer accessible, link1 and link2 would get
reconnected despite having different target hostnames.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01 00:10:18 -05:00
Paulo Alcantara
aaa3aef34d cifs: set up next DFS target before generic_ip_connect()
If we mount a very specific DFS link

    \\FS0.FOO.COM\dfs\link -> \FS0\share1, \FS1\share2

where its target list contains NB names ("FS0" & "FS1") rather than
FQDN ones ("FS0.FOO.COM" & "FS1.FOO.COM"), we end up connecting to
\FOO\share1 but server->hostname will have "FOO.COM".  The reason is
because both "FS0" and "FS0.FOO.COM" resolve to same IP address and
they share same TCP server connection, but "FS0.FOO.COM" was the first
hostname set -- which is OK.

However, if the echo thread timeouts and we still have a good
connection to "FS0", in cifs_reconnect()

    rc = generic_ip_connect(server) -> success
    if (rc) {
            ...
            reconn_inval_dfs_target(server, cifs_sb, &tgt_list,
	                            &tgt_it);
            ...
     }
     ...

it successfully reconnects to "FS0" server but does not set up next
DFS target - which should be the same target server "\FS0\share1" -
and server->hostname remains set to "FS0.FOO.COM" rather than "FS0",
as reconn_inval_dfs_target() would have it set to "FS0" if called
earlier.

Finally, in __smb2_reconnect(), the reconnect of tcons would fail
because tcon->ses->server->hostname (FS0.FOO.COM) does not match DFS
target's hostname (FS0).

Fix that by calling reconn_inval_dfs_target() before
generic_ip_connect() so server->hostname will get updated correctly
prior to reconnecting its tcons in __smb2_reconnect().

With "cifs: handle hostnames that resolve to same ip in failover"
patch

    - The above problem would not occur.
    - We could save an DNS query to find out that they both resolve to
      the same ip address.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01 00:10:18 -05:00
Colin Ian King
136a5dc330 cifs: remove redundant initialization of variable rc
The variable rc is being initialized with a value that is never read
and it is being updated later with a new value.  The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01 00:10:18 -05:00
Kenneth D'souza
8fd6e1d694 cifs: handle "nolease" option for vers=1.0
The "nolease" mount option is only supported for SMB2+ mounts.
Fail with appropriate error message if vers=1.0 option is passed.

Signed-off-by: Kenneth D'souza <kdsouza@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-01 00:10:17 -05:00
Kees Cook
f8feafeaee pstore/blk: Introduce "best_effort" mode
In order to use arbitrary block devices as a pstore backend, provide a
new module param named "best_effort", which will allow using any block
device, even if it has not provided a panic_write callback.

Link: https://lore.kernel.org/lkml/20200511233229.27745-12-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-31 19:49:01 -07:00
WeiXiong Liao
7dcb7848ba pstore/blk: Support non-block storage devices
Add support for non-block devices (e.g. MTD). A non-block driver calls
pstore_blk_register_device() to register iself.

In addition, pstore/zone is updated to handle non-block devices,
where an erase must be done before a write. Without this, there is no
way to remove records stored to an MTD.

Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-10-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-31 19:49:00 -07:00
WeiXiong Liao
1525fb3bb6 pstore/blk: Provide way to query pstore configuration
In order to configure itself, the MTD backend needs to be able to query
the current pstore configuration. Introduce pstore_blk_get_config() for
this purpose.

Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-9-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-31 19:49:00 -07:00
WeiXiong Liao
335426c6dc pstore/zone: Provide way to skip "broken" zone for MTD devices
One requirement to support MTD devices in pstore/zone is having a
way to declare certain regions as broken. Add this support to
pstore/zone.

The MTD driver should return -ENOMSG when encountering a bad region,
which tells pstore/zone to skip and try the next one.

Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-8-keescook@chromium.org/
Co-developed-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: //lore.kernel.org/lkml/20200512173801.222666-1-colin.king@canonical.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-31 19:48:56 -07:00
David S. Miller
1806c13dc2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.

The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-31 17:48:46 -07:00
David Howells
a310082f6d afs: Rename struct afs_fs_cursor to afs_operation
As a prelude to implementing asynchronous fileserver operations in the afs
filesystem, rename struct afs_fs_cursor to afs_operation.

This struct is going to form the core of the operation management and is
going to acquire more members in later.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31 15:19:52 +01:00
David Howells
7126ead910 afs: Remove the error argument from afs_protocol_error()
Remove the error argument from afs_protocol_error() as it's always
-EBADMSG.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31 15:19:52 +01:00
David Howells
38355eec6a afs: Set error flag rather than return error from file status decode
Set a flag in the call struct to indicate an unmarshalling error rather
than return and handle an error from the decoding of file statuses.  This
flag is checked on a successful return from the delivery function.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31 15:19:51 +01:00
David Howells
8230fd8217 afs: Make callback processing more efficient.
afs_vol_interest objects represent the volume IDs currently being accessed
from a fileserver.  These hold lists of afs_cb_interest objects that
repesent the superblocks using that volume ID on that server.

When a callback notification from the server telling of a modification by
another client arrives, the volume ID specified in the notification is
looked up in the server's afs_vol_interest list.  Through the
afs_cb_interest list, the relevant superblocks can be iterated over and the
specific inode looked up and marked in each one.

Make the following efficiency improvements:

 (1) Hold rcu_read_lock() over the entire processing rather than locking it
     each time.

 (2) Do all the callbacks for each vid together rather than individually.
     Each volume then only needs to be looked up once.

 (3) afs_vol_interest objects are now stored in an rb_tree rather than a
     flat list to reduce the lookup step count.

 (4) afs_vol_interest lookup is now done with RCU, but because it's in an
     rb_tree which may rotate under us, a seqlock is used so that if it
     changes during the walk, we repeat the walk with a lock held.

With this and the preceding patch which adds RCU-based lookups in the inode
cache, target volumes/vnodes can be taken without the need to take any
locks, except on the target itself.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31 15:19:51 +01:00
David Howells
6d043a5782 afs: Show more information in /proc/net/afs/servers
Show more information in /proc/net/afs/servers to make it easier to see
what's going on with the server probing.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31 15:19:51 +01:00
David Howells
f6cbb368bc afs: Actively poll fileservers to maintain NAT or firewall openings
When an AFS client accesses a file, it receives a limited-duration callback
promise that the server will notify it if another client changes a file.
This callback duration can be a few hours in length.

If a client mounts a volume and then an application prevents it from being
unmounted, say by chdir'ing into it, but then does nothing for some time,
the rxrpc_peer record will expire and rxrpc-level keepalive will cease.

If there is NAT or a firewall between the client and the server, the route
back for the server may close after a comparatively short duration, meaning
that attempts by the server to notify the client may then bounce.

The client, however, may (so far as it knows) still have a valid unexpired
promise and will then rely on its cached data and will not see changes made
on the server by a third party until it incidentally rechecks the status or
the promise needs renewal.

To deal with this, the client needs to regularly probe the server.  This
has two effects: firstly, it keeps a route open back for the server, and
secondly, it causes the server to disgorge any notifications that got
queued up because they couldn't be sent.

Fix this by adding a mechanism to emit regular probes.

Two levels of probing are made available: Under normal circumstances the
'slow' queue will be used for a fileserver - this just probes the preferred
address once every 5 mins or so; however, if server fails to respond to any
probes, the server will shift to the 'fast' queue from which all its
interfaces will be probed every 30s.  When it finally responds, the record
will switch back to the slow queue.

Further notes:

 (1) Probing is now no longer driven from the fileserver rotation
     algorithm.

 (2) Probes are dispatched to all interfaces on a fileserver when that an
     afs_server object is set up to record it.

 (3) The afs_server object is removed from the probe queues when we start
     to probe it.  afs_is_probing_server() returns true if it's not listed
     - ie. it's undergoing probing.

 (4) The afs_server object is added back on to the probe queue when the
     final outstanding probe completes, but the probed_at time is set when
     we're about to launch a probe so that it's not dependent on the probe
     duration.

 (5) The timer and the work item added for this must be handed a count on
     net->servers_outstanding, which they hand on or release.  This makes
     sure that network namespace cleanup waits for them.

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31 15:19:51 +01:00
David Howells
977e5f8ed0 afs: Split the usage count on struct afs_server
Split the usage count on the afs_server struct to have an active count that
registers who's actually using it separately from the reference count on
the object.

This allows a future patch to dispatch polling probes without advancing the
"unuse" time into the future each time we emit a probe, which would
otherwise prevent unused server records from expiring.

Included in this:

 (1) The latter part of afs_destroy_server() in which the RCU destruction
     of afs_server objects is invoked and the outstanding server count is
     decremented is split out into __afs_put_server().

 (2) afs_put_server() now calls __afs_put_server() rather then setting the
     management timer.

 (3) The calls begun by afs_fs_give_up_all_callbacks() and
     afs_fs_get_capabilities() can now take a ref on the server record, so
     afs_destroy_server() can just drop its ref and needn't wait for the
     completion of these calls.  They'll put the ref when they're done.

 (4) Because of (3), afs_fs_probe_done() no longer needs to wake up
     afs_destroy_server() with server->probe_outstanding.

 (5) afs_gc_servers can be simplified.  It only needs to check if
     server->active is 0 rather than playing games with the refcount.

 (6) afs_manage_servers() can propose a server for gc if usage == 0 rather
     than if ref == 1.  The gc is effected by (5).

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31 15:19:51 +01:00
David Howells
8100680592 afs: Use the serverUnique field in the UVLDB record to reduce rpc ops
The U-version VLDB volume record retrieved by the VL.GetEntryByNameU rpc op
carries a change counter (the serverUnique field) for each fileserver
listed in the record as backing that volume.  This is incremented whenever
the registration details for a fileserver change (such as its address
list).  Note that the same value will be seen in all UVLDB records that
refer to that fileserver.

This should be checked before calling the VL server to re-query the address
list for a fileserver.  If it's the same, there's no point doing the query.

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31 15:19:51 +01:00
David Howells
13fcc6356a afs: Always include dir in bulk status fetch from afs_do_lookup()
When a lookup is done in an AFS directory, the filesystem will speculate
and fetch up to 49 other statuses for files in the same directory and fetch
those as well, turning them into inodes or updating inodes that already
exist.

However, occasionally, a callback break might go missing due to NAT timing
out, but the afs filesystem doesn't then realise that the directory is not
up to date.

Alleviate this by using one of the status slots to check the directory in
which the lookup is being done.

Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Suggested-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31 15:19:51 +01:00
David Howells
3f19b2ab97 vfs, afs, ext4: Make the inode hash table RCU searchable
Make the inode hash table RCU searchable so that searches that want to
access or modify an inode without taking a ref on that inode can do so
without taking the inode hash table lock.

The main thing this requires is some RCU annotation on the list
manipulation operations.  Inodes are already freed by RCU in most cases.

Users of this interface must take care as the inode may be still under
construction or may be being torn down around them.

There are at least three instances where this can be of use:

 (1) Testing whether the inode number iunique() is going to return is
     currently unique (the iunique_lock is still held).

 (2) Ext4 date stamp updating.

 (3) AFS callback breaking.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
cc: linux-ext4@vger.kernel.org
cc: linux-afs@lists.infradead.org
2020-05-31 15:19:44 +01:00
WeiXiong Liao
649304c936 Documentation: Add details for pstore/blk
Add details on using pstore/blk, the new backend of pstore to record
dumps to block devices, in Documentation/admin-guide/pstore-blk.rst

Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-7-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
WeiXiong Liao
34327e9fd2 pstore/zone,blk: Add ftrace frontend support
Support backend for ftrace. To enable ftrace backend, just make
ftrace_size be greater than 0 and a multiple of 4096.

Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-6-keescook@chromium.org/
Co-developed-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/lkml/20200512170719.221514-1-colin.king@canonical.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
WeiXiong Liao
cc9c4d1b55 pstore/zone,blk: Add console frontend support
Support backend for console. To enable console backend, just make
console_size be greater than 0 and a multiple of 4096.

Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-5-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
WeiXiong Liao
0dc068265a pstore/zone,blk: Add support for pmsg frontend
Add pmsg support to pstore/blk (through pstore/zone). To enable, pmsg_size
must be greater than 0 and a multiple of 4096.

Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-4-keescook@chromium.org/
Co-developed-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/lkml/20200512171932.222102-1-colin.king@canonical.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
WeiXiong Liao
17639f67c1 pstore/blk: Introduce backend for block devices
pstore/blk is similar to pstore/ram, but uses a block device as the
storage rather than persistent ram.

The pstore/blk backend solves two common use-cases that used to preclude
using pstore/ram:
- not all devices have a battery that could be used to persist
  regular RAM across power failures.
- most embedded intelligent equipment have no persistent ram, which
  increases costs, instead preferring cheaper solutions, like block
  devices.

pstore/blk provides separate configurations for the end user and for the
block drivers. User configuration determines how pstore/blk operates, such
as record sizes, max kmsg dump reasons, etc. These can be set by Kconfig
and/or module parameters, but module parameter have priority over Kconfig.
Driver configuration covers all the details about the target block device,
such as total size of the device and how to perform read/write operations.
These are provided by block drivers, calling pstore_register_blkdev(),
including an optional panic_write callback used to bypass regular IO
APIs in an effort to avoid potentially destabilized kernel code during
a panic.

Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-3-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
WeiXiong Liao
d26c3321fe pstore/zone: Introduce common layer to manage storage zones
Implement a common set of APIs needed to support pstore storage zones,
based on how ramoops is designed. This will be used by pstore/blk with
the intention of migrating pstore/ram in the future.

Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-2-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
Kees Cook
791205e3ec pstore/ram: Introduce max_reason and convert dump_oops
Now that pstore_register() can correctly pass max_reason to the kmesg
dump facility, introduce a new "max_reason" module parameter and
"max-reason" Device Tree field.

The "dump_oops" module parameter and "dump-oops" Device
Tree field are now considered deprecated, but are now automatically
converted to their corresponding max_reason values when present, though
the new max_reason setting has precedence.

For struct ramoops_platform_data, the "dump_oops" member is entirely
replaced by a new "max_reason" member, with the only existing user
updated in place.

Additionally remove the "reason" filter logic from ramoops_pstore_write(),
as that is not specifically needed anymore, though technically
this is a change in behavior for any ramoops users also setting the
printk.always_kmsg_dump boot param, which will cause ramoops to behave as
if max_reason was set to KMSG_DUMP_MAX.

Co-developed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-6-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
Pavel Tatashin
3524e688b8 pstore/platform: Pass max_reason to kmesg dump
Add a new member to struct pstore_info for passing information about
kmesg dump maximum reason. This allows a finer control of what kmesg
dumps are sent to pstore storage backends.

Those backends that do not explicitly set this field (keeping it equal to
0), get the default behavior: store only Oopses and Panics, or everything
if the printk.always_kmsg_dump boot param is set.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-5-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
Kees Cook
fb13cb8a04 printk: Introduce kmsg_dump_reason_str()
The pstore subsystem already had a private version of this function.
With the coming addition of the pstore/zone driver, this needs to be
shared. As it really should live with printk, move it there instead.

Link: https://lore.kernel.org/lkml/20200515184434.8470-4-keescook@chromium.org/
Acked-by: Petr Mladek <pmladek@suse.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
Kees Cook
6d3cf962dd printk: Collapse shutdown types into a single dump reason
To turn the KMSG_DUMP_* reasons into a more ordered list, collapse
the redundant KMSG_DUMP_(RESTART|HALT|POWEROFF) reasons into
KMSG_DUMP_SHUTDOWN. The current users already don't meaningfully
distinguish between them, so there's no need to, as discussed here:
https://lore.kernel.org/lkml/CA+CK2bAPv5u1ih5y9t5FUnTyximtFCtDYXJCpuyjOyHNOkRdqw@mail.gmail.com/

Link: https://lore.kernel.org/lkml/20200515184434.8470-2-keescook@chromium.org/
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
Kees Cook
16a583079e pstore/ftrace: Provide ftrace log merging routine
Move the ftrace log merging logic out of pstore/ram into pstore/ftrace
so other backends can use it, like pstore/zone.

Link: https://lore.kernel.org/lkml/20200510202436.63222-7-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:02 -07:00
Kees Cook
df9bf19d88 pstore/ram: Refactor ftrace buffer merging
This changes the ftrace record merging code to be agnostic of
pstore/ram, as the first step to making it available as a generic
routine for other backends to use, such as pstore/zone.

Link: https://lore.kernel.org/lkml/20200510202436.63222-6-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:02 -07:00
Kees Cook
26961d76ff pstore/ram: Refactor DT size parsing
Refactor device tree size parsing routines to be able to pass a non-zero
default value for providing a configurable default for the coming
"max_reason" field. Also rename the helpers, since we're not always
parsing a size -- we're parsing a u32 and making sure it's not greater
than INT_MAX.

Link: https://lore.kernel.org/lkml/20200506211523.15077-4-keescook@chromium.org/
Link: https://lore.kernel.org/lkml/20200521205223.175957-1-tyhicks@linux.microsoft.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:02 -07:00
Kees Cook
f858b57f7d pstore/ram: Adjust module param permissions to reflect reality
A couple module parameters had 0600 permissions, but changing them would
have no impact on ramoops, so switch these to 0400 to reflect reality.

Link: https://lore.kernel.org/lkml/20200506211523.15077-7-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:02 -07:00
Kees Cook
d973f7d83d pstore/platform: Move module params after declarations
It is easier to see how module params are used if they're near the
variables they use.

Link: https://lore.kernel.org/lkml/20200510202436.63222-4-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:02 -07:00
Kees Cook
d195c39052 pstore/platform: Use backend name for console registration
If the pstore backend changes, there's no indication in the logs what
the console is (it always says "pstore"). Instead, pass through the
active backend's name. (Also adjust the selftest to match.)

Link: https://lore.kernel.org/lkml/20200510202436.63222-5-keescook@chromium.org/
Link: https://lore.kernel.org/lkml/20200526135429.GQ12456@shao2-debian
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:02 -07:00
Kees Cook
563ca40ddf pstore/platform: Switch pstore_info::name to const
In order to more cleanly pass around backend names, make the "name" member
const. This means the module param needs to be dynamic (technically, it
was before, so this actually cleans up a minor memory leak if a backend
was specified and then gets unloaded.)

Link: https://lore.kernel.org/lkml/20200510202436.63222-3-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:02 -07:00
Kees Cook
b7753fc7f6 pstore: Make sure console capturing will restart
The CON_ENABLED flag gets cleared during unregister_console(), so make
sure we already reset the console flags before calling register_console(),
otherwise unloading and reloading a pstore backend will not restart
console logging.

Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:02 -07:00
Kees Cook
609e28bb13 pstore: Remove filesystem records when backend is unregistered
If a backend was unloaded without having first removed all its
associated records in pstorefs, subsequent removals would crash while
attempting to call into the now missing backend. Add automatic removal
from the tree in pstore_unregister(), so that no references to the
backend remain.

Reported-by: Luis Henriques <lhenriques@suse.com>
Link: https://lore.kernel.org/lkml/87o8yrmv69.fsf@suse.com
Link: https://lore.kernel.org/lkml/20200506152114.50375-11-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:02 -07:00
Kees Cook
78c83c828c pstore: Do not leave timer disabled for next backend
The pstore.update_ms value was being disabled during pstore_unregister(),
which would cause any prior value to go unnoticed on the next
pstore_register(). Instead, just let del_timer() stop the timer, which
was always sufficient. This additionally refactors the timer reset code
and allows the timer to be enabled if the module parameter is changed
away from the default.

Link: https://lore.kernel.org/lkml/20200506152114.50375-10-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:02 -07:00
Kees Cook
27e5041a87 pstore: Add locking around superblock changes
Nothing was protecting changes to the pstorefs superblock. Add locking
and refactor away is_pstore_mounted(), instead using a helper to add a
way to safely lock the pstorefs root inode during filesystem changes.

Link: https://lore.kernel.org/lkml/20200506152114.50375-9-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:33:46 -07:00
Chao Yu
ca7f76e680 f2fs: fix wrong discard space
Under heavy fsstress, we may triggle panic while issuing discard,
because __check_sit_bitmap() detects that discard command may earse
valid data blocks, the root cause is as below race stack described,
since we removed lock when flushing quota data, quota data writeback
may race with write_checkpoint(), so that it causes inconsistency in
between cached discard entry and segment bitmap.

- f2fs_write_checkpoint
 - block_operations
  - set_sbi_flag(sbi, SBI_QUOTA_SKIP_FLUSH)
 - f2fs_flush_sit_entries
  - add_discard_addrs
   - __set_bit_le(i, (void *)de->discard_map);
						- f2fs_write_data_pages
						 - f2fs_write_single_data_page
						   : inode is quota one, cp_rwsem won't be locked
						  - f2fs_do_write_data_page
						   - f2fs_allocate_data_block
						    - f2fs_wait_discard_bio
						      : discard entry has not been added yet.
						    - update_sit_entry
 - f2fs_clear_prefree_segments
  - f2fs_issue_discard
  : add discard entry

In order to fix this, this patch uses node_write to serialize
f2fs_allocate_data_block and checkpoint.

Fixes: 435cbab95e ("f2fs: fix quota_sync failure due to f2fs_lock_op")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-30 08:17:52 -07:00
Pavel Begunkov
7b53d59859 io_uring: fix overflowed reqs cancellation
Overflowed requests in io_uring_cancel_files() should be shed only of
inflight and overflowed refs. All other left references are owned by
someone else.

If refcount_sub_and_test() fails, it will go further and put put extra
ref, don't do that. Also, don't need to do io_wq_cancel_work()
for overflowed reqs, they will be let go shortly anyway.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-30 07:38:32 -06:00
Pavel Begunkov
bfe68a2219 io_uring: off timeouts based only on completions
Offset timeouts wait not for sqe->off non-timeout CQEs, but rather
sqe->off + number of prior inflight requests. Wait exactly for
sqe->off non-timeout completions

Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-30 07:38:18 -06:00
Pavel Begunkov
360428f8c0 io_uring: move timeouts flushing to a helper
Separate flushing offset timeouts io_commit_cqring() by moving it into a
helper. Just a preparation, makes following patches clearer.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-30 07:38:17 -06:00
Ira Weiny
e4f9ba20d3 fs/xfs: Update xfs_ioctl_setattr_dax_invalidate()
Because of the separation of FS_XFLAG_DAX from S_DAX and the delayed
setting of S_DAX, data invalidation no longer needs to happen when
FS_XFLAG_DAX is changed.

Change xfs_ioctl_setattr_dax_invalidate() to be
xfs_ioctl_dax_check_set_cache() and alter the code to reflect the new
functionality.

Furthermore, we no longer need the locking so we remove the join_flags
logic.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29 20:13:20 -07:00
Ira Weiny
b5df628432 fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()
The functionality in xfs_diflags_to_linux() and xfs_diflags_to_iflags() are
nearly identical.  The only difference is that *_to_linux() is called after
inode setup and disallows changing the DAX flag.

Combining them can be done with a flag which indicates if this is the initial
setup to allow the DAX flag to be properly set only at init time.

So remove xfs_diflags_to_linux() and call the modified xfs_diflags_to_iflags()
directly.

While we are here simplify xfs_diflags_to_iflags() to take struct xfs_inode and
use xfs_ip2xflags() to ensure future diflags are included correctly.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29 20:13:20 -07:00
Ira Weiny
f7bf743714 fs/xfs: Create function xfs_inode_should_enable_dax()
xfs_inode_supports_dax() should reflect if the inode can support DAX not
that it is enabled for DAX.

Change the use of xfs_inode_supports_dax() to reflect only if the inode
and underlying storage support dax.

Add a new function xfs_inode_should_enable_dax() which reflects if the
inode should be enabled for DAX.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29 20:13:20 -07:00
Ira Weiny
02beb2686f fs/xfs: Make DAX mount option a tri-state
As agreed upon[1].  We make the dax mount option a tri-state.  '-o dax'
continues to operate the same.  We add 'always', 'never', and 'inode'
(default).

[1] https://lore.kernel.org/lkml/20200405061945.GA94792@iweiny-DESK2.sc.intel.com/

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29 20:13:20 -07:00
Ira Weiny
fd58c62b65 fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYS
In prep for the new tri-state mount option which then introduces
XFS_MOUNT_DAX_NEVER.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29 20:13:19 -07:00
Ira Weiny
c5ec874e55 fs/xfs: Remove unnecessary initialization of i_rwsem
An earlier call of xfs_reinit_inode() from xfs_iget_cache_hit() already
handles initialization of i_rwsem.

Doing so again is unneeded.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-29 20:13:19 -07:00
Eric W. Biederman
56305aa9b6 exec: Compute file based creds only once
Move the computation of creds from prepare_binfmt into begin_new_exec
so that the creds need only be computed once.  This is just code
reorganization no semantic changes of any kind are made.

Moving the computation is safe.  I have looked through the kernel and
verified none of the binfmts look at bprm->cred directly, and that
there are no helpers that look at bprm->cred indirectly.  Which means
that it is not a problem to compute the bprm->cred later in the
execution flow as it is not used until it becomes current->cred.

A new function bprm_creds_from_file is added to contain the work that
needs to be done.  bprm_creds_from_file first computes which file
bprm->executable or most likely bprm->file that the bprm->creds
will be computed from.

The funciton bprm_fill_uid is updated to receive the file instead of
accessing bprm->file.  The now unnecessary work needed to reset the
bprm->cred->euid, and bprm->cred->egid is removed from brpm_fill_uid.
A small comment to document that bprm_fill_uid now only deals with the
work to handle suid and sgid files.  The default case is already
heandled by prepare_exec_creds.

The function security_bprm_repopulate_creds is renamed
security_bprm_creds_from_file and now is explicitly passed the file
from which to compute the creds.  The documentation of the
bprm_creds_from_file security hook is updated to explain when the hook
is called and what it needs to do.  The file is passed from
cap_bprm_creds_from_file into get_file_caps so that the caps are
computed for the appropriate file.  The now unnecessary work in
cap_bprm_creds_from_file to reset the ambient capabilites has been
removed.  A small comment to document that the work of
cap_bprm_creds_from_file is to read capabilities from the files
secureity attribute and derive capabilities from the fact the
user had uid 0 has been added.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-29 22:00:54 -05:00
Eric W. Biederman
a7868323c2 exec: Add a per bprm->file version of per_clear
There is a small bug in the code that recomputes parts of bprm->cred
for every bprm->file.  The code never recomputes the part of
clear_dangerous_personality_flags it is responsible for.

Which means that in practice if someone creates a sgid script
the interpreter will not be able to use any of:
	READ_IMPLIES_EXEC
	ADDR_NO_RANDOMIZE
	ADDR_COMPAT_LAYOUT
	MMAP_PAGE_ZERO.

This accentially clearing of personality flags probably does
not matter in practice because no one has complained
but it does make the code more difficult to understand.

Further remaining bug compatible prevents the recomputation from being
removed and replaced by simply computing bprm->cred once from the
final bprm->file.

Making this change removes the last behavior difference between
computing bprm->creds from the final file and recomputing
bprm->cred several times.  Which allows this behavior change
to be justified for it's own reasons, and for any but hunts
looking into why the behavior changed to wind up here instead
of in the code that will follow that computes bprm->cred
from the final bprm->file.

This small logic bug appears to have existed since the code
started clearing dangerous personality bits.

History Tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 1bb0fa189c6a ("[PATCH] NX: clean up legacy binary support")
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-29 21:06:48 -05:00
Al Viro
7e71609f64 pselect6() and friends: take handling the combined 6th/7th args into helper
... and use unsafe_get_user(), while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-29 19:10:42 -04:00
Linus Torvalds
e2fce151d2 Cache tiering and cap handling fixups, both marked for stable.
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl7RKm4THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi6BJB/4pz7N1K3sqs3OXHsHHnMnpTmxV5lU3
 4pXDivwESypxJKBDZ96qgSNMGgL9XpxChfA/LCYVy92LvIbjr9vrUh9386Q2arqw
 nRe4kTiN7Y8HkLb47GmqzCQdxgGVC35OZJZQzdM5y9rVEH9nbEUHWhsvCHYUR8Cb
 Ndm7hT6QzLRTQzlUhu0lPfLc84R0Hl5aFJNkA7enbXL7s9yfTYRf9+zcl+8VOI09
 X01OOxsOVNoQUzhTn2Y+SDFLr5N7CNtW7UN17S6sCiiA0XgodxeWmnxl2aaVMG+z
 VbsXQPr9ma4gYaD7BjzqaPEQqpgoTrmNqPkrzSzZbFHRc+GC3S5PiLwU
 =TOVq
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.7-rc8' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Cache tiering and cap handling fixups, both marked for stable"

* tag 'ceph-for-5.7-rc8' of git://github.com/ceph/ceph-client:
  ceph: flush release queue when handling caps for unknown inode
  libceph: ignore pool overlay and cache logic on redirects
2020-05-29 13:59:54 -07:00
Linus Torvalds
835e36b119 Fix the previous, flawed gfs2_find_jhead commit
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl7RI6UUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpV8A/8DxTfUVzH+S8fXS6nEfJ2Q8soLeGa
 JE8ZalUmcc8G6R/hPekZbcV4NVN03PlfSMh6Jnr5o5Zz6mDsksC2Lh+i0legsm2Y
 /QPj5N/vnNbEANBtz2BBMRl8VRWyqh9wBP0UuErv+bw39EyUNRVKvRkw04gxYMdw
 kpHl8EFICsIqWcXM+1dzWTTVFlz4dXRFgglMOoFYBdx45H1uUUNx5FiU2WRsS107
 WLsWQ3znEK6iqmYfG0KLkmIuEQKUodfQ4IJX5BkrNyck+1UbSQkWJFlBsMzLOiMX
 XmmnSyGmfk8FOvb1NXk7BzlZBPSF1xt55QIeLjd0sWIyEAnqx4lTz/CRA7WCBkXo
 qCLD2EaUi1RQUNItGjnq2hPmtv6hlA2zusvh5kC2I6ojTJaYcU5Sr0jFARzONbCE
 dKJLmh3RoVA63tt4lFF7DYqWI+AXt1j50aq4CbV0GzoGYaQ4UHMtlyTUFEiVtoGO
 A4tYI23UCvJe0ozduCbSkAv8o9zHEyboIBMDlPbASKLtLMkwTxOQgJ+vyjljHgS+
 vDWvRw7auosQoLxF31bhyJlYWirNUz0SNKVveGawpBQxfXNr3CKubiDORJwZIp5s
 vZLvQ0f4CqC0sx/25cnqDRFRJaNZcxvEXNBUMuN9v2713IzU5+WzRlmykxj17yfb
 B4gBml3MurCwPFA=
 =rKvo
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.7-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:
 "Fix the previous, flawed gfs2_find_jhead commit"

* tag 'gfs2-v5.7-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Even more gfs2_find_jhead fixes
2020-05-29 13:58:13 -07:00
John Hubbard
0df5564577 orangefs: convert get_user_pages() --> pin_user_pages()
This code was using get_user_pages*(), in a "Case 1" scenario
(Direct IO), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
    https://lwn.net/Articles/807108/

Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: devel@lists.orangefs.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2020-05-29 16:25:04 -04:00
Colin Ian King
22ce85611f orangefs: remove redundant assignment to variable ret
The variable ret is being initialized with a value that is
never read and it is being updated later with a new value. The
initialization is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2020-05-29 16:24:57 -04:00
Christoph Hellwig
c0425a4249 net: add a new bind_add method
The SCTP protocol allows to bind multiple address to a socket.  That
feature is currently only exposed as a socket option.  Add a bind_add
method struct proto that allows to bind additional addresses, and
switch the dlm code to use the method instead of going through the
socket option from kernel space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 13:10:39 -07:00
Christoph Hellwig
40ef92c6ec sctp: add sctp_sock_set_nodelay
Add a helper to directly set the SCTP_NODELAY sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 13:10:39 -07:00
Andreas Gruenbacher
20be493b78 gfs2: Even more gfs2_find_jhead fixes
Fix several issues in the previous gfs2_find_jhead fix:
* When updating @blocks_submitted, @block refers to the first block block not
  submitted yet, not the last block submitted, so fix an off-by-one error.
* We want to ensure that @blocks_submitted is far enough ahead of @blocks_read
  to guarantee that there is in-flight I/O.  Otherwise, we'll eventually end up
  waiting for pages that haven't been submitted, yet.
* It's much easier to compare the number of blocks added with the number of
  blocks submitted to limit the maximum bio size.
* Even with bio chaining, we can keep adding blocks until we reach the maximum
  bio size, as long as we stop at a page boundary.  This simplifies the logic.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2020-05-29 17:00:24 +02:00
Yufen Yu
cc23402c1c fs: fix indentation in deactivate_super()
Fix the breaked indent in deactive_super().

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-29 10:35:25 -04:00
Nikolay Borisov
5ad05cc8e0 vfs: Remove duplicated d_mountpoint check in __is_local_mountpoint
This function acts as an out-of-line helper for is_local_mountpoint
is only called after the latter verifies the dentry is not a mountpoint.
There's no semantic changes and the resulting object code is smaller:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-26 (-26)
Function                                     old     new   delta
__is_local_mountpoint                        147     121     -26
Total: Before=34161, After=34135, chg -0.08%

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-29 10:35:24 -04:00
Gao Xiang
34f853b849 erofs: suppress false positive last_block warning
As Andrew mentioned, some rare specific gcc versions could report
last_block uninitialized warning. Actually last_block doesn't need
to be uninitialized first from its implementation due to bio == NULL
condition. After a bio is allocated, last_block will be assigned
then.

The detailed analysis is in this thread [1]. So let's silence those
confusing gccs simply.

[1] https://lore.kernel.org/r/20200421072839.GA13867@hsiangkao-HP-ZHAN-66-Pro-G1

Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200528084844.23359-1-hsiangkao@redhat.com
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-05-29 18:58:13 +08:00
Chao Yu
f57a3fe449 erofs: convert to use the new mount fs_context api
Convert the erofs to use new internal mount API as the old one will
be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200529104836.17843-1-hsiangkao@redhat.com
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-05-29 18:57:58 +08:00
Liao Pingfang
5626de1e96 reiserfs: Replace kmalloc with kcalloc in the comment
Use kcalloc instead of kmalloc in the comment according to
the previous kcalloc() call.

Link: https://lore.kernel.org/r/1590714150-15895-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-29 08:51:39 +02:00
Ira Weiny
b383a73f2b fs/ext4: Introduce DAX inode flag
Add a flag ([EXT4|FS]_DAX_FL) to preserve FS_XFLAG_DAX in the ext4
inode.

Set the flag to be user visible and changeable.  Set the flag to be
inherited.  Allow applications to change the flag at any time except if
it conflicts with the set of mutually exclusive flags (Currently VERITY,
ENCRYPT, JOURNAL_DATA).

Furthermore, restrict setting any of the exclusive flags if DAX is set.

While conceptually possible, we do not allow setting EXT4_DAX_FL while
at the same time clearing exclusion flags (or vice versa) for 2 reasons:

	1) The DAX flag does not take effect immediately which
	   introduces quite a bit of complexity
	2) There is no clear use case for being this flexible

Finally, on regular files, flag the inode to not be cached to facilitate
changing S_DAX on the next creation of the inode.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-9-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ira Weiny
fcebc7949c fs/ext4: Remove jflag variable
The jflag variable serves almost no purpose.  Remove it.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-8-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ira Weiny
9cb20f94af fs/ext4: Make DAX mount option a tri-state
We add 'always', 'never', and 'inode' (default).  '-o dax' continues to
operate the same which is equivalent to 'always'.  This new
functionality is limited to ext4 only.

Specifically we introduce a 2nd DAX mount flag EXT4_MOUNT2_DAX_NEVER and set
it and EXT4_MOUNT_DAX_ALWAYS appropriately for the mode.

We also force EXT4_MOUNT2_DAX_NEVER if !CONFIG_FS_DAX.

Finally, EXT4_MOUNT2_DAX_INODE is used solely to detect if the user
specified that option for printing.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-7-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ira Weiny
043546e46d fs/ext4: Only change S_DAX on inode load
To prevent complications with in memory inodes we only set S_DAX on
inode load.  FS_XFLAG_DAX can be changed at any time and S_DAX will
change after inode eviction and reload.

Add init bool to ext4_set_inode_flags() to indicate if the inode is
being newly initialized.

Assert that S_DAX is not set on an inode which is just being loaded.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-6-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ira Weiny
a8ab6d3885 fs/ext4: Update ext4_should_use_dax()
S_DAX should only be enabled when the underlying block device supports
dax.

Cache the underlying support for DAX in the super block and modify
ext4_should_use_dax() to check for device support prior to the over
riding mount option.

While we are at it change the function to ext4_should_enable_dax() as
this better reflects the ask as well as matches xfs.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-5-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ira Weiny
fc626fe322 fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
In prep for the new tri-state mount option which then introduces
EXT4_MOUNT_DAX_NEVER.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-4-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ira Weiny
6c0d077ff8 fs/ext4: Disallow verity if inode is DAX
Verity and DAX are incompatible.  Changing the DAX mode due to a verity
flag change is wrong without a corresponding address_space_operations
update.

Make the 2 options mutually exclusive by returning an error if DAX was
set first.

(Setting DAX is already disabled if Verity is set first.)

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Link: https://lore.kernel.org/r/20200528150003.828793-3-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:47 -04:00
Ira Weiny
ff694ab60c fs/ext4: Narrow scope of DAX check in setflags
When preventing DAX and journaling on an inode.  Use the effective DAX
check rather than the mount option.

This will be required to support per inode DAX flags.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20200528150003.828793-2-ira.weiny@intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28 22:09:46 -04:00
Xiyu Yang
a4abc6b12e nfsd: Fix svc_xprt refcnt leak when setup callback client failed
nfsd4_process_cb_update() invokes svc_xprt_get(), which increases the
refcount of the "c->cn_xprt".

The reference counting issue happens in one exception handling path of
nfsd4_process_cb_update(). When setup callback client failed, the
function forgets to decrease the refcnt increased by svc_xprt_get(),
causing a refcnt leak.

Fix this issue by calling svc_xprt_put() when setup callback client
failed.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-05-28 18:15:00 -04:00
Linus Torvalds
75caf310d1 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "5 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  include/asm-generic/topology.h: guard cpumask_of_node() macro argument
  fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info()
  mm: remove VM_BUG_ON(PageSlab()) from page_mapcount()
  mm,thp: stop leaking unreleased file pages
  mm/z3fold: silence kmemleak false positives of slots
2020-05-28 13:04:25 -07:00
Chao Yu
dc35d73a42 f2fs: compress: don't compress any datas after cp stop
While compressed data writeback, we need to drop dirty pages like we did
for non-compressed pages if cp stops, however it's not needed to compress
any data in such case, so let's detect cp stop condition in
cluster_may_compress() to avoid redundant compressing and let following
f2fs_write_raw_pages() drops dirty pages correctly.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-28 12:00:43 -07:00
Chao Yu
47d0d7d764 f2fs: remove unneeded return value of __insert_discard_tree()
We never use return value of __insert_discard_tree(), so remove it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-28 12:00:43 -07:00
Chao Yu
84597b1f9b f2fs: fix wrong value of tracepoint parameter
In f2fs_lookup(), we should set @err correctly before printing it
in tracepoint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-28 12:00:43 -07:00
Daeho Jeong
fd6126484c f2fs: protect new segment allocation in expand_inode_data
Found a new segemnt allocation without f2fs_lock_op() in
expand_inode_data(). So, when we do fallocate() for a pinned file
and trigger checkpoint very frequently and simultaneously. F2FS gets
stuck in the below code of do_checkpoint() forever.

  f2fs_sync_meta_pages(sbi, META, LONG_MAX, FS_CP_META_IO);
  /* Wait for all dirty meta pages to be submitted for IO */
                                                <= if fallocate() here,
  f2fs_wait_on_all_pages(sbi, F2FS_DIRTY_META); <= it'll wait forever.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-05-28 12:00:43 -07:00
Alexander Potapenko
1d605416fb fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info()
KMSAN reported uninitialized data being written to disk when dumping
core.  As a result, several kilobytes of kmalloc memory may be written
to the core file and then read by a non-privileged user.

Reported-by: sam <sunhaoyl@outlook.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200419100848.63472-1-glider@google.com
Link: https://github.com/google/kmsan/issues/76
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-28 11:35:40 -07:00
Christoph Hellwig
298cd88a66 rxrpc: add rxrpc_sock_set_min_security_level
Add a helper to directly set the RXRPC_MIN_SECURITY_LEVEL sockopt from
kernel space without going through a fake uaccess.

Thanks to David Howells for the documentation updates.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:46 -07:00
Christoph Hellwig
c488aeadcb tcp: add tcp_sock_set_user_timeout
Add a helper to directly set the TCP_USER_TIMEOUT sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
12abc5ee78 tcp: add tcp_sock_set_nodelay
Add a helper to directly set the TCP_NODELAY sockopt from kernel space
without going through a fake uaccess.  Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
db10538a4b tcp: add tcp_sock_set_cork
Add a helper to directly set the TCP_CORK sockopt from kernel space
without going through a fake uaccess.  Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
26cfabf9cd net: add sock_set_rcvbuf
Add a helper to directly set the SO_RCVBUFFORCE sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
ce3d9544ce net: add sock_set_keepalive
Add a helper to directly set the SO_KEEPALIVE sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
76ee0785f4 net: add sock_set_sndtimeo
Add a helper to directly set the SO_SNDTIMEO_NEW sockopt from kernel
space without going through a fake uaccess.  The interface is
simplified to only pass the seconds value, as that is the only
thing needed at the moment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
b58f0e8f38 net: add sock_set_reuseaddr
Add a helper to directly set the SO_REUSEADDR sockopt from kernel space
without going through a fake uaccess.

For this the iscsi target now has to formally depend on inet to avoid
a mostly theoretical compile failure.  For actual operation it already
did depend on having ipv4 or ipv6 support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Will Deacon
082af5ec50 Merge branch 'for-next/scs' into for-next/core
Support for Clang's Shadow Call Stack in the kernel
(Sami Tolvanen and Will Deacon)
* for-next/scs:
  arm64: entry-ftrace.S: Update comment to indicate that x18 is live
  scs: Move DEFINE_SCS macro into core code
  scs: Remove references to asm/scs.h from core code
  scs: Move scs_overflow_check() out of architecture code
  arm64: scs: Use 'scs_sp' register alias for x18
  scs: Move accounting into alloc/free functions
  arm64: scs: Store absolute SCS stack pointer value in thread_info
  efi/libstub: Disable Shadow Call Stack
  arm64: scs: Add shadow stacks for SDEI
  arm64: Implement Shadow Call Stack
  arm64: Disable SCS for hypervisor code
  arm64: vdso: Disable Shadow Call Stack
  arm64: efi: Restore register x18 if it was corrupted
  arm64: Preserve register x18 when CPU is suspended
  arm64: Reserve register x18 from general allocation with SCS
  scs: Disable when function graph tracing is enabled
  scs: Add support for stack usage debugging
  scs: Add page accounting for shadow call stack allocations
  scs: Add support for Clang's Shadow Call Stack (SCS)
2020-05-28 18:03:40 +01:00
Filipe Manana
2166e5edce btrfs: fix space_info bytes_may_use underflow during space cache writeout
We always preallocate a data extent for writing a free space cache, which
causes writeback to always try the nocow path first, since the free space
inode has the prealloc bit set in its flags.

However if the block group that contains the data extent for the space
cache has been turned to RO mode due to a running scrub or balance for
example, we have to fallback to the cow path. In that case once a new data
extent is allocated we end up calling btrfs_add_reserved_bytes(), which
decrements the counter named bytes_may_use from the data space_info object
with the expection that this counter was previously incremented with the
same amount (the size of the data extent).

However when we started writeout of the space cache at cache_save_setup(),
we incremented the value of the bytes_may_use counter through a call to
btrfs_check_data_free_space() and then decremented it through a call to
btrfs_prealloc_file_range_trans() immediately after. So when starting the
writeback if we fallback to cow mode we have to increment the counter
bytes_may_use of the data space_info again to compensate for the extent
allocation done by the cow path.

When this issue happens we are incorrectly decrementing the bytes_may_use
counter and when its current value is smaller then the amount we try to
subtract we end up with the following warning:

 ------------[ cut here ]------------
 WARNING: CPU: 3 PID: 657 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
 Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
 CPU: 3 PID: 657 Comm: kworker/u8:7 Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
 Workqueue: writeback wb_workfn (flush-btrfs-1591)
 RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
 Code: ff ff 48 (...)
 RSP: 0000:ffffa41608f13660 EFLAGS: 00010287
 RAX: 0000000000001000 RBX: ffff9615b93ae400 RCX: 0000000000000000
 RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9615b96ab410
 RBP: fffffffffffee000 R08: 0000000000000001 R09: 0000000000000000
 R10: ffff961585e62a40 R11: 0000000000000000 R12: ffff9615b96ab400
 R13: ffff9615a1a2a000 R14: 0000000000012000 R15: ffff9615b93ae400
 FS:  0000000000000000(0000) GS:ffff9615bb200000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055cbbc2ae178 CR3: 0000000115794006 CR4: 00000000003606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  find_free_extent+0x4a0/0x16c0 [btrfs]
  btrfs_reserve_extent+0x91/0x180 [btrfs]
  cow_file_range+0x12d/0x490 [btrfs]
  btrfs_run_delalloc_range+0x9f/0x6d0 [btrfs]
  ? find_lock_delalloc_range+0x221/0x250 [btrfs]
  writepage_delalloc+0xe8/0x150 [btrfs]
  __extent_writepage+0xe8/0x4c0 [btrfs]
  extent_write_cache_pages+0x237/0x530 [btrfs]
  extent_writepages+0x44/0xa0 [btrfs]
  do_writepages+0x23/0x80
  __writeback_single_inode+0x59/0x700
  writeback_sb_inodes+0x267/0x5f0
  __writeback_inodes_wb+0x87/0xe0
  wb_writeback+0x382/0x590
  ? wb_workfn+0x4a2/0x6c0
  wb_workfn+0x4a2/0x6c0
  process_one_work+0x26d/0x6a0
  worker_thread+0x4f/0x3e0
  ? process_one_work+0x6a0/0x6a0
  kthread+0x103/0x140
  ? kthread_create_worker_on_cpu+0x70/0x70
  ret_from_fork+0x3a/0x50
 irq event stamp: 0
 hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
 softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
 softirqs last disabled at (0): [<0000000000000000>] 0x0
 ---[ end trace bd7c03622e0b0a52 ]---
 ------------[ cut here ]------------

So fix this by incrementing the bytes_may_use counter of the data
space_info when we fallback to the cow path. If the cow path is successful
the counter is decremented after extent allocation (by
btrfs_add_reserved_bytes()), if it fails it ends up being decremented as
well when clearing the delalloc range (extent_clear_unlock_delalloc()).

This could be triggered sporadically by the test case btrfs/061 from
fstests.

Fixes: 82d5902d9c ("Btrfs: Support reading/writing on disk free ino cache")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28 14:01:53 +02:00
Filipe Manana
467dc47ea9 btrfs: fix space_info bytes_may_use underflow after nocow buffered write
When doing a buffered write we always try to reserve data space for it,
even when the file has the NOCOW bit set or the write falls into a file
range covered by a prealloc extent. This is done both because it is
expensive to check if we can do a nocow write (checking if an extent is
shared through reflinks or if there's a hole in the range for example),
and because when writeback starts we might actually need to fallback to
COW mode (for example the block group containing the target extents was
turned into RO mode due to a scrub or balance).

When we are unable to reserve data space we check if we can do a nocow
write, and if we can, we proceed with dirtying the pages and setting up
the range for delalloc. In this case the bytes_may_use counter of the
data space_info object is not incremented, unlike in the case where we
are able to reserve data space (done through btrfs_check_data_free_space()
which calls btrfs_alloc_data_chunk_ondemand()).

Later when running delalloc we attempt to start writeback in nocow mode
but we might revert back to cow mode, for example because in the meanwhile
a block group was turned into RO mode by a scrub or relocation. The cow
path after successfully allocating an extent ends up calling
btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of
the data space_info object to have been incremented before - but we did
not do it when the buffered write started, since there was not enough
available data space. So btrfs_add_reserved_bytes() ends up decrementing
the bytes_may_use counter anyway, and when the counter's current value
is smaller then the size of the allocated extent we get a stack trace
like the following:

 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
 Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
 CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
 Workqueue: writeback wb_workfn (flush-btrfs-1754)
 RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
 Code: ff ff 48 (...)
 RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287
 RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000
 RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410
 RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000
 R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400
 R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800
 FS:  0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  find_free_extent+0x4a0/0x16c0 [btrfs]
  btrfs_reserve_extent+0x91/0x180 [btrfs]
  cow_file_range+0x12d/0x490 [btrfs]
  run_delalloc_nocow+0x341/0xa40 [btrfs]
  btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
  ? find_lock_delalloc_range+0x221/0x250 [btrfs]
  writepage_delalloc+0xe8/0x150 [btrfs]
  __extent_writepage+0xe8/0x4c0 [btrfs]
  extent_write_cache_pages+0x237/0x530 [btrfs]
  ? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs]
  extent_writepages+0x44/0xa0 [btrfs]
  do_writepages+0x23/0x80
  __writeback_single_inode+0x59/0x700
  writeback_sb_inodes+0x267/0x5f0
  __writeback_inodes_wb+0x87/0xe0
  wb_writeback+0x382/0x590
  ? wb_workfn+0x4a2/0x6c0
  wb_workfn+0x4a2/0x6c0
  process_one_work+0x26d/0x6a0
  worker_thread+0x4f/0x3e0
  ? process_one_work+0x6a0/0x6a0
  kthread+0x103/0x140
  ? kthread_create_worker_on_cpu+0x70/0x70
  ret_from_fork+0x3a/0x50
 irq event stamp: 0
 hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 hardirqs last disabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
 softirqs last  enabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
 softirqs last disabled at (0): [<0000000000000000>] 0x0
 ---[ end trace f9f6ef8ec4cd8ec9 ]---

So to fix this, when falling back into cow mode check if space was not
reserved, by testing for the bit EXTENT_NORESERVE in the respective file
range, and if not, increment the bytes_may_use counter for the data
space_info object. Also clear the EXTENT_NORESERVE bit from the range, so
that if the cow path fails it decrements the bytes_may_use counter when
clearing the delalloc range (through the btrfs_clear_delalloc_extent()
callback).

Fixes: 7ee9e4405f ("Btrfs: check if we can nocow if we don't have data space")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28 14:01:53 +02:00
Filipe Manana
e2c8e92d11 btrfs: fix wrong file range cleanup after an error filling dealloc range
If an error happens while running dellaloc in COW mode for a range, we can
end up calling extent_clear_unlock_delalloc() for a range that goes beyond
our range's end offset by 1 byte, which affects 1 extra page. This results
in clearing bits and doing page operations (such as a page unlock) outside
our target range.

Fix that by calling extent_clear_unlock_delalloc() with an inclusive end
offset, instead of an exclusive end offset, at cow_file_range().

Fixes: a315e68f6e ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28 14:01:53 +02:00
Nikolay Borisov
213ff4b72a btrfs: remove redundant local variable in read_block_for_search
The local 'b' variable is only used to directly read values from passed
extent buffer. So eliminate  it and directly use the input parameter.
Furthermore this shrinks the size of the following functions:

./scripts/bloat-o-meter ctree.orig fs/btrfs/ctree.o
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-73 (-73)
Function                                     old     new   delta
read_block_for_search.isra                   876     871      -5
push_node_left                              1112    1044     -68
Total: Before=50348, After=50275, chg -0.14%

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28 14:01:52 +02:00
Nikolay Borisov
995e9a166b btrfs: open code key_search
This function wraps the optimisation implemented by d7396f0735
("Btrfs: optimize key searches in btrfs_search_slot") however this
optimisation is really used in only one place - btrfs_search_slot.

Just open code the optimisation and also add a comment explaining how it
works since it's not clear just by looking at the code - the key point
here is it depends on an internal invariant that BTRFS' btree provides,
namely intermediate pointers always contain the key at slot0 at the
child node. So in the case of exact match we can safely assume that the
given key will always be in slot 0 on lower levels.

Furthermore this results in a reduction of btrfs_search_slot's size:

./scripts/bloat-o-meter ctree.orig fs/btrfs/ctree.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-75 (-75)
Function                                     old     new   delta
btrfs_search_slot                           2783    2708     -75
Total: Before=50423, After=50348, chg -0.15%

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28 14:01:52 +02:00
Christoph Hellwig
d8f3e73587 btrfs: split btrfs_direct_IO to read and write part
The read and write versions don't have anything in common except for the
call to iomap_dio_rw.  So split this function, and merge each half into
its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28 14:01:52 +02:00
Goldwyn Rodrigues
5f008163a5 btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
Since we now perform direct reads using i_rwsem, we can remove this
inode flag used to co-ordinate unlocked reads.

The truncate call takes i_rwsem. This means it is correctly synchronized
with concurrent direct reads.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jth@kernel.org>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28 14:01:52 +02:00
Goldwyn Rodrigues
b75b7ca7c2 fs: remove dio_end_io()
Since we removed the last user of dio_end_io(), remove the helper
function dio_end_io().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28 14:01:51 +02:00
Goldwyn Rodrigues
a43a67a2d7 btrfs: switch to iomap_dio_rw() for dio
Switch from __blockdev_direct_IO() to iomap_dio_rw().
Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
as iomap_begin() for iomap direct I/O functions. This function
allocates and locks all the blocks required for the I/O.
btrfs_submit_direct() is used as the submit_io() hook for direct I/O
ops.

Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.

We don't need address_space.direct_IO() anymore so set it to noop.
Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.

BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
exclusive in case of writes. This guards against simultaneous truncates.

Use iomap->iomap_end() to check for failed or incomplete direct I/O:
 - for writes, call __endio_write_update_ordered()
 - for reads, unlock extents

btrfs_dio_data is now hooked in iomap->private and not
current->journal_info. It carries the reservation variable and the
amount of data submitted, so we can calculate the amount of data to call
__endio_write_update_ordered in case of an error.

This patch removes last use of struct buffer_head from btrfs.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-28 14:01:02 +02:00
Julia Cartwright
fd56200a16 squashfs: Make use of local lock in multi_cpu decompressor
The squashfs multi CPU decompressor makes use of get_cpu_ptr() to
acquire a pointer to per-CPU data. get_cpu_ptr() implicitly disables
preemption which serializes the access to the per-CPU data.

But decompression can take quite some time depending on the size. The
observed preempt disabled times in real world scenarios went up to 8ms,
causing massive wakeup latencies. This happens on all CPUs as the
decompression is fully parallelized.

Replace the implicit preemption control with an explicit local lock.
This allows RT kernels to substitute it with a real per CPU lock, which
serializes the access but keeps the code section preemptible. On non RT
kernels this maps to preempt_disable() as before, i.e. no functional
change.

[ bigeasy: Use local_lock(), patch description]

Reported-by: Alexander Stein <alexander.stein@systec-electronic.com>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Alexander Stein <alexander.stein@systec-electronic.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200527201119.1692513-5-bigeasy@linutronix.de
2020-05-28 10:31:10 +02:00