The channel value for requesting server remote invalidating local memory
registration should be 0x00000002
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Update reading the EA using increasingly larger buffer sizes
until the response will fit in the buffer, or we exceed the
(arbitrary) maximum set to 64kb.
Without this change, a user is able to add more and more EAs using
setfattr until the point where the total space of all EAs exceed 2kb
at which point the user can no longer list the EAs at all
and getfattr will abort with an error.
The same issue still exists for EAs in SMB1.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
SMB3.1.1 is most secure and recent dialect. Fixup labels and lengths
for sMB3.1.1 signing and encryption.
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Currently we try to defer completion of async DIO to the process context
in case there are any mapped pages associated with the inode so that we
can invalidate the pages when the IO completes. However the check is racy
and the pages can be mapped afterwards. If this happens we might end up
calling invalidate_inode_pages2_range() in dio_complete() in interrupt
context which could sleep. This can be reproduced by generic/451.
Fix this by passing the information whether we can or can't invalidate
to the dio_complete(). Thanks Eryu Guan for reporting this and Jan Kara
for suggesting a fix.
Fixes: 332391a993 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
Reported-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The mount i_version flag is not enabled in the new sb_flags. This patch
adds the missing SB_I_VERSION flag.
Fixes: e462ec5 "VFS: Differentiate mount flags (MS_*) from internal
superblock flags"
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The last cleanup introduced two harmless warnings:
fs/xfs/xfs_fsmap.c:480:1: warning: '__xfs_getfsmap_rtdev' defined but not used
fs/xfs/xfs_fsmap.c:372:1: warning: 'xfs_getfsmap_rtdev_rtbitmap_helper' defined but not used
This moves those two functions as well.
Fixes: bb9c2e5433 ("xfs: move more RT specific code under CONFIG_XFS_RT")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The writeback rework in commit fbcc025613 ("xfs: Introduce
writeback context for writepages") introduced a subtle change in
behavior with regard to the block mapping used across the
->writepages() sequence. The previous xfs_cluster_write() code would
only flush pages up to EOF at the time of the writepage, thus
ensuring that any pages due to file-extending writes would be
handled on a separate cycle and with a new, updated block mapping.
The updated code establishes a block mapping in xfs_writepage_map()
that could extend beyond EOF if the file has post-eof preallocation.
Because we now use the generic writeback infrastructure and pass the
cached mapping to each writepage call, there is no implicit EOF
limit in place. If eofblocks trimming occurs during ->writepages(),
any post-eof portion of the cached mapping becomes invalid. The
eofblocks code has no means to serialize against writeback because
there are no pages associated with post-eof blocks. Therefore if an
eofblocks trim occurs and is followed by a file-extending buffered
write, not only has the mapping become invalid, but we could end up
writing a page to disk based on the invalid mapping.
Consider the following sequence of events:
- A buffered write creates a delalloc extent and post-eof
speculative preallocation.
- Writeback starts and on the first writepage cycle, the delalloc
extent is converted to real blocks (including the post-eof blocks)
and the mapping is cached.
- The file is closed and xfs_release() trims post-eof blocks. The
cached writeback mapping is now invalid.
- Another buffered write appends the file with a delalloc extent.
- The concurrent writeback cycle picks up the just written page
because the writeback range end is LLONG_MAX. xfs_writepage_map()
attributes it to the (now invalid) cached mapping and writes the
data to an incorrect location on disk (and where the file offset is
still backed by a delalloc extent).
This problem is reproduced by xfstests test generic/464, which
triggers racing writes, appends, open/closes and writeback requests.
To address this problem, trim the mapping used during writeback to
within EOF when the mapping is validated. This ensures the mapping
is revalidated for any pages encountered beyond EOF as of the time
the current mapping was cached or last validated.
Reported-by: Eryu Guan <eguan@redhat.com>
Diagnosed-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Commit 332391a993 ("fs: Fix page cache inconsistency when mixing
buffered and AIO DIO") moved page cache invalidation from
iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
path, but before the dio->end_io() call, and it re-introdued the bug
fixed by commit c771c14baa ("iomap: invalidate page caches should
be after iomap_dio_complete() in direct write").
I found this because fstests generic/418 started failing on XFS with
v4.14-rc3 kernel, which is the regression test for this specific
bug.
So similarly, fix it by moving dio->end_io() (which does the
unwritten extent conversion) before page cache invalidation, to make
sure next buffer read reads the final real allocations not unwritten
extents. I also add some comments about why should end_io() go first
in case we get it wrong again in the future.
Note that, there's no such problem in the non-iomap based direct
write path, because we didn't remove the page cache invalidation
after the ->direct_IO() in generic_file_direct_write() call, but I
decided to fix dio_complete() too so we don't leave a landmine
there, also be consistent with iomap_dio_complete().
Fixes: 332391a993 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Recently we've had warnings arise from the vm handing us pages
without bufferheads attached to them. This should not ever occur
in XFS, but we don't defend against it properly if it does. The only
place where we remove bufferheads from a page is in
xfs_vm_releasepage(), but we can't tell the difference here between
"page is dirty so don't release" and "page is dirty but is being
invalidated so release it".
In some places that are invalidating pages ask for pages to be
released and follow up afterward calling ->releasepage by checking
whether the page was dirty and then aborting the invalidation. This
is a possible vector for releasing buffers from a page but then
leaving it in the mapping, so we really do need to avoid dirty pages
in xfs_vm_releasepage().
To differentiate between invalidated pages and normal pages, we need
to clear the page dirty flag when invalidating the pages. This can
be done through xfs_vm_invalidatepage(), and will result
xfs_vm_releasepage() seeing the page as clean which matches the
bufferhead state on the page after calling block_invalidatepage().
Hence we can re-add the page dirty check in xfs_vm_releasepage to
catch the case where we might be releasing a page that is actually
dirty and so should not have the bufferheads on it removed. This
will remove one possible vector of "dirty page with no bufferheads"
and so help narrow down the search for the root cause of that
problem.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
inode->i_private is assigned by a Node pointer only after registering a
new binary format, so it could be NULL if inode was created by
bm_fill_super() (or iput() was called by the error path in
bm_register_write()), and this could result in NULL pointer dereference
when evicting such an inode. e.g. mount binfmt_misc filesystem then
umount it immediately:
mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc
umount /proc/sys/fs/binfmt_misc
will result in
BUG: unable to handle kernel NULL pointer dereference at 0000000000000013
IP: bm_evict_inode+0x16/0x40 [binfmt_misc]
...
Call Trace:
evict+0xd3/0x1a0
iput+0x17d/0x1d0
dentry_unlink_inode+0xb9/0xf0
__dentry_kill+0xc7/0x170
shrink_dentry_list+0x122/0x280
shrink_dcache_parent+0x39/0x90
do_one_tree+0x12/0x40
shrink_dcache_for_umount+0x2d/0x90
generic_shutdown_super+0x1f/0x120
kill_litter_super+0x29/0x40
deactivate_locked_super+0x43/0x70
deactivate_super+0x45/0x60
cleanup_mnt+0x3f/0x70
__cleanup_mnt+0x12/0x20
task_work_run+0x86/0xa0
exit_to_usermode_loop+0x6d/0x99
syscall_return_slowpath+0xba/0xf0
entry_SYSCALL_64_fastpath+0xa3/0xa
Fix it by making sure Node (e) is not NULL.
Link: http://lkml.kernel.org/r/20171010100642.31786-1-eguan@redhat.com
Fixes: 83f918274e ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using FAT on a block device which supports rw_page, we can hit
BUG_ON(!PageLocked(page)) in try_to_free_buffers(). This is because we
call clean_buffers() after unlocking the page we've written. Introduce
a new clean_page_buffers() which cleans all buffers associated with a
page and call it from within bdev_write_page().
[akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew]
Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Tested-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix a stale kernel memory exposure when logging inodes.
- Fix some build problems with CONFIG_XFS_RT=n
- Don't change inode mode if the acl write fails, leaving the file totally
inaccessible.
- Fix a dangling pointer problem when removing an attr fork under memory
pressure.
- Don't crash while trying to invalidate a null buffer associated with a
corrupt metadata pointer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZ3lPiAAoJEPh/dxk0SrTrfuMP/Axy7VSX71tE/eXPOmzxCVZD
w4/usqO+OsQj+q8o+rwwuX9hz0VGF8kWZJOdgGdXpYT7pWqPmcf88wbThheTetLF
fjevusqva0Ds+U4AE7DCNWSKQQRhu2jDgnhQXTv1hdYhWIF59qGwioIijbEvb72I
0QW+/uV9yXmODjWL6KfRh9zRT9N4npMtszukScONwJr9t0/5ub8H03H/ktv8T9oi
C3ljEWwyMk5lEYH8p6tpta8EbY0mrIZgo+kj33PU5s9rHvcrTGtyPNqidREUm1fL
X3+STMytcDQFAcZdBBXHN0nFMwa8ADTrVvKmEgaR8OsXmOmrlcPn7HfVVlWrY31w
X3awJ0b0+IXUrsbbQOPeqgTo5hIkMDkMOga5AP/rqpx1yCCOrlMHaRPXB2NxNcVw
dyTj6IpKybhsQ4GkcqmFcgnxPPaogNpYlp6SXV5Dm+8zEJdIQNUuci/EGsNz7UcV
msxNlJJkxczXOew6JzCyw45wTnJCxduX7Y1xrOTLaDfa9pkWO2zQBXukCJNIqVIq
35Q4P4JVYtmwQr8XkkX9tiqU0gBWTCTG9KjmTCMm5MYkutEYM0uTNR5Jvyiobl7L
Nn+RydssVw7ssnNfgsLhzQHPElUivRdYoYFSBa2DQp6ViILrefqQegd5INAjK63W
7vnHVZyJMHPM0YFoiX8w
=6Yvh
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.14-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix a stale kernel memory exposure when logging inodes.
- Fix some build problems with CONFIG_XFS_RT=n
- Don't change inode mode if the acl write fails, leaving the file
totally inaccessible.
- Fix a dangling pointer problem when removing an attr fork under
memory pressure.
- Don't crash while trying to invalidate a null buffer associated with
a corrupt metadata pointer.
* tag 'xfs-4.14-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: handle error if xfs_btree_get_bufs fails
xfs: reinit btree pointer on attr tree inactivation walk
xfs: Fix bool initialization/comparison
xfs: don't change inode mode if ACL update fails
xfs: move more RT specific code under CONFIG_XFS_RT
xfs: Don't log uninitialised fields in inode structures
Pull quota fix from Jan Kara:
"A fix for a regression in handling of quota grace times and warnings"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Generate warnings for DQUOT_SPACE_NOFAIL allocations
In eCryptfs, we failed to verify that the authentication token keys are
not revoked before dereferencing their payloads, which is problematic
because the payload of a revoked key is NULL. request_key() *does* skip
revoked keys, but there is still a window where the key can be revoked
before we acquire the key semaphore.
Fix it by updating ecryptfs_get_key_payload_data() to return
-EKEYREVOKED if the key payload is NULL. For completeness we check this
for "encrypted" keys as well as "user" keys, although encrypted keys
cannot be revoked currently.
Alternatively we could use key_validate(), but since we'll also need to
fix ecryptfs_get_key_payload_data() to validate the payload length, it
seems appropriate to just check the payload pointer.
Fixes: 237fead619 ("[PATCH] ecryptfs: fs/Makefile and fs/Kconfig")
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org> [v2.6.19+]
Cc: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When an fscrypt-encrypted file is opened, we request the file's master
key from the keyrings service as a logon key, then access its payload.
However, a revoked key has a NULL payload, and we failed to check for
this. request_key() *does* skip revoked keys, but there is still a
window where the key can be revoked before we acquire its semaphore.
Fix it by checking for a NULL payload, treating it like a key which was
already revoked at the time it was requested.
Fixes: 88bd6ccdcd ("ext4 crypto: add encryption key management facilities")
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org> [v4.1+]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When the file /proc/fs/fscache/objects (available with
CONFIG_FSCACHE_OBJECT_LIST=y) is opened, we request a user key with
description "fscache:objlist", then access its payload. However, a
revoked key has a NULL payload, and we failed to check for this.
request_key() *does* skip revoked keys, but there is still a window
where the key can be revoked before we access its payload.
Fix it by checking for a NULL payload, treating it like a key which was
already revoked at the time it was requested.
Fixes: 4fbf4291aa ("FS-Cache: Allow the current state of all objects to be dumped")
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org> [v2.6.32+]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Jason reported that a corrupted filesystem failed to replay
the log with a metadata block out of bounds warning:
XFS (dm-2): _xfs_buf_find: Block out of range: block 0x80270fff8, EOFS 0x9c40000
_xfs_buf_find() and xfs_btree_get_bufs() return NULL if
that happens, and then when xfs_alloc_fix_freelist() calls
xfs_trans_binval() on that NULL bp, we oops with:
BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
We don't handle _xfs_buf_find errors very well, every
caller higher up the stack gets to guess at why it failed.
But we should at least handle it somehow, so return
EFSCORRUPTED here.
Reported-by: Jason L Tibbitts III <tibbs@math.uh.edu>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_attr3_root_inactive() walks the attr fork tree to invalidate the
associated blocks. xfs_attr3_node_inactive() recursively descends
from internal blocks to leaf blocks, caching block address values
along the way to revisit parent blocks, locate the next entry and
descend down that branch of the tree.
The code that attempts to reread the parent block is unsafe because
it assumes that the local xfs_da_node_entry pointer remains valid
after an xfs_trans_brelse() and re-read of the parent buffer. Under
heavy memory pressure, it is possible that the buffer has been
reclaimed and reallocated by the time the parent block is reread.
This means that 'btree' can point to an invalid memory address, lead
to a random/garbage value for child_fsb and cause the subsequent
read of the attr fork to go off the rails and return a NULL buffer
for an attr fork offset that is most likely not allocated.
Note that this problem can be manufactured by setting
XFS_ATTR_BTREE_REF to 0 to prevent LRU caching of attr buffers,
creating a file with a multi-level attr fork and removing it to
trigger inactivation.
To address this problem, reinit the node/btree pointers to the
parent buffer after it has been re-read. This ensures btree points
to a valid record and allows the walk to proceed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we get ENOSPC half way through setting the ACL, the inode mode
can still be changed even though the ACL does not exist. Reorder the
operation to only change the mode of the inode if the ACL is set
correctly.
Whilst this does not fix the problem with crash consistency (that requires
attribute addition to be a deferred op) it does prevent ENOSPC and other
non-fatal errors setting an xattr to be handled sanely.
This fixes xfstests generic/449.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Various utility functions and interfaces that iterate internal
devices try to reference the realtime device even when RT support is
not compiled into the kernel.
Make sure this code is excluded from the CONFIG_XFS_RT=n build,
and where appropriate stub functions to return fatal errors if
they ever get called when RT support is not present.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Prevent kmemcheck from throwing warnings about reading uninitialised
memory when formatting inodes into the incore log buffer. There are
several issues here - we don't always log all the fields in the
inode log format item, and we never log the inode the
di_next_unlinked field.
In the case of the inode log format item, this is exacerbated
by the old xfs_inode_log_format structure padding issue. Hence make
the padded, 64 bit aligned version of the structure the one we always
use for formatting the log and get rid of the 64 bit variant. This
means we'll always log the 64-bit version and so recovery only needs
to convert from the unpadded 32 bit version from older 32 bit
kernels.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Commit 77469c3f57 prevented setting the page as uptodate when we wrote
the right amount of data, fix that.
Fixes: 77469c3f57 ("9p: saner ->write_end() on failing copy into non-uptodate page")
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Alexander Levin <alexander.levin@verizon.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fixes from Al Viro:
"Fairly old DIO bug caught by Andreas (3.10+) and several slightly
younger blk_rq_map_user_iov() bugs, both on map and copy codepaths
(Vitaly and me)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
bio_copy_user_iov(): don't ignore ->iov_offset
more bio_map_user_iov() leak fixes
fix unbalanced page refcounting in bio_map_user_iov
direct-io: Prevent NULL pointer access in submit_page_section
In the code added to function submit_page_section by commit b1058b981,
sdio->bio can currently be NULL when calling dio_bio_submit. This then
leads to a NULL pointer access in dio_bio_submit, so check for a NULL
bio in submit_page_section before trying to submit it instead.
Fixes xfstest generic/250 on gfs2.
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
file. (I was weirdly flattered by the idea that lots of random people
suddenly seemed to think Jeff and I were VFS experts. Turns out it was
just a typo.)
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZ3RyyAAoJECebzXlCjuG+JvQP/RkwFqMZJHDjhSDhj/cr/t2o
ciK5Xche1A4E5vaaPVV17w6OwIYTNhnQawwBtNw88GaqDUEELVyFZFzNtRm44Bv1
27RLOahPTT6bmHl/cd+uNpgpXs9svuNF6x4C5SUmKTm4kFdLBP7khjdcnFhwFi2y
OerDFj4XmPsUDqW8dv7a7XktRf1klMvhbRh80r9TR5JW+h4IYQIYNevue9CABpUm
4vvv4kAyxo8oodslCMQ5OyWpG4NDDsFADtlLn++9tzUl7y5j6TQyIYfeYDH3XOru
5Ara5pkuxloS1Fu4EtEInF3iLAjMZkJD+QgHFhf2/mLMzQhZZzpbnFYPhrgyQONv
wR3u7DaH2t/JbYtlSnKQpLEG0hv2hSBQ33G4ysKUHXrhnF5DC9N59epcA2X34++B
DSwyc2wgxNfr8OGPyaNNw/kcBJyahNvsxlpTxZfTnvc0p4M1dzr1mxl/zsGC2b3v
Ei1Y+u5JU2d/jmzeTOLCGtc59UyAoswdVzNa8SNYad1Tu5eAr81uooCPUvj77lTj
GWQa9wYSOxt+Ld295dtzagqx+hQFdVKa+QTzfaZuPHeuUWmhQLGgalWXCxlVKtuF
SGfAfutikQ4zbfAEz9PuNoThywfppiWbE74pfHRDkteL5+o2JQBLOSo6V6Ow0xV6
O4cOvwV5X/RExbOoZlx1
=yj7E
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.14-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fix from Bruce Fields:
"One fix for a 4.14 regression, and one minor fix to the MAINTAINERs
file. (I was weirdly flattered by the idea that lots of random people
suddenly seemed to think Jeff and I were VFS experts. Turns out it was
just a typo)"
* tag 'nfsd-4.14-1' of git://linux-nfs.org/~bfields/linux:
nfsd4: define nfsd4_secinfo_no_name_release()
MAINTAINERS: associate linux/fs.h with VFS instead of file locking
This contains one bug fix which causes a kernel panic during fstrim introduced
in 4.14-rc1.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlnc+awACgkQQBSofoJI
UNIQLg/9HB/NikmBxVtkDtwrTKpVEPK5AYRHOvoa9k6twGkU6pB8FE0cd2PstwlZ
tAwRstyt8W9nGzF5BPY+WAyVs9ybc26wIqNo13cnzwXbc0/cc4pTy8lzeiFQdQrK
JIzz2lHNt0b5euCsEEAsnwK+rTb5DPUMKm8JkBUQ8f94oxIHLWvg7Um9FBppTw7s
JNOJ8/ymzQVNlWu7VxFaVwfUPbEhK7gtpSWjO65fiprQ0JjwXLEr65356XU2XW8x
lhQkByPMfMv1ZyGSNr3m4Hih0M6250slNHzwrZDxTdH7NDJmy1DfcPiM+epMWZMa
4uT+2hsxhTCqDQbIEvP9jv+KVHV7AG9ldCD04a0RD+XoNKDVLKlzSMFWVcWE/d0H
jSaDrMZj+taseF72x/efP8P/RrTbzqYsqBoAkoByibOXvBf7U8vsLK4NuG7agoL4
EUXDMuVJDB5d8LJRSYt0lPI5R+lhRVlVuint7a9T09yiLyCeR0wGf+eoH9C9Y4V8
t/mEM9azBi9l7T0yraVfqnh+SPzwwlxYOLQeZTi0bf3uqmBOeKb0OvfOiwboOnaZ
5Rl6jYD/hgZAowXpbohRjqPJhMoLMabsTJ4kHj6uJcQDhvTqDpamm9g9Afsiyr6z
xPYo09iHHlWA/iSiV7VSnbZu8hr59bchVt86r77fy/4YH3DXOcM=
=fAsG
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fix from Jaegeuk Kim:
"This contains one bug fix which causes a kernel panic during fstrim
introduced in 4.14-rc1"
* tag 'f2fs-for-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: fix potential panic during fstrim
Eryu has reported that since commit 7b9ca4c61b "quota: Reduce
contention on dq_data_lock" test generic/233 occasionally fails. This is
caused by the fact that since that commit we don't generate warning and
set grace time for quota allocations that have DQUOT_SPACE_NOFAIL set
(these are for example some metadata allocations in ext4). We need these
allocations to behave regularly wrt warning generation and grace time
setting so fix the code to return to the original behavior.
Reported-and-tested-by: Eryu Guan <eguan@redhat.com>
CC: stable@vger.kernel.org
Fixes: 7b9ca4c61b
Signed-off-by: Jan Kara <jack@suse.cz>
Pull btrfs fixes from David Sterba:
"Two more fixes for bugs introduced in 4.13.
The sector_t problem with 32bit architecture and !LBDAF config seems
serious but the number of affected deployments is hopefully low.
The clashing status bits could lead to a confusing in-memory state of
the whole-filesystem operations if used with the quota override sysfs
knob"
* 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix overlap of fs_info::flags values
btrfs: avoid overflow when sector_t is 32 bit
Pull overlayfs fixes from Miklos Szeredi:
"Fix a regression in 4.14 and one in 4.13. The latter is a case when
Docker is doing something it really shouldn't and gets away with it.
We now print a warning instead of erroring out.
There are also fixes to several error paths"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix regression caused by exclusive upper/work dir protection
ovl: fix missing unlock_rename() in ovl_do_copy_up()
ovl: fix dentry leak in ovl_indexdir_cleanup()
ovl: fix dput() of ERR_PTR in ovl_cleanup_index()
ovl: fix error value printed in ovl_lookup_index()
ovl: fix may_write_real() for overlayfs directories
Commit 34b1744c91 ("nfsd4: define ->op_release for compound ops")
defined a couple ->op_release functions and run them if necessary.
But there's a problem with that is that it reused
nfsd4_secinfo_release() as the op_release of OP_SECINFO_NO_NAME, and
caused a leak on struct nfsd4_secinfo_no_name in
nfsd4_encode_secinfo_no_name(), because there's no .si_exp field in
struct nfsd4_secinfo_no_name.
I found this because I was unable to umount an ext4 partition after
exporting it via NFS & run fsstress on the nfs mount. A simplified
reproducer would be:
# mount a local-fs device at /mnt/test, and export it via NFS with
# fsid=0 export option (this is required)
mount /dev/sda5 /mnt/test
echo "/mnt/test *(rw,no_root_squash,fsid=0)" >> /etc/exports
service nfs restart
# locally mount the nfs export with all default, note that I have
# nfsv4.1 configured as the default nfs version, because of the
# fsid export option, v4 mount would fail and fall back to v3
mount localhost:/mnt/test /mnt/nfs
# try to umount the underlying device, but got EBUSY
umount /mnt/nfs
service nfs stop
umount /mnt/test <=== EBUSY here
Fixed it by defining a separate nfsd4_secinfo_no_name_release()
function as the op_release method of OP_SECINFO_NO_NAME that
releases the correct nfsd4_secinfo_no_name structure.
Fixes: 34b1744c91 ("nfsd4: define ->op_release for compound ops")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Enforcing exclusive ownership on upper/work dirs caused a docker
regression: https://github.com/moby/moby/issues/34672.
Euan spotted the regression and pointed to the offending commit.
Vivek has brought the regression to my attention and provided this
reproducer:
Terminal 1:
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/
Terminal 2:
unshare -m
Terminal 1:
umount merged
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/
mount: /root/overlay-testing/merged: none already mounted or mount point
busy
To fix the regression, I replaced the error with an alarming warning.
With index feature enabled, mount does fail, but logs a suggestion to
override exclusive dir protection by disabling index.
Note that index=off mount does take the inuse locks, so a concurrent
index=off will issue the warning and a concurrent index=on mount will fail.
Documentation was updated to reflect this change.
Fixes: 2cac0c00a6 ("ovl: get exclusive ownership on upper/work dirs")
Cc: <stable@vger.kernel.org> # v4.13
Reported-by: Euan Kemp <euank@euank.com>
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use the ovl_lock_rename_workdir() helper which requires
unlock_rename() only on lock success.
Fixes: ("fd210b7d67ee ovl: move copy up lock out")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
index dentry was not released when breaking out of the loop
due to index verification error.
Fixes: 415543d5c6 ("ovl: cleanup bad and stale index entries on mount")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Overlayfs directory file_inode() is the overlay inode whether the real
inode is upper or lower.
This fixes a regression in xfstest generic/158.
Fixes: 7c6893e3c9 ("ovl: don't allow writing ioctl on lower layer")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since we can now use a lock stateid or a delegation stateid, that
differs from the context stateid, we need to change the test in
nfs4_layoutget_handle_exception() to take this into account.
This fixes an infinite layoutget loop in the NFS client whereby
it keeps retrying the initial layoutget using the same broken
stateid.
Fixes: 70d2f7b1ea ("pNFS: Use the standard I/O stateid when...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Merge misc fixes from Andrew Morton:
"A lot of stuff, sorry about that. A week on a beach, then a bunch of
time catching up then more time letting it bake in -next. Shan't do
that again!"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (51 commits)
include/linux/fs.h: fix comment about struct address_space
checkpatch: fix ignoring cover-letter logic
m32r: fix build failure
lib/ratelimit.c: use deferred printk() version
kernel/params.c: improve STANDARD_PARAM_DEF readability
kernel/params.c: fix an overflow in param_attr_show
kernel/params.c: fix the maximum length in param_get_string
mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long
mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function
kernel/kcmp.c: drop branch leftover typo
memremap: add scheduling point to devm_memremap_pages
mm, page_alloc: add scheduling point to memmap_init_zone
mm, memory_hotplug: add scheduling point to __add_pages
lib/idr.c: fix comment for idr_replace()
mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
kernel/sysctl.c: remove duplicate UINT_MAX check on do_proc_douintvec_conv()
include/linux/bitfield.h: remove 32bit from FIELD_GET comment block
lib/lz4: make arrays static const, reduces object code size
exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array
exec: binfmt_misc: fix race between load_misc_binary() and kill_node()
...
Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
we should change the value.
First, BTRFS_FS_EXCL_OP was set to 14.
commit 171938e528 ("btrfs: track exclusive filesystem operation in flags")
Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.
commit f29efe2921 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
As a result, the value 14 overlapped, by accident.
This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
the flags are internal.
Fixes: f29efe2921 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minimize the change, update only BTRFS_FS_EXCL_OP ]
Signed-off-by: David Sterba <dsterba@suse.com>
Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.
In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).
I added a cast to u64 to avoid overflow.
Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
security_inode_getsecurity() provides the text string value
of a security attribute. It does not provide a "secctx".
The code in xattr_getsecurity() that calls security_inode_getsecurity()
and then calls security_release_secctx() happened to work because
SElinux and Smack treat the attribute and the secctx the same way.
It fails for cap_inode_getsecurity(), because that module has no
secctx that ever needs releasing. It turns out that Smack is the
one that's doing things wrong by not allocating memory when instructed
to do so by the "alloc" parameter.
The fix is simple enough. Change the security_release_secctx() to
kfree() because it isn't a secctx being returned by
security_inode_getsecurity(). Change Smack to allocate the string when
told to do so.
Note: this also fixes memory leaks for LSMs which implement
inode_getsecurity but not release_secctx, such as capabilities.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: stable@vger.kernel.org
Signed-off-by: James Morris <james.l.morris@oracle.com>
If we got two AIO writes into a COW area the second one might not have any
COW extents left to convert. Handle that case gracefully instead of
triggering an assert or accessing beyond the bounds of the extent list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Since the CoW fork exists as a secondary data structure to the data
fork, we must always swap cow forks during swapext. We also need to
swap the extent counts and reset the cowblocks tags.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
After the previous change "fmt" can't go away, we can kill
iname/iname_addr and use fmt->interpreter.
Link: http://lkml.kernel.org/r/20170922143653.GA17232@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
load_misc_binary() makes a local copy of fmt->interpreter under
entries_lock to avoid the race with kill_node() but this is not enough;
the whole Node can be freed after we drop entries_lock, not only the
->interpreter string.
Add dget/dput(fmt->dentry) to ensure bm_evict_inode() can't destroy/free
this Node.
Link: http://lkml.kernel.org/r/20170922143650.GA17227@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Cc: <tdhooge@llnl.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If MISC_FMT_OPEN_FILE flag is set e->interp_file must be valid or we
have a bug which should not be silently ignored.
Link: http://lkml.kernel.org/r/20170922143647.GA17222@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To ensure that load_misc_binary() can't use the partially destroyed
Node, see also the next patch.
The current logic looks wrong in any case, once we close interp_file it
doesn't make any sense to delay kfree(inode->i_private), this Node is no
longer valid. Even if the MISC_FMT_OPEN_FILE/interp_file checks were
not racy (they are), load_misc_binary() should not try to reopen
->interpreter if MISC_FMT_OPEN_FILE is set but ->interp_file is NULL.
And I can't understand why do we use filp_close(), not fput().
Link: http://lkml.kernel.org/r/20170922143644.GA17216@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kill_node() nullifies/checks Node->dentry to avoid double free. This
complicates the next changes and this is very confusing:
- we do not need to check dentry != NULL under entries_lock,
kill_node() is always called under inode_lock(d_inode(root)) and we
rely on this inode_lock() anyway, without this lock the
MISC_FMT_OPEN_FILE cleanup could race with itself.
- if kill_inode() was already called and ->dentry == NULL we should not
even try to close e->interp_file.
We can change bm_entry_write() to simply check !list_empty(list) before
kill_node. Again, we rely on inode_lock(), in particular it saves us
from the race with bm_status_write(), another caller of kill_node().
Link: http://lkml.kernel.org/r/20170922143641.GA17210@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "exec: binfmt_misc: fix use-after-free, kill
iname[BINPRM_BUF_SIZE]".
It looks like this code was always wrong, then commit 948b701a60
("binfmt_misc: add persistent opened binary handler for containers")
added more problems.
This patch (of 6):
load_script() can simply use i_name instead, it points into bprm->buf[]
and nobody can change this memory until we call prepare_binprm().
The only complication is that we need to also change the signature of
bprm_change_interp() but this change looks good too.
While at it, do whitespace/style cleanups.
NOTE: the real motivation for this change is that people want to
increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
this looks more complicated because afaics it is very buggy.
Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Travis Gummels <tgummels@redhat.com>
Cc: Ben Woodard <woodard@redhat.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When reading the event from the uffd, we put it on a temporary
fork_event list to detect if we can still access it after releasing and
retaking the event_wqh.lock.
If fork aborts and removes the event from the fork_event all is fine as
long as we're still in the userfault read context and fork_event head is
still alive.
We've to put the event allocated in the fork kernel stack, back from
fork_event list-head to the event_wqh head, before returning from
userfaultfd_ctx_read, because the fork_event head lifetime is limited to
the userfaultfd_ctx_read stack lifetime.
Forgetting to move the event back to its event_wqh place then results in
__remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
userfaultfd_event_wait_completion to remove it from a head that has been
already freed from the reader stack.
This could only happen if resolve_userfault_fork failed (for example if
there are no file descriptors available to allocate the fork uffd). If
it succeeded it was put back correctly.
Furthermore, after find_userfault_evt receives a fork event, the forked
userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
be released by the fork thread as soon as the event_wqh.lock is
released. Taking a reference on the fork_nctx before dropping the lock
prevents an use after free in resolve_userfault_fork().
If the fork side aborted and it already released everything, we still
try to succeed resolve_userfault_fork(), if possible.
Fixes: 893e26e61d ("userfaultfd: non-cooperative: Add fork() event")
Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Ju Hyung Park reported:
"When 'fstrim' is called for manual trim, a BUG() can be triggered
randomly with this patch.
I'm seeing this issue on both x86 Desktop and arm64 Android phone.
On x86 Desktop, this was caused during Ubuntu boot-up. I have a
cronjob installed which calls 'fstrim -v /' during boot. On arm64
Android, this was caused during GC looping with 1ms gc_min_sleep_time
& gc_max_sleep_time."
Root cause of this issue is that f2fs_wait_discard_bios can only be
used by f2fs_put_super, because during put_super there must be no
other referrers, so it can ignore discard entry's reference count
when removing the entry, otherwise in other caller we will hit bug_on
in __remove_discard_cmd as there may be other issuer added reference
count in discard entry.
Thread A Thread B
- issue_discard_thread
- f2fs_ioc_fitrim
- f2fs_trim_fs
- f2fs_wait_discard_bios
- __issue_discard_cmd
- __submit_discard_cmd
- __wait_discard_cmd
- dc->ref++
- __wait_one_discard_bio
- __wait_discard_cmd
- __remove_discard_cmd
- f2fs_bug_on(sbi, dc->ref)
Fixes: 969d1b180d
Reported-by: Ju Hyung Park <qkrwngud825@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
previous commit 5d37ca14 "ceph: send LSSNAP request to auth mds
of directory inode" is buggy. It makes __choose_mds() choose mds
base on hash of '.snap' dentry.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
commit 3ae0bebc "ceph: queue cap snap only when snap realm's
context changes" introduced a regression: we may not call
queue_realm_cap_snaps() for newly created snap realm. This
regression allows unflushed snapshot data to be overwritten.
Link: http://tracker.ceph.com/issues/21483
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Michael Sterrett reports a NULL pointer dereference on NFSv3 mounts when
CONFIG_NFS_V4 is not set because the NFS UOC rpc_wait_queue has not been
initialized. Move the initialization of the queue out of the CONFIG_NFS_V4
conditional setion.
Fixes: 7d6ddf88c4 ("NFS: Add an iocounter wait function for async RPC tasks")
Cc: stable@vger.kernel.org # 4.11+
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
nfs_idmap_get_desc() can't actually return zero. But if it did then
we would return ERR_PTR(0) which is NULL and the caller,
nfs_idmap_get_key(), doesn't expect that so it leads to a NULL pointer
dereference.
I've cleaned this up by changing the "<=" to "<" so it's more clear that
we don't return ERR_PTR(0).
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The units of RPC_MAX_AUTH_SIZE is bytes, not 4-byte words. This causes
the client to request a larger-than-necessary session replay slot size.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull scheduler fixes from Thomas Gleixner:
"The scheduler pull request comes with the following updates:
- Prevent a divide by zero issue by validating the input value of
sysctl_sched_time_avg
- Make task state printing consistent all over the place and have
explicit state characters for IDLE and PARKED so they wont be
displayed as 'D' state which confuses tools"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/sysctl: Check user input value of sysctl_sched_time_avg
sched/debug: Add explicit TASK_PARKED printing
sched/debug: Ignore TASK_IDLE for SysRq-W
sched/debug: Add explicit TASK_IDLE printing
sched/tracing: Use common task-state helpers
sched/tracing: Fix trace_sched_switch task-state printing
sched/debug: Remove unused variable
sched/debug: Convert TASK_state to hex
sched/debug: Implement consistent task-state printing
Pull btrfs fixes from David Sterba:
"We've collected a bunch of isolated fixes, for crashes, user-visible
behaviour or missing bits from other subsystem cleanups from the past.
The overall number is not small but I was not able to make it
significantly smaller. Most of the patches are supposed to go to
stable"
* 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: log csums for all modified extents
Btrfs: fix unexpected result when dio reading corrupted blocks
btrfs: Report error on removing qgroup if del_qgroup_item fails
Btrfs: skip checksum when reading compressed data if some IO have failed
Btrfs: fix kernel oops while reading compressed data
Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
Btrfs: do not backup tree roots when fsync
btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
btrfs: propagate error to btrfs_cmp_data_prepare caller
btrfs: prevent to set invalid default subvolid
Btrfs: send: fix error number for unknown inode types
btrfs: fix NULL pointer dereference from free_reloc_roots()
btrfs: finish ordered extent cleaning if no progress is found
btrfs: clear ordered flag on cleaning up ordered extents
Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
Btrfs: do not reset bio->bi_ops while writing bio
Btrfs: use the new helper wbc_to_write_flags
Currently TASK_PARKED is masqueraded as TASK_INTERRUPTIBLE, give it
its own print state because it will not in fact get woken by regular
wakeups and is a long-term state.
This requires moving TASK_PARKED into the TASK_REPORT mask, and since
that latter needs to be a contiguous bitmask, we need to shuffle the
bits around a bit.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Markus reported that kthreads that idle using TASK_IDLE instead of
TASK_INTERRUPTIBLE are reported in as TASK_UNINTERRUPTIBLE and things
like htop mark those red.
This is undesirable, so add an explicit state for TASK_IDLE.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently get_task_state() and task_state_to_char() report different
states, create a number of common helpers and unify the reported state
space.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- fix various problems with the copy-on-write extent maps getting freed
at the wrong time
- fix printk format specifier problems
- report zeroing operation outcomes instead of dropping them on the
floor
- fix some crashes when dio operations partially fail
- fix a race condition between unwritten extent conversion & dio read
- fix some incorrect tests in the inode log item processing
- correct the delayed allocation space reservations on rmap filesystems
- fix some problems checking for dax support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZypYxAAoJEPh/dxk0SrTrJ3YQAJFWUCp194an+yuvgOY+MuyL
PG/vAA3DyJjYbwIsqUE//dlp9nrarccAXcxPITWlLdGZ//qHbXO2MguO3KIQ4iG8
qmsA+tXetVoYZYxYZLQ0KjX/XJTaAXY64xKTFxMMTTKUoxPygJRUF/FPfFFcTtaq
Q/ULikS5mhtW7/mQCfXBvtqM5ZD61A9vQRjDL5jRdrDbz49TQqtskp/7F6SEHLxU
fTCGhN7Ys4MQ4fmtUc+EUh0LPX8oAKIIKiGz3zUqrk/FgNYI2NqnTYvflfN8L9UE
t+k+4CGrON+dzrau4HrvZaYbfIPhRaJUM4QzFcDIPoaBZOt6DpBI0dEKm9FD7Hw/
vUvBs0M9asqYycH3PopFHugF+SxW8g7g+5TD8S9rg3j33PZahSNm3gt5gYb1Kiij
3TZPirst6OeQuEjWX6L5LAruAtqtEXtHL7o4dGn5LdQkJ0EIdKXMd9YGz0F/trTK
Grqf2Mep/Q8nccMTksaj94X5AhmM4znYmbAnbS/+QfYTgLk92GJltxoKTB6roW/N
fJ5azjyzGsr4BWdgakK3aA9glaQWGh3PY8Up2VLeEdjwcy3zyscnpZP2PSvt+l9X
pmMDpMTvQD0E6e5246itB69Il1NXTEoG/t9Hlx/2x9g0R2hjK6CRXXrwPnz9zYkI
7wFz5B5LmJ27vFGTCxo5
=7ptY
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- fix various problems with the copy-on-write extent maps getting freed
at the wrong time
- fix printk format specifier problems
- report zeroing operation outcomes instead of dropping them on the
floor
- fix some crashes when dio operations partially fail
- fix a race condition between unwritten extent conversion & dio read
- fix some incorrect tests in the inode log item processing
- correct the delayed allocation space reservations on rmap filesystems
- fix some problems checking for dax support
* tag 'xfs-4.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: revert "xfs: factor rmap btree size into the indlen calculations"
xfs: Capture state of the right inode in xfs_iflush_done
xfs: perag initialization should only touch m_ag_max_usable for AG 0
xfs: update i_size after unwritten conversion in dio completion
iomap_dio_rw: Allocate AIO completion queue before submitting dio
xfs: validate bdev support for DAX inode flag
xfs: remove redundant re-initialization of total_nr_pages
xfs: Output warning message when discard option was enabled even though the device does not support discard
xfs: report zeroed or not correctly in xfs_zero_range()
xfs: kill meaningless variable 'zero'
fs/xfs: Use %pS printk format for direct addresses
xfs: evict CoW fork extents when performing finsert/fcollapse
xfs: don't unconditionally clear the reflink flag on zero-block files
Pull quota and isofs fixes from Jan Kara:
"Two quota fixes (fallout of the quota locking changes) and an isofs
build fix"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Fix quota corruption with generic/232 test
isofs: fix build regression
quota: add missing lock into __dquot_transfer()
Eric has reported that since commit d2faa41516 "quota: Do not acquire
dqio_sem for dquot overwrites in v2 format" test generic/232
occasionally fails due to quota information being incorrect. Indeed that
commit was too eager to remove dqio_sem completely from the path that
just overwrites quota structure with updated information. Although that
is innocent on its own, another process that inserts new quota structure
to the same block can perform read-modify-write cycle of that block thus
effectively discarding quota information update if they race in a wrong
way.
Fix the problem by acquiring dqio_sem for reading for overwrites of
quota structure. Note that it *is* possible to completely avoid taking
dqio_sem in the overwrite path however that will require modifying path
inserting / deleting quota structures to avoid RMW cycles of the full
block and for now it is not clear whether it is worth the hassle.
Fixes: d2faa41516
Reported-and-tested-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In generic_file_llseek_size, return -ENXIO for negative offsets as well
as offsets beyond EOF. This affects filesystems which don't implement
SEEK_HOLE / SEEK_DATA internally, possibly because they don't support
holes.
Fixes xfstest generic/448.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit fd26a88093 we added a worst case estimate for rmapbt blocks
needed to satisfy the block mapping request. Since then, we added the
ability to reserve enough space in each AG such that we should never run
out of blocks to grow the rmapbt, which makes this calculation
unnecessary. Revert the commit because it makes the extra delalloc
indlen accounting unnecessary and incorrect.
Reported-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
My previous patch: d3a304b629 check for
XFS_LI_FAILED flag xfs_iflush done, so the failed item can be properly
resubmitted.
In the loop scanning other inodes being completed, it should check the
current item for the XFS_LI_FAILED, and not the initial one.
The state of the initial inode is checked after the loop ends
Kudos to Eric for catching this.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We call __xfs_ag_resv_init to make a per-AG reservation for each AG.
This makes the reservation per-AG, not per-filesystem. Therefore, it
is incorrect to adjust m_ag_max_usable for each AG. Adjust it only
when we're reserving AG 0's blocks so that we only do it once per fs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Since commit d531d91d69 ("xfs: always use unwritten extents for
direct I/O writes"), we start allocating unwritten extents for all
direct writes to allow appending aio in XFS.
But for dio writes that could extend file size we update the in-core
inode size first, then convert the unwritten extents to real
allocations at dio completion time in xfs_dio_write_end_io(). Thus a
racing direct read could see the new i_size and find the unwritten
extents first and read zeros instead of actual data, if the direct
writer also takes a shared iolock.
Fix it by updating the in-core inode size after the unwritten extent
conversion. To do this, introduce a new boolean argument to
xfs_iomap_write_unwritten() to tell if we want to update in-core
i_size or not.
Suggested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Executing xfs/104 test in a loop on Linux-v4.13 kernel on a ppc64
machine can cause the following NULL pointer dereference,
.queue_work_on+0x4c/0x80
.iomap_dio_bio_end_io+0xbc/0x1f0
.bio_endio+0x118/0x1f0
.blk_update_request+0xd0/0x470
.blk_mq_end_request+0x24/0xc0
.lo_complete_rq+0x40/0xe0
.__blk_mq_complete_request_remote+0x28/0x40
.flush_smp_call_function_queue+0xc4/0x1e0
.smp_ipi_demux_relaxed+0x8c/0x100
.icp_hv_ipi_action+0x54/0xa0
.__handle_irq_event_percpu+0x84/0x2c0
.handle_irq_event_percpu+0x28/0x80
.handle_percpu_irq+0x78/0xc0
.generic_handle_irq+0x40/0x70
.__do_irq+0x88/0x200
.call_do_irq+0x14/0x24
.do_IRQ+0x84/0x130
This occurs due to the following sequence of events,
1. Allocate dio for Direct I/O write.
2. Invoke iomap_apply() until iov_iter_count() bytes have been submitted.
- Assume that we have submitted atleast one bio. Hence iomap_dio->ref value
will be >= 2.
- If during the second iteration, iomap_apply() ends up returning -ENOSPC, we would
break out of the loop and since the 'ret' value is a negative number we
end up not allocating memory for super_block->s_dio_done_wq.
3. Meanwhile, iomap_dio_bio_end_io() is invoked for bios that have been
submitted and here the code ends up dereferencing the NULL pointer stored
at super_block->s_dio_done_wq.
This commit fixes the bug by allocating memory for
super_block->s_dio_done_wq before iomap_apply() is invoked.
Reported-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently only the blocksize is checked, but we should really be calling
bdev_dax_supported() which also tests to make sure we can get a
struct dax_device and that the dax_direct_access() path is working.
This is the same check that we do for the "-o dax" mount option in
xfs_fs_fill_super().
This does not fix the race issues that caused the XFS DAX inode option to
be disabled, so that option will still be disabled. If/when we re-enable
it, though, I think we will want this issue to have been fixed. I also do
think that we want to fix this in stable kernels.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Amir reported a bug discovered by his cleaned up version of my
dm-log-writes xfstests where we were missing csums at certain replay
points. This is because fsx was doing an msync(), which essentially
fsync()'s a specific range of a file. We will log all modified extents,
but only search for the checksums in the range we are being asked to
sync. We cannot simply log the extents in the range we're being asked
because we are logging the inode item as it is currently, which if it
has had a i_size update before the msync means we will miss extents when
replaying. We could possibly get around this by marking the inode with
the transaction that extended the i_size to see if we have this case,
but this would be racy and we'd have to lock the whole range of the
inode to make sure we didn't have an ordered extent outside of our range
that was in the middle of completing.
Fix this simply by keeping track of the modified extents range and
logging the csums for the entire range of extents that we are logging.
This makes the xfstest pass.
Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
commit 4246a0b63b ("block: add a bi_error field to struct bio")
changed the logic of how dio read endio reports errors.
For single stripe dio read, %bio->bi_status reflects the error before
verifying checksum, and now we're updating it when data block matches
with its checksum, while in the mismatching case, %bio->bi_status is
not updated to relfect that.
When some blocks in a file have been corrupted on disk, reading such a
file ends up with
1) checksum errors are reported in kernel log
2) read(2) returns successfully with some content being 0x01.
In order to fix it, we need to report its checksum mismatch error to
the upper layer (dio layer in this case) as well.
Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Goffredo Baroncelli <kreijack@inwind.it>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously, we were calling del_qgroup_item, and ignoring the return code
resulting in a potential to have divergent in-memory state without an
error. Perhaps, it makes sense to handle this error code, and put the
filesystem into a read only, or similar state.
This patch only adds reporting of the error if the error is fatal,
(any error other than qgroup not found).
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently even if the underlying disk reports failure on IO,
compressed read endio still gets to verify checksum and reports it as
a checksum error.
In fact, if some IO have failed during reading a compressed data
extent , there's no way the checksum could match, therefore, we can
skip that in order to return error quickly to the upper layer.
Please note that we need to do this after recording the failed mirror
index so that read-repair in the upper layer's endio can work
properly.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Paul Jones <paul@pauljones.id.au>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kernel oops happens at
kernel BUG at fs/btrfs/extent_io.c:2104!
...
RIP: clean_io_failure+0x263/0x2a0 [btrfs]
It's showing that read-repair code is using an improper mirror index.
This is due to the fact that compression read's endio hasn't recorded
the failed mirror index in %cb->orig_bio.
With this, btrfs's read-repair can work properly on reading compressed
data.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Paul Jones <paul@pauljones.id.au>
Tested-by: Paul Jones <paul@pauljones.id.au>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This seems to be a leftover of commit cf8cddd38b ("btrfs: don't
abuse REQ_OP_* flags for btrfs_map_block").
It should use btrfs_op() helper to provide one of 'enum btrfs_map_op'
types.
Fixes: cf8cddd38b ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It doesn't make sense to backup tree roots when doing fsync, since
during fsync those tree roots have not been consistent on disk.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, "btrfs quota enable" would fail after "btrfs quota disable" on
the first time with syslog output "qgroup_rescan_init failed with -22", but
it would succeed on the second time.
When "quota disable" is called, BTRFS_FS_QUOTA_DISABLING flag bit will be
set in fs_info->flags in btrfs_quota_disable(), but it will not be droppd
in btrfs_run_qgroups() (which is called in btrfs_commit_transaction())
because quota_root has already been freed. If "quota enable" is called
after that, both BTRFS_FS_QUOTA_DISABLING and BTRFS_FS_QUOTA_ENABLED flag
would be dropped in the btrfs_run_qgroups() since quota_root is not NULL.
This leads to the failure of "quota enable" on the first time.
BTRFS_FS_QUOTA_DISABLING flag is not used outside of "quota disable"
context and is equivalent to whether quota_root is NULL or not.
btrfs_run_qgroups() checks whether quota_root is NULL or not in the first
place.
So, let's remove BTRFS_FS_QUOTA_DISABLING flag.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_cmp_data_prepare() (almost) always returns 0 i.e. ignoring errors
from gather_extent_pages(). While the pages are freed by
btrfs_cmp_data_free(), cmp->num_pages still has > 0. Then,
btrfs_extent_same() try to access the already freed pages causing faults
(or violates PageLocked assertion).
This patch just return the error as is so that the caller stop the process.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Fixes: f441460202 ("btrfs: fix deadlock with extent-same and readpage")
Cc: <stable@vger.kernel.org> # 4.2
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
`btrfs sub set-default` succeeds to set an ID which isn't corresponding to any
fs/file tree. If such the bad ID is set to a filesystem, we can't mount this
filesystem without specifying `subvol` or `subvolid` mount options.
Fixes: 6ef5ed0d38 ("Btrfs: add ioctl and incompat flag to set the default mount subvol")
Cc: <stable@vger.kernel.org>
Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ENOTSUPP should not be returned to the user program.
(cf. include/linux/errno.h)
Therefore, EOPNOTSUPP is used instead of ENOTSUPP.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__del_reloc_root should be called before freeing up reloc_root->node.
If not, calling __del_reloc_root() dereference reloc_root->node, causing
the system BUG.
Fixes: 6bdf131fac ("Btrfs: don't leak reloc root nodes on error")
Cc: <stable@vger.kernel.org> # 4.9
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__endio_write_update_ordered() repeats the search until it reaches the end
of the specified range. This works well with direct IO path, because before
the function is called, it's ensured that there are ordered extents filling
whole the range. It's not the case, however, when it's called from
run_delalloc_range(): it is possible to have error in the midle of the loop
in e.g. run_delalloc_nocow(), so that there exisits the range not covered
by any ordered extents. By cleaning such "uncomplete" range,
__endio_write_update_ordered() stucks at offset where there're no ordered
extents.
Since the ordered extents are created from head to tail, we can stop the
search if there are no offset progress.
Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <stable@vger.kernel.org> # 4.12
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 524272607e ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup
submitted ordered extents. However, it does not clear the ordered bit
(Private2) of corresponding pages. Thus, the following BUG occurs from
free_pages_check_bad() (on btrfs/125 with nospace_cache).
BUG: Bad page state in process btrfs pfn:3fa787
page:ffffdf2acfe9e1c0 count:0 mapcount:0 mapping: (null) index:0xd
flags: 0x8000000000002008(uptodate|private_2)
raw: 8000000000002008 0000000000000000 000000000000000d 00000000ffffffff
raw: ffffdf2acf5c1b20 ffffb443802238b0 0000000000000000 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x2000(private_2)
This patch clears the flag same as other places calling
btrfs_dec_test_ordered_pending() for every page in the specified range.
Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <stable@vger.kernel.org> # 4.12
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info->super_copy->{node,sector}size are little-endian, but the ioctl
should return the values in native endianness. Use the cached values in
btrfs_fs_info instead. Found with sparse.
Fixes: 80a773fbfc ("btrfs: retrieve more info from FS_INFO ioctl")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
flush_epd_write_bio() sets bio->bi_opf by itself to honor REQ_SYNC,
but it's not needed at all since bio->bi_opf has set up properly in
both __extent_writepage() and write_one_eb(), and in the case of
write_one_eb(), it also sets REQ_META, which we will lose in
flush_epd_write_bio().
This remove this unnecessary bio->bi_opf setting.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This updates btrfs to use the helper wbc_to_write_flags which has been
applied in ext4/xfs/f2fs/block.
Please note that, with this, btrfs's dirty pages written by a
writeback job will carry the flag REQ_BACKGROUND, which is currently
used by writeback-throttle to determine whether it should go to get a
request or wait.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Variable total_nr_pages is being initialized and then updated with
the same value, this latter assignment is redundant and can be
removed. Cleans up clang build warning:
Value stored to 'total_nr_pages' during its initialization is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In order to using discard function, it is necessary that not only xfs
is mounted with discard option, but also the discard function is
supported by the device. Current code doesn't output any message when
users mount with discard option on unsupported device, so it is
difficult to notice that it was not enabled actually.
This patch adds the warning message to notice that discard option is
not enabled due to unsupported device when the filesystem is mounted.
Changes in v2 (Suggested by Brian Foster):
- Move the unsupported device check into xfs_fs_fill_super().
- Clear the discard flag when device is unsupported.
Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The 'did_zero' param of xfs_zero_range() was not passed to
iomap_zero_range() correctly. This was introduced by commit
7bb41db3ea ("xfs: handle 64-bit length in xfs_iozero"), and found
by code inspection.
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In xfs_file_aio_write_checks(), variable 'zero' is there only to
satisfy xfs_zero_eof(), the result of it is ignored. Now, with
iomap_zero_range() based xfs_zero_eof(), we can safely pass NULL as
the last param of it and kill 'zero'.
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the %pS instead of the %pF printk format specifier for printing symbols
from direct addresses. This is needed for the ia64, ppc64 and parisc64
architectures.
Signed-off-by: Helge Deller <deller@gmx.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we perform an finsert/fcollapse operation, cancel all the CoW
extents for the affected file offset range so that they don't end up
pointing to the wrong blocks.
Reported-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we have speculative cow preallocations hanging around in the cow
fork, don't let a truncate operation clear the reflink flag because if
we do then there's a chance we'll forget to free those extents when we
destroy the incore inode.
Reported-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>