atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable fsnotify_group.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When fsnotify_add_mark_locked() fails it cleans up the mark it was
adding. Since the mark is already visible in group's list, we should
protect update of mark->flags with mark->lock. I'm not aware of any real
issues this could cause (since we also hold group->mark_mutex) but
better be safe and obey locking rules properly.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
fsnotify_add_mark_locked() can fail but we do not check its return
value. This didn't matter before commit 9dd813c15b "fsnotify: Move
mark list head from object into dedicated structure" as none of possible
failures could happen for dnotify but after that commit -ENOMEM can be
returned. Handle this error properly in fcntl_dirnotify() as
otherwise we just hit BUG_ON(dn_mark->dn) in dnotify_free_mark().
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reported-by: syzkaller
Fixes: 9dd813c15b
Signed-off-by: Jan Kara <jack@suse.cz>
Several function prototypes for the set/get functions defined by
module_param_call() have a slightly wrong argument types. This fixes
those in an effort to clean up the calls when running under type-enforced
compiler instrumentation for CFI. This is the result of running the
following semantic patch:
@match_module_param_call_function@
declarer name module_param_call;
identifier _name, _set_func, _get_func;
expression _arg, _mode;
@@
module_param_call(_name, _set_func, _get_func, _arg, _mode);
@fix_set_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._set_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _set_func(
-_val_type _val
+const char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
@fix_get_prototype
depends on match_module_param_call_function@
identifier match_module_param_call_function._get_func;
identifier _val, _param;
type _val_type, _param_type;
@@
int _get_func(
-_val_type _val
+char * _val
,
-_param_type _param
+const struct kernel_param * _param
) { ... }
Two additional by-hand changes are included for places where the above
Coccinelle script didn't notice them:
drivers/platform/x86/thinkpad_acpi.c
fs/lockd/svc.c
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
On the following call path:
gfs2_setattr -> setattr_prepare -> ... ->
cap_inode_killpriv -> ... ->
gfs2_xattr_set
the glock is locked in gfs2_setattr, so check for recursive locking in
gfs2_xattr_set as gfs2_xattr_get already does. While at it, get rid of
need_unlock in gfs2_xattr_get.
Fixes xfstest generic/093.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add support for the STATX_ATTR_ flags in statx. (Compression,
encryption, and the nodump flag are not supported by gfs2.)
Partially fixes xfstest generic/424.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Switch to a simple array for mapping between the FS_*_FL and GFS_DIF_*
flags. Clarify how the mapping between FS_JOURNAL_DATA_FL and the
filesystem flags works. The GFS2_DIF_SYSTEM flag cannot be set from
user space, so remove it from GFS2_FLAGS_USER_SET. Fail with -EINVAL
when trying to set flags that are not supported instead of silently
ignoring those flags.
Partially fixes xfstest generic/424.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Make sure that changing xattrs marks the corresponding inode dirty so
that a subsequent fsync will sync those changes to disk. We set
I_DIRTY_SYNC as well as I_DIRTY_DATASYNC so that both fsync and
fdatasync will sync xattr changes: xattrs can contain information
critical to how the data can be accessed, so we don't want fdatasync
to skip them.
Fixes xfstest generic/066.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch fixes a deadlock caused when the jdata flag is set for
inodes that are already on the ordered write list. Since it is
on the ordered write list, log_flush calls gfs2_ordered_write which
calls filemap_fdatawrite. But since the inode had the jdata flag
set, that calls gfs2_jdata_writepages, which tries to start a new
transaction. A new transaction cannot be started because it tries
to acquire the log_flush rwsem which is already locked by the log
flush operation.
The bottom line is: We cannot switch an inode from ordered to jdata
until we eliminate any ordered data pages (via log flush) or any
log_flush operation afterward will create the circular dependency
above. So we need to flush the log before setting the diskflags to
switch the file mode, then we need to remove the inode from the
ordered writes list.
Before this patch, the log flush was done for jdata->ordered, but
that's wrong. If we're going from jdata to ordered, we don't need
to call gfs2_log_flush because the call to filemap_fdatawrite will
do it for us:
filemap_fdatawrite() -> __filemap_fdatawrite_range()
__filemap_fdatawrite_range() -> do_writepages()
do_writepages() -> gfs2_jdata_writepages()
gfs2_jdata_writepages() -> gfs2_log_flush()
This patch modifies function do_gfs2_set_flags so that if a file
has its jdata flag set, and it's already on the ordered write list,
the log will be flushed and it will be removed from the list
before setting the flag.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Abhijith Das <adas@redhat.com>
In function gfs2_write_inode, starting with patch a9185b41a4, we
only flush the log and call filemap_fdatawait if we're passed in a
wbc sync_mode of WB_SYNC_ALL. We also need to do these things if
we're evicting a jdata inode, because we might have jdata pages
still attached to bufdata descriptors that need to be revoked, but
by the time it gets to evict() it's too late to start a new
transaction. This patch changes it to treat jdata inodes as if
WB_SYNC_ALL had been specified.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Abhijith Das <adas@redhat.com>
This patch switches GFS2's implementation of fiemap from the old
block_map code to the new iomap interface.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch implements iomap for block mapping, and switches the
block_map function to use it under the covers.
The additional IOMAP_F_BOUNDARY iomap flag indicates when iomap has
reached a "metadata boundary" and fetching the next mapping is likely to
incur an additional I/O. This flag is used for setting the bh buffer
boundary flag.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch eliminates height parameters from function gfs2_bmap_alloc.
Function find_metapath determines the metapath's "find height", also
known as the desired height. Function lookup_metapath determines the
metapath's "actual height", previously known as starting height or
sheight. Function gfs2_bmap_alloc now gets both height values from
the metapath. This simplification was done as a step toward switching
the block_map functions to using iomap. The bh_map responsibilities
are also removed from function gfs2_bmap_alloc for the same reason.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
JFS does not set SB_I_VERSION and doesn't use the i_version counter
internally. Just remove this increment.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Several conflicts here.
NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to
nfp_fl_output() needed some adjustments because the code block is in
an else block now.
Parallel additions to net/pkt_cls.h and net/sch_generic.h
A bug fix in __tcp_retransmit_skb() conflicted with some of
the rbtree changes in net-next.
The tc action RCU callback fixes in 'net' had some overlap with some
of the recent tcf_block reworking.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently btrfs' code uses a mix of opencoded sizes and defines from sizes.h.
Let's unifiy the code base to always use the symbolic constants. No functional
changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix missing change from commit f8f84b2dfd
("btrfs: index check-integrity state hash by a dev_t").
Function btrfsic_dev_state_hashtable_lookup uses dev_t to generate hashval
when look in up a btrfsic_dev_state in hash table. So when we add a
btrfsic_dev_state into the hash table, it should also use dev_t.
Reproducer of this bug:
Use MOUNT_OPTIONS="-o check_int" when running xfstest, device can not be
mounted successfully. So xfstest can not run.
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When one of the device is missing, bbio_error() takes care of setting
the error status. And if its only IO that is pending in that stripe, it
fails to check the status of the other IO at %bbio_error before setting
the error %bi_status for the %orig_bio. Fix this by checking if
%bbio->error has exceeded the %bbio->max_errors.
Reproducer as below fdatasync error is seen intermittently.
mount -o degraded /dev/sdc /btrfs
dd status=none if=/dev/zero of=$(mktemp /btrfs/XXX) bs=4096 count=1 conv=fdatasync
dd: fdatasync failed for ‘/btrfs/LSe’: Input/output error
The reason for the intermittences of the problem is because
the following conditions have to be met, which depends on timing:
In btrfs_map_bio()
- the RAID1 the missing device has to be at %dev_nr = 1
In bbio_error()
. before bbio_error() is called the bio of the not-missing
device at %dev_nr = 0 must be completed so that the below
condition is true
if (atomic_dec_and_test(&bbio->stripes_pending)) {
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A cleanup patch, use need_full_stripe() to replace the open code.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Code cleanup for better understanding:
Variable needs_unlock to be called extent_locked to show state as
opposed to action. Changed the type to int, to reduce code in the
critical path.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At few places we could use BLK_STS_OK and BLK_STS_NOSUPP.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Satoru Taekeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ dropped first hunk btrfs_endio_direct_read ]
Signed-off-by: David Sterba <dsterba@suse.com>
These are useful for debugging problems where we mess with
trans->block_rsv to make sure we're not screwing something up.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can get this from the ref we've passed in.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is just excessive information in the ref_head, and makes the code
complicated. It is a relic from when we had the heads and the refs in
the same tree, which is no longer the case. With this removal I've
cleaned up a bunch of the cruft around this old assumption as well.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We do a couple different cleanup operations on the ref head. We adjust
counters, we'll free any reserved space if we didn't end up using the
ref, and we clear the pending csum bytes. Move all these disparate
things into cleanup_ref_head and clean up the logic in
__btrfs_run_delayed_refs so that it handles the !ref case a lot cleaner,
as well as making run_one_delayed_ref() only deal with real refs and not
the ref head.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only use this logic if our ref isn't a ref_head, so move it up into
the if (ref) case since we know that this is a normal ref and not a
delayed ref head.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move this code out to a helper function to further simplivy
__btrfs_run_delayed_refs.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the extent_op cleanup for an empty head ref to a helper function to
help simplify __btrfs_run_delayed_refs.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify the error handling in __btrfs_run_delayed_refs by breaking out
the code used to return a head back to the delayed_refs tree for
processing into a helper function.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were only doing btrfs_check_space_for_delayed_refs() if the metadata
space was full, ie we couldn't allocate chunks. This assumes we'll be
able to allocate chunks during transaction commit, but since nothing
does a LIMIT flush during the transaction commit this won't actually
happen unless we happen to run shy of actual space. We already take
into account a full fs in btrfs_check_space_for_delayed_refs() so just
kill this extra check to make sure we're ending the transaction when we
need to.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were having corruption issues that were tied back to problems with
the extent tree. In order to track them down I built this tool to try
and find the culprit, which was pretty successful. If you compile with
this tool on it will live verify every ref update that the fs makes and
make sure it is consistent and valid. I've run this through with
xfstests and haven't gotten any false positives. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update error messages, add fixup from Dan Carpenter to handle errors
of read_tree_block ]
Signed-off-by: David Sterba <dsterba@suse.com>
We need the actual root for the ref verifier tool to work, so change
these functions to pass the root around instead. This will be used in
a subsequent patch.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds the infrastructure for turning ref verify on and off for a
mount, to be used by a later patch.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhnance btrfs_print_mod_info to print if ref-verify is compiled in ]
Signed-off-by: David Sterba <dsterba@suse.com>
The use of sector_t in the callchain of submit_extent_page is not
necessary. Switch to u64 and rename the variable and use byte units
instead of 512b, ie. dropping the >> 9 shifts and avoiding the
con(tro)versions of sector_t.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to remove sector_t and will use 'offset', so this patch
frees the name.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The use of sector_t is not necessry, it's just for a warning. Switch to
u64 and rename the variable and use byte units instead of 512b, ie.
dropping the >> 9 shifts. The messages are adjusted as well.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We pass in a pointer in our send arg struct, this means the struct size
doesn't match with 32bit user space and 64bit kernel space. Fix this by
adding a compat mode and doing the appropriate conversion.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ move structure to the beginning, next to receive 32bit compat ]
Signed-off-by: David Sterba <dsterba@suse.com>
When device is missing without the -o degraded option then its an error
so report it as an error instead of a warning. And when -o degraded
option is provided, log the missing device as warning.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch error to bool ]
Signed-off-by: David Sterba <dsterba@suse.com>
EIO is only for the IO failure to the device, avoid it. Use ENOENT as
that's the closest error code describing what happened.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
add_missing_dev() can return device pointer so that IS_ERR/PTR_ERR can
be used to check for the actual error that occurred in the function.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
[ minor error message adjustment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Remove variables 'start' and 'end', which are set but never used.
Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We now get a harmless compile-time on 32-bit architectures:
fs/btrfs/tree-checker.c: In function 'check_extent_data_item':
fs/btrfs/tree-checker.c:189:70: error: format '%lu' expects argument of type 'long unsigned int', but argument 6 has type 'unsigned int' [-Werror=format=]
This changes the format string to use %zu instead of %lu for size_t.
Fixes: c1f6520bf360 ("btrfs: tree-checker: Enhance output for check_extent_data_item")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have the combo of flushing twice, which can make sure IO
have started since the second flush will wait for page lock which
won't be unlocked unless setting page writeback and queuing ordered
extents, we don't need %async_submit_draining, %async_delalloc_pages
and %nr_async_submits to tell whether the IO has actually started.
Moreover, all the flushers in use are followed by functions that wait
for ordered extents to complete, so %nr_async_submits, which tracks
whether bio's async submit has made progress, doesn't really make
sense.
However, %async_delalloc_pages is still required by shrink_delalloc()
as that function doesn't flush twice in the normal case (just issues a
writeback with WB_REASON_FS_FREE_SPACE).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By setting compression for a defrag task, the task will start IO at
the end of defrag.
After the combo of filemap_flush(), we've already made sure that
dirty pages have made progress via async compress thread because the
second filemap_flush() will wait for page lock, which won't be
unlocked until those pages have been marked as writeback and ordered
extents have been queued.
And this is for per-inode defrag, it's not helpful to wait on a global
%async_delalloc_pages and %nr_async_submits from fs_info.
Although waiting on %nr_async_submits means that all bios are
submitted down to per-device schedule IO lists, it doesn't wait for
their completions, thus users still need to do fsync/sync to make sure
the data is on disk. While with this change, it makes sure that pages
are marked with writeback bits and will be submitted asynchronously
shortly, therefore, the behavior of defrag option '-c' remains unchanged.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was intended to congest higher layers to not send bios, but as
1) the congested bit has been taken by writeback
Async bios come from buffered writes and DIO writes.
For DIO writes, we want to submit them ASAP, while for buffered writes,
writeback uses balance_dirty_pages() to throttle how much dirty pages we
can have.
2) and no one is waiting for %nr_async_bios down to zero,
Historically, it was introduced along with changes which let
checksumming workload spread accross different cpus. And at that time,
pdflush was used instead of per-bdi flushing, perhaps pdflush did not
have the necessary information for writeback to do throttling.
We can safely remove them now.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
[ additional explanation from mails, removed unused variable 'limit' ]
Signed-off-by: David Sterba <dsterba@suse.com>
Output the invalid member name and its bad value, along with its
expected value range or alignment.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Output the bad value and expected good value (or its alignment).
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ unindent long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
Enhance the output to print:
1) the eason
2) the ad value, if reason is not sufficient
3) good value (range)
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording, unidented long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
Use inline function to replace macro since we don't need
stringification.
(Macro still exists until all callers get updated)
And add more info about the error, and replace EIO with EUCLEAN.
For nr_items error, report if it's too large or too small, and output
the valid value range.
For node block pointer, added a new alignment checker.
For key order, also output the next key to make the problem more
obvious.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments, unindented long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's no doubt the comprehensive tree block checker will become larger,
so moving them into their own files is quite reasonable.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Remove dead assigment of num_bytes.
Also as num_bytes only used in the will_compress block as copy of
total_in just replace that with total_in and drop num_bytes entirely.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit a53f4f8e9c ("btrfs: Don't call btrfs_start_transaction() on
frozen fs to avoid deadlock.") started using internal calls and we
replace them with more suitable ones.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently struct names for sysfs are generated only based on the
attribute names. This means that attribute names cannot be reused in
multiple places throughout the complete btrfs sysfs hierarchy.
E.g. allocation/data/total_bytes and allocation/data/single/total_bytes
result in the same struct name btrfs_attr_total_bytes. A workaround for
this case was made in the past by ad hoc creating an extra macro
wrapper, BTRFS_RAID_ATTR, that inserts some extra text in the struct
name.
Instead of polluting sysfs.h with such kind of extra macro definitions,
and only doing so when there are collisions, use a prefix which gets
inserted in the struct name, so we keep everything nicely grouped
together by default.
Current collections of attributes are:
* (the toplevel, empty prefix)
* allocation
* space_info
* raid
* features
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If btrfs_transaction_commit fails it will proceed to call
cleanup_transaction, which in turn already does btrfs_abort_transaction.
So let's remove the unnecessary code duplication. Also let's be explicit
about handling failure of btrfs_uuid_tree_add by calling
btrfs_end_transaction.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_udpate_root can fail and it aborts the transaction, the correct
way to handle an aborted transaction is to explicitly end with
btrfs_end_transaction. Even now the code is correct since
btrfs_commit_transaction would handle an aborted transaction but this is
more of an implementation detail. So let's be explicit in handling
failure in btrfs_update_root.
Furthermore btrfs_commit_transaction can also fail and by ignoring it's
return value we could have left the in-memory copy of the root item in
an inconsistent state. So capture the error value which allows us to
correctly revert the RO/RW flags in case of commit failure.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_init_new_device() calls btrfs_attach_transaction() to
commit sys chunks, and it should error out if it fails.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of BUG_ON return error to the caller. And handle the fail
condition by calling the abort transaction and going through the
error path.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When new device is being added to seed FS, seed FS is marked writable,
but when we fail to bring in the new device, we missed to undo the
writable part. This patch fixes it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_CSUM checker is a relatively easy one, only needs to check:
1) Objectid
Fixed to BTRFS_EXTENT_CSUM_OBJECTID
2) Key offset alignment
Must be aligned to sectorsize
3) Item size alignedment
Must be aligned to csum size
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add extra checks for item with EXTENT_DATA type. This checks the
following thing:
0) Key offset
All key offsets must be aligned to sectorsize.
Inline extent must have 0 for key offset.
1) Item size
Uncompressed inline file extent size must match item size.
(Compressed inline file extent has no information about its on-disk size.)
Regular/preallocated file extent size must be a fixed value.
2) Every member of regular file extent item
Including alignment for bytenr and offset, possible value for
compression/encryption/type.
3) Type/compression/encode must be one of the valid values.
This should be the most comprehensive and strict check in the context
of btrfs_item for EXTENT_DATA.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch to BTRFS_FILE_EXTENT_TYPES, similar to what
BTRFS_COMPRESS_TYPES does ]
Signed-off-by: David Sterba <dsterba@suse.com>
Function check_leaf() checks if any item pointer points outside of the
leaf, but it doesn't check if the pointer overlaps with the item itself.
Normally only the last item may be the victim, but adding such check is
never a bad idea anyway.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current check_leaf() function does a good job checking key order and
item offset/size.
However it only checks from slot 0 to the last but one slot, this is
good but makes later expansion hard.
So this refactoring iterates from slot 0 to the last slot.
For key comparison, it uses a key with all 0 as initial key, so all
valid keys should be larger than that.
And for item size/offset checks, it compares current item end with
previous item offset.
For slot 0, use leaf end as a special case.
This makes later item/key offset checks and item size checks easier to
be implemented.
Also, makes check_leaf() to return -EUCLEAN other than -EIO to indicate
error.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Was added in:
c8b978188c
"Btrfs: Add zlib compression support"
Survive to near time (from 08.10.2008).
Because 'start' checked for zero before branch, so it's safe to remove
that subtraction.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay reported that generic/273 was failing currently with ENOSPC.
Turns out this is because we get to the point where the outstanding
reservations are greater than the pinned space on the fs. This is a
mistake, previously we used the current reservation amount in
may_commit_transaction, not the entire outstanding reservation amount.
Fix this to find the minimum byte size needed to make progress in
flushing, and pass that into may_commit_transaction. From there we can
make a smarter decision on whether to commit the transaction or not.
This fixes the failure in generic/273.
From Nikolai, IOW: when we go to the final stage of deciding whether to
do trans commit, instead of passing all the reservations from all
tickets we just pass the reservation for the current ticket. Otherwise,
in case all reservations exceed pinned space, then we don't commit
transaction and fail prematurely. Before we passed num_bytes from
flush_space, where num_bytes was the sum of all pending reserverations,
but now all we do is take the first ticket and commit the trans if we
can satisfy that.
Fixes: 957780eb27 ("Btrfs: introduce ticketed enospc infrastructure")
Cc: stable@vger.kernel.org # 4.8
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
[ added Nikolai's comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
By analyzing the perf on btrfs send, we found it take large amount of
cpu time on page_cache_sync_readahead. This effort can be reduced after
switching to asynchronous one. Overall performance gain on HDD and SSD
were 9 and 15 percent if simply send a large file.
Signed-off-by: Kuanling Huang <peterh@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The local bio_list may have pending bios when doing cleanup, it can
end up with memory leak if they don't get freed.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't populate the read-only array types on the stack, instead make
it static const. Makes the object code smaller by nearly 60 bytes:
Before:
text data bss dec hex filename
90536 6552 64 97152 17b80 fs/btrfs/ioctl.o
After:
text data bss dec hex filename
90414 6616 64 97094 17b46 fs/btrfs/ioctl.o
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Forward the correct return value -ENOMEM from btrfsic_dev_state_alloc()
too.
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ adjust changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
We've seen the following backtrace stack in ftrace or dmesg log,
kworker/u16:10-4244 [000] 241942.480955: function: btrfs_put_ordered_extent
kworker/u16:10-4244 [000] 241942.480956: kernel_stack: <stack trace>
=> finish_ordered_fn (ffffffffa0384475)
=> btrfs_scrubparity_helper (ffffffffa03ca577) <-----"incorrect"
=> btrfs_freespace_write_helper (ffffffffa03ca98e) <-----"correct"
=> process_one_work (ffffffff81117b2f)
=> worker_thread (ffffffff81118c2a)
=> kthread (ffffffff81121de0)
=> ret_from_fork (ffffffff81d7087a)
btrfs_freespace_write_helper is actually calling normal_worker_helper
instead of btrfs_scrubparity_helper, so somehow kernel has parsed the
incorrect function address while unwinding the stack,
btrfs_scrubparity_helper really shouldn't be shown up.
It's caused by compiler doing inline for our helper function, adding a
noinline tag can fix that.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use noinline_for_stack ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since both committing transaction and writing log-tree are doing
plugging on metadata IO, we can unify to use %sync_writers to benefit
both cases, instead of checking bio_flags while writing meta blocks of
log-tree.
We can remove this bio_flags because in order to write dirty blocks,
log tree also uses btrfs_write_marked_extents(), inside which we
have enabled %sync_writers, therefore, every write goes in a
synchronous way, so does checksuming.
Please also note that, bio_flags is applied per-context while
%sync_writers is applied per-inode, so this might incur some overhead, ie.
1) while log tree is flushing its dirty blocks via
btrfs_write_marked_extents(), in which %sync_writers is increased
by one.
2) in the meantime, some writeback operations may happen upon btrfs's
metadata inode, so these writes go synchronously, too.
However, AFAICS, the overhead is not a big one while the win is that
we unify the two places that needs synchronous way and remove a
special hack/flag.
This removes the bio_flags related stuff for writing log-tree.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have started plug in btrfs_write_and_wait_marked_extents() but the
generated IOs actually go to device's schedule IO list where the work
is doing in another task, thus the started plug doesn't make any
sense.
And since we wait for IOs immediately after writing meta blocks, it's
the same case as writing log tree, doing sync submit can merge more
IOs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are checks on fs_info in __btrfs_panic to avoid dereferencing a
null fs_info, however, there is a call to btrfs_crit that may also
dereference a null fs_info. Fix this by adding a check to see if fs_info
is null and only print the s_id if fs_info is non-null.
Detected by CoverityScan CID#401973 ("Dereference after null check")
Fixes: efe120a067 ("Btrfs: convert printk to btrfs_ and fix BTRFS prefix")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of variable 'can_recover' is never used after being set, thus
it should be removed, as it was never used since the first commit
68a7342c51 ("Btrfs: cleanup orphaned root orphan item").
Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If 'btrfs_alloc_path()' fails, we must free the resources already
allocated, as done in the other error handling paths in this function.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__link_block_group is called from only 2 places and at each call site the
space_info being passed is the same as the space info assigned to the passed
cache struct. Let's remove the redundant argument and make the function
reference the space_info from the passed block_group_cache. No functional
changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ renamed to link_block_group ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the code executes add_extent_mapping and if it is successful
it links the new mapping, it then proceeds to unlock the extent mapping
tree and check for failure and handle them. Instead, rework the code to
only perform a single check if add_extent_mapping has failed and handle
it, otherwise the code continues in a linear fashion. No functional
changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduced by 5a5f79b570 ("Btrfs: allow unaligned DIO") and never
used. The buffered fallback from unaligned DIO works as expected.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_changed_cb_t represents the signature of the callback being passed
to btrfs_compare_trees. Currently there is only one such callback,
namely changed_cb in send.c. This function doesn't really uses the first
2 parameters, i.e. the roots. Since there are not other functions
implementing the btrfs_changed_cb_t let's remove the unused parameters
from the prototype and implementation.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
iterate_dir_item:found_key - introduced in 31db9f7c23 ("Btrfs:
introduce BTRFS_IOC_SEND for btrfs send/receive"), yet never used.
record_ref:num - ditto
This is a first pass with the low-hanging fruit. There are still quite a
few unsued parameters in some function which have to abide by a callback
interface.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Src was initially part of 31ff1cd25d ("Btrfs: Copy into the log tree in
big batches"), however 16e7549f04 ("Btrfs: incompatible format change
to remove hole extents") changed parameters passed to copy_items which
made the src variable redundant.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While we submit direct writes, if the inode is flagged with nodatasum,
there's no benefit to submit asynchronously, because
a) we don't have to calculate checksum across processors,
b) and direct IO has started a plug, but async submit makes us queue
IO on each device's scheduled IO list instead of DIO's plug list, so
that IOs get much less merges in general.
Lets use sync submit for nodatasum inodes.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After mapping block with BTRFS_MAP_WRITE, parities have been sorted to
the end position, so this search can start from the first parity
stripe.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied changelog as a comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
We didn't copy fsid to struct super_block.s_uuid so Overlay disables
index feature with btrfs as the lower FS.
kernel: overlayfs: fs on '/lower' does not support file handles, falling back to index=off.
Fix this by publishing the fsid through struct super_block.s_uuid.
[ dsterba: I think that setting s_uuid is the last missing bit. Overlay
needs the file handle encoding support from the lower filesystem, which
is supported. Filling the whole filesystem id is correct, the subvolume
id is encoded in the file handle buffer from inside btrfs_encode_fh. ]
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These aren't used outside of volumes.c.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some static functions are needlessly forward declared. Let's remove those
declarations since they add no value.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both wait_for_commit() and wait_for_writer() are checking the
condition out of the mutex lock.
This refactors code a bit to be lock safe.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since TASK_UNINTERRUPTIBLE has been used here, wait_event() can do the
same job.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're still going to wait after schedule(), we don't have to do
finish_wait() to remove our %wait_queue_entry since prepare_to_wait()
won't add the same %wait_queue_entry twice.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Block layer has a limit on plug, ie. BLK_MAX_REQUEST_COUNT == 16, so
we don't gain benefits by batching 64 bios here.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
And fix tcon leak in error path.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: David Disseldorp <ddiss@samba.org>
This patch adds support for online resizing on bigalloc file system by
implementing EXT4_IOC_RESIZE_FS ioctl. Old resize interfaces (add
block groups and extend last block group) are left untouched. Tests
performed with cluster sizes of 1, 2, 4 and 8 blocks (of size 4k) per
cluster. I will add these tests to xfstests.
Signed-off-by: Harshad Shirwadkar <harshads@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQGcBAABAgAGBQJZ81hKAAoJEIosvXAHck9R7K4L/R4vPpYn19s/xPUf0fUYMOWO
JOIghfeCmmfCd2kTZF+fcDRNBpGnJjjs4ZPloxIbF7bQF0VbjrkToxthF6f9aYIJ
gt0jH1ntGUvraDpkZelTAGRj1BZou2IBzJF3Or1sL83ZX76fyXm9cJUx8Y+l2Mlx
BJMOL0Au38oRKOGnGk3GPtrflgNxe+6cTpNhLmVa9CBNDMQYjobrALgGPpbGf5h3
6l1i0IxMXuxeXjqFva0GKCjTsQSON44gNNHQoggIfHvE3nBVpSZLCwNwrVHOfd8q
4FlEXPzr3ME4WzASWqw1kAX+aij2NqbaLgDs7USkn4mUheIvZcHhC0LPVZJrZ1b4
2c3RHkOV0aZQunPJyq5vtO9B4TJC6MLcHS46iwQ6lao9hTVT8OqV7R40qmzQyt06
KwdIGObEm76J7u4lgVlAsapVKgPLOPuObQKhYUdNvRarorNaxgtadnHOhXM20G3S
PE23XvNJYwKl6SJbP97ih8Uq0//7vNCxd/khdnq4FQ==
=8LHe
-----END PGP SIGNATURE-----
Merge tag '4.14-smb3-fixes-for-stable' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Various SMB3 fixes for 4.14 and stable"
* tag '4.14-smb3-fixes-for-stable' of git://git.samba.org/sfrench/cifs-2.6:
SMB3: Validate negotiate request must always be signed
SMB: fix validate negotiate info uninitialised memory use
SMB: fix leak of validate negotiate info response buffer
CIFS: Fix NULL pointer deref on SMB2_tcon() failure
CIFS: do not send invalid input buffer on QUERY_INFO requests
cifs: Select all required crypto modules
CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE
cifs: handle large EA requests more gracefully in smb2+
Fix encryption labels and lengths for SMB3.1.1
Pull overlayfs fixes from Miklos Szeredi:
"Fix several issues, most of them introduced in the last release"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: do not cleanup unsupported index entries
ovl: handle ENOENT on index lookup
ovl: fix EIO from lookup of non-indexed upper
ovl: Return -ENOMEM if an allocation fails ovl_lookup()
ovl: add NULL check in ovl_alloc_inode
Pull fuse fix from Miklos Szeredi:
"This fixes a longstanding bug, which can be triggered by interrupting
a directory reading syscall"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix READDIRPLUS skipping an entry
Fix an unused variable warning on non-DEBUG builds introduced by
commit 7561d27e90 ("xfs: buffer lru reference count error injection
tag").
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we're done checking all the records/keys in a btree block, compute
the low and high key of the block and compare them to the associated key
in the parent btree block.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Abort an dir/attr btree operation if the attr btree has obvious problems
like loops back to the root or pointers don't point down the tree.
Found by fuzzing btree[0].before to zero in xfs/402, which livelocks on
the cycle in the attr btree.
Apply the same checks to xfs_da3_node_lookup_int.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When we're iterating the attribute list and we can't find our previous
location based off the attribute cursor, we'll instead walk down the
attribute btree from the root trying to find where we left off. Move
this code into a separate function for later cleanups.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Make sure the log stripe unit is sane before proceeding with mounting.
AFAICT this means that logsunit has to be 0, 1, or a multiple of the fs
block size. Found this by setting the LSB of logsunit in xfs/350 and
watching the system crash as soon as we try to write to the log.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Log recovery of v4 filesystems does not use buffer verifiers because
log recovery historically can result in transient buffer corruption
when target buffers might be ahead of the log after a crash. v5
filesystems work around this problem with metadata LSN ordering.
While this log recovery verifier behavior is necessary on v4 supers,
it can result in leaving buffers around in the LRU without verifiers
attached for a significant amount of time. This leads to use of
unverified buffers while the filesystem is in active use, long after
recovery has completed.
To address this problem, drain all buffers from the LRU as a final
step of the log mount sequence. Note that this is done
unconditionally to provide a consistently clean cache footprint,
regardless of superblock version or log state. As a side effect,
this ensures that all cache resident, unverified buffers are
reclaimed after log recovery and therefore must be recreated with
verifiers on subsequent use.
Reported-by: Darrick Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
It is possible for mkfs to format very small filesystems with too
small of an internal log with respect to the various minimum size
and block count requirements. If this occurs when the log happens to
be smaller than the scan window used for cycle verification and the
scan wraps the end of the log, the start_blk calculation in
xlog_find_head() underflows and leads to an attempt to scan an
invalid range of log blocks. This results in log recovery failure
and a failed mount.
Since there may be filesystems out in the wild with this kind of
geometry, we cannot simply refuse to mount. Instead, cap the scan
window for cycle verification to the size of the physical log. This
ensures that the cycle verification proceeds as expected when the
scan wraps the end of the log.
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
mkfs has a historical problem where it can format very small
filesystems with too small of a physical log. Under certain
conditions, log recovery of an associated filesystem can end up
passing garbage parameter values to some of the cycle and log record
verification functions due to bugs in log recovery not dealing with
such filesystems properly. This results in attempts to read from
bogus/underflowed log block addresses.
Since the buffer read may ultimately succeed, log recovery can
proceed with bogus data and otherwise go off the rails and crash.
One example of this is a negative last_blk being passed to
xlog_find_verify_log_record() causing us to skip the loop, pass a
NULL head pointer to xlog_header_check_mount() and crash.
Improve the xlog buffer verification to address this problem. We
already verify xlog buffer length, so update this mechanism to also
sanity check for a valid log relative block address and otherwise
return an error. Pass a fixed, valid log block address from
xlog_get_bp() since the target address will be validated when the
buffer is read. This ensures that any bogus log block address/length
calculations lead to graceful mount failure rather than risking a
crash or worse if recovery proceeds with bogus data.
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This helper looks up the last extent the covers space before the passed
in block number. This is useful for truncate and similar operations that
operate backwards over the extent list. For xfs_bunmapi it also is
a slight optimization as we can return early if there are not extents
at or below the end of the to be truncated range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_iread_extents is just a trivial wrapper, there is no good reason
to keep the two separate.
[darrick: minor fixups having left xfs_bmbt_validate_extent intact]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Look at the return value of xfs_iext_get_extent instead of figuring out
the extent count first and looping up to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Rewrite xfs_bmap_insert_extents so that we don't rely on extent indices
except for iterating over them. Not being able to iterate to the previous
extent or finding the extent that stop_fsb is in are sufficient exit
conditions, and we don't need to do any extent count games given that:
a) we already flushed all delalloc extents past our start offset
before doing the operation
b) xfs_iext_count() includes delalloc extents anyway
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Rewrite xfs_bmap_collapse_extents so that we don't rely on extent indices
except for iterating over them. Not being able to iterate to the next
extent is a sufficient exit condition, and we don't need to do any extent
count games given that:
a) we already flushed all delalloc extents past our start offset
before doing the operation
b) xfs_iext_count() includes delalloc extents anyway
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This way the caller gets the proper updated extent returned in got.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead do the actual left and right shift work in the callers, and just
keep a helper to update the bmap and rmap btrees as well as the in-core
extent list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Have a separate helper for insert vs collapse, as this prepares us for
simplifying the code in the next patches.
Also changed the done output argument to a bool intead of int for both
new functions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The define was always set to 1, which means looping until we reach is
was dead code from the start.
Also remove an initialization of next_fsb for the done case that doesn't
fit the new code flow - it was never checked by the caller in the done
case to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The code is sufficiently different for the insert vs collapse cases both
in xfs_shift_file_space itself and the callers that untangling them will
make life a lot easier down the road.
We still keep a common helper for flushing all data and COW state to get
the inode into the right shape for shifting the extents around.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We can simply use the i_rdev field in the Linux inode and just convert
to and from the XFS dev_t when reading or logging/writing the inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove the dead code dealing with the UUID fork format that was never
implemented in Linux (and neither in IRIX as far as I know).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of looping over all extents in some debug-only helper just
insert trace points into the loops that already exist in the calling
functions.
Also split the xfs_extlist trace point into one each for reading and
writing extents from disk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_iext_update_extent already has basically all the information needed
to centralize the bmap pre/post tracing. We just need to pass inode +
bmap state instead of the inode fork pointer to get all trace annotations.
In addition to covering all the existing trace points this gives us
tracing coverage for the extent shifting operations for free.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we use xfs_iext_insert this is already covered by the tracing
in that function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We already have all the information about the fork a=D1=95 well as additional
tracing information, so pass that to xfs_iext_remove().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This creates the right initial bmap state from the passed in inode
fork enum.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Perform some quick sanity testing of the disk quota information.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Perform simple tests of the realtime bitmap and summary.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Scrub parent pointers, sort of. For directories, we can ride the
'..' entry up to the parent to confirm that there's at most one
dentry that points back to this directory.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create the infrastructure to scrub symbolic link data.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Scrub the hash tree, keys, and values in an extended attribute structure.
Refactor the attribute code to use the transaction if the caller supplied
one to avoid buffer deadocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Check the free space information in a directory.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Scrub the hash tree and all the entries in a directory.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Provide a way to check the shape and scrub the hashes and records
in a directory or extended attribute btree. These are helper functions
for the directory & attribute scrubbers in subsequent patches.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[fengguang: remove unneeded variable to store return value]
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Scrub an individual inode's block mappings to make sure they make sense.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Plumb in the pieces necessary to check the refcount btree. If rmap is
available, check the reference count by performing an interval query
against the rmapbt.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Check the reverse mapping records to make sure that the contents
make sense.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Check the records of the inode btrees to make sure that the values
make sense given the inode records themselves.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Check the extent records free space btrees to ensure that the values
look sane.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a forgotten check to the AGI verifier, then wire up the scrub
infrastructure to check the AGI contents.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Check the block references in the AGF and AGFL headers to make sure
they make sense.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Ensure that the geometry presented in the backup superblocks matches
the primary superblock so that repair can recover the filesystem if
that primary gets corrupted.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add some helpers to enable us to lock an AG's headers, create btree
cursors for all btrees in that allocation group, and clean up
afterwards.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add to the btree scrubber the ability to check that the keys and
records are in the right order and actually call out to our record
iterator to do actual checking of the records.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a function that can check the shape of a btree -- each block
passes basic inspection and all the pointers look ok. In the next patch
we'll add the ability to check the actual keys and records stored within
the btree. Add some helper functions so that we report detailed scrub
errors in a uniform manner in dmesg. These are helper functions for
subsequent patches.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create helper functions and tracepoints to deal with errors while
scrubbing a metadata btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create helper functions to record crc and corruption problems, and
deal with any other runtime errors that arise.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a probe scrubber with id 0. This will be used by xfs_scrub to
probe the kernel's abilities to scrub (and repair) the metadata. We do
this by validating the ioctl inputs from userspace, preparing the
filesystem for a scrub (or a repair) operation, and immediately
returning to userspace. Userspace can use the returned errno and
structure state to decide (in broad terms) if scrub/repair are
supported by the running kernel.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create structures needed to hold scrubbing context and dispatch incoming
commands to the individual scrubbers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create an ioctl that can be used to scrub internal filesystem metadata.
The new ioctl takes the metadata type, an (optional) AG number, an
(optional) inode number and generation, and a flags argument. This will
be used by the upcoming XFS online scrub tool.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create some helper functions to check that inode pointers point to
somewhere within the filesystem and not at the static AG metadata.
Move xfs_internal_inum and create a directory inode check function.
We will use these functions in scrub and elsewhere.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Refactor the btree block header checks to have an internal function that
returns the address of the failing check without logging errors. The
scrubber will call the internal function, while the external version
will maintain the current logging behavior.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Refactor the btree pointer checks so that we can call them from the
scrub code without logging errors to dmesg. Preserve the existing error
reporting for regular operations.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create some helper functions to check that a block pointer points
within the filesystem (or AG) and doesn't point at static metadata.
We will use this for scrub.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
For an XFS_IGET_INCORE iget operation, if the inode isn't in the cache,
return ENODATA so that we don't confuse it with the pre-existing ENOENT
cases (inode is in cache, but freed).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
XFS uses a fixed reference count for certain types of buffers in the
internal LRU cache. These reference counts dictate how aggressively
certain buffers are reclaimed vs. others. While the reference counts
implements priority across different buffer types, all buffers
(other than uncached buffers) are typically cached for at least one
reclaim cycle.
We've had at least one bug recently that has been hidden by a
released buffer sitting around in the LRU. Users hitting the problem
were able to reproduce under enough memory pressure to cause
aggressive reclaim in a particular window of time.
To support future xfstests cases, add an error injection tag to
hardcode the buffer reference count to zero. When enabled, this
bypasses caching of associated buffers and facilitates test cases
that depend on this behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The child buffer read in xfs_attr3_node_inactive() should never
reach a hole in the attr fork. If this occurs, it is likely due to a
bug. Prior to commit cd87d867 ("xfs: don't crash on unexpected holes
in dir/attr btrees"), this would result in a crash. Now that the
crash has been fixed, this is a silent failure.
Pass -1 to xfs_da3_node_read() from xfs_da3_node_inactive() to
indicate that reading from a hole is an error. This logs an error to
syslog and fails the inode inactivation, leaving the inode on the AG
unlinked list until removed by xfs_repair (or log recovery). Also
update the subsequent code to reflect that the read now returns a
non-NULL buffer or an error.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
A umount hang is possible when a race occurs between the umount
process and the xfsaild kthread. The following sequences outline
the race:
xfsaild: kthread_should_stop()
=> return false, so xfsaild continue
umount: set_bit(KTHREAD_SHOULD_STOP, &kthread->flags)
=> by kthread_stop()
umount: wake_up_process()
=> because xfsaild is still running, so 0 is returned
xfsaild: __set_current_state(TASK_INTERRUPTIBLE)
xfsaild: schedule()
=> now, xfsaild will wait indefinitely
umount: wait_for_completion()
=> and umount will hang
To fix that, we need to check kthread_should_stop() after we set
the task state, so the xfsaild will either see the stop bit and
exit or the task state is reset to runnable by wake_up_process()
such that it isn't scheduled out indefinitely and detects the stop
bit at the next iteration.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Unused after the big bmap refactor.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Unused after the big bmap refactor.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We only use xfs_bmbt_lookup_ge to look up the first bmap record in an
inode, so replace xfs_bmbt_lookup_ge with a special purpose helper that
is a bit more descriptive.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we've massaged the callers into the right form we can always
pass the actual extent record instead of the individual fields.
As an additional benefit the btree cursor will now be prepoulated with
the correct extent state instead of having to fix it up later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we've massaged the callers into the right form we can always
pass the actual extent record instead of the individual fields.
With that xfs_bmbt_disk_set_allf can go away, and xfs_bmbt_disk_set_all
can be merged into the former implementation of xfs_bmbt_disk_set_allf.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use xfs_iext_get_extent to find, and xfs_iext_update_extent to update
entries in the in-core extent list. This isolates the function from
the detailed layout of the extent list, and generally makes the code
a lot more readable.
Also get rid of the oldext and newext variables as using the extent
records is a lot more descriptive.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Account for all changes to the delalloc reservation in da_new, and use a
single call xfs_mod_fdblocks to reserve/free blocks, including always
checking for an error.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use xfs_iext_get_extent to find, and xfs_iext_update_extent to update
entries in the in-core extent list. This isolates the function from
the detailed layout of the extent list, and generally makes the code
a lot more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use xfs_iext_update_extent to update entries in the in-core extent list.
This isolates the function from the detailed layout of the extent list,
and generally makes the code a lot more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use xfs_iext_get_extent to find, and xfs_iext_update_extent to update
entries in the in-core extent list. This isolates the function from
the detailed layout of the extent list, and generally makes the code
a lot more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use xfs_iext_update_extent to update entries in the in-core extent list.
This isolates the function from the detailed layout of the extent list,
and generally makes the code a lot more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the same defines as the other extent add and delete helpers, which
both improves code readability and trace point output.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the _FILLING values to match the usage in the xfs_bmap_add_extent_*
helpers. No change in behavior, just better naming in the code and
tracepoint output.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
And remove the delalloc code from xfs_bmap_del_extent, which gets renamed
to xfs_bmap_del_extent_real to fit the naming scheme used by the other
xfs_bmap_{add,del}_extent_* routines.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Rename the bno variable that's used as the end of the range in
__xfs_bunmapi to end, which better describes it. Additionally change
the start variable which takes the initial value of bno to be the
function parameter itself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The XFS_BTCUR_BPRV_WASDEL flag is supposed to indicate that we are
converting a delayed allocation to a real one, which isn't the case
in xfs_bunmapi. Setting it could theoretically lead to misaccounting
here, but it's unlikely that we ever hit it in practice.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This avoids exposure to details of the extent list implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There was one spot in xfs_bmap_add_extent_unwritten_real that didn't use the
passed in new extent state but always converted to normal, leading to wrong
behavior when converting from normal to unwritten.
Only found by code inspection, it seems like this code path to move partial
extent from written to unwritten while merging it with the next extent is
rarely exercised.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of passing in a formatter callback allocate the bmap buffer
in the caller and process the entries there. Additionally replace
the in-kernel buffer with a new much smaller structure, and unify
the implementation of the different ioctls in a single function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently getbmap uses xfs_bmapi_read to query the extent map, and then
fixes up various bits that are eventually reported to userspace.
This patch instead rewrites it to use xfs_iext_lookup_extent and
xfs_iext_get_extent to iteratively process the extent map. This not
only avoids the need to allocate a map for the returned xfs_bmbt_irec
structures but also greatly simplified the code.
There are two intentional behavior changes compared to the old code:
- the current code reports unwritten extents that don't directly border
a written one as unwritten even when not passing the BMV_IF_PREALLOC
option, contrary to the documentation. The new code requires the
BMV_IF_PREALLOC flag to report the unwrittent extent bit.
- The new code does never merges consecutive extents, unlike the old
code that sometimes does it based on the boundaries of the
xfs_bmapi_read calls. Note that the extent merging behavior was
entirely undocumented.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
According to MS-SMB2 3.2.55 validate_negotiate request must
always be signed. Some Windows can fail the request if you send it unsigned
See kernel bugzilla bug 197311
CC: Stable <stable@vger.kernel.org>
Acked-by: Ronnie Sahlberg <lsahlber.redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
There are several assignments to variables that are redundant
as the values are never read when the variables are updated later
and so the redundant statements can be safely removed.
Cleans up clang warnings:
fs/f2fs/segment.c:923:19: warning: Value stored to 'p' during its initialization is never read
fs/f2fs/segment.c:2060:2: warning: Value stored to 'hint' is never read
fs/f2fs/segment.c:2353:2: warning: Value stored to 'start_block' is never read
fs/f2fs/segment.c:2354:2: warning: Value stored to 'end_block' is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All uses of timespec are deprecated, and this one is not particularly
useful, as the documented method for converting seconds to jiffies
is to multiply by 'HZ'.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There may be extreme case as below:
For one section contains one segment, and there are total 100 segments
with 10% over-privision ratio in f2fs partition, fggc_threshold will
be rounded down to 460 instead of 460.8 as below caclulation:
sbi->fggc_threshold = div_u64((u64)(main_count - ovp_count) *
BLKS_PER_SEC(sbi), (main_count - resv_count));
If section usage is as:
60 segments which contain 460 valid blocks
40 segments which contain 462 valid blocks
As valid block number in all sections is large than fggc_threshold, so
none of them will be chosen as candidate due to incorrect fggc_threshold.
Let's just soften the term of choosing foreground GC candidates.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit 4f31d26b0c.
It turns out that we need to report error number if nothing was written.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes obosolete parameter for truncate_xattr_node.
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If some abnormal users try lots of atomic write operations, f2fs is able to
produce pinned pages in the main memory which affects system performance.
This patch limits that as 20% over total memory size, and if f2fs reaches
to the limit, it will drop all the inmemory pages.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to update ctx->pos correctly when hitting hole in
directory.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, for large directory, we just do readahead only once in
readdir(), readdir()'s performance may drop when traversing latter
blocks. In order to avoid this, relocate readahead codes to covering
all traverse flow.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch follows ext4 to allow readdir() in large empty directory to
be interrupted. Referenced commit of ext4: 1f60fbe727 ("ext4: allow
readdir()'s of large empty directories to be interrupted").
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's skip entire non-exist area to speed up truncate_hole by
using get_next_page_offset.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there's some data written through inline data or dentry, we need to shouw
st_blocks. This fixes reporting zero blocks even though there is small written
data.
Cc: stable@vger.kernel.org
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid link file for quotacheck]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When doing fault injection test, f2fs_evict_inode() didn't remove gdirty_list
which incurs a kernel panic due to wrong pointer access.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch avoids dropping crypto key in f2fs_drop_inode, so we can guarantee
it happens only at evict_inode.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
last_disk_size could be wrong due to concurrently updating, so using
i_sem semaphore to make last_disk_size updating exclusive to fix this
issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In ->umount, once we drop remained discard entries, we should not
set CP_TRIMMED_FLAG with another checkpoint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__submit_discard_cmd may lead long latency due to exhaustion of I/O
request resource in block layer, so issuing all discard under cmd_lock
may lead to hangtask, in order to avoid that, let's reduce it's coverage.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are many different scenarios such as fstrim, umount, urgent or
background where we will issue discards, actually, they need use
different policy in aspect of io aware, discard granularity, delay
interval and so on. But now they just share one common discard policy,
so there will be race when changing policy in between these scenarios,
the interference of changing discard policy will be very serious.
This patch changes to split discard policy for different scenarios.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch wraps scattered optional parameters into discard policy as
below, later, with it we expect that we can adjust these parameters with
proper strategy in different scenario.
struct discard_policy {
unsigned int min_interval; /* used for candidates exist */
unsigned int max_interval; /* used for candidates not exist */
unsigned int max_requests; /* # of discards issued per round */
unsigned int io_aware_gran; /* minimum granularity discard not be aware of I/O */
bool io_aware; /* issue discard in idle time */
bool sync; /* submit discard with REQ_SYNC flag */
};
This patch doesn't change any logic of codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fstrim intends to trim invalid blocks of filesystem only with specified
range and granularity, but actually, it will issue all previous cached
discard commands which may be out-of-range and be with unmatched
granularity, it's unneeded.
In order to fix above issues, this patch introduces new helps to support
to issue and wait discard in range and adds a new fstrim_list for tracking
in-flight discard from ->fstrim.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
An undersize validate negotiate info server response causes the client
to use uninitialised memory for struct validate_negotiate_info_rsp
comparisons of Dialect, SecurityMode and/or Capabilities members.
Link: https://bugzilla.samba.org/show_bug.cgi?id=13092
Fixes: 7db0a6efdc ("SMB3: Work around mount failure when using SMB3 dialect to Macs")
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Fixes: ff1c038add ("Check SMB3 dialects against downgrade attacks")
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
If SendReceive2() fails rsp is set to NULL but is dereferenced in the
error handling code.
Cc: stable@vger.kernel.org
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
query_info() doesn't use the InputBuffer field of the QUERY_INFO
request, therefore according to [MS-SMB2] it must:
a) set the InputBufferOffset to 0
b) send a zero-length InputBuffer
Doing a) is trivial but b) is a bit more tricky.
The packet is allocated according to it's StructureSize, which takes
into account an extra 1 byte buffer which we don't need
here. StructureSize fields must have constant values no matter the
actual length of the whole packet so we can't just edit that constant.
Both the NetBIOS-over-TCP message length ("rfc1002 length") L and the
iovec length L' have to be updated. Since L' is computed from L we
just update L by decrementing it by one.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Some dependencies were lost when CIFS_SMB2 was merged into CIFS.
Fixes: 2a38e12053 ("[SMB3] Remove ifdef since SMB3 (and later) now STRONGLY preferred")
Signed-off-by: Benjamin Gilbert <benjamin.gilbert@coreos.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Marios Titas running a Haskell program noticed a problem with fuse's
readdirplus: when it is interrupted by a signal, it skips one directory
entry.
The reason is that fuse erronously updates ctx->pos after a failed
dir_emit().
The issue originates from the patch adding readdirplus support.
Reported-by: Jakob Unterwurzacher <jakobunt@gmail.com>
Tested-by: Marios Titas <redneb@gmx.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 0b05b18381 ("fuse: implement NFS-like readdirplus support")
Cc: <stable@vger.kernel.org> # v3.9
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.
However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.
It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts the dcache code and comments to use
{READ,WRITE}_ONCE() consistently.
----
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-4-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sparse warns:
fs/ceph/caps.c:2042:9: warning: context imbalance in 'try_flush_caps' - wrong count at exit
We need to exit this function with the lock unlocked, but a couple of
cases leave it locked.
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
With index=on, ovl_indexdir_cleanup() tries to cleanup invalid index
entries (e.g. bad index name). This behavior could result in cleaning of
entries created by newer kernels and is therefore undesirable.
Instead, abort mount if such entries are encountered. We still cleanup
'stale' entries and 'orphan' entries, both those cases can be a result
of offline changes to lower and upper dirs.
When encoutering an index entry of type directory or whiteout, kernel
was supposed to fallback to read-only mount, but the fill_super()
operation returns EROFS in this case instead of returning success with
read-only mount flag, so mount fails when encoutering directory or
whiteout index entries. Bless this behavior by returning -EINVAL on
directory and whiteout index entries as we do for all unsupported index
entries.
Fixes: 61b674710c ("ovl: do not cleanup directory and whiteout index..")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Treat ENOENT from index entry lookup the same way as treating a returned
negative dentry. Apparently, either could be returned if file is not
found, depending on the underlying file system.
Fixes: 359f392ca5 ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Commit fbaf94ee3c ("ovl: don't set origin on broken lower hardlink")
attempt to avoid the condition of non-indexed upper inode with lower
hardlink as origin. If this condition is found, lookup returns EIO.
The protection of commit mentioned above does not cover the case of lower
that is not a hardlink when it is copied up (with either index=off/on)
and then lower is hardlinked while overlay is offline.
Changes to lower layer while overlayfs is offline should not result in
unexpected behavior, so a permanent EIO error after creating a link in
lower layer should not be considered as correct behavior.
This fix replaces EIO error with success in cases where upper has origin
but no index is found, or index is found that does not match upper
inode. In those cases, lookup will not fail and the returned overlay inode
will be hashed by upper inode instead of by lower origin inode.
Fixes: 359f392ca5 ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit fd2421f544 ("fs/9p: When doing inode lookup compare qid details
and inode mode bits.") transformed v9fs_qid_iget() to use iget5_locked()
instead of iget_locked(). However, the test() callback is not checking
fid.path at all, which means that a lookup in the inode cache can now
accidentally locate a completely wrong inode from the same inode hash
bucket if the other fields (qid.type and qid.version) match.
Fixes: fd2421f544 ("fs/9p: When doing inode lookup compare qid details and inode mode bits.")
Cc: stable@vger.kernel.org
Reviewed-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Tuomas Tynkkynen <tuomas@tuxera.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Apparently our current rwsem code doesn't like doing the trylock, then
lock for real scheme. So change our read/write methods to just do the
trylock for the RWF_NOWAIT case. This fixes a ~25% regression in
AIM7.
Fixes: 91f9943e ("fs: support RWF_NOWAIT for buffered reads")
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There were quite a few overlapping sets of changes here.
Daniel's bug fix for off-by-ones in the new BPF branch instructions,
along with the added allowances for "data_end > ptr + x" forms
collided with the metadata additions.
Along with those three changes came veritifer test cases, which in
their final form I tried to group together properly. If I had just
trimmed GIT's conflict tags as-is, this would have split up the
meta tests unnecessarily.
In the socketmap code, a set of preemption disabling changes
overlapped with the rename of bpf_compute_data_end() to
bpf_compute_data_pointers().
Changes were made to the mv88e6060.c driver set addr method
which got removed in net-next.
The hyperv transport socket layer had a locking change in 'net'
which overlapped with a change of socket state macro usage
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull vfs fixes from Al Viro:
"MS_I_VERSION fixes - Mimi's fix + missing bits picked from Matthew
(his patch contained a duplicate of the fs/namespace.c fix as well,
but by that point the original fix had already been applied)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Convert fs/*/* to SB_I_VERSION
vfs: fix mounting a filesystem with i_version
Pull key handling fixes from James Morris:
"This includes a fix for the capabilities code from Colin King, and a
set of further fixes for the keys subsystem. From David:
- Fix a bunch of places where kernel drivers may access revoked
user-type keys and don't do it correctly.
- Fix some ecryptfs bits.
- Fix big_key to require CONFIG_CRYPTO.
- Fix a couple of bugs in the asymmetric key type.
- Fix a race between updating and finding negative keys.
- Prevent add_key() from updating uninstantiated keys.
- Make loading of key flags and expiry time atomic when not holding
locks"
* 'fixes-v4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
commoncap: move assignment of fs_ns to avoid null pointer dereference
pkcs7: Prevent NULL pointer dereference, since sinfo is not always set.
KEYS: load key flags and expiry time atomically in proc_keys_show()
KEYS: Load key expiry time atomically in keyring_search_iterator()
KEYS: load key flags and expiry time atomically in key_validate()
KEYS: don't let add_key() update an uninstantiated key
KEYS: Fix race between updating and finding a negative key
KEYS: checking the input id parameters before finding asymmetric key
KEYS: Fix the wrong index when checking the existence of second id
security/keys: BIG_KEY requires CONFIG_CRYPTO
ecryptfs: fix dereference of NULL user_key_payload
fscrypt: fix dereference of NULL user_key_payload
lib/digsig: fix dereference of NULL user_key_payload
FS-Cache: fix dereference of NULL user_key_payload
KEYS: encrypted: fix dereference of NULL user_key_payload
This introduces a "register private expedited" membarrier command which
allows eventual removal of important memory barrier constraints on the
scheduler fast-paths. It changes how the "private expedited" membarrier
command (new to 4.14) is used from user-space.
This new command allows processes to register their intent to use the
private expedited command. This affects how the expedited private
command introduced in 4.14-rc is meant to be used, and should be merged
before 4.14 final.
Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
This fixes a problem that arose when designing requested extensions to
sys_membarrier() to allow JITs to efficiently flush old code from
instruction caches. Several potential algorithms are much less painful
if the user register intent to use this functionality early on, for
example, before the process spawns the second thread. Registering at
this time removes the need to interrupt each and every thread in that
process at the first expedited sys_membarrier() system call.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The error code is missing here so it means we return ERR_PTR(0) or NULL.
The other error paths all return an error code so this probably should
as well.
Fixes: 02b69b284c ("ovl: lookup redirects")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This was detected by fault injection test
Signed-off-by: Hirofumi Nakagawa <nklabs@gmail.com>
Fixes: 13cf199d00 ("ovl: allocate an ovl_inode struct")
Cc: <stable@vger.kernel.org> # v4.13
Make config_item_type structures const as they are either passed to a
function having the argument as const or stored in the const "ci_type"
field of a config_item structure.
Done using Coccinelle.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Make these structures const as they are either passed to the functions
having the argument as const or stored as a reference in the "ci_type"
const field of a config_item structure.
Done using Coccinelle.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The ci_type field of the config_item structure do not modify the fields
of the config_item_type structure it points to. And the other pointers
initialized with ci_type do not modify the fields as well.
So, make the ci_type field and the pointers initialized with ci_type
as const.
Make the struct config_item_type *type function argument of functions
config_{item/group}_init_type_name const as the argument in both the
functions is only stored in the ci_type field of a config_item structure
which is now made const.
Make the argument of configfs_register_default_group const as it is
only passed to the argument of the function config_group_init_type_name
which is now const.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Introduce a helper function which prepares to look up the given dentry
in the given directory. If the directory is encrypted, it handles
loading the directory's encryption key, setting the dentry's ->d_op to
fscrypt_d_ops, and setting DCACHE_ENCRYPTED_WITH_KEY if the directory's
encryption key is available.
Note: once all filesystems switch over to this, we'll be able to move
fscrypt_d_ops and fscrypt_set_encrypted_dentry() to fscrypt_private.h.
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Introduce a helper function which prepares to rename a file into a
possibly encrypted directory. It handles loading the encryption keys
for the source and target directories if needed, and it handles
enforcing that if the target directory (and the source directory for a
cross-rename) is encrypted, then the file being moved into the directory
has the same encryption policy as its containing directory.
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Introduce a helper function which prepares to link an inode into a
possibly-encrypted directory. It handles setting up the target
directory's encryption key, then verifying that the link won't violate
the constraint that all files in an encrypted directory tree use the
same encryption policy.
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add a helper function which prepares to open a regular file which may be
encrypted. It handles setting up the file's encryption key, then
checking that the file's encryption policy matches that of its parent
directory (if the parent directory is encrypted). It may be set as the
->open() method or it can be called from another ->open() method.
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In the case where a filesystem has been configured without encryption
support, there is no longer any need to initialize ->s_cop at all, since
none of the methods are ever called.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that all callers of fscrypt_operations.is_encrypted() have been
switched to IS_ENCRYPTED(), remove ->is_encrypted().
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
IS_ENCRYPTED() now gives the same information as
i_sb->s_cop->is_encrypted() but is more efficient, since IS_ENCRYPTED()
is just a simple flag check. Prepare to remove ->is_encrypted() by
switching all callers to IS_ENCRYPTED().
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Introduce a flag S_ENCRYPTED which can be set in ->i_flags to indicate
that the inode is encrypted using the fscrypt (fs/crypto/) mechanism.
Checking this flag will give the same information that
inode->i_sb->s_cop->is_encrypted(inode) currently does, but will be more
efficient. This will be useful for adding higher-level helper functions
for filesystems to use. For example we'll be able to replace this:
if (ext4_encrypted_inode(inode)) {
ret = fscrypt_get_encryption_info(inode);
if (ret)
return ret;
if (!fscrypt_has_encryption_key(inode))
return -ENOKEY;
}
with this:
ret = fscrypt_require_key(inode);
if (ret)
return ret;
... since we'll be able to retain the fast path for unencrypted files as
a single flag check, using an inline function. This wasn't possible
before because we'd have had to frequently call through the
->i_sb->s_cop->is_encrypted function pointer, even when the encryption
support was disabled or not being used.
Note: we don't define S_ENCRYPTED to 0 if CONFIG_FS_ENCRYPTION is
disabled because we want to continue to return an error if an encrypted
file is accessed without encryption support, rather than pretending that
it is unencrypted.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Filesystems have to include different header files based on whether they
are compiled with encryption support or not. That's nasty and messy.
Instead, rationalise the headers so we have a single include fscrypt.h
and let it decide what internal implementation to include based on the
__FS_HAS_ENCRYPTION define. Filesystems set __FS_HAS_ENCRYPTION to 1
before including linux/fscrypt.h if they are built with encryption
support. Otherwise, they must set __FS_HAS_ENCRYPTION to 0.
Add guards to prevent fscrypt_supp.h and fscrypt_notsupp.h from being
directly included by filesystems.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[EB: use 1 and 0 rather than defined/undefined]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[AV: in addition to the fix in previous commit]
Signed-off-by: Matthew Garrett <mjg59@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- fix some more CONFIG_XFS_RT related build problems
- fix data loss when writeback at eof races eofblocks gc and loses
- invalidate page cache after fs finishes a dio write
- remove dirty page state when invalidating pages so releasepage does
the right thing when handed a dirty page
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZ5jqbAAoJEPh/dxk0SrTrtfMP/jcQ6lTDcpnQ7XEP2fg2dXjx
2+z8uI7Mjr5wo2qfIWHc8nZHZ+8KRak4U28rTlrXkeVbJ79x3Z+SzeipP76dGHXB
u9MD7uacTD6BDT7R8/bux7g7KrPATVJYJiT3PRHZ5ysUT6i9KnREdbaKpgOwhMcI
Ivd9ROZHx62CmZhsbfLzD+Ccy9/mGBR5OmT8nQlsuD8cEcFU5u1afaJ2/YlCjNLN
c16Q8dhGXed7tjduiYCzsxDiewJMzSfcGdyk6yCwXdR3zcI3RdhXUN5FRH0R9GB2
xxG1n5Q4qgtgODGgcPUl9WG8mfhVvEcuZGioxChQrxCEcaHt1Waop0fOixLy9J3Q
lUn4qjA5S+VBqa6XsKCSCkiZdDtncSedvMRQYef09q8DGAouwAtN/Z3BVM24oyWU
k5888Gt4EHZK6V3lz3qPMmGFxfuPL6GeyEvIYUezpVIYsmp0sLQTeNFUW+XC7fb/
tOBNom4ARHFmSb5da7uwJvesNZBVFSpFQtxkcx1OL0rhTqlKIfPP61dLznKhqUTL
2NhaFjnznYenSEK2CsP+V3CtQrCxywdqDNnOEgTgKJbWPpsYMX63z/Cmtm0A7Qdz
BAbGc+OSBLqelwsWNnNzTWPHk33SKxtIxGTe8gKbKbrzbR7mxyJxHKEwpZvWIqh+
8eTdgJb1wgJyqtBsTSHN
=UY00
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.14-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- fix some more CONFIG_XFS_RT related build problems
- fix data loss when writeback at eof races eofblocks gc and loses
- invalidate page cache after fs finishes a dio write
- remove dirty page state when invalidating pages so releasepage does
the right thing when handed a dirty page
* tag 'xfs-4.14-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: move two more RT specific functions into CONFIG_XFS_RT
xfs: trim writepage mapping to within eof
fs: invalidate page cache after end_io() in dio completion
xfs: cancel dirty pages on invalidation
Pull block fixes from Jens Axboe:
"Three small fixes:
- A fix for skd, it was using kfree() to free a structure allocate
with kmem_cache_alloc().
- Stable fix for nbd, fixing a regression using the normal ioctl
based tools.
- Fix for a previous fix in this series, that fixed up
inconsistencies between buffered and direct IO"
* 'for-linus' of git://git.kernel.dk/linux-block:
fs: Avoid invalidation in interrupt context in dio_complete()
nbd: don't set the device size until we're connected
skd: Use kmem_cache_free
Help the user to find the appropriate mount option to continue mounting
the file system on a read-only device if the journal requires recovery.
Signed-off-by: Simon Ruderich <simon@ruderich.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The channel value for requesting server remote invalidating local memory
registration should be 0x00000002
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Update reading the EA using increasingly larger buffer sizes
until the response will fit in the buffer, or we exceed the
(arbitrary) maximum set to 64kb.
Without this change, a user is able to add more and more EAs using
setfattr until the point where the total space of all EAs exceed 2kb
at which point the user can no longer list the EAs at all
and getfattr will abort with an error.
The same issue still exists for EAs in SMB1.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
SMB3.1.1 is most secure and recent dialect. Fixup labels and lengths
for sMB3.1.1 signing and encryption.
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.com>
Cc: linux-ext4@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Make AF_RXRPC accept MSG_WAITALL as a flag to sendmsg() to tell it to
ignore signals whilst loading up the message queue, provided progress is
being made in emptying the queue at the other side.
Progress is defined as the base of the transmit window having being
advanced within 2 RTT periods. If the period is exceeded with no progress,
sendmsg() will return anyway, indicating how much data has been copied, if
any.
Once the supplied buffer is entirely decanted, the sendmsg() will return.
Signed-off-by: David Howells <dhowells@redhat.com>
Provide support for a kernel service to make use of the service upgrade
facility. This involves:
(1) Pass an upgrade request flag to rxrpc_kernel_begin_call().
(2) Make rxrpc_kernel_recv_data() return the call's current service ID so
that the caller can detect service upgrade and see what the service
was upgraded to.
Signed-off-by: David Howells <dhowells@redhat.com>
Currently we try to defer completion of async DIO to the process context
in case there are any mapped pages associated with the inode so that we
can invalidate the pages when the IO completes. However the check is racy
and the pages can be mapped afterwards. If this happens we might end up
calling invalidate_inode_pages2_range() in dio_complete() in interrupt
context which could sleep. This can be reproduced by generic/451.
Fix this by passing the information whether we can or can't invalidate
to the dio_complete(). Thanks Eryu Guan for reporting this and Jan Kara
for suggesting a fix.
Fixes: 332391a993 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
Reported-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix some warnings that appear when compiling with -Wconversion.
A sub-optimal choice of variable type leads to warnings about
conversion in both directions between unsigned and signed.
Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix problems noted in compilion with -Wformat=2 -Wformat-signedness.
In particular, a mismatch between the signedness of a value and the
signedness of its format specifier can result in unsigned values being
printed as negative numbers, e.g.:
Partition (0 type 1511) starts at physical 460, block length -1779968542
...which occurs when mounting a large (> 1 TiB) UDF partition.
Changes since V1:
* Fixed additional issues noted in udf_bitmap_free_blocks(),
udf_get_fileident(), udf_show_options()
Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Large (> 1 TiB) UDF filesystems appear subject to several problems when
mounted on 64-bit systems:
* readdir() can fail on a directory containing File Identifiers residing
above 0x7FFFFFFF. This manifests as a 'ls' command failing with EIO.
* FIBMAP on a file block located above 0x7FFFFFFF can return a negative
value. The low 32 bits are correct, but applications that don't mask the
high 32 bits of the result can perform incorrectly.
Per suggestion by Jan Kara, introduce a udf_pblk_t type for representation
of UDF block addresses. Ultimately, all driver functions that manipulate
UDF block addresses should use this type; for now, deployment is limited
to functions with actual or potential sign extension issues.
Changes to udf_readdir() and udf_block_map() address the issues noted
above; other changes address potential similar issues uncovered during
audit of the driver code.
Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The mount i_version flag is not enabled in the new sb_flags. This patch
adds the missing SB_I_VERSION flag.
Fixes: e462ec5 "VFS: Differentiate mount flags (MS_*) from internal
superblock flags"
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The last cleanup introduced two harmless warnings:
fs/xfs/xfs_fsmap.c:480:1: warning: '__xfs_getfsmap_rtdev' defined but not used
fs/xfs/xfs_fsmap.c:372:1: warning: 'xfs_getfsmap_rtdev_rtbitmap_helper' defined but not used
This moves those two functions as well.
Fixes: bb9c2e5433 ("xfs: move more RT specific code under CONFIG_XFS_RT")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The writeback rework in commit fbcc025613 ("xfs: Introduce
writeback context for writepages") introduced a subtle change in
behavior with regard to the block mapping used across the
->writepages() sequence. The previous xfs_cluster_write() code would
only flush pages up to EOF at the time of the writepage, thus
ensuring that any pages due to file-extending writes would be
handled on a separate cycle and with a new, updated block mapping.
The updated code establishes a block mapping in xfs_writepage_map()
that could extend beyond EOF if the file has post-eof preallocation.
Because we now use the generic writeback infrastructure and pass the
cached mapping to each writepage call, there is no implicit EOF
limit in place. If eofblocks trimming occurs during ->writepages(),
any post-eof portion of the cached mapping becomes invalid. The
eofblocks code has no means to serialize against writeback because
there are no pages associated with post-eof blocks. Therefore if an
eofblocks trim occurs and is followed by a file-extending buffered
write, not only has the mapping become invalid, but we could end up
writing a page to disk based on the invalid mapping.
Consider the following sequence of events:
- A buffered write creates a delalloc extent and post-eof
speculative preallocation.
- Writeback starts and on the first writepage cycle, the delalloc
extent is converted to real blocks (including the post-eof blocks)
and the mapping is cached.
- The file is closed and xfs_release() trims post-eof blocks. The
cached writeback mapping is now invalid.
- Another buffered write appends the file with a delalloc extent.
- The concurrent writeback cycle picks up the just written page
because the writeback range end is LLONG_MAX. xfs_writepage_map()
attributes it to the (now invalid) cached mapping and writes the
data to an incorrect location on disk (and where the file offset is
still backed by a delalloc extent).
This problem is reproduced by xfstests test generic/464, which
triggers racing writes, appends, open/closes and writeback requests.
To address this problem, trim the mapping used during writeback to
within EOF when the mapping is validated. This ensures the mapping
is revalidated for any pages encountered beyond EOF as of the time
the current mapping was cached or last validated.
Reported-by: Eryu Guan <eguan@redhat.com>
Diagnosed-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Commit 332391a993 ("fs: Fix page cache inconsistency when mixing
buffered and AIO DIO") moved page cache invalidation from
iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
path, but before the dio->end_io() call, and it re-introdued the bug
fixed by commit c771c14baa ("iomap: invalidate page caches should
be after iomap_dio_complete() in direct write").
I found this because fstests generic/418 started failing on XFS with
v4.14-rc3 kernel, which is the regression test for this specific
bug.
So similarly, fix it by moving dio->end_io() (which does the
unwritten extent conversion) before page cache invalidation, to make
sure next buffer read reads the final real allocations not unwritten
extents. I also add some comments about why should end_io() go first
in case we get it wrong again in the future.
Note that, there's no such problem in the non-iomap based direct
write path, because we didn't remove the page cache invalidation
after the ->direct_IO() in generic_file_direct_write() call, but I
decided to fix dio_complete() too so we don't leave a landmine
there, also be consistent with iomap_dio_complete().
Fixes: 332391a993 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Recently we've had warnings arise from the vm handing us pages
without bufferheads attached to them. This should not ever occur
in XFS, but we don't defend against it properly if it does. The only
place where we remove bufferheads from a page is in
xfs_vm_releasepage(), but we can't tell the difference here between
"page is dirty so don't release" and "page is dirty but is being
invalidated so release it".
In some places that are invalidating pages ask for pages to be
released and follow up afterward calling ->releasepage by checking
whether the page was dirty and then aborting the invalidation. This
is a possible vector for releasing buffers from a page but then
leaving it in the mapping, so we really do need to avoid dirty pages
in xfs_vm_releasepage().
To differentiate between invalidated pages and normal pages, we need
to clear the page dirty flag when invalidating the pages. This can
be done through xfs_vm_invalidatepage(), and will result
xfs_vm_releasepage() seeing the page as clean which matches the
bufferhead state on the page after calling block_invalidatepage().
Hence we can re-add the page dirty check in xfs_vm_releasepage to
catch the case where we might be releasing a page that is actually
dirty and so should not have the bufferheads on it removed. This
will remove one possible vector of "dirty page with no bufferheads"
and so help narrow down the search for the root cause of that
problem.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Commit f5a73672d1 ("NFS: allow close-to-open cache semantics to
apply to root of NFS filesystem") added a call to
__nfs_revalidate_inode() to nfs_opendir to as the lookup
process wouldn't reliable do this.
Subsequent commit a3fbbde70a ("VFS: we need to set LOOKUP_JUMPED
on mountpoint crossing") make this unnecessary. So remove the
unnecessary code.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For correct close-to-open semantics, NFS must validate
the change attribute of a directory (or file) on open.
Since commit ecf3d1f1aa ("vfs: kill FS_REVAL_DOT by adding a
d_weak_revalidate dentry op"), open() of "." or a path ending ".." is
not revalidated reliably (except when that direct is a mount point).
Prior to that commit, "." was revalidated using nfs_lookup_revalidate()
which checks the LOOKUP_OPEN flag and forces revalidation if the flag is
set.
Since that commit, nfs_weak_revalidate() is used for NFSv3 (which
ignores the flags) and nothing is used for NFSv4.
This is fixed by using nfs_lookup_verify_inode() in
nfs_weak_revalidate(). This does the revalidation exactly when needed.
Also, add a definition of .d_weak_revalidate for NFSv4.
The incorrect behavior is easily demonstrated by running "echo *" in
some non-mountpoint NFS directory while watching network traffic.
Without this patch, "echo *" sometimes doesn't produce any traffic.
With the patch it always does.
Fixes: ecf3d1f1aa ("vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op")
cc: stable@vger.kernel.org (3.9+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The NFS_ACCESS_* flags aren't a 1:1 mapping to the MAY_* flags, so
checking for MAY_WHATEVER might have surprising results in
nfs*_proc_access(). Let's simplify this check when determining which
bits to ask for, and do it in a generic place instead of copying code
for each NFS version.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Passing the NFS v4 flags into the v3 code seems weird to me, even if
they are defined to the same values. This patch adds in generic flags
to help me feel better
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When session starts beyond offset 2^31 the arithmetics in
udf_check_vsd() would overflow. Make sure the computation is done in
large enough type.
Reported-by: Cezary Sliwa <sliwa@ifpan.edu.pl>
Signed-off-by: Jan Kara <jack@suse.cz>
Update documentation, pointer to latest tools, appoint myself as
maintainer. Given it's been unloved for so long, I don't expect anyone
will protest.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When cramfs in physical memory is used then we have the opportunity
to map files directly from ROM, directly into user space, saving on
RAM usage. This gives us Execute-In-Place (XIP) support.
For a file to be mmap()-able, the map area has to correspond to a range
of uncompressed and contiguous blocks, and in the MMU case it also has
to be page aligned. A version of mkcramfs with appropriate support is
necessary to create such a filesystem image.
In the MMU case it may happen for a vma structure to extend beyond the
actual file size. This is notably the case in binfmt_elf.c:elf_map().
Or the file's last block is shared with other files and cannot be mapped
as is. Rather than refusing to mmap it, we do a "mixed" map and let the
regular fault handler populate the unmapped area with RAM-backed pages.
In practice the unmapped area is seldom accessed so page faults might
never occur before this area is discarded.
In the non-MMU case it is the get_unmapped_area method that is responsible
for providing the address where the actual data can be found. No mapping
is necessary of course.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Two new capabilities are introduced here:
- The ability to store some blocks uncompressed.
- The ability to locate blocks anywhere.
Those capabilities can be used independently, but the combination
opens the possibility for execute-in-place (XIP) of program text segments
that must remain uncompressed, and in the MMU case, must have a specific
alignment. It is even possible to still have the writable data segments
from the same file compressed as they have to be copied into RAM anyway.
This is achieved by giving special meanings to some unused block pointer
bits while remaining compatible with legacy cramfs images.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Small embedded systems typically execute the kernel code in place (XIP)
directly from flash to save on precious RAM usage. This patch adds to
cramfs the ability to consume filesystem data directly from flash as
well. Cramfs is particularly well suited to this feature as it is very
simple with low RAM usage, and with this feature it is possible to use
it with no block device support and consequently even lower RAM usage.
This patch was inspired by a similar patch from Shane Nay dated 17 years
ago that used to be very popular in embedded circles but never made it
into mainline. This is a cleaned-up implementation that uses far fewer
ifdef's and gets the actual memory location for the filesystem image
via MTD at run time. In the context of small IoT deployments, this
functionality has become relevant and useful again.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
inode->i_private is assigned by a Node pointer only after registering a
new binary format, so it could be NULL if inode was created by
bm_fill_super() (or iput() was called by the error path in
bm_register_write()), and this could result in NULL pointer dereference
when evicting such an inode. e.g. mount binfmt_misc filesystem then
umount it immediately:
mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc
umount /proc/sys/fs/binfmt_misc
will result in
BUG: unable to handle kernel NULL pointer dereference at 0000000000000013
IP: bm_evict_inode+0x16/0x40 [binfmt_misc]
...
Call Trace:
evict+0xd3/0x1a0
iput+0x17d/0x1d0
dentry_unlink_inode+0xb9/0xf0
__dentry_kill+0xc7/0x170
shrink_dentry_list+0x122/0x280
shrink_dcache_parent+0x39/0x90
do_one_tree+0x12/0x40
shrink_dcache_for_umount+0x2d/0x90
generic_shutdown_super+0x1f/0x120
kill_litter_super+0x29/0x40
deactivate_locked_super+0x43/0x70
deactivate_super+0x45/0x60
cleanup_mnt+0x3f/0x70
__cleanup_mnt+0x12/0x20
task_work_run+0x86/0xa0
exit_to_usermode_loop+0x6d/0x99
syscall_return_slowpath+0xba/0xf0
entry_SYSCALL_64_fastpath+0xa3/0xa
Fix it by making sure Node (e) is not NULL.
Link: http://lkml.kernel.org/r/20171010100642.31786-1-eguan@redhat.com
Fixes: 83f918274e ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using FAT on a block device which supports rw_page, we can hit
BUG_ON(!PageLocked(page)) in try_to_free_buffers(). This is because we
call clean_buffers() after unlocking the page we've written. Introduce
a new clean_page_buffers() which cleans all buffers associated with a
page and call it from within bdev_write_page().
[akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew]
Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Tested-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix a stale kernel memory exposure when logging inodes.
- Fix some build problems with CONFIG_XFS_RT=n
- Don't change inode mode if the acl write fails, leaving the file totally
inaccessible.
- Fix a dangling pointer problem when removing an attr fork under memory
pressure.
- Don't crash while trying to invalidate a null buffer associated with a
corrupt metadata pointer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZ3lPiAAoJEPh/dxk0SrTrfuMP/Axy7VSX71tE/eXPOmzxCVZD
w4/usqO+OsQj+q8o+rwwuX9hz0VGF8kWZJOdgGdXpYT7pWqPmcf88wbThheTetLF
fjevusqva0Ds+U4AE7DCNWSKQQRhu2jDgnhQXTv1hdYhWIF59qGwioIijbEvb72I
0QW+/uV9yXmODjWL6KfRh9zRT9N4npMtszukScONwJr9t0/5ub8H03H/ktv8T9oi
C3ljEWwyMk5lEYH8p6tpta8EbY0mrIZgo+kj33PU5s9rHvcrTGtyPNqidREUm1fL
X3+STMytcDQFAcZdBBXHN0nFMwa8ADTrVvKmEgaR8OsXmOmrlcPn7HfVVlWrY31w
X3awJ0b0+IXUrsbbQOPeqgTo5hIkMDkMOga5AP/rqpx1yCCOrlMHaRPXB2NxNcVw
dyTj6IpKybhsQ4GkcqmFcgnxPPaogNpYlp6SXV5Dm+8zEJdIQNUuci/EGsNz7UcV
msxNlJJkxczXOew6JzCyw45wTnJCxduX7Y1xrOTLaDfa9pkWO2zQBXukCJNIqVIq
35Q4P4JVYtmwQr8XkkX9tiqU0gBWTCTG9KjmTCMm5MYkutEYM0uTNR5Jvyiobl7L
Nn+RydssVw7ssnNfgsLhzQHPElUivRdYoYFSBa2DQp6ViILrefqQegd5INAjK63W
7vnHVZyJMHPM0YFoiX8w
=6Yvh
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.14-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix a stale kernel memory exposure when logging inodes.
- Fix some build problems with CONFIG_XFS_RT=n
- Don't change inode mode if the acl write fails, leaving the file
totally inaccessible.
- Fix a dangling pointer problem when removing an attr fork under
memory pressure.
- Don't crash while trying to invalidate a null buffer associated with
a corrupt metadata pointer.
* tag 'xfs-4.14-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: handle error if xfs_btree_get_bufs fails
xfs: reinit btree pointer on attr tree inactivation walk
xfs: Fix bool initialization/comparison
xfs: don't change inode mode if ACL update fails
xfs: move more RT specific code under CONFIG_XFS_RT
xfs: Don't log uninitialised fields in inode structures
Pull quota fix from Jan Kara:
"A fix for a regression in handling of quota grace times and warnings"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Generate warnings for DQUOT_SPACE_NOFAIL allocations
In eCryptfs, we failed to verify that the authentication token keys are
not revoked before dereferencing their payloads, which is problematic
because the payload of a revoked key is NULL. request_key() *does* skip
revoked keys, but there is still a window where the key can be revoked
before we acquire the key semaphore.
Fix it by updating ecryptfs_get_key_payload_data() to return
-EKEYREVOKED if the key payload is NULL. For completeness we check this
for "encrypted" keys as well as "user" keys, although encrypted keys
cannot be revoked currently.
Alternatively we could use key_validate(), but since we'll also need to
fix ecryptfs_get_key_payload_data() to validate the payload length, it
seems appropriate to just check the payload pointer.
Fixes: 237fead619 ("[PATCH] ecryptfs: fs/Makefile and fs/Kconfig")
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org> [v2.6.19+]
Cc: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When an fscrypt-encrypted file is opened, we request the file's master
key from the keyrings service as a logon key, then access its payload.
However, a revoked key has a NULL payload, and we failed to check for
this. request_key() *does* skip revoked keys, but there is still a
window where the key can be revoked before we acquire its semaphore.
Fix it by checking for a NULL payload, treating it like a key which was
already revoked at the time it was requested.
Fixes: 88bd6ccdcd ("ext4 crypto: add encryption key management facilities")
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org> [v4.1+]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When the file /proc/fs/fscache/objects (available with
CONFIG_FSCACHE_OBJECT_LIST=y) is opened, we request a user key with
description "fscache:objlist", then access its payload. However, a
revoked key has a NULL payload, and we failed to check for this.
request_key() *does* skip revoked keys, but there is still a window
where the key can be revoked before we access its payload.
Fix it by checking for a NULL payload, treating it like a key which was
already revoked at the time it was requested.
Fixes: 4fbf4291aa ("FS-Cache: Allow the current state of all objects to be dumped")
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org> [v2.6.32+]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
The following commit:
commit 9b7365fc1c ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR
interface support")
added several defines related to extended attributes to ext4.h. They were
added within an #ifndef FS_IOC_FSGETXATTR block with the comment:
/* Until the uapi changes get merged for project quota... */
Those uapi changes were merged by this commit:
commit 334e580a6f ("fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR
promotion")
so all the definitions needed by ext4 are available in
include/uapi/linux/fs.h. Remove the duplicates from ext4.h.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
This helper, in the spirit of ext4_should_dioread_nolock() et al., replaces
the complex conditional in ext4_set_inode_flags().
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
We prevent DAX from being used on inodes which are using ext4's built in
encryption via a check in ext4_set_inode_flags(). We do have what appears
to be an unsafe transition of S_DAX in ext4_set_context(), though, where
S_DAX can get disabled without us doing a proper writeback + invalidate.
There are also issues with mm-level races when changing the value of S_DAX,
as well as issues with the VM_MIXEDMAP flag:
https://www.spinics.net/lists/linux-xfs/msg09859.html
I actually think we are safe in this case because of the following:
1) You can't encrypt an existing file. Encryption can only be set on an
empty directory, with new inodes in that directory being created with
encryption turned on, so I don't think it's possible to turn encryption on
for a file that has open DAX mmaps or outstanding I/Os.
2) There is no way to turn encryption off on a given file. Once an inode
is encrypted, it stays encrypted for the life of that inode, so we don't
have to worry about the case where we turn encryption off and S_DAX
suddenly turns on.
3) The only way we end up in ext4_set_context() to turn on encryption is
when we are creating a new file in the encrypted directory. This happens
as part of ext4_create() before the inode has been allowed to do any I/O.
Here's the call tree:
ext4_create()
__ext4_new_inode()
ext4_set_inode_flags() // sets S_DAX
fscrypt_inherit_context()
fscrypt_get_encryption_info();
ext4_set_context() // sets EXT4_INODE_ENCRYPT, clears S_DAX
So, I actually think it's safe to transition S_DAX in ext4_set_context()
without any locking, writebacks or invalidations. I've added a
WARN_ON_ONCE() sanity check to make sure that we are notified if we ever
encounter a case where we are encrypting an inode that already has data,
in which case we need to add code to safely transition S_DAX.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
The current code has the potential for data corruption when changing an
inode's journaling mode, as that can result in a subsequent unsafe change
in S_DAX.
I've captured an instance of this data corruption in the following fstest:
https://patchwork.kernel.org/patch/9948377/
Prevent this data corruption from happening by disallowing changes to the
journaling mode if the '-o dax' mount option was used. This means that for
a given filesystem we could have a mix of inodes using either DAX or
data journaling, but whatever state the inodes are in will be held for the
duration of the mount.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
If an inode has inline data it is currently prevented from using DAX by a
check in ext4_set_inode_flags(). When the inode grows inline data via
ext4_create_inline_data() or removes its inline data via
ext4_destroy_inline_data_nolock(), the value of S_DAX can change.
Currently these changes are unsafe because we don't hold off page faults
and I/O, write back dirty radix tree entries and invalidate all mappings.
There are also issues with mm-level races when changing the value of S_DAX,
as well as issues with the VM_MIXEDMAP flag:
https://www.spinics.net/lists/linux-xfs/msg09859.html
The unsafe transition of S_DAX can reliably cause data corruption, as shown
by the following fstest:
https://patchwork.kernel.org/patch/9948381/
Fix this issue by preventing the DAX mount option from being used on
filesystems that were created to support inline data. Inline data is an
option given to mkfs.ext4.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org
The elf_fdpic code shows a harmless warning when built with MMU disabled,
I ran into this now that fdpic is available on ARM randconfig builds
since commit 50b2b2e691 ("ARM: add ELF_FDPIC support").
fs/binfmt_elf_fdpic.c: In function 'elf_fdpic_dump_segments':
fs/binfmt_elf_fdpic.c:1501:17: error: unused variable 'addr' [-Werror=unused-variable]
This adds another #ifdef around the variable declaration to shut up
the warning.
Fixes: e6c1baa9b5 ("convert the rest of binfmt_elf_fdpic to dump_emit()")
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch fixes some spelling typos found in Kconfig files.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
1) Ignoring return value from iov_iter_zero() is wrong
for iovec-backed case as well as for pipes - it can fail.
2) Failure to fault destination pages in 25Mb into a 50Mb iovec
should not act as if nothing in the area had been read, nevermind
that the first 25Mb might have *already* been read by that point.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There's only one caller of destroy_super() left now. Fold it there,
and replace those list_lru_destroy() calls with checks that they
had already been done (as they should have, when we were dropping
the last active reference).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Used for disposal of super_block instances that had never been reachable
via any shared data structures. No need for RCU delay in there, everything
can be called directly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jason reported that a corrupted filesystem failed to replay
the log with a metadata block out of bounds warning:
XFS (dm-2): _xfs_buf_find: Block out of range: block 0x80270fff8, EOFS 0x9c40000
_xfs_buf_find() and xfs_btree_get_bufs() return NULL if
that happens, and then when xfs_alloc_fix_freelist() calls
xfs_trans_binval() on that NULL bp, we oops with:
BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
We don't handle _xfs_buf_find errors very well, every
caller higher up the stack gets to guess at why it failed.
But we should at least handle it somehow, so return
EFSCORRUPTED here.
Reported-by: Jason L Tibbitts III <tibbs@math.uh.edu>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_attr3_root_inactive() walks the attr fork tree to invalidate the
associated blocks. xfs_attr3_node_inactive() recursively descends
from internal blocks to leaf blocks, caching block address values
along the way to revisit parent blocks, locate the next entry and
descend down that branch of the tree.
The code that attempts to reread the parent block is unsafe because
it assumes that the local xfs_da_node_entry pointer remains valid
after an xfs_trans_brelse() and re-read of the parent buffer. Under
heavy memory pressure, it is possible that the buffer has been
reclaimed and reallocated by the time the parent block is reread.
This means that 'btree' can point to an invalid memory address, lead
to a random/garbage value for child_fsb and cause the subsequent
read of the attr fork to go off the rails and return a NULL buffer
for an attr fork offset that is most likely not allocated.
Note that this problem can be manufactured by setting
XFS_ATTR_BTREE_REF to 0 to prevent LRU caching of attr buffers,
creating a file with a multi-level attr fork and removing it to
trigger inactivation.
To address this problem, reinit the node/btree pointers to the
parent buffer after it has been re-read. This ensures btree points
to a valid record and allows the walk to proceed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we get ENOSPC half way through setting the ACL, the inode mode
can still be changed even though the ACL does not exist. Reorder the
operation to only change the mode of the inode if the ACL is set
correctly.
Whilst this does not fix the problem with crash consistency (that requires
attribute addition to be a deferred op) it does prevent ENOSPC and other
non-fatal errors setting an xattr to be handled sanely.
This fixes xfstests generic/449.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Various utility functions and interfaces that iterate internal
devices try to reference the realtime device even when RT support is
not compiled into the kernel.
Make sure this code is excluded from the CONFIG_XFS_RT=n build,
and where appropriate stub functions to return fatal errors if
they ever get called when RT support is not present.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Prevent kmemcheck from throwing warnings about reading uninitialised
memory when formatting inodes into the incore log buffer. There are
several issues here - we don't always log all the fields in the
inode log format item, and we never log the inode the
di_next_unlinked field.
In the case of the inode log format item, this is exacerbated
by the old xfs_inode_log_format structure padding issue. Hence make
the padded, 64 bit aligned version of the structure the one we always
use for formatting the log and get rid of the 64 bit variant. This
means we'll always log the 64-bit version and so recovery only needs
to convert from the unpadded 32 bit version from older 32 bit
kernels.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Commit 77469c3f57 prevented setting the page as uptodate when we wrote
the right amount of data, fix that.
Fixes: 77469c3f57 ("9p: saner ->write_end() on failing copy into non-uptodate page")
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Alexander Levin <alexander.levin@verizon.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fixes from Al Viro:
"Fairly old DIO bug caught by Andreas (3.10+) and several slightly
younger blk_rq_map_user_iov() bugs, both on map and copy codepaths
(Vitaly and me)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
bio_copy_user_iov(): don't ignore ->iov_offset
more bio_map_user_iov() leak fixes
fix unbalanced page refcounting in bio_map_user_iov
direct-io: Prevent NULL pointer access in submit_page_section
match_int() used in mount option parsing can allocate memory using
GFP_KERNEL and thus sleep. Avoid parsing mount options with sbi->s_lock
held.
Signed-off-by: Jan Kara <jack@suse.cz>
Instead of parsing mount options directly into the superblock (and
restoring options in case of error), parse the options into a dedicated
structure and only copy everything when we know we can safely switch
options. This will allow us to simplify locking and do option parsing
without holding sb->s_lock.
Signed-off-by: Jan Kara <jack@suse.cz>
In the code added to function submit_page_section by commit b1058b981,
sdio->bio can currently be NULL when calling dio_bio_submit. This then
leads to a NULL pointer access in dio_bio_submit, so check for a NULL
bio in submit_page_section before trying to submit it instead.
Fixes xfstest generic/250 on gfs2.
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
file. (I was weirdly flattered by the idea that lots of random people
suddenly seemed to think Jeff and I were VFS experts. Turns out it was
just a typo.)
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZ3RyyAAoJECebzXlCjuG+JvQP/RkwFqMZJHDjhSDhj/cr/t2o
ciK5Xche1A4E5vaaPVV17w6OwIYTNhnQawwBtNw88GaqDUEELVyFZFzNtRm44Bv1
27RLOahPTT6bmHl/cd+uNpgpXs9svuNF6x4C5SUmKTm4kFdLBP7khjdcnFhwFi2y
OerDFj4XmPsUDqW8dv7a7XktRf1klMvhbRh80r9TR5JW+h4IYQIYNevue9CABpUm
4vvv4kAyxo8oodslCMQ5OyWpG4NDDsFADtlLn++9tzUl7y5j6TQyIYfeYDH3XOru
5Ara5pkuxloS1Fu4EtEInF3iLAjMZkJD+QgHFhf2/mLMzQhZZzpbnFYPhrgyQONv
wR3u7DaH2t/JbYtlSnKQpLEG0hv2hSBQ33G4ysKUHXrhnF5DC9N59epcA2X34++B
DSwyc2wgxNfr8OGPyaNNw/kcBJyahNvsxlpTxZfTnvc0p4M1dzr1mxl/zsGC2b3v
Ei1Y+u5JU2d/jmzeTOLCGtc59UyAoswdVzNa8SNYad1Tu5eAr81uooCPUvj77lTj
GWQa9wYSOxt+Ld295dtzagqx+hQFdVKa+QTzfaZuPHeuUWmhQLGgalWXCxlVKtuF
SGfAfutikQ4zbfAEz9PuNoThywfppiWbE74pfHRDkteL5+o2JQBLOSo6V6Ow0xV6
O4cOvwV5X/RExbOoZlx1
=yj7E
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.14-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fix from Bruce Fields:
"One fix for a 4.14 regression, and one minor fix to the MAINTAINERs
file. (I was weirdly flattered by the idea that lots of random people
suddenly seemed to think Jeff and I were VFS experts. Turns out it was
just a typo)"
* tag 'nfsd-4.14-1' of git://linux-nfs.org/~bfields/linux:
nfsd4: define nfsd4_secinfo_no_name_release()
MAINTAINERS: associate linux/fs.h with VFS instead of file locking
If f2fs manages multiple devices, in checkpoint, we need to issue flush
in those devices which contain dirty data/node in their cache before
we write checkpoint region, otherwise, filesystem metadata could be
corrupted if hitting SPO after checkpoint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When multiple device feature is enabled, during ->fsync we will issue
flush in all devices to make sure node/data of the file being persisted
into storage. But some flushes of device could be unneeded as file's
data may be not writebacked into those devices. So this patch adds and
manage bitmap per inode in global cache to indicate which device is
dirty and it needs to issue flush during ->fsync, hence, we could improve
performance of fsync in scenario of multiple device.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to stat size of ino management cache with all type instead of
orphan ino type.
Fixes: 652be55162 ("f2fs: show # of orphan inodes")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we failed to issue flush in ->fsync, we need to keep FI_UPDATE_WRITE
flag to make sure triggering flush in next ->fsync.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Fan Li reported, there is no user traversing nid_list[ALLOC_NID_LIST]
which is used for tracking preallocated nids. Let's drop it, and only
track preallocated nids in free_nid_root radix-tree.
Reported-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In FI_NO_PREALLOC cases, direct I/O path may allocate blocks for an
inode but keep its inline data flag. This inconsistency may trigger
vfs clear_inode nrpages bug_on when evicting the inode. We should
convert inline data first in this case.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Keep in line with the other Linux file system implementations
since page_cache_sync_readahead supports NULL file pointer,
and thus we can readahead data by f2fs itself without file opening
(something like the btrfs behavior).
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit ba38c27eb9 ("f2fs: enhance lookup xattr") introduces
lookup_all_xattrs duplicating from read_all_xattrs, which leaves
lots of similar codes in between them, so introduce new help
read_xattr_block to clean up redundant codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit ba38c27eb9 ("f2fs: enhance lookup xattr") introduces
lookup_all_xattrs duplicating from read_all_xattrs, which leaves
lots of similar codes in between them, so introduce new help
read_inline_xattr to clean up redundant codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 2683446646 ("f2fs: reuse nids more aggressively") tries to
reuse nids as many as possilbe, in order to mitigate producing obsolete
node pages in page cache.
But acutally, before we reuse the nids and related node page cache,
we will always invalidate that node page, so there will be not any
obsolete node pages in cache.
Let's just revert previous commit, so that nm_i::next_scan_nid can be
increased ascendingly, making __build_free_nids traverses all NAT pages
more easily, finally, free nid bitmap cache can be enabled as soon as
possible.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit b9cd20619e.
That patch causes much fewer node segments (which can be used for SSR)
than before, and in the corner case (e.g. create and delete *.txt files in
one same directory, there will be very few node segments but many data
segments), if the reserved free segments are all used up during gc, then
the write_checkpoint can still flush dentry pages to data ssr segments,
but will probably fail to flush node pages to node ssr segments, since
there are not enough node ssr segments left (the left ones are all
full).
So revert this patch to give a fair chance to let node segments remain
for SSR, which provides more robustness for corner cases.
Conflicts:
fs/f2fs/gc.c
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This contains one bug fix which causes a kernel panic during fstrim introduced
in 4.14-rc1.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlnc+awACgkQQBSofoJI
UNIQLg/9HB/NikmBxVtkDtwrTKpVEPK5AYRHOvoa9k6twGkU6pB8FE0cd2PstwlZ
tAwRstyt8W9nGzF5BPY+WAyVs9ybc26wIqNo13cnzwXbc0/cc4pTy8lzeiFQdQrK
JIzz2lHNt0b5euCsEEAsnwK+rTb5DPUMKm8JkBUQ8f94oxIHLWvg7Um9FBppTw7s
JNOJ8/ymzQVNlWu7VxFaVwfUPbEhK7gtpSWjO65fiprQ0JjwXLEr65356XU2XW8x
lhQkByPMfMv1ZyGSNr3m4Hih0M6250slNHzwrZDxTdH7NDJmy1DfcPiM+epMWZMa
4uT+2hsxhTCqDQbIEvP9jv+KVHV7AG9ldCD04a0RD+XoNKDVLKlzSMFWVcWE/d0H
jSaDrMZj+taseF72x/efP8P/RrTbzqYsqBoAkoByibOXvBf7U8vsLK4NuG7agoL4
EUXDMuVJDB5d8LJRSYt0lPI5R+lhRVlVuint7a9T09yiLyCeR0wGf+eoH9C9Y4V8
t/mEM9azBi9l7T0yraVfqnh+SPzwwlxYOLQeZTi0bf3uqmBOeKb0OvfOiwboOnaZ
5Rl6jYD/hgZAowXpbohRjqPJhMoLMabsTJ4kHj6uJcQDhvTqDpamm9g9Afsiyr6z
xPYo09iHHlWA/iSiV7VSnbZu8hr59bchVt86r77fy/4YH3DXOcM=
=fAsG
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fix from Jaegeuk Kim:
"This contains one bug fix which causes a kernel panic during fstrim
introduced in 4.14-rc1"
* tag 'f2fs-for-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: fix potential panic during fstrim
Eryu has reported that since commit 7b9ca4c61b "quota: Reduce
contention on dq_data_lock" test generic/233 occasionally fails. This is
caused by the fact that since that commit we don't generate warning and
set grace time for quota allocations that have DQUOT_SPACE_NOFAIL set
(these are for example some metadata allocations in ext4). We need these
allocations to behave regularly wrt warning generation and grace time
setting so fix the code to return to the original behavior.
Reported-and-tested-by: Eryu Guan <eguan@redhat.com>
CC: stable@vger.kernel.org
Fixes: 7b9ca4c61b
Signed-off-by: Jan Kara <jack@suse.cz>
Since commit 925a6efb8f ("Btrfs: stop using
try_to_writeback_inodes_sb_nr to flush delalloc") this function hasn't
been used outside so stop exporting it.
In addition we merge it into try_to_writeback_inodes_sb() which is the
only caller. Also change return type of try_to_writeback_inodes_sb to
void as the only user ext4 doesn't care.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The fanotify interface allows user space daemons to make access
control decisions. Under common criteria requirements, we need to
optionally record decisions based on policy. This patch adds a bit mask,
FAN_AUDIT, that a user space daemon can 'or' into the response decision
which will tell the kernel that it made a decision and record it.
It would be used something like this in user space code:
response.response = FAN_DENY | FAN_AUDIT;
write(fd, &response, sizeof(struct fanotify_response));
When the syscall ends, the audit system will record the decision as a
AUDIT_FANOTIFY auxiliary record to denote that the reason this event
occurred is the result of an access control decision from fanotify
rather than DAC or MAC policy.
A sample event looks like this:
type=PATH msg=audit(1504310584.332:290): item=0 name="./evil-ls"
inode=1319561 dev=fc:03 mode=0100755 ouid=1000 ogid=1000 rdev=00:00
obj=unconfined_u:object_r:user_home_t:s0 nametype=NORMAL
type=CWD msg=audit(1504310584.332:290): cwd="/home/sgrubb"
type=SYSCALL msg=audit(1504310584.332:290): arch=c000003e syscall=2
success=no exit=-1 a0=32cb3fca90 a1=0 a2=43 a3=8 items=1 ppid=901
pid=959 auid=1000 uid=1000 gid=1000 euid=1000 suid=1000
fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=pts1 ses=3 comm="bash"
exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:
s0-s0:c0.c1023 key=(null)
type=FANOTIFY msg=audit(1504310584.332:290): resp=2
Prior to using the audit flag, the developer needs to call
fanotify_init or'ing in FAN_ENABLE_AUDIT to ensure that the kernel
supports auditing. The calling process must also have the CAP_AUDIT_WRITE
capability.
Signed-off-by: sgrubb <sgrubb@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Steve requested better names for the new task-state helper functions.
So introduce the concept of task-state index for the printing and
rename __get_task_state() to task_state_index() and
__task_state_to_char() to task_index_to_char().
Requested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170929115016.pzlqc7ss3ccystyg@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This function was only for debugging. It would be
called in a condition that should not happen, and
should probably have been removed from the final
version of the original commit.
Remove it because it does mutex lock under spin lock.
Signed-off-by: David Teigland <teigland@redhat.com>
If there are pending writes subject to delayed allocation, then i_size
will show size after the writes have completed, while i_disksize
contains the value of i_size on the disk (since the writes have not
been persisted to disk).
If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either
with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size
after the fallocate(2) is between i_size and i_disksize, then after a
crash, if a journal commit has resulted in the changes made by the
fallocate() call to be persisted after a crash, but the delayed
allocation write has not resolved itself, i_size would not be updated,
and this would cause the following e2fsck complaint:
Inode 12, end of extent exceeds allowed value
(logical block 33, physical block 33441, len 7)
This can only take place on a sparse file, where the fallocate(2) call
is allocating blocks in a range which is before a pending delayed
allocation write which is extending i_size. Since this situation is
quite rare, and the window in which the crash must take place is
typically < 30 seconds, in practice this condition will rarely happen.
Nevertheless, it can be triggered in testing, and in particular by
xfstests generic/456.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org
Pull btrfs fixes from David Sterba:
"Two more fixes for bugs introduced in 4.13.
The sector_t problem with 32bit architecture and !LBDAF config seems
serious but the number of affected deployments is hopefully low.
The clashing status bits could lead to a confusing in-memory state of
the whole-filesystem operations if used with the quota override sysfs
knob"
* 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix overlap of fs_info::flags values
btrfs: avoid overflow when sector_t is 32 bit
Pull overlayfs fixes from Miklos Szeredi:
"Fix a regression in 4.14 and one in 4.13. The latter is a case when
Docker is doing something it really shouldn't and gets away with it.
We now print a warning instead of erroring out.
There are also fixes to several error paths"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix regression caused by exclusive upper/work dir protection
ovl: fix missing unlock_rename() in ovl_do_copy_up()
ovl: fix dentry leak in ovl_indexdir_cleanup()
ovl: fix dput() of ERR_PTR in ovl_cleanup_index()
ovl: fix error value printed in ovl_lookup_index()
ovl: fix may_write_real() for overlayfs directories
Commit 34b1744c91 ("nfsd4: define ->op_release for compound ops")
defined a couple ->op_release functions and run them if necessary.
But there's a problem with that is that it reused
nfsd4_secinfo_release() as the op_release of OP_SECINFO_NO_NAME, and
caused a leak on struct nfsd4_secinfo_no_name in
nfsd4_encode_secinfo_no_name(), because there's no .si_exp field in
struct nfsd4_secinfo_no_name.
I found this because I was unable to umount an ext4 partition after
exporting it via NFS & run fsstress on the nfs mount. A simplified
reproducer would be:
# mount a local-fs device at /mnt/test, and export it via NFS with
# fsid=0 export option (this is required)
mount /dev/sda5 /mnt/test
echo "/mnt/test *(rw,no_root_squash,fsid=0)" >> /etc/exports
service nfs restart
# locally mount the nfs export with all default, note that I have
# nfsv4.1 configured as the default nfs version, because of the
# fsid export option, v4 mount would fail and fall back to v3
mount localhost:/mnt/test /mnt/nfs
# try to umount the underlying device, but got EBUSY
umount /mnt/nfs
service nfs stop
umount /mnt/test <=== EBUSY here
Fixed it by defining a separate nfsd4_secinfo_no_name_release()
function as the op_release method of OP_SECINFO_NO_NAME that
releases the correct nfsd4_secinfo_no_name structure.
Fixes: 34b1744c91 ("nfsd4: define ->op_release for compound ops")
Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Using the ARRAY_SIZE macro improves the readability of the code.
Found with Coccinelle with the following semantic patch:
@r depends on (org || report)@
type T;
T[] E;
position p;
@@
(
(sizeof(E)@p /sizeof(*E))
|
(sizeof(E)@p /sizeof(E[...]))
|
(sizeof(E)@p /sizeof(T))
)
Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Enforcing exclusive ownership on upper/work dirs caused a docker
regression: https://github.com/moby/moby/issues/34672.
Euan spotted the regression and pointed to the offending commit.
Vivek has brought the regression to my attention and provided this
reproducer:
Terminal 1:
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/
Terminal 2:
unshare -m
Terminal 1:
umount merged
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/
mount: /root/overlay-testing/merged: none already mounted or mount point
busy
To fix the regression, I replaced the error with an alarming warning.
With index feature enabled, mount does fail, but logs a suggestion to
override exclusive dir protection by disabling index.
Note that index=off mount does take the inuse locks, so a concurrent
index=off will issue the warning and a concurrent index=on mount will fail.
Documentation was updated to reflect this change.
Fixes: 2cac0c00a6 ("ovl: get exclusive ownership on upper/work dirs")
Cc: <stable@vger.kernel.org> # v4.13
Reported-by: Euan Kemp <euank@euank.com>
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use the ovl_lock_rename_workdir() helper which requires
unlock_rename() only on lock success.
Fixes: ("fd210b7d67ee ovl: move copy up lock out")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
index dentry was not released when breaking out of the loop
due to index verification error.
Fixes: 415543d5c6 ("ovl: cleanup bad and stale index entries on mount")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Overlayfs directory file_inode() is the overlay inode whether the real
inode is upper or lower.
This fixes a regression in xfstest generic/158.
Fixes: 7c6893e3c9 ("ovl: don't allow writing ioctl on lower layer")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Instead of granting client's full requests until we hit our DRC size
limit and then failing CREATE_SESSIONs (and hence mounts) completely,
start granting clients smaller slot tables as we approach the limit.
The factor chosen here is pretty much arbitrary.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
An NFSv4.1+ client negotiates the size of its duplicate reply cache size
in the initial CREATE_SESSION request. The server preallocates the
memory for the duplicate reply cache to ensure that we'll never fail to
record the response to a nonidempotent operation.
To prevent a few CREATE_SESSIONs from consuming all of memory we set an
upper limit based on nr_free_buffer_pages(). 1/2^10 has been too
limiting in practice; 1/2^7 is still less than one percent.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
These checks should have already be done centrally in
nfsd4_proc_compound, the checks in each individual operation are
unnecessary.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Since we can now use a lock stateid or a delegation stateid, that
differs from the context stateid, we need to change the test in
nfs4_layoutget_handle_exception() to take this into account.
This fixes an infinite layoutget loop in the NFS client whereby
it keeps retrying the initial layoutget using the same broken
stateid.
Fixes: 70d2f7b1ea ("pNFS: Use the standard I/O stateid when...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Handle start-all writeback like we do periodic or kupdate
style writeback - by marking the bdi_writeback as needing a full
flush, and simply waking the thread. This eliminates the need to
allocate and queue a specific work item just for this purpose.
After this change, we truly only ever have one of them running at
any point in time. We mark the need to start all flushes, and the
writeback thread will clear it once it has processed the request.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge misc fixes from Andrew Morton:
"A lot of stuff, sorry about that. A week on a beach, then a bunch of
time catching up then more time letting it bake in -next. Shan't do
that again!"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (51 commits)
include/linux/fs.h: fix comment about struct address_space
checkpatch: fix ignoring cover-letter logic
m32r: fix build failure
lib/ratelimit.c: use deferred printk() version
kernel/params.c: improve STANDARD_PARAM_DEF readability
kernel/params.c: fix an overflow in param_attr_show
kernel/params.c: fix the maximum length in param_get_string
mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long
mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function
kernel/kcmp.c: drop branch leftover typo
memremap: add scheduling point to devm_memremap_pages
mm, page_alloc: add scheduling point to memmap_init_zone
mm, memory_hotplug: add scheduling point to __add_pages
lib/idr.c: fix comment for idr_replace()
mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
kernel/sysctl.c: remove duplicate UINT_MAX check on do_proc_douintvec_conv()
include/linux/bitfield.h: remove 32bit from FIELD_GET comment block
lib/lz4: make arrays static const, reduces object code size
exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array
exec: binfmt_misc: fix race between load_misc_binary() and kill_node()
...
Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
we should change the value.
First, BTRFS_FS_EXCL_OP was set to 14.
commit 171938e528 ("btrfs: track exclusive filesystem operation in flags")
Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.
commit f29efe2921 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
As a result, the value 14 overlapped, by accident.
This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
the flags are internal.
Fixes: f29efe2921 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minimize the change, update only BTRFS_FS_EXCL_OP ]
Signed-off-by: David Sterba <dsterba@suse.com>
Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.
In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).
I added a cast to u64 to avoid overflow.
Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
security_inode_getsecurity() provides the text string value
of a security attribute. It does not provide a "secctx".
The code in xattr_getsecurity() that calls security_inode_getsecurity()
and then calls security_release_secctx() happened to work because
SElinux and Smack treat the attribute and the secctx the same way.
It fails for cap_inode_getsecurity(), because that module has no
secctx that ever needs releasing. It turns out that Smack is the
one that's doing things wrong by not allocating memory when instructed
to do so by the "alloc" parameter.
The fix is simple enough. Change the security_release_secctx() to
kfree() because it isn't a secctx being returned by
security_inode_getsecurity(). Change Smack to allocate the string when
told to do so.
Note: this also fixes memory leaks for LSMs which implement
inode_getsecurity but not release_secctx, such as capabilities.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: stable@vger.kernel.org
Signed-off-by: James Morris <james.l.morris@oracle.com>
If we got two AIO writes into a COW area the second one might not have any
COW extents left to convert. Handle that case gracefully instead of
triggering an assert or accessing beyond the bounds of the extent list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Since the CoW fork exists as a secondary data structure to the data
fork, we must always swap cow forks during swapext. We also need to
swap the extent counts and reset the cowblocks tags.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
After the previous change "fmt" can't go away, we can kill
iname/iname_addr and use fmt->interpreter.
Link: http://lkml.kernel.org/r/20170922143653.GA17232@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
load_misc_binary() makes a local copy of fmt->interpreter under
entries_lock to avoid the race with kill_node() but this is not enough;
the whole Node can be freed after we drop entries_lock, not only the
->interpreter string.
Add dget/dput(fmt->dentry) to ensure bm_evict_inode() can't destroy/free
this Node.
Link: http://lkml.kernel.org/r/20170922143650.GA17227@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Cc: <tdhooge@llnl.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If MISC_FMT_OPEN_FILE flag is set e->interp_file must be valid or we
have a bug which should not be silently ignored.
Link: http://lkml.kernel.org/r/20170922143647.GA17222@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To ensure that load_misc_binary() can't use the partially destroyed
Node, see also the next patch.
The current logic looks wrong in any case, once we close interp_file it
doesn't make any sense to delay kfree(inode->i_private), this Node is no
longer valid. Even if the MISC_FMT_OPEN_FILE/interp_file checks were
not racy (they are), load_misc_binary() should not try to reopen
->interpreter if MISC_FMT_OPEN_FILE is set but ->interp_file is NULL.
And I can't understand why do we use filp_close(), not fput().
Link: http://lkml.kernel.org/r/20170922143644.GA17216@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kill_node() nullifies/checks Node->dentry to avoid double free. This
complicates the next changes and this is very confusing:
- we do not need to check dentry != NULL under entries_lock,
kill_node() is always called under inode_lock(d_inode(root)) and we
rely on this inode_lock() anyway, without this lock the
MISC_FMT_OPEN_FILE cleanup could race with itself.
- if kill_inode() was already called and ->dentry == NULL we should not
even try to close e->interp_file.
We can change bm_entry_write() to simply check !list_empty(list) before
kill_node. Again, we rely on inode_lock(), in particular it saves us
from the race with bm_status_write(), another caller of kill_node().
Link: http://lkml.kernel.org/r/20170922143641.GA17210@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Woodard <woodard@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Travis Gummels <tgummels@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "exec: binfmt_misc: fix use-after-free, kill
iname[BINPRM_BUF_SIZE]".
It looks like this code was always wrong, then commit 948b701a60
("binfmt_misc: add persistent opened binary handler for containers")
added more problems.
This patch (of 6):
load_script() can simply use i_name instead, it points into bprm->buf[]
and nobody can change this memory until we call prepare_binprm().
The only complication is that we need to also change the signature of
bprm_change_interp() but this change looks good too.
While at it, do whitespace/style cleanups.
NOTE: the real motivation for this change is that people want to
increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
this looks more complicated because afaics it is very buggy.
Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Travis Gummels <tgummels@redhat.com>
Cc: Ben Woodard <woodard@redhat.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When reading the event from the uffd, we put it on a temporary
fork_event list to detect if we can still access it after releasing and
retaking the event_wqh.lock.
If fork aborts and removes the event from the fork_event all is fine as
long as we're still in the userfault read context and fork_event head is
still alive.
We've to put the event allocated in the fork kernel stack, back from
fork_event list-head to the event_wqh head, before returning from
userfaultfd_ctx_read, because the fork_event head lifetime is limited to
the userfaultfd_ctx_read stack lifetime.
Forgetting to move the event back to its event_wqh place then results in
__remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
userfaultfd_event_wait_completion to remove it from a head that has been
already freed from the reader stack.
This could only happen if resolve_userfault_fork failed (for example if
there are no file descriptors available to allocate the fork uffd). If
it succeeded it was put back correctly.
Furthermore, after find_userfault_evt receives a fork event, the forked
userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
be released by the fork thread as soon as the event_wqh.lock is
released. Taking a reference on the fork_nctx before dropping the lock
prevents an use after free in resolve_userfault_fork().
If the fork side aborted and it already released everything, we still
try to succeed resolve_userfault_fork(), if possible.
Fixes: 893e26e61d ("userfaultfd: non-cooperative: Add fork() event")
Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Ju Hyung Park reported:
"When 'fstrim' is called for manual trim, a BUG() can be triggered
randomly with this patch.
I'm seeing this issue on both x86 Desktop and arm64 Android phone.
On x86 Desktop, this was caused during Ubuntu boot-up. I have a
cronjob installed which calls 'fstrim -v /' during boot. On arm64
Android, this was caused during GC looping with 1ms gc_min_sleep_time
& gc_max_sleep_time."
Root cause of this issue is that f2fs_wait_discard_bios can only be
used by f2fs_put_super, because during put_super there must be no
other referrers, so it can ignore discard entry's reference count
when removing the entry, otherwise in other caller we will hit bug_on
in __remove_discard_cmd as there may be other issuer added reference
count in discard entry.
Thread A Thread B
- issue_discard_thread
- f2fs_ioc_fitrim
- f2fs_trim_fs
- f2fs_wait_discard_bios
- __issue_discard_cmd
- __submit_discard_cmd
- __wait_discard_cmd
- dc->ref++
- __wait_one_discard_bio
- __wait_discard_cmd
- __remove_discard_cmd
- f2fs_bug_on(sbi, dc->ref)
Fixes: 969d1b180d
Reported-by: Ju Hyung Park <qkrwngud825@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When someone calls wakeup_flusher_threads() or
wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
pages in the system (or on that bdi). If we are tight on memory, we
can get tons of these queued from kswapd/vmscan. This causes (at
least) two problems:
1) We consume a ton of memory just allocating writeback work items.
We've seen as much as 600 million of these writeback work items
pending. That's a lot of memory to pointlessly hold hostage,
while the box is under memory pressure.
2) We spend so much time processing these work items, that we
introduce a softlockup in writeback processing. This is because
each of the writeback work items don't end up doing any work (it's
hard when you have millions of identical ones coming in to the
flush machinery), so we just sit in a tight loop pulling work
items and deleting/freeing them.
Fix this by adding a 'start_all' bit to the writeback structure, and
set that when someone attempts to flush all dirty pages. The bit is
cleared when we start writeback on that work item. If the bit is
already set when we attempt to queue !nr_pages writeback, then we
simply ignore it.
This provides us one full flush in flight, with one pending as well,
and makes for more efficient handling of this type of writeback.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that we have no external callers of wb_start_writeback(), we
can shuffle the passing in of 'nr_pages'. Everybody passes in 0
at this point, so just kill the argument and move the dirty
count retrieval to that function.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't have any callers outside of fs-writeback.c anymore,
make it private.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Similar to wakeup_flusher_threads(), except that we only wake
up the flusher threads on the specified backing device.
No functional changes in this patch.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're writing back the full range of dirty pages on the devices,
there's no point in making this special and not do normal range
cyclic writeback.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Everybody is passing in 0 now, let's get rid of the argument.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the previous commit removed any case where grow_buffers()
would return failure due to memory allocations, we can safely
remove the case where we have to call free_more_memory() in
this function.
Since this is also the last user of free_more_memory(), kill
it off completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently use it for find_or_create_page(), which means that it
cannot fail. Ensure we also pass in 'retry == true' to
alloc_page_buffers(), which also ensure that it cannot fail.
After this, there are no failure cases in grow_dev_page() that
occur because of a failed memory allocation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This series provides the needed changes to suport the ELF_FDPIC binary
format on ARM. Both MMU and non-MMU systems are supported. This format
has many advantages over the BFLT format used on MMU-less systems, such
as being real ELF that can be parsed by standard tools, can support
shared dynamic libs, etc.
previous commit 5d37ca14 "ceph: send LSSNAP request to auth mds
of directory inode" is buggy. It makes __choose_mds() choose mds
base on hash of '.snap' dentry.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
commit 3ae0bebc "ceph: queue cap snap only when snap realm's
context changes" introduced a regression: we may not call
queue_realm_cap_snaps() for newly created snap realm. This
regression allows unflushed snapshot data to be overwritten.
Link: http://tracker.ceph.com/issues/21483
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Michael Sterrett reports a NULL pointer dereference on NFSv3 mounts when
CONFIG_NFS_V4 is not set because the NFS UOC rpc_wait_queue has not been
initialized. Move the initialization of the queue out of the CONFIG_NFS_V4
conditional setion.
Fixes: 7d6ddf88c4 ("NFS: Add an iocounter wait function for async RPC tasks")
Cc: stable@vger.kernel.org # 4.11+
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
nfs_idmap_get_desc() can't actually return zero. But if it did then
we would return ERR_PTR(0) which is NULL and the caller,
nfs_idmap_get_key(), doesn't expect that so it leads to a NULL pointer
dereference.
I've cleaned this up by changing the "<=" to "<" so it's more clear that
we don't return ERR_PTR(0).
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The units of RPC_MAX_AUTH_SIZE is bytes, not 4-byte words. This causes
the client to request a larger-than-necessary session replay slot size.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Now that we no longer try to reserve metadata blocks for delayed
allocations (which tended to overestimate the required number of
blocks significantly), we really don't need retry allocations when the
disk is very full as aggressively any more.
The only time when it makes sense to retry an allocation is if we have
freshly deleted blocks that will only become available after a
transaction commit. And if we lose that race, it's not worth it to
try more than once.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Switch to the iomap_seek_hole and iomap_seek_data helpers for
implementing lseek SEEK_HOLE / SEEK_DATA, and remove all the code that
isn't needed any more.
Note that with this patch ext4 will now always depend on the iomap code
instead of only when CONFIG_DAX is enabled, and it requires adding a
call into the extent status tree for iomap_begin as well to properly
deal with delalloc extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
[More fixes and cleanups by Andreas]
Report inline data as a IOMAP_F_DATA_INLINE mapping. This allows to use
iomap_seek_hole and iomap_seek_data in ext4_llseek and makes switching
to iomap_fiemap in ext4_fiemap easier.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Add a new IOMAP_F_DATA_INLINE flag to indicate that a mapping is in a
disk area that contains data as well as metadata. In iomap_fiemap, map
this flag to FIEMAP_EXTENT_DATA_INLINE.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Replace iomap->blkno, the sector number, with iomap->addr, the disk
offset in bytes. For invalid disk offsets, use the special value
IOMAP_NULL_ADDR instead of IOMAP_NULL_BLOCK.
This allows to use iomap for mappings which are not block aligned, such
as inline data on ext4.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> # iomap, xfs
Reviewed-by: Jan Kara <jack@suse.cz>
Pull scheduler fixes from Thomas Gleixner:
"The scheduler pull request comes with the following updates:
- Prevent a divide by zero issue by validating the input value of
sysctl_sched_time_avg
- Make task state printing consistent all over the place and have
explicit state characters for IDLE and PARKED so they wont be
displayed as 'D' state which confuses tools"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/sysctl: Check user input value of sysctl_sched_time_avg
sched/debug: Add explicit TASK_PARKED printing
sched/debug: Ignore TASK_IDLE for SysRq-W
sched/debug: Add explicit TASK_IDLE printing
sched/tracing: Use common task-state helpers
sched/tracing: Fix trace_sched_switch task-state printing
sched/debug: Remove unused variable
sched/debug: Convert TASK_state to hex
sched/debug: Implement consistent task-state printing
Pull btrfs fixes from David Sterba:
"We've collected a bunch of isolated fixes, for crashes, user-visible
behaviour or missing bits from other subsystem cleanups from the past.
The overall number is not small but I was not able to make it
significantly smaller. Most of the patches are supposed to go to
stable"
* 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: log csums for all modified extents
Btrfs: fix unexpected result when dio reading corrupted blocks
btrfs: Report error on removing qgroup if del_qgroup_item fails
Btrfs: skip checksum when reading compressed data if some IO have failed
Btrfs: fix kernel oops while reading compressed data
Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
Btrfs: do not backup tree roots when fsync
btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
btrfs: propagate error to btrfs_cmp_data_prepare caller
btrfs: prevent to set invalid default subvolid
Btrfs: send: fix error number for unknown inode types
btrfs: fix NULL pointer dereference from free_reloc_roots()
btrfs: finish ordered extent cleaning if no progress is found
btrfs: clear ordered flag on cleaning up ordered extents
Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
Btrfs: do not reset bio->bi_ops while writing bio
Btrfs: use the new helper wbc_to_write_flags
Currently TASK_PARKED is masqueraded as TASK_INTERRUPTIBLE, give it
its own print state because it will not in fact get woken by regular
wakeups and is a long-term state.
This requires moving TASK_PARKED into the TASK_REPORT mask, and since
that latter needs to be a contiguous bitmask, we need to shuffle the
bits around a bit.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Markus reported that kthreads that idle using TASK_IDLE instead of
TASK_INTERRUPTIBLE are reported in as TASK_UNINTERRUPTIBLE and things
like htop mark those red.
This is undesirable, so add an explicit state for TASK_IDLE.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently get_task_state() and task_state_to_char() report different
states, create a number of common helpers and unify the reported state
space.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- fix various problems with the copy-on-write extent maps getting freed
at the wrong time
- fix printk format specifier problems
- report zeroing operation outcomes instead of dropping them on the
floor
- fix some crashes when dio operations partially fail
- fix a race condition between unwritten extent conversion & dio read
- fix some incorrect tests in the inode log item processing
- correct the delayed allocation space reservations on rmap filesystems
- fix some problems checking for dax support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZypYxAAoJEPh/dxk0SrTrJ3YQAJFWUCp194an+yuvgOY+MuyL
PG/vAA3DyJjYbwIsqUE//dlp9nrarccAXcxPITWlLdGZ//qHbXO2MguO3KIQ4iG8
qmsA+tXetVoYZYxYZLQ0KjX/XJTaAXY64xKTFxMMTTKUoxPygJRUF/FPfFFcTtaq
Q/ULikS5mhtW7/mQCfXBvtqM5ZD61A9vQRjDL5jRdrDbz49TQqtskp/7F6SEHLxU
fTCGhN7Ys4MQ4fmtUc+EUh0LPX8oAKIIKiGz3zUqrk/FgNYI2NqnTYvflfN8L9UE
t+k+4CGrON+dzrau4HrvZaYbfIPhRaJUM4QzFcDIPoaBZOt6DpBI0dEKm9FD7Hw/
vUvBs0M9asqYycH3PopFHugF+SxW8g7g+5TD8S9rg3j33PZahSNm3gt5gYb1Kiij
3TZPirst6OeQuEjWX6L5LAruAtqtEXtHL7o4dGn5LdQkJ0EIdKXMd9YGz0F/trTK
Grqf2Mep/Q8nccMTksaj94X5AhmM4znYmbAnbS/+QfYTgLk92GJltxoKTB6roW/N
fJ5azjyzGsr4BWdgakK3aA9glaQWGh3PY8Up2VLeEdjwcy3zyscnpZP2PSvt+l9X
pmMDpMTvQD0E6e5246itB69Il1NXTEoG/t9Hlx/2x9g0R2hjK6CRXXrwPnz9zYkI
7wFz5B5LmJ27vFGTCxo5
=7ptY
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- fix various problems with the copy-on-write extent maps getting freed
at the wrong time
- fix printk format specifier problems
- report zeroing operation outcomes instead of dropping them on the
floor
- fix some crashes when dio operations partially fail
- fix a race condition between unwritten extent conversion & dio read
- fix some incorrect tests in the inode log item processing
- correct the delayed allocation space reservations on rmap filesystems
- fix some problems checking for dax support
* tag 'xfs-4.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: revert "xfs: factor rmap btree size into the indlen calculations"
xfs: Capture state of the right inode in xfs_iflush_done
xfs: perag initialization should only touch m_ag_max_usable for AG 0
xfs: update i_size after unwritten conversion in dio completion
iomap_dio_rw: Allocate AIO completion queue before submitting dio
xfs: validate bdev support for DAX inode flag
xfs: remove redundant re-initialization of total_nr_pages
xfs: Output warning message when discard option was enabled even though the device does not support discard
xfs: report zeroed or not correctly in xfs_zero_range()
xfs: kill meaningless variable 'zero'
fs/xfs: Use %pS printk format for direct addresses
xfs: evict CoW fork extents when performing finsert/fcollapse
xfs: don't unconditionally clear the reflink flag on zero-block files
Pull quota and isofs fixes from Jan Kara:
"Two quota fixes (fallout of the quota locking changes) and an isofs
build fix"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Fix quota corruption with generic/232 test
isofs: fix build regression
quota: add missing lock into __dquot_transfer()
Eric has reported that since commit d2faa41516 "quota: Do not acquire
dqio_sem for dquot overwrites in v2 format" test generic/232
occasionally fails due to quota information being incorrect. Indeed that
commit was too eager to remove dqio_sem completely from the path that
just overwrites quota structure with updated information. Although that
is innocent on its own, another process that inserts new quota structure
to the same block can perform read-modify-write cycle of that block thus
effectively discarding quota information update if they race in a wrong
way.
Fix the problem by acquiring dqio_sem for reading for overwrites of
quota structure. Note that it *is* possible to completely avoid taking
dqio_sem in the overwrite path however that will require modifying path
inserting / deleting quota structures to avoid RMW cycles of the full
block and for now it is not clear whether it is worth the hassle.
Fixes: d2faa41516
Reported-and-tested-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In generic_file_llseek_size, return -ENXIO for negative offsets as well
as offsets beyond EOF. This affects filesystems which don't implement
SEEK_HOLE / SEEK_DATA internally, possibly because they don't support
holes.
Fixes xfstest generic/448.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit fd26a88093 we added a worst case estimate for rmapbt blocks
needed to satisfy the block mapping request. Since then, we added the
ability to reserve enough space in each AG such that we should never run
out of blocks to grow the rmapbt, which makes this calculation
unnecessary. Revert the commit because it makes the extra delalloc
indlen accounting unnecessary and incorrect.
Reported-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
My previous patch: d3a304b629 check for
XFS_LI_FAILED flag xfs_iflush done, so the failed item can be properly
resubmitted.
In the loop scanning other inodes being completed, it should check the
current item for the XFS_LI_FAILED, and not the initial one.
The state of the initial inode is checked after the loop ends
Kudos to Eric for catching this.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We call __xfs_ag_resv_init to make a per-AG reservation for each AG.
This makes the reservation per-AG, not per-filesystem. Therefore, it
is incorrect to adjust m_ag_max_usable for each AG. Adjust it only
when we're reserving AG 0's blocks so that we only do it once per fs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Since commit d531d91d69 ("xfs: always use unwritten extents for
direct I/O writes"), we start allocating unwritten extents for all
direct writes to allow appending aio in XFS.
But for dio writes that could extend file size we update the in-core
inode size first, then convert the unwritten extents to real
allocations at dio completion time in xfs_dio_write_end_io(). Thus a
racing direct read could see the new i_size and find the unwritten
extents first and read zeros instead of actual data, if the direct
writer also takes a shared iolock.
Fix it by updating the in-core inode size after the unwritten extent
conversion. To do this, introduce a new boolean argument to
xfs_iomap_write_unwritten() to tell if we want to update in-core
i_size or not.
Suggested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Executing xfs/104 test in a loop on Linux-v4.13 kernel on a ppc64
machine can cause the following NULL pointer dereference,
.queue_work_on+0x4c/0x80
.iomap_dio_bio_end_io+0xbc/0x1f0
.bio_endio+0x118/0x1f0
.blk_update_request+0xd0/0x470
.blk_mq_end_request+0x24/0xc0
.lo_complete_rq+0x40/0xe0
.__blk_mq_complete_request_remote+0x28/0x40
.flush_smp_call_function_queue+0xc4/0x1e0
.smp_ipi_demux_relaxed+0x8c/0x100
.icp_hv_ipi_action+0x54/0xa0
.__handle_irq_event_percpu+0x84/0x2c0
.handle_irq_event_percpu+0x28/0x80
.handle_percpu_irq+0x78/0xc0
.generic_handle_irq+0x40/0x70
.__do_irq+0x88/0x200
.call_do_irq+0x14/0x24
.do_IRQ+0x84/0x130
This occurs due to the following sequence of events,
1. Allocate dio for Direct I/O write.
2. Invoke iomap_apply() until iov_iter_count() bytes have been submitted.
- Assume that we have submitted atleast one bio. Hence iomap_dio->ref value
will be >= 2.
- If during the second iteration, iomap_apply() ends up returning -ENOSPC, we would
break out of the loop and since the 'ret' value is a negative number we
end up not allocating memory for super_block->s_dio_done_wq.
3. Meanwhile, iomap_dio_bio_end_io() is invoked for bios that have been
submitted and here the code ends up dereferencing the NULL pointer stored
at super_block->s_dio_done_wq.
This commit fixes the bug by allocating memory for
super_block->s_dio_done_wq before iomap_apply() is invoked.
Reported-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently only the blocksize is checked, but we should really be calling
bdev_dax_supported() which also tests to make sure we can get a
struct dax_device and that the dax_direct_access() path is working.
This is the same check that we do for the "-o dax" mount option in
xfs_fs_fill_super().
This does not fix the race issues that caused the XFS DAX inode option to
be disabled, so that option will still be disabled. If/when we re-enable
it, though, I think we will want this issue to have been fixed. I also do
think that we want to fix this in stable kernels.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Since __state_in_grace return only true/false, make it return bool
instead of int.
Same change for the two user of it, locks_in_grace/opens_in_grace
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit c87fb4a378 ("lockd: NLM grace period shouldn't block NFSv4 opens")
made the locks_in_grace() comment be in the wrong place.
This patch move this comment just at the right place.
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This fix the following warning
fs/nfs_common/grace.c:66:1: warning: no previous prototype for function '__state_in_grace' [-Wmissing-prototypes]
by adding the missing static.
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Amir reported a bug discovered by his cleaned up version of my
dm-log-writes xfstests where we were missing csums at certain replay
points. This is because fsx was doing an msync(), which essentially
fsync()'s a specific range of a file. We will log all modified extents,
but only search for the checksums in the range we are being asked to
sync. We cannot simply log the extents in the range we're being asked
because we are logging the inode item as it is currently, which if it
has had a i_size update before the msync means we will miss extents when
replaying. We could possibly get around this by marking the inode with
the transaction that extended the i_size to see if we have this case,
but this would be racy and we'd have to lock the whole range of the
inode to make sure we didn't have an ordered extent outside of our range
that was in the middle of completing.
Fix this simply by keeping track of the modified extents range and
logging the csums for the entire range of extents that we are logging.
This makes the xfstest pass.
Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
commit 4246a0b63b ("block: add a bi_error field to struct bio")
changed the logic of how dio read endio reports errors.
For single stripe dio read, %bio->bi_status reflects the error before
verifying checksum, and now we're updating it when data block matches
with its checksum, while in the mismatching case, %bio->bi_status is
not updated to relfect that.
When some blocks in a file have been corrupted on disk, reading such a
file ends up with
1) checksum errors are reported in kernel log
2) read(2) returns successfully with some content being 0x01.
In order to fix it, we need to report its checksum mismatch error to
the upper layer (dio layer in this case) as well.
Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Goffredo Baroncelli <kreijack@inwind.it>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously, we were calling del_qgroup_item, and ignoring the return code
resulting in a potential to have divergent in-memory state without an
error. Perhaps, it makes sense to handle this error code, and put the
filesystem into a read only, or similar state.
This patch only adds reporting of the error if the error is fatal,
(any error other than qgroup not found).
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently even if the underlying disk reports failure on IO,
compressed read endio still gets to verify checksum and reports it as
a checksum error.
In fact, if some IO have failed during reading a compressed data
extent , there's no way the checksum could match, therefore, we can
skip that in order to return error quickly to the upper layer.
Please note that we need to do this after recording the failed mirror
index so that read-repair in the upper layer's endio can work
properly.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Paul Jones <paul@pauljones.id.au>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kernel oops happens at
kernel BUG at fs/btrfs/extent_io.c:2104!
...
RIP: clean_io_failure+0x263/0x2a0 [btrfs]
It's showing that read-repair code is using an improper mirror index.
This is due to the fact that compression read's endio hasn't recorded
the failed mirror index in %cb->orig_bio.
With this, btrfs's read-repair can work properly on reading compressed
data.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Paul Jones <paul@pauljones.id.au>
Tested-by: Paul Jones <paul@pauljones.id.au>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This seems to be a leftover of commit cf8cddd38b ("btrfs: don't
abuse REQ_OP_* flags for btrfs_map_block").
It should use btrfs_op() helper to provide one of 'enum btrfs_map_op'
types.
Fixes: cf8cddd38b ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It doesn't make sense to backup tree roots when doing fsync, since
during fsync those tree roots have not been consistent on disk.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, "btrfs quota enable" would fail after "btrfs quota disable" on
the first time with syslog output "qgroup_rescan_init failed with -22", but
it would succeed on the second time.
When "quota disable" is called, BTRFS_FS_QUOTA_DISABLING flag bit will be
set in fs_info->flags in btrfs_quota_disable(), but it will not be droppd
in btrfs_run_qgroups() (which is called in btrfs_commit_transaction())
because quota_root has already been freed. If "quota enable" is called
after that, both BTRFS_FS_QUOTA_DISABLING and BTRFS_FS_QUOTA_ENABLED flag
would be dropped in the btrfs_run_qgroups() since quota_root is not NULL.
This leads to the failure of "quota enable" on the first time.
BTRFS_FS_QUOTA_DISABLING flag is not used outside of "quota disable"
context and is equivalent to whether quota_root is NULL or not.
btrfs_run_qgroups() checks whether quota_root is NULL or not in the first
place.
So, let's remove BTRFS_FS_QUOTA_DISABLING flag.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_cmp_data_prepare() (almost) always returns 0 i.e. ignoring errors
from gather_extent_pages(). While the pages are freed by
btrfs_cmp_data_free(), cmp->num_pages still has > 0. Then,
btrfs_extent_same() try to access the already freed pages causing faults
(or violates PageLocked assertion).
This patch just return the error as is so that the caller stop the process.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Fixes: f441460202 ("btrfs: fix deadlock with extent-same and readpage")
Cc: <stable@vger.kernel.org> # 4.2
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
`btrfs sub set-default` succeeds to set an ID which isn't corresponding to any
fs/file tree. If such the bad ID is set to a filesystem, we can't mount this
filesystem without specifying `subvol` or `subvolid` mount options.
Fixes: 6ef5ed0d38 ("Btrfs: add ioctl and incompat flag to set the default mount subvol")
Cc: <stable@vger.kernel.org>
Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ENOTSUPP should not be returned to the user program.
(cf. include/linux/errno.h)
Therefore, EOPNOTSUPP is used instead of ENOTSUPP.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__del_reloc_root should be called before freeing up reloc_root->node.
If not, calling __del_reloc_root() dereference reloc_root->node, causing
the system BUG.
Fixes: 6bdf131fac ("Btrfs: don't leak reloc root nodes on error")
Cc: <stable@vger.kernel.org> # 4.9
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__endio_write_update_ordered() repeats the search until it reaches the end
of the specified range. This works well with direct IO path, because before
the function is called, it's ensured that there are ordered extents filling
whole the range. It's not the case, however, when it's called from
run_delalloc_range(): it is possible to have error in the midle of the loop
in e.g. run_delalloc_nocow(), so that there exisits the range not covered
by any ordered extents. By cleaning such "uncomplete" range,
__endio_write_update_ordered() stucks at offset where there're no ordered
extents.
Since the ordered extents are created from head to tail, we can stop the
search if there are no offset progress.
Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <stable@vger.kernel.org> # 4.12
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 524272607e ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup
submitted ordered extents. However, it does not clear the ordered bit
(Private2) of corresponding pages. Thus, the following BUG occurs from
free_pages_check_bad() (on btrfs/125 with nospace_cache).
BUG: Bad page state in process btrfs pfn:3fa787
page:ffffdf2acfe9e1c0 count:0 mapcount:0 mapping: (null) index:0xd
flags: 0x8000000000002008(uptodate|private_2)
raw: 8000000000002008 0000000000000000 000000000000000d 00000000ffffffff
raw: ffffdf2acf5c1b20 ffffb443802238b0 0000000000000000 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x2000(private_2)
This patch clears the flag same as other places calling
btrfs_dec_test_ordered_pending() for every page in the specified range.
Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <stable@vger.kernel.org> # 4.12
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info->super_copy->{node,sector}size are little-endian, but the ioctl
should return the values in native endianness. Use the cached values in
btrfs_fs_info instead. Found with sparse.
Fixes: 80a773fbfc ("btrfs: retrieve more info from FS_INFO ioctl")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
flush_epd_write_bio() sets bio->bi_opf by itself to honor REQ_SYNC,
but it's not needed at all since bio->bi_opf has set up properly in
both __extent_writepage() and write_one_eb(), and in the case of
write_one_eb(), it also sets REQ_META, which we will lose in
flush_epd_write_bio().
This remove this unnecessary bio->bi_opf setting.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This updates btrfs to use the helper wbc_to_write_flags which has been
applied in ext4/xfs/f2fs/block.
Please note that, with this, btrfs's dirty pages written by a
writeback job will carry the flag REQ_BACKGROUND, which is currently
used by writeback-throttle to determine whether it should go to get a
request or wait.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Variable total_nr_pages is being initialized and then updated with
the same value, this latter assignment is redundant and can be
removed. Cleans up clang build warning:
Value stored to 'total_nr_pages' during its initialization is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In order to using discard function, it is necessary that not only xfs
is mounted with discard option, but also the discard function is
supported by the device. Current code doesn't output any message when
users mount with discard option on unsupported device, so it is
difficult to notice that it was not enabled actually.
This patch adds the warning message to notice that discard option is
not enabled due to unsupported device when the filesystem is mounted.
Changes in v2 (Suggested by Brian Foster):
- Move the unsupported device check into xfs_fs_fill_super().
- Clear the discard flag when device is unsupported.
Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The 'did_zero' param of xfs_zero_range() was not passed to
iomap_zero_range() correctly. This was introduced by commit
7bb41db3ea ("xfs: handle 64-bit length in xfs_iozero"), and found
by code inspection.
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In xfs_file_aio_write_checks(), variable 'zero' is there only to
satisfy xfs_zero_eof(), the result of it is ignored. Now, with
iomap_zero_range() based xfs_zero_eof(), we can safely pass NULL as
the last param of it and kill 'zero'.
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the %pS instead of the %pF printk format specifier for printing symbols
from direct addresses. This is needed for the ia64, ppc64 and parisc64
architectures.
Signed-off-by: Helge Deller <deller@gmx.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we perform an finsert/fcollapse operation, cancel all the CoW
extents for the affected file offset range so that they don't end up
pointing to the wrong blocks.
Reported-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we have speculative cow preallocations hanging around in the cow
fork, don't let a truncate operation clear the reflink flag because if
we do then there's a chance we'll forget to free those extents when we
destroy the incore inode.
Reported-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull block fixes from Jens Axboe:
- Two sets of NVMe pull requests from Christoph:
- Fixes for the Fibre Channel host/target to fix spec compliance
- Allow a zero keep alive timeout
- Make the debug printk for broken SGLs work better
- Fix queue zeroing during initialization
- Set of RDMA and FC fixes
- Target div-by-zero fix
- bsg double-free fix.
- ndb unknown ioctl fix from Josef.
- Buffered vs O_DIRECT page cache inconsistency fix. Has been floating
around for a long time, well reviewed. From Lukas.
- brd overflow fix from Mikulas.
- Fix for a loop regression in this merge window, where using a union
for two members of the loop_cmd turned out to be a really bad idea.
From Omar.
- Fix for an iostat regression fix in this series, using the wrong API
to get at the block queue. From Shaohua.
- Fix for a potential blktrace delection deadlock. From Waiman.
* 'for-linus' of git://git.kernel.dk/linux-block: (30 commits)
nvme-fcloop: fix port deletes and callbacks
nvmet-fc: sync header templates with comments
nvmet-fc: ensure target queue id within range.
nvmet-fc: on port remove call put outside lock
nvme-rdma: don't fully stop the controller in error recovery
nvme-rdma: give up reconnect if state change fails
nvme-core: Use nvme_wq to queue async events and fw activation
nvme: fix sqhd reference when admin queue connect fails
block: fix a crash caused by wrong API
fs: Fix page cache inconsistency when mixing buffered and AIO DIO
nvmet: implement valid sqhd values in completions
nvme-fabrics: Allow 0 as KATO value
nvme: allow timed-out ios to retry
nvme: stop aer posting if controller state not live
nvme-pci: Print invalid SGL only once
nvme-pci: initialize queue memory before interrupts
nvmet-fc: fix failing max io queue connections
nvme-fc: use transport-specific sgl format
nvme: add transport SGL definitions
nvme.h: remove FC transport-specific error values
...
This tag is meant for pulling a patch called "gfs2: Fix
debugfs glocks dump" which fixes a regression introduced
by commit 88ffbf3e03. The regression caused the glock
dump in debugfs to not report all the glocks, which makes
debugging extremely difficult.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZyUI8AAoJENeLYdPf93o7iq4IAKhb9wJ8kmpu7LZ5k6Fl8BCy
GFztPe2bKsFG8cul1o1gZx8c/GWORaCHe3ZDI6pxl16/E+AvWoA1pKbBLYB1GSvD
90a7/m6+hx02ZXR/MHxBUQLWYXBtBrVMVcZDCmFMHWYCRUIiX2etPZL8wOXeJLTl
lNCSGdd1+3y6IJbthaIKTt1ctzsR8ZqV4QN786d2C3L9dxZ63FnAV43p3rUBzBLX
B5uT5LTmdWSLRqe0A9rnrPga/BfEnA8GDtIYUMic9Yz0Hq2a3vEnCC3P3Myp0DJZ
PGposwqL/emRhXkC4+ICrGsTOIy1BzwMXLF47GQaB/k+2Rd3/l9r/hU5ESjQOgA=
=taQL
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Bob Peterson:
"GFS2: Fix an old regression in GFS2's debugfs interface
This fixes a regression introduced by commit 88ffbf3e03 ("GFS2: Use
resizable hash table for glocks"). The regression caused the glock dump
in debugfs to not report all the glocks, which makes debugging
extremely difficult"
* tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix debugfs glocks dump
Call schedule() here could make the thread miss wake
up from kthread_stop(), so it is better to recheck
kthread_should_stop() before call schedule(), a symptom
happened when I run indefinite test (which mostly created
clustered raid1, assemble it in other nodes, then stop
them) of clustered raid.
$ ps aux|grep md|grep D
root 4211 0.0 0.0 19760 2220 ? Ds 02:58 0:00 mdadm -Ssq
$ cat /proc/4211/stack
kthread_stop+0x4d/0x150
dlm_recoverd_stop+0x15/0x20 [dlm]
dlm_release_lockspace+0x2ab/0x460 [dlm]
leave+0xbf/0x150 [md_cluster]
md_cluster_stop+0x18/0x30 [md_mod]
bitmap_free+0x12e/0x140 [md_mod]
bitmap_destroy+0x7f/0x90 [md_mod]
__md_stop+0x21/0xa0 [md_mod]
do_md_stop+0x15f/0x5c0 [md_mod]
md_ioctl+0xa65/0x18a0 [md_mod]
blkdev_ioctl+0x49e/0x8d0
block_ioctl+0x41/0x50
do_vfs_ioctl+0x96/0x5b0
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x1e/0xad
This maybe not resolve the issue completely since the
KTHREAD_SHOULD_STOP flag could be set between "break"
and "schedule", but at least the chance for the symptom
happen could be reduce a lot (The indefinite test runs
more than 20 hours without problem and it happens easily
without the change).
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The writequeue and writequeue_lock member of othercon was not initialized.
If lowcomms_state_change() is called from network layer, othercon->swork
may be scheduled. In this case, send_to_sock() will generate a NULL pointer
reference. We avoid this problem by correctly initializing writequeue and
writequeue_lock member of othercon.
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
When an error occurs in kernel_recvmsg or kernel_sendpage and
close_connection is called and receive work is already scheduled,
receive work is canceled. In that case, the receive work will not
be scheduled forever after reconnection, because CF_READ_PENDING
flag is established.
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
In the current implementation, we think that exclusion control between
processing to set the callback function to the connection structure and
processing to refer to the connection structure from the callback function
was not enough. We fix them.
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
dlm_cb_seq is 64 bits. If dlm_cb_seq overflows and returns to 0,
dlm_rem_lkb_callback() will not work properly.
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
The sk member of the socket generated by sock_create_kern() is overwritten
by ops->accept(). So the previous sk will not be released.
We use kernel_accept() instead of sock_create_kern() and ops->accept().
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
When the DLM_LKF_NODLCKWT flag was set, even if conversion deadlock
was detected, the caller of can_be_granted() was unknown.
We change the behavior of can_be_granted() and change it to detect
conversion deadlock regardless of whether the DLM_LKF_NODLCKWT flag
is set or not. And depending on whether the DLM_LKF_NODLCKWT flag
is set or not, we change the behavior at the caller of can_be_granted().
This fix has no effect except when using DLM_LKF_NODLCKWT flag.
Currently, ocfs2 uses the DLM_LKF_NODLCKWT flag and does not expect a
cancel operation from conversion deadlock when calling dlm_lock().
ocfs2 is implemented to perform a cancel operation by requesting
BASTs (callback).
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
If reconnection fails while executing dlm_lowcomms_stop,
dlm_send will not stop.
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
CF_WRITE_PENDING flag has been reanimated to make dlm_send stop properly
when running dlm_lowcomms_stop.
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
When dlm_recoverd_stop() is called between kthread_should_stop() and
set_task_state(TASK_INTERRUPTIBLE), dlm_recoverd will not wake up.
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
If an error occurs in the sending / receiving process, if othercon
exists, sending / receiving processing using othercon may also result
in an error. We fix to pre-close othercon as well.
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
If a node sends a DLM_RCOM_STATUS command and an error occurs on the
receiving side, the DLM_RCOM_STATUS_REPLY response may not be returned.
We retransmitted the DLM_RCOM_STATUS command so that we do not wait for
an infinite response.
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
In the current implementation, we think that exclusion control
for othercon in tcp_accept_from_sock() and sctp_accept_from_sock()
was not enough. We fix them.
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
When kernel_sendpage(in send_to_sock) and kernel_recvmsg
(in receive_from_sock) return error, close_connection may works at the
same time. At that time, they may wait for each other by cancel_work_sync.
Signed-off-by: Tadashi Miyauchi <miayuchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
save_cb argument is not used. We remove them.
Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
In a previous patch I noted that accept() often copies the struct
sock (sk) which overwrites the sock callbacks. However, in testing
we discovered that the dlm connection structures (con) are sometimes
deleted and recreated as connections come and go, and since they're
zeroed out by kmem_cache_zalloc, the saved callback pointers are
also initialized to zero. But with today's DLM code, the callbacks
are only saved when a socket is added.
During recovery testing, we discovered a common situation in which
the new con is initialized to zero, then a socket is added after
accept(). In this case, the sock's saved values are all NULL, but
the saved values are wiped out, due to accept(). Therefore, we
don't have a known good copy of the callbacks from which we can
restore.
Since the struct sock callbacks are always good after listen(),
this patch saves the known good values after listen(). These good
values are then used for subsequent restores.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
Before this patch, there was a flag in the con structure that was
used to determine whether or not a connect was needed. The bit was
set here and there, and cleared here and there, so it left some
race conditions: the bit was set, work was queued, then the worker
cleared the bit, allowing someone else to set it while the worker
ran. For the most part, this worked okay, but we got into trouble
if connections were lost and it needed to reconnect.
This patch eliminates the flag in favor of simply checking if we
actually have a sock pointer while protected by the mutex.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp>
Signed-off-by: David Teigland <teigland@redhat.com>
Three-entry POSIX ACLs can be stored in the file mode permission bits,
with no need to store them in extended attributes. When a process sets
such a minimal ACL, the kernel updates the file mode like chmod does,
and removes any existing extended attributes for that ACL. Make sure
the ctime is always updated in that case.
Fixes xfstest generic/307.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When inodes are read from disk, GFS2 will only update in-memory atimes
older than the on-disk atimes; this prevents atimes from going
backwards. The atimes of newly allocated inodes are initialized to 0.
This means that when an atime is explicitly set to a negative value,
this value will not persist.
Fix by setting the atime of newly allocated inodes to the lowest
possible value instead of 0.
Fixes xfstest generic/258.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The FS_IOC_SETFLAGS ioctl is supposed to update the inode ctime.
Fixes xfstests generic/277.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add a comment about the logical block size for directories. Rename
"bsize" in gfs2_block_map to "factor". Fix a typo in the description of
metaptr1.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The switch to rhashtables (commit 88ffbf3e03) broke the debugfs glock
dump (/sys/kernel/debug/gfs2/<device>/glocks) for dumps bigger than a
single buffer: the right function for restarting an rhashtable iteration
from the beginning of the hash table is rhashtable_walk_enter;
rhashtable_walk_stop + rhashtable_walk_start will just resume from the
current position.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Cc: stable@vger.kernel.org # v4.3+
Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.
Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.
This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.
This was based on proposal from Jeff Moyer, thanks!
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQGcBAABAgAGBQJZxb1HAAoJEIosvXAHck9RH48L/AnYLWFHp+9RK+elSTAFJkOn
seLJXoQTJtOQ9XA1Fz4wp5jiGTIJgx2Yq3Vj7hl8rQHNE6MvJxTfHxNSMwfrLUkS
HznrXpagSsKFcAHTyHojL3ZZCOzRFOE5c08xpfJ9cGsL3boJGfh+cXawP/mXBpIR
6IP5Om97PMH5QZhfoVkDJT+/L1ZKYdBDPoX56RH33l5w0h5ctHz24Zfk6K1e/NRw
/UMZ6di1FRxD5sZu+cQiKMymw/XohKOXkz48hj/x1wWZ0rkpxb4qlSRuQ5se9sKq
AwpRWacZkN7Vhwl4hLzQV914tLSkT2Z/pTKTZ7Xn0UBxXIpU/hYq/TuzS6E74rcs
n2H0FOdeaSPW7lE9Z0j0wWP3lNad2ii6qUE+BJxxaNTVa1Z5lOH2iNrmmp0HUi1G
FnKBiYdws2uJQDYQBFlKyrZY14LvRB3MAXKV94qRr1KWoIqFLrg+4O0mf/TcvzIV
fr8YT147UuK5FhSfLKJMJLQXYSGUA/M4QhdClFe+Sg==
=FS3F
-----END PGP SIGNATURE-----
Merge tag '4.14-smb3-fixes-from-recent-test-events-for-stable' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Various SMB3 fixes for stable and security improvements from the
recently completed SMB3/Samba test events
* tag '4.14-smb3-fixes-from-recent-test-events-for-stable' of git://git.samba.org/sfrench/cifs-2.6:
SMB3: Don't ignore O_SYNC/O_DSYNC and O_DIRECT flags
SMB3: handle new statx fields
SMB: Validate negotiate (to protect against downgrade) even if signing off
cifs: release auth_key.response for reconnect.
cifs: release cifs root_cred after exit_cifs
CIFS: make arrays static const, reduces object code size
[SMB3] Update session and share information displayed for debugging SMB2/SMB3
cifs: show 'soft' in the mount options for hard mounts
SMB3: Warn user if trying to sign connection that authenticated as guest
SMB3: Fix endian warning
Fix SMB3.1.1 guest authentication to Samba
release and a rare NULL dereference in create_session_open_msg().
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJZxQ4tAAoJEEp/3jgCEfOLnrEH+wXHfoLTTFPRfHtvp3E1X0j0
FGBYg/AuOejMOxXrP824XURjaPSV5GiNJA6g62GXf31Va7lmYfI5u+70XbdVluEm
j1e8fT/qNfCizKq+dv8LyiPse6QDi+0siuxbLo7sMmJYODzrhKFWpbPsXabafCBc
XFc3sk6J6wPoZynPsd2QtxXhrZpmQ7Xq+Us2u+pA7DZvtiZCRn0unqUPWV0vFl0j
7cPqPBX5LOSJoLNHmffXMQBqAWq1mErNs8islm+DIJvookhdlc0J5Qsu2a8lGrkP
57Sln1nr1vNJ2f2qUj2feOm02+Vcy2HFBuLL82kFAGJAVAvlDH7tiFlO3WdFvx4=
=b0uS
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.14-rc2' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two small but important fixes: RADOS semantic change in upcoming v12.2.1
release and a rare NULL dereference in create_session_open_msg()"
* tag 'ceph-for-4.14-rc2' of git://github.com/ceph/ceph-client:
ceph: avoid panic in create_session_open_msg() if utsname() returns NULL
libceph: don't allow bidirectional swap of pg-upmap-items
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
We weren't returning the creation time or the two easily supported
attributes (ENCRYPTED or COMPRESSED) for the getattr call to
allow statx to return these fields.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>\
Acked-by: Jeff Layton <jlayton@poochiereds.net>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
As long as signing is supported (ie not a guest user connection) and
connection is SMB3 or SMB3.02, then validate negotiate (protect
against man in the middle downgrade attacks). We had been doing this
only when signing was required, not when signing was just enabled,
but this more closely matches recommended SMB3 behavior and is
better security. Suggested by Metze.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeremy Allison <jra@samba.org>
Acked-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
There is a race that cause cifs reconnect in cifs_mount,
- cifs_mount
- cifs_get_tcp_session
- [ start thread cifs_demultiplex_thread
- cifs_read_from_socket: -ECONNABORTED
- DELAY_WORK smb2_reconnect_server ]
- cifs_setup_session
- [ smb2_reconnect_server ]
auth_key.response was allocated in cifs_setup_session, and
will release when the session destoried. So when session re-
connect, auth_key.response should be check and released.
Tested with my system:
CIFS VFS: Free previous auth_key.response = ffff8800320bbf80
A simple auth_key.response allocation call trace:
- cifs_setup_session
- SMB2_sess_setup
- SMB2_sess_auth_rawntlmssp_authenticate
- build_ntlmssp_auth_blob
- setup_ntlmv2_rsp
Signed-off-by: Shu Wang <shuwang@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
memory leak was found by kmemleak. exit_cifs_spnego
should be called before cifs module removed, or
cifs root_cred will not be released.
kmemleak report:
unreferenced object 0xffff880070a3ce40 (size 192):
backtrace:
kmemleak_alloc+0x4a/0xa0
kmem_cache_alloc+0xc7/0x1d0
prepare_kernel_cred+0x20/0x120
init_cifs_spnego+0x2d/0x170 [cifs]
0xffffffffc07801f3
do_one_initcall+0x51/0x1b0
do_init_module+0x60/0x1fd
load_module+0x161e/0x1b60
SYSC_finit_module+0xa9/0x100
SyS_finit_module+0xe/0x10
Signed-off-by: Shu Wang <shuwang@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Don't populate the read-only arrays types[] on the stack, instead make
them both static const. Makes the object code smaller by over 200 bytes:
Before:
text data bss dec hex filename
111503 37696 448 149647 2488f fs/cifs/file.o
After:
text data bss dec hex filename
111140 37856 448 149444 247c4 fs/cifs/file.o
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
We were not displaying some key fields (session status and capabilities and
whether guest authenticated) for SMB2/SMB3 session in /proc/fs/cifs/DebugData.
This is needed for real world triage of problems with the (now much more
common) SMB3 mounts.
Signed-off-by: Steve French <smfrench@gmail.com>
It can be confusing if user ends up authenticated as guest but they
requested signing (server will return error validating signed packets)
so add log message for this.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Multi-dialect negotiate patch had a minor endian error.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org> # 4.13+
Pull si_code fix from Eric Biederman:
"When sorting out the si_code ambiguity fcntl I accidentally overshot
and included SIGPOLL as well. Ooops! This is my trivial fix for that.
Vince Weaver caught this when it landed in your tree with his
perf_event_tests many of which started failing because the si_code
changed"
Quoth Vince Weaver:
"I've tested with this patch applied and can confirm all of my tests
now pass again"
Fixes: d08477aa97 ("fcntl: Don't use ambiguous SIG_POLL si_codes")
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
fcntl: Don't set si_code to SI_SIGIO when sig == SIGPOLL
Samba rejects SMB3.1.1 dialect (vers=3.1.1) negotiate requests from
the kernel client due to the two byte pad at the end of the negotiate
contexts.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
struct timespec is not y2038 safe. Use y2038 safe
struct timespec64 to represent timeouts.
The system call interface itself will be changed as
part of different series.
Timeouts will not really need more than 32 bits.
But, replacing these with timespec64 helps verification
of a y2038 safe kernel by getting rid of timespec
internally.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: linux-aio@kvack.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Usage of these apis and their compat versions makes
the syscalls: select family of syscalls and their
compat implementations simpler.
This is a preparatory patch to isolate data conversions to
struct timespec64 at userspace boundaries. This helps contain
the changes needed to transition to new y2038 safe types.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
utsname() can return NULL while process is exiting. Kernel releases
file locks during process exits. We send request to mds when releasing
file lock. So it's possible that we open mds session while process is
exiting. utsname() is called in create_session_open_msg().
Link: http://tracker.ceph.com/issues/21275
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
[idryomov@gmail.com: drop utsname.h include from mds_client.c]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQGcBAABAgAGBQJZwBUiAAoJEIosvXAHck9RbygL/0jG69MaHf1HSZdZQDmHQZFk
9U3M8vhufxWx4bkM574Hm3TJJMZ4X5GNLB9oqOKdHa6Ciz2XRqwsGkIuqWuGCFmo
vwBtHGwz3iGVaOb1LMzqxRKG/W1LJgGFZe57ZwGhwglDa6rMn46ygT11e8fH0+g1
OI1OhUxDqLE8EDLHcfAaTlrRze8tNnYiEsRYU7qx6k4yeh5r3o9UMU3dqdMyEaw2
pM2xLyBp8pZDcfCMrZwSoSFS4zt8eD5C9l7TDsoixxhChf2cVGA1nfbQUPYHfVd4
tV4MmpHZtD9Uay4+L+tUAv1lAmuM93AMoltLXk34RojiPLjWzr5MSqdo2L0bEjYe
SaItonZhOA/Y+A6VpvRKQBxOmSdgNhcyenUsQE8ybifPAksqIGlzc3Ba9iGLcVrp
Izz4WARBrDAXC+FTt9N6kfVSN3QGIrYeZ2uwYmZCadK7GHu6O2s3wtYXa/0mewXG
9s7s26kQwvJUAIcK3xu61HiJBx0jCW5DjKeejoaXxw==
=sn5Y
-----END PGP SIGNATURE-----
Merge tag '4.14-smb3-multidialect-support-and-fixes-for-stable' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Convert default dialect to smb2.1 or later to allow connecting to
Windows 7 for example, also includes some fixes for stable"
* tag '4.14-smb3-multidialect-support-and-fixes-for-stable' of git://git.samba.org/sfrench/cifs-2.6:
Update version of cifs module
cifs: hide unused functions
SMB3: Add support for multidialect negotiate (SMB2.1 and later)
CIFS/SMB3: Update documentation to reflect SMB3 and various changes
cifs: check rsp for NULL before dereferencing in SMB2_open
When fixing things to avoid ambiguous cases I had a thinko
and included SIGPOLL/SIGIO in with all of the other signals
that have signal specific si_codes. Which is completely wrong.
Fix that.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The new isofs_show_options() function fails to build when CONFIG_NLS
is disabled:
fs/isofs/inode.c: In function 'isofs_show_options':
fs/isofs/inode.c:518:44: error: 'CONFIG_NLS_DEFAULT' undeclared (first use in this function)
fs/isofs/inode.c:518:44: note: each undeclared identifier is reported only once for each function it appears in
This adds a check for CONFIG_JOLIET (which selects NLS), matching
the other uses of the iocharset handling in this file.
Fixes: 6fecb86a44f5 ("isofs: Implement show_options")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Lock dq_dqb_lock around dquot_decr_inodes()
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 7b9ca4c61b ("quota: Reduce contention on dq_data_lock")
Signed-off-by: Jan Kara <jack@suse.cz>
The newly added SMB2+ attribute support causes unused function
warnings when CONFIG_CIFS_XATTR is disabled:
fs/cifs/smb2ops.c:563:1: error: 'smb2_set_ea' defined but not used [-Werror=unused-function]
smb2_set_ea(const unsigned int xid, struct cifs_tcon *tcon,
fs/cifs/smb2ops.c:513:1: error: 'smb2_query_eas' defined but not used [-Werror=unused-function]
smb2_query_eas(const unsigned int xid, struct cifs_tcon *tcon,
This adds another #ifdef around the affected functions.
Fixes: 5517554e43 ("cifs: Add support for writing attributes on SMB2+")
Fixes: 95907fea4f ("cifs: Add support for reading attributes on SMB2+")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Steve French <smfrench@gmail.com>
With the need to discourage use of less secure dialect, SMB1 (CIFS),
we temporarily upgraded the dialect to SMB3 in 4.13, but since there
are various servers which only support SMB2.1 (2.1 is more secure
than CIFS/SMB1) but not optimal for a default dialect - add support
for multidialect negotiation. cifs.ko will now request SMB2.1
or later (ie SMB2.1 or SMB3.0, SMB3.02) and the server will
pick the latest most secure one it can support.
In addition since we are sending multidialect negotiate, add
support for secure negotiate to validate that a man in the
middle didn't downgrade us.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> # 4.13+
Pull misc fixes from Thomas Gleixner:
- A fix for a user space regression in /proc/$PID/stat
- A couple of objtool fixes:
~ Plug a memory leak
~ Avoid accessing empty sections which upsets certain binutil
versions
~ Prevent corrupting the obj file when section sizes did not change
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/proc: Report eip/esp in /prod/PID/stat for coredumping
objtool: Fix object file corruption
objtool: Do not retrieve data from empty sections
objtool: Fix memory leak in elf_create_rela_section()
Commit 0a1eb2d474 ("fs/proc: Stop reporting eip and esp in
/proc/PID/stat") stopped reporting eip/esp because it is
racy and dangerous for executing tasks. The comment adds:
As far as I know, there are no use programs that make any
material use of these fields, so just get rid of them.
However, existing userspace core-dump-handler applications (for
example, minicoredumper) are using these fields since they
provide an excellent cross-platform interface to these valuable
pointers. So that commit introduced a user space visible
regression.
Partially revert the change and make the readout possible for
tasks with the proper permissions and only if the target task
has the PF_DUMPCORE flag set.
Fixes: 0a1eb2d474 ("fs/proc: Stop reporting eip and esp in> /proc/PID/stat")
Reported-by: Marco Felsch <marco.felsch@preh.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: stable@vger.kernel.org
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/87poatfwg6.fsf@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ACL patch, I realized that Orangefs ACL code was busted, not just in the
kernel module, but in the server as well. I've been working on the
code in the server mostly, but here's one kernel patch, there
will be more.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZu8QoAAoJEM9EDqnrzg2+cJIP/2A8cCLk5HZf/DU0R9AUTyZx
+vZF0xdbesUuOkOAw5phYor31bToD8XsPKY7z/OlUjXJR09V2DukCX4YRDyrOibJ
nTUxWk1JnCArUeefIkD8N/4EtuQ7n8mdhAeln4//vjzGKXB/BgmTNhXe3+8RTj/t
IIVV+T99aNth4JD9K7Uux/vtoZA7kSvIbPQHRzFRr38GORJOkjB8b3mluxwjDkS/
S2DJND/mneQSPeh7VylKGSPHTqQcv9eg83/muyEoaWcd94QT2pZx9ZEYQrZ1hKus
1Nk1xmaJXqbJ1V+9qKJT9cxiDwE3uz5TQ6JSeB91ca50DcO/Up9EIdPMF0P13J21
0mh5/OuiffVdBYYPIWOe2KdpXOw9aBQqQyNA2MZdk1hotW0o/FlxNx4qtuVeQpqo
f3U0hQBQQD86nLylw6QDu5sD8Bxxx4ihM5szuHn9YlStYUgPbBdPZHsFTk2LAmp8
71UobSnSQGaOop6pfAJXW2y7m790BwYQhK7vSozQtTLMNU7EzPItIT6oQM7pjS3M
1Kh6a5XXwH+imbiaeLRBsfNA293eDuRhz9wsZeLnCV0Tt34bp+UG0FMfP+gceDQn
4hFEPnzWVAQpCyJBq4AMCH/fitawQiLcqgPBjiMVOl6pnznd5MK5C4T9XfaNf5R6
t2JWF/PS+plOa02uq4Fu
=wR7F
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.14-ofs2' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"Some cleanups and a big bug fix for ACLs.
When I was reviewing Jan Kara's ACL patch, I realized that Orangefs
ACL code was busted, not just in the kernel module, but in the server
as well. I've been working on the code in the server mostly, but
here's one kernel patch, there will be more"
* tag 'for-linus-4.14-ofs2' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: Adjust three checks for null pointers
orangefs: Use kcalloc() in orangefs_prepare_cdm_array()
orangefs: Delete error messages for a failed memory allocation in five functions
orangefs: constify xattr_handler structure
orangefs: don't call filemap_write_and_wait from fsync
orangefs: off by ones in xattr size checks
orangefs: documentation clean up
orangefs: react properly to posix_acl_update_mode's aftermath.
orangefs: Don't clear SGID when inheriting ACLs
This patch constifies the path argument to kernel_read_file_from_path().
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hightlights include:
bugfixes:
- Various changes relating to reporting IO errors.
- pnfs: Use the standard I/O stateid when calling LAYOUTGET
Features:
- Add static NFS I/O tracepoints for debugging
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZuuXZAAoJEGcL54qWCgDy1MAQAIMJR7SZhc6i5PpGXDLx+lRp
HiNldIlW/TXMcHKQ/35cIkt+KPy49EJiB54USZkdiKeiGqMUKtrGgxLVqLY6GELV
gORRHxYY4SkNCWSDqwkpM+4XCQ56Tqrh6PxPmiOd+zsKAiqWzL5WFydbdVZcN3Dr
e/lVHktefKZ3ks0DJ+qbYY2MCVqJ8HLlsQ7ZhvGzNNXHHhmWZHAL2LtXdAZ3e5Wt
fkEuZLhSSVT52VFreS8Z+SZr8Q5kbaNrihT+rNuE3IDg+HqJ0+P99ZbABt0ZIzwI
9Om8R7WKhDeZQKOVHN6UvLX6p04U56VsM43PLdenQ9JeMTjgjSCA47s2rYl/b9GM
hPgp51urhj2k7CrjVE9oMOdgMBO6lCpeQT4K4PAtQeeTWb00sPOGr2PGoPb6r1Pa
UqybzwdKZxpv+/jIiZffd+GYRoETmDW4UnlXq0aE2IMxXUWVI+liJuKVXk/zT5cz
N/n4rJrJTEJFSOMd0UF8TfbnsR+OgNDo76HWOBWbJ2JZ46qmCufZAOgXhN7XtC0O
Kbatgsbi9lRBRBT80Agdr3OOoT2mzuzaY+VwrfiV2vkLOsdhmW039oUUws4V91ii
TksS9JFBtJP30m/qxEUDSisrQpLrEcPaT4zYY3tFHywL2Dn1t300iTPeW+LXX0c7
aJfE4XNfYgs7BEHgl1Q3
=p93H
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull more NFS client updates from Trond Myklebust:
"Hightlights include:
Bugfixes:
- Various changes relating to reporting IO errors.
- pnfs: Use the standard I/O stateid when calling LAYOUTGET
Features:
- Add static NFS I/O tracepoints for debugging"
* tag 'nfs-for-4.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: various changes relating to reporting IO errors.
NFS: Add static NFS I/O tracepoints
pNFS: Use the standard I/O stateid when calling LAYOUTGET
Pull misc leftovers from Al Viro.
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix the __user misannotations in asm-generic get_user/put_user
fput: Don't reinvent the wheel but use existing llist API
namespace.c: Don't reinvent the wheel but use existing llist API
Pull nowait read support from Al Viro:
"Support IOCB_NOWAIT for buffered reads and block devices"
* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
block_dev: support RFW_NOWAIT on block device nodes
fs: support RWF_NOWAIT for buffered reads
fs: support IOCB_NOWAIT in generic_file_buffered_read
fs: pass iocb to do_generic_file_read
Pull mount flag updates from Al Viro:
"Another chunk of fmount preparations from dhowells; only trivial
conflicts for that part. It separates MS_... bits (very grotty
mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
only a small subset of MS_... stuff).
This does *not* convert the filesystems to new constants; only the
infrastructure is done here. The next step in that series is where the
conflicts would be; that's the conversion of filesystems. It's purely
mechanical and it's better done after the merge, so if you could run
something like
list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')
sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
-e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
-e 's/\<MS_NODEV\>/SB_NODEV/g' \
-e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
-e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
-e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
-e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
-e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
-e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
-e 's/\<MS_SILENT\>/SB_SILENT/g' \
-e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
-e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
-e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
-e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
$list
and commit it with something along the lines of 'convert filesystems
away from use of MS_... constants' as commit message, it would save a
quite a bit of headache next cycle"
* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: Differentiate mount flags (MS_*) from internal superblock flags
VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
Pull more set_fs removal from Al Viro:
"Christoph's 'use kernel_read and friends rather than open-coding
set_fs()' series"
* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: unexport vfs_readv and vfs_writev
fs: unexport vfs_read and vfs_write
fs: unexport __vfs_read/__vfs_write
lustre: switch to kernel_write
gadget/f_mass_storage: stop messing with the address limit
mconsole: switch to kernel_read
btrfs: switch write_buf to kernel_write
net/9p: switch p9_fd_read to kernel_write
mm/nommu: switch do_mmap_private to kernel_read
serial2002: switch serial2002_tty_write to kernel_{read/write}
fs: make the buf argument to __kernel_write a void pointer
fs: fix kernel_write prototype
fs: fix kernel_read prototype
fs: move kernel_read to fs/read_write.c
fs: move kernel_write to fs/read_write.c
autofs4: switch autofs4_write to __kernel_write
ashmem: switch to ->read_iter
Pull ipc compat cleanup and 64-bit time_t from Al Viro:
"IPC copyin/copyout sanitizing, including 64bit time_t work from Deepa
Dinamani"
* 'work.ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
utimes: Make utimes y2038 safe
ipc: shm: Make shmid_kernel timestamps y2038 safe
ipc: sem: Make sem_array timestamps y2038 safe
ipc: msg: Make msg_queue timestamps y2038 safe
ipc: mqueue: Replace timespec with timespec64
ipc: Make sys_semtimedop() y2038 safe
get rid of SYSVIPC_COMPAT on ia64
semtimedop(): move compat to native
shmat(2): move compat to native
msgrcv(2), msgsnd(2): move compat to native
ipc(2): move compat to native
ipc: make use of compat ipc_perm helpers
semctl(): move compat to native
semctl(): separate all layout-dependent copyin/copyout
msgctl(): move compat to native
msgctl(): split the actual work from copyin/copyout
ipc: move compat shmctl to native
shmctl: split the work from copyin/copyout
Pull zstd support from Chris Mason:
"Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while. After talking with Dave Sterba, Herbert
and Phillip, we decided to send the whole thing in as one pull
request.
zstd is a big win in speed over zlib and in compression ratio over
lzo, and the compression team here at FB has gotten great results
using it in production. Nick will continue to update the kernel side
with new improvements from the open source zstd userland code.
Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:
I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB
of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel
Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using
`silesia.tar` [3], which is 211,988,480 B large. Run the following
commands for the benchmark:
sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test
The time is reported by the time of the userland `cp`.
The MB/s is computed with
1,536,217,008 B / time(buffer size, hash)
which includes the time to copy from userland.
The Adjusted MB/s is computed with
1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).
The memory reported is the amount of memory the compressor
requests.
| Method | Size (B) | Time (s) | Ratio | MB/s | Adj MB/s | Mem (MB) |
|----------|----------|----------|-------|---------|----------|----------|
| none | 11988480 | 0.100 | 1 | 2119.88 | - | - |
| zstd -1 | 73645762 | 1.044 | 2.878 | 203.05 | 224.56 | 1.23 |
| zstd -3 | 66988878 | 1.761 | 3.165 | 120.38 | 127.63 | 2.47 |
| zstd -5 | 65001259 | 2.563 | 3.261 | 82.71 | 86.07 | 2.86 |
| zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 | 16.13 | 13.22 |
| zstd -15 | 58009756 | 47.601 | 3.654 | 4.45 | 4.46 | 21.61 |
| zstd -19 | 54014593 | 102.835 | 3.925 | 2.06 | 2.06 | 60.15 |
| zlib -1 | 77260026 | 2.895 | 2.744 | 73.23 | 75.85 | 0.27 |
| zlib -3 | 72972206 | 4.116 | 2.905 | 51.50 | 52.79 | 0.27 |
| zlib -6 | 68190360 | 9.633 | 3.109 | 22.01 | 22.24 | 0.27 |
| zlib -9 | 67613382 | 22.554 | 3.135 | 9.40 | 9.44 | 0.27 |
I benchmarked zstd decompression using the same method on the same
machine. The benchmark file is located in the upstream zstd repo
under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The
memory reported is the amount of memory required to decompress
data compressed with the given compression level. If you know the
maximum size of your input, you can reduce the memory usage of
decompression irrespective of the compression level.
| Method | Time (s) | MB/s | Adjusted MB/s | Memory (MB) |
|----------|----------|---------|---------------|-------------|
| none | 0.025 | 8479.54 | - | - |
| zstd -1 | 0.358 | 592.15 | 636.60 | 0.84 |
| zstd -3 | 0.396 | 535.32 | 571.40 | 1.46 |
| zstd -5 | 0.396 | 535.32 | 571.40 | 1.46 |
| zstd -10 | 0.374 | 566.81 | 607.42 | 2.51 |
| zstd -15 | 0.379 | 559.34 | 598.84 | 4.61 |
| zstd -19 | 0.412 | 514.54 | 547.77 | 8.80 |
| zlib -1 | 0.940 | 225.52 | 231.68 | 0.04 |
| zlib -3 | 0.883 | 240.08 | 247.07 | 0.04 |
| zlib -6 | 0.844 | 251.17 | 258.84 | 0.04 |
| zlib -9 | 0.837 | 253.27 | 287.64 | 0.04 |
I ran a long series of tests and benchmarks on the btrfs side and the
gains are very similar to the core benchmarks Nick ran"
* 'zstd-minimal' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
squashfs: Add zstd support
btrfs: Add zstd support
lib: Add zstd modules
lib: Add xxhash module
- Constify a few variables in DM core and DM integrity
- Add bufio optimization and checksum failure accounting to DM integrity
- Fix DM integrity to avoid checking integrity of failed reads
- Fix DM integrity to use init_completion
- A couple DM log-writes target fixes
- Simplify DAX flushing by eliminating the unnecessary flush abstraction
that was stood up for DM's use.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJZuo8UAAoJEMUj8QotnQNa5BEIANO4mHh1nrzEbH72a4RCLgxV
H1Pk1zZx/W1bhOOmcRRhxCSM85dPgsCegc5EmpwLZEMavQrP9UZblHcYOUsyIx7W
S/lWa+soOq/5N2OveROc4WdoWVs50UFmc1+BcClc4YrEe+15XC3R0VMkjX2b/hUL
o2eYhPjpMlgaorMtRRU6MAooo2fBRQ9m05aPeVgd35fxibrE7PZm+EYW09wa0STi
9ufuDXJf8+TtFP/38BD41LbUEskuHUZTSDeAJ+3DBaTtfEZcZYxsst4P9JangsHx
jqqqI9aYzFD2a27fl9WLhCvm40YFiKp5nwzED0RZjzWxVa/jTShX7a49BdzTTfw=
=rkSB
-----END PGP SIGNATURE-----
Merge tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Some request-based DM core and DM multipath fixes and cleanups
- Constify a few variables in DM core and DM integrity
- Add bufio optimization and checksum failure accounting to DM
integrity
- Fix DM integrity to avoid checking integrity of failed reads
- Fix DM integrity to use init_completion
- A couple DM log-writes target fixes
- Simplify DAX flushing by eliminating the unnecessary flush
abstraction that was stood up for DM's use.
* tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dax: remove the pmem_dax_ops->flush abstraction
dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK
dm integrity: make blk_integrity_profile structure const
dm integrity: do not check integrity for failed read operations
dm log writes: fix >512b sectorsize support
dm log writes: don't use all the cpu while waiting to log blocks
dm ioctl: constify ioctl lookup table
dm: constify argument arrays
dm integrity: count and display checksum failures
dm integrity: optimize writing dm-bufio buffers that are partially changed
dm rq: do not update rq partially in each ending bio
dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior
dm mpath: complain about unsupported __multipath_map_bio() return values
dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The script “checkpatch.pl” pointed information out like the following.
Comparison to NULL could be written !…
Thus fix affected source code places.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
* A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "kcalloc".
This issue was detected by using the Coccinelle software.
* Replace the specification of a data structure by a pointer dereference
to make the corresponding size determination a bit safer according to
the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Omit an extra message for a memory allocation failure in these functions.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>