->atomic_open() can be given an in-lookup dentry *or* a negative one
found in dcache. Use d_in_lookup() to tell one from another, rather
than d_unhashed().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In orangefs_inode_getxattr(), an fsuid is written to dmesg. The kuid is
converted to a userspace uid via from_kuid(current_user_ns(), [...]), but
since dmesg is global, init_user_ns should be used here instead.
In copy_attributes_from_inode(), op_alloc() and fill_default_sys_attrs(),
upcall structures are populated with uids/gids that have been mapped into
the caller's namespace. However, those upcall structures are read by
another process (the userspace filesystem driver), and that process might
be running in another namespace. This effectively lets any user spoof its
uid and gid as seen by the userspace filesystem driver.
To fix the second issue, I just construct the opcall structures with
init_user_ns uids/gids and require the filesystem server to run in the
init namespace. Since orangefs is full of global state anyway (as the error
message in DUMP_DEVICE_ERROR explains, there can only be one userspace
orangefs filesystem driver at once), that shouldn't be a problem.
[
Why does orangefs even exist in the kernel if everything does upcalls into
userspace? What does orangefs do that couldn't be done with the FUSE
interface? If there is no good answer to those questions, I'd prefer to see
orangefs kicked out of the kernel. Can that be done for something that
shipped in a release?
According to commit f7ab093f74 ("Orangefs: kernel client part 1"), they
even already have a FUSE daemon, and the only rational reason (apart from
"but most of our users report preferring to use our kernel module instead")
given for not wanting to use FUSE is one "in-the-works" feature that could
probably be integated into FUSE instead.
]
This patch has been compile-tested.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Mike,
On Fri, Jun 3, 2016 at 9:44 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> We use the return value in this one line you changed, our userspace code gets
> ill when we send it (-ENOMEM +1) as a key length...
ah, my mistake. Here's a fixed version.
Thanks,
Andreas
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Orangefs has a catch-all xattr handler that effectively does what the
trusted handler does already.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
The ORANGEFS_XATTR_INDEX_ defines are unused; the ORANGEFS_XATTR_NAME_
defines only obfuscate the code.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
If ext4_fill_super() fails early, it's possible for ext4_evict_inode()
to call ext4_should_journal_data() before superblock options and flags
are fully set up. In that case, the iput() on the journal inode can
end up causing a BUG().
Work around this problem by reordering the tests so we only call
ext4_should_journal_data() after we know it's not the journal inode.
Fixes: 2d859db3e4 ("ext4: fix data corruption in inodes with journalled data")
Fixes: 2b405bfa84 ("ext4: fix data=journal fast mount/umount hang")
Cc: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Right now when a new overlay inode is created, we initialize overlay
inode's ->i_mode from underlying inode ->i_mode but we retain only
file type bits (S_IFMT) and discard permission bits.
This patch changes it and retains permission bits too. This should allow
overlay to do permission checks on overlay inode itself in task context.
[SzM] It also fixes clearing suid/sgid bits on write.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>
Before 4bacc9c923 ("overlayfs: Make f_path...") file->f_path pointed to
the underlying file, hence suid/sgid removal on write worked fine.
After that patch file->f_path pointed to the overlay file, and the file
mode bits weren't copied to overlay_inode->i_mode. So the suid/sgid
removal simply stopped working.
The fix is to copy the mode bits, but then ovl_setattr() needs to clear
ATTR_MODE to avoid the BUG() in notify_change(). So do this first, then in
the next patch copy the mode.
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>
If there are racing calls to ext4_commit_super() it's possible for
another writeback of the superblock to result in the buffer being
marked with an error after we check if the buffer is marked as having
a write error and the buffer up-to-date flag is set again. If that
happens mark_buffer_dirty() can end up throwing a WARN_ON_ONCE.
Fix this by moving this check to write before we call
write_buffer_dirty(), and keeping the buffer locked during this whole
sequence.
Signed-off-by: Pranay Kr. Srivastava <pranjas@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit 06bd3c36a7 (ext4: fix data exposure after a crash) uncovered a
deadlock in ext4_writepages() which was previously much harder to hit.
After this commit xfstest generic/130 reproduces the deadlock on small
filesystems.
The problem happens when ext4_do_update_inode() sets LARGE_FILE feature
and marks current inode handle as synchronous. That subsequently results
in ext4_journal_stop() called from ext4_writepages() to block waiting for
transaction commit while still holding page locks, reference to io_end,
and some prepared bio in mpd structure each of which can possibly block
transaction commit from completing and thus results in deadlock.
Fix the problem by releasing page locks, io_end reference, and
submitting prepared bio before calling ext4_journal_stop().
[ Changed to defer the call to ext4_journal_stop() only if the handle
is synchronous. --tytso ]
Reported-and-tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
ext4_dx_csum_verify() returns the success return value in two checksum
verification failure cases. We need to set the return values to zero
as failure like ext4_dirent_csum_verify() returning zero when failing
to find a checksum dirent at the tail.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
We temporally change checksum fields in buffers of some types of
metadata into '0' for verifying the checksum values. By doing this
without locking the buffer, some metadata's checksums, which are
being committed or written back to the storage, could be damaged.
In our test, several metadata blocks were found with damaged metadata
checksum value during recovery process. When we only verify the
checksum value, we have to avoid modifying checksum fields directly.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull fuse fix from Miklos Szeredi:
"This makes sure userspace filesystems are not broken by the parallel
lookups and readdir feature"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: serialize dirops by default
Pull overlayfs fixes from Miklos Szeredi:
"This contains fixes for a dentry leak, a regression in 4.6 noticed by
Docker users and missing write access checking in truncate"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: warn instead of error if d_type is not supported
ovl: get_write_access() in truncate
ovl: fix dentry leak for default_permissions
overlay needs underlying fs to support d_type. Recently I put in a
patch in to detect this condition and started failing mount if
underlying fs did not support d_type.
But this breaks existing configurations over kernel upgrade. Those who
are running docker (partially broken configuration) with xfs not
supporting d_type, are surprised that after kernel upgrade docker does
not run anymore.
https://github.com/docker/docker/issues/22937#issuecomment-229881315
So instead of erroring out, detect broken configuration and warn
about it. This should allow existing docker setups to continue
working after kernel upgrade.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 45aebeaf4f ("ovl: Ensure upper filesystem supports d_type")
Cc: <stable@vger.kernel.org> 4.6
Pull vfs fixes from Al Viro:
"Tmpfs readdir throughput regression fix (this cycle) + some -stable
fodder all over the place.
One missing bit is Miklos' tonight locks.c fix - NFS folks had already
grabbed that one by the time I woke up ;-)"
[ The locks.c fix came through the nfsd tree just moments ago ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
namespace: update event counter when umounting a deleted dentry
9p: use file_dentry()
ceph: fix d_obtain_alias() misuses
lockless next_positive()
libfs.c: new helper - next_positive()
dcache_{readdir,dir_lseek}(): don't bother with nested ->d_lock
leases on overlayfs.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXdr1UAAoJECebzXlCjuG+QlsQAJiZTmio6k9tupN5+iKsZNL3
m919ooj8GYsXxlC0OTFfi09dUi1yeF8MEE3j9egk/+qxEtsdmKdEOIy0RVdcLSfd
HeGjXgLh79hVcGxgyBP+pdax2XhZ3RVisg8F5gTw2GPo+FPFZfrEuO5h7ctn+t45
MCQ+4yqYqzEhYnoPyo5XKh5Aj6wBWiaTzg3/jSe6uSuSfuBfyaMaBPq7l7ayGra/
5El+tu61o/SrJ41N2EayWSj/bOFJE92LIuGOh8NdfANuuP70JhxlwgVSldah3CCQ
6PymXAVcjw0+gJ00mokzKfTJW5FPfasxckMHaOvcFsSONy4rlmrwwqUr9C2AFzTE
wQGIzibCDYOSI8uF+//Oe+dh8JWp2TF8rfJcmKLyJMIcCq/Xl6cNx1qrq/oSWvuk
WKOv1otJrPeT31h/s5iLjr/E/Po1eX+d2ySJdvUHYu/5aZwFgWPnVXwJk0s9bLow
auZU85tPnuz+tbS2pWEK+el7LMgDBdzraVRogMdH1c+m3+G9pr53EzmpYovkZ2X8
duVJ2Leslyya347TnJAgEY47Fbeu26JaoeIChGVhKcEyCENlcqAJWaGVECrxvs3y
p/Y2MYMkO8YrCz5wQXPiLFiG4rAc+jSIn1Q+vRGl2Pkel0y7AgJNNMFtANjMCSIO
pg6BqUjOyKt8cXVy4UW7
=K0mr
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.7-3' of git://linux-nfs.org/~bfields/linux
Pull lockd/locks fixes from Bruce Fields:
"One fix for lockd soft lookups in an error path, and one fix for file
leases on overlayfs"
* tag 'nfsd-4.7-3' of git://linux-nfs.org/~bfields/linux:
locks: use file_inode()
lockd: unregister notifier blocks if the service fails to come up completely
Pull libnvdimm fixes from Dan Williams:
"1/ Two regression fixes since v4.6: one for the byte order of a sysfs
attribute (bz121161) and another for QEMU 2.6's NVDIMM _DSM (ACPI
Device Specific Method) implementation that gets tripped up by new
auto-probing behavior in the NFIT driver.
2/ A fix tagged for -stable that stops the kernel from
clobbering/ignoring changes to the configuration of a 'pfn'
instance ("struct page" driver). For example changing the
alignment from 2M to 1G may silently revert to 2M if that value is
currently stored on media.
3/ A fix from Eric for an xfstests failure in dax. It is not
currently tagged for -stable since it requires an 8-exabyte file
system to trigger, and there appear to be no user visible side
effects"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
nfit: fix format interface code byte order
dax: fix offset overflow in dax_io
acpi, nfit: fix acpi_check_dsm() vs zero functions implemented
libnvdimm, pfn, dax: fix initialization vs autodetect for mode + alignment
(Another one for the f_path debacle.)
ltp fcntl33 testcase caused an Oops in selinux_file_send_sigiotask.
The reason is that generic_add_lease() used filp->f_path.dentry->inode
while all the others use file_inode(). This makes a difference for files
opened on overlayfs since the former will point to the overlay inode the
latter to the underlying inode.
So generic_add_lease() added the lease to the overlay inode and
generic_delete_lease() removed it from the underlying inode. When the file
was released the lease remained on the overlay inode's lock list, resulting
in use after free.
Reported-by: Eryu Guan <eguan@redhat.com>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit 7eaceaccab removed ->sync_page, but a few mentions of it still
existed in documentation and comments,
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
lockless_dereference() was added which can be used in place of
hard-coding smp_read_barrier_depends().
Signed-off-by: He Kuang <hekuang@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- m_start() in fs/namespace.c expects that ns->event is incremented each
time a mount added or removed from ns->list.
- umount_tree() removes items from the list but does not increment event
counter, expecting that it's done before the function is called.
- There are some codepaths that call umount_tree() without updating
"event" counter. e.g. from __detach_mounts().
- When this happens m_start may reuse a cached mount structure that no
longer belongs to ns->list (i.e. use after free which usually leads
to infinite loop).
This change fixes the above problem by incrementing global event counter
before invoking umount_tree().
Change-Id: I622c8e84dcb9fb63542372c5dbf0178ee86bb589
Cc: stable@vger.kernel.org
Signed-off-by: Andrey Ulanov <andreyu@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
v9fs may be used as lower layer of overlayfs and accessing f_path.dentry
can lead to a crash. In this case it's a NULL pointer dereference in
p9_fid_create().
Fix by replacing direct access of file->f_path.dentry with the
file_dentry() accessor, which will always return a native object.
Reported-by: Alessio Igor Bogani <alessioigorbogani@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Alessio Igor Bogani <alessioigorbogani@gmail.com>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Filesystem uids which don't map into a user namespace may result
in inode->i_uid being INVALID_UID. A symlink and its parent
could have different owners in the filesystem can both get
mapped to INVALID_UID, which may result in following a symlink
when this would not have otherwise been permitted when protected
symlinks are enabled.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Update posix_acl_valid to verify that an acl is within a user namespace.
Update the callers of posix_acl_valid to pass in an appropriate
user namespace. For posix_acl_xattr_set and v9fs_xattr_set_acl pass in
inode->i_sb->s_user_ns to posix_acl_valid. For md_unpack_acl pass in
&init_user_ns as no inode or superblock is in sight.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If the lockd service fails to start up then we need to be sure that the
notifier blocks are not registered, otherwise a subsequent start of the
service could cause the same notifier to be registered twice, leading to
soft lockups.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 0751ddf77b "lockd: Register callbacks on the inetaddr_chain..."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fix up nfs4_do_handle_exception() so that it can check if the operation
that received the NFS4ERR_BAD_STATEID was using a defunct delegation.
Apply that to the case of SETATTR, which will currently return EIO
in some cases where this happens.
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a pNFS client sets hdr->pgio_done_cb, then we should not overwrite that
in nfs4_proc_read_setup()
Fixes: 75bf47ebf6 ("pNFS/flexfile: Fix erroneous fall back to...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
An extent with lblock = 4294967295 and len = 1 will pass the
ext4_valid_extent() test:
ext4_lblk_t last = lblock + len - 1;
if (len == 0 || lblock > last)
return 0;
since last = 4294967295 + 1 - 1 = 4294967295. This would later trigger
the BUG_ON(es->es_lblk + es->es_len < es->es_lblk) in ext4_es_end().
We can simplify it by removing the - 1 altogether and changing the test
to use lblock + len <= lblock, since now if len = 0, then lblock + 0 ==
lblock and it fails, and if len > 0 then lblock + len > lblock in order
to pass (i.e. it doesn't overflow).
Fixes: 5946d0893 ("ext4: check for overlapping extents in ext4_valid_extent_entries()")
Fixes: 2f974865f ("ext4: check for zero length extent explicitly")
Cc: Eryu Guan <guaneryu@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The jbd2 journal stores the commit time in 64-bit seconds and 32-bit
nanoseconds, which avoids an overflow in 2038, but it gets the numbers
from current_kernel_time(), which uses 'long' seconds on 32-bit
architectures.
This simply changes the code to call current_kernel_time64() so
we use 64-bit seconds consistently.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
So far we were tracking only dependency on transaction commit due to
starting a new handle (which may require commit to start a new
transaction). Now add tracking also for other cases where we wait for
transaction commit. This way lockdep can catch deadlocks e. g. because we
call jbd2_journal_stop() for a synchronous handle with some locks held
which rank below transaction start.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently lockdep map is tracked in each journal handle. To be able to
expand lockdep support to cover also other cases where we depend on
transaction commit and where handle is not available, move lockdep map
into struct journal_s. Since this makes the lockdep map shared for all
handles, we have to use rwsem_acquire_read() for acquisitions now.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The transaction the handle references is free to commit once we've
decremented t_updates counter. Move the lockdep instrumentation to that
place. Currently it was a bit later which did not really matter but
subsequent improvements to lockdep instrumentation would cause false
positives with it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
While sending the blocking directIO in fuse, the write request is broken
into sub-requests, each of default size 128k and all the requests are sent
in non-blocking background mode if async_dio mode is supported by libfuse.
The process which issue the write wait for the completion of all the
sub-requests. Sending multiple requests parallely gives a chance to perform
parallel writes in the user space fuse implementation if it is
multi-threaded and hence improves the performance.
When there is a size extending aio dio write, we switch to blocking mode so
that we can properly update the size of the file after completion of the
writes. However, in this situation all the sub-requests are sent in
serialized manner where the next request is sent only after receiving the
reply of the current request. Hence the multi-threaded user space
implementation is not utilized properly.
This patch changes the size extending aio dio behavior to exactly follow
blocking dio. For multi threaded fuse implementation having 10 threads and
using buffer size of 64MB to perform async directIO, we are getting double
the speed.
Signed-off-by: Ashish Sangwan <ashishsangwan2@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Negotiate with userspace filesystems whether they support parallel readdir
and lookup. Disable parallelism by default for fear of breaking fuse
filesystems.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 9902af79c0 ("parallel lookups: actual switch to rwsem")
Fixes: d9b3dbdcfd ("fuse: switch to ->iterate_shared()")
The simple_write_to_buffer() already increments the @ppos on success,
see fs/libfs.c simple_write_to_buffer() comment:
"
On success, the number of bytes written is returned and the offset @ppos
advanced by this number, or negative value is returned on error.
"
If the configfs_write_bin_file() is invoked with @count smaller than the
total length of the written binary file, it will be invoked multiple times.
Since configfs_write_bin_file() increments @ppos on success, after calling
simple_write_to_buffer(), the @ppos is incremented twice.
Subsequent invocation of configfs_write_bin_file() will result in the next
piece of data being written to the offset twice as long as the length of
the previous write, thus creating buffer with "holes" in it.
The simple testcase using DTO follows:
$ mkdir /sys/kernel/config/device-tree/overlays/1
$ dd bs=1 if=foo.dtbo of=/sys/kernel/config/device-tree/overlays/1/dtbo
Without this patch, the testcase will result in twice as big buffer in the
kernel, which is then passed to the cfs_overlay_item_dtbo_write() .
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().
Signed-off-by: David S. Miller <davem@davemloft.net>
The two methods essentially do the same: find the real dentry/inode
belonging to an overlay dentry. The difference is in the usage:
vfs_open() uses ->d_select_inode() and expects the function to perform
copy-up if necessary based on the open flags argument.
file_dentry() uses ->d_real() passing in the overlay dentry as well as the
underlying inode.
vfs_rename() uses ->d_select_inode() but passes zero flags. ->d_real()
with a zero inode would have worked just as well here.
This patch merges the functionality of ->d_select_inode() into ->d_real()
by adding an 'open_flags' argument to the latter.
[Al Viro] Make the signature of d_real() match that of ->d_real() again.
And constify the inode argument, while we are at it.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Stable bugfixes:
- Fix _cancel_empty_pagelist
- Fix a double page unlock
- Make nfs_atomic_open() call d_drop() on all ->open_context() errors.
- Fix another OPEN_DOWNGRADE bug
Other bugfixes:
- Ensure we handle delegation errors in nfs4_proc_layoutget()
- Layout stateids start out as being invalid
- Add sparse lock annotations for pnfs_find_alloc_layout
- Handle bad delegation stateids in nfs4_layoutget_handle_exception
- Fix up O_DIRECT results
- Fix potential use after free of state in nfs4_do_reclaim.
- Mark the layout stateid invalid when all segments are removed
- Don't let readdirplus revalidate an inode that was marked as stale
- Fix potential race in nfs_fhget()
- Fix an unused variable warning
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXdDWkAAoJENfLVL+wpUDrpsUP/1F2zu12r/Zkv3ZEFhShcpQc
2N1TRD9X7Lruod2pUD95qqjjdw+/vu3LjcyJljrasRaJENijvZ2GQhKkB7xPODlu
qxZcmnQsH+WmpmJKcqAByAW1czcNGMoMHnt4tV0gG21NH+XUb92fgn+aGeIJDVrK
Hcd9d8TfnFWO70ZgTUXW/hv0CXwu4MEJhN2JfF4lolbxUkmjLHHLoxSDDm0AdXGC
EE8f0V9/7xurvOeLe5bQOQXfZPedBydsLNXa1ZacMGKgmBUoRNxJ5yCpPUtcTVBx
HwbiY+WDFQ7MdKTzUQqqbnrIqKw8Hu4SugIV/vHRqR+Lhc6u29YGOqdU4d2G8IKW
Nh8MBqS+dDefCkL3TJoE7MpjhP3EOO6HXnv5FjMZLOuu2X2o+Sz3+DkhYCq6pj/g
fFh480vZfZYaTsfDf1ttvVN8kIvQ+1Uk3LK6aC2EVwgPrv+0OIRu36F0JQQimxOp
EbYDlhVk7mzH/ZQ31GmPPbSIk+3sm2V58lqXnUMovoqPFPiN3xZDuBcnlyFrrzaI
NjOvsdVxkOdHWbYZyBQzj16Vo651EYbAAUEwsud70N8C3aCgkTxCZ30Q0v+KqqxU
pP5kz3zYUdkXQeHxE6T0iXG9fGcv/nGS21hTfT01YJCK7v67K8TYRNMrOEVURVgk
LSD/CZJXJHVJn1Vr4F7o
=IGNO
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.7-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"Stable bugfixes:
- Fix _cancel_empty_pagelist
- Fix a double page unlock
- Make nfs_atomic_open() call d_drop() on all ->open_context() errors.
- Fix another OPEN_DOWNGRADE bug
Other bugfixes:
- Ensure we handle delegation errors in nfs4_proc_layoutget()
- Layout stateids start out as being invalid
- Add sparse lock annotations for pnfs_find_alloc_layout
- Handle bad delegation stateids in nfs4_layoutget_handle_exception
- Fix up O_DIRECT results
- Fix potential use after free of state in nfs4_do_reclaim.
- Mark the layout stateid invalid when all segments are removed
- Don't let readdirplus revalidate an inode that was marked as stale
- Fix potential race in nfs_fhget()
- Fix an unused variable warning"
* tag 'nfs-for-4.7-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Fix another OPEN_DOWNGRADE bug
make nfs_atomic_open() call d_drop() on all ->open_context() errors.
NFS: Fix an unused variable warning
NFS: Fix potential race in nfs_fhget()
NFS: Don't let readdirplus revalidate an inode that was marked as stale
NFSv4.1/pnfs: Mark the layout stateid invalid when all segments are removed
NFS: Fix a double page unlock
pnfs_nfs: fix _cancel_empty_pagelist
nfs4: Fix potential use after free of state in nfs4_do_reclaim.
NFS: Fix up O_DIRECT results
NFS/pnfs: handle bad delegation stateids in nfs4_layoutget_handle_exception
NFSv4.1/pnfs: Add sparse lock annotations for pnfs_find_alloc_layout
NFSv4.1/pnfs: Layout stateids start out as being invalid
NFSv4.1/pnfs: Ensure we handle delegation errors in nfs4_proc_layoutget()
When truncating a file we should check write access on the underlying
inode. And we should do so on the lower file as well (before copy-up) for
consistency.
Original patch and test case by Aihua Zhang.
- - >o >o - - test.c - - >o >o - -
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
int ret;
ret = truncate(argv[0], 4096);
if (ret != -1) {
fprintf(stderr, "truncate(argv[0]) should have failed\n");
return 1;
}
if (errno != ETXTBSY) {
perror("truncate(argv[0])");
return 1;
}
return 0;
}
- - >o >o - - >o >o - - >o >o - -
Reported-by: Aihua Zhang <zhangaihua1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
When using the 'default_permissions' mount option, ovl_permission() on
non-directories was missing a dput(alias), resulting in "BUG Dentry still
in use".
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 8d3095f4ad ("ovl: default permissions")
Cc: <stable@vger.kernel.org> # v4.5+
Olga Kornievskaia reports that the following test fails to trigger
an OPEN_DOWNGRADE on the wire, and only triggers the final CLOSE.
fd0 = open(foo, RDRW) -- should be open on the wire for "both"
fd1 = open(foo, RDONLY) -- should be open on the wire for "read"
close(fd0) -- should trigger an open_downgrade
read(fd1)
close(fd1)
The issue is that we're missing a check for whether or not the current
state transitioned from an O_RDWR state as opposed to having transitioned
from a combination of O_RDONLY and O_WRONLY.
Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: cd9288ffae ("NFSv4: Fix another bug in the close/open_downgrade code")
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The CEC ioctls didn't have compat32 support, so they returned -ENOTTY
when used in a 32 bit application on a 64 bit kernel.
Since all the CEC ioctls are 32-bit compatible adding support for this
API is trivial.
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Add checks to notify_change to verify that uid and gid changes
will map into the superblock's user namespace. If they do not
fail with -EOVERFLOW.
This is mandatory so that fileystems don't have to even think
of dealing with ia_uid and ia_gid that
--EWB Moved the test from inode_change_ok to notify_change
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This isn't functionally apparent for some reason, but
when we test io at extreme offsets at the end of the loff_t
rang, such as in fstests xfs/071, the calculation of
"max" in dax_io() can be wrong due to pos + size overflowing.
For example,
# xfs_io -c "pwrite 9223372036854771712 512" /mnt/test/file
enters dax_io with:
start 0x7ffffffffffff000
end 0x7ffffffffffff200
and the rounded up "size" variable is 0x1000. This yields:
pos + size 0x8000000000000000 (overflows loff_t)
end 0x7ffffffffffff200
Due to the overflow, the min() function picks the wrong
value for the "max" variable, and when we send (max - pos)
into i.e. copy_from_iter_pmem() it is also the wrong value.
This somehow(tm) gets magically absorbed without incident,
probably because iter->count is correct. But it seems best
to fix it up properly by comparing the two values as
unsigned.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull cifs fixes from Steve French:
"Various small cifs/smb3 fixes, include some for stable, and some from
the recent SMB3 test event"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
File names with trailing period or space need special case conversion
Fix reconnect to not defer smb3 session reconnect long after socket reconnect
cifs: check hash calculating succeeded
cifs: dynamic allocation of ntlmssp blob
cifs: use CIFS_MAX_DOMAINNAME_LEN when converting the domain name
cifs: stuff the fl_owner into "pid" field in the lock request
When gfs2 attempts to write a page to a file that is being truncated,
and notices that the page is completely outside of the file size, it
tries to invalidate it. However, this may require a transaction for
journaled data files to revoke any buffers from the page on the active
items list. Unfortunately, this can happen inside a log flush, where a
transaction cannot be started. Also, gfs2 may need to be able to remove
the buffer from the ail1 list before it can finish the log flush.
To deal with this, when writing a page of a file with data journalling
enabled gfs2 now skips the check to see if the write is outside the file
size, and simply writes it anyway. This situation can only occur when
the truncate code still has the file locked exclusively, and hasn't
marked this block as free in the metadata (which happens later in
truc_dealloc). After gfs2 writes this page out, the truncation code
will shortly invalidate it and write out any revokes if necessary.
To do this, gfs2 now implements its own version of block_write_full_page
without the check, and calls the newly exported __block_write_full_page.
It also no longer calls gfs2_writepage_common from gfs2_jdata_writepage.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
gfs2 needs to be able to skip the check to see if a page is outside of
the file size when writing it out. gfs2 can get into a situation where
it needs to flush its in-memory log to disk while a truncate is in
progress. If the file being trucated has data journaling enabled, it is
possible that there are data blocks in the log that are past the end of
the file. gfs can't finish the log flush without either writing these
blocks out or revoking them. Otherwise, if the node crashed, it could
overwrite subsequent changes made by other nodes in the cluster when
it's journal was replayed.
Unfortunately, there is no way to add log entries to the log during a
flush. So gfs2 simply writes out the page instead. This situation can
only occur when the truncate code still has the file locked exclusively,
and hasn't marked this block as free in the metadata (which happens
later in truc_dealloc). After gfs2 writes this page out, the truncation
code will shortly invalidate it and write out any revokes if necessary.
In order to make this work, gfs2 needs to be able to skip the check for
writes outside the file size. Since the check exists in
block_write_full_page, this patch exports __block_write_full_page, which
doesn't have the check.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Make the code more readable by cleaning up the different ways of
initializing lock holders and checking for initialized lock holders:
mark lock holders as uninitialized by setting the holder's glock to NULL
(gfs2_holder_mark_uninitialized) instead of zeroing out the entire
object or using a separate flag. Recognize initialized holders by their
non-NULL glock (gfs2_holder_initialized). Don't zero out holder objects
which are immeditiately initialized via gfs2_holder_init or
gfs2_glock_nq_init.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Commit ff34245d switched from iget5_locked to iget_locked among other
things, but iget_locked doesn't work for filesystems larger than 2^32
blocks on 32-bit systems. Switch back to iget5_locked. Filesystems
larger than 2^32 blocks are unrealistic to work well on 32-bit systems,
so this is mostly a code cleanliness fix.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Now that gfs2_lookup_by_inum only takes the inode glock for new inodes
(and not for cached inodes anymore), there no longer is a need to
optimize the cached-inode case in gfs2_get_dentry or delete_work_func,
and gfs2_ilookup can be removed.
In addition, gfs2_get_dentry wasn't checking the GFS2_DIF_SYSTEM flag in
i_diskflags in the gfs2_ilookup case (see gfs2_lookup_by_inum); this
inconsistency goes away as well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The current gfs2_lookup_by_inum takes the glock of a presumed inode
identified by block number, verifies that the block is indeed an inode,
and then instantiates and reads the new inode via gfs2_inode_lookup.
However, instantiating a new inode may block on freeing a previous
instance of that inode (__wait_on_freeing_inode), and freeing an inode
requires to take the glock already held, leading to lock inversion and
deadlock.
Fix this by first instantiating the new inode, then verifying that the
block is an inode (if required), and then reading in the new inode, all
in gfs2_inode_lookup.
If the block we are looking for is not an inode, we discard the new
inode via iget_failed, which marks inodes as bad and unhashes them.
Other tasks waiting on that inode will get back a bad inode back from
ilookup or iget_locked; in that case, retry the lookup.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
In "NFSv4: Move dentry instantiation into the NFSv4-specific atomic open code"
unconditional d_drop() after the ->open_context() had been removed. It had
been correct for success cases (there ->open_context() itself had been doing
dcache manipulations), but not for error ones. Only one of those (ENOENT)
got a compensatory d_drop() added in that commit, but in fact it should've
been done for all errors. As it is, the case of O_CREAT non-exclusive open
on a hashed negative dentry racing with e.g. symlink creation from another
client ended up with ->open_context() getting an error and proceeding to
call nfs_lookup(). On a hashed dentry, which would've instantly triggered
BUG_ON() in d_materialise_unique() (or, these days, its equivalent in
d_splice_alias()).
Cc: stable@vger.kernel.org # v3.10+
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Also, if we are going to issue the barrier, we should do this after we
write out the parent directories if necessary.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
If there are no pending blocks to be released after a commit, forcing
a journal commit has no hope of helping. It's possible that a commit
had just completed, so if there are now free blocks available for
allocation, it's worth retrying the commit.
Reported-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull btrfs fixes part 2 from Chris Mason:
"This has one patch from Omar to bring iterate_shared back to btrfs.
We have a tree of work we queue up for directory items and it doesn't
lend itself well to shared access. While we're cleaning it up, Omar
has changed things to use an exclusive lock when there are delayed
items"
* 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes
Pull btrfs fixes from Chris Mason:
"I have a two part pull this time because one of the patches Dave
Sterba collected needed to be against v4.7-rc2 or higher (we used
rc4). I try to make my for-linus-xx branch testable on top of the
last major so we can hand fixes to people on the list more easily, so
I've split this pull in two.
This first part has some fixes and two performance improvements that
we've been testing for some time.
Josef's two performance fixes are most notable. The transid tracking
patch makes a big improvement on pretty much every workload"
* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: Force stripesize to the value of sectorsize
btrfs: fix disk_i_size update bug when fallocate() fails
Btrfs: fix error handling in map_private_extent_buffer
Btrfs: fix error return code in btrfs_init_test_fs()
Btrfs: don't do nocow check unless we have to
btrfs: fix deadlock in delayed_ref_async_start
Btrfs: track transid for delayed ref flushing
Commit fe742fd4f9 ("Revert "btrfs: switch to ->iterate_shared()"")
backed out the conversion to ->iterate_shared() for Btrfs because the
delayed inode handling in btrfs_real_readdir() is racy. However, we can
still do readdir in parallel if there are no delayed nodes.
This is a temporary fix which upgrades the shared inode lock to an
exclusive lock only when we have delayed items until we come up with a
more complete solution. While we're here, rename the
btrfs_{get,put}_delayed_items functions to make it very clear that
they're just for readdir.
Tested with xfstests and by doing a parallel kernel build:
while make tinyconfig && make -j4 && git clean dqfx; do
:
done
along with a bunch of parallel finds in another shell:
while true; do
for ((i=0; i<4; i++)); do
find . >/dev/null &
done
wait
done
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Merge misc fixes from Andrew Morton:
"Two weeks worth of fixes here"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits)
init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
autofs: don't get stuck in a loop if vfs_write() returns an error
mm/page_owner: avoid null pointer dereference
tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
fs/nilfs2: fix potential underflow in call to crc32_le
oom, suspend: fix oom_reaper vs. oom_killer_disable race
ocfs2: disable BUG assertions in reading blocks
mm, compaction: abort free scanner if split fails
mm: prevent KASAN false positives in kmemleak
mm/hugetlb: clear compound_mapcount when freeing gigantic pages
mm/swap.c: flush lru pvecs on compound page arrival
memcg: css_alloc should return an ERR_PTR value on error
memcg: mem_cgroup_migrate() may be called with irq disabled
hugetlb: fix nr_pmds accounting with shared page tables
Revert "mm: disable fault around on emulated access bit architecture"
Revert "mm: make faultaround produce old ptes"
mailmap: add Boris Brezillon's email
mailmap: add Antoine Tenart's email
mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
mm: mempool: kasan: don't poot mempool objects in quarantine
...
__vfs_write() returns a negative value in a error case.
Link: http://lkml.kernel.org/r/20160616083108.6278.65815.stgit@pluto.themaw.net
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The value `bytes' comes from the filesystem which is about to be
mounted. We cannot trust that the value is always in the range we
expect it to be.
Check its value before using it to calculate the length for the crc32_le
call. It value must be larger (or equal) sumoff + 4.
This fixes a kernel bug when accidentially mounting an image file which
had the nilfs2 magic value 0x3434 at the right offset 0x406 by chance.
The bytes 0x01 0x00 were stored at 0x408 and were interpreted as a
s_bytes value of 1. This caused an underflow when substracting sumoff +
4 (20) in the call to crc32_le.
BUG: unable to handle kernel paging request at ffff88021e600000
IP: crc32_le+0x36/0x100
...
Call Trace:
nilfs_valid_sb.part.5+0x52/0x60 [nilfs2]
nilfs_load_super_block+0x142/0x300 [nilfs2]
init_nilfs+0x60/0x390 [nilfs2]
nilfs_mount+0x302/0x520 [nilfs2]
mount_fs+0x38/0x160
vfs_kern_mount+0x67/0x110
do_mount+0x269/0xe00
SyS_mount+0x9f/0x100
entry_SYSCALL_64_fastpath+0x16/0x71
Link: http://lkml.kernel.org/r/1466778587-5184-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to some high-load testing, these two BUG assertions were
encountered, this led system panic. Actually, there were some
discussions about removing these two BUG() assertions, it would not
bring any side effect.
Then, I did the the following changes,
1) use the existing macro CATCH_BH_JBD_RACES to wrap BUG() in the
ocfs2_read_blocks_sync function like before.
2) disable the macro CATCH_BH_JBD_RACES in Makefile by default.
Link: http://lkml.kernel.org/r/1466574294-26863-1-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jbd2_alloc is explicit about its allocation preferences wrt. the
allocation size. Sub page allocations go to the slab allocator and
larger are using either the page allocator or vmalloc. This is all good
but the logic is unnecessarily complex.
1) as per Ted, the vmalloc fallback is a left-over:
: jbd2_alloc is only passed in the bh->b_size, which can't be PAGE_SIZE, so
: the code path that calls vmalloc() should never get called. When we
: conveted jbd2_alloc() to suppor sub-page size allocations in commit
: d2eecb0393, there was an assumption that it could be called with a size
: greater than PAGE_SIZE, but that's certaily not true today.
Moreover vmalloc allocation might even lead to a deadlock because the
callers expect GFP_NOFS context while vmalloc is GFP_KERNEL.
2) __GFP_REPEAT for requests <= PAGE_ALLOC_COSTLY_ORDER is ignored
since the flag was introduced.
Let's simplify the code flow and use the slab allocator for sub-page
requests and the page allocator for others. Even though order > 0 is
not currently used as per above leave that option open.
Link: http://lkml.kernel.org/r/1464599699-30131-18-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
POSIX allows files with trailing spaces or a trailing period but
SMB3 does not, so convert these using the normal Services For Mac
mapping as we do for other reserved characters such as
: < > | ? *
This is similar to what Macs do for the same problem over SMB3.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <steve.french@primarydata.com>
Acked-by: Pavel Shilovsky <pshilovsky@samba.org>
Azure server blocks clients that open a socket and don't do anything on it.
In our reconnect scenarios, we can reconnect the tcp session and
detect the socket is available but we defer the negprot and SMB3 session
setup and tree connect reconnection until the next i/o is requested, but
this looks suspicous to some servers who expect SMB3 negprog and session
setup soon after a socket is created.
In the echo thread, reconnect SMB3 sessions and tree connections
that are disconnected. A later patch will replay persistent (and
resilient) handle opens.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <steve.french@primarydata.com>
Acked-by: Pavel Shilovsky <pshilovsky@samba.org>
Use set_posix_acl, which includes proper permission checks, instead of
calling ->set_acl directly. Without this anyone may be able to grant
themselves permissions to a file by setting the ACL.
Lock the inode to make the new checks atomic with respect to set_acl.
(Also, nfsd was the only caller of set_acl not locking the inode, so I
suspect this may fix other races.)
This also simplifies the code, and ensures our ACLs are checked by
posix_acl_valid.
The permission checks and the inode locking were lost with commit
4ac7249e, which changed nfsd to use the set_acl inode operation directly
instead of going through xattr handlers.
Reported-by: David Sinquin <david@sinquin.eu>
[agreunba@redhat.com: use set_posix_acl]
Fixes: 4ac7249e
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Factor out part of posix_acl_xattr_set into a common function that takes
a posix_acl, which nfsd can also call.
The prototype already exists in include/linux/posix_acl.h.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If we don't set the mode correctly in nfs_init_locked(), then there is
potential for a race with a second call to nfs_fhget that will cause
inode aliasing.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
According to RFC5661, section 12.5.3. the layout stateid is no longer
valid once the client no longer holds any layout segments. Ensure that
we mark it invalid.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Since commit 0bcbf039f6, nfs_readpage_release() has been used to
unlock the page in the read code.
Fixes: 0bcbf039f6 ("nfs: handle request add failure properly")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
pnfs_generic_commit_cancel_empty_pagelist calls nfs_commitdata_release,
but that is wrong: nfs_commitdata_release puts the open context, something
that isn't valid until nfs_init_commit is called, which is never the case
when pnfs_generic_commit_cancel_empty_pagelist is called.
This was introduced in "nfs: avoid race that crashes nfs_init_commit".
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
if we read or wrote something, we must report it
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We must call nfs4_handle_exception() on BAD_STATEID errors. The only
exception is if the stateid argument turns out to be a layout stateid
that is declared invalid.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
nfs4_handle_exception() relies on the caller setting the 'inode' field
in the struct nfs4_exception argument when the error applies to a
delegation.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If a process gets access to a mount from a different user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem. Prevent
this by treating mounts from other mount namespaces and those not
owned by current_user_ns() or an ancestor as nosuid.
This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.
This does not remove the need for MNT_LOCK_NOSUID. The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.
As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents. If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.
On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.
As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
"bogus directory-entries per block" was reported for what was instead
bogus number of directory entries. The message also mismatched the
argument passed to printk(), which was sbi->dir_entries.
Fix this by replacing the message with "bogus number of directory
entries". printk() argument was kept unchanged.
Signed-off-by: Zheng Lv <lv.zheng.2015@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
1. Fixes the possibility of losing data upon a power cut when UBI tries
to recover from a write error.
2. Fixes page migration on UBIFS. It turned out that the default page
migration function is not suitable for UBIFS.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJXa5tEAAoJEEtJtSqsAOnW090P/RcQjIfVf2g3r8VRp38OQPbb
MTd4sD/rnyt5Eq0QYUPWG5xcYK2BWI1PwpdB81JvW5hxnXPgG8DpVIxjzt/7Xgnp
QheYe9tMfgYjDntz1rzGVa/uHSAldP9V4czgczrBW/0lwnRsZ6mLY1RA9Oz0hRdG
cp53I8CSD0DPyqU0XkgzLkzVUstmySwQ5i46C0kQEnlRcytReOLgcjSrXXn+/Zih
yZxhtDQSCKmQAfVmERggPXVHo8jFtVfej52ja7RFcMA2uXvXqljOBNCyLUYPdYka
XdQEKsXRLl69ktFUXwZwPAYAW23I8+PMpsoljHDVc0hF25p8omp3D+7HE18SsMSv
6RNnUwz+PDbiFApyoTu0SBgHN/OO9o6rjNNoRIInoKpk0NvWmrMQOo6BIFsX4yq1
0dOVJiKXVoFuo75Yw9mOKdrV/Z5P1TvgdTBj6g03aUM9vcX1Gz6+1xKkvcXGgh02
8qFDZdZ5L87TlpMkvtWO87Ir0ssrfjxpvxR8pPsxxqvxbfUuVmss4ILuh9AVSVk+
d1zrz30+JZzTbIrky/7R31i6Bx2+reYdTKiPIkST9sF5WblUPSeyUoKq1OlNRYxj
n+0Q8S5Tm/6AHXUOQFxurbXU+D7G7TaL/CsBeepvV/AqJb07+vBxUuGFH1rDbmLB
r5dTfOXn3iNEmmNyrhgN
=EDeX
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.7-rc5' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS fixes from Richard Weinberger:
"This contains fixes for two critical bugs in UBI and UBIFS:
- fix the possibility of losing data upon a power cut when UBI tries
to recover from a write error
- fix page migration on UBIFS. It turned out that the default page
migration function is not suitable for UBIFS"
* tag 'upstream-4.7-rc5' of git://git.infradead.org/linux-ubifs:
UBIFS: Implement ->migratepage()
mm: Export migrate_page_move_mapping and migrate_page_copy
ubi: Make recover_peb power cut aware
gpio: make library immune to error pointers
gpio: make sure gpiod_to_irq() returns negative on NULL desc
gpio: 104-idi-48: Fix missing spin_lock_init for ack_lock
calc_lanman_hash() could return -ENOMEM or other errors, we should check
that everything went fine before using the calculated key.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Steve French <smfrench@gmail.com>
In sess_auth_rawntlmssp_authenticate(), the ntlmssp blob is allocated
statically and its size is an "empirical" 5*sizeof(struct
_AUTHENTICATE_MESSAGE) (320B on x86_64). I don't know where this value
comes from or if it was ever appropriate, but it is currently
insufficient: the user and domain name in UTF16 could take 1kB by
themselves. Because of that, build_ntlmssp_auth_blob() might corrupt
memory (out-of-bounds write). The size of ntlmssp_blob in
SMB2_sess_setup() is too small too (sizeof(struct _NEGOTIATE_MESSAGE)
+ 500).
This patch allocates the blob dynamically in
build_ntlmssp_auth_blob().
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Currently in build_ntlmssp_auth_blob(), when converting the domain
name to UTF16, CIFS_MAX_USERNAME_LEN limit is used. It should be
CIFS_MAX_DOMAINNAME_LEN. This patch fixes this.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Right now, we send the tgid cross the wire. What we really want to send
though is a hashed fl_owner_t since samba treats this field as a generic
lockowner.
It turns out that because we enforce and release locks locally before
they are ever sent to the server, this patch makes no difference in
behavior. Still, setting OFD locks on the server using the process
pid seems wrong, so I think this patch still makes sense.
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilovsky@samba.org>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Now that SB_I_NODEV controls the nodev behavior devpts can just clear
this flag during mount. Simplifying the code and making it easier
to audit how the code works. While still preserving the invariant
that s_iflags is only modified during mount.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Replace the implict setting of MNT_NODEV on mounts that happen with
just user namespace permissions with an implicit setting of SB_I_NODEV
in s_iflags. The visibility of the implicit MNT_NODEV has caused
problems in the past.
With this change the fragile case where an implicit MNT_NODEV needs to
be preserved in do_remount is removed. Using SB_I_NODEV is much less
fragile as s_iflags are set during the original mount and never
changed.
In do_new_mount with the implicit setting of MNT_NODEV gone, the only
code that can affect mnt_flags is fs_fully_visible so simplify the if
statement and reduce the indentation of the code to make that clear.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Verify all filesystems that we check in mount_too_revealing set
SB_I_NOEXEC and SB_I_NODEV in sb->s_iflags. That is true for today
and it should remain true in the future.
Remove the now unnecessary checks from mnt_already_visibile that
ensure MNT_LOCK_NOSUID, MNT_LOCK_NOEXEC, and MNT_LOCK_NODEV are
preserved. Making the code shorter and easier to read.
Relying on SB_I_NOEXEC and SB_I_NODEV instead of the user visible
MNT_NOSUID, MNT_NOEXEC, and MNT_NODEV ensures the many current
systems where proc and sysfs are mounted with "nosuid, nodev, noexec"
and several slightly buggy container applications don't bother to
set those flags continue to work.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Introduce a function may_open_dev that tests MNT_NODEV and a new
superblock flab SB_I_NODEV. Use this new function in all of the
places where MNT_NODEV was previously tested.
Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
filesystems should never support device nodes, and a simple superblock
flags makes that very hard to get wrong. With SB_I_NODEV set if any
device nodes somehow manage to show up on on a filesystem those
device nodes will be unopenable.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The cgroup filesystem is in the same boat as sysfs. No one ever
permits executables of any kind on the cgroup filesystem, and there is
no reasonable future case to support executables in the future.
Therefore move the setting of SB_I_NOEXEC which makes the code proof
against future mistakes of accidentally creating executables from
sysfs to kernfs itself. Making the code simpler and covering the
sysfs, cgroup, and cgroup2 filesystems.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Allowing a filesystem to be mounted by other than root in the initial
user namespace is a filesystem property not a mount namespace property
and as such should be checked in filesystem specific code. Move the
FS_USERNS_MOUNT test into super.c:sget_userns().
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Start marking filesystems with a user namespace owner, s_user_ns. In
this change this is only used for permission checks of who may mount a
filesystem. Ultimately s_user_ns will be used for translating ids and
checking capabilities for filesystems mounted from user namespaces.
The default policy for setting s_user_ns is implemented in sget(),
which arranges for s_user_ns to be set to current_user_ns() and to
ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that
user_ns.
The guts of sget are split out into another function sget_userns().
The function sget_userns calls alloc_super with the specified user
namespace or it verifies the existing superblock that was found
has the expected user namespace, and fails with EBUSY when it is not.
This failing prevents users with the wrong privileges mounting a
filesystem.
The reason for the split of sget_userns from sget is that in some
cases such as mount_ns and kernfs_mount_ns a different policy for
permission checking of mounts and setting s_user_ns is necessary, and
the existence of sget_userns() allows those policies to be
implemented.
The helper mount_ns is expected to be used for filesystems such as
proc and mqueuefs which present per namespace information. The
function mount_ns is modified to call sget_userns instead of sget to
ensure the user namespace owner of the namespace whose information is
presented by the filesystem is used on the superblock.
For sysfs and cgroup the appropriate permission checks are already in
place, and kernfs_mount_ns is modified to call sget_userns so that
the init_user_ns is the only user namespace used.
For the cgroup filesystem cgroup namespace mounts are bind mounts of a
subset of the full cgroup filesystem and as such s_user_ns must be the
same for all of them as there is only a single superblock.
Mounts of sysfs that vary based on the network namespace could in principle
change s_user_ns but it keeps the analysis and implementation of kernfs
simpler if that is not supported, and at present there appear to be no
benefits from supporting a different s_user_ns on any sysfs mount.
Getting the details of setting s_user_ns correct has been
a long process. Thanks to Pavel Tikhorirorv who spotted a leak
in sget_userns. Thanks to Seth Forshee who has kept the work alive.
Thanks-to: Seth Forshee <seth.forshee@canonical.com>
Thanks-to: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Move the call of get_pid_ns, the call of proc_parse_options, and
the setting of s_iflags into proc_fill_super so that mount_ns
can be used.
Convert proc_mount to call mount_ns and remove the now unnecessary
code.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Reviewed-by: Djalal Harouni <tixxdz@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.
Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.
Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks. The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts. Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.
Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Replace the call of fs_fully_visible in do_new_mount from before the
new superblock is allocated with a call of mount_too_revealing after
the superblock is allocated. This winds up being a much better location
for maintainability of the code.
The first change this enables is the replacement of FS_USERNS_VISIBLE
with SB_I_USERNS_VISIBLE. Moving the flag from struct filesystem_type
to sb_iflags on the superblock.
Unfortunately mount_too_revealing fundamentally needs to touch
mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate
times. If the mnt_flags did not need to be touched the code
could be easily moved into the filesystem specific mount code.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Btrfs code currently assumes stripesize to be same as
sectorsize. However Btrfs-progs (until commit
df05c7ed455f519e6e15e46196392e4757257305) has been setting
btrfs_super_block->stripesize to a value of 4096.
This commit makes sure that the value of btrfs_super_block->stripesize
is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
to sectorsize.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When doing truncate operation, btrfs_setsize() will first call
truncate_setsize() to set new inode->i_size, but if later
btrfs_truncate() fails, btrfs_setsize() will call
"i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
inmemory inode size, now bug occurs. It's because for truncate
case btrfs_ordered_update_i_size() directly uses inode->i_size
to update BTRFS_I(inode)->disk_i_size, indeed we should use the
"offset" argument to update disk_i_size. Here is the call graph:
==>btrfs_truncate()
====>btrfs_truncate_inode_items()
======>btrfs_ordered_update_i_size(inode, last_size, NULL);
Here btrfs_ordered_update_i_size()'s offset argument is last_size.
And below test case can reveal this bug:
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
dev=$(losetup --show -f fs.img)
mkdir -p /mnt/mntpoint
mkfs.btrfs -f $dev
mount $dev /mnt/mntpoint
cd /mnt/mntpoint
echo "workdir is: /mnt/mntpoint"
blocksize=$((128 * 1024))
dd if=/dev/zero of=testfile bs=$blocksize count=1
sync
count=$((17*1024*1024*1024/blocksize))
echo "file size is:" $((count*blocksize))
for ((i = 1; i <= $count; i++)); do
i=$((i + 1))
dst_offset=$((blocksize * i))
xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
testfile > /dev/null
done
sync
truncate --size 0 testfile
ls -l testfile
du -sh testfile
exit
In this case, truncate operation will fail for enospc reason and
"du -sh testfile" returns value greater than 0, but testfile's
size is 0, we need to reflect correct inode->i_size.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
map_private_extent_buffer() can return -EINVAL in two different cases,
1. when the requested contents span two pages if nodesize is larger
than pagesize,
2. when it detects something insane.
The 2nd one used to be only a WARN_ON(1), and we decided to return a error
to callers, but we didn't fix up all its callers, which will be
addressed by this patch.
Without this, btrfs may end up with 'general protection', ie.
reading invalid memory.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Fix to return a negative error code from the kern_mount() error handling
case instead of 0(ret is set to 0 by register_filesystem), as done
elsewhere in this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.
The Coccinelle semantic patch used to make this change is as follows:
@@
expression from,to,size,flag;
statement S;
@@
- to = \(kmalloc\|kzalloc\)(size,flag);
+ to = kmemdup(from,size,flag);
if (to==NULL || ...) S
- memcpy(to, from, size);
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Before we write into prealloc/nocow space we have to make sure that there are no
references to the extents we are writing into, which means checking the extent
tree and csum tree in the case of nocow. So we don't want to do the nocow dance
unless we can't reserve data space, since it's a serious drag on performance.
With the following sequence
fallocate -l10737418240 /mnt/btrfs-test/file
cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
--end_fsync=1
we get the worst case scenario where we have to fall back on to doing the check
anyway.
Without this patch
lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec
With this patch
lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec
We get twice the throughput, half of the runtime, and half of the average
latency. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
[ PAGE_CACHE_ removal related fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
"Btrfs: track transid for delayed ref flushing" was deadlocking on
btrfs_attach_transaction because its not safe to call from the async
delayed ref start code. This commit brings back btrfs_join_transaction
instead and checks for a blocked commit.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Using the offwakecputime bpf script I noticed most of our time was spent waiting
on the delayed ref throttling. This is what is supposed to happen, but
sometimes the transaction can commit and then we're waiting for throttling that
doesn't matter anymore. So change this stuff to be a little smarter by tracking
the transid we were in when we initiated the throttling. If the transaction we
get is different then we can just bail out. This resulted in a 50% speedup in
my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
over the entire run (which is about 30 minutes). Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
During page migrations UBIFS might get confused
and the following assert triggers:
[ 213.480000] UBIFS assert failed in ubifs_set_page_dirty at 1451 (pid 436)
[ 213.490000] CPU: 0 PID: 436 Comm: drm-stress-test Not tainted 4.4.4-00176-geaa802524636-dirty #1008
[ 213.490000] Hardware name: Allwinner sun4i/sun5i Families
[ 213.490000] [<c0015e70>] (unwind_backtrace) from [<c0012cdc>] (show_stack+0x10/0x14)
[ 213.490000] [<c0012cdc>] (show_stack) from [<c02ad834>] (dump_stack+0x8c/0xa0)
[ 213.490000] [<c02ad834>] (dump_stack) from [<c0236ee8>] (ubifs_set_page_dirty+0x44/0x50)
[ 213.490000] [<c0236ee8>] (ubifs_set_page_dirty) from [<c00fa0bc>] (try_to_unmap_one+0x10c/0x3a8)
[ 213.490000] [<c00fa0bc>] (try_to_unmap_one) from [<c00fadb4>] (rmap_walk+0xb4/0x290)
[ 213.490000] [<c00fadb4>] (rmap_walk) from [<c00fb1bc>] (try_to_unmap+0x64/0x80)
[ 213.490000] [<c00fb1bc>] (try_to_unmap) from [<c010dc28>] (migrate_pages+0x328/0x7a0)
[ 213.490000] [<c010dc28>] (migrate_pages) from [<c00d0cb0>] (alloc_contig_range+0x168/0x2f4)
[ 213.490000] [<c00d0cb0>] (alloc_contig_range) from [<c010ec00>] (cma_alloc+0x170/0x2c0)
[ 213.490000] [<c010ec00>] (cma_alloc) from [<c001a958>] (__alloc_from_contiguous+0x38/0xd8)
[ 213.490000] [<c001a958>] (__alloc_from_contiguous) from [<c001ad44>] (__dma_alloc+0x23c/0x274)
[ 213.490000] [<c001ad44>] (__dma_alloc) from [<c001ae08>] (arm_dma_alloc+0x54/0x5c)
[ 213.490000] [<c001ae08>] (arm_dma_alloc) from [<c035cecc>] (drm_gem_cma_create+0xb8/0xf0)
[ 213.490000] [<c035cecc>] (drm_gem_cma_create) from [<c035cf20>] (drm_gem_cma_create_with_handle+0x1c/0xe8)
[ 213.490000] [<c035cf20>] (drm_gem_cma_create_with_handle) from [<c035d088>] (drm_gem_cma_dumb_create+0x3c/0x48)
[ 213.490000] [<c035d088>] (drm_gem_cma_dumb_create) from [<c0341ed8>] (drm_ioctl+0x12c/0x444)
[ 213.490000] [<c0341ed8>] (drm_ioctl) from [<c0121adc>] (do_vfs_ioctl+0x3f4/0x614)
[ 213.490000] [<c0121adc>] (do_vfs_ioctl) from [<c0121d30>] (SyS_ioctl+0x34/0x5c)
[ 213.490000] [<c0121d30>] (SyS_ioctl) from [<c000f2c0>] (ret_fast_syscall+0x0/0x34)
UBIFS is using PagePrivate() which can have different meanings across
filesystems. Therefore the generic page migration code cannot handle this
case correctly.
We have to implement our own migration function which basically does a
plain copy but also duplicates the page private flag.
UBIFS is not a block device filesystem and cannot use buffer_migrate_page().
Cc: stable@vger.kernel.org
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[rw: Massaged changelog, build fixes, etc...]
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Christoph Hellwig <hch@lst.de>
Pull userns fix from Eric Biederman:
"This contains just a single small patch that fixes a tiny hole in the
logic of allowing unprivileged mounting of proc and sysfs.
In practice I don't think anyone is affected because having MNT_RDONLY
clear in mnt->mnt_flags but MS_RDONLY set in sb->s_flags is very weird
for a filesystem, and weirder for proc and sysfs. However if it
happens let's handle it correctly and then no one has to to worry
about this crazy case"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mnt: Account for MS_RDONLY in fs_fully_visible
While COMMIT has the potential to free up a lot of memory that is being
taken by unstable writes, it isn't guaranteed to free up this particular
page. Also, calling fsync() on the server is expensive and so we want to
do it in a more controlled fashion, rather than have it triggered at
random by the VM.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Unless the user is using file locking, we must assume close-to-open
cache consistency when the file is open for writing. Adjust the
caching algorithm so that it does not clear the cache on out-of-order
writes and/or attribute revalidations.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This config option can be used to disable the
LOG_INFO recovery messages.
Signed-off-by: Zhilong Liu <zlliu@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Create a common function to calculate the maximum height of a per-AG
btree. This will eventually be used by the rmapbt and refcountbt
code to calculate appropriate maxlevels values for each. This is
important because the verifiers and the transaction block
reservations depend on accurate estimates of how many blocks are
needed to satisfy a btree split.
We were mistakenly using the max bnobt height for all the btrees,
which creates a dangerous situation since the larger records and
keys in an rmapbt make it very possible that the rmapbt will be
taller than the bnobt and so we can run out of transaction block
reservation.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In struct xfs_bmap_free, convert the open-coded free extent list to
a regular list, then use list_sort to sort it prior to processing.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Break up xfs_free_extent() into a helper that fixes the freelist.
This helper will be used subsequently to ensure the freelist during
deferred rmap processing.
[darrick: refactor to put this at the head of the patchset]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This is already in xfsprogs' libxfs, so port it to the kernel.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Currently we don't check the error_tag when someone's trying to set up
error injection testing. If userspace passes in a value we don't know
about, send back an error. This will help xfstests to _notrun a test
that uses error injection to test things like log replay.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Create a second buf_trylock tracepoint so that we can distinguish
between a successful and a failed trylock. With this piece, we can
use a script to look at the ftrace output to detect buffer deadlocks.
[dchinner: update to if/else as per hch's suggestion]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Some of the directory/attr structures contain variable-length objects,
so the enclosing structure doesn't have a meaningful fixed size at
compile time. We can check the offsets of the members before the
variable-length member, so do those.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_reserve_blocks() is responsible to update the XFS reserved block
pool count at mount time or based on user request. When the caller
requests to increase the reserve pool, blocks must be allocated from
the global counters such that they are no longer available for
general purpose use. If the requested reserve pool size is too
large, XFS reserves what blocks are available. The implementation
requires looking at the percpu counters and making an educated guess
as to how many blocks to try and allocate from xfs_mod_fdblocks(),
which can return -ENOSPC if the guess was not accurate due to
counters being modified in parallel.
xfs_reserve_blocks() retries the guess in this scenario until the
allocation succeeds or it is determined that there is no space
available in the fs. While not easily reproducible in the current
form, the retry code doesn't actually work correctly if
xfs_mod_fdblocks() actually fails. The problem is that the percpu
calculations use the m_resblks counter to determine how many blocks
to allocate, but unconditionally update m_resblks before the block
allocation has actually succeeded. Therefore, if xfs_mod_fdblocks()
fails, the code jumps to the retry label and uses the already
updated m_resblks value to determine how many blocks to try and
allocate. If the percpu counters previously suggested that the
entire request was available, fdblocks_delta could end up set to 0.
In that case, m_resblks is updated to the requested value, yet no
blocks have been reserved at all.
Refactor xfs_reserve_blocks() to use an explicit loop and make the
code easier to follow. Since we have to drop the spinlock across the
xfs_mod_fdblocks() call, use a delta value for m_resblks as well and
only apply the delta once allocation succeeds.
[dchinner: convert to do {} while() loop]
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The filesystem quiesce sequence performs the operations necessary to
drain all background work, push pending transactions through the log
infrastructure and wait on I/O resulting from the final AIL push. We
have had reports of remount,ro hangs in xfs_log_quiesce() ->
xfs_wait_buftarg(), however, and some instrumentation code to detect
transaction commits at this point in the quiesce sequence has inculpated
the eofblocks background scanner as a cause.
While higher level remount code generally prevents user modifications by
the time the filesystem has made it to xfs_log_quiesce(), the background
scanner may still be alive and can perform pending work at any time. If
this occurs between the xfs_log_force() and xfs_wait_buftarg() calls
within xfs_log_quiesce(), this can lead to an indefinite lockup in
xfs_wait_buftarg().
To prevent this problem, cancel the background eofblocks scan worker
during the remount read-only quiesce sequence. This suspends background
trimming when a filesystem is remounted read-only. This is only done in
the remount path because the freeze codepath has already locked out new
transactions by the time the filesystem attempts to quiesce (and thus
waiting on an active work item could deadlock). Kick the eofblocks
worker to pick up where it left off once an fs is remounted back to
read-write.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Instead punch the whole first, and the use the our zeroing helper
to punch out the edge blocks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We now skip holes in it, so no need to have the caller do it as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We'll want to use this code for large offsets now that we're
skipping holes and unwritten extents efficiently. Also rename it to
xfs_zero_range to be a bit more descriptive, and tell the caller if
we actually did any zeroing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Note that this removes support for the untested FIEMAP_FLAG_XATTR. It
could be added relatively easily with iomap ops for the attr fork, but
without test coverage I don't feel safe doing this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Convert XFS to use the new iomap based multipage write path. This involves
implementing the ->iomap_begin and ->iomap_end methods, and switching the
buffered file write, page_mkwrite and xfs_iozero paths to the new iomap
helpers.
With this change __xfs_get_blocks will never be used for buffered writes,
and the code handling them can be removed.
Based on earlier code from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Currently zeroing out blocks and waiting for writeout is a bit of a mess in
truncate. This patch gives it a clear order in preparation for the iomap
path:
(1) we first wait for any direct I/O to complete to prevent any races
for it
(2) we then perform the actual zeroing, and only use the truncate_page
helpers for truncating down. The truncate up case already is
handled by the separate call to xfs_zero_eof.
(3) only then we write back dirty data, as zeroing block may cause
dirty pages when using either xfs_zero_eof or the new iomap
infrastructure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
And ensure it works for RT subvolume files an set the block device,
both of which will be needed to be able to use the function in the
buffered write path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Add a simple fiemap implementation based on iomap_ops, partially based
on a previous implementation from Bob Peterson <rpeterso@redhat.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This avoid needing a separate inefficient get_block based DAX zero_range
implementation in file systems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Add infrastructure for multipage buffered writes. This is implemented
using an main iterator that applies an actor function to a range that
can be written.
This infrastucture is used to implement a buffered write helper, one
to zero file ranges and one to implement the ->page_mkwrite VM
operations. All of them borrow a fair amount of code from fs/buffers.
for now by using an internal version of __block_write_begin that
gets passed an iomap and builds the corresponding buffer head.
The file system is gets a set of paired ->iomap_begin and ->iomap_end
calls which allow it to map/reserve a range and get a notification
once the write code is finished with it.
Based on earlier code from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Return nth positive child after given or NULL if there's
less than n left. dcache_readdir() and dcache_dir_lseek()
switched to it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make sure that directory is locked shared in dcache_dir_lseek();
for dcache_readdir() it's already tru, and that's enough to make
simple_positive() stable.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs fixes from Al Viro:
"A couple more of d_walk()/d_subdirs reordering fixes (stable fodder;
ought to solve that crap for good) and a fix for a brown paperbag bug
in d_alloc_parallel() (this cycle)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix idiotic braino in d_alloc_parallel()
autofs races
much milder d_walk() race
Noticed some minor spelling errors when looking through the code.
Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Check for d_unhashed() while searching in in-lookup hash was absolutely
wrong. Worse, it masked a deadlock on dget() done under bitlock that
nests inside ->d_lock. Thanks to J. R. Okajima for spotting it.
Spotted-by: "J. R. Okajima" <hooanon05g@gmail.com>
Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull UDF fixes and a reiserfs fix from Jan Kara:
"A couple of udf fixes (most notably a bug in parsing UDF partitions
which led to inability to mount recent Windows installation media) and
a reiserfs fix for handling kstrdup failure"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: check kstrdup failure
udf: Use correct partition reference number for metadata
udf: Use IS_ERR when loading metadata mirror file entry
udf: Don't BUG on missing metadata partition descriptor
The quota subsystem has two formats, the old v1 format using architecture
specific time_t values on the on-disk format, while the v2 format
(introduced in Linux 2.5.16 and 2.4.22) uses fixed 64-bit little-endian.
While there is no future for the v1 format beyond y2038, the v2 format
is almost there on 32-bit architectures, as both the user interface
and the on-disk format use 64-bit timestamps, just not the time_t
inbetween.
This changes the internal representation to use time64_t, which will
end up doing the right thing everywhere for v2 format.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Here are a small number of debugfs, ISA, and one driver core fix for 4.7-rc4.
All of these resolve reported issues. The ISA ones have spent the least
amount of time in linux-next, sorry about that, I didn't realize they
were regressions that needed to get in now (thanks to Thorsten for the
prodding!) but they do all pass the 0-day bot tests. The others have
been in linux-next for a while now.
Full details about them are in the shortlog below.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAldlbeQACgkQMUfUDdst+ymnFACfaWhEKA/84jwNNHiim92diJrY
zYsAoLOmpBw68yL6qTSZbcWJF4Flb6Xk
=N8M2
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are a small number of debugfs, ISA, and one driver core fix for
4.7-rc4.
All of these resolve reported issues. The ISA ones have spent the
least amount of time in linux-next, sorry about that, I didn't realize
they were regressions that needed to get in now (thanks to Thorsten
for the prodding!) but they do all pass the 0-day bot tests. The
others have been in linux-next for a while now.
Full details about them are in the shortlog below"
* tag 'driver-core-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
isa: Dummy isa_register_driver should return error code
isa: Call isa_bus_init before dependent ISA bus drivers register
watchdog: ebc-c384_wdt: Allow build for X86_64
iio: stx104: Allow build for X86_64
gpio: Allow PC/104 devices on X86_64
isa: Allow ISA-style drivers on modern systems
base: make module_create_drivers_dir race-free
debugfs: open_proxy_open(): avoid double fops release
debugfs: full_proxy_open(): free proxy on ->open() failure
kernel/kcov: unproxify debugfs file's fops
Pull btrfs fixes from Chris Mason:
"The most user visible change here is a fix for our recent superblock
validation checks that were causing problems on non-4k pagesized
systems"
* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize
btrfs: remove build fixup for qgroup_account_snapshot
btrfs: use new error message helper in qgroup_account_snapshot
btrfs: avoid blocking open_ctree from cleaner_kthread
Btrfs: don't BUG_ON() in btrfs_orphan_add
btrfs: account for non-CoW'd blocks in btrfs_abort_transaction
Btrfs: check if extent buffer is aligned to sectorsize
btrfs: Use correct format specifier
Older btrfs-progs/mkfs.btrfs sets 4096 as the stripesize. Hence
restricting stripesize to be equal to sectorsize would cause super block
validation to return an error on architectures where PAGE_SIZE is not
equal to 4096.
Hence as a workaround, this commit allows stripesize to be set to 4096
bytes.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduced in 2c1984f244 ("btrfs: build fixup for
qgroup_account_snapshot") as temporary bisectability build fixup.
Signed-off-by: David Sterba <dsterba@suse.com>
This fixes a problem introduced in commit 2f3165ecf1
"btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes".
open_ctree eventually calls btrfs_replay_log which in turn calls
btrfs_commit_super which tries to lock the cleaner_mutex, causing a
recursive mutex deadlock during mount.
Instead of playing whack-a-mole trying to keep up with all the
functions that may want to lock cleaner_mutex, put all the cleaner_mutex
lockers back where they were, and attack the problem more directly:
keep cleaner_kthread asleep until the filesystem is mounted.
When filesystems are mounted read-only and later remounted read-write,
open_ctree did not set fs_info->open and neither does anything else.
Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs
nor cleaner_kthread get confused by the common case of "/" filesystem
read-only mount followed by read-write remount.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is just a screwup for developers, so change it to an ASSERT() so developers
notice when things go wrong and deal with the error appropriately if ASSERT()
isn't enabled. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The test for !trans->blocks_used in btrfs_abort_transaction is
insufficient to determine whether it's safe to drop the transaction
handle on the floor. btrfs_cow_block, informed by should_cow_block,
can return blocks that have already been CoW'd in the current
transaction. trans->blocks_used is only incremented for new block
allocations. If an operation overlaps the blocks in the current
transaction entirely and must abort the transaction, we'll happily
let it clean up the trans handle even though it may have modified
the blocks and will commit an incomplete operation.
In the long-term, I'd like to do closer tracking of when the fs
is actually modified so we can still recover as gracefully as possible,
but that approach will need some discussion. In the short term,
since this is the only code using trans->blocks_used, let's just
switch it to a bool indicating whether any blocks were used and set
it when should_cow_block returns false.
Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer(). An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.
This adds a warning to let us quickly know what was happening.
Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Component mirror_num of struct btrfsic_block is defined
as unsigned int. Use %u as format specifier.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In gfs2_init_inode_once, initialize inode->i_iopen_gh.gh_gl to NULL:
otherwise, when gfs2_inode_lookup fails, the iopen glock holder can
remain unset and iget_failed can end up accessing random memory.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
to the big nfs4_lock_state removal around 3.17 (but that were also
probably hard to reproduce before client changes in 3.20 allowed the
client to perform parallel opens).
Also fix a 4.1 backchannel crash due to rpc multipath changes in 4.6.
Trond acked the client-side rpc fixes going through my tree.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXYsiKAAoJECebzXlCjuG+PK8P/jGBS+h7Zf4GWOOsWN5cbEs/
8VTn83KXFp/feFhGikUIcAZQJRynDK+tD9Vh2FznC2zKDTLFPS0mAvL3tZyQhUO2
nEWaCUOFR+sB3aTPlMGwxbGc7NHNQg1hKqKgqcLEEqtozxhFQye3WW0MZNfFCiUZ
qpq2tK1OGGhJVIp7wWSa8+B2nGFMuasPaGM2OVJrebip49yTG/tT3rwKxKMoB8kS
i8BwNejoP1KRD6LqvpgdV1ESzkdyokDxKXCrdY/j2lMdp2YRe+cWmX239ojjvm8G
n9Ow8DYCefuiKiF6iCLZfxpX8dcmVJvT6g+k+9V63A4YCyuGhy/CneA3MO4QyLhq
yfe2zviJ2kZVz+1Ih3v9kD7ZkyK1hjrxXx/VPrI5CBIXE5eVXin2ZDvTCSoV491h
g1zscPc9Thgk6gKXsvkaVOXxLHBoUzXeSRbNqVXXZfjl+s4TXLNJ0lcaBYkzh74/
SypiFeNHjsjNpJYz5GptlbMUpaEoeyH0Y+OiH8d5Jf8hCcQ+CLjKgKSuCH5zrypt
Lx3U5QWHTT3IXH4QS/njcTSfSDu7BUip4RTLzw6C/ZJf7hd6SS4Xv72J6ZmeDSmg
146MpAYty8HB04KQWpYx0DGI7UEPlubfRHSF9XzsSitbRtNGr6xvIug8fkKBlXDB
aHtr+/gI7UvrmnXnlGdD
=aNkl
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.7-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
"Oleg Drokin found and fixed races in the nfsd4 state code that go back
to the big nfs4_lock_state removal around 3.17 (but that were also
probably hard to reproduce before client changes in 3.20 allowed the
client to perform parallel opens).
Also fix a 4.1 backchannel crash due to rpc multipath changes in 4.6.
Trond acked the client-side rpc fixes going through my tree"
* tag 'nfsd-4.7-1' of git://linux-nfs.org/~bfields/linux:
nfsd: Make init_open_stateid() a bit more whole
nfsd: Extend the mutex holding region around in nfsd4_process_open2()
nfsd: Always lock state exclusively.
rpc: share one xps between all backchannels
nfsd4/rpc: move backchannel create logic into rpc code
SUNRPC: fix xprt leak on xps allocation failure
nfsd: Fix NFSD_MDS_PR_KEY on 32-bit by adding ULL postfix
Pull overlayfs fixes from Miklos Szeredi:
"This contains two regression fixes: one for the xattr API update and
one for using the mounter's creds in file creation in overlayfs.
There's also a fix for a bug in handling hard linked AF_UNIX sockets
that's been there from day one. This fix is overlayfs only despite
the fact that it touches code outside the overlay filesystem: d_real()
is an identity function for all except overlay dentries"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix uid/gid when creating over whiteout
ovl: xattr filter fix
af_unix: fix hard linked sockets on overlay
vfs: add d_real_inode() helper
Move the state selection logic inside from the caller,
always making it return correct stp to use.
Signed-off-by: J . Bruce Fields <bfields@fieldses.org>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
To avoid racing entry into nfs4_get_vfs_file().
Make init_open_stateid() return with locked stateid to be unlocked
by the caller.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It used to be the case that state had an rwlock that was locked for write
by downgrades, but for read for upgrades (opens). Well, the problem is
if there are two competing opens for the same state, they step on
each other toes potentially leading to leaking file descriptors
from the state structure, since access mode is a bitmap only set once.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If dotdot directory is corrupted, its slot may be ocupied by another
file. In this case, dentry[1] is not the parent directory. Rename and
cross-rename will update the inode in dentry[1] incorrectly. This
patch finds dotdot dentry by name.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
[Jaegeuk Kim: remove wron bug_on]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If an attribute revalidation fails, then we already know that we'll
zap the access cache. If, OTOH, the inode isn't changing, there should
be no need to eject access calls just because they are old.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Also simplify the logic a bit.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Trond Myklebust <trondmy@primarydata.com>
Fix a regression when creating a file over a whiteout. The new
file/directory needs to use the current fsuid/fsgid, not the ones from the
mounter's credentials.
The refcounting is a bit tricky: prepare_creds() sets an original refcount,
override_creds() gets one more, which revert_cred() drops. So
1) we need to expicitly put the mounter's credentials when overriding
with the updated one
2) we need to put the original ref to the updated creds (and this can
safely be done before revert_creds(), since we'll still have the ref
from override_creds()).
Reported-by: Stephen Smalley <sds@tycho.nsa.gov>
Fixes: 3fe6e52f06 ("ovl: override creds with the ones from the superblock mounter")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Debugfs' open_proxy_open(), the ->open() installed at all inodes created
through debugfs_create_file_unsafe(),
- grabs a reference to the original file_operations instance passed to
debugfs_create_file_unsafe() via fops_get(),
- installs it at the file's ->f_op by means of replace_fops()
- and calls fops_put() on it.
Since the semantics of replace_fops() are such that the reference's
ownership is transferred, the subsequent fops_put() will result in a double
release when the file is eventually closed.
Currently, this is not an issue since fops_put() basically does a
module_put() on the file_operations' ->owner only and there don't exist any
modules calling debugfs_create_file_unsafe() yet. This is expected to
change in the future though, c.f. commit c646880814 ("debugfs: add
support for self-protecting attribute file fops").
Remove the call to fops_put() from open_proxy_open().
Fixes: 9fd4dcece4 ("debugfs: prevent access to possibly dead
file_operations at file open")
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Debugfs' full_proxy_open(), the ->open() installed at all inodes created
through debugfs_create_file(),
- grabs a reference to the original struct file_operations instance passed
to debugfs_create_file(),
- dynamically allocates a proxy struct file_operations instance wrapping
the original
- and installs this at the file's ->f_op.
Afterwards, it calls the original ->open() and passes its return value back
to the VFS layer.
Now, if that return value indicates failure, the VFS layer won't ever call
->release() and thus, neither the reference to the original file_operations
nor the memory for the proxy file_operations will get released, i.e. both
are leaked.
Upon failure of the original fops' ->open(), undo the proxy installation.
That is:
- Set the struct file ->f_op to what it had been when full_proxy_open()
was entered.
- Drop the reference to the original file_operations.
- Free the memory holding the proxy file_operations.
Fixes: 49d200deaa ("debugfs: prevent access to removed files' private
data")
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In rare cases it is possible for s_flags & MS_RDONLY to be set but
MNT_READONLY to be clear. This starting combination can cause
fs_fully_visible to fail to ensure that the new mount is readonly.
Therefore force MNT_LOCK_READONLY in the new mount if MS_RDONLY
is set on the source filesystem of the mount.
In general both MS_RDONLY and MNT_READONLY are set at the same for
mounts so I don't expect any programs to care. Nor do I expect
MS_RDONLY to be set on proc or sysfs in the initial user namespace,
which further decreases the likelyhood of problems.
Which means this change should only affect system configurations by
paranoid sysadmins who should welcome the additional protection
as it keeps people from wriggling out of their policies.
Cc: stable@vger.kernel.org
Fixes: 8c6cf9cc82 ("mnt: Modify fs_fully_visible to deal with locked ro nodev and atime")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
ramoops is one of the remaining places where ARM vendors still rely on
board-specific shims. Device Tree lets us replace those shims with
generic code.
These bindings mirror the ramoops module parameters, with two small
differences:
(1) dump_oops becomes an optional "no-dump-oops" property, since ramoops
sets dump_oops=1 by default.
(2) mem_type=1 becomes the more self-explanatory "unbuffered" property.
Signed-off-by: Greg Hackmann <ghackmann@google.com>
[fixed platform_get_drvdata() crash, thanks to Brian Norris]
[switched from u64 to u32 to simplify code, various whitespace fixes]
[use dev_of_node() to gain code-elimination for CONFIG_OF=n]
Signed-off-by: Kees Cook <keescook@chromium.org>
On 32-bit:
fs/nfsd/blocklayout.c: In function ‘nfsd4_block_get_device_info_scsi’:
fs/nfsd/blocklayout.c:337: warning: integer constant is too large for ‘long’ type
fs/nfsd/blocklayout.c:344: warning: integer constant is too large for ‘long’ type
fs/nfsd/blocklayout.c: In function ‘nfsd4_scsi_fence_client’:
fs/nfsd/blocklayout.c:385: warning: integer constant is too large for ‘long’ type
Add the missing "ULL" postfix to 64-bit constant NFSD_MDS_PR_KEY to fix
this.
Fixes: f99d4fbdae ("nfsd: add SCSI layout support")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
mkdir sync_dirty_inode
- init_inode_metadata
- lock_page(node)
- make_empty_dir
- filemap_fdatawrite()
- do_writepages
- lock_page(data)
- write_page(data)
- lock_page(node)
- f2fs_init_acl
- error
- truncate_inode_pages
- lock_page(data)
So, we don't need to truncate data pages in this error case, which will
be done by f2fs_evict_inode.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This mount option is to enable original log-structured filesystem forcefully.
So, there should be no random writes for main area.
Especially, this supports host-managed SMR device.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there were outstanding writes then chalk up the unexpected change
attribute on the server to them.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This change fixes also a buffer overflow which was caused by
accessing address space beyond mapped page
Signed-off-by: Krzysztof Błaszkowski <kb@sysmikro.com.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
There is nothing worse than just allocated inode without being
initialized _once().
Signed-off-by: Krzysztof Błaszkowski <kb@sysmikro.com.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Every successful mount two structs vxfs_fsh were not released.
Signed-off-by: Krzysztof Błaszkowski <kb@sysmikro.com.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
* make autofs4_expire_indirect() skip the dentries being in process of
expiry
* do *not* mess with list_move(); making sure that dentry with
AUTOFS_INF_EXPIRING are not picked for expiry is enough.
* do not remove NO_RCU when we set EXPIRING, don't bother with smp_mb()
there. Clear it at the same time we clear EXPIRING. Makes a bunch of
tests simpler.
* rename NO_RCU to WANT_EXPIRE, which is what it really is.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Noe that we're mixing in the parent pointer earlier, we
don't need to use hash_32() to mix its bits. Instead, we can
just take the msbits of the hash value directly.
For those applications which use the partial_name_hash(),
move the multiply to end_name_hash.
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We always mixed in the parent pointer into the dentry name hash, but we
did it late at lookup time. It turns out that we can simplify that
lookup-time action by salting the hash with the parent pointer early
instead of late.
A few other users of our string hashes also wanted to mix in their own
pointers into the hash, and those are updated to use the same mechanism.
Hash users that don't have any particular initial salt can just use the
NULL pointer as a no-salt.
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Limit the socket incoming call backlog queue size so that a remote client
can't pump in sufficient new calls that the server runs out of memory. Note
that this is partially theoretical at the moment since whilst the number of
calls is limited, the number of packets trying to set up new calls is not.
This will be addressed in a later patch.
If the caller of listen() specifies a backlog INT_MAX, then they get the
current maximum; anything else greater than max_backlog or anything
negative incurs EINVAL.
The limit on the maximum queue size can be set by:
echo N >/proc/sys/net/rxrpc/max_backlog
where 4<=N<=32.
Further, set the default backlog to 0, requiring listen() to be called
before we start actually queueing new calls. Whilst this kind of is a
change in the UAPI, the caller can't actually *accept* new calls anyway
unless they've first called listen() to put the socket into the LISTENING
state - thus the aforementioned new calls would otherwise just sit there,
eating up kernel memory. (Note that sockets that don't have a non-zero
service ID bound don't get incoming calls anyway.)
Given that the default backlog is now 0, make the AFS filesystem call
kernel_listen() to set the maximum backlog for itself.
Possible improvements include:
(1) Trimming a too-large backlog to max_backlog when listen is called.
(2) Trimming the backlog value whenever the value is used so that changes
to max_backlog are applied to an open socket automatically. Note that
the AFS filesystem opens one socket and keeps it open for extended
periods, so would miss out on changes to max_backlog.
(3) Having a separate setting for the AFS filesystem.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull btrfs fixes from Chris Mason:
"Has some fixes and some new self tests for btrfs. The self tests are
usually disabled in the .config file (unless you're doing btrfs dev
work), and this bunch is meant to find problems with the 64K page size
patches.
Jeff has a patch to help people see if they are using the hardware
assist crc32c module, which really helps us nail down problems when
people ask why crcs are using so much CPU.
Otherwise, it's small fixes"
* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system
Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize
Btrfs: self-tests: Use macros instead of constants and add missing newline
Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes
Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE
btrfs: advertise which crc32c implementation is being used at module load
Btrfs: add validadtion checks for chunk loading
Btrfs: add more validation checks for superblock
Btrfs: clear uptodate flags of pages in sys_array eb
Btrfs: self-tests: Support non-4k page size
Btrfs: Fix integer overflow when calculating bytes_per_bitmap
Btrfs: test_check_exists: Fix infinite loop when searching for free space entries
Btrfs: end transaction if we abort when creating uuid root
btrfs: Use __u64 in exported linux/btrfs.h.
Merge filesystem stacking fixes from Jann Horn.
* emailed patches from Jann Horn <jannh@google.com>:
sched: panic on corrupted stack end
ecryptfs: forbid opening files without mmap handler
proc: prevent stacking filesystems on top
This prevents users from triggering a stack overflow through a recursive
invocation of pagefault handling that involves mapping procfs files into
virtual memory.
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This prevents stacking filesystems (ecryptfs and overlayfs) from using
procfs as lower filesystem. There is too much magic going on inside
procfs, and there is no good reason to stack stuff on top of procfs.
(For example, procfs does access checks in VFS open handlers, and
ecryptfs by design calls open handlers from a kernel thread that doesn't
drop privileges or so.)
Signed-off-by: Jann Horn <jannh@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d_walk() relies upon the tree not getting rearranged under it without
rename_lock being touched. And we do grab rename_lock around the
places that change the tree topology. Unfortunately, branch reordering
is just as bad from d_walk() POV and we have two places that do it
without touching rename_lock - one in handling of cursors (for ramfs-style
directories) and another in autofs. autofs one is a separate story; this
commit deals with the cursors.
* mark cursor dentries explicitly at allocation time
* make __dentry_kill() leave ->d_child.next pointing to the next
non-cursor sibling, making sure that it won't be moved around unnoticed
before the parent is relocked on ascend-to-parent path in d_walk().
* make d_walk() skip cursors explicitly; strictly speaking it's
not necessary (all callbacks we pass to d_walk() are no-ops on cursors),
but it makes analysis easier.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Before this patch, function read_rindex_entry would set a rgrp
glock's gl_object pointer to itself before inserting the rgrp into
the rgrp rbtree. The problem is: if another process was also reading
the rgrp in, and had already inserted its newly created rgrp, then
the second call to read_rindex_entry would overwrite that value,
then return a bad return code to the caller. Later, other functions
would reference the now-freed rgrp memory by way of gl_object.
In some cases, that could result in gfs2_rgrp_brelse being called
twice for the same rgrp: once for the failed attempt and once for
the "real" rgrp release. Eventually the kernel would panic.
There are also a number of other things that could go wrong when
a kernel module is accessing freed storage. For example, this could
result in rgrp corruption because the fake rgrp would point to a
fake bitmap in memory too, causing gfs2_inplace_reserve to search
some random memory for free blocks, and find some, since we were
never setting rgd->rd_bits to NULL before freeing it.
This patch fixes the problem by not setting gl_object until we
have successfully inserted the rgrp into the rbtree. Also, it sets
rd_bits to NULL as it frees them, which will ensure any accidental
access to the wrong rgrp will result in a kernel panic rather than
file system corruption, which is preferred.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
timerfd gives processes a way to set wake alarms, but unlike timers made using
timer_create, timerfds don't check whether the process has CAP_WAKE_ALARM
before setting alarm-time timers. CAP_WAKE_ALARM is supposed to gate this
behavior and so it makes sense that we should deny permission to create such
timerfds if the process doesn't have this capability.
Signed-off-by: Eric Caruso <ejcaruso@google.com>
Cc: Todd Poynor <toddpoynor@google.com>
Link: http://lkml.kernel.org/r/1465427339-96209-1-git-send-email-ejcaruso@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
BIO_MAX_PAGES is used as maximum count of bvecs, so
replace BIO_MAX_SECTORS with BIO_MAX_PAGES since
BIO_MAX_SECTORS is to be removed.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This was missed from my last patchset.
This patch has ext4 crypto code use the bio op helper
to set the operation. The operation (discard, write, writesame,
etc) is now defined seperately from the other REQ bits. They
still share the bi_rw field to save space, so we use these
helpers so modules do not have to worry about setting/overwriting
info.
Jens, I am not sure how you handle patches on top of patches
in the next branches. If you merge patches that fix issues
in previous patches in next, then this patch could be part
of
commit 95fe6c1a20
Author: Mike Christie <mchristi@redhat.com>
Date: Sun Jun 5 14:31:48 2016 -0500
block, fs, mm, drivers: use bio set/get op accessors
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In f2fs, we don't need to keep block plugging for NODE and DATA writes, since
we already merged bios as much as possible.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There is a data race between allocate_data_block() and f2fs_sbumit_page_mbio(),
which incur unnecessary reversed bio submission.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull vfs fixes from Al Viro:
"Fixes for crap of assorted ages: EOPENSTALE one is 4.2+, autofs one is
4.6, d_walk - 3.2+.
The atomic_open() and coredump ones are regressions from this window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coredump: fix dumping through pipes
fix a regression in atomic_open()
fix d_walk()/non-delayed __d_free() race
autofs braino fix for do_last()
fix EOPENSTALE bug in do_last()
The offset in the core file used to be tracked with ->written field of
the coredump_params structure. The field was retired in favour of
file->f_pos.
However, ->f_pos is not maintained for pipes which leads to breakage.
Restore explicit tracking of the offset in coredump_params. Introduce
->pos field for this purpose since ->written was already reused.
Fixes: a008393951 ("get rid of coredump_params->written").
Reported-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
open("/foo/no_such_file", O_RDONLY | O_CREAT) on should fail with
EACCES when /foo is not writable; failing with ENOENT is obviously
wrong. That got broken by a braino introduced when moving the
creat_error logics from atomic_open() to lookup_open(). Easy to
fix, fortunately.
Spotted-by: "Yan, Zheng" <ukernel@gmail.com>
Tested-by: "Yan, Zheng" <ukernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ascend-to-parent logics in d_walk() depends on all encountered child
dentries not getting freed without an RCU delay. Unfortunately, in
quite a few cases it is not true, with hard-to-hit oopsable race as
the result.
Fortunately, the fix is simiple; right now the rule is "if it ever
been hashed, freeing must be delayed" and changing it to "if it
ever had a parent, freeing must be delayed" closes that hole and
covers all cases the old rule used to cover. Moreover, pipes and
sockets remain _not_ covered, so we do not introduce RCU delay in
the cases which are the reason for having that delay conditional
in the first place.
Cc: stable@vger.kernel.org # v3.2+ (and watch out for __d_materialise_dentry())
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have ocfs2
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have nilfs
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have the mpage code
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have gfs2
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have xfs
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have gfs2
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have f2fs
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We no longer pass in a bitmap of rq_flag_bits bits to __btrfs_map_block.
It will always be a REQ_OP, or the btrfs specific REQ_GET_READ_MIRRORS,
so this drops the bit tests.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This should be the easier cases to convert btrfs to
bio_set_op_attrs/bio_op.
They are mostly just cut and replace type of changes.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch has btrfs's submit_one_bio users set the bio op using
bio_set_op_attrs and get the op using bio_op.
The next patches will continue to convert btrfs,
so submit_bio_hook and merge_bio_hook
related code will be modified to take only the bio. I did
not do it in this patch to try and keep it smaller.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch has the dio code use a REQ_OP for the op and rq_flag_bits
for bi_rw flags. To set/get the op it uses the bio_set_op_attrs/bio_op
accssors.
It also begins to convert btrfs's dio_submit_t because of the dio
submit_io callout use. The next patches will completely convert
this code and the reset of the btrfs code paths.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch converts the simple bi_rw use cases in the block,
drivers, mm and fs code to set/get the bio operation using
bio_set_op_attrs/bio_op
These should be simple one or two liner cases, so I just did them
in one patch. The next patches handle the more complicated
cases in a module per patch.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has ll_rw_block users pass in the operation and flags separately,
so ll_rw_block can setup the bio op and bi_rw flags on the bio that
is submitted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has submit_bh users pass in the operation and flags separately,
so submit_bh_wbc can setup the bio op and bi_rw flags on the bio that
is submitted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
FI_DIRTY_INODE flag is not covered by inode page lock, so it can be unset
at any time like below.
Thread #1 Thread #2
- lock_page(ipage)
- update i_fields
- update i_size/i_blocks/and so on
- set FI_DIRTY_INODE
- reset FI_DIRTY_INODE
- set_page_dirty(ipage)
In this case, we can lose the latest i_field information.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
MNT_LOCKED implies on a child mount implies the child is locked to the
parent. So while looping through the children the children should be
tested (not their parent).
Typically an unshare of a mount namespace locks all mounts together
making both the parent and the slave as locked but there are a few
corner cases where other things work.
Cc: stable@vger.kernel.org
Fixes: ceeb0e5d39 ("vfs: Ignore unlocked mounts in fs_fully_visible")
Reported-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add this trivial missing error handling.
Cc: stable@vger.kernel.org
Fixes: 1b852bceb0 ("mnt: Refactor the logic for mounting sysfs and proc in a user namespace")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In __test_eb_bitmaps(), we write random data to a bitmap. Then copy
the bitmap to another bitmap that resides inside an extent buffer.
Later we verify the values of corresponding bits in the bitmap and the
bitmap inside the extent buffer. However, extent_buffer_test_bit()
reads in byte granularity while test_bit() reads in unsigned long
granularity. Hence we end up comparing wrong bits on big-endian
systems such as ppc64. This commit fixes the issue by reading the
bitmap in byte granularity.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To test all possible sectorsizes, this commit adds a sectorsize
array. This commit executes the tests for all possible sectorsizes and
nodesizes.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On ppc64, PAGE_SIZE is 64k which is same as BTRFS_MAX_METADATA_BLOCKSIZE.
In such a scenario, we will never be able to have an extent buffer
containing more than one page. Hence in such cases this commit does not
execute the page straddling tests.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
a) ovl_need_xattr_filter() is wrong, we can have multiple lower layers
overlaid, all of which (except the lowest one) honouring the
"trusted.overlay.opaque" xattr. So need to filter everything except the
bottom and the pure-upper layer.
b) we no longer can assume that inode is attached to dentry in
get/setxattr.
This patch unconditionally filters private xattrs to fix both of the above.
Performance impact for get/removexattrs is likely in the noise.
For listxattrs it might be measurable in pathological cases, but I very
much hope nobody cares. If they do, we'll fix it then.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: b96809173e ("security_d_instantiate(): move to the point prior to attaching dentry to inode")
Since several architectures support hardware-accelerated crc32c
calculation, it would be nice to confirm that btrfs is actually using it.
We can see an elevated use count for the module, but it doesn't actually
show who the users are. This patch simply prints the name of the driver
after successfully initializing the shash.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ added a helper and used in module load-time message ]
Signed-off-by: David Sterba <dsterba@suse.com>
To prevent fuzzed filesystem images from panic the whole system,
we need various validation checks to refuse to mount such an image
if btrfs finds any invalid value during loading chunks, including
both sys_array and regular chunks.
Note that these checks may not be sufficient to cover all corner cases,
feel free to add more checks.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds validation checks for super_total_bytes, super_bytes_used and
super_stripesize, super_num_devices.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We set uptodate flag to pages in the temporary sys_array eb,
but do not clear the flag after free eb. As the special
btree inode may still hold a reference on those pages, the
uptodate flag can remain alive in them.
If btrfs_super_chunk_root has been intentionally changed to the
offset of this sys_array eb, reading chunk_root will read content
of sys_array and it will skip our beautiful checks in
btree_readpage_end_io_hook() because of
"pages of eb are uptodate => eb is uptodate"
This adds the 'clear uptodate' part to force it to read from disk.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The /dev/ptmx device node is changed to lookup the directory entry "pts"
in the same directory as the /dev/ptmx device node was opened in. If
there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx
uses that filesystem. Otherwise the open of /dev/ptmx fails.
The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that
userspace can now safely depend on each mount of devpts creating a new
instance of the filesystem.
Each mount of devpts is now a separate and equal filesystem.
Reserved ttys are now available to all instances of devpts where the
mounter is in the initial mount namespace.
A new vfs helper path_pts is introduced that finds a directory entry
named "pts" in the directory of the passed in path, and changes the
passed in path to point to it. The helper path_pts uses a function
path_parent_directory that was factored out of follow_dotdot.
In the implementation of devpts:
- devpts_mnt is killed as it is no longer meaningful if all mounts of
devpts are equal.
- pts_sb_from_inode is replaced by just inode->i_sb as all cached
inodes in the tty layer are now from the devpts filesystem.
- devpts_add_ref is rolled into the new function devpts_ptmx. And the
unnecessary inode hold is removed.
- devpts_del_ref is renamed devpts_release and reduced to just a
deacrivate_super.
- The newinstance mount option continues to be accepted but is now
ignored.
In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as
they are never used.
Documentation/filesystems/devices.txt is updated to describe the current
situation.
This has been verified to work properly on openwrt-15.05, centos5,
centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3,
ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1,
slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01. With the
caveat that on centos6 and on slackware-14.1 that there wind up being
two instances of the devpts filesystem mounted on /dev/pts, the lower
copy does not end up getting used.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg KH <greg@kroah.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's an analogue of commit 7500c38a (fix the braino in "namei:
massage lookup_slow() to be usable by lookup_one_len_unlocked()").
The same problem (->lookup()-returned unhashed negative dentry
just might be an autofs one with ->d_manage() that would wait
until the daemon makes it positive) applies in do_last() - we
need to do follow_managed() first.
Fortunately, remaining callers of follow_managed() are OK - only
autofs has that weirdness (negative dentry that does not mean
an instant -ENOENT)) and autofs never has its negative dentries
hashed, so we can't pick one from a dcache lookup.
->d_manage() is a bloody mess ;-/
Cc: stable@vger.kernel.org # v4.6
Spotted-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs fixes from Chris Mason:
"The important part of this pull is Filipe's set of fixes for btrfs
device replacement. Filipe fixed a few issues seen on the list and a
number he found on his own"
* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
Btrfs: fix race between device replace and read repair
Btrfs: fix race between device replace and discard
Btrfs: fix race between device replace and chunk allocation
Btrfs: fix race setting block group back to RW mode during device replace
Btrfs: fix unprotected assignment of the left cursor for device replace
Btrfs: fix race setting block group readonly during device replace
Btrfs: fix race between device replace and block group removal
Btrfs: fix race between readahead and device replace/removal
EOPENSTALE occuring at the last component of a trailing symlink ends up
with do_last() retrying its lookup. After the symlink body has been
discarded. The thing is, all this retry_lookup logics in there is not
needed at all - the upper layers will do the right thing if we simply
return that -EOPENSTALE as we would with any other error. Trying to
microoptimize in do_last() is a lot of headache for no good reason.
Cc: stable@vger.kernel.org # v4.2+
Tested-by: Oleg Drokin <green@linuxhacker.ru>
Reviewed-and-Tested-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When dealing with inline extents, btrfs_get_extent will incorrectly try
to insert a duplicate extent_map. The dup hits -EEXIST from
add_extent_map, but then we try to merge with the existing one and end
up trying to insert a zero length extent_map.
This actually works most of the time, except when there are extent maps
past the end of the inline extent. rocksdb will trigger this sometimes
because it preallocates an extent and then truncates down.
Josef made a script to trigger with xfs_io:
#!/bin/bash
xfs_io -f -c "pwrite 0 1000" inline
xfs_io -c "falloc -k 4k 1M" inline
xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
xfs_io -c "fadvise -d 0 1000" inline
cat inline
You'll get EIOs trying to read inline after this because add_extent_map
is returning EEXIST
Signed-off-by: Chris Mason <clm@fb.com>
Previously, f2fs_write_data_pages() calls __f2fs_writepage() which calls
f2fs_write_data_page().
If f2fs_write_data_page() returns AOP_WRITEPAGE_ACTIVATE, __f2fs_writepage()
calls mapping_set_error(). But, this should not happen at every time, since
sometimes f2fs_write_data_page() tries to skip writing pages without error.
For example, volatile_write() gives EIO all the time, as Shuoran Liu pointed
out.
Reported-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now we can report an error to f2fs_lookup given by f2fs_find_entry.
Suggested-by: He YunLei <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit aaf9607516 ("f2fs: check node page
contents all the time") pointed out that "sometimes it was reported that
its contents was missing", so it checks the page's mapping and contents.
When "nid != nid_of_node(page)", ERR_PTR(-EIO) will be returned to the
caller. However, commit e1c51b9f1d ("f2fs:
clean up node page updating flow") moves "nid != nid_of_node(page)" test
to "f2fs_bug_on(sbi, nid != nid_of_node(page))", this will return a
wrong page to the caller when F2FS_CHECK_FS is off when "sometimes it
was reported that its contents was missing" happens.
This patch restores to check node page contents all the time, and
returns the errno to make the caller known something is wrong and avoid
to use the page. This patch also moves f2fs_bug_on to its proper location.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is no cold page, we don't need to do a loop to flush dirty
data pages.
On /dev/pmem0,
1. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 conv=fsync
Before : 1.1 GB/s
After : 1.2 GB/s
2. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048
Before : 2.2 GB/s
After : 2.3 GB/s
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For data pages, let's try to flush as much as possible in background.
On /dev/pmem0,
1. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048 conv=fsync
Before : 800 MB/s
After : 1.1 GB/s
2. dd if=/dev/zero of=/mnt/test/testfile bs=1M count=2048
Before : 1.3 GB/s
After : 2.2 GB/s
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we get ENOMEM or EIO in f2fs_find_entry, we should stop right away.
Otherwise, for example, we can get duplicate directory entry by ->chash and
->clevel.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes writepages lock.
We can improve multi-threading performance.
tiobench, 32 threads, 4KB write per fsync on SSD
Before: 25.88 MB/s
After: 28.03 MB/s
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If flush commands do not incur any congestion, we don't need to throw that to
dispatching queue which causes unnecessary latency.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch reduces to call them across the whole tree.
- sync_inode_page()
- update_inode_page()
- update_inode()
- f2fs_write_inode()
Instead, checkpoint will flush all the dirty inode metadata before syncing
node pages.
Note that, this is doable, since we call mark_inode_dirty_sync() for all
inode's field change which needs to update on-disk inode as well.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch calls mark_inode_dirty_sync() for the following on-disk inode
changes.
-> largest
-> ctime/mtime/atime
-> i_current_depth
-> i_xattr_nid
-> i_pino
-> i_advise
-> i_flags
-> i_mode
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_i_links_write() to call mark_inode_dirty_sync() when
changing inode->i_links.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_i_blocks_write() to call mark_inode_dirty_sync() when
changing inode->i_blocks.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In ee1d267423 ("pstore: add pstore unregister") I added:
.owner = THIS_MODULE,
in both pstore_fs_type and pstore_file_operations to increase a reference
count when pstore filesystem is mounted and pstore file is opened.
But, it's repetitive. There is no need to increase the opened reference
count. We only need to increase the mounted reference count. When a file
is opened, the filesystem can't be unmounted. Hence the pstore module
can't be unloaded either.
So I drop the opened reference count in this patch.
Fixes: ee1d267423 ("pstore: add pstore unregister")
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Like zlib compression in pstore, this patch added lzo and lz4
compression support so that users can have more options and better
compression ratio.
The original code treats the compressed data together with the
uncompressed ECC correction notice by using zlib decompress. The
ECC correction notice is missing in the decompression process. The
treatment also makes lzo and lz4 not working. So I treat them
separately by using pstore_decompress() to treat the compressed
data, and memcpy() to treat the uncompressed ECC correction notice.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
self-tests code assumes 4k as the sectorsize and nodesize. This commit
fix hardcoded 4K. Enables the self-tests code to be executed on non-4k
page sized systems (e.g. ppc64).
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On ppc64, bytes_per_bitmap will be (65536*8*65536). Hence append UL to
fix integer overflow.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On a ppc64 machine using 64K as the block size, assume that the RB
tree at btrfs_free_space_ctl->free_space_offset contains following
two entries:
1. A bitmap entry having an offset value of 0 and having the bits
corresponding to the address range [128M+512K, 128M+768K] set.
2. An extent entry corresponding to the address range
[128M-256K, 128M-128K]
In such a scenario, test_check_exists() invoked for checking the
existence of address range [128M+768K, 256M] can lead to an
infinite loop as explained below:
- Checking for the extent entry fails.
- Checking for a bitmap entry results in the free space info in
range [128M+512K, 128M+768K] beng returned.
- rb_prev(info) returns NULL because the bitmap entry starting from
offset 0 comes first in the RB tree.
- current_node = bitmap node.
- while (current_node)
tmp = rb_next(bitmap_node);/*tmp is extent based free space entry*/
Since extent based free space entry's last address is smaller
than the address being searched for (i.e. 128M+768K) we
incorrectly again obtain the extent node as the "next right node"
of the RB tree and thus end up looping infinitely.
This patch fixes the issue by checking the "tmp" variable which point
to the most recently searched free space node.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several issues in fscache revalidation code.
- In ceph_revalidate_work(), fscache_invalidate() is called when
fscache_check_consistency() return 0. This is complete wrong
because 0 means cache is valid.
- Handle_cap_grant() calls ceph_queue_revalidate() if client
already has CAP_FILE_CACHE. This code is confusing. Client
should revalidate the cache each time it got CAP_FILE_CACHE
anew.
- In Handle_cap_grant(), fscache_invalidate() is called if MDS
revokes CAP_FILE_CACHE. This is inconsistency with the case
that inode get evicted. In the later case, the cache is not
discarded. Client may use the cache when inode is reloaded.
This patch moves the fscache revalidation into ceph_get_caps().
Client revalidates the cache after it gets CAP_FILE_CACHE.
i_rdcache_gen should keep constance while CAP_FILE_CACHE is
used. If i_fscache_gen is not equal to i_rdcache_gen, client
needs to check cache's consistency.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
All other filesystems do not add dirty pages to fscache. They all
disable fscache when inode is opened for write. Only ceph adds
dirty pages to fscache, but the code is buggy.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
__fscache_check_consistency() calls check_consistency() callback
and return the callback's return value. But the return type of
check_consistency() is bool. So __fscache_check_consistency()
return 1 if the cache is inconsistent. This is inconsistent with
the document.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
When we have a lot of metadata to flush from the AIL, the buffer
list can get very long. The current submission code tries to batch
submission to optimise IO order of the metadata (i.e. ascending
block order) to maximise block layer merging or IO to adjacent
metadata blocks.
Unfortunately, the method used can result in long lock times
occurring as buffers locked early on in the buffer list might not be
dispatched until the end of the IO licst processing. This is because
sorting does not occur util after the buffer list has been processed
and the buffers that are going to be submitted are locked. Hence
when the buffer list is several thousand buffers long, the lock hold
times before IO dispatch can be significant.
To fix this, sort the buffer list before we start trying to lock and
submit buffers. This means we can now submit buffers immediately
after they are locked, allowing merging to occur immediately on the
plug and dispatch to occur as quickly as possible. This means there
is minimal delay between locking the buffer and IO submission
occuring, hence reducing the worst case lock hold times seen during
delayed write buffer IO submission signficantly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
And the same for XFS_IOC_THAW. Just because we now have a common
version of the ioctl we still need to provide the old name for it
for anyone using those.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Al Viro noticed that xfs_lock_inodes should be static, and
that led to ... a few more.
These are just the easy ones, others require moving functions
higher in source files, so that's not done here to keep
this review simple.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The static checker reports that after commit 8d99fe92fe ("xfs: fix
efi/efd error handling to avoid fs shutdown hangs"), the code has been
reworked such that error == -EFSCORRUPTED is not possible in this
codepath.
Remove the spurious error check and just use SHUTDOWN_META_IO_ERROR
unconditionally.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Multi-block buffers are logged based on buffer offset in
xfs_trans_log_buf(). xfs_buf_item_log() ultimately walks each mapping in
the buffer and marks the associated range to be logged in the
xfs_buf_log_format bitmap for that mapping. This code is broken,
however, in that it marks the actual buffer offsets of the associated
range in each bitmap rather than shifting to the byte range for that
particular mapping.
For example, on a 4k fsb fs, buffer offset 4096 refers to the first byte
of the second mapping in the buffer. This means byte 0 of the second log
format bitmap should be tagged as dirty. Instead, the current code marks
byte offset 4096 of the second log format bitmap, which is invalid and
potentially out of range of the mapping.
As a result of this, the log item format code invoked at transaction
commit time is not be able to correctly identify what parts of the
buffer to copy into log vectors. This can lead to NULL log vector
pointer dereferences in CIL push context if the item format code was not
able to locate any dirty ranges at all. This crash has been reproduced
on a 4k FSB filesystem using 16k directory blocks where an unlink
operation happened not to log anything in the first block of the
mapping. The logged offsets were all over 4k, marked as such in the
subsequent log format mappings, and thus left the transaction with an
xfs_log_item that is marked DIRTY but without any logged regions.
Further, even when the logged regions are marked correctly in the buffer
log format bitmaps, the format code doesn't copy the correct ranges of
the buffer into the log. This means that any logged region beyond the
first block of a multi-block buffer is subject to corruption after a
crash and log recovery sequence. This is due to a failure to convert the
mapping bm_len field from basic blocks to bytes in the buffer offset
tracking code in xfs_buf_item_format().
Update xfs_buf_item_log() to convert buffer offsets to segment relative
offsets when logging multi-block buffers. This ensures that the modified
regions of a buffer are logged correctly and avoids the aforementioned
crash. Also update xfs_buf_item_format() to correctly track the source
offset into the buffer for the log vector formatting code. This ensures
that the correct data is copied into the log.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This driver predates those methods and was trying to be clever allocating
it's own private data. Switch to the generic scheme used by other file
systems.
Based on an earlier patch from Krzysztof Błaszkowski <kb@sysmikro.com.pl>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
To support VxFS filesystems from HP-UX on x86 systems we need to
implement byte swapping, and to keep support for Unixware filesystems
it needs to be the complicated dual-endian kind ala sysvfs.
To do this properly we have to split the on disk and in-core inode
so that we can keep the in-core one in native endianness. All other
structures are byteswapped on demand.
Signed-off-by: Krzysztof Błaszkowski <kb@sysmikro.com.pl>
[hch: make spare happy]
Signed-off-by: Christoph Hellwig <hch@lst.de>
We still need to call btrfs_end_transaction if we call btrfs_abort_transaction,
otherwise we hang and make me super grumpy. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The code is duplicate between compression is enabled or not.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
The commit f0e2efcfd2 ("pstore: do not use message compression
without lock") added a check to 'is_locked' to avoid breakage in
concurrent accesses. But it has a side-effect of disabling compression
on normal path since 'is_locked' variable is not set. As normal path
always takes the lock, it should be initialized to 1.
This also makes the unlock code a bit simpler.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
While none of the "fragile" pstore backends unregister yet, if they
ever did, the unregistering code for the non-dump targets might get
confused. This adds a check for fragile backends on unregister.
Signed-off-by: Kees Cook <keescook@chromium.org>
While we are finishing a device replace operation we can have a concurrent
task trying to do a read repair operation, in which case it will call
btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
points to the source device of the device replace operation. This allows
for the read repair task to dereference the stripe's device pointer after
the device replace operation has freed the source device, resulting in
an invalid memory access. This is similar to the problem solved by my
previous patch in the same series and named "Btrfs: fix race between
device replace and discard".
So fix this by surrounding the call to btrfs_map_block() and the code
that uses the returned struct btrfs_bio with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
proper serialization with the finishing phase of the device replace
operation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
While we are finishing a device replace operation, we can make a discard
operation (fs mounted with -o discard) do an invalid memory access like
the one reported by the following trace:
[ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP
[ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport
processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci
virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1
[ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000
[ 3206.388595] RIP: 0010:[<ffffffff8124d233>] [<ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7
[ 3206.388595] RSP: 0018:ffff880171b9bb80 EFLAGS: 00010246
[ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000
[ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48
[ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001
[ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8
[ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b
[ 3206.388595] FS: 00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000
[ 3206.388595] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0
[ 3206.388595] Stack:
[ 3206.388595] 0000000000004878 0000000000000000 0000000002400040 0000000000000000
[ 3206.388595] 0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0
[ 3206.388595] ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0
[ 3206.388595] Call Trace:
[ 3206.388595] [<ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs]
[ 3206.388595] [<ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs]
[ 3206.388595] [<ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs]
[ 3206.388595] [<ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs]
[ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
[ 3206.388595] [<ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs]
[ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
[ 3206.388595] [<ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs]
[ 3206.388595] [<ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e
[ 3206.388595] [<ffffffff811a82c0>] vfs_fsync+0x1c/0x1e
[ 3206.388595] [<ffffffff811a8417>] do_fsync+0x31/0x4a
[ 3206.388595] [<ffffffff811a8637>] SyS_fsync+0x10/0x14
[ 3206.388595] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 3206.388595] [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
[ 3206.388595] [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa
This happens because when we call btrfs_map_block() from
btrfs_discard_extent() to get a btrfs_bio structure, the device replace
operation has not finished yet, but before we use the device of one of the
stripes from the returned btrfs_bio structure, the device object is freed.
This is illustrated by the following diagram.
CPU 1 CPU 2
btrfs_dev_replace_start()
(...)
btrfs_dev_replace_finishing()
btrfs_start_transaction()
btrfs_commit_transaction()
(...)
btrfs_sync_file()
btrfs_start_transaction()
(...)
btrfs_commit_transaction()
btrfs_finish_extent_commit()
btrfs_discard_extent()
btrfs_map_block()
--> returns a struct btrfs_bio
with a stripe that has a
device field pointing to
source device of the replace
operation (the device that
is being replaced)
mutex_lock(&uuid_mutex)
mutex_lock(&fs_info->fs_devices->device_list_mutex)
mutex_lock(&fs_info->chunk_mutex)
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates the mapping tree and for each
extent map that has a stripe pointing to
the source device, it updates the stripe
to point to the target device instead
btrfs_rm_dev_replace_blocked()
--> waits for fs_info->bio_counter to go down to 0
btrfs_rm_dev_replace_remove_srcdev()
--> removes source device from the list of devices
mutex_unlock(&fs_info->chunk_mutex)
mutex_unlock(&fs_info->fs_devices->device_list_mutex)
mutex_unlock(&uuid_mutex)
btrfs_rm_dev_replace_free_srcdev()
--> frees the source device
--> iterates over all stripes
of the returned struct
btrfs_bio
--> for each stripe it
dereferences its device
pointer
--> it ends up finding a
pointer to the device
used as the source
device for the replace
operation and that was
already freed
So fix this by surrounding the call to btrfs_map_block(), and the code
that uses the returned struct btrfs_bio, with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that
the finishing phase of the device replace operation blocks until the
the bio counter decreases to zero before it frees the source device.
This is the same approach we do at btrfs_map_bio() for example.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
For the benefit of every single caller, take osdc instead of map.
Also, now that osdc->osdmap can't ever be NULL, drop the check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
While iterating and copying extents from the source device, the device
replace code keeps adjusting a left cursor that is used to make sure that
once we finish processing a device extent, any future writes to extents
from the corresponding block group will get into both the source and
target devices. This left cursor is also used for resuming the device
replace operation at mount time.
However using this left cursor to decide whether writes go into both
devices or only the source device is not enough to guarantee we don't
miss copying extents into the target device. There are two cases where
the current approach fails. The first one is related to when there are
holes in the device and they get allocated for new block groups while
the device replace operation is iterating the device extents (more on
this explained below). The second one is that when that loop over the
device extents finishes, we start dellaloc, wait for all ordered extents
and then commit the current transaction, we might have got new block
groups allocated that are now using a device extent that has an offset
greater then or equals to the value of the left cursor, in which case
writes to extents belonging to these new block groups will get issued
only to the source device.
For the first case where the current approach of using a left cursor
fails, consider the source device currently has the following layout:
[ extent bg A ] [ hole, unallocated space ] [extent bg B ]
3Gb 4Gb 5Gb
While we are iterating the device extents from the source device using
the commit root of the device tree, the following happens:
CPU 1 CPU 2
<we are at transaction N>
scrub_enumerate_chunks()
--> searches the device tree for
extents belonging to the source
device using the device tree's
commit root
--> 1st iteration finds extent belonging to
block group A
--> sets block group A to RO mode
(btrfs_inc_block_group_ro)
--> sets cursor left to found_key.offset
which is 3Gb
--> scrub_chunk() starts
copies all allocated extents from
block group's A stripe at source
device into target device
btrfs_alloc_chunk()
--> allocates device extent
in the range [4Gb, 5Gb[
from the source device for
a new block group C
extent allocated from block
group C for a direct IO,
buffered write or btree node/leaf
extent is written to, perhaps
in response to a writepages()
call from the VM or directly
through direct IO
the write is made only against
the source device and not against
the target device because the
extent's offset is in the interval
[4Gb, 5Gb[ which is larger then
the value of cursor_left (3Gb)
--> scrub_chunks() finishes
--> updates left cursor from 3Gb to
4Gb
--> btrfs_dec_block_group_ro() sets
block group A back to RW mode
<we are still at transaction N>
--> 2nd iteration finds extent belonging to
block group B - it did not find the new
extent in the range [4Gb, 5Gb[ for block
group C because we are using the device
tree's commit root or even because the
block group's items are not all yet
inserted in the respective btrees, that is,
the block group is still attached to some
transaction handle's new_bgs list and
btrfs_create_pending_block_groups() was
not called yet against that transaction
handle, so the device extent items were
not yet inserted into the devices tree
<we are still at transaction N>
--> so we end not copying anything from the newly
allocated device extent from the source device
to the target device
So fix this by making __btrfs_map_block() always redirect writes to the
target device as well, independently of the left cursor's value. With
this change the left cursor is now used only for the purpose of tracking
progress and allow a mount operation to resume a device replace.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
After it finishes processing a device extent, the device replace code sets
back the block group to RW mode and then after that it sets the left cursor
to match the logical end address of the block group, so that future writes
into extents belonging to the block group go both the source (old) and
target (new) devices. However from the moment we turn the block group
back to RW mode we have a short time window, that lasts until we update
the left cursor's value, where extents can be allocated from the block
group and written to, in which case they will not be copied/written to
the target (new) device. Fix this by updating the left cursor's value
before turning the block group back to RW mode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
We were assigning new values to fields of the device replace object
without holding the respective lock after processing each device extent.
This is important for the left cursor field which can be accessed by a
concurrent task running __btrfs_map_block (which, correctly, takes the
device replace lock).
So change these fields while holding the device replace lock.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
When we do a device replace, for each device extent we find from the
source device, we set the corresponding block group to readonly mode to
prevent writes into it from happening while we are copying the device
extent from the source to the target device. However just before we set
the block group to readonly mode some concurrent task might have already
allocated an extent from it or decided it could perform a nocow write
into one of its extents, which can make the device replace process to
miss copying an extent since it uses the extent tree's commit root to
search for extents and only once it finishes searching for all extents
belonging to the block group it does set the left cursor to the logical
end address of the block group - this is a problem if the respective
ordered extents finish while we are searching for extents using the
extent tree's commit root and no transaction commit happens while we
are iterating the tree, since it's the delayed references created by the
ordered extents (when they complete) that insert the extent items into
the extent tree (using the non-commit root of course).
Example:
CPU 1 CPU 2
btrfs_dev_replace_start()
btrfs_scrub_dev()
scrub_enumerate_chunks()
--> finds device extent belonging
to block group X
<transaction N starts>
starts buffered write
against some inode
writepages is run against
that inode forcing dellaloc
to run
btrfs_writepages()
extent_writepages()
extent_write_cache_pages()
__extent_writepage()
writepage_delalloc()
run_delalloc_range()
cow_file_range()
btrfs_reserve_extent()
--> allocates an extent
from block group X
(which is not yet
in RO mode)
btrfs_add_ordered_extent()
--> creates ordered extent Y
flush_epd_write_bio()
--> bio against the extent from
block group X is submitted
btrfs_inc_block_group_ro(bg X)
--> sets block group X to readonly
scrub_chunk(bg X)
scrub_stripe(device extent from srcdev)
--> keeps searching for extent items
belonging to the block group using
the extent tree's commit root
--> it never blocks due to
fs_info->scrub_pause_req as no
one tries to commit transaction N
--> copies all extents found from the
source device into the target device
--> finishes search loop
bio completes
ordered extent Y completes
and creates delayed data
reference which will add an
extent item to the extent
tree when run (typically
at transaction commit time)
--> so the task doing the
scrub/device replace
at CPU 1 misses this
and does not copy this
extent into the new/target
device
btrfs_dec_block_group_ro(bg X)
--> turns block group X back to RW mode
dev_replace->cursor_left is set to the
logical end offset of block group X
So fix this by waiting for all cow and nocow writes after setting a block
group to readonly mode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
When it's finishing, the device replace code iterates all extent maps
representing block group and for each one that has a stripe that refers
to the source device, it replaces its device with the target device.
However when it replaces the source device with the target device it,
the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID),
only after its ID is changed to match the one from the source device.
This leads to races with the chunk removal code that can temporarly see
a device with an ID of 0ULL and then attempt to use that ID to remove
items from the device tree and fail, causing a transaction abort:
[ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished
[ 9238.594377] ------------[ cut here ]------------
[ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594403] BTRFS: Transaction aborted (error 1)
[ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se
rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl
oppy [last unloaded: btrfs]
[ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1
[ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 9238.594421] 0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0
[ 9238.594422] 0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18
[ 9238.594423] 0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0
[ 9238.594424] Call Trace:
[ 9238.594428] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[ 9238.594430] [<ffffffff81052b14>] __warn+0xc2/0xdd
[ 9238.594432] [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53
[ 9238.594434] [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188
[ 9238.594450] [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594452] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[ 9238.594464] [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs]
[ 9238.594476] [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs]
[ 9238.594489] [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs]
[ 9238.594490] [<ffffffff8106f403>] kthread+0xd4/0xdc
[ 9238.594494] [<ffffffff8149e242>] ret_from_fork+0x22/0x40
[ 9238.594495] [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286
[ 9238.594496] ---[ end trace 183efbe50275f059 ]---
The sequence of steps leading to this is like the following:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_delete_unused_bgs()
btrfs_remove_chunk()
looks up for the extent map
corresponding to the chunk
lock_chunks() (chunk_mutex)
check_system_chunk()
unlock_chunks() (chunk_mutex)
locks fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
iterates over all stripes from
the extent map
--> calls btrfs_free_dev_extent()
passing it the target device
that still has an ID of 0ULL
--> btrfs_free_dev_extent() fails
--> aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid (which is necessarily > 0)
frees the srcdev
unlocks fs_info->chunk_mutex
So fix this by taking the device list mutex while processing the stripes
for the chunk's extent map. This is similar to the race between device
replace and block group creation that was fixed by commit 50460e3718
("Btrfs: fix race when finishing dev replace leading to transaction abort").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
There is a lot of duplication between dentry_unlink_inode() and dentry_iput().
The only real difference is that dentry_unlink_inode() bumps ->d_seq and
dentry_iput() doesn't. The argument of the latter is known to have been
unhashed, so anybody who might've found it in RCU lookup would already be
doomed to a ->d_seq mismatch. And we want to avoid pointless smp_rmb() there.
This patch makes dentry_unlink_inode() bump ->d_seq only for hashed dentries.
It's safe (d_delete() calls that sucker only if we are holding the only
reference to dentry, so rehash is not going to happen) and it allows
to use dentry_unlink_inode() in __dentry_kill() and get rid of dentry_iput().
The interesting question here is profiling; it *is* a hot path, and extra
conditional jumps in there might or might not be painful.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's not needed for file_operations of inodes located on fs defined
in the hosting module and for file_operations that go into procfs.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fsnotify_d_move()/__fsnotify_d_instantiate()/__fsnotify_update_dcache_flags()
are identical to each other, regardless of the config.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The self-test was updated to cover zero-length strings; the function
needs to be updated, too.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Fixes: fcfd2fbf22 ("fs/namei.c: Add hashlen_string() function")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original name was simply hash_string(), but that conflicted with a
function with that name in drivers/base/power/trace.c, and I decided
that calling it "hashlen_" was better anyway.
But you have to do it in two places.
[ This caused build errors for architectures that don't define
CONFIG_DCACHE_WORD_ACCESS - Linus ]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: fcfd2fbf22 ("fs/namei.c: Add hashlen_string() function")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The HPFS filesystem used generic_show_options to produce string that is
displayed in /proc/mounts. However, there is a problem that the options
may disappear after remount. If we mount the filesystem with option1
and then remount it with option2, /proc/mounts should show both option1
and option2, however it only shows option2 because the whole option
string is replaced with replace_mount_options in hpfs_remount_fs.
To fix this bug, implement the hpfs_show_options function that prints
options that are currently selected.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c8f33d0bec ("affs: kstrdup() memory handling") checks if the
kstrdup function returns NULL due to out-of-memory condition.
However, if we are remounting a filesystem with no change to
filesystem-specific options, the parameter data is NULL. In this case,
kstrdup returns NULL (because it was passed NULL parameter), although no
out of memory condition exists. The mount syscall then fails with
ENOMEM.
This patch fixes the bug. We fail with ENOMEM only if data is non-NULL.
The patch also changes the call to replace_mount_options - if we didn't
pass any filesystem-specific options, we don't call
replace_mount_options (thus we don't erase existing reported options).
Fixes: c8f33d0bec ("affs: kstrdup() memory handling")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ce657611ba ("hpfs: kstrdup() out of memory handling") checks if
the kstrdup function returns NULL due to out-of-memory condition.
However, if we are remounting a filesystem with no change to
filesystem-specific options, the parameter data is NULL. In this case,
kstrdup returns NULL (because it was passed NULL parameter), although no
out of memory condition exists. The mount syscall then fails with
ENOMEM.
This patch fixes the bug. We fail with ENOMEM only if data is non-NULL.
The patch also changes the call to replace_mount_options - if we didn't
pass any filesystem-specific options, we don't call
replace_mount_options (thus we don't erase existing reported options).
Fixes: ce657611ba ("hpfs: kstrdup() out of memory handling")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Various builds (such as i386:allmodconfig) fail with
fs/binfmt_aout.c:133:2: error: expected identifier or '(' before 'return'
fs/binfmt_aout.c:134:1: error: expected identifier or '(' before '}' token
[ Oops. My bad, I had stupidly thought that "allmodconfig" covered this
on x86-64 too, but it obviously doesn't. Egg on my face. - Linus ]
Fixes: 5d22fc25d4 ("mm: remove more IS_ERR_VALUE abuses")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull string hash improvements from George Spelvin:
"This series does several related things:
- Makes the dcache hash (fs/namei.c) useful for general kernel use.
(Thanks to Bruce for noticing the zero-length corner case)
- Converts the string hashes in <linux/sunrpc/svcauth.h> to use the
above.
- Avoids 64-bit multiplies in hash_64() on 32-bit platforms. Two
32-bit multiplies will do well enough.
- Rids the world of the bad hash multipliers in hash_32.
This finishes the job started in commit 689de1d6ca ("Minimal
fix-up of bad hashing behavior of hash_64()")
The vast majority of Linux architectures have hardware support for
32x32-bit multiply and so derive no benefit from "simplified"
multipliers.
The few processors that do not (68000, h8/300 and some models of
Microblaze) have arch-specific implementations added. Those
patches are last in the series.
- Overhauls the dcache hash mixing.
The patch in commit 0fed3ac866 ("namei: Improve hash mixing if
CONFIG_DCACHE_WORD_ACCESS") was an off-the-cuff suggestion.
Replaced with a much more careful design that's simultaneously
faster and better. (My own invention, as there was noting suitable
in the literature I could find. Comments welcome!)
- Modify the hash_name() loop to skip the initial HASH_MIX(). This
would let us salt the hash if we ever wanted to.
- Sort out partial_name_hash().
The hash function is declared as using a long state, even though
it's truncated to 32 bits at the end and the extra internal state
contributes nothing to the result. And some callers do odd things:
- fs/hfs/string.c only allocates 32 bits of state
- fs/hfsplus/unicode.c uses it to hash 16-bit unicode symbols not bytes
- Modify bytemask_from_count to handle inputs of 1..sizeof(long)
rather than 0..sizeof(long)-1. This would simplify users other
than full_name_hash"
Special thanks to Bruce Fields for testing and finding bugs in v1. (I
learned some humbling lessons about "obviously correct" code.)
On the arch-specific front, the m68k assembly has been tested in a
standalone test harness, I've been in contact with the Microblaze
maintainers who mostly don't care, as the hardware multiplier is never
omitted in real-world applications, and I haven't heard anything from
the H8/300 world"
* 'hash' of git://ftp.sciencehorizons.net/linux:
h8300: Add <asm/hash.h>
microblaze: Add <asm/hash.h>
m68k: Add <asm/hash.h>
<linux/hash.h>: Add support for architecture-specific functions
fs/namei.c: Improve dcache hash function
Eliminate bad hash multipliers from hash_32() and hash_64()
Change hash_64() return value to 32 bits
<linux/sunrpc/svcauth.h>: Define hash_str() in terms of hashlen_string()
fs/namei.c: Add hashlen_string() function
Pull out string hash to <linux/stringhash.h>
This is just the infrastructure; there are no users yet.
This is modelled on CONFIG_ARCH_RANDOM; a CONFIG_ symbol declares
the existence of <asm/hash.h>.
That file may define its own versions of various functions, and define
HAVE_* symbols (no CONFIG_ prefix!) to suppress the generic ones.
Included is a self-test (in lib/test_hash.c) that verifies the basics.
It is NOT in general required that the arch-specific functions compute
the same thing as the generic, but if a HAVE_* symbol is defined with
the value 1, then equality is tested.
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Philippe De Muyter <phdm@macq.eu>
Cc: linux-m68k@lists.linux-m68k.org
Cc: Alistair Francis <alistai@xilinx.com>
Cc: Michal Simek <michal.simek@xilinx.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: uclinux-h8-devel@lists.sourceforge.jp
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
We'd like to make more use of the highly-optimized dcache hash functions
throughout the kernel, rather than have every subsystem create its own,
and a function that hashes basic null-terminated strings is required
for that.
(The name is to emphasize that it returns both hash and length.)
It's actually useful in the dcache itself, specifically d_alloc_name().
Other uses in the next patch.
full_name_hash() is also tweaked to make it more generally useful:
1) Take a "char *" rather than "unsigned char *" argument, to
be consistent with hash_name().
2) Handle zero-length inputs. If we want more callers, we don't want
to make them worry about corner cases.
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
improvements of UBI and UBIFS.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJXSJl6AAoJEEtJtSqsAOnWbbAP/2ls5KpGsRSoOP4NsKo5+8k9
GsZ8hb0iN+UGDxMB5Jlj6O1ISNPz/8o7iGBuWa5OWFdh4Nn38sA1Qv996Rg3Ca3O
8KYEGAD7POWANfxCTmyQX/L8AsOP62B3diFktPyGrtYmLsVzx/AFN9/nM27ticdF
PQpUtLwvL/m7/oE6ymFCB4x34+DkTa7Jx4e+chVlF61CUipGc5I60VVX1Do+/DWt
S37ajjIxMJx0BO7Zs6vrcH9uGFSbvSpVBr9/sC16t0Fa1/vAa9pXieg3y5XZmtOl
+6NlwUxXVu5IjHkYYGZP8jsrA7bKJlflgCo/w6WKC6V0KECnBx+oSG9uMjDVPBSM
rc4VF4nevnNLNqFGBRWhxiYrRUCdB1TcWVDOzM16peNeUNJiWj/+4PbsiJQBwECH
ZnE/dBrjU7v+J8UmHsUJSveW6/5PluJf1loDzrip7gk5QD13wdpPo3VOycOj1vMD
iS6H/TBjtcEzWWh7qcBbtgkbHQH0g+w9YSzVy7ODFweuoq+4qCZ1davqzzZaRv3q
hylor1gea3b4GnH1EDjmXyQOOuIf2rLdJhou8WoZhmy2q3FrnE+89/CCOCMkeQMQ
qQr1lAbxD7A+8bYwSZ7Nvv6lrBUrat59DV6FHFc69MdkGT7J6z85ZOyd3lxRvJc/
XiZZkKhrYSKgV/iLFhHW
=L16y
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.7-rc1' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS updates from Richard Weinberger:
"This contains mostly cleanups and minor improvements of UBI and UBIFS"
* tag 'upstream-4.7-rc1' of git://git.infradead.org/linux-ubifs:
ubifs: ubifs_dump_inode: Fix dumping field bulk_read
UBI: Fix static volume checks when Fastmap is used
UBI: Set free_count to zero before walking through erase list
UBI: Silence an unintialized variable warning
UBI: Clean up return in ubi_remove_volume()
UBI: Modify wrong comment in ubi_leb_map function.
UBI: Don't read back all data in ubi_eba_copy_leb()
UBI: Add ro-mode sysfs attribute
Older versions of gcc don't understand named initializers inside a
anonymous structure or union member. It can be worked around by adding
the bracin gin the initializer for the anonymous member.
Without this, gcc 4.4.4 will fail the build with
CC fs/nfs/nfs4state.o
fs/nfs/nfs4state.c:69: error: unknown field ‘data’ specified in initializer
fs/nfs/nfs4state.c:69: warning: missing braces around initializer
fs/nfs/nfs4state.c:69: warning: (near initialization for ‘zero_stateid.<anonymous>.data’)
make[2]: *** [fs/nfs/nfs4state.o] Error 1
introduced in commit 93b717fd81 ("NFSv4: Label stateids with the type")
Reported-and-tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Anna Schumaker <Anna.Schumaker@netapp.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fixes from Al Viro:
"Followups to the parallel lookup work:
- update docs
- restore killability of the places that used to take ->i_mutex
killably now that we have down_write_killable() merged
- Additionally, it turns out that I missed a prerequisite for
security_d_instantiate() stuff - ->getxattr() wasn't the only thing
that could be called before dentry is attached to inode; with smack
we needed the same treatment applied to ->setxattr() as well"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->setxattr() to passing dentry and inode separately
switch xattr_handler->set() to passing dentry and inode separately
restore killability of old mutex_lock_killable(&inode->i_mutex) users
add down_write_killable_nested()
update D/f/directory-locking
smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
we'd hashed the new dentry and attached it to inode, we need ->setxattr()
instances getting the inode as an explicit argument rather than obtaining
it from dentry.
Similar change for ->getxattr() had been done in commit ce23e64. Unlike
->getxattr() (which is used by both selinux and smack instances of
->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
it got missed back then.
Reported-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull overlayfs update from Miklos Szeredi:
"The meat of this is a change to use the mounter's credentials for
operations that require elevated privileges (such as whiteout
creation). This fixes behavior under user namespaces as well as being
a nice cleanup"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: Do d_type check only if work dir creation was successful
ovl: update documentation
ovl: override creds with the ones from the superblock mounter
Pull btrfs cleanups and fixes from Chris Mason:
"We have another round of fixes and a few cleanups.
I have a fix for short returns from btrfs_copy_from_user, which
finally nails down a very hard to find regression we added in v4.6.
Dave is pushing around gfp parameters, mostly to cleanup internal apis
and make it a little more consistent.
The rest are smaller fixes, and one speelling fixup patch"
* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits)
Btrfs: fix handling of faults from btrfs_copy_from_user
btrfs: fix string and comment grammatical issues and typos
btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
Btrfs: fix unexpected return value of fiemap
Btrfs: free sys_array eb as soon as possible
btrfs: sink gfp parameter to convert_extent_bit
btrfs: make state preallocation more speculative in __set_extent_bit
btrfs: untangle gotos a bit in convert_extent_bit
btrfs: untangle gotos a bit in __clear_extent_bit
btrfs: untangle gotos a bit in __set_extent_bit
btrfs: sink gfp parameter to set_record_extent_bits
btrfs: sink gfp parameter to set_extent_new
btrfs: sink gfp parameter to set_extent_defrag
btrfs: sink gfp parameter to set_extent_delalloc
btrfs: sink gfp parameter to clear_extent_dirty
btrfs: sink gfp parameter to clear_record_extent_bits
btrfs: sink gfp parameter to clear_extent_bits
btrfs: sink gfp parameter to set_extent_bits
btrfs: make find_workspace warn if there are no workspaces
btrfs: make find_workspace always succeed
...
The do_brk() and vm_brk() return value was "unsigned long" and returned
the starting address on success, and an error value on failure. The
reasons are entirely historical, and go back to it basically behaving
like the mmap() interface does.
However, nobody actually wanted that interface, and it causes totally
pointless IS_ERR_VALUE() confusion.
What every single caller actually wants is just the simpler integer
return of zero for success and negative error number on failure.
So just convert to that much clearer and more common calling convention,
and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most users of IS_ERR_VALUE() in the kernel are wrong, as they
pass an 'int' into a function that takes an 'unsigned long'
argument. This happens to work because the type is sign-extended
on 64-bit architectures before it gets converted into an
unsigned type.
However, anything that passes an 'unsigned short' or 'unsigned int'
argument into IS_ERR_VALUE() is guaranteed to be broken, as are
8-bit integers and types that are wider than 'unsigned long'.
Andrzej Hajda has already fixed a lot of the worst abusers that
were causing actual bugs, but it would be nice to prevent any
users that are not passing 'unsigned long' arguments.
This patch changes all users of IS_ERR_VALUE() that I could find
on 32-bit ARM randconfig builds and x86 allmodconfig. For the
moment, this doesn't change the definition of IS_ERR_VALUE()
because there are probably still architecture specific users
elsewhere.
Almost all the warnings I got are for files that are better off
using 'if (err)' or 'if (err < 0)'.
The only legitimate user I could find that we get a warning for
is the (32-bit only) freescale fman driver, so I did not remove
the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
For 9pfs, I just worked around one user whose calling conventions
are so obscure that I did not dare change the behavior.
I was using this definition for testing:
#define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))
which ends up making all 16-bit or wider types work correctly with
the most plausible interpretation of what IS_ERR_VALUE() was supposed
to return according to its users, but also causes a compile-time
warning for any users that do not pass an 'unsigned long' argument.
I suggested this approach earlier this year, but back then we ended
up deciding to just fix the users that are obviously broken. After
the initial warning that caused me to get involved in the discussion
(fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
asked me to send the whole thing again.
[ Updated the 9p parts as per Al Viro - Linus ]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.org/lkml/2016/1/7/363
Link: https://lkml.org/lkml/2016/5/27/486
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
before calling get_block() callback), if it's a sparse file, direct
writes fall back to buffered writes to avoid stale data exposure from
concurrent buffered read. But there're two cases that can result in
stale data exposure are not correctly detected.
1. The detection for "writing inside i_size" is not sufficient,
writes can be treated as "extending writes" wrongly. For example,
direct write 1FSB (file system block) to a 1FSB sparse file on
ext2/3/4, starting from offset 0, in this case it's writing inside
i_size, but 'create' is non-zero, because 'block_in_file' and
'(i_size_read(inode) >> blkbits' are both zero.
2. Direct writes starting from or beyong i_size (not inside i_size)
also could trigger block allocation and expose stale data. For
example, consider a sparse file with i_size of 2k, and a write to
offset 2k or 3k into the file, with a filesystem block size of 4k.
(Thanks to Jeff Moyer for pointing this case out in his review.)
The first problem can be demostrated by running ltp-aiodio test ADSP045
many times. When testing on extN filesystems, I see test failures
occasionally, buffered read could read non-zero (stale) data.
ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1
dio_sparse 0 TINFO : Dirtying free blocks
dio_sparse 0 TINFO : Starting I/O tests
non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
non-zero read at offset 0
dio_sparse 0 TINFO : Killing childrens(s)
dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally
The second problem can also be reproduced easily by a hacked dio_sparse
program, which accepts an option to specify the write offset.
What we should really do is to disable block allocation for writes that
could result in filling holes inside i_size.
Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two new messages are added to support negotiating hb timeout. Stop
nodes frmo talking an old version to mount as they will cause the
negotiation to fail.
Link: http://lkml.kernel.org/r/1464231615-27939-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hr_last_timeout_start should be set as the last time where hb is
still OK. When hb write timeout, hung time will be (jiffies -
hr_last_timeout_start).
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes io error is returned when storage is down for a while. Like
for iscsi device, stroage is made offline when session timeout, and this
will make all io return -EIO. For this case, nodes shouldn't do
negotiate timeout but should fence self. So let nodes fence self when
o2hb_do_disk_heartbeat return an error, this is the same behavior with
o2hb without negotiate timer.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This message is used to re-queue write timeout timer and negotiate timer
when all nodes suffer a write hung to storage, this makes node not fence
self if storage down.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This message is sent to master node when non-master nodes's negotiate
timer expired. Master node records these nodes in a bitmap which is
used to do write timeout timer re-queue decision.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series of patches is to fix the issue that when storage down, all
nodes will fence self due to write timeout.
With this patch set, all nodes will keep going until storage back
online, except if the following issue happens, then all nodes will do as
before to fence self.
1. io error got
2. network between nodes down
3. nodes panic
This patch (of 6):
When storage down, all nodes will fence self due to write timeout. The
negotiate timer is designed to avoid this, with it node will wait until
storage up again.
Negotiate timer working in the following way:
1. The timer expires before write timeout timer, its timeout is half
of write timeout now. It is re-queued along with write timeout timer.
If expires, it will send NEGO_TIMEOUT message to master node(node with
lowest node number). This message does nothing but marks a bit in a
bitmap recording which nodes are negotiating timeout on master node.
2. If storage down, nodes will send this message to master node, then
when master node finds its bitmap including all online nodes, it sends
NEGO_APPROVL message to all nodes one by one, this message will
re-queue write timeout timer and negotiate timer. For any node doesn't
receive this message or meets some issue when handling this message, it
will be fenced. If storage up at any time, o2hb_thread will run and
re-queue all the timer, nothing will be affected by these two steps.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>