Commit Graph

41328 Commits

Author SHA1 Message Date
Al Viro
0a73d0a204 9p: don't leave a half-initialized inode sitting around
Cc: stable@vger.kernel.org # all branches
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-12 11:22:05 -04:00
Filipe Manana
cffc3374e5 Btrfs: fix order by which delayed references are run
When we have an extent that got N references removed and N new references
added in the same transaction, we must run the insertion of the references
first because otherwise the last removed reference will remove the extent
item from the extent tree, resulting in a failure for the insertions.

This is a regression introduced in the 4.2-rc1 release and this fix just
brings back the behaviour of selecting reference additions before any
reference removals.

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_cloner
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create prealloc extent covering range [160K, 620K[
  $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo

  # Now write to the last 80K of the prealloc extent plus 40K to the unallocated
  # space that immediately follows it. This creates a new extent of 40K that spans
  # the range [620K, 660K[.
  $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io

  # At this point, there are now 2 back references to the prealloc extent in our
  # extent tree. Both are for our file offset 160K and one relates to a file
  # extent item with a data offset of 0 and a length of 380K, while the other
  # relates to a file extent item with a data offset of 380K and a length of 80K.

  # Make sure everything done so far is durably persisted (all back references are
  # in the extent tree, etc).
  sync

  # Now clone all extents of our file that cover the offset 160K up to its eof
  # (660K at this point) into itself at offset 2M. This leaves a hole in the file
  # covering the range [660K, 2M[. The prealloc extent will now be referenced by
  # the file twice, once for offset 160K and once for offset 2M. The 40K extent
  # that follows the prealloc extent will also be referenced twice by our file,
  # once for offset 620K and once for offset 2M + 460K.
  $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \
	$SCRATCH_MNT/foo

  # Now create one new extent in our file with a size of 100Kb. It will span the
  # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the
  # range [2M + 460K, 3M[. Our new file size is 3M + 100K.
  $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io

  # At this point, there are now (in memory) 4 back references to the prealloc
  # extent.
  #
  # Two of them are for file offset 160K, related to file extent items
  # matching the file offsets 160K and 540K respectively, with data offsets of
  # 0 and 380K respectively, and with lengths of 380K and 80K respectively.
  #
  # The other two references are for file offset 2M, related to file extent items
  # matching the file offsets 2M and 2M + 380K respectively, with data offsets of
  # 0 and 380K respectively, and with lengths of 389K and 80K respectively.
  #
  # The 40K extent has 2 back references, one for file offset 620K and the other
  # for file offset 2M + 460K.
  #
  # The 100K extent has a single back reference and it relates to file offset 3M.

  # Now clone our 100K extent into offset 600K. That offset covers the last 20K
  # of the prealloc extent, the whole 40K extent and 40K of the hole starting at
  # offset 660K.
  $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \
      $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  # At this point there's only one reference to the 40K extent, at file offset
  # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset
  # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for
  # file offset 3M and a new one for file offset 600K).

  # Now fsync our file to make all its new data and metadata updates are durably
  # persisted and present if a power failure/crash happens after a successful
  # fsync and before the next transaction commit.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  # Silently drop all writes and ummount to simulate a crash/power failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file contents.
  # During log replay, the btrfs delayed references implementation used to run the
  # deletion of back references before the addition of new back references, which
  # made the addition fail as it didn't find the key in the extent tree that it
  # was looking for. The failure triggered by this test was related to the 40K
  # extent, which got 1 reference dropped and 1 reference added during the fsync
  # log replay - when running the delayed references at transaction commit time,
  # btrfs was applying the deletion before the insertion, resulting in a failure
  # of the insertion that ended up turning the fs into read-only mode.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File digest after log replay:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  _unmount_flakey

  status=0
  exit

This issue turned the filesystem into read-only mode (current transaction
aborted) and produced the following traces:

  [ 8247.578385] ------------[ cut here ]------------
  [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]()
  (...)
  [ 8247.601697] Call Trace:
  [ 8247.602222]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
  [ 8247.604320]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
  [ 8247.605488]  [<ffffffffa0506c8d>] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs]
  [ 8247.608226]  [<ffffffffa0506c8d>] lookup_inline_extent_backref+0x17d/0x45d [btrfs]
  [ 8247.617061]  [<ffffffffa0507957>] insert_inline_extent_backref+0x41/0xb2 [btrfs]
  [ 8247.621856]  [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs]
  [ 8247.624366]  [<ffffffffa050ee60>] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs]
  [ 8247.626176]  [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs]
  [ 8247.627435]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
  [ 8247.628531]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
  (...)
  [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]---

  [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]()
  [ 8247.728954] BTRFS: Transaction aborted (error -5)
  (...)
  [ 8247.760866] Call Trace:
  [ 8247.761534]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
  [ 8247.764271]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
  [ 8247.767582]  [<ffffffffa0510e04>] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
  [ 8247.769373]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
  [ 8247.770836]  [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
  [ 8247.772532]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
  [ 8247.773664]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
  [ 8247.775047]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
  [ 8247.776176]  [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189
  [ 8247.777427]  [<ffffffffa055a920>] btrfs_recover_log_trees+0x2da/0x33d [btrfs]
  [ 8247.778575]  [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc [btrfs]
  [ 8247.779838]  [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs]
  [ 8247.781020]  [<ffffffff81120f48>] ? register_shrinker+0x56/0x81
  [ 8247.782285]  [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs]
  (...)
  [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]---
  [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure
  [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree)

Fixes: c6fc245499 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Acked-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-07-11 22:36:44 +01:00
Filipe Manana
d3efe08400 Btrfs: fix list transaction->pending_ordered corruption
When we call btrfs_commit_transaction(), we splice the list "ordered"
of our transaction handle into the transaction's "pending_ordered"
list, but we don't re-initialize the "ordered" list of our transaction
handle, this means it still points to the same elements it used to
before the splice. Then we check if the current transaction's state is
>= TRANS_STATE_COMMIT_START and if it is we end up calling
btrfs_end_transaction() which simply splices again the "ordered" list
of our handle into the transaction's "pending_ordered" list, leaving
multiple pointers to the same ordered extents which results in list
corruption when we are iterating, removing and freeing ordered extents
at btrfs_wait_pending_ordered(), resulting in access to dangling
pointers / use-after-free issues.
Similarly, btrfs_end_transaction() can end up in some cases calling
btrfs_commit_transaction(), and both did a list splice of the transaction
handle's "ordered" list into the transaction's "pending_ordered" without
re-initializing the handle's "ordered" list, resulting in exactly the
same problem.

This produces the following warning on a kernel with linked list
debugging enabled:

[109749.265416] ------------[ cut here ]------------
[109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
[109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d
(...)
[109749.287505] Call Trace:
[109749.288135]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
[109749.298080]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
[109749.331605]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
[109749.334849]  [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98
[109749.337093]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
[109749.337847]  [<ffffffff81260642>] __list_del_entry+0x5a/0x98
[109749.338678]  [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs]
[109749.340145]  [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs]
[109749.348313]  [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs]
[109749.349745]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
[109749.350819]  [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs]
[109749.351976]  [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e
[109749.360341]  [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e
[109749.368828]  [<ffffffff8118ee1d>] do_fsync+0x34/0x4e
[109749.369790]  [<ffffffff8118f045>] SyS_fsync+0x10/0x14
[109749.370925]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
[109749.382274] ---[ end trace 48e0d07f7c03d95a ]---

On a non-debug kernel this leads to invalid memory accesses, causing a
crash. Fix this by using list_splice_init() instead of list_splice() in
btrfs_commit_transaction() and btrfs_end_transaction().

Cc: stable@vger.kernel.org
Fixes: 50d9aa99bd ("Btrfs: make sure logged extents complete in the current transaction V3"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
2015-07-11 22:35:05 +01:00
Filipe Manana
497b4050e0 Btrfs: fix memory leak in the extent_same ioctl
We were allocating memory with memdup_user() but we were never releasing
that memory. This affected pretty much every call to the ioctl, whether
it deduplicated extents or not.

This issue was reported on IRC by Julian Taylor and on the mailing list
by Marcel Ritter, credit goes to them for finding the issue.

Reported-by: Julian Taylor <jtaylor.debian@googlemail.com>
Reported-by: Marcel Ritter <ritter.marcel@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
2015-07-11 22:34:26 +01:00
Filipe Manana
c1aa45759e Btrfs: fix shrinking truncate when the no_holes feature is enabled
If the no_holes feature is enabled, we attempt to shrink a file to a size
that ends up in the middle of a hole and we don't have any file extent
items in the fs/subvol tree that go beyond the new file size (or any
ordered extents that will insert such file extent items), we end up not
updating the inode's disk_i_size, we only update the inode's i_size.

This means that after unmounting and mounting the filesystem, or after
the inode is evicted and reloaded, its i_size ends up being incorrect
(an inode's i_size is set to the disk_i_size field when an inode is
loaded). This happens when btrfs_truncate_inode_items() doesn't find
any file extent items to drop - in this case it never makes a call to
btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.

Example reproducer:

  $ mkfs.btrfs -O no-holes -f /dev/sdd
  $ mount /dev/sdd /mnt

  # Create our test file with some data and durably persist it.
  $ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
  $ sync

  # Append some data to the file, increasing its size, and leave a hole
  # between the old size and the start offset if the following write. So
  # our file gets a hole in the range [128Kb, 256Kb[.
  $ xfs_io -c "truncate 160K" /mnt/foo

  # We expect to see our file with a size of 160Kb, with the first 128Kb
  # of data all having the value 0xaa and the remaining 32Kb of data all
  # having the value 0x00.
  $ od -t x1 /mnt/foo
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0500000

  # Now cleanly unmount and mount again the filesystem.
  $ umount /mnt
  $ mount /dev/sdd /mnt

  # We expect to get the same result as before, a file with a size of
  # 160Kb, with the first 128Kb of data all having the value 0xaa and the
  # remaining 32Kb of data all having the value 0x00.
  $ od -t x1 /mnt/foo
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0400000

In the example above the file size/data do not match what they were before
the remount.

Fix this by always calling btrfs_ordered_update_i_size() with a size
matching the size the file was truncated to if btrfs_truncate_inode_items()
is not called for a log tree and no file extent items were dropped. This
ensures the same behaviour as when the no_holes feature is not enabled.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-11 22:33:14 +01:00
Linus Torvalds
31b7a57c9e Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is an assortment of fixes.  Most of the commits are from Filipe
  (fsync, the inode allocation cache and a few others).  Mark kicked in
  a series fixing corners in the extent sharing ioctls, and everyone
  else fixed up on assorted other problems"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix wrong check for btrfs_force_chunk_alloc()
  Btrfs: fix warning of bytes_may_use
  Btrfs: fix hang when failing to submit bio of directIO
  Btrfs: fix a comment in inode.c:evict_inode_truncate_pages()
  Btrfs: fix memory corruption on failure to submit bio for direct IO
  btrfs: don't update mtime/ctime on deduped inodes
  btrfs: allow dedupe of same inode
  btrfs: fix deadlock with extent-same and readpage
  btrfs: pass unaligned length to btrfs_cmp_data()
  Btrfs: fix fsync after truncate when no_holes feature is enabled
  Btrfs: fix fsync xattr loss in the fast fsync path
  Btrfs: fix fsync data loss after append write
  Btrfs: fix crash on close_ctree() if cleaner starts new transaction
  Btrfs: fix race between caching kthread and returning inode to inode cache
  Btrfs: use kmem_cache_free when freeing entry in inode cache
  Btrfs: fix race between balance and unused block group deletion
  btrfs: add error handling for scrub_workers_get()
  btrfs: cleanup noused initialization of dev in btrfs_end_bio()
  btrfs: qgroup: allow user to clear the limitation on qgroup
2015-07-11 10:26:34 -07:00
Joe Perches
a28e4b2b18 hpfs: hpfs_error: Remove static buffer, use vsprintf extension %pV instead
Removing unnecessary static buffers is good.
Use the vsprintf %pV extension instead.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Cc: stable@vger.kernel.org      # v2.6.36+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09 13:35:31 -07:00
Sanidhya Kashyap
ce657611ba hpfs: kstrdup() out of memory handling
There is a possibility of nothing being allocated to the new_opts in
case of memory pressure, therefore return ENOMEM for such case.

Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09 13:35:31 -07:00
Firo Yang
d7b04097c2 hpfs: Remove unessary cast
Avoid a pointless kmem_cache_alloc() return value cast in
fs/hpfs/super.c::hpfs_alloc_inode()

Signed-off-by: Firo Yang <firogm@gmail.com>
Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09 13:35:31 -07:00
Mikulas Patocka
a27b5b97d6 hpfs: add fstrim support
This patch adds support for fstrim to the HPFS filesystem.

Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09 13:35:30 -07:00
Mikulas Patocka
9abea2d64c ioctl_compat: handle FITRIM
The FITRIM ioctl has the same arguments on 32-bit and 64-bit
architectures, so we can add it to the list of compatible ioctls and
drop it from compat_ioctl method of various filesystems.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ted Ts'o <tytso@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09 11:42:21 -07:00
Linus Torvalds
1c4c7159ed Bug fixes (all for stable kernels) for ext4:
* address corner cases for indirect blocks->extent migration
   * fix reserved block accounting invalidate_page when
 	page_size != block_size (i.e., ppc or 1k block size file systems)
   * fix deadlocks when a memcg is under heavy memory pressure
   * fix fencepost error in lazytime optimization
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJVmW27AAoJEPL5WVaVDYGjmEkIAJsGHVIKur1Kp//FhejSB/wI
 B0d+UuQt5kdAE3lNxC7lHO1NqIhvnS7eBho+52LG8V4JDRrzTbE1GdbsBhAIk6FW
 CcsQvsHAI99QJMdqOCachu/+nhCwIINGkxmbumhNaZoJPn6wmGQzCA3Cn5qmnGnK
 Ctbk6li1HuMXyzbbvxCLfaD/xCUs1NCdufEnRU44i0U4OfaYNpiAhddeGIQ8WMEQ
 G14l2JvhIfye6fG8lnCzfacFvnT9zvvSGfRO3ZQjC4Az1EogIUbhCPLvq0ebDbPp
 i4eRfrSRdXmMojqmW/knET8skXQVZVnD7LWuvkue+n47UbTH2c0roTbp4l76W+U=
 =x8Cc
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Bug fixes (all for stable kernels) for ext4:

   - address corner cases for indirect blocks->extent migration

   - fix reserved block accounting invalidate_page when
     page_size != block_size (i.e., ppc or 1k block size file systems)

   - fix deadlocks when a memcg is under heavy memory pressure

   - fix fencepost error in lazytime optimization"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: replace open coded nofail allocation in ext4_free_blocks()
  ext4: correctly migrate a file with a hole at the beginning
  ext4: be more strict when migrating to non-extent based file
  ext4: fix reservation release on invalidatepage for delalloc fs
  ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp
  bufferhead: Add _gfp version for sb_getblk()
  ext4: fix fencepost error in lazytime optimization
2015-07-05 16:24:54 -07:00
Michal Hocko
7444a072c3 ext4: replace open coded nofail allocation in ext4_free_blocks()
ext4_free_blocks is looping around the allocation request and mimics
__GFP_NOFAIL behavior without any allocation fallback strategy. Let's
remove the open coded loop and replace it with __GFP_NOFAIL. Without the
flag the allocator has no way to find out never-fail requirement and
cannot help in any way.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-05 12:33:44 -04:00
Linus Torvalds
1dc51b8288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted VFS fixes and related cleanups (IMO the most interesting in
  that part are f_path-related things and Eric's descriptor-related
  stuff).  UFS regression fixes (it got broken last cycle).  9P fixes.
  fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups".  The
  file_table locking rule change by Eric Dumazet is a rather big and
  fundamental update even if the patch isn't huge.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
  9p: cope with bogus responses from server in p9_client_{read,write}
  p9_client_write(): avoid double p9_free_req()
  9p: forgetting to cancel request on interrupted zero-copy RPC
  dax: bdev_direct_access() may sleep
  block: Add support for DAX reads/writes to block devices
  dax: Use copy_from_iter_nocache
  dax: Add block size note to documentation
  fs/file.c: __fget() and dup2() atomicity rules
  fs/file.c: don't acquire files->file_lock in fd_install()
  fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
  vfs: avoid creation of inode number 0 in get_next_ino
  namei: make set_root_rcu() return void
  make simple_positive() public
  ufs: use dir_pages instead of ufs_dir_pages()
  pagemap.h: move dir_pages() over there
  remove the pointless include of lglock.h
  fs: cleanup slight list_entry abuse
  xfs: Correctly lock inode when removing suid and file capabilities
  fs: Call security_ops->inode_killpriv on truncate
  fs: Provide function telling whether file_remove_privs() will do anything
  ...
2015-07-04 19:36:06 -07:00
Matthew Wilcox
43c3dd08da dax: bdev_direct_access() may sleep
The brd driver is the only in-tree driver that may sleep currently.
After some discussion on linux-fsdevel, we decided that any driver
may choose to sleep in its ->direct_access method.  To ensure that all
callers of bdev_direct_access() are prepared for this, add a call
to might_sleep().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-04 15:56:57 -04:00
Matthew Wilcox
bbab37ddc2 block: Add support for DAX reads/writes to block devices
If a block device supports the ->direct_access methods, bypass the normal
DIO path and use DAX to go straight to memcpy() instead of allocating
a DIO and a BIO.

Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
do_blockdev_direct_IO().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-04 15:56:57 -04:00
Matthew Wilcox
872eb127e3 dax: Use copy_from_iter_nocache
When userspace does a write, there's no need for the written data to
pollute the CPU cache.  This matches the original XIP code.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-04 15:56:56 -04:00
Linus Torvalds
22a093b2fb Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Debug info and other statistics fixes and related enhancements"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Fix numa balancing stats in /proc/pid/sched
  sched/numa: Show numa_group ID in /proc/sched_debug task listings
  sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
  sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
  sched/stat: Simplify the sched_info accounting dependency
2015-07-04 08:56:53 -07:00
Naveen N. Rao
5968cecedd sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
Expand /proc/pid/schedstat output:

 - enable it on CONFIG_TASK_DELAY_ACCT=y && !CONFIG_SCHEDSTATS kernels.

 - dump all zeroes on kernels that are booted with the 'nodelayacct'
   option, which boot option disables delay accounting on
   CONFIG_TASK_DELAY_ACCT=y kernels.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: a.p.zijlstra@chello.nl
Cc: ricklind@us.ibm.com
Link: http://lkml.kernel.org/r/5ccbef17d4bc841084ea6e6421d4e4a23b7b806f.1435654789.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-04 10:04:31 +02:00
Eryu Guan
8974fec7d7 ext4: correctly migrate a file with a hole at the beginning
Currently ext4_ind_migrate() doesn't correctly handle a file which
contains a hole at the beginning of the file.  This caused the migration
to be done incorrectly, and then if there is a subsequent following
delayed allocation write to the "hole", this would reclaim the same data
blocks again and results in fs corruption.

  # assmuing 4k block size ext4, with delalloc enabled
  # skip the first block and write to the second block
  xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile

  # converting to indirect-mapped file, which would move the data blocks
  # to the beginning of the file, but extent status cache still marks
  # that region as a hole
  chattr -e /mnt/ext4/testfile

  # delayed allocation writes to the "hole", reclaim the same data block
  # again, results in i_blocks corruption
  xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
  umount /mnt/ext4
  e2fsck -nf /dev/sda6
  ...
  Inode 53, i_blocks is 16, should be 8.  Fix? no
  ...

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-04 00:03:44 -04:00
Eryu Guan
d6f123a929 ext4: be more strict when migrating to non-extent based file
Currently the check in ext4_ind_migrate() is not enough before doing the
real conversion:

a) delayed allocated extents could bypass the check on eh->eh_entries
   and eh->eh_depth

This can be demonstrated by this script

  xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
  chattr -e /mnt/ext4/testfile

where testfile has two extents but still be converted to non-extent
based file format.

b) only extent length is checked but not the offset, which would result
   in data lose (delalloc) or fs corruption (nodelalloc), because
   non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
   blocks

This can be demostrated by

  xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
  chattr -e /mnt/ext4/testfile
  sync

If delalloc is enabled, dmesg prints
  EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
  EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
  EXT4-fs (dm-4): This should not happen!! Data will be lost

If delalloc is disabled, e2fsck -nf shows corruption
  Inode 53, i_size is 5497558142976, should be 4096.  Fix? no

Fix the two issues by

a) forcing all delayed allocation blocks to be allocated before checking
   eh->eh_depth and eh->eh_entries
b) limiting the last logical block of the extent is within direct map

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-03 23:56:50 -04:00
Lukas Czerner
9705acd63b ext4: fix reservation release on invalidatepage for delalloc fs
On delalloc enabled file system on invalidatepage operation
in ext4_da_page_release_reservation() we want to clear the delayed
buffer and remove the extent covering the delayed buffer from the extent
status tree.

However currently there is a bug where on the systems with page size >
block size we will always remove extents from the start of the page
regardless where the actual delayed buffers are positioned in the page.
This leads to the errors like this:

EXT4-fs warning (device loop0): ext4_da_release_space:1225:
ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
blocks

This however can cause data loss on writeback time if the file system is
in ENOSPC condition because we're releasing reservation for someones
else delayed buffer.

Fix this by only removing extents that corresponds to the part of the
page we want to invalidate.

This problem is reproducible by the following fio receipt (however I was
only able to reproduce it with fio-2.1 or older.

[global]
bs=8k
iodepth=1024
iodepth_batch=60
randrepeat=1
size=1m
directory=/mnt/test
numjobs=20
[job1]
ioengine=sync
bs=1k
direct=1
rw=randread
filename=file1:file2
[job2]
ioengine=libaio
rw=randwrite
direct=1
filename=file1:file2
[job3]
bs=1k
ioengine=posixaio
rw=randwrite
direct=1
filename=file1:file2
[job5]
bs=1k
ioengine=sync
rw=randread
filename=file1:file2
[job7]
ioengine=libaio
rw=randwrite
filename=file1:file2
[job8]
ioengine=posixaio
rw=randwrite
filename=file1:file2
[job10]
ioengine=mmap
rw=randwrite
bs=1k
filename=file1:file2
[job11]
ioengine=mmap
rw=randwrite
direct=1
filename=file1:file2

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2015-07-03 21:13:55 -04:00
Linus Torvalds
0cbee99269 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace updates from Eric Biederman:
 "Long ago and far away when user namespaces where young it was realized
  that allowing fresh mounts of proc and sysfs with only user namespace
  permissions could violate the basic rule that only root gets to decide
  if proc or sysfs should be mounted at all.

  Some hacks were put in place to reduce the worst of the damage could
  be done, and the common sense rule was adopted that fresh mounts of
  proc and sysfs should allow no more than bind mounts of proc and
  sysfs.  Unfortunately that rule has not been fully enforced.

  There are two kinds of gaps in that enforcement.  Only filesystems
  mounted on empty directories of proc and sysfs should be ignored but
  the test for empty directories was insufficient.  So in my tree
  directories on proc, sysctl and sysfs that will always be empty are
  created specially.  Every other technique is imperfect as an ordinary
  directory can have entries added even after a readdir returns and
  shows that the directory is empty.  Special creation of directories
  for mount points makes the code in the kernel a smidge clearer about
  it's purpose.  I asked container developers from the various container
  projects to help test this and no holes were found in the set of mount
  points on proc and sysfs that are created specially.

  This set of changes also starts enforcing the mount flags of fresh
  mounts of proc and sysfs are consistent with the existing mount of
  proc and sysfs.  I expected this to be the boring part of the work but
  unfortunately unprivileged userspace winds up mounting fresh copies of
  proc and sysfs with noexec and nosuid clear when root set those flags
  on the previous mount of proc and sysfs.  So for now only the atime,
  read-only and nodev attributes which userspace happens to keep
  consistent are enforced.  Dealing with the noexec and nosuid
  attributes remains for another time.

  This set of changes also addresses an issue with how open file
  descriptors from /proc/<pid>/ns/* are displayed.  Recently readlink of
  /proc/<pid>/fd has been triggering a WARN_ON that has not been
  meaningful since it was added (as all of the code in the kernel was
  converted) and is not now actively wrong.

  There is also a short list of issues that have not been fixed yet that
  I will mention briefly.

  It is possible to rename a directory from below to above a bind mount.
  At which point any directory pointers below the renamed directory can
  be walked up to the root directory of the filesystem.  With user
  namespaces enabled a bind mount of the bind mount can be created
  allowing the user to pick a directory whose children they can rename
  to outside of the bind mount.  This is challenging to fix and doubly
  so because all obvious solutions must touch code that is in the
  performance part of pathname resolution.

  As mentioned above there is also a question of how to ensure that
  developers by accident or with purpose do not introduce exectuable
  files on sysfs and proc and in doing so introduce security regressions
  in the current userspace that will not be immediately obvious and as
  such are likely to require breaking userspace in painful ways once
  they are recognized"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  vfs: Remove incorrect debugging WARN in prepend_path
  mnt: Update fs_fully_visible to test for permanently empty directories
  sysfs: Create mountpoints with sysfs_create_mount_point
  sysfs: Add support for permanently empty directories to serve as mount points.
  kernfs: Add support for always empty directories.
  proc: Allow creating permanently empty directories that serve as mount points
  sysctl: Allow creating permanently empty directories that serve as mountpoints.
  fs: Add helper functions for permanently empty directories.
  vfs: Ignore unlocked mounts in fs_fully_visible
  mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
  mnt: Refactor the logic for mounting sysfs and proc in a user namespace
2015-07-03 15:20:57 -07:00
Linus Torvalds
0c76c6ba24 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "We have a pile of bug fixes from Ilya, including a few patches that
  sync up the CRUSH code with the latest from userspace.

  There is also a long series from Zheng that fixes various issues with
  snapshots, inline data, and directory fsync, some simplification and
  improvement in the cap release code, and a rework of the caching of
  directory contents.

  To top it off there are a few small fixes and cleanups from Benoit and
  Hong"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits)
  rbd: use GFP_NOIO in rbd_obj_request_create()
  crush: fix a bug in tree bucket decode
  libceph: Fix ceph_tcp_sendpage()'s more boolean usage
  libceph: Remove spurious kunmap() of the zero page
  rbd: queue_depth map option
  rbd: store rbd_options in rbd_device
  rbd: terminate rbd_opts_tokens with Opt_err
  ceph: fix ceph_writepages_start()
  rbd: bump queue_max_segments
  ceph: rework dcache readdir
  crush: sync up with userspace
  crush: fix crash from invalid 'take' argument
  ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL
  ceph: pre-allocate data structure that tracks caps flushing
  ceph: re-send flushing caps (which are revoked) in reconnect stage
  ceph: send TID of the oldest pending caps flush to MDS
  ceph: track pending caps flushing globally
  ceph: track pending caps flushing accurately
  libceph: fix wrong name "Ceph filesystem for Linux"
  ceph: fix directory fsync
  ...
2015-07-02 11:35:00 -07:00
Linus Torvalds
8688d9540c NFS client updates for Linux 4.2
Highlights include:
 
 Stable patches:
 - Fix a crash in the NFSv4 file locking code.
 - Fix an fsync() regression, where we were failing to retry I/O in some
   circumstances.
 - Fix an infinite loop in NFSv4.0 OPEN stateid recovery
 - Fix a memory leak when an attempted pnfs fails.
 - Fix a memory leak in the backchannel code
 - Large hostnames were not supported correctly in NFSv4.1
 - Fix a pNFS/flexfiles bug that was impeding error reporting on I/O.
 - Fix a couple of credential issues in pNFS/flexfiles
 
 Bugfixes + cleanups:
 - Open flag sanity checks in the NFSv4 atomic open codepath
 - More NFSv4 delegation related bugfixes
 - Various NFSv4.1 backchannel bugfixes and cleanups
 - Fix the NFS swap socket code
 - Various cleanups of the NFSv4 SETCLIENTID and EXCHANGE_ID code
 - Fix a UDP transport deadlock issue
 
 Features:
 - More RDMA client transport improvements
 - NFSv4.2 LAYOUTSTATS functionality for pnfs flexfiles.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVlWQgAAoJEGcL54qWCgDyXtcP/2Y3HJ9xu5qU3Bo/jzCAw4E1
 jPPMSFAz4kqy/LGoslyc1cNDEiKGzJYWU8TtCGI3KAyNxb6n3pT1mEE1tvIsSdis
 D8bpV13M452PPpZYrBawIf4+OuohXmuYHpFiVNSpLbH3Uo7dthvFFnbqCGaGlnqY
 rXYZHAnx637OGBcJsT4AXCUz12ILvxMYRnqwW6Xn+j9JmwR1coQX3v8W8e7SMf6i
 J+zOny7Uetjrg1U9C9uQB6ZvIoxUMo9QOVmtGCwsBl8lM3fLmzaQfcUf9fm76pMT
 yTrKJs4jBLvVf00bRHFDv9EHWCy97oqCkeQEw1EY2lnxp/lmM5SiI4zQqjbf0QTW
 5VQScT1MK6xwHoUbuI/sYdXXR8KGDVT1xCFFHUNcg69CvgqdgWslPQY7xLJMvUJZ
 vBWfWDd8ppdCw2ZVX4ae/bnhfc+/mVh4wRPF7tgVAjT0pobBV9xMOeMkF4mo76Wa
 pvo/nTRMt68hpESVSvq9dYEMVhy5haqFhPrSbyAGOpT4SE2V3RCCZQfhu15TMKdW
 BdvItG+mdAVPbIHqhx7vRdAudcOEZKyxbFA+l3E5FyCAXLV7XS3M8CEl3P1w7gmm
 Ccr8DW9abKFJf1RAKdX3stexIoJLGTwciSMR5smsbup/xNcx/fRgx2f1w31JMPxb
 kG3Izfk25w9uGSsbR39D
 =AREr
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - Fix a crash in the NFSv4 file locking code.
   - Fix an fsync() regression, where we were failing to retry I/O in
     some circumstances.
   - Fix an infinite loop in NFSv4.0 OPEN stateid recovery
   - Fix a memory leak when an attempted pnfs fails.
   - Fix a memory leak in the backchannel code
   - Large hostnames were not supported correctly in NFSv4.1
   - Fix a pNFS/flexfiles bug that was impeding error reporting on I/O.
   - Fix a couple of credential issues in pNFS/flexfiles

  Bugfixes + cleanups:
   - Open flag sanity checks in the NFSv4 atomic open codepath
   - More NFSv4 delegation related bugfixes
   - Various NFSv4.1 backchannel bugfixes and cleanups
   - Fix the NFS swap socket code
   - Various cleanups of the NFSv4 SETCLIENTID and EXCHANGE_ID code
   - Fix a UDP transport deadlock issue

  Features:
   - More RDMA client transport improvements
   - NFSv4.2 LAYOUTSTATS functionality for pnfs flexfiles"

* tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (87 commits)
  nfs: Remove invalid tk_pid from debug message
  nfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfh
  nfs: Drop bad comment in nfs41_walk_client_list()
  nfs: Remove unneeded micro checking of CONFIG_PROC_FS
  nfs: Don't setting FILE_CREATED flags always
  nfs: Use remove_proc_subtree() instead remove_proc_entry()
  nfs: Remove unused argument in nfs_server_set_fsinfo()
  nfs: Fix a memory leak when meeting an unsupported state protect
  nfs: take extra reference to fl->fl_file when running a LOCKU operation
  NFSv4: When returning a delegation, don't reclaim an incompatible open mode.
  NFSv4.2: LAYOUTSTATS is optional to implement
  NFSv4.2: Fix up a decoding error in layoutstats
  pNFS/flexfiles: Fix the reset of struct pgio_header when resending
  pNFS/flexfiles: Turn off layoutcommit for servers that don't need it
  pnfs/flexfiles: protect ktime manipulation with mirror lock
  nfs: provide pnfs_report_layoutstat when NFS42 is disabled
  nfs: verify open flags before allowing open
  nfs: always update creds in mirror, even when we have an already connected ds
  nfs: fix potential credential leak in ff_layout_update_mirror_cred
  pnfs/flexfiles: report layoutstat regularly
  ...
2015-07-02 11:32:23 -07:00
Linus Torvalds
320cd413fa Merge branch 'overlayfs-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This relaxes the requirements on the lower layer filesystem: now ones
  that implement .d_revalidate, such as NFS, can be used.

  Upper layer filesystems still has the "no .d_revalidate" requirement.

  Also a bad interaction with jffs2 locking has been fixed"

* 'overlayfs-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: lookup whiteouts outside iterate_dir()
  ovl: allow distributed fs as lower layer
  ovl: don't traverse automount points
2015-07-02 11:23:00 -07:00
Linus Torvalds
a7ba4bf5e7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
 "This is the start of improving fuse scalability.

  An input queue and a processing queue is split out from the monolithic
  fuse connection, each of those having their own spinlock.  The end of
  the patchset adds the ability to clone a fuse connection.  This means,
  that instead of having to read/write requests/answers on a single fuse
  device fd, the fuse daemon can have multiple distinct file descriptors
  open.  Each of those can be used to receive requests and send answers,
  currently the only constraint is that a request must be answered on
  the same fd as it was read from.

  This can be extended further to allow binding a device clone to a
  specific CPU or NUMA node.

  Based on a patchset by Srinivas Eeda and Ashish Samant.  Thanks to
  Ashish for the review of this series"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (40 commits)
  fuse: update MAINTAINERS entry
  fuse: separate pqueue for clones
  fuse: introduce per-instance fuse_dev structure
  fuse: device fd clone
  fuse: abort: no fc->lock needed for request ending
  fuse: no fc->lock for pqueue parts
  fuse: no fc->lock in request_end()
  fuse: cleanup request_end()
  fuse: request_end(): do once
  fuse: add req flag for private list
  fuse: pqueue locking
  fuse: abort: group pqueue accesses
  fuse: cleanup fuse_dev_do_read()
  fuse: move list_del_init() from request_end() into callers
  fuse: duplicate ->connected in pqueue
  fuse: separate out processing queue
  fuse: simplify request_wait()
  fuse: no fc->lock for iqueue parts
  fuse: allow interrupt queuing without fc->lock
  fuse: iqueue locking
  ...
2015-07-02 11:21:26 -07:00
Linus Torvalds
9d90f03531 Replace module_init with appropriate alternate initcall in non modules.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVkO6nAAoJEOvOhAQsB9HWpHMP/Aknc+lmX2dZeIn96gdkP+UK
 1qL24C5oq2sm/9yTZLdoXbyApLaaTbAJHS9O4kolaOU6uOs3JrgtXqL1697PVp1R
 qV4f4DOzXmmEHaE2oO21afAri3tXIVQNqA2NQl2TmKfwz0Atu01Vj5RJPu/ZOBPl
 dONXcFnE6nO2p7AEFRP/GfDZwkng4xALyZPhwL7tJDAeGaBpqG/n2hCuq+Szn9g8
 wjTFACBdad/mRrYsL6YsWZ1e+LKI8vsArQbdPTam+jPaEUlK7yjFReFKCJVzL2JP
 xfQoTcCgFztzTUV0JTGR9sqeYA3WH9AkJOFDxNE/eIili4xiTh789WbEpHLVECSX
 1LsW025I3DkRWBPT4L+9ZP805ha71kNXDFc5N3XJkzrCYaFvD2BgsUzxi6FXj7aC
 9lEVKt6xO04FFG5SwTKnO0f8PEhPemZH3BDnVvjBDWQYLjUcPSNz7bfyHUhif0G5
 ulOGVB0ncJJF9iP8PyZs1RA/F8kKxXWnhYMIHzvl0f0vLUA7rAKsACnhBgq8s9ZQ
 uM5YjzU91Z/4pe5C2E5MmQIZ84b79ZPsee1lF0GJdjK5W3PDvnCjIdXfQ5M/f3S8
 76cssXWNhS78/P+19YqirLeb0u7Zw0jf73m9t9ywRgcByWfY5ZUDm0DFpQnWKkoR
 QY/aFO/yHKTO3VHj8Ril
 =KDJO
 -----END PGP SIGNATURE-----

Merge tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux

Pull module_init replacement part two from Paul Gortmaker:
 "Replace module_init with appropriate alternate initcall in non
  modules.

  This series converts non-modular code that is using the module_init()
  call to hook itself into the system to instead use one of our
  alternate priority initcalls.

  Unlike the previous series that used device_initcall and hence was a
  runtime no-op, these commits change to one of the alternate initcalls,
  because (a) we have them and (b) it seems like the right thing to do.

  For example, it would seem logical to use arch_initcall for arch
  specific setup code and fs_initcall for filesystem setup code.

  This does mean however, that changes in the init ordering will be
  taking place, and so there is a small risk that some kind of implicit
  init ordering issue may lie uncovered.  But I think it is still better
  to give these ones sensible priorities than to just assign them all to
  device_initcall in order to exactly preserve the old ordering.

  Thad said, we have already made similar changes in core kernel code in
  commit c96d6660dc ("kernel: audit/fix non-modular users of
  module_init in core code") without any regressions reported, so this
  type of change isn't without precedent.  It has also got the same
  local testing and linux-next coverage as all the other pull requests
  that I'm sending for this merge window have got.

  Once again, there is an unused module_exit function removal that shows
  up as an outlier upon casual inspection of the diffstat"

* tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
  x86: perf_event_intel_pt.c: use arch_initcall to hook in enabling
  x86: perf_event_intel_bts.c: use arch_initcall to hook in enabling
  mm/page_owner.c: use late_initcall to hook in enabling
  lib/list_sort: use late_initcall to hook in self tests
  arm: use subsys_initcall in non-modular pl320 IPC code
  powerpc: don't use module_init for non-modular core hugetlb code
  powerpc: use subsys_initcall for Freescale Local Bus
  x86: don't use module_init for non-modular core bootflag code
  netfilter: don't use module_init/exit in core IPV4 code
  fs/notify: don't use module_init for non-modular inotify_user code
  mm: replace module_init usages with subsys_initcall in nommu.c
2015-07-02 10:36:29 -07:00
Nikolay Borisov
c45653c341 ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp
Switch ext4 to using sb_getblk_gfp with GFP_NOFS added to fix possible
deadlocks in the page writeback path.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-02 01:34:07 -04:00
Theodore Ts'o
0f0ff9a9f3 ext4: fix fencepost error in lazytime optimization
Commit 8f4d855839: "ext4: fix lazytime optimization" was not a
complete fix.  In the case where the inode number is a multiple of 16,
and we could still end up updating an inode with dirty timestamps
written to the wrong inode on disk.  Oops.

This can be easily reproduced by using generic/005 with a file system
with metadata_csum and lazytime enabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-01 23:37:46 -04:00
Linus Torvalds
2d01eedf1d Merge branch 'akpm' (patches from Andrew)
Merge third patchbomb from Andrew Morton:

 - the rest of MM

 - scripts/gdb updates

 - ipc/ updates

 - lib/ updates

 - MAINTAINERS updates

 - various other misc things

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (67 commits)
  genalloc: rename of_get_named_gen_pool() to of_gen_pool_get()
  genalloc: rename dev_get_gen_pool() to gen_pool_get()
  x86: opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
  MAINTAINERS: add zpool
  MAINTAINERS: BCACHE: Kent Overstreet has changed email address
  MAINTAINERS: move Jens Osterkamp to CREDITS
  MAINTAINERS: remove unused nbd.h pattern
  MAINTAINERS: update brcm gpio filename pattern
  MAINTAINERS: update brcm dts pattern
  MAINTAINERS: update sound soc intel patterns
  MAINTAINERS: remove website for paride
  MAINTAINERS: update Emulex ocrdma email addresses
  bcache: use kvfree() in various places
  libcxgbi: use kvfree() in cxgbi_free_big_mem()
  target: use kvfree() in session alloc and free
  IB/ehca: use kvfree() in ipz_queue_{cd}tor()
  drm/nouveau/gem: use kvfree() in u_free()
  drm: use kvfree() in drm_free_large()
  cxgb4: use kvfree() in t4_free_mem()
  cxgb3: use kvfree() in cxgb_free_mem()
  ...
2015-07-01 17:47:51 -07:00
Shilong Wang
9689457b5b Btrfs: fix wrong check for btrfs_force_chunk_alloc()
btrfs_force_chunk_alloc() return 1 for allocation chunk successfully.
This problem exists since commit c87f08ca4.

With this patch, we might fix some enospc problems for balances.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:22 -07:00
Liu Bo
ddba1bfc23 Btrfs: fix warning of bytes_may_use
While running generic/019, dmesg got several warnings from
btrfs_free_reserved_data_space().

Test generic/019 produces some disk failures so sumbit dio will get errors,
in which case, btrfs_direct_IO() goes to the error handling and free
bytes_may_use, but the problem is that bytes_may_use has been free'd
during get_block().

This adds a runtime flag to show if we've gone through get_block(), if so,
don't do the cleanup work.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:21 -07:00
Liu Bo
ad9ee2053f Btrfs: fix hang when failing to submit bio of directIO
The hang is uncoverd by generic/019.

btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
an error, thus those added ordered extents will never get processed, which
block processes that waiting for them via btrfs_start_ordered_extent().

This fixes the above, and meanwhile finish_ordered_fn will do the space
accounting work.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:20 -07:00
Filipe Manana
9c6429d96d Btrfs: fix a comment in inode.c:evict_inode_truncate_pages()
The comment was not correct about the part where it says the endio
callback of the bio might have not yet been called - update it
to mention that by that time the endio callback execution might
still be in progress only.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:19 -07:00
Filipe Manana
61de718fce Btrfs: fix memory corruption on failure to submit bio for direct IO
If we fail to submit a bio for a direct IO request, we were grabbing the
corresponding ordered extent and decrementing its reference count twice,
once for our lookup reference and once for the ordered tree reference.
This was a problem because it caused the ordered extent to be freed
without removing it from the ordered tree and any lists it might be
attached to, leaving dangling pointers to the ordered extent around.
Example trace with CONFIG_DEBUG_PAGEALLOC=y:

[161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
[161779.859983] IP: [<ffffffff8124ca68>] rb_prev+0x22/0x3b
[161779.860636] PGD 34d818067 PUD 0
[161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[161779.860636] Call Trace:
[161779.860636]  [<ffffffffa06b36a6>] __tree_search+0xd9/0xf9 [btrfs]
[161779.860636]  [<ffffffffa06b3708>] tree_search+0x42/0x63 [btrfs]
[161779.860636]  [<ffffffffa06b4868>] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
[161779.860636]  [<ffffffffa06b4873>] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
[161779.860636]  [<ffffffffa06aab8e>] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
[161779.860636]  [<ffffffff8119727f>] do_blockdev_direct_IO+0x5ff/0xb43
[161779.860636]  [<ffffffffa06aaa73>] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffffa06a10ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
[161779.860636]  [<ffffffffa06affaa>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
[161779.860636]  [<ffffffffa06b004c>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
(...)

We were also not freeing the btrfs_dio_private we allocated previously,
which kmemleak reported with the following trace in its sysfs file:

unreferenced object 0xffff8803f553bf80 (size 96):
  comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
  hex dump (first 32 bytes):
    88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00  .l..............
    00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81161ffe>] create_object+0x172/0x29a
    [<ffffffff8145870f>] kmemleak_alloc+0x25/0x41
    [<ffffffff81154e64>] kmemleak_alloc_recursive.constprop.40+0x16/0x18
    [<ffffffff811579ed>] kmem_cache_alloc_trace+0xfb/0x148
    [<ffffffffa03d8cff>] btrfs_submit_direct+0x65/0x16a [btrfs]
    [<ffffffff811968dc>] dio_bio_submit+0x62/0x8f
    [<ffffffff811975fe>] do_blockdev_direct_IO+0x97e/0xb43
    [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
    [<ffffffffa03d70ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
    [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
    [<ffffffffa03e604d>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
    [<ffffffff8116586a>] __vfs_write+0x7c/0xa5
    [<ffffffff81165da9>] vfs_write+0xa0/0xe4
    [<ffffffff81166675>] SyS_pwrite64+0x64/0x82
    [<ffffffff81464fd7>] system_call_fastpath+0x12/0x6f
    [<ffffffffffffffff>] 0xffffffffffffffff

For read requests we weren't doing any cleanup either (none of the work
done by btrfs_endio_direct_read()), so a failure submitting a bio for a
read request would leave a range in the inode's io_tree locked forever,
blocking any future operations (both reads and writes) against that range.

So fix this by making sure we do the same cleanup that we do for the case
where the bio submission succeeds.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:18 -07:00
Mark Fasheh
1c919a5e13 btrfs: don't update mtime/ctime on deduped inodes
One issue users have reported is that dedupe changes mtime on files,
resulting in tools like rsync thinking that their contents have changed when
in fact the data is exactly the same. We also skip the ctime update as no
user-visible metadata changes here and we want dedupe to be transparent to
the user.

Clone still wants time changes, so we special case this in the code.

This was tested with the btrfs-extent-same tool.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:17 -07:00
Mark Fasheh
0efa9f48c7 btrfs: allow dedupe of same inode
clone() supports cloning within an inode so extent-same can do
the same now. This patch fixes up the locking in extent-same to
know about the single-inode case. In addition to that, we add a
check for overlapping ranges, which clone does not allow.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:15 -07:00
Mark Fasheh
f441460202 btrfs: fix deadlock with extent-same and readpage
->readpage() does page_lock() before extent_lock(), we do the opposite in
extent-same. We want to reverse the order in btrfs_extent_same() but it's
not quite straightforward since the page locks are taken inside btrfs_cmp_data().

So I split btrfs_cmp_data() into 3 parts with a small context structure that
is passed between them. The first, btrfs_cmp_data_prepare() gathers up the
pages needed (taking page lock as required) and puts them on our context
structure. At this point, we are safe to lock the extent range. Afterwards,
we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free()
to clean up our context.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:14 -07:00
Mark Fasheh
207910ddee btrfs: pass unaligned length to btrfs_cmp_data()
In the case that we dedupe the tail of a file, we might expand the dedupe
len out to the end of our last block. We don't want to compare data past
i_size however, so pass the original length to btrfs_cmp_data().

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:13 -07:00
Filipe Manana
a89ca6f24f Btrfs: fix fsync after truncate when no_holes feature is enabled
When we have the no_holes feature enabled, if a we truncate a file to a
smaller size, truncate it again but to a size greater than or equals to
its original size and fsync it, the log tree will not have any information
about the hole covering the range [truncate_1_offset, new_file_size[.
Which means if the fsync log is replayed, the file will remain with the
state it had before both truncate operations.

Without the no_holes feature this does not happen, since when the inode
is logged (full sync flag is set) it will find in the fs/subvol tree a
leaf with a generation matching the current transaction id that has an
explicit extent item representing the hole.

Fix this by adding an explicit extent item representing a hole between
the last extent and the inode's i_size if we are doing a full sync.

The issue is easy to reproduce with the following test case for fstests:

  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_dm_flakey

  # This test was motivated by an issue found in btrfs when the btrfs
  # no-holes feature is enabled (introduced in kernel 3.14). So enable
  # the feature if the fs being tested is btrfs.
  if [ $FSTYP == "btrfs" ]; then
      _require_btrfs_fs_feature "no_holes"
      _require_btrfs_mkfs_feature "no-holes"
      MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
  fi

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test files and make sure everything is durably persisted.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K"         \
                  -c "pwrite -S 0xbb 64K 61K"       \
                  $SCRATCH_MNT/foo | _filter_xfs_io
  $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K"         \
                  -c "pwrite -S 0xff 64K 61K"       \
                  $SCRATCH_MNT/bar | _filter_xfs_io
  sync

  # Now truncate our file foo to a smaller size (64Kb) and then truncate
  # it to the size it had before the shrinking truncate (125Kb). Then
  # fsync our file. If a power failure happens after the fsync, we expect
  # our file to have a size of 125Kb, with the first 64Kb of data having
  # the value 0xaa and the second 61Kb of data having the value 0x00.
  $XFS_IO_PROG -c "truncate 64K" \
               -c "truncate 125K" \
               -c "fsync" \
               $SCRATCH_MNT/foo

  # Do something similar to our file bar, but the first truncation sets
  # the file size to 0 and the second truncation expands the size to the
  # double of what it was initially.
  $XFS_IO_PROG -c "truncate 0" \
               -c "truncate 253K" \
               -c "fsync" \
               $SCRATCH_MNT/bar

  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file
  # contents.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # We expect foo to have a size of 125Kb, the first 64Kb of data all
  # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
  # with value 0x00).
  echo "File foo content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  # We expect bar to have a size of 253Kb and no extents (any byte read
  # from bar has the value 0x00).
  echo "File bar content after log replay:"
  od -t x1 $SCRATCH_MNT/bar

  status=0
  exit

The expected file contents in the golden output are:

  File foo content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0372000
  File bar content after log replay:
  0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0772000

Without this fix, their contents are:

  File foo content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0372000
  File bar content after log replay:
  0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
  *
  0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  *
  0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0772000

A test case submission for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:12 -07:00
Linus Torvalds
02201e3f1b Minor merge needed, due to function move.
Main excitement here is Peter Zijlstra's lockless rbtree optimization to
 speed module address lookup.  He found some abusers of the module lock
 doing that too.
 
 A little bit of parameter work here too; including Dan Streetman's breaking
 up the big param mutex so writing a parameter can load another module (yeah,
 really).  Unfortunately that broke the usual suspects, !CONFIG_MODULES and
 !CONFIG_SYSFS, so those fixes were appended too.
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVkgKHAAoJENkgDmzRrbjxQpwQAJVmBN6jF3SnwbQXv9vRixjH
 58V33sb1G1RW+kXxQ3/e8jLX/4VaN479CufruXQp+IJWXsN/CH0lbC3k8m7u50d7
 b1Zeqd/Yrh79rkc11b0X1698uGCSMlzz+V54Z0QOTEEX+nSu2ZZvccFS4UaHkn3z
 rqDo00lb7rxQz8U25qro2OZrG6D3ub2q20TkWUB8EO4AOHkPn8KWP2r429Axrr0K
 wlDWDTTt8/IsvPbuPf3T15RAhq1avkMXWn9nDXDjyWbpLfTn8NFnWmtesgY7Jl4t
 GjbXC5WYekX3w2ZDB9KaT/DAMQ1a7RbMXNSz4RX4VbzDl+yYeSLmIh2G9fZb1PbB
 PsIxrOgy4BquOWsJPm+zeFPSC3q9Cfu219L4AmxSjiZxC3dlosg5rIB892Mjoyv4
 qxmg6oiqtc4Jxv+Gl9lRFVOqyHZrTC5IJ+xgfv1EyP6kKMUKLlDZtxZAuQxpUyxR
 HZLq220RYnYSvkWauikq4M8fqFM8bdt6hLJnv7bVqllseROk9stCvjSiE3A9szH5
 OgtOfYV5GhOeb8pCZqJKlGDw+RoJ21jtNCgOr6DgkNKV9CX/kL/Puwv8gnA0B0eh
 dxCeB7f/gcLl7Cg3Z3gVVcGlgak6JWrLf5ITAJhBZ8Lv+AtL2DKmwEWS/iIMRmek
 tLdh/a9GiCitqS0bT7GE
 =tWPQ
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module updates from Rusty Russell:
 "Main excitement here is Peter Zijlstra's lockless rbtree optimization
  to speed module address lookup.  He found some abusers of the module
  lock doing that too.

  A little bit of parameter work here too; including Dan Streetman's
  breaking up the big param mutex so writing a parameter can load
  another module (yeah, really).  Unfortunately that broke the usual
  suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were
  appended too"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits)
  modules: only use mod->param_lock if CONFIG_MODULES
  param: fix module param locks when !CONFIG_SYSFS.
  rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
  module: add per-module param_lock
  module: make perm const
  params: suppress unused variable error, warn once just in case code changes.
  modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
  kernel/module.c: avoid ifdefs for sig_enforce declaration
  kernel/workqueue.c: remove ifdefs over wq_power_efficient
  kernel/params.c: export param_ops_bool_enable_only
  kernel/params.c: generalize bool_enable_only
  kernel/module.c: use generic module param operaters for sig_enforce
  kernel/params: constify struct kernel_param_ops uses
  sysfs: tightened sysfs permission checks
  module: Rework module_addr_{min,max}
  module: Use __module_address() for module_address_lookup()
  module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
  module: Optimize __module_address() using a latched RB-tree
  rbtree: Implement generic latch_tree
  seqlock: Introduce raw_read_seqcount_latch()
  ...
2015-07-01 10:49:25 -07:00
Eric W. Biederman
93e3bce628 vfs: Remove incorrect debugging WARN in prepend_path
The warning message in prepend_path is unclear and outdated.  It was
added as a warning that the mechanism for generating names of pseudo
files had been removed from prepend_path and d_dname should be used
instead.  Unfortunately the warning reads like a general warning,
making it unclear what to do with it.

Remove the warning.  The transition it was added to warn about is long
over, and I added code several years ago which in rare cases causes
the warning to fire on legitimate code, and the warning is now firing
and scaring people for no good reason.

Cc: stable@vger.kernel.org
Reported-by: Ivan Delalande <colona@arista.com>
Reported-by: Omar Sandoval <osandov@osandov.com>
Fixes: f48cfddc67 ("vfs: In d_path don't call d_dname on a mount point")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:51 -05:00
Eric W. Biederman
7236c85e1b mnt: Update fs_fully_visible to test for permanently empty directories
fs_fully_visible attempts to make fresh mounts of proc and sysfs give
the mounter no more access to proc and sysfs than if they could have
by creating a bind mount.  One aspect of proc and sysfs that makes
this particularly tricky is that there are other filesystems that
typically mount on top of proc and sysfs.  As those filesystems are
mounted on empty directories in practice it is safe to ignore them.
However testing to ensure filesystems are mounted on empty directories
has not been something the in kernel data structures have supported so
the current test for an empty directory which checks to see
if nlink <= 2 is a bit lacking.

proc and sysfs have recently been modified to use the new empty_dir
infrastructure to create all of their dedicated mount points.  Instead
of testing for S_ISDIR(inode->i_mode) && i_nlink <= 2 to see if a
directory is empty, test for is_empty_dir_inode(inode).  That small
change guaranteess mounts found on proc and sysfs really are safe to
ignore, because the directories are not only empty but nothing can
ever be added to them.  This guarantees there is nothing to worry
about when mounting proc and sysfs.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:49 -05:00
Eric W. Biederman
f9bb48825a sysfs: Create mountpoints with sysfs_create_mount_point
This allows for better documentation in the code and
it allows for a simpler and fully correct version of
fs_fully_visible to be written.

The mount points converted and their filesystems are:
/sys/hypervisor/s390/       s390_hypfs
/sys/kernel/config/         configfs
/sys/kernel/debug/          debugfs
/sys/firmware/efi/efivars/  efivarfs
/sys/fs/fuse/connections/   fusectl
/sys/fs/pstore/             pstore
/sys/kernel/tracing/        tracefs
/sys/fs/cgroup/             cgroup
/sys/kernel/security/       securityfs
/sys/fs/selinux/            selinuxfs
/sys/fs/smackfs/            smackfs

Cc: stable@vger.kernel.org
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:47 -05:00
Eric W. Biederman
87d2846fcf sysfs: Add support for permanently empty directories to serve as mount points.
Add two functions sysfs_create_mount_point and
sysfs_remove_mount_point that hang a permanently empty directory off
of a kobject or remove a permanently emptpy directory hanging from a
kobject.  Export these new functions so modular filesystems can use
them.

Cc: stable@vger.kernel.org
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:45 -05:00
Eric W. Biederman
ea015218f2 kernfs: Add support for always empty directories.
Add a new function kernfs_create_empty_dir that can be used to create
directory that can not be modified.

Update the code to use make_empty_dir_inode when reporting a
permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:43 -05:00
Eric W. Biederman
eb6d38d542 proc: Allow creating permanently empty directories that serve as mount points
Add a new function proc_create_mount_point that when used to creates a
directory that can not be added to.

Add a new function is_empty_pde to test if a function is a mount
point.

Update the code to use make_empty_dir_inode when reporting
a permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:41 -05:00
Eric W. Biederman
f9bd6733d3 sysctl: Allow creating permanently empty directories that serve as mountpoints.
Add a magic sysctl table sysctl_mount_point that when used to
create a directory forces that directory to be permanently empty.

Update the code to use make_empty_dir_inode when accessing permanently
empty directories.

Update the code to not allow adding to permanently empty directories.

Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:39 -05:00
Eric W. Biederman
fbabfd0f4e fs: Add helper functions for permanently empty directories.
To ensure it is safe to mount proc and sysfs I need to check if
filesystems that are mounted on top of them are mounted on truly empty
directories.  Given that some directories can gain entries over time,
knowing that a directory is empty right now is insufficient.

Therefore add supporting infrastructure for permantently empty
directories that proc and sysfs can use when they create mount points
for filesystems and fs_fully_visible can use to test for permanently
empty directories to ensure that nothing will be gained by mounting a
fresh copy of proc or sysfs.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:37 -05:00
Eric W. Biederman
ceeb0e5d39 vfs: Ignore unlocked mounts in fs_fully_visible
Limit the mounts fs_fully_visible considers to locked mounts.
Unlocked can always be unmounted so considering them adds hassle
but no security benefit.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:35 -05:00
Kinglong Mee
b4839ebe21 nfs: Remove invalid tk_pid from debug message
Before rpc_run_task(), tk_pid is uninitiated as 0 always.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:31:25 -04:00
Kinglong Mee
d356a7d1e7 nfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfh
NFS_ATTR_FATTR_V4_REFERRAL is only set in nfs4_proc_lookup_common.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:31:22 -04:00
Kinglong Mee
bc4da2a2be nfs: Drop bad comment in nfs41_walk_client_list()
Commit 7b1f1fd184 "NFSv4/4.1: Fix bugs in nfs4[01]_walk_client_list"
have change the logical of the list_for_each_entry().

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:30:59 -04:00
Kinglong Mee
cd738ee985 nfs: Remove unneeded micro checking of CONFIG_PROC_FS
Have checking CONFIG_PROC_FS in include/linux/sunrpc/stats.h.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:30:45 -04:00
Kinglong Mee
2785110d2e nfs: Don't setting FILE_CREATED flags always
Commit 5bc2afc2b5 "NFSv4: Honour the 'opened' parameter in the atomic_open()
 filesystem method" have support the opened arguments now.

Also,
Commit 03da633aa7 "atomic_open: take care of EEXIST in no-open case with
 O_CREAT|O_EXCL in fs/namei.c" have change vfs's logical.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:30:21 -04:00
Kinglong Mee
6a062a3687 nfs: Use remove_proc_subtree() instead remove_proc_entry()
Thanks for Al Viro's comments of killing proc_fs_nfs completely.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:29:41 -04:00
Kinglong Mee
71f81e51ee nfs: Remove unused argument in nfs_server_set_fsinfo()
Commit e38eb6506f "NFS: set_pnfs_layoutdriver() from nfs4_proc_fsinfo()"
have remove the using of mntfh from nfs_server_set_fsinfo().

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:29:10 -04:00
Kinglong Mee
6b55970b0f nfs: Fix a memory leak when meeting an unsupported state protect
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:29:00 -04:00
Jeff Layton
db2efec0ca nfs: take extra reference to fl->fl_file when running a LOCKU operation
Jean reported another crash, similar to the one fixed by feaff8e5b2:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
    IP: [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0
    PGD 0
    Oops: 0000 [#1] SMP
    Modules linked in: nfsv3 nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache vmw_vsock_vmci_transport vsock cfg80211 rfkill coretemp crct10dif_pclmul ppdev vmw_balloon crc32_pclmul crc32c_intel ghash_clmulni_intel pcspkr vmxnet3 parport_pc i2c_piix4 microcode serio_raw parport nfsd floppy vmw_vmci acpi_cpufreq auth_rpcgss shpchp nfs_acl lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi scsi_transport_spi mptscsih ata_generic mptbase i2c_core pata_acpi
    CPU: 0 PID: 329 Comm: kworker/0:1H Not tainted 4.1.0-rc7+ #2
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
    Workqueue: rpciod rpc_async_schedule [sunrpc]
    30ec000
    RIP: 0010:[<ffffffff8124ef7f>]  [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0
    RSP: 0018:ffff8802330efc08  EFLAGS: 00010296
    RAX: ffff8802330efc58 RBX: ffff880097187c80 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
    RBP: ffff8802330efc18 R08: ffff88023fc173d8 R09: 3038b7bf00000000
    R10: 00002f1a02000000 R11: 3038b7bf00000000 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff8802337a2300 R15: 0000000000000020
    FS:  0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000148 CR3: 000000003680f000 CR4: 00000000001407f0
    Stack:
     ffff880097187c80 ffff880097187cd8 ffff8802330efc98 ffffffff81250281
     ffff8802330efc68 ffffffffa013e7df ffff8802330efc98 0000000000000246
     ffff8801f6901c00 ffff880233d2b8d8 ffff8802330efc58 ffff8802330efc58
    Call Trace:
     [<ffffffff81250281>] __posix_lock_file+0x31/0x5e0
     [<ffffffffa013e7df>] ? rpc_wake_up_task_queue_locked.part.35+0xcf/0x240 [sunrpc]
     [<ffffffff8125088b>] posix_lock_file_wait+0x3b/0xd0
     [<ffffffffa03890b2>] ? nfs41_wake_and_assign_slot+0x32/0x40 [nfsv4]
     [<ffffffffa0365808>] ? nfs41_sequence_done+0xd8/0x300 [nfsv4]
     [<ffffffffa0367525>] do_vfs_lock+0x35/0x40 [nfsv4]
     [<ffffffffa03690c1>] nfs4_locku_done+0x81/0x120 [nfsv4]
     [<ffffffffa013e310>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
     [<ffffffffa013e310>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
     [<ffffffffa013e33c>] rpc_exit_task+0x2c/0x90 [sunrpc]
     [<ffffffffa0134400>] ? call_refreshresult+0x170/0x170 [sunrpc]
     [<ffffffffa013ece4>] __rpc_execute+0x84/0x410 [sunrpc]
     [<ffffffffa013f085>] rpc_async_schedule+0x15/0x20 [sunrpc]
     [<ffffffff810add67>] process_one_work+0x147/0x400
     [<ffffffff810ae42b>] worker_thread+0x11b/0x460
     [<ffffffff810ae310>] ? rescuer_thread+0x2f0/0x2f0
     [<ffffffff810b35d9>] kthread+0xc9/0xe0
     [<ffffffff81010000>] ? perf_trace_xen_mmu_set_pmd+0xa0/0x160
     [<ffffffff810b3510>] ? kthread_create_on_node+0x170/0x170
     [<ffffffff8173c222>] ret_from_fork+0x42/0x70
     [<ffffffff810b3510>] ? kthread_create_on_node+0x170/0x170
    Code: a5 81 e8 85 75 e4 ff c6 05 31 ee aa 00 01 eb 98 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 <48> 8b 9f 48 01 00 00 48 85 db 74 08 48 89 d8 5b 41 5c 5d c3 83
    RIP  [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0
     RSP <ffff8802330efc08>
    CR2: 0000000000000148
    ---[ end trace 64484f16250de7ef ]---

The problem is almost exactly the same as the one fixed by feaff8e5b2.
We must take a reference to the struct file when running the LOCKU
compound to prevent the final fput from running until the operation is
complete.

Reported-by: Jean Spector <jean@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:27:27 -04:00
Miklos Szeredi
c3696046be fuse: separate pqueue for clones
Make each fuse device clone refer to a separate processing queue.  The only
constraint on userspace code is that the request answer must be written to
the same device clone as it was read off.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:26:09 +02:00
Miklos Szeredi
cc080e9e9b fuse: introduce per-instance fuse_dev structure
Allow fuse device clones to refer to be distinguished.  This patch just
adds the infrastructure by associating a separate "struct fuse_dev" with
each clone.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:08 +02:00
Miklos Szeredi
00c570f4ba fuse: device fd clone
Allow an open fuse device to be "cloned".  Userspace can create a clone by:

      newfd = open("/dev/fuse", O_RDWR)
      ioctl(newfd, FUSE_DEV_IOC_CLONE, &oldfd);

At this point newfd will refer to the same fuse connection as oldfd.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:08 +02:00
Miklos Szeredi
ee314a870e fuse: abort: no fc->lock needed for request ending
In fuse_abort_conn() when all requests are on private lists we no longer
need fc->lock protection.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:08 +02:00
Miklos Szeredi
46c34a348b fuse: no fc->lock for pqueue parts
Remove fc->lock protection from processing queue members, now protected by
fpq->lock.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:07 +02:00
Miklos Szeredi
efe2800fac fuse: no fc->lock in request_end()
No longer need to call request_end() with the connection lock held.  We
still protect the background counters and queue with fc->lock, so acquire
it if necessary.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:07 +02:00
Miklos Szeredi
1e6881c36e fuse: cleanup request_end()
Now that we atomically test having already done everything we no longer
need other protection.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:07 +02:00
Miklos Szeredi
365ae710df fuse: request_end(): do once
When the connection is aborted it is possible that request_end() will be
called twice.  Use atomic test and set to do the actual ending only once.

test_and_set_bit() also provides the necessary barrier semantics so no
explicit smp_wmb() is necessary.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:06 +02:00
Miklos Szeredi
77cd9d488b fuse: add req flag for private list
When an unlocked request is aborted, it is moved from fpq->io to a private
list.  Then, after unlocking fpq->lock, the private list is processed and
the requests are finished off.

To protect the private list, we need to mark the request with a flag, so if
in the meantime the request is unlocked the list is not corrupted.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:06 +02:00
Miklos Szeredi
45a91cb1a4 fuse: pqueue locking
Add a fpq->lock for protecting members of struct fuse_pqueue and FR_LOCKED
request flag.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:06 +02:00
Miklos Szeredi
24b4d33d46 fuse: abort: group pqueue accesses
Rearrange fuse_abort_conn() so that processing queue accesses are grouped
together.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:05 +02:00
Miklos Szeredi
82cbdcd320 fuse: cleanup fuse_dev_do_read()
- locked list_add() + list_del_init() cancel out

 - common handling of case when request is ended here in the read phase

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:05 +02:00
Miklos Szeredi
f377cb799e fuse: move list_del_init() from request_end() into callers
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:26:04 +02:00
Miklos Szeredi
e96edd94d0 fuse: duplicate ->connected in pqueue
This will allow checking ->connected just with the processing queue lock.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:04 +02:00
Miklos Szeredi
3a2b5b9cd9 fuse: separate out processing queue
This is just two fields: fc->io and fc->processing.

This patch just rearranges the fields, no functional change.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:04 +02:00
Miklos Szeredi
5250921bb0 fuse: simplify request_wait()
wait_event_interruptible_exclusive_locked() will do everything
request_wait() does, so replace it.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:03 +02:00
Miklos Szeredi
fd22d62ed0 fuse: no fc->lock for iqueue parts
Remove fc->lock protection from input queue members, now protected by
fiq->waitq.lock.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:03 +02:00
Miklos Szeredi
8f7bb368db fuse: allow interrupt queuing without fc->lock
Interrupt is only queued after the request has been sent to userspace.
This is either done in request_wait_answer() or fuse_dev_do_read()
depending on which state the request is in at the time of the interrupt.
If it's not yet sent, then queuing the interrupt is postponed until the
request is read.  Otherwise (the request has already been read and is
waiting for an answer) the interrupt is queued immedidately.

We want to call queue_interrupt() without fc->lock protection, in which
case there can be a race between the two functions:

 - neither of them queue the interrupt (thinking the other one has already
   done it).

 - both of them queue the interrupt

The first one is prevented by adding memory barriers, the second is
prevented by checking (under fiq->waitq.lock) if the interrupt has already
been queued.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:26:03 +02:00
Miklos Szeredi
4ce6081260 fuse: iqueue locking
Use fiq->waitq.lock for protecting members of struct fuse_iqueue and
FR_PENDING request flag, previously protected by fc->lock.

Following patches will remove fc->lock protection from these members.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:02 +02:00
Miklos Szeredi
ef75925886 fuse: dev read: split list_move
Different lists will need different locks.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:02 +02:00
Miklos Szeredi
8c91189a2a fuse: abort: group iqueue accesses
Rearrange fuse_abort_conn() so that input queue accesses are grouped
together.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:02 +02:00
Miklos Szeredi
e16714d875 fuse: duplicate ->connected in iqueue
This will allow checking ->connected just with the input queue lock.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:01 +02:00
Miklos Szeredi
f88996a933 fuse: separate out input queue
The input queue contains normal requests (fc->pending), forgets
(fc->forget_*) and interrupts (fc->interrupts).  There's also fc->waitq and
fc->fasync for waking up the readers of the fuse device when a request is
available.

The fc->reqctr is also moved to the input queue (assigned to the request
when the request is added to the input queue.

This patch just rearranges the fields, no functional change.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:01 +02:00
Miklos Szeredi
33e14b4dfd fuse: req state use flags
Use flags for representing the state in fuse_req.  This is needed since
req->list will be protected by different locks in different states, hence
we'll want the state itself to be split into distinct bits, each protected
with the relevant lock in that state.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:26:01 +02:00
Miklos Szeredi
7a3b2c7547 fuse: simplify req states
FUSE_REQ_INIT is actually the same state as FUSE_REQ_PENDING and
FUSE_REQ_READING and FUSE_REQ_WRITING can be merged into a common
FUSE_REQ_IO state.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:00 +02:00
Miklos Szeredi
c47752673a fuse: don't hold lock over request_wait_answer()
Only hold fc->lock over sections of request_wait_answer() that actually
need it.  If wait_event_interruptible() returns zero, it means that the
request finished.  Need to add memory barriers, though, to make sure that
all relevant data in the request is synchronized.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:26:00 +02:00
Miklos Szeredi
7d2e0a099c fuse: simplify unique ctr
Since it's a 64bit counter, it's never gonna wrap around.  Remove code
dealing with that possibility.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:00 +02:00
Miklos Szeredi
41f982747e fuse: rework abort
Splice fc->pending and fc->processing lists into a common kill list while
holding fc->lock.

By the time we release fc->lock, pending and processing lists are empty and
the io list contains only locked requests.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:59 +02:00
Miklos Szeredi
b716d42538 fuse: fold helpers into abort
Fold end_io_requests() and end_queued_requests() into fuse_abort_conn().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:59 +02:00
Miklos Szeredi
dc00809a53 fuse: use per req lock for lock/unlock_request()
Reuse req->waitq.lock for protecting FR_ABORTED and FR_LOCKED flags.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:58 +02:00
Miklos Szeredi
825d6d3395 fuse: req use bitops
Finer grained locking will mean there's no single lock to protect
modification of bitfileds in fuse_req.

So move to using bitops.  Can use the non-atomic variants for those which
happen while the request definitely has only one reference.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:58 +02:00
Miklos Szeredi
0d8e84b043 fuse: simplify request abort
- don't end the request while req->locked is true

 - make unlock_request() return an error if the connection was aborted

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:58 +02:00
Miklos Szeredi
ccd0a0bd16 fuse: call fuse_abort_conn() in dev release
fuse_abort_conn() does all the work done by fuse_dev_release() and more.
"More" consists of:

	end_io_requests(fc);
	wake_up_all(&fc->waitq);
	kill_fasync(&fc->fasync, SIGIO, POLL_IN);

All of which should be no-op (WARN_ON's added).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:57 +02:00
Miklos Szeredi
f0139aa819 fuse: fold fuse_request_send_nowait() into single caller
And the same with fuse_request_send_nowait_locked().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:57 +02:00
Miklos Szeredi
de15522646 fuse: check conn_error earlier
fc->conn_error is set once in FUSE_INIT reply and never cleared.  Check it
in request allocation, there's no sense in doing all the preparation if
sending will surely fail.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:57 +02:00
Miklos Szeredi
5437f24172 fuse: account as waiting before queuing for background
Move accounting of fc->num_waiting to the point where the request actually
starts waiting.  This is earlier than the current queue_request() for
background requests, since they might be waiting on the fc->bg_queue before
being queued on fc->pending.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:56 +02:00
Miklos Szeredi
73e0e73844 fuse: reset waiting
Reset req->waiting in fuse_put_request().  This is needed for correct
accounting in fc->num_waiting for reserved requests.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:25:56 +02:00
Miklos Szeredi
42dc6211c5 fuse: fix background request if not connected
request_end() expects fc->num_background and fc->active_background to have
been incremented, which is not the case in fuse_request_send_nowait()
failure path.  So instead just call the ->end() callback (which is actually
set by all callers).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:56 +02:00
Miklos Szeredi
0ad0b3255a fuse: initialize fc->release before calling it
fc->release is called from fuse_conn_put() which was used in the error
cleanup before fc->release was initialized.

[Jeremiah Mahler <jmmahler@gmail.com>: assign fc->release after calling
fuse_conn_init(fc) instead of before.]

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fixes: a325f9b922 ("fuse: update fuse_conn_init() and separate out fuse_conn_kill()")
Cc: <stable@vger.kernel.org> #v2.6.31+
2015-07-01 16:25:55 +02:00
Eric Dumazet
5ba97d2832 fs/file.c: __fget() and dup2() atomicity rules
__fget() does lockless fetch of pointer from the descriptor
table, attempts to grab a reference and treats "it was already
zero" as "it's already gone from the table, we just hadn't
seen the store, let's fail".  Unfortunately, that breaks the
atomicity of dup2() - __fget() might see the old pointer,
notice that it's been already dropped and treat that as
"it's closed".  What we should be getting is either the
old file or new one, depending whether we come before or after
dup2().

Dmitry had following test failing sometimes :

int fd;
void *Thread(void *x) {
  char buf;
  int n = read(fd, &buf, 1);
  if (n != 1)
    exit(printf("read failed: n=%d errno=%d\n", n, errno));
  return 0;
}

int main()
{
  fd = open("/dev/urandom", O_RDONLY);
  int fd2 = open("/dev/urandom", O_RDONLY);
  if (fd == -1 || fd2 == -1)
    exit(printf("open failed\n"));
  pthread_t th;
  pthread_create(&th, 0, Thread, 0);
  if (dup2(fd2, fd) == -1)
    exit(printf("dup2 failed\n"));
  pthread_join(th, 0);
  if (close(fd) == -1)
    exit(printf("close failed\n"));
  if (close(fd2) == -1)
    exit(printf("close failed\n"));
  printf("DONE\n");
  return 0;
}

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-01 02:31:08 -04:00