Commit Graph

47953 Commits

Author SHA1 Message Date
Andrea Arcangeli
8474901a33 userfaultfd: convert BUG() to WARN_ON_ONCE()
Avoid BUG_ON()s and only WARN instead.  This is just a cleanup, it can't
make any runtime difference.  This BUG_ON has never triggered and cannot
trigger.

Link: http://lkml.kernel.org/r/20161216144821.5183-4-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:28 -08:00
Andrea Arcangeli
a4605a61d6 userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP
Minor comment correction.

Link: http://lkml.kernel.org/r/20161216144821.5183-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:28 -08:00
Cong Wang
b5c66bab72 9p: fix a potential acl leak
posix_acl_update_mode() could possibly clear 'acl', if so we leak the
memory pointed by 'acl'.  Save this pointer before calling
posix_acl_update_mode() and release the memory if 'acl' really gets
cleared.

Link: http://lkml.kernel.org/r/1486678332-2430-1-git-send-email-xiyou.wangcong@gmail.com
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: Mark Salyzyn <salyzyn@android.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kurz <groug@kaod.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:27 -08:00
Eric Ren
b891fa5024 ocfs2: fix deadlock issue when taking inode lock at vfs entry points
Commit 743b5f1434 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
results in a deadlock, as the author "Tariq Saeed" realized shortly
after the patch was merged.  The discussion happened here

  https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html

The reason why taking cluster inode lock at vfs entry points opens up a
self deadlock window, is explained in the previous patch of this series.

So far, we have seen two different code paths that have this issue.

1. do_sys_open
     may_open
      inode_permission
       ocfs2_permission
        ocfs2_inode_lock() <=== take PR
         generic_permission
          get_acl
           ocfs2_iop_get_acl
            ocfs2_inode_lock() <=== take PR

2. fchmod|fchmodat
    chmod_common
     notify_change
      ocfs2_setattr <=== take EX
       posix_acl_chmod
        get_acl
         ocfs2_iop_get_acl <=== take PR
        ocfs2_iop_set_acl <=== take EX

Fixes them by adding the tracking logic (in the previous patch) for these
funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(),
ocfs2_setattr().

Link: http://lkml.kernel.org/r/20170117100948.11657-3-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:27 -08:00
Eric Ren
439a36b8ef ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock
We are in the situation that we have to avoid recursive cluster locking,
but there is no way to check if a cluster lock has been taken by a precess
already.

Mostly, we can avoid recursive locking by writing code carefully.
However, we found that it's very hard to handle the routines that are
invoked directly by vfs code.  For instance:

  const struct inode_operations ocfs2_file_iops = {
      .permission     = ocfs2_permission,
      .get_acl        = ocfs2_iop_get_acl,
      .set_acl        = ocfs2_iop_set_acl,
  };

Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):

  do_sys_open
   may_open
    inode_permission
     ocfs2_permission
      ocfs2_inode_lock() <=== first time
       generic_permission
        get_acl
         ocfs2_iop_get_acl
  	ocfs2_inode_lock() <=== recursive one

A deadlock will occur if a remote EX request comes in between two of
ocfs2_inode_lock().  Briefly describe how the deadlock is formed:

On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of
the remote EX lock request.  Another hand, the recursive cluster lock
(the second one) will be blocked in in __ocfs2_cluster_lock() because of
OCFS2_LOCK_BLOCKED.  But, the downconvert never complete, why? because
there is no chance for the first cluster lock on this node to be
unlocked - we block ourselves in the code path.

The idea to fix this issue is mostly taken from gfs2 code.

1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track
   of the processes' pid who has taken the cluster lock of this lock
   resource;

2. introduce a new flag for ocfs2_inode_lock_full:
   OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for
   us if we've got cluster lock.

3. export a helper: ocfs2_is_locked_by_me() is used to check if we have
   got the cluster lock in the upper code path.

The tracking logic should be used by some of the ocfs2 vfs's callbacks,
to solve the recursive locking issue cuased by the fact that vfs
routines can call into each other.

The performance penalty of processing the holder list should only be
seen at a few cases where the tracking logic is used, such as get/set
acl.

You may ask what if the first time we got a PR lock, and the second time
we want a EX lock? fortunately, this case never happens in the real
world, as far as I can see, including permission check,
(get|set)_(acl|attr), and the gfs2 code also do so.

[sfr@canb.auug.org.au remove some inlines]
Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:27 -08:00
Dave Jiang
f42003917b mm, dax: change pmd_fault() to take only vmf parameter
pmd_fault() and related functions really only need the vmf parameter since
the additional parameters are all included in the vmf struct.  Remove the
additional parameter and simplify pmd_fault() and friends.

Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:26 -08:00
Dave Jiang
d8a849e1bc mm, dax: make pmd_fault() and friends be the same as fault()
Instead of passing in multiple parameters in the pmd_fault() handler,
a vmf can be passed in just like a fault() handler. This will simplify
code and remove the need for the actual pmd fault handlers to allocate a
vmf. Related functions are also modified to do the same.

[dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off]
  Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:26 -08:00
Ross Zwisler
27a7ffaccd dax: add tracepoints to dax_pmd_insert_mapping()
Add tracepoints to dax_pmd_insert_mapping(), following the same logging
conventions as the tracepoints in dax_iomap_pmd_fault().

Here is an example PMD fault showing the new tracepoints:

big-1504  [001] ....   326.960743: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003

big-1504  [001] ....   326.960753: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400

big-1504  [001] ....   326.960981: dax_pmd_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10505000 length 0x200000 pfn 0x100600 DEV|MAP radix_entry 0xc000e

big-1504  [001] ....   326.960986: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE

Link: http://lkml.kernel.org/r/1484085142-2297-6-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:26 -08:00
Ross Zwisler
653b2ea339 dax: add tracepoints to dax_pmd_load_hole()
Add tracepoints to dax_pmd_load_hole(), following the same logging
conventions as the tracepoints in dax_iomap_pmd_fault().

Here is an example PMD fault showing the new tracepoints:

read_big-1478  [004] ....   238.242188: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003

read_big-1478  [004] ....   238.242191: dax_pmd_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400

read_big-1478  [004] ....   238.242390: dax_pmd_load_hole: dev 259:0 ino 0x1003 shared address 0x10400000 zero_page ffffea0002c20000 radix_entry 0x1e

read_big-1478  [004] ....   238.242392: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400 NOPAGE

Link: http://lkml.kernel.org/r/1484085142-2297-5-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:26 -08:00
Ross Zwisler
282a8e0391 dax: add tracepoint infrastructure, PMD tracing
Tracepoints are the standard way to capture debugging and tracing
information in many parts of the kernel, including the XFS and ext4
filesystems.  Create a tracepoint header for FS DAX and add the first DAX
tracepoints to the PMD fault handler.  This allows the tracing for DAX to
be done in the same way as the filesystem tracing so that developers can
look at them together and get a coherent idea of what the system is doing.

I added both an entry and exit tracepoint because future patches will add
tracepoints to child functions of dax_iomap_pmd_fault() like
dax_pmd_load_hole() and dax_pmd_insert_mapping().  We want those messages
to be wrapped by the parent function tracepoints so the code flow is more
easily understood.  Having entry and exit tracepoints for faults also
allows us to easily see what filesystems functions were called during the
fault.  These filesystem functions get executed via iomap_begin() and
iomap_end() calls, for example, and will have their own tracepoints.

For PMD faults we primarily want to understand the type of mapping, the
fault flags, the faulting address and whether it fell back to 4k faults.
If it fell back to 4k faults the tracepoints should let us understand why.

I named the new tracepoint header file "fs_dax.h" to allow for device DAX
to have its own separate tracing header in the same directory at some
point.

Here is an example output for these events from a successful PMD fault:

  big-1441  [005] ....    32.582758: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003

  big-1441  [005] ....    32.582776: dax_pmd_fault: dev 259:0 ino 0x1003
  shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400

  big-1441  [005] ....    32.583292: dax_pmd_fault_done: dev 259:0 ino 0x1003
  shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE

Link: http://lkml.kernel.org/r/1484085142-2297-3-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:26 -08:00
Liu Bo
6288d6eabc Btrfs: use the correct type when creating cow dio extent
'BTRFS_ORDERED_REGULAR' was introduced for the cow case in patch
'Btrfs: specify a new ordered extent type for create_io_em',
but it missed the directIO cow case.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-02-22 15:55:03 -08:00
Filipe Manana
b1517622f2 Btrfs: fix deadlock between dedup on same file and starting writeback
If we are deduping two ranges of the same file we need to make sure that
we lock all pages in ascending order, that is, lock first the pages from
the range with lower offset and then the pages from the other range, as
otherwise we can deadlock with a concurrent task that is starting delalloc
(writeback). Example trace:

[74073.052218] INFO: task kworker/u32:10:17997 blocked for more than 120 seconds.
[74073.053889]       Tainted: G        W       4.9.0-rc7-btrfs-next-36+ #1
[74073.055071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[74073.056696] kworker/u32:10  D    0 17997      2 0x00000000
[74073.058606] Workqueue: writeback wb_workfn (flush-btrfs-53176)
[74073.061370]  ffff880031e79858 ffff8802159d2580 ffff880237004580 ffff880031e79240
[74073.064784]  ffff88023f4978c0 ffffc9000817b638 ffffffff814c15e1 0000000000000000
[74073.068386]  ffff88023f4978d8 ffff88023f4978c0 000000000017b620 ffff880031e79240
[74073.071712] Call Trace:
[74073.072884]  [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
[74073.075395]  [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
[74073.077511]  [<ffffffff814c18d2>] schedule+0x8c/0xa0
[74073.079440]  [<ffffffff814c4b36>] schedule_timeout+0x43/0xff
[74073.081637]  [<ffffffff8110953e>] ? time_hardirqs_on+0x9/0x14
[74073.083809]  [<ffffffff81095c67>] ? trace_hardirqs_on_caller+0x16/0x197
[74073.086314]  [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
[74073.100654]  [<ffffffff810be048>] ? ktime_get+0x41/0x52
[74073.102619]  [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
[74073.104771]  [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
[74073.106969]  [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
[74073.108954]  [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
[74073.110981]  [<ffffffff8112b692>] __lock_page+0x6b/0x6d
[74073.112833]  [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
[74073.115010]  [<ffffffffa031178b>] lock_page+0x2f/0x32 [btrfs]
[74073.116999]  [<ffffffffa0311d9f>] lock_delalloc_pages+0xc7/0x1a0 [btrfs]
[74073.119243]  [<ffffffffa0313d15>] find_lock_delalloc_range+0xc3/0x1a4 [btrfs]
[74073.121636]  [<ffffffffa0313e81>] writepage_delalloc.isra.31+0x8b/0x134 [btrfs]
[74073.124229]  [<ffffffffa0315d69>] __extent_writepage+0x1c1/0x2bf [btrfs]
[74073.126372]  [<ffffffffa03160f2>] extent_write_cache_pages.isra.30.constprop.49+0x28b/0x36c [btrfs]
[74073.129371]  [<ffffffffa03165b9>] extent_writepages+0x4b/0x5c [btrfs]
[74073.131440]  [<ffffffffa02fcb59>] ? insert_reserved_file_extent.constprop.42+0x261/0x261 [btrfs]
[74073.134303]  [<ffffffff811b4ce4>] ? writeback_sb_inodes+0xe0/0x4a1
[74073.136298]  [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
[74073.138248]  [<ffffffff81138200>] do_writepages+0x23/0x2c
[74073.139910]  [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
[74073.142003]  [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
[74073.136298]  [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
[74073.138248]  [<ffffffff81138200>] do_writepages+0x23/0x2c
[74073.139910]  [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
[74073.142003]  [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
[74073.143911]  [<ffffffff811b511b>] __writeback_inodes_wb+0x76/0xae
[74073.145787]  [<ffffffff811b53ca>] wb_writeback+0x1cc/0x4d7
[74073.147452]  [<ffffffff811b60cd>] wb_workfn+0x194/0x37d
[74073.149084]  [<ffffffff811b60cd>] ? wb_workfn+0x194/0x37d
[74073.150726]  [<ffffffff8106ce77>] ? process_one_work+0x154/0x4e4
[74073.152694]  [<ffffffff8106cf96>] process_one_work+0x273/0x4e4
[74073.154452]  [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca
[74073.156138]  [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6
[74073.157837]  [<ffffffff81072a81>] kthread+0xd5/0xdd
[74073.159339]  [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a
[74073.161088]  [<ffffffff814c6257>] ret_from_fork+0x27/0x40
[74073.162680] INFO: lockdep is turned off.
[74073.163855] INFO: task do-dedup:30264 blocked for more than 120 seconds.
[74073.181180]       Tainted: G        W       4.9.0-rc7-btrfs-next-36+ #1
[74073.181180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[74073.185296] fdm-stress      D    0 30264  29974 0x00000000
[74073.186810]  ffff880089595118 ffff880211b8eac0 ffff880237030380 ffff880089594b00
[74073.188998]  ffff88023f2978c0 ffffc900063abb68 ffffffff814c15e1 0000000000000000
[74073.191070]  ffff88023f2978d8 ffff88023f2978c0 00000000003abb50 ffff880089594b00
[74073.193286] Call Trace:
[74073.193990]  [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
[74073.195418]  [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
[74073.196796]  [<ffffffff814c18d2>] schedule+0x8c/0xa0
[74073.198163]  [<ffffffff814c4b36>] schedule_timeout+0x43/0xff
[74073.199621]  [<ffffffff81095df5>] ? trace_hardirqs_on+0xd/0xf
[74073.201100]  [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
[74073.202686]  [<ffffffff810be048>] ? ktime_get+0x41/0x52
[74073.204051]  [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
[74073.205585]  [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
[74073.207123]  [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
[74073.208238]  [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
[74073.208871]  [<ffffffff8112b692>] __lock_page+0x6b/0x6d
[74073.209430]  [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
[74073.210101]  [<ffffffff8112b800>] lock_page+0x2f/0x32
[74073.210636]  [<ffffffff8112c502>] pagecache_get_page+0x5e/0x153
[74073.211270]  [<ffffffffa03257eb>] gather_extent_pages+0x4e/0x109 [btrfs]
[74073.212166]  [<ffffffffa032a04c>] btrfs_dedupe_file_range+0x1e1/0x4dd [btrfs]
[74073.213257]  [<ffffffff8118d9b5>] vfs_dedupe_file_range+0x1c1/0x221
[74073.214086]  [<ffffffff8119e0c4>] do_vfs_ioctl+0x442/0x600
[74073.214767]  [<ffffffff811a7874>] ? rcu_read_unlock+0x5b/0x5d
[74073.215619]  [<ffffffff811a7953>] ? __fget+0x6b/0x77
[74073.216338]  [<ffffffff8119e2d9>] SyS_ioctl+0x57/0x79
[74073.217149]  [<ffffffff814c5fea>] entry_SYSCALL_64_fastpath+0x18/0xad
[74073.218102]  [<ffffffff81109552>] ? time_hardirqs_off+0x9/0x14
[74073.218968]  [<ffffffff810938ce>] ? trace_hardirqs_off_caller+0x1f/0xaa
[74073.219938] INFO: lockdep is turned off.

What happened was the following:

      CPU 1                                       CPU 2

                                             btrfs_dedupe_file_range()
                                               --> using same inode as source
                                                   and target
                                               --> src range is [768K, 1Mb[
                                               --> dst range is [0, 256K[
                                              btrfs_cmp_data_prepare()
                                               --> calls gather_extent_pages()
                                                   for range [768K, 1Mb[ and
                                                   locks all pages in that range

 do_writepages()
  btrfs_writepages()
   extent_writepages()
    extent_write_cache_pages()
     __extent_writepage()
      writepage_delalloc()
       find_lock_delalloc_range()
         --> finds range [0, 1Mb[
         lock_delalloc_pages()
          --> locks all pages in the
              range [0, 768K[
          --> tries to lock page at
              offset 768K
                --> deadlock

                                               --> calls gather_extent_pages()
                                                   to lock pages in the range
                                                   [0, 256K[
                                                    --> deadlock, task at CPU 1
                                                        already locked that
                                                        range and it's trying
                                                        to lock the range we
                                                        locked previously

So fix this by making sure that during a dedup we always lock first the
pages from the range with lower offset.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-02-22 15:55:02 -08:00
Linus Torvalds
b2064617c7 driver core patches for 4.11-rc1
Here is the "small" driver core patches for 4.11-rc1.
 
 Not much here, some firmware documentation and self-test updates, a
 debugfs code formatting issue, and a new feature for call_usermodehelper
 to make it more robust on systems that want to lock it down in a more
 secure way.
 
 All of these have been linux-next for a while now with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWK2jKg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymCEACgozYuqZZ/TUGW0P3xVNi7fbfUWCEAn3nYExrc
 XgevqeYOSKp2We6X/2JX
 =aZ+5
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the "small" driver core patches for 4.11-rc1.

  Not much here, some firmware documentation and self-test updates, a
  debugfs code formatting issue, and a new feature for call_usermodehelper
  to make it more robust on systems that want to lock it down in a more
  secure way.

  All of these have been linux-next for a while now with no reported
  issues"

* tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  kernfs: handle null pointers while printing node name and path
  Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()
  Make static usermode helper binaries constant
  kmod: make usermodehelper path a const string
  firmware: revamp firmware documentation
  selftests: firmware: send expected errors to /dev/null
  selftests: firmware: only modprobe if driver is missing
  platform: Print the resource range if device failed to claim
  kref: prefer atomic_inc_not_zero to atomic_add_unless
  debugfs: improve formatting of debugfs_real_fops()
2017-02-22 11:44:32 -08:00
Linus Torvalds
3051bf36c2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini
      Varadhan.

   2) Simplify classifier state on sk_buff in order to shrink it a bit.
      From Willem de Bruijn.

   3) Introduce SIPHASH and it's usage for secure sequence numbers and
      syncookies. From Jason A. Donenfeld.

   4) Reduce CPU usage for ICMP replies we are going to limit or
      suppress, from Jesper Dangaard Brouer.

   5) Introduce Shared Memory Communications socket layer, from Ursula
      Braun.

   6) Add RACK loss detection and allow it to actually trigger fast
      recovery instead of just assisting after other algorithms have
      triggered it. From Yuchung Cheng.

   7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot.

   8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert.

   9) Export MPLS packet stats via netlink, from Robert Shearman.

  10) Significantly improve inet port bind conflict handling, especially
      when an application is restarted and changes it's setting of
      reuseport. From Josef Bacik.

  11) Implement TX batching in vhost_net, from Jason Wang.

  12) Extend the dummy device so that VF (virtual function) features,
      such as configuration, can be more easily tested. From Phil
      Sutter.

  13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric
      Dumazet.

  14) Add new bpf MAP, implementing a longest prefix match trie. From
      Daniel Mack.

  15) Packet sample offloading support in mlxsw driver, from Yotam Gigi.

  16) Add new aquantia driver, from David VomLehn.

  17) Add bpf tracepoints, from Daniel Borkmann.

  18) Add support for port mirroring to b53 and bcm_sf2 drivers, from
      Florian Fainelli.

  19) Remove custom busy polling in many drivers, it is done in the core
      networking since 4.5 times. From Eric Dumazet.

  20) Support XDP adjust_head in virtio_net, from John Fastabend.

  21) Fix several major holes in neighbour entry confirmation, from
      Julian Anastasov.

  22) Add XDP support to bnxt_en driver, from Michael Chan.

  23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan.

  24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi.

  25) Support GRO in IPSEC protocols, from Steffen Klassert"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits)
  Revert "ath10k: Search SMBIOS for OEM board file extension"
  net: socket: fix recvmmsg not returning error from sock_error
  bnxt_en: use eth_hw_addr_random()
  bpf: fix unlocking of jited image when module ronx not set
  arch: add ARCH_HAS_SET_MEMORY config
  net: napi_watchdog() can use napi_schedule_irqoff()
  tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"
  net/hsr: use eth_hw_addr_random()
  net: mvpp2: enable building on 64-bit platforms
  net: mvpp2: switch to build_skb() in the RX path
  net: mvpp2: simplify MVPP2_PRS_RI_* definitions
  net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT
  net: mvpp2: remove unused register definitions
  net: mvpp2: simplify mvpp2_bm_bufs_add()
  net: mvpp2: drop useless fields in mvpp2_bm_pool and related code
  net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue'
  net: mvpp2: release reference to txq_cpu[] entry after unmapping
  net: mvpp2: handle too large value in mvpp2_rx_time_coal_set()
  net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set()
  net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set
  ...
2017-02-22 10:15:09 -08:00
Linus Torvalds
6d1dd93ea0 Minor changes to pstore tree:
- update MAINTAINERS with current git repo, add more files.
 - move prz allocation checks into the walker
 - initialize flags correctly (by accident spinlock was technically ok)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJYrJo2AAoJEIly9N/cbcAm9ooP/jQcPPA4PrvC+U8Kv4JsPCbA
 AKiYvLpDr5F6f+8yUCyemrNj6dPdIvl0FJTCrPD2EZC6GEBXvy87umRIxzqweSi8
 E/MCD9PXJoXZZR4Auy9/O2tVd38BxLXqZAd9fVAouDEJUiNPWX2Zvmna9vEGK5Yh
 FAJ+GL1iHtH17smrN33DmXu4ro6gl+RVLzqYZc5MoQQcAar4c+IWlZyIuHLUNjF2
 FNLTJETyAsiCZQ/zvKgJW8gEWVo/qpSHelFEw/xTjcw2oAnp+H1821bd0UmzSlH6
 B4Veyv8DfB2Rw+7eAdFetxQDuvuKLZbrecqSr6h6zSftDC3oXyM+Bx+f3v6RL8mt
 3ijv4tToV27odBVJzIfPlXL9HN87/2P3UxVJt2e7X96BEyd/AXRXSJFOilD3tud6
 Zc6qanC9Agik+URZPRuQzfGK0csdlsqu6wMmOso5qo2fvQ6TsHATtZiwQZEh2i5r
 8gOZn7s1EYt5N1XXcTS5damCORnE0i4TvCryfJmR84WKwHSDe43fVnHloU/FXTjt
 EZM2ZIwpBMzpWh+glLl4+ARkEZMq8gMmNk1vCa+w0F6B2g+ipM3G9GqpWUiUKeKJ
 /LRQj8RPzIpeRe7z6wxGkDet5XZY/PM/eXriUXOz0f0Mw7WQiJdhOLsUO0kz0s6S
 /MWo2b1BHU32uJ41xqca
 =lSPN
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:
 "Minor changes to pstore tree:

   - update MAINTAINERS with current git repo, add more files.

   - move prz allocation checks into the walker

   - initialize flags correctly (by accident spinlock was technically
     ok)"

* tag 'pstore-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  MAINTAINERS: Adjust pstore git repo URI, add files
  pstore: Check for prz allocation in walker
  pstore: Correctly initialize spinlock and flags
2017-02-21 17:51:37 -08:00
Linus Torvalds
c9341ee0af Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security layer updates from James Morris:
 "Highlights:

   - major AppArmor update: policy namespaces & lots of fixes

   - add /sys/kernel/security/lsm node for easy detection of loaded LSMs

   - SELinux cgroupfs labeling support

   - SELinux context mounts on tmpfs, ramfs, devpts within user
     namespaces

   - improved TPM 2.0 support"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (117 commits)
  tpm: declare tpm2_get_pcr_allocation() as static
  tpm: Fix expected number of response bytes of TPM1.2 PCR Extend
  tpm xen: drop unneeded chip variable
  tpm: fix misspelled "facilitate" in module parameter description
  tpm_tis: fix the error handling of init_tis()
  KEYS: Use memzero_explicit() for secret data
  KEYS: Fix an error code in request_master_key()
  sign-file: fix build error in sign-file.c with libressl
  selinux: allow changing labels for cgroupfs
  selinux: fix off-by-one in setprocattr
  tpm: silence an array overflow warning
  tpm: fix the type of owned field in cap_t
  tpm: add securityfs support for TPM 2.0 firmware event log
  tpm: enhance read_log_of() to support Physical TPM event log
  tpm: enhance TPM 2.0 PCR extend to support multiple banks
  tpm: implement TPM 2.0 capability to get active PCR banks
  tpm: fix RC value check in tpm2_seal_trusted
  tpm_tis: fix iTPM probe via probe_itpm() function
  tpm: Begin the process to deprecate user_read_timer
  tpm: remove tpm_read_index and tpm_write_index from tpm.h
  ...
2017-02-21 12:49:56 -08:00
Jan Kara
cccd9fb9ec block: Revalidate i_bdev reference in bd_aquire()
When a device gets removed, block device inode unhashed so that it is not
used anymore (bdget() will not find it anymore). Later when a new device
gets created with the same device number, we create new block device
inode. However there may be file system device inodes whose i_bdev still
points to the original block device inode and thus we get two active
block device inodes for the same device. They will share the same
gendisk so the only visible differences will be that page caches will
not be coherent and BDIs will be different (the old block device inode
still points to unregistered BDI).

Fix the problem by checking in bd_acquire() whether i_bdev still points
to active block device inode and re-lookup the block device if not. That
way any open of a block device happening after the old device has been
removed will get correct block device inode.

Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-21 12:51:54 -07:00
Eric W. Biederman
ace0c791e6 proc/sysctl: Don't grab i_lock under sysctl_lock.
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> This patch has locking problem. I've got lockdep splat under LTP.
>
> [ 6633.115456] ======================================================
> [ 6633.115502] [ INFO: possible circular locking dependency detected ]
> [ 6633.115553] 4.9.10-debug+ #9 Tainted: G             L
> [ 6633.115584] -------------------------------------------------------
> [ 6633.115627] ksm02/284980 is trying to acquire lock:
> [ 6633.115659]  (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80
> [ 6633.115834] but task is already holding lock:
> [ 6633.115882]  (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110
> [ 6633.116026] which lock already depends on the new lock.
> [ 6633.116026]
> [ 6633.116080]
> [ 6633.116080] the existing dependency chain (in reverse order) is:
> [ 6633.116117]
> -> #2 (sysctl_lock){+.+...}:
> -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}:
> -> #0 (&sb->s_type->i_lock_key#4){+.+...}:
>
> d_lock nests inside i_lock
> sysctl_lock nests inside d_lock in d_compare
>
> This patch adds i_lock nesting inside sysctl_lock.

Al Viro <viro@ZenIV.linux.org.uk> replied:
> Once ->unregistering is set, you can drop sysctl_lock just fine.  So I'd
> try something like this - use rcu_read_lock() in proc_sys_prune_dcache(),
> drop sysctl_lock() before it and regain after.  Make sure that no inodes
> are added to the list ones ->unregistering has been set and use RCU list
> primitives for modifying the inode list, with sysctl_lock still used to
> serialize its modifications.
>
> Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing
> igrab() is safe there.  Since we don't drop inode reference until after we'd
> passed beyond it in the list, list_for_each_entry_rcu() should be fine.

I agree with Al Viro's analsysis of the situtation.

Fixes: d6cffbbe9a ("proc/sysctl: prune stale dentries during unregistering")
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Tested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-02-22 08:34:53 +13:00
Linus Torvalds
772c8f6f3b for-4.11/linus-merge-signed
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJYqeb8AAoJEPfTWPspceCmB3UP/3UtcPrzEm8w2cxB9MaWhZN3
 J+jiwlO4vaqhm2HVzQtoJqfaqRlud/iDx5cIXE2S7FnIM54ZKs3CANbKu8X+b1zm
 eJije3zMI8A8qyftigbz6a/Y2kWE4ZqFEc9WU5CWawfTl3ImCVUi8+F5X0wOLU/h
 r50zAQOEyURH4G5usNl9q0olF6FonJ82AcYm1iJ0QP2wYWZRJauC0rRn8IT93tyK
 bZPHnGKdkd7km8yi3zr2GNWOfuZZuA0HWAaF4qfrHPZQ883gITFAUIlFb1f+2TNl
 DkQzRrBB2wPWPnlbfb9KejMkvL94hflzsLb5rHt835DyVXFRyjxsgyAI8A+LPGSz
 vqZ3rsbWj6H4F9z2CkZ+T+AP/ZSWDNjwc0RXPm9HYdR5CDeTxIUVvnFQ44YNsmTv
 Xd5BKrUJ2oKegAxQG6zcuFx23p8JzhT70l+mNrMdtyeKnDD9FRdDvhKG9AHeTipn
 o/DnGivhS3UMQoQ7D68KOO+kuhLDeo7my5XGsnjzMO/iHqg++7IP2HyYYs/Ba4qZ
 cYaCtSDQW71Zt0vsqa6dvPuXBveu4h8Qh8R7uAGjSGS9IAFFb4Cab2tiUdISE6PE
 YnMWzY+G6pT8imlLVOL5/QFuo2Q4pUsaL0AHpXMCN9TZnQtbqXa8eqwnKnQ0m2KN
 7ut0IYYEPaYUX5xFn1K6
 =z7AL
 -----END PGP SIGNATURE-----

Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:

 - blk-mq scheduling framework from me and Omar, with a port of the
   deadline scheduler for this framework. A port of BFQ from Paolo is in
   the works, and should be ready for 4.12.

 - Various fixups and improvements to the above scheduling framework
   from Omar, Paolo, Bart, me, others.

 - Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
   This allows us to export more information that helps debug hangs or
   performance issues, without cluttering or abusing the sysfs API.

 - Fixes for the sbitmap code, the scalable bitmap code that was
   migrated from blk-mq, from Omar.

 - Removal of the BLOCK_PC support in struct request, and refactoring of
   carrying SCSI payloads in the block layer. This cleans up the code
   nicely, and enables us to kill the SCSI specific parts of struct
   request, shrinking it down nicely. From Christoph mainly, with help
   from Hannes.

 - Support for ranged discard requests and discard merging, also from
   Christoph.

 - Support for OPAL in the block layer, and for NVMe as well. Mainly
   from Scott Bauer, with fixes/updates from various others folks.

 - Error code fixup for gdrom from Christophe.

 - cciss pci irq allocation cleanup from Christoph.

 - Making the cdrom device operations read only, from Kees Cook.

 - Fixes for duplicate bdi registrations and bdi/queue life time
   problems from Jan and Dan.

 - Set of fixes and updates for lightnvm, from Matias and Javier.

 - A few fixes for nbd from Josef, using idr to name devices and a
   workqueue deadlock fix on receive. Also marks Josef as the current
   maintainer of nbd.

 - Fix from Josef, overwriting queue settings when the number of
   hardware queues is updated for a blk-mq device.

 - NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
   aborted, if we didn't end up aborting it.

 - SG gap merging fix from Ming Lei for block.

 - Loop fix also from Ming, fixing a race and crash between setting loop
   status and IO.

 - Two block race fixes from Tahsin, fixing request list iteration and
   fixing a race between device registration and udev device add
   notifiations.

 - Double free fix from cgroup writeback, from Tejun.

 - Another double free fix in blkcg, from Hou Tao.

 - Partition overflow fix for EFI from Alden Tondettar.

* tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
  nvme: Check for Security send/recv support before issuing commands.
  block/sed-opal: allocate struct opal_dev dynamically
  block/sed-opal: tone down not supported warnings
  block: don't defer flushes on blk-mq + scheduling
  blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
  blk-mq: don't special case flush inserts for blk-mq-sched
  blk-mq-sched: don't add flushes to the head of requeue queue
  blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
  block: do not allow updates through sysfs until registration completes
  lightnvm: set default lun range when no luns are specified
  lightnvm: fix off-by-one error on target initialization
  Maintainers: Modify SED list from nvme to block
  Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
  uapi: sed-opal fix IOW for activate lsp to use correct struct
  cdrom: Make device operations read-only
  elevator: fix loading wrong elevator type for blk-mq devices
  cciss: switch to pci_irq_alloc_vectors
  block/loop: fix race between I/O and set_status
  blk-mq-sched: don't hold queue_lock when calling exit_icq
  block: set make_request_fn manually in blk_mq_update_nr_hw_queues
  ...
2017-02-21 10:57:33 -08:00
Linus Torvalds
9763dd6f81 We've got eight GFS2 patches for this merge window:
1. Andy Price submitted a patch to make gfs2_write_full_page a
    static function.
 2. Dan Carpenter submitted a patch to fix a ERR_PTR thinko.
 
 I've also got a few patches, three of which fix bugs related to
 deleting very large files, which cause GFS2 to run out of
 journal space:
 
 3. The first one prevents GFS2 delete operation from requesting too
    much journal space.
 4. The second one fixes a problem whereby GFS2 can hang because it
    wasn't taking journal space demand into its calculations.
 5. The third one wakes up IO waiters when a flush is done to restart
    processes stuck waiting for journal space to become available.
 
 The other three patches are a performance improvement related to
 spin_lock contention between multiple writers:
 
 6. The "tr_touched" variable was switched to a flag to be more atomic
    and eliminate the possibility of some races.
 7. Function meta_lo_add was moved inline with its only caller to make
    the code more readable and efficient.
 8. Contention on the gfs2_log_lock spinlock was greatly reduced by
    avoiding the lock altogether in cases where we don't really need
    it: buffers that already appear in the appropriate metadata list
    for the journal. Many thanks to Steve Whitehouse for the ideas and
    principles behind these patches.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYrEEEAAoJENeLYdPf93o7bjoIAIqPG/EAzi+idgMWDPQa9Eit
 53dPy16snkrbWwtaK6spSWlH6bGYuHeanXORYon9bvtVjKYaa4NQclGihN2IE6uB
 O8zT+MGwP45LDhNplVJpumaOALZ9ZDqQSe+3tHeNK3FhNirLyiIjSqrHt/7Yi1qi
 fPLlT4Jx0TBo5rhvEGa7Yg01WhWVtnmVSMqJXj/7ZtC50s1aPyDUikdNIDfDCN2X
 LxfKGDXuk6p63VQ6JKqYSBVATCR0/bbKfkuk/kBUTYLoHoapImxB8d0HgIdsh1Mv
 9PlbZnnNW8k5oapuhVxjl0T5G0JsQgCkPb/wlte+ryOCjBoc2L2fCUV5qc0QxWc=
 =xQyl
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull GFS2 updates from Robert Peterson:
 "We've got eight GFS2 patches for this merge window:

   - Andy Price submitted a patch to make gfs2_write_full_page a static
     function.

   - Dan Carpenter submitted a patch to fix a ERR_PTR thinko.

  Three patches fix bugs related to deleting very large files, which
  cause GFS2 to run out of journal space:

   - The first one prevents GFS2 delete operation from requesting too
     much journal space.

   - The second one fixes a problem whereby GFS2 can hang because it
     wasn't taking journal space demand into its calculations.

   - The third one wakes up IO waiters when a flush is done to restart
     processes stuck waiting for journal space to become available.

  The final three patches are a performance improvement related to
  spin_lock contention between multiple writers:

   - The "tr_touched" variable was switched to a flag to be more atomic
     and eliminate the possibility of some races.

   - Function meta_lo_add was moved inline with its only caller to make
     the code more readable and efficient.

   - Contention on the gfs2_log_lock spinlock was greatly reduced by
     avoiding the lock altogether in cases where we don't really need
     it: buffers that already appear in the appropriate metadata list
     for the journal. Many thanks to Steve Whitehouse for the ideas and
     principles behind these patches"

* tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Make gfs2_write_full_page static
  GFS2: Reduce contention on gfs2_log_lock
  GFS2: Inline function meta_lo_add
  GFS2: Switch tr_touched to flag in transaction
  GFS2: Wake up io waiters whenever a flush is done
  GFS2: Made logd daemon take into account log demand
  GFS2: Limit number of transaction blocks requested for truncates
  GFS2: Fix reference to ERR_PTR in gfs2_glock_iter_next
2017-02-21 07:46:34 -08:00
Linus Torvalds
70fcf5c339 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fixes and cleanups from Jan Kara:
 "Several small UDF fixes and cleanups and a small cleanup of fanotify
  code"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: simplify the code of fanotify_merge
  udf: simplify udf_ioctl()
  udf: fix ioctl errors
  udf: allow implicit blocksize specification during mount
  udf: check partition reference in udf_read_inode()
  udf: atomically read inode size
  udf: merge module informations in super.c
  udf: remove next_epos from udf_update_extent_cache()
  udf: Factor out trimming of crtime
  udf: remove empty condition
  udf: remove unneeded line break
  udf: merge bh free
  udf: use pointer for kernel_long_ad argument
  udf: use __packed instead of __attribute__ ((packed))
  udf: Make stat on symlink report symlink length as st_size
  fs/udf: make #ifdef UDF_PREALLOCATE unconditional
  fs: udf: Replace CURRENT_TIME with current_time()
2017-02-21 07:44:03 -08:00
Linus Torvalds
2bfe01eff4 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS/SMB3 updates from Steve French:
 "Includes support for a critical SMB3 security feature: per-share
  encryption from Pavel, and a cleanup from Jean Delvare.

  Will have another cifs/smb3 merge next week"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Allow to switch on encryption with seal mount option
  CIFS: Add capability to decrypt big read responses
  CIFS: Decrypt and process small encrypted packets
  CIFS: Add copy into pages callback for a read operation
  CIFS: Add mid handle callback
  CIFS: Add transform header handling callbacks
  CIFS: Encrypt SMB3 requests before sending
  CIFS: Enable encryption during session setup phase
  CIFS: Add capability to transform requests before sending
  CIFS: Separate RFC1001 length processing for SMB2 read
  CIFS: Separate SMB2 sync header processing
  CIFS: Send RFC1001 length in a separate iov
  CIFS: Make send_cancel take rqst as argument
  CIFS: Make SendReceive2() takes resp iov
  CIFS: Separate SMB2 header structure
  CIFS: Fix splice read for non-cached files
  cifs: Add soft dependencies
  cifs: Only select the required crypto modules
  cifs: Simplify SMB2 and SMB311 dependencies
2017-02-20 18:38:47 -08:00
Linus Torvalds
cab7076a18 For this cycle we add support for the shutdown ioctl, which is
primarily used for testing, but which can be useful on production
 systems when a scratch volume is being destroyed and the data on it
 doesn't need to be saved.  This found (and we fixed) a number of bugs
 with ext4's recovery to corrupted file system --- the bugs increased
 the amount of data that could be potentially lost, and in the case of
 the inline data feature, could cause the kernel to BUG.
 
 Also included are a number of other bug fixes, including in ext4's
 fscrypt, DAX, inline data support.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlirXesACgkQ8vlZVpUN
 gaMOzQf8Ct6uPatV+m855oR4dAbZr2+lY4A4C+vHDzBtSMkPRyLX8cuo8XcwfTIm
 vPVyDnL6EPyhXPxxfItu+92wAq1m5mVpKo57d0Ft5lw0rHxNtJTgVSRzsQ7VDRjj
 5qMHW2K7Bk7EjzTeW3SF8/3+hqpzkAvRtNCntcomk5h08+cWMC8JSnn1kqw+naIn
 EcbrC72GZb8JUELogVXC2vU58lp50SSBdr3l005jqKc5BvljMvdJ0Izn/3RVyU7u
 q7vtynhe2ScFcHe/UzL1QgmQOy32tJpbS0NHalW47aw3Ynmn4cSX0YhhT9FDjRNQ
 VOOfo1m1sAg166x0E+Nn7FeghTSSyA==
 =cPIf
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "For this cycle we add support for the shutdown ioctl, which is
  primarily used for testing, but which can be useful on production
  systems when a scratch volume is being destroyed and the data on it
  doesn't need to be saved.

  This found (and we fixed) a number of bugs with ext4's recovery to
  corrupted file system --- the bugs increased the amount of data that
  could be potentially lost, and in the case of the inline data feature,
  could cause the kernel to BUG.

  Also included are a number of other bug fixes, including in ext4's
  fscrypt, DAX, inline data support"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
  ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN
  ext4: fix fencepost in s_first_meta_bg validation
  ext4: don't BUG when truncating encrypted inodes on the orphan list
  ext4: do not use stripe_width if it is not set
  ext4: fix stripe-unaligned allocations
  dax: assert that i_rwsem is held exclusive for writes
  ext4: fix DAX write locking
  ext4: add EXT4_IOC_GOINGDOWN ioctl
  ext4: add shutdown bit and check for it
  ext4: rename s_resize_flags to s_ext4_flags
  ext4: return EROFS if device is r/o and journal replay is needed
  ext4: preserve the needs_recovery flag when the journal is aborted
  jbd2: don't leak modified metadata buffers on an aborted journal
  ext4: fix inline data error paths
  ext4: move halfmd4 into hash.c directly
  ext4: fix use-after-iput when fscrypt contexts are inconsistent
  jbd2: fix use after free in kjournald2()
  ext4: fix data corruption in data=journal mode
  ext4: trim allocation requests to group size
  ext4: replace BUG_ON with WARN_ON in mb_find_extent()
  ...
2017-02-20 18:24:39 -08:00
Linus Torvalds
6c24337f22 Various cleanups for the file system encryption feature.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlirP6wACgkQ8vlZVpUN
 gaMwpQgApR67CxzlstxYjZpWPAqC8McJ2FBDX+mCOle5Vkc1WQDklwr0oCfQThTj
 eDSFRhNfIvyPh0DJ589PxBCsWOqN5h6Si7hD5ZinomVNI+IL89OytaU5EV2OpWaW
 iKdJgO9Tm8U7LuY6FOIoVdX57kUXVdkWoj61rC056B1SNhnNiVeofi7lYDM8Ix4q
 IGSQ9W24iQKmCk4hCwgObhJBRK9RnlOH0GLUmpMaS+jnfnj/uePwdxWEFsPuCOob
 8acAJ49lr55kjIw79E0BAyWxhEZ2aiArHk8PaWynT/DyNq3ftcapPlpftoeba8vo
 glBJRX70QxPvt0iHEp0ykfExkhWhFA==
 =Joki
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Various cleanups for the file system encryption feature"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: constify struct fscrypt_operations
  fscrypt: properly declare on-stack completion
  fscrypt: split supp and notsupp declarations into their own headers
  fscrypt: remove redundant assignment of res
  fscrypt: make fscrypt_operations.key_prefix a string
  fscrypt: remove unused 'mode' member of fscrypt_ctx
  ext4: don't allow encrypted operations without keys
  fscrypt: make test_dummy_encryption require a keyring key
  fscrypt: factor out bio specific functions
  fscrypt: pass up error codes from ->get_context()
  fscrypt: remove user-triggerable warning messages
  fscrypt: use EEXIST when file already uses different policy
  fscrypt: use ENOTDIR when setting encryption policy on nondirectory
  fscrypt: use ENOKEY when file cannot be created w/o key
2017-02-20 18:22:31 -08:00
Linus Torvalds
42e1b14b6e Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Implement wraparound-safe refcount_t and kref_t types based on
     generic atomic primitives (Peter Zijlstra)

   - Improve and fix the ww_mutex code (Nicolai Hähnle)

   - Add self-tests to the ww_mutex code (Chris Wilson)

   - Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
     Bueso)

   - Micro-optimize the current-task logic all around the core kernel
     (Davidlohr Bueso)

   - Tidy up after recent optimizations: remove stale code and APIs,
     clean up the code (Waiman Long)

   - ... plus misc fixes, updates and cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  fork: Fix task_struct alignment
  locking/spinlock/debug: Remove spinlock lockup detection code
  lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
  lkdtm: Convert to refcount_t testing
  kref: Implement 'struct kref' using refcount_t
  refcount_t: Introduce a special purpose refcount type
  sched/wake_q: Clarify queue reinit comment
  sched/wait, rcuwait: Fix typo in comment
  locking/mutex: Fix lockdep_assert_held() fail
  locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
  locking/rwsem: Reinit wake_q after use
  locking/rwsem: Remove unnecessary atomic_long_t casts
  jump_labels: Move header guard #endif down where it belongs
  locking/atomic, kref: Implement kref_put_lock()
  locking/ww_mutex: Turn off __must_check for now
  locking/atomic, kref: Avoid more abuse
  locking/atomic, kref: Use kref_get_unless_zero() more
  locking/atomic, kref: Kill kref_sub()
  locking/atomic, kref: Add kref_read()
  locking/atomic, kref: Add KREF_INIT()
  ...
2017-02-20 13:23:30 -08:00
Linus Torvalds
828cad8ea0 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this (fairly busy) cycle were:

   - There was a class of scheduler bugs related to forgetting to update
     the rq-clock timestamp which can cause weird and hard to debug
     problems, so there's a new debug facility for this: which uncovered
     a whole lot of bugs which convinced us that we want to keep the
     debug facility.

     (Peter Zijlstra, Matt Fleming)

   - Various cputime related updates: eliminate cputime and use u64
     nanoseconds directly, simplify and improve the arch interfaces,
     implement delayed accounting more widely, etc. - (Frederic
     Weisbecker)

   - Move code around for better structure plus cleanups (Ingo Molnar)

   - Move IO schedule accounting deeper into the scheduler plus related
     changes to improve the situation (Tejun Heo)

   - ... plus a round of sched/rt and sched/deadline fixes, plus other
     fixes, updats and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
  sched/core: Remove unlikely() annotation from sched_move_task()
  sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
  sched/topology: Split out scheduler topology code from core.c into topology.c
  sched/core: Remove unnecessary #include headers
  sched/rq_clock: Consolidate the ordering of the rq_clock methods
  delayacct: Include <uapi/linux/taskstats.h>
  sched/core: Clean up comments
  sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
  sched/clock: Add dummy clear_sched_clock_stable() stub function
  sched/cputime: Remove generic asm headers
  sched/cputime: Remove unused nsec_to_cputime()
  s390, sched/cputime: Remove unused cputime definitions
  powerpc, sched/cputime: Remove unused cputime definitions
  s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
  ia64, sched/cputime: Remove unused cputime definitions
  ia64: Convert vtime to use nsec units directly
  ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
  sched/cputime: Remove jiffies based cputime
  sched/cputime, vtime: Return nsecs instead of cputime_t to account
  sched/cputime: Complete nsec conversion of tick based accounting
  ...
2017-02-20 12:52:55 -08:00
Theodore Ts'o
e9be2ac7c0 ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN
It's very likely the file system independent ioctl name will be
FS_IOC_SHUTDOWN, so let's use the same name for the ext4 ioctl name.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-20 15:34:59 -05:00
Linus Torvalds
20dcfe1b7d Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "Nothing exciting, just the usual pile of fixes, updates and cleanups:

   - A bunch of clocksource driver updates

   - Removal of CONFIG_TIMER_STATS and the related /proc file

   - More posix timer slim down work

   - A scalability enhancement in the tick broadcast code

   - Math cleanups"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  hrtimer: Catch invalid clockids again
  math64, tile: Fix build failure
  clocksource/drivers/arm_arch_timer:: Mark cyclecounter __ro_after_init
  timerfd: Protect the might cancel mechanism proper
  timer_list: Remove useless cast when printing
  time: Remove CONFIG_TIMER_STATS
  clocksource/drivers/arm_arch_timer: Work around Hisilicon erratum 161010101
  clocksource/drivers/arm_arch_timer: Introduce generic errata handling infrastructure
  clocksource/drivers/arm_arch_timer: Remove fsl-a008585 parameter
  clocksource/drivers/arm_arch_timer: Add dt binding for hisilicon-161010101 erratum
  clocksource/drivers/ostm: Add renesas-ostm timer driver
  clocksource/drivers/ostm: Document renesas-ostm timer DT bindings
  clocksource/drivers/tcb_clksrc: Use 32 bit tcb as sched_clock
  clocksource/drivers/gemini: Add driver for the Cortina Gemini
  clocksource: add DT bindings for Cortina Gemini
  clockevents: Add a clkevt-of mechanism like clksrc-of
  tick/broadcast: Reduce lock cacheline contention
  timers: Omit POSIX timer stuff from task_struct when disabled
  x86/timer: Make delay() work during early bootup
  delay: Add explanation of udelay() inaccuracy
  ...
2017-02-20 10:06:32 -08:00
Christoph Hellwig
8d242e932f xfs: remove XFS_ALLOCTYPE_ANY_AG and XFS_ALLOCTYPE_START_AG
XFS_ALLOCTYPE_ANY_AG  was only used for the RT allocator and is unused
now, and XFS_ALLOCTYPE_START_AG has been unused for a while.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-17 20:32:10 -08:00
Christoph Hellwig
089ec2f875 xfs: simplify xfs_rtallocate_extent
We can deduce the allocation type from the bno argument, and do the
return without prod much simpler internally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix the macro for the non-rt build]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-17 16:52:52 -08:00
Jens Axboe
818551e2b2 Merge branch 'for-4.11/next' into for-4.11/linus-merge
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 14:08:19 -07:00
Herbert Xu
98687f426b gfs2: Use rhashtable walk interface in glock_hash_walk
The function glock_hash_walk walks the rhashtable by hand.  This
is broken because if it catches the hash table in the middle of
a rehash, then it will miss entries.

This patch replaces the manual walk by using the rhashtable walk
interface.

Fixes: 88ffbf3e03 ("GFS2: Use resizable hash table for glocks")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 12:28:22 -05:00
Jeff Mahoney
71367b3fa7 btrfs: use btrfs_debug instead of pr_debug in transaction abort
Commit e5d6b12fe1 (Btrfs: don't WARN() in btrfs_transaction_abort() for
IO errors) added a pr_debug call to be printed when a transaction is
aborted with -EIO instead of WARN.  btrfs_debug prints which file system
the message is associated with so let's use that instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:56 +01:00
Jeff Mahoney
21e75ffe3c btrfs: btrfs_truncate_free_space_cache always allocates path
btrfs_truncate_free_space_cache always allocates a btrfs_path structure
but only uses it when the caller passes a block group.  Let's move the
allocation and free into the conditional.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:56 +01:00
Jeff Mahoney
77ab86bf1c btrfs: free-space-cache, clean up unnecessary root arguments
The free space cache APIs accept a root but always use the tree root.

Also, btrfs_truncate_free_space_cache accepts a root AND an inode but
the inode always points to the root anyway, so let's just pass the inode.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:56 +01:00
Jeff Mahoney
5e00f1939f btrfs: convert btrfs_inc_block_group_ro to accept fs_info
btrfs_inc_block_group_ro is either passed the extent root or the dev
root, but it doesn't do anything with the dev tree.  Let's convert
to passing an fs_info and using the extent root.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:56 +01:00
Jeff Mahoney
0c9ab349c2 btrfs: flush_space always takes fs_info->fs_root
We don't need to pass a root to flush_space since it always uses
the fs_root.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
Jeff Mahoney
87bde3cdfc btrfs: pass fs_info to (more) routines that are only called with extent_root
Outside of interactions with qgroups, the roots passed in extent-tree.c
are usually passed to ensure that we don't do refcounts on log trees or
to get the allocation profile for an allocation request.  Otherwise, it
operates on the extent root.  This patch converts some more routines in
extent-tree.c that are always called with the extent root to accept
an fs_info instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
Qu Wenruo
fb235dc06f btrfs: qgroup: Move half of the qgroup accounting time out of commit trans
Just as Filipe pointed out, the most time consuming parts of qgroup are
btrfs_qgroup_account_extents() and
btrfs_qgroup_prepare_account_extents().
Which both call btrfs_find_all_roots() to get old_roots and new_roots
ulist.

What makes things worse is, we're calling that expensive
btrfs_find_all_roots() at transaction committing time with
TRANS_STATE_COMMIT_DOING, which will blocks all incoming transaction.

Such behavior is necessary for @new_roots search as current
btrfs_find_all_roots() can't do it correctly so we do call it just
before switch commit roots.

However for @old_roots search, it's not necessary as such search is
based on commit_root, so it will always be correct and we can move it
out of transaction committing.

This patch moves the @old_roots search part out of
commit_transaction(), so in theory we can half the time qgroup time
consumption at commit_transaction().

But please note that, this won't speedup qgroup overall, the total time
consumption is still the same, just reduce the performance stall.

Cc: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
David Sterba
15b34517a6 btrfs: remove unused parameter from adjust_slots_upwards
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
David Sterba
0e8d931a82 btrfs: remove unused parameters from __btrfs_write_out_cache
Both unused after the call to update_cache_item has been moved to
__btrfs_wait_cache_io.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
David Sterba
7bf1a15912 btrfs: remove unused parameter from cleanup_write_cache_enospc
bitmap_list is unused since the io_ctl framework.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
David Sterba
d75eefdf96 btrfs: remove unused parameter from __add_inode_ref
Unused since the helper has been split, eb used in the caller.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
4a0ab9d711 btrfs: remove unused parameter from clone_copy_inline_extent
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
1a287cfea1 btrfs: remove unused parameters from btrfs_cmp_data
After the page locking has been reworked, we get all pages prepared via
cmp_pages.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
eeac44cb49 btrfs: remove unused parameter from __add_inline_refs
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
e5987e1319 btrfs: remove unused parameters from scrub_setup_wr_ctx
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
61d7e4cb11 btrfs: remove unused parameter from create_snapshot
The name parameters have never been used, as the name is passed via the
dentry.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
e4a4dce72e btrfs: remove unused parameter from init_first_rw_device
The 'device' used to be added in that function, but now it's done by the
caller.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
David Sterba
72b468c8da btrfs: remove unused parameter from __btrfs_alloc_chunk
We grab fs_info from other parameters.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
56e033a787 btrfs: remove unused parameter from btrfs_fill_super
Never used for anything meaningful since we have our own superblock
filler.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
4242b64a4c btrfs: remove unused parameter from extent_write_cache_pages
The 'tree' was used to call locking hook that does not exist anymore.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
df9f628e3d btrfs: remove unused parameter from add_pending_csums
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
3d4b9496e8 btrfs: remove unused parameter from update_nr_written
The logic has been updated in "Btrfs: make mapping->writeback_index
point to the last written page" (a91326679f) and page is not
needed anymore.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
c2df8bb43f btrfs: remove unused parameter from submit_extent_page
This used to hold number of maximum pages to allocate, but this is now
limited by BIO_MAX_PAGES. The local are now unused and removed as well.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:53 +01:00
David Sterba
f1e3026192 btrfs: remove unused parameter from tree_move_next_or_upnext
Not needed.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
ab6a43e122 btrfs: remove unused parameter from tree_move_down
Never needed.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
3d3a126a81 btrfs: remove unused parameter from btrfs_check_super_valid
None of the checks need to know the ro/rw status as they're all not
changing the superblock. Moreover, we can access the sb flags directly
if we'd need to decide by the ro/rw status.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
8b74c03e3c btrfs: remove unused parameter from btrfs_prepare_extent_commit
Added but never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
7775c8184e btrfs: remove unused parameter from btrfs_subvolume_release_metadata
Unused since qgroup refactoring that split data and metadata accounting,
the btrfs_qgroup_free helper.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
66cb7ddbf2 btrfs: remove unused parameter from __push_leaf_left
Unused since long ago.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
1e47eef223 btrfs: remove unused parameter from __push_leaf_right
Unused since long ago.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
eece6a9cf6 btrfs: merge two superblock writing helpers
write_all_supers and write_ctree_super are almost equal, the parameter
'trans' is unused so we can drop it and have just one helper.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:51 +01:00
David Sterba
b75f506243 btrfs: remove unused parameter from write_dev_supers
The barriers are handled by the caller.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:51 +01:00
David Sterba
4961e2930f btrfs: remove unused parameter from split_item
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:51 +01:00
David Sterba
7c302b49dd btrfs: remove unused parameter from clean_tree_block
Added but never needed.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:51 +01:00
David Sterba
e27f62652b btrfs: remove unused parameter from check_async_write
Added but never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:51 +01:00
David Sterba
cda79c545e btrfs: remove unused parameter from read_block_for_search
Never used in that function.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
6655bc3de1 btrfs: ulist: rename ulist_fini to ulist_release
Change the name so it matches the naming we already use eg. for
btrfs_path.

Suggested-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
4ae8553c2d btrfs: remove pointless rcu protection from btrfs_qgroup_inherit
There was never need for RCU protection around reading nodesize or other
fairly constant filesystem data.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
0b08e1f4f7 btrfs: qgroups: opencode qgroup_free helper
The helper name is not too helpful and is just wrapping a simple call.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
9ea6e2b548 btrfs: remove unnecessary mutex lock in qgroup_account_snapshot
The quota status used to be tracked as a variable, so the mutex was
needed (until "Btrfs: add a flags field to btrfs_fs_info" afcdd129e0).
Since the status is a bit modified atomically and we don't hold the
mutex beyond the check, we can drop it.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
81353d50f5 btrfs: check quota status earlier and don't do unnecessary frees
Status of quotas should be the first check in
btrfs_qgroup_account_extent and we can return immediatelly, no need to
do no-op ulist frees.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:50 +01:00
David Sterba
53d3235995 btrfs: embed extent_changeset::range_changed to the structure
We can embed range_changed to the extent changeset to address following
problems:

- no need to allocate ulist dynamically, we also get rid of the GFP_NOFS
  for free
- fix lack of allocation failure checking in btrfs_qgroup_reserve_data

The stack consuption where extent_changeset is used slightly increases:

before: 16
after: 16 - 8 (for pointer) + 32 (sizeof ulist) = 40

Which is bearable.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
9d03793386 btrfs: ulist: make the finalization function public
Make ulist_fini externally visible so the ulist API is complete.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
025db916aa btrfs: qgroups: make __del_qgroup_relation static
Internal helper.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
1d4805386e btrfs: make space cache inode readahead failure nonfatal
We do a readahead of the free space cache inode to speed things up but
the failure is not fatal, like in other readahead cases. Proper reads
would need to happen anyway and any errors would be caught there.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
6602caf149 btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation
Qgroup relations are added/deleted from ioctl, we hold the high level
qgroup lock, no deadlocks or recursion from the allocation possible
here.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
52bf8e7aea btrfs: use GFP_KERNEL in btrfs_quota_enable
We don't need to use GFP_NOFS here as this is called from ioctls an the
only lock held is the subvol_sem, which is of a high level and protects
creation/renames/deletion and is never held in the writeout paths.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
323b88f4ab btrfs: use GFP_KERNEL in btrfs_read_qgroup_config
The qgroup config is read during mount, we do not have to use NOFS.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:49 +01:00
David Sterba
23269bf5ea btrfs: use GFP_KERNEL in create_snapshot
We don't need to use GFP_NOFS here as this is called from ioctls an the
only lock held is the subvol_sem, which is of a high level and protects
creation/renames/deletion and is never held in the writeout paths.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Liu Bo
1af4a0aaa5 Btrfs: specify a new ordered extent type for create_io_em
As 0 refers to an existing type BTRFS_ORDERED_IO_DONE, this specifies a
new type 'REGULAR' for regular IO.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Liu Bo
6f9994dbab Btrfs: create a helper to create em for IO
We have similar codes to create and insert extent mapping around IO path,
this merges them into a single helper.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Liu Bo
4136135b08 Btrfs: use helper to get used bytes of space_info
This uses a helper instead of open code around used byte of space_info
everywhere.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Liu Bo
0c9b36e0d7 Btrfs: try to avoid acquiring free space ctl's lock
We don't need to take the lock if the block group has not been cached.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Qu Wenruo
6f6b643e44 btrfs: Better csum error message for data csum mismatch
The original csum error message only outputs inode number, offset, check
sum and expected check sum.

However no root objectid is outputted, which sometimes makes debugging
quite painful under multi-subvolume case (including relocation).

Also the checksum output is decimal, which seldom makes sense for
users/developers and is hard to read in most time.

This patch will add root objectid, which will be %lld for rootid larger
than LAST_FREE_OBJECTID, and hex csum output for better readability.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Takafumi Kubota
fe01aa6538 Btrfs: add another missing end_page_writeback on submit_extent_page failure
If btrfs_bio_alloc fails in submit_extent_page, submit_extent_page returns
without clearing the writeback bit of the failed page.

__extent_writepage_io, that is a caller of submit_extent_page,
does not clear the remaining writeback bit anywhere.
As a result, this will cause the hang at filemap_fdatawait_range,
because it waits the writeback bit to be cleared from the failed page.
So, we have to call end_page_writeback to clear the writeback bit.

For reproducing the hang, we inject a fault like

   if (should_failtest()) { // I define should_failtest()
        bio = NULL;
   }
   else {
        bio = btrfs_bio_alloc(...);
   }

in submit_extent_page.

We should also check whether page has the bit before end_page_writeback,
to avoid the conflict against the other end_page_writeback in bio_endio.
Thus, we add PageWriteback checks not only in __extent_writepage_io,
but also in write_one_eb too, because it misses the check.

Signed-off-by: Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:47 +01:00
David Sterba
66bbc1c0c0 btrfs: remove unused ulist members
Commit "btrfs: ulist: Add ulist_del() function" (d4b8040459)
removed some debugging code but left the structure defintions.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:47 +01:00
Liu Bo
76c0021db8 Btrfs: use helper to simplify lock/unlock pages
Since we have a helper to set page bits, let lock_delalloc_pages and
__unlock_for_delalloc use it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:47 +01:00
Liu Bo
da2c7009f6 btrfs: teach __process_pages_contig about PAGE_LOCK operation
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changes to the helper separated from the following patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:35 +01:00
Christoph Hellwig
410d17f67e xfs: tune down agno asserts in the bmap code
In various places we currently assert that xfs_bmap_btalloc allocates
from the same as the firstblock value passed in, unless it's either
NULLAGNO or the dop_low flag is set.  But the reflink code does not
fully follow this convention as it passes in firstblock purely as
a hint for the allocator without actually having previous allocations
in the transaction, and without having a minleft check on the current
AG, leading to the assert firing on a very full and heavily used
file system.  As even the reflink code only allocates from equal or
higher AGs for now we can simply the check to always allow for equal
or higher AGs.

Note that we need to eventually split the two meanings of the firstblock
value.  At that point we can also allow the reflink code to allocate
from any AG instead of limiting it in any way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:20:39 -08:00
Chandan Rajendra
8ee9fdbebc xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment
On a ppc64 system, executing generic/256 test with 32k block size gives the following call trace,

XFS: Assertion failed: args->maxlen > 0, file: /root/repos/linux/fs/xfs/libxfs/xfs_alloc.c, line: 2026

kernel BUG at /root/repos/linux/fs/xfs/xfs_message.c:113!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=2048
DEBUG_PAGEALLOC
NUMA
pSeries
Modules linked in:
CPU: 2 PID: 19361 Comm: mkdir Not tainted 4.10.0-rc5 #58
task: c000000102606d80 task.stack: c0000001026b8000
NIP: c0000000004ef798 LR: c0000000004ef798 CTR: c00000000082b290
REGS: c0000001026bb090 TRAP: 0700   Not tainted  (4.10.0-rc5)
MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>
CR: 28004428  XER: 00000000
CFAR: c0000000004ef180 SOFTE: 1
GPR00: c0000000004ef798 c0000001026bb310 c000000001157300 ffffffffffffffea
GPR04: 000000000000000a c0000001026bb130 0000000000000000 ffffffffffffffc0
GPR08: 00000000000000d1 0000000000000021 00000000ffffffd1 c000000000dd4990
GPR12: 0000000022004444 c00000000fe00800 0000000020000000 0000000000000000
GPR16: 0000000000000000 0000000043a606fc 0000000043a76c08 0000000043a1b3d0
GPR20: 000001002a35cd60 c0000001026bbb80 0000000000000000 0000000000000001
GPR24: 0000000000000240 0000000000000004 c00000062dc55000 0000000000000000
GPR28: 0000000000000004 c00000062ecd9200 0000000000000000 c0000001026bb6c0
NIP [c0000000004ef798] .assfail+0x28/0x30
LR [c0000000004ef798] .assfail+0x28/0x30
Call Trace:
[c0000001026bb310] [c0000000004ef798] .assfail+0x28/0x30 (unreliable)
[c0000001026bb380] [c000000000455d74] .xfs_alloc_space_available+0x194/0x1b0
[c0000001026bb410] [c00000000045b914] .xfs_alloc_fix_freelist+0x144/0x480
[c0000001026bb580] [c00000000045c368] .xfs_alloc_vextent+0x698/0xa90
[c0000001026bb650] [c0000000004a6200] .xfs_ialloc_ag_alloc+0x170/0x820
[c0000001026bb7c0] [c0000000004a9098] .xfs_dialloc+0x158/0x320
[c0000001026bb8a0] [c0000000004e628c] .xfs_ialloc+0x7c/0x610
[c0000001026bb990] [c0000000004e8138] .xfs_dir_ialloc+0xa8/0x2f0
[c0000001026bbaa0] [c0000000004e8814] .xfs_create+0x494/0x790
[c0000001026bbbf0] [c0000000004e5ebc] .xfs_generic_create+0x2bc/0x410
[c0000001026bbce0] [c0000000002b4a34] .vfs_mkdir+0x154/0x230
[c0000001026bbd70] [c0000000002bc444] .SyS_mkdirat+0x94/0x120
[c0000001026bbe30] [c00000000000b760] system_call+0x38/0xfc
Instruction dump:
4e800020 60000000 7c0802a6 7c862378 3c82ffca 7ca72b78 38841c18 7c651b78
38600000 f8010010 f821ff91 4bfff94d <0fe00000> 60000000 7c0802a6 7c892378

When block size is larger than inode cluster size, the call to
XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
would have set xfs_sb->sb_inoalignmt to 0. This causes
xfs_ialloc_cluster_alignment() to return 0.  Due to this
args.minalignslop (in xfs_ialloc_ag_alloc()) gets the unsigned
equivalent of -1 assigned to it. This later causes alloc_len in
xfs_alloc_space_available() to have a value of 0. In such a scenario
when args.total is also 0, the assert statement "ASSERT(args->maxlen >
0);" fails.

This commit fixes the bug by replacing the call to XFS_B_TO_FSBT() in
xfs_ialloc_cluster_alignment() with a call to xfs_icluster_size_fsb().

Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:20:38 -08:00
Brian Foster
48af96ab92 xfs: don't reserve blocks for right shift transactions
The block reservation for the transaction allocated in
xfs_shift_file_space() is an artifact of the original collapse range
support. It exists to handle the case where a collapse range occurs,
the initial extent is left shifted into a location that forms a
contiguous boundary with the previous extent and thus the extents
are merged. This code was subsequently refactored and reused for
insert range (right shift) support.

If an insert range occurs under low free space conditions, the
extent at the starting offset is split before the first shift
transaction is allocated. If the block reservation fails, this
leaves separate, but contiguous extents around in the inode. While
not a fatal problem, this is unexpected and will flag a warning on
subsequent insert range operations on the inode. This problem has
been reproduce intermittently by generic/270 running against a
ramdisk device.

Since right shift does not create new extent boundaries in the
inode, a block reservation for extent merge is unnecessary. Update
xfs_shift_file_space() to conditionally reserve fs blocks for left
shift transactions only. This avoids the warning reproduced by
generic/270.

Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:20:13 -08:00
Arnd Bergmann
353fe445f5 xfs: fix len comparison in xfs_extent_busy_trim
The length is now passed by reference, so the assertion has to be updated
to match the other changes, as pointed out by this W=1 warning:

fs/xfs/xfs_extent_busy.c: In function 'xfs_extent_busy_trim':
fs/xfs/xfs_extent_busy.c:356:13: error: ordered comparison of pointer with integer zero [-Werror=extra]

Fixes: ebf5587261 ("xfs: improve handling of busy extents in the low-level allocator")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:20:12 -08:00
Darrick J. Wong
93aaead52a xfs: fix uninitialized variable in _reflink_convert_cow
Fix an uninitialize variable.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:20:12 -08:00
Brian Foster
75d65361cf xfs: split indlen reservations fairly when under reserved
Certain workoads that punch holes into speculative preallocation can
cause delalloc indirect reservation splits when the delalloc extent is
split in two. If further splits occur, an already short-handed extent
can be split into two in a manner that leaves zero indirect blocks for
one of the two new extents. This occurs because the shortage is large
enough that the xfs_bmap_split_indlen() algorithm completely drains the
requested indlen of one of the extents before it honors the existing
reservation.

This ultimately results in a warning from xfs_bmap_del_extent(). This
has been observed during file copies of large, sparse files using 'cp
--sparse=always.'

To avoid this problem, update xfs_bmap_split_indlen() to explicitly
apply the reservation shortage fairly between both extents. This smooths
out the overall indlen shortage and defers the situation where we end up
with a delalloc extent with zero indlen reservation to extreme
circumstances.

Reported-by: Patrick Dung <mpatdung@gmail.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:20:12 -08:00
Brian Foster
0e339ef855 xfs: handle indlen shortage on delalloc extent merge
When a delalloc extent is created, it can be merged with pre-existing,
contiguous, delalloc extents. When this occurs,
xfs_bmap_add_extent_hole_delay() merges the extents along with the
associated indirect block reservations. The expectation here is that the
combined worst case indlen reservation is always less than or equal to
the indlen reservation for the individual extents.

This is not always the case, however, as existing extents can less than
the expected indlen reservation if the extent was previously split due
to a hole punch. If a new extent merges with such an extent, the total
indlen requirement may be larger than the sum of the indlen reservations
held by both extents.

xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen
reservation is always available and assigns it to the merged extent
without consideration for the indlen held by the pre-existing extent. As
a result, the subsequent xfs_mod_fdblocks() call can attempt an
unintentional allocation rather than a free (indicated by an ASSERT()
failure). Further, if the allocation happens to fail in this context,
the failure goes unhandled and creates a filesystem wide block
accounting inconsistency.

Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the
indlen reservation assigned to the merged extent to the sum of the
indlen reservations held by each of the individual extents.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:20:12 -08:00
Brian Foster
9dbddd7b0c xfs: resurrect debug mode drop buffered writes mechanism
A debug mode write failure mechanism was introduced to XFS in commit
801cc4e17a ("xfs: debug mode forced buffered write failure") to
facilitate targeted testing of delalloc indirect reservation management
from userspace. This code was subsequently rendered ineffective by the
move to iomap based buffered writes in commit 68a9f5e700 ("xfs:
implement iomap based buffered write path"). This likely went unnoticed
because the associated userspace code had not made it into xfstests.

Resurrect this mechanism to facilitate effective indlen reservation
testing from xfstests. The move to iomap based buffered writes relocated
the hook this mechanism needs to return write failure from XFS to
generic code. The failure trigger must remain in XFS. Given that
limitation, convert this from a write failure mechanism to one that
simply drops writes without returning failure to userspace. Rename all
"fail_writes" references to "drop_writes" to illustrate the point. This
is more hacky than preferred, but still triggers the XFS error handling
behavior required to drive the indlen tests. This is only available in
DEBUG mode and for testing purposes only.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:19:15 -08:00
Brian Foster
fa7f138ac4 xfs: clear delalloc and cache on buffered write failure
The buffered write failure handling code in
xfs_file_iomap_end_delalloc() has a couple minor problems. First, if
written == 0, start_fsb is not rounded down and it fails to kill off a
delalloc block if the start offset is block unaligned. This results in a
lingering delalloc block and broken delalloc block accounting detected
at unmount time. Fix this by rounding down start_fsb in the unlikely
event that written == 0.

Second, it is possible for a failed overwrite of a delalloc extent to
leave dirty pagecache around over a hole in the file. This is because is
possible to hit ->iomap_end() on write failure before the iomap code has
attempted to allocate pagecache, and thus has no need to clean it up. If
the targeted delalloc extent was successfully written by a previous
write, however, then it does still have dirty pages when ->iomap_end()
punches out the underlying blocks. This ultimately results in writeback
over a hole. To fix this problem, unconditionally punch out the
pagecache from XFS before the associated delalloc range.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:19:12 -08:00
David S. Miller
3f64116a83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-16 19:34:01 -05:00