Commit Graph

41281 Commits

Author SHA1 Message Date
Eric W. Biederman
87d2846fcf sysfs: Add support for permanently empty directories to serve as mount points.
Add two functions sysfs_create_mount_point and
sysfs_remove_mount_point that hang a permanently empty directory off
of a kobject or remove a permanently emptpy directory hanging from a
kobject.  Export these new functions so modular filesystems can use
them.

Cc: stable@vger.kernel.org
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:45 -05:00
Eric W. Biederman
ea015218f2 kernfs: Add support for always empty directories.
Add a new function kernfs_create_empty_dir that can be used to create
directory that can not be modified.

Update the code to use make_empty_dir_inode when reporting a
permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:43 -05:00
Eric W. Biederman
eb6d38d542 proc: Allow creating permanently empty directories that serve as mount points
Add a new function proc_create_mount_point that when used to creates a
directory that can not be added to.

Add a new function is_empty_pde to test if a function is a mount
point.

Update the code to use make_empty_dir_inode when reporting
a permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:41 -05:00
Eric W. Biederman
f9bd6733d3 sysctl: Allow creating permanently empty directories that serve as mountpoints.
Add a magic sysctl table sysctl_mount_point that when used to
create a directory forces that directory to be permanently empty.

Update the code to use make_empty_dir_inode when accessing permanently
empty directories.

Update the code to not allow adding to permanently empty directories.

Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:39 -05:00
Eric W. Biederman
fbabfd0f4e fs: Add helper functions for permanently empty directories.
To ensure it is safe to mount proc and sysfs I need to check if
filesystems that are mounted on top of them are mounted on truly empty
directories.  Given that some directories can gain entries over time,
knowing that a directory is empty right now is insufficient.

Therefore add supporting infrastructure for permantently empty
directories that proc and sysfs can use when they create mount points
for filesystems and fs_fully_visible can use to test for permanently
empty directories to ensure that nothing will be gained by mounting a
fresh copy of proc or sysfs.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:37 -05:00
Eric W. Biederman
ceeb0e5d39 vfs: Ignore unlocked mounts in fs_fully_visible
Limit the mounts fs_fully_visible considers to locked mounts.
Unlocked can always be unmounted so considering them adds hassle
but no security benefit.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:35 -05:00
Kinglong Mee
b4839ebe21 nfs: Remove invalid tk_pid from debug message
Before rpc_run_task(), tk_pid is uninitiated as 0 always.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:31:25 -04:00
Kinglong Mee
d356a7d1e7 nfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfh
NFS_ATTR_FATTR_V4_REFERRAL is only set in nfs4_proc_lookup_common.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:31:22 -04:00
Kinglong Mee
bc4da2a2be nfs: Drop bad comment in nfs41_walk_client_list()
Commit 7b1f1fd184 "NFSv4/4.1: Fix bugs in nfs4[01]_walk_client_list"
have change the logical of the list_for_each_entry().

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:30:59 -04:00
Kinglong Mee
cd738ee985 nfs: Remove unneeded micro checking of CONFIG_PROC_FS
Have checking CONFIG_PROC_FS in include/linux/sunrpc/stats.h.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:30:45 -04:00
Kinglong Mee
2785110d2e nfs: Don't setting FILE_CREATED flags always
Commit 5bc2afc2b5 "NFSv4: Honour the 'opened' parameter in the atomic_open()
 filesystem method" have support the opened arguments now.

Also,
Commit 03da633aa7 "atomic_open: take care of EEXIST in no-open case with
 O_CREAT|O_EXCL in fs/namei.c" have change vfs's logical.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:30:21 -04:00
Kinglong Mee
6a062a3687 nfs: Use remove_proc_subtree() instead remove_proc_entry()
Thanks for Al Viro's comments of killing proc_fs_nfs completely.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:29:41 -04:00
Kinglong Mee
71f81e51ee nfs: Remove unused argument in nfs_server_set_fsinfo()
Commit e38eb6506f "NFS: set_pnfs_layoutdriver() from nfs4_proc_fsinfo()"
have remove the using of mntfh from nfs_server_set_fsinfo().

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:29:10 -04:00
Kinglong Mee
6b55970b0f nfs: Fix a memory leak when meeting an unsupported state protect
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:29:00 -04:00
Jeff Layton
db2efec0ca nfs: take extra reference to fl->fl_file when running a LOCKU operation
Jean reported another crash, similar to the one fixed by feaff8e5b2:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
    IP: [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0
    PGD 0
    Oops: 0000 [#1] SMP
    Modules linked in: nfsv3 nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache vmw_vsock_vmci_transport vsock cfg80211 rfkill coretemp crct10dif_pclmul ppdev vmw_balloon crc32_pclmul crc32c_intel ghash_clmulni_intel pcspkr vmxnet3 parport_pc i2c_piix4 microcode serio_raw parport nfsd floppy vmw_vmci acpi_cpufreq auth_rpcgss shpchp nfs_acl lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi scsi_transport_spi mptscsih ata_generic mptbase i2c_core pata_acpi
    CPU: 0 PID: 329 Comm: kworker/0:1H Not tainted 4.1.0-rc7+ #2
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
    Workqueue: rpciod rpc_async_schedule [sunrpc]
    30ec000
    RIP: 0010:[<ffffffff8124ef7f>]  [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0
    RSP: 0018:ffff8802330efc08  EFLAGS: 00010296
    RAX: ffff8802330efc58 RBX: ffff880097187c80 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
    RBP: ffff8802330efc18 R08: ffff88023fc173d8 R09: 3038b7bf00000000
    R10: 00002f1a02000000 R11: 3038b7bf00000000 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff8802337a2300 R15: 0000000000000020
    FS:  0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000148 CR3: 000000003680f000 CR4: 00000000001407f0
    Stack:
     ffff880097187c80 ffff880097187cd8 ffff8802330efc98 ffffffff81250281
     ffff8802330efc68 ffffffffa013e7df ffff8802330efc98 0000000000000246
     ffff8801f6901c00 ffff880233d2b8d8 ffff8802330efc58 ffff8802330efc58
    Call Trace:
     [<ffffffff81250281>] __posix_lock_file+0x31/0x5e0
     [<ffffffffa013e7df>] ? rpc_wake_up_task_queue_locked.part.35+0xcf/0x240 [sunrpc]
     [<ffffffff8125088b>] posix_lock_file_wait+0x3b/0xd0
     [<ffffffffa03890b2>] ? nfs41_wake_and_assign_slot+0x32/0x40 [nfsv4]
     [<ffffffffa0365808>] ? nfs41_sequence_done+0xd8/0x300 [nfsv4]
     [<ffffffffa0367525>] do_vfs_lock+0x35/0x40 [nfsv4]
     [<ffffffffa03690c1>] nfs4_locku_done+0x81/0x120 [nfsv4]
     [<ffffffffa013e310>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
     [<ffffffffa013e310>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
     [<ffffffffa013e33c>] rpc_exit_task+0x2c/0x90 [sunrpc]
     [<ffffffffa0134400>] ? call_refreshresult+0x170/0x170 [sunrpc]
     [<ffffffffa013ece4>] __rpc_execute+0x84/0x410 [sunrpc]
     [<ffffffffa013f085>] rpc_async_schedule+0x15/0x20 [sunrpc]
     [<ffffffff810add67>] process_one_work+0x147/0x400
     [<ffffffff810ae42b>] worker_thread+0x11b/0x460
     [<ffffffff810ae310>] ? rescuer_thread+0x2f0/0x2f0
     [<ffffffff810b35d9>] kthread+0xc9/0xe0
     [<ffffffff81010000>] ? perf_trace_xen_mmu_set_pmd+0xa0/0x160
     [<ffffffff810b3510>] ? kthread_create_on_node+0x170/0x170
     [<ffffffff8173c222>] ret_from_fork+0x42/0x70
     [<ffffffff810b3510>] ? kthread_create_on_node+0x170/0x170
    Code: a5 81 e8 85 75 e4 ff c6 05 31 ee aa 00 01 eb 98 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 <48> 8b 9f 48 01 00 00 48 85 db 74 08 48 89 d8 5b 41 5c 5d c3 83
    RIP  [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0
     RSP <ffff8802330efc08>
    CR2: 0000000000000148
    ---[ end trace 64484f16250de7ef ]---

The problem is almost exactly the same as the one fixed by feaff8e5b2.
We must take a reference to the struct file when running the LOCKU
compound to prevent the final fput from running until the operation is
complete.

Reported-by: Jean Spector <jean@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-01 11:27:27 -04:00
Miklos Szeredi
c3696046be fuse: separate pqueue for clones
Make each fuse device clone refer to a separate processing queue.  The only
constraint on userspace code is that the request answer must be written to
the same device clone as it was read off.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:26:09 +02:00
Miklos Szeredi
cc080e9e9b fuse: introduce per-instance fuse_dev structure
Allow fuse device clones to refer to be distinguished.  This patch just
adds the infrastructure by associating a separate "struct fuse_dev" with
each clone.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:08 +02:00
Miklos Szeredi
00c570f4ba fuse: device fd clone
Allow an open fuse device to be "cloned".  Userspace can create a clone by:

      newfd = open("/dev/fuse", O_RDWR)
      ioctl(newfd, FUSE_DEV_IOC_CLONE, &oldfd);

At this point newfd will refer to the same fuse connection as oldfd.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:08 +02:00
Miklos Szeredi
ee314a870e fuse: abort: no fc->lock needed for request ending
In fuse_abort_conn() when all requests are on private lists we no longer
need fc->lock protection.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:08 +02:00
Miklos Szeredi
46c34a348b fuse: no fc->lock for pqueue parts
Remove fc->lock protection from processing queue members, now protected by
fpq->lock.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:07 +02:00
Miklos Szeredi
efe2800fac fuse: no fc->lock in request_end()
No longer need to call request_end() with the connection lock held.  We
still protect the background counters and queue with fc->lock, so acquire
it if necessary.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:07 +02:00
Miklos Szeredi
1e6881c36e fuse: cleanup request_end()
Now that we atomically test having already done everything we no longer
need other protection.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:07 +02:00
Miklos Szeredi
365ae710df fuse: request_end(): do once
When the connection is aborted it is possible that request_end() will be
called twice.  Use atomic test and set to do the actual ending only once.

test_and_set_bit() also provides the necessary barrier semantics so no
explicit smp_wmb() is necessary.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:06 +02:00
Miklos Szeredi
77cd9d488b fuse: add req flag for private list
When an unlocked request is aborted, it is moved from fpq->io to a private
list.  Then, after unlocking fpq->lock, the private list is processed and
the requests are finished off.

To protect the private list, we need to mark the request with a flag, so if
in the meantime the request is unlocked the list is not corrupted.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:06 +02:00
Miklos Szeredi
45a91cb1a4 fuse: pqueue locking
Add a fpq->lock for protecting members of struct fuse_pqueue and FR_LOCKED
request flag.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:06 +02:00
Miklos Szeredi
24b4d33d46 fuse: abort: group pqueue accesses
Rearrange fuse_abort_conn() so that processing queue accesses are grouped
together.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:05 +02:00
Miklos Szeredi
82cbdcd320 fuse: cleanup fuse_dev_do_read()
- locked list_add() + list_del_init() cancel out

 - common handling of case when request is ended here in the read phase

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:05 +02:00
Miklos Szeredi
f377cb799e fuse: move list_del_init() from request_end() into callers
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:26:04 +02:00
Miklos Szeredi
e96edd94d0 fuse: duplicate ->connected in pqueue
This will allow checking ->connected just with the processing queue lock.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:04 +02:00
Miklos Szeredi
3a2b5b9cd9 fuse: separate out processing queue
This is just two fields: fc->io and fc->processing.

This patch just rearranges the fields, no functional change.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:04 +02:00
Miklos Szeredi
5250921bb0 fuse: simplify request_wait()
wait_event_interruptible_exclusive_locked() will do everything
request_wait() does, so replace it.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:03 +02:00
Miklos Szeredi
fd22d62ed0 fuse: no fc->lock for iqueue parts
Remove fc->lock protection from input queue members, now protected by
fiq->waitq.lock.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:03 +02:00
Miklos Szeredi
8f7bb368db fuse: allow interrupt queuing without fc->lock
Interrupt is only queued after the request has been sent to userspace.
This is either done in request_wait_answer() or fuse_dev_do_read()
depending on which state the request is in at the time of the interrupt.
If it's not yet sent, then queuing the interrupt is postponed until the
request is read.  Otherwise (the request has already been read and is
waiting for an answer) the interrupt is queued immedidately.

We want to call queue_interrupt() without fc->lock protection, in which
case there can be a race between the two functions:

 - neither of them queue the interrupt (thinking the other one has already
   done it).

 - both of them queue the interrupt

The first one is prevented by adding memory barriers, the second is
prevented by checking (under fiq->waitq.lock) if the interrupt has already
been queued.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:26:03 +02:00
Miklos Szeredi
4ce6081260 fuse: iqueue locking
Use fiq->waitq.lock for protecting members of struct fuse_iqueue and
FR_PENDING request flag, previously protected by fc->lock.

Following patches will remove fc->lock protection from these members.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:02 +02:00
Miklos Szeredi
ef75925886 fuse: dev read: split list_move
Different lists will need different locks.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:02 +02:00
Miklos Szeredi
8c91189a2a fuse: abort: group iqueue accesses
Rearrange fuse_abort_conn() so that input queue accesses are grouped
together.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:02 +02:00
Miklos Szeredi
e16714d875 fuse: duplicate ->connected in iqueue
This will allow checking ->connected just with the input queue lock.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:01 +02:00
Miklos Szeredi
f88996a933 fuse: separate out input queue
The input queue contains normal requests (fc->pending), forgets
(fc->forget_*) and interrupts (fc->interrupts).  There's also fc->waitq and
fc->fasync for waking up the readers of the fuse device when a request is
available.

The fc->reqctr is also moved to the input queue (assigned to the request
when the request is added to the input queue.

This patch just rearranges the fields, no functional change.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:01 +02:00
Miklos Szeredi
33e14b4dfd fuse: req state use flags
Use flags for representing the state in fuse_req.  This is needed since
req->list will be protected by different locks in different states, hence
we'll want the state itself to be split into distinct bits, each protected
with the relevant lock in that state.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:26:01 +02:00
Miklos Szeredi
7a3b2c7547 fuse: simplify req states
FUSE_REQ_INIT is actually the same state as FUSE_REQ_PENDING and
FUSE_REQ_READING and FUSE_REQ_WRITING can be merged into a common
FUSE_REQ_IO state.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:00 +02:00
Miklos Szeredi
c47752673a fuse: don't hold lock over request_wait_answer()
Only hold fc->lock over sections of request_wait_answer() that actually
need it.  If wait_event_interruptible() returns zero, it means that the
request finished.  Need to add memory barriers, though, to make sure that
all relevant data in the request is synchronized.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:26:00 +02:00
Miklos Szeredi
7d2e0a099c fuse: simplify unique ctr
Since it's a 64bit counter, it's never gonna wrap around.  Remove code
dealing with that possibility.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:26:00 +02:00
Miklos Szeredi
41f982747e fuse: rework abort
Splice fc->pending and fc->processing lists into a common kill list while
holding fc->lock.

By the time we release fc->lock, pending and processing lists are empty and
the io list contains only locked requests.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:59 +02:00
Miklos Szeredi
b716d42538 fuse: fold helpers into abort
Fold end_io_requests() and end_queued_requests() into fuse_abort_conn().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:59 +02:00
Miklos Szeredi
dc00809a53 fuse: use per req lock for lock/unlock_request()
Reuse req->waitq.lock for protecting FR_ABORTED and FR_LOCKED flags.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:58 +02:00
Miklos Szeredi
825d6d3395 fuse: req use bitops
Finer grained locking will mean there's no single lock to protect
modification of bitfileds in fuse_req.

So move to using bitops.  Can use the non-atomic variants for those which
happen while the request definitely has only one reference.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:58 +02:00
Miklos Szeredi
0d8e84b043 fuse: simplify request abort
- don't end the request while req->locked is true

 - make unlock_request() return an error if the connection was aborted

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:58 +02:00
Miklos Szeredi
ccd0a0bd16 fuse: call fuse_abort_conn() in dev release
fuse_abort_conn() does all the work done by fuse_dev_release() and more.
"More" consists of:

	end_io_requests(fc);
	wake_up_all(&fc->waitq);
	kill_fasync(&fc->fasync, SIGIO, POLL_IN);

All of which should be no-op (WARN_ON's added).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:57 +02:00
Miklos Szeredi
f0139aa819 fuse: fold fuse_request_send_nowait() into single caller
And the same with fuse_request_send_nowait_locked().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:57 +02:00
Miklos Szeredi
de15522646 fuse: check conn_error earlier
fc->conn_error is set once in FUSE_INIT reply and never cleared.  Check it
in request allocation, there's no sense in doing all the preparation if
sending will surely fail.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:57 +02:00
Miklos Szeredi
5437f24172 fuse: account as waiting before queuing for background
Move accounting of fc->num_waiting to the point where the request actually
starts waiting.  This is earlier than the current queue_request() for
background requests, since they might be waiting on the fc->bg_queue before
being queued on fc->pending.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:56 +02:00
Miklos Szeredi
73e0e73844 fuse: reset waiting
Reset req->waiting in fuse_put_request().  This is needed for correct
accounting in fc->num_waiting for reserved requests.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-07-01 16:25:56 +02:00
Miklos Szeredi
42dc6211c5 fuse: fix background request if not connected
request_end() expects fc->num_background and fc->active_background to have
been incremented, which is not the case in fuse_request_send_nowait()
failure path.  So instead just call the ->end() callback (which is actually
set by all callers).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
2015-07-01 16:25:56 +02:00
Miklos Szeredi
0ad0b3255a fuse: initialize fc->release before calling it
fc->release is called from fuse_conn_put() which was used in the error
cleanup before fc->release was initialized.

[Jeremiah Mahler <jmmahler@gmail.com>: assign fc->release after calling
fuse_conn_init(fc) instead of before.]

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fixes: a325f9b922 ("fuse: update fuse_conn_init() and separate out fuse_conn_kill()")
Cc: <stable@vger.kernel.org> #v2.6.31+
2015-07-01 16:25:55 +02:00
Eric Dumazet
5ba97d2832 fs/file.c: __fget() and dup2() atomicity rules
__fget() does lockless fetch of pointer from the descriptor
table, attempts to grab a reference and treats "it was already
zero" as "it's already gone from the table, we just hadn't
seen the store, let's fail".  Unfortunately, that breaks the
atomicity of dup2() - __fget() might see the old pointer,
notice that it's been already dropped and treat that as
"it's closed".  What we should be getting is either the
old file or new one, depending whether we come before or after
dup2().

Dmitry had following test failing sometimes :

int fd;
void *Thread(void *x) {
  char buf;
  int n = read(fd, &buf, 1);
  if (n != 1)
    exit(printf("read failed: n=%d errno=%d\n", n, errno));
  return 0;
}

int main()
{
  fd = open("/dev/urandom", O_RDONLY);
  int fd2 = open("/dev/urandom", O_RDONLY);
  if (fd == -1 || fd2 == -1)
    exit(printf("open failed\n"));
  pthread_t th;
  pthread_create(&th, 0, Thread, 0);
  if (dup2(fd2, fd) == -1)
    exit(printf("dup2 failed\n"));
  pthread_join(th, 0);
  if (close(fd) == -1)
    exit(printf("close failed\n"));
  if (close(fd2) == -1)
    exit(printf("close failed\n"));
  printf("DONE\n");
  return 0;
}

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-01 02:31:08 -04:00
Eric Dumazet
8a81252b77 fs/file.c: don't acquire files->file_lock in fd_install()
Mateusz Guzik reported :

 Currently obtaining a new file descriptor results in locking fdtable
 twice - once in order to reserve a slot and second time to fill it.

Holding the spinlock in __fd_install() is needed in case a resize is
done, or to prevent a resize.

Mateusz provided an RFC patch and a micro benchmark :
  http://people.redhat.com/~mguzik/pipebench.c

A resize is an unlikely operation in a process lifetime,
as table size is at least doubled at every resize.

We can use RCU instead of the spinlock.

__fd_install() must wait if a resize is in progress.

The resize must block new __fd_install() callers from starting,
and wait that ongoing install are finished (synchronize_sched())

resize should be attempted by a single thread to not waste resources.

rcu_sched variant is used, as __fd_install() and expand_fdtable() run
from process context.

It gives us a ~30% speedup using pipebench on a dual Intel(R) Xeon(R)
CPU E5-2696 v2 @ 2.50GHz

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Mateusz Guzik <mguzik@redhat.com>
Tested-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-01 02:30:09 -04:00
Wang YanQing
1af95de6f0 fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
Execution of get_anon_bdev concurrently and preemptive kernel all
could bring race condition, it isn't enough to check dev against
its upper limitation with equality operator only.

This patch fix it.

Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-01 01:50:06 -04:00
Linus Torvalds
94521ca3df Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS/SMB3 updates from Steve French:
 "Includes two bug fixes, as well as (minimal) support for the new
  protocol dialect (SMB3.1.1), and support for two ioctls including
  reflink (duplicate extents) file copy and set integrity"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Unset CIFS_MOUNT_POSIX_PATHS flag when following dfs mounts
  Update negotiate protocol for SMB3.11 dialect
  Add ioctl to set integrity
  Add Get/Set Integrity Information structure definitions
  Add reflink copy over SMB3.11 with new FSCTL_DUPLICATE_EXTENTS
  Add SMB3.11 mount option synonym for new dialect
  add struct FILE_STANDARD_INFO
  Make dialect negotiation warning message easier to read
  Add defines and structs for smb3.1 dialect
  Allow parsing vers=3.11 on cifs mount
  client MUST ignore EncryptionKeyLength if CAP_EXTENDED_SECURITY is set
2015-06-30 21:40:07 -07:00
Carlos Maiolino
2adc376c55 vfs: avoid creation of inode number 0 in get_next_ino
currently, get_next_ino() is able to create inodes with inode number = 0.
This have a bad impact in the filesystems relying in this function to generate
inode numbers.

While there is no problem at all in having inodes with number 0, userspace tools
which handle file management tasks can have problems handling these files, like
for example, the impossiblity of users to delete these files, since glibc will
ignore them. So, I believe the best way is kernel to avoid creating them.

This problem has been raised previously, but the old thread didn't have any
other update for a year+, and I've seen too many users hitting the same issue
regarding the impossibility to delete files while using filesystems relying on
this function. So, I'm starting the thread again, with the same patch
that I believe is enough to address this problem.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-30 23:59:49 -04:00
Linus Torvalds
68b4449d79 xfs: update for 4.2-rc1
This update contains:
 
 o A new sparse on-disk inode record format to allow small extents to
   be used for inode allocation when free space is fragmented.
 o DAX support. This includes minor changes to the DAX core code to
   fix problems with lock ordering and bufferhead mapping abuse.
 o transaction commit interface cleanup
 o removal of various unnecessary XFS specific type definitions
 o cleanup and optimisation of freelist preparation before allocation
 o various minor cleanups
 o bug fixes for
 	- transaction reservation leaks
 	- incorrect inode logging in unwritten extent conversion
 	- mmap lock vs freeze ordering
 	- remote symlink mishandling
 	- attribute fork removal issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJVkhI0AAoJEK3oKUf0dfod45MQAJCOEkNduBdlfPvTCMPjj/7z
 vzcfDdzgKwhpPTMXSDRvw4zDPt3C2FLMBJqxtPpC4sKGKG/8G0kFvw8bDtBag1m9
 ru5nI5LaQ6LC5RcU40zxBx1s/L8qYvyfUlxeoOT5lSwN9c6ENGOCQ3bUk4pSKaee
 pWDplag9LbfQomW2GHtxd8agMUZEYx0R1vgfv88V8xgPka8CvQo81XUgkb4PcDZV
 ugR+wDUsvwMS01aLYBmRFkMXuExNuCJVwtvdTJS+ZWGHzyTpulFoANUW6QT24gAM
 eP4yRXN4bv9vXrXpg8JkF25DHsfw4HBwNEL17ZvoB8t3oJp1/NYaH8ce1jS0+I8i
 NCtaO+qUqDSTGQZKgmeDPwCciQp54ra9LEdmIJFxpZxiBof9g/tIYEFgRklyFLwR
 GZU6Io6VpBa1oTGlC4D1cmG6bdcnhMB9MGVVCbqnB5mRRDKCmVgCyJwusd1pi7Re
 G4O6KkFt21O7+fP13VsjP57KoaJzsIgZ/+H3Ff/fJOJ33AKYTRCmwi8+IMi2n5JI
 zz+V0AIBQZAx9dlVyENnxufh9eJYcnwta0lUSLCCo91fZKxbo3ktK1kVHNZP5EGs
 IMFM1Ka6hibY20rWlR3GH0dfyP5/yNcvNgTMYPKjj9SVjTar1aSfF2rGpkqYXYyH
 D4FICbtDgtOc2ClfpI2k
 =3x+W
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pul xfs updates from Dave Chinner:
 "There's a couple of small API changes to the core DAX code which
  required small changes to the ext2 and ext4 code bases, but otherwise
  everything is within the XFS codebase.

  This update contains:

   - A new sparse on-disk inode record format to allow small extents to
     be used for inode allocation when free space is fragmented.

   - DAX support.  This includes minor changes to the DAX core code to
     fix problems with lock ordering and bufferhead mapping abuse.

   - transaction commit interface cleanup

   - removal of various unnecessary XFS specific type definitions

   - cleanup and optimisation of freelist preparation before allocation

   - various minor cleanups

   - bug fixes for
	- transaction reservation leaks
	- incorrect inode logging in unwritten extent conversion
	- mmap lock vs freeze ordering
	- remote symlink mishandling
	- attribute fork removal issues"

* tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits)
  xfs: don't truncate attribute extents if no extents exist
  xfs: clean up XFS_MIN_FREELIST macros
  xfs: sanitise error handling in xfs_alloc_fix_freelist
  xfs: factor out free space extent length check
  xfs: xfs_alloc_fix_freelist() can use incore perag structures
  xfs: remove xfs_caddr_t
  xfs: use void pointers in log validation helpers
  xfs: return a void pointer from xfs_buf_offset
  xfs: remove inst_t
  xfs: remove __psint_t and __psunsigned_t
  xfs: fix remote symlinks on V5/CRC filesystems
  xfs: fix xfs_log_done interface
  xfs: saner xfs_trans_commit interface
  xfs: remove the flags argument to xfs_trans_cancel
  xfs: pass a boolean flag to xfs_trans_free_items
  xfs: switch remaining xfs_trans_dup users to xfs_trans_roll
  xfs: check min blks for random debug mode sparse allocations
  xfs: fix sparse inodes 32-bit compile failure
  xfs: add initial DAX support
  xfs: add DAX IO path support
  ...
2015-06-30 20:16:08 -07:00
Linus Torvalds
043cd04950 Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "Outside of our usual batch of fixes, this integrates the subvolume
  quota updates that Qu Wenruo from Fujitsu has been working on for a
  few releases now.  He gets an extra gold star for making btrfs smaller
  this time, and fixing a number of quota corners in the process.

  Dave Sterba tested and integrated Anand Jain's sysfs improvements.
  Outside of exporting a symbol (ack'd by Greg) these are all internal
  to btrfs and it's mostly cleanups and fixes.  Anand also attached some
  of our sysfs objects to our internal device management structs instead
  of an object off the super block.  It will make device management
  easier overall and it's a better fit for how the sysfs files are used.
  None of the existing sysfs files are moved around.

  Thanks for all the fixes everyone"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (87 commits)
  btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref()
  Btrfs: Check if kobject is initialized before put
  lib: export symbol kobject_move()
  Btrfs: sysfs: add support to show replacing target in the sysfs
  Btrfs: free the stale device
  Btrfs: use received_uuid of parent during send
  Btrfs: fix use-after-free in btrfs_replay_log
  btrfs: wait for delayed iputs on no space
  btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup.
  btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.
  btrfs: ulist: Add ulist_del() function.
  btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.
  btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.
  btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.
  btrfs: qgroup: Switch rescan to new mechanism.
  btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents().
  btrfs: backref: Add special time_seq == (u64)-1 case for btrfs_find_all_roots().
  btrfs: qgroup: Add new function to record old_roots.
  btrfs: qgroup: Record possible quota-related extent for qgroup.
  btrfs: qgroup: Add function qgroup_update_counters().
  ...
2015-06-30 20:07:45 -07:00
Linus Torvalds
43baed34bc Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull more block layer patches from Jens Axboe:
 "A few later arrivers that I didn't fold into the first pull request,
  so we had a chance to run some testing.  This contains:

   - NVMe:
        - Set of fixes from Keith
        - 4.4 and earlier gcc build fix from Andrew

   - small set of xen-blk{back,front} fixes from Bob Liu.

   - warnings fix for bogus inline statement in I_BDEV() from Geert.

   - error code fixup for SG_IO ioctl from Paolo Bonzini"

* 'for-linus' of git://git.kernel.dk/linux-block:
  drivers/block/nvme-core.c: fix build with gcc-4.4.4
  bdi: Remove "inline" keyword from exported I_BDEV() implementation
  block: fix bogus EFAULT error from SG_IO ioctl
  NVMe: Fix filesystem deadlock on removal
  NVMe: Failed controller initialization fixes
  NVMe: Unify controller probe and resume
  NVMe: Don't use fake status on cancelled command
  NVMe: Fix device cleanup on initialization failure
  drivers: xen-blkfront: only talk_to_blkback() when in XenbusStateInitialising
  xen/block: add multi-page ring support
  driver: xen-blkfront: move talk_to_blkback to a more suitable place
  drivers: xen-blkback: delay pending_req allocation to connect_ring
2015-06-30 19:46:34 -07:00
Josh Triplett
9ce71148b0 devpts: if initialization failed, don't crash when opening /dev/ptmx
If devpts failed to initialize, it would store an ERR_PTR in the global
devpts_mnt.  A subsequent open of /dev/ptmx would call devpts_new_index,
which would dereference devpts_mnt and crash.

Avoid storing invalid values in devpts_mnt; leave it NULL instead.  Make
both devpts_new_index and devpts_pty_new fail gracefully with ENODEV in
that case, which then becomes the return value to the userspace open call
on /dev/ptmx.

[akpm@linux-foundation.org: remove unneeded static]
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:44:58 -07:00
Fabian Frederick
196a4f82bd fs/affs/symlink.c: remove unneeded err variable
err is only assigned to -EIO.  Return that value at the end of fail
context.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:44:57 -07:00
Fabian Frederick
4709187ef2 fs/affs/amigaffs.c: remove unneeded initialization
bh is initialized unconditionally in affs_remove_link()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:44:57 -07:00
Fabian Frederick
78f444f673 fs/affs/inode.c: remove unneeded initialization
bh is initialized unconditionally in affs_add_entry()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:44:57 -07:00
Firo Yang
d96f184532 fs/adfs: remove unneeded cast
kmem_cache_alloc() returns void*.

Signed-off-by: Firo Yang <firogm@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:44:57 -07:00
Yann Droneaud
460b865e53 fs: document seq_open()'s usage of file->private_data
seq_open() stores its struct seq_file in file->private_data, thus it must
not be modified by user of seq_file.

Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:44:57 -07:00
Yann Droneaud
189f9841de fs: allocate structure unconditionally in seq_open()
Since patch described below, from v2.6.15-rc1, seq_open() could use a
struct seq_file already allocated by the caller if the pointer to the
structure is stored in file->private_data before calling the function.

    Commit 1abe77b0fc
    Author: Al Viro <viro@zeniv.linux.org.uk>
    Date:   Mon Nov 7 17:15:34 2005 -0500

        [PATCH] allow callers of seq_open do allocation themselves

        Allow caller of seq_open() to kmalloc() seq_file + whatever else they
        want and set ->private_data to it.  seq_open() will then abstain from
        doing allocation itself.

As there's no more use for such feature, as it could be easily replaced by
calls to seq_open_private() (see commit 39699037a5 ("[FS] seq_file:
Introduce the seq_open_private()")) and seq_release_private() (see
v2.6.0-test3), support for this uncommon feature can be removed from
seq_open().

Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:44:57 -07:00
Yann Droneaud
ede1bf0dcf fs: use seq_open_private() for proc_mounts
A patchset to remove support for passing pre-allocated struct seq_file to
seq_open().  Such feature is undocumented and prone to error.

In particular, if seq_release() is used in release handler, it will
kfree() a pointer which was not allocated by seq_open().

So this patchset drops support for pre-allocated struct seq_file: it's
only of use in proc_namespace.c and can be easily replaced by using
seq_open_private()/seq_release_private().

Additionally, it documents the use of file->private_data to hold pointer
to struct seq_file by seq_open().

This patch (of 3):

Since patch described below, from v2.6.15-rc1, seq_open() could use a
struct seq_file already allocated by the caller if the pointer to the
structure is stored in file->private_data before calling the function.

    Commit 1abe77b0fc
    Author: Al Viro <viro@zeniv.linux.org.uk>
    Date:   Mon Nov 7 17:15:34 2005 -0500

        [PATCH] allow callers of seq_open do allocation themselves

        Allow caller of seq_open() to kmalloc() seq_file + whatever else they
        want and set ->private_data to it.  seq_open() will then abstain from
        doing allocation itself.

Such behavior is only used by mounts_open_common().

In order to drop support for such uncommon feature, proc_mounts is
converted to use seq_open_private(), which take care of allocating the
proc_mounts structure, making it available through ->private in struct
seq_file.

Conversely, proc_mounts is converted to use seq_release_private(), in
order to release the private structure allocated by seq_open_private().

Then, ->private is used directly instead of proc_mounts() macro to access
to the proc_mounts structure.

Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:44:56 -07:00
Filipe Manana
36283bf777 Btrfs: fix fsync xattr loss in the fast fsync path
After commit 4f764e5153 ("Btrfs: remove deleted xattrs on fsync log
replay"), we can end up in a situation where during log replay we end up
deleting xattrs that were never deleted when their file was last fsynced.

This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is
not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING
set, the xattr was added in a past transaction and the leaf where the
xattr is located was not updated (COWed or created) in the current
transaction. In this scenario the xattr item never ends up in the log
tree and therefore at log replay time, which makes the replay code delete
the xattr from the fs/subvol tree as it thinks that xattr was deleted
prior to the last fsync.

Fix this by always logging all xattrs, which is the simplest and most
reliable way to detect deleted xattrs and replay the deletes at log replay
time.

This issue is reproducible with the following test case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  here=`pwd`
  tmp=/tmp/$$
  status=1	# failure is the default!

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey
  . ./common/attr

  # real QA test starts here

  # We create a lot of xattrs for a single file. Only btrfs and xfs are currently
  # able to store such a large mount of xattrs per file, other filesystems such
  # as ext3/4 and f2fs for example, fail with ENOSPC even if we attempt to add
  # less than 1000 xattrs with very small values.
  _supported_fs btrfs xfs
  _supported_os Linux
  _need_to_be_root
  _require_scratch
  _require_dm_flakey
  _require_attrs
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create the test file with some initial data and make sure everything is
  # durably persisted.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" $SCRATCH_MNT/foo | _filter_xfs_io
  sync

  # Add many small xattrs to our file.
  # We create such a large amount because it's needed to trigger the issue found
  # in btrfs - we need to have an amount that causes the fs to have at least 3
  # btree leafs with xattrs stored in them, and it must work on any leaf size
  # (maximum leaf/node size is 64Kb).
  num_xattrs=2000
  for ((i = 1; i <= $num_xattrs; i++)); do
      name="user.attr_$(printf "%04d" $i)"
      $SETFATTR_PROG -n $name -v "val_$(printf "%04d" $i)" $SCRATCH_MNT/foo
  done

  # Sync the filesystem to force a commit of the current btrfs transaction, this
  # is a necessary condition to trigger the bug on btrfs.
  sync

  # Now update our file's data and fsync the file.
  # After a successful fsync, if the fsync log/journal is replayed we expect to
  # see all the xattrs we added before with the same values (and the updated file
  # data of course). Btrfs used to delete some of these xattrs when it replayed
  # its fsync log/journal.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 16K" \
               -c "fsync" \
               $SCRATCH_MNT/foo | _filter_xfs_io

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again and mount. This makes the fs replay its fsync log.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File content after crash and log replay:"
  od -t x1 $SCRATCH_MNT/foo

  echo "File xattrs after crash and log replay:"
  for ((i = 1; i <= $num_xattrs; i++)); do
      name="user.attr_$(printf "%04d" $i)"
      echo -n "$name="
      $GETFATTR_PROG --absolute-names -n $name --only-values $SCRATCH_MNT/foo
      echo
  done

  status=0
  exit

The golden output expects all xattrs to be available, and with the correct
values, after the fsync log is replayed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:47 -07:00
Filipe Manana
e4545de5b0 Btrfs: fix fsync data loss after append write
If we do an append write to a file (which increases its inode's i_size)
that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
and the previous transaction added a new hard link to the file, which sets
the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
the file, the inode's new i_size isn't logged. This has the consequence
that after the fsync log is replayed, the file size remains what it was
before the append write operation, which means users/applications will
not be able to read the data that was successsfully fsync'ed before.

This happens because neither the inode item nor the delayed inode get
their i_size updated when the append write is made - doing so would
require starting a transaction in the buffered write path, something that
we do not do intentionally for performance reasons.

Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
set the inode is logged with its current i_size (log the in-memory inode
into the log tree).

This issue is not a recent regression and is easy to reproduce with the
following test case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  here=`pwd`
  tmp=/tmp/$$
  status=1	# failure is the default!

  _cleanup()
  {
          _cleanup_flakey
          rm -f $tmp.*
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _supported_fs generic
  _supported_os Linux
  _need_to_be_root
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  _crash_and_mount()
  {
          # Simulate a crash/power loss.
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
          # Allow writes again and mount. This makes the fs replay its fsync log.
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
  }

  rm -f $seqres.full

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create the test file with some initial data and then fsync it.
  # The fsync here is only needed to trigger the issue in btrfs, as it causes the
  # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
                  -c "fsync" \
                  $SCRATCH_MNT/foo | _filter_xfs_io
  sync

  # Add a hard link to our file.
  # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
  # which is a necessary condition to trigger the issue.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Sync the filesystem to force a commit of the current btrfs transaction, this
  # is a necessary condition to trigger the bug on btrfs.
  sync

  # Now append more data to our file, increasing its size, and fsync the file.
  # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
  # write path did not update the inode item in the btree nor the delayed inode
  # item (in memory struture) in the current transaction (created by the fsync
  # handler), the fsync did not record the inode's new i_size in the fsync
  # log/journal. This made the data unavailable after the fsync log/journal is
  # replayed.
  $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
               -c "fsync" \
               $SCRATCH_MNT/foo | _filter_xfs_io

  echo "File content after fsync and before crash:"
  od -t x1 $SCRATCH_MNT/foo

  _crash_and_mount

  echo "File content after crash and log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected file output before and after the crash/power failure expects the
appended data to be available, which is:

  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0200000

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:47 -07:00
Filipe Manana
da288d280d Btrfs: fix crash on close_ctree() if cleaner starts new transaction
Often when running fstests btrfs/079 I was running into the following
trace during umount on one of my qemu/kvm test vms:

[ 8245.682441] WARNING: CPU: 8 PID: 25064 at fs/btrfs/extent-tree.c:138 btrfs_put_block_group+0x51/0x69 [btrfs]()
[ 8245.685039] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 acpi_cpufreq processor psmouse i2c_core thermal_sys parport evdev serio_raw button pcspkr microcode ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata floppy virtio_pci virtio_ring scsi_mod virtio e1000 [last unloaded: btrfs]
[ 8245.693860] CPU: 8 PID: 25064 Comm: umount Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[ 8245.695081] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[ 8245.697583]  0000000000000009 ffff88020d047ce8 ffffffff8145eec7 ffffffff81095dce
[ 8245.699234]  0000000000000000 ffff88020d047d28 ffffffff8104b399 0000000000000028
[ 8245.700995]  ffffffffa04db07b ffff8801c6036c00 ffff8801c6036d68 ffff880202eb40b0
[ 8245.702510] Call Trace:
[ 8245.703006]  [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
[ 8245.705393]  [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
[ 8245.706569]  [<ffffffff8104b399>] warn_slowpath_common+0xa1/0xbb
[ 8245.707747]  [<ffffffffa04db07b>] ? btrfs_put_block_group+0x51/0x69 [btrfs]
[ 8245.709101]  [<ffffffff8104b456>] warn_slowpath_null+0x1a/0x1c
[ 8245.710274]  [<ffffffffa04db07b>] btrfs_put_block_group+0x51/0x69 [btrfs]
[ 8245.711823]  [<ffffffffa04e3473>] btrfs_free_block_groups+0x145/0x322 [btrfs]
[ 8245.713251]  [<ffffffffa04ef31a>] close_ctree+0x1ef/0x325 [btrfs]
[ 8245.714448]  [<ffffffff8117d26e>] ? evict_inodes+0xdc/0xeb
[ 8245.715539]  [<ffffffffa04cb3ad>] btrfs_put_super+0x19/0x1b [btrfs]
[ 8245.716835]  [<ffffffff81167607>] generic_shutdown_super+0x73/0xef
[ 8245.718015]  [<ffffffff81167a3a>] kill_anon_super+0x13/0x1e
[ 8245.719101]  [<ffffffffa04cb1b6>] btrfs_kill_super+0x17/0x23 [btrfs]
[ 8245.720316]  [<ffffffff81167544>] deactivate_locked_super+0x3b/0x68
[ 8245.721517]  [<ffffffff81167dd6>] deactivate_super+0x3f/0x43
[ 8245.722581]  [<ffffffff8117fbb9>] cleanup_mnt+0x59/0x78
[ 8245.723538]  [<ffffffff8117fc18>] __cleanup_mnt+0x12/0x14
[ 8245.724572]  [<ffffffff81065371>] task_work_run+0x8f/0xbc
[ 8245.725598]  [<ffffffff810028fb>] do_notify_resume+0x45/0x53
[ 8245.726892]  [<ffffffff814651ac>] int_signal+0x12/0x17
[ 8245.737887] ---[ end trace a01d038397e99b92 ]---
[ 8245.769363] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 8245.770737] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 acpi_cpufreq processor psmouse i2c_core thermal_sys parport evdev serio_raw button pcspkr microcode ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata floppy virtio_pci virtio_ring scsi_mod virtio e1000 [last unloaded: btrfs]
[ 8245.772641] CPU: 2 PID: 25064 Comm: umount Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[ 8245.772641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[ 8245.772641] task: ffff880013005810 ti: ffff88020d044000 task.ti: ffff88020d044000
[ 8245.772641] RIP: 0010:[<ffffffffa051c8e6>]  [<ffffffffa051c8e6>] btrfs_queue_work+0x2c/0x14d [btrfs]
[ 8245.772641] RSP: 0018:ffff88020d0478b8  EFLAGS: 00010202
[ 8245.772641] RAX: 0000000000000004 RBX: 6b6b6b6b6b6b6b6b RCX: ffffffffa0581488
[ 8245.772641] RDX: 0000000000000000 RSI: ffff880194b7bf48 RDI: ffff880144b6a7a0
[ 8245.772641] RBP: ffff88020d0478d8 R08: 0000000000000000 R09: 000000000000ffff
[ 8245.772641] R10: 0000000000000004 R11: 0000000000000005 R12: ffff880194b7bf48
[ 8245.772641] R13: ffff880194b7bf48 R14: 0000000000000410 R15: 0000000000000000
[ 8245.772641] FS:  00007f991e77d840(0000) GS:ffff88023e280000(0000) knlGS:0000000000000000
[ 8245.772641] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 8245.772641] CR2: 00007fbbd325ee68 CR3: 000000021de8e000 CR4: 00000000000006e0
[ 8245.772641] Stack:
[ 8245.772641]  ffff880194b7bf00 ffff880202eb4000 ffff880194b7bf48 0000000000000410
[ 8245.772641]  ffff88020d047958 ffffffffa04ec6d5 ffff8801629b2ee8 0000000082987570
[ 8245.772641]  0000000000a5813f 0000000000000001 ffff880013006100 0000000000000002
[ 8245.772641] Call Trace:
[ 8245.772641]  [<ffffffffa04ec6d5>] btrfs_wq_submit_bio+0xe1/0x17b [btrfs]
[ 8245.772641]  [<ffffffff81086bff>] ? check_irq_usage+0x76/0x87
[ 8245.772641]  [<ffffffffa04ec825>] btree_submit_bio_hook+0xb6/0xd9 [btrfs]
[ 8245.772641]  [<ffffffffa04ebb7c>] ? btree_csum_one_bio+0xad/0xad [btrfs]
[ 8245.772641]  [<ffffffffa04eb1a6>] ? btree_io_failed_hook+0x5e/0x5e [btrfs]
[ 8245.772641]  [<ffffffffa050a6e7>] submit_one_bio+0x8c/0xc7 [btrfs]
[ 8245.772641]  [<ffffffffa050d75b>] submit_extent_page.isra.18+0x9d/0x186 [btrfs]
[ 8245.772641]  [<ffffffffa050d95b>] write_one_eb+0x117/0x1ae [btrfs]
[ 8245.772641]  [<ffffffffa050a79b>] ? end_extent_buffer_writeback+0x21/0x21 [btrfs]
[ 8245.772641]  [<ffffffffa0510510>] btree_write_cache_pages+0x2ab/0x385 [btrfs]
[ 8245.772641]  [<ffffffffa04eb2b8>] btree_writepages+0x23/0x5c [btrfs]
[ 8245.772641]  [<ffffffff8111c661>] do_writepages+0x23/0x2c
[ 8245.772641]  [<ffffffff81189cd4>] __writeback_single_inode+0xda/0x5bd
[ 8245.772641]  [<ffffffff8118aa60>] ? writeback_single_inode+0x2b/0x173
[ 8245.772641]  [<ffffffff8118aafd>] writeback_single_inode+0xc8/0x173
[ 8245.772641]  [<ffffffff8118ac95>] write_inode_now+0x8a/0x95
[ 8245.772641]  [<ffffffff81247bf0>] ? _atomic_dec_and_lock+0x30/0x4e
[ 8245.772641]  [<ffffffff8117cc5e>] iput+0x17d/0x26a
[ 8245.772641]  [<ffffffffa04ef355>] close_ctree+0x22a/0x325 [btrfs]
[ 8245.772641]  [<ffffffff8117d26e>] ? evict_inodes+0xdc/0xeb
[ 8245.772641]  [<ffffffffa04cb3ad>] btrfs_put_super+0x19/0x1b [btrfs]
[ 8245.772641]  [<ffffffff81167607>] generic_shutdown_super+0x73/0xef
[ 8245.772641]  [<ffffffff81167a3a>] kill_anon_super+0x13/0x1e
[ 8245.772641]  [<ffffffffa04cb1b6>] btrfs_kill_super+0x17/0x23 [btrfs]
[ 8245.772641]  [<ffffffff81167544>] deactivate_locked_super+0x3b/0x68
[ 8245.772641]  [<ffffffff81167dd6>] deactivate_super+0x3f/0x43
[ 8245.772641]  [<ffffffff8117fbb9>] cleanup_mnt+0x59/0x78
[ 8245.772641]  [<ffffffff8117fc18>] __cleanup_mnt+0x12/0x14
[ 8245.772641]  [<ffffffff81065371>] task_work_run+0x8f/0xbc
[ 8245.772641]  [<ffffffff810028fb>] do_notify_resume+0x45/0x53
[ 8245.772641]  [<ffffffff814651ac>] int_signal+0x12/0x17
[ 8245.772641] Code: 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 f4 48 8b 46 70 a8 04 74 09 48 8b 5f 08 48 85 db 75 03 48 8b 1f 49 89 5c 24 68 <83> 7b 5c ff 74 04 f0 ff 43 50 49 83 7c 24 08 00 74 2c 4c 8d 6b
[ 8245.772641] RIP  [<ffffffffa051c8e6>] btrfs_queue_work+0x2c/0x14d [btrfs]
[ 8245.772641]  RSP <ffff88020d0478b8>
[ 8245.845040] ---[ end trace a01d038397e99b93 ]---

For logical reasons such as the phase of the moon, this happened more
often with "-o inode_cache" than without any mount options.

After some debugging it turned out to be simple to understand what was
happening:

1) close_ctree() is called;

2) It then stops the transaction kthread, which commits the current
   transaction;

3) It asks the cleaner kthread to stop, which is currently running
   btrfs_delete_unused_bgs();

4) btrfs_delete_unused_bgs() finds an unused block group, starts a new
   transaction, deletes the block group, which implies COWing some
   tree nodes and leafs and dirtying their respective pages, and then
   finally it ends the transaction it started, without committing it;

5) The cleaner kthread stops;

6) close_ctree() releases (from memory) the block group objects, which
   produces the warning in the trace pasted above;

7) Then it invalidates all pages of the btree inode, by calling
   invalidate_inode_pages2(), which waits for any pages under writeback,
   and releases any non-dirty pages;

8) All work queues are destroyed (waiting first for their current tasks
   to finish execution);

9) A final iput() is called against the btree inode;

10) This iput triggers a writeback of the btree inode because it still
    has dirty pages;

11) This starts the whole chain of callbacks for the btree inode until
    it eventually reaches btrfs_wq_submit_bio() where it leads to a
    NULL pointer dereference because the work queues were already
    destroyed.

Fix this by making the cleaner commit any transaction that it started
after the transaction kthread was stopped.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:46 -07:00
Filipe Manana
ae9d8f1711 Btrfs: fix race between caching kthread and returning inode to inode cache
While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
we could have a concurrent call to btrfs_return_ino() that adds a new
entry to the root's free space cache of pinned inodes. This concurrent
call does not acquire the fs_info->commit_root_sem before adding a new
entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
because the caching kthread calls btrfs_unpin_free_ino() after setting
the caching state to BTRFS_CACHE_FINISHED and therefore races with
the task calling btrfs_return_ino(), which is adding a new entry, while
the former (caching kthread) is navigating the cache's rbtree, removing
and freeing nodes from the cache's rbtree without acquiring the spinlock
that protects the rbtree.

This race resulted in memory corruption due to double free of struct
btrfs_free_space objects because both tasks can end up doing freeing the
same objects. Note that adding a new entry can result in merging it with
other entries in the cache, in which case those entries are freed.
This is particularly important as btrfs_free_space structures are also
used for the block group free space caches.

This memory corruption can be detected by a debugging kernel, which
reports it with the following trace:

[132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
[132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[132408.505075]  ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
[132408.505075]  ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
[132408.505075]  ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
[132408.505075] Call Trace:
[132408.505075]  [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
[132408.505075]  [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
[132408.505075]  [<ffffffff81154e1e>] __slab_error.isra.28+0x25/0x36
[132408.505075]  [<ffffffff81155733>] __cache_free+0xe2/0x4b6
[132408.505075]  [<ffffffffa054a95a>] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
[132408.505075]  [<ffffffffa0505b5f>] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [<ffffffff810f3b30>] ? time_hardirqs_off+0x15/0x28
[132408.505075]  [<ffffffff81084d42>] ? trace_hardirqs_off+0xd/0xf
[132408.505075]  [<ffffffff811563a1>] ? kfree+0xb6/0x14e
[132408.505075]  [<ffffffff811563d0>] kfree+0xe5/0x14e
[132408.505075]  [<ffffffffa0505b5f>] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [<ffffffffa0505e08>] caching_kthread+0x29e/0x2d9 [btrfs]
[132408.505075]  [<ffffffffa0505b6a>] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
[132408.505075]  [<ffffffff8106698f>] kthread+0xef/0xf7
[132408.505075]  [<ffffffff810f3b08>] ? time_hardirqs_on+0x15/0x28
[132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075]  [<ffffffff814653d2>] ret_from_fork+0x42/0x70
[132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
[132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
[132409.503355] ------------[ cut here ]------------
[132409.504241] kernel BUG at mm/slab.c:2571!

Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
that protects the rbtree while doing the searches and removing entries.

Fixes: 1c70d8fb4d ("Btrfs: fix inode caching vs tree log")
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:46 -07:00
Filipe Manana
c3f4a1685b Btrfs: use kmem_cache_free when freeing entry in inode cache
The free space entries are allocated using kmem_cache_zalloc(),
through __btrfs_add_free_space(), therefore we should use
kmem_cache_free() and not kfree() to avoid any confusion and
any potential problem. Looking at the kfree() definition at
mm/slab.c it has the following comment:

  /*
   * (...)
   *
   * Don't free memory not originally allocated by kmalloc()
   * or you will run into trouble.
   */

So better be safe and use kmem_cache_free().

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:46 -07:00
Filipe Manana
67c5e7d464 Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:

[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>]  [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8  EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS:  00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566]  ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566]  ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566]  ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566]  [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566]  [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566]  [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566]  [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566]  [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566]  [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566]  [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566]  [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566]  [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566]  [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566]  [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566]  [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)

The sequence of steps leading to this are:

           CPU 0                                         CPU 1

  btrfs_balance()
    btrfs_relocate_chunk()

      btrfs_relocate_block_group(bg X)
        btrfs_lookup_block_group(bg X)

                                               cleaner_kthread
                                                  locks fs_info->cleaner_mutex

                                                  btrfs_delete_unused_bgs()
                                                    finds bg X, which became
                                                    unused in the previous
                                                    transaction

                                                    checks bg X ->ro == 0,
                                                    so it proceeds
        sets bg X ->ro to 1
        (btrfs_set_block_group_ro(bg X))

        blocks on fs_info->cleaner_mutex
                                                    btrfs_remove_chunk(bg X)
                                                  unlocks fs_info->cleaner_mutex

        acquires fs_info->cleaner_mutex
        relocate_block_group()
          --> does nothing, no extents found in
              the extent tree from bg X
        unlocks fs_info->cleaner_mutex

      btrfs_relocate_block_group(bg X) returns

    btrfs_remove_chunk(bg X)
       extent map not found
          --> ASSERT(0)

Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.

This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:

    while true; do btrfs balance start -dusage=0 <mountpoint> ; done

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:46 -07:00
Zhao Lei
e82afc52ab btrfs: add error handling for scrub_workers_get()
Although it is a rare case, we'd better free previous allocated
memory on error.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 13:20:03 -07:00
Zhao Lei
65f5333875 btrfs: cleanup noused initialization of dev in btrfs_end_bio()
It is introduced by:
 c404e0dc2c
 Btrfs: fix use-after-free in the finishing procedure of the device replace

But seems no relationship with that bug, this patch revirt these
code block for cleanup.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 13:20:02 -07:00
Yang Dongsheng
fe7599079b btrfs: qgroup: allow user to clear the limitation on qgroup
Currently, we can only set a limitation on a qgroup, but we
can not clear it.

This patch provide a choice to user to clear a limitation on
qgroup by passing a value of CLEAR_VALUE(-1) to kernel.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 13:20:00 -07:00
Sachin Prabhu
1dfd18d057 cifs: Unset CIFS_MOUNT_POSIX_PATHS flag when following dfs mounts
In a dfs setup where the client transitions from a server which supports
posix paths to a server which doesn't support posix paths, the flag
CIFS_MOUNT_POSIX_PATHS is not reset. This leads to the wrong directory
separator being used causing smb commands to fail.

Consider the following case where a dfs share on a samba server points
to a share on windows smb server.
 # mount -t cifs -o .. //vm140-31/dfsroot/testwin/
 # ls -l /mnt; touch /mnt/a
 total 0
 touch: cannot touch ‘/mnt/a’: No such file or directory

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <steve.french@primarydata.com>
2015-06-29 14:50:22 -05:00
Linus Torvalds
88793e5c77 The libnvdimm sub-system introduces, in addition to the libnvdimm-core,
4 drivers / enabling modules:
 
 NFIT:
 Instantiates an "nvdimm bus" with the core and registers memory devices
 (NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware Interface
 table).  After registering NVDIMMs the NFIT driver then registers
 "region" devices.  A libnvdimm-region defines an access mode and the
 boundaries of persistent memory media.  A region may span multiple
 NVDIMMs that are interleaved by the hardware memory controller.  In
 turn, a libnvdimm-region can be carved into a "namespace" device and
 bound to the PMEM or BLK driver which will attach a Linux block device
 (disk) interface to the memory.
 
 PMEM:
 Initially merged in v4.1 this driver for contiguous spans of persistent
 memory address ranges is re-worked to drive PMEM-namespaces emitted by
 the libnvdimm-core.  In this update the PMEM driver, on x86, gains the
 ability to assert that writes to persistent memory have been flushed all
 the way through the caches and buffers in the platform to persistent
 media.  See memcpy_to_pmem() and wmb_pmem().
 
 BLK:
 This new driver enables access to persistent memory media through "Block
 Data Windows" as defined by the NFIT.  The primary difference of this
 driver to PMEM is that only a small window of persistent memory is
 mapped into system address space at any given point in time.  Per-NVDIMM
 windows are reprogrammed at run time, per-I/O, to access different
 portions of the media.  BLK-mode, by definition, does not support DAX.
 
 BTT:
 This is a library, optionally consumed by either PMEM or BLK, that
 converts a byte-accessible namespace into a disk with atomic sector
 update semantics (prevents sector tearing on crash or power loss).  The
 sinister aspect of sector tearing is that most applications do not know
 they have a atomic sector dependency.  At least today's disk's rarely
 ever tear sectors and if they do one almost certainly gets a CRC error
 on access.  NVDIMMs will always tear and always silently.  Until an
 application is audited to be robust in the presence of sector-tearing
 the usage of BTT is recommended.
 
 Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
 Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
 Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
 Wysocki, and Bob Moore.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVjZGBAAoJEB7SkWpmfYgC4fkP/j+k6HmSRNU/yRYPyo7CAWvj
 3P5P1i6R6nMZZbjQrQArAXaIyLlFk4sEQDYsciR6dmslhhFZAkR2eFwVO5rBOyx3
 QN0yxEpyjJbroRFUrV/BLaFK4cq2oyJAFFHs0u7/pLHBJ4MDMqfRKAMtlnBxEkTE
 LFcqXapSlvWitSbjMdIBWKFEvncaiJ2mdsFqT4aZqclBBTj00eWQvEG9WxleJLdv
 +tj7qR/vGcwOb12X5UrbQXgwtMYos7A6IzhHbqwQL8IrOcJ6YB8NopJUpLDd7ZVq
 KAzX6ZYMzNueN4uvv6aDfqDRLyVL7qoxM9XIjGF5R8SV9sF2LMspm1FBpfowo1GT
 h2QMr0ky1nHVT32yspBCpE9zW/mubRIDtXxEmZZ53DIc4N6Dy9jFaNVmhoWtTAqG
 b9pndFnjUzzieCjX5pCvo2M5U6N0AQwsnq76/CasiWyhSa9DNKOg8MVDRg0rbxb0
 UvK0v8JwOCIRcfO3qiKcx+02nKPtjCtHSPqGkFKPySRvAdb+3g6YR26CxTb3VmnF
 etowLiKU7HHalLvqGFOlDoQG6viWes9Zl+ZeANBOCVa6rL2O7ZnXJtYgXf1wDQee
 fzgKB78BcDjXH4jHobbp/WBANQGN/GF34lse8yHa7Ym+28uEihDvSD1wyNLnefmo
 7PJBbN5M5qP5tD0aO7SZ
 =VtWG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm

Pull libnvdimm subsystem from Dan Williams:
 "The libnvdimm sub-system introduces, in addition to the
  libnvdimm-core, 4 drivers / enabling modules:

  NFIT:
    Instantiates an "nvdimm bus" with the core and registers memory
    devices (NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware
    Interface table).

    After registering NVDIMMs the NFIT driver then registers "region"
    devices.  A libnvdimm-region defines an access mode and the
    boundaries of persistent memory media.  A region may span multiple
    NVDIMMs that are interleaved by the hardware memory controller.  In
    turn, a libnvdimm-region can be carved into a "namespace" device and
    bound to the PMEM or BLK driver which will attach a Linux block
    device (disk) interface to the memory.

  PMEM:
    Initially merged in v4.1 this driver for contiguous spans of
    persistent memory address ranges is re-worked to drive
    PMEM-namespaces emitted by the libnvdimm-core.

    In this update the PMEM driver, on x86, gains the ability to assert
    that writes to persistent memory have been flushed all the way
    through the caches and buffers in the platform to persistent media.
    See memcpy_to_pmem() and wmb_pmem().

  BLK:
    This new driver enables access to persistent memory media through
    "Block Data Windows" as defined by the NFIT.  The primary difference
    of this driver to PMEM is that only a small window of persistent
    memory is mapped into system address space at any given point in
    time.

    Per-NVDIMM windows are reprogrammed at run time, per-I/O, to access
    different portions of the media.  BLK-mode, by definition, does not
    support DAX.

  BTT:
    This is a library, optionally consumed by either PMEM or BLK, that
    converts a byte-accessible namespace into a disk with atomic sector
    update semantics (prevents sector tearing on crash or power loss).

    The sinister aspect of sector tearing is that most applications do
    not know they have a atomic sector dependency.  At least today's
    disk's rarely ever tear sectors and if they do one almost certainly
    gets a CRC error on access.  NVDIMMs will always tear and always
    silently.  Until an application is audited to be robust in the
    presence of sector-tearing the usage of BTT is recommended.

  Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
  Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
  Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
  Wysocki, and Bob Moore"

* tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm: (33 commits)
  arch, x86: pmem api for ensuring durability of persistent memory updates
  libnvdimm: Add sysfs numa_node to NVDIMM devices
  libnvdimm: Set numa_node to NVDIMM devices
  acpi: Add acpi_map_pxm_to_online_node()
  libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only
  pmem: flag pmem block devices as non-rotational
  libnvdimm: enable iostat
  pmem: make_request cleanups
  libnvdimm, pmem: fix up max_hw_sectors
  libnvdimm, blk: add support for blk integrity
  libnvdimm, btt: add support for blk integrity
  fs/block_dev.c: skip rw_page if bdev has integrity
  libnvdimm: Non-Volatile Devices
  tools/testing/nvdimm: libnvdimm unit test infrastructure
  libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
  nd_btt: atomic sector updates
  libnvdimm: infrastructure for btt devices
  libnvdimm: write blk label set
  libnvdimm: write pmem label set
  libnvdimm: blk labels and namespace instantiation
  ...
2015-06-29 10:34:42 -07:00
Al Viro
06d7137e5c namei: make set_root_rcu() return void
The only caller that cares about its return value can just
as easily pick it from nd->root_seq itself.  We used to just
calculate it and return to caller, but these days we are
storing it in nd->root_seq in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-29 12:07:04 -04:00
NeilBrown
39f897fdbd NFSv4: When returning a delegation, don't reclaim an incompatible open mode.
It is possible to have an active open with one mode, and a delegation
for the same file with a different mode.
In particular, a WR_ONLY open and an RD_ONLY delegation.
This happens if a WR_ONLY open is followed by a RD_ONLY open which
provides a delegation, but is then close.

When returning the delegation, we currently try to claim opens for
every open type (n_rdwr, n_rdonly, n_wronly).  As there is no harm
in claiming an open for a mode that we already have, this is often
simplest.

However if the delegation only provides a subset of the modes that we
currently have open, this will produce an error from the server.

So when claiming open modes prior to returning a delegation, skip the
open request if the mode is not covered by the delegation - the open_stateid
must already cover that mode, so there is nothing to do.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-29 09:17:13 -04:00
Steve French
ebb3a9d4ba Update negotiate protocol for SMB3.11 dialect
Send negotiate contexts when SMB3.11 dialect is negotiated
(ie the preauth and the encryption contexts) and
Initialize SMB3.11 preauth negotiate context salt to random bytes

Followon patch will update session setup and tree connect

Signed-off-by: Steve French <steve.french@primarydata.com>
2015-06-28 21:15:48 -05:00
Steve French
b3152e2c7a Add ioctl to set integrity
set integrity increases reliability of files stored on SMB3 servers.
Add ioctl to allow setting this on files on SMB3 and later mounts.

Signed-off-by: Steve French <steve.french@primarydata.com>
2015-06-28 21:15:45 -05:00
Steve French
9d1b06602e Add Get/Set Integrity Information structure definitions
Signed-off-by: Steve French <steve.french@primarydata.com>
2015-06-28 21:15:41 -05:00
Steve French
02b1666544 Add reflink copy over SMB3.11 with new FSCTL_DUPLICATE_EXTENTS
Getting fantastic copy performance with cp --reflink over SMB3.11
 using the new FSCTL_DUPLICATE_EXTENTS.

 This FSCTL was added in the SMB3.11 dialect (testing was
 against REFS file system) so have put it as a 3.11 protocol
 specific operation ("vers=3.1.1" on the mount).  Tested at
 the SMB3 plugfest in Redmond.

 It depends on the new FS Attribute (BLOCK_REFCOUNTING) which
 is used to advertise support for the ability to do this ioctl
 (if you can support multiple files pointing to the same block
 than this refcounting ability or equivalent is needed to
 support the new reflink-like duplicate extent SMB3 ioctl.

Signed-off-by: Steve French <steve.french@primarydata.com>
2015-06-28 21:15:38 -05:00
Linus Torvalds
21dc2e6c6d Merge branch 'for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML updates from Richard Weinberger:

 - remove hppfs ("HonePot ProcFS")

 - initial support for musl libc

 - uaccess cleanup

 - random cleanups and bug fixes all over the place

* 'for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (21 commits)
  um: Don't pollute kernel namespace with uapi
  um: Include sys/types.h for makedev(), major(), minor()
  um: Do not use stdin and stdout identifiers for struct members
  um: Do not use __ptr_t type for stack_t's .ss pointer
  um: Fix mconsole dependency
  um: Handle tracehook_report_syscall_entry() result
  um: Remove copy&paste code from init.h
  um: Stop abusing __KERNEL__
  um: Catch unprotected user memory access
  um: Fix warning in setup_signal_stack_si()
  um: Rework uaccess code
  um: Add uaccess.h to ldt.c
  um: Add uaccess.h to syscalls_64.c
  um: Add asm/elf.h to vma.c
  um: Cleanup mem_32/64.c headers
  um: Remove hppfs
  um: Move syscall() declaration into os.h
  um: kernel: ksyms: Export symbol syscall() for fixing modpost issue
  um/os-Linux: Use char[] for syscall_stub declarations
  um: Use char[] for linker script address declarations
  ...
2015-06-28 13:55:08 -07:00
Steve French
aab1893d5f Add SMB3.11 mount option synonym for new dialect
Most people think of SMB 3.1.1 as SMB version 3.11 so add synonym
for "vers=3.1.1" of "vers=3.11" on mount.

Also make sure that unlike SMB3.0 and 3.02 we don't send
validate negotiate on mount (it is handled by negotiate contexts) -
add list of SMB3.11 specific functions (distinct from 3.0 dialect).

Signed-off-by: Steve French <steve.french@primarydata.com>w
2015-06-27 20:28:11 -07:00
Steve French
80bc83c360 add struct FILE_STANDARD_INFO
Signed-off-by: Gregor Beck <gbeck@sernet.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
2015-06-27 20:25:56 -07:00
Steve French
f799d6234b Make dialect negotiation warning message easier to read
Dialect version and minor version are easier to read in hex

Signed-off-by: Steve French <steve.french@primarydata.com>
2015-06-27 20:28:49 -07:00
Steve French
eed0e1753c Add defines and structs for smb3.1 dialect
Add new structures and defines for SMB3.11 negotiate, session setup and tcon

See MS-SMB2-diff.pdf section 2.2.3 for additional protocol documentation.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
2015-06-27 20:23:59 -07:00
Steve French
5f7fbf733c Allow parsing vers=3.11 on cifs mount
Parses and recognizes "vers=3.1.1" on cifs mount and allows sending
0x0311 as a new CIFS/SMB3 dialect. Subsequent patches will add
the new negotiate contexts and updated session setup

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
2015-06-27 20:23:32 -07:00
Noel Power
f291095f34 client MUST ignore EncryptionKeyLength if CAP_EXTENDED_SECURITY is set
[MS-SMB] 2.2.4.5.2.1 states:

"ChallengeLength (1 byte): When the CAP_EXTENDED_SECURITY bit is set,
 the server MUST set this value to zero and clients MUST ignore this
 value."

Signed-off-by: Noel Power <noel.power@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2015-06-27 20:26:00 -07:00
Linus Torvalds
e22619a29f Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "The main change in this kernel is Casey's generalized LSM stacking
  work, which removes the hard-coding of Capabilities and Yama stacking,
  allowing multiple arbitrary "small" LSMs to be stacked with a default
  monolithic module (e.g.  SELinux, Smack, AppArmor).

  See
        https://lwn.net/Articles/636056/

  This will allow smaller, simpler LSMs to be incorporated into the
  mainline kernel and arbitrarily stacked by users.  Also, this is a
  useful cleanup of the LSM code in its own right"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
  tpm, tpm_crb: fix le64_to_cpu conversions in crb_acpi_add()
  vTPM: set virtual device before passing to ibmvtpm_reset_crq
  tpm_ibmvtpm: remove unneccessary message level.
  ima: update builtin policies
  ima: extend "mask" policy matching support
  ima: add support for new "euid" policy condition
  ima: fix ima_show_template_data_ascii()
  Smack: freeing an error pointer in smk_write_revoke_subj()
  selinux: fix setting of security labels on NFS
  selinux: Remove unused permission definitions
  selinux: enable genfscon labeling for sysfs and pstore files
  selinux: enable per-file labeling for debugfs files.
  selinux: update netlink socket classes
  signals: don't abuse __flush_signals() in selinux_bprm_committed_creds()
  selinux: Print 'sclass' as string when unrecognized netlink message occurs
  Smack: allow multiple labels in onlycap
  Smack: fix seq operations in smackfs
  ima: pass iint to ima_add_violation()
  ima: wrap event related data to the new ima_event_data structure
  integrity: add validity checks for 'path' parameter
  ...
2015-06-27 13:26:03 -07:00
Geert Uytterhoeven
ff5053f666 bdi: Remove "inline" keyword from exported I_BDEV() implementation
With gcc 3.4.6/4.1.2/4.2.4 (not with 4.4.7/4.6.4/4.8.4):

      CC      fs/block_dev.o
    include/linux/fs.h:804: warning: ‘I_BDEV’ declared inline after being called
    include/linux/fs.h:804: warning: previous declaration of ‘I_BDEV’ was here

Commit a212b105b0 ("bdi: make inode_to_bdi() inline") added a
caller of I_BDEV() in a header file, exposing the bogus "inline" on the
exported implementation.

Drop the "inline" keyword to fix this.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-27 11:44:12 -06:00
Linus Torvalds
d2c3ac7e7e Merge branch 'for-4.2' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "A relatively quiet cycle, with a mix of cleanup and smaller bugfixes"

* 'for-4.2' of git://linux-nfs.org/~bfields/linux: (24 commits)
  sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key()
  nfsd: wrap too long lines in nfsd4_encode_read
  nfsd: fput rd_file from XDR encode context
  nfsd: take struct file setup fully into nfs4_preprocess_stateid_op
  nfsd: refactor nfs4_preprocess_stateid_op
  nfsd: clean up raparams handling
  nfsd: use swap() in sort_pacl_range()
  rpcrdma: Merge svcrdma and xprtrdma modules into one
  svcrdma: Add a separate "max data segs macro for svcrdma
  svcrdma: Replace GFP_KERNEL in a loop with GFP_NOFAIL
  svcrdma: Keep rpcrdma_msg fields in network byte-order
  svcrdma: Fix byte-swapping in svc_rdma_sendto.c
  nfsd: Update callback sequnce id only CB_SEQUENCE success
  nfsd: Reset cb_status in nfsd4_cb_prepare() at retrying
  svcrdma: Remove svc_rdma_xdr_decode_deferred_req()
  SUNRPC: Move EXPORT_SYMBOL for svc_process
  uapi/nfs: Add NFSv4.1 ACL definitions
  nfsd: Remove dead declarations
  nfsd: work around a gcc-5.1 warning
  nfsd: Checking for acl support does not require fetching any acls
  ...
2015-06-27 10:14:39 -07:00
Linus Torvalds
546fac6073 GFS2: merge window
Here is a list of patches we've accumulated for GFS2 for the current upstream
 merge window. We have a good mixture this time. Here are some of the features:
 
 1. Fix a problem with RO mounts writing to the journal.
 2. Further improvements to quotas on GFS2.
 3. Added support for rename2 and RENAME_EXCHANGE on GFS2.
 4. Increase performance by making glock lru_list less of a bottleneck.
 5. Increase performance by avoiding unnecessary buffer_head releases.
 6. Increase performance by using average glock round trip time from all CPUs.
 7. Fixes for some compiler warnings and minor white space issues.
 8. Other misc. bug fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEbBAABAgAGBQJVjFEXAAoJENeLYdPf93o7zqoH926XV0oddQCsGCYg6gq7OL+c
 /4q2y9x31hkv5XbPTNcahqR6UsK8JcbcdZD+XAqTftL4Q789FDAdWbYS++45qz8D
 YuUFgZd2bc75Ge2qqacgEdv85YRLtws1fRnI6DUpjfN1qJ9kJXX+gk+BSve6rh6V
 Qyv8a4DRIw1En5fwt0R6yIg0LI/ywPhYeVlxo6WUoK8fiL/i3eNd57Jgv5YQ6ly+
 ZyqH8w1m0kE4IkIiTFgmIpvepiWLBCA3mPOfHfE3QxbDKpXe4uFsMdnxghmP5bFN
 3H9syJQNFqjs+ooKma/fE2VmpxQ5/5lAs0/ms+ECW3GvBseTll8Iln7y4NwgAQ==
 =ITwD
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-merge-window' of git://git.kernel.org:/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull GFS2 updates from Bob Peterson:
 "Here are the patches we've accumulated for GFS2 for the current
  upstream merge window.  We have a good mixture this time.  Here are
  some of the features:

   - Fix a problem with RO mounts writing to the journal.

   - Further improvements to quotas on GFS2.

   - Added support for rename2 and RENAME_EXCHANGE on GFS2.

   - Increase performance by making glock lru_list less of a bottleneck.

   - Increase performance by avoiding unnecessary buffer_head releases.

   - Increase performance by using average glock round trip time from all CPUs.

   - Fixes for some compiler warnings and minor white space issues.

   - Other misc bug fixes"

* tag 'gfs2-merge-window' of git://git.kernel.org:/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  GFS2: Don't brelse rgrp buffer_heads every allocation
  GFS2: Don't add all glocks to the lru
  gfs2: Don't support fallocate on jdata	files
  gfs2: s64 cast for negative quota value
  gfs2: limit quota log messages
  gfs2: fix quota updates on block boundaries
  gfs2: fix shadow warning in gfs2_rbm_find()
  gfs2: kerneldoc warning fixes
  gfs2: convert simple_str to kstr
  GFS2: make sure S_NOSEC flag isn't overwritten
  GFS2: add support for rename2 and RENAME_EXCHANGE
  gfs2: handle NULL rgd in set_rgrp_preferences
  GFS2: inode.c: indent with TABs, not spaces
  GFS2: mark the journal idle to fix ro mounts
  GFS2: Average in only non-zero round-trip times for congestion stats
  GFS2: Use average srttb value in congestion calculations
2015-06-27 09:47:46 -07:00
Linus Torvalds
ebeaa8ddb3 Revert "jbd2: speedup jbd2_journal_dirty_metadata()"
This reverts commit 2143c1965a.

This commit seems to be the cause of the following jbd2 assertion
failure:

   ------------[ cut here ]------------
   kernel BUG at fs/jbd2/transaction.c:1325!
   invalid opcode: 0000 [#1] SMP
   Modules linked in: bnep bluetooth fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 ...
   CPU: 7 PID: 5509 Comm: gcc Not tainted 4.1.0-10944-g2a298679b411 #1
   Hardware name:                  /DH87RL, BIOS RLH8710H.86A.0327.2014.0924.1645 09/24/2014
   task: ffff8803bf866040 ti: ffff880308528000 task.ti: ffff880308528000
   RIP: jbd2_journal_dirty_metadata+0x237/0x290
   Call Trace:
     __ext4_handle_dirty_metadata+0x43/0x1f0
     ext4_handle_dirty_dirent_node+0xde/0x160
     ? jbd2_journal_get_write_access+0x36/0x50
     ext4_delete_entry+0x112/0x160
     ? __ext4_journal_start_sb+0x52/0xb0
     ext4_unlink+0xfa/0x260
     vfs_unlink+0xec/0x190
     do_unlinkat+0x24a/0x270
     SyS_unlink+0x11/0x20
     entry_SYSCALL_64_fastpath+0x12/0x6a
   ---[ end trace ae033ebde8d080b4 ]---

which is not easily reproducible (I've seen it just once, and then Ted
was able to reproduce it once).  Revert it while Ted and Jan try to
figure out what is wrong.

Cc: Jan Kara <jack@suse.cz>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-27 09:41:50 -07:00
Trond Myklebust
6c5a0d8915 NFSv4.2: LAYOUTSTATS is optional to implement
Make it so, by checking the return value for NFS4ERR_MOTSUPP and
caching the information as a server capability.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-27 11:48:58 -04:00