Commit Graph

35156 Commits

Author SHA1 Message Date
Al Viro
c7999c3627 reduce m_start() cost...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:09 -04:00
Al Viro
f2ebb3a921 smarter propagate_mnt()
The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint.  That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations.  It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.

Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.

One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups.  And iterate through
the peers before dealing with the next group.

Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.

Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}.  It means that the master
of M_{k} will be among the sequence of masters of M.  On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).

So we go through the sequence of masters of M until we find
a marked one (P).  Let N be the one before it.  Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.

That's it for the hard part; the rest is fairly simple.  Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().

It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:08 -04:00
Al Viro
38129a13e6 switch mnt_hash to hlist
fixes RCU bug - walking through hlist is safe in face of element moves,
since it's self-terminating.  Cyclic lists are not - if we end up jumping
to another hash chain, we'll loop infinitely without ever hitting the
original list head.

[fix for dumb braino folded]

Spotted by: Max Kellermann <mk@cm4all.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:51 -04:00
Al Viro
0b1b901b5a don't bother with propagate_mnt() unless the target is shared
If the dest_mnt is not shared, propagate_mnt() does nothing -
there's no mounts to propagate to and thus no copies to create.
Might as well don't bother calling it in that case.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:50 -04:00
Al Viro
1d6a32acd7 keep shadowed vfsmounts together
preparation to switching mnt_hash to hlist

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:50 -04:00
Al Viro
0818bf27c0 resizable namespace.c hashes
* switch allocation to alloc_large_system_hash()
* make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
* switch mountpoint_hashtable from list_head to hlist_head

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-30 19:18:49 -04:00
Sasha Levin
d9060742fb ocfs2: check if cluster name exists before deref
Commit c74a3bdd9b ("ocfs2: add clustername to cluster connection") is
trying to strlcpy a string which was explicitly passed as NULL in the
very same patch, triggering a NULL ptr deref.

  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: strlcpy (lib/string.c:388 lib/string.c:151)
  CPU: 19 PID: 19426 Comm: trinity-c19 Tainted: G        W     3.14.0-rc7-next-20140325-sasha-00014-g9476368-dirty #274
  RIP:  strlcpy (lib/string.c:388 lib/string.c:151)
  Call Trace:
   ocfs2_cluster_connect (fs/ocfs2/stackglue.c:350)
   ocfs2_cluster_connect_agnostic (fs/ocfs2/stackglue.c:396)
   user_dlm_register (fs/ocfs2/dlmfs/userdlm.c:679)
   dlmfs_mkdir (fs/ocfs2/dlmfs/dlmfs.c:503)
   vfs_mkdir (fs/namei.c:3467)
   SyS_mkdirat (fs/namei.c:3488 fs/namei.c:3472)
   tracesys (arch/x86/kernel/entry_64.S:749)

akpm: this patch probably disables the feature.  A temporary thing to
avoid triviel oopses.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-28 13:56:58 -07:00
Jan Kara
75c5a52da3 vfs: Allocate anon_inode_inode in anon_inode_init()
Currently we allocated anon_inode_inode in anon_inodefs_mount. This is
somewhat fragile as if that function ever gets called again, it will
overwrite anon_inode_inode pointer. So move the initialization of
anon_inode_inode to anon_inode_init().

Signed-off-by: Jan Kara <jack@suse.cz>
[ Further simplified on suggestion from Dave Jones ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-27 09:52:54 -07:00
Linus Torvalds
fce7fc79c8 fs: remove now stale label in anon_inode_init()
The previous commit removed the register_filesystem() call and the
associated error handling, but left the label for the error path that no
longer exists.  Remove that too.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-25 17:43:34 -07:00
Jan Kara
d6f2589ad5 fs: Avoid userspace mounting anon_inodefs filesystem
anon_inodefs filesystem is a kernel internal filesystem userspace
shouldn't mess with. Remove registration of it so userspace cannot
even try to mount it (which would fail anyway because the filesystem is
MS_NOUSER).

This fixes an oops triggered by trinity when it tried mounting
anon_inodefs which overwrote anon_inode_inode pointer while other CPU
has been in anon_inode_getfile() between ihold() and d_instantiate().
Thus effectively creating dentry pointing to an inode without holding a
reference to it.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-25 17:42:16 -07:00
Linus Torvalds
632b06aa28 Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linux
Pull nfsd fix frm Bruce Fields:
 "J R Okajima sent this early and I was just slow to pass it along,
  apologies.  Fortunately it's a simple fix"

* 'nfsd-next' of git://linux-nfs.org/~bfields/linux:
  nfsd: fix lost nfserrno() call in nfsd_setattr()
2014-03-25 15:24:11 -07:00
Al Viro
b37199e626 rcuwalk: recheck mount_lock after mountpoint crossing attempts
We can get false negative from __lookup_mnt() if an unrelated vfsmount
gets moved.  In that case legitimize_mnt() is guaranteed to fail,
and we will fall back to non-RCU walk... unless we end up running
into a hard error on a filesystem object we wouldn't have reached
if not for that false negative.  IOW, delaying that check until
the end of pathname resolution is wrong - we should recheck right
after we attempt to cross the mountpoint.  We don't need to recheck
unless we see d_mountpoint() being true - in that case even if
we have just raced with mount/umount, we can simply go on as if
we'd come at the moment when the sucker wasn't a mountpoint; if we
run into a hard error as the result, it was a legitimate outcome.
__lookup_mnt() returning NULL is different in that respect, since
it might've happened due to operation on completely unrelated
mountpoint.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-23 00:32:55 -04:00
Al Viro
e825196d48 make prepend_name() work correctly when called with negative *buflen
In all callchains leading to prepend_name(), the value left in *buflen
is eventually discarded unused if prepend_name() has returned a negative.
So we are free to do what prepend() does, and subtract from *buflen
*before* checking for underflow (which turns into checking the sign
of subtraction result, of course).

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-23 00:28:40 -04:00
Eric Biggers
99aea68134 vfs: Don't let __fdget_pos() get FMODE_PATH files
Commit bd2a31d522 ("get rid of fget_light()") introduced the
__fdget_pos() function, which returns the resulting file pointer and
fdput flags combined in an 'unsigned long'.  However, it also changed the
behavior to return files with FMODE_PATH set, which shouldn't happen
because read(), write(), lseek(), etc. aren't allowed on such files.
This commit restores the old behavior.

This regression actually had no effect on read() and write() since
FMODE_READ and FMODE_WRITE are not set on file descriptors opened with
O_PATH, but it did cause lseek() on a file descriptor opened with O_PATH
to fail with ESPIPE rather than EBADF.

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-23 00:03:12 -04:00
Eric Biggers
d7a15f8d07 vfs: atomic f_pos access in llseek()
Commit 9c225f2655 ("vfs: atomic f_pos accesses as per POSIX") changed
several system calls to use fdget_pos() instead of fdget(), but missed
sys_llseek().  Fix it.

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-23 00:03:12 -04:00
Linus Torvalds
33807f4f0d Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "A fix for the problem which Al spotted in cifs_writev and a followup
  (noticed when fixing CVE-2014-0069) patch to ensure that cifs never
  sends more than the smb frame length over the socket (as we saw with
  that cifs_iovec_write problem that Jeff fixed last month)"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: mask off top byte in get_rfc1002_length()
  cifs: sanity check length of data to send before sending
  CIFS: Fix wrong pos argument of cifs_find_lock_conflict
2014-03-11 11:53:42 -07:00
Linus Torvalds
8712a00514 Merge branch 'akpm' (patches from Andrew Morton)
Merge misc fixes from Andrew Morton:
 "Nine fixes"

* emailed patches from Andrew Morton akpm@linux-foundation.org>:
  cris: convert ffs from an object-like macro to a function-like macro
  hfsplus: add HFSX subfolder count support
  tools/testing/selftests/ipc/msgque.c: handle msgget failure return correctly
  MAINTAINERS: blackfin: add git repository
  revert "kallsyms: fix absolute addresses for kASLR"
  mm/Kconfig: fix URL for zsmalloc benchmark
  fs/proc/base.c: fix GPF in /proc/$PID/map_files
  mm/compaction: break out of loop on !PageBuddy in isolate_freepages_block
  mm: fix GFP_THISNODE callers and clarify
2014-03-10 17:26:36 -07:00
Sergei Antonov
d7d673a591 hfsplus: add HFSX subfolder count support
Adds support for HFSX 'HasFolderCount' flag and a corresponding
'folderCount' field in folder records.  (For reference see
HFS_FOLDERCOUNT and kHFSHasFolderCountBit/kHFSHasFolderCountMask in
Apple's source code.)

Ignoring subfolder count leads to fs errors found by Mac:

  ...
  Checking catalog hierarchy.
  HasFolderCount flag needs to be set (id = 105)
  (It should be 0x10 instead of 0)
  Incorrect folder count in a directory (id = 2)
  (It should be 7 instead of 6)
  ...

Steps to reproduce:
 Format with "newfs_hfs -s /dev/diskXXX".
 Mount in Linux.
 Create a new directory in root.
 Unmount.
 Run "fsck_hfs /dev/diskXXX".

The patch handles directory creation, deletion, and rename.

Signed-off-by: Sergei Antonov <saproj@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-10 17:26:21 -07:00
Artem Fetishev
70335abb26 fs/proc/base.c: fix GPF in /proc/$PID/map_files
The expected logic of proc_map_files_get_link() is either to return 0
and initialize 'path' or return an error and leave 'path' uninitialized.

By the time dname_to_vma_addr() returns 0 the corresponding vma may have
already be gone.  In this case the path is not initialized but the
return value is still 0.  This results in 'general protection fault'
inside d_path().

Steps to reproduce:

  CONFIG_CHECKPOINT_RESTORE=y

    fd = open(...);
    while (1) {
        mmap(fd, ...);
        munmap(fd, ...);
    }

  ls -la /proc/$PID/map_files

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=68991

Signed-off-by: Artem Fetishev <artem_fetishev@epam.com>
Signed-off-by: Aleksandr Terekhov <aleksandr_terekhov@epam.com>
Reported-by: <wiebittewas@gmail.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-10 17:26:20 -07:00
Linus Torvalds
e6a4b6f5ea Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro.

Clean up file table accesses (get rid of fget_light() in favor of the
fdget() interface), add proper file position locking.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  get rid of fget_light()
  sockfd_lookup_light(): switch to fdget^W^Waway from fget_light
  vfs: atomic f_pos accesses as per POSIX
  ocfs2 syncs the wrong range...
2014-03-10 12:57:26 -07:00
Al Viro
bd2a31d522 get rid of fget_light()
instead of returning the flags by reference, we can just have the
low-level primitive return those in lower bits of unsigned long,
with struct file * derived from the rest.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:44:42 -04:00
Linus Torvalds
9c225f2655 vfs: atomic f_pos accesses as per POSIX
Our write() system call has always been atomic in the sense that you get
the expected thread-safe contiguous write, but we haven't actually
guaranteed that concurrent writes are serialized wrt f_pos accesses, so
threads (or processes) that share a file descriptor and use "write()"
concurrently would quite likely overwrite each others data.

This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:

 "2.9.7 Thread Interactions with Regular File Operations

  All of the following functions shall be atomic with respect to each
  other in the effects specified in POSIX.1-2008 when they operate on
  regular files or symbolic links: [...]"

and one of the effects is the file position update.

This unprotected file position behavior is not new behavior, and nobody
has ever cared.  Until now.  Yongzhi Pan reported unexpected behavior to
Michael Kerrisk that was due to this.

This resolves the issue with a f_pos-specific lock that is taken by
read/write/lseek on file descriptors that may be shared across threads
or processes.

Reported-by: Yongzhi Pan <panyongzhi@gmail.com>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:44:41 -04:00
Al Viro
1b56e98990 ocfs2 syncs the wrong range...
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:43:32 -04:00
Linus Torvalds
fe9ea91cde NFS client bugfixes for Linux 3.14
Highlights include:
 
 - Fix another nfs4_sequence corruptor in RELEASE_LOCKOWNER
 - Fix an Oopsable delegation callback race
 - Fix another bad stateid infinite loop
 - Fail the data server I/O is the stateid represents a lost lock
 - Fix an Oopsable sunrpc trace event
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTHJSVAAoJEGcL54qWCgDyVRkP/2t43gjMF6P+Yc7VUW2e5uTv
 rHhPGFLuDVs9oS3WUYegzvThZMs//ovTaYgUSDNpOYztEB6P8bDRm41q/VgUIixY
 zWFoEplDgAZZE7gP2EJuXJv3bEdhJqXuCG2KUysqMsaIGlahrlQdHmqGTz6Y931o
 WROyMWVvnL4IoEtQHVR7DwyqkvSmifPJ8MZZv3Liy82wuw1fCsh8uy8mkYYSbdvN
 OK4JmHqdJ+CbAZ0WmE4Xe3Itqy/aIMBL9Jyrq4Zl1QX0p7ez3Xpy4XwmtlZXn2KP
 bKMfK2vP9RggagIpjUL+dhCqxlsyjlF6EzTnQRe7jXqlJ/vJ9pQF8X294jwRysfp
 80jDqsTSND4JQiZuBISID23N1nL0TzrP2tWqipR9zx5JJMRVzYZWTzEq4w2uAHgg
 aW2vTdRNRLZWydlfFNQ8FiuEPIFoQaJFmOCQisec2LtfffLZZBz7JPofjNH9CgU8
 mcbPhv75m2imXDOylydiVoD4x/myCGheYw2hpqhb1ZeuQxdN9lnwa0JzjPiP1h38
 XIYwzM7TE8WayrdkMDCeIem1dz/VexknfKmXmFXlMfn3GRKxowCSrggxKG92k0eP
 L35cJj91a9AoxMz/ej0erv0iI1flLeoYP9aJzIRtZf+SB1BZkKhmWlFRQKqnlIOA
 BzjYui4mUoEQEa5Sk7Th
 =JfQx
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.14-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

   - Fix another nfs4_sequence corruptor in RELEASE_LOCKOWNER
   - Fix an Oopsable delegation callback race
   - Fix another bad stateid infinite loop
   - Fail the data server I/O is the stateid represents a lost lock
   - Fix an Oopsable sunrpc trace event"

* tag 'nfs-for-3.14-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: Fix oops when trace sunrpc_task events in nfs client
  NFSv4: Fail the truncate() if the lock/open stateid is invalid
  NFSv4.1 Fail data server I/O if stateid represents a lost lock
  NFSv4: Fix the return value of nfs4_select_rw_stateid
  NFSv4: nfs4_stateid_is_current should return 'true' for an invalid stateid
  NFS: Fix a delegation callback race
  NFSv4: Fix another nfs4_sequence corruptor
2014-03-09 19:17:39 -07:00
Linus Torvalds
2a75184d52 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Small collection of fixes for 3.14-rc. It contains:

   - Three minor update to blk-mq from Christoph.

   - Reduce number of unaligned (< 4kb) in-flight writes on mtip32xx to
     two.  From Micron.

   - Make the blk-mq CPU notify spinlock raw, since it can't be a
     sleeper spinlock on RT.  From Mike Galbraith.

   - Drop now bogus BUG_ON() for bio iteration with blk integrity.  From
     Nic Bellinger.

   - Properly propagate the SYNC flag on requests. From Shaohua"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: add REQ_SYNC early
  rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
  bio-integrity: Drop bio_integrity_verify BUG_ON in post bip->bip_iter world
  blk-mq: support partial I/O completions
  blk-mq: merge blk_mq_insert_request and blk_mq_run_request
  blk-mq: remove blk_mq_alloc_rq
  mtip32xx: Reduce the number of unaligned writes to 2
2014-03-07 09:59:44 -08:00
Trond Myklebust
0418dae105 NFSv4: Fail the truncate() if the lock/open stateid is invalid
If the open stateid could not be recovered, or the file locks were lost,
then we should fail the truncate() operation altogether.

Reported-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1393954269-3974-1-git-send-email-andros@netapp.com
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-03-05 11:55:25 -05:00
Andy Adamson
869a9d375d NFSv4.1 Fail data server I/O if stateid represents a lost lock
Signed-off-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1393954269-3974-1-git-send-email-andros@netapp.com
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-03-05 11:55:24 -05:00
Trond Myklebust
927864cd92 NFSv4: Fix the return value of nfs4_select_rw_stateid
In commit 5521abfdcf (NFSv4: Resend the READ/WRITE RPC call
if a stateid change causes an error), we overloaded the return value of
nfs4_select_rw_stateid() to cause it to return -EWOULDBLOCK if an RPC
call is outstanding that would cause the NFSv4 lock or open stateid
to change.
That is all redundant when we actually copy the stateid used in the
read/write RPC call that failed, and check that against the current
stateid. It is doubly so, when we consider that in the NFSv4.1 case,
we also set the stateid's seqid to the special value '0', which means
'match the current valid stateid'.

Reported-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1393954269-3974-1-git-send-email-andros@netapp.com
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-03-05 11:55:24 -05:00
Trond Myklebust
e1253be0ec NFSv4: nfs4_stateid_is_current should return 'true' for an invalid stateid
When nfs4_set_rw_stateid() can fails by returning EIO to indicate that
the stateid is completely invalid, then it makes no sense to have it
trigger a retry of the READ or WRITE operation. Instead, we should just
have it fall through and attempt a recovery.

This fixes an infinite loop in which the client keeps replaying the same
bad stateid back to the server.

Reported-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1393954269-3974-1-git-send-email-andros@netapp.com
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-03-05 11:55:06 -05:00
Vyacheslav Dubeyko
bd2c003532 hfsplus: fix remount issue
Current implementation of HFS+ driver has small issue with remount
option.  Namely, for example, you are unable to remount from RO mode
into RW mode by means of command "mount -o remount,rw /dev/loop0
/mnt/hfsplus".  Trying to execute sequence of commands results in an
error message:

  mount /dev/loop0 /mnt/hfsplus
  mount -o remount,ro /dev/loop0 /mnt/hfsplus
  mount -o remount,rw /dev/loop0 /mnt/hfsplus

  mount: you must specify the filesystem type

  mount -t hfsplus -o remount,rw /dev/loop0 /mnt/hfsplus

  mount: /mnt/hfsplus not mounted or bad option

The reason of such issue is failure of mount syscall:

  mount("/dev/loop0", "/mnt/hfsplus", 0x2282a60, MS_MGC_VAL|MS_REMOUNT, NULL) = -1 EINVAL (Invalid argument)

Namely, hfsplus_parse_options_remount() method receives empty "input"
argument and return false in such case.  As a result, hfsplus_remount()
returns -EINVAL error code.

This patch fixes the issue by means of return true for the case of empty
"input" argument in hfsplus_parse_options_remount() method.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:49 -08:00
Jan Kara
15c34a7606 ocfs2: fix quota file corruption
Global quota files are accessed from different nodes.  Thus we cannot
cache offset of quota structure in the quota file after we drop our node
reference count to it because after that moment quota structure may be
freed and reallocated elsewhere by a different node resulting in
corruption of quota file.

Fix the problem by clearing dq_off when we are releasing dquot structure.
We also remove the DB_READ_B handling because it is useless -
DQ_ACTIVE_B is set iff DQ_READ_B is set.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:48 -08:00
David Rientjes
668f9abbd4 mm: close PageTail race
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page).  Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation.  This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling.  The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:47 -08:00
Trond Myklebust
755a48a7a4 NFS: Fix a delegation callback race
The clean-up in commit 36281caa83 ended up removing a NULL pointer check
that is needed in order to prevent an Oops in
nfs_async_inode_return_delegation().

Reported-by: "Yan, Zheng" <zheng.z.yan@intel.com>
Link: http://lkml.kernel.org/r/5313E9F6.2020405@intel.com
Fixes: 36281caa83 (NFSv4: Further clean-ups of delegation stateid validation)
Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-03-02 22:03:12 -05:00
Linus Torvalds
3751c97036 Driver core fix for 3.14-rc5
Here is a single sysfs fix for 3.14-rc5.  It fixes a reported problem
 with the namespace code in sysfs.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlMTq7sACgkQMUfUDdst+yml/wCgkUWPlSGv3UA5AJ1yDBnFqgxB
 RcAAn1CM1x6k3ULHG6Hz7SGkFg9dqpjz
 =aK2B
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull sysfs fix from Greg KH:
 "Here is a single sysfs fix for 3.14-rc5.  It fixes a reported problem
  with the namespace code in sysfs"

* tag 'driver-core-3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  sysfs: fix namespace refcnt leak
2014-03-02 15:13:41 -08:00
Trond Myklebust
b7e63a1079 NFSv4: Fix another nfs4_sequence corruptor
nfs4_release_lockowner needs to set the rpc_message reply to point to
the nfs4_sequence_res in order to avoid another Oopsable situation
in nfs41_assign_slot.

Fixes: fbd4bfd1d9 (NFS: Add nfs4_sequence calls for RELEASE_LOCKOWNER)
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-03-01 13:51:53 -06:00
Jeff Layton
dca1c8d17a cifs: mask off top byte in get_rfc1002_length()
The rfc1002 length actually includes a type byte, which we aren't
masking off. In most cases, it's not a problem since the
RFC1002_SESSION_MESSAGE type is 0, but when doing a RFC1002 session
establishment, the type is non-zero and that throws off the returned
length.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-02-28 14:01:14 -06:00
Linus Torvalds
8d7531825c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull filesystem fixes from Jan Kara:
 "Notification, writeback, udf, quota fixes

  The notification patches are (with one exception) a fallout of my
  fsnotify rework which went into -rc1 (I've extented LTP to cover these
  cornercases to avoid similar breakage in future).

  The UDF patch is a nasty data corruption Al has recently reported,
  the revert of the writeback patch is due to possibility of violating
  sync(2) guarantees, and a quota bug can lead to corruption of quota
  files in ocfs2"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: Allocate overflow events with proper type
  fanotify: Handle overflow in case of permission events
  fsnotify: Fix detection whether overflow event is queued
  Revert "writeback: do not sync data dirtied after sync start"
  quota: Fix race between dqput() and dquot_scan_active()
  udf: Fix data corruption on file type conversion
  inotify: Fix reporting of cookies for inotify events
2014-02-27 10:37:22 -08:00
Li Zefan
fed95bab8d sysfs: fix namespace refcnt leak
As mount() and kill_sb() is not a one-to-one match, we shoudn't get
ns refcnt unconditionally in sysfs_mount(), and instead we should
get the refcnt only when kernfs_mount() allocated a new superblock.

v2:
- Changed the name of the new argument, suggested by Tejun.
- Made the argument optional, suggested by Tejun.

v3:
- Make the new argument as second-to-last arg, suggested by Tejun.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
 ---
 fs/kernfs/mount.c      | 8 +++++++-
 fs/sysfs/mount.c       | 5 +++--
 include/linux/kernfs.h | 9 +++++----
 3 files changed, 15 insertions(+), 7 deletions(-)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-25 07:37:52 -08:00
Jan Kara
ff57cd5863 fsnotify: Allocate overflow events with proper type
Commit 7053aee26a "fsnotify: do not share events between notification
groups" used overflow event statically allocated in a group with the
size of the generic notification event. This causes problems because
some code looks at type specific parts of event structure and gets
confused by a random data it sees there and causes crashes.

Fix the problem by allocating overflow event with type corresponding to
the group type so code cannot get confused.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-02-25 11:18:06 +01:00
Jan Kara
482ef06c5e fanotify: Handle overflow in case of permission events
If the event queue overflows when we are handling permission event, we
will never get response from userspace. So we must avoid waiting for it.
Change fsnotify_add_notify_event() to return whether overflow has
happened so that we can detect it in fanotify_handle_event() and act
accordingly.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-02-25 11:17:58 +01:00
Jan Kara
2513190a92 fsnotify: Fix detection whether overflow event is queued
Currently we didn't initialize event's list head when we removed it from
the event list. Thus a detection whether overflow event is already
queued wasn't working. Fix it by always initializing the list head when
deleting event from a list.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-02-25 11:17:52 +01:00
Jeff Layton
a26054d184 cifs: sanity check length of data to send before sending
We had a bug discovered recently where an upper layer function
(cifs_iovec_write) could pass down a smb_rqst with an invalid amount of
data in it. The length of the SMB frame would be correct, but the rqst
struct would cause smb_send_rqst to send nearly 4GB of data.

This should never be the case. Add some sanity checking to the beginning
of smb_send_rqst that ensures that the amount of data we're going to
send agrees with the length in the RFC1002 header. If it doesn't, WARN()
and return -EIO to the upper layers.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-02-23 20:55:07 -06:00
Pavel Shilovsky
6b1168e161 CIFS: Fix wrong pos argument of cifs_find_lock_conflict
and use generic_file_aio_write rather than __generic_file_aio_write
in cifs_writev.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-02-23 20:54:50 -06:00
Linus Torvalds
645ceee885 Merge branch 'xfs-fixes-for-3.14-rc4' of git://oss.sgi.com/xfs/xfs
Pull xfs fixes from Dave Chinner:
 "This is the first pull request I've had to do for you, so I'm still
  sorting things out.  The reason I'm sending this and not Ben should be
  obvious from the first commit below - SGI has stepped down from the
  XFS maintainership role.  As such, I'd like to take another
  opportunity to thank them for their many years of effort maintaining
  XFS and supporting the XFS community that they developed from the
  ground up.

  So I haven't had time to work things like signed tags into my
  workflows yet, so this is just a repo branch I'm asking you to pull
  from.  And yes, I named the branch -rc4 because I wanted the fixes in
  rc4, not because the branch was for merging into -rc3.  Probably not
  right, either.

  Anyway, I should have everything sorted out by the time the next merge
  window comes around.  If there's anything that you don't like in the
  pull req, feel free to flame me unmercifully.

  The changes are fixes for recent regressions and important thinkos in
  verification code:

        - a log vector buffer alignment issue on ia32
        - timestamps on truncate got mangled
        - primary superblock CRC validation fixes and error message
          sanitisation"

* 'xfs-fixes-for-3.14-rc4' of git://oss.sgi.com/xfs/xfs:
  xfs: limit superblock corruption errors to actual corruption
  xfs: skip verification on initial "guess" superblock read
  MAINTAINERS: SGI no longer maintaining XFS
  xfs: xfs_sb_read_verify() doesn't flag bad crcs on primary sb
  xfs: ensure correct log item buffer alignment
  xfs: ensure correct timestamp updates from truncate
2014-02-22 08:26:01 -08:00
Jan Kara
0dc83bd30b Revert "writeback: do not sync data dirtied after sync start"
This reverts commit c4a391b53a. Dave
Chinner <david@fromorbit.com> has reported the commit may cause some
inodes to be left out from sync(2). This is because we can call
redirty_tail() for some inode (which sets i_dirtied_when to current time)
after sync(2) has started or similarly requeue_inode() can set
i_dirtied_when to current time if writeback had to skip some pages. The
real problem is in the functions clobbering i_dirtied_when but fixing
that isn't trivial so revert is a safer choice for now.

CC: stable@vger.kernel.org # >= 3.13
Signed-off-by: Jan Kara <jack@suse.cz>
2014-02-22 02:02:28 +01:00
Nicholas Bellinger
eec70897d8 bio-integrity: Drop bio_integrity_verify BUG_ON in post bip->bip_iter world
Given that bip->bip_iter.bi_size is decremented after bio_advance() ->
bio_integrity_advance() is called, the BUG_ON() in bio_integrity_verify()
ends up tripping in v3.14-rc1 code with the advent of immutable biovecs
in:

commit d57a5f7c66
Author: Kent Overstreet <kmo@daterainc.com>
Date:   Sat Nov 23 17:20:16 2013 -0800

    bio-integrity: Convert to bvec_iter

Given that there is no easy way to ascertain the original bi_size
value, go ahead and drop this BUG_ON().

Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Reported-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-21 15:56:36 -08:00
Jan Kara
1362f4ea20 quota: Fix race between dqput() and dquot_scan_active()
Currently last dqput() can race with dquot_scan_active() causing it to
call callback for an already deactivated dquot. The race is as follows:

CPU1					CPU2
  dqput()
    spin_lock(&dq_list_lock);
    if (atomic_read(&dquot->dq_count) > 1) {
     - not taken
    if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
      spin_unlock(&dq_list_lock);
      ->release_dquot(dquot);
        if (atomic_read(&dquot->dq_count) > 1)
         - not taken
					  dquot_scan_active()
					    spin_lock(&dq_list_lock);
					    if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
					     - not taken
					    atomic_inc(&dquot->dq_count);
					    spin_unlock(&dq_list_lock);
        - proceeds to release dquot
					    ret = fn(dquot, priv);
					     - called for inactive dquot

Fix the problem by making sure possible ->release_dquot() is finished by
the time we call the callback and new calls to it will notice reference
dquot_scan_active() has taken and bail out.

CC: stable@vger.kernel.org # >= 2.6.29
Signed-off-by: Jan Kara <jack@suse.cz>
2014-02-20 21:57:04 +01:00
Jan Kara
09ebb17ab4 udf: Fix data corruption on file type conversion
UDF has two types of files - files with data stored in inode (ICB in
UDF terminology) and files with data stored in external data blocks. We
convert file from in-inode format to external format in
udf_file_aio_write() when we find out data won't fit into inode any
longer. However the following race between two O_APPEND writes can happen:

CPU1					CPU2
udf_file_aio_write()			udf_file_aio_write()
  down_write(&iinfo->i_data_sem);
  checks that i_size + count1 fits within inode
    => no need to convert
  up_write(&iinfo->i_data_sem);
					  down_write(&iinfo->i_data_sem);
					  checks that i_size + count2 fits
					    within inode => no need to convert
					  up_write(&iinfo->i_data_sem);
  generic_file_aio_write()
    - extends file by count1 bytes
					  generic_file_aio_write()
					    - extends file by count2 bytes

Clearly if count1 + count2 doesn't fit into the inode, we overwrite
kernel buffers beyond inode, possibly corrupting the filesystem as well.

Fix the problem by acquiring i_mutex before checking whether write fits
into the inode and using __generic_file_aio_write() afterwards which
puts check and write into one critical section.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-02-20 21:56:00 +01:00
Linus Torvalds
6a4d07f85b Merge branch 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Quite a few fixes this time.

  Three locking fixes, all marked for -stable.  A couple error path
  fixes and some misc fixes.  Hugh found a bug in memcg offlining
  sequence and we thought we could fix that from cgroup core side but
  that turned out to be insufficient and got reverted.  A different fix
  has been applied to -mm"

* 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: update cgroup_enable_task_cg_lists() to grab siglock
  Revert "cgroup: use an ordered workqueue for cgroup destruction"
  cgroup: protect modifications to cgroup_idr with cgroup_mutex
  cgroup: fix locking in cgroup_cfts_commit()
  cgroup: fix error return from cgroup_create()
  cgroup: fix error return value in cgroup_mount()
  cgroup: use an ordered workqueue for cgroup destruction
  nfs: include xattr.h from fs/nfs/nfs3proc.c
  cpuset: update MAINTAINERS entry
  arm, pm, vmpressure: add missing slab.h includes
2014-02-20 12:01:09 -08:00
Linus Torvalds
e95003c3f9 NFS client bugfixes for Linux 3.14
Highlights include stable fixes for the following bugs:
 
 - General performance regression due to NFS_INO_INVALID_LABEL being set
   when the server doesn't support labeled NFS
 - Hang in the RPC code due to a socket out-of-buffer race
 - Infinite loop when trying to establish the NFSv4 lease
 - Use-after-free bug in the RPCSEC gss code.
 - nfs4_select_rw_stateid is returning with a non-zero error value on success
 
 Other bug fixes:
 
 - Potential memory scribble in the RPC bi-directional RPC code
 - Pipe version reference leak
 - Use the correct net namespace in the new NFSv4 migration code
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTBMBlAAoJEGcL54qWCgDyOakP+gKDh0VhKw8GziJFfY+6kHHI
 wej86M/coNnRPUv8n7s3N5TXMoV36qisYpxbIG/WQbOn6MfR3qjto6WoP7+vsrEq
 iNomtLivgEsWJePydTuIyAR/TK0du/zqP4zoPEDgdLDenucEVvkCzGIkqzg8Mddc
 duknEhIq918BkXIe3hRBWuxl+pRjwZur+TY0h/OR11oodqTYHxrE37f5PvREWOmB
 08hhjOFBFYlyEnCjD3I1SFmcXQxkKzvACavvbhTyF6u/37oL/QC1/DZKL5mSdOJ6
 novO8sv9gIpn/RhsEMOdaeYMYM5QTvkYIJQyLpKAYyaLZ42EMbRkczwNE7C0ZWyi
 F9MizDMNTie+DvsSHZPYwABTDOOQOuWPa9PO3Lyo8UtWxfmTOjYr2Rre1wyG0/+0
 ywb3JiKQCtVDPnmSHhqxFfVp9XvS7D2/vz0udgKjrCLKyDC2OMMGLVa/acnrtZmz
 s94QpPiqhjnRqIuKo251HtbK3AVaLBNQxBCPszieYwPm9aZ04P7mjsGg1WuhhB2v
 +eSa9UkicGJwKWJWtIBr54qIAOlEXu2bXY+vio5UfbDb+5qfBHe0TmNrz5QNJ53A
 x0eUBocth9VpW1cv/Rf+o30tJyZGy6Jtv8kn2hfXkJXNL3N1Gn25rD3tpb+5TtdY
 b7gtHy13/oe8yi0tK91f
 =tll2
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.14-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include stable fixes for the following bugs:

   - General performance regression due to NFS_INO_INVALID_LABEL being
     set when the server doesn't support labeled NFS
   - Hang in the RPC code due to a socket out-of-buffer race
   - Infinite loop when trying to establish the NFSv4 lease
   - Use-after-free bug in the RPCSEC gss code.
   - nfs4_select_rw_stateid is returning with a non-zero error value on
     success

  Other bug fixes:

  - Potential memory scribble in the RPC bi-directional RPC code
  - Pipe version reference leak
  - Use the correct net namespace in the new NFSv4 migration code"

* tag 'nfs-for-3.14-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS fix error return in nfs4_select_rw_stateid
  NFSv4: Use the correct net namespace in nfs4_update_server
  SUNRPC: Fix a pipe_version reference leak
  SUNRPC: Ensure that gss_auth isn't freed before its upcall messages
  SUNRPC: Fix potential memory scribble in xprt_free_bc_request()
  SUNRPC: Fix races in xs_nospace()
  SUNRPC: Don't create a gss auth cache unless rpc.gssd is running
  NFS: Do not set NFS_INO_INVALID_LABEL unless server supports labeled NFS
2014-02-19 12:13:02 -08:00