Commit Graph

1046 Commits

Author SHA1 Message Date
Yan, Zheng
4819301287 ceph: don't grabs open file reference for aborted request
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:08:25 -07:00
Yan, Zheng
ab866549b3 ceph: drop extra open file reference in ceph_atomic_open()
ceph_atomic_open() calls ceph_open() after receiving the MDS reply.
ceph_open() grabs an extra open file reference. (The open request
already holds an open file reference)

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:08:23 -07:00
Yan, Zheng
54008399dc ceph: preallocate buffer for readdir reply
Preallocate buffer for readdir reply. Limit number of entries in
readdir reply according to the buffer size.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:08:22 -07:00
Yan, Zheng
cc48c3e85f ceph: don't include ceph.{file,dir}.layout vxattr in listxattr()
This avoids 'cp -a' modifying layout of new files/directories.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:21 -07:00
Yan, Zheng
1e5c6649ff ceph: check buffer size in ceph_vxattrcb_layout()
If buffer size is zero, return the size of layout vxattr. If buffer
size is not zero, check if it is large enough for layout vxattr.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:19 -07:00
Yan, Zheng
00bd8edb86 ceph: fix null pointer dereference in discard_cap_releases()
send_mds_reconnect() may call discard_cap_releases() after all
release messages have been dropped by cleanup_cap_releases()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04 21:07:17 -07:00
Fabian Frederick
5f75ce5781 ceph: Remove get/set acl on symlinks
Remove unsupported symlink operations.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-04-04 21:07:14 -07:00
Yan, Zheng
d9ffc4f770 ceph: set mds_wanted when MDS reply changes a cap to auth cap
When adjusting caps client wants, MDS does not record caps that are
not allowed. For non-auth MDS, it does not record WR caps. So when
a MDS reply changes a non-auth cap to auth cap, client needs to set
cap's mds_wanted according to the reply.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:12 -07:00
Yan, Zheng
eb13e832f8 ceph: use fl->fl_file as owner identifier of flock and posix lock
flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).

The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.

The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04 21:07:11 -07:00
Yan, Zheng
eb70c0ce4e ceph: forbid mandatory file lock
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:09 -07:00
Yan, Zheng
0e8e95d6d7 ceph: use fl->fl_type to decide flock operation
VFS does not directly pass flock's operation code to filesystem's
flock callback. It translates the operation code to the form how
posix lock's parameters are presented.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:08 -07:00
Yan, Zheng
8c93cd610c ceph: update i_max_size even if inode version does not change
handle following sequence of events:
 - client releases a inode with i_max_size > 0. The release message
   is queued. (is not sent to the auth MDS)
 - a 'lookup' request reply from non-auth MDS returns the same inode.
 - client opens the inode in write mode. The version of inode trace
   in 'open' request reply is equal to the cached inode's version.
 - client requests new max size. The MDS ignores the request because
   it does not affect client's write range

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04 21:07:06 -07:00
Yan, Zheng
a255060451 ceph: make sure write caps are registered with auth MDS
Only auth MDS can issue write caps to clients, so don't consider
write caps registered with non-auth MDS as valid.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:05 -07:00
Yan, Zheng
c137a32a40 ceph: print inode number for LOOKUPINO request
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:54 +08:00
Yan, Zheng
19913b4eac ceph: add get_name() NFS export callback
Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yan, Zheng
8996f4f23d ceph: fix ceph_fh_to_parent()
ceph_fh_to_parent() returns dentry that corresponds to the 'ino' field
of struct ceph_nfs_confh. This is wrong, it should return dentry that
corresponds to the 'parent_ino' field.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yan, Zheng
9017c2ec78 ceph: add get_parent() NFS export callback
The callback uses LOOKUPPARENT MDS request to find parent.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yan, Zheng
4f32b42dca ceph: simplify ceph_fh_to_dentry()
MDS handles LOOKUPHASH and LOOKUPINO MDS requests in the same way.
So __cfh_to_dentry() is redundant.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Yunchuan Wen
f1fc4fee3b ceph: fscache: Wait for completion of object initialization
The object store limit needs to be updated after writing,
and this can be done provided the corresponding object has already
been initialized. Current object initialization is done asynchrously,
which introduce a race if a file is opened, then immediately followed
by a writing, the initialization may have not completed, the code will
reach the ASSERT in fscache_submit_exclusive_op() to cause kernel
bug.

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03 10:33:53 +08:00
Yunchuan Wen
32d3e148dd ceph: fscache: Update object store limit after file writing
Synchronize object->store_limit[_l] with new inode->i_size after file writing.

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03 10:33:53 +08:00
Yunchuan Wen
020c4bddc0 ceph: fscache: add an interface to synchronize object store limit
Add an interface to explicitly synchronize object->store_limit[_l]
with inode->i_size

Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
2014-04-03 10:33:53 +08:00
Sage Weil
4b58c9b19b ceph: do not set r_old_dentry_dir on link()
This is racy--we do not know whather d_parent has changed out from
underneath us because i_mutex is not held on the source inode's directory.

Also, taking this reference is useless.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:53 +08:00
Sage Weil
844d87c332 ceph: do not assume r_old_dentry[_dir] always set together
Do not assume that r_old_dentry implies that r_old_dentry_dir is also
true.  Separate out the ref cleanup and make the debugs dump behave when
it is NULL.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:53 +08:00
Sage Weil
752c8bdcfe ceph: do not chain inode updates to parent fsync
The fsync(dirfd) only covers namespace operations, not inode updates.
We do not need to cover setattr variants or O_TRUNC.

Reported-by: Al Viro <viro@xeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:52 +08:00
Sage Weil
180061a58c ceph: avoid useless ceph_get_dentry_parent_inode() in ceph_rename()
This is just old_dir; no reason to abuse the dcache pointers.

Reported-by: Al Viro <viro.zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:52 +08:00
Yan, Zheng
15289dc85b ceph: let MDS adjust readdir 'frag'
If readdir 'frag' is adjusted, readdir 'offset' should be reset.
Otherwise some dentries may be lost when readdir and fragmenting
directory happen at the some.

Another way to fix this issue is let MDS adjust readdir 'frag'.
The code that handles MDS reply reset the readdir 'offset' if
the readdir reply is different than the requested one.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-03 10:33:52 +08:00
Yan, Zheng
dcd3cc05e5 ceph: fix reset_readdir()
When changing readdir postion, fi->next_offset should be set to 0
if the new postion is not in the first dirfrag.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03 10:33:52 +08:00
Yan, Zheng
f049420607 ceph: fix ceph_dir_llseek()
Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for
directory. For a fragmented directory, offset (frag_t, off) can be
larger than inode->i_sb->s_maxbytes.

At the very beginning of ceph_dir_llseek(), local variable old_offset
is initialized to parameter offset. This doesn't make sense neither.
Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset).

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03 10:33:52 +08:00
Al Viro
aec605f429 ceph_aio_write(): switch to generic_perform_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:37 -04:00
Al Viro
fcacafd269 kill the 5th argument of generic_file_buffered_write()
same story - it's &iocb->ki_pos in all cases

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:34 -04:00
Yan, Zheng
4d5f5df673 ceph: fix __dcache_readdir()
If directory is fragmented, readdir() read its dirfrags one by one.
After reading all dirfrags, the corresponding dentries are sorted in
(frag_t, off) order in the dcache. If dentries of a directory are all
cached, __dcache_readdir() can use the cached dentries to satisfy
readdir syscall. But when checking if a given dentry is after the
position of readdir, __dcache_readdir() compares numerical value of
frag_t directly. This is wrong, it should use ceph_frag_compare().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17 12:37:13 -08:00
Sage Weil
45195e42c7 ceph: add acl, noacl options for cephfs mount
Make the 'acl' option dependent on having ACL support compiled in.  Make
the 'noacl' option work even without it so that one can always ask it to
be off and not error out on mount when it is not supported.

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2014-02-17 12:37:12 -08:00
Guangliang Zhao
c969d9bf91 ceph: make ceph_forget_all_cached_acls() static inline
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Sage Weil <sage@inktank.com>
2014-02-17 12:37:12 -08:00
Yan, Zheng
b20a95a0dd ceph: add missing init_acl() for mkdir() and atomic_open()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17 12:37:11 -08:00
Yan, Zheng
7a92d64760 ceph: fix ceph_set_acl()
If acl is equivalent to file mode permission bits, ceph_set_acl()
needs to remove any existing acl xattr. Use __ceph_setxattr() to
handle both setting and removing acl xattr cases, it doesn't return
-ENODATA when there is no acl xattr.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17 12:37:11 -08:00
Yan, Zheng
524186ace6 ceph: fix ceph_removexattr()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17 12:37:10 -08:00
Yan, Zheng
bcdfeb2eb4 ceph: remove xattr when null value is given to setxattr()
For the setxattr request, introduce a new flag CEPH_XATTR_REMOVE
to distinguish null value case from the zero-length value case.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17 12:37:09 -08:00
Yan, Zheng
fbc0b970dd ceph: properly handle XATTR_CREATE and XATTR_REPLACE
return -EEXIST if XATTR_CREATE is set and xattr alread exists.
return -ENODATA if XATTR_REPLACE is set but xattr does not exist.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17 12:37:05 -08:00
Sage Weil
77516dc92a ceph: fix missing dput in ceph_set_acl
Add matching dput() for d_find_alias().  Move d_find_alias() down a bit
at Julia's suggestion.

[ Introduced by commit 72466d0b92: "ceph: fix posix ACL hooks" ]

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-31 08:14:06 -08:00
Christoph Hellwig
7585823619 ceph: simplify ceph_{get,init}_acl
- ->get_acl only gets called after we checked for a cached ACL, so no
   need to call get_cached_acl again.
 - no need to check IS_POSIXACL in ->get_acl, without that it should
   never get set as all the callers that set it already have the check.
 - you should be able to use the full posix_acl_create in CEPH

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Sage Weil <sage@inktank.com>
2014-01-30 19:26:17 -08:00
Peter Rosin
32d35d44d0 ceph: remove duplicate declaration of ceph_setattr
Signed-off-by: Peter Rosin <peda@lysator.liu.se>
Signed-off-by: Sage Weil <sage@inktank.com>
2014-01-30 08:38:00 -08:00
Sage Weil
72466d0b92 ceph: fix posix ACL hooks
The merge of commit 7221fe4c2e ("ceph: add acl for cephfs") raced with
upstream changes in the generic POSIX ACL code (eg commit 2aeccbe957
"fs: add generic xattr_acl handlers" and others).

Some of the fallout was fixed in commit 4db658ea0c ("ceph: Fix up after
semantic merge conflict"), but it was incomplete: the set_acl
inode_operation wasn't getting set, and the prototype needed to be
adjusted a bit (it doesn't take a dentry anymore).

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29 16:05:28 -08:00
Linus Torvalds
4db658ea0c ceph: Fix up after semantic merge conflict
The previous ceph-client merge resulted in ceph not even building,
because there was a merge conflict that wasn't visible as an actual data
conflict: commit 7221fe4c2e ("ceph: add acl for cephfs") added support
for POSIX ACL's into Ceph, but unluckily we also had the VFS tree change
a lot of the POSIX ACL helper functions to be much more helpful to
filesystems (see for example commits 2aeccbe957 "fs: add generic
xattr_acl handlers", 5bf3258fd2 "fs: make posix_acl_chmod more useful"
and 37bc15392a "fs: make posix_acl_create more useful")

The reason this conflict wasn't obvious was many-fold: because it was a
semantic conflict rather than a data conflict, it wasn't visible in the
git merge as a conflict.  And because the VFS tree hadn't been in
linux-next, people hadn't become aware of it that way.  And because I
was at jury duty this morning, I was using my laptop and as a result not
doing constant "allmodconfig" builds.

Anyway, this fixes the build and generally removes a fair chunk of the
Ceph POSIX ACL support code, since the improved helpers seem to match
really well for Ceph too.  But I don't actually have any way to *test*
the end result, and I was really hoping for some ACK's for this.  Oh,
well.

Not compiling certainly doesn't make things easier to test, so I'm
committing this without the acks after having waited for four hours...
Plus it's what I would have done for the merge had I noticed the
semantic conflict..

Reported-by: Dave Jones <davej@redhat.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Guangliang Zhao <lucienchao@gmail.com>
Cc: Li Wang <li.wang@ubuntykylin.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-28 18:06:18 -08:00
Ilya Dryomov
125d725c92 ceph: cast PAGE_SIZE to size_t in ceph_sync_write()
Use min_t(size_t, ...) instead of plain min(), which does strict type
checking, to avoid compile warning on i386.

Cc: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-28 09:57:21 -08:00
Ilya Dryomov
37b52fe608 ceph: fix dout() compile warnings in ceph_filemap_fault()
PAGE_CACHE_SIZE is unsigned long on all architectures, however size_t
is either unsigned int or unsigned long.  Rather than change format
strings, cast PAGE_CACHE_SIZE to size_t to be in line with dout()s in
ceph_page_mkwrite().

Cc: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-28 09:57:06 -08:00
Ilya Dryomov
7c13cb6435 libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()
Switch ceph_calc_ceph_pg() to new oloc and oid abstractions and rename
it to ceph_oloc_oid_to_pg() to make its purpose more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:32 +02:00
Yan, Zheng
11df2dfb61 ceph: add imported caps when handling cap export message
Version 3 cap export message includes information about the imported
caps. It allows us to add the imported caps if the corresponding cap
import message still hasn't been received.

This allow us to handle situation that the importer MDS crashes and
the cap import message is missing.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 16:30:31 +08:00
Yan, Zheng
5d72d13c42 ceph: add open export target session helper
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 16:30:30 +08:00
Yan, Zheng
4ee6a914ed ceph: remove exported caps when handling cap import message
Version 3 cap import message includes the ID of the exported
caps. It allow us to remove the exported caps if we still haven't
received the corresponding cap export message.

We remove the exported caps because they are stale, keeping them
can compromise consistence.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 16:30:28 +08:00
Yan, Zheng
186e4f7a4b ceph: handle session flush message
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 13:29:33 +08:00
Yan, Zheng
9215aeea62 ceph: check inode caps in ceph_d_revalidate
Some inodes in readdir reply may have no caps. Getattr mds request
for these inodes can return -ESTALE. The fix is consider dentry that
links to inode with no caps as invalid. Invalid dentry causes a
lookup request to send to the mds, the MDS will send caps back.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 13:29:33 +08:00
Yan, Zheng
ca18bede04 ceph: handle -ESTALE reply
Send requests that operate on path to directory's auth MDS if
mode == USE_AUTH_MDS. Always retry using the auth MDS if got
-ESTALE reply from non-auth MDS. Also clean up the code that
handles auth MDS change.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 13:29:33 +08:00
Yan, Zheng
979abfdd5c ceph: fix trim caps
- don't trim auth cap if there are flusing caps
- don't trim auth cap if any 'write' cap is wanted
- allow trimming non-auth cap even if the inode is dirty

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 13:29:32 +08:00
Yan, Zheng
9563f88c1f ceph: fix cache revoke race
handle following sequence of events:

- non-auth MDS revokes Fc cap. queue invalidate work
- auth MDS issues Fc cap through request reply. i_rdcache_gen gets
  increased.
- invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen,
  so it does nothing.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 13:29:32 +08:00
Yan, Zheng
d1b87809fb ceph: use ceph_seq_cmp() to compare migrate_seq
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 13:29:32 +08:00
Yan, Zheng
4fe59789ad ceph: handle cap export race in try_flush_caps()
auth cap may change after releasing the i_ceph_lock

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 13:29:32 +08:00
J. Bruce Fields
fc12c80aa5 ceph: trivial comment fix
"disconnected" is too easily confused with "DCACHE_DISCONNECTED".  I
think "unhashed" is the more precise term here.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-16 16:03:50 -08:00
Ilya Dryomov
12b4629a9f libceph: all features fields must be u64
In preparation for ceph_features.h update, change all features fields
from unsigned int/u32 to u64.  (ceph.git has ~40 feature bits at this
point.)

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:08 +02:00
Li Wang
183028052b ceph fscache: Uncaching no data page from fscache in readpage()
Currently, if one new page allocated into fscache in readpage(), however,
with no data read into due to error encountered during reading from OSDs,
the slot in fscache is not uncached. This patch fixes this.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Milosz Tanski <milosz@adfin.com>
2013-12-31 20:32:03 +02:00
Li Wang
3f42bc4bea ceph fscache: Introduce a routine for uncaching single no data page from fscache
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Milosz Tanski <milosz@adfin.com>
2013-12-31 20:32:02 +02:00
Guangliang Zhao
7221fe4c2e ceph: add acl for cephfs
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Li Wang <li.wang@ubuntykylin.com>
Reviewed-by: Zheng Yan <zheng.z.yan@intel.com>
2013-12-31 20:32:01 +02:00
Yan, Zheng
61f6881621 ceph: check caps in filemap_fault and page_mkwrite
Adds cap check to the page fault handler. The check prevents page
fault handler from adding new page to the page cache while Fcb caps
are being revoked. This solves Fc revoking hang in multiple clients
mmap IO workload.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:00 +02:00
Libo Chen
aa8b60e077 fs: ceph: new helper: file_inode(file)
Signed-off-by: Libo Chen <clbchenlibo.chen@huawei.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-13 09:13:29 -08:00
Li Wang
f36132a75a ceph: Clean up if error occurred in finish_read()
Clean up if error occurred rather than going through normal process

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-13 09:13:28 -08:00
majianpeng
8eb4efb091 ceph: implement readv/preadv for sync operation
For readv/preadv sync-operatoin, ceph only do the first iov.
Now implement this.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-13 09:13:17 -08:00
majianpeng
e8344e6689 ceph: Implement writev/pwritev for sync operation.
For writev/pwritev sync-operatoin, ceph only do the first iov.

I divided the write-sync-operation into two functions. One for
direct-write, other for none-direct-sync-write. This is because for
none-direct-sync-write we can merge iovs to one. But for direct-write,
we can't merge iovs.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-13 09:13:17 -08:00
Yan, Zheng
9f12bd119e ceph: drop unconnected inodes
Positve dentry and corresponding inode are always accompanied in MDS reply.
So no need to keep inode in the cache after dropping all its aliases.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-13 09:13:16 -08:00
Li Wang
56f91aad69 ceph: Avoid data inconsistency due to d-cache aliasing in readpage()
If the length of data to be read in readpage() is exactly
PAGE_CACHE_SIZE, the original code does not flush d-cache
for data consistency after finishing reading. This patches fixes
this.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-13 09:11:38 -08:00
Yan, Zheng
86b58d1313 ceph: initialize inode before instantiating dentry
commit b18825a7c8 (Put a small type field into struct dentry::d_flags)
put a type field into struct dentry::d_flags. __d_instantiate() set the
field by checking inode->i_mode. So we should initialize inode before
instantiating dentry when handling mds reply.

Fixes: http://tracker.ceph.com/issues/6930
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-13 09:11:36 -08:00
Linus Torvalds
4f9e5df211 Merge branch 'for-linus-bugs' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph bug-fixes from Sage Weil:
 "These include a couple fixes to the new fscache code that went in
  during the last cycle (which will need to go stable@ shortly as well),
  a couple client-side directory fragmentation fixes, a fix for a race
  in the cap release queuing path, and a couple race fixes in the
  request abort and resend code.

  Obviously some of this could have gone into 3.12 final, but I
  preferred to overtest rather than send things in for a late -rc, and
  then my travel schedule intervened"

* 'for-linus-bugs' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: allocate non-zero page to fscache in readpage()
  ceph: wake up 'safe' waiters when unregistering request
  ceph: cleanup aborted requests when re-sending requests.
  ceph: handle race between cap reconnect and cap release
  ceph: set caps count after composing cap reconnect message
  ceph: queue cap release in __ceph_remove_cap()
  ceph: handle frag mismatch between readdir request and reply
  ceph: remove outdated frag information
  ceph: hung on ceph fscache invalidate in some cases
2013-11-26 18:02:46 -08:00
Li Wang
ff638b7df5 ceph: allocate non-zero page to fscache in readpage()
ceph_osdc_readpages() returns number of bytes read, currently,
the code only allocate full-zero page into fscache, this patch
fixes this.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-11-23 11:01:07 -08:00
Yan, Zheng
fc55d2c944 ceph: wake up 'safe' waiters when unregistering request
We also need to wake up 'safe' waiters if error occurs or request
aborted. Otherwise sync(2)/fsync(2) may hang forever.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-11-23 11:01:05 -08:00
Yan, Zheng
eb1b8af33c ceph: cleanup aborted requests when re-sending requests.
Aborted requests usually get cleared when the reply is received.
If MDS crashes, no reply will be received. So we need to cleanup
aborted requests when re-sending requests.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-11-23 11:01:04 -08:00
Yan, Zheng
99a9c273b9 ceph: handle race between cap reconnect and cap release
When a cap get released while composing the cap reconnect message.
We should skip queuing the release message if the cap hasn't been
added to the cap reconnect message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-11-23 11:01:02 -08:00
Yan, Zheng
44c99757fa ceph: set caps count after composing cap reconnect message
It's possible that some caps get released while composing the cap
reconnect message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-11-23 11:01:01 -08:00
Yan, Zheng
a096b09aee ceph: queue cap release in __ceph_remove_cap()
call __queue_cap_release() in __ceph_remove_cap(), this avoids
acquiring s_cap_lock twice.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-11-23 11:00:59 -08:00
Yan, Zheng
81c6aea527 ceph: handle frag mismatch between readdir request and reply
If client has outdated directory fragments information, it may request
readdir an non-existent directory fragment. In this case, the MDS finds
an approximate directory fragment and sends its contents back to the
client. When receiving a reply with fragment that is different than the
requested one, the client need to reset the 'readdir offset'.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-09-30 14:49:53 -07:00
Yan, Zheng
53e879a485 ceph: remove outdated frag information
If directory fragments change, fill_inode() inserts new frags into
the fragtree, but it does not remove outdated frags from the fragtree.
This patch fixes it.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-09-30 14:49:28 -07:00
David Howells
94d30ae90a FS-Cache: Provide the ability to enable/disable cookies
Provide the ability to enable and disable fscache cookies.  A disabled cookie
will reject or ignore further requests to:

	Acquire a child cookie
	Invalidate and update backing objects
	Check the consistency of a backing object
	Allocate storage for backing page
	Read backing pages
	Write to backing pages

but still allows:

	Checks/waits on the completion of already in-progress objects
	Uncaching of pages
	Relinquishment of cookies

Two new operations are provided:

 (1) Disable a cookie:

	void fscache_disable_cookie(struct fscache_cookie *cookie,
				    bool invalidate);

     If the cookie is not already disabled, this locks the cookie against other
     dis/enablement ops, marks the cookie as being disabled, discards or
     invalidates any backing objects and waits for cessation of activity on any
     associated object.

     This is a wrapper around a chunk split out of fscache_relinquish_cookie(),
     but it reinitialises the cookie such that it can be reenabled.

     All possible failures are handled internally.  The caller should consider
     calling fscache_uncache_all_inode_pages() afterwards to make sure all page
     markings are cleared up.

 (2) Enable a cookie:

	void fscache_enable_cookie(struct fscache_cookie *cookie,
				   bool (*can_enable)(void *data),
				   void *data)

     If the cookie is not already enabled, this locks the cookie against other
     dis/enablement ops, invokes can_enable() and, if the cookie is not an
     index cookie, will begin the procedure of acquiring backing objects.

     The optional can_enable() function is passed the data argument and returns
     a ruling as to whether or not enablement should actually be permitted to
     begin.

     All possible failures are handled internally.  The cookie will only be
     marked as enabled if provisional backing objects are allocated.

A later patch will introduce these to NFS.  Cookie enablement during nfs_open()
is then contingent on i_writecount <= 0.  can_enable() checks for a race
between open(O_RDONLY) and open(O_WRONLY/O_RDWR).  This simplifies NFS's cookie
handling and allows us to get rid of open(O_RDONLY) accidentally introducing
caching to an inode that's open for writing already.

One operation has its API modified:

 (3) Acquire a cookie.

	struct fscache_cookie *fscache_acquire_cookie(
		struct fscache_cookie *parent,
		const struct fscache_cookie_def *def,
		void *netfs_data,
		bool enable);

     This now has an additional argument that indicates whether the requested
     cookie should be enabled by default.  It doesn't need the can_enable()
     function because the caller must prevent multiple calls for the same netfs
     object and it doesn't need to take the enablement lock because no one else
     can get at the cookie before this returns.

Signed-off-by: David Howells <dhowells@redhat.com
2013-09-27 18:40:25 +01:00
Milosz Tanski
ffc79664d1 ceph: hung on ceph fscache invalidate in some cases
In some cases I'm on my ceph client cluster I'm seeing hunk kernel tasks in
the invalidate page code path. This is due to the fact that we don't check if
the page is marked as cache before calling fscache_wait_on_page_write().

This is the log from the hang

INFO: task XXXXXX:12034 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 ...
Call Trace:
[<ffffffff81568d09>] schedule+0x29/0x70
[<ffffffffa01d4cbd>] __fscache_wait_on_page_write+0x6d/0xb0 [fscache]
[<ffffffff81083520>] ? add_wait_queue+0x60/0x60
[<ffffffffa029a3e9>] ceph_invalidate_fscache_page+0x29/0x50 [ceph]
[<ffffffffa027df00>] ceph_invalidatepage+0x70/0x190 [ceph]
[<ffffffff8112656f>] ? delete_from_page_cache+0x5f/0x70
[<ffffffff81133cab>] truncate_inode_page+0x8b/0x90
[<ffffffff81133ded>] truncate_inode_pages_range.part.12+0x13d/0x620
[<ffffffff8113431d>] truncate_inode_pages_range+0x4d/0x60
[<ffffffff811343b5>] truncate_inode_pages+0x15/0x20
[<ffffffff8119bbf6>] evict+0x1a6/0x1b0
[<ffffffff8119c3f3>] iput+0x103/0x190
 ...

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-09-25 18:20:14 -07:00
Yan, Zheng
a8d436f015 ceph: use d_invalidate() to invalidate aliases
d_invalidate() is the standard VFS method to invalidate dentry.
compare to d_delete(), it also try shrinking children dentries.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-09-06 12:55:29 -07:00
Yan, Zheng
ed284c49f6 ceph: remove ceph_lookup_inode()
commit 6f60f889 (ceph: fix freeing inode vs removing session caps race)
introduced ceph_lookup_inode(). But there is already a ceph_find_inode()
which provides similar function. So remove ceph_lookup_inode(), use
ceph_find_inode() instead.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <alex.elder@linary.org>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-09-06 12:55:09 -07:00
Milosz Tanski
971f0bdeaa ceph: trivial buildbot warnings fix
The linux-next build bot found a three of warnings, this addresses all of them.

 * non-ANSI function declaration of function 'ceph_fscache_register' and
   'ceph_fscache_unregister'
 * symbol 'ceph_cache_netfs' was not declared, now it's extern in the header.
 * warning: "pr_fmt" redefined

Signed-off-by: Milosz Tanski <milosz@adfin.com>
2013-09-06 16:50:12 +00:00
Milosz Tanski
e81568eb18 ceph: Do not do invalidate if the filesystem is mounted nofsc
Previously we would always try to enqueue work even if the filesystem is not
mounted with fscache enabled (or the file has no cookie). In the case of the
filesystem mouned nofsc (but with fscache compiled in) this would lead to a
crash.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
2013-09-06 16:50:12 +00:00
Milosz Tanski
d4d3aa38d6 ceph: page still marked private_2
Previous patch that allowed us to cleanup most of the issues with pages marked
as private_2 when calling ceph_readpages. However, there seams to be a case in
the error case clean up in start read that still trigers this from time to
time. I've only seen this one a couple times.

BUG: Bad page state in process petabucket  pfn:335b82
page:ffffea000cd6e080 count:0 mapcount:0 mapping:          (null) index:0x0
page flags: 0x200000000001000(private_2)
Call Trace:
 [<ffffffff81563442>] dump_stack+0x46/0x58
 [<ffffffff8112c7f7>] bad_page+0xc7/0x120
 [<ffffffff8112cd9e>] free_pages_prepare+0x10e/0x120
 [<ffffffff8112e580>] free_hot_cold_page+0x40/0x160
 [<ffffffff81132427>] __put_single_page+0x27/0x30
 [<ffffffff81132d95>] put_page+0x25/0x40
 [<ffffffffa02cb409>] ceph_readpages+0x2e9/0x6f0 [ceph]
 [<ffffffff811313cf>] __do_page_cache_readahead+0x1af/0x260

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-09-06 16:50:12 +00:00
Milosz Tanski
9b8dd1e8a5 ceph: ceph_readpage_to_fscache didn't check if marked
Previously ceph_readpage_to_fscache did not call if page was marked as cached
before calling fscache_write_page resulting in a BUG inside of fscache.

FS-Cache: Assertion failed
------------[ cut here ]------------
kernel BUG at fs/fscache/page.c:874!
invalid opcode: 0000 [#1] SMP
Call Trace:
 [<ffffffffa02e6566>] __ceph_readpage_to_fscache+0x66/0x80 [ceph]
 [<ffffffffa02caf84>] readpage_nounlock+0x124/0x210 [ceph]
 [<ffffffffa02cb08d>] ceph_readpage+0x1d/0x40 [ceph]
 [<ffffffff81126db6>] generic_file_aio_read+0x1f6/0x700
 [<ffffffffa02c6fcc>] ceph_aio_read+0x5fc/0xab0 [ceph]

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-09-06 16:50:12 +00:00
Milosz Tanski
76be778b3a ceph: clean PgPrivate2 on returning from readpages
In some cases the ceph readapages code code bails without filling all the pages
already marked by fscache. When we return back to readahead code this causes
a BUG.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
2013-09-06 16:50:11 +00:00
Milosz Tanski
99ccbd229c ceph: use fscache as a local presisent cache
Adding support for fscache to the Ceph filesystem. This would bring it to on
par with some of the other network filesystems in Linux (like NFS, AFS, etc...)

In order to mount the filesystem with fscache the 'fsc' mount option must be
passed.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-09-06 16:50:11 +00:00
Sha Zhengju
7d6e1f5461 ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
Following we will begin to add memcg dirty page accounting around
__set_page_dirty_{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
avoid exporting those details to filesystems.

Since vfs set_page_dirty() should be called under page lock, here we don't need elaborate
codes to handle racy anymore, and two WARN_ON() are added to detect such exceptions.
Thanks very much for Sage and Yan Zheng's coaching!

I tested it in a two server's ceph environment that one is client and the other is
mds/osd/mon, and run the following fsx test from xfstests:

  ./fsx   1MB -N 50000 -p 10000 -l 1048576
  ./fsx  10MB -N 50000 -p 10000 -l 10485760
  ./fsx 100MB -N 50000 -p 10000 -l 104857600

The fsx does lots of mmap-read/mmap-write/truncate operations and the tests completed
successfully without triggering any of WARN_ON.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-27 16:29:44 -07:00
majianpeng
ee7289bfad ceph: allow sync_read/write return partial successed size of read/write.
For sync_read/write, it may do multi stripe operations.If one of those
met erro, we return the former successed size rather than a error value.
There is a exception for write-operation met -EOLDSNAPC.If this occur,we
retry the whole write again.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
2013-08-27 12:28:46 -07:00
majianpeng
02ae66d8b2 ceph: fix bugs about handling short-read for sync read mode.
cephfs . show_layout
>layyout.data_pool:     0
>layout.object_size:   4194304
>layout.stripe_unit:   4194304
>layout.stripe_count:  1

TestA:
>dd if=/dev/urandom of=test bs=1M count=2 oflag=direct
>dd if=/dev/urandom of=test bs=1M count=2 seek=4  oflag=direct
>dd if=test of=/dev/null bs=6M count=1 iflag=direct
The messages from func striped_read are:
ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT
ceph:           file.c:350  : striped_read 2097152~4194304 (read 2097152) got 0 HITSTRIPE SHORT
ceph:           file.c:381  : zero tail 4194304
ceph:           file.c:390  : striped_read returns 6291456
The hole of file is from 2M--4M.But actualy it zero the last 4M include
the last 2M area which isn't a hole.
Using this patch, the messages are:
ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT
ceph:           file.c:358  :  zero gap 2097152 to 4194304
ceph:           file.c:350  : striped_read 4194304~2097152 (read 4194304) got 2097152
ceph:           file.c:384  : striped_read returns 6291456

TestB:
>echo majianpeng > test
>dd if=test of=/dev/null bs=2M count=1 iflag=direct
The messages are:
ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT
ceph:           file.c:350  : striped_read 11~6291445 (read 11) got 0 HITSTRIPE SHORT
ceph:           file.c:390  : striped_read returns 11
For this case,it did once more striped_read.It's no meaningless.
Using this patch, the message are:
ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT
ceph:           file.c:384  : striped_read returns 11

Big thanks to Yan Zheng for the patch.

Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
2013-08-27 12:28:45 -07:00
Li Wang
e907574323 ceph: remove useless variable revoked_rdcache
Cleanup in handle_cap_grant().

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-27 12:28:44 -07:00
Sage Weil
b314a90d8f ceph: fix fallocate division
We need to use do_div to divide by a 64-bit value.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-08-27 12:26:29 -07:00
Li Wang
ad7a60de88 ceph: punch hole support
This patch implements fallocate and punch hole support for Ceph kernel client.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
2013-08-15 11:12:17 -07:00
Yan, Zheng
3871cbb9a4 ceph: fix request max size
ceph_check_caps() requests new max size only when there is Fw cap.
If we call check_max_size() while there is no Fw cap. It updates
i_wanted_max_size and calls ceph_check_caps(), but ceph_check_caps()
does nothing. Later when Fw cap is issued, we call check_max_size()
again. But i_wanted_max_size is equal to 'endoff' at this time, so
check_max_size() doesn't call ceph_check_caps() and we end up with
waiting for the new max size forever.

The fix is duplicate ceph_check_caps()'s "request max size" code in
check_max_size(), and make try_get_cap_refs() wait for the Fw cap
before retry requesting new max size.

This patch also removes the "endoff > (inode->i_size << 1)" check
in check_max_size(). It's useless because there is no corresponding
logic in ceph_check_caps().

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-08-15 11:12:11 -07:00
Yan, Zheng
b0d7c22310 ceph: introduce i_truncate_mutex
I encountered below deadlock when running fsstress

wmtruncate work      truncate                 MDS
---------------  ------------------  --------------------------
                   lock i_mutex
                                      <- truncate file
lock i_mutex (blocked)
                                      <- revoking Fcb (filelock to MIX)
                   send request ->
                                         handle request (xlock filelock)

At the initial time, there are some dirty pages in the page cache.
When the kclient receives the truncate message, it reduces inode size
and creates some 'out of i_size' dirty pages. wmtruncate work can't
truncate these dirty pages because it's blocked by the i_mutex. Later
when the kclient receives the cap message that revokes Fcb caps, It
can't flush all dirty pages because writepages() only flushes dirty
pages within the inode size.

When the MDS handles the 'truncate' request from kclient, it waits
for the filelock to become stable. But the filelock is stuck in
unstable state because it can't finish revoking kclient's Fcb caps.

The truncate pagecache locking has already caused lots of trouble
for use. I think it's time simplify it by introducing a new mutex.
We use the new mutex to prevent concurrent truncate_inode_pages().
There is no need to worry about race between buffered write and
truncate_inode_pages(), because our "get caps" mechanism prevents
them from concurrent execution.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-08-15 11:12:06 -07:00
Milosz Tanski
b150f5c1c7 ceph: cleanup the logic in ceph_invalidatepage
The invalidatepage code bails if it encounters a non-zero page offset. The
current logic that does is non-obvious with multiple if statements.

This should be logically and functionally equivalent.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-15 11:12:02 -07:00
Sage Weil
ee3e542fec Merge remote-tracking branch 'linus/master' into testing 2013-08-15 11:11:45 -07:00
Milosz Tanski
fe2a801b50 ceph: Remove bogus check in invalidatepage
The early bug checks are moot because the VMA layer ensures those things.

1. It will not call invalidatepage unless PagePrivate (or PagePrivate2) are set
2. It will not call invalidatepage without taking a PageLock first.
3. Guantrees that the inode page is mapped.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:58 -07:00
Sage Weil
2f75e9e179 ceph: replace hold_mutex flag with goto
All of the early exit paths need to drop the mutex; it is only the normal
path through the function that does not.  Skip the unlock in that case
with a goto out_unlocked.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Jianpeng Ma <majianpeng@gmail.com>
2013-08-09 17:55:48 -07:00
majianpeng
0e5dd45ce4 ceph: Move the place for EOLDSNAPC handle in ceph_aio_write to easily understand
Only for ceph_sync_write, the osd can return EOLDSNAPC.so move the
related codes after the call ceph_sync_write.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:43 -07:00
Yan, Zheng
6f60f88947 ceph: fix freeing inode vs removing session caps race
remove_session_caps() uses iterate_session_caps() to remove caps,
but iterate_session_caps() skips inodes that are being deleted.
So session->s_nr_caps can be non-zero after iterate_session_caps()
return.

We can fix the issue by waiting until deletions are complete.
__wait_on_freeing_inode() is designed for the job, but it is not
exported, so we use lookup inode function to access it.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-08-09 17:55:32 -07:00
majianpeng
2fbcbff1d6 ceph: Add check returned value on func ceph_calc_ceph_pg.
Func ceph_calc_ceph_pg maybe failed.So add check for returned value.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:21 -07:00
majianpeng
7ab9b38070 ceph: Don't use ceph-sync-mode for synchronous-fs.
Sending reads and writes through the sync read/write paths bypasses the
page cache, which is not expected or generally a good idea.  Removing
the write check is safe as there is a conditional vfs_fsync_range() later
in ceph_aio_write that already checks for the same flag (via
IS_SYNC(inode)).

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:18 -07:00
Dan Carpenter
688bac461b ceph: cleanup types in striped_read()
We pass in a u64 value for "len" and then immediately truncate away the
upper 32 bits.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-08-09 17:55:15 -07:00
Yan, Zheng
ca20c99191 ceph: trim deleted inode
The MDS uses caps message to notify clients about deleted inode.
when receiving a such message, invalidate any alias of the inode.
This makes the kernel release the inode ASAP.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:10 -07:00
Yan, Zheng
85ce127a9a ceph: wake up writer if vmtruncate work get blocked
To write data, the writer first acquires the i_mutex, then try getting
caps. The writer may sleep while holding the i_mutex. If the MDS revokes
Fb cap in this case, vmtruncate work can't do its job because i_mutex
is locked. We should wake up the writer and let it truncate the pages.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:54:33 -07:00
Yan, Zheng
ad88f23f42 ceph: drop CAP_LINK_SHARED when sending "link" request to MDS
To handle "link" request, the MDS need to xlock inode's linklock,
which requires revoking any CAP_LINK_SHARED.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:54:32 -07:00
Nathaniel Yazdani
c338c07c51 ceph: fix null pointer dereference
When register_session() is given an out-of-range argument for mds,
ceph_mdsmap_get_addr() will return a null pointer, which would be given to
ceph_con_open() & be dereferenced, causing a kernel oops. This fixes bug #4685
in the Ceph bug tracker <http://tracker.ceph.com/issues/4685>.

Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:52:58 -07:00
majianpeng
494ddd11be ceph: Don't forget the 'up_read(&osdc->map_sem)' if met error.
CC: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:49:39 -07:00
Linus Torvalds
9a5889ae1c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There is some follow-on RBD cleanup after the last window's code drop,
  a series from Yan fixing multi-mds behavior in cephfs, and then a
  sprinkling of bug fixes all around.  Some warnings, sleeping while
  atomic, a null dereference, and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (36 commits)
  libceph: fix invalid unsigned->signed conversion for timespec encoding
  libceph: call r_unsafe_callback when unsafe reply is received
  ceph: fix race between cap issue and revoke
  ceph: fix cap revoke race
  ceph: fix pending vmtruncate race
  ceph: avoid accessing invalid memory
  libceph: Fix NULL pointer dereference in auth client code
  ceph: Reconstruct the func ceph_reserve_caps.
  ceph: Free mdsc if alloc mdsc->mdsmap failed.
  ceph: remove sb_start/end_write in ceph_aio_write.
  ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.
  ceph: fix sleeping function called from invalid context.
  ceph: move inode to proper flushing list when auth MDS changes
  rbd: fix a couple warnings
  ceph: clear migrate seq when MDS restarts
  ceph: check migrate seq before changing auth cap
  ceph: fix race between page writeback and truncate
  ceph: reset iov_len when discarding cap release messages
  ceph: fix cap release race
  libceph: fix truncate size calculation
  ...
2013-07-09 12:39:10 -07:00
Al Viro
84d08fa888 helper for reading ->d_count
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-05 18:59:33 +04:00
Yan, Zheng
6ee6b95373 ceph: fix race between cap issue and revoke
If we receive new caps from the auth MDS and the non-auth MDS is
revoking the newly issued caps, we should release the caps from
the non-auth MDS. The scenario is filelock's state changes from
SYNC to LOCK. Non-auth MDS revokes Fc cap, the client gets Fc cap
from the auth MDS at the same time.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:57 -07:00
Yan, Zheng
b1530f5704 ceph: fix cap revoke race
If caps are been revoking by the auth MDS, don't consider them as
issued even they are still issued by non-auth MDS. The non-auth
MDS should also be revoking/exporting these caps, the client just
hasn't received the cap revoke/export message.

The race I encountered is: When caps are exporting to new MDS, the
client receives cap import message and cap revoke message from the
new MDS, then receives cap export message from the old MDS. When
the client receives cap revoke message from the new MDS, the revoking
caps are still issued by the old MDS, so the client does nothing.
Later when the cap export message is received, the client removes
the caps issued by the old MDS. (Another way to fix the race is
calling ceph_check_caps() in handle_cap_export())

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:57 -07:00
Yan, Zheng
b415bf4f9f ceph: fix pending vmtruncate race
The locking order for pending vmtruncate is wrong, it can lead to
following race:

        write                  wmtruncate work
------------------------    ----------------------
lock i_mutex
check i_truncate_pending   check i_truncate_pending
truncate_inode_pages()     lock i_mutex (blocked)
copy data to page cache
unlock i_mutex
                           truncate_inode_pages()

The fix is take i_mutex before calling __ceph_do_pending_vmtruncate()

Fixes: http://tracker.ceph.com/issues/5453
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:56 -07:00
Sasha Levin
5446429630 ceph: avoid accessing invalid memory
when mounting ceph with a dev name that starts with a slash, ceph
would attempt to access the character before that slash. Since we
don't actually own that byte of memory, we would trigger an
invalid access:

[   43.499934] BUG: unable to handle kernel paging request at ffff880fa3a97fff
[   43.500984] IP: [<ffffffff818f3884>] parse_mount_options+0x1a4/0x300
[   43.501491] PGD 743b067 PUD 10283c4067 PMD 10282a6067 PTE 8000000fa3a97060
[   43.502301] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   43.503006] Dumping ftrace buffer:
[   43.503596]    (ftrace buffer empty)
[   43.504046] CPU: 0 PID: 10879 Comm: mount Tainted: G        W    3.10.0-sasha #1129
[   43.504851] task: ffff880fa625b000 ti: ffff880fa3412000 task.ti: ffff880fa3412000
[   43.505608] RIP: 0010:[<ffffffff818f3884>]  [<ffffffff818f3884>] parse_mount_options$
[   43.506552] RSP: 0018:ffff880fa3413d08  EFLAGS: 00010286
[   43.507133] RAX: ffff880fa3a98000 RBX: ffff880fa3a98000 RCX: 0000000000000000
[   43.507893] RDX: ffff880fa3a98001 RSI: 000000000000002f RDI: ffff880fa3a98000
[   43.508610] RBP: ffff880fa3413d58 R08: 0000000000001f99 R09: ffff880fa3fe64c0
[   43.509426] R10: ffff880fa3413d98 R11: ffff880fa38710d8 R12: ffff880fa3413da0
[   43.509792] R13: ffff880fa3a97fff R14: 0000000000000000 R15: ffff880fa3413d90
[   43.509792] FS:  00007fa9c48757e0(0000) GS:ffff880fd2600000(0000) knlGS:000000000000$
[   43.509792] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   43.509792] CR2: ffff880fa3a97fff CR3: 0000000fa3bb9000 CR4: 00000000000006b0
[   43.509792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   43.509792] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   43.509792] Stack:
[   43.509792]  0000e5180000000e ffffffff85ca1900 ffff880fa38710d8 ffff880fa3413d98
[   43.509792]  0000000000000120 0000000000000000 ffff880fa3a98000 0000000000000000
[   43.509792]  ffffffff85cf32a0 0000000000000000 ffff880fa3413dc8 ffffffff818f3c72
[   43.509792] Call Trace:
[   43.509792]  [<ffffffff818f3c72>] ceph_mount+0xa2/0x390
[   43.509792]  [<ffffffff81226314>] ? pcpu_alloc+0x334/0x3c0
[   43.509792]  [<ffffffff81282f8d>] mount_fs+0x8d/0x1a0
[   43.509792]  [<ffffffff812263d0>] ? __alloc_percpu+0x10/0x20
[   43.509792]  [<ffffffff8129f799>] vfs_kern_mount+0x79/0x100
[   43.509792]  [<ffffffff812a224d>] do_new_mount+0xcd/0x1c0
[   43.509792]  [<ffffffff812a2e8d>] do_mount+0x15d/0x210
[   43.509792]  [<ffffffff81220e55>] ? strndup_user+0x45/0x60
[   43.509792]  [<ffffffff812a2fdd>] SyS_mount+0x9d/0xe0
[   43.509792]  [<ffffffff83fd816c>] tracesys+0xdd/0xe2
[   43.509792] Code: 4c 8b 5d c0 74 0a 48 8d 50 01 49 89 14 24 eb 17 31 c0 48 83 c9 ff $
[   43.509792] RIP  [<ffffffff818f3884>] parse_mount_options+0x1a4/0x300
[   43.509792]  RSP <ffff880fa3413d08>
[   43.509792] CR2: ffff880fa3a97fff
[   43.509792] ---[ end trace 22469cd81e93af51 ]---

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Sage Weil <sage@inktan.com>
2013-07-03 15:32:55 -07:00
majianpeng
93faca6ef4 ceph: Reconstruct the func ceph_reserve_caps.
Drop ignored return value.  Fix allocation failure case to not leak.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:54 -07:00
majianpeng
fb3101b6f0 ceph: Free mdsc if alloc mdsc->mdsmap failed.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:53 -07:00
Jianpeng Ma
0405a1499d ceph: remove sb_start/end_write in ceph_aio_write.
Either in vfs_write or io_submit,it call file_start/end_write.
The different between file_start/end_write and sb_start/end_write is
file_ only handle regular file.But i think in ceph_aio_write,it only
for regular file.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Acked-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-07-03 15:32:52 -07:00
majianpeng
c62988ec09 ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:52 -07:00
majianpeng
a1dc193733 ceph: fix sleeping function called from invalid context.
[ 1121.231883] BUG: sleeping function called from invalid context at kernel/rwsem.c:20
[ 1121.231935] in_atomic(): 1, irqs_disabled(): 0, pid: 9831, name: mv
[ 1121.231971] 1 lock held by mv/9831:
[ 1121.231973]  #0:  (&(&ci->i_ceph_lock)->rlock){+.+...},at:[<ffffffffa02bbd38>] ceph_getxattr+0x58/0x1d0 [ceph]
[ 1121.231998] CPU: 3 PID: 9831 Comm: mv Not tainted 3.10.0-rc6+ #215
[ 1121.232000] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
[ 1121.232027]  ffff88006d355a80 ffff880092f69ce0 ffffffff8168348c ffff880092f69cf8
[ 1121.232045]  ffffffff81070435 ffff88006d355a20 ffff880092f69d20 ffffffff816899ba
[ 1121.232052]  0000000300000004 ffff8800b76911d0 ffff88006d355a20 ffff880092f69d68
[ 1121.232056] Call Trace:
[ 1121.232062]  [<ffffffff8168348c>] dump_stack+0x19/0x1b
[ 1121.232067]  [<ffffffff81070435>] __might_sleep+0xe5/0x110
[ 1121.232071]  [<ffffffff816899ba>] down_read+0x2a/0x98
[ 1121.232080]  [<ffffffffa02baf70>] ceph_vxattrcb_layout+0x60/0xf0 [ceph]
[ 1121.232088]  [<ffffffffa02bbd7f>] ceph_getxattr+0x9f/0x1d0 [ceph]
[ 1121.232093]  [<ffffffff81188d28>] vfs_getxattr+0xa8/0xd0
[ 1121.232097]  [<ffffffff8118900b>] getxattr+0xab/0x1c0
[ 1121.232100]  [<ffffffff811704f2>] ? final_putname+0x22/0x50
[ 1121.232104]  [<ffffffff81155f80>] ? kmem_cache_free+0xb0/0x260
[ 1121.232107]  [<ffffffff811704f2>] ? final_putname+0x22/0x50
[ 1121.232110]  [<ffffffff8109e63d>] ? trace_hardirqs_on+0xd/0x10
[ 1121.232114]  [<ffffffff816957a7>] ? sysret_check+0x1b/0x56
[ 1121.232120]  [<ffffffff81189c9c>] SyS_fgetxattr+0x6c/0xc0
[ 1121.232125]  [<ffffffff81695782>] system_call_fastpath+0x16/0x1b
[ 1121.232129] BUG: scheduling while atomic: mv/9831/0x10000002
[ 1121.232154] 1 lock held by mv/9831:
[ 1121.232156]  #0:  (&(&ci->i_ceph_lock)->rlock){+.+...}, at:
[<ffffffffa02bbd38>] ceph_getxattr+0x58/0x1d0 [ceph]

I think move the ci->i_ceph_lock down is safe because we can't free
ceph_inode_info at there.

CC: stable@vger.kernel.org  # 3.8+
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:51 -07:00
Yan, Zheng
005c46970e ceph: move inode to proper flushing list when auth MDS changes
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:50 -07:00
Yan, Zheng
667ca05cd9 ceph: clear migrate seq when MDS restarts
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:49 -07:00
Yan, Zheng
b8c2f3ae2d ceph: check migrate seq before changing auth cap
We may receive old request reply from the exporter MDS after receiving
the importer MDS' cap import message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:48 -07:00
Yan, Zheng
fc2744aa12 ceph: fix race between page writeback and truncate
The client can receive truncate request from MDS at any time.
So the page writeback code need to get i_size, truncate_seq and
truncate_size atomically

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:47 -07:00
Yan, Zheng
3803da4963 ceph: reset iov_len when discarding cap release messages
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:47 -07:00
Yan, Zheng
bb137f84d1 ceph: fix cap release race
ceph_encode_inode_release() can race with ceph_open() and release
caps wanted by open files. So it should call __ceph_caps_wanted()
to get the wanted caps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:46 -07:00
Linus Torvalds
790eac5640 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second set of VFS changes from Al Viro:
 "Assorted f_pos race fixes, making do_splice_direct() safe to call with
  i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
  ->d_hash/->d_compare calling conventions changes from Linus, misc
  stuff all over the place."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  Document ->tmpfile()
  ext4: ->tmpfile() support
  vfs: export lseek_execute() to modules
  lseek_execute() doesn't need an inode passed to it
  block_dev: switch to fixed_size_llseek()
  cpqphp_sysfs: switch to fixed_size_llseek()
  tile-srom: switch to fixed_size_llseek()
  proc_powerpc: switch to fixed_size_llseek()
  ubi/cdev: switch to fixed_size_llseek()
  pci/proc: switch to fixed_size_llseek()
  isapnp: switch to fixed_size_llseek()
  lpfc: switch to fixed_size_llseek()
  locks: give the blocked_hash its own spinlock
  locks: add a new "lm_owner_key" lock operation
  locks: turn the blocked_list into a hashtable
  locks: convert fl_link to a hlist_node
  locks: avoid taking global lock if possible when waking up blocked waiters
  locks: protect most of the file_lock handling with i_lock
  locks: encapsulate the fl_link list handling
  locks: make "added" in __posix_lock_file a bool
  ...
2013-07-03 09:10:19 -07:00
Jie Liu
46a1c2c7ae vfs: export lseek_execute() to modules
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().

To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.

Thanks Dave Chinner for this suggestion.

[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-03 16:23:27 +04:00
Linus Torvalds
9e239bb939 Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
 block size is smaller than the page size (i.e., file systems 1k blocks
 on x86, or more interestingly file systems with 4k blocks on Power or
 ia64 systems.)
 
 In the cleanup category, the ext4's punch hole implementation was
 significantly improved by Lukas Czerner, and now supports bigalloc
 file systems.  In addition, Jan Kara significantly cleaned up the
 write submission code path.  We also improved error checking and added
 a few sanity checks.
 
 In the optimizations category, two major optimizations deserve
 mention.  The first is that ext4_writepages() is now used for
 nodelalloc and ext3 compatibility mode.  This allows writes to be
 submitted much more efficiently as a single bio request, instead of
 being sent as individual 4k writes into the block layer (which then
 relied on the elevator code to coalesce the requests in the block
 queue).  Secondly, the extent cache shrink mechanism, which was
 introduce in 3.9, no longer has a scalability bottleneck caused by the
 i_es_lru spinlock.  Other optimizations include some changes to reduce
 CPU usage and to avoid issuing empty commits unnecessarily.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
 0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
 yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
 NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
 rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
 ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
 AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
 zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
 U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
 vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
 +TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
 XrddRLGlk4Hf+2o7/D7B
 =SwaI
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Lots of bug fixes, cleanups and optimizations.  In the bug fixes
  category, of note is a fix for on-line resizing file systems where the
  block size is smaller than the page size (i.e., file systems 1k blocks
  on x86, or more interestingly file systems with 4k blocks on Power or
  ia64 systems.)

  In the cleanup category, the ext4's punch hole implementation was
  significantly improved by Lukas Czerner, and now supports bigalloc
  file systems.  In addition, Jan Kara significantly cleaned up the
  write submission code path.  We also improved error checking and added
  a few sanity checks.

  In the optimizations category, two major optimizations deserve
  mention.  The first is that ext4_writepages() is now used for
  nodelalloc and ext3 compatibility mode.  This allows writes to be
  submitted much more efficiently as a single bio request, instead of
  being sent as individual 4k writes into the block layer (which then
  relied on the elevator code to coalesce the requests in the block
  queue).  Secondly, the extent cache shrink mechanism, which was
  introduce in 3.9, no longer has a scalability bottleneck caused by the
  i_es_lru spinlock.  Other optimizations include some changes to reduce
  CPU usage and to avoid issuing empty commits unnecessarily."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
  ext4: optimize starting extent in ext4_ext_rm_leaf()
  jbd2: invalidate handle if jbd2_journal_restart() fails
  ext4: translate flag bits to strings in tracepoints
  ext4: fix up error handling for mpage_map_and_submit_extent()
  jbd2: fix theoretical race in jbd2__journal_restart
  ext4: only zero partial blocks in ext4_zero_partial_blocks()
  ext4: check error return from ext4_write_inline_data_end()
  ext4: delete unnecessary C statements
  ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
  jbd2: move superblock checksum calculation to jbd2_write_superblock()
  ext4: pass inode pointer instead of file pointer to punch hole
  ext4: improve free space calculation for inline_data
  ext4: reduce object size when !CONFIG_PRINTK
  ext4: improve extent cache shrink mechanism to avoid to burn CPU time
  ext4: implement error handling of ext4_mb_new_preallocation()
  ext4: fix corruption when online resizing a fs with 1K block size
  ext4: delete unused variables
  ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
  jbd2: remove debug dependency on debug_fs and update Kconfig help text
  jbd2: use a single printk for jbd_debug()
  ...
2013-07-02 09:39:34 -07:00
Dan Carpenter
6af8652849 ceph: tidy ceph_mdsmap_decode() a little
I introduced a new temporary variable "info" instead of
"m->m_info[mds]".  Also I reversed the if condition and pulled
everything in one indent level.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-07-01 09:52:02 -07:00
Emil Goode
c213b50b7d ceph: improve error handling in ceph_mdsmap_decode
This patch makes the following improvements to the error handling
in the ceph_mdsmap_decode function:

- Add a NULL check for return value from kcalloc
- Make use of the variable err

Signed-off-by: Emil Goode <emilgoode@gmail.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-01 09:52:02 -07:00
Jim Schutt
4d1bf79aff ceph: fix up comment for ceph_count_locks() as to which lock to hold
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-07-01 09:52:01 -07:00
Jeff Layton
1c8c601a8c locks: protect most of the file_lock handling with i_lock
Having a global lock that protects all of this code is a clear
scalability problem. Instead of doing that, move most of the code to be
protected by the i_lock instead. The exceptions are the global lists
that the ->fl_link sits on, and the ->fl_block list.

->fl_link is what connects these structures to the
global lists, so we must ensure that we hold those locks when iterating
over or updating these lists.

Furthermore, sound deadlock detection requires that we hold the
blocked_list state steady while checking for loops. We also must ensure
that the search and update to the list are atomic.

For the checking and insertion side of the blocked_list, push the
acquisition of the global lock into __posix_lock_file and ensure that
checking and update of the  blocked_list is done without dropping the
lock in between.

On the removal side, when waking up blocked lock waiters, take the
global lock before walking the blocked list and dequeue the waiters from
the global list prior to removal from the fl_block list.

With this, deadlock detection should be race free while we minimize
excessive file_lock_lock thrashing.

Finally, in order to avoid a lock inversion problem when handling
/proc/locks output we must ensure that manipulations of the fl_block
list are also protected by the file_lock_lock.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:42 +04:00
Al Viro
77acfa29e1 [readdir] convert ceph
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:41 +04:00
Linus Torvalds
8d7a8fe2ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
 "There is a pair of fixes for double-frees in the recent bundle for
  3.10, a couple of fixes for long-standing bugs (sleep while atomic and
  an endianness fix), and a locking fix that can be triggered when osds
  are going down"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup in rbd_add()
  rbd: don't destroy ceph_opts in rbd_add()
  ceph: ceph_pagelist_append might sleep while atomic
  ceph: add cpu_to_le32() calls when encoding a reconnect capability
  libceph: must hold mutex for reset_changed_osds()
2013-06-12 08:28:19 -07:00
Lukas Czerner
569d39fc3e ceph: use ->invalidatepage() length argument
->invalidatepage() aop now accepts range to invalidate so we can make
use of it in ceph_invalidatepage().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Acked-by: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
2013-05-21 23:58:48 -04:00
Lukas Czerner
d47992f86b mm: change invalidatepage prototype to accept length
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.

Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).

This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.

We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.

Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
2013-05-21 23:17:23 -04:00
Jim Schutt
39be95e9c8 ceph: ceph_pagelist_append might sleep while atomic
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc()
while holding a lock, but it's spoiled because ceph_pagelist_addpage()
always calls kmap(), which might sleep.  Here's the result:

[13439.295457] ceph: mds0 reconnect start
[13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58
[13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
    . . .
[13439.376225] Call Trace:
[13439.378757]  [<ffffffff81076f4c>] __might_sleep+0xfc/0x110
[13439.384353]  [<ffffffffa03f4ce0>] ceph_pagelist_append+0x120/0x1b0 [libceph]
[13439.391491]  [<ffffffffa0448fe9>] ceph_encode_locks+0x89/0x190 [ceph]
[13439.398035]  [<ffffffff814ee849>] ? _raw_spin_lock+0x49/0x50
[13439.403775]  [<ffffffff811cadf5>] ? lock_flocks+0x15/0x20
[13439.409277]  [<ffffffffa045e2af>] encode_caps_cb+0x41f/0x4a0 [ceph]
[13439.415622]  [<ffffffff81196748>] ? igrab+0x28/0x70
[13439.420610]  [<ffffffffa045e9f8>] ? iterate_session_caps+0xe8/0x250 [ceph]
[13439.427584]  [<ffffffffa045ea25>] iterate_session_caps+0x115/0x250 [ceph]
[13439.434499]  [<ffffffffa045de90>] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
[13439.441646]  [<ffffffffa0462888>] send_mds_reconnect+0x238/0x450 [ceph]
[13439.448363]  [<ffffffffa0464542>] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
[13439.455250]  [<ffffffffa0462e42>] check_new_map+0x352/0x500 [ceph]
[13439.461534]  [<ffffffffa04631ad>] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
[13439.468432]  [<ffffffff814ebc7e>] ? mutex_unlock+0xe/0x10
[13439.473934]  [<ffffffffa043c612>] extra_mon_dispatch+0x22/0x30 [ceph]
[13439.480464]  [<ffffffffa03f6c2c>] dispatch+0xbc/0x110 [libceph]
[13439.486492]  [<ffffffffa03eec3d>] process_message+0x1ad/0x1d0 [libceph]
[13439.493190]  [<ffffffffa03f1498>] ? read_partial_message+0x3e8/0x520 [libceph]
    . . .
[13439.587132] ceph: mds0 reconnect success
[13490.720032] ceph: mds0 caps stale
[13501.235257] ceph: mds0 recovery completed
[13501.300419] ceph: mds0 caps renewed

Fix it up by encoding locks into a buffer first, and when the number
of encoded locks is stable, copy that into a ceph_pagelist.

[elder@inktank.com: abbreviated the stack info a bit.]

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-17 12:45:48 -05:00
Jim Schutt
c420276a53 ceph: add cpu_to_le32() calls when encoding a reconnect capability
In his review, Alex Elder mentioned that he hadn't checked that
num_fcntl_locks and num_flock_locks were properly decoded on the
server side, from a le32 over-the-wire type to a cpu type.
I checked, and AFAICS it is done; those interested can consult
    Locker::_do_cap_update()
in src/mds/Locker.cc and src/include/encoding.h in the Ceph server
code (git://github.com/ceph/ceph).

I also checked the server side for flock_len decoding, and I believe
that also happens correctly, by virtue of having been declared
__le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h.

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-17 12:45:43 -05:00
Kent Overstreet
a27bb332c0 aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 20:16:25 -07:00
Alex Elder
812164f8c3 ceph: use ceph_create_snap_context()
Now that we have a library routine to create snap contexts, use it.

This is part of:
    http://tracker.ceph.com/issues/4857

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:09 -07:00
Alex Elder
406e2c9f92 libceph: kill off osd data write_request parameters
In the incremental move toward supporting distinct data items in an
osd request some of the functions had "write_request" parameters to
indicate, basically, whether the data belonged to in_data or the
out_data.  Now that we maintain the data fields in the op structure
there is no need to indicate the direction, so get rid of the
"write_request" parameters.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:58 -07:00
Randy Dunlap
ac7f29bf2e ceph: fix printk format warnings in file.c
Fix printk format warnings by using %zd for 'ssize_t' variables:

fs/ceph/file.c:751:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat]
fs/ceph/file.c:762:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat]

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc:	ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-01 21:18:57 -07:00
Yan, Zheng
1ac0fc8adf ceph: fix race between writepages and truncate
ceph_writepages_start() reads inode->i_size in two places. It can get
different values between successive read, because truncate can change
inode->i_size at any time. The race can lead to mismatch between data
length of osd request and pages marked as writeback. When osd request
finishes, it clear writeback page according to its data length. So
some pages can be left in writeback state forever. The fix is only
read inode->i_size once, save its value to a local variable and use
the local variable when i_size is needed.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:18:55 -07:00
Yan, Zheng
03d254edeb ceph: apply write checks in ceph_aio_write
copy write checks in __generic_file_aio_write to ceph_aio_write.
To make these checks cover sync write path.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:18:54 -07:00
Yan, Zheng
37505d5768 ceph: take i_mutex before getting Fw cap
There is deadlock as illustrated bellow. The fix is taking i_mutex
before getting Fw cap reference.

      write                    truncate                 MDS
---------------------     --------------------      --------------
get Fw cap
                          lock i_mutex
lock i_mutex (blocked)
                          request setattr.size  ->
                                                <-   revoke Fw cap

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:18:53 -07:00
Alex Elder
26be88087a libceph: change how "safe" callback is used
An osd request currently has two callbacks.  They inform the
initiator of the request when we've received confirmation for the
target osd that a request was received, and when the osd indicates
all changes described by the request are durable.

The only time the second callback is used is in the ceph file system
for a synchronous write.  There's a race that makes some handling of
this case unsafe.  This patch addresses this problem.  The error
handling for this callback is also kind of gross, and this patch
changes that as well.

In ceph_sync_write(), if a safe callback is requested we want to add
the request on the ceph inode's unsafe items list.  Because items on
this list must have their tid set (by ceph_osd_start_request()), the
request added *after* the call to that function returns.  The
problem with this is that there's a race between starting the
request and adding it to the unsafe items list; the request may
already be complete before ceph_sync_write() even begins to put it
on the list.

To address this, we change the way the "safe" callback is used.
Rather than just calling it when the request is "safe", we use it to
notify the initiator the bounds (start and end) of the period during
which the request is *unsafe*.  So the initiator gets notified just
before the request gets sent to the osd (when it is "unsafe"), and
again when it's known the results are durable (it's no longer
unsafe).  The first call will get made in __send_request(), just
before the request message gets sent to the messenger for the first
time.  That function is only called by __send_queued(), which is
always called with the osd client's request mutex held.

We then have this callback function insert the request on the ceph
inode's unsafe list when we're told the request is unsafe.  This
will avoid the race because this call will be made under protection
of the osd client's request mutex.  It also nicely groups the setup
and cleanup of the state associated with managing unsafe requests.

The name of the "safe" callback field is changed to "unsafe" to
better reflect its new purpose.  It has a Boolean "unsafe" parameter
to indicate whether the request is becoming unsafe or is now safe.
Because the "msg" parameter wasn't used, we drop that.

This resolves the original problem reportedin:
    http://tracker.ceph.com/issues/4706

Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:18:52 -07:00
Alex Elder
7d7d51ce14 ceph: let osd client clean up for interrupted request
In ceph_sync_write(), if a safe callback is supplied with a request,
and an error is returned by ceph_osdc_wait_request(), a block of
code is executed to remove the request from the unsafe writes list
and drop references to capabilities acquired just prior to a call to
ceph_osdc_wait_request().

The only function used for this callback is sync_write_commit(),
and it does *exactly* what that block of error handling code does.

Now in ceph_osdc_wait_request(), if an error occurs (due to an
interupt during a wait_for_completion_interruptible() call),
complete_request() gets called, and that calls the request's
safe_callback method if it's defined.

So this means that this cleanup activity gets called twice in this
case, which is erroneous (and in fact leads to a crash).

Fix this by just letting the osd client handle the cleanup in
the event of an interrupt.

This resolves one problem mentioned in:
    http://tracker.ceph.com/issues/4706

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-01 21:18:51 -07:00
Yan, Zheng
0b93267252 ceph: fix symlink inode operations
add getattr/setattr and xattrs related methods.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:18:50 -07:00