linux-next

mirror of https://github.com/edk2-porting/linux-next.git synced 2024-12-28 07:04:00 +08:00

Author	SHA1	Message	Date
Alex Elder	2480882611	ceph: kill ceph_osdc_writepages() "flags" parameter There is only one caller of ceph_osdc_writepages(), and it always passes 0 as its "flags" argument. Get rid of that argument and replace its use in ceph_osdc_writepages() with 0. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-18 12:19:35 -06:00
Alex Elder	fbf8685fb1	ceph: kill ceph_osdc_writepages() "dosync" parameter There is only one caller of ceph_osdc_writepages(), and it always passes 0 as its "dosync" argument. Get rid of that argument and replace its use in ceph_osdc_writepages() with 0. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-18 12:19:28 -06:00
Alex Elder	87f979d390	ceph: kill ceph_osdc_writepages() "nofail" parameter There is only one caller of ceph_osdc_writepages(), and it always passes the value true as its "nofail" argument. Get rid of that argument and replace its use in ceph_osdc_writepages() with the constant value true. This and a number of cleanup patches that follow resolve: http://tracker.ceph.com/issues/4126 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-18 12:19:22 -06:00
Sage Weil	695b711933	ceph: implement hidden per-field ceph..layout. vxattrs Allow individual fields of the layout to be fetched via getxattr. The ceph.dir.layout.* vxattr with "disappear" if the exists_cb indicates there no dir layout set. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>	2013-02-13 18:26:17 -08:00
Sage Weil	1f08f2b056	ceph: add ceph.dir.layout vxattr This virtual xattr will only appear when there is a dir layout policy set on the directory. It can be set via setxattr and removed via removexattr (implemented by the MDS). Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>	2013-02-13 18:26:13 -08:00
Sage Weil	32ab0bd78d	ceph: change ceph.file.layout.* implementation, content Implement a new method to generate the ceph.file.layout vxattr using the new framework. Use 'stripe_unit' instead of 'chunk_size'. Include pool name, either as a string or as an integer. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>	2013-02-13 18:26:10 -08:00
Sage Weil	b65917dd27	ceph: fix listxattr handling for vxattrs Only include vxattrs in the result if they are not hidden and exist (as determined by the exists_cb callback). Note that the buffer size we return when 0 is passed in always includes vxattrs that might exist, forming an upper bound. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>	2013-02-13 18:26:06 -08:00
Sage Weil	0bee82fb4b	ceph: fix getxattr vxattr handling Change the vxattr handling for getxattr so that vxattrs are checked prior to any xattr content, and never after. Enforce vxattr existence via the exists_cb callback. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>	2013-02-13 18:26:03 -08:00
Sage Weil	f36e447296	ceph: add exists_cb to vxattr struct Allow for a callback to dynamically determine if a vxattr exists for the given inode. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>	2013-02-13 18:25:58 -08:00
Sage Weil	d421acb1ad	ceph: pass ceph.* removexattrs through to MDS If we do not explicitly recognized a vxattr (e.g., as readonly), pass the request through to the MDS and deal with it there. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>	2013-02-13 18:25:55 -08:00
Sage Weil	3adf654ddb	ceph: pass unhandled ceph.* setxattrs through to MDS If we do not specifically understand a setxattr on a ceph.* virtual xattr, send it through to the MDS. This allows us to implement new functionality via the MDS without direct support on the client side. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>	2013-02-13 18:25:51 -08:00
Sage Weil	8860147a01	ceph: support hidden vxattrs Add ability to flag virtual xattrs as hidden, such that you can getxattr them but they do not appear in listxattr. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>	2013-02-13 18:25:47 -08:00
Sage Weil	39b648d9ec	ceph: remove 'ceph.layout' virtual xattr This has been deprecated since v3.3, `114fc474`. Kill it. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>	2013-02-13 18:25:43 -08:00
Eric W. Biederman	bd2bae6a66	ceph: Convert kuids and kgids before printing them. Before printing kuid and kgids values convert them into the initial user namespace. Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:27 -08:00
Eric W. Biederman	ff3d004662	ceph: Convert struct ceph_mds_request to use kuid_t and kgid_t Hold the uid and gid for a pending ceph mds request using the types kuid_t and kgid_t. When a request message is finally created convert the kuid_t and kgid_t values into uids and gids in the initial user namespace. Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:26 -08:00
Eric W. Biederman	ab871b903e	ceph: Translate inode uid and gid attributes to/from kuids and kgids. - In fill_inode() transate uids and gids in the initial user namespace into kuids and kgids stored in inode->i_uid and inode->i_gid. - In ceph_setattr() if they have changed convert inode->i_uid and inode->i_gid into initial user namespace uids and gids for transmission. Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:25 -08:00
Eric W. Biederman	05cb11c17e	ceph: Translate between uid and gids in cap messages and kuids and kgids - Make the uid and gid arguments of send_cap_msg() used to compose ceph_mds_caps messages of type kuid_t and kgid_t. - Pass inode->i_uid and inode->i_gid in __send_cap to send_cap_msg() through variables of type kuid_t and kgid_t. - Modify struct ceph_cap_snap to store uids and gids in types kuid_t and kgid_t. This allows capturing inode->i_uid and inode->i_gid in ceph_queue_cap_snap() without loss and pssing them to __ceph_flush_snaps() where they are removed from struct ceph_cap_snap and passed to send_cap_msg(). - In handle_cap_grant translate uid and gids in the initial user namespace stored in struct ceph_mds_cap into kuids and kgids before setting inode->i_uid and inode->i_gid. Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:24 -08:00
Alex Elder	969e5aa3b0	Merge branch 'testing' of github.com:ceph/ceph-client into v3.8-rc5-testing	2013-01-30 07:54:34 -06:00
Alex Elder	e8afad656c	libceph: pass length to ceph_calc_file_object_mapping() ceph_calc_file_object_mapping() takes (among other things) a "file" offset and length, and based on the layout, determines the object number ("bno") backing the affected portion of the file's data and the offset into that object where the desired range begins. It also computes the size that should be used for the request--either the amount requested or something less if that would exceed the end of the object. This patch changes the input length parameter in this function so it is used only for input. That is, the argument will be passed by value rather than by address, so the value provided won't get updated by the function. The value would only get updated if the length would surpass the current object, and in that case the value it got updated to would be exactly that returned in *oxlen. Only one of the two callers is affected by this change. Update ceph_calc_raw_layout() so it records any updated value. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:04 -06:00
Yan, Zheng	390306c38d	ceph: check mds_wanted for imported cap The MDS may have incorrect wanted caps after importing caps. So the client should check the value mds has and send cap update if necessary. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:38 -06:00
Yan, Zheng	66f58691c5	ceph: allocate cap_release message when receiving cap import When client wants to release an imported cap, it's possible there is no reserved cap_release message in corresponding mds session. so __queue_cap_release causes kernel panic. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:38 -06:00
Yan, Zheng	395c312b9c	ceph: allow revoking duplicated caps issued by non-auth MDS Allow revoking duplicated caps issued by non-auth MDS if these caps are also issued by auth MDS. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:38 -06:00
Yan, Zheng	8a92a119b2	ceph: move dirty inode to migrating list when clearing auth caps Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:37 -06:00
Sam Lang	6e8575faa8	ceph: Check for created flag in response from mds The mds now sends back a created inode if the create request performed the create. If the file already existed, no inode is returned in the reply. This allows ceph to set the created flag in atomic_open so that permissions are properly checked in the case that the file wasn't created by the create call to the mds. To ensure compability with previous kernels, a feature for sending back the inode in the create reply was added, so that the mds will only send back the inode if the client indicates it supports the feature. Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:36 -06:00
Sam Lang	79aec9844d	ceph: Check for err on mds request in atomic_open The error returned by ceph_mdsc_do_request includes errors sending the request, errors on timeout, or any errors coming from the mds. If ceph_mdsc_do_request returns an error, the reply struct will most likely be bogus. We need to bail out and propogate the error instead of overwriting it. Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-01-17 12:42:36 -06:00
Kees Cook	1b6a78a522	fs/ceph: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Sage Weil <sage@inktank.com> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Sage Weil <sage@inktank.com>	2013-01-11 11:39:04 -08:00
Linus Torvalds	40889e8d9f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph update from Sage Weil: "There are a few different groups of commits here. The largest is Alex's ongoing work to enable the coming RBD features (cloning, striping). There is some cleanup in libceph that goes along with it. Cyril and David have fixed some problems with NFS reexport (leaking dentries and page locks), and there is a batch of patches from Yan fixing problems with the fs client when running against a clustered MDS. There are a few bug fixes mixed in for good measure, many of which will be going to the stable trees once they're upstream. My apologies for the late pull. There is still a gremlin in the rbd map/unmap code and I was hoping to include the fix for that as well, but we haven't been able to confirm the fix is correct yet; I'll send that in a separate pull once it's nailed down." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits) rbd: get rid of rbd_{get,put}_dev() libceph: register request before unregister linger libceph: don't use rb_init_node() in ceph_osdc_alloc_request() libceph: init event->node in ceph_osdc_create_event() libceph: init osd->o_node in create_osd() libceph: report connection fault with warning libceph: socket can close in any connection state rbd: don't use ENOTSUPP rbd: remove linger unconditionally rbd: get rid of RBD_MAX_SEG_NAME_LEN libceph: avoid using freed osd in __kick_osd_requests() ceph: don't reference req after put rbd: do not allow remove of mounted-on image libceph: Unlock unprocessed pages in start_read() error path ceph: call handle_cap_grant() for cap import message ceph: Fix __ceph_do_pending_vmtruncate ceph: Don't add dirty inode to dirty list if caps is in migration ceph: Fix infinite loop in __wake_requests ceph: Don't update i_max_size when handling non-auth cap bdi_register: add __printf verification, fix arg mismatch ...	2012-12-20 14:00:13 -08:00
Cyril Roelandt	f6af75dac3	ceph: fix dentry reference leak in ceph_encode_fh() dput() was not called in the error path. Signed-off-by: Cyril Roelandt <tipecaml@gmail.com> Cc: Sage Weil <sage@inktank.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-18 15:02:11 -08:00
Andrew Morton	965c8e59cf	lseek: the "whence" argument is called "whence" But the kernel decided to call it "origin" instead. Fix most of the sites. Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:12 -08:00
David Zafman	8884d53dd6	libceph: Unlock unprocessed pages in start_read() error path Function start_read() can get an error before processing all pages. It must not only release the remaining pages, but unlock them too. This fixes http://tracker.newdream.net/issues/3370 Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-12-13 08:13:09 -06:00
Yan, Zheng	0e5e1774a9	ceph: call handle_cap_grant() for cap import message If client sends cap message that requests new max size during exporting caps, the exporting MDS will drop the message quietly. So the client may wait for the reply that updates the max size forever. call handle_cap_grant() for cap import message can avoid this issue. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-12-13 08:13:08 -06:00
Yan, Zheng	a85f50b6ef	ceph: Fix __ceph_do_pending_vmtruncate we should set i_truncate_pending to 0 after page cache is truncated to i_truncate_size Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-12-13 08:13:08 -06:00
Yan, Zheng	0685235ffd	ceph: Don't add dirty inode to dirty list if caps is in migration Add dirty inode to cap_dirty_migrating list instead, this can avoid ceph_flush_dirty_caps() entering infinite loop. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-12-13 08:13:08 -06:00
Yan, Zheng	ed75ec2cd1	ceph: Fix infinite loop in __wake_requests __wake_requests() will enter infinite loop if we use it to wake requests in the session->s_waiting list. __wake_requests() deletes requests from the list and __do_request() adds requests back to the list. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-12-13 08:13:07 -06:00
Yan, Zheng	5e62ad3015	ceph: Don't update i_max_size when handling non-auth cap The cap from non-auth mds doesn't have a meaningful max_size value. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-12-13 08:13:07 -06:00
Joe Perches	d2cc4dde92	bdi_register: add __printf verification, fix arg mismatch __printf is useful to verify format and arguments. Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-12-13 08:13:07 -06:00
Sage Weil	83aff95eb9	libceph: remove 'osdtimeout' option This would reset a connection with any OSD that had an outstanding request that was taking more than N seconds. The idea was that if the OSD was buggy, the client could compensate by resending the request. In reality, this only served to hide server bugs, and we haven't actually seen such a bug in quite a while. Moreover, the userspace client code never did this. More importantly, often the request is taking a long time because the OSD is trying to recover, or overloaded, and killing the connection and retrying would only make the situation worse by giving the OSD more work to do. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-12-13 08:13:06 -06:00
Cyril Roelandt	cfc84c9f73	ceph: fix dentry reference leak in ceph_encode_fh(). dput() was not called in the error path. Signed-off-by: Cyril Roelandt <tipecaml@gmail.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-12-13 08:13:06 -06:00
Sage Weil	22cddde104	ceph: Fix i_size update race ceph_aio_write() has an optimization that marks cap EPH_CAP_FILE_WR dirty before data is copied to page cache and inode size is updated. If ceph_check_caps() flushes the dirty cap before the inode size is updated, MDS can miss the new inode size. The fix is move ceph_{get,put}_cap_refs() into ceph_write_{begin,end}() and call __ceph_mark_dirty_caps() after inode size is updated. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-11-05 11:07:23 -08:00
Yan, Zheng	4d1d0534f5	ceph: Hold caps_list_lock when adjusting caps_{use, total}_count Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-11-04 03:08:24 -08:00
Linus Torvalds	35fd3dc58d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fixes form Sage Weil: "There are two fixes in the messenger code, one that can trigger a NULL dereference, and one that error in refcounting (extra put). There is also a trivial fix that in the fs client code that is triggered by NFS reexport." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: fix dentry reference leak in encode_fh() libceph: avoid NULL kref_put when osd reset races with alloc_msg rbd: reset BACKOFF if unable to re-queue	2012-10-29 08:49:25 -07:00
David Zafman	52eb5a900a	ceph: fix dentry reference leak in encode_fh() Call to d_find_alias() needs a corresponding dput() This fixes http://tracker.newdream.net/issues/3271 Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-10-29 08:17:10 -07:00
David Zafman	b000056a5a	ceph: Fix NULL ptr crash in strlen() set_request_path_attr() checks for NULL ptr before calling strlen() This fixes http://tracker.newdream.net/issues/3404 Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-10-26 16:35:07 -05:00
David Zafman	0f9831a893	ceph: fix dentry reference leak in encode_fh() Call to d_find_alias() needs a corresponding dput() This fixes http://tracker.newdream.net/issues/3271 Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-10-26 16:34:53 -05:00
Hugh Dickins	35c2a7f490	tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(), u64 inum = fid->raw[2]; which is unhelpfully reported as at the end of shmem_alloc_inode(): BUG: unable to handle kernel paging request at ffff880061cd3000 IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Call Trace: [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0 [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0 [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10 [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6 Right, tmpfs is being stupid to access fid->raw[2] before validating that fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may fall at the end of a page, and the next page not be present. But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and could oops in the same way: add the missing fh_len checks to those. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Sage Weil <sage@inktank.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-09 23:33:55 -04:00
Konstantin Khlebnikov	0b173bc4da	mm: kill vma flag VM_CAN_NONLINEAR Move actual pte filling for non-linear file mappings into the new special vma operation: ->remap_pages(). Filesystems must implement this method to get non-linear mapping support, if it uses filemap_fault() then generic_file_remap_pages() can be used. Now device drivers can implement this method and obtain nonlinear vma support. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:17 +09:00
Linus Torvalds	7035cdf36d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "The bulk of this pull is a series from Alex that refactors and cleans up the RBD code to lay the groundwork for supporting the new image format and evolving feature set. There are also some cleanups in libceph, and for ceph there's fixed validation of file striping layouts and a bugfix in the code handling a shrinking MDS cluster." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits) ceph: avoid 32-bit page index overflow ceph: return EIO on invalid layout on GET_DATALOC ioctl rbd: BUG on invalid layout ceph: propagate layout error on osd request creation libceph: check for invalid mapping ceph: convert to use le32_add_cpu() ceph: Fix oops when handling mdsmap that decreases max_mds rbd: update remaining header fields for v2 rbd: get snapshot name for a v2 image rbd: get the snapshot context for a v2 image rbd: get image features for a v2 image rbd: get the object prefix for a v2 rbd image rbd: add code to get the size of a v2 rbd image rbd: lay out header probe infrastructure rbd: encapsulate code that gets snapshot info rbd: add an rbd features field rbd: don't use index in __rbd_add_snap_dev() rbd: kill create_snap sysfs entry rbd: define rbd_dev_image_id() rbd: define some new format constants ...	2012-10-08 06:38:18 +09:00
Alex Elder	6285bc2312	ceph: avoid 32-bit page index overflow A pgoff_t is defined (by default) to have type (unsigned long). On architectures such as i686 that's a 32-bit type. The ceph address space code was attempting to produce 64 bit offsets by shifting a page's index by PAGE_CACHE_SHIFT, but the result was not what was desired because the shift occurred before the result got promoted to 64 bits. Fix this by converting all uses of page->index used in this way to use the page_offset() macro, which ensures the 64-bit result has the intended value. This fixes http://tracker.newdream.net/issues/3112 Reported-by: Mohamed Pakkeer <pakkeer.mohideen@realimage.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-10-03 10:51:18 -05:00
Sage Weil	457712a0bc	ceph: return EIO on invalid layout on GET_DATALOC ioctl If the user calls GET_DATALOC on a file with an invalid (e.g., zeroed) layout, return EIO to userland. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-10-03 10:51:17 -05:00
Kirill A. Shutemov	8c0a853770	fs: push rcu_barrier() from deactivate_locked_super() to filesystems There's no reason to call rcu_barrier() on every deactivate_locked_super(). We only need to make sure that all delayed rcu free inodes are flushed before we destroy related cache. Removing rcu_barrier() from deactivate_locked_super() affects some fast paths. E.g. on my machine exit_group() of a last process in IPC namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-02 21:35:55 -04:00
Sage Weil	6816282dab	ceph: propagate layout error on osd request creation If we are creating an osd request and get an invalid layout, return an EINVAL to the caller. We switch up the return to have an error code instead of NULL implying -ENOMEM. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-10-01 17:20:00 -05:00
Wei Yongjun	b905a7f8b7	ceph: convert to use le32_add_cpu() Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu(). dpatch engine is used to auto generate this patch. (https://github.com/weiyj/dpatch) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 14:30:54 -05:00
Yan, Zheng	3e8f43a089	ceph: Fix oops when handling mdsmap that decreases max_mds When i >= newmap->m_max_mds, ceph_mdsmap_get_addr(newmap, i) return NULL. Passing NULL to memcmp() triggers oops. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 14:30:54 -05:00
Alex Elder	c98f533c94	ceph: let path portion of mount "device" be optional A recent change to /sbin/mountall causes any trailing '/' character in the "device" (or fs_spec) field in /etc/fstab to be stripped. As a result, an entry for a ceph mount that intends to mount the root of the name space ends up with now path portion, and the ceph mount option processing code rejects this. That is, an entry in /etc/fstab like: cephserver:port:/ /mnt ceph defaults 0 0 provides to the ceph code just "cephserver:port:" as the "device," and that gets rejected. Although this is a bug in /sbin/mountall, we can have the ceph mount code support an empty/nonexistent path, interpreting it to mean the root of the name space. RFC 5952 offers recommendations for how to express IPv6 addresses, and recommends the usage found in RFC 3986 (which specifies the format for URI's) for representing both IPv4 and IPv6 addresses that include port numbers. (See in particular the definition of "authority" found in the Appendix of RFC 3986.) According to those standards, no host specification will ever contain a '/' character. As a result, it is sufficient to scan a provided "device" from an /etc/fstab entry for the first '/' character, and if it's found, treat that as the beginning of the path. If no '/' character is present, we can treat the entire string as the monitor host specification(s), and assume the path to be the root of the name space. We'll still require a ':' to separate the host portion from the (possibly empty) path portion. This means that we can more formally define how ceph will interpret the "device" it's provided when processing a mount request: "device" will look like: <server_spec>[,<server_spec>...]:[<path>] where <server_spec> is <ip>[:<port>] <path> is optional, but if present must begin with '/' This addresses http://tracker.newdream.net/issues/2919 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>	2012-10-01 14:30:49 -05:00
Al Viro	2744c171db	ceph: don't abuse d_delete() on failure exits Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 22:20:20 -04:00
Sage Weil	45f2e081f5	ceph: avoid divide by zero in __validate_layout() If "l->stripe_unit" is zero the the mod on the next line will cause a divide by zero bug. This comes from the copy_from_user() in ceph_ioctl_set_layout_policy(). Passing 0 is valid, though (it means "do not change") so avoid the % check in that case. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-08-21 15:55:28 -07:00
Sage Weil	6c5e50fa61	ceph: tolerate (and warn on) extraneous dentry from mds If the MDS gives us a dentry and we weren't prepared to handle it, WARN_ON_ONCE instead of crashing. Reported-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-08-21 15:55:25 -07:00
Sage Weil	d1c338a509	libceph: delay debugfs initialization until we learn global_id The debugfs directory includes the cluster fsid and our unique global_id. We need to delay the initialization of the debug entry until we have learned both the fsid and our global_id from the monitor or else the second client can't create its debugfs entry and will fail (and multiple client instances aren't properly reflected in debugfs). Reported by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>	2012-08-20 10:03:15 -07:00
Sage Weil	5ef50c3bec	ceph: simplify+fix atomic_open The initial ->atomic_open op was carried over from the old intent code, which was incomplete and didn't really work. Replace it with a fresh method. In particular: * always attempt to do an atomic open+lookup, both for the create case and for lookups of existing files. * fix symlink handling by returning 1 to the VFS so that we can follow the link to its destination. This fixes a longstanding ceph bug (#2392). Signed-off-by: Sage Weil <sage@inktank.com>	2012-08-02 09:11:19 -07:00
Linus Torvalds	a0e881b7c1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second vfs pile from Al Viro: "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the deadlock reproduced by xfstests 068), symlink and hardlink restriction patches, plus assorted cleanups and fixes. Note that another fsfreeze deadlock (emergency thaw one) is not dealt with - the series by Fernando conflicts a lot with Jan's, breaks userland ABI (FIFREEZE semantics gets changed) and trades the deadlock for massive vfsmount leak; this is going to be handled next cycle. There probably will be another pull request, but that stuff won't be in it." Fix up trivial conflicts due to unrelated changes next to each other in drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) delousing target_core_file a bit Documentation: Correct s_umount state for freeze_fs/unfreeze_fs fs: Remove old freezing mechanism ext2: Implement freezing btrfs: Convert to new freezing mechanism nilfs2: Convert to new freezing mechanism ntfs: Convert to new freezing mechanism fuse: Convert to new freezing mechanism gfs2: Convert to new freezing mechanism ocfs2: Convert to new freezing mechanism xfs: Convert to new freezing code ext4: Convert to new freezing mechanism fs: Protect write paths by sb_start_write - sb_end_write fs: Skip atime update on frozen filesystem fs: Add freezing handling to mnt_want_write() / mnt_drop_write() fs: Improve filesystem freezing handling switch the protection of percpu_counter list to spinlock nfsd: Push mnt_want_write() outside of i_mutex btrfs: Push mnt_want_write() outside of i_mutex fat: Push mnt_want_write() outside of i_mutex ...	2012-08-01 10:26:23 -07:00
Linus Torvalds	cc8362b1f6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "Lots of stuff this time around: - lots of cleanup and refactoring in the libceph messenger code, and many hard to hit races and bugs closed as a result. - lots of cleanup and refactoring in the rbd code from Alex Elder, mostly in preparation for the layering functionality that will be coming in 3.7. - some misc rbd cleanups from Josh Durgin that are finally going upstream - support for CRUSH tunables (used by newer clusters to improve the data placement) - some cleanup in our use of d_parent that Al brought up a while back - a random collection of fixes across the tree There is another patch coming that fixes up our ->atomic_open() behavior, but I'm going to hammer on it a bit more before sending it." Fix up conflicts due to commits that were already committed earlier in drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits) rbd: create rbd_refresh_helper() rbd: return obj version in __rbd_refresh_header() rbd: fixes in rbd_header_from_disk() rbd: always pass ops array to rbd_req_sync_op() rbd: pass null version pointer in add_snap() rbd: make rbd_create_rw_ops() return a pointer rbd: have __rbd_add_snap_dev() return a pointer libceph: recheck con state after allocating incoming message libceph: change ceph_con_in_msg_alloc convention to be less weird libceph: avoid dropping con mutex before fault libceph: verify state after retaking con lock after dispatch libceph: revoke mon_client messages on session restart libceph: fix handling of immediate socket connect failure ceph: update MAINTAINERS file libceph: be less chatty about stray replies libceph: clear all flags on con_close libceph: clean up con flags libceph: replace connection state bits with states libceph: drop unnecessary CLOSED check in socket state change callback libceph: close socket directly from ceph_con_close() ...	2012-07-31 14:35:28 -07:00
Alex Elder	aa711ee340	ceph: define snap counts as u32 everywhere There are two structures in which a count of snapshots are maintained: struct ceph_snap_context { ... u32 num_snaps; ... } and struct ceph_snap_realm { ... u32 num_prior_parent_snaps; /* had prior to parent_since */ ... u32 num_snaps; ... } These fields never take on negative values (e.g., to hold special meaning), and so are really inherently unsigned. Furthermore they take their value from over-the-wire or on-disk formatted 32-bit values. So change their definition to have type u32, and change some spots elsewhere in the code to account for this change. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2012-07-30 18:15:47 -07:00
Alan Cox	21ec6ffa46	ceph: fix potential double free We re-run the loop but we don't re-set the attrs pointer back to NULL. Signed-off-by: Alan Cox <alan@linux.intel.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-07-30 18:15:35 -07:00
Sage Weil	a53aab645c	ceph: close old con before reopening on mds reconnect When we detect a mds session reset, close the old ceph_connection before reopening it. This ensures we clean up the old socket properly and keep the ceph_connection state correct. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>	2012-07-30 18:15:32 -07:00
Sage Weil	1fe60e51a3	libceph: move feature bits to separate header This is simply cleanup that will keep things more closely synced with the userland code. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>	2012-07-30 16:23:22 -07:00
Jan Kara	3ca9c3bd8a	ceph: Push file_update_time() into ceph_page_mkwrite() CC: Sage Weil <sage@newdream.net> CC: ceph-devel@vger.kernel.org Acked-by: Sage Weil <sage@newdream.net> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-31 01:02:45 +04:00
Sage Weil	8842b3be96	ceph: clean up useless d_parent checks d_parent is never NULL, and IS_ROOT() is the proper way to check for a (non-self-referential) parent. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 09:29:54 -07:00
David Howells	9249e17fe0	VFS: Pass mount flags to sget() Pass mount flags to sget() so that it can use them in initialising a new superblock before the set function is called. They could also be passed to the compare function. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:38:34 +04:00
Al Viro	ebfc3b49a7	don't pass nameidata to ->create() boolean "does it have to be exclusive?" flag is passed instead; Local filesystem should just ignore it - the object is guaranteed not to be there yet. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:34:47 +04:00
Al Viro	00cd8dd3bf	stop passing nameidata to ->lookup() Just the flags; only NFS cares even about that, but there are legitimate uses for such argument. And getting rid of that completely would require splitting ->lookup() into a couple of methods (at least), so let's leave that alone for now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:34:32 +04:00
Al Viro	0b728e1911	stop passing nameidata * to ->d_revalidate() Just the lookup flags. Die, bastard, die... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:34:14 +04:00
Al Viro	e45198a6ac	make finish_no_open() return int namely, 1 ;-) That's what we want to return from ->atomic_open() instances after finish_no_open(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:33:45 +04:00
Al Viro	30d9049474	kill struct opendata Just pass struct file . Methods are happier that way... There's no need to return struct file from finish_open() now, so let it return int. Next: saner prototypes for parts in namei.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:33:39 +04:00
Al Viro	d95852777b	make ->atomic_open() return int Change of calling conventions: old new NULL 1 file 0 ERR_PTR(-ve) -ve Caller knows that struct file *; no need to return it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:33:35 +04:00
Al Viro	47237687d7	->atomic_open() prototype change - pass int * instead of bool * ... and let finish_open() report having opened the file via that sucker. Next step: don't modify od->filp at all. [AV: FILE_CREATE was already used by cifs; Miklos' fix folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:33:31 +04:00
Miklos Szeredi	2d83bde9a1	ceph: implement i_op->atomic_open() Add an ->atomic_open implementation which replaces the atomic lookup+open+create operation implemented via ->lookup and ->create operations. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:33:16 +04:00
Miklos Szeredi	3819219b59	ceph: remove unused arg from ceph_lookup_open() What was the purpose of this? Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:33:16 +04:00
Sage Weil	b7a9e5dd40	libceph: set peer name on con_open, not init The peer name may change on each open attempt, even when the connection is reused. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-05 21:14:35 -07:00
Yan, Zheng	61600ef848	ceph: check PG_Private flag before accessing page->private I got lots of NULL pointer dereference Oops when compiling kernel on ceph. The bug is because the kernel page migration routine replaces some pages in the page cache with new pages, these new pages' private can be non-zero. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com> (cherry picked from commit `28c0254ede`)	2012-06-20 07:43:48 -05:00
Sage Weil	9a64e8e0ac	Linux 3.5-rc1 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQEcBAABAgAGBQJPyr4LAAoJEHm+PkMAQRiGhvMH/1uXaDmJiiyAtMhC9kQbLclK 5RpUOV+ukRrPXBJhwWGEZvC9G/DiWAfZ/19Ee6qTGZbA46yxkgZklqO+bw7fuOLH dPf4MNXdhgOgbs0KkVAk6aXIYzIU836pcYg+LcapG8E8SZp3SWbJzrVbUPFwPM+m Sv11ZcpJfM2HH9wFRdKErUOiZHsMY+LZHcw0nx+BObytjgzBbzHNkpF57F714TUO QplYpIToO3XtGhIM1yRDxww+2zFlVNsCZ8IC57EDbLb8BMZWuyZoFgWZqLAnrU0u vy7CHLledMSvs855juJ9JxGo/EDnfwJpCnjmcp8BY+h4b5T/k5mGK6d9aeXYRf4= =CcWn -----END PGP SIGNATURE----- Merge tag 'v3.5-rc1' Linux 3.5-rc1 Conflicts: net/ceph/messenger.c	2012-06-15 12:32:04 -07:00
Alex Elder	1bfd89f4e6	libceph: fully initialize connection in con_init() Move the initialization of a ceph connection's private pointer, operations vector pointer, and peer name information into ceph_con_init(). Rearrange the arguments so the connection pointer is first. Hide the byte-swapping of the peer entity number inside ceph_con_init() Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-06 09:23:54 -05:00
Linus Torvalds	1193755ac6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs changes from Al Viro. "A lot of misc stuff. The obvious groups: * Miklos' atomic_open series; kills the damn abuse of ->d_revalidate() by NFS, which was the major stumbling block for all work in that area. * ripping security_file_mmap() and dealing with deadlocks in the area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in general. * ->encode_fh() switched to saner API; insane fake dentry in mm/cleancache.c gone. * assorted annotations in fs (endianness, __user) * parts of Artem's ->s_dirty work (jff2 and reiserfs parts) * ->update_time() work from Josef. * other bits and pieces all over the place. Normally it would've been in two or three pull requests, but signal.git stuff had eaten a lot of time during this cycle ;-/" Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the 'truncate_range' inode method was removed by the VM changes, the VFS update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due to sparse fix added twice, with other changes nearby). * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits) nfs: don't open in ->d_revalidate vfs: retry last component if opening stale dentry vfs: nameidata_to_filp(): don't throw away file on error vfs: nameidata_to_filp(): inline __dentry_open() vfs: do_dentry_open(): don't put filp vfs: split __dentry_open() vfs: do_last() common post lookup vfs: do_last(): add audit_inode before open vfs: do_last(): only return EISDIR for O_CREAT vfs: do_last(): check LOOKUP_DIRECTORY vfs: do_last(): make ENOENT exit RCU safe vfs: make follow_link check RCU safe vfs: do_last(): use inode variable vfs: do_last(): inline walk_component() vfs: do_last(): make exit RCU safe vfs: split do_lookup() Btrfs: move over to use ->update_time fs: introduce inode operation ->update_time reiserfs: get rid of resierfs_sync_super reiserfs: mark the superblock as dirty a bit later ...	2012-06-01 10:34:35 -07:00
Alex Elder	15d9882c33	libceph: embed ceph messenger structure in ceph_client A ceph client has a pointer to a ceph messenger structure in it. There is always exactly one ceph messenger for a ceph client, so there is no need to allocate it separate from the ceph client structure. Switch the ceph_client structure to embed its ceph_messenger structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-01 08:37:56 -05:00
Yan, Zheng	28c0254ede	ceph: check PG_Private flag before accessing page->private I got lots of NULL pointer dereference Oops when compiling kernel on ceph. The bug is because the kernel page migration routine replaces some pages in the page cache with new pages, these new pages' private can be non-zero. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-06-01 08:06:29 -05:00
Linus Torvalds	08615d7d85	Merge branch 'akpm' (Andrew's patch-bomb) Merge misc patches from Andrew Morton: - the "misc" tree - stuff from all over the map - checkpatch updates - fatfs - kmod changes - procfs - cpumask - UML - kexec - mqueue - rapidio - pidns - some checkpoint-restore feature work. Reluctantly. Most of it delayed a release. I'm still rather worried that we don't have a clear roadmap to completion for this work. * emailed from Andrew Morton <akpm@linux-foundation.org>: (78 patches) kconfig: update compression algorithm info c/r: prctl: add ability to set new mm_struct::exe_file c/r: prctl: extend PR_SET_MM to set up more mm_struct entries c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat syscalls, x86: add __NR_kcmp syscall fs, proc: introduce /proc/<pid>/task/<tid>/children entry sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() eventfd: change int to __u64 in eventfd_signal() fs/nls: add Apple NLS pidns: make killed children autoreap pidns: use task_active_pid_ns in do_notify_parent rapidio/tsi721: add DMA engine support rapidio: add DMA engine support for RIO data transfers ipc/mqueue: add rbtree node caching support tools/selftests: add mq_perf_tests ipc/mqueue: strengthen checks on mqueue creation ipc/mqueue: correct mq_attr_ok test ipc/mqueue: improve performance of send/recv selftests: add mq_open_tests ...	2012-05-31 18:10:18 -07:00
Xi Wang	a3860c1c5d	introduce SIZE_MAX ULONG_MAX is often used to check for integer overflow when calculating allocation size. While ULONG_MAX happens to work on most systems, there is no guarantee that `size_t' must be the same size as `long'. This patch introduces SIZE_MAX, the maximum value of `size_t', to improve portability and readability for allocation size validation. Signed-off-by: Xi Wang <xi.wang@gmail.com> Acked-by: Alex Elder <elder@dreamhost.com> Cc: David Airlie <airlied@linux.ie> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-05-31 17:49:26 -07:00
Linus Torvalds	af56e0aa35	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "There are some updates and cleanups to the CRUSH placement code, a bug fix with incremental maps, several cleanups and fixes from Josh Durgin in the RBD block device code, a series of cleanups and bug fixes from Alex Elder in the messenger code, and some miscellaneous bounds checking and gfp cleanups/fixes." Fix up trivial conflicts in net/ceph/{messenger.c,osdmap.c} due to the networking people preferring "unsigned int" over just "unsigned". * git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (45 commits) libceph: fix pg_temp updates libceph: avoid unregistering osd request when not registered ceph: add auth buf in prepare_write_connect() ceph: rename prepare_connect_authorizer() ceph: return pointer from prepare_connect_authorizer() ceph: use info returned by get_authorizer ceph: have get_authorizer methods return pointers ceph: ensure auth ops are defined before use ceph: messenger: reduce args to create_authorizer ceph: define ceph_auth_handshake type ceph: messenger: check return from get_authorizer ceph: messenger: rework prepare_connect_authorizer() ceph: messenger: check prepare_write_connect() result ceph: don't set WRITE_PENDING too early ceph: drop msgr argument from prepare_write_connect() ceph: messenger: send banner in process_connect() ceph: messenger: reset connection kvec caller libceph: don't reset kvec in prepare_write_banner() ceph: ignore preferred_osd field ceph: fully initialize new layout ...	2012-05-30 11:17:19 -07:00
Sage Weil	c862868bb4	ceph: move encode_fh to new API Use parent_inode has a flag for whether nfsd wants a connectable fh, but generate one opportunistically so that we can take advantage of the additional info in there. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-05-29 23:28:33 -04:00
Al Viro	b0b0382bb4	->encode_fh() API change pass inode + parent's inode or NULL instead of dentry + bool saying whether we want the parent or not. NOTE: that needs ceph fix folded in. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-05-29 23:28:33 -04:00
Alex Elder	8f43fb5389	ceph: use info returned by get_authorizer Rather than passing a bunch of arguments to be filled in with the content of the ceph_auth_handshake buffer now returned by the get_authorizer method, just use the returned information in the caller, and drop the unnecessary arguments. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:13 -05:00
Alex Elder	a3530df33e	ceph: have get_authorizer methods return pointers Have the get_authorizer auth_client method return a ceph_auth pointer rather than an integer, pointer-encoding any returned error value. This is to pave the way for making use of the returned value in an upcoming patch. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:13 -05:00
Alex Elder	a255651d4c	ceph: ensure auth ops are defined before use In the create_authorizer method for both the mds and osd clients, the auth_client->ops pointer is blindly dereferenced. There is no obvious guarantee that this pointer has been assigned. And furthermore, even if the ops pointer is non-null there is definitely no guarantee that the create_authorizer or destroy_authorizer methods are defined. Add checks in both routines to make sure they are defined (non-null) before use. Add similar checks in a few other spots in these files while we're at it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:13 -05:00
Alex Elder	74f1869f76	ceph: messenger: reduce args to create_authorizer Make use of the new ceph_auth_handshake structure in order to reduce the number of arguments passed to the create_authorizor method in ceph_auth_client_ops. Use a local variable of that type as a shorthand in the get_authorizer method definitions. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:12 -05:00
Alex Elder	6c4a19158b	ceph: define ceph_auth_handshake type The definitions for the ceph_mds_session and ceph_osd both contain five fields related only to "authorizers." Encapsulate those fields into their own struct type, allowing for better isolation in some upcoming patches. Fix the #includes in "linux/ceph/osd_client.h" to lay out their more complete canonical path. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:12 -05:00
Sage Weil	c047be0934	ceph: ignore preferred_osd field Old users may not expect EINVAL, and there is no clear user-visibile behavior change now that we ignore it. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-05-16 14:28:28 -05:00
Sage Weil	702aeb1f88	ceph: fully initialize new layout When we are setting a new layout, fully initialize the structure: - zero it out - always set preferred_osd to -1 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-05-16 14:28:27 -05:00
Sage Weil	e49bf4c51c	ceph: refactor SETLAYOUT and SETDIRLAYOUT ioctl checks into common helper Both of these methods perform similar checks; move that code to a helper so that we can ensure the checks are consistent. Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-05-07 15:34:35 -07:00
Sage Weil	3469ac1aa3	ceph: drop support for preferred_osd pgs This was an ill-conceived feature that has been removed from Ceph. Do this gracefully: - reject attempts to specify a preferred_osd via the ioctl - stop exposing this information via virtual xattrs - always fill in -1 for requests, in case we talk to an older server - don't calculate preferred_osd placements/pgids Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-05-07 15:33:36 -07:00
Linus Torvalds	56b59b429b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates for 3.4-rc1 from Sage Weil: "Alex has been busy. There are a range of rbd and libceph cleanups, especially surrounding device setup and teardown, and a few critical fixes in that code. There are more cleanups in the messenger code, virtual xattrs, a fix for CRC calculation/checks, and lots of other miscellaneous stuff. There's a patch from Amon Ott to make inos behave a bit better on 32-bit boxes, some decode check fixes from Xi Wang, and network throttling fix from Jim Schutt, and a couple RBD fixes from Josh Durgin. No new functionality, just a lot of cleanup and bug fixing." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (65 commits) rbd: move snap_rwsem to the device, rename to header_rwsem ceph: fix three bugs, two in ceph_vxattrcb_file_layout() libceph: isolate kmap() call in write_partial_msg_pages() libceph: rename "page_shift" variable to something sensible libceph: get rid of zero_page_address libceph: only call kernel_sendpage() via helper libceph: use kernel_sendpage() for sending zeroes libceph: fix inverted crc option logic libceph: some simple changes libceph: small refactor in write_partial_kvec() libceph: do crc calculations outside loop libceph: separate CRC calculation from byte swapping libceph: use "do" in CRC-related Boolean variables ceph: ensure Boolean options support both senses libceph: a few small changes libceph: make ceph_tcp_connect() return int libceph: encapsulate some messenger cleanup code libceph: make ceph_msgr_wq private libceph: encapsulate connection kvec operations libceph: move prepare_write_banner() ...	2012-03-28 10:01:29 -07:00
Alex Elder	3489b42a72	ceph: fix three bugs, two in ceph_vxattrcb_file_layout() In ceph_vxattrcb_file_layout(), there is a check to determine whether a preferred PG should be formatted into the output buffer. That check assumes that a preferred PG number of 0 indicates "no preference," but that is wrong. No preference is indicated by a negative (specifically, -1) PG number. In addition, if that condition yields true, the preferred value is formatted into a sized buffer, but the size consumed by the earlier snprintf() call is not accounted for, opening up the possibilty of a buffer overrun. Finally, in ceph_vxattrcb_dir_rctime() where the nanoseconds part of the time displayed did not include leading 0's, which led to erroneous (sub-second portion of) time values being shown. This fixes these three issues: http://tracker.newdream.net/issues/2155 http://tracker.newdream.net/issues/2156 http://tracker.newdream.net/issues/2157 Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:52 -05:00
Alex Elder	cffaba15cd	ceph: ensure Boolean options support both senses Many ceph-related Boolean options offer the ability to both enable and disable a feature. For all those that don't offer this, add a new option so that they do. Note that ceph_show_options()--which reports mount options currently in effect--only reports the option if it is different from the default value. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:51 -05:00
Alex Elder	ee57741c52	rbd: make ceph_parse_options() return a pointer ceph_parse_options() takes the address of a pointer as an argument and uses it to return the address of an allocated structure if successful. With this interface is not evident at call sites that the pointer is always initialized. Change the interface to return the address instead (or a pointer-coded error code) to make the validity of the returned pointer obvious. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:47 -05:00
Alex Elder	18fa8b3fea	ceph: make ceph_setxattr() and ceph_removexattr() more alike This patch just rearranges a few bits of code to make more portions of ceph_setxattr() and ceph_removexattr() identical. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:46 -05:00
Alex Elder	3ce6cd1233	ceph: avoid repeatedly computing the size of constant vxattr names All names defined in the directory and file virtual extended attribute tables are constant, and the size of each is known at compile time. So there's no need to compute their length every time any file's attribute is listed. Record the length of each string and use it when needed to determine the space need to represent them. In addition, compute the aggregate size of strings in each table just once at initialization time. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:46 -05:00
Alex Elder	aa4066ed7b	ceph: encode type in vxattr callback routines The names of the callback functions used for virtual extended attributes are based only on the last component of the attribute name. Because of the way these are defined, this precludes allowing a single (lowest) attribute name for different callbacks, dependent on the type of file being operated on. (For example, it might be nice to support both "ceph.dir.layout" and "ceph.file.layout".) Just change the callback names to avoid this problem. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:46 -05:00
Alex Elder	881a5fa200	ceph: drop "_cb" from name of struct ceph_vxattr_cb A struct ceph_vxattr_cb does not represent a callback at all, but rather a virtual extended attribute itself. Drop the "_cb" suffix from its name to reflect that. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:46 -05:00
Alex Elder	eb78808446	ceph: use macros to normalize vxattr table definitions Entries in the ceph virtual extended attribute tables all follow a distinct pattern in their definition. Enforce this pattern through the use of a macro. Also, a null name field signals the end of the table, so make that be the first field in the ceph_vxattr_cb structure. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:46 -05:00
Alex Elder	2289190719	ceph: use a symbolic name for "ceph." extended attribute namespace Use symbolic constants to define the top-level prefix for "ceph." extended attribute names. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:46 -05:00
Alex Elder	06476a69d8	ceph: pass inode rather than table to ceph_match_vxattr() All callers of ceph_match_vxattr() determine what to pass as the first argument by calling ceph_inode_vxattrs(inode). Just do that inside ceph_match_vxattr() itself, changing it to take an inode rather than the vxattr pointer as its first argument. Also ensure the function works correctly for an empty table (i.e., containing only a terminating null entry). Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:46 -05:00
Alex Elder	b829c1954d	ceph: don't null-terminate xattr values For some reason, ceph_setxattr() allocates an extra byte in which a '\0' is stored past the end of an extended attribute value. This is not needed, and is potentially misleading, so get rid of it. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:46 -05:00
Xi Wang	80834312a4	ceph: fix overflow check in build_snap_context() The overflow check for a + n * b should be (n > (ULONG_MAX - a) / b), rather than (n > ULONG_MAX / b - a). Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:45 -05:00
Xi Wang	810339ec2f	ceph: avoid panic with mismatched symlink sizes in fill_inode() Return -EINVAL rather than panic if iinfo->symlink_len and inode->i_size do not match. Also use kstrndup rather than kmalloc/memcpy. Signed-off-by: Xi Wang <xi.wang@gmail.com> Reviewed-by: Alex Elder <elder@dreamhost.com>	2012-03-22 10:47:45 -05:00
Amon Ott	a661fc5611	ceph: use 2 instead of 1 as fallback for 32-bit inode number The root directory of the Ceph mount has inode number 1, so falling back to 1 always creates a collision. 2 is unused on my test systems and seems less likely to collide. Signed-off-by: Amon Ott <ao@m-privacy.de> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:45 -05:00
Alex Elder	1ce208a6ce	ceph: don't reset s_cap_ttl to zero Avoid the need to check for a special zero s_cap_ttl value by just using (jiffies - 1) as the value assigned to indicate "sometime in the past." Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:45 -05:00
Al Viro	48fde701af	switch open-coded instances of d_make_root() to new helper Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-20 21:29:35 -04:00
Linus Torvalds	6c073a7ee2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: fix safety of rbd_put_client() rbd: fix a memory leak in rbd_get_client() ceph: create a new session lock to avoid lock inversion ceph: fix length validation in parse_reply_info() ceph: initialize client debugfs outside of monc->mutex ceph: change "ceph.layout" xattr to be "ceph.file.layout"	2012-02-02 15:47:33 -08:00
Alex Elder	d8fb02abdc	ceph: create a new session lock to avoid lock inversion Lockdep was reporting a possible circular lock dependency in dentry_lease_is_valid(). That function needs to sample the session's s_cap_gen and and s_cap_ttl fields coherently, but needs to do so while holding a dentry lock. The s_cap_lock field was being used to protect the two fields, but that can't be taken while holding a lock on a dentry within the session. In most cases, the s_cap_gen and s_cap_ttl fields only get operated on separately. But in three cases they need to be updated together. Implement a new lock to protect the spots updating both fields atomically is required. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>	2012-02-02 12:49:19 -08:00
Xi Wang	32852a81bc	ceph: fix length validation in parse_reply_info() "len" is read from network and thus needs validation. Otherwise, given a bogus "len" value, p+len could be an out-of-bounds pointer, which is used in further parsing. Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-02-02 12:49:11 -08:00
Alex Elder	114fc47492	ceph: change "ceph.layout" xattr to be "ceph.file.layout" The virtual extended attribute named "ceph.layout" is meaningful only for regular files. Change its name to be "ceph.file.layout" to more directly reflect that in the ceph xattr namespace. Preserve the old "ceph.layout" name for the time being (until we decide it's safe to get rid of it entirely). Add a missing initializer for "readonly" in the terminating entry. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>	2012-02-02 12:48:52 -08:00
Linus Torvalds	1a52bb0b68	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: ensure prealloc_blob is in place when removing xattr rbd: initialize snap_rwsem in rbd_add() ceph: enable/disable dentry complete flags via mount option vfs: export symbol d_find_any_alias() ceph: always initialize the dentry in open_root_dentry() libceph: remove useless return value for osd_client __send_request() ceph: avoid iput() while holding spinlock in ceph_dir_fsync ceph: avoid useless dget/dput in encode_fh ceph: dereference pointer after checking for NULL crush: fix force for non-root TAKE ceph: remove unnecessary d_fsdata conditional checks ceph: Use kmemdup rather than duplicating its implementation Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs always initialize the dentry in open_root_dentry)	2012-01-13 10:29:21 -08:00
Alex Elder	83eb26af0d	ceph: ensure prealloc_blob is in place when removing xattr In __ceph_build_xattrs_blob(), if a ceph inode's extended attributes are marked dirty, all attributes recorded in its rb_tree index are formatted into a "blob" buffer. The target buffer is recorded in ceph_inode->i_xattrs.prealloc_blob, and it is expected to exist and be of sufficient size to hold the attributes. The extended attributes are marked dirty in two cases: when a new attribute is added to the inode; or when one is removed. In the former case work is done to ensure the prealloc_blob buffer is properly set up, but in the latter it is not. Change the logic in ceph_removexattr() so it matches what is done in ceph_setxattr(). Note that this is done in a way that keeps the two blocks of code nearly identical, in anticipation of a subsequent patch that encapsulates some of this logic into one or more helper routines. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-01-12 11:00:51 -08:00
Sage Weil	a40dc6cc2e	ceph: enable/disable dentry complete flags via mount option Enable/disable use of the dentry dir 'complete' flag via a mount option. This lets the admin control whether ceph uses the dcache to satisfy negative lookups or readdir when it has the entire directory contents in its cache. This is purely a performance optimization; correctness is guaranteed whether it is enabled or not. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sage Weil <sage@newdream.net>	2012-01-12 11:00:40 -08:00
Alex Elder	d46cfba536	ceph: always initialize the dentry in open_root_dentry() When open_root_dentry() gets a dentry via d_obtain_alias() it does not get initialized. If the dentry obtained came from the cache, this is OK. But if not, the result is an improperly initialized dentry. To fix this, call ceph_init_dentry() regardless of which path produced the dentry. That function returns immediately for a dentry that is already initialized, it is safe to use either way. (Credit to Sage, who suggested this fix.) Signed-off-by: Alex Elder <aelder@sgi.com>	2012-01-11 16:28:25 -08:00
Sage Weil	2ff179e650	ceph: avoid iput() while holding spinlock in ceph_dir_fsync ceph_mdsc_put_request() can call iput(), which can sleep. Don't do that. Fixes: #1812 Signed-off-by: Sage Weil <sage@newdream.net>	2012-01-10 08:57:02 -08:00
Sage Weil	ee6b1baf67	ceph: avoid useless dget/dput in encode_fh Nothing we do here sleeps, so just do it under d_lock and avoid the dget/ dput entirely. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Sage Weil <sage@newdream.net>	2012-01-10 08:57:00 -08:00
Yehuda Sadeh	b8cd952b51	ceph: dereference pointer after checking for NULL moved dereference after BUG_ON Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2012-01-10 08:56:59 -08:00
Sage Weil	3d8eb7a94e	ceph: remove unnecessary d_fsdata conditional checks We now set d_fsdata unconditionally on all dentries prior to setting up the d_ops, so all of these checks are unnecessary. Signed-off-by: Sage Weil <sage@newdream.net>	2012-01-10 08:56:56 -08:00
Al Viro	3c5184ef12	ceph: d_alloc_root() may fail ... and ceph_init_dentry(NULL) will oops Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-09 16:36:12 -05:00
Al Viro	34c80b1d93	vfs: switch ->show_options() to struct dentry * Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-06 23:19:54 -05:00
Al Viro	5706b27dea	ceph: propagate umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:55:16 -05:00
Al Viro	dba19c6064	get rid of open-coded S_ISREG(), etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:55:12 -05:00
Al Viro	1a67aafb5f	switch ->mknod() to umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:54 -05:00
Al Viro	4acdaf27eb	switch ->create() to umode_t vfs_create() ignores everything outside of 16bit subset of its mode argument; switching it to umode_t is obviously equivalent and it's the only caller of the method Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:53 -05:00
Al Viro	18bb1db3e7	switch vfs_mkdir() and ->mkdir() to umode_t vfs_mkdir() gets int, but immediately drops everything that might not fit into umode_t and that's the only caller of ->mkdir()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:53 -05:00
Al Viro	6b520e0565	vfs: fix the stupidity with i_dentry in inode destructors Seeing that just about every destructor got that INIT_LIST_HEAD() copied into it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once(); the cost of taking it into inode_init_always() will be negligible for pipes and sockets and negative for everything else. Not to mention the removal of boilerplate code from ->destroy_inode() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:52:40 -05:00
Sage Weil	a4d46363ce	ceph: disable use of dcache for readdir etc. Ceph attempts to use the dcache to satisfy negative lookups and readdir when the entire directory contents are in cache. Disable this behavior until lingering bugs in this code are shaken out; we'll re-enable these hooks once things are fully stable. Signed-off-by: Sage Weil <sage@newdream.net>	2011-12-29 08:05:14 -08:00
Yehuda Sadeh	9d5a09e659	ceph: add missing spin_unlock at ceph_mdsc_build_path() one of the paths was missing spin_unlock Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2011-12-13 11:59:53 -08:00
Sage Weil	6a82c47aa8	ceph: fix SEEK_CUR, SEEK_SET regression Commit `06222e491e` got the if wrong so that it always evaluates as true. This is semantically harmless, but makes SEEK_CUR and SEEK_SET needlessly query the server. Rewrite the if to explicitly enumerate the cases we DO need a valid i_size to make this code less fragile. Reported-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-12-13 09:19:26 -08:00
Sage Weil	be655596b3	ceph: use i_ceph_lock instead of i_lock We have been using i_lock to protect all kinds of data structures in the ceph_inode_info struct, including lists of inodes that we need to iterate over while avoiding races with inode destruction. That requires grabbing a reference to the inode with the list lock protected, but igrab() now takes i_lock to check the inode flags. Changing the list lock ordering would be a painful process. However, using a ceph-specific i_ceph_lock in the ceph inode instead of i_lock is a simple mechanical change and avoids the ordering constraints imposed by igrab(). Reported-by: Amon Ott <a.ott@m-privacy.de> Signed-off-by: Sage Weil <sage@newdream.net>	2011-12-07 10:46:44 -08:00
Sage Weil	2151937d7c	ceph: fix rasize reporting by ceph_show_options Fix typo. Reported-by: mowang da <whooya.xxl@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-12-02 09:27:54 -08:00
Sage Weil	774ac21da7	ceph: initialize root dentry Set up d_fsdata on the root dentry. This fixes a NULL pointer dereference in ceph_d_prune on umount. It also means we can eventually strip out all of the conditional checks on d_fsdata because it is now set unconditionally (prior to setting up the d_ops). Fix the ceph_d_prune debug print while we're here. Signed-off-by: Sage Weil <sage@newdream.net>	2011-11-11 09:50:17 -08:00
Sage Weil	15a2015fbc	ceph: fix iput race when queueing inode work If we queue a work item that calls iput(), make sure we ihold() before attempting to queue work. Otherwise our queued work might miraculously run before we notice the queue_work() succeeded and call ihold(), allowing the inode to be destroyed. That is, instead of if (queue_work(...)) ihold(); we need to do ihold(); if (!queue_work(...)) iput(); Reported-by: Amon Ott <a.ott@m-privacy.de> Signed-off-by: Sage Weil <sage@newdream.net>	2011-11-05 22:06:31 -07:00
H Hartley Sweeten	0c6d4b4e22	ceph/super.c: quiet sparse noise Quiet the sparse noise: warning: symbol 'create_fs_client' was not declared. Should it be static? warning: symbol 'destroy_fs_client' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Sage Weil <sage@newdream.net> ceph-devel@vger.kernel.org Signed-off-by: Sage Weil <sage@newdream.net>	2011-11-05 21:10:12 -07:00
H Hartley Sweeten	7fd7d101ff	ceph/mds_client.c: quiet sparse noise Quiet the following sparse noise: warning: symbol 'get_nonsnap_parent' was not declared. Should it be static? warning: symbol 'done_closing_sessions' was not declared. Should it be static? Local functions don't need external visability. Make them static. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Sage Weil <sage@newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-11-05 21:10:11 -07:00
Sage Weil	c6ffe10015	ceph: use new D_COMPLETE dentry flag We used to use a flag on the directory inode to track whether the dcache contents for a directory were a complete cached copy. Switch to a dentry flag CEPH_D_COMPLETE that is safely updated by ->d_prune(). Signed-off-by: Sage Weil <sage@newdream.net>	2011-11-05 21:10:10 -07:00
Sage Weil	b58dc4100b	ceph: clear parent D_COMPLETE flag when on dentry prune When the VFS prunes a dentry from the cache, clear the D_COMPLETE flag on the parent dentry. Do this for the live and snapshotted namespaces. Do not bother for the .snap dir contents, since we do not cache that. Signed-off-by: Sage Weil <sage@newdream.net>	2011-11-03 09:23:49 -07:00
Miklos Szeredi	bfe8684869	filesystems: add set_nlink() Replace remaining direct i_nlink updates with a new set_nlink() updater function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:43 +01:00
Sage Weil	3395734067	libceph: fix double-free of page vector ceph_release_page_vector() kfrees the vector; we shouldn't do it here too. Reported-by: Jeff Wu <cpwu@tnsoft.com.cn> Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:17 -07:00
Amon Ott	3310f7541f	ceph: fix 32-bit ino numbers Fix 32-bit ino generation to not always be 1. Signed-off-by: Amon Ott <a.ott@m-privacy.de>	2011-10-25 16:10:17 -07:00
Greg Farnum	a35eca958a	ceph: let the set_layout ioctl set single traits Previously we were validating the passed-in stripe unit, object size, and stripe count against each other (and not testing most other stuff). Instead, make sure that the composed previous layout and new values are valid, and only send the new values to the MDS. This lets users change the pool without setting the whole layout, for instance. Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>	2011-10-25 16:10:16 -07:00
Sage Weil	83eaea22bd	Revert "ceph: don't truncate dirty pages in invalidate work thread" This reverts commit `c9af9fb68e`. We need to block and truncate all pages in order to reliably invalidate them. Otherwise, we could: - have some uptodate pages in the cache - queue an invalidate - write(2) locks some pages - invalidate_work skips them - write(2) only overwrites part of the page - page now dirty and uptodate -> partial leakage of invalidated data It's not entirely clear why we started skipping locked pages in the first place. I just ran this through fsx and didn't see any problems. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:16 -07:00
Noah Watkins	80db8bea6a	ceph: replace leading spaces with tabs Trivial formatting fix. Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:16 -07:00
Sage Weil	b61c27636f	libceph: don't complain on msgpool alloc failures The pool allocation failures are masked by the pool; there is no need to spam the console about them. (That's the whole point of having the pool in the first place.) Mark msg allocations whose failure is safely handled as such. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Sage Weil	6ab00d465a	libceph: create messenger with client This simplifies the init/shutdown paths, and makes client->msgr available during the rest of the setup process. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Sage Weil	6a8ea4706a	ceph: document ioctls ...after some prodding by Christoph. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Sage Weil	0d66a487c1	ceph: implement (optional) max read size The 'rsize' mount option limits the maximum size of an individual read(ahead) operation that is sent off to an OSD. This is distinct from 'rasize', which controls the size of the readahead window. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Sage Weil	83817e35cb	ceph: rename rsize -> rasize It controls readahead. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Sage Weil	7c272194e6	ceph: make readpages fully async When we get a ->readpages() aop, submit async reads for all page ranges in the provided page list. Lock the pages immediately, so that VFS/MM will block until the reads complete. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:14 -07:00
Linus Torvalds	0d20fbbe82	Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-client * 'for-linus' of git://ceph.newdream.net/git/ceph-client: libceph: fix leak of osd structs during shutdown ceph: fix memory leak ceph: fix encoding of ino only (not relative) paths libceph: fix msgpool	2011-09-09 15:48:34 -07:00
Noah Watkins	259a187ade	ceph: fix memory leak kfree does not clean up indirect allocations in ceph_fs_client and ceph_options (e.g. snapdir_name). Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-08-22 13:06:59 -07:00
Sage Weil	795858dbd2	ceph: fix encoding of ino only (not relative) paths A 'path' consists of a starting ino and relative component. Encode even when there is no relative component. This is primarily needed by the NFS reexport code. Signed-off-by: Sage Weil <sage@newdream.net>	2011-08-15 13:03:56 -07:00
Linus Torvalds	ba5b56cb3e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) ceph: document unlocked d_parent accesses ceph: explicitly reference rename old_dentry parent dir in request ceph: document locking for ceph_set_dentry_offset ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug ceph: protect d_parent access in ceph_d_revalidate ceph: protect access to d_parent ceph: handle racing calls to ceph_init_dentry ceph: set dir complete frag after adding capability rbd: set blk_queue request sizes to object size ceph: set up readahead size when rsize is not passed rbd: cancel watch request when releasing the device ceph: ignore lease mask ceph: fix ceph_lookup_open intent usage ceph: only link open operations to directory unsafe list if O_CREAT\|O_TRUNC ceph: fix bad parent_inode calc in ceph_lookup_open ceph: avoid carrying Fw cap during write into page cache libceph: don't time out osd requests that haven't been received ceph: report f_bfree based on kb_avail rather than diffing. ceph: only queue capsnap if caps are dirty ceph: fix snap writeback when racing with writes ...	2011-07-26 13:38:50 -07:00
Sage Weil	d79698da32	ceph: document unlocked d_parent accesses For the most part we don't care about racing with rename when directing MDS requests; either the old or new parent is fine. Document that, and do some minor cleanup. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:31:26 -07:00
Sage Weil	41b02e1f9b	ceph: explicitly reference rename old_dentry parent dir in request We carry a pin on the parent directory for the rename source and dest dentries. For the source it's r_locked_dir; we need to explicitly reference the old_dentry parent as well, since the dentry's d_parent may change between when the request was created and pinned and when it is freed. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:31:14 -07:00
Sage Weil	4f17726452	ceph: document locking for ceph_set_dentry_offset Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:31:08 -07:00
Sage Weil	e5f86dc377	ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug Have caller pass in a safely-obtained reference to the parent directory for calculating a dentry's hash valud. While we're here, simpify the flow through ceph_encode_fh() so that there is a single exit point and cleanup. Also fix a bug with the dentry hash calculation: calculate the hash for the dentry we were given, not its parent. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:55 -07:00
Sage Weil	bf1c6aca96	ceph: protect d_parent access in ceph_d_revalidate Protect d_parent with d_lock. Carry a reference. Simplify the flow so that there is a single exit point and cleanup. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:43 -07:00
Sage Weil	5f21c96dd5	ceph: protect access to d_parent d_parent is protected by d_lock: use it when looking up a dentry's parent directory inode. Also take a reference and drop it in the caller to avoid a use-after-free. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:29 -07:00
Sage Weil	48d0cbd124	ceph: handle racing calls to ceph_init_dentry The ->lookup() and prepopulate_readdir() callers are working with unhashed dentries, so we don't have to worry. The export.c callers, though, need to initialize something they got back from d_obtain_alias() and are potentially racing with other callers. Make sure we don't return unless the dentry is properly initialized (by us or someone else). Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:15 -07:00
Sage Weil	dfabbed6fd	ceph: set dir complete frag after adding capability Curretly ceph_add_cap clears the complete bit if we are newly issued the FILE_SHARED cap, which is normally the case for a newly issue cap on a new directory. That means we clear the just-set bit. Move the check that sets the flag to after the cap is added/updated. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:02 -07:00
Yehuda Sadeh	e985222743	ceph: set up readahead size when rsize is not passed This should improve the default read performance, as without it readahead is practically disabled. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2011-07-26 11:29:14 -07:00
Sage Weil	2f90b852e3	ceph: ignore lease mask The lease mask is no longer used (and it changed a while back). Instead, use a non-zero duration to indicate that there is a lease being issued. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:28:25 -07:00
Sage Weil	468640e32c	ceph: fix ceph_lookup_open intent usage We weren't properly calling lookup_instantiate_filp when setting up the lookup intent, which could lead to file leakage on errors. So: - use separate helper for the hidden snapdir translation, immediately following the mds request - use ceph_finish_lookup for the final dentry/return value dance in the exit path - lookup_instantiate_filp on success Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:28:11 -07:00
Sage Weil	9bae113a08	ceph: only link open operations to directory unsafe list if O_CREAT\|O_TRUNC We only need to put these on the directory unsafe list if they have side effects that fsync(2) should flush out. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:27:59 -07:00
Sage Weil	acda765788	ceph: fix bad parent_inode calc in ceph_lookup_open We were always getting NULL here because the intent file f_dentry is always NULL at this point, which means we were always passing NULL to ceph_mdsc_do_request. In reality, this was fine, since this isn't currently ever a write operation that needs to get strung on the dir's unsafe list. Use the dir explicitly, and only pass it if this open has side-effects that a dir fsync should flush. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:27:48 -07:00
Sage Weil	d8de9ab63a	ceph: avoid carrying Fw cap during write into page cache The generic_file_aio_write call may block on balance_dirty_pages while we flush data to the OSDs. If we hold a reference to the FILE_WR cap during that interval revocation by the MDS (e.g., to do a stat(2)) may be very slow. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:27:34 -07:00
Greg Farnum	8f04d42276	ceph: report f_bfree based on kb_avail rather than diffing. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>	2011-07-26 11:27:06 -07:00
Sage Weil	e77dc3e9c0	ceph: only queue capsnap if caps are dirty We used to go into this branch if i_wrbuffer_ref_head was non-zero. This was an ancient check from before we were careful about dealing with all kinds of caps (and not just dirty pages). It is cleaner to only queue a capsnap if there is an actual dirty cap. If we are racing with... something...we will end up here with ci->i_wrbuffer_refs but no dirty caps. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:26:41 -07:00
Sage Weil	af0ed569d7	ceph: fix snap writeback when racing with writes There are two problems that come up when we try to queue a capsnap while a write is in progress: - The FILE_WR cap is held, but not yet dirty, so we may queue a capsnap with dirty == 0. That will crash later in __ceph_flush_snaps(). Or on the FILE_WR cap if a write is in progress. - We may not have i_head_snapc set, which causes problems pretty quickly. Look to the snaprealm in this case. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:26:31 -07:00
Sage Weil	9cfa1098dc	ceph: use flag bit for at_end readdir flag This saves us a word of memory per file. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:26:18 -07:00
Sage Weil	4918b6d140	ceph: add F_SYNC file flag to force sync (non-O_DIRECT) io This allows us to force IO through the sync path which you normally only get when multiple clients are reading/writing to the same file or by mounting with -o sync. Among other things, this lets test programs verify correctness with a single mount. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:26:07 -07:00
Sage Weil	252c6728de	ceph: add flags field to file_info Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:25:27 -07:00
Josef Bacik	02c24a8218	fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:59 -04:00
Josef Bacik	06222e491e	fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases we just return -EINVAL, in others we do the normal generic thing, and in others we're simply making sure that the properly due-dilligence is done. For example in NFS/CIFS we need to make sure the file size is update properly for the SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself that is all we have to do. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:58 -04:00
Al Viro	b85fd6bdc9	don't open-code parent_ino() in assorted ->readdir() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:54 -04:00
Al Viro	a127e0af59	ceph: LOOKUP_OPEN is set only when it's the last component Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:59 -04:00
Al Viro	8a5e929dd2	don't transliterate lower bits of ->intent.open.flags to FMODE_... ->create() instances are much happier that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:52 -04:00
Al Viro	10556cb21a	->permission() sanitizing: don't pass flags to ->permission() not used by the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:24 -04:00
Al Viro	2830ba7f34	->permission() sanitizing: don't pass flags to generic_permission() redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of them removes that bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:22 -04:00
Al Viro	178ea73521	kill check_acl callback of generic_permission() its value depends only on inode and does not change; we might as well store it in ->i_op->check_acl and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:16 -04:00
Al Viro	1b71fe2efa	ceph analog of cifs build_path_from_dentry() race fix ... unfortunately, cifs bug got copied. Fix is essentially the same. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-16 23:43:58 -04:00
Sage Weil	d7f124f129	ceph: fix sync and dio writes across stripe boundaries We were iterating across stripe boundaries properly, but not moving the write buffer pointer forward. This caused us to rewrite the same data after the break. Fix by adjusting the data pointer forward, and recalculating the io and buffer alignment after the break. Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-13 16:26:22 -07:00
Sage Weil	773e9b4426	ceph: fix page alignment corrections dd if=/dev/urandom of=/mnt/fs_depot/dd10 bs=500 seek=8388 count=1 dd if=/mnt/fs_depot/dd10 of=/root/dd10out bs=500 skip=8388 count=1 Reported-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-13 16:26:10 -07:00
Sage Weil	0c1f91f271	ceph: unwind canceled flock state If we request a lock and then abort (e.g., ^C), we need to send a matching unlock request to the MDS to unwind our lock attempt to avoid indefinitely blocking other clients. Reported-by: Brian Chrisman <brchrisman@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:36:45 -07:00
Sage Weil	0e98728fa3	ceph: fix ENOENT logic in striped_read Getting ENOENT is equivalent to reading 0 bytes. Make that correction before setting up the hit_stripe and was_short flags. Fixes the following case: dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0 dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct Reported-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:34:16 -07:00
Sage Weil	c3cd62839a	ceph: fix short sync reads from the OSD If we get a short read from the OSD because the object is small, we need to zero the remainder of the buffer. For O_DIRECT reads, the attempted range is not trimmed to i_size by the VFS, so we were actually looping indefinitely. Fix by trimming by i_size, and the unconditionally zeroing the trailing range. Reported-by: Jeff Wu <cpwu@tnsoft.com.cn> Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:34:14 -07:00
Sage Weil	70b666c3b4	ceph: use ihold when we already have an inode ref We should use ihold whenever we already have a stable inode ref, even when we aren't holding i_lock. This avoids adding new and unnecessary locking dependencies. Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:34:11 -07:00
Sage Weil	db3540522e	ceph: fix cap flush race reentrancy In `e9964c10` we change cap flushing to do a delicate dance because some inodes on the cap_dirty list could be in a migrating state (got EXPORT but not IMPORT) in which we couldn't actually flush and move from dirty->flushing, breaking the while (!empty) { process first } loop structure. It worked for a single sync thread, but was not reentrant and triggered infinite loops when multiple syncers came along. Instead, move inodes with dirty to a separate cap_dirty_migrating list when in the limbo export-but-no-import state, allowing us to go back to the simple loop structure (which was reentrant). This is cleaner and more robust. Audited the cap_dirty users and this looks fine: list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we have dirty caps (which list we're on is irrelevant) and list_del_init() calls still do the right thing. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-24 11:52:12 -07:00
Sage Weil	45e3d3eeb6	ceph: avoid inode lookup on nfs fh reconnect If we get the inode from the MDS, we have a reference in req; don't do a fresh lookup. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-24 11:52:06 -07:00
Sage Weil	3c454cf216	ceph: use LOOKUPINO to make unconnected nfs fh more reliable If we are unable to locate an inode by ino, ask the MDS using the new LOOKUPINO command. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-24 11:52:05 -07:00
Sage Weil	9d6fcb081a	ceph: check return value for start_request in writepages Since we pass the nofail arg, we should never get an error; BUG if we do. (And fix the function to not return an error if __map_request fails.) Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:25:05 -07:00
Sage Weil	6b4a3b517a	ceph: remove useless check rc is only ever 0 or negative in this method. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:25:05 -07:00
Sage Weil	da39822c65	ceph: fix broken comparison in readdir loop Both off and fi->offset are unsigned, so the difference is always >= 0. Compare them directly instead of the sign of the difference. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:25:04 -07:00
Sage Weil	3540303f87	ceph: fix rare potential cap leak If we grab new_cap, retake the lock, and find we already have a cap now for the given mds, release new_cap. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:25:03 -07:00
Sage Weil	ae59808301	ceph: use snprintf for dirstat content We allocate a buffer for rstats if the dirstat option is enabled. Use snprintf. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:25:02 -07:00
Sage Weil	1b36698577	libceph: remove unused variable Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:24:17 -07:00
Sage Weil	3b66378034	ceph: take reference on mds request r_unsafe_dir We put ourselves on an inode list for the parent directory of metadata operations so that an fsync on the directory will wait for metadata updates to commit to disk. We weren't holding a reference to that directory, however, and under certain workloads (fsstress in this case) the directory can go away. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:20:07 -07:00
Henry C Chang	d3d0720d4a	ceph: do not use i_wrbuffer_ref as refcount for Fb cap We increments i_wrbuffer_ref when taking the Fb cap. This breaks the dirty page accounting and causes looping in __ceph_do_pending_vmtruncate, and ceph client hangs. This bug can be reproduced occasionally by running blogbench. Add a new field i_wb_ref to inode and dedicate it to Fb reference counting. Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-11 10:44:48 -07:00
Henry C Chang	a26a185d27	ceph: fix list_add in ceph_put_snap_realm Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-11 10:44:36 -07:00
Henry C Chang	7d8e18a69d	ceph: print debug message before put mds session The mds session, s, could be freed during ceph_put_mds_session. Move dout before ceph_put_mds_session. Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-11 10:44:34 -07:00
Sage Weil	fca65b4ad7	ceph: do not call __mark_dirty_inode under i_lock The __mark_dirty_inode helper now takes i_lock as of `250df6ed`. Fix the one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the flags value so that the callers can do it outside of i_lock. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-04 12:56:45 -07:00
Henry C Chang	8c71897be2	ceph: handle ceph_osdc_new_request failure in ceph_writepages_start We should unlock the page and return -ENOMEM if ceph_osdc_new_request failed. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-03 09:28:12 -07:00
Sage Weil	3772d26d87	ceph: use ihold() when i_lock is held See `0444d76ae6`. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-03 09:28:08 -07:00
Linus Torvalds	42933bac11	Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 * 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings	2011-04-07 11:14:49 -07:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Linus Torvalds	50f3515828	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: libceph: Create a new key type "ceph". libceph: Get secret from the kernel keys api when mounting with key=NAME. ceph: Move secret key parsing earlier. libceph: fix null dereference when unregistering linger requests ceph: unlock on error in ceph_osdc_start_request() ceph: fix possible NULL pointer dereference ceph: flush msgr_wq during mds_client shutdown	2011-03-30 09:46:09 -07:00
Tommi Virtanen	8323c3aa74	ceph: Move secret key parsing earlier. This makes the base64 logic be contained in mount option parsing, and prepares us for replacing the homebew key management with the kernel key retention service. Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-29 12:11:16 -07:00
Dave Chinner	0444d76ae6	fs: don't use igrab() while holding i_lock Fix the incorrect use of igrab() inside the i_lock in NFS and Ceph‥ If we are already holding the i_lock, we have a reference to the inode so we can safely use ihold() to gain an extra reference. This avoids hangs due to lock recursion on the i_lock now that the inode_lock is gone and igrab() uses the i_lock itself. Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: Ryan Mallon <ryan@bluewatersys.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-29 07:50:34 -07:00
Sage Weil	ef550f6f4f	ceph: flush msgr_wq during mds_client shutdown The release method for mds connections uses a backpointer to the mds_client, so we need to flush the workqueue of any pending work (and ceph_connection references) prior to freeing the mds_client. This fixes an oops easily triggered under UML by while true ; do mount ... ; umount ... ; done Also fix an outdated comment: the flush in ceph_destroy_client only flushes OSD connections out. This bug is basically an artifact of the ceph -> ceph+libceph conversion. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-25 13:27:48 -07:00
Sage Weil	147851d2dc	ceph: rename dentry_release -> d_release, fix comment Just for consistency's sake. Fix obsolete comment too. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-21 12:24:26 -07:00
Henry C Chang	49bcb93236	ceph: add request to the tail of unsafe write list In sync_write_wait(), we assume that the newest request is at the tail of unsafe write list. We should maintain the semantics here. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-21 12:24:25 -07:00
Henry C Chang	78a255654f	ceph: remove request from unsafe list if it is canceled/timed out This fixes the list corruption warning like this: ------------[ cut here ]------------ WARNING: at lib/list_debug.c:30 __list_add+0x68/0x81() Hardware name: X8DTU list_add corruption. prev->next should be next (ffff880618931250), but was (null). (prev=ffff880c188b9130). Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs ceph libceph libcrc32c sunrpc ipv6 fuse igb i2c_i801 ioatdma i2c_core iTCO_wdt iTCO_vendor_support joydev dca serio_raw usb_storage [last unloaded: scsi_wait_scan] Pid: 10977, comm: smbd Tainted: G W 2.6.32.23-170.Elaster.xendom0.fc12.x86_64 #1 Call Trace: [<ffffffff8105753c>] warn_slowpath_common+0x7c/0x94 [<ffffffff810575ab>] warn_slowpath_fmt+0x41/0x43 [<ffffffff812351a3>] __list_add+0x68/0x81 [<ffffffffa014799d>] ceph_aio_write+0x614/0x8a2 [ceph] [<ffffffff8111d2a0>] do_sync_write+0xe8/0x125 [<ffffffff81075a1f>] ? autoremove_wake_function+0x0/0x39 [<ffffffff811f21ec>] ? selinux_file_permission+0x5c/0xb3 [<ffffffff811e8521>] ? security_file_permission+0x16/0x18 [<ffffffff8111d864>] vfs_write+0xae/0x10b [<ffffffff8111d91b>] sys_pwrite64+0x5a/0x76 [<ffffffff81012d32>] system_call_fastpath+0x16/0x1b ---[ end trace 08573eb9f07ff6f4 ]--- Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-21 12:24:24 -07:00
Sage Weil	80456f8672	ceph: move readahead default to fs/ceph from libceph Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-21 12:24:23 -07:00
Yehuda Sadeh	ad1fee96cb	ceph: add ino32 mount option The ino32 mount option forces the ceph fs to report 32 bit ino values. This is useful for 64 bit kernels with 32 bit userspace. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2011-03-21 12:24:22 -07:00
Sage Weil	21f3b5f1bb	ceph: remove debugfs debug cruft Whoops! Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-21 12:24:20 -07:00
Sage Weil	09adc80c61	ceph: preserve I_COMPLETE across rename d_move puts the renamed dentry at the end of d_subdirs, screwing with our cached dentry directory offsets. We were just clearing I_COMPLETE to avoid any possibility of trouble. However, assigning the renamed dentry an offset at the end of the directory (to match it's new d_subdirs position) is sufficient to maintain correct behavior and hold onto I_COMPLETE. This is especially important for workloads like rsync, which renames files into place. Before, we would lose I_COMPLETE and do MDS lookups for each file. With this patch we only talk to the MDS on create and rename. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-15 09:14:03 -07:00
Al Viro	0eb980e317	ceph: fix d_revalidate oopsen on NFS exports can't blindly check nd->flags in ->d_revalidate() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-10 03:44:05 -05:00
Sage Weil	455cec0abf	ceph: no .snap inside of snapped namespace Otherwise you can do things like # mkdir .snap/foo # cd .snap/foo/.snap # ls <badness> Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-04 12:25:09 -08:00
Sage Weil	16a8b70a5a	ceph: do not clear I_COMPLETE from d_release First, this was racy anyway: d_release isn't called until well after the dentry is unhashed. Second, this runs afoul of the recent dcache change that clears d_parent prior to calling d_release (`949854d0`), causing a NULL pointer dereference. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-03 10:09:52 -08:00
Sage Weil	b545cc1505	ceph: do not set I_COMPLETE Do not set the I_COMPLETE flag on directories until we resolve races with dcache pruning. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-03 10:09:51 -08:00
Sage Weil	9bde178d05	Revert "ceph: keep reference to parent inode on ceph_dentry" This reverts commit `97d79b403e`. This fails to account for d_parent changes due to rename or disconnected dentries due to submounts or NFS reexports. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-03 10:09:50 -08:00
Linus Torvalds	8bd89ca220	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: keep reference to parent inode on ceph_dentry ceph: queue cap_snaps once per realm libceph: fix socket write error handling libceph: fix socket read error handling	2011-02-21 15:01:38 -08:00
Yehuda Sadeh	97d79b403e	ceph: keep reference to parent inode on ceph_dentry When creating a new dentry we now hold a reference to the parent inode in the ceph_dentry. This is required due to the new RCU changes from `949854d0`, which set dentry->d_parent to NULL in d_kill before calling the ->release() callback. If/when that behavior is changed, we can revert this hack. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-02-19 19:59:14 -08:00
Sage Weil	e8e1ba96b2	ceph: queue cap_snaps once per realm We were forming a dirty list, and then queueing cap_snaps for each realm _and_ its children, regardless of whether the children were already in the dirty list. This meant we did it twice for some realms. Which in turn meant we corrupted mdsc->snap_flush_list when the cap_snap was re-added to the list it was already on, and could trigger an infinite loop. We were also using recursion to do reach all the children, a no-no when stack is limited. Instead, (re)queue any children on the dirty list, avoiding processing anything twice and avoiding any recursion. Signed-off-by: Sage Weil <sage@newdream.net>	2011-02-04 20:45:58 -08:00
Linus Torvalds	b12ece7d85	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: avoid picking MDS that is not active ceph: avoid immediate cap check after import ceph: fix flushing of caps vs cap import ceph: fix erroneous cap flush to non-auth mds ceph: fix cap_wanted_delay_{min,max} mount option initialization ceph: fix xattr rbtree search ceph: fix getattr on directory when using norbytes	2011-01-28 12:12:58 +10:00
Sage Weil	d66bbd441c	ceph: avoid picking MDS that is not active Ignore replication or auth frag data if it indicates an MDS that is not active. This can happen if the MDS shuts down and the client has stale data about the namespace distribution across the MDS cluster. If that's the case, fall back to directing the request based on the auth cap (which should always be accurate). Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-25 08:16:37 -08:00
Sage Weil	7e57b81c76	ceph: avoid immediate cap check after import The NODELAY flag avoids the heuristics that delay cap (issued/wanted) release. There's no reason for that after we import a cap, and it kills whatever benefit we get from those delays. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-19 09:23:26 -08:00
Sage Weil	088b3f5e9e	ceph: fix flushing of caps vs cap import If we are mid-flush and a cap is migrated to another node, we need to resend the cap flush message to the new MDS, and do so with the original flush_seq to avoid leaking across a sync boundary. Previously we didn't redo the flush (we only flushed newly dirty data), which would cause a later sync to hang forever. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-19 09:23:25 -08:00
Sage Weil	24be0c4810	ceph: fix erroneous cap flush to non-auth mds The int flushing is global and not clear on each iteration of the loop, which can cause a second flush of caps to any MDSs with ids greater than the auth. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-19 09:23:24 -08:00
Sage Weil	50aac4fec5	ceph: fix cap_wanted_delay_{min,max} mount option initialization These were initialized to 0 instead of the default, fallout from the RBD refactor in `3d14c5d2b6`. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-19 09:23:22 -08:00
Sage Weil	17db143fc0	ceph: fix xattr rbtree search Fix xattr name comparison in rbtree search for strings that share a prefix. The name argument is null terminated, but the xattr name is not, so we need to use strncmp, but that means adjusting for the case where name is a prefix of xattr->name. The corresponding case in __set_xattr() already handles this properly (although in that case name is also not null terminated). Reported-by: Sergiy Kibrik <sakib@meta.ua> Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-13 15:50:11 -08:00
Yehuda Sadeh	1c1266bb91	ceph: fix getattr on directory when using norbytes The norbytes mount option was broken, and when doing getattr on a directory it return the rbytes instead of the number of entities. This commit fixes it. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-13 15:50:06 -08:00
Linus Torvalds	a170315420	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: fix cleanup when trying to mount inexistent image net/ceph: make ceph_msgr_wq non-reentrant ceph: fsc->*_wq's aren't used in memory reclaim path ceph: Always free allocated memory in osdmap_decode() ceph: Makefile: Remove unnessary code ceph: associate requests with opening sessions ceph: drop redundant r_mds field ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS ceph: add dir_layout to inode	2011-01-13 10:25:24 -08:00
Tejun Heo	01e6acc4ea	ceph: fsc->_wq's aren't used in memory reclaim path fsc->_wq's aren't depended upon during memory reclaim. Convert to alloc_workqueue() w/o WQ_MEM_RECLAIM. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Sage Weil <sage@newdream.net> Cc: ceph-devel@vger.kernel.org Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-12 15:15:14 -08:00
Tracey Dent	582c86e690	ceph: Makefile: Remove unnessary code Remove the if and else conditional because the code is in mainline and there is no need in it being there. Also, Changed Makefile to use <modules>-y instead of <modules>-objs because -objs is deprecated and not mentioned in Documentation/kbuild/makefiles.txt. Signed-off-by: Tracey Dent <tdent48227@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-12 15:15:13 -08:00
Sage Weil	dc69e2e9fc	ceph: associate requests with opening sessions Associate request with sessions that aren't yep open. This makes the debugfs mdsc request list more informative. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-12 15:15:13 -08:00
Sage Weil	4af25fdda6	ceph: drop redundant r_mds field The r_mds field is redundant, since we can find the same information at r_session->s_mds, and when r_session is NULL then r_mds is meaningless. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-12 15:15:13 -08:00
Sage Weil	14303d20f3	ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS This implements the DIRLAYOUTHASH protocol feature, which passes the dir layout over the wire from the MDS. This gives the client knowledge of the correct hash function to use for mapping dentries among dir fragments. Note that if this feature is _not_ present on the client but is on the MDS, the client may misdirect requests. This will result in a forward and degrade performance. It may also result in inaccurate NFS filehandle generation, which will prevent fh resolution when the inode is not present in the client cache and the parent directories have been fragmented. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-12 15:15:13 -08:00
Sage Weil	6c0f3af72c	ceph: add dir_layout to inode Add a ceph_dir_layout to the inode, and calculate dentry hash values based on the parent directory's specified dir_hash function. This is needed because the old default Linux dcache hash function is extremely week and leads to a poor distribution of files among dir fragments. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-12 15:15:12 -08:00
Nick Piggin	b74c79e993	fs: provide rcu-walk aware permission i_ops Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:29 +11:00
Nick Piggin	34286d6662	fs: rcu-walk aware d_revalidate method Require filesystems be aware of .d_revalidate being called in rcu-walk mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning -ECHILD from all implementations. Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:29 +11:00
Nick Piggin	fb045adb99	fs: dcache reduce branches in lookup path Reduce some branches and memory accesses in dcache lookup by adding dentry flags to indicate common d_ops are set, rather than having to check them. This saves a pointer memory access (dentry->d_op) in common path lookup situations, and saves another pointer load and branch in cases where we have d_op but not the particular operation. Patched with: git grep -E '[.>]([[:space:]])d_op([[:space:]])=' \| xargs sed -e 's/$[^\t ]$->d_op = $.$;/d_set_d_op(\1, \2);/' -e 's/$[^\t ]$\.d_op = $.$;/d_set_d_op(\&\1, \2);/' -i Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:28 +11:00
Nick Piggin	fa0d7e3de6	fs: icache RCU free inodes RCU free the struct inode. This will allow: - Subsequent store-free path walking patch. The inode must be consulted for permissions when walking, so an RCU inode reference is a must. - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want to take i_lock no longer need to take sb_inode_list_lock to walk the list in the first place. This will simplify and optimize locking. - Could remove some nested trylock loops in dcache code - Could potentially simplify things a bit in VM land. Do not need to take the page lock to follow page->mapping. The downsides of this is the performance cost of using RCU. In a simple creat/unlink microbenchmark, performance drops by about 10% due to inability to reuse cache-hot slab objects. As iterations increase and RCU freeing starts kicking over, this increases to about 20%. In cases where inode lifetimes are longer (ie. many inodes may be allocated during the average life span of a single inode), a lot of this cache reuse is not applicable, so the regression caused by this patch is smaller. The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU, however this adds some complexity to list walking and store-free path walking, so I prefer to implement this at a later date, if it is shown to be a win in real situations. I haven't found a regression in any non-micro benchmark so I doubt it will be a problem. Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:26 +11:00
Nick Piggin	b5c84bf6f6	fs: dcache remove dcache_lock dcache_lock no longer protects anything. remove it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:23 +11:00
Nick Piggin	2fd6b7f507	fs: dcache scale subdirs Protect d_subdirs and d_child with d_lock, except in filesystems that aren't using dcache_lock for these anyway (eg. using i_mutex). Note: if we change the locking rule in future so that ->d_child protection is provided only with ->d_parent->d_lock, it may allow us to reduce some locking. But it would be an exception to an otherwise regular locking scheme, so we'd have to see some good results. Probably not worthwhile. Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:21 +11:00
Nick Piggin	da5029563a	fs: dcache scale d_unhashed Protect d_unhashed(dentry) condition with d_lock. This means keeping DCACHE_UNHASHED bit in synch with hash manipulations. Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:21 +11:00
Nick Piggin	b7ab39f631	fs: dcache scale dentry refcount Make d_count non-atomic and protect it with d_lock. This allows us to ensure a 0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when we start protecting many other dentry members with d_lock. Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:21 +11:00
Henry C Chang	b6aa5901c7	ceph: mark user pages dirty on direct-io reads For read operation, we have to set the argument _write_ of get_user_pages to 1 since we will write data to pages. Also, we need to SetPageDirty before releasing these pages. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-12-17 09:54:40 -08:00
Sage Weil	92cf765237	ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport The fh_to_dentry etc. methods use ceph_init_dentry(), which assumes that d_parent is defined. It isn't for those callers, so check! Signed-off-by: Sage Weil <sage@newdream.net>	2010-12-17 09:53:48 -08:00
Henry C Chang	ab226e21ad	ceph: fix direct-io on non-page-aligned buffers The user buffer may be 512-byte aligned, not page-aligned. We were assuming the buffer was page-aligned and only accounting for non-page-aligned io offsets. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-12-15 20:46:16 -08:00
Sage Weil	1cd275f609	ceph: fix ioctl magic The ioctl magic was inadvertently changed in `571dba52`. Signed-off-by: Sage Weil <sage@newdream.net>	2010-12-06 09:45:22 -08:00
Herb Shiu	a5b10629ed	ceph: Behave better when handling file lock replies. Fill in the local lock with response data if appropriate, and don't call posix_lock_file when reading locks. Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com> Acked-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-12-01 14:22:34 -08:00
Herb Shiu	637ae8d547	ceph: pass lock information by struct file_lock instead of as individual params. Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com> Acked-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-12-01 14:22:34 -08:00
Herb Shiu	25933abdd8	ceph: Handle file locks in replies from the MDS. Previously the kernel client incorrectly assumed everything was a directory. Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com> Acked-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-12-01 14:22:27 -08:00
Sage Weil	884ea89276	ceph: avoid possible null deref in readdir after dir llseek last may be NULL, but we dereference it in the else branch without checking. Normally it doesn't trigger because last == NULL when fpos == 2, but it could happen on a newly opened dir if the user seeks forward. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-12-01 14:15:31 -08:00
Linus Torvalds	76db8ac45f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: fix readdir EOVERFLOW on 32-bit archs ceph: fix frag offset for non-leftmost frags ceph: fix dangling pointer ceph: explicitly specify page alignment in network messages ceph: make page alignment explicit in osd interface ceph: fix comment, remove extraneous args ceph: fix update of ctime from MDS ceph: fix version check on racing inode updates ceph: fix uid/gid on resent mds requests ceph: fix rdcache_gen usage and invalidate ceph: re-request max_size if cap auth changes ceph: only let auth caps update max_size ceph: fix open for write on clustered mds ceph: fix bad pointer dereference in ceph_fill_trace ceph: fix small seq message skipping Revert "ceph: update issue_seq on cap grant"	2010-11-19 15:32:22 -08:00
Sage Weil	3105c19c45	ceph: fix readdir EOVERFLOW on 32-bit archs One of the readdir filldir_t callers was passing the raw ceph 64-bit ino instead of the hashed 32-bit one, producing an EOVERFLOW in the filler callback. Fix this by calling the ceph_vino_to_ino() helper to do the conversion. Reported-by: Jan Smets <jan.smets@alcatel-lucent.com> Tested-by: Jan Smets <jan.smets@alcatel-lucent.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-18 09:15:07 -08:00
Arnd Bergmann	451a3c24b0	BKL: remove extraneous #include <smp_lock.h> The big kernel lock has been removed from all these files at some point, leaving only the #include. Remove this too as a cleanup. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-11-17 08:59:32 -08:00
Sage Weil	7b88dadc13	ceph: fix frag offset for non-leftmost frags We start at offset 2 for the leftmost frag, and 0 for subsequent frags. When we reach the end (rightmost), we go back to 2. This fixes readdir on fragmented (large) directories. Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-11 16:48:59 -08:00
Sage Weil	a1629c3b24	ceph: fix dangling pointer Clear fi->last_name when it's freed. The only caller is rewinddir() (or equivalent lseek). Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-11 15:24:06 -08:00
Sage Weil	b7495fc2ff	ceph: make page alignment explicit in osd interface We used to infer alignment of IOs within a page based on the file offset, which assumed they matched. This broke with direct IO that was not aligned to pages (e.g., 512-byte aligned IO). We were also trusting the alignment specified in the OSD reply, which could have been adjusted by the server. Explicitly specify the page alignment when setting up OSD IO requests. Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-09 12:43:12 -08:00
Sage Weil	e98b6fed84	ceph: fix comment, remove extraneous args The offset/length arguments aren't used. Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-09 12:24:53 -08:00
Sage Weil	d8672d64b8	ceph: fix update of ctime from MDS The client can have a newer ctime than the MDS due to AUTH_EXCL and XATTR_EXCL caps as well; update the check in ceph_fill_file_time appropriately. This fixes cases where ctime/mtime goes backward under the right sequence of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that goes to the MDS). Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-08 09:24:34 -08:00
Sage Weil	8bd59e0188	ceph: fix version check on racing inode updates We may get updates on the same inode from multiple MDSs; generally we only pay attention if the update is newer than what we already have. The exception is when an MDS sense unstable information, in which case we always update. The old > check got this wrong when our version was odd (e.g. 3) and the reply version was even (e.g. 2): the older stale (v2) info would be applied. Fixed and clarified the comment. Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-08 09:23:12 -08:00
Sage Weil	cb4276cca4	ceph: fix uid/gid on resent mds requests MDS requests can be rebuilt and resent in non-process context, but were filling in uid/gid from current_fsuid/gid. Put that information in the request struct on request setup. This fixes incorrect (and root) uid/gid getting set for requests that are forwarded between MDSs, usually due to metadata migrations. Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-08 07:29:05 -08:00
Sage Weil	cd045cb42a	ceph: fix rdcache_gen usage and invalidate We used to use rdcache_gen to indicate whether we "might" have cached pages. Now we just look at the mapping to determine that. However, some old behavior remains from that transition. First, rdcache_gen == 0 no longer means we have no pages. That can happen at any time (presumably when we carry FILE_CACHE). We should not reset it to zero, and we should not check that it is zero. That means that the only purpose for rdcache_revoking is to resolve races between new issues of FILE_CACHE and an async invalidate. If they are equal, we should invalidate. On success, we decrement rdcache_revoking, so that it is no longer equal to rdcache_gen. Similarly, if we success in doing a sync invalidate, set revoking = gen - 1. (This is a small optimization to avoid doing unnecessary invalidate work and does not affect correctness.) Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-08 07:29:05 -08:00
Sage Weil	feb4cc9bb4	ceph: re-request max_size if cap auth changes If the auth cap migrates to another MDS, clear requested_max_size so that we resend any pending max_size increase requests. This fixes potential hangs on writes that extend a file and race with an cap migration between MDSs. Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-07 09:39:23 -08:00
Sage Weil	912a9b0319	ceph: only let auth caps update max_size Only the auth MDS has a meaningful max_size value for us, so only update it in fill_inode if we're being issued an auth cap. Otherwise, a random stat result from a non-auth MDS can clobber a meaningful max_size, get the client<->mds cap state out of sync, and make writes hang. Specifically, even if the client re-requests a larger max_size (which it will), the MDS won't respond because as far as it knows we already have a sufficiently large value. Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-07 09:39:21 -08:00
Sage Weil	7421ab8041	ceph: fix open for write on clustered mds Normally when we open a file we already have a cap, and simply update the wanted set. However, if we open a file for write, but don't have an auth cap, that doesn't work; we need to open a new cap with the auth MDS. Only reuse existing caps if we are opening for read or the existing cap is auth. Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-07 09:07:15 -08:00
Sage Weil	d8b16b3d1c	ceph: fix bad pointer dereference in ceph_fill_trace We dereference *in a few lines down, but only set it on rename. It is apparently pretty rare for this to trigger, but I have been hitting it with a clustered MDSs. Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-07 08:40:43 -08:00
Al Viro	a7f9fb205a	convert ceph Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-10-29 04:17:18 -04:00
Sage Weil	2f56f56ad9	Revert "ceph: update issue_seq on cap grant" This reverts commit `d91f2438d8`. The intent of issue_seq is to distinguish between mds->client messages that (re)create the cap and those that do not, which means we should _only_ be updating that value in the create paths. By updating it in handle_cap_grant, we reset it to zero, which then breaks release. The larger question is what workload/problem made me think it should be updated here... Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-27 21:05:54 -07:00
Wu Fengguang	1b430beee5	writeback: remove nonblocking/encountered_congestion references This removes more dead code that was somehow missed by commit `0d99519efe` (writeback: remove unused nonblocking and congestion checks). There are no behavior change except for the removal of two entries from one of the ext4 tracing interface. The nonblocking checks in ->writepages are no longer used because the flusher now prefer to block on get_request_wait() than to skip inodes on IO congestion. The latter will lead to more seeky IO. The nonblocking checks in ->writepage are no longer used because it's redundant with the WB_SYNC_NONE check. We no long set ->nonblocking in VM page out and page migration, because a) it's effectively redundant with WB_SYNC_NONE in current code b) it's old semantic of "Don't get stuck on request queues" is mis-behavior: that would skip some dirty inodes on congestion and page out others, which is unfair in terms of LRU age. Inspired by Christoph Hellwig. Thanks! Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: David Howells <dhowells@redhat.com> Cc: Sage Weil <sage@newdream.net> Cc: Steve French <sfrench@samba.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-10-26 16:52:05 -07:00
Sage Weil	efa4c1206e	ceph: do not carry i_lock for readdir from dcache We were taking dcache_lock inside of i_lock, which introduces a dependency not found elsewhere in the kernel, complicationg the vfs locking scalability work. Since we don't actually need it here anyway, remove it. We only need i_lock to test for the I_COMPLETE flag, so be careful to do so without dcache_lock held. Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:38:27 -07:00
Julia Lawall	61413c2f59	fs/ceph/xattr.c: Use kmemdup Convert a sequence of kmalloc and memcpy to use kmemdup. The semantic patch that performs this transformation is: (http://coccinelle.lip6.fr/) // <smpl> @@ expression a,flag,len; expression arg,e1,e2; statement S; @@ a = - $kmalloc\\|kzalloc$(len,flag) + kmemdup(arg,len,flag) <... when != a if (a == NULL \|\| ...) S ...> - memcpy(a,arg,len+1); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:38:26 -07:00
Greg Farnum	571dba52a3	ceph: add CEPH_MDS_OP_SETDIRLAYOUT and associated ioctl. Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:38:23 -07:00
Randy Dunlap	6f453ed6c0	ceph: fix debugfs warnings Include "super.h" outside of CONFIG_DEBUG_FS to eliminate a compiler warning: fs/ceph/debugfs.c:266: warning: 'struct ceph_fs_client' declared inside parameter list fs/ceph/debugfs.c:266: warning: its scope is only this definition or declaration, which is probably not what you want fs/ceph/debugfs.c:271: warning: 'struct ceph_fs_client' declared inside parameter list Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2010-10-20 15:38:21 -07:00
Sage Weil	496e59553c	ceph: switch from BKL to lock_flocks() Switch from using the BKL explicitly to the new lock_flocks() interface. Eventually this will turn into a spinlock. Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:38:18 -07:00
Greg Farnum	fca4451acf	ceph: preallocate flock state without locks held When the lock_kernel() turns into lock_flocks() and a spinlock, we won't be able to do allocations with the lock held. Preallocate space without the lock, and retry if the lock state changes out from underneath us. Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:38:17 -07:00
Sage Weil	18a38193ef	ceph: use mapping->nrpages to determine if mapping is empty This is simpler and faster. Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:38:15 -07:00
Sage Weil	93afd449aa	ceph: only invalidate on check_caps if we actually have pages The i_rdcache_gen value only implies we MAY have cached pages; actually check the mapping to see if it's worth bothering with an invalidate. Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:38:15 -07:00
Sage Weil	4c32f5dda5	ceph: do not hide .snap in root directory Snaps in the root directory are now supported by the MDS, and harmless on older versions. Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:38:14 -07:00
Yehuda Sadeh	3d14c5d2b6	ceph: factor out libceph from Ceph file system This factors out protocol and low-level storage parts of ceph into a separate libceph module living in net/ceph and include/linux/ceph. This is mostly a matter of moving files around. However, a few key pieces of the interface change as well: - ceph_client becomes ceph_fs_client and ceph_client, where the latter captures the mon and osd clients, and the fs_client gets the mds client and file system specific pieces. - Mount option parsing and debugfs setup is correspondingly broken into two pieces. - The mon client gets a generic handler callback for otherwise unknown messages (mds map, in this case). - The basic supported/required feature bits can be expanded (and are by ceph_fs_client). No functional change, aside from some subtle error handling cases that got cleaned up in the refactoring process. Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:37:28 -07:00
Yehuda Sadeh	ae1533b62b	ceph-rbd: osdc support for osd call and rollback operations This will be used for rbd snapshots administration. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2010-10-20 15:37:25 -07:00
Yehuda Sadeh	68b4476b0b	ceph: messenger and osdc changes for rbd Allow the messenger to send/receive data in a bio. This is added so that we wouldn't need to copy the data into pages or some other buffer when doing IO for an rbd block device. We can now have trailing variable sized data for osd ops. Also osd ops encoding is more modular. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:37:18 -07:00
Yehuda Sadeh	3499e8a5d4	ceph: refactor osdc requests creation functions The osd requests creation are being decoupled from the vino parameter, allowing clients using the osd to use other arbitrary object names that are not necessarily vino based. Also, calc_raw_layout now takes a snap id. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:36:01 -07:00
Yehuda Sadeh	7669a2c95e	ceph: lookup pool in osdmap by name Implement a pool lookup by name. This will be used by rbd. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:35:36 -07:00
Sage Weil	d91f2438d8	ceph: update issue_seq on cap grant We need to update the issue_seq on any grant operation, be it via an MDS reply or a separate grant message. The update in the grant path was missing. This broke cap release for inodes in which the MDS sent an explicit grant message that was not soon after followed by a successful MDS reply on the same inode. Also fix the signedness on seq locals. Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-07 08:01:50 -07:00
Greg Farnum	21b559de56	ceph: send cap release message early on failed revoke. If an MDS tries to revoke caps that we don't have, we want to send releases early since they probably contain the caps message the MDS is looking for. Previously, we only sent the messages if we didn't have the inode either. But in a multi-mds system we can retain the inode after dropping all caps for a single MDS. Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-07 08:00:24 -07:00
Aneesh Kumar K.V	bba0cd0e3d	ceph: Update max_len with minimum required size encode_fh on error should update max_len with minimum required size, so that caller can redo the call with the reallocated buffer. This is required with open by handle patch series Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-07 08:00:24 -07:00

... 4 5 6 7 8 ...

987 Commits