Commit Graph

993 Commits

Author SHA1 Message Date
Michal Hocko
19809c2da2 mm, vmalloc: use __GFP_HIGHMEM implicitly
__vmalloc* allows users to provide gfp flags for the underlying
allocation.  This API is quite popular

  $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
  77

The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space.  About half of users don't
use this flag, though.  This signals that we make the API unnecessarily
too complex.

This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
are simplified and drop the flag.

Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Ilya Dryomov
14bb211d32 rbd: support updating the lock cookie without releasing the lock
As we no longer release the lock before potentially raising BLACKLISTED
in rbd_reregister_watch(), the "either locked or blacklisted" assert in
rbd_queue_workfn() needs to go: we can be both locked and blacklisted
at that point now.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04 09:19:23 +02:00
Jeff Layton
58eb7932ae libceph: add an epoch_barrier field to struct ceph_osd_client
Cephfs can get cap update requests that contain a new epoch barrier in
them. When that happens we want to pause all OSD traffic until the right
map epoch arrives.

Add an epoch_barrier field to ceph_osd_client that is protected by the
osdc->lock rwsem. When the barrier is set, and the current OSD map
epoch is below that, pause the request target when submitting the
request or when revisiting it. Add a way for upper layers (cephfs)
to update the epoch_barrier as well.

If we get a new map, compare the new epoch against the barrier before
kicking requests and request another map if the map epoch is still lower
than the one we want.

If we get a map with a full pool, or at quota condition, then set the
barrier to the current epoch value.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton
fc36d0a42c libceph: abort already submitted but abortable requests when map or pool goes full
When a Ceph volume hits capacity, a flag is set in the OSD map to
indicate that, and a new map is sprayed around the cluster. With cephfs
we want it to shut down any abortable requests that are in progress with
an -ENOSPC error as they'd just hang otherwise.

Add a new ceph_osdc_abort_on_full helper function to handle this. It
will first check whether there is an out-of-space condition in the
cluster and then walk the tree and abort any request that has
r_abort_on_full set with a -ENOSPC error. Call this new function
directly whenever we get a new OSD map.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton
a1f4020aab libceph: allow requests to return immediately on full conditions if caller wishes
Usually, when the osd map is flagged as full or the pool is at quota,
write requests just hang. This is not what we want for cephfs, where
it would be better to simply report -ENOSPC back to userland instead
of stalling.

If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior.

Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
and on any other write request from ceph.ko.

A later patch will deal with requests that were submitted before the new
map showing the full condition came in.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton
aa26d662b9 libceph: remove req->r_replay_version
Nothing uses this anymore with the removal of the ack vs. commit code.
Remove the field and just encode zeroes into place in the request
encoding.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Elena Reshetova
0e1a5ee657 libceph: convert ceph_pagelist.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova
02113a0f14 libceph: convert ceph_osd.o_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova
06dfa96399 libceph: convert ceph_snap_context.nref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ilya Dryomov
d6a3408a77 libceph: supported_features module parameter
Add a readonly, exported to sysfs module parameter so that userspace
can generate meaningful error messages.  It's a bit funky, but there is
no other libceph-specific place.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ilya Dryomov
74da4a0f57 libceph, ceph: always advertise all supported features
No reason to hide CephFS-specific features in the rbd case.  Recent
feature bits mix RADOS and CephFS-specific stuff together anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ilya Dryomov
633ee407b9 libceph: force GFP_NOIO for socket allocations
sock_alloc_inode() allocates socket+inode and socket_wq with
GFP_KERNEL, which is not allowed on the writeback path:

    Workqueue: ceph-msgr con_work [libceph]
    ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
    0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
    ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
    Call Trace:
    [<ffffffff816dd629>] schedule+0x29/0x70
    [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
    [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
    [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
    [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
    [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
    [<ffffffff81086335>] flush_work+0x165/0x250
    [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
    [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
    [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
    [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
    [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
    [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
    [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
    [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
    [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
    [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
    [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
    [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
    [<ffffffff811c0c18>] super_cache_scan+0x178/0x180
    [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
    [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
    [<ffffffff8115af70>] shrink_slab+0x100/0x140
    [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
    [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
    [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
    [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
    [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
    [<ffffffff811a0ac5>] new_slab+0x2c5/0x390
    [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
    [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
    [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
    [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
    [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
    [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
    [<ffffffff811d8566>] alloc_inode+0x26/0xa0
    [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
    [<ffffffff815b933e>] sock_alloc+0x1e/0x80
    [<ffffffff815ba855>] __sock_create+0x95/0x220
    [<ffffffff815baa04>] sock_create_kern+0x24/0x30
    [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
    [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
    [<ffffffff81084c19>] process_one_work+0x159/0x4f0
    [<ffffffff8108561b>] worker_thread+0x11b/0x530
    [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
    [<ffffffff8108b6f9>] kthread+0xc9/0xe0
    [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
    [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
    [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90

Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.

Cc: stable@vger.kernel.org # 3.10+, needs backporting
Link: http://tracker.ceph.com/issues/19309
Reported-by: Sergey Jerusalimov <wintchester@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-03-23 12:03:36 +01:00
Ilya Dryomov
7cc5e38f2f libceph: osd_request_timeout option
osd_request_timeout specifies how many seconds to wait for a response
from OSDs before returning -ETIMEDOUT from an OSD request.  0 (default)
means no limit.

osd_request_timeout is osdkeepalive-precise -- in-flight requests are
swept through every osdkeepalive seconds.  With ack vs commit behaviour
gone, abort_request() is really simple.

This is based on a patch from Artur Molchanov <artur.molchanov@synesis.ru>.

Tested-by: Artur Molchanov <artur.molchanov@synesis.ru>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-03-07 14:30:38 +01:00
Ilya Dryomov
b581a5854e libceph: don't set weight to IN when OSD is destroyed
Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs.  Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.

Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.

Fixes: 930c532869 ("libceph: apply new_state before new_up_client on incrementals")
Link: http://tracker.ceph.com/issues/19122
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-03-07 14:30:38 +01:00
Ilya Dryomov
9afd30dbc8 libceph: fix crush_decode() for older maps
Older (shorter) CRUSH maps too need to be finalized.

Fixes: 66a0e2d579 ("crush: remove mutable part of CRUSH map")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-03-07 14:30:37 +01:00
Linus Torvalds
1827adb11a Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull sched.h split-up from Ingo Molnar:
 "The point of these changes is to significantly reduce the
  <linux/sched.h> header footprint, to speed up the kernel build and to
  have a cleaner header structure.

  After these changes the new <linux/sched.h>'s typical preprocessed
  size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
  lines), which is around 40% faster to build on typical configs.

  Not much changed from the last version (-v2) posted three weeks ago: I
  eliminated quirks, backmerged fixes plus I rebased it to an upstream
  SHA1 from yesterday that includes most changes queued up in -next plus
  all sched.h changes that were pending from Andrew.

  I've re-tested the series both on x86 and on cross-arch defconfigs,
  and did a bisectability test at a number of random points.

  I tried to test as many build configurations as possible, but some
  build breakage is probably still left - but it should be mostly
  limited to architectures that have no cross-compiler binaries
  available on kernel.org, and non-default configurations"

* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
  sched/headers: Clean up <linux/sched.h>
  sched/headers: Remove #ifdefs from <linux/sched.h>
  sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
  sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
  sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
  sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
  sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
  sched/core: Remove unused prefetch_stack()
  sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
  sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
  sched/headers: Remove <linux/signal.h> from <linux/sched.h>
  sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
  sched/headers: Remove the runqueue_is_locked() prototype
  sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
  sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
  sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
  ...
2017-03-03 10:16:38 -08:00
Linus Torvalds
69fd110eb6 Merge branch 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs sendmsg updates from Al Viro:
 "More sendmsg work.

  This is a fairly separate isolated stuff (there's a continuation
  around lustre, but that one was too late to soak in -next), thus the
  separate pull request"

* 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ncpfs: switch to sock_sendmsg()
  ncpfs: don't mess with manually advancing iovec on send
  ncpfs: sendmsg does *not* bugger iovec these days
  ceph_tcp_sendpage(): use ITER_BVEC sendmsg
  afs_send_pages(): use ITER_BVEC
  rds: remove dead code
  ceph: switch to sock_recvmsg()
  usbip_recv(): switch to sock_recvmsg()
  iscsi_target: deal with short writes on the tx side
  [nbd] pass iov_iter to nbd_xmit()
  [nbd] switch sock_xmit() to sock_{send,recv}msg()
  [drbd] use sock_sendmsg()
2017-03-02 15:16:38 -08:00
Ingo Molnar
5b3cc15aff sched/headers: Prepare to move the memalloc_noio_*() APIs to <linux/sched/mm.h>
Update the .c files that depend on these APIs.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:33 +01:00
Linus Torvalds
b2deee2dc0 This time around we have:
- support for rbd data-pool feature, which enables rbd images on
   erasure-coded pools (myself).  CEPH_PG_MAX_SIZE has been bumped to
   allow erasure-coded profiles with k+m up to 32.
 
 - a patch for ceph_d_revalidate() performance regression introduced in
   4.9, along with some cleanups in the area (Jeff Layton)
 
 - a set of fixes for unsafe ->d_parent accesses in CephFS (Jeff Layton)
 
 - buffered reads are now processed in rsize windows instead of rasize
   windows (Andreas Gerstmayr).  The new default for rsize mount option
   is 64M.
 
 - ack vs commit distinction is gone, greatly simplifying ->fsync() and
   MOSDOpReply handling code (myself)
 
 Also a few filesystem bug fixes from Zheng, a CRUSH sync up (CRUSH
 computations are still serialized though) and several minor fixes and
 cleanups all over.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYtY0rAAoJEEp/3jgCEfOLQioH/36QKsalquY1FCdJnJve9qj0
 q19OohamIedhv76AYvXhJzBBHlVwerjicE51/bSzuUhxV+ApdATrPPcLC22oLd3i
 h0R9NAUMYjiris1yN/Z9JRiPCSdsxvHuRycsUMRSRbxZhnyP9XdTxFD1A+fLfisU
 Z4osyTzadabVL5Um9maRBbAtXCWh3d9JZzPa5xIvWTEO4CWWk87GtEIIQDcgx+Y6
 8ZSMmrVFDNtskUp9js+LnFYW7/xBsEXyqgsqKaecf5uQqwu1WKRXSKtv9PUmGAIb
 HBrlUdV1PQaCzTYtaoztJshNdYcphM5L7gePzxRG0nXrTNsq8J5eCzI8en5qS8w=
 =CPL/
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "This time around we have:

   - support for rbd data-pool feature, which enables rbd images on
     erasure-coded pools (myself). CEPH_PG_MAX_SIZE has been bumped to
     allow erasure-coded profiles with k+m up to 32.

   - a patch for ceph_d_revalidate() performance regression introduced
     in 4.9, along with some cleanups in the area (Jeff Layton)

   - a set of fixes for unsafe ->d_parent accesses in CephFS (Jeff
     Layton)

   - buffered reads are now processed in rsize windows instead of rasize
     windows (Andreas Gerstmayr). The new default for rsize mount option
     is 64M.

   - ack vs commit distinction is gone, greatly simplifying ->fsync()
     and MOSDOpReply handling code (myself)

  ... also a few filesystem bug fixes from Zheng, a CRUSH sync up (CRUSH
  computations are still serialized though) and several minor fixes and
  cleanups all over"

* tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-client: (52 commits)
  libceph, rbd, ceph: WRITE | ONDISK -> WRITE
  libceph: get rid of ack vs commit
  ceph: remove special ack vs commit behavior
  ceph: tidy some white space in get_nonsnap_parent()
  crush: fix dprintk compilation
  crush: do is_out test only if we do not collide
  ceph: remove req from unsafe list when unregistering it
  rbd: constify device_type structure
  rbd: kill obj_request->object_name and rbd_segment_name_cache
  rbd: store and use obj_request->object_no
  rbd: RBD_V{1,2}_DATA_FORMAT macros
  rbd: factor out __rbd_osd_req_create()
  rbd: set offset and length outside of rbd_obj_request_create()
  rbd: support for data-pool feature
  rbd: introduce rbd_init_layout()
  rbd: use rbd_obj_bytes() more
  rbd: remove now unused rbd_obj_request_wait() and helpers
  rbd: switch rbd_obj_method_sync() to ceph_osdc_call()
  libceph: pass reply buffer length through ceph_osdc_call()
  rbd: do away with obj_request in rbd_obj_read_sync()
  ...
2017-02-28 15:36:09 -08:00
Ilya Dryomov
54ea0046b6 libceph, rbd, ceph: WRITE | ONDISK -> WRITE
CEPH_OSD_FLAG_ONDISK is set in account_request().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-24 19:04:57 +01:00
Ilya Dryomov
b18b9550e4 libceph: get rid of ack vs commit
- CEPH_OSD_FLAG_ACK shouldn't be set anymore, so assert on it
- remove support for handling ack replies (OSDs will send ack replies
  only if clients request them)
- drop the "do lingering callbacks under osd->lock" logic from
  handle_reply() -- lreq->lock is sufficient in all three cases

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-24 19:04:57 +01:00
Ilya Dryomov
7ba0487cca crush: fix dprintk compilation
The syntax error was not noticed because dprintk is a macro
and the code is discarded by default.

Reflects ceph.git commit f29b840c64a933b2cb13e3da6f3d785effd73a57.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-23 22:22:02 +01:00
Ilya Dryomov
98ba6af728 crush: do is_out test only if we do not collide
The is_out() test may require an additional hashing operation, so we
should skip it whenever possible.

Reflects ceph.git commit db107cc7f15cf2481894add325dc93e33479f529.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-23 22:22:02 +01:00
Linus Torvalds
42e1b14b6e Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Implement wraparound-safe refcount_t and kref_t types based on
     generic atomic primitives (Peter Zijlstra)

   - Improve and fix the ww_mutex code (Nicolai Hähnle)

   - Add self-tests to the ww_mutex code (Chris Wilson)

   - Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
     Bueso)

   - Micro-optimize the current-task logic all around the core kernel
     (Davidlohr Bueso)

   - Tidy up after recent optimizations: remove stale code and APIs,
     clean up the code (Waiman Long)

   - ... plus misc fixes, updates and cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  fork: Fix task_struct alignment
  locking/spinlock/debug: Remove spinlock lockup detection code
  lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
  lkdtm: Convert to refcount_t testing
  kref: Implement 'struct kref' using refcount_t
  refcount_t: Introduce a special purpose refcount type
  sched/wake_q: Clarify queue reinit comment
  sched/wait, rcuwait: Fix typo in comment
  locking/mutex: Fix lockdep_assert_held() fail
  locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
  locking/rwsem: Reinit wake_q after use
  locking/rwsem: Remove unnecessary atomic_long_t casts
  jump_labels: Move header guard #endif down where it belongs
  locking/atomic, kref: Implement kref_put_lock()
  locking/ww_mutex: Turn off __must_check for now
  locking/atomic, kref: Avoid more abuse
  locking/atomic, kref: Use kref_get_unless_zero() more
  locking/atomic, kref: Kill kref_sub()
  locking/atomic, kref: Add kref_read()
  locking/atomic, kref: Add KREF_INIT()
  ...
2017-02-20 13:23:30 -08:00
Ilya Dryomov
2544a02090 libceph: pass reply buffer length through ceph_osdc_call()
To spare checking for "this reply fits into a page, but does it fit
into my buffer?" in some callers, osd_req_op_cls_response_data_pages()
needs to know how big it is.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20 12:16:13 +01:00
Ilya Dryomov
ef9324bb11 libceph: don't go through with the mapping if the PG is too wide
With EC overwrites maturing, the kernel client will be getting exposed
to potentially very wide EC pools.  While "min(pi->size, X)" works fine
when the cluster is stable and happy, truncating OSD sets interferes
with resend logic (ceph_is_new_interval(), etc).  Abort the mapping if
the pool is too wide, assigning the request to the homeless session.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-20 12:16:11 +01:00
Ilya Dryomov
743efcffff crush: merge working data and scratch
Much like Arlo Guthrie, I decided that one big pile is better than two
little piles.

Reflects ceph.git commit 95c2df6c7e0b22d2ea9d91db500cf8b9441c73ba.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:11 +01:00
Ilya Dryomov
66a0e2d579 crush: remove mutable part of CRUSH map
Then add it to the working state. It would be very nice if we didn't
have to take a lock to calculate a crush placement. By moving the
permutation array into the working data, we can treat the CRUSH map as
immutable.

Reflects ceph.git commit cbcd039651c0569551cb90d26ce27e1432671f2a.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:11 +01:00
Ilya Dryomov
1b6a78b5b9 libceph: add osdmap_set_crush() helper
Simplify osdmap_decode() and osdmap_apply_incremental() a bit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:11 +01:00
Stafford Horne
19def166f3 libceph: remove unneeded stddef.h include
This was causing a build failure for openrisc when using musl and
gcc 5.4.0 since the file is not available in the toolchain.

It doesnt seem this is needed and removing it does not cause any build
warnings for me.

Signed-off-by: Stafford Horne <shorne@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:10 +01:00
Yan, Zheng
d641df819d ceph: update readpages osd request according to size of pages
add_to_page_cache_lru() can fails, so the actual pages to read
can be smaller than the initial size of osd request. We need to
update osd request size in that case.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-02-20 12:16:07 +01:00
Ilya Dryomov
7fea24c6d4 libceph: include linux/sched.h into crypto.c directly
Currently crypto.c gets linux/sched.h indirectly through linux/slab.h
from linux/kasan.h.  Include it directly for memalloc_noio_*() inlines.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:06 +01:00
Ilya Dryomov
124f930b8c libceph: make sure ceph_aes_crypt() IV is aligned
... otherwise the crypto stack will align it for us with a GFP_ATOMIC
allocation and a memcpy() -- see skcipher_walk_first().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-01-18 17:58:45 +01:00
Peter Zijlstra
2c935bc572 locking/atomic, kref: Add kref_read()
Since we need to change the implementation, stop exposing internals.

Provide kref_read() to read the current reference count; typically
used for debug messages.

Kills two anti-patterns:

	atomic_read(&kref->refcount)
	kref->refcount.counter

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:18 +01:00
Al Viro
61ff6e9b45 ceph_tcp_sendpage(): use ITER_BVEC sendmsg
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-26 21:35:41 -05:00
Al Viro
100803a84d ceph: switch to sock_recvmsg()
... and use ITER_BVEC instead of playing with kmap()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-26 21:35:38 -05:00
Ilya Dryomov
45ee2c1d66 libceph: remove now unused finish_request() wrapper
Kill the wrapper and rename __finish_request() to finish_request().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-14 22:39:08 +01:00
Ilya Dryomov
c297eb4269 libceph: always signal completion when done
r_safe_completion is currently, and has always been, signaled only if
on-disk ack was requested.  It's there for fsync and syncfs, which wait
for in-flight writes to flush - all data write requests set ONDISK.

However, the pool perm check code introduced in 4.2 sends a write
request with only ACK set.  An unfortunately timed syncfs can then hang
forever: r_safe_completion won't be signaled because only an unsafe
reply was requested.

We could patch ceph_osdc_sync() to skip !ONDISK write requests, but
that is somewhat incomplete and yet another special case.  Instead,
rename this completion to r_done_completion and always signal it when
the OSD client is done with the request, whether unsafe, safe, or
error.  This is a bit cleaner and helps with the cancellation code.

Reported-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-14 22:39:08 +01:00
Tobias Klauser
f6c0d1a3ed crush: include mapper.h in mapper.c
Include linux/crush/mapper.h in crush/mapper.c to get the prototypes of
crush_find_rule and crush_do_rule which are defined there. This fixes
the following GCC warnings when building with 'W=1':

  net/ceph/crush/mapper.c:40:5: warning: no previous prototype for ‘crush_find_rule’ [-Wmissing-prototypes]
  net/ceph/crush/mapper.c:793:5: warning: no previous prototype for ‘crush_do_rule’ [-Wmissing-prototypes]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
[idryomov@gmail.com: corresponding !__KERNEL__ include]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-12 23:54:26 +01:00
Ilya Dryomov
b3bbd3f2ab libceph: no need to drop con->mutex for ->get_authorizer()
->get_authorizer(), ->verify_authorizer_reply(), ->sign_message() and
->check_message_signature() shouldn't be doing anything with or on the
connection (like closing it or sending messages).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:21 +01:00
Ilya Dryomov
0dde584882 libceph: drop len argument of *verify_authorizer_reply()
The length of the reply is protocol-dependent - for cephx it's
ceph_x_authorize_reply.  Nothing sensible can be passed from the
messenger layer anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:21 +01:00
Ilya Dryomov
5c056fdc5b libceph: verify authorize reply on connect
After sending an authorizer (ceph_x_authorize_a + ceph_x_authorize_b),
the client gets back a ceph_x_authorize_reply, which it is supposed to
verify to ensure the authenticity and protect against replay attacks.
The code for doing this is there (ceph_x_verify_authorizer_reply(),
ceph_auth_verify_authorizer_reply() + plumbing), but it is never
invoked by the the messenger.

AFAICT this goes back to 2009, when ceph authentication protocols
support was added to the kernel client in 4e7a5dcd1b ("ceph:
negotiate authentication protocol; implement AUTH_NONE protocol").

The second param of ceph_connection_operations::verify_authorizer_reply
is unused all the way down.  Pass 0 to facilitate backporting, and kill
it in the next commit.

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:21 +01:00
Ilya Dryomov
5418d0a2c8 libceph: no need for GFP_NOFS in ceph_monc_init()
It's called during inital setup, when everything should be allocated
with GFP_KERNEL.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:21 +01:00
Ilya Dryomov
7af3ea189a libceph: stop allocating a new cipher on every crypto request
This is useless and more importantly not allowed on the writeback path,
because crypto_alloc_skcipher() allocates memory with GFP_KERNEL, which
can recurse back into the filesystem:

    kworker/9:3     D ffff92303f318180     0 20732      2 0x00000080
    Workqueue: ceph-msgr ceph_con_workfn [libceph]
     ffff923035dd4480 ffff923038f8a0c0 0000000000000001 000000009eb27318
     ffff92269eb28000 ffff92269eb27338 ffff923036b145ac ffff923035dd4480
     00000000ffffffff ffff923036b145b0 ffffffff951eb4e1 ffff923036b145a8
    Call Trace:
     [<ffffffff951eb4e1>] ? schedule+0x31/0x80
     [<ffffffff951eb77a>] ? schedule_preempt_disabled+0xa/0x10
     [<ffffffff951ed1f4>] ? __mutex_lock_slowpath+0xb4/0x130
     [<ffffffff951ed28b>] ? mutex_lock+0x1b/0x30
     [<ffffffffc0a974b3>] ? xfs_reclaim_inodes_ag+0x233/0x2d0 [xfs]
     [<ffffffff94d92ba5>] ? move_active_pages_to_lru+0x125/0x270
     [<ffffffff94f2b985>] ? radix_tree_gang_lookup_tag+0xc5/0x1c0
     [<ffffffff94dad0f3>] ? __list_lru_walk_one.isra.3+0x33/0x120
     [<ffffffffc0a98331>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
     [<ffffffff94e05bfe>] ? super_cache_scan+0x17e/0x190
     [<ffffffff94d919f3>] ? shrink_slab.part.38+0x1e3/0x3d0
     [<ffffffff94d9616a>] ? shrink_node+0x10a/0x320
     [<ffffffff94d96474>] ? do_try_to_free_pages+0xf4/0x350
     [<ffffffff94d967ba>] ? try_to_free_pages+0xea/0x1b0
     [<ffffffff94d863bd>] ? __alloc_pages_nodemask+0x61d/0xe60
     [<ffffffff94ddf42d>] ? cache_grow_begin+0x9d/0x560
     [<ffffffff94ddfb88>] ? fallback_alloc+0x148/0x1c0
     [<ffffffff94ed84e7>] ? __crypto_alloc_tfm+0x37/0x130
     [<ffffffff94de09db>] ? __kmalloc+0x1eb/0x580
     [<ffffffffc09fe2db>] ? crush_choose_firstn+0x3eb/0x470 [libceph]
     [<ffffffff94ed84e7>] ? __crypto_alloc_tfm+0x37/0x130
     [<ffffffff94ed9c19>] ? crypto_spawn_tfm+0x39/0x60
     [<ffffffffc08b30a3>] ? crypto_cbc_init_tfm+0x23/0x40 [cbc]
     [<ffffffff94ed857c>] ? __crypto_alloc_tfm+0xcc/0x130
     [<ffffffff94edcc23>] ? crypto_skcipher_init_tfm+0x113/0x180
     [<ffffffff94ed7cc3>] ? crypto_create_tfm+0x43/0xb0
     [<ffffffff94ed83b0>] ? crypto_larval_lookup+0x150/0x150
     [<ffffffff94ed7da2>] ? crypto_alloc_tfm+0x72/0x120
     [<ffffffffc0a01dd7>] ? ceph_aes_encrypt2+0x67/0x400 [libceph]
     [<ffffffffc09fd264>] ? ceph_pg_to_up_acting_osds+0x84/0x5b0 [libceph]
     [<ffffffff950d40a0>] ? release_sock+0x40/0x90
     [<ffffffff95139f94>] ? tcp_recvmsg+0x4b4/0xae0
     [<ffffffffc0a02714>] ? ceph_encrypt2+0x54/0xc0 [libceph]
     [<ffffffffc0a02b4d>] ? ceph_x_encrypt+0x5d/0x90 [libceph]
     [<ffffffffc0a02bdf>] ? calcu_signature+0x5f/0x90 [libceph]
     [<ffffffffc0a02ef5>] ? ceph_x_sign_message+0x35/0x50 [libceph]
     [<ffffffffc09e948c>] ? prepare_write_message_footer+0x5c/0xa0 [libceph]
     [<ffffffffc09ecd18>] ? ceph_con_workfn+0x2258/0x2dd0 [libceph]
     [<ffffffffc09e9903>] ? queue_con_delay+0x33/0xd0 [libceph]
     [<ffffffffc09f68ed>] ? __submit_request+0x20d/0x2f0 [libceph]
     [<ffffffffc09f6ef8>] ? ceph_osdc_start_request+0x28/0x30 [libceph]
     [<ffffffffc0b52603>] ? rbd_queue_workfn+0x2f3/0x350 [rbd]
     [<ffffffff94c94ec0>] ? process_one_work+0x160/0x410
     [<ffffffff94c951bd>] ? worker_thread+0x4d/0x480
     [<ffffffff94c95170>] ? process_one_work+0x410/0x410
     [<ffffffff94c9af8d>] ? kthread+0xcd/0xf0
     [<ffffffff951efb2f>] ? ret_from_fork+0x1f/0x40
     [<ffffffff94c9aec0>] ? kthread_create_on_node+0x190/0x190

Allocating the cipher along with the key fixes the issue - as long the
key doesn't change, a single cipher context can be used concurrently in
multiple requests.

We still can't take that GFP_KERNEL allocation though.  Both
ceph_crypto_key_clone() and ceph_crypto_key_decode() are called from
GFP_NOFS context, so resort to memalloc_noio_{save,restore}() here.

Reported-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:20 +01:00
Ilya Dryomov
6db2304aab libceph: uninline ceph_crypto_key_destroy()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:20 +01:00
Ilya Dryomov
2b1e1a7cd0 libceph: remove now unused ceph_*{en,de}crypt*() functions
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:20 +01:00
Ilya Dryomov
e15fd0a11d libceph: switch ceph_x_decrypt() to ceph_crypt()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov
d03857c63b libceph: switch ceph_x_encrypt() to ceph_crypt()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov
4eb4517ce7 libceph: tweak calcu_signature() a little
- replace an ad-hoc array with a struct
- rename to calc_signature() for consistency

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov
7882a26d2e libceph: rename and align ceph_x_authorizer::reply_buf
It's going to be used as a temporary buffer for in-place en/decryption
with ceph_crypt() instead of on-stack buffers, so rename to enc_buf.
Ensure alignment to avoid GFP_ATOMIC allocations in the crypto stack.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov
a45f795c65 libceph: introduce ceph_crypt() for in-place en/decryption
Starting with 4.9, kernel stacks may be vmalloced and therefore not
guaranteed to be physically contiguous; the new CONFIG_VMAP_STACK
option is enabled by default on x86.  This makes it invalid to use
on-stack buffers with the crypto scatterlist API, as sg_set_buf()
expects a logical address and won't work with vmalloced addresses.

There isn't a different (e.g. kvec-based) crypto API we could switch
net/ceph/crypto.c to and the current scatterlist.h API isn't getting
updated to accommodate this use case.  Allocating a new header and
padding for each operation is a non-starter, so do the en/decryption
in-place on a single pre-assembled (header + data + padding) heap
buffer.  This is explicitly supported by the crypto API:

    "... the caller may provide the same scatter/gather list for the
     plaintext and cipher text. After the completion of the cipher
     operation, the plaintext data is replaced with the ciphertext data
     in case of an encryption and vice versa for a decryption."

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov
55d9cc834f libceph: introduce ceph_x_encrypt_offset()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov
462e650451 libceph: old_key in process_one_ticket() is redundant
Since commit 0a990e7093 ("ceph: clean up service ticket decoding"),
th->session_key isn't assigned until everything is decoded.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov
36721ece1e libceph: ceph_x_encrypt_buflen() takes in_len
Pass what's going to be encrypted - that's msg_b, not ticket_blob.
ceph_x_encrypt_buflen() returns the upper bound, so this doesn't change
the maxlen calculation, but makes it a bit clearer.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov
264048afab libceph: initialize last_linger_id with a large integer
osdc->last_linger_id is a counter for lreq->linger_id, which is used
for watch cookies.  Starting with a large integer should ease the task
of telling apart kernel and userspace clients.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-11-10 20:13:08 +01:00
Yan, Zheng
3890dce1d3 libceph: fix legacy layout decode with pool 0
If your data pool was pool 0, ceph_file_layout_from_legacy()
transform that to -1 unconditionally, which broke upgrades.
We only want do that for a fully zeroed ceph_file_layout,
so that it still maps to a file_layout_t.  If any fields
are set, though, we trust the fl_pgpool to be a valid pool.

Fixes: 7627151ea3 ("libceph: define new ceph_file_layout structure")
Link: http://tracker.ceph.com/issues/17825
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-11-10 20:13:08 +01:00
Lorenzo Stoakes
c164154f66 mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
This removes the 'write' and 'force' use from get_user_pages_unlocked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-18 14:13:37 -07:00
Ilya Dryomov
64f77566e1 crush: remove redundant local variable
Remove extra x1 variable, it's just temporary placeholder that
clutters the code unnecessarily.

Reflects ceph.git commit 0d19408d91dd747340d70287b4ef9efd89e95c6b.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-05 23:02:10 +02:00
Ilya Dryomov
74a5293832 crush: don't normalize input of crush_ln iteratively
Use __builtin_clz() supported by GCC and Clang to figure out
how many bits we should shift instead of shifting by a bit
in a loop until the value gets normalized. Improves performance
of this function by up to 3x in worst-case scenario and overall
straw2 performance by ~10%.

Reflects ceph.git commit 110de33ca497d94fc4737e5154d3fe781fa84a0a.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-05 23:02:04 +02:00
Ilya Dryomov
464691bd52 libceph: ceph_build_auth() doesn't need ceph_auth_build_hello()
A static bug finder (EBA) on Linux 4.7:

    Double lock in net/ceph/auth.c
    second lock at 108: mutex_lock(& ac->mutex); [ceph_auth_build_hello]
    after calling from 263: ret = ceph_auth_build_hello(ac, msg_buf, msg_len);
    if ! ac->protocol -> true at 262
    first lock at 261: mutex_lock(& ac->mutex); [ceph_build_auth]

ceph_auth_build_hello() is never called, because the protocol is always
initialized, whether we are checking existing tickets (in delayed_work())
or getting new ones after invalidation (in invalidate_authorizer()).

Reported-by: Iago Abal <iari@itu.dk>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-03 16:13:50 +02:00
Ilya Dryomov
fdc723e77b libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-03 16:13:50 +02:00
Ilya Dryomov
005a07bf0a rbd: add 'client_addr' sysfs rbd device attribute
Export client addr/nonce, so userspace can check if a image is being
blacklisted.

Signed-off-by: Mike Christie <mchristi@redhat.com>
[idryomov@gmail.com: ceph_client_addr(), endianess fix]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-08-24 23:49:16 +02:00
Ilya Dryomov
ed95b21a4b rbd: support for exclusive-lock feature
Add basic support for RBD_FEATURE_EXCLUSIVE_LOCK feature.  Maintenance
operations (resize, snapshot create, etc) are offloaded to librbd via
returning -EOPNOTSUPP - librbd should request the lock and execute the
operation.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Tested-by: Mike Christie <mchristi@redhat.com>
2016-08-24 23:49:16 +02:00
Ilya Dryomov
99d1694310 rbd: retry watch re-registration periodically
Revamp watch code to support retrying watch re-registration:

- add rbd_dev->watch_state for more robust errcb handling
- store watch cookie separately to avoid dereferencing watch_handle
  which is set to NULL on unwatch
- move re-register code into a delayed work and retry re-registration
  every second, unless the client is blacklisted

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Tested-by: Mike Christie <mchristi@redhat.com>
2016-08-24 23:49:16 +02:00
Ilya Dryomov
033268a5f0 libceph: rename ceph_client_id() -> ceph_client_gid()
It's gid / global_id in other places.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:16 +02:00
Douglas Fuller
6305a3b415 libceph: support for blacklisting clients
Reuse ceph_mon_generic_request infrastructure for sending monitor
commands.  In particular, add support for 'blacklist add' to prevent
other, non-responsive clients from making further updates.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
[idryomov@gmail.com: refactor, misc fixes throughout]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Douglas Fuller
d4ed4a5305 libceph: support for lock.lock_info
Add an interface for the Ceph OSD lock.lock_info method and associated
data structures.

Based heavily on code by Mike Christie <michaelc@cs.wisc.edu>.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
[idryomov@gmail.com: refactor, misc fixes throughout]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Douglas Fuller
f66241cb99 libceph: support for advisory locking on RADOS objects
This patch adds support for rados lock, unlock and break lock.

Based heavily on code by Mike Christie <michaelc@cs.wisc.edu>.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Douglas Fuller
428a715811 libceph: add ceph_osdc_call() single-page helper
Add a convenience function to osd_client to send Ceph OSD
'class' ops. The interface assumes that the request and
reply data each consist of single pages.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Douglas Fuller
a4ed38d7a1 libceph: support for CEPH_OSD_OP_LIST_WATCHERS
Add support for this Ceph OSD op, needed to support the RBD exclusive
lock feature.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
[idryomov@gmail.com: refactor, misc fixes throughout]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Ilya Dryomov
f01d5cb24e libceph: rename ceph_entity_name_encode() -> ceph_auth_entity_name_encode()
Clear up EntityName vs entity_name_t confusion.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Wei Yongjun
864364a29c libceph: using kfree_rcu() to simplify the code
The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.

Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-08-08 21:41:42 +02:00
Wei Yongjun
f52ec33cbd libceph: make cancel_generic_request() static
Fixes the following sparse warning:

net/ceph/mon_client.c:577:6: warning:
 symbol 'cancel_generic_request' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-08-08 21:41:42 +02:00
Wei Yongjun
c22e853a2e libceph: fix return value check in alloc_msg_with_page_vector()
In case of error, the function ceph_alloc_page_vector() returns
ERR_PTR() and never returns NULL. The NULL test in the return value
check should be replaced with IS_ERR().

Fixes: 1907920324 ('libceph: support for sending notifies')
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-08-08 21:41:41 +02:00
Yan, Zheng
0cabbd94ff libceph: fsmap.user subscription support
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-28 03:00:40 +02:00
Yan, Zheng
cd08e0a274 libceph: make sure redirect does not change namespace
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-28 02:55:37 +02:00
Yan, Zheng
30c156d995 libceph: rados pool namespace support
Add pool namesapce pointer to struct ceph_file_layout and struct
ceph_object_locator. Pool namespace is used by when mapping object
to PG, it's also used when composing OSD request.

The namespace pointer in struct ceph_file_layout is RCU protected.
So libceph can read namespace without taking lock.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-28 02:55:37 +02:00
Yan, Zheng
51e9273796 libceph: introduce reference counted string
The data structure is for storing namesapce string. It allows namespace
string to be shared between cephfs inodes with same layout. This data
structure can also be referenced by OSD request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:37 +02:00
Yan, Zheng
7627151ea3 libceph: define new ceph_file_layout structure
Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:36 +02:00
Ilya Dryomov
b2aa5d0bc8 libceph: fix some missing includes
- decode.h needs slab.h for kmalloc()
- osd_client.h needs msgpool.h for struct ceph_msgpool
- msgpool.h doesn't need messenger.h

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-28 02:55:35 +02:00
Ilya Dryomov
930c532869 libceph: apply new_state before new_up_client on incrementals
Currently, osd_weight and osd_state fields are updated in the encoding
order.  This is wrong, because an incremental map may look like e.g.

    new_up_client: { osd=6, addr=... } # set osd_state and addr
    new_state: { osd=6, xorstate=EXISTS } # clear osd_state

Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down).  After
applying new_up_client, osd_state is changed to EXISTS | UP.  Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state.  A non-existent OSD is considered down by the
mapping code

2087    for (i = 0; i < pg->pg_temp.len; i++) {
2088            if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
2089                    if (ceph_can_shift_osds(pi))
2090                            continue;
2091
2092                    temp->osds[temp->size++] = CRUSH_ITEM_NONE;

and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:

[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680

and hung rbds on the client:

[  493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[  493.566805] rbd: rbd0:   result -6 xferred 400000
[  493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688

The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed

Fixes: http://tracker.ceph.com/issues/14901

Cc: stable@vger.kernel.org # 3.15+: 6dd74e44dc: libceph: set 'exists' flag for newly up osd
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2016-07-22 15:17:40 +02:00
Ilya Dryomov
4a3262b17c libceph: use %s instead of %pE in dout()s
Commit d30291b985 ("libceph: variable-sized ceph_object_id") changed
dout()s in what is now encode_request() and ceph_object_locator_to_pg()
to use %pE, mostly to document that, although all rbd and cephfs object
names are NULL-terminated strings, ceph_object_id will handle any RADOS
object name, including the one containing NULs, just fine.

However, it turns out that vbin_printf() can't handle anything but ints
and %s - all %p suffixes are ignored.  The buffer %p** points to isn't
recorded, resulting in trash in the messages if the buffer had been
reused by the time bstr_printf() got to it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-30 23:00:23 +02:00
Ilya Dryomov
dc045a9168 libceph: put request only if it's done in handle_reply()
handle_reply() may be called twice on the same request: on ack and then
on commit.  This occurs on btrfs-formatted OSDs or if cephfs sync write
path is triggered - CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK.

handle_reply() handles this with the help of done_request().

Fixes: 5aea3dcd50 ("libceph: a major OSD client update")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-30 23:00:23 +02:00
Ilya Dryomov
b7ec35b304 libceph: change ceph_osdmap_flag() to take osdc
For the benefit of every single caller, take osdc instead of map.
Also, now that osdc->osdmap can't ever be NULL, drop the check.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-30 23:00:22 +02:00
Yan, Zheng
0e76abf21e libceph: make ceph_osdc_wait_request() uninterruptible
Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most
cases, the sync IO should be uninterruptible. The fix is use killale
wait function in ceph_osdc_wait_request().

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26 01:15:40 +02:00
Zhang Zhuoyu
3b33f692c8 ceph: make logical calculation functions return bool
This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.

No functional change.

Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
2016-05-26 01:15:39 +02:00
Ilya Dryomov
737cc81ead libceph: support for subscribing to "mdsmap.<id>" maps
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:30 +02:00
Ilya Dryomov
7cca78c9dc libceph: replace ceph_monc_request_next_osdmap()
... with a wrapper around maybe_request_map() - no need for two
osdmap-specific functions.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:30 +02:00
Ilya Dryomov
b4f3479569 libceph: take osdc->lock in osdmap_show() and dump flags in hex
There is now about a dozen CEPH_OSDMAP_* flags.  This is a debugging
interface, so just dump in hex instead of spelling each flag out.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:29 +02:00
Ilya Dryomov
4609245e26 libceph: pool deletion detection
This adds the "map check" infrastructure for sending osdmap version
checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
-ENOENT if the target pool doesn't exist or has just been deleted.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:29 +02:00
Ilya Dryomov
d0b19705e9 libceph: async MON client generic requests
For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion.  Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version().  ceph_monc_do_statfs() is
switched over and remains sync.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:29 +02:00
Ilya Dryomov
b07d3c4bd7 libceph: support for checking on status of watch
Implement ceph_osdc_watch_check() to be able to check on status of
watch.  Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:28 +02:00
Ilya Dryomov
1907920324 libceph: support for sending notifies
Implement ceph_osdc_notify() for sending notifies.

Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:28 +02:00
Ilya Dryomov
922dab6134 libceph, rbd: ceph_osd_linger_request, watch/notify v2
This adds support and switches rbd to a new, more reliable version of
watch/notify protocol.  As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed.  watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().

A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request.  ceph_osd_event has been merged into
ceph_osd_linger_request.

All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.

ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.

Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:02 +02:00
Ilya Dryomov
42b0696527 libceph: wait_request_timeout()
The unwatch timeout is currently implemented in rbd.  With
watch/unwatch code moving into libceph, we are going to need
a ceph_osdc_wait_request() variant with a timeout.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:14:06 +02:00
Ilya Dryomov
3540bfdb30 libceph: request_init() and request_release_checks()
These are going to be used by request_reinit() code.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:14:05 +02:00
Ilya Dryomov
5aea3dcd50 libceph: a major OSD client update
This is a major sync up, up to ~Jewel.  The highlights are:

- per-session request trees (vs a global per-client tree)
- per-session locking (vs a global per-client rwlock)
- homeless OSD session
- no ad-hoc global per-client lists
- support for pool quotas
- foundation for watch/notify v2 support
- foundation for map check (pool deletion detection) support

The switchover is incomplete: lingering requests can be setup and
teared down but aren't ever reestablished.  This functionality is
restored with the introduction of the new lingering infrastructure
(ceph_osd_linger_request, linger_work, etc) in a later commit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:14:03 +02:00
Ilya Dryomov
9dd2845ccb libceph: protect osdc->osd_lru list with a spinlock
OSD client is getting moved from the big per-client lock to a set of
per-session locks.  The big rwlock would only be held for read most of
the time, so a global osdc->osd_lru needs additional protection.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:12:30 +02:00
Ilya Dryomov
7a28f59bf9 libceph: allocate ceph_osd with GFP_NOFAIL
create_osd() is called way too deep in the stack to be able to error
out in a sane way; a failing create_osd() just messes everything up.
The current req_notarget list solution is broken - the list is never
traversed as it's not entirely clear when to do it, I guess.

If we were to start traversing it at regular intervals and retrying
each request, we wouldn't be far off from what __GFP_NOFAIL is doing,
so allocate OSD sessions with __GFP_NOFAIL, at least until we come up
with a better fix.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:12:29 +02:00
Ilya Dryomov
0247a0cf3e libceph: osd_init() and osd_cleanup()
These are going to be used by homeless OSD sessions code.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:12:29 +02:00
Ilya Dryomov
42c1b12403 libceph: handle_one_map()
Separate osdmap handling from decoding and iterating over a bag of maps
in a fresh MOSDMap message.  This sets up the scene for the updated OSD
client.

Of particular importance here is the addition of pi->was_full, which
can be used to answer "did this pool go full -> not-full in this map?".
This is the key bit for supporting pool quotas.

We won't be able to downgrade map_sem for much longer, so drop
downgrade_write().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:12:27 +02:00
Ilya Dryomov
e5253a7bde libceph: allocate dummy osdmap in ceph_osdc_init()
This leads to a simpler osdmap handling code, particularly when dealing
with pi->was_full, which is introduced in a later commit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:29 +02:00
Ilya Dryomov
fbca963532 libceph: schedule tick from ceph_osdc_init()
Both homeless OSD sessions and watch/notify v2, introduced in later
commits, require periodic ticks which don't depend on ->num_requests.
Schedule the initial tick from ceph_osdc_init() and reschedule from
handle_timeout() unconditionally.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:28 +02:00
Ilya Dryomov
b37ee1b9b8 libceph: move schedule_delayed_work() in ceph_osdc_init()
ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
with handle_osds_timeout() running on invalid memory if any one of the
allocations fails.  Call schedule_delayed_work() after everything is
setup, just before returning.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:28 +02:00
Ilya Dryomov
fe5da05e97 libceph: redo callbacks and factor out MOSDOpReply decoding
If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack.  This is
very confusing.  Redo this so that only one of them is called:

    ->r_unsafe_callback(true), on ack
    ->r_unsafe_callback(false), on commit

or

    ->r_callback, on ack|commit

Decode everything in decode_MOSDOpReply() to reduce clutter.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:28 +02:00
Ilya Dryomov
85e084feb4 libceph: drop msg argument from ceph_osdc_callback_t
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success.  This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:27 +02:00
Ilya Dryomov
bb873b5391 libceph: switch to calc_target(), part 2
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target.  Encoding now happens within ceph_osdc_start_request().

Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded.  If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.

Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op().  While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:27 +02:00
Ilya Dryomov
a66dd38309 libceph: switch to calc_target(), part 1
Replace __calc_request_pg() and most of __map_request() with
calc_target() and start using req->r_t.

ceph_osdc_build_request() however still encodes base_oid, because it's
called before calc_target() is and target_oid is empty at that point in
time; a printf in osdc_show() also shows base_oid.  This is fixed in
"libceph: switch to calc_target(), part 2".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:26 +02:00
Ilya Dryomov
63244fa123 libceph: introduce ceph_osd_request_target, calc_target()
Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:26 +02:00
Ilya Dryomov
04812acf57 libceph: pi->min_size, pi->last_force_request_resend
Add and decode pi->min_size and pi->last_force_request_resend.  These
are going to be used by calc_target().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:26 +02:00
Ilya Dryomov
f984cb76cc libceph: make pgid_cmp() global
calc_target() code is going to need to know how to compare PGs.  Take
lhs and rhs pgid by const * while at it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:25 +02:00
Ilya Dryomov
f81f16339a libceph: rename ceph_calc_pg_primary()
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:25 +02:00
Ilya Dryomov
6f3bfd45cd libceph: ceph_osds, ceph_pg_to_up_acting_osds()
Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes.  This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around.  Introduce and switch to ceph_osds to help with that.

Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:25 +02:00
Ilya Dryomov
d9591f5e28 libceph: rename ceph_oloc_oid_to_pg()
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg().  Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:24 +02:00
Ilya Dryomov
fcd00b68bb libceph: DEFINE_RB_FUNCS macro
Given

    struct foo {
        u64 id;
        struct rb_node bar_node;
    };

generate insert_bar(), erase_bar() and lookup_bar() functions with

    DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)

The key is assumed to be an integer (u64, int, etc), compared with
< and >.  nodefld has to be initialized with RB_CLEAR_NODE().

Start using it for MDS, MON and OSD requests and OSD sessions.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:23 +02:00
Ilya Dryomov
42a2c09f2b libceph: open-code remove_{all,old}_osds()
They are called only once, from ceph_osdc_stop() and
handle_osds_timeout() respectively.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:23 +02:00
Ilya Dryomov
0c0a8de13f libceph: nuke unused fields and functions
Either unused or useless:

    osdmap->mkfs_epoch
    osd->o_marked_for_keepalive
    monc->num_generic_requests
    osdc->map_waiters
    osdc->last_requested_map
    osdc->timeout_tid

    osd_req_op_cls_response_data()

    osdmap_apply_incremental() @msgr arg

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:23 +02:00
Ilya Dryomov
d30291b985 libceph: variable-sized ceph_object_id
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters.  This is enough for all use cases,
expect one - long rbd image names:

- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"

We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh.  (A format 2 header name is
a small system-generated string.)

Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string.  Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:22 +02:00
Ilya Dryomov
711da55d36 libceph: change how osd_op_reply message size is calculated
For a message pool message, preallocate a page, just like we do for
osd_op.  For a normal message, take ceph_object_id into account and
don't bother subtracting CEPH_OSD_SLAB_OPS ceph_osd_ops.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:21 +02:00
Ilya Dryomov
13d1ad16d0 libceph: move message allocation out of ceph_osdc_alloc_request()
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed.  Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:

    req = ceph_osdc_alloc_request(...);
    <fill in req->r_base_oid>
    <fill in req->r_base_oloc>
    ceph_osdc_alloc_messages(req);

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:21 +02:00
Ilya Dryomov
841272825b libceph: grab snapc in ceph_osdc_alloc_request()
ceph_osdc_build_request() is going away.  Grab snapc and initialize
->r_snapid in ceph_osdc_alloc_request().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:21 +02:00
Ilya Dryomov
3ed97d6345 libceph: make ceph_osdc_put_request() accept NULL
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:20 +02:00
Ilya Dryomov
6c1ea260f8 libceph: make authorizer destruction independent of ceph_auth_client
Starting the kernel client with cephx disabled and then enabling cephx
and restarting userspace daemons can result in a crash:

    [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
    [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
    [262671.584334] PGD 0
    [262671.635847] Oops: 0000 [#1] SMP
    [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
    [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
    [262672.268937] Workqueue: ceph-msgr con_work [libceph]
    [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
    [262672.428330] RIP: 0010:[<ffffffff811cd04a>]  [<ffffffff811cd04a>] kfree+0x5a/0x130
    [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
    [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
    [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
    [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
    [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
    [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
    [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
    [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
    [262673.417556] Stack:
    [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
    [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
    [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
    [262673.805230] Call Trace:
    [262673.859116]  [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
    [262673.968705]  [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
    [262674.078852]  [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
    [262674.134249]  [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
    [262674.189124]  [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
    [262674.243749]  [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
    [262674.297485]  [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
    [262674.350813]  [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
    [262674.403312]  [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
    [262674.454712]  [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
    [262674.505096]  [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
    [262674.555104]  [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
    [262674.604072]  [<ffffffff810901ea>] worker_thread+0x11a/0x470
    [262674.652187]  [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
    [262674.699022]  [<ffffffff810957a2>] kthread+0xd2/0xf0
    [262674.744494]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
    [262674.789543]  [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
    [262674.834094]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0

What happens is the following:

    (1) new MON session is established
    (2) old "none" ac is destroyed
    (3) new "cephx" ac is constructed
    ...
    (4) old OSD session (w/ "none" authorizer) is put
          ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)

osd->o_auth.authorizer in the "none" case is just a bare pointer into
ac, which contains a single static copy for all services.  By the time
we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
so we end up trying to destroy a "none" authorizer with a "cephx"
destructor operating on invalid memory!

To fix this, decouple authorizer destruction from ac and do away with
a single static "none" authorizer by making a copy for each OSD or MDS
session.  Authorizers themselves are independent of ac and so there is
no reason for destroy_authorizer() to be an ac op.  Make it an op on
the authorizer itself by turning ceph_authorizer into a real struct.

Fixes: http://tracker.ceph.com/issues/15447

Reported-by: Alan Zhang <alan.zhang@linux.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-04-25 20:54:13 +02:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Linus Torvalds
d5a38f6e46 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There is quite a bit here, including some overdue refactoring and
  cleanup on the mon_client and osd_client code from Ilya, scattered
  writeback support for CephFS and a pile of bug fixes from Zheng, and a
  few random cleanups and fixes from others"

[ I already decided not to pull this because of it having been rebased
  recently, but ended up changing my mind after all.  Next time I'll
  really hold people to it.  Oh well.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
  libceph: use KMEM_CACHE macro
  ceph: use kmem_cache_zalloc
  rbd: use KMEM_CACHE macro
  ceph: use lookup request to revalidate dentry
  ceph: kill ceph_get_dentry_parent_inode()
  ceph: fix security xattr deadlock
  ceph: don't request vxattrs from MDS
  ceph: fix mounting same fs multiple times
  ceph: remove unnecessary NULL check
  ceph: avoid updating directory inode's i_size accidentally
  ceph: fix race during filling readdir cache
  libceph: use sizeof_footer() more
  ceph: kill ceph_empty_snapc
  ceph: fix a wrong comparison
  ceph: replace CURRENT_TIME by current_fs_time()
  ceph: scattered page writeback
  libceph: add helper that duplicates last extent operation
  libceph: enable large, variable-sized OSD requests
  libceph: osdc->req_mempool should be backed by a slab pool
  libceph: make r_request msg_size calculation clearer
  ...
2016-03-26 15:53:16 -07:00
Geliang Tang
5ee61e95b6 libceph: use KMEM_CACHE macro
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:57 +01:00
Ilya Dryomov
89f081730c libceph: use sizeof_footer() more
Don't open-code sizeof_footer() in read_partial_message() and
ceph_msg_revoke().  Also, after switching to sizeof_footer(), it's now
possible to use con_out_kvec_add() in prepare_write_message_footer().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-03-25 18:51:53 +01:00
Yan, Zheng
2c63f49a72 libceph: add helper that duplicates last extent operation
This helper duplicates last extent operation in OSD request, then
adjusts the new extent operation's offset and length. The helper
is for scatterd page writeback, which adds nonconsecutive dirty
pages to single OSD request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:43 +01:00
Ilya Dryomov
3f1af42ad0 libceph: enable large, variable-sized OSD requests
Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests.  The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.

r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once.  ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc().  req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:43 +01:00
Ilya Dryomov
9e767adbd3 libceph: osdc->req_mempool should be backed by a slab pool
ceph_osd_request_cache was introduced a long time ago.  Also, osd_req
is about to get a flexible array member, which ceph_osd_request_cache
is going to be aware of.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:43 +01:00
Ilya Dryomov
ae458f5a17 libceph: make r_request msg_size calculation clearer
Although msg_size is calculated correctly, the terms are grouped in
a misleading way - snaps appears to not have room for a u32 length.
Move calculation closer to its use and regroup terms.

No functional change.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:42 +01:00
Yan, Zheng
7665d85b73 libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op
This avoids defining large array of r_reply_op_{len,result} in
in struct ceph_osd_request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:42 +01:00
Ilya Dryomov
de2aa102ea libceph: rename ceph_osd_req_op::payload_len to indata_len
Follow userspace nomenclature on this - the next commit adds
outdata_len.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:41 +01:00
Ilya Dryomov
b5d91704f5 libceph: behave in mon_fault() if cur_mon < 0
This can happen if __close_session() in ceph_monc_stop() races with
a connection reset.  We need to ignore such faults, otherwise it's
likely we would take !hunting, call __schedule_delayed() and end up
with delayed_work() executing on invalid memory, among other things.

The (two!) con->private tests are useless, as nothing ever clears
con->private.  Nuke them.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:40 +01:00
Ilya Dryomov
bee3a37c47 libceph: reschedule tick in mon_fault()
Doing __schedule_delayed() in the hunting branch is pointless, as the
tick will have already been scheduled by then.

What we need to do instead is *reschedule* it in the !hunting branch,
after reopen_session() changes hunt_mult, which affects the delay.
This helps with spacing out connection attempts and avoiding things
like two back-to-back attempts followed by a longer period of waiting
around.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:40 +01:00
Ilya Dryomov
1752b50ca2 libceph: introduce and switch to reopen_session()
hunting is now set in __open_session() and cleared in finish_hunting(),
instead of all around.  The "session lost" message is printed not only
on connection resets, but also on keepalive timeouts.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:39 +01:00
Ilya Dryomov
168b9090c7 libceph: monc hunt rate is 3s with backoff up to 30s
Unless we are in the process of setting up a client (i.e. connecting to
the monitor cluster for the first time), apply a backoff: every time we
want to reopen a session, increase our timeout by a multiple (currently
2); when we complete the connection, reduce that multipler by 50%.

Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:39 +01:00
Ilya Dryomov
58d81b1294 libceph: monc ping rate is 10s
Split ping interval and ping timeout: ping interval is 10s; keepalive
timeout is 30s.

Make monc_ping_timeout a constant while at it - it's not actually
exported as a mount option (and the rest of tick-related settings won't
be either), so it's got no place in ceph_options.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:39 +01:00
Ilya Dryomov
0e04dc26cc libceph: pick a different monitor when reconnecting
Don't try to reconnect to the same monitor when we fail to establish
a session within a timeout or it's lost.

For that, pick_new_mon() needs to see the old value of cur_mon, so
don't clear it in __close_session() - all calls to __close_session()
but one are followed by __open_session() anyway.  __open_session() is
only called when a new session needs to be established, so the "already
open?" branch, which is now in the way, is simply dropped.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:38 +01:00
Ilya Dryomov
82dcabad75 libceph: revamp subs code, switch to SUBSCRIBE2 protocol
It is currently hard-coded in the mon_client that mdsmap and monmap
subs are continuous, while osdmap sub is always "onetime".  To better
handle full clusters/pools in the osd_client, we need to be able to
issue continuous osdmap subs.  Revamp subs code to allow us to specify
for each sub whether it should be continuous or not.

Although not strictly required for the above, switch to SUBSCRIBE2
protocol while at it, eliminating the ambiguity between a request for
"every map since X" and a request for "just the latest" when we don't
have a map yet (i.e. have epoch 0).  SUBSCRIBE2 feature bit is now
required - it's been supported since pre-argonaut (2010).

Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
in before we validate the epoch and successfully install the new map
can mess up mon_client sub state.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:38 +01:00
Ilya Dryomov
0f9af169a1 libceph: decouple hunting and subs management
Coupling hunting state with subscribe state is not a good idea.  Clear
hunting when we complete the authentication handshake.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:38 +01:00
Ilya Dryomov
02ac956c42 libceph: move debugfs initialization into __ceph_open_session()
Our debugfs dir name is a concatenation of cluster fsid and client
unique ID ("global_id").  It used to be the case that we learned
global_id first, nowadays we always learn fsid first - the monmap is
sent before any auth replies are.  ceph_debugfs_client_init() call in
ceph_monc_handle_map() is therefore never executed and can be removed.

Its counterpart in handle_auth_reply() doesn't really belong there
either: having to do monc->client and unlocking early to work around
lockdep is a testament to that.  Move it into __ceph_open_session(),
where it can be called unconditionally.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:37 +01:00
Linus Torvalds
643ad15d47 Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 protection key support from Ingo Molnar:
 "This tree adds support for a new memory protection hardware feature
  that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

  There's a background article at LWN.net:

      https://lwn.net/Articles/643797/

  The gist is that protection keys allow the encoding of
  user-controllable permission masks in the pte.  So instead of having a
  fixed protection mask in the pte (which needs a system call to change
  and works on a per page basis), the user can map a (handful of)
  protection mask variants and can change the masks runtime relatively
  cheaply, without having to change every single page in the affected
  virtual memory range.

  This allows the dynamic switching of the protection bits of large
  amounts of virtual memory, via user-space instructions.  It also
  allows more precise control of MMU permission bits: for example the
  executable bit is separate from the read bit (see more about that
  below).

  This tree adds the MM infrastructure and low level x86 glue needed for
  that, plus it adds a high level API to make use of protection keys -
  if a user-space application calls:

        mmap(..., PROT_EXEC);

  or

        mprotect(ptr, sz, PROT_EXEC);

  (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
  this special case, and will set a special protection key on this
  memory range.  It also sets the appropriate bits in the Protection
  Keys User Rights (PKRU) register so that the memory becomes unreadable
  and unwritable.

  So using protection keys the kernel is able to implement 'true'
  PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
  PROT_READ as well.  Unreadable executable mappings have security
  advantages: they cannot be read via information leaks to figure out
  ASLR details, nor can they be scanned for ROP gadgets - and they
  cannot be used by exploits for data purposes either.

  We know about no user-space code that relies on pure PROT_EXEC
  mappings today, but binary loaders could start making use of this new
  feature to map binaries and libraries in a more secure fashion.

  There is other pending pkeys work that offers more high level system
  call APIs to manage protection keys - but those are not part of this
  pull request.

  Right now there's a Kconfig that controls this feature
  (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
  (like most x86 CPU feature enablement code that has no runtime
  overhead), but it's not user-configurable at the moment.  If there's
  any serious problem with this then we can make it configurable and/or
  flip the default"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
  mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
  x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
  mm/core, x86/mm/pkeys: Add execute-only protection keys support
  x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
  x86/mm/pkeys: Allow kernel to modify user pkey rights register
  x86/fpu: Allow setting of XSAVE state
  x86/mm: Factor out LDT init from context init
  mm/core, x86/mm/pkeys: Add arch_validate_pkey()
  mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
  x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
  x86/mm/pkeys: Add Kconfig prompt to existing config option
  x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
  x86/mm/pkeys: Dump PKRU with other kernel registers
  mm/core, x86/mm/pkeys: Differentiate instruction fetches
  x86/mm/pkeys: Optimize fault handling in access_error()
  mm/core: Do not enforce PKEY permissions on remote mm access
  um, pkeys: Add UML arch_*_access_permitted() methods
  mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
  x86/mm/gup: Simplify get_user_pages() PTE bit handling
  ...
2016-03-20 19:08:56 -07:00
Linus Torvalds
70477371dc Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "Here is the crypto update for 4.6:

  API:
   - Convert remaining crypto_hash users to shash or ahash, also convert
     blkcipher/ablkcipher users to skcipher.
   - Remove crypto_hash interface.
   - Remove crypto_pcomp interface.
   - Add crypto engine for async cipher drivers.
   - Add akcipher documentation.
   - Add skcipher documentation.

  Algorithms:
   - Rename crypto/crc32 to avoid name clash with lib/crc32.
   - Fix bug in keywrap where we zero the wrong pointer.

  Drivers:
   - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
   - Add PIC32 hwrng driver.
   - Support BCM6368 in bcm63xx hwrng driver.
   - Pack structs for 32-bit compat users in qat.
   - Use crypto engine in omap-aes.
   - Add support for sama5d2x SoCs in atmel-sha.
   - Make atmel-sha available again.
   - Make sahara hashing available again.
   - Make ccp hashing available again.
   - Make sha1-mb available again.
   - Add support for multiple devices in ccp.
   - Improve DMA performance in caam.
   - Add hashing support to rockchip"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
  crypto: qat - remove redundant arbiter configuration
  crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
  crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
  crypto: qat - Change the definition of icp_qat_uof_regtype
  hwrng: exynos - use __maybe_unused to hide pm functions
  crypto: ccp - Add abstraction for device-specific calls
  crypto: ccp - CCP versioning support
  crypto: ccp - Support for multiple CCPs
  crypto: ccp - Remove check for x86 family and model
  crypto: ccp - memset request context to zero during import
  lib/mpi: use "static inline" instead of "extern inline"
  lib/mpi: avoid assembler warning
  hwrng: bcm63xx - fix non device tree compatibility
  crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
  crypto: qat - The AE id should be less than the maximal AE number
  lib/mpi: Endianness fix
  crypto: rockchip - add hash support for crypto engine in rk3288
  crypto: xts - fix compile errors
  crypto: doc - add skcipher API documentation
  crypto: doc - update AEAD AD handling
  ...
2016-03-17 11:22:54 -07:00
Ilya Dryomov
cd8140c673 libceph: don't spam dmesg with stray reply warnings
Commit d15f9d694b ("libceph: check data_len in ->alloc_msg()")
mistakenly bumped the log level on the "tid %llu unknown, skipping"
message.  Turn it back into a dout() - stray replies are perfectly
normal when OSDs flap, crash, get killed for testing purposes, etc.

Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-02-24 20:28:51 +01:00
Ilya Dryomov
dbc0d3caff libceph: use the right footer size when skipping a message
ceph_msg_footer is 21 bytes long, while ceph_msg_footer_old is only 13.
Don't skip too much when CEPH_FEATURE_MSG_AUTH isn't negotiated.

Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-02-24 20:28:40 +01:00
Ilya Dryomov
e7a88e82fe libceph: don't bail early from try_read() when skipping a message
The contract between try_read() and try_write() is that when called
each processes as much data as possible.  When instructed by osd_client
to skip a message, try_read() is violating this contract by returning
after receiving and discarding a single message instead of checking for
more.  try_write() then gets a chance to write out more requests,
generating more replies/skips for try_read() to handle, forcing the
messenger into a starvation loop.

Cc: stable@vger.kernel.org # 3.10+
Reported-by: Varada Kari <Varada.Kari@sandisk.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Varada Kari <Varada.Kari@sandisk.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-02-24 20:28:31 +01:00
Dave Hansen
d4edcf0d56 mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm
We will soon modify the vanilla get_user_pages() so it can no
longer be used on mm/tasks other than 'current/current->mm',
which is by far the most common way it is called.  For now,
we allow the old-style calls, but warn when they are used.
(implemented in previous patch)

This patch switches all callers of:

	get_user_pages()
	get_user_pages_unlocked()
	get_user_pages_locked()

to stop passing tsk/mm so they will no longer see the warnings.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-16 10:11:12 +01:00
Ilya Dryomov
b0b31a8ffe libceph: MOSDOpReply v7 encoding
Empty request_redirect_t (struct ceph_request_redirect in the kernel
client) is now encoded with a bool.  NEW_OSDOPREPLY_ENCODING feature
bit overlaps with already supported CRUSH_TUNABLES5.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-02-04 18:26:08 +01:00
Ilya Dryomov
b9b519b78c crush: decode and initialize chooseleaf_stable
Also add missing \n while at it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-02-04 18:25:58 +01:00
Ilya Dryomov
dc6ae6d8e7 crush: add chooseleaf_stable tunable
Add a tunable to fix the bug that chooseleaf may cause unnecessary pg
migrations when some device fails.

Reflects ceph.git commit fdb3f664448e80d984470f32f04e2e6f03ab52ec.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-02-04 18:25:55 +01:00
Ilya Dryomov
56a4f3091d crush: ensure take bucket value is valid
Ensure that the take argument is a valid bucket ID before indexing the
buckets array.

Reflects ceph.git commit 93ec538e8a667699876b72459b8ad78966d89c61.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-02-04 18:25:50 +01:00
Ilya Dryomov
f224a6915f crush: ensure bucket id is valid before indexing buckets array
We were indexing the buckets array without verifying the index was
within the [0,max_buckets) range.  This could happen because
a multistep rule does not have enough buckets and has CRUSH_ITEM_NONE
for an intermediate result, which would feed in CRUSH_ITEM_NONE and
make us crash.

Reflects ceph.git commit 976a24a326da8931e689ee22fce35feab5b67b76.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-02-04 18:25:23 +01:00
Ilya Dryomov
e9f6452e0e libceph: Remove unnecessary ivsize variables
This patch removes the unnecessary ivsize variabls as they always
have the value of AES_BLOCK_SIZE.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-01-27 20:36:25 +08:00
Herbert Xu
e59dd982d3 libceph: Use skcipher
This patch replaces uses of blkcipher with skcipher.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-01-27 20:36:05 +08:00
Ilya Dryomov
7e01726a68 libceph: remove outdated comment
MClientMount{,Ack} are long gone.  The receipt of bare monmap doesn't
actually indicate a mount success as we are yet to authenticate at that
point in time.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-01-21 19:36:09 +01:00
Ilya Dryomov
f6cdb2928d libceph: kill off ceph_x_ticket_handler::validity
With it gone, no need to preserve ceph_timespec in process_one_ticket()
either.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-21 19:36:09 +01:00
Ilya Dryomov
187d131dd9 libceph: invalidate AUTH in addition to a service ticket
If we fault due to authentication, we invalidate the service ticket we
have and request a new one - the idea being that if a service rejected
our authorizer, it must have expired, despite mon_client's attempts at
periodic renewal.  (The other possibility is that our ticket is too new
and the service hasn't gotten it yet, in which case invalidating isn't
necessary but doesn't hurt.)

Invalidating just the service ticket is not enough, though.  If we
assume a failure on mon_client's part to renew a service ticket, we
have to assume the same for the AUTH ticket.  If our AUTH ticket is
bad, we won't get any service tickets no matter how hard we try, so
invalidate AUTH ticket along with the service ticket.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-21 19:36:09 +01:00
Ilya Dryomov
6abe097db5 libceph: fix authorizer invalidation, take 2
Back in 2013, commit 4b8e8b5d78 ("libceph: fix authorizer
invalidation") tried to fix authorizer invalidation issues by clearing
validity field.  However, nothing ever consults this field, so it
doesn't force us to request any new secrets in any way and therefore we
never get out of the exponential backoff mode:

    [  129.973812] libceph: osd2 192.168.122.1:6810 connect authorization failure
    [  130.706785] libceph: osd2 192.168.122.1:6810 connect authorization failure
    [  131.710088] libceph: osd2 192.168.122.1:6810 connect authorization failure
    [  133.708321] libceph: osd2 192.168.122.1:6810 connect authorization failure
    [  137.706598] libceph: osd2 192.168.122.1:6810 connect authorization failure
    ...

AFAICT this was the case at the time 4b8e8b5d78 was merged, too.

Using timespec solely as a bool isn't nice, so introduce a new have_key
flag, specifically for this purpose.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-21 19:36:08 +01:00
Ilya Dryomov
f6330cc1f0 libceph: clear messenger auth_retry flag if we fault
Commit 20e55c4cc7 ("libceph: clear messenger auth_retry flag when we
authenticate") got us only half way there.  We clear the flag if the
second attempt succeeds, but it also needs to be cleared if that
attempt fails, to allow for the exponential backoff to kick in.
Otherwise, if ->should_authenticate() thinks our keys are valid, we
will busy loop, incrementing auth_retry to no avail:

    process_connect ffff880079a63830 got BADAUTHORIZER attempt 1
    process_connect ffff880079a63830 got BADAUTHORIZER attempt 2
    process_connect ffff880079a63830 got BADAUTHORIZER attempt 3
    process_connect ffff880079a63830 got BADAUTHORIZER attempt 4
    process_connect ffff880079a63830 got BADAUTHORIZER attempt 5
    ...

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-21 19:36:08 +01:00
Ilya Dryomov
67645d7619 libceph: fix ceph_msg_revoke()
There are a number of problems with revoking a "was sending" message:

(1) We never make any attempt to revoke data - only kvecs contibute to
con->out_skip.  However, once the header (envelope) is written to the
socket, our peer learns data_len and sets itself to expect at least
data_len bytes to follow front or front+middle.  If ceph_msg_revoke()
is called while the messenger is sending message's data portion,
anything we send after that call is counted by the OSD towards the now
revoked message's data portion.  The effects vary, the most common one
is the eventual hang - higher layers get stuck waiting for the reply to
the message that was sent out after ceph_msg_revoke() returned and
treated by the OSD as a bunch of data bytes.  This is what Matt ran
into.

(2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs
is wrong.  If ceph_msg_revoke() is called before the tag is sent out or
while the messenger is sending the header, we will get a connection
reset, either due to a bad tag (0 is not a valid tag) or a bad header
CRC, which kind of defeats the purpose of revoke.  Currently the kernel
client refuses to work with header CRCs disabled, but that will likely
change in the future, making this even worse.

(3) con->out_skip is not reset on connection reset, leading to one or
more spurious connection resets if we happen to get a real one between
con->out_skip is set in ceph_msg_revoke() and before it's cleared in
write_partial_skip().

Fixing (1) and (3) is trivial.  The idea behind fixing (2) is to never
zero the tag or the header, i.e. send out tag+header regardless of when
ceph_msg_revoke() is called.  That way the header is always correct, no
unnecessary resets are induced and revoke stands ready for disabled
CRCs.  Since ceph_msg_revoke() rips out con->out_msg, introduce a new
"message out temp" and copy the header into it before sending.

Cc: stable@vger.kernel.org # 4.0+
Reported-by: Matt Conner <matt.conner@keepertech.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Matt Conner <matt.conner@keepertech.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-21 19:36:08 +01:00
Geliang Tang
10bcee149f libceph: use list_for_each_entry_safe
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
[idryomov@gmail.com: nuke call to list_splice_init() as well]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-01-21 19:36:08 +01:00
Geliang Tang
17ddc49b9c libceph: use list_next_entry instead of list_entry_next
list_next_entry has been defined in list.h, so I replace list_entry_next
with it.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-01-21 19:36:07 +01:00
Linus Torvalds
ca4ba96e02 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There are several patches from Ilya fixing RBD allocation lifecycle
  issues, a series adding a nocephx_sign_messages option (and associated
  bug fixes/cleanups), several patches from Zheng improving the
  (directory) fsync behavior, a big improvement in IO for direct-io
  requests when striping is enabled from Caifeng, and several other
  small fixes and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: clear msg->con in ceph_msg_release() only
  libceph: add nocephx_sign_messages option
  libceph: stop duplicating client fields in messenger
  libceph: drop authorizer check from cephx msg signing routines
  libceph: msg signing callouts don't need con argument
  libceph: evaluate osd_req_op_data() arguments only once
  ceph: make fsync() wait unsafe requests that created/modified inode
  ceph: add request to i_unsafe_dirops when getting unsafe reply
  libceph: introduce ceph_x_authorizer_cleanup()
  ceph: don't invalidate page cache when inode is no longer used
  rbd: remove duplicate calls to rbd_dev_mapping_clear()
  rbd: set device_type::release instead of device::release
  rbd: don't free rbd_dev outside of the release callback
  rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails
  libceph: use local variable cursor instead of &msg->cursor
  libceph: remove con argument in handle_reply()
  ceph: combine as many iovec as possile into one OSD request
  ceph: fix message length computation
  ceph: fix a comment typo
  rbd: drop null test before destroy functions
2015-11-13 09:24:40 -08:00
Linus Torvalds
1873499e13 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem update from James Morris:
 "This is mostly maintenance updates across the subsystem, with a
  notable update for TPM 2.0, and addition of Jarkko Sakkinen as a
  maintainer of that"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (40 commits)
  apparmor: clarify CRYPTO dependency
  selinux: Use a kmem_cache for allocation struct file_security_struct
  selinux: ioctl_has_perm should be static
  selinux: use sprintf return value
  selinux: use kstrdup() in security_get_bools()
  selinux: use kmemdup in security_sid_to_context_core()
  selinux: remove pointless cast in selinux_inode_setsecurity()
  selinux: introduce security_context_str_to_sid
  selinux: do not check open perm on ftruncate call
  selinux: change CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE default
  KEYS: Merge the type-specific data with the payload data
  KEYS: Provide a script to extract a module signature
  KEYS: Provide a script to extract the sys cert list from a vmlinux file
  keys: Be more consistent in selection of union members used
  certs: add .gitignore to stop git nagging about x509_certificate_list
  KEYS: use kvfree() in add_key
  Smack: limited capability for changing process label
  TPM: remove unnecessary little endian conversion
  vTPM: support little endian guests
  char: Drop owner assignment from i2c_driver
  ...
2015-11-05 15:32:38 -08:00
Ilya Dryomov
583d0fef75 libceph: clear msg->con in ceph_msg_release() only
The following bit in ceph_msg_revoke_incoming() is unsafe:

    struct ceph_connection *con = msg->con;
    if (!con)
            return;
    mutex_lock(&con->mutex);
    <more msg->con use>

There is nothing preventing con from getting destroyed right after
msg->con test.  One easy way to reproduce this is to disable message
signing only on the server side and try to map an image.  The system
will go into a

    libceph: read_partial_message ffff880073f0ab68 signature check failed
    libceph: osd0 192.168.255.155:6801 bad crc/signature
    libceph: read_partial_message ffff880073f0ab68 signature check failed
    libceph: osd0 192.168.255.155:6801 bad crc/signature

loop which has to be interrupted with Ctrl-C.  Hit Ctrl-C and you are
likely to end up with a random GP fault if the reset handler executes
"within" ceph_msg_revoke_incoming():

                     <yet another reply w/o a signature>
                                   ...
          <Ctrl-C>
    rbd_obj_request_end
      ceph_osdc_cancel_request
        __unregister_request
          ceph_osdc_put_request
            ceph_msg_revoke_incoming
                                   ...
                                osd_reset
                                  __kick_osd_requests
                                    __reset_osd
                                      remove_osd
                                        ceph_con_close
                                          reset_connection
                                            <clear con->in_msg->con>
                                            <put con ref>
                                              put_osd
                                                <free osd/con>
              <msg->con use> <-- !!!

If ceph_msg_revoke_incoming() executes "before" the reset handler,
osd/con will be leaked because ceph_msg_revoke_incoming() clears
con->in_msg but doesn't put con ref, while reset_connection() only puts
con ref if con->in_msg != NULL.

The current msg->con scheme was introduced by commits 38941f8031
("libceph: have messages point to their connection") and 92ce034b5a
("libceph: have messages take a connection reference"), which defined
when messages get associated with a connection and when that
association goes away.  Part of the problem is that this association is
supposed to go away in much too many places; closing this race entirely
requires either a rework of the existing or an addition of a new layer
of synchronization.

In lieu of that, we can make it *much* less likely to hit by
disassociating messages only on their destruction and resend through
a different connection.  This makes the code simpler and is probably
a good thing to do regardless - this patch adds a msg_con_set() helper
which is is called from only three places: ceph_con_send() and
ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear
it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:37:46 +01:00
Ilya Dryomov
a51983e4dd libceph: add nocephx_sign_messages option
Support for message signing was merged into 3.19, along with
nocephx_require_signatures option.  But, all that option does is allow
the kernel client to talk to clusters that don't support MSG_AUTH
feature bit.  That's pretty useless, given that it's been supported
since bobtail.

Meanwhile, if one disables message signing on the server side with
"cephx sign messages = false", it becomes impossible to use the kernel
client since it expects messages to be signed if MSG_AUTH was
negotiated.  Add nocephx_sign_messages option to support this use case.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:37:46 +01:00
Ilya Dryomov
859bff51dc libceph: stop duplicating client fields in messenger
supported_features and required_features serve no purpose at all, while
nocrc and tcp_nodelay belong to ceph_options::flags.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:37:46 +01:00
Ilya Dryomov
4199b8eec3 libceph: drop authorizer check from cephx msg signing routines
I don't see a way for auth->authorizer to be NULL in
ceph_x_sign_message() or ceph_x_check_message_signature().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:37:46 +01:00
Ilya Dryomov
79dbd1baa6 libceph: msg signing callouts don't need con argument
We can use msg->con instead - at the point we sign an outgoing message
or check the signature on the incoming one, msg->con is always set.  We
wouldn't know how to sign a message without an associated session (i.e.
msg->con == NULL) and being able to sign a message using an explicitly
provided authorizer is of no use.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:37:45 +01:00
Ioana Ciornei
8a703a383d libceph: evaluate osd_req_op_data() arguments only once
This patch changes the osd_req_op_data() macro to not evaluate
arguments more than once in order to follow the kernel coding style.

Signed-off-by: Ioana Ciornei <ciorneiioana@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
[idryomov@gmail.com: changelog, formatting]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:36:49 +01:00
Ilya Dryomov
cbf99a11fb libceph: introduce ceph_x_authorizer_cleanup()
Commit ae385eaf24 ("libceph: store session key in cephx authorizer")
introduced ceph_x_authorizer::session_key, but didn't update all the
exit/error paths.  Introduce ceph_x_authorizer_cleanup() to encapsulate
ceph_x_authorizer cleanup and switch to it.  This fixes ceph_x_destroy(),
which currently always leaks key and ceph_x_build_authorizer() error
paths.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2015-11-02 23:36:48 +01:00
Shraddha Barke
343128ce91 libceph: use local variable cursor instead of &msg->cursor
Use local variable cursor in place of &msg->cursor in
read_partial_msg_data() and write_partial_msg_data().

Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:36:47 +01:00
Shraddha Barke
70cf052d0c libceph: remove con argument in handle_reply()
Since handle_reply() does not use its con argument, remove it.

Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:36:47 +01:00
David Howells
146aa8b145 KEYS: Merge the type-specific data with the payload data
Merge the type-specific data with the payload data into one four-word chunk
as it seems pointless to keep them separate.

Use user_key_payload() for accessing the payloads of overloaded
user-defined keys.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cifs@vger.kernel.org
cc: ecryptfs@vger.kernel.org
cc: linux-ext4@vger.kernel.org
cc: linux-f2fs-devel@lists.sourceforge.net
cc: linux-nfs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: linux-ima-devel@lists.sourceforge.net
2015-10-21 15:18:36 +01:00
Ilya Dryomov
e30b7577bf rbd: use writefull op for object size writes
This covers only the simplest case - an object size sized write, but
it's still useful in tiering setups when EC is used for the base tier
as writefull op can be proxied, saving an object promotion.

Even though updating ceph_osdc_new_request() to allow writefull should
just be a matter of fixing an assert, I didn't do it because its only
user is cephfs.  All other sites were updated.

Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-10-16 16:49:01 +02:00
Ilya Dryomov
7f61f54565 libceph: don't access invalid memory in keepalive2 path
This

    struct ceph_timespec ceph_ts;
    ...
    con_out_kvec_add(con, sizeof(ceph_ts), &ceph_ts);

wraps ceph_ts into a kvec and adds it to con->out_kvec array, yet
ceph_ts becomes invalid on return from prepare_write_keepalive().  As
a result, we send out bogus keepalive2 stamps.  Fix this by encoding
into a ceph_timespec member, similar to how acks are read and written.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2015-09-17 20:14:15 +03:00
Linus Torvalds
e013f74b60 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph update from Sage Weil:
 "There are a few fixes for snapshot behavior with CephFS and support
  for the new keepalive protocol from Zheng, a libceph fix that affects
  both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya,
  and several small fixes and cleanups from Jianpeng and others"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: improve readahead for file holes
  ceph: get inode size for each append write
  libceph: check data_len in ->alloc_msg()
  libceph: use keepalive2 to verify the mon session is alive
  rbd: plug rbd_dev->header.object_prefix memory leak
  rbd: fix double free on rbd_dev->header_name
  libceph: set 'exists' flag for newly up osd
  ceph: cleanup use of ceph_msg_get
  ceph: no need to get parent inode in ceph_open
  ceph: remove the useless judgement
  ceph: remove redundant test of head->safe and silence static analysis warnings
  ceph: fix queuing inode to mdsdir's snaprealm
  libceph: rename con_work() to ceph_con_workfn()
  libceph: Avoid holding the zero page on ceph_msgr_slab_init errors
  libceph: remove the unused macro AES_KEY_SIZE
  ceph: invalidate dirty pages after forced umount
  ceph: EIO all operations after forced umount
2015-09-11 12:33:03 -07:00
Ilya Dryomov
d15f9d694b libceph: check data_len in ->alloc_msg()
Only ->alloc_msg() should check data_len of the incoming message
against the preallocated ceph_msg, doing it in the messenger is not
right.  The contract is that either ->alloc_msg() returns a ceph_msg
which will fit all of the portions of the incoming message, or it
returns NULL and possibly sets skip, signaling whether NULL is due to
an -ENOMEM.  ->alloc_msg() should be the only place where we make the
skip/no-skip decision.

I stumbled upon this while looking at con/osd ref counting.  Right now,
if we get a non-extent message with a larger data portion than we are
prepared for, ->alloc_msg() returns a ceph_msg, and then, when we skip
it in the messenger, we don't put the con/osd ref acquired in
ceph_con_in_msg_alloc() (which is normally put in process_message()),
so this also fixes a memory leak.

An existing BUG_ON in ceph_msg_data_cursor_init() ensures we don't
corrupt random memory should a buggy ->alloc_msg() return an unfit
ceph_msg.

While at it, I changed the "unknown tid" dout() to a pr_warn() to make
sure all skips are seen and unified format strings.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-09-09 09:52:17 +03:00
Yan, Zheng
8b9558aab8 libceph: use keepalive2 to verify the mon session is alive
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-09-08 23:14:30 +03:00
Yan, Zheng
6dd74e44dc libceph: set 'exists' flag for newly up osd
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-09-08 23:14:29 +03:00
Ilya Dryomov
6893162215 libceph: rename con_work() to ceph_con_workfn()
Even though it's static, con_work(), being a work func, shows up in
various stacktraces a lot.  Prefix it with ceph_.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-09-08 23:14:28 +03:00
Benoît Canet
d920ff6fc7 libceph: Avoid holding the zero page on ceph_msgr_slab_init errors
ceph_msgr_slab_init may fail due to a temporary ENOMEM.

Delay a bit the initialization of zero_page in ceph_msgr_init and
reorder its cleanup in _ceph_msgr_exit so it's done in reverse
order of setup.

BUG_ON() will not suffer to be postponed in case it is triggered.

Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-09-08 23:14:28 +03:00
Nicholas Krause
b79b23682a libceph: remove the unused macro AES_KEY_SIZE
This removes the no longer used macro AES_KEY_SIZE as no functions use
this macro anymore and thus this macro can be removed due it no longer
being required.

Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-09-08 23:14:28 +03:00
Kees Cook
a068acf2ee fs: create and use seq_show_option for escaping
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g.  new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else.  This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
of "sudo" is something more sneaky:

  $ BASE="ovl"
  $ MNT="$BASE/mnt"
  $ LOW="$BASE/lower"
  $ UP="$BASE/upper"
  $ WORK="$BASE/work/ 0 0
  none /proc fuse.pwn user_id=1000"
  $ mkdir -p "$LOW" "$UP" "$WORK"
  $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
  $ cat /proc/mounts
  none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
  none /proc fuse.pwn user_id=1000 0 0
  $ fusermount -u /proc
  $ cat /proc/mounts
  cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed.  Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Ilya Dryomov
c44bd69c0c libceph: treat sockaddr_storage with uninitialized family as blank
addr_is_blank() should return true if family is neither AF_INET nor
AF_INET6.  This is what its counterpart entity_addr_t::is_blank_ip() is
doing and it is the right thing to do: in process_banner() we check if
our address is blank and if it is "learn" it from our peer.  As it is,
we never learn our address and always send out a blank one.  This goes
way back to ceph.git commit dd732cbfc1c9 ("use sockaddr_storage; and
some ipv6 support groundwork") from 2009.

While at at, do not open-code ipv6_addr_any() and use INADDR_ANY
constant instead of 0.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2015-07-09 20:30:34 +03:00
Ilya Dryomov
757856d2b9 libceph: enable ceph in a non-default network namespace
Grab a reference on a network namespace of the 'rbd map' (in case of
rbd) or 'mount' (in case of ceph) process and use that to open sockets
instead of always using init_net and bailing if network namespace is
anything but init_net.  Be careful to not share struct ceph_client
instances between different namespaces and don't add any code in the
!CONFIG_NET_NS case.

This is based on a patch from Hong Zhiguo <zhiguohong@tencent.com>.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2015-07-09 20:30:34 +03:00
Linus Torvalds
0c76c6ba24 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "We have a pile of bug fixes from Ilya, including a few patches that
  sync up the CRUSH code with the latest from userspace.

  There is also a long series from Zheng that fixes various issues with
  snapshots, inline data, and directory fsync, some simplification and
  improvement in the cap release code, and a rework of the caching of
  directory contents.

  To top it off there are a few small fixes and cleanups from Benoit and
  Hong"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits)
  rbd: use GFP_NOIO in rbd_obj_request_create()
  crush: fix a bug in tree bucket decode
  libceph: Fix ceph_tcp_sendpage()'s more boolean usage
  libceph: Remove spurious kunmap() of the zero page
  rbd: queue_depth map option
  rbd: store rbd_options in rbd_device
  rbd: terminate rbd_opts_tokens with Opt_err
  ceph: fix ceph_writepages_start()
  rbd: bump queue_max_segments
  ceph: rework dcache readdir
  crush: sync up with userspace
  crush: fix crash from invalid 'take' argument
  ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL
  ceph: pre-allocate data structure that tracks caps flushing
  ceph: re-send flushing caps (which are revoked) in reconnect stage
  ceph: send TID of the oldest pending caps flush to MDS
  ceph: track pending caps flushing globally
  ceph: track pending caps flushing accurately
  libceph: fix wrong name "Ceph filesystem for Linux"
  ceph: fix directory fsync
  ...
2015-07-02 11:35:00 -07:00
Ilya Dryomov
82cd003a77 crush: fix a bug in tree bucket decode
struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe()
should be used.  -Wconversion catches this, but I guess it went
unnoticed in all the noise it spews.  The actual problem (at least for
common crushmaps) isn't the u32 -> u8 truncation though - it's the
advancement by 4 bytes instead of 1 in the crushmap buffer.

Fixes: http://tracker.ceph.com/issues/2759

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2015-07-01 00:46:35 +03:00
Benoît Canet
c2cfa19400 libceph: Fix ceph_tcp_sendpage()'s more boolean usage
From struct ceph_msg_data_cursor in include/linux/ceph/messenger.h:

bool    last_piece;     /* current is last piece */

In ceph_msg_data_next():

*last_piece = cursor->last_piece;

A call to ceph_msg_data_next() is followed by:

ret = ceph_tcp_sendpage(con->sock, page, page_offset,
                        length, last_piece);

while ceph_tcp_sendpage() is:

static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
                             int offset, size_t size, bool more)

The logic is inverted: correct it.

Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-06-29 20:03:46 +03:00
Benoît Canet
6ba8edc0bc libceph: Remove spurious kunmap() of the zero page
ceph_tcp_sendpage already does the work of mapping/unmapping
the zero page if needed.

Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-06-25 18:30:56 +03:00
Ilya Dryomov
b459be739f crush: sync up with userspace
.. up to ceph.git commit 1db1abc8328d ("crush: eliminate ad hoc diff
between kernel and userspace").  This fixes a bunch of recently pulled
coding style issues and makes includes a bit cleaner.

A patch "crush:Make the function crush_ln static" from Nicholas Krause
<xerofoify@gmail.com> is folded in as crush_ln() has been made static
in userspace as well.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-06-25 11:49:31 +03:00
Ilya Dryomov
8f529795ba crush: fix crash from invalid 'take' argument
Verify that the 'take' argument is a valid device or bucket.
Otherwise ignore it (do not add the value to the working vector).

Reflects ceph.git commit 9324d0a1af61e1c234cc48e2175b4e6320fff8f4.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-06-25 11:49:31 +03:00
Hong Zhiguo
6c13a6bb55 libceph: fix wrong name "Ceph filesystem for Linux"
modinfo libceph prints the module name "Ceph filesystem for Linux",
which is same as the real fs module ceph. It's confusing.

Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-06-25 11:49:30 +03:00
Ilya Dryomov
216639dd50 libceph: a couple tweaks for wait loops
- return -ETIMEDOUT instead of -EIO in case of timeout
- wait_event_interruptible_timeout() returns time left until timeout
  and since it can be almost LONG_MAX we had better assign it to long

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-06-25 11:49:29 +03:00
Ilya Dryomov
a319bf56a6 libceph: store timeouts in jiffies, verify user input
There are currently three libceph-level timeouts that the user can
specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive.  All of
these are in seconds and no checking is done on user input: negative
values are accepted, we multiply them all by HZ which may or may not
overflow, arbitrarily large jiffies then get added together, etc.

There is also a bug in the way mount_timeout=0 is handled.  It's
supposed to mean "infinite timeout", but that's not how wait.h APIs
treat it and so __ceph_open_session() for example will busy loop
without much chance of being interrupted if none of ceph-mons are
there.

Fix all this by verifying user input, storing timeouts capped by
msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
helper for all user-specified waits to handle infinite timeouts
correctly.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-06-25 11:49:29 +03:00
Ilya Dryomov
b01da6a08c libceph: use kvfree() instead of open-coding it
This one sneaked in through vfs tree with commit 2b777c9dd9
("ceph_sync_read: stop poking into iov_iter guts").

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-06-25 11:49:28 +03:00
Yan, Zheng
144cba1493 libceph: allow setting osd_req_op's flags
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-06-25 11:49:27 +03:00
Yan, Zheng
66ba609f7b libceph: properly release STAT request's raw_data_in
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-06-25 11:49:27 +03:00
David S. Miller
dda922c831 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/amd-xgbe-phy.c
	drivers/net/wireless/iwlwifi/Kconfig
	include/net/mac80211.h

iwlwifi/Kconfig and mac80211.h were both trivial overlapping
changes.

The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and
the bug fix that happened on the 'net' side is already integrated
into the rest of the amd-xgbe driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01 22:51:30 -07:00