2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2024-12-25 13:43:55 +08:00
Commit Graph

4197 Commits

Author SHA1 Message Date
Kent Overstreet
4550dd6c6b block: Immutable bio vecs
This adds a mechanism by which we can advance a bio by an arbitrary
number of bytes without modifying the biovec: bio->bi_iter.bi_bvec_done
indicates the number of bytes completed in the current bvec.

Various driver code still needs to be updated to not refer to the bvec
directly before we can use this for interesting things, like efficient
bio splitting.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
2013-11-23 22:33:49 -08:00
Kent Overstreet
7988613b0e block: Convert bio_for_each_segment() to bvec_iter
More prep work for immutable biovecs - with immutable bvecs drivers
won't be able to use the biovec directly, they'll need to use helpers
that take into account bio->bi_iter.bi_bvec_done.

This updates callers for the new usage without changing the
implementation yet.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: Jim Paris <jim@jtan.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: support@lsi.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-m68k@lists.linux-m68k.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
Cc: cbe-oss-dev@lists.ozlabs.org
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: DL-MPTFusionLinux@lsi.com
Cc: linux-scsi@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: linux-mm@kvack.org
Acked-by: Geoff Levand <geoff@infradead.org>
2013-11-23 22:33:49 -08:00
Kent Overstreet
a4ad39b1d1 block: Convert bio_iovec() to bvec_iter
For immutable biovecs, we'll be introducing a new bio_iovec() that uses
our new bvec iterator to construct a biovec, taking into account
bvec_iter->bi_bvec_done - this patch updates existing users for the new
usage.

Some of the existing users really do need a pointer into the bvec array
- those uses are all going to be removed, but we'll need the
functionality from immutable to remove them - so for now rename the
existing bio_iovec() -> __bio_iovec(), and it'll be removed in a couple
patches.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
2013-11-23 22:33:49 -08:00
Kent Overstreet
4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Yuanhan Liu
044c8d4b15 kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS cleanly
Remove CONFIG_USE_GENERIC_SMP_HELPERS left by commit 0a06ff068f
("kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS").

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-21 16:42:27 -08:00
Shaohua Li
f02b9ac35a virtio-blk: virtqueue_kick() must be ordered with other virtqueue operations
It isn't safe to call it without holding the vblk->vq_lock.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>

Fixed another condition of virtqueue_kick() not holding the lock.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-19 19:00:45 -07:00
Michael Opdenacker
481e5bad84 NVMe: remove deprecated IRQF_DISABLED
This patch proposes to remove the use of the IRQF_DISABLED flag

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-11-18 17:13:54 -05:00
Haiyan Hu
b80d5ccca3 NVMe: Avoid shift operation when writing cq head doorbell
Changes the type of dev->db_stride to unsigned and changes the value
stored there to be 1 << the current value. Then there is less
calculation to be done at completion time.

Signed-off-by: Haiyan Hu <huhaiyan@huawei.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-11-18 17:10:51 -05:00
Linus Torvalds
f412f2c60b Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull second round of block driver updates from Jens Axboe:
 "As mentioned in the original pull request, the bcache bits were pulled
  because of their dependency on the immutable bio vecs.  Kent re-did
  this part and resubmitted it, so here's the 2nd round of (mostly)
  driver updates for 3.13.  It contains:

 - The bcache work from Kent.

 - Conversion of virtio-blk to blk-mq.  This removes the bio and request
   path, and substitutes with the blk-mq path instead.  The end result
   almost 200 deleted lines.  Patch is acked by Asias and Christoph, who
   both did a bunch of testing.

 - A removal of bootmem.h include from Grygorii Strashko, part of a
   larger series of his killing the dependency on that header file.

 - Removal of __cpuinit from blk-mq from Paul Gortmaker"

* 'for-linus' of git://git.kernel.dk/linux-block: (56 commits)
  virtio_blk: blk-mq support
  blk-mq: remove newly added instances of __cpuinit
  bcache: defensively handle format strings
  bcache: Bypass torture test
  bcache: Delete some slower inline asm
  bcache: Use ida for bcache block dev minor
  bcache: Fix sysfs splat on shutdown with flash only devs
  bcache: Better full stripe scanning
  bcache: Have btree_split() insert into parent directly
  bcache: Move spinlock into struct time_stats
  bcache: Kill sequential_merge option
  bcache: Kill bch_next_recurse_key()
  bcache: Avoid deadlocking in garbage collection
  bcache: Incremental gc
  bcache: Add make_btree_freeing_key()
  bcache: Add btree_node_write_sync()
  bcache: PRECEDING_KEY()
  bcache: bch_(btree|extent)_ptr_invalid()
  bcache: Don't bother with bucket refcount for btree node allocations
  bcache: Debug code improvements
  ...
2013-11-15 16:33:41 -08:00
Linus Torvalds
b746f9c794 Nothing really exciting: some groundwork for changing virtio endian, and
some robustness fixes for broken virtio devices, plus minor tweaks.
 
 [vs last pull request: added the virtio-scsi broken vq escape patch, which
 I somehow lost.]
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJSgDsJAAoJENkgDmzRrbjxEE4P/jXqZHS/HdlxW9k0BjKKlEIF
 PdtCoP3UhWTdskXvy2pD8m6nYn214MEJYUIa4HFlIEZsdxhuexzQHY19Ynkjagyv
 57sRsUUm5fYQLIL7IUh2DUD1VU38hUFinno/y333szzvCj9qITDA/QABsiWxK8NO
 dq+Lmeixgrhc5yN9iryW+gZV+hekJIZ4LsU5ejSaJucKblzXUH8qIbmSthG7RTYJ
 tr4J7xTTXbhxY4CoC5Dpx2hvsFkvzaAIvI4Nr1mDjfq5cR8BaYvnC89U1IbhdAey
 p1AbZE58JLrY+Z8K8LBRGV2KjO8qSZ6R47hbZ9nAnodJYB7sZLyj6jUe1q+/htuC
 Dh9Xm9O4eW2xNaFk20dYeIF4UU5/HzdsbvG/IlH8x4sm8/K706ocYyAOHlzYUg2T
 k7gltrgDzDokMgb2R44gwnr4oaJ2q8Gne6JXswlPEv2eRs6vNnA5Xhc0rEHGkU6C
 gYn1vNFN6yx0vf2syG/Ce5pZtMxGpefKQkHzzWdq8FKr1B9s54dDuf2hls7J8A9t
 OQT1gE33yURSelf4Kh4k9zWXaWk/Ohv9l2R1cqpALnJ4/+q0fP5t7HdK500S7aax
 DxLeFeqvsBw7nlWgsGxQmt+fjITQFHhcDiwst0ehnt6RbDEW7XPIguz0K/gyhxYG
 +UNbl/5Gr64jnUX3YCzm
 =vY2L
 -----END PGP SIGNATURE-----

Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull virtio updates from Rusty Russell:
 "Nothing really exciting: some groundwork for changing virtio endian,
  and some robustness fixes for broken virtio devices, plus minor
  tweaks"

* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  virtio_scsi: verify if queue is broken after virtqueue_get_buf()
  x86, asmlinkage, lguest: Pass in globals into assembler statement
  virtio: mmio: fix signature checking for BE guests
  virtio_ring: adapt to notify() returning bool
  virtio_net: verify if queue is broken after virtqueue_get_buf()
  virtio_console: verify if queue is broken after virtqueue_get_buf()
  virtio_blk: verify if queue is broken after virtqueue_get_buf()
  virtio_ring: add new function virtqueue_is_broken()
  virtio_test: verify if virtqueue_kick() succeeded
  virtio_net: verify if virtqueue_kick() succeeded
  virtio_ring: let virtqueue_{kick()/notify()} return a bool
  virtio_ring: change host notification API
  virtio_config: remove virtio_config_val
  virtio: use size-based config accessors.
  virtio_config: introduce size-based accessors.
  virtio_ring: plug kmemleak false positive.
  virtio: pm: use CONFIG_PM_SLEEP instead of CONFIG_PM
2013-11-15 13:28:47 +09:00
Wolfram Sang
16735d022f tree-wide: use reinit_completion instead of INIT_COMPLETION
Use this new function to make code more comprehensible, since we are
reinitialzing the completion, not initializing.

[akpm@linux-foundation.org: linux-next resyncs]
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Acked-by: Linus Walleij <linus.walleij@linaro.org> (personally at LCE13)
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:21 +09:00
Jens Axboe
1cf7e9c68f virtio_blk: blk-mq support
Switch virtio-blk from the dual support for old-style requests and bios
to use the block-multiqueue.

Acked-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2013-11-14 08:40:44 -07:00
Jens Axboe
1355b37f11 Merge branch 'for-3.13/post-mq-drivers' into for-linus 2013-11-14 08:29:01 -07:00
Linus Torvalds
5eea9be8b2 Merge branch 'for-3.13/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This is the block driver pull request for 3.13.  As with the core pull
  request just sent out, this was rebased on top of the core branch
  again after the immutable series was pulled.  This also means that
  bcache gets to sit the initial pull over.  I will send a second driver
  pull request in the merge window to get those fixes in, once they have
  been rebased and tested on top of the non-immutable stack.

  This pull request contains:

   - Add support for the sTec Kronos pci-e flash card from sTec.  Also
     has various cleanups for this driver, from myself, Bart, Mike
     Snizter, and Wei Yongjun.

   - Add surprise removal support for the micron mtip32xx driver from
     Micron.

   - Floppy documentation fix from Ben Harris.

   - debugfs bug fix for pktcdvd from Dan Carpenter.

   - Fix for the mtip32xx driver stack usage in the debugfs path,
     dynamically allocating those buffers instead.  From David Milburn.

   - Disable cpqarray in Kconfig.  The plan is to remove it on request
     of HP, but lets disable it for a few revisions just to see if
     anyone yells.

   - drbd fixes from Lars Ellenberg and Philipp Reisner.

   - Elevator switch fix for the s390 block driver from Heiko Carstens.

   - loop crash fix on IO to unassigned device from Mikulas Patocka.

   - A series of bug fixes for the IBM rsxx pci-e flash driver from
     Philip J Kelleher.

   - cciss probe fix from Stephen Cameron.

   - Xen block front/back fixes from Roger Pau Monne and Vegard Nossum"

* 'for-3.13/drivers' of git://git.kernel.dk/linux-block: (41 commits)
  floppy: Correct documentation of driver options when used as a module.
  pktcdvd: debugfs functions return NULL on error
  xen-blkfront: restore the non-persistent data path
  skd: fix formatting in skd_s1120.h
  skd: reorder construct/destruct code
  skd: cleanup skd_do_inq_page_da()
  skd: remove SKD_OMIT_FROM_SRC_DIST ifdefs
  skd: remove redundant skdev->pdev assignment from skd_pci_probe()
  skd: use <asm/unaligned.h>
  skd: remove SCSI subsystem specific includes
  skd: register block device only if some devices are present
  skd: fix error messages in skd_init()
  skd: fix error paths in skd_init()
  skd: fix unregister_blkdev() placement
  skd: more removal of bio-based code
  skd: cleanup the skd_*() function block wrapping
  skd: rip out bio path
  skd: fix error return code in skd_pci_probe()
  s390/dasd: hold request queue sysfs lock when calling elevator_init()
  cciss: return 0 from driver probe function on success, not 1
  ...
2013-11-14 12:13:05 +09:00
Linus Torvalds
0910c0bdf7 Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block
Pull block IO core updates from Jens Axboe:
 "This is the pull request for the core changes in the block layer for
  3.13.  It contains:

   - The new blk-mq request interface.

     This is a new and more scalable queueing model that marries the
     best part of the request based interface we currently have (which
     is fully featured, but scales poorly) and the bio based "interface"
     which the new drivers for high IOPS devices end up using because
     it's much faster than the request based one.

     The bio interface has no block layer support, since it taps into
     the stack much earlier.  This means that drivers end up having to
     implement a lot of functionality on their own, like tagging,
     timeout handling, requeue, etc.  The blk-mq interface provides all
     these.  Some drivers even provide a switch to select bio or rq and
     has code to handle both, since things like merging only works in
     the rq model and hence is faster for some workloads.  This is a
     huge mess.  Conversion of these drivers nets us a substantial code
     reduction.  Initial results on converting SCSI to this model even
     shows an 8x improvement on single queue devices.  So while the
     model was intended to work on the newer multiqueue devices, it has
     substantial improvements for "classic" hardware as well.  This code
     has gone through extensive testing and development, it's now ready
     to go.  A pull request is coming to convert virtio-blk to this
     model will be will be coming as well, with more drivers scheduled
     for 3.14 conversion.

   - Two blktrace fixes from Jan and Chen Gang.

   - A plug merge fix from Alireza Haghdoost.

   - Conversion of __get_cpu_var() from Christoph Lameter.

   - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

   - A fix for a race between request completion and the timeout
     handling from Jeff Moyer.  This is what caused the merge conflict
     with blk-mq/core, in case you are looking at that.

   - A dm stacking fix from Mike Snitzer.

   - A code consolidation fix and duplicated code removal from Kent
     Overstreet.

   - A handful of block bug fixes from Mikulas Patocka, fixing a loop
     crash and memory corruption on blk cg.

   - Elevator switch bug fix from Tomoki Sekiyama.

  A heads-up that I had to rebase this branch.  Initially the immutable
  bio_vecs had been queued up for inclusion, but a week later, it became
  clear that it wasn't fully cooked yet.  So the decision was made to
  pull this out and postpone it until 3.14.  It was a straight forward
  rebase, just pruning out the immutable series and the later fixes of
  problems with it.  The rest of the patches applied directly and no
  further changes were made"

* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
  block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  block: Do not call sector_div() with a 64-bit divisor
  kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
  block: Consolidate duplicated bio_trim() implementations
  block: Use rw_copy_check_uvector()
  block: Enable sysfs nomerge control for I/O requests in the plug list
  block: properly stack underlying max_segment_size to DM device
  elevator: acquire q->sysfs_lock in elevator_change()
  elevator: Fix a race in elevator switching and md device initialization
  block: Replace __get_cpu_var uses
  bdi: test bdi_init failure
  block: fix a probe argument to blk_register_region
  loop: fix crash if blk_alloc_queue fails
  blk-core: Fix memory corruption if blkcg_init_queue fails
  block: fix race between request completion and timeout handling
  blktrace: Send BLK_TN_PROCESS events to all running traces
  blk-mq: don't disallow request merges for req->special being set
  blk-mq: mq plug list breakage
  blk-mq: fix for flush deadlock
  ...
2013-11-14 12:08:14 +09:00
Linus Torvalds
8ceafbfa91 Merge branch 'for-linus-dma-masks' of git://git.linaro.org/people/rmk/linux-arm
Pull DMA mask updates from Russell King:
 "This series cleans up the handling of DMA masks in a lot of drivers,
  fixing some bugs as we go.

  Some of the more serious errors include:
   - drivers which only set their coherent DMA mask if the attempt to
     set the streaming mask fails.
   - drivers which test for a NULL dma mask pointer, and then set the
     dma mask pointer to a location in their module .data section -
     which will cause problems if the module is reloaded.

  To counter these, I have introduced two helper functions:
   - dma_set_mask_and_coherent() takes care of setting both the
     streaming and coherent masks at the same time, with the correct
     error handling as specified by the API.
   - dma_coerce_mask_and_coherent() which resolves the problem of
     drivers forcefully setting DMA masks.  This is more a marker for
     future work to further clean these locations up - the code which
     creates the devices really should be initialising these, but to fix
     that in one go along with this change could potentially be very
     disruptive.

  The last thing this series does is prise away some of Linux's addition
  to "DMA addresses are physical addresses and RAM always starts at
  zero".  We have ARM LPAE systems where all system memory is above 4GB
  physical, hence having DMA masks interpreted by (eg) the block layers
  as describing physical addresses in the range 0..DMAMASK fails on
  these platforms.  Santosh Shilimkar addresses this in this series; the
  patches were copied to the appropriate people multiple times but were
  ignored.

  Fixing this also gets rid of some ARM weirdness in the setup of the
  max*pfn variables, and brings ARM into line with every other Linux
  architecture as far as those go"

* 'for-linus-dma-masks' of git://git.linaro.org/people/rmk/linux-arm: (52 commits)
  ARM: 7805/1: mm: change max*pfn to include the physical offset of memory
  ARM: 7797/1: mmc: Use dma_max_pfn(dev) helper for bounce_limit calculations
  ARM: 7796/1: scsi: Use dma_max_pfn(dev) helper for bounce_limit calculations
  ARM: 7795/1: mm: dma-mapping: Add dma_max_pfn(dev) helper function
  ARM: 7794/1: block: Rename parameter dma_mask to max_addr for blk_queue_bounce_limit()
  ARM: DMA-API: better handing of DMA masks for coherent allocations
  ARM: 7857/1: dma: imx-sdma: setup dma mask
  DMA-API: firmware/google/gsmi.c: avoid direct access to DMA masks
  DMA-API: dcdbas: update DMA mask handing
  DMA-API: dma: edma.c: no need to explicitly initialize DMA masks
  DMA-API: usb: musb: use platform_device_register_full() to avoid directly messing with dma masks
  DMA-API: crypto: remove last references to 'static struct device *dev'
  DMA-API: crypto: fix ixp4xx crypto platform device support
  DMA-API: others: use dma_set_coherent_mask()
  DMA-API: staging: use dma_set_coherent_mask()
  DMA-API: usb: use new dma_coerce_mask_and_coherent()
  DMA-API: usb: use dma_set_coherent_mask()
  DMA-API: parport: parport_pc.c: use dma_coerce_mask_and_coherent()
  DMA-API: net: octeon: use dma_coerce_mask_and_coherent()
  DMA-API: net: nxp/lpc_eth: use dma_coerce_mask_and_coherent()
  ...
2013-11-14 07:55:21 +09:00
Dan Carpenter
49c2856af7 pktcdvd: debugfs functions return NULL on error
My static checker complains correctly that this is potential NULL
dereference because debugfs functions return NULL on error.  They return
an ERR_PTR if they are configured out.

We don't need to check for ERR_PTR because if debugfs is stubbed out the
dummy functions won't complain about that.  We don't need to check the
values before calling debugfs_remove() because that accepts ERR_PTRs and
NULL pointers.

We don't need to set pkt->dfs_f_info to NULL in pkt_debugfs_dev_new()
because it was initialized with kzalloc() so I have removed that.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-11-08 09:10:31 -07:00
Roger Pau Monne
bfe11d6de1 xen-blkfront: restore the non-persistent data path
When persistent grants were added they were always used, even if the
backend doesn't have this feature (there's no harm in always using the
same set of pages). This restores the old data path when the backend
doesn't have persistent grants, removing the burden of doing a memcpy
when it is not actually needed.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Felipe Franciosi <felipe.franciosi@citrix.com>
Cc: Felipe Franciosi <felipe.franciosi@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Fix up whitespace issues]
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
f1a3c61913 skd: fix formatting in skd_s1120.h
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
542d7b0015 skd: reorder construct/destruct code
Reorder placement of skd_construct(), skd_cons_sg_list(), skd_destruct()
and skd_free_sg_list() functions. Then remove no longer needed function
prototypes.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
fec23f6311 skd: cleanup skd_do_inq_page_da()
skdev->pdev and skdev->pdev->bus are always different than NULL in
skd_do_inq_page_da() so simplify the code accordingly.

Also cache skdev->pdev value in pdev variable while at it.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
7f74e5e944 skd: remove SKD_OMIT_FROM_SRC_DIST ifdefs
SKD_OMIT_FROM_SRC_DIST is never defined.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
ebedd16dab skd: remove redundant skdev->pdev assignment from skd_pci_probe()
skdev->pdev is set to pdev twice in skd_pci_probe(), first time
through skd_construct() call and the second time directly in
the function. Remove the second assignment as it is not needed.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
4ca90b5332 skd: use <asm/unaligned.h>
Use <asm/unaligned.h> instead of <asm-generic/unaligned.h>.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
43504631d0 skd: remove SCSI subsystem specific includes
This is not a SCSI host driver so remove SCSI subsystem specific
includes.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
b8df6647c2 skd: register block device only if some devices are present
Register block device in skd_pci_probe() instead of in skd_init() so it
is registered only if some devices are present (currently it is always
registered when the driver is loaded). Please note that this change
depends on the fact that register_blkdev(0, ...) never returns 0.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
fbed149ab3 skd: fix error messages in skd_init()
* change priority level from KERN_INFO to KERN_ERR
* add "skd: " prefix
* do minor CodingStyle fixes

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
2b91c55222 skd: fix error paths in skd_init()
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:30 -07:00
Bartlomiej Zolnierkiewicz
a073ae953e skd: fix unregister_blkdev() placement
register_blkdev() is called before pci_register_driver() in skd_init()
so unregister_blkdev() should be called after pci_unregister_driver()
in skd_exit(). Fix it.

Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Mike Snitzer
38d4a1bb99 skd: more removal of bio-based code
Remove skd_flush_cmd structure and skd_flush_slab.
Remove skd_end_request wrapper around skd_end_request_blk.
Remove skd_requeue_request, use blk_requeue_request directly.
Cleanup some comments (remove "bio" info) and whitespace.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Jens Axboe
6a5ec65b9a skd: cleanup the skd_*() function block wrapping
Just call the block functions directly, don't wrap them
in skd helpers. With only one queueing model enabled, there's
no point in doing that.

Also kill the ->start_time and ->bio from the skd_request_context,
we don't use those anymore.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Jens Axboe
fcd37eb3c1 skd: rip out bio path
The skd driver has a selectable rq or bio based queueing model.
For 3.14, we want to turn this into a single blk-mq interface
instead. With the immutable biovecs being merged in 3.13, the
bio model would need patches to even work. So rip it out, with
a conversion pending for blk-mq in the next release.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Wei Yongjun
1762b57fcb skd: fix error return code in skd_pci_probe()
Fix to return -ENOMEM in the skd construct error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Stephen M. Cameron
b88fac630b cciss: return 0 from driver probe function on success, not 1
A return value of 1 is interpreted as an error

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
rchinthekindi
2e44b42718 skd: Replaced custom debug PRINTKs with pr_debug
Replaced DPRINTK() and VPRINTK() with pr_debug().

Signed-off-by: Ramprasad C <ramprasad.chinthekindi@hgst.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Akhil Bhansali
f721bb0dbd skd: Fix checkpatch ERRORS and removed unused functions
This patch fixes checkpatch.pl errors for assignment in if condition.
It also removes unused readq / readl function calls.

As Andrew had disabled the compilation of drivers for 32 bit,
I have modified format specifiers in few VPRINTKs to avoid warnings
during 64 bit compilation.

Signed-off-by: Akhil Bhansali <abhansali@stec-inc.com>
Reviewed-by: Ramprasad Chinthekindi <rchinthekindi@stec-inc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Philip J Kelleher
8c49a77ca4 rsxx: Fix possible kernel panic with invalid config.
This patch fixes a possible Kernel Panic on driver load if
the configuration on the card is messed up or not yet set.
The driver could possible give a 32 bit unsigned all Fs to
the kernel as the device's block size.

Now we only write the block size to the kernel if the
configuration from the card is valid.

Also, driver version is being updated.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Philip J Kelleher
e35f38bf73 rsxx: Disallow discards from being unmapped.
This patch fixes a bug in which discards were always
calling pci_unmap_page. Discards should never call the
pci_unmap_page function call because they are never mapped.

This caused a race condition on PowerPC systems when issuing
discards, writes, and reads all at the same time. The
pci_map_page function would eventually map logical address
0 for a read or write. Discards are always assigned a DMA
address of 0 because they are never mapped. So if
pci_map_page mapped address 0 for a DMA and a discard was
"unmapped" then the address would be freed and would cause
an EEH event to occur when Hardware accesses the address.

This was injected/uncovered in commit:
b347f9cf0bc8d42ee95ba1d3837fd93045ab336b

The pci_dma_mapping_error function declares -1 a DMA_ERROR
not 0 like initially thought So before we would never unmap
discards because they were considered NULL.

This patch should fall on top of commit id:
fc1967bb08a6184ed44ef990e1dd4389901b809c

Also, the driver version is being up dated.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Lars Ellenberg
35f47ef1a1 drbd: avoid to shrink max_bio_size due to peer re-configuration
For a long time, the receiving side has spread "too large" incoming
requests over multiple bios.  No need to shrink our max_bio_size
(max_hw_sectors) if the peer is reconfigured to use a different storage.

The problem manifests itself if we are not the top of the device stack
(DRBD is used a LVM PV).

A hardware reconfiguration on the peer may cause the supported
max_bio_size to shrink, and the connection handshake would now
unnecessarily shrink the max_bio_size on the active node.

There is no way to notify upper layers that they have to "re-stack"
their limits. So they won't notice at all, and may keep submitting bios
that are suddenly considered "too large for device".

We already check for compatibility and ignore changes on the peer,
the code only was masked out unless we have a fully established connection.
We just need to allow it a bit earlier during the handshake.

Also consider max_hw_sectors in our merge bvec function, just in case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:29 -07:00
Lars Ellenberg
d2da5b0cb5 drbd: fix decoding of bitmap vli rle for device sizes > 64 TB
Symptoms: disconnect after bitmap exchange due to
bitmap overflow (e:49731075554) while decoding bm RLE packet

In the decoding step of the variable length integer run length encoding
there was potentially an uncatched bitshift by wordsize (variable >> 64).

The result of which is "undefined" :(
(only "sometimes" the result is the desired 0)

Fix: don't do any bit shift magic for shift == 64, just assign.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philipp Reisner
57737adc96 drbd: Fix adding of new minors with freshly created meta data
Online adding of new minors with freshly created meta data
to an resource with an established connection failed, with a
wrong state transition on one side on one side of the new minor.

Freshly created meta-data has a la_size (last agreed size) of 0.
When we online add such devices, the code wrongly got into
the code path for resyncing new storage that was added while
the disk was detached.

Fixed that by making the GREW from ZERO a special case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philipp Reisner
b874d231e1 drbd: Fix an connection drop issue after enabling allow-two-primaries
Since drbd-8.4.0 it is possible to change the allow-two-primaries
network option while the connection is established.

The sequence code used to partially order packets from the
data socket with packets from the meta-data socket, still assued
that the allow-two-primaries option is constant while the
connection is established.

I.e.
On a node that has the RESOLVE_CONFLICTS bits set, after enabling
allow-two-primaries, when receiving the next data packet it timed out
while waiting for the necessary packets on the data socket to arrive
(wait_for_and_update_peer_seq() function).

Fixed that by always tracking the sequence number, but only waiting
for it if allow-two-primaries is set.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Lars Ellenberg
69babf05cb drbd: fix NULL pointer deref in module init error path
If we want to iterate over the (as of yet still empty) list in the
cleanup path, we need to initialize the list before the first goto fail.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Jens Axboe
7badfb1c34 block: disable cpqarray in Kconfig
Mike writes:

"cpqarray hasn't been used in over 12 years. It's doubtful that anyone
 still uses the board. It's time the driver was removed from the mainline
 kernel.  The only updates these days are minor and mostly done by people
 outside of HP."

If nobody yells, we'll remove it from the kernel tree completely
for 3.15.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Akhil Bhansali
e67f86b31a Add support for sTec's pci-e flash card Kronos
Signed-off-by: Akhil Bhansali <abhansali@stec-inc.com>
Signed-off-by: Ramprasad Chinthekindi <rchinthekindi@stec-inc.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

Folded patch, contributions to clean up this driver from:

Jens Axboe
Dan Carpenter
Andrew Morton

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philip J Kelleher
0317cd6de8 rsxx: Kernel Panic caused by mapping Discards
This fixes a kernel panic injected by commit id
8d26750143341831bc312f61c5ed141eeb75b8d0 where discards
are getting mapped through the pci_map_page function call.

The driver will now start verifying that a dma is not a
discard before issuing a the pci_map_page function call.

Also, we are updating the driver version.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
David Milburn
c8afd0dcbd mtip32xx: dynamically allocate buffer in debugfs functions
Dynamically allocate buf to prevent warnings:

drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_device_status’:
drivers/block/mtip32xx/mtip32xx.c:2823: warning: the frame size of 1056 bytes is larger than 1024 bytes
drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_registers’:
drivers/block/mtip32xx/mtip32xx.c:2894: warning: the frame size of 1056 bytes is larger than 1024 bytes
drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_flags’:
drivers/block/mtip32xx/mtip32xx.c:2917: warning: the frame size of 1056 bytes is larger than 1024 bytes

Signed-off-by: David Milburn <dmilburn@redhat.com>
Acked-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Asai Thambi S P
8f8b899563 mtip32xx: Add SRSI support
This patch add support for SRSI(Surprise Removal Surprise Insertion).

Approach:
---------
Surprise Removal:
-----------------
On surprise removal of the device, gendisk, request queue, device index, sysfs
entries, etc are retained as long as device is in use - mounted filesystem,
device opened by an application, etc. The service thread breaks out of the main
while loop, waits for pci remove to exit, and then waits for device to become
free. When there no holders of the device, service thread cleans up the block
and device related stuff and returns.

Surprise Insertion:
-------------------
No change, this scenario follows the normal pci probe() function flow.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philip J Kelleher
1b21f5b2ad rsxx: Moving pci_map_page to prevent overflow.
The pci_map_page function has been moved into our
issued workqueue to prevent an us running out of
mappable addresses on non-HWWD PCIe x8 slots. The
maximum amount that can possible be mapped at one
time now is: 255 dmas X 4 dma channels X 4096 Bytes.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philip J Kelleher
e5feab229f rsxx: Handling failed pci_map_page on PowerPC and double free.
The rsxx driver was not checking the correct value during a
pci_map_page failure. Fixing this also uncovered a
double free if the bio was returned before it was
broken up into indiviadual 4k dmas, that is also
fixed here.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Mikulas Patocka
ef7e7c82e0 loop: fix crash when using unassigned loop device
When the loop module is loaded, it creates 8 loop devices /dev/loop[0-7].
The devices have no request routine and thus, when they are used without
being assigned, a crash happens.

For example, these commands cause crash (assuming there are no used loop
devices):

Kernel Fault: Code=26 regs=000000007f420980 (Addr=0000000000000010)
CPU: 1 PID: 50 Comm: kworker/1:1 Not tainted 3.11.0 #1
Workqueue: ksnaphd do_metadata [dm_snapshot]
task: 000000007fcf4078 ti: 000000007f420000 task.ti: 000000007f420000
[  116.319988]
     YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001001111111100001111 Not tainted
r00-03  000000ff0804ff0f 00000000408bf5d0 00000000402d8204 000000007b7ff6c0
r04-07  00000000408a95d0 000000007f420950 000000007b7ff6c0 000000007d06c930
r08-11  000000007f4205c0 0000000000000001 000000007f4205c0 000000007f4204b8
r12-15  0000000000000010 0000000000000000 0000000000000000 0000000000000000
r16-19  000000001108dd48 000000004061cd7c 000000007d859800 000000000800000f
r20-23  0000000000000000 0000000000000008 0000000000000000 0000000000000000
r24-27  00000000ffffffff 000000007b7ff6c0 000000007d859800 00000000408a95d0
r28-31  0000000000000000 000000007f420950 000000007f420980 000000007f4208e8
sr00-03  0000000000000000 0000000000000000 0000000000000000 0000000000303000
sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  117.549988]
IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000402d82fc 00000000402d8300
 IIR: 53820020    ISR: 0000000000000000  IOR: 0000000000000010
 CPU:        1   CR30: 000000007f420000 CR31: ffffffffffffffff
 ORIG_R28: 0000000000000001
 IAOQ[0]: generic_make_request+0x11c/0x1a0
 IAOQ[1]: generic_make_request+0x120/0x1a0
 RP(r2): generic_make_request+0x24/0x1a0
Backtrace:
 [<00000000402d83f0>] submit_bio+0x70/0x140
 [<0000000011087c4c>] dispatch_io+0x234/0x478 [dm_mod]
 [<0000000011087f44>] sync_io+0xb4/0x190 [dm_mod]
 [<00000000110883bc>] dm_io+0x2c4/0x310 [dm_mod]
 [<00000000110bfcd0>] do_metadata+0x28/0xb0 [dm_snapshot]
 [<00000000401591d8>] process_one_work+0x160/0x460
 [<0000000040159bc0>] worker_thread+0x300/0x478
 [<0000000040161a70>] kthread+0x118/0x128
 [<0000000040104020>] end_fault_vector+0x20/0x28
 [<0000000040177220>] task_tick_fair+0x420/0x4d0
 [<00000000401aa048>] invoke_rcu_core+0x50/0x60
 [<00000000401ad5b8>] rcu_check_callbacks+0x210/0x8d8
 [<000000004014aaa0>] update_process_times+0xa8/0xc0
 [<00000000401ab86c>] rcu_process_callbacks+0x4b4/0x598
 [<0000000040142408>] __do_softirq+0x250/0x2c0
 [<00000000401789d0>] find_busiest_group+0x3c0/0xc70
[  119.379988]
Kernel panic - not syncing: Kernel Fault
Rebooting in 1 seconds..

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Vegard Nossum
ea5ec76d76 xen/blkback: fix reference counting
If the permission check fails, we drop a reference to the blkif without
having taken it in the first place. The bug was introduced in commit
604c499cbb (xen/blkback: Check device
permissions before allowing OP_DISCARD).

Cc: stable@vger.kernel.org
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:27 -07:00
Roger Pau Monne
c47206e25f xen-blkfront: improve aproximation of required grants per request
Improve the calculation of required grants to process a request by
using nr_phys_segments instead of always assuming a request is going
to use all posible segments.

nr_phys_segments contains the number of scatter-gather DMA addr+len
pairs, which is basically what we put at every granted page.
for_each_sg iterates over the DMA addr+len pairs and uses a grant
page for each of them.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:27 -07:00
Roger Pau Monne
fbe363c476 xen-blkfront: revoke foreign access for grants not mapped by the backend
There's no need to keep the foreign access in a grant if it is not
persistently mapped by the backend. This allows us to free grants that
are not mapped by the backend, thus preventing blkfront from hoarding
all grants.

The main effect of this is that blkfront will only persistently map
the same grants as the backend, and it will always try to use grants
that are already mapped by the backend. Also the number of persistent
grants in blkfront is the same as in blkback (and is controlled by the
value in blkback).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Matt Wilson <msw@amazon.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:27 -07:00
Michael Opdenacker
370d668668 mg_disk: remove deprecated IRQF_DISABLED
This patch proposes to remove the use of the IRQF_DISABLED flag

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:27 -07:00
Jens Axboe
e37459b8e2 Merge branch 'blk-mq/core' into for-3.13/core
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	block/blk-timeout.c
2013-11-08 09:08:12 -07:00
Kent Overstreet
6678d83f18 block: Consolidate duplicated bio_trim() implementations
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:02:31 -07:00
Mikulas Patocka
a207f59376 block: fix a probe argument to blk_register_region
The probe function is supposed to return NULL on failure (as we can see in
kobj_lookup: kobj = probe(dev, index, data); ... if (kobj) return kobj;

However, in loop and brd, it returns negative error from ERR_PTR.

This causes a crash if we simulate disk allocation failure and run
less -f /dev/loop0 because the negative number is interpreted as a pointer:

BUG: unable to handle kernel NULL pointer dereference at 00000000000002b4
IP: [<ffffffff8118b188>] __blkdev_get+0x28/0x450
PGD 23c677067 PUD 23d6d1067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: loop hpfs nvidia(PO) ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_stats cpufreq_ondemand cpufreq_userspace cpufreq_powersave cpufreq_conservative hid_generic spadfs usbhid hid fuse raid0 snd_usb_audio snd_pcm_oss snd_mixer_oss md_mod snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib dmi_sysfs snd_rawmidi nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd soundcore lm85 hwmon_vid ohci_hcd ehci_pci ehci_hcd serverworks sata_svw libata acpi_cpufreq freq_table mperf ide_core usbcore kvm_amd kvm tg3 i2c_piix4 libphy microcode e100 usb_common ptp skge i2c_core pcspkr k10temp evdev floppy hwmon pps_core mii rtc_cmos button processor unix [last unloaded: nvidia]
CPU: 1 PID: 6831 Comm: less Tainted: P        W  O 3.10.15-devel #18
Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
task: ffff880203cc6bc0 ti: ffff88023e47c000 task.ti: ffff88023e47c000
RIP: 0010:[<ffffffff8118b188>]  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
RSP: 0018:ffff88023e47dbd8  EFLAGS: 00010286
RAX: ffffffffffffff74 RBX: ffffffffffffff74 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88023e47dc18 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88023f519658
R13: ffffffff8118c300 R14: 0000000000000000 R15: ffff88023f519640
FS:  00007f2070bf7700(0000) GS:ffff880247400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002b4 CR3: 000000023da1d000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 0000000000000002 0000001d00000000 000000003e47dc50 ffff88023f519640
 ffff88043d5bb668 ffffffff8118c300 ffff88023d683550 ffff88023e47de60
 ffff88023e47dc98 ffffffff8118c10d 0000001d81605698 0000000000000292
Call Trace:
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff8118c10d>] blkdev_get+0x1dd/0x370
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff8118c365>] blkdev_open+0x65/0x80
 [<ffffffff8114d12e>] do_dentry_open.isra.18+0x23e/0x2f0
 [<ffffffff8114d214>] finish_open+0x34/0x50
 [<ffffffff8115e122>] do_last.isra.62+0x2d2/0xc50
 [<ffffffff8115eb58>] path_openat.isra.63+0xb8/0x4d0
 [<ffffffff81115a8e>] ? might_fault+0x4e/0xa0
 [<ffffffff8115f4f0>] do_filp_open+0x40/0x90
 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
 [<ffffffff8116db85>] ? __alloc_fd+0xa5/0x1f0
 [<ffffffff8114e45f>] do_sys_open+0xef/0x1d0
 [<ffffffff8114e559>] SyS_open+0x19/0x20
 [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
Code: 44 00 00 55 48 89 e5 41 57 49 89 ff 41 56 41 89 d6 41 55 41 54 4c 8d 67 18 53 48 83 ec 18 89 75 cc e9 f2 00 00 00 0f 1f 44 00 00 <48> 8b 80 40 03 00 00 48 89 df 4c 8b 68 58 e8 d5
a4 07 00 44 89
RIP  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
 RSP <ffff88023e47dbd8>
CR2: 00000000000002b4
---[ end trace bb7f32dbf02398dc ]---

The brd change should be backported to stable kernels starting with 2.6.25.
The loop change should be backported to stable kernels starting with 2.6.22.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org	# 2.6.22+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 08:59:39 -07:00
Mikulas Patocka
3ec981e30f loop: fix crash if blk_alloc_queue fails
loop: fix crash if blk_alloc_queue fails

If blk_alloc_queue fails, loop_add cleans up, but it doesn't clean up the
identifier allocated with idr_alloc. That causes crash on module unload in
idr_for_each(&loop_index_idr, &loop_exit_cb, NULL); where we attempt to
remove non-existed device with that id.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000380
IP: [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
PGD 43d399067 PUD 43d0ad067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: loop(-) dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_ondemand cpufreq_conservative cpufreq_powersave spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc lm85 hwmon_vid snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq ohci_hcd freq_table tg3 ehci_pci mperf ehci_hcd kvm_amd kvm sata_svw serverworks libphy libata ide_core k10temp usbcore hwmon microcode ptp pcspkr pps_core e100 skge mii usb_common i2c_piix4 floppy evdev rtc_cmos i2c_core processor but!
 ton unix
CPU: 7 PID: 2735 Comm: rmmod Tainted: G        W    3.10.15-devel #15
Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
task: ffff88043d38e780 ti: ffff88043d21e000 task.ti: ffff88043d21e000
RIP: 0010:[<ffffffff812057c9>]  [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
RSP: 0018:ffff88043d21fe10  EFLAGS: 00010282
RAX: ffffffffa05102e0 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88043ea82800 RDI: 0000000000000000
RBP: ffff88043d21fe48 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000000ff
R13: 0000000000000080 R14: 0000000000000000 R15: ffff88043ea82800
FS:  00007ff646534700(0000) GS:ffff880447000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000380 CR3: 000000043e9bf000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffffffff8100aba4 0000000000000092 ffff88043d21fe48 ffff88043ea82800
 00000000000000ff ffff88043d21fe98 0000000000000000 ffff88043d21fe60
 ffffffffa05102b4 0000000000000000 ffff88043d21fe70 ffffffffa05102ec
Call Trace:
 [<ffffffff8100aba4>] ? native_sched_clock+0x24/0x80
 [<ffffffffa05102b4>] loop_remove+0x14/0x40 [loop]
 [<ffffffffa05102ec>] loop_exit_cb+0xc/0x10 [loop]
 [<ffffffff81217b74>] idr_for_each+0x104/0x190
 [<ffffffffa05102e0>] ? loop_remove+0x40/0x40 [loop]
 [<ffffffff8109adc5>] ? trace_hardirqs_on_caller+0x105/0x1d0
 [<ffffffffa05135dc>] loop_exit+0x34/0xa58 [loop]
 [<ffffffff810a98ea>] SyS_delete_module+0x13a/0x260
 [<ffffffff81221d5e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
Code: f0 4c 8b 6d f8 c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 4c 8d af 80 00 00 00 41 54 53 48 89 fb 48 83 ec 18 <48> 83 bf 80 03 00
00 00 74 4d e8 98 fe ff ff 31 f6 48 c7 c7 20
RIP  [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
 RSP <ffff88043d21fe10>
CR2: 0000000000000380
---[ end trace 64ec069ec70f1309 ]---

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org	# 3.1+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 08:59:26 -07:00
Heinz Graalfs
7f03b17d5c virtio_blk: verify if queue is broken after virtqueue_get_buf()
In case virtqueue_get_buf() returned with a NULL pointer verify if the
virtqueue is broken in order to leave while loop.

Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-10-29 11:28:17 +10:30
Olof Johansson
43d93947a5 Via Paul Walmsley <paul@pwsan.com>:
Move some of the OMAP2+ CM and System Control Module direct
 register accesses into CM- and System Control
 Module-specific "drivers" underneath arch/arm/mach-omap2/.  This
 is a prerequisite for moving this code out of arch/arm/mach-omap2/ into
 drivers/.
 
 Basic test logs are available here:
 
 http://www.pwsan.com/omap/testlogs/cm_scm_cleanup_a_v3.13/20131019101809/
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJSZAZXAAoJEBvUPslcq6VzCWQQAKH4Rj0izwbbLkgBAeeaQz5K
 oJgPJ6UPLOJ2uLIUauCKUSR6+nktrCTfV8P+J4DhCc6OiGrKBXJhSETPgaTbWsNw
 Bd577pmuvXSfNFXUaLwCgkSmafJ1pi6d7kEx/7ZW3TziVE/aUxyeHkrMtWJHrjTP
 28tJVieOxLlO5iK06DfmGcCpLUBKJKtgGRo0h/oqMhLAaN5S8//lyVYgdsto7oCN
 /bes6OpuVVdKiSr78V4rCVtR5Lij5+lVrT8HDiw2BA0V3bYcI7+CVlWBPZ3mYkuy
 oAJDcn9whNyfWS+SsaTIjy6nHsgQkhEJnhrQW3k2skVZobRtWDv7U5LiTjsUhb3o
 pjyWD8zZ7jqrkgyLsai6dm1zsljMQXsIQwH5h++HdCRhtNOXd6bVQZy0KqkpLu0y
 Bhpt8/edh4Bdc305oB05/Y9Uxr7Gr8M377chVZx+JD3rxIDjRRyOJcRIhd27WZEf
 HSMLpO/ayUXWdDuTlKW0IEnImx3PrxT913cnjIY589FhfdahfGQoft4sWDeiQLAX
 +zVYZljeY+GxbUWO6aY4m2PfVN9p/Hwal58NZZgj59wq9iHUuJErK11X7rj+2vwN
 +20IS8sikz6Iym84iC0T+omUeFVY0Zo004DVvpPB+D1C2LpwdI1c6kTz4DYT1EBP
 pvs8Wihkk7xQxQn0rBGP
 =L37r
 -----END PGP SIGNATURE-----

Merge tag 'omap-for-v3.13/cm-scm-cleanup-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into next/cleanup

From Paul Walmsley <paul@pwsan.com> via Tony Lindgren:

Move some of the OMAP2+ CM and System Control Module direct
register accesses into CM- and System Control
Module-specific "drivers" underneath arch/arm/mach-omap2/.  This
is a prerequisite for moving this code out of arch/arm/mach-omap2/ into
drivers/.

Basic test logs are available here:

http://www.pwsan.com/omap/testlogs/cm_scm_cleanup_a_v3.13/20131019101809/

* tag 'omap-for-v3.13/cm-scm-cleanup-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  ARM: OMAP3: control: add API for setting IVA bootmode
  ARM: OMAP3: CM/control: move CM scratchpad save to CM driver
  ARM: OMAP3: McBSP: do not access CM register directly
  ARM: OMAP3: clock: add API to enable/disable autoidle for a single clock
  ARM: OMAP2: CM/PM: remove direct register accesses outside CM code
  + Linux 3.12-rc4

Signed-off-by: Olof Johansson <olof@lixom.net>
2013-10-28 14:39:03 -07:00
Jens Axboe
f2298c0403 null_blk: multi queue aware block test driver
A driver that simply completes IO it receives, it does no
transfers. Written to fascilitate testing of the blk-mq code.
It supports various module options to use either bio queueing,
rq queueing, or mq mode.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:56:00 +01:00
Jens Axboe
5953316dbf block: make rq->cmd_flags be 64-bit
We have officially run out of flags in a 32-bit space. Extend it
to 64-bit even on 32-bit archs.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-25 11:55:59 +01:00
Rusty Russell
855e0c5288 virtio: use size-based config accessors.
This lets the transport do endian conversion if necessary, and insulates
the drivers from the difference.

Most drivers can use the simple helpers virtio_cread() and virtio_cwrite().

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-10-17 10:55:37 +10:30
Olof Johansson
ec01791d0f This deletes the Shark SA110-based sub-architecture from
the kernel.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJSQuSSAAoJEEEQszewGV1zi00QALIJQt28Ir03WoxpbfLev2WP
 2s/f3zwW76XkNhWhh7FNG+GCl7VUcLcLSXod/nYE7wvbVtx9DVGzpWRbzqmNW/Ti
 5eqkk9EQso1G8jEseBiKaWsO7JxKkwa+nEYkGaqbm9K5Qk9iVlQ1hTF/m9/L64Dq
 I4BazqHJuU8g7sJhNaUaVvJLX5YXEeYTaTl43XoEEHy24HDqnvT2sIkiEj1WXLGE
 EmsOHVYLeJ4z7kZK8Dl0ZW/YFzrP3UAM25KObRYJXKAwGZhbTNreujF0RDdg5CBY
 6etMU7h7W90KkCGXzbgt4AkHwBDHM6EGb81ZiKOWZBGlrQRQFyAas4e8UTgEQh1F
 +I9PKNY/m94F9+RMpV9a35hNHNUj/+XskmXmoIFnCgSsrgYDIoSzZqV+v3iVDHRb
 oeX4R6S5ZbmxT3QC0n1xdctGn8BbgzGpAUnvfcNHG/IypNjCuGk9Femu7CQS2Pka
 R3BuTcizcppbITaFeaLLorRkC8U/GPuh5JmMUpyJh/+aq+wiWDw9oBEVfNclHB67
 Qo44EAN1WPv9M+EyQAhzUndJIFj0Ib4wUM7ILjHeXJLVvwQfpE1dQURDs//6Wwgp
 /hALpZqa5Y9oxPJO3Fgmmmci6q0P3+EFX5PT8GO2ecGWhhDlCSfLWc/lEtYrjzxG
 t77iKH/DdzX5nfrr4AQQ
 =r6/u
 -----END PGP SIGNATURE-----

Merge tag 'del-shark-for-v3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-stericsson into next/cleanup

From Linus Walleij:

This deletes the Shark SA110-based sub-architecture from
the kernel.

* tag 'del-shark-for-v3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-stericsson:
  video: drop code for ARCH_SHARK in cyber2000fb
  input: i8042: drop dependency on ARCH_SHARK
  ide: drop dependency on ARCH_SHARK
  block: drop dependency on ARCH_SHARK
  MAINTAINERS: delete Shark machine entry
  ARM: delete mach-shark

Signed-off-by: Olof Johansson <olof@lixom.net>
2013-09-25 21:55:46 -07:00
Linus Walleij
3c5710f6a2 block: drop dependency on ARCH_SHARK
With this machine deleted, there is no need to maintain the
MFM block driver for its hard disk either.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2013-09-25 15:25:17 +02:00
Dan Carpenter
58f09e00ae cciss: fix info leak in cciss_ioctl32_passthru()
The arg64 struct has a hole after ->buf_size which isn't cleared.  Or if
any of the calls to copy_from_user() fail then that would cause an
information leak as well.

This was assigned CVE-2013-2147.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mike Miller <mike.miller@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 17:00:26 -07:00
Dan Carpenter
627aad1c01 cpqarray: fix info leak in ida_locked_ioctl()
The pciinfo struct has a two byte hole after ->dev_fn so stack
information could be leaked to the user.

This was assigned CVE-2013-2147.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mike Miller <mike.miller@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 17:00:26 -07:00
Aaron Lu
8910700039 virtio: pm: use CONFIG_PM_SLEEP instead of CONFIG_PM
The freeze and restore functions defined in virtio drivers are used
for suspend and hibernate, so CONFIG_PM_SLEEP is more appropriate than
CONFIG_PM. This patch replace all CONFIG_PM with CONFIG_PM_SLEEP for
virtio drivers that implement freeze and restore callbacks.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-09-23 15:45:58 +09:30
Russell King
052d0efada DMA-API: block: nvme-core: replace dma_set_mask()+dma_set_coherent_mask() with new helper
Replace the following sequence:

	dma_set_mask(dev, mask);
	dma_set_coherent_mask(dev, mask);

with a call to the new helper dma_set_mask_and_coherent().

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-09-21 21:02:24 +01:00
Linus Torvalds
e9ff04dd94 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
 "These fix several bugs with RBD from 3.11 that didn't get tested in
  time for the merge window: some error handling, a use-after-free, and
  a sequencing issue when unmapping and image races with a notify
  operation.

  There is also a patch fixing a problem with the new ceph + fscache
  code that just went in"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  fscache: check consistency does not decrement refcount
  rbd: fix error handling from rbd_snap_name()
  rbd: ignore unmapped snapshots that no longer exist
  rbd: fix use-after free of rbd_dev->disk
  rbd: make rbd_obj_notify_ack() synchronous
  rbd: complete notifies before cleaning up osd_client and rbd_dev
  libceph: add function to ensure notifies are complete
2013-09-19 12:50:37 -05:00
Martin Schwidefsky
0244ad004a Remove GENERIC_HARDIRQ config option
After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-09-13 15:09:52 +02:00
Joe Perches
666dc7c90a pktcdvd: fix defective misuses of pkt_<level>
Fix thinkos where pkt_<level> needs a valid pktcdvd_device * and the
pointer is known to be NULL.

Signed-off-by: Joe Perches <joe@perches.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> (go smatch!)
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:34 -07:00
Joe Perches
f3ded788bb pktcdvd: add struct pktcdvd_device * to pkt_dump_sense()
Allow the device name to be emitted with pkt_err when logging the sense
data.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:33 -07:00
Joe Perches
0c075d64df pktcdvd: convert pr_info to pkt_info
Add a new pkt_info macro to prefix the name to the logging output.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:33 -07:00
Joe Perches
ca73dabc3d pktcdvd: convert pr_notice to pkt_notice
Add a new pkt_notice macro to prefix the name to the logging output.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:32 -07:00
Joe Perches
fa63c0ab81 pktcdvd: add struct pktcdvd_device.name to pr_err logging where possible
Add a new pkt_err macro to prefix the name to the logging output.  Convert
pr_err where there is a non-null struct pktcdvd_device.

Includes improvements from Andy Shevchenko.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:32 -07:00
Joe Perches
844aa79743 pktcdvd: add struct pktcdvd_device * to pkt_dbg
Add pd->name to output for these debugging messages.

Remove normally compiled out pkt_dbg(2, ...) function entry tracing
equivalents as it's better done via the function tracer.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:32 -07:00
Joe Perches
cd3f2cd05c pktcdvd: consolidate DPRINTK and VPRINTK macros
Use the more common pkt_dbg(level, fmt, ...) form.

These messages are emitted at KERN_NOTICE.

Always emit function name with pkt_dbg(2, ...) uses and remove the
sometimes abbreviated embedded function name.

This form always verifies the format and arguments.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:32 -07:00
Joe Perches
99481334bc pktcdvd: convert printk to pr_<level>
Use a more current logging style and add messages levels to the logging
messages.

Simplify pkt_dump_sense by using %*ph and adding a simple function to emit
the sense string.

Includes improvements from Andy Shevchenko and Dan Carpenter.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:30 -07:00
Joe Perches
5323fb770b pktcdvd: convert ZONE macro to static function get_zone()
Macros should be converted to functions where feasible to
verify arguments and the like.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:30 -07:00
Ed Cashin
fea1b13973 aoe: do not BUG if memory pressure prevented debugfs file creation
If the system has trouble allocating memory for the creation of the aoe
debugfs directory or of a file inside it, the debugfs member of an aoedev
can be NULL.

Do not treat a NULL debugfs pointer as a BUG on aoedev shutdown, avoiding
the user impact of an unecessary panic.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:28 -07:00
Andy Shevchenko
e0ec360597 aoe: suppress compiler warnings
This patch fixes following compiler warnings:

  drivers/block/aoe/aoecmd.c: In function `aoecmd_ata_rw':
  drivers/block/aoe/aoecmd.c:383:17: warning: variable `t' set but not used [-Wunused-but-set-variable]
    struct aoetgt *t;
                   ^
  drivers/block/aoe/aoecmd.c: In function `resend':
  drivers/block/aoe/aoecmd.c:488:21: warning: variable `ah' set but not used [-Wunused-but-set-variable]
    struct aoe_atahdr *ah;
                       ^

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:27 -07:00
Andy Shevchenko
a88c1f0cac aoe: remove custom implementation of kbasename()
In the kernel we have a nice helper that may be used here. This patch
substitutes the custom implementation by the native function call.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:26 -07:00
Ed Cashin
896dcd9a64 aoe: update internal version number to 85
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:26 -07:00
Ed Cashin
ec345120c5 aoe: update copyright date
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:25 -07:00
Ed Cashin
2256c1c51e aoe: fill in per-AoE-target information for debugfs file
This information is presented in a compact format that has evolved for
easy routine scanning by expert humans, mostly developers and support
technicians helping to troubleshoot or test AoE-based systems.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:25 -07:00
Ed Cashin
1cf94797c2 aoe: provide file operations for debugfs files
The place holder in the file contents is filled out in the following
patch.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:24 -07:00
Ed Cashin
e8866cf2b9 aoe: add AoE-target files to debugfs
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:23 -07:00
Ed Cashin
190519cd30 aoe: create and destroy debugfs directory for aoe
This series adds the debugging information that the coraid.com-distributed
aoe driver exports via sysfs, but instead of sysfs, it uses debugfs.

With these patches applied, even without AoE targets on the network, KEDR
reports new possible memory leaks, but these are from callers outside the
aoe driver that have used aoe_devnode to get the name of the character
devices through the aoe_class->devnode callback, and I believe they're
responsible for freeing that memory.

This patch:

Create and destroy the debugfs directory.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:22 -07:00
Jingoo Han
c07303c0af drivers/block/swim.c: remove unnecessary platform_set_drvdata()
The driver core clears the driver data to NULL after device_release or
on probe failure.  Thus, it is not needed to manually clear the device
driver data to NULL.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:59 -07:00
Mike Miller
e7b18ede44 cciss: set max scatter gather entries to 32 on P600
At one time we used to set the maximum number of scatter gather elements
on all Smart Array controllers to 32.  At some point in time the
firmware began to write the "appropriate" value for each controller into
the config table.  The cciss driver would then read that and set
h->maxsgentries.

        h->maxsgentries = readl(&(h->cfgtable->MaxSGElements);

On the P600 that value is 544.  Under some workloads a significant
performance reduction may result.  This patch forces the P600 to use
only 32 scatter gather elements.  Other controllers are not affected.

Signed-off-by: Mike Miller <mike.miller@hp.com>
Signed-off-by: Dwight (Bud) Brown <bubrown@redhat.com>
Signed-off-by: Tomas Henzl <thenzl@redhat.com>
Acked-by: Stephen M. Cameron <steve.cameron@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:59 -07:00
Jingoo Han
c86db975c8 drivers/block/mg_disk.c: make mg_times_out() static
mg_times_out() is used only in this file.  Fix the following sparse
warning:

  drivers/block/mg_disk.c:639:6: warning: symbol 'mg_times_out' was not declared. Should it be static?

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:59 -07:00
Jingoo Han
bb8e0e84b3 block: replace strict_strtoul() with kstrtoul()
The use of strict_strtoul() is not preferred, because strict_strtoul() is
obsolete.  Thus, kstrtoul() should be used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:56 -07:00
Josh Durgin
da6a6b6397 rbd: fix error handling from rbd_snap_name()
rbd_snap_name() calls rbd_dev_v{1,2}_snap_name() depending on the
format of the image. The format 1 version returns NULL on error, which
is handled by the caller. The format 2 version returns an ERR_PTR,
which the caller of rbd_snap_name() does not expect.

Fortunately this is unlikely to occur in practice because
rbd_snap_id_by_name() is called before rbd_snap_name(). This would hit
similar errors to rbd_snap_name() (like the snapshot not existing) and
return early, so rbd_snap_name() would not hit an error unless the
snapshot was removed between the two calls or memory was exhausted.

Use an ERR_PTR in rbd_dev_v1_snap_name() so that the specific error
can be propagated, and it is consistent with rbd_dev_v2_snap_name().
Handle the ERR_PTR in the only rbd_snap_name() caller.

Suggested-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:44 -07:00
Josh Durgin
efadc98aab rbd: ignore unmapped snapshots that no longer exist
This prevents erroring out while adding a device when a snapshot
unrelated to the current mapping is deleted between reading the
snapshot context and reading the snapshot names. If the mapped
snapshot name is not found an error still occurs as usual.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:32 -07:00
Josh Durgin
9875201e10 rbd: fix use-after free of rbd_dev->disk
Removing a device deallocates the disk, unschedules the watch, and
finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
from the watch callback, updates the disk size and rbd_dev
structure. With no locking between them, rbd_dev_refresh() may use the
device or rbd_dev after they've been freed.

To fix this, check whether RBD_DEV_FLAG_REMOVING is set before
updating the disk size in rbd_dev_refresh(). In order to prevent a
race where rbd_dev_refresh() is already revalidating the disk when
rbd_remove() is called, move the call to rbd_bus_del_dev() after the
watch is unregistered and all notifies are complete. It's safe to
defer deleting this structure because no new requests can be submitted
once the RBD_DEV_FLAG_REMOVING is set, since the device cannot be
opened.

Fixes: http://tracker.ceph.com/issues/5636
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:25 -07:00
Josh Durgin
20e0af67ce rbd: make rbd_obj_notify_ack() synchronous
The only user of rbd_obj_notify_ack() is rbd_watch_cb(). It used
asynchronously with no tracking of when the notify ack completes, so
it may still be in progress when the osd_client is shut down.  This
results in a BUG() since the osd client assumes no requests are in
flight when it stops. Since all notifies are flushed before the
osd_client is stopped, waiting for the notify ack to complete before
returning from the watch callback ensures there are no notify acks in
flight during shutdown.

Rename rbd_obj_notify_ack() to rbd_obj_notify_ack_sync() to reflect
its new synchronous nature.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:16:02 -07:00
Josh Durgin
9abc59908e rbd: complete notifies before cleaning up osd_client and rbd_dev
To ensure rbd_dev is not used after it's released, flush all pending
notify callbacks before calling rbd_dev_image_release(). No new
notifies can be added to the queue at this point because the watch has
already be unregistered with the osd_client.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:15:57 -07:00
Linus Torvalds
6cccc7d301 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "This includes both the first pile of Ceph patches (which I sent to
  torvalds@vger, sigh) and a few new patches that add support for
  fscache for Ceph.  That includes a few fscache core fixes that David
  Howells asked go through the Ceph tree.  (Thanks go to Milosz Tanski
  for putting this feature together)

  This first batch of patches (included here) had (has) several
  important RBD bug fixes, hole punch support, several different
  cleanups in the page cache interactions, improvements in the truncate
  code (new truncate mutex to avoid shenanigans with i_mutex), and a
  series of fixes in the synchronous striping read/write code.

  On top of that is a random collection of small fixes all across the
  tree (error code checks and error path cleanup, obsolete wq flags,
  etc)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits)
  ceph: use d_invalidate() to invalidate aliases
  ceph: remove ceph_lookup_inode()
  ceph: trivial buildbot warnings fix
  ceph: Do not do invalidate if the filesystem is mounted nofsc
  ceph: page still marked private_2
  ceph: ceph_readpage_to_fscache didn't check if marked
  ceph: clean PgPrivate2 on returning from readpages
  ceph: use fscache as a local presisent cache
  fscache: Netfs function for cleanup post readpages
  FS-Cache: Fix heading in documentation
  CacheFiles: Implement interface to check cache consistency
  FS-Cache: Add interface to check consistency of a cached object
  rbd: fix null dereference in dout
  rbd: fix buffer size for writes to images with snapshots
  libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc
  rbd: fix I/O error propagation for reads
  ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  ceph: allow sync_read/write return partial successed size of read/write.
  ceph: fix bugs about handling short-read for sync read mode.
  ceph: remove useless variable revoked_rdcache
  ...
2013-09-09 09:13:22 -07:00
Linus Torvalds
b409624ad5 Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVM Express driver update from Matthew Wilcox.

* git://git.infradead.org/users/willy/linux-nvme:
  NVMe: Merge issue on character device bring-up
  NVMe: Handle ioremap failure
  NVMe: Add pci suspend/resume driver callbacks
  NVMe: Use normal shutdown
  NVMe: Separate controller init from disk discovery
  NVMe: Separate queue alloc/free from create/delete
  NVMe: Group pci related actions in functions
  NVMe: Disk stats for read/write commands only
  NVMe: Bring up cdev on set feature failure
  NVMe: Fix checkpatch issues
  NVMe: Namespace IDs are unsigned
  NVMe: Update nvme_id_power_state with latest spec
  NVMe: Split header file into user-visible and kernel-visible pieces
  NVMe: Call nvme_process_cq from submission path
  NVMe: Remove "process_cq did something" message
  NVMe: Return correct value from interrupt handler
  NVMe: Disk IO statistics
  NVMe: Restructure MSI / MSI-X setup
  NVMe: Use kzalloc instead of kmalloc+memset
2013-09-07 20:19:02 -07:00
Keith Busch
d82e8bfdef NVMe: Merge issue on character device bring-up
A recent patch made it possible to bring up the character handle when the
device is responsive but not accepting a set-features command. Another
recent patch moved the initialization that requires we move where the
checks for this condition occur. This patch merges these two ideas so
it works much as before.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-06 16:26:58 -04:00
Linus Torvalds
2e515bf096 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "The usual trivial updates all over the tree -- mostly typo fixes and
  documentation updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
  doc: Documentation/cputopology.txt fix typo
  treewide: Convert retrun typos to return
  Fix comment typo for init_cma_reserved_pageblock
  Documentation/trace: Correcting and extending tracepoint documentation
  mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
  power: Documentation: Update s2ram link
  doc: fix a typo in Documentation/00-INDEX
  Documentation/printk-formats.txt: No casts needed for u64/s64
  doc: Fix typo "is is" in Documentations
  treewide: Fix printks with 0x%#
  zram: doc fixes
  Documentation/kmemcheck: update kmemcheck documentation
  doc: documentation/hwspinlock.txt fix typo
  PM / Hibernate: add section for resume options
  doc: filesystems : Fix typo in Documentations/filesystems
  scsi/megaraid fixed several typos in comments
  ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
  treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
  page_isolation: Fix a comment typo in test_pages_isolated()
  doc: fix a typo about irq affinity
  ...
2013-09-06 09:36:28 -07:00
Josh Durgin
c35455791c rbd: fix null dereference in dout
The order parameter is sometimes NULL in _rbd_dev_v2_snap_size(), but
the dout() always derefences it. Move this to another dout() protected
by a check that order is non-NULL.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-09-03 22:08:51 -07:00
Josh Durgin
03507db631 rbd: fix buffer size for writes to images with snapshots
rbd_osd_req_create() needs to know the snapshot context size to create
a buffer large enough to send it with the message front. It gets this
from the img_request, which was not set for the obj_request yet. This
resulted in trying to write past the end of the front payload, hitting
this BUG:

libceph: BUG_ON(p > msg->front.iov_base + msg->front.iov_len);

Fix this by associating the obj_request with its img_request
immediately after it's created, before the osd request is created.

Fixes: http://tracker.ceph.com/issues/5760
Suggested-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-09-03 22:08:46 -07:00
Josh Durgin
17c1cc1d92 rbd: fix I/O error propagation for reads
When a request returns an error, the driver needs to report the entire
extent of the request as completed.  Writes already did this, since
they always set xferred = length, but reads were skipping that step if
an error other than -ENOENT occurred.  Instead, rbd would end up
passing 0 xferred to blk_end_request(), which would always report
needing more data.  This resulted in an assert failing when more data
was required by the block layer, but all the object requests were
done:

[ 1868.719077] rbd: obj_request read result -108 xferred 0
[ 1868.719077]
[ 1868.719518] end_request: I/O error, dev rbd1, sector 0
[ 1868.719739]
[ 1868.719739] Assertion failure in rbd_img_obj_callback() at line 1736:
[ 1868.719739]
[ 1868.719739]   rbd_assert(more ^ (which == img_request->obj_request_count));

Without this assert, reads that hit errors would hang forever, since
the block layer considered them incomplete.

Fixes: http://tracker.ceph.com/issues/5647
CC: stable@vger.kernel.org  # v3.10
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-09-03 22:06:10 -07:00
Keith Busch
9d713c2bfb NVMe: Handle ioremap failure
Decrement the number of queues required for doorbell remapping until
the memory is successfully mapped for that size.

Additional checks are done so that we don't call free_irq if it has
already been freed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:44:25 -04:00
Keith Busch
cd63894630 NVMe: Add pci suspend/resume driver callbacks
Used for going in and out of low power states. Resuming reuses the IO
queues from the previous initialization, freeing any allocated queues
that are no longer usable.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:44:16 -04:00
Keith Busch
1894d8f16a NVMe: Use normal shutdown
The NVMe spec recommends using the shutdown normal sequence when safely
taking the controller offline instead of hitting CC.EN on the next
start-up to reset the controller. The spec recommends a minimum of 1
second for the shutdown complete. This patch waits 2 seconds to be on
the safe side.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:40:32 -04:00
Keith Busch
f0b50732a9 NVMe: Separate controller init from disk discovery
This combines the controller initialization into one function, removing
IO queue setup from namespace discovery, and creates symetric functions
for device removal. The controller start and shutdown functions can now
be called from resume/suspend context as well as probe/remove.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:40:28 -04:00
Keith Busch
2240427425 NVMe: Separate queue alloc/free from create/delete
This separates nvme queue allocation from creation, and queue deletion
from freeing. This is so that we may in the future temporarily disable
queues and reuse the same memory when bringing them back online, like
coming back from suspend state.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:39:28 -04:00
Keith Busch
0877cb0d28 NVMe: Group pci related actions in functions
This will make it easier to reuse these outside probe/remove.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:39:25 -04:00
Keith Busch
9e59d091b0 NVMe: Disk stats for read/write commands only
Flush and discard requests would previously mess up the accounting.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:33:35 -04:00
Keith Busch
7e03b12406 NVMe: Bring up cdev on set feature failure
This patch creates the character device as long as a device's admin queues
are usable so a user has an opprotunity to perform administration tasks.
A device may be in a state that does not allow IO and setting the queue
count feature in such a state returns an error. Previously the driver
would bail and the controller would be unusable.

Signed-off-by: Keith Busch <keith.busch@intel.com>
2013-09-03 16:32:26 -04:00
Keith Busch
1b56749e54 NVMe: Fix checkpatch issues
Signed-off-by: Keith Busch <keith.busch@intel.com>
2013-09-03 16:32:26 -04:00
Matthew Wilcox
c3bfe7176c NVMe: Namespace IDs are unsigned
The 'Number of Namespaces' read from the device was being treated as
signed, which would cause us to not scan any namespaces for a device
with more than 2 billion namespaces.  That led to noticing that the
namespace ID was also being treated as signed, which could lead to the
result from NVME_IOCTL_ID being treated as an error code.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:32:26 -04:00
Linus Torvalds
542a086ac7 Driver core patches for 3.12-rc1
Here's the big driver core pull request for 3.12-rc1.
 
 Lots of tiny changes here fixing up the way sysfs attributes are
 created, to try to make drivers simpler, and fix a whole class race
 conditions with creations of device attributes after the device was
 announced to userspace.
 
 All the various pieces are acked by the different subsystem maintainers.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.21 (GNU/Linux)
 
 iEYEABECAAYFAlIlIPcACgkQMUfUDdst+ynUMwCaAnITsxyDXYQ4DqEsz8EcOtMk
 718AoLrgnUZs3B+70AT34DVktg4HSThk
 =USl9
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core patches from Greg KH:
 "Here's the big driver core pull request for 3.12-rc1.

  Lots of tiny changes here fixing up the way sysfs attributes are
  created, to try to make drivers simpler, and fix a whole class race
  conditions with creations of device attributes after the device was
  announced to userspace.

  All the various pieces are acked by the different subsystem
  maintainers"

* tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits)
  firmware loader: fix pending_fw_head list corruption
  drivers/base/memory.c: introduce help macro to_memory_block
  dynamic debug: line queries failing due to uninitialized local variable
  sysfs: sysfs_create_groups returns a value.
  debugfs: provide debugfs_create_x64() when disabled
  rbd: convert bus code to use bus_groups
  firmware: dcdbas: use binary attribute groups
  sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled
  driver core: add #include <linux/sysfs.h> to core files.
  HID: convert bus code to use dev_groups
  Input: serio: convert bus code to use drv_groups
  Input: gameport: convert bus code to use drv_groups
  driver core: firmware: use __ATTR_RW()
  driver core: core: use DEVICE_ATTR_RO
  driver core: bus: use DRIVER_ATTR_WO()
  driver core: create write-only attribute macros for devices and drivers
  sysfs: create __ATTR_WO()
  driver-core: platform: convert bus code to use dev_groups
  workqueue: convert bus code to use dev_groups
  MEI: convert bus code to use dev_groups
  ...
2013-09-03 11:37:15 -07:00
Greg Kroah-Hartman
b15a21ddda rbd: convert bus code to use bus_groups
The bus_attrs field of struct bus_type is going away soon, dev_groups
should be used instead.  This converts the RBD bus code to use the
correct field.

Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Acked-by: Alex Elder <elder@linaro.org>
Cc: <ceph-devel@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-27 22:07:49 -07:00
Joe Perches
8be04b9374 treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.

Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-20 13:06:40 +02:00
Sage Weil
ee3e542fec Merge remote-tracking branch 'linus/master' into testing 2013-08-15 11:11:45 -07:00
Ed Cashin
fb32975d1b aoe: adjust ref of head for compound page tails
Fix a BUG which can trigger when direct-IO is used with AOE.

As discussed previously, the fact that some users of the block layer
provide bios that point to pages with a zero _count means that it is not
OK for the network layer to do a put_page on the skb frags during an
skb_linearize, so the aoe driver gets a reference to pages in bios and
puts the reference before ending the bio.  And because it cannot use
get_page on a page with a zero _count, it manipulates the value
directly.

It is not OK to increment the _count of a compound page tail, though,
since the VM layer will VM_BUG_ON a non-zero _count.  Block users that
do direct I/O can result in the aoe driver seeing compound page tails in
bios.  In that case, the same logic works as long as the head of the
compound page is used instead of the tails.  This patch handles compound
pages and does not BUG.

It relies on the block layer user leaving the relationship between the
page tail and its head alone for the duration between the submission of
the bio and its completion, whether successful or not.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:48 -07:00
Jingoo Han
a158073c43 block: rbd: use NULL instead of 0
The local variables such as 'bio_list', and 'pages' are pointers;
thus, use NULL instead of 0 to fix the following sparse warnings.

drivers/block/rbd.c:2166:32: warning: Using plain integer as NULL pointer
drivers/block/rbd.c:2168:31: warning: Using plain integer as NULL pointer

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:40 -07:00
Linus Torvalds
d4c90b1b9f Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver bits from Jens Axboe:
 "As I mentioned in the core block pull request, due to real life
  circumstances the driver pull request would be late.  Now it looks
  like -rc2 late...  On the plus side, apart form the rsxx update, these
  are all things that I could argue could go in later in the cycle as
  they are fixes and not features.  So even though things are late, it's
  not ALL bad.

  The pull request contains:

   - Updates to bcache, all bug fixes, from Kent.

   - A pile of drbd bug fixes (no big features this time!).

   - xen blk front/back fixes.

   - rsxx driver updates, some of them deferred form 3.10.  So should be
     well cooked by now"

* 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits)
  bcache: Allocation kthread fixes
  bcache: Fix GC_SECTORS_USED() calculation
  bcache: Journal replay fix
  bcache: Shutdown fix
  bcache: Fix a sysfs splat on shutdown
  bcache: Advertise that flushes are supported
  bcache: check for allocation failures
  bcache: Fix a dumb race
  bcache: Use standard utility code
  bcache: Update email address
  bcache: Delete fuzz tester
  bcache: Document shrinker reserve better
  bcache: FUA fixes
  drbd: Allow online change of al-stripes and al-stripe-size
  drbd: Constants should be UPPERCASE
  drbd: Ignore the exit code of a fence-peer handler if it returns too late
  drbd: Fix rcu_read_lock balance on error path
  drbd: fix error return code in drbd_init()
  drbd: Do not sleep inside rcu
  bcache: Refresh usage docs
  ...
2013-07-22 19:02:52 -07:00
Linus Torvalds
72c1c2f1be Merge branch 'next' of git://git.monstr.eu/linux-2.6-microblaze
Pull microblaze update from Michal Simek:
 "This Microblaze merge window is quite minimal.

  I have also added to my branch one xilinx systemace sparse fix because
  haven't got any reply from block maintainer."

* 'next' of git://git.monstr.eu/linux-2.6-microblaze:
  xilinx systemace: Fix sparse warnings
  microblaze: Move __NR_syscalls from uapi
  microblaze: Enable KGDB in defconfig
  microblaze: Don't mark arch_kgdb_ops as const.
2013-07-10 10:16:07 -07:00
Michal Simek
4937a269cb xilinx systemace: Fix sparse warnings
Fix sysace sparse warnings.

Signed-off-by: Michal Simek <michal.simek@xilinx.com>
2013-07-10 07:47:12 +02:00
Linus Torvalds
9a5889ae1c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There is some follow-on RBD cleanup after the last window's code drop,
  a series from Yan fixing multi-mds behavior in cephfs, and then a
  sprinkling of bug fixes all around.  Some warnings, sleeping while
  atomic, a null dereference, and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (36 commits)
  libceph: fix invalid unsigned->signed conversion for timespec encoding
  libceph: call r_unsafe_callback when unsafe reply is received
  ceph: fix race between cap issue and revoke
  ceph: fix cap revoke race
  ceph: fix pending vmtruncate race
  ceph: avoid accessing invalid memory
  libceph: Fix NULL pointer dereference in auth client code
  ceph: Reconstruct the func ceph_reserve_caps.
  ceph: Free mdsc if alloc mdsc->mdsmap failed.
  ceph: remove sb_start/end_write in ceph_aio_write.
  ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.
  ceph: fix sleeping function called from invalid context.
  ceph: move inode to proper flushing list when auth MDS changes
  rbd: fix a couple warnings
  ceph: clear migrate seq when MDS restarts
  ceph: check migrate seq before changing auth cap
  ceph: fix race between page writeback and truncate
  ceph: reset iov_len when discarding cap release messages
  ceph: fix cap release race
  libceph: fix truncate size calculation
  ...
2013-07-09 12:39:10 -07:00
Linus Torvalds
7f0ef0267e Merge branch 'akpm' (updates from Andrew Morton)
Merge first patch-bomb from Andrew Morton:
 - various misc bits
 - I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
   distracted.  There has been quite a bit of activity.
 - About half the MM queue
 - Some backlight bits
 - Various lib/ updates
 - checkpatch updates
 - zillions more little rtc patches
 - ptrace
 - signals
 - exec
 - procfs
 - rapidio
 - nbd
 - aoe
 - pps
 - memstick
 - tools/testing/selftests updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (445 commits)
  tools/testing/selftests: don't assume the x bit is set on scripts
  selftests: add .gitignore for kcmp
  selftests: fix clean target in kcmp Makefile
  selftests: add .gitignore for vm
  selftests: add hugetlbfstest
  self-test: fix make clean
  selftests: exit 1 on failure
  kernel/resource.c: remove the unneeded assignment in function __find_resource
  aio: fix wrong comment in aio_complete()
  drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
  drivers/memstick/host/r592.c: convert to module_pci_driver
  drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
  pps-gpio: add device-tree binding and support
  drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
  drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
  drivers/parport/share.c: use kzalloc
  Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
  aoe: update internal version number to v83
  aoe: update copyright date
  aoe: perform I/O completions in parallel
  ...
2013-07-03 17:12:13 -07:00
Ed Cashin
94ac11833f aoe: update internal version number to v83
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:05 -07:00
Ed Cashin
ca47bbd93c aoe: update copyright date
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:05 -07:00
Ed Cashin
8030d34397 aoe: perform I/O completions in parallel
Some users have a large AoE target while others like to use many AoE
targets at the same time.  In the latter case, there is an opportunity to
greatly improve aggregate throughput by allowing different threads to
complete the I/O associated with each target.  For 36 targets, 4 KiB read
throughput roughly doubles, for example, with these changes in place.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:05 -07:00
Paul Clements
c378f70adb nbd: correct disconnect behavior
Currently, when a disconnect is requested by the user (via NBD_DISCONNECT
ioctl) the return from NBD_DO_IT is undefined (it is usually one of
several error codes).  This means that nbd-client does not know if a
manual disconnect was performed or whether a network error occurred.
Because of this, nbd-client's persist mode (which tries to reconnect after
error, but not after manual disconnect) does not always work correctly.

This change fixes this by causing NBD_DO_IT to always return 0 if a user
requests a disconnect.  This means that nbd-client can correctly either
persist the connection (if an error occurred) or disconnect (if the user
requested it).

Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Acked-by: Rob Landley <rob@landley.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:05 -07:00
Michal Belczyk
9532f149ee nbd: remove bogus BUG_ON in NBD_CLEAR_QUE
The NBD_CLEAR_QUE ioctl has been deprecated for quite some time (its job
is now done by two other ioctls).  We should stop trying to make bogus
assertions in it.  Also, user-level code should remove calls to
NBD_CLEAR_QUE, ASAP.

Signed-off-by: Michal Belczyk <belczyk@bsd.krakow.pl>
Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:05 -07:00
Kees Cook
f170168b9a drivers: avoid parsing names as kthread_run() format strings
Calling kthread_run with a single name parameter causes it to be handled
as a format string. Many callers are passing potentially dynamic string
content, so use "%s" in those cases to avoid any potential accidents.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:41 -07:00
Kees Cook
ffc8b30866 block: do not pass disk names as format strings
Disk names may contain arbitrary strings, so they must not be
interpreted as format strings.  It seems that only md allows arbitrary
strings to be used for disk names, but this could allow for a local
memory corruption from uid 0 into ring 0.

CVE-2013-2851

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:25 -07:00
Sage Weil
e976cad0f0 rbd: fix a couple warnings
gcc isn't quite smart enough and generates these warnings:

drivers/block/rbd.c: In function 'rbd_img_request_fill':
drivers/block/rbd.c:1266:22: warning: 'bio_list' may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/block/rbd.c:2186:14: note: 'bio_list' was declared here
drivers/block/rbd.c:2247:10: warning: 'pages' may be used uninitialized in this function [-Wmaybe-uninitialized]

even though they are initialized for their respective code paths.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:50 -07:00
Alex Elder
d552c6191b rbd: take a little credit
Add a name to the list of authors.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:44 -07:00
Alex Elder
cfbf6377b6 rbd: use rwsem to protect header updates
Updating an image header needs to be protected to ensure it's
done consistently.  However distinct headers can be updated
concurrently without a problem.  Instead of using the global
control lock to serialize headder updates, just rely on the header
semaphore.  (It's already used, this just moves it out to cover
a broader section of the code.)

That leaves the control mutex protecting only the creation of rbd
clients, so rename it.

This resolves:
    http://tracker.ceph.com/issues/5222

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:43 -07:00
Alex Elder
1ba0f1e797 rbd: don't hold ctl_mutex to get/put device
When an rbd device is first getting mapped, its device registration
is protected the control mutex.  There is no need to do that though,
because the device has already been assigned an id that's guaranteed
to be unique.

An unmap of an rbd device won't proceed if the device has a non-zero
open count or is already being unmapped.  So there's no need to hold
the control mutex in that case either.

Finally, an rbd device can't be opened if it is being removed, and
it won't go away if there is a non-zero open count.  So here too
there's no need to hold the control mutex while getting or putting a
reference to an rbd device's Linux device structure.

Drop the mutex calls in these cases.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:42 -07:00
Alex Elder
82a442d239 rbd: protect against concurrent unmaps
Make sure two concurrent unmap operations on the same rbd device
won't collide, by only proceeding with the removal and cleanup of a
device if is not already underway.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:41 -07:00
Alex Elder
751cc0e3cf rbd: set removing flag while holding list lock
When unmapping a device, its id is supplied, and that is used to
look up which rbd device should be unmapped.  Looking up the
device involves searching the rbd device list while holding
a spinlock that protects access to that list.

Currently all of this is done under protection of the control lock,
but that protection is going away soon.  To ensure the rbd_dev is
still valid (still on the list) while setting its REMOVING flag, do
so while still holding the list lock.  To do so, get rid of
__rbd_get_dev(), and open code what it did in the one place it
was used.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:41 -07:00
Alex Elder
08f75463c1 rbd: protect against duplicate client creation
If more than one rbd image has the same ceph cluster configuration
(same options, same set of monitors, same keys) they normally share
a single rbd client.

When an image is getting mapped, rbd looks to see if an existing
client can be used, and creates a new one if not.

The lookup and creation are not done under a common lock though, so
mapping two images concurrently could lead to duplicate clients
getting set up needlessly.  This isn't a major problem, but it's
wasteful and different from what's intended.

This patch fixes that by using the control mutex to protect
both the lookup and (if needed) creation of the client.  It
was previously used just when creating.

This resolves:
    http://tracker.ceph.com/issues/3094

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:39 -07:00
Alex Elder
3b5cf2a2f1 rbd: clean up a few things in the refresh path
This includes a few relatively small fixes I found while examining
the code that refreshes image information.

This resolves:
    http://tracker.ceph.com/issues/5040

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:38 -07:00
Alex Elder
e215605417 rbd: flush dcache after zeroing page data
Neither zero_bio_chain() nor zero_pages() contains a call to flush
caches after zeroing a portion of a page.  This can cause problems
on architectures that have caches that allow virtual address
aliasing.

This resolves:
    http://tracker.ceph.com/issues/4777

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-03 15:32:37 -07:00
Linus Torvalds
baa6f82093 Simple warning fix for module sections. If too late to pull, no big deal.
Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0NO+AAoJENkgDmzRrbjxS0YP/RoI5WguUDVx8tIdS29pwe9L
 Z5U0uIDUBZkNftlvvkdqdGjU1MeejxoMIs6cMz6ScY6YEI7li1AzuNApRadflnMx
 Do9u999NylI6q2oCLrt9y/eliJ/p5hkeCbhDrGdCgB7rPCjEAp/+Eu9Kq7FO8AYr
 Qu0rpLjvJ/5c5lsLAmUn95lZZ/uc/EPHasfZDiR8eGHRoAvPeyJ5qR06MpHFcl33
 2EKwR8RCYP9yUdqWnSKlC4LTuIpNK/BjEpHcQK2UPSn49vR8B3gVz5VXNP/7nIYQ
 hCr3VGoEOBoAlsD38IuKycXQ7vmBTU2hq7bQPq7Z6wyH96Sb6xFmVqT1Dm7uAI4+
 3OU69ugQb7WM3hQjeRRFvn5WjPqu8THwJKU7vhEf+xA09+v09QFjbHoEFabrFDlo
 MqVVhelkT6Mo8bBrxl8sZc/CG+aSl/kLQfyXWIcR2MQEAlfYKOYuDz+5TXtMZHaR
 SwlxgE2wbPP54v27yQWfBWY4d7dtJ+OX25GW1yzJdsNUFg7fQ9zzMnrv1prsUlog
 bxs4P0JcBOMLZx2IWm0R2kmbRFYjPgj6Kr3VnrIhZGesG1cwQB2QNvEPEkCtl6kq
 8HdX8Tua3dpfKAhDxnlztpVcJUmTuLn95jMWjW1KhCcjOqqJbaElq9ubplP2eMtr
 7zlfzFq9RsBKG0+J98Gt
 =vZYj
 -----END PGP SIGNATURE-----
mergetag object b3087e48ce
 type commit
 tag virtio-next-for-linus
 tagger Rusty Russell <rusty@rustcorp.com.au> 1372639977 +0930
 
 Was away, but it's all trivial and been sitting in linux-next.  So if you don't
 pull, no electrons will be harmed.
 
 Thanks,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0NMjAAoJENkgDmzRrbjxf3EQALoIAuQAjiEcHTGkLd/hGMJk
 mDiN14CXNH0siKASlCfcAxudJxKfIRM1tYzMFV1e7DtC7ZAH43m/SYBmXC4hRpIz
 nHsv57xtUg8bGZgb2OZKX/L0rqiE5wxtK7ZWGIPh3aE4uZKif+YvSBNJniO/HRTl
 UVDbrp+sbZzbT8NIHQ8d2vdtfG7rCNRkVaaETxk1wguu2TGJkyt0xuU3zT54We0L
 M8rJuHyTplgEMuDUydmNi9uNPZJZZY7QqoZ7s+ubXELo2FcFeyqWP2oHZnywESok
 fWOfhgdgIgmTED/eXS9wGwB7+K6azKg2chA23QZ91ncX/Ts141sEAd9ezwuQZc+5
 VS/Lo2mRvWbEcG78uYGUz+ukltZv0Og6uOvfSfCrALcVA4pYNFJTjbRu+Io+qAml
 082byeBUT/eurZr7xE30b4aTvZV2bYfqFkVynjF9ugNvTTjMiruw8uAzep6DD1gl
 38XRwtrk6MHLAFuxsyNV1FgunIWB6zvXNjMTUUVDLAQCbZwxZArr9gDd4zUG9VTS
 EbA9TIulSMMLUve5zZQyByqRLlzmzDVghMOZVbYUOQN4uNxbAPoXZ87tpiRhk8Oc
 XW+LRABjwZOjiQ9SDWRxxb8p2WTVyjuwIOTE0pLCuEBw6RFmjGga+PzH9rDygJnF
 zEIbVDRT9Jj7rOtxSabE
 =p6z7
 -----END PGP SIGNATURE-----

Merge tags 'modules-next-for-linus' and 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull trivial module and virtio fixes from Rusty Russell.

Apparently these were meant for 3.10, but came in after the release.

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  modpost.c: Add .text.unlikely to TEXT_SECTIONS

* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  virtio: remove virtqueue_add_buf().
  lguest: rename i386_head.S
  virtio_blk: Add missing 'static' qualifiers
  virtio: console: Add emergency writeonly register to config space
  virtio_pci: better macro exported in uapi
2013-07-03 13:09:06 -07:00
Linus Torvalds
0e97456ab5 Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven.

* 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k/q40: Enable PC parallel port in defconfig
  m68k/q40: Undefine insl/outsl before redefining them
  m68k/uaccess: Fix asm constraints for userspace access
  swim: Release memory region after incorrect return/goto
  m68k/irq: Vector ints need a valid interrupt handler
  m68k/math-emu: unsigned issue, 'unsigned long' will never be less than zero
  m68k: remove CONFIG_EARLY_PRINTK dependency on CONFIG_EMBEDDED, default to n
  m68k/sun3: remove inline marking of EXPORT_SYMBOL functions
  [SCSI] a3000: use module_platform_driver_probe()
  [SCSI] a4000t: use module_platform_driver_probe()
  m68k: Remove inline strcpy() and strcat() implementations
2013-07-03 11:11:23 -07:00
Linus Torvalds
63580e51bb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS patches (part 1) from Al Viro:
 "The major change in this pile is ->readdir() replacement with
  ->iterate(), dealing with ->f_pos races in ->readdir() instances for
  good.

  There's a lot more, but I'd prefer to split the pull request into
  several stages and this is the first obvious cutoff point."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (67 commits)
  [readdir] constify ->actor
  [readdir] ->readdir() is gone
  [readdir] convert ecryptfs
  [readdir] convert coda
  [readdir] convert ocfs2
  [readdir] convert fatfs
  [readdir] convert xfs
  [readdir] convert btrfs
  [readdir] convert hostfs
  [readdir] convert afs
  [readdir] convert ncpfs
  [readdir] convert hfsplus
  [readdir] convert hfs
  [readdir] convert befs
  [readdir] convert cifs
  [readdir] convert freevxfs
  [readdir] convert fuse
  [readdir] convert hpfs
  reiserfs: switch reiserfs_readdir_dentry to inode
  reiserfs: is_privroot_deh() needs only directory inode, actually
  ...
2013-07-02 09:28:37 -07:00
Jens Axboe
5f0e5afa0d Linux 3.10-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQEbBAABAgAGBQJRxf9cAAoJEHm+PkMAQRiGMWkH911xM4gRmFgE7SqVW4F4AWBm
 ngcqMqNy9IdqKfibORUUDvVfEa5gjD5ai2quIKpfQiaukbpQJ696H90ijuAkajLn
 DQBrN243s0pzhhc/quWINnWxsFQ613JjdUMUMaD7e9A1aKjYzWrPGt/tSjrFXGCP
 tArTupVzc/iOmnEQDKiROI/Nokq44QJ36aTGPM7n08xMtpKmkCXM+9/UosBteB0O
 HVI33dmjwz7i55fI53XAWyuZCE+gSEnA4z8spJ9LfXso2W14V+roc+GuL6OyeeTI
 pCn/+4niVPb4B0ROZlpyVmdZjbPPcMMEK5o+BSJI68SH6LHZTQh2iVuqYfpSyA==
 =uUH5
 -----END PGP SIGNATURE-----

Merge tag 'v3.10-rc7' into for-3.11/drivers

Linux 3.10-rc7

Pull this in early to avoid doing it with the bcache merge,
since there are a number of changes to bcache between my old
base (3.10-rc1) and the new pull request.
2013-07-02 08:31:48 +02:00
Alex Elder
912c317d46 rbd: drop original request earlier for existence check
The reference to the original request dropped at the end of
rbd_img_obj_exists_callback() corresponds to the reference taken
in rbd_img_obj_exists_submit() to account for the stat request
referring to it.  Move the put of that reference up right after
clearing that pointer to make its purpose more obvious.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-07-01 09:52:02 -07:00
Geert Uytterhoeven
491205a8b4 rbd: Use min_t() to fix comparison of distinct pointer types warning
drivers/block/rbd.c: In function ‘zero_pages’:
drivers/block/rbd.c:1102: warning: comparison of distinct pointer types lacks a cast

Remove the hackish casts and use min_t() to fix this.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-07-01 09:52:01 -07:00
Linus Torvalds
bd2931b5cf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
 "This is a recently spotted regression in the snapshot behavior...

  It turns out several tests weren't being run in the nightlies so this
  took a while to spot"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: send snapshot context with writes
2013-06-29 10:31:15 -07:00
Al Viro
83a8761142 move linux/loop.h to drivers/block
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:45 +04:00
Philipp Reisner
d752b26960 drbd: Allow online change of al-stripes and al-stripe-size
Allow to change the AL layout with an resize operation. For that
the reisze command gets two new fields: al_stripes and al_stripe_size.

In order to make the operation crash save:
1) Lock out all IO and MD-IO
2) Write the super block with MDF_PRIMARY_IND clear
3) write the bitmap to the new location (all zeros, since
   we allow only while connected)
4) Initialize the new AL-area
5) Write the super block with the restored MDF_PRIMARY_IND.
6) Unfreeze all IO

Since the AL-layout has no influence on the protocol, this operation
needs to be beforemed on both sides of a resource (if intended).

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Philipp Reisner
e96c96333f drbd: Constants should be UPPERCASE
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Philipp Reisner
28e448bb30 drbd: Ignore the exit code of a fence-peer handler if it returns too late
In case the connection was established and lost again before
the a fence-peer handler returns, ignore the exit code of this
instance. (And use the exit code of the later started instance)

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Andreas Gruenbacher
f9eb7bf424 drbd: Fix rcu_read_lock balance on error path
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Wei Yongjun
6110d70bdf drbd: fix error return code in drbd_init()
Fix to return a negative error code from the error handling
case instead of 0, as returned elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Andreas Gruenbacher
26ea8f9239 drbd: Do not sleep inside rcu
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Jens Axboe
f35546e072 Merge branch 'stable/for-jens-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.11/drivers
Konrad writes:

It has the 'feature-max-indirect-segments' implemented in both backend
and frontend. The current problem with the backend and frontend is that the
segment size is limited to 11 pages. It means we can at most squeeze in 44kB per
request. The ring can hold 32 (next power of two below 36) requests, meaning we
can do 1.4M of outstanding requests. Nowadays that is not enough.

The problem in the past was addressed in two ways - but neither one went upstream.
The first solution to this proposed by Justin from Spectralogic was to negotiate
the segment size.  This means that the ‘struct blkif_sring_entry’ is now a variable size.
It can expand from 112 bytes (cover 11 pages of data - 44kB) to 1580 bytes
(256 pages of data - so 1MB). It is a simple extension by just making the array in the
request expand from 11 to a variable size negotiated. But it had limits: this extension
still limits the number of segments per request to 255 (as the total number must be
specified in the request, which only has an 8-bit field for that purpose).

The other solution (from Intel - Ronghui) was to create one extra ring that only has the
‘struct blkif_request_segment’ in them. The ‘struct blkif_request’ would be changed to have
an index in said ‘segment ring’. There is only one segment ring. This means that the size of
the initial ring is still the same. The requests would point to the segment and enumerate out
how many of the indexes it wants to use. The limit is of course the size of the segment.
If one assumes a one-page segment this means we can in one request cover ~4MB.

Those patches were posted as RFC and the author never followed up on the ideas on changing
it to be a bit more flexible.

There is yet another mechanism that could be employed  (which these patches implement) - and it
borrows from VirtIO protocol. And that is the ‘indirect descriptors’. This very similar to
what Intel suggests, but with a twist. The twist is to negotiate how many of these
'segment' pages (aka indirect descriptor pages) we want to support (in reality we negotiate
how many entries in the segment we want to cover, and we module the number if it is
bigger than the segment size).

This means that with the existing 36 slots in the ring (single page) we can cover:
32 slots * each blkif_request_indirect covers: 512 * 4096 ~= 64M. Since we ample space
in the blkif_request_indirect to span more than one indirect page, that number (64M)
can be also multiplied by eight = 512MB.

Roger Pau Monne took the idea and implemented them in these patches. They work
great and the corner cases (migration between backends with and without this extension)
work nicely. The backend has a limit right now off how many indirect entries
it can handle: one indirect page, and at maximum 256 entries (out of 512 - so  50% of the page
is used). That comes out to 32 slots * 256 entries in a indirect page * 1 indirect page
per request * 4096 = 32MB.

This is a conservative number that can change in the future. Right now it strikes
a good balance between giving excellent performance, memory usage in the backend, and
balancing the needs of many guests.

In the patchset there is also the split of the blkback structure to be per-VBD.
This means that the spinlock contention we had with many guests trying to do I/O and
all the blkback threads hitting the same lock has been eliminated.

Also there are bug-fixes to deal with oddly sized sectors, insane amounts on
th ring, and also a security fix (posted earlier).
2013-06-28 16:01:14 +02:00
Josh Durgin
d2d1f17a0d rbd: send snapshot context with writes
Sending the right snapshot context with each write is required for
snapshots to work. Due to the ordering of calls, the snapshot context
is never set for any requests. This causes writes to the current
version of the image to be reflected in all snapshots, which are
supposed to be read-only.

This happens because rbd_osd_req_format_write() sets the snapshot
context based on obj_request->img_request. At this point, however,
obj_request->img_request has not been set yet, to the snapshot context
is set to NULL. Fix this by moving rbd_img_obj_request_add(), which
sets obj_request->img_request, before the osd request formatting
calls.

This resolves:
    http://tracker.ceph.com/issues/5465

Reported-by: Karol Jurak <karol.jurak@gmail.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-06-27 05:55:29 -07:00
Linus Torvalds
78750f1908 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
 "This fixes another problem with using v2 images on 3.10 due to the
  order in which fields are read from the image header.

  Hopefully this is the last one"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fetch object order before using it
2013-06-26 08:47:46 -10:00
Josh Durgin
1617e40c1e rbd: fetch object order before using it
rbd_dev_v2_header_onetime() fetches striping information, and
checks whether the image can be read by compariing the stripe unit
to the object size. It determines the object size by shifting
the object order, which is 0 at this point since it has not been
read yet. Move the call to get the image size and object order
before rbd_dev_v2_header_onetime() so it is set before use.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-06-25 12:27:31 -07:00
Roger Pau Monne
1e0f7a21b2 xen-blkback: check the number of iovecs before allocating a bios
With the introduction of indirect segments we can receive requests
with a number of segments bigger than the maximum number of allowed
iovecs in a bios, so make sure that blkback doesn't try to allocate a
bios with more iovecs than BIO_MAX_PAGES

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-25 10:00:58 -04:00
Matthew Wilcox
7d8224574c NVMe: Call nvme_process_cq from submission path
Since we have the queue locked, it makes sense to check if there are
any completion queue entries on the queue before we release the lock.
If there are, it may save an interrupt and reduce latency for the I/Os
that happened to complete.  This happens fairly often for some workloads.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-24 13:57:27 -04:00
Matthew Wilcox
bc57a0f7a4 NVMe: Remove "process_cq did something" message
I was originally intending to log the fact that the kthread had done
some work since it might help us find interrupt handling problems, but
that hasn't been done yet, and spamming the logs with this message is
just rude.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-24 13:57:08 -04:00
Joe Perches
957d6bf665 swim: Release memory region after incorrect return/goto
The code uses

	return foo;
	goto err_type;

when instead the form should have been

	ret = foo;
	goto err_type;

Here this causes a useful release_mem_region to be skipped.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Laurent Vivier <Laurent@Vivier.EU>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2013-06-24 19:44:19 +02:00
Matthew Wilcox
e9539f4752 NVMe: Return correct value from interrupt handler
The interrupt handler currently reports whether it found any new
completion queue entries.  If the completion queue is primarily being
processed by a method other than the interrupt handler, it may return
IRQ_NONE so often that Linux thinks that the interrupt is being falsely
triggered.

To solve this problem, report whether any completion queue entries have
been seen since the last interrupt was received for this queue.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-24 11:54:20 -04:00
Roger Pau Monne
294caaf29c xen-blkfront: set blk_queue_max_hw_sectors correctly
Now that indirect segments are enabled blk_queue_max_hw_sectors must
be set to match the maximum number of sectors we can handle in a
request.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Felipe Franciosi <felipe.franciosi@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-21 15:58:55 -04:00
Roger Pau Monne
2d9105433f xen-blkback: workaround compiler bug in gcc 4.1
The code generat with gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-54)
creates an unbound loop for the second foreach_grant_safe loop in
purge_persistent_gnt.

The workaround is to avoid having this second loop and instead
perform all the work inside the first loop by adding a new variable,
clean_used, that will be set when all the desired persistent grants
have been removed and we need to iterate over the remaining ones to
remove the WAS_ACTIVE flag.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Tom O'Neill <toneill@vmem.com>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-21 15:58:44 -04:00
Linus Torvalds
7ecba6f2f3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fix from Sage Weil:
 "This fixes a problem preventing the kernel and userland librbd
  libraries from sharing data with the new format 2 images"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: use the correct length for format 2 object names
2013-06-21 06:27:40 -10:00
Keith Busch
6198221fa0 NVMe: Disk IO statistics
Add io stats accounting for bio requests so nvme block devices show
useful disk stats.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-20 12:06:35 -04:00
Matthew Wilcox
063a8096f3 NVMe: Restructure MSI / MSI-X setup
The current code copies 'nr_io_queues' into 'q_count', modifies
'nr_io_queues' during MSI-X setup, then resets 'nr_io_queues' for
MSI setup.  Instead, copy 'nr_io_queues' into 'vecs' and modify 'vecs'
during both MSI-X and MSI setup.

This lets us simplify the for-loops that set up MSI-X and MSI, and opens
the possibility of using more I/O queues than we have interrupt vectors,
should future benchmarking prove that to be a useful feature.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-20 11:09:23 -04:00
Tushar Behera
03ea83e9a3 NVMe: Use kzalloc instead of kmalloc+memset
Use kzalloc instead of kmalloc and a susbsequent memset.

Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-19 13:24:27 -04:00
Philip J Kelleher
36f988e978 rsxx: Adding in debugfs entries.
Adding debugfs entries to help with debugging and testing and
testing code.

pci_regs:
       	This entry will spit out all of the data stored on the BAR.

stats:
       	This entry will display all of the driver stats for each
       	DMA channel.

cram:
	This will allow read/write ability to the CRAM address space
	on our adapter's CPU.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:10 +02:00
Philip J Kelleher
62302508f2 rsxx: Fixes incorrect stats calculation.
Fixing incorrect stats calculation during read retries.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:10 +02:00
Philip J Kelleher
b8b225da13 rsxx: Adding EEH check inside cregs timeout.
Unfortunaly, our CPU register path does not do any kind of
EEH error checking. So to fix this issue, an ioread32 was
added to the CPU register timeout code. This way, the
driver can check to see if the timeout was caused by an EEH
error or not. This is a dummy read.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:10 +02:00
Philip J Kelleher
3eb8dcafb5 rsxx: Adapter address space sanity check.
Adding a sanity check to guarentee that DMAs outside of the device's
address space will be errored out right away.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:10 +02:00
Philip J Kelleher
66bc600363 rsxx: Fixes DLPAR add kernel panic if partition still mounted.
A kernel panic would occur on a DLPAR add if there was a partition
still mounted during the DLPAR remove. This bug fix will allow the
user to unmount the partition and bring the driver back into a
good state after the DLPAR add.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
f730e3dc6d rsxx: Changing the adapter name to the official name.
Changing the adapter name from FlashSystem-80 to the official
name: Flash Adapter 900GB Full Height.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
fb065cd9e0 rsxx: Adding in sync_start module paramenter.
Before, the partition table would have to be reread because our
card was attached before it transistioned out of it's 'starting'
state.

This change will cause the driver to wait to attach the device
until the adapter is ready.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
7b379cc378 rsxx: Allow block size to be determined by configuration.
Previously, the block size was determined by whether or not
our Hardware could handle 512 byte accesses. Now, all of our
Hardware can handle 512 and 4096 block sizes.

This fix allows it to be user configurable.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
31a70bb444 rsxx: Fixes soft-lockup issues during DMAs.
The workqueue mechanism has been reworked to prevent soft
lockup issues from occuring by adding in mutex sychronization.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
0ab4743ebc rsxx: Restructured DMA cancel scheme.
Before, DMAs would never be cancelled if there was a data stall
or an EEH Permenant failure which would cause an unrecoverable
I/O hang.

The DMA cancellation mechanism has been modified to fix
these issues and allows DMAs to be cancelled during the
above mentioned events.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Philip J Kelleher
a3299ab185 rsxx: Individual workqueues for interruptible events.
Giving all interrupt based events their own workqueue to complete
tasks on. This fixes a bug that would cause creg commands to timeout
if too many are issued at once.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-19 13:52:09 +02:00
Konrad Rzeszutek Wilk
8e3f875554 xen/blkback: Check for insane amounts of request on the ring (v6).
Check that the ring does not have an insane amount of requests
(more than there could fit on the ring).

If we detect this case we will stop processing the requests
and wait until the XenBus disconnects the ring.

The existing check RING_REQUEST_CONS_OVERFLOW which checks for how
many responses we have created in the past (rsp_prod_pvt) vs
requests consumed (req_cons) and whether said difference is greater or
equal to the size of the ring, does not catch this case.

Wha the condition does check if there is a need to process more
as we still have a backlog of responses to finish. Note that both
of those values (rsp_prod_pvt and req_cons) are not exposed on the
shared ring.

To understand this problem a mini crash course in ring protocol
response/request updates is in place.

There are four entries: req_prod and rsp_prod; req_event and rsp_event
to track the ring entries. We are only concerned about the first two -
which set the tone of this bug.

The req_prod is a value incremented by frontend for each request put
on the ring. Conversely the rsp_prod is a value incremented by the backend
for each response put on the ring (rsp_prod gets set by rsp_prod_pvt when
pushing the responses on the ring).  Both values can
wrap and are modulo the size of the ring (in block case that is 32).
Please see RING_GET_REQUEST and RING_GET_RESPONSE for the more details.

The culprit here is that if the difference between the
req_prod and req_cons is greater than the ring size we have a problem.
Fortunately for us, the '__do_block_io_op' loop:

	rc = blk_rings->common.req_cons;
	rp = blk_rings->common.sring->req_prod;

	while (rc != rp) {

		..
		blk_rings->common.req_cons = ++rc; /* before make_response() */

	}

will loop up to the point when rc == rp. The macros inside of the
loop (RING_GET_REQUEST) is smart and is indexing based on the modulo
of the ring size. If the frontend has provided a bogus req_prod value
we will loop until the 'rc == rp' - which means we could be processing
already processed requests (or responses) often.

The reason the RING_REQUEST_CONS_OVERFLOW is not helping here is
b/c it only tracks how many responses we have internally produced
and whether we would should process more. The astute reader will
notice that the macro RING_REQUEST_CONS_OVERFLOW provides two
arguments - more on this later.

For example, if we were to enter this function with these values:

       	blk_rings->common.sring->req_prod =  X+31415 (X is the value from
		the last time __do_block_io_op was called).
        blk_rings->common.req_cons = X
        blk_rings->common.rsp_prod_pvt = X

The RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, blk_rings->common.req_cons)
is doing:

	req_cons - rsp_prod_pvt >= 32

Which is,
	X - X >= 32 or 0 >= 32

And that is false, so we continue on looping (this bug).

If we re-use said macro RING_REQUEST_CONS_OVERFLOW and pass in the rp
instead (sring->req_prod) of rc, the this macro can do the check:

     req_prod - rsp_prov_pvt >= 32

Which is,
       X + 31415 - X >= 32 , or 31415 >= 32

which is true, so we can error out and break out of the function.

Unfortunatly the difference between rsp_prov_pvt and req_prod can be
at 32 (which would error out in the macro). This condition exists when
the backend is lagging behind with the responses and still has not finished
responding to all of them (so make_response has not been called), and
the rsp_prov_pvt + 32 == req_cons. This ends up with us not being able
to use said macro.

Hence introducing a new macro called RING_REQUEST_PROD_OVERFLOW which does
a simple check of:

    req_prod - rsp_prod_pvt > RING_SIZE

And with the X values from above:

   X + 31415 - X > 32

Returns true. Also not that if the ring is full (which is where
the RING_REQUEST_CONS_OVERFLOW triggered), we would not hit the
same condition:

   X + 32 - X > 32

Which is false.

Lets use that macro.
Note that in v5 of this patchset the macro was different - we used an
earlier version.

Cc: stable@vger.kernel.org
[v1: Move the check outside the loop]
[v2: Add a pr_warn as suggested by David]
[v3: Use RING_REQUEST_CONS_OVERFLOW as suggested by Jan]
[v4: Move wake_up after kthread_stop as suggested by Jan]
[v5: Use RING_REQUEST_PROD_OVERFLOW instead]
[v6: Use RING_REQUEST_PROD_OVERFLOW - Jan's version]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

gadsa
2013-06-17 15:17:16 -04:00
Josh Durgin
3a96d5cd7b rbd: use the correct length for format 2 object names
Format 2 objects use 16 characters for the object name suffix to be
able to express the full 64-bit range of object numbers. Format 1
images only use 12 characters for this. Using 12-character names for
format 2 caused userspace and kernel rbd clients to read differently
named objects, which made an image written by one client look empty to
the other client.

CC: stable@vger.kernel.org  # 3.9+
Reported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-06-13 08:46:15 -07:00
Linus Torvalds
b2cc9c19e4 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "Outside of bcache (which really isn't super big), these are all
  few-liners.  There are a few important fixes in here:

   - Fix blk pm sleeping when holding the queue lock

   - A small collection of bcache fixes that have been done and tested
     since bcache was included in this merge window.

   - A fix for a raid5 regression introduced with the bio changes.

   - Two important fixes for mtip32xx, fixing an oops and potential data
     corruption (or hang) due to wrong bio iteration on stacked devices."

* 'for-linus' of git://git.kernel.dk/linux-block:
  scatterlist: sg_set_buf() argument must be in linear mapping
  raid5: Initialize bi_vcnt
  pktcdvd: silence static checker warning
  block: remove refs to XD disks from documentation
  blkpm: avoid sleep when holding queue lock
  mtip32xx: Correctly handle bio->bi_idx != 0 conditions
  mtip32xx: Fix NULL pointer dereference during module unload
  bcache: Fix error handling in init code
  bcache: clarify free/available/unused space
  bcache: drop "select CLOSURES"
  bcache: Fix incompatible pointer type warning
2013-06-12 16:42:39 -07:00
Linus Torvalds
a568fa1c91 Merge branch 'akpm' (updates from Andrew Morton)
Merge misc fixes from Andrew Morton:
 "Bunch of fixes and one little addition to math64.h"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
  include/linux/math64.h: add div64_ul()
  mm: memcontrol: fix lockless reclaim hierarchy iterator
  frontswap: fix incorrect zeroing and allocation size for frontswap_map
  kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
  mm: migration: add migrate_entry_wait_huge()
  ocfs2: add missing lockres put in dlm_mig_lockres_handler
  mm/page_alloc.c: fix watermark check in __zone_watermark_ok()
  drivers/misc/sgi-gru/grufile.c: fix info leak in gru_get_config_info()
  aio: fix io_destroy() regression by using call_rcu()
  rtc-at91rm9200: use shadow IMR on at91sam9x5
  rtc-at91rm9200: add shadow interrupt mask
  rtc-at91rm9200: refactor interrupt-register handling
  rtc-at91rm9200: add configuration support
  rtc-at91rm9200: add match-table compile guard
  fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
  swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion
  drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree
  cciss: fix broken mutex usage in ioctl
  audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE
  drivers/rtc/rtc-cmos.c: fix accidentally enabling rtc channel
  ...
2013-06-12 16:29:53 -07:00
Stephen M. Cameron
03f47e888d cciss: fix broken mutex usage in ioctl
If a new logical drive is added and the CCISS_REGNEWD ioctl is invoked
(as is normal with the Array Configuration Utility) the process will
hang as below.  It attempts to acquire the same mutex twice, once in
do_ioctl() and once in cciss_unlocked_open().  The BKL was recursive,
the mutex isn't.

  Linux version 3.10.0-rc2 (scameron@localhost.localdomain) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri May 24 14:32:12 CDT 2013
  [...]
  acu             D 0000000000000001     0  3246   3191 0x00000080
  Call Trace:
    schedule+0x29/0x70
    schedule_preempt_disabled+0xe/0x10
    __mutex_lock_slowpath+0x17b/0x220
    mutex_lock+0x2b/0x50
    cciss_unlocked_open+0x2f/0x110 [cciss]
    __blkdev_get+0xd3/0x470
    blkdev_get+0x5c/0x1e0
    register_disk+0x182/0x1a0
    add_disk+0x17c/0x310
    cciss_add_disk+0x13a/0x170 [cciss]
    cciss_update_drive_info+0x39b/0x480 [cciss]
    rebuild_lun_table+0x258/0x370 [cciss]
    cciss_ioctl+0x34f/0x470 [cciss]
    do_ioctl+0x49/0x70 [cciss]
    __blkdev_driver_ioctl+0x28/0x30
    blkdev_ioctl+0x200/0x7b0
    block_ioctl+0x3c/0x40
    do_vfs_ioctl+0x89/0x350
    SyS_ioctl+0xa1/0xb0
    system_call_fastpath+0x16/0x1b

This mutex usage was added into the ioctl path when the big kernel lock
was removed.  As it turns out, these paths are all thread safe anyway
(or can easily be made so) and we don't want ioctl() to be single
threaded in any case.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Miller <mike.miller@hp.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:45 -07:00
Linus Torvalds
8d7a8fe2ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
 "There is a pair of fixes for double-frees in the recent bundle for
  3.10, a couple of fixes for long-standing bugs (sleep while atomic and
  an endianness fix), and a locking fix that can be triggered when osds
  are going down"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup in rbd_add()
  rbd: don't destroy ceph_opts in rbd_add()
  ceph: ceph_pagelist_append might sleep while atomic
  ceph: add cpu_to_le32() calls when encoding a reconnect capability
  libceph: must hold mutex for reset_changed_osds()
2013-06-12 08:28:19 -07:00
Linus Torvalds
77293e215e Merge branch 'fixes-3.10' of git://git.infradead.org/users/willy/linux-nvme
Pull NVMe fixes from Matthew Wilcox.

* 'fixes-3.10' of git://git.infradead.org/users/willy/linux-nvme:
  NVMe: Add MSI support
  NVMe: Use dma_set_mask() correctly
  Return the result from user admin command IOCTL even in case of failure
  NVMe: Do not cancel command multiple times
  NVMe: fix error return code in nvme_submit_bio_queue()
  NVMe: check for integer overflow in nvme_map_user_pages()
  MAINTAINERS: update NVM EXPRESS DRIVER file list
  NVMe: Fix a signedness bug in nvme_trans_modesel_get_mp
  NVMe: Remove redundant version.h header include
2013-06-11 23:07:21 -07:00
Konrad Rzeszutek Wilk
604c499cbb xen/blkback: Check device permissions before allowing OP_DISCARD
We need to make sure that the device is not RO or that
the request is not past the number of sectors we want to
issue the DISCARD operation for.

This fixes CVE-2013-2140.

Cc: stable@vger.kernel.org
Acked-by: Jan Beulich <JBeulich@suse.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
[v1: Made it pr_warn instead of pr_debug]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-07 17:05:55 -04:00
Stefan Bader
7c4d7d710f xen/blkback: Use physical sector size for setup
Currently xen-blkback passes the logical sector size over xenbus and
xen-blkfront sets up the paravirt disk with that logical block size.
But newer drives usually have the logical sector size set to 512 for
compatibility reasons and would show the actual sector size only in
physical sector size.
This results in the device being partitioned and accessed in dom0 with
the correct sector size, but the guest thinks 512 bytes is the correct
block size. And that results in poor performance.

To fix this, blkback gets modified to pass also physical-sector-size
over xenbus and blkfront to use both values to set up the paravirt
disk. I did not just change the passed in sector-size because I am
not sure having a bigger logical sector size than the physical one
is valid (and that would happen if a newer dom0 kernel hits an older
domU kernel). Also this way a domU set up before should still be
accessible (just some tools might detect the unaligned setup).

[v2: Make xenbus write failure non-fatal]
[v3: Use xenbus_scanf instead of xenbus_gather]
[v4: Rebased against segment changes]

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-07 17:05:53 -04:00
Konrad Rzeszutek Wilk
2d5dc3ba85 xen-blkfront: Introduce a 'max' module parameter to alter the amount of indirect segments.
The max module parameter (by default 32) is the maximum number of
segments that the frontend will negotiate with the backend for indirect
descriptors.  Higher value means more potential throughput but more
memory usage. The backend picks the minimum of the frontend and its
default backend value.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-04 15:58:35 -04:00
Ramachandra Rao Gajula
fa08a39664 NVMe: Add MSI support
Some devices only have support for MSI, not MSI-X.  While MSI is more
limited, it still provides better performance than line-based interrupts.

Signed-off-by: Ramachandra Gajula <rama@fastorsystems.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-31 11:45:52 -04:00
Jens Axboe
b02383ea20 Merge branch 'for-jens' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/linux-block into for-linus
Jiri writes:

please pull from

  git://git.kernel.org/pub/scm/linux/kernel/git/jikos/linux-block.git for-jens

to receive one pktcdvd fix. It fixes a highly theoretical issue with using.
pktcdvd to work with media that'd be larger than 2TB :) But it's a correct.
fix and makes static checkers shut up about improperly cleaning upper.
32bits.
2013-05-29 21:21:23 +02:00
Dan Carpenter
104529661b pktcdvd: silence static checker warning
Static checkers complain about widening the binary "not" operations here
because sectors are u64 and "(pd)->settings.size" is unsigned int.
It unintentionally clears the high 32 bits of the sector.  This means
that the driver won't work for devices with over 2TB of space.  Since
this is a DVD drive, we're unlikely to reach that limit, but we may as
well silence the warning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-05-29 15:36:22 +02:00
Matthew Wilcox
cf9f123b38 NVMe: Use dma_set_mask() correctly
In some circumstances setting a 64-bit DMA mask can fail, as explained
in Documentation/DMA-API-HOWTO.txt.  Use the recommended code sequence
to set a 32-bit DMA mask if setting a 64-bit DMA mask fails.

Reported-by: Chayan Biswas <Chayan.Biswas@sandisk.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-28 16:46:46 -04:00
Brian Behlendorf
dfd20b2b17 drivers/block/brd.c: fix brd_lookup_page() race
The index on the page must be set before it is inserted in the radix
tree.  Otherwise there is a small race which can occur during lookup
where the page can be found with the incorrect index.  This will trigger
the BUG_ON() in brd_lookup_page().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Chris Wedgwood <cw@f00f.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:52 -07:00
Gernot Vormayr
585dc0c2f6 drivers/block/xsysace.c: fix id with missing port-number
If the port number is missing from the device-tree the device gets named
xs` instead of xsa.  This fixes the check for missing ids.

Tested on ml507 board.

Signed-off-by: Gernot Vormayr <gvormayr@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:50 -07:00
Chayan Biswas
cf90bc4830 Return the result from user admin command IOCTL even in case of failure
We copy the result to user if the command is completed from the
controller even if it completes with failure (non-zero) status.
A return status of < 0 indicates the command was not completed
by the controller. The user application may expect the error code
in the result field in case of failure.

Signed-off-by: Chayan Biswas <Chayan.Biswas@sandisk.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-23 13:38:59 -04:00
Jonghwan Choi
2a647bfe1b virtio_blk: Add missing 'static' qualifiers
Add missing 'static' qualifiers

Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-05-20 12:09:23 +09:30
Alex Elder
3abef3b358 rbd: fix cleanup in rbd_add()
Bjorn Helgaas pointed out that a recent commit introduced a
use-after-free condition in an error path for rbd_add().
He correctly stated:

    I think b536f69a3a "rbd: set up devices only for mapped images"
    introduced a use-after-free error in rbd_add():
	...
    If rbd_dev_device_setup() returns an error, we call
    rbd_dev_image_release(), which ultimately kfrees rbd_dev.
    Then we call rbd_dev_destroy(), which references fields in
    the already-freed rbd_dev struct before kfreeing it again.

The simple fix is to return the error code after the call to
rbd_dev_image_release().

Closer examination revealed that there's no need to clean up
rbd_opts in that function, so fix that too.

Update some other comments that have also become out of date.

Reported-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-17 12:50:10 -05:00
Alex Elder
7262cfca43 rbd: don't destroy ceph_opts in rbd_add()
Whether rbd_client_create() successfully creates a new client or
not, it takes responsibility for getting the ceph_opts structure
it's passed destroyed.  If successful, the structure becomes
associated with the created client; if not, rbd_client_create()
will destroy it.

Previously, rbd_get_client() would call ceph_destroy_options()
if rbd_get_client() failed, and that meant it got called twice.
That led freeing various pointers more than once, which is never a
good idea.

This resolves:
    http://tracker.ceph.com/issues/4559

Cc: stable@vger.kernel.org # 3.8+
Reported-by: Dan van der Ster <dan@vanderster.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-17 12:50:03 -05:00
Keith Busch
053ab702cc NVMe: Do not cancel command multiple times
Cancelling an already cancelled command does not do anything, so check
the command context before cancelling it, continuing if had already been
cancelled so we do not log the same problem every second if a device
stops responding.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:18:38 -04:00
Wei Yongjun
1287dabd34 NVMe: fix error return code in nvme_submit_bio_queue()
nvme_submit_flush_data() might overwrite the initialisation of the
return value with 0, so move the -ENOMEM setting close to the usage.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:13:18 -04:00
Dan Carpenter
5460fc0310 NVMe: check for integer overflow in nvme_map_user_pages()
You need to have CAP_SYS_ADMIN to trigger this overflow but it makes the
static checkers complain so we should fix it.  The worry is that
"length" comes from copy_from_user() so we need to check that "length +
offset" can't overflow.

I also changed the min_t() cast to be unsigned instead of signed.  Now
that we cap "length" to INT_MAX it doesn't make a difference, but it's a
little easier for reviewers to know that large values aren't cast to
negative.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:11:03 -04:00
Vishal Verma
710a143dd8 NVMe: Fix a signedness bug in nvme_trans_modesel_get_mp
nvme_trans_modesel_get_mp() was defined with a unsigned return
type, but can return signed values.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:10:38 -04:00
Sachin Kamat
76e0310c2d NVMe: Remove redundant version.h header include
version.h header inclusion is not necessary as detected by
checkversion.pl.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:10:05 -04:00
Linus Torvalds
109c3c0292 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "Yes, this is a much larger pull than I would like after -rc1.  There
  are a few things included:

   - a few fixes for leaks and incorrect assertions
   - a few patches fixing behavior when mapped images are resized
   - handling for cloned/layered images that are flattened out from
     underneath the client

  The last bit was non-trivial, and there is some code movement and
  associated cleanup mixed in.  This was ready and was meant to go in
  last week but I missed the boat on Friday.  My only excuse is that I
  was waiting for an all clear from the testing and there were many
  other shiny things to distract me.

  Strictly speaking, handling the flatten case isn't a regression and
  could wait, so if you like we can try to pull the series apart, but
  Alex and I would much prefer to have it all in as it is a case real
  users will hit with 3.10."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (33 commits)
  rbd: re-submit flattened write request (part 2)
  rbd: re-submit write request for flattened clone
  rbd: re-submit read request for flattened clone
  rbd: detect when clone image is flattened
  rbd: reference count parent requests
  rbd: define parent image request routines
  rbd: define rbd_dev_unparent()
  rbd: don't release write request until necessary
  rbd: get parent info on refresh
  rbd: ignore zero-overlap parent
  rbd: support reading parent page data for writes
  rbd: fix parent request size assumption
  libceph: init sent and completed when starting
  rbd: kill rbd_img_request_get()
  rbd: only set up watch for mapped images
  rbd: set mapping read-only flag in rbd_add()
  rbd: support reading parent page data
  rbd: fix an incorrect assertion condition
  rbd: define rbd_dev_v2_header_info()
  rbd: get rid of trivial v1 header wrappers
  ...
2013-05-15 13:36:19 -07:00
Sam Bradshaw
093c959307 mtip32xx: Correctly handle bio->bi_idx != 0 conditions
Stacking drivers may append bvecs to existing bio's, resulting
in non-zero bi_idx conditions.  This patch counts the loops of
bio_for_each_segment() rather than inheriting the bi_idx value
to pass as a segment count to the hardware submission routine.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-15 10:09:05 +02:00
Sam Bradshaw
974a51a245 mtip32xx: Fix NULL pointer dereference during module unload
An open file-handle to one or more of the driver exported debugfs
nodes causes raciness in recursive removal during module unload;
sometimes a stale parent dentry is dereferenced when more than 1
pci device is present.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-15 10:04:34 +02:00
Alex Elder
638f5abed3 rbd: re-submit flattened write request (part 2)
Add code to rbd_img_obj_exists_callback() to detect when a clone's
parent image has disappeared, and re-submit the original write
request in that case.

Kill off some redundant assertions.

This completes the resolution for:
    http://tracker.ceph.com/issues/3763

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:46 -05:00
Alex Elder
bbea1c1a31 rbd: re-submit write request for flattened clone
Add code to rbd_img_parent_read_full_callback() to detect when a
clone's parent image has disappeared, and re-submit the original
write request in that case.  (See the previous commit for more
reasoning about why this is appropriate.)

Rename some variables in rbd_img_obj_parent_read_full_callback()
to match the convention used in the previous patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:45 -05:00
Alex Elder
02c74fbad9 rbd: re-submit read request for flattened clone
If a clone image gets flattened while a parent read request is
underway, the original rbd object request needs to be resubmitted.

The reason is that by the time we get the response to the parent
read request, the data read from the parent may be out of date.
In other words, we could see this sequence of events:

    rbd client                      parent image/osd
    ----------                      ----------------
    original object ENOENT;
        issue parent read
                                    respond to parent read
                                    child image flattened
    original image header refresh
             <--- original object written independently here
    parent read response received

Add code to rbd_img_parent_read_callback() to detect when a clone's
parent image has disappeared (as evidenced by its parent overlap
becoming 0), and re-submit the original read request in that case.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:45 -05:00
Alex Elder
392a9dad7e rbd: detect when clone image is flattened
A format 2 clone image can be the subject of a "flatten" operation,
during which all of its data gets "copied up" from its parent image,
leaving the image fully populated.  Once this is complete, the
clone's association with the parent is abolished.

Since this can occur when a clone is mapped, we need to detect when
it has occurred and handle it accordingly.  We know an image has
been flattened when we know it at one time had a parent, but we have
learned (via a "get_parent" object class method call) it no longer
has one.

There might be in-flight requests at the point we learn an image has
been flattened, so we can't simply clean up parent data structures
right away.  Instead, we'll drop the initial parent reference when
the parent has disappeared (rather than when the image gets
destroyed), which will allow the last in-flight reference to clean
things up when it's complete.

We leverage the fact that a zero parent overlap renders an image
effectively unlayered.  We set the overlap to 0 at the point we
detect the clone image has flattened, which allows the unlayered
behavior to take effect immediately, while keeping other parent
structures in place until in-flight requests to complete.

This and the next few patches resolve:
    http://tracker.ceph.com/issues/3763

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:45 -05:00
Alex Elder
a2acd00e79 rbd: reference count parent requests
Keep a reference count for uses of the parent information for an rbd
device.

An initial reference is set in rbd_img_request_create() if the
target image has a parent (with non-zero overlap).  Each image
request for an image with a non-zero parent overlap gets another
reference when it's created, and that reference is dropped when the
request is destroyed.

The initial reference is dropped when the image gets torn down.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:44 -05:00
Alex Elder
e93f315235 rbd: define parent image request routines
Define rbd_parent_request_create() and rbd_parent_request_destroy()
to handle the creation of parent image requests submitted for
layered image objects.  For simplicity, let rbd_img_request_put()
handle dropping the reference to any image request (parent or not),
and call whichever destructor is appropriate on the last put.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:44 -05:00
Alex Elder
fb65d2284c rbd: define rbd_dev_unparent()
Define rbd_dev_unparent() to encapsulate cleaning up parent data
structures from a layered rbd image.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:43 -05:00
Alex Elder
8785b1d487 rbd: don't release write request until necessary
Previously when a layered write was going to involve a copyup
request, the original osd request was released before submitting the
parent full-object read.  The osd request for the copyup would then
be allocated in rbd_img_obj_parent_read_full_callback().

Shortly we will be handling the event of mapped layered images
getting flattened, and when that occurs we need to resubmit the
original request.  We therefore don't want to release the osd
request until we really konw we're going to replace it--in the
callback function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:43 -05:00
Alex Elder
642a25375f rbd: get parent info on refresh
Get parent info for format 2 images on every refresh (rather than
just during the initial probe).  This will be needed to detect the
disappearance of the parent image in the event a mapped image
becomes unlayered (i.e., flattened).  Avoid leaking the previous
parent spec on the second and subsequent times this information is
requested by dropping the previous one (if any) before updating it.
(Also, extract the pool id into a local variable before assigning
it into the parent spec.)

Switch to using a non-zero parent overlap value rather than the
existence of a parent (a non-null parent_spec pointer) to determine
whether to mark a request layered.  It will soon be possible for
a layered image to become unlayered while a request is in flight.

This means that the layered flag for an image request indicates that
there was a non-zero parent overlap at the time the image request
was created.  The parent overlap can change thereafter, which may
lead to special handling at request submission or completion time.

This and the next several patches are related to:
    http://tracker.ceph.com/issues/3763

NOTE:
If an error occurs while refreshing the parent info (i.e.,
requesting it after initial probe), the old parent info will
persist.  This is not really correct, and is a scenario that needs
to be addressed.  For now we'll assert that the failure mode is
unlikely, but the issue has been documented in tracker issue 5040.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 15:06:33 -05:00
Alex Elder
70cf49cfc7 rbd: ignore zero-overlap parent
An rbd clone image that has an overlap with its parent of 0 is
effectively not a layered image at all.  Detect this case and treat
such an image as non-layered.  Issue a warning to be sure the user
knows what's going on.

This resolves:
    http://tracker.ceph.com/issues/5028

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 14:12:41 -05:00
Alex Elder
b91f09f17b rbd: support reading parent page data for writes
Currently, rbd_img_obj_parent_read_full() assumes the incoming
object request contains bio data.  But if a layered image is part of
a multi-layer stack of images it will result in read requests of
page data to parent images.

This is handling the same kind of issue as was resolved by this
commit:
    5b2ab72d  rbd: support reading parent page data

This resolves:
    http://tracker.ceph.com/issues/5027

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 14:12:40 -05:00
Alex Elder
ebda6408f2 rbd: fix parent request size assumption
The code that reads object data from the parent for a copyup on
write request currently assumes that the size of that request is the
size of a "full" object from the original target image.

That is not necessarily the case.  The parent overlap could reduce
the request size below that.  To fix that assumption we need to
record the number of pages in the copyup_pages array, for both an
image request and an object request.  Rename a local variable in
rbd_img_obj_parent_read_full_callback() to reflect we're recording
the length of the parent read request, not the size of the target
object.

This resolves:
    http://tracker.ceph.com/issues/5038

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-13 14:09:01 -05:00
Linus Torvalds
2d4fe27850 Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe driver update from Matthew Wilcox:
 "Lots of exciting new features in the NVM Express driver this time,
  including support for emulating SCSI commands, discard support and the
  ability to submit per-sector metadata with I/Os.

  It's still mostly bugfixes though!"

* git://git.infradead.org/users/willy/linux-nvme: (27 commits)
  NVMe: Use user defined admin ioctl timeout
  NVMe: Simplify Firmware Activate code slightly
  NVMe: Only clear the enable bit when disabling controller
  NVMe: Wait for device to acknowledge shutdown
  NVMe: Schedule timeout for sync commands
  NVMe: Meta-data support in NVME_IOCTL_SUBMIT_IO
  NVMe: Device specific stripe size handling
  NVMe: Split non-mergeable bio requests
  NVMe: Remove dead code in nvme_dev_add
  NVMe: Check for NULL memory in nvme_dev_add
  NVMe: Fix error clean-up on nvme_alloc_queue
  NVMe: Free admin queue on request_irq error
  NVMe: Add scsi unmap to SG_IO
  NVMe: queue usage fixes in nvme-scsi
  NVMe: Set TASK_INTERRUPTIBLE before processing queues
  NVMe: Add a character device for each nvme device
  NVMe: Fix endian-related problems in user I/O submission path
  NVMe: Fix I/O cancellation status on big-endian machines
  NVMe: Fix sparse warnings in scsi emulation
  NVMe: Don't fail initialisation unnecessarily
  ...
2013-05-09 16:35:00 -07:00
Keith Busch
94f370cab6 NVMe: Use user defined admin ioctl timeout
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-09 16:03:50 -04:00
Alex Elder
c48f3f86e2 rbd: kill rbd_img_request_get()
Get rid of rbd_img_request_get(), because it isn't used, and maybe
won't ever be needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:17:00 -05:00
Alex Elder
1f3ef78861 rbd: only set up watch for mapped images
Any changes to parent images are immaterial to any mapped clone.
So there is no need to have a watch event registered on header
objects except for the header object of an image that is mapped.
In fact, a watch request is a write operation, and we may only
have read access to a parent image.

We can't set up the watch request until we know the name of the
header object though.  So pass a flag to rbd_dev_image_probe() to
indicate whether this probe is for a mapping or for a parent image.

Change the second parameter to rbd_dev_header_watch_sync() be
Boolean while we're at it.

This resolves:
    http://tracker.ceph.com/issues/4941

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:55 -05:00
Alex Elder
7ce4eef7b5 rbd: set mapping read-only flag in rbd_add()
The rbd_dev->mapping field for a parent image is not meaningful.
Since rbd_image_probe() is used both for images being mapped and
their parents, it doesn't make sense to set that flag in that
function.

So move the setting of the mapping.read_only flag out of
rbd_dev_image_probe() and into rbd_add() instead.

This resolves:
    http://tracker.ceph.com/issues/4940

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:50 -05:00
Alex Elder
5b2ab72d36 rbd: support reading parent page data
Currently, rbd_img_parent_read() assumes the incoming object request
contains bio data.  But if a layered image is part of a multi-layer
stack of images it will result in read requests of page data to parent
images.

Fortunately, it's not hard to add support for page data.

This resolves:
    http://tracker.ceph.com/issues/4939

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:25 -05:00
Alex Elder
91c6febb38 rbd: fix an incorrect assertion condition
In rbd_img_obj_parent_read_full_callback() there is an assertion
intended to verify the size of the image request for a full parent
read was the size of the original request's target object.  But
assertion was looking at the parent image order rather than the
original one, and these values can differ.

Fix that.

This resolves:
    http://tracker.ceph.com/issues/4938

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 20:16:10 -05:00
Alex Elder
2df3fac758 rbd: define rbd_dev_v2_header_info()
This rearranges rbd_dev_v2_refresh() so it works more like
rbd_dev_v1_header_info().  While format 1 images need to read the
whole header object to get any information, format 2 can collect
almost all information selectively.  So the one-time initialization
will remain in a separate function--based on rbd_dev_v2_probe().

Rename rbd_dev_v2_refresh() to be rbd_dev_v2_header_info(), and have
it call rbd_dev_v2_header_onetime() if it's being called for the
first time for the given rbd device.

Rename rbd_dev_v2_probe() to be rbd_dev_v2_header_onetime() and
remove the image size and snapshot context calls it held in
common with the refresh function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:52 -05:00
Alex Elder
99a41ebcee rbd: get rid of trivial v1 header wrappers
Get rid of the trivial wrapper functions rbd_dev_v1_refresh() and
rbd_dev_v1_probe(), substituting rbd_dev_v1_header_read() calls
in their place.

Rename rbd_dev_v1_header_read() to be rbd_dev_v1_header_info(), to
be more generic (it will better reflect what happens with format 2
images).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:46 -05:00
Alex Elder
30d60ba2f2 rbd: simplify rbd_dev_v1_probe()
An rbd_dev structure's fields are all zero-filled for an initial
probe, so there's no need to explicitly zero the parent_spec
and parent_overlap fields in rbd_dev_v1_probe().  Removing these
assignments makes rbd_dev_v1_probe() *almost* trivial.

Move the dout() message that announces discovery of an image into
rbd_dev_image_probe(), generalize to support images in either format
and only show it if an image is fully discovered.

This highlights that are some unnecessary cleanups in the error
path for rbd_dev_v1_probe(), so they can be removed.

Now rbd_dev_v1_probe() *is* a trivial wrapper function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:41 -05:00
Alex Elder
662518b128 rbd: update in-core header directly
Now that rbd_header_from_disk() only fills in one-time fields once,
we can extend it slightly so it releases the other fields before
replacing their values.  This way there's no need to pass a
temporary buffer and then copy all the results in.  Just use the rbd
device header structure in rbd_header_from_disk() so its values get
updated directly.

Note that this means we need to take the header semaphore at the
point we update things.  So pass the rbd_dev rather than the address
of its header as its first argument to rbd_header_from_disk(), and
have it return an error code.

As a result, rbd_dev_v1_header_read() does all the work,
rbd_read_header() becomes unnecessary, and rbd_dev_v1_refresh()
becomes a very simple wrapper.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:37 -05:00
Alex Elder
bb23e37acb rbd: refactor rbd_header_from_disk()
This rearranges rbd_header_from_disk so that it:
    - allocates the snapshot context right away
    - keeps results in local variables, not changing the passed-in
      header until it's known we'll succeed
    - does initialization of set-once fields in a header only if
      they have not already been set

The last point is moot at the moment, because rbd_read_header()
(the only caller) always supplies a zero-filled header buffer.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:33 -05:00
Alex Elder
46578dcdca rbd: zero format 1 header structure earlier
The passed-in header structure is zeroed in rbd_header_from_disk().
Instead, have the caller do it.  Note that there are two callers,
rbd_dev_v1_refresh() and rbd_dev_v1_probe().  The latter already has
a zeroed header structure so zeroing it isn't necessary there.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:28 -05:00
Alex Elder
f35a4dee14 rbd: set the mapping size and features later
Defer setting the size and features fields of a mapped image until
after the Linux disk structure is set up.  Set the capacity of the
disk after that.

Rearrange the definition of rbd_image_header, separating the fields
that are set only once from those that can be updated.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 17:00:00 -05:00
Linus Torvalds
ebb3727779 Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "It might look big in volume, but when categorized, not a lot of
  drivers are touched.  The pull request contains:

   - mtip32xx fixes from Micron.

   - A slew of drbd updates, this time in a nicer series.

   - bcache, a flash/ssd caching framework from Kent.

   - Fixes for cciss"

* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
  bcache: Use bd_link_disk_holder()
  bcache: Allocator cleanup/fixes
  cciss: bug fix to prevent cciss from loading in kdump crash kernel
  cciss: add cciss_allow_hpsa module parameter
  drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
  mtip32xx: Workaround for unaligned writes
  bcache: Make sure blocksize isn't smaller than device blocksize
  bcache: Fix merge_bvec_fn usage for when it modifies the bvm
  bcache: Correctly check against BIO_MAX_PAGES
  bcache: Hack around stuff that clones up to bi_max_vecs
  bcache: Set ra_pages based on backing device's ra_pages
  bcache: Take data offset from the bdev superblock.
  mtip32xx: mtip32xx: Disable TRIM support
  mtip32xx: fix a smatch warning
  bcache: Disable broken btree fuzz tester
  bcache: Fix a format string overflow
  bcache: Fix a minor memory leak on device teardown
  bcache: Documentation updates
  bcache: Use WARN_ONCE() instead of __WARN()
  bcache: Add missing #include <linux/prefetch.h>
  ...
2013-05-08 11:51:05 -07:00
Linus Torvalds
4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Matthew Wilcox
ab3ea5bf37 NVMe: Simplify Firmware Activate code slightly
Add definitions for the three Firmware Activate actions, and change the
SCSI translation code to construct the command into a temporary variable
instead of translating the endianness back-and-forth.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
2013-05-08 09:55:05 -04:00
Matthew Wilcox
44af146a84 NVMe: Only clear the enable bit when disabling controller
Many of the bits in the Controller Configuration register may only be
modified when the Enable bit is clear.  Clearing them at the same time
as the Enable bit might be OK, but let's play it safe and only touch the
Enable bit.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2013-05-08 09:54:31 -04:00
Matthew Wilcox
ba47e3865e NVMe: Wait for device to acknowledge shutdown
A recent update to the specification makes it clear that the host
is expected to wait for the device to acknowledge the Enable bit
transitioning to 0 as well as waiting for the device to acknowledge a
transition to 1.

Reported-by: Khosrow Panah <Khosrow.Panah@idt.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2013-05-08 09:53:49 -04:00
Alex Elder
51344a38ba rbd: always set read-only flag in rbd_add()
Hold off setting the read-only flag in rbd_add() for an image being
mapped until we have successfully probed the image.  At that point
we know whether it's a snapshot mapping or not, so we can set the
read-only flag in that one place rather than doing so (for
snapshots) in rbd_dev_mapping_set().  To do this, pass a flag to the
image probe routine indicating whether we want a read-only mapping.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:12 -05:00
Alex Elder
6d80b130d5 rbd: kill rbd_dev_clear_mapping()
This function is a duplicate of rbd_dev_mapping_clear(), and was
added by mistake.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:12 -05:00
Alex Elder
8f4b7d9821 rbd: don't look up snapshot id in rbd_dev_mapping_set()
Currently rbd_dev_mapping_set() looks up the snapshot id for the
snapshot whose name is found in the rbd device's spec structure.

That function gets called by rbd_dev_device_setup(), which is
called by rbd_add() *after* rbd_dev_image_probe().  If the
image probe succeeds, the rbd device's spec will already have
been updated to include names and ids for all fields.

Therefore there's no need to look up the snapshot id in
rbd_dev_mapping_set().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:11 -05:00
Alex Elder
c734b79655 rbd: don't print warning if not mapping a parent
The presence of the LAYERING bit in an rbd image's feature mask does
not guarantee the image actually has a parent image.  Currently that
bit is set only when a clone (i.e., image with a parent) is created,
but it is (currently) not cleared if that clone gets flattened back
into a "normal" image.  A "parent_id" query will leave the
parent_spec for the image being mapped a null pointer, but will not
return an error.

Currently, whenever an image with the LAYERED feature gets mapped, a
warning about the use of layered images gets printed.  But we don't
want to do this for a flattened image, so print the warning only
if we find there is a parent spec after the probe.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:48:11 -05:00
Roger Pau Monne
b7649158a0 xen-blkfront: use a different scatterlist for each request
In blkif_queue_request blkfront iterates over the scatterlist in order
to set the segments of the request, and in blkif_completion blkfront
iterates over the raw request, which makes it hard to know the exact
position of the source and destination memory positions.

This can be solved by allocating a scatterlist for each request, that
will be keep until the request is finished, allowing us to copy the
data back to the original memory without having to iterate over the
raw request.

Oracle-Bug: 16660413 - LARGE ASYNCHRONOUS READS APPEAR BROKEN ON 2.6.39-400
CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-and-Tested-by: Anne Milicia <anne.milicia@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-08 08:46:51 -04:00
Alex Elder
29334ba49c rbd: kill rbd_update_mapping_size()
Since rbd_update_mapping_size() is now a trivial wrapper, just open
code it in its two callers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:45:39 -05:00
Alex Elder
00a653e216 rbd: update capacity in rbd_dev_refresh()
When a mapped image changes size, we change the capacity recorded
for the Linux disk associated with it, in rbd_update_mapping_size().
That function is called in two places--the format 1 and format 2
refresh routines.

There is no need to set the capacity while holding the header
semaphore.  Instead, do it in the common rbd_dev_refresh(), using
the logic that's already there to initiate disk revalidation.

Add handling in the request function, just in case a request
that exceeds the capacity of the device comes in (perhaps one
that was started before a refresh shrunk the device).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:45:30 -05:00
Alex Elder
e627db085e rbd: revalidate only for mapping size changes
This commit:
    d98df63e rbd: revalidate_disk upon rbd resize
instituted a call to revalidate_disk() to notify interested parties
that a mapped image has changed size.  This works well, as long as
the the rbd device doesn't map a snapshot.

A snapshot will never change size.  However, the base image the
snapshot is associated with can, and it can do so while the snapshot
is mapped.

The problem is that the test for the size is looking at the size of
the base image, not the size of the mapped snapshot.  This patch
corrects that.

Update the warning message shown in the event of error, and move
it into the callers.

This resolves:
    http://tracker.ceph.com/issues/4911

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-08 07:40:48 -05:00