Adds all controller information to attribute list exposed to sysfs, and
appends the reset_controller attribute to it. The nvme device is created
with this attribute list, so driver no long manages its attributes.
Reported-by: Sujith Pandel <sujithpshankar@gmail.com>
Cc: Sujith Pandel <sujithpshankar@ gmail.com>
Cc: David Milburn <dmilburn@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We don't need to shutdown a controller for a reset. A controller in a
shutdown state may take longer to become ready than one that was simply
disabled. This patch has the driver shut down a controller only if the
device is about to be powered off or being removed. When taking the
controller down for a reset reason, the controller will be disabled
instead.
Function names have been updated in this patch to reflect their changed
semantics.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The nvme driver deletes IO queues asynchronously since this operation
may potentially take an undesirable amount of time with a large number
of queues if done serially.
The driver used to manage coordinating asynchronous deletions. This
patch simplifies that by leveraging the block layer rather than using
kthread workers and chaining more complicated callbacks.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
NVMe submits all commands through the block layer now. This means we
can let requests queue at the blk-mq hardware context since there is no
path that bypasses this anymore so we don't need to freeze the queues
anymore. The driver can simply stop the h/w queues from running during
a reset instead.
This also fixes a WARN in percpu_ref_reinit when the queue was unfrozen
with requeued requests.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
A negative status has the "do not retry" bit set, which makes it not
retryable. Use a fake status that can potentially be retried on reset.
An aborted command's status is overridden by the timeout handler so
that it won't be retried, which is necessary to keep initialization from
getting into a reset loop.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The tag set queue depth needs to be one less than the h/w queue depth
so we don't wrap the circular buffer. This conforms to the specification
defined "Full Queue" condition.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Translation SCSI commands to NVMe commands is rather pointless in general
as applications must not expext to be able to use SCSI commands on a
generic block device.
Make the huge translation layer optional and hope no one will ever enable
it in the future.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Make sure we synchronize access to the namespaces list and grab a reference
to the namespace before doing I/O. Make sure to reject the ioctl if multiple
namespaces are present as it's entirely unsafe, and warn when using it even
with a single namespace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently traversal and modification of ctrl->namespaces happens completely
unsynchronized, which can be fixed by the addition of a simple mutex.
Note: nvme_dev_ioctl will be handled in the next patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Nothing pci specific about them and We'll need them exported
in other transports too.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Exposes the NGUID, EUI-64, and NSID to sysfs entries under the disk's
kobject.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Requests enabling pcie aer support. Shuts down the controller on error
detected with io frozen state prior to requesting slot reset; resumes
controller after reset completes.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Merge the two per-request structures in the nvme driver into a single
one.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We need to move freeing of resources to the ->complete handler to ensure
they are also freed when we cancel the command.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that all commands are executed as block layer requests we can remove the
internal completion in the NVMe driver. Note that we can simply call
blk_mq_complete_request to abort commands as the block layer will protect
against double copletions internally.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
AEN requests are different from other requests in that they don't time out
or can easily be cancelled. Because of that we should not use the blk-mq
infrastructure but just special case them in the completion path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
And remove the now unused nvme_submit_cmd helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We'll need them in other places later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The number in tag_set->queue depth includes the reserved tags.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We no longer require the two-pass setup for block integrity.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We don't want to allow new references to open on a device that is
removed. This ties the lifetime of these handles to the physical device's
presence rather than to the open reference count.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Removes all usage of the global work queue so work can't be
scheduled on two different work queues, and removes nvme's work queue
singlethreadedness so controllers can be driven in parallel.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: keep the dead controller removal on the system workqueue to avoid
deadlocks]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The NVMe 1.1 specification provides an identify mode to return a
list of active namespaces. This is more efficient to discover which
namespace identifiers are active on a controller, providing potentially
significant improvement in scan time for controllers with sparesly
populated namespaces.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: add quirk for the broken Qemu Identify implementation. To be relaxed
later]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no lock to sychronize access to the abort_limit field of
struct nvme_ctrl, so switch it to an atomic_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Compared to the kthread this gives us multiple call prevention for free.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If we're using two work queues we're always going to run into races where
one item is tearing down what the other one is initializing. So insted
merge the two work queues, and let the old probe_work also tear the
controller down first if it was alive. Together with the better detection
of the probe path using a flag this gives us a properly serialized
reset/probe path that also doesn't accidentally trigger when two commands
time out and the second one tries to reset the controller while the first
reset is still in progress.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Otherwise we're never going to complete a command when it is restarted just
after we completed all other outstanding commands in nvme_clear_queue.
The controller must be disabled prior to completing a presumed lost
command, do this by directly shutting down the controller before
queueing the reset work, and return EH_HANDLED from the timeout handler
after we shut the controller down.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: split and rebase]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Don't delete the controller from dev_list before queuing a reset, instead
just check for it being reset in the polling kthread. This allows to remove
the dev_list_lock in various places, and in addition we can simply rely on
checking the queue_work return value to see if we could reset a controller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
To properly document how we are using a negative Linux error value to
communicate request cancellations inside the driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We want to be able to return bettern error values frmo nvme_timeout, which
is significantly easier if the two functions are merged. Also clean up and
reduce the printk spew so that we only get one message per abort.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
There is nothing it protects, but it makes lockdep unhappy in many different
ways.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Without this we can easily get bad derferences on nvmeq->d_db when the nvme
kthread tries to poll the CQs for controllers that are in half initialized
state.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Half initialized queues due to kernel error returns or timeout are still a
good reason to give up on initializing a controller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The "|" operator has higher precedence than "?:" so this didn't work as
intended. I had previously fixed this bug, but it we copied the older
unfixed version when we moved the function between files.
Fixes: 1673f1f08c ('nvme: move block_device_operations and ns/ctrl freeing to common code')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The nvme_user_cmd function was recently moved around from one file
to another, which made a warning reappear that I had fixed before
at some point:
drivers/nvme/host/core.c: In function 'nvme_user_cmd':
drivers/nvme/host/core.c:424:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
This applies the same workaround that we have elsewhere in the
driver with an extra type cast to uintptr_t.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 1673f1f08c ("nvme: move block_device_operations and ns/ctrl freeing to common code")
Link: https://lkml.org/lkml/2015/10/9/611
Signed-off-by: Jens Axboe <axboe@fb.com>
Looks like I didn't test with CONFIG_NVM enabled, and neither did
the build bot.
Most of this is really weird crazy shit in the lighnvm support, though.
Struct nvme_ns is a structure for the NVM I/O command set, and it has
no business poking into it. Second this commit:
commit 47b3115ae7
Author: Wenwei Tao <ww.tao0320@gmail.com>
Date: Fri Nov 20 13:47:55 2015 +0100
nvme: lightnvm: use admin queues for admin cmds
Does even more crazy stuff. If a function gets a request_queue parameter
passed it'd better use that and not look for another one.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch moves the blk_integrity_payload definition outside the
CONFIG_BLK_DEV_INTERITY dependency and provides empty function
implementations when the kernel configuration disables integrity
extensions. This simplifies drivers that make use of these to map user
data so they don't need to repeat the same configuration checks.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Updated by Jens to pass an error pointer return from
bio_integrity_alloc(), otherwise if CONFIG_BLK_DEV_INTEGRITY isn't
set, we return a weird ENOMEM from __nvme_submit_user_cmd()
if a meta buffer is set.
Signed-off-by: Jens Axboe <axboe@fb.com>
Split out a helper that just issues the Set Features and interprets the
result which can go to common code, and document why we are ignoring
non-timeout error returns in the PCIe driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
For this we need to add a proper controller init routine and a list of
all controllers that is in addition to the list of PCIe controllers,
which stays in pci.c. Note that we remove the sysfs device when the
last reference to a controller is dropped now - the old code would have
kept it around longer, which doesn't make much sense.
This requires a new ->reset_ctrl operation to implement controleller
resets, and a new ->write_reg32 operation that is required to implement
subsystem resets. We also now store caches copied of the NVMe compliance
version and the flag if a controller is attached to a subsystem or not in
the generic controller structure now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Fixes for pr merge]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The namespace scanning code has been mostly generic already, we just
need to store a pointer to the tagset in the nvme_ctrl structure, and
add a method to check if a controller is I/O incapable. The latter
will hopefully be replaced by a proper controller state machine soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Fixed pr conflicts]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We want to record the identify and CAP values even if no I/O queue
is available.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
And add the 64-bit register read operation for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove the calculation of all the bits written into the CC register into
nvme_enable_ctrl, so that they can be moved into the core NVMe driver in
the future.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add an enum for all workarounds not in the spec and identify the affected
controllers at probe time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This moves the block_device_operations over to common code mostly
as-is. The only change is that the ns and ctrl refcounting got some
small refcounting to have wrappers around the kref_put operations.
A new free_ctrl operation is added to allow the PCI driver to free
it's ressources on the final drop.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Moved the integrity and pr changes due to merge conflict]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use the integrity API to pass through metadata from userspace. For PI
enabled devices this means that we now validate the reftag, which seems
like an unintentional ommission in the old code.
Thanks to Keith Busch for testing and fixes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Skip metadata setup on admin commands]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>