linux/drivers/accel/habanalabs
Koby Elbaz a6685b573c habanalabs: block soft-reset on an unusable device
A device with status malfunction indicates that it can't be used.
In such a case we do not support certain reset types, e.g.,
all kinds of soft-resets (compute reset, inference soft-reset),
and reset upon device release.

A hard-reset is the only way that an unusable device can change its
status. All other reset procedures can't put the device in a reset
procedure, which might ultimately cause the device to change its
status, unintentionally, to become operational again.

Such a scenario has recently occurred, when a user requested
a hard-reset while another heavy user workload was ongoing (reset
request is queued).
Since the workload couldn't finish within reset's timeout limits, the
reset has failed and set a device status malfunction.
Eventually, when the user released the FD, an unsuccessful soft-reset
occurred, hence followed by an additional hard-reset that changed the
ASICs status back to be operational.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-01-26 11:52:13 +02:00
..
common habanalabs: block soft-reset on an unusable device 2023-01-26 11:52:13 +02:00
gaudi habanalabs: add set engines masks ASIC function 2023-01-26 11:52:11 +02:00
gaudi2 habanalabs/gaudi2: print page fault axi transaction id 2023-01-26 11:52:13 +02:00
goya habanalabs: add set engines masks ASIC function 2023-01-26 11:52:11 +02:00
include habanalabs: Replace zero-length arrays with flexible-array members 2023-01-26 11:52:12 +02:00
Kconfig habanalabs: move driver to accel subsystem 2023-01-26 11:52:10 +02:00
Makefile habanalabs: move driver to accel subsystem 2023-01-26 11:52:10 +02:00