In case EEPROM is not burned, firmware sets default EEPROM values.
As this is not valid in production, driver should fail load upon any
EEPROM error reported by firmware.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The engines idle mask was increased to be an array of 4 u64 entries.
To make the print of this mask more readable, remove the "0x" prefix,
and zero-pad each u64 to 16 bytes if either it isn't zero or if any of
the higher-order u64's is not zero.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reserved encapsulated signals which were not released hold the context
refcount, leading to a failure when killing the user process on device
reset or device fini.
Add the release of these left signals in the CS roll-back process.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
hl_encaps_handle_do_release_sob() can be called only when the last
reference to the context object is released and hl_ctx_do_release() is
initiated, and therefore it shouldn't call hl_ctx_put().
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Failing to kill a user process during a hard reset can be due to a
reference to the user context which isn't released.
To make it easier to understand if this the reason for the failure and
not something else, add a print of the context refcount value.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
This is similar to RMWREG32, but the given 'val' is already shifted
according to the mask.
This allows several 'ORed' vals and masks to be set at once
The patch also fixes wrong usage of RMWREG32 by replacing
it with RMWREG32_SHIFTED
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
When the new CPUCP opcodes are not supported and a CPUCP packet fails,
the return value is the F/W error resposone which is a positive value.
If this packet is sent from IOCTL and the positive value is used, the
ICOTL will not be considered as unsuccessful.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The clang-analyzer reported a warning: "Value stored to
'cq_size_addr' is never read".
The cq_size register of dcore0 is not being zeroed using
gaudi2_memset_device_lbw(), along with the other cq_* registers,
even though the corresponding cq_size_addr variable is set.
Signed-off-by: Marco Pagani <marpagan@redhat.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The clang-analyzer reported a warning: "Value stored to 'rc' is never
read".
The return value check for the first hl_fw_dynamic_send_clear_cmd() call
in hl_fw_dynamic_send_protocol_cmd() appears to be missing.
Signed-off-by: Marco Pagani <marpagan@redhat.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Increase the size of the busy engines mask in 'struct hl_info_hw_idle',
for future ASICs with more than 128 engines.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Currently the scrubbing mechanism used the EDMA engines by directly
setting the engine core registers to scrub a chunk of memory.
Due to a sporadic failure with this mechanism, it was decided to
initiate the engines via its QMAN using LIN-DMA packets.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Processes that use our device are likely to use at the same time other
devices such as remote storage.
In case our device is removed and a user process is still using the
device, we need to kill the user process. However, if that process
has a thread waiting for i/o to complete on remote storage, for example,
the process won't terminate.
Let's give it enough time to terminate before giving up.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
schedule_hard_reset can be true only if we didn't do hard-reset.
Therefore, no point of checking it in case hard_reset is true.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
If the device file is released while a context is still held, it won't
be possible to reopen it until the context is eventually released.
If that doesn't happen, only a device reset will revert it back to an
operational state, i.e. need to wait for a CS timeout or an error, or to
wait for an external intervention of injecting a reset via sysfs.
At this stage, after the device was released by user, context is held
either because of CS which were left running on the device and are not
relevant anymore, or due to missing cleanup steps from user side.
All of this is in any case handled in the device reset flow, so initiate
the reset at this point instead of waiting for it.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Due to a H/W issue in the LBW path to the PCIE_DBI MSI-X doorbell, there
were false sporadic error responses in SM when it was configured to
write to there, and hence no reset was done as part of handling the
relevant event.
Now that the virtual MSI-X doorbell is used, such errors in SM are not
expected and reset shouldn't be skipped.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
User should use the virtual MSI-X doorbell to generate interrupts from
the device, so there is no need to enable entries in the MSIX_GW table.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Firmware 1.7 is the first official firmware, so no need to check
if we are running a version below it.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Add missing le32_to_cpu() conversions, and use %d for the value
returned from atomic_read().
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Each time page fault happens, besides capturing its data, also notify
the user about it.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
As power and thermal envelope events are pure informative and not
indicating an error, we reduce the print level to info only.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Some ASICs haven't yet implemented this functionality and so the
ioctl call should fail and the user should be notified of the reason.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
This is needed to allow adding more data to the lkd_fw_comms_desc
structure.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Each time razwi (read-only zero, write ignored) event happens, besides
capturing its data, also notify the user about it.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Due to binning, Gaudi2 does not always support fp32.
We add support for such an event in case fp32 is used by the user
in such a device.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Each time page fault happens, besides capturing its data, also notify
the user about it.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Creating event queue workqueue using alloc_workqueue made it run in
multi threaded mode, which caused parallel dumping of events as well as
parallel events notifying to user, causing logs with multiple
events to be out of order.
Fixed by creating event queue workqueue as single threaded work queue.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Each time razwi (read-only zero, write ignore) happens, besides
capturing its data, also notify the user about it.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Add support for Gaudi2 Device with PCI revision 2.
Functionality is exactly the same as revision 1, the only difference
is device name exposed to user.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
As Gaudi2 has a single PCI id, the secured asic type is redundant.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In order to know if driver catches PCI errors correctly, we need to
print a warning per each error.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
hl_access_sram_dram_region() uses a region base which is set within the
hl_set_dram_bar() function. However, for SRAM access this function is
not called, and we end up with a wrong value of region base and with a
bad calculated address.
Fix it by initializing the region base value independently of whether
hl_set_dram_bar() is called or not.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
To avoid memory corruption in kernel memory while using timestamp
registration nodes, zero the kernel buff memory when its allocated.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Consecutive error protects a device reset loop from being triggered
due to h/w issues and enters the device into an unavailable state.
When user may cause the error, an unavailable state
will prevent the user from running its workloads.
The commit prevents entering consecutive state when a user context
is enabled.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Use graceful hard reset when detecting a CS timeout that requires a
device reset.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Use graceful hard reset for F/W events on Gaudi2 device that require a
device reset.
While at it, do a small refactor of the checks and function calls,
to simplify it and to avoid code duplication.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Use graceful hard reset for F/W events on Gaudi device that require a
device reset.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Add an option to control the timeout value for the driver's watchdog
of the reset process. The timeout represents the amount of the user
has to close his process once he gets a device reset notification from
the driver.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Calling hl_device_reset() for a hard reset will lead to a quite
immediate device reset and to killing user process.
For resets that follow errors, it disables the option to debug the
errors on both the device side and the user application side.
This patch adds a 'graceful hard reset' option and a new
hl_device_cond_reset() function.
Under some conditions, mainly if there is no user process or if he is
not registered to driver notifications, this function will execute hard
reset as usual.
Otherwise, the reset will be postponed and a notification will be sent
to user, to let him perform post-error actions and then to release the
device, after which reset will take place.
If device is not released by user in some defined time, a watchdog work
will execute the reset in any case.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Currently there is no verification whether the divisor is legal.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
As there are 2 types of user mappings, pmmu and hmmu, calculate
only the relevant mappings for the requested type.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The virtual MSI-X doorbell is supported now in F/W, so all
configurations to access the PCIE_DBI MSI-X doorbell can be removed.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Up until now the use-case in the driver was that the HBM is accessed
using the HBM BAR, yet the BAR sometimes cannot cover the whole HBM and
so we needed to set the BAR to other HBM offset.
Now we are facing the need to access other PCI memory regions that can
be covered by the HBM BAR.
To answer that we are allowing the caller to determine if the HBM BAR
need to be set or not regardless of the PCI memory region.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The code uses the pointer for trace purpose (without actually
dereference it) but still get static analysis warning.
This patch eliminate the warning.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
NIC ARCs need to have access to CBU_EARLY_BRESP, hence we unsecure
those registers.
Signed-off-by: Dilip Puri <dilipp@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The event notifier mechanism should not raise an empty
event (event equals zero).
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Added function to calculate possible engines which caused
RAZWI (read-only zero, write ignored), from a given router id or
module index.
When getting RAZWI via PSOC IP, first the router id is calculated
and then the possible engines that caused the RAZWI are calculated.
There is a possibility that the RAZWI initiator is not an engine. In
that case, it will not be included in possible engines as it
doesn't have an engine id.
RAZWI information is captured when receiving event from engine or via
PSOC IP.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
'struct hl_device_reset_work' is used as a wrapper for the reset work
and its parameters, including the reset workqueue on which it runs.
In a future commit, another reset related work with similar parameters
is going to be added, but it won't use the reset workqueue.
As in any case there is a single reset workqueue, and to allow the resue
of this structure, move the reset workqueue to 'struct hl_device'.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>