Commit Graph

282 Commits

Author SHA1 Message Date
Ofir Bitton
fa58b59493 accel/habanalabs: modify pci health check
Today we read PCI VENDOR-ID in order to make sure PCI link is
healthy. Apparently the VENDOR-ID might be stored on host and
hence, when we read it we might not access the PCI bus.
In order to make sure PCI health check is reliable, we will start
checking the DEVICE-ID instead.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:32 +02:00
Tomer Tayar
c517068349 accel/habanalabs: keep explicit size of reserved memory for FW
The reserved memory for FW is currently saved in an ASIC property in
units of MB, just like the value that comes from FW.
Except the fact that it is not clear from the property's name, it means
also that a calculation to actual size is required everywhere that it is
used.
Modify the property to hold the size in bytes.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:27 +02:00
Tomer Tayar
db45bbdd02 accel/habanalabs: handle reserved memory request when working with full FW
Currently the reserved memory request from FW is handled when running
with preboot only, but this request is relevant also when running with
full FW.
Modify to always handle this reservation request.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:24 +02:00
Ofir Bitton
5b6658eb7c accel/habanalabs/hwmon: rate limit errors user can generate
Fetching sensor data can fail due to various reasons. In order
not to pollute the kernel log, those error prints must be
rate limited.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:19 +02:00
Ofir Bitton
3bf6ef981f accel/habanalabs/gaudi2: drain event lacks rd/wr indication
Due to a H/W issue, AXI drain event does not include a read/write
indication, hence we remove this print.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:16 +02:00
Dani Liberman
fd8d2fa066 accel/habanalabs: fix error print
The unmasking is for event and it can be other event than RAZWI.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:12 +02:00
Tal Risin
c8c062e967 accel/habanalabs: initialize maybe-uninitialized variables
Prevent static analysis warning.

Signed-off-by: Tal Risin <trisin@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:08 +02:00
Avri Kehat
0b105a2a72 accel/habanalabs: fix debugfs files permissions
debugfs files are created with permissions that don't align
with the access requirements.

Signed-off-by: Avri Kehat <akehat@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:03 +02:00
Tomer Tayar
e855869bec accel/habanalabs: fix glbl error cause handling
The glbl error cause handling has a wrong assumption that all error
bits are consecutive.
Fix the handling to check all relevant error bits per ASIC.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:47:00 +02:00
Tomer Tayar
c1e89ae455 accel/habanalabs/gaudi2: check extended errors according to PCIe addr_dec interrupt info
The FW interrupt info for a PCIe addr_dec event is set correctly, so
check for either global errors or razwi according to the indications
there.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:46:55 +02:00
Tomer Tayar
7159813c91 accel/habanalabs: modify print for skip loading linux FW to debug log
Skip loading a linux FW image into the device with the current supported
ASICs is done for test purposes only.
Moreover, for future supported ASICs it is possible that there won't be
a need to load such an image.
The print in such a case is therefore not needed in most cases, so
replace the used dev_info() with dev_dbg().

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:46:52 +02:00
Farah Kassabri
c14e5cd3ed accel/habanalabs: remove hop size from asic properties
The hop size related properties is a MMU properties and not
asic properties.
As for PMMU and HMMU we could have different sizes.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:46:40 +02:00
Erick Archer
9e263c5042 accel/habanalabs: use kcalloc() instead of kzalloc()
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

So, use the purpose specific kcalloc() function instead of the argument
size * count in the kzalloc() function.

Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1]
Link: https://github.com/KSPP/linux/issues/162

Signed-off-by: Erick Archer <erick.archer@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:41 +02:00
Colin Ian King
5ae8b6b774 accel/habanalabs/goya: remove redundant assignment to pointer 'input'
The pointer input is assigned a value that is not read, it is
being re-assigned again later with the same value. Resolve this
by moving the declaration to input into the if block.

Cleans up clang scan build warning:
warning: Value stored to 'input' during its initialization is never
read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Tomer Tayar
01f8cd0faf accel/habanalabs/gaudi2: fail memory memset when failing to copy QM packet to device
gaudi2_memset_memory_chunk_using_edma_qm() calls the access_dev_mem()
ASIC function, but ignores its return value.
Add this missing check.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Dani Liberman
731d320e68 accel/habanalabs: remove call to deprecated function
In newer kernel versions, irq_set_affinity_hint() is deprecated.
Instead, use the newer version which is irq_set_affinity_and_hint().

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Malkoot Khan
8a5be2b62b accel/habanalabs: Remove unnecessary braces from if statement
The coding style in the Linux kernel prefers not to use
braces for single-statement if conditions.
This patch removes the unnecessary braces from an if statement
in the file drivers/accel/habanalabs/common/command_submission.c,
which also resolves a coding style warning.

Signed-off-by: Malkoot Khan <engr.mkhan1990@gmail.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Farah Kassabri
f728c17fc9 accel/habanalabs/gaudi2: move HMMU page tables to device memory
Currently the HMMU page tables reside in the host memory,
which will cause host access from the device for every page walk.
This can affect PCIe bandwidth in certain scenarios.

To prevent that problem, HMMU page tables will be moved to the device
memory so the miss transaction will read the hops from there instead of
going to the host.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Tomer Tayar
246d8b6cfb accel/habanalabs: abort device reset for consecutive heartbeat failures
The mechanism of aborting device reset for consecutive fatal errors is
currently only for fatal errors that are reported by FW.
A non-responsive FW and consecutive heartbeat failures is also
considered fatal, so add them as well to this mechanism to avoid
recurring device reset in such a case.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Tomer Tayar
d0df8a35a7 accel/habanalabs: fix DRAM BAR base address calculation
When the DRAM region size in the BAR is not a power of 2, calculating
the corresponding BAR base address should be done using the offset from
the DRAM start address, and not using directly the DRAM address.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Koby Elbaz
8c075401f2 accel/habanalabs: increase HL_MAX_STR to 64 bytes to avoid warnings
Fix a warning of a buffer overflow:
‘snprintf’ output between 38 and 47 bytes into a destination of size 32

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Dani Liberman
e91c37f194 accel/habanalabs/gaudi2: add interrupt affinity for user interrupts
User interrupts are MSIx interrupts coming from Gaudi2, that have
specific range of IDs and are assigned to the sole use of the user
process that opened the Gaudi2 device (reminder: there can be only
a single user process running on Gaudi2 at any given time).

The interrupts are allocated and managed by the driver and therefore,
the user expects the driver to initialize them properly, which also
includes setting the affinity to the related CPU cores of the
device's NUMA node to get maximum performance.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2024-02-26 09:30:40 +02:00
Linus Torvalds
cf65598d59 drm-next for 6.8:
new drivers:
 - imagination - new driver for Imagination Technologies GPU
 - xe - new driver for Intel GPUs using core drm concepts
 
 core:
 - add CLOSE_FB ioctl
 - remove old UMS ioctls
 - increase max objects to accomodate AMD color mgmt
 
 encoder:
 - create per-encoder debugfs directory
 
 edid:
 - split out drm_eld
 - SAD helpers
 - drop edid_firmware module parameter
 
 format-helper:
 - cache format conversion buffers
 
 sched:
 - move from kthread to workqueue
 - rename some internals
 - implement dynamic job-flow control
 
 gpuvm:
 - provide more features to handle GEM objects
 
 client:
 - don't acquire module reference
 
 displayport:
 - add mst path property documentation
 
 fdinfo:
 - alignment fix
 
 dma-buf:
 - add fence timestamp helper
 - add fence deadline support
 
 bridge:
 - transparent aux-bridge for DP/USB-C
 - lt8912b: add suspend/resume support and power regulator support
 
 panel:
 - edp: AUO B116XTN02, BOE NT116WHM-N21,836X2, NV116WHM-N49
 - chromebook panel support
 - elida-kd35t133: rework pm
 - powkiddy RK2023 panel
 - himax-hx8394: drop prepare/unprepare and shutdown logic
 - BOE BP101WX1-100, Powkiddy X55, Ampire AM8001280G
 - Evervision VGG644804, SDC ATNA45AF01
 - nv3052c: register docs, init sequence fixes, fascontek FS035VG158
 - st7701: Anbernic RG-ARC support
 - r63353 panel controller
 - Ilitek ILI9805 panel controller
 - AUO G156HAN04.0
 
 simplefb:
 - support memory regions
 - support power domains
 
 amdgpu:
 - add new 64-bit sequence number infrastructure
 - add AMD specific color management
 - ACPI WBRF support for RF interference handling
 - GPUVM updates
 - RAS updates
 - DCN 3.5 updates
 - Rework PCIe link speed handling
 - Document GPU reset types
 - DMUB fixes
 - eDP fixes
 - NBIO 7.9/7.11 updates
 - SubVP updates
 - XGMI PCIe state dumping for aqua vanjaram
 - GFX11 golden register updates
 - enable tunnelling on high pri compute
 
 amdkfd:
 - Migrate TLB flushing logic to amdgpu
 - Trap handler fixes
 - Fix restore workers handling on suspend/resume
 - Fix possible memory leak in pqm_uninit()
 - support import/export of dma-bufs using GEM handles
 
 radeon:
 - fix possible overflows in command buffer checking
 - check for errors in ring_lock
 
 i915:
 - reorg display code for reuse in xe driver
 - fdinfo memory stats printing
 - DP MST bandwidth mgmt improvements
 - DP panel replay enabling
 - MTL C20 phy state verification
 - MTL DP DSC fractional bpp support
 - Audio fastset support
 - use dma_fence interfaces instead of i915_sw_fence
 - Separate gem and display code
 - AUX register macro refactoring
 - Separate display module/device parameters
 - Move display capabilities debugfs under display
 - Makefile cleanups
 - Register cleanups
 - Move display lock inits under display/
 - VLV/CHV DPIO PHY register and interface refactoring
 - DSI VBT sequence refactoring
 - C10/C20 PHY PLL hardware readout
 - DPLL code cleanups
 - Cleanup PXP plane protection checks
 - Improve display debug msgs
 - PSR selective fetch fixes/improvements
 - DP MST fixes
 - Xe2LPD FBC restrictions removed
 - DGFX uses direct VBT pin mapping
 - more MTL WAs
 - fix MTL eDP bug
 - eliminate use of kmap_atomic
 
 habanalabs:
 - sysfs entry to identify a device minor id with debugfs path
 - sysfs entry to expose device module id
 - add signed device info retrieval through INFO ioctl
 - add Gaudi2C device support
 - pcie reset prepare/done hooks
 
 msm:
 - Add support for SDM670, SM8650
 - Handle the CFG interconnect to fix the obscure hangs / timeouts
 - Kconfig fix for QMP dependency
 - use managed allocators
 - DPU: SDM670, SM8650 support
 - DPU: Enable SmartDMA on SM8350 and SM8450
 - DP: enable runtime PM support
 - GPU: add metadata UAPI
 - GPU: move devcoredumps to GPU device
 - GPU: convert to drm_exec
 
 ivpu:
 - update FW API
 - new debugfs file
 - a new NOP job submission test mode
 - improve suspend/resume
 - PM improvements
 - MMU PT optimizations
 - firmware profile frequency support
 - support for uncached buffers
 - switch to gem shmem helpers
 - replace kthread with threaded irqs
 
 rockchip:
 - rk3066_hdmi: convert to atomic
 - vop2: support nv20 and nv30
 - rk3588 support
 
 mediatek:
 - use devm_platform_ioremap_resource
 - stop using iommu_present
 - MT8188 VDOSYS1 display support
 
 panfrost:
 - PM improvements
 - improve interrupt handling as poweroff
 
 qaic:
 - allow to run with single MSI
 - support host/device time sync
 - switch to persistent DRM devices
 
 exynos:
 - fix potential error pointer dereference
 - fix wrong error checking
 - add missing call to drm_atomic_helper_shutdown
 
 omapdrm:
 - dma-fence lockdep annotation fix
 
 tidss:
 - dma-fence lockdep annotation fix
 - support for AM62A7
 
 v3d:
 - BCM2712 - rpi5 support
 - fdinfo + gputop support
 - uapi for CPU job handling
 
 virtio-gpu:
 - add context debug name
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmWeLcQACgkQDHTzWXnE
 hr54zg//dtPiG9nRA3OeoQh/pTmbFO26uhS8OluLiXhcX/7T/c1e6ck4dA3De5kB
 wgaqVH6/TFuMgiBbEqZSFuQM6k2X3HLCgHcCRpiz7iGse2GODLtFiUE/E4XFPrSP
 VhycI64and9XLBmxW87yGdmezVXxo6KZNX4nYabgZ7SD83/2w+ub6rxiAvd0KfSO
 gFmaOrujOIYBjFYFtKLZIYLH4Jzsy81bP0REBzEnAiWYV5qHdsXfvVgwuOU+3G/B
 BAVUUf++SU046QeD3HPEuOp3AqgazF4uNHQH5QL0UD2144uGWsk0LA4OZBnU0qhd
 oM4Oxu9V+TXvRfYhHwiQKeVleifcZBijndqiF7rlrTnNqS4YYOCPxuXzMlZO9aEJ
 6wQL/0JX8d5G6lXsweoBzNC76jeU/gspd1DvyaTFt7I8l8YqWvR5V8l8KRf2s14R
 +CwwujoqMMVmhZ4WhB+FgZTiWw5PaWoMM9ijVFOv8QhXOz21rj718NPdBspvdJK3
 Lo3obSO5p4lqgkMEuINBEXzkHjcSyOmMe1fG4Et8Wr+IrEBr1gfG9E4Twr+3/k3s
 9Ok9nOPykbYmt4gfJp/RDNCWBr8QGZKznP6Nq8EFfIqhEkXOHQo9wtsofVUhyW7P
 qEkCYcYkRa89KFp4Lep6lgDT5O7I+32eRmbRg716qRm9nn3Vj3Y=
 =nuw0
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2024-01-10' of git://anongit.freedesktop.org/drm/drm

Pull drm updates from Dave Airlie:
 "This contains two major new drivers:

   - imagination is a first driver for Imagination Technologies devices,
     it only covers very specific devices, but there is hope to grow it

   - xe is a reboot of the i915 GPU (shares display) side using a more
     upstream focused development model, and trying to maximise code
     sharing. It's not enabled for any hw by default, and will hopefully
     get switched on for Intel's Lunarlake.

  This also drops a bunch of the old UMS ioctls. It's been dead long
  enough.

  amdgpu has a bunch of new color management code that is being used in
  the Steam Deck.

  amdgpu also has a new ACPI WBRF interaction to help avoid radio
  interference.

  Otherwise it's the usual lots of changes in lots of places.

  Detailed summary:

  new drivers:
   - imagination - new driver for Imagination Technologies GPU
   - xe - new driver for Intel GPUs using core drm concepts

  core:
   - add CLOSE_FB ioctl
   - remove old UMS ioctls
   - increase max objects to accomodate AMD color mgmt

  encoder:
   - create per-encoder debugfs directory

  edid:
   - split out drm_eld
   - SAD helpers
   - drop edid_firmware module parameter

  format-helper:
   - cache format conversion buffers

  sched:
   - move from kthread to workqueue
   - rename some internals
   - implement dynamic job-flow control

  gpuvm:
   - provide more features to handle GEM objects

  client:
   - don't acquire module reference

  displayport:
   - add mst path property documentation

  fdinfo:
   - alignment fix

  dma-buf:
   - add fence timestamp helper
   - add fence deadline support

  bridge:
   - transparent aux-bridge for DP/USB-C
   - lt8912b: add suspend/resume support and power regulator support

  panel:
   - edp: AUO B116XTN02, BOE NT116WHM-N21,836X2, NV116WHM-N49
   - chromebook panel support
   - elida-kd35t133: rework pm
   - powkiddy RK2023 panel
   - himax-hx8394: drop prepare/unprepare and shutdown logic
   - BOE BP101WX1-100, Powkiddy X55, Ampire AM8001280G
   - Evervision VGG644804, SDC ATNA45AF01
   - nv3052c: register docs, init sequence fixes, fascontek FS035VG158
   - st7701: Anbernic RG-ARC support
   - r63353 panel controller
   - Ilitek ILI9805 panel controller
   - AUO G156HAN04.0

  simplefb:
   - support memory regions
   - support power domains

  amdgpu:
   - add new 64-bit sequence number infrastructure
   - add AMD specific color management
   - ACPI WBRF support for RF interference handling
   - GPUVM updates
   - RAS updates
   - DCN 3.5 updates
   - Rework PCIe link speed handling
   - Document GPU reset types
   - DMUB fixes
   - eDP fixes
   - NBIO 7.9/7.11 updates
   - SubVP updates
   - XGMI PCIe state dumping for aqua vanjaram
   - GFX11 golden register updates
   - enable tunnelling on high pri compute

  amdkfd:
   - Migrate TLB flushing logic to amdgpu
   - Trap handler fixes
   - Fix restore workers handling on suspend/resume
   - Fix possible memory leak in pqm_uninit()
   - support import/export of dma-bufs using GEM handles

  radeon:
   - fix possible overflows in command buffer checking
   - check for errors in ring_lock

  i915:
   - reorg display code for reuse in xe driver
   - fdinfo memory stats printing
   - DP MST bandwidth mgmt improvements
   - DP panel replay enabling
   - MTL C20 phy state verification
   - MTL DP DSC fractional bpp support
   - Audio fastset support
   - use dma_fence interfaces instead of i915_sw_fence
   - Separate gem and display code
   - AUX register macro refactoring
   - Separate display module/device parameters
   - Move display capabilities debugfs under display
   - Makefile cleanups
   - Register cleanups
   - Move display lock inits under display/
   - VLV/CHV DPIO PHY register and interface refactoring
   - DSI VBT sequence refactoring
   - C10/C20 PHY PLL hardware readout
   - DPLL code cleanups
   - Cleanup PXP plane protection checks
   - Improve display debug msgs
   - PSR selective fetch fixes/improvements
   - DP MST fixes
   - Xe2LPD FBC restrictions removed
   - DGFX uses direct VBT pin mapping
   - more MTL WAs
   - fix MTL eDP bug
   - eliminate use of kmap_atomic

  habanalabs:
   - sysfs entry to identify a device minor id with debugfs path
   - sysfs entry to expose device module id
   - add signed device info retrieval through INFO ioctl
   - add Gaudi2C device support
   - pcie reset prepare/done hooks

  msm:
   - Add support for SDM670, SM8650
   - Handle the CFG interconnect to fix the obscure hangs / timeouts
   - Kconfig fix for QMP dependency
   - use managed allocators
   - DPU: SDM670, SM8650 support
   - DPU: Enable SmartDMA on SM8350 and SM8450
   - DP: enable runtime PM support
   - GPU: add metadata UAPI
   - GPU: move devcoredumps to GPU device
   - GPU: convert to drm_exec

  ivpu:
   - update FW API
   - new debugfs file
   - a new NOP job submission test mode
   - improve suspend/resume
   - PM improvements
   - MMU PT optimizations
   - firmware profile frequency support
   - support for uncached buffers
   - switch to gem shmem helpers
   - replace kthread with threaded irqs

  rockchip:
   - rk3066_hdmi: convert to atomic
   - vop2: support nv20 and nv30
   - rk3588 support

  mediatek:
   - use devm_platform_ioremap_resource
   - stop using iommu_present
   - MT8188 VDOSYS1 display support

  panfrost:
   - PM improvements
   - improve interrupt handling as poweroff

  qaic:
   - allow to run with single MSI
   - support host/device time sync
   - switch to persistent DRM devices

  exynos:
   - fix potential error pointer dereference
   - fix wrong error checking
   - add missing call to drm_atomic_helper_shutdown

  omapdrm:
   - dma-fence lockdep annotation fix

  tidss:
   - dma-fence lockdep annotation fix
   - support for AM62A7

  v3d:
   - BCM2712 - rpi5 support
   - fdinfo + gputop support
   - uapi for CPU job handling

  virtio-gpu:
   - add context debug name"

* tag 'drm-next-2024-01-10' of git://anongit.freedesktop.org/drm/drm: (2340 commits)
  drm/amd/display: Allow z8/z10 from driver
  drm/amd/display: fix bandwidth validation failure on DCN 2.1
  drm/amdgpu: apply the RV2 system aperture fix to RN/CZN as well
  drm/amd/display: Move fixpt_from_s3132 to amdgpu_dm
  drm/amd/display: Fix recent checkpatch errors in amdgpu_dm
  Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole"
  drm/amd/display: avoid stringop-overflow warnings for dp_decide_lane_settings()
  drm/amd/display: Fix power_helpers.c codestyle
  drm/amd/display: Fix hdcp_log.h codestyle
  drm/amd/display: Fix hdcp2_execution.c codestyle
  drm/amd/display: Fix hdcp_psp.h codestyle
  drm/amd/display: Fix freesync.c codestyle
  drm/amd/display: Fix hdcp_psp.c codestyle
  drm/amd/display: Fix hdcp1_execution.c codestyle
  drm/amd/pm/smu7: fix a memleak in smu7_hwmgr_backend_init
  drm/amdkfd: Fix iterator used outside loop in 'kfd_add_peer_prop()'
  drm/amdgpu: Drop 'fence' check in 'to_amdgpu_amdkfd_fence()'
  drm/amdkfd: Confirm list is non-empty before utilizing list_first_entry in kfd_topology.c
  drm/amdgpu: Fix '*fw' from request_firmware() not released in 'amdgpu_ucode_request()'
  drm/amdgpu: Fix variable 'mca_funcs' dereferenced before NULL check in 'amdgpu_mca_smu_get_mca_entry()'
  ...
2024-01-12 11:32:19 -08:00
Xingyuan Mo
a9f07790a4 accel/habanalabs: fix information leak in sec_attest_info()
This function may copy the pad0 field of struct hl_info_sec_attest to user
mode which has not been initialized, resulting in leakage of kernel heap
data to user mode. To prevent this, use kzalloc() to allocate and zero out
the buffer, which can also eliminate other uninitialized holes, if any.

Fixes: 0c88760f8f ("habanalabs/gaudi2: add secured attestation info uapi")
Signed-off-by: Xingyuan Mo <hdthky0@gmail.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:44 +02:00
Tomer Tayar
bc5f15abcf accel/habanalabs/gaudi2: avoid overriding existing undefined opcode data
Part of the undefined opcode data is updated in
gaudi2_handle_qman_err_generic() and some in
handle_lower_qman_data_on_err().
However, the 'write_enable' flag is checked only in
gaudi2_handle_qman_err_generic(), and information of more than a single
error can be mixed there.

Moreover, handle_lower_qman_data_on_err() is called only for the lower
QMAN, so for an error in the upper QMAN there is only a partial info.

Move all the data update to be done in a single place, protected by the
'write_enable' flag.
As mainly the lower QMAN's info is interesting, avoid saving the partial
info for the upper QMAN.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:44 +02:00
Tomer Tayar
aa5cea38ce accel/habanalabs: add parent_device sysfs attribute
The device debugfs directory was modified to be named as the
device-name.
This name is the parent device name, i.e. either the PCI address in case
of an ASIC, or the simulator device name in case of a simulator.

This change makes it more difficult for a user to access the debugfs
directory for a specific accel device, because he can't just use the
accel minor id, but he needs to do more device-dependent operations to
get the device name.

To make it easier to get this name, add a 'parent_device' sysfs
attribute that the user can read using the minor id before accessing
debugfs.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:44 +02:00
Tomer Tayar
565ee78840 accel/habanalabs/gaudi2: add zero padding when printing QM CP instruction
QM instructions are in multiples of 64 bits and the command type is in
the upper bits of first QWORD.
To make it clearer that an undefined command is due to a type of 0x0,
always print all 64 bits and add a zero padding if needed.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:44 +02:00
Ariel Suller
d980e1ced9 accel/habanalabs: report 3 instances of Infineon second stage
Infineon controller second stage has 3 instances that their version
need to be reported by driver.

Signed-off-by: Ariel Suller <asuller@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Moti Haimovski
7259eb7b53 accel/habanalabs/gaudi2: add signed dev info uAPI
User will provide a nonce via the INFO ioctl, and will retrieve
the signed device info generated using given nonce.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Tomer Tayar
5bc155cfea accel/habanalabs/gaudi2: use correct registers to dump QM CQ info
The QM CQ PTR_LO/PTR_HI/TSIZE registers are for pushing a CQ entry, and
although they are updated by HW even when descriptors are fetched by PQ
and CB addresses are fed into CQ, the correct registers to use when
dumping the CQ info are the ones with the _STS suffix.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Dani Liberman
47a552863d accel/habanalabs: expose module id through sysfs
Module ID exposes the physical location of the device in the server,
from the pov of the devices in regard to how they are connected by
internal fabric.

This information is already exposed in our INFO ioctl, but there are
utilities and scripts running in data-center which are already
accessing sysfs for topology information and it is easier for them
to continue getting that information from sysfs instead of opening
a file descriptor.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Dani Liberman
c9f9d0e3d0 accel/habanalabs: print error code when mapping fails
Failure to map is considered a non-trivial error and we need to notify
the user about it.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Tomer Tayar
ae303d885d accel/habanalabs/gaudi2: get the correct QM CQ info upon an error
Upon a QM error, the address/size from both the CQ and the ARC_CQ are
printed, although the instruction that led to the error was received
from only one of them.

Moreover, in case of a QM undefined opcode, only one of these
address/size sets will be captured based on the value of ARC_CQ_PTR.
However, this value can be non-zero even if currently the CQ is used, in
case the CQ/ARC_CQ are alternately used.

Under the assumption of having a stop-on-error configuration, modify to
use CP_STS.CUR_CQ field to get the relevant CQ for the QM error.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Tomer Tayar
4b0b1fbc77 accel/habanalabs: set hard reset flag if graceful reset is skipped
hl_device_cond_reset() might be called with the hard reset flag unset,
because a compute reset upon device release as part of a graceful reset
is valid.
If the conditions for graceful reset are not met, hl_device_reset() will
be called for an immediate reset. In this case a compute reset is not
valid, so it will be replaced with a hard reset together with a debug
message about it.
This message might be confusing, as it implies that a compute reset was
requested when it shouldn't. To prevent this confusion, set the hard
reset flag in hl_device_cond_reset() if going to an immediate reset.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Ofir Bitton
571cdb6e3b accel/habanalabs: remove 'get temperature' debug print
The print was added long back for a specific debug and can
now be removed.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Dafna Hirschfeld
0ec3467796 accel/habanalabs/gaudi2: fix undef opcode reporting
currently the undefined opcode event bit in set only for lower cp and
only if 'write_enable' is true. It should be set anyway and for all
streams in order to report that event to userspace.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Farah Kassabri
d1958dce5a accel/habanalabs: fix EQ heartbeat mechanism
Stop rescheduling another heartbeat check when EQ heartbeat check fails
as it generates confusing logs in dmesg that the heartbeat fails.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Oded Gabbay
42422993cf accel/habanalabs: add support for Gaudi2C device
Gaudi2 with PCI revision ID with the value of '3' represents Gaudi2C
device and should be detected and initialized as Gaudi2.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:43 +02:00
Farah Kassabri
e8bc0c1b1b accel/habanalabs: add log when eq event is not received
Add error log when no eq event is received from FW,
to cover a scenario when FW is stuck for some reason.
In such case driver will not receive neither the eq error interrupt
or the eq heartbeat event, and will just initiate a reset without
indication in the dmesg about the reason.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:42 +02:00
Tomer Tayar
c648548233 accel/habanalabs/gaudi2: assume hard-reset by FW upon PCIe AXI drain
When a PCIe AXI drain event happens, it is possible that the driver
cannot access the device through PCIe, and therefore cannot send a
hard-reset request to FW.
Starting from FW version 1.13, FW will initiate a hard-reset in such
a case without waiting for a reset request from the driver.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:42 +02:00
Farah Kassabri
fbc2a09e09 accel/habanalabs: update device boot error check
Use a predefined mask which set the device critical boot errors.
Driver will fail and stop its loading, only upon detecting at least
one of those errors defined in this mask.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:42 +02:00
farah kassabri
f64fa33260 accel/habanalabs: add pcie reset prepare/done hooks
When working on a bare-metal system, if FLR will happen the firmware
will handle it and driver will have no knowledge of it, and this will
cause two issues:

1.The driver will be in operational state while it should be in reset.
  This will cause the heartbeat mechanism to keep sending messages to FW
  while pci device is in reset. Eventually heartbeat will fail and
  the device will end up in non-operational state.

2. After FW handles the FLR, and due to the reset it'll go back to
   preboot stage, and driver need to perform hard reset in order to
   load the boot fit binary.

This patch will add reset_prepare hook that will set the device to
be in disabled state, so it'll be not operational, and also
reset_done hook which will be called after the actual FLR handling,
then it will perform hard reset.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-12-19 11:09:42 +02:00
Christian Brauner
3652117f85 eventfd: simplify eventfd_signal()
Ever since the eventfd type was introduced back in 2007 in commit
e1ad7468c7 ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed 1 as a value for @n. There's no point in
keeping that additional argument.

Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org
Acked-by: Xu Yilun <yilun.xu@intel.com>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl
Acked-by: Eric Farman <farman@linux.ibm.com>  # s390
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-28 14:08:38 +01:00
Oded Gabbay
4db74c0fde accel/habanalabs/gaudi2: fix spmu mask creation
event_types_num received from the user can be 0. In that case, the
event_mask should be 0.

In addition, to create a correct mask we need to match the number
of event types to the bit location such that bit 0 represents a single
event type, bit 1 represents 2 types and so on.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:24 +03:00
Tomer Tayar
0426e03126 accel/habanalabs/gaudi2: perform hard-reset upon PCIe AXI drain event
Non-completed transactions from PCIe towards the device are handled by
the AXI drain mechanism. This handling is in the PCIe level, but the
transactions are still there in the device consuming some queues
entries, and therefore the device must be reset.
Modify to perform hard-reset upon PCIe AXI drain events.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:24 +03:00
farah kassabri
84190b92cc accel/habanalabs: fix bug in decoder wait for cs completion
The decoder interrupts are handled in the interrupt context
same as all user interrupts.
In such case, the wait list should be protected by
spin_lock_irqsave in order to avoid deadlock that might happen
with the user submission flow.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:24 +03:00
Dafna Hirschfeld
2ba0236f5b accel/habanalabs: remove wrong doc for init_phys_pg_pack_from_userptr
The function does not pin the pages so remove that from the inline doc.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:24 +03:00
Arnd Bergmann
c1805bf36a accel/habanalabs: add missing debugfs function stubs
Two function stubs were removed in an earlier commit but are now needed
again:

drivers/accel/habanalabs/common/device.c: In function 'hl_device_init':
drivers/accel/habanalabs/common/device.c:2231:14: error: implicit declaration of function 'hl_debugfs_device_init'; did you mean 'drm_debugfs_dev_init'? [-Werror=implicit-function-declaration]
 2231 |         rc = hl_debugfs_device_init(hdev);
drivers/accel/habanalabs/common/device.c:2367:9: error: implicit declaration of function 'hl_debugfs_device_fini'; did you mean 'hl_debugfs_remove_file'? [-Werror=implicit-function-declaration]
 2367 |         hl_debugfs_device_fini(hdev);
      |         ^~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-10-09 12:37:24 +03:00
Oded Gabbay
1630d14f8d accel/habanalabs: minor cosmetic update to habanalabs.h
- Update copyright years
- Align fields in struct hl_userptr
- Fix comments

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:24 +03:00
Oded Gabbay
4355f2c322 accel/habanalabs/gaudi: remove define used for simulator
We don't support simulator in upstream.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>
2023-10-09 12:37:24 +03:00