Commit Graph

995 Commits

Author SHA1 Message Date
Christophe JAILLET
b9dd6fbd15 drm/amdkfd: Use bitmap_zalloc() when applicable
'doorbell_bitmap' and 'queue_slot_bitmap' are bitmaps. So use
'bitmap_zalloc()' to simplify code, improve the semantic and avoid some
open-coded arithmetic in allocator arguments.

Also change the corresponding 'kfree()' into 'bitmap_free()' to keep
consistency.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:02:27 -05:00
Philip Yang
2e4477282c drm/amdkfd: simplify drain retry fault
unmap range always increase atomic svms->drain_pagefaults to simplify
both parent range and child range unmap, page fault handle ignores the
retry fault if svms->drain_pagefaults is set to speed up interrupt
handling. svm_range_drain_retry_fault restart draining if another
range unmap from cpu.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-24 14:06:53 -05:00
Philip Yang
7ad153db58 drm/amdkfd: handle VMA remove race
VMA may be removed before unmap notifier callback, and deferred list
work remove range, return success for this special case as we are
handling stale retry fault.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-24 14:06:53 -05:00
Philip Yang
a0c55ecee1 drm/amdkfd: process exit and retry fault race
kfd_process_wq_release drain retry fault to ensure no retry fault comes
after removing kfd process from the hash table, otherwise svm page fault
handler will fail to recover the fault and dump GPU vm fault log.

Refactor deferred list work to get_task_mm and take mmap write lock
to handle all ranges, and avoid mm is gone while inserting mmu notifier.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-24 14:06:53 -05:00
Amber Lin
a0e7e140b5 drm/amdkfd: Remove unused entries in table
Remove unused entries in kfd_device_info table: num_xgmi_sdma_engines
and num_sdma_queues_per_engine. They are calculated in
kfd_get_num_sdma_engines and kfd_get_num_xgmi_sdma_engines instead.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Graham Sider <Graham.Sider@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22 14:59:05 -05:00
Amber Lin
ee2f17f4d0 drm/amdkfd: Retrieve SDMA numbers from amdgpu
Instead of hard coding the number of sdma engines and the number of
sdma_xgmi engines in the device_info table, get the number of toal SDMA
instances from amdgpu. The first two engines are sdma engines and the
rest are sdma-xgmi engines unless the ASIC doesn't support XGMI.

v2: add kfd_ prefix to non static function names

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22 14:46:03 -05:00
shaoyunl
c96cb65989 drm/amd/amdkfd: Fix kernel panic when reset failed and been triggered again
In SRIOV configuration, the reset may failed to bring asic back to normal but stop cpsch
already been called, the start_cpsch will not be called since there is no resume in this
case.  When reset been triggered again, driver should avoid to do uninitialization again.

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22 14:45:02 -05:00
Graham Sider
7eb0502ac0 drm/amdkfd: replace asic_family with asic_type
asic_family was a duplicate of asic_type, both of type amd_asic_type.
Replace all instances of device_info->asic_family with adev->asic_type
and remove asic_family from device_info.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 17:10:01 -05:00
Graham Sider
046e674b96 drm/amdkfd: convert misc checks to IP version checking
Switch to IP version checking instead of asic_type on various KFD
version checks.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 17:09:46 -05:00
Graham Sider
e4804a39ba drm/amdkfd: convert switches to IP version checking
Converts KFD switch statements to use IP version checking instead
of asic_type.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 17:09:36 -05:00
Graham Sider
dd0ae064e7 drm/amdkfd: convert KFD_IS_SOC to IP version checking
Defined as GC HWIP >= IP_VERSION(9, 0, 1).

Also defines KFD_GC_VERSION to return GC HWIP version.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 17:09:28 -05:00
Graham Sider
02274fc0f6 drm/amdkfd: replace trivial funcs with direct access
These get funcs simply return an adev field. Replace funcs/calls with
direct field accesses instead.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:11 -05:00
Felix Kuehling
b5f5738480 drm/amdkfd: Add sysfs bitfields and enums to uAPI
These bits are de-facto part of the uAPI, so declare them in a uAPI header.

The corresponding bit-fields and enums in user mode are defined in
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/master/include/hsakmttypes.h

HSA_CAP_...           -> HSA_CAPABILITY
HSA_MEM_HEAP_TYPE_... -> HSA_HEAPTYPE
HSA_MEM_FLAGS_...     -> HSA_MEMORYPROPERTY
HSA_CACHE_TYPE_...    -> HsaCacheType
HSA_IOLINK_TYPE_...   -> HSA_IOLINKTYPE
HSA_IOLINK_FLAGS_...  -> HSA_LINKPROPERTY

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:03 -05:00
Graham Sider
b5d1d755c1 drm/amdkfd: remove kgd_dev declaration and initialization
Completes removal of kgd_dev. Direct references to amdgpu_device objects
should now be used instead.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:03 -05:00
Graham Sider
56c5977eae drm/amdkfd: replace/remove remaining kgd_dev references
Remove get_amdgpu_device and other remaining kgd_dev references aside
from declaration/kfd struct entry and initialization.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:02 -05:00
Graham Sider
dff63da93e drm/amdkfd: replace kgd_dev in gpuvm amdgpu_amdkfd funcs
Modified definitions:

- amdgpu_amdkfd_gpuvm_acquire_process_vm
- amdgpu_amdkfd_gpuvm_release_process_vm
- amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu
- amdgpu_amdkfd_gpuvm_free_memory_of_gpu
- amdgpu_amdkfd_gpuvm_map_memory_to_gpu
- amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu
- amdgpu_amdkfd_gpuvm_sync_memory
- amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel
- amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel
- amdgpu_amdkfd_gpuvm_get_vm_fault_info
- amdgpu_amdkfd_gpuvm_import_dmabuf
- amdgpu_amdkfd_get_tile_config

Removed:

- get_amdgpu_device

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:02 -05:00
Graham Sider
574c4183ef drm/amdkfd: replace kgd_dev in get amdgpu_amdkfd funcs
Modified definitions:

- amdgpu_amdkfd_get_fw_version
- amdgpu_amdkfd_get_local_mem_info
- amdgpu_amdkfd_get_gpu_clock_counter
- amdgpu_amdkfd_get_max_engine_clock_in_mhz
- amdgpu_amdkfd_get_cu_info
- amdgpu_amdkfd_get_dmabuf_info
- amdgpu_amdkfd_get_vram_usage
- amdgpu_amdkfd_get_hive_id
- amdgpu_amdkfd_get_unique_id
- amdgpu_amdkfd_get_mmio_remap_phys_addr
- amdgpu_amdkfd_get_num_gws
- amdgpu_amdkfd_get_asic_rev_id
- amdgpu_amdkfd_get_noretry
- amdgpu_amdkfd_get_xgmi_hops_count
- amdgpu_amdkfd_get_xgmi_bandwidth_mbytes
- amdgpu_amdkfd_get_pcie_bandwidth_mbytes

Also replaces kfd_device_by_kgd with kfd_device_by_adev, now
searching via adev rather than kgd.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:02 -05:00
Graham Sider
6bfc7c7e17 drm/amdkfd: replace kgd_dev in various amgpu_amdkfd funcs
Modified definitions:

- amdgpu_amdkfd_submit_ib
- amdgpu_amdkfd_set_compute_idle
- amdgpu_amdkfd_have_atomics_support
- amdgpu_amdkfd_flush_gpu_tlb_pasid
- amdgpu_amdkfd_flush_gpu_tlb_pasid
- amdgpu_amdkfd_gpu_reset
- amdgpu_amdkfd_alloc_gtt_mem
- amdgpu_amdkfd_free_gtt_mem
- amdgpu_amdkfd_alloc_gws
- amdgpu_amdkfd_free_gws
- amdgpu_amdkfd_ras_poison_consumption_handler

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:01 -05:00
Graham Sider
3356c38dc1 drm/amdkfd: replace kgd_dev in various kfd2kgd funcs
Modified definitions:

- program_sh_mem_settings
- set_pasid_vmid_mapping
- init_interrupts
- address_watch_disable
- address_watch_execute
- wave_control_execute
- address_watch_get_offset
- get_atc_vmid_pasid_mapping_info
- set_scratch_backing_va
- set_vm_context_page_table_base
- read_vmid_from_vmfault_reg
- get_cu_occupancy
- program_trap_handler_settings

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:01 -05:00
Graham Sider
420185fdad drm/amdkfd: replace kgd_dev in hqd/mqd kfd2kgd funcs
Modified definitions:

- hqd_load
- hiq_mqd_load
- hqd_sdma_load
- hqd_dump
- hqd_sdma_dump
- hqd_is_occupied
- hqd_destroy
- hqd_sdma_is_occupied
- hqd_sdma_destroy

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:01 -05:00
Graham Sider
c6c5744638 drm/amdkfd: add amdgpu_device entry to kfd_dev
Patch series to remove kgd_dev struct and replace all instances with
amdgpu_device objects.

amdgpu_device needs to be declared in kgd_kfd_interface.h to be visible
to kfd2kgd_calls.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:57:59 -05:00
Linus Torvalds
304ac8032d drm next/fixes for 5.16-rc1
bridge:
 - HPD improvments for lt9611uxc
 - eDP aux-bus support for ps8640
 - LVDS data-mapping selection support
 
 ttm:
 - remove huge page functionality (needs reworking)
 - fix a race condition during BO eviction
 
 panels:
 - add some new panels
 
 fbdev:
 - fix double-free
 - remove unused scrolling acceleration
 - CONFIG_FB dep improvements
 
 locking:
 - improve contended locking logging
 - naming collision fix
 
 dma-buf:
 - add dma_resv_for_each_fence iterator
 - fix fence refcounting bug
 - name locking fixesA
 
 prime:
 - fix object references during mmap
 
 nouveau:
 - various code style changes
 - refcount fix
 - device removal fixes
 - protect client list with a mutex
 - fix CE0 address calculation
 
 i915:
 - DP rates related fixes
 - Revert disabling dual eDP that was causing state readout problems
 - put the cdclk vtables in const data
 - Fix DVO port type for older platforms
 - Fix blankscreen by turning DP++ TMDS output buffers on encoder->shutdown
 - CCS FBs related fixes
 - Fix recursive lock in GuC submission
 - Revert guc_id from i915_request tracepoint
 - Build fix around dmabuf
 
 amdgpu:
 - GPU reset fix
 - Aldebaran fix
 - Yellow Carp fixes
 - DCN2.1 DMCUB fix
 - IOMMU regression fix for Picasso
 - DSC display fixes
 - BPC display calculation fixes
 - Other misc display fixes
 - Don't allow partial copy from user for DC debugfs
 - SRIOV fixes
 - GFX9 CSB pin count fix
 - Various IP version check fixes
 - DP 2.0 fixes
 - Limit DCN1 MPO fix to DCN1
 
 amdkfd:
 - SVM fixes
 - Fix gfx version for renoir
 - Reset fixes
 
 udl:
 - timeout fix
 
 imx:
 - circular locking fix
 
 virtio:
 - NULL ptr deref fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmGN3YwACgkQDHTzWXnE
 hr6aZQ/+Pobf1VE7V3wPUcopxccJYmgBvG/uY8EDyjA8qaxHs2pQqGN2IooOGxr6
 F8G1N94Hem/PCDn3T8JI2Tqw5z4sy4UwLahEWISurFCen1IMAfA7hYfutp9X3O7X
 8h7b+PgkvVruEAHF7z0kqnWGPHmcro29cIHNkXVRjnJuz+Gmn1XRfo6Jj65n6D7u
 NfMeU4/lWRR3767oJQzTqyAYtGxsKaZT3/tBD5WggZBzEKC7hqhAl8EUoOLWwojo
 fDqwiEpLXpraPRIQH8trkXVHhzPeLAmG916WwS8JG3CEk9mUQ+I7Jshhd8cw+bsQ
 XPuk3OBfU9mtuiGgNzrLP3xXJZs/QN3EkpKZWLefTnJY+C4BgiP2RifTnghmwV31
 6/7Pr83CX/cn3BRd7r0xaeBZYvVYBZmwoZcsZFJBM8SVjd/ofKUfAmCzZZKheio2
 5qa6bj9DQoyjEoFAULh23plcX6hvATGP7wzfRTnJ9AlAJ0KyEjVJ3r0qE6jHMDc/
 uzcTAnKIWCxt9kSgE5qwLQtxLBaBpr/iOniZbCqGkPjiZeMzqP/ug1AKVP7kk39x
 FxZVT8ZOKk8Xt4iLZx8jmHi2KKheXYZi9LqieoTrJd44qMXDOmR9DCtQX9FZuWJS
 EJAlMj6sCowAZdODPZMVpoMc3Gti9nZ2Fpu7mLrRcMk1gKfjKwo=
 =qMNk
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2021-11-12' of git://anongit.freedesktop.org/drm/drm

Pull more drm updates from Dave Airlie:
 "I missed a drm-misc-next pull for the main pull last week. It wasn't
  that major and isn't the bulk of this at all. This has a bunch of
  fixes all over, a lot for amdgpu and i915.

  bridge:
   - HPD improvments for lt9611uxc
   - eDP aux-bus support for ps8640
   - LVDS data-mapping selection support

  ttm:
   - remove huge page functionality (needs reworking)
   - fix a race condition during BO eviction

  panels:
   - add some new panels

  fbdev:
   - fix double-free
   - remove unused scrolling acceleration
   - CONFIG_FB dep improvements

  locking:
   - improve contended locking logging
   - naming collision fix

  dma-buf:
   - add dma_resv_for_each_fence iterator
   - fix fence refcounting bug
   - name locking fixesA

  prime:
   - fix object references during mmap

  nouveau:
   - various code style changes
   - refcount fix
   - device removal fixes
   - protect client list with a mutex
   - fix CE0 address calculation

  i915:
   - DP rates related fixes
   - Revert disabling dual eDP that was causing state readout problems
   - put the cdclk vtables in const data
   - Fix DVO port type for older platforms
   - Fix blankscreen by turning DP++ TMDS output buffers on encoder->shutdown
   - CCS FBs related fixes
   - Fix recursive lock in GuC submission
   - Revert guc_id from i915_request tracepoint
   - Build fix around dmabuf

  amdgpu:
   - GPU reset fix
   - Aldebaran fix
   - Yellow Carp fixes
   - DCN2.1 DMCUB fix
   - IOMMU regression fix for Picasso
   - DSC display fixes
   - BPC display calculation fixes
   - Other misc display fixes
   - Don't allow partial copy from user for DC debugfs
   - SRIOV fixes
   - GFX9 CSB pin count fix
   - Various IP version check fixes
   - DP 2.0 fixes
   - Limit DCN1 MPO fix to DCN1

  amdkfd:
   - SVM fixes
   - Fix gfx version for renoir
   - Reset fixes

  udl:
   - timeout fix

  imx:
   - circular locking fix

  virtio:
   - NULL ptr deref fix"

* tag 'drm-next-2021-11-12' of git://anongit.freedesktop.org/drm/drm: (126 commits)
  drm/ttm: Double check mem_type of BO while eviction
  drm/amdgpu: add missed support for UVD IP_VERSION(3, 0, 64)
  drm/amdgpu: drop jpeg IP initialization in SRIOV case
  drm/amd/display: reject both non-zero src_x and src_y only for DCN1x
  drm/amd/display: Add callbacks for DMUB HPD IRQ notifications
  drm/amd/display: Don't lock connection_mutex for DMUB HPD
  drm/amd/display: Add comment where CONFIG_DRM_AMD_DC_DCN macro ends
  drm/amdkfd: Fix retry fault drain race conditions
  drm/amdkfd: lower the VAs base offset to 8KB
  drm/amd/display: fix exit from amdgpu_dm_atomic_check() abruptly
  drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
  drm/amdgpu: fix uvd crash on Polaris12 during driver unloading
  drm/i915/adlp/fb: Prevent the mapping of redundant trailing padding NULL pages
  drm/i915/fb: Fix rounding error in subsampled plane size calculation
  drm/i915/hdmi: Turn DP++ TMDS output buffers back on in encoder->shutdown()
  drm/locking: fix __stack_depot_* name conflict
  drm/virtio: Fix NULL dereference error in virtio_gpu_poll
  drm/amdgpu: fix SI handling in amdgpu_device_asic_has_dc_support()
  drm/amdgpu: Fix dangling kfd_bo pointer for shared BOs
  drm/amd/amdkfd: Don't sent command to HWS on kfd reset
  ...
2021-11-12 12:11:07 -08:00
Alistair Popple
ab09243aa9 mm/migrate.c: remove MIGRATE_PFN_LOCKED
MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a
source page was already locked during migrate_vma_collect().  If it
wasn't then the a second attempt is made to lock the page.  However if
the first attempt failed it's unlikely a second attempt will succeed,
and the retry adds complexity.  So clean this up by removing the retry
and MIGRATE_PFN_LOCKED flag.

Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag
set, but nothing actually checks that.

Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-11 09:34:35 -08:00
Felix Kuehling
a44fe9ee05 drm/amdkfd: Fix retry fault drain race conditions
The check for whether to drain retry faults must be under the mmap write
lock to serialize with munmap notifier callbacks.

We were also missing checks on child ranges. To fix that, simplify the
logic by using a flag rather than checking on each prange. That also
allows draining less freqeuntly when many ranges are unmapped at once.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: Philip Yang <Philip.Yang@amd.com>
Tested-by: Alex Sierra <Alex.Sierra@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-09 17:08:00 -05:00
Alex Sierra
3aac6aa630 drm/amdkfd: lower the VAs base offset to 8KB
The low 16MB of virtual address space are currently reserved for kernel
mode allocations mapped into user virtual address space. This causes
conflicts with HMM/SVM mappings at low virtual addresses. We tried to
move those kernel mode allocations to the upper half of the 64-bit
virtual address space for GFX9, which is naturally reserved for kernel
use. However, TBA (trap handler code) has problems to access addresses
in the high virtual space. We have decided to set this to 8KB of the
lower address space as a temporary fix, while investigate TBA address
problem. It is very unlikely for user space to map memory at this low
region.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-09 17:08:00 -05:00
shaoyunl
b8c20c74ab drm/amd/amdkfd: Don't sent command to HWS on kfd reset
When kfd need to be reset, sent command to HWS might cause hang and get unnecessary timeout.
This change try not to touch HW in pre_reset and keep queues to be in the evicted state
when the reset is done, so they are not put back on the runlist. These queues will be destroied
on process termination.

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-05 14:12:38 -04:00
Alex Sierra
a6283010e2 drm/amdkfd: avoid recursive lock in migrations back to RAM
[Why]:
When we call hmm_range_fault to map memory after a migration, we don't
expect memory to be migrated again as a result of hmm_range_fault. The
driver ensures that all memory is in GPU-accessible locations so that
no migration should be needed. However, there is one corner case where
hmm_range_fault can unexpectedly cause a migration from DEVICE_PRIVATE
back to system memory due to a write-fault when a system memory page in
the same range was mapped read-only (e.g. COW). Ranges with individual
pages in different locations are usually the result of failed page
migrations (e.g. page lock contention). The unexpected migration back
to system memory causes a deadlock from recursive locking in our
driver.

[How]:
Creating a task reference new member under svm_range_list struct.
Setting this with "current" reference, right before the hmm_range_fault
is called. This member is checked against "current" reference at
svm_migrate_to_ram callback function. If equal, the migration will be
ignored.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-05 14:10:58 -04:00
Graham Sider
cc22b92761 drm/amdkfd: update gfx target version for Renoir
Previously Renoir compiler gfx target version was forced to Raven.
Update driver side for completeness.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-03 12:22:07 -04:00
Felix Kuehling
740a451b07 drm/amdkfd: Handle incomplete migration to system memory
If some pages fail to migrate to system memory, don't update
prange->actual_loc = 0. This prevents endless CPU page faults after
partial migration failures due to contested page locks.

Migration to RAM must be complete during migrations from VRAM to VRAM and
during evictions. Implement retry and fail if the migration to RAM fails.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-03 12:22:07 -04:00
Felix Kuehling
12fcf0a7da drm/amdkfd: Avoid thrashing of stack and heap
Stack and heap pages tend to be shared by many small allocations.
Concurrent access by CPU and GPU is therefore likely, which can lead to
thrashing. Avoid this by setting the preferred location to system memory.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-03 12:22:07 -04:00
Felix Kuehling
297753a06a drm/amdkfd: Fix SVM_ATTR_PREFERRED_LOC
The preferred location should be used as the migration destination
whenever it is accessible by the faulting GPU. System memory is always
accessible. Peer memory is accessible if it's in the same XGMI hive.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-03 12:22:07 -04:00
Lang Yu
7c695a2c54 drm/amdkfd: Remove cu mask from struct queue_properties(v2)
Actually, cu_mask has been copied to mqd memory and
does't have to persist in queue_properties. Remove it
from queue_properties.

And use struct mqd_update_info to store such properties,
then pass it to update queue operation.

v2:
* Rename pqm_update_queue to pqm_update_queue_properties.
* Rename struct queue_update_info to struct mqd_update_info.
* Rename pqm_set_cu_mask to pqm_update_mqd.

Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-28 14:26:13 -04:00
Lang Yu
c6e559eb3b drm/amdkfd: Add an optional argument into update queue operation(v2)
Currently, queue is updated with data in queue_properties.
And all allocated resource in queue_properties will not
be freed until the queue is destroyed.

But some properties(e.g., cu mask) bring some memory
management headaches(e.g., memory leak) and make code
complex. Actually they have been copied to mqd and
don't have to persist in queue_properties.

Add an argument into update queue to pass such properties,
then we can remove them from queue_properties.

v2: Don't use void *.

Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-28 14:26:13 -04:00
Lang Yu
68df0f195a drm/amdkfd: Separate pinned BOs destruction from general routine
Currently, all kfd BOs use same destruction routine. But pinned
BOs are not unpinned properly. Separate them from general routine.

v2 (Felix):
Add safeguard to prevent user space from freeing signal BO.
Kunmap signal BO in the event of setting event page error.
Just kunmap signal BO to avoid duplicating the code.

Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-28 14:26:12 -04:00
Philip Yang
33c6bd989d drm/amdkfd: debug message to count successfully migrated pages
Not all migrate.cpages returned from migrate_vma_setup can be migrated,
for example non anonymous page, or out of device memory. So after
migrate_vma_pages returns, add debug message to count pages are
successfully migrated which has MIGRATE_PFN_VALID and
MIGRATE_PFN_MIGRATE flag set.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-21 23:38:35 -04:00
Philip Yang
75fa98d6e4 drm/amdkfd: clarify the origin of cpages returned by migration functions
cpages is only updated by migrate_vma_setup. So capture its value at
that point to clarify the significance of the number. The next patch
will add counting of actually migrated pages after migrate_vma_pages for
debug purposes.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-21 23:38:29 -04:00
Alex Deucher
0f3d2b6804 drm/amdkfd: protect raven_device_info with KFD_SUPPORT_IOMMU_V2
raven_device_info is not used when KFD_SUPPORT_IOMMU_V2 is not
set.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-20 11:43:57 -04:00
Alex Deucher
18f12604f5 drm/amdkfd: protect hawaii_device_info with CONFIG_DRM_AMDGPU_CIK
hawaii_device_info is not used when CONFIG_DRM_AMDGPU_CIK is not
set.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-20 11:43:56 -04:00
Jonathan Kim
d5edb56fbc drm/amdkfd: map gpu hive id to xgmi connected cpu
ROCr needs to be able to identify all devices that have direct access to
fine grain memory, which should include CPUs that are connected to GPUs
over xGMI. The GPU hive ID can be mapped onto the CPU hive ID since the
CPU is part of the hive.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-19 17:14:40 -04:00
Philip Yang
43fc10c187 drm/amdkfd: unregistered svm range not overlap with TTM range
When creating unregistered new svm range to recover retry fault, avoid
new svm range to overlap with ranges or userptr ranges managed by TTM,
otherwise svm migration will trigger TTM or userptr eviction, to evict
user queues unexpectedly.

Change helper amdgpu_ttm_tt_affect_userptr to return userptr which is
inside the range. Add helper svm_range_check_vm_userptr to scan all
userptr of the vm, and return overlap userptr bo start, last.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-13 22:20:13 -04:00
Yifan Zhang
6f4b590aae drm/amdkfd: fix resume error when iommu disabled in Picasso
When IOMMU disabled in sbios and kfd in iommuv2 path,
IOMMU resume failure blocks system resume. Don't allow kfd to
use iommu v2 when iommu is disabled.

Reported-by: youling <youling257@gmail.com>
Tested-by: youling <youling257@gmail.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-13 14:16:02 -04:00
Yifan Zhang
afd18180c0 drm/amdkfd: fix boot failure when iommu is disabled in Picasso.
When IOMMU disabled in sbios and kfd in iommuv2 path, iommuv2
init will fail. But this failure should not block amdgpu driver init.

Reported-by: youling <youling257@gmail.com>
Tested-by: youling <youling257@gmail.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-13 14:15:52 -04:00
Philip Yang
ca432dcc27 drm/amdkfd: handle svm partial migration cpages 0
migrate_vma_setup may return cpages 0, means 0 page can be migrated,
treat this as error case to skip the rest of vma migration steps.

Change svm_migrate_vma_to_vram and svm_migrate_vma_to_ram to return the
number of pages migrated successfully or error code. The caller add up
all the successful migration pages and update prange->actual_loc only if
the total migrated pages is not 0.

This also removes the warning message "VRAM BO missing during
validation" if migration cpages is 0.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-13 14:15:46 -04:00
Philip Yang
a273bc9937 drm/amdkfd: ratelimited svm debug messages
No function change, use pr_debug_ratelimited to avoid per page debug
message overflowing dmesg buf and console log.

use dev_err to show error message from unexpected situation, to provide
clue to help debug without enabling dynamic debug log. Define dev_fmt to
output function name in error message.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-13 14:15:40 -04:00
Alex Sierra
7e3fb209d5 amd/amdkfd: remove svms declaration to avoid werror
svm_range_list svms declaration removed to avoid werror when
CONFIG_HSA_AMD_SVM is not enabled.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-13 14:14:34 -04:00
Yifan Zhang
9c152f54d9 drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails
[ RUN      ] KFDSVMRangeTest.PartialUnmapSysMemTest
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:245: Failure
Value of: (hsaKmtAllocMemory(m_Node, m_Size, m_Flags, &m_pBuf))
  Actual: 1
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:248: Failure
Value of: (hsaKmtMapMemoryToGPUNodes(m_pBuf, m_Size, __null, mapFlags, 1, &m_Node))
  Actual: 1
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:306: Failure
Expected: ((void *)__null) != (ptr), actual: NULL vs NULL
Segmentation fault (core dumped)
[          ] Profile: Full Test
[          ] HW capabilities: 0x9

kernel log:

[  102.029150]  ret_from_fork+0x22/0x30
[  102.029158] ---[ end trace 15c34e782714f9a3 ]---
[ 3613.603598] amdgpu: Address: 0x7f7149ccc000 already allocated by SVM
[ 3613.610620] show_signal_msg: 27 callbacks suppressed

These is race with deferred actions from previous memory map
changes (e.g. munmap).Flush pending deffered work to avoid such case.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-13 14:14:34 -04:00
Yifan Zhang
6bdfc37b5c drm/amdkfd: export svm_range_list_lock_and_flush_work
export svm_range_list_lock_and_flush_work to make other kfd parts be
able to sync svm_range_list.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-13 14:14:34 -04:00
Alex Sierra
71cbfeb381 drm/amdkfd: avoid conflicting address mappings
[Why]
Avoid conflict with address ranges mapped by SVM
mechanism that try to be allocated again through
ioctl_alloc in the same process. And viceversa.

[How]
For ioctl_alloc_memory_of_gpu allocations
Check if the address range passed into ioctl memory
alloc does not exist already in the kfd_process
svms->objects interval tree.

For SVM allocations
Look for the address range into the interval tree VA from
the VM inside of each pdds used in a kfd_process.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-13 14:14:34 -04:00
Alex Sierra
ec6abe831a drm/amdkfd: rm BO resv on validation to avoid deadlock
This fix the deadlock with the BO reservations during SVM_BO evictions
while allocations in VRAM are concurrently performed. More specific,
while the ttm waits for the fence to be signaled (ttm_bo_wait), it
already has the BO reserved. In parallel, the restore worker might be
running, prefetching memory to VRAM. This also requires to reserve the
BO, but blocks the mmap semaphore first. The deadlock happens when the
SVM_BO eviction worker kicks in and waits for the mmap semaphore held
in restore worker. Preventing signal the fence back, causing the
deadlock until the ttm times out.

We don't need to hold the BO reservation anymore during validation
and mapping. Now the physical addresses are taken from hmm_range_fault.
We also take migrate_mutex to prevent range migration while
validate_and_map update GPU page table.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-08 13:20:31 -04:00
Yifan Zhang
499f4d38ec drm/amdkfd: remove redundant iommu cleanup code
kfd_resume doesn't involve iommu operation, remove
redundant iommu cleanup code.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-05 12:22:15 -04:00