linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2025-01-15 02:05:16 +08:00

Author	SHA1	Message	Date
Christophe JAILLET	b9dd6fbd15	drm/amdkfd: Use bitmap_zalloc() when applicable 'doorbell_bitmap' and 'queue_slot_bitmap' are bitmaps. So use 'bitmap_zalloc()' to simplify code, improve the semantic and avoid some open-coded arithmetic in allocator arguments. Also change the corresponding 'kfree()' into 'bitmap_free()' to keep consistency. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-12-01 16:02:27 -05:00
Philip Yang	2e4477282c	drm/amdkfd: simplify drain retry fault unmap range always increase atomic svms->drain_pagefaults to simplify both parent range and child range unmap, page fault handle ignores the retry fault if svms->drain_pagefaults is set to speed up interrupt handling. svm_range_drain_retry_fault restart draining if another range unmap from cpu. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-24 14:06:53 -05:00
Philip Yang	7ad153db58	drm/amdkfd: handle VMA remove race VMA may be removed before unmap notifier callback, and deferred list work remove range, return success for this special case as we are handling stale retry fault. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-24 14:06:53 -05:00
Philip Yang	a0c55ecee1	drm/amdkfd: process exit and retry fault race kfd_process_wq_release drain retry fault to ensure no retry fault comes after removing kfd process from the hash table, otherwise svm page fault handler will fail to recover the fault and dump GPU vm fault log. Refactor deferred list work to get_task_mm and take mmap write lock to handle all ranges, and avoid mm is gone while inserting mmu notifier. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-24 14:06:53 -05:00
Amber Lin	a0e7e140b5	drm/amdkfd: Remove unused entries in table Remove unused entries in kfd_device_info table: num_xgmi_sdma_engines and num_sdma_queues_per_engine. They are calculated in kfd_get_num_sdma_engines and kfd_get_num_xgmi_sdma_engines instead. Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Graham Sider <Graham.Sider@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-22 14:59:05 -05:00
Amber Lin	ee2f17f4d0	drm/amdkfd: Retrieve SDMA numbers from amdgpu Instead of hard coding the number of sdma engines and the number of sdma_xgmi engines in the device_info table, get the number of toal SDMA instances from amdgpu. The first two engines are sdma engines and the rest are sdma-xgmi engines unless the ASIC doesn't support XGMI. v2: add kfd_ prefix to non static function names Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-22 14:46:03 -05:00
shaoyunl	c96cb65989	drm/amd/amdkfd: Fix kernel panic when reset failed and been triggered again In SRIOV configuration, the reset may failed to bring asic back to normal but stop cpsch already been called, the start_cpsch will not be called since there is no resume in this case. When reset been triggered again, driver should avoid to do uninitialization again. Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-22 14:45:02 -05:00
Graham Sider	7eb0502ac0	drm/amdkfd: replace asic_family with asic_type asic_family was a duplicate of asic_type, both of type amd_asic_type. Replace all instances of device_info->asic_family with adev->asic_type and remove asic_family from device_info. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 17:10:01 -05:00
Graham Sider	046e674b96	drm/amdkfd: convert misc checks to IP version checking Switch to IP version checking instead of asic_type on various KFD version checks. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 17:09:46 -05:00
Graham Sider	e4804a39ba	drm/amdkfd: convert switches to IP version checking Converts KFD switch statements to use IP version checking instead of asic_type. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 17:09:36 -05:00
Graham Sider	dd0ae064e7	drm/amdkfd: convert KFD_IS_SOC to IP version checking Defined as GC HWIP >= IP_VERSION(9, 0, 1). Also defines KFD_GC_VERSION to return GC HWIP version. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 17:09:28 -05:00
Graham Sider	02274fc0f6	drm/amdkfd: replace trivial funcs with direct access These get funcs simply return an adev field. Replace funcs/calls with direct field accesses instead. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 16:58:11 -05:00
Felix Kuehling	b5f5738480	drm/amdkfd: Add sysfs bitfields and enums to uAPI These bits are de-facto part of the uAPI, so declare them in a uAPI header. The corresponding bit-fields and enums in user mode are defined in https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/master/include/hsakmttypes.h HSA_CAP_... -> HSA_CAPABILITY HSA_MEM_HEAP_TYPE_... -> HSA_HEAPTYPE HSA_MEM_FLAGS_... -> HSA_MEMORYPROPERTY HSA_CACHE_TYPE_... -> HsaCacheType HSA_IOLINK_TYPE_... -> HSA_IOLINKTYPE HSA_IOLINK_FLAGS_... -> HSA_LINKPROPERTY Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 16:58:03 -05:00
Graham Sider	b5d1d755c1	drm/amdkfd: remove kgd_dev declaration and initialization Completes removal of kgd_dev. Direct references to amdgpu_device objects should now be used instead. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 16:58:03 -05:00
Graham Sider	56c5977eae	drm/amdkfd: replace/remove remaining kgd_dev references Remove get_amdgpu_device and other remaining kgd_dev references aside from declaration/kfd struct entry and initialization. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 16:58:02 -05:00
Graham Sider	dff63da93e	drm/amdkfd: replace kgd_dev in gpuvm amdgpu_amdkfd funcs Modified definitions: - amdgpu_amdkfd_gpuvm_acquire_process_vm - amdgpu_amdkfd_gpuvm_release_process_vm - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu - amdgpu_amdkfd_gpuvm_free_memory_of_gpu - amdgpu_amdkfd_gpuvm_map_memory_to_gpu - amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu - amdgpu_amdkfd_gpuvm_sync_memory - amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel - amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel - amdgpu_amdkfd_gpuvm_get_vm_fault_info - amdgpu_amdkfd_gpuvm_import_dmabuf - amdgpu_amdkfd_get_tile_config Removed: - get_amdgpu_device Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 16:58:02 -05:00
Graham Sider	574c4183ef	drm/amdkfd: replace kgd_dev in get amdgpu_amdkfd funcs Modified definitions: - amdgpu_amdkfd_get_fw_version - amdgpu_amdkfd_get_local_mem_info - amdgpu_amdkfd_get_gpu_clock_counter - amdgpu_amdkfd_get_max_engine_clock_in_mhz - amdgpu_amdkfd_get_cu_info - amdgpu_amdkfd_get_dmabuf_info - amdgpu_amdkfd_get_vram_usage - amdgpu_amdkfd_get_hive_id - amdgpu_amdkfd_get_unique_id - amdgpu_amdkfd_get_mmio_remap_phys_addr - amdgpu_amdkfd_get_num_gws - amdgpu_amdkfd_get_asic_rev_id - amdgpu_amdkfd_get_noretry - amdgpu_amdkfd_get_xgmi_hops_count - amdgpu_amdkfd_get_xgmi_bandwidth_mbytes - amdgpu_amdkfd_get_pcie_bandwidth_mbytes Also replaces kfd_device_by_kgd with kfd_device_by_adev, now searching via adev rather than kgd. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 16:58:02 -05:00
Graham Sider	6bfc7c7e17	drm/amdkfd: replace kgd_dev in various amgpu_amdkfd funcs Modified definitions: - amdgpu_amdkfd_submit_ib - amdgpu_amdkfd_set_compute_idle - amdgpu_amdkfd_have_atomics_support - amdgpu_amdkfd_flush_gpu_tlb_pasid - amdgpu_amdkfd_flush_gpu_tlb_pasid - amdgpu_amdkfd_gpu_reset - amdgpu_amdkfd_alloc_gtt_mem - amdgpu_amdkfd_free_gtt_mem - amdgpu_amdkfd_alloc_gws - amdgpu_amdkfd_free_gws - amdgpu_amdkfd_ras_poison_consumption_handler Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 16:58:01 -05:00
Graham Sider	3356c38dc1	drm/amdkfd: replace kgd_dev in various kfd2kgd funcs Modified definitions: - program_sh_mem_settings - set_pasid_vmid_mapping - init_interrupts - address_watch_disable - address_watch_execute - wave_control_execute - address_watch_get_offset - get_atc_vmid_pasid_mapping_info - set_scratch_backing_va - set_vm_context_page_table_base - read_vmid_from_vmfault_reg - get_cu_occupancy - program_trap_handler_settings Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 16:58:01 -05:00
Graham Sider	420185fdad	drm/amdkfd: replace kgd_dev in hqd/mqd kfd2kgd funcs Modified definitions: - hqd_load - hiq_mqd_load - hqd_sdma_load - hqd_dump - hqd_sdma_dump - hqd_is_occupied - hqd_destroy - hqd_sdma_is_occupied - hqd_sdma_destroy Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 16:58:01 -05:00
Graham Sider	c6c5744638	drm/amdkfd: add amdgpu_device entry to kfd_dev Patch series to remove kgd_dev struct and replace all instances with amdgpu_device objects. amdgpu_device needs to be declared in kgd_kfd_interface.h to be visible to kfd2kgd_calls. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-17 16:57:59 -05:00
Linus Torvalds	304ac8032d	drm next/fixes for 5.16-rc1 bridge: - HPD improvments for lt9611uxc - eDP aux-bus support for ps8640 - LVDS data-mapping selection support ttm: - remove huge page functionality (needs reworking) - fix a race condition during BO eviction panels: - add some new panels fbdev: - fix double-free - remove unused scrolling acceleration - CONFIG_FB dep improvements locking: - improve contended locking logging - naming collision fix dma-buf: - add dma_resv_for_each_fence iterator - fix fence refcounting bug - name locking fixesA prime: - fix object references during mmap nouveau: - various code style changes - refcount fix - device removal fixes - protect client list with a mutex - fix CE0 address calculation i915: - DP rates related fixes - Revert disabling dual eDP that was causing state readout problems - put the cdclk vtables in const data - Fix DVO port type for older platforms - Fix blankscreen by turning DP++ TMDS output buffers on encoder->shutdown - CCS FBs related fixes - Fix recursive lock in GuC submission - Revert guc_id from i915_request tracepoint - Build fix around dmabuf amdgpu: - GPU reset fix - Aldebaran fix - Yellow Carp fixes - DCN2.1 DMCUB fix - IOMMU regression fix for Picasso - DSC display fixes - BPC display calculation fixes - Other misc display fixes - Don't allow partial copy from user for DC debugfs - SRIOV fixes - GFX9 CSB pin count fix - Various IP version check fixes - DP 2.0 fixes - Limit DCN1 MPO fix to DCN1 amdkfd: - SVM fixes - Fix gfx version for renoir - Reset fixes udl: - timeout fix imx: - circular locking fix virtio: - NULL ptr deref fix -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmGN3YwACgkQDHTzWXnE hr6aZQ/+Pobf1VE7V3wPUcopxccJYmgBvG/uY8EDyjA8qaxHs2pQqGN2IooOGxr6 F8G1N94Hem/PCDn3T8JI2Tqw5z4sy4UwLahEWISurFCen1IMAfA7hYfutp9X3O7X 8h7b+PgkvVruEAHF7z0kqnWGPHmcro29cIHNkXVRjnJuz+Gmn1XRfo6Jj65n6D7u NfMeU4/lWRR3767oJQzTqyAYtGxsKaZT3/tBD5WggZBzEKC7hqhAl8EUoOLWwojo fDqwiEpLXpraPRIQH8trkXVHhzPeLAmG916WwS8JG3CEk9mUQ+I7Jshhd8cw+bsQ XPuk3OBfU9mtuiGgNzrLP3xXJZs/QN3EkpKZWLefTnJY+C4BgiP2RifTnghmwV31 6/7Pr83CX/cn3BRd7r0xaeBZYvVYBZmwoZcsZFJBM8SVjd/ofKUfAmCzZZKheio2 5qa6bj9DQoyjEoFAULh23plcX6hvATGP7wzfRTnJ9AlAJ0KyEjVJ3r0qE6jHMDc/ uzcTAnKIWCxt9kSgE5qwLQtxLBaBpr/iOniZbCqGkPjiZeMzqP/ug1AKVP7kk39x FxZVT8ZOKk8Xt4iLZx8jmHi2KKheXYZi9LqieoTrJd44qMXDOmR9DCtQX9FZuWJS EJAlMj6sCowAZdODPZMVpoMc3Gti9nZ2Fpu7mLrRcMk1gKfjKwo= =qMNk -----END PGP SIGNATURE----- Merge tag 'drm-next-2021-11-12' of git://anongit.freedesktop.org/drm/drm Pull more drm updates from Dave Airlie: "I missed a drm-misc-next pull for the main pull last week. It wasn't that major and isn't the bulk of this at all. This has a bunch of fixes all over, a lot for amdgpu and i915. bridge: - HPD improvments for lt9611uxc - eDP aux-bus support for ps8640 - LVDS data-mapping selection support ttm: - remove huge page functionality (needs reworking) - fix a race condition during BO eviction panels: - add some new panels fbdev: - fix double-free - remove unused scrolling acceleration - CONFIG_FB dep improvements locking: - improve contended locking logging - naming collision fix dma-buf: - add dma_resv_for_each_fence iterator - fix fence refcounting bug - name locking fixesA prime: - fix object references during mmap nouveau: - various code style changes - refcount fix - device removal fixes - protect client list with a mutex - fix CE0 address calculation i915: - DP rates related fixes - Revert disabling dual eDP that was causing state readout problems - put the cdclk vtables in const data - Fix DVO port type for older platforms - Fix blankscreen by turning DP++ TMDS output buffers on encoder->shutdown - CCS FBs related fixes - Fix recursive lock in GuC submission - Revert guc_id from i915_request tracepoint - Build fix around dmabuf amdgpu: - GPU reset fix - Aldebaran fix - Yellow Carp fixes - DCN2.1 DMCUB fix - IOMMU regression fix for Picasso - DSC display fixes - BPC display calculation fixes - Other misc display fixes - Don't allow partial copy from user for DC debugfs - SRIOV fixes - GFX9 CSB pin count fix - Various IP version check fixes - DP 2.0 fixes - Limit DCN1 MPO fix to DCN1 amdkfd: - SVM fixes - Fix gfx version for renoir - Reset fixes udl: - timeout fix imx: - circular locking fix virtio: - NULL ptr deref fix" * tag 'drm-next-2021-11-12' of git://anongit.freedesktop.org/drm/drm: (126 commits) drm/ttm: Double check mem_type of BO while eviction drm/amdgpu: add missed support for UVD IP_VERSION(3, 0, 64) drm/amdgpu: drop jpeg IP initialization in SRIOV case drm/amd/display: reject both non-zero src_x and src_y only for DCN1x drm/amd/display: Add callbacks for DMUB HPD IRQ notifications drm/amd/display: Don't lock connection_mutex for DMUB HPD drm/amd/display: Add comment where CONFIG_DRM_AMD_DC_DCN macro ends drm/amdkfd: Fix retry fault drain race conditions drm/amdkfd: lower the VAs base offset to 8KB drm/amd/display: fix exit from amdgpu_dm_atomic_check() abruptly drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov drm/amdgpu: fix uvd crash on Polaris12 during driver unloading drm/i915/adlp/fb: Prevent the mapping of redundant trailing padding NULL pages drm/i915/fb: Fix rounding error in subsampled plane size calculation drm/i915/hdmi: Turn DP++ TMDS output buffers back on in encoder->shutdown() drm/locking: fix __stack_depot_* name conflict drm/virtio: Fix NULL dereference error in virtio_gpu_poll drm/amdgpu: fix SI handling in amdgpu_device_asic_has_dc_support() drm/amdgpu: Fix dangling kfd_bo pointer for shared BOs drm/amd/amdkfd: Don't sent command to HWS on kfd reset ...	2021-11-12 12:11:07 -08:00
Alistair Popple	ab09243aa9	mm/migrate.c: remove MIGRATE_PFN_LOCKED MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a source page was already locked during migrate_vma_collect(). If it wasn't then the a second attempt is made to lock the page. However if the first attempt failed it's unlikely a second attempt will succeed, and the retry adds complexity. So clean this up by removing the retry and MIGRATE_PFN_LOCKED flag. Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag set, but nothing actually checks that. Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ben Skeggs <bskeggs@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-11 09:34:35 -08:00
Felix Kuehling	a44fe9ee05	drm/amdkfd: Fix retry fault drain race conditions The check for whether to drain retry faults must be under the mmap write lock to serialize with munmap notifier callbacks. We were also missing checks on child ranges. To fix that, simplify the logic by using a flag rather than checking on each prange. That also allows draining less freqeuntly when many ranges are unmapped at once. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: Philip Yang <Philip.Yang@amd.com> Tested-by: Alex Sierra <Alex.Sierra@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-09 17:08:00 -05:00
Alex Sierra	3aac6aa630	drm/amdkfd: lower the VAs base offset to 8KB The low 16MB of virtual address space are currently reserved for kernel mode allocations mapped into user virtual address space. This causes conflicts with HMM/SVM mappings at low virtual addresses. We tried to move those kernel mode allocations to the upper half of the 64-bit virtual address space for GFX9, which is naturally reserved for kernel use. However, TBA (trap handler code) has problems to access addresses in the high virtual space. We have decided to set this to 8KB of the lower address space as a temporary fix, while investigate TBA address problem. It is very unlikely for user space to map memory at this low region. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-09 17:08:00 -05:00
shaoyunl	b8c20c74ab	drm/amd/amdkfd: Don't sent command to HWS on kfd reset When kfd need to be reset, sent command to HWS might cause hang and get unnecessary timeout. This change try not to touch HW in pre_reset and keep queues to be in the evicted state when the reset is done, so they are not put back on the runlist. These queues will be destroied on process termination. Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-05 14:12:38 -04:00
Alex Sierra	a6283010e2	drm/amdkfd: avoid recursive lock in migrations back to RAM [Why]: When we call hmm_range_fault to map memory after a migration, we don't expect memory to be migrated again as a result of hmm_range_fault. The driver ensures that all memory is in GPU-accessible locations so that no migration should be needed. However, there is one corner case where hmm_range_fault can unexpectedly cause a migration from DEVICE_PRIVATE back to system memory due to a write-fault when a system memory page in the same range was mapped read-only (e.g. COW). Ranges with individual pages in different locations are usually the result of failed page migrations (e.g. page lock contention). The unexpected migration back to system memory causes a deadlock from recursive locking in our driver. [How]: Creating a task reference new member under svm_range_list struct. Setting this with "current" reference, right before the hmm_range_fault is called. This member is checked against "current" reference at svm_migrate_to_ram callback function. If equal, the migration will be ignored. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-05 14:10:58 -04:00
Graham Sider	cc22b92761	drm/amdkfd: update gfx target version for Renoir Previously Renoir compiler gfx target version was forced to Raven. Update driver side for completeness. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-03 12:22:07 -04:00
Felix Kuehling	740a451b07	drm/amdkfd: Handle incomplete migration to system memory If some pages fail to migrate to system memory, don't update prange->actual_loc = 0. This prevents endless CPU page faults after partial migration failures due to contested page locks. Migration to RAM must be complete during migrations from VRAM to VRAM and during evictions. Implement retry and fail if the migration to RAM fails. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-03 12:22:07 -04:00
Felix Kuehling	12fcf0a7da	drm/amdkfd: Avoid thrashing of stack and heap Stack and heap pages tend to be shared by many small allocations. Concurrent access by CPU and GPU is therefore likely, which can lead to thrashing. Avoid this by setting the preferred location to system memory. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-03 12:22:07 -04:00
Felix Kuehling	297753a06a	drm/amdkfd: Fix SVM_ATTR_PREFERRED_LOC The preferred location should be used as the migration destination whenever it is accessible by the faulting GPU. System memory is always accessible. Peer memory is accessible if it's in the same XGMI hive. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-11-03 12:22:07 -04:00
Lang Yu	7c695a2c54	drm/amdkfd: Remove cu mask from struct queue_properties(v2) Actually, cu_mask has been copied to mqd memory and does't have to persist in queue_properties. Remove it from queue_properties. And use struct mqd_update_info to store such properties, then pass it to update queue operation. v2: * Rename pqm_update_queue to pqm_update_queue_properties. * Rename struct queue_update_info to struct mqd_update_info. * Rename pqm_set_cu_mask to pqm_update_mqd. Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-28 14:26:13 -04:00
Lang Yu	c6e559eb3b	drm/amdkfd: Add an optional argument into update queue operation(v2) Currently, queue is updated with data in queue_properties. And all allocated resource in queue_properties will not be freed until the queue is destroyed. But some properties(e.g., cu mask) bring some memory management headaches(e.g., memory leak) and make code complex. Actually they have been copied to mqd and don't have to persist in queue_properties. Add an argument into update queue to pass such properties, then we can remove them from queue_properties. v2: Don't use void *. Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-28 14:26:13 -04:00
Lang Yu	68df0f195a	drm/amdkfd: Separate pinned BOs destruction from general routine Currently, all kfd BOs use same destruction routine. But pinned BOs are not unpinned properly. Separate them from general routine. v2 (Felix): Add safeguard to prevent user space from freeing signal BO. Kunmap signal BO in the event of setting event page error. Just kunmap signal BO to avoid duplicating the code. Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-28 14:26:12 -04:00
Philip Yang	33c6bd989d	drm/amdkfd: debug message to count successfully migrated pages Not all migrate.cpages returned from migrate_vma_setup can be migrated, for example non anonymous page, or out of device memory. So after migrate_vma_pages returns, add debug message to count pages are successfully migrated which has MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-21 23:38:35 -04:00
Philip Yang	75fa98d6e4	drm/amdkfd: clarify the origin of cpages returned by migration functions cpages is only updated by migrate_vma_setup. So capture its value at that point to clarify the significance of the number. The next patch will add counting of actually migrated pages after migrate_vma_pages for debug purposes. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-21 23:38:29 -04:00
Alex Deucher	0f3d2b6804	drm/amdkfd: protect raven_device_info with KFD_SUPPORT_IOMMU_V2 raven_device_info is not used when KFD_SUPPORT_IOMMU_V2 is not set. Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-20 11:43:57 -04:00
Alex Deucher	18f12604f5	drm/amdkfd: protect hawaii_device_info with CONFIG_DRM_AMDGPU_CIK hawaii_device_info is not used when CONFIG_DRM_AMDGPU_CIK is not set. Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-20 11:43:56 -04:00
Jonathan Kim	d5edb56fbc	drm/amdkfd: map gpu hive id to xgmi connected cpu ROCr needs to be able to identify all devices that have direct access to fine grain memory, which should include CPUs that are connected to GPUs over xGMI. The GPU hive ID can be mapped onto the CPU hive ID since the CPU is part of the hive. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-19 17:14:40 -04:00
Philip Yang	43fc10c187	drm/amdkfd: unregistered svm range not overlap with TTM range When creating unregistered new svm range to recover retry fault, avoid new svm range to overlap with ranges or userptr ranges managed by TTM, otherwise svm migration will trigger TTM or userptr eviction, to evict user queues unexpectedly. Change helper amdgpu_ttm_tt_affect_userptr to return userptr which is inside the range. Add helper svm_range_check_vm_userptr to scan all userptr of the vm, and return overlap userptr bo start, last. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-13 22:20:13 -04:00
Yifan Zhang	6f4b590aae	drm/amdkfd: fix resume error when iommu disabled in Picasso When IOMMU disabled in sbios and kfd in iommuv2 path, IOMMU resume failure blocks system resume. Don't allow kfd to use iommu v2 when iommu is disabled. Reported-by: youling <youling257@gmail.com> Tested-by: youling <youling257@gmail.com> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: James Zhu <James.Zhu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-13 14:16:02 -04:00
Yifan Zhang	afd18180c0	drm/amdkfd: fix boot failure when iommu is disabled in Picasso. When IOMMU disabled in sbios and kfd in iommuv2 path, iommuv2 init will fail. But this failure should not block amdgpu driver init. Reported-by: youling <youling257@gmail.com> Tested-by: youling <youling257@gmail.com> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: James Zhu <James.Zhu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-13 14:15:52 -04:00
Philip Yang	ca432dcc27	drm/amdkfd: handle svm partial migration cpages 0 migrate_vma_setup may return cpages 0, means 0 page can be migrated, treat this as error case to skip the rest of vma migration steps. Change svm_migrate_vma_to_vram and svm_migrate_vma_to_ram to return the number of pages migrated successfully or error code. The caller add up all the successful migration pages and update prange->actual_loc only if the total migrated pages is not 0. This also removes the warning message "VRAM BO missing during validation" if migration cpages is 0. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-13 14:15:46 -04:00
Philip Yang	a273bc9937	drm/amdkfd: ratelimited svm debug messages No function change, use pr_debug_ratelimited to avoid per page debug message overflowing dmesg buf and console log. use dev_err to show error message from unexpected situation, to provide clue to help debug without enabling dynamic debug log. Define dev_fmt to output function name in error message. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-13 14:15:40 -04:00
Alex Sierra	7e3fb209d5	amd/amdkfd: remove svms declaration to avoid werror svm_range_list svms declaration removed to avoid werror when CONFIG_HSA_AMD_SVM is not enabled. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-13 14:14:34 -04:00
Yifan Zhang	9c152f54d9	drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails [ RUN ] KFDSVMRangeTest.PartialUnmapSysMemTest /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:245: Failure Value of: (hsaKmtAllocMemory(m_Node, m_Size, m_Flags, &m_pBuf)) Actual: 1 Expected: HSAKMT_STATUS_SUCCESS Which is: 0 /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:248: Failure Value of: (hsaKmtMapMemoryToGPUNodes(m_pBuf, m_Size, __null, mapFlags, 1, &m_Node)) Actual: 1 Expected: HSAKMT_STATUS_SUCCESS Which is: 0 /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:306: Failure Expected: ((void *)__null) != (ptr), actual: NULL vs NULL Segmentation fault (core dumped) [ ] Profile: Full Test [ ] HW capabilities: 0x9 kernel log: [ 102.029150] ret_from_fork+0x22/0x30 [ 102.029158] ---[ end trace 15c34e782714f9a3 ]--- [ 3613.603598] amdgpu: Address: 0x7f7149ccc000 already allocated by SVM [ 3613.610620] show_signal_msg: 27 callbacks suppressed These is race with deferred actions from previous memory map changes (e.g. munmap).Flush pending deffered work to avoid such case. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-13 14:14:34 -04:00
Yifan Zhang	6bdfc37b5c	drm/amdkfd: export svm_range_list_lock_and_flush_work export svm_range_list_lock_and_flush_work to make other kfd parts be able to sync svm_range_list. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-13 14:14:34 -04:00
Alex Sierra	71cbfeb381	drm/amdkfd: avoid conflicting address mappings [Why] Avoid conflict with address ranges mapped by SVM mechanism that try to be allocated again through ioctl_alloc in the same process. And viceversa. [How] For ioctl_alloc_memory_of_gpu allocations Check if the address range passed into ioctl memory alloc does not exist already in the kfd_process svms->objects interval tree. For SVM allocations Look for the address range into the interval tree VA from the VM inside of each pdds used in a kfd_process. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-13 14:14:34 -04:00
Alex Sierra	ec6abe831a	drm/amdkfd: rm BO resv on validation to avoid deadlock This fix the deadlock with the BO reservations during SVM_BO evictions while allocations in VRAM are concurrently performed. More specific, while the ttm waits for the fence to be signaled (ttm_bo_wait), it already has the BO reserved. In parallel, the restore worker might be running, prefetching memory to VRAM. This also requires to reserve the BO, but blocks the mmap semaphore first. The deadlock happens when the SVM_BO eviction worker kicks in and waits for the mmap semaphore held in restore worker. Preventing signal the fence back, causing the deadlock until the ttm times out. We don't need to hold the BO reservation anymore during validation and mapping. Now the physical addresses are taken from hmm_range_fault. We also take migrate_mutex to prevent range migration while validate_and_map update GPU page table. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <philip.yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-08 13:20:31 -04:00
Yifan Zhang	499f4d38ec	drm/amdkfd: remove redundant iommu cleanup code kfd_resume doesn't involve iommu operation, remove redundant iommu cleanup code. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: James Zhu <James.Zhu@amd.com> Tested-by: James Zhu <James.Zhu@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2021-10-05 12:22:15 -04:00

1 2 3 4 5 ...

995 Commits