linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/16] RFC Support hot device unplug in amdgpu
@ 2021-05-12 14:26 Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 01/16] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
                   ` (15 more replies)
  0 siblings, 16 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky

Until now extracting a card either by physical extraction (e.g. eGPU with 
thunderbolt connection or by emulation through  sysfs -> /sys/bus/pci/devices/device_id/remove) 
would cause random crashes in user apps. The random crashes in apps were 
mostly due to the app having mapped a device backed BO into its address 
space and was still trying to access the BO while the backing device was gone.
To answer this first problem Christian suggested fixing the handling of mapped 
memory in the clients when the device goes away by forcibly unmapping all buffers the 
user processes have by clearing their respective VMAs mapping the device BOs.
Then when the VMAs try to fill in the page tables again we check in the fault 
handler if the device is removed and if so, return an error. This will generate a 
SIGBUS to the application which can then cleanly terminate. This indeed was done 
but this in turn created a problem of kernel OOPs where the OOPSes were due to the 
fact that while the app was terminating because of the SIGBUS it would trigger use 
after free in the driver by calling to access device structures that were already
released from the pci remove sequence. This was handled by introducing a 'flush' 
sequence during device removal where we wait for drm file reference to drop to 0 
meaning all user clients directly using this device terminated.

v2:
Based on discussions in the mailing list with Daniel and Pekka [1] and based on the document 
produced by Pekka from those discussions [2] the whole approach with returning SIGBUS and 
waiting for all user clients having CPU mapping of device BOs to die was dropped. 
Instead as per the document suggestion the device structures are kept alive until 
the last reference to the device is dropped by user client and in the meanwhile all existing and new CPU mappings of the BOs 
belonging to the device directly or by dma-buf import are rerouted to per user 
process dummy rw page.Also, I skipped the 'Requirements for KMS UAPI' section of [2] 
since i am trying to get the minimal set of requirements that still give useful solution 
to work and this is the'Requirements for Render and Cross-Device UAPI' section and so my 
test case is removing a secondary device, which is render only and is not involved 
in KMS.

v3:
More updates following comments from v2 such as removing loop to find DRM file when rerouting 
page faults to dummy page,getting rid of unnecessary sysfs handling refactoring and moving 
prevention of GPU recovery post device unplug from amdgpu to scheduler layer. 
On top of that added unplug support for the IOMMU enabled system.

v4:
Drop last sysfs hack and use sysfs default attribute.
Guard against write accesses after device removal to avoid modifying released memory.
Update dummy pages handling to on demand allocation and release through drm managed framework.
Add return value to scheduler job TO handler (by Luben Tuikov) and use this in amdgpu for prevention 
of GPU recovery post device unplug
Also rebase on top of drm-misc-mext instead of amd-staging-drm-next

v5:
The most significant in this series is the improved protection from kernel driver accessing MMIO ranges that were allocated
for the device once the device is gone. To do this, first a patch 'drm/amdgpu: Unmap all MMIO mappings' is introduced.
This patch unamps all MMIO mapped into the kernel address space in the form of BARs and kernel BOs with CPU visible VRAM mappings.
This way it helped to discover multiple such access points because a page fault would be immediately generated on access. Most of them
were solved by moving HW fini code into pci_remove stage (patch drm/amdgpu: Add early fini callback) and for some who 
were harder to unwind drm_dev_enter/exit scoping was used. In addition all the IOCTLs and all background work and timers 
are now protected with drm_dev_enter/exit at their root in an attempt that after drm_dev_unplug is finished none of them 
run anymore and the pci_remove thread is the only thread executing which might touch the HW. To prevent deadlocks in such 
case against threads stuck on various HW or SW fences patches 'drm/amdgpu: Finalise device fences on device remove'  
and drm/amdgpu: Add rw_sem to pushing job into sched queue' take care of force signaling all such existing fences 
and rejecting any newly added ones.

v6:
Drop using drm_dev_enter/exit in conjunction with signalling HW fences before setting drm_dev_unplug.
We need to devise a more robust cros DRM approach to the problem of dma fence waits falling
inside drm_dev_enter/exit scopes -> move to TODO.

v7:
Small cosmetic changes after V6 comments.
Added back the patch which invalidates MMIO mappings in the driver (register, doorbell and VRAM). While
waterproof protection from MMIO accessing from V5 was dropped until a more generic approach was developed I
do believe that it's better to cause kernel panic once such access happens and then fix it then let those go
unnoticed.

With these patches I am able to gracefully remove the secondary card using sysfs remove hook while glxgears is running off of secondary 
card (DRI_PRIME=1) without kernel oopses or hangs and keep working with the primary card or soft reset the device without hangs or oopses.
Also as per Daniel's comment I added 3 tests to IGT [4] to core_hotunplug test suite - remove device while commands are submitted, 
exported BO and exported fence (not pushed yet).
Also now it's possible to plug back the device after unplug 
Also some users now can successfully use those patches with eGPU boxes[3].

TODOs for followup work:
Convert AMDGPU code to use devm (for hw stuff) and drmm (for sw stuff and allocations) (Daniel)
Add support for 'Requirements for KMS UAPI' section of [2] - unplugging primary, display connected card.
Annotate drm_dev_enter/exit against dma_fence_waits as first in deciding where to use drm_dev_enter/exit
in code for device unplug.

[1] - Discussions during v6 of the patchset https://lore.kernel.org/amd-gfx/20210510163625.407105-1-andrey.grodzovsky@amd.com/#r
[2] - drm/doc: device hot-unplug for userspace https://www.spinics.net/lists/dri-devel/msg259755.html
[3] - Related gitlab ticket https://gitlab.freedesktop.org/drm/amd/-/issues/1081
[4] - Related IGT tests https://gitlab.freedesktop.org/agrodzov/igt-gpu-tools/-/commits/master

Andrey Grodzovsky (16):
  drm/ttm: Remap all page faults to per process dummy page.
  drm/amdgpu: Split amdgpu_device_fini into early and late
  drm/amdkfd: Split kfd suspend from device exit
  drm/amdgpu: Add early fini callback
  drm/amdgpu: Handle IOMMU enabled case.
  drm/amdgpu: Remap all page faults to per process dummy page.
  PCI: Add support for dev_groups to struct pci_driver
  drm/amdgpu: Convert driver sysfs attributes to static attributes
  drm/amdgpu: Guard against write accesses after device removal
  drm/sched: Make timeout timer rearm conditional.
  drm/amdgpu: Prevent any job recoveries after device is unplugged.
  drm/amdgpu: Fix hang on device removal.
  drm/scheduler: Fix hang when sched_entity released
  drm/amd/display: Remove superfluous drm_mode_config_cleanup
  drm/amdgpu: Verify DMA opearations from device are done
  drm/amdgpu: Unmap all MMIO mappings

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c  |  17 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 121 +++++++++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  26 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  31 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h      |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c       |   9 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c   |  25 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c        |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c       |  31 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h       |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  19 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c       |  12 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c    |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c       |  63 +++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h       |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  25 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c       |  31 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c       |  11 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c       |  22 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   7 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c  |  14 +-
 drivers/gpu/drm/amd/amdgpu/cik_ih.c           |   3 +-
 drivers/gpu/drm/amd/amdgpu/cz_ih.c            |   3 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c        |   1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c         |   1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c         |   1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c         |   1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c         |   1 -
 drivers/gpu/drm/amd/amdgpu/iceland_ih.c       |   3 +-
 drivers/gpu/drm/amd/amdgpu/navi10_ih.c        |   6 +-
 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c        |  44 +++----
 drivers/gpu/drm/amd/amdgpu/psp_v12_0.c        |   8 +-
 drivers/gpu/drm/amd/amdgpu/psp_v3_1.c         |   8 +-
 drivers/gpu/drm/amd/amdgpu/si_ih.c            |   3 +-
 drivers/gpu/drm/amd/amdgpu/tonga_ih.c         |   3 +-
 drivers/gpu/drm/amd/amdgpu/vce_v4_0.c         |  26 ++--
 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c         |  22 ++--
 drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   6 +-
 drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   6 +-
 drivers/gpu/drm/amd/amdkfd/kfd_device.c       |   3 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  13 +-
 drivers/gpu/drm/amd/include/amd_shared.h      |   2 +
 drivers/gpu/drm/scheduler/sched_entity.c      |   3 +-
 drivers/gpu/drm/scheduler/sched_main.c        |  35 ++++-
 drivers/gpu/drm/ttm/ttm_bo_vm.c               |  54 +++++++-
 drivers/pci/pci-driver.c                      |   1 +
 include/drm/ttm/ttm_bo_api.h                  |   2 +
 include/linux/pci.h                           |   3 +
 54 files changed, 514 insertions(+), 254 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v7 01/16] drm/ttm: Remap all page faults to per process dummy page.
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 02/16] drm/amdgpu: Split amdgpu_device_fini into early and late Andrey Grodzovsky
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky, Christian König

On device removal reroute all CPU mappings to dummy page.

v3:
Remove loop to find DRM file and instead access it
by vma->vm_file->private_data. Move dummy page installation
into a separate function.

v4:
Map the entire BOs VA space into on demand allocated dummy page
on the first fault for that BO.

v5: Remove duplicate return.

v6: Polish ttm_bo_vm_dummy_page, remove superfluous code.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 54 ++++++++++++++++++++++++++++++++-
 include/drm/ttm/ttm_bo_api.h    |  2 ++
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index b31b18058965..7ff9fd551357 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -34,6 +34,8 @@
 #include <drm/ttm/ttm_bo_driver.h>
 #include <drm/ttm/ttm_placement.h>
 #include <drm/drm_vma_manager.h>
+#include <drm/drm_drv.h>
+#include <drm/drm_managed.h>
 #include <linux/mm.h>
 #include <linux/pfn_t.h>
 #include <linux/rbtree.h>
@@ -380,19 +382,69 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
 }
 EXPORT_SYMBOL(ttm_bo_vm_fault_reserved);
 
+static void ttm_bo_release_dummy_page(struct drm_device *dev, void *res)
+{
+	struct page *dummy_page = (struct page *)res;
+
+	__free_page(dummy_page);
+}
+
+vm_fault_t ttm_bo_vm_dummy_page(struct vm_fault *vmf, pgprot_t prot)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct ttm_buffer_object *bo = vma->vm_private_data;
+	struct drm_device *ddev = bo->base.dev;
+	vm_fault_t ret = VM_FAULT_NOPAGE;
+	unsigned long address;
+	unsigned long pfn;
+	struct page *page;
+
+	/* Allocate new dummy page to map all the VA range in this VMA to it*/
+	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (!page)
+		return VM_FAULT_OOM;
+
+	/* Set the page to be freed using drmm release action */
+	if (drmm_add_action_or_reset(ddev, ttm_bo_release_dummy_page, page))
+		return VM_FAULT_OOM;
+
+	pfn = page_to_pfn(page);
+
+	/* Prefault the entire VMA range right away to avoid further faults */
+	for (address = vma->vm_start; address < vma->vm_end; address += PAGE_SIZE) {
+
+		if (vma->vm_flags & VM_MIXEDMAP)
+			ret = vmf_insert_mixed_prot(vma, address,
+						    __pfn_to_pfn_t(pfn, PFN_DEV),
+						    prot);
+		else
+			ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(ttm_bo_vm_dummy_page);
+
 vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	pgprot_t prot;
 	struct ttm_buffer_object *bo = vma->vm_private_data;
+	struct drm_device *ddev = bo->base.dev;
 	vm_fault_t ret;
+	int idx;
 
 	ret = ttm_bo_vm_reserve(bo, vmf);
 	if (ret)
 		return ret;
 
 	prot = vma->vm_page_prot;
-	ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT, 1);
+	if (drm_dev_enter(ddev, &idx)) {
+		ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT, 1);
+		drm_dev_exit(idx);
+	} else {
+		ret = ttm_bo_vm_dummy_page(vmf, prot);
+	}
 	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
 		return ret;
 
diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
index 639521880c29..254ede97f8e3 100644
--- a/include/drm/ttm/ttm_bo_api.h
+++ b/include/drm/ttm/ttm_bo_api.h
@@ -620,4 +620,6 @@ int ttm_bo_vm_access(struct vm_area_struct *vma, unsigned long addr,
 		     void *buf, int len, int write);
 bool ttm_bo_delayed_delete(struct ttm_device *bdev, bool remove_all);
 
+vm_fault_t ttm_bo_vm_dummy_page(struct vm_fault *vmf, pgprot_t prot);
+
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 02/16] drm/amdgpu: Split amdgpu_device_fini into early and late
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 01/16] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 03/16] drm/amdkfd: Split kfd suspend from device exit Andrey Grodzovsky
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky, Christian König, Alex Deucher

Some of the stuff in amdgpu_device_fini such as HW interrupts
disable and pending fences finilization must be done right away on
pci_remove while most of the stuff which relates to finilizing and
releasing driver data structures can be kept until
drm_driver.release hook is called, i.e. when the last device
reference is dropped.

v4: Change functions prefix early->hw and late->sw

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 +++++++++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  7 ++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 15 ++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    | 26 +++++++++++++---------
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h    |  3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 12 +++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  3 ++-
 drivers/gpu/drm/amd/amdgpu/cik_ih.c        |  2 +-
 drivers/gpu/drm/amd/amdgpu/cz_ih.c         |  2 +-
 drivers/gpu/drm/amd/amdgpu/iceland_ih.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/navi10_ih.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/si_ih.c         |  2 +-
 drivers/gpu/drm/amd/amdgpu/tonga_ih.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/vega10_ih.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/vega20_ih.c     |  2 +-
 17 files changed, 79 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 380801b59b07..d830a541ba89 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1099,7 +1099,9 @@ static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_device *bdev)
 
 int amdgpu_device_init(struct amdgpu_device *adev,
 		       uint32_t flags);
-void amdgpu_device_fini(struct amdgpu_device *adev);
+void amdgpu_device_fini_hw(struct amdgpu_device *adev);
+void amdgpu_device_fini_sw(struct amdgpu_device *adev);
+
 int amdgpu_gpu_wait_for_idle(struct amdgpu_device *adev);
 
 void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
@@ -1319,6 +1321,8 @@ void amdgpu_driver_lastclose_kms(struct drm_device *dev);
 int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv);
 void amdgpu_driver_postclose_kms(struct drm_device *dev,
 				 struct drm_file *file_priv);
+void amdgpu_driver_release_kms(struct drm_device *dev);
+
 int amdgpu_device_ip_suspend(struct amdgpu_device *adev);
 int amdgpu_device_suspend(struct drm_device *dev, bool fbcon);
 int amdgpu_device_resume(struct drm_device *dev, bool fbcon);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b4ad1c055c70..3760ce7d8ff8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3648,15 +3648,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
  * Tear down the driver info (all asics).
  * Called at driver shutdown.
  */
-void amdgpu_device_fini(struct amdgpu_device *adev)
+void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 {
 	dev_info(adev->dev, "amdgpu: finishing device.\n");
 	flush_delayed_work(&adev->delayed_init_work);
 	ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
 	adev->shutdown = true;
 
-	kfree(adev->pci_state);
-
 	/* make sure IB test finished before entering exclusive mode
 	 * to avoid preemption on IB test
 	 * */
@@ -3673,11 +3671,24 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 		else
 			drm_atomic_helper_shutdown(adev_to_drm(adev));
 	}
-	amdgpu_fence_driver_fini(adev);
+	amdgpu_fence_driver_fini_hw(adev);
+
 	if (adev->pm_sysfs_en)
 		amdgpu_pm_sysfs_fini(adev);
+	if (adev->ucode_sysfs_en)
+		amdgpu_ucode_sysfs_fini(adev);
+	sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
+
+
 	amdgpu_fbdev_fini(adev);
+
+	amdgpu_irq_fini_hw(adev);
+}
+
+void amdgpu_device_fini_sw(struct amdgpu_device *adev)
+{
 	amdgpu_device_ip_fini(adev);
+	amdgpu_fence_driver_fini_sw(adev);
 	release_firmware(adev->firmware.gpu_info_fw);
 	adev->firmware.gpu_info_fw = NULL;
 	adev->accel_working = false;
@@ -3703,14 +3714,13 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 	adev->rmmio = NULL;
 	amdgpu_device_doorbell_fini(adev);
 
-	if (adev->ucode_sysfs_en)
-		amdgpu_ucode_sysfs_fini(adev);
-
-	sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
 	if (IS_ENABLED(CONFIG_PERF_EVENTS))
 		amdgpu_pmu_fini(adev);
 	if (adev->mman.discovery_bin)
 		amdgpu_discovery_fini(adev);
+
+	kfree(adev->pci_state);
+
 }
 
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6cf573293823..5ebed4c7d9c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1311,14 +1311,10 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 
-#ifdef MODULE
-	if (THIS_MODULE->state != MODULE_STATE_GOING)
-#endif
-		DRM_ERROR("Hotplug removal is not supported\n");
 	drm_dev_unplug(dev);
 	amdgpu_driver_unload_kms(dev);
+
 	pci_disable_device(pdev);
-	pci_set_drvdata(pdev, NULL);
 }
 
 static void
@@ -1748,6 +1744,7 @@ static const struct drm_driver amdgpu_kms_driver = {
 	.dumb_create = amdgpu_mode_dumb_create,
 	.dumb_map_offset = amdgpu_mode_dumb_mmap,
 	.fops = &amdgpu_driver_kms_fops,
+	.release = &amdgpu_driver_release_kms,
 
 	.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
 	.prime_fd_to_handle = drm_gem_prime_fd_to_handle,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 47ea46859618..1ffb36bd0b19 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -523,7 +523,7 @@ int amdgpu_fence_driver_init(struct amdgpu_device *adev)
  *
  * Tear down the fence driver for all possible rings (all asics).
  */
-void amdgpu_fence_driver_fini(struct amdgpu_device *adev)
+void amdgpu_fence_driver_fini_hw(struct amdgpu_device *adev)
 {
 	unsigned i, j;
 	int r;
@@ -545,6 +545,19 @@ void amdgpu_fence_driver_fini(struct amdgpu_device *adev)
 				       ring->fence_drv.irq_type);
 
 		del_timer_sync(&ring->fence_drv.fallback_timer);
+	}
+}
+
+void amdgpu_fence_driver_fini_sw(struct amdgpu_device *adev)
+{
+	unsigned int i, j;
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->fence_drv.initialized)
+			continue;
+
 		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
 			dma_fence_put(ring->fence_drv.fences[j]);
 		kfree(ring->fence_drv.fences);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 90f50561b43a..233b64dab94b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -49,6 +49,7 @@
 #include <drm/drm_irq.h>
 #include <drm/drm_vblank.h>
 #include <drm/amdgpu_drm.h>
+#include <drm/drm_drv.h>
 #include "amdgpu.h"
 #include "amdgpu_ih.h"
 #include "atom.h"
@@ -348,6 +349,20 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
 	return 0;
 }
 
+
+void amdgpu_irq_fini_hw(struct amdgpu_device *adev)
+{
+	if (adev->irq.installed) {
+		drm_irq_uninstall(&adev->ddev);
+		adev->irq.installed = false;
+		if (adev->irq.msi_enabled)
+			pci_free_irq_vectors(adev->pdev);
+
+		if (!amdgpu_device_has_dc_support(adev))
+			flush_work(&adev->hotplug_work);
+	}
+}
+
 /**
  * amdgpu_irq_fini - shut down interrupt handling
  *
@@ -357,19 +372,10 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
  * functionality, shuts down vblank, hotplug and reset interrupt handling,
  * turns off interrupts from all sources (all ASICs).
  */
-void amdgpu_irq_fini(struct amdgpu_device *adev)
+void amdgpu_irq_fini_sw(struct amdgpu_device *adev)
 {
 	unsigned i, j;
 
-	if (adev->irq.installed) {
-		drm_irq_uninstall(adev_to_drm(adev));
-		adev->irq.installed = false;
-		if (adev->irq.msi_enabled)
-			pci_free_irq_vectors(adev->pdev);
-		if (!amdgpu_device_has_dc_support(adev))
-			flush_work(&adev->hotplug_work);
-	}
-
 	for (i = 0; i < AMDGPU_IRQ_CLIENTID_MAX; ++i) {
 		if (!adev->irq.client[i].sources)
 			continue;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
index cf6116648322..78ad4784cc74 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
@@ -103,7 +103,8 @@ void amdgpu_irq_disable_all(struct amdgpu_device *adev);
 irqreturn_t amdgpu_irq_handler(int irq, void *arg);
 
 int amdgpu_irq_init(struct amdgpu_device *adev);
-void amdgpu_irq_fini(struct amdgpu_device *adev);
+void amdgpu_irq_fini_sw(struct amdgpu_device *adev);
+void amdgpu_irq_fini_hw(struct amdgpu_device *adev);
 int amdgpu_irq_add_id(struct amdgpu_device *adev,
 		      unsigned client_id, unsigned src_id,
 		      struct amdgpu_irq_src *source);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 39ee88d29cca..f3ecada208b0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -28,6 +28,7 @@
 
 #include "amdgpu.h"
 #include <drm/amdgpu_drm.h>
+#include <drm/drm_drv.h>
 #include "amdgpu_uvd.h"
 #include "amdgpu_vce.h"
 #include "atom.h"
@@ -92,7 +93,7 @@ void amdgpu_driver_unload_kms(struct drm_device *dev)
 	}
 
 	amdgpu_acpi_fini(adev);
-	amdgpu_device_fini(adev);
+	amdgpu_device_fini_hw(adev);
 }
 
 void amdgpu_register_gpu_instance(struct amdgpu_device *adev)
@@ -1219,6 +1220,15 @@ void amdgpu_driver_postclose_kms(struct drm_device *dev,
 	pm_runtime_put_autosuspend(dev->dev);
 }
 
+
+void amdgpu_driver_release_kms(struct drm_device *dev)
+{
+	struct amdgpu_device *adev = drm_to_adev(dev);
+
+	amdgpu_device_fini_sw(adev);
+	pci_set_drvdata(adev->pdev, NULL);
+}
+
 /*
  * VBlank related functions.
  */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 0541196ae1ed..844a667f655b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2325,6 +2325,7 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
 	if (!adev->ras_features || !con)
 		return 0;
 
+
 	/* Need disable ras on all IPs here before ip [hw/sw]fini */
 	amdgpu_ras_disable_all_features(adev, 0);
 	amdgpu_ras_recovery_fini(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index ca1622835296..e7d3d0dbdd96 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -107,7 +107,8 @@ struct amdgpu_fence_driver {
 };
 
 int amdgpu_fence_driver_init(struct amdgpu_device *adev);
-void amdgpu_fence_driver_fini(struct amdgpu_device *adev);
+void amdgpu_fence_driver_fini_hw(struct amdgpu_device *adev);
+void amdgpu_fence_driver_fini_sw(struct amdgpu_device *adev);
 void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring);
 
 int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
diff --git a/drivers/gpu/drm/amd/amdgpu/cik_ih.c b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
index d3745711d55f..183d44a6583c 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
@@ -309,7 +309,7 @@ static int cik_ih_sw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	amdgpu_irq_fini(adev);
+	amdgpu_irq_fini_sw(adev);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
index 307c01301c87..d32743949003 100644
--- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
@@ -301,7 +301,7 @@ static int cz_ih_sw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	amdgpu_irq_fini(adev);
+	amdgpu_irq_fini_sw(adev);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
index cc957471f31e..da96c6013477 100644
--- a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
@@ -300,7 +300,7 @@ static int iceland_ih_sw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	amdgpu_irq_fini(adev);
+	amdgpu_irq_fini_sw(adev);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
index f4e4040bbd25..5eea4550b856 100644
--- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
@@ -569,7 +569,7 @@ static int navi10_ih_sw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	amdgpu_irq_fini(adev);
+	amdgpu_irq_fini_sw(adev);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
diff --git a/drivers/gpu/drm/amd/amdgpu/si_ih.c b/drivers/gpu/drm/amd/amdgpu/si_ih.c
index 51880f6ef634..751307f3252c 100644
--- a/drivers/gpu/drm/amd/amdgpu/si_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/si_ih.c
@@ -175,7 +175,7 @@ static int si_ih_sw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	amdgpu_irq_fini(adev);
+	amdgpu_irq_fini_sw(adev);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
index 249fcbee7871..973d80ec7f6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
@@ -312,7 +312,7 @@ static int tonga_ih_sw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	amdgpu_irq_fini(adev);
+	amdgpu_irq_fini_sw(adev);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
index ca8efa5c6978..dead9c2fbd4c 100644
--- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
@@ -513,7 +513,7 @@ static int vega10_ih_sw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	amdgpu_irq_fini(adev);
+	amdgpu_irq_fini_sw(adev);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
index 8a122b413bf5..58993ae1fe11 100644
--- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
@@ -565,7 +565,7 @@ static int vega20_ih_sw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	amdgpu_irq_fini(adev);
+	amdgpu_irq_fini_sw(adev);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
 	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 03/16] drm/amdkfd: Split kfd suspend from device exit
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 01/16] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 02/16] drm/amdgpu: Split amdgpu_device_fini into early and late Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 20:33   ` Felix Kuehling
  2021-05-12 14:26 ` [PATCH v7 04/16] " Andrey Grodzovsky
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky

Helps to expdite HW related stuff to amdgpu_pci_remove

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 3 ++-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 5f6696a3c778..2b06dee9a0ce 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -170,7 +170,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 	}
 }
 
-void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev)
+void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev)
 {
 	if (adev->kfd.dev) {
 		kgd2kfd_device_exit(adev->kfd.dev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 14f68c028126..f8e10af99c28 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -127,7 +127,7 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 			const void *ih_ring_entry);
 void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
 void amdgpu_amdkfd_device_init(struct amdgpu_device *adev);
-void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev);
+void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev);
 int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine,
 				uint32_t vmid, uint64_t gpu_addr,
 				uint32_t *ib_cmd, uint32_t ib_len);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 357b9bf62a1c..ab6d2a43c9a3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -858,10 +858,11 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 	return kfd->init_complete;
 }
 
+
+
 void kgd2kfd_device_exit(struct kfd_dev *kfd)
 {
 	if (kfd->init_complete) {
-		kgd2kfd_suspend(kfd, false);
 		device_queue_manager_uninit(kfd->dqm);
 		kfd_interrupt_exit(kfd);
 		kfd_topology_remove_device(kfd);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 04/16] drm/amdgpu: Add early fini callback
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (2 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 03/16] drm/amdkfd: Split kfd suspend from device exit Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 05/16] drm/amdgpu: Handle IOMMU enabled case Andrey Grodzovsky
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky, Christian König

Use it to call disply code dependent on device->drv_data
before it's set to NULL on device unplug

v5: Move HW finilization into this callback to prevent MMIO accesses
    post cpi remove.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 59 +++++++++++++------
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +++-
 drivers/gpu/drm/amd/include/amd_shared.h      |  2 +
 3 files changed, 52 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3760ce7d8ff8..18598eda18f6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2558,34 +2558,26 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
 	return 0;
 }
 
-/**
- * amdgpu_device_ip_fini - run fini for hardware IPs
- *
- * @adev: amdgpu_device pointer
- *
- * Main teardown pass for hardware IPs.  The list of all the hardware
- * IPs that make up the asic is walked and the hw_fini and sw_fini callbacks
- * are run.  hw_fini tears down the hardware associated with each IP
- * and sw_fini tears down any software state associated with each IP.
- * Returns 0 on success, negative error code on failure.
- */
-static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
+static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
 {
 	int i, r;
 
-	if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
-		amdgpu_virt_release_ras_err_handler_data(adev);
+	for (i = 0; i < adev->num_ip_blocks; i++) {
+		if (!adev->ip_blocks[i].version->funcs->early_fini)
+			continue;
 
-	amdgpu_ras_pre_fini(adev);
+		r = adev->ip_blocks[i].version->funcs->early_fini((void *)adev);
+		if (r) {
+			DRM_DEBUG("early_fini of IP block <%s> failed %d\n",
+				  adev->ip_blocks[i].version->funcs->name, r);
+		}
+	}
 
-	if (adev->gmc.xgmi.num_physical_nodes > 1)
-		amdgpu_xgmi_remove_device(adev);
+	amdgpu_amdkfd_suspend(adev, false);
 
 	amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
 	amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
 
-	amdgpu_amdkfd_device_fini(adev);
-
 	/* need to disable SMC first */
 	for (i = 0; i < adev->num_ip_blocks; i++) {
 		if (!adev->ip_blocks[i].status.hw)
@@ -2616,6 +2608,33 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
 		adev->ip_blocks[i].status.hw = false;
 	}
 
+	return 0;
+}
+
+/**
+ * amdgpu_device_ip_fini - run fini for hardware IPs
+ *
+ * @adev: amdgpu_device pointer
+ *
+ * Main teardown pass for hardware IPs.  The list of all the hardware
+ * IPs that make up the asic is walked and the hw_fini and sw_fini callbacks
+ * are run.  hw_fini tears down the hardware associated with each IP
+ * and sw_fini tears down any software state associated with each IP.
+ * Returns 0 on success, negative error code on failure.
+ */
+static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
+{
+	int i, r;
+
+	if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
+		amdgpu_virt_release_ras_err_handler_data(adev);
+
+	amdgpu_ras_pre_fini(adev);
+
+	if (adev->gmc.xgmi.num_physical_nodes > 1)
+		amdgpu_xgmi_remove_device(adev);
+
+	amdgpu_amdkfd_device_fini_sw(adev);
 
 	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
 		if (!adev->ip_blocks[i].status.sw)
@@ -3683,6 +3702,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 	amdgpu_fbdev_fini(adev);
 
 	amdgpu_irq_fini_hw(adev);
+
+	amdgpu_device_ip_fini_early(adev);
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 296704ce3768..6c2c6a51ce6c 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1251,6 +1251,15 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
 	return -EINVAL;
 }
 
+static int amdgpu_dm_early_fini(void *handle)
+{
+	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+
+	amdgpu_dm_audio_fini(adev);
+
+	return 0;
+}
+
 static void amdgpu_dm_fini(struct amdgpu_device *adev)
 {
 	int i;
@@ -1259,8 +1268,6 @@ static void amdgpu_dm_fini(struct amdgpu_device *adev)
 		drm_encoder_cleanup(&adev->dm.mst_encoders[i].base);
 	}
 
-	amdgpu_dm_audio_fini(adev);
-
 	amdgpu_dm_destroy_drm_device(&adev->dm);
 
 #if defined(CONFIG_DRM_AMD_SECURE_DISPLAY)
@@ -2298,6 +2305,7 @@ static const struct amd_ip_funcs amdgpu_dm_funcs = {
 	.late_init = dm_late_init,
 	.sw_init = dm_sw_init,
 	.sw_fini = dm_sw_fini,
+	.early_fini = amdgpu_dm_early_fini,
 	.hw_init = dm_hw_init,
 	.hw_fini = dm_hw_fini,
 	.suspend = dm_suspend,
diff --git a/drivers/gpu/drm/amd/include/amd_shared.h b/drivers/gpu/drm/amd/include/amd_shared.h
index 43ed6291b2b8..1ad56da486e4 100644
--- a/drivers/gpu/drm/amd/include/amd_shared.h
+++ b/drivers/gpu/drm/amd/include/amd_shared.h
@@ -240,6 +240,7 @@ enum amd_dpm_forced_level;
  * @late_init: sets up late driver/hw state (post hw_init) - Optional
  * @sw_init: sets up driver state, does not configure hw
  * @sw_fini: tears down driver state, does not configure hw
+ * @early_fini: tears down stuff before dev detached from driver
  * @hw_init: sets up the hw state
  * @hw_fini: tears down the hw state
  * @late_fini: final cleanup
@@ -268,6 +269,7 @@ struct amd_ip_funcs {
 	int (*late_init)(void *handle);
 	int (*sw_init)(void *handle);
 	int (*sw_fini)(void *handle);
+	int (*early_fini)(void *handle);
 	int (*hw_init)(void *handle);
 	int (*hw_fini)(void *handle);
 	void (*late_fini)(void *handle);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 05/16] drm/amdgpu: Handle IOMMU enabled case.
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (3 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 04/16] " Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-14 14:41   ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 06/16] drm/amdgpu: Remap all page faults to per process dummy page Andrey Grodzovsky
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky

Handle all DMA IOMMU gropup related dependencies before the
group is removed.

v5: Drop IOMMU notifier and switch to lockless call to ttm_tt_unpopulate
v6: Drop the BO unamp list
v7:
Drop amdgpu_gart_fini
In amdgpu_ih_ring_fini do uncinditional  check (!ih->ring)
to avoid freeing uniniitalized rings.
Call amdgpu_ih_ring_fini unconditionally.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   | 14 +-------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c     |  6 ++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |  5 +++++
 drivers/gpu/drm/amd/amdgpu/cik_ih.c        |  1 -
 drivers/gpu/drm/amd/amdgpu/cz_ih.c         |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c     |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c      |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c      |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c      |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  1 -
 drivers/gpu/drm/amd/amdgpu/iceland_ih.c    |  1 -
 drivers/gpu/drm/amd/amdgpu/navi10_ih.c     |  4 ----
 drivers/gpu/drm/amd/amdgpu/si_ih.c         |  1 -
 drivers/gpu/drm/amd/amdgpu/tonga_ih.c      |  1 -
 drivers/gpu/drm/amd/amdgpu/vega10_ih.c     |  4 ----
 drivers/gpu/drm/amd/amdgpu/vega20_ih.c     |  4 ----
 18 files changed, 13 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 18598eda18f6..a0bff4713672 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3256,7 +3256,6 @@ static const struct attribute *amdgpu_dev_attributes[] = {
 	NULL
 };
 
-
 /**
  * amdgpu_device_init - initialize the driver
  *
@@ -3698,12 +3697,13 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 		amdgpu_ucode_sysfs_fini(adev);
 	sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
 
-
 	amdgpu_fbdev_fini(adev);
 
 	amdgpu_irq_fini_hw(adev);
 
 	amdgpu_device_ip_fini_early(adev);
+
+	amdgpu_gart_dummy_page_fini(adev);
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
index c5a9a4fb10d2..6460cf723f0a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -92,7 +92,7 @@ static int amdgpu_gart_dummy_page_init(struct amdgpu_device *adev)
  *
  * Frees the dummy page used by the driver (all asics).
  */
-static void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
+void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
 {
 	if (!adev->dummy_page_addr)
 		return;
@@ -365,15 +365,3 @@ int amdgpu_gart_init(struct amdgpu_device *adev)
 
 	return 0;
 }
-
-/**
- * amdgpu_gart_fini - tear down the driver info for managing the gart
- *
- * @adev: amdgpu_device pointer
- *
- * Tear down the gart driver info and free the dummy page (all asics).
- */
-void amdgpu_gart_fini(struct amdgpu_device *adev)
-{
-	amdgpu_gart_dummy_page_fini(adev);
-}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
index a25fe97b0196..030b9d4c736a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
@@ -57,7 +57,7 @@ void amdgpu_gart_table_vram_free(struct amdgpu_device *adev);
 int amdgpu_gart_table_vram_pin(struct amdgpu_device *adev);
 void amdgpu_gart_table_vram_unpin(struct amdgpu_device *adev);
 int amdgpu_gart_init(struct amdgpu_device *adev);
-void amdgpu_gart_fini(struct amdgpu_device *adev);
+void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev);
 int amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
 		       int pages);
 int amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index faaa6aa2faaf..433469ace6f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -115,9 +115,11 @@ int amdgpu_ih_ring_init(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih,
  */
 void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih)
 {
+
+	if (!ih->ring)
+		return;
+
 	if (ih->use_bus_addr) {
-		if (!ih->ring)
-			return;
 
 		/* add 8 bytes for the rptr/wptr shadows and
 		 * add them to the end of the ring allocation.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 233b64dab94b..32ce0e679dc7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -361,6 +361,11 @@ void amdgpu_irq_fini_hw(struct amdgpu_device *adev)
 		if (!amdgpu_device_has_dc_support(adev))
 			flush_work(&adev->hotplug_work);
 	}
+
+	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
+	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
+	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
+	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/cik_ih.c b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
index 183d44a6583c..df385ffc9768 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
@@ -310,7 +310,6 @@ static int cik_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
index d32743949003..b8c47e0cf37a 100644
--- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
@@ -302,7 +302,6 @@ static int cz_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 2bfd620576f2..5e8bfcdd422e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -954,7 +954,6 @@ static int gmc_v10_0_sw_init(void *handle)
 static void gmc_v10_0_gart_fini(struct amdgpu_device *adev)
 {
 	amdgpu_gart_table_vram_free(adev);
-	amdgpu_gart_fini(adev);
 }
 
 static int gmc_v10_0_sw_fini(void *handle)
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
index 405d6ad09022..0e81e03e9b49 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
@@ -898,7 +898,6 @@ static int gmc_v6_0_sw_fini(void *handle)
 	amdgpu_vm_manager_fini(adev);
 	amdgpu_gart_table_vram_free(adev);
 	amdgpu_bo_fini(adev);
-	amdgpu_gart_fini(adev);
 	release_firmware(adev->gmc.fw);
 	adev->gmc.fw = NULL;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
index 210ada2289ec..0795ea736573 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
@@ -1085,7 +1085,6 @@ static int gmc_v7_0_sw_fini(void *handle)
 	kfree(adev->gmc.vm_fault_info);
 	amdgpu_gart_table_vram_free(adev);
 	amdgpu_bo_fini(adev);
-	amdgpu_gart_fini(adev);
 	release_firmware(adev->gmc.fw);
 	adev->gmc.fw = NULL;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
index c1bd190841f8..dbf2e5472069 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
@@ -1194,7 +1194,6 @@ static int gmc_v8_0_sw_fini(void *handle)
 	kfree(adev->gmc.vm_fault_info);
 	amdgpu_gart_table_vram_free(adev);
 	amdgpu_bo_fini(adev);
-	amdgpu_gart_fini(adev);
 	release_firmware(adev->gmc.fw);
 	adev->gmc.fw = NULL;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index c82d82da2c73..5ed0adae05cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1601,7 +1601,6 @@ static int gmc_v9_0_sw_fini(void *handle)
 	amdgpu_gart_table_vram_free(adev);
 	amdgpu_bo_unref(&adev->gmc.pdb0_bo);
 	amdgpu_bo_fini(adev);
-	amdgpu_gart_fini(adev);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
index da96c6013477..ddfe4eaeea05 100644
--- a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
@@ -301,7 +301,6 @@ static int iceland_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
index 5eea4550b856..941d464a2b47 100644
--- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
@@ -570,10 +570,6 @@ static int navi10_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/si_ih.c b/drivers/gpu/drm/amd/amdgpu/si_ih.c
index 751307f3252c..9a24f17a5750 100644
--- a/drivers/gpu/drm/amd/amdgpu/si_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/si_ih.c
@@ -176,7 +176,6 @@ static int si_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
index 973d80ec7f6c..b08905d1c00f 100644
--- a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
@@ -313,7 +313,6 @@ static int tonga_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
index dead9c2fbd4c..32ec4b8e806a 100644
--- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
@@ -514,10 +514,6 @@ static int vega10_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
index 58993ae1fe11..f51dfc38ac65 100644
--- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
@@ -566,10 +566,6 @@ static int vega20_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 06/16] drm/amdgpu: Remap all page faults to per process dummy page.
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (4 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 05/16] drm/amdgpu: Handle IOMMU enabled case Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 07/16] PCI: Add support for dev_groups to struct pci_driver Andrey Grodzovsky
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky, Christian König

On device removal reroute all CPU mappings to dummy page
per drm_file instance or imported GEM object.

v4:
Update for modified ttm_bo_vm_dummy_page

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 8c7ec09eb1a4..0d54e70278ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -48,6 +48,7 @@
 #include <drm/ttm/ttm_placement.h>
 
 #include <drm/amdgpu_drm.h>
+#include <drm/drm_drv.h>
 
 #include "amdgpu.h"
 #include "amdgpu_object.h"
@@ -1905,18 +1906,28 @@ void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable)
 static vm_fault_t amdgpu_ttm_fault(struct vm_fault *vmf)
 {
 	struct ttm_buffer_object *bo = vmf->vma->vm_private_data;
+	struct drm_device *ddev = bo->base.dev;
 	vm_fault_t ret;
+	int idx;
 
 	ret = ttm_bo_vm_reserve(bo, vmf);
 	if (ret)
 		return ret;
 
-	ret = amdgpu_bo_fault_reserve_notify(bo);
-	if (ret)
-		goto unlock;
+	if (drm_dev_enter(ddev, &idx)) {
+		ret = amdgpu_bo_fault_reserve_notify(bo);
+		if (ret) {
+			drm_dev_exit(idx);
+			goto unlock;
+		}
 
-	ret = ttm_bo_vm_fault_reserved(vmf, vmf->vma->vm_page_prot,
-				       TTM_BO_VM_NUM_PREFAULT, 1);
+		 ret = ttm_bo_vm_fault_reserved(vmf, vmf->vma->vm_page_prot,
+						TTM_BO_VM_NUM_PREFAULT, 1);
+
+		 drm_dev_exit(idx);
+	} else {
+		ret = ttm_bo_vm_dummy_page(vmf, vmf->vma->vm_page_prot);
+	}
 	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
 		return ret;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 07/16] PCI: Add support for dev_groups to struct pci_driver
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (5 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 06/16] drm/amdgpu: Remap all page faults to per process dummy page Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 08/16] drm/amdgpu: Convert driver sysfs attributes to static attributes Andrey Grodzovsky
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky, Bjorn Helgaas

This helps converting PCI drivers sysfs attributes to static.

Analogous to' commit b71b283e3d6d ("USB: add support for dev_groups to
struct usb_driver")'

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pci-driver.c | 1 +
 include/linux/pci.h      | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index ec44a79e951a..3a72352aa5cf 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1385,6 +1385,7 @@ int __pci_register_driver(struct pci_driver *drv, struct module *owner,
 	drv->driver.owner = owner;
 	drv->driver.mod_name = mod_name;
 	drv->driver.groups = drv->groups;
+	drv->driver.dev_groups = drv->dev_groups;
 
 	spin_lock_init(&drv->dynids.lock);
 	INIT_LIST_HEAD(&drv->dynids.list);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 86c799c97b77..b57755b03009 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -858,6 +858,8 @@ struct module;
  *		number of VFs to enable via sysfs "sriov_numvfs" file.
  * @err_handler: See Documentation/PCI/pci-error-recovery.rst
  * @groups:	Sysfs attribute groups.
+ * @dev_groups: Attributes attached to the device that will be
+ *              created once it is bound to the driver.
  * @driver:	Driver model structure.
  * @dynids:	List of dynamically added device IDs.
  */
@@ -873,6 +875,7 @@ struct pci_driver {
 	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
 	const struct pci_error_handlers *err_handler;
 	const struct attribute_group **groups;
+	const struct attribute_group **dev_groups;
 	struct device_driver	driver;
 	struct pci_dynids	dynids;
 };
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 08/16] drm/amdgpu: Convert driver sysfs attributes to static attributes
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (6 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 07/16] PCI: Add support for dev_groups to struct pci_driver Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 09/16] drm/amdgpu: Guard against write accesses after device removal Andrey Grodzovsky
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky, Christian König

This allows to remove explicit creation and destruction
of those attrs and by this avoids warnings on device
finalizing post physical device extraction.

v5: Use newly added pci_driver.dev_groups directly

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c | 17 ++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c      | 13 ++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 25 ++++++++------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 14 ++++-------
 4 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
index 494b2e1717d5..879ed3e50a6e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
@@ -1768,6 +1768,15 @@ static ssize_t amdgpu_atombios_get_vbios_version(struct device *dev,
 static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version,
 		   NULL);
 
+static struct attribute *amdgpu_vbios_version_attrs[] = {
+	&dev_attr_vbios_version.attr,
+	NULL
+};
+
+const struct attribute_group amdgpu_vbios_version_attr_group = {
+	.attrs = amdgpu_vbios_version_attrs
+};
+
 /**
  * amdgpu_atombios_fini - free the driver info and callbacks for atombios
  *
@@ -1787,7 +1796,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev)
 	adev->mode_info.atom_context = NULL;
 	kfree(adev->mode_info.atom_card_info);
 	adev->mode_info.atom_card_info = NULL;
-	device_remove_file(adev->dev, &dev_attr_vbios_version);
 }
 
 /**
@@ -1804,7 +1812,6 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
 {
 	struct card_info *atom_card_info =
 	    kzalloc(sizeof(struct card_info), GFP_KERNEL);
-	int ret;
 
 	if (!atom_card_info)
 		return -ENOMEM;
@@ -1833,12 +1840,6 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
 		amdgpu_atombios_allocate_fb_scratch(adev);
 	}
 
-	ret = device_create_file(adev->dev, &dev_attr_vbios_version);
-	if (ret) {
-		DRM_ERROR("Failed to create device file for VBIOS version\n");
-		return ret;
-	}
-
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 5ebed4c7d9c0..83006f45b10b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1766,6 +1766,18 @@ static struct pci_error_handlers amdgpu_pci_err_handler = {
 	.resume		= amdgpu_pci_resume,
 };
 
+extern const struct attribute_group amdgpu_vram_mgr_attr_group;
+extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
+extern const struct attribute_group amdgpu_vbios_version_attr_group;
+
+static const struct attribute_group *amdgpu_sysfs_groups[] = {
+	&amdgpu_vram_mgr_attr_group,
+	&amdgpu_gtt_mgr_attr_group,
+	&amdgpu_vbios_version_attr_group,
+	NULL,
+};
+
+
 static struct pci_driver amdgpu_kms_pci_driver = {
 	.name = DRIVER_NAME,
 	.id_table = pciidlist,
@@ -1774,6 +1786,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
 	.shutdown = amdgpu_pci_shutdown,
 	.driver.pm = &amdgpu_pm_ops,
 	.err_handler = &amdgpu_pci_err_handler,
+	.dev_groups = amdgpu_sysfs_groups,
 };
 
 static int __init amdgpu_init(void)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
index 72962de4c04c..a4404da8ca6d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
@@ -75,6 +75,16 @@ static DEVICE_ATTR(mem_info_gtt_total, S_IRUGO,
 static DEVICE_ATTR(mem_info_gtt_used, S_IRUGO,
 	           amdgpu_mem_info_gtt_used_show, NULL);
 
+static struct attribute *amdgpu_gtt_mgr_attributes[] = {
+	&dev_attr_mem_info_gtt_total.attr,
+	&dev_attr_mem_info_gtt_used.attr,
+	NULL
+};
+
+const struct attribute_group amdgpu_gtt_mgr_attr_group = {
+	.attrs = amdgpu_gtt_mgr_attributes
+};
+
 static const struct ttm_resource_manager_func amdgpu_gtt_mgr_func;
 /**
  * amdgpu_gtt_mgr_init - init GTT manager and DRM MM
@@ -89,7 +99,6 @@ int amdgpu_gtt_mgr_init(struct amdgpu_device *adev, uint64_t gtt_size)
 	struct amdgpu_gtt_mgr *mgr = &adev->mman.gtt_mgr;
 	struct ttm_resource_manager *man = &mgr->manager;
 	uint64_t start, size;
-	int ret;
 
 	man->use_tt = true;
 	man->func = &amdgpu_gtt_mgr_func;
@@ -102,17 +111,6 @@ int amdgpu_gtt_mgr_init(struct amdgpu_device *adev, uint64_t gtt_size)
 	spin_lock_init(&mgr->lock);
 	atomic64_set(&mgr->available, gtt_size >> PAGE_SHIFT);
 
-	ret = device_create_file(adev->dev, &dev_attr_mem_info_gtt_total);
-	if (ret) {
-		DRM_ERROR("Failed to create device file mem_info_gtt_total\n");
-		return ret;
-	}
-	ret = device_create_file(adev->dev, &dev_attr_mem_info_gtt_used);
-	if (ret) {
-		DRM_ERROR("Failed to create device file mem_info_gtt_used\n");
-		return ret;
-	}
-
 	ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_TT, &mgr->manager);
 	ttm_resource_manager_set_used(man, true);
 	return 0;
@@ -142,9 +140,6 @@ void amdgpu_gtt_mgr_fini(struct amdgpu_device *adev)
 	drm_mm_takedown(&mgr->mm);
 	spin_unlock(&mgr->lock);
 
-	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_total);
-	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_used);
-
 	ttm_resource_manager_cleanup(man);
 	ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_TT, NULL);
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 2344aba9dca3..8543d6486018 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -152,7 +152,7 @@ static DEVICE_ATTR(mem_info_vis_vram_used, S_IRUGO,
 static DEVICE_ATTR(mem_info_vram_vendor, S_IRUGO,
 		   amdgpu_mem_info_vram_vendor, NULL);
 
-static const struct attribute *amdgpu_vram_mgr_attributes[] = {
+static struct attribute *amdgpu_vram_mgr_attributes[] = {
 	&dev_attr_mem_info_vram_total.attr,
 	&dev_attr_mem_info_vis_vram_total.attr,
 	&dev_attr_mem_info_vram_used.attr,
@@ -161,6 +161,10 @@ static const struct attribute *amdgpu_vram_mgr_attributes[] = {
 	NULL
 };
 
+const struct attribute_group amdgpu_vram_mgr_attr_group = {
+	.attrs = amdgpu_vram_mgr_attributes
+};
+
 static const struct ttm_resource_manager_func amdgpu_vram_mgr_func;
 
 /**
@@ -174,7 +178,6 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 {
 	struct amdgpu_vram_mgr *mgr = &adev->mman.vram_mgr;
 	struct ttm_resource_manager *man = &mgr->manager;
-	int ret;
 
 	ttm_resource_manager_init(man, adev->gmc.real_vram_size >> PAGE_SHIFT);
 
@@ -185,11 +188,6 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 	INIT_LIST_HEAD(&mgr->reservations_pending);
 	INIT_LIST_HEAD(&mgr->reserved_pages);
 
-	/* Add the two VRAM-related sysfs files */
-	ret = sysfs_create_files(&adev->dev->kobj, amdgpu_vram_mgr_attributes);
-	if (ret)
-		DRM_ERROR("Failed to register sysfs\n");
-
 	ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_VRAM, &mgr->manager);
 	ttm_resource_manager_set_used(man, true);
 	return 0;
@@ -227,8 +225,6 @@ void amdgpu_vram_mgr_fini(struct amdgpu_device *adev)
 	drm_mm_takedown(&mgr->mm);
 	spin_unlock(&mgr->lock);
 
-	sysfs_remove_files(&adev->dev->kobj, amdgpu_vram_mgr_attributes);
-
 	ttm_resource_manager_cleanup(man);
 	ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_VRAM, NULL);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 09/16] drm/amdgpu: Guard against write accesses after device removal
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (7 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 08/16] drm/amdgpu: Convert driver sysfs attributes to static attributes Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 20:17   ` Alex Deucher
  2021-05-12 14:26 ` [PATCH v7 10/16] drm/sched: Make timeout timer rearm conditional Andrey Grodzovsky
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky

This should prevent writing to memory or IO ranges possibly
already allocated for other uses after our device is removed.

v5:
Protect more places wher memcopy_to/form_io takes place
Protect IB submissions

v6: Switch to !drm_dev_enter instead of scoping entire code
with brackets.

v7:
Drop guard of HW ring commands emission protection since they
are in GART and not in MMIO.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 63 ++++++++++++++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 31 +++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 11 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c    | 22 +++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     |  7 ++-
 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c     | 44 +++++++--------
 drivers/gpu/drm/amd/amdgpu/psp_v12_0.c     |  8 +--
 drivers/gpu/drm/amd/amdgpu/psp_v3_1.c      |  8 +--
 drivers/gpu/drm/amd/amdgpu/vce_v4_0.c      | 26 +++++----
 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c      | 22 +++++---
 13 files changed, 168 insertions(+), 95 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a0bff4713672..f7cca25c0fa0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -71,6 +71,8 @@
 #include <drm/task_barrier.h>
 #include <linux/pm_runtime.h>
 
+#include <drm/drm_drv.h>
+
 MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
 MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
 MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -281,7 +283,10 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
 	unsigned long flags;
 	uint32_t hi = ~0;
 	uint64_t last;
+	int idx;
 
+	if (!drm_dev_enter(&adev->ddev, &idx))
+		return;
 
 #ifdef CONFIG_64BIT
 	last = min(pos + size, adev->gmc.visible_vram_size);
@@ -300,7 +305,7 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
 		}
 
 		if (count == size)
-			return;
+			goto exit;
 
 		pos += count;
 		buf += count / 4;
@@ -323,6 +328,9 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
 			*buf++ = RREG32_NO_KIQ(mmMM_DATA);
 	}
 	spin_unlock_irqrestore(&adev->mmio_idx_lock, flags);
+
+exit:
+	drm_dev_exit(idx);
 }
 
 /*
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 4d32233cde92..04ba5eef1e88 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -31,6 +31,8 @@
 #include "amdgpu_ras.h"
 #include "amdgpu_xgmi.h"
 
+#include <drm/drm_drv.h>
+
 /**
  * amdgpu_gmc_pdb0_alloc - allocate vram for pdb0
  *
@@ -151,6 +153,10 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
 {
 	void __iomem *ptr = (void *)cpu_pt_addr;
 	uint64_t value;
+	int idx;
+
+	if (!drm_dev_enter(&adev->ddev, &idx))
+		return 0;
 
 	/*
 	 * The following is for PTE only. GART does not have PDEs.
@@ -158,6 +164,9 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
 	value = addr & 0x0000FFFFFFFFF000ULL;
 	value |= flags;
 	writeq(value, ptr + (gpu_page_idx * 8));
+
+	drm_dev_exit(idx);
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 9e769cf6095b..bb6afee61666 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -25,6 +25,7 @@
 
 #include <linux/firmware.h>
 #include <linux/dma-mapping.h>
+#include <drm/drm_drv.h>
 
 #include "amdgpu.h"
 #include "amdgpu_psp.h"
@@ -39,6 +40,8 @@
 #include "amdgpu_ras.h"
 #include "amdgpu_securedisplay.h"
 
+#include <drm/drm_drv.h>
+
 static int psp_sysfs_init(struct amdgpu_device *adev);
 static void psp_sysfs_fini(struct amdgpu_device *adev);
 
@@ -253,7 +256,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
 		   struct psp_gfx_cmd_resp *cmd, uint64_t fence_mc_addr)
 {
 	int ret;
-	int index;
+	int index, idx;
 	int timeout = 20000;
 	bool ras_intr = false;
 	bool skip_unsupport = false;
@@ -261,6 +264,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
 	if (psp->adev->in_pci_err_recovery)
 		return 0;
 
+	if (!drm_dev_enter(&psp->adev->ddev, &idx))
+		return 0;
+
 	mutex_lock(&psp->mutex);
 
 	memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
@@ -271,8 +277,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
 	ret = psp_ring_cmd_submit(psp, psp->cmd_buf_mc_addr, fence_mc_addr, index);
 	if (ret) {
 		atomic_dec(&psp->fence_value);
-		mutex_unlock(&psp->mutex);
-		return ret;
+		goto exit;
 	}
 
 	amdgpu_asic_invalidate_hdp(psp->adev, NULL);
@@ -312,8 +317,8 @@ psp_cmd_submit_buf(struct psp_context *psp,
 			 psp->cmd_buf_mem->cmd_id,
 			 psp->cmd_buf_mem->resp.status);
 		if (!timeout) {
-			mutex_unlock(&psp->mutex);
-			return -EINVAL;
+			ret = -EINVAL;
+			goto exit;
 		}
 	}
 
@@ -321,8 +326,10 @@ psp_cmd_submit_buf(struct psp_context *psp,
 		ucode->tmr_mc_addr_lo = psp->cmd_buf_mem->resp.fw_addr_lo;
 		ucode->tmr_mc_addr_hi = psp->cmd_buf_mem->resp.fw_addr_hi;
 	}
-	mutex_unlock(&psp->mutex);
 
+exit:
+	mutex_unlock(&psp->mutex);
+	drm_dev_exit(idx);
 	return ret;
 }
 
@@ -359,8 +366,7 @@ static int psp_load_toc(struct psp_context *psp,
 	if (!cmd)
 		return -ENOMEM;
 	/* Copy toc to psp firmware private buffer */
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-	memcpy(psp->fw_pri_buf, psp->toc_start_addr, psp->toc_bin_size);
+	psp_copy_fw(psp, psp->toc_start_addr, psp->toc_bin_size);
 
 	psp_prep_load_toc_cmd_buf(cmd, psp->fw_pri_mc_addr, psp->toc_bin_size);
 
@@ -625,8 +631,7 @@ static int psp_asd_load(struct psp_context *psp)
 	if (!cmd)
 		return -ENOMEM;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-	memcpy(psp->fw_pri_buf, psp->asd_start_addr, psp->asd_ucode_size);
+	psp_copy_fw(psp, psp->asd_start_addr, psp->asd_ucode_size);
 
 	psp_prep_asd_load_cmd_buf(cmd, psp->fw_pri_mc_addr,
 				  psp->asd_ucode_size);
@@ -781,8 +786,7 @@ static int psp_xgmi_load(struct psp_context *psp)
 	if (!cmd)
 		return -ENOMEM;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-	memcpy(psp->fw_pri_buf, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
+	psp_copy_fw(psp, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
 
 	psp_prep_ta_load_cmd_buf(cmd,
 				 psp->fw_pri_mc_addr,
@@ -1038,8 +1042,7 @@ static int psp_ras_load(struct psp_context *psp)
 	if (!cmd)
 		return -ENOMEM;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-	memcpy(psp->fw_pri_buf, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
+	psp_copy_fw(psp, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
 
 	psp_prep_ta_load_cmd_buf(cmd,
 				 psp->fw_pri_mc_addr,
@@ -1275,8 +1278,7 @@ static int psp_hdcp_load(struct psp_context *psp)
 	if (!cmd)
 		return -ENOMEM;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-	memcpy(psp->fw_pri_buf, psp->ta_hdcp_start_addr,
+	psp_copy_fw(psp, psp->ta_hdcp_start_addr,
 	       psp->ta_hdcp_ucode_size);
 
 	psp_prep_ta_load_cmd_buf(cmd,
@@ -1427,8 +1429,7 @@ static int psp_dtm_load(struct psp_context *psp)
 	if (!cmd)
 		return -ENOMEM;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-	memcpy(psp->fw_pri_buf, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
+	psp_copy_fw(psp, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
 
 	psp_prep_ta_load_cmd_buf(cmd,
 				 psp->fw_pri_mc_addr,
@@ -1573,8 +1574,7 @@ static int psp_rap_load(struct psp_context *psp)
 	if (!cmd)
 		return -ENOMEM;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-	memcpy(psp->fw_pri_buf, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
+	psp_copy_fw(psp, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
 
 	psp_prep_ta_load_cmd_buf(cmd,
 				 psp->fw_pri_mc_addr,
@@ -3022,7 +3022,7 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
 	struct amdgpu_device *adev = drm_to_adev(ddev);
 	void *cpu_addr;
 	dma_addr_t dma_addr;
-	int ret;
+	int ret, idx;
 	char fw_name[100];
 	const struct firmware *usbc_pd_fw;
 
@@ -3031,6 +3031,9 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
 		return -EBUSY;
 	}
 
+	if (!drm_dev_enter(ddev, &idx))
+		return -ENODEV;
+
 	snprintf(fw_name, sizeof(fw_name), "amdgpu/%s", buf);
 	ret = request_firmware(&usbc_pd_fw, fw_name, adev->dev);
 	if (ret)
@@ -3062,16 +3065,30 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
 rel_buf:
 	dma_free_coherent(adev->dev, usbc_pd_fw->size, cpu_addr, dma_addr);
 	release_firmware(usbc_pd_fw);
-
 fail:
 	if (ret) {
 		DRM_ERROR("Failed to load USBC PD FW, err = %d", ret);
-		return ret;
+		count = ret;
 	}
 
+	drm_dev_exit(idx);
 	return count;
 }
 
+void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size)
+{
+	int idx;
+
+	if (!drm_dev_enter(&psp->adev->ddev, &idx))
+		return;
+
+	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
+	memcpy(psp->fw_pri_buf, start_addr, bin_size);
+
+	drm_dev_exit(idx);
+}
+
+
 static DEVICE_ATTR(usbc_pd_fw, S_IRUGO | S_IWUSR,
 		   psp_usbc_pd_fw_sysfs_read,
 		   psp_usbc_pd_fw_sysfs_write);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
index 46a5328e00e0..2bfdc278817f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
@@ -423,4 +423,6 @@ int psp_get_fw_attestation_records_addr(struct psp_context *psp,
 
 int psp_load_fw_list(struct psp_context *psp,
 		     struct amdgpu_firmware_info **ucode_list, int ucode_count);
+void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size);
+
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index c6dbc0801604..82f0542c7792 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -32,6 +32,7 @@
 #include <linux/module.h>
 
 #include <drm/drm.h>
+#include <drm/drm_drv.h>
 
 #include "amdgpu.h"
 #include "amdgpu_pm.h"
@@ -375,7 +376,7 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
 {
 	unsigned size;
 	void *ptr;
-	int i, j;
+	int i, j, idx;
 	bool in_ras_intr = amdgpu_ras_intr_triggered();
 
 	cancel_delayed_work_sync(&adev->uvd.idle_work);
@@ -403,11 +404,15 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
 		if (!adev->uvd.inst[j].saved_bo)
 			return -ENOMEM;
 
-		/* re-write 0 since err_event_athub will corrupt VCPU buffer */
-		if (in_ras_intr)
-			memset(adev->uvd.inst[j].saved_bo, 0, size);
-		else
-			memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
+		if (drm_dev_enter(&adev->ddev, &idx)) {
+			/* re-write 0 since err_event_athub will corrupt VCPU buffer */
+			if (in_ras_intr)
+				memset(adev->uvd.inst[j].saved_bo, 0, size);
+			else
+				memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
+
+			drm_dev_exit(idx);
+		}
 	}
 
 	if (in_ras_intr)
@@ -420,7 +425,7 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
 {
 	unsigned size;
 	void *ptr;
-	int i;
+	int i, idx;
 
 	for (i = 0; i < adev->uvd.num_uvd_inst; i++) {
 		if (adev->uvd.harvest_config & (1 << i))
@@ -432,7 +437,10 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
 		ptr = adev->uvd.inst[i].cpu_addr;
 
 		if (adev->uvd.inst[i].saved_bo != NULL) {
-			memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
+			if (drm_dev_enter(&adev->ddev, &idx)) {
+				memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
+				drm_dev_exit(idx);
+			}
 			kvfree(adev->uvd.inst[i].saved_bo);
 			adev->uvd.inst[i].saved_bo = NULL;
 		} else {
@@ -442,8 +450,11 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
 			hdr = (const struct common_firmware_header *)adev->uvd.fw->data;
 			if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
 				offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
-				memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
-					    le32_to_cpu(hdr->ucode_size_bytes));
+				if (drm_dev_enter(&adev->ddev, &idx)) {
+					memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
+						    le32_to_cpu(hdr->ucode_size_bytes));
+					drm_dev_exit(idx);
+				}
 				size -= le32_to_cpu(hdr->ucode_size_bytes);
 				ptr += le32_to_cpu(hdr->ucode_size_bytes);
 			}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index ea6a62f67e38..833203401ef4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -29,6 +29,7 @@
 #include <linux/module.h>
 
 #include <drm/drm.h>
+#include <drm/drm_drv.h>
 
 #include "amdgpu.h"
 #include "amdgpu_pm.h"
@@ -293,7 +294,7 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
 	void *cpu_addr;
 	const struct common_firmware_header *hdr;
 	unsigned offset;
-	int r;
+	int r, idx;
 
 	if (adev->vce.vcpu_bo == NULL)
 		return -EINVAL;
@@ -313,8 +314,12 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
 
 	hdr = (const struct common_firmware_header *)adev->vce.fw->data;
 	offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
-	memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
-		    adev->vce.fw->size - offset);
+
+	if (drm_dev_enter(&adev->ddev, &idx)) {
+		memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
+			    adev->vce.fw->size - offset);
+		drm_dev_exit(idx);
+	}
 
 	amdgpu_bo_kunmap(adev->vce.vcpu_bo);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 201645963ba5..21f7d3644d70 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -27,6 +27,7 @@
 #include <linux/firmware.h>
 #include <linux/module.h>
 #include <linux/pci.h>
+#include <drm/drm_drv.h>
 
 #include "amdgpu.h"
 #include "amdgpu_pm.h"
@@ -275,7 +276,7 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
 {
 	unsigned size;
 	void *ptr;
-	int i;
+	int i, idx;
 
 	cancel_delayed_work_sync(&adev->vcn.idle_work);
 
@@ -292,7 +293,10 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
 		if (!adev->vcn.inst[i].saved_bo)
 			return -ENOMEM;
 
-		memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
+		if (drm_dev_enter(&adev->ddev, &idx)) {
+			memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
+			drm_dev_exit(idx);
+		}
 	}
 	return 0;
 }
@@ -301,7 +305,7 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
 {
 	unsigned size;
 	void *ptr;
-	int i;
+	int i, idx;
 
 	for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
 		if (adev->vcn.harvest_config & (1 << i))
@@ -313,7 +317,10 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
 		ptr = adev->vcn.inst[i].cpu_addr;
 
 		if (adev->vcn.inst[i].saved_bo != NULL) {
-			memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
+			if (drm_dev_enter(&adev->ddev, &idx)) {
+				memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
+				drm_dev_exit(idx);
+			}
 			kvfree(adev->vcn.inst[i].saved_bo);
 			adev->vcn.inst[i].saved_bo = NULL;
 		} else {
@@ -323,8 +330,11 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
 			hdr = (const struct common_firmware_header *)adev->vcn.fw->data;
 			if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
 				offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
-				memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
-					    le32_to_cpu(hdr->ucode_size_bytes));
+				if (drm_dev_enter(&adev->ddev, &idx)) {
+					memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
+						    le32_to_cpu(hdr->ucode_size_bytes));
+					drm_dev_exit(idx);
+				}
 				size -= le32_to_cpu(hdr->ucode_size_bytes);
 				ptr += le32_to_cpu(hdr->ucode_size_bytes);
 			}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 9f868cf3b832..7dd5f10ab570 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -32,6 +32,7 @@
 #include <linux/dma-buf.h>
 
 #include <drm/amdgpu_drm.h>
+#include <drm/drm_drv.h>
 #include "amdgpu.h"
 #include "amdgpu_trace.h"
 #include "amdgpu_amdkfd.h"
@@ -1606,7 +1607,10 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
 	struct amdgpu_vm_update_params params;
 	enum amdgpu_sync_mode sync_mode;
 	uint64_t pfn;
-	int r;
+	int r, idx;
+
+	if (!drm_dev_enter(&adev->ddev, &idx))
+		return -ENODEV;
 
 	memset(&params, 0, sizeof(params));
 	params.adev = adev;
@@ -1715,6 +1719,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
 
 error_unlock:
 	amdgpu_vm_eviction_unlock(vm);
+	drm_dev_exit(idx);
 	return r;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
index 589410c32d09..2cec71e823f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
@@ -23,6 +23,7 @@
 #include <linux/firmware.h>
 #include <linux/module.h>
 #include <linux/vmalloc.h>
+#include <drm/drm_drv.h>
 
 #include "amdgpu.h"
 #include "amdgpu_psp.h"
@@ -269,10 +270,8 @@ static int psp_v11_0_bootloader_load_kdb(struct psp_context *psp)
 	if (ret)
 		return ret;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-
 	/* Copy PSP KDB binary to memory */
-	memcpy(psp->fw_pri_buf, psp->kdb_start_addr, psp->kdb_bin_size);
+	psp_copy_fw(psp, psp->kdb_start_addr, psp->kdb_bin_size);
 
 	/* Provide the PSP KDB to bootloader */
 	WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
@@ -302,10 +301,8 @@ static int psp_v11_0_bootloader_load_spl(struct psp_context *psp)
 	if (ret)
 		return ret;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-
 	/* Copy PSP SPL binary to memory */
-	memcpy(psp->fw_pri_buf, psp->spl_start_addr, psp->spl_bin_size);
+	psp_copy_fw(psp, psp->spl_start_addr, psp->spl_bin_size);
 
 	/* Provide the PSP SPL to bootloader */
 	WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
@@ -335,10 +332,8 @@ static int psp_v11_0_bootloader_load_sysdrv(struct psp_context *psp)
 	if (ret)
 		return ret;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-
 	/* Copy PSP System Driver binary to memory */
-	memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
+	psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
 
 	/* Provide the sys driver to bootloader */
 	WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
@@ -371,10 +366,8 @@ static int psp_v11_0_bootloader_load_sos(struct psp_context *psp)
 	if (ret)
 		return ret;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-
 	/* Copy Secure OS binary to PSP memory */
-	memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
+	psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
 
 	/* Provide the PSP secure OS to bootloader */
 	WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
@@ -608,7 +601,7 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
 	uint32_t p2c_header[4];
 	uint32_t sz;
 	void *buf;
-	int ret;
+	int ret, idx;
 
 	if (ctx->init == PSP_MEM_TRAIN_NOT_SUPPORT) {
 		DRM_DEBUG("Memory training is not supported.\n");
@@ -681,17 +674,24 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
 			return -ENOMEM;
 		}
 
-		memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
-		ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
-		if (ret) {
-			DRM_ERROR("Send long training msg failed.\n");
+		if (drm_dev_enter(&adev->ddev, &idx)) {
+			memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
+			ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
+			if (ret) {
+				DRM_ERROR("Send long training msg failed.\n");
+				vfree(buf);
+				drm_dev_exit(idx);
+				return ret;
+			}
+
+			memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
+			adev->hdp.funcs->flush_hdp(adev, NULL);
 			vfree(buf);
-			return ret;
+			drm_dev_exit(idx);
+		} else {
+			vfree(buf);
+			return -ENODEV;
 		}
-
-		memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
-		adev->hdp.funcs->flush_hdp(adev, NULL);
-		vfree(buf);
 	}
 
 	if (ops & PSP_MEM_TRAIN_SAVE) {
diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
index c4828bd3264b..618e5b6b85d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
@@ -138,10 +138,8 @@ static int psp_v12_0_bootloader_load_sysdrv(struct psp_context *psp)
 	if (ret)
 		return ret;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-
 	/* Copy PSP System Driver binary to memory */
-	memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
+	psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
 
 	/* Provide the sys driver to bootloader */
 	WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
@@ -179,10 +177,8 @@ static int psp_v12_0_bootloader_load_sos(struct psp_context *psp)
 	if (ret)
 		return ret;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-
 	/* Copy Secure OS binary to PSP memory */
-	memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
+	psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
 
 	/* Provide the PSP secure OS to bootloader */
 	WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
index f2e725f72d2f..d0a6cccd0897 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
@@ -102,10 +102,8 @@ static int psp_v3_1_bootloader_load_sysdrv(struct psp_context *psp)
 	if (ret)
 		return ret;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-
 	/* Copy PSP System Driver binary to memory */
-	memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
+	psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
 
 	/* Provide the sys driver to bootloader */
 	WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
@@ -143,10 +141,8 @@ static int psp_v3_1_bootloader_load_sos(struct psp_context *psp)
 	if (ret)
 		return ret;
 
-	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
-
 	/* Copy Secure OS binary to PSP memory */
-	memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
+	psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
 
 	/* Provide the PSP secure OS to bootloader */
 	WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
index 8e238dea7bef..90910d19db12 100644
--- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
@@ -25,6 +25,7 @@
  */
 
 #include <linux/firmware.h>
+#include <drm/drm_drv.h>
 
 #include "amdgpu.h"
 #include "amdgpu_vce.h"
@@ -555,16 +556,19 @@ static int vce_v4_0_hw_fini(void *handle)
 static int vce_v4_0_suspend(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
-	int r;
+	int r, idx;
 
 	if (adev->vce.vcpu_bo == NULL)
 		return 0;
 
-	if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
-		unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
-		void *ptr = adev->vce.cpu_addr;
+	if (drm_dev_enter(&adev->ddev, &idx)) {
+		if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
+			unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
+			void *ptr = adev->vce.cpu_addr;
 
-		memcpy_fromio(adev->vce.saved_bo, ptr, size);
+			memcpy_fromio(adev->vce.saved_bo, ptr, size);
+		}
+		drm_dev_exit(idx);
 	}
 
 	r = vce_v4_0_hw_fini(adev);
@@ -577,16 +581,20 @@ static int vce_v4_0_suspend(void *handle)
 static int vce_v4_0_resume(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
-	int r;
+	int r, idx;
 
 	if (adev->vce.vcpu_bo == NULL)
 		return -EINVAL;
 
 	if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
-		unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
-		void *ptr = adev->vce.cpu_addr;
 
-		memcpy_toio(ptr, adev->vce.saved_bo, size);
+		if (drm_dev_enter(&adev->ddev, &idx)) {
+			unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
+			void *ptr = adev->vce.cpu_addr;
+
+			memcpy_toio(ptr, adev->vce.saved_bo, size);
+			drm_dev_exit(idx);
+		}
 	} else {
 		r = amdgpu_vce_resume(adev);
 		if (r)
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index 3f15bf34123a..df34be8ec82d 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -34,6 +34,8 @@
 #include "vcn/vcn_3_0_0_sh_mask.h"
 #include "ivsrcid/vcn/irqsrcs_vcn_2_0.h"
 
+#include <drm/drm_drv.h>
+
 #define mmUVD_CONTEXT_ID_INTERNAL_OFFSET			0x27
 #define mmUVD_GPCOM_VCPU_CMD_INTERNAL_OFFSET			0x0f
 #define mmUVD_GPCOM_VCPU_DATA0_INTERNAL_OFFSET			0x10
@@ -268,16 +270,20 @@ static int vcn_v3_0_sw_init(void *handle)
 static int vcn_v3_0_sw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
-	int i, r;
+	int i, r, idx;
 
-	for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
-		volatile struct amdgpu_fw_shared *fw_shared;
+	if (drm_dev_enter(&adev->ddev, &idx)) {
+		for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
+			volatile struct amdgpu_fw_shared *fw_shared;
 
-		if (adev->vcn.harvest_config & (1 << i))
-			continue;
-		fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
-		fw_shared->present_flag_0 = 0;
-		fw_shared->sw_ring.is_enabled = false;
+			if (adev->vcn.harvest_config & (1 << i))
+				continue;
+			fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
+			fw_shared->present_flag_0 = 0;
+			fw_shared->sw_ring.is_enabled = false;
+		}
+
+		drm_dev_exit(idx);
 	}
 
 	if (amdgpu_sriov_vf(adev))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 10/16] drm/sched: Make timeout timer rearm conditional.
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (8 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 09/16] drm/amdgpu: Guard against write accesses after device removal Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 11/16] drm/amdgpu: Prevent any job recoveries after device is unplugged Andrey Grodzovsky
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky, Christian König

We don't want to rearm the timer if driver hook reports
that the device is gone.

v5: Update drm_gpu_sched_stat values in code.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index f4f474944169..8d1211e87101 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_job *job;
+	enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
@@ -331,7 +332,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
 		list_del_init(&job->list);
 		spin_unlock(&sched->job_list_lock);
 
-		job->sched->ops->timedout_job(job);
+		status = job->sched->ops->timedout_job(job);
 
 		/*
 		 * Guilty job did complete and hence needs to be manually removed
@@ -345,9 +346,11 @@ static void drm_sched_job_timedout(struct work_struct *work)
 		spin_unlock(&sched->job_list_lock);
 	}
 
-	spin_lock(&sched->job_list_lock);
-	drm_sched_start_timeout(sched);
-	spin_unlock(&sched->job_list_lock);
+	if (status != DRM_GPU_SCHED_STAT_ENODEV) {
+		spin_lock(&sched->job_list_lock);
+		drm_sched_start_timeout(sched);
+		spin_unlock(&sched->job_list_lock);
+	}
 }
 
  /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 11/16] drm/amdgpu: Prevent any job recoveries after device is unplugged.
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (9 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 10/16] drm/sched: Make timeout timer rearm conditional Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 12/16] drm/amdgpu: Fix hang on device removal Andrey Grodzovsky
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky, Christian König

Return DRM_TASK_STATUS_ENODEV back to the scheduler when device
is not present so they timeout timer will not be rearmed.

v5: Update to match updated return values in enum drm_gpu_sched_stat

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 759b34799221..d33e6d97cc89 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -25,6 +25,8 @@
 #include <linux/wait.h>
 #include <linux/sched.h>
 
+#include <drm/drm_drv.h>
+
 #include "amdgpu.h"
 #include "amdgpu_trace.h"
 
@@ -34,6 +36,15 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 	struct amdgpu_job *job = to_amdgpu_job(s_job);
 	struct amdgpu_task_info ti;
 	struct amdgpu_device *adev = ring->adev;
+	int idx;
+
+	if (!drm_dev_enter(&adev->ddev, &idx)) {
+		DRM_INFO("%s - device unplugged skipping recovery on scheduler:%s",
+			 __func__, s_job->sched->name);
+
+		/* Effectively the job is aborted as the device is gone */
+		return DRM_GPU_SCHED_STAT_ENODEV;
+	}
 
 	memset(&ti, 0, sizeof(struct amdgpu_task_info));
 
@@ -41,7 +52,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 	    amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
 		DRM_ERROR("ring %s timeout, but soft recovered\n",
 			  s_job->sched->name);
-		return DRM_GPU_SCHED_STAT_NOMINAL;
+		goto exit;
 	}
 
 	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
@@ -53,13 +64,15 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 
 	if (amdgpu_device_should_recover_gpu(ring->adev)) {
 		amdgpu_device_gpu_recover(ring->adev, job);
-		return DRM_GPU_SCHED_STAT_NOMINAL;
 	} else {
 		drm_sched_suspend_timeout(&ring->sched);
 		if (amdgpu_sriov_vf(adev))
 			adev->virt.tdr_debug = true;
-		return DRM_GPU_SCHED_STAT_NOMINAL;
 	}
+
+exit:
+	drm_dev_exit(idx);
+	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
 
 int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (10 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 11/16] drm/amdgpu: Prevent any job recoveries after device is unplugged Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-14 14:42   ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released Andrey Grodzovsky
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky

If removing while commands in flight you cannot wait to flush the
HW fences on a ring since the device is gone.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 1ffb36bd0b19..fa03702ecbfb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -36,6 +36,7 @@
 #include <linux/firmware.h>
 #include <linux/pm_runtime.h>
 
+#include <drm/drm_drv.h>
 #include "amdgpu.h"
 #include "amdgpu_trace.h"
 
@@ -525,8 +526,7 @@ int amdgpu_fence_driver_init(struct amdgpu_device *adev)
  */
 void amdgpu_fence_driver_fini_hw(struct amdgpu_device *adev)
 {
-	unsigned i, j;
-	int r;
+	int i, r;
 
 	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
 		struct amdgpu_ring *ring = adev->rings[i];
@@ -535,11 +535,15 @@ void amdgpu_fence_driver_fini_hw(struct amdgpu_device *adev)
 			continue;
 		if (!ring->no_scheduler)
 			drm_sched_fini(&ring->sched);
-		r = amdgpu_fence_wait_empty(ring);
-		if (r) {
-			/* no need to trigger GPU reset as we are unloading */
+		/* You can't wait for HW to signal if it's gone */
+		if (!drm_dev_is_unplugged(&adev->ddev))
+			r = amdgpu_fence_wait_empty(ring);
+		else
+			r = -ENODEV;
+		/* no need to trigger GPU reset as we are unloading */
+		if (r)
 			amdgpu_fence_driver_force_completion(ring);
-		}
+
 		if (ring->fence_drv.irq_src)
 			amdgpu_irq_put(adev, ring->fence_drv.irq_src,
 				       ring->fence_drv.irq_type);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (11 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 12/16] drm/amdgpu: Fix hang on device removal Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-18 14:07   ` Christian König
  2021-05-12 14:26 ` [PATCH v7 14/16] drm/amd/display: Remove superfluous drm_mode_config_cleanup Andrey Grodzovsky
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky, Christian König

Problem: If scheduler is already stopped by the time sched_entity
is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
never becomes false.

Fix: In drm_sched_fini detach all sched_entities from the
scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
Also wakeup all those processes stuck in sched_entity flushing
as the scheduler main thread which wakes them up is stopped by now.

v2:
Reverse order of drm_sched_rq_remove_entity and marking
s_entity as stopped to prevent reinserion back to rq due
to race.

v3:
Drop drm_sched_rq_remove_entity, only modify entity->stopped
and check for it in drm_sched_entity_is_idle

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
 drivers/gpu/drm/scheduler/sched_main.c   | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 0249c7450188..2e93e881b65f 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct drm_sched_entity *entity)
 	rmb(); /* for list_empty to work without lock */
 
 	if (list_empty(&entity->list) ||
-	    spsc_queue_count(&entity->job_queue) == 0)
+	    spsc_queue_count(&entity->job_queue) == 0 ||
+	    entity->stopped)
 		return true;
 
 	return false;
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 8d1211e87101..a2a953693b45 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
+	struct drm_sched_entity *s_entity;
+	int i;
+
 	if (sched->thread)
 		kthread_stop(sched->thread);
 
+	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
+		struct drm_sched_rq *rq = &sched->sched_rq[i];
+
+		if (!rq)
+			continue;
+
+		spin_lock(&rq->lock);
+		list_for_each_entry(s_entity, &rq->entities, list)
+			/*
+			 * Prevents reinsertion and marks job_queue as idle,
+			 * it will removed from rq in drm_sched_entity_fini
+			 * eventually
+			 */
+			s_entity->stopped = true;
+		spin_unlock(&rq->lock);
+
+	}
+
+	/* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
+	wake_up_all(&sched->job_scheduled);
+
 	/* Confirm no work left behind accessing device structures */
 	cancel_delayed_work_sync(&sched->work_tdr);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 14/16] drm/amd/display: Remove superfluous drm_mode_config_cleanup
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (12 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 15/16] drm/amdgpu: Verify DMA opearations from device are done Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings Andrey Grodzovsky
  15 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky, Rodrigo Siqueira

It's already being released by DRM core through devm

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 6c2c6a51ce6c..9728a0158bcb 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -3757,7 +3757,6 @@ static int amdgpu_dm_initialize_drm_device(struct amdgpu_device *adev)
 
 static void amdgpu_dm_destroy_drm_device(struct amdgpu_display_manager *dm)
 {
-	drm_mode_config_cleanup(dm->ddev);
 	drm_atomic_private_obj_fini(&dm->atomic_obj);
 	return;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 15/16] drm/amdgpu: Verify DMA opearations from device are done
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (13 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 14/16] drm/amd/display: Remove superfluous drm_mode_config_cleanup Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-12 14:26 ` [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings Andrey Grodzovsky
  15 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky, Christian König

In case device remove is just simualted by sysfs then verify
device doesn't keep doing DMA to the released memory after
pci_remove is done.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 83006f45b10b..5e6af9e0b7bf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1314,7 +1314,13 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 	drm_dev_unplug(dev);
 	amdgpu_driver_unload_kms(dev);
 
+	/*
+	 * Flush any in flight DMA operations from device.
+	 * Clear the Bus Master Enable bit and then wait on the PCIe Device
+	 * StatusTransactions Pending bit.
+	 */
 	pci_disable_device(pdev);
+	pci_wait_for_pending_transaction(pdev);
 }
 
 static void
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings
  2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (14 preceding siblings ...)
  2021-05-12 14:26 ` [PATCH v7 15/16] drm/amdgpu: Verify DMA opearations from device are done Andrey Grodzovsky
@ 2021-05-12 14:26 ` Andrey Grodzovsky
  2021-05-14 14:42   ` Andrey Grodzovsky
  2021-05-17 17:43   ` Alex Deucher
  15 siblings, 2 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 14:26 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky

Access to those must be prevented post pci_remove

v6: Drop BOs list, unampping VRAM BAR is enough.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 +++++++++++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  4 ----
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f7cca25c0fa0..73cbc3c7453f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3666,6 +3666,25 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	return r;
 }
 
+static void amdgpu_device_unmap_mmio(struct amdgpu_device *adev)
+{
+	/* Clear all CPU mappings pointing to this device */
+	unmap_mapping_range(adev->ddev.anon_inode->i_mapping, 0, 0, 1);
+
+	/* Unmap all mapped bars - Doorbell, registers and VRAM */
+	amdgpu_device_doorbell_fini(adev);
+
+	iounmap(adev->rmmio);
+	adev->rmmio = NULL;
+	if (adev->mman.aper_base_kaddr)
+		iounmap(adev->mman.aper_base_kaddr);
+	adev->mman.aper_base_kaddr = NULL;
+
+	/* Memory manager related */
+	arch_phys_wc_del(adev->gmc.vram_mtrr);
+	arch_io_free_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size);
+}
+
 /**
  * amdgpu_device_fini - tear down the driver
  *
@@ -3712,6 +3731,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 	amdgpu_device_ip_fini_early(adev);
 
 	amdgpu_gart_dummy_page_fini(adev);
+
+	amdgpu_device_unmap_mmio(adev);
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
@@ -3739,9 +3760,6 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
 	}
 	if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
 		vga_client_register(adev->pdev, NULL, NULL, NULL);
-	iounmap(adev->rmmio);
-	adev->rmmio = NULL;
-	amdgpu_device_doorbell_fini(adev);
 
 	if (IS_ENABLED(CONFIG_PERF_EVENTS))
 		amdgpu_pmu_fini(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 0adffcace326..882fb49f3c41 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -533,6 +533,7 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
 		return -ENOMEM;
 	drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
 	INIT_LIST_HEAD(&bo->shadow_list);
+
 	bo->vm_bo = NULL;
 	bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
 		bp->domain;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 0d54e70278ca..58ad2fecc9e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1841,10 +1841,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
 	amdgpu_bo_free_kernel(&adev->mman.discovery_memory, NULL, NULL);
 	amdgpu_ttm_fw_reserve_vram_fini(adev);
 
-	if (adev->mman.aper_base_kaddr)
-		iounmap(adev->mman.aper_base_kaddr);
-	adev->mman.aper_base_kaddr = NULL;
-
 	amdgpu_vram_mgr_fini(adev);
 	amdgpu_gtt_mgr_fini(adev);
 	ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_GDS);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 09/16] drm/amdgpu: Guard against write accesses after device removal
  2021-05-12 14:26 ` [PATCH v7 09/16] drm/amdgpu: Guard against write accesses after device removal Andrey Grodzovsky
@ 2021-05-12 20:17   ` Alex Deucher
  2021-05-12 20:30     ` Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Deucher @ 2021-05-12 20:17 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Maling list - DRI developers, amd-gfx list, Linux PCI,
	Christian König, Daniel Vetter, Wentland, Harry, Greg KH,
	Kuehling, Felix, Bjorn Helgaas, Deucher, Alexander

On Wed, May 12, 2021 at 10:27 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> This should prevent writing to memory or IO ranges possibly
> already allocated for other uses after our device is removed.
>
> v5:
> Protect more places wher memcopy_to/form_io takes place

where

> Protect IB submissions
>
> v6: Switch to !drm_dev_enter instead of scoping entire code
> with brackets.
>
> v7:
> Drop guard of HW ring commands emission protection since they
> are in GART and not in MMIO.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

I think you could split out the psp_fw_copy changes as a separate
cleanup patch.  That's a nice clean up in general.  What about the SMU
code (e.g., amd/pm/powerplay and amd/pm/swsmu)?  There are a bunch of
shared memory areas we interact with in the driver.

Alex


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 ++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 63 ++++++++++++++--------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |  2 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 31 +++++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 11 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c    | 22 +++++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     |  7 ++-
>  drivers/gpu/drm/amd/amdgpu/psp_v11_0.c     | 44 +++++++--------
>  drivers/gpu/drm/amd/amdgpu/psp_v12_0.c     |  8 +--
>  drivers/gpu/drm/amd/amdgpu/psp_v3_1.c      |  8 +--
>  drivers/gpu/drm/amd/amdgpu/vce_v4_0.c      | 26 +++++----
>  drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c      | 22 +++++---
>  13 files changed, 168 insertions(+), 95 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a0bff4713672..f7cca25c0fa0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -71,6 +71,8 @@
>  #include <drm/task_barrier.h>
>  #include <linux/pm_runtime.h>
>
> +#include <drm/drm_drv.h>
> +
>  MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
>  MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
>  MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
> @@ -281,7 +283,10 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
>         unsigned long flags;
>         uint32_t hi = ~0;
>         uint64_t last;
> +       int idx;
>
> +       if (!drm_dev_enter(&adev->ddev, &idx))
> +               return;
>
>  #ifdef CONFIG_64BIT
>         last = min(pos + size, adev->gmc.visible_vram_size);
> @@ -300,7 +305,7 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
>                 }
>
>                 if (count == size)
> -                       return;
> +                       goto exit;
>
>                 pos += count;
>                 buf += count / 4;
> @@ -323,6 +328,9 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
>                         *buf++ = RREG32_NO_KIQ(mmMM_DATA);
>         }
>         spin_unlock_irqrestore(&adev->mmio_idx_lock, flags);
> +
> +exit:
> +       drm_dev_exit(idx);
>  }
>
>  /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 4d32233cde92..04ba5eef1e88 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -31,6 +31,8 @@
>  #include "amdgpu_ras.h"
>  #include "amdgpu_xgmi.h"
>
> +#include <drm/drm_drv.h>
> +
>  /**
>   * amdgpu_gmc_pdb0_alloc - allocate vram for pdb0
>   *
> @@ -151,6 +153,10 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
>  {
>         void __iomem *ptr = (void *)cpu_pt_addr;
>         uint64_t value;
> +       int idx;
> +
> +       if (!drm_dev_enter(&adev->ddev, &idx))
> +               return 0;
>
>         /*
>          * The following is for PTE only. GART does not have PDEs.
> @@ -158,6 +164,9 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
>         value = addr & 0x0000FFFFFFFFF000ULL;
>         value |= flags;
>         writeq(value, ptr + (gpu_page_idx * 8));
> +
> +       drm_dev_exit(idx);
> +
>         return 0;
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index 9e769cf6095b..bb6afee61666 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -25,6 +25,7 @@
>
>  #include <linux/firmware.h>
>  #include <linux/dma-mapping.h>
> +#include <drm/drm_drv.h>
>
>  #include "amdgpu.h"
>  #include "amdgpu_psp.h"
> @@ -39,6 +40,8 @@
>  #include "amdgpu_ras.h"
>  #include "amdgpu_securedisplay.h"
>
> +#include <drm/drm_drv.h>
> +
>  static int psp_sysfs_init(struct amdgpu_device *adev);
>  static void psp_sysfs_fini(struct amdgpu_device *adev);
>
> @@ -253,7 +256,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
>                    struct psp_gfx_cmd_resp *cmd, uint64_t fence_mc_addr)
>  {
>         int ret;
> -       int index;
> +       int index, idx;
>         int timeout = 20000;
>         bool ras_intr = false;
>         bool skip_unsupport = false;
> @@ -261,6 +264,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
>         if (psp->adev->in_pci_err_recovery)
>                 return 0;
>
> +       if (!drm_dev_enter(&psp->adev->ddev, &idx))
> +               return 0;
> +
>         mutex_lock(&psp->mutex);
>
>         memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
> @@ -271,8 +277,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
>         ret = psp_ring_cmd_submit(psp, psp->cmd_buf_mc_addr, fence_mc_addr, index);
>         if (ret) {
>                 atomic_dec(&psp->fence_value);
> -               mutex_unlock(&psp->mutex);
> -               return ret;
> +               goto exit;
>         }
>
>         amdgpu_asic_invalidate_hdp(psp->adev, NULL);
> @@ -312,8 +317,8 @@ psp_cmd_submit_buf(struct psp_context *psp,
>                          psp->cmd_buf_mem->cmd_id,
>                          psp->cmd_buf_mem->resp.status);
>                 if (!timeout) {
> -                       mutex_unlock(&psp->mutex);
> -                       return -EINVAL;
> +                       ret = -EINVAL;
> +                       goto exit;
>                 }
>         }
>
> @@ -321,8 +326,10 @@ psp_cmd_submit_buf(struct psp_context *psp,
>                 ucode->tmr_mc_addr_lo = psp->cmd_buf_mem->resp.fw_addr_lo;
>                 ucode->tmr_mc_addr_hi = psp->cmd_buf_mem->resp.fw_addr_hi;
>         }
> -       mutex_unlock(&psp->mutex);
>
> +exit:
> +       mutex_unlock(&psp->mutex);
> +       drm_dev_exit(idx);
>         return ret;
>  }
>
> @@ -359,8 +366,7 @@ static int psp_load_toc(struct psp_context *psp,
>         if (!cmd)
>                 return -ENOMEM;
>         /* Copy toc to psp firmware private buffer */
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -       memcpy(psp->fw_pri_buf, psp->toc_start_addr, psp->toc_bin_size);
> +       psp_copy_fw(psp, psp->toc_start_addr, psp->toc_bin_size);
>
>         psp_prep_load_toc_cmd_buf(cmd, psp->fw_pri_mc_addr, psp->toc_bin_size);
>
> @@ -625,8 +631,7 @@ static int psp_asd_load(struct psp_context *psp)
>         if (!cmd)
>                 return -ENOMEM;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -       memcpy(psp->fw_pri_buf, psp->asd_start_addr, psp->asd_ucode_size);
> +       psp_copy_fw(psp, psp->asd_start_addr, psp->asd_ucode_size);
>
>         psp_prep_asd_load_cmd_buf(cmd, psp->fw_pri_mc_addr,
>                                   psp->asd_ucode_size);
> @@ -781,8 +786,7 @@ static int psp_xgmi_load(struct psp_context *psp)
>         if (!cmd)
>                 return -ENOMEM;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -       memcpy(psp->fw_pri_buf, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
> +       psp_copy_fw(psp, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
>
>         psp_prep_ta_load_cmd_buf(cmd,
>                                  psp->fw_pri_mc_addr,
> @@ -1038,8 +1042,7 @@ static int psp_ras_load(struct psp_context *psp)
>         if (!cmd)
>                 return -ENOMEM;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -       memcpy(psp->fw_pri_buf, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
> +       psp_copy_fw(psp, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
>
>         psp_prep_ta_load_cmd_buf(cmd,
>                                  psp->fw_pri_mc_addr,
> @@ -1275,8 +1278,7 @@ static int psp_hdcp_load(struct psp_context *psp)
>         if (!cmd)
>                 return -ENOMEM;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -       memcpy(psp->fw_pri_buf, psp->ta_hdcp_start_addr,
> +       psp_copy_fw(psp, psp->ta_hdcp_start_addr,
>                psp->ta_hdcp_ucode_size);
>
>         psp_prep_ta_load_cmd_buf(cmd,
> @@ -1427,8 +1429,7 @@ static int psp_dtm_load(struct psp_context *psp)
>         if (!cmd)
>                 return -ENOMEM;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -       memcpy(psp->fw_pri_buf, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
> +       psp_copy_fw(psp, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
>
>         psp_prep_ta_load_cmd_buf(cmd,
>                                  psp->fw_pri_mc_addr,
> @@ -1573,8 +1574,7 @@ static int psp_rap_load(struct psp_context *psp)
>         if (!cmd)
>                 return -ENOMEM;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -       memcpy(psp->fw_pri_buf, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
> +       psp_copy_fw(psp, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
>
>         psp_prep_ta_load_cmd_buf(cmd,
>                                  psp->fw_pri_mc_addr,
> @@ -3022,7 +3022,7 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
>         struct amdgpu_device *adev = drm_to_adev(ddev);
>         void *cpu_addr;
>         dma_addr_t dma_addr;
> -       int ret;
> +       int ret, idx;
>         char fw_name[100];
>         const struct firmware *usbc_pd_fw;
>
> @@ -3031,6 +3031,9 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
>                 return -EBUSY;
>         }
>
> +       if (!drm_dev_enter(ddev, &idx))
> +               return -ENODEV;
> +
>         snprintf(fw_name, sizeof(fw_name), "amdgpu/%s", buf);
>         ret = request_firmware(&usbc_pd_fw, fw_name, adev->dev);
>         if (ret)
> @@ -3062,16 +3065,30 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
>  rel_buf:
>         dma_free_coherent(adev->dev, usbc_pd_fw->size, cpu_addr, dma_addr);
>         release_firmware(usbc_pd_fw);
> -
>  fail:
>         if (ret) {
>                 DRM_ERROR("Failed to load USBC PD FW, err = %d", ret);
> -               return ret;
> +               count = ret;
>         }
>
> +       drm_dev_exit(idx);
>         return count;
>  }
>
> +void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size)
> +{
> +       int idx;
> +
> +       if (!drm_dev_enter(&psp->adev->ddev, &idx))
> +               return;
> +
> +       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> +       memcpy(psp->fw_pri_buf, start_addr, bin_size);
> +
> +       drm_dev_exit(idx);
> +}
> +
> +
>  static DEVICE_ATTR(usbc_pd_fw, S_IRUGO | S_IWUSR,
>                    psp_usbc_pd_fw_sysfs_read,
>                    psp_usbc_pd_fw_sysfs_write);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> index 46a5328e00e0..2bfdc278817f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> @@ -423,4 +423,6 @@ int psp_get_fw_attestation_records_addr(struct psp_context *psp,
>
>  int psp_load_fw_list(struct psp_context *psp,
>                      struct amdgpu_firmware_info **ucode_list, int ucode_count);
> +void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size);
> +
>  #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> index c6dbc0801604..82f0542c7792 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> @@ -32,6 +32,7 @@
>  #include <linux/module.h>
>
>  #include <drm/drm.h>
> +#include <drm/drm_drv.h>
>
>  #include "amdgpu.h"
>  #include "amdgpu_pm.h"
> @@ -375,7 +376,7 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>  {
>         unsigned size;
>         void *ptr;
> -       int i, j;
> +       int i, j, idx;
>         bool in_ras_intr = amdgpu_ras_intr_triggered();
>
>         cancel_delayed_work_sync(&adev->uvd.idle_work);
> @@ -403,11 +404,15 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>                 if (!adev->uvd.inst[j].saved_bo)
>                         return -ENOMEM;
>
> -               /* re-write 0 since err_event_athub will corrupt VCPU buffer */
> -               if (in_ras_intr)
> -                       memset(adev->uvd.inst[j].saved_bo, 0, size);
> -               else
> -                       memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> +                       /* re-write 0 since err_event_athub will corrupt VCPU buffer */
> +                       if (in_ras_intr)
> +                               memset(adev->uvd.inst[j].saved_bo, 0, size);
> +                       else
> +                               memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
> +
> +                       drm_dev_exit(idx);
> +               }
>         }
>
>         if (in_ras_intr)
> @@ -420,7 +425,7 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
>  {
>         unsigned size;
>         void *ptr;
> -       int i;
> +       int i, idx;
>
>         for (i = 0; i < adev->uvd.num_uvd_inst; i++) {
>                 if (adev->uvd.harvest_config & (1 << i))
> @@ -432,7 +437,10 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
>                 ptr = adev->uvd.inst[i].cpu_addr;
>
>                 if (adev->uvd.inst[i].saved_bo != NULL) {
> -                       memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
> +                       if (drm_dev_enter(&adev->ddev, &idx)) {
> +                               memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
> +                               drm_dev_exit(idx);
> +                       }
>                         kvfree(adev->uvd.inst[i].saved_bo);
>                         adev->uvd.inst[i].saved_bo = NULL;
>                 } else {
> @@ -442,8 +450,11 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
>                         hdr = (const struct common_firmware_header *)adev->uvd.fw->data;
>                         if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
>                                 offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
> -                               memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
> -                                           le32_to_cpu(hdr->ucode_size_bytes));
> +                               if (drm_dev_enter(&adev->ddev, &idx)) {
> +                                       memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
> +                                                   le32_to_cpu(hdr->ucode_size_bytes));
> +                                       drm_dev_exit(idx);
> +                               }
>                                 size -= le32_to_cpu(hdr->ucode_size_bytes);
>                                 ptr += le32_to_cpu(hdr->ucode_size_bytes);
>                         }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> index ea6a62f67e38..833203401ef4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> @@ -29,6 +29,7 @@
>  #include <linux/module.h>
>
>  #include <drm/drm.h>
> +#include <drm/drm_drv.h>
>
>  #include "amdgpu.h"
>  #include "amdgpu_pm.h"
> @@ -293,7 +294,7 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
>         void *cpu_addr;
>         const struct common_firmware_header *hdr;
>         unsigned offset;
> -       int r;
> +       int r, idx;
>
>         if (adev->vce.vcpu_bo == NULL)
>                 return -EINVAL;
> @@ -313,8 +314,12 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
>
>         hdr = (const struct common_firmware_header *)adev->vce.fw->data;
>         offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
> -       memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
> -                   adev->vce.fw->size - offset);
> +
> +       if (drm_dev_enter(&adev->ddev, &idx)) {
> +               memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
> +                           adev->vce.fw->size - offset);
> +               drm_dev_exit(idx);
> +       }
>
>         amdgpu_bo_kunmap(adev->vce.vcpu_bo);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> index 201645963ba5..21f7d3644d70 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> @@ -27,6 +27,7 @@
>  #include <linux/firmware.h>
>  #include <linux/module.h>
>  #include <linux/pci.h>
> +#include <drm/drm_drv.h>
>
>  #include "amdgpu.h"
>  #include "amdgpu_pm.h"
> @@ -275,7 +276,7 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
>  {
>         unsigned size;
>         void *ptr;
> -       int i;
> +       int i, idx;
>
>         cancel_delayed_work_sync(&adev->vcn.idle_work);
>
> @@ -292,7 +293,10 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
>                 if (!adev->vcn.inst[i].saved_bo)
>                         return -ENOMEM;
>
> -               memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> +                       memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
> +                       drm_dev_exit(idx);
> +               }
>         }
>         return 0;
>  }
> @@ -301,7 +305,7 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
>  {
>         unsigned size;
>         void *ptr;
> -       int i;
> +       int i, idx;
>
>         for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
>                 if (adev->vcn.harvest_config & (1 << i))
> @@ -313,7 +317,10 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
>                 ptr = adev->vcn.inst[i].cpu_addr;
>
>                 if (adev->vcn.inst[i].saved_bo != NULL) {
> -                       memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
> +                       if (drm_dev_enter(&adev->ddev, &idx)) {
> +                               memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
> +                               drm_dev_exit(idx);
> +                       }
>                         kvfree(adev->vcn.inst[i].saved_bo);
>                         adev->vcn.inst[i].saved_bo = NULL;
>                 } else {
> @@ -323,8 +330,11 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
>                         hdr = (const struct common_firmware_header *)adev->vcn.fw->data;
>                         if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
>                                 offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
> -                               memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
> -                                           le32_to_cpu(hdr->ucode_size_bytes));
> +                               if (drm_dev_enter(&adev->ddev, &idx)) {
> +                                       memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
> +                                                   le32_to_cpu(hdr->ucode_size_bytes));
> +                                       drm_dev_exit(idx);
> +                               }
>                                 size -= le32_to_cpu(hdr->ucode_size_bytes);
>                                 ptr += le32_to_cpu(hdr->ucode_size_bytes);
>                         }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 9f868cf3b832..7dd5f10ab570 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -32,6 +32,7 @@
>  #include <linux/dma-buf.h>
>
>  #include <drm/amdgpu_drm.h>
> +#include <drm/drm_drv.h>
>  #include "amdgpu.h"
>  #include "amdgpu_trace.h"
>  #include "amdgpu_amdkfd.h"
> @@ -1606,7 +1607,10 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
>         struct amdgpu_vm_update_params params;
>         enum amdgpu_sync_mode sync_mode;
>         uint64_t pfn;
> -       int r;
> +       int r, idx;
> +
> +       if (!drm_dev_enter(&adev->ddev, &idx))
> +               return -ENODEV;
>
>         memset(&params, 0, sizeof(params));
>         params.adev = adev;
> @@ -1715,6 +1719,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
>
>  error_unlock:
>         amdgpu_vm_eviction_unlock(vm);
> +       drm_dev_exit(idx);
>         return r;
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> index 589410c32d09..2cec71e823f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> @@ -23,6 +23,7 @@
>  #include <linux/firmware.h>
>  #include <linux/module.h>
>  #include <linux/vmalloc.h>
> +#include <drm/drm_drv.h>
>
>  #include "amdgpu.h"
>  #include "amdgpu_psp.h"
> @@ -269,10 +270,8 @@ static int psp_v11_0_bootloader_load_kdb(struct psp_context *psp)
>         if (ret)
>                 return ret;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -
>         /* Copy PSP KDB binary to memory */
> -       memcpy(psp->fw_pri_buf, psp->kdb_start_addr, psp->kdb_bin_size);
> +       psp_copy_fw(psp, psp->kdb_start_addr, psp->kdb_bin_size);
>
>         /* Provide the PSP KDB to bootloader */
>         WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> @@ -302,10 +301,8 @@ static int psp_v11_0_bootloader_load_spl(struct psp_context *psp)
>         if (ret)
>                 return ret;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -
>         /* Copy PSP SPL binary to memory */
> -       memcpy(psp->fw_pri_buf, psp->spl_start_addr, psp->spl_bin_size);
> +       psp_copy_fw(psp, psp->spl_start_addr, psp->spl_bin_size);
>
>         /* Provide the PSP SPL to bootloader */
>         WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> @@ -335,10 +332,8 @@ static int psp_v11_0_bootloader_load_sysdrv(struct psp_context *psp)
>         if (ret)
>                 return ret;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -
>         /* Copy PSP System Driver binary to memory */
> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
>
>         /* Provide the sys driver to bootloader */
>         WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> @@ -371,10 +366,8 @@ static int psp_v11_0_bootloader_load_sos(struct psp_context *psp)
>         if (ret)
>                 return ret;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -
>         /* Copy Secure OS binary to PSP memory */
> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
>
>         /* Provide the PSP secure OS to bootloader */
>         WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> @@ -608,7 +601,7 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
>         uint32_t p2c_header[4];
>         uint32_t sz;
>         void *buf;
> -       int ret;
> +       int ret, idx;
>
>         if (ctx->init == PSP_MEM_TRAIN_NOT_SUPPORT) {
>                 DRM_DEBUG("Memory training is not supported.\n");
> @@ -681,17 +674,24 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
>                         return -ENOMEM;
>                 }
>
> -               memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
> -               ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
> -               if (ret) {
> -                       DRM_ERROR("Send long training msg failed.\n");
> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> +                       memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
> +                       ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
> +                       if (ret) {
> +                               DRM_ERROR("Send long training msg failed.\n");
> +                               vfree(buf);
> +                               drm_dev_exit(idx);
> +                               return ret;
> +                       }
> +
> +                       memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
> +                       adev->hdp.funcs->flush_hdp(adev, NULL);
>                         vfree(buf);
> -                       return ret;
> +                       drm_dev_exit(idx);
> +               } else {
> +                       vfree(buf);
> +                       return -ENODEV;
>                 }
> -
> -               memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
> -               adev->hdp.funcs->flush_hdp(adev, NULL);
> -               vfree(buf);
>         }
>
>         if (ops & PSP_MEM_TRAIN_SAVE) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
> index c4828bd3264b..618e5b6b85d9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
> @@ -138,10 +138,8 @@ static int psp_v12_0_bootloader_load_sysdrv(struct psp_context *psp)
>         if (ret)
>                 return ret;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -
>         /* Copy PSP System Driver binary to memory */
> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
>
>         /* Provide the sys driver to bootloader */
>         WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> @@ -179,10 +177,8 @@ static int psp_v12_0_bootloader_load_sos(struct psp_context *psp)
>         if (ret)
>                 return ret;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -
>         /* Copy Secure OS binary to PSP memory */
> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
>
>         /* Provide the PSP secure OS to bootloader */
>         WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> index f2e725f72d2f..d0a6cccd0897 100644
> --- a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> @@ -102,10 +102,8 @@ static int psp_v3_1_bootloader_load_sysdrv(struct psp_context *psp)
>         if (ret)
>                 return ret;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -
>         /* Copy PSP System Driver binary to memory */
> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
>
>         /* Provide the sys driver to bootloader */
>         WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> @@ -143,10 +141,8 @@ static int psp_v3_1_bootloader_load_sos(struct psp_context *psp)
>         if (ret)
>                 return ret;
>
> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> -
>         /* Copy Secure OS binary to PSP memory */
> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
>
>         /* Provide the PSP secure OS to bootloader */
>         WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> index 8e238dea7bef..90910d19db12 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> @@ -25,6 +25,7 @@
>   */
>
>  #include <linux/firmware.h>
> +#include <drm/drm_drv.h>
>
>  #include "amdgpu.h"
>  #include "amdgpu_vce.h"
> @@ -555,16 +556,19 @@ static int vce_v4_0_hw_fini(void *handle)
>  static int vce_v4_0_suspend(void *handle)
>  {
>         struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> -       int r;
> +       int r, idx;
>
>         if (adev->vce.vcpu_bo == NULL)
>                 return 0;
>
> -       if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
> -               unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> -               void *ptr = adev->vce.cpu_addr;
> +       if (drm_dev_enter(&adev->ddev, &idx)) {
> +               if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
> +                       unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> +                       void *ptr = adev->vce.cpu_addr;
>
> -               memcpy_fromio(adev->vce.saved_bo, ptr, size);
> +                       memcpy_fromio(adev->vce.saved_bo, ptr, size);
> +               }
> +               drm_dev_exit(idx);
>         }
>
>         r = vce_v4_0_hw_fini(adev);
> @@ -577,16 +581,20 @@ static int vce_v4_0_suspend(void *handle)
>  static int vce_v4_0_resume(void *handle)
>  {
>         struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> -       int r;
> +       int r, idx;
>
>         if (adev->vce.vcpu_bo == NULL)
>                 return -EINVAL;
>
>         if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
> -               unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> -               void *ptr = adev->vce.cpu_addr;
>
> -               memcpy_toio(ptr, adev->vce.saved_bo, size);
> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> +                       unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> +                       void *ptr = adev->vce.cpu_addr;
> +
> +                       memcpy_toio(ptr, adev->vce.saved_bo, size);
> +                       drm_dev_exit(idx);
> +               }
>         } else {
>                 r = amdgpu_vce_resume(adev);
>                 if (r)
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> index 3f15bf34123a..df34be8ec82d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> @@ -34,6 +34,8 @@
>  #include "vcn/vcn_3_0_0_sh_mask.h"
>  #include "ivsrcid/vcn/irqsrcs_vcn_2_0.h"
>
> +#include <drm/drm_drv.h>
> +
>  #define mmUVD_CONTEXT_ID_INTERNAL_OFFSET                       0x27
>  #define mmUVD_GPCOM_VCPU_CMD_INTERNAL_OFFSET                   0x0f
>  #define mmUVD_GPCOM_VCPU_DATA0_INTERNAL_OFFSET                 0x10
> @@ -268,16 +270,20 @@ static int vcn_v3_0_sw_init(void *handle)
>  static int vcn_v3_0_sw_fini(void *handle)
>  {
>         struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> -       int i, r;
> +       int i, r, idx;
>
> -       for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
> -               volatile struct amdgpu_fw_shared *fw_shared;
> +       if (drm_dev_enter(&adev->ddev, &idx)) {
> +               for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
> +                       volatile struct amdgpu_fw_shared *fw_shared;
>
> -               if (adev->vcn.harvest_config & (1 << i))
> -                       continue;
> -               fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
> -               fw_shared->present_flag_0 = 0;
> -               fw_shared->sw_ring.is_enabled = false;
> +                       if (adev->vcn.harvest_config & (1 << i))
> +                               continue;
> +                       fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
> +                       fw_shared->present_flag_0 = 0;
> +                       fw_shared->sw_ring.is_enabled = false;
> +               }
> +
> +               drm_dev_exit(idx);
>         }
>
>         if (amdgpu_sriov_vf(adev))
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 09/16] drm/amdgpu: Guard against write accesses after device removal
  2021-05-12 20:17   ` Alex Deucher
@ 2021-05-12 20:30     ` Andrey Grodzovsky
  2021-05-12 20:50       ` Alex Deucher
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 20:30 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Maling list - DRI developers, amd-gfx list, Linux PCI,
	Christian König, Daniel Vetter, Wentland, Harry, Greg KH,
	Kuehling, Felix, Bjorn Helgaas, Deucher, Alexander



On 2021-05-12 4:17 p.m., Alex Deucher wrote:
> On Wed, May 12, 2021 at 10:27 AM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>> This should prevent writing to memory or IO ranges possibly
>> already allocated for other uses after our device is removed.
>>
>> v5:
>> Protect more places wher memcopy_to/form_io takes place
> 
> where
> 
>> Protect IB submissions
>>
>> v6: Switch to !drm_dev_enter instead of scoping entire code
>> with brackets.
>>
>> v7:
>> Drop guard of HW ring commands emission protection since they
>> are in GART and not in MMIO.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> 
> I think you could split out the psp_fw_copy changes as a separate
> cleanup patch.  That's a nice clean up in general.  What about the SMU
> code (e.g., amd/pm/powerplay and amd/pm/swsmu)?  There are a bunch of
> shared memory areas we interact with in the driver.

Can you point me to it ? Are they VRAM and not GART ?
I searched for all memcpy_to/from_io in our code. Maybe missed some.

Andrey

> 
> Alex
> 
> 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 ++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 63 ++++++++++++++--------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |  2 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 31 +++++++----
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 11 ++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c    | 22 +++++---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     |  7 ++-
>>   drivers/gpu/drm/amd/amdgpu/psp_v11_0.c     | 44 +++++++--------
>>   drivers/gpu/drm/amd/amdgpu/psp_v12_0.c     |  8 +--
>>   drivers/gpu/drm/amd/amdgpu/psp_v3_1.c      |  8 +--
>>   drivers/gpu/drm/amd/amdgpu/vce_v4_0.c      | 26 +++++----
>>   drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c      | 22 +++++---
>>   13 files changed, 168 insertions(+), 95 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index a0bff4713672..f7cca25c0fa0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -71,6 +71,8 @@
>>   #include <drm/task_barrier.h>
>>   #include <linux/pm_runtime.h>
>>
>> +#include <drm/drm_drv.h>
>> +
>>   MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
>>   MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
>>   MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
>> @@ -281,7 +283,10 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
>>          unsigned long flags;
>>          uint32_t hi = ~0;
>>          uint64_t last;
>> +       int idx;
>>
>> +       if (!drm_dev_enter(&adev->ddev, &idx))
>> +               return;
>>
>>   #ifdef CONFIG_64BIT
>>          last = min(pos + size, adev->gmc.visible_vram_size);
>> @@ -300,7 +305,7 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
>>                  }
>>
>>                  if (count == size)
>> -                       return;
>> +                       goto exit;
>>
>>                  pos += count;
>>                  buf += count / 4;
>> @@ -323,6 +328,9 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
>>                          *buf++ = RREG32_NO_KIQ(mmMM_DATA);
>>          }
>>          spin_unlock_irqrestore(&adev->mmio_idx_lock, flags);
>> +
>> +exit:
>> +       drm_dev_exit(idx);
>>   }
>>
>>   /*
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> index 4d32233cde92..04ba5eef1e88 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> @@ -31,6 +31,8 @@
>>   #include "amdgpu_ras.h"
>>   #include "amdgpu_xgmi.h"
>>
>> +#include <drm/drm_drv.h>
>> +
>>   /**
>>    * amdgpu_gmc_pdb0_alloc - allocate vram for pdb0
>>    *
>> @@ -151,6 +153,10 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
>>   {
>>          void __iomem *ptr = (void *)cpu_pt_addr;
>>          uint64_t value;
>> +       int idx;
>> +
>> +       if (!drm_dev_enter(&adev->ddev, &idx))
>> +               return 0;
>>
>>          /*
>>           * The following is for PTE only. GART does not have PDEs.
>> @@ -158,6 +164,9 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
>>          value = addr & 0x0000FFFFFFFFF000ULL;
>>          value |= flags;
>>          writeq(value, ptr + (gpu_page_idx * 8));
>> +
>> +       drm_dev_exit(idx);
>> +
>>          return 0;
>>   }
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>> index 9e769cf6095b..bb6afee61666 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>> @@ -25,6 +25,7 @@
>>
>>   #include <linux/firmware.h>
>>   #include <linux/dma-mapping.h>
>> +#include <drm/drm_drv.h>
>>
>>   #include "amdgpu.h"
>>   #include "amdgpu_psp.h"
>> @@ -39,6 +40,8 @@
>>   #include "amdgpu_ras.h"
>>   #include "amdgpu_securedisplay.h"
>>
>> +#include <drm/drm_drv.h>
>> +
>>   static int psp_sysfs_init(struct amdgpu_device *adev);
>>   static void psp_sysfs_fini(struct amdgpu_device *adev);
>>
>> @@ -253,7 +256,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>                     struct psp_gfx_cmd_resp *cmd, uint64_t fence_mc_addr)
>>   {
>>          int ret;
>> -       int index;
>> +       int index, idx;
>>          int timeout = 20000;
>>          bool ras_intr = false;
>>          bool skip_unsupport = false;
>> @@ -261,6 +264,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>          if (psp->adev->in_pci_err_recovery)
>>                  return 0;
>>
>> +       if (!drm_dev_enter(&psp->adev->ddev, &idx))
>> +               return 0;
>> +
>>          mutex_lock(&psp->mutex);
>>
>>          memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
>> @@ -271,8 +277,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>          ret = psp_ring_cmd_submit(psp, psp->cmd_buf_mc_addr, fence_mc_addr, index);
>>          if (ret) {
>>                  atomic_dec(&psp->fence_value);
>> -               mutex_unlock(&psp->mutex);
>> -               return ret;
>> +               goto exit;
>>          }
>>
>>          amdgpu_asic_invalidate_hdp(psp->adev, NULL);
>> @@ -312,8 +317,8 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>                           psp->cmd_buf_mem->cmd_id,
>>                           psp->cmd_buf_mem->resp.status);
>>                  if (!timeout) {
>> -                       mutex_unlock(&psp->mutex);
>> -                       return -EINVAL;
>> +                       ret = -EINVAL;
>> +                       goto exit;
>>                  }
>>          }
>>
>> @@ -321,8 +326,10 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>                  ucode->tmr_mc_addr_lo = psp->cmd_buf_mem->resp.fw_addr_lo;
>>                  ucode->tmr_mc_addr_hi = psp->cmd_buf_mem->resp.fw_addr_hi;
>>          }
>> -       mutex_unlock(&psp->mutex);
>>
>> +exit:
>> +       mutex_unlock(&psp->mutex);
>> +       drm_dev_exit(idx);
>>          return ret;
>>   }
>>
>> @@ -359,8 +366,7 @@ static int psp_load_toc(struct psp_context *psp,
>>          if (!cmd)
>>                  return -ENOMEM;
>>          /* Copy toc to psp firmware private buffer */
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -       memcpy(psp->fw_pri_buf, psp->toc_start_addr, psp->toc_bin_size);
>> +       psp_copy_fw(psp, psp->toc_start_addr, psp->toc_bin_size);
>>
>>          psp_prep_load_toc_cmd_buf(cmd, psp->fw_pri_mc_addr, psp->toc_bin_size);
>>
>> @@ -625,8 +631,7 @@ static int psp_asd_load(struct psp_context *psp)
>>          if (!cmd)
>>                  return -ENOMEM;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -       memcpy(psp->fw_pri_buf, psp->asd_start_addr, psp->asd_ucode_size);
>> +       psp_copy_fw(psp, psp->asd_start_addr, psp->asd_ucode_size);
>>
>>          psp_prep_asd_load_cmd_buf(cmd, psp->fw_pri_mc_addr,
>>                                    psp->asd_ucode_size);
>> @@ -781,8 +786,7 @@ static int psp_xgmi_load(struct psp_context *psp)
>>          if (!cmd)
>>                  return -ENOMEM;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -       memcpy(psp->fw_pri_buf, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
>> +       psp_copy_fw(psp, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
>>
>>          psp_prep_ta_load_cmd_buf(cmd,
>>                                   psp->fw_pri_mc_addr,
>> @@ -1038,8 +1042,7 @@ static int psp_ras_load(struct psp_context *psp)
>>          if (!cmd)
>>                  return -ENOMEM;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -       memcpy(psp->fw_pri_buf, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
>> +       psp_copy_fw(psp, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
>>
>>          psp_prep_ta_load_cmd_buf(cmd,
>>                                   psp->fw_pri_mc_addr,
>> @@ -1275,8 +1278,7 @@ static int psp_hdcp_load(struct psp_context *psp)
>>          if (!cmd)
>>                  return -ENOMEM;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -       memcpy(psp->fw_pri_buf, psp->ta_hdcp_start_addr,
>> +       psp_copy_fw(psp, psp->ta_hdcp_start_addr,
>>                 psp->ta_hdcp_ucode_size);
>>
>>          psp_prep_ta_load_cmd_buf(cmd,
>> @@ -1427,8 +1429,7 @@ static int psp_dtm_load(struct psp_context *psp)
>>          if (!cmd)
>>                  return -ENOMEM;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -       memcpy(psp->fw_pri_buf, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
>> +       psp_copy_fw(psp, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
>>
>>          psp_prep_ta_load_cmd_buf(cmd,
>>                                   psp->fw_pri_mc_addr,
>> @@ -1573,8 +1574,7 @@ static int psp_rap_load(struct psp_context *psp)
>>          if (!cmd)
>>                  return -ENOMEM;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -       memcpy(psp->fw_pri_buf, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
>> +       psp_copy_fw(psp, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
>>
>>          psp_prep_ta_load_cmd_buf(cmd,
>>                                   psp->fw_pri_mc_addr,
>> @@ -3022,7 +3022,7 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
>>          struct amdgpu_device *adev = drm_to_adev(ddev);
>>          void *cpu_addr;
>>          dma_addr_t dma_addr;
>> -       int ret;
>> +       int ret, idx;
>>          char fw_name[100];
>>          const struct firmware *usbc_pd_fw;
>>
>> @@ -3031,6 +3031,9 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
>>                  return -EBUSY;
>>          }
>>
>> +       if (!drm_dev_enter(ddev, &idx))
>> +               return -ENODEV;
>> +
>>          snprintf(fw_name, sizeof(fw_name), "amdgpu/%s", buf);
>>          ret = request_firmware(&usbc_pd_fw, fw_name, adev->dev);
>>          if (ret)
>> @@ -3062,16 +3065,30 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
>>   rel_buf:
>>          dma_free_coherent(adev->dev, usbc_pd_fw->size, cpu_addr, dma_addr);
>>          release_firmware(usbc_pd_fw);
>> -
>>   fail:
>>          if (ret) {
>>                  DRM_ERROR("Failed to load USBC PD FW, err = %d", ret);
>> -               return ret;
>> +               count = ret;
>>          }
>>
>> +       drm_dev_exit(idx);
>>          return count;
>>   }
>>
>> +void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size)
>> +{
>> +       int idx;
>> +
>> +       if (!drm_dev_enter(&psp->adev->ddev, &idx))
>> +               return;
>> +
>> +       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> +       memcpy(psp->fw_pri_buf, start_addr, bin_size);
>> +
>> +       drm_dev_exit(idx);
>> +}
>> +
>> +
>>   static DEVICE_ATTR(usbc_pd_fw, S_IRUGO | S_IWUSR,
>>                     psp_usbc_pd_fw_sysfs_read,
>>                     psp_usbc_pd_fw_sysfs_write);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
>> index 46a5328e00e0..2bfdc278817f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
>> @@ -423,4 +423,6 @@ int psp_get_fw_attestation_records_addr(struct psp_context *psp,
>>
>>   int psp_load_fw_list(struct psp_context *psp,
>>                       struct amdgpu_firmware_info **ucode_list, int ucode_count);
>> +void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size);
>> +
>>   #endif
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> index c6dbc0801604..82f0542c7792 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>> @@ -32,6 +32,7 @@
>>   #include <linux/module.h>
>>
>>   #include <drm/drm.h>
>> +#include <drm/drm_drv.h>
>>
>>   #include "amdgpu.h"
>>   #include "amdgpu_pm.h"
>> @@ -375,7 +376,7 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>>   {
>>          unsigned size;
>>          void *ptr;
>> -       int i, j;
>> +       int i, j, idx;
>>          bool in_ras_intr = amdgpu_ras_intr_triggered();
>>
>>          cancel_delayed_work_sync(&adev->uvd.idle_work);
>> @@ -403,11 +404,15 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>>                  if (!adev->uvd.inst[j].saved_bo)
>>                          return -ENOMEM;
>>
>> -               /* re-write 0 since err_event_athub will corrupt VCPU buffer */
>> -               if (in_ras_intr)
>> -                       memset(adev->uvd.inst[j].saved_bo, 0, size);
>> -               else
>> -                       memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
>> +                       /* re-write 0 since err_event_athub will corrupt VCPU buffer */
>> +                       if (in_ras_intr)
>> +                               memset(adev->uvd.inst[j].saved_bo, 0, size);
>> +                       else
>> +                               memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
>> +
>> +                       drm_dev_exit(idx);
>> +               }
>>          }
>>
>>          if (in_ras_intr)
>> @@ -420,7 +425,7 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
>>   {
>>          unsigned size;
>>          void *ptr;
>> -       int i;
>> +       int i, idx;
>>
>>          for (i = 0; i < adev->uvd.num_uvd_inst; i++) {
>>                  if (adev->uvd.harvest_config & (1 << i))
>> @@ -432,7 +437,10 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
>>                  ptr = adev->uvd.inst[i].cpu_addr;
>>
>>                  if (adev->uvd.inst[i].saved_bo != NULL) {
>> -                       memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
>> +                       if (drm_dev_enter(&adev->ddev, &idx)) {
>> +                               memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
>> +                               drm_dev_exit(idx);
>> +                       }
>>                          kvfree(adev->uvd.inst[i].saved_bo);
>>                          adev->uvd.inst[i].saved_bo = NULL;
>>                  } else {
>> @@ -442,8 +450,11 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
>>                          hdr = (const struct common_firmware_header *)adev->uvd.fw->data;
>>                          if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
>>                                  offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
>> -                               memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
>> -                                           le32_to_cpu(hdr->ucode_size_bytes));
>> +                               if (drm_dev_enter(&adev->ddev, &idx)) {
>> +                                       memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
>> +                                                   le32_to_cpu(hdr->ucode_size_bytes));
>> +                                       drm_dev_exit(idx);
>> +                               }
>>                                  size -= le32_to_cpu(hdr->ucode_size_bytes);
>>                                  ptr += le32_to_cpu(hdr->ucode_size_bytes);
>>                          }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> index ea6a62f67e38..833203401ef4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> @@ -29,6 +29,7 @@
>>   #include <linux/module.h>
>>
>>   #include <drm/drm.h>
>> +#include <drm/drm_drv.h>
>>
>>   #include "amdgpu.h"
>>   #include "amdgpu_pm.h"
>> @@ -293,7 +294,7 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
>>          void *cpu_addr;
>>          const struct common_firmware_header *hdr;
>>          unsigned offset;
>> -       int r;
>> +       int r, idx;
>>
>>          if (adev->vce.vcpu_bo == NULL)
>>                  return -EINVAL;
>> @@ -313,8 +314,12 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
>>
>>          hdr = (const struct common_firmware_header *)adev->vce.fw->data;
>>          offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
>> -       memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
>> -                   adev->vce.fw->size - offset);
>> +
>> +       if (drm_dev_enter(&adev->ddev, &idx)) {
>> +               memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
>> +                           adev->vce.fw->size - offset);
>> +               drm_dev_exit(idx);
>> +       }
>>
>>          amdgpu_bo_kunmap(adev->vce.vcpu_bo);
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>> index 201645963ba5..21f7d3644d70 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>> @@ -27,6 +27,7 @@
>>   #include <linux/firmware.h>
>>   #include <linux/module.h>
>>   #include <linux/pci.h>
>> +#include <drm/drm_drv.h>
>>
>>   #include "amdgpu.h"
>>   #include "amdgpu_pm.h"
>> @@ -275,7 +276,7 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
>>   {
>>          unsigned size;
>>          void *ptr;
>> -       int i;
>> +       int i, idx;
>>
>>          cancel_delayed_work_sync(&adev->vcn.idle_work);
>>
>> @@ -292,7 +293,10 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
>>                  if (!adev->vcn.inst[i].saved_bo)
>>                          return -ENOMEM;
>>
>> -               memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
>> +                       memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
>> +                       drm_dev_exit(idx);
>> +               }
>>          }
>>          return 0;
>>   }
>> @@ -301,7 +305,7 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
>>   {
>>          unsigned size;
>>          void *ptr;
>> -       int i;
>> +       int i, idx;
>>
>>          for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
>>                  if (adev->vcn.harvest_config & (1 << i))
>> @@ -313,7 +317,10 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
>>                  ptr = adev->vcn.inst[i].cpu_addr;
>>
>>                  if (adev->vcn.inst[i].saved_bo != NULL) {
>> -                       memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
>> +                       if (drm_dev_enter(&adev->ddev, &idx)) {
>> +                               memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
>> +                               drm_dev_exit(idx);
>> +                       }
>>                          kvfree(adev->vcn.inst[i].saved_bo);
>>                          adev->vcn.inst[i].saved_bo = NULL;
>>                  } else {
>> @@ -323,8 +330,11 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
>>                          hdr = (const struct common_firmware_header *)adev->vcn.fw->data;
>>                          if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
>>                                  offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
>> -                               memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
>> -                                           le32_to_cpu(hdr->ucode_size_bytes));
>> +                               if (drm_dev_enter(&adev->ddev, &idx)) {
>> +                                       memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
>> +                                                   le32_to_cpu(hdr->ucode_size_bytes));
>> +                                       drm_dev_exit(idx);
>> +                               }
>>                                  size -= le32_to_cpu(hdr->ucode_size_bytes);
>>                                  ptr += le32_to_cpu(hdr->ucode_size_bytes);
>>                          }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 9f868cf3b832..7dd5f10ab570 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -32,6 +32,7 @@
>>   #include <linux/dma-buf.h>
>>
>>   #include <drm/amdgpu_drm.h>
>> +#include <drm/drm_drv.h>
>>   #include "amdgpu.h"
>>   #include "amdgpu_trace.h"
>>   #include "amdgpu_amdkfd.h"
>> @@ -1606,7 +1607,10 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
>>          struct amdgpu_vm_update_params params;
>>          enum amdgpu_sync_mode sync_mode;
>>          uint64_t pfn;
>> -       int r;
>> +       int r, idx;
>> +
>> +       if (!drm_dev_enter(&adev->ddev, &idx))
>> +               return -ENODEV;
>>
>>          memset(&params, 0, sizeof(params));
>>          params.adev = adev;
>> @@ -1715,6 +1719,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
>>
>>   error_unlock:
>>          amdgpu_vm_eviction_unlock(vm);
>> +       drm_dev_exit(idx);
>>          return r;
>>   }
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
>> index 589410c32d09..2cec71e823f5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
>> @@ -23,6 +23,7 @@
>>   #include <linux/firmware.h>
>>   #include <linux/module.h>
>>   #include <linux/vmalloc.h>
>> +#include <drm/drm_drv.h>
>>
>>   #include "amdgpu.h"
>>   #include "amdgpu_psp.h"
>> @@ -269,10 +270,8 @@ static int psp_v11_0_bootloader_load_kdb(struct psp_context *psp)
>>          if (ret)
>>                  return ret;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -
>>          /* Copy PSP KDB binary to memory */
>> -       memcpy(psp->fw_pri_buf, psp->kdb_start_addr, psp->kdb_bin_size);
>> +       psp_copy_fw(psp, psp->kdb_start_addr, psp->kdb_bin_size);
>>
>>          /* Provide the PSP KDB to bootloader */
>>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>> @@ -302,10 +301,8 @@ static int psp_v11_0_bootloader_load_spl(struct psp_context *psp)
>>          if (ret)
>>                  return ret;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -
>>          /* Copy PSP SPL binary to memory */
>> -       memcpy(psp->fw_pri_buf, psp->spl_start_addr, psp->spl_bin_size);
>> +       psp_copy_fw(psp, psp->spl_start_addr, psp->spl_bin_size);
>>
>>          /* Provide the PSP SPL to bootloader */
>>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>> @@ -335,10 +332,8 @@ static int psp_v11_0_bootloader_load_sysdrv(struct psp_context *psp)
>>          if (ret)
>>                  return ret;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -
>>          /* Copy PSP System Driver binary to memory */
>> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
>> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
>>
>>          /* Provide the sys driver to bootloader */
>>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>> @@ -371,10 +366,8 @@ static int psp_v11_0_bootloader_load_sos(struct psp_context *psp)
>>          if (ret)
>>                  return ret;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -
>>          /* Copy Secure OS binary to PSP memory */
>> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
>> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
>>
>>          /* Provide the PSP secure OS to bootloader */
>>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>> @@ -608,7 +601,7 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
>>          uint32_t p2c_header[4];
>>          uint32_t sz;
>>          void *buf;
>> -       int ret;
>> +       int ret, idx;
>>
>>          if (ctx->init == PSP_MEM_TRAIN_NOT_SUPPORT) {
>>                  DRM_DEBUG("Memory training is not supported.\n");
>> @@ -681,17 +674,24 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
>>                          return -ENOMEM;
>>                  }
>>
>> -               memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
>> -               ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
>> -               if (ret) {
>> -                       DRM_ERROR("Send long training msg failed.\n");
>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
>> +                       memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
>> +                       ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
>> +                       if (ret) {
>> +                               DRM_ERROR("Send long training msg failed.\n");
>> +                               vfree(buf);
>> +                               drm_dev_exit(idx);
>> +                               return ret;
>> +                       }
>> +
>> +                       memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
>> +                       adev->hdp.funcs->flush_hdp(adev, NULL);
>>                          vfree(buf);
>> -                       return ret;
>> +                       drm_dev_exit(idx);
>> +               } else {
>> +                       vfree(buf);
>> +                       return -ENODEV;
>>                  }
>> -
>> -               memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
>> -               adev->hdp.funcs->flush_hdp(adev, NULL);
>> -               vfree(buf);
>>          }
>>
>>          if (ops & PSP_MEM_TRAIN_SAVE) {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
>> index c4828bd3264b..618e5b6b85d9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
>> @@ -138,10 +138,8 @@ static int psp_v12_0_bootloader_load_sysdrv(struct psp_context *psp)
>>          if (ret)
>>                  return ret;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -
>>          /* Copy PSP System Driver binary to memory */
>> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
>> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
>>
>>          /* Provide the sys driver to bootloader */
>>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>> @@ -179,10 +177,8 @@ static int psp_v12_0_bootloader_load_sos(struct psp_context *psp)
>>          if (ret)
>>                  return ret;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -
>>          /* Copy Secure OS binary to PSP memory */
>> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
>> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
>>
>>          /* Provide the PSP secure OS to bootloader */
>>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
>> index f2e725f72d2f..d0a6cccd0897 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
>> @@ -102,10 +102,8 @@ static int psp_v3_1_bootloader_load_sysdrv(struct psp_context *psp)
>>          if (ret)
>>                  return ret;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -
>>          /* Copy PSP System Driver binary to memory */
>> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
>> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
>>
>>          /* Provide the sys driver to bootloader */
>>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>> @@ -143,10 +141,8 @@ static int psp_v3_1_bootloader_load_sos(struct psp_context *psp)
>>          if (ret)
>>                  return ret;
>>
>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>> -
>>          /* Copy Secure OS binary to PSP memory */
>> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
>> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
>>
>>          /* Provide the PSP secure OS to bootloader */
>>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
>> index 8e238dea7bef..90910d19db12 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
>> @@ -25,6 +25,7 @@
>>    */
>>
>>   #include <linux/firmware.h>
>> +#include <drm/drm_drv.h>
>>
>>   #include "amdgpu.h"
>>   #include "amdgpu_vce.h"
>> @@ -555,16 +556,19 @@ static int vce_v4_0_hw_fini(void *handle)
>>   static int vce_v4_0_suspend(void *handle)
>>   {
>>          struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>> -       int r;
>> +       int r, idx;
>>
>>          if (adev->vce.vcpu_bo == NULL)
>>                  return 0;
>>
>> -       if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
>> -               unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
>> -               void *ptr = adev->vce.cpu_addr;
>> +       if (drm_dev_enter(&adev->ddev, &idx)) {
>> +               if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
>> +                       unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
>> +                       void *ptr = adev->vce.cpu_addr;
>>
>> -               memcpy_fromio(adev->vce.saved_bo, ptr, size);
>> +                       memcpy_fromio(adev->vce.saved_bo, ptr, size);
>> +               }
>> +               drm_dev_exit(idx);
>>          }
>>
>>          r = vce_v4_0_hw_fini(adev);
>> @@ -577,16 +581,20 @@ static int vce_v4_0_suspend(void *handle)
>>   static int vce_v4_0_resume(void *handle)
>>   {
>>          struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>> -       int r;
>> +       int r, idx;
>>
>>          if (adev->vce.vcpu_bo == NULL)
>>                  return -EINVAL;
>>
>>          if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
>> -               unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
>> -               void *ptr = adev->vce.cpu_addr;
>>
>> -               memcpy_toio(ptr, adev->vce.saved_bo, size);
>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
>> +                       unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
>> +                       void *ptr = adev->vce.cpu_addr;
>> +
>> +                       memcpy_toio(ptr, adev->vce.saved_bo, size);
>> +                       drm_dev_exit(idx);
>> +               }
>>          } else {
>>                  r = amdgpu_vce_resume(adev);
>>                  if (r)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
>> index 3f15bf34123a..df34be8ec82d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
>> @@ -34,6 +34,8 @@
>>   #include "vcn/vcn_3_0_0_sh_mask.h"
>>   #include "ivsrcid/vcn/irqsrcs_vcn_2_0.h"
>>
>> +#include <drm/drm_drv.h>
>> +
>>   #define mmUVD_CONTEXT_ID_INTERNAL_OFFSET                       0x27
>>   #define mmUVD_GPCOM_VCPU_CMD_INTERNAL_OFFSET                   0x0f
>>   #define mmUVD_GPCOM_VCPU_DATA0_INTERNAL_OFFSET                 0x10
>> @@ -268,16 +270,20 @@ static int vcn_v3_0_sw_init(void *handle)
>>   static int vcn_v3_0_sw_fini(void *handle)
>>   {
>>          struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>> -       int i, r;
>> +       int i, r, idx;
>>
>> -       for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
>> -               volatile struct amdgpu_fw_shared *fw_shared;
>> +       if (drm_dev_enter(&adev->ddev, &idx)) {
>> +               for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
>> +                       volatile struct amdgpu_fw_shared *fw_shared;
>>
>> -               if (adev->vcn.harvest_config & (1 << i))
>> -                       continue;
>> -               fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
>> -               fw_shared->present_flag_0 = 0;
>> -               fw_shared->sw_ring.is_enabled = false;
>> +                       if (adev->vcn.harvest_config & (1 << i))
>> +                               continue;
>> +                       fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
>> +                       fw_shared->present_flag_0 = 0;
>> +                       fw_shared->sw_ring.is_enabled = false;
>> +               }
>> +
>> +               drm_dev_exit(idx);
>>          }
>>
>>          if (amdgpu_sriov_vf(adev))
>> --
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 03/16] drm/amdkfd: Split kfd suspend from device exit
  2021-05-12 14:26 ` [PATCH v7 03/16] drm/amdkfd: Split kfd suspend from device exit Andrey Grodzovsky
@ 2021-05-12 20:33   ` Felix Kuehling
  2021-05-12 20:38     ` Andrey Grodzovsky
  2021-05-20  3:20     ` [PATCH] drm/amdgpu: Add early fini callback Andrey Grodzovsky
  0 siblings, 2 replies; 64+ messages in thread
From: Felix Kuehling @ 2021-05-12 20:33 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas

Am 2021-05-12 um 10:26 a.m. schrieb Andrey Grodzovsky:
> Helps to expdite HW related stuff to amdgpu_pci_remove
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 3 ++-
>  3 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 5f6696a3c778..2b06dee9a0ce 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -170,7 +170,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>  	}
>  }
>  
> -void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev)
> +void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev)

You're renaming this function, but I don't see you fixing up any of the
callers. Looks like you do that in the next patch. So this patch breaks
the build, the next one fixes it. Maybe you need to refactor this or
just squash the two patches.

Regards,
  Felix


>  {
>  	if (adev->kfd.dev) {
>  		kgd2kfd_device_exit(adev->kfd.dev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 14f68c028126..f8e10af99c28 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -127,7 +127,7 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>  			const void *ih_ring_entry);
>  void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>  void amdgpu_amdkfd_device_init(struct amdgpu_device *adev);
> -void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev);
> +void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev);
>  int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine,
>  				uint32_t vmid, uint64_t gpu_addr,
>  				uint32_t *ib_cmd, uint32_t ib_len);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 357b9bf62a1c..ab6d2a43c9a3 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -858,10 +858,11 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>  	return kfd->init_complete;
>  }
>  
> +
> +
>  void kgd2kfd_device_exit(struct kfd_dev *kfd)
>  {
>  	if (kfd->init_complete) {
> -		kgd2kfd_suspend(kfd, false);
>  		device_queue_manager_uninit(kfd->dqm);
>  		kfd_interrupt_exit(kfd);
>  		kfd_topology_remove_device(kfd);

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 03/16] drm/amdkfd: Split kfd suspend from device exit
  2021-05-12 20:33   ` Felix Kuehling
@ 2021-05-12 20:38     ` Andrey Grodzovsky
  2021-05-20  3:20     ` [PATCH] drm/amdgpu: Add early fini callback Andrey Grodzovsky
  1 sibling, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-12 20:38 UTC (permalink / raw)
  To: Felix Kuehling, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas



On 2021-05-12 4:33 p.m., Felix Kuehling wrote:
> Am 2021-05-12 um 10:26 a.m. schrieb Andrey Grodzovsky:
>> Helps to expdite HW related stuff to amdgpu_pci_remove
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 2 +-
>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 3 ++-
>>   3 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> index 5f6696a3c778..2b06dee9a0ce 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> @@ -170,7 +170,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>>   	}
>>   }
>>   
>> -void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev)
>> +void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev)
> 
> You're renaming this function, but I don't see you fixing up any of the
> callers. Looks like you do that in the next patch. So this patch breaks
> the build, the next one fixes it. Maybe you need to refactor this or
> just squash the two patches.

Will do.

Andrey

> 
> Regards,
>    Felix
> 
> 
>>   {
>>   	if (adev->kfd.dev) {
>>   		kgd2kfd_device_exit(adev->kfd.dev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> index 14f68c028126..f8e10af99c28 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> @@ -127,7 +127,7 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>   			const void *ih_ring_entry);
>>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>   void amdgpu_amdkfd_device_init(struct amdgpu_device *adev);
>> -void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev);
>> +void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev);
>>   int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine,
>>   				uint32_t vmid, uint64_t gpu_addr,
>>   				uint32_t *ib_cmd, uint32_t ib_len);
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 357b9bf62a1c..ab6d2a43c9a3 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -858,10 +858,11 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>   	return kfd->init_complete;
>>   }
>>   
>> +
>> +
>>   void kgd2kfd_device_exit(struct kfd_dev *kfd)
>>   {
>>   	if (kfd->init_complete) {
>> -		kgd2kfd_suspend(kfd, false);
>>   		device_queue_manager_uninit(kfd->dqm);
>>   		kfd_interrupt_exit(kfd);
>>   		kfd_topology_remove_device(kfd);

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 09/16] drm/amdgpu: Guard against write accesses after device removal
  2021-05-12 20:30     ` Andrey Grodzovsky
@ 2021-05-12 20:50       ` Alex Deucher
  2021-05-13 14:47         ` Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Deucher @ 2021-05-12 20:50 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Maling list - DRI developers, amd-gfx list, Linux PCI,
	Christian König, Daniel Vetter, Wentland, Harry, Greg KH,
	Kuehling, Felix, Bjorn Helgaas, Deucher, Alexander

On Wed, May 12, 2021 at 4:30 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
>
> On 2021-05-12 4:17 p.m., Alex Deucher wrote:
> > On Wed, May 12, 2021 at 10:27 AM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >>
> >> This should prevent writing to memory or IO ranges possibly
> >> already allocated for other uses after our device is removed.
> >>
> >> v5:
> >> Protect more places wher memcopy_to/form_io takes place
> >
> > where
> >
> >> Protect IB submissions
> >>
> >> v6: Switch to !drm_dev_enter instead of scoping entire code
> >> with brackets.
> >>
> >> v7:
> >> Drop guard of HW ring commands emission protection since they
> >> are in GART and not in MMIO.
> >>
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >
> > I think you could split out the psp_fw_copy changes as a separate
> > cleanup patch.  That's a nice clean up in general.  What about the SMU
> > code (e.g., amd/pm/powerplay and amd/pm/swsmu)?  There are a bunch of
> > shared memory areas we interact with in the driver.
>
> Can you point me to it ? Are they VRAM and not GART ?
> I searched for all memcpy_to/from_io in our code. Maybe missed some.
>

Mostly vram.  A quick grep shows:

$ grep -r -I AMDGPU_GEM_DOMAIN drivers/gpu/drm/amd/pm/
drivers/gpu/drm/amd/pm/powerplay/amd_powerplay.c:
  PAGE_SIZE, AMDGPU_GEM_DOMAIN_GTT,
drivers/gpu/drm/amd/pm/powerplay/smumgr/smu7_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/smu7_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
 AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/smu8_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/smu8_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
       AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
           AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
       AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
       AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
       AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/smu10_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/powerplay/smumgr/smu10_smumgr.c:
AMDGPU_GEM_DOMAIN_VRAM,
drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c:        PAGE_SIZE,
AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c:        PAGE_SIZE,
AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c:        PAGE_SIZE,
AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
    AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c:    driver_table->domain =
AMDGPU_GEM_DOMAIN_VRAM;
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c:    memory_pool->domain =
AMDGPU_GEM_DOMAIN_GTT;
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c:
dummy_read_1_table->domain = AMDGPU_GEM_DOMAIN_VRAM;

In general, the driver puts shared structures in vram and then either
updates them and asks the SMU to read them, or requests that the SMU
write to them by writing a message to the SMU via MMIO.  You'll need
to look for places where those shared buffers are accessed.

Alex


> Andrey
>
> >
> > Alex
> >
> >
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++-
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 ++++
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 63 ++++++++++++++--------
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |  2 +
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 31 +++++++----
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 11 ++--
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c    | 22 +++++---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     |  7 ++-
> >>   drivers/gpu/drm/amd/amdgpu/psp_v11_0.c     | 44 +++++++--------
> >>   drivers/gpu/drm/amd/amdgpu/psp_v12_0.c     |  8 +--
> >>   drivers/gpu/drm/amd/amdgpu/psp_v3_1.c      |  8 +--
> >>   drivers/gpu/drm/amd/amdgpu/vce_v4_0.c      | 26 +++++----
> >>   drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c      | 22 +++++---
> >>   13 files changed, 168 insertions(+), 95 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index a0bff4713672..f7cca25c0fa0 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -71,6 +71,8 @@
> >>   #include <drm/task_barrier.h>
> >>   #include <linux/pm_runtime.h>
> >>
> >> +#include <drm/drm_drv.h>
> >> +
> >>   MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
> >>   MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
> >>   MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
> >> @@ -281,7 +283,10 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
> >>          unsigned long flags;
> >>          uint32_t hi = ~0;
> >>          uint64_t last;
> >> +       int idx;
> >>
> >> +       if (!drm_dev_enter(&adev->ddev, &idx))
> >> +               return;
> >>
> >>   #ifdef CONFIG_64BIT
> >>          last = min(pos + size, adev->gmc.visible_vram_size);
> >> @@ -300,7 +305,7 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
> >>                  }
> >>
> >>                  if (count == size)
> >> -                       return;
> >> +                       goto exit;
> >>
> >>                  pos += count;
> >>                  buf += count / 4;
> >> @@ -323,6 +328,9 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
> >>                          *buf++ = RREG32_NO_KIQ(mmMM_DATA);
> >>          }
> >>          spin_unlock_irqrestore(&adev->mmio_idx_lock, flags);
> >> +
> >> +exit:
> >> +       drm_dev_exit(idx);
> >>   }
> >>
> >>   /*
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >> index 4d32233cde92..04ba5eef1e88 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >> @@ -31,6 +31,8 @@
> >>   #include "amdgpu_ras.h"
> >>   #include "amdgpu_xgmi.h"
> >>
> >> +#include <drm/drm_drv.h>
> >> +
> >>   /**
> >>    * amdgpu_gmc_pdb0_alloc - allocate vram for pdb0
> >>    *
> >> @@ -151,6 +153,10 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
> >>   {
> >>          void __iomem *ptr = (void *)cpu_pt_addr;
> >>          uint64_t value;
> >> +       int idx;
> >> +
> >> +       if (!drm_dev_enter(&adev->ddev, &idx))
> >> +               return 0;
> >>
> >>          /*
> >>           * The following is for PTE only. GART does not have PDEs.
> >> @@ -158,6 +164,9 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
> >>          value = addr & 0x0000FFFFFFFFF000ULL;
> >>          value |= flags;
> >>          writeq(value, ptr + (gpu_page_idx * 8));
> >> +
> >> +       drm_dev_exit(idx);
> >> +
> >>          return 0;
> >>   }
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> >> index 9e769cf6095b..bb6afee61666 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> >> @@ -25,6 +25,7 @@
> >>
> >>   #include <linux/firmware.h>
> >>   #include <linux/dma-mapping.h>
> >> +#include <drm/drm_drv.h>
> >>
> >>   #include "amdgpu.h"
> >>   #include "amdgpu_psp.h"
> >> @@ -39,6 +40,8 @@
> >>   #include "amdgpu_ras.h"
> >>   #include "amdgpu_securedisplay.h"
> >>
> >> +#include <drm/drm_drv.h>
> >> +
> >>   static int psp_sysfs_init(struct amdgpu_device *adev);
> >>   static void psp_sysfs_fini(struct amdgpu_device *adev);
> >>
> >> @@ -253,7 +256,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
> >>                     struct psp_gfx_cmd_resp *cmd, uint64_t fence_mc_addr)
> >>   {
> >>          int ret;
> >> -       int index;
> >> +       int index, idx;
> >>          int timeout = 20000;
> >>          bool ras_intr = false;
> >>          bool skip_unsupport = false;
> >> @@ -261,6 +264,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
> >>          if (psp->adev->in_pci_err_recovery)
> >>                  return 0;
> >>
> >> +       if (!drm_dev_enter(&psp->adev->ddev, &idx))
> >> +               return 0;
> >> +
> >>          mutex_lock(&psp->mutex);
> >>
> >>          memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
> >> @@ -271,8 +277,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
> >>          ret = psp_ring_cmd_submit(psp, psp->cmd_buf_mc_addr, fence_mc_addr, index);
> >>          if (ret) {
> >>                  atomic_dec(&psp->fence_value);
> >> -               mutex_unlock(&psp->mutex);
> >> -               return ret;
> >> +               goto exit;
> >>          }
> >>
> >>          amdgpu_asic_invalidate_hdp(psp->adev, NULL);
> >> @@ -312,8 +317,8 @@ psp_cmd_submit_buf(struct psp_context *psp,
> >>                           psp->cmd_buf_mem->cmd_id,
> >>                           psp->cmd_buf_mem->resp.status);
> >>                  if (!timeout) {
> >> -                       mutex_unlock(&psp->mutex);
> >> -                       return -EINVAL;
> >> +                       ret = -EINVAL;
> >> +                       goto exit;
> >>                  }
> >>          }
> >>
> >> @@ -321,8 +326,10 @@ psp_cmd_submit_buf(struct psp_context *psp,
> >>                  ucode->tmr_mc_addr_lo = psp->cmd_buf_mem->resp.fw_addr_lo;
> >>                  ucode->tmr_mc_addr_hi = psp->cmd_buf_mem->resp.fw_addr_hi;
> >>          }
> >> -       mutex_unlock(&psp->mutex);
> >>
> >> +exit:
> >> +       mutex_unlock(&psp->mutex);
> >> +       drm_dev_exit(idx);
> >>          return ret;
> >>   }
> >>
> >> @@ -359,8 +366,7 @@ static int psp_load_toc(struct psp_context *psp,
> >>          if (!cmd)
> >>                  return -ENOMEM;
> >>          /* Copy toc to psp firmware private buffer */
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -       memcpy(psp->fw_pri_buf, psp->toc_start_addr, psp->toc_bin_size);
> >> +       psp_copy_fw(psp, psp->toc_start_addr, psp->toc_bin_size);
> >>
> >>          psp_prep_load_toc_cmd_buf(cmd, psp->fw_pri_mc_addr, psp->toc_bin_size);
> >>
> >> @@ -625,8 +631,7 @@ static int psp_asd_load(struct psp_context *psp)
> >>          if (!cmd)
> >>                  return -ENOMEM;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -       memcpy(psp->fw_pri_buf, psp->asd_start_addr, psp->asd_ucode_size);
> >> +       psp_copy_fw(psp, psp->asd_start_addr, psp->asd_ucode_size);
> >>
> >>          psp_prep_asd_load_cmd_buf(cmd, psp->fw_pri_mc_addr,
> >>                                    psp->asd_ucode_size);
> >> @@ -781,8 +786,7 @@ static int psp_xgmi_load(struct psp_context *psp)
> >>          if (!cmd)
> >>                  return -ENOMEM;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -       memcpy(psp->fw_pri_buf, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
> >> +       psp_copy_fw(psp, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
> >>
> >>          psp_prep_ta_load_cmd_buf(cmd,
> >>                                   psp->fw_pri_mc_addr,
> >> @@ -1038,8 +1042,7 @@ static int psp_ras_load(struct psp_context *psp)
> >>          if (!cmd)
> >>                  return -ENOMEM;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -       memcpy(psp->fw_pri_buf, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
> >> +       psp_copy_fw(psp, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
> >>
> >>          psp_prep_ta_load_cmd_buf(cmd,
> >>                                   psp->fw_pri_mc_addr,
> >> @@ -1275,8 +1278,7 @@ static int psp_hdcp_load(struct psp_context *psp)
> >>          if (!cmd)
> >>                  return -ENOMEM;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -       memcpy(psp->fw_pri_buf, psp->ta_hdcp_start_addr,
> >> +       psp_copy_fw(psp, psp->ta_hdcp_start_addr,
> >>                 psp->ta_hdcp_ucode_size);
> >>
> >>          psp_prep_ta_load_cmd_buf(cmd,
> >> @@ -1427,8 +1429,7 @@ static int psp_dtm_load(struct psp_context *psp)
> >>          if (!cmd)
> >>                  return -ENOMEM;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -       memcpy(psp->fw_pri_buf, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
> >> +       psp_copy_fw(psp, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
> >>
> >>          psp_prep_ta_load_cmd_buf(cmd,
> >>                                   psp->fw_pri_mc_addr,
> >> @@ -1573,8 +1574,7 @@ static int psp_rap_load(struct psp_context *psp)
> >>          if (!cmd)
> >>                  return -ENOMEM;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -       memcpy(psp->fw_pri_buf, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
> >> +       psp_copy_fw(psp, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
> >>
> >>          psp_prep_ta_load_cmd_buf(cmd,
> >>                                   psp->fw_pri_mc_addr,
> >> @@ -3022,7 +3022,7 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
> >>          struct amdgpu_device *adev = drm_to_adev(ddev);
> >>          void *cpu_addr;
> >>          dma_addr_t dma_addr;
> >> -       int ret;
> >> +       int ret, idx;
> >>          char fw_name[100];
> >>          const struct firmware *usbc_pd_fw;
> >>
> >> @@ -3031,6 +3031,9 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
> >>                  return -EBUSY;
> >>          }
> >>
> >> +       if (!drm_dev_enter(ddev, &idx))
> >> +               return -ENODEV;
> >> +
> >>          snprintf(fw_name, sizeof(fw_name), "amdgpu/%s", buf);
> >>          ret = request_firmware(&usbc_pd_fw, fw_name, adev->dev);
> >>          if (ret)
> >> @@ -3062,16 +3065,30 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
> >>   rel_buf:
> >>          dma_free_coherent(adev->dev, usbc_pd_fw->size, cpu_addr, dma_addr);
> >>          release_firmware(usbc_pd_fw);
> >> -
> >>   fail:
> >>          if (ret) {
> >>                  DRM_ERROR("Failed to load USBC PD FW, err = %d", ret);
> >> -               return ret;
> >> +               count = ret;
> >>          }
> >>
> >> +       drm_dev_exit(idx);
> >>          return count;
> >>   }
> >>
> >> +void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size)
> >> +{
> >> +       int idx;
> >> +
> >> +       if (!drm_dev_enter(&psp->adev->ddev, &idx))
> >> +               return;
> >> +
> >> +       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> +       memcpy(psp->fw_pri_buf, start_addr, bin_size);
> >> +
> >> +       drm_dev_exit(idx);
> >> +}
> >> +
> >> +
> >>   static DEVICE_ATTR(usbc_pd_fw, S_IRUGO | S_IWUSR,
> >>                     psp_usbc_pd_fw_sysfs_read,
> >>                     psp_usbc_pd_fw_sysfs_write);
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> >> index 46a5328e00e0..2bfdc278817f 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> >> @@ -423,4 +423,6 @@ int psp_get_fw_attestation_records_addr(struct psp_context *psp,
> >>
> >>   int psp_load_fw_list(struct psp_context *psp,
> >>                       struct amdgpu_firmware_info **ucode_list, int ucode_count);
> >> +void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size);
> >> +
> >>   #endif
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> >> index c6dbc0801604..82f0542c7792 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> >> @@ -32,6 +32,7 @@
> >>   #include <linux/module.h>
> >>
> >>   #include <drm/drm.h>
> >> +#include <drm/drm_drv.h>
> >>
> >>   #include "amdgpu.h"
> >>   #include "amdgpu_pm.h"
> >> @@ -375,7 +376,7 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
> >>   {
> >>          unsigned size;
> >>          void *ptr;
> >> -       int i, j;
> >> +       int i, j, idx;
> >>          bool in_ras_intr = amdgpu_ras_intr_triggered();
> >>
> >>          cancel_delayed_work_sync(&adev->uvd.idle_work);
> >> @@ -403,11 +404,15 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
> >>                  if (!adev->uvd.inst[j].saved_bo)
> >>                          return -ENOMEM;
> >>
> >> -               /* re-write 0 since err_event_athub will corrupt VCPU buffer */
> >> -               if (in_ras_intr)
> >> -                       memset(adev->uvd.inst[j].saved_bo, 0, size);
> >> -               else
> >> -                       memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
> >> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> >> +                       /* re-write 0 since err_event_athub will corrupt VCPU buffer */
> >> +                       if (in_ras_intr)
> >> +                               memset(adev->uvd.inst[j].saved_bo, 0, size);
> >> +                       else
> >> +                               memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
> >> +
> >> +                       drm_dev_exit(idx);
> >> +               }
> >>          }
> >>
> >>          if (in_ras_intr)
> >> @@ -420,7 +425,7 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
> >>   {
> >>          unsigned size;
> >>          void *ptr;
> >> -       int i;
> >> +       int i, idx;
> >>
> >>          for (i = 0; i < adev->uvd.num_uvd_inst; i++) {
> >>                  if (adev->uvd.harvest_config & (1 << i))
> >> @@ -432,7 +437,10 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
> >>                  ptr = adev->uvd.inst[i].cpu_addr;
> >>
> >>                  if (adev->uvd.inst[i].saved_bo != NULL) {
> >> -                       memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
> >> +                       if (drm_dev_enter(&adev->ddev, &idx)) {
> >> +                               memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
> >> +                               drm_dev_exit(idx);
> >> +                       }
> >>                          kvfree(adev->uvd.inst[i].saved_bo);
> >>                          adev->uvd.inst[i].saved_bo = NULL;
> >>                  } else {
> >> @@ -442,8 +450,11 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
> >>                          hdr = (const struct common_firmware_header *)adev->uvd.fw->data;
> >>                          if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
> >>                                  offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
> >> -                               memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
> >> -                                           le32_to_cpu(hdr->ucode_size_bytes));
> >> +                               if (drm_dev_enter(&adev->ddev, &idx)) {
> >> +                                       memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
> >> +                                                   le32_to_cpu(hdr->ucode_size_bytes));
> >> +                                       drm_dev_exit(idx);
> >> +                               }
> >>                                  size -= le32_to_cpu(hdr->ucode_size_bytes);
> >>                                  ptr += le32_to_cpu(hdr->ucode_size_bytes);
> >>                          }
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> >> index ea6a62f67e38..833203401ef4 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> >> @@ -29,6 +29,7 @@
> >>   #include <linux/module.h>
> >>
> >>   #include <drm/drm.h>
> >> +#include <drm/drm_drv.h>
> >>
> >>   #include "amdgpu.h"
> >>   #include "amdgpu_pm.h"
> >> @@ -293,7 +294,7 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
> >>          void *cpu_addr;
> >>          const struct common_firmware_header *hdr;
> >>          unsigned offset;
> >> -       int r;
> >> +       int r, idx;
> >>
> >>          if (adev->vce.vcpu_bo == NULL)
> >>                  return -EINVAL;
> >> @@ -313,8 +314,12 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
> >>
> >>          hdr = (const struct common_firmware_header *)adev->vce.fw->data;
> >>          offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
> >> -       memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
> >> -                   adev->vce.fw->size - offset);
> >> +
> >> +       if (drm_dev_enter(&adev->ddev, &idx)) {
> >> +               memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
> >> +                           adev->vce.fw->size - offset);
> >> +               drm_dev_exit(idx);
> >> +       }
> >>
> >>          amdgpu_bo_kunmap(adev->vce.vcpu_bo);
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >> index 201645963ba5..21f7d3644d70 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >> @@ -27,6 +27,7 @@
> >>   #include <linux/firmware.h>
> >>   #include <linux/module.h>
> >>   #include <linux/pci.h>
> >> +#include <drm/drm_drv.h>
> >>
> >>   #include "amdgpu.h"
> >>   #include "amdgpu_pm.h"
> >> @@ -275,7 +276,7 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
> >>   {
> >>          unsigned size;
> >>          void *ptr;
> >> -       int i;
> >> +       int i, idx;
> >>
> >>          cancel_delayed_work_sync(&adev->vcn.idle_work);
> >>
> >> @@ -292,7 +293,10 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
> >>                  if (!adev->vcn.inst[i].saved_bo)
> >>                          return -ENOMEM;
> >>
> >> -               memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
> >> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> >> +                       memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
> >> +                       drm_dev_exit(idx);
> >> +               }
> >>          }
> >>          return 0;
> >>   }
> >> @@ -301,7 +305,7 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
> >>   {
> >>          unsigned size;
> >>          void *ptr;
> >> -       int i;
> >> +       int i, idx;
> >>
> >>          for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
> >>                  if (adev->vcn.harvest_config & (1 << i))
> >> @@ -313,7 +317,10 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
> >>                  ptr = adev->vcn.inst[i].cpu_addr;
> >>
> >>                  if (adev->vcn.inst[i].saved_bo != NULL) {
> >> -                       memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
> >> +                       if (drm_dev_enter(&adev->ddev, &idx)) {
> >> +                               memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
> >> +                               drm_dev_exit(idx);
> >> +                       }
> >>                          kvfree(adev->vcn.inst[i].saved_bo);
> >>                          adev->vcn.inst[i].saved_bo = NULL;
> >>                  } else {
> >> @@ -323,8 +330,11 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
> >>                          hdr = (const struct common_firmware_header *)adev->vcn.fw->data;
> >>                          if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
> >>                                  offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
> >> -                               memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
> >> -                                           le32_to_cpu(hdr->ucode_size_bytes));
> >> +                               if (drm_dev_enter(&adev->ddev, &idx)) {
> >> +                                       memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
> >> +                                                   le32_to_cpu(hdr->ucode_size_bytes));
> >> +                                       drm_dev_exit(idx);
> >> +                               }
> >>                                  size -= le32_to_cpu(hdr->ucode_size_bytes);
> >>                                  ptr += le32_to_cpu(hdr->ucode_size_bytes);
> >>                          }
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >> index 9f868cf3b832..7dd5f10ab570 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >> @@ -32,6 +32,7 @@
> >>   #include <linux/dma-buf.h>
> >>
> >>   #include <drm/amdgpu_drm.h>
> >> +#include <drm/drm_drv.h>
> >>   #include "amdgpu.h"
> >>   #include "amdgpu_trace.h"
> >>   #include "amdgpu_amdkfd.h"
> >> @@ -1606,7 +1607,10 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
> >>          struct amdgpu_vm_update_params params;
> >>          enum amdgpu_sync_mode sync_mode;
> >>          uint64_t pfn;
> >> -       int r;
> >> +       int r, idx;
> >> +
> >> +       if (!drm_dev_enter(&adev->ddev, &idx))
> >> +               return -ENODEV;
> >>
> >>          memset(&params, 0, sizeof(params));
> >>          params.adev = adev;
> >> @@ -1715,6 +1719,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
> >>
> >>   error_unlock:
> >>          amdgpu_vm_eviction_unlock(vm);
> >> +       drm_dev_exit(idx);
> >>          return r;
> >>   }
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> >> index 589410c32d09..2cec71e823f5 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> >> @@ -23,6 +23,7 @@
> >>   #include <linux/firmware.h>
> >>   #include <linux/module.h>
> >>   #include <linux/vmalloc.h>
> >> +#include <drm/drm_drv.h>
> >>
> >>   #include "amdgpu.h"
> >>   #include "amdgpu_psp.h"
> >> @@ -269,10 +270,8 @@ static int psp_v11_0_bootloader_load_kdb(struct psp_context *psp)
> >>          if (ret)
> >>                  return ret;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -
> >>          /* Copy PSP KDB binary to memory */
> >> -       memcpy(psp->fw_pri_buf, psp->kdb_start_addr, psp->kdb_bin_size);
> >> +       psp_copy_fw(psp, psp->kdb_start_addr, psp->kdb_bin_size);
> >>
> >>          /* Provide the PSP KDB to bootloader */
> >>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >> @@ -302,10 +301,8 @@ static int psp_v11_0_bootloader_load_spl(struct psp_context *psp)
> >>          if (ret)
> >>                  return ret;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -
> >>          /* Copy PSP SPL binary to memory */
> >> -       memcpy(psp->fw_pri_buf, psp->spl_start_addr, psp->spl_bin_size);
> >> +       psp_copy_fw(psp, psp->spl_start_addr, psp->spl_bin_size);
> >>
> >>          /* Provide the PSP SPL to bootloader */
> >>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >> @@ -335,10 +332,8 @@ static int psp_v11_0_bootloader_load_sysdrv(struct psp_context *psp)
> >>          if (ret)
> >>                  return ret;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -
> >>          /* Copy PSP System Driver binary to memory */
> >> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
> >> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
> >>
> >>          /* Provide the sys driver to bootloader */
> >>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >> @@ -371,10 +366,8 @@ static int psp_v11_0_bootloader_load_sos(struct psp_context *psp)
> >>          if (ret)
> >>                  return ret;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -
> >>          /* Copy Secure OS binary to PSP memory */
> >> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
> >> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
> >>
> >>          /* Provide the PSP secure OS to bootloader */
> >>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >> @@ -608,7 +601,7 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
> >>          uint32_t p2c_header[4];
> >>          uint32_t sz;
> >>          void *buf;
> >> -       int ret;
> >> +       int ret, idx;
> >>
> >>          if (ctx->init == PSP_MEM_TRAIN_NOT_SUPPORT) {
> >>                  DRM_DEBUG("Memory training is not supported.\n");
> >> @@ -681,17 +674,24 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
> >>                          return -ENOMEM;
> >>                  }
> >>
> >> -               memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
> >> -               ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
> >> -               if (ret) {
> >> -                       DRM_ERROR("Send long training msg failed.\n");
> >> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> >> +                       memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
> >> +                       ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
> >> +                       if (ret) {
> >> +                               DRM_ERROR("Send long training msg failed.\n");
> >> +                               vfree(buf);
> >> +                               drm_dev_exit(idx);
> >> +                               return ret;
> >> +                       }
> >> +
> >> +                       memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
> >> +                       adev->hdp.funcs->flush_hdp(adev, NULL);
> >>                          vfree(buf);
> >> -                       return ret;
> >> +                       drm_dev_exit(idx);
> >> +               } else {
> >> +                       vfree(buf);
> >> +                       return -ENODEV;
> >>                  }
> >> -
> >> -               memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
> >> -               adev->hdp.funcs->flush_hdp(adev, NULL);
> >> -               vfree(buf);
> >>          }
> >>
> >>          if (ops & PSP_MEM_TRAIN_SAVE) {
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
> >> index c4828bd3264b..618e5b6b85d9 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
> >> @@ -138,10 +138,8 @@ static int psp_v12_0_bootloader_load_sysdrv(struct psp_context *psp)
> >>          if (ret)
> >>                  return ret;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -
> >>          /* Copy PSP System Driver binary to memory */
> >> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
> >> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
> >>
> >>          /* Provide the sys driver to bootloader */
> >>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >> @@ -179,10 +177,8 @@ static int psp_v12_0_bootloader_load_sos(struct psp_context *psp)
> >>          if (ret)
> >>                  return ret;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -
> >>          /* Copy Secure OS binary to PSP memory */
> >> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
> >> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
> >>
> >>          /* Provide the PSP secure OS to bootloader */
> >>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> >> index f2e725f72d2f..d0a6cccd0897 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> >> @@ -102,10 +102,8 @@ static int psp_v3_1_bootloader_load_sysdrv(struct psp_context *psp)
> >>          if (ret)
> >>                  return ret;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -
> >>          /* Copy PSP System Driver binary to memory */
> >> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
> >> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
> >>
> >>          /* Provide the sys driver to bootloader */
> >>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >> @@ -143,10 +141,8 @@ static int psp_v3_1_bootloader_load_sos(struct psp_context *psp)
> >>          if (ret)
> >>                  return ret;
> >>
> >> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >> -
> >>          /* Copy Secure OS binary to PSP memory */
> >> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
> >> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
> >>
> >>          /* Provide the PSP secure OS to bootloader */
> >>          WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> >> index 8e238dea7bef..90910d19db12 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> >> @@ -25,6 +25,7 @@
> >>    */
> >>
> >>   #include <linux/firmware.h>
> >> +#include <drm/drm_drv.h>
> >>
> >>   #include "amdgpu.h"
> >>   #include "amdgpu_vce.h"
> >> @@ -555,16 +556,19 @@ static int vce_v4_0_hw_fini(void *handle)
> >>   static int vce_v4_0_suspend(void *handle)
> >>   {
> >>          struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> >> -       int r;
> >> +       int r, idx;
> >>
> >>          if (adev->vce.vcpu_bo == NULL)
> >>                  return 0;
> >>
> >> -       if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
> >> -               unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> >> -               void *ptr = adev->vce.cpu_addr;
> >> +       if (drm_dev_enter(&adev->ddev, &idx)) {
> >> +               if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
> >> +                       unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> >> +                       void *ptr = adev->vce.cpu_addr;
> >>
> >> -               memcpy_fromio(adev->vce.saved_bo, ptr, size);
> >> +                       memcpy_fromio(adev->vce.saved_bo, ptr, size);
> >> +               }
> >> +               drm_dev_exit(idx);
> >>          }
> >>
> >>          r = vce_v4_0_hw_fini(adev);
> >> @@ -577,16 +581,20 @@ static int vce_v4_0_suspend(void *handle)
> >>   static int vce_v4_0_resume(void *handle)
> >>   {
> >>          struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> >> -       int r;
> >> +       int r, idx;
> >>
> >>          if (adev->vce.vcpu_bo == NULL)
> >>                  return -EINVAL;
> >>
> >>          if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
> >> -               unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> >> -               void *ptr = adev->vce.cpu_addr;
> >>
> >> -               memcpy_toio(ptr, adev->vce.saved_bo, size);
> >> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> >> +                       unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> >> +                       void *ptr = adev->vce.cpu_addr;
> >> +
> >> +                       memcpy_toio(ptr, adev->vce.saved_bo, size);
> >> +                       drm_dev_exit(idx);
> >> +               }
> >>          } else {
> >>                  r = amdgpu_vce_resume(adev);
> >>                  if (r)
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> >> index 3f15bf34123a..df34be8ec82d 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> >> @@ -34,6 +34,8 @@
> >>   #include "vcn/vcn_3_0_0_sh_mask.h"
> >>   #include "ivsrcid/vcn/irqsrcs_vcn_2_0.h"
> >>
> >> +#include <drm/drm_drv.h>
> >> +
> >>   #define mmUVD_CONTEXT_ID_INTERNAL_OFFSET                       0x27
> >>   #define mmUVD_GPCOM_VCPU_CMD_INTERNAL_OFFSET                   0x0f
> >>   #define mmUVD_GPCOM_VCPU_DATA0_INTERNAL_OFFSET                 0x10
> >> @@ -268,16 +270,20 @@ static int vcn_v3_0_sw_init(void *handle)
> >>   static int vcn_v3_0_sw_fini(void *handle)
> >>   {
> >>          struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> >> -       int i, r;
> >> +       int i, r, idx;
> >>
> >> -       for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
> >> -               volatile struct amdgpu_fw_shared *fw_shared;
> >> +       if (drm_dev_enter(&adev->ddev, &idx)) {
> >> +               for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
> >> +                       volatile struct amdgpu_fw_shared *fw_shared;
> >>
> >> -               if (adev->vcn.harvest_config & (1 << i))
> >> -                       continue;
> >> -               fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
> >> -               fw_shared->present_flag_0 = 0;
> >> -               fw_shared->sw_ring.is_enabled = false;
> >> +                       if (adev->vcn.harvest_config & (1 << i))
> >> +                               continue;
> >> +                       fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
> >> +                       fw_shared->present_flag_0 = 0;
> >> +                       fw_shared->sw_ring.is_enabled = false;
> >> +               }
> >> +
> >> +               drm_dev_exit(idx);
> >>          }
> >>
> >>          if (amdgpu_sriov_vf(adev))
> >> --
> >> 2.25.1
> >>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 09/16] drm/amdgpu: Guard against write accesses after device removal
  2021-05-12 20:50       ` Alex Deucher
@ 2021-05-13 14:47         ` Andrey Grodzovsky
  2021-05-13 14:54           ` Alex Deucher
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-13 14:47 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Maling list - DRI developers, amd-gfx list, Linux PCI,
	Christian König, Daniel Vetter, Wentland, Harry, Greg KH,
	Kuehling, Felix, Bjorn Helgaas, Deucher, Alexander



On 2021-05-12 4:50 p.m., Alex Deucher wrote:
> On Wed, May 12, 2021 at 4:30 PM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>>
>>
>> On 2021-05-12 4:17 p.m., Alex Deucher wrote:
>>> On Wed, May 12, 2021 at 10:27 AM Andrey Grodzovsky
>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>> This should prevent writing to memory or IO ranges possibly
>>>> already allocated for other uses after our device is removed.
>>>>
>>>> v5:
>>>> Protect more places wher memcopy_to/form_io takes place
>>>
>>> where
>>>
>>>> Protect IB submissions
>>>>
>>>> v6: Switch to !drm_dev_enter instead of scoping entire code
>>>> with brackets.
>>>>
>>>> v7:
>>>> Drop guard of HW ring commands emission protection since they
>>>> are in GART and not in MMIO.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>
>>> I think you could split out the psp_fw_copy changes as a separate
>>> cleanup patch.  That's a nice clean up in general.  What about the SMU
>>> code (e.g., amd/pm/powerplay and amd/pm/swsmu)?  There are a bunch of
>>> shared memory areas we interact with in the driver.
>>
>> Can you point me to it ? Are they VRAM and not GART ?
>> I searched for all memcpy_to/from_io in our code. Maybe missed some.
>>
> 
> Mostly vram.  A quick grep shows:
> 
> $ grep -r -I AMDGPU_GEM_DOMAIN drivers/gpu/drm/amd/pm/
> drivers/gpu/drm/amd/pm/powerplay/amd_powerplay.c:
>    PAGE_SIZE, AMDGPU_GEM_DOMAIN_GTT,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/smu7_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/smu7_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
>   AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/smu8_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/smu8_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
>         AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
>             AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
>         AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
>         AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
>         AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/smu10_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/powerplay/smumgr/smu10_smumgr.c:
> AMDGPU_GEM_DOMAIN_VRAM,
> drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c:        PAGE_SIZE,
> AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c:        PAGE_SIZE,
> AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c:        PAGE_SIZE,
> AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
>      AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
> AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
> PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c:    driver_table->domain =
> AMDGPU_GEM_DOMAIN_VRAM;
> drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c:    memory_pool->domain =
> AMDGPU_GEM_DOMAIN_GTT;
> drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c:
> dummy_read_1_table->domain = AMDGPU_GEM_DOMAIN_VRAM;
> 
> In general, the driver puts shared structures in vram and then either
> updates them and asks the SMU to read them, or requests that the SMU
> write to them by writing a message to the SMU via MMIO.  You'll need
> to look for places where those shared buffers are accessed.
> 
> Alex

There are indeed multiple memcpy in the pm folder and other places.
The thing is I never hit them during testing. I invalidate all MMIO 
ranges (VRAM and Regs) at the end of amdgpu_pci_remove so if any of them 
was hit post device unplug i would see a page fault oops.
The fact they are not hit means that either they are accessed before
pci remove code ends and then they are ok, sysfs accessors as you
pointed to me earlier - in which case also that cannot be accessed
post pci remove as all device related sysfs is gone or possible real
places that i didn't catch in my tests.
I feel like guarding all of them with drm_dev_enter/exit will really 
clutter the code and that the right approach is the one we eventually 
working toward preventing any HW accessing code to run post 
amdgpu_pci_remove coupled with force retiring fences.
Invalidation of all MMIO mappings that I added as the last patch in
this series also guarantees that in any new unguarded code that
actually happens post device unplug will be quickly catched
and identified by page fault and can be specifically addressed.

Andrey

> 
> 
>> Andrey
>>
>>>
>>> Alex
>>>
>>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++-
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 ++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 63 ++++++++++++++--------
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |  2 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 31 +++++++----
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 11 ++--
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c    | 22 +++++---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     |  7 ++-
>>>>    drivers/gpu/drm/amd/amdgpu/psp_v11_0.c     | 44 +++++++--------
>>>>    drivers/gpu/drm/amd/amdgpu/psp_v12_0.c     |  8 +--
>>>>    drivers/gpu/drm/amd/amdgpu/psp_v3_1.c      |  8 +--
>>>>    drivers/gpu/drm/amd/amdgpu/vce_v4_0.c      | 26 +++++----
>>>>    drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c      | 22 +++++---
>>>>    13 files changed, 168 insertions(+), 95 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index a0bff4713672..f7cca25c0fa0 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -71,6 +71,8 @@
>>>>    #include <drm/task_barrier.h>
>>>>    #include <linux/pm_runtime.h>
>>>>
>>>> +#include <drm/drm_drv.h>
>>>> +
>>>>    MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
>>>>    MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
>>>>    MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
>>>> @@ -281,7 +283,10 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
>>>>           unsigned long flags;
>>>>           uint32_t hi = ~0;
>>>>           uint64_t last;
>>>> +       int idx;
>>>>
>>>> +       if (!drm_dev_enter(&adev->ddev, &idx))
>>>> +               return;
>>>>
>>>>    #ifdef CONFIG_64BIT
>>>>           last = min(pos + size, adev->gmc.visible_vram_size);
>>>> @@ -300,7 +305,7 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
>>>>                   }
>>>>
>>>>                   if (count == size)
>>>> -                       return;
>>>> +                       goto exit;
>>>>
>>>>                   pos += count;
>>>>                   buf += count / 4;
>>>> @@ -323,6 +328,9 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
>>>>                           *buf++ = RREG32_NO_KIQ(mmMM_DATA);
>>>>           }
>>>>           spin_unlock_irqrestore(&adev->mmio_idx_lock, flags);
>>>> +
>>>> +exit:
>>>> +       drm_dev_exit(idx);
>>>>    }
>>>>
>>>>    /*
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> index 4d32233cde92..04ba5eef1e88 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> @@ -31,6 +31,8 @@
>>>>    #include "amdgpu_ras.h"
>>>>    #include "amdgpu_xgmi.h"
>>>>
>>>> +#include <drm/drm_drv.h>
>>>> +
>>>>    /**
>>>>     * amdgpu_gmc_pdb0_alloc - allocate vram for pdb0
>>>>     *
>>>> @@ -151,6 +153,10 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
>>>>    {
>>>>           void __iomem *ptr = (void *)cpu_pt_addr;
>>>>           uint64_t value;
>>>> +       int idx;
>>>> +
>>>> +       if (!drm_dev_enter(&adev->ddev, &idx))
>>>> +               return 0;
>>>>
>>>>           /*
>>>>            * The following is for PTE only. GART does not have PDEs.
>>>> @@ -158,6 +164,9 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
>>>>           value = addr & 0x0000FFFFFFFFF000ULL;
>>>>           value |= flags;
>>>>           writeq(value, ptr + (gpu_page_idx * 8));
>>>> +
>>>> +       drm_dev_exit(idx);
>>>> +
>>>>           return 0;
>>>>    }
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>>>> index 9e769cf6095b..bb6afee61666 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>>>> @@ -25,6 +25,7 @@
>>>>
>>>>    #include <linux/firmware.h>
>>>>    #include <linux/dma-mapping.h>
>>>> +#include <drm/drm_drv.h>
>>>>
>>>>    #include "amdgpu.h"
>>>>    #include "amdgpu_psp.h"
>>>> @@ -39,6 +40,8 @@
>>>>    #include "amdgpu_ras.h"
>>>>    #include "amdgpu_securedisplay.h"
>>>>
>>>> +#include <drm/drm_drv.h>
>>>> +
>>>>    static int psp_sysfs_init(struct amdgpu_device *adev);
>>>>    static void psp_sysfs_fini(struct amdgpu_device *adev);
>>>>
>>>> @@ -253,7 +256,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>>>                      struct psp_gfx_cmd_resp *cmd, uint64_t fence_mc_addr)
>>>>    {
>>>>           int ret;
>>>> -       int index;
>>>> +       int index, idx;
>>>>           int timeout = 20000;
>>>>           bool ras_intr = false;
>>>>           bool skip_unsupport = false;
>>>> @@ -261,6 +264,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>>>           if (psp->adev->in_pci_err_recovery)
>>>>                   return 0;
>>>>
>>>> +       if (!drm_dev_enter(&psp->adev->ddev, &idx))
>>>> +               return 0;
>>>> +
>>>>           mutex_lock(&psp->mutex);
>>>>
>>>>           memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
>>>> @@ -271,8 +277,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>>>           ret = psp_ring_cmd_submit(psp, psp->cmd_buf_mc_addr, fence_mc_addr, index);
>>>>           if (ret) {
>>>>                   atomic_dec(&psp->fence_value);
>>>> -               mutex_unlock(&psp->mutex);
>>>> -               return ret;
>>>> +               goto exit;
>>>>           }
>>>>
>>>>           amdgpu_asic_invalidate_hdp(psp->adev, NULL);
>>>> @@ -312,8 +317,8 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>>>                            psp->cmd_buf_mem->cmd_id,
>>>>                            psp->cmd_buf_mem->resp.status);
>>>>                   if (!timeout) {
>>>> -                       mutex_unlock(&psp->mutex);
>>>> -                       return -EINVAL;
>>>> +                       ret = -EINVAL;
>>>> +                       goto exit;
>>>>                   }
>>>>           }
>>>>
>>>> @@ -321,8 +326,10 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>>>                   ucode->tmr_mc_addr_lo = psp->cmd_buf_mem->resp.fw_addr_lo;
>>>>                   ucode->tmr_mc_addr_hi = psp->cmd_buf_mem->resp.fw_addr_hi;
>>>>           }
>>>> -       mutex_unlock(&psp->mutex);
>>>>
>>>> +exit:
>>>> +       mutex_unlock(&psp->mutex);
>>>> +       drm_dev_exit(idx);
>>>>           return ret;
>>>>    }
>>>>
>>>> @@ -359,8 +366,7 @@ static int psp_load_toc(struct psp_context *psp,
>>>>           if (!cmd)
>>>>                   return -ENOMEM;
>>>>           /* Copy toc to psp firmware private buffer */
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -       memcpy(psp->fw_pri_buf, psp->toc_start_addr, psp->toc_bin_size);
>>>> +       psp_copy_fw(psp, psp->toc_start_addr, psp->toc_bin_size);
>>>>
>>>>           psp_prep_load_toc_cmd_buf(cmd, psp->fw_pri_mc_addr, psp->toc_bin_size);
>>>>
>>>> @@ -625,8 +631,7 @@ static int psp_asd_load(struct psp_context *psp)
>>>>           if (!cmd)
>>>>                   return -ENOMEM;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -       memcpy(psp->fw_pri_buf, psp->asd_start_addr, psp->asd_ucode_size);
>>>> +       psp_copy_fw(psp, psp->asd_start_addr, psp->asd_ucode_size);
>>>>
>>>>           psp_prep_asd_load_cmd_buf(cmd, psp->fw_pri_mc_addr,
>>>>                                     psp->asd_ucode_size);
>>>> @@ -781,8 +786,7 @@ static int psp_xgmi_load(struct psp_context *psp)
>>>>           if (!cmd)
>>>>                   return -ENOMEM;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -       memcpy(psp->fw_pri_buf, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
>>>> +       psp_copy_fw(psp, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
>>>>
>>>>           psp_prep_ta_load_cmd_buf(cmd,
>>>>                                    psp->fw_pri_mc_addr,
>>>> @@ -1038,8 +1042,7 @@ static int psp_ras_load(struct psp_context *psp)
>>>>           if (!cmd)
>>>>                   return -ENOMEM;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -       memcpy(psp->fw_pri_buf, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
>>>> +       psp_copy_fw(psp, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
>>>>
>>>>           psp_prep_ta_load_cmd_buf(cmd,
>>>>                                    psp->fw_pri_mc_addr,
>>>> @@ -1275,8 +1278,7 @@ static int psp_hdcp_load(struct psp_context *psp)
>>>>           if (!cmd)
>>>>                   return -ENOMEM;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -       memcpy(psp->fw_pri_buf, psp->ta_hdcp_start_addr,
>>>> +       psp_copy_fw(psp, psp->ta_hdcp_start_addr,
>>>>                  psp->ta_hdcp_ucode_size);
>>>>
>>>>           psp_prep_ta_load_cmd_buf(cmd,
>>>> @@ -1427,8 +1429,7 @@ static int psp_dtm_load(struct psp_context *psp)
>>>>           if (!cmd)
>>>>                   return -ENOMEM;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -       memcpy(psp->fw_pri_buf, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
>>>> +       psp_copy_fw(psp, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
>>>>
>>>>           psp_prep_ta_load_cmd_buf(cmd,
>>>>                                    psp->fw_pri_mc_addr,
>>>> @@ -1573,8 +1574,7 @@ static int psp_rap_load(struct psp_context *psp)
>>>>           if (!cmd)
>>>>                   return -ENOMEM;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -       memcpy(psp->fw_pri_buf, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
>>>> +       psp_copy_fw(psp, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
>>>>
>>>>           psp_prep_ta_load_cmd_buf(cmd,
>>>>                                    psp->fw_pri_mc_addr,
>>>> @@ -3022,7 +3022,7 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
>>>>           struct amdgpu_device *adev = drm_to_adev(ddev);
>>>>           void *cpu_addr;
>>>>           dma_addr_t dma_addr;
>>>> -       int ret;
>>>> +       int ret, idx;
>>>>           char fw_name[100];
>>>>           const struct firmware *usbc_pd_fw;
>>>>
>>>> @@ -3031,6 +3031,9 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
>>>>                   return -EBUSY;
>>>>           }
>>>>
>>>> +       if (!drm_dev_enter(ddev, &idx))
>>>> +               return -ENODEV;
>>>> +
>>>>           snprintf(fw_name, sizeof(fw_name), "amdgpu/%s", buf);
>>>>           ret = request_firmware(&usbc_pd_fw, fw_name, adev->dev);
>>>>           if (ret)
>>>> @@ -3062,16 +3065,30 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
>>>>    rel_buf:
>>>>           dma_free_coherent(adev->dev, usbc_pd_fw->size, cpu_addr, dma_addr);
>>>>           release_firmware(usbc_pd_fw);
>>>> -
>>>>    fail:
>>>>           if (ret) {
>>>>                   DRM_ERROR("Failed to load USBC PD FW, err = %d", ret);
>>>> -               return ret;
>>>> +               count = ret;
>>>>           }
>>>>
>>>> +       drm_dev_exit(idx);
>>>>           return count;
>>>>    }
>>>>
>>>> +void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size)
>>>> +{
>>>> +       int idx;
>>>> +
>>>> +       if (!drm_dev_enter(&psp->adev->ddev, &idx))
>>>> +               return;
>>>> +
>>>> +       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> +       memcpy(psp->fw_pri_buf, start_addr, bin_size);
>>>> +
>>>> +       drm_dev_exit(idx);
>>>> +}
>>>> +
>>>> +
>>>>    static DEVICE_ATTR(usbc_pd_fw, S_IRUGO | S_IWUSR,
>>>>                      psp_usbc_pd_fw_sysfs_read,
>>>>                      psp_usbc_pd_fw_sysfs_write);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
>>>> index 46a5328e00e0..2bfdc278817f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
>>>> @@ -423,4 +423,6 @@ int psp_get_fw_attestation_records_addr(struct psp_context *psp,
>>>>
>>>>    int psp_load_fw_list(struct psp_context *psp,
>>>>                        struct amdgpu_firmware_info **ucode_list, int ucode_count);
>>>> +void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size);
>>>> +
>>>>    #endif
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>>> index c6dbc0801604..82f0542c7792 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>>> @@ -32,6 +32,7 @@
>>>>    #include <linux/module.h>
>>>>
>>>>    #include <drm/drm.h>
>>>> +#include <drm/drm_drv.h>
>>>>
>>>>    #include "amdgpu.h"
>>>>    #include "amdgpu_pm.h"
>>>> @@ -375,7 +376,7 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>>>>    {
>>>>           unsigned size;
>>>>           void *ptr;
>>>> -       int i, j;
>>>> +       int i, j, idx;
>>>>           bool in_ras_intr = amdgpu_ras_intr_triggered();
>>>>
>>>>           cancel_delayed_work_sync(&adev->uvd.idle_work);
>>>> @@ -403,11 +404,15 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>>>>                   if (!adev->uvd.inst[j].saved_bo)
>>>>                           return -ENOMEM;
>>>>
>>>> -               /* re-write 0 since err_event_athub will corrupt VCPU buffer */
>>>> -               if (in_ras_intr)
>>>> -                       memset(adev->uvd.inst[j].saved_bo, 0, size);
>>>> -               else
>>>> -                       memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
>>>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
>>>> +                       /* re-write 0 since err_event_athub will corrupt VCPU buffer */
>>>> +                       if (in_ras_intr)
>>>> +                               memset(adev->uvd.inst[j].saved_bo, 0, size);
>>>> +                       else
>>>> +                               memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
>>>> +
>>>> +                       drm_dev_exit(idx);
>>>> +               }
>>>>           }
>>>>
>>>>           if (in_ras_intr)
>>>> @@ -420,7 +425,7 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
>>>>    {
>>>>           unsigned size;
>>>>           void *ptr;
>>>> -       int i;
>>>> +       int i, idx;
>>>>
>>>>           for (i = 0; i < adev->uvd.num_uvd_inst; i++) {
>>>>                   if (adev->uvd.harvest_config & (1 << i))
>>>> @@ -432,7 +437,10 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
>>>>                   ptr = adev->uvd.inst[i].cpu_addr;
>>>>
>>>>                   if (adev->uvd.inst[i].saved_bo != NULL) {
>>>> -                       memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
>>>> +                       if (drm_dev_enter(&adev->ddev, &idx)) {
>>>> +                               memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
>>>> +                               drm_dev_exit(idx);
>>>> +                       }
>>>>                           kvfree(adev->uvd.inst[i].saved_bo);
>>>>                           adev->uvd.inst[i].saved_bo = NULL;
>>>>                   } else {
>>>> @@ -442,8 +450,11 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
>>>>                           hdr = (const struct common_firmware_header *)adev->uvd.fw->data;
>>>>                           if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
>>>>                                   offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
>>>> -                               memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
>>>> -                                           le32_to_cpu(hdr->ucode_size_bytes));
>>>> +                               if (drm_dev_enter(&adev->ddev, &idx)) {
>>>> +                                       memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
>>>> +                                                   le32_to_cpu(hdr->ucode_size_bytes));
>>>> +                                       drm_dev_exit(idx);
>>>> +                               }
>>>>                                   size -= le32_to_cpu(hdr->ucode_size_bytes);
>>>>                                   ptr += le32_to_cpu(hdr->ucode_size_bytes);
>>>>                           }
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>>> index ea6a62f67e38..833203401ef4 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>>> @@ -29,6 +29,7 @@
>>>>    #include <linux/module.h>
>>>>
>>>>    #include <drm/drm.h>
>>>> +#include <drm/drm_drv.h>
>>>>
>>>>    #include "amdgpu.h"
>>>>    #include "amdgpu_pm.h"
>>>> @@ -293,7 +294,7 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
>>>>           void *cpu_addr;
>>>>           const struct common_firmware_header *hdr;
>>>>           unsigned offset;
>>>> -       int r;
>>>> +       int r, idx;
>>>>
>>>>           if (adev->vce.vcpu_bo == NULL)
>>>>                   return -EINVAL;
>>>> @@ -313,8 +314,12 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
>>>>
>>>>           hdr = (const struct common_firmware_header *)adev->vce.fw->data;
>>>>           offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
>>>> -       memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
>>>> -                   adev->vce.fw->size - offset);
>>>> +
>>>> +       if (drm_dev_enter(&adev->ddev, &idx)) {
>>>> +               memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
>>>> +                           adev->vce.fw->size - offset);
>>>> +               drm_dev_exit(idx);
>>>> +       }
>>>>
>>>>           amdgpu_bo_kunmap(adev->vce.vcpu_bo);
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>> index 201645963ba5..21f7d3644d70 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
>>>> @@ -27,6 +27,7 @@
>>>>    #include <linux/firmware.h>
>>>>    #include <linux/module.h>
>>>>    #include <linux/pci.h>
>>>> +#include <drm/drm_drv.h>
>>>>
>>>>    #include "amdgpu.h"
>>>>    #include "amdgpu_pm.h"
>>>> @@ -275,7 +276,7 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
>>>>    {
>>>>           unsigned size;
>>>>           void *ptr;
>>>> -       int i;
>>>> +       int i, idx;
>>>>
>>>>           cancel_delayed_work_sync(&adev->vcn.idle_work);
>>>>
>>>> @@ -292,7 +293,10 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
>>>>                   if (!adev->vcn.inst[i].saved_bo)
>>>>                           return -ENOMEM;
>>>>
>>>> -               memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
>>>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
>>>> +                       memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
>>>> +                       drm_dev_exit(idx);
>>>> +               }
>>>>           }
>>>>           return 0;
>>>>    }
>>>> @@ -301,7 +305,7 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
>>>>    {
>>>>           unsigned size;
>>>>           void *ptr;
>>>> -       int i;
>>>> +       int i, idx;
>>>>
>>>>           for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
>>>>                   if (adev->vcn.harvest_config & (1 << i))
>>>> @@ -313,7 +317,10 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
>>>>                   ptr = adev->vcn.inst[i].cpu_addr;
>>>>
>>>>                   if (adev->vcn.inst[i].saved_bo != NULL) {
>>>> -                       memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
>>>> +                       if (drm_dev_enter(&adev->ddev, &idx)) {
>>>> +                               memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
>>>> +                               drm_dev_exit(idx);
>>>> +                       }
>>>>                           kvfree(adev->vcn.inst[i].saved_bo);
>>>>                           adev->vcn.inst[i].saved_bo = NULL;
>>>>                   } else {
>>>> @@ -323,8 +330,11 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
>>>>                           hdr = (const struct common_firmware_header *)adev->vcn.fw->data;
>>>>                           if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
>>>>                                   offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
>>>> -                               memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
>>>> -                                           le32_to_cpu(hdr->ucode_size_bytes));
>>>> +                               if (drm_dev_enter(&adev->ddev, &idx)) {
>>>> +                                       memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
>>>> +                                                   le32_to_cpu(hdr->ucode_size_bytes));
>>>> +                                       drm_dev_exit(idx);
>>>> +                               }
>>>>                                   size -= le32_to_cpu(hdr->ucode_size_bytes);
>>>>                                   ptr += le32_to_cpu(hdr->ucode_size_bytes);
>>>>                           }
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> index 9f868cf3b832..7dd5f10ab570 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> @@ -32,6 +32,7 @@
>>>>    #include <linux/dma-buf.h>
>>>>
>>>>    #include <drm/amdgpu_drm.h>
>>>> +#include <drm/drm_drv.h>
>>>>    #include "amdgpu.h"
>>>>    #include "amdgpu_trace.h"
>>>>    #include "amdgpu_amdkfd.h"
>>>> @@ -1606,7 +1607,10 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
>>>>           struct amdgpu_vm_update_params params;
>>>>           enum amdgpu_sync_mode sync_mode;
>>>>           uint64_t pfn;
>>>> -       int r;
>>>> +       int r, idx;
>>>> +
>>>> +       if (!drm_dev_enter(&adev->ddev, &idx))
>>>> +               return -ENODEV;
>>>>
>>>>           memset(&params, 0, sizeof(params));
>>>>           params.adev = adev;
>>>> @@ -1715,6 +1719,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
>>>>
>>>>    error_unlock:
>>>>           amdgpu_vm_eviction_unlock(vm);
>>>> +       drm_dev_exit(idx);
>>>>           return r;
>>>>    }
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
>>>> index 589410c32d09..2cec71e823f5 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
>>>> @@ -23,6 +23,7 @@
>>>>    #include <linux/firmware.h>
>>>>    #include <linux/module.h>
>>>>    #include <linux/vmalloc.h>
>>>> +#include <drm/drm_drv.h>
>>>>
>>>>    #include "amdgpu.h"
>>>>    #include "amdgpu_psp.h"
>>>> @@ -269,10 +270,8 @@ static int psp_v11_0_bootloader_load_kdb(struct psp_context *psp)
>>>>           if (ret)
>>>>                   return ret;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -
>>>>           /* Copy PSP KDB binary to memory */
>>>> -       memcpy(psp->fw_pri_buf, psp->kdb_start_addr, psp->kdb_bin_size);
>>>> +       psp_copy_fw(psp, psp->kdb_start_addr, psp->kdb_bin_size);
>>>>
>>>>           /* Provide the PSP KDB to bootloader */
>>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>>>> @@ -302,10 +301,8 @@ static int psp_v11_0_bootloader_load_spl(struct psp_context *psp)
>>>>           if (ret)
>>>>                   return ret;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -
>>>>           /* Copy PSP SPL binary to memory */
>>>> -       memcpy(psp->fw_pri_buf, psp->spl_start_addr, psp->spl_bin_size);
>>>> +       psp_copy_fw(psp, psp->spl_start_addr, psp->spl_bin_size);
>>>>
>>>>           /* Provide the PSP SPL to bootloader */
>>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>>>> @@ -335,10 +332,8 @@ static int psp_v11_0_bootloader_load_sysdrv(struct psp_context *psp)
>>>>           if (ret)
>>>>                   return ret;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -
>>>>           /* Copy PSP System Driver binary to memory */
>>>> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
>>>> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
>>>>
>>>>           /* Provide the sys driver to bootloader */
>>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>>>> @@ -371,10 +366,8 @@ static int psp_v11_0_bootloader_load_sos(struct psp_context *psp)
>>>>           if (ret)
>>>>                   return ret;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -
>>>>           /* Copy Secure OS binary to PSP memory */
>>>> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
>>>> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
>>>>
>>>>           /* Provide the PSP secure OS to bootloader */
>>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>>>> @@ -608,7 +601,7 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
>>>>           uint32_t p2c_header[4];
>>>>           uint32_t sz;
>>>>           void *buf;
>>>> -       int ret;
>>>> +       int ret, idx;
>>>>
>>>>           if (ctx->init == PSP_MEM_TRAIN_NOT_SUPPORT) {
>>>>                   DRM_DEBUG("Memory training is not supported.\n");
>>>> @@ -681,17 +674,24 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
>>>>                           return -ENOMEM;
>>>>                   }
>>>>
>>>> -               memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
>>>> -               ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
>>>> -               if (ret) {
>>>> -                       DRM_ERROR("Send long training msg failed.\n");
>>>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
>>>> +                       memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
>>>> +                       ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
>>>> +                       if (ret) {
>>>> +                               DRM_ERROR("Send long training msg failed.\n");
>>>> +                               vfree(buf);
>>>> +                               drm_dev_exit(idx);
>>>> +                               return ret;
>>>> +                       }
>>>> +
>>>> +                       memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
>>>> +                       adev->hdp.funcs->flush_hdp(adev, NULL);
>>>>                           vfree(buf);
>>>> -                       return ret;
>>>> +                       drm_dev_exit(idx);
>>>> +               } else {
>>>> +                       vfree(buf);
>>>> +                       return -ENODEV;
>>>>                   }
>>>> -
>>>> -               memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
>>>> -               adev->hdp.funcs->flush_hdp(adev, NULL);
>>>> -               vfree(buf);
>>>>           }
>>>>
>>>>           if (ops & PSP_MEM_TRAIN_SAVE) {
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
>>>> index c4828bd3264b..618e5b6b85d9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
>>>> @@ -138,10 +138,8 @@ static int psp_v12_0_bootloader_load_sysdrv(struct psp_context *psp)
>>>>           if (ret)
>>>>                   return ret;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -
>>>>           /* Copy PSP System Driver binary to memory */
>>>> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
>>>> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
>>>>
>>>>           /* Provide the sys driver to bootloader */
>>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>>>> @@ -179,10 +177,8 @@ static int psp_v12_0_bootloader_load_sos(struct psp_context *psp)
>>>>           if (ret)
>>>>                   return ret;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -
>>>>           /* Copy Secure OS binary to PSP memory */
>>>> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
>>>> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
>>>>
>>>>           /* Provide the PSP secure OS to bootloader */
>>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
>>>> index f2e725f72d2f..d0a6cccd0897 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
>>>> @@ -102,10 +102,8 @@ static int psp_v3_1_bootloader_load_sysdrv(struct psp_context *psp)
>>>>           if (ret)
>>>>                   return ret;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -
>>>>           /* Copy PSP System Driver binary to memory */
>>>> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
>>>> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
>>>>
>>>>           /* Provide the sys driver to bootloader */
>>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>>>> @@ -143,10 +141,8 @@ static int psp_v3_1_bootloader_load_sos(struct psp_context *psp)
>>>>           if (ret)
>>>>                   return ret;
>>>>
>>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
>>>> -
>>>>           /* Copy Secure OS binary to PSP memory */
>>>> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
>>>> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
>>>>
>>>>           /* Provide the PSP secure OS to bootloader */
>>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
>>>> index 8e238dea7bef..90910d19db12 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
>>>> @@ -25,6 +25,7 @@
>>>>     */
>>>>
>>>>    #include <linux/firmware.h>
>>>> +#include <drm/drm_drv.h>
>>>>
>>>>    #include "amdgpu.h"
>>>>    #include "amdgpu_vce.h"
>>>> @@ -555,16 +556,19 @@ static int vce_v4_0_hw_fini(void *handle)
>>>>    static int vce_v4_0_suspend(void *handle)
>>>>    {
>>>>           struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>> -       int r;
>>>> +       int r, idx;
>>>>
>>>>           if (adev->vce.vcpu_bo == NULL)
>>>>                   return 0;
>>>>
>>>> -       if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
>>>> -               unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
>>>> -               void *ptr = adev->vce.cpu_addr;
>>>> +       if (drm_dev_enter(&adev->ddev, &idx)) {
>>>> +               if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
>>>> +                       unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
>>>> +                       void *ptr = adev->vce.cpu_addr;
>>>>
>>>> -               memcpy_fromio(adev->vce.saved_bo, ptr, size);
>>>> +                       memcpy_fromio(adev->vce.saved_bo, ptr, size);
>>>> +               }
>>>> +               drm_dev_exit(idx);
>>>>           }
>>>>
>>>>           r = vce_v4_0_hw_fini(adev);
>>>> @@ -577,16 +581,20 @@ static int vce_v4_0_suspend(void *handle)
>>>>    static int vce_v4_0_resume(void *handle)
>>>>    {
>>>>           struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>> -       int r;
>>>> +       int r, idx;
>>>>
>>>>           if (adev->vce.vcpu_bo == NULL)
>>>>                   return -EINVAL;
>>>>
>>>>           if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
>>>> -               unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
>>>> -               void *ptr = adev->vce.cpu_addr;
>>>>
>>>> -               memcpy_toio(ptr, adev->vce.saved_bo, size);
>>>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
>>>> +                       unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
>>>> +                       void *ptr = adev->vce.cpu_addr;
>>>> +
>>>> +                       memcpy_toio(ptr, adev->vce.saved_bo, size);
>>>> +                       drm_dev_exit(idx);
>>>> +               }
>>>>           } else {
>>>>                   r = amdgpu_vce_resume(adev);
>>>>                   if (r)
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
>>>> index 3f15bf34123a..df34be8ec82d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
>>>> @@ -34,6 +34,8 @@
>>>>    #include "vcn/vcn_3_0_0_sh_mask.h"
>>>>    #include "ivsrcid/vcn/irqsrcs_vcn_2_0.h"
>>>>
>>>> +#include <drm/drm_drv.h>
>>>> +
>>>>    #define mmUVD_CONTEXT_ID_INTERNAL_OFFSET                       0x27
>>>>    #define mmUVD_GPCOM_VCPU_CMD_INTERNAL_OFFSET                   0x0f
>>>>    #define mmUVD_GPCOM_VCPU_DATA0_INTERNAL_OFFSET                 0x10
>>>> @@ -268,16 +270,20 @@ static int vcn_v3_0_sw_init(void *handle)
>>>>    static int vcn_v3_0_sw_fini(void *handle)
>>>>    {
>>>>           struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>> -       int i, r;
>>>> +       int i, r, idx;
>>>>
>>>> -       for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
>>>> -               volatile struct amdgpu_fw_shared *fw_shared;
>>>> +       if (drm_dev_enter(&adev->ddev, &idx)) {
>>>> +               for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
>>>> +                       volatile struct amdgpu_fw_shared *fw_shared;
>>>>
>>>> -               if (adev->vcn.harvest_config & (1 << i))
>>>> -                       continue;
>>>> -               fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
>>>> -               fw_shared->present_flag_0 = 0;
>>>> -               fw_shared->sw_ring.is_enabled = false;
>>>> +                       if (adev->vcn.harvest_config & (1 << i))
>>>> +                               continue;
>>>> +                       fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
>>>> +                       fw_shared->present_flag_0 = 0;
>>>> +                       fw_shared->sw_ring.is_enabled = false;
>>>> +               }
>>>> +
>>>> +               drm_dev_exit(idx);
>>>>           }
>>>>
>>>>           if (amdgpu_sriov_vf(adev))
>>>> --
>>>> 2.25.1
>>>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 09/16] drm/amdgpu: Guard against write accesses after device removal
  2021-05-13 14:47         ` Andrey Grodzovsky
@ 2021-05-13 14:54           ` Alex Deucher
  0 siblings, 0 replies; 64+ messages in thread
From: Alex Deucher @ 2021-05-13 14:54 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Maling list - DRI developers, amd-gfx list, Linux PCI,
	Christian König, Daniel Vetter, Wentland, Harry, Greg KH,
	Kuehling, Felix, Bjorn Helgaas, Deucher, Alexander

On Thu, May 13, 2021 at 10:47 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
>
> On 2021-05-12 4:50 p.m., Alex Deucher wrote:
> > On Wed, May 12, 2021 at 4:30 PM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >>
> >>
> >>
> >> On 2021-05-12 4:17 p.m., Alex Deucher wrote:
> >>> On Wed, May 12, 2021 at 10:27 AM Andrey Grodzovsky
> >>> <andrey.grodzovsky@amd.com> wrote:
> >>>>
> >>>> This should prevent writing to memory or IO ranges possibly
> >>>> already allocated for other uses after our device is removed.
> >>>>
> >>>> v5:
> >>>> Protect more places wher memcopy_to/form_io takes place
> >>>
> >>> where
> >>>
> >>>> Protect IB submissions
> >>>>
> >>>> v6: Switch to !drm_dev_enter instead of scoping entire code
> >>>> with brackets.
> >>>>
> >>>> v7:
> >>>> Drop guard of HW ring commands emission protection since they
> >>>> are in GART and not in MMIO.
> >>>>
> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>>
> >>> I think you could split out the psp_fw_copy changes as a separate
> >>> cleanup patch.  That's a nice clean up in general.  What about the SMU
> >>> code (e.g., amd/pm/powerplay and amd/pm/swsmu)?  There are a bunch of
> >>> shared memory areas we interact with in the driver.
> >>
> >> Can you point me to it ? Are they VRAM and not GART ?
> >> I searched for all memcpy_to/from_io in our code. Maybe missed some.
> >>
> >
> > Mostly vram.  A quick grep shows:
> >
> > $ grep -r -I AMDGPU_GEM_DOMAIN drivers/gpu/drm/amd/pm/
> > drivers/gpu/drm/amd/pm/powerplay/amd_powerplay.c:
> >    PAGE_SIZE, AMDGPU_GEM_DOMAIN_GTT,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/smu7_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/smu7_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
> >   AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/smu8_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/smu8_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
> >         AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
> >             AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
> >         AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
> >         AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega12_smumgr.c:
> >         AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/vega20_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/smu10_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/powerplay/smumgr/smu10_smumgr.c:
> > AMDGPU_GEM_DOMAIN_VRAM,
> > drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c:        PAGE_SIZE,
> > AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c:        PAGE_SIZE,
> > AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c:        PAGE_SIZE,
> > AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c:
> > AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c:
> >      AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu11/arcturus_ppt.c:
> > AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c:
> > PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c:    driver_table->domain =
> > AMDGPU_GEM_DOMAIN_VRAM;
> > drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c:    memory_pool->domain =
> > AMDGPU_GEM_DOMAIN_GTT;
> > drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c:
> > dummy_read_1_table->domain = AMDGPU_GEM_DOMAIN_VRAM;
> >
> > In general, the driver puts shared structures in vram and then either
> > updates them and asks the SMU to read them, or requests that the SMU
> > write to them by writing a message to the SMU via MMIO.  You'll need
> > to look for places where those shared buffers are accessed.
> >
> > Alex
>
> There are indeed multiple memcpy in the pm folder and other places.
> The thing is I never hit them during testing. I invalidate all MMIO
> ranges (VRAM and Regs) at the end of amdgpu_pci_remove so if any of them
> was hit post device unplug i would see a page fault oops.
> The fact they are not hit means that either they are accessed before
> pci remove code ends and then they are ok, sysfs accessors as you
> pointed to me earlier - in which case also that cannot be accessed
> post pci remove as all device related sysfs is gone or possible real
> places that i didn't catch in my tests.
> I feel like guarding all of them with drm_dev_enter/exit will really
> clutter the code and that the right approach is the one we eventually
> working toward preventing any HW accessing code to run post
> amdgpu_pci_remove coupled with force retiring fences.
> Invalidation of all MMIO mappings that I added as the last patch in
> this series also guarantees that in any new unguarded code that
> actually happens post device unplug will be quickly catched
> and identified by page fault and can be specifically addressed.
>

In general they are only hit during runtime via sysfs (e.g., if you
are querying telemetry data via hwmon or checking the clocks).  We can
always revisit them in the future if they end up being a problem.

I can go either way on splitting out the psp_copy_fw clean up as a
separate patch.  With the typo in the commit message fixed, the patch
is:
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

> Andrey
>
> >
> >
> >> Andrey
> >>
> >>>
> >>> Alex
> >>>
> >>>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++-
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 ++++
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 63 ++++++++++++++--------
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |  2 +
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 31 +++++++----
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 11 ++--
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c    | 22 +++++---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     |  7 ++-
> >>>>    drivers/gpu/drm/amd/amdgpu/psp_v11_0.c     | 44 +++++++--------
> >>>>    drivers/gpu/drm/amd/amdgpu/psp_v12_0.c     |  8 +--
> >>>>    drivers/gpu/drm/amd/amdgpu/psp_v3_1.c      |  8 +--
> >>>>    drivers/gpu/drm/amd/amdgpu/vce_v4_0.c      | 26 +++++----
> >>>>    drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c      | 22 +++++---
> >>>>    13 files changed, 168 insertions(+), 95 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> index a0bff4713672..f7cca25c0fa0 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> @@ -71,6 +71,8 @@
> >>>>    #include <drm/task_barrier.h>
> >>>>    #include <linux/pm_runtime.h>
> >>>>
> >>>> +#include <drm/drm_drv.h>
> >>>> +
> >>>>    MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
> >>>>    MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
> >>>>    MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
> >>>> @@ -281,7 +283,10 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
> >>>>           unsigned long flags;
> >>>>           uint32_t hi = ~0;
> >>>>           uint64_t last;
> >>>> +       int idx;
> >>>>
> >>>> +       if (!drm_dev_enter(&adev->ddev, &idx))
> >>>> +               return;
> >>>>
> >>>>    #ifdef CONFIG_64BIT
> >>>>           last = min(pos + size, adev->gmc.visible_vram_size);
> >>>> @@ -300,7 +305,7 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
> >>>>                   }
> >>>>
> >>>>                   if (count == size)
> >>>> -                       return;
> >>>> +                       goto exit;
> >>>>
> >>>>                   pos += count;
> >>>>                   buf += count / 4;
> >>>> @@ -323,6 +328,9 @@ void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
> >>>>                           *buf++ = RREG32_NO_KIQ(mmMM_DATA);
> >>>>           }
> >>>>           spin_unlock_irqrestore(&adev->mmio_idx_lock, flags);
> >>>> +
> >>>> +exit:
> >>>> +       drm_dev_exit(idx);
> >>>>    }
> >>>>
> >>>>    /*
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>>> index 4d32233cde92..04ba5eef1e88 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>>> @@ -31,6 +31,8 @@
> >>>>    #include "amdgpu_ras.h"
> >>>>    #include "amdgpu_xgmi.h"
> >>>>
> >>>> +#include <drm/drm_drv.h>
> >>>> +
> >>>>    /**
> >>>>     * amdgpu_gmc_pdb0_alloc - allocate vram for pdb0
> >>>>     *
> >>>> @@ -151,6 +153,10 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
> >>>>    {
> >>>>           void __iomem *ptr = (void *)cpu_pt_addr;
> >>>>           uint64_t value;
> >>>> +       int idx;
> >>>> +
> >>>> +       if (!drm_dev_enter(&adev->ddev, &idx))
> >>>> +               return 0;
> >>>>
> >>>>           /*
> >>>>            * The following is for PTE only. GART does not have PDEs.
> >>>> @@ -158,6 +164,9 @@ int amdgpu_gmc_set_pte_pde(struct amdgpu_device *adev, void *cpu_pt_addr,
> >>>>           value = addr & 0x0000FFFFFFFFF000ULL;
> >>>>           value |= flags;
> >>>>           writeq(value, ptr + (gpu_page_idx * 8));
> >>>> +
> >>>> +       drm_dev_exit(idx);
> >>>> +
> >>>>           return 0;
> >>>>    }
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> >>>> index 9e769cf6095b..bb6afee61666 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> >>>> @@ -25,6 +25,7 @@
> >>>>
> >>>>    #include <linux/firmware.h>
> >>>>    #include <linux/dma-mapping.h>
> >>>> +#include <drm/drm_drv.h>
> >>>>
> >>>>    #include "amdgpu.h"
> >>>>    #include "amdgpu_psp.h"
> >>>> @@ -39,6 +40,8 @@
> >>>>    #include "amdgpu_ras.h"
> >>>>    #include "amdgpu_securedisplay.h"
> >>>>
> >>>> +#include <drm/drm_drv.h>
> >>>> +
> >>>>    static int psp_sysfs_init(struct amdgpu_device *adev);
> >>>>    static void psp_sysfs_fini(struct amdgpu_device *adev);
> >>>>
> >>>> @@ -253,7 +256,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
> >>>>                      struct psp_gfx_cmd_resp *cmd, uint64_t fence_mc_addr)
> >>>>    {
> >>>>           int ret;
> >>>> -       int index;
> >>>> +       int index, idx;
> >>>>           int timeout = 20000;
> >>>>           bool ras_intr = false;
> >>>>           bool skip_unsupport = false;
> >>>> @@ -261,6 +264,9 @@ psp_cmd_submit_buf(struct psp_context *psp,
> >>>>           if (psp->adev->in_pci_err_recovery)
> >>>>                   return 0;
> >>>>
> >>>> +       if (!drm_dev_enter(&psp->adev->ddev, &idx))
> >>>> +               return 0;
> >>>> +
> >>>>           mutex_lock(&psp->mutex);
> >>>>
> >>>>           memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
> >>>> @@ -271,8 +277,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
> >>>>           ret = psp_ring_cmd_submit(psp, psp->cmd_buf_mc_addr, fence_mc_addr, index);
> >>>>           if (ret) {
> >>>>                   atomic_dec(&psp->fence_value);
> >>>> -               mutex_unlock(&psp->mutex);
> >>>> -               return ret;
> >>>> +               goto exit;
> >>>>           }
> >>>>
> >>>>           amdgpu_asic_invalidate_hdp(psp->adev, NULL);
> >>>> @@ -312,8 +317,8 @@ psp_cmd_submit_buf(struct psp_context *psp,
> >>>>                            psp->cmd_buf_mem->cmd_id,
> >>>>                            psp->cmd_buf_mem->resp.status);
> >>>>                   if (!timeout) {
> >>>> -                       mutex_unlock(&psp->mutex);
> >>>> -                       return -EINVAL;
> >>>> +                       ret = -EINVAL;
> >>>> +                       goto exit;
> >>>>                   }
> >>>>           }
> >>>>
> >>>> @@ -321,8 +326,10 @@ psp_cmd_submit_buf(struct psp_context *psp,
> >>>>                   ucode->tmr_mc_addr_lo = psp->cmd_buf_mem->resp.fw_addr_lo;
> >>>>                   ucode->tmr_mc_addr_hi = psp->cmd_buf_mem->resp.fw_addr_hi;
> >>>>           }
> >>>> -       mutex_unlock(&psp->mutex);
> >>>>
> >>>> +exit:
> >>>> +       mutex_unlock(&psp->mutex);
> >>>> +       drm_dev_exit(idx);
> >>>>           return ret;
> >>>>    }
> >>>>
> >>>> @@ -359,8 +366,7 @@ static int psp_load_toc(struct psp_context *psp,
> >>>>           if (!cmd)
> >>>>                   return -ENOMEM;
> >>>>           /* Copy toc to psp firmware private buffer */
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -       memcpy(psp->fw_pri_buf, psp->toc_start_addr, psp->toc_bin_size);
> >>>> +       psp_copy_fw(psp, psp->toc_start_addr, psp->toc_bin_size);
> >>>>
> >>>>           psp_prep_load_toc_cmd_buf(cmd, psp->fw_pri_mc_addr, psp->toc_bin_size);
> >>>>
> >>>> @@ -625,8 +631,7 @@ static int psp_asd_load(struct psp_context *psp)
> >>>>           if (!cmd)
> >>>>                   return -ENOMEM;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -       memcpy(psp->fw_pri_buf, psp->asd_start_addr, psp->asd_ucode_size);
> >>>> +       psp_copy_fw(psp, psp->asd_start_addr, psp->asd_ucode_size);
> >>>>
> >>>>           psp_prep_asd_load_cmd_buf(cmd, psp->fw_pri_mc_addr,
> >>>>                                     psp->asd_ucode_size);
> >>>> @@ -781,8 +786,7 @@ static int psp_xgmi_load(struct psp_context *psp)
> >>>>           if (!cmd)
> >>>>                   return -ENOMEM;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -       memcpy(psp->fw_pri_buf, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
> >>>> +       psp_copy_fw(psp, psp->ta_xgmi_start_addr, psp->ta_xgmi_ucode_size);
> >>>>
> >>>>           psp_prep_ta_load_cmd_buf(cmd,
> >>>>                                    psp->fw_pri_mc_addr,
> >>>> @@ -1038,8 +1042,7 @@ static int psp_ras_load(struct psp_context *psp)
> >>>>           if (!cmd)
> >>>>                   return -ENOMEM;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -       memcpy(psp->fw_pri_buf, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
> >>>> +       psp_copy_fw(psp, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
> >>>>
> >>>>           psp_prep_ta_load_cmd_buf(cmd,
> >>>>                                    psp->fw_pri_mc_addr,
> >>>> @@ -1275,8 +1278,7 @@ static int psp_hdcp_load(struct psp_context *psp)
> >>>>           if (!cmd)
> >>>>                   return -ENOMEM;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -       memcpy(psp->fw_pri_buf, psp->ta_hdcp_start_addr,
> >>>> +       psp_copy_fw(psp, psp->ta_hdcp_start_addr,
> >>>>                  psp->ta_hdcp_ucode_size);
> >>>>
> >>>>           psp_prep_ta_load_cmd_buf(cmd,
> >>>> @@ -1427,8 +1429,7 @@ static int psp_dtm_load(struct psp_context *psp)
> >>>>           if (!cmd)
> >>>>                   return -ENOMEM;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -       memcpy(psp->fw_pri_buf, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
> >>>> +       psp_copy_fw(psp, psp->ta_dtm_start_addr, psp->ta_dtm_ucode_size);
> >>>>
> >>>>           psp_prep_ta_load_cmd_buf(cmd,
> >>>>                                    psp->fw_pri_mc_addr,
> >>>> @@ -1573,8 +1574,7 @@ static int psp_rap_load(struct psp_context *psp)
> >>>>           if (!cmd)
> >>>>                   return -ENOMEM;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -       memcpy(psp->fw_pri_buf, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
> >>>> +       psp_copy_fw(psp, psp->ta_rap_start_addr, psp->ta_rap_ucode_size);
> >>>>
> >>>>           psp_prep_ta_load_cmd_buf(cmd,
> >>>>                                    psp->fw_pri_mc_addr,
> >>>> @@ -3022,7 +3022,7 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
> >>>>           struct amdgpu_device *adev = drm_to_adev(ddev);
> >>>>           void *cpu_addr;
> >>>>           dma_addr_t dma_addr;
> >>>> -       int ret;
> >>>> +       int ret, idx;
> >>>>           char fw_name[100];
> >>>>           const struct firmware *usbc_pd_fw;
> >>>>
> >>>> @@ -3031,6 +3031,9 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
> >>>>                   return -EBUSY;
> >>>>           }
> >>>>
> >>>> +       if (!drm_dev_enter(ddev, &idx))
> >>>> +               return -ENODEV;
> >>>> +
> >>>>           snprintf(fw_name, sizeof(fw_name), "amdgpu/%s", buf);
> >>>>           ret = request_firmware(&usbc_pd_fw, fw_name, adev->dev);
> >>>>           if (ret)
> >>>> @@ -3062,16 +3065,30 @@ static ssize_t psp_usbc_pd_fw_sysfs_write(struct device *dev,
> >>>>    rel_buf:
> >>>>           dma_free_coherent(adev->dev, usbc_pd_fw->size, cpu_addr, dma_addr);
> >>>>           release_firmware(usbc_pd_fw);
> >>>> -
> >>>>    fail:
> >>>>           if (ret) {
> >>>>                   DRM_ERROR("Failed to load USBC PD FW, err = %d", ret);
> >>>> -               return ret;
> >>>> +               count = ret;
> >>>>           }
> >>>>
> >>>> +       drm_dev_exit(idx);
> >>>>           return count;
> >>>>    }
> >>>>
> >>>> +void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size)
> >>>> +{
> >>>> +       int idx;
> >>>> +
> >>>> +       if (!drm_dev_enter(&psp->adev->ddev, &idx))
> >>>> +               return;
> >>>> +
> >>>> +       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> +       memcpy(psp->fw_pri_buf, start_addr, bin_size);
> >>>> +
> >>>> +       drm_dev_exit(idx);
> >>>> +}
> >>>> +
> >>>> +
> >>>>    static DEVICE_ATTR(usbc_pd_fw, S_IRUGO | S_IWUSR,
> >>>>                      psp_usbc_pd_fw_sysfs_read,
> >>>>                      psp_usbc_pd_fw_sysfs_write);
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> >>>> index 46a5328e00e0..2bfdc278817f 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> >>>> @@ -423,4 +423,6 @@ int psp_get_fw_attestation_records_addr(struct psp_context *psp,
> >>>>
> >>>>    int psp_load_fw_list(struct psp_context *psp,
> >>>>                        struct amdgpu_firmware_info **ucode_list, int ucode_count);
> >>>> +void psp_copy_fw(struct psp_context *psp, uint8_t *start_addr, uint32_t bin_size);
> >>>> +
> >>>>    #endif
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> >>>> index c6dbc0801604..82f0542c7792 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> >>>> @@ -32,6 +32,7 @@
> >>>>    #include <linux/module.h>
> >>>>
> >>>>    #include <drm/drm.h>
> >>>> +#include <drm/drm_drv.h>
> >>>>
> >>>>    #include "amdgpu.h"
> >>>>    #include "amdgpu_pm.h"
> >>>> @@ -375,7 +376,7 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
> >>>>    {
> >>>>           unsigned size;
> >>>>           void *ptr;
> >>>> -       int i, j;
> >>>> +       int i, j, idx;
> >>>>           bool in_ras_intr = amdgpu_ras_intr_triggered();
> >>>>
> >>>>           cancel_delayed_work_sync(&adev->uvd.idle_work);
> >>>> @@ -403,11 +404,15 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
> >>>>                   if (!adev->uvd.inst[j].saved_bo)
> >>>>                           return -ENOMEM;
> >>>>
> >>>> -               /* re-write 0 since err_event_athub will corrupt VCPU buffer */
> >>>> -               if (in_ras_intr)
> >>>> -                       memset(adev->uvd.inst[j].saved_bo, 0, size);
> >>>> -               else
> >>>> -                       memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
> >>>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> >>>> +                       /* re-write 0 since err_event_athub will corrupt VCPU buffer */
> >>>> +                       if (in_ras_intr)
> >>>> +                               memset(adev->uvd.inst[j].saved_bo, 0, size);
> >>>> +                       else
> >>>> +                               memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
> >>>> +
> >>>> +                       drm_dev_exit(idx);
> >>>> +               }
> >>>>           }
> >>>>
> >>>>           if (in_ras_intr)
> >>>> @@ -420,7 +425,7 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
> >>>>    {
> >>>>           unsigned size;
> >>>>           void *ptr;
> >>>> -       int i;
> >>>> +       int i, idx;
> >>>>
> >>>>           for (i = 0; i < adev->uvd.num_uvd_inst; i++) {
> >>>>                   if (adev->uvd.harvest_config & (1 << i))
> >>>> @@ -432,7 +437,10 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
> >>>>                   ptr = adev->uvd.inst[i].cpu_addr;
> >>>>
> >>>>                   if (adev->uvd.inst[i].saved_bo != NULL) {
> >>>> -                       memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
> >>>> +                       if (drm_dev_enter(&adev->ddev, &idx)) {
> >>>> +                               memcpy_toio(ptr, adev->uvd.inst[i].saved_bo, size);
> >>>> +                               drm_dev_exit(idx);
> >>>> +                       }
> >>>>                           kvfree(adev->uvd.inst[i].saved_bo);
> >>>>                           adev->uvd.inst[i].saved_bo = NULL;
> >>>>                   } else {
> >>>> @@ -442,8 +450,11 @@ int amdgpu_uvd_resume(struct amdgpu_device *adev)
> >>>>                           hdr = (const struct common_firmware_header *)adev->uvd.fw->data;
> >>>>                           if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
> >>>>                                   offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
> >>>> -                               memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
> >>>> -                                           le32_to_cpu(hdr->ucode_size_bytes));
> >>>> +                               if (drm_dev_enter(&adev->ddev, &idx)) {
> >>>> +                                       memcpy_toio(adev->uvd.inst[i].cpu_addr, adev->uvd.fw->data + offset,
> >>>> +                                                   le32_to_cpu(hdr->ucode_size_bytes));
> >>>> +                                       drm_dev_exit(idx);
> >>>> +                               }
> >>>>                                   size -= le32_to_cpu(hdr->ucode_size_bytes);
> >>>>                                   ptr += le32_to_cpu(hdr->ucode_size_bytes);
> >>>>                           }
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> >>>> index ea6a62f67e38..833203401ef4 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> >>>> @@ -29,6 +29,7 @@
> >>>>    #include <linux/module.h>
> >>>>
> >>>>    #include <drm/drm.h>
> >>>> +#include <drm/drm_drv.h>
> >>>>
> >>>>    #include "amdgpu.h"
> >>>>    #include "amdgpu_pm.h"
> >>>> @@ -293,7 +294,7 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
> >>>>           void *cpu_addr;
> >>>>           const struct common_firmware_header *hdr;
> >>>>           unsigned offset;
> >>>> -       int r;
> >>>> +       int r, idx;
> >>>>
> >>>>           if (adev->vce.vcpu_bo == NULL)
> >>>>                   return -EINVAL;
> >>>> @@ -313,8 +314,12 @@ int amdgpu_vce_resume(struct amdgpu_device *adev)
> >>>>
> >>>>           hdr = (const struct common_firmware_header *)adev->vce.fw->data;
> >>>>           offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
> >>>> -       memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
> >>>> -                   adev->vce.fw->size - offset);
> >>>> +
> >>>> +       if (drm_dev_enter(&adev->ddev, &idx)) {
> >>>> +               memcpy_toio(cpu_addr, adev->vce.fw->data + offset,
> >>>> +                           adev->vce.fw->size - offset);
> >>>> +               drm_dev_exit(idx);
> >>>> +       }
> >>>>
> >>>>           amdgpu_bo_kunmap(adev->vce.vcpu_bo);
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >>>> index 201645963ba5..21f7d3644d70 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> >>>> @@ -27,6 +27,7 @@
> >>>>    #include <linux/firmware.h>
> >>>>    #include <linux/module.h>
> >>>>    #include <linux/pci.h>
> >>>> +#include <drm/drm_drv.h>
> >>>>
> >>>>    #include "amdgpu.h"
> >>>>    #include "amdgpu_pm.h"
> >>>> @@ -275,7 +276,7 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
> >>>>    {
> >>>>           unsigned size;
> >>>>           void *ptr;
> >>>> -       int i;
> >>>> +       int i, idx;
> >>>>
> >>>>           cancel_delayed_work_sync(&adev->vcn.idle_work);
> >>>>
> >>>> @@ -292,7 +293,10 @@ int amdgpu_vcn_suspend(struct amdgpu_device *adev)
> >>>>                   if (!adev->vcn.inst[i].saved_bo)
> >>>>                           return -ENOMEM;
> >>>>
> >>>> -               memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
> >>>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> >>>> +                       memcpy_fromio(adev->vcn.inst[i].saved_bo, ptr, size);
> >>>> +                       drm_dev_exit(idx);
> >>>> +               }
> >>>>           }
> >>>>           return 0;
> >>>>    }
> >>>> @@ -301,7 +305,7 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
> >>>>    {
> >>>>           unsigned size;
> >>>>           void *ptr;
> >>>> -       int i;
> >>>> +       int i, idx;
> >>>>
> >>>>           for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
> >>>>                   if (adev->vcn.harvest_config & (1 << i))
> >>>> @@ -313,7 +317,10 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
> >>>>                   ptr = adev->vcn.inst[i].cpu_addr;
> >>>>
> >>>>                   if (adev->vcn.inst[i].saved_bo != NULL) {
> >>>> -                       memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
> >>>> +                       if (drm_dev_enter(&adev->ddev, &idx)) {
> >>>> +                               memcpy_toio(ptr, adev->vcn.inst[i].saved_bo, size);
> >>>> +                               drm_dev_exit(idx);
> >>>> +                       }
> >>>>                           kvfree(adev->vcn.inst[i].saved_bo);
> >>>>                           adev->vcn.inst[i].saved_bo = NULL;
> >>>>                   } else {
> >>>> @@ -323,8 +330,11 @@ int amdgpu_vcn_resume(struct amdgpu_device *adev)
> >>>>                           hdr = (const struct common_firmware_header *)adev->vcn.fw->data;
> >>>>                           if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) {
> >>>>                                   offset = le32_to_cpu(hdr->ucode_array_offset_bytes);
> >>>> -                               memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
> >>>> -                                           le32_to_cpu(hdr->ucode_size_bytes));
> >>>> +                               if (drm_dev_enter(&adev->ddev, &idx)) {
> >>>> +                                       memcpy_toio(adev->vcn.inst[i].cpu_addr, adev->vcn.fw->data + offset,
> >>>> +                                                   le32_to_cpu(hdr->ucode_size_bytes));
> >>>> +                                       drm_dev_exit(idx);
> >>>> +                               }
> >>>>                                   size -= le32_to_cpu(hdr->ucode_size_bytes);
> >>>>                                   ptr += le32_to_cpu(hdr->ucode_size_bytes);
> >>>>                           }
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>>> index 9f868cf3b832..7dd5f10ab570 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>>> @@ -32,6 +32,7 @@
> >>>>    #include <linux/dma-buf.h>
> >>>>
> >>>>    #include <drm/amdgpu_drm.h>
> >>>> +#include <drm/drm_drv.h>
> >>>>    #include "amdgpu.h"
> >>>>    #include "amdgpu_trace.h"
> >>>>    #include "amdgpu_amdkfd.h"
> >>>> @@ -1606,7 +1607,10 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
> >>>>           struct amdgpu_vm_update_params params;
> >>>>           enum amdgpu_sync_mode sync_mode;
> >>>>           uint64_t pfn;
> >>>> -       int r;
> >>>> +       int r, idx;
> >>>> +
> >>>> +       if (!drm_dev_enter(&adev->ddev, &idx))
> >>>> +               return -ENODEV;
> >>>>
> >>>>           memset(&params, 0, sizeof(params));
> >>>>           params.adev = adev;
> >>>> @@ -1715,6 +1719,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
> >>>>
> >>>>    error_unlock:
> >>>>           amdgpu_vm_eviction_unlock(vm);
> >>>> +       drm_dev_exit(idx);
> >>>>           return r;
> >>>>    }
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> >>>> index 589410c32d09..2cec71e823f5 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
> >>>> @@ -23,6 +23,7 @@
> >>>>    #include <linux/firmware.h>
> >>>>    #include <linux/module.h>
> >>>>    #include <linux/vmalloc.h>
> >>>> +#include <drm/drm_drv.h>
> >>>>
> >>>>    #include "amdgpu.h"
> >>>>    #include "amdgpu_psp.h"
> >>>> @@ -269,10 +270,8 @@ static int psp_v11_0_bootloader_load_kdb(struct psp_context *psp)
> >>>>           if (ret)
> >>>>                   return ret;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -
> >>>>           /* Copy PSP KDB binary to memory */
> >>>> -       memcpy(psp->fw_pri_buf, psp->kdb_start_addr, psp->kdb_bin_size);
> >>>> +       psp_copy_fw(psp, psp->kdb_start_addr, psp->kdb_bin_size);
> >>>>
> >>>>           /* Provide the PSP KDB to bootloader */
> >>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >>>> @@ -302,10 +301,8 @@ static int psp_v11_0_bootloader_load_spl(struct psp_context *psp)
> >>>>           if (ret)
> >>>>                   return ret;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -
> >>>>           /* Copy PSP SPL binary to memory */
> >>>> -       memcpy(psp->fw_pri_buf, psp->spl_start_addr, psp->spl_bin_size);
> >>>> +       psp_copy_fw(psp, psp->spl_start_addr, psp->spl_bin_size);
> >>>>
> >>>>           /* Provide the PSP SPL to bootloader */
> >>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >>>> @@ -335,10 +332,8 @@ static int psp_v11_0_bootloader_load_sysdrv(struct psp_context *psp)
> >>>>           if (ret)
> >>>>                   return ret;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -
> >>>>           /* Copy PSP System Driver binary to memory */
> >>>> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
> >>>> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
> >>>>
> >>>>           /* Provide the sys driver to bootloader */
> >>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >>>> @@ -371,10 +366,8 @@ static int psp_v11_0_bootloader_load_sos(struct psp_context *psp)
> >>>>           if (ret)
> >>>>                   return ret;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -
> >>>>           /* Copy Secure OS binary to PSP memory */
> >>>> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
> >>>> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
> >>>>
> >>>>           /* Provide the PSP secure OS to bootloader */
> >>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >>>> @@ -608,7 +601,7 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
> >>>>           uint32_t p2c_header[4];
> >>>>           uint32_t sz;
> >>>>           void *buf;
> >>>> -       int ret;
> >>>> +       int ret, idx;
> >>>>
> >>>>           if (ctx->init == PSP_MEM_TRAIN_NOT_SUPPORT) {
> >>>>                   DRM_DEBUG("Memory training is not supported.\n");
> >>>> @@ -681,17 +674,24 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
> >>>>                           return -ENOMEM;
> >>>>                   }
> >>>>
> >>>> -               memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
> >>>> -               ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
> >>>> -               if (ret) {
> >>>> -                       DRM_ERROR("Send long training msg failed.\n");
> >>>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> >>>> +                       memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
> >>>> +                       ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
> >>>> +                       if (ret) {
> >>>> +                               DRM_ERROR("Send long training msg failed.\n");
> >>>> +                               vfree(buf);
> >>>> +                               drm_dev_exit(idx);
> >>>> +                               return ret;
> >>>> +                       }
> >>>> +
> >>>> +                       memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
> >>>> +                       adev->hdp.funcs->flush_hdp(adev, NULL);
> >>>>                           vfree(buf);
> >>>> -                       return ret;
> >>>> +                       drm_dev_exit(idx);
> >>>> +               } else {
> >>>> +                       vfree(buf);
> >>>> +                       return -ENODEV;
> >>>>                   }
> >>>> -
> >>>> -               memcpy_toio(adev->mman.aper_base_kaddr, buf, sz);
> >>>> -               adev->hdp.funcs->flush_hdp(adev, NULL);
> >>>> -               vfree(buf);
> >>>>           }
> >>>>
> >>>>           if (ops & PSP_MEM_TRAIN_SAVE) {
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
> >>>> index c4828bd3264b..618e5b6b85d9 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v12_0.c
> >>>> @@ -138,10 +138,8 @@ static int psp_v12_0_bootloader_load_sysdrv(struct psp_context *psp)
> >>>>           if (ret)
> >>>>                   return ret;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -
> >>>>           /* Copy PSP System Driver binary to memory */
> >>>> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
> >>>> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
> >>>>
> >>>>           /* Provide the sys driver to bootloader */
> >>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >>>> @@ -179,10 +177,8 @@ static int psp_v12_0_bootloader_load_sos(struct psp_context *psp)
> >>>>           if (ret)
> >>>>                   return ret;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -
> >>>>           /* Copy Secure OS binary to PSP memory */
> >>>> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
> >>>> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
> >>>>
> >>>>           /* Provide the PSP secure OS to bootloader */
> >>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> >>>> index f2e725f72d2f..d0a6cccd0897 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v3_1.c
> >>>> @@ -102,10 +102,8 @@ static int psp_v3_1_bootloader_load_sysdrv(struct psp_context *psp)
> >>>>           if (ret)
> >>>>                   return ret;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -
> >>>>           /* Copy PSP System Driver binary to memory */
> >>>> -       memcpy(psp->fw_pri_buf, psp->sys_start_addr, psp->sys_bin_size);
> >>>> +       psp_copy_fw(psp, psp->sys_start_addr, psp->sys_bin_size);
> >>>>
> >>>>           /* Provide the sys driver to bootloader */
> >>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >>>> @@ -143,10 +141,8 @@ static int psp_v3_1_bootloader_load_sos(struct psp_context *psp)
> >>>>           if (ret)
> >>>>                   return ret;
> >>>>
> >>>> -       memset(psp->fw_pri_buf, 0, PSP_1_MEG);
> >>>> -
> >>>>           /* Copy Secure OS binary to PSP memory */
> >>>> -       memcpy(psp->fw_pri_buf, psp->sos_start_addr, psp->sos_bin_size);
> >>>> +       psp_copy_fw(psp, psp->sos_start_addr, psp->sos_bin_size);
> >>>>
> >>>>           /* Provide the PSP secure OS to bootloader */
> >>>>           WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> >>>> index 8e238dea7bef..90910d19db12 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> >>>> @@ -25,6 +25,7 @@
> >>>>     */
> >>>>
> >>>>    #include <linux/firmware.h>
> >>>> +#include <drm/drm_drv.h>
> >>>>
> >>>>    #include "amdgpu.h"
> >>>>    #include "amdgpu_vce.h"
> >>>> @@ -555,16 +556,19 @@ static int vce_v4_0_hw_fini(void *handle)
> >>>>    static int vce_v4_0_suspend(void *handle)
> >>>>    {
> >>>>           struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> >>>> -       int r;
> >>>> +       int r, idx;
> >>>>
> >>>>           if (adev->vce.vcpu_bo == NULL)
> >>>>                   return 0;
> >>>>
> >>>> -       if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
> >>>> -               unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> >>>> -               void *ptr = adev->vce.cpu_addr;
> >>>> +       if (drm_dev_enter(&adev->ddev, &idx)) {
> >>>> +               if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
> >>>> +                       unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> >>>> +                       void *ptr = adev->vce.cpu_addr;
> >>>>
> >>>> -               memcpy_fromio(adev->vce.saved_bo, ptr, size);
> >>>> +                       memcpy_fromio(adev->vce.saved_bo, ptr, size);
> >>>> +               }
> >>>> +               drm_dev_exit(idx);
> >>>>           }
> >>>>
> >>>>           r = vce_v4_0_hw_fini(adev);
> >>>> @@ -577,16 +581,20 @@ static int vce_v4_0_suspend(void *handle)
> >>>>    static int vce_v4_0_resume(void *handle)
> >>>>    {
> >>>>           struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> >>>> -       int r;
> >>>> +       int r, idx;
> >>>>
> >>>>           if (adev->vce.vcpu_bo == NULL)
> >>>>                   return -EINVAL;
> >>>>
> >>>>           if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {
> >>>> -               unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> >>>> -               void *ptr = adev->vce.cpu_addr;
> >>>>
> >>>> -               memcpy_toio(ptr, adev->vce.saved_bo, size);
> >>>> +               if (drm_dev_enter(&adev->ddev, &idx)) {
> >>>> +                       unsigned size = amdgpu_bo_size(adev->vce.vcpu_bo);
> >>>> +                       void *ptr = adev->vce.cpu_addr;
> >>>> +
> >>>> +                       memcpy_toio(ptr, adev->vce.saved_bo, size);
> >>>> +                       drm_dev_exit(idx);
> >>>> +               }
> >>>>           } else {
> >>>>                   r = amdgpu_vce_resume(adev);
> >>>>                   if (r)
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> >>>> index 3f15bf34123a..df34be8ec82d 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> >>>> @@ -34,6 +34,8 @@
> >>>>    #include "vcn/vcn_3_0_0_sh_mask.h"
> >>>>    #include "ivsrcid/vcn/irqsrcs_vcn_2_0.h"
> >>>>
> >>>> +#include <drm/drm_drv.h>
> >>>> +
> >>>>    #define mmUVD_CONTEXT_ID_INTERNAL_OFFSET                       0x27
> >>>>    #define mmUVD_GPCOM_VCPU_CMD_INTERNAL_OFFSET                   0x0f
> >>>>    #define mmUVD_GPCOM_VCPU_DATA0_INTERNAL_OFFSET                 0x10
> >>>> @@ -268,16 +270,20 @@ static int vcn_v3_0_sw_init(void *handle)
> >>>>    static int vcn_v3_0_sw_fini(void *handle)
> >>>>    {
> >>>>           struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> >>>> -       int i, r;
> >>>> +       int i, r, idx;
> >>>>
> >>>> -       for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
> >>>> -               volatile struct amdgpu_fw_shared *fw_shared;
> >>>> +       if (drm_dev_enter(&adev->ddev, &idx)) {
> >>>> +               for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
> >>>> +                       volatile struct amdgpu_fw_shared *fw_shared;
> >>>>
> >>>> -               if (adev->vcn.harvest_config & (1 << i))
> >>>> -                       continue;
> >>>> -               fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
> >>>> -               fw_shared->present_flag_0 = 0;
> >>>> -               fw_shared->sw_ring.is_enabled = false;
> >>>> +                       if (adev->vcn.harvest_config & (1 << i))
> >>>> +                               continue;
> >>>> +                       fw_shared = adev->vcn.inst[i].fw_shared_cpu_addr;
> >>>> +                       fw_shared->present_flag_0 = 0;
> >>>> +                       fw_shared->sw_ring.is_enabled = false;
> >>>> +               }
> >>>> +
> >>>> +               drm_dev_exit(idx);
> >>>>           }
> >>>>
> >>>>           if (amdgpu_sriov_vf(adev))
> >>>> --
> >>>> 2.25.1
> >>>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 05/16] drm/amdgpu: Handle IOMMU enabled case.
  2021-05-12 14:26 ` [PATCH v7 05/16] drm/amdgpu: Handle IOMMU enabled case Andrey Grodzovsky
@ 2021-05-14 14:41   ` Andrey Grodzovsky
  2021-05-14 16:25     ` Felix Kuehling
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-14 14:41 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Ping

Andrey

On 2021-05-12 10:26 a.m., Andrey Grodzovsky wrote:
> Handle all DMA IOMMU gropup related dependencies before the
> group is removed.
> 
> v5: Drop IOMMU notifier and switch to lockless call to ttm_tt_unpopulate
> v6: Drop the BO unamp list
> v7:
> Drop amdgpu_gart_fini
> In amdgpu_ih_ring_fini do uncinditional  check (!ih->ring)
> to avoid freeing uniniitalized rings.
> Call amdgpu_ih_ring_fini unconditionally.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   | 14 +-------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c     |  6 ++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |  5 +++++
>   drivers/gpu/drm/amd/amdgpu/cik_ih.c        |  1 -
>   drivers/gpu/drm/amd/amdgpu/cz_ih.c         |  1 -
>   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c     |  1 -
>   drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c      |  1 -
>   drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c      |  1 -
>   drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c      |  1 -
>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  1 -
>   drivers/gpu/drm/amd/amdgpu/iceland_ih.c    |  1 -
>   drivers/gpu/drm/amd/amdgpu/navi10_ih.c     |  4 ----
>   drivers/gpu/drm/amd/amdgpu/si_ih.c         |  1 -
>   drivers/gpu/drm/amd/amdgpu/tonga_ih.c      |  1 -
>   drivers/gpu/drm/amd/amdgpu/vega10_ih.c     |  4 ----
>   drivers/gpu/drm/amd/amdgpu/vega20_ih.c     |  4 ----
>   18 files changed, 13 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 18598eda18f6..a0bff4713672 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3256,7 +3256,6 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>   	NULL
>   };
>   
> -
>   /**
>    * amdgpu_device_init - initialize the driver
>    *
> @@ -3698,12 +3697,13 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>   		amdgpu_ucode_sysfs_fini(adev);
>   	sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
>   
> -
>   	amdgpu_fbdev_fini(adev);
>   
>   	amdgpu_irq_fini_hw(adev);
>   
>   	amdgpu_device_ip_fini_early(adev);
> +
> +	amdgpu_gart_dummy_page_fini(adev);
>   }
>   
>   void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> index c5a9a4fb10d2..6460cf723f0a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> @@ -92,7 +92,7 @@ static int amdgpu_gart_dummy_page_init(struct amdgpu_device *adev)
>    *
>    * Frees the dummy page used by the driver (all asics).
>    */
> -static void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
> +void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
>   {
>   	if (!adev->dummy_page_addr)
>   		return;
> @@ -365,15 +365,3 @@ int amdgpu_gart_init(struct amdgpu_device *adev)
>   
>   	return 0;
>   }
> -
> -/**
> - * amdgpu_gart_fini - tear down the driver info for managing the gart
> - *
> - * @adev: amdgpu_device pointer
> - *
> - * Tear down the gart driver info and free the dummy page (all asics).
> - */
> -void amdgpu_gart_fini(struct amdgpu_device *adev)
> -{
> -	amdgpu_gart_dummy_page_fini(adev);
> -}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> index a25fe97b0196..030b9d4c736a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> @@ -57,7 +57,7 @@ void amdgpu_gart_table_vram_free(struct amdgpu_device *adev);
>   int amdgpu_gart_table_vram_pin(struct amdgpu_device *adev);
>   void amdgpu_gart_table_vram_unpin(struct amdgpu_device *adev);
>   int amdgpu_gart_init(struct amdgpu_device *adev);
> -void amdgpu_gart_fini(struct amdgpu_device *adev);
> +void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev);
>   int amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
>   		       int pages);
>   int amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> index faaa6aa2faaf..433469ace6f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> @@ -115,9 +115,11 @@ int amdgpu_ih_ring_init(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih,
>    */
>   void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih)
>   {
> +
> +	if (!ih->ring)
> +		return;
> +
>   	if (ih->use_bus_addr) {
> -		if (!ih->ring)
> -			return;
>   
>   		/* add 8 bytes for the rptr/wptr shadows and
>   		 * add them to the end of the ring allocation.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> index 233b64dab94b..32ce0e679dc7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> @@ -361,6 +361,11 @@ void amdgpu_irq_fini_hw(struct amdgpu_device *adev)
>   		if (!amdgpu_device_has_dc_support(adev))
>   			flush_work(&adev->hotplug_work);
>   	}
> +
> +	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
> +	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
> +	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
> +	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/cik_ih.c b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
> index 183d44a6583c..df385ffc9768 100644
> --- a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
> @@ -310,7 +310,6 @@ static int cik_ih_sw_fini(void *handle)
>   	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>   
>   	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>   	amdgpu_irq_remove_domain(adev);
>   
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
> index d32743949003..b8c47e0cf37a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
> @@ -302,7 +302,6 @@ static int cz_ih_sw_fini(void *handle)
>   	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>   
>   	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>   	amdgpu_irq_remove_domain(adev);
>   
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> index 2bfd620576f2..5e8bfcdd422e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> @@ -954,7 +954,6 @@ static int gmc_v10_0_sw_init(void *handle)
>   static void gmc_v10_0_gart_fini(struct amdgpu_device *adev)
>   {
>   	amdgpu_gart_table_vram_free(adev);
> -	amdgpu_gart_fini(adev);
>   }
>   
>   static int gmc_v10_0_sw_fini(void *handle)
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
> index 405d6ad09022..0e81e03e9b49 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
> @@ -898,7 +898,6 @@ static int gmc_v6_0_sw_fini(void *handle)
>   	amdgpu_vm_manager_fini(adev);
>   	amdgpu_gart_table_vram_free(adev);
>   	amdgpu_bo_fini(adev);
> -	amdgpu_gart_fini(adev);
>   	release_firmware(adev->gmc.fw);
>   	adev->gmc.fw = NULL;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
> index 210ada2289ec..0795ea736573 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
> @@ -1085,7 +1085,6 @@ static int gmc_v7_0_sw_fini(void *handle)
>   	kfree(adev->gmc.vm_fault_info);
>   	amdgpu_gart_table_vram_free(adev);
>   	amdgpu_bo_fini(adev);
> -	amdgpu_gart_fini(adev);
>   	release_firmware(adev->gmc.fw);
>   	adev->gmc.fw = NULL;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
> index c1bd190841f8..dbf2e5472069 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
> @@ -1194,7 +1194,6 @@ static int gmc_v8_0_sw_fini(void *handle)
>   	kfree(adev->gmc.vm_fault_info);
>   	amdgpu_gart_table_vram_free(adev);
>   	amdgpu_bo_fini(adev);
> -	amdgpu_gart_fini(adev);
>   	release_firmware(adev->gmc.fw);
>   	adev->gmc.fw = NULL;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index c82d82da2c73..5ed0adae05cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -1601,7 +1601,6 @@ static int gmc_v9_0_sw_fini(void *handle)
>   	amdgpu_gart_table_vram_free(adev);
>   	amdgpu_bo_unref(&adev->gmc.pdb0_bo);
>   	amdgpu_bo_fini(adev);
> -	amdgpu_gart_fini(adev);
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
> index da96c6013477..ddfe4eaeea05 100644
> --- a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
> @@ -301,7 +301,6 @@ static int iceland_ih_sw_fini(void *handle)
>   	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>   
>   	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>   	amdgpu_irq_remove_domain(adev);
>   
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> index 5eea4550b856..941d464a2b47 100644
> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> @@ -570,10 +570,6 @@ static int navi10_ih_sw_fini(void *handle)
>   	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>   
>   	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/si_ih.c b/drivers/gpu/drm/amd/amdgpu/si_ih.c
> index 751307f3252c..9a24f17a5750 100644
> --- a/drivers/gpu/drm/amd/amdgpu/si_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/si_ih.c
> @@ -176,7 +176,6 @@ static int si_ih_sw_fini(void *handle)
>   	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>   
>   	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
> index 973d80ec7f6c..b08905d1c00f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
> @@ -313,7 +313,6 @@ static int tonga_ih_sw_fini(void *handle)
>   	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>   
>   	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>   	amdgpu_irq_remove_domain(adev);
>   
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> index dead9c2fbd4c..32ec4b8e806a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> @@ -514,10 +514,6 @@ static int vega10_ih_sw_fini(void *handle)
>   	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>   
>   	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> index 58993ae1fe11..f51dfc38ac65 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> @@ -566,10 +566,6 @@ static int vega20_ih_sw_fini(void *handle)
>   	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>   
>   	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>   
>   	return 0;
>   }
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.
  2021-05-12 14:26 ` [PATCH v7 12/16] drm/amdgpu: Fix hang on device removal Andrey Grodzovsky
@ 2021-05-14 14:42   ` Andrey Grodzovsky
  2021-05-17 14:40     ` Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-14 14:42 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Ping

Andrey

On 2021-05-12 10:26 a.m., Andrey Grodzovsky wrote:
> If removing while commands in flight you cannot wait to flush the
> HW fences on a ring since the device is gone.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 16 ++++++++++------
>   1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 1ffb36bd0b19..fa03702ecbfb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -36,6 +36,7 @@
>   #include <linux/firmware.h>
>   #include <linux/pm_runtime.h>
>   
> +#include <drm/drm_drv.h>
>   #include "amdgpu.h"
>   #include "amdgpu_trace.h"
>   
> @@ -525,8 +526,7 @@ int amdgpu_fence_driver_init(struct amdgpu_device *adev)
>    */
>   void amdgpu_fence_driver_fini_hw(struct amdgpu_device *adev)
>   {
> -	unsigned i, j;
> -	int r;
> +	int i, r;
>   
>   	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>   		struct amdgpu_ring *ring = adev->rings[i];
> @@ -535,11 +535,15 @@ void amdgpu_fence_driver_fini_hw(struct amdgpu_device *adev)
>   			continue;
>   		if (!ring->no_scheduler)
>   			drm_sched_fini(&ring->sched);
> -		r = amdgpu_fence_wait_empty(ring);
> -		if (r) {
> -			/* no need to trigger GPU reset as we are unloading */
> +		/* You can't wait for HW to signal if it's gone */
> +		if (!drm_dev_is_unplugged(&adev->ddev))
> +			r = amdgpu_fence_wait_empty(ring);
> +		else
> +			r = -ENODEV;
> +		/* no need to trigger GPU reset as we are unloading */
> +		if (r)
>   			amdgpu_fence_driver_force_completion(ring);
> -		}
> +
>   		if (ring->fence_drv.irq_src)
>   			amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>   				       ring->fence_drv.irq_type);
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings
  2021-05-12 14:26 ` [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings Andrey Grodzovsky
@ 2021-05-14 14:42   ` Andrey Grodzovsky
  2021-05-17 14:41     ` Andrey Grodzovsky
  2021-05-17 17:43   ` Alex Deucher
  1 sibling, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-14 14:42 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Ping

Andrey

On 2021-05-12 10:26 a.m., Andrey Grodzovsky wrote:
> Access to those must be prevented post pci_remove
> 
> v6: Drop BOs list, unampping VRAM BAR is enough.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 +++++++++++++++++++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  4 ----
>   3 files changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f7cca25c0fa0..73cbc3c7453f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3666,6 +3666,25 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> +static void amdgpu_device_unmap_mmio(struct amdgpu_device *adev)
> +{
> +	/* Clear all CPU mappings pointing to this device */
> +	unmap_mapping_range(adev->ddev.anon_inode->i_mapping, 0, 0, 1);
> +
> +	/* Unmap all mapped bars - Doorbell, registers and VRAM */
> +	amdgpu_device_doorbell_fini(adev);
> +
> +	iounmap(adev->rmmio);
> +	adev->rmmio = NULL;
> +	if (adev->mman.aper_base_kaddr)
> +		iounmap(adev->mman.aper_base_kaddr);
> +	adev->mman.aper_base_kaddr = NULL;
> +
> +	/* Memory manager related */
> +	arch_phys_wc_del(adev->gmc.vram_mtrr);
> +	arch_io_free_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size);
> +}
> +
>   /**
>    * amdgpu_device_fini - tear down the driver
>    *
> @@ -3712,6 +3731,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>   	amdgpu_device_ip_fini_early(adev);
>   
>   	amdgpu_gart_dummy_page_fini(adev);
> +
> +	amdgpu_device_unmap_mmio(adev);
>   }
>   
>   void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> @@ -3739,9 +3760,6 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>   	}
>   	if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
>   		vga_client_register(adev->pdev, NULL, NULL, NULL);
> -	iounmap(adev->rmmio);
> -	adev->rmmio = NULL;
> -	amdgpu_device_doorbell_fini(adev);
>   
>   	if (IS_ENABLED(CONFIG_PERF_EVENTS))
>   		amdgpu_pmu_fini(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 0adffcace326..882fb49f3c41 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -533,6 +533,7 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
>   		return -ENOMEM;
>   	drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
>   	INIT_LIST_HEAD(&bo->shadow_list);
> +
>   	bo->vm_bo = NULL;
>   	bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
>   		bp->domain;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 0d54e70278ca..58ad2fecc9e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1841,10 +1841,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
>   	amdgpu_bo_free_kernel(&adev->mman.discovery_memory, NULL, NULL);
>   	amdgpu_ttm_fw_reserve_vram_fini(adev);
>   
> -	if (adev->mman.aper_base_kaddr)
> -		iounmap(adev->mman.aper_base_kaddr);
> -	adev->mman.aper_base_kaddr = NULL;
> -
>   	amdgpu_vram_mgr_fini(adev);
>   	amdgpu_gtt_mgr_fini(adev);
>   	ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_GDS);
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 05/16] drm/amdgpu: Handle IOMMU enabled case.
  2021-05-14 14:41   ` Andrey Grodzovsky
@ 2021-05-14 16:25     ` Felix Kuehling
  2021-05-14 16:26       ` Andrey Grodzovsky
  2021-05-17 14:38       ` [PATCH] " Andrey Grodzovsky
  0 siblings, 2 replies; 64+ messages in thread
From: Felix Kuehling @ 2021-05-14 16:25 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: Alexander.Deucher, gregkh, helgaas

Maybe this patch needs a better explanation how the GART and IH changes
relate to IOMMU or what's the problem this is meant to fix. Just looking
at the patch I don't see the connection. Looks like just a bunch of
refactoring to me.

Regards,
  Felix


Am 2021-05-14 um 10:41 a.m. schrieb Andrey Grodzovsky:
> Ping
>
> Andrey
>
> On 2021-05-12 10:26 a.m., Andrey Grodzovsky wrote:
>> Handle all DMA IOMMU gropup related dependencies before the
>> group is removed.
>>
>> v5: Drop IOMMU notifier and switch to lockless call to ttm_tt_unpopulate
>> v6: Drop the BO unamp list
>> v7:
>> Drop amdgpu_gart_fini
>> In amdgpu_ih_ring_fini do uncinditional  check (!ih->ring)
>> to avoid freeing uniniitalized rings.
>> Call amdgpu_ih_ring_fini unconditionally.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   | 14 +-------------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c     |  6 ++++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |  5 +++++
>>   drivers/gpu/drm/amd/amdgpu/cik_ih.c        |  1 -
>>   drivers/gpu/drm/amd/amdgpu/cz_ih.c         |  1 -
>>   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c     |  1 -
>>   drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c      |  1 -
>>   drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c      |  1 -
>>   drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c      |  1 -
>>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  1 -
>>   drivers/gpu/drm/amd/amdgpu/iceland_ih.c    |  1 -
>>   drivers/gpu/drm/amd/amdgpu/navi10_ih.c     |  4 ----
>>   drivers/gpu/drm/amd/amdgpu/si_ih.c         |  1 -
>>   drivers/gpu/drm/amd/amdgpu/tonga_ih.c      |  1 -
>>   drivers/gpu/drm/amd/amdgpu/vega10_ih.c     |  4 ----
>>   drivers/gpu/drm/amd/amdgpu/vega20_ih.c     |  4 ----
>>   18 files changed, 13 insertions(+), 40 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 18598eda18f6..a0bff4713672 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3256,7 +3256,6 @@ static const struct attribute
>> *amdgpu_dev_attributes[] = {
>>       NULL
>>   };
>>   -
>>   /**
>>    * amdgpu_device_init - initialize the driver
>>    *
>> @@ -3698,12 +3697,13 @@ void amdgpu_device_fini_hw(struct
>> amdgpu_device *adev)
>>           amdgpu_ucode_sysfs_fini(adev);
>>       sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
>>   -
>>       amdgpu_fbdev_fini(adev);
>>         amdgpu_irq_fini_hw(adev);
>>         amdgpu_device_ip_fini_early(adev);
>> +
>> +    amdgpu_gart_dummy_page_fini(adev);
>>   }
>>     void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>> index c5a9a4fb10d2..6460cf723f0a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>> @@ -92,7 +92,7 @@ static int amdgpu_gart_dummy_page_init(struct
>> amdgpu_device *adev)
>>    *
>>    * Frees the dummy page used by the driver (all asics).
>>    */
>> -static void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
>> +void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
>>   {
>>       if (!adev->dummy_page_addr)
>>           return;
>> @@ -365,15 +365,3 @@ int amdgpu_gart_init(struct amdgpu_device *adev)
>>         return 0;
>>   }
>> -
>> -/**
>> - * amdgpu_gart_fini - tear down the driver info for managing the gart
>> - *
>> - * @adev: amdgpu_device pointer
>> - *
>> - * Tear down the gart driver info and free the dummy page (all asics).
>> - */
>> -void amdgpu_gart_fini(struct amdgpu_device *adev)
>> -{
>> -    amdgpu_gart_dummy_page_fini(adev);
>> -}
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
>> index a25fe97b0196..030b9d4c736a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
>> @@ -57,7 +57,7 @@ void amdgpu_gart_table_vram_free(struct
>> amdgpu_device *adev);
>>   int amdgpu_gart_table_vram_pin(struct amdgpu_device *adev);
>>   void amdgpu_gart_table_vram_unpin(struct amdgpu_device *adev);
>>   int amdgpu_gart_init(struct amdgpu_device *adev);
>> -void amdgpu_gart_fini(struct amdgpu_device *adev);
>> +void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev);
>>   int amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
>>                  int pages);
>>   int amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>> index faaa6aa2faaf..433469ace6f4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>> @@ -115,9 +115,11 @@ int amdgpu_ih_ring_init(struct amdgpu_device
>> *adev, struct amdgpu_ih_ring *ih,
>>    */
>>   void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct
>> amdgpu_ih_ring *ih)
>>   {
>> +
>> +    if (!ih->ring)
>> +        return;
>> +
>>       if (ih->use_bus_addr) {
>> -        if (!ih->ring)
>> -            return;
>>             /* add 8 bytes for the rptr/wptr shadows and
>>            * add them to the end of the ring allocation.
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> index 233b64dab94b..32ce0e679dc7 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> @@ -361,6 +361,11 @@ void amdgpu_irq_fini_hw(struct amdgpu_device *adev)
>>           if (!amdgpu_device_has_dc_support(adev))
>>               flush_work(&adev->hotplug_work);
>>       }
>> +
>> +    amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
>> +    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>> +    amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
>> +    amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
>>   }
>>     /**
>> diff --git a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
>> b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
>> index 183d44a6583c..df385ffc9768 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
>> @@ -310,7 +310,6 @@ static int cik_ih_sw_fini(void *handle)
>>       struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>         amdgpu_irq_fini_sw(adev);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>       amdgpu_irq_remove_domain(adev);
>>         return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>> b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>> index d32743949003..b8c47e0cf37a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>> @@ -302,7 +302,6 @@ static int cz_ih_sw_fini(void *handle)
>>       struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>         amdgpu_irq_fini_sw(adev);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>       amdgpu_irq_remove_domain(adev);
>>         return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>> index 2bfd620576f2..5e8bfcdd422e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>> @@ -954,7 +954,6 @@ static int gmc_v10_0_sw_init(void *handle)
>>   static void gmc_v10_0_gart_fini(struct amdgpu_device *adev)
>>   {
>>       amdgpu_gart_table_vram_free(adev);
>> -    amdgpu_gart_fini(adev);
>>   }
>>     static int gmc_v10_0_sw_fini(void *handle)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>> index 405d6ad09022..0e81e03e9b49 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>> @@ -898,7 +898,6 @@ static int gmc_v6_0_sw_fini(void *handle)
>>       amdgpu_vm_manager_fini(adev);
>>       amdgpu_gart_table_vram_free(adev);
>>       amdgpu_bo_fini(adev);
>> -    amdgpu_gart_fini(adev);
>>       release_firmware(adev->gmc.fw);
>>       adev->gmc.fw = NULL;
>>   diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>> index 210ada2289ec..0795ea736573 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>> @@ -1085,7 +1085,6 @@ static int gmc_v7_0_sw_fini(void *handle)
>>       kfree(adev->gmc.vm_fault_info);
>>       amdgpu_gart_table_vram_free(adev);
>>       amdgpu_bo_fini(adev);
>> -    amdgpu_gart_fini(adev);
>>       release_firmware(adev->gmc.fw);
>>       adev->gmc.fw = NULL;
>>   diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>> index c1bd190841f8..dbf2e5472069 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>> @@ -1194,7 +1194,6 @@ static int gmc_v8_0_sw_fini(void *handle)
>>       kfree(adev->gmc.vm_fault_info);
>>       amdgpu_gart_table_vram_free(adev);
>>       amdgpu_bo_fini(adev);
>> -    amdgpu_gart_fini(adev);
>>       release_firmware(adev->gmc.fw);
>>       adev->gmc.fw = NULL;
>>   diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> index c82d82da2c73..5ed0adae05cf 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> @@ -1601,7 +1601,6 @@ static int gmc_v9_0_sw_fini(void *handle)
>>       amdgpu_gart_table_vram_free(adev);
>>       amdgpu_bo_unref(&adev->gmc.pdb0_bo);
>>       amdgpu_bo_fini(adev);
>> -    amdgpu_gart_fini(adev);
>>         return 0;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>> b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>> index da96c6013477..ddfe4eaeea05 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>> @@ -301,7 +301,6 @@ static int iceland_ih_sw_fini(void *handle)
>>       struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>         amdgpu_irq_fini_sw(adev);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>       amdgpu_irq_remove_domain(adev);
>>         return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>> b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>> index 5eea4550b856..941d464a2b47 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>> @@ -570,10 +570,6 @@ static int navi10_ih_sw_fini(void *handle)
>>       struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>         amdgpu_irq_fini_sw(adev);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>         return 0;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/si_ih.c
>> b/drivers/gpu/drm/amd/amdgpu/si_ih.c
>> index 751307f3252c..9a24f17a5750 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/si_ih.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/si_ih.c
>> @@ -176,7 +176,6 @@ static int si_ih_sw_fini(void *handle)
>>       struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>         amdgpu_irq_fini_sw(adev);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>         return 0;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>> b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>> index 973d80ec7f6c..b08905d1c00f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>> @@ -313,7 +313,6 @@ static int tonga_ih_sw_fini(void *handle)
>>       struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>         amdgpu_irq_fini_sw(adev);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>       amdgpu_irq_remove_domain(adev);
>>         return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>> b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>> index dead9c2fbd4c..32ec4b8e806a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>> @@ -514,10 +514,6 @@ static int vega10_ih_sw_fini(void *handle)
>>       struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>         amdgpu_irq_fini_sw(adev);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>         return 0;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>> b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>> index 58993ae1fe11..f51dfc38ac65 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>> @@ -566,10 +566,6 @@ static int vega20_ih_sw_fini(void *handle)
>>       struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>         amdgpu_irq_fini_sw(adev);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>         return 0;
>>   }
>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 05/16] drm/amdgpu: Handle IOMMU enabled case.
  2021-05-14 16:25     ` Felix Kuehling
@ 2021-05-14 16:26       ` Andrey Grodzovsky
  2021-05-17 14:38       ` [PATCH] " Andrey Grodzovsky
  1 sibling, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-14 16:26 UTC (permalink / raw)
  To: Felix Kuehling, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: Alexander.Deucher, gregkh, helgaas

Makes sense - will update.

Andrey

On 2021-05-14 12:25 p.m., Felix Kuehling wrote:
> Maybe this patch needs a better explanation how the GART and IH changes
> relate to IOMMU or what's the problem this is meant to fix. Just looking
> at the patch I don't see the connection. Looks like just a bunch of
> refactoring to me.
> 
> Regards,
>    Felix
> 
> 
> Am 2021-05-14 um 10:41 a.m. schrieb Andrey Grodzovsky:
>> Ping
>>
>> Andrey
>>
>> On 2021-05-12 10:26 a.m., Andrey Grodzovsky wrote:
>>> Handle all DMA IOMMU gropup related dependencies before the
>>> group is removed.
>>>
>>> v5: Drop IOMMU notifier and switch to lockless call to ttm_tt_unpopulate
>>> v6: Drop the BO unamp list
>>> v7:
>>> Drop amdgpu_gart_fini
>>> In amdgpu_ih_ring_fini do uncinditional  check (!ih->ring)
>>> to avoid freeing uniniitalized rings.
>>> Call amdgpu_ih_ring_fini unconditionally.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   | 14 +-------------
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  2 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c     |  6 ++++--
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |  5 +++++
>>>    drivers/gpu/drm/amd/amdgpu/cik_ih.c        |  1 -
>>>    drivers/gpu/drm/amd/amdgpu/cz_ih.c         |  1 -
>>>    drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c     |  1 -
>>>    drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c      |  1 -
>>>    drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c      |  1 -
>>>    drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c      |  1 -
>>>    drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  1 -
>>>    drivers/gpu/drm/amd/amdgpu/iceland_ih.c    |  1 -
>>>    drivers/gpu/drm/amd/amdgpu/navi10_ih.c     |  4 ----
>>>    drivers/gpu/drm/amd/amdgpu/si_ih.c         |  1 -
>>>    drivers/gpu/drm/amd/amdgpu/tonga_ih.c      |  1 -
>>>    drivers/gpu/drm/amd/amdgpu/vega10_ih.c     |  4 ----
>>>    drivers/gpu/drm/amd/amdgpu/vega20_ih.c     |  4 ----
>>>    18 files changed, 13 insertions(+), 40 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 18598eda18f6..a0bff4713672 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3256,7 +3256,6 @@ static const struct attribute
>>> *amdgpu_dev_attributes[] = {
>>>        NULL
>>>    };
>>>    -
>>>    /**
>>>     * amdgpu_device_init - initialize the driver
>>>     *
>>> @@ -3698,12 +3697,13 @@ void amdgpu_device_fini_hw(struct
>>> amdgpu_device *adev)
>>>            amdgpu_ucode_sysfs_fini(adev);
>>>        sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
>>>    -
>>>        amdgpu_fbdev_fini(adev);
>>>          amdgpu_irq_fini_hw(adev);
>>>          amdgpu_device_ip_fini_early(adev);
>>> +
>>> +    amdgpu_gart_dummy_page_fini(adev);
>>>    }
>>>      void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> index c5a9a4fb10d2..6460cf723f0a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> @@ -92,7 +92,7 @@ static int amdgpu_gart_dummy_page_init(struct
>>> amdgpu_device *adev)
>>>     *
>>>     * Frees the dummy page used by the driver (all asics).
>>>     */
>>> -static void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
>>> +void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
>>>    {
>>>        if (!adev->dummy_page_addr)
>>>            return;
>>> @@ -365,15 +365,3 @@ int amdgpu_gart_init(struct amdgpu_device *adev)
>>>          return 0;
>>>    }
>>> -
>>> -/**
>>> - * amdgpu_gart_fini - tear down the driver info for managing the gart
>>> - *
>>> - * @adev: amdgpu_device pointer
>>> - *
>>> - * Tear down the gart driver info and free the dummy page (all asics).
>>> - */
>>> -void amdgpu_gart_fini(struct amdgpu_device *adev)
>>> -{
>>> -    amdgpu_gart_dummy_page_fini(adev);
>>> -}
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
>>> index a25fe97b0196..030b9d4c736a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
>>> @@ -57,7 +57,7 @@ void amdgpu_gart_table_vram_free(struct
>>> amdgpu_device *adev);
>>>    int amdgpu_gart_table_vram_pin(struct amdgpu_device *adev);
>>>    void amdgpu_gart_table_vram_unpin(struct amdgpu_device *adev);
>>>    int amdgpu_gart_init(struct amdgpu_device *adev);
>>> -void amdgpu_gart_fini(struct amdgpu_device *adev);
>>> +void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev);
>>>    int amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
>>>                   int pages);
>>>    int amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> index faaa6aa2faaf..433469ace6f4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> @@ -115,9 +115,11 @@ int amdgpu_ih_ring_init(struct amdgpu_device
>>> *adev, struct amdgpu_ih_ring *ih,
>>>     */
>>>    void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct
>>> amdgpu_ih_ring *ih)
>>>    {
>>> +
>>> +    if (!ih->ring)
>>> +        return;
>>> +
>>>        if (ih->use_bus_addr) {
>>> -        if (!ih->ring)
>>> -            return;
>>>              /* add 8 bytes for the rptr/wptr shadows and
>>>             * add them to the end of the ring allocation.
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>> index 233b64dab94b..32ce0e679dc7 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>>> @@ -361,6 +361,11 @@ void amdgpu_irq_fini_hw(struct amdgpu_device *adev)
>>>            if (!amdgpu_device_has_dc_support(adev))
>>>                flush_work(&adev->hotplug_work);
>>>        }
>>> +
>>> +    amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
>>> +    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>> +    amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
>>> +    amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
>>>    }
>>>      /**
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
>>> index 183d44a6583c..df385ffc9768 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
>>> @@ -310,7 +310,6 @@ static int cik_ih_sw_fini(void *handle)
>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>          amdgpu_irq_fini_sw(adev);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>>        amdgpu_irq_remove_domain(adev);
>>>          return 0;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>> index d32743949003..b8c47e0cf37a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>> @@ -302,7 +302,6 @@ static int cz_ih_sw_fini(void *handle)
>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>          amdgpu_irq_fini_sw(adev);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>>        amdgpu_irq_remove_domain(adev);
>>>          return 0;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>>> index 2bfd620576f2..5e8bfcdd422e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>>> @@ -954,7 +954,6 @@ static int gmc_v10_0_sw_init(void *handle)
>>>    static void gmc_v10_0_gart_fini(struct amdgpu_device *adev)
>>>    {
>>>        amdgpu_gart_table_vram_free(adev);
>>> -    amdgpu_gart_fini(adev);
>>>    }
>>>      static int gmc_v10_0_sw_fini(void *handle)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>>> index 405d6ad09022..0e81e03e9b49 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>>> @@ -898,7 +898,6 @@ static int gmc_v6_0_sw_fini(void *handle)
>>>        amdgpu_vm_manager_fini(adev);
>>>        amdgpu_gart_table_vram_free(adev);
>>>        amdgpu_bo_fini(adev);
>>> -    amdgpu_gart_fini(adev);
>>>        release_firmware(adev->gmc.fw);
>>>        adev->gmc.fw = NULL;
>>>    diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>>> index 210ada2289ec..0795ea736573 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>>> @@ -1085,7 +1085,6 @@ static int gmc_v7_0_sw_fini(void *handle)
>>>        kfree(adev->gmc.vm_fault_info);
>>>        amdgpu_gart_table_vram_free(adev);
>>>        amdgpu_bo_fini(adev);
>>> -    amdgpu_gart_fini(adev);
>>>        release_firmware(adev->gmc.fw);
>>>        adev->gmc.fw = NULL;
>>>    diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>> index c1bd190841f8..dbf2e5472069 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>> @@ -1194,7 +1194,6 @@ static int gmc_v8_0_sw_fini(void *handle)
>>>        kfree(adev->gmc.vm_fault_info);
>>>        amdgpu_gart_table_vram_free(adev);
>>>        amdgpu_bo_fini(adev);
>>> -    amdgpu_gart_fini(adev);
>>>        release_firmware(adev->gmc.fw);
>>>        adev->gmc.fw = NULL;
>>>    diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> index c82d82da2c73..5ed0adae05cf 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> @@ -1601,7 +1601,6 @@ static int gmc_v9_0_sw_fini(void *handle)
>>>        amdgpu_gart_table_vram_free(adev);
>>>        amdgpu_bo_unref(&adev->gmc.pdb0_bo);
>>>        amdgpu_bo_fini(adev);
>>> -    amdgpu_gart_fini(adev);
>>>          return 0;
>>>    }
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>> index da96c6013477..ddfe4eaeea05 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>> @@ -301,7 +301,6 @@ static int iceland_ih_sw_fini(void *handle)
>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>          amdgpu_irq_fini_sw(adev);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>>        amdgpu_irq_remove_domain(adev);
>>>          return 0;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>> index 5eea4550b856..941d464a2b47 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>> @@ -570,10 +570,6 @@ static int navi10_ih_sw_fini(void *handle)
>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>          amdgpu_irq_fini_sw(adev);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>>          return 0;
>>>    }
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>> index 751307f3252c..9a24f17a5750 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>> @@ -176,7 +176,6 @@ static int si_ih_sw_fini(void *handle)
>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>          amdgpu_irq_fini_sw(adev);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>>          return 0;
>>>    }
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>> index 973d80ec7f6c..b08905d1c00f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>> @@ -313,7 +313,6 @@ static int tonga_ih_sw_fini(void *handle)
>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>          amdgpu_irq_fini_sw(adev);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>>        amdgpu_irq_remove_domain(adev);
>>>          return 0;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>> index dead9c2fbd4c..32ec4b8e806a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>> @@ -514,10 +514,6 @@ static int vega10_ih_sw_fini(void *handle)
>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>          amdgpu_irq_fini_sw(adev);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>>          return 0;
>>>    }
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>> index 58993ae1fe11..f51dfc38ac65 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>> @@ -566,10 +566,6 @@ static int vega20_ih_sw_fini(void *handle)
>>>        struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>>          amdgpu_irq_fini_sw(adev);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
>>> -    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>>>          return 0;
>>>    }
>>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH] drm/amdgpu: Handle IOMMU enabled case.
  2021-05-14 16:25     ` Felix Kuehling
  2021-05-14 16:26       ` Andrey Grodzovsky
@ 2021-05-17 14:38       ` Andrey Grodzovsky
  2021-05-17 14:48         ` Felix Kuehling
  1 sibling, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-17 14:38 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky

Problem:
Handle all DMA IOMMU group related dependencies before the
group is removed. Those manifest themself in that when IOMMU
enabled DMA map/unmap is dependent on the presence of IOMMU
group the device belongs to but, this group is released once
the device is removed from PCI topology.

Fix:
Expedite all such unmap operations to pci remove driver callback.

v5: Drop IOMMU notifier and switch to lockless call to ttm_tt_unpopulate
v6: Drop the BO unamp list
v7:
Drop amdgpu_gart_fini
In amdgpu_ih_ring_fini do uncinditional  check (!ih->ring)
to avoid freeing uniniitalized rings.
Call amdgpu_ih_ring_fini unconditionally.
v8: Add deatiled explanation

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   | 14 +-------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c     |  6 ++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |  5 +++++
 drivers/gpu/drm/amd/amdgpu/cik_ih.c        |  1 -
 drivers/gpu/drm/amd/amdgpu/cz_ih.c         |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c     |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c      |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c      |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c      |  1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  1 -
 drivers/gpu/drm/amd/amdgpu/iceland_ih.c    |  1 -
 drivers/gpu/drm/amd/amdgpu/navi10_ih.c     |  4 ----
 drivers/gpu/drm/amd/amdgpu/si_ih.c         |  1 -
 drivers/gpu/drm/amd/amdgpu/tonga_ih.c      |  1 -
 drivers/gpu/drm/amd/amdgpu/vega10_ih.c     |  4 ----
 drivers/gpu/drm/amd/amdgpu/vega20_ih.c     |  4 ----
 18 files changed, 13 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 18598eda18f6..a0bff4713672 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3256,7 +3256,6 @@ static const struct attribute *amdgpu_dev_attributes[] = {
 	NULL
 };
 
-
 /**
  * amdgpu_device_init - initialize the driver
  *
@@ -3698,12 +3697,13 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 		amdgpu_ucode_sysfs_fini(adev);
 	sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
 
-
 	amdgpu_fbdev_fini(adev);
 
 	amdgpu_irq_fini_hw(adev);
 
 	amdgpu_device_ip_fini_early(adev);
+
+	amdgpu_gart_dummy_page_fini(adev);
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
index c5a9a4fb10d2..6460cf723f0a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -92,7 +92,7 @@ static int amdgpu_gart_dummy_page_init(struct amdgpu_device *adev)
  *
  * Frees the dummy page used by the driver (all asics).
  */
-static void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
+void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
 {
 	if (!adev->dummy_page_addr)
 		return;
@@ -365,15 +365,3 @@ int amdgpu_gart_init(struct amdgpu_device *adev)
 
 	return 0;
 }
-
-/**
- * amdgpu_gart_fini - tear down the driver info for managing the gart
- *
- * @adev: amdgpu_device pointer
- *
- * Tear down the gart driver info and free the dummy page (all asics).
- */
-void amdgpu_gart_fini(struct amdgpu_device *adev)
-{
-	amdgpu_gart_dummy_page_fini(adev);
-}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
index a25fe97b0196..030b9d4c736a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
@@ -57,7 +57,7 @@ void amdgpu_gart_table_vram_free(struct amdgpu_device *adev);
 int amdgpu_gart_table_vram_pin(struct amdgpu_device *adev);
 void amdgpu_gart_table_vram_unpin(struct amdgpu_device *adev);
 int amdgpu_gart_init(struct amdgpu_device *adev);
-void amdgpu_gart_fini(struct amdgpu_device *adev);
+void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev);
 int amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
 		       int pages);
 int amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index faaa6aa2faaf..433469ace6f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -115,9 +115,11 @@ int amdgpu_ih_ring_init(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih,
  */
 void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih)
 {
+
+	if (!ih->ring)
+		return;
+
 	if (ih->use_bus_addr) {
-		if (!ih->ring)
-			return;
 
 		/* add 8 bytes for the rptr/wptr shadows and
 		 * add them to the end of the ring allocation.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 233b64dab94b..32ce0e679dc7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -361,6 +361,11 @@ void amdgpu_irq_fini_hw(struct amdgpu_device *adev)
 		if (!amdgpu_device_has_dc_support(adev))
 			flush_work(&adev->hotplug_work);
 	}
+
+	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
+	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
+	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
+	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/cik_ih.c b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
index 183d44a6583c..df385ffc9768 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
@@ -310,7 +310,6 @@ static int cik_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
index d32743949003..b8c47e0cf37a 100644
--- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
@@ -302,7 +302,6 @@ static int cz_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 2bfd620576f2..5e8bfcdd422e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -954,7 +954,6 @@ static int gmc_v10_0_sw_init(void *handle)
 static void gmc_v10_0_gart_fini(struct amdgpu_device *adev)
 {
 	amdgpu_gart_table_vram_free(adev);
-	amdgpu_gart_fini(adev);
 }
 
 static int gmc_v10_0_sw_fini(void *handle)
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
index 405d6ad09022..0e81e03e9b49 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
@@ -898,7 +898,6 @@ static int gmc_v6_0_sw_fini(void *handle)
 	amdgpu_vm_manager_fini(adev);
 	amdgpu_gart_table_vram_free(adev);
 	amdgpu_bo_fini(adev);
-	amdgpu_gart_fini(adev);
 	release_firmware(adev->gmc.fw);
 	adev->gmc.fw = NULL;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
index 210ada2289ec..0795ea736573 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
@@ -1085,7 +1085,6 @@ static int gmc_v7_0_sw_fini(void *handle)
 	kfree(adev->gmc.vm_fault_info);
 	amdgpu_gart_table_vram_free(adev);
 	amdgpu_bo_fini(adev);
-	amdgpu_gart_fini(adev);
 	release_firmware(adev->gmc.fw);
 	adev->gmc.fw = NULL;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
index c1bd190841f8..dbf2e5472069 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
@@ -1194,7 +1194,6 @@ static int gmc_v8_0_sw_fini(void *handle)
 	kfree(adev->gmc.vm_fault_info);
 	amdgpu_gart_table_vram_free(adev);
 	amdgpu_bo_fini(adev);
-	amdgpu_gart_fini(adev);
 	release_firmware(adev->gmc.fw);
 	adev->gmc.fw = NULL;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index c82d82da2c73..5ed0adae05cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1601,7 +1601,6 @@ static int gmc_v9_0_sw_fini(void *handle)
 	amdgpu_gart_table_vram_free(adev);
 	amdgpu_bo_unref(&adev->gmc.pdb0_bo);
 	amdgpu_bo_fini(adev);
-	amdgpu_gart_fini(adev);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
index da96c6013477..ddfe4eaeea05 100644
--- a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
@@ -301,7 +301,6 @@ static int iceland_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
index 5eea4550b856..941d464a2b47 100644
--- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
@@ -570,10 +570,6 @@ static int navi10_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/si_ih.c b/drivers/gpu/drm/amd/amdgpu/si_ih.c
index 751307f3252c..9a24f17a5750 100644
--- a/drivers/gpu/drm/amd/amdgpu/si_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/si_ih.c
@@ -176,7 +176,6 @@ static int si_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
index 973d80ec7f6c..b08905d1c00f 100644
--- a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
@@ -313,7 +313,6 @@ static int tonga_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 	amdgpu_irq_remove_domain(adev);
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
index dead9c2fbd4c..32ec4b8e806a 100644
--- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
@@ -514,10 +514,6 @@ static int vega10_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
index 58993ae1fe11..f51dfc38ac65 100644
--- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
@@ -566,10 +566,6 @@ static int vega20_ih_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
 	amdgpu_irq_fini_sw(adev);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
-	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.
  2021-05-14 14:42   ` Andrey Grodzovsky
@ 2021-05-17 14:40     ` Andrey Grodzovsky
  2021-05-17 17:39       ` Alex Deucher
  2021-05-17 19:39       ` Christian König
  0 siblings, 2 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-17 14:40 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Ping

Andrey

On 2021-05-14 10:42 a.m., Andrey Grodzovsky wrote:
> Ping
> 
> Andrey
> 
> On 2021-05-12 10:26 a.m., Andrey Grodzovsky wrote:
>> If removing while commands in flight you cannot wait to flush the
>> HW fences on a ring since the device is gone.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 16 ++++++++++------
>>   1 file changed, 10 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 1ffb36bd0b19..fa03702ecbfb 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -36,6 +36,7 @@
>>   #include <linux/firmware.h>
>>   #include <linux/pm_runtime.h>
>> +#include <drm/drm_drv.h>
>>   #include "amdgpu.h"
>>   #include "amdgpu_trace.h"
>> @@ -525,8 +526,7 @@ int amdgpu_fence_driver_init(struct amdgpu_device 
>> *adev)
>>    */
>>   void amdgpu_fence_driver_fini_hw(struct amdgpu_device *adev)
>>   {
>> -    unsigned i, j;
>> -    int r;
>> +    int i, r;
>>       for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>           struct amdgpu_ring *ring = adev->rings[i];
>> @@ -535,11 +535,15 @@ void amdgpu_fence_driver_fini_hw(struct 
>> amdgpu_device *adev)
>>               continue;
>>           if (!ring->no_scheduler)
>>               drm_sched_fini(&ring->sched);
>> -        r = amdgpu_fence_wait_empty(ring);
>> -        if (r) {
>> -            /* no need to trigger GPU reset as we are unloading */
>> +        /* You can't wait for HW to signal if it's gone */
>> +        if (!drm_dev_is_unplugged(&adev->ddev))
>> +            r = amdgpu_fence_wait_empty(ring);
>> +        else
>> +            r = -ENODEV;
>> +        /* no need to trigger GPU reset as we are unloading */
>> +        if (r)
>>               amdgpu_fence_driver_force_completion(ring);
>> -        }
>> +
>>           if (ring->fence_drv.irq_src)
>>               amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>>                          ring->fence_drv.irq_type);
>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings
  2021-05-14 14:42   ` Andrey Grodzovsky
@ 2021-05-17 14:41     ` Andrey Grodzovsky
  0 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-17 14:41 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Ping

Andrey

On 2021-05-14 10:42 a.m., Andrey Grodzovsky wrote:
> Ping
> 
> Andrey
> 
> On 2021-05-12 10:26 a.m., Andrey Grodzovsky wrote:
>> Access to those must be prevented post pci_remove
>>
>> v6: Drop BOs list, unampping VRAM BAR is enough.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 +++++++++++++++++++---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  4 ----
>>   3 files changed, 22 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index f7cca25c0fa0..73cbc3c7453f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3666,6 +3666,25 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>       return r;
>>   }
>> +static void amdgpu_device_unmap_mmio(struct amdgpu_device *adev)
>> +{
>> +    /* Clear all CPU mappings pointing to this device */
>> +    unmap_mapping_range(adev->ddev.anon_inode->i_mapping, 0, 0, 1);
>> +
>> +    /* Unmap all mapped bars - Doorbell, registers and VRAM */
>> +    amdgpu_device_doorbell_fini(adev);
>> +
>> +    iounmap(adev->rmmio);
>> +    adev->rmmio = NULL;
>> +    if (adev->mman.aper_base_kaddr)
>> +        iounmap(adev->mman.aper_base_kaddr);
>> +    adev->mman.aper_base_kaddr = NULL;
>> +
>> +    /* Memory manager related */
>> +    arch_phys_wc_del(adev->gmc.vram_mtrr);
>> +    arch_io_free_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size);
>> +}
>> +
>>   /**
>>    * amdgpu_device_fini - tear down the driver
>>    *
>> @@ -3712,6 +3731,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device 
>> *adev)
>>       amdgpu_device_ip_fini_early(adev);
>>       amdgpu_gart_dummy_page_fini(adev);
>> +
>> +    amdgpu_device_unmap_mmio(adev);
>>   }
>>   void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>> @@ -3739,9 +3760,6 @@ void amdgpu_device_fini_sw(struct amdgpu_device 
>> *adev)
>>       }
>>       if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
>>           vga_client_register(adev->pdev, NULL, NULL, NULL);
>> -    iounmap(adev->rmmio);
>> -    adev->rmmio = NULL;
>> -    amdgpu_device_doorbell_fini(adev);
>>       if (IS_ENABLED(CONFIG_PERF_EVENTS))
>>           amdgpu_pmu_fini(adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 0adffcace326..882fb49f3c41 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -533,6 +533,7 @@ static int amdgpu_bo_do_create(struct 
>> amdgpu_device *adev,
>>           return -ENOMEM;
>>       drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, 
>> size);
>>       INIT_LIST_HEAD(&bo->shadow_list);
>> +
>>       bo->vm_bo = NULL;
>>       bo->preferred_domains = bp->preferred_domain ? 
>> bp->preferred_domain :
>>           bp->domain;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 0d54e70278ca..58ad2fecc9e3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1841,10 +1841,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
>>       amdgpu_bo_free_kernel(&adev->mman.discovery_memory, NULL, NULL);
>>       amdgpu_ttm_fw_reserve_vram_fini(adev);
>> -    if (adev->mman.aper_base_kaddr)
>> -        iounmap(adev->mman.aper_base_kaddr);
>> -    adev->mman.aper_base_kaddr = NULL;
>> -
>>       amdgpu_vram_mgr_fini(adev);
>>       amdgpu_gtt_mgr_fini(adev);
>>       ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_GDS);
>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] drm/amdgpu: Handle IOMMU enabled case.
  2021-05-17 14:38       ` [PATCH] " Andrey Grodzovsky
@ 2021-05-17 14:48         ` Felix Kuehling
  0 siblings, 0 replies; 64+ messages in thread
From: Felix Kuehling @ 2021-05-17 14:48 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas

Am 2021-05-17 um 10:38 a.m. schrieb Andrey Grodzovsky:
> Problem:
> Handle all DMA IOMMU group related dependencies before the
> group is removed. Those manifest themself in that when IOMMU
> enabled DMA map/unmap is dependent on the presence of IOMMU
> group the device belongs to but, this group is released once
> the device is removed from PCI topology.
>
> Fix:
> Expedite all such unmap operations to pci remove driver callback.
>
> v5: Drop IOMMU notifier and switch to lockless call to ttm_tt_unpopulate
> v6: Drop the BO unamp list
> v7:
> Drop amdgpu_gart_fini
> In amdgpu_ih_ring_fini do uncinditional  check (!ih->ring)
> to avoid freeing uniniitalized rings.
> Call amdgpu_ih_ring_fini unconditionally.
> v8: Add deatiled explanation
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

This patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   | 14 +-------------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c     |  6 ++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |  5 +++++
>  drivers/gpu/drm/amd/amdgpu/cik_ih.c        |  1 -
>  drivers/gpu/drm/amd/amdgpu/cz_ih.c         |  1 -
>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c     |  1 -
>  drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c      |  1 -
>  drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c      |  1 -
>  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c      |  1 -
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  1 -
>  drivers/gpu/drm/amd/amdgpu/iceland_ih.c    |  1 -
>  drivers/gpu/drm/amd/amdgpu/navi10_ih.c     |  4 ----
>  drivers/gpu/drm/amd/amdgpu/si_ih.c         |  1 -
>  drivers/gpu/drm/amd/amdgpu/tonga_ih.c      |  1 -
>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c     |  4 ----
>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c     |  4 ----
>  18 files changed, 13 insertions(+), 40 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 18598eda18f6..a0bff4713672 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3256,7 +3256,6 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>  	NULL
>  };
>  
> -
>  /**
>   * amdgpu_device_init - initialize the driver
>   *
> @@ -3698,12 +3697,13 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>  		amdgpu_ucode_sysfs_fini(adev);
>  	sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
>  
> -
>  	amdgpu_fbdev_fini(adev);
>  
>  	amdgpu_irq_fini_hw(adev);
>  
>  	amdgpu_device_ip_fini_early(adev);
> +
> +	amdgpu_gart_dummy_page_fini(adev);
>  }
>  
>  void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> index c5a9a4fb10d2..6460cf723f0a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> @@ -92,7 +92,7 @@ static int amdgpu_gart_dummy_page_init(struct amdgpu_device *adev)
>   *
>   * Frees the dummy page used by the driver (all asics).
>   */
> -static void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
> +void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
>  {
>  	if (!adev->dummy_page_addr)
>  		return;
> @@ -365,15 +365,3 @@ int amdgpu_gart_init(struct amdgpu_device *adev)
>  
>  	return 0;
>  }
> -
> -/**
> - * amdgpu_gart_fini - tear down the driver info for managing the gart
> - *
> - * @adev: amdgpu_device pointer
> - *
> - * Tear down the gart driver info and free the dummy page (all asics).
> - */
> -void amdgpu_gart_fini(struct amdgpu_device *adev)
> -{
> -	amdgpu_gart_dummy_page_fini(adev);
> -}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> index a25fe97b0196..030b9d4c736a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> @@ -57,7 +57,7 @@ void amdgpu_gart_table_vram_free(struct amdgpu_device *adev);
>  int amdgpu_gart_table_vram_pin(struct amdgpu_device *adev);
>  void amdgpu_gart_table_vram_unpin(struct amdgpu_device *adev);
>  int amdgpu_gart_init(struct amdgpu_device *adev);
> -void amdgpu_gart_fini(struct amdgpu_device *adev);
> +void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev);
>  int amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
>  		       int pages);
>  int amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> index faaa6aa2faaf..433469ace6f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> @@ -115,9 +115,11 @@ int amdgpu_ih_ring_init(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih,
>   */
>  void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih)
>  {
> +
> +	if (!ih->ring)
> +		return;
> +
>  	if (ih->use_bus_addr) {
> -		if (!ih->ring)
> -			return;
>  
>  		/* add 8 bytes for the rptr/wptr shadows and
>  		 * add them to the end of the ring allocation.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> index 233b64dab94b..32ce0e679dc7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> @@ -361,6 +361,11 @@ void amdgpu_irq_fini_hw(struct amdgpu_device *adev)
>  		if (!amdgpu_device_has_dc_support(adev))
>  			flush_work(&adev->hotplug_work);
>  	}
> +
> +	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
> +	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
> +	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
> +	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/cik_ih.c b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
> index 183d44a6583c..df385ffc9768 100644
> --- a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
> @@ -310,7 +310,6 @@ static int cik_ih_sw_fini(void *handle)
>  	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>  
>  	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>  	amdgpu_irq_remove_domain(adev);
>  
>  	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
> index d32743949003..b8c47e0cf37a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
> @@ -302,7 +302,6 @@ static int cz_ih_sw_fini(void *handle)
>  	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>  
>  	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>  	amdgpu_irq_remove_domain(adev);
>  
>  	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> index 2bfd620576f2..5e8bfcdd422e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> @@ -954,7 +954,6 @@ static int gmc_v10_0_sw_init(void *handle)
>  static void gmc_v10_0_gart_fini(struct amdgpu_device *adev)
>  {
>  	amdgpu_gart_table_vram_free(adev);
> -	amdgpu_gart_fini(adev);
>  }
>  
>  static int gmc_v10_0_sw_fini(void *handle)
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
> index 405d6ad09022..0e81e03e9b49 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
> @@ -898,7 +898,6 @@ static int gmc_v6_0_sw_fini(void *handle)
>  	amdgpu_vm_manager_fini(adev);
>  	amdgpu_gart_table_vram_free(adev);
>  	amdgpu_bo_fini(adev);
> -	amdgpu_gart_fini(adev);
>  	release_firmware(adev->gmc.fw);
>  	adev->gmc.fw = NULL;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
> index 210ada2289ec..0795ea736573 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
> @@ -1085,7 +1085,6 @@ static int gmc_v7_0_sw_fini(void *handle)
>  	kfree(adev->gmc.vm_fault_info);
>  	amdgpu_gart_table_vram_free(adev);
>  	amdgpu_bo_fini(adev);
> -	amdgpu_gart_fini(adev);
>  	release_firmware(adev->gmc.fw);
>  	adev->gmc.fw = NULL;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
> index c1bd190841f8..dbf2e5472069 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
> @@ -1194,7 +1194,6 @@ static int gmc_v8_0_sw_fini(void *handle)
>  	kfree(adev->gmc.vm_fault_info);
>  	amdgpu_gart_table_vram_free(adev);
>  	amdgpu_bo_fini(adev);
> -	amdgpu_gart_fini(adev);
>  	release_firmware(adev->gmc.fw);
>  	adev->gmc.fw = NULL;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index c82d82da2c73..5ed0adae05cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -1601,7 +1601,6 @@ static int gmc_v9_0_sw_fini(void *handle)
>  	amdgpu_gart_table_vram_free(adev);
>  	amdgpu_bo_unref(&adev->gmc.pdb0_bo);
>  	amdgpu_bo_fini(adev);
> -	amdgpu_gart_fini(adev);
>  
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
> index da96c6013477..ddfe4eaeea05 100644
> --- a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
> @@ -301,7 +301,6 @@ static int iceland_ih_sw_fini(void *handle)
>  	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>  
>  	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>  	amdgpu_irq_remove_domain(adev);
>  
>  	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> index 5eea4550b856..941d464a2b47 100644
> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> @@ -570,10 +570,6 @@ static int navi10_ih_sw_fini(void *handle)
>  	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>  
>  	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>  
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/si_ih.c b/drivers/gpu/drm/amd/amdgpu/si_ih.c
> index 751307f3252c..9a24f17a5750 100644
> --- a/drivers/gpu/drm/amd/amdgpu/si_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/si_ih.c
> @@ -176,7 +176,6 @@ static int si_ih_sw_fini(void *handle)
>  	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>  
>  	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>  
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
> index 973d80ec7f6c..b08905d1c00f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
> @@ -313,7 +313,6 @@ static int tonga_ih_sw_fini(void *handle)
>  	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>  
>  	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>  	amdgpu_irq_remove_domain(adev);
>  
>  	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> index dead9c2fbd4c..32ec4b8e806a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> @@ -514,10 +514,6 @@ static int vega10_ih_sw_fini(void *handle)
>  	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>  
>  	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>  
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> index 58993ae1fe11..f51dfc38ac65 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> @@ -566,10 +566,6 @@ static int vega20_ih_sw_fini(void *handle)
>  	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>  
>  	amdgpu_irq_fini_sw(adev);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
> -	amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>  
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.
  2021-05-17 14:40     ` Andrey Grodzovsky
@ 2021-05-17 17:39       ` Alex Deucher
  2021-05-17 19:39       ` Christian König
  1 sibling, 0 replies; 64+ messages in thread
From: Alex Deucher @ 2021-05-17 17:39 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Maling list - DRI developers, amd-gfx list, Linux PCI,
	Christian König, Daniel Vetter, Wentland, Harry, Deucher,
	Alexander, Greg KH, Pekka Paalanen, Bjorn Helgaas, Kuehling,
	Felix

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

On Mon, May 17, 2021 at 10:40 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> Ping
>
> Andrey
>
> On 2021-05-14 10:42 a.m., Andrey Grodzovsky wrote:
> > Ping
> >
> > Andrey
> >
> > On 2021-05-12 10:26 a.m., Andrey Grodzovsky wrote:
> >> If removing while commands in flight you cannot wait to flush the
> >> HW fences on a ring since the device is gone.
> >>
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 16 ++++++++++------
> >>   1 file changed, 10 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >> index 1ffb36bd0b19..fa03702ecbfb 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >> @@ -36,6 +36,7 @@
> >>   #include <linux/firmware.h>
> >>   #include <linux/pm_runtime.h>
> >> +#include <drm/drm_drv.h>
> >>   #include "amdgpu.h"
> >>   #include "amdgpu_trace.h"
> >> @@ -525,8 +526,7 @@ int amdgpu_fence_driver_init(struct amdgpu_device
> >> *adev)
> >>    */
> >>   void amdgpu_fence_driver_fini_hw(struct amdgpu_device *adev)
> >>   {
> >> -    unsigned i, j;
> >> -    int r;
> >> +    int i, r;
> >>       for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
> >>           struct amdgpu_ring *ring = adev->rings[i];
> >> @@ -535,11 +535,15 @@ void amdgpu_fence_driver_fini_hw(struct
> >> amdgpu_device *adev)
> >>               continue;
> >>           if (!ring->no_scheduler)
> >>               drm_sched_fini(&ring->sched);
> >> -        r = amdgpu_fence_wait_empty(ring);
> >> -        if (r) {
> >> -            /* no need to trigger GPU reset as we are unloading */
> >> +        /* You can't wait for HW to signal if it's gone */
> >> +        if (!drm_dev_is_unplugged(&adev->ddev))
> >> +            r = amdgpu_fence_wait_empty(ring);
> >> +        else
> >> +            r = -ENODEV;
> >> +        /* no need to trigger GPU reset as we are unloading */
> >> +        if (r)
> >>               amdgpu_fence_driver_force_completion(ring);
> >> -        }
> >> +
> >>           if (ring->fence_drv.irq_src)
> >>               amdgpu_irq_put(adev, ring->fence_drv.irq_src,
> >>                          ring->fence_drv.irq_type);
> >>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings
  2021-05-12 14:26 ` [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings Andrey Grodzovsky
  2021-05-14 14:42   ` Andrey Grodzovsky
@ 2021-05-17 17:43   ` Alex Deucher
  2021-05-17 18:46     ` Andrey Grodzovsky
  2021-05-17 19:31     ` [PATCH] " Andrey Grodzovsky
  1 sibling, 2 replies; 64+ messages in thread
From: Alex Deucher @ 2021-05-17 17:43 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Maling list - DRI developers, amd-gfx list, Linux PCI,
	Christian König, Daniel Vetter, Wentland, Harry, Greg KH,
	Kuehling, Felix, Bjorn Helgaas, Deucher, Alexander

On Wed, May 12, 2021 at 10:27 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> Access to those must be prevented post pci_remove
>
> v6: Drop BOs list, unampping VRAM BAR is enough.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 +++++++++++++++++++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  4 ----
>  3 files changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f7cca25c0fa0..73cbc3c7453f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3666,6 +3666,25 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>         return r;
>  }
>
> +static void amdgpu_device_unmap_mmio(struct amdgpu_device *adev)
> +{
> +       /* Clear all CPU mappings pointing to this device */
> +       unmap_mapping_range(adev->ddev.anon_inode->i_mapping, 0, 0, 1);
> +
> +       /* Unmap all mapped bars - Doorbell, registers and VRAM */
> +       amdgpu_device_doorbell_fini(adev);
> +
> +       iounmap(adev->rmmio);
> +       adev->rmmio = NULL;
> +       if (adev->mman.aper_base_kaddr)
> +               iounmap(adev->mman.aper_base_kaddr);
> +       adev->mman.aper_base_kaddr = NULL;
> +
> +       /* Memory manager related */

I think we need:
if (!adev->gmc.xgmi.connected_to_cpu) {
around these two to mirror amdgpu_bo_fini().

Alex

> +       arch_phys_wc_del(adev->gmc.vram_mtrr);
> +       arch_io_free_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size);
> +}
> +
>  /**
>   * amdgpu_device_fini - tear down the driver
>   *
> @@ -3712,6 +3731,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>         amdgpu_device_ip_fini_early(adev);
>
>         amdgpu_gart_dummy_page_fini(adev);
> +
> +       amdgpu_device_unmap_mmio(adev);
>  }
>
>  void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> @@ -3739,9 +3760,6 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>         }
>         if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
>                 vga_client_register(adev->pdev, NULL, NULL, NULL);
> -       iounmap(adev->rmmio);
> -       adev->rmmio = NULL;
> -       amdgpu_device_doorbell_fini(adev);
>
>         if (IS_ENABLED(CONFIG_PERF_EVENTS))
>                 amdgpu_pmu_fini(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 0adffcace326..882fb49f3c41 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -533,6 +533,7 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
>                 return -ENOMEM;
>         drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
>         INIT_LIST_HEAD(&bo->shadow_list);
> +
>         bo->vm_bo = NULL;
>         bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
>                 bp->domain;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 0d54e70278ca..58ad2fecc9e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1841,10 +1841,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
>         amdgpu_bo_free_kernel(&adev->mman.discovery_memory, NULL, NULL);
>         amdgpu_ttm_fw_reserve_vram_fini(adev);
>
> -       if (adev->mman.aper_base_kaddr)
> -               iounmap(adev->mman.aper_base_kaddr);
> -       adev->mman.aper_base_kaddr = NULL;
> -
>         amdgpu_vram_mgr_fini(adev);
>         amdgpu_gtt_mgr_fini(adev);
>         ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_GDS);
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings
  2021-05-17 17:43   ` Alex Deucher
@ 2021-05-17 18:46     ` Andrey Grodzovsky
  2021-05-17 18:56       ` Alex Deucher
  2021-05-17 19:31     ` [PATCH] " Andrey Grodzovsky
  1 sibling, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-17 18:46 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Maling list - DRI developers, amd-gfx list, Linux PCI,
	Christian König, Daniel Vetter, Wentland, Harry, Greg KH,
	Kuehling, Felix, Bjorn Helgaas, Deucher, Alexander



On 2021-05-17 1:43 p.m., Alex Deucher wrote:
> On Wed, May 12, 2021 at 10:27 AM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>> Access to those must be prevented post pci_remove
>>
>> v6: Drop BOs list, unampping VRAM BAR is enough.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 +++++++++++++++++++---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  4 ----
>>   3 files changed, 22 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index f7cca25c0fa0..73cbc3c7453f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3666,6 +3666,25 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>          return r;
>>   }
>>
>> +static void amdgpu_device_unmap_mmio(struct amdgpu_device *adev)
>> +{
>> +       /* Clear all CPU mappings pointing to this device */
>> +       unmap_mapping_range(adev->ddev.anon_inode->i_mapping, 0, 0, 1);
>> +
>> +       /* Unmap all mapped bars - Doorbell, registers and VRAM */
>> +       amdgpu_device_doorbell_fini(adev);
>> +
>> +       iounmap(adev->rmmio);
>> +       adev->rmmio = NULL;
>> +       if (adev->mman.aper_base_kaddr)
>> +               iounmap(adev->mman.aper_base_kaddr);
>> +       adev->mman.aper_base_kaddr = NULL;
>> +
>> +       /* Memory manager related */
> 
> I think we need:
> if (!adev->gmc.xgmi.connected_to_cpu) {
> around these two to mirror amdgpu_bo_fini().
> 
> Alex

I am working of off drm-misc-next and here amdgpu_xgmi
doesn't have connected_to_cpu yet.

Andrey

> 
>> +       arch_phys_wc_del(adev->gmc.vram_mtrr);
>> +       arch_io_free_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size);
>> +}
>> +
>>   /**
>>    * amdgpu_device_fini - tear down the driver
>>    *
>> @@ -3712,6 +3731,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>>          amdgpu_device_ip_fini_early(adev);
>>
>>          amdgpu_gart_dummy_page_fini(adev);
>> +
>> +       amdgpu_device_unmap_mmio(adev);
>>   }
>>
>>   void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>> @@ -3739,9 +3760,6 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>          }
>>          if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
>>                  vga_client_register(adev->pdev, NULL, NULL, NULL);
>> -       iounmap(adev->rmmio);
>> -       adev->rmmio = NULL;
>> -       amdgpu_device_doorbell_fini(adev);
>>
>>          if (IS_ENABLED(CONFIG_PERF_EVENTS))
>>                  amdgpu_pmu_fini(adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 0adffcace326..882fb49f3c41 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -533,6 +533,7 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
>>                  return -ENOMEM;
>>          drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
>>          INIT_LIST_HEAD(&bo->shadow_list);
>> +
>>          bo->vm_bo = NULL;
>>          bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
>>                  bp->domain;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 0d54e70278ca..58ad2fecc9e3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1841,10 +1841,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
>>          amdgpu_bo_free_kernel(&adev->mman.discovery_memory, NULL, NULL);
>>          amdgpu_ttm_fw_reserve_vram_fini(adev);
>>
>> -       if (adev->mman.aper_base_kaddr)
>> -               iounmap(adev->mman.aper_base_kaddr);
>> -       adev->mman.aper_base_kaddr = NULL;
>> -
>>          amdgpu_vram_mgr_fini(adev);
>>          amdgpu_gtt_mgr_fini(adev);
>>          ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_GDS);
>> --
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings
  2021-05-17 18:46     ` Andrey Grodzovsky
@ 2021-05-17 18:56       ` Alex Deucher
  2021-05-17 19:22         ` Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Alex Deucher @ 2021-05-17 18:56 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Maling list - DRI developers, amd-gfx list, Linux PCI,
	Christian König, Daniel Vetter, Wentland, Harry, Greg KH,
	Kuehling, Felix, Bjorn Helgaas, Deucher, Alexander

On Mon, May 17, 2021 at 2:46 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
>
> On 2021-05-17 1:43 p.m., Alex Deucher wrote:
> > On Wed, May 12, 2021 at 10:27 AM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >>
> >> Access to those must be prevented post pci_remove
> >>
> >> v6: Drop BOs list, unampping VRAM BAR is enough.
> >>
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 +++++++++++++++++++---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  1 +
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  4 ----
> >>   3 files changed, 22 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index f7cca25c0fa0..73cbc3c7453f 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -3666,6 +3666,25 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >>          return r;
> >>   }
> >>
> >> +static void amdgpu_device_unmap_mmio(struct amdgpu_device *adev)
> >> +{
> >> +       /* Clear all CPU mappings pointing to this device */
> >> +       unmap_mapping_range(adev->ddev.anon_inode->i_mapping, 0, 0, 1);
> >> +
> >> +       /* Unmap all mapped bars - Doorbell, registers and VRAM */
> >> +       amdgpu_device_doorbell_fini(adev);
> >> +
> >> +       iounmap(adev->rmmio);
> >> +       adev->rmmio = NULL;
> >> +       if (adev->mman.aper_base_kaddr)
> >> +               iounmap(adev->mman.aper_base_kaddr);
> >> +       adev->mman.aper_base_kaddr = NULL;
> >> +
> >> +       /* Memory manager related */
> >
> > I think we need:
> > if (!adev->gmc.xgmi.connected_to_cpu) {
> > around these two to mirror amdgpu_bo_fini().
> >
> > Alex
>
> I am working of off drm-misc-next and here amdgpu_xgmi
> doesn't have connected_to_cpu yet.

Ah, right.  Ok.  Do we need to remove the code from bo_fini() if we
handle it here now?

Alex


>
> Andrey
>
> >
> >> +       arch_phys_wc_del(adev->gmc.vram_mtrr);
> >> +       arch_io_free_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size);
> >> +}
> >> +
> >>   /**
> >>    * amdgpu_device_fini - tear down the driver
> >>    *
> >> @@ -3712,6 +3731,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
> >>          amdgpu_device_ip_fini_early(adev);
> >>
> >>          amdgpu_gart_dummy_page_fini(adev);
> >> +
> >> +       amdgpu_device_unmap_mmio(adev);
> >>   }
> >>
> >>   void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> >> @@ -3739,9 +3760,6 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> >>          }
> >>          if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
> >>                  vga_client_register(adev->pdev, NULL, NULL, NULL);
> >> -       iounmap(adev->rmmio);
> >> -       adev->rmmio = NULL;
> >> -       amdgpu_device_doorbell_fini(adev);
> >>
> >>          if (IS_ENABLED(CONFIG_PERF_EVENTS))
> >>                  amdgpu_pmu_fini(adev);
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >> index 0adffcace326..882fb49f3c41 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >> @@ -533,6 +533,7 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
> >>                  return -ENOMEM;
> >>          drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
> >>          INIT_LIST_HEAD(&bo->shadow_list);
> >> +
> >>          bo->vm_bo = NULL;
> >>          bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
> >>                  bp->domain;
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> >> index 0d54e70278ca..58ad2fecc9e3 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> >> @@ -1841,10 +1841,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
> >>          amdgpu_bo_free_kernel(&adev->mman.discovery_memory, NULL, NULL);
> >>          amdgpu_ttm_fw_reserve_vram_fini(adev);
> >>
> >> -       if (adev->mman.aper_base_kaddr)
> >> -               iounmap(adev->mman.aper_base_kaddr);
> >> -       adev->mman.aper_base_kaddr = NULL;
> >> -
> >>          amdgpu_vram_mgr_fini(adev);
> >>          amdgpu_gtt_mgr_fini(adev);
> >>          ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_GDS);
> >> --
> >> 2.25.1
> >>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings
  2021-05-17 18:56       ` Alex Deucher
@ 2021-05-17 19:22         ` Andrey Grodzovsky
  0 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-17 19:22 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Maling list - DRI developers, amd-gfx list, Linux PCI,
	Christian König, Daniel Vetter, Wentland, Harry, Greg KH,
	Kuehling, Felix, Bjorn Helgaas, Deucher, Alexander

On 2021-05-17 2:56 p.m., Alex Deucher wrote:
> On Mon, May 17, 2021 at 2:46 PM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>>
>>
>> On 2021-05-17 1:43 p.m., Alex Deucher wrote:
>>> On Wed, May 12, 2021 at 10:27 AM Andrey Grodzovsky
>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>> Access to those must be prevented post pci_remove
>>>>
>>>> v6: Drop BOs list, unampping VRAM BAR is enough.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 +++++++++++++++++++---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  1 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  4 ----
>>>>    3 files changed, 22 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index f7cca25c0fa0..73cbc3c7453f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -3666,6 +3666,25 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>>>           return r;
>>>>    }
>>>>
>>>> +static void amdgpu_device_unmap_mmio(struct amdgpu_device *adev)
>>>> +{
>>>> +       /* Clear all CPU mappings pointing to this device */
>>>> +       unmap_mapping_range(adev->ddev.anon_inode->i_mapping, 0, 0, 1);
>>>> +
>>>> +       /* Unmap all mapped bars - Doorbell, registers and VRAM */
>>>> +       amdgpu_device_doorbell_fini(adev);
>>>> +
>>>> +       iounmap(adev->rmmio);
>>>> +       adev->rmmio = NULL;
>>>> +       if (adev->mman.aper_base_kaddr)
>>>> +               iounmap(adev->mman.aper_base_kaddr);
>>>> +       adev->mman.aper_base_kaddr = NULL;
>>>> +
>>>> +       /* Memory manager related */
>>>
>>> I think we need:
>>> if (!adev->gmc.xgmi.connected_to_cpu) {
>>> around these two to mirror amdgpu_bo_fini().
>>>
>>> Alex
>>
>> I am working of off drm-misc-next and here amdgpu_xgmi
>> doesn't have connected_to_cpu yet.
> 
> Ah, right.  Ok.  Do we need to remove the code from bo_fini() if we
> handle it here now?
> 
> Alex

My bad, I was on older kernel due to fixing internal
ticket last week, in latest drm-misc-next there is
connected_to_cpu and so I fixed everything as you asked.
Will resend in a moment.

Andrey

> 
> 
>>
>> Andrey
>>
>>>
>>>> +       arch_phys_wc_del(adev->gmc.vram_mtrr);
>>>> +       arch_io_free_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size);
>>>> +}
>>>> +
>>>>    /**
>>>>     * amdgpu_device_fini - tear down the driver
>>>>     *
>>>> @@ -3712,6 +3731,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>>>>           amdgpu_device_ip_fini_early(adev);
>>>>
>>>>           amdgpu_gart_dummy_page_fini(adev);
>>>> +
>>>> +       amdgpu_device_unmap_mmio(adev);
>>>>    }
>>>>
>>>>    void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>>> @@ -3739,9 +3760,6 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>>>           }
>>>>           if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
>>>>                   vga_client_register(adev->pdev, NULL, NULL, NULL);
>>>> -       iounmap(adev->rmmio);
>>>> -       adev->rmmio = NULL;
>>>> -       amdgpu_device_doorbell_fini(adev);
>>>>
>>>>           if (IS_ENABLED(CONFIG_PERF_EVENTS))
>>>>                   amdgpu_pmu_fini(adev);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> index 0adffcace326..882fb49f3c41 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> @@ -533,6 +533,7 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
>>>>                   return -ENOMEM;
>>>>           drm_gem_private_object_init(adev_to_drm(adev), &bo->tbo.base, size);
>>>>           INIT_LIST_HEAD(&bo->shadow_list);
>>>> +
>>>>           bo->vm_bo = NULL;
>>>>           bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain :
>>>>                   bp->domain;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> index 0d54e70278ca..58ad2fecc9e3 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> @@ -1841,10 +1841,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
>>>>           amdgpu_bo_free_kernel(&adev->mman.discovery_memory, NULL, NULL);
>>>>           amdgpu_ttm_fw_reserve_vram_fini(adev);
>>>>
>>>> -       if (adev->mman.aper_base_kaddr)
>>>> -               iounmap(adev->mman.aper_base_kaddr);
>>>> -       adev->mman.aper_base_kaddr = NULL;
>>>> -
>>>>           amdgpu_vram_mgr_fini(adev);
>>>>           amdgpu_gtt_mgr_fini(adev);
>>>>           ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_GDS);
>>>> --
>>>> 2.25.1
>>>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH] drm/amdgpu: Unmap all MMIO mappings
  2021-05-17 17:43   ` Alex Deucher
  2021-05-17 18:46     ` Andrey Grodzovsky
@ 2021-05-17 19:31     ` Andrey Grodzovsky
  2021-05-18 14:01       ` Andrey Grodzovsky
  1 sibling, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-17 19:31 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky

Access to those must be prevented post pci_remove

v6: Drop BOs list, unampping VRAM BAR is enough.
v8:
Add condition of xgmi.connected_to_cpu to MTTR
handling and remove MTTR handling from the old place.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 +++++++++++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  4 ----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  4 ----
 3 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f7cca25c0fa0..8b50315d1fe1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3666,6 +3666,27 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	return r;
 }
 
+static void amdgpu_device_unmap_mmio(struct amdgpu_device *adev)
+{
+	/* Clear all CPU mappings pointing to this device */
+	unmap_mapping_range(adev->ddev.anon_inode->i_mapping, 0, 0, 1);
+
+	/* Unmap all mapped bars - Doorbell, registers and VRAM */
+	amdgpu_device_doorbell_fini(adev);
+
+	iounmap(adev->rmmio);
+	adev->rmmio = NULL;
+	if (adev->mman.aper_base_kaddr)
+		iounmap(adev->mman.aper_base_kaddr);
+	adev->mman.aper_base_kaddr = NULL;
+
+	/* Memory manager related */
+	if (!adev->gmc.xgmi.connected_to_cpu) {
+		arch_phys_wc_del(adev->gmc.vram_mtrr);
+		arch_io_free_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size);
+	}
+}
+
 /**
  * amdgpu_device_fini - tear down the driver
  *
@@ -3712,6 +3733,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 	amdgpu_device_ip_fini_early(adev);
 
 	amdgpu_gart_dummy_page_fini(adev);
+
+	amdgpu_device_unmap_mmio(adev);
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
@@ -3739,9 +3762,6 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
 	}
 	if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
 		vga_client_register(adev->pdev, NULL, NULL, NULL);
-	iounmap(adev->rmmio);
-	adev->rmmio = NULL;
-	amdgpu_device_doorbell_fini(adev);
 
 	if (IS_ENABLED(CONFIG_PERF_EVENTS))
 		amdgpu_pmu_fini(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 0adffcace326..8eabe3c9ad17 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1107,10 +1107,6 @@ int amdgpu_bo_init(struct amdgpu_device *adev)
 void amdgpu_bo_fini(struct amdgpu_device *adev)
 {
 	amdgpu_ttm_fini(adev);
-	if (!adev->gmc.xgmi.connected_to_cpu) {
-		arch_phys_wc_del(adev->gmc.vram_mtrr);
-		arch_io_free_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size);
-	}
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 0d54e70278ca..58ad2fecc9e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1841,10 +1841,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
 	amdgpu_bo_free_kernel(&adev->mman.discovery_memory, NULL, NULL);
 	amdgpu_ttm_fw_reserve_vram_fini(adev);
 
-	if (adev->mman.aper_base_kaddr)
-		iounmap(adev->mman.aper_base_kaddr);
-	adev->mman.aper_base_kaddr = NULL;
-
 	amdgpu_vram_mgr_fini(adev);
 	amdgpu_gtt_mgr_fini(adev);
 	ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_GDS);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.
  2021-05-17 14:40     ` Andrey Grodzovsky
  2021-05-17 17:39       ` Alex Deucher
@ 2021-05-17 19:39       ` Christian König
  2021-05-17 19:46         ` Andrey Grodzovsky
  1 sibling, 1 reply; 64+ messages in thread
From: Christian König @ 2021-05-17 19:39 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci, daniel.vetter,
	Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

You need to note who you are pinging here.

I'm still assuming you wait for feedback from Daniel. Or should I take a 
look?

Christian.

Am 17.05.21 um 16:40 schrieb Andrey Grodzovsky:
> Ping
>
> Andrey
>
> On 2021-05-14 10:42 a.m., Andrey Grodzovsky wrote:
>> Ping
>>
>> Andrey
>>
>> On 2021-05-12 10:26 a.m., Andrey Grodzovsky wrote:
>>> If removing while commands in flight you cannot wait to flush the
>>> HW fences on a ring since the device is gone.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 16 ++++++++++------
>>>   1 file changed, 10 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 1ffb36bd0b19..fa03702ecbfb 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -36,6 +36,7 @@
>>>   #include <linux/firmware.h>
>>>   #include <linux/pm_runtime.h>
>>> +#include <drm/drm_drv.h>
>>>   #include "amdgpu.h"
>>>   #include "amdgpu_trace.h"
>>> @@ -525,8 +526,7 @@ int amdgpu_fence_driver_init(struct 
>>> amdgpu_device *adev)
>>>    */
>>>   void amdgpu_fence_driver_fini_hw(struct amdgpu_device *adev)
>>>   {
>>> -    unsigned i, j;
>>> -    int r;
>>> +    int i, r;
>>>       for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>>           struct amdgpu_ring *ring = adev->rings[i];
>>> @@ -535,11 +535,15 @@ void amdgpu_fence_driver_fini_hw(struct 
>>> amdgpu_device *adev)
>>>               continue;
>>>           if (!ring->no_scheduler)
>>>               drm_sched_fini(&ring->sched);
>>> -        r = amdgpu_fence_wait_empty(ring);
>>> -        if (r) {
>>> -            /* no need to trigger GPU reset as we are unloading */
>>> +        /* You can't wait for HW to signal if it's gone */
>>> +        if (!drm_dev_is_unplugged(&adev->ddev))
>>> +            r = amdgpu_fence_wait_empty(ring);
>>> +        else
>>> +            r = -ENODEV;
>>> +        /* no need to trigger GPU reset as we are unloading */
>>> +        if (r)
>>>               amdgpu_fence_driver_force_completion(ring);
>>> -        }
>>> +
>>>           if (ring->fence_drv.irq_src)
>>>               amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>>>                          ring->fence_drv.irq_type);
>>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.
  2021-05-17 19:39       ` Christian König
@ 2021-05-17 19:46         ` Andrey Grodzovsky
  2021-05-17 19:54           ` Christian König
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-17 19:46 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx, linux-pci,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Yep, you can take a look.

Andrey

On 2021-05-17 3:39 p.m., Christian König wrote:
> You need to note who you are pinging here.
> 
> I'm still assuming you wait for feedback from Daniel. Or should I take a 
> look?
> 
> Christian.
> 
> Am 17.05.21 um 16:40 schrieb Andrey Grodzovsky:
>> Ping
>>
>> Andrey
>>
>> On 2021-05-14 10:42 a.m., Andrey Grodzovsky wrote:
>>> Ping
>>>
>>> Andrey
>>>
>>> On 2021-05-12 10:26 a.m., Andrey Grodzovsky wrote:
>>>> If removing while commands in flight you cannot wait to flush the
>>>> HW fences on a ring since the device is gone.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 16 ++++++++++------
>>>>   1 file changed, 10 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 1ffb36bd0b19..fa03702ecbfb 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -36,6 +36,7 @@
>>>>   #include <linux/firmware.h>
>>>>   #include <linux/pm_runtime.h>
>>>> +#include <drm/drm_drv.h>
>>>>   #include "amdgpu.h"
>>>>   #include "amdgpu_trace.h"
>>>> @@ -525,8 +526,7 @@ int amdgpu_fence_driver_init(struct 
>>>> amdgpu_device *adev)
>>>>    */
>>>>   void amdgpu_fence_driver_fini_hw(struct amdgpu_device *adev)
>>>>   {
>>>> -    unsigned i, j;
>>>> -    int r;
>>>> +    int i, r;
>>>>       for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>>>           struct amdgpu_ring *ring = adev->rings[i];
>>>> @@ -535,11 +535,15 @@ void amdgpu_fence_driver_fini_hw(struct 
>>>> amdgpu_device *adev)
>>>>               continue;
>>>>           if (!ring->no_scheduler)
>>>>               drm_sched_fini(&ring->sched);
>>>> -        r = amdgpu_fence_wait_empty(ring);
>>>> -        if (r) {
>>>> -            /* no need to trigger GPU reset as we are unloading */
>>>> +        /* You can't wait for HW to signal if it's gone */
>>>> +        if (!drm_dev_is_unplugged(&adev->ddev))
>>>> +            r = amdgpu_fence_wait_empty(ring);
>>>> +        else
>>>> +            r = -ENODEV;
>>>> +        /* no need to trigger GPU reset as we are unloading */
>>>> +        if (r)
>>>>               amdgpu_fence_driver_force_completion(ring);
>>>> -        }
>>>> +
>>>>           if (ring->fence_drv.irq_src)
>>>>               amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>>>>                          ring->fence_drv.irq_type);
>>>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.
  2021-05-17 19:46         ` Andrey Grodzovsky
@ 2021-05-17 19:54           ` Christian König
  0 siblings, 0 replies; 64+ messages in thread
From: Christian König @ 2021-05-17 19:54 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci, daniel.vetter,
	Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Ok, then putting that on my TODO list for tomorrow.

I've already found a problem with how we finish of fences, going to 
write more on this tomorrow.

Christian.

Am 17.05.21 um 21:46 schrieb Andrey Grodzovsky:
> Yep, you can take a look.
>
> Andrey
>
> On 2021-05-17 3:39 p.m., Christian König wrote:
>> You need to note who you are pinging here.
>>
>> I'm still assuming you wait for feedback from Daniel. Or should I 
>> take a look?
>>
>> Christian.
>>
>> Am 17.05.21 um 16:40 schrieb Andrey Grodzovsky:
>>> Ping
>>>
>>> Andrey
>>>
>>> On 2021-05-14 10:42 a.m., Andrey Grodzovsky wrote:
>>>> Ping
>>>>
>>>> Andrey
>>>>
>>>> On 2021-05-12 10:26 a.m., Andrey Grodzovsky wrote:
>>>>> If removing while commands in flight you cannot wait to flush the
>>>>> HW fences on a ring since the device is gone.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 16 ++++++++++------
>>>>>   1 file changed, 10 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index 1ffb36bd0b19..fa03702ecbfb 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -36,6 +36,7 @@
>>>>>   #include <linux/firmware.h>
>>>>>   #include <linux/pm_runtime.h>
>>>>> +#include <drm/drm_drv.h>
>>>>>   #include "amdgpu.h"
>>>>>   #include "amdgpu_trace.h"
>>>>> @@ -525,8 +526,7 @@ int amdgpu_fence_driver_init(struct 
>>>>> amdgpu_device *adev)
>>>>>    */
>>>>>   void amdgpu_fence_driver_fini_hw(struct amdgpu_device *adev)
>>>>>   {
>>>>> -    unsigned i, j;
>>>>> -    int r;
>>>>> +    int i, r;
>>>>>       for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>>>>           struct amdgpu_ring *ring = adev->rings[i];
>>>>> @@ -535,11 +535,15 @@ void amdgpu_fence_driver_fini_hw(struct 
>>>>> amdgpu_device *adev)
>>>>>               continue;
>>>>>           if (!ring->no_scheduler)
>>>>>               drm_sched_fini(&ring->sched);
>>>>> -        r = amdgpu_fence_wait_empty(ring);
>>>>> -        if (r) {
>>>>> -            /* no need to trigger GPU reset as we are unloading */
>>>>> +        /* You can't wait for HW to signal if it's gone */
>>>>> +        if (!drm_dev_is_unplugged(&adev->ddev))
>>>>> +            r = amdgpu_fence_wait_empty(ring);
>>>>> +        else
>>>>> +            r = -ENODEV;
>>>>> +        /* no need to trigger GPU reset as we are unloading */
>>>>> +        if (r)
>>>>>               amdgpu_fence_driver_force_completion(ring);
>>>>> -        }
>>>>> +
>>>>>           if (ring->fence_drv.irq_src)
>>>>>               amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>>>>>                          ring->fence_drv.irq_type);
>>>>>
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] drm/amdgpu: Unmap all MMIO mappings
  2021-05-17 19:31     ` [PATCH] " Andrey Grodzovsky
@ 2021-05-18 14:01       ` Andrey Grodzovsky
  0 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-18 14:01 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Ping

Andrey

On 2021-05-17 3:31 p.m., Andrey Grodzovsky wrote:
> Access to those must be prevented post pci_remove
> 
> v6: Drop BOs list, unampping VRAM BAR is enough.
> v8:
> Add condition of xgmi.connected_to_cpu to MTTR
> handling and remove MTTR handling from the old place.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 +++++++++++++++++++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  4 ----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  4 ----
>   3 files changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f7cca25c0fa0..8b50315d1fe1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3666,6 +3666,27 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> +static void amdgpu_device_unmap_mmio(struct amdgpu_device *adev)
> +{
> +	/* Clear all CPU mappings pointing to this device */
> +	unmap_mapping_range(adev->ddev.anon_inode->i_mapping, 0, 0, 1);
> +
> +	/* Unmap all mapped bars - Doorbell, registers and VRAM */
> +	amdgpu_device_doorbell_fini(adev);
> +
> +	iounmap(adev->rmmio);
> +	adev->rmmio = NULL;
> +	if (adev->mman.aper_base_kaddr)
> +		iounmap(adev->mman.aper_base_kaddr);
> +	adev->mman.aper_base_kaddr = NULL;
> +
> +	/* Memory manager related */
> +	if (!adev->gmc.xgmi.connected_to_cpu) {
> +		arch_phys_wc_del(adev->gmc.vram_mtrr);
> +		arch_io_free_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size);
> +	}
> +}
> +
>   /**
>    * amdgpu_device_fini - tear down the driver
>    *
> @@ -3712,6 +3733,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>   	amdgpu_device_ip_fini_early(adev);
>   
>   	amdgpu_gart_dummy_page_fini(adev);
> +
> +	amdgpu_device_unmap_mmio(adev);
>   }
>   
>   void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> @@ -3739,9 +3762,6 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>   	}
>   	if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
>   		vga_client_register(adev->pdev, NULL, NULL, NULL);
> -	iounmap(adev->rmmio);
> -	adev->rmmio = NULL;
> -	amdgpu_device_doorbell_fini(adev);
>   
>   	if (IS_ENABLED(CONFIG_PERF_EVENTS))
>   		amdgpu_pmu_fini(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 0adffcace326..8eabe3c9ad17 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1107,10 +1107,6 @@ int amdgpu_bo_init(struct amdgpu_device *adev)
>   void amdgpu_bo_fini(struct amdgpu_device *adev)
>   {
>   	amdgpu_ttm_fini(adev);
> -	if (!adev->gmc.xgmi.connected_to_cpu) {
> -		arch_phys_wc_del(adev->gmc.vram_mtrr);
> -		arch_io_free_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size);
> -	}
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 0d54e70278ca..58ad2fecc9e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1841,10 +1841,6 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
>   	amdgpu_bo_free_kernel(&adev->mman.discovery_memory, NULL, NULL);
>   	amdgpu_ttm_fw_reserve_vram_fini(adev);
>   
> -	if (adev->mman.aper_base_kaddr)
> -		iounmap(adev->mman.aper_base_kaddr);
> -	adev->mman.aper_base_kaddr = NULL;
> -
>   	amdgpu_vram_mgr_fini(adev);
>   	amdgpu_gtt_mgr_fini(adev);
>   	ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_GDS);
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-12 14:26 ` [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released Andrey Grodzovsky
@ 2021-05-18 14:07   ` Christian König
  2021-05-18 15:03     ` Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Christian König @ 2021-05-18 14:07 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

In a separate discussion with Daniel we once more iterated over the 
dma_resv requirements and I came to the conclusion that this approach 
here won't work reliable.

The problem is as following:
1. device A schedules some rendering with into a buffer and exports it 
as DMA-buf.
2. device B imports the DMA-buf and wants to consume the rendering, for 
the the fence of device A is replaced with a new operation.
3. device B is hot plugged and the new operation canceled/newer scheduled.

The problem is now that we can't do this since the operation of device A 
is still running and by signaling our fences we run into the problem of 
potential memory corruption.

Not sure how to handle that case. One possibility would be to wait for 
all dependencies of unscheduled jobs before signaling their fences as 
canceled.

Christian.

Am 12.05.21 um 16:26 schrieb Andrey Grodzovsky:
> Problem: If scheduler is already stopped by the time sched_entity
> is released and entity's job_queue not empty I encountred
> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
> never becomes false.
>
> Fix: In drm_sched_fini detach all sched_entities from the
> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
> Also wakeup all those processes stuck in sched_entity flushing
> as the scheduler main thread which wakes them up is stopped by now.
>
> v2:
> Reverse order of drm_sched_rq_remove_entity and marking
> s_entity as stopped to prevent reinserion back to rq due
> to race.
>
> v3:
> Drop drm_sched_rq_remove_entity, only modify entity->stopped
> and check for it in drm_sched_entity_is_idle
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Reviewed-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
>   drivers/gpu/drm/scheduler/sched_main.c   | 24 ++++++++++++++++++++++++
>   2 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 0249c7450188..2e93e881b65f 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct drm_sched_entity *entity)
>   	rmb(); /* for list_empty to work without lock */
>   
>   	if (list_empty(&entity->list) ||
> -	    spsc_queue_count(&entity->job_queue) == 0)
> +	    spsc_queue_count(&entity->job_queue) == 0 ||
> +	    entity->stopped)
>   		return true;
>   
>   	return false;
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 8d1211e87101..a2a953693b45 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
>    */
>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>   {
> +	struct drm_sched_entity *s_entity;
> +	int i;
> +
>   	if (sched->thread)
>   		kthread_stop(sched->thread);
>   
> +	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> +		struct drm_sched_rq *rq = &sched->sched_rq[i];
> +
> +		if (!rq)
> +			continue;
> +
> +		spin_lock(&rq->lock);
> +		list_for_each_entry(s_entity, &rq->entities, list)
> +			/*
> +			 * Prevents reinsertion and marks job_queue as idle,
> +			 * it will removed from rq in drm_sched_entity_fini
> +			 * eventually
> +			 */
> +			s_entity->stopped = true;
> +		spin_unlock(&rq->lock);
> +
> +	}
> +
> +	/* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
> +	wake_up_all(&sched->job_scheduled);
> +
>   	/* Confirm no work left behind accessing device structures */
>   	cancel_delayed_work_sync(&sched->work_tdr);
>   


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-18 14:07   ` Christian König
@ 2021-05-18 15:03     ` Andrey Grodzovsky
  2021-05-18 15:15       ` Christian König
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-18 15:03 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling


On 2021-05-18 10:07 a.m., Christian König wrote:
> In a separate discussion with Daniel we once more iterated over the 
> dma_resv requirements and I came to the conclusion that this approach 
> here won't work reliable.
> 
> The problem is as following:
> 1. device A schedules some rendering with into a buffer and exports it 
> as DMA-buf.
> 2. device B imports the DMA-buf and wants to consume the rendering, for 
> the the fence of device A is replaced with a new operation.
> 3. device B is hot plugged and the new operation canceled/newer scheduled.
> 
> The problem is now that we can't do this since the operation of device A 
> is still running and by signaling our fences we run into the problem of 
> potential memory corruption.


I am not sure this problem you describe above is related to this patch.
Here we purely expand the criteria for when sched_entity is
considered idle in order to prevent a hang on device remove.
Were you addressing the patch from yesterday in which you commented
that you found a problem with how we finish fences ? It was
'[PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.'

Also, in the patch series as it is now we only signal HW fences for the
extracted device B, we are not touching any other fences. In fact as you
may remember, I dropped all new logic to forcing fence completion in
this patch series and only call amdgpu_fence_driver_force_completion
for the HW fences of the extracted device as it's done today anyway.

Andrey

> 
> Not sure how to handle that case. One possibility would be to wait for 
> all dependencies of unscheduled jobs before signaling their fences as 
> canceled.
> 
> Christian.
> 
> Am 12.05.21 um 16:26 schrieb Andrey Grodzovsky:
>> Problem: If scheduler is already stopped by the time sched_entity
>> is released and entity's job_queue not empty I encountred
>> a hang in drm_sched_entity_flush. This is because 
>> drm_sched_entity_is_idle
>> never becomes false.
>>
>> Fix: In drm_sched_fini detach all sched_entities from the
>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>> Also wakeup all those processes stuck in sched_entity flushing
>> as the scheduler main thread which wakes them up is stopped by now.
>>
>> v2:
>> Reverse order of drm_sched_rq_remove_entity and marking
>> s_entity as stopped to prevent reinserion back to rq due
>> to race.
>>
>> v3:
>> Drop drm_sched_rq_remove_entity, only modify entity->stopped
>> and check for it in drm_sched_entity_is_idle
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Reviewed-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
>>   drivers/gpu/drm/scheduler/sched_main.c   | 24 ++++++++++++++++++++++++
>>   2 files changed, 26 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>> b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 0249c7450188..2e93e881b65f 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct 
>> drm_sched_entity *entity)
>>       rmb(); /* for list_empty to work without lock */
>>       if (list_empty(&entity->list) ||
>> -        spsc_queue_count(&entity->job_queue) == 0)
>> +        spsc_queue_count(&entity->job_queue) == 0 ||
>> +        entity->stopped)
>>           return true;
>>       return false;
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 8d1211e87101..a2a953693b45 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>    */
>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   {
>> +    struct drm_sched_entity *s_entity;
>> +    int i;
>> +
>>       if (sched->thread)
>>           kthread_stop(sched->thread);
>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>> DRM_SCHED_PRIORITY_MIN; i--) {
>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>> +
>> +        if (!rq)
>> +            continue;
>> +
>> +        spin_lock(&rq->lock);
>> +        list_for_each_entry(s_entity, &rq->entities, list)
>> +            /*
>> +             * Prevents reinsertion and marks job_queue as idle,
>> +             * it will removed from rq in drm_sched_entity_fini
>> +             * eventually
>> +             */
>> +            s_entity->stopped = true;
>> +        spin_unlock(&rq->lock);
>> +
>> +    }
>> +
>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for this 
>> scheduler */
>> +    wake_up_all(&sched->job_scheduled);
>> +
>>       /* Confirm no work left behind accessing device structures */
>>       cancel_delayed_work_sync(&sched->work_tdr);
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-18 15:03     ` Andrey Grodzovsky
@ 2021-05-18 15:15       ` Christian König
  2021-05-18 16:17         ` Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Christian König @ 2021-05-18 15:15 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Am 18.05.21 um 17:03 schrieb Andrey Grodzovsky:
>
> On 2021-05-18 10:07 a.m., Christian König wrote:
>> In a separate discussion with Daniel we once more iterated over the 
>> dma_resv requirements and I came to the conclusion that this approach 
>> here won't work reliable.
>>
>> The problem is as following:
>> 1. device A schedules some rendering with into a buffer and exports 
>> it as DMA-buf.
>> 2. device B imports the DMA-buf and wants to consume the rendering, 
>> for the the fence of device A is replaced with a new operation.
>> 3. device B is hot plugged and the new operation canceled/newer 
>> scheduled.
>>
>> The problem is now that we can't do this since the operation of 
>> device A is still running and by signaling our fences we run into the 
>> problem of potential memory corruption.
>
>
> I am not sure this problem you describe above is related to this patch.

Well it is kind of related.

> Here we purely expand the criteria for when sched_entity is
> considered idle in order to prevent a hang on device remove.

And exactly that is problematic. See the jobs on the entity need to 
cleanly wait for their dependencies before they can be completed.

drm_sched_entity_kill_jobs() is also not handling that correctly at the 
moment, we only wait for the last scheduled fence but not for the 
dependencies of the job.

> Were you addressing the patch from yesterday in which you commented
> that you found a problem with how we finish fences ? It was
> '[PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.'
>
> Also, in the patch series as it is now we only signal HW fences for the
> extracted device B, we are not touching any other fences. In fact as you
> may remember, I dropped all new logic to forcing fence completion in
> this patch series and only call amdgpu_fence_driver_force_completion
> for the HW fences of the extracted device as it's done today anyway.

Signaling hardware fences is unproblematic since they are emitted when 
the software scheduling is already completed.

Christian.

>
> Andrey
>
>>
>> Not sure how to handle that case. One possibility would be to wait 
>> for all dependencies of unscheduled jobs before signaling their 
>> fences as canceled.
>>
>> Christian.
>>
>> Am 12.05.21 um 16:26 schrieb Andrey Grodzovsky:
>>> Problem: If scheduler is already stopped by the time sched_entity
>>> is released and entity's job_queue not empty I encountred
>>> a hang in drm_sched_entity_flush. This is because 
>>> drm_sched_entity_is_idle
>>> never becomes false.
>>>
>>> Fix: In drm_sched_fini detach all sched_entities from the
>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>> Also wakeup all those processes stuck in sched_entity flushing
>>> as the scheduler main thread which wakes them up is stopped by now.
>>>
>>> v2:
>>> Reverse order of drm_sched_rq_remove_entity and marking
>>> s_entity as stopped to prevent reinserion back to rq due
>>> to race.
>>>
>>> v3:
>>> Drop drm_sched_rq_remove_entity, only modify entity->stopped
>>> and check for it in drm_sched_entity_is_idle
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
>>>   drivers/gpu/drm/scheduler/sched_main.c   | 24 
>>> ++++++++++++++++++++++++
>>>   2 files changed, 26 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index 0249c7450188..2e93e881b65f 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct 
>>> drm_sched_entity *entity)
>>>       rmb(); /* for list_empty to work without lock */
>>>       if (list_empty(&entity->list) ||
>>> -        spsc_queue_count(&entity->job_queue) == 0)
>>> +        spsc_queue_count(&entity->job_queue) == 0 ||
>>> +        entity->stopped)
>>>           return true;
>>>       return false;
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 8d1211e87101..a2a953693b45 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>    */
>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>   {
>>> +    struct drm_sched_entity *s_entity;
>>> +    int i;
>>> +
>>>       if (sched->thread)
>>>           kthread_stop(sched->thread);
>>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>>> +
>>> +        if (!rq)
>>> +            continue;
>>> +
>>> +        spin_lock(&rq->lock);
>>> +        list_for_each_entry(s_entity, &rq->entities, list)
>>> +            /*
>>> +             * Prevents reinsertion and marks job_queue as idle,
>>> +             * it will removed from rq in drm_sched_entity_fini
>>> +             * eventually
>>> +             */
>>> +            s_entity->stopped = true;
>>> +        spin_unlock(&rq->lock);
>>> +
>>> +    }
>>> +
>>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for this 
>>> scheduler */
>>> +    wake_up_all(&sched->job_scheduled);
>>> +
>>>       /* Confirm no work left behind accessing device structures */
>>>       cancel_delayed_work_sync(&sched->work_tdr);
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-18 15:15       ` Christian König
@ 2021-05-18 16:17         ` Andrey Grodzovsky
  2021-05-18 16:33           ` Christian König
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-18 16:17 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling



On 2021-05-18 11:15 a.m., Christian König wrote:
> Am 18.05.21 um 17:03 schrieb Andrey Grodzovsky:
>>
>> On 2021-05-18 10:07 a.m., Christian König wrote:
>>> In a separate discussion with Daniel we once more iterated over the 
>>> dma_resv requirements and I came to the conclusion that this approach 
>>> here won't work reliable.
>>>
>>> The problem is as following:
>>> 1. device A schedules some rendering with into a buffer and exports 
>>> it as DMA-buf.
>>> 2. device B imports the DMA-buf and wants to consume the rendering, 
>>> for the the fence of device A is replaced with a new operation.
>>> 3. device B is hot plugged and the new operation canceled/newer 
>>> scheduled.
>>>
>>> The problem is now that we can't do this since the operation of 
>>> device A is still running and by signaling our fences we run into the 
>>> problem of potential memory corruption.

By signaling s_fence->finished of the canceled operation from the
removed device B we in fact cause memory corruption for the uncompleted
job still running on device A ? Because there is someone waiting to
read write from the imported buffer on device B and he only waits for
the s_fence->finished of device B we signaled
in drm_sched_entity_kill_jobs ?

Andrey

>>
>>
>> I am not sure this problem you describe above is related to this patch.
> 
> Well it is kind of related.
> 
>> Here we purely expand the criteria for when sched_entity is
>> considered idle in order to prevent a hang on device remove.
> 
> And exactly that is problematic. See the jobs on the entity need to 
> cleanly wait for their dependencies before they can be completed.
> 
> drm_sched_entity_kill_jobs() is also not handling that correctly at the 
> moment, we only wait for the last scheduled fence but not for the 
> dependencies of the job.
> 
>> Were you addressing the patch from yesterday in which you commented
>> that you found a problem with how we finish fences ? It was
>> '[PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.'
>>
>> Also, in the patch series as it is now we only signal HW fences for the
>> extracted device B, we are not touching any other fences. In fact as you
>> may remember, I dropped all new logic to forcing fence completion in
>> this patch series and only call amdgpu_fence_driver_force_completion
>> for the HW fences of the extracted device as it's done today anyway.
> 
> Signaling hardware fences is unproblematic since they are emitted when 
> the software scheduling is already completed.
> 
> Christian.
> 
>>
>> Andrey
>>
>>>
>>> Not sure how to handle that case. One possibility would be to wait 
>>> for all dependencies of unscheduled jobs before signaling their 
>>> fences as canceled.
>>>
>>> Christian.
>>>
>>> Am 12.05.21 um 16:26 schrieb Andrey Grodzovsky:
>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>> is released and entity's job_queue not empty I encountred
>>>> a hang in drm_sched_entity_flush. This is because 
>>>> drm_sched_entity_is_idle
>>>> never becomes false.
>>>>
>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>
>>>> v2:
>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>> s_entity as stopped to prevent reinserion back to rq due
>>>> to race.
>>>>
>>>> v3:
>>>> Drop drm_sched_rq_remove_entity, only modify entity->stopped
>>>> and check for it in drm_sched_entity_is_idle
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
>>>>   drivers/gpu/drm/scheduler/sched_main.c   | 24 
>>>> ++++++++++++++++++++++++
>>>>   2 files changed, 26 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index 0249c7450188..2e93e881b65f 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct 
>>>> drm_sched_entity *entity)
>>>>       rmb(); /* for list_empty to work without lock */
>>>>       if (list_empty(&entity->list) ||
>>>> -        spsc_queue_count(&entity->job_queue) == 0)
>>>> +        spsc_queue_count(&entity->job_queue) == 0 ||
>>>> +        entity->stopped)
>>>>           return true;
>>>>       return false;
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 8d1211e87101..a2a953693b45 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>    */
>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>   {
>>>> +    struct drm_sched_entity *s_entity;
>>>> +    int i;
>>>> +
>>>>       if (sched->thread)
>>>>           kthread_stop(sched->thread);
>>>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>> +
>>>> +        if (!rq)
>>>> +            continue;
>>>> +
>>>> +        spin_lock(&rq->lock);
>>>> +        list_for_each_entry(s_entity, &rq->entities, list)
>>>> +            /*
>>>> +             * Prevents reinsertion and marks job_queue as idle,
>>>> +             * it will removed from rq in drm_sched_entity_fini
>>>> +             * eventually
>>>> +             */
>>>> +            s_entity->stopped = true;
>>>> +        spin_unlock(&rq->lock);
>>>> +
>>>> +    }
>>>> +
>>>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for this 
>>>> scheduler */
>>>> +    wake_up_all(&sched->job_scheduled);
>>>> +
>>>>       /* Confirm no work left behind accessing device structures */
>>>>       cancel_delayed_work_sync(&sched->work_tdr);
>>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-18 16:17         ` Andrey Grodzovsky
@ 2021-05-18 16:33           ` Christian König
  2021-05-18 17:43             ` Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Christian König @ 2021-05-18 16:33 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Am 18.05.21 um 18:17 schrieb Andrey Grodzovsky:
>
>
> On 2021-05-18 11:15 a.m., Christian König wrote:
>> Am 18.05.21 um 17:03 schrieb Andrey Grodzovsky:
>>>
>>> On 2021-05-18 10:07 a.m., Christian König wrote:
>>>> In a separate discussion with Daniel we once more iterated over the 
>>>> dma_resv requirements and I came to the conclusion that this 
>>>> approach here won't work reliable.
>>>>
>>>> The problem is as following:
>>>> 1. device A schedules some rendering with into a buffer and exports 
>>>> it as DMA-buf.
>>>> 2. device B imports the DMA-buf and wants to consume the rendering, 
>>>> for the the fence of device A is replaced with a new operation.
>>>> 3. device B is hot plugged and the new operation canceled/newer 
>>>> scheduled.
>>>>
>>>> The problem is now that we can't do this since the operation of 
>>>> device A is still running and by signaling our fences we run into 
>>>> the problem of potential memory corruption.
>
> By signaling s_fence->finished of the canceled operation from the
> removed device B we in fact cause memory corruption for the uncompleted
> job still running on device A ? Because there is someone waiting to
> read write from the imported buffer on device B and he only waits for
> the s_fence->finished of device B we signaled
> in drm_sched_entity_kill_jobs ?

Exactly that, yes.

In other words when you have a dependency chain like A->B->C then memory 
management only waits for C before freeing up the memory for example.

When C now signaled because the device is hot-plugged before A or B are 
finished they are essentially accessing freed up memory.

Christian.

>
> Andrey
>
>>>
>>>
>>> I am not sure this problem you describe above is related to this patch.
>>
>> Well it is kind of related.
>>
>>> Here we purely expand the criteria for when sched_entity is
>>> considered idle in order to prevent a hang on device remove.
>>
>> And exactly that is problematic. See the jobs on the entity need to 
>> cleanly wait for their dependencies before they can be completed.
>>
>> drm_sched_entity_kill_jobs() is also not handling that correctly at 
>> the moment, we only wait for the last scheduled fence but not for the 
>> dependencies of the job.
>>
>>> Were you addressing the patch from yesterday in which you commented
>>> that you found a problem with how we finish fences ? It was
>>> '[PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.'
>>>
>>> Also, in the patch series as it is now we only signal HW fences for the
>>> extracted device B, we are not touching any other fences. In fact as 
>>> you
>>> may remember, I dropped all new logic to forcing fence completion in
>>> this patch series and only call amdgpu_fence_driver_force_completion
>>> for the HW fences of the extracted device as it's done today anyway.
>>
>> Signaling hardware fences is unproblematic since they are emitted 
>> when the software scheduling is already completed.
>>
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>>
>>>> Not sure how to handle that case. One possibility would be to wait 
>>>> for all dependencies of unscheduled jobs before signaling their 
>>>> fences as canceled.
>>>>
>>>> Christian.
>>>>
>>>> Am 12.05.21 um 16:26 schrieb Andrey Grodzovsky:
>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>> is released and entity's job_queue not empty I encountred
>>>>> a hang in drm_sched_entity_flush. This is because 
>>>>> drm_sched_entity_is_idle
>>>>> never becomes false.
>>>>>
>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>
>>>>> v2:
>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>> to race.
>>>>>
>>>>> v3:
>>>>> Drop drm_sched_rq_remove_entity, only modify entity->stopped
>>>>> and check for it in drm_sched_entity_is_idle
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
>>>>>   drivers/gpu/drm/scheduler/sched_main.c   | 24 
>>>>> ++++++++++++++++++++++++
>>>>>   2 files changed, 26 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index 0249c7450188..2e93e881b65f 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct 
>>>>> drm_sched_entity *entity)
>>>>>       rmb(); /* for list_empty to work without lock */
>>>>>       if (list_empty(&entity->list) ||
>>>>> -        spsc_queue_count(&entity->job_queue) == 0)
>>>>> +        spsc_queue_count(&entity->job_queue) == 0 ||
>>>>> +        entity->stopped)
>>>>>           return true;
>>>>>       return false;
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 8d1211e87101..a2a953693b45 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>    */
>>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>   {
>>>>> +    struct drm_sched_entity *s_entity;
>>>>> +    int i;
>>>>> +
>>>>>       if (sched->thread)
>>>>>           kthread_stop(sched->thread);
>>>>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>> +
>>>>> +        if (!rq)
>>>>> +            continue;
>>>>> +
>>>>> +        spin_lock(&rq->lock);
>>>>> +        list_for_each_entry(s_entity, &rq->entities, list)
>>>>> +            /*
>>>>> +             * Prevents reinsertion and marks job_queue as idle,
>>>>> +             * it will removed from rq in drm_sched_entity_fini
>>>>> +             * eventually
>>>>> +             */
>>>>> +            s_entity->stopped = true;
>>>>> +        spin_unlock(&rq->lock);
>>>>> +
>>>>> +    }
>>>>> +
>>>>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for this 
>>>>> scheduler */
>>>>> +    wake_up_all(&sched->job_scheduled);
>>>>> +
>>>>>       /* Confirm no work left behind accessing device structures */
>>>>>       cancel_delayed_work_sync(&sched->work_tdr);
>>>>
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-18 16:33           ` Christian König
@ 2021-05-18 17:43             ` Andrey Grodzovsky
  2021-05-18 18:02               ` Christian König
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-18 17:43 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling



On 2021-05-18 12:33 p.m., Christian König wrote:
> Am 18.05.21 um 18:17 schrieb Andrey Grodzovsky:
>>
>>
>> On 2021-05-18 11:15 a.m., Christian König wrote:
>>> Am 18.05.21 um 17:03 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2021-05-18 10:07 a.m., Christian König wrote:
>>>>> In a separate discussion with Daniel we once more iterated over the 
>>>>> dma_resv requirements and I came to the conclusion that this 
>>>>> approach here won't work reliable.
>>>>>
>>>>> The problem is as following:
>>>>> 1. device A schedules some rendering with into a buffer and exports 
>>>>> it as DMA-buf.
>>>>> 2. device B imports the DMA-buf and wants to consume the rendering, 
>>>>> for the the fence of device A is replaced with a new operation.
>>>>> 3. device B is hot plugged and the new operation canceled/newer 
>>>>> scheduled.
>>>>>
>>>>> The problem is now that we can't do this since the operation of 
>>>>> device A is still running and by signaling our fences we run into 
>>>>> the problem of potential memory corruption.
>>
>> By signaling s_fence->finished of the canceled operation from the
>> removed device B we in fact cause memory corruption for the uncompleted
>> job still running on device A ? Because there is someone waiting to
>> read write from the imported buffer on device B and he only waits for
>> the s_fence->finished of device B we signaled
>> in drm_sched_entity_kill_jobs ?
> 
> Exactly that, yes.
> 
> In other words when you have a dependency chain like A->B->C then memory 
> management only waits for C before freeing up the memory for example.
> 
> When C now signaled because the device is hot-plugged before A or B are 
> finished they are essentially accessing freed up memory.

But didn't C imported the BO form B or A in this case ? Why would he be
the one releasing that memory ? He would be just dropping his reference
to the BO, no ?

Also in the general case,
drm_sched_entity_fini->drm_sched_entity_kill_jobs which is
the one who signals the 'C' fence with error code are as far
as I looked called from when the user of that BO is stopping
the usage anyway (e.g. drm_driver.postclose callback for when use
process closes his device file) who would then access and corrupt
the exported memory on device A where the job hasn't completed yet ?

Andrey

> 
> Christian.
> 
>>
>> Andrey
>>
>>>>
>>>>
>>>> I am not sure this problem you describe above is related to this patch.
>>>
>>> Well it is kind of related.
>>>
>>>> Here we purely expand the criteria for when sched_entity is
>>>> considered idle in order to prevent a hang on device remove.
>>>
>>> And exactly that is problematic. See the jobs on the entity need to 
>>> cleanly wait for their dependencies before they can be completed.
>>>
>>> drm_sched_entity_kill_jobs() is also not handling that correctly at 
>>> the moment, we only wait for the last scheduled fence but not for the 
>>> dependencies of the job.
>>>
>>>> Were you addressing the patch from yesterday in which you commented
>>>> that you found a problem with how we finish fences ? It was
>>>> '[PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.'
>>>>
>>>> Also, in the patch series as it is now we only signal HW fences for the
>>>> extracted device B, we are not touching any other fences. In fact as 
>>>> you
>>>> may remember, I dropped all new logic to forcing fence completion in
>>>> this patch series and only call amdgpu_fence_driver_force_completion
>>>> for the HW fences of the extracted device as it's done today anyway.
>>>
>>> Signaling hardware fences is unproblematic since they are emitted 
>>> when the software scheduling is already completed.
>>>
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>>
>>>>>
>>>>> Not sure how to handle that case. One possibility would be to wait 
>>>>> for all dependencies of unscheduled jobs before signaling their 
>>>>> fences as canceled.
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 12.05.21 um 16:26 schrieb Andrey Grodzovsky:
>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>> is released and entity's job_queue not empty I encountred
>>>>>> a hang in drm_sched_entity_flush. This is because 
>>>>>> drm_sched_entity_is_idle
>>>>>> never becomes false.
>>>>>>
>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>>
>>>>>> v2:
>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>> to race.
>>>>>>
>>>>>> v3:
>>>>>> Drop drm_sched_rq_remove_entity, only modify entity->stopped
>>>>>> and check for it in drm_sched_entity_is_idle
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
>>>>>>   drivers/gpu/drm/scheduler/sched_main.c   | 24 
>>>>>> ++++++++++++++++++++++++
>>>>>>   2 files changed, 26 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>> index 0249c7450188..2e93e881b65f 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>> @@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct 
>>>>>> drm_sched_entity *entity)
>>>>>>       rmb(); /* for list_empty to work without lock */
>>>>>>       if (list_empty(&entity->list) ||
>>>>>> -        spsc_queue_count(&entity->job_queue) == 0)
>>>>>> +        spsc_queue_count(&entity->job_queue) == 0 ||
>>>>>> +        entity->stopped)
>>>>>>           return true;
>>>>>>       return false;
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> index 8d1211e87101..a2a953693b45 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> @@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>    */
>>>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>   {
>>>>>> +    struct drm_sched_entity *s_entity;
>>>>>> +    int i;
>>>>>> +
>>>>>>       if (sched->thread)
>>>>>>           kthread_stop(sched->thread);
>>>>>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>> +
>>>>>> +        if (!rq)
>>>>>> +            continue;
>>>>>> +
>>>>>> +        spin_lock(&rq->lock);
>>>>>> +        list_for_each_entry(s_entity, &rq->entities, list)
>>>>>> +            /*
>>>>>> +             * Prevents reinsertion and marks job_queue as idle,
>>>>>> +             * it will removed from rq in drm_sched_entity_fini
>>>>>> +             * eventually
>>>>>> +             */
>>>>>> +            s_entity->stopped = true;
>>>>>> +        spin_unlock(&rq->lock);
>>>>>> +
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for this 
>>>>>> scheduler */
>>>>>> +    wake_up_all(&sched->job_scheduled);
>>>>>> +
>>>>>>       /* Confirm no work left behind accessing device structures */
>>>>>>       cancel_delayed_work_sync(&sched->work_tdr);
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-18 17:43             ` Andrey Grodzovsky
@ 2021-05-18 18:02               ` Christian König
  2021-05-18 18:09                 ` Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Christian König @ 2021-05-18 18:02 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Am 18.05.21 um 19:43 schrieb Andrey Grodzovsky:
> On 2021-05-18 12:33 p.m., Christian König wrote:
>> Am 18.05.21 um 18:17 schrieb Andrey Grodzovsky:
>>>
>>>
>>> On 2021-05-18 11:15 a.m., Christian König wrote:
>>>> Am 18.05.21 um 17:03 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2021-05-18 10:07 a.m., Christian König wrote:
>>>>>> In a separate discussion with Daniel we once more iterated over 
>>>>>> the dma_resv requirements and I came to the conclusion that this 
>>>>>> approach here won't work reliable.
>>>>>>
>>>>>> The problem is as following:
>>>>>> 1. device A schedules some rendering with into a buffer and 
>>>>>> exports it as DMA-buf.
>>>>>> 2. device B imports the DMA-buf and wants to consume the 
>>>>>> rendering, for the the fence of device A is replaced with a new 
>>>>>> operation.
>>>>>> 3. device B is hot plugged and the new operation canceled/newer 
>>>>>> scheduled.
>>>>>>
>>>>>> The problem is now that we can't do this since the operation of 
>>>>>> device A is still running and by signaling our fences we run into 
>>>>>> the problem of potential memory corruption.
>>>
>>> By signaling s_fence->finished of the canceled operation from the
>>> removed device B we in fact cause memory corruption for the uncompleted
>>> job still running on device A ? Because there is someone waiting to
>>> read write from the imported buffer on device B and he only waits for
>>> the s_fence->finished of device B we signaled
>>> in drm_sched_entity_kill_jobs ?
>>
>> Exactly that, yes.
>>
>> In other words when you have a dependency chain like A->B->C then 
>> memory management only waits for C before freeing up the memory for 
>> example.
>>
>> When C now signaled because the device is hot-plugged before A or B 
>> are finished they are essentially accessing freed up memory.
>
> But didn't C imported the BO form B or A in this case ? Why would he be
> the one releasing that memory ? He would be just dropping his reference
> to the BO, no ?

Well freeing the memory was just an example. The BO could also move back 
to VRAM because of the dropped reference.

> Also in the general case,
> drm_sched_entity_fini->drm_sched_entity_kill_jobs which is
> the one who signals the 'C' fence with error code are as far
> as I looked called from when the user of that BO is stopping
> the usage anyway (e.g. drm_driver.postclose callback for when use
> process closes his device file) who would then access and corrupt
> the exported memory on device A where the job hasn't completed yet ?

Key point is that memory management only waits for the last added fence, 
that is the design of the dma_resv object. How that happens is irrelevant.

Because of this we at least need to wait for all dependencies of the job 
before signaling the fence, even if we cancel the job for some reason.

Christian.

>
> Andrey
>
>>
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>>>
>>>>>
>>>>> I am not sure this problem you describe above is related to this 
>>>>> patch.
>>>>
>>>> Well it is kind of related.
>>>>
>>>>> Here we purely expand the criteria for when sched_entity is
>>>>> considered idle in order to prevent a hang on device remove.
>>>>
>>>> And exactly that is problematic. See the jobs on the entity need to 
>>>> cleanly wait for their dependencies before they can be completed.
>>>>
>>>> drm_sched_entity_kill_jobs() is also not handling that correctly at 
>>>> the moment, we only wait for the last scheduled fence but not for 
>>>> the dependencies of the job.
>>>>
>>>>> Were you addressing the patch from yesterday in which you commented
>>>>> that you found a problem with how we finish fences ? It was
>>>>> '[PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.'
>>>>>
>>>>> Also, in the patch series as it is now we only signal HW fences 
>>>>> for the
>>>>> extracted device B, we are not touching any other fences. In fact 
>>>>> as you
>>>>> may remember, I dropped all new logic to forcing fence completion in
>>>>> this patch series and only call amdgpu_fence_driver_force_completion
>>>>> for the HW fences of the extracted device as it's done today anyway.
>>>>
>>>> Signaling hardware fences is unproblematic since they are emitted 
>>>> when the software scheduling is already completed.
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>>
>>>>>> Not sure how to handle that case. One possibility would be to 
>>>>>> wait for all dependencies of unscheduled jobs before signaling 
>>>>>> their fences as canceled.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Am 12.05.21 um 16:26 schrieb Andrey Grodzovsky:
>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>> a hang in drm_sched_entity_flush. This is because 
>>>>>>> drm_sched_entity_is_idle
>>>>>>> never becomes false.
>>>>>>>
>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>>>
>>>>>>> v2:
>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>> to race.
>>>>>>>
>>>>>>> v3:
>>>>>>> Drop drm_sched_rq_remove_entity, only modify entity->stopped
>>>>>>> and check for it in drm_sched_entity_is_idle
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
>>>>>>>   drivers/gpu/drm/scheduler/sched_main.c   | 24 
>>>>>>> ++++++++++++++++++++++++
>>>>>>>   2 files changed, 26 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> index 0249c7450188..2e93e881b65f 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> @@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct 
>>>>>>> drm_sched_entity *entity)
>>>>>>>       rmb(); /* for list_empty to work without lock */
>>>>>>>       if (list_empty(&entity->list) ||
>>>>>>> -        spsc_queue_count(&entity->job_queue) == 0)
>>>>>>> +        spsc_queue_count(&entity->job_queue) == 0 ||
>>>>>>> +        entity->stopped)
>>>>>>>           return true;
>>>>>>>       return false;
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index 8d1211e87101..a2a953693b45 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>    */
>>>>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>   {
>>>>>>> +    struct drm_sched_entity *s_entity;
>>>>>>> +    int i;
>>>>>>> +
>>>>>>>       if (sched->thread)
>>>>>>>           kthread_stop(sched->thread);
>>>>>>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>> +
>>>>>>> +        if (!rq)
>>>>>>> +            continue;
>>>>>>> +
>>>>>>> +        spin_lock(&rq->lock);
>>>>>>> +        list_for_each_entry(s_entity, &rq->entities, list)
>>>>>>> +            /*
>>>>>>> +             * Prevents reinsertion and marks job_queue as idle,
>>>>>>> +             * it will removed from rq in drm_sched_entity_fini
>>>>>>> +             * eventually
>>>>>>> +             */
>>>>>>> +            s_entity->stopped = true;
>>>>>>> +        spin_unlock(&rq->lock);
>>>>>>> +
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for this 
>>>>>>> scheduler */
>>>>>>> +    wake_up_all(&sched->job_scheduled);
>>>>>>> +
>>>>>>>       /* Confirm no work left behind accessing device structures */
>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-18 18:02               ` Christian König
@ 2021-05-18 18:09                 ` Andrey Grodzovsky
  2021-05-18 18:13                   ` Christian König
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-18 18:09 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling



On 2021-05-18 2:02 p.m., Christian König wrote:
> Am 18.05.21 um 19:43 schrieb Andrey Grodzovsky:
>> On 2021-05-18 12:33 p.m., Christian König wrote:
>>> Am 18.05.21 um 18:17 schrieb Andrey Grodzovsky:
>>>>
>>>>
>>>> On 2021-05-18 11:15 a.m., Christian König wrote:
>>>>> Am 18.05.21 um 17:03 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> On 2021-05-18 10:07 a.m., Christian König wrote:
>>>>>>> In a separate discussion with Daniel we once more iterated over 
>>>>>>> the dma_resv requirements and I came to the conclusion that this 
>>>>>>> approach here won't work reliable.
>>>>>>>
>>>>>>> The problem is as following:
>>>>>>> 1. device A schedules some rendering with into a buffer and 
>>>>>>> exports it as DMA-buf.
>>>>>>> 2. device B imports the DMA-buf and wants to consume the 
>>>>>>> rendering, for the the fence of device A is replaced with a new 
>>>>>>> operation.
>>>>>>> 3. device B is hot plugged and the new operation canceled/newer 
>>>>>>> scheduled.
>>>>>>>
>>>>>>> The problem is now that we can't do this since the operation of 
>>>>>>> device A is still running and by signaling our fences we run into 
>>>>>>> the problem of potential memory corruption.
>>>>
>>>> By signaling s_fence->finished of the canceled operation from the
>>>> removed device B we in fact cause memory corruption for the uncompleted
>>>> job still running on device A ? Because there is someone waiting to
>>>> read write from the imported buffer on device B and he only waits for
>>>> the s_fence->finished of device B we signaled
>>>> in drm_sched_entity_kill_jobs ?
>>>
>>> Exactly that, yes.
>>>
>>> In other words when you have a dependency chain like A->B->C then 
>>> memory management only waits for C before freeing up the memory for 
>>> example.
>>>
>>> When C now signaled because the device is hot-plugged before A or B 
>>> are finished they are essentially accessing freed up memory.
>>
>> But didn't C imported the BO form B or A in this case ? Why would he be
>> the one releasing that memory ? He would be just dropping his reference
>> to the BO, no ?
> 
> Well freeing the memory was just an example. The BO could also move back 
> to VRAM because of the dropped reference.
> 
>> Also in the general case,
>> drm_sched_entity_fini->drm_sched_entity_kill_jobs which is
>> the one who signals the 'C' fence with error code are as far
>> as I looked called from when the user of that BO is stopping
>> the usage anyway (e.g. drm_driver.postclose callback for when use
>> process closes his device file) who would then access and corrupt
>> the exported memory on device A where the job hasn't completed yet ?
> 
> Key point is that memory management only waits for the last added fence, 
> that is the design of the dma_resv object. How that happens is irrelevant.
> 
> Because of this we at least need to wait for all dependencies of the job 
> before signaling the fence, even if we cancel the job for some reason.
> 
> Christian.

Would this be the right way to do it ?

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index 2e93e881b65f..10f784874b63 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -223,10 +223,14 @@ static void drm_sched_entity_kill_jobs(struct 
drm_sched_entity *entity)
  {
         struct drm_sched_job *job;
         int r;
+       struct dma_fence *f;

         while ((job = 
to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
                 struct drm_sched_fence *s_fence = job->s_fence;

+               while (f = sched->ops->dependency(sched_job, entity))
+                       dma_fence_wait(f);
+
                 drm_sched_fence_scheduled(s_fence);
                 dma_fence_set_error(&s_fence->finished, -ESRCH);

Andrey



> 
>>
>> Andrey
>>
>>>
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>>
>>>>>>
>>>>>>
>>>>>> I am not sure this problem you describe above is related to this 
>>>>>> patch.
>>>>>
>>>>> Well it is kind of related.
>>>>>
>>>>>> Here we purely expand the criteria for when sched_entity is
>>>>>> considered idle in order to prevent a hang on device remove.
>>>>>
>>>>> And exactly that is problematic. See the jobs on the entity need to 
>>>>> cleanly wait for their dependencies before they can be completed.
>>>>>
>>>>> drm_sched_entity_kill_jobs() is also not handling that correctly at 
>>>>> the moment, we only wait for the last scheduled fence but not for 
>>>>> the dependencies of the job.
>>>>>
>>>>>> Were you addressing the patch from yesterday in which you commented
>>>>>> that you found a problem with how we finish fences ? It was
>>>>>> '[PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.'
>>>>>>
>>>>>> Also, in the patch series as it is now we only signal HW fences 
>>>>>> for the
>>>>>> extracted device B, we are not touching any other fences. In fact 
>>>>>> as you
>>>>>> may remember, I dropped all new logic to forcing fence completion in
>>>>>> this patch series and only call amdgpu_fence_driver_force_completion
>>>>>> for the HW fences of the extracted device as it's done today anyway.
>>>>>
>>>>> Signaling hardware fences is unproblematic since they are emitted 
>>>>> when the software scheduling is already completed.
>>>>>
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>>
>>>>>>> Not sure how to handle that case. One possibility would be to 
>>>>>>> wait for all dependencies of unscheduled jobs before signaling 
>>>>>>> their fences as canceled.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 12.05.21 um 16:26 schrieb Andrey Grodzovsky:
>>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>> a hang in drm_sched_entity_flush. This is because 
>>>>>>>> drm_sched_entity_is_idle
>>>>>>>> never becomes false.
>>>>>>>>
>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>> as the scheduler main thread which wakes them up is stopped by now.
>>>>>>>>
>>>>>>>> v2:
>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>> to race.
>>>>>>>>
>>>>>>>> v3:
>>>>>>>> Drop drm_sched_rq_remove_entity, only modify entity->stopped
>>>>>>>> and check for it in drm_sched_entity_is_idle
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
>>>>>>>>   drivers/gpu/drm/scheduler/sched_main.c   | 24 
>>>>>>>> ++++++++++++++++++++++++
>>>>>>>>   2 files changed, 26 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>> index 0249c7450188..2e93e881b65f 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>> @@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct 
>>>>>>>> drm_sched_entity *entity)
>>>>>>>>       rmb(); /* for list_empty to work without lock */
>>>>>>>>       if (list_empty(&entity->list) ||
>>>>>>>> -        spsc_queue_count(&entity->job_queue) == 0)
>>>>>>>> +        spsc_queue_count(&entity->job_queue) == 0 ||
>>>>>>>> +        entity->stopped)
>>>>>>>>           return true;
>>>>>>>>       return false;
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index 8d1211e87101..a2a953693b45 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>    */
>>>>>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>   {
>>>>>>>> +    struct drm_sched_entity *s_entity;
>>>>>>>> +    int i;
>>>>>>>> +
>>>>>>>>       if (sched->thread)
>>>>>>>>           kthread_stop(sched->thread);
>>>>>>>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>> +
>>>>>>>> +        if (!rq)
>>>>>>>> +            continue;
>>>>>>>> +
>>>>>>>> +        spin_lock(&rq->lock);
>>>>>>>> +        list_for_each_entry(s_entity, &rq->entities, list)
>>>>>>>> +            /*
>>>>>>>> +             * Prevents reinsertion and marks job_queue as idle,
>>>>>>>> +             * it will removed from rq in drm_sched_entity_fini
>>>>>>>> +             * eventually
>>>>>>>> +             */
>>>>>>>> +            s_entity->stopped = true;
>>>>>>>> +        spin_unlock(&rq->lock);
>>>>>>>> +
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for this 
>>>>>>>> scheduler */
>>>>>>>> +    wake_up_all(&sched->job_scheduled);
>>>>>>>> +
>>>>>>>>       /* Confirm no work left behind accessing device structures */
>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>
>>>>>
>>>
> 

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-18 18:09                 ` Andrey Grodzovsky
@ 2021-05-18 18:13                   ` Christian König
  2021-05-18 18:48                     ` Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Christian König @ 2021-05-18 18:13 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling


Am 18.05.21 um 20:09 schrieb Andrey Grodzovsky:
> On 2021-05-18 2:02 p.m., Christian König wrote:
>> Am 18.05.21 um 19:43 schrieb Andrey Grodzovsky:
>>> On 2021-05-18 12:33 p.m., Christian König wrote:
>>>> Am 18.05.21 um 18:17 schrieb Andrey Grodzovsky:
>>>>>
>>>>>
>>>>> On 2021-05-18 11:15 a.m., Christian König wrote:
>>>>>> Am 18.05.21 um 17:03 schrieb Andrey Grodzovsky:
>>>>>>>
>>>>>>> On 2021-05-18 10:07 a.m., Christian König wrote:
>>>>>>>> In a separate discussion with Daniel we once more iterated over 
>>>>>>>> the dma_resv requirements and I came to the conclusion that 
>>>>>>>> this approach here won't work reliable.
>>>>>>>>
>>>>>>>> The problem is as following:
>>>>>>>> 1. device A schedules some rendering with into a buffer and 
>>>>>>>> exports it as DMA-buf.
>>>>>>>> 2. device B imports the DMA-buf and wants to consume the 
>>>>>>>> rendering, for the the fence of device A is replaced with a new 
>>>>>>>> operation.
>>>>>>>> 3. device B is hot plugged and the new operation canceled/newer 
>>>>>>>> scheduled.
>>>>>>>>
>>>>>>>> The problem is now that we can't do this since the operation of 
>>>>>>>> device A is still running and by signaling our fences we run 
>>>>>>>> into the problem of potential memory corruption.
>>>>>
>>>>> By signaling s_fence->finished of the canceled operation from the
>>>>> removed device B we in fact cause memory corruption for the 
>>>>> uncompleted
>>>>> job still running on device A ? Because there is someone waiting to
>>>>> read write from the imported buffer on device B and he only waits for
>>>>> the s_fence->finished of device B we signaled
>>>>> in drm_sched_entity_kill_jobs ?
>>>>
>>>> Exactly that, yes.
>>>>
>>>> In other words when you have a dependency chain like A->B->C then 
>>>> memory management only waits for C before freeing up the memory for 
>>>> example.
>>>>
>>>> When C now signaled because the device is hot-plugged before A or B 
>>>> are finished they are essentially accessing freed up memory.
>>>
>>> But didn't C imported the BO form B or A in this case ? Why would he be
>>> the one releasing that memory ? He would be just dropping his reference
>>> to the BO, no ?
>>
>> Well freeing the memory was just an example. The BO could also move 
>> back to VRAM because of the dropped reference.
>>
>>> Also in the general case,
>>> drm_sched_entity_fini->drm_sched_entity_kill_jobs which is
>>> the one who signals the 'C' fence with error code are as far
>>> as I looked called from when the user of that BO is stopping
>>> the usage anyway (e.g. drm_driver.postclose callback for when use
>>> process closes his device file) who would then access and corrupt
>>> the exported memory on device A where the job hasn't completed yet ?
>>
>> Key point is that memory management only waits for the last added 
>> fence, that is the design of the dma_resv object. How that happens is 
>> irrelevant.
>>
>> Because of this we at least need to wait for all dependencies of the 
>> job before signaling the fence, even if we cancel the job for some 
>> reason.
>>
>> Christian.
>
> Would this be the right way to do it ?

Yes, it is at least a start. Question is if we can wait blocking here or 
not.

We install a callback a bit lower to avoid blocking, so I'm pretty sure 
that won't work as expected.

Christian.

>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index 2e93e881b65f..10f784874b63 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -223,10 +223,14 @@ static void drm_sched_entity_kill_jobs(struct 
> drm_sched_entity *entity)
>  {
>         struct drm_sched_job *job;
>         int r;
> +       struct dma_fence *f;
>
>         while ((job = 
> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>                 struct drm_sched_fence *s_fence = job->s_fence;
>
> +               while (f = sched->ops->dependency(sched_job, entity))
> +                       dma_fence_wait(f);
> +
>                 drm_sched_fence_scheduled(s_fence);
>                 dma_fence_set_error(&s_fence->finished, -ESRCH);
>
> Andrey
>
>
>
>>
>>>
>>> Andrey
>>>
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I am not sure this problem you describe above is related to this 
>>>>>>> patch.
>>>>>>
>>>>>> Well it is kind of related.
>>>>>>
>>>>>>> Here we purely expand the criteria for when sched_entity is
>>>>>>> considered idle in order to prevent a hang on device remove.
>>>>>>
>>>>>> And exactly that is problematic. See the jobs on the entity need 
>>>>>> to cleanly wait for their dependencies before they can be completed.
>>>>>>
>>>>>> drm_sched_entity_kill_jobs() is also not handling that correctly 
>>>>>> at the moment, we only wait for the last scheduled fence but not 
>>>>>> for the dependencies of the job.
>>>>>>
>>>>>>> Were you addressing the patch from yesterday in which you commented
>>>>>>> that you found a problem with how we finish fences ? It was
>>>>>>> '[PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.'
>>>>>>>
>>>>>>> Also, in the patch series as it is now we only signal HW fences 
>>>>>>> for the
>>>>>>> extracted device B, we are not touching any other fences. In 
>>>>>>> fact as you
>>>>>>> may remember, I dropped all new logic to forcing fence 
>>>>>>> completion in
>>>>>>> this patch series and only call 
>>>>>>> amdgpu_fence_driver_force_completion
>>>>>>> for the HW fences of the extracted device as it's done today 
>>>>>>> anyway.
>>>>>>
>>>>>> Signaling hardware fences is unproblematic since they are emitted 
>>>>>> when the software scheduling is already completed.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>>
>>>>>>>> Not sure how to handle that case. One possibility would be to 
>>>>>>>> wait for all dependencies of unscheduled jobs before signaling 
>>>>>>>> their fences as canceled.
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 12.05.21 um 16:26 schrieb Andrey Grodzovsky:
>>>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>>> a hang in drm_sched_entity_flush. This is because 
>>>>>>>>> drm_sched_entity_is_idle
>>>>>>>>> never becomes false.
>>>>>>>>>
>>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>>> scheduler's run queues. This will satisfy 
>>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>>> as the scheduler main thread which wakes them up is stopped by 
>>>>>>>>> now.
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>>> to race.
>>>>>>>>>
>>>>>>>>> v3:
>>>>>>>>> Drop drm_sched_rq_remove_entity, only modify entity->stopped
>>>>>>>>> and check for it in drm_sched_entity_is_idle
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>>>>>> ---
>>>>>>>>>   drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
>>>>>>>>>   drivers/gpu/drm/scheduler/sched_main.c   | 24 
>>>>>>>>> ++++++++++++++++++++++++
>>>>>>>>>   2 files changed, 26 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>> index 0249c7450188..2e93e881b65f 100644
>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>> @@ -116,7 +116,8 @@ static bool 
>>>>>>>>> drm_sched_entity_is_idle(struct drm_sched_entity *entity)
>>>>>>>>>       rmb(); /* for list_empty to work without lock */
>>>>>>>>>       if (list_empty(&entity->list) ||
>>>>>>>>> - spsc_queue_count(&entity->job_queue) == 0)
>>>>>>>>> + spsc_queue_count(&entity->job_queue) == 0 ||
>>>>>>>>> +        entity->stopped)
>>>>>>>>>           return true;
>>>>>>>>>       return false;
>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> index 8d1211e87101..a2a953693b45 100644
>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>> @@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>>    */
>>>>>>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>>   {
>>>>>>>>> +    struct drm_sched_entity *s_entity;
>>>>>>>>> +    int i;
>>>>>>>>> +
>>>>>>>>>       if (sched->thread)
>>>>>>>>>           kthread_stop(sched->thread);
>>>>>>>>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>>> +
>>>>>>>>> +        if (!rq)
>>>>>>>>> +            continue;
>>>>>>>>> +
>>>>>>>>> +        spin_lock(&rq->lock);
>>>>>>>>> +        list_for_each_entry(s_entity, &rq->entities, list)
>>>>>>>>> +            /*
>>>>>>>>> +             * Prevents reinsertion and marks job_queue as idle,
>>>>>>>>> +             * it will removed from rq in drm_sched_entity_fini
>>>>>>>>> +             * eventually
>>>>>>>>> +             */
>>>>>>>>> +            s_entity->stopped = true;
>>>>>>>>> +        spin_unlock(&rq->lock);
>>>>>>>>> +
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for 
>>>>>>>>> this scheduler */
>>>>>>>>> +    wake_up_all(&sched->job_scheduled);
>>>>>>>>> +
>>>>>>>>>       /* Confirm no work left behind accessing device 
>>>>>>>>> structures */
>>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>>
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-18 18:13                   ` Christian König
@ 2021-05-18 18:48                     ` Andrey Grodzovsky
  2021-05-18 20:56                       ` Andrey Grodzovsky
  2021-05-19 10:57                       ` Christian König
  0 siblings, 2 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-18 18:48 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling



On 2021-05-18 2:13 p.m., Christian König wrote:
> 
> Am 18.05.21 um 20:09 schrieb Andrey Grodzovsky:
>> On 2021-05-18 2:02 p.m., Christian König wrote:
>>> Am 18.05.21 um 19:43 schrieb Andrey Grodzovsky:
>>>> On 2021-05-18 12:33 p.m., Christian König wrote:
>>>>> Am 18.05.21 um 18:17 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>>
>>>>>> On 2021-05-18 11:15 a.m., Christian König wrote:
>>>>>>> Am 18.05.21 um 17:03 schrieb Andrey Grodzovsky:
>>>>>>>>
>>>>>>>> On 2021-05-18 10:07 a.m., Christian König wrote:
>>>>>>>>> In a separate discussion with Daniel we once more iterated over 
>>>>>>>>> the dma_resv requirements and I came to the conclusion that 
>>>>>>>>> this approach here won't work reliable.
>>>>>>>>>
>>>>>>>>> The problem is as following:
>>>>>>>>> 1. device A schedules some rendering with into a buffer and 
>>>>>>>>> exports it as DMA-buf.
>>>>>>>>> 2. device B imports the DMA-buf and wants to consume the 
>>>>>>>>> rendering, for the the fence of device A is replaced with a new 
>>>>>>>>> operation.
>>>>>>>>> 3. device B is hot plugged and the new operation canceled/newer 
>>>>>>>>> scheduled.
>>>>>>>>>
>>>>>>>>> The problem is now that we can't do this since the operation of 
>>>>>>>>> device A is still running and by signaling our fences we run 
>>>>>>>>> into the problem of potential memory corruption.
>>>>>>
>>>>>> By signaling s_fence->finished of the canceled operation from the
>>>>>> removed device B we in fact cause memory corruption for the 
>>>>>> uncompleted
>>>>>> job still running on device A ? Because there is someone waiting to
>>>>>> read write from the imported buffer on device B and he only waits for
>>>>>> the s_fence->finished of device B we signaled
>>>>>> in drm_sched_entity_kill_jobs ?
>>>>>
>>>>> Exactly that, yes.
>>>>>
>>>>> In other words when you have a dependency chain like A->B->C then 
>>>>> memory management only waits for C before freeing up the memory for 
>>>>> example.
>>>>>
>>>>> When C now signaled because the device is hot-plugged before A or B 
>>>>> are finished they are essentially accessing freed up memory.
>>>>
>>>> But didn't C imported the BO form B or A in this case ? Why would he be
>>>> the one releasing that memory ? He would be just dropping his reference
>>>> to the BO, no ?
>>>
>>> Well freeing the memory was just an example. The BO could also move 
>>> back to VRAM because of the dropped reference.
>>>
>>>> Also in the general case,
>>>> drm_sched_entity_fini->drm_sched_entity_kill_jobs which is
>>>> the one who signals the 'C' fence with error code are as far
>>>> as I looked called from when the user of that BO is stopping
>>>> the usage anyway (e.g. drm_driver.postclose callback for when use
>>>> process closes his device file) who would then access and corrupt
>>>> the exported memory on device A where the job hasn't completed yet ?
>>>
>>> Key point is that memory management only waits for the last added 
>>> fence, that is the design of the dma_resv object. How that happens is 
>>> irrelevant.
>>>
>>> Because of this we at least need to wait for all dependencies of the 
>>> job before signaling the fence, even if we cancel the job for some 
>>> reason.
>>>
>>> Christian.
>>
>> Would this be the right way to do it ?
> 
> Yes, it is at least a start. Question is if we can wait blocking here or 
> not.
> 
> We install a callback a bit lower to avoid blocking, so I'm pretty sure 
> that won't work as expected.
> 
> Christian.

I can't see why this would create problems, as long as the dependencies
complete or force competed if they are from same device (extracted) but
on a different ring then looks to me it should work. I will give it
a try.

Andrey

> 
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>> b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 2e93e881b65f..10f784874b63 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -223,10 +223,14 @@ static void drm_sched_entity_kill_jobs(struct 
>> drm_sched_entity *entity)
>>  {
>>         struct drm_sched_job *job;
>>         int r;
>> +       struct dma_fence *f;
>>
>>         while ((job = 
>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>                 struct drm_sched_fence *s_fence = job->s_fence;
>>
>> +               while (f = sched->ops->dependency(sched_job, entity))
>> +                       dma_fence_wait(f);
>> +
>>                 drm_sched_fence_scheduled(s_fence);
>>                 dma_fence_set_error(&s_fence->finished, -ESRCH);
>>
>> Andrey
>>
>>
>>
>>>
>>>>
>>>> Andrey
>>>>
>>>>>
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I am not sure this problem you describe above is related to this 
>>>>>>>> patch.
>>>>>>>
>>>>>>> Well it is kind of related.
>>>>>>>
>>>>>>>> Here we purely expand the criteria for when sched_entity is
>>>>>>>> considered idle in order to prevent a hang on device remove.
>>>>>>>
>>>>>>> And exactly that is problematic. See the jobs on the entity need 
>>>>>>> to cleanly wait for their dependencies before they can be completed.
>>>>>>>
>>>>>>> drm_sched_entity_kill_jobs() is also not handling that correctly 
>>>>>>> at the moment, we only wait for the last scheduled fence but not 
>>>>>>> for the dependencies of the job.
>>>>>>>
>>>>>>>> Were you addressing the patch from yesterday in which you commented
>>>>>>>> that you found a problem with how we finish fences ? It was
>>>>>>>> '[PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.'
>>>>>>>>
>>>>>>>> Also, in the patch series as it is now we only signal HW fences 
>>>>>>>> for the
>>>>>>>> extracted device B, we are not touching any other fences. In 
>>>>>>>> fact as you
>>>>>>>> may remember, I dropped all new logic to forcing fence 
>>>>>>>> completion in
>>>>>>>> this patch series and only call 
>>>>>>>> amdgpu_fence_driver_force_completion
>>>>>>>> for the HW fences of the extracted device as it's done today 
>>>>>>>> anyway.
>>>>>>>
>>>>>>> Signaling hardware fences is unproblematic since they are emitted 
>>>>>>> when the software scheduling is already completed.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Not sure how to handle that case. One possibility would be to 
>>>>>>>>> wait for all dependencies of unscheduled jobs before signaling 
>>>>>>>>> their fences as canceled.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 12.05.21 um 16:26 schrieb Andrey Grodzovsky:
>>>>>>>>>> Problem: If scheduler is already stopped by the time sched_entity
>>>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>>>> a hang in drm_sched_entity_flush. This is because 
>>>>>>>>>> drm_sched_entity_is_idle
>>>>>>>>>> never becomes false.
>>>>>>>>>>
>>>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>>>> scheduler's run queues. This will satisfy 
>>>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>>>> as the scheduler main thread which wakes them up is stopped by 
>>>>>>>>>> now.
>>>>>>>>>>
>>>>>>>>>> v2:
>>>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>>>> to race.
>>>>>>>>>>
>>>>>>>>>> v3:
>>>>>>>>>> Drop drm_sched_rq_remove_entity, only modify entity->stopped
>>>>>>>>>> and check for it in drm_sched_entity_is_idle
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>> ---
>>>>>>>>>>   drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
>>>>>>>>>>   drivers/gpu/drm/scheduler/sched_main.c   | 24 
>>>>>>>>>> ++++++++++++++++++++++++
>>>>>>>>>>   2 files changed, 26 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>>> index 0249c7450188..2e93e881b65f 100644
>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>>> @@ -116,7 +116,8 @@ static bool 
>>>>>>>>>> drm_sched_entity_is_idle(struct drm_sched_entity *entity)
>>>>>>>>>>       rmb(); /* for list_empty to work without lock */
>>>>>>>>>>       if (list_empty(&entity->list) ||
>>>>>>>>>> - spsc_queue_count(&entity->job_queue) == 0)
>>>>>>>>>> + spsc_queue_count(&entity->job_queue) == 0 ||
>>>>>>>>>> +        entity->stopped)
>>>>>>>>>>           return true;
>>>>>>>>>>       return false;
>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> index 8d1211e87101..a2a953693b45 100644
>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>> @@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>>>    */
>>>>>>>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>>>   {
>>>>>>>>>> +    struct drm_sched_entity *s_entity;
>>>>>>>>>> +    int i;
>>>>>>>>>> +
>>>>>>>>>>       if (sched->thread)
>>>>>>>>>>           kthread_stop(sched->thread);
>>>>>>>>>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>>>> +
>>>>>>>>>> +        if (!rq)
>>>>>>>>>> +            continue;
>>>>>>>>>> +
>>>>>>>>>> +        spin_lock(&rq->lock);
>>>>>>>>>> +        list_for_each_entry(s_entity, &rq->entities, list)
>>>>>>>>>> +            /*
>>>>>>>>>> +             * Prevents reinsertion and marks job_queue as idle,
>>>>>>>>>> +             * it will removed from rq in drm_sched_entity_fini
>>>>>>>>>> +             * eventually
>>>>>>>>>> +             */
>>>>>>>>>> +            s_entity->stopped = true;
>>>>>>>>>> +        spin_unlock(&rq->lock);
>>>>>>>>>> +
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for 
>>>>>>>>>> this scheduler */
>>>>>>>>>> +    wake_up_all(&sched->job_scheduled);
>>>>>>>>>> +
>>>>>>>>>>       /* Confirm no work left behind accessing device 
>>>>>>>>>> structures */
>>>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>>>
>>>>>>>
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-18 18:48                     ` Andrey Grodzovsky
@ 2021-05-18 20:56                       ` Andrey Grodzovsky
  2021-05-19 10:57                       ` Christian König
  1 sibling, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-18 20:56 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: Alexander.Deucher, gregkh, helgaas, Felix.Kuehling



On 2021-05-18 2:48 p.m., Andrey Grodzovsky wrote:
> 
> 
> On 2021-05-18 2:13 p.m., Christian König wrote:
>>
>> Am 18.05.21 um 20:09 schrieb Andrey Grodzovsky:
>>> On 2021-05-18 2:02 p.m., Christian König wrote:
>>>> Am 18.05.21 um 19:43 schrieb Andrey Grodzovsky:
>>>>> On 2021-05-18 12:33 p.m., Christian König wrote:
>>>>>> Am 18.05.21 um 18:17 schrieb Andrey Grodzovsky:
>>>>>>>
>>>>>>>
>>>>>>> On 2021-05-18 11:15 a.m., Christian König wrote:
>>>>>>>> Am 18.05.21 um 17:03 schrieb Andrey Grodzovsky:
>>>>>>>>>
>>>>>>>>> On 2021-05-18 10:07 a.m., Christian König wrote:
>>>>>>>>>> In a separate discussion with Daniel we once more iterated 
>>>>>>>>>> over the dma_resv requirements and I came to the conclusion 
>>>>>>>>>> that this approach here won't work reliable.
>>>>>>>>>>
>>>>>>>>>> The problem is as following:
>>>>>>>>>> 1. device A schedules some rendering with into a buffer and 
>>>>>>>>>> exports it as DMA-buf.
>>>>>>>>>> 2. device B imports the DMA-buf and wants to consume the 
>>>>>>>>>> rendering, for the the fence of device A is replaced with a 
>>>>>>>>>> new operation.
>>>>>>>>>> 3. device B is hot plugged and the new operation 
>>>>>>>>>> canceled/newer scheduled.
>>>>>>>>>>
>>>>>>>>>> The problem is now that we can't do this since the operation 
>>>>>>>>>> of device A is still running and by signaling our fences we 
>>>>>>>>>> run into the problem of potential memory corruption.
>>>>>>>
>>>>>>> By signaling s_fence->finished of the canceled operation from the
>>>>>>> removed device B we in fact cause memory corruption for the 
>>>>>>> uncompleted
>>>>>>> job still running on device A ? Because there is someone waiting to
>>>>>>> read write from the imported buffer on device B and he only waits 
>>>>>>> for
>>>>>>> the s_fence->finished of device B we signaled
>>>>>>> in drm_sched_entity_kill_jobs ?
>>>>>>
>>>>>> Exactly that, yes.
>>>>>>
>>>>>> In other words when you have a dependency chain like A->B->C then 
>>>>>> memory management only waits for C before freeing up the memory 
>>>>>> for example.
>>>>>>
>>>>>> When C now signaled because the device is hot-plugged before A or 
>>>>>> B are finished they are essentially accessing freed up memory.
>>>>>
>>>>> But didn't C imported the BO form B or A in this case ? Why would 
>>>>> he be
>>>>> the one releasing that memory ? He would be just dropping his 
>>>>> reference
>>>>> to the BO, no ?
>>>>
>>>> Well freeing the memory was just an example. The BO could also move 
>>>> back to VRAM because of the dropped reference.
>>>>
>>>>> Also in the general case,
>>>>> drm_sched_entity_fini->drm_sched_entity_kill_jobs which is
>>>>> the one who signals the 'C' fence with error code are as far
>>>>> as I looked called from when the user of that BO is stopping
>>>>> the usage anyway (e.g. drm_driver.postclose callback for when use
>>>>> process closes his device file) who would then access and corrupt
>>>>> the exported memory on device A where the job hasn't completed yet ?
>>>>
>>>> Key point is that memory management only waits for the last added 
>>>> fence, that is the design of the dma_resv object. How that happens 
>>>> is irrelevant.
>>>>
>>>> Because of this we at least need to wait for all dependencies of the 
>>>> job before signaling the fence, even if we cancel the job for some 
>>>> reason.
>>>>
>>>> Christian.
>>>
>>> Would this be the right way to do it ?
>>
>> Yes, it is at least a start. Question is if we can wait blocking here 
>> or not.
>>
>> We install a callback a bit lower to avoid blocking, so I'm pretty 
>> sure that won't work as expected.
>>
>> Christian.
> 
> I can't see why this would create problems, as long as the dependencies
> complete or force competed if they are from same device (extracted) but
> on a different ring then looks to me it should work. I will give it
> a try.
> 
> Andrey

Well, i gave it a try with my usual tests like IGT hot unplug while
rapid command submissions and unplug the card while glxgears runs
with DRI_PRIME=1 and haven't seen issues.

Andrey

> 
>>
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index 2e93e881b65f..10f784874b63 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -223,10 +223,14 @@ static void drm_sched_entity_kill_jobs(struct 
>>> drm_sched_entity *entity)
>>>  {
>>>         struct drm_sched_job *job;
>>>         int r;
>>> +       struct dma_fence *f;
>>>
>>>         while ((job = 
>>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>>                 struct drm_sched_fence *s_fence = job->s_fence;
>>>
>>> +               while (f = sched->ops->dependency(sched_job, entity))
>>> +                       dma_fence_wait(f);
>>> +
>>>                 drm_sched_fence_scheduled(s_fence);
>>>                 dma_fence_set_error(&s_fence->finished, -ESRCH);
>>>
>>> Andrey
>>>
>>>
>>>
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not sure this problem you describe above is related to 
>>>>>>>>> this patch.
>>>>>>>>
>>>>>>>> Well it is kind of related.
>>>>>>>>
>>>>>>>>> Here we purely expand the criteria for when sched_entity is
>>>>>>>>> considered idle in order to prevent a hang on device remove.
>>>>>>>>
>>>>>>>> And exactly that is problematic. See the jobs on the entity need 
>>>>>>>> to cleanly wait for their dependencies before they can be 
>>>>>>>> completed.
>>>>>>>>
>>>>>>>> drm_sched_entity_kill_jobs() is also not handling that correctly 
>>>>>>>> at the moment, we only wait for the last scheduled fence but not 
>>>>>>>> for the dependencies of the job.
>>>>>>>>
>>>>>>>>> Were you addressing the patch from yesterday in which you 
>>>>>>>>> commented
>>>>>>>>> that you found a problem with how we finish fences ? It was
>>>>>>>>> '[PATCH v7 12/16] drm/amdgpu: Fix hang on device removal.'
>>>>>>>>>
>>>>>>>>> Also, in the patch series as it is now we only signal HW fences 
>>>>>>>>> for the
>>>>>>>>> extracted device B, we are not touching any other fences. In 
>>>>>>>>> fact as you
>>>>>>>>> may remember, I dropped all new logic to forcing fence 
>>>>>>>>> completion in
>>>>>>>>> this patch series and only call 
>>>>>>>>> amdgpu_fence_driver_force_completion
>>>>>>>>> for the HW fences of the extracted device as it's done today 
>>>>>>>>> anyway.
>>>>>>>>
>>>>>>>> Signaling hardware fences is unproblematic since they are 
>>>>>>>> emitted when the software scheduling is already completed.
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Not sure how to handle that case. One possibility would be to 
>>>>>>>>>> wait for all dependencies of unscheduled jobs before signaling 
>>>>>>>>>> their fences as canceled.
>>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 12.05.21 um 16:26 schrieb Andrey Grodzovsky:
>>>>>>>>>>> Problem: If scheduler is already stopped by the time 
>>>>>>>>>>> sched_entity
>>>>>>>>>>> is released and entity's job_queue not empty I encountred
>>>>>>>>>>> a hang in drm_sched_entity_flush. This is because 
>>>>>>>>>>> drm_sched_entity_is_idle
>>>>>>>>>>> never becomes false.
>>>>>>>>>>>
>>>>>>>>>>> Fix: In drm_sched_fini detach all sched_entities from the
>>>>>>>>>>> scheduler's run queues. This will satisfy 
>>>>>>>>>>> drm_sched_entity_is_idle.
>>>>>>>>>>> Also wakeup all those processes stuck in sched_entity flushing
>>>>>>>>>>> as the scheduler main thread which wakes them up is stopped 
>>>>>>>>>>> by now.
>>>>>>>>>>>
>>>>>>>>>>> v2:
>>>>>>>>>>> Reverse order of drm_sched_rq_remove_entity and marking
>>>>>>>>>>> s_entity as stopped to prevent reinserion back to rq due
>>>>>>>>>>> to race.
>>>>>>>>>>>
>>>>>>>>>>> v3:
>>>>>>>>>>> Drop drm_sched_rq_remove_entity, only modify entity->stopped
>>>>>>>>>>> and check for it in drm_sched_entity_is_idle
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>>>>>>>> ---
>>>>>>>>>>>   drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
>>>>>>>>>>>   drivers/gpu/drm/scheduler/sched_main.c   | 24 
>>>>>>>>>>> ++++++++++++++++++++++++
>>>>>>>>>>>   2 files changed, 26 insertions(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>>>> index 0249c7450188..2e93e881b65f 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>>>> @@ -116,7 +116,8 @@ static bool 
>>>>>>>>>>> drm_sched_entity_is_idle(struct drm_sched_entity *entity)
>>>>>>>>>>>       rmb(); /* for list_empty to work without lock */
>>>>>>>>>>>       if (list_empty(&entity->list) ||
>>>>>>>>>>> - spsc_queue_count(&entity->job_queue) == 0)
>>>>>>>>>>> + spsc_queue_count(&entity->job_queue) == 0 ||
>>>>>>>>>>> +        entity->stopped)
>>>>>>>>>>>           return true;
>>>>>>>>>>>       return false;
>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> index 8d1211e87101..a2a953693b45 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> @@ -898,9 +898,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>>>>>>>    */
>>>>>>>>>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>   {
>>>>>>>>>>> +    struct drm_sched_entity *s_entity;
>>>>>>>>>>> +    int i;
>>>>>>>>>>> +
>>>>>>>>>>>       if (sched->thread)
>>>>>>>>>>>           kthread_stop(sched->thread);
>>>>>>>>>>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>>>>>>>>>>> DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>>>> +        struct drm_sched_rq *rq = &sched->sched_rq[i];
>>>>>>>>>>> +
>>>>>>>>>>> +        if (!rq)
>>>>>>>>>>> +            continue;
>>>>>>>>>>> +
>>>>>>>>>>> +        spin_lock(&rq->lock);
>>>>>>>>>>> +        list_for_each_entry(s_entity, &rq->entities, list)
>>>>>>>>>>> +            /*
>>>>>>>>>>> +             * Prevents reinsertion and marks job_queue as 
>>>>>>>>>>> idle,
>>>>>>>>>>> +             * it will removed from rq in drm_sched_entity_fini
>>>>>>>>>>> +             * eventually
>>>>>>>>>>> +             */
>>>>>>>>>>> +            s_entity->stopped = true;
>>>>>>>>>>> +        spin_unlock(&rq->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    /* Wakeup everyone stuck in drm_sched_entity_flush for 
>>>>>>>>>>> this scheduler */
>>>>>>>>>>> +    wake_up_all(&sched->job_scheduled);
>>>>>>>>>>> +
>>>>>>>>>>>       /* Confirm no work left behind accessing device 
>>>>>>>>>>> structures */
>>>>>>>>>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-18 18:48                     ` Andrey Grodzovsky
  2021-05-18 20:56                       ` Andrey Grodzovsky
@ 2021-05-19 10:57                       ` Christian König
  2021-05-19 11:03                         ` Andrey Grodzovsky
  1 sibling, 1 reply; 64+ messages in thread
From: Christian König @ 2021-05-19 10:57 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, dri-devel, amd-gfx,
	linux-pci, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Am 18.05.21 um 20:48 schrieb Andrey Grodzovsky:
> [SNIP]
>>>
>>> Would this be the right way to do it ?
>>
>> Yes, it is at least a start. Question is if we can wait blocking here 
>> or not.
>>
>> We install a callback a bit lower to avoid blocking, so I'm pretty 
>> sure that won't work as expected.
>>
>> Christian.
>
> I can't see why this would create problems, as long as the dependencies
> complete or force competed if they are from same device (extracted) but
> on a different ring then looks to me it should work. I will give it
> a try.

Ok, but please also test the case for a killed process.

Christian.

>
> Andrey


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-19 10:57                       ` Christian König
@ 2021-05-19 11:03                         ` Andrey Grodzovsky
  2021-05-19 11:46                           ` Christian König
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-19 11:03 UTC (permalink / raw)
  To: Christian König, Christian König, dri-devel, amd-gfx,
	linux-pci, daniel.vetter, Harry.Wentland
  Cc: Alexander.Deucher, gregkh, ppaalanen, helgaas, Felix.Kuehling



On 2021-05-19 6:57 a.m., Christian König wrote:
> Am 18.05.21 um 20:48 schrieb Andrey Grodzovsky:
>> [SNIP]
>>>>
>>>> Would this be the right way to do it ?
>>>
>>> Yes, it is at least a start. Question is if we can wait blocking here 
>>> or not.
>>>
>>> We install a callback a bit lower to avoid blocking, so I'm pretty 
>>> sure that won't work as expected.
>>>
>>> Christian.
>>
>> I can't see why this would create problems, as long as the dependencies
>> complete or force competed if they are from same device (extracted) but
>> on a different ring then looks to me it should work. I will give it
>> a try.
> 
> Ok, but please also test the case for a killed process.
> 
> Christian.

You mean something like run glxgears and then simply
terminate it ? Because I done that. Or something more ?

Andrey


> 
>>
>> Andrey
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cce1252e55fae4338710d08d91ab4de01%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637570186393107071%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vGqxY5sxpEIiQGFBNn2PWkKqVjviM29r34Yjv0wujf4%3D&amp;reserved=0 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-19 11:03                         ` Andrey Grodzovsky
@ 2021-05-19 11:46                           ` Christian König
  2021-05-19 11:51                             ` Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Christian König @ 2021-05-19 11:46 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, dri-devel, amd-gfx,
	linux-pci, daniel.vetter, Harry.Wentland
  Cc: Alexander.Deucher, gregkh, ppaalanen, helgaas, Felix.Kuehling

Am 19.05.21 um 13:03 schrieb Andrey Grodzovsky:
>
>
> On 2021-05-19 6:57 a.m., Christian König wrote:
>> Am 18.05.21 um 20:48 schrieb Andrey Grodzovsky:
>>> [SNIP]
>>>>>
>>>>> Would this be the right way to do it ?
>>>>
>>>> Yes, it is at least a start. Question is if we can wait blocking 
>>>> here or not.
>>>>
>>>> We install a callback a bit lower to avoid blocking, so I'm pretty 
>>>> sure that won't work as expected.
>>>>
>>>> Christian.
>>>
>>> I can't see why this would create problems, as long as the dependencies
>>> complete or force competed if they are from same device (extracted) but
>>> on a different ring then looks to me it should work. I will give it
>>> a try.
>>
>> Ok, but please also test the case for a killed process.
>>
>> Christian.
>
> You mean something like run glxgears and then simply
> terminate it ? Because I done that. Or something more ?

Well glxgears is a bit to lightweight for that.

You need at least some test which is limited by the rendering pipeline.

Christian.

>
> Andrey
>
>
>>
>>>
>>> Andrey
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cce1252e55fae4338710d08d91ab4de01%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637570186393107071%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vGqxY5sxpEIiQGFBNn2PWkKqVjviM29r34Yjv0wujf4%3D&amp;reserved=0 
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-19 11:46                           ` Christian König
@ 2021-05-19 11:51                             ` Andrey Grodzovsky
  2021-05-19 11:56                               ` Christian König
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-19 11:51 UTC (permalink / raw)
  To: Christian König, Christian König, dri-devel, amd-gfx,
	linux-pci, daniel.vetter, Harry.Wentland
  Cc: Alexander.Deucher, gregkh, ppaalanen, helgaas, Felix.Kuehling



On 2021-05-19 7:46 a.m., Christian König wrote:
> Am 19.05.21 um 13:03 schrieb Andrey Grodzovsky:
>>
>>
>> On 2021-05-19 6:57 a.m., Christian König wrote:
>>> Am 18.05.21 um 20:48 schrieb Andrey Grodzovsky:
>>>> [SNIP]
>>>>>>
>>>>>> Would this be the right way to do it ?
>>>>>
>>>>> Yes, it is at least a start. Question is if we can wait blocking 
>>>>> here or not.
>>>>>
>>>>> We install a callback a bit lower to avoid blocking, so I'm pretty 
>>>>> sure that won't work as expected.
>>>>>
>>>>> Christian.
>>>>
>>>> I can't see why this would create problems, as long as the dependencies
>>>> complete or force competed if they are from same device (extracted) but
>>>> on a different ring then looks to me it should work. I will give it
>>>> a try.
>>>
>>> Ok, but please also test the case for a killed process.
>>>
>>> Christian.
>>
>> You mean something like run glxgears and then simply
>> terminate it ? Because I done that. Or something more ?
> 
> Well glxgears is a bit to lightweight for that.
> 
> You need at least some test which is limited by the rendering pipeline.
> 
> Christian.

You mean something that fill the entity queue faster then sched thread
empties it so when we kill the process we actually need to explicitly go
through remaining jobs termination ? I done that too by inserting
artificial delay in drm_sched_main.

Andrey

> 
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Andrey
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cce1252e55fae4338710d08d91ab4de01%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637570186393107071%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vGqxY5sxpEIiQGFBNn2PWkKqVjviM29r34Yjv0wujf4%3D&amp;reserved=0 
>>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released
  2021-05-19 11:51                             ` Andrey Grodzovsky
@ 2021-05-19 11:56                               ` Christian König
  2021-05-19 14:14                                 ` [PATCH] drm/sched: Avoid data corruptions Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Christian König @ 2021-05-19 11:56 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, dri-devel, amd-gfx,
	linux-pci, daniel.vetter, Harry.Wentland
  Cc: Alexander.Deucher, gregkh, ppaalanen, helgaas, Felix.Kuehling

Am 19.05.21 um 13:51 schrieb Andrey Grodzovsky:
>
>
> On 2021-05-19 7:46 a.m., Christian König wrote:
>> Am 19.05.21 um 13:03 schrieb Andrey Grodzovsky:
>>>
>>>
>>> On 2021-05-19 6:57 a.m., Christian König wrote:
>>>> Am 18.05.21 um 20:48 schrieb Andrey Grodzovsky:
>>>>> [SNIP]
>>>>>>>
>>>>>>> Would this be the right way to do it ?
>>>>>>
>>>>>> Yes, it is at least a start. Question is if we can wait blocking 
>>>>>> here or not.
>>>>>>
>>>>>> We install a callback a bit lower to avoid blocking, so I'm 
>>>>>> pretty sure that won't work as expected.
>>>>>>
>>>>>> Christian.
>>>>>
>>>>> I can't see why this would create problems, as long as the 
>>>>> dependencies
>>>>> complete or force competed if they are from same device 
>>>>> (extracted) but
>>>>> on a different ring then looks to me it should work. I will give it
>>>>> a try.
>>>>
>>>> Ok, but please also test the case for a killed process.
>>>>
>>>> Christian.
>>>
>>> You mean something like run glxgears and then simply
>>> terminate it ? Because I done that. Or something more ?
>>
>> Well glxgears is a bit to lightweight for that.
>>
>> You need at least some test which is limited by the rendering pipeline.
>>
>> Christian.
>
> You mean something that fill the entity queue faster then sched thread
> empties it so when we kill the process we actually need to explicitly go
> through remaining jobs termination ? I done that too by inserting
> artificial delay in drm_sched_main.

Yeah, something like that.

Ok in that case I would say that this should work then.

Christian.

>
> Andrey
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Andrey
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cce1252e55fae4338710d08d91ab4de01%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637570186393107071%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vGqxY5sxpEIiQGFBNn2PWkKqVjviM29r34Yjv0wujf4%3D&amp;reserved=0 
>>>>
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH] drm/sched: Avoid data corruptions
  2021-05-19 11:56                               ` Christian König
@ 2021-05-19 14:14                                 ` Andrey Grodzovsky
  2021-05-19 14:15                                   ` Christian König
  0 siblings, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-19 14:14 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling,
	Andrey Grodzovsky

Wait for all dependencies of a job  to complete before
killing it to avoid data corruptions.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 2e93e881b65f..d5cf61972558 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -222,11 +222,16 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
 static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity)
 {
 	struct drm_sched_job *job;
+	struct dma_fence *f;
 	int r;
 
 	while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
 		struct drm_sched_fence *s_fence = job->s_fence;
 
+		/* Wait for all dependencies to avoid data corruptions */
+		while ((f = job->sched->ops->dependency(job, entity)))
+			dma_fence_wait(f);
+
 		drm_sched_fence_scheduled(s_fence);
 		dma_fence_set_error(&s_fence->finished, -ESRCH);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] drm/sched: Avoid data corruptions
  2021-05-19 14:14                                 ` [PATCH] drm/sched: Avoid data corruptions Andrey Grodzovsky
@ 2021-05-19 14:15                                   ` Christian König
  0 siblings, 0 replies; 64+ messages in thread
From: Christian König @ 2021-05-19 14:15 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci, daniel.vetter,
	Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Felix.Kuehling

Am 19.05.21 um 16:14 schrieb Andrey Grodzovsky:
> Wait for all dependencies of a job  to complete before
> killing it to avoid data corruptions.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/scheduler/sched_entity.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 2e93e881b65f..d5cf61972558 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -222,11 +222,16 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
>   static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity)
>   {
>   	struct drm_sched_job *job;
> +	struct dma_fence *f;
>   	int r;
>   
>   	while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>   		struct drm_sched_fence *s_fence = job->s_fence;
>   
> +		/* Wait for all dependencies to avoid data corruptions */
> +		while ((f = job->sched->ops->dependency(job, entity)))
> +			dma_fence_wait(f);
> +
>   		drm_sched_fence_scheduled(s_fence);
>   		dma_fence_set_error(&s_fence->finished, -ESRCH);
>   


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH] drm/amdgpu: Add early fini callback
  2021-05-12 20:33   ` Felix Kuehling
  2021-05-12 20:38     ` Andrey Grodzovsky
@ 2021-05-20  3:20     ` Andrey Grodzovsky
  2021-05-20  3:29       ` Felix Kuehling
  1 sibling, 1 reply; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-20  3:20 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-pci, ckoenig.leichtzumerken,
	daniel.vetter, Harry.Wentland, Felix.Kuehling
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Andrey Grodzovsky,
	Christian König

Use it to call disply code dependent on device->drv_data
before it's set to NULL on device unplug

v5:
Move HW finilization into this callback to prevent MMIO accesses
post cpi remove.

v7:
Split kfd suspend from device exit to expdite HW related
stuff to amdgpu_pci_remove

v8:
Squash previous KFD commit into this commit to avoid compile break.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 59 +++++++++++++------
 drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  3 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +++-
 drivers/gpu/drm/amd/include/amd_shared.h      |  2 +
 6 files changed, 56 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 5f6696a3c778..2b06dee9a0ce 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -170,7 +170,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 	}
 }
 
-void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev)
+void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev)
 {
 	if (adev->kfd.dev) {
 		kgd2kfd_device_exit(adev->kfd.dev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 5ffb07b02810..d8a537e8aea5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -127,7 +127,7 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 			const void *ih_ring_entry);
 void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
 void amdgpu_amdkfd_device_init(struct amdgpu_device *adev);
-void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev);
+void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev);
 int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine,
 				uint32_t vmid, uint64_t gpu_addr,
 				uint32_t *ib_cmd, uint32_t ib_len);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8bee95ad32d9..bc75e35dd8d8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2558,34 +2558,26 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
 	return 0;
 }
 
-/**
- * amdgpu_device_ip_fini - run fini for hardware IPs
- *
- * @adev: amdgpu_device pointer
- *
- * Main teardown pass for hardware IPs.  The list of all the hardware
- * IPs that make up the asic is walked and the hw_fini and sw_fini callbacks
- * are run.  hw_fini tears down the hardware associated with each IP
- * and sw_fini tears down any software state associated with each IP.
- * Returns 0 on success, negative error code on failure.
- */
-static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
+static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
 {
 	int i, r;
 
-	if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
-		amdgpu_virt_release_ras_err_handler_data(adev);
+	for (i = 0; i < adev->num_ip_blocks; i++) {
+		if (!adev->ip_blocks[i].version->funcs->early_fini)
+			continue;
 
-	amdgpu_ras_pre_fini(adev);
+		r = adev->ip_blocks[i].version->funcs->early_fini((void *)adev);
+		if (r) {
+			DRM_DEBUG("early_fini of IP block <%s> failed %d\n",
+				  adev->ip_blocks[i].version->funcs->name, r);
+		}
+	}
 
-	if (adev->gmc.xgmi.num_physical_nodes > 1)
-		amdgpu_xgmi_remove_device(adev);
+	amdgpu_amdkfd_suspend(adev, false);
 
 	amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
 	amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
 
-	amdgpu_amdkfd_device_fini(adev);
-
 	/* need to disable SMC first */
 	for (i = 0; i < adev->num_ip_blocks; i++) {
 		if (!adev->ip_blocks[i].status.hw)
@@ -2616,6 +2608,33 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
 		adev->ip_blocks[i].status.hw = false;
 	}
 
+	return 0;
+}
+
+/**
+ * amdgpu_device_ip_fini - run fini for hardware IPs
+ *
+ * @adev: amdgpu_device pointer
+ *
+ * Main teardown pass for hardware IPs.  The list of all the hardware
+ * IPs that make up the asic is walked and the hw_fini and sw_fini callbacks
+ * are run.  hw_fini tears down the hardware associated with each IP
+ * and sw_fini tears down any software state associated with each IP.
+ * Returns 0 on success, negative error code on failure.
+ */
+static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
+{
+	int i, r;
+
+	if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
+		amdgpu_virt_release_ras_err_handler_data(adev);
+
+	amdgpu_ras_pre_fini(adev);
+
+	if (adev->gmc.xgmi.num_physical_nodes > 1)
+		amdgpu_xgmi_remove_device(adev);
+
+	amdgpu_amdkfd_device_fini_sw(adev);
 
 	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
 		if (!adev->ip_blocks[i].status.sw)
@@ -3681,6 +3700,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 	amdgpu_fbdev_fini(adev);
 
 	amdgpu_irq_fini_hw(adev);
+
+	amdgpu_device_ip_fini_early(adev);
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 357b9bf62a1c..ab6d2a43c9a3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -858,10 +858,11 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 	return kfd->init_complete;
 }
 
+
+
 void kgd2kfd_device_exit(struct kfd_dev *kfd)
 {
 	if (kfd->init_complete) {
-		kgd2kfd_suspend(kfd, false);
 		device_queue_manager_uninit(kfd->dqm);
 		kfd_interrupt_exit(kfd);
 		kfd_topology_remove_device(kfd);
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 9ca517b65854..f7112865269a 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1251,6 +1251,15 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
 	return -EINVAL;
 }
 
+static int amdgpu_dm_early_fini(void *handle)
+{
+	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+
+	amdgpu_dm_audio_fini(adev);
+
+	return 0;
+}
+
 static void amdgpu_dm_fini(struct amdgpu_device *adev)
 {
 	int i;
@@ -1259,8 +1268,6 @@ static void amdgpu_dm_fini(struct amdgpu_device *adev)
 		drm_encoder_cleanup(&adev->dm.mst_encoders[i].base);
 	}
 
-	amdgpu_dm_audio_fini(adev);
-
 	amdgpu_dm_destroy_drm_device(&adev->dm);
 
 #if defined(CONFIG_DRM_AMD_SECURE_DISPLAY)
@@ -2298,6 +2305,7 @@ static const struct amd_ip_funcs amdgpu_dm_funcs = {
 	.late_init = dm_late_init,
 	.sw_init = dm_sw_init,
 	.sw_fini = dm_sw_fini,
+	.early_fini = amdgpu_dm_early_fini,
 	.hw_init = dm_hw_init,
 	.hw_fini = dm_hw_fini,
 	.suspend = dm_suspend,
diff --git a/drivers/gpu/drm/amd/include/amd_shared.h b/drivers/gpu/drm/amd/include/amd_shared.h
index 43ed6291b2b8..1ad56da486e4 100644
--- a/drivers/gpu/drm/amd/include/amd_shared.h
+++ b/drivers/gpu/drm/amd/include/amd_shared.h
@@ -240,6 +240,7 @@ enum amd_dpm_forced_level;
  * @late_init: sets up late driver/hw state (post hw_init) - Optional
  * @sw_init: sets up driver state, does not configure hw
  * @sw_fini: tears down driver state, does not configure hw
+ * @early_fini: tears down stuff before dev detached from driver
  * @hw_init: sets up the hw state
  * @hw_fini: tears down the hw state
  * @late_fini: final cleanup
@@ -268,6 +269,7 @@ struct amd_ip_funcs {
 	int (*late_init)(void *handle);
 	int (*sw_init)(void *handle);
 	int (*sw_fini)(void *handle);
+	int (*early_fini)(void *handle);
 	int (*hw_init)(void *handle);
 	int (*hw_fini)(void *handle);
 	void (*late_fini)(void *handle);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] drm/amdgpu: Add early fini callback
  2021-05-20  3:20     ` [PATCH] drm/amdgpu: Add early fini callback Andrey Grodzovsky
@ 2021-05-20  3:29       ` Felix Kuehling
  2021-05-20  3:58         ` Andrey Grodzovsky
  0 siblings, 1 reply; 64+ messages in thread
From: Felix Kuehling @ 2021-05-20  3:29 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Christian König

Am 2021-05-19 um 11:20 p.m. schrieb Andrey Grodzovsky:
> Use it to call disply code dependent on device->drv_data
> before it's set to NULL on device unplug
>
> v5:
> Move HW finilization into this callback to prevent MMIO accesses
> post cpi remove.
>
> v7:
> Split kfd suspend from device exit to expdite HW related
> stuff to amdgpu_pci_remove
>
> v8:
> Squash previous KFD commit into this commit to avoid compile break.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Acked-by: Christian König <christian.koenig@amd.com>

See one cosmetic comment inline. With that fixed the patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 59 +++++++++++++------
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  3 +-
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +++-
>  drivers/gpu/drm/amd/include/amd_shared.h      |  2 +
>  6 files changed, 56 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 5f6696a3c778..2b06dee9a0ce 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -170,7 +170,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>  	}
>  }
>  
> -void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev)
> +void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev)
>  {
>  	if (adev->kfd.dev) {
>  		kgd2kfd_device_exit(adev->kfd.dev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 5ffb07b02810..d8a537e8aea5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -127,7 +127,7 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>  			const void *ih_ring_entry);
>  void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>  void amdgpu_amdkfd_device_init(struct amdgpu_device *adev);
> -void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev);
> +void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev);
>  int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine,
>  				uint32_t vmid, uint64_t gpu_addr,
>  				uint32_t *ib_cmd, uint32_t ib_len);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 8bee95ad32d9..bc75e35dd8d8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2558,34 +2558,26 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
>  	return 0;
>  }
>  
> -/**
> - * amdgpu_device_ip_fini - run fini for hardware IPs
> - *
> - * @adev: amdgpu_device pointer
> - *
> - * Main teardown pass for hardware IPs.  The list of all the hardware
> - * IPs that make up the asic is walked and the hw_fini and sw_fini callbacks
> - * are run.  hw_fini tears down the hardware associated with each IP
> - * and sw_fini tears down any software state associated with each IP.
> - * Returns 0 on success, negative error code on failure.
> - */
> -static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
> +static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
>  {
>  	int i, r;
>  
> -	if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
> -		amdgpu_virt_release_ras_err_handler_data(adev);
> +	for (i = 0; i < adev->num_ip_blocks; i++) {
> +		if (!adev->ip_blocks[i].version->funcs->early_fini)
> +			continue;
>  
> -	amdgpu_ras_pre_fini(adev);
> +		r = adev->ip_blocks[i].version->funcs->early_fini((void *)adev);
> +		if (r) {
> +			DRM_DEBUG("early_fini of IP block <%s> failed %d\n",
> +				  adev->ip_blocks[i].version->funcs->name, r);
> +		}
> +	}
>  
> -	if (adev->gmc.xgmi.num_physical_nodes > 1)
> -		amdgpu_xgmi_remove_device(adev);
> +	amdgpu_amdkfd_suspend(adev, false);
>  
>  	amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>  	amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>  
> -	amdgpu_amdkfd_device_fini(adev);
> -
>  	/* need to disable SMC first */
>  	for (i = 0; i < adev->num_ip_blocks; i++) {
>  		if (!adev->ip_blocks[i].status.hw)
> @@ -2616,6 +2608,33 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
>  		adev->ip_blocks[i].status.hw = false;
>  	}
>  
> +	return 0;
> +}
> +
> +/**
> + * amdgpu_device_ip_fini - run fini for hardware IPs
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * Main teardown pass for hardware IPs.  The list of all the hardware
> + * IPs that make up the asic is walked and the hw_fini and sw_fini callbacks
> + * are run.  hw_fini tears down the hardware associated with each IP
> + * and sw_fini tears down any software state associated with each IP.
> + * Returns 0 on success, negative error code on failure.
> + */
> +static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
> +{
> +	int i, r;
> +
> +	if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
> +		amdgpu_virt_release_ras_err_handler_data(adev);
> +
> +	amdgpu_ras_pre_fini(adev);
> +
> +	if (adev->gmc.xgmi.num_physical_nodes > 1)
> +		amdgpu_xgmi_remove_device(adev);
> +
> +	amdgpu_amdkfd_device_fini_sw(adev);
>  
>  	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
>  		if (!adev->ip_blocks[i].status.sw)
> @@ -3681,6 +3700,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>  	amdgpu_fbdev_fini(adev);
>  
>  	amdgpu_irq_fini_hw(adev);
> +
> +	amdgpu_device_ip_fini_early(adev);
>  }
>  
>  void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 357b9bf62a1c..ab6d2a43c9a3 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -858,10 +858,11 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>  	return kfd->init_complete;
>  }
>  
> +
> +
>  void kgd2kfd_device_exit(struct kfd_dev *kfd)

Unnecessary whitespace change.

Regards,
  Felix


>  {
>  	if (kfd->init_complete) {
> -		kgd2kfd_suspend(kfd, false);
>  		device_queue_manager_uninit(kfd->dqm);
>  		kfd_interrupt_exit(kfd);
>  		kfd_topology_remove_device(kfd);
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 9ca517b65854..f7112865269a 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -1251,6 +1251,15 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
>  	return -EINVAL;
>  }
>  
> +static int amdgpu_dm_early_fini(void *handle)
> +{
> +	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> +
> +	amdgpu_dm_audio_fini(adev);
> +
> +	return 0;
> +}
> +
>  static void amdgpu_dm_fini(struct amdgpu_device *adev)
>  {
>  	int i;
> @@ -1259,8 +1268,6 @@ static void amdgpu_dm_fini(struct amdgpu_device *adev)
>  		drm_encoder_cleanup(&adev->dm.mst_encoders[i].base);
>  	}
>  
> -	amdgpu_dm_audio_fini(adev);
> -
>  	amdgpu_dm_destroy_drm_device(&adev->dm);
>  
>  #if defined(CONFIG_DRM_AMD_SECURE_DISPLAY)
> @@ -2298,6 +2305,7 @@ static const struct amd_ip_funcs amdgpu_dm_funcs = {
>  	.late_init = dm_late_init,
>  	.sw_init = dm_sw_init,
>  	.sw_fini = dm_sw_fini,
> +	.early_fini = amdgpu_dm_early_fini,
>  	.hw_init = dm_hw_init,
>  	.hw_fini = dm_hw_fini,
>  	.suspend = dm_suspend,
> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h b/drivers/gpu/drm/amd/include/amd_shared.h
> index 43ed6291b2b8..1ad56da486e4 100644
> --- a/drivers/gpu/drm/amd/include/amd_shared.h
> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
> @@ -240,6 +240,7 @@ enum amd_dpm_forced_level;
>   * @late_init: sets up late driver/hw state (post hw_init) - Optional
>   * @sw_init: sets up driver state, does not configure hw
>   * @sw_fini: tears down driver state, does not configure hw
> + * @early_fini: tears down stuff before dev detached from driver
>   * @hw_init: sets up the hw state
>   * @hw_fini: tears down the hw state
>   * @late_fini: final cleanup
> @@ -268,6 +269,7 @@ struct amd_ip_funcs {
>  	int (*late_init)(void *handle);
>  	int (*sw_init)(void *handle);
>  	int (*sw_fini)(void *handle);
> +	int (*early_fini)(void *handle);
>  	int (*hw_init)(void *handle);
>  	int (*hw_fini)(void *handle);
>  	void (*late_fini)(void *handle);

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] drm/amdgpu: Add early fini callback
  2021-05-20  3:29       ` Felix Kuehling
@ 2021-05-20  3:58         ` Andrey Grodzovsky
  0 siblings, 0 replies; 64+ messages in thread
From: Andrey Grodzovsky @ 2021-05-20  3:58 UTC (permalink / raw)
  To: Felix Kuehling, dri-devel, amd-gfx, linux-pci,
	ckoenig.leichtzumerken, daniel.vetter, Harry.Wentland
  Cc: ppaalanen, Alexander.Deucher, gregkh, helgaas, Christian König


On 2021-05-19 11:29 p.m., Felix Kuehling wrote:
> Am 2021-05-19 um 11:20 p.m. schrieb Andrey Grodzovsky:
>> Use it to call disply code dependent on device->drv_data
>> before it's set to NULL on device unplug
>>
>> v5:
>> Move HW finilization into this callback to prevent MMIO accesses
>> post cpi remove.
>>
>> v7:
>> Split kfd suspend from device exit to expdite HW related
>> stuff to amdgpu_pci_remove
>>
>> v8:
>> Squash previous KFD commit into this commit to avoid compile break.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Acked-by: Christian König <christian.koenig@amd.com>
> See one cosmetic comment inline. With that fixed the patch is
>
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


Thanks for quick response, updated.
Since this was last commit to review I also pushed the series to
drm-misc-next.

Andrey

>
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 59 +++++++++++++------
>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  3 +-
>>   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +++-
>>   drivers/gpu/drm/amd/include/amd_shared.h      |  2 +
>>   6 files changed, 56 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> index 5f6696a3c778..2b06dee9a0ce 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> @@ -170,7 +170,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>>   	}
>>   }
>>   
>> -void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev)
>> +void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev)
>>   {
>>   	if (adev->kfd.dev) {
>>   		kgd2kfd_device_exit(adev->kfd.dev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> index 5ffb07b02810..d8a537e8aea5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> @@ -127,7 +127,7 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>   			const void *ih_ring_entry);
>>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>   void amdgpu_amdkfd_device_init(struct amdgpu_device *adev);
>> -void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev);
>> +void amdgpu_amdkfd_device_fini_sw(struct amdgpu_device *adev);
>>   int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine,
>>   				uint32_t vmid, uint64_t gpu_addr,
>>   				uint32_t *ib_cmd, uint32_t ib_len);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 8bee95ad32d9..bc75e35dd8d8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2558,34 +2558,26 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
>>   	return 0;
>>   }
>>   
>> -/**
>> - * amdgpu_device_ip_fini - run fini for hardware IPs
>> - *
>> - * @adev: amdgpu_device pointer
>> - *
>> - * Main teardown pass for hardware IPs.  The list of all the hardware
>> - * IPs that make up the asic is walked and the hw_fini and sw_fini callbacks
>> - * are run.  hw_fini tears down the hardware associated with each IP
>> - * and sw_fini tears down any software state associated with each IP.
>> - * Returns 0 on success, negative error code on failure.
>> - */
>> -static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
>> +static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
>>   {
>>   	int i, r;
>>   
>> -	if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
>> -		amdgpu_virt_release_ras_err_handler_data(adev);
>> +	for (i = 0; i < adev->num_ip_blocks; i++) {
>> +		if (!adev->ip_blocks[i].version->funcs->early_fini)
>> +			continue;
>>   
>> -	amdgpu_ras_pre_fini(adev);
>> +		r = adev->ip_blocks[i].version->funcs->early_fini((void *)adev);
>> +		if (r) {
>> +			DRM_DEBUG("early_fini of IP block <%s> failed %d\n",
>> +				  adev->ip_blocks[i].version->funcs->name, r);
>> +		}
>> +	}
>>   
>> -	if (adev->gmc.xgmi.num_physical_nodes > 1)
>> -		amdgpu_xgmi_remove_device(adev);
>> +	amdgpu_amdkfd_suspend(adev, false);
>>   
>>   	amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>   	amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>>   
>> -	amdgpu_amdkfd_device_fini(adev);
>> -
>>   	/* need to disable SMC first */
>>   	for (i = 0; i < adev->num_ip_blocks; i++) {
>>   		if (!adev->ip_blocks[i].status.hw)
>> @@ -2616,6 +2608,33 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
>>   		adev->ip_blocks[i].status.hw = false;
>>   	}
>>   
>> +	return 0;
>> +}
>> +
>> +/**
>> + * amdgpu_device_ip_fini - run fini for hardware IPs
>> + *
>> + * @adev: amdgpu_device pointer
>> + *
>> + * Main teardown pass for hardware IPs.  The list of all the hardware
>> + * IPs that make up the asic is walked and the hw_fini and sw_fini callbacks
>> + * are run.  hw_fini tears down the hardware associated with each IP
>> + * and sw_fini tears down any software state associated with each IP.
>> + * Returns 0 on success, negative error code on failure.
>> + */
>> +static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
>> +{
>> +	int i, r;
>> +
>> +	if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
>> +		amdgpu_virt_release_ras_err_handler_data(adev);
>> +
>> +	amdgpu_ras_pre_fini(adev);
>> +
>> +	if (adev->gmc.xgmi.num_physical_nodes > 1)
>> +		amdgpu_xgmi_remove_device(adev);
>> +
>> +	amdgpu_amdkfd_device_fini_sw(adev);
>>   
>>   	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
>>   		if (!adev->ip_blocks[i].status.sw)
>> @@ -3681,6 +3700,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>>   	amdgpu_fbdev_fini(adev);
>>   
>>   	amdgpu_irq_fini_hw(adev);
>> +
>> +	amdgpu_device_ip_fini_early(adev);
>>   }
>>   
>>   void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 357b9bf62a1c..ab6d2a43c9a3 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -858,10 +858,11 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>   	return kfd->init_complete;
>>   }
>>   
>> +
>> +
>>   void kgd2kfd_device_exit(struct kfd_dev *kfd)
> Unnecessary whitespace change.
>
> Regards,
>    Felix
>
>
>>   {
>>   	if (kfd->init_complete) {
>> -		kgd2kfd_suspend(kfd, false);
>>   		device_queue_manager_uninit(kfd->dqm);
>>   		kfd_interrupt_exit(kfd);
>>   		kfd_topology_remove_device(kfd);
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> index 9ca517b65854..f7112865269a 100644
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> @@ -1251,6 +1251,15 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
>>   	return -EINVAL;
>>   }
>>   
>> +static int amdgpu_dm_early_fini(void *handle)
>> +{
>> +	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>> +
>> +	amdgpu_dm_audio_fini(adev);
>> +
>> +	return 0;
>> +}
>> +
>>   static void amdgpu_dm_fini(struct amdgpu_device *adev)
>>   {
>>   	int i;
>> @@ -1259,8 +1268,6 @@ static void amdgpu_dm_fini(struct amdgpu_device *adev)
>>   		drm_encoder_cleanup(&adev->dm.mst_encoders[i].base);
>>   	}
>>   
>> -	amdgpu_dm_audio_fini(adev);
>> -
>>   	amdgpu_dm_destroy_drm_device(&adev->dm);
>>   
>>   #if defined(CONFIG_DRM_AMD_SECURE_DISPLAY)
>> @@ -2298,6 +2305,7 @@ static const struct amd_ip_funcs amdgpu_dm_funcs = {
>>   	.late_init = dm_late_init,
>>   	.sw_init = dm_sw_init,
>>   	.sw_fini = dm_sw_fini,
>> +	.early_fini = amdgpu_dm_early_fini,
>>   	.hw_init = dm_hw_init,
>>   	.hw_fini = dm_hw_fini,
>>   	.suspend = dm_suspend,
>> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h b/drivers/gpu/drm/amd/include/amd_shared.h
>> index 43ed6291b2b8..1ad56da486e4 100644
>> --- a/drivers/gpu/drm/amd/include/amd_shared.h
>> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
>> @@ -240,6 +240,7 @@ enum amd_dpm_forced_level;
>>    * @late_init: sets up late driver/hw state (post hw_init) - Optional
>>    * @sw_init: sets up driver state, does not configure hw
>>    * @sw_fini: tears down driver state, does not configure hw
>> + * @early_fini: tears down stuff before dev detached from driver
>>    * @hw_init: sets up the hw state
>>    * @hw_fini: tears down the hw state
>>    * @late_fini: final cleanup
>> @@ -268,6 +269,7 @@ struct amd_ip_funcs {
>>   	int (*late_init)(void *handle);
>>   	int (*sw_init)(void *handle);
>>   	int (*sw_fini)(void *handle);
>> +	int (*early_fini)(void *handle);
>>   	int (*hw_init)(void *handle);
>>   	int (*hw_fini)(void *handle);
>>   	void (*late_fini)(void *handle);

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2021-05-20  3:58 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-12 14:26 [PATCH v7 00/16] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 01/16] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 02/16] drm/amdgpu: Split amdgpu_device_fini into early and late Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 03/16] drm/amdkfd: Split kfd suspend from device exit Andrey Grodzovsky
2021-05-12 20:33   ` Felix Kuehling
2021-05-12 20:38     ` Andrey Grodzovsky
2021-05-20  3:20     ` [PATCH] drm/amdgpu: Add early fini callback Andrey Grodzovsky
2021-05-20  3:29       ` Felix Kuehling
2021-05-20  3:58         ` Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 04/16] " Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 05/16] drm/amdgpu: Handle IOMMU enabled case Andrey Grodzovsky
2021-05-14 14:41   ` Andrey Grodzovsky
2021-05-14 16:25     ` Felix Kuehling
2021-05-14 16:26       ` Andrey Grodzovsky
2021-05-17 14:38       ` [PATCH] " Andrey Grodzovsky
2021-05-17 14:48         ` Felix Kuehling
2021-05-12 14:26 ` [PATCH v7 06/16] drm/amdgpu: Remap all page faults to per process dummy page Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 07/16] PCI: Add support for dev_groups to struct pci_driver Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 08/16] drm/amdgpu: Convert driver sysfs attributes to static attributes Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 09/16] drm/amdgpu: Guard against write accesses after device removal Andrey Grodzovsky
2021-05-12 20:17   ` Alex Deucher
2021-05-12 20:30     ` Andrey Grodzovsky
2021-05-12 20:50       ` Alex Deucher
2021-05-13 14:47         ` Andrey Grodzovsky
2021-05-13 14:54           ` Alex Deucher
2021-05-12 14:26 ` [PATCH v7 10/16] drm/sched: Make timeout timer rearm conditional Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 11/16] drm/amdgpu: Prevent any job recoveries after device is unplugged Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 12/16] drm/amdgpu: Fix hang on device removal Andrey Grodzovsky
2021-05-14 14:42   ` Andrey Grodzovsky
2021-05-17 14:40     ` Andrey Grodzovsky
2021-05-17 17:39       ` Alex Deucher
2021-05-17 19:39       ` Christian König
2021-05-17 19:46         ` Andrey Grodzovsky
2021-05-17 19:54           ` Christian König
2021-05-12 14:26 ` [PATCH v7 13/16] drm/scheduler: Fix hang when sched_entity released Andrey Grodzovsky
2021-05-18 14:07   ` Christian König
2021-05-18 15:03     ` Andrey Grodzovsky
2021-05-18 15:15       ` Christian König
2021-05-18 16:17         ` Andrey Grodzovsky
2021-05-18 16:33           ` Christian König
2021-05-18 17:43             ` Andrey Grodzovsky
2021-05-18 18:02               ` Christian König
2021-05-18 18:09                 ` Andrey Grodzovsky
2021-05-18 18:13                   ` Christian König
2021-05-18 18:48                     ` Andrey Grodzovsky
2021-05-18 20:56                       ` Andrey Grodzovsky
2021-05-19 10:57                       ` Christian König
2021-05-19 11:03                         ` Andrey Grodzovsky
2021-05-19 11:46                           ` Christian König
2021-05-19 11:51                             ` Andrey Grodzovsky
2021-05-19 11:56                               ` Christian König
2021-05-19 14:14                                 ` [PATCH] drm/sched: Avoid data corruptions Andrey Grodzovsky
2021-05-19 14:15                                   ` Christian König
2021-05-12 14:26 ` [PATCH v7 14/16] drm/amd/display: Remove superfluous drm_mode_config_cleanup Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 15/16] drm/amdgpu: Verify DMA opearations from device are done Andrey Grodzovsky
2021-05-12 14:26 ` [PATCH v7 16/16] drm/amdgpu: Unmap all MMIO mappings Andrey Grodzovsky
2021-05-14 14:42   ` Andrey Grodzovsky
2021-05-17 14:41     ` Andrey Grodzovsky
2021-05-17 17:43   ` Alex Deucher
2021-05-17 18:46     ` Andrey Grodzovsky
2021-05-17 18:56       ` Alex Deucher
2021-05-17 19:22         ` Andrey Grodzovsky
2021-05-17 19:31     ` [PATCH] " Andrey Grodzovsky
2021-05-18 14:01       ` Andrey Grodzovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).