All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Various fixes to pass libdrm hotunplug tests
@ 2021-08-24 21:01 Andrey Grodzovsky
  2021-08-24 21:01 ` [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini Andrey Grodzovsky
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-08-24 21:01 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: ckoenig.leichtzumerken, Andrey Grodzovsky

Bunch of fixes to enable passing hotplug tests i previosly added
here[1] with latest code. 
Once accepted I will enable the tests on libdrm side.

[1] - https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/172

Andrey Grodzovsky (4):
  drm/amdgpu: Move flush VCE idle_work during HW fini
  drm/ttm: Create pinned list
  drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case
  drm/amdgpu: Add a UAPI flag for hot plug/unplug

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    | 50 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    |  1 -
 drivers/gpu/drm/amd/amdgpu/vce_v2_0.c      |  4 ++
 drivers/gpu/drm/amd/amdgpu/vce_v3_0.c      |  5 ++-
 drivers/gpu/drm/amd/amdgpu/vce_v4_0.c      |  2 +
 drivers/gpu/drm/ttm/ttm_bo.c               | 24 +++++++++--
 drivers/gpu/drm/ttm/ttm_resource.c         |  1 +
 include/drm/ttm/ttm_resource.h             |  1 +
 11 files changed, 88 insertions(+), 6 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini
  2021-08-24 21:01 [PATCH 0/4] Various fixes to pass libdrm hotunplug tests Andrey Grodzovsky
@ 2021-08-24 21:01 ` Andrey Grodzovsky
  2021-08-25  1:41   ` Quan, Evan
  2021-08-24 21:01 ` [PATCH 2/4] drm/ttm: Create pinned list Andrey Grodzovsky
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-08-24 21:01 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: ckoenig.leichtzumerken, Andrey Grodzovsky

Attepmts to powergate after device is removed lead to crash.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 1 -
 drivers/gpu/drm/amd/amdgpu/vce_v2_0.c   | 4 ++++
 drivers/gpu/drm/amd/amdgpu/vce_v3_0.c   | 5 ++++-
 drivers/gpu/drm/amd/amdgpu/vce_v4_0.c   | 2 ++
 4 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index 1ae7f824adc7..8e8dee9fac9f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -218,7 +218,6 @@ int amdgpu_vce_sw_fini(struct amdgpu_device *adev)
 	if (adev->vce.vcpu_bo == NULL)
 		return 0;
 
-	cancel_delayed_work_sync(&adev->vce.idle_work);
 	drm_sched_entity_destroy(&adev->vce.entity);
 
 	amdgpu_bo_free_kernel(&adev->vce.vcpu_bo, &adev->vce.gpu_addr,
diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
index c7d28c169be5..716dfdd020b4 100644
--- a/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
@@ -477,6 +477,10 @@ static int vce_v2_0_hw_init(void *handle)
 
 static int vce_v2_0_hw_fini(void *handle)
 {
+	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+
+	cancel_delayed_work_sync(&adev->vce.idle_work);
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
index 3b82fb289ef6..49581c6e0cea 100644
--- a/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
@@ -495,7 +495,10 @@ static int vce_v3_0_hw_fini(void *handle)
 		return r;
 
 	vce_v3_0_stop(adev);
-	return vce_v3_0_set_clockgating_state(adev, AMD_CG_STATE_GATE);
+	r =  vce_v3_0_set_clockgating_state(adev, AMD_CG_STATE_GATE);
+	cancel_delayed_work_sync(&adev->vce.idle_work);
+
+	return r;
 }
 
 static int vce_v3_0_suspend(void *handle)
diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
index 90910d19db12..3297405fd32d 100644
--- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
@@ -550,6 +550,8 @@ static int vce_v4_0_hw_fini(void *handle)
 		DRM_DEBUG("For SRIOV client, shouldn't do anything.\n");
 	}
 
+	cancel_delayed_work_sync(&adev->vce.idle_work);
+
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] drm/ttm: Create pinned list
  2021-08-24 21:01 [PATCH 0/4] Various fixes to pass libdrm hotunplug tests Andrey Grodzovsky
  2021-08-24 21:01 ` [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini Andrey Grodzovsky
@ 2021-08-24 21:01 ` Andrey Grodzovsky
  2021-08-25  6:40   ` Christian König
  2021-08-24 21:01 ` [PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case Andrey Grodzovsky
  2021-08-24 21:01 ` [PATCH 4/4] drm/amdgpu: Add a UAPI flag for hot plug/unplug Andrey Grodzovsky
  3 siblings, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-08-24 21:01 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: ckoenig.leichtzumerken, Andrey Grodzovsky, Christian König

This list will be used to capture all non VRAM BOs not
on LRU so when device is hot unplugged we can iterate
the list and unmap DMA mappings before device is removed.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c       | 24 +++++++++++++++++++++---
 drivers/gpu/drm/ttm/ttm_resource.c |  1 +
 include/drm/ttm/ttm_resource.h     |  1 +
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 1b950b45cf4b..84ba76ace58f 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -69,16 +69,34 @@ static void ttm_bo_mem_space_debug(struct ttm_buffer_object *bo,
 	}
 }
 
-static void ttm_bo_del_from_lru(struct ttm_buffer_object *bo)
+static void ttm_bo_del_from_lru_imp(struct ttm_buffer_object *bo, bool final)
 {
 	struct ttm_device *bdev = bo->bdev;
+	struct ttm_resource_manager *man = NULL;
 
-	list_del_init(&bo->lru);
+	if (bo->resource)
+		man = ttm_manager_type(bdev, bo->resource->mem_type);
+
+
+	if (!final && man && man->use_tt)
+		list_move_tail(&bo->lru, &man->pinned);
+	else
+		list_del_init(&bo->lru);
 
 	if (bdev->funcs->del_from_lru_notify)
 		bdev->funcs->del_from_lru_notify(bo);
 }
 
+static inline void ttm_bo_del_from_lru_final(struct ttm_buffer_object *bo)
+{
+	ttm_bo_del_from_lru_imp(bo, true);
+}
+
+static inline void ttm_bo_del_from_lru(struct ttm_buffer_object *bo)
+{
+	ttm_bo_del_from_lru_imp(bo, false);
+}
+
 static void ttm_bo_bulk_move_set_pos(struct ttm_lru_bulk_move_pos *pos,
 				     struct ttm_buffer_object *bo)
 {
@@ -453,7 +471,7 @@ static void ttm_bo_release(struct kref *kref)
 	}
 
 	spin_lock(&bo->bdev->lru_lock);
-	ttm_bo_del_from_lru(bo);
+	ttm_bo_del_from_lru_final(bo);
 	list_del(&bo->ddestroy);
 	spin_unlock(&bo->bdev->lru_lock);
 
diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
index 2431717376e7..91165f77fe0e 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -85,6 +85,7 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
 
 	for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i)
 		INIT_LIST_HEAD(&man->lru[i]);
+	INIT_LIST_HEAD(&man->pinned);
 	man->move = NULL;
 }
 EXPORT_SYMBOL(ttm_resource_manager_init);
diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
index 140b6b9a8bbe..1ec0d5ebb59f 100644
--- a/include/drm/ttm/ttm_resource.h
+++ b/include/drm/ttm/ttm_resource.h
@@ -130,6 +130,7 @@ struct ttm_resource_manager {
 	 */
 
 	struct list_head lru[TTM_MAX_BO_PRIORITY];
+	struct list_head pinned;
 
 	/*
 	 * Protected by @move_lock.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case
  2021-08-24 21:01 [PATCH 0/4] Various fixes to pass libdrm hotunplug tests Andrey Grodzovsky
  2021-08-24 21:01 ` [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini Andrey Grodzovsky
  2021-08-24 21:01 ` [PATCH 2/4] drm/ttm: Create pinned list Andrey Grodzovsky
@ 2021-08-24 21:01 ` Andrey Grodzovsky
  2021-08-25  6:43   ` Christian König
  2021-08-24 21:01 ` [PATCH 4/4] drm/amdgpu: Add a UAPI flag for hot plug/unplug Andrey Grodzovsky
  3 siblings, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-08-24 21:01 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: ckoenig.leichtzumerken, Andrey Grodzovsky

Handle all DMA IOMMU group related dependencies before the
group is removed and we try to access it after free.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    | 50 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h    |  1 +
 3 files changed, 53 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0b5764aa98a4..288a465b8101 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3860,6 +3860,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 
 	amdgpu_device_ip_fini_early(adev);
 
+	amdgpu_ttm_clear_dma_mappings(adev);
+
 	amdgpu_gart_dummy_page_fini(adev);
 
 	amdgpu_device_unmap_mmio(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 446943e32e3e..f73d807db3b0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -64,6 +64,7 @@
 static int amdgpu_ttm_backend_bind(struct ttm_device *bdev,
 				   struct ttm_tt *ttm,
 				   struct ttm_resource *bo_mem);
+
 static void amdgpu_ttm_backend_unbind(struct ttm_device *bdev,
 				      struct ttm_tt *ttm);
 
@@ -2293,6 +2294,55 @@ static ssize_t amdgpu_iomem_write(struct file *f, const char __user *buf,
 	return result;
 }
 
+void amdgpu_ttm_clear_dma_mappings(struct amdgpu_device *adev)
+{
+	struct ttm_device *bdev = &adev->mman.bdev;
+	struct ttm_resource_manager *man;
+	struct ttm_buffer_object *bo;
+	unsigned int i, j;
+
+	spin_lock(&bdev->lru_lock);
+	for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) {
+		man = ttm_manager_type(bdev, i);
+		if (!man || !man->use_tt)
+			continue;
+
+		while (!list_empty(&man->pinned)) {
+			bo = list_first_entry(&man->pinned, struct ttm_buffer_object, lru);
+			/* Take ref against racing releases once lru_lock is unlocked */
+			ttm_bo_get(bo);
+			list_del_init(&bo->lru);
+			spin_unlock(&bdev->lru_lock);
+
+			if (bo->ttm) {
+				amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);
+				ttm_tt_destroy_common(bo->bdev, bo->ttm);
+			}
+
+			ttm_bo_put(bo);
+			spin_lock(&bdev->lru_lock);
+		}
+
+		for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
+			while (!list_empty(&man->lru[j])) {
+				bo = list_first_entry(&man->lru[j], struct ttm_buffer_object, lru);
+				ttm_bo_get(bo);
+				list_del_init(&bo->lru);
+				spin_unlock(&bdev->lru_lock);
+
+				if (bo->ttm) {
+					amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);
+					ttm_tt_destroy_common(bo->bdev, bo->ttm);
+				}
+				ttm_bo_put(bo);
+				spin_lock(&bdev->lru_lock);
+			}
+		}
+	}
+	spin_unlock(&bdev->lru_lock);
+
+}
+
 static const struct file_operations amdgpu_ttm_iomem_fops = {
 	.owner = THIS_MODULE,
 	.read = amdgpu_iomem_read,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
index e69f3e8e06e5..02c8eac48a64 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
@@ -190,6 +190,7 @@ bool amdgpu_ttm_tt_is_readonly(struct ttm_tt *ttm);
 uint64_t amdgpu_ttm_tt_pde_flags(struct ttm_tt *ttm, struct ttm_resource *mem);
 uint64_t amdgpu_ttm_tt_pte_flags(struct amdgpu_device *adev, struct ttm_tt *ttm,
 				 struct ttm_resource *mem);
+void amdgpu_ttm_clear_dma_mappings(struct amdgpu_device *adev);
 
 void amdgpu_ttm_debugfs_init(struct amdgpu_device *adev);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] drm/amdgpu: Add a UAPI flag for hot plug/unplug
  2021-08-24 21:01 [PATCH 0/4] Various fixes to pass libdrm hotunplug tests Andrey Grodzovsky
                   ` (2 preceding siblings ...)
  2021-08-24 21:01 ` [PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case Andrey Grodzovsky
@ 2021-08-24 21:01 ` Andrey Grodzovsky
  3 siblings, 0 replies; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-08-24 21:01 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: ckoenig.leichtzumerken, Andrey Grodzovsky

To support libdrm tests.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6400259a7c4b..c2fdf67ff551 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -96,9 +96,10 @@
  * - 3.40.0 - Add AMDGPU_IDS_FLAGS_TMZ
  * - 3.41.0 - Add video codec query
  * - 3.42.0 - Add 16bpc fixed point display support
+ * - 3.43.0 - Add device hot plug/unplug support
  */
 #define KMS_DRIVER_MAJOR	3
-#define KMS_DRIVER_MINOR	42
+#define KMS_DRIVER_MINOR	43
 #define KMS_DRIVER_PATCHLEVEL	0
 
 int amdgpu_vram_limit;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini
  2021-08-24 21:01 ` [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini Andrey Grodzovsky
@ 2021-08-25  1:41   ` Quan, Evan
  2021-08-25  3:20     ` Andrey Grodzovsky
  0 siblings, 1 reply; 13+ messages in thread
From: Quan, Evan @ 2021-08-25  1:41 UTC (permalink / raw)
  To: Grodzovsky, Andrey, dri-devel, amd-gfx
  Cc: ckoenig.leichtzumerken, Grodzovsky, Andrey

[AMD Official Use Only]

Hi Andrey,

I sent out a similar patch set to address S3 issue. And I believe it should be able to address the issue here too.
https://lists.freedesktop.org/archives/amd-gfx/2021-August/067972.html
https://lists.freedesktop.org/archives/amd-gfx/2021-August/067967.html

BR
Evan
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Andrey Grodzovsky
> Sent: Wednesday, August 25, 2021 5:01 AM
> To: dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
> Cc: ckoenig.leichtzumerken@gmail.com; Grodzovsky, Andrey
> <Andrey.Grodzovsky@amd.com>
> Subject: [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini
> 
> Attepmts to powergate after device is removed lead to crash.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 1 -
>  drivers/gpu/drm/amd/amdgpu/vce_v2_0.c   | 4 ++++
>  drivers/gpu/drm/amd/amdgpu/vce_v3_0.c   | 5 ++++-
>  drivers/gpu/drm/amd/amdgpu/vce_v4_0.c   | 2 ++
>  4 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> index 1ae7f824adc7..8e8dee9fac9f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> @@ -218,7 +218,6 @@ int amdgpu_vce_sw_fini(struct amdgpu_device
> *adev)
>  	if (adev->vce.vcpu_bo == NULL)
>  		return 0;
> 
> -	cancel_delayed_work_sync(&adev->vce.idle_work);
>  	drm_sched_entity_destroy(&adev->vce.entity);
> 
>  	amdgpu_bo_free_kernel(&adev->vce.vcpu_bo, &adev-
> >vce.gpu_addr,
> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
> b/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
> index c7d28c169be5..716dfdd020b4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
> @@ -477,6 +477,10 @@ static int vce_v2_0_hw_init(void *handle)
> 
>  static int vce_v2_0_hw_fini(void *handle)
>  {
> +	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> +
> +	cancel_delayed_work_sync(&adev->vce.idle_work);
> +
>  	return 0;
>  }
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
> b/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
> index 3b82fb289ef6..49581c6e0cea 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
> @@ -495,7 +495,10 @@ static int vce_v3_0_hw_fini(void *handle)
>  		return r;
> 
>  	vce_v3_0_stop(adev);
> -	return vce_v3_0_set_clockgating_state(adev,
> AMD_CG_STATE_GATE);
> +	r =  vce_v3_0_set_clockgating_state(adev, AMD_CG_STATE_GATE);
> +	cancel_delayed_work_sync(&adev->vce.idle_work);
> +
> +	return r;
>  }
> 
>  static int vce_v3_0_suspend(void *handle)
> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> index 90910d19db12..3297405fd32d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> @@ -550,6 +550,8 @@ static int vce_v4_0_hw_fini(void *handle)
>  		DRM_DEBUG("For SRIOV client, shouldn't do anything.\n");
>  	}
> 
> +	cancel_delayed_work_sync(&adev->vce.idle_work);
> +
>  	return 0;
>  }
> 
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini
  2021-08-25  1:41   ` Quan, Evan
@ 2021-08-25  3:20     ` Andrey Grodzovsky
  2021-08-25  3:56       ` Quan, Evan
  0 siblings, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-08-25  3:20 UTC (permalink / raw)
  To: Quan, Evan, dri-devel, amd-gfx; +Cc: ckoenig.leichtzumerken

Right, they will cover my use case, when are they landing ? I rebased 
today and haven't seen them.

Andrey

On 2021-08-24 9:41 p.m., Quan, Evan wrote:
> [AMD Official Use Only]
>
> Hi Andrey,
>
> I sent out a similar patch set to address S3 issue. And I believe it should be able to address the issue here too.
> https://lists.freedesktop.org/archives/amd-gfx/2021-August/067972.html
> https://lists.freedesktop.org/archives/amd-gfx/2021-August/067967.html
>
> BR
> Evan
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>> Andrey Grodzovsky
>> Sent: Wednesday, August 25, 2021 5:01 AM
>> To: dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>> Cc: ckoenig.leichtzumerken@gmail.com; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>
>> Subject: [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini
>>
>> Attepmts to powergate after device is removed lead to crash.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 1 -
>>   drivers/gpu/drm/amd/amdgpu/vce_v2_0.c   | 4 ++++
>>   drivers/gpu/drm/amd/amdgpu/vce_v3_0.c   | 5 ++++-
>>   drivers/gpu/drm/amd/amdgpu/vce_v4_0.c   | 2 ++
>>   4 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> index 1ae7f824adc7..8e8dee9fac9f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>> @@ -218,7 +218,6 @@ int amdgpu_vce_sw_fini(struct amdgpu_device
>> *adev)
>>   	if (adev->vce.vcpu_bo == NULL)
>>   		return 0;
>>
>> -	cancel_delayed_work_sync(&adev->vce.idle_work);
>>   	drm_sched_entity_destroy(&adev->vce.entity);
>>
>>   	amdgpu_bo_free_kernel(&adev->vce.vcpu_bo, &adev-
>>> vce.gpu_addr,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
>> b/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
>> index c7d28c169be5..716dfdd020b4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
>> @@ -477,6 +477,10 @@ static int vce_v2_0_hw_init(void *handle)
>>
>>   static int vce_v2_0_hw_fini(void *handle)
>>   {
>> +	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>> +
>> +	cancel_delayed_work_sync(&adev->vce.idle_work);
>> +
>>   	return 0;
>>   }
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
>> b/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
>> index 3b82fb289ef6..49581c6e0cea 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
>> @@ -495,7 +495,10 @@ static int vce_v3_0_hw_fini(void *handle)
>>   		return r;
>>
>>   	vce_v3_0_stop(adev);
>> -	return vce_v3_0_set_clockgating_state(adev,
>> AMD_CG_STATE_GATE);
>> +	r =  vce_v3_0_set_clockgating_state(adev, AMD_CG_STATE_GATE);
>> +	cancel_delayed_work_sync(&adev->vce.idle_work);
>> +
>> +	return r;
>>   }
>>
>>   static int vce_v3_0_suspend(void *handle)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
>> b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
>> index 90910d19db12..3297405fd32d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
>> @@ -550,6 +550,8 @@ static int vce_v4_0_hw_fini(void *handle)
>>   		DRM_DEBUG("For SRIOV client, shouldn't do anything.\n");
>>   	}
>>
>> +	cancel_delayed_work_sync(&adev->vce.idle_work);
>> +
>>   	return 0;
>>   }
>>
>> --
>> 2.25.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini
  2021-08-25  3:20     ` Andrey Grodzovsky
@ 2021-08-25  3:56       ` Quan, Evan
  0 siblings, 0 replies; 13+ messages in thread
From: Quan, Evan @ 2021-08-25  3:56 UTC (permalink / raw)
  To: Grodzovsky, Andrey, dri-devel, amd-gfx; +Cc: ckoenig.leichtzumerken

[AMD Official Use Only]

Just landed.

Thanks,
Evan
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: Wednesday, August 25, 2021 11:20 AM
> To: Quan, Evan <Evan.Quan@amd.com>; dri-devel@lists.freedesktop.org;
> amd-gfx@lists.freedesktop.org
> Cc: ckoenig.leichtzumerken@gmail.com
> Subject: Re: [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW
> fini
> 
> Right, they will cover my use case, when are they landing ? I rebased today
> and haven't seen them.
> 
> Andrey
> 
> On 2021-08-24 9:41 p.m., Quan, Evan wrote:
> > [AMD Official Use Only]
> >
> > Hi Andrey,
> >
> > I sent out a similar patch set to address S3 issue. And I believe it should be
> able to address the issue here too.
> > https://lists.freedesktop.org/archives/amd-gfx/2021-August/067972.html
> > https://lists.freedesktop.org/archives/amd-gfx/2021-August/067967.html
> >
> > BR
> > Evan
> >> -----Original Message-----
> >> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> >> Andrey Grodzovsky
> >> Sent: Wednesday, August 25, 2021 5:01 AM
> >> To: dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
> >> Cc: ckoenig.leichtzumerken@gmail.com; Grodzovsky, Andrey
> >> <Andrey.Grodzovsky@amd.com>
> >> Subject: [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW
> >> fini
> >>
> >> Attepmts to powergate after device is removed lead to crash.
> >>
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 1 -
> >>   drivers/gpu/drm/amd/amdgpu/vce_v2_0.c   | 4 ++++
> >>   drivers/gpu/drm/amd/amdgpu/vce_v3_0.c   | 5 ++++-
> >>   drivers/gpu/drm/amd/amdgpu/vce_v4_0.c   | 2 ++
> >>   4 files changed, 10 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> >> index 1ae7f824adc7..8e8dee9fac9f 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> >> @@ -218,7 +218,6 @@ int amdgpu_vce_sw_fini(struct amdgpu_device
> >> *adev)
> >>   	if (adev->vce.vcpu_bo == NULL)
> >>   		return 0;
> >>
> >> -	cancel_delayed_work_sync(&adev->vce.idle_work);
> >>   	drm_sched_entity_destroy(&adev->vce.entity);
> >>
> >>   	amdgpu_bo_free_kernel(&adev->vce.vcpu_bo, &adev-
> >>> vce.gpu_addr,
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
> >> b/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
> >> index c7d28c169be5..716dfdd020b4 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v2_0.c
> >> @@ -477,6 +477,10 @@ static int vce_v2_0_hw_init(void *handle)
> >>
> >>   static int vce_v2_0_hw_fini(void *handle)
> >>   {
> >> +	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> >> +
> >> +	cancel_delayed_work_sync(&adev->vce.idle_work);
> >> +
> >>   	return 0;
> >>   }
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
> >> b/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
> >> index 3b82fb289ef6..49581c6e0cea 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v3_0.c
> >> @@ -495,7 +495,10 @@ static int vce_v3_0_hw_fini(void *handle)
> >>   		return r;
> >>
> >>   	vce_v3_0_stop(adev);
> >> -	return vce_v3_0_set_clockgating_state(adev,
> >> AMD_CG_STATE_GATE);
> >> +	r =  vce_v3_0_set_clockgating_state(adev, AMD_CG_STATE_GATE);
> >> +	cancel_delayed_work_sync(&adev->vce.idle_work);
> >> +
> >> +	return r;
> >>   }
> >>
> >>   static int vce_v3_0_suspend(void *handle) diff --git
> >> a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> >> b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> >> index 90910d19db12..3297405fd32d 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/vce_v4_0.c
> >> @@ -550,6 +550,8 @@ static int vce_v4_0_hw_fini(void *handle)
> >>   		DRM_DEBUG("For SRIOV client, shouldn't do anything.\n");
> >>   	}
> >>
> >> +	cancel_delayed_work_sync(&adev->vce.idle_work);
> >> +
> >>   	return 0;
> >>   }
> >>
> >> --
> >> 2.25.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] drm/ttm: Create pinned list
  2021-08-24 21:01 ` [PATCH 2/4] drm/ttm: Create pinned list Andrey Grodzovsky
@ 2021-08-25  6:40   ` Christian König
  0 siblings, 0 replies; 13+ messages in thread
From: Christian König @ 2021-08-25  6:40 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx; +Cc: ckoenig.leichtzumerken

Am 24.08.21 um 23:01 schrieb Andrey Grodzovsky:
> This list will be used to capture all non VRAM BOs not
> on LRU so when device is hot unplugged we can iterate
> the list and unmap DMA mappings before device is removed.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Suggested-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/gpu/drm/ttm/ttm_bo.c       | 24 +++++++++++++++++++++---
>   drivers/gpu/drm/ttm/ttm_resource.c |  1 +
>   include/drm/ttm/ttm_resource.h     |  1 +
>   3 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 1b950b45cf4b..84ba76ace58f 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -69,16 +69,34 @@ static void ttm_bo_mem_space_debug(struct ttm_buffer_object *bo,
>   	}
>   }
>   
> -static void ttm_bo_del_from_lru(struct ttm_buffer_object *bo)
> +static void ttm_bo_del_from_lru_imp(struct ttm_buffer_object *bo, bool final)

I think we should rather inline the only user where we actually need to 
delete the BO from the LRU and then rename the function here to 
ttm_bo_move_to_pinned().

Apart from that looks good to me.

Regards,
Christian.

>   {
>   	struct ttm_device *bdev = bo->bdev;
> +	struct ttm_resource_manager *man = NULL;
>   
> -	list_del_init(&bo->lru);
> +	if (bo->resource)
> +		man = ttm_manager_type(bdev, bo->resource->mem_type);
> +
> +
> +	if (!final && man && man->use_tt)
> +		list_move_tail(&bo->lru, &man->pinned);
> +	else
> +		list_del_init(&bo->lru);
>   
>   	if (bdev->funcs->del_from_lru_notify)
>   		bdev->funcs->del_from_lru_notify(bo);
>   }
>   
> +static inline void ttm_bo_del_from_lru_final(struct ttm_buffer_object *bo)
> +{
> +	ttm_bo_del_from_lru_imp(bo, true);
> +}
> +
> +static inline void ttm_bo_del_from_lru(struct ttm_buffer_object *bo)
> +{
> +	ttm_bo_del_from_lru_imp(bo, false);
> +}
> +
>   static void ttm_bo_bulk_move_set_pos(struct ttm_lru_bulk_move_pos *pos,
>   				     struct ttm_buffer_object *bo)
>   {
> @@ -453,7 +471,7 @@ static void ttm_bo_release(struct kref *kref)
>   	}
>   
>   	spin_lock(&bo->bdev->lru_lock);
> -	ttm_bo_del_from_lru(bo);
> +	ttm_bo_del_from_lru_final(bo);
>   	list_del(&bo->ddestroy);
>   	spin_unlock(&bo->bdev->lru_lock);
>   
> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
> index 2431717376e7..91165f77fe0e 100644
> --- a/drivers/gpu/drm/ttm/ttm_resource.c
> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
> @@ -85,6 +85,7 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
>   
>   	for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i)
>   		INIT_LIST_HEAD(&man->lru[i]);
> +	INIT_LIST_HEAD(&man->pinned);
>   	man->move = NULL;
>   }
>   EXPORT_SYMBOL(ttm_resource_manager_init);
> diff --git a/include/drm/ttm/ttm_resource.h b/include/drm/ttm/ttm_resource.h
> index 140b6b9a8bbe..1ec0d5ebb59f 100644
> --- a/include/drm/ttm/ttm_resource.h
> +++ b/include/drm/ttm/ttm_resource.h
> @@ -130,6 +130,7 @@ struct ttm_resource_manager {
>   	 */
>   
>   	struct list_head lru[TTM_MAX_BO_PRIORITY];
> +	struct list_head pinned;
>   
>   	/*
>   	 * Protected by @move_lock.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case
  2021-08-24 21:01 ` [PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case Andrey Grodzovsky
@ 2021-08-25  6:43   ` Christian König
  2021-08-25 15:36     ` Andrey Grodzovsky
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2021-08-25  6:43 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx



Am 24.08.21 um 23:01 schrieb Andrey Grodzovsky:
> Handle all DMA IOMMU group related dependencies before the
> group is removed and we try to access it after free.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    | 50 ++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h    |  1 +
>   3 files changed, 53 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 0b5764aa98a4..288a465b8101 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3860,6 +3860,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>   
>   	amdgpu_device_ip_fini_early(adev);
>   
> +	amdgpu_ttm_clear_dma_mappings(adev);
> +
>   	amdgpu_gart_dummy_page_fini(adev);
>   
>   	amdgpu_device_unmap_mmio(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 446943e32e3e..f73d807db3b0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -64,6 +64,7 @@
>   static int amdgpu_ttm_backend_bind(struct ttm_device *bdev,
>   				   struct ttm_tt *ttm,
>   				   struct ttm_resource *bo_mem);
> +
>   static void amdgpu_ttm_backend_unbind(struct ttm_device *bdev,
>   				      struct ttm_tt *ttm);
>   
> @@ -2293,6 +2294,55 @@ static ssize_t amdgpu_iomem_write(struct file *f, const char __user *buf,
>   	return result;
>   }
>   
> +void amdgpu_ttm_clear_dma_mappings(struct amdgpu_device *adev)

I strongly think that this function should be part of TTM. Something 
like ttm_device_force_unpopulate.

> +{
> +	struct ttm_device *bdev = &adev->mman.bdev;
> +	struct ttm_resource_manager *man;
> +	struct ttm_buffer_object *bo;
> +	unsigned int i, j;
> +
> +	spin_lock(&bdev->lru_lock);
> +	for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) {
> +		man = ttm_manager_type(bdev, i);
> +		if (!man || !man->use_tt)
> +			continue;
> +
> +		while (!list_empty(&man->pinned)) {
> +			bo = list_first_entry(&man->pinned, struct ttm_buffer_object, lru);
> +			/* Take ref against racing releases once lru_lock is unlocked */
> +			ttm_bo_get(bo);
> +			list_del_init(&bo->lru);
> +			spin_unlock(&bdev->lru_lock);
> +
> +			if (bo->ttm) {
> +				amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);
> +				ttm_tt_destroy_common(bo->bdev, bo->ttm);

Then you can also cleanly use ttm_tt_unpopulate here, cause this will 
result in incorrect statistics inside TTM atm.

Regards,
Christian.

> +			}
> +
> +			ttm_bo_put(bo);
> +			spin_lock(&bdev->lru_lock);
> +		}
> +
> +		for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
> +			while (!list_empty(&man->lru[j])) {
> +				bo = list_first_entry(&man->lru[j], struct ttm_buffer_object, lru);
> +				ttm_bo_get(bo);
> +				list_del_init(&bo->lru);
> +				spin_unlock(&bdev->lru_lock);
> +
> +				if (bo->ttm) {
> +					amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);
> +					ttm_tt_destroy_common(bo->bdev, bo->ttm);
> +				}
> +				ttm_bo_put(bo);
> +				spin_lock(&bdev->lru_lock);
> +			}
> +		}
> +	}
> +	spin_unlock(&bdev->lru_lock);
> +
> +}
> +
>   static const struct file_operations amdgpu_ttm_iomem_fops = {
>   	.owner = THIS_MODULE,
>   	.read = amdgpu_iomem_read,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
> index e69f3e8e06e5..02c8eac48a64 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
> @@ -190,6 +190,7 @@ bool amdgpu_ttm_tt_is_readonly(struct ttm_tt *ttm);
>   uint64_t amdgpu_ttm_tt_pde_flags(struct ttm_tt *ttm, struct ttm_resource *mem);
>   uint64_t amdgpu_ttm_tt_pte_flags(struct amdgpu_device *adev, struct ttm_tt *ttm,
>   				 struct ttm_resource *mem);
> +void amdgpu_ttm_clear_dma_mappings(struct amdgpu_device *adev);
>   
>   void amdgpu_ttm_debugfs_init(struct amdgpu_device *adev);
>   


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case
  2021-08-25  6:43   ` Christian König
@ 2021-08-25 15:36     ` Andrey Grodzovsky
  2021-08-26 13:43       ` Andrey Grodzovsky
  0 siblings, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-08-25 15:36 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx


On 2021-08-25 2:43 a.m., Christian König wrote:
>
>
> Am 24.08.21 um 23:01 schrieb Andrey Grodzovsky:
>> Handle all DMA IOMMU group related dependencies before the
>> group is removed and we try to access it after free.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    | 50 ++++++++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h    |  1 +
>>   3 files changed, 53 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 0b5764aa98a4..288a465b8101 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3860,6 +3860,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device 
>> *adev)
>>         amdgpu_device_ip_fini_early(adev);
>>   +    amdgpu_ttm_clear_dma_mappings(adev);
>> +
>>       amdgpu_gart_dummy_page_fini(adev);
>>         amdgpu_device_unmap_mmio(adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 446943e32e3e..f73d807db3b0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -64,6 +64,7 @@
>>   static int amdgpu_ttm_backend_bind(struct ttm_device *bdev,
>>                      struct ttm_tt *ttm,
>>                      struct ttm_resource *bo_mem);
>> +
>>   static void amdgpu_ttm_backend_unbind(struct ttm_device *bdev,
>>                         struct ttm_tt *ttm);
>>   @@ -2293,6 +2294,55 @@ static ssize_t amdgpu_iomem_write(struct 
>> file *f, const char __user *buf,
>>       return result;
>>   }
>>   +void amdgpu_ttm_clear_dma_mappings(struct amdgpu_device *adev)
>
> I strongly think that this function should be part of TTM. Something 
> like ttm_device_force_unpopulate.


Yes, this something I also wanted but see bellow


>
>> +{
>> +    struct ttm_device *bdev = &adev->mman.bdev;
>> +    struct ttm_resource_manager *man;
>> +    struct ttm_buffer_object *bo;
>> +    unsigned int i, j;
>> +
>> +    spin_lock(&bdev->lru_lock);
>> +    for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) {
>> +        man = ttm_manager_type(bdev, i);
>> +        if (!man || !man->use_tt)
>> +            continue;
>> +
>> +        while (!list_empty(&man->pinned)) {
>> +            bo = list_first_entry(&man->pinned, struct 
>> ttm_buffer_object, lru);
>> +            /* Take ref against racing releases once lru_lock is 
>> unlocked */
>> +            ttm_bo_get(bo);
>> +            list_del_init(&bo->lru);
>> +            spin_unlock(&bdev->lru_lock);
>> +
>> +            if (bo->ttm) {
>> +                amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);


amdgpu_ttm_backend_unbind is needed to be called separately from 
ttm_tt_unpopulate to take care of code
flows that do dma mapping though the gart bind and not through 
ttm_tt_populate, Since it's inside amdgpu
i had to place the entire function in amdgpu. Any suggestions ?

Andrey


>> + ttm_tt_destroy_common(bo->bdev, bo->ttm);
>
> Then you can also cleanly use ttm_tt_unpopulate here, cause this will 
> result in incorrect statistics inside TTM atm.
>
> Regards,
> Christian.
>
>> +            }
>> +
>> +            ttm_bo_put(bo);
>> +            spin_lock(&bdev->lru_lock);
>> +        }
>> +
>> +        for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
>> +            while (!list_empty(&man->lru[j])) {
>> +                bo = list_first_entry(&man->lru[j], struct 
>> ttm_buffer_object, lru);
>> +                ttm_bo_get(bo);
>> +                list_del_init(&bo->lru);
>> +                spin_unlock(&bdev->lru_lock);
>> +
>> +                if (bo->ttm) {
>> +                    amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);
>> +                    ttm_tt_destroy_common(bo->bdev, bo->ttm);
>> +                }
>> +                ttm_bo_put(bo);
>> +                spin_lock(&bdev->lru_lock);
>> +            }
>> +        }
>> +    }
>> +    spin_unlock(&bdev->lru_lock);
>> +
>> +}
>> +
>>   static const struct file_operations amdgpu_ttm_iomem_fops = {
>>       .owner = THIS_MODULE,
>>       .read = amdgpu_iomem_read,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> index e69f3e8e06e5..02c8eac48a64 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>> @@ -190,6 +190,7 @@ bool amdgpu_ttm_tt_is_readonly(struct ttm_tt *ttm);
>>   uint64_t amdgpu_ttm_tt_pde_flags(struct ttm_tt *ttm, struct 
>> ttm_resource *mem);
>>   uint64_t amdgpu_ttm_tt_pte_flags(struct amdgpu_device *adev, struct 
>> ttm_tt *ttm,
>>                    struct ttm_resource *mem);
>> +void amdgpu_ttm_clear_dma_mappings(struct amdgpu_device *adev);
>>     void amdgpu_ttm_debugfs_init(struct amdgpu_device *adev);
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case
  2021-08-25 15:36     ` Andrey Grodzovsky
@ 2021-08-26 13:43       ` Andrey Grodzovsky
  2021-08-26 14:52         ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-08-26 13:43 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx

Ping

Andrey

On 2021-08-25 11:36 a.m., Andrey Grodzovsky wrote:
>
> On 2021-08-25 2:43 a.m., Christian König wrote:
>>
>>
>> Am 24.08.21 um 23:01 schrieb Andrey Grodzovsky:
>>> Handle all DMA IOMMU group related dependencies before the
>>> group is removed and we try to access it after free.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    | 50 
>>> ++++++++++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h    |  1 +
>>>   3 files changed, 53 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 0b5764aa98a4..288a465b8101 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3860,6 +3860,8 @@ void amdgpu_device_fini_hw(struct 
>>> amdgpu_device *adev)
>>>         amdgpu_device_ip_fini_early(adev);
>>>   +    amdgpu_ttm_clear_dma_mappings(adev);
>>> +
>>>       amdgpu_gart_dummy_page_fini(adev);
>>>         amdgpu_device_unmap_mmio(adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> index 446943e32e3e..f73d807db3b0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> @@ -64,6 +64,7 @@
>>>   static int amdgpu_ttm_backend_bind(struct ttm_device *bdev,
>>>                      struct ttm_tt *ttm,
>>>                      struct ttm_resource *bo_mem);
>>> +
>>>   static void amdgpu_ttm_backend_unbind(struct ttm_device *bdev,
>>>                         struct ttm_tt *ttm);
>>>   @@ -2293,6 +2294,55 @@ static ssize_t amdgpu_iomem_write(struct 
>>> file *f, const char __user *buf,
>>>       return result;
>>>   }
>>>   +void amdgpu_ttm_clear_dma_mappings(struct amdgpu_device *adev)
>>
>> I strongly think that this function should be part of TTM. Something 
>> like ttm_device_force_unpopulate.
>
>
> Yes, this something I also wanted but see bellow
>
>
>>
>>> +{
>>> +    struct ttm_device *bdev = &adev->mman.bdev;
>>> +    struct ttm_resource_manager *man;
>>> +    struct ttm_buffer_object *bo;
>>> +    unsigned int i, j;
>>> +
>>> +    spin_lock(&bdev->lru_lock);
>>> +    for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) {
>>> +        man = ttm_manager_type(bdev, i);
>>> +        if (!man || !man->use_tt)
>>> +            continue;
>>> +
>>> +        while (!list_empty(&man->pinned)) {
>>> +            bo = list_first_entry(&man->pinned, struct 
>>> ttm_buffer_object, lru);
>>> +            /* Take ref against racing releases once lru_lock is 
>>> unlocked */
>>> +            ttm_bo_get(bo);
>>> +            list_del_init(&bo->lru);
>>> +            spin_unlock(&bdev->lru_lock);
>>> +
>>> +            if (bo->ttm) {
>>> +                amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);
>
>
> amdgpu_ttm_backend_unbind is needed to be called separately from 
> ttm_tt_unpopulate to take care of code
> flows that do dma mapping though the gart bind and not through 
> ttm_tt_populate, Since it's inside amdgpu
> i had to place the entire function in amdgpu. Any suggestions ?
>
> Andrey
>
>
>>> + ttm_tt_destroy_common(bo->bdev, bo->ttm);
>>
>> Then you can also cleanly use ttm_tt_unpopulate here, cause this will 
>> result in incorrect statistics inside TTM atm.
>>
>> Regards,
>> Christian.
>>
>>> +            }
>>> +
>>> +            ttm_bo_put(bo);
>>> +            spin_lock(&bdev->lru_lock);
>>> +        }
>>> +
>>> +        for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
>>> +            while (!list_empty(&man->lru[j])) {
>>> +                bo = list_first_entry(&man->lru[j], struct 
>>> ttm_buffer_object, lru);
>>> +                ttm_bo_get(bo);
>>> +                list_del_init(&bo->lru);
>>> +                spin_unlock(&bdev->lru_lock);
>>> +
>>> +                if (bo->ttm) {
>>> +                    amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);
>>> +                    ttm_tt_destroy_common(bo->bdev, bo->ttm);
>>> +                }
>>> +                ttm_bo_put(bo);
>>> +                spin_lock(&bdev->lru_lock);
>>> +            }
>>> +        }
>>> +    }
>>> +    spin_unlock(&bdev->lru_lock);
>>> +
>>> +}
>>> +
>>>   static const struct file_operations amdgpu_ttm_iomem_fops = {
>>>       .owner = THIS_MODULE,
>>>       .read = amdgpu_iomem_read,
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>> index e69f3e8e06e5..02c8eac48a64 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>> @@ -190,6 +190,7 @@ bool amdgpu_ttm_tt_is_readonly(struct ttm_tt *ttm);
>>>   uint64_t amdgpu_ttm_tt_pde_flags(struct ttm_tt *ttm, struct 
>>> ttm_resource *mem);
>>>   uint64_t amdgpu_ttm_tt_pte_flags(struct amdgpu_device *adev, 
>>> struct ttm_tt *ttm,
>>>                    struct ttm_resource *mem);
>>> +void amdgpu_ttm_clear_dma_mappings(struct amdgpu_device *adev);
>>>     void amdgpu_ttm_debugfs_init(struct amdgpu_device *adev);
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case
  2021-08-26 13:43       ` Andrey Grodzovsky
@ 2021-08-26 14:52         ` Christian König
  0 siblings, 0 replies; 13+ messages in thread
From: Christian König @ 2021-08-26 14:52 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx

Am 26.08.21 um 15:43 schrieb Andrey Grodzovsky:
> Ping
>
> Andrey
>
> On 2021-08-25 11:36 a.m., Andrey Grodzovsky wrote:
>>
>> On 2021-08-25 2:43 a.m., Christian König wrote:
>>>
>>>
>>> Am 24.08.21 um 23:01 schrieb Andrey Grodzovsky:
>>>> Handle all DMA IOMMU group related dependencies before the
>>>> group is removed and we try to access it after free.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    | 50 
>>>> ++++++++++++++++++++++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h    |  1 +
>>>>   3 files changed, 53 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 0b5764aa98a4..288a465b8101 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -3860,6 +3860,8 @@ void amdgpu_device_fini_hw(struct 
>>>> amdgpu_device *adev)
>>>>         amdgpu_device_ip_fini_early(adev);
>>>>   +    amdgpu_ttm_clear_dma_mappings(adev);
>>>> +
>>>>       amdgpu_gart_dummy_page_fini(adev);
>>>>         amdgpu_device_unmap_mmio(adev);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> index 446943e32e3e..f73d807db3b0 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> @@ -64,6 +64,7 @@
>>>>   static int amdgpu_ttm_backend_bind(struct ttm_device *bdev,
>>>>                      struct ttm_tt *ttm,
>>>>                      struct ttm_resource *bo_mem);
>>>> +
>>>>   static void amdgpu_ttm_backend_unbind(struct ttm_device *bdev,
>>>>                         struct ttm_tt *ttm);
>>>>   @@ -2293,6 +2294,55 @@ static ssize_t amdgpu_iomem_write(struct 
>>>> file *f, const char __user *buf,
>>>>       return result;
>>>>   }
>>>>   +void amdgpu_ttm_clear_dma_mappings(struct amdgpu_device *adev)
>>>
>>> I strongly think that this function should be part of TTM. Something 
>>> like ttm_device_force_unpopulate.
>>
>>
>> Yes, this something I also wanted but see bellow
>>
>>
>>>
>>>> +{
>>>> +    struct ttm_device *bdev = &adev->mman.bdev;
>>>> +    struct ttm_resource_manager *man;
>>>> +    struct ttm_buffer_object *bo;
>>>> +    unsigned int i, j;
>>>> +
>>>> +    spin_lock(&bdev->lru_lock);
>>>> +    for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) {
>>>> +        man = ttm_manager_type(bdev, i);
>>>> +        if (!man || !man->use_tt)
>>>> +            continue;
>>>> +
>>>> +        while (!list_empty(&man->pinned)) {
>>>> +            bo = list_first_entry(&man->pinned, struct 
>>>> ttm_buffer_object, lru);
>>>> +            /* Take ref against racing releases once lru_lock is 
>>>> unlocked */
>>>> +            ttm_bo_get(bo);
>>>> +            list_del_init(&bo->lru);
>>>> +            spin_unlock(&bdev->lru_lock);
>>>> +
>>>> +            if (bo->ttm) {
>>>> +                amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);
>>
>>
>> amdgpu_ttm_backend_unbind is needed to be called separately from 
>> ttm_tt_unpopulate to take care of code
>> flows that do dma mapping though the gart bind and not through 
>> ttm_tt_populate, Since it's inside amdgpu
>> i had to place the entire function in amdgpu. Any suggestions ?

I think I've fixed exactly that just recently, see the patch here:

commit b7e8b086ffbc03b890ed22ae63ed5e5bd319d184
Author: Christian König <ckoenig.leichtzumerken@gmail.com>
Date:   Wed Jul 28 15:05:49 2021 +0200

     drm/amdgpu: unbind in amdgpu_ttm_tt_unpopulate

     Doing this in amdgpu_ttm_backend_destroy() is to late.

     It turned out that this is not a good idea at all because it leaves 
pointers
     to freed up system memory pages in the GART tables of the drivers.

But that probably hasn't showed up in amd-staging-drm-next yet.

Christian.

>>
>> Andrey
>>
>>
>>>> + ttm_tt_destroy_common(bo->bdev, bo->ttm);
>>>
>>> Then you can also cleanly use ttm_tt_unpopulate here, cause this 
>>> will result in incorrect statistics inside TTM atm.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> +            }
>>>> +
>>>> +            ttm_bo_put(bo);
>>>> +            spin_lock(&bdev->lru_lock);
>>>> +        }
>>>> +
>>>> +        for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
>>>> +            while (!list_empty(&man->lru[j])) {
>>>> +                bo = list_first_entry(&man->lru[j], struct 
>>>> ttm_buffer_object, lru);
>>>> +                ttm_bo_get(bo);
>>>> +                list_del_init(&bo->lru);
>>>> +                spin_unlock(&bdev->lru_lock);
>>>> +
>>>> +                if (bo->ttm) {
>>>> +                    amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);
>>>> +                    ttm_tt_destroy_common(bo->bdev, bo->ttm);
>>>> +                }
>>>> +                ttm_bo_put(bo);
>>>> +                spin_lock(&bdev->lru_lock);
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +    spin_unlock(&bdev->lru_lock);
>>>> +
>>>> +}
>>>> +
>>>>   static const struct file_operations amdgpu_ttm_iomem_fops = {
>>>>       .owner = THIS_MODULE,
>>>>       .read = amdgpu_iomem_read,
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>>> index e69f3e8e06e5..02c8eac48a64 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>>> @@ -190,6 +190,7 @@ bool amdgpu_ttm_tt_is_readonly(struct ttm_tt 
>>>> *ttm);
>>>>   uint64_t amdgpu_ttm_tt_pde_flags(struct ttm_tt *ttm, struct 
>>>> ttm_resource *mem);
>>>>   uint64_t amdgpu_ttm_tt_pte_flags(struct amdgpu_device *adev, 
>>>> struct ttm_tt *ttm,
>>>>                    struct ttm_resource *mem);
>>>> +void amdgpu_ttm_clear_dma_mappings(struct amdgpu_device *adev);
>>>>     void amdgpu_ttm_debugfs_init(struct amdgpu_device *adev);
>>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-08-26 14:52 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24 21:01 [PATCH 0/4] Various fixes to pass libdrm hotunplug tests Andrey Grodzovsky
2021-08-24 21:01 ` [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini Andrey Grodzovsky
2021-08-25  1:41   ` Quan, Evan
2021-08-25  3:20     ` Andrey Grodzovsky
2021-08-25  3:56       ` Quan, Evan
2021-08-24 21:01 ` [PATCH 2/4] drm/ttm: Create pinned list Andrey Grodzovsky
2021-08-25  6:40   ` Christian König
2021-08-24 21:01 ` [PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case Andrey Grodzovsky
2021-08-25  6:43   ` Christian König
2021-08-25 15:36     ` Andrey Grodzovsky
2021-08-26 13:43       ` Andrey Grodzovsky
2021-08-26 14:52         ` Christian König
2021-08-24 21:01 ` [PATCH 4/4] drm/amdgpu: Add a UAPI flag for hot plug/unplug Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.