AMD-GFX Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/8] RFC Support hot device unplug in amdgpu
@ 2020-06-21  6:03 Andrey Grodzovsky
  2020-06-21  6:03 ` [PATCH v2 1/8] drm: Add dummy page per device or GEM object Andrey Grodzovsky
                   ` (8 more replies)
  0 siblings, 9 replies; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-21  6:03 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: Andrey Grodzovsky, daniel.vetter, michel, ppaalanen,
	ckoenig.leichtzumerken, alexdeucher

This RFC is more of a proof of concept then a fully working solution as there are a few unresolved issues we are hoping to get advise on from people on the mailing list.
Until now extracting a card either by physical extraction (e.g. eGPU with thunderbolt connection or by emulation through syfs -> /sys/bus/pci/devices/device_id/remove)
would cause random crashes in user apps. The random crashes in apps were mostly due to the app having mapped a device backed BO into its address space was still
trying to access the BO while the backing device was gone.
To answer this first problem Christian suggested to fix the handling of mapped memory in the clients when the device goes away by forcibly unmap all buffers
the user processes has by clearing their respective VMAs mapping the device BOs. Then when the VMAs try to fill in the page tables again we check in the fault handler
if the device is removed and if so, return an error. This will generate a SIGBUS to the application which can then cleanly terminate.
This indeed was done but this in turn created a problem of kernel OOPs were the OOPSes were due to the fact that while the app was terminating because of the SIGBUS
it would trigger use after free in the driver by calling to accesses device structures that were already released from the pci remove sequence.
This was handled by introducing a 'flush' sequence during device removal were we wait for drm file reference to drop to 0 meaning all user clients directly using this device terminated.
With this I was able to cleanly emulate device unplug with X and glxgears running and later emulate device plug back and restart of X and glxgears.

v2:
Based on discussions in the mailing list with Daniel and Pekka [1] and based on the document produced by Pekka from those discussions [2] the whole approach with returning SIGBUS
and waiting for all user clients having CPU mapping of device BOs to die was dropped. Instead as per the document suggestion the device structures are kept alive until the last
reference to the device is dropped by user client and in the meanwhile all existing and new CPU mappings of the BOs belonging to the device directly or by dma-buf import are rerouted
to per user process dummy rw page.
Also, I skipped the 'Requirements for KMS UAPI' section of [2] since i am trying to get the minimal set of requiremnts that still give useful solution to work and this is the
'Requirements for Render and Cross-Device UAPI' section and so my test case is removing a secondary device, which is render only and is not involved in KMS.
 
This iteration is still more of a draft as I am still facing a few unsolved issues such as a crash in user client when trying to CPU map imported BO if the map happens after device was
removed and HW failure to plug back a removed device. Also since i don't have real life setup with external GPU connected through TB I am using sysfs to emulate pci remove and i
expect to encounter more issues once i try this on real life case. I am also expecting some help on this from a user who volunteered to test in the related gitlab ticket.
So basically this is more of a  way to get feedback if I am moving in the right direction.

[1] - Discussions during v1 of the patchset https://lists.freedesktop.org/archives/dri-devel/2020-May/265386.html
[2] - drm/doc: device hot-unplug for userspace https://www.spinics.net/lists/dri-devel/msg259755.html
[3] - Related gitlab ticket https://gitlab.freedesktop.org/drm/amd/-/issues/1081
 

Andrey Grodzovsky (8):
  drm: Add dummy page per device or GEM object
  drm/ttm: Remap all page faults to per process dummy page.
  drm/ttm: Add unampping of the entire device address space
  drm/amdgpu: Split amdgpu_device_fini into early and late
  drm/amdgpu: Refactor sysfs removal
  drm/amdgpu: Unmap entire device address space on device remove.
  drm/amdgpu: Fix sdma code crash post device unplug
  drm/amdgpu: Prevent any job recoveries after device is unplugged.

 drivers/gpu/drm/amd/amdgpu/amdgpu.h          | 19 +++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c |  7 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 50 +++++++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c      | 23 ++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 12 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c      | 24 ++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h      |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c      |  8 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c      | 23 +++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c      |  8 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c      |  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c  | 21 ++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 +++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c     | 13 +++++-
 drivers/gpu/drm/amd/amdgpu/df_v3_6.c         | 10 +++--
 drivers/gpu/drm/drm_file.c                   |  8 ++++
 drivers/gpu/drm/drm_prime.c                  | 10 +++++
 drivers/gpu/drm/ttm/ttm_bo.c                 |  8 +++-
 drivers/gpu/drm/ttm/ttm_bo_vm.c              | 65 ++++++++++++++++++++++++----
 include/drm/drm_file.h                       |  2 +
 include/drm/drm_gem.h                        |  2 +
 include/drm/ttm/ttm_bo_driver.h              |  7 +++
 22 files changed, 286 insertions(+), 55 deletions(-)

-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 1/8] drm: Add dummy page per device or GEM object
  2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
@ 2020-06-21  6:03 ` Andrey Grodzovsky
  2020-06-22  9:35   ` Daniel Vetter
  2020-06-22 13:18   ` Christian König
  2020-06-21  6:03 ` [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-21  6:03 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: Andrey Grodzovsky, daniel.vetter, michel, ppaalanen,
	ckoenig.leichtzumerken, alexdeucher

Will be used to reroute CPU mapped BO's page faults once
device is removed.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/drm_file.c  |  8 ++++++++
 drivers/gpu/drm/drm_prime.c | 10 ++++++++++
 include/drm/drm_file.h      |  2 ++
 include/drm/drm_gem.h       |  2 ++
 4 files changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index c4c704e..67c0770 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -188,6 +188,12 @@ struct drm_file *drm_file_alloc(struct drm_minor *minor)
 			goto out_prime_destroy;
 	}
 
+	file->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (!file->dummy_page) {
+		ret = -ENOMEM;
+		goto out_prime_destroy;
+	}
+
 	return file;
 
 out_prime_destroy:
@@ -284,6 +290,8 @@ void drm_file_free(struct drm_file *file)
 	if (dev->driver->postclose)
 		dev->driver->postclose(dev, file);
 
+	__free_page(file->dummy_page);
+
 	drm_prime_destroy_file_private(&file->prime);
 
 	WARN_ON(!list_empty(&file->event_list));
diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 1de2cde..c482e9c 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -335,6 +335,13 @@ int drm_gem_prime_fd_to_handle(struct drm_device *dev,
 
 	ret = drm_prime_add_buf_handle(&file_priv->prime,
 			dma_buf, *handle);
+
+	if (!ret) {
+		obj->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!obj->dummy_page)
+			ret = -ENOMEM;
+	}
+
 	mutex_unlock(&file_priv->prime.lock);
 	if (ret)
 		goto fail;
@@ -1006,6 +1013,9 @@ void drm_prime_gem_destroy(struct drm_gem_object *obj, struct sg_table *sg)
 		dma_buf_unmap_attachment(attach, sg, DMA_BIDIRECTIONAL);
 	dma_buf = attach->dmabuf;
 	dma_buf_detach(attach->dmabuf, attach);
+
+	__free_page(obj->dummy_page);
+
 	/* remove the reference */
 	dma_buf_put(dma_buf);
 }
diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
index 19df802..349a658 100644
--- a/include/drm/drm_file.h
+++ b/include/drm/drm_file.h
@@ -335,6 +335,8 @@ struct drm_file {
 	 */
 	struct drm_prime_file_private prime;
 
+	struct page *dummy_page;
+
 	/* private: */
 #if IS_ENABLED(CONFIG_DRM_LEGACY)
 	unsigned long lock_count; /* DRI1 legacy lock count */
diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
index 0b37506..47460d1 100644
--- a/include/drm/drm_gem.h
+++ b/include/drm/drm_gem.h
@@ -310,6 +310,8 @@ struct drm_gem_object {
 	 *
 	 */
 	const struct drm_gem_object_funcs *funcs;
+
+	struct page *dummy_page;
 };
 
 /**
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page.
  2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
  2020-06-21  6:03 ` [PATCH v2 1/8] drm: Add dummy page per device or GEM object Andrey Grodzovsky
@ 2020-06-21  6:03 ` Andrey Grodzovsky
  2020-06-22  9:41   ` Daniel Vetter
  2020-06-22 19:30   ` Christian König
  2020-06-21  6:03 ` [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space Andrey Grodzovsky
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-21  6:03 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: Andrey Grodzovsky, daniel.vetter, michel, ppaalanen,
	ckoenig.leichtzumerken, alexdeucher

On device removal reroute all CPU mappings to dummy page per drm_file
instance or imported GEM object.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 65 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 57 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 389128b..2f8bf5e 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -35,6 +35,8 @@
 #include <drm/ttm/ttm_bo_driver.h>
 #include <drm/ttm/ttm_placement.h>
 #include <drm/drm_vma_manager.h>
+#include <drm/drm_drv.h>
+#include <drm/drm_file.h>
 #include <linux/mm.h>
 #include <linux/pfn_t.h>
 #include <linux/rbtree.h>
@@ -328,19 +330,66 @@ vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
 	pgprot_t prot;
 	struct ttm_buffer_object *bo = vma->vm_private_data;
 	vm_fault_t ret;
+	int idx;
+	struct drm_device *ddev = bo->base.dev;
 
-	ret = ttm_bo_vm_reserve(bo, vmf);
-	if (ret)
-		return ret;
+	if (drm_dev_enter(ddev, &idx)) {
+		ret = ttm_bo_vm_reserve(bo, vmf);
+		if (ret)
+			goto exit;
+
+		prot = vma->vm_page_prot;
 
-	prot = vma->vm_page_prot;
-	ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT);
-	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
+		ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT);
+		if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
+			goto exit;
+
+		dma_resv_unlock(bo->base.resv);
+
+exit:
+		drm_dev_exit(idx);
 		return ret;
+	} else {
 
-	dma_resv_unlock(bo->base.resv);
+		struct drm_file *file = NULL;
+		struct page *dummy_page = NULL;
+		int handle;
 
-	return ret;
+		/* We are faulting on imported BO from dma_buf */
+		if (bo->base.dma_buf && bo->base.import_attach) {
+			dummy_page = bo->base.dummy_page;
+		/* We are faulting on non imported BO, find drm_file owning the BO*/
+		} else {
+			struct drm_gem_object *gobj;
+
+			mutex_lock(&ddev->filelist_mutex);
+			list_for_each_entry(file, &ddev->filelist, lhead) {
+				spin_lock(&file->table_lock);
+				idr_for_each_entry(&file->object_idr, gobj, handle) {
+					if (gobj == &bo->base) {
+						dummy_page = file->dummy_page;
+						break;
+					}
+				}
+				spin_unlock(&file->table_lock);
+			}
+			mutex_unlock(&ddev->filelist_mutex);
+		}
+
+		if (dummy_page) {
+			/*
+			 * Let do_fault complete the PTE install e.t.c using vmf->page
+			 *
+			 * TODO - should i call free_page somewhere ?
+			 */
+			get_page(dummy_page);
+			vmf->page = dummy_page;
+			return 0;
+		} else {
+			return VM_FAULT_SIGSEGV;
+		}
+	}
 }
 EXPORT_SYMBOL(ttm_bo_vm_fault);
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space
  2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
  2020-06-21  6:03 ` [PATCH v2 1/8] drm: Add dummy page per device or GEM object Andrey Grodzovsky
  2020-06-21  6:03 ` [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
@ 2020-06-21  6:03 ` Andrey Grodzovsky
  2020-06-22  9:45   ` Daniel Vetter
                     ` (2 more replies)
  2020-06-21  6:03 ` [PATCH v2 4/8] drm/amdgpu: Split amdgpu_device_fini into early and late Andrey Grodzovsky
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-21  6:03 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: Andrey Grodzovsky, daniel.vetter, michel, ppaalanen,
	ckoenig.leichtzumerken, alexdeucher

Helper function to be used to invalidate all BOs CPU mappings
once device is removed.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c    | 8 ++++++--
 include/drm/ttm/ttm_bo_driver.h | 7 +++++++
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index c5b516f..926a365 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -1750,10 +1750,14 @@ void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo)
 	ttm_bo_unmap_virtual_locked(bo);
 	ttm_mem_io_unlock(man);
 }
-
-
 EXPORT_SYMBOL(ttm_bo_unmap_virtual);
 
+void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev)
+{
+	unmap_mapping_range(bdev->dev_mapping, 0, 0, 1);
+}
+EXPORT_SYMBOL(ttm_bo_unmap_virtual_address_space);
+
 int ttm_bo_wait(struct ttm_buffer_object *bo,
 		bool interruptible, bool no_wait)
 {
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index c9e0fd0..39ea44f 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -601,6 +601,13 @@ int ttm_bo_device_init(struct ttm_bo_device *bdev,
 void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo);
 
 /**
+ * ttm_bo_unmap_virtual_address_space
+ *
+ * @bdev: tear down all the virtual mappings for this device
+ */
+void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev);
+
+/**
  * ttm_bo_unmap_virtual
  *
  * @bo: tear down the virtual mappings for this BO
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 4/8] drm/amdgpu: Split amdgpu_device_fini into early and late
  2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (2 preceding siblings ...)
  2020-06-21  6:03 ` [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space Andrey Grodzovsky
@ 2020-06-21  6:03 ` Andrey Grodzovsky
  2020-06-22  9:48   ` Daniel Vetter
  2020-06-21  6:03 ` [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal Andrey Grodzovsky
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-21  6:03 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: Andrey Grodzovsky, daniel.vetter, michel, ppaalanen,
	ckoenig.leichtzumerken, alexdeucher

Some of the stuff in amdgpu_device_fini such as HW interrupts
disable and pending fences finilization must be done right away on
pci_remove while most of the stuff which relates to finilizing and releasing
driver data structures can be kept until drm_driver.release hook is called, i.e.
when the last device reference is dropped.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  6 ++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    | 24 +++++++++++++++---------
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 23 +++++++++++++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    |  3 +++
 7 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 2a806cb..604a681 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1003,7 +1003,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 		       struct drm_device *ddev,
 		       struct pci_dev *pdev,
 		       uint32_t flags);
-void amdgpu_device_fini(struct amdgpu_device *adev);
+void amdgpu_device_fini_early(struct amdgpu_device *adev);
+void amdgpu_device_fini_late(struct amdgpu_device *adev);
+
 int amdgpu_gpu_wait_for_idle(struct amdgpu_device *adev);
 
 void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
@@ -1188,6 +1190,8 @@ void amdgpu_driver_lastclose_kms(struct drm_device *dev);
 int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv);
 void amdgpu_driver_postclose_kms(struct drm_device *dev,
 				 struct drm_file *file_priv);
+void amdgpu_driver_release_kms(struct drm_device *dev);
+
 int amdgpu_device_ip_suspend(struct amdgpu_device *adev);
 int amdgpu_device_suspend(struct drm_device *dev, bool fbcon);
 int amdgpu_device_resume(struct drm_device *dev, bool fbcon);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index cc41e8f..e7b9065 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2309,6 +2309,8 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
 {
 	int i, r;
 
+	//DRM_ERROR("adev 0x%llx", (long long unsigned int)adev);
+
 	amdgpu_ras_pre_fini(adev);
 
 	if (adev->gmc.xgmi.num_physical_nodes > 1)
@@ -3304,10 +3306,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
  * Tear down the driver info (all asics).
  * Called at driver shutdown.
  */
-void amdgpu_device_fini(struct amdgpu_device *adev)
+void amdgpu_device_fini_early(struct amdgpu_device *adev)
 {
-	int r;
-
 	DRM_INFO("amdgpu: finishing device.\n");
 	flush_delayed_work(&adev->delayed_init_work);
 	adev->shutdown = true;
@@ -3330,7 +3330,13 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 	if (adev->pm_sysfs_en)
 		amdgpu_pm_sysfs_fini(adev);
 	amdgpu_fbdev_fini(adev);
-	r = amdgpu_device_ip_fini(adev);
+
+	amdgpu_irq_fini_early(adev);
+}
+
+void amdgpu_device_fini_late(struct amdgpu_device *adev)
+{
+	amdgpu_device_ip_fini(adev);
 	if (adev->firmware.gpu_info_fw) {
 		release_firmware(adev->firmware.gpu_info_fw);
 		adev->firmware.gpu_info_fw = NULL;
@@ -3368,6 +3374,7 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 		amdgpu_pmu_fini(adev);
 	if (amdgpu_discovery && adev->asic_type >= CHIP_NAVI10)
 		amdgpu_discovery_fini(adev);
+
 }
 
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 9e5afa5..43592dc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1134,12 +1134,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 
-#ifdef MODULE
-	if (THIS_MODULE->state != MODULE_STATE_GOING)
-#endif
-		DRM_ERROR("Hotplug removal is not supported\n");
 	drm_dev_unplug(dev);
 	amdgpu_driver_unload_kms(dev);
+
 	pci_disable_device(pdev);
 	pci_set_drvdata(pdev, NULL);
 	drm_dev_put(dev);
@@ -1445,6 +1442,7 @@ static struct drm_driver kms_driver = {
 	.dumb_create = amdgpu_mode_dumb_create,
 	.dumb_map_offset = amdgpu_mode_dumb_mmap,
 	.fops = &amdgpu_driver_kms_fops,
+	.release = &amdgpu_driver_release_kms,
 
 	.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
 	.prime_fd_to_handle = drm_gem_prime_fd_to_handle,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 0cc4c67..1697655 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -49,6 +49,7 @@
 #include <drm/drm_irq.h>
 #include <drm/drm_vblank.h>
 #include <drm/amdgpu_drm.h>
+#include <drm/drm_drv.h>
 #include "amdgpu.h"
 #include "amdgpu_ih.h"
 #include "atom.h"
@@ -297,6 +298,20 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
 	return 0;
 }
 
+
+void amdgpu_irq_fini_early(struct amdgpu_device *adev)
+{
+	if (adev->irq.installed) {
+		drm_irq_uninstall(adev->ddev);
+		adev->irq.installed = false;
+		if (adev->irq.msi_enabled)
+			pci_free_irq_vectors(adev->pdev);
+
+		if (!amdgpu_device_has_dc_support(adev))
+			flush_work(&adev->hotplug_work);
+	}
+}
+
 /**
  * amdgpu_irq_fini - shut down interrupt handling
  *
@@ -310,15 +325,6 @@ void amdgpu_irq_fini(struct amdgpu_device *adev)
 {
 	unsigned i, j;
 
-	if (adev->irq.installed) {
-		drm_irq_uninstall(adev->ddev);
-		adev->irq.installed = false;
-		if (adev->irq.msi_enabled)
-			pci_free_irq_vectors(adev->pdev);
-		if (!amdgpu_device_has_dc_support(adev))
-			flush_work(&adev->hotplug_work);
-	}
-
 	for (i = 0; i < AMDGPU_IRQ_CLIENTID_MAX; ++i) {
 		if (!adev->irq.client[i].sources)
 			continue;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
index c718e94..718c70f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
@@ -104,6 +104,7 @@ irqreturn_t amdgpu_irq_handler(int irq, void *arg);
 
 int amdgpu_irq_init(struct amdgpu_device *adev);
 void amdgpu_irq_fini(struct amdgpu_device *adev);
+void amdgpu_irq_fini_early(struct amdgpu_device *adev);
 int amdgpu_irq_add_id(struct amdgpu_device *adev,
 		      unsigned client_id, unsigned src_id,
 		      struct amdgpu_irq_src *source);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index c0b1904..9d0af22 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -29,6 +29,7 @@
 #include "amdgpu.h"
 #include <drm/drm_debugfs.h>
 #include <drm/amdgpu_drm.h>
+#include <drm/drm_drv.h>
 #include "amdgpu_sched.h"
 #include "amdgpu_uvd.h"
 #include "amdgpu_vce.h"
@@ -86,7 +87,7 @@ void amdgpu_driver_unload_kms(struct drm_device *dev)
 	amdgpu_unregister_gpu_instance(adev);
 
 	if (adev->rmmio == NULL)
-		goto done_free;
+		return;
 
 	if (adev->runpm) {
 		pm_runtime_get_sync(dev->dev);
@@ -95,11 +96,7 @@ void amdgpu_driver_unload_kms(struct drm_device *dev)
 
 	amdgpu_acpi_fini(adev);
 
-	amdgpu_device_fini(adev);
-
-done_free:
-	kfree(adev);
-	dev->dev_private = NULL;
+	amdgpu_device_fini_early(adev);
 }
 
 void amdgpu_register_gpu_instance(struct amdgpu_device *adev)
@@ -1108,6 +1105,20 @@ void amdgpu_driver_postclose_kms(struct drm_device *dev,
 	pm_runtime_put_autosuspend(dev->dev);
 }
 
+
+void amdgpu_driver_release_kms (struct drm_device *dev)
+{
+	struct amdgpu_device *adev = dev->dev_private;
+
+	amdgpu_device_fini_late(adev);
+
+	kfree(adev);
+	dev->dev_private = NULL;
+
+	drm_dev_fini(dev);
+	kfree(dev);
+}
+
 /*
  * VBlank related functions.
  */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 7348619..169c2239 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2056,9 +2056,12 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 
+	//DRM_ERROR("adev 0x%llx", (long long unsigned int)adev);
+
 	if (!con)
 		return 0;
 
+
 	/* Need disable ras on all IPs here before ip [hw/sw]fini */
 	amdgpu_ras_disable_all_features(adev, 0);
 	amdgpu_ras_recovery_fini(adev);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
  2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (3 preceding siblings ...)
  2020-06-21  6:03 ` [PATCH v2 4/8] drm/amdgpu: Split amdgpu_device_fini into early and late Andrey Grodzovsky
@ 2020-06-21  6:03 ` Andrey Grodzovsky
  2020-06-22  9:51   ` Daniel Vetter
  2020-06-22 13:19   ` Christian König
  2020-06-21  6:03 ` [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove Andrey Grodzovsky
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-21  6:03 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: Andrey Grodzovsky, daniel.vetter, michel, ppaalanen,
	ckoenig.leichtzumerken, alexdeucher

Track sysfs files in a list so they all can be removed during pci remove
since otherwise their removal after that causes crash because parent
folder was already removed during pci remove.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h          | 13 +++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c |  7 +++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 35 ++++++++++++++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 12 ++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c      |  8 ++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 ++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c     | 13 ++++++++++-
 drivers/gpu/drm/amd/amdgpu/df_v3_6.c         | 10 +++++---
 8 files changed, 99 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 604a681..ba3775f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -726,6 +726,15 @@ struct amd_powerplay {
 
 #define AMDGPU_RESET_MAGIC_NUM 64
 #define AMDGPU_MAX_DF_PERFMONS 4
+
+struct amdgpu_sysfs_list_node {
+	struct list_head head;
+	struct device_attribute *attr;
+};
+
+#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \
+	struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = &dev_attr_##_attr}
+
 struct amdgpu_device {
 	struct device			*dev;
 	struct drm_device		*ddev;
@@ -992,6 +1001,10 @@ struct amdgpu_device {
 	char				product_number[16];
 	char				product_name[32];
 	char				serial[16];
+
+	struct list_head sysfs_files_list;
+	struct mutex	 sysfs_files_list_lock;
+
 };
 
 static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
index fdd52d8..c1549ee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
@@ -1950,8 +1950,10 @@ static ssize_t amdgpu_atombios_get_vbios_version(struct device *dev,
 	return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version);
 }
 
+
 static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version,
 		   NULL);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version);
 
 /**
  * amdgpu_atombios_fini - free the driver info and callbacks for atombios
@@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev)
 	adev->mode_info.atom_context = NULL;
 	kfree(adev->mode_info.atom_card_info);
 	adev->mode_info.atom_card_info = NULL;
-	device_remove_file(adev->dev, &dev_attr_vbios_version);
 }
 
 /**
@@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
 		return ret;
 	}
 
+	mutex_lock(&adev->sysfs_files_list_lock);
+	list_add_tail(&dev_attr_handle_vbios_version.head, &adev->sysfs_files_list);
+	mutex_unlock(&adev->sysfs_files_list_lock);
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e7b9065..3173046 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2928,6 +2928,12 @@ static const struct attribute *amdgpu_dev_attributes[] = {
 	NULL
 };
 
+static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count);
+
+
 /**
  * amdgpu_device_init - initialize the driver
  *
@@ -3029,6 +3035,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	INIT_LIST_HEAD(&adev->shadow_list);
 	mutex_init(&adev->shadow_list_lock);
 
+	INIT_LIST_HEAD(&adev->sysfs_files_list);
+	mutex_init(&adev->sysfs_files_list_lock);
+
 	INIT_DELAYED_WORK(&adev->delayed_init_work,
 			  amdgpu_device_delayed_init_work_handler);
 	INIT_DELAYED_WORK(&adev->gfx.gfx_off_delay_work,
@@ -3281,6 +3290,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	if (r) {
 		dev_err(adev->dev, "Could not create amdgpu device attr\n");
 		return r;
+	} else {
+		mutex_lock(&adev->sysfs_files_list_lock);
+		list_add_tail(&dev_attr_handle_product_name.head, &adev->sysfs_files_list);
+		list_add_tail(&dev_attr_handle_product_number.head, &adev->sysfs_files_list);
+		list_add_tail(&dev_attr_handle_serial_number.head, &adev->sysfs_files_list);
+		list_add_tail(&dev_attr_handle_pcie_replay_count.head, &adev->sysfs_files_list);
+		mutex_unlock(&adev->sysfs_files_list_lock);
 	}
 
 	if (IS_ENABLED(CONFIG_PERF_EVENTS))
@@ -3298,6 +3314,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	return r;
 }
 
+static void amdgpu_sysfs_remove_files(struct amdgpu_device *adev)
+{
+	struct amdgpu_sysfs_list_node *node;
+
+	mutex_lock(&adev->sysfs_files_list_lock);
+	list_for_each_entry(node, &adev->sysfs_files_list, head)
+		device_remove_file(adev->dev, node->attr);
+	mutex_unlock(&adev->sysfs_files_list_lock);
+}
+
 /**
  * amdgpu_device_fini - tear down the driver
  *
@@ -3332,6 +3358,11 @@ void amdgpu_device_fini_early(struct amdgpu_device *adev)
 	amdgpu_fbdev_fini(adev);
 
 	amdgpu_irq_fini_early(adev);
+
+	amdgpu_sysfs_remove_files(adev);
+
+	if (adev->ucode_sysfs_en)
+		amdgpu_ucode_sysfs_fini(adev);
 }
 
 void amdgpu_device_fini_late(struct amdgpu_device *adev)
@@ -3366,10 +3397,6 @@ void amdgpu_device_fini_late(struct amdgpu_device *adev)
 	adev->rmmio = NULL;
 	amdgpu_device_doorbell_fini(adev);
 
-	if (adev->ucode_sysfs_en)
-		amdgpu_ucode_sysfs_fini(adev);
-
-	sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
 	if (IS_ENABLED(CONFIG_PERF_EVENTS))
 		amdgpu_pmu_fini(adev);
 	if (amdgpu_discovery && adev->asic_type >= CHIP_NAVI10)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
index 6271044..e7b6c4a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
@@ -76,6 +76,9 @@ static DEVICE_ATTR(mem_info_gtt_total, S_IRUGO,
 static DEVICE_ATTR(mem_info_gtt_used, S_IRUGO,
 	           amdgpu_mem_info_gtt_used_show, NULL);
 
+static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_gtt_total);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_gtt_used);
+
 /**
  * amdgpu_gtt_mgr_init - init GTT manager and DRM MM
  *
@@ -114,6 +117,11 @@ static int amdgpu_gtt_mgr_init(struct ttm_mem_type_manager *man,
 		return ret;
 	}
 
+	mutex_lock(&adev->sysfs_files_list_lock);
+	list_add_tail(&dev_attr_handle_mem_info_gtt_total.head, &adev->sysfs_files_list);
+	list_add_tail(&dev_attr_handle_mem_info_gtt_used.head, &adev->sysfs_files_list);
+	mutex_unlock(&adev->sysfs_files_list_lock);
+
 	return 0;
 }
 
@@ -127,7 +135,6 @@ static int amdgpu_gtt_mgr_init(struct ttm_mem_type_manager *man,
  */
 static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager *man)
 {
-	struct amdgpu_device *adev = amdgpu_ttm_adev(man->bdev);
 	struct amdgpu_gtt_mgr *mgr = man->priv;
 	spin_lock(&mgr->lock);
 	drm_mm_takedown(&mgr->mm);
@@ -135,9 +142,6 @@ static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager *man)
 	kfree(mgr);
 	man->priv = NULL;
 
-	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_total);
-	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_used);
-
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index ddb4af0c..554fec0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2216,6 +2216,8 @@ static DEVICE_ATTR(usbc_pd_fw, S_IRUGO | S_IWUSR,
 		   psp_usbc_pd_fw_sysfs_read,
 		   psp_usbc_pd_fw_sysfs_write);
 
+static AMDGPU_DEVICE_ATTR_LIST_NODE(usbc_pd_fw);
+
 
 
 const struct amd_ip_funcs psp_ip_funcs = {
@@ -2242,13 +2244,17 @@ static int psp_sysfs_init(struct amdgpu_device *adev)
 
 	if (ret)
 		DRM_ERROR("Failed to create USBC PD FW control file!");
+	else {
+		mutex_lock(&adev->sysfs_files_list_lock);
+		list_add_tail(&dev_attr_handle_usbc_pd_fw.head, &adev->sysfs_files_list);
+		mutex_unlock(&adev->sysfs_files_list_lock);
+	}
 
 	return ret;
 }
 
 static void psp_sysfs_fini(struct amdgpu_device *adev)
 {
-	device_remove_file(adev->dev, &dev_attr_usbc_pd_fw);
 }
 
 const struct amdgpu_ip_block_version psp_v3_1_ip_block =
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 7723937..39c400c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -148,6 +148,12 @@ static DEVICE_ATTR(mem_info_vis_vram_used, S_IRUGO,
 static DEVICE_ATTR(mem_info_vram_vendor, S_IRUGO,
 		   amdgpu_mem_info_vram_vendor, NULL);
 
+static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_total);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vis_vram_total);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_used);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vis_vram_used);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_vendor);
+
 static const struct attribute *amdgpu_vram_mgr_attributes[] = {
 	&dev_attr_mem_info_vram_total.attr,
 	&dev_attr_mem_info_vis_vram_total.attr,
@@ -184,6 +190,15 @@ static int amdgpu_vram_mgr_init(struct ttm_mem_type_manager *man,
 	ret = sysfs_create_files(&adev->dev->kobj, amdgpu_vram_mgr_attributes);
 	if (ret)
 		DRM_ERROR("Failed to register sysfs\n");
+	else {
+		mutex_lock(&adev->sysfs_files_list_lock);
+		list_add_tail(&dev_attr_handle_mem_info_vram_total.head, &adev->sysfs_files_list);
+		list_add_tail(&dev_attr_handle_mem_info_vis_vram_total.head, &adev->sysfs_files_list);
+		list_add_tail(&dev_attr_handle_mem_info_vram_used.head, &adev->sysfs_files_list);
+		list_add_tail(&dev_attr_handle_mem_info_vis_vram_used.head, &adev->sysfs_files_list);
+		list_add_tail(&dev_attr_handle_mem_info_vram_vendor.head, &adev->sysfs_files_list);
+		mutex_unlock(&adev->sysfs_files_list_lock);
+	}
 
 	return 0;
 }
@@ -198,7 +213,6 @@ static int amdgpu_vram_mgr_init(struct ttm_mem_type_manager *man,
  */
 static int amdgpu_vram_mgr_fini(struct ttm_mem_type_manager *man)
 {
-	struct amdgpu_device *adev = amdgpu_ttm_adev(man->bdev);
 	struct amdgpu_vram_mgr *mgr = man->priv;
 
 	spin_lock(&mgr->lock);
@@ -206,7 +220,6 @@ static int amdgpu_vram_mgr_fini(struct ttm_mem_type_manager *man)
 	spin_unlock(&mgr->lock);
 	kfree(mgr);
 	man->priv = NULL;
-	sysfs_remove_files(&adev->dev->kobj, amdgpu_vram_mgr_attributes);
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 90610b4..455eaa4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -272,6 +272,9 @@ static ssize_t amdgpu_xgmi_show_error(struct device *dev,
 static DEVICE_ATTR(xgmi_device_id, S_IRUGO, amdgpu_xgmi_show_device_id, NULL);
 static DEVICE_ATTR(xgmi_error, S_IRUGO, amdgpu_xgmi_show_error, NULL);
 
+static AMDGPU_DEVICE_ATTR_LIST_NODE(xgmi_device_id);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(xgmi_error);
+
 static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev,
 					 struct amdgpu_hive_info *hive)
 {
@@ -285,10 +288,19 @@ static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev,
 		return ret;
 	}
 
+	mutex_lock(&adev->sysfs_files_list_lock);
+	list_add_tail(&dev_attr_handle_xgmi_device_id.head, &adev->sysfs_files_list);
+	mutex_unlock(&adev->sysfs_files_list_lock);
+
 	/* Create xgmi error file */
 	ret = device_create_file(adev->dev, &dev_attr_xgmi_error);
 	if (ret)
 		pr_err("failed to create xgmi_error\n");
+	else {
+		mutex_lock(&adev->sysfs_files_list_lock);
+		list_add_tail(&dev_attr_handle_xgmi_error.head, &adev->sysfs_files_list);
+		mutex_unlock(&adev->sysfs_files_list_lock);
+	}
 
 
 	/* Create sysfs link to hive info folder on the first device */
@@ -325,7 +337,6 @@ static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev,
 static void amdgpu_xgmi_sysfs_rem_dev_info(struct amdgpu_device *adev,
 					  struct amdgpu_hive_info *hive)
 {
-	device_remove_file(adev->dev, &dev_attr_xgmi_device_id);
 	sysfs_remove_link(&adev->dev->kobj, adev->ddev->unique);
 	sysfs_remove_link(hive->kobj, adev->ddev->unique);
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
index a7b8292..f95b0b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
+++ b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
@@ -265,6 +265,8 @@ static ssize_t df_v3_6_get_df_cntr_avail(struct device *dev,
 /* device attr for available perfmon counters */
 static DEVICE_ATTR(df_cntr_avail, S_IRUGO, df_v3_6_get_df_cntr_avail, NULL);
 
+static AMDGPU_DEVICE_ATTR_LIST_NODE(df_cntr_avail);
+
 static void df_v3_6_query_hashes(struct amdgpu_device *adev)
 {
 	u32 tmp;
@@ -299,6 +301,11 @@ static void df_v3_6_sw_init(struct amdgpu_device *adev)
 	ret = device_create_file(adev->dev, &dev_attr_df_cntr_avail);
 	if (ret)
 		DRM_ERROR("failed to create file for available df counters\n");
+	else {
+		mutex_lock(&adev->sysfs_files_list_lock);
+		list_add_tail(&dev_attr_handle_df_cntr_avail.head, &adev->sysfs_files_list);
+		mutex_unlock(&adev->sysfs_files_list_lock);
+	}
 
 	for (i = 0; i < AMDGPU_MAX_DF_PERFMONS; i++)
 		adev->df_perfmon_config_assign_mask[i] = 0;
@@ -308,9 +315,6 @@ static void df_v3_6_sw_init(struct amdgpu_device *adev)
 
 static void df_v3_6_sw_fini(struct amdgpu_device *adev)
 {
-
-	device_remove_file(adev->dev, &dev_attr_df_cntr_avail);
-
 }
 
 static void df_v3_6_enable_broadcast_mode(struct amdgpu_device *adev,
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove.
  2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (4 preceding siblings ...)
  2020-06-21  6:03 ` [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal Andrey Grodzovsky
@ 2020-06-21  6:03 ` Andrey Grodzovsky
  2020-06-22  9:56   ` Daniel Vetter
  2020-06-22 19:38   ` Christian König
  2020-06-21  6:03 ` [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug Andrey Grodzovsky
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-21  6:03 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: Andrey Grodzovsky, daniel.vetter, michel, ppaalanen,
	ckoenig.leichtzumerken, alexdeucher

Use the new TTM interface to invalidate all exsisting BO CPU mappings
form all user proccesses.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 43592dc..6932d75 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1135,6 +1135,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 	struct drm_device *dev = pci_get_drvdata(pdev);
 
 	drm_dev_unplug(dev);
+	ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
 	amdgpu_driver_unload_kms(dev);
 
 	pci_disable_device(pdev);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug
  2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (5 preceding siblings ...)
  2020-06-21  6:03 ` [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove Andrey Grodzovsky
@ 2020-06-21  6:03 ` Andrey Grodzovsky
  2020-06-22  9:55   ` Daniel Vetter
  2020-06-22 19:40   ` Christian König
  2020-06-21  6:03 ` [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged Andrey Grodzovsky
  2020-06-22  9:46 ` [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Daniel Vetter
  8 siblings, 2 replies; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-21  6:03 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: Andrey Grodzovsky, daniel.vetter, michel, ppaalanen,
	ckoenig.leichtzumerken, alexdeucher

entity->rq becomes null aftre device unplugged so just return early
in that case.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 8d9c6fe..d252427 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -24,6 +24,7 @@
 #include "amdgpu_job.h"
 #include "amdgpu_object.h"
 #include "amdgpu_trace.h"
+#include <drm/drm_drv.h>
 
 #define AMDGPU_VM_SDMA_MIN_NUM_DW	256u
 #define AMDGPU_VM_SDMA_MAX_NUM_DW	(16u * 1024u)
@@ -94,7 +95,12 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 	struct drm_sched_entity *entity;
 	struct amdgpu_ring *ring;
 	struct dma_fence *f;
-	int r;
+	int r, idx;
+
+	if (!drm_dev_enter(p->adev->ddev, &idx)) {
+		r = -ENODEV;
+		goto nodev;
+	}
 
 	entity = p->immediate ? &p->vm->immediate : &p->vm->delayed;
 	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
@@ -104,7 +110,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 	WARN_ON(ib->length_dw > p->num_dw_left);
 	r = amdgpu_job_submit(p->job, entity, AMDGPU_FENCE_OWNER_VM, &f);
 	if (r)
-		goto error;
+		goto job_fail;
 
 	if (p->unlocked) {
 		struct dma_fence *tmp = dma_fence_get(f);
@@ -118,10 +124,15 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 	if (fence && !p->immediate)
 		swap(*fence, f);
 	dma_fence_put(f);
-	return 0;
 
-error:
-	amdgpu_job_free(p->job);
+	r = 0;
+
+job_fail:
+	drm_dev_exit(idx);
+nodev:
+	if (r)
+		amdgpu_job_free(p->job);
+
 	return r;
 }
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.
  2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (6 preceding siblings ...)
  2020-06-21  6:03 ` [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug Andrey Grodzovsky
@ 2020-06-21  6:03 ` Andrey Grodzovsky
  2020-06-22  9:53   ` Daniel Vetter
  2020-06-22  9:46 ` [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Daniel Vetter
  8 siblings, 1 reply; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-21  6:03 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: Andrey Grodzovsky, daniel.vetter, michel, ppaalanen,
	ckoenig.leichtzumerken, alexdeucher

No point to try recovery if device is gone, just messes up things.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 ++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 ++++++++
 2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6932d75..5d6d3d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
 	return ret;
 }
 
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+	int i;
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+		cancel_delayed_work_sync(&ring->sched.work_tdr);
+	}
+}
+
 static void
 amdgpu_pci_remove(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = dev->dev_private;
 
 	drm_dev_unplug(dev);
+	amdgpu_cancel_all_tdr(adev);
 	ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
 	amdgpu_driver_unload_kms(dev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 4720718..87ff0c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -28,6 +28,8 @@
 #include "amdgpu.h"
 #include "amdgpu_trace.h"
 
+#include <drm/drm_drv.h>
+
 static void amdgpu_job_timedout(struct drm_sched_job *s_job)
 {
 	struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
@@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
 
 	memset(&ti, 0, sizeof(struct amdgpu_task_info));
 
+	if (drm_dev_is_unplugged(adev->ddev)) {
+		DRM_INFO("ring %s timeout, but device unplugged, skipping.\n",
+					  s_job->sched->name);
+		return;
+	}
+
 	if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
 		DRM_ERROR("ring %s timeout, but soft recovered\n",
 			  s_job->sched->name);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] drm: Add dummy page per device or GEM object
  2020-06-21  6:03 ` [PATCH v2 1/8] drm: Add dummy page per device or GEM object Andrey Grodzovsky
@ 2020-06-22  9:35   ` Daniel Vetter
  2020-06-22 14:21     ` Pekka Paalanen
  2020-06-22 13:18   ` Christian König
  1 sibling, 1 reply; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22  9:35 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

On Sun, Jun 21, 2020 at 02:03:01AM -0400, Andrey Grodzovsky wrote:
> Will be used to reroute CPU mapped BO's page faults once
> device is removed.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/drm_file.c  |  8 ++++++++
>  drivers/gpu/drm/drm_prime.c | 10 ++++++++++
>  include/drm/drm_file.h      |  2 ++
>  include/drm/drm_gem.h       |  2 ++
>  4 files changed, 22 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
> index c4c704e..67c0770 100644
> --- a/drivers/gpu/drm/drm_file.c
> +++ b/drivers/gpu/drm/drm_file.c
> @@ -188,6 +188,12 @@ struct drm_file *drm_file_alloc(struct drm_minor *minor)
>  			goto out_prime_destroy;
>  	}
>  
> +	file->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!file->dummy_page) {
> +		ret = -ENOMEM;
> +		goto out_prime_destroy;
> +	}
> +
>  	return file;
>  
>  out_prime_destroy:
> @@ -284,6 +290,8 @@ void drm_file_free(struct drm_file *file)
>  	if (dev->driver->postclose)
>  		dev->driver->postclose(dev, file);
>  
> +	__free_page(file->dummy_page);
> +
>  	drm_prime_destroy_file_private(&file->prime);
>  
>  	WARN_ON(!list_empty(&file->event_list));
> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
> index 1de2cde..c482e9c 100644
> --- a/drivers/gpu/drm/drm_prime.c
> +++ b/drivers/gpu/drm/drm_prime.c
> @@ -335,6 +335,13 @@ int drm_gem_prime_fd_to_handle(struct drm_device *dev,
>  
>  	ret = drm_prime_add_buf_handle(&file_priv->prime,
>  			dma_buf, *handle);
> +
> +	if (!ret) {
> +		obj->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +		if (!obj->dummy_page)
> +			ret = -ENOMEM;
> +	}
> +
>  	mutex_unlock(&file_priv->prime.lock);
>  	if (ret)
>  		goto fail;
> @@ -1006,6 +1013,9 @@ void drm_prime_gem_destroy(struct drm_gem_object *obj, struct sg_table *sg)
>  		dma_buf_unmap_attachment(attach, sg, DMA_BIDIRECTIONAL);
>  	dma_buf = attach->dmabuf;
>  	dma_buf_detach(attach->dmabuf, attach);
> +
> +	__free_page(obj->dummy_page);
> +
>  	/* remove the reference */
>  	dma_buf_put(dma_buf);
>  }
> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> index 19df802..349a658 100644
> --- a/include/drm/drm_file.h
> +++ b/include/drm/drm_file.h
> @@ -335,6 +335,8 @@ struct drm_file {
>  	 */
>  	struct drm_prime_file_private prime;
>  

Kerneldoc for these please, including why we need them and when. E.g. the
one in gem_bo should say it's only for exported buffers, so that we're not
colliding security spaces.

> +	struct page *dummy_page;
> +
>  	/* private: */
>  #if IS_ENABLED(CONFIG_DRM_LEGACY)
>  	unsigned long lock_count; /* DRI1 legacy lock count */
> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> index 0b37506..47460d1 100644
> --- a/include/drm/drm_gem.h
> +++ b/include/drm/drm_gem.h
> @@ -310,6 +310,8 @@ struct drm_gem_object {
>  	 *
>  	 */
>  	const struct drm_gem_object_funcs *funcs;
> +
> +	struct page *dummy_page;
>  };

I think amdgpu doesn't care, but everyone else still might care somewhat
about flink. That also shares buffers, so also needs to allocate the
per-bo dummy page.

I also wonder whether we shouldn't have a helper to look up the dummy
page, just to encode in core code how it's supposedo to cascade.
-Daniel

>  
>  /**
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page.
  2020-06-21  6:03 ` [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
@ 2020-06-22  9:41   ` Daniel Vetter
  2020-06-24  3:31     ` Andrey Grodzovsky
  2020-06-22 19:30   ` Christian König
  1 sibling, 1 reply; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22  9:41 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

On Sun, Jun 21, 2020 at 02:03:02AM -0400, Andrey Grodzovsky wrote:
> On device removal reroute all CPU mappings to dummy page per drm_file
> instance or imported GEM object.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/ttm/ttm_bo_vm.c | 65 ++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 57 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 389128b..2f8bf5e 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -35,6 +35,8 @@
>  #include <drm/ttm/ttm_bo_driver.h>
>  #include <drm/ttm/ttm_placement.h>
>  #include <drm/drm_vma_manager.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_file.h>
>  #include <linux/mm.h>
>  #include <linux/pfn_t.h>
>  #include <linux/rbtree.h>
> @@ -328,19 +330,66 @@ vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)

Hm I think diff and code flow look a bit bad now. What about renaming the
current function to __ttm_bo_vm_fault and then having something like the
below:

ttm_bo_vm_fault(args) {

	if (drm_dev_enter()) {
		__ttm_bo_vm_fault(args);
		drm_dev_exit();
	} else  {
		drm_gem_insert_dummy_pfn();
	}
}

I think drm_gem_insert_dummy_pfn(); should be portable across drivers, so
another nice point to try to unifiy drivers as much as possible.
-Daniel

>  	pgprot_t prot;
>  	struct ttm_buffer_object *bo = vma->vm_private_data;
>  	vm_fault_t ret;
> +	int idx;
> +	struct drm_device *ddev = bo->base.dev;
>  
> -	ret = ttm_bo_vm_reserve(bo, vmf);
> -	if (ret)
> -		return ret;
> +	if (drm_dev_enter(ddev, &idx)) {
> +		ret = ttm_bo_vm_reserve(bo, vmf);
> +		if (ret)
> +			goto exit;
> +
> +		prot = vma->vm_page_prot;
>  
> -	prot = vma->vm_page_prot;
> -	ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT);
> -	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
> +		ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT);
> +		if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
> +			goto exit;
> +
> +		dma_resv_unlock(bo->base.resv);
> +
> +exit:
> +		drm_dev_exit(idx);
>  		return ret;
> +	} else {
>  
> -	dma_resv_unlock(bo->base.resv);
> +		struct drm_file *file = NULL;
> +		struct page *dummy_page = NULL;
> +		int handle;
>  
> -	return ret;
> +		/* We are faulting on imported BO from dma_buf */
> +		if (bo->base.dma_buf && bo->base.import_attach) {
> +			dummy_page = bo->base.dummy_page;
> +		/* We are faulting on non imported BO, find drm_file owning the BO*/

Uh, we can't fish that out of the vma->vm_file pointer somehow? Or is that
one all wrong? Doing this kind of list walk looks pretty horrible.

If the vma doesn't have the right pointer I guess next option is that we
store the drm_file page in gem_bo->dummy_page, and replace it on first
export. But that's going to be tricky to track ...

> +		} else {
> +			struct drm_gem_object *gobj;
> +
> +			mutex_lock(&ddev->filelist_mutex);
> +			list_for_each_entry(file, &ddev->filelist, lhead) {
> +				spin_lock(&file->table_lock);
> +				idr_for_each_entry(&file->object_idr, gobj, handle) {
> +					if (gobj == &bo->base) {
> +						dummy_page = file->dummy_page;
> +						break;
> +					}
> +				}
> +				spin_unlock(&file->table_lock);
> +			}
> +			mutex_unlock(&ddev->filelist_mutex);
> +		}
> +
> +		if (dummy_page) {
> +			/*
> +			 * Let do_fault complete the PTE install e.t.c using vmf->page
> +			 *
> +			 * TODO - should i call free_page somewhere ?

Nah, instead don't call get_page. The page will be around as long as
there's a reference for the drm_file or gem_bo, which is longer than any
mmap. Otherwise yes this would like really badly.

> +			 */
> +			get_page(dummy_page);
> +			vmf->page = dummy_page;
> +			return 0;
> +		} else {
> +			return VM_FAULT_SIGSEGV;

Hm that would be a kernel bug, wouldn't it? WARN_ON() required here imo.
-Daniel

> +		}
> +	}
>  }
>  EXPORT_SYMBOL(ttm_bo_vm_fault);
>  
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space
  2020-06-21  6:03 ` [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space Andrey Grodzovsky
@ 2020-06-22  9:45   ` Daniel Vetter
  2020-06-23  5:00     ` Andrey Grodzovsky
  2020-06-22 19:37   ` Christian König
  2020-06-22 19:47   ` Alex Deucher
  2 siblings, 1 reply; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22  9:45 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

On Sun, Jun 21, 2020 at 02:03:03AM -0400, Andrey Grodzovsky wrote:
> Helper function to be used to invalidate all BOs CPU mappings
> once device is removed.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

This seems to be missing the code to invalidate all the dma-buf mmaps?

Probably needs more testcases if you're not yet catching this. Or am I
missing something, and we're exchanging the the address space also for
dma-buf?
-Daniel

> ---
>  drivers/gpu/drm/ttm/ttm_bo.c    | 8 ++++++--
>  include/drm/ttm/ttm_bo_driver.h | 7 +++++++
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index c5b516f..926a365 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -1750,10 +1750,14 @@ void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo)
>  	ttm_bo_unmap_virtual_locked(bo);
>  	ttm_mem_io_unlock(man);
>  }
> -
> -
>  EXPORT_SYMBOL(ttm_bo_unmap_virtual);
>  
> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev)
> +{
> +	unmap_mapping_range(bdev->dev_mapping, 0, 0, 1);
> +}
> +EXPORT_SYMBOL(ttm_bo_unmap_virtual_address_space);
> +
>  int ttm_bo_wait(struct ttm_buffer_object *bo,
>  		bool interruptible, bool no_wait)
>  {
> diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
> index c9e0fd0..39ea44f 100644
> --- a/include/drm/ttm/ttm_bo_driver.h
> +++ b/include/drm/ttm/ttm_bo_driver.h
> @@ -601,6 +601,13 @@ int ttm_bo_device_init(struct ttm_bo_device *bdev,
>  void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo);
>  
>  /**
> + * ttm_bo_unmap_virtual_address_space
> + *
> + * @bdev: tear down all the virtual mappings for this device
> + */
> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev);
> +
> +/**
>   * ttm_bo_unmap_virtual
>   *
>   * @bo: tear down the virtual mappings for this BO
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 0/8] RFC Support hot device unplug in amdgpu
  2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
                   ` (7 preceding siblings ...)
  2020-06-21  6:03 ` [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged Andrey Grodzovsky
@ 2020-06-22  9:46 ` Daniel Vetter
  2020-06-23  5:14   ` Andrey Grodzovsky
  8 siblings, 1 reply; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22  9:46 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

On Sun, Jun 21, 2020 at 02:03:00AM -0400, Andrey Grodzovsky wrote:
> This RFC is more of a proof of concept then a fully working solution as there are a few unresolved issues we are hoping to get advise on from people on the mailing list.
> Until now extracting a card either by physical extraction (e.g. eGPU with thunderbolt connection or by emulation through syfs -> /sys/bus/pci/devices/device_id/remove)
> would cause random crashes in user apps. The random crashes in apps were mostly due to the app having mapped a device backed BO into its address space was still
> trying to access the BO while the backing device was gone.
> To answer this first problem Christian suggested to fix the handling of mapped memory in the clients when the device goes away by forcibly unmap all buffers
> the user processes has by clearing their respective VMAs mapping the device BOs. Then when the VMAs try to fill in the page tables again we check in the fault handler
> if the device is removed and if so, return an error. This will generate a SIGBUS to the application which can then cleanly terminate.
> This indeed was done but this in turn created a problem of kernel OOPs were the OOPSes were due to the fact that while the app was terminating because of the SIGBUS
> it would trigger use after free in the driver by calling to accesses device structures that were already released from the pci remove sequence.
> This was handled by introducing a 'flush' sequence during device removal were we wait for drm file reference to drop to 0 meaning all user clients directly using this device terminated.
> With this I was able to cleanly emulate device unplug with X and glxgears running and later emulate device plug back and restart of X and glxgears.
> 
> v2:
> Based on discussions in the mailing list with Daniel and Pekka [1] and based on the document produced by Pekka from those discussions [2] the whole approach with returning SIGBUS
> and waiting for all user clients having CPU mapping of device BOs to die was dropped. Instead as per the document suggestion the device structures are kept alive until the last
> reference to the device is dropped by user client and in the meanwhile all existing and new CPU mappings of the BOs belonging to the device directly or by dma-buf import are rerouted
> to per user process dummy rw page.
> Also, I skipped the 'Requirements for KMS UAPI' section of [2] since i am trying to get the minimal set of requiremnts that still give useful solution to work and this is the
> 'Requirements for Render and Cross-Device UAPI' section and so my test case is removing a secondary device, which is render only and is not involved in KMS.
>  
> This iteration is still more of a draft as I am still facing a few unsolved issues such as a crash in user client when trying to CPU map imported BO if the map happens after device was
> removed and HW failure to plug back a removed device. Also since i don't have real life setup with external GPU connected through TB I am using sysfs to emulate pci remove and i
> expect to encounter more issues once i try this on real life case. I am also expecting some help on this from a user who volunteered to test in the related gitlab ticket.
> So basically this is more of a  way to get feedback if I am moving in the right direction.
> 
> [1] - Discussions during v1 of the patchset https://lists.freedesktop.org/archives/dri-devel/2020-May/265386.html
> [2] - drm/doc: device hot-unplug for userspace https://www.spinics.net/lists/dri-devel/msg259755.html
> [3] - Related gitlab ticket https://gitlab.freedesktop.org/drm/amd/-/issues/1081

A few high-level commments on the generic parts, I didn't really look at
the amdgpu side yet.

Also a nit: Please tell your mailer to break long lines, it looks funny
and inconsistent otherwise, at least in some of the mailers I use here :-/
-Daniel
>  
> 
> Andrey Grodzovsky (8):
>   drm: Add dummy page per device or GEM object
>   drm/ttm: Remap all page faults to per process dummy page.
>   drm/ttm: Add unampping of the entire device address space
>   drm/amdgpu: Split amdgpu_device_fini into early and late
>   drm/amdgpu: Refactor sysfs removal
>   drm/amdgpu: Unmap entire device address space on device remove.
>   drm/amdgpu: Fix sdma code crash post device unplug
>   drm/amdgpu: Prevent any job recoveries after device is unplugged.
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h          | 19 +++++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c |  7 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 50 +++++++++++++++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c      | 23 ++++++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 12 +++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c      | 24 ++++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h      |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c      |  8 ++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c      | 23 +++++++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c      |  8 +++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c      |  3 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c  | 21 ++++++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 +++++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c     | 13 +++++-
>  drivers/gpu/drm/amd/amdgpu/df_v3_6.c         | 10 +++--
>  drivers/gpu/drm/drm_file.c                   |  8 ++++
>  drivers/gpu/drm/drm_prime.c                  | 10 +++++
>  drivers/gpu/drm/ttm/ttm_bo.c                 |  8 +++-
>  drivers/gpu/drm/ttm/ttm_bo_vm.c              | 65 ++++++++++++++++++++++++----
>  include/drm/drm_file.h                       |  2 +
>  include/drm/drm_gem.h                        |  2 +
>  include/drm/ttm/ttm_bo_driver.h              |  7 +++
>  22 files changed, 286 insertions(+), 55 deletions(-)
> 
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/8] drm/amdgpu: Split amdgpu_device_fini into early and late
  2020-06-21  6:03 ` [PATCH v2 4/8] drm/amdgpu: Split amdgpu_device_fini into early and late Andrey Grodzovsky
@ 2020-06-22  9:48   ` Daniel Vetter
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22  9:48 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

On Sun, Jun 21, 2020 at 02:03:04AM -0400, Andrey Grodzovsky wrote:
> Some of the stuff in amdgpu_device_fini such as HW interrupts
> disable and pending fences finilization must be done right away on
> pci_remove while most of the stuff which relates to finilizing and releasing
> driver data structures can be kept until drm_driver.release hook is called, i.e.
> when the last device reference is dropped.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Long term I think best if as much of this code is converted over to devm
(for hw stuff) and drmm (for sw stuff and allocations). Doing this all
manually is very error prone.

I've started various such patches and others followed, but thus far only
very simple drivers tackled. But it should be doable step by step at
least, so you should have incremental benefits in code complexity right
away I hope.
-Daniel

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  6 +++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++++++++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  6 ++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    | 24 +++++++++++++++---------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h    |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 23 +++++++++++++++++------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    |  3 +++
>  7 files changed, 54 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 2a806cb..604a681 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1003,7 +1003,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  		       struct drm_device *ddev,
>  		       struct pci_dev *pdev,
>  		       uint32_t flags);
> -void amdgpu_device_fini(struct amdgpu_device *adev);
> +void amdgpu_device_fini_early(struct amdgpu_device *adev);
> +void amdgpu_device_fini_late(struct amdgpu_device *adev);
> +
>  int amdgpu_gpu_wait_for_idle(struct amdgpu_device *adev);
>  
>  void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,
> @@ -1188,6 +1190,8 @@ void amdgpu_driver_lastclose_kms(struct drm_device *dev);
>  int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv);
>  void amdgpu_driver_postclose_kms(struct drm_device *dev,
>  				 struct drm_file *file_priv);
> +void amdgpu_driver_release_kms(struct drm_device *dev);
> +
>  int amdgpu_device_ip_suspend(struct amdgpu_device *adev);
>  int amdgpu_device_suspend(struct drm_device *dev, bool fbcon);
>  int amdgpu_device_resume(struct drm_device *dev, bool fbcon);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index cc41e8f..e7b9065 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2309,6 +2309,8 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
>  {
>  	int i, r;
>  
> +	//DRM_ERROR("adev 0x%llx", (long long unsigned int)adev);
> +
>  	amdgpu_ras_pre_fini(adev);
>  
>  	if (adev->gmc.xgmi.num_physical_nodes > 1)
> @@ -3304,10 +3306,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   * Tear down the driver info (all asics).
>   * Called at driver shutdown.
>   */
> -void amdgpu_device_fini(struct amdgpu_device *adev)
> +void amdgpu_device_fini_early(struct amdgpu_device *adev)
>  {
> -	int r;
> -
>  	DRM_INFO("amdgpu: finishing device.\n");
>  	flush_delayed_work(&adev->delayed_init_work);
>  	adev->shutdown = true;
> @@ -3330,7 +3330,13 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>  	if (adev->pm_sysfs_en)
>  		amdgpu_pm_sysfs_fini(adev);
>  	amdgpu_fbdev_fini(adev);
> -	r = amdgpu_device_ip_fini(adev);
> +
> +	amdgpu_irq_fini_early(adev);
> +}
> +
> +void amdgpu_device_fini_late(struct amdgpu_device *adev)
> +{
> +	amdgpu_device_ip_fini(adev);
>  	if (adev->firmware.gpu_info_fw) {
>  		release_firmware(adev->firmware.gpu_info_fw);
>  		adev->firmware.gpu_info_fw = NULL;
> @@ -3368,6 +3374,7 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>  		amdgpu_pmu_fini(adev);
>  	if (amdgpu_discovery && adev->asic_type >= CHIP_NAVI10)
>  		amdgpu_discovery_fini(adev);
> +
>  }
>  
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 9e5afa5..43592dc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1134,12 +1134,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>  {
>  	struct drm_device *dev = pci_get_drvdata(pdev);
>  
> -#ifdef MODULE
> -	if (THIS_MODULE->state != MODULE_STATE_GOING)
> -#endif
> -		DRM_ERROR("Hotplug removal is not supported\n");
>  	drm_dev_unplug(dev);
>  	amdgpu_driver_unload_kms(dev);
> +
>  	pci_disable_device(pdev);
>  	pci_set_drvdata(pdev, NULL);
>  	drm_dev_put(dev);
> @@ -1445,6 +1442,7 @@ static struct drm_driver kms_driver = {
>  	.dumb_create = amdgpu_mode_dumb_create,
>  	.dumb_map_offset = amdgpu_mode_dumb_mmap,
>  	.fops = &amdgpu_driver_kms_fops,
> +	.release = &amdgpu_driver_release_kms,
>  
>  	.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
>  	.prime_fd_to_handle = drm_gem_prime_fd_to_handle,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> index 0cc4c67..1697655 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> @@ -49,6 +49,7 @@
>  #include <drm/drm_irq.h>
>  #include <drm/drm_vblank.h>
>  #include <drm/amdgpu_drm.h>
> +#include <drm/drm_drv.h>
>  #include "amdgpu.h"
>  #include "amdgpu_ih.h"
>  #include "atom.h"
> @@ -297,6 +298,20 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
>  	return 0;
>  }
>  
> +
> +void amdgpu_irq_fini_early(struct amdgpu_device *adev)
> +{
> +	if (adev->irq.installed) {
> +		drm_irq_uninstall(adev->ddev);
> +		adev->irq.installed = false;
> +		if (adev->irq.msi_enabled)
> +			pci_free_irq_vectors(adev->pdev);
> +
> +		if (!amdgpu_device_has_dc_support(adev))
> +			flush_work(&adev->hotplug_work);
> +	}
> +}
> +
>  /**
>   * amdgpu_irq_fini - shut down interrupt handling
>   *
> @@ -310,15 +325,6 @@ void amdgpu_irq_fini(struct amdgpu_device *adev)
>  {
>  	unsigned i, j;
>  
> -	if (adev->irq.installed) {
> -		drm_irq_uninstall(adev->ddev);
> -		adev->irq.installed = false;
> -		if (adev->irq.msi_enabled)
> -			pci_free_irq_vectors(adev->pdev);
> -		if (!amdgpu_device_has_dc_support(adev))
> -			flush_work(&adev->hotplug_work);
> -	}
> -
>  	for (i = 0; i < AMDGPU_IRQ_CLIENTID_MAX; ++i) {
>  		if (!adev->irq.client[i].sources)
>  			continue;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
> index c718e94..718c70f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
> @@ -104,6 +104,7 @@ irqreturn_t amdgpu_irq_handler(int irq, void *arg);
>  
>  int amdgpu_irq_init(struct amdgpu_device *adev);
>  void amdgpu_irq_fini(struct amdgpu_device *adev);
> +void amdgpu_irq_fini_early(struct amdgpu_device *adev);
>  int amdgpu_irq_add_id(struct amdgpu_device *adev,
>  		      unsigned client_id, unsigned src_id,
>  		      struct amdgpu_irq_src *source);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index c0b1904..9d0af22 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -29,6 +29,7 @@
>  #include "amdgpu.h"
>  #include <drm/drm_debugfs.h>
>  #include <drm/amdgpu_drm.h>
> +#include <drm/drm_drv.h>
>  #include "amdgpu_sched.h"
>  #include "amdgpu_uvd.h"
>  #include "amdgpu_vce.h"
> @@ -86,7 +87,7 @@ void amdgpu_driver_unload_kms(struct drm_device *dev)
>  	amdgpu_unregister_gpu_instance(adev);
>  
>  	if (adev->rmmio == NULL)
> -		goto done_free;
> +		return;
>  
>  	if (adev->runpm) {
>  		pm_runtime_get_sync(dev->dev);
> @@ -95,11 +96,7 @@ void amdgpu_driver_unload_kms(struct drm_device *dev)
>  
>  	amdgpu_acpi_fini(adev);
>  
> -	amdgpu_device_fini(adev);
> -
> -done_free:
> -	kfree(adev);
> -	dev->dev_private = NULL;
> +	amdgpu_device_fini_early(adev);
>  }
>  
>  void amdgpu_register_gpu_instance(struct amdgpu_device *adev)
> @@ -1108,6 +1105,20 @@ void amdgpu_driver_postclose_kms(struct drm_device *dev,
>  	pm_runtime_put_autosuspend(dev->dev);
>  }
>  
> +
> +void amdgpu_driver_release_kms (struct drm_device *dev)
> +{
> +	struct amdgpu_device *adev = dev->dev_private;
> +
> +	amdgpu_device_fini_late(adev);
> +
> +	kfree(adev);
> +	dev->dev_private = NULL;
> +
> +	drm_dev_fini(dev);
> +	kfree(dev);
> +}
> +
>  /*
>   * VBlank related functions.
>   */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 7348619..169c2239 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2056,9 +2056,12 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
>  {
>  	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  
> +	//DRM_ERROR("adev 0x%llx", (long long unsigned int)adev);
> +
>  	if (!con)
>  		return 0;
>  
> +
>  	/* Need disable ras on all IPs here before ip [hw/sw]fini */
>  	amdgpu_ras_disable_all_features(adev, 0);
>  	amdgpu_ras_recovery_fini(adev);
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
  2020-06-21  6:03 ` [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal Andrey Grodzovsky
@ 2020-06-22  9:51   ` Daniel Vetter
  2020-06-22 11:21     ` Greg KH
  2020-06-22 13:19   ` Christian König
  1 sibling, 1 reply; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22  9:51 UTC (permalink / raw)
  To: Andrey Grodzovsky, Greg KH
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
> Track sysfs files in a list so they all can be removed during pci remove
> since otherwise their removal after that causes crash because parent
> folder was already removed during pci remove.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Uh I thought sysfs just gets yanked completely. Please check with Greg KH
whether hand-rolling all this really is the right solution here ... Feels
very wrong. I thought this was all supposed to work by adding attributes
before publishing the sysfs node, and then letting sysfs clean up
everything. Not by cleaning up manually yourself.

Adding Greg for an authoritative answer.
-Daniel

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h          | 13 +++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c |  7 +++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 35 ++++++++++++++++++++++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 12 ++++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c      |  8 ++++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 ++++++++++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c     | 13 ++++++++++-
>  drivers/gpu/drm/amd/amdgpu/df_v3_6.c         | 10 +++++---
>  8 files changed, 99 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 604a681..ba3775f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -726,6 +726,15 @@ struct amd_powerplay {
>  
>  #define AMDGPU_RESET_MAGIC_NUM 64
>  #define AMDGPU_MAX_DF_PERFMONS 4
> +
> +struct amdgpu_sysfs_list_node {
> +	struct list_head head;
> +	struct device_attribute *attr;
> +};
> +
> +#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \
> +	struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = &dev_attr_##_attr}
> +
>  struct amdgpu_device {
>  	struct device			*dev;
>  	struct drm_device		*ddev;
> @@ -992,6 +1001,10 @@ struct amdgpu_device {
>  	char				product_number[16];
>  	char				product_name[32];
>  	char				serial[16];
> +
> +	struct list_head sysfs_files_list;
> +	struct mutex	 sysfs_files_list_lock;
> +
>  };
>  
>  static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> index fdd52d8..c1549ee 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> @@ -1950,8 +1950,10 @@ static ssize_t amdgpu_atombios_get_vbios_version(struct device *dev,
>  	return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version);
>  }
>  
> +
>  static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version,
>  		   NULL);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version);
>  
>  /**
>   * amdgpu_atombios_fini - free the driver info and callbacks for atombios
> @@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev)
>  	adev->mode_info.atom_context = NULL;
>  	kfree(adev->mode_info.atom_card_info);
>  	adev->mode_info.atom_card_info = NULL;
> -	device_remove_file(adev->dev, &dev_attr_vbios_version);
>  }
>  
>  /**
> @@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
>  		return ret;
>  	}
>  
> +	mutex_lock(&adev->sysfs_files_list_lock);
> +	list_add_tail(&dev_attr_handle_vbios_version.head, &adev->sysfs_files_list);
> +	mutex_unlock(&adev->sysfs_files_list_lock);
> +
>  	return 0;
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e7b9065..3173046 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2928,6 +2928,12 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>  	NULL
>  };
>  
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count);
> +
> +
>  /**
>   * amdgpu_device_init - initialize the driver
>   *
> @@ -3029,6 +3035,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  	INIT_LIST_HEAD(&adev->shadow_list);
>  	mutex_init(&adev->shadow_list_lock);
>  
> +	INIT_LIST_HEAD(&adev->sysfs_files_list);
> +	mutex_init(&adev->sysfs_files_list_lock);
> +
>  	INIT_DELAYED_WORK(&adev->delayed_init_work,
>  			  amdgpu_device_delayed_init_work_handler);
>  	INIT_DELAYED_WORK(&adev->gfx.gfx_off_delay_work,
> @@ -3281,6 +3290,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  	if (r) {
>  		dev_err(adev->dev, "Could not create amdgpu device attr\n");
>  		return r;
> +	} else {
> +		mutex_lock(&adev->sysfs_files_list_lock);
> +		list_add_tail(&dev_attr_handle_product_name.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_product_number.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_serial_number.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_pcie_replay_count.head, &adev->sysfs_files_list);
> +		mutex_unlock(&adev->sysfs_files_list_lock);
>  	}
>  
>  	if (IS_ENABLED(CONFIG_PERF_EVENTS))
> @@ -3298,6 +3314,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  	return r;
>  }
>  
> +static void amdgpu_sysfs_remove_files(struct amdgpu_device *adev)
> +{
> +	struct amdgpu_sysfs_list_node *node;
> +
> +	mutex_lock(&adev->sysfs_files_list_lock);
> +	list_for_each_entry(node, &adev->sysfs_files_list, head)
> +		device_remove_file(adev->dev, node->attr);
> +	mutex_unlock(&adev->sysfs_files_list_lock);
> +}
> +
>  /**
>   * amdgpu_device_fini - tear down the driver
>   *
> @@ -3332,6 +3358,11 @@ void amdgpu_device_fini_early(struct amdgpu_device *adev)
>  	amdgpu_fbdev_fini(adev);
>  
>  	amdgpu_irq_fini_early(adev);
> +
> +	amdgpu_sysfs_remove_files(adev);
> +
> +	if (adev->ucode_sysfs_en)
> +		amdgpu_ucode_sysfs_fini(adev);
>  }
>  
>  void amdgpu_device_fini_late(struct amdgpu_device *adev)
> @@ -3366,10 +3397,6 @@ void amdgpu_device_fini_late(struct amdgpu_device *adev)
>  	adev->rmmio = NULL;
>  	amdgpu_device_doorbell_fini(adev);
>  
> -	if (adev->ucode_sysfs_en)
> -		amdgpu_ucode_sysfs_fini(adev);
> -
> -	sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
>  	if (IS_ENABLED(CONFIG_PERF_EVENTS))
>  		amdgpu_pmu_fini(adev);
>  	if (amdgpu_discovery && adev->asic_type >= CHIP_NAVI10)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> index 6271044..e7b6c4a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> @@ -76,6 +76,9 @@ static DEVICE_ATTR(mem_info_gtt_total, S_IRUGO,
>  static DEVICE_ATTR(mem_info_gtt_used, S_IRUGO,
>  	           amdgpu_mem_info_gtt_used_show, NULL);
>  
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_gtt_total);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_gtt_used);
> +
>  /**
>   * amdgpu_gtt_mgr_init - init GTT manager and DRM MM
>   *
> @@ -114,6 +117,11 @@ static int amdgpu_gtt_mgr_init(struct ttm_mem_type_manager *man,
>  		return ret;
>  	}
>  
> +	mutex_lock(&adev->sysfs_files_list_lock);
> +	list_add_tail(&dev_attr_handle_mem_info_gtt_total.head, &adev->sysfs_files_list);
> +	list_add_tail(&dev_attr_handle_mem_info_gtt_used.head, &adev->sysfs_files_list);
> +	mutex_unlock(&adev->sysfs_files_list_lock);
> +
>  	return 0;
>  }
>  
> @@ -127,7 +135,6 @@ static int amdgpu_gtt_mgr_init(struct ttm_mem_type_manager *man,
>   */
>  static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager *man)
>  {
> -	struct amdgpu_device *adev = amdgpu_ttm_adev(man->bdev);
>  	struct amdgpu_gtt_mgr *mgr = man->priv;
>  	spin_lock(&mgr->lock);
>  	drm_mm_takedown(&mgr->mm);
> @@ -135,9 +142,6 @@ static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager *man)
>  	kfree(mgr);
>  	man->priv = NULL;
>  
> -	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_total);
> -	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_used);
> -
>  	return 0;
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index ddb4af0c..554fec0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -2216,6 +2216,8 @@ static DEVICE_ATTR(usbc_pd_fw, S_IRUGO | S_IWUSR,
>  		   psp_usbc_pd_fw_sysfs_read,
>  		   psp_usbc_pd_fw_sysfs_write);
>  
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(usbc_pd_fw);
> +
>  
>  
>  const struct amd_ip_funcs psp_ip_funcs = {
> @@ -2242,13 +2244,17 @@ static int psp_sysfs_init(struct amdgpu_device *adev)
>  
>  	if (ret)
>  		DRM_ERROR("Failed to create USBC PD FW control file!");
> +	else {
> +		mutex_lock(&adev->sysfs_files_list_lock);
> +		list_add_tail(&dev_attr_handle_usbc_pd_fw.head, &adev->sysfs_files_list);
> +		mutex_unlock(&adev->sysfs_files_list_lock);
> +	}
>  
>  	return ret;
>  }
>  
>  static void psp_sysfs_fini(struct amdgpu_device *adev)
>  {
> -	device_remove_file(adev->dev, &dev_attr_usbc_pd_fw);
>  }
>  
>  const struct amdgpu_ip_block_version psp_v3_1_ip_block =
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index 7723937..39c400c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -148,6 +148,12 @@ static DEVICE_ATTR(mem_info_vis_vram_used, S_IRUGO,
>  static DEVICE_ATTR(mem_info_vram_vendor, S_IRUGO,
>  		   amdgpu_mem_info_vram_vendor, NULL);
>  
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_total);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vis_vram_total);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_used);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vis_vram_used);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_vendor);
> +
>  static const struct attribute *amdgpu_vram_mgr_attributes[] = {
>  	&dev_attr_mem_info_vram_total.attr,
>  	&dev_attr_mem_info_vis_vram_total.attr,
> @@ -184,6 +190,15 @@ static int amdgpu_vram_mgr_init(struct ttm_mem_type_manager *man,
>  	ret = sysfs_create_files(&adev->dev->kobj, amdgpu_vram_mgr_attributes);
>  	if (ret)
>  		DRM_ERROR("Failed to register sysfs\n");
> +	else {
> +		mutex_lock(&adev->sysfs_files_list_lock);
> +		list_add_tail(&dev_attr_handle_mem_info_vram_total.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_mem_info_vis_vram_total.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_mem_info_vram_used.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_mem_info_vis_vram_used.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_mem_info_vram_vendor.head, &adev->sysfs_files_list);
> +		mutex_unlock(&adev->sysfs_files_list_lock);
> +	}
>  
>  	return 0;
>  }
> @@ -198,7 +213,6 @@ static int amdgpu_vram_mgr_init(struct ttm_mem_type_manager *man,
>   */
>  static int amdgpu_vram_mgr_fini(struct ttm_mem_type_manager *man)
>  {
> -	struct amdgpu_device *adev = amdgpu_ttm_adev(man->bdev);
>  	struct amdgpu_vram_mgr *mgr = man->priv;
>  
>  	spin_lock(&mgr->lock);
> @@ -206,7 +220,6 @@ static int amdgpu_vram_mgr_fini(struct ttm_mem_type_manager *man)
>  	spin_unlock(&mgr->lock);
>  	kfree(mgr);
>  	man->priv = NULL;
> -	sysfs_remove_files(&adev->dev->kobj, amdgpu_vram_mgr_attributes);
>  	return 0;
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index 90610b4..455eaa4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -272,6 +272,9 @@ static ssize_t amdgpu_xgmi_show_error(struct device *dev,
>  static DEVICE_ATTR(xgmi_device_id, S_IRUGO, amdgpu_xgmi_show_device_id, NULL);
>  static DEVICE_ATTR(xgmi_error, S_IRUGO, amdgpu_xgmi_show_error, NULL);
>  
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(xgmi_device_id);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(xgmi_error);
> +
>  static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev,
>  					 struct amdgpu_hive_info *hive)
>  {
> @@ -285,10 +288,19 @@ static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev,
>  		return ret;
>  	}
>  
> +	mutex_lock(&adev->sysfs_files_list_lock);
> +	list_add_tail(&dev_attr_handle_xgmi_device_id.head, &adev->sysfs_files_list);
> +	mutex_unlock(&adev->sysfs_files_list_lock);
> +
>  	/* Create xgmi error file */
>  	ret = device_create_file(adev->dev, &dev_attr_xgmi_error);
>  	if (ret)
>  		pr_err("failed to create xgmi_error\n");
> +	else {
> +		mutex_lock(&adev->sysfs_files_list_lock);
> +		list_add_tail(&dev_attr_handle_xgmi_error.head, &adev->sysfs_files_list);
> +		mutex_unlock(&adev->sysfs_files_list_lock);
> +	}
>  
>  
>  	/* Create sysfs link to hive info folder on the first device */
> @@ -325,7 +337,6 @@ static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev,
>  static void amdgpu_xgmi_sysfs_rem_dev_info(struct amdgpu_device *adev,
>  					  struct amdgpu_hive_info *hive)
>  {
> -	device_remove_file(adev->dev, &dev_attr_xgmi_device_id);
>  	sysfs_remove_link(&adev->dev->kobj, adev->ddev->unique);
>  	sysfs_remove_link(hive->kobj, adev->ddev->unique);
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
> index a7b8292..f95b0b2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
> +++ b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
> @@ -265,6 +265,8 @@ static ssize_t df_v3_6_get_df_cntr_avail(struct device *dev,
>  /* device attr for available perfmon counters */
>  static DEVICE_ATTR(df_cntr_avail, S_IRUGO, df_v3_6_get_df_cntr_avail, NULL);
>  
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(df_cntr_avail);
> +
>  static void df_v3_6_query_hashes(struct amdgpu_device *adev)
>  {
>  	u32 tmp;
> @@ -299,6 +301,11 @@ static void df_v3_6_sw_init(struct amdgpu_device *adev)
>  	ret = device_create_file(adev->dev, &dev_attr_df_cntr_avail);
>  	if (ret)
>  		DRM_ERROR("failed to create file for available df counters\n");
> +	else {
> +		mutex_lock(&adev->sysfs_files_list_lock);
> +		list_add_tail(&dev_attr_handle_df_cntr_avail.head, &adev->sysfs_files_list);
> +		mutex_unlock(&adev->sysfs_files_list_lock);
> +	}
>  
>  	for (i = 0; i < AMDGPU_MAX_DF_PERFMONS; i++)
>  		adev->df_perfmon_config_assign_mask[i] = 0;
> @@ -308,9 +315,6 @@ static void df_v3_6_sw_init(struct amdgpu_device *adev)
>  
>  static void df_v3_6_sw_fini(struct amdgpu_device *adev)
>  {
> -
> -	device_remove_file(adev->dev, &dev_attr_df_cntr_avail);
> -
>  }
>  
>  static void df_v3_6_enable_broadcast_mode(struct amdgpu_device *adev,
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.
  2020-06-21  6:03 ` [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged Andrey Grodzovsky
@ 2020-06-22  9:53   ` Daniel Vetter
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22  9:53 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:
> No point to try recovery if device is gone, just messes up things.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 ++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 ++++++++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 6932d75..5d6d3d9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>  	return ret;
>  }
>  
> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
> +{
> +	int i;
> +
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		struct amdgpu_ring *ring = adev->rings[i];
> +
> +		if (!ring || !ring->sched.thread)
> +			continue;
> +
> +		cancel_delayed_work_sync(&ring->sched.work_tdr);
> +	}
> +}

I think this is a function that's supposed to be in drm/scheduler, not
here. Might also just be your cleanup code being ordered wrongly, or your
split in one of the earlier patches not done quite right.
-Daniel

> +
>  static void
>  amdgpu_pci_remove(struct pci_dev *pdev)
>  {
>  	struct drm_device *dev = pci_get_drvdata(pdev);
> +	struct amdgpu_device *adev = dev->dev_private;
>  
>  	drm_dev_unplug(dev);
> +	amdgpu_cancel_all_tdr(adev);
>  	ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
>  	amdgpu_driver_unload_kms(dev);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 4720718..87ff0c0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -28,6 +28,8 @@
>  #include "amdgpu.h"
>  #include "amdgpu_trace.h"
>  
> +#include <drm/drm_drv.h>
> +
>  static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>  {
>  	struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> @@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>  
>  	memset(&ti, 0, sizeof(struct amdgpu_task_info));
>  
> +	if (drm_dev_is_unplugged(adev->ddev)) {
> +		DRM_INFO("ring %s timeout, but device unplugged, skipping.\n",
> +					  s_job->sched->name);
> +		return;
> +	}
> +
>  	if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>  		DRM_ERROR("ring %s timeout, but soft recovered\n",
>  			  s_job->sched->name);
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug
  2020-06-21  6:03 ` [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug Andrey Grodzovsky
@ 2020-06-22  9:55   ` Daniel Vetter
  2020-06-22 19:40   ` Christian König
  1 sibling, 0 replies; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22  9:55 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

On Sun, Jun 21, 2020 at 02:03:07AM -0400, Andrey Grodzovsky wrote:
> entity->rq becomes null aftre device unplugged so just return early
> in that case.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

That looks very deep in amdgpu internals ... how do you even get in here
after the device is fully unplugged on the sw side?

Is this amdkfd doing something stupid because entirely unaware of what
amdgpu has done? Something else? Just feels like this is just duct-taping
over a more fundamental problem, after hotunplug no one should be able to
even submit anything new, or do bo moves, or well anything really.
-Daniel

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 8d9c6fe..d252427 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -24,6 +24,7 @@
>  #include "amdgpu_job.h"
>  #include "amdgpu_object.h"
>  #include "amdgpu_trace.h"
> +#include <drm/drm_drv.h>
>  
>  #define AMDGPU_VM_SDMA_MIN_NUM_DW	256u
>  #define AMDGPU_VM_SDMA_MAX_NUM_DW	(16u * 1024u)
> @@ -94,7 +95,12 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>  	struct drm_sched_entity *entity;
>  	struct amdgpu_ring *ring;
>  	struct dma_fence *f;
> -	int r;
> +	int r, idx;
> +
> +	if (!drm_dev_enter(p->adev->ddev, &idx)) {
> +		r = -ENODEV;
> +		goto nodev;
> +	}
>  
>  	entity = p->immediate ? &p->vm->immediate : &p->vm->delayed;
>  	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
> @@ -104,7 +110,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>  	WARN_ON(ib->length_dw > p->num_dw_left);
>  	r = amdgpu_job_submit(p->job, entity, AMDGPU_FENCE_OWNER_VM, &f);
>  	if (r)
> -		goto error;
> +		goto job_fail;
>  
>  	if (p->unlocked) {
>  		struct dma_fence *tmp = dma_fence_get(f);
> @@ -118,10 +124,15 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>  	if (fence && !p->immediate)
>  		swap(*fence, f);
>  	dma_fence_put(f);
> -	return 0;
>  
> -error:
> -	amdgpu_job_free(p->job);
> +	r = 0;
> +
> +job_fail:
> +	drm_dev_exit(idx);
> +nodev:
> +	if (r)
> +		amdgpu_job_free(p->job);
> +
>  	return r;
>  }
>  
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove.
  2020-06-21  6:03 ` [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove Andrey Grodzovsky
@ 2020-06-22  9:56   ` Daniel Vetter
  2020-06-22 19:38   ` Christian König
  1 sibling, 0 replies; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22  9:56 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

On Sun, Jun 21, 2020 at 02:03:06AM -0400, Andrey Grodzovsky wrote:
> Use the new TTM interface to invalidate all exsisting BO CPU mappings
> form all user proccesses.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 43592dc..6932d75 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1135,6 +1135,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>  	struct drm_device *dev = pci_get_drvdata(pdev);
>  
>  	drm_dev_unplug(dev);
> +	ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
>  	amdgpu_driver_unload_kms(dev);

Hm a ttm, or maybe even vram helper function which wraps drm_dev_unplug +
ttm unmapping into one would be nice I think? I suspect there's going to
be more in the future here.
-Daniel

>  
>  	pci_disable_device(pdev);
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
  2020-06-22  9:51   ` Daniel Vetter
@ 2020-06-22 11:21     ` Greg KH
  2020-06-22 16:07       ` Andrey Grodzovsky
  0 siblings, 1 reply; 54+ messages in thread
From: Greg KH @ 2020-06-22 11:21 UTC (permalink / raw)
  To: Daniel Vetter, Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
> On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
> > Track sysfs files in a list so they all can be removed during pci remove
> > since otherwise their removal after that causes crash because parent
> > folder was already removed during pci remove.

Huh?  That should not happen, do you have a backtrace of that crash?

> > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> 
> Uh I thought sysfs just gets yanked completely. Please check with Greg KH
> whether hand-rolling all this really is the right solution here ... Feels
> very wrong. I thought this was all supposed to work by adding attributes
> before publishing the sysfs node, and then letting sysfs clean up
> everything. Not by cleaning up manually yourself.

Yes, that is supposed to be the correct thing to do.

> 
> Adding Greg for an authoritative answer.
> -Daniel
> 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h          | 13 +++++++++++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c |  7 +++++-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 35 ++++++++++++++++++++++++----
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 12 ++++++----
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c      |  8 ++++++-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 ++++++++++++--
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c     | 13 ++++++++++-
> >  drivers/gpu/drm/amd/amdgpu/df_v3_6.c         | 10 +++++---
> >  8 files changed, 99 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 604a681..ba3775f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -726,6 +726,15 @@ struct amd_powerplay {
> >  
> >  #define AMDGPU_RESET_MAGIC_NUM 64
> >  #define AMDGPU_MAX_DF_PERFMONS 4
> > +
> > +struct amdgpu_sysfs_list_node {
> > +	struct list_head head;
> > +	struct device_attribute *attr;
> > +};

You know we have lists of attributes already, called attribute groups,
if you really wanted to do something like this.  But, I don't think so.

Either way, don't hand-roll your own stuff that the driver core has
provided for you for a decade or more, that's just foolish :)

> > +
> > +#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \
> > +	struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = &dev_attr_##_attr}
> > +
> >  struct amdgpu_device {
> >  	struct device			*dev;
> >  	struct drm_device		*ddev;
> > @@ -992,6 +1001,10 @@ struct amdgpu_device {
> >  	char				product_number[16];
> >  	char				product_name[32];
> >  	char				serial[16];
> > +
> > +	struct list_head sysfs_files_list;
> > +	struct mutex	 sysfs_files_list_lock;
> > +
> >  };
> >  
> >  static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev)
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> > index fdd52d8..c1549ee 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> > @@ -1950,8 +1950,10 @@ static ssize_t amdgpu_atombios_get_vbios_version(struct device *dev,
> >  	return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version);
> >  }
> >  
> > +
> >  static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version,
> >  		   NULL);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version);
> >  
> >  /**
> >   * amdgpu_atombios_fini - free the driver info and callbacks for atombios
> > @@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev)
> >  	adev->mode_info.atom_context = NULL;
> >  	kfree(adev->mode_info.atom_card_info);
> >  	adev->mode_info.atom_card_info = NULL;
> > -	device_remove_file(adev->dev, &dev_attr_vbios_version);
> >  }
> >  
> >  /**
> > @@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
> >  		return ret;
> >  	}
> >  
> > +	mutex_lock(&adev->sysfs_files_list_lock);
> > +	list_add_tail(&dev_attr_handle_vbios_version.head, &adev->sysfs_files_list);
> > +	mutex_unlock(&adev->sysfs_files_list_lock);
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index e7b9065..3173046 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2928,6 +2928,12 @@ static const struct attribute *amdgpu_dev_attributes[] = {
> >  	NULL
> >  };
> >  
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count);
> > +
> > +
> >  /**
> >   * amdgpu_device_init - initialize the driver
> >   *
> > @@ -3029,6 +3035,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >  	INIT_LIST_HEAD(&adev->shadow_list);
> >  	mutex_init(&adev->shadow_list_lock);
> >  
> > +	INIT_LIST_HEAD(&adev->sysfs_files_list);
> > +	mutex_init(&adev->sysfs_files_list_lock);
> > +
> >  	INIT_DELAYED_WORK(&adev->delayed_init_work,
> >  			  amdgpu_device_delayed_init_work_handler);
> >  	INIT_DELAYED_WORK(&adev->gfx.gfx_off_delay_work,
> > @@ -3281,6 +3290,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >  	if (r) {
> >  		dev_err(adev->dev, "Could not create amdgpu device attr\n");
> >  		return r;
> > +	} else {
> > +		mutex_lock(&adev->sysfs_files_list_lock);
> > +		list_add_tail(&dev_attr_handle_product_name.head, &adev->sysfs_files_list);
> > +		list_add_tail(&dev_attr_handle_product_number.head, &adev->sysfs_files_list);
> > +		list_add_tail(&dev_attr_handle_serial_number.head, &adev->sysfs_files_list);
> > +		list_add_tail(&dev_attr_handle_pcie_replay_count.head, &adev->sysfs_files_list);
> > +		mutex_unlock(&adev->sysfs_files_list_lock);
> >  	}
> >  
> >  	if (IS_ENABLED(CONFIG_PERF_EVENTS))
> > @@ -3298,6 +3314,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >  	return r;
> >  }
> >  
> > +static void amdgpu_sysfs_remove_files(struct amdgpu_device *adev)
> > +{
> > +	struct amdgpu_sysfs_list_node *node;
> > +
> > +	mutex_lock(&adev->sysfs_files_list_lock);
> > +	list_for_each_entry(node, &adev->sysfs_files_list, head)
> > +		device_remove_file(adev->dev, node->attr);
> > +	mutex_unlock(&adev->sysfs_files_list_lock);
> > +}
> > +
> >  /**
> >   * amdgpu_device_fini - tear down the driver
> >   *
> > @@ -3332,6 +3358,11 @@ void amdgpu_device_fini_early(struct amdgpu_device *adev)
> >  	amdgpu_fbdev_fini(adev);
> >  
> >  	amdgpu_irq_fini_early(adev);
> > +
> > +	amdgpu_sysfs_remove_files(adev);
> > +
> > +	if (adev->ucode_sysfs_en)
> > +		amdgpu_ucode_sysfs_fini(adev);
> >  }
> >  
> >  void amdgpu_device_fini_late(struct amdgpu_device *adev)
> > @@ -3366,10 +3397,6 @@ void amdgpu_device_fini_late(struct amdgpu_device *adev)
> >  	adev->rmmio = NULL;
> >  	amdgpu_device_doorbell_fini(adev);
> >  
> > -	if (adev->ucode_sysfs_en)
> > -		amdgpu_ucode_sysfs_fini(adev);
> > -
> > -	sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
> >  	if (IS_ENABLED(CONFIG_PERF_EVENTS))
> >  		amdgpu_pmu_fini(adev);
> >  	if (amdgpu_discovery && adev->asic_type >= CHIP_NAVI10)
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> > index 6271044..e7b6c4a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> > @@ -76,6 +76,9 @@ static DEVICE_ATTR(mem_info_gtt_total, S_IRUGO,
> >  static DEVICE_ATTR(mem_info_gtt_used, S_IRUGO,
> >  	           amdgpu_mem_info_gtt_used_show, NULL);
> >  
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_gtt_total);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_gtt_used);
> > +
> >  /**
> >   * amdgpu_gtt_mgr_init - init GTT manager and DRM MM
> >   *
> > @@ -114,6 +117,11 @@ static int amdgpu_gtt_mgr_init(struct ttm_mem_type_manager *man,
> >  		return ret;
> >  	}
> >  
> > +	mutex_lock(&adev->sysfs_files_list_lock);
> > +	list_add_tail(&dev_attr_handle_mem_info_gtt_total.head, &adev->sysfs_files_list);
> > +	list_add_tail(&dev_attr_handle_mem_info_gtt_used.head, &adev->sysfs_files_list);
> > +	mutex_unlock(&adev->sysfs_files_list_lock);
> > +
> >  	return 0;
> >  }
> >  
> > @@ -127,7 +135,6 @@ static int amdgpu_gtt_mgr_init(struct ttm_mem_type_manager *man,
> >   */
> >  static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager *man)
> >  {
> > -	struct amdgpu_device *adev = amdgpu_ttm_adev(man->bdev);
> >  	struct amdgpu_gtt_mgr *mgr = man->priv;
> >  	spin_lock(&mgr->lock);
> >  	drm_mm_takedown(&mgr->mm);
> > @@ -135,9 +142,6 @@ static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager *man)
> >  	kfree(mgr);
> >  	man->priv = NULL;
> >  
> > -	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_total);
> > -	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_used);
> > -
> >  	return 0;
> >  }
> >  
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index ddb4af0c..554fec0 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -2216,6 +2216,8 @@ static DEVICE_ATTR(usbc_pd_fw, S_IRUGO | S_IWUSR,
> >  		   psp_usbc_pd_fw_sysfs_read,
> >  		   psp_usbc_pd_fw_sysfs_write);
> >  
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(usbc_pd_fw);
> > +
> >  
> >  
> >  const struct amd_ip_funcs psp_ip_funcs = {
> > @@ -2242,13 +2244,17 @@ static int psp_sysfs_init(struct amdgpu_device *adev)
> >  
> >  	if (ret)
> >  		DRM_ERROR("Failed to create USBC PD FW control file!");
> > +	else {
> > +		mutex_lock(&adev->sysfs_files_list_lock);
> > +		list_add_tail(&dev_attr_handle_usbc_pd_fw.head, &adev->sysfs_files_list);
> > +		mutex_unlock(&adev->sysfs_files_list_lock);
> > +	}
> >  
> >  	return ret;
> >  }
> >  
> >  static void psp_sysfs_fini(struct amdgpu_device *adev)
> >  {
> > -	device_remove_file(adev->dev, &dev_attr_usbc_pd_fw);
> >  }
> >  
> >  const struct amdgpu_ip_block_version psp_v3_1_ip_block =
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> > index 7723937..39c400c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> > @@ -148,6 +148,12 @@ static DEVICE_ATTR(mem_info_vis_vram_used, S_IRUGO,
> >  static DEVICE_ATTR(mem_info_vram_vendor, S_IRUGO,
> >  		   amdgpu_mem_info_vram_vendor, NULL);
> >  
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_total);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vis_vram_total);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_used);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vis_vram_used);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_vendor);

Converting all of these individual attributes to an attribute group
would be a nice thing to do anyway.  Makes your logic much simpler and
less error-prone.

But again, the driver core should do all of the device file removal
stuff automatically for you when your PCI device is removed from the
system _UNLESS_ you are doing crazy things like creating child devices
or messing with raw kobjects or other horrible things that I haven't
read the code to see if you are, but hopefully not :)

thanks,

greg k-h
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] drm: Add dummy page per device or GEM object
  2020-06-21  6:03 ` [PATCH v2 1/8] drm: Add dummy page per device or GEM object Andrey Grodzovsky
  2020-06-22  9:35   ` Daniel Vetter
@ 2020-06-22 13:18   ` Christian König
  2020-06-22 14:23     ` Daniel Vetter
  2020-06-22 14:32     ` Andrey Grodzovsky
  1 sibling, 2 replies; 54+ messages in thread
From: Christian König @ 2020-06-22 13:18 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx, dri-devel
  Cc: alexdeucher, daniel.vetter, michel, ppaalanen

Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
> Will be used to reroute CPU mapped BO's page faults once
> device is removed.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/drm_file.c  |  8 ++++++++
>   drivers/gpu/drm/drm_prime.c | 10 ++++++++++
>   include/drm/drm_file.h      |  2 ++
>   include/drm/drm_gem.h       |  2 ++
>   4 files changed, 22 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
> index c4c704e..67c0770 100644
> --- a/drivers/gpu/drm/drm_file.c
> +++ b/drivers/gpu/drm/drm_file.c
> @@ -188,6 +188,12 @@ struct drm_file *drm_file_alloc(struct drm_minor *minor)
>   			goto out_prime_destroy;
>   	}
>   
> +	file->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!file->dummy_page) {
> +		ret = -ENOMEM;
> +		goto out_prime_destroy;
> +	}
> +
>   	return file;
>   
>   out_prime_destroy:
> @@ -284,6 +290,8 @@ void drm_file_free(struct drm_file *file)
>   	if (dev->driver->postclose)
>   		dev->driver->postclose(dev, file);
>   
> +	__free_page(file->dummy_page);
> +
>   	drm_prime_destroy_file_private(&file->prime);
>   
>   	WARN_ON(!list_empty(&file->event_list));
> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
> index 1de2cde..c482e9c 100644
> --- a/drivers/gpu/drm/drm_prime.c
> +++ b/drivers/gpu/drm/drm_prime.c
> @@ -335,6 +335,13 @@ int drm_gem_prime_fd_to_handle(struct drm_device *dev,
>   
>   	ret = drm_prime_add_buf_handle(&file_priv->prime,
>   			dma_buf, *handle);
> +
> +	if (!ret) {
> +		obj->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +		if (!obj->dummy_page)
> +			ret = -ENOMEM;
> +	}
> +

While the per file case still looks acceptable this is a clear NAK since 
it will massively increase the memory needed for a prime exported object.

I think that this is quite overkill in the first place and for the hot 
unplug case we can just use the global dummy page as well.

Christian.

>   	mutex_unlock(&file_priv->prime.lock);
>   	if (ret)
>   		goto fail;
> @@ -1006,6 +1013,9 @@ void drm_prime_gem_destroy(struct drm_gem_object *obj, struct sg_table *sg)
>   		dma_buf_unmap_attachment(attach, sg, DMA_BIDIRECTIONAL);
>   	dma_buf = attach->dmabuf;
>   	dma_buf_detach(attach->dmabuf, attach);
> +
> +	__free_page(obj->dummy_page);
> +
>   	/* remove the reference */
>   	dma_buf_put(dma_buf);
>   }
> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> index 19df802..349a658 100644
> --- a/include/drm/drm_file.h
> +++ b/include/drm/drm_file.h
> @@ -335,6 +335,8 @@ struct drm_file {
>   	 */
>   	struct drm_prime_file_private prime;
>   
> +	struct page *dummy_page;
> +
>   	/* private: */
>   #if IS_ENABLED(CONFIG_DRM_LEGACY)
>   	unsigned long lock_count; /* DRI1 legacy lock count */
> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> index 0b37506..47460d1 100644
> --- a/include/drm/drm_gem.h
> +++ b/include/drm/drm_gem.h
> @@ -310,6 +310,8 @@ struct drm_gem_object {
>   	 *
>   	 */
>   	const struct drm_gem_object_funcs *funcs;
> +
> +	struct page *dummy_page;
>   };
>   
>   /**

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
  2020-06-21  6:03 ` [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal Andrey Grodzovsky
  2020-06-22  9:51   ` Daniel Vetter
@ 2020-06-22 13:19   ` Christian König
  1 sibling, 0 replies; 54+ messages in thread
From: Christian König @ 2020-06-22 13:19 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx, dri-devel
  Cc: alexdeucher, daniel.vetter, michel, ppaalanen

Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
> Track sysfs files in a list so they all can be removed during pci remove
> since otherwise their removal after that causes crash because parent
> folder was already removed during pci remove.

That looks extremely fishy to me.

It sounds like we just don't remove stuff in the right order.

Christian.

>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h          | 13 +++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c |  7 +++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 35 ++++++++++++++++++++++++----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 12 ++++++----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c      |  8 ++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 ++++++++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c     | 13 ++++++++++-
>   drivers/gpu/drm/amd/amdgpu/df_v3_6.c         | 10 +++++---
>   8 files changed, 99 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 604a681..ba3775f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -726,6 +726,15 @@ struct amd_powerplay {
>   
>   #define AMDGPU_RESET_MAGIC_NUM 64
>   #define AMDGPU_MAX_DF_PERFMONS 4
> +
> +struct amdgpu_sysfs_list_node {
> +	struct list_head head;
> +	struct device_attribute *attr;
> +};
> +
> +#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \
> +	struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = &dev_attr_##_attr}
> +
>   struct amdgpu_device {
>   	struct device			*dev;
>   	struct drm_device		*ddev;
> @@ -992,6 +1001,10 @@ struct amdgpu_device {
>   	char				product_number[16];
>   	char				product_name[32];
>   	char				serial[16];
> +
> +	struct list_head sysfs_files_list;
> +	struct mutex	 sysfs_files_list_lock;
> +
>   };
>   
>   static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> index fdd52d8..c1549ee 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> @@ -1950,8 +1950,10 @@ static ssize_t amdgpu_atombios_get_vbios_version(struct device *dev,
>   	return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version);
>   }
>   
> +
>   static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version,
>   		   NULL);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version);
>   
>   /**
>    * amdgpu_atombios_fini - free the driver info and callbacks for atombios
> @@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev)
>   	adev->mode_info.atom_context = NULL;
>   	kfree(adev->mode_info.atom_card_info);
>   	adev->mode_info.atom_card_info = NULL;
> -	device_remove_file(adev->dev, &dev_attr_vbios_version);
>   }
>   
>   /**
> @@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
>   		return ret;
>   	}
>   
> +	mutex_lock(&adev->sysfs_files_list_lock);
> +	list_add_tail(&dev_attr_handle_vbios_version.head, &adev->sysfs_files_list);
> +	mutex_unlock(&adev->sysfs_files_list_lock);
> +
>   	return 0;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e7b9065..3173046 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2928,6 +2928,12 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>   	NULL
>   };
>   
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count);
> +
> +
>   /**
>    * amdgpu_device_init - initialize the driver
>    *
> @@ -3029,6 +3035,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	INIT_LIST_HEAD(&adev->shadow_list);
>   	mutex_init(&adev->shadow_list_lock);
>   
> +	INIT_LIST_HEAD(&adev->sysfs_files_list);
> +	mutex_init(&adev->sysfs_files_list_lock);
> +
>   	INIT_DELAYED_WORK(&adev->delayed_init_work,
>   			  amdgpu_device_delayed_init_work_handler);
>   	INIT_DELAYED_WORK(&adev->gfx.gfx_off_delay_work,
> @@ -3281,6 +3290,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	if (r) {
>   		dev_err(adev->dev, "Could not create amdgpu device attr\n");
>   		return r;
> +	} else {
> +		mutex_lock(&adev->sysfs_files_list_lock);
> +		list_add_tail(&dev_attr_handle_product_name.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_product_number.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_serial_number.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_pcie_replay_count.head, &adev->sysfs_files_list);
> +		mutex_unlock(&adev->sysfs_files_list_lock);
>   	}
>   
>   	if (IS_ENABLED(CONFIG_PERF_EVENTS))
> @@ -3298,6 +3314,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> +static void amdgpu_sysfs_remove_files(struct amdgpu_device *adev)
> +{
> +	struct amdgpu_sysfs_list_node *node;
> +
> +	mutex_lock(&adev->sysfs_files_list_lock);
> +	list_for_each_entry(node, &adev->sysfs_files_list, head)
> +		device_remove_file(adev->dev, node->attr);
> +	mutex_unlock(&adev->sysfs_files_list_lock);
> +}
> +
>   /**
>    * amdgpu_device_fini - tear down the driver
>    *
> @@ -3332,6 +3358,11 @@ void amdgpu_device_fini_early(struct amdgpu_device *adev)
>   	amdgpu_fbdev_fini(adev);
>   
>   	amdgpu_irq_fini_early(adev);
> +
> +	amdgpu_sysfs_remove_files(adev);
> +
> +	if (adev->ucode_sysfs_en)
> +		amdgpu_ucode_sysfs_fini(adev);
>   }
>   
>   void amdgpu_device_fini_late(struct amdgpu_device *adev)
> @@ -3366,10 +3397,6 @@ void amdgpu_device_fini_late(struct amdgpu_device *adev)
>   	adev->rmmio = NULL;
>   	amdgpu_device_doorbell_fini(adev);
>   
> -	if (adev->ucode_sysfs_en)
> -		amdgpu_ucode_sysfs_fini(adev);
> -
> -	sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
>   	if (IS_ENABLED(CONFIG_PERF_EVENTS))
>   		amdgpu_pmu_fini(adev);
>   	if (amdgpu_discovery && adev->asic_type >= CHIP_NAVI10)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> index 6271044..e7b6c4a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> @@ -76,6 +76,9 @@ static DEVICE_ATTR(mem_info_gtt_total, S_IRUGO,
>   static DEVICE_ATTR(mem_info_gtt_used, S_IRUGO,
>   	           amdgpu_mem_info_gtt_used_show, NULL);
>   
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_gtt_total);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_gtt_used);
> +
>   /**
>    * amdgpu_gtt_mgr_init - init GTT manager and DRM MM
>    *
> @@ -114,6 +117,11 @@ static int amdgpu_gtt_mgr_init(struct ttm_mem_type_manager *man,
>   		return ret;
>   	}
>   
> +	mutex_lock(&adev->sysfs_files_list_lock);
> +	list_add_tail(&dev_attr_handle_mem_info_gtt_total.head, &adev->sysfs_files_list);
> +	list_add_tail(&dev_attr_handle_mem_info_gtt_used.head, &adev->sysfs_files_list);
> +	mutex_unlock(&adev->sysfs_files_list_lock);
> +
>   	return 0;
>   }
>   
> @@ -127,7 +135,6 @@ static int amdgpu_gtt_mgr_init(struct ttm_mem_type_manager *man,
>    */
>   static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager *man)
>   {
> -	struct amdgpu_device *adev = amdgpu_ttm_adev(man->bdev);
>   	struct amdgpu_gtt_mgr *mgr = man->priv;
>   	spin_lock(&mgr->lock);
>   	drm_mm_takedown(&mgr->mm);
> @@ -135,9 +142,6 @@ static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager *man)
>   	kfree(mgr);
>   	man->priv = NULL;
>   
> -	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_total);
> -	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_used);
> -
>   	return 0;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index ddb4af0c..554fec0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -2216,6 +2216,8 @@ static DEVICE_ATTR(usbc_pd_fw, S_IRUGO | S_IWUSR,
>   		   psp_usbc_pd_fw_sysfs_read,
>   		   psp_usbc_pd_fw_sysfs_write);
>   
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(usbc_pd_fw);
> +
>   
>   
>   const struct amd_ip_funcs psp_ip_funcs = {
> @@ -2242,13 +2244,17 @@ static int psp_sysfs_init(struct amdgpu_device *adev)
>   
>   	if (ret)
>   		DRM_ERROR("Failed to create USBC PD FW control file!");
> +	else {
> +		mutex_lock(&adev->sysfs_files_list_lock);
> +		list_add_tail(&dev_attr_handle_usbc_pd_fw.head, &adev->sysfs_files_list);
> +		mutex_unlock(&adev->sysfs_files_list_lock);
> +	}
>   
>   	return ret;
>   }
>   
>   static void psp_sysfs_fini(struct amdgpu_device *adev)
>   {
> -	device_remove_file(adev->dev, &dev_attr_usbc_pd_fw);
>   }
>   
>   const struct amdgpu_ip_block_version psp_v3_1_ip_block =
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index 7723937..39c400c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -148,6 +148,12 @@ static DEVICE_ATTR(mem_info_vis_vram_used, S_IRUGO,
>   static DEVICE_ATTR(mem_info_vram_vendor, S_IRUGO,
>   		   amdgpu_mem_info_vram_vendor, NULL);
>   
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_total);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vis_vram_total);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_used);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vis_vram_used);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_vendor);
> +
>   static const struct attribute *amdgpu_vram_mgr_attributes[] = {
>   	&dev_attr_mem_info_vram_total.attr,
>   	&dev_attr_mem_info_vis_vram_total.attr,
> @@ -184,6 +190,15 @@ static int amdgpu_vram_mgr_init(struct ttm_mem_type_manager *man,
>   	ret = sysfs_create_files(&adev->dev->kobj, amdgpu_vram_mgr_attributes);
>   	if (ret)
>   		DRM_ERROR("Failed to register sysfs\n");
> +	else {
> +		mutex_lock(&adev->sysfs_files_list_lock);
> +		list_add_tail(&dev_attr_handle_mem_info_vram_total.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_mem_info_vis_vram_total.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_mem_info_vram_used.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_mem_info_vis_vram_used.head, &adev->sysfs_files_list);
> +		list_add_tail(&dev_attr_handle_mem_info_vram_vendor.head, &adev->sysfs_files_list);
> +		mutex_unlock(&adev->sysfs_files_list_lock);
> +	}
>   
>   	return 0;
>   }
> @@ -198,7 +213,6 @@ static int amdgpu_vram_mgr_init(struct ttm_mem_type_manager *man,
>    */
>   static int amdgpu_vram_mgr_fini(struct ttm_mem_type_manager *man)
>   {
> -	struct amdgpu_device *adev = amdgpu_ttm_adev(man->bdev);
>   	struct amdgpu_vram_mgr *mgr = man->priv;
>   
>   	spin_lock(&mgr->lock);
> @@ -206,7 +220,6 @@ static int amdgpu_vram_mgr_fini(struct ttm_mem_type_manager *man)
>   	spin_unlock(&mgr->lock);
>   	kfree(mgr);
>   	man->priv = NULL;
> -	sysfs_remove_files(&adev->dev->kobj, amdgpu_vram_mgr_attributes);
>   	return 0;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index 90610b4..455eaa4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -272,6 +272,9 @@ static ssize_t amdgpu_xgmi_show_error(struct device *dev,
>   static DEVICE_ATTR(xgmi_device_id, S_IRUGO, amdgpu_xgmi_show_device_id, NULL);
>   static DEVICE_ATTR(xgmi_error, S_IRUGO, amdgpu_xgmi_show_error, NULL);
>   
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(xgmi_device_id);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(xgmi_error);
> +
>   static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev,
>   					 struct amdgpu_hive_info *hive)
>   {
> @@ -285,10 +288,19 @@ static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev,
>   		return ret;
>   	}
>   
> +	mutex_lock(&adev->sysfs_files_list_lock);
> +	list_add_tail(&dev_attr_handle_xgmi_device_id.head, &adev->sysfs_files_list);
> +	mutex_unlock(&adev->sysfs_files_list_lock);
> +
>   	/* Create xgmi error file */
>   	ret = device_create_file(adev->dev, &dev_attr_xgmi_error);
>   	if (ret)
>   		pr_err("failed to create xgmi_error\n");
> +	else {
> +		mutex_lock(&adev->sysfs_files_list_lock);
> +		list_add_tail(&dev_attr_handle_xgmi_error.head, &adev->sysfs_files_list);
> +		mutex_unlock(&adev->sysfs_files_list_lock);
> +	}
>   
>   
>   	/* Create sysfs link to hive info folder on the first device */
> @@ -325,7 +337,6 @@ static int amdgpu_xgmi_sysfs_add_dev_info(struct amdgpu_device *adev,
>   static void amdgpu_xgmi_sysfs_rem_dev_info(struct amdgpu_device *adev,
>   					  struct amdgpu_hive_info *hive)
>   {
> -	device_remove_file(adev->dev, &dev_attr_xgmi_device_id);
>   	sysfs_remove_link(&adev->dev->kobj, adev->ddev->unique);
>   	sysfs_remove_link(hive->kobj, adev->ddev->unique);
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
> index a7b8292..f95b0b2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
> +++ b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
> @@ -265,6 +265,8 @@ static ssize_t df_v3_6_get_df_cntr_avail(struct device *dev,
>   /* device attr for available perfmon counters */
>   static DEVICE_ATTR(df_cntr_avail, S_IRUGO, df_v3_6_get_df_cntr_avail, NULL);
>   
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(df_cntr_avail);
> +
>   static void df_v3_6_query_hashes(struct amdgpu_device *adev)
>   {
>   	u32 tmp;
> @@ -299,6 +301,11 @@ static void df_v3_6_sw_init(struct amdgpu_device *adev)
>   	ret = device_create_file(adev->dev, &dev_attr_df_cntr_avail);
>   	if (ret)
>   		DRM_ERROR("failed to create file for available df counters\n");
> +	else {
> +		mutex_lock(&adev->sysfs_files_list_lock);
> +		list_add_tail(&dev_attr_handle_df_cntr_avail.head, &adev->sysfs_files_list);
> +		mutex_unlock(&adev->sysfs_files_list_lock);
> +	}
>   
>   	for (i = 0; i < AMDGPU_MAX_DF_PERFMONS; i++)
>   		adev->df_perfmon_config_assign_mask[i] = 0;
> @@ -308,9 +315,6 @@ static void df_v3_6_sw_init(struct amdgpu_device *adev)
>   
>   static void df_v3_6_sw_fini(struct amdgpu_device *adev)
>   {
> -
> -	device_remove_file(adev->dev, &dev_attr_df_cntr_avail);
> -
>   }
>   
>   static void df_v3_6_enable_broadcast_mode(struct amdgpu_device *adev,

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] drm: Add dummy page per device or GEM object
  2020-06-22  9:35   ` Daniel Vetter
@ 2020-06-22 14:21     ` Pekka Paalanen
  2020-06-22 14:24       ` Daniel Vetter
  0 siblings, 1 reply; 54+ messages in thread
From: Pekka Paalanen @ 2020-06-22 14:21 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Andrey Grodzovsky, daniel.vetter, michel, dri-devel, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

[-- Attachment #1.1: Type: text/plain, Size: 1217 bytes --]

On Mon, 22 Jun 2020 11:35:01 +0200
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Sun, Jun 21, 2020 at 02:03:01AM -0400, Andrey Grodzovsky wrote:
> > Will be used to reroute CPU mapped BO's page faults once
> > device is removed.
> > 
> > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > ---
> >  drivers/gpu/drm/drm_file.c  |  8 ++++++++
> >  drivers/gpu/drm/drm_prime.c | 10 ++++++++++
> >  include/drm/drm_file.h      |  2 ++
> >  include/drm/drm_gem.h       |  2 ++
> >  4 files changed, 22 insertions(+)

...

> > diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> > index 0b37506..47460d1 100644
> > --- a/include/drm/drm_gem.h
> > +++ b/include/drm/drm_gem.h
> > @@ -310,6 +310,8 @@ struct drm_gem_object {
> >  	 *
> >  	 */
> >  	const struct drm_gem_object_funcs *funcs;
> > +
> > +	struct page *dummy_page;
> >  };  
> 
> I think amdgpu doesn't care, but everyone else still might care somewhat
> about flink. That also shares buffers, so also needs to allocate the
> per-bo dummy page.

Do you really care about making flink not explode on device
hot-unplug? Why not just leave flink users die in a fire?
It's not a regression.


Thanks,
pq

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] drm: Add dummy page per device or GEM object
  2020-06-22 13:18   ` Christian König
@ 2020-06-22 14:23     ` Daniel Vetter
  2020-06-22 14:32     ` Andrey Grodzovsky
  1 sibling, 0 replies; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22 14:23 UTC (permalink / raw)
  To: Christian König
  Cc: Andrey Grodzovsky, Michel Dänzer, dri-devel, Pekka Paalanen,
	amd-gfx list, Alex Deucher

On Mon, Jun 22, 2020 at 3:18 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
> > Will be used to reroute CPU mapped BO's page faults once
> > device is removed.
> >
> > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > ---
> >   drivers/gpu/drm/drm_file.c  |  8 ++++++++
> >   drivers/gpu/drm/drm_prime.c | 10 ++++++++++
> >   include/drm/drm_file.h      |  2 ++
> >   include/drm/drm_gem.h       |  2 ++
> >   4 files changed, 22 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
> > index c4c704e..67c0770 100644
> > --- a/drivers/gpu/drm/drm_file.c
> > +++ b/drivers/gpu/drm/drm_file.c
> > @@ -188,6 +188,12 @@ struct drm_file *drm_file_alloc(struct drm_minor *minor)
> >                       goto out_prime_destroy;
> >       }
> >
> > +     file->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > +     if (!file->dummy_page) {
> > +             ret = -ENOMEM;
> > +             goto out_prime_destroy;
> > +     }
> > +
> >       return file;
> >
> >   out_prime_destroy:
> > @@ -284,6 +290,8 @@ void drm_file_free(struct drm_file *file)
> >       if (dev->driver->postclose)
> >               dev->driver->postclose(dev, file);
> >
> > +     __free_page(file->dummy_page);
> > +
> >       drm_prime_destroy_file_private(&file->prime);
> >
> >       WARN_ON(!list_empty(&file->event_list));
> > diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
> > index 1de2cde..c482e9c 100644
> > --- a/drivers/gpu/drm/drm_prime.c
> > +++ b/drivers/gpu/drm/drm_prime.c
> > @@ -335,6 +335,13 @@ int drm_gem_prime_fd_to_handle(struct drm_device *dev,
> >
> >       ret = drm_prime_add_buf_handle(&file_priv->prime,
> >                       dma_buf, *handle);
> > +
> > +     if (!ret) {
> > +             obj->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > +             if (!obj->dummy_page)
> > +                     ret = -ENOMEM;
> > +     }
> > +
>
> While the per file case still looks acceptable this is a clear NAK since
> it will massively increase the memory needed for a prime exported object.
>
> I think that this is quite overkill in the first place and for the hot
> unplug case we can just use the global dummy page as well.

Imo we either don't bother with per-file dummy page, or we need this.
Half-way doesn't make much sense, since for anything you dma-buf
exported you have no idea whether it left a sandbox or not.

E.g. anything that's shared between client/compositor has a different
security context, so picking the dummy page of either is the wrong
thing.

If you're worried about the overhead we can also allocate the dummy
page on demand, and SIGBUS if we can't allocate the right one. Then we
just need to track whether a buffer has ever been exported.
-Daniel

>
> Christian.
>
> >       mutex_unlock(&file_priv->prime.lock);
> >       if (ret)
> >               goto fail;
> > @@ -1006,6 +1013,9 @@ void drm_prime_gem_destroy(struct drm_gem_object *obj, struct sg_table *sg)
> >               dma_buf_unmap_attachment(attach, sg, DMA_BIDIRECTIONAL);
> >       dma_buf = attach->dmabuf;
> >       dma_buf_detach(attach->dmabuf, attach);
> > +
> > +     __free_page(obj->dummy_page);
> > +
> >       /* remove the reference */
> >       dma_buf_put(dma_buf);
> >   }
> > diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> > index 19df802..349a658 100644
> > --- a/include/drm/drm_file.h
> > +++ b/include/drm/drm_file.h
> > @@ -335,6 +335,8 @@ struct drm_file {
> >        */
> >       struct drm_prime_file_private prime;
> >
> > +     struct page *dummy_page;
> > +
> >       /* private: */
> >   #if IS_ENABLED(CONFIG_DRM_LEGACY)
> >       unsigned long lock_count; /* DRI1 legacy lock count */
> > diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> > index 0b37506..47460d1 100644
> > --- a/include/drm/drm_gem.h
> > +++ b/include/drm/drm_gem.h
> > @@ -310,6 +310,8 @@ struct drm_gem_object {
> >        *
> >        */
> >       const struct drm_gem_object_funcs *funcs;
> > +
> > +     struct page *dummy_page;
> >   };
> >
> >   /**
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] drm: Add dummy page per device or GEM object
  2020-06-22 14:21     ` Pekka Paalanen
@ 2020-06-22 14:24       ` Daniel Vetter
  2020-06-22 14:28         ` Pekka Paalanen
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22 14:24 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Andrey Grodzovsky, Christian König, Michel Dänzer,
	dri-devel, amd-gfx list, Alex Deucher

On Mon, Jun 22, 2020 at 4:22 PM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Mon, 22 Jun 2020 11:35:01 +0200
> Daniel Vetter <daniel@ffwll.ch> wrote:
>
> > On Sun, Jun 21, 2020 at 02:03:01AM -0400, Andrey Grodzovsky wrote:
> > > Will be used to reroute CPU mapped BO's page faults once
> > > device is removed.
> > >
> > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > > ---
> > >  drivers/gpu/drm/drm_file.c  |  8 ++++++++
> > >  drivers/gpu/drm/drm_prime.c | 10 ++++++++++
> > >  include/drm/drm_file.h      |  2 ++
> > >  include/drm/drm_gem.h       |  2 ++
> > >  4 files changed, 22 insertions(+)
>
> ...
>
> > > diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> > > index 0b37506..47460d1 100644
> > > --- a/include/drm/drm_gem.h
> > > +++ b/include/drm/drm_gem.h
> > > @@ -310,6 +310,8 @@ struct drm_gem_object {
> > >      *
> > >      */
> > >     const struct drm_gem_object_funcs *funcs;
> > > +
> > > +   struct page *dummy_page;
> > >  };
> >
> > I think amdgpu doesn't care, but everyone else still might care somewhat
> > about flink. That also shares buffers, so also needs to allocate the
> > per-bo dummy page.
>
> Do you really care about making flink not explode on device
> hot-unplug? Why not just leave flink users die in a fire?
> It's not a regression.

It's not about exploding, they won't. With flink you can pass a buffer
from one address space to the other, so imo we should avoid false
sharing. E.g. if you happen to write something $secret into a private
buffer, but only $non-secret stuff into shared buffers. Then if you
unplug, your well-kept $secret might suddenly be visible by lots of
other processes you never intended to share it with.

Just feels safer to plug that hole completely.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] drm: Add dummy page per device or GEM object
  2020-06-22 14:24       ` Daniel Vetter
@ 2020-06-22 14:28         ` Pekka Paalanen
  0 siblings, 0 replies; 54+ messages in thread
From: Pekka Paalanen @ 2020-06-22 14:28 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Andrey Grodzovsky, Christian König, Michel Dänzer,
	dri-devel, amd-gfx list, Alex Deucher

[-- Attachment #1.1: Type: text/plain, Size: 2066 bytes --]

On Mon, 22 Jun 2020 16:24:38 +0200
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Mon, Jun 22, 2020 at 4:22 PM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Mon, 22 Jun 2020 11:35:01 +0200
> > Daniel Vetter <daniel@ffwll.ch> wrote:
> >  
> > > On Sun, Jun 21, 2020 at 02:03:01AM -0400, Andrey Grodzovsky wrote:  
> > > > Will be used to reroute CPU mapped BO's page faults once
> > > > device is removed.
> > > >
> > > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > > > ---
> > > >  drivers/gpu/drm/drm_file.c  |  8 ++++++++
> > > >  drivers/gpu/drm/drm_prime.c | 10 ++++++++++
> > > >  include/drm/drm_file.h      |  2 ++
> > > >  include/drm/drm_gem.h       |  2 ++
> > > >  4 files changed, 22 insertions(+)  
> >
> > ...
> >  
> > > > diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> > > > index 0b37506..47460d1 100644
> > > > --- a/include/drm/drm_gem.h
> > > > +++ b/include/drm/drm_gem.h
> > > > @@ -310,6 +310,8 @@ struct drm_gem_object {
> > > >      *
> > > >      */
> > > >     const struct drm_gem_object_funcs *funcs;
> > > > +
> > > > +   struct page *dummy_page;
> > > >  };  
> > >
> > > I think amdgpu doesn't care, but everyone else still might care somewhat
> > > about flink. That also shares buffers, so also needs to allocate the
> > > per-bo dummy page.  
> >
> > Do you really care about making flink not explode on device
> > hot-unplug? Why not just leave flink users die in a fire?
> > It's not a regression.  
> 
> It's not about exploding, they won't. With flink you can pass a buffer
> from one address space to the other, so imo we should avoid false
> sharing. E.g. if you happen to write something $secret into a private
> buffer, but only $non-secret stuff into shared buffers. Then if you
> unplug, your well-kept $secret might suddenly be visible by lots of
> other processes you never intended to share it with.
> 
> Just feels safer to plug that hole completely.

Ah! Ok, I clearly didn't understand the consequences.


Thanks,
pq

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] drm: Add dummy page per device or GEM object
  2020-06-22 13:18   ` Christian König
  2020-06-22 14:23     ` Daniel Vetter
@ 2020-06-22 14:32     ` Andrey Grodzovsky
  2020-06-22 17:45       ` Christian König
  1 sibling, 1 reply; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-22 14:32 UTC (permalink / raw)
  To: christian.koenig, amd-gfx, dri-devel
  Cc: alexdeucher, daniel.vetter, michel, ppaalanen


On 6/22/20 9:18 AM, Christian König wrote:
> Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
>> Will be used to reroute CPU mapped BO's page faults once
>> device is removed.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/drm_file.c  |  8 ++++++++
>>   drivers/gpu/drm/drm_prime.c | 10 ++++++++++
>>   include/drm/drm_file.h      |  2 ++
>>   include/drm/drm_gem.h       |  2 ++
>>   4 files changed, 22 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
>> index c4c704e..67c0770 100644
>> --- a/drivers/gpu/drm/drm_file.c
>> +++ b/drivers/gpu/drm/drm_file.c
>> @@ -188,6 +188,12 @@ struct drm_file *drm_file_alloc(struct drm_minor 
>> *minor)
>>               goto out_prime_destroy;
>>       }
>>   +    file->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> +    if (!file->dummy_page) {
>> +        ret = -ENOMEM;
>> +        goto out_prime_destroy;
>> +    }
>> +
>>       return file;
>>     out_prime_destroy:
>> @@ -284,6 +290,8 @@ void drm_file_free(struct drm_file *file)
>>       if (dev->driver->postclose)
>>           dev->driver->postclose(dev, file);
>>   +    __free_page(file->dummy_page);
>> +
>>       drm_prime_destroy_file_private(&file->prime);
>>         WARN_ON(!list_empty(&file->event_list));
>> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
>> index 1de2cde..c482e9c 100644
>> --- a/drivers/gpu/drm/drm_prime.c
>> +++ b/drivers/gpu/drm/drm_prime.c
>> @@ -335,6 +335,13 @@ int drm_gem_prime_fd_to_handle(struct drm_device 
>> *dev,
>>         ret = drm_prime_add_buf_handle(&file_priv->prime,
>>               dma_buf, *handle);
>> +
>> +    if (!ret) {
>> +        obj->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> +        if (!obj->dummy_page)
>> +            ret = -ENOMEM;
>> +    }
>> +
>
> While the per file case still looks acceptable this is a clear NAK 
> since it will massively increase the memory needed for a prime 
> exported object.
>
> I think that this is quite overkill in the first place and for the hot 
> unplug case we can just use the global dummy page as well.
>
> Christian.


Global dummy page is good for read access, what do you do on write 
access ? My first approach was indeed to map at first global dummy page 
as read only and mark the vma->vm_flags as !VM_SHARED assuming that this 
would trigger Copy On Write flow in core mm 
(https://elixir.bootlin.com/linux/v5.7-rc7/source/mm/memory.c#L3977) on 
the next page fault to same address triggered by a write access but then 
i realized a new COW page will be allocated for each such mapping and 
this is much more wasteful then having a dedicated page per GEM object. 
We can indeed optimize by allocating this dummy page on the first page 
fault after device disconnect instead on GEM object creation.

Andrey


>
>> mutex_unlock(&file_priv->prime.lock);
>>       if (ret)
>>           goto fail;
>> @@ -1006,6 +1013,9 @@ void drm_prime_gem_destroy(struct 
>> drm_gem_object *obj, struct sg_table *sg)
>>           dma_buf_unmap_attachment(attach, sg, DMA_BIDIRECTIONAL);
>>       dma_buf = attach->dmabuf;
>>       dma_buf_detach(attach->dmabuf, attach);
>> +
>> +    __free_page(obj->dummy_page);
>> +
>>       /* remove the reference */
>>       dma_buf_put(dma_buf);
>>   }
>> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
>> index 19df802..349a658 100644
>> --- a/include/drm/drm_file.h
>> +++ b/include/drm/drm_file.h
>> @@ -335,6 +335,8 @@ struct drm_file {
>>        */
>>       struct drm_prime_file_private prime;
>>   +    struct page *dummy_page;
>> +
>>       /* private: */
>>   #if IS_ENABLED(CONFIG_DRM_LEGACY)
>>       unsigned long lock_count; /* DRI1 legacy lock count */
>> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
>> index 0b37506..47460d1 100644
>> --- a/include/drm/drm_gem.h
>> +++ b/include/drm/drm_gem.h
>> @@ -310,6 +310,8 @@ struct drm_gem_object {
>>        *
>>        */
>>       const struct drm_gem_object_funcs *funcs;
>> +
>> +    struct page *dummy_page;
>>   };
>>     /**
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
  2020-06-22 11:21     ` Greg KH
@ 2020-06-22 16:07       ` Andrey Grodzovsky
  2020-06-22 16:45         ` Greg KH
  0 siblings, 1 reply; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-22 16:07 UTC (permalink / raw)
  To: Greg KH, Daniel Vetter
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher


[-- Attachment #1: Type: text/plain, Size: 11612 bytes --]


On 6/22/20 7:21 AM, Greg KH wrote:
> On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
>> On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
>>> Track sysfs files in a list so they all can be removed during pci remove
>>> since otherwise their removal after that causes crash because parent
>>> folder was already removed during pci remove.
> Huh?  That should not happen, do you have a backtrace of that crash?


2 examples in the attached trace.

Andrey


>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Uh I thought sysfs just gets yanked completely. Please check with Greg KH
>> whether hand-rolling all this really is the right solution here ... Feels
>> very wrong. I thought this was all supposed to work by adding attributes
>> before publishing the sysfs node, and then letting sysfs clean up
>> everything. Not by cleaning up manually yourself.
> Yes, that is supposed to be the correct thing to do.
>
>> Adding Greg for an authoritative answer.
>> -Daniel
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h          | 13 +++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c |  7 +++++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 35 ++++++++++++++++++++++++----
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 12 ++++++----
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c      |  8 ++++++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 ++++++++++++--
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c     | 13 ++++++++++-
>>>   drivers/gpu/drm/amd/amdgpu/df_v3_6.c         | 10 +++++---
>>>   8 files changed, 99 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index 604a681..ba3775f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -726,6 +726,15 @@ struct amd_powerplay {
>>>   
>>>   #define AMDGPU_RESET_MAGIC_NUM 64
>>>   #define AMDGPU_MAX_DF_PERFMONS 4
>>> +
>>> +struct amdgpu_sysfs_list_node {
>>> +	struct list_head head;
>>> +	struct device_attribute *attr;
>>> +};
> You know we have lists of attributes already, called attribute groups,
> if you really wanted to do something like this.  But, I don't think so.
>
> Either way, don't hand-roll your own stuff that the driver core has
> provided for you for a decade or more, that's just foolish :)
>
>>> +
>>> +#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \
>>> +	struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = &dev_attr_##_attr}
>>> +
>>>   struct amdgpu_device {
>>>   	struct device			*dev;
>>>   	struct drm_device		*ddev;
>>> @@ -992,6 +1001,10 @@ struct amdgpu_device {
>>>   	char				product_number[16];
>>>   	char				product_name[32];
>>>   	char				serial[16];
>>> +
>>> +	struct list_head sysfs_files_list;
>>> +	struct mutex	 sysfs_files_list_lock;
>>> +
>>>   };
>>>   
>>>   static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
>>> index fdd52d8..c1549ee 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
>>> @@ -1950,8 +1950,10 @@ static ssize_t amdgpu_atombios_get_vbios_version(struct device *dev,
>>>   	return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version);
>>>   }
>>>   
>>> +
>>>   static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version,
>>>   		   NULL);
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version);
>>>   
>>>   /**
>>>    * amdgpu_atombios_fini - free the driver info and callbacks for atombios
>>> @@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev)
>>>   	adev->mode_info.atom_context = NULL;
>>>   	kfree(adev->mode_info.atom_card_info);
>>>   	adev->mode_info.atom_card_info = NULL;
>>> -	device_remove_file(adev->dev, &dev_attr_vbios_version);
>>>   }
>>>   
>>>   /**
>>> @@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
>>>   		return ret;
>>>   	}
>>>   
>>> +	mutex_lock(&adev->sysfs_files_list_lock);
>>> +	list_add_tail(&dev_attr_handle_vbios_version.head, &adev->sysfs_files_list);
>>> +	mutex_unlock(&adev->sysfs_files_list_lock);
>>> +
>>>   	return 0;
>>>   }
>>>   
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index e7b9065..3173046 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2928,6 +2928,12 @@ static const struct attribute *amdgpu_dev_attributes[] = {
>>>   	NULL
>>>   };
>>>   
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name);
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number);
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number);
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count);
>>> +
>>> +
>>>   /**
>>>    * amdgpu_device_init - initialize the driver
>>>    *
>>> @@ -3029,6 +3035,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>>   	INIT_LIST_HEAD(&adev->shadow_list);
>>>   	mutex_init(&adev->shadow_list_lock);
>>>   
>>> +	INIT_LIST_HEAD(&adev->sysfs_files_list);
>>> +	mutex_init(&adev->sysfs_files_list_lock);
>>> +
>>>   	INIT_DELAYED_WORK(&adev->delayed_init_work,
>>>   			  amdgpu_device_delayed_init_work_handler);
>>>   	INIT_DELAYED_WORK(&adev->gfx.gfx_off_delay_work,
>>> @@ -3281,6 +3290,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>>   	if (r) {
>>>   		dev_err(adev->dev, "Could not create amdgpu device attr\n");
>>>   		return r;
>>> +	} else {
>>> +		mutex_lock(&adev->sysfs_files_list_lock);
>>> +		list_add_tail(&dev_attr_handle_product_name.head, &adev->sysfs_files_list);
>>> +		list_add_tail(&dev_attr_handle_product_number.head, &adev->sysfs_files_list);
>>> +		list_add_tail(&dev_attr_handle_serial_number.head, &adev->sysfs_files_list);
>>> +		list_add_tail(&dev_attr_handle_pcie_replay_count.head, &adev->sysfs_files_list);
>>> +		mutex_unlock(&adev->sysfs_files_list_lock);
>>>   	}
>>>   
>>>   	if (IS_ENABLED(CONFIG_PERF_EVENTS))
>>> @@ -3298,6 +3314,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>>   	return r;
>>>   }
>>>   
>>> +static void amdgpu_sysfs_remove_files(struct amdgpu_device *adev)
>>> +{
>>> +	struct amdgpu_sysfs_list_node *node;
>>> +
>>> +	mutex_lock(&adev->sysfs_files_list_lock);
>>> +	list_for_each_entry(node, &adev->sysfs_files_list, head)
>>> +		device_remove_file(adev->dev, node->attr);
>>> +	mutex_unlock(&adev->sysfs_files_list_lock);
>>> +}
>>> +
>>>   /**
>>>    * amdgpu_device_fini - tear down the driver
>>>    *
>>> @@ -3332,6 +3358,11 @@ void amdgpu_device_fini_early(struct amdgpu_device *adev)
>>>   	amdgpu_fbdev_fini(adev);
>>>   
>>>   	amdgpu_irq_fini_early(adev);
>>> +
>>> +	amdgpu_sysfs_remove_files(adev);
>>> +
>>> +	if (adev->ucode_sysfs_en)
>>> +		amdgpu_ucode_sysfs_fini(adev);
>>>   }
>>>   
>>>   void amdgpu_device_fini_late(struct amdgpu_device *adev)
>>> @@ -3366,10 +3397,6 @@ void amdgpu_device_fini_late(struct amdgpu_device *adev)
>>>   	adev->rmmio = NULL;
>>>   	amdgpu_device_doorbell_fini(adev);
>>>   
>>> -	if (adev->ucode_sysfs_en)
>>> -		amdgpu_ucode_sysfs_fini(adev);
>>> -
>>> -	sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
>>>   	if (IS_ENABLED(CONFIG_PERF_EVENTS))
>>>   		amdgpu_pmu_fini(adev);
>>>   	if (amdgpu_discovery && adev->asic_type >= CHIP_NAVI10)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
>>> index 6271044..e7b6c4a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
>>> @@ -76,6 +76,9 @@ static DEVICE_ATTR(mem_info_gtt_total, S_IRUGO,
>>>   static DEVICE_ATTR(mem_info_gtt_used, S_IRUGO,
>>>   	           amdgpu_mem_info_gtt_used_show, NULL);
>>>   
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_gtt_total);
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_gtt_used);
>>> +
>>>   /**
>>>    * amdgpu_gtt_mgr_init - init GTT manager and DRM MM
>>>    *
>>> @@ -114,6 +117,11 @@ static int amdgpu_gtt_mgr_init(struct ttm_mem_type_manager *man,
>>>   		return ret;
>>>   	}
>>>   
>>> +	mutex_lock(&adev->sysfs_files_list_lock);
>>> +	list_add_tail(&dev_attr_handle_mem_info_gtt_total.head, &adev->sysfs_files_list);
>>> +	list_add_tail(&dev_attr_handle_mem_info_gtt_used.head, &adev->sysfs_files_list);
>>> +	mutex_unlock(&adev->sysfs_files_list_lock);
>>> +
>>>   	return 0;
>>>   }
>>>   
>>> @@ -127,7 +135,6 @@ static int amdgpu_gtt_mgr_init(struct ttm_mem_type_manager *man,
>>>    */
>>>   static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager *man)
>>>   {
>>> -	struct amdgpu_device *adev = amdgpu_ttm_adev(man->bdev);
>>>   	struct amdgpu_gtt_mgr *mgr = man->priv;
>>>   	spin_lock(&mgr->lock);
>>>   	drm_mm_takedown(&mgr->mm);
>>> @@ -135,9 +142,6 @@ static int amdgpu_gtt_mgr_fini(struct ttm_mem_type_manager *man)
>>>   	kfree(mgr);
>>>   	man->priv = NULL;
>>>   
>>> -	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_total);
>>> -	device_remove_file(adev->dev, &dev_attr_mem_info_gtt_used);
>>> -
>>>   	return 0;
>>>   }
>>>   
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>>> index ddb4af0c..554fec0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>>> @@ -2216,6 +2216,8 @@ static DEVICE_ATTR(usbc_pd_fw, S_IRUGO | S_IWUSR,
>>>   		   psp_usbc_pd_fw_sysfs_read,
>>>   		   psp_usbc_pd_fw_sysfs_write);
>>>   
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(usbc_pd_fw);
>>> +
>>>   
>>>   
>>>   const struct amd_ip_funcs psp_ip_funcs = {
>>> @@ -2242,13 +2244,17 @@ static int psp_sysfs_init(struct amdgpu_device *adev)
>>>   
>>>   	if (ret)
>>>   		DRM_ERROR("Failed to create USBC PD FW control file!");
>>> +	else {
>>> +		mutex_lock(&adev->sysfs_files_list_lock);
>>> +		list_add_tail(&dev_attr_handle_usbc_pd_fw.head, &adev->sysfs_files_list);
>>> +		mutex_unlock(&adev->sysfs_files_list_lock);
>>> +	}
>>>   
>>>   	return ret;
>>>   }
>>>   
>>>   static void psp_sysfs_fini(struct amdgpu_device *adev)
>>>   {
>>> -	device_remove_file(adev->dev, &dev_attr_usbc_pd_fw);
>>>   }
>>>   
>>>   const struct amdgpu_ip_block_version psp_v3_1_ip_block =
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>> index 7723937..39c400c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>> @@ -148,6 +148,12 @@ static DEVICE_ATTR(mem_info_vis_vram_used, S_IRUGO,
>>>   static DEVICE_ATTR(mem_info_vram_vendor, S_IRUGO,
>>>   		   amdgpu_mem_info_vram_vendor, NULL);
>>>   
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_total);
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vis_vram_total);
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_used);
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vis_vram_used);
>>> +static AMDGPU_DEVICE_ATTR_LIST_NODE(mem_info_vram_vendor);
> Converting all of these individual attributes to an attribute group
> would be a nice thing to do anyway.  Makes your logic much simpler and
> less error-prone.
>
> But again, the driver core should do all of the device file removal
> stuff automatically for you when your PCI device is removed from the
> system _UNLESS_ you are doing crazy things like creating child devices
> or messing with raw kobjects or other horrible things that I haven't
> read the code to see if you are, but hopefully not :)
>
> thanks,
>
> greg k-h

[-- Attachment #2: sysfs_oops-1.log --]
[-- Type: text/x-log, Size: 7179 bytes --]

[  925.738225 <    0.188086>] BUG: kernel NULL pointer dereference, address: 0000000000000090
[  925.738232 <    0.000007>] #PF: supervisor read access in kernel mode
[  925.738236 <    0.000004>] #PF: error_code(0x0000) - not-present page
[  925.738240 <    0.000004>] PGD 0 P4D 0 
[  925.738245 <    0.000005>] Oops: 0000 [#1] SMP PTI
[  925.738249 <    0.000004>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
[  925.738256 <    0.000007>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
[  925.738266 <    0.000010>] RIP: 0010:kernfs_find_ns+0x18/0x110
[  925.738270 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
[  925.738282 <    0.000012>] RSP: 0018:ffffad6d0118fb00 EFLAGS: 00010246
[  925.738287 <    0.000005>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
[  925.738292 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffb6606b31 RDI: 0000000000000000
[  925.738297 <    0.000005>] RBP: ffffffffb6606b31 R08: ffffffffb5379d10 R09: 0000000000000000
[  925.738302 <    0.000005>] R10: ffffad6d0118fb38 R11: ffff9a75f64820a8 R12: 0000000000000000
[  925.738307 <    0.000005>] R13: 0000000000000000 R14: ffffffffb6606b31 R15: ffff9a7612b06130
[  925.738313 <    0.000006>] FS:  00007f3eca4e8700(0000) GS:ffff9a763dbc0000(0000) knlGS:0000000000000000
[  925.738319 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  925.738323 <    0.000004>] CR2: 0000000000000090 CR3: 0000000035e5a005 CR4: 00000000000606e0
[  925.738329 <    0.000006>] Call Trace:
[  925.738334 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
[  925.738339 <    0.000005>]  sysfs_remove_group+0x25/0x80
[  925.738344 <    0.000005>]  sysfs_remove_groups+0x29/0x40
[  925.738350 <    0.000006>]  free_msi_irqs+0xf5/0x190
[  925.738354 <    0.000004>]  pci_disable_msi+0xe9/0x120
[  925.738406 <    0.000052>]  amdgpu_irq_fini+0xe3/0xf0 [amdgpu]
[  925.738453 <    0.000047>]  tonga_ih_sw_fini+0xe/0x30 [amdgpu]
[  925.738490 <    0.000037>]  amdgpu_device_fini_late+0x14b/0x440 [amdgpu]
[  925.738529 <    0.000039>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
[  925.738548 <    0.000019>]  drm_dev_put+0x5b/0x80 [drm]
[  925.738558 <    0.000010>]  drm_release+0xc6/0xd0 [drm]
[  925.738563 <    0.000005>]  __fput+0xc6/0x260
[  925.738568 <    0.000005>]  task_work_run+0x79/0xb0
[  925.738573 <    0.000005>]  do_exit+0x3d0/0xc60
[  925.738578 <    0.000005>]  do_group_exit+0x47/0xb0
[  925.738583 <    0.000005>]  get_signal+0x18b/0xc30
[  925.738589 <    0.000006>]  do_signal+0x36/0x6a0
[  925.738593 <    0.000004>]  ? force_sig_info_to_task+0xbc/0xd0
[  925.738597 <    0.000004>]  ? signal_wake_up_state+0x15/0x30
[  925.738603 <    0.000006>]  exit_to_usermode_loop+0x6f/0xc0
[  925.738608 <    0.000005>]  prepare_exit_to_usermode+0xc7/0x110
[  925.738613 <    0.000005>]  ret_from_intr+0x25/0x35
[  925.738617 <    0.000004>] RIP: 0033:0x417369
[  925.738621 <    0.000004>] Code: Bad RIP value.
[  925.738625 <    0.000004>] RSP: 002b:00007ffdd6bf0900 EFLAGS: 00010246
[  925.738629 <    0.000004>] RAX: 00007f3eca509000 RBX: 000000000000001e RCX: 00007f3ec95ba260
[  925.738634 <    0.000005>] RDX: 00007f3ec9889790 RSI: 000000000000000a RDI: 0000000000000000
[  925.738639 <    0.000005>] RBP: 00007ffdd6bf0990 R08: 00007f3ec9889780 R09: 00007f3eca4e8700
[  925.738645 <    0.000006>] R10: 000000000000035c R11: 0000000000000246 R12: 00000000021c6170
[  925.738650 <    0.000005>] R13: 00007ffdd6bf0c00 R14: 0000000000000000 R15: 0000000000000000




[   40.880899 <    0.000004>] BUG: kernel NULL pointer dereference, address: 0000000000000090
[   40.880906 <    0.000007>] #PF: supervisor read access in kernel mode
[   40.880910 <    0.000004>] #PF: error_code(0x0000) - not-present page
[   40.880915 <    0.000005>] PGD 0 P4D 0 
[   40.880920 <    0.000005>] Oops: 0000 [#1] SMP PTI
[   40.880924 <    0.000004>] CPU: 1 PID: 2526 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
[   40.880932 <    0.000008>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
[   40.880941 <    0.000009>] RIP: 0010:kernfs_find_ns+0x18/0x110
[   40.880945 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
[   40.880957 <    0.000012>] RSP: 0018:ffffaf3380467ba8 EFLAGS: 00010246
[   40.880963 <    0.000006>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
[   40.880968 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffc0678cfc RDI: 0000000000000000
[   40.880973 <    0.000005>] RBP: ffffffffc0678cfc R08: ffffffffaa379d10 R09: 0000000000000000
[   40.880979 <    0.000006>] R10: ffffaf3380467be0 R11: ffff93547615d128 R12: 0000000000000000
[   40.880984 <    0.000005>] R13: 0000000000000000 R14: ffffffffc0678cfc R15: ffff93549be86130
[   40.880990 <    0.000006>] FS:  00007fd9ecb10700(0000) GS:ffff9354bd840000(0000) knlGS:0000000000000000
[   40.880996 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.881001 <    0.000005>] CR2: 0000000000000090 CR3: 0000000072866001 CR4: 00000000000606e0
[   40.881006 <    0.000005>] Call Trace:
[   40.881011 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
[   40.881016 <    0.000005>]  sysfs_remove_group+0x25/0x80
[   40.881055 <    0.000039>]  amdgpu_device_fini_late+0x3eb/0x440 [amdgpu]
[   40.881095 <    0.000040>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
[   40.881109 <    0.000014>]  drm_dev_put+0x5b/0x80 [drm]
[   40.881119 <    0.000010>]  drm_release+0xc6/0xd0 [drm]
[   40.881124 <    0.000005>]  __fput+0xc6/0x260
[   40.881129 <    0.000005>]  task_work_run+0x79/0xb0
[   40.881134 <    0.000005>]  do_exit+0x3d0/0xc60
[   40.881138 <    0.000004>]  do_group_exit+0x47/0xb0
[   40.881143 <    0.000005>]  get_signal+0x18b/0xc30
[   40.881149 <    0.000006>]  do_signal+0x36/0x6a0
[   40.881153 <    0.000004>]  ? force_sig_info_to_task+0xbc/0xd0
[   40.881158 <    0.000005>]  ? signal_wake_up_state+0x15/0x30
[   40.881164 <    0.000006>]  exit_to_usermode_loop+0x6f/0xc0
[   40.881170 <    0.000006>]  prepare_exit_to_usermode+0xc7/0x110
[   40.881176 <    0.000006>]  ret_from_intr+0x25/0x35
[   40.881181 <    0.000005>] RIP: 0033:0x417369
[   40.881185 <    0.000004>] Code: Bad RIP value.
[   40.881188 <    0.000003>] RSP: 002b:00007ffd6a742f90 EFLAGS: 00010246
[   40.881193 <    0.000005>] RAX: 00007fd9ecb31000 RBX: 000000000000001e RCX: 00007fd9ebbe2260
[   40.881199 <    0.000006>] RDX: 00007fd9ebeb1790 RSI: 000000000000000a RDI: 0000000000000000
[   40.881204 <    0.000005>] RBP: 00007ffd6a743020 R08: 00007fd9ebeb1780 R09: 00007fd9ecb10700
[   40.881210 <    0.000006>] R10: 000000000000035c R11: 0000000000000246 R12: 00000000023e0170
[   40.881215 <    0.000005>] R13: 00007ffd6a743290 R14: 0000000000000000 R15: 0000000000000000



[-- Attachment #3: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
  2020-06-22 16:07       ` Andrey Grodzovsky
@ 2020-06-22 16:45         ` Greg KH
  2020-06-23  4:51           ` Andrey Grodzovsky
  0 siblings, 1 reply; 54+ messages in thread
From: Greg KH @ 2020-06-22 16:45 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	Daniel Vetter, ckoenig.leichtzumerken, alexdeucher

On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:
> 
> On 6/22/20 7:21 AM, Greg KH wrote:
> > On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
> > > On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
> > > > Track sysfs files in a list so they all can be removed during pci remove
> > > > since otherwise their removal after that causes crash because parent
> > > > folder was already removed during pci remove.
> > Huh?  That should not happen, do you have a backtrace of that crash?
> 
> 
> 2 examples in the attached trace.

Odd, how did you trigger these?


> [  925.738225 <    0.188086>] BUG: kernel NULL pointer dereference, address: 0000000000000090
> [  925.738232 <    0.000007>] #PF: supervisor read access in kernel mode
> [  925.738236 <    0.000004>] #PF: error_code(0x0000) - not-present page
> [  925.738240 <    0.000004>] PGD 0 P4D 0 
> [  925.738245 <    0.000005>] Oops: 0000 [#1] SMP PTI
> [  925.738249 <    0.000004>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
> [  925.738256 <    0.000007>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
> [  925.738266 <    0.000010>] RIP: 0010:kernfs_find_ns+0x18/0x110
> [  925.738270 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
> [  925.738282 <    0.000012>] RSP: 0018:ffffad6d0118fb00 EFLAGS: 00010246
> [  925.738287 <    0.000005>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
> [  925.738292 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffb6606b31 RDI: 0000000000000000
> [  925.738297 <    0.000005>] RBP: ffffffffb6606b31 R08: ffffffffb5379d10 R09: 0000000000000000
> [  925.738302 <    0.000005>] R10: ffffad6d0118fb38 R11: ffff9a75f64820a8 R12: 0000000000000000
> [  925.738307 <    0.000005>] R13: 0000000000000000 R14: ffffffffb6606b31 R15: ffff9a7612b06130
> [  925.738313 <    0.000006>] FS:  00007f3eca4e8700(0000) GS:ffff9a763dbc0000(0000) knlGS:0000000000000000
> [  925.738319 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  925.738323 <    0.000004>] CR2: 0000000000000090 CR3: 0000000035e5a005 CR4: 00000000000606e0
> [  925.738329 <    0.000006>] Call Trace:
> [  925.738334 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
> [  925.738339 <    0.000005>]  sysfs_remove_group+0x25/0x80
> [  925.738344 <    0.000005>]  sysfs_remove_groups+0x29/0x40
> [  925.738350 <    0.000006>]  free_msi_irqs+0xf5/0x190
> [  925.738354 <    0.000004>]  pci_disable_msi+0xe9/0x120

So the PCI core is trying to clean up attributes that it had registered,
which is fine.  But we can't seem to find the attributes?  Were they
already removed somewhere else?

that's odd.

> [  925.738406 <    0.000052>]  amdgpu_irq_fini+0xe3/0xf0 [amdgpu]
> [  925.738453 <    0.000047>]  tonga_ih_sw_fini+0xe/0x30 [amdgpu]
> [  925.738490 <    0.000037>]  amdgpu_device_fini_late+0x14b/0x440 [amdgpu]
> [  925.738529 <    0.000039>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
> [  925.738548 <    0.000019>]  drm_dev_put+0x5b/0x80 [drm]
> [  925.738558 <    0.000010>]  drm_release+0xc6/0xd0 [drm]
> [  925.738563 <    0.000005>]  __fput+0xc6/0x260
> [  925.738568 <    0.000005>]  task_work_run+0x79/0xb0
> [  925.738573 <    0.000005>]  do_exit+0x3d0/0xc60
> [  925.738578 <    0.000005>]  do_group_exit+0x47/0xb0
> [  925.738583 <    0.000005>]  get_signal+0x18b/0xc30
> [  925.738589 <    0.000006>]  do_signal+0x36/0x6a0
> [  925.738593 <    0.000004>]  ? force_sig_info_to_task+0xbc/0xd0
> [  925.738597 <    0.000004>]  ? signal_wake_up_state+0x15/0x30
> [  925.738603 <    0.000006>]  exit_to_usermode_loop+0x6f/0xc0
> [  925.738608 <    0.000005>]  prepare_exit_to_usermode+0xc7/0x110
> [  925.738613 <    0.000005>]  ret_from_intr+0x25/0x35
> [  925.738617 <    0.000004>] RIP: 0033:0x417369
> [  925.738621 <    0.000004>] Code: Bad RIP value.
> [  925.738625 <    0.000004>] RSP: 002b:00007ffdd6bf0900 EFLAGS: 00010246
> [  925.738629 <    0.000004>] RAX: 00007f3eca509000 RBX: 000000000000001e RCX: 00007f3ec95ba260
> [  925.738634 <    0.000005>] RDX: 00007f3ec9889790 RSI: 000000000000000a RDI: 0000000000000000
> [  925.738639 <    0.000005>] RBP: 00007ffdd6bf0990 R08: 00007f3ec9889780 R09: 00007f3eca4e8700
> [  925.738645 <    0.000006>] R10: 000000000000035c R11: 0000000000000246 R12: 00000000021c6170
> [  925.738650 <    0.000005>] R13: 00007ffdd6bf0c00 R14: 0000000000000000 R15: 0000000000000000
> 
> 
> 
> 
> [   40.880899 <    0.000004>] BUG: kernel NULL pointer dereference, address: 0000000000000090
> [   40.880906 <    0.000007>] #PF: supervisor read access in kernel mode
> [   40.880910 <    0.000004>] #PF: error_code(0x0000) - not-present page
> [   40.880915 <    0.000005>] PGD 0 P4D 0 
> [   40.880920 <    0.000005>] Oops: 0000 [#1] SMP PTI
> [   40.880924 <    0.000004>] CPU: 1 PID: 2526 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
> [   40.880932 <    0.000008>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
> [   40.880941 <    0.000009>] RIP: 0010:kernfs_find_ns+0x18/0x110
> [   40.880945 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
> [   40.880957 <    0.000012>] RSP: 0018:ffffaf3380467ba8 EFLAGS: 00010246
> [   40.880963 <    0.000006>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
> [   40.880968 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffc0678cfc RDI: 0000000000000000
> [   40.880973 <    0.000005>] RBP: ffffffffc0678cfc R08: ffffffffaa379d10 R09: 0000000000000000
> [   40.880979 <    0.000006>] R10: ffffaf3380467be0 R11: ffff93547615d128 R12: 0000000000000000
> [   40.880984 <    0.000005>] R13: 0000000000000000 R14: ffffffffc0678cfc R15: ffff93549be86130
> [   40.880990 <    0.000006>] FS:  00007fd9ecb10700(0000) GS:ffff9354bd840000(0000) knlGS:0000000000000000
> [   40.880996 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   40.881001 <    0.000005>] CR2: 0000000000000090 CR3: 0000000072866001 CR4: 00000000000606e0
> [   40.881006 <    0.000005>] Call Trace:
> [   40.881011 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
> [   40.881016 <    0.000005>]  sysfs_remove_group+0x25/0x80
> [   40.881055 <    0.000039>]  amdgpu_device_fini_late+0x3eb/0x440 [amdgpu]
> [   40.881095 <    0.000040>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]

Here is this is your driver doing the same thing, removing attributes it
created.  But again they are not there.

So something went through and wiped the tree clean, which if I'm reading
this correctly, your patch would not solve as you would try to also
remove attributes that were already removed, right?

And 5.5-rc7 is a bit old (6 months and many thousands of changes ago),
does this still happen on a modern, released, kernel?

thanks,

greg k-h
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] drm: Add dummy page per device or GEM object
  2020-06-22 14:32     ` Andrey Grodzovsky
@ 2020-06-22 17:45       ` Christian König
  2020-06-22 17:50         ` Daniel Vetter
  0 siblings, 1 reply; 54+ messages in thread
From: Christian König @ 2020-06-22 17:45 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx, dri-devel
  Cc: alexdeucher, daniel.vetter, michel, ppaalanen

Am 22.06.20 um 16:32 schrieb Andrey Grodzovsky:
>
> On 6/22/20 9:18 AM, Christian König wrote:
>> Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
>>> Will be used to reroute CPU mapped BO's page faults once
>>> device is removed.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/drm_file.c  |  8 ++++++++
>>>   drivers/gpu/drm/drm_prime.c | 10 ++++++++++
>>>   include/drm/drm_file.h      |  2 ++
>>>   include/drm/drm_gem.h       |  2 ++
>>>   4 files changed, 22 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
>>> index c4c704e..67c0770 100644
>>> --- a/drivers/gpu/drm/drm_file.c
>>> +++ b/drivers/gpu/drm/drm_file.c
>>> @@ -188,6 +188,12 @@ struct drm_file *drm_file_alloc(struct 
>>> drm_minor *minor)
>>>               goto out_prime_destroy;
>>>       }
>>>   +    file->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>> +    if (!file->dummy_page) {
>>> +        ret = -ENOMEM;
>>> +        goto out_prime_destroy;
>>> +    }
>>> +
>>>       return file;
>>>     out_prime_destroy:
>>> @@ -284,6 +290,8 @@ void drm_file_free(struct drm_file *file)
>>>       if (dev->driver->postclose)
>>>           dev->driver->postclose(dev, file);
>>>   +    __free_page(file->dummy_page);
>>> +
>>>       drm_prime_destroy_file_private(&file->prime);
>>>         WARN_ON(!list_empty(&file->event_list));
>>> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
>>> index 1de2cde..c482e9c 100644
>>> --- a/drivers/gpu/drm/drm_prime.c
>>> +++ b/drivers/gpu/drm/drm_prime.c
>>> @@ -335,6 +335,13 @@ int drm_gem_prime_fd_to_handle(struct 
>>> drm_device *dev,
>>>         ret = drm_prime_add_buf_handle(&file_priv->prime,
>>>               dma_buf, *handle);
>>> +
>>> +    if (!ret) {
>>> +        obj->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>> +        if (!obj->dummy_page)
>>> +            ret = -ENOMEM;
>>> +    }
>>> +
>>
>> While the per file case still looks acceptable this is a clear NAK 
>> since it will massively increase the memory needed for a prime 
>> exported object.
>>
>> I think that this is quite overkill in the first place and for the 
>> hot unplug case we can just use the global dummy page as well.
>>
>> Christian.
>
>
> Global dummy page is good for read access, what do you do on write 
> access ? My first approach was indeed to map at first global dummy 
> page as read only and mark the vma->vm_flags as !VM_SHARED assuming 
> that this would trigger Copy On Write flow in core mm 
> (https://elixir.bootlin.com/linux/v5.7-rc7/source/mm/memory.c#L3977) 
> on the next page fault to same address triggered by a write access but 
> then i realized a new COW page will be allocated for each such mapping 
> and this is much more wasteful then having a dedicated page per GEM 
> object. 

Yeah, but this is only for a very very small corner cases. What we need 
to prevent is increasing the memory usage during normal operation to much.

Using memory during the unplug is completely unproblematic because we 
just released quite a bunch of it by releasing all those system memory 
buffers.

And I'm pretty sure that COWed pages are correctly accounted towards the 
used memory of a process.

So I think if that approach works as intended and the COW pages are 
released again on unmapping it would be the perfect solution to the problem.

Daniel what do you think?

Regards,
Christian.

> We can indeed optimize by allocating this dummy page on the first page 
> fault after device disconnect instead on GEM object creation.
>
> Andrey
>
>
>>
>>> mutex_unlock(&file_priv->prime.lock);
>>>       if (ret)
>>>           goto fail;
>>> @@ -1006,6 +1013,9 @@ void drm_prime_gem_destroy(struct 
>>> drm_gem_object *obj, struct sg_table *sg)
>>>           dma_buf_unmap_attachment(attach, sg, DMA_BIDIRECTIONAL);
>>>       dma_buf = attach->dmabuf;
>>>       dma_buf_detach(attach->dmabuf, attach);
>>> +
>>> +    __free_page(obj->dummy_page);
>>> +
>>>       /* remove the reference */
>>>       dma_buf_put(dma_buf);
>>>   }
>>> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
>>> index 19df802..349a658 100644
>>> --- a/include/drm/drm_file.h
>>> +++ b/include/drm/drm_file.h
>>> @@ -335,6 +335,8 @@ struct drm_file {
>>>        */
>>>       struct drm_prime_file_private prime;
>>>   +    struct page *dummy_page;
>>> +
>>>       /* private: */
>>>   #if IS_ENABLED(CONFIG_DRM_LEGACY)
>>>       unsigned long lock_count; /* DRI1 legacy lock count */
>>> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
>>> index 0b37506..47460d1 100644
>>> --- a/include/drm/drm_gem.h
>>> +++ b/include/drm/drm_gem.h
>>> @@ -310,6 +310,8 @@ struct drm_gem_object {
>>>        *
>>>        */
>>>       const struct drm_gem_object_funcs *funcs;
>>> +
>>> +    struct page *dummy_page;
>>>   };
>>>     /**
>>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] drm: Add dummy page per device or GEM object
  2020-06-22 17:45       ` Christian König
@ 2020-06-22 17:50         ` Daniel Vetter
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Vetter @ 2020-06-22 17:50 UTC (permalink / raw)
  To: Christian König
  Cc: Andrey Grodzovsky, Michel Dänzer, dri-devel, Pekka Paalanen,
	amd-gfx list, Alex Deucher

On Mon, Jun 22, 2020 at 7:45 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 22.06.20 um 16:32 schrieb Andrey Grodzovsky:
> >
> > On 6/22/20 9:18 AM, Christian König wrote:
> >> Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
> >>> Will be used to reroute CPU mapped BO's page faults once
> >>> device is removed.
> >>>
> >>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>> ---
> >>>   drivers/gpu/drm/drm_file.c  |  8 ++++++++
> >>>   drivers/gpu/drm/drm_prime.c | 10 ++++++++++
> >>>   include/drm/drm_file.h      |  2 ++
> >>>   include/drm/drm_gem.h       |  2 ++
> >>>   4 files changed, 22 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
> >>> index c4c704e..67c0770 100644
> >>> --- a/drivers/gpu/drm/drm_file.c
> >>> +++ b/drivers/gpu/drm/drm_file.c
> >>> @@ -188,6 +188,12 @@ struct drm_file *drm_file_alloc(struct
> >>> drm_minor *minor)
> >>>               goto out_prime_destroy;
> >>>       }
> >>>   +    file->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >>> +    if (!file->dummy_page) {
> >>> +        ret = -ENOMEM;
> >>> +        goto out_prime_destroy;
> >>> +    }
> >>> +
> >>>       return file;
> >>>     out_prime_destroy:
> >>> @@ -284,6 +290,8 @@ void drm_file_free(struct drm_file *file)
> >>>       if (dev->driver->postclose)
> >>>           dev->driver->postclose(dev, file);
> >>>   +    __free_page(file->dummy_page);
> >>> +
> >>>       drm_prime_destroy_file_private(&file->prime);
> >>>         WARN_ON(!list_empty(&file->event_list));
> >>> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
> >>> index 1de2cde..c482e9c 100644
> >>> --- a/drivers/gpu/drm/drm_prime.c
> >>> +++ b/drivers/gpu/drm/drm_prime.c
> >>> @@ -335,6 +335,13 @@ int drm_gem_prime_fd_to_handle(struct
> >>> drm_device *dev,
> >>>         ret = drm_prime_add_buf_handle(&file_priv->prime,
> >>>               dma_buf, *handle);
> >>> +
> >>> +    if (!ret) {
> >>> +        obj->dummy_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >>> +        if (!obj->dummy_page)
> >>> +            ret = -ENOMEM;
> >>> +    }
> >>> +
> >>
> >> While the per file case still looks acceptable this is a clear NAK
> >> since it will massively increase the memory needed for a prime
> >> exported object.
> >>
> >> I think that this is quite overkill in the first place and for the
> >> hot unplug case we can just use the global dummy page as well.
> >>
> >> Christian.
> >
> >
> > Global dummy page is good for read access, what do you do on write
> > access ? My first approach was indeed to map at first global dummy
> > page as read only and mark the vma->vm_flags as !VM_SHARED assuming
> > that this would trigger Copy On Write flow in core mm
> > (https://elixir.bootlin.com/linux/v5.7-rc7/source/mm/memory.c#L3977)
> > on the next page fault to same address triggered by a write access but
> > then i realized a new COW page will be allocated for each such mapping
> > and this is much more wasteful then having a dedicated page per GEM
> > object.
>
> Yeah, but this is only for a very very small corner cases. What we need
> to prevent is increasing the memory usage during normal operation to much.
>
> Using memory during the unplug is completely unproblematic because we
> just released quite a bunch of it by releasing all those system memory
> buffers.
>
> And I'm pretty sure that COWed pages are correctly accounted towards the
> used memory of a process.
>
> So I think if that approach works as intended and the COW pages are
> released again on unmapping it would be the perfect solution to the problem.
>
> Daniel what do you think?

If COW works, sure sounds reasonable. And if we can make sure we
managed to drop all the system allocations (otherwise suddenly 2x
memory usage, worst case). But I have no idea whether we can
retroshoehorn that into an established vma, you might have fun stuff
like a mkwrite handler there (which I thought is the COW handler
thing, but really no idea).

If we need to massively change stuff then I think rw dummy page,
allocated on first fault after hotunplug (maybe just make it one per
object, that's simplest) seems like the much safer option. Much less
code that can go wrong.
-Daniel

> Regards,
> Christian.
>
> > We can indeed optimize by allocating this dummy page on the first page
> > fault after device disconnect instead on GEM object creation.
> >
> > Andrey
> >
> >
> >>
> >>> mutex_unlock(&file_priv->prime.lock);
> >>>       if (ret)
> >>>           goto fail;
> >>> @@ -1006,6 +1013,9 @@ void drm_prime_gem_destroy(struct
> >>> drm_gem_object *obj, struct sg_table *sg)
> >>>           dma_buf_unmap_attachment(attach, sg, DMA_BIDIRECTIONAL);
> >>>       dma_buf = attach->dmabuf;
> >>>       dma_buf_detach(attach->dmabuf, attach);
> >>> +
> >>> +    __free_page(obj->dummy_page);
> >>> +
> >>>       /* remove the reference */
> >>>       dma_buf_put(dma_buf);
> >>>   }
> >>> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> >>> index 19df802..349a658 100644
> >>> --- a/include/drm/drm_file.h
> >>> +++ b/include/drm/drm_file.h
> >>> @@ -335,6 +335,8 @@ struct drm_file {
> >>>        */
> >>>       struct drm_prime_file_private prime;
> >>>   +    struct page *dummy_page;
> >>> +
> >>>       /* private: */
> >>>   #if IS_ENABLED(CONFIG_DRM_LEGACY)
> >>>       unsigned long lock_count; /* DRI1 legacy lock count */
> >>> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> >>> index 0b37506..47460d1 100644
> >>> --- a/include/drm/drm_gem.h
> >>> +++ b/include/drm/drm_gem.h
> >>> @@ -310,6 +310,8 @@ struct drm_gem_object {
> >>>        *
> >>>        */
> >>>       const struct drm_gem_object_funcs *funcs;
> >>> +
> >>> +    struct page *dummy_page;
> >>>   };
> >>>     /**
> >>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page.
  2020-06-21  6:03 ` [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
  2020-06-22  9:41   ` Daniel Vetter
@ 2020-06-22 19:30   ` Christian König
  1 sibling, 0 replies; 54+ messages in thread
From: Christian König @ 2020-06-22 19:30 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx, dri-devel
  Cc: alexdeucher, daniel.vetter, michel, ppaalanen

Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
> On device removal reroute all CPU mappings to dummy page per drm_file
> instance or imported GEM object.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 65 ++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 57 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 389128b..2f8bf5e 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -35,6 +35,8 @@
>   #include <drm/ttm/ttm_bo_driver.h>
>   #include <drm/ttm/ttm_placement.h>
>   #include <drm/drm_vma_manager.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_file.h>
>   #include <linux/mm.h>
>   #include <linux/pfn_t.h>
>   #include <linux/rbtree.h>
> @@ -328,19 +330,66 @@ vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
>   	pgprot_t prot;
>   	struct ttm_buffer_object *bo = vma->vm_private_data;
>   	vm_fault_t ret;
> +	int idx;
> +	struct drm_device *ddev = bo->base.dev;
>   
> -	ret = ttm_bo_vm_reserve(bo, vmf);
> -	if (ret)
> -		return ret;
> +	if (drm_dev_enter(ddev, &idx)) {

Better do this like if (!drm_dev_enter(...)) return ttm_bo_vm_dummy(..);

This way you can move all the dummy fault handling into a separate 
function without cluttering this one here to much.

Christian.

> +		ret = ttm_bo_vm_reserve(bo, vmf);
> +		if (ret)
> +			goto exit;
> +
> +		prot = vma->vm_page_prot;
>   
> -	prot = vma->vm_page_prot;
> -	ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT);
> -	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
> +		ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT);
> +		if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
> +			goto exit;
> +
> +		dma_resv_unlock(bo->base.resv);
> +
> +exit:
> +		drm_dev_exit(idx);
>   		return ret;
> +	} else {
>   
> -	dma_resv_unlock(bo->base.resv);
> +		struct drm_file *file = NULL;
> +		struct page *dummy_page = NULL;
> +		int handle;
>   
> -	return ret;
> +		/* We are faulting on imported BO from dma_buf */
> +		if (bo->base.dma_buf && bo->base.import_attach) {
> +			dummy_page = bo->base.dummy_page;
> +		/* We are faulting on non imported BO, find drm_file owning the BO*/
> +		} else {
> +			struct drm_gem_object *gobj;
> +
> +			mutex_lock(&ddev->filelist_mutex);
> +			list_for_each_entry(file, &ddev->filelist, lhead) {
> +				spin_lock(&file->table_lock);
> +				idr_for_each_entry(&file->object_idr, gobj, handle) {
> +					if (gobj == &bo->base) {
> +						dummy_page = file->dummy_page;
> +						break;
> +					}
> +				}
> +				spin_unlock(&file->table_lock);
> +			}
> +			mutex_unlock(&ddev->filelist_mutex);
> +		}
> +
> +		if (dummy_page) {
> +			/*
> +			 * Let do_fault complete the PTE install e.t.c using vmf->page
> +			 *
> +			 * TODO - should i call free_page somewhere ?
> +			 */
> +			get_page(dummy_page);
> +			vmf->page = dummy_page;
> +			return 0;
> +		} else {
> +			return VM_FAULT_SIGSEGV;
> +		}
> +	}
>   }
>   EXPORT_SYMBOL(ttm_bo_vm_fault);
>   

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space
  2020-06-21  6:03 ` [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space Andrey Grodzovsky
  2020-06-22  9:45   ` Daniel Vetter
@ 2020-06-22 19:37   ` Christian König
  2020-06-22 19:47   ` Alex Deucher
  2 siblings, 0 replies; 54+ messages in thread
From: Christian König @ 2020-06-22 19:37 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx, dri-devel
  Cc: alexdeucher, daniel.vetter, michel, ppaalanen

Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
> Helper function to be used to invalidate all BOs CPU mappings
> once device is removed.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/ttm/ttm_bo.c    | 8 ++++++--
>   include/drm/ttm/ttm_bo_driver.h | 7 +++++++
>   2 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index c5b516f..926a365 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -1750,10 +1750,14 @@ void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo)
>   	ttm_bo_unmap_virtual_locked(bo);
>   	ttm_mem_io_unlock(man);
>   }
> -
> -
>   EXPORT_SYMBOL(ttm_bo_unmap_virtual);
>   
> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev)
> +{
> +	unmap_mapping_range(bdev->dev_mapping, 0, 0, 1);
> +}
> +EXPORT_SYMBOL(ttm_bo_unmap_virtual_address_space);
> +
>   int ttm_bo_wait(struct ttm_buffer_object *bo,
>   		bool interruptible, bool no_wait)
>   {
> diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
> index c9e0fd0..39ea44f 100644
> --- a/include/drm/ttm/ttm_bo_driver.h
> +++ b/include/drm/ttm/ttm_bo_driver.h
> @@ -601,6 +601,13 @@ int ttm_bo_device_init(struct ttm_bo_device *bdev,
>   void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo);
>   
>   /**
> + * ttm_bo_unmap_virtual_address_space
> + *
> + * @bdev: tear down all the virtual mappings for this device
> + */
> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev);
> +
> +/**
>    * ttm_bo_unmap_virtual
>    *
>    * @bo: tear down the virtual mappings for this BO

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove.
  2020-06-21  6:03 ` [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove Andrey Grodzovsky
  2020-06-22  9:56   ` Daniel Vetter
@ 2020-06-22 19:38   ` Christian König
  2020-06-22 19:48     ` Alex Deucher
  1 sibling, 1 reply; 54+ messages in thread
From: Christian König @ 2020-06-22 19:38 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx, dri-devel
  Cc: alexdeucher, daniel.vetter, michel, ppaalanen

Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
> Use the new TTM interface to invalidate all exsisting BO CPU mappings
> form all user proccesses.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

I think those two patches could already land in amd-staging-drm-next 
since they are a good idea independent of how else we fix the other issues.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 43592dc..6932d75 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1135,6 +1135,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>   	struct drm_device *dev = pci_get_drvdata(pdev);
>   
>   	drm_dev_unplug(dev);
> +	ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
>   	amdgpu_driver_unload_kms(dev);
>   
>   	pci_disable_device(pdev);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug
  2020-06-21  6:03 ` [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug Andrey Grodzovsky
  2020-06-22  9:55   ` Daniel Vetter
@ 2020-06-22 19:40   ` Christian König
  2020-06-23  5:11     ` Andrey Grodzovsky
  1 sibling, 1 reply; 54+ messages in thread
From: Christian König @ 2020-06-22 19:40 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx, dri-devel
  Cc: alexdeucher, daniel.vetter, michel, ppaalanen

Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
> entity->rq becomes null aftre device unplugged so just return early
> in that case.

Mhm, do you have a backtrace for this?

This should only be called by an IOCTL and IOCTLs should already call 
drm_dev_enter()/exit() on their own...

Christian.

>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 21 ++++++++++++++++-----
>   1 file changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 8d9c6fe..d252427 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -24,6 +24,7 @@
>   #include "amdgpu_job.h"
>   #include "amdgpu_object.h"
>   #include "amdgpu_trace.h"
> +#include <drm/drm_drv.h>
>   
>   #define AMDGPU_VM_SDMA_MIN_NUM_DW	256u
>   #define AMDGPU_VM_SDMA_MAX_NUM_DW	(16u * 1024u)
> @@ -94,7 +95,12 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   	struct drm_sched_entity *entity;
>   	struct amdgpu_ring *ring;
>   	struct dma_fence *f;
> -	int r;
> +	int r, idx;
> +
> +	if (!drm_dev_enter(p->adev->ddev, &idx)) {
> +		r = -ENODEV;
> +		goto nodev;
> +	}
>   
>   	entity = p->immediate ? &p->vm->immediate : &p->vm->delayed;
>   	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
> @@ -104,7 +110,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   	WARN_ON(ib->length_dw > p->num_dw_left);
>   	r = amdgpu_job_submit(p->job, entity, AMDGPU_FENCE_OWNER_VM, &f);
>   	if (r)
> -		goto error;
> +		goto job_fail;
>   
>   	if (p->unlocked) {
>   		struct dma_fence *tmp = dma_fence_get(f);
> @@ -118,10 +124,15 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   	if (fence && !p->immediate)
>   		swap(*fence, f);
>   	dma_fence_put(f);
> -	return 0;
>   
> -error:
> -	amdgpu_job_free(p->job);
> +	r = 0;
> +
> +job_fail:
> +	drm_dev_exit(idx);
> +nodev:
> +	if (r)
> +		amdgpu_job_free(p->job);
> +
>   	return r;
>   }
>   

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space
  2020-06-21  6:03 ` [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space Andrey Grodzovsky
  2020-06-22  9:45   ` Daniel Vetter
  2020-06-22 19:37   ` Christian König
@ 2020-06-22 19:47   ` Alex Deucher
  2 siblings, 0 replies; 54+ messages in thread
From: Alex Deucher @ 2020-06-22 19:47 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Daniel Vetter, Michel Dänzer, Maling list - DRI developers,
	Pekka Paalanen, amd-gfx list, Christian König

On Sun, Jun 21, 2020 at 2:05 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> Helper function to be used to invalidate all BOs CPU mappings
> once device is removed.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Typo in the subject:
unampping -> unmapping

Alex


> ---
>  drivers/gpu/drm/ttm/ttm_bo.c    | 8 ++++++--
>  include/drm/ttm/ttm_bo_driver.h | 7 +++++++
>  2 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index c5b516f..926a365 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -1750,10 +1750,14 @@ void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo)
>         ttm_bo_unmap_virtual_locked(bo);
>         ttm_mem_io_unlock(man);
>  }
> -
> -
>  EXPORT_SYMBOL(ttm_bo_unmap_virtual);
>
> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev)
> +{
> +       unmap_mapping_range(bdev->dev_mapping, 0, 0, 1);
> +}
> +EXPORT_SYMBOL(ttm_bo_unmap_virtual_address_space);
> +
>  int ttm_bo_wait(struct ttm_buffer_object *bo,
>                 bool interruptible, bool no_wait)
>  {
> diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
> index c9e0fd0..39ea44f 100644
> --- a/include/drm/ttm/ttm_bo_driver.h
> +++ b/include/drm/ttm/ttm_bo_driver.h
> @@ -601,6 +601,13 @@ int ttm_bo_device_init(struct ttm_bo_device *bdev,
>  void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo);
>
>  /**
> + * ttm_bo_unmap_virtual_address_space
> + *
> + * @bdev: tear down all the virtual mappings for this device
> + */
> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev);
> +
> +/**
>   * ttm_bo_unmap_virtual
>   *
>   * @bo: tear down the virtual mappings for this BO
> --
> 2.7.4
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove.
  2020-06-22 19:38   ` Christian König
@ 2020-06-22 19:48     ` Alex Deucher
  2020-06-23 10:22       ` Daniel Vetter
  0 siblings, 1 reply; 54+ messages in thread
From: Alex Deucher @ 2020-06-22 19:48 UTC (permalink / raw)
  To: Christian Koenig
  Cc: Andrey Grodzovsky, Daniel Vetter, Michel Dänzer,
	Maling list - DRI developers, Pekka Paalanen, amd-gfx list

On Mon, Jun 22, 2020 at 3:38 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
> > Use the new TTM interface to invalidate all exsisting BO CPU mappings
> > form all user proccesses.
> >
> > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>
> Reviewed-by: Christian König <christian.koenig@amd.com>
>
> I think those two patches could already land in amd-staging-drm-next
> since they are a good idea independent of how else we fix the other issues.

Please make sure they land in drm-misc as well.

Alex

>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
> >   1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 43592dc..6932d75 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -1135,6 +1135,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
> >       struct drm_device *dev = pci_get_drvdata(pdev);
> >
> >       drm_dev_unplug(dev);
> > +     ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
> >       amdgpu_driver_unload_kms(dev);
> >
> >       pci_disable_device(pdev);
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
  2020-06-22 16:45         ` Greg KH
@ 2020-06-23  4:51           ` Andrey Grodzovsky
  2020-06-23  6:05             ` Greg KH
  0 siblings, 1 reply; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-23  4:51 UTC (permalink / raw)
  To: Greg KH
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	Daniel Vetter, ckoenig.leichtzumerken, alexdeucher


On 6/22/20 12:45 PM, Greg KH wrote:
> On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:
>> On 6/22/20 7:21 AM, Greg KH wrote:
>>> On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
>>>> On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
>>>>> Track sysfs files in a list so they all can be removed during pci remove
>>>>> since otherwise their removal after that causes crash because parent
>>>>> folder was already removed during pci remove.
>>> Huh?  That should not happen, do you have a backtrace of that crash?
>>
>> 2 examples in the attached trace.
> Odd, how did you trigger these?


By manually triggering PCI remove from sysfs

cd /sys/bus/pci/devices/0000\:05\:00.0 && echo 1 > remove


>
>
>> [  925.738225 <    0.188086>] BUG: kernel NULL pointer dereference, address: 0000000000000090
>> [  925.738232 <    0.000007>] #PF: supervisor read access in kernel mode
>> [  925.738236 <    0.000004>] #PF: error_code(0x0000) - not-present page
>> [  925.738240 <    0.000004>] PGD 0 P4D 0
>> [  925.738245 <    0.000005>] Oops: 0000 [#1] SMP PTI
>> [  925.738249 <    0.000004>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
>> [  925.738256 <    0.000007>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
>> [  925.738266 <    0.000010>] RIP: 0010:kernfs_find_ns+0x18/0x110
>> [  925.738270 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
>> [  925.738282 <    0.000012>] RSP: 0018:ffffad6d0118fb00 EFLAGS: 00010246
>> [  925.738287 <    0.000005>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
>> [  925.738292 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffb6606b31 RDI: 0000000000000000
>> [  925.738297 <    0.000005>] RBP: ffffffffb6606b31 R08: ffffffffb5379d10 R09: 0000000000000000
>> [  925.738302 <    0.000005>] R10: ffffad6d0118fb38 R11: ffff9a75f64820a8 R12: 0000000000000000
>> [  925.738307 <    0.000005>] R13: 0000000000000000 R14: ffffffffb6606b31 R15: ffff9a7612b06130
>> [  925.738313 <    0.000006>] FS:  00007f3eca4e8700(0000) GS:ffff9a763dbc0000(0000) knlGS:0000000000000000
>> [  925.738319 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  925.738323 <    0.000004>] CR2: 0000000000000090 CR3: 0000000035e5a005 CR4: 00000000000606e0
>> [  925.738329 <    0.000006>] Call Trace:
>> [  925.738334 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
>> [  925.738339 <    0.000005>]  sysfs_remove_group+0x25/0x80
>> [  925.738344 <    0.000005>]  sysfs_remove_groups+0x29/0x40
>> [  925.738350 <    0.000006>]  free_msi_irqs+0xf5/0x190
>> [  925.738354 <    0.000004>]  pci_disable_msi+0xe9/0x120
> So the PCI core is trying to clean up attributes that it had registered,
> which is fine.  But we can't seem to find the attributes?  Were they
> already removed somewhere else?
>
> that's odd.


Yes, as i pointed above i am emulating device remove from sysfs and this 
triggers pci device remove sequence and as part of that my specific device 
folder (05:00.0) is removed from the sysfs tree.


>
>> [  925.738406 <    0.000052>]  amdgpu_irq_fini+0xe3/0xf0 [amdgpu]
>> [  925.738453 <    0.000047>]  tonga_ih_sw_fini+0xe/0x30 [amdgpu]
>> [  925.738490 <    0.000037>]  amdgpu_device_fini_late+0x14b/0x440 [amdgpu]
>> [  925.738529 <    0.000039>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
>> [  925.738548 <    0.000019>]  drm_dev_put+0x5b/0x80 [drm]
>> [  925.738558 <    0.000010>]  drm_release+0xc6/0xd0 [drm]
>> [  925.738563 <    0.000005>]  __fput+0xc6/0x260
>> [  925.738568 <    0.000005>]  task_work_run+0x79/0xb0
>> [  925.738573 <    0.000005>]  do_exit+0x3d0/0xc60
>> [  925.738578 <    0.000005>]  do_group_exit+0x47/0xb0
>> [  925.738583 <    0.000005>]  get_signal+0x18b/0xc30
>> [  925.738589 <    0.000006>]  do_signal+0x36/0x6a0
>> [  925.738593 <    0.000004>]  ? force_sig_info_to_task+0xbc/0xd0
>> [  925.738597 <    0.000004>]  ? signal_wake_up_state+0x15/0x30
>> [  925.738603 <    0.000006>]  exit_to_usermode_loop+0x6f/0xc0
>> [  925.738608 <    0.000005>]  prepare_exit_to_usermode+0xc7/0x110
>> [  925.738613 <    0.000005>]  ret_from_intr+0x25/0x35
>> [  925.738617 <    0.000004>] RIP: 0033:0x417369
>> [  925.738621 <    0.000004>] Code: Bad RIP value.
>> [  925.738625 <    0.000004>] RSP: 002b:00007ffdd6bf0900 EFLAGS: 00010246
>> [  925.738629 <    0.000004>] RAX: 00007f3eca509000 RBX: 000000000000001e RCX: 00007f3ec95ba260
>> [  925.738634 <    0.000005>] RDX: 00007f3ec9889790 RSI: 000000000000000a RDI: 0000000000000000
>> [  925.738639 <    0.000005>] RBP: 00007ffdd6bf0990 R08: 00007f3ec9889780 R09: 00007f3eca4e8700
>> [  925.738645 <    0.000006>] R10: 000000000000035c R11: 0000000000000246 R12: 00000000021c6170
>> [  925.738650 <    0.000005>] R13: 00007ffdd6bf0c00 R14: 0000000000000000 R15: 0000000000000000
>>
>>
>>
>>
>> [   40.880899 <    0.000004>] BUG: kernel NULL pointer dereference, address: 0000000000000090
>> [   40.880906 <    0.000007>] #PF: supervisor read access in kernel mode
>> [   40.880910 <    0.000004>] #PF: error_code(0x0000) - not-present page
>> [   40.880915 <    0.000005>] PGD 0 P4D 0
>> [   40.880920 <    0.000005>] Oops: 0000 [#1] SMP PTI
>> [   40.880924 <    0.000004>] CPU: 1 PID: 2526 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
>> [   40.880932 <    0.000008>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
>> [   40.880941 <    0.000009>] RIP: 0010:kernfs_find_ns+0x18/0x110
>> [   40.880945 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
>> [   40.880957 <    0.000012>] RSP: 0018:ffffaf3380467ba8 EFLAGS: 00010246
>> [   40.880963 <    0.000006>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
>> [   40.880968 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffc0678cfc RDI: 0000000000000000
>> [   40.880973 <    0.000005>] RBP: ffffffffc0678cfc R08: ffffffffaa379d10 R09: 0000000000000000
>> [   40.880979 <    0.000006>] R10: ffffaf3380467be0 R11: ffff93547615d128 R12: 0000000000000000
>> [   40.880984 <    0.000005>] R13: 0000000000000000 R14: ffffffffc0678cfc R15: ffff93549be86130
>> [   40.880990 <    0.000006>] FS:  00007fd9ecb10700(0000) GS:ffff9354bd840000(0000) knlGS:0000000000000000
>> [   40.880996 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   40.881001 <    0.000005>] CR2: 0000000000000090 CR3: 0000000072866001 CR4: 00000000000606e0
>> [   40.881006 <    0.000005>] Call Trace:
>> [   40.881011 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
>> [   40.881016 <    0.000005>]  sysfs_remove_group+0x25/0x80
>> [   40.881055 <    0.000039>]  amdgpu_device_fini_late+0x3eb/0x440 [amdgpu]
>> [   40.881095 <    0.000040>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
> Here is this is your driver doing the same thing, removing attributes it
> created.  But again they are not there.
>
> So something went through and wiped the tree clean, which if I'm reading
> this correctly, your patch would not solve as you would try to also
> remove attributes that were already removed, right?


I don't think so, the stack here is from a later stage (after pci remove) where 
the last user process holding a reference to the device file decides to die and 
thus triggering drm_dev_release sequence after drm dev refcount dropped to zero. 
And this why my patch helps, i am expediting all amdgpu sysfs attributes removal 
to the pci remove stage when the device folder is still present in the sysfs 
hierarchy. At least this is my  understanding to why it helped. I admit i am not 
an expert on sysfs internals.


>
> And 5.5-rc7 is a bit old (6 months and many thousands of changes ago),
> does this still happen on a modern, released, kernel?


I will give it a try with the latest and greatest but it might take some time as 
I have to make a temporary context switch to some urgent task.

Andrey


>
> thanks,
>
> greg k-h
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space
  2020-06-22  9:45   ` Daniel Vetter
@ 2020-06-23  5:00     ` Andrey Grodzovsky
  2020-06-23 10:25       ` Daniel Vetter
  0 siblings, 1 reply; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-23  5:00 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher


On 6/22/20 5:45 AM, Daniel Vetter wrote:
> On Sun, Jun 21, 2020 at 02:03:03AM -0400, Andrey Grodzovsky wrote:
>> Helper function to be used to invalidate all BOs CPU mappings
>> once device is removed.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> This seems to be missing the code to invalidate all the dma-buf mmaps?
>
> Probably needs more testcases if you're not yet catching this. Or am I
> missing something, and we're exchanging the the address space also for
> dma-buf?
> -Daniel


IMHO the device address space includes all user clients having a CPU view of the 
BO either from direct mapping though drm file or by  mapping through imported 
BO's FD.

Andrey


>
>> ---
>>   drivers/gpu/drm/ttm/ttm_bo.c    | 8 ++++++--
>>   include/drm/ttm/ttm_bo_driver.h | 7 +++++++
>>   2 files changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>> index c5b516f..926a365 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>> @@ -1750,10 +1750,14 @@ void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo)
>>   	ttm_bo_unmap_virtual_locked(bo);
>>   	ttm_mem_io_unlock(man);
>>   }
>> -
>> -
>>   EXPORT_SYMBOL(ttm_bo_unmap_virtual);
>>   
>> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev)
>> +{
>> +	unmap_mapping_range(bdev->dev_mapping, 0, 0, 1);
>> +}
>> +EXPORT_SYMBOL(ttm_bo_unmap_virtual_address_space);
>> +
>>   int ttm_bo_wait(struct ttm_buffer_object *bo,
>>   		bool interruptible, bool no_wait)
>>   {
>> diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
>> index c9e0fd0..39ea44f 100644
>> --- a/include/drm/ttm/ttm_bo_driver.h
>> +++ b/include/drm/ttm/ttm_bo_driver.h
>> @@ -601,6 +601,13 @@ int ttm_bo_device_init(struct ttm_bo_device *bdev,
>>   void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo);
>>   
>>   /**
>> + * ttm_bo_unmap_virtual_address_space
>> + *
>> + * @bdev: tear down all the virtual mappings for this device
>> + */
>> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev);
>> +
>> +/**
>>    * ttm_bo_unmap_virtual
>>    *
>>    * @bo: tear down the virtual mappings for this BO
>> -- 
>> 2.7.4
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug
  2020-06-22 19:40   ` Christian König
@ 2020-06-23  5:11     ` Andrey Grodzovsky
  2020-06-23  7:14       ` Christian König
  0 siblings, 1 reply; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-23  5:11 UTC (permalink / raw)
  To: christian.koenig, amd-gfx, dri-devel
  Cc: alexdeucher, daniel.vetter, michel, ppaalanen


On 6/22/20 3:40 PM, Christian König wrote:
> Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
>> entity->rq becomes null aftre device unplugged so just return early
>> in that case.
>
> Mhm, do you have a backtrace for this?
>
> This should only be called by an IOCTL and IOCTLs should already call 
> drm_dev_enter()/exit() on their own...
>
> Christian.


See bellow, it's not during IOCTL but during all GEM objects release when 
releasing the device. entity->rq becomes null because all the gpu schedulers are 
marked as not ready during the early pci remove stage and so the next time sdma 
job tries to pick a scheduler to run nothing is available and it's set to null.

Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382648] BUG: kernel NULL pointer 
dereference, address: 0000000000000038
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382651] #PF: supervisor read 
access in kernel mode
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382652] #PF: error_code(0x0000) 
- not-present page
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382653] PGD 0 P4D 0
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382656] Oops: 0000 [#1] SMP PTI
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382658] CPU: 6 PID: 2598 Comm: 
llvmpipe-6 Tainted: G           OE     5.6.0-dev+ #51
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382659] Hardware name: System 
manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382700] RIP: 
0010:amdgpu_vm_sdma_commit+0x6c/0x270 [amdgpu]
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382702] Code: 01 00 00 48 89 ee 
48 c7 c7 ef d4 85 c0 e8 fc 5f e8 ff 48 8b 75 10 48 c7 c7 fd d4 85 c0 e8 ec 5f e8 
ff 48 8b 45 10 41 8b 55 08 <48> 8b 40 38 85 d2 48 8d b8 30 ff ff ff 0f 84 9b 01 
00 00 48 8b 80
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382704] RSP: 
0018:ffffa88e40f57950 EFLAGS: 00010282
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382705] RAX: 0000000000000000 
RBX: ffffa88e40f579a8 RCX: 0000000000000001
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382707] RDX: 0000000000000014 
RSI: ffff94d4d62388e0 RDI: ffff94d4dbd98e30
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382708] RBP: ffff94d4d2ad3288 
R08: 0000000000000000 R09: 0000000000000001
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382709] R10: 000000000000001f 
R11: 0000000000000000 R12: ffffa88e40f57a48
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382710] R13: ffff94d4d627a5e8 
R14: ffff94d4d424d978 R15: 0000000800100020
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382712] FS: 
00007f30ae694700(0000) GS:ffff94d4dbd80000(0000) knlGS:0000000000000000
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382713] CS:  0010 DS: 0000 ES: 
0000 CR0: 0000000080050033
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382714] CR2: 0000000000000038 
CR3: 0000000121810006 CR4: 00000000000606e0
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382716] Call Trace:
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382755] 
amdgpu_vm_bo_update_mapping.constprop.30+0x16b/0x230 [amdgpu]
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382795] 
amdgpu_vm_clear_freed+0xd7/0x210 [amdgpu]
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382833] 
amdgpu_gem_object_close+0x200/0x2b0 [amdgpu]
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382856]  ? 
drm_gem_object_handle_put_unlocked+0x90/0x90 [drm]
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382864]  ? 
drm_gem_object_release_handle+0x2c/0x90 [drm]
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382872] 
drm_gem_object_release_handle+0x2c/0x90 [drm]
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382879]  ? 
drm_gem_object_handle_put_unlocked+0x90/0x90 [drm]
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382882] idr_for_each+0x48/0xd0
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382885]  ? 
_raw_spin_unlock_irqrestore+0x2d/0x50
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382893] 
drm_gem_release+0x1c/0x30 [drm]
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382901] 
drm_file_free+0x21d/0x270 [drm]
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382908] drm_release+0x67/0xe0 [drm]
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382912] __fput+0xc6/0x260
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382916] task_work_run+0x79/0xb0
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382919] do_exit+0x3d0/0xc40
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382921]  ? get_signal+0x13d/0xc30
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382924] do_group_exit+0x47/0xb0
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382926] get_signal+0x18b/0xc30
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382929] do_signal+0x36/0x6a0
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382931]  ? 
__set_task_comm+0x62/0x120
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382935]  ? 
__x64_sys_futex+0x88/0x180
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382938] 
exit_to_usermode_loop+0x6f/0xc0
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382941] do_syscall_64+0x149/0x1c0
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382943] 
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382944] RIP: 0033:0x7f30f7f35360
Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382947] Code: Bad RIP value.


Andrey


>
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 21 ++++++++++++++++-----
>>   1 file changed, 16 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> index 8d9c6fe..d252427 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> @@ -24,6 +24,7 @@
>>   #include "amdgpu_job.h"
>>   #include "amdgpu_object.h"
>>   #include "amdgpu_trace.h"
>> +#include <drm/drm_drv.h>
>>     #define AMDGPU_VM_SDMA_MIN_NUM_DW    256u
>>   #define AMDGPU_VM_SDMA_MAX_NUM_DW    (16u * 1024u)
>> @@ -94,7 +95,12 @@ static int amdgpu_vm_sdma_commit(struct 
>> amdgpu_vm_update_params *p,
>>       struct drm_sched_entity *entity;
>>       struct amdgpu_ring *ring;
>>       struct dma_fence *f;
>> -    int r;
>> +    int r, idx;
>> +
>> +    if (!drm_dev_enter(p->adev->ddev, &idx)) {
>> +        r = -ENODEV;
>> +        goto nodev;
>> +    }
>>         entity = p->immediate ? &p->vm->immediate : &p->vm->delayed;
>>       ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>> @@ -104,7 +110,7 @@ static int amdgpu_vm_sdma_commit(struct 
>> amdgpu_vm_update_params *p,
>>       WARN_ON(ib->length_dw > p->num_dw_left);
>>       r = amdgpu_job_submit(p->job, entity, AMDGPU_FENCE_OWNER_VM, &f);
>>       if (r)
>> -        goto error;
>> +        goto job_fail;
>>         if (p->unlocked) {
>>           struct dma_fence *tmp = dma_fence_get(f);
>> @@ -118,10 +124,15 @@ static int amdgpu_vm_sdma_commit(struct 
>> amdgpu_vm_update_params *p,
>>       if (fence && !p->immediate)
>>           swap(*fence, f);
>>       dma_fence_put(f);
>> -    return 0;
>>   -error:
>> -    amdgpu_job_free(p->job);
>> +    r = 0;
>> +
>> +job_fail:
>> +    drm_dev_exit(idx);
>> +nodev:
>> +    if (r)
>> +        amdgpu_job_free(p->job);
>> +
>>       return r;
>>   }
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 0/8] RFC Support hot device unplug in amdgpu
  2020-06-22  9:46 ` [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Daniel Vetter
@ 2020-06-23  5:14   ` Andrey Grodzovsky
  2020-06-23  9:04     ` Michel Dänzer
  0 siblings, 1 reply; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-23  5:14 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher

I am fighting with Thunderbird to make limit a line to 80 chars but nothing 
helps. Any suggestions please.

Andrey

On 6/22/20 5:46 AM, Daniel Vetter wrote:
> Also a nit: Please tell your mailer to break long lines, it looks funny
> and inconsistent otherwise, at least in some of the mailers I use here :-/
> -Daniel
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
  2020-06-23  4:51           ` Andrey Grodzovsky
@ 2020-06-23  6:05             ` Greg KH
  2020-06-24  3:04               ` Andrey Grodzovsky
  0 siblings, 1 reply; 54+ messages in thread
From: Greg KH @ 2020-06-23  6:05 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	Daniel Vetter, ckoenig.leichtzumerken, alexdeucher

On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote:
> 
> On 6/22/20 12:45 PM, Greg KH wrote:
> > On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:
> > > On 6/22/20 7:21 AM, Greg KH wrote:
> > > > On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
> > > > > On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
> > > > > > Track sysfs files in a list so they all can be removed during pci remove
> > > > > > since otherwise their removal after that causes crash because parent
> > > > > > folder was already removed during pci remove.
> > > > Huh?  That should not happen, do you have a backtrace of that crash?
> > > 
> > > 2 examples in the attached trace.
> > Odd, how did you trigger these?
> 
> 
> By manually triggering PCI remove from sysfs
> 
> cd /sys/bus/pci/devices/0000\:05\:00.0 && echo 1 > remove

For some reason, I didn't think that video/drm devices could handle
hot-remove like this.  The "old" PCI hotplug specification explicitly
said that video devices were not supported, has that changed?

And this whole issue is probably tied to the larger issue that Daniel
was asking me about, when it came to device lifetimes and the drm layer,
so odds are we need to fix that up first before worrying about trying to
support this crazy request, right?  :)

> > > [  925.738225 <    0.188086>] BUG: kernel NULL pointer dereference, address: 0000000000000090
> > > [  925.738232 <    0.000007>] #PF: supervisor read access in kernel mode
> > > [  925.738236 <    0.000004>] #PF: error_code(0x0000) - not-present page
> > > [  925.738240 <    0.000004>] PGD 0 P4D 0
> > > [  925.738245 <    0.000005>] Oops: 0000 [#1] SMP PTI
> > > [  925.738249 <    0.000004>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
> > > [  925.738256 <    0.000007>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
> > > [  925.738266 <    0.000010>] RIP: 0010:kernfs_find_ns+0x18/0x110
> > > [  925.738270 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
> > > [  925.738282 <    0.000012>] RSP: 0018:ffffad6d0118fb00 EFLAGS: 00010246
> > > [  925.738287 <    0.000005>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
> > > [  925.738292 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffb6606b31 RDI: 0000000000000000
> > > [  925.738297 <    0.000005>] RBP: ffffffffb6606b31 R08: ffffffffb5379d10 R09: 0000000000000000
> > > [  925.738302 <    0.000005>] R10: ffffad6d0118fb38 R11: ffff9a75f64820a8 R12: 0000000000000000
> > > [  925.738307 <    0.000005>] R13: 0000000000000000 R14: ffffffffb6606b31 R15: ffff9a7612b06130
> > > [  925.738313 <    0.000006>] FS:  00007f3eca4e8700(0000) GS:ffff9a763dbc0000(0000) knlGS:0000000000000000
> > > [  925.738319 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  925.738323 <    0.000004>] CR2: 0000000000000090 CR3: 0000000035e5a005 CR4: 00000000000606e0
> > > [  925.738329 <    0.000006>] Call Trace:
> > > [  925.738334 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
> > > [  925.738339 <    0.000005>]  sysfs_remove_group+0x25/0x80
> > > [  925.738344 <    0.000005>]  sysfs_remove_groups+0x29/0x40
> > > [  925.738350 <    0.000006>]  free_msi_irqs+0xf5/0x190
> > > [  925.738354 <    0.000004>]  pci_disable_msi+0xe9/0x120
> > So the PCI core is trying to clean up attributes that it had registered,
> > which is fine.  But we can't seem to find the attributes?  Were they
> > already removed somewhere else?
> > 
> > that's odd.
> 
> 
> Yes, as i pointed above i am emulating device remove from sysfs and this
> triggers pci device remove sequence and as part of that my specific device
> folder (05:00.0) is removed from the sysfs tree.

But why are things being removed twice?

> > > [  925.738406 <    0.000052>]  amdgpu_irq_fini+0xe3/0xf0 [amdgpu]
> > > [  925.738453 <    0.000047>]  tonga_ih_sw_fini+0xe/0x30 [amdgpu]
> > > [  925.738490 <    0.000037>]  amdgpu_device_fini_late+0x14b/0x440 [amdgpu]
> > > [  925.738529 <    0.000039>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
> > > [  925.738548 <    0.000019>]  drm_dev_put+0x5b/0x80 [drm]
> > > [  925.738558 <    0.000010>]  drm_release+0xc6/0xd0 [drm]
> > > [  925.738563 <    0.000005>]  __fput+0xc6/0x260
> > > [  925.738568 <    0.000005>]  task_work_run+0x79/0xb0
> > > [  925.738573 <    0.000005>]  do_exit+0x3d0/0xc60
> > > [  925.738578 <    0.000005>]  do_group_exit+0x47/0xb0
> > > [  925.738583 <    0.000005>]  get_signal+0x18b/0xc30
> > > [  925.738589 <    0.000006>]  do_signal+0x36/0x6a0
> > > [  925.738593 <    0.000004>]  ? force_sig_info_to_task+0xbc/0xd0
> > > [  925.738597 <    0.000004>]  ? signal_wake_up_state+0x15/0x30
> > > [  925.738603 <    0.000006>]  exit_to_usermode_loop+0x6f/0xc0
> > > [  925.738608 <    0.000005>]  prepare_exit_to_usermode+0xc7/0x110
> > > [  925.738613 <    0.000005>]  ret_from_intr+0x25/0x35
> > > [  925.738617 <    0.000004>] RIP: 0033:0x417369
> > > [  925.738621 <    0.000004>] Code: Bad RIP value.
> > > [  925.738625 <    0.000004>] RSP: 002b:00007ffdd6bf0900 EFLAGS: 00010246
> > > [  925.738629 <    0.000004>] RAX: 00007f3eca509000 RBX: 000000000000001e RCX: 00007f3ec95ba260
> > > [  925.738634 <    0.000005>] RDX: 00007f3ec9889790 RSI: 000000000000000a RDI: 0000000000000000
> > > [  925.738639 <    0.000005>] RBP: 00007ffdd6bf0990 R08: 00007f3ec9889780 R09: 00007f3eca4e8700
> > > [  925.738645 <    0.000006>] R10: 000000000000035c R11: 0000000000000246 R12: 00000000021c6170
> > > [  925.738650 <    0.000005>] R13: 00007ffdd6bf0c00 R14: 0000000000000000 R15: 0000000000000000
> > > 
> > > 
> > > 
> > > 
> > > [   40.880899 <    0.000004>] BUG: kernel NULL pointer dereference, address: 0000000000000090
> > > [   40.880906 <    0.000007>] #PF: supervisor read access in kernel mode
> > > [   40.880910 <    0.000004>] #PF: error_code(0x0000) - not-present page
> > > [   40.880915 <    0.000005>] PGD 0 P4D 0
> > > [   40.880920 <    0.000005>] Oops: 0000 [#1] SMP PTI
> > > [   40.880924 <    0.000004>] CPU: 1 PID: 2526 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
> > > [   40.880932 <    0.000008>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
> > > [   40.880941 <    0.000009>] RIP: 0010:kernfs_find_ns+0x18/0x110
> > > [   40.880945 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
> > > [   40.880957 <    0.000012>] RSP: 0018:ffffaf3380467ba8 EFLAGS: 00010246
> > > [   40.880963 <    0.000006>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
> > > [   40.880968 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffc0678cfc RDI: 0000000000000000
> > > [   40.880973 <    0.000005>] RBP: ffffffffc0678cfc R08: ffffffffaa379d10 R09: 0000000000000000
> > > [   40.880979 <    0.000006>] R10: ffffaf3380467be0 R11: ffff93547615d128 R12: 0000000000000000
> > > [   40.880984 <    0.000005>] R13: 0000000000000000 R14: ffffffffc0678cfc R15: ffff93549be86130
> > > [   40.880990 <    0.000006>] FS:  00007fd9ecb10700(0000) GS:ffff9354bd840000(0000) knlGS:0000000000000000
> > > [   40.880996 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   40.881001 <    0.000005>] CR2: 0000000000000090 CR3: 0000000072866001 CR4: 00000000000606e0
> > > [   40.881006 <    0.000005>] Call Trace:
> > > [   40.881011 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
> > > [   40.881016 <    0.000005>]  sysfs_remove_group+0x25/0x80
> > > [   40.881055 <    0.000039>]  amdgpu_device_fini_late+0x3eb/0x440 [amdgpu]
> > > [   40.881095 <    0.000040>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
> > Here is this is your driver doing the same thing, removing attributes it
> > created.  But again they are not there.
> > 
> > So something went through and wiped the tree clean, which if I'm reading
> > this correctly, your patch would not solve as you would try to also
> > remove attributes that were already removed, right?
> 
> 
> I don't think so, the stack here is from a later stage (after pci remove)
> where the last user process holding a reference to the device file decides
> to die and thus triggering drm_dev_release sequence after drm dev refcount
> dropped to zero. And this why my patch helps, i am expediting all amdgpu
> sysfs attributes removal to the pci remove stage when the device folder is
> still present in the sysfs hierarchy. At least this is my  understanding to
> why it helped. I admit i am not an expert on sysfs internals.

Ok, yeah, I think this is back to the drm lifecycle issues mentioned
above.

{sigh}, I'll get to that once I deal with the -rc1/-rc2 merge fallout,
that will take me a week or so, sorry...

thanks,

greg k-h
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug
  2020-06-23  5:11     ` Andrey Grodzovsky
@ 2020-06-23  7:14       ` Christian König
  0 siblings, 0 replies; 54+ messages in thread
From: Christian König @ 2020-06-23  7:14 UTC (permalink / raw)
  To: Andrey Grodzovsky, christian.koenig, amd-gfx, dri-devel
  Cc: alexdeucher, daniel.vetter, michel, ppaalanen

Am 23.06.20 um 07:11 schrieb Andrey Grodzovsky:
>
> On 6/22/20 3:40 PM, Christian König wrote:
>> Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
>>> entity->rq becomes null aftre device unplugged so just return early
>>> in that case.
>>
>> Mhm, do you have a backtrace for this?
>>
>> This should only be called by an IOCTL and IOCTLs should already call 
>> drm_dev_enter()/exit() on their own...
>>
>> Christian.
>
>
> See bellow, it's not during IOCTL but during all GEM objects release 
> when releasing the device. entity->rq becomes null because all the gpu 
> schedulers are marked as not ready during the early pci remove stage 
> and so the next time sdma job tries to pick a scheduler to run nothing 
> is available and it's set to null.

I see. This should then probably go into amdgpu_gem_object_close() 
before we reserve the PD.

See drm_dev_enter()/exit() are kind of a read side lock and with this we 
create a nice lock inversion when we do it in the low level SDMA VM backend.

Christian.

>
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382648] BUG: kernel 
> NULL pointer dereference, address: 0000000000000038
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382651] #PF: 
> supervisor read access in kernel mode
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382652] #PF: 
> error_code(0x0000) - not-present page
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382653] PGD 0 P4D 0
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382656] Oops: 0000 
> [#1] SMP PTI
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382658] CPU: 6 PID: 
> 2598 Comm: llvmpipe-6 Tainted: G           OE     5.6.0-dev+ #51
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382659] Hardware name: 
> System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 
> 12/30/2013
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382700] RIP: 
> 0010:amdgpu_vm_sdma_commit+0x6c/0x270 [amdgpu]
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382702] Code: 01 00 00 
> 48 89 ee 48 c7 c7 ef d4 85 c0 e8 fc 5f e8 ff 48 8b 75 10 48 c7 c7 fd 
> d4 85 c0 e8 ec 5f e8 ff 48 8b 45 10 41 8b 55 08 <48> 8b 40 38 85 d2 48 
> 8d b8 30 ff ff ff 0f 84 9b 01 00 00 48 8b 80
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382704] RSP: 
> 0018:ffffa88e40f57950 EFLAGS: 00010282
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382705] RAX: 
> 0000000000000000 RBX: ffffa88e40f579a8 RCX: 0000000000000001
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382707] RDX: 
> 0000000000000014 RSI: ffff94d4d62388e0 RDI: ffff94d4dbd98e30
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382708] RBP: 
> ffff94d4d2ad3288 R08: 0000000000000000 R09: 0000000000000001
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382709] R10: 
> 000000000000001f R11: 0000000000000000 R12: ffffa88e40f57a48
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382710] R13: 
> ffff94d4d627a5e8 R14: ffff94d4d424d978 R15: 0000000800100020
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382712] FS: 
> 00007f30ae694700(0000) GS:ffff94d4dbd80000(0000) knlGS:0000000000000000
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382713] CS:  0010 DS: 
> 0000 ES: 0000 CR0: 0000000080050033
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382714] CR2: 
> 0000000000000038 CR3: 0000000121810006 CR4: 00000000000606e0
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382716] Call Trace:
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382755] 
> amdgpu_vm_bo_update_mapping.constprop.30+0x16b/0x230 [amdgpu]
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382795] 
> amdgpu_vm_clear_freed+0xd7/0x210 [amdgpu]
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382833] 
> amdgpu_gem_object_close+0x200/0x2b0 [amdgpu]
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382856]  ? 
> drm_gem_object_handle_put_unlocked+0x90/0x90 [drm]
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382864]  ? 
> drm_gem_object_release_handle+0x2c/0x90 [drm]
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382872] 
> drm_gem_object_release_handle+0x2c/0x90 [drm]
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382879]  ? 
> drm_gem_object_handle_put_unlocked+0x90/0x90 [drm]
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382882] 
> idr_for_each+0x48/0xd0
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382885]  ? 
> _raw_spin_unlock_irqrestore+0x2d/0x50
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382893] 
> drm_gem_release+0x1c/0x30 [drm]
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382901] 
> drm_file_free+0x21d/0x270 [drm]
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382908] 
> drm_release+0x67/0xe0 [drm]
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382912] __fput+0xc6/0x260
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382916] 
> task_work_run+0x79/0xb0
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382919] 
> do_exit+0x3d0/0xc40
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382921]  ? 
> get_signal+0x13d/0xc30
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382924] 
> do_group_exit+0x47/0xb0
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382926] 
> get_signal+0x18b/0xc30
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382929] 
> do_signal+0x36/0x6a0
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382931]  ? 
> __set_task_comm+0x62/0x120
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382935]  ? 
> __x64_sys_futex+0x88/0x180
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382938] 
> exit_to_usermode_loop+0x6f/0xc0
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382941] 
> do_syscall_64+0x149/0x1c0
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382943] 
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382944] RIP: 
> 0033:0x7f30f7f35360
> Jun  8 11:14:56 ubuntu-1604-test kernel: [   44.382947] Code: Bad RIP 
> value.
>
>
> Andrey
>
>
>>
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 21 
>>> ++++++++++++++++-----
>>>   1 file changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> index 8d9c6fe..d252427 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> @@ -24,6 +24,7 @@
>>>   #include "amdgpu_job.h"
>>>   #include "amdgpu_object.h"
>>>   #include "amdgpu_trace.h"
>>> +#include <drm/drm_drv.h>
>>>     #define AMDGPU_VM_SDMA_MIN_NUM_DW    256u
>>>   #define AMDGPU_VM_SDMA_MAX_NUM_DW    (16u * 1024u)
>>> @@ -94,7 +95,12 @@ static int amdgpu_vm_sdma_commit(struct 
>>> amdgpu_vm_update_params *p,
>>>       struct drm_sched_entity *entity;
>>>       struct amdgpu_ring *ring;
>>>       struct dma_fence *f;
>>> -    int r;
>>> +    int r, idx;
>>> +
>>> +    if (!drm_dev_enter(p->adev->ddev, &idx)) {
>>> +        r = -ENODEV;
>>> +        goto nodev;
>>> +    }
>>>         entity = p->immediate ? &p->vm->immediate : &p->vm->delayed;
>>>       ring = container_of(entity->rq->sched, struct amdgpu_ring, 
>>> sched);
>>> @@ -104,7 +110,7 @@ static int amdgpu_vm_sdma_commit(struct 
>>> amdgpu_vm_update_params *p,
>>>       WARN_ON(ib->length_dw > p->num_dw_left);
>>>       r = amdgpu_job_submit(p->job, entity, AMDGPU_FENCE_OWNER_VM, &f);
>>>       if (r)
>>> -        goto error;
>>> +        goto job_fail;
>>>         if (p->unlocked) {
>>>           struct dma_fence *tmp = dma_fence_get(f);
>>> @@ -118,10 +124,15 @@ static int amdgpu_vm_sdma_commit(struct 
>>> amdgpu_vm_update_params *p,
>>>       if (fence && !p->immediate)
>>>           swap(*fence, f);
>>>       dma_fence_put(f);
>>> -    return 0;
>>>   -error:
>>> -    amdgpu_job_free(p->job);
>>> +    r = 0;
>>> +
>>> +job_fail:
>>> +    drm_dev_exit(idx);
>>> +nodev:
>>> +    if (r)
>>> +        amdgpu_job_free(p->job);
>>> +
>>>       return r;
>>>   }
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 0/8] RFC Support hot device unplug in amdgpu
  2020-06-23  5:14   ` Andrey Grodzovsky
@ 2020-06-23  9:04     ` Michel Dänzer
  2020-06-24  3:21       ` Andrey Grodzovsky
  0 siblings, 1 reply; 54+ messages in thread
From: Michel Dänzer @ 2020-06-23  9:04 UTC (permalink / raw)
  To: Andrey Grodzovsky, Daniel Vetter
  Cc: daniel.vetter, amd-gfx, dri-devel, ckoenig.leichtzumerken

On 2020-06-23 7:14 a.m., Andrey Grodzovsky wrote:
> I am fighting with Thunderbird to make limit a line to 80 chars but
> nothing helps. Any suggestions please.

Maybe try disabling mail.compose.default_to_paragraph, or check other
*wrap* settings.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove.
  2020-06-22 19:48     ` Alex Deucher
@ 2020-06-23 10:22       ` Daniel Vetter
  2020-06-23 13:16         ` Christian König
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel Vetter @ 2020-06-23 10:22 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Andrey Grodzovsky, Daniel Vetter, Michel Dänzer,
	Maling list - DRI developers, Pekka Paalanen, amd-gfx list,
	Christian Koenig

On Mon, Jun 22, 2020 at 03:48:29PM -0400, Alex Deucher wrote:
> On Mon, Jun 22, 2020 at 3:38 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
> > > Use the new TTM interface to invalidate all exsisting BO CPU mappings
> > > form all user proccesses.
> > >
> > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >
> > Reviewed-by: Christian König <christian.koenig@amd.com>
> >
> > I think those two patches could already land in amd-staging-drm-next
> > since they are a good idea independent of how else we fix the other issues.
> 
> Please make sure they land in drm-misc as well.

Not sure that's much use, since without any of the fault side changes you
just blow up on the first refault. Seems somewhat silly to charge ahead on
this with the other bits still very much under discussion.

Plus I suggested a possible bikeshed here :-)
-Daniel

> 
> Alex
> 
> >
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
> > >   1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > index 43592dc..6932d75 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > @@ -1135,6 +1135,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
> > >       struct drm_device *dev = pci_get_drvdata(pdev);
> > >
> > >       drm_dev_unplug(dev);
> > > +     ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
> > >       amdgpu_driver_unload_kms(dev);
> > >
> > >       pci_disable_device(pdev);
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space
  2020-06-23  5:00     ` Andrey Grodzovsky
@ 2020-06-23 10:25       ` Daniel Vetter
  2020-06-23 12:55         ` Christian König
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel Vetter @ 2020-06-23 10:25 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	Daniel Vetter, ckoenig.leichtzumerken, alexdeucher

On Tue, Jun 23, 2020 at 01:00:02AM -0400, Andrey Grodzovsky wrote:
> 
> On 6/22/20 5:45 AM, Daniel Vetter wrote:
> > On Sun, Jun 21, 2020 at 02:03:03AM -0400, Andrey Grodzovsky wrote:
> > > Helper function to be used to invalidate all BOs CPU mappings
> > > once device is removed.
> > > 
> > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > This seems to be missing the code to invalidate all the dma-buf mmaps?
> > 
> > Probably needs more testcases if you're not yet catching this. Or am I
> > missing something, and we're exchanging the the address space also for
> > dma-buf?
> > -Daniel
> 
> 
> IMHO the device address space includes all user clients having a CPU view of
> the BO either from direct mapping though drm file or by  mapping through
> imported BO's FD.

Uh this is all very confusing and very much midlayer-y thanks to ttm.

I think a much better solution would be to have a core gem helper for
this (well not even gem really, this is core drm), which directly uses
drm_device->anon_inode->i_mapping.

Then
a) it clearly matches what drm_prime.c does on export
b) can be reused across all drivers, not just ttm

So much better.

What's more, we could then very easily make the generic
drm_dev_unplug_and_unmap helper I've talked about for the amdgpu patch,
which I think would be really neat&pretty.

Thoughts?
-Daniel

> 
> Andrey
> 
> 
> > 
> > > ---
> > >   drivers/gpu/drm/ttm/ttm_bo.c    | 8 ++++++--
> > >   include/drm/ttm/ttm_bo_driver.h | 7 +++++++
> > >   2 files changed, 13 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> > > index c5b516f..926a365 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_bo.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> > > @@ -1750,10 +1750,14 @@ void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo)
> > >   	ttm_bo_unmap_virtual_locked(bo);
> > >   	ttm_mem_io_unlock(man);
> > >   }
> > > -
> > > -
> > >   EXPORT_SYMBOL(ttm_bo_unmap_virtual);
> > > +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev)
> > > +{
> > > +	unmap_mapping_range(bdev->dev_mapping, 0, 0, 1);
> > > +}
> > > +EXPORT_SYMBOL(ttm_bo_unmap_virtual_address_space);
> > > +
> > >   int ttm_bo_wait(struct ttm_buffer_object *bo,
> > >   		bool interruptible, bool no_wait)
> > >   {
> > > diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
> > > index c9e0fd0..39ea44f 100644
> > > --- a/include/drm/ttm/ttm_bo_driver.h
> > > +++ b/include/drm/ttm/ttm_bo_driver.h
> > > @@ -601,6 +601,13 @@ int ttm_bo_device_init(struct ttm_bo_device *bdev,
> > >   void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo);
> > >   /**
> > > + * ttm_bo_unmap_virtual_address_space
> > > + *
> > > + * @bdev: tear down all the virtual mappings for this device
> > > + */
> > > +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev);
> > > +
> > > +/**
> > >    * ttm_bo_unmap_virtual
> > >    *
> > >    * @bo: tear down the virtual mappings for this BO
> > > -- 
> > > 2.7.4
> > > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space
  2020-06-23 10:25       ` Daniel Vetter
@ 2020-06-23 12:55         ` Christian König
  0 siblings, 0 replies; 54+ messages in thread
From: Christian König @ 2020-06-23 12:55 UTC (permalink / raw)
  To: Daniel Vetter, Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx, alexdeucher

Am 23.06.20 um 12:25 schrieb Daniel Vetter:
> On Tue, Jun 23, 2020 at 01:00:02AM -0400, Andrey Grodzovsky wrote:
>> On 6/22/20 5:45 AM, Daniel Vetter wrote:
>>> On Sun, Jun 21, 2020 at 02:03:03AM -0400, Andrey Grodzovsky wrote:
>>>> Helper function to be used to invalidate all BOs CPU mappings
>>>> once device is removed.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> This seems to be missing the code to invalidate all the dma-buf mmaps?
>>>
>>> Probably needs more testcases if you're not yet catching this. Or am I
>>> missing something, and we're exchanging the the address space also for
>>> dma-buf?
>>> -Daniel
>>
>> IMHO the device address space includes all user clients having a CPU view of
>> the BO either from direct mapping though drm file or by  mapping through
>> imported BO's FD.
> Uh this is all very confusing and very much midlayer-y thanks to ttm.
>
> I think a much better solution would be to have a core gem helper for
> this (well not even gem really, this is core drm), which directly uses
> drm_device->anon_inode->i_mapping.
>
> Then
> a) it clearly matches what drm_prime.c does on export
> b) can be reused across all drivers, not just ttm
>
> So much better.
>
> What's more, we could then very easily make the generic
> drm_dev_unplug_and_unmap helper I've talked about for the amdgpu patch,
> which I think would be really neat&pretty.

Good point, that is indeed a rather nice idea.

Christian.

>
> Thoughts?
> -Daniel
>
>> Andrey
>>
>>
>>>> ---
>>>>    drivers/gpu/drm/ttm/ttm_bo.c    | 8 ++++++--
>>>>    include/drm/ttm/ttm_bo_driver.h | 7 +++++++
>>>>    2 files changed, 13 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>>>> index c5b516f..926a365 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>>>> @@ -1750,10 +1750,14 @@ void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo)
>>>>    	ttm_bo_unmap_virtual_locked(bo);
>>>>    	ttm_mem_io_unlock(man);
>>>>    }
>>>> -
>>>> -
>>>>    EXPORT_SYMBOL(ttm_bo_unmap_virtual);
>>>> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev)
>>>> +{
>>>> +	unmap_mapping_range(bdev->dev_mapping, 0, 0, 1);
>>>> +}
>>>> +EXPORT_SYMBOL(ttm_bo_unmap_virtual_address_space);
>>>> +
>>>>    int ttm_bo_wait(struct ttm_buffer_object *bo,
>>>>    		bool interruptible, bool no_wait)
>>>>    {
>>>> diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
>>>> index c9e0fd0..39ea44f 100644
>>>> --- a/include/drm/ttm/ttm_bo_driver.h
>>>> +++ b/include/drm/ttm/ttm_bo_driver.h
>>>> @@ -601,6 +601,13 @@ int ttm_bo_device_init(struct ttm_bo_device *bdev,
>>>>    void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo);
>>>>    /**
>>>> + * ttm_bo_unmap_virtual_address_space
>>>> + *
>>>> + * @bdev: tear down all the virtual mappings for this device
>>>> + */
>>>> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev);
>>>> +
>>>> +/**
>>>>     * ttm_bo_unmap_virtual
>>>>     *
>>>>     * @bo: tear down the virtual mappings for this BO
>>>> -- 
>>>> 2.7.4
>>>>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove.
  2020-06-23 10:22       ` Daniel Vetter
@ 2020-06-23 13:16         ` Christian König
  2020-06-24  3:12           ` Andrey Grodzovsky
  0 siblings, 1 reply; 54+ messages in thread
From: Christian König @ 2020-06-23 13:16 UTC (permalink / raw)
  To: Daniel Vetter, Alex Deucher
  Cc: Andrey Grodzovsky, Daniel Vetter, Michel Dänzer,
	Maling list - DRI developers, Pekka Paalanen, amd-gfx list

Am 23.06.20 um 12:22 schrieb Daniel Vetter:
> On Mon, Jun 22, 2020 at 03:48:29PM -0400, Alex Deucher wrote:
>> On Mon, Jun 22, 2020 at 3:38 PM Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>> Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
>>>> Use the new TTM interface to invalidate all exsisting BO CPU mappings
>>>> form all user proccesses.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>
>>> I think those two patches could already land in amd-staging-drm-next
>>> since they are a good idea independent of how else we fix the other issues.
>> Please make sure they land in drm-misc as well.
> Not sure that's much use, since without any of the fault side changes you
> just blow up on the first refault. Seems somewhat silly to charge ahead on
> this with the other bits still very much under discussion.

Well what I wanted to say is that we don't need to send out those simple 
patches once more.

> Plus I suggested a possible bikeshed here :-)

No bikeshed, but indeed a rather good idea to not make this a TTM function.

Christian.

> -Daniel
>
>> Alex
>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
>>>>    1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 43592dc..6932d75 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -1135,6 +1135,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>>>>        struct drm_device *dev = pci_get_drvdata(pdev);
>>>>
>>>>        drm_dev_unplug(dev);
>>>> +     ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
>>>>        amdgpu_driver_unload_kms(dev);
>>>>
>>>>        pci_disable_device(pdev);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
  2020-06-23  6:05             ` Greg KH
@ 2020-06-24  3:04               ` Andrey Grodzovsky
  2020-06-24  6:11                 ` Greg KH
  0 siblings, 1 reply; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-24  3:04 UTC (permalink / raw)
  To: Greg KH
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	Daniel Vetter, ckoenig.leichtzumerken, alexdeucher


On 6/23/20 2:05 AM, Greg KH wrote:
> On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote:
>> On 6/22/20 12:45 PM, Greg KH wrote:
>>> On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:
>>>> On 6/22/20 7:21 AM, Greg KH wrote:
>>>>> On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
>>>>>> On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
>>>>>>> Track sysfs files in a list so they all can be removed during pci remove
>>>>>>> since otherwise their removal after that causes crash because parent
>>>>>>> folder was already removed during pci remove.
>>>>> Huh?  That should not happen, do you have a backtrace of that crash?
>>>> 2 examples in the attached trace.
>>> Odd, how did you trigger these?
>>
>> By manually triggering PCI remove from sysfs
>>
>> cd /sys/bus/pci/devices/0000\:05\:00.0 && echo 1 > remove
> For some reason, I didn't think that video/drm devices could handle
> hot-remove like this.  The "old" PCI hotplug specification explicitly
> said that video devices were not supported, has that changed?
>
> And this whole issue is probably tied to the larger issue that Daniel
> was asking me about, when it came to device lifetimes and the drm layer,
> so odds are we need to fix that up first before worrying about trying to
> support this crazy request, right?  :)
>
>>>> [  925.738225 <    0.188086>] BUG: kernel NULL pointer dereference, address: 0000000000000090
>>>> [  925.738232 <    0.000007>] #PF: supervisor read access in kernel mode
>>>> [  925.738236 <    0.000004>] #PF: error_code(0x0000) - not-present page
>>>> [  925.738240 <    0.000004>] PGD 0 P4D 0
>>>> [  925.738245 <    0.000005>] Oops: 0000 [#1] SMP PTI
>>>> [  925.738249 <    0.000004>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
>>>> [  925.738256 <    0.000007>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
>>>> [  925.738266 <    0.000010>] RIP: 0010:kernfs_find_ns+0x18/0x110
>>>> [  925.738270 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
>>>> [  925.738282 <    0.000012>] RSP: 0018:ffffad6d0118fb00 EFLAGS: 00010246
>>>> [  925.738287 <    0.000005>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
>>>> [  925.738292 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffb6606b31 RDI: 0000000000000000
>>>> [  925.738297 <    0.000005>] RBP: ffffffffb6606b31 R08: ffffffffb5379d10 R09: 0000000000000000
>>>> [  925.738302 <    0.000005>] R10: ffffad6d0118fb38 R11: ffff9a75f64820a8 R12: 0000000000000000
>>>> [  925.738307 <    0.000005>] R13: 0000000000000000 R14: ffffffffb6606b31 R15: ffff9a7612b06130
>>>> [  925.738313 <    0.000006>] FS:  00007f3eca4e8700(0000) GS:ffff9a763dbc0000(0000) knlGS:0000000000000000
>>>> [  925.738319 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  925.738323 <    0.000004>] CR2: 0000000000000090 CR3: 0000000035e5a005 CR4: 00000000000606e0
>>>> [  925.738329 <    0.000006>] Call Trace:
>>>> [  925.738334 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
>>>> [  925.738339 <    0.000005>]  sysfs_remove_group+0x25/0x80
>>>> [  925.738344 <    0.000005>]  sysfs_remove_groups+0x29/0x40
>>>> [  925.738350 <    0.000006>]  free_msi_irqs+0xf5/0x190
>>>> [  925.738354 <    0.000004>]  pci_disable_msi+0xe9/0x120
>>> So the PCI core is trying to clean up attributes that it had registered,
>>> which is fine.  But we can't seem to find the attributes?  Were they
>>> already removed somewhere else?
>>>
>>> that's odd.
>>
>> Yes, as i pointed above i am emulating device remove from sysfs and this
>> triggers pci device remove sequence and as part of that my specific device
>> folder (05:00.0) is removed from the sysfs tree.
> But why are things being removed twice?


Not sure I understand what removed twice ? I remove only once per sysfs attribute.

Andrey


>
>>>> [  925.738406 <    0.000052>]  amdgpu_irq_fini+0xe3/0xf0 [amdgpu]
>>>> [  925.738453 <    0.000047>]  tonga_ih_sw_fini+0xe/0x30 [amdgpu]
>>>> [  925.738490 <    0.000037>]  amdgpu_device_fini_late+0x14b/0x440 [amdgpu]
>>>> [  925.738529 <    0.000039>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
>>>> [  925.738548 <    0.000019>]  drm_dev_put+0x5b/0x80 [drm]
>>>> [  925.738558 <    0.000010>]  drm_release+0xc6/0xd0 [drm]
>>>> [  925.738563 <    0.000005>]  __fput+0xc6/0x260
>>>> [  925.738568 <    0.000005>]  task_work_run+0x79/0xb0
>>>> [  925.738573 <    0.000005>]  do_exit+0x3d0/0xc60
>>>> [  925.738578 <    0.000005>]  do_group_exit+0x47/0xb0
>>>> [  925.738583 <    0.000005>]  get_signal+0x18b/0xc30
>>>> [  925.738589 <    0.000006>]  do_signal+0x36/0x6a0
>>>> [  925.738593 <    0.000004>]  ? force_sig_info_to_task+0xbc/0xd0
>>>> [  925.738597 <    0.000004>]  ? signal_wake_up_state+0x15/0x30
>>>> [  925.738603 <    0.000006>]  exit_to_usermode_loop+0x6f/0xc0
>>>> [  925.738608 <    0.000005>]  prepare_exit_to_usermode+0xc7/0x110
>>>> [  925.738613 <    0.000005>]  ret_from_intr+0x25/0x35
>>>> [  925.738617 <    0.000004>] RIP: 0033:0x417369
>>>> [  925.738621 <    0.000004>] Code: Bad RIP value.
>>>> [  925.738625 <    0.000004>] RSP: 002b:00007ffdd6bf0900 EFLAGS: 00010246
>>>> [  925.738629 <    0.000004>] RAX: 00007f3eca509000 RBX: 000000000000001e RCX: 00007f3ec95ba260
>>>> [  925.738634 <    0.000005>] RDX: 00007f3ec9889790 RSI: 000000000000000a RDI: 0000000000000000
>>>> [  925.738639 <    0.000005>] RBP: 00007ffdd6bf0990 R08: 00007f3ec9889780 R09: 00007f3eca4e8700
>>>> [  925.738645 <    0.000006>] R10: 000000000000035c R11: 0000000000000246 R12: 00000000021c6170
>>>> [  925.738650 <    0.000005>] R13: 00007ffdd6bf0c00 R14: 0000000000000000 R15: 0000000000000000
>>>>
>>>>
>>>>
>>>>
>>>> [   40.880899 <    0.000004>] BUG: kernel NULL pointer dereference, address: 0000000000000090
>>>> [   40.880906 <    0.000007>] #PF: supervisor read access in kernel mode
>>>> [   40.880910 <    0.000004>] #PF: error_code(0x0000) - not-present page
>>>> [   40.880915 <    0.000005>] PGD 0 P4D 0
>>>> [   40.880920 <    0.000005>] Oops: 0000 [#1] SMP PTI
>>>> [   40.880924 <    0.000004>] CPU: 1 PID: 2526 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
>>>> [   40.880932 <    0.000008>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
>>>> [   40.880941 <    0.000009>] RIP: 0010:kernfs_find_ns+0x18/0x110
>>>> [   40.880945 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
>>>> [   40.880957 <    0.000012>] RSP: 0018:ffffaf3380467ba8 EFLAGS: 00010246
>>>> [   40.880963 <    0.000006>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
>>>> [   40.880968 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffc0678cfc RDI: 0000000000000000
>>>> [   40.880973 <    0.000005>] RBP: ffffffffc0678cfc R08: ffffffffaa379d10 R09: 0000000000000000
>>>> [   40.880979 <    0.000006>] R10: ffffaf3380467be0 R11: ffff93547615d128 R12: 0000000000000000
>>>> [   40.880984 <    0.000005>] R13: 0000000000000000 R14: ffffffffc0678cfc R15: ffff93549be86130
>>>> [   40.880990 <    0.000006>] FS:  00007fd9ecb10700(0000) GS:ffff9354bd840000(0000) knlGS:0000000000000000
>>>> [   40.880996 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   40.881001 <    0.000005>] CR2: 0000000000000090 CR3: 0000000072866001 CR4: 00000000000606e0
>>>> [   40.881006 <    0.000005>] Call Trace:
>>>> [   40.881011 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
>>>> [   40.881016 <    0.000005>]  sysfs_remove_group+0x25/0x80
>>>> [   40.881055 <    0.000039>]  amdgpu_device_fini_late+0x3eb/0x440 [amdgpu]
>>>> [   40.881095 <    0.000040>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
>>> Here is this is your driver doing the same thing, removing attributes it
>>> created.  But again they are not there.
>>>
>>> So something went through and wiped the tree clean, which if I'm reading
>>> this correctly, your patch would not solve as you would try to also
>>> remove attributes that were already removed, right?
>>
>> I don't think so, the stack here is from a later stage (after pci remove)
>> where the last user process holding a reference to the device file decides
>> to die and thus triggering drm_dev_release sequence after drm dev refcount
>> dropped to zero. And this why my patch helps, i am expediting all amdgpu
>> sysfs attributes removal to the pci remove stage when the device folder is
>> still present in the sysfs hierarchy. At least this is my  understanding to
>> why it helped. I admit i am not an expert on sysfs internals.
> Ok, yeah, I think this is back to the drm lifecycle issues mentioned
> above.
>
> {sigh}, I'll get to that once I deal with the -rc1/-rc2 merge fallout,
> that will take me a week or so, sorry...
>
> thanks,
>
> greg k-h
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove.
  2020-06-23 13:16         ` Christian König
@ 2020-06-24  3:12           ` Andrey Grodzovsky
  0 siblings, 0 replies; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-24  3:12 UTC (permalink / raw)
  To: Christian König, Daniel Vetter, Alex Deucher
  Cc: Daniel Vetter, Michel Dänzer, Pekka Paalanen,
	Maling list - DRI developers, amd-gfx list


On 6/23/20 9:16 AM, Christian König wrote:
> Am 23.06.20 um 12:22 schrieb Daniel Vetter:
>> On Mon, Jun 22, 2020 at 03:48:29PM -0400, Alex Deucher wrote:
>>> On Mon, Jun 22, 2020 at 3:38 PM Christian König
>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>> Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:
>>>>> Use the new TTM interface to invalidate all exsisting BO CPU mappings
>>>>> form all user proccesses.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>
>>>> I think those two patches could already land in amd-staging-drm-next
>>>> since they are a good idea independent of how else we fix the other issues.
>>> Please make sure they land in drm-misc as well.
>> Not sure that's much use, since without any of the fault side changes you
>> just blow up on the first refault. Seems somewhat silly to charge ahead on
>> this with the other bits still very much under discussion.
>
> Well what I wanted to say is that we don't need to send out those simple 
> patches once more.
>
>> Plus I suggested a possible bikeshed here :-)
>
> No bikeshed, but indeed a rather good idea to not make this a TTM function.
>
> Christian.


So i will incorporate the changes suggested to turn the TTM part into generic 
DRM helper and will resend both patches as part of V3 (which might take a while 
now due to a context switch I am doing for another task).

Andrey


>
>> -Daniel
>>
>>> Alex
>>>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
>>>>>    1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> index 43592dc..6932d75 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> @@ -1135,6 +1135,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>>>>>        struct drm_device *dev = pci_get_drvdata(pdev);
>>>>>
>>>>>        drm_dev_unplug(dev);
>>>>> + ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
>>>>>        amdgpu_driver_unload_kms(dev);
>>>>>
>>>>>        pci_disable_device(pdev);
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 0/8] RFC Support hot device unplug in amdgpu
  2020-06-23  9:04     ` Michel Dänzer
@ 2020-06-24  3:21       ` Andrey Grodzovsky
  0 siblings, 0 replies; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-24  3:21 UTC (permalink / raw)
  To: Michel Dänzer, Daniel Vetter
  Cc: daniel.vetter, amd-gfx, dri-devel, ckoenig.leichtzumerken

Tried, didn't have any impact

Andrey


On 6/23/20 5:04 AM, Michel Dänzer wrote:
> On 2020-06-23 7:14 a.m., Andrey Grodzovsky wrote:
>> I am fighting with Thunderbird to make limit a line to 80 chars but
>> nothing helps. Any suggestions please.
> Maybe try disabling mail.compose.default_to_paragraph, or check other
> *wrap* settings.
>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page.
  2020-06-22  9:41   ` Daniel Vetter
@ 2020-06-24  3:31     ` Andrey Grodzovsky
  2020-06-24  7:19       ` Daniel Vetter
  0 siblings, 1 reply; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-24  3:31 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	ckoenig.leichtzumerken, alexdeucher


On 6/22/20 5:41 AM, Daniel Vetter wrote:
> On Sun, Jun 21, 2020 at 02:03:02AM -0400, Andrey Grodzovsky wrote:
>> On device removal reroute all CPU mappings to dummy page per drm_file
>> instance or imported GEM object.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 65 ++++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 57 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> index 389128b..2f8bf5e 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> @@ -35,6 +35,8 @@
>>   #include <drm/ttm/ttm_bo_driver.h>
>>   #include <drm/ttm/ttm_placement.h>
>>   #include <drm/drm_vma_manager.h>
>> +#include <drm/drm_drv.h>
>> +#include <drm/drm_file.h>
>>   #include <linux/mm.h>
>>   #include <linux/pfn_t.h>
>>   #include <linux/rbtree.h>
>> @@ -328,19 +330,66 @@ vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
> Hm I think diff and code flow look a bit bad now. What about renaming the
> current function to __ttm_bo_vm_fault and then having something like the
> below:
>
> ttm_bo_vm_fault(args) {
>
> 	if (drm_dev_enter()) {
> 		__ttm_bo_vm_fault(args);
> 		drm_dev_exit();
> 	} else  {
> 		drm_gem_insert_dummy_pfn();
> 	}
> }
>
> I think drm_gem_insert_dummy_pfn(); should be portable across drivers, so
> another nice point to try to unifiy drivers as much as possible.
> -Daniel
>
>>   	pgprot_t prot;
>>   	struct ttm_buffer_object *bo = vma->vm_private_data;
>>   	vm_fault_t ret;
>> +	int idx;
>> +	struct drm_device *ddev = bo->base.dev;
>>   
>> -	ret = ttm_bo_vm_reserve(bo, vmf);
>> -	if (ret)
>> -		return ret;
>> +	if (drm_dev_enter(ddev, &idx)) {
>> +		ret = ttm_bo_vm_reserve(bo, vmf);
>> +		if (ret)
>> +			goto exit;
>> +
>> +		prot = vma->vm_page_prot;
>>   
>> -	prot = vma->vm_page_prot;
>> -	ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT);
>> -	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
>> +		ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT);
>> +		if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
>> +			goto exit;
>> +
>> +		dma_resv_unlock(bo->base.resv);
>> +
>> +exit:
>> +		drm_dev_exit(idx);
>>   		return ret;
>> +	} else {
>>   
>> -	dma_resv_unlock(bo->base.resv);
>> +		struct drm_file *file = NULL;
>> +		struct page *dummy_page = NULL;
>> +		int handle;
>>   
>> -	return ret;
>> +		/* We are faulting on imported BO from dma_buf */
>> +		if (bo->base.dma_buf && bo->base.import_attach) {
>> +			dummy_page = bo->base.dummy_page;
>> +		/* We are faulting on non imported BO, find drm_file owning the BO*/
> Uh, we can't fish that out of the vma->vm_file pointer somehow? Or is that
> one all wrong? Doing this kind of list walk looks pretty horrible.
>
> If the vma doesn't have the right pointer I guess next option is that we
> store the drm_file page in gem_bo->dummy_page, and replace it on first
> export. But that's going to be tricky to track ...
>
>> +		} else {
>> +			struct drm_gem_object *gobj;
>> +
>> +			mutex_lock(&ddev->filelist_mutex);
>> +			list_for_each_entry(file, &ddev->filelist, lhead) {
>> +				spin_lock(&file->table_lock);
>> +				idr_for_each_entry(&file->object_idr, gobj, handle) {
>> +					if (gobj == &bo->base) {
>> +						dummy_page = file->dummy_page;
>> +						break;
>> +					}
>> +				}
>> +				spin_unlock(&file->table_lock);
>> +			}
>> +			mutex_unlock(&ddev->filelist_mutex);
>> +		}
>> +
>> +		if (dummy_page) {
>> +			/*
>> +			 * Let do_fault complete the PTE install e.t.c using vmf->page
>> +			 *
>> +			 * TODO - should i call free_page somewhere ?
> Nah, instead don't call get_page. The page will be around as long as
> there's a reference for the drm_file or gem_bo, which is longer than any
> mmap. Otherwise yes this would like really badly.


So actually that was my thinking in the first place and I indeed avoided taking 
reference and this ended up
with multiple BUG_ONs as seen bellow where  refcount:-63 mapcount:-48 for a page 
are deep into negative
values... Those warnings were gone once i added get_page(dummy) which in my 
opinion implies that there
is a page reference per each PTE and that when there is unmapping of the process 
address
space and PTEs are deleted there is also put_page somewhere in mm core and the 
get_page per mapping
keeps it balanced.

Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762929] BUG: Bad page map in 
process glxgear:disk$0  pte:8000000132284867 pmd:15aaec067
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762931] page:ffffe63384c8a100 
refcount:-63 mapcount:-48 mapping:0000000000000000 index:0x0
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762932] flags: 
0x17fff8000000008(dirty)
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762933] raw: 017fff8000000008 
dead000000000100 dead000000000122 0000000000000000
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762934] raw: 0000000000000000 
0000000000000000 ffffffc1ffffffcf 0000000000000000
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762935] page dumped because: bad pte
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762937] addr:00007fe086263000 
vm_flags:1c0440fb anon_vma:0000000000000000 mapping:ffff9b5cd42db268 index:1008b3
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762981] file:renderD129 
fault:ttm_bo_vm_fault [ttm] mmap:amdgpu_mmap [amdgpu] readpage:0x0
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762984] CPU: 5 PID: 2619 Comm: 
glxgear:disk$0 Tainted: G    B      OE 5.6.0-dev+ #51
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762985] Hardware name: System 
manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762985] Call Trace:
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762988] dump_stack+0x68/0x9b
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762990] print_bad_pte+0x19f/0x270
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762992]  ? lock_page_memcg+0x5/0xf0
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762995] unmap_page_range+0x777/0xbe0
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763000] unmap_vmas+0xcc/0x160
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763004] exit_mmap+0xb5/0x1b0
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763009] mmput+0x65/0x140
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763010] do_exit+0x362/0xc40
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763013] do_group_exit+0x47/0xb0
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763016] get_signal+0x18b/0xc30
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763019] do_signal+0x36/0x6a0
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763021]  ? 
__set_task_comm+0x62/0x120
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763024]  ? 
__x64_sys_futex+0x88/0x180
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763028] 
exit_to_usermode_loop+0x6f/0xc0
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763030] do_syscall_64+0x149/0x1c0
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763032] 
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763034] RIP: 0033:0x7fe091bd9360
Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763037] Code: Bad RIP value.

Andrey


>
>> +			 */
>> +			get_page(dummy_page);
>> +			vmf->page = dummy_page;
>> +			return 0;
>> +		} else {
>> +			return VM_FAULT_SIGSEGV;
> Hm that would be a kernel bug, wouldn't it? WARN_ON() required here imo.
> -Daniel
>
>> +		}
>> +	}
>>   }
>>   EXPORT_SYMBOL(ttm_bo_vm_fault);
>>   
>> -- 
>> 2.7.4
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
  2020-06-24  3:04               ` Andrey Grodzovsky
@ 2020-06-24  6:11                 ` Greg KH
  2020-06-25  1:52                   ` Andrey Grodzovsky
  0 siblings, 1 reply; 54+ messages in thread
From: Greg KH @ 2020-06-24  6:11 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	Daniel Vetter, ckoenig.leichtzumerken, alexdeucher

On Tue, Jun 23, 2020 at 11:04:30PM -0400, Andrey Grodzovsky wrote:
> 
> On 6/23/20 2:05 AM, Greg KH wrote:
> > On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote:
> > > On 6/22/20 12:45 PM, Greg KH wrote:
> > > > On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:
> > > > > On 6/22/20 7:21 AM, Greg KH wrote:
> > > > > > On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
> > > > > > > On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
> > > > > > > > Track sysfs files in a list so they all can be removed during pci remove
> > > > > > > > since otherwise their removal after that causes crash because parent
> > > > > > > > folder was already removed during pci remove.
> > > > > > Huh?  That should not happen, do you have a backtrace of that crash?
> > > > > 2 examples in the attached trace.
> > > > Odd, how did you trigger these?
> > > 
> > > By manually triggering PCI remove from sysfs
> > > 
> > > cd /sys/bus/pci/devices/0000\:05\:00.0 && echo 1 > remove
> > For some reason, I didn't think that video/drm devices could handle
> > hot-remove like this.  The "old" PCI hotplug specification explicitly
> > said that video devices were not supported, has that changed?
> > 
> > And this whole issue is probably tied to the larger issue that Daniel
> > was asking me about, when it came to device lifetimes and the drm layer,
> > so odds are we need to fix that up first before worrying about trying to
> > support this crazy request, right?  :)
> > 
> > > > > [  925.738225 <    0.188086>] BUG: kernel NULL pointer dereference, address: 0000000000000090
> > > > > [  925.738232 <    0.000007>] #PF: supervisor read access in kernel mode
> > > > > [  925.738236 <    0.000004>] #PF: error_code(0x0000) - not-present page
> > > > > [  925.738240 <    0.000004>] PGD 0 P4D 0
> > > > > [  925.738245 <    0.000005>] Oops: 0000 [#1] SMP PTI
> > > > > [  925.738249 <    0.000004>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
> > > > > [  925.738256 <    0.000007>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
> > > > > [  925.738266 <    0.000010>] RIP: 0010:kernfs_find_ns+0x18/0x110
> > > > > [  925.738270 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
> > > > > [  925.738282 <    0.000012>] RSP: 0018:ffffad6d0118fb00 EFLAGS: 00010246
> > > > > [  925.738287 <    0.000005>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
> > > > > [  925.738292 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffb6606b31 RDI: 0000000000000000
> > > > > [  925.738297 <    0.000005>] RBP: ffffffffb6606b31 R08: ffffffffb5379d10 R09: 0000000000000000
> > > > > [  925.738302 <    0.000005>] R10: ffffad6d0118fb38 R11: ffff9a75f64820a8 R12: 0000000000000000
> > > > > [  925.738307 <    0.000005>] R13: 0000000000000000 R14: ffffffffb6606b31 R15: ffff9a7612b06130
> > > > > [  925.738313 <    0.000006>] FS:  00007f3eca4e8700(0000) GS:ffff9a763dbc0000(0000) knlGS:0000000000000000
> > > > > [  925.738319 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [  925.738323 <    0.000004>] CR2: 0000000000000090 CR3: 0000000035e5a005 CR4: 00000000000606e0
> > > > > [  925.738329 <    0.000006>] Call Trace:
> > > > > [  925.738334 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
> > > > > [  925.738339 <    0.000005>]  sysfs_remove_group+0x25/0x80
> > > > > [  925.738344 <    0.000005>]  sysfs_remove_groups+0x29/0x40
> > > > > [  925.738350 <    0.000006>]  free_msi_irqs+0xf5/0x190
> > > > > [  925.738354 <    0.000004>]  pci_disable_msi+0xe9/0x120
> > > > So the PCI core is trying to clean up attributes that it had registered,
> > > > which is fine.  But we can't seem to find the attributes?  Were they
> > > > already removed somewhere else?
> > > > 
> > > > that's odd.
> > > 
> > > Yes, as i pointed above i am emulating device remove from sysfs and this
> > > triggers pci device remove sequence and as part of that my specific device
> > > folder (05:00.0) is removed from the sysfs tree.
> > But why are things being removed twice?
> 
> 
> Not sure I understand what removed twice ? I remove only once per sysfs attribute.

This code path shows that the kernel is trying to remove a file that is
not present, so someone removed it already...

thanks,

gre k-h
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page.
  2020-06-24  3:31     ` Andrey Grodzovsky
@ 2020-06-24  7:19       ` Daniel Vetter
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Vetter @ 2020-06-24  7:19 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	Daniel Vetter, ckoenig.leichtzumerken, alexdeucher

On Tue, Jun 23, 2020 at 11:31:45PM -0400, Andrey Grodzovsky wrote:
> 
> On 6/22/20 5:41 AM, Daniel Vetter wrote:
> > On Sun, Jun 21, 2020 at 02:03:02AM -0400, Andrey Grodzovsky wrote:
> > > On device removal reroute all CPU mappings to dummy page per drm_file
> > > instance or imported GEM object.
> > > 
> > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > > ---
> > >   drivers/gpu/drm/ttm/ttm_bo_vm.c | 65 ++++++++++++++++++++++++++++++++++++-----
> > >   1 file changed, 57 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> > > index 389128b..2f8bf5e 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> > > @@ -35,6 +35,8 @@
> > >   #include <drm/ttm/ttm_bo_driver.h>
> > >   #include <drm/ttm/ttm_placement.h>
> > >   #include <drm/drm_vma_manager.h>
> > > +#include <drm/drm_drv.h>
> > > +#include <drm/drm_file.h>
> > >   #include <linux/mm.h>
> > >   #include <linux/pfn_t.h>
> > >   #include <linux/rbtree.h>
> > > @@ -328,19 +330,66 @@ vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
> > Hm I think diff and code flow look a bit bad now. What about renaming the
> > current function to __ttm_bo_vm_fault and then having something like the
> > below:
> > 
> > ttm_bo_vm_fault(args) {
> > 
> > 	if (drm_dev_enter()) {
> > 		__ttm_bo_vm_fault(args);
> > 		drm_dev_exit();
> > 	} else  {
> > 		drm_gem_insert_dummy_pfn();
> > 	}
> > }
> > 
> > I think drm_gem_insert_dummy_pfn(); should be portable across drivers, so
> > another nice point to try to unifiy drivers as much as possible.
> > -Daniel
> > 
> > >   	pgprot_t prot;
> > >   	struct ttm_buffer_object *bo = vma->vm_private_data;
> > >   	vm_fault_t ret;
> > > +	int idx;
> > > +	struct drm_device *ddev = bo->base.dev;
> > > -	ret = ttm_bo_vm_reserve(bo, vmf);
> > > -	if (ret)
> > > -		return ret;
> > > +	if (drm_dev_enter(ddev, &idx)) {
> > > +		ret = ttm_bo_vm_reserve(bo, vmf);
> > > +		if (ret)
> > > +			goto exit;
> > > +
> > > +		prot = vma->vm_page_prot;
> > > -	prot = vma->vm_page_prot;
> > > -	ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT);
> > > -	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
> > > +		ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT);
> > > +		if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
> > > +			goto exit;
> > > +
> > > +		dma_resv_unlock(bo->base.resv);
> > > +
> > > +exit:
> > > +		drm_dev_exit(idx);
> > >   		return ret;
> > > +	} else {
> > > -	dma_resv_unlock(bo->base.resv);
> > > +		struct drm_file *file = NULL;
> > > +		struct page *dummy_page = NULL;
> > > +		int handle;
> > > -	return ret;
> > > +		/* We are faulting on imported BO from dma_buf */
> > > +		if (bo->base.dma_buf && bo->base.import_attach) {
> > > +			dummy_page = bo->base.dummy_page;
> > > +		/* We are faulting on non imported BO, find drm_file owning the BO*/
> > Uh, we can't fish that out of the vma->vm_file pointer somehow? Or is that
> > one all wrong? Doing this kind of list walk looks pretty horrible.
> > 
> > If the vma doesn't have the right pointer I guess next option is that we
> > store the drm_file page in gem_bo->dummy_page, and replace it on first
> > export. But that's going to be tricky to track ...
> > 
> > > +		} else {
> > > +			struct drm_gem_object *gobj;
> > > +
> > > +			mutex_lock(&ddev->filelist_mutex);
> > > +			list_for_each_entry(file, &ddev->filelist, lhead) {
> > > +				spin_lock(&file->table_lock);
> > > +				idr_for_each_entry(&file->object_idr, gobj, handle) {
> > > +					if (gobj == &bo->base) {
> > > +						dummy_page = file->dummy_page;
> > > +						break;
> > > +					}
> > > +				}
> > > +				spin_unlock(&file->table_lock);
> > > +			}
> > > +			mutex_unlock(&ddev->filelist_mutex);
> > > +		}
> > > +
> > > +		if (dummy_page) {
> > > +			/*
> > > +			 * Let do_fault complete the PTE install e.t.c using vmf->page
> > > +			 *
> > > +			 * TODO - should i call free_page somewhere ?
> > Nah, instead don't call get_page. The page will be around as long as
> > there's a reference for the drm_file or gem_bo, which is longer than any
> > mmap. Otherwise yes this would like really badly.
> 
> 
> So actually that was my thinking in the first place and I indeed avoided
> taking reference and this ended up
> with multiple BUG_ONs as seen bellow where  refcount:-63 mapcount:-48 for a
> page are deep into negative
> values... Those warnings were gone once i added get_page(dummy) which in my
> opinion implies that there
> is a page reference per each PTE and that when there is unmapping of the
> process address
> space and PTEs are deleted there is also put_page somewhere in mm core and
> the get_page per mapping
> keeps it balanced.
> 
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762929] BUG: Bad page map in
> process glxgear:disk$0  pte:8000000132284867 pmd:15aaec067
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762931]
> page:ffffe63384c8a100 refcount:-63 mapcount:-48 mapping:0000000000000000
> index:0x0
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762932] flags:
> 0x17fff8000000008(dirty)
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762933] raw:
> 017fff8000000008 dead000000000100 dead000000000122 0000000000000000
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762934] raw:
> 0000000000000000 0000000000000000 ffffffc1ffffffcf 0000000000000000
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762935] page dumped because: bad pte
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762937]
> addr:00007fe086263000 vm_flags:1c0440fb anon_vma:0000000000000000
> mapping:ffff9b5cd42db268 index:1008b3
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762981] file:renderD129
> fault:ttm_bo_vm_fault [ttm] mmap:amdgpu_mmap [amdgpu] readpage:0x0
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762984] CPU: 5 PID: 2619
> Comm: glxgear:disk$0 Tainted: G    B      OE 5.6.0-dev+ #51
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762985] Hardware name:
> System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804
> 12/30/2013
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762985] Call Trace:
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762988] dump_stack+0x68/0x9b
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762990] print_bad_pte+0x19f/0x270
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762992]  ? lock_page_memcg+0x5/0xf0
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.762995] unmap_page_range+0x777/0xbe0
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763000] unmap_vmas+0xcc/0x160
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763004] exit_mmap+0xb5/0x1b0
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763009] mmput+0x65/0x140
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763010] do_exit+0x362/0xc40
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763013] do_group_exit+0x47/0xb0
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763016] get_signal+0x18b/0xc30
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763019] do_signal+0x36/0x6a0
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763021]  ?
> __set_task_comm+0x62/0x120
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763024]  ?
> __x64_sys_futex+0x88/0x180
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763028]
> exit_to_usermode_loop+0x6f/0xc0
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763030] do_syscall_64+0x149/0x1c0
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763032]
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763034] RIP: 0033:0x7fe091bd9360
> Jun 20 01:36:43 ubuntu-1604-test kernel: [   98.763037] Code: Bad RIP value.

Uh, I guess that just shows how little I understand how this all works.
But yeah if we set vmf->page then I guess core mm takes care of
everything, but apparently expects a page reference.
-Daniel
 
> Andrey
> 
> 
> > 
> > > +			 */
> > > +			get_page(dummy_page);
> > > +			vmf->page = dummy_page;
> > > +			return 0;
> > > +		} else {
> > > +			return VM_FAULT_SIGSEGV;
> > Hm that would be a kernel bug, wouldn't it? WARN_ON() required here imo.
> > -Daniel
> > 
> > > +		}
> > > +	}
> > >   }
> > >   EXPORT_SYMBOL(ttm_bo_vm_fault);
> > > -- 
> > > 2.7.4
> > > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
  2020-06-24  6:11                 ` Greg KH
@ 2020-06-25  1:52                   ` Andrey Grodzovsky
  0 siblings, 0 replies; 54+ messages in thread
From: Andrey Grodzovsky @ 2020-06-25  1:52 UTC (permalink / raw)
  To: Greg KH
  Cc: daniel.vetter, michel, dri-devel, ppaalanen, amd-gfx,
	Daniel Vetter, ckoenig.leichtzumerken, alexdeucher


On 6/24/20 2:11 AM, Greg KH wrote:
> On Tue, Jun 23, 2020 at 11:04:30PM -0400, Andrey Grodzovsky wrote:
>> On 6/23/20 2:05 AM, Greg KH wrote:
>>> On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote:
>>>> On 6/22/20 12:45 PM, Greg KH wrote:
>>>>> On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:
>>>>>> On 6/22/20 7:21 AM, Greg KH wrote:
>>>>>>> On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
>>>>>>>> On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
>>>>>>>>> Track sysfs files in a list so they all can be removed during pci remove
>>>>>>>>> since otherwise their removal after that causes crash because parent
>>>>>>>>> folder was already removed during pci remove.
>>>>>>> Huh?  That should not happen, do you have a backtrace of that crash?
>>>>>> 2 examples in the attached trace.
>>>>> Odd, how did you trigger these?
>>>> By manually triggering PCI remove from sysfs
>>>>
>>>> cd /sys/bus/pci/devices/0000\:05\:00.0 && echo 1 > remove
>>> For some reason, I didn't think that video/drm devices could handle
>>> hot-remove like this.  The "old" PCI hotplug specification explicitly
>>> said that video devices were not supported, has that changed?
>>>
>>> And this whole issue is probably tied to the larger issue that Daniel
>>> was asking me about, when it came to device lifetimes and the drm layer,
>>> so odds are we need to fix that up first before worrying about trying to
>>> support this crazy request, right?  :)
>>>
>>>>>> [  925.738225 <    0.188086>] BUG: kernel NULL pointer dereference, address: 0000000000000090
>>>>>> [  925.738232 <    0.000007>] #PF: supervisor read access in kernel mode
>>>>>> [  925.738236 <    0.000004>] #PF: error_code(0x0000) - not-present page
>>>>>> [  925.738240 <    0.000004>] PGD 0 P4D 0
>>>>>> [  925.738245 <    0.000005>] Oops: 0000 [#1] SMP PTI
>>>>>> [  925.738249 <    0.000004>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
>>>>>> [  925.738256 <    0.000007>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
>>>>>> [  925.738266 <    0.000010>] RIP: 0010:kernfs_find_ns+0x18/0x110
>>>>>> [  925.738270 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
>>>>>> [  925.738282 <    0.000012>] RSP: 0018:ffffad6d0118fb00 EFLAGS: 00010246
>>>>>> [  925.738287 <    0.000005>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
>>>>>> [  925.738292 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffb6606b31 RDI: 0000000000000000
>>>>>> [  925.738297 <    0.000005>] RBP: ffffffffb6606b31 R08: ffffffffb5379d10 R09: 0000000000000000
>>>>>> [  925.738302 <    0.000005>] R10: ffffad6d0118fb38 R11: ffff9a75f64820a8 R12: 0000000000000000
>>>>>> [  925.738307 <    0.000005>] R13: 0000000000000000 R14: ffffffffb6606b31 R15: ffff9a7612b06130
>>>>>> [  925.738313 <    0.000006>] FS:  00007f3eca4e8700(0000) GS:ffff9a763dbc0000(0000) knlGS:0000000000000000
>>>>>> [  925.738319 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [  925.738323 <    0.000004>] CR2: 0000000000000090 CR3: 0000000035e5a005 CR4: 00000000000606e0
>>>>>> [  925.738329 <    0.000006>] Call Trace:
>>>>>> [  925.738334 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
>>>>>> [  925.738339 <    0.000005>]  sysfs_remove_group+0x25/0x80
>>>>>> [  925.738344 <    0.000005>]  sysfs_remove_groups+0x29/0x40
>>>>>> [  925.738350 <    0.000006>]  free_msi_irqs+0xf5/0x190
>>>>>> [  925.738354 <    0.000004>]  pci_disable_msi+0xe9/0x120
>>>>> So the PCI core is trying to clean up attributes that it had registered,
>>>>> which is fine.  But we can't seem to find the attributes?  Were they
>>>>> already removed somewhere else?
>>>>>
>>>>> that's odd.
>>>> Yes, as i pointed above i am emulating device remove from sysfs and this
>>>> triggers pci device remove sequence and as part of that my specific device
>>>> folder (05:00.0) is removed from the sysfs tree.
>>> But why are things being removed twice?
>>
>> Not sure I understand what removed twice ? I remove only once per sysfs attribute.
> This code path shows that the kernel is trying to remove a file that is
> not present, so someone removed it already...
>
> thanks,
>
> gre k-h


That a mystery for me too...

Andrey


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, back to index

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
2020-06-21  6:03 ` [PATCH v2 1/8] drm: Add dummy page per device or GEM object Andrey Grodzovsky
2020-06-22  9:35   ` Daniel Vetter
2020-06-22 14:21     ` Pekka Paalanen
2020-06-22 14:24       ` Daniel Vetter
2020-06-22 14:28         ` Pekka Paalanen
2020-06-22 13:18   ` Christian König
2020-06-22 14:23     ` Daniel Vetter
2020-06-22 14:32     ` Andrey Grodzovsky
2020-06-22 17:45       ` Christian König
2020-06-22 17:50         ` Daniel Vetter
2020-06-21  6:03 ` [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
2020-06-22  9:41   ` Daniel Vetter
2020-06-24  3:31     ` Andrey Grodzovsky
2020-06-24  7:19       ` Daniel Vetter
2020-06-22 19:30   ` Christian König
2020-06-21  6:03 ` [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space Andrey Grodzovsky
2020-06-22  9:45   ` Daniel Vetter
2020-06-23  5:00     ` Andrey Grodzovsky
2020-06-23 10:25       ` Daniel Vetter
2020-06-23 12:55         ` Christian König
2020-06-22 19:37   ` Christian König
2020-06-22 19:47   ` Alex Deucher
2020-06-21  6:03 ` [PATCH v2 4/8] drm/amdgpu: Split amdgpu_device_fini into early and late Andrey Grodzovsky
2020-06-22  9:48   ` Daniel Vetter
2020-06-21  6:03 ` [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal Andrey Grodzovsky
2020-06-22  9:51   ` Daniel Vetter
2020-06-22 11:21     ` Greg KH
2020-06-22 16:07       ` Andrey Grodzovsky
2020-06-22 16:45         ` Greg KH
2020-06-23  4:51           ` Andrey Grodzovsky
2020-06-23  6:05             ` Greg KH
2020-06-24  3:04               ` Andrey Grodzovsky
2020-06-24  6:11                 ` Greg KH
2020-06-25  1:52                   ` Andrey Grodzovsky
2020-06-22 13:19   ` Christian König
2020-06-21  6:03 ` [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove Andrey Grodzovsky
2020-06-22  9:56   ` Daniel Vetter
2020-06-22 19:38   ` Christian König
2020-06-22 19:48     ` Alex Deucher
2020-06-23 10:22       ` Daniel Vetter
2020-06-23 13:16         ` Christian König
2020-06-24  3:12           ` Andrey Grodzovsky
2020-06-21  6:03 ` [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug Andrey Grodzovsky
2020-06-22  9:55   ` Daniel Vetter
2020-06-22 19:40   ` Christian König
2020-06-23  5:11     ` Andrey Grodzovsky
2020-06-23  7:14       ` Christian König
2020-06-21  6:03 ` [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged Andrey Grodzovsky
2020-06-22  9:53   ` Daniel Vetter
2020-06-22  9:46 ` [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Daniel Vetter
2020-06-23  5:14   ` Andrey Grodzovsky
2020-06-23  9:04     ` Michel Dänzer
2020-06-24  3:21       ` Andrey Grodzovsky

AMD-GFX Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/amd-gfx/0 amd-gfx/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 amd-gfx amd-gfx/ https://lore.kernel.org/amd-gfx \
		amd-gfx@lists.freedesktop.org
	public-inbox-index amd-gfx

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.freedesktop.lists.amd-gfx


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git