dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/35] Add HMM-based SVM memory manager to KFD
@ 2021-01-07  3:00 Felix Kuehling
  2021-01-07  3:00 ` [PATCH 01/35] drm/amdkfd: select kernel DEVICE_PRIVATE option Felix Kuehling
                   ` (36 more replies)
  0 siblings, 37 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:00 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

This is the first version of our HMM based shared virtual memory manager
for KFD. There are still a number of known issues that we're working through
(see below). This will likely lead to some pretty significant changes in
MMU notifier handling and locking on the migration code paths. So don't
get hung up on those details yet.

But I think this is a good time to start getting feedback. We're pretty
confident about the ioctl API, which is both simple and extensible for the
future. (see patches 4,16) The user mode side of the API can be found here:
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c

I'd also like another pair of eyes on how we're interfacing with the GPU VM
code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
and some retry IRQ handling changes (32).


Known issues:
* won't work with IOMMU enabled, we need to dma_map all pages properly
* still working on some race conditions and random bugs
* performance is not great yet

Alex Sierra (12):
  drm/amdgpu: replace per_device_list by array
  drm/amdkfd: helper to convert gpu id and idx
  drm/amdkfd: add xnack enabled flag to kfd_process
  drm/amdkfd: add ioctl to configure and query xnack retries
  drm/amdkfd: invalidate tables on page retry fault
  drm/amdkfd: page table restore through svm API
  drm/amdkfd: SVM API call to restore page tables
  drm/amdkfd: add svm_bo reference for eviction fence
  drm/amdgpu: add param bit flag to create SVM BOs
  drm/amdkfd: add svm_bo eviction mechanism support
  drm/amdgpu: svm bo enable_signal call condition
  drm/amdgpu: add svm_bo eviction to enable_signal cb

Philip Yang (23):
  drm/amdkfd: select kernel DEVICE_PRIVATE option
  drm/amdkfd: add svm ioctl API
  drm/amdkfd: Add SVM API support capability bits
  drm/amdkfd: register svm range
  drm/amdkfd: add svm ioctl GET_ATTR op
  drm/amdgpu: add common HMM get pages function
  drm/amdkfd: validate svm range system memory
  drm/amdkfd: register overlap system memory range
  drm/amdkfd: deregister svm range
  drm/amdgpu: export vm update mapping interface
  drm/amdkfd: map svm range to GPUs
  drm/amdkfd: svm range eviction and restore
  drm/amdkfd: register HMM device private zone
  drm/amdkfd: validate vram svm range from TTM
  drm/amdkfd: support xgmi same hive mapping
  drm/amdkfd: copy memory through gart table
  drm/amdkfd: HMM migrate ram to vram
  drm/amdkfd: HMM migrate vram to ram
  drm/amdgpu: reserve fence slot to update page table
  drm/amdgpu: enable retry fault wptr overflow
  drm/amdkfd: refine migration policy with xnack on
  drm/amdkfd: add svm range validate timestamp
  drm/amdkfd: multiple gpu migrate vram to vram

 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
 drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
 drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
 drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
 drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
 drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
 .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
 drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
 include/uapi/linux/kfd_ioctl.h                |  169 +-
 26 files changed, 4296 insertions(+), 291 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h

-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 01/35] drm/amdkfd: select kernel DEVICE_PRIVATE option
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
@ 2021-01-07  3:00 ` Felix Kuehling
  2021-01-07  3:00 ` [PATCH 02/35] drm/amdgpu: replace per_device_list by array Felix Kuehling
                   ` (35 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:00 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

DEVICE_PRIVATE kernel config option is required for HMM page migration,
to register vram (GPU device memory) as DEVICE_PRIVATE zone memory.
Enabling this option recompiles kernel.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig b/drivers/gpu/drm/amd/amdkfd/Kconfig
index e8fb10c41f16..33f8efadc6f6 100644
--- a/drivers/gpu/drm/amd/amdkfd/Kconfig
+++ b/drivers/gpu/drm/amd/amdkfd/Kconfig
@@ -7,6 +7,7 @@ config HSA_AMD
 	bool "HSA kernel driver for AMD GPU devices"
 	depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64)
 	imply AMD_IOMMU_V2 if X86_64
+	select DEVICE_PRIVATE
 	select MMU_NOTIFIER
 	help
 	  Enable this if you want to use HSA features on AMD GPU devices.
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 02/35] drm/amdgpu: replace per_device_list by array
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
  2021-01-07  3:00 ` [PATCH 01/35] drm/amdkfd: select kernel DEVICE_PRIVATE option Felix Kuehling
@ 2021-01-07  3:00 ` Felix Kuehling
  2021-01-07  3:00 ` [PATCH 03/35] drm/amdkfd: helper to convert gpu id and idx Felix Kuehling
                   ` (34 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:00 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

From: Alex Sierra <alex.sierra@amd.com>

Remove per_device_list from kfd_process and replace it with a
kfd_process_device pointers array of MAX_GPU_INSTANCES size. This helps
to manage the kfd_process_devices binded to a specific kfd_process.
Also, functions used by kfd_chardev to iterate over the list were
removed, since they are not valid anymore. Instead, it was replaced by a
local loop iterating the array.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 116 ++++++++----------
 drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |   8 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  20 +--
 drivers/gpu/drm/amd/amdkfd/kfd_process.c      | 108 ++++++++--------
 .../amd/amdkfd/kfd_process_queue_manager.c    |   6 +-
 5 files changed, 111 insertions(+), 147 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 8cc51cec988a..8c87afce12df 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -874,52 +874,47 @@ static int kfd_ioctl_get_process_apertures(struct file *filp,
 {
 	struct kfd_ioctl_get_process_apertures_args *args = data;
 	struct kfd_process_device_apertures *pAperture;
-	struct kfd_process_device *pdd;
+	int i;
 
 	dev_dbg(kfd_device, "get apertures for PASID 0x%x", p->pasid);
 
 	args->num_of_nodes = 0;
 
 	mutex_lock(&p->mutex);
+	/* Run over all pdd of the process */
+	for (i = 0; i < p->n_pdds; i++) {
+		struct kfd_process_device *pdd = p->pdds[i];
+
+		pAperture =
+			&args->process_apertures[args->num_of_nodes];
+		pAperture->gpu_id = pdd->dev->id;
+		pAperture->lds_base = pdd->lds_base;
+		pAperture->lds_limit = pdd->lds_limit;
+		pAperture->gpuvm_base = pdd->gpuvm_base;
+		pAperture->gpuvm_limit = pdd->gpuvm_limit;
+		pAperture->scratch_base = pdd->scratch_base;
+		pAperture->scratch_limit = pdd->scratch_limit;
 
-	/*if the process-device list isn't empty*/
-	if (kfd_has_process_device_data(p)) {
-		/* Run over all pdd of the process */
-		pdd = kfd_get_first_process_device_data(p);
-		do {
-			pAperture =
-				&args->process_apertures[args->num_of_nodes];
-			pAperture->gpu_id = pdd->dev->id;
-			pAperture->lds_base = pdd->lds_base;
-			pAperture->lds_limit = pdd->lds_limit;
-			pAperture->gpuvm_base = pdd->gpuvm_base;
-			pAperture->gpuvm_limit = pdd->gpuvm_limit;
-			pAperture->scratch_base = pdd->scratch_base;
-			pAperture->scratch_limit = pdd->scratch_limit;
-
-			dev_dbg(kfd_device,
-				"node id %u\n", args->num_of_nodes);
-			dev_dbg(kfd_device,
-				"gpu id %u\n", pdd->dev->id);
-			dev_dbg(kfd_device,
-				"lds_base %llX\n", pdd->lds_base);
-			dev_dbg(kfd_device,
-				"lds_limit %llX\n", pdd->lds_limit);
-			dev_dbg(kfd_device,
-				"gpuvm_base %llX\n", pdd->gpuvm_base);
-			dev_dbg(kfd_device,
-				"gpuvm_limit %llX\n", pdd->gpuvm_limit);
-			dev_dbg(kfd_device,
-				"scratch_base %llX\n", pdd->scratch_base);
-			dev_dbg(kfd_device,
-				"scratch_limit %llX\n", pdd->scratch_limit);
-
-			args->num_of_nodes++;
-
-			pdd = kfd_get_next_process_device_data(p, pdd);
-		} while (pdd && (args->num_of_nodes < NUM_OF_SUPPORTED_GPUS));
-	}
+		dev_dbg(kfd_device,
+			"node id %u\n", args->num_of_nodes);
+		dev_dbg(kfd_device,
+			"gpu id %u\n", pdd->dev->id);
+		dev_dbg(kfd_device,
+			"lds_base %llX\n", pdd->lds_base);
+		dev_dbg(kfd_device,
+			"lds_limit %llX\n", pdd->lds_limit);
+		dev_dbg(kfd_device,
+			"gpuvm_base %llX\n", pdd->gpuvm_base);
+		dev_dbg(kfd_device,
+			"gpuvm_limit %llX\n", pdd->gpuvm_limit);
+		dev_dbg(kfd_device,
+			"scratch_base %llX\n", pdd->scratch_base);
+		dev_dbg(kfd_device,
+			"scratch_limit %llX\n", pdd->scratch_limit);
 
+		if (++args->num_of_nodes >= NUM_OF_SUPPORTED_GPUS)
+			break;
+	}
 	mutex_unlock(&p->mutex);
 
 	return 0;
@@ -930,9 +925,8 @@ static int kfd_ioctl_get_process_apertures_new(struct file *filp,
 {
 	struct kfd_ioctl_get_process_apertures_new_args *args = data;
 	struct kfd_process_device_apertures *pa;
-	struct kfd_process_device *pdd;
-	uint32_t nodes = 0;
 	int ret;
+	int i;
 
 	dev_dbg(kfd_device, "get apertures for PASID 0x%x", p->pasid);
 
@@ -941,17 +935,7 @@ static int kfd_ioctl_get_process_apertures_new(struct file *filp,
 		 * sufficient memory
 		 */
 		mutex_lock(&p->mutex);
-
-		if (!kfd_has_process_device_data(p))
-			goto out_unlock;
-
-		/* Run over all pdd of the process */
-		pdd = kfd_get_first_process_device_data(p);
-		do {
-			args->num_of_nodes++;
-			pdd = kfd_get_next_process_device_data(p, pdd);
-		} while (pdd);
-
+		args->num_of_nodes = p->n_pdds;
 		goto out_unlock;
 	}
 
@@ -966,22 +950,23 @@ static int kfd_ioctl_get_process_apertures_new(struct file *filp,
 
 	mutex_lock(&p->mutex);
 
-	if (!kfd_has_process_device_data(p)) {
+	if (!p->n_pdds) {
 		args->num_of_nodes = 0;
 		kfree(pa);
 		goto out_unlock;
 	}
 
 	/* Run over all pdd of the process */
-	pdd = kfd_get_first_process_device_data(p);
-	do {
-		pa[nodes].gpu_id = pdd->dev->id;
-		pa[nodes].lds_base = pdd->lds_base;
-		pa[nodes].lds_limit = pdd->lds_limit;
-		pa[nodes].gpuvm_base = pdd->gpuvm_base;
-		pa[nodes].gpuvm_limit = pdd->gpuvm_limit;
-		pa[nodes].scratch_base = pdd->scratch_base;
-		pa[nodes].scratch_limit = pdd->scratch_limit;
+	for (i = 0; i < min(p->n_pdds, args->num_of_nodes); i++) {
+		struct kfd_process_device *pdd = p->pdds[i];
+
+		pa[i].gpu_id = pdd->dev->id;
+		pa[i].lds_base = pdd->lds_base;
+		pa[i].lds_limit = pdd->lds_limit;
+		pa[i].gpuvm_base = pdd->gpuvm_base;
+		pa[i].gpuvm_limit = pdd->gpuvm_limit;
+		pa[i].scratch_base = pdd->scratch_base;
+		pa[i].scratch_limit = pdd->scratch_limit;
 
 		dev_dbg(kfd_device,
 			"gpu id %u\n", pdd->dev->id);
@@ -997,17 +982,14 @@ static int kfd_ioctl_get_process_apertures_new(struct file *filp,
 			"scratch_base %llX\n", pdd->scratch_base);
 		dev_dbg(kfd_device,
 			"scratch_limit %llX\n", pdd->scratch_limit);
-		nodes++;
-
-		pdd = kfd_get_next_process_device_data(p, pdd);
-	} while (pdd && (nodes < args->num_of_nodes));
+	}
 	mutex_unlock(&p->mutex);
 
-	args->num_of_nodes = nodes;
+	args->num_of_nodes = i;
 	ret = copy_to_user(
 			(void __user *)args->kfd_process_device_apertures_ptr,
 			pa,
-			(nodes * sizeof(struct kfd_process_device_apertures)));
+			(i * sizeof(struct kfd_process_device_apertures)));
 	kfree(pa);
 	return ret ? -EFAULT : 0;
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
index 5a64915abaf7..1a266b78f0d8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
@@ -131,11 +131,11 @@ int kfd_iommu_bind_process_to_device(struct kfd_process_device *pdd)
  */
 void kfd_iommu_unbind_process(struct kfd_process *p)
 {
-	struct kfd_process_device *pdd;
+	int i;
 
-	list_for_each_entry(pdd, &p->per_device_data, per_device_list)
-		if (pdd->bound == PDD_BOUND)
-			amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid);
+	for (i = 0; i < p->n_pdds; i++)
+		if (p->pdds[i]->bound == PDD_BOUND)
+			amd_iommu_unbind_pasid(p->pdds[i]->dev->pdev, p->pasid);
 }
 
 /* Callback for process shutdown invoked by the IOMMU driver */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index e2ebd5a1d4de..d9f8d3d48aac 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 
 #include "amd_shared.h"
+#include "amdgpu.h"
 
 #define KFD_MAX_RING_ENTRY_SIZE	8
 
@@ -644,12 +645,6 @@ enum kfd_pdd_bound {
 
 /* Data that is per-process-per device. */
 struct kfd_process_device {
-	/*
-	 * List of all per-device data for a process.
-	 * Starts from kfd_process.per_device_data.
-	 */
-	struct list_head per_device_list;
-
 	/* The device that owns this data. */
 	struct kfd_dev *dev;
 
@@ -766,10 +761,11 @@ struct kfd_process {
 	uint16_t pasid;
 
 	/*
-	 * List of kfd_process_device structures,
+	 * Array of kfd_process_device pointers,
 	 * one for each device the process is using.
 	 */
-	struct list_head per_device_data;
+	struct kfd_process_device *pdds[MAX_GPU_INSTANCE];
+	uint32_t n_pdds;
 
 	struct process_queue_manager pqm;
 
@@ -867,14 +863,6 @@ void *kfd_process_device_translate_handle(struct kfd_process_device *p,
 void kfd_process_device_remove_obj_handle(struct kfd_process_device *pdd,
 					int handle);
 
-/* Process device data iterator */
-struct kfd_process_device *kfd_get_first_process_device_data(
-							struct kfd_process *p);
-struct kfd_process_device *kfd_get_next_process_device_data(
-						struct kfd_process *p,
-						struct kfd_process_device *pdd);
-bool kfd_has_process_device_data(struct kfd_process *p);
-
 /* PASIDs */
 int kfd_pasid_init(void);
 void kfd_pasid_exit(void);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 2807e1c4d59b..031e752e3154 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -505,7 +505,7 @@ static int kfd_sysfs_create_file(struct kfd_process *p, struct attribute *attr,
 static int kfd_procfs_add_sysfs_stats(struct kfd_process *p)
 {
 	int ret = 0;
-	struct kfd_process_device *pdd;
+	int i;
 	char stats_dir_filename[MAX_SYSFS_FILENAME_LEN];
 
 	if (!p)
@@ -520,7 +520,8 @@ static int kfd_procfs_add_sysfs_stats(struct kfd_process *p)
 	 * - proc/<pid>/stats_<gpuid>/evicted_ms
 	 * - proc/<pid>/stats_<gpuid>/cu_occupancy
 	 */
-	list_for_each_entry(pdd, &p->per_device_data, per_device_list) {
+	for (i = 0; i < p->n_pdds; i++) {
+		struct kfd_process_device *pdd = p->pdds[i];
 		struct kobject *kobj_stats;
 
 		snprintf(stats_dir_filename, MAX_SYSFS_FILENAME_LEN,
@@ -571,7 +572,7 @@ static int kfd_procfs_add_sysfs_stats(struct kfd_process *p)
 static int kfd_procfs_add_sysfs_files(struct kfd_process *p)
 {
 	int ret = 0;
-	struct kfd_process_device *pdd;
+	int i;
 
 	if (!p)
 		return -EINVAL;
@@ -584,7 +585,9 @@ static int kfd_procfs_add_sysfs_files(struct kfd_process *p)
 	 * - proc/<pid>/vram_<gpuid>
 	 * - proc/<pid>/sdma_<gpuid>
 	 */
-	list_for_each_entry(pdd, &p->per_device_data, per_device_list) {
+	for (i = 0; i < p->n_pdds; i++) {
+		struct kfd_process_device *pdd = p->pdds[i];
+
 		snprintf(pdd->vram_filename, MAX_SYSFS_FILENAME_LEN, "vram_%u",
 			 pdd->dev->id);
 		ret = kfd_sysfs_create_file(p, &pdd->attr_vram, pdd->vram_filename);
@@ -875,21 +878,23 @@ void kfd_unref_process(struct kfd_process *p)
 	kref_put(&p->ref, kfd_process_ref_release);
 }
 
+
 static void kfd_process_device_free_bos(struct kfd_process_device *pdd)
 {
 	struct kfd_process *p = pdd->process;
 	void *mem;
 	int id;
+	int i;
 
 	/*
 	 * Remove all handles from idr and release appropriate
 	 * local memory object
 	 */
 	idr_for_each_entry(&pdd->alloc_idr, mem, id) {
-		struct kfd_process_device *peer_pdd;
 
-		list_for_each_entry(peer_pdd, &p->per_device_data,
-				    per_device_list) {
+		for (i = 0; i < p->n_pdds; i++) {
+			struct kfd_process_device *peer_pdd = p->pdds[i];
+
 			if (!peer_pdd->vm)
 				continue;
 			amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu(
@@ -903,18 +908,19 @@ static void kfd_process_device_free_bos(struct kfd_process_device *pdd)
 
 static void kfd_process_free_outstanding_kfd_bos(struct kfd_process *p)
 {
-	struct kfd_process_device *pdd;
+	int i;
 
-	list_for_each_entry(pdd, &p->per_device_data, per_device_list)
-		kfd_process_device_free_bos(pdd);
+	for (i = 0; i < p->n_pdds; i++)
+		kfd_process_device_free_bos(p->pdds[i]);
 }
 
 static void kfd_process_destroy_pdds(struct kfd_process *p)
 {
-	struct kfd_process_device *pdd, *temp;
+	int i;
+
+	for (i = 0; i < p->n_pdds; i++) {
+		struct kfd_process_device *pdd = p->pdds[i];
 
-	list_for_each_entry_safe(pdd, temp, &p->per_device_data,
-				 per_device_list) {
 		pr_debug("Releasing pdd (topology id %d) for process (pasid 0x%x)\n",
 				pdd->dev->id, p->pasid);
 
@@ -927,8 +933,6 @@ static void kfd_process_destroy_pdds(struct kfd_process *p)
 			amdgpu_amdkfd_gpuvm_destroy_process_vm(
 				pdd->dev->kgd, pdd->vm);
 
-		list_del(&pdd->per_device_list);
-
 		if (pdd->qpd.cwsr_kaddr && !pdd->qpd.cwsr_base)
 			free_pages((unsigned long)pdd->qpd.cwsr_kaddr,
 				get_order(KFD_CWSR_TBA_TMA_SIZE));
@@ -949,7 +953,9 @@ static void kfd_process_destroy_pdds(struct kfd_process *p)
 		}
 
 		kfree(pdd);
+		p->pdds[i] = NULL;
 	}
+	p->n_pdds = 0;
 }
 
 /* No process locking is needed in this function, because the process
@@ -961,7 +967,7 @@ static void kfd_process_wq_release(struct work_struct *work)
 {
 	struct kfd_process *p = container_of(work, struct kfd_process,
 					     release_work);
-	struct kfd_process_device *pdd;
+	int i;
 
 	/* Remove the procfs files */
 	if (p->kobj) {
@@ -970,7 +976,9 @@ static void kfd_process_wq_release(struct work_struct *work)
 		kobject_put(p->kobj_queues);
 		p->kobj_queues = NULL;
 
-		list_for_each_entry(pdd, &p->per_device_data, per_device_list) {
+		for (i = 0; i < p->n_pdds; i++) {
+			struct kfd_process_device *pdd = p->pdds[i];
+
 			sysfs_remove_file(p->kobj, &pdd->attr_vram);
 			sysfs_remove_file(p->kobj, &pdd->attr_sdma);
 			sysfs_remove_file(p->kobj, &pdd->attr_evict);
@@ -1020,7 +1028,7 @@ static void kfd_process_notifier_release(struct mmu_notifier *mn,
 					struct mm_struct *mm)
 {
 	struct kfd_process *p;
-	struct kfd_process_device *pdd = NULL;
+	int i;
 
 	/*
 	 * The kfd_process structure can not be free because the
@@ -1044,8 +1052,8 @@ static void kfd_process_notifier_release(struct mmu_notifier *mn,
 	 * pdd is in debug mode, we should first force unregistration,
 	 * then we will be able to destroy the queues
 	 */
-	list_for_each_entry(pdd, &p->per_device_data, per_device_list) {
-		struct kfd_dev *dev = pdd->dev;
+	for (i = 0; i < p->n_pdds; i++) {
+		struct kfd_dev *dev = p->pdds[i]->dev;
 
 		mutex_lock(kfd_get_dbgmgr_mutex());
 		if (dev && dev->dbgmgr && dev->dbgmgr->pasid == p->pasid) {
@@ -1081,11 +1089,11 @@ static const struct mmu_notifier_ops kfd_process_mmu_notifier_ops = {
 static int kfd_process_init_cwsr_apu(struct kfd_process *p, struct file *filep)
 {
 	unsigned long  offset;
-	struct kfd_process_device *pdd;
+	int i;
 
-	list_for_each_entry(pdd, &p->per_device_data, per_device_list) {
-		struct kfd_dev *dev = pdd->dev;
-		struct qcm_process_device *qpd = &pdd->qpd;
+	for (i = 0; i < p->n_pdds; i++) {
+		struct kfd_dev *dev = p->pdds[i]->dev;
+		struct qcm_process_device *qpd = &p->pdds[i]->qpd;
 
 		if (!dev->cwsr_enabled || qpd->cwsr_kaddr || qpd->cwsr_base)
 			continue;
@@ -1162,7 +1170,7 @@ static struct kfd_process *create_process(const struct task_struct *thread)
 	mutex_init(&process->mutex);
 	process->mm = thread->mm;
 	process->lead_thread = thread->group_leader;
-	INIT_LIST_HEAD(&process->per_device_data);
+	process->n_pdds = 0;
 	INIT_DELAYED_WORK(&process->eviction_work, evict_process_worker);
 	INIT_DELAYED_WORK(&process->restore_work, restore_process_worker);
 	process->last_restore_timestamp = get_jiffies_64();
@@ -1244,11 +1252,11 @@ static int init_doorbell_bitmap(struct qcm_process_device *qpd,
 struct kfd_process_device *kfd_get_process_device_data(struct kfd_dev *dev,
 							struct kfd_process *p)
 {
-	struct kfd_process_device *pdd = NULL;
+	int i;
 
-	list_for_each_entry(pdd, &p->per_device_data, per_device_list)
-		if (pdd->dev == dev)
-			return pdd;
+	for (i = 0; i < p->n_pdds; i++)
+		if (p->pdds[i]->dev == dev)
+			return p->pdds[i];
 
 	return NULL;
 }
@@ -1258,6 +1266,8 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev,
 {
 	struct kfd_process_device *pdd = NULL;
 
+	if (WARN_ON_ONCE(p->n_pdds >= MAX_GPU_INSTANCE))
+		return NULL;
 	pdd = kzalloc(sizeof(*pdd), GFP_KERNEL);
 	if (!pdd)
 		return NULL;
@@ -1286,7 +1296,7 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev,
 	pdd->vram_usage = 0;
 	pdd->sdma_past_activity_counter = 0;
 	atomic64_set(&pdd->evict_duration_counter, 0);
-	list_add(&pdd->per_device_list, &p->per_device_data);
+	p->pdds[p->n_pdds++] = pdd;
 
 	/* Init idr used for memory handle translation */
 	idr_init(&pdd->alloc_idr);
@@ -1418,28 +1428,6 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
 	return ERR_PTR(err);
 }
 
-struct kfd_process_device *kfd_get_first_process_device_data(
-						struct kfd_process *p)
-{
-	return list_first_entry(&p->per_device_data,
-				struct kfd_process_device,
-				per_device_list);
-}
-
-struct kfd_process_device *kfd_get_next_process_device_data(
-						struct kfd_process *p,
-						struct kfd_process_device *pdd)
-{
-	if (list_is_last(&pdd->per_device_list, &p->per_device_data))
-		return NULL;
-	return list_next_entry(pdd, per_device_list);
-}
-
-bool kfd_has_process_device_data(struct kfd_process *p)
-{
-	return !(list_empty(&p->per_device_data));
-}
-
 /* Create specific handle mapped to mem from process local memory idr
  * Assumes that the process lock is held.
  */
@@ -1515,11 +1503,13 @@ struct kfd_process *kfd_lookup_process_by_mm(const struct mm_struct *mm)
  */
 int kfd_process_evict_queues(struct kfd_process *p)
 {
-	struct kfd_process_device *pdd;
 	int r = 0;
+	int i;
 	unsigned int n_evicted = 0;
 
-	list_for_each_entry(pdd, &p->per_device_data, per_device_list) {
+	for (i = 0; i < p->n_pdds; i++) {
+		struct kfd_process_device *pdd = p->pdds[i];
+
 		r = pdd->dev->dqm->ops.evict_process_queues(pdd->dev->dqm,
 							    &pdd->qpd);
 		if (r) {
@@ -1535,7 +1525,9 @@ int kfd_process_evict_queues(struct kfd_process *p)
 	/* To keep state consistent, roll back partial eviction by
 	 * restoring queues
 	 */
-	list_for_each_entry(pdd, &p->per_device_data, per_device_list) {
+	for (i = 0; i < p->n_pdds; i++) {
+		struct kfd_process_device *pdd = p->pdds[i];
+
 		if (n_evicted == 0)
 			break;
 		if (pdd->dev->dqm->ops.restore_process_queues(pdd->dev->dqm,
@@ -1551,10 +1543,12 @@ int kfd_process_evict_queues(struct kfd_process *p)
 /* kfd_process_restore_queues - Restore all user queues of a process */
 int kfd_process_restore_queues(struct kfd_process *p)
 {
-	struct kfd_process_device *pdd;
 	int r, ret = 0;
+	int i;
+
+	for (i = 0; i < p->n_pdds; i++) {
+		struct kfd_process_device *pdd = p->pdds[i];
 
-	list_for_each_entry(pdd, &p->per_device_data, per_device_list) {
 		r = pdd->dev->dqm->ops.restore_process_queues(pdd->dev->dqm,
 							      &pdd->qpd);
 		if (r) {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index eb1635ac8988..95a6c36cea4c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -126,10 +126,10 @@ int pqm_set_gws(struct process_queue_manager *pqm, unsigned int qid,
 
 void kfd_process_dequeue_from_all_devices(struct kfd_process *p)
 {
-	struct kfd_process_device *pdd;
+	int i;
 
-	list_for_each_entry(pdd, &p->per_device_data, per_device_list)
-		kfd_process_dequeue_from_device(pdd);
+	for (i = 0; i < p->n_pdds; i++)
+		kfd_process_dequeue_from_device(p->pdds[i]);
 }
 
 int pqm_init(struct process_queue_manager *pqm, struct kfd_process *p)
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 03/35] drm/amdkfd: helper to convert gpu id and idx
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
  2021-01-07  3:00 ` [PATCH 01/35] drm/amdkfd: select kernel DEVICE_PRIVATE option Felix Kuehling
  2021-01-07  3:00 ` [PATCH 02/35] drm/amdgpu: replace per_device_list by array Felix Kuehling
@ 2021-01-07  3:00 ` Felix Kuehling
  2021-01-07  3:00 ` [PATCH 04/35] drm/amdkfd: add svm ioctl API Felix Kuehling
                   ` (33 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:00 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

From: Alex Sierra <alex.sierra@amd.com>

svm range uses gpu bitmap to store which GPU svm range maps to.
Application pass driver gpu id to specify GPU, the helper is needed to
convert gpu id to gpu bitmap idx.

Access through kfd_process_device pointers array from kfd_process.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h    |  5 ++++
 drivers/gpu/drm/amd/amdkfd/kfd_process.c | 30 ++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index d9f8d3d48aac..4ef8804adcf5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -837,6 +837,11 @@ struct kfd_process *kfd_create_process(struct file *filep);
 struct kfd_process *kfd_get_process(const struct task_struct *);
 struct kfd_process *kfd_lookup_process_by_pasid(unsigned int pasid);
 struct kfd_process *kfd_lookup_process_by_mm(const struct mm_struct *mm);
+int kfd_process_gpuid_from_gpuidx(struct kfd_process *p,
+					uint32_t gpu_idx, uint32_t *gpuid);
+int kfd_process_gpuidx_from_gpuid(struct kfd_process *p, uint32_t gpu_id);
+int kfd_process_device_from_gpuidx(struct kfd_process *p,
+					uint32_t gpu_idx, struct kfd_dev **gpu);
 void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p);
 int kfd_process_restore_queues(struct kfd_process *p);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 031e752e3154..7396f3a6d0ee 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1561,6 +1561,36 @@ int kfd_process_restore_queues(struct kfd_process *p)
 	return ret;
 }
 
+int kfd_process_gpuid_from_gpuidx(struct kfd_process *p,
+					uint32_t gpu_idx, uint32_t *gpuid)
+{
+	if (gpu_idx < p->n_pdds) {
+		*gpuid = p->pdds[gpu_idx]->dev->id;
+		return 0;
+	}
+	return -EINVAL;
+}
+
+int kfd_process_gpuidx_from_gpuid(struct kfd_process *p, uint32_t gpu_id)
+{
+	int i;
+
+	for (i = 0; i < p->n_pdds; i++)
+		if (p->pdds[i] && gpu_id == p->pdds[i]->dev->id)
+			return i;
+	return -EINVAL;
+}
+
+int kfd_process_device_from_gpuidx(struct kfd_process *p,
+					uint32_t gpu_idx, struct kfd_dev **gpu)
+{
+	if (gpu_idx < p->n_pdds) {
+		*gpu = p->pdds[gpu_idx]->dev;
+		return 0;
+	}
+	return -EINVAL;
+}
+
 static void evict_process_worker(struct work_struct *work)
 {
 	int ret;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 04/35] drm/amdkfd: add svm ioctl API
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (2 preceding siblings ...)
  2021-01-07  3:00 ` [PATCH 03/35] drm/amdkfd: helper to convert gpu id and idx Felix Kuehling
@ 2021-01-07  3:00 ` Felix Kuehling
  2021-01-07  3:00 ` [PATCH 05/35] drm/amdkfd: Add SVM API support capability bits Felix Kuehling
                   ` (32 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:00 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

Add svm (shared virtual memory) ioctl data structure and API definition.

The svm ioctl API is designed to be extensible in the future. All
operations are provided by a single IOCTL to preserve ioctl number
space. The arguments structure ends with a variable size array of
attributes that can be used to set or get one or multiple attributes.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |   7 ++
 include/uapi/linux/kfd_ioctl.h           | 128 ++++++++++++++++++++++-
 2 files changed, 133 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 8c87afce12df..c5288a6e45b9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1746,6 +1746,11 @@ static int kfd_ioctl_smi_events(struct file *filep,
 	return kfd_smi_event_open(dev, &args->anon_fd);
 }
 
+static int kfd_ioctl_svm(struct file *filep, struct kfd_process *p, void *data)
+{
+	return -EINVAL;
+}
+
 #define AMDKFD_IOCTL_DEF(ioctl, _func, _flags) \
 	[_IOC_NR(ioctl)] = {.cmd = ioctl, .func = _func, .flags = _flags, \
 			    .cmd_drv = 0, .name = #ioctl}
@@ -1844,6 +1849,8 @@ static const struct amdkfd_ioctl_desc amdkfd_ioctls[] = {
 
 	AMDKFD_IOCTL_DEF(AMDKFD_IOC_SMI_EVENTS,
 			kfd_ioctl_smi_events, 0),
+
+	AMDKFD_IOCTL_DEF(AMDKFD_IOC_SVM, kfd_ioctl_svm, 0),
 };
 
 #define AMDKFD_CORE_IOCTL_COUNT	ARRAY_SIZE(amdkfd_ioctls)
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index 695b606da4b1..5d4a4b3e0b61 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -29,9 +29,10 @@
 /*
  * - 1.1 - initial version
  * - 1.3 - Add SMI events support
+ * - 1.4 - Add SVM API
  */
 #define KFD_IOCTL_MAJOR_VERSION 1
-#define KFD_IOCTL_MINOR_VERSION 3
+#define KFD_IOCTL_MINOR_VERSION 4
 
 struct kfd_ioctl_get_version_args {
 	__u32 major_version;	/* from KFD */
@@ -471,6 +472,127 @@ enum kfd_mmio_remap {
 	KFD_MMIO_REMAP_HDP_REG_FLUSH_CNTL = 4,
 };
 
+/* Guarantee host access to memory */
+#define KFD_IOCTL_SVM_FLAG_HOST_ACCESS 0x00000001
+/* Fine grained coherency between all devices with access */
+#define KFD_IOCTL_SVM_FLAG_COHERENT    0x00000002
+/* Use any GPU in same hive as preferred device */
+#define KFD_IOCTL_SVM_FLAG_HIVE_LOCAL  0x00000004
+/* GPUs only read, allows replication */
+#define KFD_IOCTL_SVM_FLAG_GPU_RO      0x00000008
+/* Allow execution on GPU */
+#define KFD_IOCTL_SVM_FLAG_GPU_EXEC    0x00000010
+
+/**
+ * kfd_ioctl_svm_op - SVM ioctl operations
+ *
+ * @KFD_IOCTL_SVM_OP_SET_ATTR: Modify one or more attributes
+ * @KFD_IOCTL_SVM_OP_GET_ATTR: Query one or more attributes
+ */
+enum kfd_ioctl_svm_op {
+	KFD_IOCTL_SVM_OP_SET_ATTR,
+	KFD_IOCTL_SVM_OP_GET_ATTR
+};
+
+/** kfd_ioctl_svm_location - Enum for preferred and prefetch locations
+ *
+ * GPU IDs are used to specify GPUs as preferred and prefetch locations.
+ * Below definitions are used for system memory or for leaving the preferred
+ * location unspecified.
+ */
+enum kfd_ioctl_svm_location {
+	KFD_IOCTL_SVM_LOCATION_SYSMEM = 0,
+	KFD_IOCTL_SVM_LOCATION_UNDEFINED = 0xffffffff
+};
+
+/**
+ * kfd_ioctl_svm_attr_type - SVM attribute types
+ *
+ * @KFD_IOCTL_SVM_ATTR_PREFERRED_LOC: gpuid of the preferred location, 0 for
+ *                                    system memory
+ * @KFD_IOCTL_SVM_ATTR_PREFETCH_LOC: gpuid of the prefetch location, 0 for
+ *                                   system memory. Setting this triggers an
+ *                                   immediate prefetch (migration).
+ * @KFD_IOCTL_SVM_ATTR_ACCESS:
+ * @KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE:
+ * @KFD_IOCTL_SVM_ATTR_NO_ACCESS: specify memory access for the gpuid given
+ *                                by the attribute value
+ * @KFD_IOCTL_SVM_ATTR_SET_FLAGS: bitmask of flags to set (see
+ *                                KFD_IOCTL_SVM_FLAG_...)
+ * @KFD_IOCTL_SVM_ATTR_CLR_FLAGS: bitmask of flags to clear
+ * @KFD_IOCTL_SVM_ATTR_GRANULARITY: migration granularity
+ *                                  (log2 num pages)
+ */
+enum kfd_ioctl_svm_attr_type {
+	KFD_IOCTL_SVM_ATTR_PREFERRED_LOC,
+	KFD_IOCTL_SVM_ATTR_PREFETCH_LOC,
+	KFD_IOCTL_SVM_ATTR_ACCESS,
+	KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE,
+	KFD_IOCTL_SVM_ATTR_NO_ACCESS,
+	KFD_IOCTL_SVM_ATTR_SET_FLAGS,
+	KFD_IOCTL_SVM_ATTR_CLR_FLAGS,
+	KFD_IOCTL_SVM_ATTR_GRANULARITY
+};
+
+/**
+ * kfd_ioctl_svm_attribute - Attributes as pairs of type and value
+ *
+ * The meaning of the @value depends on the attribute type.
+ *
+ * @type: attribute type (see enum @kfd_ioctl_svm_attr_type)
+ * @value: attribute value
+ */
+struct kfd_ioctl_svm_attribute {
+	__u32 type;
+	__u32 value;
+};
+
+/**
+ * kfd_ioctl_svm_args - Arguments for SVM ioctl
+ *
+ * @op specifies the operation to perform (see enum
+ * @kfd_ioctl_svm_op).  @start_addr and @size are common for all
+ * operations.
+ *
+ * A variable number of attributes can be given in @attrs.
+ * @nattr specifies the number of attributes. New attributes can be
+ * added in the future without breaking the ABI. If unknown attributes
+ * are given, the function returns -EINVAL.
+ *
+ * @KFD_IOCTL_SVM_OP_SET_ATTR sets attributes for a virtual address
+ * range. It may overlap existing virtual address ranges. If it does,
+ * the existing ranges will be split such that the attribute changes
+ * only apply to the specified address range.
+ *
+ * @KFD_IOCTL_SVM_OP_GET_ATTR returns the intersection of attributes
+ * over all memory in the given range and returns the result as the
+ * attribute value. If different pages have different preferred or
+ * prefetch locations, 0xffffffff will be returned for
+ * @KFD_IOCTL_SVM_ATTR_PREFERRED_LOC or
+ * @KFD_IOCTL_SVM_ATTR_PREFETCH_LOC resepctively. For
+ * @KFD_IOCTL_SVM_ATTR_SET_FLAGS, flags of all pages will be
+ * aggregated by bitwise AND. The minimum  migration granularity
+ * throughout the range will be returned for
+ * @KFD_IOCTL_SVM_ATTR_GRANULARITY.
+ *
+ * Querying of accessibility attributes works by initializing the
+ * attribute type to @KFD_IOCTL_SVM_ATTR_ACCESS and the value to the
+ * GPUID being queried. Multiple attributes can be given to allow
+ * querying multiple GPUIDs. The ioctl function overwrites the
+ * attribute type to indicate the access for the specified GPU.
+ *
+ * @KFD_IOCTL_SVM_ATTR_CLR_FLAGS is invalid for
+ * @KFD_IOCTL_SVM_OP_GET_ATTR.
+ */
+struct kfd_ioctl_svm_args {
+	__u64 start_addr;
+	__u64 size;
+	__u32 op;
+	__u32 nattr;
+	/* Variable length array of attributes */
+	struct kfd_ioctl_svm_attribute attrs[0];
+};
+
 #define AMDKFD_IOCTL_BASE 'K'
 #define AMDKFD_IO(nr)			_IO(AMDKFD_IOCTL_BASE, nr)
 #define AMDKFD_IOR(nr, type)		_IOR(AMDKFD_IOCTL_BASE, nr, type)
@@ -571,7 +693,9 @@ enum kfd_mmio_remap {
 #define AMDKFD_IOC_SMI_EVENTS			\
 		AMDKFD_IOWR(0x1F, struct kfd_ioctl_smi_events_args)
 
+#define AMDKFD_IOC_SVM	AMDKFD_IOWR(0x20, struct kfd_ioctl_svm_args)
+
 #define AMDKFD_COMMAND_START		0x01
-#define AMDKFD_COMMAND_END		0x20
+#define AMDKFD_COMMAND_END		0x21
 
 #endif
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 05/35] drm/amdkfd: Add SVM API support capability bits
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (3 preceding siblings ...)
  2021-01-07  3:00 ` [PATCH 04/35] drm/amdkfd: add svm ioctl API Felix Kuehling
@ 2021-01-07  3:00 ` Felix Kuehling
  2021-01-07  3:00 ` [PATCH 06/35] drm/amdkfd: register svm range Felix Kuehling
                   ` (31 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:00 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

SVMAPISupported property added to HSA_CAPABILITY, the value match
HSA_CAPABILITY defined in Thunk spec:

SVMAPISupported: it will not be supported on older kernels that don't
have HMM or on GFXv8 or older GPUs without support for 48-bit virtual
addresses.

CoherentHostAccess property added to HSA_MEMORYPROPERTY, the value match
HSA_MEMORYPROPERTY defined in Thunk spec:

CoherentHostAccess: whether or not device memory can be coherently
accessed by the host CPU.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c |  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 ++++++----
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index a3fc23873819..885b8a071717 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -1380,6 +1380,7 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
 		dev->node_props.capability |= ((HSA_CAP_DOORBELL_TYPE_2_0 <<
 			HSA_CAP_DOORBELL_TYPE_TOTALBITS_SHIFT) &
 			HSA_CAP_DOORBELL_TYPE_TOTALBITS_MASK);
+		dev->node_props.capability |= HSA_CAP_SVMAPI_SUPPORTED;
 		break;
 	default:
 		WARN(1, "Unexpected ASIC family %u",
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
index 326d9b26b7aa..7c5ea9b4b9d9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
@@ -52,8 +52,9 @@
 #define HSA_CAP_RASEVENTNOTIFY			0x00200000
 #define HSA_CAP_ASIC_REVISION_MASK		0x03c00000
 #define HSA_CAP_ASIC_REVISION_SHIFT		22
+#define HSA_CAP_SVMAPI_SUPPORTED		0x04000000
 
-#define HSA_CAP_RESERVED			0xfc078000
+#define HSA_CAP_RESERVED			0xf8078000
 
 struct kfd_node_properties {
 	uint64_t hive_id;
@@ -98,9 +99,10 @@ struct kfd_node_properties {
 #define HSA_MEM_HEAP_TYPE_GPU_LDS	4
 #define HSA_MEM_HEAP_TYPE_GPU_SCRATCH	5
 
-#define HSA_MEM_FLAGS_HOT_PLUGGABLE	0x00000001
-#define HSA_MEM_FLAGS_NON_VOLATILE	0x00000002
-#define HSA_MEM_FLAGS_RESERVED		0xfffffffc
+#define HSA_MEM_FLAGS_HOT_PLUGGABLE		0x00000001
+#define HSA_MEM_FLAGS_NON_VOLATILE		0x00000002
+#define HSA_MEM_FLAGS_COHERENTHOSTACCESS	0x00000004
+#define HSA_MEM_FLAGS_RESERVED			0xfffffff8
 
 struct kfd_mem_properties {
 	struct list_head	list;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 06/35] drm/amdkfd: register svm range
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (4 preceding siblings ...)
  2021-01-07  3:00 ` [PATCH 05/35] drm/amdkfd: Add SVM API support capability bits Felix Kuehling
@ 2021-01-07  3:00 ` Felix Kuehling
  2021-01-07  3:00 ` [PATCH 07/35] drm/amdkfd: add svm ioctl GET_ATTR op Felix Kuehling
                   ` (30 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:00 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

svm range structure stores the range start address, size, attributes,
flags, prefetch location and gpu bitmap which indicates which GPU this
range maps to. Same virtual address is shared by CPU and GPUs.

Process has svm range list which uses both interval tree and list to
store all svm ranges registered by the process. Interval tree is used by
GPU vm fault handler and CPU page fault handler to get svm range
structure from the specific address. List is used to scan all ranges in
eviction restore work.

Apply attributes preferred location, prefetch location, mapping flags,
migration granularity to svm range, store mapping gpu index into bitmap.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/Makefile      |   3 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  21 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h    |  14 +
 drivers/gpu/drm/amd/amdkfd/kfd_process.c |   9 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 603 +++++++++++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |  93 ++++
 6 files changed, 741 insertions(+), 2 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h

diff --git a/drivers/gpu/drm/amd/amdkfd/Makefile b/drivers/gpu/drm/amd/amdkfd/Makefile
index e1e4115dcf78..387ce0217d35 100644
--- a/drivers/gpu/drm/amd/amdkfd/Makefile
+++ b/drivers/gpu/drm/amd/amdkfd/Makefile
@@ -54,7 +54,8 @@ AMDKFD_FILES	:= $(AMDKFD_PATH)/kfd_module.o \
 		$(AMDKFD_PATH)/kfd_dbgdev.o \
 		$(AMDKFD_PATH)/kfd_dbgmgr.o \
 		$(AMDKFD_PATH)/kfd_smi_events.o \
-		$(AMDKFD_PATH)/kfd_crat.o
+		$(AMDKFD_PATH)/kfd_crat.o \
+		$(AMDKFD_PATH)/kfd_svm.o
 
 ifneq ($(CONFIG_AMD_IOMMU_V2),)
 AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index c5288a6e45b9..2d3ba7e806d5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -38,6 +38,7 @@
 #include "kfd_priv.h"
 #include "kfd_device_queue_manager.h"
 #include "kfd_dbgmgr.h"
+#include "kfd_svm.h"
 #include "amdgpu_amdkfd.h"
 #include "kfd_smi_events.h"
 
@@ -1748,7 +1749,25 @@ static int kfd_ioctl_smi_events(struct file *filep,
 
 static int kfd_ioctl_svm(struct file *filep, struct kfd_process *p, void *data)
 {
-	return -EINVAL;
+	struct kfd_ioctl_svm_args *args = data;
+	int r = 0;
+
+	pr_debug("start 0x%llx size 0x%llx op 0x%x nattr 0x%x\n",
+		 args->start_addr, args->size, args->op, args->nattr);
+
+	if ((args->start_addr & ~PAGE_MASK) || (args->size & ~PAGE_MASK))
+		return -EINVAL;
+	if (!args->start_addr || !args->size)
+		return -EINVAL;
+
+	mutex_lock(&p->mutex);
+
+	r = svm_ioctl(p, args->op, args->start_addr, args->size, args->nattr,
+		      args->attrs);
+
+	mutex_unlock(&p->mutex);
+
+	return r;
 }
 
 #define AMDKFD_IOCTL_DEF(ioctl, _func, _flags) \
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 4ef8804adcf5..cbb2bae1982d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -726,6 +726,17 @@ struct kfd_process_device {
 
 #define qpd_to_pdd(x) container_of(x, struct kfd_process_device, qpd)
 
+struct svm_range_list {
+	struct mutex			lock; /* use svms_lock/unlock(svms) */;
+	unsigned int			saved_flags;
+	struct rb_root_cached		objects;
+	struct list_head		list;
+	struct srcu_struct		srcu;
+	struct work_struct		srcu_free_work;
+	struct list_head		free_list;
+	struct mutex			free_list_lock;
+};
+
 /* Process data */
 struct kfd_process {
 	/*
@@ -804,6 +815,9 @@ struct kfd_process {
 	struct kobject *kobj;
 	struct kobject *kobj_queues;
 	struct attribute attr_pasid;
+
+	/* shared virtual memory registered by this process */
+	struct svm_range_list svms;
 };
 
 #define KFD_PROCESS_TABLE_SIZE 5 /* bits: 32 entries */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 7396f3a6d0ee..791f17308b1b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -35,6 +35,7 @@
 #include <linux/pm_runtime.h>
 #include "amdgpu_amdkfd.h"
 #include "amdgpu.h"
+#include "kfd_svm.h"
 
 struct mm_struct;
 
@@ -42,6 +43,7 @@ struct mm_struct;
 #include "kfd_device_queue_manager.h"
 #include "kfd_dbgmgr.h"
 #include "kfd_iommu.h"
+#include "kfd_svm.h"
 
 /*
  * List of struct kfd_process (field kfd_process).
@@ -997,6 +999,7 @@ static void kfd_process_wq_release(struct work_struct *work)
 	kfd_iommu_unbind_process(p);
 
 	kfd_process_free_outstanding_kfd_bos(p);
+	svm_range_list_fini(p);
 
 	kfd_process_destroy_pdds(p);
 	dma_fence_put(p->ef);
@@ -1190,6 +1193,10 @@ static struct kfd_process *create_process(const struct task_struct *thread)
 	if (err != 0)
 		goto err_init_apertures;
 
+	err = svm_range_list_init(process);
+	if (err)
+		goto err_init_svm_range_list;
+
 	/* Must be last, have to use release destruction after this */
 	process->mmu_notifier.ops = &kfd_process_mmu_notifier_ops;
 	err = mmu_notifier_register(&process->mmu_notifier, process->mm);
@@ -1203,6 +1210,8 @@ static struct kfd_process *create_process(const struct task_struct *thread)
 	return process;
 
 err_register_notifier:
+	svm_range_list_fini(process);
+err_init_svm_range_list:
 	kfd_process_free_outstanding_kfd_bos(process);
 	kfd_process_destroy_pdds(process);
 err_init_apertures:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
new file mode 100644
index 000000000000..0b0410837be9
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -0,0 +1,603 @@
+/*
+ * Copyright 2020 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include <linux/types.h>
+#include "amdgpu_sync.h"
+#include "amdgpu_object.h"
+#include "amdgpu_vm.h"
+#include "amdgpu_mn.h"
+#include "kfd_priv.h"
+#include "kfd_svm.h"
+
+/**
+ * svm_range_unlink - unlink svm_range from lists and interval tree
+ * @prange: svm range structure to be removed
+ *
+ * Remove the svm range from svms interval tree and link list
+ *
+ * Context: The caller must hold svms_lock
+ */
+static void svm_range_unlink(struct svm_range *prange)
+{
+	pr_debug("prange 0x%p [0x%lx 0x%lx]\n", prange, prange->it_node.start,
+		 prange->it_node.last);
+
+	list_del_rcu(&prange->list);
+	interval_tree_remove(&prange->it_node, &prange->svms->objects);
+}
+
+/**
+ * svm_range_add_to_svms - add svm range to svms
+ * @prange: svm range structure to be added
+ *
+ * Add the svm range to svms interval tree and link list
+ *
+ * Context: The caller must hold svms_lock
+ */
+static void svm_range_add_to_svms(struct svm_range *prange)
+{
+	pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last);
+
+	list_add_tail_rcu(&prange->list, &prange->svms->list);
+	interval_tree_insert(&prange->it_node, &prange->svms->objects);
+}
+
+static void svm_range_remove(struct svm_range *prange)
+{
+	pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last);
+
+	kvfree(prange->pages_addr);
+	kfree(prange);
+}
+
+static void
+svm_range_set_default_attributes(int32_t *location, int32_t *prefetch_loc,
+				 uint8_t *granularity, uint32_t *flags)
+{
+	*location = 0;
+	*prefetch_loc = 0;
+	*granularity = 9;
+	*flags =
+		KFD_IOCTL_SVM_FLAG_HOST_ACCESS | KFD_IOCTL_SVM_FLAG_COHERENT;
+}
+
+static struct
+svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start,
+			 uint64_t last)
+{
+	uint64_t size = last - start + 1;
+	struct svm_range *prange;
+
+	prange = kzalloc(sizeof(*prange), GFP_KERNEL);
+	if (!prange)
+		return NULL;
+	prange->npages = size;
+	prange->svms = svms;
+	prange->it_node.start = start;
+	prange->it_node.last = last;
+	INIT_LIST_HEAD(&prange->list);
+	INIT_LIST_HEAD(&prange->update_list);
+	INIT_LIST_HEAD(&prange->remove_list);
+	svm_range_set_default_attributes(&prange->preferred_loc,
+					 &prange->prefetch_loc,
+					 &prange->granularity, &prange->flags);
+
+	pr_debug("svms 0x%p [0x%llx 0x%llx]\n", svms, start, last);
+
+	return prange;
+}
+
+static struct kfd_dev *
+svm_get_supported_dev_by_id(struct kfd_process *p, uint32_t gpu_id,
+			    int *r_gpuidx)
+{
+	struct kfd_dev *dev;
+	int gpuidx;
+	int r;
+
+	gpuidx = kfd_process_gpuidx_from_gpuid(p, gpu_id);
+	if (gpuidx < 0) {
+		pr_debug("failed to get device by id 0x%x\n", gpu_id);
+		return NULL;
+	}
+	r = kfd_process_device_from_gpuidx(p, gpuidx, &dev);
+	if (r < 0) {
+		pr_debug("failed to get device by idx 0x%x\n", gpuidx);
+		return NULL;
+	}
+	if (dev->device_info->asic_family < CHIP_VEGA10) {
+		pr_debug("device id 0x%x does not support SVM\n", gpu_id);
+		return NULL;
+	}
+	if (r_gpuidx)
+		*r_gpuidx = gpuidx;
+	return dev;
+}
+
+static int
+svm_range_apply_attrs(struct kfd_process *p, struct svm_range *prange,
+		      uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs)
+{
+	uint32_t i;
+	int gpuidx;
+
+	for (i = 0; i < nattr; i++) {
+		switch (attrs[i].type) {
+		case KFD_IOCTL_SVM_ATTR_PREFERRED_LOC:
+			if (attrs[i].value != KFD_IOCTL_SVM_LOCATION_SYSMEM &&
+			    attrs[i].value != KFD_IOCTL_SVM_LOCATION_UNDEFINED &&
+			    !svm_get_supported_dev_by_id(p, attrs[i].value, NULL))
+				return -EINVAL;
+			prange->preferred_loc = attrs[i].value;
+			break;
+		case KFD_IOCTL_SVM_ATTR_PREFETCH_LOC:
+			if (attrs[i].value != KFD_IOCTL_SVM_LOCATION_SYSMEM &&
+			    !svm_get_supported_dev_by_id(p, attrs[i].value, NULL))
+				return -EINVAL;
+			prange->prefetch_loc = attrs[i].value;
+			break;
+		case KFD_IOCTL_SVM_ATTR_ACCESS:
+		case KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE:
+		case KFD_IOCTL_SVM_ATTR_NO_ACCESS:
+			if (!svm_get_supported_dev_by_id(p, attrs[i].value,
+							 &gpuidx))
+				return -EINVAL;
+			if (attrs[i].type == KFD_IOCTL_SVM_ATTR_NO_ACCESS) {
+				bitmap_clear(prange->bitmap_access, gpuidx, 1);
+				bitmap_clear(prange->bitmap_aip, gpuidx, 1);
+			} else if (attrs[i].type == KFD_IOCTL_SVM_ATTR_ACCESS) {
+				bitmap_set(prange->bitmap_access, gpuidx, 1);
+				bitmap_clear(prange->bitmap_aip, gpuidx, 1);
+			} else {
+				bitmap_clear(prange->bitmap_access, gpuidx, 1);
+				bitmap_set(prange->bitmap_aip, gpuidx, 1);
+			}
+			break;
+		case KFD_IOCTL_SVM_ATTR_SET_FLAGS:
+			prange->flags |= attrs[i].value;
+			break;
+		case KFD_IOCTL_SVM_ATTR_CLR_FLAGS:
+			prange->flags &= ~attrs[i].value;
+			break;
+		case KFD_IOCTL_SVM_ATTR_GRANULARITY:
+			prange->granularity = attrs[i].value;
+			break;
+		default:
+			pr_debug("unknown attr type 0x%x\n", attrs[i].type);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * svm_range_debug_dump - print all range information from svms
+ * @svms: svm range list header
+ *
+ * debug output svm range start, end, pages_addr, prefetch location from svms
+ * interval tree and link list
+ *
+ * Context: The caller must hold svms_lock
+ */
+static void svm_range_debug_dump(struct svm_range_list *svms)
+{
+	struct interval_tree_node *node;
+	struct svm_range *prange;
+
+	pr_debug("dump svms 0x%p list\n", svms);
+	pr_debug("range\tstart\tpage\tend\t\tpages_addr\tlocation\n");
+
+	/* Not using list_for_each_entry_rcu because the caller is holding the
+	 * svms lock
+	 */
+	list_for_each_entry(prange, &svms->list, list) {
+		pr_debug("0x%lx\t0x%llx\t0x%llx\t0x%llx\t0x%x\n",
+			 prange->it_node.start, prange->npages,
+			 prange->it_node.start + prange->npages - 1,
+			 prange->pages_addr ? *prange->pages_addr : 0,
+			 prange->actual_loc);
+	}
+
+	pr_debug("dump svms 0x%p interval tree\n", svms);
+	pr_debug("range\tstart\tpage\tend\t\tpages_addr\tlocation\n");
+	node = interval_tree_iter_first(&svms->objects, 0, ~0ULL);
+	while (node) {
+		prange = container_of(node, struct svm_range, it_node);
+		pr_debug("0x%lx\t0x%llx\t0x%llx\t0x%llx\t0x%x\n",
+			 prange->it_node.start, prange->npages,
+			 prange->it_node.start + prange->npages - 1,
+			 prange->pages_addr ? *prange->pages_addr : 0,
+			 prange->actual_loc);
+		node = interval_tree_iter_next(node, 0, ~0ULL);
+	}
+}
+
+/**
+ * svm_range_handle_overlap - split overlap ranges
+ * @svms: svm range list header
+ * @new: range added with this attributes
+ * @start: range added start address, in pages
+ * @last: range last address, in pages
+ * @update_list: output, the ranges attributes are updated. For set_attr, this
+ *               will do validation and map to GPUs. For unmap, this will be
+ *               removed and unmap from GPUs
+ * @insert_list: output, the ranges will be inserted into svms, attributes are
+ *               not changes. For set_attr, this will add into svms. For unmap,
+ *               will remove duplicate range from update_list because it is
+ *               unmapped, should not insert to svms.
+ * @remove_list:output, the ranges will be removed from svms
+ * @left: the remaining range after overlap, For set_attr, this will be added
+ *        as new range. For unmap, this is ignored.
+ *
+ * Total have 5 overlap cases.
+ *
+ * Context: The caller must hold svms_lock
+ */
+static int
+svm_range_handle_overlap(struct svm_range_list *svms, struct svm_range *new,
+			 unsigned long start, unsigned long last,
+			 struct list_head *update_list,
+			 struct list_head *insert_list,
+			 struct list_head *remove_list,
+			 unsigned long *left)
+{
+	struct interval_tree_node *node;
+	struct svm_range *prange;
+	struct svm_range *tmp;
+	int r = 0;
+
+	INIT_LIST_HEAD(update_list);
+	INIT_LIST_HEAD(insert_list);
+	INIT_LIST_HEAD(remove_list);
+
+	node = interval_tree_iter_first(&svms->objects, start, last);
+	while (node) {
+		struct interval_tree_node *next;
+
+		pr_debug("found overlap node [0x%lx 0x%lx]\n", node->start,
+			 node->last);
+
+		prange = container_of(node, struct svm_range, it_node);
+		next = interval_tree_iter_next(node, start, last);
+
+		if (node->start < start && node->last > last) {
+			pr_debug("split in 2 ranges\n");
+			start = last + 1;
+
+		} else if (node->start < start) {
+			/*
+			 * For node->last == last, will exit loop
+			 * for node->last < last, will continue in next loop
+			 */
+			uint64_t old_last = node->last;
+
+			start = old_last + 1;
+
+		} else if (node->start == start && node->last > last) {
+			pr_debug("change old range start\n");
+
+			start = last + 1;
+
+		} else if (node->start == start) {
+			if (prange->it_node.last == last)
+				pr_debug("found exactly same range\n");
+			else
+				pr_debug("next loop to add remaining range\n");
+
+			start = node->last + 1;
+
+		} else { /* node->start > start */
+			pr_debug("add new range at front\n");
+
+			start = node->last + 1;
+		}
+
+		if (r)
+			goto out;
+
+		node = next;
+	}
+
+	if (left && start <= last)
+		*left = last - start + 1;
+
+out:
+	if (r)
+		list_for_each_entry_safe(prange, tmp, insert_list, list)
+			svm_range_remove(prange);
+
+	return r;
+}
+
+static void svm_range_srcu_free_work(struct work_struct *work_struct)
+{
+	struct svm_range_list *svms;
+	struct svm_range *prange;
+	struct svm_range *tmp;
+
+	svms = container_of(work_struct, struct svm_range_list, srcu_free_work);
+
+	synchronize_srcu(&svms->srcu);
+
+	mutex_lock(&svms->free_list_lock);
+	list_for_each_entry_safe(prange, tmp, &svms->free_list, remove_list) {
+		list_del(&prange->remove_list);
+		svm_range_remove(prange);
+	}
+	mutex_unlock(&svms->free_list_lock);
+}
+
+void svm_range_list_fini(struct kfd_process *p)
+{
+	pr_debug("pasid 0x%x svms 0x%p\n", p->pasid, &p->svms);
+
+	/* Ensure srcu free work is finished before process is destroyed */
+	flush_work(&p->svms.srcu_free_work);
+	cleanup_srcu_struct(&p->svms.srcu);
+	mutex_destroy(&p->svms.free_list_lock);
+}
+
+int svm_range_list_init(struct kfd_process *p)
+{
+	struct svm_range_list *svms = &p->svms;
+	int r;
+
+	svms->objects = RB_ROOT_CACHED;
+	mutex_init(&svms->lock);
+	INIT_LIST_HEAD(&svms->list);
+	r = init_srcu_struct(&svms->srcu);
+	if (r) {
+		pr_debug("failed %d to init srcu\n", r);
+		return r;
+	}
+	INIT_WORK(&svms->srcu_free_work, svm_range_srcu_free_work);
+	INIT_LIST_HEAD(&svms->free_list);
+	mutex_init(&svms->free_list_lock);
+
+	return 0;
+}
+
+/**
+ * svm_range_is_valid - check if virtual address range is valid
+ * @mm: current process mm_struct
+ * @start: range start address, in pages
+ * @size: range size, in pages
+ *
+ * Valid virtual address range means it belongs to one or more VMAs
+ *
+ * Context: Process context
+ *
+ * Return:
+ *  true - valid svm range
+ *  false - invalid svm range
+ */
+static bool
+svm_range_is_valid(struct mm_struct *mm, uint64_t start, uint64_t size)
+{
+	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
+	struct vm_area_struct *vma;
+	unsigned long end;
+
+	start <<= PAGE_SHIFT;
+	end = start + (size << PAGE_SHIFT);
+
+	do {
+		vma = find_vma(mm, start);
+		if (!vma || start < vma->vm_start ||
+		    (vma->vm_flags & device_vma))
+			return false;
+		start = min(end, vma->vm_end);
+	} while (start < end);
+
+	return true;
+}
+
+/**
+ * svm_range_add - add svm range and handle overlap
+ * @p: the range add to this process svms
+ * @start: page size aligned
+ * @size: page size aligned
+ * @nattr: number of attributes
+ * @attrs: array of attributes
+ * @update_list: output, the ranges need validate and update GPU mapping
+ * @insert_list: output, the ranges need insert to svms
+ * @remove_list: output, the ranges are replaced and need remove from svms
+ *
+ * Check if the virtual address range has overlap with the registered ranges,
+ * split the overlapped range, copy and adjust pages address and vram nodes in
+ * old and new ranges.
+ *
+ * Context: Process context, takes and releases svms_lock
+ *
+ * Return:
+ * 0 - OK, otherwise error code
+ */
+static int
+svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
+	      uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs,
+	      struct list_head *update_list, struct list_head *insert_list,
+	      struct list_head *remove_list)
+{
+	uint64_t last = start + size - 1UL;
+	struct svm_range_list *svms;
+	struct svm_range new = {0};
+	struct svm_range *prange;
+	unsigned long left = 0;
+	int r = 0;
+
+	pr_debug("svms 0x%p [0x%llx 0x%llx]\n", &p->svms, start, last);
+
+	r = svm_range_apply_attrs(p, &new, nattr, attrs);
+	if (r)
+		return r;
+
+	svms = &p->svms;
+
+	r = svm_range_handle_overlap(svms, &new, start, last, update_list,
+				     insert_list, remove_list, &left);
+	if (r)
+		return r;
+
+	if (left) {
+		prange = svm_range_new(svms, last - left + 1, last);
+		list_add(&prange->list, insert_list);
+		list_add(&prange->update_list, update_list);
+	}
+
+	return 0;
+}
+
+static int
+svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
+		   uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs)
+{
+	struct amdkfd_process_info *process_info = p->kgd_process_info;
+	struct mm_struct *mm = current->mm;
+	struct list_head update_list;
+	struct list_head insert_list;
+	struct list_head remove_list;
+	struct svm_range_list *svms;
+	struct svm_range *prange;
+	struct svm_range *tmp;
+	int srcu_idx;
+	int r = 0;
+
+	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 0x%llx\n",
+		 p->pasid, &p->svms, start, start + size - 1, size);
+
+	mmap_read_lock(mm);
+	if (!svm_range_is_valid(mm, start, size)) {
+		pr_debug("invalid range\n");
+		mmap_read_unlock(mm);
+		return -EFAULT;
+	}
+	mmap_read_unlock(mm);
+
+	mutex_lock(&process_info->lock);
+
+	svms = &p->svms;
+	svms_lock(svms);
+
+	r = svm_range_add(p, start, size, nattr, attrs, &update_list,
+			  &insert_list, &remove_list);
+	if (r) {
+		svms_unlock(svms);
+		mutex_unlock(&process_info->lock);
+		return r;
+	}
+
+	list_for_each_entry_safe(prange, tmp, &insert_list, list)
+		svm_range_add_to_svms(prange);
+
+	/* Hold read lock to prevent prange is removed after unlocking svms */
+	srcu_idx = srcu_read_lock(&svms->srcu);
+	svms_unlock(svms);
+
+	/* Hold mm->map_sem and check if svm range is unmapped in parallel */
+	mmap_read_lock(mm);
+
+	if (!svm_range_is_valid(mm, start, size)) {
+		pr_debug("range is unmapped\n");
+		mmap_read_unlock(mm);
+		srcu_read_unlock(&svms->srcu, srcu_idx);
+		r = -EFAULT;
+		goto out_remove;
+	}
+
+	list_for_each_entry(prange, &update_list, update_list) {
+
+		r = svm_range_apply_attrs(p, prange, nattr, attrs);
+		if (r) {
+			pr_debug("failed %d to apply attrs\n", r);
+			mmap_read_unlock(mm);
+			srcu_read_unlock(&prange->svms->srcu, srcu_idx);
+			goto out_remove;
+		}
+	}
+
+	srcu_read_unlock(&svms->srcu, srcu_idx);
+	svms_lock(svms);
+
+	mutex_lock(&svms->free_list_lock);
+	list_for_each_entry_safe(prange, tmp, &remove_list, remove_list) {
+		pr_debug("remove overlap prange svms 0x%p [0x%lx 0x%lx]\n",
+			 prange->svms, prange->it_node.start,
+			 prange->it_node.last);
+		svm_range_unlink(prange);
+
+		pr_debug("schedule to free prange svms 0x%p [0x%lx 0x%lx]\n",
+			 prange->svms, prange->it_node.start,
+			 prange->it_node.last);
+		list_add_tail(&prange->remove_list, &svms->free_list);
+	}
+	if (!list_empty(&svms->free_list))
+		schedule_work(&svms->srcu_free_work);
+	mutex_unlock(&svms->free_list_lock);
+
+	svm_range_debug_dump(svms);
+
+	svms_unlock(svms);
+	mmap_read_unlock(mm);
+	mutex_unlock(&process_info->lock);
+
+	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done\n", p->pasid,
+		 &p->svms, start, start + size - 1);
+
+	return 0;
+
+out_remove:
+	svms_lock(svms);
+	list_for_each_entry_safe(prange, tmp, &insert_list, list) {
+		svm_range_unlink(prange);
+		list_add_tail(&prange->remove_list, &svms->free_list);
+	}
+	if (!list_empty(&svms->free_list))
+		schedule_work(&svms->srcu_free_work);
+	svms_unlock(svms);
+	mutex_unlock(&process_info->lock);
+
+	return r;
+}
+
+int
+svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start,
+	  uint64_t size, uint32_t nattrs, struct kfd_ioctl_svm_attribute *attrs)
+{
+	int r;
+
+	start >>= PAGE_SHIFT;
+	size >>= PAGE_SHIFT;
+
+	switch (op) {
+	case KFD_IOCTL_SVM_OP_SET_ATTR:
+		r = svm_range_set_attr(p, start, size, nattrs, attrs);
+		break;
+	default:
+		r = EINVAL;
+		break;
+	}
+
+	return r;
+}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
new file mode 100644
index 000000000000..c7c54fb73dfb
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -0,0 +1,93 @@
+/*
+ * Copyright 2020 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#ifndef KFD_SVM_H_
+#define KFD_SVM_H_
+
+#include <linux/rwsem.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/sched/mm.h>
+#include <linux/hmm.h>
+#include "amdgpu.h"
+#include "kfd_priv.h"
+
+/**
+ * struct svm_range - shared virtual memory range
+ *
+ * @svms:       list of svm ranges, structure defined in kfd_process
+ * @it_node:    node [start, last] stored in interval tree, start, last are page
+ *              aligned, page size is (last - start + 1)
+ * @list:       link list node, used to scan all ranges of svms
+ * @update_list:link list node used to add to update_list
+ * @remove_list:link list node used to add to remove list
+ * @npages:     number of pages
+ * @pages_addr: list of system memory physical page address
+ * @flags:      flags defined as KFD_IOCTL_SVM_FLAG_*
+ * @perferred_loc: perferred location, 0 for CPU, or GPU id
+ * @perfetch_loc: last prefetch location, 0 for CPU, or GPU id
+ * @actual_loc: the actual location, 0 for CPU, or GPU id
+ * @granularity:migration granularity, log2 num pages
+ * @bitmap_access: index bitmap of GPUs which can access the range
+ * @bitmap_aip: index bitmap of GPUs which can access the range in place
+ *
+ * Data structure for virtual memory range shared by CPU and GPUs, it can be
+ * allocated from system memory ram or device vram, and migrate from ram to vram
+ * or from vram to ram.
+ */
+struct svm_range {
+	struct svm_range_list		*svms;
+	struct interval_tree_node	it_node;
+	struct list_head		list;
+	struct list_head		update_list;
+	struct list_head		remove_list;
+	uint64_t			npages;
+	dma_addr_t			*pages_addr;
+	uint32_t			flags;
+	uint32_t			preferred_loc;
+	uint32_t			prefetch_loc;
+	uint32_t			actual_loc;
+	uint8_t				granularity;
+	DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE);
+	DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
+};
+
+static inline void svms_lock(struct svm_range_list *svms)
+{
+	mutex_lock(&svms->lock);
+	svms->saved_flags = memalloc_nofs_save();
+
+}
+static inline void svms_unlock(struct svm_range_list *svms)
+{
+	memalloc_nofs_restore(svms->saved_flags);
+	mutex_unlock(&svms->lock);
+}
+
+int svm_range_list_init(struct kfd_process *p);
+void svm_range_list_fini(struct kfd_process *p);
+int svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start,
+	      uint64_t size, uint32_t nattrs,
+	      struct kfd_ioctl_svm_attribute *attrs);
+
+#endif /* KFD_SVM_H_ */
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 07/35] drm/amdkfd: add svm ioctl GET_ATTR op
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (5 preceding siblings ...)
  2021-01-07  3:00 ` [PATCH 06/35] drm/amdkfd: register svm range Felix Kuehling
@ 2021-01-07  3:00 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 08/35] drm/amdgpu: add common HMM get pages function Felix Kuehling
                   ` (29 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:00 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

Get the intersection of attributes over all memory in the given
range

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 175 ++++++++++++++++++++++++++-
 1 file changed, 173 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 0b0410837be9..017e77e9ae1e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -75,8 +75,8 @@ static void
 svm_range_set_default_attributes(int32_t *location, int32_t *prefetch_loc,
 				 uint8_t *granularity, uint32_t *flags)
 {
-	*location = 0;
-	*prefetch_loc = 0;
+	*location = KFD_IOCTL_SVM_LOCATION_UNDEFINED;
+	*prefetch_loc = KFD_IOCTL_SVM_LOCATION_UNDEFINED;
 	*granularity = 9;
 	*flags =
 		KFD_IOCTL_SVM_FLAG_HOST_ACCESS | KFD_IOCTL_SVM_FLAG_COHERENT;
@@ -581,6 +581,174 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 	return r;
 }
 
+static int
+svm_range_get_attr(struct kfd_process *p, uint64_t start, uint64_t size,
+		   uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs)
+{
+	DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE);
+	DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
+	bool get_preferred_loc = false;
+	bool get_prefetch_loc = false;
+	bool get_granularity = false;
+	bool get_accessible = false;
+	bool get_flags = false;
+	uint64_t last = start + size - 1UL;
+	struct mm_struct *mm = current->mm;
+	uint8_t granularity = 0xff;
+	struct interval_tree_node *node;
+	struct svm_range_list *svms;
+	struct svm_range *prange;
+	uint32_t prefetch_loc = KFD_IOCTL_SVM_LOCATION_UNDEFINED;
+	uint32_t location = KFD_IOCTL_SVM_LOCATION_UNDEFINED;
+	uint32_t flags = 0xffffffff;
+	int gpuidx;
+	uint32_t i;
+
+	pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start,
+		 start + size - 1, nattr);
+
+	mmap_read_lock(mm);
+	if (!svm_range_is_valid(mm, start, size)) {
+		pr_debug("invalid range\n");
+		mmap_read_unlock(mm);
+		return -EINVAL;
+	}
+	mmap_read_unlock(mm);
+
+	for (i = 0; i < nattr; i++) {
+		switch (attrs[i].type) {
+		case KFD_IOCTL_SVM_ATTR_PREFERRED_LOC:
+			get_preferred_loc = true;
+			break;
+		case KFD_IOCTL_SVM_ATTR_PREFETCH_LOC:
+			get_prefetch_loc = true;
+			break;
+		case KFD_IOCTL_SVM_ATTR_ACCESS:
+			if (!svm_get_supported_dev_by_id(
+					p, attrs[i].value, NULL))
+				return -EINVAL;
+			get_accessible = true;
+			break;
+		case KFD_IOCTL_SVM_ATTR_SET_FLAGS:
+			get_flags = true;
+			break;
+		case KFD_IOCTL_SVM_ATTR_GRANULARITY:
+			get_granularity = true;
+			break;
+		case KFD_IOCTL_SVM_ATTR_CLR_FLAGS:
+		case KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE:
+		case KFD_IOCTL_SVM_ATTR_NO_ACCESS:
+			fallthrough;
+		default:
+			pr_debug("get invalid attr type 0x%x\n", attrs[i].type);
+			return -EINVAL;
+		}
+	}
+
+	svms = &p->svms;
+
+	svms_lock(svms);
+
+	node = interval_tree_iter_first(&svms->objects, start, last);
+	if (!node) {
+		pr_debug("range attrs not found return default values\n");
+		svm_range_set_default_attributes(&location, &prefetch_loc,
+						 &granularity, &flags);
+		/* TODO: Automatically create SVM ranges and map them on
+		 * GPU page faults
+		if (p->xnack_enabled)
+			bitmap_fill(bitmap_access, MAX_GPU_INSTANCE);
+			FIXME: Only set bits for supported GPUs
+			FIXME: I think this should be done inside
+			svm_range_set_default_attributes, so that it will
+			apply to all newly created ranges
+		 */
+
+		goto fill_values;
+	}
+	bitmap_fill(bitmap_access, MAX_GPU_INSTANCE);
+	bitmap_fill(bitmap_aip, MAX_GPU_INSTANCE);
+
+	while (node) {
+		struct interval_tree_node *next;
+
+		prange = container_of(node, struct svm_range, it_node);
+		next = interval_tree_iter_next(node, start, last);
+
+		if (get_preferred_loc) {
+			if (prange->preferred_loc ==
+					KFD_IOCTL_SVM_LOCATION_UNDEFINED ||
+			    (location != KFD_IOCTL_SVM_LOCATION_UNDEFINED &&
+			     location != prange->preferred_loc)) {
+				location = KFD_IOCTL_SVM_LOCATION_UNDEFINED;
+				get_preferred_loc = false;
+			} else {
+				location = prange->preferred_loc;
+			}
+		}
+		if (get_prefetch_loc) {
+			if (prange->prefetch_loc ==
+					KFD_IOCTL_SVM_LOCATION_UNDEFINED ||
+			    (prefetch_loc != KFD_IOCTL_SVM_LOCATION_UNDEFINED &&
+			     prefetch_loc != prange->prefetch_loc)) {
+				prefetch_loc = KFD_IOCTL_SVM_LOCATION_UNDEFINED;
+				get_prefetch_loc = false;
+			} else {
+				prefetch_loc = prange->prefetch_loc;
+			}
+		}
+		if (get_accessible) {
+			bitmap_and(bitmap_access, bitmap_access,
+				   prange->bitmap_access, MAX_GPU_INSTANCE);
+			bitmap_and(bitmap_aip, bitmap_aip,
+				   prange->bitmap_aip, MAX_GPU_INSTANCE);
+		}
+		if (get_flags)
+			flags &= prange->flags;
+
+		if (get_granularity && prange->granularity < granularity)
+			granularity = prange->granularity;
+
+		node = next;
+	}
+fill_values:
+	svms_unlock(svms);
+
+	for (i = 0; i < nattr; i++) {
+		switch (attrs[i].type) {
+		case KFD_IOCTL_SVM_ATTR_PREFERRED_LOC:
+			attrs[i].value = location;
+			break;
+		case KFD_IOCTL_SVM_ATTR_PREFETCH_LOC:
+			attrs[i].value = prefetch_loc;
+			break;
+		case KFD_IOCTL_SVM_ATTR_ACCESS:
+			gpuidx = kfd_process_gpuidx_from_gpuid(p,
+							       attrs[i].value);
+			if (gpuidx < 0) {
+				pr_debug("invalid gpuid %x\n", attrs[i].value);
+				return -EINVAL;
+			}
+			if (test_bit(gpuidx, bitmap_access))
+				attrs[i].type = KFD_IOCTL_SVM_ATTR_ACCESS;
+			else if (test_bit(gpuidx, bitmap_aip))
+				attrs[i].type =
+					KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE;
+			else
+				attrs[i].type = KFD_IOCTL_SVM_ATTR_NO_ACCESS;
+			break;
+		case KFD_IOCTL_SVM_ATTR_SET_FLAGS:
+			attrs[i].value = flags;
+			break;
+		case KFD_IOCTL_SVM_ATTR_GRANULARITY:
+			attrs[i].value = (uint32_t)granularity;
+			break;
+		}
+	}
+
+	return 0;
+}
+
 int
 svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start,
 	  uint64_t size, uint32_t nattrs, struct kfd_ioctl_svm_attribute *attrs)
@@ -594,6 +762,9 @@ svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start,
 	case KFD_IOCTL_SVM_OP_SET_ATTR:
 		r = svm_range_set_attr(p, start, size, nattrs, attrs);
 		break;
+	case KFD_IOCTL_SVM_OP_GET_ATTR:
+		r = svm_range_get_attr(p, start, size, nattrs, attrs);
+		break;
 	default:
 		r = EINVAL;
 		break;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 08/35] drm/amdgpu: add common HMM get pages function
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (6 preceding siblings ...)
  2021-01-07  3:00 ` [PATCH 07/35] drm/amdkfd: add svm ioctl GET_ATTR op Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07 10:53   ` Christian König
  2021-01-07  3:01 ` [PATCH 09/35] drm/amdkfd: validate svm range system memory Felix Kuehling
                   ` (28 subsequent siblings)
  36 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

Move the HMM get pages function from amdgpu_ttm and to amdgpu_mn. This
common function will be used by new svm APIs.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 83 +++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h  |  7 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 76 +++-------------------
 3 files changed, 100 insertions(+), 66 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 828b5167ff12..997da4237a10 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -155,3 +155,86 @@ void amdgpu_mn_unregister(struct amdgpu_bo *bo)
 	mmu_interval_notifier_remove(&bo->notifier);
 	bo->notifier.mm = NULL;
 }
+
+int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier,
+			       struct mm_struct *mm, struct page **pages,
+			       uint64_t start, uint64_t npages,
+			       struct hmm_range **phmm_range, bool readonly,
+			       bool mmap_locked)
+{
+	struct hmm_range *hmm_range;
+	unsigned long timeout;
+	unsigned long i;
+	unsigned long *pfns;
+	int r = 0;
+
+	hmm_range = kzalloc(sizeof(*hmm_range), GFP_KERNEL);
+	if (unlikely(!hmm_range))
+		return -ENOMEM;
+
+	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
+	if (unlikely(!pfns)) {
+		r = -ENOMEM;
+		goto out_free_range;
+	}
+
+	hmm_range->notifier = notifier;
+	hmm_range->default_flags = HMM_PFN_REQ_FAULT;
+	if (!readonly)
+		hmm_range->default_flags |= HMM_PFN_REQ_WRITE;
+	hmm_range->hmm_pfns = pfns;
+	hmm_range->start = start;
+	hmm_range->end = start + npages * PAGE_SIZE;
+	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+
+retry:
+	hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
+
+	if (likely(!mmap_locked))
+		mmap_read_lock(mm);
+
+	r = hmm_range_fault(hmm_range);
+
+	if (likely(!mmap_locked))
+		mmap_read_unlock(mm);
+	if (unlikely(r)) {
+		/*
+		 * FIXME: This timeout should encompass the retry from
+		 * mmu_interval_read_retry() as well.
+		 */
+		if (r == -EBUSY && !time_after(jiffies, timeout))
+			goto retry;
+		goto out_free_pfns;
+	}
+
+	/*
+	 * Due to default_flags, all pages are HMM_PFN_VALID or
+	 * hmm_range_fault() fails. FIXME: The pages cannot be touched outside
+	 * the notifier_lock, and mmu_interval_read_retry() must be done first.
+	 */
+	for (i = 0; pages && i < npages; i++)
+		pages[i] = hmm_pfn_to_page(pfns[i]);
+
+	*phmm_range = hmm_range;
+
+	return 0;
+
+out_free_pfns:
+	kvfree(pfns);
+out_free_range:
+	kfree(hmm_range);
+
+	return r;
+}
+
+int amdgpu_hmm_range_get_pages_done(struct hmm_range *hmm_range)
+{
+	int r;
+
+	r = mmu_interval_read_retry(hmm_range->notifier,
+				    hmm_range->notifier_seq);
+	kvfree(hmm_range->hmm_pfns);
+	kfree(hmm_range);
+
+	return r;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
index a292238f75eb..7f7d37a457c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
@@ -30,6 +30,13 @@
 #include <linux/workqueue.h>
 #include <linux/interval_tree.h>
 
+int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier,
+			       struct mm_struct *mm, struct page **pages,
+			       uint64_t start, uint64_t npages,
+			       struct hmm_range **phmm_range, bool readonly,
+			       bool mmap_locked);
+int amdgpu_hmm_range_get_pages_done(struct hmm_range *hmm_range);
+
 #if defined(CONFIG_HMM_MIRROR)
 int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr);
 void amdgpu_mn_unregister(struct amdgpu_bo *bo);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index aaad9e304ad9..f423f42cb9b5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -32,7 +32,6 @@
 
 #include <linux/dma-mapping.h>
 #include <linux/iommu.h>
-#include <linux/hmm.h>
 #include <linux/pagemap.h>
 #include <linux/sched/task.h>
 #include <linux/sched/mm.h>
@@ -843,10 +842,8 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 	struct amdgpu_ttm_tt *gtt = (void *)ttm;
 	unsigned long start = gtt->userptr;
 	struct vm_area_struct *vma;
-	struct hmm_range *range;
-	unsigned long timeout;
 	struct mm_struct *mm;
-	unsigned long i;
+	bool readonly;
 	int r = 0;
 
 	mm = bo->notifier.mm;
@@ -862,76 +859,26 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 	if (!mmget_not_zero(mm)) /* Happens during process shutdown */
 		return -ESRCH;
 
-	range = kzalloc(sizeof(*range), GFP_KERNEL);
-	if (unlikely(!range)) {
-		r = -ENOMEM;
-		goto out;
-	}
-	range->notifier = &bo->notifier;
-	range->start = bo->notifier.interval_tree.start;
-	range->end = bo->notifier.interval_tree.last + 1;
-	range->default_flags = HMM_PFN_REQ_FAULT;
-	if (!amdgpu_ttm_tt_is_readonly(ttm))
-		range->default_flags |= HMM_PFN_REQ_WRITE;
-
-	range->hmm_pfns = kvmalloc_array(ttm->num_pages,
-					 sizeof(*range->hmm_pfns), GFP_KERNEL);
-	if (unlikely(!range->hmm_pfns)) {
-		r = -ENOMEM;
-		goto out_free_ranges;
-	}
-
 	mmap_read_lock(mm);
 	vma = find_vma(mm, start);
+	mmap_read_unlock(mm);
 	if (unlikely(!vma || start < vma->vm_start)) {
 		r = -EFAULT;
-		goto out_unlock;
+		goto out_putmm;
 	}
 	if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
 		vma->vm_file)) {
 		r = -EPERM;
-		goto out_unlock;
+		goto out_putmm;
 	}
-	mmap_read_unlock(mm);
-	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
-
-retry:
-	range->notifier_seq = mmu_interval_read_begin(&bo->notifier);
 
-	mmap_read_lock(mm);
-	r = hmm_range_fault(range);
-	mmap_read_unlock(mm);
-	if (unlikely(r)) {
-		/*
-		 * FIXME: This timeout should encompass the retry from
-		 * mmu_interval_read_retry() as well.
-		 */
-		if (r == -EBUSY && !time_after(jiffies, timeout))
-			goto retry;
-		goto out_free_pfns;
-	}
-
-	/*
-	 * Due to default_flags, all pages are HMM_PFN_VALID or
-	 * hmm_range_fault() fails. FIXME: The pages cannot be touched outside
-	 * the notifier_lock, and mmu_interval_read_retry() must be done first.
-	 */
-	for (i = 0; i < ttm->num_pages; i++)
-		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
-
-	gtt->range = range;
+	readonly = amdgpu_ttm_tt_is_readonly(ttm);
+	r = amdgpu_hmm_range_get_pages(&bo->notifier, mm, pages, start,
+				       ttm->num_pages, &gtt->range, readonly,
+				       false);
+out_putmm:
 	mmput(mm);
 
-	return 0;
-
-out_unlock:
-	mmap_read_unlock(mm);
-out_free_pfns:
-	kvfree(range->hmm_pfns);
-out_free_ranges:
-	kfree(range);
-out:
-	mmput(mm);
 	return r;
 }
 
@@ -960,10 +907,7 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm)
 		 * FIXME: Must always hold notifier_lock for this, and must
 		 * not ignore the return code.
 		 */
-		r = mmu_interval_read_retry(gtt->range->notifier,
-					 gtt->range->notifier_seq);
-		kvfree(gtt->range->hmm_pfns);
-		kfree(gtt->range);
+		r = amdgpu_hmm_range_get_pages_done(gtt->range);
 		gtt->range = NULL;
 	}
 
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 09/35] drm/amdkfd: validate svm range system memory
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (7 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 08/35] drm/amdgpu: add common HMM get pages function Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 10/35] drm/amdkfd: register overlap system memory range Felix Kuehling
                   ` (27 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

Use HMM to get system memory pages address, which will be used to
map to GPUs or migrate to vram.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h |  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c  | 88 +++++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h  |  2 +
 3 files changed, 91 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index cbb2bae1982d..97cf267b6f51 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -735,6 +735,7 @@ struct svm_range_list {
 	struct work_struct		srcu_free_work;
 	struct list_head		free_list;
 	struct mutex			free_list_lock;
+	struct mmu_interval_notifier	notifier;
 };
 
 /* Process data */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 017e77e9ae1e..02918faa70d5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -135,6 +135,65 @@ svm_get_supported_dev_by_id(struct kfd_process *p, uint32_t gpu_id,
 	return dev;
 }
 
+/**
+ * svm_range_validate_ram - get system memory pages of svm range
+ *
+ * @mm: the mm_struct of process
+ * @prange: the range struct
+ *
+ * After mapping system memory to GPU, system memory maybe invalidated anytime
+ * during application running, we use HMM callback to sync GPU with CPU page
+ * table update, so we don't need use lock to prevent CPU invalidation and check
+ * hmm_range_get_pages_done return value.
+ *
+ * Return:
+ * 0 - OK, otherwise error code
+ */
+static int
+svm_range_validate_ram(struct mm_struct *mm, struct svm_range *prange)
+{
+	uint64_t i;
+	int r;
+
+	if (!prange->pages_addr) {
+		prange->pages_addr = kvmalloc_array(prange->npages,
+						sizeof(*prange->pages_addr),
+						GFP_KERNEL | __GFP_ZERO);
+		if (!prange->pages_addr)
+			return -ENOMEM;
+	}
+
+	r = amdgpu_hmm_range_get_pages(&prange->svms->notifier, mm, NULL,
+				       prange->it_node.start << PAGE_SHIFT,
+				       prange->npages, &prange->hmm_range,
+				       false, true);
+	if (r) {
+		pr_debug("failed %d to get svm range pages\n", r);
+		return r;
+	}
+
+	for (i = 0; i < prange->npages; i++)
+		prange->pages_addr[i] =
+			PFN_PHYS(prange->hmm_range->hmm_pfns[i]);
+
+	amdgpu_hmm_range_get_pages_done(prange->hmm_range);
+	prange->hmm_range = NULL;
+
+	return 0;
+}
+
+static int
+svm_range_validate(struct mm_struct *mm, struct svm_range *prange)
+{
+	int r = 0;
+
+	pr_debug("actual loc 0x%x\n", prange->actual_loc);
+
+	r = svm_range_validate_ram(mm, prange);
+
+	return r;
+}
+
 static int
 svm_range_apply_attrs(struct kfd_process *p, struct svm_range *prange,
 		      uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs)
@@ -349,10 +408,28 @@ static void svm_range_srcu_free_work(struct work_struct *work_struct)
 	mutex_unlock(&svms->free_list_lock);
 }
 
+/**
+ * svm_range_cpu_invalidate_pagetables - interval notifier callback
+ *
+ */
+static bool
+svm_range_cpu_invalidate_pagetables(struct mmu_interval_notifier *mni,
+				    const struct mmu_notifier_range *range,
+				    unsigned long cur_seq)
+{
+	return true;
+}
+
+static const struct mmu_interval_notifier_ops svm_range_mn_ops = {
+	.invalidate = svm_range_cpu_invalidate_pagetables,
+};
+
 void svm_range_list_fini(struct kfd_process *p)
 {
 	pr_debug("pasid 0x%x svms 0x%p\n", p->pasid, &p->svms);
 
+	mmu_interval_notifier_remove(&p->svms.notifier);
+
 	/* Ensure srcu free work is finished before process is destroyed */
 	flush_work(&p->svms.srcu_free_work);
 	cleanup_srcu_struct(&p->svms.srcu);
@@ -375,6 +452,8 @@ int svm_range_list_init(struct kfd_process *p)
 	INIT_WORK(&svms->srcu_free_work, svm_range_srcu_free_work);
 	INIT_LIST_HEAD(&svms->free_list);
 	mutex_init(&svms->free_list_lock);
+	mmu_interval_notifier_insert(&svms->notifier, current->mm, 0, ~1ULL,
+				     &svm_range_mn_ops);
 
 	return 0;
 }
@@ -531,6 +610,15 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 		r = svm_range_apply_attrs(p, prange, nattr, attrs);
 		if (r) {
 			pr_debug("failed %d to apply attrs\n", r);
+			goto out_unlock;
+		}
+
+		r = svm_range_validate(mm, prange);
+		if (r)
+			pr_debug("failed %d to validate svm range\n", r);
+
+out_unlock:
+		if (r) {
 			mmap_read_unlock(mm);
 			srcu_read_unlock(&prange->svms->srcu, srcu_idx);
 			goto out_remove;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index c7c54fb73dfb..4d394f72eefc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -41,6 +41,7 @@
  * @list:       link list node, used to scan all ranges of svms
  * @update_list:link list node used to add to update_list
  * @remove_list:link list node used to add to remove list
+ * @hmm_range:  hmm range structure used by hmm_range_fault to get system pages
  * @npages:     number of pages
  * @pages_addr: list of system memory physical page address
  * @flags:      flags defined as KFD_IOCTL_SVM_FLAG_*
@@ -61,6 +62,7 @@ struct svm_range {
 	struct list_head		list;
 	struct list_head		update_list;
 	struct list_head		remove_list;
+	struct hmm_range		*hmm_range;
 	uint64_t			npages;
 	dma_addr_t			*pages_addr;
 	uint32_t			flags;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 10/35] drm/amdkfd: register overlap system memory range
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (8 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 09/35] drm/amdkfd: validate svm range system memory Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 11/35] drm/amdkfd: deregister svm range Felix Kuehling
                   ` (26 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

No overlap range interval [start, last] exist in svms object interval
tree. If process registers new range which has overlap with old range,
the old range split into 2 ranges depending on the overlap happens at
head or tail part of old range.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 297 ++++++++++++++++++++++++++-
 1 file changed, 294 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 02918faa70d5..ad007261f54c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -293,6 +293,278 @@ static void svm_range_debug_dump(struct svm_range_list *svms)
 	}
 }
 
+static bool
+svm_range_is_same_attrs(struct svm_range *old, struct svm_range *new)
+{
+	return (old->prefetch_loc == new->prefetch_loc &&
+		old->flags == new->flags &&
+		old->granularity == new->granularity);
+}
+
+static int
+svm_range_split_pages(struct svm_range *new, struct svm_range *old,
+		      uint64_t start, uint64_t last)
+{
+	unsigned long old_start;
+	dma_addr_t *pages_addr;
+	uint64_t d;
+
+	old_start = old->it_node.start;
+	new->pages_addr = kvmalloc_array(new->npages,
+					 sizeof(*new->pages_addr),
+					 GFP_KERNEL | __GFP_ZERO);
+	if (!new->pages_addr)
+		return -ENOMEM;
+
+	d = new->it_node.start - old_start;
+	memcpy(new->pages_addr, old->pages_addr + d,
+	       new->npages * sizeof(*new->pages_addr));
+
+	old->npages = last - start + 1;
+	old->it_node.start = start;
+	old->it_node.last = last;
+
+	pages_addr = kvmalloc_array(old->npages, sizeof(*pages_addr),
+				    GFP_KERNEL);
+	if (!pages_addr) {
+		kvfree(new->pages_addr);
+		return -ENOMEM;
+	}
+
+	d = start - old_start;
+	memcpy(pages_addr, old->pages_addr + d,
+	       old->npages * sizeof(*pages_addr));
+
+	kvfree(old->pages_addr);
+	old->pages_addr = pages_addr;
+
+	return 0;
+}
+
+/**
+ * svm_range_split_adjust - split range and adjust
+ *
+ * @new: new range
+ * @old: the old range
+ * @start: the old range adjust to start address in pages
+ * @last: the old range adjust to last address in pages
+ *
+ * Copy system memory pages, pages_addr or vram mm_nodes in old range to new
+ * range from new_start up to size new->npages, the remaining old range is from
+ * start to last
+ *
+ * Return:
+ * 0 - OK, -ENOMEM - out of memory
+ */
+static int
+svm_range_split_adjust(struct svm_range *new, struct svm_range *old,
+		      uint64_t start, uint64_t last)
+{
+	int r = -EINVAL;
+
+	pr_debug("svms 0x%p new 0x%lx old [0x%lx 0x%lx] => [0x%llx 0x%llx]\n",
+		 new->svms, new->it_node.start, old->it_node.start,
+		 old->it_node.last, start, last);
+
+	if (new->it_node.start < old->it_node.start ||
+	    new->it_node.last > old->it_node.last) {
+		WARN_ONCE(1, "invalid new range start or last\n");
+		return -EINVAL;
+	}
+
+	if (old->pages_addr)
+		r = svm_range_split_pages(new, old, start, last);
+	else
+		WARN_ONCE(1, "split adjust invalid pages_addr and nodes\n");
+	if (r)
+		return r;
+
+	new->flags = old->flags;
+	new->preferred_loc = old->preferred_loc;
+	new->prefetch_loc = old->prefetch_loc;
+	new->actual_loc = old->actual_loc;
+	new->granularity = old->granularity;
+	bitmap_copy(new->bitmap_access, old->bitmap_access, MAX_GPU_INSTANCE);
+	bitmap_copy(new->bitmap_aip, old->bitmap_aip, MAX_GPU_INSTANCE);
+
+	return 0;
+}
+
+/**
+ * svm_range_split - split a range in 2 ranges
+ *
+ * @prange: the svm range to split
+ * @start: the remaining range start address in pages
+ * @last: the remaining range last address in pages
+ * @new: the result new range generated
+ *
+ * Two cases only:
+ * case 1: if start == prange->it_node.start
+ *         prange ==> prange[start, last]
+ *         new range [last + 1, prange->it_node.last]
+ *
+ * case 2: if last == prange->it_node.last
+ *         prange ==> prange[start, last]
+ *         new range [prange->it_node.start, start - 1]
+ *
+ * Context: Caller hold svms->rw_sem as write mode
+ *
+ * Return:
+ * 0 - OK, -ENOMEM - out of memory, -EINVAL - invalid start, last
+ */
+static int
+svm_range_split(struct svm_range *prange, uint64_t start, uint64_t last,
+		struct svm_range **new)
+{
+	uint64_t old_start = prange->it_node.start;
+	uint64_t old_last = prange->it_node.last;
+	struct svm_range_list *svms;
+	int r = 0;
+
+	pr_debug("svms 0x%p [0x%llx 0x%llx] to [0x%llx 0x%llx]\n", prange->svms,
+		 old_start, old_last, start, last);
+
+	if (old_start != start && old_last != last)
+		return -EINVAL;
+	if (start < old_start || last > old_last)
+		return -EINVAL;
+
+	svms = prange->svms;
+	if (old_start == start) {
+		*new = svm_range_new(svms, last + 1, old_last);
+		if (!*new)
+			return -ENOMEM;
+		r = svm_range_split_adjust(*new, prange, start, last);
+	} else {
+		*new = svm_range_new(svms, old_start, start - 1);
+		if (!*new)
+			return -ENOMEM;
+		r = svm_range_split_adjust(*new, prange, start, last);
+	}
+
+	return r;
+}
+
+static int
+svm_range_split_two(struct svm_range *prange, struct svm_range *new,
+		    uint64_t start, uint64_t last,
+		    struct list_head *insert_list,
+		    struct list_head *update_list)
+{
+	struct svm_range *tail, *tail2;
+	int r;
+
+	r = svm_range_split(prange, prange->it_node.start, start - 1, &tail);
+	if (r)
+		return r;
+	r = svm_range_split(tail, start, last, &tail2);
+	if (r)
+		return r;
+	list_add(&tail2->list, insert_list);
+	list_add(&tail->list, insert_list);
+
+	if (!svm_range_is_same_attrs(prange, new))
+		list_add(&tail->update_list, update_list);
+
+	return 0;
+}
+
+static int
+svm_range_split_tail(struct svm_range *prange, struct svm_range *new,
+		     uint64_t start, struct list_head *insert_list,
+		     struct list_head *update_list)
+{
+	struct svm_range *tail;
+	int r;
+
+	r = svm_range_split(prange, prange->it_node.start, start - 1, &tail);
+	if (r)
+		return r;
+	list_add(&tail->list, insert_list);
+	if (!svm_range_is_same_attrs(prange, new))
+		list_add(&tail->update_list, update_list);
+
+	return 0;
+}
+
+static int
+svm_range_split_head(struct svm_range *prange, struct svm_range *new,
+		     uint64_t last, struct list_head *insert_list,
+		     struct list_head *update_list)
+{
+	struct svm_range *head;
+	int r;
+
+	r = svm_range_split(prange, last + 1, prange->it_node.last, &head);
+	if (r)
+		return r;
+	list_add(&head->list, insert_list);
+	if (!svm_range_is_same_attrs(prange, new))
+		list_add(&head->update_list, update_list);
+
+	return 0;
+}
+
+static int
+svm_range_split_add_front(struct svm_range *prange, struct svm_range *new,
+			  uint64_t start, uint64_t last,
+			  struct list_head *insert_list,
+			  struct list_head *update_list)
+{
+	struct svm_range *front, *tail;
+	int r = 0;
+
+	front = svm_range_new(prange->svms, start, prange->it_node.start - 1);
+	if (!front)
+		return -ENOMEM;
+
+	list_add(&front->list, insert_list);
+	list_add(&front->update_list, update_list);
+
+	if (prange->it_node.last > last) {
+		pr_debug("split old in 2\n");
+		r = svm_range_split(prange, prange->it_node.start, last, &tail);
+		if (r)
+			return r;
+		list_add(&tail->list, insert_list);
+	}
+	if (!svm_range_is_same_attrs(prange, new))
+		list_add(&prange->update_list, update_list);
+
+	return 0;
+}
+
+struct svm_range *svm_range_clone(struct svm_range *old)
+{
+	struct svm_range *new;
+
+	new = svm_range_new(old->svms, old->it_node.start, old->it_node.last);
+	if (!new)
+		return NULL;
+
+	if (old->pages_addr) {
+		new->pages_addr = kvmalloc_array(new->npages,
+						 sizeof(*new->pages_addr),
+						 GFP_KERNEL);
+		if (!new->pages_addr) {
+			kfree(new);
+			return NULL;
+		}
+		memcpy(new->pages_addr, old->pages_addr,
+		       old->npages * sizeof(*old->pages_addr));
+	}
+
+	new->flags = old->flags;
+	new->preferred_loc = old->preferred_loc;
+	new->prefetch_loc = old->prefetch_loc;
+	new->actual_loc = old->actual_loc;
+	new->granularity = old->granularity;
+	bitmap_copy(new->bitmap_access, old->bitmap_access, MAX_GPU_INSTANCE);
+	bitmap_copy(new->bitmap_aip, old->bitmap_aip, MAX_GPU_INSTANCE);
+
+	return new;
+}
+
 /**
  * svm_range_handle_overlap - split overlap ranges
  * @svms: svm range list header
@@ -334,15 +606,27 @@ svm_range_handle_overlap(struct svm_range_list *svms, struct svm_range *new,
 	node = interval_tree_iter_first(&svms->objects, start, last);
 	while (node) {
 		struct interval_tree_node *next;
+		struct svm_range *old;
 
 		pr_debug("found overlap node [0x%lx 0x%lx]\n", node->start,
 			 node->last);
 
-		prange = container_of(node, struct svm_range, it_node);
+		old = container_of(node, struct svm_range, it_node);
 		next = interval_tree_iter_next(node, start, last);
 
+		prange = svm_range_clone(old);
+		if (!prange) {
+			r = -ENOMEM;
+			goto out;
+		}
+
+		list_add(&old->remove_list, remove_list);
+		list_add(&prange->list, insert_list);
+
 		if (node->start < start && node->last > last) {
 			pr_debug("split in 2 ranges\n");
+			r = svm_range_split_two(prange, new, start, last,
+						insert_list, update_list);
 			start = last + 1;
 
 		} else if (node->start < start) {
@@ -352,11 +636,15 @@ svm_range_handle_overlap(struct svm_range_list *svms, struct svm_range *new,
 			 */
 			uint64_t old_last = node->last;
 
+			pr_debug("change old range last\n");
+			r = svm_range_split_tail(prange, new, start,
+						 insert_list, update_list);
 			start = old_last + 1;
 
 		} else if (node->start == start && node->last > last) {
 			pr_debug("change old range start\n");
-
+			r = svm_range_split_head(prange, new, last,
+						 insert_list, update_list);
 			start = last + 1;
 
 		} else if (node->start == start) {
@@ -364,12 +652,15 @@ svm_range_handle_overlap(struct svm_range_list *svms, struct svm_range *new,
 				pr_debug("found exactly same range\n");
 			else
 				pr_debug("next loop to add remaining range\n");
+			if (!svm_range_is_same_attrs(prange, new))
+				list_add(&prange->update_list, update_list);
 
 			start = node->last + 1;
 
 		} else { /* node->start > start */
 			pr_debug("add new range at front\n");
-
+			r = svm_range_split_add_front(prange, new, start, last,
+						      insert_list, update_list);
 			start = node->last + 1;
 		}
 
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 11/35] drm/amdkfd: deregister svm range
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (9 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 10/35] drm/amdkfd: register overlap system memory range Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 12/35] drm/amdgpu: export vm update mapping interface Felix Kuehling
                   ` (25 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

When application explicitly call unmap or unmap from mmput when
application exit, driver will receive MMU_NOTIFY_UNMAP event to remove
svm range from process svms object tree and list first, unmap from GPUs
(in the following patch).

Split the svm ranges to handle unmap partial svm range.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 86 ++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index ad007261f54c..55500ec4972f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -699,15 +699,101 @@ static void svm_range_srcu_free_work(struct work_struct *work_struct)
 	mutex_unlock(&svms->free_list_lock);
 }
 
+static void
+svm_range_unmap_from_cpu(struct mm_struct *mm, unsigned long start,
+			 unsigned long last)
+{
+	struct list_head remove_list;
+	struct list_head update_list;
+	struct list_head insert_list;
+	struct svm_range_list *svms;
+	struct svm_range new = {0};
+	struct svm_range *prange;
+	struct svm_range *tmp;
+	struct kfd_process *p;
+	int r;
+
+	p = kfd_lookup_process_by_mm(mm);
+	if (!p)
+		return;
+	svms = &p->svms;
+
+	pr_debug("notifier svms 0x%p [0x%lx 0x%lx]\n", svms, start, last);
+
+	svms_lock(svms);
+
+	r = svm_range_handle_overlap(svms, &new, start, last, &update_list,
+				     &insert_list, &remove_list, NULL);
+	if (r) {
+		svms_unlock(svms);
+		kfd_unref_process(p);
+		return;
+	}
+
+	mutex_lock(&svms->free_list_lock);
+	list_for_each_entry_safe(prange, tmp, &remove_list, remove_list) {
+		pr_debug("remove svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+			 prange->it_node.start, prange->it_node.last);
+		svm_range_unlink(prange);
+
+		pr_debug("schedule to free svms 0x%p [0x%lx 0x%lx]\n",
+			 prange->svms, prange->it_node.start,
+			 prange->it_node.last);
+		list_add_tail(&prange->remove_list, &svms->free_list);
+	}
+	if (!list_empty(&svms->free_list))
+		schedule_work(&svms->srcu_free_work);
+	mutex_unlock(&svms->free_list_lock);
+
+	/* prange in update_list is unmapping from cpu, remove it from insert
+	 * list
+	 */
+	list_for_each_entry_safe(prange, tmp, &update_list, update_list) {
+		list_del(&prange->list);
+		mutex_lock(&svms->free_list_lock);
+		list_add_tail(&prange->remove_list, &svms->free_list);
+		mutex_unlock(&svms->free_list_lock);
+	}
+	mutex_lock(&svms->free_list_lock);
+	if (!list_empty(&svms->free_list))
+		schedule_work(&svms->srcu_free_work);
+	mutex_unlock(&svms->free_list_lock);
+
+	list_for_each_entry_safe(prange, tmp, &insert_list, list)
+		svm_range_add_to_svms(prange);
+
+	svms_unlock(svms);
+	kfd_unref_process(p);
+}
+
 /**
  * svm_range_cpu_invalidate_pagetables - interval notifier callback
  *
+ * MMU range unmap notifier to remove svm ranges
  */
 static bool
 svm_range_cpu_invalidate_pagetables(struct mmu_interval_notifier *mni,
 				    const struct mmu_notifier_range *range,
 				    unsigned long cur_seq)
 {
+	unsigned long start = range->start >> PAGE_SHIFT;
+	unsigned long last = (range->end - 1) >> PAGE_SHIFT;
+	struct svm_range_list *svms;
+
+	svms = container_of(mni, struct svm_range_list, notifier);
+
+	if (range->event == MMU_NOTIFY_RELEASE) {
+		pr_debug("cpu release range [0x%lx 0x%lx]\n", range->start,
+			 range->end - 1);
+		return true;
+	}
+	if (range->event == MMU_NOTIFY_UNMAP) {
+		pr_debug("mm 0x%p unmap range [0x%lx 0x%lx]\n", range->mm,
+			 start, last);
+		svm_range_unmap_from_cpu(mni->mm, start, last);
+		return true;
+	}
+
 	return true;
 }
 
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 12/35] drm/amdgpu: export vm update mapping interface
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (10 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 11/35] drm/amdkfd: deregister svm range Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07 10:54   ` Christian König
  2021-01-07  3:01 ` [PATCH 13/35] drm/amdkfd: map svm range to GPUs Felix Kuehling
                   ` (24 subsequent siblings)
  36 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

It will be used by kfd to map svm range to GPU, because svm range does
not have amdgpu_bo and bo_va, cannot use amdgpu_bo_update interface, use
amdgpu vm update interface directly.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++++++++---------
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 ++++++++++
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index fdbe7d4e8b8b..9c557e8bf0e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1589,15 +1589,14 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params,
  * Returns:
  * 0 for success, -EINVAL for failure.
  */
-static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
-				       struct amdgpu_device *bo_adev,
-				       struct amdgpu_vm *vm, bool immediate,
-				       bool unlocked, struct dma_resv *resv,
-				       uint64_t start, uint64_t last,
-				       uint64_t flags, uint64_t offset,
-				       struct drm_mm_node *nodes,
-				       dma_addr_t *pages_addr,
-				       struct dma_fence **fence)
+int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
+				struct amdgpu_device *bo_adev,
+				struct amdgpu_vm *vm, bool immediate,
+				bool unlocked, struct dma_resv *resv,
+				uint64_t start, uint64_t last, uint64_t flags,
+				uint64_t offset, struct drm_mm_node *nodes,
+				dma_addr_t *pages_addr,
+				struct dma_fence **fence)
 {
 	struct amdgpu_vm_update_params params;
 	enum amdgpu_sync_mode sync_mode;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 2bf4ef5fb3e1..73ca630520fd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -366,6 +366,8 @@ struct amdgpu_vm_manager {
 	spinlock_t				pasid_lock;
 };
 
+struct amdgpu_bo_va_mapping;
+
 #define amdgpu_vm_copy_pte(adev, ib, pe, src, count) ((adev)->vm_manager.vm_pte_funcs->copy_pte((ib), (pe), (src), (count)))
 #define amdgpu_vm_write_pte(adev, ib, pe, value, count, incr) ((adev)->vm_manager.vm_pte_funcs->write_pte((ib), (pe), (value), (count), (incr)))
 #define amdgpu_vm_set_pte_pde(adev, ib, pe, addr, count, incr, flags) ((adev)->vm_manager.vm_pte_funcs->set_pte_pde((ib), (pe), (addr), (count), (incr), (flags)))
@@ -397,6 +399,14 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
 			  struct dma_fence **fence);
 int amdgpu_vm_handle_moved(struct amdgpu_device *adev,
 			   struct amdgpu_vm *vm);
+int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
+				struct amdgpu_device *bo_adev,
+				struct amdgpu_vm *vm, bool immediate,
+				bool unlocked, struct dma_resv *resv,
+				uint64_t start, uint64_t last, uint64_t flags,
+				uint64_t offset, struct drm_mm_node *nodes,
+				dma_addr_t *pages_addr,
+				struct dma_fence **fence);
 int amdgpu_vm_bo_update(struct amdgpu_device *adev,
 			struct amdgpu_bo_va *bo_va,
 			bool clear);
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 13/35] drm/amdkfd: map svm range to GPUs
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (11 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 12/35] drm/amdgpu: export vm update mapping interface Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 14/35] drm/amdkfd: svm range eviction and restore Felix Kuehling
                   ` (23 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

Use amdgpu_vm_bo_update_mapping to update GPU page table to map or unmap
svm range system memory pages address to GPUs.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 232 ++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   2 +
 2 files changed, 233 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 55500ec4972f..3c4a036609c4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -534,6 +534,229 @@ svm_range_split_add_front(struct svm_range *prange, struct svm_range *new,
 	return 0;
 }
 
+static uint64_t
+svm_range_get_pte_flags(struct amdgpu_device *adev, struct svm_range *prange)
+{
+	uint32_t flags = prange->flags;
+	uint32_t mapping_flags;
+	uint64_t pte_flags;
+
+	pte_flags = AMDGPU_PTE_VALID;
+	pte_flags |= AMDGPU_PTE_SYSTEM | AMDGPU_PTE_SNOOPED;
+
+	mapping_flags = AMDGPU_VM_PAGE_READABLE | AMDGPU_VM_PAGE_WRITEABLE;
+
+	if (flags & KFD_IOCTL_SVM_FLAG_GPU_RO)
+		mapping_flags &= ~AMDGPU_VM_PAGE_WRITEABLE;
+	if (flags & KFD_IOCTL_SVM_FLAG_GPU_EXEC)
+		mapping_flags |= AMDGPU_VM_PAGE_EXECUTABLE;
+	if (flags & KFD_IOCTL_SVM_FLAG_COHERENT)
+		mapping_flags |= AMDGPU_VM_MTYPE_UC;
+	else
+		mapping_flags |= AMDGPU_VM_MTYPE_NC;
+
+	/* TODO: add CHIP_ARCTURUS new flags for vram mapping */
+
+	pte_flags |= amdgpu_gem_va_map_flags(adev, mapping_flags);
+
+	/* Apply ASIC specific mapping flags */
+	amdgpu_gmc_get_vm_pte(adev, &prange->mapping, &pte_flags);
+
+	pr_debug("PTE flags 0x%llx\n", pte_flags);
+
+	return pte_flags;
+}
+
+static int
+svm_range_unmap_from_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
+			 struct svm_range *prange, struct dma_fence **fence)
+{
+	uint64_t init_pte_value = 0;
+	uint64_t start;
+	uint64_t last;
+
+	start = prange->it_node.start;
+	last = prange->it_node.last;
+
+	pr_debug("svms 0x%p [0x%llx 0x%llx]\n", prange->svms, start, last);
+
+	return amdgpu_vm_bo_update_mapping(adev, adev, vm, false, true, NULL,
+					   start, last, init_pte_value, 0,
+					   NULL, NULL, fence);
+}
+
+static int
+svm_range_unmap_from_gpus(struct svm_range *prange)
+{
+	DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE);
+	struct kfd_process_device *pdd;
+	struct dma_fence *fence = NULL;
+	struct amdgpu_device *adev;
+	struct kfd_process *p;
+	struct kfd_dev *dev;
+	uint32_t gpuidx;
+	int r = 0;
+
+	bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip,
+		  MAX_GPU_INSTANCE);
+	p = container_of(prange->svms, struct kfd_process, svms);
+
+	for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) {
+		pr_debug("unmap from gpu idx 0x%x\n", gpuidx);
+		r = kfd_process_device_from_gpuidx(p, gpuidx, &dev);
+		if (r) {
+			pr_debug("failed to find device idx %d\n", gpuidx);
+			return -EINVAL;
+		}
+
+		pdd = kfd_bind_process_to_device(dev, p);
+		if (IS_ERR(pdd))
+			return -EINVAL;
+
+		adev = (struct amdgpu_device *)dev->kgd;
+
+		r = svm_range_unmap_from_gpu(adev, pdd->vm, prange, &fence);
+		if (r)
+			break;
+
+		if (fence) {
+			r = dma_fence_wait(fence, false);
+			dma_fence_put(fence);
+			fence = NULL;
+			if (r)
+				break;
+		}
+
+		amdgpu_amdkfd_flush_gpu_tlb_pasid((struct kgd_dev *)adev,
+						  p->pasid);
+	}
+
+	return r;
+}
+
+static int svm_range_bo_validate(void *param, struct amdgpu_bo *bo)
+{
+	struct ttm_operation_ctx ctx = { false, false };
+
+	amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_VRAM);
+
+	return ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
+}
+
+static int
+svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
+		     struct svm_range *prange, bool reserve_vm,
+		     struct dma_fence **fence)
+{
+	struct amdgpu_bo *root;
+	dma_addr_t *pages_addr;
+	uint64_t pte_flags;
+	int r = 0;
+
+	pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last);
+
+	if (reserve_vm) {
+		root = amdgpu_bo_ref(vm->root.base.bo);
+		r = amdgpu_bo_reserve(root, true);
+		if (r) {
+			pr_debug("failed %d to reserve root bo\n", r);
+			amdgpu_bo_unref(&root);
+			goto out;
+		}
+		r = amdgpu_vm_validate_pt_bos(adev, vm, svm_range_bo_validate,
+					      NULL);
+		if (r) {
+			pr_debug("failed %d validate pt bos\n", r);
+			goto unreserve_out;
+		}
+	}
+
+	prange->mapping.start = prange->it_node.start;
+	prange->mapping.last = prange->it_node.last;
+	prange->mapping.offset = 0;
+	pte_flags = svm_range_get_pte_flags(adev, prange);
+	prange->mapping.flags = pte_flags;
+	pages_addr = prange->pages_addr;
+
+	r = amdgpu_vm_bo_update_mapping(adev, adev, vm, false, false, NULL,
+					prange->mapping.start,
+					prange->mapping.last, pte_flags,
+					prange->mapping.offset, NULL,
+					pages_addr, &vm->last_update);
+	if (r) {
+		pr_debug("failed %d to map to gpu 0x%lx\n", r,
+			 prange->it_node.start);
+		goto unreserve_out;
+	}
+
+
+	r = amdgpu_vm_update_pdes(adev, vm, false);
+	if (r) {
+		pr_debug("failed %d to update directories 0x%lx\n", r,
+			 prange->it_node.start);
+		goto unreserve_out;
+	}
+
+	if (fence)
+		*fence = dma_fence_get(vm->last_update);
+
+unreserve_out:
+	if (reserve_vm) {
+		amdgpu_bo_unreserve(root);
+		amdgpu_bo_unref(&root);
+	}
+
+out:
+	return r;
+}
+
+static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm)
+{
+	DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE);
+	struct kfd_process_device *pdd;
+	struct amdgpu_device *adev;
+	struct kfd_process *p;
+	struct kfd_dev *dev;
+	struct dma_fence *fence = NULL;
+	uint32_t gpuidx;
+	int r = 0;
+
+	bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip,
+		  MAX_GPU_INSTANCE);
+	p = container_of(prange->svms, struct kfd_process, svms);
+
+	for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) {
+		r = kfd_process_device_from_gpuidx(p, gpuidx, &dev);
+		if (r) {
+			pr_debug("failed to find device idx %d\n", gpuidx);
+			return -EINVAL;
+		}
+
+		pdd = kfd_bind_process_to_device(dev, p);
+		if (IS_ERR(pdd))
+			return -EINVAL;
+		adev = (struct amdgpu_device *)dev->kgd;
+
+		r = svm_range_map_to_gpu(adev, pdd->vm, prange, reserve_vm,
+					 &fence);
+		if (r)
+			break;
+
+		if (fence) {
+			r = dma_fence_wait(fence, false);
+			dma_fence_put(fence);
+			fence = NULL;
+			if (r) {
+				pr_debug("failed %d to dma fence wait\n", r);
+				break;
+			}
+		}
+	}
+
+	return r;
+}
+
 struct svm_range *svm_range_clone(struct svm_range *old)
 {
 	struct svm_range *new;
@@ -750,6 +973,7 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, unsigned long start,
 	 */
 	list_for_each_entry_safe(prange, tmp, &update_list, update_list) {
 		list_del(&prange->list);
+		svm_range_unmap_from_gpus(prange);
 		mutex_lock(&svms->free_list_lock);
 		list_add_tail(&prange->remove_list, &svms->free_list);
 		mutex_unlock(&svms->free_list_lock);
@@ -991,8 +1215,14 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 		}
 
 		r = svm_range_validate(mm, prange);
-		if (r)
+		if (r) {
 			pr_debug("failed %d to validate svm range\n", r);
+			goto out_unlock;
+		}
+
+		r = svm_range_map_to_gpus(prange, true);
+		if (r)
+			pr_debug("failed %d to map svm range\n", r);
 
 out_unlock:
 		if (r) {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 4d394f72eefc..fb68b5ee54f8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -42,6 +42,7 @@
  * @update_list:link list node used to add to update_list
  * @remove_list:link list node used to add to remove list
  * @hmm_range:  hmm range structure used by hmm_range_fault to get system pages
+ * @mapping:    bo_va mapping structure to create and update GPU page table
  * @npages:     number of pages
  * @pages_addr: list of system memory physical page address
  * @flags:      flags defined as KFD_IOCTL_SVM_FLAG_*
@@ -63,6 +64,7 @@ struct svm_range {
 	struct list_head		update_list;
 	struct list_head		remove_list;
 	struct hmm_range		*hmm_range;
+	struct amdgpu_bo_va_mapping	mapping;
 	uint64_t			npages;
 	dma_addr_t			*pages_addr;
 	uint32_t			flags;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 14/35] drm/amdkfd: svm range eviction and restore
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (12 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 13/35] drm/amdkfd: map svm range to GPUs Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 15/35] drm/amdkfd: add xnack enabled flag to kfd_process Felix Kuehling
                   ` (22 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

HMM interval notifier callback notify CPU page table will be updated,
stop process queues if the updated address belongs to svm range
registered in process svms objects tree. Scheduled restore work to
update GPU page table using new pages address in the updated svm range.

svm restore work to use srcu to scan svms list to avoid deadlock between
below two cases:

case1: svm restore work takes svm lock to scan svms list, then call
hmm_page_fault which takes mm->mmap_sem.
case2: unmap event callback and set_attr ioctl takes mm->mmap_sem, than
takes svm lock to add/remove ranges.

Calling synchronize_srcu in unmap event callback will deadlock with
restore work because restore work may wait for unmap event done to
take mm->mmap_sem, so schedule srcu_free_work to wait for srcu read
critical section done in svm restore work then free svm ranges.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h    |   2 +
 drivers/gpu/drm/amd/amdkfd/kfd_process.c |   1 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 169 ++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |   2 +
 4 files changed, 169 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 97cf267b6f51..f1e95773e19b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -736,6 +736,8 @@ struct svm_range_list {
 	struct list_head		free_list;
 	struct mutex			free_list_lock;
 	struct mmu_interval_notifier	notifier;
+	atomic_t			evicted_ranges;
+	struct delayed_work		restore_work;
 };
 
 /* Process data */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 791f17308b1b..0f31538b2a91 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1048,6 +1048,7 @@ static void kfd_process_notifier_release(struct mmu_notifier *mn,
 
 	cancel_delayed_work_sync(&p->eviction_work);
 	cancel_delayed_work_sync(&p->restore_work);
+	cancel_delayed_work_sync(&p->svms.restore_work);
 
 	mutex_lock(&p->mutex);
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 3c4a036609c4..e3ba6e7262a7 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -21,6 +21,7 @@
  */
 
 #include <linux/types.h>
+#include <linux/sched/task.h>
 #include "amdgpu_sync.h"
 #include "amdgpu_object.h"
 #include "amdgpu_vm.h"
@@ -28,6 +29,8 @@
 #include "kfd_priv.h"
 #include "kfd_svm.h"
 
+#define AMDGPU_SVM_RANGE_RESTORE_DELAY_MS 1
+
 /**
  * svm_range_unlink - unlink svm_range from lists and interval tree
  * @prange: svm range structure to be removed
@@ -99,6 +102,7 @@ svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start,
 	INIT_LIST_HEAD(&prange->list);
 	INIT_LIST_HEAD(&prange->update_list);
 	INIT_LIST_HEAD(&prange->remove_list);
+	atomic_set(&prange->invalid, 0);
 	svm_range_set_default_attributes(&prange->preferred_loc,
 					 &prange->prefetch_loc,
 					 &prange->granularity, &prange->flags);
@@ -191,6 +195,10 @@ svm_range_validate(struct mm_struct *mm, struct svm_range *prange)
 
 	r = svm_range_validate_ram(mm, prange);
 
+	pr_debug("svms 0x%p [0x%lx 0x%lx] ret %d invalid %d\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last,
+		 r, atomic_read(&prange->invalid));
+
 	return r;
 }
 
@@ -757,6 +765,151 @@ static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm)
 	return r;
 }
 
+static void svm_range_restore_work(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct amdkfd_process_info *process_info;
+	struct svm_range_list *svms;
+	struct svm_range *prange;
+	struct kfd_process *p;
+	struct mm_struct *mm;
+	int evicted_ranges;
+	int srcu_idx;
+	int invalid;
+	int r;
+
+	svms = container_of(dwork, struct svm_range_list, restore_work);
+	evicted_ranges = atomic_read(&svms->evicted_ranges);
+	if (!evicted_ranges)
+		return;
+
+	pr_debug("restore svm ranges\n");
+
+	/* kfd_process_notifier_release destroys this worker thread. So during
+	 * the lifetime of this thread, kfd_process and mm will be valid.
+	 */
+	p = container_of(svms, struct kfd_process, svms);
+	process_info = p->kgd_process_info;
+	mm = p->mm;
+	if (!mm)
+		return;
+
+	mutex_lock(&process_info->lock);
+	mmap_read_lock(mm);
+	srcu_idx = srcu_read_lock(&svms->srcu);
+
+	list_for_each_entry_rcu(prange, &svms->list, list) {
+		invalid = atomic_read(&prange->invalid);
+		if (!invalid)
+			continue;
+
+		pr_debug("restoring svms 0x%p [0x%lx %lx] invalid %d\n",
+			 prange->svms, prange->it_node.start,
+			 prange->it_node.last, invalid);
+
+		r = svm_range_validate(mm, prange);
+		if (r) {
+			pr_debug("failed %d to validate [0x%lx 0x%lx]\n", r,
+				 prange->it_node.start, prange->it_node.last);
+
+			goto unlock_out;
+		}
+
+		r = svm_range_map_to_gpus(prange, true);
+		if (r) {
+			pr_debug("failed %d to map 0x%lx to gpu\n", r,
+				 prange->it_node.start);
+			goto unlock_out;
+		}
+
+		if (atomic_cmpxchg(&prange->invalid, invalid, 0) != invalid)
+			goto unlock_out;
+	}
+
+	if (atomic_cmpxchg(&svms->evicted_ranges, evicted_ranges, 0) !=
+	    evicted_ranges)
+		goto unlock_out;
+
+	evicted_ranges = 0;
+
+	r = kgd2kfd_resume_mm(mm);
+	if (r) {
+		/* No recovery from this failure. Probably the CP is
+		 * hanging. No point trying again.
+		 */
+		pr_debug("failed %d to resume KFD\n", r);
+	}
+
+	pr_debug("restore svm ranges successfully\n");
+
+unlock_out:
+	srcu_read_unlock(&svms->srcu, srcu_idx);
+	mmap_read_unlock(mm);
+	mutex_unlock(&process_info->lock);
+
+	/* If validation failed, reschedule another attempt */
+	if (evicted_ranges) {
+		pr_debug("reschedule to restore svm range\n");
+		schedule_delayed_work(&svms->restore_work,
+			msecs_to_jiffies(AMDGPU_SVM_RANGE_RESTORE_DELAY_MS));
+	}
+}
+
+/**
+ * svm_range_evict - evict svm range
+ *
+ * Stop all queues of the process to ensure GPU doesn't access the memory, then
+ * return to let CPU evict the buffer and proceed CPU pagetable update.
+ *
+ * Don't need use lock to sync cpu pagetable invalidation with GPU execution.
+ * If invalidation happens while restore work is running, restore work will
+ * restart to ensure to get the latest CPU pages mapping to GPU, then start
+ * the queues.
+ */
+static int
+svm_range_evict(struct svm_range_list *svms, struct mm_struct *mm,
+		unsigned long start, unsigned long last)
+{
+	int invalid, evicted_ranges;
+	int r = 0;
+	struct interval_tree_node *node;
+	struct svm_range *prange;
+
+	svms_lock(svms);
+
+	pr_debug("invalidate svms 0x%p [0x%lx 0x%lx]\n", svms, start, last);
+
+	node = interval_tree_iter_first(&svms->objects, start, last);
+	while (node) {
+		struct interval_tree_node *next;
+
+		prange = container_of(node, struct svm_range, it_node);
+		next = interval_tree_iter_next(node, start, last);
+
+		invalid = atomic_inc_return(&prange->invalid);
+		evicted_ranges = atomic_inc_return(&svms->evicted_ranges);
+		if (evicted_ranges == 1) {
+			pr_debug("evicting svms 0x%p range [0x%lx 0x%lx]\n",
+				 prange->svms, prange->it_node.start,
+				 prange->it_node.last);
+
+			/* First eviction, stop the queues */
+			r = kgd2kfd_quiesce_mm(mm);
+			if (r)
+				pr_debug("failed to quiesce KFD\n");
+
+			pr_debug("schedule to restore svm %p ranges\n", svms);
+			schedule_delayed_work(&svms->restore_work,
+			   msecs_to_jiffies(AMDGPU_SVM_RANGE_RESTORE_DELAY_MS));
+		}
+		node = next;
+	}
+
+	svms_unlock(svms);
+
+	return r;
+}
+
 struct svm_range *svm_range_clone(struct svm_range *old)
 {
 	struct svm_range *new;
@@ -994,6 +1147,11 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, unsigned long start,
  * svm_range_cpu_invalidate_pagetables - interval notifier callback
  *
  * MMU range unmap notifier to remove svm ranges
+ *
+ * If GPU vm fault retry is not enabled, evict the svm range, then restore
+ * work will update GPU mapping.
+ * If GPU vm fault retry is enabled, unmap the svm range from GPU, vm fault
+ * will update GPU mapping.
  */
 static bool
 svm_range_cpu_invalidate_pagetables(struct mmu_interval_notifier *mni,
@@ -1009,15 +1167,14 @@ svm_range_cpu_invalidate_pagetables(struct mmu_interval_notifier *mni,
 	if (range->event == MMU_NOTIFY_RELEASE) {
 		pr_debug("cpu release range [0x%lx 0x%lx]\n", range->start,
 			 range->end - 1);
-		return true;
-	}
-	if (range->event == MMU_NOTIFY_UNMAP) {
+	} else if (range->event == MMU_NOTIFY_UNMAP) {
 		pr_debug("mm 0x%p unmap range [0x%lx 0x%lx]\n", range->mm,
 			 start, last);
 		svm_range_unmap_from_cpu(mni->mm, start, last);
-		return true;
+	} else {
+		mmu_interval_set_seq(mni, cur_seq);
+		svm_range_evict(svms, mni->mm, start, last);
 	}
-
 	return true;
 }
 
@@ -1045,6 +1202,8 @@ int svm_range_list_init(struct kfd_process *p)
 	svms->objects = RB_ROOT_CACHED;
 	mutex_init(&svms->lock);
 	INIT_LIST_HEAD(&svms->list);
+	atomic_set(&svms->evicted_ranges, 0);
+	INIT_DELAYED_WORK(&svms->restore_work, svm_range_restore_work);
 	r = init_srcu_struct(&svms->srcu);
 	if (r) {
 		pr_debug("failed %d to init srcu\n", r);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index fb68b5ee54f8..4c7daf8e0b6f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -50,6 +50,7 @@
  * @perfetch_loc: last prefetch location, 0 for CPU, or GPU id
  * @actual_loc: the actual location, 0 for CPU, or GPU id
  * @granularity:migration granularity, log2 num pages
+ * @invalid:    not 0 means cpu page table is invalidated
  * @bitmap_access: index bitmap of GPUs which can access the range
  * @bitmap_aip: index bitmap of GPUs which can access the range in place
  *
@@ -72,6 +73,7 @@ struct svm_range {
 	uint32_t			prefetch_loc;
 	uint32_t			actual_loc;
 	uint8_t				granularity;
+	atomic_t			invalid;
 	DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE);
 	DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
 };
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 15/35] drm/amdkfd: add xnack enabled flag to kfd_process
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (13 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 14/35] drm/amdkfd: svm range eviction and restore Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 16/35] drm/amdkfd: add ioctl to configure and query xnack retries Felix Kuehling
                   ` (21 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

From: Alex Sierra <alex.sierra@amd.com>

This flag is useful at cpu invalidation page table
decision. Between select queue eviction or page fault.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h    |  4 +++
 drivers/gpu/drm/amd/amdkfd/kfd_process.c | 36 ++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index f1e95773e19b..7a4b4b6dcf32 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -821,6 +821,8 @@ struct kfd_process {
 
 	/* shared virtual memory registered by this process */
 	struct svm_range_list svms;
+
+	bool xnack_enabled;
 };
 
 #define KFD_PROCESS_TABLE_SIZE 5 /* bits: 32 entries */
@@ -874,6 +876,8 @@ struct kfd_process_device *kfd_get_process_device_data(struct kfd_dev *dev,
 struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev,
 							struct kfd_process *p);
 
+bool kfd_process_xnack_supported(struct kfd_process *p);
+
 int kfd_reserved_mem_mmap(struct kfd_dev *dev, struct kfd_process *process,
 			  struct vm_area_struct *vma);
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 0f31538b2a91..f7a50a364d78 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1157,6 +1157,39 @@ static int kfd_process_device_init_cwsr_dgpu(struct kfd_process_device *pdd)
 	return 0;
 }
 
+bool kfd_process_xnack_supported(struct kfd_process *p)
+{
+	int i;
+
+	/* On most GFXv9 GPUs, the retry mode in the SQ must match the
+	 * boot time retry setting. Mixing processes with different
+	 * XNACK/retry settings can hang the GPU.
+	 *
+	 * Different GPUs can have different noretry settings depending
+	 * on HW bugs or limitations. We need to find at least one
+	 * XNACK mode for this process that's compatible with all GPUs.
+	 * Fortunately GPUs with retry enabled (noretry=0) can run code
+	 * built for XNACK-off. On GFXv9 it may perform slower.
+	 *
+	 * Therefore applications built for XNACK-off can always be
+	 * supported and will be our fallback if any GPU does not
+	 * support retry.
+	 */
+	for (i = 0; i < p->n_pdds; i++) {
+		struct kfd_dev *dev = p->pdds[i]->dev;
+
+		/* Only consider GFXv9 and higher GPUs. Older GPUs don't
+		 * support the SVM APIs and don't need to be considered
+		 * for the XNACK mode selection.
+		 */
+		if (dev->device_info->asic_family >= CHIP_VEGA10 &&
+		    dev->noretry)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * On return the kfd_process is fully operational and will be freed when the
  * mm is released
@@ -1194,6 +1227,9 @@ static struct kfd_process *create_process(const struct task_struct *thread)
 	if (err != 0)
 		goto err_init_apertures;
 
+	/* Check XNACK support after PDDs are created in kfd_init_apertures */
+	process->xnack_enabled = kfd_process_xnack_supported(process);
+
 	err = svm_range_list_init(process);
 	if (err)
 		goto err_init_svm_range_list;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 16/35] drm/amdkfd: add ioctl to configure and query xnack retries
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (14 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 15/35] drm/amdkfd: add xnack enabled flag to kfd_process Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 17/35] drm/amdkfd: register HMM device private zone Felix Kuehling
                   ` (20 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

From: Alex Sierra <alex.sierra@amd.com>

Xnack retries are used for page fault recovery. Some AMD chip
families support continuously retry while page table entries are invalid.
The driver must handle the page fault interrupt and fill in a valid entry
for the GPU to continue.

This ioctl allows to enable/disable XNACK retries per KFD process.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 28 +++++++++++++++
 include/uapi/linux/kfd_ioctl.h           | 43 +++++++++++++++++++++++-
 2 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 2d3ba7e806d5..a9a6a7c8ff21 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1747,6 +1747,31 @@ static int kfd_ioctl_smi_events(struct file *filep,
 	return kfd_smi_event_open(dev, &args->anon_fd);
 }
 
+static int kfd_ioctl_set_xnack_mode(struct file *filep,
+				    struct kfd_process *p, void *data)
+{
+	struct kfd_ioctl_set_xnack_mode_args *args = data;
+	int r = 0;
+
+	mutex_lock(&p->mutex);
+	if (args->xnack_enabled >= 0) {
+		if (!list_empty(&p->pqm.queues)) {
+			pr_debug("Process has user queues running\n");
+			mutex_unlock(&p->mutex);
+			return -EBUSY;
+		}
+		if (args->xnack_enabled && !kfd_process_xnack_supported(p))
+			r = -EPERM;
+		else
+			p->xnack_enabled = args->xnack_enabled;
+	} else {
+		args->xnack_enabled = p->xnack_enabled;
+	}
+	mutex_unlock(&p->mutex);
+
+	return r;
+}
+
 static int kfd_ioctl_svm(struct file *filep, struct kfd_process *p, void *data)
 {
 	struct kfd_ioctl_svm_args *args = data;
@@ -1870,6 +1895,9 @@ static const struct amdkfd_ioctl_desc amdkfd_ioctls[] = {
 			kfd_ioctl_smi_events, 0),
 
 	AMDKFD_IOCTL_DEF(AMDKFD_IOC_SVM, kfd_ioctl_svm, 0),
+
+	AMDKFD_IOCTL_DEF(AMDKFD_IOC_SET_XNACK_MODE,
+			kfd_ioctl_set_xnack_mode, 0),
 };
 
 #define AMDKFD_CORE_IOCTL_COUNT	ARRAY_SIZE(amdkfd_ioctls)
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index 5d4a4b3e0b61..b1a45cd37ab7 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -593,6 +593,44 @@ struct kfd_ioctl_svm_args {
 	struct kfd_ioctl_svm_attribute attrs[0];
 };
 
+/**
+ * kfd_ioctl_set_xnack_mode_args - Arguments for set_xnack_mode
+ *
+ * @xnack_enabled:       [in/out] Whether to enable XNACK mode for this process
+ *
+ * @xnack_enabled indicates whether recoverable page faults should be
+ * enabled for the current process. 0 means disabled, positive means
+ * enabled, negative means leave unchanged. If enabled, virtual address
+ * translations on GFXv9 and later AMD GPUs can return XNACK and retry
+ * the access until a valid PTE is available. This is used to implement
+ * device page faults.
+ *
+ * On output, @xnack_enabled returns the (new) current mode (0 or
+ * positive). Therefore, a negative input value can be used to query
+ * the current mode without changing it.
+ *
+ * The XNACK mode fundamentally changes the way SVM managed memory works
+ * in the driver, with subtle effects on application performance and
+ * functionality.
+ *
+ * Enabling XNACK mode requires shader programs to be compiled
+ * differently. Furthermore, not all GPUs support changing the mode
+ * per-process. Therefore changing the mode is only allowed while no
+ * user mode queues exist in the process. This ensure that no shader
+ * code is running that may be compiled for the wrong mode. And GPUs
+ * that cannot change to the requested mode will prevent the XNACK
+ * mode from occurring. All GPUs used by the process must be in the
+ * same XNACK mode.
+ *
+ * GFXv8 or older GPUs do not support 48 bit virtual addresses or SVM.
+ * Therefore those GPUs are not considered for the XNACK mode switch.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+struct kfd_ioctl_set_xnack_mode_args {
+	__s32 xnack_enabled;
+};
+
 #define AMDKFD_IOCTL_BASE 'K'
 #define AMDKFD_IO(nr)			_IO(AMDKFD_IOCTL_BASE, nr)
 #define AMDKFD_IOR(nr, type)		_IOR(AMDKFD_IOCTL_BASE, nr, type)
@@ -695,7 +733,10 @@ struct kfd_ioctl_svm_args {
 
 #define AMDKFD_IOC_SVM	AMDKFD_IOWR(0x20, struct kfd_ioctl_svm_args)
 
+#define AMDKFD_IOC_SET_XNACK_MODE		\
+		AMDKFD_IOWR(0x21, struct kfd_ioctl_set_xnack_mode_args)
+
 #define AMDKFD_COMMAND_START		0x01
-#define AMDKFD_COMMAND_END		0x21
+#define AMDKFD_COMMAND_END		0x22
 
 #endif
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 17/35] drm/amdkfd: register HMM device private zone
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (15 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 16/35] drm/amdkfd: add ioctl to configure and query xnack retries Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-03-01  8:32   ` Daniel Vetter
  2021-01-07  3:01 ` [PATCH 18/35] drm/amdkfd: validate vram svm range from TTM Felix Kuehling
                   ` (19 subsequent siblings)
  36 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
allocate vram backing pages for page migration.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |   3 +
 drivers/gpu/drm/amd/amdkfd/Makefile        |   3 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c   | 101 +++++++++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h   |  48 ++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |   3 +
 5 files changed, 157 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index db96d69eb45e..562bb5b69137 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -30,6 +30,7 @@
 #include <linux/dma-buf.h>
 #include "amdgpu_xgmi.h"
 #include <uapi/linux/kfd_ioctl.h>
+#include "kfd_migrate.h"
 
 /* Total memory size in system memory and all GPU VRAM. Used to
  * estimate worst case amount of memory to reserve for page tables
@@ -170,12 +171,14 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 		}
 
 		kgd2kfd_device_init(adev->kfd.dev, adev_to_drm(adev), &gpu_resources);
+		svm_migrate_init(adev);
 	}
 }
 
 void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev)
 {
 	if (adev->kfd.dev) {
+		svm_migrate_fini(adev);
 		kgd2kfd_device_exit(adev->kfd.dev);
 		adev->kfd.dev = NULL;
 	}
diff --git a/drivers/gpu/drm/amd/amdkfd/Makefile b/drivers/gpu/drm/amd/amdkfd/Makefile
index 387ce0217d35..a93301dbc464 100644
--- a/drivers/gpu/drm/amd/amdkfd/Makefile
+++ b/drivers/gpu/drm/amd/amdkfd/Makefile
@@ -55,7 +55,8 @@ AMDKFD_FILES	:= $(AMDKFD_PATH)/kfd_module.o \
 		$(AMDKFD_PATH)/kfd_dbgmgr.o \
 		$(AMDKFD_PATH)/kfd_smi_events.o \
 		$(AMDKFD_PATH)/kfd_crat.o \
-		$(AMDKFD_PATH)/kfd_svm.o
+		$(AMDKFD_PATH)/kfd_svm.o \
+		$(AMDKFD_PATH)/kfd_migrate.o
 
 ifneq ($(CONFIG_AMD_IOMMU_V2),)
 AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
new file mode 100644
index 000000000000..1950b86f1562
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -0,0 +1,101 @@
+/*
+ * Copyright 2020 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include <linux/types.h>
+#include <linux/hmm.h>
+#include <linux/dma-direction.h>
+#include <linux/dma-mapping.h>
+#include "amdgpu_sync.h"
+#include "amdgpu_object.h"
+#include "amdgpu_vm.h"
+#include "amdgpu_mn.h"
+#include "kfd_priv.h"
+#include "kfd_svm.h"
+#include "kfd_migrate.h"
+
+static void svm_migrate_page_free(struct page *page)
+{
+}
+
+/**
+ * svm_migrate_to_ram - CPU page fault handler
+ * @vmf: CPU vm fault vma, address
+ *
+ * Context: vm fault handler, mm->mmap_sem is taken
+ *
+ * Return:
+ * 0 - OK
+ * VM_FAULT_SIGBUS - notice application to have SIGBUS page fault
+ */
+static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static const struct dev_pagemap_ops svm_migrate_pgmap_ops = {
+	.page_free		= svm_migrate_page_free,
+	.migrate_to_ram		= svm_migrate_to_ram,
+};
+
+int svm_migrate_init(struct amdgpu_device *adev)
+{
+	struct kfd_dev *kfddev = adev->kfd.dev;
+	struct dev_pagemap *pgmap;
+	struct resource *res;
+	unsigned long size;
+	void *r;
+
+	/* Page migration works on Vega10 or newer */
+	if (kfddev->device_info->asic_family < CHIP_VEGA10)
+		return -EINVAL;
+
+	pgmap = &kfddev->pgmap;
+	memset(pgmap, 0, sizeof(*pgmap));
+
+	/* TODO: register all vram to HMM for now.
+	 * should remove reserved size
+	 */
+	size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
+	res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+	if (IS_ERR(res))
+		return -ENOMEM;
+
+	pgmap->type = MEMORY_DEVICE_PRIVATE;
+	pgmap->res = *res;
+	pgmap->ops = &svm_migrate_pgmap_ops;
+	pgmap->owner = adev;
+	pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+	r = devm_memremap_pages(adev->dev, pgmap);
+	if (IS_ERR(r)) {
+		pr_err("failed to register HMM device memory\n");
+		return PTR_ERR(r);
+	}
+
+	pr_info("HMM registered %ldMB device memory\n", size >> 20);
+
+	return 0;
+}
+
+void svm_migrate_fini(struct amdgpu_device *adev)
+{
+	memunmap_pages(&adev->kfd.dev->pgmap);
+}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
new file mode 100644
index 000000000000..98ab685d3e17
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
@@ -0,0 +1,48 @@
+/*
+ * Copyright 2020 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#ifndef KFD_MIGRATE_H_
+#define KFD_MIGRATE_H_
+
+#include <linux/rwsem.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/sched/mm.h>
+#include <linux/hmm.h>
+#include "kfd_priv.h"
+#include "kfd_svm.h"
+
+#if defined(CONFIG_DEVICE_PRIVATE)
+int svm_migrate_init(struct amdgpu_device *adev);
+void svm_migrate_fini(struct amdgpu_device *adev);
+
+#else
+static inline int svm_migrate_init(struct amdgpu_device *adev)
+{
+	DRM_WARN_ONCE("DEVICE_PRIVATE kernel config option is not enabled, "
+		      "add CONFIG_DEVICE_PRIVATE=y in config file to fix\n");
+	return -ENODEV;
+}
+static inline void svm_migrate_fini(struct amdgpu_device *adev) {}
+#endif
+#endif /* KFD_MIGRATE_H_ */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 7a4b4b6dcf32..d5367e770b39 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -317,6 +317,9 @@ struct kfd_dev {
 	unsigned int max_doorbell_slices;
 
 	int noretry;
+
+	/* HMM page migration MEMORY_DEVICE_PRIVATE mapping */
+	struct dev_pagemap pgmap;
 };
 
 enum kfd_mempool {
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 18/35] drm/amdkfd: validate vram svm range from TTM
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (16 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 17/35] drm/amdkfd: register HMM device private zone Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 19/35] drm/amdkfd: support xgmi same hive mapping Felix Kuehling
                   ` (18 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

If svm range perfetch location is not zero, use TTM to alloc
amdgpu_bo vram nodes to validate svm range, then map vram nodes to GPUs.

Use offset to sub allocate from the same amdgpu_bo to handle overlap
vram range while adding new range or unmapping range.

svm_bo has ref count to trace the shared ranges. If all ranges of shared
amdgpu_bo are migrated to ram, ref count becomes 0, then amdgpu_bo is
released, all ranges svm_bo is set to NULL.

To migrate range from ram back to vram, allocate the same amdgpu_bo
with previous offset if the range has svm_bo.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 342 ++++++++++++++++++++++++---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  20 ++
 2 files changed, 335 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index e3ba6e7262a7..7d91dc49a5a9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -35,7 +35,9 @@
  * svm_range_unlink - unlink svm_range from lists and interval tree
  * @prange: svm range structure to be removed
  *
- * Remove the svm range from svms interval tree and link list
+ * Remove the svm_range from the svms and svm_bo SRCU lists and the svms
+ * interval tree. After this call, synchronize_srcu is needed before the
+ * range can be freed safely.
  *
  * Context: The caller must hold svms_lock
  */
@@ -44,6 +46,12 @@ static void svm_range_unlink(struct svm_range *prange)
 	pr_debug("prange 0x%p [0x%lx 0x%lx]\n", prange, prange->it_node.start,
 		 prange->it_node.last);
 
+	if (prange->svm_bo) {
+		spin_lock(&prange->svm_bo->list_lock);
+		list_del(&prange->svm_bo_list);
+		spin_unlock(&prange->svm_bo->list_lock);
+	}
+
 	list_del_rcu(&prange->list);
 	interval_tree_remove(&prange->it_node, &prange->svms->objects);
 }
@@ -70,6 +78,12 @@ static void svm_range_remove(struct svm_range *prange)
 	pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
 		 prange->it_node.start, prange->it_node.last);
 
+	if (prange->mm_nodes) {
+		pr_debug("vram prange svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+			 prange->it_node.start, prange->it_node.last);
+		svm_range_vram_node_free(prange);
+	}
+
 	kvfree(prange->pages_addr);
 	kfree(prange);
 }
@@ -102,7 +116,9 @@ svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start,
 	INIT_LIST_HEAD(&prange->list);
 	INIT_LIST_HEAD(&prange->update_list);
 	INIT_LIST_HEAD(&prange->remove_list);
+	INIT_LIST_HEAD(&prange->svm_bo_list);
 	atomic_set(&prange->invalid, 0);
+	spin_lock_init(&prange->svm_bo_lock);
 	svm_range_set_default_attributes(&prange->preferred_loc,
 					 &prange->prefetch_loc,
 					 &prange->granularity, &prange->flags);
@@ -139,6 +155,16 @@ svm_get_supported_dev_by_id(struct kfd_process *p, uint32_t gpu_id,
 	return dev;
 }
 
+struct amdgpu_device *
+svm_range_get_adev_by_id(struct svm_range *prange, uint32_t gpu_id)
+{
+	struct kfd_process *p =
+			container_of(prange->svms, struct kfd_process, svms);
+	struct kfd_dev *dev = svm_get_supported_dev_by_id(p, gpu_id, NULL);
+
+	return dev ? (struct amdgpu_device *)dev->kgd : NULL;
+}
+
 /**
  * svm_range_validate_ram - get system memory pages of svm range
  *
@@ -186,14 +212,226 @@ svm_range_validate_ram(struct mm_struct *mm, struct svm_range *prange)
 	return 0;
 }
 
+static bool svm_bo_ref_unless_zero(struct svm_range_bo *svm_bo)
+{
+	if (!svm_bo || !kref_get_unless_zero(&svm_bo->kref))
+		return false;
+
+	return true;
+}
+
+static struct svm_range_bo *svm_range_bo_ref(struct svm_range_bo *svm_bo)
+{
+	if (svm_bo)
+		kref_get(&svm_bo->kref);
+
+	return svm_bo;
+}
+
+static void svm_range_bo_release(struct kref *kref)
+{
+	struct svm_range_bo *svm_bo;
+
+	svm_bo = container_of(kref, struct svm_range_bo, kref);
+	/* This cleanup loop does not need to be SRCU safe because there
+	 * should be no SRCU readers while the ref count is 0. Any SRCU
+	 * reader that has a chance of reducing the ref count must take
+	 * an extra reference before srcu_read_lock and release it after
+	 * srcu_read_unlock.
+	 */
+	spin_lock(&svm_bo->list_lock);
+	while (!list_empty(&svm_bo->range_list)) {
+		struct svm_range *prange =
+				list_first_entry(&svm_bo->range_list,
+						struct svm_range, svm_bo_list);
+		pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+			 prange->it_node.start, prange->it_node.last);
+		spin_lock(&prange->svm_bo_lock);
+		prange->svm_bo = NULL;
+		spin_unlock(&prange->svm_bo_lock);
+
+		/* list_del_init tells a concurrent svm_range_vram_node_new when
+		 * it's safe to reuse the svm_bo pointer and svm_bo_list head.
+		 */
+		list_del_init(&prange->svm_bo_list);
+	}
+	spin_unlock(&svm_bo->list_lock);
+
+	amdgpu_bo_unref(&svm_bo->bo);
+	kfree(svm_bo);
+}
+
+static void svm_range_bo_unref(struct svm_range_bo *svm_bo)
+{
+	if (!svm_bo)
+		return;
+
+	kref_put(&svm_bo->kref, svm_range_bo_release);
+}
+
+static struct svm_range_bo *svm_range_bo_new(void)
+{
+	struct svm_range_bo *svm_bo;
+
+	svm_bo = kzalloc(sizeof(*svm_bo), GFP_KERNEL);
+	if (!svm_bo)
+		return NULL;
+
+	kref_init(&svm_bo->kref);
+	INIT_LIST_HEAD(&svm_bo->range_list);
+	spin_lock_init(&svm_bo->list_lock);
+
+	return svm_bo;
+}
+
+int
+svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange,
+			bool clear)
+{
+	struct amdkfd_process_info *process_info;
+	struct amdgpu_bo_param bp;
+	struct svm_range_bo *svm_bo;
+	struct amdgpu_bo *bo;
+	struct kfd_process *p;
+	int r;
+
+	pr_debug("[0x%lx 0x%lx]\n", prange->it_node.start,
+		 prange->it_node.last);
+	spin_lock(&prange->svm_bo_lock);
+	if (prange->svm_bo) {
+		if (prange->mm_nodes) {
+			/* We still have a reference, all is well */
+			spin_unlock(&prange->svm_bo_lock);
+			return 0;
+		}
+		if (svm_bo_ref_unless_zero(prange->svm_bo)) {
+			/* The BO was still around and we got
+			 * a new reference to it
+			 */
+			spin_unlock(&prange->svm_bo_lock);
+			pr_debug("reuse old bo [0x%lx 0x%lx]\n",
+				prange->it_node.start, prange->it_node.last);
+
+			prange->mm_nodes = prange->svm_bo->bo->tbo.mem.mm_node;
+			return 0;
+		}
+
+		spin_unlock(&prange->svm_bo_lock);
+
+		/* We need a new svm_bo. Spin-loop to wait for concurrent
+		 * svm_range_bo_release to finish removing this range from
+		 * its range list. After this, it is safe to reuse the
+		 * svm_bo pointer and svm_bo_list head.
+		 */
+		while (!list_empty_careful(&prange->svm_bo_list))
+			;
+
+	} else {
+		spin_unlock(&prange->svm_bo_lock);
+	}
+
+	svm_bo = svm_range_bo_new();
+	if (!svm_bo) {
+		pr_debug("failed to alloc svm bo\n");
+		return -ENOMEM;
+	}
+
+	memset(&bp, 0, sizeof(bp));
+	bp.size = prange->npages * PAGE_SIZE;
+	bp.byte_align = PAGE_SIZE;
+	bp.domain = AMDGPU_GEM_DOMAIN_VRAM;
+	bp.flags = AMDGPU_GEM_CREATE_NO_CPU_ACCESS;
+	bp.flags |= clear ? AMDGPU_GEM_CREATE_VRAM_CLEARED : 0;
+	bp.type = ttm_bo_type_device;
+	bp.resv = NULL;
+
+	r = amdgpu_bo_create(adev, &bp, &bo);
+	if (r) {
+		pr_debug("failed %d to create bo\n", r);
+		kfree(svm_bo);
+		return r;
+	}
+
+	p = container_of(prange->svms, struct kfd_process, svms);
+	r = amdgpu_bo_reserve(bo, true);
+	if (r) {
+		pr_debug("failed %d to reserve bo\n", r);
+		goto reserve_bo_failed;
+	}
+
+	r = dma_resv_reserve_shared(bo->tbo.base.resv, 1);
+	if (r) {
+		pr_debug("failed %d to reserve bo\n", r);
+		amdgpu_bo_unreserve(bo);
+		goto reserve_bo_failed;
+	}
+	process_info = p->kgd_process_info;
+	amdgpu_bo_fence(bo, &process_info->eviction_fence->base, true);
+
+	amdgpu_bo_unreserve(bo);
+
+	svm_bo->bo = bo;
+	prange->svm_bo = svm_bo;
+	prange->mm_nodes = bo->tbo.mem.mm_node;
+	prange->offset = 0;
+
+	spin_lock(&svm_bo->list_lock);
+	list_add(&prange->svm_bo_list, &svm_bo->range_list);
+	spin_unlock(&svm_bo->list_lock);
+
+	return 0;
+
+reserve_bo_failed:
+	kfree(svm_bo);
+	amdgpu_bo_unref(&bo);
+	prange->mm_nodes = NULL;
+
+	return r;
+}
+
+void svm_range_vram_node_free(struct svm_range *prange)
+{
+	pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last);
+
+	svm_range_bo_unref(prange->svm_bo);
+	prange->mm_nodes = NULL;
+}
+
+static int svm_range_validate_vram(struct svm_range *prange)
+{
+	struct amdgpu_device *adev;
+	int r;
+
+	pr_debug("svms 0x%p [0x%lx 0x%lx] actual_loc 0x%x\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last,
+		 prange->actual_loc);
+
+	adev = svm_range_get_adev_by_id(prange, prange->actual_loc);
+	if (!adev) {
+		pr_debug("failed to get device by id 0x%x\n",
+			 prange->actual_loc);
+		return -EINVAL;
+	}
+
+	r = svm_range_vram_node_new(adev, prange, true);
+	if (r)
+		pr_debug("failed %d to alloc vram\n", r);
+
+	return r;
+}
+
 static int
 svm_range_validate(struct mm_struct *mm, struct svm_range *prange)
 {
-	int r = 0;
+	int r;
 
 	pr_debug("actual loc 0x%x\n", prange->actual_loc);
 
-	r = svm_range_validate_ram(mm, prange);
+	if (!prange->actual_loc)
+		r = svm_range_validate_ram(mm, prange);
+	else
+		r = svm_range_validate_vram(prange);
 
 	pr_debug("svms 0x%p [0x%lx 0x%lx] ret %d invalid %d\n", prange->svms,
 		 prange->it_node.start, prange->it_node.last,
@@ -349,6 +587,35 @@ svm_range_split_pages(struct svm_range *new, struct svm_range *old,
 	return 0;
 }
 
+static int
+svm_range_split_nodes(struct svm_range *new, struct svm_range *old,
+		      uint64_t start, uint64_t last)
+{
+	pr_debug("svms 0x%p new start 0x%lx start 0x%llx last 0x%llx\n",
+		 new->svms, new->it_node.start, start, last);
+
+	old->npages = last - start + 1;
+
+	if (new->it_node.start == old->it_node.start) {
+		new->offset = old->offset;
+		old->offset += new->npages;
+	} else {
+		new->offset = old->offset + old->npages;
+	}
+
+	old->it_node.start = start;
+	old->it_node.last = last;
+
+	new->svm_bo = svm_range_bo_ref(old->svm_bo);
+	new->mm_nodes = old->mm_nodes;
+
+	spin_lock(&new->svm_bo->list_lock);
+	list_add(&new->svm_bo_list, &new->svm_bo->range_list);
+	spin_unlock(&new->svm_bo->list_lock);
+
+	return 0;
+}
+
 /**
  * svm_range_split_adjust - split range and adjust
  *
@@ -382,6 +649,8 @@ svm_range_split_adjust(struct svm_range *new, struct svm_range *old,
 
 	if (old->pages_addr)
 		r = svm_range_split_pages(new, old, start, last);
+	else if (old->actual_loc && old->mm_nodes)
+		r = svm_range_split_nodes(new, old, start, last);
 	else
 		WARN_ONCE(1, "split adjust invalid pages_addr and nodes\n");
 	if (r)
@@ -438,17 +707,14 @@ svm_range_split(struct svm_range *prange, uint64_t start, uint64_t last,
 		return -EINVAL;
 
 	svms = prange->svms;
-	if (old_start == start) {
+	if (old_start == start)
 		*new = svm_range_new(svms, last + 1, old_last);
-		if (!*new)
-			return -ENOMEM;
-		r = svm_range_split_adjust(*new, prange, start, last);
-	} else {
+	else
 		*new = svm_range_new(svms, old_start, start - 1);
-		if (!*new)
-			return -ENOMEM;
-		r = svm_range_split_adjust(*new, prange, start, last);
-	}
+	if (!*new)
+		return -ENOMEM;
+
+	r = svm_range_split_adjust(*new, prange, start, last);
 
 	return r;
 }
@@ -550,7 +816,8 @@ svm_range_get_pte_flags(struct amdgpu_device *adev, struct svm_range *prange)
 	uint64_t pte_flags;
 
 	pte_flags = AMDGPU_PTE_VALID;
-	pte_flags |= AMDGPU_PTE_SYSTEM | AMDGPU_PTE_SNOOPED;
+	if (!prange->mm_nodes)
+		pte_flags |= AMDGPU_PTE_SYSTEM | AMDGPU_PTE_SNOOPED;
 
 	mapping_flags = AMDGPU_VM_PAGE_READABLE | AMDGPU_VM_PAGE_WRITEABLE;
 
@@ -570,7 +837,9 @@ svm_range_get_pte_flags(struct amdgpu_device *adev, struct svm_range *prange)
 	/* Apply ASIC specific mapping flags */
 	amdgpu_gmc_get_vm_pte(adev, &prange->mapping, &pte_flags);
 
-	pr_debug("PTE flags 0x%llx\n", pte_flags);
+	pr_debug("svms 0x%p [0x%lx 0x%lx] vram %d system %d PTE flags 0x%llx\n",
+		 prange->svms, prange->it_node.start, prange->it_node.last,
+		 prange->mm_nodes ? 1:0, prange->pages_addr ? 1:0, pte_flags);
 
 	return pte_flags;
 }
@@ -656,7 +925,9 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 		     struct svm_range *prange, bool reserve_vm,
 		     struct dma_fence **fence)
 {
-	struct amdgpu_bo *root;
+	struct ttm_validate_buffer tv[2];
+	struct ww_acquire_ctx ticket;
+	struct list_head list;
 	dma_addr_t *pages_addr;
 	uint64_t pte_flags;
 	int r = 0;
@@ -665,13 +936,25 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 		 prange->it_node.start, prange->it_node.last);
 
 	if (reserve_vm) {
-		root = amdgpu_bo_ref(vm->root.base.bo);
-		r = amdgpu_bo_reserve(root, true);
+		INIT_LIST_HEAD(&list);
+
+		tv[0].bo = &vm->root.base.bo->tbo;
+		tv[0].num_shared = 4;
+		list_add(&tv[0].head, &list);
+		if (prange->svm_bo && prange->mm_nodes) {
+			tv[1].bo = &prange->svm_bo->bo->tbo;
+			tv[1].num_shared = 1;
+			list_add(&tv[1].head, &list);
+		}
+		r = ttm_eu_reserve_buffers(&ticket, &list, true, NULL);
 		if (r) {
-			pr_debug("failed %d to reserve root bo\n", r);
-			amdgpu_bo_unref(&root);
+			pr_debug("failed %d to reserve bo\n", r);
 			goto out;
 		}
+		if (prange->svm_bo && prange->mm_nodes &&
+		    prange->svm_bo->bo->tbo.evicted)
+			goto unreserve_out;
+
 		r = amdgpu_vm_validate_pt_bos(adev, vm, svm_range_bo_validate,
 					      NULL);
 		if (r) {
@@ -682,7 +965,7 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 
 	prange->mapping.start = prange->it_node.start;
 	prange->mapping.last = prange->it_node.last;
-	prange->mapping.offset = 0;
+	prange->mapping.offset = prange->offset;
 	pte_flags = svm_range_get_pte_flags(adev, prange);
 	prange->mapping.flags = pte_flags;
 	pages_addr = prange->pages_addr;
@@ -690,7 +973,8 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	r = amdgpu_vm_bo_update_mapping(adev, adev, vm, false, false, NULL,
 					prange->mapping.start,
 					prange->mapping.last, pte_flags,
-					prange->mapping.offset, NULL,
+					prange->mapping.offset,
+					prange->mm_nodes,
 					pages_addr, &vm->last_update);
 	if (r) {
 		pr_debug("failed %d to map to gpu 0x%lx\n", r,
@@ -710,11 +994,8 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 		*fence = dma_fence_get(vm->last_update);
 
 unreserve_out:
-	if (reserve_vm) {
-		amdgpu_bo_unreserve(root);
-		amdgpu_bo_unref(&root);
-	}
-
+	if (reserve_vm)
+		ttm_eu_backoff_reservation(&ticket, &list);
 out:
 	return r;
 }
@@ -929,7 +1210,14 @@ struct svm_range *svm_range_clone(struct svm_range *old)
 		memcpy(new->pages_addr, old->pages_addr,
 		       old->npages * sizeof(*old->pages_addr));
 	}
-
+	if (old->svm_bo) {
+		new->mm_nodes = old->mm_nodes;
+		new->offset = old->offset;
+		new->svm_bo = svm_range_bo_ref(old->svm_bo);
+		spin_lock(&new->svm_bo->list_lock);
+		list_add(&new->svm_bo_list, &new->svm_bo->range_list);
+		spin_unlock(&new->svm_bo->list_lock);
+	}
 	new->flags = old->flags;
 	new->preferred_loc = old->preferred_loc;
 	new->prefetch_loc = old->prefetch_loc;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 4c7daf8e0b6f..b1d2db02043b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -32,6 +32,12 @@
 #include "amdgpu.h"
 #include "kfd_priv.h"
 
+struct svm_range_bo {
+	struct amdgpu_bo	*bo;
+	struct kref		kref;
+	struct list_head	range_list; /* all svm ranges shared this bo */
+	spinlock_t		list_lock;
+};
 /**
  * struct svm_range - shared virtual memory range
  *
@@ -45,6 +51,10 @@
  * @mapping:    bo_va mapping structure to create and update GPU page table
  * @npages:     number of pages
  * @pages_addr: list of system memory physical page address
+ * @mm_nodes:   vram nodes allocated
+ * @offset:     range start offset within mm_nodes
+ * @svm_bo:     struct to manage splited amdgpu_bo
+ * @svm_bo_list:link list node, to scan all ranges which share same svm_bo
  * @flags:      flags defined as KFD_IOCTL_SVM_FLAG_*
  * @perferred_loc: perferred location, 0 for CPU, or GPU id
  * @perfetch_loc: last prefetch location, 0 for CPU, or GPU id
@@ -68,6 +78,11 @@ struct svm_range {
 	struct amdgpu_bo_va_mapping	mapping;
 	uint64_t			npages;
 	dma_addr_t			*pages_addr;
+	struct drm_mm_node		*mm_nodes;
+	uint64_t			offset;
+	struct svm_range_bo		*svm_bo;
+	struct list_head		svm_bo_list;
+	spinlock_t                      svm_bo_lock;
 	uint32_t			flags;
 	uint32_t			preferred_loc;
 	uint32_t			prefetch_loc;
@@ -95,5 +110,10 @@ void svm_range_list_fini(struct kfd_process *p);
 int svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start,
 	      uint64_t size, uint32_t nattrs,
 	      struct kfd_ioctl_svm_attribute *attrs);
+struct amdgpu_device *svm_range_get_adev_by_id(struct svm_range *prange,
+					       uint32_t id);
+int svm_range_vram_node_new(struct amdgpu_device *adev,
+			    struct svm_range *prange, bool clear);
+void svm_range_vram_node_free(struct svm_range *prange);
 
 #endif /* KFD_SVM_H_ */
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 19/35] drm/amdkfd: support xgmi same hive mapping
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (17 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 18/35] drm/amdkfd: validate vram svm range from TTM Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 20/35] drm/amdkfd: copy memory through gart table Felix Kuehling
                   ` (17 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

amdgpu_gmc_get_vm_pte use bo_va->is_xgmi same hive information to set
pte flags to update GPU mapping. Add local structure variable bo_va, and
update bo_va.is_xgmi, pass it to mapping->bo_va while mapping to GPU.

Assuming xgmi pstate is hi after boot.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 7d91dc49a5a9..8a4d0a3935b6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -26,6 +26,8 @@
 #include "amdgpu_object.h"
 #include "amdgpu_vm.h"
 #include "amdgpu_mn.h"
+#include "amdgpu.h"
+#include "amdgpu_xgmi.h"
 #include "kfd_priv.h"
 #include "kfd_svm.h"
 
@@ -923,10 +925,11 @@ static int svm_range_bo_validate(void *param, struct amdgpu_bo *bo)
 static int
 svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 		     struct svm_range *prange, bool reserve_vm,
-		     struct dma_fence **fence)
+		     struct amdgpu_device *bo_adev, struct dma_fence **fence)
 {
 	struct ttm_validate_buffer tv[2];
 	struct ww_acquire_ctx ticket;
+	struct amdgpu_bo_va bo_va;
 	struct list_head list;
 	dma_addr_t *pages_addr;
 	uint64_t pte_flags;
@@ -963,6 +966,11 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 		}
 	}
 
+	if (prange->svm_bo && prange->mm_nodes) {
+		bo_va.is_xgmi = amdgpu_xgmi_same_hive(adev, bo_adev);
+		prange->mapping.bo_va = &bo_va;
+	}
+
 	prange->mapping.start = prange->it_node.start;
 	prange->mapping.last = prange->it_node.last;
 	prange->mapping.offset = prange->offset;
@@ -970,7 +978,7 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	prange->mapping.flags = pte_flags;
 	pages_addr = prange->pages_addr;
 
-	r = amdgpu_vm_bo_update_mapping(adev, adev, vm, false, false, NULL,
+	r = amdgpu_vm_bo_update_mapping(adev, bo_adev, vm, false, false, NULL,
 					prange->mapping.start,
 					prange->mapping.last, pte_flags,
 					prange->mapping.offset,
@@ -994,6 +1002,7 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 		*fence = dma_fence_get(vm->last_update);
 
 unreserve_out:
+	prange->mapping.bo_va = NULL;
 	if (reserve_vm)
 		ttm_eu_backoff_reservation(&ticket, &list);
 out:
@@ -1004,6 +1013,7 @@ static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm)
 {
 	DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE);
 	struct kfd_process_device *pdd;
+	struct amdgpu_device *bo_adev;
 	struct amdgpu_device *adev;
 	struct kfd_process *p;
 	struct kfd_dev *dev;
@@ -1011,6 +1021,11 @@ static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm)
 	uint32_t gpuidx;
 	int r = 0;
 
+	if (prange->svm_bo && prange->mm_nodes)
+		bo_adev = amdgpu_ttm_adev(prange->svm_bo->bo->tbo.bdev);
+	else
+		bo_adev = NULL;
+
 	bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip,
 		  MAX_GPU_INSTANCE);
 	p = container_of(prange->svms, struct kfd_process, svms);
@@ -1027,8 +1042,14 @@ static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm)
 			return -EINVAL;
 		adev = (struct amdgpu_device *)dev->kgd;
 
+		if (bo_adev && adev != bo_adev &&
+		    !amdgpu_xgmi_same_hive(adev, bo_adev)) {
+			pr_debug("cannot map to device idx %d\n", gpuidx);
+			continue;
+		}
+
 		r = svm_range_map_to_gpu(adev, pdd->vm, prange, reserve_vm,
-					 &fence);
+					 bo_adev, &fence);
 		if (r)
 			break;
 
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 20/35] drm/amdkfd: copy memory through gart table
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (18 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 19/35] drm/amdkfd: support xgmi same hive mapping Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 21/35] drm/amdkfd: HMM migrate ram to vram Felix Kuehling
                   ` (16 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

Use sdma linear copy to migrate data between ram and vram. The sdma
linear copy command uses kernel buffer function queue to access system
memory through gart table.

Use reserved gart table window 0 to map system page address, and vram
page address is direct mapping. Use the same kernel buffer function to
fill in gart table mapping, so this is serialized with memory copy by
sdma job submit. We only need wait for the last memory copy sdma fence
for larger buffer migration.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 172 +++++++++++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   5 +
 2 files changed, 177 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 1950b86f1562..f2019c8f0b80 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -32,6 +32,178 @@
 #include "kfd_svm.h"
 #include "kfd_migrate.h"
 
+static uint64_t
+svm_migrate_direct_mapping_addr(struct amdgpu_device *adev, uint64_t addr)
+{
+	return addr + amdgpu_ttm_domain_start(adev, TTM_PL_VRAM);
+}
+
+static int
+svm_migrate_gart_map(struct amdgpu_ring *ring, uint64_t npages,
+		     uint64_t *addr, uint64_t *gart_addr, uint64_t flags)
+{
+	struct amdgpu_device *adev = ring->adev;
+	struct amdgpu_job *job;
+	unsigned int num_dw, num_bytes;
+	struct dma_fence *fence;
+	uint64_t src_addr, dst_addr;
+	uint64_t pte_flags;
+	void *cpu_addr;
+	int r;
+
+	/* use gart window 0 */
+	*gart_addr = adev->gmc.gart_start;
+
+	num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
+	num_bytes = npages * 8;
+
+	r = amdgpu_job_alloc_with_ib(adev, num_dw * 4 + num_bytes,
+				     AMDGPU_IB_POOL_DELAYED, &job);
+	if (r)
+		return r;
+
+	src_addr = num_dw * 4;
+	src_addr += job->ibs[0].gpu_addr;
+
+	dst_addr = amdgpu_bo_gpu_offset(adev->gart.bo);
+	amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr,
+				dst_addr, num_bytes, false);
+
+	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
+	WARN_ON(job->ibs[0].length_dw > num_dw);
+
+	pte_flags = AMDGPU_PTE_VALID | AMDGPU_PTE_READABLE;
+	pte_flags |= AMDGPU_PTE_SYSTEM | AMDGPU_PTE_SNOOPED;
+	if (!(flags & KFD_IOCTL_SVM_FLAG_GPU_RO))
+		pte_flags |= AMDGPU_PTE_WRITEABLE;
+	pte_flags |= adev->gart.gart_pte_flags;
+
+	cpu_addr = &job->ibs[0].ptr[num_dw];
+
+	r = amdgpu_gart_map(adev, 0, npages, addr, pte_flags, cpu_addr);
+	if (r)
+		goto error_free;
+
+	r = amdgpu_job_submit(job, &adev->mman.entity,
+			      AMDGPU_FENCE_OWNER_UNDEFINED, &fence);
+	if (r)
+		goto error_free;
+
+	dma_fence_put(fence);
+
+	return r;
+
+error_free:
+	amdgpu_job_free(job);
+	return r;
+}
+
+/**
+ * svm_migrate_copy_memory_gart - sdma copy data between ram and vram
+ *
+ * @adev: amdgpu device the sdma ring running
+ * @src: source page address array
+ * @dst: destination page address array
+ * @npages: number of pages to copy
+ * @direction: enum MIGRATION_COPY_DIR
+ * @mfence: output, sdma fence to signal after sdma is done
+ *
+ * ram address uses GART table continuous entries mapping to ram pages,
+ * vram address uses direct mapping of vram pages, which must have npages
+ * number of continuous pages.
+ * GART update and sdma uses same buf copy function ring, sdma is splited to
+ * multiple GTT_MAX_PAGES transfer, all sdma operations are serialized, wait for
+ * the last sdma finish fence which is returned to check copy memory is done.
+ *
+ * Context: Process context, takes and releases gtt_window_lock
+ *
+ * Return:
+ * 0 - OK, otherwise error code
+ */
+
+static int
+svm_migrate_copy_memory_gart(struct amdgpu_device *adev, uint64_t *src,
+			     uint64_t *dst, uint64_t npages,
+			     enum MIGRATION_COPY_DIR direction,
+			     struct dma_fence **mfence)
+{
+	const uint64_t GTT_MAX_PAGES = AMDGPU_GTT_MAX_TRANSFER_SIZE;
+	struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
+	uint64_t gart_s, gart_d;
+	struct dma_fence *next;
+	uint64_t size;
+	int r;
+
+	mutex_lock(&adev->mman.gtt_window_lock);
+
+	while (npages) {
+		size = min(GTT_MAX_PAGES, npages);
+
+		if (direction == FROM_VRAM_TO_RAM) {
+			gart_s = svm_migrate_direct_mapping_addr(adev, *src);
+			r = svm_migrate_gart_map(ring, size, dst, &gart_d, 0);
+
+		} else if (direction == FROM_RAM_TO_VRAM) {
+			r = svm_migrate_gart_map(ring, size, src, &gart_s,
+						 KFD_IOCTL_SVM_FLAG_GPU_RO);
+			gart_d = svm_migrate_direct_mapping_addr(adev, *dst);
+		}
+		if (r) {
+			pr_debug("failed %d to create gart mapping\n", r);
+			goto out_unlock;
+		}
+
+		r = amdgpu_copy_buffer(ring, gart_s, gart_d, size * PAGE_SIZE,
+				       NULL, &next, false, true, false);
+		if (r) {
+			pr_debug("failed %d to copy memory\n", r);
+			goto out_unlock;
+		}
+
+		dma_fence_put(*mfence);
+		*mfence = next;
+		npages -= size;
+		if (npages) {
+			src += size;
+			dst += size;
+		}
+	}
+
+out_unlock:
+	mutex_unlock(&adev->mman.gtt_window_lock);
+
+	return r;
+}
+
+/**
+ * svm_migrate_copy_done - wait for memory copy sdma is done
+ *
+ * @adev: amdgpu device the sdma memory copy is executing on
+ * @mfence: migrate fence
+ *
+ * Wait for dma fence is signaled, if the copy ssplit into multiple sdma
+ * operations, this is the last sdma operation fence.
+ *
+ * Context: called after svm_migrate_copy_memory
+ *
+ * Return:
+ * 0		- success
+ * otherwise	- error code from dma fence signal
+ */
+int
+svm_migrate_copy_done(struct amdgpu_device *adev, struct dma_fence *mfence)
+{
+	int r = 0;
+
+	if (mfence) {
+		r = dma_fence_wait(mfence, false);
+		dma_fence_put(mfence);
+		pr_debug("sdma copy memory fence done\n");
+	}
+
+	return r;
+}
+
 static void svm_migrate_page_free(struct page *page)
 {
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
index 98ab685d3e17..5db5686fa46a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
@@ -32,6 +32,11 @@
 #include "kfd_priv.h"
 #include "kfd_svm.h"
 
+enum MIGRATION_COPY_DIR {
+	FROM_RAM_TO_VRAM = 0,
+	FROM_VRAM_TO_RAM
+};
+
 #if defined(CONFIG_DEVICE_PRIVATE)
 int svm_migrate_init(struct amdgpu_device *adev);
 void svm_migrate_fini(struct amdgpu_device *adev);
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 21/35] drm/amdkfd: HMM migrate ram to vram
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (19 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 20/35] drm/amdkfd: copy memory through gart table Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 22/35] drm/amdkfd: HMM migrate vram to ram Felix Kuehling
                   ` (15 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

Register svm range with same address and size but perferred_location
is changed from CPU to GPU or from GPU to CPU, trigger migration the svm
range from ram to vram or from vram to ram.

If svm range prefetch location is GPU with flags
KFD_IOCTL_SVM_FLAG_HOST_ACCESS, validate the svm range on ram first,
then migrate it from ram to vram.

After migrating to vram is done, CPU access will have cpu page fault,
page fault handler migrate it back to ram and resume cpu access.

Migration steps:

1. migrate_vma_pages get svm range ram pages, notify the
interval is invalidated and unmap from CPU page table, HMM interval
notifier callback evict process queues
2. Allocate new pages in vram using TTM
3. Use svm copy memory to sdma copy data from ram to vram
4. migrate_vma_pages copy ram pages structure to vram pages structure
5. migrate_vma_finalize put ram pages to free ram pages and memory
6. Restore work wait for migration is finished, then update GPUs page
table mapping to new vram pages, resume process queues

If migrate_vma_setup failed to collect all ram pages of range, retry 3
times until success to start migration.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 265 +++++++++++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   2 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 175 ++++++++++++++-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |   2 +
 4 files changed, 436 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index f2019c8f0b80..af23f0be7eaf 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -204,6 +204,271 @@ svm_migrate_copy_done(struct amdgpu_device *adev, struct dma_fence *mfence)
 	return r;
 }
 
+static uint64_t
+svm_migrate_node_physical_addr(struct amdgpu_device *adev,
+			       struct drm_mm_node **mm_node, uint64_t *offset)
+{
+	struct drm_mm_node *node = *mm_node;
+	uint64_t pos = *offset;
+
+	if (node->start == AMDGPU_BO_INVALID_OFFSET) {
+		pr_debug("drm node is not validated\n");
+		return 0;
+	}
+
+	pr_debug("vram node start 0x%llx npages 0x%llx\n", node->start,
+		 node->size);
+
+	if (pos >= node->size) {
+		do  {
+			pos -= node->size;
+			node++;
+		} while (pos >= node->size);
+
+		*mm_node = node;
+		*offset = pos;
+	}
+
+	return (node->start + pos) << PAGE_SHIFT;
+}
+
+unsigned long
+svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr)
+{
+	return (addr + adev->kfd.dev->pgmap.res.start) >> PAGE_SHIFT;
+}
+
+static void
+svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
+{
+	struct page *page;
+
+	page = pfn_to_page(pfn);
+	page->zone_device_data = prange;
+	get_page(page);
+	lock_page(page);
+}
+
+static void
+svm_migrate_put_vram_page(struct amdgpu_device *adev, unsigned long addr)
+{
+	struct page *page;
+
+	page = pfn_to_page(svm_migrate_addr_to_pfn(adev, addr));
+	unlock_page(page);
+	put_page(page);
+}
+
+
+static int
+svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
+			 struct migrate_vma *migrate,
+			 struct dma_fence **mfence)
+{
+	uint64_t npages = migrate->cpages;
+	struct drm_mm_node *node;
+	uint64_t *src, *dst;
+	uint64_t vram_addr;
+	uint64_t offset;
+	uint64_t i, j;
+	int r = -ENOMEM;
+
+	pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last);
+
+	src = kvmalloc_array(npages << 1, sizeof(*src), GFP_KERNEL);
+	if (!src)
+		goto out;
+	dst = src + npages;
+
+	r = svm_range_vram_node_new(adev, prange, false);
+	if (r) {
+		pr_debug("failed %d get 0x%llx pages from vram\n", r, npages);
+		goto out_free;
+	}
+
+	node = prange->mm_nodes;
+	offset = prange->offset;
+	vram_addr = svm_migrate_node_physical_addr(adev, &node, &offset);
+	if (!vram_addr) {
+		WARN_ONCE(1, "vram node address is 0\n");
+		r = -ENOMEM;
+		goto out_free;
+	}
+
+	for (i = j = 0; i < npages; i++) {
+		struct page *spage;
+
+		spage = migrate_pfn_to_page(migrate->src[i]);
+		src[i] = page_to_pfn(spage) << PAGE_SHIFT;
+
+		dst[i] = vram_addr + (j << PAGE_SHIFT);
+		migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
+		svm_migrate_get_vram_page(prange, migrate->dst[i]);
+
+		migrate->dst[i] = migrate_pfn(migrate->dst[i]);
+		migrate->dst[i] |= MIGRATE_PFN_LOCKED;
+
+		if (j + offset >= node->size - 1 && i < npages - 1) {
+			r = svm_migrate_copy_memory_gart(adev, src + i - j,
+							 dst + i - j, j + 1,
+							 FROM_RAM_TO_VRAM,
+							 mfence);
+			if (r)
+				goto out_free_vram_pages;
+
+			node++;
+			pr_debug("next node size 0x%llx\n", node->size);
+			vram_addr = node->start << PAGE_SHIFT;
+			offset = 0;
+			j = 0;
+		} else {
+			j++;
+		}
+	}
+
+	r = svm_migrate_copy_memory_gart(adev, src + i - j, dst + i - j, j,
+					 FROM_RAM_TO_VRAM, mfence);
+	if (!r)
+		goto out_free;
+
+out_free_vram_pages:
+	pr_debug("failed %d to copy memory to vram\n", r);
+	while (i--) {
+		svm_migrate_put_vram_page(adev, dst[i]);
+		migrate->dst[i] = 0;
+	}
+
+out_free:
+	kvfree(src);
+out:
+	return r;
+}
+
+static int
+svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
+			struct vm_area_struct *vma, uint64_t start,
+			uint64_t end)
+{
+	uint64_t npages = (end - start) >> PAGE_SHIFT;
+	struct dma_fence *mfence = NULL;
+	struct migrate_vma migrate;
+	int r = -ENOMEM;
+	int retry = 0;
+
+	memset(&migrate, 0, sizeof(migrate));
+	migrate.vma = vma;
+	migrate.start = start;
+	migrate.end = end;
+	migrate.flags = MIGRATE_VMA_SELECT_SYSTEM;
+	migrate.pgmap_owner = adev;
+
+	migrate.src = kvmalloc_array(npages << 1, sizeof(*migrate.src),
+				     GFP_KERNEL | __GFP_ZERO);
+	if (!migrate.src)
+		goto out;
+	migrate.dst = migrate.src + npages;
+
+retry:
+	r = migrate_vma_setup(&migrate);
+	if (r) {
+		pr_debug("failed %d prepare migrate svms 0x%p [0x%lx 0x%lx]\n",
+			 r, prange->svms, prange->it_node.start,
+			 prange->it_node.last);
+		goto out_free;
+	}
+	if (migrate.cpages != npages) {
+		pr_debug("collect 0x%lx/0x%llx pages, retry\n", migrate.cpages,
+			 npages);
+		migrate_vma_finalize(&migrate);
+		if (retry++ >= 3) {
+			r = -ENOMEM;
+			pr_debug("failed %d migrate svms 0x%p [0x%lx 0x%lx]\n",
+				 r, prange->svms, prange->it_node.start,
+				 prange->it_node.last);
+			goto out_free;
+		}
+
+		goto retry;
+	}
+
+	if (migrate.cpages) {
+		svm_migrate_copy_to_vram(adev, prange, &migrate, &mfence);
+		migrate_vma_pages(&migrate);
+		svm_migrate_copy_done(adev, mfence);
+		migrate_vma_finalize(&migrate);
+	}
+
+	kvfree(prange->pages_addr);
+	prange->pages_addr = NULL;
+
+out_free:
+	kvfree(migrate.src);
+out:
+	return r;
+}
+
+/**
+ * svm_migrate_ram_to_vram - migrate svm range from system to device
+ * @prange: range structure
+ * @best_loc: the device to migrate to
+ *
+ * Context: Process context, caller hold mm->mmap_sem and prange->lock and take
+ *          svms srcu read lock.
+ *
+ * Return:
+ * 0 - OK, otherwise error code
+ */
+int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc)
+{
+	unsigned long addr, start, end;
+	struct vm_area_struct *vma;
+	struct amdgpu_device *adev;
+	struct mm_struct *mm;
+	int r = 0;
+
+	if (prange->actual_loc == best_loc) {
+		pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 0x%x\n",
+			 prange->svms, prange->it_node.start,
+			 prange->it_node.last, best_loc);
+		return 0;
+	}
+
+	adev = svm_range_get_adev_by_id(prange, best_loc);
+	if (!adev) {
+		pr_debug("failed to get device by id 0x%x\n", best_loc);
+		return -ENODEV;
+	}
+
+	pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last);
+
+	start = prange->it_node.start << PAGE_SHIFT;
+	end = (prange->it_node.last + 1) << PAGE_SHIFT;
+
+	mm = current->mm;
+
+	for (addr = start; addr < end;) {
+		unsigned long next;
+
+		vma = find_vma(mm, addr);
+		if (!vma || addr < vma->vm_start)
+			break;
+
+		next = min(vma->vm_end, end);
+		r = svm_migrate_vma_to_vram(adev, prange, vma, addr, next);
+		if (r) {
+			pr_debug("failed to migrate\n");
+			break;
+		}
+		addr = next;
+	}
+
+	prange->actual_loc = best_loc;
+
+	return r;
+}
+
 static void svm_migrate_page_free(struct page *page)
 {
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
index 5db5686fa46a..ffae5f989909 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
@@ -37,6 +37,8 @@ enum MIGRATION_COPY_DIR {
 	FROM_VRAM_TO_RAM
 };
 
+int svm_migrate_ram_to_vram(struct svm_range *prange,  uint32_t best_loc);
+
 #if defined(CONFIG_DEVICE_PRIVATE)
 int svm_migrate_init(struct amdgpu_device *adev);
 void svm_migrate_fini(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 8a4d0a3935b6..0dbc403413a1 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -30,6 +30,7 @@
 #include "amdgpu_xgmi.h"
 #include "kfd_priv.h"
 #include "kfd_svm.h"
+#include "kfd_migrate.h"
 
 #define AMDGPU_SVM_RANGE_RESTORE_DELAY_MS 1
 
@@ -120,6 +121,7 @@ svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start,
 	INIT_LIST_HEAD(&prange->remove_list);
 	INIT_LIST_HEAD(&prange->svm_bo_list);
 	atomic_set(&prange->invalid, 0);
+	mutex_init(&prange->mutex);
 	spin_lock_init(&prange->svm_bo_lock);
 	svm_range_set_default_attributes(&prange->preferred_loc,
 					 &prange->prefetch_loc,
@@ -409,6 +411,11 @@ static int svm_range_validate_vram(struct svm_range *prange)
 		 prange->it_node.start, prange->it_node.last,
 		 prange->actual_loc);
 
+	if (prange->mm_nodes) {
+		pr_debug("validation skipped after migration\n");
+		return 0;
+	}
+
 	adev = svm_range_get_adev_by_id(prange, prange->actual_loc);
 	if (!adev) {
 		pr_debug("failed to get device by id 0x%x\n",
@@ -428,7 +435,9 @@ svm_range_validate(struct mm_struct *mm, struct svm_range *prange)
 {
 	int r;
 
-	pr_debug("actual loc 0x%x\n", prange->actual_loc);
+	pr_debug("svms 0x%p [0x%lx 0x%lx] actual loc 0x%x\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last,
+		 prange->actual_loc);
 
 	if (!prange->actual_loc)
 		r = svm_range_validate_ram(mm, prange);
@@ -1109,28 +1118,36 @@ static void svm_range_restore_work(struct work_struct *work)
 			 prange->svms, prange->it_node.start,
 			 prange->it_node.last, invalid);
 
+		/*
+		 * If range is migrating, wait for migration is done.
+		 */
+		mutex_lock(&prange->mutex);
+
 		r = svm_range_validate(mm, prange);
 		if (r) {
 			pr_debug("failed %d to validate [0x%lx 0x%lx]\n", r,
 				 prange->it_node.start, prange->it_node.last);
 
-			goto unlock_out;
+			goto out_unlock;
 		}
 
 		r = svm_range_map_to_gpus(prange, true);
-		if (r) {
+		if (r)
 			pr_debug("failed %d to map 0x%lx to gpu\n", r,
 				 prange->it_node.start);
-			goto unlock_out;
-		}
+
+out_unlock:
+		mutex_unlock(&prange->mutex);
+		if (r)
+			goto out_reschedule;
 
 		if (atomic_cmpxchg(&prange->invalid, invalid, 0) != invalid)
-			goto unlock_out;
+			goto out_reschedule;
 	}
 
 	if (atomic_cmpxchg(&svms->evicted_ranges, evicted_ranges, 0) !=
 	    evicted_ranges)
-		goto unlock_out;
+		goto out_reschedule;
 
 	evicted_ranges = 0;
 
@@ -1144,7 +1161,7 @@ static void svm_range_restore_work(struct work_struct *work)
 
 	pr_debug("restore svm ranges successfully\n");
 
-unlock_out:
+out_reschedule:
 	srcu_read_unlock(&svms->srcu, srcu_idx);
 	mmap_read_unlock(mm);
 	mutex_unlock(&process_info->lock);
@@ -1617,6 +1634,134 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
 	return 0;
 }
 
+/* svm_range_best_location - decide the best actual location
+ * @prange: svm range structure
+ *
+ * For xnack off:
+ * If range map to single GPU, the best acutal location is prefetch loc, which
+ * can be CPU or GPU.
+ *
+ * If range map to multiple GPUs, only if mGPU connection on xgmi same hive,
+ * the best actual location could be prefetch_loc GPU. If mGPU connection on
+ * PCIe, the best actual location is always CPU, because GPU cannot access vram
+ * of other GPUs, assuming PCIe small bar (large bar support is not upstream).
+ *
+ * For xnack on:
+ * The best actual location is prefetch location. If mGPU connection on xgmi
+ * same hive, range map to multiple GPUs. Otherwise, the range only map to
+ * actual location GPU. Other GPU access vm fault will trigger migration.
+ *
+ * Context: Process context
+ *
+ * Return:
+ * 0 for CPU or GPU id
+ */
+static uint32_t svm_range_best_location(struct svm_range *prange)
+{
+	DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE);
+	uint32_t best_loc = prange->prefetch_loc;
+	struct amdgpu_device *bo_adev;
+	struct amdgpu_device *adev;
+	struct kfd_dev *kfd_dev;
+	struct kfd_process *p;
+	uint32_t gpuidx;
+
+	p = container_of(prange->svms, struct kfd_process, svms);
+
+	/* xnack on */
+	if (p->xnack_enabled)
+		goto out;
+
+	/* xnack off */
+	if (!best_loc || best_loc == KFD_IOCTL_SVM_LOCATION_UNDEFINED)
+		goto out;
+
+	bo_adev = svm_range_get_adev_by_id(prange, best_loc);
+	bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip,
+		  MAX_GPU_INSTANCE);
+
+	for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) {
+		kfd_process_device_from_gpuidx(p, gpuidx, &kfd_dev);
+		adev = (struct amdgpu_device *)kfd_dev->kgd;
+
+		if (adev == bo_adev)
+			continue;
+
+		if (!amdgpu_xgmi_same_hive(adev, bo_adev)) {
+			best_loc = 0;
+			break;
+		}
+	}
+
+out:
+	pr_debug("xnack %d svms 0x%p [0x%lx 0x%lx] best loc 0x%x\n",
+		 p->xnack_enabled, &p->svms, prange->it_node.start,
+		 prange->it_node.last, best_loc);
+	return best_loc;
+}
+
+/* svm_range_trigger_migration - start page migration if prefetch loc changed
+ * @mm: current process mm_struct
+ * @prange: svm range structure
+ * @migrated: output, true if migration is triggered
+ *
+ * If range perfetch_loc is GPU, actual loc is cpu 0, then migrate the range
+ * from ram to vram.
+ * If range prefetch_loc is cpu 0, actual loc is GPU, then migrate the range
+ * from vram to ram.
+ *
+ * If GPU vm fault retry is not enabled, migration interact with MMU notifier
+ * and restore work:
+ * 1. migrate_vma_setup invalidate pages, MMU notifier callback svm_range_evict
+ *    stops all queues, schedule restore work
+ * 2. svm_range_restore_work wait for migration is done by
+ *    a. svm_range_validate_vram takes prange->mutex
+ *    b. svm_range_validate_ram HMM get pages wait for CPU fault handle returns
+ * 3. restore work update mappings of GPU, resume all queues.
+ *
+ * Context: Process context
+ *
+ * Return:
+ * 0 - OK, otherwise - error code of migration
+ */
+static int
+svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
+			    bool *migrated)
+{
+	uint32_t best_loc;
+	int r = 0;
+
+	*migrated = false;
+	best_loc = svm_range_best_location(prange);
+
+	if (best_loc == KFD_IOCTL_SVM_LOCATION_UNDEFINED ||
+	    best_loc == prange->actual_loc)
+		return 0;
+
+	if (best_loc && !prange->actual_loc &&
+	    !(prange->flags & KFD_IOCTL_SVM_FLAG_HOST_ACCESS))
+		return 0;
+
+	if (best_loc) {
+		if (!prange->actual_loc && !prange->pages_addr) {
+			pr_debug("host access and prefetch to gpu\n");
+			r = svm_range_validate_ram(mm, prange);
+			if (r) {
+				pr_debug("failed %d to validate on ram\n", r);
+				return r;
+			}
+		}
+
+		pr_debug("migrate from ram to vram\n");
+		r = svm_migrate_ram_to_vram(prange, best_loc);
+
+		if (!r)
+			*migrated = true;
+	}
+
+	return r;
+}
+
 static int
 svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 		   uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs)
@@ -1675,6 +1820,9 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 	}
 
 	list_for_each_entry(prange, &update_list, update_list) {
+		bool migrated;
+
+		mutex_lock(&prange->mutex);
 
 		r = svm_range_apply_attrs(p, prange, nattr, attrs);
 		if (r) {
@@ -1682,6 +1830,16 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 			goto out_unlock;
 		}
 
+		r = svm_range_trigger_migration(mm, prange, &migrated);
+		if (r)
+			goto out_unlock;
+
+		if (migrated) {
+			pr_debug("restore_work will update mappings of GPUs\n");
+			mutex_unlock(&prange->mutex);
+			continue;
+		}
+
 		r = svm_range_validate(mm, prange);
 		if (r) {
 			pr_debug("failed %d to validate svm range\n", r);
@@ -1693,6 +1851,7 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 			pr_debug("failed %d to map svm range\n", r);
 
 out_unlock:
+		mutex_unlock(&prange->mutex);
 		if (r) {
 			mmap_read_unlock(mm);
 			srcu_read_unlock(&prange->svms->srcu, srcu_idx);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index b1d2db02043b..b81dfb32135b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -42,6 +42,7 @@ struct svm_range_bo {
  * struct svm_range - shared virtual memory range
  *
  * @svms:       list of svm ranges, structure defined in kfd_process
+ * @mutex:      to serialize range migration, validation and mapping update
  * @it_node:    node [start, last] stored in interval tree, start, last are page
  *              aligned, page size is (last - start + 1)
  * @list:       link list node, used to scan all ranges of svms
@@ -70,6 +71,7 @@ struct svm_range_bo {
  */
 struct svm_range {
 	struct svm_range_list		*svms;
+	struct mutex			mutex;
 	struct interval_tree_node	it_node;
 	struct list_head		list;
 	struct list_head		update_list;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 22/35] drm/amdkfd: HMM migrate vram to ram
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (20 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 21/35] drm/amdkfd: HMM migrate ram to vram Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 23/35] drm/amdkfd: invalidate tables on page retry fault Felix Kuehling
                   ` (14 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

If CPU page fault happens, HMM pgmap_ops callback migrate_to_ram start
migrate memory from vram to ram in steps:

1. migrate_vma_pages get vram pages, and notify HMM to invalidate the
pages, HMM interval notifier callback evict process queues
2. Allocate system memory pages
3. Use svm copy memory to migrate data from vram to ram
4. migrate_vma_pages copy pages structure from vram pages to ram pages
5. Return VM_FAULT_SIGBUS if migration failed, to notify application
6. migrate_vma_finalize put vram pages, page_free callback free vram
pages and vram nodes
7. Restore work wait for migration is finished, then update GPU page
table mapping to system memory, and resume process queues

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 274 ++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   3 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 116 +++++++++-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |   4 +
 4 files changed, 392 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index af23f0be7eaf..d33a4cc63495 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -259,6 +259,35 @@ svm_migrate_put_vram_page(struct amdgpu_device *adev, unsigned long addr)
 	put_page(page);
 }
 
+static unsigned long
+svm_migrate_addr(struct amdgpu_device *adev, struct page *page)
+{
+	unsigned long addr;
+
+	addr = page_to_pfn(page) << PAGE_SHIFT;
+	return (addr - adev->kfd.dev->pgmap.res.start);
+}
+
+static struct page *
+svm_migrate_get_sys_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (page)
+		lock_page(page);
+
+	return page;
+}
+
+void svm_migrate_put_sys_page(unsigned long addr)
+{
+	struct page *page;
+
+	page = pfn_to_page(addr >> PAGE_SHIFT);
+	unlock_page(page);
+	put_page(page);
+}
 
 static int
 svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
@@ -471,13 +500,208 @@ int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc)
 
 static void svm_migrate_page_free(struct page *page)
 {
+	/* Keep this function to avoid warning */
+}
+
+static int
+svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
+			struct migrate_vma *migrate,
+			struct dma_fence **mfence)
+{
+	uint64_t npages = migrate->cpages;
+	uint64_t *src, *dst;
+	struct page *dpage;
+	uint64_t i = 0, j;
+	uint64_t addr;
+	int r = 0;
+
+	pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last);
+
+	addr = prange->it_node.start << PAGE_SHIFT;
+
+	src = kvmalloc_array(npages << 1, sizeof(*src), GFP_KERNEL);
+	if (!src)
+		return -ENOMEM;
+
+	dst = src + npages;
+
+	prange->pages_addr = kvmalloc_array(npages, sizeof(*prange->pages_addr),
+					    GFP_KERNEL | __GFP_ZERO);
+	if (!prange->pages_addr) {
+		r = -ENOMEM;
+		goto out_oom;
+	}
+
+	for (i = 0, j = 0; i < npages; i++, j++, addr += PAGE_SIZE) {
+		struct page *spage;
+
+		spage = migrate_pfn_to_page(migrate->src[i]);
+		if (!spage) {
+			pr_debug("failed get spage svms 0x%p [0x%lx 0x%lx]\n",
+				 prange->svms, prange->it_node.start,
+				 prange->it_node.last);
+			r = -ENOMEM;
+			goto out_oom;
+		}
+		src[i] = svm_migrate_addr(adev, spage);
+		if (i > 0 && src[i] != src[i - 1] + PAGE_SIZE) {
+			r = svm_migrate_copy_memory_gart(adev, src + i - j,
+							 dst + i - j, j,
+							 FROM_VRAM_TO_RAM,
+							 mfence);
+			if (r)
+				goto out_oom;
+			j = 0;
+		}
+
+		dpage = svm_migrate_get_sys_page(migrate->vma, addr);
+		if (!dpage) {
+			pr_debug("failed get page svms 0x%p [0x%lx 0x%lx]\n",
+				 prange->svms, prange->it_node.start,
+				 prange->it_node.last);
+			r = -ENOMEM;
+			goto out_oom;
+		}
+
+		dst[i] = page_to_pfn(dpage) << PAGE_SHIFT;
+		*(prange->pages_addr + i) = dst[i];
+
+		migrate->dst[i] = migrate_pfn(page_to_pfn(dpage));
+		migrate->dst[i] |= MIGRATE_PFN_LOCKED;
+
+	}
+
+	r = svm_migrate_copy_memory_gart(adev, src + i - j, dst + i - j, j,
+					 FROM_VRAM_TO_RAM, mfence);
+
+out_oom:
+	kvfree(src);
+	if (r) {
+		pr_debug("failed %d copy to ram\n", r);
+		while (i--) {
+			svm_migrate_put_sys_page(dst[i]);
+			migrate->dst[i] = 0;
+		}
+	}
+
+	return r;
+}
+
+static int
+svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
+		       struct vm_area_struct *vma, uint64_t start, uint64_t end)
+{
+	uint64_t npages = (end - start) >> PAGE_SHIFT;
+	struct dma_fence *mfence = NULL;
+	struct migrate_vma migrate;
+	int r = -ENOMEM;
+
+	memset(&migrate, 0, sizeof(migrate));
+	migrate.vma = vma;
+	migrate.start = start;
+	migrate.end = end;
+	migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+	migrate.pgmap_owner = adev;
+
+	migrate.src = kvmalloc_array(npages << 1, sizeof(*migrate.src),
+				     GFP_KERNEL | __GFP_ZERO);
+	if (!migrate.src)
+		goto out;
+	migrate.dst = migrate.src + npages;
+
+	r = migrate_vma_setup(&migrate);
+	if (r) {
+		pr_debug("failed %d prepare migrate svms 0x%p [0x%lx 0x%lx]\n",
+			 r, prange->svms, prange->it_node.start,
+			 prange->it_node.last);
+		goto out_free;
+	}
+
+	pr_debug("cpages %ld\n", migrate.cpages);
+
+	if (migrate.cpages) {
+		svm_migrate_copy_to_ram(adev, prange, &migrate, &mfence);
+		migrate_vma_pages(&migrate);
+		svm_migrate_copy_done(adev, mfence);
+		migrate_vma_finalize(&migrate);
+	} else {
+		pr_debug("failed collect migrate device pages [0x%lx 0x%lx]\n",
+			 prange->it_node.start, prange->it_node.last);
+	}
+
+out_free:
+	kvfree(migrate.src);
+out:
+	return r;
+}
+
+/**
+ * svm_migrate_vram_to_ram - migrate svm range from device to system
+ * @prange: range structure
+ * @mm: process mm, use current->mm if NULL
+ *
+ * Context: Process context, caller hold mm->mmap_sem and prange->lock and take
+ *          svms srcu read lock
+ *
+ * Return:
+ * 0 - OK, otherwise error code
+ */
+int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm)
+{
+	struct amdgpu_device *adev;
+	struct vm_area_struct *vma;
+	unsigned long addr;
+	unsigned long start;
+	unsigned long end;
+	int r = 0;
+
+	if (!prange->actual_loc) {
+		pr_debug("[0x%lx 0x%lx] already migrated to ram\n",
+			 prange->it_node.start, prange->it_node.last);
+		return 0;
+	}
+
+	adev = svm_range_get_adev_by_id(prange, prange->actual_loc);
+	if (!adev) {
+		pr_debug("failed to get device by id 0x%x\n",
+			 prange->actual_loc);
+		return -ENODEV;
+	}
+
+	pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last);
+
+	start = prange->it_node.start << PAGE_SHIFT;
+	end = (prange->it_node.last + 1) << PAGE_SHIFT;
+
+	for (addr = start; addr < end;) {
+		unsigned long next;
+
+		vma = find_vma(mm, addr);
+		if (!vma || addr < vma->vm_start)
+			break;
+
+		next = min(vma->vm_end, end);
+		r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next);
+		if (r) {
+			pr_debug("failed %d to migrate\n", r);
+			break;
+		}
+		addr = next;
+	}
+
+	svm_range_vram_node_free(prange);
+	prange->actual_loc = 0;
+
+	return r;
 }
 
 /**
  * svm_migrate_to_ram - CPU page fault handler
  * @vmf: CPU vm fault vma, address
  *
- * Context: vm fault handler, mm->mmap_sem is taken
+ * Context: vm fault handler, caller holds the mmap lock
  *
  * Return:
  * 0 - OK
@@ -485,7 +709,53 @@ static void svm_migrate_page_free(struct page *page)
  */
 static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
 {
-	return VM_FAULT_SIGBUS;
+	unsigned long addr = vmf->address;
+	struct vm_area_struct *vma;
+	struct svm_range *prange;
+	struct list_head list;
+	struct kfd_process *p;
+	int r = VM_FAULT_SIGBUS;
+	int srcu_idx;
+
+	vma = vmf->vma;
+
+	p = kfd_lookup_process_by_mm(vma->vm_mm);
+	if (!p) {
+		pr_debug("failed find process at fault address 0x%lx\n", addr);
+		return VM_FAULT_SIGBUS;
+	}
+
+	/* To prevent prange is removed */
+	srcu_idx = srcu_read_lock(&p->svms.srcu);
+
+	addr >>= PAGE_SHIFT;
+	pr_debug("CPU page fault svms 0x%p address 0x%lx\n", &p->svms, addr);
+
+	r = svm_range_split_by_granularity(p, addr, &list);
+	if (r) {
+		pr_debug("failed %d to split range by granularity\n", r);
+		goto out_srcu;
+	}
+
+	list_for_each_entry(prange, &list, update_list) {
+		mutex_lock(&prange->mutex);
+		r = svm_migrate_vram_to_ram(prange, vma->vm_mm);
+		mutex_unlock(&prange->mutex);
+		if (r) {
+			pr_debug("failed %d migrate [0x%lx 0x%lx] to ram\n", r,
+				 prange->it_node.start, prange->it_node.last);
+			goto out_srcu;
+		}
+	}
+
+out_srcu:
+	srcu_read_unlock(&p->svms.srcu, srcu_idx);
+	kfd_unref_process(p);
+
+	if (r)
+		return VM_FAULT_SIGBUS;
+
+	return 0;
 }
 
 static const struct dev_pagemap_ops svm_migrate_pgmap_ops = {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
index ffae5f989909..95fd7b21791f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
@@ -38,6 +38,9 @@ enum MIGRATION_COPY_DIR {
 };
 
 int svm_migrate_ram_to_vram(struct svm_range *prange,  uint32_t best_loc);
+int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm);
+unsigned long
+svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr);
 
 #if defined(CONFIG_DEVICE_PRIVATE)
 int svm_migrate_init(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 0dbc403413a1..37f35f986930 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -819,6 +819,92 @@ svm_range_split_add_front(struct svm_range *prange, struct svm_range *new,
 	return 0;
 }
 
+/**
+ * svm_range_split_by_granularity - collect ranges within granularity boundary
+ *
+ * @p: the process with svms list
+ * @addr: the vm fault address in pages, to search ranges
+ * @list: output, the range list
+ *
+ * Collects small ranges that make up one migration granule and splits the first
+ * and the last range at the granularity boundary
+ *
+ * Context: hold and release svms lock
+ *
+ * Return:
+ * 0 - OK, otherwise error code
+ */
+int svm_range_split_by_granularity(struct kfd_process *p, unsigned long addr,
+				   struct list_head *list)
+{
+	struct svm_range *prange;
+	struct svm_range *tail;
+	struct svm_range *new;
+	unsigned long start;
+	unsigned long last;
+	unsigned long size;
+	int r = 0;
+
+	svms_lock(&p->svms);
+
+	prange = svm_range_from_addr(&p->svms, addr);
+	if (!prange) {
+		pr_debug("cannot find svm range at 0x%lx\n", addr);
+		svms_unlock(&p->svms);
+		return -EFAULT;
+	}
+
+	/* Align splited range start and size to granularity size, then a single
+	 * PTE will be used for whole range, this reduces the number of PTE
+	 * updated and the L1 TLB space used for translation.
+	 */
+	size = 1ULL << prange->granularity;
+	start = ALIGN_DOWN(addr, size);
+	last = ALIGN(addr + 1, size) - 1;
+	INIT_LIST_HEAD(list);
+
+	pr_debug("svms 0x%p split [0x%lx 0x%lx] at 0x%lx granularity 0x%lx\n",
+		 prange->svms, start, last, addr, size);
+
+	if (start > prange->it_node.start) {
+		r = svm_range_split(prange, prange->it_node.start, start - 1,
+				    &new);
+		if (r)
+			goto out_unlock;
+
+		svm_range_add_to_svms(new);
+	} else {
+		new = prange;
+	}
+
+	while (size > new->npages) {
+		struct interval_tree_node *next;
+
+		list_add(&new->update_list, list);
+
+		next = interval_tree_iter_next(&new->it_node, start, last);
+		if (!next)
+			goto out_unlock;
+
+		size -= new->npages;
+		new = container_of(next, struct svm_range, it_node);
+	}
+
+	if (last < new->it_node.last) {
+		r = svm_range_split(new, new->it_node.start, last, &tail);
+		if (r)
+			goto out_unlock;
+		svm_range_add_to_svms(tail);
+	}
+
+	list_add(&new->update_list, list);
+
+out_unlock:
+	svms_unlock(&p->svms);
+
+	return r;
+}
+
 static uint64_t
 svm_range_get_pte_flags(struct amdgpu_device *adev, struct svm_range *prange)
 {
@@ -1508,6 +1594,27 @@ static const struct mmu_interval_notifier_ops svm_range_mn_ops = {
 	.invalidate = svm_range_cpu_invalidate_pagetables,
 };
 
+/**
+ * svm_range_from_addr - find svm range from fault address
+ * @svms: svm range list header
+ * @addr: address to search range interval tree, in pages
+ *
+ * Context: The caller must hold svms_lock
+ *
+ * Return: the svm_range found or NULL
+ */
+struct svm_range *
+svm_range_from_addr(struct svm_range_list *svms, unsigned long addr)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_first(&svms->objects, addr, addr);
+	if (!node)
+		return NULL;
+
+	return container_of(node, struct svm_range, it_node);
+}
+
 void svm_range_list_fini(struct kfd_process *p)
 {
 	pr_debug("pasid 0x%x svms 0x%p\n", p->pasid, &p->svms);
@@ -1754,11 +1861,14 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
 
 		pr_debug("migrate from ram to vram\n");
 		r = svm_migrate_ram_to_vram(prange, best_loc);
-
-		if (!r)
-			*migrated = true;
+	} else {
+		pr_debug("migrate from vram to ram\n");
+		r = svm_migrate_vram_to_ram(prange, current->mm);
 	}
 
+	if (!r)
+		*migrated = true;
+
 	return r;
 }
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index b81dfb32135b..c67e96f764fe 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -112,10 +112,14 @@ void svm_range_list_fini(struct kfd_process *p);
 int svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start,
 	      uint64_t size, uint32_t nattrs,
 	      struct kfd_ioctl_svm_attribute *attrs);
+struct svm_range *svm_range_from_addr(struct svm_range_list *svms,
+				      unsigned long addr);
 struct amdgpu_device *svm_range_get_adev_by_id(struct svm_range *prange,
 					       uint32_t id);
 int svm_range_vram_node_new(struct amdgpu_device *adev,
 			    struct svm_range *prange, bool clear);
 void svm_range_vram_node_free(struct svm_range *prange);
+int svm_range_split_by_granularity(struct kfd_process *p, unsigned long addr,
+				   struct list_head *list);
 
 #endif /* KFD_SVM_H_ */
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 23/35] drm/amdkfd: invalidate tables on page retry fault
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (21 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 22/35] drm/amdkfd: HMM migrate vram to ram Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 24/35] drm/amdkfd: page table restore through svm API Felix Kuehling
                   ` (13 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Alex Sierra <alex.sierra@amd.com>

GPU page tables are invalidated by unmapping prange directly at
the mmu notifier, when page fault retry is enabled through
amdgpu_noretry global parameter. The restore page table is
performed at the page fault handler.

If xnack is on, we need update GPU mapping after prefetch migration
to avoid GPU vm fault, because range migration unmap the range from
GPUs, there is no restore work scheduled to update GPU mapping.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 37f35f986930..ea27c5ed4ef3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1279,7 +1279,9 @@ svm_range_evict(struct svm_range_list *svms, struct mm_struct *mm,
 	int r = 0;
 	struct interval_tree_node *node;
 	struct svm_range *prange;
+	struct kfd_process *p;
 
+	p = container_of(svms, struct kfd_process, svms);
 	svms_lock(svms);
 
 	pr_debug("invalidate svms 0x%p [0x%lx 0x%lx]\n", svms, start, last);
@@ -1292,8 +1294,13 @@ svm_range_evict(struct svm_range_list *svms, struct mm_struct *mm,
 		next = interval_tree_iter_next(node, start, last);
 
 		invalid = atomic_inc_return(&prange->invalid);
-		evicted_ranges = atomic_inc_return(&svms->evicted_ranges);
-		if (evicted_ranges == 1) {
+
+		if (!p->xnack_enabled) {
+			evicted_ranges =
+				atomic_inc_return(&svms->evicted_ranges);
+			if (evicted_ranges != 1)
+				goto next_node;
+
 			pr_debug("evicting svms 0x%p range [0x%lx 0x%lx]\n",
 				 prange->svms, prange->it_node.start,
 				 prange->it_node.last);
@@ -1306,7 +1313,14 @@ svm_range_evict(struct svm_range_list *svms, struct mm_struct *mm,
 			pr_debug("schedule to restore svm %p ranges\n", svms);
 			schedule_delayed_work(&svms->restore_work,
 			   msecs_to_jiffies(AMDGPU_SVM_RANGE_RESTORE_DELAY_MS));
+		} else {
+			pr_debug("invalidate svms 0x%p [0x%lx 0x%lx] %d\n",
+				 prange->svms, prange->it_node.start,
+				 prange->it_node.last, invalid);
+			if (invalid == 1)
+				svm_range_unmap_from_gpus(prange);
 		}
+next_node:
 		node = next;
 	}
 
@@ -1944,7 +1958,7 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 		if (r)
 			goto out_unlock;
 
-		if (migrated) {
+		if (migrated && !p->xnack_enabled) {
 			pr_debug("restore_work will update mappings of GPUs\n");
 			mutex_unlock(&prange->mutex);
 			continue;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 24/35] drm/amdkfd: page table restore through svm API
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (22 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 23/35] drm/amdkfd: invalidate tables on page retry fault Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 25/35] drm/amdkfd: SVM API call to restore page tables Felix Kuehling
                   ` (12 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Alex Sierra <alex.sierra@amd.com>

Page table restore implementation in SVM API. This is called from
the fault handler at amdgpu_vm. To update page tables through
the page fault retry IH.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 78 ++++++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  2 +
 2 files changed, 80 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index ea27c5ed4ef3..7346255f7c27 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1629,6 +1629,84 @@ svm_range_from_addr(struct svm_range_list *svms, unsigned long addr)
 	return container_of(node, struct svm_range, it_node);
 }
 
+int
+svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
+			uint64_t addr)
+{
+	int r = 0;
+	int srcu_idx;
+	struct mm_struct *mm = NULL;
+	struct svm_range *prange;
+	struct svm_range_list *svms;
+	struct kfd_process *p;
+
+	p = kfd_lookup_process_by_pasid(pasid);
+	if (!p) {
+		pr_debug("kfd process not founded pasid 0x%x\n", pasid);
+		return -ESRCH;
+	}
+	svms = &p->svms;
+	srcu_idx = srcu_read_lock(&svms->srcu);
+
+	pr_debug("restoring svms 0x%p fault address 0x%llx\n", svms, addr);
+
+	svms_lock(svms);
+	prange = svm_range_from_addr(svms, addr);
+	svms_unlock(svms);
+	if (!prange) {
+		pr_debug("failed to find prange svms 0x%p address [0x%llx]\n",
+			 svms, addr);
+		r = -EFAULT;
+		goto unlock_out;
+	}
+
+	if (!atomic_read(&prange->invalid)) {
+		pr_debug("svms 0x%p [0x%lx %lx] already restored\n",
+			 svms, prange->it_node.start, prange->it_node.last);
+		goto unlock_out;
+	}
+
+	mm = get_task_mm(p->lead_thread);
+	if (!mm) {
+		pr_debug("svms 0x%p failed to get mm\n", svms);
+		r = -ESRCH;
+		goto unlock_out;
+	}
+
+	mmap_read_lock(mm);
+
+	/*
+	 * If range is migrating, wait for migration is done.
+	 */
+	mutex_lock(&prange->mutex);
+
+	r = svm_range_validate(mm, prange);
+	if (r) {
+		pr_debug("failed %d to validate svms 0x%p [0x%lx 0x%lx]\n", r,
+			 svms, prange->it_node.start, prange->it_node.last);
+
+		goto mmput_out;
+	}
+
+	pr_debug("restoring svms 0x%p [0x%lx %lx] mapping\n",
+		 svms, prange->it_node.start, prange->it_node.last);
+
+	r = svm_range_map_to_gpus(prange, true);
+	if (r)
+		pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx] to gpu\n", r,
+			 svms, prange->it_node.start, prange->it_node.last);
+
+mmput_out:
+	mutex_unlock(&prange->mutex);
+	mmap_read_unlock(mm);
+	mmput(mm);
+unlock_out:
+	srcu_read_unlock(&svms->srcu, srcu_idx);
+	kfd_unref_process(p);
+
+	return r;
+}
+
 void svm_range_list_fini(struct kfd_process *p)
 {
 	pr_debug("pasid 0x%x svms 0x%p\n", p->pasid, &p->svms);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index c67e96f764fe..e546f36ef709 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -121,5 +121,7 @@ int svm_range_vram_node_new(struct amdgpu_device *adev,
 void svm_range_vram_node_free(struct svm_range *prange);
 int svm_range_split_by_granularity(struct kfd_process *p, unsigned long addr,
 				   struct list_head *list);
+int svm_range_restore_pages(struct amdgpu_device *adev,
+			    unsigned int pasid, uint64_t addr);
 
 #endif /* KFD_SVM_H_ */
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 25/35] drm/amdkfd: SVM API call to restore page tables
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (23 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 24/35] drm/amdkfd: page table restore through svm API Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 26/35] drm/amdkfd: add svm_bo reference for eviction fence Felix Kuehling
                   ` (11 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

From: Alex Sierra <alex.sierra@amd.com>

Use SVM API to restore page tables when retry fault and
compute context are enabled.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 9c557e8bf0e5..abdd4e7b4c3b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -37,6 +37,7 @@
 #include "amdgpu_gmc.h"
 #include "amdgpu_xgmi.h"
 #include "amdgpu_dma_buf.h"
+#include "kfd_svm.h"
 
 /**
  * DOC: GPUVM
@@ -3301,18 +3302,29 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
 	uint64_t value, flags;
 	struct amdgpu_vm *vm;
 	long r;
+	bool is_compute_context = false;
 
 	spin_lock(&adev->vm_manager.pasid_lock);
 	vm = idr_find(&adev->vm_manager.pasid_idr, pasid);
-	if (vm)
+	if (vm) {
 		root = amdgpu_bo_ref(vm->root.base.bo);
-	else
+		is_compute_context = vm->is_compute_context;
+	} else {
 		root = NULL;
+	}
 	spin_unlock(&adev->vm_manager.pasid_lock);
 
 	if (!root)
 		return false;
 
+	addr /= AMDGPU_GPU_PAGE_SIZE;
+
+	if (!amdgpu_noretry && is_compute_context &&
+		!svm_range_restore_pages(adev, pasid, addr)) {
+		amdgpu_bo_unref(&root);
+		return true;
+	}
+
 	r = amdgpu_bo_reserve(root, true);
 	if (r)
 		goto error_unref;
@@ -3326,18 +3338,16 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
 	if (!vm)
 		goto error_unlock;
 
-	addr /= AMDGPU_GPU_PAGE_SIZE;
 	flags = AMDGPU_PTE_VALID | AMDGPU_PTE_SNOOPED |
 		AMDGPU_PTE_SYSTEM;
 
-	if (vm->is_compute_context) {
+	if (is_compute_context) {
 		/* Intentionally setting invalid PTE flag
 		 * combination to force a no-retry-fault
 		 */
 		flags = AMDGPU_PTE_EXECUTABLE | AMDGPU_PDE_PTE |
 			AMDGPU_PTE_TF;
 		value = 0;
-
 	} else if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_NEVER) {
 		/* Redirect the access to the dummy page */
 		value = adev->dummy_page_addr;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 26/35] drm/amdkfd: add svm_bo reference for eviction fence
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (24 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 25/35] drm/amdkfd: SVM API call to restore page tables Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 27/35] drm/amdgpu: add param bit flag to create SVM BOs Felix Kuehling
                   ` (10 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

From: Alex Sierra <alex.sierra@amd.com>

[why]
As part of the SVM functionality, the eviction mechanism used for
SVM_BOs is different. This mechanism uses one eviction fence per prange,
instead of one fence per kfd_process.

[how]
A svm_bo reference to amdgpu_amdkfd_fence to allow differentiate between
SVM_BO or regular BO evictions. This also include modifications to set the
reference at the fence creation call.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h       | 4 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 5 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 6 ++++--
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index bc9f0e42e0a2..fb8be788ac1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -75,6 +75,7 @@ struct amdgpu_amdkfd_fence {
 	struct mm_struct *mm;
 	spinlock_t lock;
 	char timeline_name[TASK_COMM_LEN];
+	struct svm_range_bo *svm_bo;
 };
 
 struct amdgpu_kfd_dev {
@@ -95,7 +96,8 @@ enum kgd_engine_type {
 };
 
 struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context,
-						       struct mm_struct *mm);
+					struct mm_struct *mm,
+					struct svm_range_bo *svm_bo);
 bool amdkfd_fence_check_mm(struct dma_fence *f, struct mm_struct *mm);
 struct amdgpu_amdkfd_fence *to_amdgpu_amdkfd_fence(struct dma_fence *f);
 int amdgpu_amdkfd_remove_fence_on_pt_pd_bos(struct amdgpu_bo *bo);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
index 3107b9575929..9cc85efa4ed5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
@@ -60,7 +60,8 @@ static atomic_t fence_seq = ATOMIC_INIT(0);
  */
 
 struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context,
-						       struct mm_struct *mm)
+				struct mm_struct *mm,
+				struct svm_range_bo *svm_bo)
 {
 	struct amdgpu_amdkfd_fence *fence;
 
@@ -73,7 +74,7 @@ struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context,
 	fence->mm = mm;
 	get_task_comm(fence->timeline_name, current);
 	spin_lock_init(&fence->lock);
-
+	fence->svm_bo = svm_bo;
 	dma_fence_init(&fence->base, &amdkfd_fence_ops, &fence->lock,
 		   context, atomic_inc_return(&fence_seq));
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 99ad4e1d0896..8a43f3880022 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -928,7 +928,8 @@ static int init_kfd_vm(struct amdgpu_vm *vm, void **process_info,
 
 		info->eviction_fence =
 			amdgpu_amdkfd_fence_create(dma_fence_context_alloc(1),
-						   current->mm);
+						   current->mm,
+						   NULL);
 		if (!info->eviction_fence) {
 			pr_err("Failed to create eviction fence\n");
 			ret = -ENOMEM;
@@ -2150,7 +2151,8 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
 	 */
 	new_fence = amdgpu_amdkfd_fence_create(
 				process_info->eviction_fence->base.context,
-				process_info->eviction_fence->mm);
+				process_info->eviction_fence->mm,
+				NULL);
 	if (!new_fence) {
 		pr_err("Failed to create eviction fence\n");
 		ret = -ENOMEM;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 27/35] drm/amdgpu: add param bit flag to create SVM BOs
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (25 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 26/35] drm/amdkfd: add svm_bo reference for eviction fence Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 28/35] drm/amdkfd: add svm_bo eviction mechanism support Felix Kuehling
                   ` (9 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

From: Alex Sierra <alex.sierra@amd.com>

Add CREATE_SVM_BO define bit for SVM BOs.
Another define flag was moved to concentrate these
KFD type flags in one include file.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 7 ++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h       | 5 +++++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 8a43f3880022..5982d09b6c3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -31,9 +31,6 @@
 #include "amdgpu_dma_buf.h"
 #include <uapi/linux/kfd_ioctl.h>
 
-/* BO flag to indicate a KFD userptr BO */
-#define AMDGPU_AMDKFD_USERPTR_BO (1ULL << 63)
-
 /* Userptr restore delay, just long enough to allow consecutive VM
  * changes to accumulate
  */
@@ -207,7 +204,7 @@ void amdgpu_amdkfd_unreserve_memory_limit(struct amdgpu_bo *bo)
 	u32 domain = bo->preferred_domains;
 	bool sg = (bo->preferred_domains == AMDGPU_GEM_DOMAIN_CPU);
 
-	if (bo->flags & AMDGPU_AMDKFD_USERPTR_BO) {
+	if (bo->flags & AMDGPU_AMDKFD_CREATE_USERPTR_BO) {
 		domain = AMDGPU_GEM_DOMAIN_CPU;
 		sg = false;
 	}
@@ -1241,7 +1238,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
 	bo->kfd_bo = *mem;
 	(*mem)->bo = bo;
 	if (user_addr)
-		bo->flags |= AMDGPU_AMDKFD_USERPTR_BO;
+		bo->flags |= AMDGPU_AMDKFD_CREATE_USERPTR_BO;
 
 	(*mem)->va = va;
 	(*mem)->domain = domain;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index adbefd6a655d..b72772ab93fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -37,6 +37,11 @@
 #define AMDGPU_BO_INVALID_OFFSET	LONG_MAX
 #define AMDGPU_BO_MAX_PLACEMENTS	3
 
+/* BO flag to indicate a KFD userptr BO */
+#define AMDGPU_AMDKFD_CREATE_USERPTR_BO	(1ULL << 63)
+#define AMDGPU_AMDKFD_CREATE_SVM_BO	(1ULL << 62)
+
+
 struct amdgpu_bo_param {
 	unsigned long			size;
 	int				byte_align;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 28/35] drm/amdkfd: add svm_bo eviction mechanism support
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (26 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 27/35] drm/amdgpu: add param bit flag to create SVM BOs Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 29/35] drm/amdgpu: svm bo enable_signal call condition Felix Kuehling
                   ` (8 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Alex Sierra <alex.sierra@amd.com>

svm_bo eviction mechanism is different from regular BOs.
Every SVM_BO created contains one eviction fence and one
worker item for eviction process.
SVM_BOs can be attached to one or more pranges.
For SVM_BO eviction mechanism, TTM will start to call
enable_signal callback for every SVM_BO until VRAM space
is available.
Here, all the ttm_evict calls are synchronous, this guarantees
that each eviction has completed and the fence has signaled before
it returns.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 197 ++++++++++++++++++++-------
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  13 +-
 2 files changed, 160 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 7346255f7c27..63b745a06740 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -34,6 +34,7 @@
 
 #define AMDGPU_SVM_RANGE_RESTORE_DELAY_MS 1
 
+static void svm_range_evict_svm_bo_worker(struct work_struct *work);
 /**
  * svm_range_unlink - unlink svm_range from lists and interval tree
  * @prange: svm range structure to be removed
@@ -260,7 +261,15 @@ static void svm_range_bo_release(struct kref *kref)
 		list_del_init(&prange->svm_bo_list);
 	}
 	spin_unlock(&svm_bo->list_lock);
-
+	if (!dma_fence_is_signaled(&svm_bo->eviction_fence->base)) {
+		/* We're not in the eviction worker.
+		 * Signal the fence and synchronize with any
+		 * pending eviction work.
+		 */
+		dma_fence_signal(&svm_bo->eviction_fence->base);
+		cancel_work_sync(&svm_bo->eviction_work);
+	}
+	dma_fence_put(&svm_bo->eviction_fence->base);
 	amdgpu_bo_unref(&svm_bo->bo);
 	kfree(svm_bo);
 }
@@ -273,6 +282,62 @@ static void svm_range_bo_unref(struct svm_range_bo *svm_bo)
 	kref_put(&svm_bo->kref, svm_range_bo_release);
 }
 
+static bool svm_range_validate_svm_bo(struct svm_range *prange)
+{
+	spin_lock(&prange->svm_bo_lock);
+	if (!prange->svm_bo) {
+		spin_unlock(&prange->svm_bo_lock);
+		return false;
+	}
+	if (prange->mm_nodes) {
+		/* We still have a reference, all is well */
+		spin_unlock(&prange->svm_bo_lock);
+		return true;
+	}
+	if (svm_bo_ref_unless_zero(prange->svm_bo)) {
+		if (READ_ONCE(prange->svm_bo->evicting)) {
+			struct dma_fence *f;
+			struct svm_range_bo *svm_bo;
+			/* The BO is getting evicted,
+			 * we need to get a new one
+			 */
+			spin_unlock(&prange->svm_bo_lock);
+			svm_bo = prange->svm_bo;
+			f = dma_fence_get(&svm_bo->eviction_fence->base);
+			svm_range_bo_unref(prange->svm_bo);
+			/* wait for the fence to avoid long spin-loop
+			 * at list_empty_careful
+			 */
+			dma_fence_wait(f, false);
+			dma_fence_put(f);
+		} else {
+			/* The BO was still around and we got
+			 * a new reference to it
+			 */
+			spin_unlock(&prange->svm_bo_lock);
+			pr_debug("reuse old bo svms 0x%p [0x%lx 0x%lx]\n",
+				 prange->svms, prange->it_node.start,
+				 prange->it_node.last);
+
+			prange->mm_nodes = prange->svm_bo->bo->tbo.mem.mm_node;
+			return true;
+		}
+
+	} else {
+		spin_unlock(&prange->svm_bo_lock);
+	}
+
+	/* We need a new svm_bo. Spin-loop to wait for concurrent
+	 * svm_range_bo_release to finish removing this range from
+	 * its range list. After this, it is safe to reuse the
+	 * svm_bo pointer and svm_bo_list head.
+	 */
+	while (!list_empty_careful(&prange->svm_bo_list))
+		;
+
+	return false;
+}
+
 static struct svm_range_bo *svm_range_bo_new(void)
 {
 	struct svm_range_bo *svm_bo;
@@ -292,71 +357,54 @@ int
 svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange,
 			bool clear)
 {
-	struct amdkfd_process_info *process_info;
 	struct amdgpu_bo_param bp;
 	struct svm_range_bo *svm_bo;
 	struct amdgpu_bo *bo;
 	struct kfd_process *p;
+	struct mm_struct *mm;
 	int r;
 
-	pr_debug("[0x%lx 0x%lx]\n", prange->it_node.start,
-		 prange->it_node.last);
-	spin_lock(&prange->svm_bo_lock);
-	if (prange->svm_bo) {
-		if (prange->mm_nodes) {
-			/* We still have a reference, all is well */
-			spin_unlock(&prange->svm_bo_lock);
-			return 0;
-		}
-		if (svm_bo_ref_unless_zero(prange->svm_bo)) {
-			/* The BO was still around and we got
-			 * a new reference to it
-			 */
-			spin_unlock(&prange->svm_bo_lock);
-			pr_debug("reuse old bo [0x%lx 0x%lx]\n",
-				prange->it_node.start, prange->it_node.last);
-
-			prange->mm_nodes = prange->svm_bo->bo->tbo.mem.mm_node;
-			return 0;
-		}
-
-		spin_unlock(&prange->svm_bo_lock);
-
-		/* We need a new svm_bo. Spin-loop to wait for concurrent
-		 * svm_range_bo_release to finish removing this range from
-		 * its range list. After this, it is safe to reuse the
-		 * svm_bo pointer and svm_bo_list head.
-		 */
-		while (!list_empty_careful(&prange->svm_bo_list))
-			;
+	p = container_of(prange->svms, struct kfd_process, svms);
+	pr_debug("pasid: %x svms 0x%p [0x%lx 0x%lx]\n", p->pasid, prange->svms,
+		 prange->it_node.start, prange->it_node.last);
 
-	} else {
-		spin_unlock(&prange->svm_bo_lock);
-	}
+	if (svm_range_validate_svm_bo(prange))
+		return 0;
 
 	svm_bo = svm_range_bo_new();
 	if (!svm_bo) {
 		pr_debug("failed to alloc svm bo\n");
 		return -ENOMEM;
 	}
-
+	mm = get_task_mm(p->lead_thread);
+	if (!mm) {
+		pr_debug("failed to get mm\n");
+		kfree(svm_bo);
+		return -ESRCH;
+	}
+	svm_bo->svms = prange->svms;
+	svm_bo->eviction_fence =
+		amdgpu_amdkfd_fence_create(dma_fence_context_alloc(1),
+					   mm,
+					   svm_bo);
+	mmput(mm);
+	INIT_WORK(&svm_bo->eviction_work, svm_range_evict_svm_bo_worker);
+	svm_bo->evicting = 0;
 	memset(&bp, 0, sizeof(bp));
 	bp.size = prange->npages * PAGE_SIZE;
 	bp.byte_align = PAGE_SIZE;
 	bp.domain = AMDGPU_GEM_DOMAIN_VRAM;
 	bp.flags = AMDGPU_GEM_CREATE_NO_CPU_ACCESS;
 	bp.flags |= clear ? AMDGPU_GEM_CREATE_VRAM_CLEARED : 0;
+	bp.flags |= AMDGPU_AMDKFD_CREATE_SVM_BO;
 	bp.type = ttm_bo_type_device;
 	bp.resv = NULL;
 
 	r = amdgpu_bo_create(adev, &bp, &bo);
 	if (r) {
 		pr_debug("failed %d to create bo\n", r);
-		kfree(svm_bo);
-		return r;
+		goto create_bo_failed;
 	}
-
-	p = container_of(prange->svms, struct kfd_process, svms);
 	r = amdgpu_bo_reserve(bo, true);
 	if (r) {
 		pr_debug("failed %d to reserve bo\n", r);
@@ -369,8 +417,7 @@ svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange,
 		amdgpu_bo_unreserve(bo);
 		goto reserve_bo_failed;
 	}
-	process_info = p->kgd_process_info;
-	amdgpu_bo_fence(bo, &process_info->eviction_fence->base, true);
+	amdgpu_bo_fence(bo, &svm_bo->eviction_fence->base, true);
 
 	amdgpu_bo_unreserve(bo);
 
@@ -380,14 +427,16 @@ svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange,
 	prange->offset = 0;
 
 	spin_lock(&svm_bo->list_lock);
-	list_add(&prange->svm_bo_list, &svm_bo->range_list);
+	list_add_rcu(&prange->svm_bo_list, &svm_bo->range_list);
 	spin_unlock(&svm_bo->list_lock);
 
 	return 0;
 
 reserve_bo_failed:
-	kfree(svm_bo);
 	amdgpu_bo_unref(&bo);
+create_bo_failed:
+	dma_fence_put(&svm_bo->eviction_fence->base);
+	kfree(svm_bo);
 	prange->mm_nodes = NULL;
 
 	return r;
@@ -621,7 +670,7 @@ svm_range_split_nodes(struct svm_range *new, struct svm_range *old,
 	new->mm_nodes = old->mm_nodes;
 
 	spin_lock(&new->svm_bo->list_lock);
-	list_add(&new->svm_bo_list, &new->svm_bo->range_list);
+	list_add_rcu(&new->svm_bo_list, &new->svm_bo->range_list);
 	spin_unlock(&new->svm_bo->list_lock);
 
 	return 0;
@@ -1353,7 +1402,7 @@ struct svm_range *svm_range_clone(struct svm_range *old)
 		new->offset = old->offset;
 		new->svm_bo = svm_range_bo_ref(old->svm_bo);
 		spin_lock(&new->svm_bo->list_lock);
-		list_add(&new->svm_bo_list, &new->svm_bo->range_list);
+		list_add_rcu(&new->svm_bo_list, &new->svm_bo->range_list);
 		spin_unlock(&new->svm_bo->list_lock);
 	}
 	new->flags = old->flags;
@@ -1964,6 +2013,62 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
 	return r;
 }
 
+int svm_range_schedule_evict_svm_bo(struct amdgpu_amdkfd_fence *fence)
+{
+	if (!fence)
+		return -EINVAL;
+
+	if (dma_fence_is_signaled(&fence->base))
+		return 0;
+
+	if (fence->svm_bo) {
+		WRITE_ONCE(fence->svm_bo->evicting, 1);
+		schedule_work(&fence->svm_bo->eviction_work);
+	}
+
+	return 0;
+}
+
+static void svm_range_evict_svm_bo_worker(struct work_struct *work)
+{
+	struct svm_range_bo *svm_bo;
+	struct svm_range *prange;
+	struct kfd_process *p;
+	struct mm_struct *mm;
+	int srcu_idx;
+
+	svm_bo = container_of(work, struct svm_range_bo, eviction_work);
+	if (!svm_bo_ref_unless_zero(svm_bo))
+		return; /* svm_bo was freed while eviction was pending */
+
+	/* svm_range_bo_release destroys this worker thread. So during
+	 * the lifetime of this thread, kfd_process and mm will be valid.
+	 */
+	p = container_of(svm_bo->svms, struct kfd_process, svms);
+	mm = p->mm;
+	if (!mm)
+		return;
+
+	mmap_read_lock(mm);
+	srcu_idx = srcu_read_lock(&svm_bo->svms->srcu);
+	list_for_each_entry_rcu(prange, &svm_bo->range_list, svm_bo_list) {
+		pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
+			 prange->it_node.start, prange->it_node.last);
+		mutex_lock(&prange->mutex);
+		svm_migrate_vram_to_ram(prange, svm_bo->eviction_fence->mm);
+		mutex_unlock(&prange->mutex);
+	}
+	srcu_read_unlock(&svm_bo->svms->srcu, srcu_idx);
+	mmap_read_unlock(mm);
+
+	dma_fence_signal(&svm_bo->eviction_fence->base);
+	/* This is the last reference to svm_bo, after svm_range_vram_node_free
+	 * has been called in svm_migrate_vram_to_ram
+	 */
+	WARN_ONCE(kref_read(&svm_bo->kref) != 1, "This was not the last reference\n");
+	svm_range_bo_unref(svm_bo);
+}
+
 static int
 svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 		   uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index e546f36ef709..143573621956 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -33,10 +33,14 @@
 #include "kfd_priv.h"
 
 struct svm_range_bo {
-	struct amdgpu_bo	*bo;
-	struct kref		kref;
-	struct list_head	range_list; /* all svm ranges shared this bo */
-	spinlock_t		list_lock;
+	struct amdgpu_bo		*bo;
+	struct kref			kref;
+	struct list_head		range_list; /* all svm ranges shared this bo */
+	spinlock_t			list_lock;
+	struct amdgpu_amdkfd_fence	*eviction_fence;
+	struct work_struct		eviction_work;
+	struct svm_range_list		*svms;
+	uint32_t			evicting;
 };
 /**
  * struct svm_range - shared virtual memory range
@@ -123,5 +127,6 @@ int svm_range_split_by_granularity(struct kfd_process *p, unsigned long addr,
 				   struct list_head *list);
 int svm_range_restore_pages(struct amdgpu_device *adev,
 			    unsigned int pasid, uint64_t addr);
+int svm_range_schedule_evict_svm_bo(struct amdgpu_amdkfd_fence *fence);
 
 #endif /* KFD_SVM_H_ */
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 29/35] drm/amdgpu: svm bo enable_signal call condition
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (27 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 28/35] drm/amdkfd: add svm_bo eviction mechanism support Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07 10:56   ` Christian König
  2021-01-07  3:01 ` [PATCH 30/35] drm/amdgpu: add svm_bo eviction to enable_signal cb Felix Kuehling
                   ` (7 subsequent siblings)
  36 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

From: Alex Sierra <alex.sierra@amd.com>

[why]
To support svm bo eviction mechanism.

[how]
If the BO crated has AMDGPU_AMDKFD_CREATE_SVM_BO flag set,
enable_signal callback will be called inside amdgpu_evict_flags.
This also causes gutting of the BO by removing all placements,
so that TTM won't actually do an eviction. Instead it will discard
the memory held by the BO. This is needed for HMM migration to user
mode system memory pages.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index f423f42cb9b5..62d4da95d22d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -107,6 +107,20 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo,
 	}
 
 	abo = ttm_to_amdgpu_bo(bo);
+	if (abo->flags & AMDGPU_AMDKFD_CREATE_SVM_BO) {
+		struct dma_fence *fence;
+		struct dma_resv *resv = &bo->base._resv;
+
+		rcu_read_lock();
+		fence = rcu_dereference(resv->fence_excl);
+		if (fence && !fence->ops->signaled)
+			dma_fence_enable_sw_signaling(fence);
+
+		placement->num_placement = 0;
+		placement->num_busy_placement = 0;
+		rcu_read_unlock();
+		return;
+	}
 	switch (bo->mem.mem_type) {
 	case AMDGPU_PL_GDS:
 	case AMDGPU_PL_GWS:
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 30/35] drm/amdgpu: add svm_bo eviction to enable_signal cb
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (28 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 29/35] drm/amdgpu: svm bo enable_signal call condition Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 31/35] drm/amdgpu: reserve fence slot to update page table Felix Kuehling
                   ` (6 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

From: Alex Sierra <alex.sierra@amd.com>

Add to amdgpu_amdkfd_fence.enable_signal callback, support
for svm_bo fence eviction.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
index 9cc85efa4ed5..98d6e08f22d8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
@@ -28,6 +28,7 @@
 #include <linux/slab.h>
 #include <linux/sched/mm.h>
 #include "amdgpu_amdkfd.h"
+#include "kfd_svm.h"
 
 static const struct dma_fence_ops amdkfd_fence_ops;
 static atomic_t fence_seq = ATOMIC_INIT(0);
@@ -123,9 +124,13 @@ static bool amdkfd_fence_enable_signaling(struct dma_fence *f)
 	if (dma_fence_is_signaled(f))
 		return true;
 
-	if (!kgd2kfd_schedule_evict_and_restore_process(fence->mm, f))
-		return true;
-
+	if (!fence->svm_bo) {
+		if (!kgd2kfd_schedule_evict_and_restore_process(fence->mm, f))
+			return true;
+	} else {
+		if (!svm_range_schedule_evict_svm_bo(fence))
+			return true;
+	}
 	return false;
 }
 
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 31/35] drm/amdgpu: reserve fence slot to update page table
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (29 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 30/35] drm/amdgpu: add svm_bo eviction to enable_signal cb Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07 10:57   ` Christian König
  2021-01-07  3:01 ` [PATCH 32/35] drm/amdgpu: enable retry fault wptr overflow Felix Kuehling
                   ` (5 subsequent siblings)
  36 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

Forgot to reserve a fence slot to use sdma to update page table, cause
below kernel BUG backtrace to handle vm retry fault while application is
exiting.

[  133.048143] kernel BUG at /home/yangp/git/compute_staging/kernel/drivers/dma-buf/dma-resv.c:281!
[  133.048487] Workqueue: events amdgpu_irq_handle_ih1 [amdgpu]
[  133.048506] RIP: 0010:dma_resv_add_shared_fence+0x204/0x280
[  133.048672]  amdgpu_vm_sdma_commit+0x134/0x220 [amdgpu]
[  133.048788]  amdgpu_vm_bo_update_range+0x220/0x250 [amdgpu]
[  133.048905]  amdgpu_vm_handle_fault+0x202/0x370 [amdgpu]
[  133.049031]  gmc_v9_0_process_interrupt+0x1ab/0x310 [amdgpu]
[  133.049165]  ? kgd2kfd_interrupt+0x9a/0x180 [amdgpu]
[  133.049289]  ? amdgpu_irq_dispatch+0xb6/0x240 [amdgpu]
[  133.049408]  amdgpu_irq_dispatch+0xb6/0x240 [amdgpu]
[  133.049534]  amdgpu_ih_process+0x9b/0x1c0 [amdgpu]
[  133.049657]  amdgpu_irq_handle_ih1+0x21/0x60 [amdgpu]
[  133.049669]  process_one_work+0x29f/0x640
[  133.049678]  worker_thread+0x39/0x3f0
[  133.049685]  ? process_one_work+0x640/0x640

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index abdd4e7b4c3b..bd9de870f8f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -3301,7 +3301,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
 	struct amdgpu_bo *root;
 	uint64_t value, flags;
 	struct amdgpu_vm *vm;
-	long r;
+	int r;
 	bool is_compute_context = false;
 
 	spin_lock(&adev->vm_manager.pasid_lock);
@@ -3359,6 +3359,12 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
 		value = 0;
 	}
 
+	r = dma_resv_reserve_shared(root->tbo.base.resv, 1);
+	if (r) {
+		pr_debug("failed %d to reserve fence slot\n", r);
+		goto error_unlock;
+	}
+
 	r = amdgpu_vm_bo_update_mapping(adev, adev, vm, true, false, NULL, addr,
 					addr, flags, value, NULL, NULL,
 					NULL);
@@ -3370,7 +3376,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
 error_unlock:
 	amdgpu_bo_unreserve(root);
 	if (r < 0)
-		DRM_ERROR("Can't handle page fault (%ld)\n", r);
+		DRM_ERROR("Can't handle page fault (%d)\n", r);
 
 error_unref:
 	amdgpu_bo_unref(&root);
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 32/35] drm/amdgpu: enable retry fault wptr overflow
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (30 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 31/35] drm/amdgpu: reserve fence slot to update page table Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07 11:01   ` Christian König
  2021-01-07  3:01 ` [PATCH 33/35] drm/amdkfd: refine migration policy with xnack on Felix Kuehling
                   ` (4 subsequent siblings)
  36 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

If xnack is on, VM retry fault interrupt send to IH ring1, and ring1
will be full quickly. IH cannot receive other interrupts, this causes
deadlock if migrating buffer using sdma and waiting for sdma done while
handling retry fault.

Remove VMC from IH storm client, enable ring1 write pointer overflow,
then IH will drop retry fault interrupts and be able to receive other
interrupts while driver is handling retry fault.

IH ring1 write pointer doesn't writeback to memory by IH, and ring1
write pointer recorded by self-irq is not updated, so always read
the latest ring1 write pointer from register.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +++++++++-----------------
 drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +++++++++-----------------
 2 files changed, 22 insertions(+), 42 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
index 88626d83e07b..ca8efa5c6978 100644
--- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
@@ -220,10 +220,8 @@ static int vega10_ih_enable_ring(struct amdgpu_device *adev,
 	tmp = vega10_ih_rb_cntl(ih, tmp);
 	if (ih == &adev->irq.ih)
 		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RPTR_REARM, !!adev->irq.msi_enabled);
-	if (ih == &adev->irq.ih1) {
-		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_ENABLE, 0);
+	if (ih == &adev->irq.ih1)
 		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RB_FULL_DRAIN_ENABLE, 1);
-	}
 	if (amdgpu_sriov_vf(adev)) {
 		if (psp_reg_program(&adev->psp, ih_regs->psp_reg_id, tmp)) {
 			dev_err(adev->dev, "PSP program IH_RB_CNTL failed!\n");
@@ -265,7 +263,6 @@ static int vega10_ih_irq_init(struct amdgpu_device *adev)
 	u32 ih_chicken;
 	int ret;
 	int i;
-	u32 tmp;
 
 	/* disable irqs */
 	ret = vega10_ih_toggle_interrupts(adev, false);
@@ -291,15 +288,6 @@ static int vega10_ih_irq_init(struct amdgpu_device *adev)
 		}
 	}
 
-	tmp = RREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL);
-	tmp = REG_SET_FIELD(tmp, IH_STORM_CLIENT_LIST_CNTL,
-			    CLIENT18_IS_STORM_CLIENT, 1);
-	WREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL, tmp);
-
-	tmp = RREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL);
-	tmp = REG_SET_FIELD(tmp, IH_INT_FLOOD_CNTL, FLOOD_CNTL_ENABLE, 1);
-	WREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL, tmp);
-
 	pci_set_master(adev->pdev);
 
 	/* enable interrupts */
@@ -345,11 +333,17 @@ static u32 vega10_ih_get_wptr(struct amdgpu_device *adev,
 	u32 wptr, tmp;
 	struct amdgpu_ih_regs *ih_regs;
 
-	wptr = le32_to_cpu(*ih->wptr_cpu);
-	ih_regs = &ih->ih_regs;
+	if (ih == &adev->irq.ih) {
+		/* Only ring0 supports writeback. On other rings fall back
+		 * to register-based code with overflow checking below.
+		 */
+		wptr = le32_to_cpu(*ih->wptr_cpu);
 
-	if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
-		goto out;
+		if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
+			goto out;
+	}
+
+	ih_regs = &ih->ih_regs;
 
 	/* Double check that the overflow wasn't already cleared. */
 	wptr = RREG32_NO_KIQ(ih_regs->ih_rb_wptr);
@@ -440,15 +434,11 @@ static int vega10_ih_self_irq(struct amdgpu_device *adev,
 			      struct amdgpu_irq_src *source,
 			      struct amdgpu_iv_entry *entry)
 {
-	uint32_t wptr = cpu_to_le32(entry->src_data[0]);
-
 	switch (entry->ring_id) {
 	case 1:
-		*adev->irq.ih1.wptr_cpu = wptr;
 		schedule_work(&adev->irq.ih1_work);
 		break;
 	case 2:
-		*adev->irq.ih2.wptr_cpu = wptr;
 		schedule_work(&adev->irq.ih2_work);
 		break;
 	default: break;
diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
index 42032ca380cc..60d1bd51781e 100644
--- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
@@ -220,10 +220,8 @@ static int vega20_ih_enable_ring(struct amdgpu_device *adev,
 	tmp = vega20_ih_rb_cntl(ih, tmp);
 	if (ih == &adev->irq.ih)
 		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RPTR_REARM, !!adev->irq.msi_enabled);
-	if (ih == &adev->irq.ih1) {
-		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_ENABLE, 0);
+	if (ih == &adev->irq.ih1)
 		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RB_FULL_DRAIN_ENABLE, 1);
-	}
 	if (amdgpu_sriov_vf(adev)) {
 		if (psp_reg_program(&adev->psp, ih_regs->psp_reg_id, tmp)) {
 			dev_err(adev->dev, "PSP program IH_RB_CNTL failed!\n");
@@ -297,7 +295,6 @@ static int vega20_ih_irq_init(struct amdgpu_device *adev)
 	u32 ih_chicken;
 	int ret;
 	int i;
-	u32 tmp;
 
 	/* disable irqs */
 	ret = vega20_ih_toggle_interrupts(adev, false);
@@ -326,15 +323,6 @@ static int vega20_ih_irq_init(struct amdgpu_device *adev)
 		}
 	}
 
-	tmp = RREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL);
-	tmp = REG_SET_FIELD(tmp, IH_STORM_CLIENT_LIST_CNTL,
-			    CLIENT18_IS_STORM_CLIENT, 1);
-	WREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL, tmp);
-
-	tmp = RREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL);
-	tmp = REG_SET_FIELD(tmp, IH_INT_FLOOD_CNTL, FLOOD_CNTL_ENABLE, 1);
-	WREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL, tmp);
-
 	pci_set_master(adev->pdev);
 
 	/* enable interrupts */
@@ -379,11 +367,17 @@ static u32 vega20_ih_get_wptr(struct amdgpu_device *adev,
 	u32 wptr, tmp;
 	struct amdgpu_ih_regs *ih_regs;
 
-	wptr = le32_to_cpu(*ih->wptr_cpu);
-	ih_regs = &ih->ih_regs;
+	if (ih == &adev->irq.ih) {
+		/* Only ring0 supports writeback. On other rings fall back
+		 * to register-based code with overflow checking below.
+		 */
+		wptr = le32_to_cpu(*ih->wptr_cpu);
 
-	if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
-		goto out;
+		if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
+			goto out;
+	}
+
+	ih_regs = &ih->ih_regs;
 
 	/* Double check that the overflow wasn't already cleared. */
 	wptr = RREG32_NO_KIQ(ih_regs->ih_rb_wptr);
@@ -473,15 +467,11 @@ static int vega20_ih_self_irq(struct amdgpu_device *adev,
 			      struct amdgpu_irq_src *source,
 			      struct amdgpu_iv_entry *entry)
 {
-	uint32_t wptr = cpu_to_le32(entry->src_data[0]);
-
 	switch (entry->ring_id) {
 	case 1:
-		*adev->irq.ih1.wptr_cpu = wptr;
 		schedule_work(&adev->irq.ih1_work);
 		break;
 	case 2:
-		*adev->irq.ih2.wptr_cpu = wptr;
 		schedule_work(&adev->irq.ih2_work);
 		break;
 	default: break;
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 33/35] drm/amdkfd: refine migration policy with xnack on
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (31 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 32/35] drm/amdgpu: enable retry fault wptr overflow Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 34/35] drm/amdkfd: add svm range validate timestamp Felix Kuehling
                   ` (3 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

With xnack on, GPU vm fault handler decide the best restore location,
then migrate range to the best restore location and update GPU mapping
to recover the GPU vm fault.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c   |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  25 +++-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   3 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h    |   3 +
 drivers/gpu/drm/amd/amdkfd/kfd_process.c |  16 +++
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 162 +++++++++++++++++++----
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |   3 +-
 7 files changed, 180 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index bd9de870f8f1..50a8f4db22f6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -3320,7 +3320,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
 	addr /= AMDGPU_GPU_PAGE_SIZE;
 
 	if (!amdgpu_noretry && is_compute_context &&
-		!svm_range_restore_pages(adev, pasid, addr)) {
+		!svm_range_restore_pages(adev, vm, pasid, addr)) {
 		amdgpu_bo_unref(&root);
 		return true;
 	}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index d33a4cc63495..2095417c7846 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -441,6 +441,7 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
  * svm_migrate_ram_to_vram - migrate svm range from system to device
  * @prange: range structure
  * @best_loc: the device to migrate to
+ * @mm: the process mm structure
  *
  * Context: Process context, caller hold mm->mmap_sem and prange->lock and take
  *          svms srcu read lock.
@@ -448,12 +449,12 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
  * Return:
  * 0 - OK, otherwise error code
  */
-int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc)
+int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
+			    struct mm_struct *mm)
 {
 	unsigned long addr, start, end;
 	struct vm_area_struct *vma;
 	struct amdgpu_device *adev;
-	struct mm_struct *mm;
 	int r = 0;
 
 	if (prange->actual_loc == best_loc) {
@@ -475,8 +476,6 @@ int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc)
 	start = prange->it_node.start << PAGE_SHIFT;
 	end = (prange->it_node.last + 1) << PAGE_SHIFT;
 
-	mm = current->mm;
-
 	for (addr = start; addr < end;) {
 		unsigned long next;
 
@@ -740,12 +739,26 @@ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
 	list_for_each_entry(prange, &list, update_list) {
 		mutex_lock(&prange->mutex);
 		r = svm_migrate_vram_to_ram(prange, vma->vm_mm);
-		mutex_unlock(&prange->mutex);
 		if (r) {
 			pr_debug("failed %d migrate [0x%lx 0x%lx] to ram\n", r,
 				 prange->it_node.start, prange->it_node.last);
-			goto out_srcu;
+			goto next;
 		}
+
+		/* xnack off, svm_range_restore_work will update GPU mapping */
+		if (!p->xnack_enabled)
+			goto next;
+
+		/* xnack on, update mapping on GPUs with ACCESS_IN_PLACE */
+		r = svm_range_map_to_gpus(prange, true);
+		if (r)
+			pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx]\n",
+				 r, prange->svms, prange->it_node.start,
+				 prange->it_node.last);
+next:
+		mutex_unlock(&prange->mutex);
+		if (r)
+			break;
 	}
 
 out_srcu:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
index 95fd7b21791f..9949b55d3b6a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
@@ -37,7 +37,8 @@ enum MIGRATION_COPY_DIR {
 	FROM_VRAM_TO_RAM
 };
 
-int svm_migrate_ram_to_vram(struct svm_range *prange,  uint32_t best_loc);
+int svm_migrate_ram_to_vram(struct svm_range *prange,  uint32_t best_loc,
+			    struct mm_struct *mm);
 int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm);
 unsigned long
 svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index d5367e770b39..db94f963eb7e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -864,6 +864,9 @@ int kfd_process_gpuid_from_gpuidx(struct kfd_process *p,
 int kfd_process_gpuidx_from_gpuid(struct kfd_process *p, uint32_t gpu_id);
 int kfd_process_device_from_gpuidx(struct kfd_process *p,
 					uint32_t gpu_idx, struct kfd_dev **gpu);
+int kfd_process_gpuid_from_kgd(struct kfd_process *p,
+			       struct amdgpu_device *adev, uint32_t *gpuid,
+			       uint32_t *gpuidx);
 void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p);
 int kfd_process_restore_queues(struct kfd_process *p);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index f7a50a364d78..69970a3bc176 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1637,6 +1637,22 @@ int kfd_process_device_from_gpuidx(struct kfd_process *p,
 	return -EINVAL;
 }
 
+int
+kfd_process_gpuid_from_kgd(struct kfd_process *p, struct amdgpu_device *adev,
+			   uint32_t *gpuid, uint32_t *gpuidx)
+{
+	struct kgd_dev *kgd = (struct kgd_dev *)adev;
+	int i;
+
+	for (i = 0; i < p->n_pdds; i++)
+		if (p->pdds[i] && p->pdds[i]->dev->kgd == kgd) {
+			*gpuid = p->pdds[i]->dev->id;
+			*gpuidx = i;
+			return 0;
+		}
+	return -EINVAL;
+}
+
 static void evict_process_worker(struct work_struct *work)
 {
 	int ret;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 63b745a06740..8b57f5a471bd 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1153,7 +1153,7 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	return r;
 }
 
-static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm)
+int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm)
 {
 	DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE);
 	struct kfd_process_device *pdd;
@@ -1170,9 +1170,29 @@ static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm)
 	else
 		bo_adev = NULL;
 
-	bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip,
-		  MAX_GPU_INSTANCE);
 	p = container_of(prange->svms, struct kfd_process, svms);
+	if (p->xnack_enabled) {
+		bitmap_copy(bitmap, prange->bitmap_aip, MAX_GPU_INSTANCE);
+
+		/* If prefetch range to GPU, or GPU retry fault migrate range to
+		 * GPU, which has ACCESS attribute to the range, create mapping
+		 * on that GPU.
+		 */
+		if (prange->actual_loc) {
+			gpuidx = kfd_process_gpuidx_from_gpuid(p,
+							prange->actual_loc);
+			if (gpuidx < 0) {
+				WARN_ONCE(1, "failed get device by id 0x%x\n",
+					 prange->actual_loc);
+				return -EINVAL;
+			}
+			if (test_bit(gpuidx, prange->bitmap_access))
+				bitmap_set(bitmap, gpuidx, 1);
+		}
+	} else {
+		bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip,
+			  MAX_GPU_INSTANCE);
+	}
 
 	for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) {
 		r = kfd_process_device_from_gpuidx(p, gpuidx, &dev);
@@ -1678,16 +1698,77 @@ svm_range_from_addr(struct svm_range_list *svms, unsigned long addr)
 	return container_of(node, struct svm_range, it_node);
 }
 
+/* svm_range_best_restore_location - decide the best fault restore location
+ * @prange: svm range structure
+ * @adev: the GPU on which vm fault happened
+ *
+ * This is only called when xnack is on, to decide the best location to restore
+ * the range mapping after GPU vm fault. Caller uses the best location to do
+ * migration if actual loc is not best location, then update GPU page table
+ * mapping to the best location.
+ *
+ * If vm fault gpu is range preferred loc, the best_loc is preferred loc.
+ * If vm fault gpu idx is on range ACCESSIBLE bitmap, best_loc is vm fault gpu
+ * If vm fault gpu idx is on range ACCESSIBLE_IN_PLACE bitmap, then
+ *    if range actual loc is cpu, best_loc is cpu
+ *    if vm fault gpu is on xgmi same hive of range actual loc gpu, best_loc is
+ *    range actual loc.
+ * Otherwise, GPU no access, best_loc is -1.
+ *
+ * Return:
+ * -1 means vm fault GPU no access
+ * 0 for CPU or GPU id
+ */
+static int32_t
+svm_range_best_restore_location(struct svm_range *prange,
+				struct amdgpu_device *adev)
+{
+	struct amdgpu_device *bo_adev;
+	struct kfd_process *p;
+	int32_t gpuidx;
+	uint32_t gpuid;
+	int r;
+
+	p = container_of(prange->svms, struct kfd_process, svms);
+
+	r = kfd_process_gpuid_from_kgd(p, adev, &gpuid, &gpuidx);
+	if (r < 0) {
+		pr_debug("failed to get gpuid from kgd\n");
+		return -1;
+	}
+
+	if (prange->preferred_loc == gpuid)
+		return prange->preferred_loc;
+
+	if (test_bit(gpuidx, prange->bitmap_access))
+		return gpuid;
+
+	if (test_bit(gpuidx, prange->bitmap_aip)) {
+		if (!prange->actual_loc)
+			return 0;
+
+		bo_adev = svm_range_get_adev_by_id(prange, prange->actual_loc);
+		if (amdgpu_xgmi_same_hive(adev, bo_adev))
+			return prange->actual_loc;
+		else
+			return 0;
+	}
+
+	return -1;
+}
+
 int
-svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
-			uint64_t addr)
+svm_range_restore_pages(struct amdgpu_device *adev, struct amdgpu_vm *vm,
+			unsigned int pasid, uint64_t addr)
 {
-	int r = 0;
-	int srcu_idx;
+	struct amdgpu_device *bo_adev;
 	struct mm_struct *mm = NULL;
-	struct svm_range *prange;
 	struct svm_range_list *svms;
+	struct svm_range *prange;
 	struct kfd_process *p;
+	int32_t best_loc;
+	int srcu_idx;
+	int r = 0;
 
 	p = kfd_lookup_process_by_pasid(pasid);
 	if (!p) {
@@ -1706,20 +1787,20 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
 		pr_debug("failed to find prange svms 0x%p address [0x%llx]\n",
 			 svms, addr);
 		r = -EFAULT;
-		goto unlock_out;
+		goto out_srcu_unlock;
 	}
 
 	if (!atomic_read(&prange->invalid)) {
 		pr_debug("svms 0x%p [0x%lx %lx] already restored\n",
 			 svms, prange->it_node.start, prange->it_node.last);
-		goto unlock_out;
+		goto out_srcu_unlock;
 	}
 
 	mm = get_task_mm(p->lead_thread);
 	if (!mm) {
 		pr_debug("svms 0x%p failed to get mm\n", svms);
 		r = -ESRCH;
-		goto unlock_out;
+		goto out_srcu_unlock;
 	}
 
 	mmap_read_lock(mm);
@@ -1729,27 +1810,57 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
 	 */
 	mutex_lock(&prange->mutex);
 
+	best_loc = svm_range_best_restore_location(prange, adev);
+	if (best_loc == -1) {
+		pr_debug("svms %p failed get best restore loc [0x%lx 0x%lx]\n",
+			 svms, prange->it_node.start, prange->it_node.last);
+		r = -EACCES;
+		goto out_mmput;
+	}
+
+	pr_debug("svms %p [0x%lx 0x%lx] best restore 0x%x, actual loc 0x%x\n",
+		 svms, prange->it_node.start, prange->it_node.last, best_loc,
+		 prange->actual_loc);
+
+	if (prange->actual_loc != best_loc) {
+		if (best_loc)
+			r = svm_migrate_ram_to_vram(prange, best_loc, mm);
+		else
+			r = svm_migrate_vram_to_ram(prange, mm);
+		if (r) {
+			pr_debug("failed %d to migrate svms %p [0x%lx 0x%lx]\n",
+				 r, svms, prange->it_node.start,
+				 prange->it_node.last);
+			goto out_mmput;
+		}
+	}
+
 	r = svm_range_validate(mm, prange);
 	if (r) {
-		pr_debug("failed %d to validate svms 0x%p [0x%lx 0x%lx]\n", r,
+		pr_debug("failed %d to validate svms %p [0x%lx 0x%lx]\n", r,
 			 svms, prange->it_node.start, prange->it_node.last);
-
-		goto mmput_out;
+		goto out_mmput;
 	}
 
-	pr_debug("restoring svms 0x%p [0x%lx %lx] mapping\n",
-		 svms, prange->it_node.start, prange->it_node.last);
+	if (prange->svm_bo && prange->mm_nodes)
+		bo_adev = amdgpu_ttm_adev(prange->svm_bo->bo->tbo.bdev);
+	else
+		bo_adev = NULL;
+
+	pr_debug("restoring svms 0x%p [0x%lx %lx] mapping, bo_adev is %s\n",
+		 svms, prange->it_node.start, prange->it_node.last,
+		 bo_adev ? "not NULL" : "NULL");
 
 	r = svm_range_map_to_gpus(prange, true);
 	if (r)
-		pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx] to gpu\n", r,
-			 svms, prange->it_node.start, prange->it_node.last);
+		pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx] to gpus\n",
+			 r, svms, prange->it_node.start, prange->it_node.last);
 
-mmput_out:
+out_mmput:
 	mutex_unlock(&prange->mutex);
 	mmap_read_unlock(mm);
 	mmput(mm);
-unlock_out:
+out_srcu_unlock:
 	srcu_read_unlock(&svms->srcu, srcu_idx);
 	kfd_unref_process(p);
 
@@ -1882,7 +1993,7 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
 	return 0;
 }
 
-/* svm_range_best_location - decide the best actual location
+/* svm_range_best_prefetch_location - decide the best prefetch location
  * @prange: svm range structure
  *
  * For xnack off:
@@ -1904,7 +2015,8 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
  * Return:
  * 0 for CPU or GPU id
  */
-static uint32_t svm_range_best_location(struct svm_range *prange)
+static uint32_t
+svm_range_best_prefetch_location(struct svm_range *prange)
 {
 	DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE);
 	uint32_t best_loc = prange->prefetch_loc;
@@ -1980,7 +2092,7 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
 	int r = 0;
 
 	*migrated = false;
-	best_loc = svm_range_best_location(prange);
+	best_loc = svm_range_best_prefetch_location(prange);
 
 	if (best_loc == KFD_IOCTL_SVM_LOCATION_UNDEFINED ||
 	    best_loc == prange->actual_loc)
@@ -2001,10 +2113,10 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
 		}
 
 		pr_debug("migrate from ram to vram\n");
-		r = svm_migrate_ram_to_vram(prange, best_loc);
+		r = svm_migrate_ram_to_vram(prange, best_loc, mm);
 	} else {
 		pr_debug("migrate from vram to ram\n");
-		r = svm_migrate_vram_to_ram(prange, current->mm);
+		r = svm_migrate_vram_to_ram(prange, mm);
 	}
 
 	if (!r)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 143573621956..0685eb04b87c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -125,8 +125,9 @@ int svm_range_vram_node_new(struct amdgpu_device *adev,
 void svm_range_vram_node_free(struct svm_range *prange);
 int svm_range_split_by_granularity(struct kfd_process *p, unsigned long addr,
 				   struct list_head *list);
-int svm_range_restore_pages(struct amdgpu_device *adev,
+int svm_range_restore_pages(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 			    unsigned int pasid, uint64_t addr);
 int svm_range_schedule_evict_svm_bo(struct amdgpu_amdkfd_fence *fence);
+int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm);
 
 #endif /* KFD_SVM_H_ */
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 34/35] drm/amdkfd: add svm range validate timestamp
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (32 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 33/35] drm/amdkfd: refine migration policy with xnack on Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  3:01 ` [PATCH 35/35] drm/amdkfd: multiple gpu migrate vram to vram Felix Kuehling
                   ` (2 subsequent siblings)
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

With xnack on, add validate timestamp in order to handle GPU vm fault
from multiple GPUs.

If GPU retry fault need migrate the range to the best restore location,
use range validate timestamp to record system timestamp after range is
restored to update GPU page table.

Because multiple pages of same range have multiple retry fault, define
AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING to the long time period that
pending retry fault may still comes after page table update, to skip
duplicate retry fault of same range.

If difference between system timestamp and range last validate timestamp
is bigger than AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING, that means the
retry fault is from another GPU, then continue to handle retry fault
recover.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 +++++++++++++++++++++++----
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  2 ++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 8b57f5a471bd..65f20a72ddcb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -34,6 +34,11 @@
 
 #define AMDGPU_SVM_RANGE_RESTORE_DELAY_MS 1
 
+/* Long enough to ensure no retry fault comes after svm range is restored and
+ * page table is updated.
+ */
+#define AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING	2000
+
 static void svm_range_evict_svm_bo_worker(struct work_struct *work);
 /**
  * svm_range_unlink - unlink svm_range from lists and interval tree
@@ -122,6 +127,7 @@ svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start,
 	INIT_LIST_HEAD(&prange->remove_list);
 	INIT_LIST_HEAD(&prange->svm_bo_list);
 	atomic_set(&prange->invalid, 0);
+	prange->validate_timestamp = ktime_to_us(ktime_get());
 	mutex_init(&prange->mutex);
 	spin_lock_init(&prange->svm_bo_lock);
 	svm_range_set_default_attributes(&prange->preferred_loc,
@@ -482,20 +488,28 @@ static int svm_range_validate_vram(struct svm_range *prange)
 static int
 svm_range_validate(struct mm_struct *mm, struct svm_range *prange)
 {
+	struct kfd_process *p;
 	int r;
 
 	pr_debug("svms 0x%p [0x%lx 0x%lx] actual loc 0x%x\n", prange->svms,
 		 prange->it_node.start, prange->it_node.last,
 		 prange->actual_loc);
 
+	p = container_of(prange->svms, struct kfd_process, svms);
+
 	if (!prange->actual_loc)
 		r = svm_range_validate_ram(mm, prange);
 	else
 		r = svm_range_validate_vram(prange);
 
-	pr_debug("svms 0x%p [0x%lx 0x%lx] ret %d invalid %d\n", prange->svms,
-		 prange->it_node.start, prange->it_node.last,
-		 r, atomic_read(&prange->invalid));
+	if (!r) {
+		if (p->xnack_enabled)
+			atomic_set(&prange->invalid, 0);
+		prange->validate_timestamp = ktime_to_us(ktime_get());
+	}
+
+	pr_debug("svms 0x%p [0x%lx 0x%lx] ret %d\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last, r);
 
 	return r;
 }
@@ -1766,6 +1780,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	struct svm_range_list *svms;
 	struct svm_range *prange;
 	struct kfd_process *p;
+	uint64_t timestamp;
 	int32_t best_loc;
 	int srcu_idx;
 	int r = 0;
@@ -1790,7 +1805,11 @@ svm_range_restore_pages(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 		goto out_srcu_unlock;
 	}
 
-	if (!atomic_read(&prange->invalid)) {
+	mutex_lock(&prange->mutex);
+	timestamp = ktime_to_us(ktime_get()) - prange->validate_timestamp;
+	mutex_unlock(&prange->mutex);
+	/* skip duplicate vm fault on different pages of same range */
+	if (timestamp < AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING) {
 		pr_debug("svms 0x%p [0x%lx %lx] already restored\n",
 			 svms, prange->it_node.start, prange->it_node.last);
 		goto out_srcu_unlock;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 0685eb04b87c..466ec5537bbb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -66,6 +66,7 @@ struct svm_range_bo {
  * @actual_loc: the actual location, 0 for CPU, or GPU id
  * @granularity:migration granularity, log2 num pages
  * @invalid:    not 0 means cpu page table is invalidated
+ * @validate_timestamp: system timestamp when range is validated
  * @bitmap_access: index bitmap of GPUs which can access the range
  * @bitmap_aip: index bitmap of GPUs which can access the range in place
  *
@@ -95,6 +96,7 @@ struct svm_range {
 	uint32_t			actual_loc;
 	uint8_t				granularity;
 	atomic_t			invalid;
+	uint64_t			validate_timestamp;
 	DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE);
 	DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
 };
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 35/35] drm/amdkfd: multiple gpu migrate vram to vram
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (33 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 34/35] drm/amdkfd: add svm range validate timestamp Felix Kuehling
@ 2021-01-07  3:01 ` Felix Kuehling
  2021-01-07  9:23 ` [PATCH 00/35] Add HMM-based SVM memory manager to KFD Daniel Vetter
  2021-01-13 16:47 ` [PATCH 00/35] Add HMM-based SVM memory manager to KFD Jerome Glisse
  36 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07  3:01 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

From: Philip Yang <Philip.Yang@amd.com>

If prefetch range to gpu with acutal location is another gpu, or GPU
retry fault restore pages to migrate the range with acutal location is
gpu, then migrate from one gpu to another gpu.

Use system memory as bridge because sdma engine may not able to access
another gpu vram, use sdma of source gpu to migrate to system memory,
then use sdma of destination gpu to migrate from system memory to gpu.

Print out gpuid or gpuidx in debug messages.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 57 +++++++++++++++++--
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 70 +++++++++++++++++-------
 3 files changed, 103 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 2095417c7846..6c644472cead 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -449,8 +449,9 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
  * Return:
  * 0 - OK, otherwise error code
  */
-int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
-			    struct mm_struct *mm)
+static int
+svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
+			struct mm_struct *mm)
 {
 	unsigned long addr, start, end;
 	struct vm_area_struct *vma;
@@ -470,8 +471,8 @@ int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
 		return -ENODEV;
 	}
 
-	pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
-		 prange->it_node.start, prange->it_node.last);
+	pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last, best_loc);
 
 	start = prange->it_node.start << PAGE_SHIFT;
 	end = (prange->it_node.last + 1) << PAGE_SHIFT;
@@ -668,8 +669,9 @@ int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm)
 		return -ENODEV;
 	}
 
-	pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms,
-		 prange->it_node.start, prange->it_node.last);
+	pr_debug("svms 0x%p [0x%lx 0x%lx] from gpu 0x%x to ram\n", prange->svms,
+		 prange->it_node.start, prange->it_node.last,
+		 prange->actual_loc);
 
 	start = prange->it_node.start << PAGE_SHIFT;
 	end = (prange->it_node.last + 1) << PAGE_SHIFT;
@@ -696,6 +698,49 @@ int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm)
 	return r;
 }
 
+/**
+ * svm_migrate_vram_to_vram - migrate svm range from device to device
+ * @prange: range structure
+ * @best_loc: the device to migrate to
+ * @mm: process mm, use current->mm if NULL
+ *
+ * Context: Process context, caller hold mm->mmap_sem and prange->lock and take
+ *          svms srcu read lock
+ *
+ * Return:
+ * 0 - OK, otherwise error code
+ */
+static int
+svm_migrate_vram_to_vram(struct svm_range *prange, uint32_t best_loc,
+			 struct mm_struct *mm)
+{
+	int r;
+
+	/*
+	 * TODO: for both devices with PCIe large bar or on same xgmi hive, skip
+	 * system memory as migration bridge
+	 */
+
+	pr_debug("from gpu 0x%x to gpu 0x%x\n", prange->actual_loc, best_loc);
+
+	r = svm_migrate_vram_to_ram(prange, mm);
+	if (r)
+		return r;
+
+	return svm_migrate_ram_to_vram(prange, best_loc, mm);
+}
+
+int
+svm_migrate_to_vram(struct svm_range *prange, uint32_t best_loc,
+		    struct mm_struct *mm)
+{
+	if  (!prange->actual_loc)
+		return svm_migrate_ram_to_vram(prange, best_loc, mm);
+	else
+		return svm_migrate_vram_to_vram(prange, best_loc, mm);
+
+}
+
 /**
  * svm_migrate_to_ram - CPU page fault handler
  * @vmf: CPU vm fault vma, address
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
index 9949b55d3b6a..bc680619d135 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
@@ -37,8 +37,8 @@ enum MIGRATION_COPY_DIR {
 	FROM_VRAM_TO_RAM
 };
 
-int svm_migrate_ram_to_vram(struct svm_range *prange,  uint32_t best_loc,
-			    struct mm_struct *mm);
+int svm_migrate_to_vram(struct svm_range *prange,  uint32_t best_loc,
+			struct mm_struct *mm);
 int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm);
 unsigned long
 svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 65f20a72ddcb..d029fce94db0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -288,8 +288,11 @@ static void svm_range_bo_unref(struct svm_range_bo *svm_bo)
 	kref_put(&svm_bo->kref, svm_range_bo_release);
 }
 
-static bool svm_range_validate_svm_bo(struct svm_range *prange)
+static bool
+svm_range_validate_svm_bo(struct amdgpu_device *adev, struct svm_range *prange)
 {
+	struct amdgpu_device *bo_adev;
+
 	spin_lock(&prange->svm_bo_lock);
 	if (!prange->svm_bo) {
 		spin_unlock(&prange->svm_bo_lock);
@@ -301,6 +304,22 @@ static bool svm_range_validate_svm_bo(struct svm_range *prange)
 		return true;
 	}
 	if (svm_bo_ref_unless_zero(prange->svm_bo)) {
+		/*
+		 * Migrate from GPU to GPU, remove range from source bo_adev
+		 * svm_bo range list, and return false to allocate svm_bo from
+		 * destination adev.
+		 */
+		bo_adev = amdgpu_ttm_adev(prange->svm_bo->bo->tbo.bdev);
+		if (bo_adev != adev) {
+			spin_unlock(&prange->svm_bo_lock);
+
+			spin_lock(&prange->svm_bo->list_lock);
+			list_del_init(&prange->svm_bo_list);
+			spin_unlock(&prange->svm_bo->list_lock);
+
+			svm_range_bo_unref(prange->svm_bo);
+			return false;
+		}
 		if (READ_ONCE(prange->svm_bo->evicting)) {
 			struct dma_fence *f;
 			struct svm_range_bo *svm_bo;
@@ -374,7 +393,7 @@ svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange,
 	pr_debug("pasid: %x svms 0x%p [0x%lx 0x%lx]\n", p->pasid, prange->svms,
 		 prange->it_node.start, prange->it_node.last);
 
-	if (svm_range_validate_svm_bo(prange))
+	if (svm_range_validate_svm_bo(adev, prange))
 		return 0;
 
 	svm_bo = svm_range_bo_new();
@@ -1209,6 +1228,7 @@ int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm)
 	}
 
 	for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) {
+		pr_debug("mapping to gpu idx 0x%x\n", gpuidx);
 		r = kfd_process_device_from_gpuidx(p, gpuidx, &dev);
 		if (r) {
 			pr_debug("failed to find device idx %d\n", gpuidx);
@@ -1843,7 +1863,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 
 	if (prange->actual_loc != best_loc) {
 		if (best_loc)
-			r = svm_migrate_ram_to_vram(prange, best_loc, mm);
+			r = svm_migrate_to_vram(prange, best_loc, mm);
 		else
 			r = svm_migrate_vram_to_ram(prange, mm);
 		if (r) {
@@ -2056,6 +2076,11 @@ svm_range_best_prefetch_location(struct svm_range *prange)
 		goto out;
 
 	bo_adev = svm_range_get_adev_by_id(prange, best_loc);
+	if (!bo_adev) {
+		WARN_ONCE(1, "failed to get device by id 0x%x\n", best_loc);
+		best_loc = 0;
+		goto out;
+	}
 	bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip,
 		  MAX_GPU_INSTANCE);
 
@@ -2076,6 +2101,7 @@ svm_range_best_prefetch_location(struct svm_range *prange)
 	pr_debug("xnack %d svms 0x%p [0x%lx 0x%lx] best loc 0x%x\n",
 		 p->xnack_enabled, &p->svms, prange->it_node.start,
 		 prange->it_node.last, best_loc);
+
 	return best_loc;
 }
 
@@ -2117,29 +2143,33 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
 	    best_loc == prange->actual_loc)
 		return 0;
 
+	/*
+	 * Prefetch to GPU without host access flag, set actual_loc to gpu, then
+	 * validate on gpu and map to gpus will be handled afterwards.
+	 */
 	if (best_loc && !prange->actual_loc &&
-	    !(prange->flags & KFD_IOCTL_SVM_FLAG_HOST_ACCESS))
+	    !(prange->flags & KFD_IOCTL_SVM_FLAG_HOST_ACCESS)) {
+		prange->actual_loc = best_loc;
 		return 0;
+	}
 
-	if (best_loc) {
-		if (!prange->actual_loc && !prange->pages_addr) {
-			pr_debug("host access and prefetch to gpu\n");
-			r = svm_range_validate_ram(mm, prange);
-			if (r) {
-				pr_debug("failed %d to validate on ram\n", r);
-				return r;
-			}
-		}
-
-		pr_debug("migrate from ram to vram\n");
-		r = svm_migrate_ram_to_vram(prange, best_loc, mm);
-	} else {
-		pr_debug("migrate from vram to ram\n");
+	if (!best_loc) {
 		r = svm_migrate_vram_to_ram(prange, mm);
+		*migrated = !r;
+		return r;
+	}
+
+	if (!prange->actual_loc && !prange->pages_addr) {
+		pr_debug("host access and prefetch to gpu\n");
+		r = svm_range_validate_ram(mm, prange);
+		if (r) {
+			pr_debug("failed %d to validate on ram\n", r);
+			return r;
+		}
 	}
 
-	if (!r)
-		*migrated = true;
+	r = svm_migrate_to_vram(prange, best_loc, mm);
+	*migrated = !r;
 
 	return r;
 }
-- 
2.29.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (34 preceding siblings ...)
  2021-01-07  3:01 ` [PATCH 35/35] drm/amdkfd: multiple gpu migrate vram to vram Felix Kuehling
@ 2021-01-07  9:23 ` Daniel Vetter
  2021-01-07 16:25   ` Felix Kuehling
  2021-01-13 16:47 ` [PATCH 00/35] Add HMM-based SVM memory manager to KFD Jerome Glisse
  36 siblings, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-01-07  9:23 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: alex.sierra, philip.yang, dri-devel, amd-gfx

On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> This is the first version of our HMM based shared virtual memory manager
> for KFD. There are still a number of known issues that we're working through
> (see below). This will likely lead to some pretty significant changes in
> MMU notifier handling and locking on the migration code paths. So don't
> get hung up on those details yet.
> 
> But I think this is a good time to start getting feedback. We're pretty
> confident about the ioctl API, which is both simple and extensible for the
> future. (see patches 4,16) The user mode side of the API can be found here:
> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> 
> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> and some retry IRQ handling changes (32).
> 
> 
> Known issues:
> * won't work with IOMMU enabled, we need to dma_map all pages properly
> * still working on some race conditions and random bugs
> * performance is not great yet

Still catching up, but I think there's another one for your list:

 * hmm gpu context preempt vs page fault handling. I've had a short
   discussion about this one with Christian before the holidays, and also
   some private chats with Jerome. It's nasty since no easy fix, much less
   a good idea what's the best approach here.

I'll try to look at this more in-depth when I'm catching up on mails.
-Daniel

> 
> Alex Sierra (12):
>   drm/amdgpu: replace per_device_list by array
>   drm/amdkfd: helper to convert gpu id and idx
>   drm/amdkfd: add xnack enabled flag to kfd_process
>   drm/amdkfd: add ioctl to configure and query xnack retries
>   drm/amdkfd: invalidate tables on page retry fault
>   drm/amdkfd: page table restore through svm API
>   drm/amdkfd: SVM API call to restore page tables
>   drm/amdkfd: add svm_bo reference for eviction fence
>   drm/amdgpu: add param bit flag to create SVM BOs
>   drm/amdkfd: add svm_bo eviction mechanism support
>   drm/amdgpu: svm bo enable_signal call condition
>   drm/amdgpu: add svm_bo eviction to enable_signal cb
> 
> Philip Yang (23):
>   drm/amdkfd: select kernel DEVICE_PRIVATE option
>   drm/amdkfd: add svm ioctl API
>   drm/amdkfd: Add SVM API support capability bits
>   drm/amdkfd: register svm range
>   drm/amdkfd: add svm ioctl GET_ATTR op
>   drm/amdgpu: add common HMM get pages function
>   drm/amdkfd: validate svm range system memory
>   drm/amdkfd: register overlap system memory range
>   drm/amdkfd: deregister svm range
>   drm/amdgpu: export vm update mapping interface
>   drm/amdkfd: map svm range to GPUs
>   drm/amdkfd: svm range eviction and restore
>   drm/amdkfd: register HMM device private zone
>   drm/amdkfd: validate vram svm range from TTM
>   drm/amdkfd: support xgmi same hive mapping
>   drm/amdkfd: copy memory through gart table
>   drm/amdkfd: HMM migrate ram to vram
>   drm/amdkfd: HMM migrate vram to ram
>   drm/amdgpu: reserve fence slot to update page table
>   drm/amdgpu: enable retry fault wptr overflow
>   drm/amdkfd: refine migration policy with xnack on
>   drm/amdkfd: add svm range validate timestamp
>   drm/amdkfd: multiple gpu migrate vram to vram
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
>  drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
>  drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
>  .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
>  drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
>  include/uapi/linux/kfd_ioctl.h                |  169 +-
>  26 files changed, 4296 insertions(+), 291 deletions(-)
>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> 
> -- 
> 2.29.2
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 08/35] drm/amdgpu: add common HMM get pages function
  2021-01-07  3:01 ` [PATCH 08/35] drm/amdgpu: add common HMM get pages function Felix Kuehling
@ 2021-01-07 10:53   ` Christian König
  0 siblings, 0 replies; 84+ messages in thread
From: Christian König @ 2021-01-07 10:53 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

Am 07.01.21 um 04:01 schrieb Felix Kuehling:
> From: Philip Yang <Philip.Yang@amd.com>
>
> Move the HMM get pages function from amdgpu_ttm and to amdgpu_mn. This
> common function will be used by new svm APIs.
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>

Acked-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 83 +++++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h  |  7 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 76 +++-------------------
>   3 files changed, 100 insertions(+), 66 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 828b5167ff12..997da4237a10 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -155,3 +155,86 @@ void amdgpu_mn_unregister(struct amdgpu_bo *bo)
>   	mmu_interval_notifier_remove(&bo->notifier);
>   	bo->notifier.mm = NULL;
>   }
> +
> +int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier,
> +			       struct mm_struct *mm, struct page **pages,
> +			       uint64_t start, uint64_t npages,
> +			       struct hmm_range **phmm_range, bool readonly,
> +			       bool mmap_locked)
> +{
> +	struct hmm_range *hmm_range;
> +	unsigned long timeout;
> +	unsigned long i;
> +	unsigned long *pfns;
> +	int r = 0;
> +
> +	hmm_range = kzalloc(sizeof(*hmm_range), GFP_KERNEL);
> +	if (unlikely(!hmm_range))
> +		return -ENOMEM;
> +
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (unlikely(!pfns)) {
> +		r = -ENOMEM;
> +		goto out_free_range;
> +	}
> +
> +	hmm_range->notifier = notifier;
> +	hmm_range->default_flags = HMM_PFN_REQ_FAULT;
> +	if (!readonly)
> +		hmm_range->default_flags |= HMM_PFN_REQ_WRITE;
> +	hmm_range->hmm_pfns = pfns;
> +	hmm_range->start = start;
> +	hmm_range->end = start + npages * PAGE_SIZE;
> +	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +
> +retry:
> +	hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
> +
> +	if (likely(!mmap_locked))
> +		mmap_read_lock(mm);
> +
> +	r = hmm_range_fault(hmm_range);
> +
> +	if (likely(!mmap_locked))
> +		mmap_read_unlock(mm);
> +	if (unlikely(r)) {
> +		/*
> +		 * FIXME: This timeout should encompass the retry from
> +		 * mmu_interval_read_retry() as well.
> +		 */
> +		if (r == -EBUSY && !time_after(jiffies, timeout))
> +			goto retry;
> +		goto out_free_pfns;
> +	}
> +
> +	/*
> +	 * Due to default_flags, all pages are HMM_PFN_VALID or
> +	 * hmm_range_fault() fails. FIXME: The pages cannot be touched outside
> +	 * the notifier_lock, and mmu_interval_read_retry() must be done first.
> +	 */
> +	for (i = 0; pages && i < npages; i++)
> +		pages[i] = hmm_pfn_to_page(pfns[i]);
> +
> +	*phmm_range = hmm_range;
> +
> +	return 0;
> +
> +out_free_pfns:
> +	kvfree(pfns);
> +out_free_range:
> +	kfree(hmm_range);
> +
> +	return r;
> +}
> +
> +int amdgpu_hmm_range_get_pages_done(struct hmm_range *hmm_range)
> +{
> +	int r;
> +
> +	r = mmu_interval_read_retry(hmm_range->notifier,
> +				    hmm_range->notifier_seq);
> +	kvfree(hmm_range->hmm_pfns);
> +	kfree(hmm_range);
> +
> +	return r;
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
> index a292238f75eb..7f7d37a457c3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
> @@ -30,6 +30,13 @@
>   #include <linux/workqueue.h>
>   #include <linux/interval_tree.h>
>   
> +int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier,
> +			       struct mm_struct *mm, struct page **pages,
> +			       uint64_t start, uint64_t npages,
> +			       struct hmm_range **phmm_range, bool readonly,
> +			       bool mmap_locked);
> +int amdgpu_hmm_range_get_pages_done(struct hmm_range *hmm_range);
> +
>   #if defined(CONFIG_HMM_MIRROR)
>   int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr);
>   void amdgpu_mn_unregister(struct amdgpu_bo *bo);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index aaad9e304ad9..f423f42cb9b5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -32,7 +32,6 @@
>   
>   #include <linux/dma-mapping.h>
>   #include <linux/iommu.h>
> -#include <linux/hmm.h>
>   #include <linux/pagemap.h>
>   #include <linux/sched/task.h>
>   #include <linux/sched/mm.h>
> @@ -843,10 +842,8 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   	struct amdgpu_ttm_tt *gtt = (void *)ttm;
>   	unsigned long start = gtt->userptr;
>   	struct vm_area_struct *vma;
> -	struct hmm_range *range;
> -	unsigned long timeout;
>   	struct mm_struct *mm;
> -	unsigned long i;
> +	bool readonly;
>   	int r = 0;
>   
>   	mm = bo->notifier.mm;
> @@ -862,76 +859,26 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   	if (!mmget_not_zero(mm)) /* Happens during process shutdown */
>   		return -ESRCH;
>   
> -	range = kzalloc(sizeof(*range), GFP_KERNEL);
> -	if (unlikely(!range)) {
> -		r = -ENOMEM;
> -		goto out;
> -	}
> -	range->notifier = &bo->notifier;
> -	range->start = bo->notifier.interval_tree.start;
> -	range->end = bo->notifier.interval_tree.last + 1;
> -	range->default_flags = HMM_PFN_REQ_FAULT;
> -	if (!amdgpu_ttm_tt_is_readonly(ttm))
> -		range->default_flags |= HMM_PFN_REQ_WRITE;
> -
> -	range->hmm_pfns = kvmalloc_array(ttm->num_pages,
> -					 sizeof(*range->hmm_pfns), GFP_KERNEL);
> -	if (unlikely(!range->hmm_pfns)) {
> -		r = -ENOMEM;
> -		goto out_free_ranges;
> -	}
> -
>   	mmap_read_lock(mm);
>   	vma = find_vma(mm, start);
> +	mmap_read_unlock(mm);
>   	if (unlikely(!vma || start < vma->vm_start)) {
>   		r = -EFAULT;
> -		goto out_unlock;
> +		goto out_putmm;
>   	}
>   	if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
>   		vma->vm_file)) {
>   		r = -EPERM;
> -		goto out_unlock;
> +		goto out_putmm;
>   	}
> -	mmap_read_unlock(mm);
> -	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> -
> -retry:
> -	range->notifier_seq = mmu_interval_read_begin(&bo->notifier);
>   
> -	mmap_read_lock(mm);
> -	r = hmm_range_fault(range);
> -	mmap_read_unlock(mm);
> -	if (unlikely(r)) {
> -		/*
> -		 * FIXME: This timeout should encompass the retry from
> -		 * mmu_interval_read_retry() as well.
> -		 */
> -		if (r == -EBUSY && !time_after(jiffies, timeout))
> -			goto retry;
> -		goto out_free_pfns;
> -	}
> -
> -	/*
> -	 * Due to default_flags, all pages are HMM_PFN_VALID or
> -	 * hmm_range_fault() fails. FIXME: The pages cannot be touched outside
> -	 * the notifier_lock, and mmu_interval_read_retry() must be done first.
> -	 */
> -	for (i = 0; i < ttm->num_pages; i++)
> -		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
> -
> -	gtt->range = range;
> +	readonly = amdgpu_ttm_tt_is_readonly(ttm);
> +	r = amdgpu_hmm_range_get_pages(&bo->notifier, mm, pages, start,
> +				       ttm->num_pages, &gtt->range, readonly,
> +				       false);
> +out_putmm:
>   	mmput(mm);
>   
> -	return 0;
> -
> -out_unlock:
> -	mmap_read_unlock(mm);
> -out_free_pfns:
> -	kvfree(range->hmm_pfns);
> -out_free_ranges:
> -	kfree(range);
> -out:
> -	mmput(mm);
>   	return r;
>   }
>   
> @@ -960,10 +907,7 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm)
>   		 * FIXME: Must always hold notifier_lock for this, and must
>   		 * not ignore the return code.
>   		 */
> -		r = mmu_interval_read_retry(gtt->range->notifier,
> -					 gtt->range->notifier_seq);
> -		kvfree(gtt->range->hmm_pfns);
> -		kfree(gtt->range);
> +		r = amdgpu_hmm_range_get_pages_done(gtt->range);
>   		gtt->range = NULL;
>   	}
>   

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 12/35] drm/amdgpu: export vm update mapping interface
  2021-01-07  3:01 ` [PATCH 12/35] drm/amdgpu: export vm update mapping interface Felix Kuehling
@ 2021-01-07 10:54   ` Christian König
  0 siblings, 0 replies; 84+ messages in thread
From: Christian König @ 2021-01-07 10:54 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

Am 07.01.21 um 04:01 schrieb Felix Kuehling:
> From: Philip Yang <Philip.Yang@amd.com>
>
> It will be used by kfd to map svm range to GPU, because svm range does
> not have amdgpu_bo and bo_va, cannot use amdgpu_bo_update interface, use
> amdgpu vm update interface directly.
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++++++++---------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 ++++++++++
>   2 files changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index fdbe7d4e8b8b..9c557e8bf0e5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1589,15 +1589,14 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params,
>    * Returns:
>    * 0 for success, -EINVAL for failure.
>    */
> -static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
> -				       struct amdgpu_device *bo_adev,
> -				       struct amdgpu_vm *vm, bool immediate,
> -				       bool unlocked, struct dma_resv *resv,
> -				       uint64_t start, uint64_t last,
> -				       uint64_t flags, uint64_t offset,
> -				       struct drm_mm_node *nodes,
> -				       dma_addr_t *pages_addr,
> -				       struct dma_fence **fence)
> +int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
> +				struct amdgpu_device *bo_adev,
> +				struct amdgpu_vm *vm, bool immediate,
> +				bool unlocked, struct dma_resv *resv,
> +				uint64_t start, uint64_t last, uint64_t flags,
> +				uint64_t offset, struct drm_mm_node *nodes,
> +				dma_addr_t *pages_addr,
> +				struct dma_fence **fence)
>   {
>   	struct amdgpu_vm_update_params params;
>   	enum amdgpu_sync_mode sync_mode;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index 2bf4ef5fb3e1..73ca630520fd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -366,6 +366,8 @@ struct amdgpu_vm_manager {
>   	spinlock_t				pasid_lock;
>   };
>   
> +struct amdgpu_bo_va_mapping;
> +
>   #define amdgpu_vm_copy_pte(adev, ib, pe, src, count) ((adev)->vm_manager.vm_pte_funcs->copy_pte((ib), (pe), (src), (count)))
>   #define amdgpu_vm_write_pte(adev, ib, pe, value, count, incr) ((adev)->vm_manager.vm_pte_funcs->write_pte((ib), (pe), (value), (count), (incr)))
>   #define amdgpu_vm_set_pte_pde(adev, ib, pe, addr, count, incr, flags) ((adev)->vm_manager.vm_pte_funcs->set_pte_pde((ib), (pe), (addr), (count), (incr), (flags)))
> @@ -397,6 +399,14 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
>   			  struct dma_fence **fence);
>   int amdgpu_vm_handle_moved(struct amdgpu_device *adev,
>   			   struct amdgpu_vm *vm);
> +int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
> +				struct amdgpu_device *bo_adev,
> +				struct amdgpu_vm *vm, bool immediate,
> +				bool unlocked, struct dma_resv *resv,
> +				uint64_t start, uint64_t last, uint64_t flags,
> +				uint64_t offset, struct drm_mm_node *nodes,
> +				dma_addr_t *pages_addr,
> +				struct dma_fence **fence);
>   int amdgpu_vm_bo_update(struct amdgpu_device *adev,
>   			struct amdgpu_bo_va *bo_va,
>   			bool clear);

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 29/35] drm/amdgpu: svm bo enable_signal call condition
  2021-01-07  3:01 ` [PATCH 29/35] drm/amdgpu: svm bo enable_signal call condition Felix Kuehling
@ 2021-01-07 10:56   ` Christian König
  2021-01-07 16:16     ` Felix Kuehling
  0 siblings, 1 reply; 84+ messages in thread
From: Christian König @ 2021-01-07 10:56 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

Am 07.01.21 um 04:01 schrieb Felix Kuehling:
> From: Alex Sierra <alex.sierra@amd.com>
>
> [why]
> To support svm bo eviction mechanism.
>
> [how]
> If the BO crated has AMDGPU_AMDKFD_CREATE_SVM_BO flag set,
> enable_signal callback will be called inside amdgpu_evict_flags.
> This also causes gutting of the BO by removing all placements,
> so that TTM won't actually do an eviction. Instead it will discard
> the memory held by the BO. This is needed for HMM migration to user
> mode system memory pages.

I don't think that this will work. What exactly are you doing here?

As Daniel pointed out HMM and dma_fences are fundamentally incompatible.

Christian.

>
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index f423f42cb9b5..62d4da95d22d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -107,6 +107,20 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo,
>   	}
>   
>   	abo = ttm_to_amdgpu_bo(bo);
> +	if (abo->flags & AMDGPU_AMDKFD_CREATE_SVM_BO) {
> +		struct dma_fence *fence;
> +		struct dma_resv *resv = &bo->base._resv;
> +
> +		rcu_read_lock();
> +		fence = rcu_dereference(resv->fence_excl);
> +		if (fence && !fence->ops->signaled)
> +			dma_fence_enable_sw_signaling(fence);
> +
> +		placement->num_placement = 0;
> +		placement->num_busy_placement = 0;
> +		rcu_read_unlock();
> +		return;
> +	}
>   	switch (bo->mem.mem_type) {
>   	case AMDGPU_PL_GDS:
>   	case AMDGPU_PL_GWS:

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 31/35] drm/amdgpu: reserve fence slot to update page table
  2021-01-07  3:01 ` [PATCH 31/35] drm/amdgpu: reserve fence slot to update page table Felix Kuehling
@ 2021-01-07 10:57   ` Christian König
  0 siblings, 0 replies; 84+ messages in thread
From: Christian König @ 2021-01-07 10:57 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

Am 07.01.21 um 04:01 schrieb Felix Kuehling:
> From: Philip Yang <Philip.Yang@amd.com>
>
> Forgot to reserve a fence slot to use sdma to update page table, cause
> below kernel BUG backtrace to handle vm retry fault while application is
> exiting.
>
> [  133.048143] kernel BUG at /home/yangp/git/compute_staging/kernel/drivers/dma-buf/dma-resv.c:281!
> [  133.048487] Workqueue: events amdgpu_irq_handle_ih1 [amdgpu]
> [  133.048506] RIP: 0010:dma_resv_add_shared_fence+0x204/0x280
> [  133.048672]  amdgpu_vm_sdma_commit+0x134/0x220 [amdgpu]
> [  133.048788]  amdgpu_vm_bo_update_range+0x220/0x250 [amdgpu]
> [  133.048905]  amdgpu_vm_handle_fault+0x202/0x370 [amdgpu]
> [  133.049031]  gmc_v9_0_process_interrupt+0x1ab/0x310 [amdgpu]
> [  133.049165]  ? kgd2kfd_interrupt+0x9a/0x180 [amdgpu]
> [  133.049289]  ? amdgpu_irq_dispatch+0xb6/0x240 [amdgpu]
> [  133.049408]  amdgpu_irq_dispatch+0xb6/0x240 [amdgpu]
> [  133.049534]  amdgpu_ih_process+0x9b/0x1c0 [amdgpu]
> [  133.049657]  amdgpu_irq_handle_ih1+0x21/0x60 [amdgpu]
> [  133.049669]  process_one_work+0x29f/0x640
> [  133.049678]  worker_thread+0x39/0x3f0
> [  133.049685]  ? process_one_work+0x640/0x640
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index abdd4e7b4c3b..bd9de870f8f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -3301,7 +3301,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
>   	struct amdgpu_bo *root;
>   	uint64_t value, flags;
>   	struct amdgpu_vm *vm;
> -	long r;
> +	int r;
>   	bool is_compute_context = false;
>   
>   	spin_lock(&adev->vm_manager.pasid_lock);
> @@ -3359,6 +3359,12 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
>   		value = 0;
>   	}
>   
> +	r = dma_resv_reserve_shared(root->tbo.base.resv, 1);
> +	if (r) {
> +		pr_debug("failed %d to reserve fence slot\n", r);
> +		goto error_unlock;
> +	}
> +
>   	r = amdgpu_vm_bo_update_mapping(adev, adev, vm, true, false, NULL, addr,
>   					addr, flags, value, NULL, NULL,
>   					NULL);
> @@ -3370,7 +3376,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
>   error_unlock:
>   	amdgpu_bo_unreserve(root);
>   	if (r < 0)
> -		DRM_ERROR("Can't handle page fault (%ld)\n", r);
> +		DRM_ERROR("Can't handle page fault (%d)\n", r);
>   
>   error_unref:
>   	amdgpu_bo_unref(&root);

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 32/35] drm/amdgpu: enable retry fault wptr overflow
  2021-01-07  3:01 ` [PATCH 32/35] drm/amdgpu: enable retry fault wptr overflow Felix Kuehling
@ 2021-01-07 11:01   ` Christian König
  0 siblings, 0 replies; 84+ messages in thread
From: Christian König @ 2021-01-07 11:01 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx, dri-devel; +Cc: alex.sierra, Philip Yang

Am 07.01.21 um 04:01 schrieb Felix Kuehling:
> From: Philip Yang <Philip.Yang@amd.com>
>
> If xnack is on, VM retry fault interrupt send to IH ring1, and ring1
> will be full quickly. IH cannot receive other interrupts, this causes
> deadlock if migrating buffer using sdma and waiting for sdma done while
> handling retry fault.
>
> Remove VMC from IH storm client, enable ring1 write pointer overflow,
> then IH will drop retry fault interrupts and be able to receive other
> interrupts while driver is handling retry fault.
>
> IH ring1 write pointer doesn't writeback to memory by IH, and ring1
> write pointer recorded by self-irq is not updated, so always read
> the latest ring1 write pointer from register.
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +++++++++-----------------
>   drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +++++++++-----------------
>   2 files changed, 22 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> index 88626d83e07b..ca8efa5c6978 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> @@ -220,10 +220,8 @@ static int vega10_ih_enable_ring(struct amdgpu_device *adev,
>   	tmp = vega10_ih_rb_cntl(ih, tmp);
>   	if (ih == &adev->irq.ih)
>   		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RPTR_REARM, !!adev->irq.msi_enabled);
> -	if (ih == &adev->irq.ih1) {
> -		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_ENABLE, 0);
> +	if (ih == &adev->irq.ih1)
>   		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RB_FULL_DRAIN_ENABLE, 1);
> -	}
>   	if (amdgpu_sriov_vf(adev)) {
>   		if (psp_reg_program(&adev->psp, ih_regs->psp_reg_id, tmp)) {
>   			dev_err(adev->dev, "PSP program IH_RB_CNTL failed!\n");
> @@ -265,7 +263,6 @@ static int vega10_ih_irq_init(struct amdgpu_device *adev)
>   	u32 ih_chicken;
>   	int ret;
>   	int i;
> -	u32 tmp;
>   
>   	/* disable irqs */
>   	ret = vega10_ih_toggle_interrupts(adev, false);
> @@ -291,15 +288,6 @@ static int vega10_ih_irq_init(struct amdgpu_device *adev)
>   		}
>   	}
>   
> -	tmp = RREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL);
> -	tmp = REG_SET_FIELD(tmp, IH_STORM_CLIENT_LIST_CNTL,
> -			    CLIENT18_IS_STORM_CLIENT, 1);
> -	WREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL, tmp);
> -
> -	tmp = RREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL);
> -	tmp = REG_SET_FIELD(tmp, IH_INT_FLOOD_CNTL, FLOOD_CNTL_ENABLE, 1);
> -	WREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL, tmp);
> -
>   	pci_set_master(adev->pdev);
>   
>   	/* enable interrupts */
> @@ -345,11 +333,17 @@ static u32 vega10_ih_get_wptr(struct amdgpu_device *adev,
>   	u32 wptr, tmp;
>   	struct amdgpu_ih_regs *ih_regs;
>   
> -	wptr = le32_to_cpu(*ih->wptr_cpu);
> -	ih_regs = &ih->ih_regs;
> +	if (ih == &adev->irq.ih) {
> +		/* Only ring0 supports writeback. On other rings fall back
> +		 * to register-based code with overflow checking below.
> +		 */
> +		wptr = le32_to_cpu(*ih->wptr_cpu);
>   
> -	if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
> -		goto out;
> +		if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
> +			goto out;
> +	}
> +
> +	ih_regs = &ih->ih_regs;
>   
>   	/* Double check that the overflow wasn't already cleared. */
>   	wptr = RREG32_NO_KIQ(ih_regs->ih_rb_wptr);
> @@ -440,15 +434,11 @@ static int vega10_ih_self_irq(struct amdgpu_device *adev,
>   			      struct amdgpu_irq_src *source,
>   			      struct amdgpu_iv_entry *entry)
>   {
> -	uint32_t wptr = cpu_to_le32(entry->src_data[0]);
> -
>   	switch (entry->ring_id) {
>   	case 1:
> -		*adev->irq.ih1.wptr_cpu = wptr;
>   		schedule_work(&adev->irq.ih1_work);
>   		break;
>   	case 2:
> -		*adev->irq.ih2.wptr_cpu = wptr;
>   		schedule_work(&adev->irq.ih2_work);
>   		break;
>   	default: break;
> diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> index 42032ca380cc..60d1bd51781e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> @@ -220,10 +220,8 @@ static int vega20_ih_enable_ring(struct amdgpu_device *adev,
>   	tmp = vega20_ih_rb_cntl(ih, tmp);
>   	if (ih == &adev->irq.ih)
>   		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RPTR_REARM, !!adev->irq.msi_enabled);
> -	if (ih == &adev->irq.ih1) {
> -		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_ENABLE, 0);
> +	if (ih == &adev->irq.ih1)
>   		tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RB_FULL_DRAIN_ENABLE, 1);
> -	}
>   	if (amdgpu_sriov_vf(adev)) {
>   		if (psp_reg_program(&adev->psp, ih_regs->psp_reg_id, tmp)) {
>   			dev_err(adev->dev, "PSP program IH_RB_CNTL failed!\n");
> @@ -297,7 +295,6 @@ static int vega20_ih_irq_init(struct amdgpu_device *adev)
>   	u32 ih_chicken;
>   	int ret;
>   	int i;
> -	u32 tmp;
>   
>   	/* disable irqs */
>   	ret = vega20_ih_toggle_interrupts(adev, false);
> @@ -326,15 +323,6 @@ static int vega20_ih_irq_init(struct amdgpu_device *adev)
>   		}
>   	}
>   
> -	tmp = RREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL);
> -	tmp = REG_SET_FIELD(tmp, IH_STORM_CLIENT_LIST_CNTL,
> -			    CLIENT18_IS_STORM_CLIENT, 1);
> -	WREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL, tmp);
> -
> -	tmp = RREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL);
> -	tmp = REG_SET_FIELD(tmp, IH_INT_FLOOD_CNTL, FLOOD_CNTL_ENABLE, 1);
> -	WREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL, tmp);
> -
>   	pci_set_master(adev->pdev);
>   
>   	/* enable interrupts */
> @@ -379,11 +367,17 @@ static u32 vega20_ih_get_wptr(struct amdgpu_device *adev,
>   	u32 wptr, tmp;
>   	struct amdgpu_ih_regs *ih_regs;
>   
> -	wptr = le32_to_cpu(*ih->wptr_cpu);
> -	ih_regs = &ih->ih_regs;
> +	if (ih == &adev->irq.ih) {
> +		/* Only ring0 supports writeback. On other rings fall back
> +		 * to register-based code with overflow checking below.
> +		 */
> +		wptr = le32_to_cpu(*ih->wptr_cpu);
>   
> -	if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
> -		goto out;
> +		if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
> +			goto out;
> +	}
> +
> +	ih_regs = &ih->ih_regs;
>   
>   	/* Double check that the overflow wasn't already cleared. */
>   	wptr = RREG32_NO_KIQ(ih_regs->ih_rb_wptr);
> @@ -473,15 +467,11 @@ static int vega20_ih_self_irq(struct amdgpu_device *adev,
>   			      struct amdgpu_irq_src *source,
>   			      struct amdgpu_iv_entry *entry)
>   {
> -	uint32_t wptr = cpu_to_le32(entry->src_data[0]);
> -
>   	switch (entry->ring_id) {
>   	case 1:
> -		*adev->irq.ih1.wptr_cpu = wptr;
>   		schedule_work(&adev->irq.ih1_work);
>   		break;
>   	case 2:
> -		*adev->irq.ih2.wptr_cpu = wptr;
>   		schedule_work(&adev->irq.ih2_work);
>   		break;
>   	default: break;

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 29/35] drm/amdgpu: svm bo enable_signal call condition
  2021-01-07 10:56   ` Christian König
@ 2021-01-07 16:16     ` Felix Kuehling
  2021-01-07 16:28       ` Christian König
  0 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07 16:16 UTC (permalink / raw)
  To: christian.koenig, amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang


Am 2021-01-07 um 5:56 a.m. schrieb Christian König:

> Am 07.01.21 um 04:01 schrieb Felix Kuehling:
>> From: Alex Sierra <alex.sierra@amd.com>
>>
>> [why]
>> To support svm bo eviction mechanism.
>>
>> [how]
>> If the BO crated has AMDGPU_AMDKFD_CREATE_SVM_BO flag set,
>> enable_signal callback will be called inside amdgpu_evict_flags.
>> This also causes gutting of the BO by removing all placements,
>> so that TTM won't actually do an eviction. Instead it will discard
>> the memory held by the BO. This is needed for HMM migration to user
>> mode system memory pages.
>
> I don't think that this will work. What exactly are you doing here?
We discussed this a while ago when we talked about pipelined gutting.
And you actually helped us out with a fix for that
(https://patchwork.freedesktop.org/patch/379039/).

SVM BOs are BOs in VRAM containing data for HMM ranges. When such a BO
is evicted by TTM, we do an HMM migration of the data to system memory
(triggered by kgd2kfd_schedule_evict_and_restore_process in patch 30).
That means we don't need TTM to copy the BO contents to GTT any more.
Instead we want to use pipelined gutting to allow the VRAM to be freed
once the fence signals that the HMM migration is done (the
dma_fence_signal call near the end of svm_range_evict_svm_bo_worker in
patch 28).

Regards,
  Felix


>
> As Daniel pointed out HMM and dma_fences are fundamentally incompatible.
>
> Christian.
>
>>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++++++++++++++
>>   1 file changed, 14 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index f423f42cb9b5..62d4da95d22d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -107,6 +107,20 @@ static void amdgpu_evict_flags(struct
>> ttm_buffer_object *bo,
>>       }
>>         abo = ttm_to_amdgpu_bo(bo);
>> +    if (abo->flags & AMDGPU_AMDKFD_CREATE_SVM_BO) {
>> +        struct dma_fence *fence;
>> +        struct dma_resv *resv = &bo->base._resv;
>> +
>> +        rcu_read_lock();
>> +        fence = rcu_dereference(resv->fence_excl);
>> +        if (fence && !fence->ops->signaled)
>> +            dma_fence_enable_sw_signaling(fence);
>> +
>> +        placement->num_placement = 0;
>> +        placement->num_busy_placement = 0;
>> +        rcu_read_unlock();
>> +        return;
>> +    }
>>       switch (bo->mem.mem_type) {
>>       case AMDGPU_PL_GDS:
>>       case AMDGPU_PL_GWS:
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-07  9:23 ` [PATCH 00/35] Add HMM-based SVM memory manager to KFD Daniel Vetter
@ 2021-01-07 16:25   ` Felix Kuehling
  2021-01-08 14:40     ` Daniel Vetter
  0 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07 16:25 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: alex.sierra, philip.yang, dri-devel, amd-gfx

Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
>> This is the first version of our HMM based shared virtual memory manager
>> for KFD. There are still a number of known issues that we're working through
>> (see below). This will likely lead to some pretty significant changes in
>> MMU notifier handling and locking on the migration code paths. So don't
>> get hung up on those details yet.
>>
>> But I think this is a good time to start getting feedback. We're pretty
>> confident about the ioctl API, which is both simple and extensible for the
>> future. (see patches 4,16) The user mode side of the API can be found here:
>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
>>
>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
>> and some retry IRQ handling changes (32).
>>
>>
>> Known issues:
>> * won't work with IOMMU enabled, we need to dma_map all pages properly
>> * still working on some race conditions and random bugs
>> * performance is not great yet
> Still catching up, but I think there's another one for your list:
>
>  * hmm gpu context preempt vs page fault handling. I've had a short
>    discussion about this one with Christian before the holidays, and also
>    some private chats with Jerome. It's nasty since no easy fix, much less
>    a good idea what's the best approach here.

Do you have a pointer to that discussion or any more details?

Thanks,
  Felix


>
> I'll try to look at this more in-depth when I'm catching up on mails.
> -Daniel
>
>> Alex Sierra (12):
>>   drm/amdgpu: replace per_device_list by array
>>   drm/amdkfd: helper to convert gpu id and idx
>>   drm/amdkfd: add xnack enabled flag to kfd_process
>>   drm/amdkfd: add ioctl to configure and query xnack retries
>>   drm/amdkfd: invalidate tables on page retry fault
>>   drm/amdkfd: page table restore through svm API
>>   drm/amdkfd: SVM API call to restore page tables
>>   drm/amdkfd: add svm_bo reference for eviction fence
>>   drm/amdgpu: add param bit flag to create SVM BOs
>>   drm/amdkfd: add svm_bo eviction mechanism support
>>   drm/amdgpu: svm bo enable_signal call condition
>>   drm/amdgpu: add svm_bo eviction to enable_signal cb
>>
>> Philip Yang (23):
>>   drm/amdkfd: select kernel DEVICE_PRIVATE option
>>   drm/amdkfd: add svm ioctl API
>>   drm/amdkfd: Add SVM API support capability bits
>>   drm/amdkfd: register svm range
>>   drm/amdkfd: add svm ioctl GET_ATTR op
>>   drm/amdgpu: add common HMM get pages function
>>   drm/amdkfd: validate svm range system memory
>>   drm/amdkfd: register overlap system memory range
>>   drm/amdkfd: deregister svm range
>>   drm/amdgpu: export vm update mapping interface
>>   drm/amdkfd: map svm range to GPUs
>>   drm/amdkfd: svm range eviction and restore
>>   drm/amdkfd: register HMM device private zone
>>   drm/amdkfd: validate vram svm range from TTM
>>   drm/amdkfd: support xgmi same hive mapping
>>   drm/amdkfd: copy memory through gart table
>>   drm/amdkfd: HMM migrate ram to vram
>>   drm/amdkfd: HMM migrate vram to ram
>>   drm/amdgpu: reserve fence slot to update page table
>>   drm/amdgpu: enable retry fault wptr overflow
>>   drm/amdkfd: refine migration policy with xnack on
>>   drm/amdkfd: add svm range validate timestamp
>>   drm/amdkfd: multiple gpu migrate vram to vram
>>
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
>>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
>>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
>>  drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
>>  drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
>>  .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
>>  include/uapi/linux/kfd_ioctl.h                |  169 +-
>>  26 files changed, 4296 insertions(+), 291 deletions(-)
>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>
>> -- 
>> 2.29.2
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 29/35] drm/amdgpu: svm bo enable_signal call condition
  2021-01-07 16:16     ` Felix Kuehling
@ 2021-01-07 16:28       ` Christian König
  2021-01-07 16:53         ` Felix Kuehling
  0 siblings, 1 reply; 84+ messages in thread
From: Christian König @ 2021-01-07 16:28 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

Am 07.01.21 um 17:16 schrieb Felix Kuehling:
> Am 2021-01-07 um 5:56 a.m. schrieb Christian König:
>
>> Am 07.01.21 um 04:01 schrieb Felix Kuehling:
>>> From: Alex Sierra <alex.sierra@amd.com>
>>>
>>> [why]
>>> To support svm bo eviction mechanism.
>>>
>>> [how]
>>> If the BO crated has AMDGPU_AMDKFD_CREATE_SVM_BO flag set,
>>> enable_signal callback will be called inside amdgpu_evict_flags.
>>> This also causes gutting of the BO by removing all placements,
>>> so that TTM won't actually do an eviction. Instead it will discard
>>> the memory held by the BO. This is needed for HMM migration to user
>>> mode system memory pages.
>> I don't think that this will work. What exactly are you doing here?
> We discussed this a while ago when we talked about pipelined gutting.
> And you actually helped us out with a fix for that
> (https://patchwork.freedesktop.org/patch/379039/).

That's not what I meant. The pipelined gutting is ok, but why the 
enable_signaling()?

Christian.

>
> SVM BOs are BOs in VRAM containing data for HMM ranges. When such a BO
> is evicted by TTM, we do an HMM migration of the data to system memory
> (triggered by kgd2kfd_schedule_evict_and_restore_process in patch 30).
> That means we don't need TTM to copy the BO contents to GTT any more.
> Instead we want to use pipelined gutting to allow the VRAM to be freed
> once the fence signals that the HMM migration is done (the
> dma_fence_signal call near the end of svm_range_evict_svm_bo_worker in
> patch 28).
>
> Regards,
>    Felix
>
>
>> As Daniel pointed out HMM and dma_fences are fundamentally incompatible.
>>
>> Christian.
>>
>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++++++++++++++
>>>    1 file changed, 14 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> index f423f42cb9b5..62d4da95d22d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> @@ -107,6 +107,20 @@ static void amdgpu_evict_flags(struct
>>> ttm_buffer_object *bo,
>>>        }
>>>          abo = ttm_to_amdgpu_bo(bo);
>>> +    if (abo->flags & AMDGPU_AMDKFD_CREATE_SVM_BO) {
>>> +        struct dma_fence *fence;
>>> +        struct dma_resv *resv = &bo->base._resv;
>>> +
>>> +        rcu_read_lock();
>>> +        fence = rcu_dereference(resv->fence_excl);
>>> +        if (fence && !fence->ops->signaled)
>>> +            dma_fence_enable_sw_signaling(fence);
>>> +
>>> +        placement->num_placement = 0;
>>> +        placement->num_busy_placement = 0;
>>> +        rcu_read_unlock();
>>> +        return;
>>> +    }
>>>        switch (bo->mem.mem_type) {
>>>        case AMDGPU_PL_GDS:
>>>        case AMDGPU_PL_GWS:

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 29/35] drm/amdgpu: svm bo enable_signal call condition
  2021-01-07 16:28       ` Christian König
@ 2021-01-07 16:53         ` Felix Kuehling
  0 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-07 16:53 UTC (permalink / raw)
  To: Christian König, amd-gfx, dri-devel; +Cc: alex.sierra, philip.yang

Am 2021-01-07 um 11:28 a.m. schrieb Christian König:
> Am 07.01.21 um 17:16 schrieb Felix Kuehling:
>> Am 2021-01-07 um 5:56 a.m. schrieb Christian König:
>>
>>> Am 07.01.21 um 04:01 schrieb Felix Kuehling:
>>>> From: Alex Sierra <alex.sierra@amd.com>
>>>>
>>>> [why]
>>>> To support svm bo eviction mechanism.
>>>>
>>>> [how]
>>>> If the BO crated has AMDGPU_AMDKFD_CREATE_SVM_BO flag set,
>>>> enable_signal callback will be called inside amdgpu_evict_flags.
>>>> This also causes gutting of the BO by removing all placements,
>>>> so that TTM won't actually do an eviction. Instead it will discard
>>>> the memory held by the BO. This is needed for HMM migration to user
>>>> mode system memory pages.
>>> I don't think that this will work. What exactly are you doing here?
>> We discussed this a while ago when we talked about pipelined gutting.
>> And you actually helped us out with a fix for that
>> (https://patchwork.freedesktop.org/patch/379039/).
>
> That's not what I meant. The pipelined gutting is ok, but why the
> enable_signaling()?

That's what triggers our eviction fence callback
amdkfd_fence_enable_signaling that schedules the worker doing the
eviction. Without pipelined gutting we'd be getting that callback from
the GPU scheduler when it tries to execute the job that does the
migration. With pipelined gutting we have to call this somewhere ourselves.

I guess we could schedule the eviction worker directly without going
through the fence callback. I think we did it this way because it's more
similar to our KFD BO eviction handling where the worker gets scheduled
by the fence callback.

Regards,
  Felix


>
> Christian.
>
>>
>> SVM BOs are BOs in VRAM containing data for HMM ranges. When such a BO
>> is evicted by TTM, we do an HMM migration of the data to system memory
>> (triggered by kgd2kfd_schedule_evict_and_restore_process in patch 30).
>> That means we don't need TTM to copy the BO contents to GTT any more.
>> Instead we want to use pipelined gutting to allow the VRAM to be freed
>> once the fence signals that the HMM migration is done (the
>> dma_fence_signal call near the end of svm_range_evict_svm_bo_worker in
>> patch 28).
>>
>> Regards,
>>    Felix
>>
>>
>>> As Daniel pointed out HMM and dma_fences are fundamentally
>>> incompatible.
>>>
>>> Christian.
>>>
>>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>>> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++++++++++++++
>>>>    1 file changed, 14 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> index f423f42cb9b5..62d4da95d22d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> @@ -107,6 +107,20 @@ static void amdgpu_evict_flags(struct
>>>> ttm_buffer_object *bo,
>>>>        }
>>>>          abo = ttm_to_amdgpu_bo(bo);
>>>> +    if (abo->flags & AMDGPU_AMDKFD_CREATE_SVM_BO) {
>>>> +        struct dma_fence *fence;
>>>> +        struct dma_resv *resv = &bo->base._resv;
>>>> +
>>>> +        rcu_read_lock();
>>>> +        fence = rcu_dereference(resv->fence_excl);
>>>> +        if (fence && !fence->ops->signaled)
>>>> +            dma_fence_enable_sw_signaling(fence);
>>>> +
>>>> +        placement->num_placement = 0;
>>>> +        placement->num_busy_placement = 0;
>>>> +        rcu_read_unlock();
>>>> +        return;
>>>> +    }
>>>>        switch (bo->mem.mem_type) {
>>>>        case AMDGPU_PL_GDS:
>>>>        case AMDGPU_PL_GWS:
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-07 16:25   ` Felix Kuehling
@ 2021-01-08 14:40     ` Daniel Vetter
  2021-01-08 14:45       ` Christian König
                         ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Daniel Vetter @ 2021-01-08 14:40 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: alex.sierra, philip.yang, dri-devel, amd-gfx

On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> >> This is the first version of our HMM based shared virtual memory manager
> >> for KFD. There are still a number of known issues that we're working through
> >> (see below). This will likely lead to some pretty significant changes in
> >> MMU notifier handling and locking on the migration code paths. So don't
> >> get hung up on those details yet.
> >>
> >> But I think this is a good time to start getting feedback. We're pretty
> >> confident about the ioctl API, which is both simple and extensible for the
> >> future. (see patches 4,16) The user mode side of the API can be found here:
> >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> >>
> >> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> >> and some retry IRQ handling changes (32).
> >>
> >>
> >> Known issues:
> >> * won't work with IOMMU enabled, we need to dma_map all pages properly
> >> * still working on some race conditions and random bugs
> >> * performance is not great yet
> > Still catching up, but I think there's another one for your list:
> >
> >  * hmm gpu context preempt vs page fault handling. I've had a short
> >    discussion about this one with Christian before the holidays, and also
> >    some private chats with Jerome. It's nasty since no easy fix, much less
> >    a good idea what's the best approach here.
> 
> Do you have a pointer to that discussion or any more details?

Essentially if you're handling an hmm page fault from the gpu, you can
deadlock by calling dma_fence_wait on a (chain of, possibly) other command
submissions or compute contexts with dma_fence_wait. Which deadlocks if
you can't preempt while you have that page fault pending. Two solutions:

- your hw can (at least for compute ctx) preempt even when a page fault is
  pending

- lots of screaming in trying to come up with an alternate solution. They
  all suck.

Note that the dma_fence_wait is hard requirement, because we need that for
mmu notifiers and shrinkers, disallowing that would disable dynamic memory
management. Which is the current "ttm is self-limited to 50% of system
memory" limitation Christian is trying to lift. So that's really not
a restriction we can lift, at least not in upstream where we need to also
support old style hardware which doesn't have page fault support and
really has no other option to handle memory management than
dma_fence_wait.

Thread was here:

https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/

There's a few ways to resolve this (without having preempt-capable
hardware), but they're all supremely nasty.
-Daniel

> 
> Thanks,
>   Felix
> 
> 
> >
> > I'll try to look at this more in-depth when I'm catching up on mails.
> > -Daniel
> >
> >> Alex Sierra (12):
> >>   drm/amdgpu: replace per_device_list by array
> >>   drm/amdkfd: helper to convert gpu id and idx
> >>   drm/amdkfd: add xnack enabled flag to kfd_process
> >>   drm/amdkfd: add ioctl to configure and query xnack retries
> >>   drm/amdkfd: invalidate tables on page retry fault
> >>   drm/amdkfd: page table restore through svm API
> >>   drm/amdkfd: SVM API call to restore page tables
> >>   drm/amdkfd: add svm_bo reference for eviction fence
> >>   drm/amdgpu: add param bit flag to create SVM BOs
> >>   drm/amdkfd: add svm_bo eviction mechanism support
> >>   drm/amdgpu: svm bo enable_signal call condition
> >>   drm/amdgpu: add svm_bo eviction to enable_signal cb
> >>
> >> Philip Yang (23):
> >>   drm/amdkfd: select kernel DEVICE_PRIVATE option
> >>   drm/amdkfd: add svm ioctl API
> >>   drm/amdkfd: Add SVM API support capability bits
> >>   drm/amdkfd: register svm range
> >>   drm/amdkfd: add svm ioctl GET_ATTR op
> >>   drm/amdgpu: add common HMM get pages function
> >>   drm/amdkfd: validate svm range system memory
> >>   drm/amdkfd: register overlap system memory range
> >>   drm/amdkfd: deregister svm range
> >>   drm/amdgpu: export vm update mapping interface
> >>   drm/amdkfd: map svm range to GPUs
> >>   drm/amdkfd: svm range eviction and restore
> >>   drm/amdkfd: register HMM device private zone
> >>   drm/amdkfd: validate vram svm range from TTM
> >>   drm/amdkfd: support xgmi same hive mapping
> >>   drm/amdkfd: copy memory through gart table
> >>   drm/amdkfd: HMM migrate ram to vram
> >>   drm/amdkfd: HMM migrate vram to ram
> >>   drm/amdgpu: reserve fence slot to update page table
> >>   drm/amdgpu: enable retry fault wptr overflow
> >>   drm/amdkfd: refine migration policy with xnack on
> >>   drm/amdkfd: add svm range validate timestamp
> >>   drm/amdkfd: multiple gpu migrate vram to vram
> >>
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
> >>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
> >>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
> >>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
> >>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
> >>  drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
> >>  drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
> >>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
> >>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
> >>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
> >>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
> >>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
> >>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
> >>  .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
> >>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
> >>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
> >>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
> >>  drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
> >>  include/uapi/linux/kfd_ioctl.h                |  169 +-
> >>  26 files changed, 4296 insertions(+), 291 deletions(-)
> >>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> >>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
> >>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> >>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> >>
> >> -- 
> >> 2.29.2
> >>
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-08 14:40     ` Daniel Vetter
@ 2021-01-08 14:45       ` Christian König
  2021-01-08 15:58       ` Felix Kuehling
  2021-01-13 16:56       ` Jerome Glisse
  2 siblings, 0 replies; 84+ messages in thread
From: Christian König @ 2021-01-08 14:45 UTC (permalink / raw)
  To: Daniel Vetter, Felix Kuehling
  Cc: alex.sierra, philip.yang, amd-gfx, dri-devel

Am 08.01.21 um 15:40 schrieb Daniel Vetter:
> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
>> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
>>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
>>>> This is the first version of our HMM based shared virtual memory manager
>>>> for KFD. There are still a number of known issues that we're working through
>>>> (see below). This will likely lead to some pretty significant changes in
>>>> MMU notifier handling and locking on the migration code paths. So don't
>>>> get hung up on those details yet.
>>>>
>>>> But I think this is a good time to start getting feedback. We're pretty
>>>> confident about the ioctl API, which is both simple and extensible for the
>>>> future. (see patches 4,16) The user mode side of the API can be found here:
>>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
>>>>
>>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
>>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
>>>> and some retry IRQ handling changes (32).
>>>>
>>>>
>>>> Known issues:
>>>> * won't work with IOMMU enabled, we need to dma_map all pages properly
>>>> * still working on some race conditions and random bugs
>>>> * performance is not great yet
>>> Still catching up, but I think there's another one for your list:
>>>
>>>   * hmm gpu context preempt vs page fault handling. I've had a short
>>>     discussion about this one with Christian before the holidays, and also
>>>     some private chats with Jerome. It's nasty since no easy fix, much less
>>>     a good idea what's the best approach here.
>> Do you have a pointer to that discussion or any more details?
> Essentially if you're handling an hmm page fault from the gpu, you can
> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> submissions or compute contexts with dma_fence_wait. Which deadlocks if
> you can't preempt while you have that page fault pending. Two solutions:
>
> - your hw can (at least for compute ctx) preempt even when a page fault is
>    pending
>
> - lots of screaming in trying to come up with an alternate solution. They
>    all suck.
>
> Note that the dma_fence_wait is hard requirement, because we need that for
> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> management. Which is the current "ttm is self-limited to 50% of system
> memory" limitation Christian is trying to lift. So that's really not
> a restriction we can lift, at least not in upstream where we need to also
> support old style hardware which doesn't have page fault support and
> really has no other option to handle memory management than
> dma_fence_wait.
>
> Thread was here:
>
> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
>
> There's a few ways to resolve this (without having preempt-capable
> hardware), but they're all supremely nasty.
> -Daniel
>
>> Thanks,
>>    Felix
>>
>>
>>> I'll try to look at this more in-depth when I'm catching up on mails.
>>> -Daniel
>>>
>>>> Alex Sierra (12):
>>>>    drm/amdgpu: replace per_device_list by array
>>>>    drm/amdkfd: helper to convert gpu id and idx
>>>>    drm/amdkfd: add xnack enabled flag to kfd_process
>>>>    drm/amdkfd: add ioctl to configure and query xnack retries
>>>>    drm/amdkfd: invalidate tables on page retry fault
>>>>    drm/amdkfd: page table restore through svm API
>>>>    drm/amdkfd: SVM API call to restore page tables
>>>>    drm/amdkfd: add svm_bo reference for eviction fence
>>>>    drm/amdgpu: add param bit flag to create SVM BOs
>>>>    drm/amdkfd: add svm_bo eviction mechanism support
>>>>    drm/amdgpu: svm bo enable_signal call condition
>>>>    drm/amdgpu: add svm_bo eviction to enable_signal cb
>>>>
>>>> Philip Yang (23):
>>>>    drm/amdkfd: select kernel DEVICE_PRIVATE option
>>>>    drm/amdkfd: add svm ioctl API
>>>>    drm/amdkfd: Add SVM API support capability bits
>>>>    drm/amdkfd: register svm range
>>>>    drm/amdkfd: add svm ioctl GET_ATTR op
>>>>    drm/amdgpu: add common HMM get pages function
>>>>    drm/amdkfd: validate svm range system memory
>>>>    drm/amdkfd: register overlap system memory range
>>>>    drm/amdkfd: deregister svm range
>>>>    drm/amdgpu: export vm update mapping interface
>>>>    drm/amdkfd: map svm range to GPUs
>>>>    drm/amdkfd: svm range eviction and restore
>>>>    drm/amdkfd: register HMM device private zone
>>>>    drm/amdkfd: validate vram svm range from TTM
>>>>    drm/amdkfd: support xgmi same hive mapping
>>>>    drm/amdkfd: copy memory through gart table
>>>>    drm/amdkfd: HMM migrate ram to vram
>>>>    drm/amdkfd: HMM migrate vram to ram
>>>>    drm/amdgpu: reserve fence slot to update page table
>>>>    drm/amdgpu: enable retry fault wptr overflow
>>>>    drm/amdkfd: refine migration policy with xnack on
>>>>    drm/amdkfd: add svm range validate timestamp
>>>>    drm/amdkfd: multiple gpu migrate vram to vram
>>>>
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
>>>>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
>>>>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
>>>>   drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
>>>>   drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
>>>>   drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
>>>>   drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
>>>>   .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
>>>>   include/uapi/linux/kfd_ioctl.h                |  169 +-
>>>>   26 files changed, 4296 insertions(+), 291 deletions(-)
>>>>   create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>   create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>>>>   create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>>   create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>>>
>>>> -- 
>>>> 2.29.2
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-08 14:40     ` Daniel Vetter
  2021-01-08 14:45       ` Christian König
@ 2021-01-08 15:58       ` Felix Kuehling
  2021-01-08 16:06         ` Daniel Vetter
  2021-01-13 16:56       ` Jerome Glisse
  2 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-08 15:58 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: alex.sierra, philip.yang, dri-devel, amd-gfx

Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
>> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
>>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
>>>> This is the first version of our HMM based shared virtual memory manager
>>>> for KFD. There are still a number of known issues that we're working through
>>>> (see below). This will likely lead to some pretty significant changes in
>>>> MMU notifier handling and locking on the migration code paths. So don't
>>>> get hung up on those details yet.
>>>>
>>>> But I think this is a good time to start getting feedback. We're pretty
>>>> confident about the ioctl API, which is both simple and extensible for the
>>>> future. (see patches 4,16) The user mode side of the API can be found here:
>>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
>>>>
>>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
>>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
>>>> and some retry IRQ handling changes (32).
>>>>
>>>>
>>>> Known issues:
>>>> * won't work with IOMMU enabled, we need to dma_map all pages properly
>>>> * still working on some race conditions and random bugs
>>>> * performance is not great yet
>>> Still catching up, but I think there's another one for your list:
>>>
>>>  * hmm gpu context preempt vs page fault handling. I've had a short
>>>    discussion about this one with Christian before the holidays, and also
>>>    some private chats with Jerome. It's nasty since no easy fix, much less
>>>    a good idea what's the best approach here.
>> Do you have a pointer to that discussion or any more details?
> Essentially if you're handling an hmm page fault from the gpu, you can
> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> submissions or compute contexts with dma_fence_wait. Which deadlocks if
> you can't preempt while you have that page fault pending. Two solutions:
>
> - your hw can (at least for compute ctx) preempt even when a page fault is
>   pending

Our GFXv9 GPUs can do this. GFXv10 cannot.


>
> - lots of screaming in trying to come up with an alternate solution. They
>   all suck.

My idea for GFXv10 is to avoid preemption for memory management purposes
and rely 100% on page faults instead. That is, if the memory manager
needs to prevent GPU access to certain memory, just invalidate the GPU
page table entries pointing to that memory. No waiting for fences is
necessary, except for the SDMA job that invalidates the PTEs, which runs
on a special high-priority queue that should never deadlock. That should
prevent the CPU getting involved in deadlocks in kernel mode. But you
can still deadlock the GPU in user mode if all compute units get stuck
in page faults and can't switch to any useful work any more. So it's
possible that we won't be able to use GPU page faults on our GFXv10 GPUs.

Regards,
  Felix

>
> Note that the dma_fence_wait is hard requirement, because we need that for
> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> management. Which is the current "ttm is self-limited to 50% of system
> memory" limitation Christian is trying to lift. So that's really not
> a restriction we can lift, at least not in upstream where we need to also
> support old style hardware which doesn't have page fault support and
> really has no other option to handle memory management than
> dma_fence_wait.
>
> Thread was here:
>
> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
>
> There's a few ways to resolve this (without having preempt-capable
> hardware), but they're all supremely nasty.
> -Daniel
>
>> Thanks,
>>   Felix
>>
>>
>>> I'll try to look at this more in-depth when I'm catching up on mails.
>>> -Daniel
>>>
>>>> Alex Sierra (12):
>>>>   drm/amdgpu: replace per_device_list by array
>>>>   drm/amdkfd: helper to convert gpu id and idx
>>>>   drm/amdkfd: add xnack enabled flag to kfd_process
>>>>   drm/amdkfd: add ioctl to configure and query xnack retries
>>>>   drm/amdkfd: invalidate tables on page retry fault
>>>>   drm/amdkfd: page table restore through svm API
>>>>   drm/amdkfd: SVM API call to restore page tables
>>>>   drm/amdkfd: add svm_bo reference for eviction fence
>>>>   drm/amdgpu: add param bit flag to create SVM BOs
>>>>   drm/amdkfd: add svm_bo eviction mechanism support
>>>>   drm/amdgpu: svm bo enable_signal call condition
>>>>   drm/amdgpu: add svm_bo eviction to enable_signal cb
>>>>
>>>> Philip Yang (23):
>>>>   drm/amdkfd: select kernel DEVICE_PRIVATE option
>>>>   drm/amdkfd: add svm ioctl API
>>>>   drm/amdkfd: Add SVM API support capability bits
>>>>   drm/amdkfd: register svm range
>>>>   drm/amdkfd: add svm ioctl GET_ATTR op
>>>>   drm/amdgpu: add common HMM get pages function
>>>>   drm/amdkfd: validate svm range system memory
>>>>   drm/amdkfd: register overlap system memory range
>>>>   drm/amdkfd: deregister svm range
>>>>   drm/amdgpu: export vm update mapping interface
>>>>   drm/amdkfd: map svm range to GPUs
>>>>   drm/amdkfd: svm range eviction and restore
>>>>   drm/amdkfd: register HMM device private zone
>>>>   drm/amdkfd: validate vram svm range from TTM
>>>>   drm/amdkfd: support xgmi same hive mapping
>>>>   drm/amdkfd: copy memory through gart table
>>>>   drm/amdkfd: HMM migrate ram to vram
>>>>   drm/amdkfd: HMM migrate vram to ram
>>>>   drm/amdgpu: reserve fence slot to update page table
>>>>   drm/amdgpu: enable retry fault wptr overflow
>>>>   drm/amdkfd: refine migration policy with xnack on
>>>>   drm/amdkfd: add svm range validate timestamp
>>>>   drm/amdkfd: multiple gpu migrate vram to vram
>>>>
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
>>>>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
>>>>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
>>>>  drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
>>>>  drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
>>>>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
>>>>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
>>>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
>>>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
>>>>  .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
>>>>  include/uapi/linux/kfd_ioctl.h                |  169 +-
>>>>  26 files changed, 4296 insertions(+), 291 deletions(-)
>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>>>
>>>> -- 
>>>> 2.29.2
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-08 15:58       ` Felix Kuehling
@ 2021-01-08 16:06         ` Daniel Vetter
  2021-01-08 16:36           ` Felix Kuehling
  0 siblings, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-01-08 16:06 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: Alex Sierra, Yang, Philip, dri-devel, amd-gfx list

On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>
> Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
> > On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> >> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> >>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> >>>> This is the first version of our HMM based shared virtual memory manager
> >>>> for KFD. There are still a number of known issues that we're working through
> >>>> (see below). This will likely lead to some pretty significant changes in
> >>>> MMU notifier handling and locking on the migration code paths. So don't
> >>>> get hung up on those details yet.
> >>>>
> >>>> But I think this is a good time to start getting feedback. We're pretty
> >>>> confident about the ioctl API, which is both simple and extensible for the
> >>>> future. (see patches 4,16) The user mode side of the API can be found here:
> >>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> >>>>
> >>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> >>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> >>>> and some retry IRQ handling changes (32).
> >>>>
> >>>>
> >>>> Known issues:
> >>>> * won't work with IOMMU enabled, we need to dma_map all pages properly
> >>>> * still working on some race conditions and random bugs
> >>>> * performance is not great yet
> >>> Still catching up, but I think there's another one for your list:
> >>>
> >>>  * hmm gpu context preempt vs page fault handling. I've had a short
> >>>    discussion about this one with Christian before the holidays, and also
> >>>    some private chats with Jerome. It's nasty since no easy fix, much less
> >>>    a good idea what's the best approach here.
> >> Do you have a pointer to that discussion or any more details?
> > Essentially if you're handling an hmm page fault from the gpu, you can
> > deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> > submissions or compute contexts with dma_fence_wait. Which deadlocks if
> > you can't preempt while you have that page fault pending. Two solutions:
> >
> > - your hw can (at least for compute ctx) preempt even when a page fault is
> >   pending
>
> Our GFXv9 GPUs can do this. GFXv10 cannot.

Uh, why did your hw guys drop this :-/

> > - lots of screaming in trying to come up with an alternate solution. They
> >   all suck.
>
> My idea for GFXv10 is to avoid preemption for memory management purposes
> and rely 100% on page faults instead. That is, if the memory manager
> needs to prevent GPU access to certain memory, just invalidate the GPU
> page table entries pointing to that memory. No waiting for fences is
> necessary, except for the SDMA job that invalidates the PTEs, which runs
> on a special high-priority queue that should never deadlock. That should
> prevent the CPU getting involved in deadlocks in kernel mode. But you
> can still deadlock the GPU in user mode if all compute units get stuck
> in page faults and can't switch to any useful work any more. So it's
> possible that we won't be able to use GPU page faults on our GFXv10 GPUs.

This only works if _everything_ in the system works like this, since
you're defacto breaking the cross-driver contract. As soon as there's
some legacy gl workload (userptr) or another driver involved, this
approach falls apart.

I do think it can be rescued with what I call gang scheduling of
engines: I.e. when a given engine is running a context (or a group of
engines, depending how your hw works) that can cause a page fault, you
must flush out all workloads running on the same engine which could
block a dma_fence (preempt them, or for non-compute stuff, force their
completion). And the other way round, i.e. before you can run a legacy
gl workload with a dma_fence on these engines you need to preempt all
ctxs that could cause page faults and take them at least out of the hw
scheduler queue.

Just reserving an sdma engine for copy jobs and ptes updates and that
stuff is necessary, but not sufficient.

Another approach that Jerome suggested is to track the reverse
dependency graph of all dma_fence somehow and make sure that direct
reclaim never recurses on an engine you're serving a pagefault for.
Possible in theory, but in practice I think not feasible to implement
because way too much work to implement.

Either way it's imo really nasty to come up with a scheme here that
doesn't fail in some corner, or becomes really nasty with inconsistent
rules across different drivers and hw :-(

Cheers, Daniel

>
> Regards,
>   Felix
>
> >
> > Note that the dma_fence_wait is hard requirement, because we need that for
> > mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> > management. Which is the current "ttm is self-limited to 50% of system
> > memory" limitation Christian is trying to lift. So that's really not
> > a restriction we can lift, at least not in upstream where we need to also
> > support old style hardware which doesn't have page fault support and
> > really has no other option to handle memory management than
> > dma_fence_wait.
> >
> > Thread was here:
> >
> > https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
> >
> > There's a few ways to resolve this (without having preempt-capable
> > hardware), but they're all supremely nasty.
> > -Daniel
> >
> >> Thanks,
> >>   Felix
> >>
> >>
> >>> I'll try to look at this more in-depth when I'm catching up on mails.
> >>> -Daniel
> >>>
> >>>> Alex Sierra (12):
> >>>>   drm/amdgpu: replace per_device_list by array
> >>>>   drm/amdkfd: helper to convert gpu id and idx
> >>>>   drm/amdkfd: add xnack enabled flag to kfd_process
> >>>>   drm/amdkfd: add ioctl to configure and query xnack retries
> >>>>   drm/amdkfd: invalidate tables on page retry fault
> >>>>   drm/amdkfd: page table restore through svm API
> >>>>   drm/amdkfd: SVM API call to restore page tables
> >>>>   drm/amdkfd: add svm_bo reference for eviction fence
> >>>>   drm/amdgpu: add param bit flag to create SVM BOs
> >>>>   drm/amdkfd: add svm_bo eviction mechanism support
> >>>>   drm/amdgpu: svm bo enable_signal call condition
> >>>>   drm/amdgpu: add svm_bo eviction to enable_signal cb
> >>>>
> >>>> Philip Yang (23):
> >>>>   drm/amdkfd: select kernel DEVICE_PRIVATE option
> >>>>   drm/amdkfd: add svm ioctl API
> >>>>   drm/amdkfd: Add SVM API support capability bits
> >>>>   drm/amdkfd: register svm range
> >>>>   drm/amdkfd: add svm ioctl GET_ATTR op
> >>>>   drm/amdgpu: add common HMM get pages function
> >>>>   drm/amdkfd: validate svm range system memory
> >>>>   drm/amdkfd: register overlap system memory range
> >>>>   drm/amdkfd: deregister svm range
> >>>>   drm/amdgpu: export vm update mapping interface
> >>>>   drm/amdkfd: map svm range to GPUs
> >>>>   drm/amdkfd: svm range eviction and restore
> >>>>   drm/amdkfd: register HMM device private zone
> >>>>   drm/amdkfd: validate vram svm range from TTM
> >>>>   drm/amdkfd: support xgmi same hive mapping
> >>>>   drm/amdkfd: copy memory through gart table
> >>>>   drm/amdkfd: HMM migrate ram to vram
> >>>>   drm/amdkfd: HMM migrate vram to ram
> >>>>   drm/amdgpu: reserve fence slot to update page table
> >>>>   drm/amdgpu: enable retry fault wptr overflow
> >>>>   drm/amdkfd: refine migration policy with xnack on
> >>>>   drm/amdkfd: add svm range validate timestamp
> >>>>   drm/amdkfd: multiple gpu migrate vram to vram
> >>>>
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
> >>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
> >>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
> >>>>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
> >>>>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
> >>>>  drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
> >>>>  drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
> >>>>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
> >>>>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
> >>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
> >>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
> >>>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
> >>>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
> >>>>  .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
> >>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
> >>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
> >>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
> >>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
> >>>>  include/uapi/linux/kfd_ioctl.h                |  169 +-
> >>>>  26 files changed, 4296 insertions(+), 291 deletions(-)
> >>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> >>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
> >>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> >>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> >>>>
> >>>> --
> >>>> 2.29.2
> >>>>
> >>>> _______________________________________________
> >>>> dri-devel mailing list
> >>>> dri-devel@lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-08 16:06         ` Daniel Vetter
@ 2021-01-08 16:36           ` Felix Kuehling
  2021-01-08 16:53             ` Daniel Vetter
  0 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-08 16:36 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Alex Sierra, Yang, Philip, dri-devel, amd-gfx list


Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
> On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>> Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
>>> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
>>>> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
>>>>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
>>>>>> This is the first version of our HMM based shared virtual memory manager
>>>>>> for KFD. There are still a number of known issues that we're working through
>>>>>> (see below). This will likely lead to some pretty significant changes in
>>>>>> MMU notifier handling and locking on the migration code paths. So don't
>>>>>> get hung up on those details yet.
>>>>>>
>>>>>> But I think this is a good time to start getting feedback. We're pretty
>>>>>> confident about the ioctl API, which is both simple and extensible for the
>>>>>> future. (see patches 4,16) The user mode side of the API can be found here:
>>>>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
>>>>>>
>>>>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
>>>>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
>>>>>> and some retry IRQ handling changes (32).
>>>>>>
>>>>>>
>>>>>> Known issues:
>>>>>> * won't work with IOMMU enabled, we need to dma_map all pages properly
>>>>>> * still working on some race conditions and random bugs
>>>>>> * performance is not great yet
>>>>> Still catching up, but I think there's another one for your list:
>>>>>
>>>>>  * hmm gpu context preempt vs page fault handling. I've had a short
>>>>>    discussion about this one with Christian before the holidays, and also
>>>>>    some private chats with Jerome. It's nasty since no easy fix, much less
>>>>>    a good idea what's the best approach here.
>>>> Do you have a pointer to that discussion or any more details?
>>> Essentially if you're handling an hmm page fault from the gpu, you can
>>> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
>>> submissions or compute contexts with dma_fence_wait. Which deadlocks if
>>> you can't preempt while you have that page fault pending. Two solutions:
>>>
>>> - your hw can (at least for compute ctx) preempt even when a page fault is
>>>   pending
>> Our GFXv9 GPUs can do this. GFXv10 cannot.
> Uh, why did your hw guys drop this :-/
>
>>> - lots of screaming in trying to come up with an alternate solution. They
>>>   all suck.
>> My idea for GFXv10 is to avoid preemption for memory management purposes
>> and rely 100% on page faults instead. That is, if the memory manager
>> needs to prevent GPU access to certain memory, just invalidate the GPU
>> page table entries pointing to that memory. No waiting for fences is
>> necessary, except for the SDMA job that invalidates the PTEs, which runs
>> on a special high-priority queue that should never deadlock. That should
>> prevent the CPU getting involved in deadlocks in kernel mode. But you
>> can still deadlock the GPU in user mode if all compute units get stuck
>> in page faults and can't switch to any useful work any more. So it's
>> possible that we won't be able to use GPU page faults on our GFXv10 GPUs.
> This only works if _everything_ in the system works like this, since
> you're defacto breaking the cross-driver contract. As soon as there's
> some legacy gl workload (userptr) or another driver involved, this
> approach falls apart.

I think the scenario you have in mind involves a dma_fence that depends
on the resolution of a GPU page fault. With our user mode command
submission model for compute contexts, there are no DMA fences that get
signaled by compute jobs that could get stuck on page faults.

The legacy GL workload would not get GPU page faults. The only way it
could get stuck is, if all CUs are stuck on page faults and the command
processor can't find any HW resources to execute it on. That's my user
mode deadlock scenario below. So yeah, you're right, kernel mode can't
avoid getting involved in that unless everything uses user mode command
submissions.

If (big if) we switched to user mode command submission for all compute
and graphics contexts, and no longer use DMA fences to signal their
completion, I think that would solve the problem as far as the kernel is
concerned.


>
> I do think it can be rescued with what I call gang scheduling of
> engines: I.e. when a given engine is running a context (or a group of
> engines, depending how your hw works) that can cause a page fault, you
> must flush out all workloads running on the same engine which could
> block a dma_fence (preempt them, or for non-compute stuff, force their
> completion). And the other way round, i.e. before you can run a legacy
> gl workload with a dma_fence on these engines you need to preempt all
> ctxs that could cause page faults and take them at least out of the hw
> scheduler queue.

Yuck! But yeah, that would work. A less invasive alternative would be to
reserve some compute units for graphics contexts so we can guarantee
forward progress for graphics contexts even when all CUs working on
compute stuff are stuck on page faults.


>
> Just reserving an sdma engine for copy jobs and ptes updates and that
> stuff is necessary, but not sufficient.
>
> Another approach that Jerome suggested is to track the reverse
> dependency graph of all dma_fence somehow and make sure that direct
> reclaim never recurses on an engine you're serving a pagefault for.
> Possible in theory, but in practice I think not feasible to implement
> because way too much work to implement.

I agree.


>
> Either way it's imo really nasty to come up with a scheme here that
> doesn't fail in some corner, or becomes really nasty with inconsistent
> rules across different drivers and hw :-(

Yeah. The cleanest approach is to avoid DMA fences altogether for
device/engines that can get stuck on page faults. A user mode command
submission model would do that.

Reserving some compute units for graphics contexts that signal fences
but never page fault should also work.

Regards,
  Felix


>
> Cheers, Daniel
>
>> Regards,
>>   Felix
>>
>>> Note that the dma_fence_wait is hard requirement, because we need that for
>>> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
>>> management. Which is the current "ttm is self-limited to 50% of system
>>> memory" limitation Christian is trying to lift. So that's really not
>>> a restriction we can lift, at least not in upstream where we need to also
>>> support old style hardware which doesn't have page fault support and
>>> really has no other option to handle memory management than
>>> dma_fence_wait.
>>>
>>> Thread was here:
>>>
>>> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
>>>
>>> There's a few ways to resolve this (without having preempt-capable
>>> hardware), but they're all supremely nasty.
>>> -Daniel
>>>
>>>> Thanks,
>>>>   Felix
>>>>
>>>>
>>>>> I'll try to look at this more in-depth when I'm catching up on mails.
>>>>> -Daniel
>>>>>
>>>>>> Alex Sierra (12):
>>>>>>   drm/amdgpu: replace per_device_list by array
>>>>>>   drm/amdkfd: helper to convert gpu id and idx
>>>>>>   drm/amdkfd: add xnack enabled flag to kfd_process
>>>>>>   drm/amdkfd: add ioctl to configure and query xnack retries
>>>>>>   drm/amdkfd: invalidate tables on page retry fault
>>>>>>   drm/amdkfd: page table restore through svm API
>>>>>>   drm/amdkfd: SVM API call to restore page tables
>>>>>>   drm/amdkfd: add svm_bo reference for eviction fence
>>>>>>   drm/amdgpu: add param bit flag to create SVM BOs
>>>>>>   drm/amdkfd: add svm_bo eviction mechanism support
>>>>>>   drm/amdgpu: svm bo enable_signal call condition
>>>>>>   drm/amdgpu: add svm_bo eviction to enable_signal cb
>>>>>>
>>>>>> Philip Yang (23):
>>>>>>   drm/amdkfd: select kernel DEVICE_PRIVATE option
>>>>>>   drm/amdkfd: add svm ioctl API
>>>>>>   drm/amdkfd: Add SVM API support capability bits
>>>>>>   drm/amdkfd: register svm range
>>>>>>   drm/amdkfd: add svm ioctl GET_ATTR op
>>>>>>   drm/amdgpu: add common HMM get pages function
>>>>>>   drm/amdkfd: validate svm range system memory
>>>>>>   drm/amdkfd: register overlap system memory range
>>>>>>   drm/amdkfd: deregister svm range
>>>>>>   drm/amdgpu: export vm update mapping interface
>>>>>>   drm/amdkfd: map svm range to GPUs
>>>>>>   drm/amdkfd: svm range eviction and restore
>>>>>>   drm/amdkfd: register HMM device private zone
>>>>>>   drm/amdkfd: validate vram svm range from TTM
>>>>>>   drm/amdkfd: support xgmi same hive mapping
>>>>>>   drm/amdkfd: copy memory through gart table
>>>>>>   drm/amdkfd: HMM migrate ram to vram
>>>>>>   drm/amdkfd: HMM migrate vram to ram
>>>>>>   drm/amdgpu: reserve fence slot to update page table
>>>>>>   drm/amdgpu: enable retry fault wptr overflow
>>>>>>   drm/amdkfd: refine migration policy with xnack on
>>>>>>   drm/amdkfd: add svm range validate timestamp
>>>>>>   drm/amdkfd: multiple gpu migrate vram to vram
>>>>>>
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
>>>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
>>>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
>>>>>>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
>>>>>>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
>>>>>>  drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
>>>>>>  drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
>>>>>>  .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
>>>>>>  include/uapi/linux/kfd_ioctl.h                |  169 +-
>>>>>>  26 files changed, 4296 insertions(+), 291 deletions(-)
>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>>>>>
>>>>>> --
>>>>>> 2.29.2
>>>>>>
>>>>>> _______________________________________________
>>>>>> dri-devel mailing list
>>>>>> dri-devel@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-08 16:36           ` Felix Kuehling
@ 2021-01-08 16:53             ` Daniel Vetter
  2021-01-08 17:56               ` Felix Kuehling
  0 siblings, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-01-08 16:53 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: Alex Sierra, Yang, Philip, dri-devel, amd-gfx list

On Fri, Jan 8, 2021 at 5:36 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>
>
> Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
> > On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
> >> Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
> >>> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> >>>> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> >>>>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> >>>>>> This is the first version of our HMM based shared virtual memory manager
> >>>>>> for KFD. There are still a number of known issues that we're working through
> >>>>>> (see below). This will likely lead to some pretty significant changes in
> >>>>>> MMU notifier handling and locking on the migration code paths. So don't
> >>>>>> get hung up on those details yet.
> >>>>>>
> >>>>>> But I think this is a good time to start getting feedback. We're pretty
> >>>>>> confident about the ioctl API, which is both simple and extensible for the
> >>>>>> future. (see patches 4,16) The user mode side of the API can be found here:
> >>>>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> >>>>>>
> >>>>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> >>>>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> >>>>>> and some retry IRQ handling changes (32).
> >>>>>>
> >>>>>>
> >>>>>> Known issues:
> >>>>>> * won't work with IOMMU enabled, we need to dma_map all pages properly
> >>>>>> * still working on some race conditions and random bugs
> >>>>>> * performance is not great yet
> >>>>> Still catching up, but I think there's another one for your list:
> >>>>>
> >>>>>  * hmm gpu context preempt vs page fault handling. I've had a short
> >>>>>    discussion about this one with Christian before the holidays, and also
> >>>>>    some private chats with Jerome. It's nasty since no easy fix, much less
> >>>>>    a good idea what's the best approach here.
> >>>> Do you have a pointer to that discussion or any more details?
> >>> Essentially if you're handling an hmm page fault from the gpu, you can
> >>> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> >>> submissions or compute contexts with dma_fence_wait. Which deadlocks if
> >>> you can't preempt while you have that page fault pending. Two solutions:
> >>>
> >>> - your hw can (at least for compute ctx) preempt even when a page fault is
> >>>   pending
> >> Our GFXv9 GPUs can do this. GFXv10 cannot.
> > Uh, why did your hw guys drop this :-/
> >
> >>> - lots of screaming in trying to come up with an alternate solution. They
> >>>   all suck.
> >> My idea for GFXv10 is to avoid preemption for memory management purposes
> >> and rely 100% on page faults instead. That is, if the memory manager
> >> needs to prevent GPU access to certain memory, just invalidate the GPU
> >> page table entries pointing to that memory. No waiting for fences is
> >> necessary, except for the SDMA job that invalidates the PTEs, which runs
> >> on a special high-priority queue that should never deadlock. That should
> >> prevent the CPU getting involved in deadlocks in kernel mode. But you
> >> can still deadlock the GPU in user mode if all compute units get stuck
> >> in page faults and can't switch to any useful work any more. So it's
> >> possible that we won't be able to use GPU page faults on our GFXv10 GPUs.
> > This only works if _everything_ in the system works like this, since
> > you're defacto breaking the cross-driver contract. As soon as there's
> > some legacy gl workload (userptr) or another driver involved, this
> > approach falls apart.
>
> I think the scenario you have in mind involves a dma_fence that depends
> on the resolution of a GPU page fault. With our user mode command
> submission model for compute contexts, there are no DMA fences that get
> signaled by compute jobs that could get stuck on page faults.
>
> The legacy GL workload would not get GPU page faults. The only way it
> could get stuck is, if all CUs are stuck on page faults and the command
> processor can't find any HW resources to execute it on. That's my user
> mode deadlock scenario below. So yeah, you're right, kernel mode can't
> avoid getting involved in that unless everything uses user mode command
> submissions.
>
> If (big if) we switched to user mode command submission for all compute
> and graphics contexts, and no longer use DMA fences to signal their
> completion, I think that would solve the problem as far as the kernel is
> concerned.

We can't throw dma_fence away because it's uapi built into various
compositor protocols. Otherwise we could pull a wddm2 like microsoft
did on windows and do what you're describing. So completely getting
rid of dma_fences (even just limited on newer gpus) is also a decadel
effort at least, since that's roughly how long it'll take to sunset
and convert everything over.

The other problem is that we're now building more stuff on top of
dma_resv like the dynamic dma-buf p2p stuff, now integrated into rdma.
I think even internally in the kernel it would be a massive pain to
untangle our fencing sufficiently to make this all happen without
loops. And I'm not even sure whether we could prevent deadlocks by
splitting dma_fence up into the userspace sync parts and the kernel
internal sync parts since they leak into each another.

> > I do think it can be rescued with what I call gang scheduling of
> > engines: I.e. when a given engine is running a context (or a group of
> > engines, depending how your hw works) that can cause a page fault, you
> > must flush out all workloads running on the same engine which could
> > block a dma_fence (preempt them, or for non-compute stuff, force their
> > completion). And the other way round, i.e. before you can run a legacy
> > gl workload with a dma_fence on these engines you need to preempt all
> > ctxs that could cause page faults and take them at least out of the hw
> > scheduler queue.
>
> Yuck! But yeah, that would work. A less invasive alternative would be to
> reserve some compute units for graphics contexts so we can guarantee
> forward progress for graphics contexts even when all CUs working on
> compute stuff are stuck on page faults.

Won't this hurt compute workloads? I think we need something were at
least pure compute or pure gl/vk workloads run at full performance.
And without preempt we can't take anything back when we need it, so
would have to always upfront reserve some cores just in case.

> > Just reserving an sdma engine for copy jobs and ptes updates and that
> > stuff is necessary, but not sufficient.
> >
> > Another approach that Jerome suggested is to track the reverse
> > dependency graph of all dma_fence somehow and make sure that direct
> > reclaim never recurses on an engine you're serving a pagefault for.
> > Possible in theory, but in practice I think not feasible to implement
> > because way too much work to implement.
>
> I agree.
>
>
> >
> > Either way it's imo really nasty to come up with a scheme here that
> > doesn't fail in some corner, or becomes really nasty with inconsistent
> > rules across different drivers and hw :-(
>
> Yeah. The cleanest approach is to avoid DMA fences altogether for
> device/engines that can get stuck on page faults. A user mode command
> submission model would do that.
>
> Reserving some compute units for graphics contexts that signal fences
> but never page fault should also work.

The trouble is you don't just need engines, you need compute
resources/cores behind them too (assuming I'm understading correctly
how this works on amd hw). Otherwise you end up with a gl context that
should complete to resolve the deadlock, but can't because it can't
run it's shader because all the shader cores are stuck in compute page
faults somewhere. Hence the gang scheduling would need to be at a
level were you can guarantee full isolation of hw resources, either
because you can preempt stuck compute kernels and let gl shaders run,
or because of hw core partitiion or something else. If you cant, you
need to gang schedule the entire gpu.

I think in practice that's not too ugly since for pure compute
workloads you're not going to have a desktop running most likely. And
for developer machines we should be able to push the occasional gfx
update through the gpu still without causing too much stutter on the
desktop or costing too much perf on the compute side. And pure gl/vk
or pure compute workloads should keep running at full performance.
-Daniel



>
> Regards,
>   Felix
>
>
> >
> > Cheers, Daniel
> >
> >> Regards,
> >>   Felix
> >>
> >>> Note that the dma_fence_wait is hard requirement, because we need that for
> >>> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> >>> management. Which is the current "ttm is self-limited to 50% of system
> >>> memory" limitation Christian is trying to lift. So that's really not
> >>> a restriction we can lift, at least not in upstream where we need to also
> >>> support old style hardware which doesn't have page fault support and
> >>> really has no other option to handle memory management than
> >>> dma_fence_wait.
> >>>
> >>> Thread was here:
> >>>
> >>> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
> >>>
> >>> There's a few ways to resolve this (without having preempt-capable
> >>> hardware), but they're all supremely nasty.
> >>> -Daniel
> >>>
> >>>> Thanks,
> >>>>   Felix
> >>>>
> >>>>
> >>>>> I'll try to look at this more in-depth when I'm catching up on mails.
> >>>>> -Daniel
> >>>>>
> >>>>>> Alex Sierra (12):
> >>>>>>   drm/amdgpu: replace per_device_list by array
> >>>>>>   drm/amdkfd: helper to convert gpu id and idx
> >>>>>>   drm/amdkfd: add xnack enabled flag to kfd_process
> >>>>>>   drm/amdkfd: add ioctl to configure and query xnack retries
> >>>>>>   drm/amdkfd: invalidate tables on page retry fault
> >>>>>>   drm/amdkfd: page table restore through svm API
> >>>>>>   drm/amdkfd: SVM API call to restore page tables
> >>>>>>   drm/amdkfd: add svm_bo reference for eviction fence
> >>>>>>   drm/amdgpu: add param bit flag to create SVM BOs
> >>>>>>   drm/amdkfd: add svm_bo eviction mechanism support
> >>>>>>   drm/amdgpu: svm bo enable_signal call condition
> >>>>>>   drm/amdgpu: add svm_bo eviction to enable_signal cb
> >>>>>>
> >>>>>> Philip Yang (23):
> >>>>>>   drm/amdkfd: select kernel DEVICE_PRIVATE option
> >>>>>>   drm/amdkfd: add svm ioctl API
> >>>>>>   drm/amdkfd: Add SVM API support capability bits
> >>>>>>   drm/amdkfd: register svm range
> >>>>>>   drm/amdkfd: add svm ioctl GET_ATTR op
> >>>>>>   drm/amdgpu: add common HMM get pages function
> >>>>>>   drm/amdkfd: validate svm range system memory
> >>>>>>   drm/amdkfd: register overlap system memory range
> >>>>>>   drm/amdkfd: deregister svm range
> >>>>>>   drm/amdgpu: export vm update mapping interface
> >>>>>>   drm/amdkfd: map svm range to GPUs
> >>>>>>   drm/amdkfd: svm range eviction and restore
> >>>>>>   drm/amdkfd: register HMM device private zone
> >>>>>>   drm/amdkfd: validate vram svm range from TTM
> >>>>>>   drm/amdkfd: support xgmi same hive mapping
> >>>>>>   drm/amdkfd: copy memory through gart table
> >>>>>>   drm/amdkfd: HMM migrate ram to vram
> >>>>>>   drm/amdkfd: HMM migrate vram to ram
> >>>>>>   drm/amdgpu: reserve fence slot to update page table
> >>>>>>   drm/amdgpu: enable retry fault wptr overflow
> >>>>>>   drm/amdkfd: refine migration policy with xnack on
> >>>>>>   drm/amdkfd: add svm range validate timestamp
> >>>>>>   drm/amdkfd: multiple gpu migrate vram to vram
> >>>>>>
> >>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
> >>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
> >>>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
> >>>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
> >>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
> >>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
> >>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
> >>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
> >>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
> >>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
> >>>>>>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
> >>>>>>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
> >>>>>>  drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
> >>>>>>  drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
> >>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
> >>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
> >>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
> >>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
> >>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
> >>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
> >>>>>>  .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
> >>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
> >>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
> >>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
> >>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
> >>>>>>  include/uapi/linux/kfd_ioctl.h                |  169 +-
> >>>>>>  26 files changed, 4296 insertions(+), 291 deletions(-)
> >>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> >>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
> >>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> >>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> >>>>>>
> >>>>>> --
> >>>>>> 2.29.2
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> dri-devel mailing list
> >>>>>> dri-devel@lists.freedesktop.org
> >>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
> >



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-08 16:53             ` Daniel Vetter
@ 2021-01-08 17:56               ` Felix Kuehling
  2021-01-11 16:29                 ` Daniel Vetter
  0 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-08 17:56 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Alex Sierra, Yang, Philip, dri-devel, amd-gfx list


Am 2021-01-08 um 11:53 a.m. schrieb Daniel Vetter:
> On Fri, Jan 8, 2021 at 5:36 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>>
>> Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
>>> On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>>>> Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
>>>>> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
>>>>>> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
>>>>>>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
>>>>>>>> This is the first version of our HMM based shared virtual memory manager
>>>>>>>> for KFD. There are still a number of known issues that we're working through
>>>>>>>> (see below). This will likely lead to some pretty significant changes in
>>>>>>>> MMU notifier handling and locking on the migration code paths. So don't
>>>>>>>> get hung up on those details yet.
>>>>>>>>
>>>>>>>> But I think this is a good time to start getting feedback. We're pretty
>>>>>>>> confident about the ioctl API, which is both simple and extensible for the
>>>>>>>> future. (see patches 4,16) The user mode side of the API can be found here:
>>>>>>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
>>>>>>>>
>>>>>>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
>>>>>>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
>>>>>>>> and some retry IRQ handling changes (32).
>>>>>>>>
>>>>>>>>
>>>>>>>> Known issues:
>>>>>>>> * won't work with IOMMU enabled, we need to dma_map all pages properly
>>>>>>>> * still working on some race conditions and random bugs
>>>>>>>> * performance is not great yet
>>>>>>> Still catching up, but I think there's another one for your list:
>>>>>>>
>>>>>>>  * hmm gpu context preempt vs page fault handling. I've had a short
>>>>>>>    discussion about this one with Christian before the holidays, and also
>>>>>>>    some private chats with Jerome. It's nasty since no easy fix, much less
>>>>>>>    a good idea what's the best approach here.
>>>>>> Do you have a pointer to that discussion or any more details?
>>>>> Essentially if you're handling an hmm page fault from the gpu, you can
>>>>> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
>>>>> submissions or compute contexts with dma_fence_wait. Which deadlocks if
>>>>> you can't preempt while you have that page fault pending. Two solutions:
>>>>>
>>>>> - your hw can (at least for compute ctx) preempt even when a page fault is
>>>>>   pending
>>>> Our GFXv9 GPUs can do this. GFXv10 cannot.
>>> Uh, why did your hw guys drop this :-/

Performance. It's the same reason why the XNACK mode selection API
exists (patch 16). When we enable recoverable page fault handling in the
compute units on GFXv9, it costs some performance even when no page
faults are happening. On GFXv10 that retry fault handling moved out of
the compute units, so they don't take the performance hit. But that
sacrificed the ability to preempt during page faults. We'll need to work
with our hardware teams to restore that capability in a future generation.


>>>
>>>>> - lots of screaming in trying to come up with an alternate solution. They
>>>>>   all suck.
>>>> My idea for GFXv10 is to avoid preemption for memory management purposes
>>>> and rely 100% on page faults instead. That is, if the memory manager
>>>> needs to prevent GPU access to certain memory, just invalidate the GPU
>>>> page table entries pointing to that memory. No waiting for fences is
>>>> necessary, except for the SDMA job that invalidates the PTEs, which runs
>>>> on a special high-priority queue that should never deadlock. That should
>>>> prevent the CPU getting involved in deadlocks in kernel mode. But you
>>>> can still deadlock the GPU in user mode if all compute units get stuck
>>>> in page faults and can't switch to any useful work any more. So it's
>>>> possible that we won't be able to use GPU page faults on our GFXv10 GPUs.
>>> This only works if _everything_ in the system works like this, since
>>> you're defacto breaking the cross-driver contract. As soon as there's
>>> some legacy gl workload (userptr) or another driver involved, this
>>> approach falls apart.
>> I think the scenario you have in mind involves a dma_fence that depends
>> on the resolution of a GPU page fault. With our user mode command
>> submission model for compute contexts, there are no DMA fences that get
>> signaled by compute jobs that could get stuck on page faults.
>>
>> The legacy GL workload would not get GPU page faults. The only way it
>> could get stuck is, if all CUs are stuck on page faults and the command
>> processor can't find any HW resources to execute it on. That's my user
>> mode deadlock scenario below. So yeah, you're right, kernel mode can't
>> avoid getting involved in that unless everything uses user mode command
>> submissions.
>>
>> If (big if) we switched to user mode command submission for all compute
>> and graphics contexts, and no longer use DMA fences to signal their
>> completion, I think that would solve the problem as far as the kernel is
>> concerned.
> We can't throw dma_fence away because it's uapi built into various
> compositor protocols. Otherwise we could pull a wddm2 like microsoft
> did on windows and do what you're describing. So completely getting
> rid of dma_fences (even just limited on newer gpus) is also a decadel
> effort at least, since that's roughly how long it'll take to sunset
> and convert everything over.

OK.


>
> The other problem is that we're now building more stuff on top of
> dma_resv like the dynamic dma-buf p2p stuff, now integrated into rdma.
> I think even internally in the kernel it would be a massive pain to
> untangle our fencing sufficiently to make this all happen without
> loops. And I'm not even sure whether we could prevent deadlocks by
> splitting dma_fence up into the userspace sync parts and the kernel
> internal sync parts since they leak into each another.
>
>>> I do think it can be rescued with what I call gang scheduling of
>>> engines: I.e. when a given engine is running a context (or a group of
>>> engines, depending how your hw works) that can cause a page fault, you
>>> must flush out all workloads running on the same engine which could
>>> block a dma_fence (preempt them, or for non-compute stuff, force their
>>> completion). And the other way round, i.e. before you can run a legacy
>>> gl workload with a dma_fence on these engines you need to preempt all
>>> ctxs that could cause page faults and take them at least out of the hw
>>> scheduler queue.
>> Yuck! But yeah, that would work. A less invasive alternative would be to
>> reserve some compute units for graphics contexts so we can guarantee
>> forward progress for graphics contexts even when all CUs working on
>> compute stuff are stuck on page faults.
> Won't this hurt compute workloads? I think we need something were at
> least pure compute or pure gl/vk workloads run at full performance.
> And without preempt we can't take anything back when we need it, so
> would have to always upfront reserve some cores just in case.

Yes, it would hurt proportionally to how many CUs get reserved. On big
GPUs with many CUs the impact could be quite small.

That said, I'm not sure it'll work on our hardware. Our CUs can execute
multiple wavefronts from different contexts and switch between them with
fine granularity. I'd need to check with our HW engineers whether this
CU-internal context switching is still possible during page faults on
GFXv10.


>
>>> Just reserving an sdma engine for copy jobs and ptes updates and that
>>> stuff is necessary, but not sufficient.
>>>
>>> Another approach that Jerome suggested is to track the reverse
>>> dependency graph of all dma_fence somehow and make sure that direct
>>> reclaim never recurses on an engine you're serving a pagefault for.
>>> Possible in theory, but in practice I think not feasible to implement
>>> because way too much work to implement.
>> I agree.
>>
>>
>>> Either way it's imo really nasty to come up with a scheme here that
>>> doesn't fail in some corner, or becomes really nasty with inconsistent
>>> rules across different drivers and hw :-(
>> Yeah. The cleanest approach is to avoid DMA fences altogether for
>> device/engines that can get stuck on page faults. A user mode command
>> submission model would do that.
>>
>> Reserving some compute units for graphics contexts that signal fences
>> but never page fault should also work.
> The trouble is you don't just need engines, you need compute
> resources/cores behind them too (assuming I'm understading correctly
> how this works on amd hw). Otherwise you end up with a gl context that
> should complete to resolve the deadlock, but can't because it can't
> run it's shader because all the shader cores are stuck in compute page
> faults somewhere.

That's why I suggested reserving some CUs that would never execute
compute workloads that can page fault.


>  Hence the gang scheduling would need to be at a
> level were you can guarantee full isolation of hw resources, either
> because you can preempt stuck compute kernels and let gl shaders run,
> or because of hw core partitiion or something else. If you cant, you
> need to gang schedule the entire gpu.

Yes.


>
> I think in practice that's not too ugly since for pure compute
> workloads you're not going to have a desktop running most likely.

We still need legacy contexts for video decoding and post processing.
But maybe we can find a fix for that too.


>  And
> for developer machines we should be able to push the occasional gfx
> update through the gpu still without causing too much stutter on the
> desktop or costing too much perf on the compute side. And pure gl/vk
> or pure compute workloads should keep running at full performance.

I think it would be acceptable for mostly-compute workloads. It would be
bad for desktop workloads with some compute, e.g. games with
OpenCL-based physics. We're increasingly relying on KFD for all GPU
computing (including OpenCL) in desktop applications. But those could
live without GPU page faults until we can build sane hardware.

Regards,
  Felix


> -Daniel
>
>
>
>> Regards,
>>   Felix
>>
>>
>>> Cheers, Daniel
>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>>> Note that the dma_fence_wait is hard requirement, because we need that for
>>>>> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
>>>>> management. Which is the current "ttm is self-limited to 50% of system
>>>>> memory" limitation Christian is trying to lift. So that's really not
>>>>> a restriction we can lift, at least not in upstream where we need to also
>>>>> support old style hardware which doesn't have page fault support and
>>>>> really has no other option to handle memory management than
>>>>> dma_fence_wait.
>>>>>
>>>>> Thread was here:
>>>>>
>>>>> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
>>>>>
>>>>> There's a few ways to resolve this (without having preempt-capable
>>>>> hardware), but they're all supremely nasty.
>>>>> -Daniel
>>>>>
>>>>>> Thanks,
>>>>>>   Felix
>>>>>>
>>>>>>
>>>>>>> I'll try to look at this more in-depth when I'm catching up on mails.
>>>>>>> -Daniel
>>>>>>>
>>>>>>>> Alex Sierra (12):
>>>>>>>>   drm/amdgpu: replace per_device_list by array
>>>>>>>>   drm/amdkfd: helper to convert gpu id and idx
>>>>>>>>   drm/amdkfd: add xnack enabled flag to kfd_process
>>>>>>>>   drm/amdkfd: add ioctl to configure and query xnack retries
>>>>>>>>   drm/amdkfd: invalidate tables on page retry fault
>>>>>>>>   drm/amdkfd: page table restore through svm API
>>>>>>>>   drm/amdkfd: SVM API call to restore page tables
>>>>>>>>   drm/amdkfd: add svm_bo reference for eviction fence
>>>>>>>>   drm/amdgpu: add param bit flag to create SVM BOs
>>>>>>>>   drm/amdkfd: add svm_bo eviction mechanism support
>>>>>>>>   drm/amdgpu: svm bo enable_signal call condition
>>>>>>>>   drm/amdgpu: add svm_bo eviction to enable_signal cb
>>>>>>>>
>>>>>>>> Philip Yang (23):
>>>>>>>>   drm/amdkfd: select kernel DEVICE_PRIVATE option
>>>>>>>>   drm/amdkfd: add svm ioctl API
>>>>>>>>   drm/amdkfd: Add SVM API support capability bits
>>>>>>>>   drm/amdkfd: register svm range
>>>>>>>>   drm/amdkfd: add svm ioctl GET_ATTR op
>>>>>>>>   drm/amdgpu: add common HMM get pages function
>>>>>>>>   drm/amdkfd: validate svm range system memory
>>>>>>>>   drm/amdkfd: register overlap system memory range
>>>>>>>>   drm/amdkfd: deregister svm range
>>>>>>>>   drm/amdgpu: export vm update mapping interface
>>>>>>>>   drm/amdkfd: map svm range to GPUs
>>>>>>>>   drm/amdkfd: svm range eviction and restore
>>>>>>>>   drm/amdkfd: register HMM device private zone
>>>>>>>>   drm/amdkfd: validate vram svm range from TTM
>>>>>>>>   drm/amdkfd: support xgmi same hive mapping
>>>>>>>>   drm/amdkfd: copy memory through gart table
>>>>>>>>   drm/amdkfd: HMM migrate ram to vram
>>>>>>>>   drm/amdkfd: HMM migrate vram to ram
>>>>>>>>   drm/amdgpu: reserve fence slot to update page table
>>>>>>>>   drm/amdgpu: enable retry fault wptr overflow
>>>>>>>>   drm/amdkfd: refine migration policy with xnack on
>>>>>>>>   drm/amdkfd: add svm range validate timestamp
>>>>>>>>   drm/amdkfd: multiple gpu migrate vram to vram
>>>>>>>>
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
>>>>>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
>>>>>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
>>>>>>>>  .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
>>>>>>>>  include/uapi/linux/kfd_ioctl.h                |  169 +-
>>>>>>>>  26 files changed, 4296 insertions(+), 291 deletions(-)
>>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>>>>>>>
>>>>>>>> --
>>>>>>>> 2.29.2
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> dri-devel mailing list
>>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-08 17:56               ` Felix Kuehling
@ 2021-01-11 16:29                 ` Daniel Vetter
  2021-01-14  5:34                   ` Felix Kuehling
  0 siblings, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-01-11 16:29 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: Alex Sierra, Yang, Philip, dri-devel, amd-gfx list

On Fri, Jan 08, 2021 at 12:56:24PM -0500, Felix Kuehling wrote:
> 
> Am 2021-01-08 um 11:53 a.m. schrieb Daniel Vetter:
> > On Fri, Jan 8, 2021 at 5:36 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
> >>
> >> Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
> >>> On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
> >>>> Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
> >>>>> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> >>>>>> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> >>>>>>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> >>>>>>>> This is the first version of our HMM based shared virtual memory manager
> >>>>>>>> for KFD. There are still a number of known issues that we're working through
> >>>>>>>> (see below). This will likely lead to some pretty significant changes in
> >>>>>>>> MMU notifier handling and locking on the migration code paths. So don't
> >>>>>>>> get hung up on those details yet.
> >>>>>>>>
> >>>>>>>> But I think this is a good time to start getting feedback. We're pretty
> >>>>>>>> confident about the ioctl API, which is both simple and extensible for the
> >>>>>>>> future. (see patches 4,16) The user mode side of the API can be found here:
> >>>>>>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> >>>>>>>>
> >>>>>>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> >>>>>>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> >>>>>>>> and some retry IRQ handling changes (32).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Known issues:
> >>>>>>>> * won't work with IOMMU enabled, we need to dma_map all pages properly
> >>>>>>>> * still working on some race conditions and random bugs
> >>>>>>>> * performance is not great yet
> >>>>>>> Still catching up, but I think there's another one for your list:
> >>>>>>>
> >>>>>>>  * hmm gpu context preempt vs page fault handling. I've had a short
> >>>>>>>    discussion about this one with Christian before the holidays, and also
> >>>>>>>    some private chats with Jerome. It's nasty since no easy fix, much less
> >>>>>>>    a good idea what's the best approach here.
> >>>>>> Do you have a pointer to that discussion or any more details?
> >>>>> Essentially if you're handling an hmm page fault from the gpu, you can
> >>>>> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> >>>>> submissions or compute contexts with dma_fence_wait. Which deadlocks if
> >>>>> you can't preempt while you have that page fault pending. Two solutions:
> >>>>>
> >>>>> - your hw can (at least for compute ctx) preempt even when a page fault is
> >>>>>   pending
> >>>> Our GFXv9 GPUs can do this. GFXv10 cannot.
> >>> Uh, why did your hw guys drop this :-/
> 
> Performance. It's the same reason why the XNACK mode selection API
> exists (patch 16). When we enable recoverable page fault handling in the
> compute units on GFXv9, it costs some performance even when no page
> faults are happening. On GFXv10 that retry fault handling moved out of
> the compute units, so they don't take the performance hit. But that
> sacrificed the ability to preempt during page faults. We'll need to work
> with our hardware teams to restore that capability in a future generation.

Ah yes, you need to stall in more points in the compute cores to make sure
you can recover if the page fault gets interrupted.

Maybe my knowledge is outdated, but my understanding is that nvidia can
also preempt (but only for compute jobs, since oh dear the pain this would
be for all the fixed function stuff). Since gfx10 moved page fault
handling further away from compute cores, do you know whether this now
means you can do page faults for (some?) fixed function stuff too? Or
still only for compute?

Supporting page fault for 3d would be real pain with the corner we're
stuck in right now, but better we know about this early than later :-/

> >>>
> >>>>> - lots of screaming in trying to come up with an alternate solution. They
> >>>>>   all suck.
> >>>> My idea for GFXv10 is to avoid preemption for memory management purposes
> >>>> and rely 100% on page faults instead. That is, if the memory manager
> >>>> needs to prevent GPU access to certain memory, just invalidate the GPU
> >>>> page table entries pointing to that memory. No waiting for fences is
> >>>> necessary, except for the SDMA job that invalidates the PTEs, which runs
> >>>> on a special high-priority queue that should never deadlock. That should
> >>>> prevent the CPU getting involved in deadlocks in kernel mode. But you
> >>>> can still deadlock the GPU in user mode if all compute units get stuck
> >>>> in page faults and can't switch to any useful work any more. So it's
> >>>> possible that we won't be able to use GPU page faults on our GFXv10 GPUs.
> >>> This only works if _everything_ in the system works like this, since
> >>> you're defacto breaking the cross-driver contract. As soon as there's
> >>> some legacy gl workload (userptr) or another driver involved, this
> >>> approach falls apart.
> >> I think the scenario you have in mind involves a dma_fence that depends
> >> on the resolution of a GPU page fault. With our user mode command
> >> submission model for compute contexts, there are no DMA fences that get
> >> signaled by compute jobs that could get stuck on page faults.
> >>
> >> The legacy GL workload would not get GPU page faults. The only way it
> >> could get stuck is, if all CUs are stuck on page faults and the command
> >> processor can't find any HW resources to execute it on. That's my user
> >> mode deadlock scenario below. So yeah, you're right, kernel mode can't
> >> avoid getting involved in that unless everything uses user mode command
> >> submissions.
> >>
> >> If (big if) we switched to user mode command submission for all compute
> >> and graphics contexts, and no longer use DMA fences to signal their
> >> completion, I think that would solve the problem as far as the kernel is
> >> concerned.
> > We can't throw dma_fence away because it's uapi built into various
> > compositor protocols. Otherwise we could pull a wddm2 like microsoft
> > did on windows and do what you're describing. So completely getting
> > rid of dma_fences (even just limited on newer gpus) is also a decadel
> > effort at least, since that's roughly how long it'll take to sunset
> > and convert everything over.
> 
> OK.
> 
> 
> >
> > The other problem is that we're now building more stuff on top of
> > dma_resv like the dynamic dma-buf p2p stuff, now integrated into rdma.
> > I think even internally in the kernel it would be a massive pain to
> > untangle our fencing sufficiently to make this all happen without
> > loops. And I'm not even sure whether we could prevent deadlocks by
> > splitting dma_fence up into the userspace sync parts and the kernel
> > internal sync parts since they leak into each another.
> >
> >>> I do think it can be rescued with what I call gang scheduling of
> >>> engines: I.e. when a given engine is running a context (or a group of
> >>> engines, depending how your hw works) that can cause a page fault, you
> >>> must flush out all workloads running on the same engine which could
> >>> block a dma_fence (preempt them, or for non-compute stuff, force their
> >>> completion). And the other way round, i.e. before you can run a legacy
> >>> gl workload with a dma_fence on these engines you need to preempt all
> >>> ctxs that could cause page faults and take them at least out of the hw
> >>> scheduler queue.
> >> Yuck! But yeah, that would work. A less invasive alternative would be to
> >> reserve some compute units for graphics contexts so we can guarantee
> >> forward progress for graphics contexts even when all CUs working on
> >> compute stuff are stuck on page faults.
> > Won't this hurt compute workloads? I think we need something were at
> > least pure compute or pure gl/vk workloads run at full performance.
> > And without preempt we can't take anything back when we need it, so
> > would have to always upfront reserve some cores just in case.
> 
> Yes, it would hurt proportionally to how many CUs get reserved. On big
> GPUs with many CUs the impact could be quite small.

Also, we could do the reservation only for the time when there's actually
a legacy context with normal dma_fence in the scheduler queue. Assuming
that reserving/unreserving of CUs isn't too expensive operation. If it's
as expensive as a full stall probably not worth the complexity here and
just go with a full stall and only run one or the other at a time.

Wrt desktops I'm also somewhat worried that we might end up killing
desktop workloads if there's not enough CUs reserved for these and they
end up taking too long and anger either tdr or worse the user because the
desktop is unuseable when you start a compute job and get a big pile of
faults. Probably needs some testing to see how bad it is.

> That said, I'm not sure it'll work on our hardware. Our CUs can execute
> multiple wavefronts from different contexts and switch between them with
> fine granularity. I'd need to check with our HW engineers whether this
> CU-internal context switching is still possible during page faults on
> GFXv10.

You'd need to do the reservation for all contexts/engines which can cause
page faults, otherewise it'd leak.
> 
> 
> >
> >>> Just reserving an sdma engine for copy jobs and ptes updates and that
> >>> stuff is necessary, but not sufficient.
> >>>
> >>> Another approach that Jerome suggested is to track the reverse
> >>> dependency graph of all dma_fence somehow and make sure that direct
> >>> reclaim never recurses on an engine you're serving a pagefault for.
> >>> Possible in theory, but in practice I think not feasible to implement
> >>> because way too much work to implement.
> >> I agree.
> >>
> >>
> >>> Either way it's imo really nasty to come up with a scheme here that
> >>> doesn't fail in some corner, or becomes really nasty with inconsistent
> >>> rules across different drivers and hw :-(
> >> Yeah. The cleanest approach is to avoid DMA fences altogether for
> >> device/engines that can get stuck on page faults. A user mode command
> >> submission model would do that.
> >>
> >> Reserving some compute units for graphics contexts that signal fences
> >> but never page fault should also work.
> > The trouble is you don't just need engines, you need compute
> > resources/cores behind them too (assuming I'm understading correctly
> > how this works on amd hw). Otherwise you end up with a gl context that
> > should complete to resolve the deadlock, but can't because it can't
> > run it's shader because all the shader cores are stuck in compute page
> > faults somewhere.
> 
> That's why I suggested reserving some CUs that would never execute
> compute workloads that can page fault.
> 
> 
> >  Hence the gang scheduling would need to be at a
> > level were you can guarantee full isolation of hw resources, either
> > because you can preempt stuck compute kernels and let gl shaders run,
> > or because of hw core partitiion or something else. If you cant, you
> > need to gang schedule the entire gpu.
> 
> Yes.
> 
> 
> >
> > I think in practice that's not too ugly since for pure compute
> > workloads you're not going to have a desktop running most likely.
> 
> We still need legacy contexts for video decoding and post processing.
> But maybe we can find a fix for that too.

Hm I'd expect video workloads to not use page faults (even if they use
compute for post processing). Same way that compute in vk/gl would still
use all the legacy fencing (which excludes page fault support).

So pure "compute always has to use page fault mode and user sync" I don't
think is feasible. And then all the mixed workloads useage should be fine
too.

> >  And
> > for developer machines we should be able to push the occasional gfx
> > update through the gpu still without causing too much stutter on the
> > desktop or costing too much perf on the compute side. And pure gl/vk
> > or pure compute workloads should keep running at full performance.
> 
> I think it would be acceptable for mostly-compute workloads. It would be
> bad for desktop workloads with some compute, e.g. games with
> OpenCL-based physics. We're increasingly relying on KFD for all GPU
> computing (including OpenCL) in desktop applications. But those could
> live without GPU page faults until we can build sane hardware.

Uh ... I guess the challenge here is noticing when your opencl should be
run in old style mode. I guess you could link them together through some
backchannel, so when a gl or vk context is set up you run opencl in the
legacy mode without pagefault for full perf together with vk. Still
doesn't work if the app sets up ocl before vk/gl :-/
-Daniel

> Regards,
>   Felix
> 
> 
> > -Daniel
> >
> >
> >
> >> Regards,
> >>   Felix
> >>
> >>
> >>> Cheers, Daniel
> >>>
> >>>> Regards,
> >>>>   Felix
> >>>>
> >>>>> Note that the dma_fence_wait is hard requirement, because we need that for
> >>>>> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> >>>>> management. Which is the current "ttm is self-limited to 50% of system
> >>>>> memory" limitation Christian is trying to lift. So that's really not
> >>>>> a restriction we can lift, at least not in upstream where we need to also
> >>>>> support old style hardware which doesn't have page fault support and
> >>>>> really has no other option to handle memory management than
> >>>>> dma_fence_wait.
> >>>>>
> >>>>> Thread was here:
> >>>>>
> >>>>> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
> >>>>>
> >>>>> There's a few ways to resolve this (without having preempt-capable
> >>>>> hardware), but they're all supremely nasty.
> >>>>> -Daniel
> >>>>>
> >>>>>> Thanks,
> >>>>>>   Felix
> >>>>>>
> >>>>>>
> >>>>>>> I'll try to look at this more in-depth when I'm catching up on mails.
> >>>>>>> -Daniel
> >>>>>>>
> >>>>>>>> Alex Sierra (12):
> >>>>>>>>   drm/amdgpu: replace per_device_list by array
> >>>>>>>>   drm/amdkfd: helper to convert gpu id and idx
> >>>>>>>>   drm/amdkfd: add xnack enabled flag to kfd_process
> >>>>>>>>   drm/amdkfd: add ioctl to configure and query xnack retries
> >>>>>>>>   drm/amdkfd: invalidate tables on page retry fault
> >>>>>>>>   drm/amdkfd: page table restore through svm API
> >>>>>>>>   drm/amdkfd: SVM API call to restore page tables
> >>>>>>>>   drm/amdkfd: add svm_bo reference for eviction fence
> >>>>>>>>   drm/amdgpu: add param bit flag to create SVM BOs
> >>>>>>>>   drm/amdkfd: add svm_bo eviction mechanism support
> >>>>>>>>   drm/amdgpu: svm bo enable_signal call condition
> >>>>>>>>   drm/amdgpu: add svm_bo eviction to enable_signal cb
> >>>>>>>>
> >>>>>>>> Philip Yang (23):
> >>>>>>>>   drm/amdkfd: select kernel DEVICE_PRIVATE option
> >>>>>>>>   drm/amdkfd: add svm ioctl API
> >>>>>>>>   drm/amdkfd: Add SVM API support capability bits
> >>>>>>>>   drm/amdkfd: register svm range
> >>>>>>>>   drm/amdkfd: add svm ioctl GET_ATTR op
> >>>>>>>>   drm/amdgpu: add common HMM get pages function
> >>>>>>>>   drm/amdkfd: validate svm range system memory
> >>>>>>>>   drm/amdkfd: register overlap system memory range
> >>>>>>>>   drm/amdkfd: deregister svm range
> >>>>>>>>   drm/amdgpu: export vm update mapping interface
> >>>>>>>>   drm/amdkfd: map svm range to GPUs
> >>>>>>>>   drm/amdkfd: svm range eviction and restore
> >>>>>>>>   drm/amdkfd: register HMM device private zone
> >>>>>>>>   drm/amdkfd: validate vram svm range from TTM
> >>>>>>>>   drm/amdkfd: support xgmi same hive mapping
> >>>>>>>>   drm/amdkfd: copy memory through gart table
> >>>>>>>>   drm/amdkfd: HMM migrate ram to vram
> >>>>>>>>   drm/amdkfd: HMM migrate vram to ram
> >>>>>>>>   drm/amdgpu: reserve fence slot to update page table
> >>>>>>>>   drm/amdgpu: enable retry fault wptr overflow
> >>>>>>>>   drm/amdkfd: refine migration policy with xnack on
> >>>>>>>>   drm/amdkfd: add svm range validate timestamp
> >>>>>>>>   drm/amdkfd: multiple gpu migrate vram to vram
> >>>>>>>>
> >>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
> >>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
> >>>>>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
> >>>>>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
> >>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
> >>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
> >>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
> >>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
> >>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
> >>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
> >>>>>>>>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
> >>>>>>>>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
> >>>>>>>>  .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
> >>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
> >>>>>>>>  include/uapi/linux/kfd_ioctl.h                |  169 +-
> >>>>>>>>  26 files changed, 4296 insertions(+), 291 deletions(-)
> >>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> >>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
> >>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> >>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> 2.29.2
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> dri-devel mailing list
> >>>>>>>> dri-devel@lists.freedesktop.org
> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>>
> >
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
                   ` (35 preceding siblings ...)
  2021-01-07  9:23 ` [PATCH 00/35] Add HMM-based SVM memory manager to KFD Daniel Vetter
@ 2021-01-13 16:47 ` Jerome Glisse
  2021-01-14  0:06   ` Felix Kuehling
  36 siblings, 1 reply; 84+ messages in thread
From: Jerome Glisse @ 2021-01-13 16:47 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: alex.sierra, philip.yang, dri-devel, amd-gfx

On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> This is the first version of our HMM based shared virtual memory manager
> for KFD. There are still a number of known issues that we're working through
> (see below). This will likely lead to some pretty significant changes in
> MMU notifier handling and locking on the migration code paths. So don't
> get hung up on those details yet.

[...]

> Known issues:
> * won't work with IOMMU enabled, we need to dma_map all pages properly
> * still working on some race conditions and random bugs
> * performance is not great yet

What would those changes looks like ? Seeing the issue below i do not
see how they inter-play with mmu notifier. Can you elaborate.

Cheers,
Jérôme

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-08 14:40     ` Daniel Vetter
  2021-01-08 14:45       ` Christian König
  2021-01-08 15:58       ` Felix Kuehling
@ 2021-01-13 16:56       ` Jerome Glisse
  2021-01-13 20:31         ` Daniel Vetter
  2021-01-14 10:49         ` Christian König
  2 siblings, 2 replies; 84+ messages in thread
From: Jerome Glisse @ 2021-01-13 16:56 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: alex.sierra, philip.yang, Felix Kuehling, amd-gfx, dri-devel

On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> > > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> > >> This is the first version of our HMM based shared virtual memory manager
> > >> for KFD. There are still a number of known issues that we're working through
> > >> (see below). This will likely lead to some pretty significant changes in
> > >> MMU notifier handling and locking on the migration code paths. So don't
> > >> get hung up on those details yet.
> > >>
> > >> But I think this is a good time to start getting feedback. We're pretty
> > >> confident about the ioctl API, which is both simple and extensible for the
> > >> future. (see patches 4,16) The user mode side of the API can be found here:
> > >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> > >>
> > >> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> > >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> > >> and some retry IRQ handling changes (32).
> > >>
> > >>
> > >> Known issues:
> > >> * won't work with IOMMU enabled, we need to dma_map all pages properly
> > >> * still working on some race conditions and random bugs
> > >> * performance is not great yet
> > > Still catching up, but I think there's another one for your list:
> > >
> > >  * hmm gpu context preempt vs page fault handling. I've had a short
> > >    discussion about this one with Christian before the holidays, and also
> > >    some private chats with Jerome. It's nasty since no easy fix, much less
> > >    a good idea what's the best approach here.
> > 
> > Do you have a pointer to that discussion or any more details?
> 
> Essentially if you're handling an hmm page fault from the gpu, you can
> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> submissions or compute contexts with dma_fence_wait. Which deadlocks if
> you can't preempt while you have that page fault pending. Two solutions:
> 
> - your hw can (at least for compute ctx) preempt even when a page fault is
>   pending
> 
> - lots of screaming in trying to come up with an alternate solution. They
>   all suck.
> 
> Note that the dma_fence_wait is hard requirement, because we need that for
> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> management. Which is the current "ttm is self-limited to 50% of system
> memory" limitation Christian is trying to lift. So that's really not
> a restriction we can lift, at least not in upstream where we need to also
> support old style hardware which doesn't have page fault support and
> really has no other option to handle memory management than
> dma_fence_wait.
> 
> Thread was here:
> 
> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
> 
> There's a few ways to resolve this (without having preempt-capable
> hardware), but they're all supremely nasty.
> -Daniel
> 

I had a new idea, i wanted to think more about it but have not yet,
anyway here it is. Adding a new callback to dma fence which ask the
question can it dead lock ? Any time a GPU driver has pending page
fault (ie something calling into the mm) it answer yes, otherwise
no. The GPU shrinker would ask the question before waiting on any
dma-fence and back of if it gets yes. Shrinker can still try many
dma buf object for which it does not get a yes on associated fence.

This does not solve the mmu notifier case, for this you would just
invalidate the gem userptr object (with a flag but not releasing the
page refcount) but you would not wait for the GPU (ie no dma fence
wait in that code path anymore). The userptr API never really made
the contract that it will always be in sync with the mm view of the
world so if different page get remapped to same virtual address
while GPU is still working with the old pages it should not be an
issue (it would not be in our usage of userptr for compositor and
what not).

Maybe i overlook something there.

Cheers,
Jérôme

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-13 16:56       ` Jerome Glisse
@ 2021-01-13 20:31         ` Daniel Vetter
  2021-01-14  3:27           ` Jerome Glisse
  2021-01-14 10:49         ` Christian König
  1 sibling, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-01-13 20:31 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, amd-gfx list, dri-devel

On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
> > On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> > > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> > > > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> > > >> This is the first version of our HMM based shared virtual memory manager
> > > >> for KFD. There are still a number of known issues that we're working through
> > > >> (see below). This will likely lead to some pretty significant changes in
> > > >> MMU notifier handling and locking on the migration code paths. So don't
> > > >> get hung up on those details yet.
> > > >>
> > > >> But I think this is a good time to start getting feedback. We're pretty
> > > >> confident about the ioctl API, which is both simple and extensible for the
> > > >> future. (see patches 4,16) The user mode side of the API can be found here:
> > > >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> > > >>
> > > >> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> > > >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> > > >> and some retry IRQ handling changes (32).
> > > >>
> > > >>
> > > >> Known issues:
> > > >> * won't work with IOMMU enabled, we need to dma_map all pages properly
> > > >> * still working on some race conditions and random bugs
> > > >> * performance is not great yet
> > > > Still catching up, but I think there's another one for your list:
> > > >
> > > >  * hmm gpu context preempt vs page fault handling. I've had a short
> > > >    discussion about this one with Christian before the holidays, and also
> > > >    some private chats with Jerome. It's nasty since no easy fix, much less
> > > >    a good idea what's the best approach here.
> > >
> > > Do you have a pointer to that discussion or any more details?
> >
> > Essentially if you're handling an hmm page fault from the gpu, you can
> > deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> > submissions or compute contexts with dma_fence_wait. Which deadlocks if
> > you can't preempt while you have that page fault pending. Two solutions:
> >
> > - your hw can (at least for compute ctx) preempt even when a page fault is
> >   pending
> >
> > - lots of screaming in trying to come up with an alternate solution. They
> >   all suck.
> >
> > Note that the dma_fence_wait is hard requirement, because we need that for
> > mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> > management. Which is the current "ttm is self-limited to 50% of system
> > memory" limitation Christian is trying to lift. So that's really not
> > a restriction we can lift, at least not in upstream where we need to also
> > support old style hardware which doesn't have page fault support and
> > really has no other option to handle memory management than
> > dma_fence_wait.
> >
> > Thread was here:
> >
> > https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
> >
> > There's a few ways to resolve this (without having preempt-capable
> > hardware), but they're all supremely nasty.
> > -Daniel
> >
>
> I had a new idea, i wanted to think more about it but have not yet,
> anyway here it is. Adding a new callback to dma fence which ask the
> question can it dead lock ? Any time a GPU driver has pending page
> fault (ie something calling into the mm) it answer yes, otherwise
> no. The GPU shrinker would ask the question before waiting on any
> dma-fence and back of if it gets yes. Shrinker can still try many
> dma buf object for which it does not get a yes on associated fence.

Having that answer on a given fence isn't enough, you still need to
forward that information through the entire dependency graph, across
drivers. That's the hard part, since that dependency graph is very
implicit in the code, and we'd need to first roll it out across all
drivers.

> This does not solve the mmu notifier case, for this you would just
> invalidate the gem userptr object (with a flag but not releasing the
> page refcount) but you would not wait for the GPU (ie no dma fence
> wait in that code path anymore). The userptr API never really made
> the contract that it will always be in sync with the mm view of the
> world so if different page get remapped to same virtual address
> while GPU is still working with the old pages it should not be an
> issue (it would not be in our usage of userptr for compositor and
> what not).
>
> Maybe i overlook something there.

tbh I'm never really clear on how much exactly we need, and whether
maybe the new pin/unpin api should fix it all.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-13 16:47 ` [PATCH 00/35] Add HMM-based SVM memory manager to KFD Jerome Glisse
@ 2021-01-14  0:06   ` Felix Kuehling
  0 siblings, 0 replies; 84+ messages in thread
From: Felix Kuehling @ 2021-01-14  0:06 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: alex.sierra, philip.yang, dri-devel, amd-gfx


Am 2021-01-13 um 11:47 a.m. schrieb Jerome Glisse:
> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
>> This is the first version of our HMM based shared virtual memory manager
>> for KFD. There are still a number of known issues that we're working through
>> (see below). This will likely lead to some pretty significant changes in
>> MMU notifier handling and locking on the migration code paths. So don't
>> get hung up on those details yet.
> [...]
>
>> Known issues:
>> * won't work with IOMMU enabled, we need to dma_map all pages properly
>> * still working on some race conditions and random bugs
>> * performance is not great yet
> What would those changes looks like ? Seeing the issue below i do not
> see how they inter-play with mmu notifier. Can you elaborate.

We currently have some race conditions when multiple threads are causing
migrations concurrently (e.g. CPU page faults, GPU page faults, memory
evictions, and explicit prefetch by the application).

In the current patch series we set up one MMU range notifier for the
entire address space because we had trouble setting up MMU notifiers for
specific address ranges. There are situations where we want to free or
free/resize/reallocate MMU range notifiers, but we can't due to the
locking context we're in:

  * MMU release notifier when a virtual address range is unmapped
  * CPU page fault handler

In both these situations we may need to split virtual address ranges
because we only want to free or migrate a part of it. If we have
per-address range notifiers we also need to free or create notifiers,
which is not possible in those contexts. On the other hand, using a
single range notifier for everything causes unnecessary serialization.

We're reworking all of this to have per-address range notifiers that are
updated with a deferred mechanism in workers. I finally figured out how
to do that in a clean way, hopefully without races or deadlocks, which
should also address the other race conditions we had with concurrent
migration triggers. Philip is working on the implementation.

Regards,
  Felix

>
> Cheers,
> Jérôme
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-13 20:31         ` Daniel Vetter
@ 2021-01-14  3:27           ` Jerome Glisse
  2021-01-14  9:26             ` Daniel Vetter
  0 siblings, 1 reply; 84+ messages in thread
From: Jerome Glisse @ 2021-01-14  3:27 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, amd-gfx list, dri-devel

On Wed, Jan 13, 2021 at 09:31:11PM +0100, Daniel Vetter wrote:
> On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
> > > On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> > > > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> > > > > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> > > > >> This is the first version of our HMM based shared virtual memory manager
> > > > >> for KFD. There are still a number of known issues that we're working through
> > > > >> (see below). This will likely lead to some pretty significant changes in
> > > > >> MMU notifier handling and locking on the migration code paths. So don't
> > > > >> get hung up on those details yet.
> > > > >>
> > > > >> But I think this is a good time to start getting feedback. We're pretty
> > > > >> confident about the ioctl API, which is both simple and extensible for the
> > > > >> future. (see patches 4,16) The user mode side of the API can be found here:
> > > > >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> > > > >>
> > > > >> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> > > > >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> > > > >> and some retry IRQ handling changes (32).
> > > > >>
> > > > >>
> > > > >> Known issues:
> > > > >> * won't work with IOMMU enabled, we need to dma_map all pages properly
> > > > >> * still working on some race conditions and random bugs
> > > > >> * performance is not great yet
> > > > > Still catching up, but I think there's another one for your list:
> > > > >
> > > > >  * hmm gpu context preempt vs page fault handling. I've had a short
> > > > >    discussion about this one with Christian before the holidays, and also
> > > > >    some private chats with Jerome. It's nasty since no easy fix, much less
> > > > >    a good idea what's the best approach here.
> > > >
> > > > Do you have a pointer to that discussion or any more details?
> > >
> > > Essentially if you're handling an hmm page fault from the gpu, you can
> > > deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> > > submissions or compute contexts with dma_fence_wait. Which deadlocks if
> > > you can't preempt while you have that page fault pending. Two solutions:
> > >
> > > - your hw can (at least for compute ctx) preempt even when a page fault is
> > >   pending
> > >
> > > - lots of screaming in trying to come up with an alternate solution. They
> > >   all suck.
> > >
> > > Note that the dma_fence_wait is hard requirement, because we need that for
> > > mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> > > management. Which is the current "ttm is self-limited to 50% of system
> > > memory" limitation Christian is trying to lift. So that's really not
> > > a restriction we can lift, at least not in upstream where we need to also
> > > support old style hardware which doesn't have page fault support and
> > > really has no other option to handle memory management than
> > > dma_fence_wait.
> > >
> > > Thread was here:
> > >
> > > https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
> > >
> > > There's a few ways to resolve this (without having preempt-capable
> > > hardware), but they're all supremely nasty.
> > > -Daniel
> > >
> >
> > I had a new idea, i wanted to think more about it but have not yet,
> > anyway here it is. Adding a new callback to dma fence which ask the
> > question can it dead lock ? Any time a GPU driver has pending page
> > fault (ie something calling into the mm) it answer yes, otherwise
> > no. The GPU shrinker would ask the question before waiting on any
> > dma-fence and back of if it gets yes. Shrinker can still try many
> > dma buf object for which it does not get a yes on associated fence.
> 
> Having that answer on a given fence isn't enough, you still need to
> forward that information through the entire dependency graph, across
> drivers. That's the hard part, since that dependency graph is very
> implicit in the code, and we'd need to first roll it out across all
> drivers.

Here i am saying do not wait on fence for which you are not sure.
Only wait on fence for which you are 100% certain you can not dead
lock. So if you can never be sure on dma fence then never wait on
dma-fence in the shrinker. However most driver should have enough
information in their shrinker to know if it is safe to wait on
fence internal to their device driver (and also know if any of
those fence has implicit outside dependency). So first implementation
would be to say always deadlock and then having each driver build
confidence into what it can ascertain.

> 
> > This does not solve the mmu notifier case, for this you would just
> > invalidate the gem userptr object (with a flag but not releasing the
> > page refcount) but you would not wait for the GPU (ie no dma fence
> > wait in that code path anymore). The userptr API never really made
> > the contract that it will always be in sync with the mm view of the
> > world so if different page get remapped to same virtual address
> > while GPU is still working with the old pages it should not be an
> > issue (it would not be in our usage of userptr for compositor and
> > what not).
> >
> > Maybe i overlook something there.
> 
> tbh I'm never really clear on how much exactly we need, and whether
> maybe the new pin/unpin api should fix it all.

pin/unpin is not a solution it is to fix something with GUP (where
we need to know if a page is GUPed or not). GUP should die longterm
so anything using GUP (pin/unpin falls into that) should die longterm.
Pining memory is bad period (it just breaks too much mm and it is
unsolvable for things like mremap, splice, ...).

Cheers,
Jérôme

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-11 16:29                 ` Daniel Vetter
@ 2021-01-14  5:34                   ` Felix Kuehling
  2021-01-14 12:19                     ` Christian König
  0 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-14  5:34 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Alex Sierra, Yang, Philip, dri-devel, amd-gfx list

Am 2021-01-11 um 11:29 a.m. schrieb Daniel Vetter:
> On Fri, Jan 08, 2021 at 12:56:24PM -0500, Felix Kuehling wrote:
>> Am 2021-01-08 um 11:53 a.m. schrieb Daniel Vetter:
>>> On Fri, Jan 8, 2021 at 5:36 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>>>> Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
>>>>> On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>>>>>> Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
>>>>>>> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
>>>>>>>> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
>>>>>>>>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
>>>>>>>>>> This is the first version of our HMM based shared virtual memory manager
>>>>>>>>>> for KFD. There are still a number of known issues that we're working through
>>>>>>>>>> (see below). This will likely lead to some pretty significant changes in
>>>>>>>>>> MMU notifier handling and locking on the migration code paths. So don't
>>>>>>>>>> get hung up on those details yet.
>>>>>>>>>>
>>>>>>>>>> But I think this is a good time to start getting feedback. We're pretty
>>>>>>>>>> confident about the ioctl API, which is both simple and extensible for the
>>>>>>>>>> future. (see patches 4,16) The user mode side of the API can be found here:
>>>>>>>>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
>>>>>>>>>>
>>>>>>>>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
>>>>>>>>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
>>>>>>>>>> and some retry IRQ handling changes (32).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Known issues:
>>>>>>>>>> * won't work with IOMMU enabled, we need to dma_map all pages properly
>>>>>>>>>> * still working on some race conditions and random bugs
>>>>>>>>>> * performance is not great yet
>>>>>>>>> Still catching up, but I think there's another one for your list:
>>>>>>>>>
>>>>>>>>>  * hmm gpu context preempt vs page fault handling. I've had a short
>>>>>>>>>    discussion about this one with Christian before the holidays, and also
>>>>>>>>>    some private chats with Jerome. It's nasty since no easy fix, much less
>>>>>>>>>    a good idea what's the best approach here.
>>>>>>>> Do you have a pointer to that discussion or any more details?
>>>>>>> Essentially if you're handling an hmm page fault from the gpu, you can
>>>>>>> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
>>>>>>> submissions or compute contexts with dma_fence_wait. Which deadlocks if
>>>>>>> you can't preempt while you have that page fault pending. Two solutions:
>>>>>>>
>>>>>>> - your hw can (at least for compute ctx) preempt even when a page fault is
>>>>>>>   pending
>>>>>> Our GFXv9 GPUs can do this. GFXv10 cannot.
>>>>> Uh, why did your hw guys drop this :-/
>> Performance. It's the same reason why the XNACK mode selection API
>> exists (patch 16). When we enable recoverable page fault handling in the
>> compute units on GFXv9, it costs some performance even when no page
>> faults are happening. On GFXv10 that retry fault handling moved out of
>> the compute units, so they don't take the performance hit. But that
>> sacrificed the ability to preempt during page faults. We'll need to work
>> with our hardware teams to restore that capability in a future generation.
> Ah yes, you need to stall in more points in the compute cores to make sure
> you can recover if the page fault gets interrupted.
>
> Maybe my knowledge is outdated, but my understanding is that nvidia can
> also preempt (but only for compute jobs, since oh dear the pain this would
> be for all the fixed function stuff). Since gfx10 moved page fault
> handling further away from compute cores, do you know whether this now
> means you can do page faults for (some?) fixed function stuff too? Or
> still only for compute?

I'm not sure.


>
> Supporting page fault for 3d would be real pain with the corner we're
> stuck in right now, but better we know about this early than later :-/

I know Christian hates the idea. We know that page faults on GPUs can be
a huge performance drain because you're stalling potentially so many
threads and the CPU can become a bottle neck dealing with all the page
faults from many GPU threads. On the compute side, applications will be
optimized to avoid them as much as possible, e.g. by pre-faulting or
pre-fetching data before it's needed.

But I think you need page faults to make overcommitted memory with user
mode command submission not suck.


>
>>
>>>>> I do think it can be rescued with what I call gang scheduling of
>>>>> engines: I.e. when a given engine is running a context (or a group of
>>>>> engines, depending how your hw works) that can cause a page fault, you
>>>>> must flush out all workloads running on the same engine which could
>>>>> block a dma_fence (preempt them, or for non-compute stuff, force their
>>>>> completion). And the other way round, i.e. before you can run a legacy
>>>>> gl workload with a dma_fence on these engines you need to preempt all
>>>>> ctxs that could cause page faults and take them at least out of the hw
>>>>> scheduler queue.
>>>> Yuck! But yeah, that would work. A less invasive alternative would be to
>>>> reserve some compute units for graphics contexts so we can guarantee
>>>> forward progress for graphics contexts even when all CUs working on
>>>> compute stuff are stuck on page faults.
>>> Won't this hurt compute workloads? I think we need something were at
>>> least pure compute or pure gl/vk workloads run at full performance.
>>> And without preempt we can't take anything back when we need it, so
>>> would have to always upfront reserve some cores just in case.
>> Yes, it would hurt proportionally to how many CUs get reserved. On big
>> GPUs with many CUs the impact could be quite small.
> Also, we could do the reservation only for the time when there's actually
> a legacy context with normal dma_fence in the scheduler queue. Assuming
> that reserving/unreserving of CUs isn't too expensive operation. If it's
> as expensive as a full stall probably not worth the complexity here and
> just go with a full stall and only run one or the other at a time.
>
> Wrt desktops I'm also somewhat worried that we might end up killing
> desktop workloads if there's not enough CUs reserved for these and they
> end up taking too long and anger either tdr or worse the user because the
> desktop is unuseable when you start a compute job and get a big pile of
> faults. Probably needs some testing to see how bad it is.
>
>> That said, I'm not sure it'll work on our hardware. Our CUs can execute
>> multiple wavefronts from different contexts and switch between them with
>> fine granularity. I'd need to check with our HW engineers whether this
>> CU-internal context switching is still possible during page faults on
>> GFXv10.
> You'd need to do the reservation for all contexts/engines which can cause
> page faults, otherewise it'd leak.

All engines that can page fault and cannot be preempted during faults.

Regards,
  Felix


>>
>>>>> Just reserving an sdma engine for copy jobs and ptes updates and that
>>>>> stuff is necessary, but not sufficient.
>>>>>
>>>>> Another approach that Jerome suggested is to track the reverse
>>>>> dependency graph of all dma_fence somehow and make sure that direct
>>>>> reclaim never recurses on an engine you're serving a pagefault for.
>>>>> Possible in theory, but in practice I think not feasible to implement
>>>>> because way too much work to implement.
>>>> I agree.
>>>>
>>>>
>>>>> Either way it's imo really nasty to come up with a scheme here that
>>>>> doesn't fail in some corner, or becomes really nasty with inconsistent
>>>>> rules across different drivers and hw :-(
>>>> Yeah. The cleanest approach is to avoid DMA fences altogether for
>>>> device/engines that can get stuck on page faults. A user mode command
>>>> submission model would do that.
>>>>
>>>> Reserving some compute units for graphics contexts that signal fences
>>>> but never page fault should also work.
>>> The trouble is you don't just need engines, you need compute
>>> resources/cores behind them too (assuming I'm understading correctly
>>> how this works on amd hw). Otherwise you end up with a gl context that
>>> should complete to resolve the deadlock, but can't because it can't
>>> run it's shader because all the shader cores are stuck in compute page
>>> faults somewhere.
>> That's why I suggested reserving some CUs that would never execute
>> compute workloads that can page fault.
>>
>>
>>>  Hence the gang scheduling would need to be at a
>>> level were you can guarantee full isolation of hw resources, either
>>> because you can preempt stuck compute kernels and let gl shaders run,
>>> or because of hw core partitiion or something else. If you cant, you
>>> need to gang schedule the entire gpu.
>> Yes.
>>
>>
>>> I think in practice that's not too ugly since for pure compute
>>> workloads you're not going to have a desktop running most likely.
>> We still need legacy contexts for video decoding and post processing.
>> But maybe we can find a fix for that too.
> Hm I'd expect video workloads to not use page faults (even if they use
> compute for post processing). Same way that compute in vk/gl would still
> use all the legacy fencing (which excludes page fault support).
>
> So pure "compute always has to use page fault mode and user sync" I don't
> think is feasible. And then all the mixed workloads useage should be fine
> too.
>
>>>  And
>>> for developer machines we should be able to push the occasional gfx
>>> update through the gpu still without causing too much stutter on the
>>> desktop or costing too much perf on the compute side. And pure gl/vk
>>> or pure compute workloads should keep running at full performance.
>> I think it would be acceptable for mostly-compute workloads. It would be
>> bad for desktop workloads with some compute, e.g. games with
>> OpenCL-based physics. We're increasingly relying on KFD for all GPU
>> computing (including OpenCL) in desktop applications. But those could
>> live without GPU page faults until we can build sane hardware.
> Uh ... I guess the challenge here is noticing when your opencl should be
> run in old style mode. I guess you could link them together through some
> backchannel, so when a gl or vk context is set up you run opencl in the
> legacy mode without pagefault for full perf together with vk. Still
> doesn't work if the app sets up ocl before vk/gl :-/
> -Daniel
>
>> Regards,
>>   Felix
>>
>>
>>> -Daniel
>>>
>>>
>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>>
>>>>> Cheers, Daniel
>>>>>
>>>>>> Regards,
>>>>>>   Felix
>>>>>>
>>>>>>> Note that the dma_fence_wait is hard requirement, because we need that for
>>>>>>> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
>>>>>>> management. Which is the current "ttm is self-limited to 50% of system
>>>>>>> memory" limitation Christian is trying to lift. So that's really not
>>>>>>> a restriction we can lift, at least not in upstream where we need to also
>>>>>>> support old style hardware which doesn't have page fault support and
>>>>>>> really has no other option to handle memory management than
>>>>>>> dma_fence_wait.
>>>>>>>
>>>>>>> Thread was here:
>>>>>>>
>>>>>>> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
>>>>>>>
>>>>>>> There's a few ways to resolve this (without having preempt-capable
>>>>>>> hardware), but they're all supremely nasty.
>>>>>>> -Daniel
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>   Felix
>>>>>>>>
>>>>>>>>
>>>>>>>>> I'll try to look at this more in-depth when I'm catching up on mails.
>>>>>>>>> -Daniel
>>>>>>>>>
>>>>>>>>>> Alex Sierra (12):
>>>>>>>>>>   drm/amdgpu: replace per_device_list by array
>>>>>>>>>>   drm/amdkfd: helper to convert gpu id and idx
>>>>>>>>>>   drm/amdkfd: add xnack enabled flag to kfd_process
>>>>>>>>>>   drm/amdkfd: add ioctl to configure and query xnack retries
>>>>>>>>>>   drm/amdkfd: invalidate tables on page retry fault
>>>>>>>>>>   drm/amdkfd: page table restore through svm API
>>>>>>>>>>   drm/amdkfd: SVM API call to restore page tables
>>>>>>>>>>   drm/amdkfd: add svm_bo reference for eviction fence
>>>>>>>>>>   drm/amdgpu: add param bit flag to create SVM BOs
>>>>>>>>>>   drm/amdkfd: add svm_bo eviction mechanism support
>>>>>>>>>>   drm/amdgpu: svm bo enable_signal call condition
>>>>>>>>>>   drm/amdgpu: add svm_bo eviction to enable_signal cb
>>>>>>>>>>
>>>>>>>>>> Philip Yang (23):
>>>>>>>>>>   drm/amdkfd: select kernel DEVICE_PRIVATE option
>>>>>>>>>>   drm/amdkfd: add svm ioctl API
>>>>>>>>>>   drm/amdkfd: Add SVM API support capability bits
>>>>>>>>>>   drm/amdkfd: register svm range
>>>>>>>>>>   drm/amdkfd: add svm ioctl GET_ATTR op
>>>>>>>>>>   drm/amdgpu: add common HMM get pages function
>>>>>>>>>>   drm/amdkfd: validate svm range system memory
>>>>>>>>>>   drm/amdkfd: register overlap system memory range
>>>>>>>>>>   drm/amdkfd: deregister svm range
>>>>>>>>>>   drm/amdgpu: export vm update mapping interface
>>>>>>>>>>   drm/amdkfd: map svm range to GPUs
>>>>>>>>>>   drm/amdkfd: svm range eviction and restore
>>>>>>>>>>   drm/amdkfd: register HMM device private zone
>>>>>>>>>>   drm/amdkfd: validate vram svm range from TTM
>>>>>>>>>>   drm/amdkfd: support xgmi same hive mapping
>>>>>>>>>>   drm/amdkfd: copy memory through gart table
>>>>>>>>>>   drm/amdkfd: HMM migrate ram to vram
>>>>>>>>>>   drm/amdkfd: HMM migrate vram to ram
>>>>>>>>>>   drm/amdgpu: reserve fence slot to update page table
>>>>>>>>>>   drm/amdgpu: enable retry fault wptr overflow
>>>>>>>>>>   drm/amdkfd: refine migration policy with xnack on
>>>>>>>>>>   drm/amdkfd: add svm range validate timestamp
>>>>>>>>>>   drm/amdkfd: multiple gpu migrate vram to vram
>>>>>>>>>>
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    3 +
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
>>>>>>>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
>>>>>>>>>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    5 +
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   47 +-
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   10 +
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |   32 +-
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c        |   32 +-
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  170 +-
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  866 ++++++
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   52 +-
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  200 +-
>>>>>>>>>>  .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2564 +++++++++++++++++
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  135 +
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    1 +
>>>>>>>>>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
>>>>>>>>>>  include/uapi/linux/kfd_ioctl.h                |  169 +-
>>>>>>>>>>  26 files changed, 4296 insertions(+), 291 deletions(-)
>>>>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>>>>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>>>>>>>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> 2.29.2
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> dri-devel mailing list
>>>>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-14  3:27           ` Jerome Glisse
@ 2021-01-14  9:26             ` Daniel Vetter
  2021-01-14 10:39               ` Daniel Vetter
  0 siblings, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-01-14  9:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, amd-gfx list, dri-devel

On Thu, Jan 14, 2021 at 4:27 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Jan 13, 2021 at 09:31:11PM +0100, Daniel Vetter wrote:
> > On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
> > > > On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> > > > > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> > > > > > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> > > > > >> This is the first version of our HMM based shared virtual memory manager
> > > > > >> for KFD. There are still a number of known issues that we're working through
> > > > > >> (see below). This will likely lead to some pretty significant changes in
> > > > > >> MMU notifier handling and locking on the migration code paths. So don't
> > > > > >> get hung up on those details yet.
> > > > > >>
> > > > > >> But I think this is a good time to start getting feedback. We're pretty
> > > > > >> confident about the ioctl API, which is both simple and extensible for the
> > > > > >> future. (see patches 4,16) The user mode side of the API can be found here:
> > > > > >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> > > > > >>
> > > > > >> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> > > > > >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> > > > > >> and some retry IRQ handling changes (32).
> > > > > >>
> > > > > >>
> > > > > >> Known issues:
> > > > > >> * won't work with IOMMU enabled, we need to dma_map all pages properly
> > > > > >> * still working on some race conditions and random bugs
> > > > > >> * performance is not great yet
> > > > > > Still catching up, but I think there's another one for your list:
> > > > > >
> > > > > >  * hmm gpu context preempt vs page fault handling. I've had a short
> > > > > >    discussion about this one with Christian before the holidays, and also
> > > > > >    some private chats with Jerome. It's nasty since no easy fix, much less
> > > > > >    a good idea what's the best approach here.
> > > > >
> > > > > Do you have a pointer to that discussion or any more details?
> > > >
> > > > Essentially if you're handling an hmm page fault from the gpu, you can
> > > > deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> > > > submissions or compute contexts with dma_fence_wait. Which deadlocks if
> > > > you can't preempt while you have that page fault pending. Two solutions:
> > > >
> > > > - your hw can (at least for compute ctx) preempt even when a page fault is
> > > >   pending
> > > >
> > > > - lots of screaming in trying to come up with an alternate solution. They
> > > >   all suck.
> > > >
> > > > Note that the dma_fence_wait is hard requirement, because we need that for
> > > > mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> > > > management. Which is the current "ttm is self-limited to 50% of system
> > > > memory" limitation Christian is trying to lift. So that's really not
> > > > a restriction we can lift, at least not in upstream where we need to also
> > > > support old style hardware which doesn't have page fault support and
> > > > really has no other option to handle memory management than
> > > > dma_fence_wait.
> > > >
> > > > Thread was here:
> > > >
> > > > https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
> > > >
> > > > There's a few ways to resolve this (without having preempt-capable
> > > > hardware), but they're all supremely nasty.
> > > > -Daniel
> > > >
> > >
> > > I had a new idea, i wanted to think more about it but have not yet,
> > > anyway here it is. Adding a new callback to dma fence which ask the
> > > question can it dead lock ? Any time a GPU driver has pending page
> > > fault (ie something calling into the mm) it answer yes, otherwise
> > > no. The GPU shrinker would ask the question before waiting on any
> > > dma-fence and back of if it gets yes. Shrinker can still try many
> > > dma buf object for which it does not get a yes on associated fence.
> >
> > Having that answer on a given fence isn't enough, you still need to
> > forward that information through the entire dependency graph, across
> > drivers. That's the hard part, since that dependency graph is very
> > implicit in the code, and we'd need to first roll it out across all
> > drivers.
>
> Here i am saying do not wait on fence for which you are not sure.
> Only wait on fence for which you are 100% certain you can not dead
> lock. So if you can never be sure on dma fence then never wait on
> dma-fence in the shrinker. However most driver should have enough
> information in their shrinker to know if it is safe to wait on
> fence internal to their device driver (and also know if any of
> those fence has implicit outside dependency). So first implementation
> would be to say always deadlock and then having each driver build
> confidence into what it can ascertain.

I just don't think that actually works in practice:

- on a single gpu you can't wait for vk/gl due to shared CUs, so only
sdma and uvd are left (or whatever else pure fixed function)

- for multi-gpu you get the guessing game of what leaks across gpus
and what doesn't. With p2p dma-buf we're now leaking dma_fence across
gpus even when there's no implicit syncing by userspace (although for
amdgpu this is tricky since iirc it still lacks the flag to let
userspace decide this, so this is more for other drivers).

- you don't just need to guarantee that there's no dma_fence
dependency going back to you, you also need to make sure there's no
other depedency chain through locks or whatever that closes the loop.
And since your proposal here is against the dma_fence lockdep
annotations we have now, lockdep won't help you (and let's be honest,
review doesn't catch this stuff either, so it's up to hangs in
production to catch this stuff)

- you still need the full dependency graph within the driver, and only
i915 scheduler has that afaik. And I'm not sure implementing that was
a bright idea

- assuming it's a deadlock by default means all gl/vk memory is
pinned. That's not nice, plus in additional you need hacks like ttm's
"max 50% of system memory" to paper over the worst fallout, which
Christian is trying to lift. I really do think we need to be able to
move towards more dynamic memory management, not less.

So in the end you're essentially disabling shrinking/eviction of other
gpu tasks, and I don't think that works. I really think the only two
realistic options are
- guarantee forward progress of other dma_fence (hw preemption,
reserved CUs, or whatever else you have)
- guarantee there's not a single offending dma_fence active in the
system that could cause problems

Hand-waving that in theory we could track the dependencies and that in
theory we could do some deadlock avoidance of some sorts about that
just doesn't look like a pragmatic&practical solution to me here. It
feels about as realistic as just creating a completely new memory
management model that sidesteps the entire dma_fence issues we have
due to mixing up kernel memory management and userspace sync fences in
one thing.

Cheers, Daniel

> > > This does not solve the mmu notifier case, for this you would just
> > > invalidate the gem userptr object (with a flag but not releasing the
> > > page refcount) but you would not wait for the GPU (ie no dma fence
> > > wait in that code path anymore). The userptr API never really made
> > > the contract that it will always be in sync with the mm view of the
> > > world so if different page get remapped to same virtual address
> > > while GPU is still working with the old pages it should not be an
> > > issue (it would not be in our usage of userptr for compositor and
> > > what not).
> > >
> > > Maybe i overlook something there.
> >
> > tbh I'm never really clear on how much exactly we need, and whether
> > maybe the new pin/unpin api should fix it all.
>
> pin/unpin is not a solution it is to fix something with GUP (where
> we need to know if a page is GUPed or not). GUP should die longterm
> so anything using GUP (pin/unpin falls into that) should die longterm.
> Pining memory is bad period (it just breaks too much mm and it is
> unsolvable for things like mremap, splice, ...).
>
> Cheers,
> Jérôme
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-14  9:26             ` Daniel Vetter
@ 2021-01-14 10:39               ` Daniel Vetter
  0 siblings, 0 replies; 84+ messages in thread
From: Daniel Vetter @ 2021-01-14 10:39 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, amd-gfx list, dri-devel

On Thu, Jan 14, 2021 at 10:26 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, Jan 14, 2021 at 4:27 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Wed, Jan 13, 2021 at 09:31:11PM +0100, Daniel Vetter wrote:
> > > On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
> > > > > On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> > > > > > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> > > > > > > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> > > > > > >> This is the first version of our HMM based shared virtual memory manager
> > > > > > >> for KFD. There are still a number of known issues that we're working through
> > > > > > >> (see below). This will likely lead to some pretty significant changes in
> > > > > > >> MMU notifier handling and locking on the migration code paths. So don't
> > > > > > >> get hung up on those details yet.
> > > > > > >>
> > > > > > >> But I think this is a good time to start getting feedback. We're pretty
> > > > > > >> confident about the ioctl API, which is both simple and extensible for the
> > > > > > >> future. (see patches 4,16) The user mode side of the API can be found here:
> > > > > > >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> > > > > > >>
> > > > > > >> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> > > > > > >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> > > > > > >> and some retry IRQ handling changes (32).
> > > > > > >>
> > > > > > >>
> > > > > > >> Known issues:
> > > > > > >> * won't work with IOMMU enabled, we need to dma_map all pages properly
> > > > > > >> * still working on some race conditions and random bugs
> > > > > > >> * performance is not great yet
> > > > > > > Still catching up, but I think there's another one for your list:
> > > > > > >
> > > > > > >  * hmm gpu context preempt vs page fault handling. I've had a short
> > > > > > >    discussion about this one with Christian before the holidays, and also
> > > > > > >    some private chats with Jerome. It's nasty since no easy fix, much less
> > > > > > >    a good idea what's the best approach here.
> > > > > >
> > > > > > Do you have a pointer to that discussion or any more details?
> > > > >
> > > > > Essentially if you're handling an hmm page fault from the gpu, you can
> > > > > deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> > > > > submissions or compute contexts with dma_fence_wait. Which deadlocks if
> > > > > you can't preempt while you have that page fault pending. Two solutions:
> > > > >
> > > > > - your hw can (at least for compute ctx) preempt even when a page fault is
> > > > >   pending
> > > > >
> > > > > - lots of screaming in trying to come up with an alternate solution. They
> > > > >   all suck.
> > > > >
> > > > > Note that the dma_fence_wait is hard requirement, because we need that for
> > > > > mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> > > > > management. Which is the current "ttm is self-limited to 50% of system
> > > > > memory" limitation Christian is trying to lift. So that's really not
> > > > > a restriction we can lift, at least not in upstream where we need to also
> > > > > support old style hardware which doesn't have page fault support and
> > > > > really has no other option to handle memory management than
> > > > > dma_fence_wait.
> > > > >
> > > > > Thread was here:
> > > > >
> > > > > https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
> > > > >
> > > > > There's a few ways to resolve this (without having preempt-capable
> > > > > hardware), but they're all supremely nasty.
> > > > > -Daniel
> > > > >
> > > >
> > > > I had a new idea, i wanted to think more about it but have not yet,
> > > > anyway here it is. Adding a new callback to dma fence which ask the
> > > > question can it dead lock ? Any time a GPU driver has pending page
> > > > fault (ie something calling into the mm) it answer yes, otherwise
> > > > no. The GPU shrinker would ask the question before waiting on any
> > > > dma-fence and back of if it gets yes. Shrinker can still try many
> > > > dma buf object for which it does not get a yes on associated fence.
> > >
> > > Having that answer on a given fence isn't enough, you still need to
> > > forward that information through the entire dependency graph, across
> > > drivers. That's the hard part, since that dependency graph is very
> > > implicit in the code, and we'd need to first roll it out across all
> > > drivers.
> >
> > Here i am saying do not wait on fence for which you are not sure.
> > Only wait on fence for which you are 100% certain you can not dead
> > lock. So if you can never be sure on dma fence then never wait on
> > dma-fence in the shrinker. However most driver should have enough
> > information in their shrinker to know if it is safe to wait on
> > fence internal to their device driver (and also know if any of
> > those fence has implicit outside dependency). So first implementation
> > would be to say always deadlock and then having each driver build
> > confidence into what it can ascertain.
>
> I just don't think that actually works in practice:
>
> - on a single gpu you can't wait for vk/gl due to shared CUs, so only
> sdma and uvd are left (or whatever else pure fixed function)
>
> - for multi-gpu you get the guessing game of what leaks across gpus
> and what doesn't. With p2p dma-buf we're now leaking dma_fence across
> gpus even when there's no implicit syncing by userspace (although for
> amdgpu this is tricky since iirc it still lacks the flag to let
> userspace decide this, so this is more for other drivers).
>
> - you don't just need to guarantee that there's no dma_fence
> dependency going back to you, you also need to make sure there's no
> other depedency chain through locks or whatever that closes the loop.
> And since your proposal here is against the dma_fence lockdep
> annotations we have now, lockdep won't help you (and let's be honest,
> review doesn't catch this stuff either, so it's up to hangs in
> production to catch this stuff)
>
> - you still need the full dependency graph within the driver, and only
> i915 scheduler has that afaik. And I'm not sure implementing that was
> a bright idea
>
> - assuming it's a deadlock by default means all gl/vk memory is
> pinned. That's not nice, plus in additional you need hacks like ttm's
> "max 50% of system memory" to paper over the worst fallout, which
> Christian is trying to lift. I really do think we need to be able to
> move towards more dynamic memory management, not less.

Forgot one issue:

- somehow you need to transport the knowledge that you're in the gpu
fault repair path of a specific engine down to shrinkers/mmu notifiers
and all that. And it needs to be fairly specific, otherwise it just
amounts again to "no more dma_fence_wait allowed".

-Daniel

> So in the end you're essentially disabling shrinking/eviction of other
> gpu tasks, and I don't think that works. I really think the only two
> realistic options are
> - guarantee forward progress of other dma_fence (hw preemption,
> reserved CUs, or whatever else you have)
> - guarantee there's not a single offending dma_fence active in the
> system that could cause problems
>
> Hand-waving that in theory we could track the dependencies and that in
> theory we could do some deadlock avoidance of some sorts about that
> just doesn't look like a pragmatic&practical solution to me here. It
> feels about as realistic as just creating a completely new memory
> management model that sidesteps the entire dma_fence issues we have
> due to mixing up kernel memory management and userspace sync fences in
> one thing.
>
> Cheers, Daniel
>
> > > > This does not solve the mmu notifier case, for this you would just
> > > > invalidate the gem userptr object (with a flag but not releasing the
> > > > page refcount) but you would not wait for the GPU (ie no dma fence
> > > > wait in that code path anymore). The userptr API never really made
> > > > the contract that it will always be in sync with the mm view of the
> > > > world so if different page get remapped to same virtual address
> > > > while GPU is still working with the old pages it should not be an
> > > > issue (it would not be in our usage of userptr for compositor and
> > > > what not).
> > > >
> > > > Maybe i overlook something there.
> > >
> > > tbh I'm never really clear on how much exactly we need, and whether
> > > maybe the new pin/unpin api should fix it all.
> >
> > pin/unpin is not a solution it is to fix something with GUP (where
> > we need to know if a page is GUPed or not). GUP should die longterm
> > so anything using GUP (pin/unpin falls into that) should die longterm.
> > Pining memory is bad period (it just breaks too much mm and it is
> > unsolvable for things like mremap, splice, ...).
> >
> > Cheers,
> > Jérôme
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-13 16:56       ` Jerome Glisse
  2021-01-13 20:31         ` Daniel Vetter
@ 2021-01-14 10:49         ` Christian König
  2021-01-14 11:52           ` Daniel Vetter
  1 sibling, 1 reply; 84+ messages in thread
From: Christian König @ 2021-01-14 10:49 UTC (permalink / raw)
  To: Jerome Glisse, Daniel Vetter
  Cc: alex.sierra, philip.yang, Felix Kuehling, dri-devel, amd-gfx

Am 13.01.21 um 17:56 schrieb Jerome Glisse:
> On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
>> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
>>> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
>>>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
>>>>> This is the first version of our HMM based shared virtual memory manager
>>>>> for KFD. There are still a number of known issues that we're working through
>>>>> (see below). This will likely lead to some pretty significant changes in
>>>>> MMU notifier handling and locking on the migration code paths. So don't
>>>>> get hung up on those details yet.
>>>>>
>>>>> But I think this is a good time to start getting feedback. We're pretty
>>>>> confident about the ioctl API, which is both simple and extensible for the
>>>>> future. (see patches 4,16) The user mode side of the API can be found here:
>>>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
>>>>>
>>>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
>>>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
>>>>> and some retry IRQ handling changes (32).
>>>>>
>>>>>
>>>>> Known issues:
>>>>> * won't work with IOMMU enabled, we need to dma_map all pages properly
>>>>> * still working on some race conditions and random bugs
>>>>> * performance is not great yet
>>>> Still catching up, but I think there's another one for your list:
>>>>
>>>>   * hmm gpu context preempt vs page fault handling. I've had a short
>>>>     discussion about this one with Christian before the holidays, and also
>>>>     some private chats with Jerome. It's nasty since no easy fix, much less
>>>>     a good idea what's the best approach here.
>>> Do you have a pointer to that discussion or any more details?
>> Essentially if you're handling an hmm page fault from the gpu, you can
>> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
>> submissions or compute contexts with dma_fence_wait. Which deadlocks if
>> you can't preempt while you have that page fault pending. Two solutions:
>>
>> - your hw can (at least for compute ctx) preempt even when a page fault is
>>    pending
>>
>> - lots of screaming in trying to come up with an alternate solution. They
>>    all suck.
>>
>> Note that the dma_fence_wait is hard requirement, because we need that for
>> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
>> management. Which is the current "ttm is self-limited to 50% of system
>> memory" limitation Christian is trying to lift. So that's really not
>> a restriction we can lift, at least not in upstream where we need to also
>> support old style hardware which doesn't have page fault support and
>> really has no other option to handle memory management than
>> dma_fence_wait.
>>
>> Thread was here:
>>
>> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
>>
>> There's a few ways to resolve this (without having preempt-capable
>> hardware), but they're all supremely nasty.
>> -Daniel
>>
> I had a new idea, i wanted to think more about it but have not yet,
> anyway here it is. Adding a new callback to dma fence which ask the
> question can it dead lock ? Any time a GPU driver has pending page
> fault (ie something calling into the mm) it answer yes, otherwise
> no. The GPU shrinker would ask the question before waiting on any
> dma-fence and back of if it gets yes. Shrinker can still try many
> dma buf object for which it does not get a yes on associated fence.
>
> This does not solve the mmu notifier case, for this you would just
> invalidate the gem userptr object (with a flag but not releasing the
> page refcount) but you would not wait for the GPU (ie no dma fence
> wait in that code path anymore). The userptr API never really made
> the contract that it will always be in sync with the mm view of the
> world so if different page get remapped to same virtual address
> while GPU is still working with the old pages it should not be an
> issue (it would not be in our usage of userptr for compositor and
> what not).

The current working idea in my mind goes into a similar direction.

But instead of a callback I'm adding a complete new class of HMM fences.

Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for 
the dma_fences and HMM fences are ignored in container objects.

When you handle an implicit or explicit synchronization request from 
userspace you need to block for HMM fences to complete before taking any 
resource locks.

Regards,
Christian.

>
> Maybe i overlook something there.
>
> Cheers,
> Jérôme
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-14 10:49         ` Christian König
@ 2021-01-14 11:52           ` Daniel Vetter
  2021-01-14 13:37             ` HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD) Christian König
  0 siblings, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-01-14 11:52 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, dri-devel,
	Jerome Glisse, amd-gfx list

On Thu, Jan 14, 2021 at 11:49 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 13.01.21 um 17:56 schrieb Jerome Glisse:
> > On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
> >> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> >>> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> >>>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> >>>>> This is the first version of our HMM based shared virtual memory manager
> >>>>> for KFD. There are still a number of known issues that we're working through
> >>>>> (see below). This will likely lead to some pretty significant changes in
> >>>>> MMU notifier handling and locking on the migration code paths. So don't
> >>>>> get hung up on those details yet.
> >>>>>
> >>>>> But I think this is a good time to start getting feedback. We're pretty
> >>>>> confident about the ioctl API, which is both simple and extensible for the
> >>>>> future. (see patches 4,16) The user mode side of the API can be found here:
> >>>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> >>>>>
> >>>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> >>>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> >>>>> and some retry IRQ handling changes (32).
> >>>>>
> >>>>>
> >>>>> Known issues:
> >>>>> * won't work with IOMMU enabled, we need to dma_map all pages properly
> >>>>> * still working on some race conditions and random bugs
> >>>>> * performance is not great yet
> >>>> Still catching up, but I think there's another one for your list:
> >>>>
> >>>>   * hmm gpu context preempt vs page fault handling. I've had a short
> >>>>     discussion about this one with Christian before the holidays, and also
> >>>>     some private chats with Jerome. It's nasty since no easy fix, much less
> >>>>     a good idea what's the best approach here.
> >>> Do you have a pointer to that discussion or any more details?
> >> Essentially if you're handling an hmm page fault from the gpu, you can
> >> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> >> submissions or compute contexts with dma_fence_wait. Which deadlocks if
> >> you can't preempt while you have that page fault pending. Two solutions:
> >>
> >> - your hw can (at least for compute ctx) preempt even when a page fault is
> >>    pending
> >>
> >> - lots of screaming in trying to come up with an alternate solution. They
> >>    all suck.
> >>
> >> Note that the dma_fence_wait is hard requirement, because we need that for
> >> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> >> management. Which is the current "ttm is self-limited to 50% of system
> >> memory" limitation Christian is trying to lift. So that's really not
> >> a restriction we can lift, at least not in upstream where we need to also
> >> support old style hardware which doesn't have page fault support and
> >> really has no other option to handle memory management than
> >> dma_fence_wait.
> >>
> >> Thread was here:
> >>
> >> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
> >>
> >> There's a few ways to resolve this (without having preempt-capable
> >> hardware), but they're all supremely nasty.
> >> -Daniel
> >>
> > I had a new idea, i wanted to think more about it but have not yet,
> > anyway here it is. Adding a new callback to dma fence which ask the
> > question can it dead lock ? Any time a GPU driver has pending page
> > fault (ie something calling into the mm) it answer yes, otherwise
> > no. The GPU shrinker would ask the question before waiting on any
> > dma-fence and back of if it gets yes. Shrinker can still try many
> > dma buf object for which it does not get a yes on associated fence.
> >
> > This does not solve the mmu notifier case, for this you would just
> > invalidate the gem userptr object (with a flag but not releasing the
> > page refcount) but you would not wait for the GPU (ie no dma fence
> > wait in that code path anymore). The userptr API never really made
> > the contract that it will always be in sync with the mm view of the
> > world so if different page get remapped to same virtual address
> > while GPU is still working with the old pages it should not be an
> > issue (it would not be in our usage of userptr for compositor and
> > what not).
>
> The current working idea in my mind goes into a similar direction.
>
> But instead of a callback I'm adding a complete new class of HMM fences.
>
> Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
> the dma_fences and HMM fences are ignored in container objects.
>
> When you handle an implicit or explicit synchronization request from
> userspace you need to block for HMM fences to complete before taking any
> resource locks.

Isnt' that what I call gang scheduling? I.e. you either run in HMM
mode, or in legacy fencing mode (whether implicit or explicit doesn't
really matter I think). By forcing that split we avoid the problem,
but it means occasionally full stalls on mixed workloads.

But that's not what Jerome wants (afaiui at least), I think his idea
is to track the reverse dependencies of all the fences floating
around, and then skip evicting an object if you have to wait for any
fence that is problematic for the current calling context. And I don't
think that's very feasible in practice.

So what kind of hmm fences do you have in mind here?
-Daniel


>
> Regards,
> Christian.
>
> >
> > Maybe i overlook something there.
> >
> > Cheers,
> > Jérôme
> >
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD
  2021-01-14  5:34                   ` Felix Kuehling
@ 2021-01-14 12:19                     ` Christian König
  0 siblings, 0 replies; 84+ messages in thread
From: Christian König @ 2021-01-14 12:19 UTC (permalink / raw)
  To: Felix Kuehling, Daniel Vetter
  Cc: Alex Sierra, Yang, Philip, amd-gfx list, dri-devel

Am 14.01.21 um 06:34 schrieb Felix Kuehling:
> Am 2021-01-11 um 11:29 a.m. schrieb Daniel Vetter:
>> On Fri, Jan 08, 2021 at 12:56:24PM -0500, Felix Kuehling wrote:
>>> Am 2021-01-08 um 11:53 a.m. schrieb Daniel Vetter:
>>>> On Fri, Jan 8, 2021 at 5:36 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>>>>> Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
>>>>>> On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>>>>>>> Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
>>>>>>>> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
>>>>>>>>> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
>>>>>>>>>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
>>>>>>>>>>> This is the first version of our HMM based shared virtual memory manager
>>>>>>>>>>> for KFD. There are still a number of known issues that we're working through
>>>>>>>>>>> (see below). This will likely lead to some pretty significant changes in
>>>>>>>>>>> MMU notifier handling and locking on the migration code paths. So don't
>>>>>>>>>>> get hung up on those details yet.
>>>>>>>>>>>
>>>>>>>>>>> But I think this is a good time to start getting feedback. We're pretty
>>>>>>>>>>> confident about the ioctl API, which is both simple and extensible for the
>>>>>>>>>>> future. (see patches 4,16) The user mode side of the API can be found here:
>>>>>>>>>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
>>>>>>>>>>>
>>>>>>>>>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM
>>>>>>>>>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
>>>>>>>>>>> and some retry IRQ handling changes (32).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Known issues:
>>>>>>>>>>> * won't work with IOMMU enabled, we need to dma_map all pages properly
>>>>>>>>>>> * still working on some race conditions and random bugs
>>>>>>>>>>> * performance is not great yet
>>>>>>>>>> Still catching up, but I think there's another one for your list:
>>>>>>>>>>
>>>>>>>>>>   * hmm gpu context preempt vs page fault handling. I've had a short
>>>>>>>>>>     discussion about this one with Christian before the holidays, and also
>>>>>>>>>>     some private chats with Jerome. It's nasty since no easy fix, much less
>>>>>>>>>>     a good idea what's the best approach here.
>>>>>>>>> Do you have a pointer to that discussion or any more details?
>>>>>>>> Essentially if you're handling an hmm page fault from the gpu, you can
>>>>>>>> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
>>>>>>>> submissions or compute contexts with dma_fence_wait. Which deadlocks if
>>>>>>>> you can't preempt while you have that page fault pending. Two solutions:
>>>>>>>>
>>>>>>>> - your hw can (at least for compute ctx) preempt even when a page fault is
>>>>>>>>    pending
>>>>>>> Our GFXv9 GPUs can do this. GFXv10 cannot.
>>>>>> Uh, why did your hw guys drop this :-/
>>> Performance. It's the same reason why the XNACK mode selection API
>>> exists (patch 16). When we enable recoverable page fault handling in the
>>> compute units on GFXv9, it costs some performance even when no page
>>> faults are happening. On GFXv10 that retry fault handling moved out of
>>> the compute units, so they don't take the performance hit. But that
>>> sacrificed the ability to preempt during page faults. We'll need to work
>>> with our hardware teams to restore that capability in a future generation.
>> Ah yes, you need to stall in more points in the compute cores to make sure
>> you can recover if the page fault gets interrupted.
>>
>> Maybe my knowledge is outdated, but my understanding is that nvidia can
>> also preempt (but only for compute jobs, since oh dear the pain this would
>> be for all the fixed function stuff). Since gfx10 moved page fault
>> handling further away from compute cores, do you know whether this now
>> means you can do page faults for (some?) fixed function stuff too? Or
>> still only for compute?
> I'm not sure.
>
>
>> Supporting page fault for 3d would be real pain with the corner we're
>> stuck in right now, but better we know about this early than later :-/
> I know Christian hates the idea.

Well I don't hate the idea. I just don't think that this will ever work 
correctly and performant.

A big part of the additional fun is that we currently have a mix of HMM 
capable engines (3D, compute, DMA) and not HMM capable engines (display, 
multimedia etc..).

> We know that page faults on GPUs can be
> a huge performance drain because you're stalling potentially so many
> threads and the CPU can become a bottle neck dealing with all the page
> faults from many GPU threads. On the compute side, applications will be
> optimized to avoid them as much as possible, e.g. by pre-faulting or
> pre-fetching data before it's needed.
>
> But I think you need page faults to make overcommitted memory with user
> mode command submission not suck.

Yeah, completely agree.

The only short term alternative I see is to have an IOCTL telling the 
kernel which memory is currently in use. And that is complete nonsense 
cause it kills the advantage why we want user mode command submission in 
the first place.

Regards,
Christian.

>>>>>> I do think it can be rescued with what I call gang scheduling of
>>>>>> engines: I.e. when a given engine is running a context (or a group of
>>>>>> engines, depending how your hw works) that can cause a page fault, you
>>>>>> must flush out all workloads running on the same engine which could
>>>>>> block a dma_fence (preempt them, or for non-compute stuff, force their
>>>>>> completion). And the other way round, i.e. before you can run a legacy
>>>>>> gl workload with a dma_fence on these engines you need to preempt all
>>>>>> ctxs that could cause page faults and take them at least out of the hw
>>>>>> scheduler queue.
>>>>> Yuck! But yeah, that would work. A less invasive alternative would be to
>>>>> reserve some compute units for graphics contexts so we can guarantee
>>>>> forward progress for graphics contexts even when all CUs working on
>>>>> compute stuff are stuck on page faults.
>>>> Won't this hurt compute workloads? I think we need something were at
>>>> least pure compute or pure gl/vk workloads run at full performance.
>>>> And without preempt we can't take anything back when we need it, so
>>>> would have to always upfront reserve some cores just in case.
>>> Yes, it would hurt proportionally to how many CUs get reserved. On big
>>> GPUs with many CUs the impact could be quite small.
>> Also, we could do the reservation only for the time when there's actually
>> a legacy context with normal dma_fence in the scheduler queue. Assuming
>> that reserving/unreserving of CUs isn't too expensive operation. If it's
>> as expensive as a full stall probably not worth the complexity here and
>> just go with a full stall and only run one or the other at a time.
>>
>> Wrt desktops I'm also somewhat worried that we might end up killing
>> desktop workloads if there's not enough CUs reserved for these and they
>> end up taking too long and anger either tdr or worse the user because the
>> desktop is unuseable when you start a compute job and get a big pile of
>> faults. Probably needs some testing to see how bad it is.
>>
>>> That said, I'm not sure it'll work on our hardware. Our CUs can execute
>>> multiple wavefronts from different contexts and switch between them with
>>> fine granularity. I'd need to check with our HW engineers whether this
>>> CU-internal context switching is still possible during page faults on
>>> GFXv10.
>> You'd need to do the reservation for all contexts/engines which can cause
>> page faults, otherewise it'd leak.
> All engines that can page fault and cannot be preempted during faults.
>
> Regards,
>    Felix
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 11:52           ` Daniel Vetter
@ 2021-01-14 13:37             ` Christian König
  2021-01-14 13:57               ` Daniel Vetter
  2021-01-14 16:51               ` Jerome Glisse
  0 siblings, 2 replies; 84+ messages in thread
From: Christian König @ 2021-01-14 13:37 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, dri-devel,
	Jerome Glisse, amd-gfx list

Am 14.01.21 um 12:52 schrieb Daniel Vetter:
> [SNIP]
>>> I had a new idea, i wanted to think more about it but have not yet,
>>> anyway here it is. Adding a new callback to dma fence which ask the
>>> question can it dead lock ? Any time a GPU driver has pending page
>>> fault (ie something calling into the mm) it answer yes, otherwise
>>> no. The GPU shrinker would ask the question before waiting on any
>>> dma-fence and back of if it gets yes. Shrinker can still try many
>>> dma buf object for which it does not get a yes on associated fence.
>>>
>>> This does not solve the mmu notifier case, for this you would just
>>> invalidate the gem userptr object (with a flag but not releasing the
>>> page refcount) but you would not wait for the GPU (ie no dma fence
>>> wait in that code path anymore). The userptr API never really made
>>> the contract that it will always be in sync with the mm view of the
>>> world so if different page get remapped to same virtual address
>>> while GPU is still working with the old pages it should not be an
>>> issue (it would not be in our usage of userptr for compositor and
>>> what not).
>> The current working idea in my mind goes into a similar direction.
>>
>> But instead of a callback I'm adding a complete new class of HMM fences.
>>
>> Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
>> the dma_fences and HMM fences are ignored in container objects.
>>
>> When you handle an implicit or explicit synchronization request from
>> userspace you need to block for HMM fences to complete before taking any
>> resource locks.
> Isnt' that what I call gang scheduling? I.e. you either run in HMM
> mode, or in legacy fencing mode (whether implicit or explicit doesn't
> really matter I think). By forcing that split we avoid the problem,
> but it means occasionally full stalls on mixed workloads.
>
> But that's not what Jerome wants (afaiui at least), I think his idea
> is to track the reverse dependencies of all the fences floating
> around, and then skip evicting an object if you have to wait for any
> fence that is problematic for the current calling context. And I don't
> think that's very feasible in practice.
>
> So what kind of hmm fences do you have in mind here?

It's a bit more relaxed than your gang schedule.

See the requirements are as follow:

1. dma_fences never depend on hmm_fences.
2. hmm_fences can never preempt dma_fences.
3. dma_fences must be able to preempt hmm_fences or we always reserve 
enough hardware resources (CUs) to guarantee forward progress of dma_fences.

Critical sections are MMU notifiers, page faults, GPU schedulers and 
dma_reservation object locks.

4. It is valid to wait for a dma_fences in critical sections.
5. It is not valid to wait for hmm_fences in critical sections.

Fence creation either happens during command submission or by adding 
something like a barrier or signal command to your userspace queue.

6. If we have an hmm_fence as implicit or explicit dependency for 
creating a dma_fence we must wait for that before taking any locks or 
reserving resources.
7. If we have a dma_fence as implicit or explicit dependency for 
creating an hmm_fence we can wait later on. So busy waiting or special 
WAIT hardware commands are valid.

This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the 
same time on the hardware.

In other words we can have a high priority gfx queue running jobs based 
on dma_fences and a low priority compute queue running jobs based on 
hmm_fences.

Only when we switch from hmm_fence to dma_fence we need to block the 
submission until all the necessary resources (both memory as well as 
CUs) are available.

This is somewhat an extension to your gang submit idea.

Regards,
Christian.

> -Daniel
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 13:37             ` HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD) Christian König
@ 2021-01-14 13:57               ` Daniel Vetter
  2021-01-14 14:13                 ` Christian König
  2021-01-14 16:51               ` Jerome Glisse
  1 sibling, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-01-14 13:57 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, dri-devel,
	Jerome Glisse, amd-gfx list

On Thu, Jan 14, 2021 at 2:37 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 14.01.21 um 12:52 schrieb Daniel Vetter:
> > [SNIP]
> >>> I had a new idea, i wanted to think more about it but have not yet,
> >>> anyway here it is. Adding a new callback to dma fence which ask the
> >>> question can it dead lock ? Any time a GPU driver has pending page
> >>> fault (ie something calling into the mm) it answer yes, otherwise
> >>> no. The GPU shrinker would ask the question before waiting on any
> >>> dma-fence and back of if it gets yes. Shrinker can still try many
> >>> dma buf object for which it does not get a yes on associated fence.
> >>>
> >>> This does not solve the mmu notifier case, for this you would just
> >>> invalidate the gem userptr object (with a flag but not releasing the
> >>> page refcount) but you would not wait for the GPU (ie no dma fence
> >>> wait in that code path anymore). The userptr API never really made
> >>> the contract that it will always be in sync with the mm view of the
> >>> world so if different page get remapped to same virtual address
> >>> while GPU is still working with the old pages it should not be an
> >>> issue (it would not be in our usage of userptr for compositor and
> >>> what not).
> >> The current working idea in my mind goes into a similar direction.
> >>
> >> But instead of a callback I'm adding a complete new class of HMM fences.
> >>
> >> Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
> >> the dma_fences and HMM fences are ignored in container objects.
> >>
> >> When you handle an implicit or explicit synchronization request from
> >> userspace you need to block for HMM fences to complete before taking any
> >> resource locks.
> > Isnt' that what I call gang scheduling? I.e. you either run in HMM
> > mode, or in legacy fencing mode (whether implicit or explicit doesn't
> > really matter I think). By forcing that split we avoid the problem,
> > but it means occasionally full stalls on mixed workloads.
> >
> > But that's not what Jerome wants (afaiui at least), I think his idea
> > is to track the reverse dependencies of all the fences floating
> > around, and then skip evicting an object if you have to wait for any
> > fence that is problematic for the current calling context. And I don't
> > think that's very feasible in practice.
> >
> > So what kind of hmm fences do you have in mind here?
>
> It's a bit more relaxed than your gang schedule.
>
> See the requirements are as follow:
>
> 1. dma_fences never depend on hmm_fences.
> 2. hmm_fences can never preempt dma_fences.
> 3. dma_fences must be able to preempt hmm_fences or we always reserve
> enough hardware resources (CUs) to guarantee forward progress of dma_fences.
>
> Critical sections are MMU notifiers, page faults, GPU schedulers and
> dma_reservation object locks.
>
> 4. It is valid to wait for a dma_fences in critical sections.
> 5. It is not valid to wait for hmm_fences in critical sections.
>
> Fence creation either happens during command submission or by adding
> something like a barrier or signal command to your userspace queue.
>
> 6. If we have an hmm_fence as implicit or explicit dependency for
> creating a dma_fence we must wait for that before taking any locks or
> reserving resources.
> 7. If we have a dma_fence as implicit or explicit dependency for
> creating an hmm_fence we can wait later on. So busy waiting or special
> WAIT hardware commands are valid.
>
> This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the
> same time on the hardware.
>
> In other words we can have a high priority gfx queue running jobs based
> on dma_fences and a low priority compute queue running jobs based on
> hmm_fences.
>
> Only when we switch from hmm_fence to dma_fence we need to block the
> submission until all the necessary resources (both memory as well as
> CUs) are available.
>
> This is somewhat an extension to your gang submit idea.

Either I'm missing something, or this is just exactly what we
documented already with userspace fences in general, and how you can't
have a dma_fence depend upon a userspace (or hmm_fence).

My gang scheduling idea is really just an alternative for what you
have listed as item 3 above. Instead of requiring preempt or requiring
guaranteed forward progress of some other sorts we flush out any
pending dma_fence request. But _only_ those which would get stalled by
the job we're running, so high-priority sdma requests we need in the
kernel to shuffle buffers around are still all ok. This would be
needed if you're hw can't preempt, and you also have shared engines
between compute and gfx, so reserving CUs won't solve the problem
either.

What I don't mean with my gang scheduling is a completely exclusive
mode between hmm_fence and dma_fence, since that would prevent us from
using copy engines and dma_fence in the kernel to shuffle memory
around for hmm jobs. And that would suck, even on compute-only
workloads. Maybe I should rename "gang scheduling" to "engine flush"
or something like that.

I think the basics of userspace or hmm_fence or whatever we'll call it
we've documented already here:

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences

I think the only thing missing is clarifying a bit what you have under
item 3, i.e. how do we make sure there's no accidental hidden
dependency between hmm_fence and dma_fence. Maybe a subsection about
gpu page fault handling?

Or are we still talking past each another a bit here?
-Daniel


> Regards,
> Christian.
>
> > -Daniel
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 13:57               ` Daniel Vetter
@ 2021-01-14 14:13                 ` Christian König
  2021-01-14 14:23                   ` Daniel Vetter
  0 siblings, 1 reply; 84+ messages in thread
From: Christian König @ 2021-01-14 14:13 UTC (permalink / raw)
  To: Daniel Vetter, Christian König
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, amd-gfx list,
	Jerome Glisse, dri-devel

Am 14.01.21 um 14:57 schrieb Daniel Vetter:
> On Thu, Jan 14, 2021 at 2:37 PM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 14.01.21 um 12:52 schrieb Daniel Vetter:
>>> [SNIP]
>>>>> I had a new idea, i wanted to think more about it but have not yet,
>>>>> anyway here it is. Adding a new callback to dma fence which ask the
>>>>> question can it dead lock ? Any time a GPU driver has pending page
>>>>> fault (ie something calling into the mm) it answer yes, otherwise
>>>>> no. The GPU shrinker would ask the question before waiting on any
>>>>> dma-fence and back of if it gets yes. Shrinker can still try many
>>>>> dma buf object for which it does not get a yes on associated fence.
>>>>>
>>>>> This does not solve the mmu notifier case, for this you would just
>>>>> invalidate the gem userptr object (with a flag but not releasing the
>>>>> page refcount) but you would not wait for the GPU (ie no dma fence
>>>>> wait in that code path anymore). The userptr API never really made
>>>>> the contract that it will always be in sync with the mm view of the
>>>>> world so if different page get remapped to same virtual address
>>>>> while GPU is still working with the old pages it should not be an
>>>>> issue (it would not be in our usage of userptr for compositor and
>>>>> what not).
>>>> The current working idea in my mind goes into a similar direction.
>>>>
>>>> But instead of a callback I'm adding a complete new class of HMM fences.
>>>>
>>>> Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
>>>> the dma_fences and HMM fences are ignored in container objects.
>>>>
>>>> When you handle an implicit or explicit synchronization request from
>>>> userspace you need to block for HMM fences to complete before taking any
>>>> resource locks.
>>> Isnt' that what I call gang scheduling? I.e. you either run in HMM
>>> mode, or in legacy fencing mode (whether implicit or explicit doesn't
>>> really matter I think). By forcing that split we avoid the problem,
>>> but it means occasionally full stalls on mixed workloads.
>>>
>>> But that's not what Jerome wants (afaiui at least), I think his idea
>>> is to track the reverse dependencies of all the fences floating
>>> around, and then skip evicting an object if you have to wait for any
>>> fence that is problematic for the current calling context. And I don't
>>> think that's very feasible in practice.
>>>
>>> So what kind of hmm fences do you have in mind here?
>> It's a bit more relaxed than your gang schedule.
>>
>> See the requirements are as follow:
>>
>> 1. dma_fences never depend on hmm_fences.
>> 2. hmm_fences can never preempt dma_fences.
>> 3. dma_fences must be able to preempt hmm_fences or we always reserve
>> enough hardware resources (CUs) to guarantee forward progress of dma_fences.
>>
>> Critical sections are MMU notifiers, page faults, GPU schedulers and
>> dma_reservation object locks.
>>
>> 4. It is valid to wait for a dma_fences in critical sections.
>> 5. It is not valid to wait for hmm_fences in critical sections.
>>
>> Fence creation either happens during command submission or by adding
>> something like a barrier or signal command to your userspace queue.
>>
>> 6. If we have an hmm_fence as implicit or explicit dependency for
>> creating a dma_fence we must wait for that before taking any locks or
>> reserving resources.
>> 7. If we have a dma_fence as implicit or explicit dependency for
>> creating an hmm_fence we can wait later on. So busy waiting or special
>> WAIT hardware commands are valid.
>>
>> This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the
>> same time on the hardware.
>>
>> In other words we can have a high priority gfx queue running jobs based
>> on dma_fences and a low priority compute queue running jobs based on
>> hmm_fences.
>>
>> Only when we switch from hmm_fence to dma_fence we need to block the
>> submission until all the necessary resources (both memory as well as
>> CUs) are available.
>>
>> This is somewhat an extension to your gang submit idea.
> Either I'm missing something, or this is just exactly what we
> documented already with userspace fences in general, and how you can't
> have a dma_fence depend upon a userspace (or hmm_fence).
>
> My gang scheduling idea is really just an alternative for what you
> have listed as item 3 above. Instead of requiring preempt or requiring
> guaranteed forward progress of some other sorts we flush out any
> pending dma_fence request. But _only_ those which would get stalled by
> the job we're running, so high-priority sdma requests we need in the
> kernel to shuffle buffers around are still all ok. This would be
> needed if you're hw can't preempt, and you also have shared engines
> between compute and gfx, so reserving CUs won't solve the problem
> either.
>
> What I don't mean with my gang scheduling is a completely exclusive
> mode between hmm_fence and dma_fence, since that would prevent us from
> using copy engines and dma_fence in the kernel to shuffle memory
> around for hmm jobs. And that would suck, even on compute-only
> workloads. Maybe I should rename "gang scheduling" to "engine flush"
> or something like that.

Yeah, "engine flush" makes it much more clearer.

What I wanted to emphasis is that we have to mix dma_fences and 
hmm_fences running at the same time on the same hardware fighting over 
the same resources.

E.g. even on the newest hardware multimedia engines can't handle page 
faults, so video decoding/encoding will still produce dma_fences.

> I think the basics of userspace or hmm_fence or whatever we'll call it
> we've documented already here:
>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences

This talks about the restrictions we have for dma_fences and why 
infinite fences (even as hmm_fence) will never work.

But it doesn't talk about how to handle implicit or explicit 
dependencies with something like hmm_fences.

In other words my proposal above allows for hmm_fences to show up in 
dma_reservation objects and are used together with all this explicit 
synchronization we still have with only a medium amount of work :)

> I think the only thing missing is clarifying a bit what you have under
> item 3, i.e. how do we make sure there's no accidental hidden
> dependency between hmm_fence and dma_fence. Maybe a subsection about
> gpu page fault handling?

The real improvement is item 6. The problem with it is that it requires 
auditing all occasions when we create dma_fences so that we don't 
accidentally depend on an HMM fence.

Regards,
Christian.

>
> Or are we still talking past each another a bit here?
> -Daniel
>
>
>> Regards,
>> Christian.
>>
>>> -Daniel
>>>
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 14:13                 ` Christian König
@ 2021-01-14 14:23                   ` Daniel Vetter
  2021-01-14 15:08                     ` Christian König
  0 siblings, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-01-14 14:23 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, amd-gfx list,
	Jerome Glisse, dri-devel

On Thu, Jan 14, 2021 at 3:13 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 14.01.21 um 14:57 schrieb Daniel Vetter:
> > On Thu, Jan 14, 2021 at 2:37 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 14.01.21 um 12:52 schrieb Daniel Vetter:
> >>> [SNIP]
> >>>>> I had a new idea, i wanted to think more about it but have not yet,
> >>>>> anyway here it is. Adding a new callback to dma fence which ask the
> >>>>> question can it dead lock ? Any time a GPU driver has pending page
> >>>>> fault (ie something calling into the mm) it answer yes, otherwise
> >>>>> no. The GPU shrinker would ask the question before waiting on any
> >>>>> dma-fence and back of if it gets yes. Shrinker can still try many
> >>>>> dma buf object for which it does not get a yes on associated fence.
> >>>>>
> >>>>> This does not solve the mmu notifier case, for this you would just
> >>>>> invalidate the gem userptr object (with a flag but not releasing the
> >>>>> page refcount) but you would not wait for the GPU (ie no dma fence
> >>>>> wait in that code path anymore). The userptr API never really made
> >>>>> the contract that it will always be in sync with the mm view of the
> >>>>> world so if different page get remapped to same virtual address
> >>>>> while GPU is still working with the old pages it should not be an
> >>>>> issue (it would not be in our usage of userptr for compositor and
> >>>>> what not).
> >>>> The current working idea in my mind goes into a similar direction.
> >>>>
> >>>> But instead of a callback I'm adding a complete new class of HMM fences.
> >>>>
> >>>> Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
> >>>> the dma_fences and HMM fences are ignored in container objects.
> >>>>
> >>>> When you handle an implicit or explicit synchronization request from
> >>>> userspace you need to block for HMM fences to complete before taking any
> >>>> resource locks.
> >>> Isnt' that what I call gang scheduling? I.e. you either run in HMM
> >>> mode, or in legacy fencing mode (whether implicit or explicit doesn't
> >>> really matter I think). By forcing that split we avoid the problem,
> >>> but it means occasionally full stalls on mixed workloads.
> >>>
> >>> But that's not what Jerome wants (afaiui at least), I think his idea
> >>> is to track the reverse dependencies of all the fences floating
> >>> around, and then skip evicting an object if you have to wait for any
> >>> fence that is problematic for the current calling context. And I don't
> >>> think that's very feasible in practice.
> >>>
> >>> So what kind of hmm fences do you have in mind here?
> >> It's a bit more relaxed than your gang schedule.
> >>
> >> See the requirements are as follow:
> >>
> >> 1. dma_fences never depend on hmm_fences.
> >> 2. hmm_fences can never preempt dma_fences.
> >> 3. dma_fences must be able to preempt hmm_fences or we always reserve
> >> enough hardware resources (CUs) to guarantee forward progress of dma_fences.
> >>
> >> Critical sections are MMU notifiers, page faults, GPU schedulers and
> >> dma_reservation object locks.
> >>
> >> 4. It is valid to wait for a dma_fences in critical sections.
> >> 5. It is not valid to wait for hmm_fences in critical sections.
> >>
> >> Fence creation either happens during command submission or by adding
> >> something like a barrier or signal command to your userspace queue.
> >>
> >> 6. If we have an hmm_fence as implicit or explicit dependency for
> >> creating a dma_fence we must wait for that before taking any locks or
> >> reserving resources.
> >> 7. If we have a dma_fence as implicit or explicit dependency for
> >> creating an hmm_fence we can wait later on. So busy waiting or special
> >> WAIT hardware commands are valid.
> >>
> >> This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the
> >> same time on the hardware.
> >>
> >> In other words we can have a high priority gfx queue running jobs based
> >> on dma_fences and a low priority compute queue running jobs based on
> >> hmm_fences.
> >>
> >> Only when we switch from hmm_fence to dma_fence we need to block the
> >> submission until all the necessary resources (both memory as well as
> >> CUs) are available.
> >>
> >> This is somewhat an extension to your gang submit idea.
> > Either I'm missing something, or this is just exactly what we
> > documented already with userspace fences in general, and how you can't
> > have a dma_fence depend upon a userspace (or hmm_fence).
> >
> > My gang scheduling idea is really just an alternative for what you
> > have listed as item 3 above. Instead of requiring preempt or requiring
> > guaranteed forward progress of some other sorts we flush out any
> > pending dma_fence request. But _only_ those which would get stalled by
> > the job we're running, so high-priority sdma requests we need in the
> > kernel to shuffle buffers around are still all ok. This would be
> > needed if you're hw can't preempt, and you also have shared engines
> > between compute and gfx, so reserving CUs won't solve the problem
> > either.
> >
> > What I don't mean with my gang scheduling is a completely exclusive
> > mode between hmm_fence and dma_fence, since that would prevent us from
> > using copy engines and dma_fence in the kernel to shuffle memory
> > around for hmm jobs. And that would suck, even on compute-only
> > workloads. Maybe I should rename "gang scheduling" to "engine flush"
> > or something like that.
>
> Yeah, "engine flush" makes it much more clearer.
>
> What I wanted to emphasis is that we have to mix dma_fences and
> hmm_fences running at the same time on the same hardware fighting over
> the same resources.
>
> E.g. even on the newest hardware multimedia engines can't handle page
> faults, so video decoding/encoding will still produce dma_fences.

Well we also have to mix them so the kernel can shovel data around
using copy engines. Plus we have to mix it at the overall subsystem
level because I'm not sure SoC-class gpus will ever get here,
definitely aren't yet there for sure.

> > I think the basics of userspace or hmm_fence or whatever we'll call it
> > we've documented already here:
> >
> > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>
> This talks about the restrictions we have for dma_fences and why
> infinite fences (even as hmm_fence) will never work.
>
> But it doesn't talk about how to handle implicit or explicit
> dependencies with something like hmm_fences.
>
> In other words my proposal above allows for hmm_fences to show up in
> dma_reservation objects and are used together with all this explicit
> synchronization we still have with only a medium amount of work :)

Oh. I don't think we should put any hmm_fence or other infinite fence
into a dma_resv object. At least not into the current dma_resv object,
because then we have that infinite fences problem everywhere, and very
hard to audit.

What we could do is add new hmm_fence only slots for implicit sync,
but I think consensus is that implicit sync is bad, never do it again.
Last time around (for timeline syncobj) we've also pushed the waiting
on cross-over to userspace, and I think that's the right option, so we
need userspace to understand the hmm fence anyway. At that point we
might as well bite the bullet and do another round of wayland/dri
protocols.

So from that pov I think the kernel should at most deal with an
hmm_fence for cross-process communication and maybe some standard wait
primitives (for userspace to use, not for the kernel).

The only use case this would forbid is using page faults for legacy
implicit/explicit dma_fence synced workloads, and I think that's
perfectly ok to not allow. Especially since the motivation here for
all this is compute, and compute doesn't pass around dma_fences
anyway.

> > I think the only thing missing is clarifying a bit what you have under
> > item 3, i.e. how do we make sure there's no accidental hidden
> > dependency between hmm_fence and dma_fence. Maybe a subsection about
> > gpu page fault handling?
>
> The real improvement is item 6. The problem with it is that it requires
> auditing all occasions when we create dma_fences so that we don't
> accidentally depend on an HMM fence.

We have that rule already, it's the "dma_fence must not depend upon an
infinite fence anywhere" rule we documented last summer. So that
doesn't feel new.
-Daniel

>
> Regards,
> Christian.
>
> >
> > Or are we still talking past each another a bit here?
> > -Daniel
> >
> >
> >> Regards,
> >> Christian.
> >>
> >>> -Daniel
> >>>
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 14:23                   ` Daniel Vetter
@ 2021-01-14 15:08                     ` Christian König
  2021-01-14 15:40                       ` Daniel Vetter
  0 siblings, 1 reply; 84+ messages in thread
From: Christian König @ 2021-01-14 15:08 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, amd-gfx list,
	Jerome Glisse, dri-devel

Am 14.01.21 um 15:23 schrieb Daniel Vetter:
> On Thu, Jan 14, 2021 at 3:13 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 14.01.21 um 14:57 schrieb Daniel Vetter:
>>> On Thu, Jan 14, 2021 at 2:37 PM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 14.01.21 um 12:52 schrieb Daniel Vetter:
>>>>> [SNIP]
>>>>>>> I had a new idea, i wanted to think more about it but have not yet,
>>>>>>> anyway here it is. Adding a new callback to dma fence which ask the
>>>>>>> question can it dead lock ? Any time a GPU driver has pending page
>>>>>>> fault (ie something calling into the mm) it answer yes, otherwise
>>>>>>> no. The GPU shrinker would ask the question before waiting on any
>>>>>>> dma-fence and back of if it gets yes. Shrinker can still try many
>>>>>>> dma buf object for which it does not get a yes on associated fence.
>>>>>>>
>>>>>>> This does not solve the mmu notifier case, for this you would just
>>>>>>> invalidate the gem userptr object (with a flag but not releasing the
>>>>>>> page refcount) but you would not wait for the GPU (ie no dma fence
>>>>>>> wait in that code path anymore). The userptr API never really made
>>>>>>> the contract that it will always be in sync with the mm view of the
>>>>>>> world so if different page get remapped to same virtual address
>>>>>>> while GPU is still working with the old pages it should not be an
>>>>>>> issue (it would not be in our usage of userptr for compositor and
>>>>>>> what not).
>>>>>> The current working idea in my mind goes into a similar direction.
>>>>>>
>>>>>> But instead of a callback I'm adding a complete new class of HMM fences.
>>>>>>
>>>>>> Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
>>>>>> the dma_fences and HMM fences are ignored in container objects.
>>>>>>
>>>>>> When you handle an implicit or explicit synchronization request from
>>>>>> userspace you need to block for HMM fences to complete before taking any
>>>>>> resource locks.
>>>>> Isnt' that what I call gang scheduling? I.e. you either run in HMM
>>>>> mode, or in legacy fencing mode (whether implicit or explicit doesn't
>>>>> really matter I think). By forcing that split we avoid the problem,
>>>>> but it means occasionally full stalls on mixed workloads.
>>>>>
>>>>> But that's not what Jerome wants (afaiui at least), I think his idea
>>>>> is to track the reverse dependencies of all the fences floating
>>>>> around, and then skip evicting an object if you have to wait for any
>>>>> fence that is problematic for the current calling context. And I don't
>>>>> think that's very feasible in practice.
>>>>>
>>>>> So what kind of hmm fences do you have in mind here?
>>>> It's a bit more relaxed than your gang schedule.
>>>>
>>>> See the requirements are as follow:
>>>>
>>>> 1. dma_fences never depend on hmm_fences.
>>>> 2. hmm_fences can never preempt dma_fences.
>>>> 3. dma_fences must be able to preempt hmm_fences or we always reserve
>>>> enough hardware resources (CUs) to guarantee forward progress of dma_fences.
>>>>
>>>> Critical sections are MMU notifiers, page faults, GPU schedulers and
>>>> dma_reservation object locks.
>>>>
>>>> 4. It is valid to wait for a dma_fences in critical sections.
>>>> 5. It is not valid to wait for hmm_fences in critical sections.
>>>>
>>>> Fence creation either happens during command submission or by adding
>>>> something like a barrier or signal command to your userspace queue.
>>>>
>>>> 6. If we have an hmm_fence as implicit or explicit dependency for
>>>> creating a dma_fence we must wait for that before taking any locks or
>>>> reserving resources.
>>>> 7. If we have a dma_fence as implicit or explicit dependency for
>>>> creating an hmm_fence we can wait later on. So busy waiting or special
>>>> WAIT hardware commands are valid.
>>>>
>>>> This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the
>>>> same time on the hardware.
>>>>
>>>> In other words we can have a high priority gfx queue running jobs based
>>>> on dma_fences and a low priority compute queue running jobs based on
>>>> hmm_fences.
>>>>
>>>> Only when we switch from hmm_fence to dma_fence we need to block the
>>>> submission until all the necessary resources (both memory as well as
>>>> CUs) are available.
>>>>
>>>> This is somewhat an extension to your gang submit idea.
>>> Either I'm missing something, or this is just exactly what we
>>> documented already with userspace fences in general, and how you can't
>>> have a dma_fence depend upon a userspace (or hmm_fence).
>>>
>>> My gang scheduling idea is really just an alternative for what you
>>> have listed as item 3 above. Instead of requiring preempt or requiring
>>> guaranteed forward progress of some other sorts we flush out any
>>> pending dma_fence request. But _only_ those which would get stalled by
>>> the job we're running, so high-priority sdma requests we need in the
>>> kernel to shuffle buffers around are still all ok. This would be
>>> needed if you're hw can't preempt, and you also have shared engines
>>> between compute and gfx, so reserving CUs won't solve the problem
>>> either.
>>>
>>> What I don't mean with my gang scheduling is a completely exclusive
>>> mode between hmm_fence and dma_fence, since that would prevent us from
>>> using copy engines and dma_fence in the kernel to shuffle memory
>>> around for hmm jobs. And that would suck, even on compute-only
>>> workloads. Maybe I should rename "gang scheduling" to "engine flush"
>>> or something like that.
>> Yeah, "engine flush" makes it much more clearer.
>>
>> What I wanted to emphasis is that we have to mix dma_fences and
>> hmm_fences running at the same time on the same hardware fighting over
>> the same resources.
>>
>> E.g. even on the newest hardware multimedia engines can't handle page
>> faults, so video decoding/encoding will still produce dma_fences.
> Well we also have to mix them so the kernel can shovel data around
> using copy engines. Plus we have to mix it at the overall subsystem
> level because I'm not sure SoC-class gpus will ever get here,
> definitely aren't yet there for sure.
>
>>> I think the basics of userspace or hmm_fence or whatever we'll call it
>>> we've documented already here:
>>>
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fdriver-api%2Fdma-buf.html%3Fhighlight%3Ddma_fence%23indefinite-dma-fences&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Cc35b65cf4ad5430475de08d8b897f5dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637462310094850656%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=GHBbLzmHPaW4sSZUrfKi6aNMAmYDbzgUMhZOOd1Im8E%3D&amp;reserved=0
>> This talks about the restrictions we have for dma_fences and why
>> infinite fences (even as hmm_fence) will never work.
>>
>> But it doesn't talk about how to handle implicit or explicit
>> dependencies with something like hmm_fences.
>>
>> In other words my proposal above allows for hmm_fences to show up in
>> dma_reservation objects and are used together with all this explicit
>> synchronization we still have with only a medium amount of work :)
> Oh. I don't think we should put any hmm_fence or other infinite fence
> into a dma_resv object. At least not into the current dma_resv object,
> because then we have that infinite fences problem everywhere, and very
> hard to audit.

Yes, exactly. That's why this rules how to mix them or rather not mix them.

> What we could do is add new hmm_fence only slots for implicit sync,

Yeah, we would have them separated to the dma_fence objects.

> but I think consensus is that implicit sync is bad, never do it again.
> Last time around (for timeline syncobj) we've also pushed the waiting
> on cross-over to userspace, and I think that's the right option, so we
> need userspace to understand the hmm fence anyway. At that point we
> might as well bite the bullet and do another round of wayland/dri
> protocols.

As you said I don't see this happening in the next 5 years either.

So I think we have to somehow solve this in the kernel or we will go in 
circles all the time.

> So from that pov I think the kernel should at most deal with an
> hmm_fence for cross-process communication and maybe some standard wait
> primitives (for userspace to use, not for the kernel).
>
> The only use case this would forbid is using page faults for legacy
> implicit/explicit dma_fence synced workloads, and I think that's
> perfectly ok to not allow. Especially since the motivation here for
> all this is compute, and compute doesn't pass around dma_fences
> anyway.

As Alex said we will rather soon see this for gfx as well and we most 
likely will see combinations of old dma_fence based integrated graphics 
with new dedicated GPUs.

So I don't think we can say we reduce the problem to compute and don't 
support anything else.

Regards,
Christian.

>
>>> I think the only thing missing is clarifying a bit what you have under
>>> item 3, i.e. how do we make sure there's no accidental hidden
>>> dependency between hmm_fence and dma_fence. Maybe a subsection about
>>> gpu page fault handling?
>> The real improvement is item 6. The problem with it is that it requires
>> auditing all occasions when we create dma_fences so that we don't
>> accidentally depend on an HMM fence.
> We have that rule already, it's the "dma_fence must not depend upon an
> infinite fence anywhere" rule we documented last summer. So that
> doesn't feel new.
> -Daniel
>
>> Regards,
>> Christian.
>>
>>> Or are we still talking past each another a bit here?
>>> -Daniel
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> -Daniel
>>>>>
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 15:08                     ` Christian König
@ 2021-01-14 15:40                       ` Daniel Vetter
  2021-01-14 16:01                         ` Christian König
  0 siblings, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-01-14 15:40 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, amd-gfx list,
	Jerome Glisse, dri-devel

On Thu, Jan 14, 2021 at 4:08 PM Christian König
<christian.koenig@amd.com> wrote:
> Am 14.01.21 um 15:23 schrieb Daniel Vetter:
> > On Thu, Jan 14, 2021 at 3:13 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 14.01.21 um 14:57 schrieb Daniel Vetter:
> >>> On Thu, Jan 14, 2021 at 2:37 PM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Am 14.01.21 um 12:52 schrieb Daniel Vetter:
> >>>>> [SNIP]
> >>>>>>> I had a new idea, i wanted to think more about it but have not yet,
> >>>>>>> anyway here it is. Adding a new callback to dma fence which ask the
> >>>>>>> question can it dead lock ? Any time a GPU driver has pending page
> >>>>>>> fault (ie something calling into the mm) it answer yes, otherwise
> >>>>>>> no. The GPU shrinker would ask the question before waiting on any
> >>>>>>> dma-fence and back of if it gets yes. Shrinker can still try many
> >>>>>>> dma buf object for which it does not get a yes on associated fence.
> >>>>>>>
> >>>>>>> This does not solve the mmu notifier case, for this you would just
> >>>>>>> invalidate the gem userptr object (with a flag but not releasing the
> >>>>>>> page refcount) but you would not wait for the GPU (ie no dma fence
> >>>>>>> wait in that code path anymore). The userptr API never really made
> >>>>>>> the contract that it will always be in sync with the mm view of the
> >>>>>>> world so if different page get remapped to same virtual address
> >>>>>>> while GPU is still working with the old pages it should not be an
> >>>>>>> issue (it would not be in our usage of userptr for compositor and
> >>>>>>> what not).
> >>>>>> The current working idea in my mind goes into a similar direction.
> >>>>>>
> >>>>>> But instead of a callback I'm adding a complete new class of HMM fences.
> >>>>>>
> >>>>>> Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
> >>>>>> the dma_fences and HMM fences are ignored in container objects.
> >>>>>>
> >>>>>> When you handle an implicit or explicit synchronization request from
> >>>>>> userspace you need to block for HMM fences to complete before taking any
> >>>>>> resource locks.
> >>>>> Isnt' that what I call gang scheduling? I.e. you either run in HMM
> >>>>> mode, or in legacy fencing mode (whether implicit or explicit doesn't
> >>>>> really matter I think). By forcing that split we avoid the problem,
> >>>>> but it means occasionally full stalls on mixed workloads.
> >>>>>
> >>>>> But that's not what Jerome wants (afaiui at least), I think his idea
> >>>>> is to track the reverse dependencies of all the fences floating
> >>>>> around, and then skip evicting an object if you have to wait for any
> >>>>> fence that is problematic for the current calling context. And I don't
> >>>>> think that's very feasible in practice.
> >>>>>
> >>>>> So what kind of hmm fences do you have in mind here?
> >>>> It's a bit more relaxed than your gang schedule.
> >>>>
> >>>> See the requirements are as follow:
> >>>>
> >>>> 1. dma_fences never depend on hmm_fences.
> >>>> 2. hmm_fences can never preempt dma_fences.
> >>>> 3. dma_fences must be able to preempt hmm_fences or we always reserve
> >>>> enough hardware resources (CUs) to guarantee forward progress of dma_fences.
> >>>>
> >>>> Critical sections are MMU notifiers, page faults, GPU schedulers and
> >>>> dma_reservation object locks.
> >>>>
> >>>> 4. It is valid to wait for a dma_fences in critical sections.
> >>>> 5. It is not valid to wait for hmm_fences in critical sections.
> >>>>
> >>>> Fence creation either happens during command submission or by adding
> >>>> something like a barrier or signal command to your userspace queue.
> >>>>
> >>>> 6. If we have an hmm_fence as implicit or explicit dependency for
> >>>> creating a dma_fence we must wait for that before taking any locks or
> >>>> reserving resources.
> >>>> 7. If we have a dma_fence as implicit or explicit dependency for
> >>>> creating an hmm_fence we can wait later on. So busy waiting or special
> >>>> WAIT hardware commands are valid.
> >>>>
> >>>> This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the
> >>>> same time on the hardware.
> >>>>
> >>>> In other words we can have a high priority gfx queue running jobs based
> >>>> on dma_fences and a low priority compute queue running jobs based on
> >>>> hmm_fences.
> >>>>
> >>>> Only when we switch from hmm_fence to dma_fence we need to block the
> >>>> submission until all the necessary resources (both memory as well as
> >>>> CUs) are available.
> >>>>
> >>>> This is somewhat an extension to your gang submit idea.
> >>> Either I'm missing something, or this is just exactly what we
> >>> documented already with userspace fences in general, and how you can't
> >>> have a dma_fence depend upon a userspace (or hmm_fence).
> >>>
> >>> My gang scheduling idea is really just an alternative for what you
> >>> have listed as item 3 above. Instead of requiring preempt or requiring
> >>> guaranteed forward progress of some other sorts we flush out any
> >>> pending dma_fence request. But _only_ those which would get stalled by
> >>> the job we're running, so high-priority sdma requests we need in the
> >>> kernel to shuffle buffers around are still all ok. This would be
> >>> needed if you're hw can't preempt, and you also have shared engines
> >>> between compute and gfx, so reserving CUs won't solve the problem
> >>> either.
> >>>
> >>> What I don't mean with my gang scheduling is a completely exclusive
> >>> mode between hmm_fence and dma_fence, since that would prevent us from
> >>> using copy engines and dma_fence in the kernel to shuffle memory
> >>> around for hmm jobs. And that would suck, even on compute-only
> >>> workloads. Maybe I should rename "gang scheduling" to "engine flush"
> >>> or something like that.
> >> Yeah, "engine flush" makes it much more clearer.
> >>
> >> What I wanted to emphasis is that we have to mix dma_fences and
> >> hmm_fences running at the same time on the same hardware fighting over
> >> the same resources.
> >>
> >> E.g. even on the newest hardware multimedia engines can't handle page
> >> faults, so video decoding/encoding will still produce dma_fences.
> > Well we also have to mix them so the kernel can shovel data around
> > using copy engines. Plus we have to mix it at the overall subsystem
> > level because I'm not sure SoC-class gpus will ever get here,
> > definitely aren't yet there for sure.
> >
> >>> I think the basics of userspace or hmm_fence or whatever we'll call it
> >>> we've documented already here:
> >>>
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fdriver-api%2Fdma-buf.html%3Fhighlight%3Ddma_fence%23indefinite-dma-fences&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Cc35b65cf4ad5430475de08d8b897f5dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637462310094850656%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=GHBbLzmHPaW4sSZUrfKi6aNMAmYDbzgUMhZOOd1Im8E%3D&amp;reserved=0
> >> This talks about the restrictions we have for dma_fences and why
> >> infinite fences (even as hmm_fence) will never work.
> >>
> >> But it doesn't talk about how to handle implicit or explicit
> >> dependencies with something like hmm_fences.
> >>
> >> In other words my proposal above allows for hmm_fences to show up in
> >> dma_reservation objects and are used together with all this explicit
> >> synchronization we still have with only a medium amount of work :)
> > Oh. I don't think we should put any hmm_fence or other infinite fence
> > into a dma_resv object. At least not into the current dma_resv object,
> > because then we have that infinite fences problem everywhere, and very
> > hard to audit.
>
> Yes, exactly. That's why this rules how to mix them or rather not mix them.
>
> > What we could do is add new hmm_fence only slots for implicit sync,
>
> Yeah, we would have them separated to the dma_fence objects.
>
> > but I think consensus is that implicit sync is bad, never do it again.
> > Last time around (for timeline syncobj) we've also pushed the waiting
> > on cross-over to userspace, and I think that's the right option, so we
> > need userspace to understand the hmm fence anyway. At that point we
> > might as well bite the bullet and do another round of wayland/dri
> > protocols.
>
> As you said I don't see this happening in the next 5 years either.

Well I guess we'll need to get started with that then, when you guys need it.

> So I think we have to somehow solve this in the kernel or we will go in
> circles all the time.
>
> > So from that pov I think the kernel should at most deal with an
> > hmm_fence for cross-process communication and maybe some standard wait
> > primitives (for userspace to use, not for the kernel).
> >
> > The only use case this would forbid is using page faults for legacy
> > implicit/explicit dma_fence synced workloads, and I think that's
> > perfectly ok to not allow. Especially since the motivation here for
> > all this is compute, and compute doesn't pass around dma_fences
> > anyway.
>
> As Alex said we will rather soon see this for gfx as well and we most
> likely will see combinations of old dma_fence based integrated graphics
> with new dedicated GPUs.
>
> So I don't think we can say we reduce the problem to compute and don't
> support anything else.

I'm not against pagefaults for gfx, just in pushing the magic into the
kernel. I don't think that works, because it means we add stall points
where usespace, especially vk userspace, really doesn't want it. So
same way like timeline syncobj, we need to push the compat work into
userspace.

There's going to be a few stall points:
- fully new stack, we wait for the userspace fence in the atomic
commit path (which we can, if we're really careful, since we pin all
buffers upfront and so there's no risk)
- userspace fencing gpu in the client, compositor protocol can pass
around userspace fences, but the compositor still uses dma_fence for
itself. There's some stalling in the compositor, which it does already
anyway when it's collecting new frames from clients
- userspace fencing gpu in the client, but no compositor protocol: We
wait in the swapchain, but in a separate thread so that nothing blocks
that shouldn't block

If we instead go with "magic waits in the kernel behind userspace's
back", like what your item 6 would imply, then we're not really
solving anything.

For actual implementation I think the best would be an extension of
drm_syncobj. Those already have at least conceptually future/infinite
fences, and we already have fd passing, so "just" need some protocol
to pass them around. Plus we could use the same uapi for timeline
syncobj using dma_fence as for hmm_fence, so also easier to transition
for userspace to the new world since don't need the new hw capability
to roll out the new uapi and protocols.

That's not that hard to roll out, and technically a lot better than
hacking up dma_resv and hoping we don't end up stalling in wrong
places, which sounds very "eeeek" to me :-)

Cheers, Daniel

> Regards,
> Christian.
>
> >
> >>> I think the only thing missing is clarifying a bit what you have under
> >>> item 3, i.e. how do we make sure there's no accidental hidden
> >>> dependency between hmm_fence and dma_fence. Maybe a subsection about
> >>> gpu page fault handling?
> >> The real improvement is item 6. The problem with it is that it requires
> >> auditing all occasions when we create dma_fences so that we don't
> >> accidentally depend on an HMM fence.
> > We have that rule already, it's the "dma_fence must not depend upon an
> > infinite fence anywhere" rule we documented last summer. So that
> > doesn't feel new.
> > -Daniel
> >
> >> Regards,
> >> Christian.
> >>
> >>> Or are we still talking past each another a bit here?
> >>> -Daniel
> >>>
> >>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> -Daniel
> >>>>>
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 15:40                       ` Daniel Vetter
@ 2021-01-14 16:01                         ` Christian König
  2021-01-14 16:36                           ` Daniel Vetter
  0 siblings, 1 reply; 84+ messages in thread
From: Christian König @ 2021-01-14 16:01 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, amd-gfx list,
	Jerome Glisse, dri-devel

Am 14.01.21 um 16:40 schrieb Daniel Vetter:
> [SNIP]
>> So I think we have to somehow solve this in the kernel or we will go in
>> circles all the time.
>>
>>> So from that pov I think the kernel should at most deal with an
>>> hmm_fence for cross-process communication and maybe some standard wait
>>> primitives (for userspace to use, not for the kernel).
>>>
>>> The only use case this would forbid is using page faults for legacy
>>> implicit/explicit dma_fence synced workloads, and I think that's
>>> perfectly ok to not allow. Especially since the motivation here for
>>> all this is compute, and compute doesn't pass around dma_fences
>>> anyway.
>> As Alex said we will rather soon see this for gfx as well and we most
>> likely will see combinations of old dma_fence based integrated graphics
>> with new dedicated GPUs.
>>
>> So I don't think we can say we reduce the problem to compute and don't
>> support anything else.
> I'm not against pagefaults for gfx, just in pushing the magic into the
> kernel. I don't think that works, because it means we add stall points
> where usespace, especially vk userspace, really doesn't want it. So
> same way like timeline syncobj, we need to push the compat work into
> userspace.
>
> There's going to be a few stall points:
> - fully new stack, we wait for the userspace fence in the atomic
> commit path (which we can, if we're really careful, since we pin all
> buffers upfront and so there's no risk)
> - userspace fencing gpu in the client, compositor protocol can pass
> around userspace fences, but the compositor still uses dma_fence for
> itself. There's some stalling in the compositor, which it does already
> anyway when it's collecting new frames from clients
> - userspace fencing gpu in the client, but no compositor protocol: We
> wait in the swapchain, but in a separate thread so that nothing blocks
> that shouldn't block
>
> If we instead go with "magic waits in the kernel behind userspace's
> back", like what your item 6 would imply, then we're not really
> solving anything.
>
> For actual implementation I think the best would be an extension of
> drm_syncobj. Those already have at least conceptually future/infinite
> fences, and we already have fd passing, so "just" need some protocol
> to pass them around. Plus we could use the same uapi for timeline
> syncobj using dma_fence as for hmm_fence, so also easier to transition
> for userspace to the new world since don't need the new hw capability
> to roll out the new uapi and protocols.
>
> That's not that hard to roll out, and technically a lot better than
> hacking up dma_resv and hoping we don't end up stalling in wrong
> places, which sounds very "eeeek" to me :-)

Yeah, that's what I totally agree upon :)

My idea was just the last resort since we are mixing userspace sync and 
memory management so creative here.

Stalling in userspace will probably get some push back as well, but 
maybe not as much as stalling in the kernel.

Ok if we can at least remove implicit sync from the picture then the 
question remains how do we integrate HMM into drm_syncobj then?

Regards,
Christian.

>
> Cheers, Daniel
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 16:01                         ` Christian König
@ 2021-01-14 16:36                           ` Daniel Vetter
  2021-01-14 19:08                             ` Christian König
  0 siblings, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-01-14 16:36 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, amd-gfx list,
	Jerome Glisse, dri-devel

On Thu, Jan 14, 2021 at 5:01 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 14.01.21 um 16:40 schrieb Daniel Vetter:
> > [SNIP]
> >> So I think we have to somehow solve this in the kernel or we will go in
> >> circles all the time.
> >>
> >>> So from that pov I think the kernel should at most deal with an
> >>> hmm_fence for cross-process communication and maybe some standard wait
> >>> primitives (for userspace to use, not for the kernel).
> >>>
> >>> The only use case this would forbid is using page faults for legacy
> >>> implicit/explicit dma_fence synced workloads, and I think that's
> >>> perfectly ok to not allow. Especially since the motivation here for
> >>> all this is compute, and compute doesn't pass around dma_fences
> >>> anyway.
> >> As Alex said we will rather soon see this for gfx as well and we most
> >> likely will see combinations of old dma_fence based integrated graphics
> >> with new dedicated GPUs.
> >>
> >> So I don't think we can say we reduce the problem to compute and don't
> >> support anything else.
> > I'm not against pagefaults for gfx, just in pushing the magic into the
> > kernel. I don't think that works, because it means we add stall points
> > where usespace, especially vk userspace, really doesn't want it. So
> > same way like timeline syncobj, we need to push the compat work into
> > userspace.
> >
> > There's going to be a few stall points:
> > - fully new stack, we wait for the userspace fence in the atomic
> > commit path (which we can, if we're really careful, since we pin all
> > buffers upfront and so there's no risk)
> > - userspace fencing gpu in the client, compositor protocol can pass
> > around userspace fences, but the compositor still uses dma_fence for
> > itself. There's some stalling in the compositor, which it does already
> > anyway when it's collecting new frames from clients
> > - userspace fencing gpu in the client, but no compositor protocol: We
> > wait in the swapchain, but in a separate thread so that nothing blocks
> > that shouldn't block
> >
> > If we instead go with "magic waits in the kernel behind userspace's
> > back", like what your item 6 would imply, then we're not really
> > solving anything.
> >
> > For actual implementation I think the best would be an extension of
> > drm_syncobj. Those already have at least conceptually future/infinite
> > fences, and we already have fd passing, so "just" need some protocol
> > to pass them around. Plus we could use the same uapi for timeline
> > syncobj using dma_fence as for hmm_fence, so also easier to transition
> > for userspace to the new world since don't need the new hw capability
> > to roll out the new uapi and protocols.
> >
> > That's not that hard to roll out, and technically a lot better than
> > hacking up dma_resv and hoping we don't end up stalling in wrong
> > places, which sounds very "eeeek" to me :-)
>
> Yeah, that's what I totally agree upon :)
>
> My idea was just the last resort since we are mixing userspace sync and
> memory management so creative here.
>
> Stalling in userspace will probably get some push back as well, but
> maybe not as much as stalling in the kernel.

I guess we need to have last-resort stalling in the kernel, but no
more than what we do with drm_syncobj future fences right now. Like
when anything asks for a dma_fence out of an hmm_fence drm_syncob, we
just stall until the hmm_fence is signalled, and then create a
dma_fence that's already signalled and return that to the caller.
Obviously this shouldn't happen, since anyone who's timeline aware
will check whether the fence has at least materialized first and stall
somewhere more useful for that first.

> Ok if we can at least remove implicit sync from the picture then the
> question remains how do we integrate HMM into drm_syncobj then?

From an uapi pov probably just an ioctl to create an hmm drm_syncobj,
and a syncobj ioctl to query whether it's a hmm_fence or dma_fence
syncobj, so that userspace can be a bit more clever with where it
should stall - for an hmm_fence the stall will most likely be directly
on the gpu in many cases (so the ioctl should also give us all the
details about that if it's an hmm fence).

I think the real work is going through all the hardware and trying to
figure out what the common ground for userspace fences are. Stuff like
can they be in system memory, or need something special (wc maybe, but
I hope system memory should be fine for everyone), and how you count,
wrap and compare. I also have no idea how/if we can optimized cpu
waits across different drivers.

Plus ideally we get some actual wayland protocol going for passing
drm_syncobj around, so we can test it.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 13:37             ` HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD) Christian König
  2021-01-14 13:57               ` Daniel Vetter
@ 2021-01-14 16:51               ` Jerome Glisse
  2021-01-14 21:13                 ` Felix Kuehling
  1 sibling, 1 reply; 84+ messages in thread
From: Jerome Glisse @ 2021-01-14 16:51 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, dri-devel, amd-gfx list

On Thu, Jan 14, 2021 at 02:37:36PM +0100, Christian König wrote:
> Am 14.01.21 um 12:52 schrieb Daniel Vetter:
> > [SNIP]
> > > > I had a new idea, i wanted to think more about it but have not yet,
> > > > anyway here it is. Adding a new callback to dma fence which ask the
> > > > question can it dead lock ? Any time a GPU driver has pending page
> > > > fault (ie something calling into the mm) it answer yes, otherwise
> > > > no. The GPU shrinker would ask the question before waiting on any
> > > > dma-fence and back of if it gets yes. Shrinker can still try many
> > > > dma buf object for which it does not get a yes on associated fence.
> > > > 
> > > > This does not solve the mmu notifier case, for this you would just
> > > > invalidate the gem userptr object (with a flag but not releasing the
> > > > page refcount) but you would not wait for the GPU (ie no dma fence
> > > > wait in that code path anymore). The userptr API never really made
> > > > the contract that it will always be in sync with the mm view of the
> > > > world so if different page get remapped to same virtual address
> > > > while GPU is still working with the old pages it should not be an
> > > > issue (it would not be in our usage of userptr for compositor and
> > > > what not).
> > > The current working idea in my mind goes into a similar direction.
> > > 
> > > But instead of a callback I'm adding a complete new class of HMM fences.
> > > 
> > > Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
> > > the dma_fences and HMM fences are ignored in container objects.
> > > 
> > > When you handle an implicit or explicit synchronization request from
> > > userspace you need to block for HMM fences to complete before taking any
> > > resource locks.
> > Isnt' that what I call gang scheduling? I.e. you either run in HMM
> > mode, or in legacy fencing mode (whether implicit or explicit doesn't
> > really matter I think). By forcing that split we avoid the problem,
> > but it means occasionally full stalls on mixed workloads.
> > 
> > But that's not what Jerome wants (afaiui at least), I think his idea
> > is to track the reverse dependencies of all the fences floating
> > around, and then skip evicting an object if you have to wait for any
> > fence that is problematic for the current calling context. And I don't
> > think that's very feasible in practice.
> > 
> > So what kind of hmm fences do you have in mind here?
> 
> It's a bit more relaxed than your gang schedule.
> 
> See the requirements are as follow:
> 
> 1. dma_fences never depend on hmm_fences.
> 2. hmm_fences can never preempt dma_fences.
> 3. dma_fences must be able to preempt hmm_fences or we always reserve enough
> hardware resources (CUs) to guarantee forward progress of dma_fences.
> 
> Critical sections are MMU notifiers, page faults, GPU schedulers and
> dma_reservation object locks.
> 
> 4. It is valid to wait for a dma_fences in critical sections.
> 5. It is not valid to wait for hmm_fences in critical sections.
> 
> Fence creation either happens during command submission or by adding
> something like a barrier or signal command to your userspace queue.
> 
> 6. If we have an hmm_fence as implicit or explicit dependency for creating a
> dma_fence we must wait for that before taking any locks or reserving
> resources.
> 7. If we have a dma_fence as implicit or explicit dependency for creating an
> hmm_fence we can wait later on. So busy waiting or special WAIT hardware
> commands are valid.
> 
> This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same
> time on the hardware.
> 
> In other words we can have a high priority gfx queue running jobs based on
> dma_fences and a low priority compute queue running jobs based on
> hmm_fences.
> 
> Only when we switch from hmm_fence to dma_fence we need to block the
> submission until all the necessary resources (both memory as well as CUs)
> are available.
> 
> This is somewhat an extension to your gang submit idea.

What is hmm_fence ? You should not have fence with hmm at all.
So i am kind of scare now.

Cheers,
Jérôme

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 16:36                           ` Daniel Vetter
@ 2021-01-14 19:08                             ` Christian König
  2021-01-14 20:09                               ` Daniel Vetter
  0 siblings, 1 reply; 84+ messages in thread
From: Christian König @ 2021-01-14 19:08 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, amd-gfx list,
	Jerome Glisse, dri-devel

Am 14.01.21 um 17:36 schrieb Daniel Vetter:
> On Thu, Jan 14, 2021 at 5:01 PM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 14.01.21 um 16:40 schrieb Daniel Vetter:
>>> [SNIP]
>>>> So I think we have to somehow solve this in the kernel or we will go in
>>>> circles all the time.
>>>>
>>>>> So from that pov I think the kernel should at most deal with an
>>>>> hmm_fence for cross-process communication and maybe some standard wait
>>>>> primitives (for userspace to use, not for the kernel).
>>>>>
>>>>> The only use case this would forbid is using page faults for legacy
>>>>> implicit/explicit dma_fence synced workloads, and I think that's
>>>>> perfectly ok to not allow. Especially since the motivation here for
>>>>> all this is compute, and compute doesn't pass around dma_fences
>>>>> anyway.
>>>> As Alex said we will rather soon see this for gfx as well and we most
>>>> likely will see combinations of old dma_fence based integrated graphics
>>>> with new dedicated GPUs.
>>>>
>>>> So I don't think we can say we reduce the problem to compute and don't
>>>> support anything else.
>>> I'm not against pagefaults for gfx, just in pushing the magic into the
>>> kernel. I don't think that works, because it means we add stall points
>>> where usespace, especially vk userspace, really doesn't want it. So
>>> same way like timeline syncobj, we need to push the compat work into
>>> userspace.
>>>
>>> There's going to be a few stall points:
>>> - fully new stack, we wait for the userspace fence in the atomic
>>> commit path (which we can, if we're really careful, since we pin all
>>> buffers upfront and so there's no risk)
>>> - userspace fencing gpu in the client, compositor protocol can pass
>>> around userspace fences, but the compositor still uses dma_fence for
>>> itself. There's some stalling in the compositor, which it does already
>>> anyway when it's collecting new frames from clients
>>> - userspace fencing gpu in the client, but no compositor protocol: We
>>> wait in the swapchain, but in a separate thread so that nothing blocks
>>> that shouldn't block
>>>
>>> If we instead go with "magic waits in the kernel behind userspace's
>>> back", like what your item 6 would imply, then we're not really
>>> solving anything.
>>>
>>> For actual implementation I think the best would be an extension of
>>> drm_syncobj. Those already have at least conceptually future/infinite
>>> fences, and we already have fd passing, so "just" need some protocol
>>> to pass them around. Plus we could use the same uapi for timeline
>>> syncobj using dma_fence as for hmm_fence, so also easier to transition
>>> for userspace to the new world since don't need the new hw capability
>>> to roll out the new uapi and protocols.
>>>
>>> That's not that hard to roll out, and technically a lot better than
>>> hacking up dma_resv and hoping we don't end up stalling in wrong
>>> places, which sounds very "eeeek" to me :-)
>> Yeah, that's what I totally agree upon :)
>>
>> My idea was just the last resort since we are mixing userspace sync and
>> memory management so creative here.
>>
>> Stalling in userspace will probably get some push back as well, but
>> maybe not as much as stalling in the kernel.
> I guess we need to have last-resort stalling in the kernel, but no
> more than what we do with drm_syncobj future fences right now. Like
> when anything asks for a dma_fence out of an hmm_fence drm_syncob, we
> just stall until the hmm_fence is signalled, and then create a
> dma_fence that's already signalled and return that to the caller.

Good idea. BTW: We should somehow teach lockdep that this 
materialization of any future fence should not happen while holding a 
reservation lock?

> Obviously this shouldn't happen, since anyone who's timeline aware
> will check whether the fence has at least materialized first and stall
> somewhere more useful for that first.

Well if I'm not completely mistaken it should help with existing stuff 
like an implicit fence for atomic modeset etc...

>> Ok if we can at least remove implicit sync from the picture then the
>> question remains how do we integrate HMM into drm_syncobj then?
>  From an uapi pov probably just an ioctl to create an hmm drm_syncobj,
> and a syncobj ioctl to query whether it's a hmm_fence or dma_fence
> syncobj, so that userspace can be a bit more clever with where it
> should stall - for an hmm_fence the stall will most likely be directly
> on the gpu in many cases (so the ioctl should also give us all the
> details about that if it's an hmm fence).
>
> I think the real work is going through all the hardware and trying to
> figure out what the common ground for userspace fences are. Stuff like
> can they be in system memory, or need something special (wc maybe, but
> I hope system memory should be fine for everyone), and how you count,
> wrap and compare. I also have no idea how/if we can optimized cpu
> waits across different drivers.

I think that this is absolutely hardware dependent. E.g. for example AMD 
will probably have handles, so that the hardware scheduler can counter 
problems like priority inversion.

What we should probably do is to handle this similar to how DMA-buf is 
handled - if it's the same driver and device the drm_syncobj we can use 
the same handle for both sides.

If it's different driver or device we go through some CPU round trip for 
the signaling.

> Plus ideally we get some actual wayland protocol going for passing
> drm_syncobj around, so we can test it.

And DRI3 :)

Christian.

> -Daniel

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 19:08                             ` Christian König
@ 2021-01-14 20:09                               ` Daniel Vetter
  0 siblings, 0 replies; 84+ messages in thread
From: Daniel Vetter @ 2021-01-14 20:09 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Sierra, Yang, Philip, Felix Kuehling, dri-devel,
	Jerome Glisse, amd-gfx list

On Thu, Jan 14, 2021 at 08:08:06PM +0100, Christian König wrote:
> Am 14.01.21 um 17:36 schrieb Daniel Vetter:
> > On Thu, Jan 14, 2021 at 5:01 PM Christian König
> > <christian.koenig@amd.com> wrote:
> > > Am 14.01.21 um 16:40 schrieb Daniel Vetter:
> > > > [SNIP]
> > > > > So I think we have to somehow solve this in the kernel or we will go in
> > > > > circles all the time.
> > > > > 
> > > > > > So from that pov I think the kernel should at most deal with an
> > > > > > hmm_fence for cross-process communication and maybe some standard wait
> > > > > > primitives (for userspace to use, not for the kernel).
> > > > > > 
> > > > > > The only use case this would forbid is using page faults for legacy
> > > > > > implicit/explicit dma_fence synced workloads, and I think that's
> > > > > > perfectly ok to not allow. Especially since the motivation here for
> > > > > > all this is compute, and compute doesn't pass around dma_fences
> > > > > > anyway.
> > > > > As Alex said we will rather soon see this for gfx as well and we most
> > > > > likely will see combinations of old dma_fence based integrated graphics
> > > > > with new dedicated GPUs.
> > > > > 
> > > > > So I don't think we can say we reduce the problem to compute and don't
> > > > > support anything else.
> > > > I'm not against pagefaults for gfx, just in pushing the magic into the
> > > > kernel. I don't think that works, because it means we add stall points
> > > > where usespace, especially vk userspace, really doesn't want it. So
> > > > same way like timeline syncobj, we need to push the compat work into
> > > > userspace.
> > > > 
> > > > There's going to be a few stall points:
> > > > - fully new stack, we wait for the userspace fence in the atomic
> > > > commit path (which we can, if we're really careful, since we pin all
> > > > buffers upfront and so there's no risk)
> > > > - userspace fencing gpu in the client, compositor protocol can pass
> > > > around userspace fences, but the compositor still uses dma_fence for
> > > > itself. There's some stalling in the compositor, which it does already
> > > > anyway when it's collecting new frames from clients
> > > > - userspace fencing gpu in the client, but no compositor protocol: We
> > > > wait in the swapchain, but in a separate thread so that nothing blocks
> > > > that shouldn't block
> > > > 
> > > > If we instead go with "magic waits in the kernel behind userspace's
> > > > back", like what your item 6 would imply, then we're not really
> > > > solving anything.
> > > > 
> > > > For actual implementation I think the best would be an extension of
> > > > drm_syncobj. Those already have at least conceptually future/infinite
> > > > fences, and we already have fd passing, so "just" need some protocol
> > > > to pass them around. Plus we could use the same uapi for timeline
> > > > syncobj using dma_fence as for hmm_fence, so also easier to transition
> > > > for userspace to the new world since don't need the new hw capability
> > > > to roll out the new uapi and protocols.
> > > > 
> > > > That's not that hard to roll out, and technically a lot better than
> > > > hacking up dma_resv and hoping we don't end up stalling in wrong
> > > > places, which sounds very "eeeek" to me :-)
> > > Yeah, that's what I totally agree upon :)
> > > 
> > > My idea was just the last resort since we are mixing userspace sync and
> > > memory management so creative here.
> > > 
> > > Stalling in userspace will probably get some push back as well, but
> > > maybe not as much as stalling in the kernel.
> > I guess we need to have last-resort stalling in the kernel, but no
> > more than what we do with drm_syncobj future fences right now. Like
> > when anything asks for a dma_fence out of an hmm_fence drm_syncob, we
> > just stall until the hmm_fence is signalled, and then create a
> > dma_fence that's already signalled and return that to the caller.
> 
> Good idea. BTW: We should somehow teach lockdep that this materialization of
> any future fence should not happen while holding a reservation lock?

Good idea, should be easy to add (although the explanation why it works
needs a comment).

> > Obviously this shouldn't happen, since anyone who's timeline aware
> > will check whether the fence has at least materialized first and stall
> > somewhere more useful for that first.
> 
> Well if I'm not completely mistaken it should help with existing stuff like
> an implicit fence for atomic modeset etc...

Modeset is special:
- we fully pin buffers before we even start waiting. That means the loop
  can't close, since no one can try to evict our pinned buffer and would
  hence end up waiting on our hmm fence. We also only unpin the after
  everything is done.

- there's out-fences, but as long as we require that the in and out
  fences are of the same type that should be all fine. Also since the
  explicit in/out fence stuff is there already it shouldn't be too hard to
  add support for syncobj fences without touching a lot of drivers - all
  the ones that use the atomic commit helpers should Just Work.

> > > Ok if we can at least remove implicit sync from the picture then the
> > > question remains how do we integrate HMM into drm_syncobj then?
> >  From an uapi pov probably just an ioctl to create an hmm drm_syncobj,
> > and a syncobj ioctl to query whether it's a hmm_fence or dma_fence
> > syncobj, so that userspace can be a bit more clever with where it
> > should stall - for an hmm_fence the stall will most likely be directly
> > on the gpu in many cases (so the ioctl should also give us all the
> > details about that if it's an hmm fence).
> > 
> > I think the real work is going through all the hardware and trying to
> > figure out what the common ground for userspace fences are. Stuff like
> > can they be in system memory, or need something special (wc maybe, but
> > I hope system memory should be fine for everyone), and how you count,
> > wrap and compare. I also have no idea how/if we can optimized cpu
> > waits across different drivers.
> 
> I think that this is absolutely hardware dependent. E.g. for example AMD
> will probably have handles, so that the hardware scheduler can counter
> problems like priority inversion.
> 
> What we should probably do is to handle this similar to how DMA-buf is
> handled - if it's the same driver and device the drm_syncobj we can use the
> same handle for both sides.
> 
> If it's different driver or device we go through some CPU round trip for the
> signaling.

I think we should try to be slightly more standardized, dma-buf was a bit
much free-for all. But maybe that's not possible really, since we tried
this with dma-fence and ended up with exactly the situation you're
describing for hmm fences.

> > Plus ideally we get some actual wayland protocol going for passing
> > drm_syncobj around, so we can test it.
> 
> And DRI3 :)
 
Yeah. Well probably Present extension, since that's the thing that's doing
the flipping. At least we only have to really care about XWayland for
that, with this time horizon at least.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 16:51               ` Jerome Glisse
@ 2021-01-14 21:13                 ` Felix Kuehling
  2021-01-15  7:47                   ` Christian König
  0 siblings, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-01-14 21:13 UTC (permalink / raw)
  To: Jerome Glisse, Christian König
  Cc: Alex Sierra, Yang, Philip, dri-devel, amd-gfx list

Am 2021-01-14 um 11:51 a.m. schrieb Jerome Glisse:
> On Thu, Jan 14, 2021 at 02:37:36PM +0100, Christian König wrote:
>> Am 14.01.21 um 12:52 schrieb Daniel Vetter:
>>> [SNIP]
>>>>> I had a new idea, i wanted to think more about it but have not yet,
>>>>> anyway here it is. Adding a new callback to dma fence which ask the
>>>>> question can it dead lock ? Any time a GPU driver has pending page
>>>>> fault (ie something calling into the mm) it answer yes, otherwise
>>>>> no. The GPU shrinker would ask the question before waiting on any
>>>>> dma-fence and back of if it gets yes. Shrinker can still try many
>>>>> dma buf object for which it does not get a yes on associated fence.
>>>>>
>>>>> This does not solve the mmu notifier case, for this you would just
>>>>> invalidate the gem userptr object (with a flag but not releasing the
>>>>> page refcount) but you would not wait for the GPU (ie no dma fence
>>>>> wait in that code path anymore). The userptr API never really made
>>>>> the contract that it will always be in sync with the mm view of the
>>>>> world so if different page get remapped to same virtual address
>>>>> while GPU is still working with the old pages it should not be an
>>>>> issue (it would not be in our usage of userptr for compositor and
>>>>> what not).
>>>> The current working idea in my mind goes into a similar direction.
>>>>
>>>> But instead of a callback I'm adding a complete new class of HMM fences.
>>>>
>>>> Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
>>>> the dma_fences and HMM fences are ignored in container objects.
>>>>
>>>> When you handle an implicit or explicit synchronization request from
>>>> userspace you need to block for HMM fences to complete before taking any
>>>> resource locks.
>>> Isnt' that what I call gang scheduling? I.e. you either run in HMM
>>> mode, or in legacy fencing mode (whether implicit or explicit doesn't
>>> really matter I think). By forcing that split we avoid the problem,
>>> but it means occasionally full stalls on mixed workloads.
>>>
>>> But that's not what Jerome wants (afaiui at least), I think his idea
>>> is to track the reverse dependencies of all the fences floating
>>> around, and then skip evicting an object if you have to wait for any
>>> fence that is problematic for the current calling context. And I don't
>>> think that's very feasible in practice.
>>>
>>> So what kind of hmm fences do you have in mind here?
>> It's a bit more relaxed than your gang schedule.
>>
>> See the requirements are as follow:
>>
>> 1. dma_fences never depend on hmm_fences.
>> 2. hmm_fences can never preempt dma_fences.
>> 3. dma_fences must be able to preempt hmm_fences or we always reserve enough
>> hardware resources (CUs) to guarantee forward progress of dma_fences.
>>
>> Critical sections are MMU notifiers, page faults, GPU schedulers and
>> dma_reservation object locks.
>>
>> 4. It is valid to wait for a dma_fences in critical sections.
>> 5. It is not valid to wait for hmm_fences in critical sections.
>>
>> Fence creation either happens during command submission or by adding
>> something like a barrier or signal command to your userspace queue.
>>
>> 6. If we have an hmm_fence as implicit or explicit dependency for creating a
>> dma_fence we must wait for that before taking any locks or reserving
>> resources.
>> 7. If we have a dma_fence as implicit or explicit dependency for creating an
>> hmm_fence we can wait later on. So busy waiting or special WAIT hardware
>> commands are valid.
>>
>> This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same
>> time on the hardware.
>>
>> In other words we can have a high priority gfx queue running jobs based on
>> dma_fences and a low priority compute queue running jobs based on
>> hmm_fences.
>>
>> Only when we switch from hmm_fence to dma_fence we need to block the
>> submission until all the necessary resources (both memory as well as CUs)
>> are available.
>>
>> This is somewhat an extension to your gang submit idea.
> What is hmm_fence ? You should not have fence with hmm at all.
> So i am kind of scare now.

I kind of had the same question trying to follow Christian and Daniel's
discussion. I think an HMM fence would be any fence resulting from the
completion of a user mode operation in a context with HMM-based memory
management that may stall indefinitely due to page faults.

But on a hardware engine that cannot preempt page-faulting work and has
not reserved resources to guarantee forward progress for kernel jobs, I
think all fences will need to be HMM fences, because any work submitted
to such an engine can stall by getting stuck behind a stalled user mode
operation.

So for example, you have a DMA engine that can preempt during page
faults, but a graphics engine that cannot. Then work submitted to the
DMA engine can use dma_fence. But work submitted to the graphics engine
must use hmm_fence. To avoid deadlocks, dma_fences must never depend on
hmm_fences and resolution of page faults must never depend on hmm_fences.

Regards,
  Felix


>
> Cheers,
> Jérôme
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)
  2021-01-14 21:13                 ` Felix Kuehling
@ 2021-01-15  7:47                   ` Christian König
  0 siblings, 0 replies; 84+ messages in thread
From: Christian König @ 2021-01-15  7:47 UTC (permalink / raw)
  To: Felix Kuehling, Jerome Glisse
  Cc: Alex Sierra, Yang, Philip, dri-devel, amd-gfx list

Am 14.01.21 um 22:13 schrieb Felix Kuehling:
> Am 2021-01-14 um 11:51 a.m. schrieb Jerome Glisse:
>> On Thu, Jan 14, 2021 at 02:37:36PM +0100, Christian König wrote:
>>> Am 14.01.21 um 12:52 schrieb Daniel Vetter:
>>>> [SNIP]
>>>>>> I had a new idea, i wanted to think more about it but have not yet,
>>>>>> anyway here it is. Adding a new callback to dma fence which ask the
>>>>>> question can it dead lock ? Any time a GPU driver has pending page
>>>>>> fault (ie something calling into the mm) it answer yes, otherwise
>>>>>> no. The GPU shrinker would ask the question before waiting on any
>>>>>> dma-fence and back of if it gets yes. Shrinker can still try many
>>>>>> dma buf object for which it does not get a yes on associated fence.
>>>>>>
>>>>>> This does not solve the mmu notifier case, for this you would just
>>>>>> invalidate the gem userptr object (with a flag but not releasing the
>>>>>> page refcount) but you would not wait for the GPU (ie no dma fence
>>>>>> wait in that code path anymore). The userptr API never really made
>>>>>> the contract that it will always be in sync with the mm view of the
>>>>>> world so if different page get remapped to same virtual address
>>>>>> while GPU is still working with the old pages it should not be an
>>>>>> issue (it would not be in our usage of userptr for compositor and
>>>>>> what not).
>>>>> The current working idea in my mind goes into a similar direction.
>>>>>
>>>>> But instead of a callback I'm adding a complete new class of HMM fences.
>>>>>
>>>>> Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
>>>>> the dma_fences and HMM fences are ignored in container objects.
>>>>>
>>>>> When you handle an implicit or explicit synchronization request from
>>>>> userspace you need to block for HMM fences to complete before taking any
>>>>> resource locks.
>>>> Isnt' that what I call gang scheduling? I.e. you either run in HMM
>>>> mode, or in legacy fencing mode (whether implicit or explicit doesn't
>>>> really matter I think). By forcing that split we avoid the problem,
>>>> but it means occasionally full stalls on mixed workloads.
>>>>
>>>> But that's not what Jerome wants (afaiui at least), I think his idea
>>>> is to track the reverse dependencies of all the fences floating
>>>> around, and then skip evicting an object if you have to wait for any
>>>> fence that is problematic for the current calling context. And I don't
>>>> think that's very feasible in practice.
>>>>
>>>> So what kind of hmm fences do you have in mind here?
>>> It's a bit more relaxed than your gang schedule.
>>>
>>> See the requirements are as follow:
>>>
>>> 1. dma_fences never depend on hmm_fences.
>>> 2. hmm_fences can never preempt dma_fences.
>>> 3. dma_fences must be able to preempt hmm_fences or we always reserve enough
>>> hardware resources (CUs) to guarantee forward progress of dma_fences.
>>>
>>> Critical sections are MMU notifiers, page faults, GPU schedulers and
>>> dma_reservation object locks.
>>>
>>> 4. It is valid to wait for a dma_fences in critical sections.
>>> 5. It is not valid to wait for hmm_fences in critical sections.
>>>
>>> Fence creation either happens during command submission or by adding
>>> something like a barrier or signal command to your userspace queue.
>>>
>>> 6. If we have an hmm_fence as implicit or explicit dependency for creating a
>>> dma_fence we must wait for that before taking any locks or reserving
>>> resources.
>>> 7. If we have a dma_fence as implicit or explicit dependency for creating an
>>> hmm_fence we can wait later on. So busy waiting or special WAIT hardware
>>> commands are valid.
>>>
>>> This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same
>>> time on the hardware.
>>>
>>> In other words we can have a high priority gfx queue running jobs based on
>>> dma_fences and a low priority compute queue running jobs based on
>>> hmm_fences.
>>>
>>> Only when we switch from hmm_fence to dma_fence we need to block the
>>> submission until all the necessary resources (both memory as well as CUs)
>>> are available.
>>>
>>> This is somewhat an extension to your gang submit idea.
>> What is hmm_fence ? You should not have fence with hmm at all.
>> So i am kind of scare now.
> I kind of had the same question trying to follow Christian and Daniel's
> discussion. I think an HMM fence would be any fence resulting from the
> completion of a user mode operation in a context with HMM-based memory
> management that may stall indefinitely due to page faults.

It was more of a placeholder for something which can be used for inter 
process synchronization.

> But on a hardware engine that cannot preempt page-faulting work and has
> not reserved resources to guarantee forward progress for kernel jobs, I
> think all fences will need to be HMM fences, because any work submitted
> to such an engine can stall by getting stuck behind a stalled user mode
> operation.
>
> So for example, you have a DMA engine that can preempt during page
> faults, but a graphics engine that cannot. Then work submitted to the
> DMA engine can use dma_fence. But work submitted to the graphics engine
> must use hmm_fence. To avoid deadlocks, dma_fences must never depend on
> hmm_fences and resolution of page faults must never depend on hmm_fences.

Yeah, it's a bit more complicated but in general that fits.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
>> Cheers,
>> Jérôme
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone
  2021-01-07  3:01 ` [PATCH 17/35] drm/amdkfd: register HMM device private zone Felix Kuehling
@ 2021-03-01  8:32   ` Daniel Vetter
  2021-03-01  8:46     ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-03-01  8:32 UTC (permalink / raw)
  To: Felix Kuehling, Thomas Hellström (VMware), Christian König
  Cc: Alex Sierra, Philip Yang, dri-devel, amd-gfx list

On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
> From: Philip Yang <Philip.Yang@amd.com>
>
> Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
> allocate vram backing pages for page migration.
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>

So maybe I'm getting this all wrong, but I think that the current ttm
fault code relies on devmap pte entries (especially for hugepte entries)
to stop get_user_pages. But this only works if the pte happens to not
point at a range with devmap pages.

This patch here changes that, and so probably breaks this devmap pte hack
ttm is using?

If I'm not wrong here then I think we need to first fix up the ttm code to
not use the devmap hack anymore, before a ttm based driver can register a
dev_pagemap. Also adding Thomas since that just came up in another
discussion.
-Daniel


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |   3 +
>  drivers/gpu/drm/amd/amdkfd/Makefile        |   3 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c   | 101 +++++++++++++++++++++
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h   |  48 ++++++++++
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |   3 +
>  5 files changed, 157 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index db96d69eb45e..562bb5b69137 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -30,6 +30,7 @@
>  #include <linux/dma-buf.h>
>  #include "amdgpu_xgmi.h"
>  #include <uapi/linux/kfd_ioctl.h>
> +#include "kfd_migrate.h"
>
>  /* Total memory size in system memory and all GPU VRAM. Used to
>   * estimate worst case amount of memory to reserve for page tables
> @@ -170,12 +171,14 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>               }
>
>               kgd2kfd_device_init(adev->kfd.dev, adev_to_drm(adev), &gpu_resources);
> +             svm_migrate_init(adev);
>       }
>  }
>
>  void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev)
>  {
>       if (adev->kfd.dev) {
> +             svm_migrate_fini(adev);
>               kgd2kfd_device_exit(adev->kfd.dev);
>               adev->kfd.dev = NULL;
>       }
> diff --git a/drivers/gpu/drm/amd/amdkfd/Makefile b/drivers/gpu/drm/amd/amdkfd/Makefile
> index 387ce0217d35..a93301dbc464 100644
> --- a/drivers/gpu/drm/amd/amdkfd/Makefile
> +++ b/drivers/gpu/drm/amd/amdkfd/Makefile
> @@ -55,7 +55,8 @@ AMDKFD_FILES        := $(AMDKFD_PATH)/kfd_module.o \
>               $(AMDKFD_PATH)/kfd_dbgmgr.o \
>               $(AMDKFD_PATH)/kfd_smi_events.o \
>               $(AMDKFD_PATH)/kfd_crat.o \
> -             $(AMDKFD_PATH)/kfd_svm.o
> +             $(AMDKFD_PATH)/kfd_svm.o \
> +             $(AMDKFD_PATH)/kfd_migrate.o
>
>  ifneq ($(CONFIG_AMD_IOMMU_V2),)
>  AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> new file mode 100644
> index 000000000000..1950b86f1562
> --- /dev/null
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -0,0 +1,101 @@
> +/*
> + * Copyright 2020 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/hmm.h>
> +#include <linux/dma-direction.h>
> +#include <linux/dma-mapping.h>
> +#include "amdgpu_sync.h"
> +#include "amdgpu_object.h"
> +#include "amdgpu_vm.h"
> +#include "amdgpu_mn.h"
> +#include "kfd_priv.h"
> +#include "kfd_svm.h"
> +#include "kfd_migrate.h"
> +
> +static void svm_migrate_page_free(struct page *page)
> +{
> +}
> +
> +/**
> + * svm_migrate_to_ram - CPU page fault handler
> + * @vmf: CPU vm fault vma, address
> + *
> + * Context: vm fault handler, mm->mmap_sem is taken
> + *
> + * Return:
> + * 0 - OK
> + * VM_FAULT_SIGBUS - notice application to have SIGBUS page fault
> + */
> +static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +     return VM_FAULT_SIGBUS;
> +}
> +
> +static const struct dev_pagemap_ops svm_migrate_pgmap_ops = {
> +     .page_free              = svm_migrate_page_free,
> +     .migrate_to_ram         = svm_migrate_to_ram,
> +};
> +
> +int svm_migrate_init(struct amdgpu_device *adev)
> +{
> +     struct kfd_dev *kfddev = adev->kfd.dev;
> +     struct dev_pagemap *pgmap;
> +     struct resource *res;
> +     unsigned long size;
> +     void *r;
> +
> +     /* Page migration works on Vega10 or newer */
> +     if (kfddev->device_info->asic_family < CHIP_VEGA10)
> +             return -EINVAL;
> +
> +     pgmap = &kfddev->pgmap;
> +     memset(pgmap, 0, sizeof(*pgmap));
> +
> +     /* TODO: register all vram to HMM for now.
> +      * should remove reserved size
> +      */
> +     size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
> +     res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
> +     if (IS_ERR(res))
> +             return -ENOMEM;
> +
> +     pgmap->type = MEMORY_DEVICE_PRIVATE;
> +     pgmap->res = *res;
> +     pgmap->ops = &svm_migrate_pgmap_ops;
> +     pgmap->owner = adev;
> +     pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
> +     r = devm_memremap_pages(adev->dev, pgmap);
> +     if (IS_ERR(r)) {
> +             pr_err("failed to register HMM device memory\n");
> +             return PTR_ERR(r);
> +     }
> +
> +     pr_info("HMM registered %ldMB device memory\n", size >> 20);
> +
> +     return 0;
> +}
> +
> +void svm_migrate_fini(struct amdgpu_device *adev)
> +{
> +     memunmap_pages(&adev->kfd.dev->pgmap);
> +}
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
> new file mode 100644
> index 000000000000..98ab685d3e17
> --- /dev/null
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
> @@ -0,0 +1,48 @@
> +/*
> + * Copyright 2020 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + */
> +
> +#ifndef KFD_MIGRATE_H_
> +#define KFD_MIGRATE_H_
> +
> +#include <linux/rwsem.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/sched/mm.h>
> +#include <linux/hmm.h>
> +#include "kfd_priv.h"
> +#include "kfd_svm.h"
> +
> +#if defined(CONFIG_DEVICE_PRIVATE)
> +int svm_migrate_init(struct amdgpu_device *adev);
> +void svm_migrate_fini(struct amdgpu_device *adev);
> +
> +#else
> +static inline int svm_migrate_init(struct amdgpu_device *adev)
> +{
> +     DRM_WARN_ONCE("DEVICE_PRIVATE kernel config option is not enabled, "
> +                   "add CONFIG_DEVICE_PRIVATE=y in config file to fix\n");
> +     return -ENODEV;
> +}
> +static inline void svm_migrate_fini(struct amdgpu_device *adev) {}
> +#endif
> +#endif /* KFD_MIGRATE_H_ */
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index 7a4b4b6dcf32..d5367e770b39 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -317,6 +317,9 @@ struct kfd_dev {
>       unsigned int max_doorbell_slices;
>
>       int noretry;
> +
> +     /* HMM page migration MEMORY_DEVICE_PRIVATE mapping */
> +     struct dev_pagemap pgmap;
>  };
>
>  enum kfd_mempool {
> --
> 2.29.2
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone
  2021-03-01  8:32   ` Daniel Vetter
@ 2021-03-01  8:46     ` Thomas Hellström (Intel)
  2021-03-01  8:58       ` Daniel Vetter
  2021-03-04 17:58       ` Felix Kuehling
  0 siblings, 2 replies; 84+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-01  8:46 UTC (permalink / raw)
  To: Daniel Vetter, Felix Kuehling, Christian König
  Cc: Alex Sierra, Philip Yang, dri-devel, amd-gfx list


On 3/1/21 9:32 AM, Daniel Vetter wrote:
> On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
>> From: Philip Yang <Philip.Yang@amd.com>
>>
>> Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
>> allocate vram backing pages for page migration.
>>
>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
> So maybe I'm getting this all wrong, but I think that the current ttm
> fault code relies on devmap pte entries (especially for hugepte entries)
> to stop get_user_pages. But this only works if the pte happens to not
> point at a range with devmap pages.

I don't think that's in TTM yet, but the proposed fix, yes (see email I 
just sent in another thread),
but only for huge ptes.

>
> This patch here changes that, and so probably breaks this devmap pte hack
> ttm is using?
>
> If I'm not wrong here then I think we need to first fix up the ttm code to
> not use the devmap hack anymore, before a ttm based driver can register a
> dev_pagemap. Also adding Thomas since that just came up in another
> discussion.

It doesn't break the ttm devmap hack per se, but it indeed allows gup to 
the range registered, but here's where my lack of understanding why we 
can't allow gup-ing TTM ptes if there indeed is a backing struct-page? 
Because registering MEMORY_DEVICE_PRIVATE implies that, right?

/Thomas

> -Daniel
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone
  2021-03-01  8:46     ` Thomas Hellström (Intel)
@ 2021-03-01  8:58       ` Daniel Vetter
  2021-03-01  9:30         ` Thomas Hellström (Intel)
  2021-03-04 17:58       ` Felix Kuehling
  1 sibling, 1 reply; 84+ messages in thread
From: Daniel Vetter @ 2021-03-01  8:58 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Alex Sierra, Philip Yang, Felix Kuehling, dri-devel,
	amd-gfx list, Christian König

On Mon, Mar 01, 2021 at 09:46:44AM +0100, Thomas Hellström (Intel) wrote:
> On 3/1/21 9:32 AM, Daniel Vetter wrote:
> > On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
> > > From: Philip Yang <Philip.Yang@amd.com>
> > > 
> > > Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
> > > allocate vram backing pages for page migration.
> > > 
> > > Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> > > Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
> > So maybe I'm getting this all wrong, but I think that the current ttm
> > fault code relies on devmap pte entries (especially for hugepte entries)
> > to stop get_user_pages. But this only works if the pte happens to not
> > point at a range with devmap pages.
> 
> I don't think that's in TTM yet, but the proposed fix, yes (see email I just
> sent in another thread),
> but only for huge ptes.
> 
> > 
> > This patch here changes that, and so probably breaks this devmap pte hack
> > ttm is using?
> > 
> > If I'm not wrong here then I think we need to first fix up the ttm code to
> > not use the devmap hack anymore, before a ttm based driver can register a
> > dev_pagemap. Also adding Thomas since that just came up in another
> > discussion.
> 
> It doesn't break the ttm devmap hack per se, but it indeed allows gup to the
> range registered, but here's where my lack of understanding why we can't
> allow gup-ing TTM ptes if there indeed is a backing struct-page? Because
> registering MEMORY_DEVICE_PRIVATE implies that, right?

We need to keep supporting buffer based memory management for all the
non-compute users. Because those require end-of-batch dma_fence semantics,
which prevents us from using gpu page faults, which makes hmm not really
work.

And for buffer based memory manager we can't have gup pin random pages in
there, that's not really how it works. Worst case ttm just assumes it can
actually move buffers and reallocate them as it sees fit, and your gup
mapping (for direct i/o or whatever) now points at a page of a buffer that
you don't even own anymore. That's not good. Hence also all the
discussions about preventing gup for bo mappings in general.

Once we throw hmm into the mix we need to be really careful that the two
worlds don't collide. Pure hmm is fine, pure bo managed memory is fine,
mixing them is tricky.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone
  2021-03-01  8:58       ` Daniel Vetter
@ 2021-03-01  9:30         ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 84+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-01  9:30 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Alex Sierra, Philip Yang, Felix Kuehling, dri-devel,
	amd-gfx list, Christian König


On 3/1/21 9:58 AM, Daniel Vetter wrote:
> On Mon, Mar 01, 2021 at 09:46:44AM +0100, Thomas Hellström (Intel) wrote:
>> On 3/1/21 9:32 AM, Daniel Vetter wrote:
>>> On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
>>>> From: Philip Yang <Philip.Yang@amd.com>
>>>>
>>>> Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
>>>> allocate vram backing pages for page migration.
>>>>
>>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>>>> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>> So maybe I'm getting this all wrong, but I think that the current ttm
>>> fault code relies on devmap pte entries (especially for hugepte entries)
>>> to stop get_user_pages. But this only works if the pte happens to not
>>> point at a range with devmap pages.
>> I don't think that's in TTM yet, but the proposed fix, yes (see email I just
>> sent in another thread),
>> but only for huge ptes.
>>
>>> This patch here changes that, and so probably breaks this devmap pte hack
>>> ttm is using?
>>>
>>> If I'm not wrong here then I think we need to first fix up the ttm code to
>>> not use the devmap hack anymore, before a ttm based driver can register a
>>> dev_pagemap. Also adding Thomas since that just came up in another
>>> discussion.
>> It doesn't break the ttm devmap hack per se, but it indeed allows gup to the
>> range registered, but here's where my lack of understanding why we can't
>> allow gup-ing TTM ptes if there indeed is a backing struct-page? Because
>> registering MEMORY_DEVICE_PRIVATE implies that, right?
> We need to keep supporting buffer based memory management for all the
> non-compute users. Because those require end-of-batch dma_fence semantics,
> which prevents us from using gpu page faults, which makes hmm not really
> work.
>
> And for buffer based memory manager we can't have gup pin random pages in
> there, that's not really how it works. Worst case ttm just assumes it can
> actually move buffers and reallocate them as it sees fit, and your gup
> mapping (for direct i/o or whatever) now points at a page of a buffer that
> you don't even own anymore. That's not good. Hence also all the
> discussions about preventing gup for bo mappings in general.
>
> Once we throw hmm into the mix we need to be really careful that the two
> worlds don't collide. Pure hmm is fine, pure bo managed memory is fine,
> mixing them is tricky.
> -Daniel

Hmm, OK so then registering MEMORY_DEVICE_PRIVATE means we can't set 
pxx_devmap because that would allow gup, which, in turn, means no huge 
TTM ptes.

/Thomas

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone
  2021-03-01  8:46     ` Thomas Hellström (Intel)
  2021-03-01  8:58       ` Daniel Vetter
@ 2021-03-04 17:58       ` Felix Kuehling
  2021-03-11 12:24         ` Thomas Hellström (Intel)
  1 sibling, 1 reply; 84+ messages in thread
From: Felix Kuehling @ 2021-03-04 17:58 UTC (permalink / raw)
  To: Thomas Hellström (Intel), Daniel Vetter, Christian König
  Cc: Alex Sierra, Philip Yang, dri-devel, amd-gfx list


Am 2021-03-01 um 3:46 a.m. schrieb Thomas Hellström (Intel):
>
> On 3/1/21 9:32 AM, Daniel Vetter wrote:
>> On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
>>> From: Philip Yang <Philip.Yang@amd.com>
>>>
>>> Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
>>> allocate vram backing pages for page migration.
>>>
>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>>> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
>> So maybe I'm getting this all wrong, but I think that the current ttm
>> fault code relies on devmap pte entries (especially for hugepte entries)
>> to stop get_user_pages. But this only works if the pte happens to not
>> point at a range with devmap pages.
>
> I don't think that's in TTM yet, but the proposed fix, yes (see email
> I just sent in another thread),
> but only for huge ptes.
>
>>
>> This patch here changes that, and so probably breaks this devmap pte
>> hack
>> ttm is using?
>>
>> If I'm not wrong here then I think we need to first fix up the ttm
>> code to
>> not use the devmap hack anymore, before a ttm based driver can
>> register a
>> dev_pagemap. Also adding Thomas since that just came up in another
>> discussion.
>
> It doesn't break the ttm devmap hack per se, but it indeed allows gup
> to the range registered, but here's where my lack of understanding why
> we can't allow gup-ing TTM ptes if there indeed is a backing
> struct-page? Because registering MEMORY_DEVICE_PRIVATE implies that,
> right?

I wasn't aware that TTM used devmap at all. If it does, what type of
memory does it use?

MEMORY_DEVICE_PRIVATE is like swapped out memory. It cannot be mapped in
the CPU page table. GUP would cause a page fault to swap it back into
system memory. We are looking into use MEMORY_DEVICE_GENERIC for a
future coherent memory architecture, where device memory can be
coherently accessed by the CPU and GPU.

As I understand it, our DEVICE_PRIVATE registration is not tied to an
actual physical address. Thus your devmap registration and our devmap
registration could probably coexist without any conflict. You'll just
have the overhead of two sets of struct pages for the same memory.

Regards,
  Felix


>
> /Thomas
>
>> -Daniel
>>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone
  2021-03-04 17:58       ` Felix Kuehling
@ 2021-03-11 12:24         ` Thomas Hellström (Intel)
  0 siblings, 0 replies; 84+ messages in thread
From: Thomas Hellström (Intel) @ 2021-03-11 12:24 UTC (permalink / raw)
  To: Felix Kuehling, Daniel Vetter, Christian König
  Cc: Alex Sierra, Philip Yang, dri-devel, amd-gfx list


On 3/4/21 6:58 PM, Felix Kuehling wrote:
> Am 2021-03-01 um 3:46 a.m. schrieb Thomas Hellström (Intel):
>> On 3/1/21 9:32 AM, Daniel Vetter wrote:
>>> On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
>>>> From: Philip Yang <Philip.Yang@amd.com>
>>>>
>>>> Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
>>>> allocate vram backing pages for page migration.
>>>>
>>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>>>> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>> So maybe I'm getting this all wrong, but I think that the current ttm
>>> fault code relies on devmap pte entries (especially for hugepte entries)
>>> to stop get_user_pages. But this only works if the pte happens to not
>>> point at a range with devmap pages.
>> I don't think that's in TTM yet, but the proposed fix, yes (see email
>> I just sent in another thread),
>> but only for huge ptes.
>>
>>> This patch here changes that, and so probably breaks this devmap pte
>>> hack
>>> ttm is using?
>>>
>>> If I'm not wrong here then I think we need to first fix up the ttm
>>> code to
>>> not use the devmap hack anymore, before a ttm based driver can
>>> register a
>>> dev_pagemap. Also adding Thomas since that just came up in another
>>> discussion.
>> It doesn't break the ttm devmap hack per se, but it indeed allows gup
>> to the range registered, but here's where my lack of understanding why
>> we can't allow gup-ing TTM ptes if there indeed is a backing
>> struct-page? Because registering MEMORY_DEVICE_PRIVATE implies that,
>> right?
> I wasn't aware that TTM used devmap at all. If it does, what type of
> memory does it use?
>
> MEMORY_DEVICE_PRIVATE is like swapped out memory. It cannot be mapped in
> the CPU page table. GUP would cause a page fault to swap it back into
> system memory. We are looking into use MEMORY_DEVICE_GENERIC for a
> future coherent memory architecture, where device memory can be
> coherently accessed by the CPU and GPU.
>
> As I understand it, our DEVICE_PRIVATE registration is not tied to an
> actual physical address. Thus your devmap registration and our devmap
> registration could probably coexist without any conflict. You'll just
> have the overhead of two sets of struct pages for the same memory.
>
> Regards,
>    Felix

Hi, Felix. TTM doesn't use devmap yet, but thinking of using it for 
faking pmd_special() which isn't available. That would mean pmd_devmap() 
+ no_registered_dev_pagemap meaning special in the sense documented by 
vm_normal_page(). The implication here would be that if you register 
memory like above, TTM would never be able to set up a huge page table 
entry to it. But it sounds like that's not an issue?

/Thomas

>
>> /Thomas
>>
>>> -Daniel
>>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2021-03-11 12:25 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-07  3:00 [PATCH 00/35] Add HMM-based SVM memory manager to KFD Felix Kuehling
2021-01-07  3:00 ` [PATCH 01/35] drm/amdkfd: select kernel DEVICE_PRIVATE option Felix Kuehling
2021-01-07  3:00 ` [PATCH 02/35] drm/amdgpu: replace per_device_list by array Felix Kuehling
2021-01-07  3:00 ` [PATCH 03/35] drm/amdkfd: helper to convert gpu id and idx Felix Kuehling
2021-01-07  3:00 ` [PATCH 04/35] drm/amdkfd: add svm ioctl API Felix Kuehling
2021-01-07  3:00 ` [PATCH 05/35] drm/amdkfd: Add SVM API support capability bits Felix Kuehling
2021-01-07  3:00 ` [PATCH 06/35] drm/amdkfd: register svm range Felix Kuehling
2021-01-07  3:00 ` [PATCH 07/35] drm/amdkfd: add svm ioctl GET_ATTR op Felix Kuehling
2021-01-07  3:01 ` [PATCH 08/35] drm/amdgpu: add common HMM get pages function Felix Kuehling
2021-01-07 10:53   ` Christian König
2021-01-07  3:01 ` [PATCH 09/35] drm/amdkfd: validate svm range system memory Felix Kuehling
2021-01-07  3:01 ` [PATCH 10/35] drm/amdkfd: register overlap system memory range Felix Kuehling
2021-01-07  3:01 ` [PATCH 11/35] drm/amdkfd: deregister svm range Felix Kuehling
2021-01-07  3:01 ` [PATCH 12/35] drm/amdgpu: export vm update mapping interface Felix Kuehling
2021-01-07 10:54   ` Christian König
2021-01-07  3:01 ` [PATCH 13/35] drm/amdkfd: map svm range to GPUs Felix Kuehling
2021-01-07  3:01 ` [PATCH 14/35] drm/amdkfd: svm range eviction and restore Felix Kuehling
2021-01-07  3:01 ` [PATCH 15/35] drm/amdkfd: add xnack enabled flag to kfd_process Felix Kuehling
2021-01-07  3:01 ` [PATCH 16/35] drm/amdkfd: add ioctl to configure and query xnack retries Felix Kuehling
2021-01-07  3:01 ` [PATCH 17/35] drm/amdkfd: register HMM device private zone Felix Kuehling
2021-03-01  8:32   ` Daniel Vetter
2021-03-01  8:46     ` Thomas Hellström (Intel)
2021-03-01  8:58       ` Daniel Vetter
2021-03-01  9:30         ` Thomas Hellström (Intel)
2021-03-04 17:58       ` Felix Kuehling
2021-03-11 12:24         ` Thomas Hellström (Intel)
2021-01-07  3:01 ` [PATCH 18/35] drm/amdkfd: validate vram svm range from TTM Felix Kuehling
2021-01-07  3:01 ` [PATCH 19/35] drm/amdkfd: support xgmi same hive mapping Felix Kuehling
2021-01-07  3:01 ` [PATCH 20/35] drm/amdkfd: copy memory through gart table Felix Kuehling
2021-01-07  3:01 ` [PATCH 21/35] drm/amdkfd: HMM migrate ram to vram Felix Kuehling
2021-01-07  3:01 ` [PATCH 22/35] drm/amdkfd: HMM migrate vram to ram Felix Kuehling
2021-01-07  3:01 ` [PATCH 23/35] drm/amdkfd: invalidate tables on page retry fault Felix Kuehling
2021-01-07  3:01 ` [PATCH 24/35] drm/amdkfd: page table restore through svm API Felix Kuehling
2021-01-07  3:01 ` [PATCH 25/35] drm/amdkfd: SVM API call to restore page tables Felix Kuehling
2021-01-07  3:01 ` [PATCH 26/35] drm/amdkfd: add svm_bo reference for eviction fence Felix Kuehling
2021-01-07  3:01 ` [PATCH 27/35] drm/amdgpu: add param bit flag to create SVM BOs Felix Kuehling
2021-01-07  3:01 ` [PATCH 28/35] drm/amdkfd: add svm_bo eviction mechanism support Felix Kuehling
2021-01-07  3:01 ` [PATCH 29/35] drm/amdgpu: svm bo enable_signal call condition Felix Kuehling
2021-01-07 10:56   ` Christian König
2021-01-07 16:16     ` Felix Kuehling
2021-01-07 16:28       ` Christian König
2021-01-07 16:53         ` Felix Kuehling
2021-01-07  3:01 ` [PATCH 30/35] drm/amdgpu: add svm_bo eviction to enable_signal cb Felix Kuehling
2021-01-07  3:01 ` [PATCH 31/35] drm/amdgpu: reserve fence slot to update page table Felix Kuehling
2021-01-07 10:57   ` Christian König
2021-01-07  3:01 ` [PATCH 32/35] drm/amdgpu: enable retry fault wptr overflow Felix Kuehling
2021-01-07 11:01   ` Christian König
2021-01-07  3:01 ` [PATCH 33/35] drm/amdkfd: refine migration policy with xnack on Felix Kuehling
2021-01-07  3:01 ` [PATCH 34/35] drm/amdkfd: add svm range validate timestamp Felix Kuehling
2021-01-07  3:01 ` [PATCH 35/35] drm/amdkfd: multiple gpu migrate vram to vram Felix Kuehling
2021-01-07  9:23 ` [PATCH 00/35] Add HMM-based SVM memory manager to KFD Daniel Vetter
2021-01-07 16:25   ` Felix Kuehling
2021-01-08 14:40     ` Daniel Vetter
2021-01-08 14:45       ` Christian König
2021-01-08 15:58       ` Felix Kuehling
2021-01-08 16:06         ` Daniel Vetter
2021-01-08 16:36           ` Felix Kuehling
2021-01-08 16:53             ` Daniel Vetter
2021-01-08 17:56               ` Felix Kuehling
2021-01-11 16:29                 ` Daniel Vetter
2021-01-14  5:34                   ` Felix Kuehling
2021-01-14 12:19                     ` Christian König
2021-01-13 16:56       ` Jerome Glisse
2021-01-13 20:31         ` Daniel Vetter
2021-01-14  3:27           ` Jerome Glisse
2021-01-14  9:26             ` Daniel Vetter
2021-01-14 10:39               ` Daniel Vetter
2021-01-14 10:49         ` Christian König
2021-01-14 11:52           ` Daniel Vetter
2021-01-14 13:37             ` HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD) Christian König
2021-01-14 13:57               ` Daniel Vetter
2021-01-14 14:13                 ` Christian König
2021-01-14 14:23                   ` Daniel Vetter
2021-01-14 15:08                     ` Christian König
2021-01-14 15:40                       ` Daniel Vetter
2021-01-14 16:01                         ` Christian König
2021-01-14 16:36                           ` Daniel Vetter
2021-01-14 19:08                             ` Christian König
2021-01-14 20:09                               ` Daniel Vetter
2021-01-14 16:51               ` Jerome Glisse
2021-01-14 21:13                 ` Felix Kuehling
2021-01-15  7:47                   ` Christian König
2021-01-13 16:47 ` [PATCH 00/35] Add HMM-based SVM memory manager to KFD Jerome Glisse
2021-01-14  0:06   ` Felix Kuehling

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).