dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/23] XeKmd basic SVM support
@ 2024-01-17 22:12 Oak Zeng
  2024-01-17 22:12 ` [PATCH 01/23] drm/xe/svm: Add SVM document Oak Zeng
                   ` (22 more replies)
  0 siblings, 23 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

This is the very basic SVM (shared virtual memory) support in XeKmd
driver. SVM allows the programmer to use a shaed virtual address space
between CPU program and GPU program. It abstracts away from the user
the location of the backing memory in a mixed CPU and GPU programming
environment.

This work is based on previous I915 SVM implementation mainly from
Niranjana Vishwanathapura and Oak Zeng, which has never been upstreamed
before. This is our first attempt to upstream this work.

This implementation depends on Linux kernel HMM support. See some key
designs in patch #1.

We are aware there are currently some effort to implement SVM using
GMEM(generalized memory management,
see https://lore.kernel.org/dri-devel/20231128125025.4449-1-weixi.zhu@huawei.com/)
We are open to this new method if it can be merged to upstream kernel.
Before that, we think it is still safer to support SVM through HMM.

This series only has basic SVM support. We think it is better to post
this series earlier so we can get more eyes on it. Below are the works
that is planned or ongoing:

*Testing: We are working on the igt test right now. Some part of this
series, especially the gpu page table update(patch #7, #8) and migration
function (patch #10) need some debug to make it work.

*Virtual address range based memory attributes and hints: We plan to
expose uAPI for user to set memory attributes such as preferred location
or migration granularity etc to a virtual address range. This is
important to tune SVM performance.

*GPU vram eviction: One key design choice of this series is, SVM
layer allocate GPU memory directly from drm buddy allocator, instead
of from xe vram manager. There is no BO (buffer object) concept
in this implementation. The key benefit of this approach is we can
migrate memory at page granularity easily. This also means SVM bypasses
TTM's memory eviction logic. But we want the SVM memory and BO driver
memory can mutually evicted each other. We have some prove of concept
work to rework TTM resource manager for this purpose, see
https://lore.kernel.org/dri-devel/20231102043306.2931989-1-oak.zeng@intel.com/
We will continue work on that series then implement SVM's eviction
function based on the concept of shared drm LRU list b/t SVM and TTM/BO
driver.

Oak Zeng (23):
  drm/xe/svm: Add SVM document
  drm/xe/svm: Add svm key data structures
  drm/xe/svm: create xe svm during vm creation
  drm/xe/svm: Trace svm creation
  drm/xe/svm: add helper to retrieve svm range from address
  drm/xe/svm: Introduce a helper to build sg table from hmm range
  drm/xe/svm: Add helper for binding hmm range to gpu
  drm/xe/svm: Add helper to invalidate svm range from GPU
  drm/xe/svm: Remap and provide memmap backing for GPU vram
  drm/xe/svm: Introduce svm migration function
  drm/xe/svm: implement functions to allocate and free device memory
  drm/xe/svm: Trace buddy block allocation and free
  drm/xe/svm: Handle CPU page fault
  drm/xe/svm: trace svm range migration
  drm/xe/svm: Implement functions to register and unregister mmu
    notifier
  drm/xe/svm: Implement the mmu notifier range invalidate callback
  drm/xe/svm: clean up svm range during process exit
  drm/xe/svm: Move a few structures to xe_gt.h
  drm/xe/svm: migrate svm range to vram
  drm/xe/svm: Populate svm range
  drm/xe/svm: GPU page fault support
  drm/xe/svm: Add DRM_XE_SVM kernel config entry
  drm/xe/svm: Add svm memory hints interface

 Documentation/gpu/xe/index.rst       |   1 +
 Documentation/gpu/xe/xe_svm.rst      |   8 +
 drivers/gpu/drm/xe/Kconfig           |  22 ++
 drivers/gpu/drm/xe/Makefile          |   5 +
 drivers/gpu/drm/xe/xe_device_types.h |  20 ++
 drivers/gpu/drm/xe/xe_gt.h           |  20 ++
 drivers/gpu/drm/xe/xe_gt_pagefault.c |  28 +--
 drivers/gpu/drm/xe/xe_migrate.c      | 213 +++++++++++++++++
 drivers/gpu/drm/xe/xe_migrate.h      |   7 +
 drivers/gpu/drm/xe/xe_mmio.c         |  12 +
 drivers/gpu/drm/xe/xe_pt.c           | 147 +++++++++++-
 drivers/gpu/drm/xe/xe_pt.h           |   5 +
 drivers/gpu/drm/xe/xe_svm.c          | 324 +++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h          | 115 +++++++++
 drivers/gpu/drm/xe/xe_svm_devmem.c   | 232 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm_doc.h      | 121 ++++++++++
 drivers/gpu/drm/xe/xe_svm_migrate.c  | 345 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm_range.c    | 227 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_trace.h        |  71 +++++-
 drivers/gpu/drm/xe/xe_vm.c           |   7 +
 drivers/gpu/drm/xe/xe_vm_types.h     |  15 +-
 include/uapi/drm/xe_drm.h            |  40 ++++
 22 files changed, 1957 insertions(+), 28 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe_svm.rst
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm_doc.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm_range.c

-- 
2.26.3


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH 01/23] drm/xe/svm: Add SVM document
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 02/23] drm/xe/svm: Add svm key data structures Oak Zeng
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Add shared virtual memory document.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 Documentation/gpu/xe/index.rst  |   1 +
 Documentation/gpu/xe/xe_svm.rst |   8 +++
 drivers/gpu/drm/xe/xe_svm_doc.h | 121 ++++++++++++++++++++++++++++++++
 3 files changed, 130 insertions(+)
 create mode 100644 Documentation/gpu/xe/xe_svm.rst
 create mode 100644 drivers/gpu/drm/xe/xe_svm_doc.h

diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index c224ecaee81e..106b60aba1f0 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver.
    xe_firmware
    xe_tile
    xe_debugging
+   xe_svm
diff --git a/Documentation/gpu/xe/xe_svm.rst b/Documentation/gpu/xe/xe_svm.rst
new file mode 100644
index 000000000000..62954ba1c6f8
--- /dev/null
+++ b/Documentation/gpu/xe/xe_svm.rst
@@ -0,0 +1,8 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+
+=============
+Shared virtual memory
+=============
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_svm_doc.h
+   :doc: Shared virtual memory
diff --git a/drivers/gpu/drm/xe/xe_svm_doc.h b/drivers/gpu/drm/xe/xe_svm_doc.h
new file mode 100644
index 000000000000..de38ee3585e4
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm_doc.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef _XE_SVM_DOC_H_
+#define _XE_SVM_DOC_H_
+
+/**
+ * DOC: Shared virtual memory
+ *
+ * Shared Virtual Memory (SVM) allows the programmer to use a single virtual
+ * address space shared between threads executing on CPUs and GPUs. It abstracts
+ * away from the user the location of the backing memory, and hence simplifies
+ * the user programming model. In a non-SVM memory model, user need to explicitly
+ * decide memory placement such as device or system memory, also user need to
+ * explicitly migrate memory b/t device and system memory.
+ *
+ * Interface
+ * =========
+ *
+ * SVM makes use of default OS memory allocation and mapping interface such as
+ * malloc() and mmap(). The pointer returned from malloc() and mmap() can be
+ * directly used on both CPU and GPU program.
+ *
+ * SVM also provides API to set virtual address range based memory attributes
+ * such as preferred memory location, memory migration granularity, and memory
+ * atomic attributes etc. This is similar to Linux madvise API.
+ *
+ * Basic implementation
+ * ==============
+ *
+ * XeKMD implementation is based on Linux kernel Heterogeneous Memory Management
+ * (HMM) framework. HMM’s address space mirroring support allows sharing of the
+ * address space by duplicating sections of CPU page tables in the device page
+ * tables. This enables both CPU and GPU access a physical memory location using
+ * the same virtual address.
+ *
+ * Linux kernel also provides the ability to plugin device memory to the system
+ * (as a special ZONE_DEVICE type) and allocates struct page for each device memory
+ * page.
+ *
+ * HMM also provides a mechanism to migrate pages from host to device memory and
+ * vice versa.
+ *
+ * More information on HMM can be found here.
+ * https://www.kernel.org/doc/Documentation/vm/hmm.rst
+ *
+ * Unlike the non-SVM memory allocator (such as gem_create, vm_bind etc), there
+ * is no buffer object (BO, such as struct ttm_buffer_object, struct drm_gem_object),
+ * in our SVM implementation. We delibrately choose this implementation option
+ * to achieve page granularity memory placement, validation, eviction and migration.
+ *
+ * The SVM layer directly allocate device memory from drm buddy subsystem. The
+ * memory is organized as many blocks each of which has 2^n pages. SVM subsystem
+ * then mark the usage of each page using a simple bitmap. When all pages in a
+ * block are not used anymore, SVM return this block back to drm buddy subsystem.
+ *
+ * There are 3 events which can trigger SVM subsystem in actions:
+ *
+ * 1. A mmu notifier callback
+ *
+ * Since SVM need to mirror the program's CPU virtual address space from GPU side,
+ * when program's CPU address space changes, SVM need to make an identical change
+ * from GPU side. SVM/hmm use mmu interval notifier to achieve this. SVM register
+ * a mmu interval notifier call back function to core mm, and whenever a CPU side
+ * virtual address space is changed (i.e., when a virtual address range is unmapped
+ * from CPU calling munmap), the registered callback function will be called from
+ * core mm. SVM then mirror the CPU address space change from GPU side, i.e., unmap
+ * or invalidate the virtual address range from GPU page table.
+ *
+ * 2. A GPU page fault
+ *
+ * At the very beginning of a process's life, no virtual address of the process
+ * is mapped on GPU page table. So when GPU access any virtual address of the process
+ * a GPU page fault is triggered. SVM then decide the best memory location of the
+ * fault address (mainly from performance consideration. Some times also consider
+ * correctness requirement such as whether GPU can perform atomics operation to
+ * certain memory location), migrate memory if necessary, and map the fault address
+ * to GPU page table.
+ *
+ * 3. A CPU page fault
+ *
+ * A CPU page fault is usually managed by Linux core mm. But in a CPU and GPU
+ * mix programming environment, the backing store of a virtual address range
+ * can be in GPU's local memory which is not visible to CPU (DEVICE_PRIVATE),
+ * so CPU page fault handler need to migrate such pages to system memory for
+ * CPU to be able to access them. Such memory migration is device specific.
+ * HMM has a callback function (migrate_to_ram function of the dev_pagemap_ops)
+ * for device driver to implement.
+ *
+ *
+ * Memory hints: TBD
+ * =================
+ *
+ * Memory eviction: TBD
+ * ===============
+ *
+ * Lock design
+ * ===========
+ *
+ * https://www.kernel.org/doc/Documentation/vm/hmm.rst section "Address space mirroring
+ * implemenation and API" described the locking scheme that driver writer has to
+ * respect. There are 3 lock mechanism involved in this scheme:
+ *
+ * 1. Use mmp_read/write_lock to protect VMA, cpu page table operations.  Operation such
+ * as munmap/mmap, page table update during numa balance must hold this lock. Hmm_range_fault
+ * is a helper function provided by HMM to populate the CPU page table, so it must be called
+ * with this lock
+ *
+ * 2. Use xe_svm::mutex to protect device side page table operation. Any attempt to bind an
+ * address range to GPU, or invalidate an address range from GPU, should hold this device lock
+ *
+ * 3. In the GPU page fault handler, during device page table update, we hold a xe_svm::mutex,
+ * but we don't hold the mmap_read/write_lock. So programm's address space can change during
+ * the GPU page table update. mmu notifier seq# is used to determine whether unmap happened
+ * during during device page table update, if yes, then retry.
+ *
+ */
+
+#endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 02/23] drm/xe/svm: Add svm key data structures
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
  2024-01-17 22:12 ` [PATCH 01/23] drm/xe/svm: Add SVM document Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 03/23] drm/xe/svm: create xe svm during vm creation Oak Zeng
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Add xe_svm and xe_svm_range data structure. Each xe_svm
represents a svm address space and it maps 1:1 to the
process's mm_struct. It also maps 1:1 to the gpu xe_vm
struct.

Each xe_svm_range represent a virtual address range inside
a svm address space. It is similar to CPU's  vm_area_struct,
or to the GPU xe_vma struct. It contains data to synchronize
this address range to CPU's virtual address range, using mmu
notifier mechanism. It can also hold this range's memory
attributes set by user, such as preferred memory location etc -
this is TBD.

Each svm address space is made of many svm virtual address range.
All address ranges are maintained in xe_svm's interval tree.

Also add a xe_svm pointer to xe_vm data structure. So we have
a 1:1 mapping b/t xe_svm and xe_vm.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h      | 59 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm_types.h |  2 ++
 2 files changed, 61 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
new file mode 100644
index 000000000000..ba301a331f59
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef __XE_SVM_H
+#define __XE_SVM_H
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree_types.h>
+#include <linux/interval_tree.h>
+
+struct xe_vm;
+struct mm_struct;
+
+/**
+ * struct xe_svm - data structure to represent a shared
+ * virtual address space from device side. xe_svm, xe_vm
+ * and mm_struct has a 1:1:1 relationship.
+ */
+struct xe_svm {
+	/** @vm: The xe_vm address space corresponding to this xe_svm */
+	struct xe_vm *vm;
+	/** @mm: The mm_struct corresponding to this xe_svm */
+	struct mm_struct *mm;
+	/**
+	 * @mutex: A lock used by svm subsystem. It protects:
+	 * 1. below range_tree
+	 * 2. GPU page table update. Serialize all SVM GPU page table updates
+	 */
+	struct mutex mutex;
+	/**
+	 * @range_tree: Interval tree of all svm ranges in this svm
+	 */
+	struct rb_root_cached range_tree;
+};
+
+/**
+ * struct xe_svm_range - Represents a shared virtual address range.
+ */
+struct xe_svm_range {
+	/** @notifier: The mmu interval notifer used to keep track of CPU
+	 * side address range change. Driver will get a callback with this
+	 * notifier if anything changed from CPU side, such as range is
+	 * unmapped from CPU
+	 */
+	struct mmu_interval_notifier notifier;
+	/** @start: start address of this range, inclusive */
+	u64 start;
+	/** @end: end address of this range, exclusive */
+	u64 end;
+	/** @unregister_notifier_work: A worker used to unregister this notifier */
+	struct work_struct unregister_notifier_work;
+	/** @inode: used to link this range to svm's range_tree */
+	struct interval_tree_node inode;
+};
+#endif
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 63e8a50b88e9..037fb7168c63 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -17,6 +17,7 @@
 #include "xe_pt_types.h"
 #include "xe_range_fence.h"
 
+struct xe_svm;
 struct xe_bo;
 struct xe_sync_entry;
 struct xe_vm;
@@ -279,6 +280,7 @@ struct xe_vm {
 	bool batch_invalidate_tlb;
 	/** @xef: XE file handle for tracking this VM's drm client */
 	struct xe_file *xef;
+	struct xe_svm *svm;
 };
 
 /** struct xe_vma_op_map - VMA map operation */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 03/23] drm/xe/svm: create xe svm during vm creation
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
  2024-01-17 22:12 ` [PATCH 01/23] drm/xe/svm: Add SVM document Oak Zeng
  2024-01-17 22:12 ` [PATCH 02/23] drm/xe/svm: Add svm key data structures Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 04/23] drm/xe/svm: Trace svm creation Oak Zeng
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Create the xe_svm struct during xe_vm creation.
Add xe_svm to a global hash table so later on
we can retrieve xe_svm using mm_struct (the key).

Destroy svm process during xe_vm close.

Also add a helper function to retrieve svm struct
from mm struct

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 63 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h | 11 +++++++
 drivers/gpu/drm/xe/xe_vm.c  |  5 +++
 3 files changed, 79 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
new file mode 100644
index 000000000000..559188471949
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/mutex.h>
+#include <linux/mm_types.h>
+#include "xe_svm.h"
+
+DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
+
+/**
+ * xe_destroy_svm() - destroy a svm process
+ *
+ * @svm: the xe_svm to destroy
+ */
+void xe_destroy_svm(struct xe_svm *svm)
+{
+	hash_del_rcu(&svm->hnode);
+	mutex_destroy(&svm->mutex);
+	kfree(svm);
+}
+
+/**
+ * xe_create_svm() - create a svm process
+ *
+ * @vm: the xe_vm that we create svm process for
+ *
+ * Return the created xe svm struct
+ */
+struct xe_svm *xe_create_svm(struct xe_vm *vm)
+{
+	struct mm_struct *mm = current->mm;
+	struct xe_svm *svm;
+
+	svm = kzalloc(sizeof(struct xe_svm), GFP_KERNEL);
+	svm->mm = mm;
+	svm->vm	= vm;
+	mutex_init(&svm->mutex);
+	/** Add svm to global xe_svm_table hash table
+	 *  use mm as key so later we can retrieve svm using mm
+	 */
+	hash_add_rcu(xe_svm_table, &svm->hnode, (uintptr_t)mm);
+	return svm;
+}
+
+/**
+ * xe_lookup_svm_by_mm() - retrieve xe_svm from mm struct
+ *
+ * @mm: the mm struct of the svm to retrieve
+ *
+ * Return the xe_svm struct pointer, or NULL if fail
+ */
+struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm)
+{
+	struct xe_svm *svm;
+
+	hash_for_each_possible_rcu(xe_svm_table, svm, hnode, (uintptr_t)mm)
+		if (svm->mm == mm)
+			return svm;
+
+	return NULL;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index ba301a331f59..cd3cf92f3784 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -11,10 +11,15 @@
 #include <linux/workqueue.h>
 #include <linux/rbtree_types.h>
 #include <linux/interval_tree.h>
+#include <linux/hashtable.h>
+#include <linux/types.h>
 
 struct xe_vm;
 struct mm_struct;
 
+#define XE_MAX_SVM_PROCESS 5 /* Maximumly support 32 SVM process*/
+extern DECLARE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
+
 /**
  * struct xe_svm - data structure to represent a shared
  * virtual address space from device side. xe_svm, xe_vm
@@ -35,6 +40,8 @@ struct xe_svm {
 	 * @range_tree: Interval tree of all svm ranges in this svm
 	 */
 	struct rb_root_cached range_tree;
+	/** @hnode: used to add this svm to a global xe_svm_hash table*/
+	struct hlist_node hnode;
 };
 
 /**
@@ -56,4 +63,8 @@ struct xe_svm_range {
 	/** @inode: used to link this range to svm's range_tree */
 	struct interval_tree_node inode;
 };
+
+void xe_destroy_svm(struct xe_svm *svm);
+struct xe_svm *xe_create_svm(struct xe_vm *vm);
+struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index a7e7a0b24099..712fe49d8fb2 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -36,6 +36,7 @@
 #include "xe_trace.h"
 #include "generated/xe_wa_oob.h"
 #include "xe_wa.h"
+#include "xe_svm.h"
 
 #define TEST_VM_ASYNC_OPS_ERROR
 
@@ -1376,6 +1377,7 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 		xe->usm.num_vm_in_non_fault_mode++;
 	mutex_unlock(&xe->usm.lock);
 
+	vm->svm = xe_create_svm(vm);
 	trace_xe_vm_create(vm);
 
 	return vm;
@@ -1496,6 +1498,9 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 	for_each_tile(tile, xe, id)
 		xe_range_fence_tree_fini(&vm->rftree[id]);
 
+	if (vm->svm)
+		xe_destroy_svm(vm->svm);
+
 	xe_vm_put(vm);
 }
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 04/23] drm/xe/svm: Trace svm creation
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (2 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 03/23] drm/xe/svm: create xe svm during vm creation Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 05/23] drm/xe/svm: add helper to retrieve svm range from address Oak Zeng
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

xe_vm tracepoint is extended to also print svm.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_trace.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index 95163c303f3e..63867c0fa848 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -467,15 +467,17 @@ DECLARE_EVENT_CLASS(xe_vm,
 		    TP_STRUCT__entry(
 			     __field(u64, vm)
 			     __field(u32, asid)
+			     __field(u64, svm)
 			     ),
 
 		    TP_fast_assign(
 			   __entry->vm = (unsigned long)vm;
 			   __entry->asid = vm->usm.asid;
+			   __entry->svm = (unsigned long)vm->svm;
 			   ),
 
-		    TP_printk("vm=0x%016llx, asid=0x%05x",  __entry->vm,
-			      __entry->asid)
+		    TP_printk("vm=0x%016llx, asid=0x%05x, svm=0x%016llx",  __entry->vm,
+			      __entry->asid, __entry->svm)
 );
 
 DEFINE_EVENT(xe_vm, xe_vm_kill,
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 05/23] drm/xe/svm: add helper to retrieve svm range from address
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (3 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 04/23] drm/xe/svm: Trace svm creation Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range Oak Zeng
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

All valid virtual address range are maintained in svm's
range_tree. This functions iterate svm's range tree and
return the svm range that contains specific address.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h       |  2 ++
 drivers/gpu/drm/xe/xe_svm_range.c | 32 +++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_svm_range.c

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index cd3cf92f3784..3ed106ecc02b 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -67,4 +67,6 @@ struct xe_svm_range {
 void xe_destroy_svm(struct xe_svm *svm);
 struct xe_svm *xe_create_svm(struct xe_vm *vm);
 struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
+struct xe_svm_range *xe_svm_range_from_addr(struct xe_svm *svm,
+								unsigned long addr);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm_range.c b/drivers/gpu/drm/xe/xe_svm_range.c
new file mode 100644
index 000000000000..d8251d38f65e
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm_range.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/interval_tree.h>
+#include <linux/container_of.h>
+#include <linux/mutex.h>
+#include "xe_svm.h"
+
+/**
+ * xe_svm_range_from_addr() - retrieve svm_range contains a virtual address
+ *
+ * @svm: svm that the virtual address belongs to
+ * @addr: the virtual address to retrieve svm_range for
+ *
+ * return the svm range found,
+ * or NULL if no range found
+ */
+struct xe_svm_range *xe_svm_range_from_addr(struct xe_svm *svm,
+									unsigned long addr)
+{
+	struct interval_tree_node *node;
+
+	mutex_lock(&svm->mutex);
+	node = interval_tree_iter_first(&svm->range_tree, addr, addr);
+	mutex_unlock(&svm->mutex);
+	if (!node)
+		return NULL;
+
+	return container_of(node, struct xe_svm_range, inode);
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (4 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 05/23] drm/xe/svm: add helper to retrieve svm range from address Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-04-05  0:39   ` Jason Gunthorpe
  2024-01-17 22:12 ` [PATCH 07/23] drm/xe/svm: Add helper for binding hmm range to gpu Oak Zeng
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Introduce xe_svm_build_sg helper function to build a scatter
gather table from a hmm_range struct. This is prepare work
for binding hmm range to gpu.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 52 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h |  3 +++
 2 files changed, 55 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 559188471949..ab3cc2121869 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -6,6 +6,8 @@
 #include <linux/mutex.h>
 #include <linux/mm_types.h>
 #include "xe_svm.h"
+#include <linux/hmm.h>
+#include <linux/scatterlist.h>
 
 DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
 
@@ -61,3 +63,53 @@ struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm)
 
 	return NULL;
 }
+
+/**
+ * xe_svm_build_sg() - build a scatter gather table for all the physical pages/pfn
+ * in a hmm_range.
+ *
+ * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
+ * has the pfn numbers of pages that back up this hmm address range.
+ * @st: pointer to the sg table.
+ *
+ * All the contiguous pfns will be collapsed into one entry in
+ * the scatter gather table. This is for the convenience of
+ * later on operations to bind address range to GPU page table.
+ *
+ * This function allocates the storage of the sg table. It is
+ * caller's responsibility to free it calling sg_free_table.
+ *
+ * Returns 0 if successful; -ENOMEM if fails to allocate memory
+ */
+int xe_svm_build_sg(struct hmm_range *range,
+			     struct sg_table *st)
+{
+	struct scatterlist *sg;
+	u64 i, npages;
+
+	sg = NULL;
+	st->nents = 0;
+	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >> PAGE_SHIFT) + 1;
+
+	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
+		return -ENOMEM;
+
+	for (i = 0; i < npages; i++) {
+		unsigned long addr = range->hmm_pfns[i];
+
+		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
+			sg->length += PAGE_SIZE;
+			sg_dma_len(sg) += PAGE_SIZE;
+			continue;
+		}
+
+		sg =  sg ? sg_next(sg) : st->sgl;
+		sg_dma_address(sg) = addr;
+		sg_dma_len(sg) = PAGE_SIZE;
+		sg->length = PAGE_SIZE;
+		st->nents++;
+	}
+
+	sg_mark_end(sg);
+	return 0;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 3ed106ecc02b..191bce6425db 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -13,6 +13,8 @@
 #include <linux/interval_tree.h>
 #include <linux/hashtable.h>
 #include <linux/types.h>
+#include <linux/hmm.h>
+#include "xe_device_types.h"
 
 struct xe_vm;
 struct mm_struct;
@@ -69,4 +71,5 @@ struct xe_svm *xe_create_svm(struct xe_vm *vm);
 struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
 struct xe_svm_range *xe_svm_range_from_addr(struct xe_svm *svm,
 								unsigned long addr);
+int xe_svm_build_sg(struct hmm_range *range, struct sg_table *st);
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 07/23] drm/xe/svm: Add helper for binding hmm range to gpu
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (5 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 08/23] drm/xe/svm: Add helper to invalidate svm range from GPU Oak Zeng
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Add helper function xe_bind_svm_range to bind a svm range
to gpu. A temporary xe_vma is created locally to re-use
existing page table update functions which are vma-based.

The svm page table update lock design is different from
userptr and bo page table update. A xe_pt_svm_pre_commit
function is introduced for svm range pre-commitment.

A hmm_range pointer is added to xe_vma struct.

v1: Make userptr member to be the last member of xe_vma struct

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c       | 114 +++++++++++++++++++++++++++++--
 drivers/gpu/drm/xe/xe_pt.h       |   4 ++
 drivers/gpu/drm/xe/xe_vm_types.h |  13 +++-
 3 files changed, 126 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index de1030a47588..f1e479fa3001 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -17,6 +17,7 @@
 #include "xe_trace.h"
 #include "xe_ttm_stolen_mgr.h"
 #include "xe_vm.h"
+#include "xe_svm.h"
 
 struct xe_pt_dir {
 	struct xe_pt pt;
@@ -582,8 +583,15 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 {
 	struct xe_device *xe = tile_to_xe(tile);
 	struct xe_bo *bo = xe_vma_bo(vma);
-	bool is_devmem = !xe_vma_is_userptr(vma) && bo &&
-		(xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
+	/*
+	 * FIXME: Right now assume all svm ranges bound to GPU is backed
+	 * by device memory. This assumption will change once migration
+	 * policy is implemented. A svm range's backing store can be a
+	 * mixture of device memory and system memory, page by page based.
+	 * We probably need a separate stage_bind function for svm.
+	 */
+	bool is_devmem = vma->svm_sg || (!xe_vma_is_userptr(vma) && bo &&
+		(xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo)));
 	struct xe_res_cursor curs;
 	struct xe_pt_stage_bind_walk xe_walk = {
 		.base = {
@@ -617,7 +625,10 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 	xe_bo_assert_held(bo);
 
 	if (!xe_vma_is_null(vma)) {
-		if (xe_vma_is_userptr(vma))
+		if (vma->svm_sg)
+			xe_res_first_sg(vma->svm_sg, 0, xe_vma_size(vma),
+					&curs);
+		else if (xe_vma_is_userptr(vma))
 			xe_res_first_sg(vma->userptr.sg, 0, xe_vma_size(vma),
 					&curs);
 		else if (xe_bo_is_vram(bo) || xe_bo_is_stolen(bo))
@@ -1046,6 +1057,28 @@ static int xe_pt_userptr_pre_commit(struct xe_migrate_pt_update *pt_update)
 	return 0;
 }
 
+static int xe_pt_svm_pre_commit(struct xe_migrate_pt_update *pt_update)
+{
+	struct xe_vma *vma = pt_update->vma;
+	struct hmm_range *range = vma->hmm_range;
+
+	if (mmu_interval_read_retry(range->notifier,
+		    range->notifier_seq)) {
+		/*
+		 * FIXME: is this really necessary? We didn't update GPU
+		 * page table yet...
+		 */
+		xe_vm_invalidate_vma(vma);
+		return -EAGAIN;
+	}
+	return 0;
+}
+
+static const struct xe_migrate_pt_update_ops svm_bind_ops = {
+	.populate = xe_vm_populate_pgtable,
+	.pre_commit = xe_pt_svm_pre_commit,
+};
+
 static const struct xe_migrate_pt_update_ops bind_ops = {
 	.populate = xe_vm_populate_pgtable,
 	.pre_commit = xe_pt_pre_commit,
@@ -1197,7 +1230,8 @@ __xe_pt_bind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue
 	struct xe_vm_pgtable_update entries[XE_VM_MAX_LEVEL * 2 + 1];
 	struct xe_pt_migrate_pt_update bind_pt_update = {
 		.base = {
-			.ops = xe_vma_is_userptr(vma) ? &userptr_bind_ops : &bind_ops,
+			.ops = vma->svm_sg ? &svm_bind_ops :
+					(xe_vma_is_userptr(vma) ? &userptr_bind_ops : &bind_ops),
 			.vma = vma,
 			.tile_id = tile->id,
 		},
@@ -1651,3 +1685,75 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
 
 	return fence;
 }
+
+/**
+ * xe_bind_svm_range() - bind an address range to vm
+ *
+ * @vm: the vm to bind this address range
+ * @tile: the tile to bind this address range to
+ * @range: a hmm_range which includes all the information
+ * needed for binding: virtual address range and physical
+ * pfns to back up this virtual address range.
+ * @flags: the binding flags to set in pte
+ *
+ * This is a helper function used by svm sub-system
+ * to bind a svm range to gpu vm. svm sub-system
+ * doesn't have xe_vma, thus helpers such as
+ * __xe_pt_bind_vma can't be used directly. So this
+ * helper is written for svm sub-system to use.
+ *
+ * This is a synchronous function. When this function
+ * returns, either the svm range is bound to GPU, or
+ * error happened.
+ *
+ * Return: 0 for success or error code for failure
+ * If -EAGAIN returns, it means mmu notifier was called (
+ * aka there was concurrent cpu page table update) during
+ * this function, caller has to retry hmm_range_fault
+ */
+int xe_bind_svm_range(struct xe_vm *vm, struct xe_tile *tile,
+			struct hmm_range *range, u64 flags)
+{
+	struct dma_fence *fence = NULL;
+	struct xe_svm *svm = vm->svm;
+	int ret = 0;
+	/*
+	 * Create a temp vma to reuse page table helpers such as
+	 * __xe_pt_bind_vma
+	 */
+	struct xe_vma vma = {
+		.gpuva = {
+			.va = {
+				.addr = range->start,
+				.range = range->end - range->start + 1,
+			},
+			.vm = &vm->gpuvm,
+			.flags = flags,
+		},
+		.tile_mask = 0x1 << tile->id,
+		.hmm_range = range,
+	};
+
+	xe_svm_build_sg(range, &vma.svm_sgt);
+	vma.svm_sg = &vma.svm_sgt;
+
+	mutex_lock(&svm->mutex);
+	if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
+		ret = -EAGAIN;
+		goto unlock;
+	}
+	xe_vm_lock(vm, true);
+	fence = __xe_pt_bind_vma(tile, &vma, vm->q[tile->id], NULL, 0, false);
+	xe_vm_unlock(vm);
+
+unlock:
+	mutex_unlock(&svm->mutex);
+	sg_free_table(vma.svm_sg);
+
+	if (IS_ERR(fence))
+		return PTR_ERR(fence);
+
+	dma_fence_wait(fence, false);
+	dma_fence_put(fence);
+	return ret;
+}
diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
index 71a4fbfcff43..775d08707466 100644
--- a/drivers/gpu/drm/xe/xe_pt.h
+++ b/drivers/gpu/drm/xe/xe_pt.h
@@ -17,6 +17,8 @@ struct xe_sync_entry;
 struct xe_tile;
 struct xe_vm;
 struct xe_vma;
+struct xe_svm;
+struct hmm_range;
 
 /* Largest huge pte is currently 1GiB. May become device dependent. */
 #define MAX_HUGEPTE_LEVEL 2
@@ -45,4 +47,6 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queu
 
 bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
 
+int xe_bind_svm_range(struct xe_vm *vm, struct xe_tile *tile,
+			struct hmm_range *range, u64 flags);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 037fb7168c63..68c7484b2110 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -21,6 +21,7 @@ struct xe_svm;
 struct xe_bo;
 struct xe_sync_entry;
 struct xe_vm;
+struct hmm_range;
 
 #define TEST_VM_ASYNC_OPS_ERROR
 #define FORCE_ASYNC_OP_ERROR	BIT(31)
@@ -107,9 +108,19 @@ struct xe_vma {
 	 */
 	u16 pat_index;
 
+	/**
+	 * @svm_sgt: a scatter gather table to save svm virtual address range's
+	 * pfns
+	 */
+	struct sg_table svm_sgt;
+	struct sg_table *svm_sg;
+	/** hmm range of this pt update, used by svm */
+	struct hmm_range *hmm_range;
+
 	/**
 	 * @userptr: user pointer state, only allocated for VMAs that are
-	 * user pointers
+	 * user pointers. When you add new members to xe_vma struct, userptr
+	 * has to be the last member, xe_vma_create assumes this.
 	 */
 	struct xe_userptr userptr;
 };
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 08/23] drm/xe/svm: Add helper to invalidate svm range from GPU
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (6 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 07/23] drm/xe/svm: Add helper for binding hmm range to gpu Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 09/23] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

A svm subsystem friendly function is added for svm range invalidation
purpose. svm subsystem doesn't maintain xe_vma, so a temporary xe_vma
is used to call function xe_vma_invalidate_vma

Not sure whether this works or not. Will have to test. if a temporary
vma doesn't work, we will have to call the zap_pte/tlb_inv functions
directly.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c | 33 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_pt.h |  1 +
 2 files changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index f1e479fa3001..7ae8954be041 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -1757,3 +1757,36 @@ int xe_bind_svm_range(struct xe_vm *vm, struct xe_tile *tile,
 	dma_fence_put(fence);
 	return ret;
 }
+
+/**
+ * xe_invalidate_svm_range() - a helper to invalidate a svm address range
+ *
+ * @vm: The vm that the address range belongs to
+ * @start: start of the virtual address range
+ * @size: size of the virtual address range
+ *
+ * This is a helper function supposed to be used by svm subsystem.
+ * svm subsystem doesn't maintain xe_vma, so we create a temporary
+ * xe_vma structure so we can reuse xe_vm_invalidate_vma().
+ */
+void xe_invalidate_svm_range(struct xe_vm *vm, u64 start, u64 size)
+{
+	struct xe_vma vma = {
+		.gpuva = {
+			.va = {
+				.addr = start,
+				.range = size,
+			},
+			.vm = &vm->gpuvm,
+		},
+		/** invalidate from all tiles
+		 *  FIXME: We used temporary vma in xe_bind_svm_range, so
+		 *  we lost track of which tile we are bound to. Does
+		 *  setting tile_present to all tiles cause a problem
+		 *  in xe_vm_invalidate_vma()?
+		 */
+		.tile_present = BIT(vm->xe->info.tile_count) - 1,
+	};
+
+	xe_vm_invalidate_vma(&vma);
+}
diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
index 775d08707466..42d495997635 100644
--- a/drivers/gpu/drm/xe/xe_pt.h
+++ b/drivers/gpu/drm/xe/xe_pt.h
@@ -49,4 +49,5 @@ bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
 
 int xe_bind_svm_range(struct xe_vm *vm, struct xe_tile *tile,
 			struct hmm_range *range, u64 flags);
+void xe_invalidate_svm_range(struct xe_vm *vm, u64 start, u64 size);
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 09/23] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (7 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 08/23] drm/xe/svm: Add helper to invalidate svm range from GPU Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 10/23] drm/xe/svm: Introduce svm migration function Oak Zeng
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Memory remap GPU vram using devm_memremap_pages, so each GPU vram
page is backed by a struct page.

Those struct pages are created to allow hmm migrate buffer b/t
GPU vram and CPU system memory using existing Linux migration
mechanism (i.e., migrating b/t CPU system memory and hard disk).

This is prepare work to enable svm (shared virtual memory) through
Linux kernel hmm framework. The memory remap's page map type is set
to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
vram page get a struct page and can be mapped in CPU page table,
but such pages are treated as GPU's private resource, so CPU can't
access them. If CPU access such page, a page fault is triggered
and page will be migrate to system memory.

For GPU device which supports coherent memory protocol b/t CPU and
GPU (such as CXL and CAPI protocol), we can remap device memory as
MEMORY_DEVICE_COHERENT. This is TBD.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h |  8 +++
 drivers/gpu/drm/xe/xe_mmio.c         |  7 +++
 drivers/gpu/drm/xe/xe_svm.h          |  2 +
 drivers/gpu/drm/xe/xe_svm_devmem.c   | 87 ++++++++++++++++++++++++++++
 4 files changed, 104 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 7eda86bd4c2a..6dba5b0ab481 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -99,6 +99,14 @@ struct xe_mem_region {
 	resource_size_t actual_physical_size;
 	/** @mapping: pointer to VRAM mappable space */
 	void __iomem *mapping;
+	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
+	struct dev_pagemap pagemap;
+	/**
+	 * @hpa_base: base host physical address
+	 *
+	 * This is generated when remap device memory as ZONE_DEVICE
+	 */
+	resource_size_t hpa_base;
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
index c8c5d74b6e90..3d34dcfa3b3a 100644
--- a/drivers/gpu/drm/xe/xe_mmio.c
+++ b/drivers/gpu/drm/xe/xe_mmio.c
@@ -21,6 +21,7 @@
 #include "xe_macros.h"
 #include "xe_module.h"
 #include "xe_tile.h"
+#include "xe_svm.h"
 
 #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
 #define TILE_COUNT		REG_GENMASK(15, 8)
@@ -285,6 +286,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
 		}
 
 		io_size -= min_t(u64, tile_size, io_size);
+		xe_svm_devm_add(tile, &tile->mem.vram);
 	}
 
 	xe->mem.vram.actual_physical_size = total_size;
@@ -353,10 +355,15 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
 static void mmio_fini(struct drm_device *drm, void *arg)
 {
 	struct xe_device *xe = arg;
+	struct xe_tile *tile;
+	u8 id;
 
 	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
 	if (xe->mem.vram.mapping)
 		iounmap(xe->mem.vram.mapping);
+	for_each_tile(tile, xe, id) {
+		xe_svm_devm_remove(xe, &tile->mem.vram);
+	}
 }
 
 static int xe_verify_lmem_ready(struct xe_device *xe)
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 191bce6425db..b54f7714a1fc 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -72,4 +72,6 @@ struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
 struct xe_svm_range *xe_svm_range_from_addr(struct xe_svm *svm,
 								unsigned long addr);
 int xe_svm_build_sg(struct hmm_range *range, struct sg_table *st);
+int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
+void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region *mem);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
new file mode 100644
index 000000000000..cf7882830247
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/mm_types.h>
+#include <linux/sched/mm.h>
+
+#include "xe_device_types.h"
+#include "xe_trace.h"
+
+
+static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
+{
+	return 0;
+}
+
+static void xe_devm_page_free(struct page *page)
+{
+}
+
+static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
+	.page_free = xe_devm_page_free,
+	.migrate_to_ram = xe_devm_migrate_to_ram,
+};
+
+/**
+ * xe_svm_devm_add: Remap and provide memmap backing for device memory
+ * @tile: tile that the memory region blongs to
+ * @mr: memory region to remap
+ *
+ * This remap device memory to host physical address space and create
+ * struct page to back device memory
+ *
+ * Return: 0 on success standard error code otherwise
+ */
+int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
+{
+	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
+	struct resource *res;
+	void *addr;
+	int ret;
+
+	res = devm_request_free_mem_region(dev, &iomem_resource,
+					   mr->usable_size);
+	if (IS_ERR(res)) {
+		ret = PTR_ERR(res);
+		return ret;
+	}
+
+	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	mr->pagemap.range.start = res->start;
+	mr->pagemap.range.end = res->end;
+	mr->pagemap.nr_range = 1;
+	mr->pagemap.ops = &xe_devm_pagemap_ops;
+	mr->pagemap.owner = tile->xe->drm.dev;
+	addr = devm_memremap_pages(dev, &mr->pagemap);
+	if (IS_ERR(addr)) {
+		devm_release_mem_region(dev, res->start, resource_size(res));
+		ret = PTR_ERR(addr);
+		drm_err(&tile->xe->drm, "Failed to remap tile %d memory, errno %d\n",
+				tile->id, ret);
+		return ret;
+	}
+	mr->hpa_base = res->start;
+
+	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
+			tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
+	return 0;
+}
+
+/**
+ * xe_svm_devm_remove: Unmap device memory and free resources
+ * @xe: xe device
+ * @mr: memory region to remove
+ */
+void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region *mr)
+{
+	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
+
+	if (mr->hpa_base) {
+		devm_memunmap_pages(dev, &mr->pagemap);
+		devm_release_mem_region(dev, mr->pagemap.range.start,
+			mr->pagemap.range.end - mr->pagemap.range.start +1);
+	}
+}
+
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 10/23] drm/xe/svm: Introduce svm migration function
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (8 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 09/23] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 11/23] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Introduce xe_migrate_svm function for data migration.
This function is similar to xe_migrate_copy function
but has different parameters. Instead of BO and ttm
resource parameters, it has source and destination
buffer's dpa address as parameter. This function is
intended to be used by svm sub-system which doesn't
have BO and TTM concept.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 213 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_migrate.h |   7 ++
 2 files changed, 220 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 44725f978f3e..5bd9fd40f93f 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -429,6 +429,37 @@ static bool xe_migrate_allow_identity(u64 size, const struct xe_res_cursor *cur)
 	return cur->size >= size;
 }
 
+/**
+ * pte_update_cmd_size() - calculate the batch buffer command size
+ * to update a flat page table.
+ *
+ * @size: The virtual address range size of the page table to update
+ *
+ * The page table to update is supposed to be a flat 1 level page
+ * table with all entries pointing to 4k pages.
+ *
+ * Return the number of dwords of the update command
+ */
+static u32 pte_update_cmd_size(u64 size)
+{
+	u32 dword;
+	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+
+	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
+	/*
+	 * MI_STORE_DATA_IMM command is used to update page table. Each
+	 * instruction can update maximumly 0x1ff pte entries. To update
+	 * n (n <= 0x1ff) pte entries, we need:
+	 * 1 dword for the MI_STORE_DATA_IMM command header (opcode etc)
+	 * 2 dword for the page table's physical location
+	 * 2*n dword for value of pte to fill (each pte entry is 2 dwords)
+	 */
+	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
+	dword += entries * 2;
+
+	return dword;
+}
+
 static u32 pte_update_size(struct xe_migrate *m,
 			   bool is_vram,
 			   struct ttm_resource *res,
@@ -529,6 +560,48 @@ static void emit_pte(struct xe_migrate *m,
 	}
 }
 
+/**
+ * build_pt_update_batch_sram() - build batch buffer commands to update
+ * migration vm page table for system memory
+ *
+ * @m: The migration context
+ * @bb: The batch buffer which hold the page table update commands
+ * @pt_offset: The offset of page table to update, in byte
+ * @dpa: device physical address you want the page table to point to
+ * @size: size of the virtual address space you want the page table to cover
+ */
+static void build_pt_update_batch_sram(struct xe_migrate *m,
+		     struct xe_bb *bb, u32 pt_offset,
+		     u64 dpa, u32 size)
+{
+	u16 pat_index = tile_to_xe(m->tile)->pat.idx[XE_CACHE_WB];
+	u32 ptes;
+
+	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+	while (ptes) {
+		u32 chunk = min(0x1ffU, ptes);
+
+		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
+		bb->cs[bb->len++] = pt_offset;
+		bb->cs[bb->len++] = 0;
+
+		pt_offset += chunk * 8;
+		ptes -= chunk;
+
+		while (chunk--) {
+			u64 addr;
+
+			addr = dpa & PAGE_MASK;
+			addr = m->q->vm->pt_ops->pte_encode_addr(m->tile->xe,
+								 addr, pat_index,
+								 0, false, 0);
+			bb->cs[bb->len++] = lower_32_bits(addr);
+			bb->cs[bb->len++] = upper_32_bits(addr);
+			dpa += XE_PAGE_SIZE;
+		}
+	}
+}
+
 #define EMIT_COPY_CCS_DW 5
 static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
 			  u64 dst_ofs, bool dst_is_indirect,
@@ -846,6 +919,146 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
 	return fence;
 }
 
+/**
+ * xe_migrate_svm() - A migrate function used by SVM subsystem
+ *
+ * @m: The migration context
+ * @src_dpa: device physical start address of source, from GPU's point of view
+ * @src_is_vram: True if source buffer is in vram.
+ * @dst_dpa: device physical start address of destination, from GPU's point of view
+ * @dst_is_vram: True if destination buffer is in vram.
+ * @size: The size of data to copy.
+ *
+ * Copy @size bytes of data from @src_dpa to @dst_dpa. The functionality
+ * and behavior of this function is similar to xe_migrate_copy function, but
+ * the interface is different. This function is a helper function supposed to
+ * be used by SVM subsytem. Since in SVM subsystem there is no buffer object
+ * and ttm, there is no src/dst bo as function input. Instead, we directly use
+ * src/dst's physical address as function input.
+ *
+ * Since the back store of any user malloc'ed or mmap'ed memory can be placed in
+ * system  memory, it can not be compressed. Thus this function doesn't need
+ * to consider copy CCS (compression control surface) data as xe_migrate_copy did.
+ *
+ * This function assumes the source buffer and destination buffer are all physically
+ * contiguous.
+ *
+ * We use gpu blitter to copy data. Source and destination are first mapped to
+ * migration vm which is a flat one level (L0) page table, then blitter is used to
+ * perform the copy.
+ *
+ * Return: Pointer to a dma_fence representing the last copy batch, or
+ * an error pointer on failure. If there is a failure, any copy operation
+ * started by the function call has been synced.
+ */
+struct dma_fence *xe_migrate_svm(struct xe_migrate *m,
+				  u64 src_dpa,
+				  bool src_is_vram,
+				  u64 dst_dpa,
+				  bool dst_is_vram,
+				  u64 size)
+{
+#define NUM_PT_PER_BLIT (MAX_PREEMPTDISABLE_TRANSFER / SZ_2M)
+	struct xe_gt *gt = m->tile->primary_gt;
+	struct xe_device *xe = gt_to_xe(gt);
+	struct dma_fence *fence = NULL;
+	u64 src_L0_ofs, dst_L0_ofs;
+	u64 round_update_size;
+	/* A slot is a 4K page of page table, covers 2M virtual address*/
+	u32 pt_slot;
+	int err;
+
+	while (size) {
+		u32 batch_size = 2; /* arb_clear() + MI_BATCH_BUFFER_END */
+		struct xe_sched_job *job;
+		struct xe_bb *bb;
+		u32 update_idx;
+
+		/* Maximumly copy MAX_PREEMPTDISABLE_TRANSFER bytes. Why?*/
+		round_update_size = min_t(u64, size, MAX_PREEMPTDISABLE_TRANSFER);
+
+		/* src pte update*/
+		if (!src_is_vram)
+			batch_size += pte_update_cmd_size(round_update_size);
+		/* dst pte update*/
+		if (!dst_is_vram)
+			batch_size += pte_update_cmd_size(round_update_size);
+
+		/* Copy command size*/
+		batch_size += EMIT_COPY_DW;
+
+		bb = xe_bb_new(gt, batch_size, true);
+		if (IS_ERR(bb)) {
+			err = PTR_ERR(bb);
+			goto err_sync;
+		}
+
+		if (!src_is_vram) {
+			pt_slot = 0;
+			build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
+					src_dpa, round_update_size);
+			src_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+		}
+		else
+			src_L0_ofs = xe_migrate_vram_ofs(xe, src_dpa);
+
+		if (!dst_is_vram) {
+			pt_slot = NUM_PT_PER_BLIT;
+			build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
+					dst_dpa, round_update_size);
+			dst_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+		}
+		else
+			dst_L0_ofs = xe_migrate_vram_ofs(xe, dst_dpa);
+
+
+		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
+		update_idx = bb->len;
+
+		emit_copy(gt, bb, src_L0_ofs, dst_L0_ofs, round_update_size,
+			  XE_PAGE_SIZE);
+
+		mutex_lock(&m->job_mutex);
+		job = xe_bb_create_migration_job(m->q, bb,
+						 xe_migrate_batch_base(m, true),
+						 update_idx);
+		if (IS_ERR(job)) {
+			err = PTR_ERR(job);
+			goto err;
+		}
+
+		xe_sched_job_add_migrate_flush(job, 0);
+		xe_sched_job_arm(job);
+		dma_fence_put(fence);
+		fence = dma_fence_get(&job->drm.s_fence->finished);
+		xe_sched_job_push(job);
+		dma_fence_put(m->fence);
+		m->fence = dma_fence_get(fence);
+
+		mutex_unlock(&m->job_mutex);
+
+		xe_bb_free(bb, fence);
+		size -= round_update_size;
+		src_dpa += round_update_size;
+		dst_dpa += round_update_size;
+		continue;
+
+err:
+		mutex_unlock(&m->job_mutex);
+		xe_bb_free(bb, NULL);
+
+err_sync:
+		/* Sync partial copy if any. FIXME: under job_mutex? */
+		if (fence) {
+			dma_fence_wait(fence, false);
+			dma_fence_put(fence);
+		}
+
+		return ERR_PTR(err);
+	}
+
+	return fence;
+}
 static void emit_clear_link_copy(struct xe_gt *gt, struct xe_bb *bb, u64 src_ofs,
 				 u32 size, u32 pitch)
 {
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 951f19318ea4..a532760ae1fa 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -88,6 +88,13 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
 				  struct ttm_resource *dst,
 				  bool copy_only_ccs);
 
+struct dma_fence *xe_migrate_svm(struct xe_migrate *m,
+				  u64 src_dpa,
+				  bool src_is_vram,
+				  u64 dst_dpa,
+				  bool dst_is_vram,
+				  u64 size);
+
 struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 				   struct xe_bo *bo,
 				   struct ttm_resource *dst);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 11/23] drm/xe/svm: implement functions to allocate and free device memory
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (9 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 10/23] drm/xe/svm: Introduce svm migration function Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 12/23] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Function xe_devm_alloc_pages allocate pages from drm buddy and perform
house keeping work for all the pages allocated, such as get a page
refcount, keep a bitmap of all pages to denote whether a page is in
use, put pages to a drm lru list for eviction purpose.

Function xe_devm_free_blocks return all memory blocks to drm buddy
allocator.

Function xe_devm_free_page is a call back function from hmm layer. It
is called whenever a page's refcount reaches to 1. This function clears
the bit of this page in the bitmap. If all the bits in the bitmap is
cleared, it means all the pages have been freed, we return all the pages
in this memory block back to drm buddy.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h        |   9 ++
 drivers/gpu/drm/xe/xe_svm_devmem.c | 146 ++++++++++++++++++++++++++++-
 2 files changed, 154 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index b54f7714a1fc..8551df2b9780 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -74,4 +74,13 @@ struct xe_svm_range *xe_svm_range_from_addr(struct xe_svm *svm,
 int xe_svm_build_sg(struct hmm_range *range, struct sg_table *st);
 int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
 void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region *mem);
+
+
+int xe_devm_alloc_pages(struct xe_tile *tile,
+						unsigned long npages,
+						struct list_head *blocks,
+						unsigned long *pfn);
+
+void xe_devm_free_blocks(struct list_head *blocks);
+void xe_devm_page_free(struct page *page);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
index cf7882830247..445e0e1bc3b4 100644
--- a/drivers/gpu/drm/xe/xe_svm_devmem.c
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -5,18 +5,162 @@
 
 #include <linux/mm_types.h>
 #include <linux/sched/mm.h>
+#include <linux/gfp.h>
+#include <linux/migrate.h>
+#include <linux/dma-mapping.h>
+#include <linux/dma-fence.h>
+#include <linux/bitops.h>
+#include <linux/bitmap.h>
+#include <drm/drm_buddy.h>
 
 #include "xe_device_types.h"
 #include "xe_trace.h"
+#include "xe_migrate.h"
+#include "xe_ttm_vram_mgr_types.h"
+#include "xe_assert.h"
 
+/**
+ * struct xe_svm_block_meta - svm uses this data structure to manage each
+ * block allocated from drm buddy. This will be set to the drm_buddy_block's
+ * private field.
+ *
+ * @lru: used to link this block to drm's lru lists. This will be replace
+ * with struct drm_lru_entity later.
+ * @tile: tile from which we allocated this block
+ * @bitmap: A bitmap of each page in this block. 1 means this page is used,
+ * 0 means this page is idle. When all bits of this block are 0, it is time
+ * to return this block to drm buddy subsystem.
+ */
+struct xe_svm_block_meta {
+	struct list_head lru;
+	struct xe_tile *tile;
+	unsigned long bitmap[];
+};
+
+static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
+{
+	/** DRM buddy's block offset is 0-based*/
+	offset += mr->hpa_base;
+
+	return PHYS_PFN(offset);
+}
+
+/**
+ * xe_devm_alloc_pages() - allocate device pages from buddy allocator
+ *
+ * @xe_tile: which tile to allocate device memory from
+ * @npages: how many pages to allocate
+ * @blocks: used to return the allocated blocks
+ * @pfn: used to return the pfn of all allocated pages. Must be big enough
+ * to hold at @npages entries.
+ *
+ * This function allocate blocks of memory from drm buddy allocator, and
+ * performs initialization work: set struct page::zone_device_data to point
+ * to the memory block; set/initialize drm_buddy_block::private field;
+ * lock_page for each page allocated; add memory block to lru managers lru
+ * list - this is TBD.
+ *
+ * return: 0 on success
+ * error code otherwise
+ */
+int xe_devm_alloc_pages(struct xe_tile *tile,
+						unsigned long npages,
+						struct list_head *blocks,
+						unsigned long *pfn)
+{
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+	struct drm_buddy_block *block, *tmp;
+	u64 size = npages << PAGE_SHIFT;
+	int ret = 0, i, j = 0;
+
+	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
+						blocks, DRM_BUDDY_TOPDOWN_ALLOCATION);
+
+	if (unlikely(ret))
+		return ret;
+
+	list_for_each_entry_safe(block, tmp, blocks, link) {
+		struct xe_mem_region *mr = &tile->mem.vram;
+		u64 block_pfn_first, pages_per_block;
+		struct xe_svm_block_meta *meta;
+		u32 meta_size;
+
+		size = drm_buddy_block_size(mm, block);
+		pages_per_block = size >> PAGE_SHIFT;
+		meta_size = BITS_TO_BYTES(pages_per_block) +
+					sizeof(struct xe_svm_block_meta);
+		meta = kzalloc(meta_size, GFP_KERNEL);
+		bitmap_fill(meta->bitmap, pages_per_block);
+		meta->tile = tile;
+		block->private = meta;
+		block_pfn_first =
+					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
+		for(i = 0; i < pages_per_block; i++) {
+			struct page *page;
+
+			pfn[j++] = block_pfn_first + i;
+			page = pfn_to_page(block_pfn_first + i);
+			/**Lock page per hmm requirement, see hmm.rst.*/
+			zone_device_page_init(page);
+			page->zone_device_data = block;
+		}
+	}
+
+	return ret;
+}
+
+/** FIXME: we locked page by calling zone_device_page_init
+ *  inxe_dev_alloc_pages. Should we unlock pages here?
+ */
+static void free_block(struct drm_buddy_block *block)
+{
+	struct xe_svm_block_meta *meta =
+		(struct xe_svm_block_meta *)block->private;
+	struct xe_tile *tile  = meta->tile;
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+
+	kfree(block->private);
+	drm_buddy_free_block(mm, block);
+}
+
+/**
+ * xe_devm_free_blocks() - free all memory blocks
+ *
+ * @blocks: memory blocks list head
+ */
+void xe_devm_free_blocks(struct list_head *blocks)
+{
+	struct drm_buddy_block *block, *tmp;
+
+	list_for_each_entry_safe(block, tmp, blocks, link)
+		free_block(block);
+}
 
 static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
 {
 	return 0;
 }
 
-static void xe_devm_page_free(struct page *page)
+void xe_devm_page_free(struct page *page)
 {
+	struct drm_buddy_block *block =
+					(struct drm_buddy_block *)page->zone_device_data;
+	struct xe_svm_block_meta *meta =
+					(struct xe_svm_block_meta *)block->private;
+	struct xe_tile *tile  = meta->tile;
+	struct xe_mem_region *mr = &tile->mem.vram;
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+	u64 size = drm_buddy_block_size(mm, block);
+	u64 pages_per_block = size >> PAGE_SHIFT;
+	u64 block_pfn_first =
+					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
+	u64 page_pfn = page_to_pfn(page);
+	u64 i = page_pfn - block_pfn_first;
+
+	xe_assert(tile->xe, i < pages_per_block);
+	clear_bit(i, meta->bitmap);
+	if (bitmap_empty(meta->bitmap, pages_per_block))
+		free_block(block);
 }
 
 static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 12/23] drm/xe/svm: Trace buddy block allocation and free
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (10 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 11/23] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 13/23] drm/xe/svm: Handle CPU page fault Oak Zeng
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm_devmem.c |  5 ++++-
 drivers/gpu/drm/xe/xe_trace.h      | 35 ++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
index 445e0e1bc3b4..5cd54dde4a9d 100644
--- a/drivers/gpu/drm/xe/xe_svm_devmem.c
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -95,6 +95,7 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
 		block->private = meta;
 		block_pfn_first =
 					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
+		trace_xe_buddy_block_alloc(block, size, block_pfn_first);
 		for(i = 0; i < pages_per_block; i++) {
 			struct page *page;
 
@@ -159,8 +160,10 @@ void xe_devm_page_free(struct page *page)
 
 	xe_assert(tile->xe, i < pages_per_block);
 	clear_bit(i, meta->bitmap);
-	if (bitmap_empty(meta->bitmap, pages_per_block))
+	if (bitmap_empty(meta->bitmap, pages_per_block)) {
 		free_block(block);
+		trace_xe_buddy_block_free(block, size, block_pfn_first);
+	}
 }
 
 static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index 63867c0fa848..50380f5173ca 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -11,6 +11,7 @@
 
 #include <linux/tracepoint.h>
 #include <linux/types.h>
+#include <drm/drm_buddy.h>
 
 #include "xe_bo_types.h"
 #include "xe_exec_queue_types.h"
@@ -600,6 +601,40 @@ DEFINE_EVENT_PRINT(xe_guc_ctb, xe_guc_ctb_g2h,
 
 );
 
+DECLARE_EVENT_CLASS(xe_buddy_block,
+               TP_PROTO(struct drm_buddy_block *block, u64 size, u64 pfn),
+               TP_ARGS(block, size, pfn),
+
+               TP_STRUCT__entry(
+                               __field(u64, block)
+                               __field(u64, header)
+                               __field(u64, size)
+                               __field(u64, pfn)
+               ),
+
+               TP_fast_assign(
+                               __entry->block = (u64)block;
+                               __entry->header = block->header;
+                               __entry->size = size;
+                               __entry->pfn = pfn;
+               ),
+
+               TP_printk("xe svm: allocated block %llx, block header %llx, size %llx, pfn %llx\n",
+                       __entry->block, __entry->header, __entry->size, __entry->pfn)
+);
+
+
+DEFINE_EVENT(xe_buddy_block, xe_buddy_block_alloc,
+               TP_PROTO(struct drm_buddy_block *block, u64 size, u64 pfn),
+               TP_ARGS(block, size, pfn)
+);
+
+
+DEFINE_EVENT(xe_buddy_block, xe_buddy_block_free,
+               TP_PROTO(struct drm_buddy_block *block, u64 size, u64 pfn),
+               TP_ARGS(block, size, pfn)
+);
+
 #endif
 
 /* This part must be outside protection */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 13/23] drm/xe/svm: Handle CPU page fault
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (11 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 12/23] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 14/23] drm/xe/svm: trace svm range migration Oak Zeng
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Under the picture of svm, CPU and GPU program share one same
virtual address space. The backing store of this virtual address
space can be either in system memory or device memory. Since GPU
device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
Any CPU access to device memory causes a page fault. Implement
a page fault handler to migrate memory back to system memory and
map it to CPU page table so the CPU program can proceed.

Also unbind this page from GPU side, and free the original GPU
device page

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h |  12 ++
 drivers/gpu/drm/xe/xe_svm.h          |   8 +-
 drivers/gpu/drm/xe/xe_svm_devmem.c   |  10 +-
 drivers/gpu/drm/xe/xe_svm_migrate.c  | 230 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm_range.c    |  27 ++++
 5 files changed, 280 insertions(+), 7 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 6dba5b0ab481..c08e41cb3229 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -573,4 +573,16 @@ struct xe_file {
 	struct xe_drm_client *client;
 };
 
+static inline struct xe_tile *mem_region_to_tile(struct xe_mem_region *mr)
+{
+	return container_of(mr, struct xe_tile, mem.vram);
+}
+
+static inline u64 vram_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
+{
+	u64 dpa;
+	u64 offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
+	dpa = mr->dpa_base + offset;
+	return dpa;
+}
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 8551df2b9780..6b93055934f8 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -12,8 +12,10 @@
 #include <linux/rbtree_types.h>
 #include <linux/interval_tree.h>
 #include <linux/hashtable.h>
+#include <linux/mm_types.h>
 #include <linux/types.h>
 #include <linux/hmm.h>
+#include <linux/mm.h>
 #include "xe_device_types.h"
 
 struct xe_vm;
@@ -66,16 +68,20 @@ struct xe_svm_range {
 	struct interval_tree_node inode;
 };
 
+vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf);
 void xe_destroy_svm(struct xe_svm *svm);
 struct xe_svm *xe_create_svm(struct xe_vm *vm);
 struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
 struct xe_svm_range *xe_svm_range_from_addr(struct xe_svm *svm,
 								unsigned long addr);
+bool xe_svm_range_belongs_to_vma(struct mm_struct *mm,
+								struct xe_svm_range *range,
+								struct vm_area_struct *vma);
+
 int xe_svm_build_sg(struct hmm_range *range, struct sg_table *st);
 int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
 void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region *mem);
 
-
 int xe_devm_alloc_pages(struct xe_tile *tile,
 						unsigned long npages,
 						struct list_head *blocks,
diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
index 5cd54dde4a9d..01f8385ebb5b 100644
--- a/drivers/gpu/drm/xe/xe_svm_devmem.c
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -11,13 +11,16 @@
 #include <linux/dma-fence.h>
 #include <linux/bitops.h>
 #include <linux/bitmap.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
 #include <drm/drm_buddy.h>
-
 #include "xe_device_types.h"
 #include "xe_trace.h"
 #include "xe_migrate.h"
 #include "xe_ttm_vram_mgr_types.h"
 #include "xe_assert.h"
+#include "xe_pt.h"
+#include "xe_svm.h"
 
 /**
  * struct xe_svm_block_meta - svm uses this data structure to manage each
@@ -137,11 +140,6 @@ void xe_devm_free_blocks(struct list_head *blocks)
 		free_block(block);
 }
 
-static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
-{
-	return 0;
-}
-
 void xe_devm_page_free(struct page *page)
 {
 	struct drm_buddy_block *block =
diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
new file mode 100644
index 000000000000..3be26da33aa3
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
@@ -0,0 +1,230 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/gfp.h>
+#include <linux/migrate.h>
+#include <linux/dma-mapping.h>
+#include <linux/dma-fence.h>
+#include <linux/bitops.h>
+#include <linux/bitmap.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <drm/drm_buddy.h>
+#include "xe_device_types.h"
+#include "xe_trace.h"
+#include "xe_migrate.h"
+#include "xe_ttm_vram_mgr_types.h"
+#include "xe_assert.h"
+#include "xe_pt.h"
+#include "xe_svm.h"
+
+
+/**
+ * alloc_host_page() - allocate one host page for the fault vma
+ *
+ * @dev: (GPU) device that will access the allocated page
+ * @vma: the fault vma that we need allocate page for
+ * @addr: the fault address. The allocated page is for this address
+ * @dma_addr: used to output the dma address of the allocated page.
+ * This dma address will be used for gpu to access this page. GPU
+ * access host page through a dma mapped address.
+ * @pfn: used to output the pfn of the allocated page.
+ *
+ * This function allocate one host page for the specified vma. It
+ * also does some prepare work for GPU to access this page, such
+ * as map this page to iommu (by calling dma_map_page).
+ *
+ * When this function returns, the page is locked.
+ *
+ * Return struct page pointer when success
+ * NULL otherwise
+ */
+static struct page *alloc_host_page(struct device *dev,
+							 struct vm_area_struct *vma,
+							 unsigned long addr,
+							 dma_addr_t *dma_addr,
+							 unsigned long *pfn)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (unlikely(!page))
+		return NULL;
+
+	/**Lock page per hmm requirement, see hmm.rst*/
+	lock_page(page);
+	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE);
+	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
+		unlock_page(page);
+		__free_page(page);
+		return NULL;
+	}
+
+	*pfn = migrate_pfn(page_to_pfn(page));
+	return page;
+}
+
+static void free_host_page(struct page *page)
+{
+	unlock_page(page);
+	put_page(page);
+}
+
+static inline struct xe_mem_region *page_to_mem_region(struct page *page)
+{
+	return container_of(page->pgmap, struct xe_mem_region, pagemap);
+}
+
+/**
+ * migrate_page_vram_to_ram() - migrate one page from vram to ram
+ *
+ * @vma: The vma that the page is mapped to
+ * @addr: The virtual address that the page is mapped to
+ * @src_pfn: src page's page frame number
+ * @dst_pfn: used to return dstination page (in system ram)'s pfn
+ *
+ * Allocate one page in system ram and copy memory from device memory
+ * to system ram.
+ *
+ * Return: 0 if this page is already in sram (no need to migrate)
+ * 1: successfully migrated this page from vram to sram.
+ * error code otherwise
+ */
+static int migrate_page_vram_to_ram(struct vm_area_struct *vma, unsigned long addr,
+						unsigned long src_pfn, unsigned long *dst_pfn)
+{
+	struct xe_mem_region *mr;
+	struct xe_tile *tile;
+	struct xe_device *xe;
+	struct device *dev;
+	dma_addr_t dma_addr = 0;
+	struct dma_fence *fence;
+	struct page *host_page;
+	struct page *src_page;
+	u64 src_dpa;
+
+	src_page = migrate_pfn_to_page(src_pfn);
+	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
+		return 0;
+
+	mr = page_to_mem_region(src_page);
+	tile = mem_region_to_tile(mr);
+	xe = tile_to_xe(tile);
+	dev = xe->drm.dev;
+
+	src_dpa = vram_pfn_to_dpa(mr, src_pfn);
+	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
+	if (!host_page)
+		return -ENOMEM;
+
+	fence = xe_migrate_svm(tile->migrate, src_dpa, true,
+						dma_addr, false, PAGE_SIZE);
+	if (IS_ERR(fence)) {
+		dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
+		free_host_page(host_page);
+		return PTR_ERR(fence);
+	}
+
+	dma_fence_wait(fence, false);
+	dma_fence_put(fence);
+	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
+	return 1;
+}
+
+/**
+ * xe_devmem_migrate_to_ram() - Migrate memory back to sram on CPU page fault
+ *
+ * @vmf: cpu vm fault structure, contains fault information such as vma etc.
+ *
+ * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
+ * FIXME: relook the lock design here. Is there any deadlock?
+ *
+ * This function migrate one svm range which contains the fault address to sram.
+ * We try to maintain a 1:1 mapping b/t the vma and svm_range (i.e., create one
+ * svm range for one vma initially and try not to split it). So this scheme end
+ * up migrate at the vma granularity. This might not be the best performant scheme
+ * when GPU is in the picture.
+ *
+ * This can be tunned with a migration granularity for  performance, for example,
+ * migration 2M for each CPU page fault, or let user specify how much to migrate.
+ * But this is more complicated as this scheme requires vma and svm_range splitting.
+ *
+ * This function should also update GPU page table, so the fault virtual address
+ * points to the same sram location from GPU side. This is TBD.
+ *
+ * Return:
+ * 0 on success
+ * VM_FAULT_SIGBUS: failed to migrate page to system memory, application
+ * will be signaled a SIGBUG
+ */
+vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
+{
+	struct xe_mem_region *mr = page_to_mem_region(vmf->page);
+	struct xe_tile *tile = mem_region_to_tile(mr);
+	struct xe_device *xe = tile_to_xe(tile);
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
+	struct xe_svm_range *range = xe_svm_range_from_addr(svm, vmf->address);
+	struct xe_vm *vm = svm->vm;
+	u64 npages = (range->end - range->start) >> PAGE_SHIFT;
+	unsigned long addr = range->start;
+	vm_fault_t ret = 0;
+	void *buf;
+	int i;
+
+	struct migrate_vma migrate_vma = {
+		.vma		= vmf->vma,
+		.start		= range->start,
+		.end		= range->end,
+		.pgmap_owner	= xe->drm.dev,
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+		.fault_page = vmf->page,
+	};
+
+	xe_assert(xe, IS_ALIGNED(vmf->address, PAGE_SIZE));
+	xe_assert(xe, IS_ALIGNED(range->start, PAGE_SIZE));
+	xe_assert(xe, IS_ALIGNED(range->end, PAGE_SIZE));
+	/**FIXME: in case of vma split, svm range might not belongs to one vma*/
+	xe_assert(xe, xe_svm_range_belongs_to_vma(mm, range, vma));
+
+	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
+	migrate_vma.src = buf;
+	migrate_vma.dst = buf + npages;
+	if (migrate_vma_setup(&migrate_vma) < 0) {
+		ret = VM_FAULT_SIGBUS;
+		goto free_buf;
+	}
+
+	if (!migrate_vma.cpages)
+		goto free_buf;
+
+	for (i = 0; i < npages; i++) {
+		ret = migrate_page_vram_to_ram(vma, addr, migrate_vma.src[i],
+							migrate_vma.dst + i);
+		if (ret < 0) {
+			ret = VM_FAULT_SIGBUS;
+			break;
+		}
+
+		/** Migration has been successful, unbind src page from gpu,
+		 *  and free source page
+		 */
+		if (ret == 1) {
+			struct page *src_page = migrate_pfn_to_page(migrate_vma.src[i]);
+
+			xe_invalidate_svm_range(vm, addr, PAGE_SIZE);
+			xe_devm_page_free(src_page);
+		}
+
+		addr += PAGE_SIZE;
+	}
+
+	migrate_vma_pages(&migrate_vma);
+	migrate_vma_finalize(&migrate_vma);
+free_buf:
+	kvfree(buf);
+	return 0;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm_range.c b/drivers/gpu/drm/xe/xe_svm_range.c
index d8251d38f65e..b32c32f60315 100644
--- a/drivers/gpu/drm/xe/xe_svm_range.c
+++ b/drivers/gpu/drm/xe/xe_svm_range.c
@@ -5,7 +5,9 @@
 
 #include <linux/interval_tree.h>
 #include <linux/container_of.h>
+#include <linux/mm_types.h>
 #include <linux/mutex.h>
+#include <linux/mm.h>
 #include "xe_svm.h"
 
 /**
@@ -30,3 +32,28 @@ struct xe_svm_range *xe_svm_range_from_addr(struct xe_svm *svm,
 
 	return container_of(node, struct xe_svm_range, inode);
 }
+
+/**
+ * xe_svm_range_belongs_to_vma() - determine a virtual address range
+ * belongs to a vma or not
+ *
+ * @mm: the mm of the virtual address range
+ * @range: the svm virtual address range
+ * @vma: the vma to determine the range
+ *
+ * Returns true if range belongs to vma
+ * false otherwise
+ */
+bool xe_svm_range_belongs_to_vma(struct mm_struct *mm,
+								struct xe_svm_range *range,
+								struct vm_area_struct *vma)
+{
+	struct vm_area_struct *vma1, *vma2;
+	unsigned long start = range->start;
+	unsigned long end = range->end;
+
+	vma1  = find_vma_intersection(mm, start, start + 4);
+	vma2  = find_vma_intersection(mm, end - 4, end);
+
+	return (vma1 == vma) && (vma2 == vma);
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 14/23] drm/xe/svm: trace svm range migration
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (12 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 13/23] drm/xe/svm: Handle CPU page fault Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 15/23] drm/xe/svm: Implement functions to register and unregister mmu notifier Oak Zeng
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Add function to trace svm range migration, either
from vram to sram, or sram to vram

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm_migrate.c |  1 +
 drivers/gpu/drm/xe/xe_trace.h       | 30 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
index 3be26da33aa3..b4df411e04f3 100644
--- a/drivers/gpu/drm/xe/xe_svm_migrate.c
+++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
@@ -201,6 +201,7 @@ vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
 	if (!migrate_vma.cpages)
 		goto free_buf;
 
+	trace_xe_svm_migrate_vram_to_sram(range);
 	for (i = 0; i < npages; i++) {
 		ret = migrate_page_vram_to_ram(vma, addr, migrate_vma.src[i],
 							migrate_vma.dst + i);
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index 50380f5173ca..960eec38aee5 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -21,6 +21,7 @@
 #include "xe_guc_exec_queue_types.h"
 #include "xe_sched_job.h"
 #include "xe_vm.h"
+#include "xe_svm.h"
 
 DECLARE_EVENT_CLASS(xe_gt_tlb_invalidation_fence,
 		    TP_PROTO(struct xe_gt_tlb_invalidation_fence *fence),
@@ -601,6 +602,35 @@ DEFINE_EVENT_PRINT(xe_guc_ctb, xe_guc_ctb_g2h,
 
 );
 
+DECLARE_EVENT_CLASS(xe_svm_migrate,
+		    TP_PROTO(struct xe_svm_range *range),
+		    TP_ARGS(range),
+
+		    TP_STRUCT__entry(
+			     __field(u64, start)
+			     __field(u64, end)
+			     ),
+
+		    TP_fast_assign(
+			   __entry->start = range->start;
+			   __entry->end = range->end;
+			   ),
+
+		    TP_printk("Migrate svm range [0x%016llx,0x%016llx)",  __entry->start,
+			      __entry->end)
+);
+
+DEFINE_EVENT(xe_svm_migrate, xe_svm_migrate_vram_to_sram,
+		    TP_PROTO(struct xe_svm_range *range),
+		    TP_ARGS(range)
+);
+
+
+DEFINE_EVENT(xe_svm_migrate, xe_svm_migrate_sram_to_vram,
+		    TP_PROTO(struct xe_svm_range *range),
+		    TP_ARGS(range)
+);
+
 DECLARE_EVENT_CLASS(xe_buddy_block,
                TP_PROTO(struct drm_buddy_block *block, u64 size, u64 pfn),
                TP_ARGS(block, size, pfn),
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 15/23] drm/xe/svm: Implement functions to register and unregister mmu notifier
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (13 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 14/23] drm/xe/svm: trace svm range migration Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 16/23] drm/xe/svm: Implement the mmu notifier range invalidate callback Oak Zeng
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

xe driver register mmu interval notifier to core mm to monitor vma
change. We register mmu interval notifier for each svm range. mmu
interval notifier should be unregistered in a worker (see next patch
in this series), so also initialize kernel worker to unregister mmu
interval notifier.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h       | 14 ++++++
 drivers/gpu/drm/xe/xe_svm_range.c | 73 +++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 6b93055934f8..90e665f2bfc6 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -52,16 +52,28 @@ struct xe_svm {
  * struct xe_svm_range - Represents a shared virtual address range.
  */
 struct xe_svm_range {
+	/** @svm: pointer of the xe_svm that this range belongs to */
+	struct xe_svm *svm;
+
 	/** @notifier: The mmu interval notifer used to keep track of CPU
 	 * side address range change. Driver will get a callback with this
 	 * notifier if anything changed from CPU side, such as range is
 	 * unmapped from CPU
 	 */
 	struct mmu_interval_notifier notifier;
+	bool mmu_notifier_registered;
 	/** @start: start address of this range, inclusive */
 	u64 start;
 	/** @end: end address of this range, exclusive */
 	u64 end;
+	/** @vma: the corresponding vma of this svm range
+	 *  The relationship b/t vma and svm range is 1:N,
+	 *  which means one vma can be splitted into multiple
+	 *  @xe_svm_range while one @xe_svm_range can have
+	 *  only one vma. A N:N mapping means some complication
+	 *  in codes. Lets assume 1:N for now.
+	 */
+	struct vm_area_struct *vma;
 	/** @unregister_notifier_work: A worker used to unregister this notifier */
 	struct work_struct unregister_notifier_work;
 	/** @inode: used to link this range to svm's range_tree */
@@ -77,6 +89,8 @@ struct xe_svm_range *xe_svm_range_from_addr(struct xe_svm *svm,
 bool xe_svm_range_belongs_to_vma(struct mm_struct *mm,
 								struct xe_svm_range *range,
 								struct vm_area_struct *vma);
+void xe_svm_range_unregister_mmu_notifier(struct xe_svm_range *range);
+int xe_svm_range_register_mmu_notifier(struct xe_svm_range *range);
 
 int xe_svm_build_sg(struct hmm_range *range, struct sg_table *st);
 int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
diff --git a/drivers/gpu/drm/xe/xe_svm_range.c b/drivers/gpu/drm/xe/xe_svm_range.c
index b32c32f60315..286d5f7d6ecd 100644
--- a/drivers/gpu/drm/xe/xe_svm_range.c
+++ b/drivers/gpu/drm/xe/xe_svm_range.c
@@ -4,6 +4,7 @@
  */
 
 #include <linux/interval_tree.h>
+#include <linux/mmu_notifier.h>
 #include <linux/container_of.h>
 #include <linux/mm_types.h>
 #include <linux/mutex.h>
@@ -57,3 +58,75 @@ bool xe_svm_range_belongs_to_vma(struct mm_struct *mm,
 
 	return (vma1 == vma) && (vma2 == vma);
 }
+
+static const struct mmu_interval_notifier_ops xe_svm_mni_ops = {
+	.invalidate = NULL,
+};
+
+/**
+ * unregister a mmu interval notifier for a svm range
+ *
+ * @range: svm range
+ *
+ */
+void xe_svm_range_unregister_mmu_notifier(struct xe_svm_range *range)
+{
+	if (!range->mmu_notifier_registered)
+		return;
+
+	mmu_interval_notifier_remove(&range->notifier);
+	range->mmu_notifier_registered = false;
+}
+
+static void xe_svm_unregister_notifier_work(struct work_struct *work)
+{
+	struct xe_svm_range *range;
+
+	range = container_of(work, struct xe_svm_range, unregister_notifier_work);
+
+	xe_svm_range_unregister_mmu_notifier(range);
+
+	/**
+	 * This is called from mmu notifier MUNMAP event. When munmap is called,
+	 * this range is not valid any more. Remove it.
+	 */
+	mutex_lock(&range->svm->mutex);
+	interval_tree_remove(&range->inode, &range->svm->range_tree);
+	mutex_unlock(&range->svm->mutex);
+	kfree(range);
+}
+
+/**
+ * register a mmu interval notifier to monitor vma change
+ *
+ * @range: svm range to monitor
+ *
+ * This has to be called inside a mmap_read_lock
+ */
+int xe_svm_range_register_mmu_notifier(struct xe_svm_range *range)
+{
+	struct vm_area_struct *vma = range->vma;
+	struct mm_struct *mm = range->svm->mm;
+	u64 start, length;
+	int ret = 0;
+
+	if (range->mmu_notifier_registered)
+		return 0;
+
+	start =  range->start;
+	length = range->end - start;
+	/** We are inside a mmap_read_lock, but it requires a mmap_write_lock
+	 *  to register mmu notifier.
+	 */
+	mmap_read_unlock(mm);
+	mmap_write_lock(mm);
+	ret = mmu_interval_notifier_insert_locked(&range->notifier, vma->vm_mm,
+						start, length, &xe_svm_mni_ops);
+	mmap_write_downgrade(mm);
+	if (ret)
+		return ret;
+
+	INIT_WORK(&range->unregister_notifier_work, xe_svm_unregister_notifier_work);
+	range->mmu_notifier_registered = true;
+	return ret;
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 16/23] drm/xe/svm: Implement the mmu notifier range invalidate callback
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (14 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 15/23] drm/xe/svm: Implement functions to register and unregister mmu notifier Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 17/23] drm/xe/svm: clean up svm range during process exit Oak Zeng
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

To mirror the CPU page table from GPU side, we register a mmu interval
notifier (in the coming patch of this series). Core mm call back to
GPU driver whenever there is a change to certain virtual address range,
i.e., range is released or unmapped by user etc.

This patch implemented the GPU driver callback function for such mmu
interval notifier. In the callback function we unbind the address
range from GPU if it is unmapped from CPU side, thus we mirror the
CPU page table change.

We also unregister the mmu interval notifier from core mm in the case
of munmap event. But we can't unregister mmu notifier directly from the
mmu notifier range invalidation callback function. The reason is, during
a munmap (see kernel function vm_munmap), a mmap_write_lock is held, but
unregister mmu notifier (calling mmu_interval_notifier_remove) also requires
a mmap_write_lock of the current process.

Thus, we start a kernel worker to unregister mmu interval notifier on a
MMU_NOTIFY_UNMAP event.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c       |  1 +
 drivers/gpu/drm/xe/xe_svm.h       |  1 -
 drivers/gpu/drm/xe/xe_svm_range.c | 37 ++++++++++++++++++++++++++++++-
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index ab3cc2121869..6393251c0051 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -8,6 +8,7 @@
 #include "xe_svm.h"
 #include <linux/hmm.h>
 #include <linux/scatterlist.h>
+#include "xe_pt.h"
 
 DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
 
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 90e665f2bfc6..0038f98c0cc7 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -54,7 +54,6 @@ struct xe_svm {
 struct xe_svm_range {
 	/** @svm: pointer of the xe_svm that this range belongs to */
 	struct xe_svm *svm;
-
 	/** @notifier: The mmu interval notifer used to keep track of CPU
 	 * side address range change. Driver will get a callback with this
 	 * notifier if anything changed from CPU side, such as range is
diff --git a/drivers/gpu/drm/xe/xe_svm_range.c b/drivers/gpu/drm/xe/xe_svm_range.c
index 286d5f7d6ecd..53dd3be7ab9f 100644
--- a/drivers/gpu/drm/xe/xe_svm_range.c
+++ b/drivers/gpu/drm/xe/xe_svm_range.c
@@ -10,6 +10,7 @@
 #include <linux/mutex.h>
 #include <linux/mm.h>
 #include "xe_svm.h"
+#include "xe_pt.h"
 
 /**
  * xe_svm_range_from_addr() - retrieve svm_range contains a virtual address
@@ -59,8 +60,42 @@ bool xe_svm_range_belongs_to_vma(struct mm_struct *mm,
 	return (vma1 == vma) && (vma2 == vma);
 }
 
+static bool xe_svm_range_invalidate(struct mmu_interval_notifier *mni,
+				      const struct mmu_notifier_range *range,
+				      unsigned long cur_seq)
+{
+	struct xe_svm_range *svm_range =
+		container_of(mni, struct xe_svm_range, notifier);
+	struct xe_svm *svm = svm_range->svm;
+	unsigned long length = range->end - range->start;
+
+	/*
+	 * MMU_NOTIFY_RELEASE is called upon process exit to notify driver
+	 * to release any process resources, such as zap GPU page table
+	 * mapping or unregister mmu notifier etc. We already clear GPU
+	 * page table  and unregister mmu notifier in in xe_destroy_svm,
+	 * upon process exit. So just simply return here.
+	 */
+	if (range->event == MMU_NOTIFY_RELEASE)
+		return true;
+
+	if (mmu_notifier_range_blockable(range))
+		mutex_lock(&svm->mutex);
+	else if (!mutex_trylock(&svm->mutex))
+		return false;
+
+	mmu_interval_set_seq(mni, cur_seq);
+	xe_invalidate_svm_range(svm->vm, range->start, length);
+	mutex_unlock(&svm->mutex);
+
+	if (range->event == MMU_NOTIFY_UNMAP)
+		queue_work(system_unbound_wq, &svm_range->unregister_notifier_work);
+
+	return true;
+}
+
 static const struct mmu_interval_notifier_ops xe_svm_mni_ops = {
-	.invalidate = NULL,
+	.invalidate = xe_svm_range_invalidate,
 };
 
 /**
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 17/23] drm/xe/svm: clean up svm range during process exit
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (15 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 16/23] drm/xe/svm: Implement the mmu notifier range invalidate callback Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 18/23] drm/xe/svm: Move a few structures to xe_gt.h Oak Zeng
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Clean up svm range during process exit: Zap GPU page table of
the svm process on process exit; unregister all the mmu interval
notifiers which are registered before; free svm range and svm
data structure.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c       | 24 ++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h       |  1 +
 drivers/gpu/drm/xe/xe_svm_range.c | 17 +++++++++++++++++
 3 files changed, 42 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 6393251c0051..5772bfcf7da4 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -9,6 +9,8 @@
 #include <linux/hmm.h>
 #include <linux/scatterlist.h>
 #include "xe_pt.h"
+#include "xe_assert.h"
+#include "xe_vm_types.h"
 
 DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
 
@@ -19,9 +21,31 @@ DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
  */
 void xe_destroy_svm(struct xe_svm *svm)
 {
+#define MAX_SVM_RANGE (1024*1024)
+	struct xe_svm_range **range_array;
+	struct interval_tree_node *node;
+	struct xe_svm_range *range;
+	int i = 0;
+
+	range_array = kzalloc(sizeof(struct xe_svm_range *) * MAX_SVM_RANGE,
+							GFP_KERNEL);
+	node = interval_tree_iter_first(&svm->range_tree, 0, ~0ULL);
+	while (node) {
+		range = container_of(node, struct xe_svm_range, inode);
+		xe_svm_range_prepare_destroy(range);
+		node = interval_tree_iter_next(node, 0, ~0ULL);
+		xe_assert(svm->vm->xe, i < MAX_SVM_RANGE);
+		range_array[i++] = range;
+	}
+
+	/** Free range (thus range->inode) while traversing above is not safe */
+	for(; i >= 0; i--)
+		kfree(range_array[i]);
+
 	hash_del_rcu(&svm->hnode);
 	mutex_destroy(&svm->mutex);
 	kfree(svm);
+	kfree(range_array);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 0038f98c0cc7..5b3bd2c064f5 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -90,6 +90,7 @@ bool xe_svm_range_belongs_to_vma(struct mm_struct *mm,
 								struct vm_area_struct *vma);
 void xe_svm_range_unregister_mmu_notifier(struct xe_svm_range *range);
 int xe_svm_range_register_mmu_notifier(struct xe_svm_range *range);
+void xe_svm_range_prepare_destroy(struct xe_svm_range *range);
 
 int xe_svm_build_sg(struct hmm_range *range, struct sg_table *st);
 int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
diff --git a/drivers/gpu/drm/xe/xe_svm_range.c b/drivers/gpu/drm/xe/xe_svm_range.c
index 53dd3be7ab9f..dfb4660dc26f 100644
--- a/drivers/gpu/drm/xe/xe_svm_range.c
+++ b/drivers/gpu/drm/xe/xe_svm_range.c
@@ -165,3 +165,20 @@ int xe_svm_range_register_mmu_notifier(struct xe_svm_range *range)
 	range->mmu_notifier_registered = true;
 	return ret;
 }
+
+/**
+ * xe_svm_range_prepare_destroy() - prepare work to destroy a svm range
+ *
+ * @range: the svm range to destroy
+ *
+ * prepare for a svm range destroy: Zap this range from GPU, unregister mmu
+ * notifier.
+ */
+void xe_svm_range_prepare_destroy(struct xe_svm_range *range)
+{
+	struct xe_vm *vm = range->svm->vm;
+	unsigned long length = range->end - range->start;
+
+	xe_invalidate_svm_range(vm, range->start, length);
+	xe_svm_range_unregister_mmu_notifier(range);
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 18/23] drm/xe/svm: Move a few structures to xe_gt.h
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (16 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 17/23] drm/xe/svm: clean up svm range during process exit Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 19/23] drm/xe/svm: migrate svm range to vram Oak Zeng
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Move access_type and pagefault struct to header file so it
can be shared with svm sub-system. This is preparation work
for enabling page fault for svm.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_gt.h           | 20 ++++++++++++++++++++
 drivers/gpu/drm/xe/xe_gt_pagefault.c | 21 ---------------------
 2 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
index 4486e083f5ef..51dd288cf1cf 100644
--- a/drivers/gpu/drm/xe/xe_gt.h
+++ b/drivers/gpu/drm/xe/xe_gt.h
@@ -17,6 +17,26 @@
 			  xe_hw_engine_is_valid((hwe__)))
 
 #define CCS_MASK(gt) (((gt)->info.engine_mask & XE_HW_ENGINE_CCS_MASK) >> XE_HW_ENGINE_CCS0)
+enum access_type {
+	ACCESS_TYPE_READ = 0,
+	ACCESS_TYPE_WRITE = 1,
+	ACCESS_TYPE_ATOMIC = 2,
+	ACCESS_TYPE_RESERVED = 3,
+};
+
+struct pagefault {
+	u64 page_addr;
+	u32 asid;
+	u16 pdata;
+	u8 vfid;
+	u8 access_type;
+	u8 fault_type;
+	u8 fault_level;
+	u8 engine_class;
+	u8 engine_instance;
+	u8 fault_unsuccessful;
+	bool trva_fault;
+};
 
 #ifdef CONFIG_FAULT_INJECTION
 #include <linux/fault-inject.h> /* XXX: fault-inject.h is broken */
diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 5c2603075af9..467d68f8332e 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -23,27 +23,6 @@
 #include "xe_trace.h"
 #include "xe_vm.h"
 
-struct pagefault {
-	u64 page_addr;
-	u32 asid;
-	u16 pdata;
-	u8 vfid;
-	u8 access_type;
-	u8 fault_type;
-	u8 fault_level;
-	u8 engine_class;
-	u8 engine_instance;
-	u8 fault_unsuccessful;
-	bool trva_fault;
-};
-
-enum access_type {
-	ACCESS_TYPE_READ = 0,
-	ACCESS_TYPE_WRITE = 1,
-	ACCESS_TYPE_ATOMIC = 2,
-	ACCESS_TYPE_RESERVED = 3,
-};
-
 enum fault_type {
 	NOT_PRESENT = 0,
 	WRITE_ACCESS_VIOLATION = 1,
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 19/23] drm/xe/svm: migrate svm range to vram
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (17 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 18/23] drm/xe/svm: Move a few structures to xe_gt.h Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 20/23] drm/xe/svm: Populate svm range Oak Zeng
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Since the source pages of the svm range can be physically none
contiguous, and the destination vram pages can also be none
contiguous, there is no easy way to migrate multiple pages per
blitter command. We do page by page migration for now.

Migration is best effort. Even if we fail to migrate some pages,
we will try to migrate the rest pages.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c         |   7 ++
 drivers/gpu/drm/xe/xe_svm.h         |   3 +
 drivers/gpu/drm/xe/xe_svm_migrate.c | 114 ++++++++++++++++++++++++++++
 3 files changed, 124 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 5772bfcf7da4..44d4f4216a93 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -5,12 +5,19 @@
 
 #include <linux/mutex.h>
 #include <linux/mm_types.h>
+#include <linux/interval_tree.h>
+#include <linux/container_of.h>
+#include <linux/types.h>
+#include <linux/migrate.h>
 #include "xe_svm.h"
 #include <linux/hmm.h>
 #include <linux/scatterlist.h>
 #include "xe_pt.h"
 #include "xe_assert.h"
 #include "xe_vm_types.h"
+#include "xe_gt.h"
+#include "xe_migrate.h"
+#include "xe_trace.h"
 
 DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
 
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 5b3bd2c064f5..659bcb7927d6 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -80,6 +80,9 @@ struct xe_svm_range {
 };
 
 vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf);
+int svm_migrate_range_to_vram(struct xe_svm_range *range,
+							struct vm_area_struct *vma,
+							struct xe_tile *tile);
 void xe_destroy_svm(struct xe_svm *svm);
 struct xe_svm *xe_create_svm(struct xe_vm *vm);
 struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
index b4df411e04f3..3724ad6c7aea 100644
--- a/drivers/gpu/drm/xe/xe_svm_migrate.c
+++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
@@ -229,3 +229,117 @@ vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
 	kvfree(buf);
 	return 0;
 }
+
+
+/**
+ * svm_migrate_range_to_vram() - migrate backing store of a va range to vram
+ * Must be called with mmap_read_lock(mm) held.
+ * @range: the va range to migrate. Range should only belong to one vma.
+ * @vma: the vma that this range belongs to. @range can cover whole @vma
+ * or a sub-range of @vma.
+ * @tile: the destination tile which holds the new backing store of the range
+ *
+ * Returns: negative errno on faiure, 0 on success
+ */
+int svm_migrate_range_to_vram(struct xe_svm_range *range,
+							struct vm_area_struct *vma,
+							struct xe_tile *tile)
+{
+	struct mm_struct *mm = range->svm->mm;
+	unsigned long start = range->start;
+	unsigned long end = range->end;
+	unsigned long npages = (end - start) >> PAGE_SHIFT;
+	struct xe_mem_region *mr = &tile->mem.vram;
+	struct migrate_vma migrate = {
+		.vma		= vma,
+		.start		= start,
+		.end		= end,
+		.pgmap_owner	= tile->xe->drm.dev,
+		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
+	};
+	struct device *dev = tile->xe->drm.dev;
+	dma_addr_t *src_dma_addr;
+	struct dma_fence *fence;
+	struct page *src_page;
+	LIST_HEAD(blocks);
+	int ret = 0, i;
+	u64 dst_dpa;
+	void *buf;
+
+	mmap_assert_locked(mm);
+	xe_assert(tile->xe, xe_svm_range_belongs_to_vma(mm, range, vma));
+
+	buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
+					GFP_KERNEL);
+	if(!buf)
+		return -ENOMEM;
+	migrate.src = buf;
+	migrate.dst = migrate.src + npages;
+	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
+	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
+	if (ret)
+		goto kfree_buf;
+
+	ret = migrate_vma_setup(&migrate);
+	if (ret) {
+		drm_err(&tile->xe->drm, "vma setup returned %d for range [%lx - %lx]\n",
+				ret, start, end);
+		goto free_dst_pages;
+	}
+
+	trace_xe_svm_migrate_sram_to_vram(range);
+	/**FIXME: partial migration of a range
+	 * print a warning for now. If this message
+	 * is printed, we need to fall back to page by page
+	 * migration: only migrate pages with MIGRATE_PFN_MIGRATE
+	 */
+	if (migrate.cpages != npages)
+		drm_warn(&tile->xe->drm, "Partial migration for range [%lx - %lx], range is %ld pages, migrate only %ld pages\n",
+				start, end, npages, migrate.cpages);
+
+	/**Migrate page by page for now.
+	 * Both source pages and destination pages can physically not contiguous,
+	 * there is no good way to migrate multiple pages per blitter command.
+	 */
+	for (i = 0; i < npages; i++) {
+		src_page = migrate_pfn_to_page(migrate.src[i]);
+		if (unlikely(!src_page || !(migrate.src[i] & MIGRATE_PFN_MIGRATE)))
+			goto free_dst_page;
+
+		xe_assert(tile->xe, !is_zone_device_page(src_page));
+		src_dma_addr[i] = dma_map_page(dev, src_page, 0, PAGE_SIZE, DMA_TO_DEVICE);
+		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
+			drm_warn(&tile->xe->drm, "dma map error for host pfn %lx\n", migrate.src[i]);
+			goto free_dst_page;
+		}
+		dst_dpa = vram_pfn_to_dpa(mr, migrate.dst[i]);
+		fence = xe_migrate_svm(tile->migrate, src_dma_addr[i], false,
+				dst_dpa, true, PAGE_SIZE);
+		if (IS_ERR(fence)) {
+			drm_warn(&tile->xe->drm, "migrate host page (pfn: %lx) to vram failed\n",
+					migrate.src[i]);
+			/**Migration is best effort. Even we failed here, we continue*/
+			goto free_dst_page;
+		}
+		/**FIXME: Use the first migration's out fence as the second migration's input fence,
+		 * and so on. Only wait the out fence of last migration?
+		 */
+		dma_fence_wait(fence, false);
+		dma_fence_put(fence);
+free_dst_page:
+		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
+	}
+
+	for (i = 0; i < npages; i++)
+		if (!(dma_mapping_error(dev, src_dma_addr[i])))
+			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE, DMA_TO_DEVICE);
+
+	migrate_vma_pages(&migrate);
+	migrate_vma_finalize(&migrate);
+free_dst_pages:
+	if (ret)
+		xe_devm_free_blocks(&blocks);
+kfree_buf:
+	kfree(buf);
+	return ret;
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 20/23] drm/xe/svm: Populate svm range
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (18 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 19/23] drm/xe/svm: migrate svm range to vram Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 21/23] drm/xe/svm: GPU page fault support Oak Zeng
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Add a helper function svm_populate_range to populate
a svm range. This functions calls hmm_range_fault
to read CPU page tables and populate all pfns of this
virtual address range into an array, saved in hmm_range::
hmm_pfns. This is prepare work to bind a svm range to
GPU. The hmm_pfns array will be used for the GPU binding.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 61 +++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 44d4f4216a93..0c13690a19f5 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -145,3 +145,64 @@ int xe_svm_build_sg(struct hmm_range *range,
 	sg_mark_end(sg);
 	return 0;
 }
+
+/** Populate physical pages of a virtual address range
+ * This function also read mmu notifier sequence # (
+ * mmu_interval_read_begin), for the purpose of later
+ * comparison (through mmu_interval_read_retry).
+ * This must be called with mmap read or write lock held.
+ *
+ * This function alloates hmm_range->hmm_pfns, it is caller's
+ * responsibility to free it.
+ *
+ * @svm_range: The svm range to populate
+ * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
+ * will hold the populated pfns.
+ * @write: populate pages with write permission
+ *
+ * returns: 0 for succuss; negative error no on failure
+ */
+static int svm_populate_range(struct xe_svm_range *svm_range,
+			    struct hmm_range *hmm_range, bool write)
+{
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
+	u64 npages;
+	int ret;
+
+	mmap_assert_locked(svm_range->svm->mm);
+
+	npages = ((svm_range->end - 1) >> PAGE_SHIFT) -
+						(svm_range->start >> PAGE_SHIFT) + 1;
+	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
+	if (unlikely(!pfns))
+		return -ENOMEM;
+
+	if (write)
+		flags |= HMM_PFN_REQ_WRITE;
+
+	memset64((u64 *)pfns, (u64)flags, npages);
+	hmm_range->hmm_pfns = pfns;
+	hmm_range->notifier_seq = mmu_interval_read_begin(&svm_range->notifier);
+	hmm_range->notifier = &svm_range->notifier;
+	hmm_range->start = svm_range->start;
+	hmm_range->end = svm_range->end;
+	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE;
+	hmm_range->dev_private_owner = svm_range->svm->vm->xe->drm.dev;
+
+	while (true) {
+		ret = hmm_range_fault(hmm_range);
+		if (time_after(jiffies, timeout))
+			goto free_pfns;
+
+		if (ret == -EBUSY)
+			continue;
+		break;
+	}
+
+free_pfns:
+	if (ret)
+		kvfree(pfns);
+	return ret;
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 21/23] drm/xe/svm: GPU page fault support
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (19 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 20/23] drm/xe/svm: Populate svm range Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-23  2:06   ` Welty, Brian
  2024-01-17 22:12 ` [PATCH 22/23] drm/xe/svm: Add DRM_XE_SVM kernel config entry Oak Zeng
  2024-01-17 22:12 ` [PATCH 23/23] drm/xe/svm: Add svm memory hints interface Oak Zeng
  22 siblings, 1 reply; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

On gpu page fault of a virtual address, try to fault in the virtual
address range to gpu page table and let HW to retry on the faulty
address.

Right now, we always migrate the whole vma which contains the fault
address to GPU. This is subject to change of a more sophisticated
migration policy: decide whether to migrate memory to GPU or map
in place with CPU memory; migration granularity.

There is rather complicated locking strategy in this patch. See more
details in xe_svm_doc.h, lock design section.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |   7 ++
 drivers/gpu/drm/xe/xe_svm.c          | 116 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h          |   6 ++
 drivers/gpu/drm/xe/xe_svm_range.c    |  43 ++++++++++
 4 files changed, 172 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 467d68f8332e..462603abab8a 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -22,6 +22,7 @@
 #include "xe_pt.h"
 #include "xe_trace.h"
 #include "xe_vm.h"
+#include "xe_svm.h"
 
 enum fault_type {
 	NOT_PRESENT = 0,
@@ -131,6 +132,11 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 	if (!vm || !xe_vm_in_fault_mode(vm))
 		return -EINVAL;
 
+	if (vm->svm) {
+		ret = xe_svm_handle_gpu_fault(vm, gt, pf);
+		goto put_vm;
+	}
+
 retry_userptr:
 	/*
 	 * TODO: Avoid exclusive lock if VM doesn't have userptrs, or
@@ -219,6 +225,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 		if (ret >= 0)
 			ret = 0;
 	}
+put_vm:
 	xe_vm_put(vm);
 
 	return ret;
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 0c13690a19f5..1ade8d7f0ab2 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -12,6 +12,7 @@
 #include "xe_svm.h"
 #include <linux/hmm.h>
 #include <linux/scatterlist.h>
+#include <drm/xe_drm.h>
 #include "xe_pt.h"
 #include "xe_assert.h"
 #include "xe_vm_types.h"
@@ -206,3 +207,118 @@ static int svm_populate_range(struct xe_svm_range *svm_range,
 		kvfree(pfns);
 	return ret;
 }
+
+/**
+ * svm_access_allowed() -  Determine whether read or/and write to vma is allowed
+ *
+ * @write: true means a read and write access; false: read only access
+ */
+static bool svm_access_allowed(struct vm_area_struct *vma, bool write)
+{
+	unsigned long access = VM_READ;
+
+	if (write)
+		access |= VM_WRITE;
+
+	return (vma->vm_flags & access) == access;
+}
+
+/**
+ * svm_should_migrate() - Determine whether we should migrate a range to
+ * a destination memory region
+ *
+ * @range: The svm memory range to consider
+ * @dst_region: target destination memory region
+ * @is_atomic_fault: Is the intended migration triggered by a atomic access?
+ * On some platform, we have to migrate memory to guarantee atomic correctness.
+ */
+static bool svm_should_migrate(struct xe_svm_range *range,
+				struct xe_mem_region *dst_region, bool is_atomic_fault)
+{
+	return true;
+}
+
+/**
+ * xe_svm_handle_gpu_fault() - gpu page fault handler for svm subsystem
+ *
+ * @vm: The vm of the fault.
+ * @gt: The gt hardware on which the fault happens.
+ * @pf: page fault descriptor
+ *
+ * Workout a backing memory for the fault address, migrate memory from
+ * system memory to gpu vram if nessary, and map the fault address to
+ * GPU so GPU HW can retry the last operation which has caused the GPU
+ * page fault.
+ */
+int xe_svm_handle_gpu_fault(struct xe_vm *vm,
+				struct xe_gt *gt,
+				struct pagefault *pf)
+{
+	u8 access_type = pf->access_type;
+	u64 page_addr = pf->page_addr;
+	struct hmm_range hmm_range;
+	struct vm_area_struct *vma;
+	struct xe_svm_range *range;
+	struct mm_struct *mm;
+	struct xe_svm *svm;
+	int ret = 0;
+
+	svm = vm->svm;
+	if (!svm)
+		return -EINVAL;
+
+	mm = svm->mm;
+	mmap_read_lock(mm);
+	vma = find_vma_intersection(mm, page_addr, page_addr + 4);
+	if (!vma) {
+		mmap_read_unlock(mm);
+		return -ENOENT;
+	}
+
+	if (!svm_access_allowed (vma, access_type != ACCESS_TYPE_READ)) {
+		mmap_read_unlock(mm);
+		return -EPERM;
+	}
+
+	range = xe_svm_range_from_addr(svm, page_addr);
+	if (!range) {
+		range = xe_svm_range_create(svm, vma);
+		if (!range) {
+			mmap_read_unlock(mm);
+			return -ENOMEM;
+		}
+	}
+
+	if (svm_should_migrate(range, &gt->tile->mem.vram,
+						access_type == ACCESS_TYPE_ATOMIC))
+		/** Migrate whole svm range for now.
+		 *  This is subject to change once we introduce a migration granularity
+		 *  parameter for user to select.
+		 *
+		 *	Migration is best effort. If we failed to migrate to vram,
+		 *	we just map that range to gpu in system memory. For cases
+		 *	such as gpu atomic operation which requires memory to be
+		 *	resident in vram, we will fault again and retry migration.
+		 */
+		svm_migrate_range_to_vram(range, vma, gt->tile);
+
+	ret = svm_populate_range(range, &hmm_range, vma->vm_flags & VM_WRITE);
+	mmap_read_unlock(mm);
+	/** There is no need to destroy this range. Range can be reused later */
+	if (ret)
+		goto free_pfns;
+
+	/**FIXME: set the DM, AE flags in PTE*/
+	ret = xe_bind_svm_range(vm, gt->tile, &hmm_range,
+		!(vma->vm_flags & VM_WRITE) ? DRM_XE_VM_BIND_FLAG_READONLY : 0);
+	/** Concurrent cpu page table update happened,
+	 *  Return successfully so we will retry everything
+	 *  on next gpu page fault.
+	 */
+	if (ret == -EAGAIN)
+		ret = 0;
+
+free_pfns:
+	kvfree(hmm_range.hmm_pfns);
+	return ret;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 659bcb7927d6..a8ff4957a9b8 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -20,6 +20,7 @@
 
 struct xe_vm;
 struct mm_struct;
+struct pagefault;
 
 #define XE_MAX_SVM_PROCESS 5 /* Maximumly support 32 SVM process*/
 extern DECLARE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
@@ -94,6 +95,8 @@ bool xe_svm_range_belongs_to_vma(struct mm_struct *mm,
 void xe_svm_range_unregister_mmu_notifier(struct xe_svm_range *range);
 int xe_svm_range_register_mmu_notifier(struct xe_svm_range *range);
 void xe_svm_range_prepare_destroy(struct xe_svm_range *range);
+struct xe_svm_range *xe_svm_range_create(struct xe_svm *svm,
+									struct vm_area_struct *vma);
 
 int xe_svm_build_sg(struct hmm_range *range, struct sg_table *st);
 int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
@@ -106,4 +109,7 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
 
 void xe_devm_free_blocks(struct list_head *blocks);
 void xe_devm_page_free(struct page *page);
+int xe_svm_handle_gpu_fault(struct xe_vm *vm,
+				struct xe_gt *gt,
+				struct pagefault *pf);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm_range.c b/drivers/gpu/drm/xe/xe_svm_range.c
index dfb4660dc26f..05c088dddc2d 100644
--- a/drivers/gpu/drm/xe/xe_svm_range.c
+++ b/drivers/gpu/drm/xe/xe_svm_range.c
@@ -182,3 +182,46 @@ void xe_svm_range_prepare_destroy(struct xe_svm_range *range)
 	xe_invalidate_svm_range(vm, range->start, length);
 	xe_svm_range_unregister_mmu_notifier(range);
 }
+
+static void add_range_to_svm(struct xe_svm_range *range)
+{
+	range->inode.start = range->start;
+	range->inode.last = range->end;
+	mutex_lock(&range->svm->mutex);
+	interval_tree_insert(&range->inode, &range->svm->range_tree);
+	mutex_unlock(&range->svm->mutex);
+}
+
+/**
+ * xe_svm_range_create() - create and initialize a svm range
+ *
+ * @svm: the svm that the range belongs to
+ * @vma: the corresponding vma of the range
+ *
+ * Create range, add it to svm's interval tree. Regiter a mmu
+ * interval notifier for this range.
+ *
+ * Return the pointer of the created svm range
+ * or NULL if fail
+ */
+struct xe_svm_range *xe_svm_range_create(struct xe_svm *svm,
+									struct vm_area_struct *vma)
+{
+	struct xe_svm_range *range = kzalloc(sizeof(*range), GFP_KERNEL);
+
+	if (!range)
+		return NULL;
+
+	range->start = vma->vm_start;
+	range->end = vma->vm_end;
+	range->vma = vma;
+	range->svm = svm;
+
+	if (xe_svm_range_register_mmu_notifier(range)){
+		kfree(range);
+		return NULL;
+	}
+
+	add_range_to_svm(range);
+	return range;
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 22/23] drm/xe/svm: Add DRM_XE_SVM kernel config entry
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (20 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 21/23] drm/xe/svm: GPU page fault support Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  2024-01-17 22:12 ` [PATCH 23/23] drm/xe/svm: Add svm memory hints interface Oak Zeng
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

DRM_XE_SVM kernel config entry is added so
xe svm feature can be configured before kernel
compilation.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/Kconfig   | 22 ++++++++++++++++++++++
 drivers/gpu/drm/xe/Makefile  |  5 +++++
 drivers/gpu/drm/xe/xe_mmio.c |  5 +++++
 drivers/gpu/drm/xe/xe_vm.c   |  2 ++
 4 files changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
index 1b57ae38210d..6f498095a915 100644
--- a/drivers/gpu/drm/xe/Kconfig
+++ b/drivers/gpu/drm/xe/Kconfig
@@ -83,6 +83,28 @@ config DRM_XE_FORCE_PROBE
 
 	  Use "!*" to block the probe of the driver for all known devices.
 
+config DRM_XE_SVM
+    bool "Enable Shared Virtual Memory support in xe"
+    depends on DRM_XE
+    depends on ARCH_ENABLE_MEMORY_HOTPLUG
+    depends on ARCH_ENABLE_MEMORY_HOTREMOVE
+    depends on MEMORY_HOTPLUG
+    depends on MEMORY_HOTREMOVE
+    depends on ARCH_HAS_PTE_DEVMAP
+    depends on SPARSEMEM_VMEMMAP
+    depends on ZONE_DEVICE
+    depends on DEVICE_PRIVATE
+    depends on MMU
+    select HMM_MIRROR
+    select MMU_NOTIFIER
+    default y
+    help
+      Choose this option if you want Shared Virtual Memory (SVM)
+      support in xe. With SVM, virtual address space is shared
+	  between CPU and GPU. This means any virtual address such
+	  as malloc or mmap returns, variables on stack, or global
+	  memory pointers, can be used for GPU transparently.
+
 menu "drm/Xe Debugging"
 depends on DRM_XE
 depends on EXPERT
diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index e16b84f79ddf..ae503f7c1f94 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -283,6 +283,11 @@ xe-$(CONFIG_DRM_XE_DISPLAY) += \
 	i915-display/skl_universal_plane.o \
 	i915-display/skl_watermark.o
 
+xe-$(CONFIG_DRM_XE_SVM) += xe_svm.o \
+						   xe_svm_devmem.o \
+						   xe_svm_range.o \
+						   xe_svm_migrate.o
+
 ifeq ($(CONFIG_ACPI),y)
 	xe-$(CONFIG_DRM_XE_DISPLAY) += \
 		i915-display/intel_acpi.o \
diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
index 3d34dcfa3b3a..99810794bd94 100644
--- a/drivers/gpu/drm/xe/xe_mmio.c
+++ b/drivers/gpu/drm/xe/xe_mmio.c
@@ -286,7 +286,9 @@ int xe_mmio_probe_vram(struct xe_device *xe)
 		}
 
 		io_size -= min_t(u64, tile_size, io_size);
+#if IS_ENABLED(CONFIG_DRM_XE_SVM)
 		xe_svm_devm_add(tile, &tile->mem.vram);
+#endif
 	}
 
 	xe->mem.vram.actual_physical_size = total_size;
@@ -361,8 +363,11 @@ static void mmio_fini(struct drm_device *drm, void *arg)
 	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
 	if (xe->mem.vram.mapping)
 		iounmap(xe->mem.vram.mapping);
+
+#if IS_ENABLED(CONFIG_DRM_XE_SVM)
 	for_each_tile(tile, xe, id) {
 		xe_svm_devm_remove(xe, &tile->mem.vram);
+#endif
 	}
 }
 
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 712fe49d8fb2..3bf19c92e01f 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1377,7 +1377,9 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 		xe->usm.num_vm_in_non_fault_mode++;
 	mutex_unlock(&xe->usm.lock);
 
+#if IS_ENABLED(CONFIG_DRM_XE_SVM)
 	vm->svm = xe_create_svm(vm);
+#endif
 	trace_xe_vm_create(vm);
 
 	return vm;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 23/23] drm/xe/svm: Add svm memory hints interface
  2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
                   ` (21 preceding siblings ...)
  2024-01-17 22:12 ` [PATCH 22/23] drm/xe/svm: Add DRM_XE_SVM kernel config entry Oak Zeng
@ 2024-01-17 22:12 ` Oak Zeng
  22 siblings, 0 replies; 123+ messages in thread
From: Oak Zeng @ 2024-01-17 22:12 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 include/uapi/drm/xe_drm.h | 40 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index 50bbea0992d9..551ed8706097 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -80,6 +80,7 @@ extern "C" {
  *  - &DRM_IOCTL_XE_EXEC_QUEUE_GET_PROPERTY
  *  - &DRM_IOCTL_XE_EXEC
  *  - &DRM_IOCTL_XE_WAIT_USER_FENCE
+ *  - &DRM_IOCTL_XE_SVM
  */
 
 /*
@@ -100,6 +101,7 @@ extern "C" {
 #define DRM_XE_EXEC_QUEUE_GET_PROPERTY	0x08
 #define DRM_XE_EXEC			0x09
 #define DRM_XE_WAIT_USER_FENCE		0x0a
+#define DRM_XE_SVM			0x0b
 /* Must be kept compact -- no holes */
 
 #define DRM_IOCTL_XE_DEVICE_QUERY		DRM_IOWR(DRM_COMMAND_BASE + DRM_XE_DEVICE_QUERY, struct drm_xe_device_query)
@@ -113,6 +115,7 @@ extern "C" {
 #define DRM_IOCTL_XE_EXEC_QUEUE_GET_PROPERTY	DRM_IOWR(DRM_COMMAND_BASE + DRM_XE_EXEC_QUEUE_GET_PROPERTY, struct drm_xe_exec_queue_get_property)
 #define DRM_IOCTL_XE_EXEC			DRM_IOW(DRM_COMMAND_BASE + DRM_XE_EXEC, struct drm_xe_exec)
 #define DRM_IOCTL_XE_WAIT_USER_FENCE		DRM_IOWR(DRM_COMMAND_BASE + DRM_XE_WAIT_USER_FENCE, struct drm_xe_wait_user_fence)
+#define DRM_IOCTL_XE_SVM			DRM_IOWR(DRM_COMMAND_BASE + DRM_XE_SVM, struct drm_xe_svm_args)
 
 /**
  * DOC: Xe IOCTL Extensions
@@ -1339,6 +1342,43 @@ struct drm_xe_wait_user_fence {
 	__u64 reserved[2];
 };
 
+enum drm_xe_svm_attr_type {
+	DRM_XE_SVM_ATTR_PREFERRED_LOC,
+	DRM_XE_SVM_ATTR_MIGRATION_GRANULARITY,
+	DRM_XE_SVM_ATTR_ATOMIC,
+	DRM_XE_SVM_ATTR_CACHE,
+	DRM_XE_SVM_ATTR_PREFETCH_LOC,
+	DRM_XE_SVM_ATTR_ACCESS_PATTERN,
+};
+
+struct drm_xe_svm_attr {
+	__u32 type;
+	__u32 value;
+};
+
+enum drm_xe_svm_op {
+	DRM_XE_SVM_OP_SET_ATTR,
+	DRM_XE_SVM_OP_GET_ATTR,
+};
+
+/**
+ * struct drm_xe_svm_args - Input of &DRM_IOCTL_XE_SVM
+ *
+ * Set or get memory attributes to a virtual address range
+ */
+struct drm_xe_svm_args {
+	/** @start: start of the virtual address range */
+	__u64 start;
+	/** @size: size of the virtual address range */
+	__u64 size;
+	/** @op: operation, either set or get */
+	__u32 op;
+	/** @nattr: number of attributes */
+	__u32 nattr;
+	/** @attrs: An array of attributes */
+	struct drm_xe_svm_attr attrs[];
+};
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* Re: [PATCH 21/23] drm/xe/svm: GPU page fault support
  2024-01-17 22:12 ` [PATCH 21/23] drm/xe/svm: GPU page fault support Oak Zeng
@ 2024-01-23  2:06   ` Welty, Brian
  2024-01-23  3:09     ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Welty, Brian @ 2024-01-23  2:06 UTC (permalink / raw)
  To: Oak Zeng, dri-devel, intel-xe
  Cc: matthew.brost, krishnaiah.bommu, niranjana.vishwanathapura,
	himal.prasad.ghimiray, Thomas.Hellstrom


On 1/17/2024 2:12 PM, Oak Zeng wrote:
> On gpu page fault of a virtual address, try to fault in the virtual
> address range to gpu page table and let HW to retry on the faulty
> address.
> 
> Right now, we always migrate the whole vma which contains the fault
> address to GPU. This is subject to change of a more sophisticated
> migration policy: decide whether to migrate memory to GPU or map
> in place with CPU memory; migration granularity.
> 
> There is rather complicated locking strategy in this patch. See more
> details in xe_svm_doc.h, lock design section.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gt_pagefault.c |   7 ++
>   drivers/gpu/drm/xe/xe_svm.c          | 116 +++++++++++++++++++++++++++
>   drivers/gpu/drm/xe/xe_svm.h          |   6 ++
>   drivers/gpu/drm/xe/xe_svm_range.c    |  43 ++++++++++
>   4 files changed, 172 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> index 467d68f8332e..462603abab8a 100644
> --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> @@ -22,6 +22,7 @@
>   #include "xe_pt.h"
>   #include "xe_trace.h"
>   #include "xe_vm.h"
> +#include "xe_svm.h"
>   
>   enum fault_type {
>   	NOT_PRESENT = 0,
> @@ -131,6 +132,11 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
>   	if (!vm || !xe_vm_in_fault_mode(vm))
>   		return -EINVAL;
>   
> +	if (vm->svm) {
> +		ret = xe_svm_handle_gpu_fault(vm, gt, pf);
> +		goto put_vm;
> +	}
> +
>   retry_userptr:
>   	/*
>   	 * TODO: Avoid exclusive lock if VM doesn't have userptrs, or
> @@ -219,6 +225,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
>   		if (ret >= 0)
>   			ret = 0;
>   	}
> +put_vm:
>   	xe_vm_put(vm);
>   
>   	return ret;
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 0c13690a19f5..1ade8d7f0ab2 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -12,6 +12,7 @@
>   #include "xe_svm.h"
>   #include <linux/hmm.h>
>   #include <linux/scatterlist.h>
> +#include <drm/xe_drm.h>
>   #include "xe_pt.h"
>   #include "xe_assert.h"
>   #include "xe_vm_types.h"
> @@ -206,3 +207,118 @@ static int svm_populate_range(struct xe_svm_range *svm_range,
>   		kvfree(pfns);
>   	return ret;
>   }
> +
> +/**
> + * svm_access_allowed() -  Determine whether read or/and write to vma is allowed
> + *
> + * @write: true means a read and write access; false: read only access
> + */
> +static bool svm_access_allowed(struct vm_area_struct *vma, bool write)
> +{
> +	unsigned long access = VM_READ;
> +
> +	if (write)
> +		access |= VM_WRITE;
> +
> +	return (vma->vm_flags & access) == access;
> +}
> +
> +/**
> + * svm_should_migrate() - Determine whether we should migrate a range to
> + * a destination memory region
> + *
> + * @range: The svm memory range to consider
> + * @dst_region: target destination memory region
> + * @is_atomic_fault: Is the intended migration triggered by a atomic access?
> + * On some platform, we have to migrate memory to guarantee atomic correctness.
> + */
> +static bool svm_should_migrate(struct xe_svm_range *range,
> +				struct xe_mem_region *dst_region, bool is_atomic_fault)
> +{
> +	return true;
> +}
> +
> +/**
> + * xe_svm_handle_gpu_fault() - gpu page fault handler for svm subsystem
> + *
> + * @vm: The vm of the fault.
> + * @gt: The gt hardware on which the fault happens.
> + * @pf: page fault descriptor
> + *
> + * Workout a backing memory for the fault address, migrate memory from
> + * system memory to gpu vram if nessary, and map the fault address to
> + * GPU so GPU HW can retry the last operation which has caused the GPU
> + * page fault.
> + */
> +int xe_svm_handle_gpu_fault(struct xe_vm *vm,
> +				struct xe_gt *gt,
> +				struct pagefault *pf)
> +{
> +	u8 access_type = pf->access_type;
> +	u64 page_addr = pf->page_addr;
> +	struct hmm_range hmm_range;
> +	struct vm_area_struct *vma;
> +	struct xe_svm_range *range;
> +	struct mm_struct *mm;
> +	struct xe_svm *svm;
> +	int ret = 0;
> +
> +	svm = vm->svm;
> +	if (!svm)
> +		return -EINVAL;
> +
> +	mm = svm->mm;
> +	mmap_read_lock(mm);
> +	vma = find_vma_intersection(mm, page_addr, page_addr + 4);
> +	if (!vma) {
> +		mmap_read_unlock(mm);
> +		return -ENOENT;
> +	}
> +
> +	if (!svm_access_allowed (vma, access_type != ACCESS_TYPE_READ)) {
> +		mmap_read_unlock(mm);
> +		return -EPERM;
> +	}
> +
> +	range = xe_svm_range_from_addr(svm, page_addr);
> +	if (!range) {
> +		range = xe_svm_range_create(svm, vma);
> +		if (!range) {
> +			mmap_read_unlock(mm);
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	if (svm_should_migrate(range, &gt->tile->mem.vram,
> +						access_type == ACCESS_TYPE_ATOMIC))
> +		/** Migrate whole svm range for now.
> +		 *  This is subject to change once we introduce a migration granularity
> +		 *  parameter for user to select.
> +		 *
> +		 *	Migration is best effort. If we failed to migrate to vram,
> +		 *	we just map that range to gpu in system memory. For cases
> +		 *	such as gpu atomic operation which requires memory to be
> +		 *	resident in vram, we will fault again and retry migration.
> +		 */
> +		svm_migrate_range_to_vram(range, vma, gt->tile);
> +
> +	ret = svm_populate_range(range, &hmm_range, vma->vm_flags & VM_WRITE);
> +	mmap_read_unlock(mm);
> +	/** There is no need to destroy this range. Range can be reused later */
> +	if (ret)
> +		goto free_pfns;
> +
> +	/**FIXME: set the DM, AE flags in PTE*/
> +	ret = xe_bind_svm_range(vm, gt->tile, &hmm_range,
> +		!(vma->vm_flags & VM_WRITE) ? DRM_XE_VM_BIND_FLAG_READONLY : 0);
> +	/** Concurrent cpu page table update happened,
> +	 *  Return successfully so we will retry everything
> +	 *  on next gpu page fault.
> +	 */
> +	if (ret == -EAGAIN)
> +		ret = 0;
> +
> +free_pfns:
> +	kvfree(hmm_range.hmm_pfns);
> +	return ret;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 659bcb7927d6..a8ff4957a9b8 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -20,6 +20,7 @@
>   
>   struct xe_vm;
>   struct mm_struct;
> +struct pagefault;
>   
>   #define XE_MAX_SVM_PROCESS 5 /* Maximumly support 32 SVM process*/
>   extern DECLARE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
> @@ -94,6 +95,8 @@ bool xe_svm_range_belongs_to_vma(struct mm_struct *mm,
>   void xe_svm_range_unregister_mmu_notifier(struct xe_svm_range *range);
>   int xe_svm_range_register_mmu_notifier(struct xe_svm_range *range);
>   void xe_svm_range_prepare_destroy(struct xe_svm_range *range);
> +struct xe_svm_range *xe_svm_range_create(struct xe_svm *svm,
> +									struct vm_area_struct *vma);
>   
>   int xe_svm_build_sg(struct hmm_range *range, struct sg_table *st);
>   int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
> @@ -106,4 +109,7 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
>   
>   void xe_devm_free_blocks(struct list_head *blocks);
>   void xe_devm_page_free(struct page *page);
> +int xe_svm_handle_gpu_fault(struct xe_vm *vm,
> +				struct xe_gt *gt,
> +				struct pagefault *pf);
>   #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_range.c b/drivers/gpu/drm/xe/xe_svm_range.c
> index dfb4660dc26f..05c088dddc2d 100644
> --- a/drivers/gpu/drm/xe/xe_svm_range.c
> +++ b/drivers/gpu/drm/xe/xe_svm_range.c
> @@ -182,3 +182,46 @@ void xe_svm_range_prepare_destroy(struct xe_svm_range *range)
>   	xe_invalidate_svm_range(vm, range->start, length);
>   	xe_svm_range_unregister_mmu_notifier(range);
>   }
> +
> +static void add_range_to_svm(struct xe_svm_range *range)
> +{
> +	range->inode.start = range->start;
> +	range->inode.last = range->end;
> +	mutex_lock(&range->svm->mutex);
> +	interval_tree_insert(&range->inode, &range->svm->range_tree);
> +	mutex_unlock(&range->svm->mutex);
> +}

I have following question / concern.

I believe we are planning for what we call 'shared allocations' to use 
svm.  But what we call device-only allocations, will continue to use
GEM_CREATE and those are in the BO-centric world.

But you need to still have the application with one single managed 
address space, yes?  In other words, how will theses co-exist?
It seems you will have collisions.

For example as hmm_range_fault brings a range from host into GPU address 
space,  what if it was already allocated and in use by VM_BIND for
a GEM_CREATE allocated buffer?    That is of course application error, 
but KMD needs to detect it, and provide one single managed address
space across all allocations from the application....

Continuing on this theme.  Instead of this interval tree, did you 
consider to just use drm_gpuvm as address space manager?
It probably needs some overhaul, and not to assume it is managing only
BO backed allocations, but could work....
And it has all the split/merge support already there, which you will 
need for adding hints later?

Wanted to hear your thoughts.

-Brian



> +
> +/**
> + * xe_svm_range_create() - create and initialize a svm range
> + *
> + * @svm: the svm that the range belongs to
> + * @vma: the corresponding vma of the range
> + *
> + * Create range, add it to svm's interval tree. Regiter a mmu
> + * interval notifier for this range.
> + *
> + * Return the pointer of the created svm range
> + * or NULL if fail
> + */
> +struct xe_svm_range *xe_svm_range_create(struct xe_svm *svm,
> +									struct vm_area_struct *vma)
> +{
> +	struct xe_svm_range *range = kzalloc(sizeof(*range), GFP_KERNEL);
> +
> +	if (!range)
> +		return NULL;
> +
> +	range->start = vma->vm_start;
> +	range->end = vma->vm_end;
> +	range->vma = vma;
> +	range->svm = svm;
> +
> +	if (xe_svm_range_register_mmu_notifier(range)){
> +		kfree(range);
> +		return NULL;
> +	}
> +
> +	add_range_to_svm(range);
> +	return range;
> +}

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH 21/23] drm/xe/svm: GPU page fault support
  2024-01-23  2:06   ` Welty, Brian
@ 2024-01-23  3:09     ` Zeng, Oak
  2024-01-23  3:21       ` Making drm_gpuvm work across gpu devices Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-01-23  3:09 UTC (permalink / raw)
  To: Welty, Brian, dri-devel, intel-xe
  Cc: Brost, Matthew, Bommu, Krishnaiah, Vishwanathapura, Niranjana,
	Ghimiray, Himal Prasad, Thomas.Hellstrom



> -----Original Message-----
> From: Welty, Brian <brian.welty@intel.com>
> Sent: Monday, January 22, 2024 9:06 PM
> To: Zeng, Oak <oak.zeng@intel.com>; dri-devel@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org
> Cc: Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; Brost,
> Matthew <matthew.brost@intel.com>
> Subject: Re: [PATCH 21/23] drm/xe/svm: GPU page fault support
> 
> 
> On 1/17/2024 2:12 PM, Oak Zeng wrote:
> > On gpu page fault of a virtual address, try to fault in the virtual
> > address range to gpu page table and let HW to retry on the faulty
> > address.
> >
> > Right now, we always migrate the whole vma which contains the fault
> > address to GPU. This is subject to change of a more sophisticated
> > migration policy: decide whether to migrate memory to GPU or map
> > in place with CPU memory; migration granularity.
> >
> > There is rather complicated locking strategy in this patch. See more
> > details in xe_svm_doc.h, lock design section.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_gt_pagefault.c |   7 ++
> >   drivers/gpu/drm/xe/xe_svm.c          | 116 +++++++++++++++++++++++++++
> >   drivers/gpu/drm/xe/xe_svm.h          |   6 ++
> >   drivers/gpu/drm/xe/xe_svm_range.c    |  43 ++++++++++
> >   4 files changed, 172 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > index 467d68f8332e..462603abab8a 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > @@ -22,6 +22,7 @@
> >   #include "xe_pt.h"
> >   #include "xe_trace.h"
> >   #include "xe_vm.h"
> > +#include "xe_svm.h"
> >
> >   enum fault_type {
> >   	NOT_PRESENT = 0,
> > @@ -131,6 +132,11 @@ static int handle_pagefault(struct xe_gt *gt, struct
> pagefault *pf)
> >   	if (!vm || !xe_vm_in_fault_mode(vm))
> >   		return -EINVAL;
> >
> > +	if (vm->svm) {
> > +		ret = xe_svm_handle_gpu_fault(vm, gt, pf);
> > +		goto put_vm;
> > +	}
> > +
> >   retry_userptr:
> >   	/*
> >   	 * TODO: Avoid exclusive lock if VM doesn't have userptrs, or
> > @@ -219,6 +225,7 @@ static int handle_pagefault(struct xe_gt *gt, struct
> pagefault *pf)
> >   		if (ret >= 0)
> >   			ret = 0;
> >   	}
> > +put_vm:
> >   	xe_vm_put(vm);
> >
> >   	return ret;
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> > index 0c13690a19f5..1ade8d7f0ab2 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -12,6 +12,7 @@
> >   #include "xe_svm.h"
> >   #include <linux/hmm.h>
> >   #include <linux/scatterlist.h>
> > +#include <drm/xe_drm.h>
> >   #include "xe_pt.h"
> >   #include "xe_assert.h"
> >   #include "xe_vm_types.h"
> > @@ -206,3 +207,118 @@ static int svm_populate_range(struct xe_svm_range
> *svm_range,
> >   		kvfree(pfns);
> >   	return ret;
> >   }
> > +
> > +/**
> > + * svm_access_allowed() -  Determine whether read or/and write to vma is
> allowed
> > + *
> > + * @write: true means a read and write access; false: read only access
> > + */
> > +static bool svm_access_allowed(struct vm_area_struct *vma, bool write)
> > +{
> > +	unsigned long access = VM_READ;
> > +
> > +	if (write)
> > +		access |= VM_WRITE;
> > +
> > +	return (vma->vm_flags & access) == access;
> > +}
> > +
> > +/**
> > + * svm_should_migrate() - Determine whether we should migrate a range to
> > + * a destination memory region
> > + *
> > + * @range: The svm memory range to consider
> > + * @dst_region: target destination memory region
> > + * @is_atomic_fault: Is the intended migration triggered by a atomic access?
> > + * On some platform, we have to migrate memory to guarantee atomic
> correctness.
> > + */
> > +static bool svm_should_migrate(struct xe_svm_range *range,
> > +				struct xe_mem_region *dst_region, bool
> is_atomic_fault)
> > +{
> > +	return true;
> > +}
> > +
> > +/**
> > + * xe_svm_handle_gpu_fault() - gpu page fault handler for svm subsystem
> > + *
> > + * @vm: The vm of the fault.
> > + * @gt: The gt hardware on which the fault happens.
> > + * @pf: page fault descriptor
> > + *
> > + * Workout a backing memory for the fault address, migrate memory from
> > + * system memory to gpu vram if nessary, and map the fault address to
> > + * GPU so GPU HW can retry the last operation which has caused the GPU
> > + * page fault.
> > + */
> > +int xe_svm_handle_gpu_fault(struct xe_vm *vm,
> > +				struct xe_gt *gt,
> > +				struct pagefault *pf)
> > +{
> > +	u8 access_type = pf->access_type;
> > +	u64 page_addr = pf->page_addr;
> > +	struct hmm_range hmm_range;
> > +	struct vm_area_struct *vma;
> > +	struct xe_svm_range *range;
> > +	struct mm_struct *mm;
> > +	struct xe_svm *svm;
> > +	int ret = 0;
> > +
> > +	svm = vm->svm;
> > +	if (!svm)
> > +		return -EINVAL;
> > +
> > +	mm = svm->mm;
> > +	mmap_read_lock(mm);
> > +	vma = find_vma_intersection(mm, page_addr, page_addr + 4);
> > +	if (!vma) {
> > +		mmap_read_unlock(mm);
> > +		return -ENOENT;
> > +	}
> > +
> > +	if (!svm_access_allowed (vma, access_type != ACCESS_TYPE_READ)) {
> > +		mmap_read_unlock(mm);
> > +		return -EPERM;
> > +	}
> > +
> > +	range = xe_svm_range_from_addr(svm, page_addr);
> > +	if (!range) {
> > +		range = xe_svm_range_create(svm, vma);
> > +		if (!range) {
> > +			mmap_read_unlock(mm);
> > +			return -ENOMEM;
> > +		}
> > +	}
> > +
> > +	if (svm_should_migrate(range, &gt->tile->mem.vram,
> > +						access_type ==
> ACCESS_TYPE_ATOMIC))
> > +		/** Migrate whole svm range for now.
> > +		 *  This is subject to change once we introduce a migration
> granularity
> > +		 *  parameter for user to select.
> > +		 *
> > +		 *	Migration is best effort. If we failed to migrate to vram,
> > +		 *	we just map that range to gpu in system memory. For
> cases
> > +		 *	such as gpu atomic operation which requires memory to
> be
> > +		 *	resident in vram, we will fault again and retry migration.
> > +		 */
> > +		svm_migrate_range_to_vram(range, vma, gt->tile);
> > +
> > +	ret = svm_populate_range(range, &hmm_range, vma->vm_flags &
> VM_WRITE);
> > +	mmap_read_unlock(mm);
> > +	/** There is no need to destroy this range. Range can be reused later */
> > +	if (ret)
> > +		goto free_pfns;
> > +
> > +	/**FIXME: set the DM, AE flags in PTE*/
> > +	ret = xe_bind_svm_range(vm, gt->tile, &hmm_range,
> > +		!(vma->vm_flags & VM_WRITE) ?
> DRM_XE_VM_BIND_FLAG_READONLY : 0);
> > +	/** Concurrent cpu page table update happened,
> > +	 *  Return successfully so we will retry everything
> > +	 *  on next gpu page fault.
> > +	 */
> > +	if (ret == -EAGAIN)
> > +		ret = 0;
> > +
> > +free_pfns:
> > +	kvfree(hmm_range.hmm_pfns);
> > +	return ret;
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > index 659bcb7927d6..a8ff4957a9b8 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -20,6 +20,7 @@
> >
> >   struct xe_vm;
> >   struct mm_struct;
> > +struct pagefault;
> >
> >   #define XE_MAX_SVM_PROCESS 5 /* Maximumly support 32 SVM process*/
> >   extern DECLARE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
> > @@ -94,6 +95,8 @@ bool xe_svm_range_belongs_to_vma(struct mm_struct
> *mm,
> >   void xe_svm_range_unregister_mmu_notifier(struct xe_svm_range *range);
> >   int xe_svm_range_register_mmu_notifier(struct xe_svm_range *range);
> >   void xe_svm_range_prepare_destroy(struct xe_svm_range *range);
> > +struct xe_svm_range *xe_svm_range_create(struct xe_svm *svm,
> > +									struct
> vm_area_struct *vma);
> >
> >   int xe_svm_build_sg(struct hmm_range *range, struct sg_table *st);
> >   int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
> > @@ -106,4 +109,7 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> >
> >   void xe_devm_free_blocks(struct list_head *blocks);
> >   void xe_devm_page_free(struct page *page);
> > +int xe_svm_handle_gpu_fault(struct xe_vm *vm,
> > +				struct xe_gt *gt,
> > +				struct pagefault *pf);
> >   #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_range.c
> b/drivers/gpu/drm/xe/xe_svm_range.c
> > index dfb4660dc26f..05c088dddc2d 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_range.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_range.c
> > @@ -182,3 +182,46 @@ void xe_svm_range_prepare_destroy(struct
> xe_svm_range *range)
> >   	xe_invalidate_svm_range(vm, range->start, length);
> >   	xe_svm_range_unregister_mmu_notifier(range);
> >   }
> > +
> > +static void add_range_to_svm(struct xe_svm_range *range)
> > +{
> > +	range->inode.start = range->start;
> > +	range->inode.last = range->end;
> > +	mutex_lock(&range->svm->mutex);
> > +	interval_tree_insert(&range->inode, &range->svm->range_tree);
> > +	mutex_unlock(&range->svm->mutex);
> > +}
> 
> I have following question / concern.
> 
> I believe we are planning for what we call 'shared allocations' to use
> svm.  But what we call device-only allocations, will continue to use
> GEM_CREATE and those are in the BO-centric world.
> 
> But you need to still have the application with one single managed
> address space, yes?  In other words, how will theses co-exist?
> It seems you will have collisions.

Yes, those two types of allocators have to co-exist.


> 
> For example as hmm_range_fault brings a range from host into GPU address
> space,  what if it was already allocated and in use by VM_BIND for
> a GEM_CREATE allocated buffer?    That is of course application error,
> but KMD needs to detect it, and provide one single managed address
> space across all allocations from the application....


This is very good question. Yes agree we should check this application error. Fortunately this is doable. All vm_bind virtual address range are tracked in xe_vm/drm_gpuvm struct. In this case, we should iterate the drm_gpuvm's rb tree of *all* gpu devices (as xe_vm is for one device only) to see whether there is a conflict. Will make the change soon.


> 
> Continuing on this theme.  Instead of this interval tree, did you
> consider to just use drm_gpuvm as address space manager?
> It probably needs some overhaul, and not to assume it is managing only
> BO backed allocations, but could work....
> And it has all the split/merge support already there, which you will
> need for adding hints later?


Yes another good point. I discuss the approach of leveraging drm_gpuvm with Matt Brost. Yes the good thing is we can leverage all the range split/merge utilities there.

The difficulty to use drm_gpuvm is, today xe_vm/drm_gpuvm are all per-device based (see the *dev pointer in each structure). But xe_svm should work across all gpu devices.... So it is hard for xe_svm to inherit from drm_gpuvm...

One approach Matt mentioned is, change the drm_gpuvm a little to make it work across gpu device. I think this should be doable. I looked at the dev pointer in drm_gpuvm, it didn't really use this parameter a lot. The dev pointer is used just to print some warning message, no real logic work...

So what we can do is, we remove the dev pointer from drm_gpuvm, and instead of have xe_vm to inherit from drm_gpuvm, we can have a drm_gpuvm pointer in xe_vm, and let xe_svm to inherit from drm_gpuvm. Matt pointed all those ideas to me. We thought we want to make svm work w/o changing xekmd base driver and drm as a first step. And try this idea as a second step...

But since you also have this idea, I will start to an email the the drm_gpuvm designer to query the feasibility. If it turns out the be feasible, I will it work in one step. Considering this will save some codes in the memory hint part, I think it worth the time considering it right now.

Thanks,
Oak

> 
> Wanted to hear your thoughts.
> 
> -Brian
> 
> 
> 
> > +
> > +/**
> > + * xe_svm_range_create() - create and initialize a svm range
> > + *
> > + * @svm: the svm that the range belongs to
> > + * @vma: the corresponding vma of the range
> > + *
> > + * Create range, add it to svm's interval tree. Regiter a mmu
> > + * interval notifier for this range.
> > + *
> > + * Return the pointer of the created svm range
> > + * or NULL if fail
> > + */
> > +struct xe_svm_range *xe_svm_range_create(struct xe_svm *svm,
> > +									struct
> vm_area_struct *vma)
> > +{
> > +	struct xe_svm_range *range = kzalloc(sizeof(*range), GFP_KERNEL);
> > +
> > +	if (!range)
> > +		return NULL;
> > +
> > +	range->start = vma->vm_start;
> > +	range->end = vma->vm_end;
> > +	range->vma = vma;
> > +	range->svm = svm;
> > +
> > +	if (xe_svm_range_register_mmu_notifier(range)){
> > +		kfree(range);
> > +		return NULL;
> > +	}
> > +
> > +	add_range_to_svm(range);
> > +	return range;
> > +}

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Making drm_gpuvm work across gpu devices
  2024-01-23  3:09     ` Zeng, Oak
@ 2024-01-23  3:21       ` Zeng, Oak
  2024-01-23 11:13         ` Christian König
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-01-23  3:21 UTC (permalink / raw)
  To: Danilo Krummrich, Dave Airlie, Christian König
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, intel-xe

Hi Danilo and all,

During the work of Intel's SVM code, we came up the idea of making drm_gpuvm to work across multiple gpu devices. See some discussion here: https://lore.kernel.org/dri-devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd11.prod.outlook.com/

The reason we try to do this is, for a SVM (shared virtual memory across cpu program and all gpu program on all gpu devices) process, the address space has to be across all gpu devices. So if we make drm_gpuvm to work across devices, then our SVM code can leverage drm_gpuvm as well.

At a first look, it seems feasible because drm_gpuvm doesn't really use the drm_device *drm pointer a lot. This param is used only for printing/warning. So I think maybe we can delete this drm field from drm_gpuvm.

This way, on a multiple gpu device system, for one process, we can have only one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for each gpu device).

What do you think?

Thanks,
Oak

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-23  3:21       ` Making drm_gpuvm work across gpu devices Zeng, Oak
@ 2024-01-23 11:13         ` Christian König
  2024-01-23 19:37           ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Christian König @ 2024-01-23 11:13 UTC (permalink / raw)
  To: Zeng, Oak, Danilo Krummrich, Dave Airlie, Daniel Vetter
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, intel-xe

Hi Oak,

Am 23.01.24 um 04:21 schrieb Zeng, Oak:
> Hi Danilo and all,
>
> During the work of Intel's SVM code, we came up the idea of making drm_gpuvm to work across multiple gpu devices. See some discussion here: https://lore.kernel.org/dri-devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd11.prod.outlook.com/
>
> The reason we try to do this is, for a SVM (shared virtual memory across cpu program and all gpu program on all gpu devices) process, the address space has to be across all gpu devices. So if we make drm_gpuvm to work across devices, then our SVM code can leverage drm_gpuvm as well.
>
> At a first look, it seems feasible because drm_gpuvm doesn't really use the drm_device *drm pointer a lot. This param is used only for printing/warning. So I think maybe we can delete this drm field from drm_gpuvm.
>
> This way, on a multiple gpu device system, for one process, we can have only one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for each gpu device).
>
> What do you think?

Well from the GPUVM side I don't think it would make much difference if 
we have the drm device or not.

But the experience we had with the KFD I think I should mention that we 
should absolutely *not* deal with multiple devices at the same time in 
the UAPI or VM objects inside the driver.

The background is that all the APIs inside the Linux kernel are build 
around the idea that they work with only one device at a time. This 
accounts for both low level APIs like the DMA API as well as pretty high 
level things like for example file system address space etc...

So when you have multiple GPUs you either have an inseparable cluster of 
them which case you would also only have one drm_device. Or you have 
separated drm_device which also results in separate drm render nodes and 
separate virtual address spaces and also eventually separate IOMMU 
domains which gives you separate dma_addresses for the same page and so 
separate GPUVM page tables....

It's up to you how to implement it, but I think it's pretty clear that 
you need separate drm_gpuvm objects to manage those.

That you map the same thing in all those virtual address spaces at the 
same address is a completely different optimization problem I think. 
What we could certainly do is to optimize hmm_range_fault by making 
hmm_range a reference counted object and using it for multiple devices 
at the same time if those devices request the same range of an mm_struct.

I think if you start using the same drm_gpuvm for multiple devices you 
will sooner or later start to run into the same mess we have seen with 
KFD, where we moved more and more functionality from the KFD to the DRM 
render node because we found that a lot of the stuff simply doesn't work 
correctly with a single object to maintain the state.

Just one more point to your original discussion on the xe list: I think 
it's perfectly valid for an application to map something at the same 
address you already have something else.

Cheers,
Christian.

>
> Thanks,
> Oak


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-23 11:13         ` Christian König
@ 2024-01-23 19:37           ` Zeng, Oak
  2024-01-23 20:17             ` Felix Kuehling
                               ` (2 more replies)
  0 siblings, 3 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-01-23 19:37 UTC (permalink / raw)
  To: Christian König, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Felix Kuehling
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

Thanks Christian. I have some comment inline below.

Danilo, can you also take a look and give your feedback? Thanks.

> -----Original Message-----
> From: Christian König <christian.koenig@amd.com>
> Sent: Tuesday, January 23, 2024 6:13 AM
> To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>;
> Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>
> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>;
> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
> <matthew.brost@intel.com>
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> Hi Oak,
> 
> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
> > Hi Danilo and all,
> >
> > During the work of Intel's SVM code, we came up the idea of making
> drm_gpuvm to work across multiple gpu devices. See some discussion here:
> https://lore.kernel.org/dri-
> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
> 11.prod.outlook.com/
> >
> > The reason we try to do this is, for a SVM (shared virtual memory across cpu
> program and all gpu program on all gpu devices) process, the address space has
> to be across all gpu devices. So if we make drm_gpuvm to work across devices,
> then our SVM code can leverage drm_gpuvm as well.
> >
> > At a first look, it seems feasible because drm_gpuvm doesn't really use the
> drm_device *drm pointer a lot. This param is used only for printing/warning. So I
> think maybe we can delete this drm field from drm_gpuvm.
> >
> > This way, on a multiple gpu device system, for one process, we can have only
> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for
> each gpu device).
> >
> > What do you think?
> 
> Well from the GPUVM side I don't think it would make much difference if
> we have the drm device or not.
> 
> But the experience we had with the KFD I think I should mention that we
> should absolutely *not* deal with multiple devices at the same time in
> the UAPI or VM objects inside the driver.
> 
> The background is that all the APIs inside the Linux kernel are build
> around the idea that they work with only one device at a time. This
> accounts for both low level APIs like the DMA API as well as pretty high
> level things like for example file system address space etc...

Yes most API are per device based.

One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices. Cc Felix.

Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).

We have the same design requirement of SVM. For anyone who want to implement the SVM concept, this is a hard requirement. Since now drm has the drm_gpuvm concept which strictly speaking is designed for one device, I want to see whether we can extend drm_gpuvm to make it work for both single device (as used in xe) and multipe devices (will be used in the SVM code). That is why I brought up this topic.

> 
> So when you have multiple GPUs you either have an inseparable cluster of
> them which case you would also only have one drm_device. Or you have
> separated drm_device which also results in separate drm render nodes and
> separate virtual address spaces and also eventually separate IOMMU
> domains which gives you separate dma_addresses for the same page and so
> separate GPUVM page tables....

I am thinking we can still make each device has its separate drm_device/render node/iommu domains/gpu page table. Just as what we have today. I am not plan to change this picture.

But the virtual address space will support two modes of operation: 
1. one drm_gpuvm per device. This is when svm is not in the picture
2. all devices in the process share one single drm_gpuvm, when svm is in the picture. In xe driver design, we have to support a mixture use of legacy mode (such as gem_create and vm_bind) and svm (such as malloc'ed memory for gpu submission). So whenever SVM is in the picture, we want one single process address space across all devices. Drm_gpuvm doesn't need to be aware of those two operation modes. It is driver's responsibility to use different mode. 

For example, in mode #1, a driver's vm structure (such as xe_vm) can inherit from drm_gpuvm. In mode #2, a driver's svm structure (xe_svm in this series: https://lore.kernel.org/dri-devel/20240117221223.18540-1-oak.zeng@intel.com/) can inherit from drm_gpuvm while each xe_vm (still a per-device based struct) will just have a pointer to the drm_gpuvm structure. This way when svm is in play, we build a 1 process:1 mm_struct:1 xe_svm:N xe_vm correlations which means shared address space across gpu devices.

This requires some changes of drm_gpuvm design:
1. The drm_device *drm pointer, in mode #2 operation, this can be NULL, means this drm_gpuvm is not for specific gpu device
2. The common dma_resv object: drm_gem_object *r_obj. *Does one dma_resv object allocated/initialized for one device work for all devices*? From first look, dma_resv is just some CPU structure maintaining dma-fences. So I guess it should work for all devices? I definitely need you to comment.  


> 
> It's up to you how to implement it, but I think it's pretty clear that
> you need separate drm_gpuvm objects to manage those.

As explained above, I am thinking of one drm_gpuvm object across all devices when SVM is in the picture...

> 
> That you map the same thing in all those virtual address spaces at the
> same address is a completely different optimization problem I think.

Not sure I follow here... the requirement from SVM is, one virtual address points to same physical backing store. For example, whenever CPU or any GPU device access this virtual address, it refers to the same physical content. Of course the physical backing store can be migrated b/t host memory and any of the GPU's device memory, but the content should be consistent.

So we are mapping same physical content to the same virtual address in either cpu page table or any gpu device's page table...

> What we could certainly do is to optimize hmm_range_fault by making
> hmm_range a reference counted object and using it for multiple devices
> at the same time if those devices request the same range of an mm_struct.
> 

Not very follow. If you are trying to resolve a multiple devices concurrent access problem, I think we should serialize concurrent device fault to one address range. The reason is, during device fault handling, we might migrate the backing store so hmm_range->hmm_pfns[] might have changed after one device access it.

> I think if you start using the same drm_gpuvm for multiple devices you
> will sooner or later start to run into the same mess we have seen with
> KFD, where we moved more and more functionality from the KFD to the DRM
> render node because we found that a lot of the stuff simply doesn't work
> correctly with a single object to maintain the state.

As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process. Yes the codes are a little complicated.

Kfd manages the shared virtual address space in the kfd driver codes, like the split, merging etc. Here I am looking whether we can leverage the drm_gpuvm code for those functions. 

As of the shared virtual address space across gpu devices, it is a hard requirement for svm/system allocator (aka malloc for gpu program). We need to make it work either at driver level or drm_gpuvm level. Drm_gpuvm is better because the work can be shared b/t drivers.

Thanks a lot,
Oak

> 
> Just one more point to your original discussion on the xe list: I think
> it's perfectly valid for an application to map something at the same
> address you already have something else.
> 
> Cheers,
> Christian.
> 
> >
> > Thanks,
> > Oak


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-23 19:37           ` Zeng, Oak
@ 2024-01-23 20:17             ` Felix Kuehling
  2024-01-25  1:39               ` Zeng, Oak
  2024-01-23 23:56             ` Danilo Krummrich
  2024-01-24  8:33             ` Christian König
  2 siblings, 1 reply; 123+ messages in thread
From: Felix Kuehling @ 2024-01-23 20:17 UTC (permalink / raw)
  To: Zeng, Oak, Christian König, Danilo Krummrich, Dave Airlie,
	Daniel Vetter
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

On 2024-01-23 14:37, Zeng, Oak wrote:
> Thanks Christian. I have some comment inline below.
>
> Danilo, can you also take a look and give your feedback? Thanks.

Sorry, just catching up with this thread now. I'm also not familiar with 
drm_gpuvm.

Some general observations based on my experience with KFD, amdgpu and 
SVM. With SVM we have a single virtual address space managed in user 
mode (basically using mmap) with attributes per virtual address range 
maintained in the kernel mode driver. Different devices get their 
mappings of pieces of that address space using the same virtual 
addresses. We also support migration to different DEVICE_PRIVATE memory 
spaces.

However, we still have page tables managed per device. Each device can 
have different page table formats and layout (e.g. different GPU 
generations in the same system) and the same memory may be mapped with 
different flags on different devices in order to get the right coherence 
behaviour. We also need to maintain per-device DMA mappings somewhere. 
That means, as far as the device page tables are concerned, we still 
have separate address spaces. SVM only adds a layer on top, which 
coordinates these separate device virtual address spaces so that some 
parts of them provide the appearance of a shared virtual address space.

At some point you need to decide, where you draw the boundary between 
managing a per-process shared virtual address space and managing 
per-device virtual address spaces. In amdgpu that boundary is currently 
where kfd_svm code calls amdgpu_vm code to manage the per-device page 
tables.

In the amdgpu driver, we still have the traditional memory management 
APIs in the render nodes that don't do SVM. They share the device 
virtual address spaces with SVM. We have to be careful that we don't try 
to manage the same device virtual address ranges with these two 
different memory managers. In practice, we let the non-SVM APIs use the 
upper half of the canonical address space, while the lower half can be 
used almost entirely for SVM.

Regards,
   Felix


>
>> -----Original Message-----
>> From: Christian König <christian.koenig@amd.com>
>> Sent: Tuesday, January 23, 2024 6:13 AM
>> To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>;
>> Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>
>> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-
>> xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>;
>> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
>> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
>> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
>> <matthew.brost@intel.com>
>> Subject: Re: Making drm_gpuvm work across gpu devices
>>
>> Hi Oak,
>>
>> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
>>> Hi Danilo and all,
>>>
>>> During the work of Intel's SVM code, we came up the idea of making
>> drm_gpuvm to work across multiple gpu devices. See some discussion here:
>> https://lore.kernel.org/dri-
>> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
>> 11.prod.outlook.com/
>>> The reason we try to do this is, for a SVM (shared virtual memory across cpu
>> program and all gpu program on all gpu devices) process, the address space has
>> to be across all gpu devices. So if we make drm_gpuvm to work across devices,
>> then our SVM code can leverage drm_gpuvm as well.
>>> At a first look, it seems feasible because drm_gpuvm doesn't really use the
>> drm_device *drm pointer a lot. This param is used only for printing/warning. So I
>> think maybe we can delete this drm field from drm_gpuvm.
>>> This way, on a multiple gpu device system, for one process, we can have only
>> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for
>> each gpu device).
>>> What do you think?
>> Well from the GPUVM side I don't think it would make much difference if
>> we have the drm device or not.
>>
>> But the experience we had with the KFD I think I should mention that we
>> should absolutely *not* deal with multiple devices at the same time in
>> the UAPI or VM objects inside the driver.
>>
>> The background is that all the APIs inside the Linux kernel are build
>> around the idea that they work with only one device at a time. This
>> accounts for both low level APIs like the DMA API as well as pretty high
>> level things like for example file system address space etc...
> Yes most API are per device based.
>
> One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices. Cc Felix.
>
> Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).
>
> We have the same design requirement of SVM. For anyone who want to implement the SVM concept, this is a hard requirement. Since now drm has the drm_gpuvm concept which strictly speaking is designed for one device, I want to see whether we can extend drm_gpuvm to make it work for both single device (as used in xe) and multipe devices (will be used in the SVM code). That is why I brought up this topic.
>
>> So when you have multiple GPUs you either have an inseparable cluster of
>> them which case you would also only have one drm_device. Or you have
>> separated drm_device which also results in separate drm render nodes and
>> separate virtual address spaces and also eventually separate IOMMU
>> domains which gives you separate dma_addresses for the same page and so
>> separate GPUVM page tables....
> I am thinking we can still make each device has its separate drm_device/render node/iommu domains/gpu page table. Just as what we have today. I am not plan to change this picture.
>
> But the virtual address space will support two modes of operation:
> 1. one drm_gpuvm per device. This is when svm is not in the picture
> 2. all devices in the process share one single drm_gpuvm, when svm is in the picture. In xe driver design, we have to support a mixture use of legacy mode (such as gem_create and vm_bind) and svm (such as malloc'ed memory for gpu submission). So whenever SVM is in the picture, we want one single process address space across all devices. Drm_gpuvm doesn't need to be aware of those two operation modes. It is driver's responsibility to use different mode.
>
> For example, in mode #1, a driver's vm structure (such as xe_vm) can inherit from drm_gpuvm. In mode #2, a driver's svm structure (xe_svm in this series: https://lore.kernel.org/dri-devel/20240117221223.18540-1-oak.zeng@intel.com/) can inherit from drm_gpuvm while each xe_vm (still a per-device based struct) will just have a pointer to the drm_gpuvm structure. This way when svm is in play, we build a 1 process:1 mm_struct:1 xe_svm:N xe_vm correlations which means shared address space across gpu devices.
>
> This requires some changes of drm_gpuvm design:
> 1. The drm_device *drm pointer, in mode #2 operation, this can be NULL, means this drm_gpuvm is not for specific gpu device
> 2. The common dma_resv object: drm_gem_object *r_obj. *Does one dma_resv object allocated/initialized for one device work for all devices*? From first look, dma_resv is just some CPU structure maintaining dma-fences. So I guess it should work for all devices? I definitely need you to comment.
>
>
>> It's up to you how to implement it, but I think it's pretty clear that
>> you need separate drm_gpuvm objects to manage those.
> As explained above, I am thinking of one drm_gpuvm object across all devices when SVM is in the picture...
>
>> That you map the same thing in all those virtual address spaces at the
>> same address is a completely different optimization problem I think.
> Not sure I follow here... the requirement from SVM is, one virtual address points to same physical backing store. For example, whenever CPU or any GPU device access this virtual address, it refers to the same physical content. Of course the physical backing store can be migrated b/t host memory and any of the GPU's device memory, but the content should be consistent.
>
> So we are mapping same physical content to the same virtual address in either cpu page table or any gpu device's page table...
>
>> What we could certainly do is to optimize hmm_range_fault by making
>> hmm_range a reference counted object and using it for multiple devices
>> at the same time if those devices request the same range of an mm_struct.
>>
> Not very follow. If you are trying to resolve a multiple devices concurrent access problem, I think we should serialize concurrent device fault to one address range. The reason is, during device fault handling, we might migrate the backing store so hmm_range->hmm_pfns[] might have changed after one device access it.
>
>> I think if you start using the same drm_gpuvm for multiple devices you
>> will sooner or later start to run into the same mess we have seen with
>> KFD, where we moved more and more functionality from the KFD to the DRM
>> render node because we found that a lot of the stuff simply doesn't work
>> correctly with a single object to maintain the state.
> As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process. Yes the codes are a little complicated.
>
> Kfd manages the shared virtual address space in the kfd driver codes, like the split, merging etc. Here I am looking whether we can leverage the drm_gpuvm code for those functions.
>
> As of the shared virtual address space across gpu devices, it is a hard requirement for svm/system allocator (aka malloc for gpu program). We need to make it work either at driver level or drm_gpuvm level. Drm_gpuvm is better because the work can be shared b/t drivers.
>
> Thanks a lot,
> Oak
>
>> Just one more point to your original discussion on the xe list: I think
>> it's perfectly valid for an application to map something at the same
>> address you already have something else.
>>
>> Cheers,
>> Christian.
>>
>>> Thanks,
>>> Oak

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-23 19:37           ` Zeng, Oak
  2024-01-23 20:17             ` Felix Kuehling
@ 2024-01-23 23:56             ` Danilo Krummrich
  2024-01-24  3:57               ` Zeng, Oak
  2024-01-24  8:33             ` Christian König
  2 siblings, 1 reply; 123+ messages in thread
From: Danilo Krummrich @ 2024-01-23 23:56 UTC (permalink / raw)
  To: Zeng, Oak, Christian König, Dave Airlie, Daniel Vetter,
	Felix Kuehling
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

Hi Oak,

On 1/23/24 20:37, Zeng, Oak wrote:
> Thanks Christian. I have some comment inline below.
> 
> Danilo, can you also take a look and give your feedback? Thanks.

I agree with everything Christian already wrote. Except for the KFD parts, which
I'm simply not familiar with, I had exactly the same thoughts after reading your
initial mail.

Please find some more comments below.

> 
>> -----Original Message-----
>> From: Christian König <christian.koenig@amd.com>
>> Sent: Tuesday, January 23, 2024 6:13 AM
>> To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>;
>> Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>
>> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-
>> xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>;
>> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
>> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
>> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
>> <matthew.brost@intel.com>
>> Subject: Re: Making drm_gpuvm work across gpu devices
>>
>> Hi Oak,
>>
>> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
>>> Hi Danilo and all,
>>>
>>> During the work of Intel's SVM code, we came up the idea of making
>> drm_gpuvm to work across multiple gpu devices. See some discussion here:
>> https://lore.kernel.org/dri-
>> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
>> 11.prod.outlook.com/
>>>
>>> The reason we try to do this is, for a SVM (shared virtual memory across cpu
>> program and all gpu program on all gpu devices) process, the address space has
>> to be across all gpu devices. So if we make drm_gpuvm to work across devices,
>> then our SVM code can leverage drm_gpuvm as well.
>>>
>>> At a first look, it seems feasible because drm_gpuvm doesn't really use the
>> drm_device *drm pointer a lot. This param is used only for printing/warning. So I
>> think maybe we can delete this drm field from drm_gpuvm.
>>>
>>> This way, on a multiple gpu device system, for one process, we can have only
>> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for
>> each gpu device).
>>>
>>> What do you think?
>>
>> Well from the GPUVM side I don't think it would make much difference if
>> we have the drm device or not.
>>
>> But the experience we had with the KFD I think I should mention that we
>> should absolutely *not* deal with multiple devices at the same time in
>> the UAPI or VM objects inside the driver.
>>
>> The background is that all the APIs inside the Linux kernel are build
>> around the idea that they work with only one device at a time. This
>> accounts for both low level APIs like the DMA API as well as pretty high
>> level things like for example file system address space etc...
> 
> Yes most API are per device based.
> 
> One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices. Cc Felix.
> 
> Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).
> 
> We have the same design requirement of SVM. For anyone who want to implement the SVM concept, this is a hard requirement. Since now drm has the drm_gpuvm concept which strictly speaking is designed for one device, I want to see whether we can extend drm_gpuvm to make it work for both single device (as used in xe) and multipe devices (will be used in the SVM code). That is why I brought up this topic.
> 
>>
>> So when you have multiple GPUs you either have an inseparable cluster of
>> them which case you would also only have one drm_device. Or you have
>> separated drm_device which also results in separate drm render nodes and
>> separate virtual address spaces and also eventually separate IOMMU
>> domains which gives you separate dma_addresses for the same page and so
>> separate GPUVM page tables....
> 
> I am thinking we can still make each device has its separate drm_device/render node/iommu domains/gpu page table. Just as what we have today. I am not plan to change this picture.
> 
> But the virtual address space will support two modes of operation:
> 1. one drm_gpuvm per device. This is when svm is not in the picture
> 2. all devices in the process share one single drm_gpuvm, when svm is in the picture. In xe driver design, we have to support a mixture use of legacy mode (such as gem_create and vm_bind) and svm (such as malloc'ed memory for gpu submission). So whenever SVM is in the picture, we want one single process address space across all devices. Drm_gpuvm doesn't need to be aware of those two operation modes. It is driver's responsibility to use different mode.
> 
> For example, in mode #1, a driver's vm structure (such as xe_vm) can inherit from drm_gpuvm. In mode #2, a driver's svm structure (xe_svm in this series: https://lore.kernel.org/dri-devel/20240117221223.18540-1-oak.zeng@intel.com/) can inherit from drm_gpuvm while each xe_vm (still a per-device based struct) will just have a pointer to the drm_gpuvm structure. This way when svm is in play, we build a 1 process:1 mm_struct:1 xe_svm:N xe_vm correlations which means shared address space across gpu devices.

With a shared GPUVM structure, how do you track actual per device resources such as
page tables? You also need to consider that the page table layout, memory mapping
flags may vary from device to device due to different GPU chipsets or revisions.

Also, if you replace the shared GPUVM structure with a pointer to a shared one,
you may run into all kinds of difficulties due to increasing complexity in terms
of locking, synchronization, lifetime and potential unwind operations in error paths.
I haven't thought it through yet, but I wouldn't be surprised entirely if there are
cases where you simply run into circular dependencies.

Also, looking at the conversation in the linked patch series:

<snip>

>> For example as hmm_range_fault brings a range from host into GPU address
>> space,  what if it was already allocated and in use by VM_BIND for
>> a GEM_CREATE allocated buffer?    That is of course application error,
>> but KMD needs to detect it, and provide one single managed address
>> space across all allocations from the application....

> This is very good question. Yes agree we should check this application error. Fortunately this is doable. All vm_bind virtual address range are tracked in xe_vm/drm_gpuvm struct. In this case, we should iterate the drm_gpuvm's rb tree of *all* gpu devices (as xe_vm is for one device only) to see whether there is a conflict. Will make the change soon.

<snip>

How do you do that if xe_vm->gpuvm is just a pointer to the GPUVM structure within xe_svm?

> 
> This requires some changes of drm_gpuvm design:
> 1. The drm_device *drm pointer, in mode #2 operation, this can be NULL, means this drm_gpuvm is not for specific gpu device
> 2. The common dma_resv object: drm_gem_object *r_obj. *Does one dma_resv object allocated/initialized for one device work for all devices*? From first look, dma_resv is just some CPU structure maintaining dma-fences. So I guess it should work for all devices? I definitely need you to comment.

The general rule is that drivers can share the common dma_resv across GEM objects that
are only mapped within the VM owning the dma_resv, but never within another VM.

Now, your question is whether multiple VMs can share the same common dma_resv. I think
that calls for trouble, since it would create dependencies that simply aren't needed
and might even introduce locking issues.

However, that's optional, you can simply decide to not make use of the common dma_resv
and all the optimizations based on it.

> 
> 
>>
>> It's up to you how to implement it, but I think it's pretty clear that
>> you need separate drm_gpuvm objects to manage those.
> 
> As explained above, I am thinking of one drm_gpuvm object across all devices when SVM is in the picture...
> 
>>
>> That you map the same thing in all those virtual address spaces at the
>> same address is a completely different optimization problem I think.
> 
> Not sure I follow here... the requirement from SVM is, one virtual address points to same physical backing store. For example, whenever CPU or any GPU device access this virtual address, it refers to the same physical content. Of course the physical backing store can be migrated b/t host memory and any of the GPU's device memory, but the content should be consistent.

Technically, multiple different GPUs will have separate virtual address spaces, it's
just that you create mappings within all of them such that the same virtual address
resolves to the same physical content on all of them.

So, having a single GPUVM instance representing all of them might give the illusion of
a single unified address space, but you still need to maintain each device's address
space backing resources, such as page tables, separately.

- Danilo

> 
> So we are mapping same physical content to the same virtual address in either cpu page table or any gpu device's page table...
> 
>> What we could certainly do is to optimize hmm_range_fault by making
>> hmm_range a reference counted object and using it for multiple devices
>> at the same time if those devices request the same range of an mm_struct.
>>
> 
> Not very follow. If you are trying to resolve a multiple devices concurrent access problem, I think we should serialize concurrent device fault to one address range. The reason is, during device fault handling, we might migrate the backing store so hmm_range->hmm_pfns[] might have changed after one device access it.
> 
>> I think if you start using the same drm_gpuvm for multiple devices you
>> will sooner or later start to run into the same mess we have seen with
>> KFD, where we moved more and more functionality from the KFD to the DRM
>> render node because we found that a lot of the stuff simply doesn't work
>> correctly with a single object to maintain the state.
> 
> As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process. Yes the codes are a little complicated.
> 
> Kfd manages the shared virtual address space in the kfd driver codes, like the split, merging etc. Here I am looking whether we can leverage the drm_gpuvm code for those functions.
> 
> As of the shared virtual address space across gpu devices, it is a hard requirement for svm/system allocator (aka malloc for gpu program). We need to make it work either at driver level or drm_gpuvm level. Drm_gpuvm is better because the work can be shared b/t drivers.
> 
> Thanks a lot,
> Oak
> 
>>
>> Just one more point to your original discussion on the xe list: I think
>> it's perfectly valid for an application to map something at the same
>> address you already have something else.
>>
>> Cheers,
>> Christian.
>>
>>>
>>> Thanks,
>>> Oak
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-23 23:56             ` Danilo Krummrich
@ 2024-01-24  3:57               ` Zeng, Oak
  2024-01-24  4:14                 ` Zeng, Oak
  2024-01-25 22:13                 ` Danilo Krummrich
  0 siblings, 2 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-01-24  3:57 UTC (permalink / raw)
  To: Danilo Krummrich, Christian König, Dave Airlie,
	Daniel Vetter, Felix Kuehling, Welty, Brian
  Cc: Brost, Matthew, Thomas.Hellstrom, dri-devel, Ghimiray,
	Himal Prasad, Gupta, saurabhg, Bommu,  Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

Thanks a lot Danilo.

Maybe I wasn't clear enough. In the solution I proposed, each device still have separate vm/page tables. Each device still need to manage the mapping, page table flags etc. It is just in svm use case, all devices share one drm_gpuvm instance. As I understand it, drm_gpuvm's main function is the va range split and merging. I don't see why it doesn't work across gpu devices. 

But I read more about drm_gpuvm. Its split merge function takes a drm_gem_object parameter, see drm_gpuvm_sm_map_ops_create and drm_gpuvm_sm_map. Actually the whole drm_gpuvm is designed for BO-centric driver, for example, it has a drm_gpuvm_bo concept to keep track of the 1BO:Ngpuva mapping. The whole purpose of leveraging drm_gpuvm is to re-use the va split/merge functions for SVM. But in our SVM implementation, there is no buffer object at all. So I don't think our SVM codes can leverage drm_gpuvm.

I will give up this approach, unless Matt or Brian can see a way.

A few replies inline.... @Welty, Brian I had more thoughts inline to one of your original question....

> -----Original Message-----
> From: Danilo Krummrich <dakr@redhat.com>
> Sent: Tuesday, January 23, 2024 6:57 PM
> To: Zeng, Oak <oak.zeng@intel.com>; Christian König
> <christian.koenig@amd.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
> <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>
> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>;
> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
> <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> Hi Oak,
> 
> On 1/23/24 20:37, Zeng, Oak wrote:
> > Thanks Christian. I have some comment inline below.
> >
> > Danilo, can you also take a look and give your feedback? Thanks.
> 
> I agree with everything Christian already wrote. Except for the KFD parts, which
> I'm simply not familiar with, I had exactly the same thoughts after reading your
> initial mail.
> 
> Please find some more comments below.
> 
> >
> >> -----Original Message-----
> >> From: Christian König <christian.koenig@amd.com>
> >> Sent: Tuesday, January 23, 2024 6:13 AM
> >> To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>;
> >> Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>
> >> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org;
> intel-
> >> xe@lists.freedesktop.org; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>;
> >> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
> >> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
> >> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
> >> <matthew.brost@intel.com>
> >> Subject: Re: Making drm_gpuvm work across gpu devices
> >>
> >> Hi Oak,
> >>
> >> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
> >>> Hi Danilo and all,
> >>>
> >>> During the work of Intel's SVM code, we came up the idea of making
> >> drm_gpuvm to work across multiple gpu devices. See some discussion here:
> >> https://lore.kernel.org/dri-
> >>
> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
> >> 11.prod.outlook.com/
> >>>
> >>> The reason we try to do this is, for a SVM (shared virtual memory across cpu
> >> program and all gpu program on all gpu devices) process, the address space
> has
> >> to be across all gpu devices. So if we make drm_gpuvm to work across devices,
> >> then our SVM code can leverage drm_gpuvm as well.
> >>>
> >>> At a first look, it seems feasible because drm_gpuvm doesn't really use the
> >> drm_device *drm pointer a lot. This param is used only for printing/warning.
> So I
> >> think maybe we can delete this drm field from drm_gpuvm.
> >>>
> >>> This way, on a multiple gpu device system, for one process, we can have only
> >> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for
> >> each gpu device).
> >>>
> >>> What do you think?
> >>
> >> Well from the GPUVM side I don't think it would make much difference if
> >> we have the drm device or not.
> >>
> >> But the experience we had with the KFD I think I should mention that we
> >> should absolutely *not* deal with multiple devices at the same time in
> >> the UAPI or VM objects inside the driver.
> >>
> >> The background is that all the APIs inside the Linux kernel are build
> >> around the idea that they work with only one device at a time. This
> >> accounts for both low level APIs like the DMA API as well as pretty high
> >> level things like for example file system address space etc...
> >
> > Yes most API are per device based.
> >
> > One exception I know is actually the kfd SVM API. If you look at the svm_ioctl
> function, it is per-process based. Each kfd_process represent a process across N
> gpu devices. Cc Felix.
> >
> > Need to say, kfd SVM represent a shared virtual address space across CPU and
> all GPU devices on the system. This is by the definition of SVM (shared virtual
> memory). This is very different from our legacy gpu *device* driver which works
> for only one device (i.e., if you want one device to access another device's
> memory, you will have to use dma-buf export/import etc).
> >
> > We have the same design requirement of SVM. For anyone who want to
> implement the SVM concept, this is a hard requirement. Since now drm has the
> drm_gpuvm concept which strictly speaking is designed for one device, I want to
> see whether we can extend drm_gpuvm to make it work for both single device
> (as used in xe) and multipe devices (will be used in the SVM code). That is why I
> brought up this topic.
> >
> >>
> >> So when you have multiple GPUs you either have an inseparable cluster of
> >> them which case you would also only have one drm_device. Or you have
> >> separated drm_device which also results in separate drm render nodes and
> >> separate virtual address spaces and also eventually separate IOMMU
> >> domains which gives you separate dma_addresses for the same page and so
> >> separate GPUVM page tables....
> >
> > I am thinking we can still make each device has its separate drm_device/render
> node/iommu domains/gpu page table. Just as what we have today. I am not plan
> to change this picture.
> >
> > But the virtual address space will support two modes of operation:
> > 1. one drm_gpuvm per device. This is when svm is not in the picture
> > 2. all devices in the process share one single drm_gpuvm, when svm is in the
> picture. In xe driver design, we have to support a mixture use of legacy mode
> (such as gem_create and vm_bind) and svm (such as malloc'ed memory for gpu
> submission). So whenever SVM is in the picture, we want one single process
> address space across all devices. Drm_gpuvm doesn't need to be aware of those
> two operation modes. It is driver's responsibility to use different mode.
> >
> > For example, in mode #1, a driver's vm structure (such as xe_vm) can inherit
> from drm_gpuvm. In mode #2, a driver's svm structure (xe_svm in this series:
> https://lore.kernel.org/dri-devel/20240117221223.18540-1-oak.zeng@intel.com/)
> can inherit from drm_gpuvm while each xe_vm (still a per-device based struct)
> will just have a pointer to the drm_gpuvm structure. This way when svm is in play,
> we build a 1 process:1 mm_struct:1 xe_svm:N xe_vm correlations which means
> shared address space across gpu devices.
> 
> With a shared GPUVM structure, how do you track actual per device resources
> such as
> page tables? You also need to consider that the page table layout, memory
> mapping
> flags may vary from device to device due to different GPU chipsets or revisions.

The per device page table, flags etc are still managed per-device based, which is the xe_vm in the xekmd driver.

> 
> Also, if you replace the shared GPUVM structure with a pointer to a shared one,
> you may run into all kinds of difficulties due to increasing complexity in terms
> of locking, synchronization, lifetime and potential unwind operations in error
> paths.
> I haven't thought it through yet, but I wouldn't be surprised entirely if there are
> cases where you simply run into circular dependencies.

Make sense, I can't see through this without a prove of concept code either.

> 
> Also, looking at the conversation in the linked patch series:
> 
> <snip>
> 
> >> For example as hmm_range_fault brings a range from host into GPU address
> >> space,  what if it was already allocated and in use by VM_BIND for
> >> a GEM_CREATE allocated buffer?    That is of course application error,
> >> but KMD needs to detect it, and provide one single managed address
> >> space across all allocations from the application....
> 
> > This is very good question. Yes agree we should check this application error.
> Fortunately this is doable. All vm_bind virtual address range are tracked in
> xe_vm/drm_gpuvm struct. In this case, we should iterate the drm_gpuvm's rb
> tree of *all* gpu devices (as xe_vm is for one device only) to see whether there
> is a conflict. Will make the change soon.
> 
> <snip>
> 
> How do you do that if xe_vm->gpuvm is just a pointer to the GPUVM structure
> within xe_svm?

In the proposed approach, we have a single drm_gpuvm instance for one process. All device's xe_vm pointing to this drm_gpuvm instance. This drm_gpuvm's rb tree maintains all the va range we have in this process. We can just walk this rb tree to see if there is a conflict.

But I didn't answer Brian's question completely... In a mixed use of vm_bind and malloc/mmap, the virtual address used by vm_bind should first be reserved in user space using mmap. So all valid virtual address should be tracked by linux kernel vma_struct.

Both vm_bind and malloc'ed virtual address can cause a gpu page fault. Our fault handler should first see whether this is a vm_bind va and service the fault accordingly; if not, then serve the fault in the SVM path; if SVM path also failed, it is an invalid address. So from user perspective, user can use:
Ptr = mmap()
Vm_bind(ptr, bo)
Submit gpu kernel using ptr
Or:
Ptr = mmap()
Submit gpu kernel using ptr
Whether vm_bind is called or not decides the gpu fault handler code path. Hopefully this answers @Welty, Brian's original question


> 
> >
> > This requires some changes of drm_gpuvm design:
> > 1. The drm_device *drm pointer, in mode #2 operation, this can be NULL,
> means this drm_gpuvm is not for specific gpu device
> > 2. The common dma_resv object: drm_gem_object *r_obj. *Does one
> dma_resv object allocated/initialized for one device work for all devices*? From
> first look, dma_resv is just some CPU structure maintaining dma-fences. So I
> guess it should work for all devices? I definitely need you to comment.
> 
> The general rule is that drivers can share the common dma_resv across GEM
> objects that
> are only mapped within the VM owning the dma_resv, but never within another
> VM.
> 
> Now, your question is whether multiple VMs can share the same common
> dma_resv. I think
> that calls for trouble, since it would create dependencies that simply aren't
> needed
> and might even introduce locking issues.
> 
> However, that's optional, you can simply decide to not make use of the common
> dma_resv
> and all the optimizations based on it.

Ok, got it.
> 
> >
> >
> >>
> >> It's up to you how to implement it, but I think it's pretty clear that
> >> you need separate drm_gpuvm objects to manage those.
> >
> > As explained above, I am thinking of one drm_gpuvm object across all devices
> when SVM is in the picture...
> >
> >>
> >> That you map the same thing in all those virtual address spaces at the
> >> same address is a completely different optimization problem I think.
> >
> > Not sure I follow here... the requirement from SVM is, one virtual address
> points to same physical backing store. For example, whenever CPU or any GPU
> device access this virtual address, it refers to the same physical content. Of
> course the physical backing store can be migrated b/t host memory and any of
> the GPU's device memory, but the content should be consistent.
> 
> Technically, multiple different GPUs will have separate virtual address spaces, it's
> just that you create mappings within all of them such that the same virtual
> address
> resolves to the same physical content on all of them.
> 
> So, having a single GPUVM instance representing all of them might give the
> illusion of
> a single unified address space, but you still need to maintain each device's
> address
> space backing resources, such as page tables, separately.

Yes agreed.

Regards,
Oak
> 
> - Danilo
> 
> >
> > So we are mapping same physical content to the same virtual address in either
> cpu page table or any gpu device's page table...
> >
> >> What we could certainly do is to optimize hmm_range_fault by making
> >> hmm_range a reference counted object and using it for multiple devices
> >> at the same time if those devices request the same range of an mm_struct.
> >>
> >
> > Not very follow. If you are trying to resolve a multiple devices concurrent access
> problem, I think we should serialize concurrent device fault to one address range.
> The reason is, during device fault handling, we might migrate the backing store so
> hmm_range->hmm_pfns[] might have changed after one device access it.
> >
> >> I think if you start using the same drm_gpuvm for multiple devices you
> >> will sooner or later start to run into the same mess we have seen with
> >> KFD, where we moved more and more functionality from the KFD to the DRM
> >> render node because we found that a lot of the stuff simply doesn't work
> >> correctly with a single object to maintain the state.
> >
> > As I understand it, KFD is designed to work across devices. A single pseudo
> /dev/kfd device represent all hardware gpu devices. That is why during kfd open,
> many pdd (process device data) is created, each for one hardware device for this
> process. Yes the codes are a little complicated.
> >
> > Kfd manages the shared virtual address space in the kfd driver codes, like the
> split, merging etc. Here I am looking whether we can leverage the drm_gpuvm
> code for those functions.
> >
> > As of the shared virtual address space across gpu devices, it is a hard
> requirement for svm/system allocator (aka malloc for gpu program). We need to
> make it work either at driver level or drm_gpuvm level. Drm_gpuvm is better
> because the work can be shared b/t drivers.
> >
> > Thanks a lot,
> > Oak
> >
> >>
> >> Just one more point to your original discussion on the xe list: I think
> >> it's perfectly valid for an application to map something at the same
> >> address you already have something else.
> >>
> >> Cheers,
> >> Christian.
> >>
> >>>
> >>> Thanks,
> >>> Oak
> >


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-24  3:57               ` Zeng, Oak
@ 2024-01-24  4:14                 ` Zeng, Oak
  2024-01-24  6:48                   ` Christian König
  2024-01-25 22:13                 ` Danilo Krummrich
  1 sibling, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-01-24  4:14 UTC (permalink / raw)
  To: Zeng, Oak, Danilo Krummrich, Christian König, Dave Airlie,
	Daniel Vetter, Felix Kuehling, Welty, Brian
  Cc: Brost, Matthew, Thomas.Hellstrom, dri-devel, Ghimiray,
	Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

Danilo,

Maybe before I give up, I should also ask, currently drm_gpuvm is designed for BO-centric world. Is it easy to make the va range split/merge work simply for va range, but without BO? Conceptually this should work as we are merge/splitting virtual address range which can be decoupled completely from BO. 

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of Zeng,
> Oak
> Sent: Tuesday, January 23, 2024 10:57 PM
> To: Danilo Krummrich <dakr@redhat.com>; Christian König
> <christian.koenig@amd.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
> <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>; Welty, Brian
> <brian.welty@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>;
> Thomas.Hellstrom@linux.intel.com; dri-devel@lists.freedesktop.org; Ghimiray,
> Himal Prasad <himal.prasad.ghimiray@intel.com>; Gupta, saurabhg
> <saurabhg.gupta@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org
> Subject: RE: Making drm_gpuvm work across gpu devices
> 
> Thanks a lot Danilo.
> 
> Maybe I wasn't clear enough. In the solution I proposed, each device still have
> separate vm/page tables. Each device still need to manage the mapping, page
> table flags etc. It is just in svm use case, all devices share one drm_gpuvm
> instance. As I understand it, drm_gpuvm's main function is the va range split and
> merging. I don't see why it doesn't work across gpu devices.
> 
> But I read more about drm_gpuvm. Its split merge function takes a
> drm_gem_object parameter, see drm_gpuvm_sm_map_ops_create and
> drm_gpuvm_sm_map. Actually the whole drm_gpuvm is designed for BO-centric
> driver, for example, it has a drm_gpuvm_bo concept to keep track of the
> 1BO:Ngpuva mapping. The whole purpose of leveraging drm_gpuvm is to re-use
> the va split/merge functions for SVM. But in our SVM implementation, there is no
> buffer object at all. So I don't think our SVM codes can leverage drm_gpuvm.
> 
> I will give up this approach, unless Matt or Brian can see a way.
> 
> A few replies inline.... @Welty, Brian I had more thoughts inline to one of your
> original question....
> 
> > -----Original Message-----
> > From: Danilo Krummrich <dakr@redhat.com>
> > Sent: Tuesday, January 23, 2024 6:57 PM
> > To: Zeng, Oak <oak.zeng@intel.com>; Christian König
> > <christian.koenig@amd.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
> > <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>
> > Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org;
> intel-
> > xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>;
> > Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
> > Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
> > <niranjana.vishwanathapura@intel.com>; Brost, Matthew
> > <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
> > Subject: Re: Making drm_gpuvm work across gpu devices
> >
> > Hi Oak,
> >
> > On 1/23/24 20:37, Zeng, Oak wrote:
> > > Thanks Christian. I have some comment inline below.
> > >
> > > Danilo, can you also take a look and give your feedback? Thanks.
> >
> > I agree with everything Christian already wrote. Except for the KFD parts, which
> > I'm simply not familiar with, I had exactly the same thoughts after reading your
> > initial mail.
> >
> > Please find some more comments below.
> >
> > >
> > >> -----Original Message-----
> > >> From: Christian König <christian.koenig@amd.com>
> > >> Sent: Tuesday, January 23, 2024 6:13 AM
> > >> To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich
> <dakr@redhat.com>;
> > >> Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>
> > >> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org;
> > intel-
> > >> xe@lists.freedesktop.org; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>;
> > >> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
> > >> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
> > >> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
> > >> <matthew.brost@intel.com>
> > >> Subject: Re: Making drm_gpuvm work across gpu devices
> > >>
> > >> Hi Oak,
> > >>
> > >> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
> > >>> Hi Danilo and all,
> > >>>
> > >>> During the work of Intel's SVM code, we came up the idea of making
> > >> drm_gpuvm to work across multiple gpu devices. See some discussion here:
> > >> https://lore.kernel.org/dri-
> > >>
> >
> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
> > >> 11.prod.outlook.com/
> > >>>
> > >>> The reason we try to do this is, for a SVM (shared virtual memory across
> cpu
> > >> program and all gpu program on all gpu devices) process, the address space
> > has
> > >> to be across all gpu devices. So if we make drm_gpuvm to work across
> devices,
> > >> then our SVM code can leverage drm_gpuvm as well.
> > >>>
> > >>> At a first look, it seems feasible because drm_gpuvm doesn't really use the
> > >> drm_device *drm pointer a lot. This param is used only for printing/warning.
> > So I
> > >> think maybe we can delete this drm field from drm_gpuvm.
> > >>>
> > >>> This way, on a multiple gpu device system, for one process, we can have
> only
> > >> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one
> for
> > >> each gpu device).
> > >>>
> > >>> What do you think?
> > >>
> > >> Well from the GPUVM side I don't think it would make much difference if
> > >> we have the drm device or not.
> > >>
> > >> But the experience we had with the KFD I think I should mention that we
> > >> should absolutely *not* deal with multiple devices at the same time in
> > >> the UAPI or VM objects inside the driver.
> > >>
> > >> The background is that all the APIs inside the Linux kernel are build
> > >> around the idea that they work with only one device at a time. This
> > >> accounts for both low level APIs like the DMA API as well as pretty high
> > >> level things like for example file system address space etc...
> > >
> > > Yes most API are per device based.
> > >
> > > One exception I know is actually the kfd SVM API. If you look at the svm_ioctl
> > function, it is per-process based. Each kfd_process represent a process across N
> > gpu devices. Cc Felix.
> > >
> > > Need to say, kfd SVM represent a shared virtual address space across CPU
> and
> > all GPU devices on the system. This is by the definition of SVM (shared virtual
> > memory). This is very different from our legacy gpu *device* driver which
> works
> > for only one device (i.e., if you want one device to access another device's
> > memory, you will have to use dma-buf export/import etc).
> > >
> > > We have the same design requirement of SVM. For anyone who want to
> > implement the SVM concept, this is a hard requirement. Since now drm has the
> > drm_gpuvm concept which strictly speaking is designed for one device, I want
> to
> > see whether we can extend drm_gpuvm to make it work for both single device
> > (as used in xe) and multipe devices (will be used in the SVM code). That is why I
> > brought up this topic.
> > >
> > >>
> > >> So when you have multiple GPUs you either have an inseparable cluster of
> > >> them which case you would also only have one drm_device. Or you have
> > >> separated drm_device which also results in separate drm render nodes and
> > >> separate virtual address spaces and also eventually separate IOMMU
> > >> domains which gives you separate dma_addresses for the same page and so
> > >> separate GPUVM page tables....
> > >
> > > I am thinking we can still make each device has its separate
> drm_device/render
> > node/iommu domains/gpu page table. Just as what we have today. I am not
> plan
> > to change this picture.
> > >
> > > But the virtual address space will support two modes of operation:
> > > 1. one drm_gpuvm per device. This is when svm is not in the picture
> > > 2. all devices in the process share one single drm_gpuvm, when svm is in the
> > picture. In xe driver design, we have to support a mixture use of legacy mode
> > (such as gem_create and vm_bind) and svm (such as malloc'ed memory for gpu
> > submission). So whenever SVM is in the picture, we want one single process
> > address space across all devices. Drm_gpuvm doesn't need to be aware of
> those
> > two operation modes. It is driver's responsibility to use different mode.
> > >
> > > For example, in mode #1, a driver's vm structure (such as xe_vm) can inherit
> > from drm_gpuvm. In mode #2, a driver's svm structure (xe_svm in this series:
> > https://lore.kernel.org/dri-devel/20240117221223.18540-1-
> oak.zeng@intel.com/)
> > can inherit from drm_gpuvm while each xe_vm (still a per-device based struct)
> > will just have a pointer to the drm_gpuvm structure. This way when svm is in
> play,
> > we build a 1 process:1 mm_struct:1 xe_svm:N xe_vm correlations which means
> > shared address space across gpu devices.
> >
> > With a shared GPUVM structure, how do you track actual per device resources
> > such as
> > page tables? You also need to consider that the page table layout, memory
> > mapping
> > flags may vary from device to device due to different GPU chipsets or revisions.
> 
> The per device page table, flags etc are still managed per-device based, which is
> the xe_vm in the xekmd driver.
> 
> >
> > Also, if you replace the shared GPUVM structure with a pointer to a shared one,
> > you may run into all kinds of difficulties due to increasing complexity in terms
> > of locking, synchronization, lifetime and potential unwind operations in error
> > paths.
> > I haven't thought it through yet, but I wouldn't be surprised entirely if there are
> > cases where you simply run into circular dependencies.
> 
> Make sense, I can't see through this without a prove of concept code either.
> 
> >
> > Also, looking at the conversation in the linked patch series:
> >
> > <snip>
> >
> > >> For example as hmm_range_fault brings a range from host into GPU address
> > >> space,  what if it was already allocated and in use by VM_BIND for
> > >> a GEM_CREATE allocated buffer?    That is of course application error,
> > >> but KMD needs to detect it, and provide one single managed address
> > >> space across all allocations from the application....
> >
> > > This is very good question. Yes agree we should check this application error.
> > Fortunately this is doable. All vm_bind virtual address range are tracked in
> > xe_vm/drm_gpuvm struct. In this case, we should iterate the drm_gpuvm's rb
> > tree of *all* gpu devices (as xe_vm is for one device only) to see whether
> there
> > is a conflict. Will make the change soon.
> >
> > <snip>
> >
> > How do you do that if xe_vm->gpuvm is just a pointer to the GPUVM structure
> > within xe_svm?
> 
> In the proposed approach, we have a single drm_gpuvm instance for one process.
> All device's xe_vm pointing to this drm_gpuvm instance. This drm_gpuvm's rb
> tree maintains all the va range we have in this process. We can just walk this rb
> tree to see if there is a conflict.
> 
> But I didn't answer Brian's question completely... In a mixed use of vm_bind and
> malloc/mmap, the virtual address used by vm_bind should first be reserved in
> user space using mmap. So all valid virtual address should be tracked by linux
> kernel vma_struct.
> 
> Both vm_bind and malloc'ed virtual address can cause a gpu page fault. Our fault
> handler should first see whether this is a vm_bind va and service the fault
> accordingly; if not, then serve the fault in the SVM path; if SVM path also failed, it
> is an invalid address. So from user perspective, user can use:
> Ptr = mmap()
> Vm_bind(ptr, bo)
> Submit gpu kernel using ptr
> Or:
> Ptr = mmap()
> Submit gpu kernel using ptr
> Whether vm_bind is called or not decides the gpu fault handler code path.
> Hopefully this answers @Welty, Brian's original question
> 
> 
> >
> > >
> > > This requires some changes of drm_gpuvm design:
> > > 1. The drm_device *drm pointer, in mode #2 operation, this can be NULL,
> > means this drm_gpuvm is not for specific gpu device
> > > 2. The common dma_resv object: drm_gem_object *r_obj. *Does one
> > dma_resv object allocated/initialized for one device work for all devices*? From
> > first look, dma_resv is just some CPU structure maintaining dma-fences. So I
> > guess it should work for all devices? I definitely need you to comment.
> >
> > The general rule is that drivers can share the common dma_resv across GEM
> > objects that
> > are only mapped within the VM owning the dma_resv, but never within
> another
> > VM.
> >
> > Now, your question is whether multiple VMs can share the same common
> > dma_resv. I think
> > that calls for trouble, since it would create dependencies that simply aren't
> > needed
> > and might even introduce locking issues.
> >
> > However, that's optional, you can simply decide to not make use of the
> common
> > dma_resv
> > and all the optimizations based on it.
> 
> Ok, got it.
> >
> > >
> > >
> > >>
> > >> It's up to you how to implement it, but I think it's pretty clear that
> > >> you need separate drm_gpuvm objects to manage those.
> > >
> > > As explained above, I am thinking of one drm_gpuvm object across all devices
> > when SVM is in the picture...
> > >
> > >>
> > >> That you map the same thing in all those virtual address spaces at the
> > >> same address is a completely different optimization problem I think.
> > >
> > > Not sure I follow here... the requirement from SVM is, one virtual address
> > points to same physical backing store. For example, whenever CPU or any GPU
> > device access this virtual address, it refers to the same physical content. Of
> > course the physical backing store can be migrated b/t host memory and any of
> > the GPU's device memory, but the content should be consistent.
> >
> > Technically, multiple different GPUs will have separate virtual address spaces,
> it's
> > just that you create mappings within all of them such that the same virtual
> > address
> > resolves to the same physical content on all of them.
> >
> > So, having a single GPUVM instance representing all of them might give the
> > illusion of
> > a single unified address space, but you still need to maintain each device's
> > address
> > space backing resources, such as page tables, separately.
> 
> Yes agreed.
> 
> Regards,
> Oak
> >
> > - Danilo
> >
> > >
> > > So we are mapping same physical content to the same virtual address in
> either
> > cpu page table or any gpu device's page table...
> > >
> > >> What we could certainly do is to optimize hmm_range_fault by making
> > >> hmm_range a reference counted object and using it for multiple devices
> > >> at the same time if those devices request the same range of an mm_struct.
> > >>
> > >
> > > Not very follow. If you are trying to resolve a multiple devices concurrent
> access
> > problem, I think we should serialize concurrent device fault to one address
> range.
> > The reason is, during device fault handling, we might migrate the backing store
> so
> > hmm_range->hmm_pfns[] might have changed after one device access it.
> > >
> > >> I think if you start using the same drm_gpuvm for multiple devices you
> > >> will sooner or later start to run into the same mess we have seen with
> > >> KFD, where we moved more and more functionality from the KFD to the
> DRM
> > >> render node because we found that a lot of the stuff simply doesn't work
> > >> correctly with a single object to maintain the state.
> > >
> > > As I understand it, KFD is designed to work across devices. A single pseudo
> > /dev/kfd device represent all hardware gpu devices. That is why during kfd
> open,
> > many pdd (process device data) is created, each for one hardware device for
> this
> > process. Yes the codes are a little complicated.
> > >
> > > Kfd manages the shared virtual address space in the kfd driver codes, like the
> > split, merging etc. Here I am looking whether we can leverage the drm_gpuvm
> > code for those functions.
> > >
> > > As of the shared virtual address space across gpu devices, it is a hard
> > requirement for svm/system allocator (aka malloc for gpu program). We need
> to
> > make it work either at driver level or drm_gpuvm level. Drm_gpuvm is better
> > because the work can be shared b/t drivers.
> > >
> > > Thanks a lot,
> > > Oak
> > >
> > >>
> > >> Just one more point to your original discussion on the xe list: I think
> > >> it's perfectly valid for an application to map something at the same
> > >> address you already have something else.
> > >>
> > >> Cheers,
> > >> Christian.
> > >>
> > >>>
> > >>> Thanks,
> > >>> Oak
> > >


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-24  4:14                 ` Zeng, Oak
@ 2024-01-24  6:48                   ` Christian König
  0 siblings, 0 replies; 123+ messages in thread
From: Christian König @ 2024-01-24  6:48 UTC (permalink / raw)
  To: Zeng, Oak, Danilo Krummrich, Dave Airlie, Daniel Vetter,
	Felix Kuehling, Welty, Brian
  Cc: Brost, Matthew, Thomas.Hellstrom, dri-devel, Ghimiray,
	Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

Am 24.01.24 um 05:14 schrieb Zeng, Oak:
> Danilo,
>
> Maybe before I give up, I should also ask, currently drm_gpuvm is designed for BO-centric world. Is it easy to make the va range split/merge work simply for va range, but without BO? Conceptually this should work as we are merge/splitting virtual address range which can be decoupled completely from BO.

At least AMD GPUs have a similar requirement to manage virtual ranges 
which are not backed by a BO. For example PRT ranges.

I expect that we can still use drm_gpuvm for this and the BO is simply 
NULL in that case.

Regards,
Christian.

>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of Zeng,
>> Oak
>> Sent: Tuesday, January 23, 2024 10:57 PM
>> To: Danilo Krummrich <dakr@redhat.com>; Christian König
>> <christian.koenig@amd.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
>> <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>; Welty, Brian
>> <brian.welty@intel.com>
>> Cc: Brost, Matthew <matthew.brost@intel.com>;
>> Thomas.Hellstrom@linux.intel.com; dri-devel@lists.freedesktop.org; Ghimiray,
>> Himal Prasad <himal.prasad.ghimiray@intel.com>; Gupta, saurabhg
>> <saurabhg.gupta@intel.com>; Bommu, Krishnaiah
>> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
>> <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org
>> Subject: RE: Making drm_gpuvm work across gpu devices
>>
>> Thanks a lot Danilo.
>>
>> Maybe I wasn't clear enough. In the solution I proposed, each device still have
>> separate vm/page tables. Each device still need to manage the mapping, page
>> table flags etc. It is just in svm use case, all devices share one drm_gpuvm
>> instance. As I understand it, drm_gpuvm's main function is the va range split and
>> merging. I don't see why it doesn't work across gpu devices.
>>
>> But I read more about drm_gpuvm. Its split merge function takes a
>> drm_gem_object parameter, see drm_gpuvm_sm_map_ops_create and
>> drm_gpuvm_sm_map. Actually the whole drm_gpuvm is designed for BO-centric
>> driver, for example, it has a drm_gpuvm_bo concept to keep track of the
>> 1BO:Ngpuva mapping. The whole purpose of leveraging drm_gpuvm is to re-use
>> the va split/merge functions for SVM. But in our SVM implementation, there is no
>> buffer object at all. So I don't think our SVM codes can leverage drm_gpuvm.
>>
>> I will give up this approach, unless Matt or Brian can see a way.
>>
>> A few replies inline.... @Welty, Brian I had more thoughts inline to one of your
>> original question....
>>
>>> -----Original Message-----
>>> From: Danilo Krummrich <dakr@redhat.com>
>>> Sent: Tuesday, January 23, 2024 6:57 PM
>>> To: Zeng, Oak <oak.zeng@intel.com>; Christian König
>>> <christian.koenig@amd.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
>>> <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>
>>> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org;
>> intel-
>>> xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>;
>>> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
>>> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
>>> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
>>> <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
>>> Subject: Re: Making drm_gpuvm work across gpu devices
>>>
>>> Hi Oak,
>>>
>>> On 1/23/24 20:37, Zeng, Oak wrote:
>>>> Thanks Christian. I have some comment inline below.
>>>>
>>>> Danilo, can you also take a look and give your feedback? Thanks.
>>> I agree with everything Christian already wrote. Except for the KFD parts, which
>>> I'm simply not familiar with, I had exactly the same thoughts after reading your
>>> initial mail.
>>>
>>> Please find some more comments below.
>>>
>>>>> -----Original Message-----
>>>>> From: Christian König <christian.koenig@amd.com>
>>>>> Sent: Tuesday, January 23, 2024 6:13 AM
>>>>> To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich
>> <dakr@redhat.com>;
>>>>> Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>
>>>>> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org;
>>> intel-
>>>>> xe@lists.freedesktop.org; Bommu, Krishnaiah
>>> <krishnaiah.bommu@intel.com>;
>>>>> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
>>>>> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
>>>>> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
>>>>> <matthew.brost@intel.com>
>>>>> Subject: Re: Making drm_gpuvm work across gpu devices
>>>>>
>>>>> Hi Oak,
>>>>>
>>>>> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
>>>>>> Hi Danilo and all,
>>>>>>
>>>>>> During the work of Intel's SVM code, we came up the idea of making
>>>>> drm_gpuvm to work across multiple gpu devices. See some discussion here:
>>>>> https://lore.kernel.org/dri-
>>>>>
>> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
>>>>> 11.prod.outlook.com/
>>>>>> The reason we try to do this is, for a SVM (shared virtual memory across
>> cpu
>>>>> program and all gpu program on all gpu devices) process, the address space
>>> has
>>>>> to be across all gpu devices. So if we make drm_gpuvm to work across
>> devices,
>>>>> then our SVM code can leverage drm_gpuvm as well.
>>>>>> At a first look, it seems feasible because drm_gpuvm doesn't really use the
>>>>> drm_device *drm pointer a lot. This param is used only for printing/warning.
>>> So I
>>>>> think maybe we can delete this drm field from drm_gpuvm.
>>>>>> This way, on a multiple gpu device system, for one process, we can have
>> only
>>>>> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one
>> for
>>>>> each gpu device).
>>>>>> What do you think?
>>>>> Well from the GPUVM side I don't think it would make much difference if
>>>>> we have the drm device or not.
>>>>>
>>>>> But the experience we had with the KFD I think I should mention that we
>>>>> should absolutely *not* deal with multiple devices at the same time in
>>>>> the UAPI or VM objects inside the driver.
>>>>>
>>>>> The background is that all the APIs inside the Linux kernel are build
>>>>> around the idea that they work with only one device at a time. This
>>>>> accounts for both low level APIs like the DMA API as well as pretty high
>>>>> level things like for example file system address space etc...
>>>> Yes most API are per device based.
>>>>
>>>> One exception I know is actually the kfd SVM API. If you look at the svm_ioctl
>>> function, it is per-process based. Each kfd_process represent a process across N
>>> gpu devices. Cc Felix.
>>>> Need to say, kfd SVM represent a shared virtual address space across CPU
>> and
>>> all GPU devices on the system. This is by the definition of SVM (shared virtual
>>> memory). This is very different from our legacy gpu *device* driver which
>> works
>>> for only one device (i.e., if you want one device to access another device's
>>> memory, you will have to use dma-buf export/import etc).
>>>> We have the same design requirement of SVM. For anyone who want to
>>> implement the SVM concept, this is a hard requirement. Since now drm has the
>>> drm_gpuvm concept which strictly speaking is designed for one device, I want
>> to
>>> see whether we can extend drm_gpuvm to make it work for both single device
>>> (as used in xe) and multipe devices (will be used in the SVM code). That is why I
>>> brought up this topic.
>>>>> So when you have multiple GPUs you either have an inseparable cluster of
>>>>> them which case you would also only have one drm_device. Or you have
>>>>> separated drm_device which also results in separate drm render nodes and
>>>>> separate virtual address spaces and also eventually separate IOMMU
>>>>> domains which gives you separate dma_addresses for the same page and so
>>>>> separate GPUVM page tables....
>>>> I am thinking we can still make each device has its separate
>> drm_device/render
>>> node/iommu domains/gpu page table. Just as what we have today. I am not
>> plan
>>> to change this picture.
>>>> But the virtual address space will support two modes of operation:
>>>> 1. one drm_gpuvm per device. This is when svm is not in the picture
>>>> 2. all devices in the process share one single drm_gpuvm, when svm is in the
>>> picture. In xe driver design, we have to support a mixture use of legacy mode
>>> (such as gem_create and vm_bind) and svm (such as malloc'ed memory for gpu
>>> submission). So whenever SVM is in the picture, we want one single process
>>> address space across all devices. Drm_gpuvm doesn't need to be aware of
>> those
>>> two operation modes. It is driver's responsibility to use different mode.
>>>> For example, in mode #1, a driver's vm structure (such as xe_vm) can inherit
>>> from drm_gpuvm. In mode #2, a driver's svm structure (xe_svm in this series:
>>> https://lore.kernel.org/dri-devel/20240117221223.18540-1-
>> oak.zeng@intel.com/)
>>> can inherit from drm_gpuvm while each xe_vm (still a per-device based struct)
>>> will just have a pointer to the drm_gpuvm structure. This way when svm is in
>> play,
>>> we build a 1 process:1 mm_struct:1 xe_svm:N xe_vm correlations which means
>>> shared address space across gpu devices.
>>>
>>> With a shared GPUVM structure, how do you track actual per device resources
>>> such as
>>> page tables? You also need to consider that the page table layout, memory
>>> mapping
>>> flags may vary from device to device due to different GPU chipsets or revisions.
>> The per device page table, flags etc are still managed per-device based, which is
>> the xe_vm in the xekmd driver.
>>
>>> Also, if you replace the shared GPUVM structure with a pointer to a shared one,
>>> you may run into all kinds of difficulties due to increasing complexity in terms
>>> of locking, synchronization, lifetime and potential unwind operations in error
>>> paths.
>>> I haven't thought it through yet, but I wouldn't be surprised entirely if there are
>>> cases where you simply run into circular dependencies.
>> Make sense, I can't see through this without a prove of concept code either.
>>
>>> Also, looking at the conversation in the linked patch series:
>>>
>>> <snip>
>>>
>>>>> For example as hmm_range_fault brings a range from host into GPU address
>>>>> space,  what if it was already allocated and in use by VM_BIND for
>>>>> a GEM_CREATE allocated buffer?    That is of course application error,
>>>>> but KMD needs to detect it, and provide one single managed address
>>>>> space across all allocations from the application....
>>>> This is very good question. Yes agree we should check this application error.
>>> Fortunately this is doable. All vm_bind virtual address range are tracked in
>>> xe_vm/drm_gpuvm struct. In this case, we should iterate the drm_gpuvm's rb
>>> tree of *all* gpu devices (as xe_vm is for one device only) to see whether
>> there
>>> is a conflict. Will make the change soon.
>>>
>>> <snip>
>>>
>>> How do you do that if xe_vm->gpuvm is just a pointer to the GPUVM structure
>>> within xe_svm?
>> In the proposed approach, we have a single drm_gpuvm instance for one process.
>> All device's xe_vm pointing to this drm_gpuvm instance. This drm_gpuvm's rb
>> tree maintains all the va range we have in this process. We can just walk this rb
>> tree to see if there is a conflict.
>>
>> But I didn't answer Brian's question completely... In a mixed use of vm_bind and
>> malloc/mmap, the virtual address used by vm_bind should first be reserved in
>> user space using mmap. So all valid virtual address should be tracked by linux
>> kernel vma_struct.
>>
>> Both vm_bind and malloc'ed virtual address can cause a gpu page fault. Our fault
>> handler should first see whether this is a vm_bind va and service the fault
>> accordingly; if not, then serve the fault in the SVM path; if SVM path also failed, it
>> is an invalid address. So from user perspective, user can use:
>> Ptr = mmap()
>> Vm_bind(ptr, bo)
>> Submit gpu kernel using ptr
>> Or:
>> Ptr = mmap()
>> Submit gpu kernel using ptr
>> Whether vm_bind is called or not decides the gpu fault handler code path.
>> Hopefully this answers @Welty, Brian's original question
>>
>>
>>>> This requires some changes of drm_gpuvm design:
>>>> 1. The drm_device *drm pointer, in mode #2 operation, this can be NULL,
>>> means this drm_gpuvm is not for specific gpu device
>>>> 2. The common dma_resv object: drm_gem_object *r_obj. *Does one
>>> dma_resv object allocated/initialized for one device work for all devices*? From
>>> first look, dma_resv is just some CPU structure maintaining dma-fences. So I
>>> guess it should work for all devices? I definitely need you to comment.
>>>
>>> The general rule is that drivers can share the common dma_resv across GEM
>>> objects that
>>> are only mapped within the VM owning the dma_resv, but never within
>> another
>>> VM.
>>>
>>> Now, your question is whether multiple VMs can share the same common
>>> dma_resv. I think
>>> that calls for trouble, since it would create dependencies that simply aren't
>>> needed
>>> and might even introduce locking issues.
>>>
>>> However, that's optional, you can simply decide to not make use of the
>> common
>>> dma_resv
>>> and all the optimizations based on it.
>> Ok, got it.
>>>>
>>>>> It's up to you how to implement it, but I think it's pretty clear that
>>>>> you need separate drm_gpuvm objects to manage those.
>>>> As explained above, I am thinking of one drm_gpuvm object across all devices
>>> when SVM is in the picture...
>>>>> That you map the same thing in all those virtual address spaces at the
>>>>> same address is a completely different optimization problem I think.
>>>> Not sure I follow here... the requirement from SVM is, one virtual address
>>> points to same physical backing store. For example, whenever CPU or any GPU
>>> device access this virtual address, it refers to the same physical content. Of
>>> course the physical backing store can be migrated b/t host memory and any of
>>> the GPU's device memory, but the content should be consistent.
>>>
>>> Technically, multiple different GPUs will have separate virtual address spaces,
>> it's
>>> just that you create mappings within all of them such that the same virtual
>>> address
>>> resolves to the same physical content on all of them.
>>>
>>> So, having a single GPUVM instance representing all of them might give the
>>> illusion of
>>> a single unified address space, but you still need to maintain each device's
>>> address
>>> space backing resources, such as page tables, separately.
>> Yes agreed.
>>
>> Regards,
>> Oak
>>> - Danilo
>>>
>>>> So we are mapping same physical content to the same virtual address in
>> either
>>> cpu page table or any gpu device's page table...
>>>>> What we could certainly do is to optimize hmm_range_fault by making
>>>>> hmm_range a reference counted object and using it for multiple devices
>>>>> at the same time if those devices request the same range of an mm_struct.
>>>>>
>>>> Not very follow. If you are trying to resolve a multiple devices concurrent
>> access
>>> problem, I think we should serialize concurrent device fault to one address
>> range.
>>> The reason is, during device fault handling, we might migrate the backing store
>> so
>>> hmm_range->hmm_pfns[] might have changed after one device access it.
>>>>> I think if you start using the same drm_gpuvm for multiple devices you
>>>>> will sooner or later start to run into the same mess we have seen with
>>>>> KFD, where we moved more and more functionality from the KFD to the
>> DRM
>>>>> render node because we found that a lot of the stuff simply doesn't work
>>>>> correctly with a single object to maintain the state.
>>>> As I understand it, KFD is designed to work across devices. A single pseudo
>>> /dev/kfd device represent all hardware gpu devices. That is why during kfd
>> open,
>>> many pdd (process device data) is created, each for one hardware device for
>> this
>>> process. Yes the codes are a little complicated.
>>>> Kfd manages the shared virtual address space in the kfd driver codes, like the
>>> split, merging etc. Here I am looking whether we can leverage the drm_gpuvm
>>> code for those functions.
>>>> As of the shared virtual address space across gpu devices, it is a hard
>>> requirement for svm/system allocator (aka malloc for gpu program). We need
>> to
>>> make it work either at driver level or drm_gpuvm level. Drm_gpuvm is better
>>> because the work can be shared b/t drivers.
>>>> Thanks a lot,
>>>> Oak
>>>>
>>>>> Just one more point to your original discussion on the xe list: I think
>>>>> it's perfectly valid for an application to map something at the same
>>>>> address you already have something else.
>>>>>
>>>>> Cheers,
>>>>> Christian.
>>>>>
>>>>>> Thanks,
>>>>>> Oak


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-23 19:37           ` Zeng, Oak
  2024-01-23 20:17             ` Felix Kuehling
  2024-01-23 23:56             ` Danilo Krummrich
@ 2024-01-24  8:33             ` Christian König
  2024-01-25  1:17               ` Zeng, Oak
                                 ` (3 more replies)
  2 siblings, 4 replies; 123+ messages in thread
From: Christian König @ 2024-01-24  8:33 UTC (permalink / raw)
  To: Zeng, Oak, Danilo Krummrich, Dave Airlie, Daniel Vetter, Felix Kuehling
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

[-- Attachment #1: Type: text/plain, Size: 3234 bytes --]

Am 23.01.24 um 20:37 schrieb Zeng, Oak:
> [SNIP]
> Yes most API are per device based.
>
> One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices.

Yeah and that was a big mistake in my opinion. We should really not do 
that ever again.

> Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).

Exactly that thinking is what we have currently found as blocker for a 
virtualization projects. Having SVM as device independent feature which 
somehow ties to the process address space turned out to be an extremely 
bad idea.

The background is that this only works for some use cases but not all of 
them.

What's working much better is to just have a mirror functionality which 
says that a range A..B of the process address space is mapped into a 
range C..D of the GPU address space.

Those ranges can then be used to implement the SVM feature required for 
higher level APIs and not something you need at the UAPI or even inside 
the low level kernel memory management.

When you talk about migrating memory to a device you also do this on a 
per device basis and *not* tied to the process address space. If you 
then get crappy performance because userspace gave contradicting 
information where to migrate memory then that's a bug in userspace and 
not something the kernel should try to prevent somehow.

[SNIP]
>> I think if you start using the same drm_gpuvm for multiple devices you
>> will sooner or later start to run into the same mess we have seen with
>> KFD, where we moved more and more functionality from the KFD to the DRM
>> render node because we found that a lot of the stuff simply doesn't work
>> correctly with a single object to maintain the state.
> As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process.

Yes, I'm perfectly aware of that. And I can only repeat myself that I 
see this design as a rather extreme failure. And I think it's one of the 
reasons why NVidia is so dominant with Cuda.

This whole approach KFD takes was designed with the idea of extending 
the CPU process into the GPUs, but this idea only works for a few use 
cases and is not something we should apply to drivers in general.

A very good example are virtualization use cases where you end up with 
CPU address != GPU address because the VAs are actually coming from the 
guest VM and not the host process.

SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not 
have any influence on the design of the kernel UAPI.

If you want to do something similar as KFD for Xe I think you need to 
get explicit permission to do this from Dave and Daniel and maybe even 
Linus.

Regards,
Christian.

[-- Attachment #2: Type: text/html, Size: 4575 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-24  8:33             ` Christian König
@ 2024-01-25  1:17               ` Zeng, Oak
  2024-01-25  1:25                 ` David Airlie
                                   ` (2 more replies)
  2024-01-25 16:42               ` Zeng, Oak
                                 ` (2 subsequent siblings)
  3 siblings, 3 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-01-25  1:17 UTC (permalink / raw)
  To: Christian König, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Felix Kuehling, Shah, Ankur N, Winiarski, Michal
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

[-- Attachment #1: Type: text/plain, Size: 7186 bytes --]

Hi Christian,

Even though I mentioned KFD design, I didn’t mean to copy the KFD design. I also had hard time to understand the difficulty of KFD under virtualization environment.

For us, Xekmd doesn't need to know it is running under bare metal or virtualized environment. Xekmd is always a guest driver. All the virtual address used in xekmd is guest virtual address. For SVM, we require all the VF devices share one single shared address space with guest CPU program. So all the design works in bare metal environment can automatically work under virtualized environment. +@Shah, Ankur N<mailto:ankur.n.shah@intel.com> +@Winiarski, Michal<mailto:michal.winiarski@intel.com> to backup me if I am wrong.

Again, shared virtual address space b/t cpu and all gpu devices is a hard requirement for our system allocator design (which means malloc’ed memory, cpu stack variables, globals can be directly used in gpu program. Same requirement as kfd SVM design). This was aligned with our user space software stack.

For anyone who want to implement system allocator, or SVM, this is a hard requirement. I started this thread hoping I can leverage the drm_gpuvm design to manage the shared virtual address space (as the address range split/merge function was scary to me and I didn’t want re-invent). I guess my takeaway from this you and Danilo is this approach is a NAK. Thomas also mentioned to me drm_gpuvm is a overkill for our svm address range split/merge. So I will make things work first by manage address range xekmd internally. I can re-look drm-gpuvm approach in the future.

Maybe a pseudo user program can illustrate our programming model:


Fd0 = open(card0)

Fd1 = open(card1)

Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's first vm_create

Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called from same process

Queue0 = xe_exec_queue_create(fd0, vm0)

Queue1 = xe_exec_queue_create(fd1, vm1)

//check p2p capability calling L0 API….

ptr = malloc()//this replace bo_create, vm_bind, dma-import/export

Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0

Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1

//Gpu page fault handles memory allocation/migration/mapping to gpu

As you can see, from above model, our design is a little bit different than the KFD design. user need to explicitly create gpuvm (vm0 and vm1 above) for each gpu device. Driver internally have a xe_svm represent the shared address space b/t cpu and multiple gpu devices. But end user doesn’t see and no need to create xe_svm. The shared virtual address space is really managed by linux core mm (through the vma struct, mm_struct etc). From each gpu device’s perspective, it just operate under its own gpuvm, not aware of the existence of other gpuvm, even though in reality all those gpuvm shares a same virtual address space.

See one more comment inline

From: Christian König <christian.koenig@amd.com>
Sent: Wednesday, January 24, 2024 3:33 AM
To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>
Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
Subject: Re: Making drm_gpuvm work across gpu devices

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]



Yes most API are per device based.



One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices.

Yeah and that was a big mistake in my opinion. We should really not do that ever again.



Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).

Exactly that thinking is what we have currently found as blocker for a virtualization projects. Having SVM as device independent feature which somehow ties to the process address space turned out to be an extremely bad idea.

The background is that this only works for some use cases but not all of them.

What's working much better is to just have a mirror functionality which says that a range A..B of the process address space is mapped into a range C..D of the GPU address space.

Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management.

When you talk about migrating memory to a device you also do this on a per device basis and *not* tied to the process address space. If you then get crappy performance because userspace gave contradicting information where to migrate memory then that's a bug in userspace and not something the kernel should try to prevent somehow.

[SNIP]


I think if you start using the same drm_gpuvm for multiple devices you

will sooner or later start to run into the same mess we have seen with

KFD, where we moved more and more functionality from the KFD to the DRM

render node because we found that a lot of the stuff simply doesn't work

correctly with a single object to maintain the state.



As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process.

Yes, I'm perfectly aware of that. And I can only repeat myself that I see this design as a rather extreme failure. And I think it's one of the reasons why NVidia is so dominant with Cuda.

This whole approach KFD takes was designed with the idea of extending the CPU process into the GPUs, but this idea only works for a few use cases and is not something we should apply to drivers in general.

A very good example are virtualization use cases where you end up with CPU address != GPU address because the VAs are actually coming from the guest VM and not the host process.


I don’t get the problem here. For us, under virtualization, both the cpu address and gpu virtual address operated in xekmd is guest virtual address. They can still share the same virtual address space (as SVM required)

Oak


SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI.

If you want to do something similar as KFD for Xe I think you need to get explicit permission to do this from Dave and Daniel and maybe even Linus.

Regards,
Christian.

[-- Attachment #2: Type: text/html, Size: 12716 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-25  1:17               ` Zeng, Oak
@ 2024-01-25  1:25                 ` David Airlie
  2024-01-25  5:25                   ` Zeng, Oak
  2024-01-25 11:00                 ` 回复:Making " 周春明(日月)
  2024-01-25 17:15                 ` Making " Felix Kuehling
  2 siblings, 1 reply; 123+ messages in thread
From: David Airlie @ 2024-01-25  1:25 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Ghimiray, Himal Prasad, Thomas.Hellstrom, Winiarski, Michal,
	Felix Kuehling, Welty, Brian, Shah, Ankur N, dri-devel, intel-xe,
	Gupta, saurabhg, Danilo Krummrich, Daniel Vetter, Brost, Matthew,
	Bommu, Krishnaiah, Vishwanathapura, Niranjana,
	Christian König

>
>
> For us, Xekmd doesn't need to know it is running under bare metal or virtualized environment. Xekmd is always a guest driver. All the virtual address used in xekmd is guest virtual address. For SVM, we require all the VF devices share one single shared address space with guest CPU program. So all the design works in bare metal environment can automatically work under virtualized environment. +@Shah, Ankur N +@Winiarski, Michal to backup me if I am wrong.
>
>
>
> Again, shared virtual address space b/t cpu and all gpu devices is a hard requirement for our system allocator design (which means malloc’ed memory, cpu stack variables, globals can be directly used in gpu program. Same requirement as kfd SVM design). This was aligned with our user space software stack.

Just to make a very general point here (I'm hoping you listen to
Christian a bit more and hoping he replies in more detail), but just
because you have a system allocator design done, it doesn't in any way
enforce the requirements on the kernel driver to accept that design.
Bad system design should be pushed back on, not enforced in
implementation stages. It's a trap Intel falls into regularly since
they say well we already agreed this design with the userspace team
and we can't change it now. This isn't acceptable. Design includes
upstream discussion and feedback, if you say misdesigned the system
allocator (and I'm not saying you definitely have), and this is
pushing back on that, then you have to go fix your system
architecture.

KFD was an experiment like this, I pushed back on AMD at the start
saying it was likely a bad plan, we let it go and got a lot of
experience in why it was a bad design.

Dave.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-23 20:17             ` Felix Kuehling
@ 2024-01-25  1:39               ` Zeng, Oak
  0 siblings, 0 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-01-25  1:39 UTC (permalink / raw)
  To: Felix Kuehling, Christian König, Danilo Krummrich,
	Dave Airlie, Daniel Vetter
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

Thank you Felix for sharing. See a few comments inline

> -----Original Message-----
> From: Felix Kuehling <felix.kuehling@amd.com>
> Sent: Tuesday, January 23, 2024 3:17 PM
> To: Zeng, Oak <oak.zeng@intel.com>; Christian König <christian.koenig@amd.com>;
> Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel
> Vetter <daniel@ffwll.ch>
> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>;
> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
> <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> On 2024-01-23 14:37, Zeng, Oak wrote:
> > Thanks Christian. I have some comment inline below.
> >
> > Danilo, can you also take a look and give your feedback? Thanks.
> 
> Sorry, just catching up with this thread now. I'm also not familiar with
> drm_gpuvm.
> 
> Some general observations based on my experience with KFD, amdgpu and
> SVM. With SVM we have a single virtual address space managed in user
> mode (basically using mmap) with attributes per virtual address range
> maintained in the kernel mode driver. Different devices get their
> mappings of pieces of that address space using the same virtual
> addresses. We also support migration to different DEVICE_PRIVATE memory
> spaces.

I think one same virtual address can be mapped into different devices. For different device, reading from same virtual address result in same content.  Driver either map the page table to pointing to the same physical location, or migrate before mapping. I guess you imply this.

> 
> However, we still have page tables managed per device. Each device can
> have different page table formats and layout (e.g. different GPU
> generations in the same system) and the same memory may be mapped with
> different flags on different devices in order to get the right coherence
> behaviour. We also need to maintain per-device DMA mappings somewhere.
> That means, as far as the device page tables are concerned, we still
> have separate address spaces. SVM only adds a layer on top, which
> coordinates these separate device virtual address spaces so that some
> parts of them provide the appearance of a shared virtual address space.
> 

Yes exactly the same understanding.

> At some point you need to decide, where you draw the boundary between
> managing a per-process shared virtual address space and managing
> per-device virtual address spaces. In amdgpu that boundary is currently
> where kfd_svm code calls amdgpu_vm code to manage the per-device page
> tables.

Exactly, in xe driver it is xe_svm and xe_vm. Just different name 😊

> 
> In the amdgpu driver, we still have the traditional memory management
> APIs in the render nodes that don't do SVM. They share the device
> virtual address spaces with SVM. We have to be careful that we don't try
> to manage the same device virtual address ranges with these two
> different memory managers. In practice, we let the non-SVM APIs use the
> upper half of the canonical address space, while the lower half can be
> used almost entirely for SVM.

In xekmd, we also have to support a mixed usage of traditional gem_create/vm_bind api and malloc. 

I just wondering why you had to split the canonical address space b/t those two usage model. To illustrate those two usage:

Traditional model:
Ptr= mmap(anonymous)
Vm_bind(ptr, bo)
Submit gpu kernel using ptr

System allocator model:
Ptr = mmap(anonymous) or malloc()
Submit gpu kernel using ptr

The point is, both ptr are allocated in user space anonymously inside one process space, so there is no collision even if you don't deliberately divide the canonical address space.

Thanks,
Oak
> 
> Regards,
>    Felix
> 
> 
> >
> >> -----Original Message-----
> >> From: Christian König <christian.koenig@amd.com>
> >> Sent: Tuesday, January 23, 2024 6:13 AM
> >> To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>;
> >> Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>
> >> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-
> >> xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>;
> >> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
> >> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
> >> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
> >> <matthew.brost@intel.com>
> >> Subject: Re: Making drm_gpuvm work across gpu devices
> >>
> >> Hi Oak,
> >>
> >> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
> >>> Hi Danilo and all,
> >>>
> >>> During the work of Intel's SVM code, we came up the idea of making
> >> drm_gpuvm to work across multiple gpu devices. See some discussion here:
> >> https://lore.kernel.org/dri-
> >>
> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
> >> 11.prod.outlook.com/
> >>> The reason we try to do this is, for a SVM (shared virtual memory across cpu
> >> program and all gpu program on all gpu devices) process, the address space has
> >> to be across all gpu devices. So if we make drm_gpuvm to work across devices,
> >> then our SVM code can leverage drm_gpuvm as well.
> >>> At a first look, it seems feasible because drm_gpuvm doesn't really use the
> >> drm_device *drm pointer a lot. This param is used only for printing/warning. So I
> >> think maybe we can delete this drm field from drm_gpuvm.
> >>> This way, on a multiple gpu device system, for one process, we can have only
> >> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for
> >> each gpu device).
> >>> What do you think?
> >> Well from the GPUVM side I don't think it would make much difference if
> >> we have the drm device or not.
> >>
> >> But the experience we had with the KFD I think I should mention that we
> >> should absolutely *not* deal with multiple devices at the same time in
> >> the UAPI or VM objects inside the driver.
> >>
> >> The background is that all the APIs inside the Linux kernel are build
> >> around the idea that they work with only one device at a time. This
> >> accounts for both low level APIs like the DMA API as well as pretty high
> >> level things like for example file system address space etc...
> > Yes most API are per device based.
> >
> > One exception I know is actually the kfd SVM API. If you look at the svm_ioctl
> function, it is per-process based. Each kfd_process represent a process across N gpu
> devices. Cc Felix.
> >
> > Need to say, kfd SVM represent a shared virtual address space across CPU and all
> GPU devices on the system. This is by the definition of SVM (shared virtual memory).
> This is very different from our legacy gpu *device* driver which works for only one
> device (i.e., if you want one device to access another device's memory, you will have
> to use dma-buf export/import etc).
> >
> > We have the same design requirement of SVM. For anyone who want to
> implement the SVM concept, this is a hard requirement. Since now drm has the
> drm_gpuvm concept which strictly speaking is designed for one device, I want to see
> whether we can extend drm_gpuvm to make it work for both single device (as used
> in xe) and multipe devices (will be used in the SVM code). That is why I brought up
> this topic.
> >
> >> So when you have multiple GPUs you either have an inseparable cluster of
> >> them which case you would also only have one drm_device. Or you have
> >> separated drm_device which also results in separate drm render nodes and
> >> separate virtual address spaces and also eventually separate IOMMU
> >> domains which gives you separate dma_addresses for the same page and so
> >> separate GPUVM page tables....
> > I am thinking we can still make each device has its separate drm_device/render
> node/iommu domains/gpu page table. Just as what we have today. I am not plan to
> change this picture.
> >
> > But the virtual address space will support two modes of operation:
> > 1. one drm_gpuvm per device. This is when svm is not in the picture
> > 2. all devices in the process share one single drm_gpuvm, when svm is in the
> picture. In xe driver design, we have to support a mixture use of legacy mode (such
> as gem_create and vm_bind) and svm (such as malloc'ed memory for gpu
> submission). So whenever SVM is in the picture, we want one single process address
> space across all devices. Drm_gpuvm doesn't need to be aware of those two
> operation modes. It is driver's responsibility to use different mode.
> >
> > For example, in mode #1, a driver's vm structure (such as xe_vm) can inherit from
> drm_gpuvm. In mode #2, a driver's svm structure (xe_svm in this series:
> https://lore.kernel.org/dri-devel/20240117221223.18540-1-oak.zeng@intel.com/)
> can inherit from drm_gpuvm while each xe_vm (still a per-device based struct) will
> just have a pointer to the drm_gpuvm structure. This way when svm is in play, we
> build a 1 process:1 mm_struct:1 xe_svm:N xe_vm correlations which means shared
> address space across gpu devices.
> >
> > This requires some changes of drm_gpuvm design:
> > 1. The drm_device *drm pointer, in mode #2 operation, this can be NULL, means
> this drm_gpuvm is not for specific gpu device
> > 2. The common dma_resv object: drm_gem_object *r_obj. *Does one dma_resv
> object allocated/initialized for one device work for all devices*? From first look,
> dma_resv is just some CPU structure maintaining dma-fences. So I guess it should
> work for all devices? I definitely need you to comment.
> >
> >
> >> It's up to you how to implement it, but I think it's pretty clear that
> >> you need separate drm_gpuvm objects to manage those.
> > As explained above, I am thinking of one drm_gpuvm object across all devices
> when SVM is in the picture...
> >
> >> That you map the same thing in all those virtual address spaces at the
> >> same address is a completely different optimization problem I think.
> > Not sure I follow here... the requirement from SVM is, one virtual address points to
> same physical backing store. For example, whenever CPU or any GPU device access
> this virtual address, it refers to the same physical content. Of course the physical
> backing store can be migrated b/t host memory and any of the GPU's device memory,
> but the content should be consistent.
> >
> > So we are mapping same physical content to the same virtual address in either cpu
> page table or any gpu device's page table...
> >
> >> What we could certainly do is to optimize hmm_range_fault by making
> >> hmm_range a reference counted object and using it for multiple devices
> >> at the same time if those devices request the same range of an mm_struct.
> >>
> > Not very follow. If you are trying to resolve a multiple devices concurrent access
> problem, I think we should serialize concurrent device fault to one address range.
> The reason is, during device fault handling, we might migrate the backing store so
> hmm_range->hmm_pfns[] might have changed after one device access it.
> >
> >> I think if you start using the same drm_gpuvm for multiple devices you
> >> will sooner or later start to run into the same mess we have seen with
> >> KFD, where we moved more and more functionality from the KFD to the DRM
> >> render node because we found that a lot of the stuff simply doesn't work
> >> correctly with a single object to maintain the state.
> > As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd
> device represent all hardware gpu devices. That is why during kfd open, many pdd
> (process device data) is created, each for one hardware device for this process. Yes
> the codes are a little complicated.
> >
> > Kfd manages the shared virtual address space in the kfd driver codes, like the split,
> merging etc. Here I am looking whether we can leverage the drm_gpuvm code for
> those functions.
> >
> > As of the shared virtual address space across gpu devices, it is a hard requirement
> for svm/system allocator (aka malloc for gpu program). We need to make it work
> either at driver level or drm_gpuvm level. Drm_gpuvm is better because the work
> can be shared b/t drivers.
> >
> > Thanks a lot,
> > Oak
> >
> >> Just one more point to your original discussion on the xe list: I think
> >> it's perfectly valid for an application to map something at the same
> >> address you already have something else.
> >>
> >> Cheers,
> >> Christian.
> >>
> >>> Thanks,
> >>> Oak

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-25  1:25                 ` David Airlie
@ 2024-01-25  5:25                   ` Zeng, Oak
  2024-01-26 10:09                     ` Christian König
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-01-25  5:25 UTC (permalink / raw)
  To: David Airlie
  Cc: Brost, Matthew, Thomas.Hellstrom, Winiarski, Michal,
	Felix Kuehling, Welty, Brian, Shah, Ankur N, dri-devel,
	Christian König, Ghimiray, Himal Prasad, Daniel Vetter,
	Bommu,  Krishnaiah, Gupta, saurabhg, Vishwanathapura, Niranjana,
	intel-xe, Danilo Krummrich

Hi Dave,

Let me step back. When I wrote " shared virtual address space b/t cpu and all gpu devices is a hard requirement for our system allocator design", I meant this is not only Intel's design requirement. Rather this is a common requirement for both Intel, AMD and Nvidia. Take a look at cuda driver API definition of cuMemAllocManaged (search this API on https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM), it said: 

"The pointer is valid on the CPU and on all GPUs in the system that support managed memory."

This means the program virtual address space is shared b/t CPU and all GPU devices on the system. The system allocator we are discussing is just one step advanced than cuMemAllocManaged: it allows malloc'ed memory to be shared b/t CPU and all GPU devices.

I hope we all agree with this point.

With that, I agree with Christian that in kmd we should make driver code per-device based instead of managing all devices in one driver instance. Our system allocator (and generally xekmd)design follows this rule: we make xe_vm per device based - one device is *not* aware of other device's address space, as I explained in previous email. I started this email seeking a one drm_gpuvm instance to cover all GPU devices. I gave up this approach (at least for now) per Danilo and Christian's feedback: We will continue to have per device based drm_gpuvm. I hope this is aligned with Christian but I will have to wait for Christian's reply to my previous email.

I hope this clarify thing a little.

Regards,
Oak 

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of David
> Airlie
> Sent: Wednesday, January 24, 2024 8:25 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
> Thomas.Hellstrom@linux.intel.com; Winiarski, Michal
> <michal.winiarski@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty,
> Brian <brian.welty@intel.com>; Shah, Ankur N <ankur.n.shah@intel.com>; dri-
> devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Gupta, saurabhg
> <saurabhg.gupta@intel.com>; Danilo Krummrich <dakr@redhat.com>; Daniel
> Vetter <daniel@ffwll.ch>; Brost, Matthew <matthew.brost@intel.com>; Bommu,
> Krishnaiah <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Christian König
> <christian.koenig@amd.com>
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> >
> >
> > For us, Xekmd doesn't need to know it is running under bare metal or
> virtualized environment. Xekmd is always a guest driver. All the virtual address
> used in xekmd is guest virtual address. For SVM, we require all the VF devices
> share one single shared address space with guest CPU program. So all the design
> works in bare metal environment can automatically work under virtualized
> environment. +@Shah, Ankur N +@Winiarski, Michal to backup me if I am wrong.
> >
> >
> >
> > Again, shared virtual address space b/t cpu and all gpu devices is a hard
> requirement for our system allocator design (which means malloc’ed memory,
> cpu stack variables, globals can be directly used in gpu program. Same
> requirement as kfd SVM design). This was aligned with our user space software
> stack.
> 
> Just to make a very general point here (I'm hoping you listen to
> Christian a bit more and hoping he replies in more detail), but just
> because you have a system allocator design done, it doesn't in any way
> enforce the requirements on the kernel driver to accept that design.
> Bad system design should be pushed back on, not enforced in
> implementation stages. It's a trap Intel falls into regularly since
> they say well we already agreed this design with the userspace team
> and we can't change it now. This isn't acceptable. Design includes
> upstream discussion and feedback, if you say misdesigned the system
> allocator (and I'm not saying you definitely have), and this is
> pushing back on that, then you have to go fix your system
> architecture.
> 
> KFD was an experiment like this, I pushed back on AMD at the start
> saying it was likely a bad plan, we let it go and got a lot of
> experience in why it was a bad design.
> 
> Dave.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* 回复:Making drm_gpuvm work across gpu devices
  2024-01-25  1:17               ` Zeng, Oak
  2024-01-25  1:25                 ` David Airlie
@ 2024-01-25 11:00                 ` 周春明(日月)
  2024-01-25 17:00                   ` Zeng, Oak
  2024-01-25 17:15                 ` Making " Felix Kuehling
  2 siblings, 1 reply; 123+ messages in thread
From: 周春明(日月) @ 2024-01-25 11:00 UTC (permalink / raw)
  To: Zeng, Oak, Christian König, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Felix Kuehling, Shah, Ankur N, Winiarski, Michal
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Gupta, saurabhg,
	Vishwanathapura, Niranjana, intel-xe

[-- Attachment #1: Type: text/plain, Size: 9265 bytes --]

[snip]
Fd0 = open(card0)
Fd1 = open(card1)
Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's first vm_create
Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called from same process
Queue0 = xe_exec_queue_create(fd0, vm0)
Queue1 = xe_exec_queue_create(fd1, vm1)
//check p2p capability calling L0 API….
ptr = malloc()//this replace bo_create, vm_bind, dma-import/export
Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0
Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1
//Gpu page fault handles memory allocation/migration/mapping to gpu
[snip]
Hi Oak,
From your sample code, you not only need va-manager cross gpu devices, but also cpu, right?
I think you need a UVA (unified va) manager in user space and make range of drm_gpuvm reserved from cpu va space. In that way, malloc's va and gpu va are in same space and will not conflict. And then via HMM mechanism, gpu devices can safely use VA passed from HMM.
By the way, I'm not familiar with drm_gpuvm, traditionally, gpu driver often put va-manager in user space, not sure what's benefit we can get from drm_gpuvm invented in kernel space. Can anyone help explain more?
- Chunming
------------------------------------------------------------------
发件人:Zeng, Oak <oak.zeng@intel.com>
发送时间:2024年1月25日(星期四) 09:17
收件人:"Christian König" <christian.koenig@amd.com>; Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>; "Shah, Ankur N" <ankur.n.shah@intel.com>; "Winiarski, Michal" <michal.winiarski@intel.com>
抄 送:"Brost, Matthew" <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com <Thomas.Hellstrom@linux.intel.com>; "Welty, Brian" <brian.welty@intel.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>; "Gupta, saurabhg" <saurabhg.gupta@intel.com>; "Bommu, Krishnaiah" <krishnaiah.bommu@intel.com>; "Vishwanathapura, Niranjana" <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org <intel-xe@lists.freedesktop.org>
主 题:RE: Making drm_gpuvm work across gpu devices
Hi Christian,
Even though I mentioned KFD design, I didn’t mean to copy the KFD design. I also had hard time to understand the difficulty of KFD under virtualization environment.
For us, Xekmd doesn't need to know it is running under bare metal or virtualized environment. Xekmd is always a guest driver. All the virtual address used in xekmd is guest virtual address. For SVM, we require all the VF devices share one single shared address space with guest CPU program. So all the design works in bare metal environment can automatically work under virtualized environment. +@Shah, Ankur N <mailto:ankur.n.shah@intel.com > +@Winiarski, Michal <mailto:michal.winiarski@intel.com > to backup me if I am wrong.
Again, shared virtual address space b/t cpu and all gpu devices is a hard requirement for our system allocator design (which means malloc’ed memory, cpu stack variables, globals can be directly used in gpu program. Same requirement as kfd SVM design). This was aligned with our user space software stack. 
For anyone who want to implement system allocator, or SVM, this is a hard requirement. I started this thread hoping I can leverage the drm_gpuvm design to manage the shared virtual address space (as the address range split/merge function was scary to me and I didn’t want re-invent). I guess my takeaway from this you and Danilo is this approach is a NAK. Thomas also mentioned to me drm_gpuvm is a overkill for our svm address range split/merge. So I will make things work first by manage address range xekmd internally. I can re-look drm-gpuvm approach in the future.
Maybe a pseudo user program can illustrate our programming model:
Fd0 = open(card0)
Fd1 = open(card1)
Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's first vm_create
Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called from same process
Queue0 = xe_exec_queue_create(fd0, vm0)
Queue1 = xe_exec_queue_create(fd1, vm1)
//check p2p capability calling L0 API….
ptr = malloc()//this replace bo_create, vm_bind, dma-import/export
Xe_exec(queue0,  ptr)//submit gpu job which use ptr, on card0
Xe_exec(queue1,  ptr)//submit gpu job which use ptr, on card1
//Gpu page fault handles memory allocation/migration/mapping to gpu
As you can see, from above model, our design is a little bit different than the KFD design. user need to explicitly create gpuvm (vm0 and vm1 above) for each gpu device. Driver internally have a xe_svm represent the shared address space b/t cpu and multiple gpu devices. But end user doesn’t see and no need to create xe_svm. The shared virtual address space is really managed by linux core mm (through the vma struct, mm_struct etc). From each gpu device’s perspective, it just operate under its own gpuvm, not aware of the existence of other gpuvm, even though in reality all those gpuvm shares a same virtual address space.
See one more comment inline
From: Christian König <christian.koenig@amd.com> 
Sent: Wednesday, January 24, 2024 3:33 AM
To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>
Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
Subject: Re: Making drm_gpuvm work across gpu devices
Am 23.01.24 um 20:37 schrieb Zeng, Oak:
[SNIP] 
Yes most API are per device based. One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices.
 Yeah and that was a big mistake in my opinion. We should really not do that ever again.
Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).
 Exactly that thinking is what we have currently found as blocker for a virtualization projects. Having SVM as device independent feature which somehow ties to the process address space turned out to be an extremely bad idea.
 The background is that this only works for some use cases but not all of them.
 What's working much better is to just have a mirror functionality which says that a range A..B of the process address space is mapped into a range C..D of the GPU address space.
 Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management.
 When you talk about migrating memory to a device you also do this on a per device basis and *not* tied to the process address space. If you then get crappy performance because userspace gave contradicting information where to migrate memory then that's a bug in userspace and not something the kernel should try to prevent somehow.
 [SNIP]
I think if you start using the same drm_gpuvm for multiple devices youwill sooner or later start to run into the same mess we have seen withKFD, where we moved more and more functionality from the KFD to the DRMrender node because we found that a lot of the stuff simply doesn't workcorrectly with a single object to maintain the state. As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process.
 Yes, I'm perfectly aware of that. And I can only repeat myself that I see this design as a rather extreme failure. And I think it's one of the reasons why NVidia is so dominant with Cuda.
 This whole approach KFD takes was designed with the idea of extending the CPU process into the GPUs, but this idea only works for a few use cases and is not something we should apply to drivers in general.
 A very good example are virtualization use cases where you end up with CPU address != GPU address because the VAs are actually coming from the guest VM and not the host process.
I don’t get the problem here. For us, under virtualization, both the cpu address and gpu virtual address operated in xekmd is guest virtual address. They can still share the same virtual address space (as SVM required)
Oak
 SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI.
 If you want to do something similar as KFD for Xe I think you need to get explicit permission to do this from Dave and Daniel and maybe even Linus.
 Regards,
 Christian.

[-- Attachment #2: Type: text/html, Size: 21792 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-24  8:33             ` Christian König
  2024-01-25  1:17               ` Zeng, Oak
@ 2024-01-25 16:42               ` Zeng, Oak
  2024-01-25 18:32               ` Daniel Vetter
  2024-02-23 20:12               ` Zeng, Oak
  3 siblings, 0 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-01-25 16:42 UTC (permalink / raw)
  To: Christian König, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Felix Kuehling
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

[-- Attachment #1: Type: text/plain, Size: 5575 bytes --]

Hi Christian,

I got a few more questions inline

From: Christian König <christian.koenig@amd.com>
Sent: Wednesday, January 24, 2024 3:33 AM
To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>
Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
Subject: Re: Making drm_gpuvm work across gpu devices

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]



Yes most API are per device based.



One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices.

Yeah and that was a big mistake in my opinion. We should really not do that ever again.



Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).

Exactly that thinking is what we have currently found as blocker for a virtualization projects. Having SVM as device independent feature which somehow ties to the process address space turned out to be an extremely bad idea.

The background is that this only works for some use cases but not all of them.

What's working much better is to just have a mirror functionality which says that a range A..B of the process address space is mapped into a range C..D of the GPU address space.

Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management.


The whole purpose of the HMM design is to create a shared address space b/t cpu and gpu program. See here: https://www.kernel.org/doc/Documentation/vm/hmm.rst. Mapping process address A..B to C..D of GPU address space is exactly referred as “split address space” in the HMM design.



When you talk about migrating memory to a device you also do this on a per device basis and *not* tied to the process address space. If you then get crappy performance because userspace gave contradicting information where to migrate memory then that's a bug in userspace and not something the kernel should try to prevent somehow.

[SNIP]


I think if you start using the same drm_gpuvm for multiple devices you

will sooner or later start to run into the same mess we have seen with

KFD, where we moved more and more functionality from the KFD to the DRM

render node because we found that a lot of the stuff simply doesn't work

correctly with a single object to maintain the state.



As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process.

Yes, I'm perfectly aware of that. And I can only repeat myself that I see this design as a rather extreme failure. And I think it's one of the reasons why NVidia is so dominant with Cuda.

This whole approach KFD takes was designed with the idea of extending the CPU process into the GPUs, but this idea only works for a few use cases and is not something we should apply to drivers in general.

A very good example are virtualization use cases where you end up with CPU address != GPU address because the VAs are actually coming from the guest VM and not the host process.


Are you talking about general virtualization set up such as SRIOV, GPU device pass through, or something else?

In a typical virtualization set up, gpu driver such as xekmd or amdgpu is always a guest driver. In xekmd case, xekmd doesn’t need to know it is operating under virtualized environment. So the virtual address in driver is guest virtual address. From kmd driver perspective, there is no difference b/t bare metal and virtualized.

Are you talking about special virtualized setup such as para-virtualized/VirGL? I need more background info to understand why you end up with CPU address !=GPU address in SVM….


SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI.


Maybe a terminology problem here. I agree what you said above. We also have achieved the SVM design with our BO-centric driver such as i915, xekmd.

But we are mainly talking about system allocator here, like use malloc’ed memory directly for GPU program. And we want to leverage HMM. System allocator can be used to implement the same SVM concept at OpenCL/Cuda/ROCm, but SVM can be implemented with BO-centric driver also.


If you want to do something similar as KFD for Xe I think you need to get explicit permission to do this from Dave and Daniel and maybe even Linus.

If you look at my series https://lore.kernel.org/dri-devel/20231221043812.3783313-1-oak.zeng@intel.com/, I am not doing things similar to KFD.

Regards,
Oak


Regards,
Christian.

[-- Attachment #2: Type: text/html, Size: 10249 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: 回复:Making drm_gpuvm work across gpu devices
  2024-01-25 11:00                 ` 回复:Making " 周春明(日月)
@ 2024-01-25 17:00                   ` Zeng, Oak
  0 siblings, 0 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-01-25 17:00 UTC (permalink / raw)
  To: 周春明(日月),
	Christian König, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Felix Kuehling, Shah, Ankur N, Winiarski, Michal
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Gupta, saurabhg,
	Vishwanathapura, Niranjana, intel-xe

[-- Attachment #1: Type: text/plain, Size: 12184 bytes --]

Hi Chunming,


From: 周春明(日月) <riyue.zcm@alibaba-inc.com>
Sent: Thursday, January 25, 2024 6:01 AM
To: Zeng, Oak <oak.zeng@intel.com>; Christian König <christian.koenig@amd.com>; Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>; Shah, Ankur N <ankur.n.shah@intel.com>; Winiarski, Michal <michal.winiarski@intel.com>
Cc: Brost, Matthew <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org
Subject: 回复:Making drm_gpuvm work across gpu devices

[snip]

Fd0 = open(card0)

Fd1 = open(card1)

Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's first vm_create

Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called from same process

Queue0 = xe_exec_queue_create(fd0, vm0)

Queue1 = xe_exec_queue_create(fd1, vm1)

//check p2p capability calling L0 API….

ptr = malloc()//this replace bo_create, vm_bind, dma-import/export

Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0

Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1

//Gpu page fault handles memory allocation/migration/mapping to gpu
[snip]
Hi Oak,
From your sample code, you not only need va-manager cross gpu devices, but also cpu, right?

No. Per the feedback from Christian and Danilo, I would give up the idea of making drm_gpuvm to work across gpu devices. I might want to come back later but for now it is not the plan anymore.

I think you need a UVA (unified va) manager in user space and make range of drm_gpuvm reserved from cpu va space. In that way, malloc's va and gpu va are in same space and will not conflict. And then via HMM mechanism, gpu devices can safely use VA passed from HMM.

Under HMM, both GPU and CPU are simply under the same address space. A same virtual address represent the same allocation for both CPU and GPUs. See the hmm doc here: https://www.kernel.org/doc/Documentation/vm/hmm.rst.  User space program doesn’t need to reserve any address range. All the address ranges are managed by linux kernel core mm. Today GPU kmd driver has some structure to save the address range based memory attributes.

Regards,
Oak

By the way, I'm not familiar with drm_gpuvm, traditionally, gpu driver often put va-manager in user space, not sure what's benefit we can get from drm_gpuvm invented in kernel space. Can anyone help explain more?

- Chunming
------------------------------------------------------------------
发件人:Zeng, Oak <oak.zeng@intel.com<mailto:oak.zeng@intel.com>>
发送时间:2024年1月25日(星期四) 09:17
收件人:"Christian König" <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>; Danilo Krummrich <dakr@redhat.com<mailto:dakr@redhat.com>>; Dave Airlie <airlied@redhat.com<mailto:airlied@redhat.com>>; Daniel Vetter <daniel@ffwll.ch<mailto:daniel@ffwll.ch>>; Felix Kuehling <felix.kuehling@amd.com<mailto:felix.kuehling@amd.com>>; "Shah, Ankur N" <ankur.n.shah@intel.com<mailto:ankur.n.shah@intel.com>>; "Winiarski, Michal" <michal.winiarski@intel.com<mailto:michal.winiarski@intel.com>>
抄 送:"Brost, Matthew" <matthew.brost@intel.com<mailto:matthew.brost@intel.com>>; Thomas.Hellstrom@linux.intel.com<mailto:Thomas.Hellstrom@linux.intel.com> <Thomas.Hellstrom@linux.intel.com<mailto:Thomas.Hellstrom@linux.intel.com>>; "Welty, Brian" <brian.welty@intel.com<mailto:brian.welty@intel.com>>; dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org> <dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>>; "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com<mailto:himal.prasad.ghimiray@intel.com>>; "Gupta, saurabhg" <saurabhg.gupta@intel.com<mailto:saurabhg.gupta@intel.com>>; "Bommu, Krishnaiah" <krishnaiah.bommu@intel.com<mailto:krishnaiah.bommu@intel.com>>; "Vishwanathapura, Niranjana" <niranjana.vishwanathapura@intel.com<mailto:niranjana.vishwanathapura@intel.com>>; intel-xe@lists.freedesktop.org<mailto:intel-xe@lists.freedesktop.org> <intel-xe@lists.freedesktop.org<mailto:intel-xe@lists.freedesktop.org>>
主 题:RE: Making drm_gpuvm work across gpu devices

Hi Christian,

Even though I mentioned KFD design, I didn’t mean to copy the KFD design. I also had hard time to understand the difficulty of KFD under virtualization environment.

For us, Xekmd doesn't need to know it is running under bare metal or virtualized environment. Xekmd is always a guest driver. All the virtual address used in xekmd is guest virtual address. For SVM, we require all the VF devices share one single shared address space with guest CPU program. So all the design works in bare metal environment can automatically work under virtualized environment. +@Shah, Ankur N<mailto:ankur.n.shah@intel.com> +@Winiarski, Michal<mailto:michal.winiarski@intel.com> to backup me if I am wrong.

Again, shared virtual address space b/t cpu and all gpu devices is a hard requirement for our system allocator design (which means malloc’ed memory, cpu stack variables, globals can be directly used in gpu program. Same requirement as kfd SVM design). This was aligned with our user space software stack.

For anyone who want to implement system allocator, or SVM, this is a hard requirement. I started this thread hoping I can leverage the drm_gpuvm design to manage the shared virtual address space (as the address range split/merge function was scary to me and I didn’t want re-invent). I guess my takeaway from this you and Danilo is this approach is a NAK. Thomas also mentioned to me drm_gpuvm is a overkill for our svm address range split/merge. So I will make things work first by manage address range xekmd internally. I can re-look drm-gpuvm approach in the future.

Maybe a pseudo user program can illustrate our programming model:


Fd0 = open(card0)

Fd1 = open(card1)

Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's first vm_create

Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called from same process

Queue0 = xe_exec_queue_create(fd0, vm0)

Queue1 = xe_exec_queue_create(fd1, vm1)

//check p2p capability calling L0 API….

ptr = malloc()//this replace bo_create, vm_bind, dma-import/export

Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0

Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1

//Gpu page fault handles memory allocation/migration/mapping to gpu

As you can see, from above model, our design is a little bit different than the KFD design. user need to explicitly create gpuvm (vm0 and vm1 above) for each gpu device. Driver internally have a xe_svm represent the shared address space b/t cpu and multiple gpu devices. But end user doesn’t see and no need to create xe_svm. The shared virtual address space is really managed by linux core mm (through the vma struct, mm_struct etc). From each gpu device’s perspective, it just operate under its own gpuvm, not aware of the existence of other gpuvm, even though in reality all those gpuvm shares a same virtual address space.

See one more comment inline

From: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
Sent: Wednesday, January 24, 2024 3:33 AM
To: Zeng, Oak <oak.zeng@intel.com<mailto:oak.zeng@intel.com>>; Danilo Krummrich <dakr@redhat.com<mailto:dakr@redhat.com>>; Dave Airlie <airlied@redhat.com<mailto:airlied@redhat.com>>; Daniel Vetter <daniel@ffwll.ch<mailto:daniel@ffwll.ch>>; Felix Kuehling <felix.kuehling@amd.com<mailto:felix.kuehling@amd.com>>
Cc: Welty, Brian <brian.welty@intel.com<mailto:brian.welty@intel.com>>; dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>; intel-xe@lists.freedesktop.org<mailto:intel-xe@lists.freedesktop.org>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com<mailto:krishnaiah.bommu@intel.com>>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com<mailto:himal.prasad.ghimiray@intel.com>>; Thomas.Hellstrom@linux.intel.com<mailto:Thomas.Hellstrom@linux.intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com<mailto:niranjana.vishwanathapura@intel.com>>; Brost, Matthew <matthew.brost@intel.com<mailto:matthew.brost@intel.com>>; Gupta, saurabhg <saurabhg.gupta@intel.com<mailto:saurabhg.gupta@intel.com>>
Subject: Re: Making drm_gpuvm work across gpu devices

Am 23.01.24 um 20:37 schrieb Zeng, Oak:
[SNIP]



Yes most API are per device based.



One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices.

Yeah and that was a big mistake in my opinion. We should really not do that ever again.


Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).

Exactly that thinking is what we have currently found as blocker for a virtualization projects. Having SVM as device independent feature which somehow ties to the process address space turned out to be an extremely bad idea.

The background is that this only works for some use cases but not all of them.

What's working much better is to just have a mirror functionality which says that a range A..B of the process address space is mapped into a range C..D of the GPU address space.

Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management.

When you talk about migrating memory to a device you also do this on a per device basis and *not* tied to the process address space. If you then get crappy performance because userspace gave contradicting information where to migrate memory then that's a bug in userspace and not something the kernel should try to prevent somehow.

[SNIP]

I think if you start using the same drm_gpuvm for multiple devices you

will sooner or later start to run into the same mess we have seen with

KFD, where we moved more and more functionality from the KFD to the DRM

render node because we found that a lot of the stuff simply doesn't work

correctly with a single object to maintain the state.



As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process.

Yes, I'm perfectly aware of that. And I can only repeat myself that I see this design as a rather extreme failure. And I think it's one of the reasons why NVidia is so dominant with Cuda.

This whole approach KFD takes was designed with the idea of extending the CPU process into the GPUs, but this idea only works for a few use cases and is not something we should apply to drivers in general.

A very good example are virtualization use cases where you end up with CPU address != GPU address because the VAs are actually coming from the guest VM and not the host process.


I don’t get the problem here. For us, under virtualization, both the cpu address and gpu virtual address operated in xekmd is guest virtual address. They can still share the same virtual address space (as SVM required)

Oak


SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI.

If you want to do something similar as KFD for Xe I think you need to get explicit permission to do this from Dave and Daniel and maybe even Linus.

Regards,
Christian.

[-- Attachment #2: Type: text/html, Size: 31807 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-25  1:17               ` Zeng, Oak
  2024-01-25  1:25                 ` David Airlie
  2024-01-25 11:00                 ` 回复:Making " 周春明(日月)
@ 2024-01-25 17:15                 ` Felix Kuehling
  2024-01-25 18:37                   ` Zeng, Oak
  2 siblings, 1 reply; 123+ messages in thread
From: Felix Kuehling @ 2024-01-25 17:15 UTC (permalink / raw)
  To: Zeng, Oak, Christian König, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Shah, Ankur N, Winiarski, Michal
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe


On 2024-01-24 20:17, Zeng, Oak wrote:
>
> Hi Christian,
>
> Even though I mentioned KFD design, I didn’t mean to copy the KFD 
> design. I also had hard time to understand the difficulty of KFD under 
> virtualization environment.
>
The problem with virtualization is related to virtualization design 
choices. There is a single process that proxies requests from multiple 
processes in one (or more?) VMs to the GPU driver. That means, we need a 
single process with multiple contexts (and address spaces). One proxy 
process on the host must support multiple guest address spaces.

I don't know much more than these very high level requirements, and I 
only found out about those a few weeks ago. Due to my own bias I can't 
comment whether there are bad design choices in the proxy architecture 
or in KFD or both. The way we are considering fixing this, is to enable 
creating multiple KFD contexts in the same process. Each of those 
contexts will still represent a shared virtual address space across 
devices (but not the CPU). Because the device address space is not 
shared with the CPU, we cannot support our SVM API in this situation.

I still believe that it makes sense to have the kernel mode driver aware 
of a shared virtual address space at some level. A per-GPU API and an 
API that doesn't require matching CPU and GPU virtual addresses would 
enable more flexibility at the cost duplicate information tracking for 
multiple devices and duplicate overhead for things like MMU notifiers 
and interval tree data structures. Having to coordinate multiple devices 
with potentially different address spaces would probably make it more 
awkward to implement memory migration. The added flexibility would go 
mostly unused, except in some very niche applications.

Regards,
   Felix


> For us, Xekmd doesn't need to know it is running under bare metal or 
> virtualized environment. Xekmd is always a guest driver. All the 
> virtual address used in xekmd is guest virtual address. For SVM, we 
> require all the VF devices share one single shared address space with 
> guest CPU program. So all the design works in bare metal environment 
> can automatically work under virtualized environment. +@Shah, Ankur N 
> <mailto:ankur.n.shah@intel.com> +@Winiarski, Michal 
> <mailto:michal.winiarski@intel.com> to backup me if I am wrong.
>
> Again, shared virtual address space b/t cpu and all gpu devices is a 
> hard requirement for our system allocator design (which means 
> malloc’ed memory, cpu stack variables, globals can be directly used in 
> gpu program. Same requirement as kfd SVM design). This was aligned 
> with our user space software stack.
>
> For anyone who want to implement system allocator, or SVM, this is a 
> hard requirement. I started this thread hoping I can leverage the 
> drm_gpuvm design to manage the shared virtual address space (as the 
> address range split/merge function was scary to me and I didn’t want 
> re-invent). I guess my takeaway from this you and Danilo is this 
> approach is a NAK. Thomas also mentioned to me drm_gpuvm is a overkill 
> for our svm address range split/merge. So I will make things work 
> first by manage address range xekmd internally. I can re-look 
> drm-gpuvm approach in the future.
>
> Maybe a pseudo user program can illustrate our programming model:
>
> Fd0 = open(card0)
>
> Fd1 = open(card1)
>
> Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's 
> first vm_create
>
> Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called 
> from same process
>
> Queue0 = xe_exec_queue_create(fd0, vm0)
>
> Queue1 = xe_exec_queue_create(fd1, vm1)
>
> //check p2p capability calling L0 API….
>
> ptr = malloc()//this replace bo_create, vm_bind, dma-import/export
>
> Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0
>
> Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1
>
> //Gpu page fault handles memory allocation/migration/mapping to gpu
>
> As you can see, from above model, our design is a little bit different 
> than the KFD design. user need to explicitly create gpuvm (vm0 and vm1 
> above) for each gpu device. Driver internally have a xe_svm represent 
> the shared address space b/t cpu and multiple gpu devices. But end 
> user doesn’t see and no need to create xe_svm. The shared virtual 
> address space is really managed by linux core mm (through the vma 
> struct, mm_struct etc). From each gpu device’s perspective, it just 
> operate under its own gpuvm, not aware of the existence of other 
> gpuvm, even though in reality all those gpuvm shares a same virtual 
> address space.
>
> See one more comment inline
>
> *From:*Christian König <christian.koenig@amd.com>
> *Sent:* Wednesday, January 24, 2024 3:33 AM
> *To:* Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich 
> <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter 
> <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>
> *Cc:* Welty, Brian <brian.welty@intel.com>; 
> dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; 
> Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad 
> <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com; 
> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; 
> Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg 
> <saurabhg.gupta@intel.com>
> *Subject:* Re: Making drm_gpuvm work across gpu devices
>
> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>
>     [SNIP]
>
>     Yes most API are per device based.
>
>     One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices.
>
>
> Yeah and that was a big mistake in my opinion. We should really not do 
> that ever again.
>
>
>     Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).
>
>
> Exactly that thinking is what we have currently found as blocker for a 
> virtualization projects. Having SVM as device independent feature 
> which somehow ties to the process address space turned out to be an 
> extremely bad idea.
>
> The background is that this only works for some use cases but not all 
> of them.
>
> What's working much better is to just have a mirror functionality 
> which says that a range A..B of the process address space is mapped 
> into a range C..D of the GPU address space.
>
> Those ranges can then be used to implement the SVM feature required 
> for higher level APIs and not something you need at the UAPI or even 
> inside the low level kernel memory management.
>
> When you talk about migrating memory to a device you also do this on a 
> per device basis and *not* tied to the process address space. If you 
> then get crappy performance because userspace gave contradicting 
> information where to migrate memory then that's a bug in userspace and 
> not something the kernel should try to prevent somehow.
>
> [SNIP]
>
>         I think if you start using the same drm_gpuvm for multiple devices you
>
>         will sooner or later start to run into the same mess we have seen with
>
>         KFD, where we moved more and more functionality from the KFD to the DRM
>
>         render node because we found that a lot of the stuff simply doesn't work
>
>         correctly with a single object to maintain the state.
>
>     As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process.
>
>
> Yes, I'm perfectly aware of that. And I can only repeat myself that I 
> see this design as a rather extreme failure. And I think it's one of 
> the reasons why NVidia is so dominant with Cuda.
>
> This whole approach KFD takes was designed with the idea of extending 
> the CPU process into the GPUs, but this idea only works for a few use 
> cases and is not something we should apply to drivers in general.
>
> A very good example are virtualization use cases where you end up with 
> CPU address != GPU address because the VAs are actually coming from 
> the guest VM and not the host process.
>
> I don’t get the problem here. For us, under virtualization, both the 
> cpu address and gpu virtual address operated in xekmd is guest virtual 
> address. They can still share the same virtual address space (as SVM 
> required)
>
> Oak
>
>
>
> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should 
> not have any influence on the design of the kernel UAPI.
>
> If you want to do something similar as KFD for Xe I think you need to 
> get explicit permission to do this from Dave and Daniel and maybe even 
> Linus.
>
> Regards,
> Christian.
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-24  8:33             ` Christian König
  2024-01-25  1:17               ` Zeng, Oak
  2024-01-25 16:42               ` Zeng, Oak
@ 2024-01-25 18:32               ` Daniel Vetter
  2024-01-25 21:02                 ` Zeng, Oak
                                   ` (2 more replies)
  2024-02-23 20:12               ` Zeng, Oak
  3 siblings, 3 replies; 123+ messages in thread
From: Daniel Vetter @ 2024-01-25 18:32 UTC (permalink / raw)
  To: Christian König
  Cc: Brost, Matthew, Thomas.Hellstrom, Felix Kuehling, Welty, Brian,
	Ghimiray, Himal Prasad, Zeng, Oak, Gupta, saurabhg,
	Danilo Krummrich, dri-devel, Daniel Vetter, Bommu, Krishnaiah,
	Dave Airlie, Vishwanathapura, Niranjana, intel-xe

On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
> > [SNIP]
> > Yes most API are per device based.
> > 
> > One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices.
> 
> Yeah and that was a big mistake in my opinion. We should really not do that
> ever again.
> 
> > Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).
> 
> Exactly that thinking is what we have currently found as blocker for a
> virtualization projects. Having SVM as device independent feature which
> somehow ties to the process address space turned out to be an extremely bad
> idea.
> 
> The background is that this only works for some use cases but not all of
> them.
> 
> What's working much better is to just have a mirror functionality which says
> that a range A..B of the process address space is mapped into a range C..D
> of the GPU address space.
> 
> Those ranges can then be used to implement the SVM feature required for
> higher level APIs and not something you need at the UAPI or even inside the
> low level kernel memory management.
> 
> When you talk about migrating memory to a device you also do this on a per
> device basis and *not* tied to the process address space. If you then get
> crappy performance because userspace gave contradicting information where to
> migrate memory then that's a bug in userspace and not something the kernel
> should try to prevent somehow.
> 
> [SNIP]
> > > I think if you start using the same drm_gpuvm for multiple devices you
> > > will sooner or later start to run into the same mess we have seen with
> > > KFD, where we moved more and more functionality from the KFD to the DRM
> > > render node because we found that a lot of the stuff simply doesn't work
> > > correctly with a single object to maintain the state.
> > As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process.
> 
> Yes, I'm perfectly aware of that. And I can only repeat myself that I see
> this design as a rather extreme failure. And I think it's one of the reasons
> why NVidia is so dominant with Cuda.
> 
> This whole approach KFD takes was designed with the idea of extending the
> CPU process into the GPUs, but this idea only works for a few use cases and
> is not something we should apply to drivers in general.
> 
> A very good example are virtualization use cases where you end up with CPU
> address != GPU address because the VAs are actually coming from the guest VM
> and not the host process.
> 
> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have
> any influence on the design of the kernel UAPI.
> 
> If you want to do something similar as KFD for Xe I think you need to get
> explicit permission to do this from Dave and Daniel and maybe even Linus.

I think the one and only one exception where an SVM uapi like in kfd makes
sense, is if the _hardware_ itself, not the software stack defined
semantics that you've happened to build on top of that hw, enforces a 1:1
mapping with the cpu process address space.

Which means your hardware is using PASID, IOMMU based translation, PCI-ATS
(address translation services) or whatever your hw calls it and has _no_
device-side pagetables on top. Which from what I've seen all devices with
device-memory have, simply because they need some place to store whether
that memory is currently in device memory or should be translated using
PASID. Currently there's no gpu that works with PASID only, but there are
some on-cpu-die accelerator things that do work like that.

Maybe in the future there will be some accelerators that are fully cpu
cache coherent (including atomics) with something like CXL, and the
on-device memory is managed as normal system memory with struct page as
ZONE_DEVICE and accelerator va -> physical address translation is only
done with PASID ... but for now I haven't seen that, definitely not in
upstream drivers.

And the moment you have some per-device pagetables or per-device memory
management of some sort (like using gpuva mgr) then I'm 100% agreeing with
Christian that the kfd SVM model is too strict and not a great idea.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-25 17:15                 ` Making " Felix Kuehling
@ 2024-01-25 18:37                   ` Zeng, Oak
  2024-01-26 13:23                     ` Christian König
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-01-25 18:37 UTC (permalink / raw)
  To: Felix Kuehling, Christian König, Danilo Krummrich,
	Dave Airlie, Daniel Vetter, Shah, Ankur N, Winiarski, Michal
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe



> -----Original Message-----
> From: Felix Kuehling <felix.kuehling@amd.com>
> Sent: Thursday, January 25, 2024 12:16 PM
> To: Zeng, Oak <oak.zeng@intel.com>; Christian König
> <christian.koenig@amd.com>; Danilo Krummrich <dakr@redhat.com>; Dave
> Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Shah, Ankur N
> <ankur.n.shah@intel.com>; Winiarski, Michal <michal.winiarski@intel.com>
> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>;
> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
> <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> 
> On 2024-01-24 20:17, Zeng, Oak wrote:
> >
> > Hi Christian,
> >
> > Even though I mentioned KFD design, I didn’t mean to copy the KFD
> > design. I also had hard time to understand the difficulty of KFD under
> > virtualization environment.
> >
> The problem with virtualization is related to virtualization design
> choices. There is a single process that proxies requests from multiple
> processes in one (or more?) VMs to the GPU driver. That means, we need a
> single process with multiple contexts (and address spaces). One proxy
> process on the host must support multiple guest address spaces.

My first response is, why processes on the virtual machine can't open /dev/kfd device itself?

Also try to picture why base amdgpu driver (which is per hardware device based) doesn't have this problem... creating multiple contexts under single amdgpu device, each context servicing one guest process?
> 
> I don't know much more than these very high level requirements, and I
> only found out about those a few weeks ago. Due to my own bias I can't
> comment whether there are bad design choices in the proxy architecture
> or in KFD or both. The way we are considering fixing this, is to enable
> creating multiple KFD contexts in the same process. Each of those
> contexts will still represent a shared virtual address space across
> devices (but not the CPU). Because the device address space is not
> shared with the CPU, we cannot support our SVM API in this situation.
> 

One kfd process, multiple contexts, each context has a shared address space across devices.... I do see some complications 😊

> I still believe that it makes sense to have the kernel mode driver aware
> of a shared virtual address space at some level. A per-GPU API and an
> API that doesn't require matching CPU and GPU virtual addresses would
> enable more flexibility at the cost duplicate information tracking for
> multiple devices and duplicate overhead for things like MMU notifiers
> and interval tree data structures. Having to coordinate multiple devices
> with potentially different address spaces would probably make it more
> awkward to implement memory migration. The added flexibility would go
> mostly unused, except in some very niche applications.
> 
> Regards,
>    Felix
> 
> 
> > For us, Xekmd doesn't need to know it is running under bare metal or
> > virtualized environment. Xekmd is always a guest driver. All the
> > virtual address used in xekmd is guest virtual address. For SVM, we
> > require all the VF devices share one single shared address space with
> > guest CPU program. So all the design works in bare metal environment
> > can automatically work under virtualized environment. +@Shah, Ankur N
> > <mailto:ankur.n.shah@intel.com> +@Winiarski, Michal
> > <mailto:michal.winiarski@intel.com> to backup me if I am wrong.
> >
> > Again, shared virtual address space b/t cpu and all gpu devices is a
> > hard requirement for our system allocator design (which means
> > malloc’ed memory, cpu stack variables, globals can be directly used in
> > gpu program. Same requirement as kfd SVM design). This was aligned
> > with our user space software stack.
> >
> > For anyone who want to implement system allocator, or SVM, this is a
> > hard requirement. I started this thread hoping I can leverage the
> > drm_gpuvm design to manage the shared virtual address space (as the
> > address range split/merge function was scary to me and I didn’t want
> > re-invent). I guess my takeaway from this you and Danilo is this
> > approach is a NAK. Thomas also mentioned to me drm_gpuvm is a overkill
> > for our svm address range split/merge. So I will make things work
> > first by manage address range xekmd internally. I can re-look
> > drm-gpuvm approach in the future.
> >
> > Maybe a pseudo user program can illustrate our programming model:
> >
> > Fd0 = open(card0)
> >
> > Fd1 = open(card1)
> >
> > Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's
> > first vm_create
> >
> > Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called
> > from same process
> >
> > Queue0 = xe_exec_queue_create(fd0, vm0)
> >
> > Queue1 = xe_exec_queue_create(fd1, vm1)
> >
> > //check p2p capability calling L0 API….
> >
> > ptr = malloc()//this replace bo_create, vm_bind, dma-import/export
> >
> > Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0
> >
> > Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1
> >
> > //Gpu page fault handles memory allocation/migration/mapping to gpu
> >
> > As you can see, from above model, our design is a little bit different
> > than the KFD design. user need to explicitly create gpuvm (vm0 and vm1
> > above) for each gpu device. Driver internally have a xe_svm represent
> > the shared address space b/t cpu and multiple gpu devices. But end
> > user doesn’t see and no need to create xe_svm. The shared virtual
> > address space is really managed by linux core mm (through the vma
> > struct, mm_struct etc). From each gpu device’s perspective, it just
> > operate under its own gpuvm, not aware of the existence of other
> > gpuvm, even though in reality all those gpuvm shares a same virtual
> > address space.
> >
> > See one more comment inline
> >
> > *From:*Christian König <christian.koenig@amd.com>
> > *Sent:* Wednesday, January 24, 2024 3:33 AM
> > *To:* Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich
> > <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
> > <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>
> > *Cc:* Welty, Brian <brian.welty@intel.com>;
> > dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
> > Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com;
> > Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>;
> > Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg
> > <saurabhg.gupta@intel.com>
> > *Subject:* Re: Making drm_gpuvm work across gpu devices
> >
> > Am 23.01.24 um 20:37 schrieb Zeng, Oak:
> >
> >     [SNIP]
> >
> >     Yes most API are per device based.
> >
> >     One exception I know is actually the kfd SVM API. If you look at the svm_ioctl
> function, it is per-process based. Each kfd_process represent a process across N
> gpu devices.
> >
> >
> > Yeah and that was a big mistake in my opinion. We should really not do
> > that ever again.
> >
> >
> >     Need to say, kfd SVM represent a shared virtual address space across CPU
> and all GPU devices on the system. This is by the definition of SVM (shared virtual
> memory). This is very different from our legacy gpu *device* driver which works
> for only one device (i.e., if you want one device to access another device's
> memory, you will have to use dma-buf export/import etc).
> >
> >
> > Exactly that thinking is what we have currently found as blocker for a
> > virtualization projects. Having SVM as device independent feature
> > which somehow ties to the process address space turned out to be an
> > extremely bad idea.
> >
> > The background is that this only works for some use cases but not all
> > of them.
> >
> > What's working much better is to just have a mirror functionality
> > which says that a range A..B of the process address space is mapped
> > into a range C..D of the GPU address space.
> >
> > Those ranges can then be used to implement the SVM feature required
> > for higher level APIs and not something you need at the UAPI or even
> > inside the low level kernel memory management.
> >
> > When you talk about migrating memory to a device you also do this on a
> > per device basis and *not* tied to the process address space. If you
> > then get crappy performance because userspace gave contradicting
> > information where to migrate memory then that's a bug in userspace and
> > not something the kernel should try to prevent somehow.
> >
> > [SNIP]
> >
> >         I think if you start using the same drm_gpuvm for multiple devices you
> >
> >         will sooner or later start to run into the same mess we have seen with
> >
> >         KFD, where we moved more and more functionality from the KFD to the
> DRM
> >
> >         render node because we found that a lot of the stuff simply doesn't work
> >
> >         correctly with a single object to maintain the state.
> >
> >     As I understand it, KFD is designed to work across devices. A single pseudo
> /dev/kfd device represent all hardware gpu devices. That is why during kfd open,
> many pdd (process device data) is created, each for one hardware device for this
> process.
> >
> >
> > Yes, I'm perfectly aware of that. And I can only repeat myself that I
> > see this design as a rather extreme failure. And I think it's one of
> > the reasons why NVidia is so dominant with Cuda.
> >
> > This whole approach KFD takes was designed with the idea of extending
> > the CPU process into the GPUs, but this idea only works for a few use
> > cases and is not something we should apply to drivers in general.
> >
> > A very good example are virtualization use cases where you end up with
> > CPU address != GPU address because the VAs are actually coming from
> > the guest VM and not the host process.
> >
> > I don’t get the problem here. For us, under virtualization, both the
> > cpu address and gpu virtual address operated in xekmd is guest virtual
> > address. They can still share the same virtual address space (as SVM
> > required)
> >
> > Oak
> >
> >
> >
> > SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should
> > not have any influence on the design of the kernel UAPI.
> >
> > If you want to do something similar as KFD for Xe I think you need to
> > get explicit permission to do this from Dave and Daniel and maybe even
> > Linus.
> >
> > Regards,
> > Christian.
> >

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-25 18:32               ` Daniel Vetter
@ 2024-01-25 21:02                 ` Zeng, Oak
  2024-01-26  8:21                 ` Thomas Hellström
  2024-01-29 15:03                 ` Felix Kuehling
  2 siblings, 0 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-01-25 21:02 UTC (permalink / raw)
  To: Daniel Vetter, Christian König
  Cc: Brost, Matthew, Thomas.Hellstrom, Felix Kuehling, Welty, Brian,
	Ghimiray, Himal Prasad, dri-devel, Gupta, saurabhg,
	Danilo Krummrich, Bommu, Krishnaiah, Dave Airlie,
	Vishwanathapura, Niranjana, intel-xe



> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Thursday, January 25, 2024 1:33 PM
> To: Christian König <christian.koenig@amd.com>
> Cc: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>;
> Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Felix
> Kuehling <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; dri-
> devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; Brost,
> Matthew <matthew.brost@intel.com>; Gupta, saurabhg
> <saurabhg.gupta@intel.com>
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
> > Am 23.01.24 um 20:37 schrieb Zeng, Oak:
> > > [SNIP]
> > > Yes most API are per device based.
> > >
> > > One exception I know is actually the kfd SVM API. If you look at the svm_ioctl
> function, it is per-process based. Each kfd_process represent a process across N
> gpu devices.
> >
> > Yeah and that was a big mistake in my opinion. We should really not do that
> > ever again.
> >
> > > Need to say, kfd SVM represent a shared virtual address space across CPU
> and all GPU devices on the system. This is by the definition of SVM (shared virtual
> memory). This is very different from our legacy gpu *device* driver which works
> for only one device (i.e., if you want one device to access another device's
> memory, you will have to use dma-buf export/import etc).
> >
> > Exactly that thinking is what we have currently found as blocker for a
> > virtualization projects. Having SVM as device independent feature which
> > somehow ties to the process address space turned out to be an extremely bad
> > idea.
> >
> > The background is that this only works for some use cases but not all of
> > them.
> >
> > What's working much better is to just have a mirror functionality which says
> > that a range A..B of the process address space is mapped into a range C..D
> > of the GPU address space.
> >
> > Those ranges can then be used to implement the SVM feature required for
> > higher level APIs and not something you need at the UAPI or even inside the
> > low level kernel memory management.
> >
> > When you talk about migrating memory to a device you also do this on a per
> > device basis and *not* tied to the process address space. If you then get
> > crappy performance because userspace gave contradicting information where
> to
> > migrate memory then that's a bug in userspace and not something the kernel
> > should try to prevent somehow.
> >
> > [SNIP]
> > > > I think if you start using the same drm_gpuvm for multiple devices you
> > > > will sooner or later start to run into the same mess we have seen with
> > > > KFD, where we moved more and more functionality from the KFD to the
> DRM
> > > > render node because we found that a lot of the stuff simply doesn't work
> > > > correctly with a single object to maintain the state.
> > > As I understand it, KFD is designed to work across devices. A single pseudo
> /dev/kfd device represent all hardware gpu devices. That is why during kfd open,
> many pdd (process device data) is created, each for one hardware device for this
> process.
> >
> > Yes, I'm perfectly aware of that. And I can only repeat myself that I see
> > this design as a rather extreme failure. And I think it's one of the reasons
> > why NVidia is so dominant with Cuda.
> >
> > This whole approach KFD takes was designed with the idea of extending the
> > CPU process into the GPUs, but this idea only works for a few use cases and
> > is not something we should apply to drivers in general.
> >
> > A very good example are virtualization use cases where you end up with CPU
> > address != GPU address because the VAs are actually coming from the guest
> VM
> > and not the host process.
> >
> > SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have
> > any influence on the design of the kernel UAPI.
> >
> > If you want to do something similar as KFD for Xe I think you need to get
> > explicit permission to do this from Dave and Daniel and maybe even Linus.
> 
> I think the one and only one exception where an SVM uapi like in kfd makes
> sense, is if the _hardware_ itself, not the software stack defined
> semantics that you've happened to build on top of that hw, enforces a 1:1
> mapping with the cpu process address space.
> 
> Which means your hardware is using PASID, IOMMU based translation, PCI-ATS
> (address translation services) or whatever your hw calls it and has _no_
> device-side pagetables on top. Which from what I've seen all devices with
> device-memory have, simply because they need some place to store whether
> that memory is currently in device memory or should be translated using
> PASID. Currently there's no gpu that works with PASID only, but there are
> some on-cpu-die accelerator things that do work like that.
> 
> Maybe in the future there will be some accelerators that are fully cpu
> cache coherent (including atomics) with something like CXL, and the
> on-device memory is managed as normal system memory with struct page as
> ZONE_DEVICE and accelerator va -> physical address translation is only
> done with PASID ... but for now I haven't seen that, definitely not in
> upstream drivers.
> 
> And the moment you have some per-device pagetables or per-device memory
> management of some sort (like using gpuva mgr) then I'm 100% agreeing with
> Christian that the kfd SVM model is too strict and not a great idea.
> 


GPU is nothing more than a piece of HW to accelerate part of a program, just like an extra CPU core. From this perspective, a unified virtual address space across CPU and all GPU devices (and any other accelerators) is always more convenient to program than split address space b/t devices.

In reality, GPU program started from split address space.  HMM is designed to provide unified virtual address space w/o a lot of advanced hardware feature you listed above. 

I am aware Nvidia's new hardware platforms such as Grace Hopper natively support the Unified Memory programming model through hardware-based memory coherence among all CPUs and GPUs. For such systems, HMM is not required.

You can think HMM as a software based solution to provide unified address space b/t cpu and devices. Both AMD and Nvidia have been providing unified address space through hmm. I think it is still valuable.

Regards,
Oak  



> Cheers, Sima
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-24  3:57               ` Zeng, Oak
  2024-01-24  4:14                 ` Zeng, Oak
@ 2024-01-25 22:13                 ` Danilo Krummrich
  1 sibling, 0 replies; 123+ messages in thread
From: Danilo Krummrich @ 2024-01-25 22:13 UTC (permalink / raw)
  To: Zeng, Oak, Christian König, Dave Airlie, Daniel Vetter,
	Felix Kuehling, Welty, Brian
  Cc: Brost, Matthew, Thomas.Hellstrom, dri-devel, Ghimiray,
	Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

On 1/24/24 04:57, Zeng, Oak wrote:
> Thanks a lot Danilo.
> 
> Maybe I wasn't clear enough. In the solution I proposed, each device still have separate vm/page tables. Each device still need to manage the mapping, page table flags etc. It is just in svm use case, all devices share one drm_gpuvm instance. As I understand it, drm_gpuvm's main function is the va range split and merging. I don't see why it doesn't work across gpu devices.

I'm pretty sure it does work. You can indeed use GPUVM for tracking mappings using
the split and merge feature only, ignoring all other features it provides. However,
I don't think it's a good idea to have a single GPUVM instance to track the memory
mappings of different devices with different page tables, different object life times,
etc.

> 
> But I read more about drm_gpuvm. Its split merge function takes a drm_gem_object parameter, see drm_gpuvm_sm_map_ops_create and drm_gpuvm_sm_map. Actually the whole drm_gpuvm is designed for BO-centric driver, for example, it has a drm_gpuvm_bo concept to keep track of the 1BO:Ngpuva mapping. The whole purpose of leveraging drm_gpuvm is to re-use the va split/merge functions for SVM. But in our SVM implementation, there is no buffer object at all. So I don't think our SVM codes can leverage drm_gpuvm.

That's all optional features. As mentioned above, you can use GPUVM for tracking mappings
using the split and merge feature only. The drm_gem_object parameter in
drm_gpuvm_sm_map_ops_create() can simply be NULL. Afaik, Xe already does that for userptr
stuff already. But again, I don't think it's a good idea to track memory mappings of
multiple independent physical devices and driver instances in a single different place
whether you use GPUVM or a custom implementation.

- Danilo

> 
> I will give up this approach, unless Matt or Brian can see a way.
> 
> A few replies inline.... @Welty, Brian I had more thoughts inline to one of your original question....
> 
>> -----Original Message-----
>> From: Danilo Krummrich <dakr@redhat.com>
>> Sent: Tuesday, January 23, 2024 6:57 PM
>> To: Zeng, Oak <oak.zeng@intel.com>; Christian König
>> <christian.koenig@amd.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
>> <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>
>> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-
>> xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>;
>> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
>> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
>> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
>> <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
>> Subject: Re: Making drm_gpuvm work across gpu devices
>>
>> Hi Oak,
>>
>> On 1/23/24 20:37, Zeng, Oak wrote:
>>> Thanks Christian. I have some comment inline below.
>>>
>>> Danilo, can you also take a look and give your feedback? Thanks.
>>
>> I agree with everything Christian already wrote. Except for the KFD parts, which
>> I'm simply not familiar with, I had exactly the same thoughts after reading your
>> initial mail.
>>
>> Please find some more comments below.
>>
>>>
>>>> -----Original Message-----
>>>> From: Christian König <christian.koenig@amd.com>
>>>> Sent: Tuesday, January 23, 2024 6:13 AM
>>>> To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>;
>>>> Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>
>>>> Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org;
>> intel-
>>>> xe@lists.freedesktop.org; Bommu, Krishnaiah
>> <krishnaiah.bommu@intel.com>;
>>>> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
>>>> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
>>>> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
>>>> <matthew.brost@intel.com>
>>>> Subject: Re: Making drm_gpuvm work across gpu devices
>>>>
>>>> Hi Oak,
>>>>
>>>> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
>>>>> Hi Danilo and all,
>>>>>
>>>>> During the work of Intel's SVM code, we came up the idea of making
>>>> drm_gpuvm to work across multiple gpu devices. See some discussion here:
>>>> https://lore.kernel.org/dri-
>>>>
>> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
>>>> 11.prod.outlook.com/
>>>>>
>>>>> The reason we try to do this is, for a SVM (shared virtual memory across cpu
>>>> program and all gpu program on all gpu devices) process, the address space
>> has
>>>> to be across all gpu devices. So if we make drm_gpuvm to work across devices,
>>>> then our SVM code can leverage drm_gpuvm as well.
>>>>>
>>>>> At a first look, it seems feasible because drm_gpuvm doesn't really use the
>>>> drm_device *drm pointer a lot. This param is used only for printing/warning.
>> So I
>>>> think maybe we can delete this drm field from drm_gpuvm.
>>>>>
>>>>> This way, on a multiple gpu device system, for one process, we can have only
>>>> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for
>>>> each gpu device).
>>>>>
>>>>> What do you think?
>>>>
>>>> Well from the GPUVM side I don't think it would make much difference if
>>>> we have the drm device or not.
>>>>
>>>> But the experience we had with the KFD I think I should mention that we
>>>> should absolutely *not* deal with multiple devices at the same time in
>>>> the UAPI or VM objects inside the driver.
>>>>
>>>> The background is that all the APIs inside the Linux kernel are build
>>>> around the idea that they work with only one device at a time. This
>>>> accounts for both low level APIs like the DMA API as well as pretty high
>>>> level things like for example file system address space etc...
>>>
>>> Yes most API are per device based.
>>>
>>> One exception I know is actually the kfd SVM API. If you look at the svm_ioctl
>> function, it is per-process based. Each kfd_process represent a process across N
>> gpu devices. Cc Felix.
>>>
>>> Need to say, kfd SVM represent a shared virtual address space across CPU and
>> all GPU devices on the system. This is by the definition of SVM (shared virtual
>> memory). This is very different from our legacy gpu *device* driver which works
>> for only one device (i.e., if you want one device to access another device's
>> memory, you will have to use dma-buf export/import etc).
>>>
>>> We have the same design requirement of SVM. For anyone who want to
>> implement the SVM concept, this is a hard requirement. Since now drm has the
>> drm_gpuvm concept which strictly speaking is designed for one device, I want to
>> see whether we can extend drm_gpuvm to make it work for both single device
>> (as used in xe) and multipe devices (will be used in the SVM code). That is why I
>> brought up this topic.
>>>
>>>>
>>>> So when you have multiple GPUs you either have an inseparable cluster of
>>>> them which case you would also only have one drm_device. Or you have
>>>> separated drm_device which also results in separate drm render nodes and
>>>> separate virtual address spaces and also eventually separate IOMMU
>>>> domains which gives you separate dma_addresses for the same page and so
>>>> separate GPUVM page tables....
>>>
>>> I am thinking we can still make each device has its separate drm_device/render
>> node/iommu domains/gpu page table. Just as what we have today. I am not plan
>> to change this picture.
>>>
>>> But the virtual address space will support two modes of operation:
>>> 1. one drm_gpuvm per device. This is when svm is not in the picture
>>> 2. all devices in the process share one single drm_gpuvm, when svm is in the
>> picture. In xe driver design, we have to support a mixture use of legacy mode
>> (such as gem_create and vm_bind) and svm (such as malloc'ed memory for gpu
>> submission). So whenever SVM is in the picture, we want one single process
>> address space across all devices. Drm_gpuvm doesn't need to be aware of those
>> two operation modes. It is driver's responsibility to use different mode.
>>>
>>> For example, in mode #1, a driver's vm structure (such as xe_vm) can inherit
>> from drm_gpuvm. In mode #2, a driver's svm structure (xe_svm in this series:
>> https://lore.kernel.org/dri-devel/20240117221223.18540-1-oak.zeng@intel.com/)
>> can inherit from drm_gpuvm while each xe_vm (still a per-device based struct)
>> will just have a pointer to the drm_gpuvm structure. This way when svm is in play,
>> we build a 1 process:1 mm_struct:1 xe_svm:N xe_vm correlations which means
>> shared address space across gpu devices.
>>
>> With a shared GPUVM structure, how do you track actual per device resources
>> such as
>> page tables? You also need to consider that the page table layout, memory
>> mapping
>> flags may vary from device to device due to different GPU chipsets or revisions.
> 
> The per device page table, flags etc are still managed per-device based, which is the xe_vm in the xekmd driver.
> 
>>
>> Also, if you replace the shared GPUVM structure with a pointer to a shared one,
>> you may run into all kinds of difficulties due to increasing complexity in terms
>> of locking, synchronization, lifetime and potential unwind operations in error
>> paths.
>> I haven't thought it through yet, but I wouldn't be surprised entirely if there are
>> cases where you simply run into circular dependencies.
> 
> Make sense, I can't see through this without a prove of concept code either.
> 
>>
>> Also, looking at the conversation in the linked patch series:
>>
>> <snip>
>>
>>>> For example as hmm_range_fault brings a range from host into GPU address
>>>> space,  what if it was already allocated and in use by VM_BIND for
>>>> a GEM_CREATE allocated buffer?    That is of course application error,
>>>> but KMD needs to detect it, and provide one single managed address
>>>> space across all allocations from the application....
>>
>>> This is very good question. Yes agree we should check this application error.
>> Fortunately this is doable. All vm_bind virtual address range are tracked in
>> xe_vm/drm_gpuvm struct. In this case, we should iterate the drm_gpuvm's rb
>> tree of *all* gpu devices (as xe_vm is for one device only) to see whether there
>> is a conflict. Will make the change soon.
>>
>> <snip>
>>
>> How do you do that if xe_vm->gpuvm is just a pointer to the GPUVM structure
>> within xe_svm?
> 
> In the proposed approach, we have a single drm_gpuvm instance for one process. All device's xe_vm pointing to this drm_gpuvm instance. This drm_gpuvm's rb tree maintains all the va range we have in this process. We can just walk this rb tree to see if there is a conflict.
> 
> But I didn't answer Brian's question completely... In a mixed use of vm_bind and malloc/mmap, the virtual address used by vm_bind should first be reserved in user space using mmap. So all valid virtual address should be tracked by linux kernel vma_struct.
> 
> Both vm_bind and malloc'ed virtual address can cause a gpu page fault. Our fault handler should first see whether this is a vm_bind va and service the fault accordingly; if not, then serve the fault in the SVM path; if SVM path also failed, it is an invalid address. So from user perspective, user can use:
> Ptr = mmap()
> Vm_bind(ptr, bo)
> Submit gpu kernel using ptr
> Or:
> Ptr = mmap()
> Submit gpu kernel using ptr
> Whether vm_bind is called or not decides the gpu fault handler code path. Hopefully this answers @Welty, Brian's original question
> 
> 
>>
>>>
>>> This requires some changes of drm_gpuvm design:
>>> 1. The drm_device *drm pointer, in mode #2 operation, this can be NULL,
>> means this drm_gpuvm is not for specific gpu device
>>> 2. The common dma_resv object: drm_gem_object *r_obj. *Does one
>> dma_resv object allocated/initialized for one device work for all devices*? From
>> first look, dma_resv is just some CPU structure maintaining dma-fences. So I
>> guess it should work for all devices? I definitely need you to comment.
>>
>> The general rule is that drivers can share the common dma_resv across GEM
>> objects that
>> are only mapped within the VM owning the dma_resv, but never within another
>> VM.
>>
>> Now, your question is whether multiple VMs can share the same common
>> dma_resv. I think
>> that calls for trouble, since it would create dependencies that simply aren't
>> needed
>> and might even introduce locking issues.
>>
>> However, that's optional, you can simply decide to not make use of the common
>> dma_resv
>> and all the optimizations based on it.
> 
> Ok, got it.
>>
>>>
>>>
>>>>
>>>> It's up to you how to implement it, but I think it's pretty clear that
>>>> you need separate drm_gpuvm objects to manage those.
>>>
>>> As explained above, I am thinking of one drm_gpuvm object across all devices
>> when SVM is in the picture...
>>>
>>>>
>>>> That you map the same thing in all those virtual address spaces at the
>>>> same address is a completely different optimization problem I think.
>>>
>>> Not sure I follow here... the requirement from SVM is, one virtual address
>> points to same physical backing store. For example, whenever CPU or any GPU
>> device access this virtual address, it refers to the same physical content. Of
>> course the physical backing store can be migrated b/t host memory and any of
>> the GPU's device memory, but the content should be consistent.
>>
>> Technically, multiple different GPUs will have separate virtual address spaces, it's
>> just that you create mappings within all of them such that the same virtual
>> address
>> resolves to the same physical content on all of them.
>>
>> So, having a single GPUVM instance representing all of them might give the
>> illusion of
>> a single unified address space, but you still need to maintain each device's
>> address
>> space backing resources, such as page tables, separately.
> 
> Yes agreed.
> 
> Regards,
> Oak
>>
>> - Danilo
>>
>>>
>>> So we are mapping same physical content to the same virtual address in either
>> cpu page table or any gpu device's page table...
>>>
>>>> What we could certainly do is to optimize hmm_range_fault by making
>>>> hmm_range a reference counted object and using it for multiple devices
>>>> at the same time if those devices request the same range of an mm_struct.
>>>>
>>>
>>> Not very follow. If you are trying to resolve a multiple devices concurrent access
>> problem, I think we should serialize concurrent device fault to one address range.
>> The reason is, during device fault handling, we might migrate the backing store so
>> hmm_range->hmm_pfns[] might have changed after one device access it.
>>>
>>>> I think if you start using the same drm_gpuvm for multiple devices you
>>>> will sooner or later start to run into the same mess we have seen with
>>>> KFD, where we moved more and more functionality from the KFD to the DRM
>>>> render node because we found that a lot of the stuff simply doesn't work
>>>> correctly with a single object to maintain the state.
>>>
>>> As I understand it, KFD is designed to work across devices. A single pseudo
>> /dev/kfd device represent all hardware gpu devices. That is why during kfd open,
>> many pdd (process device data) is created, each for one hardware device for this
>> process. Yes the codes are a little complicated.
>>>
>>> Kfd manages the shared virtual address space in the kfd driver codes, like the
>> split, merging etc. Here I am looking whether we can leverage the drm_gpuvm
>> code for those functions.
>>>
>>> As of the shared virtual address space across gpu devices, it is a hard
>> requirement for svm/system allocator (aka malloc for gpu program). We need to
>> make it work either at driver level or drm_gpuvm level. Drm_gpuvm is better
>> because the work can be shared b/t drivers.
>>>
>>> Thanks a lot,
>>> Oak
>>>
>>>>
>>>> Just one more point to your original discussion on the xe list: I think
>>>> it's perfectly valid for an application to map something at the same
>>>> address you already have something else.
>>>>
>>>> Cheers,
>>>> Christian.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Oak
>>>
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-25 18:32               ` Daniel Vetter
  2024-01-25 21:02                 ` Zeng, Oak
@ 2024-01-26  8:21                 ` Thomas Hellström
  2024-01-26 12:52                   ` Christian König
  2024-01-29 15:03                 ` Felix Kuehling
  2 siblings, 1 reply; 123+ messages in thread
From: Thomas Hellström @ 2024-01-26  8:21 UTC (permalink / raw)
  To: Daniel Vetter, Christian König
  Cc: Brost, Matthew, Dave Airlie, Felix Kuehling, Welty, Brian, Zeng,
	Oak, Ghimiray, Himal Prasad, dri-devel, Bommu, Krishnaiah, Gupta,
	saurabhg, Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich

Hi, all

On Thu, 2024-01-25 at 19:32 +0100, Daniel Vetter wrote:
> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
> > Am 23.01.24 um 20:37 schrieb Zeng, Oak:
> > > [SNIP]
> > > Yes most API are per device based.
> > > 
> > > One exception I know is actually the kfd SVM API. If you look at
> > > the svm_ioctl function, it is per-process based. Each kfd_process
> > > represent a process across N gpu devices.
> > 
> > Yeah and that was a big mistake in my opinion. We should really not
> > do that
> > ever again.
> > 
> > > Need to say, kfd SVM represent a shared virtual address space
> > > across CPU and all GPU devices on the system. This is by the
> > > definition of SVM (shared virtual memory). This is very different
> > > from our legacy gpu *device* driver which works for only one
> > > device (i.e., if you want one device to access another device's
> > > memory, you will have to use dma-buf export/import etc).
> > 
> > Exactly that thinking is what we have currently found as blocker
> > for a
> > virtualization projects. Having SVM as device independent feature
> > which
> > somehow ties to the process address space turned out to be an
> > extremely bad
> > idea.
> > 
> > The background is that this only works for some use cases but not
> > all of
> > them.
> > 
> > What's working much better is to just have a mirror functionality
> > which says
> > that a range A..B of the process address space is mapped into a
> > range C..D
> > of the GPU address space.
> > 
> > Those ranges can then be used to implement the SVM feature required
> > for
> > higher level APIs and not something you need at the UAPI or even
> > inside the
> > low level kernel memory management.
> > 
> > When you talk about migrating memory to a device you also do this
> > on a per
> > device basis and *not* tied to the process address space. If you
> > then get
> > crappy performance because userspace gave contradicting information
> > where to
> > migrate memory then that's a bug in userspace and not something the
> > kernel
> > should try to prevent somehow.
> > 
> > [SNIP]
> > > > I think if you start using the same drm_gpuvm for multiple
> > > > devices you
> > > > will sooner or later start to run into the same mess we have
> > > > seen with
> > > > KFD, where we moved more and more functionality from the KFD to
> > > > the DRM
> > > > render node because we found that a lot of the stuff simply
> > > > doesn't work
> > > > correctly with a single object to maintain the state.
> > > As I understand it, KFD is designed to work across devices. A
> > > single pseudo /dev/kfd device represent all hardware gpu devices.
> > > That is why during kfd open, many pdd (process device data) is
> > > created, each for one hardware device for this process.
> > 
> > Yes, I'm perfectly aware of that. And I can only repeat myself that
> > I see
> > this design as a rather extreme failure. And I think it's one of
> > the reasons
> > why NVidia is so dominant with Cuda.
> > 
> > This whole approach KFD takes was designed with the idea of
> > extending the
> > CPU process into the GPUs, but this idea only works for a few use
> > cases and
> > is not something we should apply to drivers in general.
> > 
> > A very good example are virtualization use cases where you end up
> > with CPU
> > address != GPU address because the VAs are actually coming from the
> > guest VM
> > and not the host process.
> > 
> > SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should
> > not have
> > any influence on the design of the kernel UAPI.
> > 
> > If you want to do something similar as KFD for Xe I think you need
> > to get
> > explicit permission to do this from Dave and Daniel and maybe even
> > Linus.
> 
> I think the one and only one exception where an SVM uapi like in kfd
> makes
> sense, is if the _hardware_ itself, not the software stack defined
> semantics that you've happened to build on top of that hw, enforces a
> 1:1
> mapping with the cpu process address space.
> 
> Which means your hardware is using PASID, IOMMU based translation,
> PCI-ATS
> (address translation services) or whatever your hw calls it and has
> _no_
> device-side pagetables on top. Which from what I've seen all devices
> with
> device-memory have, simply because they need some place to store
> whether
> that memory is currently in device memory or should be translated
> using
> PASID. Currently there's no gpu that works with PASID only, but there
> are
> some on-cpu-die accelerator things that do work like that.
> 
> Maybe in the future there will be some accelerators that are fully
> cpu
> cache coherent (including atomics) with something like CXL, and the
> on-device memory is managed as normal system memory with struct page
> as
> ZONE_DEVICE and accelerator va -> physical address translation is
> only
> done with PASID ... but for now I haven't seen that, definitely not
> in
> upstream drivers.
> 
> And the moment you have some per-device pagetables or per-device
> memory
> management of some sort (like using gpuva mgr) then I'm 100% agreeing
> with
> Christian that the kfd SVM model is too strict and not a great idea.
> 
> Cheers, Sima


I'm trying to digest all the comments here, The end goal is to be able
to support something similar to this here:

https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/

Christian, If I understand you correctly, you're strongly suggesting
not to try to manage a common virtual address space across different
devices in the kernel, but merely providing building blocks to do so,
like for example a generalized userptr with migration support using
HMM; That way each "mirror" of the CPU mm would be per device and
inserted into the gpu_vm just like any other gpu_vma, and user-space
would dictate the A..B -> C..D mapping by choosing the GPU_VA for the
vma.

Sima, it sounds like you're suggesting to shy away from hmm and not
even attempt to support this except if it can be done using IOMMU sva
on selected hardware?

Could you clarify a bit?

Thanks,
Thomas








^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-25  5:25                   ` Zeng, Oak
@ 2024-01-26 10:09                     ` Christian König
  2024-01-26 20:13                       ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Christian König @ 2024-01-26 10:09 UTC (permalink / raw)
  To: Zeng, Oak, David Airlie
  Cc: Brost, Matthew, Thomas.Hellstrom, Winiarski, Michal,
	Felix Kuehling, Welty, Brian, Shah, Ankur N, dri-devel, Ghimiray,
	Himal Prasad, Daniel Vetter, Bommu, Krishnaiah, Gupta, saurabhg,
	Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich

Hi Oak,

you can still use SVM, but it should not be a design criteria for the 
kernel UAPI. In other words the UAPI should be designed in such a way 
that the GPU virtual address can be equal to the CPU virtual address of 
a buffer, but can also be different to support use cases where this 
isn't the case.

Additionally to what Dave wrote I can summarize a few things I have 
learned while working on the AMD GPU drivers in the last decade or so:

1. Userspace requirements are *not* relevant for UAPI or even more 
general kernel driver design.

2. What should be done is to look at the hardware capabilities and try 
to expose those in a save manner to userspace.

3. The userspace requirements are then used to validate the kernel 
driver and especially the UAPI design to ensure that nothing was missed.

The consequence of this is that nobody should ever use things like Cuda, 
Vulkan, OpenCL, OpenGL etc.. as argument to propose a certain UAPI design.

What should be done instead is to say: My hardware works in this and 
that way -> we want to expose it like this -> because that enables us to 
implement the high level API in this and that way.

Only this gives then a complete picture of how things interact together 
and allows the kernel community to influence and validate the design.

This doesn't mean that you need to throw away everything, but it gives a 
clear restriction that designs are not nailed in stone and for example 
you can't use something like a waterfall model.

Going to answer on your other questions separately.

Regards,
Christian.

Am 25.01.24 um 06:25 schrieb Zeng, Oak:
> Hi Dave,
>
> Let me step back. When I wrote " shared virtual address space b/t cpu and all gpu devices is a hard requirement for our system allocator design", I meant this is not only Intel's design requirement. Rather this is a common requirement for both Intel, AMD and Nvidia. Take a look at cuda driver API definition of cuMemAllocManaged (search this API on https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM), it said:
>
> "The pointer is valid on the CPU and on all GPUs in the system that support managed memory."
>
> This means the program virtual address space is shared b/t CPU and all GPU devices on the system. The system allocator we are discussing is just one step advanced than cuMemAllocManaged: it allows malloc'ed memory to be shared b/t CPU and all GPU devices.
>
> I hope we all agree with this point.
>
> With that, I agree with Christian that in kmd we should make driver code per-device based instead of managing all devices in one driver instance. Our system allocator (and generally xekmd)design follows this rule: we make xe_vm per device based - one device is *not* aware of other device's address space, as I explained in previous email. I started this email seeking a one drm_gpuvm instance to cover all GPU devices. I gave up this approach (at least for now) per Danilo and Christian's feedback: We will continue to have per device based drm_gpuvm. I hope this is aligned with Christian but I will have to wait for Christian's reply to my previous email.
>
> I hope this clarify thing a little.
>
> Regards,
> Oak
>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of David
>> Airlie
>> Sent: Wednesday, January 24, 2024 8:25 PM
>> To: Zeng, Oak <oak.zeng@intel.com>
>> Cc: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
>> Thomas.Hellstrom@linux.intel.com; Winiarski, Michal
>> <michal.winiarski@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty,
>> Brian <brian.welty@intel.com>; Shah, Ankur N <ankur.n.shah@intel.com>; dri-
>> devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Gupta, saurabhg
>> <saurabhg.gupta@intel.com>; Danilo Krummrich <dakr@redhat.com>; Daniel
>> Vetter <daniel@ffwll.ch>; Brost, Matthew <matthew.brost@intel.com>; Bommu,
>> Krishnaiah <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
>> <niranjana.vishwanathapura@intel.com>; Christian König
>> <christian.koenig@amd.com>
>> Subject: Re: Making drm_gpuvm work across gpu devices
>>
>>>
>>> For us, Xekmd doesn't need to know it is running under bare metal or
>> virtualized environment. Xekmd is always a guest driver. All the virtual address
>> used in xekmd is guest virtual address. For SVM, we require all the VF devices
>> share one single shared address space with guest CPU program. So all the design
>> works in bare metal environment can automatically work under virtualized
>> environment. +@Shah, Ankur N +@Winiarski, Michal to backup me if I am wrong.
>>>
>>>
>>> Again, shared virtual address space b/t cpu and all gpu devices is a hard
>> requirement for our system allocator design (which means malloc’ed memory,
>> cpu stack variables, globals can be directly used in gpu program. Same
>> requirement as kfd SVM design). This was aligned with our user space software
>> stack.
>>
>> Just to make a very general point here (I'm hoping you listen to
>> Christian a bit more and hoping he replies in more detail), but just
>> because you have a system allocator design done, it doesn't in any way
>> enforce the requirements on the kernel driver to accept that design.
>> Bad system design should be pushed back on, not enforced in
>> implementation stages. It's a trap Intel falls into regularly since
>> they say well we already agreed this design with the userspace team
>> and we can't change it now. This isn't acceptable. Design includes
>> upstream discussion and feedback, if you say misdesigned the system
>> allocator (and I'm not saying you definitely have), and this is
>> pushing back on that, then you have to go fix your system
>> architecture.
>>
>> KFD was an experiment like this, I pushed back on AMD at the start
>> saying it was likely a bad plan, we let it go and got a lot of
>> experience in why it was a bad design.
>>
>> Dave.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-26  8:21                 ` Thomas Hellström
@ 2024-01-26 12:52                   ` Christian König
  2024-01-27  2:21                     ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Christian König @ 2024-01-26 12:52 UTC (permalink / raw)
  To: Thomas Hellström, Daniel Vetter
  Cc: Brost, Matthew, Dave Airlie, Felix Kuehling, Welty, Brian, Zeng,
	Oak, Ghimiray, Himal Prasad, dri-devel, Bommu, Krishnaiah, Gupta,
	saurabhg, Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich

[-- Attachment #1: Type: text/plain, Size: 6826 bytes --]

Am 26.01.24 um 09:21 schrieb Thomas Hellström:
> Hi, all
>
> On Thu, 2024-01-25 at 19:32 +0100, Daniel Vetter wrote:
>> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>>> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>>>> [SNIP]
>>>> Yes most API are per device based.
>>>>
>>>> One exception I know is actually the kfd SVM API. If you look at
>>>> the svm_ioctl function, it is per-process based. Each kfd_process
>>>> represent a process across N gpu devices.
>>> Yeah and that was a big mistake in my opinion. We should really not
>>> do that
>>> ever again.
>>>
>>>> Need to say, kfd SVM represent a shared virtual address space
>>>> across CPU and all GPU devices on the system. This is by the
>>>> definition of SVM (shared virtual memory). This is very different
>>>> from our legacy gpu *device* driver which works for only one
>>>> device (i.e., if you want one device to access another device's
>>>> memory, you will have to use dma-buf export/import etc).
>>> Exactly that thinking is what we have currently found as blocker
>>> for a
>>> virtualization projects. Having SVM as device independent feature
>>> which
>>> somehow ties to the process address space turned out to be an
>>> extremely bad
>>> idea.
>>>
>>> The background is that this only works for some use cases but not
>>> all of
>>> them.
>>>
>>> What's working much better is to just have a mirror functionality
>>> which says
>>> that a range A..B of the process address space is mapped into a
>>> range C..D
>>> of the GPU address space.
>>>
>>> Those ranges can then be used to implement the SVM feature required
>>> for
>>> higher level APIs and not something you need at the UAPI or even
>>> inside the
>>> low level kernel memory management.
>>>
>>> When you talk about migrating memory to a device you also do this
>>> on a per
>>> device basis and *not* tied to the process address space. If you
>>> then get
>>> crappy performance because userspace gave contradicting information
>>> where to
>>> migrate memory then that's a bug in userspace and not something the
>>> kernel
>>> should try to prevent somehow.
>>>
>>> [SNIP]
>>>>> I think if you start using the same drm_gpuvm for multiple
>>>>> devices you
>>>>> will sooner or later start to run into the same mess we have
>>>>> seen with
>>>>> KFD, where we moved more and more functionality from the KFD to
>>>>> the DRM
>>>>> render node because we found that a lot of the stuff simply
>>>>> doesn't work
>>>>> correctly with a single object to maintain the state.
>>>> As I understand it, KFD is designed to work across devices. A
>>>> single pseudo /dev/kfd device represent all hardware gpu devices.
>>>> That is why during kfd open, many pdd (process device data) is
>>>> created, each for one hardware device for this process.
>>> Yes, I'm perfectly aware of that. And I can only repeat myself that
>>> I see
>>> this design as a rather extreme failure. And I think it's one of
>>> the reasons
>>> why NVidia is so dominant with Cuda.
>>>
>>> This whole approach KFD takes was designed with the idea of
>>> extending the
>>> CPU process into the GPUs, but this idea only works for a few use
>>> cases and
>>> is not something we should apply to drivers in general.
>>>
>>> A very good example are virtualization use cases where you end up
>>> with CPU
>>> address != GPU address because the VAs are actually coming from the
>>> guest VM
>>> and not the host process.
>>>
>>> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should
>>> not have
>>> any influence on the design of the kernel UAPI.
>>>
>>> If you want to do something similar as KFD for Xe I think you need
>>> to get
>>> explicit permission to do this from Dave and Daniel and maybe even
>>> Linus.
>> I think the one and only one exception where an SVM uapi like in kfd
>> makes
>> sense, is if the _hardware_ itself, not the software stack defined
>> semantics that you've happened to build on top of that hw, enforces a
>> 1:1
>> mapping with the cpu process address space.
>>
>> Which means your hardware is using PASID, IOMMU based translation,
>> PCI-ATS
>> (address translation services) or whatever your hw calls it and has
>> _no_
>> device-side pagetables on top. Which from what I've seen all devices
>> with
>> device-memory have, simply because they need some place to store
>> whether
>> that memory is currently in device memory or should be translated
>> using
>> PASID. Currently there's no gpu that works with PASID only, but there
>> are
>> some on-cpu-die accelerator things that do work like that.
>>
>> Maybe in the future there will be some accelerators that are fully
>> cpu
>> cache coherent (including atomics) with something like CXL, and the
>> on-device memory is managed as normal system memory with struct page
>> as
>> ZONE_DEVICE and accelerator va -> physical address translation is
>> only
>> done with PASID ... but for now I haven't seen that, definitely not
>> in
>> upstream drivers.
>>
>> And the moment you have some per-device pagetables or per-device
>> memory
>> management of some sort (like using gpuva mgr) then I'm 100% agreeing
>> with
>> Christian that the kfd SVM model is too strict and not a great idea.
>>
>> Cheers, Sima
>
> I'm trying to digest all the comments here, The end goal is to be able
> to support something similar to this here:
>
> https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
>
> Christian, If I understand you correctly, you're strongly suggesting
> not to try to manage a common virtual address space across different
> devices in the kernel, but merely providing building blocks to do so,
> like for example a generalized userptr with migration support using
> HMM; That way each "mirror" of the CPU mm would be per device and
> inserted into the gpu_vm just like any other gpu_vma, and user-space
> would dictate the A..B -> C..D mapping by choosing the GPU_VA for the
> vma.

Exactly that, yes.

> Sima, it sounds like you're suggesting to shy away from hmm and not
> even attempt to support this except if it can be done using IOMMU sva
> on selected hardware?

I think that comment goes more into the direction of: If you have 
ATS/ATC/PRI capable hardware which exposes the functionality to make 
memory reads and writes directly into the address space of the CPU then 
yes an SVM only interface is ok because the hardware can't do anything 
else. But as long as you have something like GPUVM then please don't 
restrict yourself.

Which I totally agree on as well. The ATS/ATC/PRI combination doesn't 
allow using separate page tables device and CPU and so also not separate 
VAs.

This was one of the reasons why we stopped using this approach for AMD GPUs.

Regards,
Christian.

> Could you clarify a bit?
>
> Thanks,
> Thomas
>
>
>
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 8204 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-25 18:37                   ` Zeng, Oak
@ 2024-01-26 13:23                     ` Christian König
  0 siblings, 0 replies; 123+ messages in thread
From: Christian König @ 2024-01-26 13:23 UTC (permalink / raw)
  To: Zeng, Oak, Felix Kuehling, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Shah, Ankur N, Winiarski, Michal
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe

[-- Attachment #1: Type: text/plain, Size: 11127 bytes --]

Am 25.01.24 um 19:37 schrieb Zeng, Oak:
>> -----Original Message-----
>> From: Felix Kuehling<felix.kuehling@amd.com>
>> Sent: Thursday, January 25, 2024 12:16 PM
>> To: Zeng, Oak<oak.zeng@intel.com>; Christian König
>> <christian.koenig@amd.com>; Danilo Krummrich<dakr@redhat.com>; Dave
>> Airlie<airlied@redhat.com>; Daniel Vetter<daniel@ffwll.ch>; Shah, Ankur N
>> <ankur.n.shah@intel.com>; Winiarski, Michal<michal.winiarski@intel.com>
>> Cc: Welty, Brian<brian.welty@intel.com>;dri-devel@lists.freedesktop.org; intel-
>> xe@lists.freedesktop.org; Bommu, Krishnaiah<krishnaiah.bommu@intel.com>;
>> Ghimiray, Himal Prasad<himal.prasad.ghimiray@intel.com>;
>> Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana
>> <niranjana.vishwanathapura@intel.com>; Brost, Matthew
>> <matthew.brost@intel.com>; Gupta, saurabhg<saurabhg.gupta@intel.com>
>> Subject: Re: Making drm_gpuvm work across gpu devices
>>
>>
>> On 2024-01-24 20:17, Zeng, Oak wrote:
>>> Hi Christian,
>>>
>>> Even though I mentioned KFD design, I didn’t mean to copy the KFD
>>> design. I also had hard time to understand the difficulty of KFD under
>>> virtualization environment.
>>>
>> The problem with virtualization is related to virtualization design
>> choices. There is a single process that proxies requests from multiple
>> processes in one (or more?) VMs to the GPU driver. That means, we need a
>> single process with multiple contexts (and address spaces). One proxy
>> process on the host must support multiple guest address spaces.
> My first response is, why processes on the virtual machine can't open /dev/kfd device itself?

Because it's not using SRIOV, we are using native context and so the KFD 
driver is on the host and not the guest.

> Also try to picture why base amdgpu driver (which is per hardware device based) doesn't have this problem... creating multiple contexts under single amdgpu device, each context servicing one guest process?

Yes, exactly that.

>> I don't know much more than these very high level requirements, and I
>> only found out about those a few weeks ago. Due to my own bias I can't
>> comment whether there are bad design choices in the proxy architecture
>> or in KFD or both. The way we are considering fixing this, is to enable
>> creating multiple KFD contexts in the same process. Each of those
>> contexts will still represent a shared virtual address space across
>> devices (but not the CPU). Because the device address space is not
>> shared with the CPU, we cannot support our SVM API in this situation.
>>
> One kfd process, multiple contexts, each context has a shared address space across devices.... I do see some complications 😊

Yeah, if the process which talks to the driver is not the process which 
initiated the work then the concept to tie anything to device state 
obviously doesn't work.

Regards,
Christian.

>
>> I still believe that it makes sense to have the kernel mode driver aware
>> of a shared virtual address space at some level. A per-GPU API and an
>> API that doesn't require matching CPU and GPU virtual addresses would
>> enable more flexibility at the cost duplicate information tracking for
>> multiple devices and duplicate overhead for things like MMU notifiers
>> and interval tree data structures. Having to coordinate multiple devices
>> with potentially different address spaces would probably make it more
>> awkward to implement memory migration. The added flexibility would go
>> mostly unused, except in some very niche applications.
>>
>> Regards,
>>     Felix
>>
>>
>>> For us, Xekmd doesn't need to know it is running under bare metal or
>>> virtualized environment. Xekmd is always a guest driver. All the
>>> virtual address used in xekmd is guest virtual address. For SVM, we
>>> require all the VF devices share one single shared address space with
>>> guest CPU program. So all the design works in bare metal environment
>>> can automatically work under virtualized environment. +@Shah, Ankur N
>>> <mailto:ankur.n.shah@intel.com>  +@Winiarski, Michal
>>> <mailto:michal.winiarski@intel.com>  to backup me if I am wrong.
>>>
>>> Again, shared virtual address space b/t cpu and all gpu devices is a
>>> hard requirement for our system allocator design (which means
>>> malloc’ed memory, cpu stack variables, globals can be directly used in
>>> gpu program. Same requirement as kfd SVM design). This was aligned
>>> with our user space software stack.
>>>
>>> For anyone who want to implement system allocator, or SVM, this is a
>>> hard requirement. I started this thread hoping I can leverage the
>>> drm_gpuvm design to manage the shared virtual address space (as the
>>> address range split/merge function was scary to me and I didn’t want
>>> re-invent). I guess my takeaway from this you and Danilo is this
>>> approach is a NAK. Thomas also mentioned to me drm_gpuvm is a overkill
>>> for our svm address range split/merge. So I will make things work
>>> first by manage address range xekmd internally. I can re-look
>>> drm-gpuvm approach in the future.
>>>
>>> Maybe a pseudo user program can illustrate our programming model:
>>>
>>> Fd0 = open(card0)
>>>
>>> Fd1 = open(card1)
>>>
>>> Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's
>>> first vm_create
>>>
>>> Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called
>>> from same process
>>>
>>> Queue0 = xe_exec_queue_create(fd0, vm0)
>>>
>>> Queue1 = xe_exec_queue_create(fd1, vm1)
>>>
>>> //check p2p capability calling L0 API….
>>>
>>> ptr = malloc()//this replace bo_create, vm_bind, dma-import/export
>>>
>>> Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0
>>>
>>> Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1
>>>
>>> //Gpu page fault handles memory allocation/migration/mapping to gpu
>>>
>>> As you can see, from above model, our design is a little bit different
>>> than the KFD design. user need to explicitly create gpuvm (vm0 and vm1
>>> above) for each gpu device. Driver internally have a xe_svm represent
>>> the shared address space b/t cpu and multiple gpu devices. But end
>>> user doesn’t see and no need to create xe_svm. The shared virtual
>>> address space is really managed by linux core mm (through the vma
>>> struct, mm_struct etc). From each gpu device’s perspective, it just
>>> operate under its own gpuvm, not aware of the existence of other
>>> gpuvm, even though in reality all those gpuvm shares a same virtual
>>> address space.
>>>
>>> See one more comment inline
>>>
>>> *From:*Christian König<christian.koenig@amd.com>
>>> *Sent:* Wednesday, January 24, 2024 3:33 AM
>>> *To:* Zeng, Oak<oak.zeng@intel.com>; Danilo Krummrich
>>> <dakr@redhat.com>; Dave Airlie<airlied@redhat.com>; Daniel Vetter
>>> <daniel@ffwll.ch>; Felix Kuehling<felix.kuehling@amd.com>
>>> *Cc:* Welty, Brian<brian.welty@intel.com>;
>>> dri-devel@lists.freedesktop.org;intel-xe@lists.freedesktop.org;
>>> Bommu, Krishnaiah<krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
>>> <himal.prasad.ghimiray@intel.com>;Thomas.Hellstrom@linux.intel.com;
>>> Vishwanathapura, Niranjana<niranjana.vishwanathapura@intel.com>;
>>> Brost, Matthew<matthew.brost@intel.com>; Gupta, saurabhg
>>> <saurabhg.gupta@intel.com>
>>> *Subject:* Re: Making drm_gpuvm work across gpu devices
>>>
>>> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>>>
>>>      [SNIP]
>>>
>>>      Yes most API are per device based.
>>>
>>>      One exception I know is actually the kfd SVM API. If you look at the svm_ioctl
>> function, it is per-process based. Each kfd_process represent a process across N
>> gpu devices.
>>>
>>> Yeah and that was a big mistake in my opinion. We should really not do
>>> that ever again.
>>>
>>>
>>>      Need to say, kfd SVM represent a shared virtual address space across CPU
>> and all GPU devices on the system. This is by the definition of SVM (shared virtual
>> memory). This is very different from our legacy gpu *device* driver which works
>> for only one device (i.e., if you want one device to access another device's
>> memory, you will have to use dma-buf export/import etc).
>>>
>>> Exactly that thinking is what we have currently found as blocker for a
>>> virtualization projects. Having SVM as device independent feature
>>> which somehow ties to the process address space turned out to be an
>>> extremely bad idea.
>>>
>>> The background is that this only works for some use cases but not all
>>> of them.
>>>
>>> What's working much better is to just have a mirror functionality
>>> which says that a range A..B of the process address space is mapped
>>> into a range C..D of the GPU address space.
>>>
>>> Those ranges can then be used to implement the SVM feature required
>>> for higher level APIs and not something you need at the UAPI or even
>>> inside the low level kernel memory management.
>>>
>>> When you talk about migrating memory to a device you also do this on a
>>> per device basis and *not* tied to the process address space. If you
>>> then get crappy performance because userspace gave contradicting
>>> information where to migrate memory then that's a bug in userspace and
>>> not something the kernel should try to prevent somehow.
>>>
>>> [SNIP]
>>>
>>>          I think if you start using the same drm_gpuvm for multiple devices you
>>>
>>>          will sooner or later start to run into the same mess we have seen with
>>>
>>>          KFD, where we moved more and more functionality from the KFD to the
>> DRM
>>>          render node because we found that a lot of the stuff simply doesn't work
>>>
>>>          correctly with a single object to maintain the state.
>>>
>>>      As I understand it, KFD is designed to work across devices. A single pseudo
>> /dev/kfd device represent all hardware gpu devices. That is why during kfd open,
>> many pdd (process device data) is created, each for one hardware device for this
>> process.
>>>
>>> Yes, I'm perfectly aware of that. And I can only repeat myself that I
>>> see this design as a rather extreme failure. And I think it's one of
>>> the reasons why NVidia is so dominant with Cuda.
>>>
>>> This whole approach KFD takes was designed with the idea of extending
>>> the CPU process into the GPUs, but this idea only works for a few use
>>> cases and is not something we should apply to drivers in general.
>>>
>>> A very good example are virtualization use cases where you end up with
>>> CPU address != GPU address because the VAs are actually coming from
>>> the guest VM and not the host process.
>>>
>>> I don’t get the problem here. For us, under virtualization, both the
>>> cpu address and gpu virtual address operated in xekmd is guest virtual
>>> address. They can still share the same virtual address space (as SVM
>>> required)
>>>
>>> Oak
>>>
>>>
>>>
>>> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should
>>> not have any influence on the design of the kernel UAPI.
>>>
>>> If you want to do something similar as KFD for Xe I think you need to
>>> get explicit permission to do this from Dave and Daniel and maybe even
>>> Linus.
>>>
>>> Regards,
>>> Christian.
>>>

[-- Attachment #2: Type: text/html, Size: 15425 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-26 10:09                     ` Christian König
@ 2024-01-26 20:13                       ` Zeng, Oak
  2024-01-29 10:10                         ` Christian König
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-01-26 20:13 UTC (permalink / raw)
  To: Christian König, David Airlie
  Cc: Brost, Matthew, Thomas.Hellstrom, Winiarski, Michal,
	Felix Kuehling, Welty, Brian, Shah, Ankur N, dri-devel, Ghimiray,
	Himal Prasad, Daniel Vetter, Bommu,  Krishnaiah, Gupta, saurabhg,
	Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich



> -----Original Message-----
> From: Christian König <christian.koenig@amd.com>
> Sent: Friday, January 26, 2024 5:10 AM
> To: Zeng, Oak <oak.zeng@intel.com>; David Airlie <airlied@redhat.com>
> Cc: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
> Thomas.Hellstrom@linux.intel.com; Winiarski, Michal
> <michal.winiarski@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty,
> Brian <brian.welty@intel.com>; Shah, Ankur N <ankur.n.shah@intel.com>; dri-
> devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Gupta, saurabhg
> <saurabhg.gupta@intel.com>; Danilo Krummrich <dakr@redhat.com>; Daniel
> Vetter <daniel@ffwll.ch>; Brost, Matthew <matthew.brost@intel.com>; Bommu,
> Krishnaiah <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> Hi Oak,
> 
> you can still use SVM, but it should not be a design criteria for the
> kernel UAPI. In other words the UAPI should be designed in such a way
> that the GPU virtual address can be equal to the CPU virtual address of
> a buffer, but can also be different to support use cases where this
> isn't the case.

Terminology:
SVM: any technology which can achieve a shared virtual address space b/t cpu and devices. The virtual address space can be managed by user space or kernel space. Intel implemented a SVM, based on the BO-centric gpu driver (gem-create, vm-bind) where virtual address space is managed by UMD.
System allocator: another way of implement SVM. User just use malloc'ed memory for gpu submission. Virtual address space is managed by Linux core mm. In practice, we leverage HMM to implement system allocator.
This article described details of all those different model: https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/

Our programming model allows a mixture use of system allocator (even though system allocator is ) and traditional vm_bind (where cpu address can != gpu address). Let me re-post the pseudo codes:

	1. Fd0 = open(/"dev/dri/render0")
	2. Fd1 = open("/dev/dri/render1")
	3. Fd3 = open("/dev/dri/xe-svm")
	4. Gpu_Vm0 =xe_vm_create(fd0) 
	5. Gpu_Vm1 = xe_vm_create(fd1) 
	6. Queue0 = xe_exec_queue_create(fd0, gpu_vm0)
	7. Queue1 = xe_exec_queue_create(fd1, gpu_vm1)
	8. ptr = malloc()
	9. bo = xe_bo_create(fd0)
	10. Vm_bind(bo, gpu_vm0, va)//va is from UMD, cpu can access bo with same or different va. It is UMD's responsibility that va doesn't conflict with malloc'ed PTRs.
	11. Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0
	12. Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1
	13. Xe_exec(queue0, va)//submit gpu job which use va, on card0

In above codes, the va used in vm_bind (line 10, Intel's API to bind an object to a va for GPU access) can be different from the CPU address when cpu access the same object. But whenever user use malloc'ed ptr for GPU submission (line 11, 12, so called system allocator), it implies CPU and GPUs use the same ptr to access.

In above vm_bind, it is user/UMD's responsibility to guarantee that vm_bind va doesn't conflict with malloc'ed ptr. Otherwise it is treated as programming error.

I think this design still meets your design restrictions. 


> 
> Additionally to what Dave wrote I can summarize a few things I have
> learned while working on the AMD GPU drivers in the last decade or so:
> 
> 1. Userspace requirements are *not* relevant for UAPI or even more
> general kernel driver design.
> 
> 2. What should be done is to look at the hardware capabilities and try
> to expose those in a save manner to userspace.
> 
> 3. The userspace requirements are then used to validate the kernel
> driver and especially the UAPI design to ensure that nothing was missed.
> 
> The consequence of this is that nobody should ever use things like Cuda,
> Vulkan, OpenCL, OpenGL etc.. as argument to propose a certain UAPI design.
> 
> What should be done instead is to say: My hardware works in this and
> that way -> we want to expose it like this -> because that enables us to
> implement the high level API in this and that way.
> 
> Only this gives then a complete picture of how things interact together
> and allows the kernel community to influence and validate the design.

What you described above is mainly bottom up. I know other people do top down, or whole system vertical HW-SW co-design. I don't have strong opinion here.

Regards,
Oak

> 
> This doesn't mean that you need to throw away everything, but it gives a
> clear restriction that designs are not nailed in stone and for example
> you can't use something like a waterfall model.
> 
> Going to answer on your other questions separately.
> 
> Regards,
> Christian.
> 
> Am 25.01.24 um 06:25 schrieb Zeng, Oak:
> > Hi Dave,
> >
> > Let me step back. When I wrote " shared virtual address space b/t cpu and all
> gpu devices is a hard requirement for our system allocator design", I meant this is
> not only Intel's design requirement. Rather this is a common requirement for
> both Intel, AMD and Nvidia. Take a look at cuda driver API definition of
> cuMemAllocManaged (search this API on https://docs.nvidia.com/cuda/cuda-
> driver-api/group__CUDA__MEM.html#group__CUDA__MEM), it said:
> >
> > "The pointer is valid on the CPU and on all GPUs in the system that support
> managed memory."
> >
> > This means the program virtual address space is shared b/t CPU and all GPU
> devices on the system. The system allocator we are discussing is just one step
> advanced than cuMemAllocManaged: it allows malloc'ed memory to be shared
> b/t CPU and all GPU devices.
> >
> > I hope we all agree with this point.
> >
> > With that, I agree with Christian that in kmd we should make driver code per-
> device based instead of managing all devices in one driver instance. Our system
> allocator (and generally xekmd)design follows this rule: we make xe_vm per
> device based - one device is *not* aware of other device's address space, as I
> explained in previous email. I started this email seeking a one drm_gpuvm
> instance to cover all GPU devices. I gave up this approach (at least for now) per
> Danilo and Christian's feedback: We will continue to have per device based
> drm_gpuvm. I hope this is aligned with Christian but I will have to wait for
> Christian's reply to my previous email.
> >
> > I hope this clarify thing a little.
> >
> > Regards,
> > Oak
> >
> >> -----Original Message-----
> >> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> David
> >> Airlie
> >> Sent: Wednesday, January 24, 2024 8:25 PM
> >> To: Zeng, Oak <oak.zeng@intel.com>
> >> Cc: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>;
> >> Thomas.Hellstrom@linux.intel.com; Winiarski, Michal
> >> <michal.winiarski@intel.com>; Felix Kuehling <felix.kuehling@amd.com>;
> Welty,
> >> Brian <brian.welty@intel.com>; Shah, Ankur N <ankur.n.shah@intel.com>;
> dri-
> >> devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Gupta, saurabhg
> >> <saurabhg.gupta@intel.com>; Danilo Krummrich <dakr@redhat.com>; Daniel
> >> Vetter <daniel@ffwll.ch>; Brost, Matthew <matthew.brost@intel.com>;
> Bommu,
> >> Krishnaiah <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> >> <niranjana.vishwanathapura@intel.com>; Christian König
> >> <christian.koenig@amd.com>
> >> Subject: Re: Making drm_gpuvm work across gpu devices
> >>
> >>>
> >>> For us, Xekmd doesn't need to know it is running under bare metal or
> >> virtualized environment. Xekmd is always a guest driver. All the virtual address
> >> used in xekmd is guest virtual address. For SVM, we require all the VF devices
> >> share one single shared address space with guest CPU program. So all the
> design
> >> works in bare metal environment can automatically work under virtualized
> >> environment. +@Shah, Ankur N +@Winiarski, Michal to backup me if I am
> wrong.
> >>>
> >>>
> >>> Again, shared virtual address space b/t cpu and all gpu devices is a hard
> >> requirement for our system allocator design (which means malloc’ed memory,
> >> cpu stack variables, globals can be directly used in gpu program. Same
> >> requirement as kfd SVM design). This was aligned with our user space
> software
> >> stack.
> >>
> >> Just to make a very general point here (I'm hoping you listen to
> >> Christian a bit more and hoping he replies in more detail), but just
> >> because you have a system allocator design done, it doesn't in any way
> >> enforce the requirements on the kernel driver to accept that design.
> >> Bad system design should be pushed back on, not enforced in
> >> implementation stages. It's a trap Intel falls into regularly since
> >> they say well we already agreed this design with the userspace team
> >> and we can't change it now. This isn't acceptable. Design includes
> >> upstream discussion and feedback, if you say misdesigned the system
> >> allocator (and I'm not saying you definitely have), and this is
> >> pushing back on that, then you have to go fix your system
> >> architecture.
> >>
> >> KFD was an experiment like this, I pushed back on AMD at the start
> >> saying it was likely a bad plan, we let it go and got a lot of
> >> experience in why it was a bad design.
> >>
> >> Dave.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-26 12:52                   ` Christian König
@ 2024-01-27  2:21                     ` Zeng, Oak
  2024-01-29 10:19                       ` Christian König
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-01-27  2:21 UTC (permalink / raw)
  To: Christian König, Thomas Hellström, Daniel Vetter
  Cc: Brost, Matthew, Dave Airlie, Felix Kuehling, Welty, Brian,
	dri-devel, Ghimiray, Himal Prasad, Bommu, Krishnaiah, Gupta,
	saurabhg, Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich

[-- Attachment #1: Type: text/plain, Size: 8130 bytes --]

Regarding the idea of expanding userptr to support migration, we explored this idea long time ago. It provides similar functions of the system allocator but its interface is not as convenient as system allocator. Besides the shared virtual address space, another benefit of a system allocator is, you can offload cpu program to gpu easier, you don’t need to call driver specific API (such as register_userptr and vm_bind in this case) for memory allocation.

We also scoped the implementation. It turned out to be big, and not as beautiful as hmm. Why we gave up this approach.

From: Christian König <christian.koenig@amd.com>
Sent: Friday, January 26, 2024 7:52 AM
To: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch>
Cc: Brost, Matthew <matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Zeng, Oak <oak.zeng@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>; Danilo Krummrich <dakr@redhat.com>; dri-devel@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Dave Airlie <airlied@redhat.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org
Subject: Re: Making drm_gpuvm work across gpu devices

Am 26.01.24 um 09:21 schrieb Thomas Hellström:


Hi, all



On Thu, 2024-01-25 at 19:32 +0100, Daniel Vetter wrote:

On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]

Yes most API are per device based.



One exception I know is actually the kfd SVM API. If you look at

the svm_ioctl function, it is per-process based. Each kfd_process

represent a process across N gpu devices.



Yeah and that was a big mistake in my opinion. We should really not

do that

ever again.



Need to say, kfd SVM represent a shared virtual address space

across CPU and all GPU devices on the system. This is by the

definition of SVM (shared virtual memory). This is very different

from our legacy gpu *device* driver which works for only one

device (i.e., if you want one device to access another device's

memory, you will have to use dma-buf export/import etc).



Exactly that thinking is what we have currently found as blocker

for a

virtualization projects. Having SVM as device independent feature

which

somehow ties to the process address space turned out to be an

extremely bad

idea.



The background is that this only works for some use cases but not

all of

them.



What's working much better is to just have a mirror functionality

which says

that a range A..B of the process address space is mapped into a

range C..D

of the GPU address space.



Those ranges can then be used to implement the SVM feature required

for

higher level APIs and not something you need at the UAPI or even

inside the

low level kernel memory management.



When you talk about migrating memory to a device you also do this

on a per

device basis and *not* tied to the process address space. If you

then get

crappy performance because userspace gave contradicting information

where to

migrate memory then that's a bug in userspace and not something the

kernel

should try to prevent somehow.



[SNIP]

I think if you start using the same drm_gpuvm for multiple

devices you

will sooner or later start to run into the same mess we have

seen with

KFD, where we moved more and more functionality from the KFD to

the DRM

render node because we found that a lot of the stuff simply

doesn't work

correctly with a single object to maintain the state.

As I understand it, KFD is designed to work across devices. A

single pseudo /dev/kfd device represent all hardware gpu devices.

That is why during kfd open, many pdd (process device data) is

created, each for one hardware device for this process.



Yes, I'm perfectly aware of that. And I can only repeat myself that

I see

this design as a rather extreme failure. And I think it's one of

the reasons

why NVidia is so dominant with Cuda.



This whole approach KFD takes was designed with the idea of

extending the

CPU process into the GPUs, but this idea only works for a few use

cases and

is not something we should apply to drivers in general.



A very good example are virtualization use cases where you end up

with CPU

address != GPU address because the VAs are actually coming from the

guest VM

and not the host process.



SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should

not have

any influence on the design of the kernel UAPI.



If you want to do something similar as KFD for Xe I think you need

to get

explicit permission to do this from Dave and Daniel and maybe even

Linus.



I think the one and only one exception where an SVM uapi like in kfd

makes

sense, is if the _hardware_ itself, not the software stack defined

semantics that you've happened to build on top of that hw, enforces a

1:1

mapping with the cpu process address space.



Which means your hardware is using PASID, IOMMU based translation,

PCI-ATS

(address translation services) or whatever your hw calls it and has

_no_

device-side pagetables on top. Which from what I've seen all devices

with

device-memory have, simply because they need some place to store

whether

that memory is currently in device memory or should be translated

using

PASID. Currently there's no gpu that works with PASID only, but there

are

some on-cpu-die accelerator things that do work like that.



Maybe in the future there will be some accelerators that are fully

cpu

cache coherent (including atomics) with something like CXL, and the

on-device memory is managed as normal system memory with struct page

as

ZONE_DEVICE and accelerator va -> physical address translation is

only

done with PASID ... but for now I haven't seen that, definitely not

in

upstream drivers.



And the moment you have some per-device pagetables or per-device

memory

management of some sort (like using gpuva mgr) then I'm 100% agreeing

with

Christian that the kfd SVM model is too strict and not a great idea.



Cheers, Sima





I'm trying to digest all the comments here, The end goal is to be able

to support something similar to this here:



https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/



Christian, If I understand you correctly, you're strongly suggesting

not to try to manage a common virtual address space across different

devices in the kernel, but merely providing building blocks to do so,

like for example a generalized userptr with migration support using

HMM; That way each "mirror" of the CPU mm would be per device and

inserted into the gpu_vm just like any other gpu_vma, and user-space

would dictate the A..B -> C..D mapping by choosing the GPU_VA for the

vma.

Exactly that, yes.





Sima, it sounds like you're suggesting to shy away from hmm and not

even attempt to support this except if it can be done using IOMMU sva

on selected hardware?

I think that comment goes more into the direction of: If you have ATS/ATC/PRI capable hardware which exposes the functionality to make memory reads and writes directly into the address space of the CPU then yes an SVM only interface is ok because the hardware can't do anything else. But as long as you have something like GPUVM then please don't restrict yourself.

Which I totally agree on as well. The ATS/ATC/PRI combination doesn't allow using separate page tables device and CPU and so also not separate VAs.

This was one of the reasons why we stopped using this approach for AMD GPUs.

Regards,
Christian.



Could you clarify a bit?



Thanks,

Thomas
















[-- Attachment #2: Type: text/html, Size: 15387 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-26 20:13                       ` Zeng, Oak
@ 2024-01-29 10:10                         ` Christian König
  2024-01-29 20:09                           ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Christian König @ 2024-01-29 10:10 UTC (permalink / raw)
  To: Zeng, Oak, David Airlie
  Cc: Brost, Matthew, Thomas.Hellstrom, Winiarski, Michal,
	Felix Kuehling, Welty, Brian, Shah, Ankur N, dri-devel, Ghimiray,
	Himal Prasad, Daniel Vetter, Bommu, Krishnaiah, Gupta, saurabhg,
	Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich

[-- Attachment #1: Type: text/plain, Size: 9837 bytes --]

Am 26.01.24 um 21:13 schrieb Zeng, Oak:
>> -----Original Message-----
>> From: Christian König<christian.koenig@amd.com>
>> Sent: Friday, January 26, 2024 5:10 AM
>> To: Zeng, Oak<oak.zeng@intel.com>; David Airlie<airlied@redhat.com>
>> Cc: Ghimiray, Himal Prasad<himal.prasad.ghimiray@intel.com>;
>> Thomas.Hellstrom@linux.intel.com; Winiarski, Michal
>> <michal.winiarski@intel.com>; Felix Kuehling<felix.kuehling@amd.com>; Welty,
>> Brian<brian.welty@intel.com>; Shah, Ankur N<ankur.n.shah@intel.com>; dri-
>> devel@lists.freedesktop.org;intel-xe@lists.freedesktop.org; Gupta, saurabhg
>> <saurabhg.gupta@intel.com>; Danilo Krummrich<dakr@redhat.com>; Daniel
>> Vetter<daniel@ffwll.ch>; Brost, Matthew<matthew.brost@intel.com>; Bommu,
>> Krishnaiah<krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
>> <niranjana.vishwanathapura@intel.com>
>> Subject: Re: Making drm_gpuvm work across gpu devices
>>
>> Hi Oak,
>>
>> you can still use SVM, but it should not be a design criteria for the
>> kernel UAPI. In other words the UAPI should be designed in such a way
>> that the GPU virtual address can be equal to the CPU virtual address of
>> a buffer, but can also be different to support use cases where this
>> isn't the case.
> Terminology:
> SVM: any technology which can achieve a shared virtual address space b/t cpu and devices. The virtual address space can be managed by user space or kernel space. Intel implemented a SVM, based on the BO-centric gpu driver (gem-create, vm-bind) where virtual address space is managed by UMD.
> System allocator: another way of implement SVM. User just use malloc'ed memory for gpu submission. Virtual address space is managed by Linux core mm. In practice, we leverage HMM to implement system allocator.
> This article described details of all those different model:https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
>
> Our programming model allows a mixture use of system allocator (even though system allocator is ) and traditional vm_bind (where cpu address can != gpu address). Let me re-post the pseudo codes:
>
> 	1. Fd0 = open(/"dev/dri/render0")
> 	2. Fd1 = open("/dev/dri/render1")
> 	3. Fd3 = open("/dev/dri/xe-svm")
> 	4. Gpu_Vm0 =xe_vm_create(fd0)
> 	5. Gpu_Vm1 = xe_vm_create(fd1)
> 	6. Queue0 = xe_exec_queue_create(fd0, gpu_vm0)
> 	7. Queue1 = xe_exec_queue_create(fd1, gpu_vm1)
> 	8. ptr = malloc()
> 	9. bo = xe_bo_create(fd0)
> 	10. Vm_bind(bo, gpu_vm0, va)//va is from UMD, cpu can access bo with same or different va. It is UMD's responsibility that va doesn't conflict with malloc'ed PTRs.
> 	11. Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0
> 	12. Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1
> 	13. Xe_exec(queue0, va)//submit gpu job which use va, on card0
>
> In above codes, the va used in vm_bind (line 10, Intel's API to bind an object to a va for GPU access) can be different from the CPU address when cpu access the same object. But whenever user use malloc'ed ptr for GPU submission (line 11, 12, so called system allocator), it implies CPU and GPUs use the same ptr to access.
>
> In above vm_bind, it is user/UMD's responsibility to guarantee that vm_bind va doesn't conflict with malloc'ed ptr. Otherwise it is treated as programming error.
>
> I think this design still meets your design restrictions.

Well why do you need this "Fd3 = open("/dev/dri/xe-svm")" ?

As far as I see fd3 isn't used anywhere. What you can do is to bind 
parts of your process address space to your driver connections (fd1, fd2 
etc..) with a vm_bind(), but this should *not* come because of 
implicitely using some other file descriptor in the process.

As far as I can see this design is exactly what failed so badly with KFD.

Regards,
Christian.

>
>
>> Additionally to what Dave wrote I can summarize a few things I have
>> learned while working on the AMD GPU drivers in the last decade or so:
>>
>> 1. Userspace requirements are *not* relevant for UAPI or even more
>> general kernel driver design.
>>
>> 2. What should be done is to look at the hardware capabilities and try
>> to expose those in a save manner to userspace.
>>
>> 3. The userspace requirements are then used to validate the kernel
>> driver and especially the UAPI design to ensure that nothing was missed.
>>
>> The consequence of this is that nobody should ever use things like Cuda,
>> Vulkan, OpenCL, OpenGL etc.. as argument to propose a certain UAPI design.
>>
>> What should be done instead is to say: My hardware works in this and
>> that way -> we want to expose it like this -> because that enables us to
>> implement the high level API in this and that way.
>>
>> Only this gives then a complete picture of how things interact together
>> and allows the kernel community to influence and validate the design.
> What you described above is mainly bottom up. I know other people do top down, or whole system vertical HW-SW co-design. I don't have strong opinion here.
>
> Regards,
> Oak
>
>> This doesn't mean that you need to throw away everything, but it gives a
>> clear restriction that designs are not nailed in stone and for example
>> you can't use something like a waterfall model.
>>
>> Going to answer on your other questions separately.
>>
>> Regards,
>> Christian.
>>
>> Am 25.01.24 um 06:25 schrieb Zeng, Oak:
>>> Hi Dave,
>>>
>>> Let me step back. When I wrote " shared virtual address space b/t cpu and all
>> gpu devices is a hard requirement for our system allocator design", I meant this is
>> not only Intel's design requirement. Rather this is a common requirement for
>> both Intel, AMD and Nvidia. Take a look at cuda driver API definition of
>> cuMemAllocManaged (search this API onhttps://docs.nvidia.com/cuda/cuda-
>> driver-api/group__CUDA__MEM.html#group__CUDA__MEM), it said:
>>> "The pointer is valid on the CPU and on all GPUs in the system that support
>> managed memory."
>>> This means the program virtual address space is shared b/t CPU and all GPU
>> devices on the system. The system allocator we are discussing is just one step
>> advanced than cuMemAllocManaged: it allows malloc'ed memory to be shared
>> b/t CPU and all GPU devices.
>>> I hope we all agree with this point.
>>>
>>> With that, I agree with Christian that in kmd we should make driver code per-
>> device based instead of managing all devices in one driver instance. Our system
>> allocator (and generally xekmd)design follows this rule: we make xe_vm per
>> device based - one device is *not* aware of other device's address space, as I
>> explained in previous email. I started this email seeking a one drm_gpuvm
>> instance to cover all GPU devices. I gave up this approach (at least for now) per
>> Danilo and Christian's feedback: We will continue to have per device based
>> drm_gpuvm. I hope this is aligned with Christian but I will have to wait for
>> Christian's reply to my previous email.
>>> I hope this clarify thing a little.
>>>
>>> Regards,
>>> Oak
>>>
>>>> -----Original Message-----
>>>> From: dri-devel<dri-devel-bounces@lists.freedesktop.org>  On Behalf Of
>> David
>>>> Airlie
>>>> Sent: Wednesday, January 24, 2024 8:25 PM
>>>> To: Zeng, Oak<oak.zeng@intel.com>
>>>> Cc: Ghimiray, Himal Prasad<himal.prasad.ghimiray@intel.com>;
>>>> Thomas.Hellstrom@linux.intel.com; Winiarski, Michal
>>>> <michal.winiarski@intel.com>; Felix Kuehling<felix.kuehling@amd.com>;
>> Welty,
>>>> Brian<brian.welty@intel.com>; Shah, Ankur N<ankur.n.shah@intel.com>;
>> dri-
>>>> devel@lists.freedesktop.org;intel-xe@lists.freedesktop.org; Gupta, saurabhg
>>>> <saurabhg.gupta@intel.com>; Danilo Krummrich<dakr@redhat.com>; Daniel
>>>> Vetter<daniel@ffwll.ch>; Brost, Matthew<matthew.brost@intel.com>;
>> Bommu,
>>>> Krishnaiah<krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
>>>> <niranjana.vishwanathapura@intel.com>; Christian König
>>>> <christian.koenig@amd.com>
>>>> Subject: Re: Making drm_gpuvm work across gpu devices
>>>>
>>>>> For us, Xekmd doesn't need to know it is running under bare metal or
>>>> virtualized environment. Xekmd is always a guest driver. All the virtual address
>>>> used in xekmd is guest virtual address. For SVM, we require all the VF devices
>>>> share one single shared address space with guest CPU program. So all the
>> design
>>>> works in bare metal environment can automatically work under virtualized
>>>> environment. +@Shah, Ankur N +@Winiarski, Michal to backup me if I am
>> wrong.
>>>>>
>>>>> Again, shared virtual address space b/t cpu and all gpu devices is a hard
>>>> requirement for our system allocator design (which means malloc’ed memory,
>>>> cpu stack variables, globals can be directly used in gpu program. Same
>>>> requirement as kfd SVM design). This was aligned with our user space
>> software
>>>> stack.
>>>>
>>>> Just to make a very general point here (I'm hoping you listen to
>>>> Christian a bit more and hoping he replies in more detail), but just
>>>> because you have a system allocator design done, it doesn't in any way
>>>> enforce the requirements on the kernel driver to accept that design.
>>>> Bad system design should be pushed back on, not enforced in
>>>> implementation stages. It's a trap Intel falls into regularly since
>>>> they say well we already agreed this design with the userspace team
>>>> and we can't change it now. This isn't acceptable. Design includes
>>>> upstream discussion and feedback, if you say misdesigned the system
>>>> allocator (and I'm not saying you definitely have), and this is
>>>> pushing back on that, then you have to go fix your system
>>>> architecture.
>>>>
>>>> KFD was an experiment like this, I pushed back on AMD at the start
>>>> saying it was likely a bad plan, we let it go and got a lot of
>>>> experience in why it was a bad design.
>>>>
>>>> Dave.

[-- Attachment #2: Type: text/html, Size: 16133 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-27  2:21                     ` Zeng, Oak
@ 2024-01-29 10:19                       ` Christian König
  2024-01-30  0:21                         ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Christian König @ 2024-01-29 10:19 UTC (permalink / raw)
  To: Zeng, Oak, Thomas Hellström, Daniel Vetter, Dave Airlie
  Cc: Brost, Matthew, Felix Kuehling, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Gupta, saurabhg,
	Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich

[-- Attachment #1: Type: text/plain, Size: 10204 bytes --]

Well Daniel and Dave noted it as well, so I'm just repeating it: Your 
design choices are not an argument to get something upstream.

It's the job of the maintainers and at the end of the Linus to judge of 
something is acceptable or not.

As far as I can see a good part of this this idea has been exercised 
lengthy with KFD and it turned out to not be the best approach.

So from what I've seen the design you outlined is extremely unlikely to 
go upstream.

Regards,
Christian.

Am 27.01.24 um 03:21 schrieb Zeng, Oak:
>
> Regarding the idea of expanding userptr to support migration, we 
> explored this idea long time ago. It provides similar functions of the 
> system allocator but its interface is not as convenient as system 
> allocator. Besides the shared virtual address space, another benefit 
> of a system allocator is, you can offload cpu program to gpu easier, 
> you don’t need to call driver specific API (such as register_userptr 
> and vm_bind in this case) for memory allocation.
>
> We also scoped the implementation. It turned out to be big, and not as 
> beautiful as hmm. Why we gave up this approach.
>
> *From:*Christian König <christian.koenig@amd.com>
> *Sent:* Friday, January 26, 2024 7:52 AM
> *To:* Thomas Hellström <thomas.hellstrom@linux.intel.com>; Daniel 
> Vetter <daniel@ffwll.ch>
> *Cc:* Brost, Matthew <matthew.brost@intel.com>; Felix Kuehling 
> <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; 
> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Zeng, Oak 
> <oak.zeng@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>; 
> Danilo Krummrich <dakr@redhat.com>; dri-devel@lists.freedesktop.org; 
> Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Dave Airlie 
> <airlied@redhat.com>; Vishwanathapura, Niranjana 
> <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org
> *Subject:* Re: Making drm_gpuvm work across gpu devices
>
> Am 26.01.24 um 09:21 schrieb Thomas Hellström:
>
>     Hi, all
>
>     On Thu, 2024-01-25 at 19:32 +0100, Daniel Vetter wrote:
>
>         On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>
>             Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>
>                 [SNIP]
>
>                 Yes most API are per device based.
>
>                 One exception I know is actually the kfd SVM API. If you look at
>
>                 the svm_ioctl function, it is per-process based. Each kfd_process
>
>                 represent a process across N gpu devices.
>
>             Yeah and that was a big mistake in my opinion. We should really not
>
>             do that
>
>             ever again.
>
>                 Need to say, kfd SVM represent a shared virtual address space
>
>                 across CPU and all GPU devices on the system. This is by the
>
>                 definition of SVM (shared virtual memory). This is very different
>
>                 from our legacy gpu *device* driver which works for only one
>
>                 device (i.e., if you want one device to access another device's
>
>                 memory, you will have to use dma-buf export/import etc).
>
>             Exactly that thinking is what we have currently found as blocker
>
>             for a
>
>             virtualization projects. Having SVM as device independent feature
>
>             which
>
>             somehow ties to the process address space turned out to be an
>
>             extremely bad
>
>             idea.
>
>             The background is that this only works for some use cases but not
>
>             all of
>
>             them.
>
>             What's working much better is to just have a mirror functionality
>
>             which says
>
>             that a range A..B of the process address space is mapped into a
>
>             range C..D
>
>             of the GPU address space.
>
>             Those ranges can then be used to implement the SVM feature required
>
>             for
>
>             higher level APIs and not something you need at the UAPI or even
>
>             inside the
>
>             low level kernel memory management.
>
>             When you talk about migrating memory to a device you also do this
>
>             on a per
>
>             device basis and *not* tied to the process address space. If you
>
>             then get
>
>             crappy performance because userspace gave contradicting information
>
>             where to
>
>             migrate memory then that's a bug in userspace and not something the
>
>             kernel
>
>             should try to prevent somehow.
>
>             [SNIP]
>
>                     I think if you start using the same drm_gpuvm for multiple
>
>                     devices you
>
>                     will sooner or later start to run into the same mess we have
>
>                     seen with
>
>                     KFD, where we moved more and more functionality from the KFD to
>
>                     the DRM
>
>                     render node because we found that a lot of the stuff simply
>
>                     doesn't work
>
>                     correctly with a single object to maintain the state.
>
>                 As I understand it, KFD is designed to work across devices. A
>
>                 single pseudo /dev/kfd device represent all hardware gpu devices.
>
>                 That is why during kfd open, many pdd (process device data) is
>
>                 created, each for one hardware device for this process.
>
>             Yes, I'm perfectly aware of that. And I can only repeat myself that
>
>             I see
>
>             this design as a rather extreme failure. And I think it's one of
>
>             the reasons
>
>             why NVidia is so dominant with Cuda.
>
>             This whole approach KFD takes was designed with the idea of
>
>             extending the
>
>             CPU process into the GPUs, but this idea only works for a few use
>
>             cases and
>
>             is not something we should apply to drivers in general.
>
>             A very good example are virtualization use cases where you end up
>
>             with CPU
>
>             address != GPU address because the VAs are actually coming from the
>
>             guest VM
>
>             and not the host process.
>
>             SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should
>
>             not have
>
>             any influence on the design of the kernel UAPI.
>
>             If you want to do something similar as KFD for Xe I think you need
>
>             to get
>
>             explicit permission to do this from Dave and Daniel and maybe even
>
>             Linus.
>
>         I think the one and only one exception where an SVM uapi like in kfd
>
>         makes
>
>         sense, is if the _hardware_ itself, not the software stack defined
>
>         semantics that you've happened to build on top of that hw, enforces a
>
>         1:1
>
>         mapping with the cpu process address space.
>
>         Which means your hardware is using PASID, IOMMU based translation,
>
>         PCI-ATS
>
>         (address translation services) or whatever your hw calls it and has
>
>         _no_
>
>         device-side pagetables on top. Which from what I've seen all devices
>
>         with
>
>         device-memory have, simply because they need some place to store
>
>         whether
>
>         that memory is currently in device memory or should be translated
>
>         using
>
>         PASID. Currently there's no gpu that works with PASID only, but there
>
>         are
>
>         some on-cpu-die accelerator things that do work like that.
>
>         Maybe in the future there will be some accelerators that are fully
>
>         cpu
>
>         cache coherent (including atomics) with something like CXL, and the
>
>         on-device memory is managed as normal system memory with struct page
>
>         as
>
>         ZONE_DEVICE and accelerator va -> physical address translation is
>
>         only
>
>         done with PASID ... but for now I haven't seen that, definitely not
>
>         in
>
>         upstream drivers.
>
>         And the moment you have some per-device pagetables or per-device
>
>         memory
>
>         management of some sort (like using gpuva mgr) then I'm 100% agreeing
>
>         with
>
>         Christian that the kfd SVM model is too strict and not a great idea.
>
>         Cheers, Sima
>
>     I'm trying to digest all the comments here, The end goal is to be able
>
>     to support something similar to this here:
>
>     https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
>
>     Christian, If I understand you correctly, you're strongly suggesting
>
>     not to try to manage a common virtual address space across different
>
>     devices in the kernel, but merely providing building blocks to do so,
>
>     like for example a generalized userptr with migration support using
>
>     HMM; That way each "mirror" of the CPU mm would be per device and
>
>     inserted into the gpu_vm just like any other gpu_vma, and user-space
>
>     would dictate the A..B -> C..D mapping by choosing the GPU_VA for the
>
>     vma.
>
>
> Exactly that, yes.
>
>
>     Sima, it sounds like you're suggesting to shy away from hmm and not
>
>     even attempt to support this except if it can be done using IOMMU sva
>
>     on selected hardware?
>
>
> I think that comment goes more into the direction of: If you have 
> ATS/ATC/PRI capable hardware which exposes the functionality to make 
> memory reads and writes directly into the address space of the CPU 
> then yes an SVM only interface is ok because the hardware can't do 
> anything else. But as long as you have something like GPUVM then 
> please don't restrict yourself.
>
> Which I totally agree on as well. The ATS/ATC/PRI combination doesn't 
> allow using separate page tables device and CPU and so also not 
> separate VAs.
>
> This was one of the reasons why we stopped using this approach for AMD 
> GPUs.
>
> Regards,
> Christian.
>
>
>     Could you clarify a bit?
>
>     Thanks,
>
>     Thomas
>

[-- Attachment #2: Type: text/html, Size: 20461 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-25 18:32               ` Daniel Vetter
  2024-01-25 21:02                 ` Zeng, Oak
  2024-01-26  8:21                 ` Thomas Hellström
@ 2024-01-29 15:03                 ` Felix Kuehling
  2024-01-29 15:33                   ` Christian König
  2 siblings, 1 reply; 123+ messages in thread
From: Felix Kuehling @ 2024-01-29 15:03 UTC (permalink / raw)
  To: Daniel Vetter, Christian König
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, Ghimiray,
	Himal Prasad, dri-devel, Gupta, saurabhg, Danilo Krummrich, Zeng,
	Oak, Bommu, Krishnaiah, Dave Airlie, Vishwanathapura, Niranjana,
	intel-xe

On 2024-01-25 13:32, Daniel Vetter wrote:
> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>>> [SNIP]
>>> Yes most API are per device based.
>>>
>>> One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices.
>> Yeah and that was a big mistake in my opinion. We should really not do that
>> ever again.
>>
>>> Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).
>> Exactly that thinking is what we have currently found as blocker for a
>> virtualization projects. Having SVM as device independent feature which
>> somehow ties to the process address space turned out to be an extremely bad
>> idea.
>>
>> The background is that this only works for some use cases but not all of
>> them.
>>
>> What's working much better is to just have a mirror functionality which says
>> that a range A..B of the process address space is mapped into a range C..D
>> of the GPU address space.
>>
>> Those ranges can then be used to implement the SVM feature required for
>> higher level APIs and not something you need at the UAPI or even inside the
>> low level kernel memory management.
>>
>> When you talk about migrating memory to a device you also do this on a per
>> device basis and *not* tied to the process address space. If you then get
>> crappy performance because userspace gave contradicting information where to
>> migrate memory then that's a bug in userspace and not something the kernel
>> should try to prevent somehow.
>>
>> [SNIP]
>>>> I think if you start using the same drm_gpuvm for multiple devices you
>>>> will sooner or later start to run into the same mess we have seen with
>>>> KFD, where we moved more and more functionality from the KFD to the DRM
>>>> render node because we found that a lot of the stuff simply doesn't work
>>>> correctly with a single object to maintain the state.
>>> As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process.
>> Yes, I'm perfectly aware of that. And I can only repeat myself that I see
>> this design as a rather extreme failure. And I think it's one of the reasons
>> why NVidia is so dominant with Cuda.
>>
>> This whole approach KFD takes was designed with the idea of extending the
>> CPU process into the GPUs, but this idea only works for a few use cases and
>> is not something we should apply to drivers in general.
>>
>> A very good example are virtualization use cases where you end up with CPU
>> address != GPU address because the VAs are actually coming from the guest VM
>> and not the host process.
>>
>> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have
>> any influence on the design of the kernel UAPI.
>>
>> If you want to do something similar as KFD for Xe I think you need to get
>> explicit permission to do this from Dave and Daniel and maybe even Linus.
> I think the one and only one exception where an SVM uapi like in kfd makes
> sense, is if the _hardware_ itself, not the software stack defined
> semantics that you've happened to build on top of that hw, enforces a 1:1
> mapping with the cpu process address space.
>
> Which means your hardware is using PASID, IOMMU based translation, PCI-ATS
> (address translation services) or whatever your hw calls it and has _no_
> device-side pagetables on top. Which from what I've seen all devices with
> device-memory have, simply because they need some place to store whether
> that memory is currently in device memory or should be translated using
> PASID. Currently there's no gpu that works with PASID only, but there are
> some on-cpu-die accelerator things that do work like that.
>
> Maybe in the future there will be some accelerators that are fully cpu
> cache coherent (including atomics) with something like CXL, and the
> on-device memory is managed as normal system memory with struct page as
> ZONE_DEVICE and accelerator va -> physical address translation is only
> done with PASID ... but for now I haven't seen that, definitely not in
> upstream drivers.
>
> And the moment you have some per-device pagetables or per-device memory
> management of some sort (like using gpuva mgr) then I'm 100% agreeing with
> Christian that the kfd SVM model is too strict and not a great idea.

That basically means, without ATS/PRI+PASID you cannot implement a 
unified memory programming model, where GPUs or accelerators access 
virtual addresses without pre-registering them with an SVM API call.

Unified memory is a feature implemented by the KFD SVM API and used by 
ROCm. This is used e.g. to implement OpenMP USM (unified shared memory). 
It's implemented with recoverable GPU page faults. If the page fault 
interrupt handler cannot assume a shared virtual address space, then 
implementing this feature isn't possible.

Regards,
   Felix


>
> Cheers, Sima

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-29 15:03                 ` Felix Kuehling
@ 2024-01-29 15:33                   ` Christian König
  2024-01-29 16:24                     ` Felix Kuehling
  0 siblings, 1 reply; 123+ messages in thread
From: Christian König @ 2024-01-29 15:33 UTC (permalink / raw)
  To: Felix Kuehling, Daniel Vetter
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, Ghimiray,
	Himal Prasad, dri-devel, Gupta, saurabhg, Danilo Krummrich, Zeng,
	Oak, Bommu, Krishnaiah, Dave Airlie, Vishwanathapura, Niranjana,
	intel-xe

Am 29.01.24 um 16:03 schrieb Felix Kuehling:
> On 2024-01-25 13:32, Daniel Vetter wrote:
>> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>>> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>>>> [SNIP]
>>>> Yes most API are per device based.
>>>>
>>>> One exception I know is actually the kfd SVM API. If you look at 
>>>> the svm_ioctl function, it is per-process based. Each kfd_process 
>>>> represent a process across N gpu devices.
>>> Yeah and that was a big mistake in my opinion. We should really not 
>>> do that
>>> ever again.
>>>
>>>> Need to say, kfd SVM represent a shared virtual address space 
>>>> across CPU and all GPU devices on the system. This is by the 
>>>> definition of SVM (shared virtual memory). This is very different 
>>>> from our legacy gpu *device* driver which works for only one device 
>>>> (i.e., if you want one device to access another device's memory, 
>>>> you will have to use dma-buf export/import etc).
>>> Exactly that thinking is what we have currently found as blocker for a
>>> virtualization projects. Having SVM as device independent feature which
>>> somehow ties to the process address space turned out to be an 
>>> extremely bad
>>> idea.
>>>
>>> The background is that this only works for some use cases but not 
>>> all of
>>> them.
>>>
>>> What's working much better is to just have a mirror functionality 
>>> which says
>>> that a range A..B of the process address space is mapped into a 
>>> range C..D
>>> of the GPU address space.
>>>
>>> Those ranges can then be used to implement the SVM feature required for
>>> higher level APIs and not something you need at the UAPI or even 
>>> inside the
>>> low level kernel memory management.
>>>
>>> When you talk about migrating memory to a device you also do this on 
>>> a per
>>> device basis and *not* tied to the process address space. If you 
>>> then get
>>> crappy performance because userspace gave contradicting information 
>>> where to
>>> migrate memory then that's a bug in userspace and not something the 
>>> kernel
>>> should try to prevent somehow.
>>>
>>> [SNIP]
>>>>> I think if you start using the same drm_gpuvm for multiple devices 
>>>>> you
>>>>> will sooner or later start to run into the same mess we have seen 
>>>>> with
>>>>> KFD, where we moved more and more functionality from the KFD to 
>>>>> the DRM
>>>>> render node because we found that a lot of the stuff simply 
>>>>> doesn't work
>>>>> correctly with a single object to maintain the state.
>>>> As I understand it, KFD is designed to work across devices. A 
>>>> single pseudo /dev/kfd device represent all hardware gpu devices. 
>>>> That is why during kfd open, many pdd (process device data) is 
>>>> created, each for one hardware device for this process.
>>> Yes, I'm perfectly aware of that. And I can only repeat myself that 
>>> I see
>>> this design as a rather extreme failure. And I think it's one of the 
>>> reasons
>>> why NVidia is so dominant with Cuda.
>>>
>>> This whole approach KFD takes was designed with the idea of 
>>> extending the
>>> CPU process into the GPUs, but this idea only works for a few use 
>>> cases and
>>> is not something we should apply to drivers in general.
>>>
>>> A very good example are virtualization use cases where you end up 
>>> with CPU
>>> address != GPU address because the VAs are actually coming from the 
>>> guest VM
>>> and not the host process.
>>>
>>> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should 
>>> not have
>>> any influence on the design of the kernel UAPI.
>>>
>>> If you want to do something similar as KFD for Xe I think you need 
>>> to get
>>> explicit permission to do this from Dave and Daniel and maybe even 
>>> Linus.
>> I think the one and only one exception where an SVM uapi like in kfd 
>> makes
>> sense, is if the _hardware_ itself, not the software stack defined
>> semantics that you've happened to build on top of that hw, enforces a 
>> 1:1
>> mapping with the cpu process address space.
>>
>> Which means your hardware is using PASID, IOMMU based translation, 
>> PCI-ATS
>> (address translation services) or whatever your hw calls it and has _no_
>> device-side pagetables on top. Which from what I've seen all devices 
>> with
>> device-memory have, simply because they need some place to store whether
>> that memory is currently in device memory or should be translated using
>> PASID. Currently there's no gpu that works with PASID only, but there 
>> are
>> some on-cpu-die accelerator things that do work like that.
>>
>> Maybe in the future there will be some accelerators that are fully cpu
>> cache coherent (including atomics) with something like CXL, and the
>> on-device memory is managed as normal system memory with struct page as
>> ZONE_DEVICE and accelerator va -> physical address translation is only
>> done with PASID ... but for now I haven't seen that, definitely not in
>> upstream drivers.
>>
>> And the moment you have some per-device pagetables or per-device memory
>> management of some sort (like using gpuva mgr) then I'm 100% agreeing 
>> with
>> Christian that the kfd SVM model is too strict and not a great idea.
>
> That basically means, without ATS/PRI+PASID you cannot implement a 
> unified memory programming model, where GPUs or accelerators access 
> virtual addresses without pre-registering them with an SVM API call.
>
> Unified memory is a feature implemented by the KFD SVM API and used by 
> ROCm. This is used e.g. to implement OpenMP USM (unified shared 
> memory). It's implemented with recoverable GPU page faults. If the 
> page fault interrupt handler cannot assume a shared virtual address 
> space, then implementing this feature isn't possible.

Why not? As far as I can see the OpenMP USM is just another funky way of 
userptr handling.

The difference is that in an userptr we assume that we always need to 
request the whole block A..B from a mapping while for page fault based 
handling it can be just any page in between A and B which is requested 
and made available to the GPU address space.

As far as I can see there is absolutely no need for any special SVM 
handling.

Regards,
Christian.

>
> Regards,
>   Felix
>
>
>>
>> Cheers, Sima


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-29 15:33                   ` Christian König
@ 2024-01-29 16:24                     ` Felix Kuehling
  2024-01-29 16:28                       ` Christian König
  0 siblings, 1 reply; 123+ messages in thread
From: Felix Kuehling @ 2024-01-29 16:24 UTC (permalink / raw)
  To: Christian König, Daniel Vetter
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, Ghimiray,
	Himal Prasad, dri-devel, Gupta, saurabhg, Danilo Krummrich, Zeng,
	Oak, Bommu, Krishnaiah, Dave Airlie, Vishwanathapura, Niranjana,
	intel-xe


On 2024-01-29 10:33, Christian König wrote:
> Am 29.01.24 um 16:03 schrieb Felix Kuehling:
>> On 2024-01-25 13:32, Daniel Vetter wrote:
>>> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>>>> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>>>>> [SNIP]
>>>>> Yes most API are per device based.
>>>>>
>>>>> One exception I know is actually the kfd SVM API. If you look at 
>>>>> the svm_ioctl function, it is per-process based. Each kfd_process 
>>>>> represent a process across N gpu devices.
>>>> Yeah and that was a big mistake in my opinion. We should really not 
>>>> do that
>>>> ever again.
>>>>
>>>>> Need to say, kfd SVM represent a shared virtual address space 
>>>>> across CPU and all GPU devices on the system. This is by the 
>>>>> definition of SVM (shared virtual memory). This is very different 
>>>>> from our legacy gpu *device* driver which works for only one 
>>>>> device (i.e., if you want one device to access another device's 
>>>>> memory, you will have to use dma-buf export/import etc).
>>>> Exactly that thinking is what we have currently found as blocker for a
>>>> virtualization projects. Having SVM as device independent feature 
>>>> which
>>>> somehow ties to the process address space turned out to be an 
>>>> extremely bad
>>>> idea.
>>>>
>>>> The background is that this only works for some use cases but not 
>>>> all of
>>>> them.
>>>>
>>>> What's working much better is to just have a mirror functionality 
>>>> which says
>>>> that a range A..B of the process address space is mapped into a 
>>>> range C..D
>>>> of the GPU address space.
>>>>
>>>> Those ranges can then be used to implement the SVM feature required 
>>>> for
>>>> higher level APIs and not something you need at the UAPI or even 
>>>> inside the
>>>> low level kernel memory management.
>>>>
>>>> When you talk about migrating memory to a device you also do this 
>>>> on a per
>>>> device basis and *not* tied to the process address space. If you 
>>>> then get
>>>> crappy performance because userspace gave contradicting information 
>>>> where to
>>>> migrate memory then that's a bug in userspace and not something the 
>>>> kernel
>>>> should try to prevent somehow.
>>>>
>>>> [SNIP]
>>>>>> I think if you start using the same drm_gpuvm for multiple 
>>>>>> devices you
>>>>>> will sooner or later start to run into the same mess we have seen 
>>>>>> with
>>>>>> KFD, where we moved more and more functionality from the KFD to 
>>>>>> the DRM
>>>>>> render node because we found that a lot of the stuff simply 
>>>>>> doesn't work
>>>>>> correctly with a single object to maintain the state.
>>>>> As I understand it, KFD is designed to work across devices. A 
>>>>> single pseudo /dev/kfd device represent all hardware gpu devices. 
>>>>> That is why during kfd open, many pdd (process device data) is 
>>>>> created, each for one hardware device for this process.
>>>> Yes, I'm perfectly aware of that. And I can only repeat myself that 
>>>> I see
>>>> this design as a rather extreme failure. And I think it's one of 
>>>> the reasons
>>>> why NVidia is so dominant with Cuda.
>>>>
>>>> This whole approach KFD takes was designed with the idea of 
>>>> extending the
>>>> CPU process into the GPUs, but this idea only works for a few use 
>>>> cases and
>>>> is not something we should apply to drivers in general.
>>>>
>>>> A very good example are virtualization use cases where you end up 
>>>> with CPU
>>>> address != GPU address because the VAs are actually coming from the 
>>>> guest VM
>>>> and not the host process.
>>>>
>>>> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should 
>>>> not have
>>>> any influence on the design of the kernel UAPI.
>>>>
>>>> If you want to do something similar as KFD for Xe I think you need 
>>>> to get
>>>> explicit permission to do this from Dave and Daniel and maybe even 
>>>> Linus.
>>> I think the one and only one exception where an SVM uapi like in kfd 
>>> makes
>>> sense, is if the _hardware_ itself, not the software stack defined
>>> semantics that you've happened to build on top of that hw, enforces 
>>> a 1:1
>>> mapping with the cpu process address space.
>>>
>>> Which means your hardware is using PASID, IOMMU based translation, 
>>> PCI-ATS
>>> (address translation services) or whatever your hw calls it and has 
>>> _no_
>>> device-side pagetables on top. Which from what I've seen all devices 
>>> with
>>> device-memory have, simply because they need some place to store 
>>> whether
>>> that memory is currently in device memory or should be translated using
>>> PASID. Currently there's no gpu that works with PASID only, but 
>>> there are
>>> some on-cpu-die accelerator things that do work like that.
>>>
>>> Maybe in the future there will be some accelerators that are fully cpu
>>> cache coherent (including atomics) with something like CXL, and the
>>> on-device memory is managed as normal system memory with struct page as
>>> ZONE_DEVICE and accelerator va -> physical address translation is only
>>> done with PASID ... but for now I haven't seen that, definitely not in
>>> upstream drivers.
>>>
>>> And the moment you have some per-device pagetables or per-device memory
>>> management of some sort (like using gpuva mgr) then I'm 100% 
>>> agreeing with
>>> Christian that the kfd SVM model is too strict and not a great idea.
>>
>> That basically means, without ATS/PRI+PASID you cannot implement a 
>> unified memory programming model, where GPUs or accelerators access 
>> virtual addresses without pre-registering them with an SVM API call.
>>
>> Unified memory is a feature implemented by the KFD SVM API and used 
>> by ROCm. This is used e.g. to implement OpenMP USM (unified shared 
>> memory). It's implemented with recoverable GPU page faults. If the 
>> page fault interrupt handler cannot assume a shared virtual address 
>> space, then implementing this feature isn't possible.
>
> Why not? As far as I can see the OpenMP USM is just another funky way 
> of userptr handling.
>
> The difference is that in an userptr we assume that we always need to 
> request the whole block A..B from a mapping while for page fault based 
> handling it can be just any page in between A and B which is requested 
> and made available to the GPU address space.
>
> As far as I can see there is absolutely no need for any special SVM 
> handling.

It does assume a shared virtual address space between CPU and GPUs. 
There are no API calls to tell the driver that address A on the CPU maps 
to address B on the GPU1 and address C on GPU2. The KFD SVM API was 
designed to work with this programming model, by augmenting the shared 
virtual address mappings with virtual address range attributes that can 
modify the migration policy and indicate prefetching, prefaulting, etc. 
You could think of it as madvise on steroids.

Regards,
   Felix


>
> Regards,
> Christian.
>
>>
>> Regards,
>>   Felix
>>
>>
>>>
>>> Cheers, Sima
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-29 16:24                     ` Felix Kuehling
@ 2024-01-29 16:28                       ` Christian König
  2024-01-29 17:52                         ` Felix Kuehling
  0 siblings, 1 reply; 123+ messages in thread
From: Christian König @ 2024-01-29 16:28 UTC (permalink / raw)
  To: Felix Kuehling, Daniel Vetter
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, Ghimiray,
	Himal Prasad, dri-devel, Gupta, saurabhg, Danilo Krummrich, Zeng,
	Oak, Bommu, Krishnaiah, Dave Airlie, Vishwanathapura, Niranjana,
	intel-xe

Am 29.01.24 um 17:24 schrieb Felix Kuehling:
> On 2024-01-29 10:33, Christian König wrote:
>> Am 29.01.24 um 16:03 schrieb Felix Kuehling:
>>> On 2024-01-25 13:32, Daniel Vetter wrote:
>>>> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>>>>> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>>>>>> [SNIP]
>>>>>> Yes most API are per device based.
>>>>>>
>>>>>> One exception I know is actually the kfd SVM API. If you look at 
>>>>>> the svm_ioctl function, it is per-process based. Each kfd_process 
>>>>>> represent a process across N gpu devices.
>>>>> Yeah and that was a big mistake in my opinion. We should really 
>>>>> not do that
>>>>> ever again.
>>>>>
>>>>>> Need to say, kfd SVM represent a shared virtual address space 
>>>>>> across CPU and all GPU devices on the system. This is by the 
>>>>>> definition of SVM (shared virtual memory). This is very different 
>>>>>> from our legacy gpu *device* driver which works for only one 
>>>>>> device (i.e., if you want one device to access another device's 
>>>>>> memory, you will have to use dma-buf export/import etc).
>>>>> Exactly that thinking is what we have currently found as blocker 
>>>>> for a
>>>>> virtualization projects. Having SVM as device independent feature 
>>>>> which
>>>>> somehow ties to the process address space turned out to be an 
>>>>> extremely bad
>>>>> idea.
>>>>>
>>>>> The background is that this only works for some use cases but not 
>>>>> all of
>>>>> them.
>>>>>
>>>>> What's working much better is to just have a mirror functionality 
>>>>> which says
>>>>> that a range A..B of the process address space is mapped into a 
>>>>> range C..D
>>>>> of the GPU address space.
>>>>>
>>>>> Those ranges can then be used to implement the SVM feature 
>>>>> required for
>>>>> higher level APIs and not something you need at the UAPI or even 
>>>>> inside the
>>>>> low level kernel memory management.
>>>>>
>>>>> When you talk about migrating memory to a device you also do this 
>>>>> on a per
>>>>> device basis and *not* tied to the process address space. If you 
>>>>> then get
>>>>> crappy performance because userspace gave contradicting 
>>>>> information where to
>>>>> migrate memory then that's a bug in userspace and not something 
>>>>> the kernel
>>>>> should try to prevent somehow.
>>>>>
>>>>> [SNIP]
>>>>>>> I think if you start using the same drm_gpuvm for multiple 
>>>>>>> devices you
>>>>>>> will sooner or later start to run into the same mess we have 
>>>>>>> seen with
>>>>>>> KFD, where we moved more and more functionality from the KFD to 
>>>>>>> the DRM
>>>>>>> render node because we found that a lot of the stuff simply 
>>>>>>> doesn't work
>>>>>>> correctly with a single object to maintain the state.
>>>>>> As I understand it, KFD is designed to work across devices. A 
>>>>>> single pseudo /dev/kfd device represent all hardware gpu devices. 
>>>>>> That is why during kfd open, many pdd (process device data) is 
>>>>>> created, each for one hardware device for this process.
>>>>> Yes, I'm perfectly aware of that. And I can only repeat myself 
>>>>> that I see
>>>>> this design as a rather extreme failure. And I think it's one of 
>>>>> the reasons
>>>>> why NVidia is so dominant with Cuda.
>>>>>
>>>>> This whole approach KFD takes was designed with the idea of 
>>>>> extending the
>>>>> CPU process into the GPUs, but this idea only works for a few use 
>>>>> cases and
>>>>> is not something we should apply to drivers in general.
>>>>>
>>>>> A very good example are virtualization use cases where you end up 
>>>>> with CPU
>>>>> address != GPU address because the VAs are actually coming from 
>>>>> the guest VM
>>>>> and not the host process.
>>>>>
>>>>> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This 
>>>>> should not have
>>>>> any influence on the design of the kernel UAPI.
>>>>>
>>>>> If you want to do something similar as KFD for Xe I think you need 
>>>>> to get
>>>>> explicit permission to do this from Dave and Daniel and maybe even 
>>>>> Linus.
>>>> I think the one and only one exception where an SVM uapi like in 
>>>> kfd makes
>>>> sense, is if the _hardware_ itself, not the software stack defined
>>>> semantics that you've happened to build on top of that hw, enforces 
>>>> a 1:1
>>>> mapping with the cpu process address space.
>>>>
>>>> Which means your hardware is using PASID, IOMMU based translation, 
>>>> PCI-ATS
>>>> (address translation services) or whatever your hw calls it and has 
>>>> _no_
>>>> device-side pagetables on top. Which from what I've seen all 
>>>> devices with
>>>> device-memory have, simply because they need some place to store 
>>>> whether
>>>> that memory is currently in device memory or should be translated 
>>>> using
>>>> PASID. Currently there's no gpu that works with PASID only, but 
>>>> there are
>>>> some on-cpu-die accelerator things that do work like that.
>>>>
>>>> Maybe in the future there will be some accelerators that are fully cpu
>>>> cache coherent (including atomics) with something like CXL, and the
>>>> on-device memory is managed as normal system memory with struct 
>>>> page as
>>>> ZONE_DEVICE and accelerator va -> physical address translation is only
>>>> done with PASID ... but for now I haven't seen that, definitely not in
>>>> upstream drivers.
>>>>
>>>> And the moment you have some per-device pagetables or per-device 
>>>> memory
>>>> management of some sort (like using gpuva mgr) then I'm 100% 
>>>> agreeing with
>>>> Christian that the kfd SVM model is too strict and not a great idea.
>>>
>>> That basically means, without ATS/PRI+PASID you cannot implement a 
>>> unified memory programming model, where GPUs or accelerators access 
>>> virtual addresses without pre-registering them with an SVM API call.
>>>
>>> Unified memory is a feature implemented by the KFD SVM API and used 
>>> by ROCm. This is used e.g. to implement OpenMP USM (unified shared 
>>> memory). It's implemented with recoverable GPU page faults. If the 
>>> page fault interrupt handler cannot assume a shared virtual address 
>>> space, then implementing this feature isn't possible.
>>
>> Why not? As far as I can see the OpenMP USM is just another funky way 
>> of userptr handling.
>>
>> The difference is that in an userptr we assume that we always need to 
>> request the whole block A..B from a mapping while for page fault 
>> based handling it can be just any page in between A and B which is 
>> requested and made available to the GPU address space.
>>
>> As far as I can see there is absolutely no need for any special SVM 
>> handling.
>
> It does assume a shared virtual address space between CPU and GPUs. 
> There are no API calls to tell the driver that address A on the CPU 
> maps to address B on the GPU1 and address C on GPU2. The KFD SVM API 
> was designed to work with this programming model, by augmenting the 
> shared virtual address mappings with virtual address range attributes 
> that can modify the migration policy and indicate prefetching, 
> prefaulting, etc. You could think of it as madvise on steroids.

Yeah, so what? In this case you just say through an IOCTL that CPU range 
A..B should map to GPU range C..D and for A/B and C/D you use the 
maximum of the address space.

There is no restriction that this needs to be accurate in way. It's just 
the it can be accurate to be more efficient and eventually use only a 
fraction of the address space instead of all of it for some use cases.

So this isn't a blocker, it's just one special use case.

Regards,
Christian.

>
> Regards,
>   Felix
>
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>>   Felix
>>>
>>>
>>>>
>>>> Cheers, Sima
>>


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-29 16:28                       ` Christian König
@ 2024-01-29 17:52                         ` Felix Kuehling
  2024-01-29 19:03                           ` Christian König
  0 siblings, 1 reply; 123+ messages in thread
From: Felix Kuehling @ 2024-01-29 17:52 UTC (permalink / raw)
  To: Christian König, Daniel Vetter
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, Ghimiray,
	Himal Prasad, dri-devel, Gupta, saurabhg, Danilo Krummrich, Zeng,
	Oak, Bommu, Krishnaiah, Dave Airlie, Vishwanathapura, Niranjana,
	intel-xe


On 2024-01-29 11:28, Christian König wrote:
> Am 29.01.24 um 17:24 schrieb Felix Kuehling:
>> On 2024-01-29 10:33, Christian König wrote:
>>> Am 29.01.24 um 16:03 schrieb Felix Kuehling:
>>>> On 2024-01-25 13:32, Daniel Vetter wrote:
>>>>> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>>>>>> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>>>>>>> [SNIP]
>>>>>>> Yes most API are per device based.
>>>>>>>
>>>>>>> One exception I know is actually the kfd SVM API. If you look at 
>>>>>>> the svm_ioctl function, it is per-process based. Each 
>>>>>>> kfd_process represent a process across N gpu devices.
>>>>>> Yeah and that was a big mistake in my opinion. We should really 
>>>>>> not do that
>>>>>> ever again.
>>>>>>
>>>>>>> Need to say, kfd SVM represent a shared virtual address space 
>>>>>>> across CPU and all GPU devices on the system. This is by the 
>>>>>>> definition of SVM (shared virtual memory). This is very 
>>>>>>> different from our legacy gpu *device* driver which works for 
>>>>>>> only one device (i.e., if you want one device to access another 
>>>>>>> device's memory, you will have to use dma-buf export/import etc).
>>>>>> Exactly that thinking is what we have currently found as blocker 
>>>>>> for a
>>>>>> virtualization projects. Having SVM as device independent feature 
>>>>>> which
>>>>>> somehow ties to the process address space turned out to be an 
>>>>>> extremely bad
>>>>>> idea.
>>>>>>
>>>>>> The background is that this only works for some use cases but not 
>>>>>> all of
>>>>>> them.
>>>>>>
>>>>>> What's working much better is to just have a mirror functionality 
>>>>>> which says
>>>>>> that a range A..B of the process address space is mapped into a 
>>>>>> range C..D
>>>>>> of the GPU address space.
>>>>>>
>>>>>> Those ranges can then be used to implement the SVM feature 
>>>>>> required for
>>>>>> higher level APIs and not something you need at the UAPI or even 
>>>>>> inside the
>>>>>> low level kernel memory management.
>>>>>>
>>>>>> When you talk about migrating memory to a device you also do this 
>>>>>> on a per
>>>>>> device basis and *not* tied to the process address space. If you 
>>>>>> then get
>>>>>> crappy performance because userspace gave contradicting 
>>>>>> information where to
>>>>>> migrate memory then that's a bug in userspace and not something 
>>>>>> the kernel
>>>>>> should try to prevent somehow.
>>>>>>
>>>>>> [SNIP]
>>>>>>>> I think if you start using the same drm_gpuvm for multiple 
>>>>>>>> devices you
>>>>>>>> will sooner or later start to run into the same mess we have 
>>>>>>>> seen with
>>>>>>>> KFD, where we moved more and more functionality from the KFD to 
>>>>>>>> the DRM
>>>>>>>> render node because we found that a lot of the stuff simply 
>>>>>>>> doesn't work
>>>>>>>> correctly with a single object to maintain the state.
>>>>>>> As I understand it, KFD is designed to work across devices. A 
>>>>>>> single pseudo /dev/kfd device represent all hardware gpu 
>>>>>>> devices. That is why during kfd open, many pdd (process device 
>>>>>>> data) is created, each for one hardware device for this process.
>>>>>> Yes, I'm perfectly aware of that. And I can only repeat myself 
>>>>>> that I see
>>>>>> this design as a rather extreme failure. And I think it's one of 
>>>>>> the reasons
>>>>>> why NVidia is so dominant with Cuda.
>>>>>>
>>>>>> This whole approach KFD takes was designed with the idea of 
>>>>>> extending the
>>>>>> CPU process into the GPUs, but this idea only works for a few use 
>>>>>> cases and
>>>>>> is not something we should apply to drivers in general.
>>>>>>
>>>>>> A very good example are virtualization use cases where you end up 
>>>>>> with CPU
>>>>>> address != GPU address because the VAs are actually coming from 
>>>>>> the guest VM
>>>>>> and not the host process.
>>>>>>
>>>>>> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This 
>>>>>> should not have
>>>>>> any influence on the design of the kernel UAPI.
>>>>>>
>>>>>> If you want to do something similar as KFD for Xe I think you 
>>>>>> need to get
>>>>>> explicit permission to do this from Dave and Daniel and maybe 
>>>>>> even Linus.
>>>>> I think the one and only one exception where an SVM uapi like in 
>>>>> kfd makes
>>>>> sense, is if the _hardware_ itself, not the software stack defined
>>>>> semantics that you've happened to build on top of that hw, 
>>>>> enforces a 1:1
>>>>> mapping with the cpu process address space.
>>>>>
>>>>> Which means your hardware is using PASID, IOMMU based translation, 
>>>>> PCI-ATS
>>>>> (address translation services) or whatever your hw calls it and 
>>>>> has _no_
>>>>> device-side pagetables on top. Which from what I've seen all 
>>>>> devices with
>>>>> device-memory have, simply because they need some place to store 
>>>>> whether
>>>>> that memory is currently in device memory or should be translated 
>>>>> using
>>>>> PASID. Currently there's no gpu that works with PASID only, but 
>>>>> there are
>>>>> some on-cpu-die accelerator things that do work like that.
>>>>>
>>>>> Maybe in the future there will be some accelerators that are fully 
>>>>> cpu
>>>>> cache coherent (including atomics) with something like CXL, and the
>>>>> on-device memory is managed as normal system memory with struct 
>>>>> page as
>>>>> ZONE_DEVICE and accelerator va -> physical address translation is 
>>>>> only
>>>>> done with PASID ... but for now I haven't seen that, definitely 
>>>>> not in
>>>>> upstream drivers.
>>>>>
>>>>> And the moment you have some per-device pagetables or per-device 
>>>>> memory
>>>>> management of some sort (like using gpuva mgr) then I'm 100% 
>>>>> agreeing with
>>>>> Christian that the kfd SVM model is too strict and not a great idea.
>>>>
>>>> That basically means, without ATS/PRI+PASID you cannot implement a 
>>>> unified memory programming model, where GPUs or accelerators access 
>>>> virtual addresses without pre-registering them with an SVM API call.
>>>>
>>>> Unified memory is a feature implemented by the KFD SVM API and used 
>>>> by ROCm. This is used e.g. to implement OpenMP USM (unified shared 
>>>> memory). It's implemented with recoverable GPU page faults. If the 
>>>> page fault interrupt handler cannot assume a shared virtual address 
>>>> space, then implementing this feature isn't possible.
>>>
>>> Why not? As far as I can see the OpenMP USM is just another funky 
>>> way of userptr handling.
>>>
>>> The difference is that in an userptr we assume that we always need 
>>> to request the whole block A..B from a mapping while for page fault 
>>> based handling it can be just any page in between A and B which is 
>>> requested and made available to the GPU address space.
>>>
>>> As far as I can see there is absolutely no need for any special SVM 
>>> handling.
>>
>> It does assume a shared virtual address space between CPU and GPUs. 
>> There are no API calls to tell the driver that address A on the CPU 
>> maps to address B on the GPU1 and address C on GPU2. The KFD SVM API 
>> was designed to work with this programming model, by augmenting the 
>> shared virtual address mappings with virtual address range attributes 
>> that can modify the migration policy and indicate prefetching, 
>> prefaulting, etc. You could think of it as madvise on steroids.
>
> Yeah, so what? In this case you just say through an IOCTL that CPU 
> range A..B should map to GPU range C..D and for A/B and C/D you use 
> the maximum of the address space.

What I want is that address range A..B on the CPU matches A..B on the 
GPU, because I'm sharing pointers between CPU and GPU. I can't think of 
any sane user mode using a unified memory programming model, that would 
ever ask KFD to map unified memory mappints to a different address range 
on the GPU. Adding such an ioclt is a complete waste of time, and can 
only serve to add unnecessary complexity.

Regards,
   Felix


>
> There is no restriction that this needs to be accurate in way. It's 
> just the it can be accurate to be more efficient and eventually use 
> only a fraction of the address space instead of all of it for some use 
> cases.
>
> So this isn't a blocker, it's just one special use case.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>   Felix
>>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>>
>>>>>
>>>>> Cheers, Sima
>>>
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-29 17:52                         ` Felix Kuehling
@ 2024-01-29 19:03                           ` Christian König
  2024-01-29 20:24                             ` Felix Kuehling
  0 siblings, 1 reply; 123+ messages in thread
From: Christian König @ 2024-01-29 19:03 UTC (permalink / raw)
  To: Felix Kuehling, Daniel Vetter
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, Ghimiray,
	Himal Prasad, dri-devel, Gupta, saurabhg, Danilo Krummrich, Zeng,
	Oak, Bommu, Krishnaiah, Dave Airlie, Vishwanathapura, Niranjana,
	intel-xe

Am 29.01.24 um 18:52 schrieb Felix Kuehling:
> On 2024-01-29 11:28, Christian König wrote:
>> Am 29.01.24 um 17:24 schrieb Felix Kuehling:
>>> On 2024-01-29 10:33, Christian König wrote:
>>>> Am 29.01.24 um 16:03 schrieb Felix Kuehling:
>>>>> On 2024-01-25 13:32, Daniel Vetter wrote:
>>>>>> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>>>>>>> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>>>>>>>> [SNIP]
>>>>>>>> Yes most API are per device based.
>>>>>>>>
>>>>>>>> One exception I know is actually the kfd SVM API. If you look 
>>>>>>>> at the svm_ioctl function, it is per-process based. Each 
>>>>>>>> kfd_process represent a process across N gpu devices.
>>>>>>> Yeah and that was a big mistake in my opinion. We should really 
>>>>>>> not do that
>>>>>>> ever again.
>>>>>>>
>>>>>>>> Need to say, kfd SVM represent a shared virtual address space 
>>>>>>>> across CPU and all GPU devices on the system. This is by the 
>>>>>>>> definition of SVM (shared virtual memory). This is very 
>>>>>>>> different from our legacy gpu *device* driver which works for 
>>>>>>>> only one device (i.e., if you want one device to access another 
>>>>>>>> device's memory, you will have to use dma-buf export/import etc).
>>>>>>> Exactly that thinking is what we have currently found as blocker 
>>>>>>> for a
>>>>>>> virtualization projects. Having SVM as device independent 
>>>>>>> feature which
>>>>>>> somehow ties to the process address space turned out to be an 
>>>>>>> extremely bad
>>>>>>> idea.
>>>>>>>
>>>>>>> The background is that this only works for some use cases but 
>>>>>>> not all of
>>>>>>> them.
>>>>>>>
>>>>>>> What's working much better is to just have a mirror 
>>>>>>> functionality which says
>>>>>>> that a range A..B of the process address space is mapped into a 
>>>>>>> range C..D
>>>>>>> of the GPU address space.
>>>>>>>
>>>>>>> Those ranges can then be used to implement the SVM feature 
>>>>>>> required for
>>>>>>> higher level APIs and not something you need at the UAPI or even 
>>>>>>> inside the
>>>>>>> low level kernel memory management.
>>>>>>>
>>>>>>> When you talk about migrating memory to a device you also do 
>>>>>>> this on a per
>>>>>>> device basis and *not* tied to the process address space. If you 
>>>>>>> then get
>>>>>>> crappy performance because userspace gave contradicting 
>>>>>>> information where to
>>>>>>> migrate memory then that's a bug in userspace and not something 
>>>>>>> the kernel
>>>>>>> should try to prevent somehow.
>>>>>>>
>>>>>>> [SNIP]
>>>>>>>>> I think if you start using the same drm_gpuvm for multiple 
>>>>>>>>> devices you
>>>>>>>>> will sooner or later start to run into the same mess we have 
>>>>>>>>> seen with
>>>>>>>>> KFD, where we moved more and more functionality from the KFD 
>>>>>>>>> to the DRM
>>>>>>>>> render node because we found that a lot of the stuff simply 
>>>>>>>>> doesn't work
>>>>>>>>> correctly with a single object to maintain the state.
>>>>>>>> As I understand it, KFD is designed to work across devices. A 
>>>>>>>> single pseudo /dev/kfd device represent all hardware gpu 
>>>>>>>> devices. That is why during kfd open, many pdd (process device 
>>>>>>>> data) is created, each for one hardware device for this process.
>>>>>>> Yes, I'm perfectly aware of that. And I can only repeat myself 
>>>>>>> that I see
>>>>>>> this design as a rather extreme failure. And I think it's one of 
>>>>>>> the reasons
>>>>>>> why NVidia is so dominant with Cuda.
>>>>>>>
>>>>>>> This whole approach KFD takes was designed with the idea of 
>>>>>>> extending the
>>>>>>> CPU process into the GPUs, but this idea only works for a few 
>>>>>>> use cases and
>>>>>>> is not something we should apply to drivers in general.
>>>>>>>
>>>>>>> A very good example are virtualization use cases where you end 
>>>>>>> up with CPU
>>>>>>> address != GPU address because the VAs are actually coming from 
>>>>>>> the guest VM
>>>>>>> and not the host process.
>>>>>>>
>>>>>>> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This 
>>>>>>> should not have
>>>>>>> any influence on the design of the kernel UAPI.
>>>>>>>
>>>>>>> If you want to do something similar as KFD for Xe I think you 
>>>>>>> need to get
>>>>>>> explicit permission to do this from Dave and Daniel and maybe 
>>>>>>> even Linus.
>>>>>> I think the one and only one exception where an SVM uapi like in 
>>>>>> kfd makes
>>>>>> sense, is if the _hardware_ itself, not the software stack defined
>>>>>> semantics that you've happened to build on top of that hw, 
>>>>>> enforces a 1:1
>>>>>> mapping with the cpu process address space.
>>>>>>
>>>>>> Which means your hardware is using PASID, IOMMU based 
>>>>>> translation, PCI-ATS
>>>>>> (address translation services) or whatever your hw calls it and 
>>>>>> has _no_
>>>>>> device-side pagetables on top. Which from what I've seen all 
>>>>>> devices with
>>>>>> device-memory have, simply because they need some place to store 
>>>>>> whether
>>>>>> that memory is currently in device memory or should be translated 
>>>>>> using
>>>>>> PASID. Currently there's no gpu that works with PASID only, but 
>>>>>> there are
>>>>>> some on-cpu-die accelerator things that do work like that.
>>>>>>
>>>>>> Maybe in the future there will be some accelerators that are 
>>>>>> fully cpu
>>>>>> cache coherent (including atomics) with something like CXL, and the
>>>>>> on-device memory is managed as normal system memory with struct 
>>>>>> page as
>>>>>> ZONE_DEVICE and accelerator va -> physical address translation is 
>>>>>> only
>>>>>> done with PASID ... but for now I haven't seen that, definitely 
>>>>>> not in
>>>>>> upstream drivers.
>>>>>>
>>>>>> And the moment you have some per-device pagetables or per-device 
>>>>>> memory
>>>>>> management of some sort (like using gpuva mgr) then I'm 100% 
>>>>>> agreeing with
>>>>>> Christian that the kfd SVM model is too strict and not a great idea.
>>>>>
>>>>> That basically means, without ATS/PRI+PASID you cannot implement a 
>>>>> unified memory programming model, where GPUs or accelerators 
>>>>> access virtual addresses without pre-registering them with an SVM 
>>>>> API call.
>>>>>
>>>>> Unified memory is a feature implemented by the KFD SVM API and 
>>>>> used by ROCm. This is used e.g. to implement OpenMP USM (unified 
>>>>> shared memory). It's implemented with recoverable GPU page faults. 
>>>>> If the page fault interrupt handler cannot assume a shared virtual 
>>>>> address space, then implementing this feature isn't possible.
>>>>
>>>> Why not? As far as I can see the OpenMP USM is just another funky 
>>>> way of userptr handling.
>>>>
>>>> The difference is that in an userptr we assume that we always need 
>>>> to request the whole block A..B from a mapping while for page fault 
>>>> based handling it can be just any page in between A and B which is 
>>>> requested and made available to the GPU address space.
>>>>
>>>> As far as I can see there is absolutely no need for any special SVM 
>>>> handling.
>>>
>>> It does assume a shared virtual address space between CPU and GPUs. 
>>> There are no API calls to tell the driver that address A on the CPU 
>>> maps to address B on the GPU1 and address C on GPU2. The KFD SVM API 
>>> was designed to work with this programming model, by augmenting the 
>>> shared virtual address mappings with virtual address range 
>>> attributes that can modify the migration policy and indicate 
>>> prefetching, prefaulting, etc. You could think of it as madvise on 
>>> steroids.
>>
>> Yeah, so what? In this case you just say through an IOCTL that CPU 
>> range A..B should map to GPU range C..D and for A/B and C/D you use 
>> the maximum of the address space.
>
> What I want is that address range A..B on the CPU matches A..B on the 
> GPU, because I'm sharing pointers between CPU and GPU. I can't think 
> of any sane user mode using a unified memory programming model, that 
> would ever ask KFD to map unified memory mappints to a different 
> address range on the GPU. Adding such an ioclt is a complete waste of 
> time, and can only serve to add unnecessary complexity.

This is exactly the use case which happens when the submitting process 
is not the one originally stitching together the command stream.

Basically all native context, virtualization and other proxy use cases 
work like this.

So that the CPU address doesn't match the GPU address is an absolutely 
real use case and should be able to be handled by the GPU VA interface.

Regards,
Christian.



>
> Regards,
>   Felix
>
>
>>
>> There is no restriction that this needs to be accurate in way. It's 
>> just the it can be accurate to be more efficient and eventually use 
>> only a fraction of the address space instead of all of it for some 
>> use cases.
>>
>> So this isn't a blocker, it's just one special use case.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>>   Felix
>>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Regards,
>>>>>   Felix
>>>>>
>>>>>
>>>>>>
>>>>>> Cheers, Sima
>>>>
>>


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-29 10:10                         ` Christian König
@ 2024-01-29 20:09                           ` Zeng, Oak
  0 siblings, 0 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-01-29 20:09 UTC (permalink / raw)
  To: Christian König, David Airlie, jglisse, rcampbell, apopple
  Cc: Brost, Matthew, Thomas.Hellstrom, Winiarski, Michal,
	Felix Kuehling, Welty, Brian, Shah, Ankur N, dri-devel, Ghimiray,
	Himal Prasad, Daniel Vetter, Bommu,  Krishnaiah, Gupta, saurabhg,
	Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich

[-- Attachment #1: Type: text/plain, Size: 14860 bytes --]

Hi Christian,

Even though this email thread was started to discuss shared virtual address space b/t multiple GPU devices, I eventually found you even don’t agree with a shared virtual address space b/t CPU and GPU program. So let’s forget about multiple GPU devices for now. I will try explain the shared address space b/t cpu and one gpu.

HMM was designed to solve the GPU programmability problem with a very fundamental assumption which is GPU program shares a same virtual address space with CPU program, for example, with HMM any CPU pointers (such as malloc’ed, stack variables and globals) can be used directly on you GPU shader program. Are you against this design goal? HMM is already part of linux core MM and Linus approved this design. CC’ed Jérôme.

Here is an example of how application can use system allocator (hmm),  I copied from https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/. CC’ed a few Nvidia folks.

void sortfile(FILE* fp, int N) {
  char* data;
  data = (char*)malloc(N);

  fread(data, 1, N, fp);
  qsort<<<...>>>(data, N, 1, cmp);
  cudaDeviceSynchronize();

  use_data(data);
  free(data)
}

As you can see, malloced ptr is used directly in GPU program, no userptr ioctl, no vm_bind. This is the model Intel also want to support, besides AMD and Nvidia.

Lastly, nouveau in the kernel already support hmm and system allocator. It also support shared virtual address space b/t CPU and GPU program. All the codes already merged upstream.


See also comments inline to your questions.

I will address your other email separately.

Regards,
Oak

From: Christian König <christian.koenig@amd.com>
Sent: Monday, January 29, 2024 5:11 AM
To: Zeng, Oak <oak.zeng@intel.com>; David Airlie <airlied@redhat.com>
Cc: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com; Winiarski, Michal <michal.winiarski@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; Shah, Ankur N <ankur.n.shah@intel.com>; dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Gupta, saurabhg <saurabhg.gupta@intel.com>; Danilo Krummrich <dakr@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Brost, Matthew <matthew.brost@intel.com>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
Subject: Re: Making drm_gpuvm work across gpu devices

Am 26.01.24 um 21:13 schrieb Zeng, Oak:

-----Original Message-----

From: Christian König <christian.koenig@amd.com><mailto:christian.koenig@amd.com>

Sent: Friday, January 26, 2024 5:10 AM

To: Zeng, Oak <oak.zeng@intel.com><mailto:oak.zeng@intel.com>; David Airlie <airlied@redhat.com><mailto:airlied@redhat.com>

Cc: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com><mailto:himal.prasad.ghimiray@intel.com>;

Thomas.Hellstrom@linux.intel.com<mailto:Thomas.Hellstrom@linux.intel.com>; Winiarski, Michal

<michal.winiarski@intel.com><mailto:michal.winiarski@intel.com>; Felix Kuehling <felix.kuehling@amd.com><mailto:felix.kuehling@amd.com>; Welty,

Brian <brian.welty@intel.com><mailto:brian.welty@intel.com>; Shah, Ankur N <ankur.n.shah@intel.com><mailto:ankur.n.shah@intel.com>; dri-

devel@lists.freedesktop.org<mailto:devel@lists.freedesktop.org>; intel-xe@lists.freedesktop.org<mailto:intel-xe@lists.freedesktop.org>; Gupta, saurabhg

<saurabhg.gupta@intel.com><mailto:saurabhg.gupta@intel.com>; Danilo Krummrich <dakr@redhat.com><mailto:dakr@redhat.com>; Daniel

Vetter <daniel@ffwll.ch><mailto:daniel@ffwll.ch>; Brost, Matthew <matthew.brost@intel.com><mailto:matthew.brost@intel.com>; Bommu,

Krishnaiah <krishnaiah.bommu@intel.com><mailto:krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana

<niranjana.vishwanathapura@intel.com><mailto:niranjana.vishwanathapura@intel.com>

Subject: Re: Making drm_gpuvm work across gpu devices



Hi Oak,



you can still use SVM, but it should not be a design criteria for the

kernel UAPI. In other words the UAPI should be designed in such a way

that the GPU virtual address can be equal to the CPU virtual address of

a buffer, but can also be different to support use cases where this

isn't the case.



Terminology:

SVM: any technology which can achieve a shared virtual address space b/t cpu and devices. The virtual address space can be managed by user space or kernel space. Intel implemented a SVM, based on the BO-centric gpu driver (gem-create, vm-bind) where virtual address space is managed by UMD.

System allocator: another way of implement SVM. User just use malloc'ed memory for gpu submission. Virtual address space is managed by Linux core mm. In practice, we leverage HMM to implement system allocator.

This article described details of all those different model: https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/



Our programming model allows a mixture use of system allocator (even though system allocator is ) and traditional vm_bind (where cpu address can != gpu address). Let me re-post the pseudo codes:



 1. Fd0 = open(/"dev/dri/render0")

 2. Fd1 = open("/dev/dri/render1")

 3. Fd3 = open("/dev/dri/xe-svm")

 4. Gpu_Vm0 =xe_vm_create(fd0)

 5. Gpu_Vm1 = xe_vm_create(fd1)

 6. Queue0 = xe_exec_queue_create(fd0, gpu_vm0)

 7. Queue1 = xe_exec_queue_create(fd1, gpu_vm1)

 8. ptr = malloc()

 9. bo = xe_bo_create(fd0)

 10. Vm_bind(bo, gpu_vm0, va)//va is from UMD, cpu can access bo with same or different va. It is UMD's responsibility that va doesn't conflict with malloc'ed PTRs.

 11. Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0

 12. Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1

 13. Xe_exec(queue0, va)//submit gpu job which use va, on card0



In above codes, the va used in vm_bind (line 10, Intel's API to bind an object to a va for GPU access) can be different from the CPU address when cpu access the same object. But whenever user use malloc'ed ptr for GPU submission (line 11, 12, so called system allocator), it implies CPU and GPUs use the same ptr to access.



In above vm_bind, it is user/UMD's responsibility to guarantee that vm_bind va doesn't conflict with malloc'ed ptr. Otherwise it is treated as programming error.



I think this design still meets your design restrictions.

Well why do you need this "Fd3 = open("/dev/dri/xe-svm")" ?

As far as I see fd3 isn't used anywhere.

We use fd3 for memory hints ioctls (I didn’t write in above program). Under the picture of system allocator, memory hint is applied to a virtual address range in a process, not specific to one GPU device. So we can’t use fd1 and fd2 for this purpose. For example, you can set the preferred memory location of a address range to be on gpu device1’s memory.


What you can do is to bind parts of your process address space to your driver connections (fd1, fd2 etc..) with a vm_bind(), but this should *not* come because of implicitely using some other file descriptor in the process.


We already have a vm_bind api which is used for a split CPU and GPU virtual address space (means GPU virtual address space can != CPU virtual address space.) for KMD. In this case, it is UMD’s responsibility to manage the whole virtual address space. UMD can make the CPU VA ==GPU VA or CPU VA!=GPU VA. It doesn’t matter for KMD. We already have this thing working. We also used this approach to achieve a shared virtual address space b/t CPU and GPU, where UMD managed to make CPU VA == GPU VA.

All the discussion in this email thread was triggered by our effort to support system allocator, which means application can use CPU pointers directly on GPU shader program *without* extra driver IOCTL call. The purpose of this programming model is to further simplify the GPU programming across all programming languages. By the definition of system allocator, GPU va is always == CPU VA.

Our API/xeKmd is designed to work for both of above two programming model.


As far as I can see this design is exactly what failed so badly with KFD.

Regards,
Christian.











Additionally to what Dave wrote I can summarize a few things I have

learned while working on the AMD GPU drivers in the last decade or so:



1. Userspace requirements are *not* relevant for UAPI or even more

general kernel driver design.



2. What should be done is to look at the hardware capabilities and try

to expose those in a save manner to userspace.



3. The userspace requirements are then used to validate the kernel

driver and especially the UAPI design to ensure that nothing was missed.



The consequence of this is that nobody should ever use things like Cuda,

Vulkan, OpenCL, OpenGL etc.. as argument to propose a certain UAPI design.



What should be done instead is to say: My hardware works in this and

that way -> we want to expose it like this -> because that enables us to

implement the high level API in this and that way.



Only this gives then a complete picture of how things interact together

and allows the kernel community to influence and validate the design.



What you described above is mainly bottom up. I know other people do top down, or whole system vertical HW-SW co-design. I don't have strong opinion here.



Regards,

Oak





This doesn't mean that you need to throw away everything, but it gives a

clear restriction that designs are not nailed in stone and for example

you can't use something like a waterfall model.



Going to answer on your other questions separately.



Regards,

Christian.



Am 25.01.24 um 06:25 schrieb Zeng, Oak:

Hi Dave,



Let me step back. When I wrote " shared virtual address space b/t cpu and all

gpu devices is a hard requirement for our system allocator design", I meant this is

not only Intel's design requirement. Rather this is a common requirement for

both Intel, AMD and Nvidia. Take a look at cuda driver API definition of

cuMemAllocManaged (search this API on https://docs.nvidia.com/cuda/cuda-

driver-api/group__CUDA__MEM.html#group__CUDA__MEM), it said:



"The pointer is valid on the CPU and on all GPUs in the system that support

managed memory."



This means the program virtual address space is shared b/t CPU and all GPU

devices on the system. The system allocator we are discussing is just one step

advanced than cuMemAllocManaged: it allows malloc'ed memory to be shared

b/t CPU and all GPU devices.



I hope we all agree with this point.



With that, I agree with Christian that in kmd we should make driver code per-

device based instead of managing all devices in one driver instance. Our system

allocator (and generally xekmd)design follows this rule: we make xe_vm per

device based - one device is *not* aware of other device's address space, as I

explained in previous email. I started this email seeking a one drm_gpuvm

instance to cover all GPU devices. I gave up this approach (at least for now) per

Danilo and Christian's feedback: We will continue to have per device based

drm_gpuvm. I hope this is aligned with Christian but I will have to wait for

Christian's reply to my previous email.



I hope this clarify thing a little.



Regards,

Oak



-----Original Message-----

From: dri-devel <dri-devel-bounces@lists.freedesktop.org><mailto:dri-devel-bounces@lists.freedesktop.org> On Behalf Of

David

Airlie

Sent: Wednesday, January 24, 2024 8:25 PM

To: Zeng, Oak <oak.zeng@intel.com><mailto:oak.zeng@intel.com>

Cc: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com><mailto:himal.prasad.ghimiray@intel.com>;

Thomas.Hellstrom@linux.intel.com<mailto:Thomas.Hellstrom@linux.intel.com>; Winiarski, Michal

<michal.winiarski@intel.com><mailto:michal.winiarski@intel.com>; Felix Kuehling <felix.kuehling@amd.com><mailto:felix.kuehling@amd.com>;

Welty,

Brian <brian.welty@intel.com><mailto:brian.welty@intel.com>; Shah, Ankur N <ankur.n.shah@intel.com><mailto:ankur.n.shah@intel.com>;

dri-

devel@lists.freedesktop.org<mailto:devel@lists.freedesktop.org>; intel-xe@lists.freedesktop.org<mailto:intel-xe@lists.freedesktop.org>; Gupta, saurabhg

<saurabhg.gupta@intel.com><mailto:saurabhg.gupta@intel.com>; Danilo Krummrich <dakr@redhat.com><mailto:dakr@redhat.com>; Daniel

Vetter <daniel@ffwll.ch><mailto:daniel@ffwll.ch>; Brost, Matthew <matthew.brost@intel.com><mailto:matthew.brost@intel.com>;

Bommu,

Krishnaiah <krishnaiah.bommu@intel.com><mailto:krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana

<niranjana.vishwanathapura@intel.com><mailto:niranjana.vishwanathapura@intel.com>; Christian König

<christian.koenig@amd.com><mailto:christian.koenig@amd.com>

Subject: Re: Making drm_gpuvm work across gpu devices





For us, Xekmd doesn't need to know it is running under bare metal or

virtualized environment. Xekmd is always a guest driver. All the virtual address

used in xekmd is guest virtual address. For SVM, we require all the VF devices

share one single shared address space with guest CPU program. So all the

design

works in bare metal environment can automatically work under virtualized

environment. +@Shah, Ankur N +@Winiarski, Michal to backup me if I am

wrong.





Again, shared virtual address space b/t cpu and all gpu devices is a hard

requirement for our system allocator design (which means malloc’ed memory,

cpu stack variables, globals can be directly used in gpu program. Same

requirement as kfd SVM design). This was aligned with our user space

software

stack.



Just to make a very general point here (I'm hoping you listen to

Christian a bit more and hoping he replies in more detail), but just

because you have a system allocator design done, it doesn't in any way

enforce the requirements on the kernel driver to accept that design.

Bad system design should be pushed back on, not enforced in

implementation stages. It's a trap Intel falls into regularly since

they say well we already agreed this design with the userspace team

and we can't change it now. This isn't acceptable. Design includes

upstream discussion and feedback, if you say misdesigned the system

allocator (and I'm not saying you definitely have), and this is

pushing back on that, then you have to go fix your system

architecture.



KFD was an experiment like this, I pushed back on AMD at the start

saying it was likely a bad plan, we let it go and got a lot of

experience in why it was a bad design.



Dave.




[-- Attachment #2: Type: text/html, Size: 26374 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-29 19:03                           ` Christian König
@ 2024-01-29 20:24                             ` Felix Kuehling
  0 siblings, 0 replies; 123+ messages in thread
From: Felix Kuehling @ 2024-01-29 20:24 UTC (permalink / raw)
  To: Christian König, Daniel Vetter
  Cc: Brost, Matthew, Thomas.Hellstrom, Welty, Brian, Ghimiray,
	Himal Prasad, dri-devel, Gupta, saurabhg, Danilo Krummrich, Zeng,
	Oak, Bommu, Krishnaiah, Dave Airlie, Vishwanathapura, Niranjana,
	intel-xe


On 2024-01-29 14:03, Christian König wrote:
> Am 29.01.24 um 18:52 schrieb Felix Kuehling:
>> On 2024-01-29 11:28, Christian König wrote:
>>> Am 29.01.24 um 17:24 schrieb Felix Kuehling:
>>>> On 2024-01-29 10:33, Christian König wrote:
>>>>> Am 29.01.24 um 16:03 schrieb Felix Kuehling:
>>>>>> On 2024-01-25 13:32, Daniel Vetter wrote:
>>>>>>> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>>>>>>>> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>>>>>>>>> [SNIP]
>>>>>>>>> Yes most API are per device based.
>>>>>>>>>
>>>>>>>>> One exception I know is actually the kfd SVM API. If you look 
>>>>>>>>> at the svm_ioctl function, it is per-process based. Each 
>>>>>>>>> kfd_process represent a process across N gpu devices.
>>>>>>>> Yeah and that was a big mistake in my opinion. We should really 
>>>>>>>> not do that
>>>>>>>> ever again.
>>>>>>>>
>>>>>>>>> Need to say, kfd SVM represent a shared virtual address space 
>>>>>>>>> across CPU and all GPU devices on the system. This is by the 
>>>>>>>>> definition of SVM (shared virtual memory). This is very 
>>>>>>>>> different from our legacy gpu *device* driver which works for 
>>>>>>>>> only one device (i.e., if you want one device to access 
>>>>>>>>> another device's memory, you will have to use dma-buf 
>>>>>>>>> export/import etc).
>>>>>>>> Exactly that thinking is what we have currently found as 
>>>>>>>> blocker for a
>>>>>>>> virtualization projects. Having SVM as device independent 
>>>>>>>> feature which
>>>>>>>> somehow ties to the process address space turned out to be an 
>>>>>>>> extremely bad
>>>>>>>> idea.
>>>>>>>>
>>>>>>>> The background is that this only works for some use cases but 
>>>>>>>> not all of
>>>>>>>> them.
>>>>>>>>
>>>>>>>> What's working much better is to just have a mirror 
>>>>>>>> functionality which says
>>>>>>>> that a range A..B of the process address space is mapped into a 
>>>>>>>> range C..D
>>>>>>>> of the GPU address space.
>>>>>>>>
>>>>>>>> Those ranges can then be used to implement the SVM feature 
>>>>>>>> required for
>>>>>>>> higher level APIs and not something you need at the UAPI or 
>>>>>>>> even inside the
>>>>>>>> low level kernel memory management.
>>>>>>>>
>>>>>>>> When you talk about migrating memory to a device you also do 
>>>>>>>> this on a per
>>>>>>>> device basis and *not* tied to the process address space. If 
>>>>>>>> you then get
>>>>>>>> crappy performance because userspace gave contradicting 
>>>>>>>> information where to
>>>>>>>> migrate memory then that's a bug in userspace and not something 
>>>>>>>> the kernel
>>>>>>>> should try to prevent somehow.
>>>>>>>>
>>>>>>>> [SNIP]
>>>>>>>>>> I think if you start using the same drm_gpuvm for multiple 
>>>>>>>>>> devices you
>>>>>>>>>> will sooner or later start to run into the same mess we have 
>>>>>>>>>> seen with
>>>>>>>>>> KFD, where we moved more and more functionality from the KFD 
>>>>>>>>>> to the DRM
>>>>>>>>>> render node because we found that a lot of the stuff simply 
>>>>>>>>>> doesn't work
>>>>>>>>>> correctly with a single object to maintain the state.
>>>>>>>>> As I understand it, KFD is designed to work across devices. A 
>>>>>>>>> single pseudo /dev/kfd device represent all hardware gpu 
>>>>>>>>> devices. That is why during kfd open, many pdd (process device 
>>>>>>>>> data) is created, each for one hardware device for this process.
>>>>>>>> Yes, I'm perfectly aware of that. And I can only repeat myself 
>>>>>>>> that I see
>>>>>>>> this design as a rather extreme failure. And I think it's one 
>>>>>>>> of the reasons
>>>>>>>> why NVidia is so dominant with Cuda.
>>>>>>>>
>>>>>>>> This whole approach KFD takes was designed with the idea of 
>>>>>>>> extending the
>>>>>>>> CPU process into the GPUs, but this idea only works for a few 
>>>>>>>> use cases and
>>>>>>>> is not something we should apply to drivers in general.
>>>>>>>>
>>>>>>>> A very good example are virtualization use cases where you end 
>>>>>>>> up with CPU
>>>>>>>> address != GPU address because the VAs are actually coming from 
>>>>>>>> the guest VM
>>>>>>>> and not the host process.
>>>>>>>>
>>>>>>>> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This 
>>>>>>>> should not have
>>>>>>>> any influence on the design of the kernel UAPI.
>>>>>>>>
>>>>>>>> If you want to do something similar as KFD for Xe I think you 
>>>>>>>> need to get
>>>>>>>> explicit permission to do this from Dave and Daniel and maybe 
>>>>>>>> even Linus.
>>>>>>> I think the one and only one exception where an SVM uapi like in 
>>>>>>> kfd makes
>>>>>>> sense, is if the _hardware_ itself, not the software stack defined
>>>>>>> semantics that you've happened to build on top of that hw, 
>>>>>>> enforces a 1:1
>>>>>>> mapping with the cpu process address space.
>>>>>>>
>>>>>>> Which means your hardware is using PASID, IOMMU based 
>>>>>>> translation, PCI-ATS
>>>>>>> (address translation services) or whatever your hw calls it and 
>>>>>>> has _no_
>>>>>>> device-side pagetables on top. Which from what I've seen all 
>>>>>>> devices with
>>>>>>> device-memory have, simply because they need some place to store 
>>>>>>> whether
>>>>>>> that memory is currently in device memory or should be 
>>>>>>> translated using
>>>>>>> PASID. Currently there's no gpu that works with PASID only, but 
>>>>>>> there are
>>>>>>> some on-cpu-die accelerator things that do work like that.
>>>>>>>
>>>>>>> Maybe in the future there will be some accelerators that are 
>>>>>>> fully cpu
>>>>>>> cache coherent (including atomics) with something like CXL, and the
>>>>>>> on-device memory is managed as normal system memory with struct 
>>>>>>> page as
>>>>>>> ZONE_DEVICE and accelerator va -> physical address translation 
>>>>>>> is only
>>>>>>> done with PASID ... but for now I haven't seen that, definitely 
>>>>>>> not in
>>>>>>> upstream drivers.
>>>>>>>
>>>>>>> And the moment you have some per-device pagetables or per-device 
>>>>>>> memory
>>>>>>> management of some sort (like using gpuva mgr) then I'm 100% 
>>>>>>> agreeing with
>>>>>>> Christian that the kfd SVM model is too strict and not a great 
>>>>>>> idea.
>>>>>>
>>>>>> That basically means, without ATS/PRI+PASID you cannot implement 
>>>>>> a unified memory programming model, where GPUs or accelerators 
>>>>>> access virtual addresses without pre-registering them with an SVM 
>>>>>> API call.
>>>>>>
>>>>>> Unified memory is a feature implemented by the KFD SVM API and 
>>>>>> used by ROCm. This is used e.g. to implement OpenMP USM (unified 
>>>>>> shared memory). It's implemented with recoverable GPU page 
>>>>>> faults. If the page fault interrupt handler cannot assume a 
>>>>>> shared virtual address space, then implementing this feature 
>>>>>> isn't possible.
>>>>>
>>>>> Why not? As far as I can see the OpenMP USM is just another funky 
>>>>> way of userptr handling.
>>>>>
>>>>> The difference is that in an userptr we assume that we always need 
>>>>> to request the whole block A..B from a mapping while for page 
>>>>> fault based handling it can be just any page in between A and B 
>>>>> which is requested and made available to the GPU address space.
>>>>>
>>>>> As far as I can see there is absolutely no need for any special 
>>>>> SVM handling.
>>>>
>>>> It does assume a shared virtual address space between CPU and GPUs. 
>>>> There are no API calls to tell the driver that address A on the CPU 
>>>> maps to address B on the GPU1 and address C on GPU2. The KFD SVM 
>>>> API was designed to work with this programming model, by augmenting 
>>>> the shared virtual address mappings with virtual address range 
>>>> attributes that can modify the migration policy and indicate 
>>>> prefetching, prefaulting, etc. You could think of it as madvise on 
>>>> steroids.
>>>
>>> Yeah, so what? In this case you just say through an IOCTL that CPU 
>>> range A..B should map to GPU range C..D and for A/B and C/D you use 
>>> the maximum of the address space.
>>
>> What I want is that address range A..B on the CPU matches A..B on the 
>> GPU, because I'm sharing pointers between CPU and GPU. I can't think 
>> of any sane user mode using a unified memory programming model, that 
>> would ever ask KFD to map unified memory mappints to a different 
>> address range on the GPU. Adding such an ioclt is a complete waste of 
>> time, and can only serve to add unnecessary complexity.
>
> This is exactly the use case which happens when the submitting process 
> is not the one originally stitching together the command stream.
>
> Basically all native context, virtualization and other proxy use cases 
> work like this.

You cannot use unified memory in this type of environment. A GPU page 
fault would occur in a GPU virtual address space in the host-side proxy 
process. That page fault would need to be resolved to map some memory 
from a process running in a guest? Which guest? Which process? That's 
anyone's guess. There is no way to annotate that because the pointer in 
the fault is coming from a shader program that's running in the guest 
context and competely unaware of the virtualization. There are no API 
calls from the guest before the fault occurs to establish a meaningful 
mapping.

The way this virtualization of the GPU is implemented, with out proxy 
multiplexing multiple clients is just fundamentally incompatible with a 
unified memory programming model that has to assume a shared virtual 
address space to make sense. You'd need separate proxy processes on the 
host side to handle that. You can't blame this on bad design decisions 
in the SVM API. As I see it, it's just a fundamental limitation of the 
virtualization approach that cannot handle guest applications that want 
to use a unified shared memory programming model.

That's why I suggested to completely disable the SVM API in this 
virtualization scenario when you create a KFD context that's separate 
from the process address space. It is not essential for any 
non-unified-memory functionality. ROCm user mode has fallbacks to work 
without it, because we also needs to support older kernels and GPUs that 
didn't support this programming model.

Regards,
   Felix


>
> So that the CPU address doesn't match the GPU address is an absolutely 
> real use case and should be able to be handled by the GPU VA interface.
>
> Regards,
> Christian.
>
>
>
>>
>> Regards,
>>   Felix
>>
>>
>>>
>>> There is no restriction that this needs to be accurate in way. It's 
>>> just the it can be accurate to be more efficient and eventually use 
>>> only a fraction of the address space instead of all of it for some 
>>> use cases.
>>>
>>> So this isn't a blocker, it's just one special use case.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>   Felix
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cheers, Sima
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-29 10:19                       ` Christian König
@ 2024-01-30  0:21                         ` Zeng, Oak
  2024-01-30  8:39                           ` Christian König
  2024-01-30  8:43                           ` Thomas Hellström
  0 siblings, 2 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-01-30  0:21 UTC (permalink / raw)
  To: Christian König, Thomas Hellström, Daniel Vetter, Dave Airlie
  Cc: Brost, Matthew, Felix Kuehling, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu,  Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich

[-- Attachment #1: Type: text/plain, Size: 12780 bytes --]

The example you used to prove that KFD is a design failure, is against *any* design which utilize system allocator and hmm. The way that one proxy process running on host to handle many guest processes, doesn’t fit into the concept of “share address space b/t cpu and gpu”. The shared address space has to be within one process. Your proxy process represent many guest processes. It is a fundamental conflict.

Also your userptr proposal does’t solve this problem either:
Imagine you have guest process1 mapping CPU address range A…B to GPU address range C…D
And you have guest process 2 mapping CPU address range A…B to GPU address range C…D, since process 1 and 2 are two different process, it is legal for process 2 to do the exact same mapping.
Now when gpu shader access address C…D, a gpu page fault happens, what does your proxy process do? Which guest process will this fault be directed to and handled? Except you have extra information/API to tell proxy process and GPU HW, there is no way to figure out.

Compared to the shared virtual address space concept of HMM, the userptr design is nothing new except it allows CPU and GPU to use different address to access the same object. If you replace above C…D with A…B, above description becomes a description of the “problem” of HMM/shared virtual address design.

Both design has the same difficulty with your example of the special virtualization environment setup.

As said, we spent effort scoped the userptr solution some time ago. The problem we found enabling userptr with migration were:

  1.  The user interface of userptr is not as convenient as system allocator. With the userptr solution, user need to call userptr_ioctl and vm_bind for *every* single cpu pointer that he want to use in a gpu program. While with system allocator, programmer just use any cpu pointer directly in gpu program without any extra driver ioctls.
  2.  We don’t see the real benefit of using a different Gpu address C…D than the A..B, except you can prove my above reasoning is wrong. In most use cases, you can make GPU C…D == CPU A…B, why bother then?
  3.  Looked into implementation details, since hmm fundamentally assume a shared virtual address space b/t cpu and device, for the userptr solution to leverage hmm, you need perform address space conversion every time you calls into hmm functions.

In summary, GPU device is just a piece of HW to accelerate your CPU program. If HW allows, it is more convenient to use shared address space b/t cpu and GPU. On old HW (example, no gpu page fault support, or gpu only has a very limited address space), we can disable system allocator/SVM. If you use different address space on modern GPU, why don’t you use different address space on different CPU cores?

Regards,
Oak
From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of Christian König
Sent: Monday, January 29, 2024 5:20 AM
To: Zeng, Oak <oak.zeng@intel.com>; Thomas Hellström <thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch>; Dave Airlie <airlied@redhat.com>
Cc: Brost, Matthew <matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>
Subject: Re: Making drm_gpuvm work across gpu devices

Well Daniel and Dave noted it as well, so I'm just repeating it: Your design choices are not an argument to get something upstream.

It's the job of the maintainers and at the end of the Linus to judge of something is acceptable or not.

As far as I can see a good part of this this idea has been exercised lengthy with KFD and it turned out to not be the best approach.

So from what I've seen the design you outlined is extremely unlikely to go upstream.

Regards,
Christian.
Am 27.01.24 um 03:21 schrieb Zeng, Oak:
Regarding the idea of expanding userptr to support migration, we explored this idea long time ago. It provides similar functions of the system allocator but its interface is not as convenient as system allocator. Besides the shared virtual address space, another benefit of a system allocator is, you can offload cpu program to gpu easier, you don’t need to call driver specific API (such as register_userptr and vm_bind in this case) for memory allocation.

We also scoped the implementation. It turned out to be big, and not as beautiful as hmm. Why we gave up this approach.

From: Christian König <christian.koenig@amd.com><mailto:christian.koenig@amd.com>
Sent: Friday, January 26, 2024 7:52 AM
To: Thomas Hellström <thomas.hellstrom@linux.intel.com><mailto:thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch><mailto:daniel@ffwll.ch>
Cc: Brost, Matthew <matthew.brost@intel.com><mailto:matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com><mailto:felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com><mailto:brian.welty@intel.com>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com><mailto:himal.prasad.ghimiray@intel.com>; Zeng, Oak <oak.zeng@intel.com><mailto:oak.zeng@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com><mailto:saurabhg.gupta@intel.com>; Danilo Krummrich <dakr@redhat.com><mailto:dakr@redhat.com>; dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com><mailto:krishnaiah.bommu@intel.com>; Dave Airlie <airlied@redhat.com><mailto:airlied@redhat.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com><mailto:niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org<mailto:intel-xe@lists.freedesktop.org>
Subject: Re: Making drm_gpuvm work across gpu devices

Am 26.01.24 um 09:21 schrieb Thomas Hellström:



Hi, all



On Thu, 2024-01-25 at 19:32 +0100, Daniel Vetter wrote:

On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]

Yes most API are per device based.



One exception I know is actually the kfd SVM API. If you look at

the svm_ioctl function, it is per-process based. Each kfd_process

represent a process across N gpu devices.



Yeah and that was a big mistake in my opinion. We should really not

do that

ever again.



Need to say, kfd SVM represent a shared virtual address space

across CPU and all GPU devices on the system. This is by the

definition of SVM (shared virtual memory). This is very different

from our legacy gpu *device* driver which works for only one

device (i.e., if you want one device to access another device's

memory, you will have to use dma-buf export/import etc).



Exactly that thinking is what we have currently found as blocker

for a

virtualization projects. Having SVM as device independent feature

which

somehow ties to the process address space turned out to be an

extremely bad

idea.



The background is that this only works for some use cases but not

all of

them.



What's working much better is to just have a mirror functionality

which says

that a range A..B of the process address space is mapped into a

range C..D

of the GPU address space.



Those ranges can then be used to implement the SVM feature required

for

higher level APIs and not something you need at the UAPI or even

inside the

low level kernel memory management.



When you talk about migrating memory to a device you also do this

on a per

device basis and *not* tied to the process address space. If you

then get

crappy performance because userspace gave contradicting information

where to

migrate memory then that's a bug in userspace and not something the

kernel

should try to prevent somehow.



[SNIP]

I think if you start using the same drm_gpuvm for multiple

devices you

will sooner or later start to run into the same mess we have

seen with

KFD, where we moved more and more functionality from the KFD to

the DRM

render node because we found that a lot of the stuff simply

doesn't work

correctly with a single object to maintain the state.

As I understand it, KFD is designed to work across devices. A

single pseudo /dev/kfd device represent all hardware gpu devices.

That is why during kfd open, many pdd (process device data) is

created, each for one hardware device for this process.



Yes, I'm perfectly aware of that. And I can only repeat myself that

I see

this design as a rather extreme failure. And I think it's one of

the reasons

why NVidia is so dominant with Cuda.



This whole approach KFD takes was designed with the idea of

extending the

CPU process into the GPUs, but this idea only works for a few use

cases and

is not something we should apply to drivers in general.



A very good example are virtualization use cases where you end up

with CPU

address != GPU address because the VAs are actually coming from the

guest VM

and not the host process.



SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should

not have

any influence on the design of the kernel UAPI.



If you want to do something similar as KFD for Xe I think you need

to get

explicit permission to do this from Dave and Daniel and maybe even

Linus.



I think the one and only one exception where an SVM uapi like in kfd

makes

sense, is if the _hardware_ itself, not the software stack defined

semantics that you've happened to build on top of that hw, enforces a

1:1

mapping with the cpu process address space.



Which means your hardware is using PASID, IOMMU based translation,

PCI-ATS

(address translation services) or whatever your hw calls it and has

_no_

device-side pagetables on top. Which from what I've seen all devices

with

device-memory have, simply because they need some place to store

whether

that memory is currently in device memory or should be translated

using

PASID. Currently there's no gpu that works with PASID only, but there

are

some on-cpu-die accelerator things that do work like that.



Maybe in the future there will be some accelerators that are fully

cpu

cache coherent (including atomics) with something like CXL, and the

on-device memory is managed as normal system memory with struct page

as

ZONE_DEVICE and accelerator va -> physical address translation is

only

done with PASID ... but for now I haven't seen that, definitely not

in

upstream drivers.



And the moment you have some per-device pagetables or per-device

memory

management of some sort (like using gpuva mgr) then I'm 100% agreeing

with

Christian that the kfd SVM model is too strict and not a great idea.



Cheers, Sima





I'm trying to digest all the comments here, The end goal is to be able

to support something similar to this here:



https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/



Christian, If I understand you correctly, you're strongly suggesting

not to try to manage a common virtual address space across different

devices in the kernel, but merely providing building blocks to do so,

like for example a generalized userptr with migration support using

HMM; That way each "mirror" of the CPU mm would be per device and

inserted into the gpu_vm just like any other gpu_vma, and user-space

would dictate the A..B -> C..D mapping by choosing the GPU_VA for the

vma.

Exactly that, yes.






Sima, it sounds like you're suggesting to shy away from hmm and not

even attempt to support this except if it can be done using IOMMU sva

on selected hardware?

I think that comment goes more into the direction of: If you have ATS/ATC/PRI capable hardware which exposes the functionality to make memory reads and writes directly into the address space of the CPU then yes an SVM only interface is ok because the hardware can't do anything else. But as long as you have something like GPUVM then please don't restrict yourself.

Which I totally agree on as well. The ATS/ATC/PRI combination doesn't allow using separate page tables device and CPU and so also not separate VAs.

This was one of the reasons why we stopped using this approach for AMD GPUs.

Regards,
Christian.




Could you clarify a bit?



Thanks,

Thomas

















[-- Attachment #2: Type: text/html, Size: 24395 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-30  0:21                         ` Zeng, Oak
@ 2024-01-30  8:39                           ` Christian König
  2024-01-30 22:29                             ` Zeng, Oak
  2024-01-30  8:43                           ` Thomas Hellström
  1 sibling, 1 reply; 123+ messages in thread
From: Christian König @ 2024-01-30  8:39 UTC (permalink / raw)
  To: Zeng, Oak, Thomas Hellström, Daniel Vetter, Dave Airlie
  Cc: Brost, Matthew, Felix Kuehling, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich

[-- Attachment #1: Type: text/plain, Size: 19132 bytes --]

Am 30.01.24 um 01:21 schrieb Zeng, Oak:
>
> The example you used to prove that KFD is a design failure, is against 
> *any* design which utilize system allocator and hmm. The way that one 
> proxy process running on host to handle many guest processes, doesn’t 
> fit into the concept of “share address space b/t cpu and gpu”. The 
> shared address space has to be within one process. Your proxy process 
> represent many guest processes. It is a fundamental conflict.
>
> Also your userptr proposal does’t solve this problem either:
>
> Imagine you have guest process1 mapping CPU address range A…B to GPU 
> address range C…D
>
> And you have guest process 2 mapping CPU address range A…B to GPU 
> address range C…D, since process 1 and 2 are two different process, it 
> is legal for process 2 to do the exact same mapping.
>
> Now when gpu shader access address C…D, a gpu page fault happens, what 
> does your proxy process do? Which guest process will this fault be 
> directed to and handled? Except you have extra information/API to tell 
> proxy process and GPU HW, there is no way to figure out.
>

Well yes, as far as I can see the fundamental design issue in the KFD is 
that it ties together CPU and GPU address space. That came from the 
implementation using the ATS/PRI feature to access the CPU address space 
from the GPU.

If you don't do ATS/PRI then you don't have that restriction and you can 
do as many GPU address spaces per CPU process as you want. This is just 
how the hw works.

So in your example above when you have multiple mappings for the range 
A..B you also have multiple GPU address spaces and so can distinct where 
the page fault is coming from just by looking at the source of it. All 
you then need is userfaultfd() to forward the fault to the client and 
you are pretty much done.

> Compared to the shared virtual address space concept of HMM, the 
> userptr design is nothing new except it allows CPU and GPU to use 
> different address to access the same object. If you replace above C…D 
> with A…B, above description becomes a description of the “problem” of 
> HMM/shared virtual address design.
>
> Both design has the same difficulty with your example of the special 
> virtualization environment setup.
>
> As said, we spent effort scoped the userptr solution some time ago. 
> The problem we found enabling userptr with migration were:
>
>  1. The user interface of userptr is not as convenient as system
>     allocator. With the userptr solution, user need to call
>     userptr_ioctl and vm_bind for *every* single cpu pointer that he
>     want to use in a gpu program. While with system allocator,
>     programmer just use any cpu pointer directly in gpu program
>     without any extra driver ioctls.
>

And I think exactly that is questionable. Why not at least call it for 
the whole address space once during initialization?

 >     We don’t see the real benefit of using a different Gpu address 
C…D than the A..B, except you can prove my above reasoning is wrong. In 
most use cases, you can make GPU C…D == CPU A…B, why bother then?

Because there are cases where this isn't true. We just recently ran into 
exactly that use case with a customer. It might be that you will never 
need this, but again the approach should generally be that the kernel 
exposes the hardware features and as far as I can see the hardware can 
do this.

And apart from those use cases there is also another good reason for 
this: CPU are going towards 5 level of page tables and GPUs are lacking 
behind. It's not unrealistic to run into cases where you can only mirror 
parts of the CPU address space into the GPU address space because of 
hardware restrictions. And in this case you absolutely do want the 
flexibility to have different address ranges.


 >     Looked into implementation details, since hmm fundamentally 
assume a shared virtual address space b/t cpu and device, for the 
userptr solution to leverage hmm, you need perform address space 
conversion every time you calls into hmm functions.

Correct, but that is trivial. I mean we do nothing else with VMAs 
mapping into the address space of files on the CPU either.

Which is by the way a good analogy. The CPU address space consists of 
anonymous memory and file mappings, where the later covers both real 
files on a file system as well as devices.

The struct address_space in the Linux kernel for example describes a 
file address space and not the CPU address space because the later is 
just a technical tool to form an execution environment which can access 
the former.

With GPUs it's pretty much the same. You have mappings which can be 
backed by CPU address space using functionalities like HMM as well as 
buffer objects created directly through device drivers.

> In summary, GPU device is just a piece of HW to accelerate your CPU 
> program.
>

Well exactly that's not how I see it. CPU accelerators are extensions 
like SSE, AVX, FPUs etc... GPU are accelerators attached as I/O devices.

And that GPUs are separate to the CPU is a benefit which gives them 
advantage over CPU based acceleration approaches.

This obviously makes GPUs harder to program and SVM is a method to 
counter this, but that doesn't make SVM a good design pattern for kernel 
or device driver interfaces.

> If HW allows, it is more convenient to use shared address space b/t 
> cpu and GPU. On old HW (example, no gpu page fault support, or gpu 
> only has a very limited address space), we can disable system 
> allocator/SVM. If you use different address space on modern GPU, why 
> don’t you use different address space on different CPU cores?
>

Quite simple, modern CPU are homogeneous. From the application point of 
view they still look more or less the same they did 40 years ago.

GPUs on the other hand look quite a bit different. SVM is now a tool to 
reduce this difference but it doesn't make the differences in execution 
environment go away.

And I can only repeat myself that this is actually a good thing, cause 
otherwise GPUs would loose some of their advantage over CPUs.

Regards,
Christian.

> Regards,
>
> Oak
>
> *From:*dri-devel <dri-devel-bounces@lists.freedesktop.org> *On Behalf 
> Of *Christian König
> *Sent:* Monday, January 29, 2024 5:20 AM
> *To:* Zeng, Oak <oak.zeng@intel.com>; Thomas Hellström 
> <thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch>; 
> Dave Airlie <airlied@redhat.com>
> *Cc:* Brost, Matthew <matthew.brost@intel.com>; Felix Kuehling 
> <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; 
> dri-devel@lists.freedesktop.org; Ghimiray, Himal Prasad 
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah 
> <krishnaiah.bommu@intel.com>; Gupta, saurabhg 
> <saurabhg.gupta@intel.com>; Vishwanathapura, Niranjana 
> <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org; 
> Danilo Krummrich <dakr@redhat.com>
> *Subject:* Re: Making drm_gpuvm work across gpu devices
>
> Well Daniel and Dave noted it as well, so I'm just repeating it: Your 
> design choices are not an argument to get something upstream.
>
> It's the job of the maintainers and at the end of the Linus to judge 
> of something is acceptable or not.
>
> As far as I can see a good part of this this idea has been exercised 
> lengthy with KFD and it turned out to not be the best approach.
>
> So from what I've seen the design you outlined is extremely unlikely 
> to go upstream.
>
> Regards,
> Christian.
>
> Am 27.01.24 um 03:21 schrieb Zeng, Oak:
>
>     Regarding the idea of expanding userptr to support migration, we
>     explored this idea long time ago. It provides similar functions of
>     the system allocator but its interface is not as convenient as
>     system allocator. Besides the shared virtual address space,
>     another benefit of a system allocator is, you can offload cpu
>     program to gpu easier, you don’t need to call driver specific API
>     (such as register_userptr and vm_bind in this case) for memory
>     allocation.
>
>     We also scoped the implementation. It turned out to be big, and
>     not as beautiful as hmm. Why we gave up this approach.
>
>     *From:*Christian König <christian.koenig@amd.com>
>     <mailto:christian.koenig@amd.com>
>     *Sent:* Friday, January 26, 2024 7:52 AM
>     *To:* Thomas Hellström <thomas.hellstrom@linux.intel.com>
>     <mailto:thomas.hellstrom@linux.intel.com>; Daniel Vetter
>     <daniel@ffwll.ch> <mailto:daniel@ffwll.ch>
>     *Cc:* Brost, Matthew <matthew.brost@intel.com>
>     <mailto:matthew.brost@intel.com>; Felix Kuehling
>     <felix.kuehling@amd.com> <mailto:felix.kuehling@amd.com>; Welty,
>     Brian <brian.welty@intel.com> <mailto:brian.welty@intel.com>;
>     Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
>     <mailto:himal.prasad.ghimiray@intel.com>; Zeng, Oak
>     <oak.zeng@intel.com> <mailto:oak.zeng@intel.com>; Gupta, saurabhg
>     <saurabhg.gupta@intel.com> <mailto:saurabhg.gupta@intel.com>;
>     Danilo Krummrich <dakr@redhat.com> <mailto:dakr@redhat.com>;
>     dri-devel@lists.freedesktop.org
>     <mailto:dri-devel@lists.freedesktop.org>; Bommu, Krishnaiah
>     <krishnaiah.bommu@intel.com> <mailto:krishnaiah.bommu@intel.com>;
>     Dave Airlie <airlied@redhat.com> <mailto:airlied@redhat.com>;
>     Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
>     <mailto:niranjana.vishwanathapura@intel.com>;
>     intel-xe@lists.freedesktop.org <mailto:intel-xe@lists.freedesktop.org>
>     *Subject:* Re: Making drm_gpuvm work across gpu devices
>
>     Am 26.01.24 um 09:21 schrieb Thomas Hellström:
>
>
>         Hi, all
>
>           
>
>         On Thu, 2024-01-25 at 19:32 +0100, Daniel Vetter wrote:
>
>             On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>
>                 Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>
>                     [SNIP]
>
>                     Yes most API are per device based.
>
>                       
>
>                     One exception I know is actually the kfd SVM API. If you look at
>
>                     the svm_ioctl function, it is per-process based. Each kfd_process
>
>                     represent a process across N gpu devices.
>
>                   
>
>                 Yeah and that was a big mistake in my opinion. We should really not
>
>                 do that
>
>                 ever again.
>
>                   
>
>                     Need to say, kfd SVM represent a shared virtual address space
>
>                     across CPU and all GPU devices on the system. This is by the
>
>                     definition of SVM (shared virtual memory). This is very different
>
>                     from our legacy gpu *device* driver which works for only one
>
>                     device (i.e., if you want one device to access another device's
>
>                     memory, you will have to use dma-buf export/import etc).
>
>                   
>
>                 Exactly that thinking is what we have currently found as blocker
>
>                 for a
>
>                 virtualization projects. Having SVM as device independent feature
>
>                 which
>
>                 somehow ties to the process address space turned out to be an
>
>                 extremely bad
>
>                 idea.
>
>                   
>
>                 The background is that this only works for some use cases but not
>
>                 all of
>
>                 them.
>
>                   
>
>                 What's working much better is to just have a mirror functionality
>
>                 which says
>
>                 that a range A..B of the process address space is mapped into a
>
>                 range C..D
>
>                 of the GPU address space.
>
>                   
>
>                 Those ranges can then be used to implement the SVM feature required
>
>                 for
>
>                 higher level APIs and not something you need at the UAPI or even
>
>                 inside the
>
>                 low level kernel memory management.
>
>                   
>
>                 When you talk about migrating memory to a device you also do this
>
>                 on a per
>
>                 device basis and *not* tied to the process address space. If you
>
>                 then get
>
>                 crappy performance because userspace gave contradicting information
>
>                 where to
>
>                 migrate memory then that's a bug in userspace and not something the
>
>                 kernel
>
>                 should try to prevent somehow.
>
>                   
>
>                 [SNIP]
>
>                         I think if you start using the same drm_gpuvm for multiple
>
>                         devices you
>
>                         will sooner or later start to run into the same mess we have
>
>                         seen with
>
>                         KFD, where we moved more and more functionality from the KFD to
>
>                         the DRM
>
>                         render node because we found that a lot of the stuff simply
>
>                         doesn't work
>
>                         correctly with a single object to maintain the state.
>
>                     As I understand it, KFD is designed to work across devices. A
>
>                     single pseudo /dev/kfd device represent all hardware gpu devices.
>
>                     That is why during kfd open, many pdd (process device data) is
>
>                     created, each for one hardware device for this process.
>
>                   
>
>                 Yes, I'm perfectly aware of that. And I can only repeat myself that
>
>                 I see
>
>                 this design as a rather extreme failure. And I think it's one of
>
>                 the reasons
>
>                 why NVidia is so dominant with Cuda.
>
>                   
>
>                 This whole approach KFD takes was designed with the idea of
>
>                 extending the
>
>                 CPU process into the GPUs, but this idea only works for a few use
>
>                 cases and
>
>                 is not something we should apply to drivers in general.
>
>                   
>
>                 A very good example are virtualization use cases where you end up
>
>                 with CPU
>
>                 address != GPU address because the VAs are actually coming from the
>
>                 guest VM
>
>                 and not the host process.
>
>                   
>
>                 SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should
>
>                 not have
>
>                 any influence on the design of the kernel UAPI.
>
>                   
>
>                 If you want to do something similar as KFD for Xe I think you need
>
>                 to get
>
>                 explicit permission to do this from Dave and Daniel and maybe even
>
>                 Linus.
>
>               
>
>             I think the one and only one exception where an SVM uapi like in kfd
>
>             makes
>
>             sense, is if the _hardware_ itself, not the software stack defined
>
>             semantics that you've happened to build on top of that hw, enforces a
>
>             1:1
>
>             mapping with the cpu process address space.
>
>               
>
>             Which means your hardware is using PASID, IOMMU based translation,
>
>             PCI-ATS
>
>             (address translation services) or whatever your hw calls it and has
>
>             _no_
>
>             device-side pagetables on top. Which from what I've seen all devices
>
>             with
>
>             device-memory have, simply because they need some place to store
>
>             whether
>
>             that memory is currently in device memory or should be translated
>
>             using
>
>             PASID. Currently there's no gpu that works with PASID only, but there
>
>             are
>
>             some on-cpu-die accelerator things that do work like that.
>
>               
>
>             Maybe in the future there will be some accelerators that are fully
>
>             cpu
>
>             cache coherent (including atomics) with something like CXL, and the
>
>             on-device memory is managed as normal system memory with struct page
>
>             as
>
>             ZONE_DEVICE and accelerator va -> physical address translation is
>
>             only
>
>             done with PASID ... but for now I haven't seen that, definitely not
>
>             in
>
>             upstream drivers.
>
>               
>
>             And the moment you have some per-device pagetables or per-device
>
>             memory
>
>             management of some sort (like using gpuva mgr) then I'm 100% agreeing
>
>             with
>
>             Christian that the kfd SVM model is too strict and not a great idea.
>
>               
>
>             Cheers, Sima
>
>           
>
>           
>
>         I'm trying to digest all the comments here, The end goal is to be able
>
>         to support something similar to this here:
>
>           
>
>         https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
>
>           
>
>         Christian, If I understand you correctly, you're strongly suggesting
>
>         not to try to manage a common virtual address space across different
>
>         devices in the kernel, but merely providing building blocks to do so,
>
>         like for example a generalized userptr with migration support using
>
>         HMM; That way each "mirror" of the CPU mm would be per device and
>
>         inserted into the gpu_vm just like any other gpu_vma, and user-space
>
>         would dictate the A..B -> C..D mapping by choosing the GPU_VA for the
>
>         vma.
>
>
>     Exactly that, yes.
>
>
>
>           
>
>         Sima, it sounds like you're suggesting to shy away from hmm and not
>
>         even attempt to support this except if it can be done using IOMMU sva
>
>         on selected hardware?
>
>
>     I think that comment goes more into the direction of: If you have
>     ATS/ATC/PRI capable hardware which exposes the functionality to
>     make memory reads and writes directly into the address space of
>     the CPU then yes an SVM only interface is ok because the hardware
>     can't do anything else. But as long as you have something like
>     GPUVM then please don't restrict yourself.
>
>     Which I totally agree on as well. The ATS/ATC/PRI combination
>     doesn't allow using separate page tables device and CPU and so
>     also not separate VAs.
>
>     This was one of the reasons why we stopped using this approach for
>     AMD GPUs.
>
>     Regards,
>     Christian.
>
>
>
>         Could you clarify a bit?
>
>           
>
>         Thanks,
>
>         Thomas
>
>           
>
>           
>
>           
>
>           
>
>           
>
>           
>
>           
>

[-- Attachment #2: Type: text/html, Size: 34413 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-30  0:21                         ` Zeng, Oak
  2024-01-30  8:39                           ` Christian König
@ 2024-01-30  8:43                           ` Thomas Hellström
  1 sibling, 0 replies; 123+ messages in thread
From: Thomas Hellström @ 2024-01-30  8:43 UTC (permalink / raw)
  To: Zeng, Oak, Christian König, Daniel Vetter, Dave Airlie
  Cc: Brost, Matthew, Felix Kuehling, Welty, Brian, dri-devel,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich

[-- Attachment #1: Type: text/plain, Size: 16985 bytes --]

Hi, Oak,

On 1/30/24 01:21, Zeng, Oak wrote:
>
> The example you used to prove that KFD is a design failure, is against 
> *any* design which utilize system allocator and hmm. The way that one 
> proxy process running on host to handle many guest processes, doesn’t 
> fit into the concept of “share address space b/t cpu and gpu”. The 
> shared address space has to be within one process. Your proxy process 
> represent many guest processes. It is a fundamental conflict.
>
> Also your userptr proposal does’t solve this problem either:
>
> Imagine you have guest process1 mapping CPU address range A…B to GPU 
> address range C…D
>
> And you have guest process 2 mapping CPU address range A…B to GPU 
> address range C…D, since process 1 and 2 are two different process, it 
> is legal for process 2 to do the exact same mapping.
>
> Now when gpu shader access address C…D, a gpu page fault happens, what 
> does your proxy process do? Which guest process will this fault be 
> directed to and handled? Except you have extra information/API to tell 
> proxy process and GPU HW, there is no way to figure out.
>
> Compared to the shared virtual address space concept of HMM, the 
> userptr design is nothing new except it allows CPU and GPU to use 
> different address to access the same object. If you replace above C…D 
> with A…B, above description becomes a description of the “problem” of 
> HMM/shared virtual address design.
>
> Both design has the same difficulty with your example of the special 
> virtualization environment setup.
>
> As said, we spent effort scoped the userptr solution some time ago. 
> The problem we found enabling userptr with migration were:
>
>  1. The user interface of userptr is not as convenient as system
>     allocator. With the userptr solution, user need to call
>     userptr_ioctl and vm_bind for *every* single cpu pointer that he
>     want to use in a gpu program. While with system allocator,
>     programmer just use any cpu pointer directly in gpu program
>     without any extra driver ioctls.
>
No, the augmented userptr (lets call it "hmmptr" to distinguish here) 
would typically only be bound once when the VM is created. It's just a 
different way to expose the whole SVM mapping to user-space. It's 
sparsely populated and is not backed by a bo, and it is per-device so 
UMD would have to replicate the SVM setup and attribute settings on each 
device.

> 1.
>
>
>  2. We don’t see the real benefit of using a different Gpu address C…D
>     than the A..B, except you can prove my above reasoning is wrong.
>     In most use cases, you can make GPU C…D == CPU A…B, why bother then?
>  3. Looked into implementation details, since hmm fundamentally assume
>     a shared virtual address space b/t cpu and device, for the userptr
>     solution to leverage hmm, you need perform address space
>     conversion every time you calls into hmm functions.
>
I think very much focus lands on the A..B -> C..D mapping in the 
discussion. It's just an added flexibility with little or no 
implementation cost. Although I must admit I'm not fully clear about the 
actual use-case. In a para-virtualized environment like virGL or 
vmware's vmx/renderers I could imagine C..D being the guest virtual 
addresses including compute kernel pointers, A..B being the host 
renderer's CPU virtual addresses. (Host creates the VM's, and then this 
translation is needed. I'm not sure para-virtualized SVM exists ATM, but 
forcing A==C, B==D in the uAPI would rule out such a beast in the future?)

/Thomas


> 1.
>
>
>
> In summary, GPU device is just a piece of HW to accelerate your CPU 
> program. If HW allows, it is more convenient to use shared address 
> space b/t cpu and GPU. On old HW (example, no gpu page fault support, 
> or gpu only has a very limited address space), we can disable system 
> allocator/SVM. If you use different address space on modern GPU, why 
> don’t you use different address space on different CPU cores?
>
> Regards,
>
> Oak
>
> *From:*dri-devel <dri-devel-bounces@lists.freedesktop.org> *On Behalf 
> Of *Christian König
> *Sent:* Monday, January 29, 2024 5:20 AM
> *To:* Zeng, Oak <oak.zeng@intel.com>; Thomas Hellström 
> <thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch>; 
> Dave Airlie <airlied@redhat.com>
> *Cc:* Brost, Matthew <matthew.brost@intel.com>; Felix Kuehling 
> <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; 
> dri-devel@lists.freedesktop.org; Ghimiray, Himal Prasad 
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah 
> <krishnaiah.bommu@intel.com>; Gupta, saurabhg 
> <saurabhg.gupta@intel.com>; Vishwanathapura, Niranjana 
> <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org; 
> Danilo Krummrich <dakr@redhat.com>
> *Subject:* Re: Making drm_gpuvm work across gpu devices
>
> Well Daniel and Dave noted it as well, so I'm just repeating it: Your 
> design choices are not an argument to get something upstream.
>
> It's the job of the maintainers and at the end of the Linus to judge 
> of something is acceptable or not.
>
> As far as I can see a good part of this this idea has been exercised 
> lengthy with KFD and it turned out to not be the best approach.
>
> So from what I've seen the design you outlined is extremely unlikely 
> to go upstream.
>
> Regards,
> Christian.
>
> Am 27.01.24 um 03:21 schrieb Zeng, Oak:
>
>     Regarding the idea of expanding userptr to support migration, we
>     explored this idea long time ago. It provides similar functions of
>     the system allocator but its interface is not as convenient as
>     system allocator. Besides the shared virtual address space,
>     another benefit of a system allocator is, you can offload cpu
>     program to gpu easier, you don’t need to call driver specific API
>     (such as register_userptr and vm_bind in this case) for memory
>     allocation.
>
>     We also scoped the implementation. It turned out to be big, and
>     not as beautiful as hmm. Why we gave up this approach.
>
>     *From:*Christian König <christian.koenig@amd.com>
>     <mailto:christian.koenig@amd.com>
>     *Sent:* Friday, January 26, 2024 7:52 AM
>     *To:* Thomas Hellström <thomas.hellstrom@linux.intel.com>
>     <mailto:thomas.hellstrom@linux.intel.com>; Daniel Vetter
>     <daniel@ffwll.ch> <mailto:daniel@ffwll.ch>
>     *Cc:* Brost, Matthew <matthew.brost@intel.com>
>     <mailto:matthew.brost@intel.com>; Felix Kuehling
>     <felix.kuehling@amd.com> <mailto:felix.kuehling@amd.com>; Welty,
>     Brian <brian.welty@intel.com> <mailto:brian.welty@intel.com>;
>     Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
>     <mailto:himal.prasad.ghimiray@intel.com>; Zeng, Oak
>     <oak.zeng@intel.com> <mailto:oak.zeng@intel.com>; Gupta, saurabhg
>     <saurabhg.gupta@intel.com> <mailto:saurabhg.gupta@intel.com>;
>     Danilo Krummrich <dakr@redhat.com> <mailto:dakr@redhat.com>;
>     dri-devel@lists.freedesktop.org
>     <mailto:dri-devel@lists.freedesktop.org>; Bommu, Krishnaiah
>     <krishnaiah.bommu@intel.com> <mailto:krishnaiah.bommu@intel.com>;
>     Dave Airlie <airlied@redhat.com> <mailto:airlied@redhat.com>;
>     Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
>     <mailto:niranjana.vishwanathapura@intel.com>;
>     intel-xe@lists.freedesktop.org <mailto:intel-xe@lists.freedesktop.org>
>     *Subject:* Re: Making drm_gpuvm work across gpu devices
>
>     Am 26.01.24 um 09:21 schrieb Thomas Hellström:
>
>
>         Hi, all
>
>           
>
>         On Thu, 2024-01-25 at 19:32 +0100, Daniel Vetter wrote:
>
>             On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>
>                 Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>
>                     [SNIP]
>
>                     Yes most API are per device based.
>
>                       
>
>                     One exception I know is actually the kfd SVM API. If you look at
>
>                     the svm_ioctl function, it is per-process based. Each kfd_process
>
>                     represent a process across N gpu devices.
>
>                   
>
>                 Yeah and that was a big mistake in my opinion. We should really not
>
>                 do that
>
>                 ever again.
>
>                   
>
>                     Need to say, kfd SVM represent a shared virtual address space
>
>                     across CPU and all GPU devices on the system. This is by the
>
>                     definition of SVM (shared virtual memory). This is very different
>
>                     from our legacy gpu *device* driver which works for only one
>
>                     device (i.e., if you want one device to access another device's
>
>                     memory, you will have to use dma-buf export/import etc).
>
>                   
>
>                 Exactly that thinking is what we have currently found as blocker
>
>                 for a
>
>                 virtualization projects. Having SVM as device independent feature
>
>                 which
>
>                 somehow ties to the process address space turned out to be an
>
>                 extremely bad
>
>                 idea.
>
>                   
>
>                 The background is that this only works for some use cases but not
>
>                 all of
>
>                 them.
>
>                   
>
>                 What's working much better is to just have a mirror functionality
>
>                 which says
>
>                 that a range A..B of the process address space is mapped into a
>
>                 range C..D
>
>                 of the GPU address space.
>
>                   
>
>                 Those ranges can then be used to implement the SVM feature required
>
>                 for
>
>                 higher level APIs and not something you need at the UAPI or even
>
>                 inside the
>
>                 low level kernel memory management.
>
>                   
>
>                 When you talk about migrating memory to a device you also do this
>
>                 on a per
>
>                 device basis and *not* tied to the process address space. If you
>
>                 then get
>
>                 crappy performance because userspace gave contradicting information
>
>                 where to
>
>                 migrate memory then that's a bug in userspace and not something the
>
>                 kernel
>
>                 should try to prevent somehow.
>
>                   
>
>                 [SNIP]
>
>                         I think if you start using the same drm_gpuvm for multiple
>
>                         devices you
>
>                         will sooner or later start to run into the same mess we have
>
>                         seen with
>
>                         KFD, where we moved more and more functionality from the KFD to
>
>                         the DRM
>
>                         render node because we found that a lot of the stuff simply
>
>                         doesn't work
>
>                         correctly with a single object to maintain the state.
>
>                     As I understand it, KFD is designed to work across devices. A
>
>                     single pseudo /dev/kfd device represent all hardware gpu devices.
>
>                     That is why during kfd open, many pdd (process device data) is
>
>                     created, each for one hardware device for this process.
>
>                   
>
>                 Yes, I'm perfectly aware of that. And I can only repeat myself that
>
>                 I see
>
>                 this design as a rather extreme failure. And I think it's one of
>
>                 the reasons
>
>                 why NVidia is so dominant with Cuda.
>
>                   
>
>                 This whole approach KFD takes was designed with the idea of
>
>                 extending the
>
>                 CPU process into the GPUs, but this idea only works for a few use
>
>                 cases and
>
>                 is not something we should apply to drivers in general.
>
>                   
>
>                 A very good example are virtualization use cases where you end up
>
>                 with CPU
>
>                 address != GPU address because the VAs are actually coming from the
>
>                 guest VM
>
>                 and not the host process.
>
>                   
>
>                 SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should
>
>                 not have
>
>                 any influence on the design of the kernel UAPI.
>
>                   
>
>                 If you want to do something similar as KFD for Xe I think you need
>
>                 to get
>
>                 explicit permission to do this from Dave and Daniel and maybe even
>
>                 Linus.
>
>               
>
>             I think the one and only one exception where an SVM uapi like in kfd
>
>             makes
>
>             sense, is if the _hardware_ itself, not the software stack defined
>
>             semantics that you've happened to build on top of that hw, enforces a
>
>             1:1
>
>             mapping with the cpu process address space.
>
>               
>
>             Which means your hardware is using PASID, IOMMU based translation,
>
>             PCI-ATS
>
>             (address translation services) or whatever your hw calls it and has
>
>             _no_
>
>             device-side pagetables on top. Which from what I've seen all devices
>
>             with
>
>             device-memory have, simply because they need some place to store
>
>             whether
>
>             that memory is currently in device memory or should be translated
>
>             using
>
>             PASID. Currently there's no gpu that works with PASID only, but there
>
>             are
>
>             some on-cpu-die accelerator things that do work like that.
>
>               
>
>             Maybe in the future there will be some accelerators that are fully
>
>             cpu
>
>             cache coherent (including atomics) with something like CXL, and the
>
>             on-device memory is managed as normal system memory with struct page
>
>             as
>
>             ZONE_DEVICE and accelerator va -> physical address translation is
>
>             only
>
>             done with PASID ... but for now I haven't seen that, definitely not
>
>             in
>
>             upstream drivers.
>
>               
>
>             And the moment you have some per-device pagetables or per-device
>
>             memory
>
>             management of some sort (like using gpuva mgr) then I'm 100% agreeing
>
>             with
>
>             Christian that the kfd SVM model is too strict and not a great idea.
>
>               
>
>             Cheers, Sima
>
>           
>
>           
>
>         I'm trying to digest all the comments here, The end goal is to be able
>
>         to support something similar to this here:
>
>           
>
>         https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
>
>           
>
>         Christian, If I understand you correctly, you're strongly suggesting
>
>         not to try to manage a common virtual address space across different
>
>         devices in the kernel, but merely providing building blocks to do so,
>
>         like for example a generalized userptr with migration support using
>
>         HMM; That way each "mirror" of the CPU mm would be per device and
>
>         inserted into the gpu_vm just like any other gpu_vma, and user-space
>
>         would dictate the A..B -> C..D mapping by choosing the GPU_VA for the
>
>         vma.
>
>
>     Exactly that, yes.
>
>
>
>           
>
>         Sima, it sounds like you're suggesting to shy away from hmm and not
>
>         even attempt to support this except if it can be done using IOMMU sva
>
>         on selected hardware?
>
>
>     I think that comment goes more into the direction of: If you have
>     ATS/ATC/PRI capable hardware which exposes the functionality to
>     make memory reads and writes directly into the address space of
>     the CPU then yes an SVM only interface is ok because the hardware
>     can't do anything else. But as long as you have something like
>     GPUVM then please don't restrict yourself.
>
>     Which I totally agree on as well. The ATS/ATC/PRI combination
>     doesn't allow using separate page tables device and CPU and so
>     also not separate VAs.
>
>     This was one of the reasons why we stopped using this approach for
>     AMD GPUs.
>
>     Regards,
>     Christian.
>
>
>
>         Could you clarify a bit?
>
>           
>
>         Thanks,
>
>         Thomas
>
>           
>
>           
>
>           
>
>           
>
>           
>
>           
>
>           
>

[-- Attachment #2: Type: text/html, Size: 32968 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-30  8:39                           ` Christian König
@ 2024-01-30 22:29                             ` Zeng, Oak
  2024-01-30 23:12                               ` David Airlie
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-01-30 22:29 UTC (permalink / raw)
  To: Christian König, Thomas Hellström, Daniel Vetter, Dave Airlie
  Cc: Brost, Matthew, rcampbell, apopple, Felix Kuehling, Welty, Brian,
	Shah, Ankur N, dri-devel, jglisse, Ghimiray, Himal Prasad, Gupta,
	saurabhg, Bommu,  Krishnaiah, Vishwanathapura, Niranjana,
	intel-xe, Danilo Krummrich

[-- Attachment #1: Type: text/plain, Size: 17584 bytes --]

Hi Christian,

Nvidia Nouveau driver uses exactly the same concept of SVM with HMM, GPU address in the same process is exactly the same with CPU virtual address. It is already in upstream Linux kernel. We Intel just follow the same direction for our customers. Why we are not allowed?

From: Christian König <christian.koenig@amd.com>
Sent: Tuesday, January 30, 2024 3:40 AM
To: Zeng, Oak <oak.zeng@intel.com>; Thomas Hellström <thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch>; Dave Airlie <airlied@redhat.com>
Cc: Brost, Matthew <matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>
Subject: Re: Making drm_gpuvm work across gpu devices

Am 30.01.24 um 01:21 schrieb Zeng, Oak:

The example you used to prove that KFD is a design failure, is against *any* design which utilize system allocator and hmm. The way that one proxy process running on host to handle many guest processes, doesn’t fit into the concept of “share address space b/t cpu and gpu”. The shared address space has to be within one process. Your proxy process represent many guest processes. It is a fundamental conflict.

Also your userptr proposal does’t solve this problem either:
Imagine you have guest process1 mapping CPU address range A…B to GPU address range C…D
And you have guest process 2 mapping CPU address range A…B to GPU address range C…D, since process 1 and 2 are two different process, it is legal for process 2 to do the exact same mapping.
Now when gpu shader access address C…D, a gpu page fault happens, what does your proxy process do? Which guest process will this fault be directed to and handled? Except you have extra information/API to tell proxy process and GPU HW, there is no way to figure out.

Well yes, as far as I can see the fundamental design issue in the KFD is that it ties together CPU and GPU address space. That came from the implementation using the ATS/PRI feature to access the CPU address space from the GPU.

If you don't do ATS/PRI then you don't have that restriction and you can do as many GPU address spaces per CPU process as you want. This is just how the hw works.

So in your example above when you have multiple mappings for the range A..B you also have multiple GPU address spaces and so can distinct where the page fault is coming from just by looking at the source of it. All you then need is userfaultfd() to forward the fault to the client and you are pretty much done.



Compared to the shared virtual address space concept of HMM, the userptr design is nothing new except it allows CPU and GPU to use different address to access the same object. If you replace above C…D with A…B, above description becomes a description of the “problem” of HMM/shared virtual address design.

Both design has the same difficulty with your example of the special virtualization environment setup.

As said, we spent effort scoped the userptr solution some time ago. The problem we found enabling userptr with migration were:

  1.  The user interface of userptr is not as convenient as system allocator. With the userptr solution, user need to call userptr_ioctl and vm_bind for *every* single cpu pointer that he want to use in a gpu program. While with system allocator, programmer just use any cpu pointer directly in gpu program without any extra driver ioctls.

And I think exactly that is questionable. Why not at least call it for the whole address space once during initialization?

>     We don’t see the real benefit of using a different Gpu address C…D than the A..B, except you can prove my above reasoning is wrong. In most use cases, you can make GPU C…D == CPU A…B, why bother then?

Because there are cases where this isn't true. We just recently ran into exactly that use case with a customer. It might be that you will never need this, but again the approach should generally be that the kernel exposes the hardware features and as far as I can see the hardware can do this.

And apart from those use cases there is also another good reason for this: CPU are going towards 5 level of page tables and GPUs are lacking behind. It's not unrealistic to run into cases where you can only mirror parts of the CPU address space into the GPU address space because of hardware restrictions. And in this case you absolutely do want the flexibility to have different address ranges.


>     Looked into implementation details, since hmm fundamentally assume a shared virtual address space b/t cpu and device, for the userptr solution to leverage hmm, you need perform address space conversion every time you calls into hmm functions.

Correct, but that is trivial. I mean we do nothing else with VMAs mapping into the address space of files on the CPU either.

Which is by the way a good analogy. The CPU address space consists of anonymous memory and file mappings, where the later covers both real files on a file system as well as devices.

The struct address_space in the Linux kernel for example describes a file address space and not the CPU address space because the later is just a technical tool to form an execution environment which can access the former.

With GPUs it's pretty much the same. You have mappings which can be backed by CPU address space using functionalities like HMM as well as buffer objects created directly through device drivers.



In summary, GPU device is just a piece of HW to accelerate your CPU program.

Well exactly that's not how I see it. CPU accelerators are extensions like SSE, AVX, FPUs etc... GPU are accelerators attached as I/O devices.

And that GPUs are separate to the CPU is a benefit which gives them advantage over CPU based acceleration approaches.

This obviously makes GPUs harder to program and SVM is a method to counter this, but that doesn't make SVM a good design pattern for kernel or device driver interfaces.


If HW allows, it is more convenient to use shared address space b/t cpu and GPU. On old HW (example, no gpu page fault support, or gpu only has a very limited address space), we can disable system allocator/SVM. If you use different address space on modern GPU, why don’t you use different address space on different CPU cores?

Quite simple, modern CPU are homogeneous. From the application point of view they still look more or less the same they did 40 years ago.

GPUs on the other hand look quite a bit different. SVM is now a tool to reduce this difference but it doesn't make the differences in execution environment go away.

And I can only repeat myself that this is actually a good thing, cause otherwise GPUs would loose some of their advantage over CPUs.

Regards,
Christian.



Regards,
Oak
From: dri-devel <dri-devel-bounces@lists.freedesktop.org><mailto:dri-devel-bounces@lists.freedesktop.org> On Behalf Of Christian König
Sent: Monday, January 29, 2024 5:20 AM
To: Zeng, Oak <oak.zeng@intel.com><mailto:oak.zeng@intel.com>; Thomas Hellström <thomas.hellstrom@linux.intel.com><mailto:thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch><mailto:daniel@ffwll.ch>; Dave Airlie <airlied@redhat.com><mailto:airlied@redhat.com>
Cc: Brost, Matthew <matthew.brost@intel.com><mailto:matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com><mailto:felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com><mailto:brian.welty@intel.com>; dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com><mailto:himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com><mailto:krishnaiah.bommu@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com><mailto:saurabhg.gupta@intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com><mailto:niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org<mailto:intel-xe@lists.freedesktop.org>; Danilo Krummrich <dakr@redhat.com><mailto:dakr@redhat.com>
Subject: Re: Making drm_gpuvm work across gpu devices

Well Daniel and Dave noted it as well, so I'm just repeating it: Your design choices are not an argument to get something upstream.

It's the job of the maintainers and at the end of the Linus to judge of something is acceptable or not.

As far as I can see a good part of this this idea has been exercised lengthy with KFD and it turned out to not be the best approach.

So from what I've seen the design you outlined is extremely unlikely to go upstream.

Regards,
Christian.
Am 27.01.24 um 03:21 schrieb Zeng, Oak:
Regarding the idea of expanding userptr to support migration, we explored this idea long time ago. It provides similar functions of the system allocator but its interface is not as convenient as system allocator. Besides the shared virtual address space, another benefit of a system allocator is, you can offload cpu program to gpu easier, you don’t need to call driver specific API (such as register_userptr and vm_bind in this case) for memory allocation.

We also scoped the implementation. It turned out to be big, and not as beautiful as hmm. Why we gave up this approach.

From: Christian König <christian.koenig@amd.com><mailto:christian.koenig@amd.com>
Sent: Friday, January 26, 2024 7:52 AM
To: Thomas Hellström <thomas.hellstrom@linux.intel.com><mailto:thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch><mailto:daniel@ffwll.ch>
Cc: Brost, Matthew <matthew.brost@intel.com><mailto:matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com><mailto:felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com><mailto:brian.welty@intel.com>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com><mailto:himal.prasad.ghimiray@intel.com>; Zeng, Oak <oak.zeng@intel.com><mailto:oak.zeng@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com><mailto:saurabhg.gupta@intel.com>; Danilo Krummrich <dakr@redhat.com><mailto:dakr@redhat.com>; dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com><mailto:krishnaiah.bommu@intel.com>; Dave Airlie <airlied@redhat.com><mailto:airlied@redhat.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com><mailto:niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org<mailto:intel-xe@lists.freedesktop.org>
Subject: Re: Making drm_gpuvm work across gpu devices

Am 26.01.24 um 09:21 schrieb Thomas Hellström:




Hi, all



On Thu, 2024-01-25 at 19:32 +0100, Daniel Vetter wrote:

On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]

Yes most API are per device based.



One exception I know is actually the kfd SVM API. If you look at

the svm_ioctl function, it is per-process based. Each kfd_process

represent a process across N gpu devices.



Yeah and that was a big mistake in my opinion. We should really not

do that

ever again.



Need to say, kfd SVM represent a shared virtual address space

across CPU and all GPU devices on the system. This is by the

definition of SVM (shared virtual memory). This is very different

from our legacy gpu *device* driver which works for only one

device (i.e., if you want one device to access another device's

memory, you will have to use dma-buf export/import etc).



Exactly that thinking is what we have currently found as blocker

for a

virtualization projects. Having SVM as device independent feature

which

somehow ties to the process address space turned out to be an

extremely bad

idea.



The background is that this only works for some use cases but not

all of

them.



What's working much better is to just have a mirror functionality

which says

that a range A..B of the process address space is mapped into a

range C..D

of the GPU address space.



Those ranges can then be used to implement the SVM feature required

for

higher level APIs and not something you need at the UAPI or even

inside the

low level kernel memory management.



When you talk about migrating memory to a device you also do this

on a per

device basis and *not* tied to the process address space. If you

then get

crappy performance because userspace gave contradicting information

where to

migrate memory then that's a bug in userspace and not something the

kernel

should try to prevent somehow.



[SNIP]

I think if you start using the same drm_gpuvm for multiple

devices you

will sooner or later start to run into the same mess we have

seen with

KFD, where we moved more and more functionality from the KFD to

the DRM

render node because we found that a lot of the stuff simply

doesn't work

correctly with a single object to maintain the state.

As I understand it, KFD is designed to work across devices. A

single pseudo /dev/kfd device represent all hardware gpu devices.

That is why during kfd open, many pdd (process device data) is

created, each for one hardware device for this process.



Yes, I'm perfectly aware of that. And I can only repeat myself that

I see

this design as a rather extreme failure. And I think it's one of

the reasons

why NVidia is so dominant with Cuda.



This whole approach KFD takes was designed with the idea of

extending the

CPU process into the GPUs, but this idea only works for a few use

cases and

is not something we should apply to drivers in general.



A very good example are virtualization use cases where you end up

with CPU

address != GPU address because the VAs are actually coming from the

guest VM

and not the host process.



SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should

not have

any influence on the design of the kernel UAPI.



If you want to do something similar as KFD for Xe I think you need

to get

explicit permission to do this from Dave and Daniel and maybe even

Linus.



I think the one and only one exception where an SVM uapi like in kfd

makes

sense, is if the _hardware_ itself, not the software stack defined

semantics that you've happened to build on top of that hw, enforces a

1:1

mapping with the cpu process address space.



Which means your hardware is using PASID, IOMMU based translation,

PCI-ATS

(address translation services) or whatever your hw calls it and has

_no_

device-side pagetables on top. Which from what I've seen all devices

with

device-memory have, simply because they need some place to store

whether

that memory is currently in device memory or should be translated

using

PASID. Currently there's no gpu that works with PASID only, but there

are

some on-cpu-die accelerator things that do work like that.



Maybe in the future there will be some accelerators that are fully

cpu

cache coherent (including atomics) with something like CXL, and the

on-device memory is managed as normal system memory with struct page

as

ZONE_DEVICE and accelerator va -> physical address translation is

only

done with PASID ... but for now I haven't seen that, definitely not

in

upstream drivers.



And the moment you have some per-device pagetables or per-device

memory

management of some sort (like using gpuva mgr) then I'm 100% agreeing

with

Christian that the kfd SVM model is too strict and not a great idea.



Cheers, Sima





I'm trying to digest all the comments here, The end goal is to be able

to support something similar to this here:



https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/



Christian, If I understand you correctly, you're strongly suggesting

not to try to manage a common virtual address space across different

devices in the kernel, but merely providing building blocks to do so,

like for example a generalized userptr with migration support using

HMM; That way each "mirror" of the CPU mm would be per device and

inserted into the gpu_vm just like any other gpu_vma, and user-space

would dictate the A..B -> C..D mapping by choosing the GPU_VA for the

vma.

Exactly that, yes.







Sima, it sounds like you're suggesting to shy away from hmm and not

even attempt to support this except if it can be done using IOMMU sva

on selected hardware?

I think that comment goes more into the direction of: If you have ATS/ATC/PRI capable hardware which exposes the functionality to make memory reads and writes directly into the address space of the CPU then yes an SVM only interface is ok because the hardware can't do anything else. But as long as you have something like GPUVM then please don't restrict yourself.

Which I totally agree on as well. The ATS/ATC/PRI combination doesn't allow using separate page tables device and CPU and so also not separate VAs.

This was one of the reasons why we stopped using this approach for AMD GPUs.

Regards,
Christian.





Could you clarify a bit?



Thanks,

Thomas


















[-- Attachment #2: Type: text/html, Size: 30510 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-30 22:29                             ` Zeng, Oak
@ 2024-01-30 23:12                               ` David Airlie
  2024-01-31  9:15                                 ` Daniel Vetter
  0 siblings, 1 reply; 123+ messages in thread
From: David Airlie @ 2024-01-30 23:12 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Brost, Matthew, Thomas Hellström, rcampbell, apopple,
	Felix Kuehling, Welty, Brian, Shah, Ankur N, dri-devel, intel-xe,
	jglisse, Ghimiray, Himal Prasad, Daniel Vetter, Gupta, saurabhg,
	Bommu, Krishnaiah, Vishwanathapura, Niranjana,
	Christian König, Danilo Krummrich

On Wed, Jan 31, 2024 at 8:29 AM Zeng, Oak <oak.zeng@intel.com> wrote:
>
> Hi Christian,
>
>
>
> Nvidia Nouveau driver uses exactly the same concept of SVM with HMM, GPU address in the same process is exactly the same with CPU virtual address. It is already in upstream Linux kernel. We Intel just follow the same direction for our customers. Why we are not allowed?


Oak, this isn't how upstream works, you don't get to appeal to
customer or internal design. nouveau isn't "NVIDIA"'s and it certainly
isn't something NVIDIA would ever suggest for their customers. We also
likely wouldn't just accept NVIDIA's current solution upstream without
some serious discussions. The implementation in nouveau was more of a
sample HMM use case rather than a serious implementation. I suspect if
we do get down the road of making nouveau an actual compute driver for
SVM etc then it would have to severely change.

Dave.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-30 23:12                               ` David Airlie
@ 2024-01-31  9:15                                 ` Daniel Vetter
  2024-01-31 20:17                                   ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Daniel Vetter @ 2024-01-31  9:15 UTC (permalink / raw)
  To: David Airlie
  Cc: Brost, Matthew, Thomas Hellström, rcampbell, apopple,
	Felix Kuehling, Welty, Brian, Shah, Ankur N, Zeng, Oak, intel-xe,
	jglisse, Ghimiray, Himal Prasad, dri-devel, Daniel Vetter, Gupta,
	saurabhg, Bommu, Krishnaiah, Vishwanathapura, Niranjana,
	Christian König, Danilo Krummrich

On Wed, Jan 31, 2024 at 09:12:39AM +1000, David Airlie wrote:
> On Wed, Jan 31, 2024 at 8:29 AM Zeng, Oak <oak.zeng@intel.com> wrote:
> >
> > Hi Christian,
> >
> >
> >
> > Nvidia Nouveau driver uses exactly the same concept of SVM with HMM, GPU address in the same process is exactly the same with CPU virtual address. It is already in upstream Linux kernel. We Intel just follow the same direction for our customers. Why we are not allowed?
> 
> 
> Oak, this isn't how upstream works, you don't get to appeal to
> customer or internal design. nouveau isn't "NVIDIA"'s and it certainly
> isn't something NVIDIA would ever suggest for their customers. We also
> likely wouldn't just accept NVIDIA's current solution upstream without
> some serious discussions. The implementation in nouveau was more of a
> sample HMM use case rather than a serious implementation. I suspect if
> we do get down the road of making nouveau an actual compute driver for
> SVM etc then it would have to severely change.

Yeah on the nouveau hmm code specifically my gut feeling impression is
that we didn't really make friends with that among core kernel
maintainers. It's a bit too much just a tech demo to be able to merge the
hmm core apis for nvidia's out-of-tree driver.

Also, a few years of learning and experience gaining happened meanwhile -
you always have to look at an api design in the context of when it was
designed, and that context changes all the time.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-31  9:15                                 ` Daniel Vetter
@ 2024-01-31 20:17                                   ` Zeng, Oak
  2024-01-31 20:59                                     ` Zeng, Oak
  2024-02-01  8:52                                     ` Christian König
  0 siblings, 2 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-01-31 20:17 UTC (permalink / raw)
  To: Daniel Vetter, David Airlie
  Cc: Brost, Matthew, Thomas Hellström, rcampbell, apopple,
	Felix Kuehling, Welty, Brian, Shah, Ankur N, dri-devel, intel-xe,
	jglisse, Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu,
	 Krishnaiah, Vishwanathapura, Niranjana, Christian König,
	Danilo Krummrich

Hi Sima, Dave,

I am well aware nouveau driver is not what Nvidia do with their customer. The key argument is, can we move forward with the concept shared virtual address space b/t CPU and GPU? This is the foundation of HMM. We already have split address space support with other driver API. SVM, from its name, it means shared address space. Are we allowed to implement another driver model to allow SVM work, along with other APIs supporting split address space? Those two scheme can co-exist in harmony. We actually have real use cases to use both models in one application.   

Hi Christian, Thomas,

In your scheme, GPU VA can != GPU VA. This does introduce some flexibility. But this scheme alone doesn't solve the problem of the proxy process/para-virtualization. You will still need a second mechanism to partition GPU VA space b/t guest process1 and guest process2 because proxy process (or the host hypervisor whatever you call it) use one single gpu page table for all the guest/client processes. GPU VA for different guest process can't overlap. If this second mechanism exist, we of course can use the same mechanism to partition CPU VA space between guest processes as well, then we can still use shared VA b/t CPU and GPU inside one process, but process1 and process2's address space (for both cpu and gpu) doesn't overlap. This second mechanism is the key to solve the proxy process problem, not the flexibility you introduced. 

In practice, your scheme also have a risk of running out of process space because you have to partition whole address space b/t processes. Apparently allowing each guest process to own the whole process space and using separate GPU/CPU page table for different processes is a better solution than using single page table and partition process space b/t processes.

For Intel GPU, para-virtualization (xenGT, see https://github.com/intel/XenGT-Preview-kernel. It is similar idea of the proxy process in Flex's email. They are all SW-based GPU virtualization technology) is an old project. It is now replaced with HW accelerated SRIOV/system virtualization. XenGT is abandoned long time ago. So agreed your scheme add some flexibility. The question is, do we have a valid use case to use such flexibility? I don't see a single one ATM.

I also pictured into how to implement your scheme. You basically rejected the very foundation of hmm design which is shared address space b/t CPU and GPU. In your scheme, GPU VA = CPU VA + offset. In every single place where driver need to call hmm facilities such as hmm_range_fault, migrate_vma_setup and in mmu notifier call back, you need to offset the GPU VA to get a CPU VA. From application writer's perspective, whenever he want to use a CPU pointer in his GPU program, he add to add that offset. Do you think this is awkward?

Finally, to implement SVM, we need to implement some memory hint API which applies to a virtual address range across all GPU devices. For example, user would say, for this virtual address range, I prefer the backing store memory to be on GPU deviceX (because user knows deviceX would use this address range much more than other GPU devices or CPU). It doesn't make sense to me to make such API per device based. For example, if you tell device A that the preferred memory location is device B memory, this doesn't sounds correct to me because in your scheme, device A is not even aware of the existence of device B. right?

Regards,
Oak
> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Wednesday, January 31, 2024 4:15 AM
> To: David Airlie <airlied@redhat.com>
> Cc: Zeng, Oak <oak.zeng@intel.com>; Christian König
> <christian.koenig@amd.com>; Thomas Hellström
> <thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch>; Brost,
> Matthew <matthew.brost@intel.com>; Felix Kuehling
> <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; dri-
> devel@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>;
> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-
> xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah, Ankur N
> <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com;
> apopple@nvidia.com
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> On Wed, Jan 31, 2024 at 09:12:39AM +1000, David Airlie wrote:
> > On Wed, Jan 31, 2024 at 8:29 AM Zeng, Oak <oak.zeng@intel.com> wrote:
> > >
> > > Hi Christian,
> > >
> > >
> > >
> > > Nvidia Nouveau driver uses exactly the same concept of SVM with HMM,
> GPU address in the same process is exactly the same with CPU virtual address. It
> is already in upstream Linux kernel. We Intel just follow the same direction for
> our customers. Why we are not allowed?
> >
> >
> > Oak, this isn't how upstream works, you don't get to appeal to
> > customer or internal design. nouveau isn't "NVIDIA"'s and it certainly
> > isn't something NVIDIA would ever suggest for their customers. We also
> > likely wouldn't just accept NVIDIA's current solution upstream without
> > some serious discussions. The implementation in nouveau was more of a
> > sample HMM use case rather than a serious implementation. I suspect if
> > we do get down the road of making nouveau an actual compute driver for
> > SVM etc then it would have to severely change.
> 
> Yeah on the nouveau hmm code specifically my gut feeling impression is
> that we didn't really make friends with that among core kernel
> maintainers. It's a bit too much just a tech demo to be able to merge the
> hmm core apis for nvidia's out-of-tree driver.
> 
> Also, a few years of learning and experience gaining happened meanwhile -
> you always have to look at an api design in the context of when it was
> designed, and that context changes all the time.
> 
> Cheers, Sima
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-31 20:17                                   ` Zeng, Oak
@ 2024-01-31 20:59                                     ` Zeng, Oak
  2024-02-01  8:52                                     ` Christian König
  1 sibling, 0 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-01-31 20:59 UTC (permalink / raw)
  To: Daniel Vetter, David Airlie
  Cc: Brost, Matthew, Thomas Hellström, rcampbell, apopple,
	Felix Kuehling, Welty, Brian, Shah, Ankur N, dri-devel, intel-xe,
	jglisse, Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu,
	 Krishnaiah, Vishwanathapura, Niranjana, Christian König,
	Danilo Krummrich

Fixed one typo: GPU VA != GPU VA should be GPU VA can != CPU VA

> -----Original Message-----
> From: Zeng, Oak
> Sent: Wednesday, January 31, 2024 3:17 PM
> To: Daniel Vetter <daniel@ffwll.ch>; David Airlie <airlied@redhat.com>
> Cc: Christian König <christian.koenig@amd.com>; Thomas Hellström
> <thomas.hellstrom@linux.intel.com>; Brost, Matthew
> <matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty,
> Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; Ghimiray, Himal
> Prasad <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>;
> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-
> xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah, Ankur N
> <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com;
> apopple@nvidia.com
> Subject: RE: Making drm_gpuvm work across gpu devices
> 
> Hi Sima, Dave,
> 
> I am well aware nouveau driver is not what Nvidia do with their customer. The
> key argument is, can we move forward with the concept shared virtual address
> space b/t CPU and GPU? This is the foundation of HMM. We already have split
> address space support with other driver API. SVM, from its name, it means
> shared address space. Are we allowed to implement another driver model to
> allow SVM work, along with other APIs supporting split address space? Those two
> scheme can co-exist in harmony. We actually have real use cases to use both
> models in one application.
> 
> Hi Christian, Thomas,
> 
> In your scheme, GPU VA can != CPU VA. This does introduce some flexibility. But
> this scheme alone doesn't solve the problem of the proxy process/para-
> virtualization. You will still need a second mechanism to partition GPU VA space
> b/t guest process1 and guest process2 because proxy process (or the host
> hypervisor whatever you call it) use one single gpu page table for all the
> guest/client processes. GPU VA for different guest process can't overlap. If this
> second mechanism exist, we of course can use the same mechanism to partition
> CPU VA space between guest processes as well, then we can still use shared VA
> b/t CPU and GPU inside one process, but process1 and process2's address space
> (for both cpu and gpu) doesn't overlap. This second mechanism is the key to
> solve the proxy process problem, not the flexibility you introduced.
> 
> In practice, your scheme also have a risk of running out of process space because
> you have to partition whole address space b/t processes. Apparently allowing
> each guest process to own the whole process space and using separate GPU/CPU
> page table for different processes is a better solution than using single page table
> and partition process space b/t processes.
> 
> For Intel GPU, para-virtualization (xenGT, see https://github.com/intel/XenGT-
> Preview-kernel. It is similar idea of the proxy process in Flex's email. They are all
> SW-based GPU virtualization technology) is an old project. It is now replaced with
> HW accelerated SRIOV/system virtualization. XenGT is abandoned long time ago.
> So agreed your scheme add some flexibility. The question is, do we have a valid
> use case to use such flexibility? I don't see a single one ATM.
> 
> I also pictured into how to implement your scheme. You basically rejected the
> very foundation of hmm design which is shared address space b/t CPU and GPU.
> In your scheme, GPU VA = CPU VA + offset. In every single place where driver
> need to call hmm facilities such as hmm_range_fault, migrate_vma_setup and in
> mmu notifier call back, you need to offset the GPU VA to get a CPU VA. From
> application writer's perspective, whenever he want to use a CPU pointer in his
> GPU program, he add to add that offset. Do you think this is awkward?
> 
> Finally, to implement SVM, we need to implement some memory hint API which
> applies to a virtual address range across all GPU devices. For example, user would
> say, for this virtual address range, I prefer the backing store memory to be on
> GPU deviceX (because user knows deviceX would use this address range much
> more than other GPU devices or CPU). It doesn't make sense to me to make such
> API per device based. For example, if you tell device A that the preferred
> memory location is device B memory, this doesn't sounds correct to me because
> in your scheme, device A is not even aware of the existence of device B. right?
> 
> Regards,
> Oak
> > -----Original Message-----
> > From: Daniel Vetter <daniel@ffwll.ch>
> > Sent: Wednesday, January 31, 2024 4:15 AM
> > To: David Airlie <airlied@redhat.com>
> > Cc: Zeng, Oak <oak.zeng@intel.com>; Christian König
> > <christian.koenig@amd.com>; Thomas Hellström
> > <thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch>; Brost,
> > Matthew <matthew.brost@intel.com>; Felix Kuehling
> > <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; dri-
> > devel@lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Gupta, saurabhg
> <saurabhg.gupta@intel.com>;
> > Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-
> > xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah, Ankur
> N
> > <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com;
> > apopple@nvidia.com
> > Subject: Re: Making drm_gpuvm work across gpu devices
> >
> > On Wed, Jan 31, 2024 at 09:12:39AM +1000, David Airlie wrote:
> > > On Wed, Jan 31, 2024 at 8:29 AM Zeng, Oak <oak.zeng@intel.com> wrote:
> > > >
> > > > Hi Christian,
> > > >
> > > >
> > > >
> > > > Nvidia Nouveau driver uses exactly the same concept of SVM with HMM,
> > GPU address in the same process is exactly the same with CPU virtual address.
> It
> > is already in upstream Linux kernel. We Intel just follow the same direction for
> > our customers. Why we are not allowed?
> > >
> > >
> > > Oak, this isn't how upstream works, you don't get to appeal to
> > > customer or internal design. nouveau isn't "NVIDIA"'s and it certainly
> > > isn't something NVIDIA would ever suggest for their customers. We also
> > > likely wouldn't just accept NVIDIA's current solution upstream without
> > > some serious discussions. The implementation in nouveau was more of a
> > > sample HMM use case rather than a serious implementation. I suspect if
> > > we do get down the road of making nouveau an actual compute driver for
> > > SVM etc then it would have to severely change.
> >
> > Yeah on the nouveau hmm code specifically my gut feeling impression is
> > that we didn't really make friends with that among core kernel
> > maintainers. It's a bit too much just a tech demo to be able to merge the
> > hmm core apis for nvidia's out-of-tree driver.
> >
> > Also, a few years of learning and experience gaining happened meanwhile -
> > you always have to look at an api design in the context of when it was
> > designed, and that context changes all the time.
> >
> > Cheers, Sima
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-01-31 20:17                                   ` Zeng, Oak
  2024-01-31 20:59                                     ` Zeng, Oak
@ 2024-02-01  8:52                                     ` Christian König
  2024-02-29 18:22                                       ` Zeng, Oak
  1 sibling, 1 reply; 123+ messages in thread
From: Christian König @ 2024-02-01  8:52 UTC (permalink / raw)
  To: Zeng, Oak, Daniel Vetter, David Airlie
  Cc: Brost, Matthew, Thomas Hellström, rcampbell, apopple,
	Felix Kuehling, Welty, Brian, Shah, Ankur N, dri-devel, jglisse,
	Ghimiray, Himal Prasad, Gupta, saurabhg, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, intel-xe, Danilo Krummrich

Hi Oak,

Am 31.01.24 um 21:17 schrieb Zeng, Oak:
> Hi Sima, Dave,
>
> I am well aware nouveau driver is not what Nvidia do with their customer. The key argument is, can we move forward with the concept shared virtual address space b/t CPU and GPU? This is the foundation of HMM. We already have split address space support with other driver API. SVM, from its name, it means shared address space. Are we allowed to implement another driver model to allow SVM work, along with other APIs supporting split address space? Those two scheme can co-exist in harmony. We actually have real use cases to use both models in one application.
>
> Hi Christian, Thomas,
>
> In your scheme, GPU VA can != GPU VA. This does introduce some flexibility. But this scheme alone doesn't solve the problem of the proxy process/para-virtualization. You will still need a second mechanism to partition GPU VA space b/t guest process1 and guest process2 because proxy process (or the host hypervisor whatever you call it) use one single gpu page table for all the guest/client processes. GPU VA for different guest process can't overlap. If this second mechanism exist, we of course can use the same mechanism to partition CPU VA space between guest processes as well, then we can still use shared VA b/t CPU and GPU inside one process, but process1 and process2's address space (for both cpu and gpu) doesn't overlap. This second mechanism is the key to solve the proxy process problem, not the flexibility you introduced.

That approach was suggested before, but it doesn't work. First of all 
you create a massive security hole when you give the GPU full access to 
the QEMU CPU process which runs the virtualization.

So even if you say CPU VA == GPU VA you still need some kind of 
flexibility otherwise you can't implement this use case securely.

Additional to this the CPU VAs are usually controlled by the OS and not 
some driver, so to make sure that host and guest VAs don't overlap you 
would need to add some kind of sync between the guest and host OS kernels.

> In practice, your scheme also have a risk of running out of process space because you have to partition whole address space b/t processes. Apparently allowing each guest process to own the whole process space and using separate GPU/CPU page table for different processes is a better solution than using single page table and partition process space b/t processes.

Yeah that you run out of address space is certainly possible. But as I 
said CPUs are switching to 5 level of pages tables and if you look at 
for example a "cat maps | cut -c-4 | sort -u" of process you will find 
that only a handful of 4GiB segments are actually used and thanks to 
recoverable page faults you can map those between host and client on 
demand. This gives you at least enough address space to handle a couple 
of thousand clients.

> For Intel GPU, para-virtualization (xenGT, see https://github.com/intel/XenGT-Preview-kernel. It is similar idea of the proxy process in Flex's email. They are all SW-based GPU virtualization technology) is an old project. It is now replaced with HW accelerated SRIOV/system virtualization. XenGT is abandoned long time ago. So agreed your scheme add some flexibility. The question is, do we have a valid use case to use such flexibility? I don't see a single one ATM.

Yeah, we have SRIOV functionality on AMD hw as well, but for some use 
cases it's just to inflexible.

> I also pictured into how to implement your scheme. You basically rejected the very foundation of hmm design which is shared address space b/t CPU and GPU. In your scheme, GPU VA = CPU VA + offset. In every single place where driver need to call hmm facilities such as hmm_range_fault, migrate_vma_setup and in mmu notifier call back, you need to offset the GPU VA to get a CPU VA. From application writer's perspective, whenever he want to use a CPU pointer in his GPU program, he add to add that offset. Do you think this is awkward?

What? This flexibility is there to prevent the application writer to 
change any offset.

> Finally, to implement SVM, we need to implement some memory hint API which applies to a virtual address range across all GPU devices. For example, user would say, for this virtual address range, I prefer the backing store memory to be on GPU deviceX (because user knows deviceX would use this address range much more than other GPU devices or CPU). It doesn't make sense to me to make such API per device based. For example, if you tell device A that the preferred memory location is device B memory, this doesn't sounds correct to me because in your scheme, device A is not even aware of the existence of device B. right?

Correct and while the additional flexibility is somewhat option I 
strongly think that not having a centralized approach for device driver 
settings is mandatory.

Going away from the well defined file descriptor based handling of 
device driver interfaces was one of the worst ideas I've ever seen in 
roughly thirty years of working with Unixiode operating systems. It 
basically broke everything, from reverse lockup handling for mmap() to 
file system privileges for hardware access.

As far as I can see anything which goes into the direction of opening 
/dev/kfd or /dev/xe_svm or something similar and saying that this then 
results into implicit SVM for your render nodes is an absolutely no-go 
and would required and explicit acknowledgement from Linus on the design 
to do something like that.

What you can do is to have an IOCTL for the render node file descriptor 
which says this device should do SVM with the current CPU address space 
and another IOCTL which says range A..B is preferred to migrate to this 
device for HMM when the device runs into a page fault.

And yes that obviously means shitty performance for device drivers 
because page play ping/pong if userspace gives contradicting information 
for migrations, but that is something supposed to be.

Everything else which works over the boarders of a device driver scope 
should be implemented as system call with the relevant review process 
around it.

Regards,
Christian.

>
> Regards,
> Oak
>> -----Original Message-----
>> From: Daniel Vetter <daniel@ffwll.ch>
>> Sent: Wednesday, January 31, 2024 4:15 AM
>> To: David Airlie <airlied@redhat.com>
>> Cc: Zeng, Oak <oak.zeng@intel.com>; Christian König
>> <christian.koenig@amd.com>; Thomas Hellström
>> <thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch>; Brost,
>> Matthew <matthew.brost@intel.com>; Felix Kuehling
>> <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; dri-
>> devel@lists.freedesktop.org; Ghimiray, Himal Prasad
>> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
>> <krishnaiah.bommu@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>;
>> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-
>> xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah, Ankur N
>> <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com;
>> apopple@nvidia.com
>> Subject: Re: Making drm_gpuvm work across gpu devices
>>
>> On Wed, Jan 31, 2024 at 09:12:39AM +1000, David Airlie wrote:
>>> On Wed, Jan 31, 2024 at 8:29 AM Zeng, Oak <oak.zeng@intel.com> wrote:
>>>> Hi Christian,
>>>>
>>>>
>>>>
>>>> Nvidia Nouveau driver uses exactly the same concept of SVM with HMM,
>> GPU address in the same process is exactly the same with CPU virtual address. It
>> is already in upstream Linux kernel. We Intel just follow the same direction for
>> our customers. Why we are not allowed?
>>>
>>> Oak, this isn't how upstream works, you don't get to appeal to
>>> customer or internal design. nouveau isn't "NVIDIA"'s and it certainly
>>> isn't something NVIDIA would ever suggest for their customers. We also
>>> likely wouldn't just accept NVIDIA's current solution upstream without
>>> some serious discussions. The implementation in nouveau was more of a
>>> sample HMM use case rather than a serious implementation. I suspect if
>>> we do get down the road of making nouveau an actual compute driver for
>>> SVM etc then it would have to severely change.
>> Yeah on the nouveau hmm code specifically my gut feeling impression is
>> that we didn't really make friends with that among core kernel
>> maintainers. It's a bit too much just a tech demo to be able to merge the
>> hmm core apis for nvidia's out-of-tree driver.
>>
>> Also, a few years of learning and experience gaining happened meanwhile -
>> you always have to look at an api design in the context of when it was
>> designed, and that context changes all the time.
>>
>> Cheers, Sima
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-01-24  8:33             ` Christian König
                                 ` (2 preceding siblings ...)
  2024-01-25 18:32               ` Daniel Vetter
@ 2024-02-23 20:12               ` Zeng, Oak
  2024-02-27  6:54                 ` Christian König
  3 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-02-23 20:12 UTC (permalink / raw)
  To: Christian König, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Felix Kuehling, jglisse
  Cc: Welty, Brian, dri-devel, intel-xe, Bommu, Krishnaiah, Ghimiray,
	Himal Prasad, Thomas.Hellstrom, Vishwanathapura, Niranjana,
	Brost, Matthew, Gupta, saurabhg

[-- Attachment #1: Type: text/plain, Size: 5564 bytes --]

Hi Christian,

I go back this old email to ask a question.

Quote from your email:
“Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management.”
“SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI.”

There are two category of SVM:

  1.  driver svm allocator: this is implemented in user space,  i.g., cudaMallocManaged (cuda) or zeMemAllocShared (L0) or clSVMAlloc(openCL). Intel already have gem_create/vm_bind in xekmd and our umd implemented clSVMAlloc and zeMemAllocShared on top of gem_create/vm_bind. Range A..B of the process address space is mapped into a range C..D of the GPU address space, exactly as you said.
  2.  system svm allocator:  This doesn’t introduce extra driver API for memory allocation. Any valid CPU virtual address can be used directly transparently in a GPU program without any extra driver API call. Quote from kernel Documentation/vm/hmm.hst: “Any application memory region (private anonymous, shared memory, or regular file backed memory) can be used by a device transparently” and “to share the address space by duplicating the CPU page table in the device page table so the same address points to the same physical memory for any valid main memory address in the process address space”. In system svm allocator, we don’t need that A..B C..D mapping.

It looks like you were talking of 1). Were you?

Oak
From: Christian König <christian.koenig@amd.com>
Sent: Wednesday, January 24, 2024 3:33 AM
To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>
Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
Subject: Re: Making drm_gpuvm work across gpu devices

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]



Yes most API are per device based.



One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices.

Yeah and that was a big mistake in my opinion. We should really not do that ever again.



Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).

Exactly that thinking is what we have currently found as blocker for a virtualization projects. Having SVM as device independent feature which somehow ties to the process address space turned out to be an extremely bad idea.

The background is that this only works for some use cases but not all of them.

What's working much better is to just have a mirror functionality which says that a range A..B of the process address space is mapped into a range C..D of the GPU address space.

Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management.

When you talk about migrating memory to a device you also do this on a per device basis and *not* tied to the process address space. If you then get crappy performance because userspace gave contradicting information where to migrate memory then that's a bug in userspace and not something the kernel should try to prevent somehow.

[SNIP]


I think if you start using the same drm_gpuvm for multiple devices you

will sooner or later start to run into the same mess we have seen with

KFD, where we moved more and more functionality from the KFD to the DRM

render node because we found that a lot of the stuff simply doesn't work

correctly with a single object to maintain the state.



As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process.

Yes, I'm perfectly aware of that. And I can only repeat myself that I see this design as a rather extreme failure. And I think it's one of the reasons why NVidia is so dominant with Cuda.

This whole approach KFD takes was designed with the idea of extending the CPU process into the GPUs, but this idea only works for a few use cases and is not something we should apply to drivers in general.

A very good example are virtualization use cases where you end up with CPU address != GPU address because the VAs are actually coming from the guest VM and not the host process.

SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI.

If you want to do something similar as KFD for Xe I think you need to get explicit permission to do this from Dave and Daniel and maybe even Linus.

Regards,
Christian.

[-- Attachment #2: Type: text/html, Size: 11601 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-02-23 20:12               ` Zeng, Oak
@ 2024-02-27  6:54                 ` Christian König
  2024-02-27 15:58                   ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Christian König @ 2024-02-27  6:54 UTC (permalink / raw)
  To: Zeng, Oak, Danilo Krummrich, Dave Airlie, Daniel Vetter,
	Felix Kuehling, jglisse
  Cc: Welty, Brian, dri-devel, intel-xe, Bommu, Krishnaiah, Ghimiray,
	Himal Prasad, Thomas.Hellstrom, Vishwanathapura, Niranjana,
	Brost, Matthew, Gupta, saurabhg

[-- Attachment #1: Type: text/plain, Size: 6441 bytes --]

Hi Oak,

Am 23.02.24 um 21:12 schrieb Zeng, Oak:
>
> Hi Christian,
>
> I go back this old email to ask a question.
>

sorry totally missed that one.

> Quote from your email:
>
> “Those ranges can then be used to implement the SVM feature required 
> for higher level APIs and not something you need at the UAPI or even 
> inside the low level kernel memory management.”
>
> “SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should 
> not have any influence on the design of the kernel UAPI.”
>
> There are two category of SVM:
>
>  1. driver svm allocator: this is implemented in user space,  i.g.,
>     cudaMallocManaged (cuda) or zeMemAllocShared (L0) or
>     clSVMAlloc(openCL). Intel already have gem_create/vm_bind in xekmd
>     and our umd implemented clSVMAlloc and zeMemAllocShared on top of
>     gem_create/vm_bind. Range A..B of the process address space is
>     mapped into a range C..D of the GPU address space, exactly as you
>     said.
>  2. system svm allocator:  This doesn’t introduce extra driver API for
>     memory allocation. Any valid CPU virtual address can be used
>     directly transparently in a GPU program without any extra driver
>     API call. Quote from kernel Documentation/vm/hmm.hst: “Any
>     application memory region (private anonymous, shared memory, or
>     regular file backed memory) can be used by a device transparently”
>     and “to share the address space by duplicating the CPU page table
>     in the device page table so the same address points to the same
>     physical memory for any valid main memory address in the process
>     address space”. In system svm allocator, we don’t need that A..B
>     C..D mapping.
>
> It looks like you were talking of 1). Were you?
>

No, even when you fully mirror the whole address space from a process 
into the GPU you still need to enable this somehow with an IOCTL.

And while enabling this you absolutely should specify to which part of 
the address space this mirroring applies and where it maps to.

I see the system svm allocator as just a special case of the driver 
allocator where not fully backed buffer objects are allocated, but 
rather sparse one which are filled and migrated on demand.

Regards,
Christian.

> Oak
>
> *From:*Christian König <christian.koenig@amd.com>
> *Sent:* Wednesday, January 24, 2024 3:33 AM
> *To:* Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich 
> <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter 
> <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>
> *Cc:* Welty, Brian <brian.welty@intel.com>; 
> dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; 
> Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad 
> <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com; 
> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; 
> Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg 
> <saurabhg.gupta@intel.com>
> *Subject:* Re: Making drm_gpuvm work across gpu devices
>
> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>
>     [SNIP]
>
>     Yes most API are per device based.
>
>     One exception I know is actually the kfd SVM API. If you look at the svm_ioctl function, it is per-process based. Each kfd_process represent a process across N gpu devices.
>
>
> Yeah and that was a big mistake in my opinion. We should really not do 
> that ever again.
>
>
>     Need to say, kfd SVM represent a shared virtual address space across CPU and all GPU devices on the system. This is by the definition of SVM (shared virtual memory). This is very different from our legacy gpu *device* driver which works for only one device (i.e., if you want one device to access another device's memory, you will have to use dma-buf export/import etc).
>
>
> Exactly that thinking is what we have currently found as blocker for a 
> virtualization projects. Having SVM as device independent feature 
> which somehow ties to the process address space turned out to be an 
> extremely bad idea.
>
> The background is that this only works for some use cases but not all 
> of them.
>
> What's working much better is to just have a mirror functionality 
> which says that a range A..B of the process address space is mapped 
> into a range C..D of the GPU address space.
>
> Those ranges can then be used to implement the SVM feature required 
> for higher level APIs and not something you need at the UAPI or even 
> inside the low level kernel memory management.
>
> When you talk about migrating memory to a device you also do this on a 
> per device basis and *not* tied to the process address space. If you 
> then get crappy performance because userspace gave contradicting 
> information where to migrate memory then that's a bug in userspace and 
> not something the kernel should try to prevent somehow.
>
> [SNIP]
>
>         I think if you start using the same drm_gpuvm for multiple devices you
>
>         will sooner or later start to run into the same mess we have seen with
>
>         KFD, where we moved more and more functionality from the KFD to the DRM
>
>         render node because we found that a lot of the stuff simply doesn't work
>
>         correctly with a single object to maintain the state.
>
>     As I understand it, KFD is designed to work across devices. A single pseudo /dev/kfd device represent all hardware gpu devices. That is why during kfd open, many pdd (process device data) is created, each for one hardware device for this process.
>
>
> Yes, I'm perfectly aware of that. And I can only repeat myself that I 
> see this design as a rather extreme failure. And I think it's one of 
> the reasons why NVidia is so dominant with Cuda.
>
> This whole approach KFD takes was designed with the idea of extending 
> the CPU process into the GPUs, but this idea only works for a few use 
> cases and is not something we should apply to drivers in general.
>
> A very good example are virtualization use cases where you end up with 
> CPU address != GPU address because the VAs are actually coming from 
> the guest VM and not the host process.
>
> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should 
> not have any influence on the design of the kernel UAPI.
>
> If you want to do something similar as KFD for Xe I think you need to 
> get explicit permission to do this from Dave and Daniel and maybe even 
> Linus.
>
> Regards,
> Christian.
>

[-- Attachment #2: Type: text/html, Size: 13922 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-02-27  6:54                 ` Christian König
@ 2024-02-27 15:58                   ` Zeng, Oak
  2024-02-28 19:51                     ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-02-27 15:58 UTC (permalink / raw)
  To: Christian König, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Felix Kuehling, jglisse
  Cc: Welty, Brian, dri-devel, intel-xe, Bommu, Krishnaiah, Ghimiray,
	Himal Prasad, Thomas.Hellstrom, Vishwanathapura, Niranjana,
	Brost, Matthew, Gupta, saurabhg

[-- Attachment #1: Type: text/plain, Size: 4047 bytes --]



From: Christian König <christian.koenig@amd.com>
Sent: Tuesday, February 27, 2024 1:54 AM
To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>; jglisse@redhat.com
Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
Subject: Re: Making drm_gpuvm work across gpu devices

Hi Oak,
Am 23.02.24 um 21:12 schrieb Zeng, Oak:
Hi Christian,

I go back this old email to ask a question.

sorry totally missed that one.



Quote from your email:
“Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management.”
“SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI.”

There are two category of SVM:

  1.  driver svm allocator: this is implemented in user space,  i.g., cudaMallocManaged (cuda) or zeMemAllocShared (L0) or clSVMAlloc(openCL). Intel already have gem_create/vm_bind in xekmd and our umd implemented clSVMAlloc and zeMemAllocShared on top of gem_create/vm_bind. Range A..B of the process address space is mapped into a range C..D of the GPU address space, exactly as you said.
  2.  system svm allocator:  This doesn’t introduce extra driver API for memory allocation. Any valid CPU virtual address can be used directly transparently in a GPU program without any extra driver API call. Quote from kernel Documentation/vm/hmm.hst: “Any application memory region (private anonymous, shared memory, or regular file backed memory) can be used by a device transparently” and “to share the address space by duplicating the CPU page table in the device page table so the same address points to the same physical memory for any valid main memory address in the process address space”. In system svm allocator, we don’t need that A..B C..D mapping.

It looks like you were talking of 1). Were you?

No, even when you fully mirror the whole address space from a process into the GPU you still need to enable this somehow with an IOCTL.

And while enabling this you absolutely should specify to which part of the address space this mirroring applies and where it maps to.


Lets say we have a hardware platform where both CPU and GPU support 57bit virtual address range, how do you decide “which part of the address space this mirroring applies”? You have to mirror the whole address space (0~2^57-1), do you? As you designed it, the gigantic vm_bind/mirroring happens at the process initialization time, and at that time, you don’t know which part of the address space will be used for gpu program.


I see the system svm allocator as just a special case of the driver allocator where not fully backed buffer objects are allocated, but rather sparse one which are filled and migrated on demand.

Above statement is true to me. We don’t have BO for system svm allocator. It is a sparse one as we don’t map the whole vma to GPU. Our migration policy decide which pages/how much of the vma is migrated/mapped to GPU page table.

The difference b/t your mind and mine is, you want a gigantic vma (created during the gigantic vm_bind) to be sparsely populated to gpu. While I thought vma (xe_vma in xekmd codes) is a place to save memory attributes (such as caching, user preferred placement etc). All those memory attributes are range based, i.e., user can specify range1 is cached while range2 is uncached. So I don’t see how you can manage it with the gigantic vma.

Regards,
Oak


Regards,
Christian.




[-- Attachment #2: Type: text/html, Size: 9400 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-02-27 15:58                   ` Zeng, Oak
@ 2024-02-28 19:51                     ` Zeng, Oak
  2024-02-29  9:41                       ` Christian König
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-02-28 19:51 UTC (permalink / raw)
  To: Christian König, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Felix Kuehling, jglisse
  Cc: Welty, Brian, dri-devel, intel-xe, Bommu, Krishnaiah, Ghimiray,
	Himal Prasad, Thomas.Hellstrom, Vishwanathapura, Niranjana,
	Brost, Matthew, Gupta, saurabhg

[-- Attachment #1: Type: text/plain, Size: 4999 bytes --]


The mail wasn’t indent/preface correctly. Manually format it.


From: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
Sent: Tuesday, February 27, 2024 1:54 AM
To: Zeng, Oak <oak.zeng@intel.com<mailto:oak.zeng@intel.com>>; Danilo Krummrich <dakr@redhat.com<mailto:dakr@redhat.com>>; Dave Airlie <airlied@redhat.com<mailto:airlied@redhat.com>>; Daniel Vetter <daniel@ffwll.ch<mailto:daniel@ffwll.ch>>; Felix Kuehling <felix.kuehling@amd.com<mailto:felix.kuehling@amd.com>>; jglisse@redhat.com<mailto:jglisse@redhat.com>
Cc: Welty, Brian <brian.welty@intel.com<mailto:brian.welty@intel.com>>; dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>; intel-xe@lists.freedesktop.org<mailto:intel-xe@lists.freedesktop.org>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com<mailto:krishnaiah.bommu@intel.com>>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com<mailto:himal.prasad.ghimiray@intel.com>>; Thomas.Hellstrom@linux.intel.com<mailto:Thomas.Hellstrom@linux.intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com<mailto:niranjana.vishwanathapura@intel.com>>; Brost, Matthew <matthew.brost@intel.com<mailto:matthew.brost@intel.com>>; Gupta, saurabhg <saurabhg.gupta@intel.com<mailto:saurabhg.gupta@intel.com>>
Subject: Re: Making drm_gpuvm work across gpu devices

Hi Oak,
Am 23.02.24 um 21:12 schrieb Zeng, Oak:
Hi Christian,

I go back this old email to ask a question.

sorry totally missed that one.


Quote from your email:
“Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management.”
“SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI.”

There are two category of SVM:

1.       driver svm allocator: this is implemented in user space,  i.g., cudaMallocManaged (cuda) or zeMemAllocShared (L0) or clSVMAlloc(openCL). Intel already have gem_create/vm_bind in xekmd and our umd implemented clSVMAlloc and zeMemAllocShared on top of gem_create/vm_bind. Range A..B of the process address space is mapped into a range C..D of the GPU address space, exactly as you said.

2.       system svm allocator:  This doesn’t introduce extra driver API for memory allocation. Any valid CPU virtual address can be used directly transparently in a GPU program without any extra driver API call. Quote from kernel Documentation/vm/hmm.hst: “Any application memory region (private anonymous, shared memory, or regular file backed memory) can be used by a device transparently” and “to share the address space by duplicating the CPU page table in the device page table so the same address points to the same physical memory for any valid main memory address in the process address space”. In system svm allocator, we don’t need that A..B C..D mapping.

It looks like you were talking of 1). Were you?

No, even when you fully mirror the whole address space from a process into the GPU you still need to enable this somehow with an IOCTL.

And while enabling this you absolutely should specify to which part of the address space this mirroring applies and where it maps to.


[Zeng, Oak]
Lets say we have a hardware platform where both CPU and GPU support 57bit(use it for example. The statement apply to any address range) virtual address range, how do you decide “which part of the address space this mirroring applies”? You have to mirror the whole address space [0~2^57-1], do you? As you designed it, the gigantic vm_bind/mirroring happens at the process initialization time, and at that time, you don’t know which part of the address space will be used for gpu program. Remember for system allocator, *any* valid CPU address can be used for GPU program.  If you add an offset to [0~2^57-1], you get an address out of 57bit address range. Is this a valid concern?


I see the system svm allocator as just a special case of the driver allocator where not fully backed buffer objects are allocated, but rather sparse one which are filled and migrated on demand.


[Zeng, Oak]
Above statement is true to me. We don’t have BO for system svm allocator. It is a sparse one as we can sparsely map vma to GPU. Our migration policy decide which pages/how much of the vma is migrated/mapped to GPU page table.

The difference b/t your mind and mine is, you want a gigantic vma (created during the gigantic vm_bind) to be sparsely populated to gpu. While I thought vma (xe_vma in xekmd codes) is a place to save memory attributes (such as caching, user preferred placement etc). All those memory attributes are range based, i.e., user can specify range1 is cached while range2 is uncached. So I don’t see how you can manage it with the gigantic vma. Do you split your gigantic vma later to save range based memory attributes?

Regards,
Oak


Regards,
Christian.



[-- Attachment #2: Type: text/html, Size: 12000 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-02-28 19:51                     ` Zeng, Oak
@ 2024-02-29  9:41                       ` Christian König
  2024-02-29 16:05                         ` Zeng, Oak
  2024-02-29 17:12                         ` Thomas Hellström
  0 siblings, 2 replies; 123+ messages in thread
From: Christian König @ 2024-02-29  9:41 UTC (permalink / raw)
  To: Zeng, Oak, Danilo Krummrich, Dave Airlie, Daniel Vetter,
	Felix Kuehling, jglisse
  Cc: Welty, Brian, dri-devel, intel-xe, Bommu, Krishnaiah, Ghimiray,
	Himal Prasad, Thomas.Hellstrom, Vishwanathapura, Niranjana,
	Brost, Matthew, Gupta, saurabhg

[-- Attachment #1: Type: text/plain, Size: 6103 bytes --]

Am 28.02.24 um 20:51 schrieb Zeng, Oak:
>
> The mail wasn’t indent/preface correctly. Manually format it.
>
> *From:*Christian König <christian.koenig@amd.com>
> *Sent:* Tuesday, February 27, 2024 1:54 AM
> *To:* Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich 
> <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter 
> <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>; 
> jglisse@redhat.com
> *Cc:* Welty, Brian <brian.welty@intel.com>; 
> dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; 
> Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad 
> <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com; 
> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; 
> Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg 
> <saurabhg.gupta@intel.com>
> *Subject:* Re: Making drm_gpuvm work across gpu devices
>
> Hi Oak,
>
> Am 23.02.24 um 21:12 schrieb Zeng, Oak:
>
>     Hi Christian,
>
>     I go back this old email to ask a question.
>
>
> sorry totally missed that one.
>
>     Quote from your email:
>
>     “Those ranges can then be used to implement the SVM feature
>     required for higher level APIs and not something you need at the
>     UAPI or even inside the low level kernel memory management.”
>
>     “SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This
>     should not have any influence on the design of the kernel UAPI.”
>
>     There are two category of SVM:
>
>     1.driver svm allocator: this is implemented in user space,  i.g.,
>     cudaMallocManaged (cuda) or zeMemAllocShared (L0) or
>     clSVMAlloc(openCL). Intel already have gem_create/vm_bind in xekmd
>     and our umd implemented clSVMAlloc and zeMemAllocShared on top of
>     gem_create/vm_bind. Range A..B of the process address space is
>     mapped into a range C..D of the GPU address space, exactly as you
>     said.
>
>     2.system svm allocator:  This doesn’t introduce extra driver API
>     for memory allocation. Any valid CPU virtual address can be used
>     directly transparently in a GPU program without any extra driver
>     API call. Quote from kernel Documentation/vm/hmm.hst: “Any
>     application memory region (private anonymous, shared memory, or
>     regular file backed memory) can be used by a device transparently”
>     and “to share the address space by duplicating the CPU page table
>     in the device page table so the same address points to the same
>     physical memory for any valid main memory address in the process
>     address space”. In system svm allocator, we don’t need that A..B
>     C..D mapping.
>
>     It looks like you were talking of 1). Were you?
>
>
> No, even when you fully mirror the whole address space from a process 
> into the GPU you still need to enable this somehow with an IOCTL.
>
> And while enabling this you absolutely should specify to which part of 
> the address space this mirroring applies and where it maps to.
>
> */[Zeng, Oak] /*
>
> Lets say we have a hardware platform where both CPU and GPU support 
> 57bit(use it for example. The statement apply to any address range) 
> virtual address range, how do you decide “which part of the address 
> space this mirroring applies”? You have to mirror the whole address 
> space [0~2^57-1], do you? As you designed it, the gigantic 
> vm_bind/mirroring happens at the process initialization time, and at 
> that time, you don’t know which part of the address space will be used 
> for gpu program. Remember for system allocator, *any* valid CPU 
> address can be used for GPU program.  If you add an offset to 
> [0~2^57-1], you get an address out of 57bit address range. Is this a 
> valid concern?
>

Well you can perfectly mirror on demand. You just need something similar 
to userfaultfd() for the GPU. This way you don't need to mirror the full 
address space, but can rather work with large chunks created on demand, 
let's say 1GiB or something like that.

The virtual address space is basically just a hardware functionality to 
route memory accesses. While the mirroring approach is a very common use 
case for data-centers and high performance computing there are quite a 
number of different use cases which makes use of virtual address space 
in a non "standard" fashion. The native context approach for VMs is just 
one example, databases and emulators are another one.

>
>
> I see the system svm allocator as just a special case of the driver 
> allocator where not fully backed buffer objects are allocated, but 
> rather sparse one which are filled and migrated on demand.
>
> */[Zeng, Oak] /*
>
> Above statement is true to me. We don’t have BO for system svm 
> allocator. It is a sparse one as we can sparsely map vma to GPU. Our 
> migration policy decide which pages/how much of the vma is 
> migrated/mapped to GPU page table.
>
> *//*
>
> The difference b/t your mind and mine is, you want a gigantic vma 
> (created during the gigantic vm_bind) to be sparsely populated to gpu. 
> While I thought vma (xe_vma in xekmd codes) is a place to save memory 
> attributes (such as caching, user preferred placement etc). All those 
> memory attributes are range based, i.e., user can specify range1 is 
> cached while range2 is uncached. So I don’t see how you can manage it 
> with the gigantic vma. Do you split your gigantic vma later to save 
> range based memory attributes?
>

Yes, exactly that. I mean the splitting and eventually merging of ranges 
is a standard functionality of the GPUVM code.

So when you need to store additional attributes per range then I would 
strongly suggest to make use of this splitting and merging functionality 
as well.

So basically an IOCTL which says range A..B of the GPU address space is 
mapped to offset X of the CPU address space with parameters Y (caching, 
migration behavior etc..). That is essentially the same we have for 
mapping GEM objects, the provider of the backing store is just something 
different.

Regards,
Christian.

> Regards,
>
> Oak
>
>
>
> Regards,
> Christian.
>

[-- Attachment #2: Type: text/html, Size: 15506 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-02-29  9:41                       ` Christian König
@ 2024-02-29 16:05                         ` Zeng, Oak
  2024-02-29 17:12                         ` Thomas Hellström
  1 sibling, 0 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-02-29 16:05 UTC (permalink / raw)
  To: Christian König, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Felix Kuehling, jglisse
  Cc: Welty, Brian, dri-devel, intel-xe, Bommu, Krishnaiah, Ghimiray,
	Himal Prasad, Thomas.Hellstrom, Vishwanathapura, Niranjana,
	Brost, Matthew, Gupta, saurabhg

[-- Attachment #1: Type: text/plain, Size: 8777 bytes --]

Hi Christian,

Can you elaborate the mirror on demand/userfaultfd idea?

userfaultfd is a way for user space to take over page fault handling of a user registered range. From first look, it seems you want a user space page fault handler to mirror a large chunk of memory to GPU. I would imagine this handler is in UMD, because the whole purpose of system svm allocator is to allow user use cpu address (such as malloc’ed) on gpu program without extra driver api call. So the registration and mirroring of this large chunk can’t be in user program. With this, I pictured below sequence:

During process initialization time, umd register a large chunk (lets say 1GiB) of memory using userfaultfd, this include:

  1.  mem = mmap(NULL, 1GiB, MAP_ANON)
  2.  register range [mem, mem + 1GiB] through userfaultfd
  3.  after that, umd can wait on page fault event. When page fault happens, umd call vm_bind to mirror [mem, mem+1GiB] range to GPU

now in a user program:
                ptr = malloc(size);
                submit a GPU program which uses ptr

This is what I can picture. It doesn’t work because ptr can’t belong to [mem, mem+1GiB] range. So you can’t vm_bind/mirror ptr on demand to GPU.

Also, the page fault event in 3) above can’t happen at all. A page fault only happens when *CPU* access mem but in our case, it could be *only GPU* touch the memory.

The point is, with system svm allocator, user can use *any* valid CPU address for GPU program. This address can be anything in the range of [0~2^57-1]. This design requirement is quite simple and clean. I don’t see how to solve this with userfaultfd/on demand mirroring.

Regards,
Oak

From: Christian König <christian.koenig@amd.com>
Sent: Thursday, February 29, 2024 4:41 AM
To: Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>; jglisse@redhat.com
Cc: Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Thomas.Hellstrom@linux.intel.com; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>
Subject: Re: Making drm_gpuvm work across gpu devices

Am 28.02.24 um 20:51 schrieb Zeng, Oak:


The mail wasn’t indent/preface correctly. Manually format it.


From: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
Sent: Tuesday, February 27, 2024 1:54 AM
To: Zeng, Oak <oak.zeng@intel.com<mailto:oak.zeng@intel.com>>; Danilo Krummrich <dakr@redhat.com<mailto:dakr@redhat.com>>; Dave Airlie <airlied@redhat.com<mailto:airlied@redhat.com>>; Daniel Vetter <daniel@ffwll.ch<mailto:daniel@ffwll.ch>>; Felix Kuehling <felix.kuehling@amd.com<mailto:felix.kuehling@amd.com>>; jglisse@redhat.com<mailto:jglisse@redhat.com>
Cc: Welty, Brian <brian.welty@intel.com<mailto:brian.welty@intel.com>>; dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>; intel-xe@lists.freedesktop.org<mailto:intel-xe@lists.freedesktop.org>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com<mailto:krishnaiah.bommu@intel.com>>; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com<mailto:himal.prasad.ghimiray@intel.com>>; Thomas.Hellstrom@linux.intel.com<mailto:Thomas.Hellstrom@linux.intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com<mailto:niranjana.vishwanathapura@intel.com>>; Brost, Matthew <matthew.brost@intel.com<mailto:matthew.brost@intel.com>>; Gupta, saurabhg <saurabhg.gupta@intel.com<mailto:saurabhg.gupta@intel.com>>
Subject: Re: Making drm_gpuvm work across gpu devices

Hi Oak,
Am 23.02.24 um 21:12 schrieb Zeng, Oak:
Hi Christian,

I go back this old email to ask a question.

sorry totally missed that one.



Quote from your email:
“Those ranges can then be used to implement the SVM feature required for higher level APIs and not something you need at the UAPI or even inside the low level kernel memory management.”
“SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have any influence on the design of the kernel UAPI.”

There are two category of SVM:

1.       driver svm allocator: this is implemented in user space,  i.g., cudaMallocManaged (cuda) or zeMemAllocShared (L0) or clSVMAlloc(openCL). Intel already have gem_create/vm_bind in xekmd and our umd implemented clSVMAlloc and zeMemAllocShared on top of gem_create/vm_bind. Range A..B of the process address space is mapped into a range C..D of the GPU address space, exactly as you said.

2.       system svm allocator:  This doesn’t introduce extra driver API for memory allocation. Any valid CPU virtual address can be used directly transparently in a GPU program without any extra driver API call. Quote from kernel Documentation/vm/hmm.hst: “Any application memory region (private anonymous, shared memory, or regular file backed memory) can be used by a device transparently” and “to share the address space by duplicating the CPU page table in the device page table so the same address points to the same physical memory for any valid main memory address in the process address space”. In system svm allocator, we don’t need that A..B C..D mapping.

It looks like you were talking of 1). Were you?

No, even when you fully mirror the whole address space from a process into the GPU you still need to enable this somehow with an IOCTL.

And while enabling this you absolutely should specify to which part of the address space this mirroring applies and where it maps to.


[Zeng, Oak]
Lets say we have a hardware platform where both CPU and GPU support 57bit(use it for example. The statement apply to any address range) virtual address range, how do you decide “which part of the address space this mirroring applies”? You have to mirror the whole address space [0~2^57-1], do you? As you designed it, the gigantic vm_bind/mirroring happens at the process initialization time, and at that time, you don’t know which part of the address space will be used for gpu program. Remember for system allocator, *any* valid CPU address can be used for GPU program.  If you add an offset to [0~2^57-1], you get an address out of 57bit address range. Is this a valid concern?

Well you can perfectly mirror on demand. You just need something similar to userfaultfd() for the GPU. This way you don't need to mirror the full address space, but can rather work with large chunks created on demand, let's say 1GiB or something like that.

The virtual address space is basically just a hardware functionality to route memory accesses. While the mirroring approach is a very common use case for data-centers and high performance computing there are quite a number of different use cases which makes use of virtual address space in a non "standard" fashion. The native context approach for VMs is just one example, databases and emulators are another one.




I see the system svm allocator as just a special case of the driver allocator where not fully backed buffer objects are allocated, but rather sparse one which are filled and migrated on demand.


[Zeng, Oak]
Above statement is true to me. We don’t have BO for system svm allocator. It is a sparse one as we can sparsely map vma to GPU. Our migration policy decide which pages/how much of the vma is migrated/mapped to GPU page table.

The difference b/t your mind and mine is, you want a gigantic vma (created during the gigantic vm_bind) to be sparsely populated to gpu. While I thought vma (xe_vma in xekmd codes) is a place to save memory attributes (such as caching, user preferred placement etc). All those memory attributes are range based, i.e., user can specify range1 is cached while range2 is uncached. So I don’t see how you can manage it with the gigantic vma. Do you split your gigantic vma later to save range based memory attributes?

Yes, exactly that. I mean the splitting and eventually merging of ranges is a standard functionality of the GPUVM code.

So when you need to store additional attributes per range then I would strongly suggest to make use of this splitting and merging functionality as well.

So basically an IOCTL which says range A..B of the GPU address space is mapped to offset X of the CPU address space with parameters Y (caching, migration behavior etc..). That is essentially the same we have for mapping GEM objects, the provider of the backing store is just something different.

Regards,
Christian.



Regards,
Oak


Regards,
Christian.





[-- Attachment #2: Type: text/html, Size: 22772 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-02-29  9:41                       ` Christian König
  2024-02-29 16:05                         ` Zeng, Oak
@ 2024-02-29 17:12                         ` Thomas Hellström
  2024-03-01  7:01                           ` Christian König
  1 sibling, 1 reply; 123+ messages in thread
From: Thomas Hellström @ 2024-02-29 17:12 UTC (permalink / raw)
  To: Christian König, Zeng, Oak, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Felix Kuehling, jglisse
  Cc: Welty, Brian, dri-devel, intel-xe, Bommu, Krishnaiah, Ghimiray,
	Himal Prasad, Vishwanathapura, Niranjana, Brost, Matthew, Gupta,
	saurabhg

Hi, Christian.

On Thu, 2024-02-29 at 10:41 +0100, Christian König wrote:
> Am 28.02.24 um 20:51 schrieb Zeng, Oak:
> > 
> > The mail wasn’t indent/preface correctly. Manually format it.
> > 
> > *From:*Christian König <christian.koenig@amd.com>
> > *Sent:* Tuesday, February 27, 2024 1:54 AM
> > *To:* Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich 
> > <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter 
> > <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>; 
> > jglisse@redhat.com
> > *Cc:* Welty, Brian <brian.welty@intel.com>; 
> > dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; 
> > Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal
> > Prasad 
> > <himal.prasad.ghimiray@intel.com>;
> > Thomas.Hellstrom@linux.intel.com; 
> > Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; 
> > Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg 
> > <saurabhg.gupta@intel.com>
> > *Subject:* Re: Making drm_gpuvm work across gpu devices
> > 
> > Hi Oak,
> > 
> > Am 23.02.24 um 21:12 schrieb Zeng, Oak:
> > 
> >     Hi Christian,
> > 
> >     I go back this old email to ask a question.
> > 
> > 
> > sorry totally missed that one.
> > 
> >     Quote from your email:
> > 
> >     “Those ranges can then be used to implement the SVM feature
> >     required for higher level APIs and not something you need at
> > the
> >     UAPI or even inside the low level kernel memory management.”
> > 
> >     “SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This
> >     should not have any influence on the design of the kernel
> > UAPI.”
> > 
> >     There are two category of SVM:
> > 
> >     1.driver svm allocator: this is implemented in user space,
> >  i.g.,
> >     cudaMallocManaged (cuda) or zeMemAllocShared (L0) or
> >     clSVMAlloc(openCL). Intel already have gem_create/vm_bind in
> > xekmd
> >     and our umd implemented clSVMAlloc and zeMemAllocShared on top
> > of
> >     gem_create/vm_bind. Range A..B of the process address space is
> >     mapped into a range C..D of the GPU address space, exactly as
> > you
> >     said.
> > 
> >     2.system svm allocator:  This doesn’t introduce extra driver
> > API
> >     for memory allocation. Any valid CPU virtual address can be
> > used
> >     directly transparently in a GPU program without any extra
> > driver
> >     API call. Quote from kernel Documentation/vm/hmm.hst: “Any
> >     application memory region (private anonymous, shared memory, or
> >     regular file backed memory) can be used by a device
> > transparently”
> >     and “to share the address space by duplicating the CPU page
> > table
> >     in the device page table so the same address points to the same
> >     physical memory for any valid main memory address in the
> > process
> >     address space”. In system svm allocator, we don’t need that
> > A..B
> >     C..D mapping.
> > 
> >     It looks like you were talking of 1). Were you?
> > 
> > 
> > No, even when you fully mirror the whole address space from a
> > process 
> > into the GPU you still need to enable this somehow with an IOCTL.
> > 
> > And while enabling this you absolutely should specify to which part
> > of 
> > the address space this mirroring applies and where it maps to.
> > 
> > */[Zeng, Oak] /*
> > 
> > Lets say we have a hardware platform where both CPU and GPU support
> > 57bit(use it for example. The statement apply to any address range)
> > virtual address range, how do you decide “which part of the address
> > space this mirroring applies”? You have to mirror the whole address
> > space [0~2^57-1], do you? As you designed it, the gigantic 
> > vm_bind/mirroring happens at the process initialization time, and
> > at 
> > that time, you don’t know which part of the address space will be
> > used 
> > for gpu program. Remember for system allocator, *any* valid CPU 
> > address can be used for GPU program.  If you add an offset to 
> > [0~2^57-1], you get an address out of 57bit address range. Is this
> > a 
> > valid concern?
> > 
> 
> Well you can perfectly mirror on demand. You just need something
> similar 
> to userfaultfd() for the GPU. This way you don't need to mirror the
> full 
> address space, but can rather work with large chunks created on
> demand, 
> let's say 1GiB or something like that.


What we're looking at as the current design is an augmented userptr
(A..B -> C..D mapping) which is internally sparsely populated in
chunks. KMD manages the population using gpu pagefaults. We acknowledge
that some parts of this mirror will not have a valid CPU mapping. That
is, no vma so a gpu page-fault that resolves to such a mirror address
will cause an error. Would you have any concerns / objections against
such an approach?

Thanks,
Thomas




^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-02-01  8:52                                     ` Christian König
@ 2024-02-29 18:22                                       ` Zeng, Oak
  2024-03-08  4:43                                         ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-02-29 18:22 UTC (permalink / raw)
  To: Christian König, Daniel Vetter, David Airlie
  Cc: Thomas Hellström, Brost, Matthew, Felix Kuehling, Welty,
	Brian, dri-devel, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Gupta, saurabhg, Vishwanathapura, Niranjana, intel-xe,
	Danilo Krummrich, Shah, Ankur N, jglisse, rcampbell, apopple

Hi Christian/Daniel/Dave/Felix/Thomas, and all,

We have been refining our design internally in the past month. Below is our plan. Please let us know if you have any concern.

1) Remove pseudo /dev/xe-svm device. All system allocator interfaces will be through /dev/dri/render devices. Not global interface.

2) Unify userptr and system allocator codes. We will treat userptr as a speciality of system allocator without migration capability. We will introduce the hmmptr concept for system allocator. We will extend vm_bind API to map a range A..B of process address space to a range C..D of GPU address space for hmmptr. For hmmptr, if gpu program accesses an address which is not backed by core mm vma, it is a fatal error.

3) Multiple device support. We have identified p2p use-cases where we might want to leave memory on a foreign device or direct migrations to a foreign device and therefore might need a global structure that tracks or caches the migration state per process address space. We didn't completely settle down this design. We will come back when we have more details.

4)We will first work on this code on xekmd then look to move some common codes to drm layer so it can also be used by other vendors.

Thomas and me still have open questions to Christian. We will follow up.

Thanks all for this discussion.

Regards,
Oak

> -----Original Message-----
> From: Christian König <christian.koenig@amd.com>
> Sent: Thursday, February 1, 2024 3:52 AM
> To: Zeng, Oak <oak.zeng@intel.com>; Daniel Vetter <daniel@ffwll.ch>; David
> Airlie <airlied@redhat.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Brost, Matthew
> <matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty,
> Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; Ghimiray, Himal
> Prasad <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>;
> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-
> xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah, Ankur N
> <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com;
> apopple@nvidia.com
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> Hi Oak,
> 
> Am 31.01.24 um 21:17 schrieb Zeng, Oak:
> > Hi Sima, Dave,
> >
> > I am well aware nouveau driver is not what Nvidia do with their customer. The
> key argument is, can we move forward with the concept shared virtual address
> space b/t CPU and GPU? This is the foundation of HMM. We already have split
> address space support with other driver API. SVM, from its name, it means
> shared address space. Are we allowed to implement another driver model to
> allow SVM work, along with other APIs supporting split address space? Those two
> scheme can co-exist in harmony. We actually have real use cases to use both
> models in one application.
> >
> > Hi Christian, Thomas,
> >
> > In your scheme, GPU VA can != GPU VA. This does introduce some flexibility.
> But this scheme alone doesn't solve the problem of the proxy process/para-
> virtualization. You will still need a second mechanism to partition GPU VA space
> b/t guest process1 and guest process2 because proxy process (or the host
> hypervisor whatever you call it) use one single gpu page table for all the
> guest/client processes. GPU VA for different guest process can't overlap. If this
> second mechanism exist, we of course can use the same mechanism to partition
> CPU VA space between guest processes as well, then we can still use shared VA
> b/t CPU and GPU inside one process, but process1 and process2's address space
> (for both cpu and gpu) doesn't overlap. This second mechanism is the key to
> solve the proxy process problem, not the flexibility you introduced.
> 
> That approach was suggested before, but it doesn't work. First of all
> you create a massive security hole when you give the GPU full access to
> the QEMU CPU process which runs the virtualization.
> 
> So even if you say CPU VA == GPU VA you still need some kind of
> flexibility otherwise you can't implement this use case securely.
> 
> Additional to this the CPU VAs are usually controlled by the OS and not
> some driver, so to make sure that host and guest VAs don't overlap you
> would need to add some kind of sync between the guest and host OS kernels.
> 
> > In practice, your scheme also have a risk of running out of process space
> because you have to partition whole address space b/t processes. Apparently
> allowing each guest process to own the whole process space and using separate
> GPU/CPU page table for different processes is a better solution than using single
> page table and partition process space b/t processes.
> 
> Yeah that you run out of address space is certainly possible. But as I
> said CPUs are switching to 5 level of pages tables and if you look at
> for example a "cat maps | cut -c-4 | sort -u" of process you will find
> that only a handful of 4GiB segments are actually used and thanks to
> recoverable page faults you can map those between host and client on
> demand. This gives you at least enough address space to handle a couple
> of thousand clients.
> 
> > For Intel GPU, para-virtualization (xenGT, see https://github.com/intel/XenGT-
> Preview-kernel. It is similar idea of the proxy process in Flex's email. They are all
> SW-based GPU virtualization technology) is an old project. It is now replaced with
> HW accelerated SRIOV/system virtualization. XenGT is abandoned long time ago.
> So agreed your scheme add some flexibility. The question is, do we have a valid
> use case to use such flexibility? I don't see a single one ATM.
> 
> Yeah, we have SRIOV functionality on AMD hw as well, but for some use
> cases it's just to inflexible.
> 
> > I also pictured into how to implement your scheme. You basically rejected the
> very foundation of hmm design which is shared address space b/t CPU and GPU.
> In your scheme, GPU VA = CPU VA + offset. In every single place where driver
> need to call hmm facilities such as hmm_range_fault, migrate_vma_setup and in
> mmu notifier call back, you need to offset the GPU VA to get a CPU VA. From
> application writer's perspective, whenever he want to use a CPU pointer in his
> GPU program, he add to add that offset. Do you think this is awkward?
> 
> What? This flexibility is there to prevent the application writer to
> change any offset.
> 
> > Finally, to implement SVM, we need to implement some memory hint API
> which applies to a virtual address range across all GPU devices. For example, user
> would say, for this virtual address range, I prefer the backing store memory to be
> on GPU deviceX (because user knows deviceX would use this address range
> much more than other GPU devices or CPU). It doesn't make sense to me to
> make such API per device based. For example, if you tell device A that the
> preferred memory location is device B memory, this doesn't sounds correct to
> me because in your scheme, device A is not even aware of the existence of
> device B. right?
> 
> Correct and while the additional flexibility is somewhat option I
> strongly think that not having a centralized approach for device driver
> settings is mandatory.
> 
> Going away from the well defined file descriptor based handling of
> device driver interfaces was one of the worst ideas I've ever seen in
> roughly thirty years of working with Unixiode operating systems. It
> basically broke everything, from reverse lockup handling for mmap() to
> file system privileges for hardware access.
> 
> As far as I can see anything which goes into the direction of opening
> /dev/kfd or /dev/xe_svm or something similar and saying that this then
> results into implicit SVM for your render nodes is an absolutely no-go
> and would required and explicit acknowledgement from Linus on the design
> to do something like that.
> 
> What you can do is to have an IOCTL for the render node file descriptor
> which says this device should do SVM with the current CPU address space
> and another IOCTL which says range A..B is preferred to migrate to this
> device for HMM when the device runs into a page fault.
> 
> And yes that obviously means shitty performance for device drivers
> because page play ping/pong if userspace gives contradicting information
> for migrations, but that is something supposed to be.
> 
> Everything else which works over the boarders of a device driver scope
> should be implemented as system call with the relevant review process
> around it.
> 
> Regards,
> Christian.
> 
> >
> > Regards,
> > Oak
> >> -----Original Message-----
> >> From: Daniel Vetter <daniel@ffwll.ch>
> >> Sent: Wednesday, January 31, 2024 4:15 AM
> >> To: David Airlie <airlied@redhat.com>
> >> Cc: Zeng, Oak <oak.zeng@intel.com>; Christian König
> >> <christian.koenig@amd.com>; Thomas Hellström
> >> <thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch>; Brost,
> >> Matthew <matthew.brost@intel.com>; Felix Kuehling
> >> <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; dri-
> >> devel@lists.freedesktop.org; Ghimiray, Himal Prasad
> >> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> >> <krishnaiah.bommu@intel.com>; Gupta, saurabhg
> <saurabhg.gupta@intel.com>;
> >> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-
> >> xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah,
> Ankur N
> >> <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com;
> >> apopple@nvidia.com
> >> Subject: Re: Making drm_gpuvm work across gpu devices
> >>
> >> On Wed, Jan 31, 2024 at 09:12:39AM +1000, David Airlie wrote:
> >>> On Wed, Jan 31, 2024 at 8:29 AM Zeng, Oak <oak.zeng@intel.com> wrote:
> >>>> Hi Christian,
> >>>>
> >>>>
> >>>>
> >>>> Nvidia Nouveau driver uses exactly the same concept of SVM with HMM,
> >> GPU address in the same process is exactly the same with CPU virtual address.
> It
> >> is already in upstream Linux kernel. We Intel just follow the same direction for
> >> our customers. Why we are not allowed?
> >>>
> >>> Oak, this isn't how upstream works, you don't get to appeal to
> >>> customer or internal design. nouveau isn't "NVIDIA"'s and it certainly
> >>> isn't something NVIDIA would ever suggest for their customers. We also
> >>> likely wouldn't just accept NVIDIA's current solution upstream without
> >>> some serious discussions. The implementation in nouveau was more of a
> >>> sample HMM use case rather than a serious implementation. I suspect if
> >>> we do get down the road of making nouveau an actual compute driver for
> >>> SVM etc then it would have to severely change.
> >> Yeah on the nouveau hmm code specifically my gut feeling impression is
> >> that we didn't really make friends with that among core kernel
> >> maintainers. It's a bit too much just a tech demo to be able to merge the
> >> hmm core apis for nvidia's out-of-tree driver.
> >>
> >> Also, a few years of learning and experience gaining happened meanwhile -
> >> you always have to look at an api design in the context of when it was
> >> designed, and that context changes all the time.
> >>
> >> Cheers, Sima
> >> --
> >> Daniel Vetter
> >> Software Engineer, Intel Corporation
> >> http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-02-29 17:12                         ` Thomas Hellström
@ 2024-03-01  7:01                           ` Christian König
  0 siblings, 0 replies; 123+ messages in thread
From: Christian König @ 2024-03-01  7:01 UTC (permalink / raw)
  To: Thomas Hellström, Zeng, Oak, Danilo Krummrich, Dave Airlie,
	Daniel Vetter, Felix Kuehling, jglisse
  Cc: Welty, Brian, dri-devel, intel-xe, Bommu, Krishnaiah, Ghimiray,
	Himal Prasad, Vishwanathapura, Niranjana, Brost, Matthew, Gupta,
	saurabhg

Hi Thomas,

Am 29.02.24 um 18:12 schrieb Thomas Hellström:
> Hi, Christian.
>
> On Thu, 2024-02-29 at 10:41 +0100, Christian König wrote:
>> Am 28.02.24 um 20:51 schrieb Zeng, Oak:
>>> The mail wasn’t indent/preface correctly. Manually format it.
>>>
>>> *From:*Christian König <christian.koenig@amd.com>
>>> *Sent:* Tuesday, February 27, 2024 1:54 AM
>>> *To:* Zeng, Oak <oak.zeng@intel.com>; Danilo Krummrich
>>> <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
>>> <daniel@ffwll.ch>; Felix Kuehling <felix.kuehling@amd.com>;
>>> jglisse@redhat.com
>>> *Cc:* Welty, Brian <brian.welty@intel.com>;
>>> dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
>>> Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Ghimiray, Himal
>>> Prasad
>>> <himal.prasad.ghimiray@intel.com>;
>>> Thomas.Hellstrom@linux.intel.com;
>>> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>;
>>> Brost, Matthew <matthew.brost@intel.com>; Gupta, saurabhg
>>> <saurabhg.gupta@intel.com>
>>> *Subject:* Re: Making drm_gpuvm work across gpu devices
>>>
>>> Hi Oak,
>>>
>>> Am 23.02.24 um 21:12 schrieb Zeng, Oak:
>>>
>>>      Hi Christian,
>>>
>>>      I go back this old email to ask a question.
>>>
>>>
>>> sorry totally missed that one.
>>>
>>>      Quote from your email:
>>>
>>>      “Those ranges can then be used to implement the SVM feature
>>>      required for higher level APIs and not something you need at
>>> the
>>>      UAPI or even inside the low level kernel memory management.”
>>>
>>>      “SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This
>>>      should not have any influence on the design of the kernel
>>> UAPI.”
>>>
>>>      There are two category of SVM:
>>>
>>>      1.driver svm allocator: this is implemented in user space,
>>>   i.g.,
>>>      cudaMallocManaged (cuda) or zeMemAllocShared (L0) or
>>>      clSVMAlloc(openCL). Intel already have gem_create/vm_bind in
>>> xekmd
>>>      and our umd implemented clSVMAlloc and zeMemAllocShared on top
>>> of
>>>      gem_create/vm_bind. Range A..B of the process address space is
>>>      mapped into a range C..D of the GPU address space, exactly as
>>> you
>>>      said.
>>>
>>>      2.system svm allocator:  This doesn’t introduce extra driver
>>> API
>>>      for memory allocation. Any valid CPU virtual address can be
>>> used
>>>      directly transparently in a GPU program without any extra
>>> driver
>>>      API call. Quote from kernel Documentation/vm/hmm.hst: “Any
>>>      application memory region (private anonymous, shared memory, or
>>>      regular file backed memory) can be used by a device
>>> transparently”
>>>      and “to share the address space by duplicating the CPU page
>>> table
>>>      in the device page table so the same address points to the same
>>>      physical memory for any valid main memory address in the
>>> process
>>>      address space”. In system svm allocator, we don’t need that
>>> A..B
>>>      C..D mapping.
>>>
>>>      It looks like you were talking of 1). Were you?
>>>
>>>
>>> No, even when you fully mirror the whole address space from a
>>> process
>>> into the GPU you still need to enable this somehow with an IOCTL.
>>>
>>> And while enabling this you absolutely should specify to which part
>>> of
>>> the address space this mirroring applies and where it maps to.
>>>
>>> */[Zeng, Oak] /*
>>>
>>> Lets say we have a hardware platform where both CPU and GPU support
>>> 57bit(use it for example. The statement apply to any address range)
>>> virtual address range, how do you decide “which part of the address
>>> space this mirroring applies”? You have to mirror the whole address
>>> space [0~2^57-1], do you? As you designed it, the gigantic
>>> vm_bind/mirroring happens at the process initialization time, and
>>> at
>>> that time, you don’t know which part of the address space will be
>>> used
>>> for gpu program. Remember for system allocator, *any* valid CPU
>>> address can be used for GPU program.  If you add an offset to
>>> [0~2^57-1], you get an address out of 57bit address range. Is this
>>> a
>>> valid concern?
>>>
>> Well you can perfectly mirror on demand. You just need something
>> similar
>> to userfaultfd() for the GPU. This way you don't need to mirror the
>> full
>> address space, but can rather work with large chunks created on
>> demand,
>> let's say 1GiB or something like that.
>
> What we're looking at as the current design is an augmented userptr
> (A..B -> C..D mapping) which is internally sparsely populated in
> chunks. KMD manages the population using gpu pagefaults. We acknowledge
> that some parts of this mirror will not have a valid CPU mapping. That
> is, no vma so a gpu page-fault that resolves to such a mirror address
> will cause an error. Would you have any concerns / objections against
> such an approach?

Nope, as far as I can see that sounds like a perfectly valid design to me.

Regards,
Christian.

>
> Thanks,
> Thomas
>
>
>


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: Making drm_gpuvm work across gpu devices
  2024-02-29 18:22                                       ` Zeng, Oak
@ 2024-03-08  4:43                                         ` Zeng, Oak
  2024-03-08 10:07                                           ` Christian König
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-03-08  4:43 UTC (permalink / raw)
  To: Zeng, Oak, Christian König, Daniel Vetter, David Airlie
  Cc: Thomas Hellström, Brost, Matthew, Felix Kuehling, Welty,
	Brian, dri-devel, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Gupta, saurabhg, Vishwanathapura, Niranjana, intel-xe,
	Danilo Krummrich, Shah, Ankur N, jglisse, rcampbell, apopple

Hello all,

Since I didn't get a reply for this one, I assume below are agreed. But feel free to let us know if you don't agree.

Thanks,
Oak

-----Original Message-----
From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of Zeng, Oak
Sent: Thursday, February 29, 2024 1:23 PM
To: Christian König <christian.koenig@amd.com>; Daniel Vetter <daniel@ffwll.ch>; David Airlie <airlied@redhat.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Brost, Matthew <matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah, Ankur N <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com; apopple@nvidia.com
Subject: RE: Making drm_gpuvm work across gpu devices

Hi Christian/Daniel/Dave/Felix/Thomas, and all,

We have been refining our design internally in the past month. Below is our plan. Please let us know if you have any concern.

1) Remove pseudo /dev/xe-svm device. All system allocator interfaces will be through /dev/dri/render devices. Not global interface.

2) Unify userptr and system allocator codes. We will treat userptr as a speciality of system allocator without migration capability. We will introduce the hmmptr concept for system allocator. We will extend vm_bind API to map a range A..B of process address space to a range C..D of GPU address space for hmmptr. For hmmptr, if gpu program accesses an address which is not backed by core mm vma, it is a fatal error.

3) Multiple device support. We have identified p2p use-cases where we might want to leave memory on a foreign device or direct migrations to a foreign device and therefore might need a global structure that tracks or caches the migration state per process address space. We didn't completely settle down this design. We will come back when we have more details.

4)We will first work on this code on xekmd then look to move some common codes to drm layer so it can also be used by other vendors.

Thomas and me still have open questions to Christian. We will follow up.

Thanks all for this discussion.

Regards,
Oak

> -----Original Message-----
> From: Christian König <christian.koenig@amd.com>
> Sent: Thursday, February 1, 2024 3:52 AM
> To: Zeng, Oak <oak.zeng@intel.com>; Daniel Vetter <daniel@ffwll.ch>; David
> Airlie <airlied@redhat.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Brost, Matthew
> <matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty,
> Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; Ghimiray, Himal
> Prasad <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>;
> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-
> xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah, Ankur N
> <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com;
> apopple@nvidia.com
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> Hi Oak,
> 
> Am 31.01.24 um 21:17 schrieb Zeng, Oak:
> > Hi Sima, Dave,
> >
> > I am well aware nouveau driver is not what Nvidia do with their customer. The
> key argument is, can we move forward with the concept shared virtual address
> space b/t CPU and GPU? This is the foundation of HMM. We already have split
> address space support with other driver API. SVM, from its name, it means
> shared address space. Are we allowed to implement another driver model to
> allow SVM work, along with other APIs supporting split address space? Those two
> scheme can co-exist in harmony. We actually have real use cases to use both
> models in one application.
> >
> > Hi Christian, Thomas,
> >
> > In your scheme, GPU VA can != GPU VA. This does introduce some flexibility.
> But this scheme alone doesn't solve the problem of the proxy process/para-
> virtualization. You will still need a second mechanism to partition GPU VA space
> b/t guest process1 and guest process2 because proxy process (or the host
> hypervisor whatever you call it) use one single gpu page table for all the
> guest/client processes. GPU VA for different guest process can't overlap. If this
> second mechanism exist, we of course can use the same mechanism to partition
> CPU VA space between guest processes as well, then we can still use shared VA
> b/t CPU and GPU inside one process, but process1 and process2's address space
> (for both cpu and gpu) doesn't overlap. This second mechanism is the key to
> solve the proxy process problem, not the flexibility you introduced.
> 
> That approach was suggested before, but it doesn't work. First of all
> you create a massive security hole when you give the GPU full access to
> the QEMU CPU process which runs the virtualization.
> 
> So even if you say CPU VA == GPU VA you still need some kind of
> flexibility otherwise you can't implement this use case securely.
> 
> Additional to this the CPU VAs are usually controlled by the OS and not
> some driver, so to make sure that host and guest VAs don't overlap you
> would need to add some kind of sync between the guest and host OS kernels.
> 
> > In practice, your scheme also have a risk of running out of process space
> because you have to partition whole address space b/t processes. Apparently
> allowing each guest process to own the whole process space and using separate
> GPU/CPU page table for different processes is a better solution than using single
> page table and partition process space b/t processes.
> 
> Yeah that you run out of address space is certainly possible. But as I
> said CPUs are switching to 5 level of pages tables and if you look at
> for example a "cat maps | cut -c-4 | sort -u" of process you will find
> that only a handful of 4GiB segments are actually used and thanks to
> recoverable page faults you can map those between host and client on
> demand. This gives you at least enough address space to handle a couple
> of thousand clients.
> 
> > For Intel GPU, para-virtualization (xenGT, see https://github.com/intel/XenGT-
> Preview-kernel. It is similar idea of the proxy process in Flex's email. They are all
> SW-based GPU virtualization technology) is an old project. It is now replaced with
> HW accelerated SRIOV/system virtualization. XenGT is abandoned long time ago.
> So agreed your scheme add some flexibility. The question is, do we have a valid
> use case to use such flexibility? I don't see a single one ATM.
> 
> Yeah, we have SRIOV functionality on AMD hw as well, but for some use
> cases it's just to inflexible.
> 
> > I also pictured into how to implement your scheme. You basically rejected the
> very foundation of hmm design which is shared address space b/t CPU and GPU.
> In your scheme, GPU VA = CPU VA + offset. In every single place where driver
> need to call hmm facilities such as hmm_range_fault, migrate_vma_setup and in
> mmu notifier call back, you need to offset the GPU VA to get a CPU VA. From
> application writer's perspective, whenever he want to use a CPU pointer in his
> GPU program, he add to add that offset. Do you think this is awkward?
> 
> What? This flexibility is there to prevent the application writer to
> change any offset.
> 
> > Finally, to implement SVM, we need to implement some memory hint API
> which applies to a virtual address range across all GPU devices. For example, user
> would say, for this virtual address range, I prefer the backing store memory to be
> on GPU deviceX (because user knows deviceX would use this address range
> much more than other GPU devices or CPU). It doesn't make sense to me to
> make such API per device based. For example, if you tell device A that the
> preferred memory location is device B memory, this doesn't sounds correct to
> me because in your scheme, device A is not even aware of the existence of
> device B. right?
> 
> Correct and while the additional flexibility is somewhat option I
> strongly think that not having a centralized approach for device driver
> settings is mandatory.
> 
> Going away from the well defined file descriptor based handling of
> device driver interfaces was one of the worst ideas I've ever seen in
> roughly thirty years of working with Unixiode operating systems. It
> basically broke everything, from reverse lockup handling for mmap() to
> file system privileges for hardware access.
> 
> As far as I can see anything which goes into the direction of opening
> /dev/kfd or /dev/xe_svm or something similar and saying that this then
> results into implicit SVM for your render nodes is an absolutely no-go
> and would required and explicit acknowledgement from Linus on the design
> to do something like that.
> 
> What you can do is to have an IOCTL for the render node file descriptor
> which says this device should do SVM with the current CPU address space
> and another IOCTL which says range A..B is preferred to migrate to this
> device for HMM when the device runs into a page fault.
> 
> And yes that obviously means shitty performance for device drivers
> because page play ping/pong if userspace gives contradicting information
> for migrations, but that is something supposed to be.
> 
> Everything else which works over the boarders of a device driver scope
> should be implemented as system call with the relevant review process
> around it.
> 
> Regards,
> Christian.
> 
> >
> > Regards,
> > Oak
> >> -----Original Message-----
> >> From: Daniel Vetter <daniel@ffwll.ch>
> >> Sent: Wednesday, January 31, 2024 4:15 AM
> >> To: David Airlie <airlied@redhat.com>
> >> Cc: Zeng, Oak <oak.zeng@intel.com>; Christian König
> >> <christian.koenig@amd.com>; Thomas Hellström
> >> <thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch>; Brost,
> >> Matthew <matthew.brost@intel.com>; Felix Kuehling
> >> <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; dri-
> >> devel@lists.freedesktop.org; Ghimiray, Himal Prasad
> >> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> >> <krishnaiah.bommu@intel.com>; Gupta, saurabhg
> <saurabhg.gupta@intel.com>;
> >> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-
> >> xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah,
> Ankur N
> >> <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com;
> >> apopple@nvidia.com
> >> Subject: Re: Making drm_gpuvm work across gpu devices
> >>
> >> On Wed, Jan 31, 2024 at 09:12:39AM +1000, David Airlie wrote:
> >>> On Wed, Jan 31, 2024 at 8:29 AM Zeng, Oak <oak.zeng@intel.com> wrote:
> >>>> Hi Christian,
> >>>>
> >>>>
> >>>>
> >>>> Nvidia Nouveau driver uses exactly the same concept of SVM with HMM,
> >> GPU address in the same process is exactly the same with CPU virtual address.
> It
> >> is already in upstream Linux kernel. We Intel just follow the same direction for
> >> our customers. Why we are not allowed?
> >>>
> >>> Oak, this isn't how upstream works, you don't get to appeal to
> >>> customer or internal design. nouveau isn't "NVIDIA"'s and it certainly
> >>> isn't something NVIDIA would ever suggest for their customers. We also
> >>> likely wouldn't just accept NVIDIA's current solution upstream without
> >>> some serious discussions. The implementation in nouveau was more of a
> >>> sample HMM use case rather than a serious implementation. I suspect if
> >>> we do get down the road of making nouveau an actual compute driver for
> >>> SVM etc then it would have to severely change.
> >> Yeah on the nouveau hmm code specifically my gut feeling impression is
> >> that we didn't really make friends with that among core kernel
> >> maintainers. It's a bit too much just a tech demo to be able to merge the
> >> hmm core apis for nvidia's out-of-tree driver.
> >>
> >> Also, a few years of learning and experience gaining happened meanwhile -
> >> you always have to look at an api design in the context of when it was
> >> designed, and that context changes all the time.
> >>
> >> Cheers, Sima
> >> --
> >> Daniel Vetter
> >> Software Engineer, Intel Corporation
> >> http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: Making drm_gpuvm work across gpu devices
  2024-03-08  4:43                                         ` Zeng, Oak
@ 2024-03-08 10:07                                           ` Christian König
  0 siblings, 0 replies; 123+ messages in thread
From: Christian König @ 2024-03-08 10:07 UTC (permalink / raw)
  To: Zeng, Oak, Daniel Vetter, David Airlie
  Cc: Thomas Hellström, Brost, Matthew, Felix Kuehling, Welty,
	Brian, dri-devel, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Gupta, saurabhg, Vishwanathapura, Niranjana, intel-xe,
	Danilo Krummrich, Shah, Ankur N, jglisse, rcampbell, apopple

Hi Oak,

sorry the mail sounded like you didn't expected a reply.

And yes, the approaches outlined in the mail sounds really good to me.

Regards,
Christian.

Am 08.03.24 um 05:43 schrieb Zeng, Oak:
> Hello all,
>
> Since I didn't get a reply for this one, I assume below are agreed. But feel free to let us know if you don't agree.
>
> Thanks,
> Oak
>
> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of Zeng, Oak
> Sent: Thursday, February 29, 2024 1:23 PM
> To: Christian König <christian.koenig@amd.com>; Daniel Vetter <daniel@ffwll.ch>; David Airlie <airlied@redhat.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Brost, Matthew <matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah <krishnaiah.bommu@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah, Ankur N <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com; apopple@nvidia.com
> Subject: RE: Making drm_gpuvm work across gpu devices
>
> Hi Christian/Daniel/Dave/Felix/Thomas, and all,
>
> We have been refining our design internally in the past month. Below is our plan. Please let us know if you have any concern.
>
> 1) Remove pseudo /dev/xe-svm device. All system allocator interfaces will be through /dev/dri/render devices. Not global interface.
>
> 2) Unify userptr and system allocator codes. We will treat userptr as a speciality of system allocator without migration capability. We will introduce the hmmptr concept for system allocator. We will extend vm_bind API to map a range A..B of process address space to a range C..D of GPU address space for hmmptr. For hmmptr, if gpu program accesses an address which is not backed by core mm vma, it is a fatal error.
>
> 3) Multiple device support. We have identified p2p use-cases where we might want to leave memory on a foreign device or direct migrations to a foreign device and therefore might need a global structure that tracks or caches the migration state per process address space. We didn't completely settle down this design. We will come back when we have more details.
>
> 4)We will first work on this code on xekmd then look to move some common codes to drm layer so it can also be used by other vendors.
>
> Thomas and me still have open questions to Christian. We will follow up.
>
> Thanks all for this discussion.
>
> Regards,
> Oak
>
>> -----Original Message-----
>> From: Christian König <christian.koenig@amd.com>
>> Sent: Thursday, February 1, 2024 3:52 AM
>> To: Zeng, Oak <oak.zeng@intel.com>; Daniel Vetter <daniel@ffwll.ch>; David
>> Airlie <airlied@redhat.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Brost, Matthew
>> <matthew.brost@intel.com>; Felix Kuehling <felix.kuehling@amd.com>; Welty,
>> Brian <brian.welty@intel.com>; dri-devel@lists.freedesktop.org; Ghimiray, Himal
>> Prasad <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
>> <krishnaiah.bommu@intel.com>; Gupta, saurabhg <saurabhg.gupta@intel.com>;
>> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-
>> xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah, Ankur N
>> <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com;
>> apopple@nvidia.com
>> Subject: Re: Making drm_gpuvm work across gpu devices
>>
>> Hi Oak,
>>
>> Am 31.01.24 um 21:17 schrieb Zeng, Oak:
>>> Hi Sima, Dave,
>>>
>>> I am well aware nouveau driver is not what Nvidia do with their customer. The
>> key argument is, can we move forward with the concept shared virtual address
>> space b/t CPU and GPU? This is the foundation of HMM. We already have split
>> address space support with other driver API. SVM, from its name, it means
>> shared address space. Are we allowed to implement another driver model to
>> allow SVM work, along with other APIs supporting split address space? Those two
>> scheme can co-exist in harmony. We actually have real use cases to use both
>> models in one application.
>>> Hi Christian, Thomas,
>>>
>>> In your scheme, GPU VA can != GPU VA. This does introduce some flexibility.
>> But this scheme alone doesn't solve the problem of the proxy process/para-
>> virtualization. You will still need a second mechanism to partition GPU VA space
>> b/t guest process1 and guest process2 because proxy process (or the host
>> hypervisor whatever you call it) use one single gpu page table for all the
>> guest/client processes. GPU VA for different guest process can't overlap. If this
>> second mechanism exist, we of course can use the same mechanism to partition
>> CPU VA space between guest processes as well, then we can still use shared VA
>> b/t CPU and GPU inside one process, but process1 and process2's address space
>> (for both cpu and gpu) doesn't overlap. This second mechanism is the key to
>> solve the proxy process problem, not the flexibility you introduced.
>>
>> That approach was suggested before, but it doesn't work. First of all
>> you create a massive security hole when you give the GPU full access to
>> the QEMU CPU process which runs the virtualization.
>>
>> So even if you say CPU VA == GPU VA you still need some kind of
>> flexibility otherwise you can't implement this use case securely.
>>
>> Additional to this the CPU VAs are usually controlled by the OS and not
>> some driver, so to make sure that host and guest VAs don't overlap you
>> would need to add some kind of sync between the guest and host OS kernels.
>>
>>> In practice, your scheme also have a risk of running out of process space
>> because you have to partition whole address space b/t processes. Apparently
>> allowing each guest process to own the whole process space and using separate
>> GPU/CPU page table for different processes is a better solution than using single
>> page table and partition process space b/t processes.
>>
>> Yeah that you run out of address space is certainly possible. But as I
>> said CPUs are switching to 5 level of pages tables and if you look at
>> for example a "cat maps | cut -c-4 | sort -u" of process you will find
>> that only a handful of 4GiB segments are actually used and thanks to
>> recoverable page faults you can map those between host and client on
>> demand. This gives you at least enough address space to handle a couple
>> of thousand clients.
>>
>>> For Intel GPU, para-virtualization (xenGT, see https://github.com/intel/XenGT-
>> Preview-kernel. It is similar idea of the proxy process in Flex's email. They are all
>> SW-based GPU virtualization technology) is an old project. It is now replaced with
>> HW accelerated SRIOV/system virtualization. XenGT is abandoned long time ago.
>> So agreed your scheme add some flexibility. The question is, do we have a valid
>> use case to use such flexibility? I don't see a single one ATM.
>>
>> Yeah, we have SRIOV functionality on AMD hw as well, but for some use
>> cases it's just to inflexible.
>>
>>> I also pictured into how to implement your scheme. You basically rejected the
>> very foundation of hmm design which is shared address space b/t CPU and GPU.
>> In your scheme, GPU VA = CPU VA + offset. In every single place where driver
>> need to call hmm facilities such as hmm_range_fault, migrate_vma_setup and in
>> mmu notifier call back, you need to offset the GPU VA to get a CPU VA. From
>> application writer's perspective, whenever he want to use a CPU pointer in his
>> GPU program, he add to add that offset. Do you think this is awkward?
>>
>> What? This flexibility is there to prevent the application writer to
>> change any offset.
>>
>>> Finally, to implement SVM, we need to implement some memory hint API
>> which applies to a virtual address range across all GPU devices. For example, user
>> would say, for this virtual address range, I prefer the backing store memory to be
>> on GPU deviceX (because user knows deviceX would use this address range
>> much more than other GPU devices or CPU). It doesn't make sense to me to
>> make such API per device based. For example, if you tell device A that the
>> preferred memory location is device B memory, this doesn't sounds correct to
>> me because in your scheme, device A is not even aware of the existence of
>> device B. right?
>>
>> Correct and while the additional flexibility is somewhat option I
>> strongly think that not having a centralized approach for device driver
>> settings is mandatory.
>>
>> Going away from the well defined file descriptor based handling of
>> device driver interfaces was one of the worst ideas I've ever seen in
>> roughly thirty years of working with Unixiode operating systems. It
>> basically broke everything, from reverse lockup handling for mmap() to
>> file system privileges for hardware access.
>>
>> As far as I can see anything which goes into the direction of opening
>> /dev/kfd or /dev/xe_svm or something similar and saying that this then
>> results into implicit SVM for your render nodes is an absolutely no-go
>> and would required and explicit acknowledgement from Linus on the design
>> to do something like that.
>>
>> What you can do is to have an IOCTL for the render node file descriptor
>> which says this device should do SVM with the current CPU address space
>> and another IOCTL which says range A..B is preferred to migrate to this
>> device for HMM when the device runs into a page fault.
>>
>> And yes that obviously means shitty performance for device drivers
>> because page play ping/pong if userspace gives contradicting information
>> for migrations, but that is something supposed to be.
>>
>> Everything else which works over the boarders of a device driver scope
>> should be implemented as system call with the relevant review process
>> around it.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>> Oak
>>>> -----Original Message-----
>>>> From: Daniel Vetter <daniel@ffwll.ch>
>>>> Sent: Wednesday, January 31, 2024 4:15 AM
>>>> To: David Airlie <airlied@redhat.com>
>>>> Cc: Zeng, Oak <oak.zeng@intel.com>; Christian König
>>>> <christian.koenig@amd.com>; Thomas Hellström
>>>> <thomas.hellstrom@linux.intel.com>; Daniel Vetter <daniel@ffwll.ch>; Brost,
>>>> Matthew <matthew.brost@intel.com>; Felix Kuehling
>>>> <felix.kuehling@amd.com>; Welty, Brian <brian.welty@intel.com>; dri-
>>>> devel@lists.freedesktop.org; Ghimiray, Himal Prasad
>>>> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
>>>> <krishnaiah.bommu@intel.com>; Gupta, saurabhg
>> <saurabhg.gupta@intel.com>;
>>>> Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>; intel-
>>>> xe@lists.freedesktop.org; Danilo Krummrich <dakr@redhat.com>; Shah,
>> Ankur N
>>>> <ankur.n.shah@intel.com>; jglisse@redhat.com; rcampbell@nvidia.com;
>>>> apopple@nvidia.com
>>>> Subject: Re: Making drm_gpuvm work across gpu devices
>>>>
>>>> On Wed, Jan 31, 2024 at 09:12:39AM +1000, David Airlie wrote:
>>>>> On Wed, Jan 31, 2024 at 8:29 AM Zeng, Oak <oak.zeng@intel.com> wrote:
>>>>>> Hi Christian,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nvidia Nouveau driver uses exactly the same concept of SVM with HMM,
>>>> GPU address in the same process is exactly the same with CPU virtual address.
>> It
>>>> is already in upstream Linux kernel. We Intel just follow the same direction for
>>>> our customers. Why we are not allowed?
>>>>> Oak, this isn't how upstream works, you don't get to appeal to
>>>>> customer or internal design. nouveau isn't "NVIDIA"'s and it certainly
>>>>> isn't something NVIDIA would ever suggest for their customers. We also
>>>>> likely wouldn't just accept NVIDIA's current solution upstream without
>>>>> some serious discussions. The implementation in nouveau was more of a
>>>>> sample HMM use case rather than a serious implementation. I suspect if
>>>>> we do get down the road of making nouveau an actual compute driver for
>>>>> SVM etc then it would have to severely change.
>>>> Yeah on the nouveau hmm code specifically my gut feeling impression is
>>>> that we didn't really make friends with that among core kernel
>>>> maintainers. It's a bit too much just a tech demo to be able to merge the
>>>> hmm core apis for nvidia's out-of-tree driver.
>>>>
>>>> Also, a few years of learning and experience gaining happened meanwhile -
>>>> you always have to look at an api design in the context of when it was
>>>> designed, and that context changes all the time.
>>>>
>>>> Cheers, Sima
>>>> --
>>>> Daniel Vetter
>>>> Software Engineer, Intel Corporation
>>>> http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-01-17 22:12 ` [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range Oak Zeng
@ 2024-04-05  0:39   ` Jason Gunthorpe
  2024-04-05  3:33     ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-05  0:39 UTC (permalink / raw)
  To: Oak Zeng
  Cc: dri-devel, intel-xe, matthew.brost, Thomas.Hellstrom,
	brian.welty, himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura, Leon Romanovsky

On Wed, Jan 17, 2024 at 05:12:06PM -0500, Oak Zeng wrote:
> +/**
> + * xe_svm_build_sg() - build a scatter gather table for all the physical pages/pfn
> + * in a hmm_range.
> + *
> + * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
> + * has the pfn numbers of pages that back up this hmm address range.
> + * @st: pointer to the sg table.
> + *
> + * All the contiguous pfns will be collapsed into one entry in
> + * the scatter gather table. This is for the convenience of
> + * later on operations to bind address range to GPU page table.
> + *
> + * This function allocates the storage of the sg table. It is
> + * caller's responsibility to free it calling sg_free_table.
> + *
> + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> + */
> +int xe_svm_build_sg(struct hmm_range *range,
> +			     struct sg_table *st)
> +{
> +	struct scatterlist *sg;
> +	u64 i, npages;
> +
> +	sg = NULL;
> +	st->nents = 0;
> +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >> PAGE_SHIFT) + 1;
> +
> +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> +		return -ENOMEM;
> +
> +	for (i = 0; i < npages; i++) {
> +		unsigned long addr = range->hmm_pfns[i];
> +
> +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> +			sg->length += PAGE_SIZE;
> +			sg_dma_len(sg) += PAGE_SIZE;
> +			continue;
> +		}
> +
> +		sg =  sg ? sg_next(sg) : st->sgl;
> +		sg_dma_address(sg) = addr;
> +		sg_dma_len(sg) = PAGE_SIZE;
> +		sg->length = PAGE_SIZE;
> +		st->nents++;
> +	}
> +
> +	sg_mark_end(sg);
> +	return 0;
> +}

I didn't look at this series a lot but I wanted to make a few
remarks.. This I don't like quite a lot. Yes, the DMA API interaction
with hmm_range_fault is pretty bad, but it should not be hacked
around like this. Leon is working on a series to improve it:

https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/

Please participate there too. In the mean time you should just call
dma_map_page for every single page like ODP does.

Also, I tried to follow the large discussion in the end but it was
quite hard to read the text in Lore for some reason.

I would just opine some general points on how I see hmm_range_fault
being used by drivers.

First of all, the device should have a private page table. At least
one, but ideally many. Obviously it should work, so I found it a bit
puzzling the talk about problems with virtualization. Either the
private page table works virtualized, including faults, or it should
not be available..

Second, I see hmm_range_fault as having two main design patterns
interactions. Either it is the entire exclusive owner of a single
device private page table and fully mirrors the mm page table into the
device table.

Or it is a selective mirror where it copies part of the mm page table
into a "vma" of a possibly shared device page table. The
hmm_range_fault bit would exclusively own it's bit of VMA.

So I find it a quite strange that this RFC is creating VMA's,
notifiers and ranges on the fly. That should happen when userspace
indicates it wants some/all of the MM to be bound to a specific device
private pagetable/address space.

I also agree with the general spirit of the remarks that there should
not be a process binding or any kind of "global" character
device. Device private page tables are by their very nature private to
the device and should be created through a device specific char
dev. If you have a VMA concept for these page tables then a HMM
mirroring one is simply a different type of VMA along with all the
others.

I was also looking at the mmu notifier register/unregister with some
suspicion, it seems wrong to have a patch talking about "process
exit". Notifiers should be destroyed when the device private page
table is destroyed, or the VMA is destroyed. Attempting to connect it
to a process beyond tying the lifetime of a page table to a FD is
nonsensical.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-05  0:39   ` Jason Gunthorpe
@ 2024-04-05  3:33     ` Zeng, Oak
  2024-04-05 12:37       ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-04-05  3:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dri-devel, intel-xe, Brost, Matthew, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 4, 2024 8:39 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> Matthew <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> hmm range
> 
> On Wed, Jan 17, 2024 at 05:12:06PM -0500, Oak Zeng wrote:
> > +/**
> > + * xe_svm_build_sg() - build a scatter gather table for all the physical
> pages/pfn
> > + * in a hmm_range.
> > + *
> > + * @range: the hmm range that we build the sg table from. range-
> >hmm_pfns[]
> > + * has the pfn numbers of pages that back up this hmm address range.
> > + * @st: pointer to the sg table.
> > + *
> > + * All the contiguous pfns will be collapsed into one entry in
> > + * the scatter gather table. This is for the convenience of
> > + * later on operations to bind address range to GPU page table.
> > + *
> > + * This function allocates the storage of the sg table. It is
> > + * caller's responsibility to free it calling sg_free_table.
> > + *
> > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > + */
> > +int xe_svm_build_sg(struct hmm_range *range,
> > +			     struct sg_table *st)
> > +{
> > +	struct scatterlist *sg;
> > +	u64 i, npages;
> > +
> > +	sg = NULL;
> > +	st->nents = 0;
> > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> > +
> > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < npages; i++) {
> > +		unsigned long addr = range->hmm_pfns[i];
> > +
> > +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> > +			sg->length += PAGE_SIZE;
> > +			sg_dma_len(sg) += PAGE_SIZE;
> > +			continue;
> > +		}
> > +
> > +		sg =  sg ? sg_next(sg) : st->sgl;
> > +		sg_dma_address(sg) = addr;
> > +		sg_dma_len(sg) = PAGE_SIZE;
> > +		sg->length = PAGE_SIZE;
> > +		st->nents++;
> > +	}
> > +
> > +	sg_mark_end(sg);
> > +	return 0;
> > +}
> 
> I didn't look at this series a lot but I wanted to make a few
> remarks.. This I don't like quite a lot. Yes, the DMA API interaction
> with hmm_range_fault is pretty bad, but it should not be hacked
> around like this. Leon is working on a series to improve it:
> 
> https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/


I completely agree above codes are really ugly. We definitely want to adapt to a better way. I will spend some time on above series.

> 
> Please participate there too. In the mean time you should just call
> dma_map_page for every single page like ODP does.

Above codes deal with a case where dma map is not needed. As I understand it, whether we need a dma map depends on the devices topology. For example, when device access host memory or another device's memory through pcie, we need dma mapping; if the connection b/t devices is xelink (similar to nvidia's nvlink), all device's memory can be in same address space, so no dma mapping is needed.


> 
> Also, I tried to follow the large discussion in the end but it was
> quite hard to read the text in Lore for some reason.

Did you mean this discussion: https://lore.kernel.org/dri-devel/?q=Making+drm_gpuvm+work+across+gpu+devices? This link works good for me with chrome browser.


> 
> I would just opine some general points on how I see hmm_range_fault
> being used by drivers.
> 
> First of all, the device should have a private page table. At least
> one, but ideally many. Obviously it should work, so I found it a bit
> puzzling the talk about problems with virtualization. Either the
> private page table works virtualized, including faults, or it should
> not be available..

To be very honest, I was also very confused. In this series, I had one very fundamental assumption that with hmm any valid cpu virtual address is also a valid gpu virtual address. But Christian had a very strong opinion that the gpu va can have an offset to cpu va. He mentioned a failed use case with amdkfd and claimed an offset can solve their problem.

For all our known use cases, gpu va == cpu va. But we had agreed to make the uAPI to be flexible so we can introduce a offset if a use case come out in the future.

> 
> Second, I see hmm_range_fault as having two main design patterns
> interactions. Either it is the entire exclusive owner of a single
> device private page table and fully mirrors the mm page table into the
> device table.
> 
> Or it is a selective mirror where it copies part of the mm page table
> into a "vma" of a possibly shared device page table. The
> hmm_range_fault bit would exclusively own it's bit of VMA.

Can you explain what is "hmm_range_fault bit"?


The whole page table (process space) is mirrored. But we don't have to copy the whole CPU page table to device page table. We only need to copy the page table entries of an address range which is accessed by GPU. For those address ranges which are not accessed by GPU, there is no need to set up GPU page table.

> 
> So I find it a quite strange that this RFC is creating VMA's,
> notifiers and ranges on the fly. That should happen when userspace
> indicates it wants some/all of the MM to be bound to a specific device
> private pagetable/address space.

We register notifier on the fly because GPU doesn't access all the valid CPU virtual addresses. GPU only access a subset of valid CPU addresses.

Do you think register a huge mmu notifier to cover the whole address space would be good? I don't know here but there would be much more unnecessary callbacks from mmu to device driver.

Similarly, we create range only the fly for those range that is accessed by gpu. But we have some idea to keep one gigantic range/VMA representing the whole address space while creating only some "gpu page table state range" on the fly. This idea requires some refactor to our xe driver and we will evaluate it more to decide whether we want to go this way.


> 
> I also agree with the general spirit of the remarks that there should
> not be a process binding or any kind of "global" character
> device. 

Even though a global pseudo device looks bad, it does come with some benefit. For example, if you want to set a memory attributes to a shared virtual address range b/t all devices, you can set such attributes through a ioctl of the global device. We have agreed to remove our global character device and we will repeat the memory attributes setting on all devices one by one. 

Is /dev/nvidia-uvm a global character device for uvm purpose?

Device private page tables are by their very nature private to
> the device and should be created through a device specific char
> dev. If you have a VMA concept for these page tables then a HMM
> mirroring one is simply a different type of VMA along with all the
> others.
> 
> I was also looking at the mmu notifier register/unregister with some
> suspicion, it seems wrong to have a patch talking about "process
> exit". Notifiers should be destroyed when the device private page
> table is destroyed, or the VMA is destroyed. 

Right. I have dropped the concept of process exit. I will soon send out the new series for review.

Oak

Attempting to connect it
> to a process beyond tying the lifetime of a page table to a FD is
> nonsensical.
> 
> Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-05  3:33     ` Zeng, Oak
@ 2024-04-05 12:37       ` Jason Gunthorpe
  2024-04-05 16:42         ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-05 12:37 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: dri-devel, intel-xe, Brost, Matthew, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Fri, Apr 05, 2024 at 03:33:10AM +0000, Zeng, Oak wrote:
> > 
> > I didn't look at this series a lot but I wanted to make a few
> > remarks.. This I don't like quite a lot. Yes, the DMA API interaction
> > with hmm_range_fault is pretty bad, but it should not be hacked
> > around like this. Leon is working on a series to improve it:
> > 
> > https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/
> 
> 
> I completely agree above codes are really ugly. We definitely want
> to adapt to a better way. I will spend some time on above series.
> 
> > 
> > Please participate there too. In the mean time you should just call
> > dma_map_page for every single page like ODP does.
> 
> Above codes deal with a case where dma map is not needed. As I
> understand it, whether we need a dma map depends on the devices
> topology. For example, when device access host memory or another
> device's memory through pcie, we need dma mapping; if the connection
> b/t devices is xelink (similar to nvidia's nvlink), all device's
> memory can be in same address space, so no dma mapping is needed.

Then you call dma_map_page to do your DMA side and you avoid it for
the DEVICE_PRIVATE side. SG list doesn't help this anyhow.

> > Also, I tried to follow the large discussion in the end but it was
> > quite hard to read the text in Lore for some reason.
> 
> Did you mean this discussion: https://lore.kernel.org/dri-devel/?q=Making+drm_gpuvm+work+across+gpu+devices? This link works good for me with chrome browser.

That is the one I am referring to

> > I would just opine some general points on how I see hmm_range_fault
> > being used by drivers.
> > 
> > First of all, the device should have a private page table. At least
> > one, but ideally many. Obviously it should work, so I found it a bit
> > puzzling the talk about problems with virtualization. Either the
> > private page table works virtualized, including faults, or it should
> > not be available..
>
> To be very honest, I was also very confused. In this series, I had
> one very fundamental assumption that with hmm any valid cpu virtual
> address is also a valid gpu virtual address. But Christian had a
> very strong opinion that the gpu va can have an offset to cpu va. He
> mentioned a failed use case with amdkfd and claimed an offset can
> solve their problem.

Offset is something different, I said the VM's view of the page table
should fully work. You shouldn't get into a weird situation where the
VM is populating the page table and can't handle faults or something.

If the VMM has a weird design where there is only one page table and
it needs to partition space based on slicing it into regions then
fine, but the delegated region to the guest OS should still work
fully.

> > Or it is a selective mirror where it copies part of the mm page table
> > into a "vma" of a possibly shared device page table. The
> > hmm_range_fault bit would exclusively own it's bit of VMA.
> 
> Can you explain what is "hmm_range_fault bit"?

I mean if you setup a mirror VMA in a device private page table then that
range of VA will be owned by the hmm_range_fault code and will mirror
a subset of a mm into that VMA. This is the offset you mention
above. The MM's VA and the device private page table VA do not have to
be the same (eg we implement this option in RDMA's ODP)

A 1:1 SVA mapping is a special case of this where there is a single
GPU VMA that spans the entire process address space with a 1:1 VA (no
offset).

> Do you think register a huge mmu notifier to cover the whole address
> space would be good? I don't know here but there would be much more
> unnecessary callbacks from mmu to device driver.

Yes. IMHO you should try to optimize the invalidations away in driver
logic not through dynamic mmu notifiers. Installing and removing a
notifier is very expensive.

> Similarly, we create range only the fly for those range that is
> accessed by gpu. But we have some idea to keep one gigantic
> range/VMA representing the whole address space while creating only
> some "gpu page table state range" on the fly. This idea requires
> some refactor to our xe driver and we will evaluate it more to
> decide whether we want to go this way.

This is a better direction.
 
> > I also agree with the general spirit of the remarks that there should
> > not be a process binding or any kind of "global" character
> > device. 
> 
> Even though a global pseudo device looks bad, it does come with some
> benefit. For example, if you want to set a memory attributes to a
> shared virtual address range b/t all devices, you can set such
> attributes through a ioctl of the global device. We have agreed to
> remove our global character device and we will repeat the memory
> attributes setting on all devices one by one.

That implies you have a global shared device private page table which
is sort of impossible because of how the DMA API works.

Having the kernel iterate over all the private page tables vs having
the userspace iterate over all the private page tables doesn't seem
like a worthwile difference to justify a global cdev.

> Is /dev/nvidia-uvm a global character device for uvm purpose?

No idea, I wouldn't assume anything the nvidia drivers do is aligned
with what upstream expects.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-05 12:37       ` Jason Gunthorpe
@ 2024-04-05 16:42         ` Zeng, Oak
  2024-04-05 18:02           ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-04-05 16:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dri-devel, intel-xe, Brost, Matthew, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky



> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of Jason
> Gunthorpe
> Sent: Friday, April 5, 2024 8:37 AM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> Matthew <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> hmm range
> 
> On Fri, Apr 05, 2024 at 03:33:10AM +0000, Zeng, Oak wrote:
> > >
> > > I didn't look at this series a lot but I wanted to make a few
> > > remarks.. This I don't like quite a lot. Yes, the DMA API interaction
> > > with hmm_range_fault is pretty bad, but it should not be hacked
> > > around like this. Leon is working on a series to improve it:
> > >
> > > https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/
> >
> >
> > I completely agree above codes are really ugly. We definitely want
> > to adapt to a better way. I will spend some time on above series.
> >
> > >
> > > Please participate there too. In the mean time you should just call
> > > dma_map_page for every single page like ODP does.
> >
> > Above codes deal with a case where dma map is not needed. As I
> > understand it, whether we need a dma map depends on the devices
> > topology. For example, when device access host memory or another
> > device's memory through pcie, we need dma mapping; if the connection
> > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > memory can be in same address space, so no dma mapping is needed.
> 
> Then you call dma_map_page to do your DMA side and you avoid it for
> the DEVICE_PRIVATE side. SG list doesn't help this anyhow.

When dma map is needed, we used dma_map_sgtable, a different flavor of the dma-map-page function.

The reason we also used (mis-used) sg list for non-dma-map cases, is because we want to re-use some data structure. Our goal here is, for a hmm_range, build a list of device physical address (can be dma-mapped or not), which will be used later on to program the device page table. We re-used the sg list structure for the non-dma-map cases so those two cases can share the same page table programming codes. Since sg list was originally designed for dma-map, it does look like this is mis-used here.

Need to mention, even for some DEVICE_PRIVATE memory, we also need dma-mapping. For example, if you have two devices connected to each other through PCIe, both devices memory are registered as DEVICE_PRIVATE to hmm. I believe we need a dma-map when one device access another device's memory. Two devices' memory doesn't belong to same address space in this case. For modern GPU with xeLink/nvLink/XGMI, this is not needed.


> 
> > > Also, I tried to follow the large discussion in the end but it was
> > > quite hard to read the text in Lore for some reason.
> >
> > Did you mean this discussion: https://lore.kernel.org/dri-
> devel/?q=Making+drm_gpuvm+work+across+gpu+devices? This link works good
> for me with chrome browser.
> 
> That is the one I am referring to
> 
> > > I would just opine some general points on how I see hmm_range_fault
> > > being used by drivers.
> > >
> > > First of all, the device should have a private page table. At least
> > > one, but ideally many. Obviously it should work, so I found it a bit
> > > puzzling the talk about problems with virtualization. Either the
> > > private page table works virtualized, including faults, or it should
> > > not be available..
> >
> > To be very honest, I was also very confused. In this series, I had
> > one very fundamental assumption that with hmm any valid cpu virtual
> > address is also a valid gpu virtual address. But Christian had a
> > very strong opinion that the gpu va can have an offset to cpu va. He
> > mentioned a failed use case with amdkfd and claimed an offset can
> > solve their problem.
> 
> Offset is something different, I said the VM's view of the page table
> should fully work. You shouldn't get into a weird situation where the
> VM is populating the page table and can't handle faults or something.
> 

We don't have such weird situation. There are two layers of translations when run under virtualized environment. From guest VM's perspective, the first level page table is in the guest device physical address space. It is nothing different from bare-metal situation. Our driver doesn't need to know it runs under virtualized or bare-metal for first level page table programming and page fault handling. 

> If the VMM has a weird design where there is only one page table and
> it needs to partition space based on slicing it into regions then
> fine, but the delegated region to the guest OS should still work
> fully.

Agree.

> 
> > > Or it is a selective mirror where it copies part of the mm page table
> > > into a "vma" of a possibly shared device page table. The
> > > hmm_range_fault bit would exclusively own it's bit of VMA.
> >
> > Can you explain what is "hmm_range_fault bit"?
> 
> I mean if you setup a mirror VMA in a device private page table then that
> range of VA will be owned by the hmm_range_fault code and will mirror
> a subset of a mm into that VMA. This is the offset you mention
> above. The MM's VA and the device private page table VA do not have to
> be the same (eg we implement this option in RDMA's ODP)

Ok, it is great to hear a use case where cpu va != gpu va

> 
> A 1:1 SVA mapping is a special case of this where there is a single
> GPU VMA that spans the entire process address space with a 1:1 VA (no
> offset).

From implementation perspective, we can have one device page table for one process for such 1:1 va mapping, but it is not necessary to have a single gpu vma. We can have many gpu vma each cover a segment of this address space. We have this structure before this svm/system allocator work. This is also true for core mm vma. But as said, we are also open/exploring a single gigantic gpu vma to cover the whole address space.

> 
> > Do you think register a huge mmu notifier to cover the whole address
> > space would be good? I don't know here but there would be much more
> > unnecessary callbacks from mmu to device driver.
> 
> Yes. IMHO you should try to optimize the invalidations away in driver
> logic not through dynamic mmu notifiers. Installing and removing a
> notifier is very expensive.

Ok, we will compare the performance of those two methods.

> 
> > Similarly, we create range only the fly for those range that is
> > accessed by gpu. But we have some idea to keep one gigantic
> > range/VMA representing the whole address space while creating only
> > some "gpu page table state range" on the fly. This idea requires
> > some refactor to our xe driver and we will evaluate it more to
> > decide whether we want to go this way.
> 
> This is a better direction.
> 
> > > I also agree with the general spirit of the remarks that there should
> > > not be a process binding or any kind of "global" character
> > > device.
> >
> > Even though a global pseudo device looks bad, it does come with some
> > benefit. For example, if you want to set a memory attributes to a
> > shared virtual address range b/t all devices, you can set such
> > attributes through a ioctl of the global device. We have agreed to
> > remove our global character device and we will repeat the memory
> > attributes setting on all devices one by one.
> 
> That implies you have a global shared device private page table which
> is sort of impossible because of how the DMA API works.

In the case of multiple devices on one machine, we don't have a shared global device page table. Each device can still have its own page table, even though page tables of all device mirror the same address space (ie., same virtual address pointing to same physical location in cpu page table and all device page tables).

The reason we had the global device is mainly for setting vma memory attributes purpose, for example, for one vma, user can set the preferred backing store placement to be on specific device. Without  a global pseudo device, we would have to perform such settings on each devices through each device's ioctl, and we would not be able to tell device 0 that the preferred placement is on device 1, because device 0 even doesn't know the existing of device 1...

Anyway let's continue multiple devices discussion under thread "Cross-device and cross-driver HMM support"

Oak

> 
> Having the kernel iterate over all the private page tables vs having
> the userspace iterate over all the private page tables doesn't seem
> like a worthwile difference to justify a global cdev.
> 
> > Is /dev/nvidia-uvm a global character device for uvm purpose?
> 
> No idea, I wouldn't assume anything the nvidia drivers do is aligned
> with what upstream expects.
> 
> Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-05 16:42         ` Zeng, Oak
@ 2024-04-05 18:02           ` Jason Gunthorpe
  2024-04-09 16:45             ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-05 18:02 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: dri-devel, intel-xe, Brost, Matthew, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Fri, Apr 05, 2024 at 04:42:14PM +0000, Zeng, Oak wrote:
> > > Above codes deal with a case where dma map is not needed. As I
> > > understand it, whether we need a dma map depends on the devices
> > > topology. For example, when device access host memory or another
> > > device's memory through pcie, we need dma mapping; if the connection
> > > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > > memory can be in same address space, so no dma mapping is needed.
> > 
> > Then you call dma_map_page to do your DMA side and you avoid it for
> > the DEVICE_PRIVATE side. SG list doesn't help this anyhow.
> 
> When dma map is needed, we used dma_map_sgtable, a different flavor
> of the dma-map-page function.

I saw, I am saying this should not be done. You cannot unmap bits of
a sgl mapping if an invalidation comes in.

> The reason we also used (mis-used) sg list for non-dma-map cases, is
> because we want to re-use some data structure. Our goal here is, for
> a hmm_range, build a list of device physical address (can be
> dma-mapped or not), which will be used later on to program the
> device page table. We re-used the sg list structure for the
> non-dma-map cases so those two cases can share the same page table
> programming codes. Since sg list was originally designed for
> dma-map, it does look like this is mis-used here.

Please don't use sg list at all for this.
 
> Need to mention, even for some DEVICE_PRIVATE memory, we also need
> dma-mapping. For example, if you have two devices connected to each
> other through PCIe, both devices memory are registered as
> DEVICE_PRIVATE to hmm. 

Yes, but you don't ever dma map DEVICE_PRIVATE.

> I believe we need a dma-map when one device access another device's
> memory. Two devices' memory doesn't belong to same address space in
> this case. For modern GPU with xeLink/nvLink/XGMI, this is not
> needed.

Review my emails here:

https://lore.kernel.org/dri-devel/20240403125712.GA1744080@nvidia.com/

Which explain how it should work.

> > A 1:1 SVA mapping is a special case of this where there is a single
> > GPU VMA that spans the entire process address space with a 1:1 VA (no
> > offset).
> 
> From implementation perspective, we can have one device page table
> for one process for such 1:1 va mapping, but it is not necessary to
> have a single gpu vma. We can have many gpu vma each cover a segment
> of this address space. 

This is not what I'm talking about. The GPU VMA is bound to a specific
MM VA, it should not be created on demand.

If you want the full 1:1 SVA case to optimize invalidations you don't
need something like a VMA, a simple bitmap reducing the address space
into 1024 faulted in chunks or something would be much cheaper than
some dynamic VMA ranges.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-05 18:02           ` Jason Gunthorpe
@ 2024-04-09 16:45             ` Zeng, Oak
  2024-04-09 17:24               ` Jason Gunthorpe
  2024-04-09 17:33               ` Matthew Brost
  0 siblings, 2 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-04-09 16:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dri-devel, intel-xe, Brost, Matthew, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

Hi Jason

We are re-spinning this series based on the previous community feedback. I will send out a v2 soon. There were big changes compared to v1. So we would better to discuss this work with v2. 

See some reply inline.

> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, April 5, 2024 2:02 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> Matthew <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> hmm range
> 
> On Fri, Apr 05, 2024 at 04:42:14PM +0000, Zeng, Oak wrote:
> > > > Above codes deal with a case where dma map is not needed. As I
> > > > understand it, whether we need a dma map depends on the devices
> > > > topology. For example, when device access host memory or another
> > > > device's memory through pcie, we need dma mapping; if the connection
> > > > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > > > memory can be in same address space, so no dma mapping is needed.
> > >
> > > Then you call dma_map_page to do your DMA side and you avoid it for
> > > the DEVICE_PRIVATE side. SG list doesn't help this anyhow.
> >
> > When dma map is needed, we used dma_map_sgtable, a different flavor
> > of the dma-map-page function.
> 
> I saw, I am saying this should not be done. You cannot unmap bits of
> a sgl mapping if an invalidation comes in.

You are right, if we register a huge mmu interval notifier to cover the whole address space, then we should use dma map/unmap pages to map bits of the address space. We will explore this approach.

Right now, in xe driver, mmu interval notifier is dynamically registered with small address range. We map/unmap the whole small address ranges each time. So functionally it still works. But it might not be as performant as the method you said. This is existing logic for our userptr codes. Our system allocator inherit this logic automatically as our system allocator design is built on top of userptr (will send out v2 soon ).We plan to make things work in the first stage then do some performance improvement like you suggested here.

> 
> > The reason we also used (mis-used) sg list for non-dma-map cases, is
> > because we want to re-use some data structure. Our goal here is, for
> > a hmm_range, build a list of device physical address (can be
> > dma-mapped or not), which will be used later on to program the
> > device page table. We re-used the sg list structure for the
> > non-dma-map cases so those two cases can share the same page table
> > programming codes. Since sg list was originally designed for
> > dma-map, it does look like this is mis-used here.
> 
> Please don't use sg list at all for this.

As explained, we use sg list for device private pages so we can re-used the gpu page table update codes. The input of the gpu page table update codes in this case is a list of dma address (in case of system memory) or device physical address (in case of device private pages). The gpu page table update codes in xe driver is pretty complicated, so re-use that codes is a preferable thing for us. If we introduce different data structure we would have to re-write part of the gpu page table update codes.

I don't see an obvious problem of this approach. But if you see this a problem, I am open to change it.

> 
> > Need to mention, even for some DEVICE_PRIVATE memory, we also need
> > dma-mapping. For example, if you have two devices connected to each
> > other through PCIe, both devices memory are registered as
> > DEVICE_PRIVATE to hmm.
> 
> Yes, but you don't ever dma map DEVICE_PRIVATE.
> 
> > I believe we need a dma-map when one device access another device's
> > memory. Two devices' memory doesn't belong to same address space in
> > this case. For modern GPU with xeLink/nvLink/XGMI, this is not
> > needed.
> 
> Review my emails here:
> 
> https://lore.kernel.org/dri-devel/20240403125712.GA1744080@nvidia.com/
> 
> Which explain how it should work.

You are right. Dma map is not needed for device private

> 
> > > A 1:1 SVA mapping is a special case of this where there is a single
> > > GPU VMA that spans the entire process address space with a 1:1 VA (no
> > > offset).
> >
> > From implementation perspective, we can have one device page table
> > for one process for such 1:1 va mapping, but it is not necessary to
> > have a single gpu vma. We can have many gpu vma each cover a segment
> > of this address space.
> 
> This is not what I'm talking about. The GPU VMA is bound to a specific
> MM VA, it should not be created on demand.

Today we have two places where we create gpu vma: 1) create gpu vma during a vm_bind ioctl 2) create gpu vma during a page fault of the system allocator range (this will be in v2 of this series).

So for case 2), we do create gpu vma on demand. We also plan to explore doing this differently, such as only keep a gigantic gpu vma covering the whole address space for system allocator while only create some gpu page table state during page fault handling. This is planned for stage 2.

> 
> If you want the full 1:1 SVA case to optimize invalidations you don't
> need something like a VMA, 

A gpu vma (xe_vma struct in xe driver codes) is a very fundamental concept in our driver. Leveraging vma can re-use a lot of existing driver codes such as page table update.

But you are right, strictly speaking we don't need a vma. Actually in this v1 version I sent out, we don't have a gpu vma concept for system allocator. But we do have a svm range concept. We created a temporary vma for page table update purpose. Again this design will be obsolete in v2 - in v2 system allocator leverage userptr codes which incorporate with gpu vma. 


a simple bitmap reducing the address space
> into 1024 faulted in chunks or something would be much cheaper than
> some dynamic VMA ranges.


I suspect something dynamic is still necessary, either a vma or a page table state (1 vma but many page table state created dynamically, as planned in our stage 2). The reason is, we still need some gpu corresponding structure to match the cpu vm_area_struct. For example, when gpu page fault happens, you look up the cpu vm_area_struct for the fault address and create a corresponding state/struct. And people can have as many cpu vm_area_struct as they want.

Oak  

> 
> Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-09 16:45             ` Zeng, Oak
@ 2024-04-09 17:24               ` Jason Gunthorpe
  2024-04-23 21:17                 ` Zeng, Oak
  2024-04-09 17:33               ` Matthew Brost
  1 sibling, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-09 17:24 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: dri-devel, intel-xe, Brost, Matthew, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:

> > I saw, I am saying this should not be done. You cannot unmap bits of
> > a sgl mapping if an invalidation comes in.
> 
> You are right, if we register a huge mmu interval notifier to cover
> the whole address space, then we should use dma map/unmap pages to
> map bits of the address space. We will explore this approach.
> 
> Right now, in xe driver, mmu interval notifier is dynamically
> registered with small address range. We map/unmap the whole small
> address ranges each time. So functionally it still works. But it
> might not be as performant as the method you said. 

Please don't do this, it is not how hmm_range_fault() should be
used.

It is intended to be page by page and there is no API surface
available to safely try to construct covering ranges. Drivers
definately should not try to invent such a thing.

> > Please don't use sg list at all for this.
> 
> As explained, we use sg list for device private pages so we can
> re-used the gpu page table update codes. 

I'm asking you not to use SGL lists for that too. SGL lists are not
generic data structures to hold DMA lists.

> > This is not what I'm talking about. The GPU VMA is bound to a specific
> > MM VA, it should not be created on demand.
> 
> Today we have two places where we create gpu vma: 1) create gpu vma
> during a vm_bind ioctl 2) create gpu vma during a page fault of the
> system allocator range (this will be in v2 of this series).

Don't do 2.

> I suspect something dynamic is still necessary, either a vma or a
> page table state (1 vma but many page table state created
> dynamically, as planned in our stage 2). 

I'm expecting you'd populate the page table memory on demand.

> The reason is, we still need some gpu corresponding structure to
> match the cpu vm_area_struct.

Definately not.

> For example, when gpu page fault happens, you look
> up the cpu vm_area_struct for the fault address and create a
> corresponding state/struct. And people can have as many cpu
> vm_area_struct as they want.

No you don't.

You call hmm_range_fault() and it does everything for you. A driver
should never touch CPU VMAs and must not be aware of them in any away.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-09 16:45             ` Zeng, Oak
  2024-04-09 17:24               ` Jason Gunthorpe
@ 2024-04-09 17:33               ` Matthew Brost
  1 sibling, 0 replies; 123+ messages in thread
From: Matthew Brost @ 2024-04-09 17:33 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Jason Gunthorpe, dri-devel, intel-xe, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Tue, Apr 09, 2024 at 10:45:22AM -0600, Zeng, Oak wrote:

Oak - A few drive by comments...

> Hi Jason
> 
> We are re-spinning this series based on the previous community feedback. I will send out a v2 soon. There were big changes compared to v1. So we would better to discuss this work with v2. 
> 
> See some reply inline.
> 
> > -----Original Message-----
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, April 5, 2024 2:02 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> > Matthew <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com;
> > Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> > <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> > Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> > hmm range
> > 
> > On Fri, Apr 05, 2024 at 04:42:14PM +0000, Zeng, Oak wrote:
> > > > > Above codes deal with a case where dma map is not needed. As I
> > > > > understand it, whether we need a dma map depends on the devices
> > > > > topology. For example, when device access host memory or another
> > > > > device's memory through pcie, we need dma mapping; if the connection
> > > > > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > > > > memory can be in same address space, so no dma mapping is needed.
> > > >
> > > > Then you call dma_map_page to do your DMA side and you avoid it for
> > > > the DEVICE_PRIVATE side. SG list doesn't help this anyhow.
> > >
> > > When dma map is needed, we used dma_map_sgtable, a different flavor
> > > of the dma-map-page function.
> > 
> > I saw, I am saying this should not be done. You cannot unmap bits of
> > a sgl mapping if an invalidation comes in.
> 
> You are right, if we register a huge mmu interval notifier to cover the whole address space, then we should use dma map/unmap pages to map bits of the address space. We will explore this approach.
> 
> Right now, in xe driver, mmu interval notifier is dynamically registered with small address range. We map/unmap the whole small address ranges each time. So functionally it still works. But it might not be as performant as the method you said. This is existing logic for our userptr codes. Our system allocator inherit this logic automatically as our system allocator design is built on top of userptr (will send out v2 soon ).We plan to make things work in the first stage then do some performance improvement like you suggested here.
>

Agree reworking the notifier design for initial phase is probably out of
scope as that would be fairly large rework. IMO if / when we switch from
a 1 to 1 relationship between VMA and PT state to a 1 to N, this is
likely when it would make sense to redesign our notifier too.

A 1 to N is basically a prerequisite for 1 notifier to work or at least
a new locking structure (maybe ref counting too?) to be able safely go
from large noifier -> invalidation chunk.

> > 
> > > The reason we also used (mis-used) sg list for non-dma-map cases, is
> > > because we want to re-use some data structure. Our goal here is, for
> > > a hmm_range, build a list of device physical address (can be
> > > dma-mapped or not), which will be used later on to program the
> > > device page table. We re-used the sg list structure for the
> > > non-dma-map cases so those two cases can share the same page table
> > > programming codes. Since sg list was originally designed for
> > > dma-map, it does look like this is mis-used here.
> > 
> > Please don't use sg list at all for this.
> 
> As explained, we use sg list for device private pages so we can re-used the gpu page table update codes. The input of the gpu page table update codes in this case is a list of dma address (in case of system memory) or device physical address (in case of device private pages). The gpu page table update codes in xe driver is pretty complicated, so re-use that codes is a preferable thing for us. If we introduce different data structure we would have to re-write part of the gpu page table update codes.
> 

Use the buddy blocks for the xe_res_cursor... We basically already have
this in place, see xe_res_first which takes a struct ttm_resource *res
as an argument and underneath uses buddy blocks for the cursor. 

> I don't see an obvious problem of this approach. But if you see this a problem, I am open to change it.
>

This should be trivial to change assuming buddy blocks are stored
somewhere (they must be, right?), so I'd do this right away.`

Only got to here, maybe reply a bit more later to below comments...

Matt
 
> > 
> > > Need to mention, even for some DEVICE_PRIVATE memory, we also need
> > > dma-mapping. For example, if you have two devices connected to each
> > > other through PCIe, both devices memory are registered as
> > > DEVICE_PRIVATE to hmm.
> > 
> > Yes, but you don't ever dma map DEVICE_PRIVATE.
> > 
> > > I believe we need a dma-map when one device access another device's
> > > memory. Two devices' memory doesn't belong to same address space in
> > > this case. For modern GPU with xeLink/nvLink/XGMI, this is not
> > > needed.
> > 
> > Review my emails here:
> > 
> > https://lore.kernel.org/dri-devel/20240403125712.GA1744080@nvidia.com/
> > 
> > Which explain how it should work.
> 
> You are right. Dma map is not needed for device private
> 
> > 
> > > > A 1:1 SVA mapping is a special case of this where there is a single
> > > > GPU VMA that spans the entire process address space with a 1:1 VA (no
> > > > offset).
> > >
> > > From implementation perspective, we can have one device page table
> > > for one process for such 1:1 va mapping, but it is not necessary to
> > > have a single gpu vma. We can have many gpu vma each cover a segment
> > > of this address space.
> > 
> > This is not what I'm talking about. The GPU VMA is bound to a specific
> > MM VA, it should not be created on demand.
> 
> Today we have two places where we create gpu vma: 1) create gpu vma during a vm_bind ioctl 2) create gpu vma during a page fault of the system allocator range (this will be in v2 of this series).
> 
> So for case 2), we do create gpu vma on demand. We also plan to explore doing this differently, such as only keep a gigantic gpu vma covering the whole address space for system allocator while only create some gpu page table state during page fault handling. This is planned for stage 2.
> 
> > 
> > If you want the full 1:1 SVA case to optimize invalidations you don't
> > need something like a VMA, 
> 
> A gpu vma (xe_vma struct in xe driver codes) is a very fundamental concept in our driver. Leveraging vma can re-use a lot of existing driver codes such as page table update.
> 
> But you are right, strictly speaking we don't need a vma. Actually in this v1 version I sent out, we don't have a gpu vma concept for system allocator. But we do have a svm range concept. We created a temporary vma for page table update purpose. Again this design will be obsolete in v2 - in v2 system allocator leverage userptr codes which incorporate with gpu vma. 
> 
> 
> a simple bitmap reducing the address space
> > into 1024 faulted in chunks or something would be much cheaper than
> > some dynamic VMA ranges.
> 
> 
> I suspect something dynamic is still necessary, either a vma or a page table state (1 vma but many page table state created dynamically, as planned in our stage 2). The reason is, we still need some gpu corresponding structure to match the cpu vm_area_struct. For example, when gpu page fault happens, you look up the cpu vm_area_struct for the fault address and create a corresponding state/struct. And people can have as many cpu vm_area_struct as they want.
> 
> Oak  
> 
> > 
> > Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-09 17:24               ` Jason Gunthorpe
@ 2024-04-23 21:17                 ` Zeng, Oak
  2024-04-24  2:31                   ` Matthew Brost
  2024-04-24 13:48                   ` Jason Gunthorpe
  0 siblings, 2 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-04-23 21:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dri-devel, intel-xe, Brost, Matthew, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

Hi Jason,

Sorry for a late reply. I have been working on a v2 of this series: https://patchwork.freedesktop.org/series/132229/. This version addressed some of your concerns, such as removing the global character device, removing svm process concept (need further clean up per Matt's feedback).

But the main concern you raised is not addressed yet. I need to further make sure I understand your concerns, See inline.



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 9, 2024 1:24 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost, Matthew
> <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> hmm range
> 
> On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> 
> > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > a sgl mapping if an invalidation comes in.
> >
> > You are right, if we register a huge mmu interval notifier to cover
> > the whole address space, then we should use dma map/unmap pages to
> > map bits of the address space. We will explore this approach.
> >
> > Right now, in xe driver, mmu interval notifier is dynamically
> > registered with small address range. We map/unmap the whole small
> > address ranges each time. So functionally it still works. But it
> > might not be as performant as the method you said.
> 
> Please don't do this, it is not how hmm_range_fault() should be
> used.
> 
> It is intended to be page by page and there is no API surface
> available to safely try to construct covering ranges. Drivers
> definately should not try to invent such a thing.

I need your help to understand this comment. Our gpu mirrors the whole CPU virtual address space. It is the first design pattern in your previous reply (entire exclusive owner of a single device private page table and fully mirrors the mm page table into the device table.) 

What do you mean by "page by page"/" no API surface available to safely try to construct covering ranges"? As I understand it, hmm_range_fault take a virtual address range (defined in hmm_range struct), and walk cpu page table in this range. It is a range based API.

From your previous reply ("So I find it a quite strange that this RFC is creating VMA's, notifiers and ranges on the fly "), it seems you are concerning why we are creating vma and register mmu interval notifier on the fly. Let me try to explain it. Xe_vma is a very fundamental concept in xe driver. The gpu page table update, invalidation are all vma-based. This concept exists before this svm work. For svm, we create a 2M (the size is user configurable) vma during gpu page fault handler and register this 2M range to mmu interval notifier.

Now I try to figure out if we don't create vma, what can we do? We can map one page (which contains the gpu fault address) to gpu page table. But that doesn't work for us because the GPU cache and TLB would not be performant for 4K page each time. One way to think of the vma is a chunk size which is good for GPU HW performance.

And the mmu notifier... if we don't register the mmu notifier on the fly, do we register one mmu notifier to cover the whole CPU virtual address space (which would be huge, e.g., 0~2^56 on a 57 bit machine, if we have half half user space kernel space split)? That won't be performant as well because for any address range that is unmapped from cpu program, but even if they are never touched by GPU, gpu driver still got a invalidation callback. In our approach, we only register a mmu notifier for address range that we know gpu would touch it. 

> 
> > > Please don't use sg list at all for this.
> >
> > As explained, we use sg list for device private pages so we can
> > re-used the gpu page table update codes.
> 
> I'm asking you not to use SGL lists for that too. SGL lists are not
> generic data structures to hold DMA lists.

Matt mentioned to use drm_buddy_block. I will see how that works out.

> 
> > > This is not what I'm talking about. The GPU VMA is bound to a specific
> > > MM VA, it should not be created on demand.
> >
> > Today we have two places where we create gpu vma: 1) create gpu vma
> > during a vm_bind ioctl 2) create gpu vma during a page fault of the
> > system allocator range (this will be in v2 of this series).
> 
> Don't do 2.

As said, we will try the approach of one gigantic gpu vma with N page table states. We will create page table states in page fault handling. But this is only planned for stage 2. 

> 
> > I suspect something dynamic is still necessary, either a vma or a
> > page table state (1 vma but many page table state created
> > dynamically, as planned in our stage 2).
> 
> I'm expecting you'd populate the page table memory on demand.

We do populate gpu page table on demand. When gpu access a virtual address, we populate the gpu page table.


> 
> > The reason is, we still need some gpu corresponding structure to
> > match the cpu vm_area_struct.
> 
> Definately not.

See explanation above.

> 
> > For example, when gpu page fault happens, you look
> > up the cpu vm_area_struct for the fault address and create a
> > corresponding state/struct. And people can have as many cpu
> > vm_area_struct as they want.
> 
> No you don't.

See explains above. I need your help to understand how we can do it without a vma (or chunk). One thing GPU driver is different from RDMA driver is, RDMA doesn't have device private memory so no migration. It only need to dma-map the system memory pages and use them to fill RDMA page table. so RDMA don't need another memory manager such as our buddy. RDMA only deal with system memory which is completely struct page based management. Page by page make 100 % sense for RDMA. 

But for gpu, we need a way to use device local memory efficiently. This is the main reason we have vma/chunk concept.

Thanks,
Oak


> 
> You call hmm_range_fault() and it does everything for you. A driver
> should never touch CPU VMAs and must not be aware of them in any away.
> 
> Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-23 21:17                 ` Zeng, Oak
@ 2024-04-24  2:31                   ` Matthew Brost
  2024-04-24 13:57                     ` Jason Gunthorpe
  2024-04-24 13:48                   ` Jason Gunthorpe
  1 sibling, 1 reply; 123+ messages in thread
From: Matthew Brost @ 2024-04-24  2:31 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Jason Gunthorpe, dri-devel, intel-xe, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Tue, Apr 23, 2024 at 03:17:03PM -0600, Zeng, Oak wrote:
> Hi Jason,
> 
> Sorry for a late reply. I have been working on a v2 of this series: https://patchwork.freedesktop.org/series/132229/. This version addressed some of your concerns, such as removing the global character device, removing svm process concept (need further clean up per Matt's feedback).
> 
> But the main concern you raised is not addressed yet. I need to further make sure I understand your concerns, See inline.
> 

A few extra comments with references below.

> 
> 
> > -----Original Message-----
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, April 9, 2024 1:24 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost, Matthew
> > <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty, Brian
> > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> > <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> > Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> > hmm range
> > 
> > On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> > 
> > > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > > a sgl mapping if an invalidation comes in.
> > >
> > > You are right, if we register a huge mmu interval notifier to cover
> > > the whole address space, then we should use dma map/unmap pages to
> > > map bits of the address space. We will explore this approach.
> > >
> > > Right now, in xe driver, mmu interval notifier is dynamically
> > > registered with small address range. We map/unmap the whole small
> > > address ranges each time. So functionally it still works. But it
> > > might not be as performant as the method you said.
> > 
> > Please don't do this, it is not how hmm_range_fault() should be
> > used.
> > 
> > It is intended to be page by page and there is no API surface
> > available to safely try to construct covering ranges. Drivers
> > definately should not try to invent such a thing.
>
> I need your help to understand this comment. Our gpu mirrors the whole CPU virtual address space. It is the first design pattern in your previous reply (entire exclusive owner of a single device private page table and fully mirrors the mm page table into the device table.) 
> 
> What do you mean by "page by page"/" no API surface available to safely try to construct covering ranges"? As I understand it, hmm_range_fault take a virtual address range (defined in hmm_range struct), and walk cpu page table in this range. It is a range based API.
> 
> From your previous reply ("So I find it a quite strange that this RFC is creating VMA's, notifiers and ranges on the fly "), it seems you are concerning why we are creating vma and register mmu interval notifier on the fly. Let me try to explain it. Xe_vma is a very fundamental concept in xe driver. The gpu page table update, invalidation are all vma-based. This concept exists before this svm work. For svm, we create a 2M (the size is user configurable) vma during gpu page fault handler and register this 2M range to mmu interval notifier.
> 
> Now I try to figure out if we don't create vma, what can we do? We can map one page (which contains the gpu fault address) to gpu page table. But that doesn't work for us because the GPU cache and TLB would not be performant for 4K page each time. One way to think of the vma is a chunk size which is good for GPU HW performance.
> 
> And the mmu notifier... if we don't register the mmu notifier on the fly, do we register one mmu notifier to cover the whole CPU virtual address space (which would be huge, e.g., 0~2^56 on a 57 bit machine, if we have half half user space kernel space split)? That won't be performant as well because for any address range that is unmapped from cpu program, but even if they are never touched by GPU, gpu driver still got a invalidation callback. In our approach, we only register a mmu notifier for address range that we know gpu would touch it. 
>

AMD seems to register notifiers on demand for parts of the address space
[1], I think Nvidia's open source driver does this too (can look this up
if needed). We (Intel) also do this in Xe and the i915 for userptrs
(explictly binding a user address via IOCTL) too and it seems to work
quite well.

[1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c#L130
 
> > 
> > > > Please don't use sg list at all for this.
> > >
> > > As explained, we use sg list for device private pages so we can
> > > re-used the gpu page table update codes.
> > 
> > I'm asking you not to use SGL lists for that too. SGL lists are not
> > generic data structures to hold DMA lists.
> 
> Matt mentioned to use drm_buddy_block. I will see how that works out.
> 

Probably actually build a iterator (xe_res_cursor) for the device pages
returned from hmm_range_fault now that I think about this more.

> > 
> > > > This is not what I'm talking about. The GPU VMA is bound to a specific
> > > > MM VA, it should not be created on demand.
> > >
> > > Today we have two places where we create gpu vma: 1) create gpu vma
> > > during a vm_bind ioctl 2) create gpu vma during a page fault of the
> > > system allocator range (this will be in v2 of this series).
> > 
> > Don't do 2.

You have to create something, actually 2 things, on a GPU page fault.
Something to track the page table state and something to track VRAM
memory allocation. Both AMD and Nvidia's open source driver do this.

In AMD's driver the page table state is svm_range [2] and VRAM state is
svm_range_bo [3]. 

Nvidia's open source driver also does something similar (again can track
down a ref if needed).

Conceptually Xe will do something similiar, these are trending towards
an xe_vma and xe_bo respectfully, TBD on exact details but concept is
solid.

[2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_svm.h#L109
[3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_svm.h#L42 

> 
> As said, we will try the approach of one gigantic gpu vma with N page table states. We will create page table states in page fault handling. But this is only planned for stage 2. 
> 
> > 
> > > I suspect something dynamic is still necessary, either a vma or a
> > > page table state (1 vma but many page table state created
> > > dynamically, as planned in our stage 2).
> > 
> > I'm expecting you'd populate the page table memory on demand.
> 
> We do populate gpu page table on demand. When gpu access a virtual address, we populate the gpu page table.
> 
> 
> > 
> > > The reason is, we still need some gpu corresponding structure to
> > > match the cpu vm_area_struct.
> > 
> > Definately not.
> 
> See explanation above.
> 

Agree GPU doesn't need to match vm_area_struct but the allocation must
be subset (or equal) to a vm_area_struct. Again other driver do this
too.

e.g. You can't allocate a 2MB chunk if the vma_area_struct looked up is
only 64k.

> > 
> > > For example, when gpu page fault happens, you look
> > > up the cpu vm_area_struct for the fault address and create a
> > > corresponding state/struct. And people can have as many cpu
> > > vm_area_struct as they want.
> > 
> > No you don't.

Yes you do. See below.

> 
> See explains above. I need your help to understand how we can do it without a vma (or chunk). One thing GPU driver is different from RDMA driver is, RDMA doesn't have device private memory so no migration. It only need to dma-map the system memory pages and use them to fill RDMA page table. so RDMA don't need another memory manager such as our buddy. RDMA only deal with system memory which is completely struct page based management. Page by page make 100 % sense for RDMA. 
> 
> But for gpu, we need a way to use device local memory efficiently. This is the main reason we have vma/chunk concept.
> 
> Thanks,
> Oak
> 
> 
> > 
> > You call hmm_range_fault() and it does everything for you. A driver
> > should never touch CPU VMAs and must not be aware of them in any away.
> > 

struct vm_area_struct is an argument to the migrate_vma* functions [4], so
yes drivers need to be aware of CPU VMAs.

Again AMD [5], Nouveau [6] , and Nvidia's open source driver (again no
ref, but can dig one up) all lookup CPU VMAs on a GPU page fault or SVM
bind IOCTL.

[4] https://elixir.bootlin.com/linux/latest/source/include/linux/migrate.h#L186
[5] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c#L522
[6] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_svm.c#L182

Matt

> > Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-23 21:17                 ` Zeng, Oak
  2024-04-24  2:31                   ` Matthew Brost
@ 2024-04-24 13:48                   ` Jason Gunthorpe
  2024-04-24 23:59                     ` Zeng, Oak
  1 sibling, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 13:48 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: dri-devel, intel-xe, Brost, Matthew, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Tue, Apr 23, 2024 at 09:17:03PM +0000, Zeng, Oak wrote:
> > On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> > 
> > > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > > a sgl mapping if an invalidation comes in.
> > >
> > > You are right, if we register a huge mmu interval notifier to cover
> > > the whole address space, then we should use dma map/unmap pages to
> > > map bits of the address space. We will explore this approach.
> > >
> > > Right now, in xe driver, mmu interval notifier is dynamically
> > > registered with small address range. We map/unmap the whole small
> > > address ranges each time. So functionally it still works. But it
> > > might not be as performant as the method you said.
> > 
> > Please don't do this, it is not how hmm_range_fault() should be
> > used.
> > 
> > It is intended to be page by page and there is no API surface
> > available to safely try to construct covering ranges. Drivers
> > definately should not try to invent such a thing.
> 
> I need your help to understand this comment. Our gpu mirrors the
> whole CPU virtual address space. It is the first design pattern in
> your previous reply (entire exclusive owner of a single device
> private page table and fully mirrors the mm page table into the
> device table.)
> 
> What do you mean by "page by page"/" no API surface available to
> safely try to construct covering ranges"? As I understand it,
> hmm_range_fault take a virtual address range (defined in hmm_range
> struct), and walk cpu page table in this range. It is a range based
> API.

Yes, but invalidation is not linked to range_fault so you can get
invalidations of single pages. You are binding range_fault to
dma_map_sg but then you can't handle invalidation at all sanely.

> From your previous reply ("So I find it a quite strange that this
> RFC is creating VMA's, notifiers and ranges on the fly "), it seems
> you are concerning why we are creating vma and register mmu interval
> notifier on the fly. Let me try to explain it. Xe_vma is a very
> fundamental concept in xe driver. 

I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
and you do need to ensure the page table manipulation has an API that
is usable. "populate an entire VMA" / "invalidate an entire VMA" is
not a sufficient primitive.

> The gpu page table update, invalidation are all vma-based. This
> concept exists before this svm work. For svm, we create a 2M (the
> size is user configurable) vma during gpu page fault handler and
> register this 2M range to mmu interval notifier.

You can create VMAs, but they can't be fully populated and they should
be alot bigger than 2M. ODP uses a similar 2 level scheme for it's
SVA, the "vma" granual is 512M.

The key thing is the VMA is just a container for the notifier and
other data structures, it doesn't insist the range be fully populated
and it must support page-by-page unmap/remap/invalidateion.

> Now I try to figure out if we don't create vma, what can we do? We
> can map one page (which contains the gpu fault address) to gpu page
> table. But that doesn't work for us because the GPU cache and TLB
> would not be performant for 4K page each time. One way to think of
> the vma is a chunk size which is good for GPU HW performance.

From this perspective invalidation is driven by the range the
invalidation callback gets, it could be a single page, but probably
bigger.

mapping is driven by the range passed to hmm_range_fault() during
fault handling, which is entirely based on your drivers prefetch
logic.

GPU TLB invalidation sequences should happen once, at the end, for any
invalidation or range_fault sequence regardless. Nothing about "gpu
vmas" should have anything to do with this.

> And the mmu notifier... if we don't register the mmu notifier on the
> fly, do we register one mmu notifier to cover the whole CPU virtual
> address space (which would be huge, e.g., 0~2^56 on a 57 bit
> machine, if we have half half user space kernel space split)? That
> won't be performant as well because for any address range that is
> unmapped from cpu program, but even if they are never touched by
> GPU, gpu driver still got a invalidation callback. In our approach,
> we only register a mmu notifier for address range that we know gpu
> would touch it.

When SVA is used something, somewhere, has to decide if a CPU VA
intersects with a HW VA.

The mmu notifiers are orginized in an interval (red/black) tree, so if
you have a huge number of them the RB search becomes very expensive.

Conversly your GPU page table is organized in a radix tree, so
detecting no-presence during invalidation is a simple radix
walk. Indeed for the obviously unused ranges it is probably a pointer
load and single de-ref to check it.

I fully expect the radix walk is much, much faster than a huge number
of 2M notifiers in the red/black tree.

Notifiers for SVA cases should be giant. If not the entire memory
space, then at least something like 512M/1G kind of size, neatly
aligned with something in your page table structure so the radix walk
can start lower in the tree automatically.

> > > For example, when gpu page fault happens, you look
> > > up the cpu vm_area_struct for the fault address and create a
> > > corresponding state/struct. And people can have as many cpu
> > > vm_area_struct as they want.
> > 
> > No you don't.
> 
> See explains above. I need your help to understand how we can do it
> without a vma (or chunk). One thing GPU driver is different from
> RDMA driver is, RDMA doesn't have device private memory so no
> migration. 

I want to be very clear, there should be no interaction of your
hmm_range_fault based code with MM centric VMAs. You MUST NOT look up
the CPU vma_area_struct in your driver.

> It only need to dma-map the system memory pages and use
> them to fill RDMA page table. so RDMA don't need another memory
> manager such as our buddy. RDMA only deal with system memory which
> is completely struct page based management. Page by page make 100 %
> sense for RDMA.

I don't think this is the issue at hand, you just have some historical
poorly designed page table manipulation code from what I can
understand..

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-24  2:31                   ` Matthew Brost
@ 2024-04-24 13:57                     ` Jason Gunthorpe
  2024-04-24 16:35                       ` Matthew Brost
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 13:57 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Zeng, Oak, dri-devel, intel-xe, Thomas.Hellstrom, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

On Wed, Apr 24, 2024 at 02:31:36AM +0000, Matthew Brost wrote:

> AMD seems to register notifiers on demand for parts of the address space
> [1], I think Nvidia's open source driver does this too (can look this up
> if needed). We (Intel) also do this in Xe and the i915 for userptrs
> (explictly binding a user address via IOCTL) too and it seems to work
> quite well.

I always thought AMD's implementation of this stuff was bad..

> > > > > This is not what I'm talking about. The GPU VMA is bound to a specific
> > > > > MM VA, it should not be created on demand.
> > > >
> > > > Today we have two places where we create gpu vma: 1) create gpu vma
> > > > during a vm_bind ioctl 2) create gpu vma during a page fault of the
> > > > system allocator range (this will be in v2 of this series).
> > > 
> > > Don't do 2.
> 
> You have to create something, actually 2 things, on a GPU page fault.
> Something to track the page table state and something to track VRAM
> memory allocation. Both AMD and Nvidia's open source driver do this.

VRAM memory allocation should be tracked by the mm side, under the
covers of hmm_range_fault (or migration prior to invoke
hmm_range_fault).

VRAM memory allocation or management has nothing to do with SVA.

From there the only need is to copy hmm_range_fault results into GPU
PTEs. You definately do not *need* some other data structure.

> > > > The reason is, we still need some gpu corresponding structure to
> > > > match the cpu vm_area_struct.
> > > 
> > > Definately not.
> > 
> > See explanation above.
> 
> Agree GPU doesn't need to match vm_area_struct but the allocation must
> be subset (or equal) to a vm_area_struct. Again other driver do this
> too.

No, absolutely not. There can be no linking of CPU vma_area_struct to
how a driver operates hmm_range_fault().

You probably need to do something like this for your migration logic,
but that is seperate.

> > > You call hmm_range_fault() and it does everything for you. A driver
> > > should never touch CPU VMAs and must not be aware of them in any away.
> 
> struct vm_area_struct is an argument to the migrate_vma* functions [4], so
> yes drivers need to be aware of CPU VMAs.

That is something else. If you want to mess with migration during your
GPU fault path then fine that is some "migration module", but it
should have NOTHING to do with how hmm_range_fault() is invoked or how
the *SVA* flow operates.

You are mixing things up here, this thread is talking about
hmm_range_fault() and SVA.

migration is something that happens before doing the SVA mirroring
flows.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-24 13:57                     ` Jason Gunthorpe
@ 2024-04-24 16:35                       ` Matthew Brost
  2024-04-24 16:44                         ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Matthew Brost @ 2024-04-24 16:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zeng, Oak, dri-devel, intel-xe, Thomas.Hellstrom, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

On Wed, Apr 24, 2024 at 10:57:54AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 24, 2024 at 02:31:36AM +0000, Matthew Brost wrote:
> 
> > AMD seems to register notifiers on demand for parts of the address space
> > [1], I think Nvidia's open source driver does this too (can look this up
> > if needed). We (Intel) also do this in Xe and the i915 for userptrs
> > (explictly binding a user address via IOCTL) too and it seems to work
> > quite well.
> 
> I always thought AMD's implementation of this stuff was bad..
> 

No comment on the quality of AMD's implementaion.

But in general the view among my team members that registering notifiers
on demand for sub ranges is an accepted practice. If we find the perf
for this is terrible, it would be fairly easy to switch to 1 large
notifier mirroring the CPU entire address space. This is a small design
detail IMO.

> > > > > > This is not what I'm talking about. The GPU VMA is bound to a specific
> > > > > > MM VA, it should not be created on demand.
> > > > >
> > > > > Today we have two places where we create gpu vma: 1) create gpu vma
> > > > > during a vm_bind ioctl 2) create gpu vma during a page fault of the
> > > > > system allocator range (this will be in v2 of this series).
> > > > 
> > > > Don't do 2.
> > 
> > You have to create something, actually 2 things, on a GPU page fault.
> > Something to track the page table state and something to track VRAM
> > memory allocation. Both AMD and Nvidia's open source driver do this.
> 
> VRAM memory allocation should be tracked by the mm side, under the
> covers of hmm_range_fault (or migration prior to invoke
> hmm_range_fault).
> 

Yes, the above step I describe is optionally done *before* calling hmm
range fault.

> VRAM memory allocation or management has nothing to do with SVA.
> 
> From there the only need is to copy hmm_range_fault results into GPU
> PTEs. You definately do not *need* some other data structure.
> 

You do not *need* some other data structure as you could always just
walk the page tables but in practice a data structure exists in a tree
of shorts with the key being a VA range. The data structure has meta
data about the mapping, all GPU drivers seem to have this. This data
structure, along with pages returned from hmm_range_fault, are used to
program the GPU PTEs.

Again the allocation of this data structure happens *before* calling
hmm_range_fault on first GPU fault within unmapped range.

> > > > > The reason is, we still need some gpu corresponding structure to
> > > > > match the cpu vm_area_struct.
> > > > 
> > > > Definately not.
> > > 
> > > See explanation above.
> > 
> > Agree GPU doesn't need to match vm_area_struct but the allocation must
> > be subset (or equal) to a vm_area_struct. Again other driver do this
> > too.
> 
> No, absolutely not. There can be no linking of CPU vma_area_struct to
> how a driver operates hmm_range_fault().
>

Agree. Once we are calling hmm_range_fault vma_area_struct is out of the
picture.

We also will never store a vma_area_struct in our driver, it is only
looked up on demand when required for migration.
 
> You probably need to do something like this for your migration logic,
> but that is seperate.
> 

Yes.

> > > > You call hmm_range_fault() and it does everything for you. A driver
> > > > should never touch CPU VMAs and must not be aware of them in any away.
> > 
> > struct vm_area_struct is an argument to the migrate_vma* functions [4], so
> > yes drivers need to be aware of CPU VMAs.
> 
> That is something else. If you want to mess with migration during your
> GPU fault path then fine that is some "migration module", but it
> should have NOTHING to do with how hmm_range_fault() is invoked or how
> the *SVA* flow operates.
>

Agree. hmm_range_fault is invoked in a opaque way whether backing store
is SRAM or VRAM and flows in a uniform way. The only difference is how we
resolve the hmm pfn to a value in the GPU page tables (device private
pages a device physical addresses, while CPU pages are dma mapped).

> You are mixing things up here, this thread is talking about
> hmm_range_fault() and SVA.
>

I figure there was a level of confusion in this thread. I think we are
now aligned?

Thanks for your time.
Matt

> migration is something that happens before doing the SVA mirroring
> flows.
> 
> Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-24 16:35                       ` Matthew Brost
@ 2024-04-24 16:44                         ` Jason Gunthorpe
  2024-04-24 16:56                           ` Matthew Brost
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 16:44 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Zeng, Oak, dri-devel, intel-xe, Thomas.Hellstrom, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

On Wed, Apr 24, 2024 at 04:35:17PM +0000, Matthew Brost wrote:
> On Wed, Apr 24, 2024 at 10:57:54AM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 24, 2024 at 02:31:36AM +0000, Matthew Brost wrote:
> > 
> > > AMD seems to register notifiers on demand for parts of the address space
> > > [1], I think Nvidia's open source driver does this too (can look this up
> > > if needed). We (Intel) also do this in Xe and the i915 for userptrs
> > > (explictly binding a user address via IOCTL) too and it seems to work
> > > quite well.
> > 
> > I always thought AMD's implementation of this stuff was bad..
> 
> No comment on the quality of AMD's implementaion.
> 
> But in general the view among my team members that registering notifiers
> on demand for sub ranges is an accepted practice.

Yes, but not on a 2M granual, and not without sparsity. Do it on
something like an aligned 512M and it would be fairly reasonable.

> You do not *need* some other data structure as you could always just
> walk the page tables but in practice a data structure exists in a tree
> of shorts with the key being a VA range. The data structure has meta
> data about the mapping, all GPU drivers seem to have this. 

What "meta data" is there for a SVA mapping? The entire page table is
an SVA.

> structure, along with pages returned from hmm_range_fault, are used to
> program the GPU PTEs.

Most likely pages returned from hmm_range_fault() can just be stored
directly in the page table's PTEs. I'd be surprised if you actually
need seperate storage. (ignoring some of the current issues with the
DMA API)

> Again the allocation of this data structure happens *before* calling
> hmm_range_fault on first GPU fault within unmapped range.

The SVA page table and hmm_range_fault are tightly connected together,
if a vma is needed to make it work then it is not "before", it is
part of.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-24 16:44                         ` Jason Gunthorpe
@ 2024-04-24 16:56                           ` Matthew Brost
  2024-04-24 17:48                             ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Matthew Brost @ 2024-04-24 16:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zeng, Oak, dri-devel, intel-xe, Thomas.Hellstrom, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

On Wed, Apr 24, 2024 at 01:44:11PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 24, 2024 at 04:35:17PM +0000, Matthew Brost wrote:
> > On Wed, Apr 24, 2024 at 10:57:54AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Apr 24, 2024 at 02:31:36AM +0000, Matthew Brost wrote:
> > > 
> > > > AMD seems to register notifiers on demand for parts of the address space
> > > > [1], I think Nvidia's open source driver does this too (can look this up
> > > > if needed). We (Intel) also do this in Xe and the i915 for userptrs
> > > > (explictly binding a user address via IOCTL) too and it seems to work
> > > > quite well.
> > > 
> > > I always thought AMD's implementation of this stuff was bad..
> > 
> > No comment on the quality of AMD's implementaion.
> > 
> > But in general the view among my team members that registering notifiers
> > on demand for sub ranges is an accepted practice.
> 
> Yes, but not on a 2M granual, and not without sparsity. Do it on
> something like an aligned 512M and it would be fairly reasonable.
> 
> > You do not *need* some other data structure as you could always just
> > walk the page tables but in practice a data structure exists in a tree
> > of shorts with the key being a VA range. The data structure has meta
> > data about the mapping, all GPU drivers seem to have this. 
> 
> What "meta data" is there for a SVA mapping? The entire page table is
> an SVA.
> 

If we have allocated memory for GPU page tables in the range, if range
has been invalidated, notifier seqno, what type of mapping this is (SVA,
BO, userptr, NULL)...

The "meta data" covers all types of mappings, not just SVA. SVA is a
specific class of the "meta data".

> > structure, along with pages returned from hmm_range_fault, are used to
> > program the GPU PTEs.
> 
> Most likely pages returned from hmm_range_fault() can just be stored
> directly in the page table's PTEs. I'd be surprised if you actually
> need seperate storage. (ignoring some of the current issues with the
> DMA API)
> 

In theory that could work but again practice this not how it is done.
The "meta data" cover all the classes mapping mentioned above. Our PTE
programming code needs to be handle all the different requirements of
these specific classes in a single code path.

> > Again the allocation of this data structure happens *before* calling
> > hmm_range_fault on first GPU fault within unmapped range.
> 
> The SVA page table and hmm_range_fault are tightly connected together,
> if a vma is needed to make it work then it is not "before", it is
> part of.
>

It is companion data for the GPU page table walk. See above
explaination.

Matt
 
> Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-24 16:56                           ` Matthew Brost
@ 2024-04-24 17:48                             ` Jason Gunthorpe
  0 siblings, 0 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 17:48 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Zeng, Oak, dri-devel, intel-xe, Thomas.Hellstrom, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

On Wed, Apr 24, 2024 at 04:56:57PM +0000, Matthew Brost wrote:
> > What "meta data" is there for a SVA mapping? The entire page table is
> > an SVA.
> 
> If we have allocated memory for GPU page tables in the range,

This is encoded directly in the radix tree.

> if range
> has been invalidated, 

As I keep saying these ranges need sparsity, so you have to store
per-page in the radix tree if it is invalidated. There should be no
need for a per-range tracking.

> notifier seqno, 

? The mmu notifier infrastructure handles seqno. You need a mmu
notifier someplace that covers that SVA, or fractionally covers it,
but that is not tied to the PTEs.

> what type of mapping this is (SVA,
> BO, userptr, NULL)...

Which is back to a whole "SVA" type "vma" that covers the entire GPU
page table when you bind the mm in the first place.

> > > structure, along with pages returned from hmm_range_fault, are used to
> > > program the GPU PTEs.
> > 
> > Most likely pages returned from hmm_range_fault() can just be stored
> > directly in the page table's PTEs. I'd be surprised if you actually
> > need seperate storage. (ignoring some of the current issues with the
> > DMA API)
> In theory that could work but again practice this not how it is done.
> The "meta data" cover all the classes mapping mentioned above. Our PTE
> programming code needs to be handle all the different requirements of
> these specific classes in a single code path.

Which is back to my first thesis, this is all about trying to bolt on
an existing PTE scheme which is ill suited to the needs of SVA and
hmm_range_fault.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-24 13:48                   ` Jason Gunthorpe
@ 2024-04-24 23:59                     ` Zeng, Oak
  2024-04-25  1:05                       ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-04-24 23:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dri-devel, intel-xe, Brost, Matthew, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

Hi Jason,

I went through the conversation b/t you and Matt. I think we are pretty much aligned. Here is what I get from this threads:

1) hmm range fault size, gpu page table map size : you prefer bigger gpu vma size and vma can be sparsely mapped to gpu. Our vma size is configurable through a user madvise api. We do plan to try 1 gigantic vma and sparse mapping. That requires us to reconstruct driver for a 1vma: N page table state mapping. This will be stage 2 work

2) invalidation: you prefer giant notifier. We can consider this if it turns out our implementation is not performant. Currently we don't know.

3) whether driver can look up cpu vma. I think we need this for data migration purpose.


See also comments inline.


> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 24, 2024 9:49 AM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> Matthew <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky
> <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Tue, Apr 23, 2024 at 09:17:03PM +0000, Zeng, Oak wrote:
> > > On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> > >
> > > > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > > > a sgl mapping if an invalidation comes in.
> > > >
> > > > You are right, if we register a huge mmu interval notifier to cover
> > > > the whole address space, then we should use dma map/unmap pages
> to
> > > > map bits of the address space. We will explore this approach.
> > > >
> > > > Right now, in xe driver, mmu interval notifier is dynamically
> > > > registered with small address range. We map/unmap the whole small
> > > > address ranges each time. So functionally it still works. But it
> > > > might not be as performant as the method you said.
> > >
> > > Please don't do this, it is not how hmm_range_fault() should be
> > > used.
> > >
> > > It is intended to be page by page and there is no API surface
> > > available to safely try to construct covering ranges. Drivers
> > > definately should not try to invent such a thing.
> >
> > I need your help to understand this comment. Our gpu mirrors the
> > whole CPU virtual address space. It is the first design pattern in
> > your previous reply (entire exclusive owner of a single device
> > private page table and fully mirrors the mm page table into the
> > device table.)
> >
> > What do you mean by "page by page"/" no API surface available to
> > safely try to construct covering ranges"? As I understand it,
> > hmm_range_fault take a virtual address range (defined in hmm_range
> > struct), and walk cpu page table in this range. It is a range based
> > API.
> 
> Yes, but invalidation is not linked to range_fault so you can get
> invalidations of single pages. You are binding range_fault to
> dma_map_sg but then you can't handle invalidation at all sanely.

Ok, I understand your point now.

Yes strictly speaking we can get invalidation of a single page. This can be triggered by core mm numa balance or ksm (kernel samepage merging). At present, my understanding is, single page (or a few pages) invalidation is not a very common case. The more common cases are invalidation triggered by user munmap, or invalidation triggered by hmm migration itself (triggered in migrate_vma_setup). I will experiment this.

User munmap obviously triggers range based invalidation.

The invalidation triggered by hmm vma migration is also range based as we select to migration at vma granularity due to performance considerations as explained.

I agree in case of single page invalidation, the current codes is not performant because we invalidate the whole vma. What I can do is, look at the mmu_notifier_range parameter of the invalidation callback, and only invalidate the range. The requires our driver to split the vma state and page table state. It is a big change. We plan to do it in stage 2

> 
> > From your previous reply ("So I find it a quite strange that this
> > RFC is creating VMA's, notifiers and ranges on the fly "), it seems
> > you are concerning why we are creating vma and register mmu interval
> > notifier on the fly. Let me try to explain it. Xe_vma is a very
> > fundamental concept in xe driver.
> 
> I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
> and you do need to ensure the page table manipulation has an API that
> is usable. "populate an entire VMA" / "invalidate an entire VMA" is
> not a sufficient primitive.

I understand invalidate entire VMA might be not performant. I will improve as explained above.

I think whether we want to populate entire VMA or only one page is still driver's selection. For us, populate a whole VMA (at 2M bytes by default but overwritable by user) is still a performant option. If we populate one page per time, we would run into continuous gpu page fault when gpu access the following pages. In most of our compute workload, gpu need to process big chunk of data, e.g., many MiB or even GiB. And page fault overhead is huge per our measurement.

Do you suggest per page based population? Or do you suggest to populate the entire address space or the entire memory region? I did look at RDMA odp codes. In function ib_umem_odp_map_dma_and_lock, it is also a range based population. It seems it populate the whole memory region each time, not very sure. 

> 
> > The gpu page table update, invalidation are all vma-based. This
> > concept exists before this svm work. For svm, we create a 2M (the
> > size is user configurable) vma during gpu page fault handler and
> > register this 2M range to mmu interval notifier.
> 
> You can create VMAs, but they can't be fully populated and they should
> be alot bigger than 2M. ODP uses a similar 2 level scheme for it's
> SVA, the "vma" granual is 512M.

Oh, I see. So you are suggesting a much bigger granularity. That make sense to me. Our design actually support a much bigger granularity. The migration/population granularity is configurable in our design. It is a memory advise API and one of the advise is called "MIGRATION_GRANULARITY". This part of the codes is not in my series yet as it is being work by Himal who is also on this email list. We will publish that work soon for review. 

> 
> The key thing is the VMA is just a container for the notifier and
> other data structures, it doesn't insist the range be fully populated
> and it must support page-by-page unmap/remap/invalidateion.

Agree and I don't see a hard conflict of our implementation to this concept. So the linux core mm vma (struct vm_area_struct) represent an address range in the process's address space. Xe driver would create some xe_vma to cover a sub-region of a core mm vma. For example, if the core mm vma is 1GiB, we might create xe-vma of 512MiB or 2MiB, depending on our MIGRATION_GRANULARITY setting. 

As explained, we can support page-by page map/unmap. Our design makes sure we can map/unmap at any page boundary if we want. The granularity setting is all depends on performance data.


> 
> > Now I try to figure out if we don't create vma, what can we do? We
> > can map one page (which contains the gpu fault address) to gpu page
> > table. But that doesn't work for us because the GPU cache and TLB
> > would not be performant for 4K page each time. One way to think of
> > the vma is a chunk size which is good for GPU HW performance.
> 
> From this perspective invalidation is driven by the range the
> invalidation callback gets, it could be a single page, but probably
> bigger.

Agree


> 
> mapping is driven by the range passed to hmm_range_fault() during
> fault handling, which is entirely based on your drivers prefetch
> logic.

In our driver, mapping can be triggered by either prefetch or fault. 

Prefetch is a user API so user can decide the range.

The range used in fault is decided by MIGRATION_GRANULARITY user setting. The default value is 2MiB as said. 


> 
> GPU TLB invalidation sequences should happen once, at the end, for any
> invalidation or range_fault sequence regardless. Nothing about "gpu
> vmas" should have anything to do with this.
> 
> > And the mmu notifier... if we don't register the mmu notifier on the
> > fly, do we register one mmu notifier to cover the whole CPU virtual
> > address space (which would be huge, e.g., 0~2^56 on a 57 bit
> > machine, if we have half half user space kernel space split)? That
> > won't be performant as well because for any address range that is
> > unmapped from cpu program, but even if they are never touched by
> > GPU, gpu driver still got a invalidation callback. In our approach,
> > we only register a mmu notifier for address range that we know gpu
> > would touch it.
> 
> When SVA is used something, somewhere, has to decide if a CPU VA
> intersects with a HW VA.
> 
> The mmu notifiers are orginized in an interval (red/black) tree, so if
> you have a huge number of them the RB search becomes very expensive.
> 
> Conversly your GPU page table is organized in a radix tree, so
> detecting no-presence during invalidation is a simple radix
> walk. Indeed for the obviously unused ranges it is probably a pointer
> load and single de-ref to check it.
> 
> I fully expect the radix walk is much, much faster than a huge number
> of 2M notifiers in the red/black tree.
> 
> Notifiers for SVA cases should be giant. If not the entire memory
> space, then at least something like 512M/1G kind of size, neatly
> aligned with something in your page table structure so the radix walk
> can start lower in the tree automatically.

In our implementation, our page table is not organized as a radix tree. Maybe this an area we can improve. For validation, we don't need to walk page table to figure out which range is present in page table. Since we only register a mmu notifier when a range is actually mapped to gpu page table. So all the notifier callback is with a valid range in gpu page table.

I agree many 2M small notifiers can slow down the red/black tree walk from core mm side. But with giant notifier, core mm callback driver much more times than small notifier - driver would be called back even if a range is not mapped to gpu page table.

So I am not sure which approach is faster. But I can experiment this.


> 
> > > > For example, when gpu page fault happens, you look
> > > > up the cpu vm_area_struct for the fault address and create a
> > > > corresponding state/struct. And people can have as many cpu
> > > > vm_area_struct as they want.
> > >
> > > No you don't.
> >
> > See explains above. I need your help to understand how we can do it
> > without a vma (or chunk). One thing GPU driver is different from
> > RDMA driver is, RDMA doesn't have device private memory so no
> > migration.
> 
> I want to be very clear, there should be no interaction of your
> hmm_range_fault based code with MM centric VMAs. You MUST NOT look
> up
> the CPU vma_area_struct in your driver.

Without look up cpu vma, we even can't decide whether our gpu accessed an valid address or not.

When GPU accesses an address, valid or not (e.g., out of bound access), hardware generates a page fault. If we don't look up cpu vma, how do we determine whether gpu has a out of bound access?

Further more, we call hmm helpers migrate_vma_setup for migration which take a struct migrate_vma parameter. Migrate_vma has a vma field. If we don't look up cpu vma, how do we get this field?

Oak


> 
> > It only need to dma-map the system memory pages and use
> > them to fill RDMA page table. so RDMA don't need another memory
> > manager such as our buddy. RDMA only deal with system memory which
> > is completely struct page based management. Page by page make 100 %
> > sense for RDMA.
> 
> I don't think this is the issue at hand, you just have some historical
> poorly designed page table manipulation code from what I can
> understand..
> 
> Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-24 23:59                     ` Zeng, Oak
@ 2024-04-25  1:05                       ` Jason Gunthorpe
  2024-04-26  9:55                         ` Thomas Hellström
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-25  1:05 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: dri-devel, intel-xe, Brost, Matthew, Thomas.Hellstrom, Welty,
	Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Wed, Apr 24, 2024 at 11:59:18PM +0000, Zeng, Oak wrote:
> Hi Jason,
> 
> I went through the conversation b/t you and Matt. I think we are pretty much aligned. Here is what I get from this threads:
> 
> 1) hmm range fault size, gpu page table map size : you prefer bigger
> gpu vma size and vma can be sparsely mapped to gpu. Our vma size is
> configurable through a user madvise api. 

That is even worse! It is not a user tunable in any way shape or form!

> 2) invalidation: you prefer giant notifier. We can consider this if
> it turns out our implementation is not performant. Currently we
> don't know.

It is the wrong way to use the API to have many small notifiers,
please don't use it wrong.
 
> 3) whether driver can look up cpu vma. I think we need this for data migration purpose.

The migration code may but not the SVA/hmm_range_fault code. 

> > > What do you mean by "page by page"/" no API surface available to
> > > safely try to construct covering ranges"? As I understand it,
> > > hmm_range_fault take a virtual address range (defined in hmm_range
> > > struct), and walk cpu page table in this range. It is a range based
> > > API.
> > 
> > Yes, but invalidation is not linked to range_fault so you can get
> > invalidations of single pages. You are binding range_fault to
> > dma_map_sg but then you can't handle invalidation at all sanely.
> 
> Ok, I understand your point now.
> 
> Yes strictly speaking we can get invalidation of a single page. This
> can be triggered by core mm numa balance or ksm (kernel samepage
> merging). At present, my understanding is, single page (or a few
> pages) invalidation is not a very common case. The more common cases
> are invalidation triggered by user munmap, or invalidation triggered
> by hmm migration itself (triggered in migrate_vma_setup). I will
> experiment this.

Regardless it must be handled and unmapping an entire 'gpu vma' is a
horrible implementation of HMM.

> I agree in case of single page invalidation, the current codes is
> not performant because we invalidate the whole vma. What I can do
> is, look at the mmu_notifier_range parameter of the invalidation
> callback, and only invalidate the range. The requires our driver to
> split the vma state and page table state. It is a big change. We
> plan to do it in stage 2

Which is, again, continuing to explain why why this VMA based approach
is a completely wrong way to use hmm.

> > I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
> > and you do need to ensure the page table manipulation has an API that
> > is usable. "populate an entire VMA" / "invalidate an entire VMA" is
> > not a sufficient primitive.
> 
> I understand invalidate entire VMA might be not performant. I will
> improve as explained above.

Please stop saying performant. There are correct ways to use hmm and
bad ways. What you are doing is a bad way. Even if the performance is
only a little different it is still the kind of horrible code
structure I don't want to see in new hmm users.

> I think whether we want to populate entire VMA or only one page is
> still driver's selection. 

hmm_range_fault() should be driven with a prefetch fault/around
scheme. This has nothing to do with a durable VMA record and addresses
these concerns.

> Do you suggest per page based population? Or do you suggest to
> populate the entire address space or the entire memory region? I did
> look at RDMA odp codes. In function ib_umem_odp_map_dma_and_lock, it
> is also a range based population. It seems it populate the whole
> memory region each time, not very sure.

Performance here is veyr complicated. You often want to allow the
userspace to prefetch data into the GPU page table to warm it up to
avoid page faults, as faults are generally super slow and hard to
manage performantly. ODP has many options to control this in a fine
granularity.

> > You can create VMAs, but they can't be fully populated and they should
> > be alot bigger than 2M. ODP uses a similar 2 level scheme for it's
> > SVA, the "vma" granual is 512M.
> 
> Oh, I see. So you are suggesting a much bigger granularity. That
> make sense to me. Our design actually support a much bigger
> granularity. The migration/population granularity is configurable in
> our design. It is a memory advise API and one of the advise is
> called "MIGRATION_GRANULARITY".

I don't think the notifier granual should be user tunable, you are
actually doing something different - it sounds like it mostly acts as
prefetch tunable.

> > The key thing is the VMA is just a container for the notifier and
> > other data structures, it doesn't insist the range be fully populated
> > and it must support page-by-page unmap/remap/invalidateion.
> 
> Agree and I don't see a hard conflict of our implementation to this
> concept. So the linux core mm vma (struct vm_area_struct) represent
> an address range in the process's address space. Xe driver would
> create some xe_vma to cover a sub-region of a core mm vma. 

No again and again NO. hmm_range_fault users are not, and must not be
linked to CPU VMAs ever.

> > mapping is driven by the range passed to hmm_range_fault() during
> > fault handling, which is entirely based on your drivers prefetch
> > logic.
> 
> In our driver, mapping can be triggered by either prefetch or fault. 
> 
> Prefetch is a user API so user can decide the range.

> The range used in fault is decided by MIGRATION_GRANULARITY user
> setting. The default value is 2MiB as said.

Which is basically turning it into a fault prefetch tunable.
 
> In our implementation, our page table is not organized as a radix
> tree. 

Huh? What does the HW in the GPU do to figure out how issue DMAs? Your
GPU HW doesn't have a radix tree walker you are configuring?

> Maybe this an area we can improve. For validation, we don't
> need to walk page table to figure out which range is present in page
> table. Since we only register a mmu notifier when a range is
> actually mapped to gpu page table. So all the notifier callback is
> with a valid range in gpu page table.

But this is not page by page.

> I agree many 2M small notifiers can slow down the red/black tree
> walk from core mm side. But with giant notifier, core mm callback
> driver much more times than small notifier - driver would be called
> back even if a range is not mapped to gpu page table.

The driver should do a cheaper check than a red/black check if it
needs to do any work, eg with a typical radix page table, or some kind
of fast sparse bitmap. At a 2M granual and 100GB workloads these days
this is an obviously loosing idea.

> > I want to be very clear, there should be no interaction of your
> > hmm_range_fault based code with MM centric VMAs. You MUST NOT look
> > up
> > the CPU vma_area_struct in your driver.
> 
> Without look up cpu vma, we even can't decide whether our gpu
> accessed an valid address or not.

hmm_range_fault provides this for you.
 
> Further more, we call hmm helpers migrate_vma_setup for migration
> which take a struct migrate_vma parameter. Migrate_vma has a vma
> field. If we don't look up cpu vma, how do we get this field?

Migration is a different topic. The vma lookup for a migration has
nothing to do with hmm_range_fault and SVA.

(and perhaps arguably this is structured wrong as you probably want
hmm_range_fault to do the migrations for you as it walks the range to
avoid double walks and things)

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-25  1:05                       ` Jason Gunthorpe
@ 2024-04-26  9:55                         ` Thomas Hellström
  2024-04-26 12:00                           ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Hellström @ 2024-04-26  9:55 UTC (permalink / raw)
  To: Jason Gunthorpe, Zeng, Oak
  Cc: dri-devel, intel-xe, Brost, Matthew, Welty, Brian, Ghimiray,
	Himal Prasad, Bommu, Krishnaiah, Vishwanathapura, Niranjana,
	Leon Romanovsky

Hi, Jason.

I've quickly read through the discussion here and have a couple of
questions and clarifications to hopefully help moving forward on
aligning on an approach for this.

Let's for simplicity initially ignore migration and assume this is on
integrated hardware since it seems like it's around the
hmm_range_fault() usage there is a disconnect.

First, the gpu_vma structure is something that partitions the gpu_vm
that holds gpu-related range metadata, like what to mirror, desired gpu
caching policies etc. These are managed (created, removed and split)
mainly from user-space. These are stored and looked up from an rb-tree.

Each such mirroring gpu_vma holds an mmu_interval notifier.

For invalidation only purposes, the mmu_interval seqno is not tracked.
An invalidation thus only zaps page-table entries, causing subsequent
accesses to fault. Hence for this purpose, having a single notifier
that covers a huge range is desirable and does not become a problem.

Now, when we hit a fault, we want to use hmm_range_fault() to re-
populate the faulting PTE, but also to pre-fault a range. Using a range
here (let's call this a prefault range for clarity) rather than to
insert a single PTE is for multiple reasons:

1) avoid subsequent adjacent faults
2a) Using huge GPU page-table entries.
2b) Updating the GPU page-table (tree-based and multi-level) becomes
more efficient when using a range.

Here, depending on hardware, 2a might be more or less crucial for GPU
performance. 2b somewhat ties into 2a but otherwise does not affect gpu
performance.

This is why we've been using dma_map_sg() for these ranges, since it is
assumed the benefits gained from 2) above by far outweighs any benefit
from finer-granularity dma-mappings on the rarer occasion of faults.
Are there other benefits from single-page dma mappings that you think
we need to consider here?

Second, when pre-faulting a range like this, the mmu interval notifier
seqno comes into play, until the gpu ptes for the prefault range are
safely in place. Now if an invalidation happens in a completely
separate part of the mirror range, it will bump the seqno and force us
to rerun the fault processing unnecessarily. Hence, for this purpose we
ideally just want to get a seqno bump covering the prefault range.
That's why finer-granularity mmu_interval notifiers might be beneficial
(and then cached for future re-use of the same prefault range). This
leads me to the next question:

You mention that mmu_notifiers are expensive to register. From looking
at the code it seems *mmu_interval* notifiers are cheap unless there
are ongoing invalidations in which case using a gpu_vma-wide notifier
would block anyway? Could you clarify a bit more the cost involved
here? If we don't register these smaller-range interval notifiers, do
you think the seqno bumps from unrelated subranges would be a real
problem?

Finally the size of the pre-faulting range is something we need to
tune. Currently it is cpu vma - wide. I understand you strongly suggest
this should be avoided. Could you elaborate a bit on why this is such a
bad choice?

Thanks,
Thomas



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-26  9:55                         ` Thomas Hellström
@ 2024-04-26 12:00                           ` Jason Gunthorpe
  2024-04-26 14:49                             ` Thomas Hellström
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-26 12:00 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Zeng, Oak, dri-devel, intel-xe, Brost, Matthew, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote:
> First, the gpu_vma structure is something that partitions the gpu_vm
> that holds gpu-related range metadata, like what to mirror, desired gpu
> caching policies etc. These are managed (created, removed and split)
> mainly from user-space. These are stored and looked up from an rb-tree.

Except we are talking about SVA here, so all of this should not be
exposed to userspace.

> Now, when we hit a fault, we want to use hmm_range_fault() to re-
> populate the faulting PTE, but also to pre-fault a range. Using a range
> here (let's call this a prefault range for clarity) rather than to
> insert a single PTE is for multiple reasons:

I've never said to do a single range, everyone using hmm_range_fault()
has some kind of prefetch populate around algorithm.

> This is why we've been using dma_map_sg() for these ranges, since it is
> assumed the benefits gained from 

This doesn't logically follow. You need to use dma_map_page page by
page and batch that into your update mechanism.

If you use dma_map_sg you get into the world of wrongness where you
have to track ranges and invalidation has to wipe an entire range -
because you cannot do a dma unmap of a single page from a dma_map_sg
mapping. This is all the wrong way to use hmm_range_fault.

hmm_range_fault() is page table mirroring, it fundamentally must be
page-by-page. The target page table structure must have similar
properties to the MM page table - especially page by page
validate/invalidate. Meaning you cannot use dma_map_sg().

> Second, when pre-faulting a range like this, the mmu interval notifier
> seqno comes into play, until the gpu ptes for the prefault range are
> safely in place. Now if an invalidation happens in a completely
> separate part of the mirror range, it will bump the seqno and force us
> to rerun the fault processing unnecessarily. 

This is how hmm_range_fault() works. Drivers should not do hacky
things to try to "improve" this. SVA granuals should be large, maybe
not the entire MM, but still quite large. 2M is far to small.

There is a tradeoff here of slowing down the entire MM vs risking an
iteration during fault processing. We want to err toward making fault
processing slowing because fault procesing is already really slow.

> Hence, for this purpose we
> ideally just want to get a seqno bump covering the prefault range.

Ideally, but this is not something we can get for free.

> That's why finer-granularity mmu_interval notifiers might be beneficial
> (and then cached for future re-use of the same prefault range). This
> leads me to the next question:

It is not the design, please don't invent crazy special Intel things
on top of hmm_range_fault.

> You mention that mmu_notifiers are expensive to register. From looking
> at the code it seems *mmu_interval* notifiers are cheap unless there
> are ongoing invalidations in which case using a gpu_vma-wide notifier
> would block anyway? Could you clarify a bit more the cost involved
> here?

The rb tree insertions become expensive the larger the tree is. If you
have only a couple of notifiers it is reasonable.

> If we don't register these smaller-range interval notifiers, do
> you think the seqno bumps from unrelated subranges would be a real
> problem?

I don't think it is, you'd need to have a workload which was
agressively manipulating the CPU mm (which is also pretty slow). If
the workload is doing that then it also really won't like being slowed
down by the giant rb tree. 

You can't win with an argument that collisions are likely due to an
app pattern that makes alot of stress on the MM so the right response
is to make the MM slower.

> Finally the size of the pre-faulting range is something we need to
> tune. 

Correct.

> Currently it is cpu vma - wide. I understand you strongly suggest
> this should be avoided. Could you elaborate a bit on why this is such a
> bad choice?

Why would a prefetch have anything to do with a VMA? Ie your app calls
malloc() and gets a little allocation out of a giant mmap() arena -
you want to prefault the entire arena? Does that really make any
sense?

Mirroring is a huge PITA, IMHO it should be discouraged in favour of
SVA. Sadly too many CPUs still canot implement SVA.

With mirroring there is no good way for the system to predict what the
access pattern is. The only way to make this actually performant is
for userspace to explicitly manage the mirroring with some kind of
prefetching scheme to avoid faulting its accesses except in
extrodinary cases.

VMA is emphatically not a hint about what to prefetch. You should
balance your prefetching based on HW performance and related. If it is
tidy for HW to fault around a 2M granual then just do that.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-26 12:00                           ` Jason Gunthorpe
@ 2024-04-26 14:49                             ` Thomas Hellström
  2024-04-26 16:35                               ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Hellström @ 2024-04-26 14:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zeng, Oak, dri-devel, intel-xe, Brost, Matthew, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

On Fri, 2024-04-26 at 09:00 -0300, Jason Gunthorpe wrote:
> On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote:
> > First, the gpu_vma structure is something that partitions the
> > gpu_vm
> > that holds gpu-related range metadata, like what to mirror, desired
> > gpu
> > caching policies etc. These are managed (created, removed and
> > split)
> > mainly from user-space. These are stored and looked up from an rb-
> > tree.
> 
> Except we are talking about SVA here, so all of this should not be
> exposed to userspace.

I think you are misreading. this is on the level "Mirror this region of
the cpu_vm", "prefer this region placed in VRAM", "GPU will do atomic
accesses on this region", very similar to cpu mmap / munmap and
madvise. What I'm trying to say here is that this does not directly
affect the SVA except whether to do SVA or not, and in that case what
region of the CPU mm will be mirrored, and in addition, any gpu
attributes for the mirrored region.

> 
> > Now, when we hit a fault, we want to use hmm_range_fault() to re-
> > populate the faulting PTE, but also to pre-fault a range. Using a
> > range
> > here (let's call this a prefault range for clarity) rather than to
> > insert a single PTE is for multiple reasons:
> 
> I've never said to do a single range, everyone using
> hmm_range_fault()
> has some kind of prefetch populate around algorithm.
> 
> > This is why we've been using dma_map_sg() for these ranges, since
> > it is
> > assumed the benefits gained from 
> 
> This doesn't logically follow. You need to use dma_map_page page by
> page and batch that into your update mechanism.
> 
> If you use dma_map_sg you get into the world of wrongness where you
> have to track ranges and invalidation has to wipe an entire range -
> because you cannot do a dma unmap of a single page from a dma_map_sg
> mapping. This is all the wrong way to use hmm_range_fault.
> 
> hmm_range_fault() is page table mirroring, it fundamentally must be
> page-by-page. The target page table structure must have similar
> properties to the MM page table - especially page by page
> validate/invalidate. Meaning you cannot use dma_map_sg().

To me this is purely an optimization to make the driver page-table and
hence the GPU TLB benefit from iommu coalescing / large pages and large
driver PTEs. It is true that invalidation will sometimes shoot down
large gpu ptes unnecessarily but it will not put any additional burden
on the core AFAICT. For the dma mappings themselves they aren't touched
on invalidation since zapping the gpu PTEs effectively stops any dma
accesses. The dma mappings are rebuilt on the next gpu pagefault,
which, as you mention, are considered slow anyway, but will probably
still reuse the same prefault region, hence needing to rebuild the dma
mappings anyway.

So as long as we are correct and do not adversely affect core mm, If
the gpu performance (for whatever reason) is severely hampered if
large gpu page-table-entries are not used, couldn't this be considered
left to the driver?

And a related question. What about THP pages? OK to set up a single
dma-mapping to those?


> 
> > Second, when pre-faulting a range like this, the mmu interval
> > notifier
> > seqno comes into play, until the gpu ptes for the prefault range
> > are
> > safely in place. Now if an invalidation happens in a completely
> > separate part of the mirror range, it will bump the seqno and force
> > us
> > to rerun the fault processing unnecessarily. 
> 
> This is how hmm_range_fault() works. Drivers should not do hacky
> things to try to "improve" this. SVA granuals should be large, maybe
> not the entire MM, but still quite large. 2M is far to small.
> 
> There is a tradeoff here of slowing down the entire MM vs risking an
> iteration during fault processing. We want to err toward making fault
> processing slowing because fault procesing is already really slow.
> 
> > Hence, for this purpose we
> > ideally just want to get a seqno bump covering the prefault range.
> 
> Ideally, but this is not something we can get for free.
> 
> > That's why finer-granularity mmu_interval notifiers might be
> > beneficial
> > (and then cached for future re-use of the same prefault range).
> > This
> > leads me to the next question:
> 
> It is not the design, please don't invent crazy special Intel things
> on top of hmm_range_fault.

For the record, this is not a "crazy special Intel" invention. It's the
way all GPU implementations do this so far. We're currently catching
up. If we're going to do this in another way, we fully need to
understand why it's a bad thing to do. That's why these questions are
asked.

> 
> > You mention that mmu_notifiers are expensive to register. From
> > looking
> > at the code it seems *mmu_interval* notifiers are cheap unless
> > there
> > are ongoing invalidations in which case using a gpu_vma-wide
> > notifier
> > would block anyway? Could you clarify a bit more the cost involved
> > here?
> 
> The rb tree insertions become expensive the larger the tree is. If
> you
> have only a couple of notifiers it is reasonable.
> 
> > If we don't register these smaller-range interval notifiers, do
> > you think the seqno bumps from unrelated subranges would be a real
> > problem?
> 
> I don't think it is, you'd need to have a workload which was
> agressively manipulating the CPU mm (which is also pretty slow). If
> the workload is doing that then it also really won't like being
> slowed
> down by the giant rb tree.

OK, this makes sense, and will also simplify implementation.

> 
> You can't win with an argument that collisions are likely due to an
> app pattern that makes alot of stress on the MM so the right response
> is to make the MM slower.
> 
> > Finally the size of the pre-faulting range is something we need to
> > tune. 
> 
> Correct.
> 
> > Currently it is cpu vma - wide. I understand you strongly suggest
> > this should be avoided. Could you elaborate a bit on why this is
> > such a
> > bad choice?
> 
> Why would a prefetch have anything to do with a VMA? Ie your app
> calls
> malloc() and gets a little allocation out of a giant mmap() arena -
> you want to prefault the entire arena? Does that really make any
> sense?

Personally, no it doesn't. I'd rather use some sort of fixed-size
chunk. But to rephrase, the question was more into the strong "drivers
should not be aware of the cpu mm vma structures" comment. While I
fully agree they are probably not very useful for determining the size
of gpu prefault regions, is there anything else we should be aware of
here.

Thanks,
Thomas


> 
> Mirroring is a huge PITA, IMHO it should be discouraged in favour of
> SVA. Sadly too many CPUs still canot implement SVA.
> 
> With mirroring there is no good way for the system to predict what
> the
> access pattern is. The only way to make this actually performant is
> for userspace to explicitly manage the mirroring with some kind of
> prefetching scheme to avoid faulting its accesses except in
> extrodinary cases.
> 
> VMA is emphatically not a hint about what to prefetch. You should
> balance your prefetching based on HW performance and related. If it
> is
> tidy for HW to fault around a 2M granual then just do that.
> 
> Jason


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-26 14:49                             ` Thomas Hellström
@ 2024-04-26 16:35                               ` Jason Gunthorpe
  2024-04-29  8:25                                 ` Thomas Hellström
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-26 16:35 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Zeng, Oak, dri-devel, intel-xe, Brost, Matthew, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

On Fri, Apr 26, 2024 at 04:49:26PM +0200, Thomas Hellström wrote:
> On Fri, 2024-04-26 at 09:00 -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote:
> > > First, the gpu_vma structure is something that partitions the
> > > gpu_vm
> > > that holds gpu-related range metadata, like what to mirror, desired
> > > gpu
> > > caching policies etc. These are managed (created, removed and
> > > split)
> > > mainly from user-space. These are stored and looked up from an rb-
> > > tree.
> > 
> > Except we are talking about SVA here, so all of this should not be
> > exposed to userspace.
> 
> I think you are misreading. this is on the level "Mirror this region of
> the cpu_vm", "prefer this region placed in VRAM", "GPU will do atomic
> accesses on this region", very similar to cpu mmap / munmap and
> madvise. What I'm trying to say here is that this does not directly
> affect the SVA except whether to do SVA or not, and in that case what
> region of the CPU mm will be mirrored, and in addition, any gpu
> attributes for the mirrored region.

SVA is you bind the whole MM and device faults dynamically populate
the mirror page table. There aren't non-SVA regions. Meta data, like
you describe, is meta data for the allocation/migration mechanism, not
for the page table or that has anything to do with the SVA mirror
operation.

Yes there is another common scheme where you bind a window of CPU to a
window on the device and mirror a fixed range, but this is a quite
different thing. It is not SVA, it has a fixed range, and it is
probably bound to a single GPU VMA in a multi-VMA device page table.

SVA is not just a whole bunch of windows being dynamically created by
the OS, that is entirely the wrong mental model. It would be horrible
to expose to userspace something like that as uAPI. Any hidden SVA
granules and other implementation specific artifacts must not be made
visible to userspace!!

> > If you use dma_map_sg you get into the world of wrongness where you
> > have to track ranges and invalidation has to wipe an entire range -
> > because you cannot do a dma unmap of a single page from a dma_map_sg
> > mapping. This is all the wrong way to use hmm_range_fault.
> > 
> > hmm_range_fault() is page table mirroring, it fundamentally must be
> > page-by-page. The target page table structure must have similar
> > properties to the MM page table - especially page by page
> > validate/invalidate. Meaning you cannot use dma_map_sg().
> 
> To me this is purely an optimization to make the driver page-table and
> hence the GPU TLB benefit from iommu coalescing / large pages and large
> driver PTEs.

This is a different topic. Leon is working on improving the DMA API to
get these kinds of benifits for HMM users. dma_map_sg is not the path
to get this. Leon's work should be significantly better in terms of
optimizing IOVA contiguity for a GPU use case. You can get a
guaranteed DMA contiguity at your choosen granual level, even up to
something like 512M.

> It is true that invalidation will sometimes shoot down
> large gpu ptes unnecessarily but it will not put any additional burden
> on the core AFAICT. 

In my experience people doing performance workloads don't enable the
IOMMU due to the high performance cost, so while optimizing iommu
coalescing is sort of interesting, it is not as important as using the
APIs properly and not harming the much more common situation when
there is no iommu and there is no artificial contiguity.

> on invalidation since zapping the gpu PTEs effectively stops any dma
> accesses. The dma mappings are rebuilt on the next gpu pagefault,
> which, as you mention, are considered slow anyway, but will probably
> still reuse the same prefault region, hence needing to rebuild the dma
> mappings anyway.

This is bad too. The DMA should not remain mapped after pages have
been freed, it completely destroys the concept of IOMMU enforced DMA
security and the ACPI notion of untrusted external devices.

> So as long as we are correct and do not adversely affect core mm, If
> the gpu performance (for whatever reason) is severely hampered if
> large gpu page-table-entries are not used, couldn't this be considered
> left to the driver?

Please use the APIs properly. We are trying to improve the DMA API to
better support HMM users, and doing unnecessary things like this in
drivers is only harmful to that kind of consolidation.

There is nothing stopping getting large GPU page table entries for
large CPU page table entries.

> And a related question. What about THP pages? OK to set up a single
> dma-mapping to those?

Yes, THP is still a page and dma_map_page() will map it.
 
> > > That's why finer-granularity mmu_interval notifiers might be
> > > beneficial
> > > (and then cached for future re-use of the same prefault range).
> > > This
> > > leads me to the next question:
> > 
> > It is not the design, please don't invent crazy special Intel things
> > on top of hmm_range_fault.
> 
> For the record, this is not a "crazy special Intel" invention. It's the
> way all GPU implementations do this so far.

"all GPU implementations" you mean AMD, and AMD predates alot of the
modern versions of this infrastructure IIRC.

> > Why would a prefetch have anything to do with a VMA? Ie your app
> > calls
> > malloc() and gets a little allocation out of a giant mmap() arena -
> > you want to prefault the entire arena? Does that really make any
> > sense?
> 
> Personally, no it doesn't. I'd rather use some sort of fixed-size
> chunk. But to rephrase, the question was more into the strong "drivers
> should not be aware of the cpu mm vma structures" comment. 

But this is essentially why - there is nothing useful the driver can
possibly learn from the CPU VMA to drive
hmm_range_fault(). hmm_range_fault already has to walk the VMA's if
someday something is actually needed it needs to be integrated in a
general way, not by having the driver touch vmas directly.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-26 16:35                               ` Jason Gunthorpe
@ 2024-04-29  8:25                                 ` Thomas Hellström
  2024-04-30 17:30                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Hellström @ 2024-04-29  8:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zeng, Oak, dri-devel, intel-xe, Brost, Matthew, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

On Fri, 2024-04-26 at 13:35 -0300, Jason Gunthorpe wrote:
> On Fri, Apr 26, 2024 at 04:49:26PM +0200, Thomas Hellström wrote:
> > On Fri, 2024-04-26 at 09:00 -0300, Jason Gunthorpe wrote:
> > > On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote:
> > > > First, the gpu_vma structure is something that partitions the
> > > > gpu_vm
> > > > that holds gpu-related range metadata, like what to mirror,
> > > > desired
> > > > gpu
> > > > caching policies etc. These are managed (created, removed and
> > > > split)
> > > > mainly from user-space. These are stored and looked up from an
> > > > rb-
> > > > tree.
> > > 
> > > Except we are talking about SVA here, so all of this should not
> > > be
> > > exposed to userspace.
> > 
> > I think you are misreading. this is on the level "Mirror this
> > region of
> > the cpu_vm", "prefer this region placed in VRAM", "GPU will do
> > atomic
> > accesses on this region", very similar to cpu mmap / munmap and
> > madvise. What I'm trying to say here is that this does not directly
> > affect the SVA except whether to do SVA or not, and in that case
> > what
> > region of the CPU mm will be mirrored, and in addition, any gpu
> > attributes for the mirrored region.
> 
> SVA is you bind the whole MM and device faults dynamically populate
> the mirror page table. There aren't non-SVA regions. Meta data, like
> you describe, is meta data for the allocation/migration mechanism,
> not
> for the page table or that has anything to do with the SVA mirror
> operation.



> 
> Yes there is another common scheme where you bind a window of CPU to
> a
> window on the device and mirror a fixed range, but this is a quite
> different thing. It is not SVA, it has a fixed range, and it is
> probably bound to a single GPU VMA in a multi-VMA device page table.

And this above here is exactly what we're implementing, and the GPU
page-tables are populated using device faults. Regions (large) of the
mirrored CPU mm need to coexist in the same GPU vm as traditional GPU
buffer objects.

> 
> SVA is not just a whole bunch of windows being dynamically created by
> the OS, that is entirely the wrong mental model. It would be horrible
> to expose to userspace something like that as uAPI. Any hidden SVA
> granules and other implementation specific artifacts must not be made
> visible to userspace!!

Implementation-specific artifacts are not to be made visible to user-
space.

> 
> > > If you use dma_map_sg you get into the world of wrongness where
> > > you
> > > have to track ranges and invalidation has to wipe an entire range
> > > -
> > > because you cannot do a dma unmap of a single page from a
> > > dma_map_sg
> > > mapping. This is all the wrong way to use hmm_range_fault.
> > > 
> > > hmm_range_fault() is page table mirroring, it fundamentally must
> > > be
> > > page-by-page. The target page table structure must have similar
> > > properties to the MM page table - especially page by page
> > > validate/invalidate. Meaning you cannot use dma_map_sg().
> > 
> > To me this is purely an optimization to make the driver page-table
> > and
> > hence the GPU TLB benefit from iommu coalescing / large pages and
> > large
> > driver PTEs.
> 
> This is a different topic. Leon is working on improving the DMA API
> to
> get these kinds of benifits for HMM users. dma_map_sg is not the path
> to get this. Leon's work should be significantly better in terms of
> optimizing IOVA contiguity for a GPU use case. You can get a
> guaranteed DMA contiguity at your choosen granual level, even up to
> something like 512M.
> 
> > It is true that invalidation will sometimes shoot down
> > large gpu ptes unnecessarily but it will not put any additional
> > burden
> > on the core AFAICT. 
> 
> In my experience people doing performance workloads don't enable the
> IOMMU due to the high performance cost, so while optimizing iommu
> coalescing is sort of interesting, it is not as important as using
> the
> APIs properly and not harming the much more common situation when
> there is no iommu and there is no artificial contiguity.
> 
> > on invalidation since zapping the gpu PTEs effectively stops any
> > dma
> > accesses. The dma mappings are rebuilt on the next gpu pagefault,
> > which, as you mention, are considered slow anyway, but will
> > probably
> > still reuse the same prefault region, hence needing to rebuild the
> > dma
> > mappings anyway.
> 
> This is bad too. The DMA should not remain mapped after pages have
> been freed, it completely destroys the concept of IOMMU enforced DMA
> security and the ACPI notion of untrusted external devices.

Hmm. Yes, good point.

> 
> > So as long as we are correct and do not adversely affect core mm,
> > If
> > the gpu performance (for whatever reason) is severely hampered if
> > large gpu page-table-entries are not used, couldn't this be
> > considered
> > left to the driver?
> 
> Please use the APIs properly. We are trying to improve the DMA API to
> better support HMM users, and doing unnecessary things like this in
> drivers is only harmful to that kind of consolidation.
> 
> There is nothing stopping getting large GPU page table entries for
> large CPU page table entries.
> 
> > And a related question. What about THP pages? OK to set up a single
> > dma-mapping to those?
> 
> Yes, THP is still a page and dma_map_page() will map it.

OK great. This is probably sufficient for the performance concern for
now.

Thanks,
Thomas

>  
> > > > That's why finer-granularity mmu_interval notifiers might be
> > > > beneficial
> > > > (and then cached for future re-use of the same prefault range).
> > > > This
> > > > leads me to the next question:
> > > 
> > > It is not the design, please don't invent crazy special Intel
> > > things
> > > on top of hmm_range_fault.
> > 
> > For the record, this is not a "crazy special Intel" invention. It's
> > the
> > way all GPU implementations do this so far.
> 
> "all GPU implementations" you mean AMD, and AMD predates alot of the
> modern versions of this infrastructure IIRC.
> 
> > > Why would a prefetch have anything to do with a VMA? Ie your app
> > > calls
> > > malloc() and gets a little allocation out of a giant mmap() arena
> > > -
> > > you want to prefault the entire arena? Does that really make any
> > > sense?
> > 
> > Personally, no it doesn't. I'd rather use some sort of fixed-size
> > chunk. But to rephrase, the question was more into the strong
> > "drivers
> > should not be aware of the cpu mm vma structures" comment. 
> 
> But this is essentially why - there is nothing useful the driver can
> possibly learn from the CPU VMA to drive
> hmm_range_fault(). hmm_range_fault already has to walk the VMA's if
> someday something is actually needed it needs to be integrated in a
> general way, not by having the driver touch vmas directly.
> 
> Jason



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-29  8:25                                 ` Thomas Hellström
@ 2024-04-30 17:30                                   ` Jason Gunthorpe
  2024-04-30 18:57                                     ` Daniel Vetter
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-04-30 17:30 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Zeng, Oak, dri-devel, intel-xe, Brost, Matthew, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström wrote:

> > Yes there is another common scheme where you bind a window of CPU to
> > a
> > window on the device and mirror a fixed range, but this is a quite
> > different thing. It is not SVA, it has a fixed range, and it is
> > probably bound to a single GPU VMA in a multi-VMA device page table.
> 
> And this above here is exactly what we're implementing, and the GPU
> page-tables are populated using device faults. Regions (large) of the
> mirrored CPU mm need to coexist in the same GPU vm as traditional GPU
> buffer objects.

Well, not really, if that was the case you'd have a single VMA over
the entire bound range, not dynamically create them.

A single VMA that uses hmm_range_fault() to populate the VM is
completely logical.

Having a hidden range of mm binding and then creating/destroying 2M
VMAs dynamicaly is the thing that doesn't make alot of sense.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-30 17:30                                   ` Jason Gunthorpe
@ 2024-04-30 18:57                                     ` Daniel Vetter
  2024-05-01  0:09                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Daniel Vetter @ 2024-04-30 18:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Hellström, Zeng, Oak, dri-devel, intel-xe, Brost,
	Matthew, Welty, Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Tue, Apr 30, 2024 at 02:30:02PM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström wrote:
> 
> > > Yes there is another common scheme where you bind a window of CPU to
> > > a
> > > window on the device and mirror a fixed range, but this is a quite
> > > different thing. It is not SVA, it has a fixed range, and it is
> > > probably bound to a single GPU VMA in a multi-VMA device page table.
> > 
> > And this above here is exactly what we're implementing, and the GPU
> > page-tables are populated using device faults. Regions (large) of the
> > mirrored CPU mm need to coexist in the same GPU vm as traditional GPU
> > buffer objects.
> 
> Well, not really, if that was the case you'd have a single VMA over
> the entire bound range, not dynamically create them.
> 
> A single VMA that uses hmm_range_fault() to populate the VM is
> completely logical.
> 
> Having a hidden range of mm binding and then creating/destroying 2M
> VMAs dynamicaly is the thing that doesn't make alot of sense.

I only noticed this thread now but fyi I did dig around in the
implementation and it's summarily an absolute no-go imo for multiple
reasons. It starts with this approach of trying to mirror cpu vma (which I
think originated from amdkfd) leading to all kinds of locking fun, and
then it gets substantially worse when you dig into the details.

I think until something more solid shows up you can just ignore this. I do
fully agree that for sva the main mirroring primitive needs to be page
centric, so dma_map_sg. There's a bit a question around how to make the
necessary batching efficient and the locking/mmu_interval_notifier scale
enough, but I had some long chats with Thomas and I think there's enough
option to spawn pretty much any possible upstream consensus. So I'm not
worried.

But first this needs to be page-centric in the fundamental mirroring
approach.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-04-30 18:57                                     ` Daniel Vetter
@ 2024-05-01  0:09                                       ` Jason Gunthorpe
  2024-05-02  8:04                                         ` Daniel Vetter
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-05-01  0:09 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Thomas Hellström, Zeng, Oak, dri-devel, intel-xe, Brost,
	Matthew, Welty, Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Tue, Apr 30, 2024 at 08:57:48PM +0200, Daniel Vetter wrote:
> On Tue, Apr 30, 2024 at 02:30:02PM -0300, Jason Gunthorpe wrote:
> > On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström wrote:
> > 
> > > > Yes there is another common scheme where you bind a window of CPU to
> > > > a
> > > > window on the device and mirror a fixed range, but this is a quite
> > > > different thing. It is not SVA, it has a fixed range, and it is
> > > > probably bound to a single GPU VMA in a multi-VMA device page table.
> > > 
> > > And this above here is exactly what we're implementing, and the GPU
> > > page-tables are populated using device faults. Regions (large) of the
> > > mirrored CPU mm need to coexist in the same GPU vm as traditional GPU
> > > buffer objects.
> > 
> > Well, not really, if that was the case you'd have a single VMA over
> > the entire bound range, not dynamically create them.
> > 
> > A single VMA that uses hmm_range_fault() to populate the VM is
> > completely logical.
> > 
> > Having a hidden range of mm binding and then creating/destroying 2M
> > VMAs dynamicaly is the thing that doesn't make alot of sense.
> 
> I only noticed this thread now but fyi I did dig around in the
> implementation and it's summarily an absolute no-go imo for multiple
> reasons. It starts with this approach of trying to mirror cpu vma (which I
> think originated from amdkfd) leading to all kinds of locking fun, and
> then it gets substantially worse when you dig into the details.

:(

Why does the DRM side struggle so much with hmm_range fault? I would
have thought it should have a fairly straightforward and logical
connection to the GPU page table.

FWIW, it does make sense to have both a window and a full MM option
for hmm_range_fault. ODP does both and it is fine..

> I think until something more solid shows up you can just ignore this. I do
> fully agree that for sva the main mirroring primitive needs to be page
> centric, so dma_map_sg. 
              ^^^^^^^^^^

dma_map_page

> There's a bit a question around how to make the
> necessary batching efficient and the locking/mmu_interval_notifier scale
> enough, but I had some long chats with Thomas and I think there's enough
> option to spawn pretty much any possible upstream consensus. So I'm not
> worried.

Sure, the new DMA API will bring some more considerations to this as
well. ODP uses a 512M granual scheme and it seems to be OK. By far the
worst part of all this is the faulting performance. I've yet hear any
complains about mmu notifier performance..

> But first this needs to be page-centric in the fundamental mirroring
> approach.

Yes

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-01  0:09                                       ` Jason Gunthorpe
@ 2024-05-02  8:04                                         ` Daniel Vetter
  2024-05-02  9:11                                           ` Thomas Hellström
  0 siblings, 1 reply; 123+ messages in thread
From: Daniel Vetter @ 2024-05-02  8:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, Thomas Hellström, Zeng, Oak, dri-devel,
	intel-xe, Brost, Matthew, Welty, Brian, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Vishwanathapura, Niranjana, Leon Romanovsky

On Tue, Apr 30, 2024 at 09:09:15PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 30, 2024 at 08:57:48PM +0200, Daniel Vetter wrote:
> > On Tue, Apr 30, 2024 at 02:30:02PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström wrote:
> > > 
> > > > > Yes there is another common scheme where you bind a window of CPU to
> > > > > a
> > > > > window on the device and mirror a fixed range, but this is a quite
> > > > > different thing. It is not SVA, it has a fixed range, and it is
> > > > > probably bound to a single GPU VMA in a multi-VMA device page table.
> > > > 
> > > > And this above here is exactly what we're implementing, and the GPU
> > > > page-tables are populated using device faults. Regions (large) of the
> > > > mirrored CPU mm need to coexist in the same GPU vm as traditional GPU
> > > > buffer objects.
> > > 
> > > Well, not really, if that was the case you'd have a single VMA over
> > > the entire bound range, not dynamically create them.
> > > 
> > > A single VMA that uses hmm_range_fault() to populate the VM is
> > > completely logical.
> > > 
> > > Having a hidden range of mm binding and then creating/destroying 2M
> > > VMAs dynamicaly is the thing that doesn't make alot of sense.
> > 
> > I only noticed this thread now but fyi I did dig around in the
> > implementation and it's summarily an absolute no-go imo for multiple
> > reasons. It starts with this approach of trying to mirror cpu vma (which I
> > think originated from amdkfd) leading to all kinds of locking fun, and
> > then it gets substantially worse when you dig into the details.
> 
> :(
> 
> Why does the DRM side struggle so much with hmm_range fault? I would
> have thought it should have a fairly straightforward and logical
> connection to the GPU page table.

Short summary is that traditionally gpu memory was managed with buffer
objects, and each individual buffer object owns the page tables for it's
va range.

For hmm you don't have that buffer object, and you want the pagetables to
be fairly independent (maybe even with their own locking like linux cpu
pagetables do) from any mapping/backing storage. Getting to that world is
a lot of reshuffling, and so thus far all the code went with the quick
hack route of creating ad-hoc ranges that look like buffer objects to the
rest of the driver code. This includes the merged amdkfd hmm code, and if
you dig around in that it results in some really annoying locking
inversions because that middle layer of fake buffer object lookalikes only
gets in the way and results in a fairly fundamental impendendence mismatch
with core linux mm locking.

> FWIW, it does make sense to have both a window and a full MM option
> for hmm_range_fault. ODP does both and it is fine..
> 
> > I think until something more solid shows up you can just ignore this. I do
> > fully agree that for sva the main mirroring primitive needs to be page
> > centric, so dma_map_sg. 
>               ^^^^^^^^^^
> 
> dma_map_page

Oops yes.

> > There's a bit a question around how to make the
> > necessary batching efficient and the locking/mmu_interval_notifier scale
> > enough, but I had some long chats with Thomas and I think there's enough
> > option to spawn pretty much any possible upstream consensus. So I'm not
> > worried.
> 
> Sure, the new DMA API will bring some more considerations to this as
> well. ODP uses a 512M granual scheme and it seems to be OK. By far the
> worst part of all this is the faulting performance. I've yet hear any
> complains about mmu notifier performance..

Yeah I don't expect there to be any need for performance improvements on
the mmu notifier side of things. All the concerns I've heard felt rather
theoretical, or where just fallout of that fake buffer object layer in the
middle.

At worst I guess the gpu pagetables need per-pgtable locking like the cpu
pagetables have, and then maybe keep track of mmu notifier sequence
numbers on a per-pgtable basis, so that invalidates and faults on
different va ranges have no impact on each another. But even that is most
likely way, way down the road.

> > But first this needs to be page-centric in the fundamental mirroring
> > approach.
> 
> Yes

Ok clarifying consensus on this was the main reason I replied, it felt a
bit like the thread was derailing in details that don't yet matter.

Thanks, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-02  8:04                                         ` Daniel Vetter
@ 2024-05-02  9:11                                           ` Thomas Hellström
  2024-05-02 12:46                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Hellström @ 2024-05-02  9:11 UTC (permalink / raw)
  To: Daniel Vetter, Jason Gunthorpe
  Cc: Zeng, Oak, dri-devel, intel-xe, Brost, Matthew, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

Hi,

Although I haven't had a chance yet to eye through the current code,
some comments along the way.


On Thu, 2024-05-02 at 10:04 +0200, Daniel Vetter wrote:
> On Tue, Apr 30, 2024 at 09:09:15PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 30, 2024 at 08:57:48PM +0200, Daniel Vetter wrote:
> > > On Tue, Apr 30, 2024 at 02:30:02PM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström
> > > > wrote:
> > > > 
> > > > > > Yes there is another common scheme where you bind a window
> > > > > > of CPU to
> > > > > > a
> > > > > > window on the device and mirror a fixed range, but this is
> > > > > > a quite
> > > > > > different thing. It is not SVA, it has a fixed range, and
> > > > > > it is
> > > > > > probably bound to a single GPU VMA in a multi-VMA device
> > > > > > page table.
> > > > > 
> > > > > And this above here is exactly what we're implementing, and
> > > > > the GPU
> > > > > page-tables are populated using device faults. Regions
> > > > > (large) of the
> > > > > mirrored CPU mm need to coexist in the same GPU vm as
> > > > > traditional GPU
> > > > > buffer objects.
> > > > 
> > > > Well, not really, if that was the case you'd have a single VMA
> > > > over
> > > > the entire bound range, not dynamically create them.
> > > > 
> > > > A single VMA that uses hmm_range_fault() to populate the VM is
> > > > completely logical.
> > > > 
> > > > Having a hidden range of mm binding and then
> > > > creating/destroying 2M
> > > > VMAs dynamicaly is the thing that doesn't make alot of sense.
> > > 
> > > I only noticed this thread now but fyi I did dig around in the
> > > implementation and it's summarily an absolute no-go imo for
> > > multiple
> > > reasons. It starts with this approach of trying to mirror cpu vma
> > > (which I
> > > think originated from amdkfd) leading to all kinds of locking
> > > fun, and
> > > then it gets substantially worse when you dig into the details.

It's true the cpu vma lookup is a remnant from amdkfd. The idea here is
to replace that with fixed prefaulting ranges of tunable size. So far,
as you mention, the prefaulting range has been determined by the CPU
vma size. Given previous feedback, this is going to change.

Still the prefaulting range needs to be restricted to avoid -EFAULT
failures in hmm_range_fault(). That can ofc be done by calling it
without HMM_PFN_REQ_FAULT for the range and interpret the returned
pnfs. There is a performance concern of this approach as compared to
peeking at the CPU vmas directly, since hmm_range_fault() would need to
be called twice. Any guidelines ideas here?

The second aspect of this is gpu_vma creation / splitting on fault that
the current implementation has. The plan is to get rid of that as well,
in favour of a sparsely populated gpu_vma. The reason for this
implementation was the easy integration with drm_gpuvm.

Still, I don't see any locking problems here, though, maintaining

gpu_vm->lock
mmap_lock
dma_resv
reclaim
notifier_lock

throughout the code. What is likely to get somewhat problematic,
though, is VRAM eviction.

/Thomas




> 
> s the main reason I replied, it felt a
> bit like the thread was derailing in details that don't yet matter.
> 
> Thanks, Sima

/Thomas


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-02  9:11                                           ` Thomas Hellström
@ 2024-05-02 12:46                                             ` Jason Gunthorpe
  2024-05-02 15:01                                               ` Thomas Hellström
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-05-02 12:46 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Daniel Vetter, Zeng, Oak, dri-devel, intel-xe, Brost, Matthew,
	Welty, Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Thu, May 02, 2024 at 11:11:04AM +0200, Thomas Hellström wrote:

> It's true the cpu vma lookup is a remnant from amdkfd. The idea here is
> to replace that with fixed prefaulting ranges of tunable size. So far,
> as you mention, the prefaulting range has been determined by the CPU
> vma size. Given previous feedback, this is going to change.

Perhaps limiting prefault to a VMA barrier is a reasonable thing to
do, but the implementation should be pushed into hmm_range_fault and
not open coded in the driver.

> Still the prefaulting range needs to be restricted to avoid -EFAULT
> failures in hmm_range_fault(). That can ofc be done by calling it
> without HMM_PFN_REQ_FAULT for the range and interpret the returned
> pnfs. 

Yes, this is exactly what that feature is for, you mark your prefetch
differently from the fault critical page(s).

> There is a performance concern of this approach as compared to
> peeking at the CPU vmas directly, since hmm_range_fault() would need to
> be called twice. Any guidelines ideas here?

If there is something wrong with hmm_range_fault() then please fix
it. I'm not sure why you'd call it twice, the HMM_PFN_REQ_FAULT is per
PFN?

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-02 12:46                                             ` Jason Gunthorpe
@ 2024-05-02 15:01                                               ` Thomas Hellström
  2024-05-02 19:25                                                 ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Hellström @ 2024-05-02 15:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, Zeng, Oak, dri-devel, intel-xe, Brost, Matthew,
	Welty, Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Thu, 2024-05-02 at 09:46 -0300, Jason Gunthorpe wrote:
> On Thu, May 02, 2024 at 11:11:04AM +0200, Thomas Hellström wrote:
> 
> > It's true the cpu vma lookup is a remnant from amdkfd. The idea
> > here is
> > to replace that with fixed prefaulting ranges of tunable size. So
> > far,
> > as you mention, the prefaulting range has been determined by the
> > CPU
> > vma size. Given previous feedback, this is going to change.
> 
> Perhaps limiting prefault to a VMA barrier is a reasonable thing to
> do, but the implementation should be pushed into hmm_range_fault and
> not open coded in the driver.
> 
> > Still the prefaulting range needs to be restricted to avoid -EFAULT
> > failures in hmm_range_fault(). That can ofc be done by calling it
> > without HMM_PFN_REQ_FAULT for the range and interpret the returned
> > pnfs. 
> 
> Yes, this is exactly what that feature is for, you mark your prefetch
> differently from the fault critical page(s).
> 
> > There is a performance concern of this approach as compared to
> > peeking at the CPU vmas directly, since hmm_range_fault() would
> > need to
> > be called twice. Any guidelines ideas here?
> 
> If there is something wrong with hmm_range_fault() then please fix
> it. I'm not sure why you'd call it twice, the HMM_PFN_REQ_FAULT is
> per
> PFN?

Ah, yes you're right. I somehow thought it was per range. Makes sense
now.

Thanks,
Thomas



> 
> Jason


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-02 15:01                                               ` Thomas Hellström
@ 2024-05-02 19:25                                                 ` Zeng, Oak
  2024-05-03 13:37                                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-05-02 19:25 UTC (permalink / raw)
  To: Thomas Hellström, Jason Gunthorpe
  Cc: Daniel Vetter, dri-devel, intel-xe, Brost, Matthew, Welty, Brian,
	Ghimiray, Himal Prasad, Bommu, Krishnaiah, Vishwanathapura,
	Niranjana, Leon Romanovsky

Hi Jason,

I tried to understand how you supposed us to use hmm range fault... it seems you want us to call hmm range fault two times on each gpu page fault:

1.
Call Hmm_range_fault first time, pfn of the fault address is set with HMM_PFN_REQ_FAULT
Other pfns in the PREFAULT_SIZE range will be set as 0
Hmm_range fault returns:
	Pfn with 0 flag or HMM_PFN_VALID flag, means a valid pfn
	Pfn with HMM_PFN_ERROR flag, means invalid pfn

2.	
Then call hmm_range_fault a second time
Setting the hmm_range start/end only to cover valid pfns
With all valid pfns, set the REQ_FAULT flag


Basically use hmm_range_fault to figure out the valid address range in the first round; then really fault (e.g., trigger cpu fault to allocate system pages) in the second call the hmm range fault.

Do I understand it correctly?

This is strange to me. We should already know the valid address range before we call hmm range fault, because the migration codes need to look up cpu vma anyway. what is the point of the first hmm_range fault?

Oak

> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Sent: Thursday, May 2, 2024 11:02 AM
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>; Zeng, Oak <oak.zeng@intel.com>; dri-
> devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> Matthew <matthew.brost@intel.com>; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky
> <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Thu, 2024-05-02 at 09:46 -0300, Jason Gunthorpe wrote:
> > On Thu, May 02, 2024 at 11:11:04AM +0200, Thomas Hellström wrote:
> >
> > > It's true the cpu vma lookup is a remnant from amdkfd. The idea
> > > here is
> > > to replace that with fixed prefaulting ranges of tunable size. So
> > > far,
> > > as you mention, the prefaulting range has been determined by the
> > > CPU
> > > vma size. Given previous feedback, this is going to change.
> >
> > Perhaps limiting prefault to a VMA barrier is a reasonable thing to
> > do, but the implementation should be pushed into hmm_range_fault and
> > not open coded in the driver.
> >
> > > Still the prefaulting range needs to be restricted to avoid -EFAULT
> > > failures in hmm_range_fault(). That can ofc be done by calling it
> > > without HMM_PFN_REQ_FAULT for the range and interpret the returned
> > > pnfs.
> >
> > Yes, this is exactly what that feature is for, you mark your prefetch
> > differently from the fault critical page(s).
> >
> > > There is a performance concern of this approach as compared to
> > > peeking at the CPU vmas directly, since hmm_range_fault() would
> > > need to
> > > be called twice. Any guidelines ideas here?
> >
> > If there is something wrong with hmm_range_fault() then please fix
> > it. I'm not sure why you'd call it twice, the HMM_PFN_REQ_FAULT is
> > per
> > PFN?
> 
> Ah, yes you're right. I somehow thought it was per range. Makes sense
> now.
> 
> Thanks,
> Thomas
> 
> 
> 
> >
> > Jason


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-02 19:25                                                 ` Zeng, Oak
@ 2024-05-03 13:37                                                   ` Jason Gunthorpe
  2024-05-03 14:43                                                     ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-05-03 13:37 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Thomas Hellström, Daniel Vetter, dri-devel, intel-xe, Brost,
	Matthew, Welty, Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Thu, May 02, 2024 at 07:25:50PM +0000, Zeng, Oak wrote:
> Hi Jason,
> 
> I tried to understand how you supposed us to use hmm range fault... it seems you want us to call hmm range fault two times on each gpu page fault:
 
> 1.
> Call Hmm_range_fault first time, pfn of the fault address is set with HMM_PFN_REQ_FAULT
> Other pfns in the PREFAULT_SIZE range will be set as 0
> Hmm_range fault returns:
> 	Pfn with 0 flag or HMM_PFN_VALID flag, means a valid pfn
> 	Pfn with HMM_PFN_ERROR flag, means invalid pfn
> 
> 2.	
> Then call hmm_range_fault a second time
> Setting the hmm_range start/end only to cover valid pfns
> With all valid pfns, set the REQ_FAULT flag

Why would you do this? The first already did the faults you needed and
returned all the easy pfns that don't require faulting.

> Basically use hmm_range_fault to figure out the valid address range
> in the first round; then really fault (e.g., trigger cpu fault to
> allocate system pages) in the second call the hmm range fault.

You don't fault on prefetch. Prefetch is about mirroring already
populated pages, it should not be causing new faults.

> Do I understand it correctly?

No
 
> This is strange to me. We should already know the valid address
> range before we call hmm range fault, because the migration codes
> need to look up cpu vma anyway. what is the point of the first
> hmm_range fault?

I don't really understand why the GPU driver would drive migration off
of faulting. It doesn't make alot of sense, especially if you are
prefetching CPU pages into the GPU and thus won't get faults for them.

If your plan is to leave the GPU page tables unpopulated and then
migrate on every fault to try to achieve some kind of locality then
you'd want to drive the hmm prefetch on the migration window (so you
don't populate unmigrated pages) and hope for the best.

However, the migration stuff should really not be in the driver
either. That should be core DRM logic to manage that. It is so
convoluted and full of policy that all the drivers should be working
in the same way. 

The GPU fault handler should indicate to some core DRM function that a
GPU memory access occured and get back a prefetch window to pass into
hmm_range_fault. The driver will mirror what the core code tells it.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-03 13:37                                                   ` Jason Gunthorpe
@ 2024-05-03 14:43                                                     ` Zeng, Oak
  2024-05-03 16:28                                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Zeng, Oak @ 2024-05-03 14:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Hellström, Daniel Vetter, dri-devel, intel-xe, Brost,
	Matthew, Welty, Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, May 3, 2024 9:38 AM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Daniel Vetter
> <daniel@ffwll.ch>; dri-devel@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; Brost, Matthew <matthew.brost@intel.com>;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky
> <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Thu, May 02, 2024 at 07:25:50PM +0000, Zeng, Oak wrote:
> > Hi Jason,
> >
> > I tried to understand how you supposed us to use hmm range fault... it
> seems you want us to call hmm range fault two times on each gpu page fault:
> 
> > 1.
> > Call Hmm_range_fault first time, pfn of the fault address is set with
> HMM_PFN_REQ_FAULT
> > Other pfns in the PREFAULT_SIZE range will be set as 0
> > Hmm_range fault returns:
> > 	Pfn with 0 flag or HMM_PFN_VALID flag, means a valid pfn
> > 	Pfn with HMM_PFN_ERROR flag, means invalid pfn
> >
> > 2.
> > Then call hmm_range_fault a second time
> > Setting the hmm_range start/end only to cover valid pfns
> > With all valid pfns, set the REQ_FAULT flag
> 
> Why would you do this? The first already did the faults you needed and
> returned all the easy pfns that don't require faulting.

But we have use case where we want to fault-in pages other than the page which contains the GPU fault address, e.g., user malloc'ed or mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU access it. Let's say GPU access caused a GPU page fault a 2MiB place. The first hmm-range-fault would only fault-in the page at 2MiB place, because in the first call we only set REQ_FAULT to the pfn at 2MiB place. 

In such case, we would go over all the pfns returned from the first hmm-range-fault to learn which pfn is a faultable page but not fault-in yet (pfn flag ==  0), which pfns is not possible to fault-in in the future (pfn flag == HMM_PFN_ERROR), then call hmm-range-fault again by setting REQ_FAULT to all faultable pages.

> 
> > Basically use hmm_range_fault to figure out the valid address range
> > in the first round; then really fault (e.g., trigger cpu fault to
> > allocate system pages) in the second call the hmm range fault.
> 
> You don't fault on prefetch. Prefetch is about mirroring already
> populated pages, it should not be causing new faults.

Maybe there is different wording here. We have this scenario we call it prefetch, or whatever you call it:

GPU page fault at an address A, we want to map an address range (e.g., 2MiB, or whatever size depending on setting) around address A to GPU page table. The range around A could have no backing pages when GPU page fault happens. We want to populate the 2MiB range. We can call it prefetch because most of pages in this range is not accessed by GPU yet, but we expect GPU to access it soon.

You mentioned "already populated pages". Who populated those pages then? Is it a CPU access populated them? If CPU access those pages first, it is true pages can be already populated. But it is also a valid use case where GPU access address before CPU so there is no "already populated pages" on GPU page fault. Please let us know what is the picture in your head. We seem picture it completely differently.



> 
> > Do I understand it correctly?
> 
> No
> 
> > This is strange to me. We should already know the valid address
> > range before we call hmm range fault, because the migration codes
> > need to look up cpu vma anyway. what is the point of the first
> > hmm_range fault?
> 
> I don't really understand why the GPU driver would drive migration off
> of faulting. It doesn't make alot of sense, especially if you are
> prefetching CPU pages into the GPU and thus won't get faults for them.
> 

Migration on GPU fault is definitely what we want to do. Not like RDMA case, GPU has its own device memory. The size of the device memory is comparable to the size of CPU system memory, sometimes bigger. We leverage significantly on device memory for performance purpose. This is why HMM introduce the MEMORY_DEVICE_PRIVATE memory type.

On GPU page fault, driver decides whether we need to migrate pages to device memory depending on a lot of factors, such as user hints, atomic correctness requirement ect. We could migrate, or we could leave pages in CPU system memory, all tuned for performance. 


> If your plan is to leave the GPU page tables unpopulated and then
> migrate on every fault to try to achieve some kind of locality then
> you'd want to drive the hmm prefetch on the migration window (so you
> don't populate unmigrated pages) and hope for the best.


Exactly what did. We decide the migration window by: 

1) look up the CPU VMA which contains the GPU fault address
2) decide a migration window per migration granularity (e.g., 2MiB) settings, inside the CPU VMA. If CPU VMA is smaller than the migration granular, migration window is the whole CPU vma range; otherwise, partially of the VMA range is migrated.

We then prefetch the migration window. If migration happened, it is true all pages are already populated. But there is use cases where migration is skipped and we want fault-in through hmm-range-fault, see explanation above. 

> 
> However, the migration stuff should really not be in the driver
> either. That should be core DRM logic to manage that. It is so
> convoluted and full of policy that all the drivers should be working
> in the same way.

Completely agreed. Moving migration infrastructures to DRM is part of our plan. We want to first prove of concept with xekmd driver, then move helpers, infrastructures to DRM. Driver should be as easy as implementation a few callback functions for device specific page table programming and device migration, and calling some DRM common functions during gpu page fault.
 

> 
> The GPU fault handler should indicate to some core DRM function that a
> GPU memory access occured and get back a prefetch window to pass into
> hmm_range_fault. The driver will mirror what the core code tells it.

No objections to this approach.

Oak

> 
> Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-03 14:43                                                     ` Zeng, Oak
@ 2024-05-03 16:28                                                       ` Jason Gunthorpe
  2024-05-03 20:29                                                         ` Zeng, Oak
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2024-05-03 16:28 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Thomas Hellström, Daniel Vetter, dri-devel, intel-xe, Brost,
	Matthew, Welty, Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Fri, May 03, 2024 at 02:43:19PM +0000, Zeng, Oak wrote:
> > > 2.
> > > Then call hmm_range_fault a second time
> > > Setting the hmm_range start/end only to cover valid pfns
> > > With all valid pfns, set the REQ_FAULT flag
> > 
> > Why would you do this? The first already did the faults you needed and
> > returned all the easy pfns that don't require faulting.
> 
> But we have use case where we want to fault-in pages other than the
> page which contains the GPU fault address, e.g., user malloc'ed or
> mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU
> access it. Let's say GPU access caused a GPU page fault a 2MiB
> place. The first hmm-range-fault would only fault-in the page at
> 2MiB place, because in the first call we only set REQ_FAULT to the
> pfn at 2MiB place.

Honestly, that doesn't make alot of sense to me, but if you really
want that you should add some new flag and have hmm_range_fault do
this kind of speculative faulting. I think you will end up
significantly over faulting.

It also doesn't make sense to do faulting in hmm prefetch if you are
going to do migration to force the fault anyhow.


> > > Basically use hmm_range_fault to figure out the valid address range
> > > in the first round; then really fault (e.g., trigger cpu fault to
> > > allocate system pages) in the second call the hmm range fault.
> > 
> > You don't fault on prefetch. Prefetch is about mirroring already
> > populated pages, it should not be causing new faults.
> 
> Maybe there is different wording here. We have this scenario we call
> it prefetch, or whatever you call it:
>
> GPU page fault at an address A, we want to map an address range
> (e.g., 2MiB, or whatever size depending on setting) around address A
> to GPU page table. The range around A could have no backing pages
> when GPU page fault happens. We want to populate the 2MiB range. We
> can call it prefetch because most of pages in this range is not
> accessed by GPU yet, but we expect GPU to access it soon.

This isn't prefetch, that is prefaulting.
 
> You mentioned "already populated pages". Who populated those pages
> then? Is it a CPU access populated them? If CPU access those pages
> first, it is true pages can be already populated. 

Yes, I would think that is a pretty common case

> But it is also a valid use case where GPU access address before CPU
> so there is no "already populated pages" on GPU page fault. Please
> let us know what is the picture in your head. We seem picture it
> completely differently.

And sure, this could happen too, but I feel like it is an application
issue to be not prefaulting the buffers it knows the GPU is going to
touch.

Again, our experiments have shown that taking the fault path is so
slow that sane applications must explicitly prefault and prefetch as
much as possible to avoid the faults in the first place.

I'm not sure I full agree there is a real need to agressively optimize
the faulting path like you are describing when it shouldn't really be
used in a performant application :\

> 2) decide a migration window per migration granularity (e.g., 2MiB)
> settings, inside the CPU VMA. If CPU VMA is smaller than the
> migration granular, migration window is the whole CPU vma range;
> otherwise, partially of the VMA range is migrated.

Seems rather arbitary to me. You are quite likely to capture some
memory that is CPU memory and cause thrashing. As I said before in
common cases the heap will be large single VMAs, so this kind of
scheme is just going to fault a whole bunch of unrelated malloc
objects over to the GPU.

Not sure how it is really a good idea.

Adaptive locality of memory is still an unsolved problem in Linux,
sadly.

> > However, the migration stuff should really not be in the driver
> > either. That should be core DRM logic to manage that. It is so
> > convoluted and full of policy that all the drivers should be working
> > in the same way.
> 
> Completely agreed. Moving migration infrastructures to DRM is part
> of our plan. We want to first prove of concept with xekmd driver,
> then move helpers, infrastructures to DRM. Driver should be as easy
> as implementation a few callback functions for device specific page
> table programming and device migration, and calling some DRM common
> functions during gpu page fault.

You'd be better to start out this way so people can look at and
understand the core code on its own merits.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-03 16:28                                                       ` Jason Gunthorpe
@ 2024-05-03 20:29                                                         ` Zeng, Oak
  2024-05-04  1:03                                                           ` Dave Airlie
  2024-05-06 13:33                                                           ` Jason Gunthorpe
  0 siblings, 2 replies; 123+ messages in thread
From: Zeng, Oak @ 2024-05-03 20:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Hellström, Daniel Vetter, dri-devel, intel-xe, Brost,
	Matthew, Welty, Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, May 3, 2024 12:28 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Daniel Vetter
> <daniel@ffwll.ch>; dri-devel@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; Brost, Matthew <matthew.brost@intel.com>;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky
> <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Fri, May 03, 2024 at 02:43:19PM +0000, Zeng, Oak wrote:
> > > > 2.
> > > > Then call hmm_range_fault a second time
> > > > Setting the hmm_range start/end only to cover valid pfns
> > > > With all valid pfns, set the REQ_FAULT flag
> > >
> > > Why would you do this? The first already did the faults you needed and
> > > returned all the easy pfns that don't require faulting.
> >
> > But we have use case where we want to fault-in pages other than the
> > page which contains the GPU fault address, e.g., user malloc'ed or
> > mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU
> > access it. Let's say GPU access caused a GPU page fault a 2MiB
> > place. The first hmm-range-fault would only fault-in the page at
> > 2MiB place, because in the first call we only set REQ_FAULT to the
> > pfn at 2MiB place.
> 
> Honestly, that doesn't make alot of sense to me, but if you really
> want that you should add some new flag and have hmm_range_fault do
> this kind of speculative faulting. I think you will end up
> significantly over faulting.

Above 2 steps hmm-range-fault approach is just my guess of what you were suggesting. Since you don't like the CPU vma look up, so we come out this 2 steps hmm-range-fault thing. The first step has the same functionality of a CPU vma lookup.

I also think this approach doesn't make sense.

In our original approach, we lookup cpu vma before migration. Calling hmm-range-fault in a non-speculative way, and there is no overfaulting, because we only call hmm-range-fault within a valid range that we get from CPU vma.

> 
> It also doesn't make sense to do faulting in hmm prefetch if you are
> going to do migration to force the fault anyhow.

What do you mean by hmm prefetch?

As explained, we call hmm-range-fault in two scenarios:

1) call hmm-range-fault to get the current status of cpu page table without causing CPU fault. When address range is already accessed by CPU before GPU, or when we migrate for such range, we run into this scenario

2) when CPU never access range and driver determined there is no need to migrate, we call hmm-range-fault to trigger cpu fault and allocate system pages for this range.

> 
> 
> > > > Basically use hmm_range_fault to figure out the valid address range
> > > > in the first round; then really fault (e.g., trigger cpu fault to
> > > > allocate system pages) in the second call the hmm range fault.
> > >
> > > You don't fault on prefetch. Prefetch is about mirroring already
> > > populated pages, it should not be causing new faults.
> >
> > Maybe there is different wording here. We have this scenario we call
> > it prefetch, or whatever you call it:
> >
> > GPU page fault at an address A, we want to map an address range
> > (e.g., 2MiB, or whatever size depending on setting) around address A
> > to GPU page table. The range around A could have no backing pages
> > when GPU page fault happens. We want to populate the 2MiB range. We
> > can call it prefetch because most of pages in this range is not
> > accessed by GPU yet, but we expect GPU to access it soon.
> 
> This isn't prefetch, that is prefaulting.

Sure, prefaulting is a better name. 

We do have another prefetch API which can be called from user space to prefetch before GPU job submission.


> 
> > You mentioned "already populated pages". Who populated those pages
> > then? Is it a CPU access populated them? If CPU access those pages
> > first, it is true pages can be already populated.
> 
> Yes, I would think that is a pretty common case
> 
> > But it is also a valid use case where GPU access address before CPU
> > so there is no "already populated pages" on GPU page fault. Please
> > let us know what is the picture in your head. We seem picture it
> > completely differently.
> 
> And sure, this could happen too, but I feel like it is an application
> issue to be not prefaulting the buffers it knows the GPU is going to
> touch.
> 
> Again, our experiments have shown that taking the fault path is so
> slow that sane applications must explicitly prefault and prefetch as
> much as possible to avoid the faults in the first place.

I agree fault path has a huge overhead. We all agree.


> 
> I'm not sure I full agree there is a real need to agressively optimize
> the faulting path like you are describing when it shouldn't really be
> used in a performant application :\

As a driver, we need to support all possible scenarios. Our way of using hmm-range-fault is just generalized enough to deal with both situation: when application is smart enough to prefetch/prefault, sure hmm-range-fault just get back the existing pfns; otherwise it falls back to the slow faulting path.

It is not an aggressive optimization. The codes is written for fast path but it also works for slow path.


> 
> > 2) decide a migration window per migration granularity (e.g., 2MiB)
> > settings, inside the CPU VMA. If CPU VMA is smaller than the
> > migration granular, migration window is the whole CPU vma range;
> > otherwise, partially of the VMA range is migrated.
> 
> Seems rather arbitary to me. You are quite likely to capture some
> memory that is CPU memory and cause thrashing. As I said before in
> common cases the heap will be large single VMAs, so this kind of
> scheme is just going to fault a whole bunch of unrelated malloc
> objects over to the GPU.

I want to listen more here.

Here is my understanding. Malloc of small size such as less than one page, or a few pages, memory is allocated from heap.

When malloc is much more than one pages, the GlibC's behavior is mmap it directly from OS, vs from heap

In glibC the threshold is defined by MMAP_THRESHOLD. The default value is 128K: https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html

So on the heap, it is some small VMAs each contains a few pages, normally one page per VMA. In the worst case, VMA on pages shouldn't be bigger than MMAP_THRESHOLD.

In a reasonable GPU application, people use GPU for compute which usually involves large amount of data which can be many MiB, sometimes it can even be many GiB of data

Now going back our scheme. I picture in most application, the CPU vma search end up big vma, MiB, GiB etc

If we end up with a vma that is only a few pages, we fault in the whole vma. It is true that for this case we fault in unrelated malloc objects. Maybe we can fine tune here to only fault in one page (which is minimum fault size) for such case. Admittedly one page can also have bunch of unrelated objects. But overall we think this should not be  common case.

Let me know if this understanding is correct.

Or what would you like to do in such situation?

> 
> Not sure how it is really a good idea.
> 
> Adaptive locality of memory is still an unsolved problem in Linux,
> sadly.
> 
> > > However, the migration stuff should really not be in the driver
> > > either. That should be core DRM logic to manage that. It is so
> > > convoluted and full of policy that all the drivers should be working
> > > in the same way.
> >
> > Completely agreed. Moving migration infrastructures to DRM is part
> > of our plan. We want to first prove of concept with xekmd driver,
> > then move helpers, infrastructures to DRM. Driver should be as easy
> > as implementation a few callback functions for device specific page
> > table programming and device migration, and calling some DRM common
> > functions during gpu page fault.
> 
> You'd be better to start out this way so people can look at and
> understand the core code on its own merits.

The two steps way were agreed with DRM maintainers, see here:  https://lore.kernel.org/dri-devel/SA1PR11MB6991045CC69EC8E1C576A715925F2@SA1PR11MB6991.namprd11.prod.outlook.com/, bullet 4)


Oak

> 
> Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-03 20:29                                                         ` Zeng, Oak
@ 2024-05-04  1:03                                                           ` Dave Airlie
  2024-05-06 13:04                                                             ` Daniel Vetter
  2024-05-06 13:33                                                           ` Jason Gunthorpe
  1 sibling, 1 reply; 123+ messages in thread
From: Dave Airlie @ 2024-05-04  1:03 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Jason Gunthorpe, Thomas Hellström, Daniel Vetter, dri-devel,
	intel-xe, Brost, Matthew, Welty, Brian, Ghimiray, Himal Prasad,
	Bommu, Krishnaiah, Vishwanathapura, Niranjana, Leon Romanovsky

> Let me know if this understanding is correct.
>
> Or what would you like to do in such situation?
>
> >
> > Not sure how it is really a good idea.
> >
> > Adaptive locality of memory is still an unsolved problem in Linux,
> > sadly.
> >
> > > > However, the migration stuff should really not be in the driver
> > > > either. That should be core DRM logic to manage that. It is so
> > > > convoluted and full of policy that all the drivers should be working
> > > > in the same way.
> > >
> > > Completely agreed. Moving migration infrastructures to DRM is part
> > > of our plan. We want to first prove of concept with xekmd driver,
> > > then move helpers, infrastructures to DRM. Driver should be as easy
> > > as implementation a few callback functions for device specific page
> > > table programming and device migration, and calling some DRM common
> > > functions during gpu page fault.
> >
> > You'd be better to start out this way so people can look at and
> > understand the core code on its own merits.
>
> The two steps way were agreed with DRM maintainers, see here:  https://lore.kernel.org/dri-devel/SA1PR11MB6991045CC69EC8E1C576A715925F2@SA1PR11MB6991.namprd11.prod.outlook.com/, bullet 4)

After this discussion and the other cross-device HMM stuff I think we
should probably push more for common up-front, I think doing this in a
driver without considering the bigger picture might not end up
extractable, and then I fear the developers will just move onto other
things due to management pressure to land features over correctness.

I think we have enough people on the list that can review this stuff,
and even if the common code ends up being a little xe specific,
iterating it will be easier outside the driver, as we can clearly
demark what is inside and outside.

Dave.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-04  1:03                                                           ` Dave Airlie
@ 2024-05-06 13:04                                                             ` Daniel Vetter
  2024-05-06 23:50                                                               ` Matthew Brost
  0 siblings, 1 reply; 123+ messages in thread
From: Daniel Vetter @ 2024-05-06 13:04 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Zeng, Oak, Jason Gunthorpe, Thomas Hellström, Daniel Vetter,
	dri-devel, intel-xe, Brost, Matthew, Welty, Brian, Ghimiray,
	Himal Prasad, Bommu, Krishnaiah, Vishwanathapura, Niranjana,
	Leon Romanovsky

On Sat, May 04, 2024 at 11:03:03AM +1000, Dave Airlie wrote:
> > Let me know if this understanding is correct.
> >
> > Or what would you like to do in such situation?
> >
> > >
> > > Not sure how it is really a good idea.
> > >
> > > Adaptive locality of memory is still an unsolved problem in Linux,
> > > sadly.
> > >
> > > > > However, the migration stuff should really not be in the driver
> > > > > either. That should be core DRM logic to manage that. It is so
> > > > > convoluted and full of policy that all the drivers should be working
> > > > > in the same way.
> > > >
> > > > Completely agreed. Moving migration infrastructures to DRM is part
> > > > of our plan. We want to first prove of concept with xekmd driver,
> > > > then move helpers, infrastructures to DRM. Driver should be as easy
> > > > as implementation a few callback functions for device specific page
> > > > table programming and device migration, and calling some DRM common
> > > > functions during gpu page fault.
> > >
> > > You'd be better to start out this way so people can look at and
> > > understand the core code on its own merits.
> >
> > The two steps way were agreed with DRM maintainers, see here:  https://lore.kernel.org/dri-devel/SA1PR11MB6991045CC69EC8E1C576A715925F2@SA1PR11MB6991.namprd11.prod.outlook.com/, bullet 4)
> 
> After this discussion and the other cross-device HMM stuff I think we
> should probably push more for common up-front, I think doing this in a
> driver without considering the bigger picture might not end up
> extractable, and then I fear the developers will just move onto other
> things due to management pressure to land features over correctness.
> 
> I think we have enough people on the list that can review this stuff,
> and even if the common code ends up being a little xe specific,
> iterating it will be easier outside the driver, as we can clearly
> demark what is inside and outside.

tldr; Yeah concurring.

I think like with the gpu vma stuff we should at least aim for the core
data structures, and more importantly, the locking design and how it
interacts with core mm services to be common code.

I read through amdkfd and I think that one is warning enough that this
area is one of these cases where going with common code aggressively is
much better. Because it will be buggy in terribly "how do we get out of
this design corner again ever?" ways no matter what. But with common code
there will at least be all of dri-devel and hopefully some mm folks
involved in sorting things out.

Most other areas it's indeed better to explore the design space with a few
drivers before going with common code, at the cost of having some really
terrible driver code in upstream. But here the cost of some really bad
design in drivers is just too expensive imo.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-03 20:29                                                         ` Zeng, Oak
  2024-05-04  1:03                                                           ` Dave Airlie
@ 2024-05-06 13:33                                                           ` Jason Gunthorpe
  1 sibling, 0 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2024-05-06 13:33 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: Thomas Hellström, Daniel Vetter, dri-devel, intel-xe, Brost,
	Matthew, Welty, Brian, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Vishwanathapura, Niranjana, Leon Romanovsky

On Fri, May 03, 2024 at 08:29:39PM +0000, Zeng, Oak wrote:

> > > But we have use case where we want to fault-in pages other than the
> > > page which contains the GPU fault address, e.g., user malloc'ed or
> > > mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU
> > > access it. Let's say GPU access caused a GPU page fault a 2MiB
> > > place. The first hmm-range-fault would only fault-in the page at
> > > 2MiB place, because in the first call we only set REQ_FAULT to the
> > > pfn at 2MiB place.
> > 
> > Honestly, that doesn't make alot of sense to me, but if you really
> > want that you should add some new flag and have hmm_range_fault do
> > this kind of speculative faulting. I think you will end up
> > significantly over faulting.
> 
> Above 2 steps hmm-range-fault approach is just my guess of what you
> were suggesting. Since you don't like the CPU vma look up, so we
> come out this 2 steps hmm-range-fault thing. The first step has the
> same functionality of a CPU vma lookup.

If you want to retain the GPU fault flag as a signal for changing
locality then you have to correct the locality and resolve all faults
before calling hmm_range_fault(). hmm_range_fault() will never do
faulting. It will always just read in the already resolved pages.

> > It also doesn't make sense to do faulting in hmm prefetch if you are
> > going to do migration to force the fault anyhow.
> 
> What do you mean by hmm prefetch?

I mean the pages that are not part of the critical fault
resultion. The pages you are preloading into the GPU page table
without an immediate need.
 
> As explained, we call hmm-range-fault in two scenarios:
> 
> 1) call hmm-range-fault to get the current status of cpu page table
> without causing CPU fault. When address range is already accessed by
> CPU before GPU, or when we migrate for such range, we run into this
> scenario

This is because you are trying to keep locality management outside of
the code code - it is creating this problem. As I said below locality
management should be core code, not in drivers. It may be hmm core
code, not drm, but regardless.
 
> We do have another prefetch API which can be called from user space
> to prefetch before GPU job submission.

This API seems like it would break the use of faulting as a mechanism
to manage locality...

> > I'm not sure I full agree there is a real need to agressively optimize
> > the faulting path like you are describing when it shouldn't really be
> > used in a performant application :\
> 
> As a driver, we need to support all possible scenarios. 

Functionally support is different from micro optimizing it.

> > > 2) decide a migration window per migration granularity (e.g., 2MiB)
> > > settings, inside the CPU VMA. If CPU VMA is smaller than the
> > > migration granular, migration window is the whole CPU vma range;
> > > otherwise, partially of the VMA range is migrated.
> > 
> > Seems rather arbitary to me. You are quite likely to capture some
> > memory that is CPU memory and cause thrashing. As I said before in
> > common cases the heap will be large single VMAs, so this kind of
> > scheme is just going to fault a whole bunch of unrelated malloc
> > objects over to the GPU.
> 
> I want to listen more here.
> 
> Here is my understanding. Malloc of small size such as less than one
> page, or a few pages, memory is allocated from heap.
> 
> When malloc is much more than one pages, the GlibC's behavior is
> mmap it directly from OS, vs from heap

Yes "much more", there is some cross over where very large allocations
may get there own arena.
 
> In glibC the threshold is defined by MMAP_THRESHOLD. The default
> value is 128K:
> https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html

Sure
 
> So on the heap, it is some small VMAs each contains a few pages,
> normally one page per VMA. In the worst case, VMA on pages shouldn't
> be bigger than MMAP_THRESHOLD.

Huh? That isn't quite how it works. The glibc arenas for < 128K
allocation can be quite big, they often will come from the brk heap
which is a single large VMA. The above only says that allocations over
128K will get their own VMAs. It doesn't say small allocations get
small VMAs.

Of course there are many allocator libraries with different schemes
and tunables.

> In a reasonable GPU application, people use GPU for compute which
> usually involves large amount of data which can be many MiB,
> sometimes it can even be many GiB of data

Then the application can also prefault the whole thing.

> Now going back our scheme. I picture in most application, the CPU
> vma search end up big vma, MiB, GiB etc

I'm not sure. Some may, but not all, and not all memory touched by the
GPU will necessarily come from the giant allocation even in the apps
that do work that way.

> If we end up with a vma that is only a few pages, we fault in the
> whole vma. It is true that for this case we fault in unrelated
> malloc objects. Maybe we can fine tune here to only fault in one
> page (which is minimum fault size) for such case. Admittedly one
> page can also have bunch of unrelated objects. But overall we think
> this should not be common case.

This is the obvious solution, without some kind of special knowledge
the kernel possibly shouldn't attempt the optimize by speculating how
to resolve the fault - or minimally the speculation needs to be a
tunable (ugh)

Broadly, I think using fault indication to indicate locality of pages
that haven't been faulted is pretty bad. Locality indications need to
come from some way that reliably indicates if the device is touching
the pages at all.

Arguably this can never be performant, so I'd argue you should focus
on making things simply work (ie single fault, no prefault, basic
prefetch) and do not expect to achieve a high quality dynamic
locality. Application must specify, application must prefault &
prefetch.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-06 13:04                                                             ` Daniel Vetter
@ 2024-05-06 23:50                                                               ` Matthew Brost
  2024-05-07 11:56                                                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Matthew Brost @ 2024-05-06 23:50 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, Zeng, Oak, Jason Gunthorpe, Thomas Hellström,
	dri-devel, intel-xe, Welty, Brian, Ghimiray, Himal Prasad, Bommu,
	Krishnaiah, Vishwanathapura, Niranjana, Leon Romanovsky

On Mon, May 06, 2024 at 03:04:15PM +0200, Daniel Vetter wrote:
> On Sat, May 04, 2024 at 11:03:03AM +1000, Dave Airlie wrote:
> > > Let me know if this understanding is correct.
> > >
> > > Or what would you like to do in such situation?
> > >
> > > >
> > > > Not sure how it is really a good idea.
> > > >
> > > > Adaptive locality of memory is still an unsolved problem in Linux,
> > > > sadly.
> > > >
> > > > > > However, the migration stuff should really not be in the driver
> > > > > > either. That should be core DRM logic to manage that. It is so
> > > > > > convoluted and full of policy that all the drivers should be working
> > > > > > in the same way.
> > > > >
> > > > > Completely agreed. Moving migration infrastructures to DRM is part
> > > > > of our plan. We want to first prove of concept with xekmd driver,
> > > > > then move helpers, infrastructures to DRM. Driver should be as easy
> > > > > as implementation a few callback functions for device specific page
> > > > > table programming and device migration, and calling some DRM common
> > > > > functions during gpu page fault.
> > > >
> > > > You'd be better to start out this way so people can look at and
> > > > understand the core code on its own merits.
> > >
> > > The two steps way were agreed with DRM maintainers, see here:  https://lore.kernel.org/dri-devel/SA1PR11MB6991045CC69EC8E1C576A715925F2@SA1PR11MB6991.namprd11.prod.outlook.com/, bullet 4)
> > 
> > After this discussion and the other cross-device HMM stuff I think we
> > should probably push more for common up-front, I think doing this in a
> > driver without considering the bigger picture might not end up
> > extractable, and then I fear the developers will just move onto other
> > things due to management pressure to land features over correctness.
> > 
> > I think we have enough people on the list that can review this stuff,
> > and even if the common code ends up being a little xe specific,
> > iterating it will be easier outside the driver, as we can clearly
> > demark what is inside and outside.
> 
> tldr; Yeah concurring.
> 
> I think like with the gpu vma stuff we should at least aim for the core
> data structures, and more importantly, the locking design and how it
> interacts with core mm services to be common code.
> 

I believe this is a reasonable request and hopefully, it should end up
being a pretty thin layer. drm_gpusvm? Have some ideas. Let's see what
we come up with.

Matt

> I read through amdkfd and I think that one is warning enough that this
> area is one of these cases where going with common code aggressively is
> much better. Because it will be buggy in terribly "how do we get out of
> this design corner again ever?" ways no matter what. But with common code
> there will at least be all of dri-devel and hopefully some mm folks
> involved in sorting things out.
> 
> Most other areas it's indeed better to explore the design space with a few
> drivers before going with common code, at the cost of having some really
> terrible driver code in upstream. But here the cost of some really bad
> design in drivers is just too expensive imo.
> -Sima
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
  2024-05-06 23:50                                                               ` Matthew Brost
@ 2024-05-07 11:56                                                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2024-05-07 11:56 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Vetter, Dave Airlie, Zeng, Oak, Thomas Hellström,
	dri-devel, intel-xe, Welty, Brian, Ghimiray, Himal Prasad, Bommu,
	Krishnaiah, Vishwanathapura, Niranjana, Leon Romanovsky

On Mon, May 06, 2024 at 11:50:36PM +0000, Matthew Brost wrote:
> > I think like with the gpu vma stuff we should at least aim for the core
> > data structures, and more importantly, the locking design and how it
> > interacts with core mm services to be common code.
> 
> I believe this is a reasonable request and hopefully, it should end up
> being a pretty thin layer. drm_gpusvm? Have some ideas. Let's see what
> we come up with.

It sounds to me like some of the important peices should not be in DRM
but somewhere in mm

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

end of thread, other threads:[~2024-05-07 11:56 UTC | newest]

Thread overview: 123+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
2024-01-17 22:12 ` [PATCH 01/23] drm/xe/svm: Add SVM document Oak Zeng
2024-01-17 22:12 ` [PATCH 02/23] drm/xe/svm: Add svm key data structures Oak Zeng
2024-01-17 22:12 ` [PATCH 03/23] drm/xe/svm: create xe svm during vm creation Oak Zeng
2024-01-17 22:12 ` [PATCH 04/23] drm/xe/svm: Trace svm creation Oak Zeng
2024-01-17 22:12 ` [PATCH 05/23] drm/xe/svm: add helper to retrieve svm range from address Oak Zeng
2024-01-17 22:12 ` [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range Oak Zeng
2024-04-05  0:39   ` Jason Gunthorpe
2024-04-05  3:33     ` Zeng, Oak
2024-04-05 12:37       ` Jason Gunthorpe
2024-04-05 16:42         ` Zeng, Oak
2024-04-05 18:02           ` Jason Gunthorpe
2024-04-09 16:45             ` Zeng, Oak
2024-04-09 17:24               ` Jason Gunthorpe
2024-04-23 21:17                 ` Zeng, Oak
2024-04-24  2:31                   ` Matthew Brost
2024-04-24 13:57                     ` Jason Gunthorpe
2024-04-24 16:35                       ` Matthew Brost
2024-04-24 16:44                         ` Jason Gunthorpe
2024-04-24 16:56                           ` Matthew Brost
2024-04-24 17:48                             ` Jason Gunthorpe
2024-04-24 13:48                   ` Jason Gunthorpe
2024-04-24 23:59                     ` Zeng, Oak
2024-04-25  1:05                       ` Jason Gunthorpe
2024-04-26  9:55                         ` Thomas Hellström
2024-04-26 12:00                           ` Jason Gunthorpe
2024-04-26 14:49                             ` Thomas Hellström
2024-04-26 16:35                               ` Jason Gunthorpe
2024-04-29  8:25                                 ` Thomas Hellström
2024-04-30 17:30                                   ` Jason Gunthorpe
2024-04-30 18:57                                     ` Daniel Vetter
2024-05-01  0:09                                       ` Jason Gunthorpe
2024-05-02  8:04                                         ` Daniel Vetter
2024-05-02  9:11                                           ` Thomas Hellström
2024-05-02 12:46                                             ` Jason Gunthorpe
2024-05-02 15:01                                               ` Thomas Hellström
2024-05-02 19:25                                                 ` Zeng, Oak
2024-05-03 13:37                                                   ` Jason Gunthorpe
2024-05-03 14:43                                                     ` Zeng, Oak
2024-05-03 16:28                                                       ` Jason Gunthorpe
2024-05-03 20:29                                                         ` Zeng, Oak
2024-05-04  1:03                                                           ` Dave Airlie
2024-05-06 13:04                                                             ` Daniel Vetter
2024-05-06 23:50                                                               ` Matthew Brost
2024-05-07 11:56                                                                 ` Jason Gunthorpe
2024-05-06 13:33                                                           ` Jason Gunthorpe
2024-04-09 17:33               ` Matthew Brost
2024-01-17 22:12 ` [PATCH 07/23] drm/xe/svm: Add helper for binding hmm range to gpu Oak Zeng
2024-01-17 22:12 ` [PATCH 08/23] drm/xe/svm: Add helper to invalidate svm range from GPU Oak Zeng
2024-01-17 22:12 ` [PATCH 09/23] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
2024-01-17 22:12 ` [PATCH 10/23] drm/xe/svm: Introduce svm migration function Oak Zeng
2024-01-17 22:12 ` [PATCH 11/23] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
2024-01-17 22:12 ` [PATCH 12/23] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
2024-01-17 22:12 ` [PATCH 13/23] drm/xe/svm: Handle CPU page fault Oak Zeng
2024-01-17 22:12 ` [PATCH 14/23] drm/xe/svm: trace svm range migration Oak Zeng
2024-01-17 22:12 ` [PATCH 15/23] drm/xe/svm: Implement functions to register and unregister mmu notifier Oak Zeng
2024-01-17 22:12 ` [PATCH 16/23] drm/xe/svm: Implement the mmu notifier range invalidate callback Oak Zeng
2024-01-17 22:12 ` [PATCH 17/23] drm/xe/svm: clean up svm range during process exit Oak Zeng
2024-01-17 22:12 ` [PATCH 18/23] drm/xe/svm: Move a few structures to xe_gt.h Oak Zeng
2024-01-17 22:12 ` [PATCH 19/23] drm/xe/svm: migrate svm range to vram Oak Zeng
2024-01-17 22:12 ` [PATCH 20/23] drm/xe/svm: Populate svm range Oak Zeng
2024-01-17 22:12 ` [PATCH 21/23] drm/xe/svm: GPU page fault support Oak Zeng
2024-01-23  2:06   ` Welty, Brian
2024-01-23  3:09     ` Zeng, Oak
2024-01-23  3:21       ` Making drm_gpuvm work across gpu devices Zeng, Oak
2024-01-23 11:13         ` Christian König
2024-01-23 19:37           ` Zeng, Oak
2024-01-23 20:17             ` Felix Kuehling
2024-01-25  1:39               ` Zeng, Oak
2024-01-23 23:56             ` Danilo Krummrich
2024-01-24  3:57               ` Zeng, Oak
2024-01-24  4:14                 ` Zeng, Oak
2024-01-24  6:48                   ` Christian König
2024-01-25 22:13                 ` Danilo Krummrich
2024-01-24  8:33             ` Christian König
2024-01-25  1:17               ` Zeng, Oak
2024-01-25  1:25                 ` David Airlie
2024-01-25  5:25                   ` Zeng, Oak
2024-01-26 10:09                     ` Christian König
2024-01-26 20:13                       ` Zeng, Oak
2024-01-29 10:10                         ` Christian König
2024-01-29 20:09                           ` Zeng, Oak
2024-01-25 11:00                 ` 回复:Making " 周春明(日月)
2024-01-25 17:00                   ` Zeng, Oak
2024-01-25 17:15                 ` Making " Felix Kuehling
2024-01-25 18:37                   ` Zeng, Oak
2024-01-26 13:23                     ` Christian König
2024-01-25 16:42               ` Zeng, Oak
2024-01-25 18:32               ` Daniel Vetter
2024-01-25 21:02                 ` Zeng, Oak
2024-01-26  8:21                 ` Thomas Hellström
2024-01-26 12:52                   ` Christian König
2024-01-27  2:21                     ` Zeng, Oak
2024-01-29 10:19                       ` Christian König
2024-01-30  0:21                         ` Zeng, Oak
2024-01-30  8:39                           ` Christian König
2024-01-30 22:29                             ` Zeng, Oak
2024-01-30 23:12                               ` David Airlie
2024-01-31  9:15                                 ` Daniel Vetter
2024-01-31 20:17                                   ` Zeng, Oak
2024-01-31 20:59                                     ` Zeng, Oak
2024-02-01  8:52                                     ` Christian König
2024-02-29 18:22                                       ` Zeng, Oak
2024-03-08  4:43                                         ` Zeng, Oak
2024-03-08 10:07                                           ` Christian König
2024-01-30  8:43                           ` Thomas Hellström
2024-01-29 15:03                 ` Felix Kuehling
2024-01-29 15:33                   ` Christian König
2024-01-29 16:24                     ` Felix Kuehling
2024-01-29 16:28                       ` Christian König
2024-01-29 17:52                         ` Felix Kuehling
2024-01-29 19:03                           ` Christian König
2024-01-29 20:24                             ` Felix Kuehling
2024-02-23 20:12               ` Zeng, Oak
2024-02-27  6:54                 ` Christian König
2024-02-27 15:58                   ` Zeng, Oak
2024-02-28 19:51                     ` Zeng, Oak
2024-02-29  9:41                       ` Christian König
2024-02-29 16:05                         ` Zeng, Oak
2024-02-29 17:12                         ` Thomas Hellström
2024-03-01  7:01                           ` Christian König
2024-01-17 22:12 ` [PATCH 22/23] drm/xe/svm: Add DRM_XE_SVM kernel config entry Oak Zeng
2024-01-17 22:12 ` [PATCH 23/23] drm/xe/svm: Add svm memory hints interface Oak Zeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).