All of lore.kernel.org
 help / color / mirror / Atom feed
* [v2 00/31] Basic system allocator support in xe driver
@ 2024-04-09 20:17 Oak Zeng
  2024-04-09 20:17 ` [v2 01/31] drm/xe: Refactor vm_bind Oak Zeng
                   ` (31 more replies)
  0 siblings, 32 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

This is the v2 of basic system allocator support in xe kmd driver.
v1 is here: https://lore.kernel.org/dri-devel/20240117221223.18540-1-oak.zeng@intel.com/

Significant design changes were made since v1, based on drm community
review feedback:

1) Introduce vm_bind uAPI for system allocator. With this uAPI, user can
optionally bind CPU virtual address range A..B to GPU virtual address
range C..D. Right now we force A..B == C..D since we don't have
a valid use case where A..B != C..D. But the interface is built so we
can extend easily in the future if valid use case come out. See patch
3 to 8 for this work.

2) Unify system allocator and user ptr code. Now system allocator and
userptr share the same codes for gpu page table programming, mmu
interval notifier and vma invalidation, page fault handling and lock
design. The codes are more unified.

This work is built on top of Matt Brost's huge vm_bind refactor series.
The first patch is a squash of Matt's 30 patch series for reference
purpose.

This work is still at its early stage. It is sent out so we can get some
early eyes on it. We are open to any comments and suggestions.

The work that are planned in our bucket are:

*Virtual address range based memory attributes and hints: We plan to
expose uAPI for user to set memory attributes such as preferred location
or migration granularity etc to a virtual address range. This is
important to tune SVM performance.

*GPU vram eviction: One key design choice of this series is, SVM
layer allocate GPU memory directly from drm buddy allocator, instead
of from xe vram manager. There is no BO (buffer object) concept
in this implementation. The key benefit of this approach is we can
migrate memory at page granularity easily. This also means SVM bypasses
TTM's memory eviction logic. But we want the SVM memory and BO driver
memory can mutually evicted each other. We have some prove of concept
work to rework TTM resource manager for this purpose, see
https://lore.kernel.org/dri-devel/20231102043306.2931989-1-oak.zeng@intel.com/
We will continue work on that series then implement SVM's eviction
function based on the concept of shared drm LRU list b/t SVM and TTM/BO
driver.

* Try 1 vma with N PT_state for system allocator and userptr. One
gigantic vma to hold address space initial default constant state
and N PT_state to hold mutable page table state. Also try to register
only one mmu interval notifier for the whole address range.

* Multiple GPU device support

Matthew Brost (7):
  drm/xe: Refactor vm_bind
  drm/xe: Invalidate userptr VMA on page pin fault
  drm/xe: Drop unused arguments from vm_bind_ioctl_ops_parse
  drm/xe: Fix op->tile_mask for fault mode
  drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag
  drm/xe: Create userptr if page fault occurs on system_allocator VMA
  drm/xe: Add faulted userptr VMA garbage collector

Oak Zeng (24):
  drm/xe/svm: Add SVM document
  drm/xe: Introduce helper to populate userptr
  drm/xe: Introduce a helper to free sg table
  drm/xe: Use hmm_range_fault to populate user pages
  drm/xe/svm: Remap and provide memmap backing for GPU vram
  drm/xe/svm: Introduce DRM_XE_SVM kernel config
  drm/xe: Introduce helper to get tile from memory region
  drm/xe: Introduce a helper to get dpa from pfn
  drm/xe/svm: Get xe memory region from page
  drm/xe: Get xe_vma from xe_userptr
  drm/xe/svm: Build userptr sg table for device pages
  drm/xe/svm: Determine a vma is backed by device memory
  drm/xe: add xe lock document
  drm/xe/svm: Introduce svm migration function
  drm/xe/svm: implement functions to allocate and free device memory
  drm/xe/svm: Trace buddy block allocation and free
  drm/xe/svm: Create and destroy xe svm
  drm/xe/svm: Add vm to xe_svm process
  drm/xe: Make function lookup_vma public
  drm/xe/svm: Handle CPU page fault
  drm/xe/svm: Introduce helper to migrate vma to vram
  drm/xe/svm: trace svm migration
  drm/xe/svm: Add a helper to determine a vma is fault userptr
  drm/xe/svm: Migration from sram to vram for system allocator

 Documentation/gpu/xe/index.rst              |    2 +
 Documentation/gpu/xe/xe_lock.rst            |    8 +
 Documentation/gpu/xe/xe_svm.rst             |    8 +
 drivers/gpu/drm/xe/Kconfig                  |   22 +
 drivers/gpu/drm/xe/Makefile                 |    6 +
 drivers/gpu/drm/xe/tests/xe_migrate.c       |   86 -
 drivers/gpu/drm/xe/xe_bo.c                  |    7 +-
 drivers/gpu/drm/xe/xe_bo.h                  |    4 +-
 drivers/gpu/drm/xe/xe_device.c              |   35 +
 drivers/gpu/drm/xe/xe_device.h              |   10 +
 drivers/gpu/drm/xe/xe_device_types.h        |   24 +
 drivers/gpu/drm/xe/xe_exec.c                |   41 +-
 drivers/gpu/drm/xe/xe_exec_queue.c          |  120 +-
 drivers/gpu/drm/xe/xe_exec_queue_types.h    |   20 +-
 drivers/gpu/drm/xe/xe_gt_pagefault.c        |   52 +-
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |   59 +-
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |    3 +
 drivers/gpu/drm/xe/xe_guc_submit.c          |   22 +-
 drivers/gpu/drm/xe/xe_hmm.c                 |  329 ++++
 drivers/gpu/drm/xe/xe_hmm.h                 |   18 +
 drivers/gpu/drm/xe/xe_lock_doc.h            |  113 ++
 drivers/gpu/drm/xe/xe_migrate.c             |  602 ++++---
 drivers/gpu/drm/xe/xe_migrate.h             |   53 +-
 drivers/gpu/drm/xe/xe_mmio.c                |    6 +
 drivers/gpu/drm/xe/xe_pci.c                 |    1 +
 drivers/gpu/drm/xe/xe_pt.c                  | 1301 +++++++++-----
 drivers/gpu/drm/xe/xe_pt.h                  |   15 +-
 drivers/gpu/drm/xe/xe_pt_exec_queue.c       |  180 ++
 drivers/gpu/drm/xe/xe_pt_exec_queue.h       |   14 +
 drivers/gpu/drm/xe/xe_pt_types.h            |   53 +
 drivers/gpu/drm/xe/xe_sched_job.c           |   68 +-
 drivers/gpu/drm/xe/xe_sched_job_types.h     |   31 +-
 drivers/gpu/drm/xe/xe_svm.c                 |  122 ++
 drivers/gpu/drm/xe/xe_svm.h                 |   88 +
 drivers/gpu/drm/xe/xe_svm_devmem.c          |  231 +++
 drivers/gpu/drm/xe/xe_svm_doc.h             |  121 ++
 drivers/gpu/drm/xe/xe_svm_migrate.c         |  340 ++++
 drivers/gpu/drm/xe/xe_sync.c                |   15 +
 drivers/gpu/drm/xe/xe_sync.h                |    1 +
 drivers/gpu/drm/xe/xe_tile.c                |    7 +
 drivers/gpu/drm/xe/xe_trace.h               |   69 +-
 drivers/gpu/drm/xe/xe_uc_fw.c               |    1 +
 drivers/gpu/drm/xe/xe_vm.c                  | 1768 ++++++++++---------
 drivers/gpu/drm/xe/xe_vm.h                  |   40 +-
 drivers/gpu/drm/xe/xe_vm_types.h            |  229 ++-
 include/drm/xe_pciids.h                     |   16 +
 include/uapi/drm/xe_drm.h                   |   15 +-
 47 files changed, 4432 insertions(+), 1944 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe_lock.rst
 create mode 100644 Documentation/gpu/xe/xe_svm.rst
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
 create mode 100644 drivers/gpu/drm/xe/xe_lock_doc.h
 create mode 100644 drivers/gpu/drm/xe/xe_pt_exec_queue.c
 create mode 100644 drivers/gpu/drm/xe/xe_pt_exec_queue.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm_doc.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c

-- 
2.26.3


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [v2 01/31] drm/xe: Refactor vm_bind
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 02/31] drm/xe/svm: Add SVM document Oak Zeng
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

drm/xe: Lock all gpuva ops during VM bind IOCTL

Lock all gpuva ops and validate all BOs in a single step durin the VM
bind IOCTL. This help with the transition to making all gpuva ops in a
VM bind IOCTL a single atomic job.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add ops_execute function which returns a fence

Add ops_execute function which returns a fence. This will be helpful to
initiate all binds (VM bind IOCTL, rebinds in exec IOCTL, rebinds in
preempt rebind worker, and rebinds in pagefaults) via a gpuva ops list.
Returning a fence is needed in various paths.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Move migrate to prefetch to op_lock function

Migrates need to be done under drm exec to make lockdep happy, move
the migrate done for prefetches under the op_lock function.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add struct xe_vma_ops abstraction

Having a structure which encapsulates a list of VMA operations will help
enable 1 job for the entire list.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Update xe_vm_rebind to use dummy VMA operations

All bind interfaces are transitioning to use VMA ops, update
xe_vm_rebind to use VMA ops.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Simplify VM bind IOCTL error handling and cleanup

Clean up everything in VM bind IOCTL in 1 path for both errors and
non-errors. Also move VM bind IOCTL cleanup from ops (also used by
non-IOCTL binds) to the VM bind IOCTL.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Update pagefaults to use dummy VMA operations

All bind interfaces are transitioning to use VMA ops, update
pagefaults to use VMA ops.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: s/xe_tile_migrate_engine/xe_tile_migrate_exec_queue

xe_engine is now xe_exec_queue, adjust this function's name to reflect.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add some members to xe_vma_ops

This will help with moving to single jobs for many bind operations.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add vm_bind_ioctl_ops_install_fences helper

Simplify VM bind code by signaling out-fences / destroying VMAs in a
single location. Will help with transition single job for many bind ops.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Move setting last fence to vm_bind_ioctl_ops_install_fences

This moves setting of the last fence to a single location.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Move ufence check to op_lock

Rather than checking for an unsignaled ufence ay unbind time, check for
this during the op_lock function. This will help with the transition to
job 1 per VM bind IOCTL.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Move ufence add to vm_bind_ioctl_ops_install_fences

Rather than adding a ufence to a VMA in the bind function, add the
ufence to all VMAs in the IOCTL that require binds in
vm_bind_ioctl_ops_install_fences. This will help with the transition to
job 1 per VM bind IOCTL.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add xe_gt_tlb_invalidation_range and convert PT layer to use this

xe_gt_tlb_invalidation_range accepts a start and end address rather than
a VMA. This will enable multiple VMAs to be invalidated in a single
invalidation. Update the PT layer to use this new function.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add xe_vm_pgtable_update_op to xe_vma_ops

Will help with the converstion to 1 job per VM bind IOCTL. Allocation
only implemented in this patch.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Use ordered WQ for TLB invalidation fences

TLB invalidation fences need to be ordered within an exec queue and if
an unordered WQ is used TLB invalidation fences could be reordered. Use
an ordered WQ to fix this.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Delete PT update selftest

IGTs (e.g. xe_vm) can provide the exact same coverage as the PT update
selftest. The PT update selftest is dependent on internal functions
which can change thus maintaining this test is costly and provide no
extra coverage. Delete this test.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Convert multiple bind ops into single job

This aligns with the uAPI of an array of binds or single bind that
results in multiple GPUVA ops to be considered a single atomic
operations.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Remove old functions defs in xe_pt.h

__xe_pt_bind_vma and __xe_pt_unbind_vma are unused, remove these.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Update PT layer with better error handling

Update PT layer so if a memory allocation for a PTE fails the error can
be propagated to the user without requiring to be killed.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Update xe_vm_rebind to return int

Now that rebinds are installed in the kernel dma-resv slot the fence
returned from xe_vm_rebind is unused aside from error checking. Update
to xe_vm_rebind to return int.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

drm/xe: Move vma rebinding to the drm_exec locking loop

Rebinding might allocate page-table bos, causing evictions.
To support blocking locking during these evictions,
perform the rebinding in the drm_exec locking loop.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

drm/xe: Update VM trace events

The trace events have changed moving to a single job per VM bind IOCTL,
update the trace events align with old behavior as much as possible.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Update clear / populate arguments

This will help implement CPU binds in run_job() as 'struct
xe_migrate_pt_update' is not available at the time of run_job().

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add __xe_migrate_update_pgtables_cpu helper

This will help implement CPU binds as the submision backend can call
this helper when a bind jobs dependencies are resolved.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: CPU binds for jobs

No reason to use the GPU for binds. In run_job use the CPU to do binds
once the bind job dependencies are resolved.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Don't use migrate exec queue for page fault binds

Now that the CPU is always used for binds even in jobs, CPU bind jobs
can pass GPU jobs in the same exec queue resulting dma-fences signaling
out-of-order. Use a dedicated exec queue for binds issued from page
faults to avoid ordering issues.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add VM bind IOCTL error injection

Add VM bind IOCTL error injection which steals MSB of the bind flags
field which if set injects errors at various points in the VM bind
IOCTL. Intended to validate error paths. Enabled by CONFIG_DRM_XE_DEBUG.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe/guc: Assert time'd out jobs are not from a VM exec queue

With CPU binds jobs cannot timeout, assert this is not happening.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add PT exec queues

Add PT exec queues which are used to implement VM bind / unbind
operations. PT exec queues use a different DRM scheduler backend
(compared GuC / execlist submission backends) which use the CPU to
update page tables once all dependecies for a job are resolved.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

drm/xe: Add PVC support

Add PVC pcie IDs and GuC firmware.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/Makefile                 |    1 +
 drivers/gpu/drm/xe/tests/xe_migrate.c       |   86 --
 drivers/gpu/drm/xe/xe_bo.c                  |    7 +-
 drivers/gpu/drm/xe/xe_bo.h                  |    4 +-
 drivers/gpu/drm/xe/xe_device.c              |   35 +
 drivers/gpu/drm/xe/xe_device.h              |    2 +
 drivers/gpu/drm/xe/xe_device_types.h        |   16 +
 drivers/gpu/drm/xe/xe_exec.c                |   41 +-
 drivers/gpu/drm/xe/xe_exec_queue.c          |  120 +-
 drivers/gpu/drm/xe/xe_exec_queue_types.h    |   20 +-
 drivers/gpu/drm/xe/xe_gt_pagefault.c        |   10 +-
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |   59 +-
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |    3 +
 drivers/gpu/drm/xe/xe_guc_submit.c          |   22 +-
 drivers/gpu/drm/xe/xe_migrate.c             |  385 ++----
 drivers/gpu/drm/xe/xe_migrate.h             |   46 +-
 drivers/gpu/drm/xe/xe_pci.c                 |    1 +
 drivers/gpu/drm/xe/xe_pt.c                  | 1242 ++++++++++++-------
 drivers/gpu/drm/xe/xe_pt.h                  |   15 +-
 drivers/gpu/drm/xe/xe_pt_exec_queue.c       |  180 +++
 drivers/gpu/drm/xe/xe_pt_exec_queue.h       |   14 +
 drivers/gpu/drm/xe/xe_pt_types.h            |   53 +
 drivers/gpu/drm/xe/xe_sched_job.c           |   68 +-
 drivers/gpu/drm/xe/xe_sched_job_types.h     |   31 +-
 drivers/gpu/drm/xe/xe_sync.c                |   15 +
 drivers/gpu/drm/xe/xe_sync.h                |    1 +
 drivers/gpu/drm/xe/xe_trace.h               |   21 +-
 drivers/gpu/drm/xe/xe_uc_fw.c               |    1 +
 drivers/gpu/drm/xe/xe_vm.c                  | 1124 ++++++++---------
 drivers/gpu/drm/xe/xe_vm.h                  |    9 +-
 drivers/gpu/drm/xe/xe_vm_types.h            |  198 +--
 include/drm/xe_pciids.h                     |   16 +
 32 files changed, 2142 insertions(+), 1704 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_pt_exec_queue.c
 create mode 100644 drivers/gpu/drm/xe/xe_pt_exec_queue.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 3c3e67885559..bf43a3690e13 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -118,6 +118,7 @@ xe-y += xe_bb.o \
 	xe_pm.o \
 	xe_preempt_fence.o \
 	xe_pt.o \
+	xe_pt_exec_queue.o \
 	xe_pt_walk.o \
 	xe_query.o \
 	xe_range_fence.o \
diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c b/drivers/gpu/drm/xe/tests/xe_migrate.c
index ce531498f57f..de2c1b7ec371 100644
--- a/drivers/gpu/drm/xe/tests/xe_migrate.c
+++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
@@ -62,36 +62,6 @@ static int run_sanity_job(struct xe_migrate *m, struct xe_device *xe,
 	return 0;
 }
 
-static void
-sanity_populate_cb(struct xe_migrate_pt_update *pt_update,
-		   struct xe_tile *tile, struct iosys_map *map, void *dst,
-		   u32 qword_ofs, u32 num_qwords,
-		   const struct xe_vm_pgtable_update *update)
-{
-	struct migrate_test_params *p =
-		to_migrate_test_params(xe_cur_kunit_priv(XE_TEST_LIVE_MIGRATE));
-	int i;
-	u64 *ptr = dst;
-	u64 value;
-
-	for (i = 0; i < num_qwords; i++) {
-		value = (qword_ofs + i - update->ofs) * 0x1111111111111111ULL;
-		if (map)
-			xe_map_wr(tile_to_xe(tile), map, (qword_ofs + i) *
-				  sizeof(u64), u64, value);
-		else
-			ptr[i] = value;
-	}
-
-	kunit_info(xe_cur_kunit(), "Used %s.\n", map ? "CPU" : "GPU");
-	if (p->force_gpu && map)
-		KUNIT_FAIL(xe_cur_kunit(), "GPU pagetable update used CPU.\n");
-}
-
-static const struct xe_migrate_pt_update_ops sanity_ops = {
-	.populate = sanity_populate_cb,
-};
-
 #define check(_retval, _expected, str, _test)				\
 	do { if ((_retval) != (_expected)) {				\
 			KUNIT_FAIL(_test, "Sanity check failed: " str	\
@@ -209,57 +179,6 @@ static void test_copy_vram(struct xe_migrate *m, struct xe_bo *bo,
 	test_copy(m, bo, test, region);
 }
 
-static void test_pt_update(struct xe_migrate *m, struct xe_bo *pt,
-			   struct kunit *test, bool force_gpu)
-{
-	struct xe_device *xe = tile_to_xe(m->tile);
-	struct dma_fence *fence;
-	u64 retval, expected;
-	ktime_t then, now;
-	int i;
-
-	struct xe_vm_pgtable_update update = {
-		.ofs = 1,
-		.qwords = 0x10,
-		.pt_bo = pt,
-	};
-	struct xe_migrate_pt_update pt_update = {
-		.ops = &sanity_ops,
-	};
-	struct migrate_test_params p = {
-		.base.id = XE_TEST_LIVE_MIGRATE,
-		.force_gpu = force_gpu,
-	};
-
-	test->priv = &p;
-	/* Test xe_migrate_update_pgtables() updates the pagetable as expected */
-	expected = 0xf0f0f0f0f0f0f0f0ULL;
-	xe_map_memset(xe, &pt->vmap, 0, (u8)expected, pt->size);
-
-	then = ktime_get();
-	fence = xe_migrate_update_pgtables(m, m->q->vm, NULL, m->q, &update, 1,
-					   NULL, 0, &pt_update);
-	now = ktime_get();
-	if (sanity_fence_failed(xe, fence, "Migration pagetable update", test))
-		return;
-
-	kunit_info(test, "Updating without syncing took %llu us,\n",
-		   (unsigned long long)ktime_to_us(ktime_sub(now, then)));
-
-	dma_fence_put(fence);
-	retval = xe_map_rd(xe, &pt->vmap, 0, u64);
-	check(retval, expected, "PTE[0] must stay untouched", test);
-
-	for (i = 0; i < update.qwords; i++) {
-		retval = xe_map_rd(xe, &pt->vmap, (update.ofs + i) * 8, u64);
-		check(retval, i * 0x1111111111111111ULL, "PTE update", test);
-	}
-
-	retval = xe_map_rd(xe, &pt->vmap, 8 * (update.ofs + update.qwords),
-			   u64);
-	check(retval, expected, "PTE[0x11] must stay untouched", test);
-}
-
 static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
 {
 	struct xe_tile *tile = m->tile;
@@ -398,11 +317,6 @@ static void xe_migrate_sanity_test(struct xe_migrate *m, struct kunit *test)
 		test_copy_vram(m, big, test);
 	}
 
-	kunit_info(test, "Testing page table update using CPU if GPU idle.\n");
-	test_pt_update(m, pt, test, false);
-	kunit_info(test, "Testing page table update using GPU\n");
-	test_pt_update(m, pt, test, true);
-
 out:
 	xe_bb_free(bb, NULL);
 free_tiny:
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index b89ac6db68a1..7a90d269d4dd 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -2265,16 +2265,16 @@ void __xe_bo_release_dummy(struct kref *kref)
 
 /**
  * xe_bo_put_commit() - Put bos whose put was deferred by xe_bo_put_deferred().
+ * @xe: Xe device
  * @deferred: The lockless list used for the call to xe_bo_put_deferred().
  *
  * Puts all bos whose put was deferred by xe_bo_put_deferred().
  * The @deferred list can be either an onstack local list or a global
  * shared list used by a workqueue.
  */
-void xe_bo_put_commit(struct llist_head *deferred)
+void xe_bo_put_commit(struct xe_device *xe, struct llist_head *deferred)
 {
 	struct llist_node *freed;
-	struct xe_bo *bo, *next;
 
 	if (!deferred)
 		return;
@@ -2283,8 +2283,7 @@ void xe_bo_put_commit(struct llist_head *deferred)
 	if (!freed)
 		return;
 
-	llist_for_each_entry_safe(bo, next, freed, freed)
-		drm_gem_object_free(&bo->ttm.base.refcount);
+	xe_device_put_deferred(xe, freed);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index c59ad15961ce..10b2b14b4c0d 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -10,7 +10,6 @@
 
 #include "xe_bo_types.h"
 #include "xe_macros.h"
-#include "xe_vm_types.h"
 #include "xe_vm.h"
 
 /**
@@ -309,10 +308,11 @@ xe_bo_put_deferred(struct xe_bo *bo, struct llist_head *deferred)
 	if (!kref_put(&bo->ttm.base.refcount, __xe_bo_release_dummy))
 		return false;
 
+	xe_vm_get(bo->vm);
 	return llist_add(&bo->freed, deferred);
 }
 
-void xe_bo_put_commit(struct llist_head *deferred);
+void xe_bo_put_commit(struct xe_device *xe, struct llist_head *deferred);
 
 struct sg_table *xe_bo_sg(struct xe_bo *bo);
 
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 919ad88f0495..80628bdcfd48 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -226,6 +226,9 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
 {
 	struct xe_device *xe = to_xe_device(dev);
 
+	flush_work(&xe->mem.deferred_work);
+	xe_assert(xe, !llist_del_all(&xe->mem.deferred));
+
 	if (xe->ordered_wq)
 		destroy_workqueue(xe->ordered_wq);
 
@@ -235,6 +238,35 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
 	ttm_device_fini(&xe->ttm);
 }
 
+void xe_device_put_deferred(struct xe_device *xe, struct llist_node *deferred)
+{
+	struct xe_bo *bo, *next;
+
+	llist_for_each_entry_safe(bo, next, deferred, freed) {
+		init_llist_node(&bo->freed);
+		llist_add(&bo->freed, &xe->mem.deferred);
+	}
+	queue_work(system_wq, &xe->mem.deferred_work);
+}
+
+static void deferred_work(struct work_struct *w)
+{
+	struct xe_device *xe = container_of(w, struct xe_device,
+					    mem.deferred_work);
+	struct llist_node *freed = llist_del_all(&xe->mem.deferred);
+	struct xe_bo *bo, *next;
+
+	if (!freed)
+		return;
+
+	llist_for_each_entry_safe(bo, next, freed, freed) {
+		struct xe_vm *vm = bo->vm;
+
+		drm_gem_object_free(&bo->ttm.base.refcount);
+		xe_vm_put(vm);
+	}
+}
+
 struct xe_device *xe_device_create(struct pci_dev *pdev,
 				   const struct pci_device_id *ent)
 {
@@ -299,6 +331,9 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
 		goto err;
 	}
 
+	init_llist_head(&xe->mem.deferred);
+	INIT_WORK(&xe->mem.deferred_work, deferred_work);
+
 	err = xe_display_create(xe);
 	if (WARN_ON(err))
 		goto err;
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 14be34d9f543..74eb9833d4d8 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -176,4 +176,6 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p);
 u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
 u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
 
+void xe_device_put_deferred(struct xe_device *xe, struct llist_node *deferred);
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 9785eef2e5a4..e73b9a086718 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -22,6 +22,10 @@
 #include "xe_sriov_types.h"
 #include "xe_step_types.h"
 
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+#define TEST_VM_OPS_ERROR
+#endif
+
 #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
 #include "soc/intel_pch.h"
 #include "intel_display_core.h"
@@ -315,6 +319,10 @@ struct xe_device {
 		struct xe_mem_region vram;
 		/** @mem.sys_mgr: system TTM manager */
 		struct ttm_resource_manager sys_mgr;
+		/** @mem.deferred: deferred list to destroy PT entries */
+		struct llist_head deferred;
+		/** @mem.deferred_work: worker to destroy PT entries */
+		struct work_struct deferred_work;
 	} mem;
 
 	/** @sriov: device level virtualization data */
@@ -455,6 +463,14 @@ struct xe_device {
 	/** @needs_flr_on_fini: requests function-reset on fini */
 	bool needs_flr_on_fini;
 
+#ifdef TEST_VM_OPS_ERROR
+	/**
+	 * @vm_inject_error_position: inject errors at different places in VM
+	 * bind IOCTL based on this value
+	 */
+	u8 vm_inject_error_position;
+#endif
+
 	/* private: */
 
 #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
diff --git a/drivers/gpu/drm/xe/xe_exec.c b/drivers/gpu/drm/xe/xe_exec.c
index 952496c6260d..64dc412f84a6 100644
--- a/drivers/gpu/drm/xe/xe_exec.c
+++ b/drivers/gpu/drm/xe/xe_exec.c
@@ -135,6 +135,10 @@ static int xe_exec_fn(struct drm_gpuvm_exec *vm_exec)
 			return ret;
 	}
 
+	ret = xe_vm_rebind(vm, false);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 
@@ -152,7 +156,6 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	struct drm_exec *exec = &vm_exec.exec;
 	u32 i, num_syncs = 0, num_ufence = 0;
 	struct xe_sched_job *job;
-	struct dma_fence *rebind_fence;
 	struct xe_vm *vm;
 	bool write_locked, skip_retry = false;
 	ktime_t end = 0;
@@ -167,7 +170,7 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	if (XE_IOCTL_DBG(xe, !q))
 		return -ENOENT;
 
-	if (XE_IOCTL_DBG(xe, q->flags & EXEC_QUEUE_FLAG_VM))
+	if (XE_IOCTL_DBG(xe, q->flags & EXEC_QUEUE_FLAG_PT))
 		return -EINVAL;
 
 	if (XE_IOCTL_DBG(xe, args->num_batch_buffer &&
@@ -285,39 +288,7 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		goto err_exec;
 	}
 
-	/*
-	 * Rebind any invalidated userptr or evicted BOs in the VM, non-compute
-	 * VM mode only.
-	 */
-	rebind_fence = xe_vm_rebind(vm, false);
-	if (IS_ERR(rebind_fence)) {
-		err = PTR_ERR(rebind_fence);
-		goto err_put_job;
-	}
-
-	/*
-	 * We store the rebind_fence in the VM so subsequent execs don't get
-	 * scheduled before the rebinds of userptrs / evicted BOs is complete.
-	 */
-	if (rebind_fence) {
-		dma_fence_put(vm->rebind_fence);
-		vm->rebind_fence = rebind_fence;
-	}
-	if (vm->rebind_fence) {
-		if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
-			     &vm->rebind_fence->flags)) {
-			dma_fence_put(vm->rebind_fence);
-			vm->rebind_fence = NULL;
-		} else {
-			dma_fence_get(vm->rebind_fence);
-			err = drm_sched_job_add_dependency(&job->drm,
-							   vm->rebind_fence);
-			if (err)
-				goto err_put_job;
-		}
-	}
-
-	/* Wait behind munmap style rebinds */
+	/* Wait for rebinds */
 	if (!xe_vm_in_lr_mode(vm)) {
 		err = drm_sched_job_add_resv_dependencies(&job->drm,
 							  xe_vm_resv(vm),
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 6a83bc57826a..149b6ffcda6e 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -19,6 +19,7 @@
 #include "xe_macros.h"
 #include "xe_migrate.h"
 #include "xe_pm.h"
+#include "xe_pt_exec_queue.h"
 #include "xe_ring_ops_types.h"
 #include "xe_trace.h"
 #include "xe_vm.h"
@@ -43,6 +44,8 @@ static struct xe_exec_queue *__xe_exec_queue_alloc(struct xe_device *xe,
 	struct xe_gt *gt = hwe->gt;
 	int err;
 
+	xe_assert(xe, !(flags & EXEC_QUEUE_FLAG_PT));
+
 	/* only kernel queues can be permanent */
 	XE_WARN_ON((flags & EXEC_QUEUE_FLAG_PERMANENT) && !(flags & EXEC_QUEUE_FLAG_KERNEL));
 
@@ -53,6 +56,7 @@ static struct xe_exec_queue *__xe_exec_queue_alloc(struct xe_device *xe,
 	kref_init(&q->refcount);
 	q->flags = flags;
 	q->hwe = hwe;
+	q->xe = xe;
 	q->gt = gt;
 	q->class = hwe->class;
 	q->width = width;
@@ -61,7 +65,6 @@ static struct xe_exec_queue *__xe_exec_queue_alloc(struct xe_device *xe,
 	q->ring_ops = gt->ring_ops[hwe->class];
 	q->ops = gt->exec_queue_ops;
 	INIT_LIST_HEAD(&q->compute.link);
-	INIT_LIST_HEAD(&q->multi_gt_link);
 
 	q->sched_props.timeslice_us = hwe->eclass->sched_props.timeslice_us;
 	q->sched_props.preempt_timeout_us =
@@ -106,7 +109,7 @@ static void __xe_exec_queue_free(struct xe_exec_queue *q)
 
 static int __xe_exec_queue_init(struct xe_exec_queue *q)
 {
-	struct xe_device *xe = gt_to_xe(q->gt);
+	struct xe_device *xe = q->xe;
 	int i, err;
 
 	for (i = 0; i < q->width; ++i) {
@@ -127,7 +130,7 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q)
 	 * can perform GuC CT actions when needed. Caller is expected to have
 	 * already grabbed the rpm ref outside any sensitive locks.
 	 */
-	if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & EXEC_QUEUE_FLAG_VM || !q->vm))
+	if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && !q->vm)
 		drm_WARN_ON(&xe->drm, !xe_device_mem_access_get_if_ongoing(xe));
 
 	return 0;
@@ -198,15 +201,8 @@ struct xe_exec_queue *xe_exec_queue_create_class(struct xe_device *xe, struct xe
 void xe_exec_queue_destroy(struct kref *ref)
 {
 	struct xe_exec_queue *q = container_of(ref, struct xe_exec_queue, refcount);
-	struct xe_exec_queue *eq, *next;
 
 	xe_exec_queue_last_fence_put_unlocked(q);
-	if (!(q->flags & EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD)) {
-		list_for_each_entry_safe(eq, next, &q->multi_gt_list,
-					 multi_gt_link)
-			xe_exec_queue_put(eq);
-	}
-
 	q->ops->fini(q);
 }
 
@@ -216,7 +212,7 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
 
 	for (i = 0; i < q->width; ++i)
 		xe_lrc_finish(q->lrc + i);
-	if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & EXEC_QUEUE_FLAG_VM || !q->vm))
+	if (q->gt && !(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && !q->vm)
 		xe_device_mem_access_put(gt_to_xe(q->gt));
 	__xe_exec_queue_free(q);
 }
@@ -454,35 +450,6 @@ find_hw_engine(struct xe_device *xe,
 			       eci.engine_instance, true);
 }
 
-static u32 bind_exec_queue_logical_mask(struct xe_device *xe, struct xe_gt *gt,
-					struct drm_xe_engine_class_instance *eci,
-					u16 width, u16 num_placements)
-{
-	struct xe_hw_engine *hwe;
-	enum xe_hw_engine_id id;
-	u32 logical_mask = 0;
-
-	if (XE_IOCTL_DBG(xe, width != 1))
-		return 0;
-	if (XE_IOCTL_DBG(xe, num_placements != 1))
-		return 0;
-	if (XE_IOCTL_DBG(xe, eci[0].engine_instance != 0))
-		return 0;
-
-	eci[0].engine_class = DRM_XE_ENGINE_CLASS_COPY;
-
-	for_each_hw_engine(hwe, gt, id) {
-		if (xe_hw_engine_is_reserved(hwe))
-			continue;
-
-		if (hwe->class ==
-		    user_to_xe_engine_class[DRM_XE_ENGINE_CLASS_COPY])
-			logical_mask |= BIT(hwe->logical_instance);
-	}
-
-	return logical_mask;
-}
-
 static u32 calc_validate_logical_mask(struct xe_device *xe, struct xe_gt *gt,
 				      struct drm_xe_engine_class_instance *eci,
 				      u16 width, u16 num_placements)
@@ -544,7 +511,7 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
 	struct drm_xe_engine_class_instance __user *user_eci =
 		u64_to_user_ptr(args->instances);
 	struct xe_hw_engine *hwe;
-	struct xe_vm *vm, *migrate_vm;
+	struct xe_vm *vm;
 	struct xe_gt *gt;
 	struct xe_exec_queue *q = NULL;
 	u32 logical_mask;
@@ -570,48 +537,15 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
 		return -EINVAL;
 
 	if (eci[0].engine_class == DRM_XE_ENGINE_CLASS_VM_BIND) {
-		for_each_gt(gt, xe, id) {
-			struct xe_exec_queue *new;
-			u32 flags;
-
-			if (xe_gt_is_media_type(gt))
-				continue;
-
-			eci[0].gt_id = gt->info.id;
-			logical_mask = bind_exec_queue_logical_mask(xe, gt, eci,
-								    args->width,
-								    args->num_placements);
-			if (XE_IOCTL_DBG(xe, !logical_mask))
-				return -EINVAL;
+		if (XE_IOCTL_DBG(xe, args->extensions))
+			return -EINVAL;
 
-			hwe = find_hw_engine(xe, eci[0]);
-			if (XE_IOCTL_DBG(xe, !hwe))
-				return -EINVAL;
-
-			/* The migration vm doesn't hold rpm ref */
-			xe_device_mem_access_get(xe);
-
-			flags = EXEC_QUEUE_FLAG_VM | (id ? EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD : 0);
-
-			migrate_vm = xe_migrate_get_vm(gt_to_tile(gt)->migrate);
-			new = xe_exec_queue_create(xe, migrate_vm, logical_mask,
-						   args->width, hwe, flags,
-						   args->extensions);
-
-			xe_device_mem_access_put(xe); /* now held by engine */
-
-			xe_vm_put(migrate_vm);
-			if (IS_ERR(new)) {
-				err = PTR_ERR(new);
-				if (q)
-					goto put_exec_queue;
-				return err;
-			}
-			if (id == 0)
-				q = new;
-			else
-				list_add_tail(&new->multi_gt_list,
-					      &q->multi_gt_link);
+		xe_device_mem_access_get(xe);
+		q = xe_pt_exec_queue_create(xe);
+		xe_device_mem_access_put(xe); /* now held by exec queue */
+		if (IS_ERR(q)) {
+			err = PTR_ERR(q);
+			return err;
 		}
 	} else {
 		gt = xe_device_get_gt(xe, eci[0].gt_id);
@@ -714,8 +648,7 @@ int xe_exec_queue_get_property_ioctl(struct drm_device *dev, void *data,
  */
 bool xe_exec_queue_is_lr(struct xe_exec_queue *q)
 {
-	return q->vm && xe_vm_in_lr_mode(q->vm) &&
-		!(q->flags & EXEC_QUEUE_FLAG_VM);
+	return q->vm && xe_vm_in_lr_mode(q->vm);
 }
 
 static s32 xe_exec_queue_num_job_inflight(struct xe_exec_queue *q)
@@ -753,6 +686,12 @@ bool xe_exec_queue_ring_full(struct xe_exec_queue *q)
  */
 bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
 {
+	if (q->flags & EXEC_QUEUE_FLAG_PT) {
+		struct dma_fence *fence = q->last_fence ?: dma_fence_get_stub();
+
+		return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
+	}
+
 	if (xe_exec_queue_is_parallel(q)) {
 		int i;
 
@@ -771,16 +710,9 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
 
 void xe_exec_queue_kill(struct xe_exec_queue *q)
 {
-	struct xe_exec_queue *eq = q, *next;
-
-	list_for_each_entry_safe(eq, next, &eq->multi_gt_list,
-				 multi_gt_link) {
-		q->ops->kill(eq);
-		xe_vm_remove_compute_exec_queue(q->vm, eq);
-	}
-
 	q->ops->kill(q);
-	xe_vm_remove_compute_exec_queue(q->vm, q);
+	if (q->vm)
+		xe_vm_remove_compute_exec_queue(q->vm, q);
 }
 
 int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
@@ -812,7 +744,7 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
 static void xe_exec_queue_last_fence_lockdep_assert(struct xe_exec_queue *q,
 						    struct xe_vm *vm)
 {
-	if (q->flags & EXEC_QUEUE_FLAG_VM)
+	if (q->flags & EXEC_QUEUE_FLAG_PT)
 		lockdep_assert_held(&vm->lock);
 	else
 		xe_vm_assert_held(vm);
diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h
index 62b3d9d1d7cd..3a2dcaed561f 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue_types.h
+++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h
@@ -19,6 +19,7 @@ struct xe_execlist_exec_queue;
 struct xe_gt;
 struct xe_guc_exec_queue;
 struct xe_hw_engine;
+struct xe_pt_exec_queue;
 struct xe_vm;
 
 enum xe_exec_queue_priority {
@@ -38,6 +39,8 @@ enum xe_exec_queue_priority {
  * a kernel object.
  */
 struct xe_exec_queue {
+	/** @xe: Xe device */
+	struct xe_device *xe;
 	/** @gt: graphics tile this exec queue can submit to */
 	struct xe_gt *gt;
 	/**
@@ -78,12 +81,10 @@ struct xe_exec_queue {
 #define EXEC_QUEUE_FLAG_PERMANENT		BIT(2)
 /* queue keeps running pending jobs after destroy ioctl */
 #define EXEC_QUEUE_FLAG_PERSISTENT		BIT(3)
-/* for VM jobs. Caller needs to hold rpm ref when creating queue with this flag */
-#define EXEC_QUEUE_FLAG_VM			BIT(4)
-/* child of VM queue for multi-tile VM jobs */
-#define EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD	BIT(5)
+/* for PT jobs. Caller needs to hold rpm ref when creating queue with this flag */
+#define EXEC_QUEUE_FLAG_PT			BIT(4)
 /* kernel exec_queue only, set priority to highest level */
-#define EXEC_QUEUE_FLAG_HIGH_PRIORITY		BIT(6)
+#define EXEC_QUEUE_FLAG_HIGH_PRIORITY		BIT(5)
 
 	/**
 	 * @flags: flags for this exec queue, should statically setup aside from ban
@@ -91,18 +92,13 @@ struct xe_exec_queue {
 	 */
 	unsigned long flags;
 
-	union {
-		/** @multi_gt_list: list head for VM bind engines if multi-GT */
-		struct list_head multi_gt_list;
-		/** @multi_gt_link: link for VM bind engines if multi-GT */
-		struct list_head multi_gt_link;
-	};
-
 	union {
 		/** @execlist: execlist backend specific state for exec queue */
 		struct xe_execlist_exec_queue *execlist;
 		/** @guc: GuC backend specific state for exec queue */
 		struct xe_guc_exec_queue *guc;
+		/** @pt: PT backend specific state for exec queue */
+		struct xe_pt_exec_queue *pt;
 	};
 
 	/**
diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 73c535193a98..e4f5a80a46fc 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -19,7 +19,6 @@
 #include "xe_guc.h"
 #include "xe_guc_ct.h"
 #include "xe_migrate.h"
-#include "xe_pt.h"
 #include "xe_trace.h"
 #include "xe_vm.h"
 
@@ -209,8 +208,13 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 
 	/* Bind VMA only to the GT that has faulted */
 	trace_xe_vma_pf_bind(vma);
-	fence = __xe_pt_bind_vma(tile, vma, xe_tile_migrate_engine(tile), NULL, 0,
-				 vma->tile_present & BIT(tile->id));
+	ret = xe_vm_populate_dummy_rebind(vm, vma, BIT(tile->id));
+	if (ret)
+		goto unlock_dma_resv;
+	vm->dummy_ops.vops.pt_update_ops[tile->id].q =
+		xe_tile_migrate_bind_exec_queue(tile);
+	fence = xe_vm_ops_execute(vm, &vm->dummy_ops.vops);
+	xe_vma_ops_free(&vm->dummy_ops.vops);
 	if (IS_ERR(fence)) {
 		ret = PTR_ERR(fence);
 		goto unlock_dma_resv;
diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
index a3c4ffba679d..ac2bf86de39a 100644
--- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
+++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
@@ -264,11 +264,15 @@ int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt)
 }
 
 /**
- * xe_gt_tlb_invalidation_vma - Issue a TLB invalidation on this GT for a VMA
+ * xe_gt_tlb_invalidation_range - Issue a TLB invalidation on this GT for an
+ * address range
+ *
  * @gt: graphics tile
  * @fence: invalidation fence which will be signal on TLB invalidation
  * completion, can be NULL
- * @vma: VMA to invalidate
+ * @start: start address
+ * @end: end address
+ * @asid: address space id
  *
  * Issue a range based TLB invalidation if supported, if not fallback to a full
  * TLB invalidation. Completion of TLB is asynchronous and caller can either use
@@ -278,17 +282,15 @@ int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt)
  * Return: Seqno which can be passed to xe_gt_tlb_invalidation_wait on success,
  * negative error code on error.
  */
-int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
-			       struct xe_gt_tlb_invalidation_fence *fence,
-			       struct xe_vma *vma)
+int xe_gt_tlb_invalidation_range(struct xe_gt *gt,
+				 struct xe_gt_tlb_invalidation_fence *fence,
+				 u64 start, u64 end, u32 asid)
 {
 	struct xe_device *xe = gt_to_xe(gt);
 #define MAX_TLB_INVALIDATION_LEN	7
 	u32 action[MAX_TLB_INVALIDATION_LEN];
 	int len = 0;
 
-	xe_gt_assert(gt, vma);
-
 	/* Execlists not supported */
 	if (gt_to_xe(gt)->info.force_execlist) {
 		if (fence)
@@ -302,8 +304,8 @@ int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
 	if (!xe->info.has_range_tlb_invalidation) {
 		action[len++] = MAKE_INVAL_OP(XE_GUC_TLB_INVAL_FULL);
 	} else {
-		u64 start = xe_vma_start(vma);
-		u64 length = xe_vma_size(vma);
+		u64 orig_start = start;
+		u64 length = end - start;
 		u64 align, end;
 
 		if (length < SZ_4K)
@@ -316,12 +318,12 @@ int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
 		 * address mask covering the required range.
 		 */
 		align = roundup_pow_of_two(length);
-		start = ALIGN_DOWN(xe_vma_start(vma), align);
-		end = ALIGN(xe_vma_end(vma), align);
+		start = ALIGN_DOWN(start, align);
+		end = ALIGN(end, align);
 		length = align;
 		while (start + length < end) {
 			length <<= 1;
-			start = ALIGN_DOWN(xe_vma_start(vma), length);
+			start = ALIGN_DOWN(orig_start, length);
 		}
 
 		/*
@@ -330,16 +332,17 @@ int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
 		 */
 		if (length >= SZ_2M) {
 			length = max_t(u64, SZ_16M, length);
-			start = ALIGN_DOWN(xe_vma_start(vma), length);
+			start = ALIGN_DOWN(orig_start, length);
 		}
 
 		xe_gt_assert(gt, length >= SZ_4K);
 		xe_gt_assert(gt, is_power_of_2(length));
-		xe_gt_assert(gt, !(length & GENMASK(ilog2(SZ_16M) - 1, ilog2(SZ_2M) + 1)));
+		xe_gt_assert(gt, !(length & GENMASK(ilog2(SZ_16M) - 1,
+						    ilog2(SZ_2M) + 1)));
 		xe_gt_assert(gt, IS_ALIGNED(start, length));
 
 		action[len++] = MAKE_INVAL_OP(XE_GUC_TLB_INVAL_PAGE_SELECTIVE);
-		action[len++] = xe_vma_vm(vma)->usm.asid;
+		action[len++] = asid;
 		action[len++] = lower_32_bits(start);
 		action[len++] = upper_32_bits(start);
 		action[len++] = ilog2(length) - ilog2(SZ_4K);
@@ -350,6 +353,32 @@ int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
 	return send_tlb_invalidation(&gt->uc.guc, fence, action, len);
 }
 
+/**
+ * xe_gt_tlb_invalidation_vma - Issue a TLB invalidation on this GT for a VMA
+ * @gt: graphics tile
+ * @fence: invalidation fence which will be signal on TLB invalidation
+ * completion, can be NULL
+ * @vma: VMA to invalidate
+ *
+ * Issue a range based TLB invalidation if supported, if not fallback to a full
+ * TLB invalidation. Completion of TLB is asynchronous and caller can either use
+ * the invalidation fence or seqno + xe_gt_tlb_invalidation_wait to wait for
+ * completion.
+ *
+ * Return: Seqno which can be passed to xe_gt_tlb_invalidation_wait on success,
+ * negative error code on error.
+ */
+int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
+			       struct xe_gt_tlb_invalidation_fence *fence,
+			       struct xe_vma *vma)
+{
+	xe_gt_assert(gt, vma);
+
+	return xe_gt_tlb_invalidation_range(gt, fence, xe_vma_start(vma),
+					    xe_vma_end(vma),
+					    xe_vma_vm(vma)->usm.asid);
+}
+
 /**
  * xe_gt_tlb_invalidation_wait - Wait for TLB to complete
  * @gt: graphics tile
diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
index fbb743d80d2c..bf3bebd9f985 100644
--- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
+++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
@@ -20,6 +20,9 @@ int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt);
 int xe_gt_tlb_invalidation_vma(struct xe_gt *gt,
 			       struct xe_gt_tlb_invalidation_fence *fence,
 			       struct xe_vma *vma);
+int xe_gt_tlb_invalidation_range(struct xe_gt *gt,
+				 struct xe_gt_tlb_invalidation_fence *fence,
+				 u64 start, u64 end, u32 asid);
 int xe_gt_tlb_invalidation_wait(struct xe_gt *gt, int seqno);
 int xe_guc_tlb_invalidation_done_handler(struct xe_guc *guc, u32 *msg, u32 len);
 
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index 19efdb2f881f..83dc799589db 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -17,6 +17,7 @@
 #include "abi/guc_klvs_abi.h"
 #include "regs/xe_lrc_layout.h"
 #include "xe_assert.h"
+#include "xe_bo.h"
 #include "xe_devcoredump.h"
 #include "xe_device.h"
 #include "xe_exec_queue.h"
@@ -719,6 +720,11 @@ static void submit_exec_queue(struct xe_exec_queue *q)
 	}
 }
 
+static bool is_pt_job(struct xe_sched_job *job)
+{
+	return test_bit(JOB_FLAG_PT, &job->fence->flags);
+}
+
 static struct dma_fence *
 guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 {
@@ -728,6 +734,8 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
 	struct xe_device *xe = guc_to_xe(guc);
 	bool lr = xe_exec_queue_is_lr(q);
 
+	xe_assert(xe, !is_pt_job(job));
+	xe_assert(xe, !(q->flags & EXEC_QUEUE_FLAG_PT));
 	xe_assert(xe, !(exec_queue_destroyed(q) || exec_queue_pending_disable(q)) ||
 		  exec_queue_banned(q) || exec_queue_suspended(q));
 
@@ -929,6 +937,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	int err = -ETIME;
 	int i = 0;
 
+	xe_assert(xe, !(q->flags & EXEC_QUEUE_FLAG_PT));
+
 	/*
 	 * TDR has fired before free job worker. Common if exec queue
 	 * immediately closed after last fence signaled.
@@ -943,8 +953,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 		   xe_sched_job_seqno(job), q->guc->id, q->flags);
 	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_KERNEL,
 		   "Kernel-submitted job timed out\n");
-	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q),
-		   "VM job timed out on non-killed execqueue\n");
 
 	simple_error_capture(q);
 	xe_devcoredump(job);
@@ -958,8 +966,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
 	 * Kernel jobs should never fail, nor should VM jobs if they do
 	 * somethings has gone wrong and the GT needs a reset
 	 */
-	if (q->flags & EXEC_QUEUE_FLAG_KERNEL ||
-	    (q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q))) {
+	if (q->flags & EXEC_QUEUE_FLAG_KERNEL) {
 		if (!xe_sched_invalidate_job(job, 2)) {
 			xe_sched_add_pending_job(sched, job);
 			xe_sched_submission_start(sched);
@@ -1439,11 +1446,10 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q)
 	trace_xe_exec_queue_stop(q);
 
 	/*
-	 * Ban any engine (aside from kernel and engines used for VM ops) with a
-	 * started but not complete job or if a job has gone through a GT reset
-	 * more than twice.
+	 * Ban any engine (aside from kernel) with a started but not complete
+	 * job or if a job has gone through a GT reset more than twice.
 	 */
-	if (!(q->flags & (EXEC_QUEUE_FLAG_KERNEL | EXEC_QUEUE_FLAG_VM))) {
+	if (!(q->flags & EXEC_QUEUE_FLAG_KERNEL)) {
 		struct xe_sched_job *job = xe_sched_first_pending_job(sched);
 
 		if (job) {
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index ee1bb938c493..82b63bdb9c47 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -28,6 +28,7 @@
 #include "xe_map.h"
 #include "xe_mocs.h"
 #include "xe_pt.h"
+#include "xe_pt_exec_queue.h"
 #include "xe_res_cursor.h"
 #include "xe_sched_job.h"
 #include "xe_sync.h"
@@ -41,6 +42,8 @@
 struct xe_migrate {
 	/** @q: Default exec queue used for migration */
 	struct xe_exec_queue *q;
+	/** @bind_q: Default exec queue used for binds */
+	struct xe_exec_queue *bind_q;
 	/** @tile: Backpointer to the tile this struct xe_migrate belongs to. */
 	struct xe_tile *tile;
 	/** @job_mutex: Timeline mutex for @eng. */
@@ -84,19 +87,24 @@ struct xe_migrate {
 #define MAX_PTE_PER_SDI 0x1FE
 
 /**
- * xe_tile_migrate_engine() - Get this tile's migrate engine.
+ * xe_tile_migrate_exec_queue() - Get this tile's migrate exec queue.
  * @tile: The tile.
  *
- * Returns the default migrate engine of this tile.
+ * Returns the default migrate exec queue of this tile.
  * TODO: Perhaps this function is slightly misplaced, and even unneeded?
  *
- * Return: The default migrate engine
+ * Return: The default migrate exec queue
  */
-struct xe_exec_queue *xe_tile_migrate_engine(struct xe_tile *tile)
+struct xe_exec_queue *xe_tile_migrate_exec_queue(struct xe_tile *tile)
 {
 	return tile->migrate->q;
 }
 
+struct xe_exec_queue *xe_tile_migrate_bind_exec_queue(struct xe_tile *tile)
+{
+	return tile->migrate->bind_q;
+}
+
 static void xe_migrate_fini(struct drm_device *dev, void *arg)
 {
 	struct xe_migrate *m = arg;
@@ -111,6 +119,8 @@ static void xe_migrate_fini(struct drm_device *dev, void *arg)
 	mutex_destroy(&m->job_mutex);
 	xe_vm_close_and_put(m->q->vm);
 	xe_exec_queue_put(m->q);
+	if (m->bind_q)
+		xe_exec_queue_put(m->bind_q);
 }
 
 static u64 xe_migrate_vm_addr(u64 slot, u32 level)
@@ -368,6 +378,12 @@ struct xe_migrate *xe_migrate_init(struct xe_tile *tile)
 		if (!hwe || !logical_mask)
 			return ERR_PTR(-EINVAL);
 
+		m->bind_q = xe_pt_exec_queue_create(xe);
+		if (IS_ERR(m->bind_q)) {
+			xe_vm_close_and_put(vm);
+			return ERR_CAST(m->bind_q);
+		}
+
 		m->q = xe_exec_queue_create(xe, vm, logical_mask, 1, hwe,
 					    EXEC_QUEUE_FLAG_KERNEL |
 					    EXEC_QUEUE_FLAG_PERMANENT |
@@ -379,6 +395,8 @@ struct xe_migrate *xe_migrate_init(struct xe_tile *tile)
 						  EXEC_QUEUE_FLAG_PERMANENT);
 	}
 	if (IS_ERR(m->q)) {
+		if (m->bind_q)
+			xe_exec_queue_put(m->bind_q);
 		xe_vm_close_and_put(vm);
 		return ERR_CAST(m->q);
 	}
@@ -1105,50 +1123,6 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 	return fence;
 }
 
-static void write_pgtable(struct xe_tile *tile, struct xe_bb *bb, u64 ppgtt_ofs,
-			  const struct xe_vm_pgtable_update *update,
-			  struct xe_migrate_pt_update *pt_update)
-{
-	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
-	u32 chunk;
-	u32 ofs = update->ofs, size = update->qwords;
-
-	/*
-	 * If we have 512 entries (max), we would populate it ourselves,
-	 * and update the PDE above it to the new pointer.
-	 * The only time this can only happen if we have to update the top
-	 * PDE. This requires a BO that is almost vm->size big.
-	 *
-	 * This shouldn't be possible in practice.. might change when 16K
-	 * pages are used. Hence the assert.
-	 */
-	xe_tile_assert(tile, update->qwords < MAX_NUM_PTE);
-	if (!ppgtt_ofs)
-		ppgtt_ofs = xe_migrate_vram_ofs(tile_to_xe(tile),
-						xe_bo_addr(update->pt_bo, 0,
-							   XE_PAGE_SIZE));
-
-	do {
-		u64 addr = ppgtt_ofs + ofs * 8;
-
-		chunk = min(size, MAX_PTE_PER_SDI);
-
-		/* Ensure populatefn can do memset64 by aligning bb->cs */
-		if (!(bb->len & 1))
-			bb->cs[bb->len++] = MI_NOOP;
-
-		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
-		bb->cs[bb->len++] = lower_32_bits(addr);
-		bb->cs[bb->len++] = upper_32_bits(addr);
-		ops->populate(pt_update, tile, NULL, bb->cs + bb->len, ofs, chunk,
-			      update);
-
-		bb->len += chunk * 2;
-		ofs += chunk;
-		size -= chunk;
-	} while (size);
-}
-
 struct xe_vm *xe_migrate_get_vm(struct xe_migrate *m)
 {
 	return xe_vm_get(m->q->vm);
@@ -1164,289 +1138,152 @@ struct migrate_test_params {
 	container_of(_priv, struct migrate_test_params, base)
 #endif
 
+void __xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct xe_tile *tile,
+				      const struct xe_migrate_pt_update_ops *ops,
+				      struct xe_vm_pgtable_update_op *pt_op,
+				      int num_ops)
+{
+	u32 j, i;
+
+	for (j = 0; j < num_ops; ++j, ++pt_op) {
+		for (i = 0; i < pt_op->num_entries; i++) {
+			const struct xe_vm_pgtable_update *update =
+				&pt_op->entries[i];
+
+			if (pt_op->bind)
+				ops->populate(tile, &update->pt_bo->vmap,
+					      NULL, update->ofs, update->qwords,
+					      update);
+			else
+				ops->clear(vm, tile, &update->pt_bo->vmap,
+					   NULL, update->ofs, update->qwords,
+					   update);
+		}
+	}
+
+	trace_xe_vm_cpu_bind(vm);
+	xe_device_wmb(vm->xe);
+}
+
 static struct dma_fence *
 xe_migrate_update_pgtables_cpu(struct xe_migrate *m,
-			       struct xe_vm *vm, struct xe_bo *bo,
-			       const struct  xe_vm_pgtable_update *updates,
-			       u32 num_updates, bool wait_vm,
 			       struct xe_migrate_pt_update *pt_update)
 {
 	XE_TEST_DECLARE(struct migrate_test_params *test =
 			to_migrate_test_params
 			(xe_cur_kunit_priv(XE_TEST_LIVE_MIGRATE));)
 	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
-	struct dma_fence *fence;
+	struct xe_vm *vm = pt_update->vops->vm;
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&pt_update->vops->pt_update_ops[pt_update->tile_id];
 	int err;
-	u32 i;
 
 	if (XE_TEST_ONLY(test && test->force_gpu))
 		return ERR_PTR(-ETIME);
 
-	if (bo && !dma_resv_test_signaled(bo->ttm.base.resv,
-					  DMA_RESV_USAGE_KERNEL))
-		return ERR_PTR(-ETIME);
-
-	if (wait_vm && !dma_resv_test_signaled(xe_vm_resv(vm),
-					       DMA_RESV_USAGE_BOOKKEEP))
-		return ERR_PTR(-ETIME);
-
 	if (ops->pre_commit) {
 		pt_update->job = NULL;
 		err = ops->pre_commit(pt_update);
 		if (err)
 			return ERR_PTR(err);
 	}
-	for (i = 0; i < num_updates; i++) {
-		const struct xe_vm_pgtable_update *update = &updates[i];
-
-		ops->populate(pt_update, m->tile, &update->pt_bo->vmap, NULL,
-			      update->ofs, update->qwords, update);
-	}
-
-	if (vm) {
-		trace_xe_vm_cpu_bind(vm);
-		xe_device_wmb(vm->xe);
-	}
-
-	fence = dma_fence_get_stub();
-
-	return fence;
-}
-
-static bool no_in_syncs(struct xe_vm *vm, struct xe_exec_queue *q,
-			struct xe_sync_entry *syncs, u32 num_syncs)
-{
-	struct dma_fence *fence;
-	int i;
-
-	for (i = 0; i < num_syncs; i++) {
-		fence = syncs[i].fence;
 
-		if (fence && !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
-				       &fence->flags))
-			return false;
-	}
-	if (q) {
-		fence = xe_exec_queue_last_fence_get(q, vm);
-		if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
-			dma_fence_put(fence);
-			return false;
-		}
-		dma_fence_put(fence);
-	}
+	__xe_migrate_update_pgtables_cpu(vm, m->tile, ops,
+					 pt_update_ops->ops,
+					 pt_update_ops->num_ops);
 
-	return true;
+	return dma_fence_get_stub();
 }
 
-/**
- * xe_migrate_update_pgtables() - Pipelined page-table update
- * @m: The migrate context.
- * @vm: The vm we'll be updating.
- * @bo: The bo whose dma-resv we will await before updating, or NULL if userptr.
- * @q: The exec queue to be used for the update or NULL if the default
- * migration engine is to be used.
- * @updates: An array of update descriptors.
- * @num_updates: Number of descriptors in @updates.
- * @syncs: Array of xe_sync_entry to await before updating. Note that waits
- * will block the engine timeline.
- * @num_syncs: Number of entries in @syncs.
- * @pt_update: Pointer to a struct xe_migrate_pt_update, which contains
- * pointers to callback functions and, if subclassed, private arguments to
- * those.
- *
- * Perform a pipelined page-table update. The update descriptors are typically
- * built under the same lock critical section as a call to this function. If
- * using the default engine for the updates, they will be performed in the
- * order they grab the job_mutex. If different engines are used, external
- * synchronization is needed for overlapping updates to maintain page-table
- * consistency. Note that the meaing of "overlapping" is that the updates
- * touch the same page-table, which might be a higher-level page-directory.
- * If no pipelining is needed, then updates may be performed by the cpu.
- *
- * Return: A dma_fence that, when signaled, indicates the update completion.
- */
-struct dma_fence *
-xe_migrate_update_pgtables(struct xe_migrate *m,
-			   struct xe_vm *vm,
-			   struct xe_bo *bo,
-			   struct xe_exec_queue *q,
-			   const struct xe_vm_pgtable_update *updates,
-			   u32 num_updates,
-			   struct xe_sync_entry *syncs, u32 num_syncs,
-			   struct xe_migrate_pt_update *pt_update)
+static struct dma_fence *
+__xe_migrate_update_pgtables(struct xe_migrate *m,
+			     struct xe_migrate_pt_update *pt_update,
+			     struct xe_vm_pgtable_update_ops *pt_update_ops)
 {
 	const struct xe_migrate_pt_update_ops *ops = pt_update->ops;
 	struct xe_tile *tile = m->tile;
-	struct xe_gt *gt = tile->primary_gt;
-	struct xe_device *xe = tile_to_xe(tile);
 	struct xe_sched_job *job;
 	struct dma_fence *fence;
-	struct drm_suballoc *sa_bo = NULL;
-	struct xe_vma *vma = pt_update->vma;
-	struct xe_bb *bb;
-	u32 i, batch_size, ppgtt_ofs, update_idx, page_ofs = 0;
-	u64 addr;
-	int err = 0;
-	bool usm = !q && xe->info.has_usm;
-	bool first_munmap_rebind = vma &&
-		vma->gpuva.flags & XE_VMA_FIRST_REBIND;
-	struct xe_exec_queue *q_override = !q ? m->q : q;
-	u16 pat_index = xe->pat.idx[XE_CACHE_WB];
-
-	/* Use the CPU if no in syncs and engine is idle */
-	if (no_in_syncs(vm, q, syncs, num_syncs) && xe_exec_queue_is_idle(q_override)) {
-		fence =  xe_migrate_update_pgtables_cpu(m, vm, bo, updates,
-							num_updates,
-							first_munmap_rebind,
-							pt_update);
-		if (!IS_ERR(fence) || fence == ERR_PTR(-EAGAIN))
-			return fence;
-	}
-
-	/* fixed + PTE entries */
-	if (IS_DGFX(xe))
-		batch_size = 2;
-	else
-		batch_size = 6 + num_updates * 2;
-
-	for (i = 0; i < num_updates; i++) {
-		u32 num_cmds = DIV_ROUND_UP(updates[i].qwords, MAX_PTE_PER_SDI);
-
-		/* align noop + MI_STORE_DATA_IMM cmd prefix */
-		batch_size += 4 * num_cmds + updates[i].qwords * 2;
-	}
-
-	/*
-	 * XXX: Create temp bo to copy from, if batch_size becomes too big?
-	 *
-	 * Worst case: Sum(2 * (each lower level page size) + (top level page size))
-	 * Should be reasonably bound..
-	 */
-	xe_tile_assert(tile, batch_size < SZ_128K);
-
-	bb = xe_bb_new(gt, batch_size, !q && xe->info.has_usm);
-	if (IS_ERR(bb))
-		return ERR_CAST(bb);
-
-	/* For sysmem PTE's, need to map them in our hole.. */
-	if (!IS_DGFX(xe)) {
-		ppgtt_ofs = NUM_KERNEL_PDE - 1;
-		if (q) {
-			xe_tile_assert(tile, num_updates <= NUM_VMUSA_WRITES_PER_UNIT);
-
-			sa_bo = drm_suballoc_new(&m->vm_update_sa, 1,
-						 GFP_KERNEL, true, 0);
-			if (IS_ERR(sa_bo)) {
-				err = PTR_ERR(sa_bo);
-				goto err;
-			}
-
-			ppgtt_ofs = NUM_KERNEL_PDE +
-				(drm_suballoc_soffset(sa_bo) /
-				 NUM_VMUSA_UNIT_PER_PAGE);
-			page_ofs = (drm_suballoc_soffset(sa_bo) %
-				    NUM_VMUSA_UNIT_PER_PAGE) *
-				VM_SA_UPDATE_UNIT_SIZE;
-		}
-
-		/* Map our PT's to gtt */
-		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(num_updates);
-		bb->cs[bb->len++] = ppgtt_ofs * XE_PAGE_SIZE + page_ofs;
-		bb->cs[bb->len++] = 0; /* upper_32_bits */
-
-		for (i = 0; i < num_updates; i++) {
-			struct xe_bo *pt_bo = updates[i].pt_bo;
-
-			xe_tile_assert(tile, pt_bo->size == SZ_4K);
-
-			addr = vm->pt_ops->pte_encode_bo(pt_bo, 0, pat_index, 0);
-			bb->cs[bb->len++] = lower_32_bits(addr);
-			bb->cs[bb->len++] = upper_32_bits(addr);
-		}
-
-		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
-		update_idx = bb->len;
-
-		addr = xe_migrate_vm_addr(ppgtt_ofs, 0) +
-			(page_ofs / sizeof(u64)) * XE_PAGE_SIZE;
-		for (i = 0; i < num_updates; i++)
-			write_pgtable(tile, bb, addr + i * XE_PAGE_SIZE,
-				      &updates[i], pt_update);
-	} else {
-		/* phys pages, no preamble required */
-		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
-		update_idx = bb->len;
-
-		for (i = 0; i < num_updates; i++)
-			write_pgtable(tile, bb, 0, &updates[i], pt_update);
-	}
+	bool is_migrate = pt_update_ops->q == m->bind_q;
+	int err;
 
-	if (!q)
+	if (is_migrate)
 		mutex_lock(&m->job_mutex);
 
-	job = xe_bb_create_migration_job(q ?: m->q, bb,
-					 xe_migrate_batch_base(m, usm),
-					 update_idx);
+	job = xe_sched_job_create(pt_update_ops->q, NULL);
 	if (IS_ERR(job)) {
 		err = PTR_ERR(job);
 		goto err_bb;
 	}
 
-	/* Wait on BO move */
-	if (bo) {
-		err = job_add_deps(job, bo->ttm.base.resv,
-				   DMA_RESV_USAGE_KERNEL);
-		if (err)
-			goto err_job;
-	}
-
-	/*
-	 * Munmap style VM unbind, need to wait for all jobs to be complete /
-	 * trigger preempts before moving forward
-	 */
-	if (first_munmap_rebind) {
-		err = job_add_deps(job, xe_vm_resv(vm),
-				   DMA_RESV_USAGE_BOOKKEEP);
-		if (err)
-			goto err_job;
-	}
-
-	err = xe_sched_job_last_fence_add_dep(job, vm);
-	for (i = 0; !err && i < num_syncs; i++)
-		err = xe_sync_entry_add_deps(&syncs[i], job);
-
-	if (err)
-		goto err_job;
-
 	if (ops->pre_commit) {
 		pt_update->job = job;
 		err = ops->pre_commit(pt_update);
 		if (err)
 			goto err_job;
 	}
+
+	set_bit(JOB_FLAG_PT, &job->fence->flags);
+	job->pt_update[0].vm = pt_update->vops->vm;
+	job->pt_update[0].tile = tile;
+	job->pt_update[0].ops = ops;
+	job->pt_update[0].pt_op = pt_update_ops->ops;
+	job->pt_update[0].num_ops = pt_update_ops->num_ops;
+	job->pt_update[0].deferred = pt_update_ops->deferred;
+
+	/* Submission backend now owns freeing of pt_update_ops->ops */
+	init_llist_head(&pt_update_ops->deferred);
+	pt_update_ops->skip_free = true;
+
 	xe_sched_job_arm(job);
 	fence = dma_fence_get(&job->drm.s_fence->finished);
 	xe_sched_job_push(job);
 
-	if (!q)
+	if (is_migrate)
 		mutex_unlock(&m->job_mutex);
 
-	xe_bb_free(bb, fence);
-	drm_suballoc_free(sa_bo, fence);
-
 	return fence;
 
 err_job:
 	xe_sched_job_put(job);
 err_bb:
-	if (!q)
+	if (is_migrate)
 		mutex_unlock(&m->job_mutex);
-	xe_bb_free(bb, NULL);
-err:
-	drm_suballoc_free(sa_bo, NULL);
 	return ERR_PTR(err);
 }
 
+/**
+ * xe_migrate_update_pgtables() - Pipelined page-table update
+ * @m: The migrate context.
+ * @pt_update: PT update arguments
+ *
+ * Perform a pipelined page-table update. The update descriptors are typically
+ * built under the same lock critical section as a call to this function. If
+ * using the default engine for the updates, they will be performed in the
+ * order they grab the job_mutex. If different engines are used, external
+ * synchronization is needed for overlapping updates to maintain page-table
+ * consistency. Note that the meaing of "overlapping" is that the updates
+ * touch the same page-table, which might be a higher-level page-directory.
+ * If no pipelining is needed, then updates may be performed by the cpu.
+ *
+ * Return: A dma_fence that, when signaled, indicates the update completion.
+ */
+struct dma_fence *
+xe_migrate_update_pgtables(struct xe_migrate *m,
+			   struct xe_migrate_pt_update *pt_update)
+
+{
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&pt_update->vops->pt_update_ops[pt_update->tile_id];
+	struct dma_fence *fence;
+
+	fence =  xe_migrate_update_pgtables_cpu(m, pt_update);
+	if (!IS_ERR(fence))
+		return fence;
+
+	return __xe_migrate_update_pgtables(m, pt_update, pt_update_ops);
+}
+
 /**
  * xe_migrate_wait() - Complete all operations using the xe_migrate context
  * @m: Migrate context to wait for.
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 951f19318ea4..701bb27349b0 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -22,6 +22,7 @@ struct xe_pt;
 struct xe_tile;
 struct xe_vm;
 struct xe_vm_pgtable_update;
+struct xe_vm_pgtable_update_op;
 struct xe_vma;
 
 /**
@@ -31,7 +32,6 @@ struct xe_vma;
 struct xe_migrate_pt_update_ops {
 	/**
 	 * @populate: Populate a command buffer or page-table with ptes.
-	 * @pt_update: Embeddable callback argument.
 	 * @tile: The tile for the current operation.
 	 * @map: struct iosys_map into the memory to be populated.
 	 * @pos: If @map is NULL, map into the memory to be populated.
@@ -43,10 +43,27 @@ struct xe_migrate_pt_update_ops {
 	 * page-table system to populate command buffers or shared
 	 * page-tables with PTEs.
 	 */
-	void (*populate)(struct xe_migrate_pt_update *pt_update,
-			 struct xe_tile *tile, struct iosys_map *map,
+	void (*populate)(struct xe_tile *tile, struct iosys_map *map,
 			 void *pos, u32 ofs, u32 num_qwords,
 			 const struct xe_vm_pgtable_update *update);
+	/**
+	 * @clear: Clear a command buffer or page-table with ptes.
+	 * @vm: VM being updated
+	 * @tile: The tile for the current operation.
+	 * @map: struct iosys_map into the memory to be populated.
+	 * @pos: If @map is NULL, map into the memory to be populated.
+	 * @ofs: qword offset into @map, unused if @map is NULL.
+	 * @num_qwords: Number of qwords to write.
+	 * @update: Information about the PTEs to be inserted.
+	 *
+	 * This interface is intended to be used as a callback into the
+	 * page-table system to populate command buffers or shared
+	 * page-tables with PTEs.
+	 */
+	void (*clear)(struct xe_vm *vm, struct xe_tile *tile,
+		      struct iosys_map *map, void *pos, u32 ofs,
+		      u32 num_qwords,
+		      const struct xe_vm_pgtable_update *update);
 
 	/**
 	 * @pre_commit: Callback to be called just before arming the
@@ -67,14 +84,10 @@ struct xe_migrate_pt_update_ops {
 struct xe_migrate_pt_update {
 	/** @ops: Pointer to the struct xe_migrate_pt_update_ops callbacks */
 	const struct xe_migrate_pt_update_ops *ops;
-	/** @vma: The vma we're updating the pagetable for. */
-	struct xe_vma *vma;
+	/** @vops: VMA operations */
+	struct xe_vma_ops *vops;
 	/** @job: The job if a GPU page-table update. NULL otherwise */
 	struct xe_sched_job *job;
-	/** @start: Start of update for the range fence */
-	u64 start;
-	/** @last: Last of update for the range fence */
-	u64 last;
 	/** @tile_id: Tile ID of the update */
 	u8 tile_id;
 };
@@ -94,17 +107,18 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 
 struct xe_vm *xe_migrate_get_vm(struct xe_migrate *m);
 
+void __xe_migrate_update_pgtables_cpu(struct xe_vm *vm, struct xe_tile *tile,
+				      const struct xe_migrate_pt_update_ops *ops,
+				      struct xe_vm_pgtable_update_op *pt_op,
+				      int num_ops);
+
 struct dma_fence *
 xe_migrate_update_pgtables(struct xe_migrate *m,
-			   struct xe_vm *vm,
-			   struct xe_bo *bo,
-			   struct xe_exec_queue *q,
-			   const struct xe_vm_pgtable_update *updates,
-			   u32 num_updates,
-			   struct xe_sync_entry *syncs, u32 num_syncs,
 			   struct xe_migrate_pt_update *pt_update);
 
 void xe_migrate_wait(struct xe_migrate *m);
 
-struct xe_exec_queue *xe_tile_migrate_engine(struct xe_tile *tile);
+struct xe_exec_queue *xe_tile_migrate_exec_queue(struct xe_tile *tile);
+struct xe_exec_queue *xe_tile_migrate_bind_exec_queue(struct xe_tile *tile);
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index c401d4890386..99968762306c 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -375,6 +375,7 @@ static const struct pci_device_id pciidlist[] = {
 	XE_DG1_IDS(INTEL_VGA_DEVICE, &dg1_desc),
 	XE_ATS_M_IDS(INTEL_VGA_DEVICE, &ats_m_desc),
 	XE_DG2_IDS(INTEL_VGA_DEVICE, &dg2_desc),
+	XE_PVC_IDS(INTEL_VGA_DEVICE, &pvc_desc),
 	XE_MTL_IDS(INTEL_VGA_DEVICE, &mtl_desc),
 	XE_LNL_IDS(INTEL_VGA_DEVICE, &lnl_desc),
 	{ }
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 7f54bc3e389d..1ff01d616dac 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -8,12 +8,14 @@
 #include "xe_bo.h"
 #include "xe_device.h"
 #include "xe_drm_client.h"
+#include "xe_exec_queue.h"
 #include "xe_gt.h"
 #include "xe_gt_tlb_invalidation.h"
 #include "xe_migrate.h"
 #include "xe_pt_types.h"
 #include "xe_pt_walk.h"
 #include "xe_res_cursor.h"
+#include "xe_sync.h"
 #include "xe_trace.h"
 #include "xe_ttm_stolen_mgr.h"
 #include "xe_vm.h"
@@ -324,6 +326,7 @@ xe_pt_new_shared(struct xe_walk_update *wupd, struct xe_pt *parent,
 	entry->pt = parent;
 	entry->flags = 0;
 	entry->qwords = 0;
+	entry->level = parent->level;
 
 	if (alloc_entries) {
 		entry->pt_entries = kmalloc_array(XE_PDES,
@@ -791,9 +794,8 @@ bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma)
 }
 
 static void
-xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update, struct xe_tile *tile,
-		       struct iosys_map *map, void *data,
-		       u32 qword_ofs, u32 num_qwords,
+xe_vm_populate_pgtable(struct xe_tile *tile, struct iosys_map *map,
+		       void *data, u32 qword_ofs, u32 num_qwords,
 		       const struct xe_vm_pgtable_update *update)
 {
 	struct xe_pt_entry *ptes = update->pt_entries;
@@ -809,19 +811,27 @@ xe_vm_populate_pgtable(struct xe_migrate_pt_update *pt_update, struct xe_tile *t
 	}
 }
 
-static void xe_pt_abort_bind(struct xe_vma *vma,
-			     struct xe_vm_pgtable_update *entries,
-			     u32 num_entries)
+static void xe_pt_cancel_bind(struct xe_vma *vma,
+			      struct xe_vm_pgtable_update *entries,
+			      u32 num_entries)
 {
 	u32 i, j;
 
 	for (i = 0; i < num_entries; i++) {
-		if (!entries[i].pt_entries)
+		struct xe_pt *pt = entries[i].pt;
+
+		if (!pt)
 			continue;
 
-		for (j = 0; j < entries[i].qwords; j++)
-			xe_pt_destroy(entries[i].pt_entries[j].pt, xe_vma_vm(vma)->flags, NULL);
+		if (pt->level) {
+			for (j = 0; j < entries[i].qwords; j++)
+				xe_pt_destroy(entries[i].pt_entries[j].pt,
+					      xe_vma_vm(vma)->flags, NULL);
+		}
+
 		kfree(entries[i].pt_entries);
+		entries[i].pt_entries = NULL;
+		entries[i].qwords = 0;
 	}
 }
 
@@ -831,18 +841,15 @@ static void xe_pt_commit_locks_assert(struct xe_vma *vma)
 
 	lockdep_assert_held(&vm->lock);
 
-	if (xe_vma_is_userptr(vma))
-		lockdep_assert_held_read(&vm->userptr.notifier_lock);
-	else if (!xe_vma_is_null(vma))
+	if (!xe_vma_is_userptr(vma) && !xe_vma_is_null(vma))
 		dma_resv_assert_held(xe_vma_bo(vma)->ttm.base.resv);
 
 	xe_vm_assert_held(vm);
 }
 
-static void xe_pt_commit_bind(struct xe_vma *vma,
-			      struct xe_vm_pgtable_update *entries,
-			      u32 num_entries, bool rebind,
-			      struct llist_head *deferred)
+static void xe_pt_commit(struct xe_vma *vma,
+			 struct xe_vm_pgtable_update *entries,
+			 u32 num_entries, struct llist_head *deferred)
 {
 	u32 i, j;
 
@@ -850,31 +857,90 @@ static void xe_pt_commit_bind(struct xe_vma *vma,
 
 	for (i = 0; i < num_entries; i++) {
 		struct xe_pt *pt = entries[i].pt;
+
+		if (!pt->level)
+			continue;
+
+		for (j = 0; j < entries[i].qwords; j++) {
+			struct xe_pt *oldpte = entries[i].pt_entries[j].pt;
+
+			xe_pt_destroy(oldpte, xe_vma_vm(vma)->flags, deferred);
+		}
+	}
+}
+
+static void xe_pt_abort_bind(struct xe_vma *vma,
+			     struct xe_vm_pgtable_update *entries,
+			     u32 num_entries, bool rebind)
+{
+	int i, j;
+
+	xe_pt_commit_locks_assert(vma);
+
+	for (i = num_entries - 1; i >= 0; --i) {
+		struct xe_pt *pt = entries[i].pt;
 		struct xe_pt_dir *pt_dir;
 
 		if (!rebind)
-			pt->num_live += entries[i].qwords;
+			pt->num_live -= entries[i].qwords;
 
-		if (!pt->level) {
-			kfree(entries[i].pt_entries);
+		if (!pt->level)
 			continue;
+
+		pt_dir = as_xe_pt_dir(pt);
+		for (j = 0; j < entries[i].qwords; j++) {
+			u32 j_ = j + entries[i].ofs;
+			struct xe_pt *newpte = xe_pt_entry(pt_dir, j_);
+			struct xe_pt *oldpte = entries[i].pt_entries[j].pt;
+
+			pt_dir->children[j_] = oldpte ? &oldpte->base : 0;
+			xe_pt_destroy(newpte, xe_vma_vm(vma)->flags, NULL);
 		}
+	}
+}
+
+static void xe_pt_commit_prepare_bind(struct xe_vma *vma,
+				      struct xe_vm_pgtable_update *entries,
+				      u32 num_entries, bool rebind)
+{
+	u32 i, j;
+
+	xe_pt_commit_locks_assert(vma);
+
+	for (i = 0; i < num_entries; i++) {
+		struct xe_pt *pt = entries[i].pt;
+		struct xe_pt_dir *pt_dir;
+
+		if (!rebind)
+			pt->num_live += entries[i].qwords;
+
+		if (!pt->level)
+			continue;
 
 		pt_dir = as_xe_pt_dir(pt);
 		for (j = 0; j < entries[i].qwords; j++) {
 			u32 j_ = j + entries[i].ofs;
 			struct xe_pt *newpte = entries[i].pt_entries[j].pt;
+			struct xe_pt *oldpte = NULL;
 
 			if (xe_pt_entry(pt_dir, j_))
-				xe_pt_destroy(xe_pt_entry(pt_dir, j_),
-					      xe_vma_vm(vma)->flags, deferred);
+				oldpte = xe_pt_entry(pt_dir, j_);
 
 			pt_dir->children[j_] = &newpte->base;
+			entries[i].pt_entries[j].pt = oldpte;
 		}
-		kfree(entries[i].pt_entries);
 	}
 }
 
+static void xe_pt_free_bind(struct xe_vm_pgtable_update *entries,
+			    u32 num_entries)
+{
+	u32 i;
+
+	for (i = 0; i < num_entries; i++)
+		kfree(entries[i].pt_entries);
+}
+
 static int
 xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
 		   struct xe_vm_pgtable_update *entries, u32 *num_entries)
@@ -885,20 +951,19 @@ xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
 	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
 	if (!err)
 		xe_tile_assert(tile, *num_entries);
-	else /* abort! */
-		xe_pt_abort_bind(vma, entries, *num_entries);
 
 	return err;
 }
 
 static void xe_vm_dbg_print_entries(struct xe_device *xe,
 				    const struct xe_vm_pgtable_update *entries,
-				    unsigned int num_entries)
+				    unsigned int num_entries, bool bind)
 #if (IS_ENABLED(CONFIG_DRM_XE_DEBUG_VM))
 {
 	unsigned int i;
 
-	vm_dbg(&xe->drm, "%u entries to update\n", num_entries);
+	vm_dbg(&xe->drm, "%s: %u entries to update\n", bind ? "bind" : "unbind",
+	       num_entries);
 	for (i = 0; i < num_entries; i++) {
 		const struct xe_vm_pgtable_update *entry = &entries[i];
 		struct xe_pt *xe_pt = entry->pt;
@@ -919,66 +984,122 @@ static void xe_vm_dbg_print_entries(struct xe_device *xe,
 {}
 #endif
 
-#ifdef CONFIG_DRM_XE_USERPTR_INVAL_INJECT
+static int job_add_deps(struct xe_sched_job *job, struct dma_resv *resv,
+			enum dma_resv_usage usage)
+{
+	return drm_sched_job_add_resv_dependencies(&job->drm, resv, usage);
+}
 
-static int xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
+static bool no_in_syncs(struct xe_sync_entry *syncs, u32 num_syncs)
 {
-	u32 divisor = uvma->userptr.divisor ? uvma->userptr.divisor : 2;
-	static u32 count;
+	int i;
 
-	if (count++ % divisor == divisor - 1) {
-		struct xe_vm *vm = xe_vma_vm(&uvma->vma);
+	for (i = 0; i < num_syncs; i++) {
+		struct dma_fence *fence = syncs[i].fence;
 
-		uvma->userptr.divisor = divisor << 1;
-		spin_lock(&vm->userptr.invalidated_lock);
-		list_move_tail(&uvma->userptr.invalidate_link,
-			       &vm->userptr.invalidated);
-		spin_unlock(&vm->userptr.invalidated_lock);
-		return true;
+		if (fence && !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
+				       &fence->flags))
+			return false;
 	}
 
-	return false;
+	return true;
 }
 
-#else
-
-static bool xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
+static int vma_add_deps(struct xe_vma *vma, struct xe_sched_job *job)
 {
-	return false;
+	struct xe_bo *bo = xe_vma_bo(vma);
+
+	xe_bo_assert_held(bo);
+
+	if (bo && !bo->vm) {
+		if (!job) {
+			if (!dma_resv_test_signaled(bo->ttm.base.resv,
+						    DMA_RESV_USAGE_KERNEL))
+				return -ETIME;
+		} else {
+			return job_add_deps(job, bo->ttm.base.resv,
+					    DMA_RESV_USAGE_KERNEL);
+		}
+	}
+
+	return 0;
 }
 
-#endif
+static int op_add_deps(struct xe_vm *vm, struct xe_vma_op *op,
+		       struct xe_sched_job *job)
+{
+	int err = 0;
 
-/**
- * struct xe_pt_migrate_pt_update - Callback argument for pre-commit callbacks
- * @base: Base we derive from.
- * @bind: Whether this is a bind or an unbind operation. A bind operation
- *        makes the pre-commit callback error with -EAGAIN if it detects a
- *        pending invalidation.
- * @locked: Whether the pre-commit callback locked the userptr notifier lock
- *          and it needs unlocking.
- */
-struct xe_pt_migrate_pt_update {
-	struct xe_migrate_pt_update base;
-	bool bind;
-	bool locked;
-};
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		err = vma_add_deps(op->map.vma, job);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		if (op->remap.prev)
+			err = vma_add_deps(op->remap.prev, job);
+		if (!err && op->remap.next)
+			err = vma_add_deps(op->remap.next, job);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		err = vma_add_deps(gpuva_to_vma(op->base.prefetch.va), job);
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+
+	return err;
+}
 
-/*
- * This function adds the needed dependencies to a page-table update job
- * to make sure racing jobs for separate bind engines don't race writing
- * to the same page-table range, wreaking havoc. Initially use a single
- * fence for the entire VM. An optimization would use smaller granularity.
- */
 static int xe_pt_vm_dependencies(struct xe_sched_job *job,
-				 struct xe_range_fence_tree *rftree,
-				 u64 start, u64 last)
+				 struct xe_vm *vm,
+				 struct xe_vma_ops *vops,
+				 struct xe_vm_pgtable_update_ops *pt_update_ops,
+				 struct xe_range_fence_tree *rftree)
 {
 	struct xe_range_fence *rtfence;
 	struct dma_fence *fence;
-	int err;
+	struct xe_vma_op *op;
+	int err = 0, i;
+
+	xe_vm_assert_held(vm);
 
-	rtfence = xe_range_fence_tree_first(rftree, start, last);
+	if (!job && !no_in_syncs(vops->syncs, vops->num_syncs))
+		return -ETIME;
+
+	if (!job && !xe_exec_queue_is_idle(pt_update_ops->q))
+		return -ETIME;
+
+	if (pt_update_ops->wait_vm_bookkeep) {
+		if (!job) {
+			if (!dma_resv_test_signaled(xe_vm_resv(vm),
+						    DMA_RESV_USAGE_BOOKKEEP))
+				return -ETIME;
+		} else {
+			err = job_add_deps(job, xe_vm_resv(vm),
+					   DMA_RESV_USAGE_BOOKKEEP);
+			if (err)
+				return err;
+		}
+	} else if (pt_update_ops->wait_vm_kernel) {
+		if (!job) {
+			if (!dma_resv_test_signaled(xe_vm_resv(vm),
+						    DMA_RESV_USAGE_KERNEL))
+				return -ETIME;
+		} else {
+			err = job_add_deps(job, xe_vm_resv(vm),
+					   DMA_RESV_USAGE_KERNEL);
+			if (err)
+				return err;
+		}
+	}
+
+	rtfence = xe_range_fence_tree_first(rftree, pt_update_ops->start,
+					    pt_update_ops->last);
 	while (rtfence) {
 		fence = rtfence->fence;
 
@@ -996,88 +1117,152 @@ static int xe_pt_vm_dependencies(struct xe_sched_job *job,
 				return err;
 		}
 
-		rtfence = xe_range_fence_tree_next(rtfence, start, last);
+		rtfence = xe_range_fence_tree_next(rtfence,
+						   pt_update_ops->start,
+						   pt_update_ops->last);
 	}
 
-	return 0;
+	list_for_each_entry(op, &vops->list, link) {
+		err = op_add_deps(vm, op, job);
+		if (err)
+			return err;
+	}
+
+	for (i = 0; job && !err && i < vops->num_syncs; i++)
+		err = xe_sync_entry_add_deps(&vops->syncs[i], job);
+
+	return err;
 }
 
 static int xe_pt_pre_commit(struct xe_migrate_pt_update *pt_update)
 {
-	struct xe_range_fence_tree *rftree =
-		&xe_vma_vm(pt_update->vma)->rftree[pt_update->tile_id];
+	struct xe_vma_ops *vops = pt_update->vops;
+	struct xe_vm *vm = vops->vm;
+	struct xe_range_fence_tree *rftree = &vm->rftree[pt_update->tile_id];
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[pt_update->tile_id];
+
+	return xe_pt_vm_dependencies(pt_update->job, vm, pt_update->vops,
+				     pt_update_ops, rftree);
+}
+
+#ifdef CONFIG_DRM_XE_USERPTR_INVAL_INJECT
+
+static bool xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
+{
+	u32 divisor = uvma->userptr.divisor ? uvma->userptr.divisor : 2;
+	static u32 count;
+
+	if (count++ % divisor == divisor - 1) {
+		uvma->userptr.divisor = divisor << 1;
+		return true;
+	}
 
-	return xe_pt_vm_dependencies(pt_update->job, rftree,
-				     pt_update->start, pt_update->last);
+	return false;
 }
 
-static int xe_pt_userptr_pre_commit(struct xe_migrate_pt_update *pt_update)
+#else
+
+static bool xe_pt_userptr_inject_eagain(struct xe_userptr_vma *uvma)
 {
-	struct xe_pt_migrate_pt_update *userptr_update =
-		container_of(pt_update, typeof(*userptr_update), base);
-	struct xe_userptr_vma *uvma = to_userptr_vma(pt_update->vma);
-	unsigned long notifier_seq = uvma->userptr.notifier_seq;
-	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
-	int err = xe_pt_vm_dependencies(pt_update->job,
-					&vm->rftree[pt_update->tile_id],
-					pt_update->start,
-					pt_update->last);
+	return false;
+}
 
-	if (err)
-		return err;
+#endif
 
-	userptr_update->locked = false;
+static void vma_check_userptr(struct xe_vm *vm, struct xe_vma *vma)
+{
+	struct xe_userptr_vma *uvma;
+	unsigned long notifier_seq;
 
-	/*
-	 * Wait until nobody is running the invalidation notifier, and
-	 * since we're exiting the loop holding the notifier lock,
-	 * nobody can proceed invalidating either.
-	 *
-	 * Note that we don't update the vma->userptr.notifier_seq since
-	 * we don't update the userptr pages.
-	 */
-	do {
-		down_read(&vm->userptr.notifier_lock);
-		if (!mmu_interval_read_retry(&uvma->userptr.notifier,
-					     notifier_seq))
-			break;
+	lockdep_assert_held_read(&vm->userptr.notifier_lock);
 
-		up_read(&vm->userptr.notifier_lock);
+	if (!xe_vma_is_userptr(vma))
+		return;
 
-		if (userptr_update->bind)
-			return -EAGAIN;
+	uvma = to_userptr_vma(vma);
+	notifier_seq = uvma->userptr.notifier_seq;
 
-		notifier_seq = mmu_interval_read_begin(&uvma->userptr.notifier);
-	} while (true);
+	if (uvma->userptr.initial_bind || xe_vm_in_fault_mode(vm))
+		return;
 
-	/* Inject errors to test_whether they are handled correctly */
-	if (userptr_update->bind && xe_pt_userptr_inject_eagain(uvma)) {
-		up_read(&vm->userptr.notifier_lock);
-		return -EAGAIN;
+	if (!mmu_interval_read_retry(&uvma->userptr.notifier,
+				     notifier_seq) &&
+	    !xe_pt_userptr_inject_eagain(uvma))
+		return;
+
+	spin_lock(&vm->userptr.invalidated_lock);
+	list_move_tail(&uvma->userptr.invalidate_link,
+		       &vm->userptr.invalidated);
+	spin_unlock(&vm->userptr.invalidated_lock);
+
+	if (xe_vm_in_preempt_fence_mode(vm)) {
+		struct dma_resv_iter cursor;
+		struct dma_fence *fence;
+
+		dma_resv_iter_begin(&cursor, xe_vm_resv(vm),
+				    DMA_RESV_USAGE_BOOKKEEP);
+		dma_resv_for_each_fence_unlocked(&cursor, fence)
+			dma_fence_enable_sw_signaling(fence);
+		dma_resv_iter_end(&cursor);
 	}
+}
 
-	userptr_update->locked = true;
+static void op_check_userptr(struct xe_vm *vm, struct xe_vma_op *op)
+{
+	lockdep_assert_held_read(&vm->userptr.notifier_lock);
 
-	return 0;
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		vma_check_userptr(vm, op->map.vma);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		if (op->remap.prev)
+			vma_check_userptr(vm, op->remap.prev);
+		if (op->remap.next)
+			vma_check_userptr(vm, op->remap.next);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		vma_check_userptr(vm, gpuva_to_vma(op->base.prefetch.va));
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
 }
 
-static const struct xe_migrate_pt_update_ops bind_ops = {
-	.populate = xe_vm_populate_pgtable,
-	.pre_commit = xe_pt_pre_commit,
-};
+static int xe_pt_userptr_pre_commit(struct xe_migrate_pt_update *pt_update)
+{
+	struct xe_vm *vm = pt_update->vops->vm;
+	struct xe_vma_ops *vops = pt_update->vops;
+	struct xe_vma_op *op;
+	int err;
 
-static const struct xe_migrate_pt_update_ops userptr_bind_ops = {
-	.populate = xe_vm_populate_pgtable,
-	.pre_commit = xe_pt_userptr_pre_commit,
-};
+	err = xe_pt_pre_commit(pt_update);
+	if (err)
+		return err;
+
+	down_read(&vm->userptr.notifier_lock);
+
+	list_for_each_entry(op, &vops->list, link)
+		op_check_userptr(vm, op);
+
+	return 0;
+}
 
 struct invalidation_fence {
 	struct xe_gt_tlb_invalidation_fence base;
 	struct xe_gt *gt;
-	struct xe_vma *vma;
 	struct dma_fence *fence;
 	struct dma_fence_cb cb;
 	struct work_struct work;
+	u64 start;
+	u64 end;
+	u32 asid;
 };
 
 static const char *
@@ -1105,7 +1290,7 @@ static void invalidation_fence_cb(struct dma_fence *fence,
 
 	trace_xe_gt_tlb_invalidation_fence_cb(&ifence->base);
 	if (!ifence->fence->error) {
-		queue_work(system_wq, &ifence->work);
+		queue_work(ifence->gt->ordered_wq, &ifence->work);
 	} else {
 		ifence->base.base.error = ifence->fence->error;
 		dma_fence_signal(&ifence->base.base);
@@ -1120,13 +1305,14 @@ static void invalidation_fence_work_func(struct work_struct *w)
 		container_of(w, struct invalidation_fence, work);
 
 	trace_xe_gt_tlb_invalidation_fence_work_func(&ifence->base);
-	xe_gt_tlb_invalidation_vma(ifence->gt, &ifence->base, ifence->vma);
+	xe_gt_tlb_invalidation_range(ifence->gt, &ifence->base, ifence->start,
+				     ifence->end, ifence->asid);
 }
 
 static int invalidation_fence_init(struct xe_gt *gt,
 				   struct invalidation_fence *ifence,
 				   struct dma_fence *fence,
-				   struct xe_vma *vma)
+				   u64 start, u64 end, u32 asid)
 {
 	int ret;
 
@@ -1144,7 +1330,9 @@ static int invalidation_fence_init(struct xe_gt *gt,
 	dma_fence_get(&ifence->base.base);	/* Ref for caller */
 	ifence->fence = fence;
 	ifence->gt = gt;
-	ifence->vma = vma;
+	ifence->start = start;
+	ifence->end = end;
+	ifence->asid = asid;
 
 	INIT_WORK(&ifence->work, invalidation_fence_work_func);
 	ret = dma_fence_add_callback(fence, &ifence->cb, invalidation_fence_cb);
@@ -1161,178 +1349,6 @@ static int invalidation_fence_init(struct xe_gt *gt,
 	return ret && ret != -ENOENT ? ret : 0;
 }
 
-static void xe_pt_calc_rfence_interval(struct xe_vma *vma,
-				       struct xe_pt_migrate_pt_update *update,
-				       struct xe_vm_pgtable_update *entries,
-				       u32 num_entries)
-{
-	int i, level = 0;
-
-	for (i = 0; i < num_entries; i++) {
-		const struct xe_vm_pgtable_update *entry = &entries[i];
-
-		if (entry->pt->level > level)
-			level = entry->pt->level;
-	}
-
-	/* Greedy (non-optimal) calculation but simple */
-	update->base.start = ALIGN_DOWN(xe_vma_start(vma),
-					0x1ull << xe_pt_shift(level));
-	update->base.last = ALIGN(xe_vma_end(vma),
-				  0x1ull << xe_pt_shift(level)) - 1;
-}
-
-/**
- * __xe_pt_bind_vma() - Build and connect a page-table tree for the vma
- * address range.
- * @tile: The tile to bind for.
- * @vma: The vma to bind.
- * @q: The exec_queue with which to do pipelined page-table updates.
- * @syncs: Entries to sync on before binding the built tree to the live vm tree.
- * @num_syncs: Number of @sync entries.
- * @rebind: Whether we're rebinding this vma to the same address range without
- * an unbind in-between.
- *
- * This function builds a page-table tree (see xe_pt_stage_bind() for more
- * information on page-table building), and the xe_vm_pgtable_update entries
- * abstracting the operations needed to attach it to the main vm tree. It
- * then takes the relevant locks and updates the metadata side of the main
- * vm tree and submits the operations for pipelined attachment of the
- * gpu page-table to the vm main tree, (which can be done either by the
- * cpu and the GPU).
- *
- * Return: A valid dma-fence representing the pipelined attachment operation
- * on success, an error pointer on error.
- */
-struct dma_fence *
-__xe_pt_bind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		 struct xe_sync_entry *syncs, u32 num_syncs,
-		 bool rebind)
-{
-	struct xe_vm_pgtable_update entries[XE_VM_MAX_LEVEL * 2 + 1];
-	struct xe_pt_migrate_pt_update bind_pt_update = {
-		.base = {
-			.ops = xe_vma_is_userptr(vma) ? &userptr_bind_ops : &bind_ops,
-			.vma = vma,
-			.tile_id = tile->id,
-		},
-		.bind = true,
-	};
-	struct xe_vm *vm = xe_vma_vm(vma);
-	u32 num_entries;
-	struct dma_fence *fence;
-	struct invalidation_fence *ifence = NULL;
-	struct xe_range_fence *rfence;
-	int err;
-
-	bind_pt_update.locked = false;
-	xe_bo_assert_held(xe_vma_bo(vma));
-	xe_vm_assert_held(vm);
-
-	vm_dbg(&xe_vma_vm(vma)->xe->drm,
-	       "Preparing bind, with range [%llx...%llx) engine %p.\n",
-	       xe_vma_start(vma), xe_vma_end(vma), q);
-
-	err = xe_pt_prepare_bind(tile, vma, entries, &num_entries);
-	if (err)
-		goto err;
-	xe_tile_assert(tile, num_entries <= ARRAY_SIZE(entries));
-
-	xe_vm_dbg_print_entries(tile_to_xe(tile), entries, num_entries);
-	xe_pt_calc_rfence_interval(vma, &bind_pt_update, entries,
-				   num_entries);
-
-	/*
-	 * If rebind, we have to invalidate TLB on !LR vms to invalidate
-	 * cached PTEs point to freed memory. on LR vms this is done
-	 * automatically when the context is re-enabled by the rebind worker,
-	 * or in fault mode it was invalidated on PTE zapping.
-	 *
-	 * If !rebind, and scratch enabled VMs, there is a chance the scratch
-	 * PTE is already cached in the TLB so it needs to be invalidated.
-	 * on !LR VMs this is done in the ring ops preceding a batch, but on
-	 * non-faulting LR, in particular on user-space batch buffer chaining,
-	 * it needs to be done here.
-	 */
-	if ((rebind && !xe_vm_in_lr_mode(vm) && !vm->batch_invalidate_tlb) ||
-	    (!rebind && xe_vm_has_scratch(vm) && xe_vm_in_preempt_fence_mode(vm))) {
-		ifence = kzalloc(sizeof(*ifence), GFP_KERNEL);
-		if (!ifence)
-			return ERR_PTR(-ENOMEM);
-	}
-
-	rfence = kzalloc(sizeof(*rfence), GFP_KERNEL);
-	if (!rfence) {
-		kfree(ifence);
-		return ERR_PTR(-ENOMEM);
-	}
-
-	fence = xe_migrate_update_pgtables(tile->migrate,
-					   vm, xe_vma_bo(vma), q,
-					   entries, num_entries,
-					   syncs, num_syncs,
-					   &bind_pt_update.base);
-	if (!IS_ERR(fence)) {
-		bool last_munmap_rebind = vma->gpuva.flags & XE_VMA_LAST_REBIND;
-		LLIST_HEAD(deferred);
-		int err;
-
-		err = xe_range_fence_insert(&vm->rftree[tile->id], rfence,
-					    &xe_range_fence_kfree_ops,
-					    bind_pt_update.base.start,
-					    bind_pt_update.base.last, fence);
-		if (err)
-			dma_fence_wait(fence, false);
-
-		/* TLB invalidation must be done before signaling rebind */
-		if (ifence) {
-			int err = invalidation_fence_init(tile->primary_gt, ifence, fence,
-							  vma);
-			if (err) {
-				dma_fence_put(fence);
-				kfree(ifence);
-				return ERR_PTR(err);
-			}
-			fence = &ifence->base.base;
-		}
-
-		/* add shared fence now for pagetable delayed destroy */
-		dma_resv_add_fence(xe_vm_resv(vm), fence, !rebind &&
-				   last_munmap_rebind ?
-				   DMA_RESV_USAGE_KERNEL :
-				   DMA_RESV_USAGE_BOOKKEEP);
-
-		if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
-			dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
-					   DMA_RESV_USAGE_BOOKKEEP);
-		xe_pt_commit_bind(vma, entries, num_entries, rebind,
-				  bind_pt_update.locked ? &deferred : NULL);
-
-		/* This vma is live (again?) now */
-		vma->tile_present |= BIT(tile->id);
-
-		if (bind_pt_update.locked) {
-			to_userptr_vma(vma)->userptr.initial_bind = true;
-			up_read(&vm->userptr.notifier_lock);
-			xe_bo_put_commit(&deferred);
-		}
-		if (!rebind && last_munmap_rebind &&
-		    xe_vm_in_preempt_fence_mode(vm))
-			xe_vm_queue_rebind_worker(vm);
-	} else {
-		kfree(rfence);
-		kfree(ifence);
-		if (bind_pt_update.locked)
-			up_read(&vm->userptr.notifier_lock);
-		xe_pt_abort_bind(vma, entries, num_entries);
-	}
-
-	return fence;
-
-err:
-	return ERR_PTR(err);
-}
-
 struct xe_pt_stage_unbind_walk {
 	/** @base: The pagewalk base-class. */
 	struct xe_pt_walk base;
@@ -1430,7 +1446,7 @@ xe_pt_stage_unbind_post_descend(struct xe_ptw *parent, pgoff_t offset,
 				     &end_offset))
 		return 0;
 
-	(void)xe_pt_new_shared(&xe_walk->wupd, xe_child, offset, false);
+	(void)xe_pt_new_shared(&xe_walk->wupd, xe_child, offset, true);
 	xe_walk->wupd.updates[level].update->qwords = end_offset - offset;
 
 	return 0;
@@ -1478,13 +1494,12 @@ static unsigned int xe_pt_stage_unbind(struct xe_tile *tile, struct xe_vma *vma,
 }
 
 static void
-xe_migrate_clear_pgtable_callback(struct xe_migrate_pt_update *pt_update,
-				  struct xe_tile *tile, struct iosys_map *map,
-				  void *ptr, u32 qword_ofs, u32 num_qwords,
+xe_migrate_clear_pgtable_callback(struct xe_vm *vm, struct xe_tile *tile,
+				  struct iosys_map *map, void *ptr,
+				  u32 qword_ofs, u32 num_qwords,
 				  const struct xe_vm_pgtable_update *update)
 {
-	struct xe_vma *vma = pt_update->vma;
-	u64 empty = __xe_pt_empty_pte(tile, xe_vma_vm(vma), update->pt->level);
+	u64 empty = __xe_pt_empty_pte(tile, vm, update->level);
 	int i;
 
 	if (map && map->is_iomem)
@@ -1498,171 +1513,556 @@ xe_migrate_clear_pgtable_callback(struct xe_migrate_pt_update *pt_update,
 		memset64(ptr, empty, num_qwords);
 }
 
+static void xe_pt_abort_unbind(struct xe_vma *vma,
+			       struct xe_vm_pgtable_update *entries,
+			       u32 num_entries)
+{
+	int j, i;
+
+	xe_pt_commit_locks_assert(vma);
+
+	for (j = num_entries - 1; j >= 0; --j) {
+		struct xe_vm_pgtable_update *entry = &entries[j];
+		struct xe_pt *pt = entry->pt;
+		struct xe_pt_dir *pt_dir = as_xe_pt_dir(pt);
+
+		pt->num_live += entry->qwords;
+
+		if (!pt->level)
+			continue;
+
+		for (i = entry->ofs; i < entry->ofs + entry->qwords; i++)
+			pt_dir->children[i] =
+				entries[j].pt_entries[i - entry->ofs].pt ?
+				&entries[j].pt_entries[i - entry->ofs].pt->base : 0;
+	}
+}
+
 static void
-xe_pt_commit_unbind(struct xe_vma *vma,
-		    struct xe_vm_pgtable_update *entries, u32 num_entries,
-		    struct llist_head *deferred)
+xe_pt_commit_prepare_unbind(struct xe_vma *vma,
+			    struct xe_vm_pgtable_update *entries,
+			    u32 num_entries)
 {
-	u32 j;
+	int j, i;
 
 	xe_pt_commit_locks_assert(vma);
 
 	for (j = 0; j < num_entries; ++j) {
 		struct xe_vm_pgtable_update *entry = &entries[j];
 		struct xe_pt *pt = entry->pt;
+		struct xe_pt_dir *pt_dir;
 
 		pt->num_live -= entry->qwords;
-		if (pt->level) {
-			struct xe_pt_dir *pt_dir = as_xe_pt_dir(pt);
-			u32 i;
+		if (!pt->level)
+			continue;
 
-			for (i = entry->ofs; i < entry->ofs + entry->qwords;
-			     i++) {
-				if (xe_pt_entry(pt_dir, i))
-					xe_pt_destroy(xe_pt_entry(pt_dir, i),
-						      xe_vma_vm(vma)->flags, deferred);
+		pt_dir = as_xe_pt_dir(pt);
+		for (i = entry->ofs; i < entry->ofs + entry->qwords; i++) {
+			if (xe_pt_entry(pt_dir, i))
+				entries[j].pt_entries[i - entry->ofs].pt =
+					xe_pt_entry(pt_dir, i);
+			else
+				entries[j].pt_entries[i - entry->ofs].pt = NULL;
+			pt_dir->children[i] = NULL;
+		}
+	}
+}
 
-				pt_dir->children[i] = NULL;
-			}
+static void
+xe_pt_update_ops_rfence_interval(struct xe_vm_pgtable_update_ops *pt_update_ops,
+				 struct xe_vma *vma)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	int i, level = 0;
+	u64 start, last;
+
+	for (i = 0; i < pt_op->num_entries; i++) {
+		const struct xe_vm_pgtable_update *entry = &pt_op->entries[i];
+
+		if (entry->pt->level > level)
+			level = entry->pt->level;
+	}
+
+	/* Greedy (non-optimal) calculation but simple */
+	start = ALIGN_DOWN(xe_vma_start(vma), 0x1ull << xe_pt_shift(level));
+	last = ALIGN(xe_vma_end(vma), 0x1ull << xe_pt_shift(level)) - 1;
+
+	if (start < pt_update_ops->start)
+		pt_update_ops->start = start;
+	if (last > pt_update_ops->last)
+		pt_update_ops->last = last;
+}
+
+static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
+			   struct xe_vm_pgtable_update_ops *pt_update_ops,
+			   struct xe_vma *vma)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+	int err;
+
+	xe_bo_assert_held(xe_vma_bo(vma));
+
+	vm_dbg(&xe_vma_vm(vma)->xe->drm,
+	       "Preparing bind, with range [%llx...%llx)\n",
+	       xe_vma_start(vma), xe_vma_end(vma) - 1);
+
+	pt_op->vma = NULL;
+	pt_op->bind = true;
+	pt_op->rebind = BIT(tile->id) & vma->tile_present;
+
+	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
+				 &pt_op->num_entries);
+	if (!err) {
+		xe_tile_assert(tile, pt_op->num_entries <=
+			       ARRAY_SIZE(pt_op->entries));
+		xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
+					pt_op->num_entries, true);
+
+		xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+		++pt_update_ops->current_op;
+		pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
+
+		/*
+		 * If rebind, we have to invalidate TLB on !LR vms to invalidate
+		 * cached PTEs point to freed memory. on LR vms this is done
+		 * automatically when the context is re-enabled by the rebind
+		 * worker, or in fault mode it was invalidated on PTE zapping.
+		 *
+		 * If !rebind, and scratch enabled VMs, there is a chance the
+		 * scratch PTE is already cached in the TLB so it needs to be
+		 * invalidated. on !LR VMs this is done in the ring ops
+		 * preceding a batch, but on non-faulting LR, in particular on
+		 * user-space batch buffer chaining, it needs to be done here.
+		 */
+		pt_update_ops->needs_invalidation |=
+			(pt_op->rebind && xe_vm_in_lr_mode(vm) &&
+			!vm->batch_invalidate_tlb) ||
+			(!pt_op->rebind && vm->scratch_pt[tile->id] &&
+			 xe_vm_in_preempt_fence_mode(vm));
+
+		pt_op->vma = vma;
+		xe_pt_commit_prepare_bind(vma, pt_op->entries,
+					  pt_op->num_entries, pt_op->rebind);
+	} else {
+		xe_pt_cancel_bind(vma, pt_op->entries, pt_op->num_entries);
+	}
+
+	return err;
+}
+
+static int unbind_op_prepare(struct xe_tile *tile,
+			     struct xe_vm_pgtable_update_ops *pt_update_ops,
+			     struct xe_vma *vma)
+{
+	u32 current_op = pt_update_ops->current_op;
+	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
+
+	xe_bo_assert_held(xe_vma_bo(vma));
+
+	vm_dbg(&xe_vma_vm(vma)->xe->drm,
+	       "Preparing unbind, with range [%llx...%llx)\n",
+	       xe_vma_start(vma), xe_vma_end(vma) - 1);
+
+	pt_op->vma = vma;
+	pt_op->bind = false;
+	pt_op->rebind = false;
+
+	pt_op->num_entries = xe_pt_stage_unbind(tile, vma, pt_op->entries);
+
+	xe_vm_dbg_print_entries(tile_to_xe(tile), pt_op->entries,
+				pt_op->num_entries, false);
+	xe_pt_update_ops_rfence_interval(pt_update_ops, vma);
+	++pt_update_ops->current_op;
+	pt_update_ops->needs_userptr_lock |= xe_vma_is_userptr(vma);
+	pt_update_ops->needs_invalidation = true;
+
+	xe_pt_commit_prepare_unbind(vma, pt_op->entries, pt_op->num_entries);
+
+	return 0;
+}
+
+static int op_prepare(struct xe_vm *vm,
+		      struct xe_tile *tile,
+		      struct xe_vm_pgtable_update_ops *pt_update_ops,
+		      struct xe_vma_op *op)
+{
+	int err = 0;
+
+	xe_vm_assert_held(vm);
+
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma);
+		pt_update_ops->wait_vm_kernel = true;
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		err = unbind_op_prepare(tile, pt_update_ops,
+					gpuva_to_vma(op->base.remap.unmap->va));
+
+		if (!err && op->remap.prev) {
+			err = bind_op_prepare(vm, tile, pt_update_ops,
+					      op->remap.prev);
+			pt_update_ops->wait_vm_bookkeep = true;
+		}
+		if (!err && op->remap.next) {
+			err = bind_op_prepare(vm, tile, pt_update_ops,
+					      op->remap.next);
+			pt_update_ops->wait_vm_bookkeep = true;
+		}
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		err = unbind_op_prepare(tile, pt_update_ops,
+					gpuva_to_vma(op->base.unmap.va));
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		err = bind_op_prepare(vm, tile, pt_update_ops,
+				      gpuva_to_vma(op->base.prefetch.va));
+		pt_update_ops->wait_vm_kernel = true;
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+
+	return err;
+}
+
+static void
+xe_pt_update_ops_init(struct xe_vm_pgtable_update_ops *pt_update_ops)
+{
+	init_llist_head(&pt_update_ops->deferred);
+	pt_update_ops->start = ~0x0ull;
+	pt_update_ops->last = 0x0ull;
+}
+
+/**
+ * xe_pt_update_ops_prepare() - Prepare PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operationa
+ *
+ * Prepare PT update operations which includes updating internal PT state,
+ * allocate memory for page tables, populate page table being pruned in, and
+ * create PT update operations for leaf insertion / removal.
+ *
+ * Return: 0 on success, negative error code on error.
+ */
+int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops *vops)
+{
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+	struct xe_vma_op *op;
+	int err;
+
+	lockdep_assert_held(&vops->vm->lock);
+	xe_vm_assert_held(vops->vm);
+
+	xe_pt_update_ops_init(pt_update_ops);
+
+	list_for_each_entry(op, &vops->list, link) {
+		err = op_prepare(vops->vm, tile, pt_update_ops, op);
+
+		if (err)
+			return err;
+	}
+
+	xe_tile_assert(tile, pt_update_ops->current_op ==
+		       pt_update_ops->num_ops);
+
+#ifdef TEST_VM_OPS_ERROR
+	if (vops->inject_error &&
+	    vops->vm->xe->vm_inject_error_position == FORCE_OP_ERROR_PREPARE)
+		return -ENOSPC;
+#endif
+
+	return 0;
+}
+
+static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
+			   struct xe_vm_pgtable_update_ops *pt_update_ops,
+			   struct xe_vma *vma, struct dma_fence *fence)
+{
+	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
+		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
+				   pt_update_ops->wait_vm_bookkeep ?
+				   DMA_RESV_USAGE_KERNEL :
+				   DMA_RESV_USAGE_BOOKKEEP);
+	vma->tile_present |= BIT(tile->id);
+	if (xe_vma_is_userptr(vma)) {
+		lockdep_assert_held_read(&vm->userptr.notifier_lock);
+		to_userptr_vma(vma)->userptr.initial_bind = true;
+	}
+
+	/*
+	 * Kick rebind worker if this bind triggers preempt fences and not in
+	 * the rebind worker
+	 */
+	if (pt_update_ops->wait_vm_bookkeep &&
+	    xe_vm_in_preempt_fence_mode(vm) &&
+	    !current->mm)
+		xe_vm_queue_rebind_worker(vm);
+}
+
+static void unbind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
+			     struct xe_vm_pgtable_update_ops *pt_update_ops,
+			     struct xe_vma *vma, struct dma_fence *fence)
+{
+	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
+		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
+				   pt_update_ops->wait_vm_bookkeep ?
+				   DMA_RESV_USAGE_KERNEL :
+				   DMA_RESV_USAGE_BOOKKEEP);
+	vma->tile_present &= ~BIT(tile->id);
+	if (!vma->tile_present) {
+		list_del_init(&vma->combined_links.rebind);
+		if (xe_vma_is_userptr(vma)) {
+			lockdep_assert_held_read(&vm->userptr.notifier_lock);
+
+			spin_lock(&vm->userptr.invalidated_lock);
+			list_del_init(&to_userptr_vma(vma)->userptr.invalidate_link);
+			spin_unlock(&vm->userptr.invalidated_lock);
 		}
 	}
 }
 
-static const struct xe_migrate_pt_update_ops unbind_ops = {
-	.populate = xe_migrate_clear_pgtable_callback,
+static void op_commit(struct xe_vm *vm,
+		      struct xe_tile *tile,
+		      struct xe_vm_pgtable_update_ops *pt_update_ops,
+		      struct xe_vma_op *op, struct dma_fence *fence)
+{
+	xe_vm_assert_held(vm);
+
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+			break;
+
+		bind_op_commit(vm, tile, pt_update_ops, op->map.vma, fence);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		unbind_op_commit(vm, tile, pt_update_ops,
+				 gpuva_to_vma(op->base.remap.unmap->va), fence);
+
+		if (op->remap.prev)
+			bind_op_commit(vm, tile, pt_update_ops, op->remap.prev,
+				       fence);
+		if (op->remap.next)
+			bind_op_commit(vm, tile, pt_update_ops, op->remap.next,
+				       fence);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		unbind_op_commit(vm, tile, pt_update_ops,
+				 gpuva_to_vma(op->base.unmap.va), fence);
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		bind_op_commit(vm, tile, pt_update_ops,
+			       gpuva_to_vma(op->base.prefetch.va), fence);
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+}
+
+static const struct xe_migrate_pt_update_ops migrate_ops = {
+	.populate = xe_vm_populate_pgtable,
+	.clear = xe_migrate_clear_pgtable_callback,
 	.pre_commit = xe_pt_pre_commit,
 };
 
-static const struct xe_migrate_pt_update_ops userptr_unbind_ops = {
-	.populate = xe_migrate_clear_pgtable_callback,
+static const struct xe_migrate_pt_update_ops userptr_migrate_ops = {
+	.populate = xe_vm_populate_pgtable,
+	.clear = xe_migrate_clear_pgtable_callback,
 	.pre_commit = xe_pt_userptr_pre_commit,
 };
 
 /**
- * __xe_pt_unbind_vma() - Disconnect and free a page-table tree for the vma
- * address range.
- * @tile: The tile to unbind for.
- * @vma: The vma to unbind.
- * @q: The exec_queue with which to do pipelined page-table updates.
- * @syncs: Entries to sync on before disconnecting the tree to be destroyed.
- * @num_syncs: Number of @sync entries.
+ * xe_pt_update_ops_run() - Run PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operationa
  *
- * This function builds a the xe_vm_pgtable_update entries abstracting the
- * operations needed to detach the page-table tree to be destroyed from the
- * man vm tree.
- * It then takes the relevant locks and submits the operations for
- * pipelined detachment of the gpu page-table from  the vm main tree,
- * (which can be done either by the cpu and the GPU), Finally it frees the
- * detached page-table tree.
+ * Run PT update operations which includes committing internal PT state changes,
+ * creating job for PT update operations for leaf insertion / removal, and
+ * installing job fence in various places.
  *
- * Return: A valid dma-fence representing the pipelined detachment operation
- * on success, an error pointer on error.
+ * Return: fence on success, negative ERR_PTR on error.
  */
 struct dma_fence *
-__xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		   struct xe_sync_entry *syncs, u32 num_syncs)
+xe_pt_update_ops_run(struct xe_tile *tile, struct xe_vma_ops *vops)
 {
-	struct xe_vm_pgtable_update entries[XE_VM_MAX_LEVEL * 2 + 1];
-	struct xe_pt_migrate_pt_update unbind_pt_update = {
-		.base = {
-			.ops = xe_vma_is_userptr(vma) ? &userptr_unbind_ops :
-			&unbind_ops,
-			.vma = vma,
-			.tile_id = tile->id,
-		},
-	};
-	struct xe_vm *vm = xe_vma_vm(vma);
-	u32 num_entries;
-	struct dma_fence *fence = NULL;
-	struct invalidation_fence *ifence;
+	struct xe_vm *vm = vops->vm;
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+	struct dma_fence *fence;
+	struct invalidation_fence *ifence = NULL;
 	struct xe_range_fence *rfence;
+	struct xe_vma_op *op;
+	int err = 0, i;
+	struct xe_migrate_pt_update update = {
+		.ops = pt_update_ops->needs_userptr_lock ?
+			&userptr_migrate_ops :
+			&migrate_ops,
+		.vops = vops,
+		.tile_id = tile->id
+	};
 
-	LLIST_HEAD(deferred);
-
-	xe_bo_assert_held(xe_vma_bo(vma));
+	lockdep_assert_held(&vm->lock);
 	xe_vm_assert_held(vm);
 
-	vm_dbg(&xe_vma_vm(vma)->xe->drm,
-	       "Preparing unbind, with range [%llx...%llx) engine %p.\n",
-	       xe_vma_start(vma), xe_vma_end(vma), q);
-
-	num_entries = xe_pt_stage_unbind(tile, vma, entries);
-	xe_tile_assert(tile, num_entries <= ARRAY_SIZE(entries));
-
-	xe_vm_dbg_print_entries(tile_to_xe(tile), entries, num_entries);
-	xe_pt_calc_rfence_interval(vma, &unbind_pt_update, entries,
-				   num_entries);
+#ifdef TEST_VM_OPS_ERROR
+	if (vops->inject_error &&
+	    vm->xe->vm_inject_error_position == FORCE_OP_ERROR_RUN)
+		return ERR_PTR(-ENOSPC);
+#endif
 
-	ifence = kzalloc(sizeof(*ifence), GFP_KERNEL);
-	if (!ifence)
-		return ERR_PTR(-ENOMEM);
+	if (pt_update_ops->needs_invalidation) {
+		ifence = kzalloc(sizeof(*ifence), GFP_KERNEL);
+		if (!ifence) {
+			err = -ENOMEM;
+			goto kill_vm_tile1;
+		}
+	}
 
 	rfence = kzalloc(sizeof(*rfence), GFP_KERNEL);
 	if (!rfence) {
-		kfree(ifence);
-		return ERR_PTR(-ENOMEM);
+		err = -ENOMEM;
+		goto free_ifence;
 	}
 
-	/*
-	 * Even if we were already evicted and unbind to destroy, we need to
-	 * clear again here. The eviction may have updated pagetables at a
-	 * lower level, because it needs to be more conservative.
-	 */
-	fence = xe_migrate_update_pgtables(tile->migrate,
-					   vm, NULL, q ? q :
-					   vm->q[tile->id],
-					   entries, num_entries,
-					   syncs, num_syncs,
-					   &unbind_pt_update.base);
-	if (!IS_ERR(fence)) {
-		int err;
-
-		err = xe_range_fence_insert(&vm->rftree[tile->id], rfence,
-					    &xe_range_fence_kfree_ops,
-					    unbind_pt_update.base.start,
-					    unbind_pt_update.base.last, fence);
+	/* Point of no return - VM killed if failure after this */
+	for (i = 0; i < pt_update_ops->num_ops; ++i) {
+		struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[i];
+
+		xe_pt_commit(pt_op->vma, pt_op->entries,
+			     pt_op->num_entries, &pt_update_ops->deferred);
+		pt_op->vma = NULL;	/* skip in xe_pt_update_ops_abort */
+	}
+
+	fence = xe_migrate_update_pgtables(tile->migrate, &update);
+	if (IS_ERR(fence)) {
+		err = PTR_ERR(fence);
+		goto kill_vm_tile0;
+	}
+
+	err = xe_range_fence_insert(&vm->rftree[tile->id], rfence,
+				    &xe_range_fence_kfree_ops,
+				    pt_update_ops->start,
+				    pt_update_ops->last, fence);
+	if (err)
+		dma_fence_wait(fence, false);
+
+	/* tlb invalidation must be done before signaling rebind */
+	if (ifence) {
+		err = invalidation_fence_init(tile->primary_gt, ifence, fence,
+					      pt_update_ops->start,
+					      pt_update_ops->last,
+					      vm->usm.asid);
 		if (err)
-			dma_fence_wait(fence, false);
-
-		/* TLB invalidation must be done before signaling unbind */
-		err = invalidation_fence_init(tile->primary_gt, ifence, fence, vma);
-		if (err) {
-			dma_fence_put(fence);
-			kfree(ifence);
-			return ERR_PTR(err);
-		}
+			goto put_fence;
 		fence = &ifence->base.base;
+	}
 
-		/* add shared fence now for pagetable delayed destroy */
-		dma_resv_add_fence(xe_vm_resv(vm), fence,
-				   DMA_RESV_USAGE_BOOKKEEP);
+	dma_resv_add_fence(xe_vm_resv(vm), fence,
+			   pt_update_ops->wait_vm_bookkeep ?
+			   DMA_RESV_USAGE_KERNEL :
+			   DMA_RESV_USAGE_BOOKKEEP);
 
-		/* This fence will be installed by caller when doing eviction */
-		if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
-			dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
-					   DMA_RESV_USAGE_BOOKKEEP);
-		xe_pt_commit_unbind(vma, entries, num_entries,
-				    unbind_pt_update.locked ? &deferred : NULL);
-		vma->tile_present &= ~BIT(tile->id);
-	} else {
-		kfree(rfence);
-		kfree(ifence);
-	}
+	list_for_each_entry(op, &vops->list, link)
+		op_commit(vops->vm, tile, pt_update_ops, op, fence);
 
-	if (!vma->tile_present)
-		list_del_init(&vma->combined_links.rebind);
+	if (pt_update_ops->needs_userptr_lock)
+		up_read(&vm->userptr.notifier_lock);
 
-	if (unbind_pt_update.locked) {
-		xe_tile_assert(tile, xe_vma_is_userptr(vma));
+	return fence;
 
-		if (!vma->tile_present) {
-			spin_lock(&vm->userptr.invalidated_lock);
-			list_del_init(&to_userptr_vma(vma)->userptr.invalidate_link);
-			spin_unlock(&vm->userptr.invalidated_lock);
-		}
+put_fence:
+	if (pt_update_ops->needs_userptr_lock)
 		up_read(&vm->userptr.notifier_lock);
-		xe_bo_put_commit(&deferred);
+	dma_fence_put(fence);
+kill_vm_tile0:
+	if (!tile->id)
+		xe_vm_kill(vops->vm, false);
+	kfree(rfence);
+free_ifence:
+	kfree(ifence);
+kill_vm_tile1:
+	if (tile->id)
+		xe_vm_kill(vops->vm, false);
+
+	return ERR_PTR(err);
+}
+
+/**
+ * xe_pt_update_ops_free() - Free PT update operations
+ * @pt_op: Array of PT update operations
+ * @num_ops: Number of PT update operations
+ *
+ * Free PT update operations
+ */
+void xe_pt_update_ops_free(struct xe_vm_pgtable_update_op *pt_op, u32 num_ops)
+{
+	u32 i;
+
+	for (i = 0; i < num_ops; ++i, ++pt_op)
+		xe_pt_free_bind(pt_op->entries, pt_op->num_entries);
+}
+
+/**
+ * xe_pt_update_ops_fini() - Finish PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operations
+ *
+ * Finish PT update operations by committing to destroy page table memory
+ */
+void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops)
+{
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+
+	lockdep_assert_held(&vops->vm->lock);
+	xe_vm_assert_held(vops->vm);
+
+	xe_bo_put_commit(tile_to_xe(tile), &pt_update_ops->deferred);
+	if (!pt_update_ops->skip_free)
+		xe_pt_update_ops_free(pt_update_ops->ops,
+				      pt_update_ops->num_ops);
+	else
+		pt_update_ops->ops = NULL;
+}
+
+/**
+ * xe_pt_update_ops_abort() - Abort PT update operations
+ * @tile: Tile of PT update operations
+ * @vops: VMA operationa
+ *
+ *  Abort PT update operations by unwinding internal PT state
+ */
+void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops)
+{
+	struct xe_vm_pgtable_update_ops *pt_update_ops =
+		&vops->pt_update_ops[tile->id];
+	int i;
+
+	lockdep_assert_held(&vops->vm->lock);
+	xe_vm_assert_held(vops->vm);
+
+	for (i = pt_update_ops->num_ops - 1; i >= 0; --i) {
+		struct xe_vm_pgtable_update_op *pt_op =
+			&pt_update_ops->ops[i];
+
+		if (!pt_op->vma || i >= pt_update_ops->current_op)
+			continue;
+
+		if (pt_op->bind)
+			xe_pt_abort_bind(pt_op->vma, pt_op->entries,
+					 pt_op->num_entries,
+					 pt_op->rebind);
+		else
+			xe_pt_abort_unbind(pt_op->vma, pt_op->entries,
+					   pt_op->num_entries);
 	}
 
-	return fence;
+	xe_pt_update_ops_fini(tile, vops);
 }
diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
index 71a4fbfcff43..989c9b190fa0 100644
--- a/drivers/gpu/drm/xe/xe_pt.h
+++ b/drivers/gpu/drm/xe/xe_pt.h
@@ -17,6 +17,7 @@ struct xe_sync_entry;
 struct xe_tile;
 struct xe_vm;
 struct xe_vma;
+struct xe_vma_ops;
 
 /* Largest huge pte is currently 1GiB. May become device dependent. */
 #define MAX_HUGEPTE_LEVEL 2
@@ -34,14 +35,12 @@ void xe_pt_populate_empty(struct xe_tile *tile, struct xe_vm *vm,
 
 void xe_pt_destroy(struct xe_pt *pt, u32 flags, struct llist_head *deferred);
 
-struct dma_fence *
-__xe_pt_bind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		 struct xe_sync_entry *syncs, u32 num_syncs,
-		 bool rebind);
-
-struct dma_fence *
-__xe_pt_unbind_vma(struct xe_tile *tile, struct xe_vma *vma, struct xe_exec_queue *q,
-		   struct xe_sync_entry *syncs, u32 num_syncs);
+int xe_pt_update_ops_prepare(struct xe_tile *tile, struct xe_vma_ops *vops);
+struct dma_fence *xe_pt_update_ops_run(struct xe_tile *tile,
+				       struct xe_vma_ops *vops);
+void xe_pt_update_ops_fini(struct xe_tile *tile, struct xe_vma_ops *vops);
+void xe_pt_update_ops_abort(struct xe_tile *tile, struct xe_vma_ops *vops);
+void xe_pt_update_ops_free(struct xe_vm_pgtable_update_op *pt_op, u32 num_ops);
 
 bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
 
diff --git a/drivers/gpu/drm/xe/xe_pt_exec_queue.c b/drivers/gpu/drm/xe/xe_pt_exec_queue.c
new file mode 100644
index 000000000000..2a6ae6267594
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_pt_exec_queue.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include <drm/gpu_scheduler.h>
+
+#include "xe_bo.h"
+#include "xe_device.h"
+#include "xe_exec_queue.h"
+#include "xe_migrate.h"
+#include "xe_pt.h"
+#include "xe_pt_exec_queue.h"
+#include "xe_sched_job.h"
+#include "xe_trace.h"
+
+/**
+ * struct xe_pt_exec_queue - PT specific state for an xe_exec_queue
+ */
+struct xe_pt_exec_queue {
+	/** @q: Backpointer to parent xe_exec_queue */
+	struct xe_exec_queue *q;
+	/** @sched: GPU scheduler for this xe_exec_queue */
+	struct drm_gpu_scheduler sched;
+	/** @entity: Scheduler entity for this xe_exec_queue */
+	struct drm_sched_entity entity;
+	/** @fini_async: do final fini async from this worker */
+	struct work_struct fini_async;
+};
+
+static bool is_pt_job(struct xe_sched_job *job)
+{
+	return test_bit(JOB_FLAG_PT, &job->fence->flags);
+}
+
+static void cleanup_pt_job(struct xe_device *xe, struct xe_sched_job *job)
+{
+	xe_pt_update_ops_free(job->pt_update[0].pt_op,
+			      job->pt_update[0].num_ops);
+	xe_bo_put_commit(xe, &job->pt_update[0].deferred);
+	kfree(job->pt_update[0].pt_op);
+}
+
+static void run_pt_job(struct xe_device *xe, struct xe_sched_job *job)
+{
+	__xe_migrate_update_pgtables_cpu(job->pt_update[0].vm,
+					 job->pt_update[0].tile,
+					 job->pt_update[0].ops,
+					 job->pt_update[0].pt_op,
+					 job->pt_update[0].num_ops);
+	cleanup_pt_job(xe, job);
+}
+
+static struct dma_fence *
+pt_exec_queue_run_job(struct drm_sched_job *drm_job)
+{
+	struct xe_sched_job *job = to_xe_sched_job(drm_job);
+	struct xe_exec_queue *q = job->q;
+	struct xe_device *xe = q->xe;
+
+	xe_assert(xe, is_pt_job(job));
+	xe_assert(xe, q->flags & EXEC_QUEUE_FLAG_PT);
+
+	trace_xe_sched_job_run(job);
+	run_pt_job(xe, job);
+
+	return NULL;
+}
+
+static void pt_exec_queue_free_job(struct drm_sched_job *drm_job)
+{
+	struct xe_sched_job *job = to_xe_sched_job(drm_job);
+
+	trace_xe_sched_job_free(job);
+	xe_sched_job_put(job);
+}
+
+static const struct drm_sched_backend_ops drm_sched_ops = {
+	.run_job = pt_exec_queue_run_job,
+	.free_job = pt_exec_queue_free_job,
+};
+
+static void pt_exec_queue_kill(struct xe_exec_queue *q)
+{
+}
+
+static void __pt_exec_queue_fini_async(struct work_struct *w)
+{
+	struct xe_pt_exec_queue *pe =
+		container_of(w, struct xe_pt_exec_queue, fini_async);
+	struct xe_exec_queue *q = pe->q;
+
+	trace_xe_exec_queue_destroy(q);
+
+	drm_sched_entity_fini(&pe->entity);
+	drm_sched_fini(&pe->sched);
+
+	kfree(pe);
+
+	xe_device_mem_access_put(q->xe);
+	xe_exec_queue_fini(q);
+}
+
+static void pt_exec_queue_fini(struct xe_exec_queue *q)
+{
+	INIT_WORK(&q->pt->fini_async, __pt_exec_queue_fini_async);
+	queue_work(system_wq, &q->pt->fini_async);
+}
+
+static bool pt_exec_queue_reset_status(struct xe_exec_queue *q)
+{
+	return false;
+}
+
+static const struct xe_exec_queue_ops pt_exec_queue_ops = {
+	.kill = pt_exec_queue_kill,
+	.fini = pt_exec_queue_fini,
+	.reset_status = pt_exec_queue_reset_status,
+};
+
+struct xe_exec_queue *xe_pt_exec_queue_create(struct xe_device *xe)
+{
+	struct drm_gpu_scheduler *sched;
+	struct xe_exec_queue *q;
+	struct xe_pt_exec_queue *pe;
+	int err;
+
+	q = kzalloc(sizeof(*q), GFP_KERNEL);
+	if (!q)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&q->refcount);
+	q->flags = EXEC_QUEUE_FLAG_PT;
+	q->ops = &pt_exec_queue_ops;
+
+	pe = kzalloc(sizeof(*pe), GFP_KERNEL);
+	if (!pe) {
+		err = -ENOMEM;
+		goto err_free;
+	}
+
+	err = drm_sched_init(&pe->sched, &drm_sched_ops, system_wq, 1, 64, 64,
+			     MAX_SCHEDULE_TIMEOUT, system_wq, NULL,
+			     q->name, xe->drm.dev);
+	if (err)
+		goto err_free;
+
+	sched = &pe->sched;
+	err = drm_sched_entity_init(&pe->entity, 0, &sched, 1, NULL);
+	if (err)
+		goto err_sched;
+
+	q->xe = xe;
+	q->pt = pe;
+	pe->q = q;
+	q->entity = &pe->entity;
+
+	xe_exec_queue_assign_name(q, 0);
+	trace_xe_exec_queue_create(q);
+
+	/*
+	 * Normally the user vm holds an rpm ref to keep the device
+	 * awake, and the context holds a ref for the vm, however for
+	 * some engines we use the kernels migrate vm underneath which offers no
+	 * such rpm ref, or we lack a vm. Make sure we keep a ref here, so we
+	 * can perform GuC CT actions when needed. Caller is expected to have
+	 * already grabbed the rpm ref outside any sensitive locks.
+	 */
+	drm_WARN_ON(&xe->drm, !xe_device_mem_access_get_if_ongoing(xe));
+
+	return q;
+
+err_sched:
+	drm_sched_fini(&pe->sched);
+err_free:
+	kfree(pe);
+	kfree(q);
+
+	return ERR_PTR(err);
+}
diff --git a/drivers/gpu/drm/xe/xe_pt_exec_queue.h b/drivers/gpu/drm/xe/xe_pt_exec_queue.h
new file mode 100644
index 000000000000..a4d16b845418
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_pt_exec_queue.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#ifndef _XE_PT_EXEC_QUEUE_H_
+#define _XE_PT_EXEC_QUEUE_H_
+
+struct xe_device;
+struct xe_exec_queue;
+
+struct xe_exec_queue *xe_pt_exec_queue_create(struct xe_device *xe);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_pt_types.h b/drivers/gpu/drm/xe/xe_pt_types.h
index cee70cb0f014..cfd0d35408a5 100644
--- a/drivers/gpu/drm/xe/xe_pt_types.h
+++ b/drivers/gpu/drm/xe/xe_pt_types.h
@@ -70,8 +70,61 @@ struct xe_vm_pgtable_update {
 	/** @pt_entries: Newly added pagetable entries */
 	struct xe_pt_entry *pt_entries;
 
+	/** @level: level of update */
+	unsigned int level;
+
 	/** @flags: Target flags */
 	u32 flags;
 };
 
+/** struct xe_vm_pgtable_update_op - Page table update operation */
+struct xe_vm_pgtable_update_op {
+	/** @entries: entries to update for this operation */
+	struct xe_vm_pgtable_update entries[XE_VM_MAX_LEVEL * 2 + 1];
+	/** @vma: VMA for operation, operation not valid if NULL */
+	struct xe_vma *vma;
+	/** @num_entries: number of entries for this update operation */
+	u32 num_entries;
+	/** @bind: is a bind */
+	bool bind;
+	/** @rebind: is a rebind */
+	bool rebind;
+};
+
+/** struct xe_vm_pgtable_update_ops: page table update operations */
+struct xe_vm_pgtable_update_ops {
+	/** @ops: operations */
+	struct xe_vm_pgtable_update_op *ops;
+	/** @deferred: deferred list to destroy PT entries */
+	struct llist_head deferred;
+	/** @q: exec queue for PT operations */
+	struct xe_exec_queue *q;
+	/** @start: start address of ops */
+	u64 start;
+	/** @last: last address of ops */
+	u64 last;
+	/** @num_ops: number of operations */
+	u32 num_ops;
+	/** @current_op: current operations */
+	u32 current_op;
+	/** @needs_userptr_lock: Needs userptr lock */
+	bool needs_userptr_lock;
+	/** @needs_invalidation: Needs invalidation */
+	bool needs_invalidation;
+	/**
+	 * @wait_vm_bookkeep: PT operations need to wait until VM is idle
+	 * (bookkeep dma-resv slots are idle) and stage all future VM activity
+	 * behind these operations (install PT operations into VM kernel
+	 * dma-resv slot).
+	 */
+	bool wait_vm_bookkeep;
+	/**
+	 * @wait_vm_kernel: PT operations need to wait until VM kernel dma-resv
+	 * slots are idle.
+	 */
+	bool wait_vm_kernel;
+	/** @skip_free: Free @ops in submission backend rather than in IOCTL */
+	bool skip_free;
+};
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_sched_job.c b/drivers/gpu/drm/xe/xe_sched_job.c
index 8151ddafb940..fc24e675f922 100644
--- a/drivers/gpu/drm/xe/xe_sched_job.c
+++ b/drivers/gpu/drm/xe/xe_sched_job.c
@@ -23,19 +23,22 @@ static struct kmem_cache *xe_sched_job_parallel_slab;
 
 int __init xe_sched_job_module_init(void)
 {
+	struct xe_sched_job *job;
+	size_t size;
+
+	size = struct_size(job, batch_addr, 1);
 	xe_sched_job_slab =
-		kmem_cache_create("xe_sched_job",
-				  sizeof(struct xe_sched_job) +
-				  sizeof(u64), 0,
+		kmem_cache_create("xe_sched_job", size, 0,
 				  SLAB_HWCACHE_ALIGN, NULL);
 	if (!xe_sched_job_slab)
 		return -ENOMEM;
 
+	size = max_t(size_t,
+		     struct_size(job, batch_addr,
+				 XE_HW_ENGINE_MAX_INSTANCE),
+		     struct_size(job, pt_update, 1));
 	xe_sched_job_parallel_slab =
-		kmem_cache_create("xe_sched_job_parallel",
-				  sizeof(struct xe_sched_job) +
-				  sizeof(u64) *
-				  XE_HW_ENGINE_MAX_INSTANCE, 0,
+		kmem_cache_create("xe_sched_job_parallel", size, 0,
 				  SLAB_HWCACHE_ALIGN, NULL);
 	if (!xe_sched_job_parallel_slab) {
 		kmem_cache_destroy(xe_sched_job_slab);
@@ -62,18 +65,21 @@ bool xe_sched_job_is_migration(struct xe_exec_queue *q)
 	return q->vm && (q->vm->flags & XE_VM_FLAG_MIGRATION);
 }
 
-static void job_free(struct xe_sched_job *job)
+static bool parallel_slab(struct xe_exec_queue *q)
 {
-	struct xe_exec_queue *q = job->q;
-	bool is_migration = xe_sched_job_is_migration(q);
+	return !q->width || xe_exec_queue_is_parallel(q) ||
+		xe_sched_job_is_migration(q);
+}
 
-	kmem_cache_free(xe_exec_queue_is_parallel(job->q) || is_migration ?
-			xe_sched_job_parallel_slab : xe_sched_job_slab, job);
+static void job_free(struct xe_sched_job *job)
+{
+	kmem_cache_free(parallel_slab(job->q) ? xe_sched_job_parallel_slab :
+			xe_sched_job_slab, job);
 }
 
 static struct xe_device *job_to_xe(struct xe_sched_job *job)
 {
-	return gt_to_xe(job->q->gt);
+	return job->q->xe;
 }
 
 struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
@@ -86,17 +92,19 @@ struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
 	int i, j;
 	u32 width;
 
-	/* only a kernel context can submit a vm-less job */
-	XE_WARN_ON(!q->vm && !(q->flags & EXEC_QUEUE_FLAG_KERNEL));
+	/* only a kernel and pt exec queue can submit a vm-less job */
+	XE_WARN_ON(!q->vm && !(q->flags & EXEC_QUEUE_FLAG_KERNEL) &&
+		   !(q->flags & EXEC_QUEUE_FLAG_PT));
 
-	/* Migration and kernel engines have their own locking */
-	if (!(q->flags & (EXEC_QUEUE_FLAG_KERNEL | EXEC_QUEUE_FLAG_VM))) {
+	/* Kernel and pt exec queues have their own locking */
+	if (!(q->flags & EXEC_QUEUE_FLAG_KERNEL) &&
+	    !(q->flags & EXEC_QUEUE_FLAG_PT)) {
 		lockdep_assert_held(&q->vm->lock);
 		if (!xe_vm_in_lr_mode(q->vm))
 			xe_vm_assert_held(q->vm);
 	}
 
-	job = job_alloc(xe_exec_queue_is_parallel(q) || is_migration);
+	job = job_alloc(parallel_slab(q));
 	if (!job)
 		return ERR_PTR(-ENOMEM);
 
@@ -108,7 +116,15 @@ struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
 	if (err)
 		goto err_free;
 
-	if (!xe_exec_queue_is_parallel(q)) {
+	if (!batch_addr) {
+		xe_assert(q->xe, q->flags & EXEC_QUEUE_FLAG_PT);
+
+		job->fence = dma_fence_allocate_private_stub(ktime_get());
+		if (!job->fence) {
+			err = -ENOMEM;
+			goto err_sched_job;
+		}
+	} else if (!xe_exec_queue_is_parallel(q)) {
 		job->fence = xe_lrc_create_seqno_fence(q->lrc);
 		if (IS_ERR(job->fence)) {
 			err = PTR_ERR(job->fence);
@@ -148,12 +164,14 @@ struct xe_sched_job *xe_sched_job_create(struct xe_exec_queue *q,
 		job->fence = &cf->base;
 	}
 
-	width = q->width;
-	if (is_migration)
-		width = 2;
+	if (batch_addr) {
+		width = q->width;
+		if (is_migration)
+			width = 2;
 
-	for (i = 0; i < width; ++i)
-		job->batch_addr[i] = batch_addr[i];
+		for (i = 0; i < width; ++i)
+			job->batch_addr[i] = batch_addr[i];
+	}
 
 	/* All other jobs require a VM to be open which has a ref */
 	if (unlikely(q->flags & EXEC_QUEUE_FLAG_KERNEL))
@@ -282,7 +300,7 @@ struct xe_sched_job_snapshot *
 xe_sched_job_snapshot_capture(struct xe_sched_job *job)
 {
 	struct xe_exec_queue *q = job->q;
-	struct xe_device *xe = q->gt->tile->xe;
+	struct xe_device *xe = job_to_xe(job);
 	struct xe_sched_job_snapshot *snapshot;
 	size_t len = sizeof(*snapshot) + (sizeof(u64) * q->width);
 	u16 i;
diff --git a/drivers/gpu/drm/xe/xe_sched_job_types.h b/drivers/gpu/drm/xe/xe_sched_job_types.h
index b1d83da50a53..29ca43d1eb65 100644
--- a/drivers/gpu/drm/xe/xe_sched_job_types.h
+++ b/drivers/gpu/drm/xe/xe_sched_job_types.h
@@ -11,6 +11,28 @@
 #include <drm/gpu_scheduler.h>
 
 struct xe_exec_queue;
+struct xe_migrate_pt_update_ops;
+struct xe_tile;
+struct xe_vm;
+struct xe_vm_pgtable_update_op;
+
+/**
+ * struct pt_update_args - PT update arguments
+ */
+struct pt_update_args {
+	/** @vm: VM */
+	struct xe_vm *vm;
+	/** @tile: Tile */
+	struct xe_tile *tile;
+	/** @ops: Migrate PT update ops */
+	const struct xe_migrate_pt_update_ops *ops;
+	/** @pt_op: PT update ops */
+	struct xe_vm_pgtable_update_op *pt_op;
+	/** @deferred: deferred list to destroy PT entries */
+	struct llist_head deferred;
+	/** @num_ops: number of PT update ops */
+	int num_ops;
+};
 
 /**
  * struct xe_sched_job - XE schedule job (batch buffer tracking)
@@ -27,6 +49,7 @@ struct xe_sched_job {
 	 * can safely reference fence, fence cannot safely reference job.
 	 */
 #define JOB_FLAG_SUBMIT		DMA_FENCE_FLAG_USER_BITS
+#define JOB_FLAG_PT		(DMA_FENCE_FLAG_USER_BITS << 1)
 	struct dma_fence *fence;
 	/** @user_fence: write back value when BB is complete */
 	struct {
@@ -39,8 +62,12 @@ struct xe_sched_job {
 	} user_fence;
 	/** @migrate_flush_flags: Additional flush flags for migration jobs */
 	u32 migrate_flush_flags;
-	/** @batch_addr: batch buffer address of job */
-	u64 batch_addr[];
+	union {
+		/** @batch_addr: batch buffer address of job */
+		DECLARE_FLEX_ARRAY(u64, batch_addr);
+		/** @pt_update: PT update arguments */
+		DECLARE_FLEX_ARRAY(struct pt_update_args, pt_update);
+	};
 };
 
 struct xe_sched_job_snapshot {
diff --git a/drivers/gpu/drm/xe/xe_sync.c b/drivers/gpu/drm/xe/xe_sync.c
index 02c9577fe418..07aa65d9bcab 100644
--- a/drivers/gpu/drm/xe/xe_sync.c
+++ b/drivers/gpu/drm/xe/xe_sync.c
@@ -343,6 +343,21 @@ xe_sync_in_fence_get(struct xe_sync_entry *sync, int num_sync,
 	return ERR_PTR(-ENOMEM);
 }
 
+/**
+ * __xe_sync_ufence_get() - Get user fence from user fence
+ * @ufence: input user fence
+ *
+ * Get a user fence reference from user fence
+ *
+ * Return: xe_user_fence pointer with reference
+ */
+struct xe_user_fence *__xe_sync_ufence_get(struct xe_user_fence *ufence)
+{
+	user_fence_get(ufence);
+
+	return ufence;
+}
+
 /**
  * xe_sync_ufence_get() - Get user fence from sync
  * @sync: input sync
diff --git a/drivers/gpu/drm/xe/xe_sync.h b/drivers/gpu/drm/xe/xe_sync.h
index 0fd0d51208e6..26e9ec9de1a8 100644
--- a/drivers/gpu/drm/xe/xe_sync.h
+++ b/drivers/gpu/drm/xe/xe_sync.h
@@ -38,6 +38,7 @@ static inline bool xe_sync_is_ufence(struct xe_sync_entry *sync)
 	return !!sync->ufence;
 }
 
+struct xe_user_fence *__xe_sync_ufence_get(struct xe_user_fence *ufence);
 struct xe_user_fence *xe_sync_ufence_get(struct xe_sync_entry *sync);
 void xe_sync_ufence_put(struct xe_user_fence *ufence);
 int xe_sync_ufence_get_status(struct xe_user_fence *ufence);
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index 4ddc55527f9a..c4704c5f3c72 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -147,8 +147,9 @@ DECLARE_EVENT_CLASS(xe_exec_queue,
 			   __entry->logical_mask = q->logical_mask;
 			   __entry->gt_id = q->gt->info.id;
 			   __entry->width = q->width;
-			   __entry->guc_id = q->guc->id;
-			   __entry->guc_state = atomic_read(&q->guc->state);
+			   __entry->guc_id = q->guc ? q->guc->id : 0;
+			   __entry->guc_state = q->guc ?
+			   atomic_read(&q->guc->state) : 0;
 			   __entry->flags = q->flags;
 			   ),
 
@@ -264,9 +265,9 @@ DECLARE_EVENT_CLASS(xe_sched_job,
 
 		    TP_fast_assign(
 			   __entry->seqno = xe_sched_job_seqno(job);
-			   __entry->guc_id = job->q->guc->id;
-			   __entry->guc_state =
-			   atomic_read(&job->q->guc->state);
+			   __entry->guc_id = job->q->guc ? job->q->guc->id : 0;
+			   __entry->guc_state = job->q->guc ?
+			   atomic_read(&job->q->guc->state) : 0;
 			   __entry->flags = job->q->flags;
 			   __entry->error = job->fence->error;
 			   __entry->fence = (unsigned long)job->fence;
@@ -423,11 +424,6 @@ DEFINE_EVENT(xe_vma, xe_vma_acc,
 	     TP_ARGS(vma)
 );
 
-DEFINE_EVENT(xe_vma, xe_vma_fail,
-	     TP_PROTO(struct xe_vma *vma),
-	     TP_ARGS(vma)
-);
-
 DEFINE_EVENT(xe_vma, xe_vma_bind,
 	     TP_PROTO(struct xe_vma *vma),
 	     TP_ARGS(vma)
@@ -541,6 +537,11 @@ DEFINE_EVENT(xe_vm, xe_vm_rebind_worker_exit,
 	     TP_ARGS(vm)
 );
 
+DEFINE_EVENT(xe_vm, xe_vm_ops_fail,
+	     TP_PROTO(struct xe_vm *vm),
+	     TP_ARGS(vm)
+);
+
 /* GuC */
 DECLARE_EVENT_CLASS(xe_guc_ct_flow_control,
 		    TP_PROTO(u32 _head, u32 _tail, u32 size, u32 space, u32 len),
diff --git a/drivers/gpu/drm/xe/xe_uc_fw.c b/drivers/gpu/drm/xe/xe_uc_fw.c
index a9d25b3fa67c..d6f788a42979 100644
--- a/drivers/gpu/drm/xe/xe_uc_fw.c
+++ b/drivers/gpu/drm/xe/xe_uc_fw.c
@@ -105,6 +105,7 @@ struct fw_blobs_by_type {
 #define XE_GUC_FIRMWARE_DEFS(fw_def, mmp_ver, major_ver)			\
 	fw_def(LUNARLAKE,	major_ver(xe,	guc,	lnl,	70, 19, 2))	\
 	fw_def(METEORLAKE,	major_ver(i915,	guc,	mtl,	70, 19, 2))	\
+	fw_def(PVC,		major_ver(i915,	guc,	pvc,	70, 19, 2))	\
 	fw_def(DG2,		major_ver(i915,	guc,	dg2,	70, 19, 2))	\
 	fw_def(DG1,		major_ver(i915,	guc,	dg1,	70, 19, 2))	\
 	fw_def(ALDERLAKE_N,	major_ver(i915,	guc,	tgl,	70, 19, 2))	\
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 643b3701a738..8ba037e7ce5c 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -34,6 +34,7 @@
 #include "xe_pm.h"
 #include "xe_preempt_fence.h"
 #include "xe_pt.h"
+#include "xe_pt_exec_queue.h"
 #include "xe_res_cursor.h"
 #include "xe_sync.h"
 #include "xe_trace.h"
@@ -413,19 +414,23 @@ int __xe_vm_userptr_needs_repin(struct xe_vm *vm)
 
 #define XE_VM_REBIND_RETRY_TIMEOUT_MS 1000
 
-static void xe_vm_kill(struct xe_vm *vm)
+void xe_vm_kill(struct xe_vm *vm, bool unlocked)
 {
 	struct xe_exec_queue *q;
 
 	lockdep_assert_held(&vm->lock);
 
-	xe_vm_lock(vm, false);
+	if (unlocked)
+		xe_vm_lock(vm, false);
+
 	vm->flags |= XE_VM_FLAG_BANNED;
 	trace_xe_vm_kill(vm);
 
 	list_for_each_entry(q, &vm->preempt.exec_queues, compute.link)
 		q->ops->kill(q);
-	xe_vm_unlock(vm);
+
+	if (unlocked)
+		xe_vm_unlock(vm);
 
 	/* TODO: Inform user the VM is banned */
 }
@@ -515,14 +520,19 @@ static int xe_preempt_work_begin(struct drm_exec *exec, struct xe_vm *vm,
 	if (err)
 		return err;
 
-	return drm_gpuvm_validate(&vm->gpuvm, exec);
+	err = drm_gpuvm_validate(&vm->gpuvm, exec);
+	if (err)
+		return err;
+
+	err = xe_vm_rebind(vm, true);
+
+	return err;
 }
 
 static void preempt_rebind_work_func(struct work_struct *w)
 {
 	struct xe_vm *vm = container_of(w, struct xe_vm, preempt.rebind_work);
 	struct drm_exec exec;
-	struct dma_fence *rebind_fence;
 	unsigned int fence_count = 0;
 	LIST_HEAD(preempt_fences);
 	ktime_t end = 0;
@@ -568,18 +578,7 @@ static void preempt_rebind_work_func(struct work_struct *w)
 	if (err)
 		goto out_unlock;
 
-	rebind_fence = xe_vm_rebind(vm, true);
-	if (IS_ERR(rebind_fence)) {
-		err = PTR_ERR(rebind_fence);
-		goto out_unlock;
-	}
-
-	if (rebind_fence) {
-		dma_fence_wait(rebind_fence, false);
-		dma_fence_put(rebind_fence);
-	}
-
-	/* Wait on munmap style VM unbinds */
+	/* Wait on rebinds */
 	wait = dma_resv_wait_timeout(xe_vm_resv(vm),
 				     DMA_RESV_USAGE_KERNEL,
 				     false, MAX_SCHEDULE_TIMEOUT);
@@ -621,7 +620,7 @@ static void preempt_rebind_work_func(struct work_struct *w)
 
 	if (err) {
 		drm_warn(&vm->xe->drm, "VM worker error: %d\n", err);
-		xe_vm_kill(vm);
+		xe_vm_kill(vm, true);
 	}
 	up_write(&vm->lock);
 
@@ -751,19 +750,103 @@ int xe_vm_userptr_check_repin(struct xe_vm *vm)
 		list_empty_careful(&vm->userptr.invalidated)) ? 0 : -EAGAIN;
 }
 
-static struct dma_fence *
-xe_vm_bind_vma(struct xe_vma *vma, struct xe_exec_queue *q,
-	       struct xe_sync_entry *syncs, u32 num_syncs,
-	       bool first_op, bool last_op);
+static void xe_vma_ops_init(struct xe_vma_ops *vops, struct xe_vm *vm,
+			    struct xe_exec_queue *q,
+			    struct xe_sync_entry *syncs, u32 num_syncs)
+{
+	memset(vops, 0, sizeof(*vops));
+	INIT_LIST_HEAD(&vops->list);
+	vops->vm = vm;
+	vops->q = q;
+	vops->syncs = syncs;
+	vops->num_syncs = num_syncs;
+}
+
+static int xe_vma_ops_alloc(struct xe_vma_ops *vops)
+{
+	int i;
+
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i) {
+		if (!vops->pt_update_ops[i].num_ops)
+			continue;
+
+		vops->pt_update_ops[i].ops =
+			kmalloc_array(vops->pt_update_ops[i].num_ops,
+				      sizeof(*vops->pt_update_ops[i].ops),
+				      GFP_KERNEL);
+		if (!vops->pt_update_ops[i].ops)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void xe_vma_ops_free(struct xe_vma_ops *vops)
+{
+	int i;
+
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i)
+		kfree(vops->pt_update_ops[i].ops);
+}
+
+/**
+ * xe_vm_populate_dummy_rebind() - Populate dummy rebind VMA ops
+ * @vm: The VM.
+ * @vma: VMA to populate dummy VMA ops
+ * @tile_mask: tile mask for VMA ops
+ *
+ * Populate dummy VMA ops which can be used to issue a rebind for the VMA
+ *
+ * Return: 0 on success, -ENOMEM on failure
+ */
+int xe_vm_populate_dummy_rebind(struct xe_vm *vm, struct xe_vma *vma,
+				u8 tile_mask)
+{
+	int i;
+
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i) {
+		if (BIT(i) & tile_mask) {
+			struct xe_vm_pgtable_update_op *pt_op =
+				vm->dummy_ops.vops.pt_update_ops[i].ops;
+
+			memset(&vm->dummy_ops.vops.pt_update_ops[i], 0,
+			       sizeof(vm->dummy_ops.vops.pt_update_ops[i]));
+			vm->dummy_ops.vops.pt_update_ops[i].ops = pt_op;
+			vm->dummy_ops.vops.pt_update_ops[i].num_ops = 1;
+
+			/*
+			 * Wait for VM to be idle / schedule execs + resume
+			 * behind rebinds
+			 */
+			vm->dummy_ops.vops.pt_update_ops[i].wait_vm_bookkeep =
+				true;
+		} else {
+			vm->dummy_ops.vops.pt_update_ops[i].num_ops = 0;
+		}
+	}
+	vm->dummy_ops.op.base.op = DRM_GPUVA_OP_MAP;
+	vm->dummy_ops.op.base.map.va.addr = vma->gpuva.va.addr;
+	vm->dummy_ops.op.base.map.va.range = vma->gpuva.va.range;
+	vm->dummy_ops.op.base.map.gem.obj = vma->gpuva.gem.obj;
+	vm->dummy_ops.op.base.map.gem.offset = vma->gpuva.gem.offset;
+	vm->dummy_ops.op.tile_mask = vma->tile_mask;
+	vm->dummy_ops.op.map.vma = vma;
+	vm->dummy_ops.op.map.immediate = true;
+	vm->dummy_ops.op.map.dumpable = vma->gpuva.flags & XE_VMA_DUMPABLE;
+	vm->dummy_ops.op.map.is_null = xe_vma_is_null(vma);
+
+	return xe_vma_ops_alloc(&vm->dummy_ops.vops);
+}
 
-struct dma_fence *xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
+int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 {
 	struct dma_fence *fence = NULL;
 	struct xe_vma *vma, *next;
+	int err;
 
 	lockdep_assert_held(&vm->lock);
 	if (xe_vm_in_lr_mode(vm) && !rebind_worker)
-		return NULL;
+		return 0;
 
 	xe_vm_assert_held(vm);
 	list_for_each_entry_safe(vma, next, &vm->rebind_list,
@@ -776,12 +859,19 @@ struct dma_fence *xe_vm_rebind(struct xe_vm *vm, bool rebind_worker)
 			trace_xe_vma_rebind_worker(vma);
 		else
 			trace_xe_vma_rebind_exec(vma);
-		fence = xe_vm_bind_vma(vma, NULL, NULL, 0, false, false);
+
+		err = xe_vm_populate_dummy_rebind(vm, vma, vma->tile_present);
+		if (err)
+			return err;
+
+		fence = xe_vm_ops_execute(vm, &vm->dummy_ops.vops);
+		xe_vma_ops_free(&vm->dummy_ops.vops);
 		if (IS_ERR(fence))
-			return fence;
+			return PTR_ERR(fence);
 	}
 
-	return fence;
+	dma_fence_put(fence);
+	return 0;
 }
 
 static void xe_vma_free(struct xe_vma *vma)
@@ -1285,6 +1375,15 @@ static void xe_vm_free_scratch(struct xe_vm *vm)
 	}
 }
 
+static void xe_vma_ops_incr_pt_update_ops(struct xe_vma_ops *vops, u8 tile_mask)
+{
+	int i;
+
+	for (i = 0; i < XE_MAX_TILES_PER_DEVICE; ++i)
+		if (BIT(i) & tile_mask)
+			++vops->pt_update_ops[i].num_ops;
+}
+
 struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 {
 	struct drm_gem_object *vm_resv_obj;
@@ -1306,6 +1405,12 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 	init_rwsem(&vm->lock);
 	mutex_init(&vm->snap_mutex);
 
+	xe_vma_ops_init(&vm->dummy_ops.vops, vm, NULL, NULL, 0);
+	INIT_LIST_HEAD(&vm->dummy_ops.op.link);
+	list_add(&vm->dummy_ops.op.link, &vm->dummy_ops.vops.list);
+	for (id = 0; id < XE_MAX_TILES_PER_DEVICE; ++id)
+		vm->dummy_ops.vops.pt_update_ops[id].num_ops = 1;
+
 	INIT_LIST_HEAD(&vm->rebind_list);
 
 	INIT_LIST_HEAD(&vm->userptr.repin_list);
@@ -1381,32 +1486,20 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 			continue;
 
 		xe_pt_populate_empty(tile, vm, vm->pt_root[id]);
+		number_tiles++;
 	}
 	dma_resv_unlock(xe_vm_resv(vm));
 
 	/* Kernel migration VM shouldn't have a circular loop.. */
 	if (!(flags & XE_VM_FLAG_MIGRATION)) {
-		for_each_tile(tile, xe, id) {
-			struct xe_gt *gt = tile->primary_gt;
-			struct xe_vm *migrate_vm;
-			struct xe_exec_queue *q;
-			u32 create_flags = EXEC_QUEUE_FLAG_VM;
+		struct xe_exec_queue *q;
 
-			if (!vm->pt_root[id])
-				continue;
-
-			migrate_vm = xe_migrate_get_vm(tile->migrate);
-			q = xe_exec_queue_create_class(xe, gt, migrate_vm,
-						       XE_ENGINE_CLASS_COPY,
-						       create_flags);
-			xe_vm_put(migrate_vm);
-			if (IS_ERR(q)) {
-				err = PTR_ERR(q);
-				goto err_close;
-			}
-			vm->q[id] = q;
-			number_tiles++;
+		q = xe_pt_exec_queue_create(xe);
+		if (IS_ERR(q)) {
+			err = PTR_ERR(q);
+			goto err_close;
 		}
+		vm->q = q;
 	}
 
 	if (number_tiles > 1)
@@ -1430,12 +1523,12 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 	return ERR_PTR(err);
 
 err_no_resv:
-	mutex_destroy(&vm->snap_mutex);
+	if (!(flags & XE_VM_FLAG_MIGRATION))
+		xe_device_mem_access_put(xe);
 	for_each_tile(tile, xe, id)
 		xe_range_fence_tree_fini(&vm->rftree[id]);
+	mutex_destroy(&vm->snap_mutex);
 	kfree(vm);
-	if (!(flags & XE_VM_FLAG_MIGRATION))
-		xe_device_mem_access_put(xe);
 	return ERR_PTR(err);
 }
 
@@ -1461,19 +1554,13 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 	if (xe_vm_in_preempt_fence_mode(vm))
 		flush_work(&vm->preempt.rebind_work);
 
-	down_write(&vm->lock);
-	for_each_tile(tile, xe, id) {
-		if (vm->q[id])
-			xe_exec_queue_last_fence_put(vm->q[id], vm);
-	}
-	up_write(&vm->lock);
+	if (vm->q) {
+		down_write(&vm->lock);
+		xe_exec_queue_last_fence_put(vm->q, vm);
+		up_write(&vm->lock);
 
-	for_each_tile(tile, xe, id) {
-		if (vm->q[id]) {
-			xe_exec_queue_kill(vm->q[id]);
-			xe_exec_queue_put(vm->q[id]);
-			vm->q[id] = NULL;
-		}
+		xe_exec_queue_kill(vm->q);
+		xe_exec_queue_put(vm->q);
 	}
 
 	down_write(&vm->lock);
@@ -1572,7 +1659,6 @@ static void vm_destroy_work_func(struct work_struct *w)
 		XE_WARN_ON(vm->pt_root[id]);
 
 	trace_xe_vm_free(vm);
-	dma_fence_put(vm->rebind_fence);
 	kfree(vm);
 }
 
@@ -1606,168 +1692,7 @@ u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
 static struct xe_exec_queue *
 to_wait_exec_queue(struct xe_vm *vm, struct xe_exec_queue *q)
 {
-	return q ? q : vm->q[0];
-}
-
-static struct dma_fence *
-xe_vm_unbind_vma(struct xe_vma *vma, struct xe_exec_queue *q,
-		 struct xe_sync_entry *syncs, u32 num_syncs,
-		 bool first_op, bool last_op)
-{
-	struct xe_vm *vm = xe_vma_vm(vma);
-	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, q);
-	struct xe_tile *tile;
-	struct dma_fence *fence = NULL;
-	struct dma_fence **fences = NULL;
-	struct dma_fence_array *cf = NULL;
-	int cur_fence = 0, i;
-	int number_tiles = hweight8(vma->tile_present);
-	int err;
-	u8 id;
-
-	trace_xe_vma_unbind(vma);
-
-	if (vma->ufence) {
-		struct xe_user_fence * const f = vma->ufence;
-
-		if (!xe_sync_ufence_get_status(f))
-			return ERR_PTR(-EBUSY);
-
-		vma->ufence = NULL;
-		xe_sync_ufence_put(f);
-	}
-
-	if (number_tiles > 1) {
-		fences = kmalloc_array(number_tiles, sizeof(*fences),
-				       GFP_KERNEL);
-		if (!fences)
-			return ERR_PTR(-ENOMEM);
-	}
-
-	for_each_tile(tile, vm->xe, id) {
-		if (!(vma->tile_present & BIT(id)))
-			goto next;
-
-		fence = __xe_pt_unbind_vma(tile, vma, q ? q : vm->q[id],
-					   first_op ? syncs : NULL,
-					   first_op ? num_syncs : 0);
-		if (IS_ERR(fence)) {
-			err = PTR_ERR(fence);
-			goto err_fences;
-		}
-
-		if (fences)
-			fences[cur_fence++] = fence;
-
-next:
-		if (q && vm->pt_root[id] && !list_empty(&q->multi_gt_list))
-			q = list_next_entry(q, multi_gt_list);
-	}
-
-	if (fences) {
-		cf = dma_fence_array_create(number_tiles, fences,
-					    vm->composite_fence_ctx,
-					    vm->composite_fence_seqno++,
-					    false);
-		if (!cf) {
-			--vm->composite_fence_seqno;
-			err = -ENOMEM;
-			goto err_fences;
-		}
-	}
-
-	fence = cf ? &cf->base : !fence ?
-		xe_exec_queue_last_fence_get(wait_exec_queue, vm) : fence;
-	if (last_op) {
-		for (i = 0; i < num_syncs; i++)
-			xe_sync_entry_signal(&syncs[i], NULL, fence);
-	}
-
-	return fence;
-
-err_fences:
-	if (fences) {
-		while (cur_fence)
-			dma_fence_put(fences[--cur_fence]);
-		kfree(fences);
-	}
-
-	return ERR_PTR(err);
-}
-
-static struct dma_fence *
-xe_vm_bind_vma(struct xe_vma *vma, struct xe_exec_queue *q,
-	       struct xe_sync_entry *syncs, u32 num_syncs,
-	       bool first_op, bool last_op)
-{
-	struct xe_tile *tile;
-	struct dma_fence *fence;
-	struct dma_fence **fences = NULL;
-	struct dma_fence_array *cf = NULL;
-	struct xe_vm *vm = xe_vma_vm(vma);
-	int cur_fence = 0, i;
-	int number_tiles = hweight8(vma->tile_mask);
-	int err;
-	u8 id;
-
-	trace_xe_vma_bind(vma);
-
-	if (number_tiles > 1) {
-		fences = kmalloc_array(number_tiles, sizeof(*fences),
-				       GFP_KERNEL);
-		if (!fences)
-			return ERR_PTR(-ENOMEM);
-	}
-
-	for_each_tile(tile, vm->xe, id) {
-		if (!(vma->tile_mask & BIT(id)))
-			goto next;
-
-		fence = __xe_pt_bind_vma(tile, vma, q ? q : vm->q[id],
-					 first_op ? syncs : NULL,
-					 first_op ? num_syncs : 0,
-					 vma->tile_present & BIT(id));
-		if (IS_ERR(fence)) {
-			err = PTR_ERR(fence);
-			goto err_fences;
-		}
-
-		if (fences)
-			fences[cur_fence++] = fence;
-
-next:
-		if (q && vm->pt_root[id] && !list_empty(&q->multi_gt_list))
-			q = list_next_entry(q, multi_gt_list);
-	}
-
-	if (fences) {
-		cf = dma_fence_array_create(number_tiles, fences,
-					    vm->composite_fence_ctx,
-					    vm->composite_fence_seqno++,
-					    false);
-		if (!cf) {
-			--vm->composite_fence_seqno;
-			err = -ENOMEM;
-			goto err_fences;
-		}
-	}
-
-	if (last_op) {
-		for (i = 0; i < num_syncs; i++)
-			xe_sync_entry_signal(&syncs[i], NULL,
-					     cf ? &cf->base : fence);
-	}
-
-	return cf ? &cf->base : fence;
-
-err_fences:
-	if (fences) {
-		while (cur_fence)
-			dma_fence_put(fences[--cur_fence]);
-		kfree(fences);
-	}
-
-	return ERR_PTR(err);
+	return q ? q : vm->q;
 }
 
 static struct xe_user_fence *
@@ -1785,89 +1710,6 @@ find_ufence_get(struct xe_sync_entry *syncs, u32 num_syncs)
 	return NULL;
 }
 
-static int __xe_vm_bind(struct xe_vm *vm, struct xe_vma *vma,
-			struct xe_exec_queue *q, struct xe_sync_entry *syncs,
-			u32 num_syncs, bool immediate, bool first_op,
-			bool last_op)
-{
-	struct dma_fence *fence;
-	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, q);
-	struct xe_user_fence *ufence;
-
-	xe_vm_assert_held(vm);
-
-	ufence = find_ufence_get(syncs, num_syncs);
-	if (vma->ufence && ufence)
-		xe_sync_ufence_put(vma->ufence);
-
-	vma->ufence = ufence ?: vma->ufence;
-
-	if (immediate) {
-		fence = xe_vm_bind_vma(vma, q, syncs, num_syncs, first_op,
-				       last_op);
-		if (IS_ERR(fence))
-			return PTR_ERR(fence);
-	} else {
-		int i;
-
-		xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
-
-		fence = xe_exec_queue_last_fence_get(wait_exec_queue, vm);
-		if (last_op) {
-			for (i = 0; i < num_syncs; i++)
-				xe_sync_entry_signal(&syncs[i], NULL, fence);
-		}
-	}
-
-	if (last_op)
-		xe_exec_queue_last_fence_set(wait_exec_queue, vm, fence);
-	dma_fence_put(fence);
-
-	return 0;
-}
-
-static int xe_vm_bind(struct xe_vm *vm, struct xe_vma *vma, struct xe_exec_queue *q,
-		      struct xe_bo *bo, struct xe_sync_entry *syncs,
-		      u32 num_syncs, bool immediate, bool first_op,
-		      bool last_op)
-{
-	int err;
-
-	xe_vm_assert_held(vm);
-	xe_bo_assert_held(bo);
-
-	if (bo && immediate) {
-		err = xe_bo_validate(bo, vm, true);
-		if (err)
-			return err;
-	}
-
-	return __xe_vm_bind(vm, vma, q, syncs, num_syncs, immediate, first_op,
-			    last_op);
-}
-
-static int xe_vm_unbind(struct xe_vm *vm, struct xe_vma *vma,
-			struct xe_exec_queue *q, struct xe_sync_entry *syncs,
-			u32 num_syncs, bool first_op, bool last_op)
-{
-	struct dma_fence *fence;
-	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, q);
-
-	xe_vm_assert_held(vm);
-	xe_bo_assert_held(xe_vma_bo(vma));
-
-	fence = xe_vm_unbind_vma(vma, q, syncs, num_syncs, first_op, last_op);
-	if (IS_ERR(fence))
-		return PTR_ERR(fence);
-
-	xe_vma_destroy(vma, fence);
-	if (last_op)
-		xe_exec_queue_last_fence_set(wait_exec_queue, vm, fence);
-	dma_fence_put(fence);
-
-	return 0;
-}
-
 #define ALL_DRM_XE_VM_CREATE_FLAGS (DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE | \
 				    DRM_XE_VM_CREATE_FLAG_LR_MODE | \
 				    DRM_XE_VM_CREATE_FLAG_FAULT_MODE)
@@ -2008,43 +1850,6 @@ static const u32 region_to_mem_type[] = {
 	XE_PL_VRAM1,
 };
 
-static int xe_vm_prefetch(struct xe_vm *vm, struct xe_vma *vma,
-			  struct xe_exec_queue *q, u32 region,
-			  struct xe_sync_entry *syncs, u32 num_syncs,
-			  bool first_op, bool last_op)
-{
-	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, q);
-	int err;
-
-	xe_assert(vm->xe, region <= ARRAY_SIZE(region_to_mem_type));
-
-	if (!xe_vma_has_no_bo(vma)) {
-		err = xe_bo_migrate(xe_vma_bo(vma), region_to_mem_type[region]);
-		if (err)
-			return err;
-	}
-
-	if (vma->tile_mask != (vma->tile_present & ~vma->usm.tile_invalidated)) {
-		return xe_vm_bind(vm, vma, q, xe_vma_bo(vma), syncs, num_syncs,
-				  true, first_op, last_op);
-	} else {
-		int i;
-
-		/* Nothing to do, signal fences now */
-		if (last_op) {
-			for (i = 0; i < num_syncs; i++) {
-				struct dma_fence *fence =
-					xe_exec_queue_last_fence_get(wait_exec_queue, vm);
-
-				xe_sync_entry_signal(&syncs[i], NULL, fence);
-				dma_fence_put(fence);
-			}
-		}
-
-		return 0;
-	}
-}
-
 static void prep_vma_destroy(struct xe_vm *vm, struct xe_vma *vma,
 			     bool post_commit)
 {
@@ -2168,6 +1973,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 		struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
 
 		if (__op->op == DRM_GPUVA_OP_MAP) {
+			op->map.immediate = !xe_vm_in_fault_mode(vm);
 			op->map.is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
 			op->map.dumpable = flags & DRM_XE_VM_BIND_FLAG_DUMPABLE;
 			op->map.pat_index = pat_index;
@@ -2329,35 +2135,30 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 	return err;
 }
 
-
 static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 				   struct drm_gpuva_ops *ops,
 				   struct xe_sync_entry *syncs, u32 num_syncs,
-				   struct list_head *ops_list, bool last)
+				   struct xe_vma_ops *vops, bool last)
 {
 	struct xe_device *xe = vm->xe;
-	struct xe_vma_op *last_op = NULL;
 	struct drm_gpuva_op *__op;
+	struct xe_tile *tile;
+	u8 id, tile_mask = 0;
 	int err = 0;
 
 	lockdep_assert_held_write(&vm->lock);
 
+	for_each_tile(tile, vm->xe, id)
+		tile_mask |= 0x1 << id;
+
 	drm_gpuva_for_each_op(__op, ops) {
 		struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
 		struct xe_vma *vma;
-		bool first = list_empty(ops_list);
 		unsigned int flags = 0;
 
 		INIT_LIST_HEAD(&op->link);
-		list_add_tail(&op->link, ops_list);
-
-		if (first) {
-			op->flags |= XE_VMA_OP_FIRST;
-			op->num_syncs = num_syncs;
-			op->syncs = syncs;
-		}
-
-		op->q = q;
+		list_add_tail(&op->link, &vops->list);
+		op->tile_mask = tile_mask;
 
 		switch (op->base.op) {
 		case DRM_GPUVA_OP_MAP:
@@ -2373,6 +2174,9 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 				return PTR_ERR(vma);
 
 			op->map.vma = vma;
+			if (op->map.immediate || !xe_vm_in_fault_mode(vm))
+				xe_vma_ops_incr_pt_update_ops(vops,
+							      op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_REMAP:
@@ -2417,6 +2221,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 					vm_dbg(&xe->drm, "REMAP:SKIP_PREV: addr=0x%016llx, range=0x%016llx",
 					       (ULL)op->remap.start,
 					       (ULL)op->remap.range);
+				} else {
+					xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 				}
 			}
 
@@ -2453,228 +2259,30 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
 					vm_dbg(&xe->drm, "REMAP:SKIP_NEXT: addr=0x%016llx, range=0x%016llx",
 					       (ULL)op->remap.start,
 					       (ULL)op->remap.range);
+				} else {
+					xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 				}
 			}
+			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_UNMAP:
 		case DRM_GPUVA_OP_PREFETCH:
-			/* Nothing to do */
+			/* FIXME: Need to skip some prefetch ops */
+			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		default:
 			drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 		}
 
-		last_op = op;
-
 		err = xe_vma_op_commit(vm, op);
 		if (err)
 			return err;
 	}
 
-	/* FIXME: Unhandled corner case */
-	XE_WARN_ON(!last_op && last && !list_empty(ops_list));
-
-	if (!last_op)
-		return 0;
-
-	last_op->ops = ops;
-	if (last) {
-		last_op->flags |= XE_VMA_OP_LAST;
-		last_op->num_syncs = num_syncs;
-		last_op->syncs = syncs;
-	}
-
 	return 0;
 }
 
-static int op_execute(struct drm_exec *exec, struct xe_vm *vm,
-		      struct xe_vma *vma, struct xe_vma_op *op)
-{
-	int err;
-
-	lockdep_assert_held_write(&vm->lock);
-
-	err = xe_vm_prepare_vma(exec, vma, 1);
-	if (err)
-		return err;
-
-	xe_vm_assert_held(vm);
-	xe_bo_assert_held(xe_vma_bo(vma));
-
-	switch (op->base.op) {
-	case DRM_GPUVA_OP_MAP:
-		err = xe_vm_bind(vm, vma, op->q, xe_vma_bo(vma),
-				 op->syncs, op->num_syncs,
-				 !xe_vm_in_fault_mode(vm),
-				 op->flags & XE_VMA_OP_FIRST,
-				 op->flags & XE_VMA_OP_LAST);
-		break;
-	case DRM_GPUVA_OP_REMAP:
-	{
-		bool prev = !!op->remap.prev;
-		bool next = !!op->remap.next;
-
-		if (!op->remap.unmap_done) {
-			if (prev || next)
-				vma->gpuva.flags |= XE_VMA_FIRST_REBIND;
-			err = xe_vm_unbind(vm, vma, op->q, op->syncs,
-					   op->num_syncs,
-					   op->flags & XE_VMA_OP_FIRST,
-					   op->flags & XE_VMA_OP_LAST &&
-					   !prev && !next);
-			if (err)
-				break;
-			op->remap.unmap_done = true;
-		}
-
-		if (prev) {
-			op->remap.prev->gpuva.flags |= XE_VMA_LAST_REBIND;
-			err = xe_vm_bind(vm, op->remap.prev, op->q,
-					 xe_vma_bo(op->remap.prev), op->syncs,
-					 op->num_syncs, true, false,
-					 op->flags & XE_VMA_OP_LAST && !next);
-			op->remap.prev->gpuva.flags &= ~XE_VMA_LAST_REBIND;
-			if (err)
-				break;
-			op->remap.prev = NULL;
-		}
-
-		if (next) {
-			op->remap.next->gpuva.flags |= XE_VMA_LAST_REBIND;
-			err = xe_vm_bind(vm, op->remap.next, op->q,
-					 xe_vma_bo(op->remap.next),
-					 op->syncs, op->num_syncs,
-					 true, false,
-					 op->flags & XE_VMA_OP_LAST);
-			op->remap.next->gpuva.flags &= ~XE_VMA_LAST_REBIND;
-			if (err)
-				break;
-			op->remap.next = NULL;
-		}
-
-		break;
-	}
-	case DRM_GPUVA_OP_UNMAP:
-		err = xe_vm_unbind(vm, vma, op->q, op->syncs,
-				   op->num_syncs, op->flags & XE_VMA_OP_FIRST,
-				   op->flags & XE_VMA_OP_LAST);
-		break;
-	case DRM_GPUVA_OP_PREFETCH:
-		err = xe_vm_prefetch(vm, vma, op->q, op->prefetch.region,
-				     op->syncs, op->num_syncs,
-				     op->flags & XE_VMA_OP_FIRST,
-				     op->flags & XE_VMA_OP_LAST);
-		break;
-	default:
-		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
-	}
-
-	if (err)
-		trace_xe_vma_fail(vma);
-
-	return err;
-}
-
-static int __xe_vma_op_execute(struct xe_vm *vm, struct xe_vma *vma,
-			       struct xe_vma_op *op)
-{
-	struct drm_exec exec;
-	int err;
-
-retry_userptr:
-	drm_exec_init(&exec, DRM_EXEC_INTERRUPTIBLE_WAIT, 0);
-	drm_exec_until_all_locked(&exec) {
-		err = op_execute(&exec, vm, vma, op);
-		drm_exec_retry_on_contention(&exec);
-		if (err)
-			break;
-	}
-	drm_exec_fini(&exec);
-
-	if (err == -EAGAIN) {
-		lockdep_assert_held_write(&vm->lock);
-
-		if (op->base.op == DRM_GPUVA_OP_REMAP) {
-			if (!op->remap.unmap_done)
-				vma = gpuva_to_vma(op->base.remap.unmap->va);
-			else if (op->remap.prev)
-				vma = op->remap.prev;
-			else
-				vma = op->remap.next;
-		}
-
-		if (xe_vma_is_userptr(vma)) {
-			err = xe_vma_userptr_pin_pages(to_userptr_vma(vma));
-			if (!err)
-				goto retry_userptr;
-
-			trace_xe_vma_fail(vma);
-		}
-	}
-
-	return err;
-}
-
-static int xe_vma_op_execute(struct xe_vm *vm, struct xe_vma_op *op)
-{
-	int ret = 0;
-
-	lockdep_assert_held_write(&vm->lock);
-
-	switch (op->base.op) {
-	case DRM_GPUVA_OP_MAP:
-		ret = __xe_vma_op_execute(vm, op->map.vma, op);
-		break;
-	case DRM_GPUVA_OP_REMAP:
-	{
-		struct xe_vma *vma;
-
-		if (!op->remap.unmap_done)
-			vma = gpuva_to_vma(op->base.remap.unmap->va);
-		else if (op->remap.prev)
-			vma = op->remap.prev;
-		else
-			vma = op->remap.next;
-
-		ret = __xe_vma_op_execute(vm, vma, op);
-		break;
-	}
-	case DRM_GPUVA_OP_UNMAP:
-		ret = __xe_vma_op_execute(vm, gpuva_to_vma(op->base.unmap.va),
-					  op);
-		break;
-	case DRM_GPUVA_OP_PREFETCH:
-		ret = __xe_vma_op_execute(vm,
-					  gpuva_to_vma(op->base.prefetch.va),
-					  op);
-		break;
-	default:
-		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
-	}
-
-	return ret;
-}
-
-static void xe_vma_op_cleanup(struct xe_vm *vm, struct xe_vma_op *op)
-{
-	bool last = op->flags & XE_VMA_OP_LAST;
-
-	if (last) {
-		while (op->num_syncs--)
-			xe_sync_entry_cleanup(&op->syncs[op->num_syncs]);
-		kfree(op->syncs);
-		if (op->q)
-			xe_exec_queue_put(op->q);
-	}
-	if (!list_empty(&op->link))
-		list_del(&op->link);
-	if (op->ops)
-		drm_gpuva_ops_free(&vm->gpuvm, op->ops);
-	if (last)
-		xe_vm_put(vm);
-}
-
 static void xe_vma_op_unwind(struct xe_vm *vm, struct xe_vma_op *op,
 			     bool post_commit, bool prev_post_commit,
 			     bool next_post_commit)
@@ -2751,38 +2359,354 @@ static void vm_bind_ioctl_ops_unwind(struct xe_vm *vm,
 					 op->flags & XE_VMA_OP_PREV_COMMITTED,
 					 op->flags & XE_VMA_OP_NEXT_COMMITTED);
 		}
+	}
+}
+
+static int vma_lock(struct drm_exec *exec, struct xe_vma *vma, bool validate)
+{
+	struct xe_bo *bo = xe_vma_bo(vma);
+	int err = 0;
+
+	if (bo) {
+		if (!bo->vm)
+			err = drm_exec_prepare_obj(exec, &bo->ttm.base, 1);
+		if (!err && validate)
+			err = xe_bo_validate(bo, xe_vma_vm(vma), true);
+	}
+
+	return err;
+}
+
+static int check_ufence(struct xe_vma *vma)
+{
+	if (vma->ufence) {
+		struct xe_user_fence * const f = vma->ufence;
+
+		if (!xe_sync_ufence_get_status(f))
+			return -EBUSY;
+
+		vma->ufence = NULL;
+		xe_sync_ufence_put(f);
+	}
+
+	return 0;
+}
+
+static int op_lock(struct drm_exec *exec, struct xe_vm *vm,
+		   struct xe_vma_op *op)
+{
+	int err = 0;
+
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		err = vma_lock(exec, op->map.vma, !xe_vm_in_fault_mode(vm));
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		err = check_ufence(gpuva_to_vma(op->base.remap.unmap->va));
+		if (err)
+			break;
+
+		err = vma_lock(exec, gpuva_to_vma(op->base.remap.unmap->va),
+			       false);
+		if (!err && op->remap.prev)
+			err = vma_lock(exec, op->remap.prev, true);
+		if (!err && op->remap.next)
+			err = vma_lock(exec, op->remap.next, true);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		err = check_ufence(gpuva_to_vma(op->base.unmap.va));
+		if (err)
+			break;
+
+		err = vma_lock(exec, gpuva_to_vma(op->base.unmap.va), false);
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+		u32 region = op->prefetch.region;
+
+		xe_assert(vm->xe, region <= ARRAY_SIZE(region_to_mem_type));
+
+		err = vma_lock(exec, vma, false);
+		if (!err && !xe_vma_has_no_bo(vma))
+			err = xe_bo_migrate(xe_vma_bo(vma), region);
+		break;
+	}
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+
+	return err;
+}
+
+static int vm_bind_ioctl_ops_lock(struct drm_exec *exec,
+				  struct xe_vm *vm,
+				  struct xe_vma_ops *vops)
+{
+	struct xe_vma_op *op;
+	int err;
+
+	err = drm_exec_prepare_obj(exec, xe_vm_obj(vm), 1);
+	if (err)
+		return err;
+
+	list_for_each_entry(op, &vops->list, link) {
+		err = op_lock(exec, vm, op);
+		if (err)
+			return err;
+	}
+
+#ifdef TEST_VM_OPS_ERROR
+	if (vops->inject_error &&
+	    vm->xe->vm_inject_error_position == FORCE_OP_ERROR_LOCK)
+		return -ENOSPC;
+#endif
+
+	return 0;
+}
+
+static void op_trace(struct xe_vma_op *op)
+{
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		trace_xe_vma_bind(op->map.vma);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		trace_xe_vma_unbind(gpuva_to_vma(op->base.remap.unmap->va));
+		if (op->remap.prev)
+			trace_xe_vma_bind(op->remap.prev);
+		if (op->remap.next)
+			trace_xe_vma_bind(op->remap.next);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		trace_xe_vma_unbind(gpuva_to_vma(op->base.unmap.va));
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		trace_xe_vma_bind(gpuva_to_vma(op->base.prefetch.va));
+		break;
+	default:
+		XE_WARN_ON("NOT POSSIBLE");
+	}
+}
+
+static void trace_xe_vm_ops_execute(struct xe_vma_ops *vops)
+{
+	struct xe_vma_op *op;
+
+	list_for_each_entry(op, &vops->list, link)
+		op_trace(op);
+}
+
+static int vm_ops_setup_tile_args(struct xe_vm *vm, struct xe_vma_ops *vops)
+{
+	struct xe_tile *tile;
+	int number_tiles = 0;
+	u8 id;
+
+	for_each_tile(tile, vm->xe, id) {
+		if (vops->pt_update_ops[id].num_ops)
+			++number_tiles;
+
+		if (vops->pt_update_ops[id].q)
+			continue;
+
+		vops->pt_update_ops[id].q = vops->q ?: vm->q;
+	}
+
+	return number_tiles;
+}
+
+/**
+ * xe_vm_ops_execute() - Execute VMA ops
+ * @vm: The VM.
+ * @vops: VMA ops to execute
+ *
+ * Execute VMA ops binding / unbinding VMAs
+ *
+ * Return: A fence for VMA ops on success, ERR_PTR on failure
+ */
+struct dma_fence *xe_vm_ops_execute(struct xe_vm *vm, struct xe_vma_ops *vops)
+{
+	struct xe_tile *tile;
+	struct dma_fence *fence = NULL;
+	struct dma_fence **fences = NULL;
+	struct dma_fence_array *cf = NULL;
+	int number_tiles = 0, current_fence = 0, err;
+	u8 id;
+
+	number_tiles = vm_ops_setup_tile_args(vm, vops);
+	if (number_tiles == 0)
+		return ERR_PTR(-ENODATA);
+
+	if (number_tiles > 1) {
+		fences = kmalloc_array(number_tiles, sizeof(*fences),
+				       GFP_KERNEL);
+		if (!fences) {
+			fence = ERR_PTR(-ENOMEM);
+			goto err_trace;
+		}
+	}
 
-		drm_gpuva_ops_free(&vm->gpuvm, __ops);
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		err = xe_pt_update_ops_prepare(tile, vops);
+		if (err) {
+			fence = ERR_PTR(err);
+			goto err_out;
+		}
+	}
+
+	trace_xe_vm_ops_execute(vops);
+
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		fence = xe_pt_update_ops_run(tile, vops);
+		if (IS_ERR(fence))
+			goto err_out;
+
+		if (fences)
+			fences[current_fence++] = fence;
+	}
+
+	if (fences) {
+		cf = dma_fence_array_create(number_tiles, fences,
+					    vm->composite_fence_ctx,
+					    vm->composite_fence_seqno++,
+					    false);
+		if (!cf) {
+			--vm->composite_fence_seqno;
+			fence = ERR_PTR(-ENOMEM);
+			goto err_out;
+		}
+		fence = &cf->base;
 	}
+
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		xe_pt_update_ops_fini(tile, vops);
+	}
+
+	return fence;
+
+err_out:
+	for_each_tile(tile, vm->xe, id) {
+		if (!vops->pt_update_ops[id].num_ops)
+			continue;
+
+		xe_pt_update_ops_abort(tile, vops);
+	}
+	while (current_fence)
+		dma_fence_put(fences[--current_fence]);
+	kfree(fences);
+	kfree(cf);
+
+err_trace:
+	trace_xe_vm_ops_fail(vm);
+	return fence;
+}
+
+static void vma_add_ufence(struct xe_vma *vma, struct xe_user_fence *ufence)
+{
+	if (vma->ufence)
+		xe_sync_ufence_put(vma->ufence);
+	vma->ufence = __xe_sync_ufence_get(ufence);
+}
+
+static void op_add_ufence(struct xe_vm *vm, struct xe_vma_op *op,
+			  struct xe_user_fence *ufence)
+{
+	switch (op->base.op) {
+	case DRM_GPUVA_OP_MAP:
+		vma_add_ufence(op->map.vma, ufence);
+		break;
+	case DRM_GPUVA_OP_REMAP:
+		if (op->remap.prev)
+			vma_add_ufence(op->remap.prev, ufence);
+		if (op->remap.next)
+			vma_add_ufence(op->remap.next, ufence);
+		break;
+	case DRM_GPUVA_OP_UNMAP:
+		break;
+	case DRM_GPUVA_OP_PREFETCH:
+		vma_add_ufence(gpuva_to_vma(op->base.prefetch.va), ufence);
+		break;
+	default:
+		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
+	}
+}
+
+static void vm_bind_ioctl_ops_install_fences(struct xe_vm *vm,
+					     struct xe_vma_ops *vops,
+					     struct dma_fence *fence)
+{
+	struct xe_exec_queue *wait_exec_queue = to_wait_exec_queue(vm, vops->q);
+	struct xe_user_fence *ufence;
+	struct xe_vma_op *op;
+	int i;
+
+	ufence = find_ufence_get(vops->syncs, vops->num_syncs);
+	list_for_each_entry(op, &vops->list, link) {
+		if (ufence)
+			op_add_ufence(vm, op, ufence);
+
+		if (op->base.op == DRM_GPUVA_OP_UNMAP)
+			xe_vma_destroy(gpuva_to_vma(op->base.unmap.va), fence);
+		else if (op->base.op == DRM_GPUVA_OP_REMAP)
+			xe_vma_destroy(gpuva_to_vma(op->base.remap.unmap->va),
+				       fence);
+	}
+	if (ufence)
+		xe_sync_ufence_put(ufence);
+	for (i = 0; i < vops->num_syncs; i++)
+		xe_sync_entry_signal(vops->syncs + i, NULL, fence);
+	xe_exec_queue_last_fence_set(wait_exec_queue, vm, fence);
+	dma_fence_put(fence);
 }
 
 static int vm_bind_ioctl_ops_execute(struct xe_vm *vm,
-				     struct list_head *ops_list)
+				     struct xe_vma_ops *vops)
 {
-	struct xe_vma_op *op, *next;
+	struct drm_exec exec;
+	struct dma_fence *fence;
 	int err;
 
 	lockdep_assert_held_write(&vm->lock);
 
-	list_for_each_entry_safe(op, next, ops_list, link) {
-		err = xe_vma_op_execute(vm, op);
-		if (err) {
-			drm_warn(&vm->xe->drm, "VM op(%d) failed with %d",
-				 op->base.op, err);
-			/*
-			 * FIXME: Killing VM rather than proper error handling
-			 */
-			xe_vm_kill(vm);
-			return -ENOSPC;
+	drm_exec_init(&exec, DRM_EXEC_INTERRUPTIBLE_WAIT |
+		      DRM_EXEC_IGNORE_DUPLICATES, 0);
+	drm_exec_until_all_locked(&exec) {
+		err = vm_bind_ioctl_ops_lock(&exec, vm, vops);
+		drm_exec_retry_on_contention(&exec);
+		if (err)
+			goto unlock;
+
+		fence = xe_vm_ops_execute(vm, vops);
+		if (IS_ERR(fence)) {
+			err = PTR_ERR(fence);
+			goto unlock;
 		}
-		xe_vma_op_cleanup(vm, op);
+
+		vm_bind_ioctl_ops_install_fences(vm, vops, fence);
 	}
 
-	return 0;
+unlock:
+	drm_exec_fini(&exec);
+	return err;
 }
 
+#ifdef TEST_VM_OPS_ERROR
+#define SUPPORTED_FLAGS	(FORCE_OP_ERROR | DRM_XE_VM_BIND_FLAG_NULL | \
+	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
+#else
 #define SUPPORTED_FLAGS	(DRM_XE_VM_BIND_FLAG_NULL | \
 	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
+#endif
 #define XE_64K_PAGE_MASK 0xffffull
 #define ALL_DRM_XE_SYNCS_FLAGS (DRM_XE_SYNCS_FLAG_WAIT_FOR_OP)
 
@@ -2936,7 +2860,7 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	u32 num_syncs, num_ufence = 0;
 	struct xe_sync_entry *syncs = NULL;
 	struct drm_xe_vm_bind_op *bind_ops;
-	LIST_HEAD(ops_list);
+	struct xe_vma_ops vops;
 	int err;
 	int i;
 
@@ -2951,7 +2875,7 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 			goto free_objs;
 		}
 
-		if (XE_IOCTL_DBG(xe, !(q->flags & EXEC_QUEUE_FLAG_VM))) {
+		if (XE_IOCTL_DBG(xe, !(q->flags & EXEC_QUEUE_FLAG_PT))) {
 			err = -EINVAL;
 			goto put_exec_queue;
 		}
@@ -3087,6 +3011,7 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		goto free_syncs;
 	}
 
+	xe_vma_ops_init(&vops, vm, q, syncs, num_syncs);
 	for (i = 0; i < args->num_binds; ++i) {
 		u64 range = bind_ops[i].range;
 		u64 addr = bind_ops[i].addr;
@@ -3106,42 +3031,39 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		}
 
 		err = vm_bind_ioctl_ops_parse(vm, q, ops[i], syncs, num_syncs,
-					      &ops_list,
-					      i == args->num_binds - 1);
+					      &vops, i == args->num_binds - 1);
 		if (err)
 			goto unwind_ops;
+
+#ifdef TEST_VM_OPS_ERROR
+		if (flags & FORCE_OP_ERROR) {
+			vops.inject_error = true;
+			vm->xe->vm_inject_error_position =
+				(vm->xe->vm_inject_error_position + 1) %
+				FORCE_OP_ERROR_COUNT;
+		}
+#endif
 	}
 
 	/* Nothing to do */
-	if (list_empty(&ops_list)) {
+	if (list_empty(&vops.list)) {
 		err = -ENODATA;
 		goto unwind_ops;
 	}
 
-	xe_vm_get(vm);
-	if (q)
-		xe_exec_queue_get(q);
-
-	err = vm_bind_ioctl_ops_execute(vm, &ops_list);
-
-	up_write(&vm->lock);
-
-	if (q)
-		xe_exec_queue_put(q);
-	xe_vm_put(vm);
-
-	for (i = 0; bos && i < args->num_binds; ++i)
-		xe_bo_put(bos[i]);
-
-	kvfree(bos);
-	kvfree(ops);
-	if (args->num_binds > 1)
-		kvfree(bind_ops);
+	err = xe_vma_ops_alloc(&vops);
+	if (err)
+		goto unwind_ops;
 
-	return err;
+	err = vm_bind_ioctl_ops_execute(vm, &vops);
 
 unwind_ops:
-	vm_bind_ioctl_ops_unwind(vm, ops, args->num_binds);
+	if (err && err != -ENODATA)
+		vm_bind_ioctl_ops_unwind(vm, ops, args->num_binds);
+	xe_vma_ops_free(&vops);
+	for (i = args->num_binds - 1; i >= 0; --i)
+		if (ops[i])
+			drm_gpuva_ops_free(&vm->gpuvm, ops[i]);
 free_syncs:
 	if (err == -ENODATA)
 		err = vm_bind_ioctl_signal_fences(vm, q, syncs, num_syncs);
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 6df1f1c7f85d..492237b60341 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -207,7 +207,7 @@ int __xe_vm_userptr_needs_repin(struct xe_vm *vm);
 
 int xe_vm_userptr_check_repin(struct xe_vm *vm);
 
-struct dma_fence *xe_vm_rebind(struct xe_vm *vm, bool rebind_worker);
+int xe_vm_rebind(struct xe_vm *vm, bool rebind_worker);
 
 int xe_vm_invalidate_vma(struct xe_vma *vma);
 
@@ -262,6 +262,13 @@ static inline struct dma_resv *xe_vm_resv(struct xe_vm *vm)
  */
 #define xe_vm_assert_held(vm) dma_resv_assert_held(xe_vm_resv(vm))
 
+int xe_vm_populate_dummy_rebind(struct xe_vm *vm, struct xe_vma *vma,
+				u8 tile_mask);
+void xe_vma_ops_free(struct xe_vma_ops *vops);
+struct dma_fence *xe_vm_ops_execute(struct xe_vm *vm, struct xe_vma_ops *vops);
+
+void xe_vm_kill(struct xe_vm *vm, bool unlocked);
+
 #if IS_ENABLED(CONFIG_DRM_XE_DEBUG_VM)
 #define vm_dbg drm_dbg
 #else
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 79b5cab57711..d0a08e927db7 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -18,9 +18,21 @@
 #include "xe_range_fence.h"
 
 struct xe_bo;
+struct xe_device;
 struct xe_sync_entry;
 struct xe_user_fence;
 struct xe_vm;
+struct xe_vm_pgtable_update_op;
+
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+#define TEST_VM_OPS_ERROR
+#define FORCE_OP_ERROR	BIT(31)
+
+#define FORCE_OP_ERROR_LOCK	0
+#define FORCE_OP_ERROR_PREPARE	1
+#define FORCE_OP_ERROR_RUN	2
+#define FORCE_OP_ERROR_COUNT	3
+#endif
 
 #define XE_VMA_READ_ONLY	DRM_GPUVA_USERBITS
 #define XE_VMA_DESTROYED	(DRM_GPUVA_USERBITS << 1)
@@ -124,7 +136,96 @@ struct xe_userptr_vma {
 	struct xe_userptr userptr;
 };
 
-struct xe_device;
+/** struct xe_vma_op_map - VMA map operation */
+struct xe_vma_op_map {
+	/** @vma: VMA to map */
+	struct xe_vma *vma;
+	/** @immediate: Immediate bind */
+	bool immediate;
+	/** @is_null: is NULL binding */
+	bool is_null;
+	/** @dumpable: whether BO is dumped on GPU hang */
+	bool dumpable;
+	/** @pat_index: The pat index to use for this operation. */
+	u16 pat_index;
+};
+
+/** struct xe_vma_op_remap - VMA remap operation */
+struct xe_vma_op_remap {
+	/** @prev: VMA preceding part of a split mapping */
+	struct xe_vma *prev;
+	/** @next: VMA subsequent part of a split mapping */
+	struct xe_vma *next;
+	/** @start: start of the VMA unmap */
+	u64 start;
+	/** @range: range of the VMA unmap */
+	u64 range;
+	/** @skip_prev: skip prev rebind */
+	bool skip_prev;
+	/** @skip_next: skip next rebind */
+	bool skip_next;
+	/** @unmap_done: unmap operation in done */
+	bool unmap_done;
+};
+
+/** struct xe_vma_op_prefetch - VMA prefetch operation */
+struct xe_vma_op_prefetch {
+	/** @region: memory region to prefetch to */
+	u32 region;
+};
+
+/** enum xe_vma_op_flags - flags for VMA operation */
+enum xe_vma_op_flags {
+	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
+	XE_VMA_OP_COMMITTED		= BIT(0),
+	/** @XE_VMA_OP_PREV_COMMITTED: Previous VMA operation committed */
+	XE_VMA_OP_PREV_COMMITTED	= BIT(1),
+	/** @XE_VMA_OP_NEXT_COMMITTED: Next VMA operation committed */
+	XE_VMA_OP_NEXT_COMMITTED	= BIT(2),
+};
+
+/** struct xe_vma_op - VMA operation */
+struct xe_vma_op {
+	/** @base: GPUVA base operation */
+	struct drm_gpuva_op base;
+	/** @num_syncs: number of syncs */
+	u32 num_syncs;
+	/** @link: async operation link */
+	struct list_head link;
+	/** @flags: operation flags */
+	enum xe_vma_op_flags flags;
+	/** @tile_mask: Tile mask for operation */
+	u8 tile_mask;
+
+	union {
+		/** @map: VMA map operation specific data */
+		struct xe_vma_op_map map;
+		/** @remap: VMA remap operation specific data */
+		struct xe_vma_op_remap remap;
+		/** @prefetch: VMA prefetch operation specific data */
+		struct xe_vma_op_prefetch prefetch;
+	};
+};
+
+/** struct xe_vma_ops - VMA operations */
+struct xe_vma_ops {
+	/** @list: list of VMA operations */
+	struct list_head list;
+	/** @vm: VM */
+	struct xe_vm *vm;
+	/** @q: exec queue for VMA operations */
+	struct xe_exec_queue *q;
+	/** @syncs: syncs these operation */
+	struct xe_sync_entry *syncs;
+	/** @num_syncs: number of syncs */
+	u32 num_syncs;
+	/** @pt_update_ops: page table update operations */
+	struct xe_vm_pgtable_update_ops pt_update_ops[XE_MAX_TILES_PER_DEVICE];
+#ifdef TEST_VM_OPS_ERROR
+	/** @inject_error: inject error to test error handling */
+	bool inject_error;
+#endif
+};
 
 struct xe_vm {
 	/** @gpuvm: base GPUVM used to track VMAs */
@@ -133,7 +234,7 @@ struct xe_vm {
 	struct xe_device *xe;
 
 	/* exec queue used for (un)binding vma's */
-	struct xe_exec_queue *q[XE_MAX_TILES_PER_DEVICE];
+	struct xe_exec_queue *q;
 
 	/** @lru_bulk_move: Bulk LRU move list for this VM's BOs */
 	struct ttm_lru_bulk_move lru_bulk_move;
@@ -180,9 +281,6 @@ struct xe_vm {
 	 */
 	struct list_head rebind_list;
 
-	/** @rebind_fence: rebind fence from execbuf */
-	struct dma_fence *rebind_fence;
-
 	/**
 	 * @destroy_work: worker to destroy VM, needed as a dma_fence signaling
 	 * from an irq context can be last put and the destroy needs to be able
@@ -267,92 +365,18 @@ struct xe_vm {
 		bool capture_once;
 	} error_capture;
 
+	/** @dummy_ops: dummy VMA ops to issue rebinds */
+	struct {
+		/** @dummy_ops.ops: dummy VMA ops */
+		struct xe_vma_ops vops;
+		/** @dummy_ops.op: dummy VMA op */
+		struct xe_vma_op op;
+	} dummy_ops;
+
 	/** @batch_invalidate_tlb: Always invalidate TLB before batch start */
 	bool batch_invalidate_tlb;
 	/** @xef: XE file handle for tracking this VM's drm client */
 	struct xe_file *xef;
 };
 
-/** struct xe_vma_op_map - VMA map operation */
-struct xe_vma_op_map {
-	/** @vma: VMA to map */
-	struct xe_vma *vma;
-	/** @is_null: is NULL binding */
-	bool is_null;
-	/** @dumpable: whether BO is dumped on GPU hang */
-	bool dumpable;
-	/** @pat_index: The pat index to use for this operation. */
-	u16 pat_index;
-};
-
-/** struct xe_vma_op_remap - VMA remap operation */
-struct xe_vma_op_remap {
-	/** @prev: VMA preceding part of a split mapping */
-	struct xe_vma *prev;
-	/** @next: VMA subsequent part of a split mapping */
-	struct xe_vma *next;
-	/** @start: start of the VMA unmap */
-	u64 start;
-	/** @range: range of the VMA unmap */
-	u64 range;
-	/** @skip_prev: skip prev rebind */
-	bool skip_prev;
-	/** @skip_next: skip next rebind */
-	bool skip_next;
-	/** @unmap_done: unmap operation in done */
-	bool unmap_done;
-};
-
-/** struct xe_vma_op_prefetch - VMA prefetch operation */
-struct xe_vma_op_prefetch {
-	/** @region: memory region to prefetch to */
-	u32 region;
-};
-
-/** enum xe_vma_op_flags - flags for VMA operation */
-enum xe_vma_op_flags {
-	/** @XE_VMA_OP_FIRST: first VMA operation for a set of syncs */
-	XE_VMA_OP_FIRST			= BIT(0),
-	/** @XE_VMA_OP_LAST: last VMA operation for a set of syncs */
-	XE_VMA_OP_LAST			= BIT(1),
-	/** @XE_VMA_OP_COMMITTED: VMA operation committed */
-	XE_VMA_OP_COMMITTED		= BIT(2),
-	/** @XE_VMA_OP_PREV_COMMITTED: Previous VMA operation committed */
-	XE_VMA_OP_PREV_COMMITTED	= BIT(3),
-	/** @XE_VMA_OP_NEXT_COMMITTED: Next VMA operation committed */
-	XE_VMA_OP_NEXT_COMMITTED	= BIT(4),
-};
-
-/** struct xe_vma_op - VMA operation */
-struct xe_vma_op {
-	/** @base: GPUVA base operation */
-	struct drm_gpuva_op base;
-	/**
-	 * @ops: GPUVA ops, when set call drm_gpuva_ops_free after this
-	 * operations is processed
-	 */
-	struct drm_gpuva_ops *ops;
-	/** @q: exec queue for this operation */
-	struct xe_exec_queue *q;
-	/**
-	 * @syncs: syncs for this operation, only used on first and last
-	 * operation
-	 */
-	struct xe_sync_entry *syncs;
-	/** @num_syncs: number of syncs */
-	u32 num_syncs;
-	/** @link: async operation link */
-	struct list_head link;
-	/** @flags: operation flags */
-	enum xe_vma_op_flags flags;
-
-	union {
-		/** @map: VMA map operation specific data */
-		struct xe_vma_op_map map;
-		/** @remap: VMA remap operation specific data */
-		struct xe_vma_op_remap remap;
-		/** @prefetch: VMA prefetch operation specific data */
-		struct xe_vma_op_prefetch prefetch;
-	};
-};
 #endif
diff --git a/include/drm/xe_pciids.h b/include/drm/xe_pciids.h
index bc7cbef6e9d8..7b62be9bb86e 100644
--- a/include/drm/xe_pciids.h
+++ b/include/drm/xe_pciids.h
@@ -173,6 +173,22 @@
 	XE_ATS_M150_IDS(MACRO__, ## __VA_ARGS__),\
 	XE_ATS_M75_IDS(MACRO__, ## __VA_ARGS__)
 
+/* PVC */
+#define XE_PVC_IDS(MACRO__, ...)		\
+	MACRO__(0x0B69, ## __VA_ARGS__),	\
+	MACRO__(0x0B6E, ## __VA_ARGS__),	\
+	MACRO__(0x0BD5, ## __VA_ARGS__),	\
+	MACRO__(0x0BD4, ## __VA_ARGS__),	\
+	MACRO__(0x0BD6, ## __VA_ARGS__),	\
+	MACRO__(0x0BD7, ## __VA_ARGS__),	\
+	MACRO__(0x0BD8, ## __VA_ARGS__),	\
+	MACRO__(0x0BD9, ## __VA_ARGS__),	\
+	MACRO__(0x0BDA, ## __VA_ARGS__),	\
+	MACRO__(0x0BDB, ## __VA_ARGS__),	\
+	MACRO__(0x0BE0, ## __VA_ARGS__),	\
+	MACRO__(0x0BE1, ## __VA_ARGS__),	\
+	MACRO__(0x0BE5, ## __VA_ARGS__)
+
 /* MTL / ARL */
 #define XE_MTL_IDS(MACRO__, ...)		\
 	MACRO__(0x7D40, ## __VA_ARGS__),	\
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 02/31] drm/xe/svm: Add SVM document
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
  2024-04-09 20:17 ` [v2 01/31] drm/xe: Refactor vm_bind Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 03/31] drm/xe: Invalidate userptr VMA on page pin fault Oak Zeng
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Add shared virtual memory document.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 Documentation/gpu/xe/index.rst  |   1 +
 Documentation/gpu/xe/xe_svm.rst |   8 +++
 drivers/gpu/drm/xe/xe_svm_doc.h | 121 ++++++++++++++++++++++++++++++++
 3 files changed, 130 insertions(+)
 create mode 100644 Documentation/gpu/xe/xe_svm.rst
 create mode 100644 drivers/gpu/drm/xe/xe_svm_doc.h

diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index c224ecaee81e..106b60aba1f0 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver.
    xe_firmware
    xe_tile
    xe_debugging
+   xe_svm
diff --git a/Documentation/gpu/xe/xe_svm.rst b/Documentation/gpu/xe/xe_svm.rst
new file mode 100644
index 000000000000..62954ba1c6f8
--- /dev/null
+++ b/Documentation/gpu/xe/xe_svm.rst
@@ -0,0 +1,8 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+
+=============
+Shared virtual memory
+=============
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_svm_doc.h
+   :doc: Shared virtual memory
diff --git a/drivers/gpu/drm/xe/xe_svm_doc.h b/drivers/gpu/drm/xe/xe_svm_doc.h
new file mode 100644
index 000000000000..de38ee3585e4
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm_doc.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef _XE_SVM_DOC_H_
+#define _XE_SVM_DOC_H_
+
+/**
+ * DOC: Shared virtual memory
+ *
+ * Shared Virtual Memory (SVM) allows the programmer to use a single virtual
+ * address space shared between threads executing on CPUs and GPUs. It abstracts
+ * away from the user the location of the backing memory, and hence simplifies
+ * the user programming model. In a non-SVM memory model, user need to explicitly
+ * decide memory placement such as device or system memory, also user need to
+ * explicitly migrate memory b/t device and system memory.
+ *
+ * Interface
+ * =========
+ *
+ * SVM makes use of default OS memory allocation and mapping interface such as
+ * malloc() and mmap(). The pointer returned from malloc() and mmap() can be
+ * directly used on both CPU and GPU program.
+ *
+ * SVM also provides API to set virtual address range based memory attributes
+ * such as preferred memory location, memory migration granularity, and memory
+ * atomic attributes etc. This is similar to Linux madvise API.
+ *
+ * Basic implementation
+ * ==============
+ *
+ * XeKMD implementation is based on Linux kernel Heterogeneous Memory Management
+ * (HMM) framework. HMM’s address space mirroring support allows sharing of the
+ * address space by duplicating sections of CPU page tables in the device page
+ * tables. This enables both CPU and GPU access a physical memory location using
+ * the same virtual address.
+ *
+ * Linux kernel also provides the ability to plugin device memory to the system
+ * (as a special ZONE_DEVICE type) and allocates struct page for each device memory
+ * page.
+ *
+ * HMM also provides a mechanism to migrate pages from host to device memory and
+ * vice versa.
+ *
+ * More information on HMM can be found here.
+ * https://www.kernel.org/doc/Documentation/vm/hmm.rst
+ *
+ * Unlike the non-SVM memory allocator (such as gem_create, vm_bind etc), there
+ * is no buffer object (BO, such as struct ttm_buffer_object, struct drm_gem_object),
+ * in our SVM implementation. We delibrately choose this implementation option
+ * to achieve page granularity memory placement, validation, eviction and migration.
+ *
+ * The SVM layer directly allocate device memory from drm buddy subsystem. The
+ * memory is organized as many blocks each of which has 2^n pages. SVM subsystem
+ * then mark the usage of each page using a simple bitmap. When all pages in a
+ * block are not used anymore, SVM return this block back to drm buddy subsystem.
+ *
+ * There are 3 events which can trigger SVM subsystem in actions:
+ *
+ * 1. A mmu notifier callback
+ *
+ * Since SVM need to mirror the program's CPU virtual address space from GPU side,
+ * when program's CPU address space changes, SVM need to make an identical change
+ * from GPU side. SVM/hmm use mmu interval notifier to achieve this. SVM register
+ * a mmu interval notifier call back function to core mm, and whenever a CPU side
+ * virtual address space is changed (i.e., when a virtual address range is unmapped
+ * from CPU calling munmap), the registered callback function will be called from
+ * core mm. SVM then mirror the CPU address space change from GPU side, i.e., unmap
+ * or invalidate the virtual address range from GPU page table.
+ *
+ * 2. A GPU page fault
+ *
+ * At the very beginning of a process's life, no virtual address of the process
+ * is mapped on GPU page table. So when GPU access any virtual address of the process
+ * a GPU page fault is triggered. SVM then decide the best memory location of the
+ * fault address (mainly from performance consideration. Some times also consider
+ * correctness requirement such as whether GPU can perform atomics operation to
+ * certain memory location), migrate memory if necessary, and map the fault address
+ * to GPU page table.
+ *
+ * 3. A CPU page fault
+ *
+ * A CPU page fault is usually managed by Linux core mm. But in a CPU and GPU
+ * mix programming environment, the backing store of a virtual address range
+ * can be in GPU's local memory which is not visible to CPU (DEVICE_PRIVATE),
+ * so CPU page fault handler need to migrate such pages to system memory for
+ * CPU to be able to access them. Such memory migration is device specific.
+ * HMM has a callback function (migrate_to_ram function of the dev_pagemap_ops)
+ * for device driver to implement.
+ *
+ *
+ * Memory hints: TBD
+ * =================
+ *
+ * Memory eviction: TBD
+ * ===============
+ *
+ * Lock design
+ * ===========
+ *
+ * https://www.kernel.org/doc/Documentation/vm/hmm.rst section "Address space mirroring
+ * implemenation and API" described the locking scheme that driver writer has to
+ * respect. There are 3 lock mechanism involved in this scheme:
+ *
+ * 1. Use mmp_read/write_lock to protect VMA, cpu page table operations.  Operation such
+ * as munmap/mmap, page table update during numa balance must hold this lock. Hmm_range_fault
+ * is a helper function provided by HMM to populate the CPU page table, so it must be called
+ * with this lock
+ *
+ * 2. Use xe_svm::mutex to protect device side page table operation. Any attempt to bind an
+ * address range to GPU, or invalidate an address range from GPU, should hold this device lock
+ *
+ * 3. In the GPU page fault handler, during device page table update, we hold a xe_svm::mutex,
+ * but we don't hold the mmap_read/write_lock. So programm's address space can change during
+ * the GPU page table update. mmu notifier seq# is used to determine whether unmap happened
+ * during during device page table update, if yes, then retry.
+ *
+ */
+
+#endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 03/31] drm/xe: Invalidate userptr VMA on page pin fault
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
  2024-04-09 20:17 ` [v2 01/31] drm/xe: Refactor vm_bind Oak Zeng
  2024-04-09 20:17 ` [v2 02/31] drm/xe/svm: Add SVM document Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 04/31] drm/xe: Drop unused arguments from vm_bind_ioctl_ops_parse Oak Zeng
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

Rather than return an error to the user or ban the VM when userptr VMA
page pin fails with -EFAULT, invalidate VMA mappings. This supports the
UMD use case of freeing userptr while still having bindings.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |  4 ++--
 drivers/gpu/drm/xe/xe_trace.h        |  2 +-
 drivers/gpu/drm/xe/xe_vm.c           | 20 +++++++++++++-------
 drivers/gpu/drm/xe/xe_vm_types.h     |  7 ++-----
 4 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index e4f5a80a46fc..c49b1409e168 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -68,7 +68,7 @@ static bool access_is_atomic(enum access_type access_type)
 static bool vma_is_valid(struct xe_tile *tile, struct xe_vma *vma)
 {
 	return BIT(tile->id) & vma->tile_present &&
-		!(BIT(tile->id) & vma->usm.tile_invalidated);
+		!(BIT(tile->id) & vma->tile_invalidated);
 }
 
 static bool vma_matches(struct xe_vma *vma, u64 page_addr)
@@ -230,7 +230,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 
 	if (xe_vma_is_userptr(vma))
 		ret = xe_vma_userptr_check_repin(to_userptr_vma(vma));
-	vma->usm.tile_invalidated &= ~BIT(tile->id);
+	vma->tile_invalidated &= ~BIT(tile->id);
 
 unlock_dma_resv:
 	drm_exec_fini(&exec);
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index c4704c5f3c72..5f7d26bf4cd7 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -464,7 +464,7 @@ DEFINE_EVENT(xe_vma, xe_vma_userptr_invalidate,
 	     TP_ARGS(vma)
 );
 
-DEFINE_EVENT(xe_vma, xe_vma_usm_invalidate,
+DEFINE_EVENT(xe_vma, xe_vma_invalidate,
 	     TP_PROTO(struct xe_vma *vma),
 	     TP_ARGS(vma)
 );
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 8ba037e7ce5c..e1c1c18825ff 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -723,11 +723,18 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
 	list_for_each_entry_safe(uvma, next, &vm->userptr.repin_list,
 				 userptr.repin_link) {
 		err = xe_vma_userptr_pin_pages(uvma);
-		if (err < 0)
-			return err;
-
 		list_del_init(&uvma->userptr.repin_link);
-		list_move_tail(&uvma->vma.combined_links.rebind, &vm->rebind_list);
+		if (err == -EFAULT) {
+			err = xe_vm_invalidate_vma(&uvma->vma);
+			if (err)
+				return err;
+		} else {
+			if (err < 0)
+				return err;
+
+			list_move_tail(&uvma->vma.combined_links.rebind,
+				       &vm->rebind_list);
+		}
 	}
 
 	return 0;
@@ -3136,9 +3143,8 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 	u8 id;
 	int ret;
 
-	xe_assert(xe, xe_vm_in_fault_mode(xe_vma_vm(vma)));
 	xe_assert(xe, !xe_vma_is_null(vma));
-	trace_xe_vma_usm_invalidate(vma);
+	trace_xe_vma_invalidate(vma);
 
 	/* Check that we don't race with page-table updates */
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
@@ -3176,7 +3182,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 		}
 	}
 
-	vma->usm.tile_invalidated = vma->tile_mask;
+	vma->tile_invalidated = vma->tile_mask;
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index d0a08e927db7..2bb76adf66a1 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -96,11 +96,8 @@ struct xe_vma {
 		struct work_struct destroy_work;
 	};
 
-	/** @usm: unified shared memory state */
-	struct {
-		/** @tile_invalidated: VMA has been invalidated */
-		u8 tile_invalidated;
-	} usm;
+	/** @tile_invalidated: VMA has been invalidated */
+	u8 tile_invalidated;
 
 	/** @tile_mask: Tile mask of where to create binding for this VMA */
 	u8 tile_mask;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 04/31] drm/xe: Drop unused arguments from vm_bind_ioctl_ops_parse
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (2 preceding siblings ...)
  2024-04-09 20:17 ` [v2 03/31] drm/xe: Invalidate userptr VMA on page pin fault Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 05/31] drm/xe: Fix op->tile_mask for fault mode Oak Zeng
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

Drop exec queue and last arguments from vm_bind_ioctl_ops_parse as these
are unused.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index e1c1c18825ff..c0c6bb163a9e 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2142,10 +2142,9 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 	return err;
 }
 
-static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct xe_exec_queue *q,
-				   struct drm_gpuva_ops *ops,
+static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				   struct xe_sync_entry *syncs, u32 num_syncs,
-				   struct xe_vma_ops *vops, bool last)
+				   struct xe_vma_ops *vops)
 {
 	struct xe_device *xe = vm->xe;
 	struct drm_gpuva_op *__op;
@@ -3037,8 +3036,8 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 			goto unwind_ops;
 		}
 
-		err = vm_bind_ioctl_ops_parse(vm, q, ops[i], syncs, num_syncs,
-					      &vops, i == args->num_binds - 1);
+		err = vm_bind_ioctl_ops_parse(vm, ops[i], syncs, num_syncs,
+					      &vops);
 		if (err)
 			goto unwind_ops;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 05/31] drm/xe: Fix op->tile_mask for fault mode
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (3 preceding siblings ...)
  2024-04-09 20:17 ` [v2 04/31] drm/xe: Drop unused arguments from vm_bind_ioctl_ops_parse Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 06/31] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag Oak Zeng
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

op->tile_mask might be a subset of all tiles if in fault mode. Fix
unmaps by setting op->tile_mask the unmapped VMA's tile_present field.

FIXME: This should be squashed into an eariler patch

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index c0c6bb163a9e..7ce7dbeb6f0a 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2190,6 +2190,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			struct xe_vma *old =
 				gpuva_to_vma(op->base.remap.unmap->va);
 
+			op->tile_mask = old->tile_present;
 			op->remap.start = xe_vma_start(old);
 			op->remap.range = xe_vma_size(old);
 
@@ -2273,6 +2274,13 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			break;
 		}
 		case DRM_GPUVA_OP_UNMAP:
+		{
+			struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+			op->tile_mask = vma->tile_present;
+			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			break;
+		}
 		case DRM_GPUVA_OP_PREFETCH:
 			/* FIXME: Need to skip some prefetch ops */
 			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 06/31] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (4 preceding siblings ...)
  2024-04-09 20:17 ` [v2 05/31] drm/xe: Fix op->tile_mask for fault mode Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 07/31] drm/xe: Create userptr if page fault occurs on system_allocator VMA Oak Zeng
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag which is used to create
unpopulated (no memory backing or GPU page tables) VMAs. These VMAs are
referred to as system allocator VMAs. The idea is on page fault the
memory back and GPU page tables will be populated.

FIXME: Only supporting 1 to 1 mapping between user address space and
GPU address space

v1: enforce DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR fo fault mode
VMs

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c       |  73 +++++++++++++----
 drivers/gpu/drm/xe/xe_vm.c       | 132 +++++++++++++++++++------------
 drivers/gpu/drm/xe/xe_vm.h       |   8 +-
 drivers/gpu/drm/xe/xe_vm_types.h |   3 +
 include/uapi/drm/xe_drm.h        |  15 +++-
 5 files changed, 161 insertions(+), 70 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 1ff01d616dac..846e896edcb5 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -1030,6 +1030,11 @@ static int op_add_deps(struct xe_vm *vm, struct xe_vma_op *op,
 {
 	int err = 0;
 
+	/*
+	 * No need to check for is_system_allocator here as vma_add_deps is a
+	 * NOP if VMA is_system_allocator
+	 */
+
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
 		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
@@ -1602,6 +1607,7 @@ static int bind_op_prepare(struct xe_vm *vm, struct xe_tile *tile,
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
 	int err;
 
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
 	xe_bo_assert_held(xe_vma_bo(vma));
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
@@ -1659,6 +1665,7 @@ static int unbind_op_prepare(struct xe_tile *tile,
 	u32 current_op = pt_update_ops->current_op;
 	struct xe_vm_pgtable_update_op *pt_op = &pt_update_ops->ops[current_op];
 
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
 	xe_bo_assert_held(xe_vma_bo(vma));
 
 	vm_dbg(&xe_vma_vm(vma)->xe->drm,
@@ -1694,15 +1701,21 @@ static int op_prepare(struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+		if ((!op->map.immediate && xe_vm_in_fault_mode(vm)) ||
+		    op->map.is_system_allocator)
 			break;
 
 		err = bind_op_prepare(vm, tile, pt_update_ops, op->map.vma);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
 	case DRM_GPUVA_OP_REMAP:
-		err = unbind_op_prepare(tile, pt_update_ops,
-					gpuva_to_vma(op->base.remap.unmap->va));
+	{
+		struct xe_vma *old = gpuva_to_vma(op->base.remap.unmap->va);
+
+		if (xe_vma_is_system_allocator(old))
+			break;
+
+		err = unbind_op_prepare(tile, pt_update_ops, old);
 
 		if (!err && op->remap.prev) {
 			err = bind_op_prepare(vm, tile, pt_update_ops,
@@ -1715,15 +1728,28 @@ static int op_prepare(struct xe_vm *vm,
 			pt_update_ops->wait_vm_bookkeep = true;
 		}
 		break;
+	}
 	case DRM_GPUVA_OP_UNMAP:
-		err = unbind_op_prepare(tile, pt_update_ops,
-					gpuva_to_vma(op->base.unmap.va));
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+		if (xe_vma_is_system_allocator(vma))
+			break;
+
+		err = unbind_op_prepare(tile, pt_update_ops, vma);
 		break;
+	}
 	case DRM_GPUVA_OP_PREFETCH:
-		err = bind_op_prepare(vm, tile, pt_update_ops,
-				      gpuva_to_vma(op->base.prefetch.va));
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
+		if (xe_vma_is_system_allocator(vma))
+			break;
+
+		err = bind_op_prepare(vm, tile, pt_update_ops, vma);
 		pt_update_ops->wait_vm_kernel = true;
 		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
@@ -1785,6 +1811,8 @@ static void bind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			   struct xe_vm_pgtable_update_ops *pt_update_ops,
 			   struct xe_vma *vma, struct dma_fence *fence)
 {
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
+
 	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
@@ -1810,6 +1838,8 @@ static void unbind_op_commit(struct xe_vm *vm, struct xe_tile *tile,
 			     struct xe_vm_pgtable_update_ops *pt_update_ops,
 			     struct xe_vma *vma, struct dma_fence *fence)
 {
+	xe_tile_assert(tile, !xe_vma_is_system_allocator(vma));
+
 	if (!xe_vma_has_no_bo(vma) && !xe_vma_bo(vma)->vm)
 		dma_resv_add_fence(xe_vma_bo(vma)->ttm.base.resv, fence,
 				   pt_update_ops->wait_vm_bookkeep ?
@@ -1837,14 +1867,20 @@ static void op_commit(struct xe_vm *vm,
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
-		if (!op->map.immediate && xe_vm_in_fault_mode(vm))
+		if ((!op->map.immediate && xe_vm_in_fault_mode(vm)) ||
+		    op->map.is_system_allocator)
 			break;
 
 		bind_op_commit(vm, tile, pt_update_ops, op->map.vma, fence);
 		break;
 	case DRM_GPUVA_OP_REMAP:
-		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.remap.unmap->va), fence);
+	{
+		struct xe_vma *old = gpuva_to_vma(op->base.remap.unmap->va);
+
+		if (xe_vma_is_system_allocator(old))
+			break;
+
+		unbind_op_commit(vm, tile, pt_update_ops, old, fence);
 
 		if (op->remap.prev)
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.prev,
@@ -1853,14 +1889,23 @@ static void op_commit(struct xe_vm *vm,
 			bind_op_commit(vm, tile, pt_update_ops, op->remap.next,
 				       fence);
 		break;
+	}
 	case DRM_GPUVA_OP_UNMAP:
-		unbind_op_commit(vm, tile, pt_update_ops,
-				 gpuva_to_vma(op->base.unmap.va), fence);
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
+
+		if (!xe_vma_is_system_allocator(vma))
+			unbind_op_commit(vm, tile, pt_update_ops, vma, fence);
 		break;
+	}
 	case DRM_GPUVA_OP_PREFETCH:
-		bind_op_commit(vm, tile, pt_update_ops,
-			       gpuva_to_vma(op->base.prefetch.va), fence);
+	{
+		struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
+		if (!xe_vma_is_system_allocator(vma))
+			bind_op_commit(vm, tile, pt_update_ops, vma, fence);
 		break;
+	}
 	default:
 		drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 	}
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 7ce7dbeb6f0a..d31d067d2e8b 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -841,6 +841,8 @@ int xe_vm_populate_dummy_rebind(struct xe_vm *vm, struct xe_vma *vma,
 	vm->dummy_ops.op.map.immediate = true;
 	vm->dummy_ops.op.map.dumpable = vma->gpuva.flags & XE_VMA_DUMPABLE;
 	vm->dummy_ops.op.map.is_null = xe_vma_is_null(vma);
+	vm->dummy_ops.op.map.is_system_allocator =
+		xe_vma_is_system_allocator(vma);
 
 	return xe_vma_ops_alloc(&vm->dummy_ops.vops);
 }
@@ -889,9 +891,10 @@ static void xe_vma_free(struct xe_vma *vma)
 		kfree(vma);
 }
 
-#define VMA_CREATE_FLAG_READ_ONLY	BIT(0)
-#define VMA_CREATE_FLAG_IS_NULL		BIT(1)
-#define VMA_CREATE_FLAG_DUMPABLE	BIT(2)
+#define VMA_CREATE_FLAG_READ_ONLY		BIT(0)
+#define VMA_CREATE_FLAG_IS_NULL			BIT(1)
+#define VMA_CREATE_FLAG_DUMPABLE		BIT(2)
+#define VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR	BIT(3)
 
 static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 				    struct xe_bo *bo,
@@ -905,6 +908,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 	bool read_only = (flags & VMA_CREATE_FLAG_READ_ONLY);
 	bool is_null = (flags & VMA_CREATE_FLAG_IS_NULL);
 	bool dumpable = (flags & VMA_CREATE_FLAG_DUMPABLE);
+	bool is_system_allocator =
+		(flags & VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR);
 
 	xe_assert(vm->xe, start < end);
 	xe_assert(vm->xe, end < vm->size);
@@ -913,7 +918,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 	 * Allocate and ensure that the xe_vma_is_userptr() return
 	 * matches what was allocated.
 	 */
-	if (!bo && !is_null) {
+	if (!bo && !is_null && !is_system_allocator) {
 		struct xe_userptr_vma *uvma = kzalloc(sizeof(*uvma), GFP_KERNEL);
 
 		if (!uvma)
@@ -925,6 +930,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 		if (!vma)
 			return ERR_PTR(-ENOMEM);
 
+		if (is_system_allocator)
+			vma->gpuva.flags |= XE_VMA_SYSTEM_ALLOCATOR;
 		if (is_null)
 			vma->gpuva.flags |= DRM_GPUVA_SPARSE;
 		if (bo)
@@ -967,7 +974,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 		drm_gpuva_link(&vma->gpuva, vm_bo);
 		drm_gpuvm_bo_put(vm_bo);
 	} else /* userptr or null */ {
-		if (!is_null) {
+		if (!is_null && !is_system_allocator) {
 			struct xe_userptr *userptr = &to_userptr_vma(vma)->userptr;
 			u64 size = end - start + 1;
 			int err;
@@ -1024,7 +1031,7 @@ static void xe_vma_destroy_late(struct xe_vma *vma)
 		 */
 		mmu_interval_notifier_remove(&userptr->notifier);
 		xe_vm_put(vm);
-	} else if (xe_vma_is_null(vma)) {
+	} else if (xe_vma_is_null(vma) || xe_vma_is_system_allocator(vma)) {
 		xe_vm_put(vm);
 	} else {
 		xe_bo_put(xe_vma_bo(vma));
@@ -1063,7 +1070,7 @@ static void xe_vma_destroy(struct xe_vma *vma, struct dma_fence *fence)
 		spin_lock(&vm->userptr.invalidated_lock);
 		list_del(&to_userptr_vma(vma)->userptr.invalidate_link);
 		spin_unlock(&vm->userptr.invalidated_lock);
-	} else if (!xe_vma_is_null(vma)) {
+	} else if (!xe_vma_is_null(vma) && !xe_vma_is_system_allocator(vma)) {
 		xe_bo_assert_held(xe_vma_bo(vma));
 
 		drm_gpuva_unlink(&vma->gpuva);
@@ -1982,6 +1989,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
 		if (__op->op == DRM_GPUVA_OP_MAP) {
 			op->map.immediate = !xe_vm_in_fault_mode(vm);
 			op->map.is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
+			op->map.is_system_allocator = flags &
+				DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
 			op->map.dumpable = flags & DRM_XE_VM_BIND_FLAG_DUMPABLE;
 			op->map.pat_index = pat_index;
 		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
@@ -2173,6 +2182,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				VMA_CREATE_FLAG_IS_NULL : 0;
 			flags |= op->map.dumpable ?
 				VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= op->map.is_system_allocator ?
+				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR : 0;
 
 			vma = new_vma(vm, &op->base.map, op->map.pat_index,
 				      flags);
@@ -2180,7 +2191,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				return PTR_ERR(vma);
 
 			op->map.vma = vma;
-			if (op->map.immediate || !xe_vm_in_fault_mode(vm))
+			if ((op->map.immediate || !xe_vm_in_fault_mode(vm)) &&
+			    !op->map.is_system_allocator)
 				xe_vma_ops_incr_pt_update_ops(vops,
 							      op->tile_mask);
 			break;
@@ -2189,22 +2201,25 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 		{
 			struct xe_vma *old =
 				gpuva_to_vma(op->base.remap.unmap->va);
+			bool skip = xe_vma_is_system_allocator(old);
 
 			op->tile_mask = old->tile_present;
 			op->remap.start = xe_vma_start(old);
 			op->remap.range = xe_vma_size(old);
 
-			if (op->base.remap.prev) {
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_READ_ONLY ?
-					VMA_CREATE_FLAG_READ_ONLY : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					DRM_GPUVA_SPARSE ?
-					VMA_CREATE_FLAG_IS_NULL : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_DUMPABLE ?
-					VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				XE_VMA_READ_ONLY ?
+				VMA_CREATE_FLAG_READ_ONLY : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				DRM_GPUVA_SPARSE ?
+				VMA_CREATE_FLAG_IS_NULL : 0;
+			flags |= op->base.remap.unmap->va->flags &
+				XE_VMA_DUMPABLE ?
+				VMA_CREATE_FLAG_DUMPABLE : 0;
+			flags |= xe_vma_is_system_allocator(old) ?
+				VMA_CREATE_FLAG_IS_SYSTEM_ALLOCATOR : 0;
 
+			if (op->base.remap.prev) {
 				vma = new_vma(vm, op->base.remap.prev,
 					      old->pat_index, flags);
 				if (IS_ERR(vma))
@@ -2216,9 +2231,10 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				 * Userptr creates a new SG mapping so
 				 * we must also rebind.
 				 */
-				op->remap.skip_prev = !xe_vma_is_userptr(old) &&
+				op->remap.skip_prev = skip ||
+					(!xe_vma_is_userptr(old) &&
 					IS_ALIGNED(xe_vma_end(vma),
-						   xe_vma_max_pte_size(old));
+						   xe_vma_max_pte_size(old)));
 				if (op->remap.skip_prev) {
 					xe_vma_set_pte_size(vma, xe_vma_max_pte_size(old));
 					op->remap.range -=
@@ -2234,16 +2250,6 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			}
 
 			if (op->base.remap.next) {
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_READ_ONLY ?
-					VMA_CREATE_FLAG_READ_ONLY : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					DRM_GPUVA_SPARSE ?
-					VMA_CREATE_FLAG_IS_NULL : 0;
-				flags |= op->base.remap.unmap->va->flags &
-					XE_VMA_DUMPABLE ?
-					VMA_CREATE_FLAG_DUMPABLE : 0;
-
 				vma = new_vma(vm, op->base.remap.next,
 					      old->pat_index, flags);
 				if (IS_ERR(vma))
@@ -2255,9 +2261,10 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 				 * Userptr creates a new SG mapping so
 				 * we must also rebind.
 				 */
-				op->remap.skip_next = !xe_vma_is_userptr(old) &&
+				op->remap.skip_next = skip ||
+					(!xe_vma_is_userptr(old) &&
 					IS_ALIGNED(xe_vma_start(vma),
-						   xe_vma_max_pte_size(old));
+						   xe_vma_max_pte_size(old)));
 				if (op->remap.skip_next) {
 					xe_vma_set_pte_size(vma, xe_vma_max_pte_size(old));
 					op->remap.range -=
@@ -2270,7 +2277,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 					xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 				}
 			}
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!skip)
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_UNMAP:
@@ -2278,13 +2286,19 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 			struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
 
 			op->tile_mask = vma->tile_present;
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!xe_vma_is_system_allocator(vma))
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
 		}
 		case DRM_GPUVA_OP_PREFETCH:
+		{
+			struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+
 			/* FIXME: Need to skip some prefetch ops */
-			xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
+			if (!xe_vma_is_system_allocator(vma))
+				xe_vma_ops_incr_pt_update_ops(vops, op->tile_mask);
 			break;
+		}
 		default:
 			drm_warn(&vm->xe->drm, "NOT POSSIBLE");
 		}
@@ -2715,22 +2729,31 @@ static int vm_bind_ioctl_ops_execute(struct xe_vm *vm,
 }
 
 #ifdef TEST_VM_OPS_ERROR
-#define SUPPORTED_FLAGS	(FORCE_OP_ERROR | DRM_XE_VM_BIND_FLAG_NULL | \
-	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
+#define SUPPORTED_FLAGS	(FORCE_OP_ERROR | \
+			 DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR | \
+			 DRM_XE_VM_BIND_FLAG_NULL | \
+			 DRM_XE_VM_BIND_FLAG_DUMPABLE)
 #else
 #define SUPPORTED_FLAGS	(DRM_XE_VM_BIND_FLAG_NULL | \
-	 DRM_XE_VM_BIND_FLAG_DUMPABLE)
+			 DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR | \
+			 DRM_XE_VM_BIND_FLAG_DUMPABLE)
 #endif
 #define XE_64K_PAGE_MASK 0xffffull
 #define ALL_DRM_XE_SYNCS_FLAGS (DRM_XE_SYNCS_FLAG_WAIT_FOR_OP)
 
 static int vm_bind_ioctl_check_args(struct xe_device *xe,
+				    struct xe_file *xef,
 				    struct drm_xe_vm_bind *args,
-				    struct drm_xe_vm_bind_op **bind_ops)
+				    struct drm_xe_vm_bind_op **bind_ops,
+				    struct xe_vm **vm)
 {
 	int err;
 	int i;
 
+	*vm = xe_vm_lookup(xef, args->vm_id);
+	if (XE_IOCTL_DBG(xe, !*vm))
+		return -EINVAL;
+
 	if (XE_IOCTL_DBG(xe, args->pad || args->pad2) ||
 	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
 		return -EINVAL;
@@ -2768,9 +2791,16 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 		u64 obj_offset = (*bind_ops)[i].obj_offset;
 		u32 prefetch_region = (*bind_ops)[i].prefetch_mem_region_instance;
 		bool is_null = flags & DRM_XE_VM_BIND_FLAG_NULL;
+		bool is_system_allocator = flags &
+			DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR;
 		u16 pat_index = (*bind_ops)[i].pat_index;
 		u16 coh_mode;
 
+		if (is_system_allocator && !xe_vm_in_fault_mode(*vm)) {
+			err = -EINVAL;
+			goto free_bind_ops;
+		}
+
 		if (XE_IOCTL_DBG(xe, pat_index >= xe->pat.n_entries)) {
 			err = -EINVAL;
 			goto free_bind_ops;
@@ -2791,13 +2821,14 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
 
 		if (XE_IOCTL_DBG(xe, op > DRM_XE_VM_BIND_OP_PREFETCH) ||
 		    XE_IOCTL_DBG(xe, flags & ~SUPPORTED_FLAGS) ||
-		    XE_IOCTL_DBG(xe, obj && is_null) ||
-		    XE_IOCTL_DBG(xe, obj_offset && is_null) ||
+		    XE_IOCTL_DBG(xe, obj && (is_null || is_system_allocator)) ||
+		    XE_IOCTL_DBG(xe, obj_offset &&
+				 (is_null || is_system_allocator)) ||
 		    XE_IOCTL_DBG(xe, op != DRM_XE_VM_BIND_OP_MAP &&
-				 is_null) ||
+				 (is_null || is_system_allocator)) ||
 		    XE_IOCTL_DBG(xe, !obj &&
 				 op == DRM_XE_VM_BIND_OP_MAP &&
-				 !is_null) ||
+				 !is_null && !is_system_allocator) ||
 		    XE_IOCTL_DBG(xe, !obj &&
 				 op == DRM_XE_VM_BIND_OP_UNMAP_ALL) ||
 		    XE_IOCTL_DBG(xe, addr &&
@@ -2878,7 +2909,7 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	int err;
 	int i;
 
-	err = vm_bind_ioctl_check_args(xe, args, &bind_ops);
+	err = vm_bind_ioctl_check_args(xe,xef, args, &bind_ops, &vm);
 	if (err)
 		return err;
 
@@ -2895,12 +2926,6 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		}
 	}
 
-	vm = xe_vm_lookup(xef, args->vm_id);
-	if (XE_IOCTL_DBG(xe, !vm)) {
-		err = -EINVAL;
-		goto put_exec_queue;
-	}
-
 	err = down_write_killable(&vm->lock);
 	if (err)
 		goto put_vm;
@@ -3151,6 +3176,7 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 	int ret;
 
 	xe_assert(xe, !xe_vma_is_null(vma));
+	xe_assert(xe, !xe_vma_is_system_allocator(vma));
 	trace_xe_vma_invalidate(vma);
 
 	/* Check that we don't race with page-table updates */
@@ -3215,8 +3241,9 @@ int xe_analyze_vm(struct drm_printer *p, struct xe_vm *vm, int gt_id)
 		struct xe_vma *vma = gpuva_to_vma(gpuva);
 		bool is_userptr = xe_vma_is_userptr(vma);
 		bool is_null = xe_vma_is_null(vma);
+		bool is_system_allocator = xe_vma_is_system_allocator(vma);
 
-		if (is_null) {
+		if (is_null || is_system_allocator) {
 			addr = 0;
 		} else if (is_userptr) {
 			struct sg_table *sg = to_userptr_vma(vma)->userptr.sg;
@@ -3235,7 +3262,8 @@ int xe_analyze_vm(struct drm_printer *p, struct xe_vm *vm, int gt_id)
 		drm_printf(p, " [%016llx-%016llx] S:0x%016llx A:%016llx %s\n",
 			   xe_vma_start(vma), xe_vma_end(vma) - 1,
 			   xe_vma_size(vma),
-			   addr, is_null ? "NULL" : is_userptr ? "USR" :
+			   addr, is_system_allocator ? "SYSTEM ALLOCATOR" :
+			   is_null ? "NULL" : is_userptr ? "USR" :
 			   is_vram ? "VRAM" : "SYS");
 	}
 	up_read(&vm->lock);
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 492237b60341..6e5470a409fc 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -150,6 +150,11 @@ static inline bool xe_vma_is_null(struct xe_vma *vma)
 	return vma->gpuva.flags & DRM_GPUVA_SPARSE;
 }
 
+static inline bool xe_vma_is_system_allocator(struct xe_vma *vma)
+{
+	return vma->gpuva.flags & XE_VMA_SYSTEM_ALLOCATOR;
+}
+
 static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
 {
 	return !xe_vma_bo(vma);
@@ -157,7 +162,8 @@ static inline bool xe_vma_has_no_bo(struct xe_vma *vma)
 
 static inline bool xe_vma_is_userptr(struct xe_vma *vma)
 {
-	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma);
+	return xe_vma_has_no_bo(vma) && !xe_vma_is_null(vma) &&
+		!xe_vma_is_system_allocator(vma);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 2bb76adf66a1..e5d12bf4cf87 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -45,6 +45,7 @@ struct xe_vm_pgtable_update_op;
 #define XE_VMA_PTE_64K		(DRM_GPUVA_USERBITS << 8)
 #define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 9)
 #define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 10)
+#define XE_VMA_SYSTEM_ALLOCATOR	(DRM_GPUVA_USERBITS << 11)
 
 /** struct xe_userptr - User pointer */
 struct xe_userptr {
@@ -141,6 +142,8 @@ struct xe_vma_op_map {
 	bool immediate;
 	/** @is_null: is NULL binding */
 	bool is_null;
+	/** @is_system_allocator: is system allocator binding */
+	bool is_system_allocator;
 	/** @dumpable: whether BO is dumped on GPU hang */
 	bool dumpable;
 	/** @pat_index: The pat index to use for this operation. */
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index 2fc19177d2b0..50ab31d59fe2 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -869,6 +869,12 @@ struct drm_xe_vm_destroy {
  *    will only be valid for DRM_XE_VM_BIND_OP_MAP operations, the BO
  *    handle MBZ, and the BO offset MBZ. This flag is intended to
  *    implement VK sparse bindings.
+ *  - %DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR - When the system allocator flag is
+ *    set, no mappings are created rather the range is reserved for system
+ *    allocations which will be populated on GPU page faults. Only valid on VMs
+ *    with DRM_XE_VM_CREATE_FLAG_FAULT_MODE set. The system allocator flag are
+ *    only valid for DRM_XE_VM_BIND_OP_MAP operations, the BO handle MBZ, and
+ *    the BO offset MBZ.
  */
 struct drm_xe_vm_bind_op {
 	/** @extensions: Pointer to the first extension struct, if any */
@@ -921,7 +927,9 @@ struct drm_xe_vm_bind_op {
 	 * on the @pat_index. For such mappings there is no actual memory being
 	 * mapped (the address in the PTE is invalid), so the various PAT memory
 	 * attributes likely do not apply.  Simply leaving as zero is one
-	 * option (still a valid pat_index).
+	 * option (still a valid pat_index). Same applies to
+	 * DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR bindings as for such mapping
+	 * there is no actual memory being mapped.
 	 */
 	__u16 pat_index;
 
@@ -955,8 +963,9 @@ struct drm_xe_vm_bind_op {
 	/** @op: Bind operation to perform */
 	__u32 op;
 
-#define DRM_XE_VM_BIND_FLAG_NULL	(1 << 2)
-#define DRM_XE_VM_BIND_FLAG_DUMPABLE	(1 << 3)
+#define DRM_XE_VM_BIND_FLAG_NULL		(1 << 2)
+#define DRM_XE_VM_BIND_FLAG_DUMPABLE		(1 << 3)
+#define DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR	(1 << 4)
 	/** @flags: Bind flags */
 	__u32 flags;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 07/31] drm/xe: Create userptr if page fault occurs on system_allocator VMA
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (5 preceding siblings ...)
  2024-04-09 20:17 ` [v2 06/31] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 08/31] drm/xe: Add faulted userptr VMA garbage collector Oak Zeng
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

If a page fault occurs on system_allocator VMA, create a userptr VMA to
replaced fault region and map to GPU.

v1: Pass userptr to the req_offset of sm_map_ops_create function. This
    fix malloc'd memory failure (Oak)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |  13 +++
 drivers/gpu/drm/xe/xe_vm.c           | 115 +++++++++++++++++++++++++--
 drivers/gpu/drm/xe/xe_vm.h           |   2 +
 drivers/gpu/drm/xe/xe_vm_types.h     |   3 +
 4 files changed, 128 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index c49b1409e168..c9c2f15d9f5b 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -166,6 +166,19 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 		goto unlock_vm;
 	}
 
+	/*
+	 * Create userptr VMA if fault occurs in a range reserved for system
+	 * allocator.
+	 */
+	if (xe_vma_is_system_allocator(vma)) {
+		vma = xe_vm_fault_userptr(vm, pf->page_addr);
+		if (IS_ERR(vma)) {
+			xe_vm_kill(vm, true);
+			ret = PTR_ERR(vma);
+			goto unlock_vm;
+		}
+	}
+
 	if (!xe_vma_is_userptr(vma) ||
 	    !xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
 		downgrade_write(&vm->lock);
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index d31d067d2e8b..1ae7f4160061 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1411,6 +1411,7 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 		return ERR_PTR(-ENOMEM);
 
 	vm->xe = xe;
+	vm->mm = current->mm;
 
 	vm->size = 1ull << xe->info.va_bits;
 
@@ -2151,9 +2152,11 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 	return err;
 }
 
-static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
-				   struct xe_sync_entry *syncs, u32 num_syncs,
-				   struct xe_vma_ops *vops)
+static int vm_bind_ioctl_ops_update_gpuvm_state(struct xe_vm *vm,
+						struct drm_gpuva_ops *ops,
+						struct xe_sync_entry *syncs,
+						u32 num_syncs,
+						struct xe_vma_ops *vops)
 {
 	struct xe_device *xe = vm->xe;
 	struct drm_gpuva_op *__op;
@@ -3069,8 +3072,8 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 			goto unwind_ops;
 		}
 
-		err = vm_bind_ioctl_ops_parse(vm, ops[i], syncs, num_syncs,
-					      &vops);
+		err = vm_bind_ioctl_ops_update_gpuvm_state(vm, ops[i], syncs,
+							   num_syncs, &vops);
 		if (err)
 			goto unwind_ops;
 
@@ -3438,3 +3441,105 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap)
 	}
 	kvfree(snap);
 }
+
+/**
+ * xe_vm_fault_userptr() - VM fault userptr
+ * @vm: VM
+ * @fault_addr: fault address
+ *
+ * Create userptr VMA from fault address
+ *
+ * Return: newly created userptr VMA on success, ERR_PTR on failure
+ */
+struct xe_vma *xe_vm_fault_userptr(struct xe_vm *vm, u64 fault_addr)
+{
+	struct vm_area_struct *vas;
+	struct mm_struct *mm = vm->mm;
+	struct xe_vma_ops vops;
+	struct drm_gpuva_ops *ops = NULL;
+	struct drm_gpuva_op *__op;
+	struct xe_vma *vma = NULL;
+	u64 start, range;
+	int err;
+
+	vm_dbg(&vm->xe->drm, "FAULT: addr=0x%016llx", fault_addr);
+
+	if (!mmget_not_zero(mm))
+		return ERR_PTR(-EFAULT);
+
+	kthread_use_mm(mm);
+
+	mmap_read_lock(mm);
+	vas = find_vma_intersection(mm, fault_addr, fault_addr + 4);
+	if (!vas) {
+		err = -ENOENT;
+		goto err_unlock;
+	}
+
+	vm_dbg(&vm->xe->drm, "FOUND VAS: vm_start=0x%016lx, vm_end=0x%016lx",
+	       vas->vm_start, vas->vm_end);
+
+	start = vas->vm_start;
+	range = vas->vm_end - vas->vm_start;
+	mmap_read_unlock(mm);
+
+	ops = drm_gpuvm_sm_map_ops_create(&vm->gpuvm, start, range, 0, start);
+	if (IS_ERR(ops)) {
+		err = PTR_ERR(ops);
+		goto err_kthread;
+	}
+
+	drm_gpuva_for_each_op(__op, ops)
+		print_op(vm->xe, __op);
+
+	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	err = vm_bind_ioctl_ops_update_gpuvm_state(vm, ops, NULL, 0, &vops);
+	if (err)
+		goto err_kthread;
+
+	/*
+	 * No need to execute ops as we just want to update GPUVM state, page
+	 * fault handler will update GPU page tables. Find VMA that needs GPU
+	 * mapping and return to page fault handler.
+	 */
+	xe_vm_lock(vm, false);
+	drm_gpuva_for_each_op(__op, ops) {
+		struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
+
+		if (__op->op == DRM_GPUVA_OP_MAP) {
+			xe_assert(vm->xe, !vma);
+			vma = op->map.vma;
+		} else if (__op->op == DRM_GPUVA_OP_UNMAP) {
+			xe_vma_destroy(gpuva_to_vma(op->base.unmap.va), NULL);
+		} else if (__op->op == DRM_GPUVA_OP_REMAP) {
+			xe_vma_destroy(gpuva_to_vma(op->base.remap.unmap->va),
+				       NULL);
+		}
+	}
+	xe_vm_unlock(vm);
+
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	drm_gpuva_ops_free(&vm->gpuvm, ops);
+
+	return vma;
+
+err_unlock:
+	mmap_read_unlock(mm);
+err_kthread:
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	if (ops) {
+		drm_gpuva_for_each_op_reverse(__op, ops) {
+			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
+
+			xe_vma_op_unwind(vm, op,
+					 op->flags & XE_VMA_OP_COMMITTED,
+					 op->flags & XE_VMA_OP_PREV_COMMITTED,
+					 op->flags & XE_VMA_OP_NEXT_COMMITTED);
+		}
+		drm_gpuva_ops_free(&vm->gpuvm, ops);
+	}
+
+	return ERR_PTR(err);
+}
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 6e5470a409fc..97d38daf0e9a 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -244,6 +244,8 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma);
 
 int xe_vma_userptr_check_repin(struct xe_userptr_vma *uvma);
 
+struct xe_vma *xe_vm_fault_userptr(struct xe_vm *vm, u64 fault_addr);
+
 bool xe_vm_validate_should_retry(struct drm_exec *exec, int err, ktime_t *end);
 
 int xe_analyze_vm(struct drm_printer *p, struct xe_vm *vm, int gt_id);
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index e5d12bf4cf87..cb67a3918990 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -233,6 +233,9 @@ struct xe_vm {
 
 	struct xe_device *xe;
 
+	/** @mm: user MM of VM */
+	struct mm_struct *mm;
+
 	/* exec queue used for (un)binding vma's */
 	struct xe_exec_queue *q;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 08/31] drm/xe: Add faulted userptr VMA garbage collector
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (6 preceding siblings ...)
  2024-04-09 20:17 ` [v2 07/31] drm/xe: Create userptr if page fault occurs on system_allocator VMA Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 09/31] drm/xe: Introduce helper to populate userptr Oak Zeng
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

From: Matthew Brost <matthew.brost@intel.com>

When a faulted userptr VMA (allocated by page handler) is invalidated
add to list which a garbage collector will unmap from GPU, destroy
faulted userptr VMA, and replace with system_allocator VMA.

v1: Run gargabe collector only on MMU_NOTIFY_UNMAP event. For other
    events, we just invalidate GPU page table but keep the vma because
    the userptr is still exist. On next GPU access, we will revalidate
    and rebind this userptr to GPU(Oak)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c |   6 ++
 drivers/gpu/drm/xe/xe_vm.c           | 151 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm.h           |   1 +
 drivers/gpu/drm/xe/xe_vm_types.h     |  12 +++
 4 files changed, 170 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index c9c2f15d9f5b..707a3466f36b 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -154,12 +154,18 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 		return -EINVAL;
 
 retry_userptr:
+	xe_vm_userptr_garbage_collector(vm);
+
 	/*
 	 * TODO: Avoid exclusive lock if VM doesn't have userptrs, or
 	 * start out read-locked?
 	 */
 	down_write(&vm->lock);
 	write_locked = true;
+	if (xe_vm_is_closed_or_banned(vm)) {
+		ret = -ENOENT;
+		goto unlock_vm;
+	}
 	vma = lookup_vma(vm, pf->page_addr);
 	if (!vma) {
 		ret = -EINVAL;
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 1ae7f4160061..95dda229a9fe 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -692,6 +692,18 @@ static bool vma_userptr_invalidate(struct mmu_interval_notifier *mni,
 		XE_WARN_ON(err);
 	}
 
+	if (range->event == MMU_NOTIFY_UNMAP &&
+	    vma->gpuva.flags & XE_VMA_FAULT_USERPTR &&
+	    !xe_vm_is_closed(vm) && !xe_vm_is_banned(vm) &&
+	    !(vma->gpuva.flags & XE_VMA_DESTROYED) && vma->tile_present) {
+		spin_lock(&vm->userptr.invalidated_lock);
+		list_move_tail(&userptr->invalidate_link,
+			       &vm->userptr.fault_invalidated);
+		spin_unlock(&vm->userptr.invalidated_lock);
+
+		queue_work(system_wq, &vm->userptr.garbage_collector);
+	}
+
 	trace_xe_vma_userptr_invalidate_complete(vma);
 
 	return true;
@@ -1398,6 +1410,8 @@ static void xe_vma_ops_incr_pt_update_ops(struct xe_vma_ops *vops, u8 tile_mask)
 			++vops->pt_update_ops[i].num_ops;
 }
 
+static void vm_userptr_garbage_collector(struct work_struct *w);
+
 struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 {
 	struct drm_gem_object *vm_resv_obj;
@@ -1430,8 +1444,10 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 
 	INIT_LIST_HEAD(&vm->userptr.repin_list);
 	INIT_LIST_HEAD(&vm->userptr.invalidated);
+	INIT_LIST_HEAD(&vm->userptr.fault_invalidated);
 	init_rwsem(&vm->userptr.notifier_lock);
 	spin_lock_init(&vm->userptr.invalidated_lock);
+	INIT_WORK(&vm->userptr.garbage_collector, vm_userptr_garbage_collector);
 
 	INIT_WORK(&vm->destroy_work, vm_destroy_work_func);
 
@@ -1568,6 +1584,8 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 	xe_vm_close(vm);
 	if (xe_vm_in_preempt_fence_mode(vm))
 		flush_work(&vm->preempt.rebind_work);
+	if (xe_vm_in_fault_mode(vm))
+		flush_work(&vm->userptr.garbage_collector);
 
 	if (vm->q) {
 		down_write(&vm->lock);
@@ -3509,6 +3527,7 @@ struct xe_vma *xe_vm_fault_userptr(struct xe_vm *vm, u64 fault_addr)
 		if (__op->op == DRM_GPUVA_OP_MAP) {
 			xe_assert(vm->xe, !vma);
 			vma = op->map.vma;
+			vma->gpuva.flags |= XE_VMA_FAULT_USERPTR;
 		} else if (__op->op == DRM_GPUVA_OP_UNMAP) {
 			xe_vma_destroy(gpuva_to_vma(op->base.unmap.va), NULL);
 		} else if (__op->op == DRM_GPUVA_OP_REMAP) {
@@ -3543,3 +3562,135 @@ struct xe_vma *xe_vm_fault_userptr(struct xe_vm *vm, u64 fault_addr)
 
 	return ERR_PTR(err);
 }
+
+static int
+vm_userptr_garbage_collector_destroy_uvma(struct xe_vm *vm,
+					  struct xe_userptr_vma *uvma)
+{
+	struct mm_struct *mm = vm->mm;
+	struct xe_vma_ops vops;
+	struct drm_gpuva_ops *ops = NULL;
+	struct drm_gpuva_op *__op;
+	struct xe_tile *tile;
+	u8 id;
+	int err;
+
+	vm_dbg(&vm->xe->drm, "GARBAGE COLLECTOR: addr=0x%016llx, range=0x%016llx",
+	       xe_vma_start(&uvma->vma), xe_vma_size(&uvma->vma));
+
+	xe_assert(vm->xe, uvma->vma.gpuva.flags & XE_VMA_FAULT_USERPTR);
+	lockdep_assert_held_write(&vm->lock);
+
+	if (!mmget_not_zero(mm))
+		return -EFAULT;
+
+	kthread_use_mm(mm);
+
+	/* Blow away xe_userptr_vma with system_allocator VMA */
+	ops = drm_gpuvm_sm_map_ops_create(&vm->gpuvm,
+					  xe_vma_start(&uvma->vma),
+					  xe_vma_size(&uvma->vma), 0, 0);
+	if (IS_ERR(ops)) {
+		err = PTR_ERR(ops);
+		goto err_kthread;
+	}
+
+	drm_gpuva_for_each_op(__op, ops) {
+		struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
+
+		if (__op->op == DRM_GPUVA_OP_MAP) {
+			op->map.immediate = true;
+			op->map.is_system_allocator = true;
+		}
+
+		print_op(vm->xe, __op);
+	}
+
+	xe_vma_ops_init(&vops, vm, NULL, NULL, 0);
+	err = vm_bind_ioctl_ops_update_gpuvm_state(vm, ops, NULL, 0, &vops);
+	if (err)
+		goto err_kthread;
+
+	/*
+	 * Order behind any user operations and use same exec queue as page
+	 * fault handler.
+	 */
+	for_each_tile(tile, vm->xe, id) {
+		vops.pt_update_ops[tile->id].wait_vm_bookkeep = true;
+		vops.pt_update_ops[tile->id].q =
+			xe_tile_migrate_bind_exec_queue(tile);
+	}
+
+	err = xe_vma_ops_alloc(&vops);
+	if (err)
+		goto err_kthread;
+
+	err = vm_bind_ioctl_ops_execute(vm, &vops);
+
+	xe_vma_ops_free(&vops);
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	drm_gpuva_ops_free(&vm->gpuvm, ops);
+
+	return err;
+
+err_kthread:
+	kthread_unuse_mm(mm);
+	mmput(mm);
+	if (ops)
+		drm_gpuva_ops_free(&vm->gpuvm, ops);
+
+	return err;
+}
+
+static void vm_userptr_garbage_collector(struct work_struct *w)
+{
+	struct xe_vm *vm =
+		container_of(w, struct xe_vm, userptr.garbage_collector);
+	struct xe_userptr_vma *uvma, *next;
+	int err;
+
+	xe_assert(vm->xe, xe_vm_in_fault_mode(vm));
+
+	down_write(&vm->lock);
+
+	if (xe_vm_is_closed_or_banned(vm))
+		goto unlock;
+
+	/*
+	 * FIXME: Could create 1 set of VMA ops for all VMAs on
+	 * fault_invalidated list
+	 */
+
+	spin_lock(&vm->userptr.invalidated_lock);
+	list_for_each_entry_safe(uvma, next, &vm->userptr.fault_invalidated,
+				 userptr.invalidate_link) {
+		list_del_init(&uvma->userptr.invalidate_link);
+		spin_unlock(&vm->userptr.invalidated_lock);
+
+		err = vm_userptr_garbage_collector_destroy_uvma(vm, uvma);
+		if (err) {
+			XE_WARN_ON("Garbage collection failed, killing VM");
+			xe_vm_kill(vm, true);
+		}
+
+		spin_lock(&vm->userptr.invalidated_lock);
+	}
+	spin_unlock(&vm->userptr.invalidated_lock);
+
+unlock:
+	up_write(&vm->lock);
+}
+
+/**
+ * xe_vm_userptr_garbage_collector() - VM userptr garbage collector
+ * @vm: VM
+ *
+ * For all invalidated faulted userptr VMAs (created by page fault handler)
+ * unmap from GPU, destroy faulted userptr VMA, and replace with
+ * system_allocator VMA.
+ */
+void xe_vm_userptr_garbage_collector(struct xe_vm *vm)
+{
+	vm_userptr_garbage_collector(&vm->userptr.garbage_collector);
+}
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 97d38daf0e9a..0b2790f697db 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -276,6 +276,7 @@ void xe_vma_ops_free(struct xe_vma_ops *vops);
 struct dma_fence *xe_vm_ops_execute(struct xe_vm *vm, struct xe_vma_ops *vops);
 
 void xe_vm_kill(struct xe_vm *vm, bool unlocked);
+void xe_vm_userptr_garbage_collector(struct xe_vm *vm);
 
 #if IS_ENABLED(CONFIG_DRM_XE_DEBUG_VM)
 #define vm_dbg drm_dbg
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index cb67a3918990..fbf6bfcf59a8 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -46,6 +46,7 @@ struct xe_vm_pgtable_update_op;
 #define XE_VMA_PTE_COMPACT	(DRM_GPUVA_USERBITS << 9)
 #define XE_VMA_DUMPABLE		(DRM_GPUVA_USERBITS << 10)
 #define XE_VMA_SYSTEM_ALLOCATOR	(DRM_GPUVA_USERBITS << 11)
+#define XE_VMA_FAULT_USERPTR	(DRM_GPUVA_USERBITS << 12)
 
 /** struct xe_userptr - User pointer */
 struct xe_userptr {
@@ -326,6 +327,17 @@ struct xe_vm {
 		 * write mode.
 		 */
 		struct list_head invalidated;
+		/**
+		 * @userptr.fault_invalidated: List of invalidated userptrs,
+		 * craeted by page fault, which will be destroy by the garbage
+		 * collector. Protected from access with the @invalidated_lock.
+		 */
+		struct list_head fault_invalidated;
+		/**
+		 * @userptr.garbage_collector: worker to implement destroying of
+		 * userptrs on @userptr.fault_invalidated list.
+		 */
+		struct work_struct garbage_collector;
 	} userptr;
 
 	/** @preempt: preempt state */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 09/31] drm/xe: Introduce helper to populate userptr
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (7 preceding siblings ...)
  2024-04-09 20:17 ` [v2 08/31] drm/xe: Add faulted userptr VMA garbage collector Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 10/31] drm/xe: Introduce a helper to free sg table Oak Zeng
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a helper function xe_userptr_populate_range to populate
a a userptr range. This functions calls hmm_range_fault to read
CPU page tables and populate all pfns/pages of this virtual address
range.

If the populated page is system memory page, dma-mapping is performed
to get a dma-address which can be used later for GPU to access pages.

If the populated page is device private page, we calculate the dpa (
device physical address) of the page. This will be handled in future
patches.

The dma-address or dpa is then saved in userptr's sg table. This is
prepare work to replace the get_user_pages_fast code in userptr code
path.

v1: Address review comments:
    separate a npage_in_range function (Matt)
    reparameterize function xe_userptr_populate_range function (Matt)
    move mmu_interval_read_begin() call into while loop (Thomas)
    s/mark_range_accessed/xe_mark_range_accessed (Thomas)
    use set_page_dirty_lock (vs set_page_dirty) (Thomas)
    move a few checking in xe_vma_userptr_pin_pages to hmm.c (Matt)
v2: Remove device private page support. Only support system
    pages for now. use dma-map-sg rather than dma-map-page (Matt/Thomas)

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/Kconfig  |   1 +
 drivers/gpu/drm/xe/Makefile |   2 +
 drivers/gpu/drm/xe/xe_hmm.c | 224 ++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_hmm.h |  17 +++
 4 files changed, 244 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.h

diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
index 1a556d087e63..449a1ecbc92a 100644
--- a/drivers/gpu/drm/xe/Kconfig
+++ b/drivers/gpu/drm/xe/Kconfig
@@ -41,6 +41,7 @@ config DRM_XE
 	select MMU_NOTIFIER
 	select WANT_DEV_COREDUMP
 	select AUXILIARY_BUS
+	select HMM_MIRROR
 	help
 	  Experimental driver for Intel Xe series GPUs
 
diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index bf43a3690e13..fff70fc9a09e 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -146,6 +146,8 @@ xe-y += xe_bb.o \
 	xe_wa.o \
 	xe_wopcm.o
 
+xe-$(CONFIG_HMM_MIRROR) += xe_hmm.o
+
 # graphics hardware monitoring (HWMON) support
 xe-$(CONFIG_HWMON) += xe_hwmon.o
 
diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
new file mode 100644
index 000000000000..4011207630a5
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hmm.c
@@ -0,0 +1,224 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/dma-mapping.h>
+#include <linux/memremap.h>
+#include <linux/swap.h>
+#include <linux/hmm.h>
+#include <linux/mm.h>
+#include "xe_hmm.h"
+#include "xe_vm.h"
+#include "xe_bo.h"
+
+static u64 xe_npages_in_range(unsigned long start, unsigned long end)
+{
+	return (PAGE_ALIGN(end) - PAGE_ALIGN_DOWN(start)) >> PAGE_SHIFT;
+}
+
+/**
+ * xe_mark_range_accessed() - mark a range is accessed, so core mm
+ * have such information for memory eviction or write back to
+ * hard disk
+ *
+ * @range: the range to mark
+ * @write: if write to this range, we mark pages in this range
+ * as dirty
+ */
+static void xe_mark_range_accessed(struct hmm_range *range, bool write)
+{
+	struct page *page;
+	u64 i, npages;
+
+	npages = xe_npages_in_range(range->start, range->end);
+	for (i = 0; i < npages; i++) {
+		page = hmm_pfn_to_page(range->hmm_pfns[i]);
+		if (write)
+			set_page_dirty_lock(page);
+
+		mark_page_accessed(page);
+	}
+}
+
+/**
+ * xe_build_sg() - build a scatter gather table for all the physical pages/pfn
+ * in a hmm_range. dma-map pages if necessary. dma-address is save in sg table
+ * and will be used to program GPU page table later.
+ *
+ * @xe: the xe device who will access the dma-address in sg table
+ * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
+ * has the pfn numbers of pages that back up this hmm address range.
+ * @st: pointer to the sg table.
+ * @write: whether we write to this range. This decides dma map direction
+ * for system pages. If write we map it bi-diretional; otherwise
+ * DMA_TO_DEVICE
+ *
+ * All the contiguous pfns will be collapsed into one entry in
+ * the scatter gather table. This is for the purpose of efficiently
+ * programming GPU page table.
+ *
+ * The dma_address in the sg table will later be used by GPU to
+ * access memory. So if the memory is system memory, we need to
+ * do a dma-mapping so it can be accessed by GPU/DMA.
+ *
+ * FIXME: This function currently only support pages in system
+ * memory. If the memory is GPU local memory (of the GPU who
+ * is going to access memory), we need gpu dpa (device physical
+ * address), and there is no need of dma-mapping. This is TBD.
+ *
+ * FIXME: dma-mapping for peer gpu device to access remote gpu's
+ * memory. Add this when you support p2p
+ *
+ * This function allocates the storage of the sg table. It is
+ * caller's responsibility to free it calling sg_free_table.
+ *
+ * Returns 0 if successful; -ENOMEM if fails to allocate memory
+ */
+static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
+			     struct sg_table *st, bool write)
+{
+	struct device *dev = xe->drm.dev;
+	struct page **pages;
+	u64 i, npages;
+	int ret;
+
+	npages = xe_npages_in_range(range->start, range->end);
+	pages = kvmalloc_array(npages, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	for (i = 0; i < npages; i++) {
+		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
+		xe_assert(xe, !is_device_private_page(pages[i]));
+	}
+
+	ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0,
+			npages << PAGE_SHIFT, xe_sg_segment_size(dev), GFP_KERNEL);
+	if (ret)
+		goto free_pages;
+
+	ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
+			DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+
+free_pages:
+	kvfree(pages);
+	return ret;
+}
+
+/**
+ * xe_userptr_populate_range() - Populate physical pages of a virtual
+ * address range
+ *
+ * @uvma: userptr vma which has information of the range to populate.
+ *
+ * This function populate the physical pages of a virtual
+ * address range. The populated physical pages is saved in
+ * userptr's sg table. It is similar to get_user_pages but call
+ * hmm_range_fault.
+ *
+ * This function also read mmu notifier sequence # (
+ * mmu_interval_read_begin), for the purpose of later
+ * comparison (through mmu_interval_read_retry).
+ *
+ * This must be called with mmap read or write lock held.
+ *
+ * This function allocates the storage of the userptr sg table.
+ * It is caller's responsibility to free it calling sg_free_table.
+ *
+ * returns: 0 for succuss; negative error no on failure
+ */
+int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
+{
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
+	struct xe_userptr *userptr;
+	struct xe_vma *vma = &uvma->vma;
+	u64 start = xe_vma_userptr(vma);
+	u64 end = start + xe_vma_size(vma);
+	struct xe_vm *vm = xe_vma_vm(vma);
+	struct hmm_range hmm_range;
+	bool write = !xe_vma_read_only(vma);
+	bool in_kthread = !current->mm;
+	unsigned long notifier_seq;
+	u64 npages;
+	int ret;
+
+	userptr = &uvma->userptr;
+	mmap_assert_locked(userptr->notifier.mm);
+
+	if (vma->gpuva.flags & XE_VMA_DESTROYED)
+		return 0;
+
+	notifier_seq = mmu_interval_read_begin(&userptr->notifier);
+	if (notifier_seq == userptr->notifier_seq)
+		return 0;
+
+	npages = xe_npages_in_range(start, end);
+	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
+	if (unlikely(!pfns))
+		return -ENOMEM;
+
+	if (write)
+		flags |= HMM_PFN_REQ_WRITE;
+
+	if (in_kthread) {
+		if (!mmget_not_zero(userptr->notifier.mm)) {
+			ret = -EFAULT;
+			goto free_pfns;
+		}
+		kthread_use_mm(userptr->notifier.mm);
+	}
+
+	memset64((u64 *)pfns, (u64)flags, npages);
+	hmm_range.hmm_pfns = pfns;
+	hmm_range.notifier = &userptr->notifier;
+	hmm_range.start = start;
+	hmm_range.end = end;
+	hmm_range.pfn_flags_mask = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE;
+	/**
+	 * FIXME:
+	 * Set the dev_private_owner can prevent hmm_range_fault to fault
+	 * in the device private pages owned by caller. See function
+	 * hmm_vma_handle_pte. In multiple GPU case, this should be set to the
+	 * device owner of the best migration destination. e.g., device0/vm0
+	 * has a page fault, but we have determined the best placement of
+	 * the fault address should be on device1, we should set below to
+	 * device1 instead of device0.
+	 */
+	hmm_range.dev_private_owner = vm->xe;
+
+	while (true) {
+		hmm_range.notifier_seq = mmu_interval_read_begin(&userptr->notifier);
+		ret = hmm_range_fault(&hmm_range);
+		if (time_after(jiffies, timeout))
+			break;
+
+		if (ret == -EBUSY)
+			continue;
+		break;
+	}
+
+	if (in_kthread) {
+		kthread_unuse_mm(userptr->notifier.mm);
+		mmput(userptr->notifier.mm);
+	}
+
+	if (ret)
+		goto free_pfns;
+
+	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt, write);
+	if (ret)
+		goto free_pfns;
+
+	xe_mark_range_accessed(&hmm_range, write);
+	userptr->sg = &userptr->sgt;
+	userptr->notifier_seq = hmm_range.notifier_seq;
+
+free_pfns:
+	kvfree(pfns);
+	return ret;
+}
+
diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
new file mode 100644
index 000000000000..91686a751711
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hmm.h
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include <linux/types.h>
+
+struct xe_userptr_vma;
+
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+int xe_userptr_populate_range(struct xe_userptr_vma *uvma);
+#else
+static inline int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
+{
+	return -ENODEV;
+}
+#endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 10/31] drm/xe: Introduce a helper to free sg table
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (8 preceding siblings ...)
  2024-04-09 20:17 ` [v2 09/31] drm/xe: Introduce helper to populate userptr Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 11/31] drm/xe: Use hmm_range_fault to populate user pages Oak Zeng
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce xe_userptr_free_sg helper to dma-unmap all
addresses in userptr's sg table and free sg table.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Suggested by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_hmm.c | 30 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_hmm.h |  1 +
 2 files changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
index 4011207630a5..427c6bc49949 100644
--- a/drivers/gpu/drm/xe/xe_hmm.c
+++ b/drivers/gpu/drm/xe/xe_hmm.c
@@ -3,6 +3,7 @@
  * Copyright © 2024 Intel Corporation
  */
 
+#include <linux/scatterlist.h>
 #include <linux/mmu_notifier.h>
 #include <linux/dma-mapping.h>
 #include <linux/memremap.h>
@@ -107,6 +108,32 @@ static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
 	return ret;
 }
 
+/**
+ * xe_userptr_free_sg() - Free the scatter gather table of userptr
+ *
+ * @uvma: the userptr vma which hold the scatter gather table
+ *
+ * With function xe_userptr_populate_range, we allocate storage of
+ * the userptr sg table. This is a helper function to free this
+ * sg table, and dma unmap the address in the table.
+ */
+void xe_userptr_free_sg(struct xe_userptr_vma *uvma)
+{
+	struct xe_userptr *userptr = &uvma->userptr;
+	struct xe_vma *vma = &uvma->vma;
+	bool write = !xe_vma_read_only(vma);
+	struct xe_vm *vm = xe_vma_vm(vma);
+	struct xe_device *xe = vm->xe;
+	struct device *dev = xe->drm.dev;
+
+	xe_assert(xe, userptr->sg);
+	dma_unmap_sgtable(dev, userptr->sg,
+			write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE, 0);
+
+	sg_free_table(userptr->sg);
+	userptr->sg = NULL;
+}
+
 /**
  * xe_userptr_populate_range() - Populate physical pages of a virtual
  * address range
@@ -156,6 +183,9 @@ int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
 	if (notifier_seq == userptr->notifier_seq)
 		return 0;
 
+	if (userptr->sg)
+		xe_userptr_free_sg(uvma);
+
 	npages = xe_npages_in_range(start, end);
 	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
 	if (unlikely(!pfns))
diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
index 91686a751711..7bb49bbde5a4 100644
--- a/drivers/gpu/drm/xe/xe_hmm.h
+++ b/drivers/gpu/drm/xe/xe_hmm.h
@@ -15,3 +15,4 @@ static inline int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
 	return -ENODEV;
 }
 #endif
+void xe_userptr_free_sg(struct xe_userptr_vma *uvma);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 11/31] drm/xe: Use hmm_range_fault to populate user pages
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (9 preceding siblings ...)
  2024-04-09 20:17 ` [v2 10/31] drm/xe: Introduce a helper to free sg table Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

This is an effort to unify hmmptr (aka system allocator)
and userptr code. hmm_range_fault is used to populate
a virtual address range for both hmmptr and userptr,
instead of hmmptr using hmm_range_fault and userptr
using get_user_pages_fast.

This also aligns with AMD gpu driver's behavior. In
long term, we plan to put some common helpers in this
area to drm layer so it can be re-used by different
vendors.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 122 ++++---------------------------------
 1 file changed, 12 insertions(+), 110 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 95dda229a9fe..61d336f24a65 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -39,6 +39,7 @@
 #include "xe_sync.h"
 #include "xe_trace.h"
 #include "xe_wa.h"
+#include "xe_hmm.h"
 
 static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
 {
@@ -66,113 +67,21 @@ int xe_vma_userptr_check_repin(struct xe_userptr_vma *uvma)
 
 int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
 {
-	struct xe_userptr *userptr = &uvma->userptr;
 	struct xe_vma *vma = &uvma->vma;
 	struct xe_vm *vm = xe_vma_vm(vma);
 	struct xe_device *xe = vm->xe;
-	const unsigned long num_pages = xe_vma_size(vma) >> PAGE_SHIFT;
-	struct page **pages;
-	bool in_kthread = !current->mm;
-	unsigned long notifier_seq;
-	int pinned, ret, i;
-	bool read_only = xe_vma_read_only(vma);
+	struct xe_userptr *userptr;
+	int ret;
 
 	lockdep_assert_held(&vm->lock);
 	xe_assert(xe, xe_vma_is_userptr(vma));
-retry:
-	if (vma->gpuva.flags & XE_VMA_DESTROYED)
-		return 0;
-
-	notifier_seq = mmu_interval_read_begin(&userptr->notifier);
-	if (notifier_seq == userptr->notifier_seq)
-		return 0;
-
-	pages = kvmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL);
-	if (!pages)
-		return -ENOMEM;
-
-	if (userptr->sg) {
-		dma_unmap_sgtable(xe->drm.dev,
-				  userptr->sg,
-				  read_only ? DMA_TO_DEVICE :
-				  DMA_BIDIRECTIONAL, 0);
-		sg_free_table(userptr->sg);
-		userptr->sg = NULL;
-	}
-
-	pinned = ret = 0;
-	if (in_kthread) {
-		if (!mmget_not_zero(userptr->notifier.mm)) {
-			ret = -EFAULT;
-			goto mm_closed;
-		}
-		kthread_use_mm(userptr->notifier.mm);
-	}
-
-	while (pinned < num_pages) {
-		ret = get_user_pages_fast(xe_vma_userptr(vma) +
-					  pinned * PAGE_SIZE,
-					  num_pages - pinned,
-					  read_only ? 0 : FOLL_WRITE,
-					  &pages[pinned]);
-		if (ret < 0)
-			break;
-
-		pinned += ret;
-		ret = 0;
-	}
 
-	if (in_kthread) {
-		kthread_unuse_mm(userptr->notifier.mm);
-		mmput(userptr->notifier.mm);
-	}
-mm_closed:
-	if (ret)
-		goto out;
-
-	ret = sg_alloc_table_from_pages_segment(&userptr->sgt, pages,
-						pinned, 0,
-						(u64)pinned << PAGE_SHIFT,
-						xe_sg_segment_size(xe->drm.dev),
-						GFP_KERNEL);
-	if (ret) {
-		userptr->sg = NULL;
-		goto out;
-	}
-	userptr->sg = &userptr->sgt;
-
-	ret = dma_map_sgtable(xe->drm.dev, userptr->sg,
-			      read_only ? DMA_TO_DEVICE :
-			      DMA_BIDIRECTIONAL,
-			      DMA_ATTR_SKIP_CPU_SYNC |
-			      DMA_ATTR_NO_KERNEL_MAPPING);
-	if (ret) {
-		sg_free_table(userptr->sg);
-		userptr->sg = NULL;
-		goto out;
-	}
-
-	for (i = 0; i < pinned; ++i) {
-		if (!read_only) {
-			lock_page(pages[i]);
-			set_page_dirty(pages[i]);
-			unlock_page(pages[i]);
-		}
+	userptr = &uvma->userptr;
+	mmap_read_lock(userptr->notifier.mm);
+	ret = xe_userptr_populate_range(uvma);
+	mmap_read_unlock(userptr->notifier.mm);
 
-		mark_page_accessed(pages[i]);
-	}
-
-out:
-	release_pages(pages, pinned);
-	kvfree(pages);
-
-	if (!(ret < 0)) {
-		userptr->notifier_seq = notifier_seq;
-		if (xe_vma_userptr_check_repin(uvma) == -EAGAIN)
-			goto retry;
-	}
-
-	return ret < 0 ? ret : 0;
+	return ret;
 }
 
 static bool preempt_fences_waiting(struct xe_vm *vm)
@@ -1016,8 +925,6 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 static void xe_vma_destroy_late(struct xe_vma *vma)
 {
 	struct xe_vm *vm = xe_vma_vm(vma);
-	struct xe_device *xe = vm->xe;
-	bool read_only = xe_vma_read_only(vma);
 
 	if (vma->ufence) {
 		xe_sync_ufence_put(vma->ufence);
@@ -1025,16 +932,11 @@ static void xe_vma_destroy_late(struct xe_vma *vma)
 	}
 
 	if (xe_vma_is_userptr(vma)) {
-		struct xe_userptr *userptr = &to_userptr_vma(vma)->userptr;
+		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
+		struct xe_userptr *userptr = &uvma->userptr;
 
-		if (userptr->sg) {
-			dma_unmap_sgtable(xe->drm.dev,
-					  userptr->sg,
-					  read_only ? DMA_TO_DEVICE :
-					  DMA_BIDIRECTIONAL, 0);
-			sg_free_table(userptr->sg);
-			userptr->sg = NULL;
-		}
+		if (userptr->sg)
+			xe_userptr_free_sg(uvma);
 
 		/*
 		 * Since userptr pages are not pinned, we can't remove
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (10 preceding siblings ...)
  2024-04-09 20:17 ` [v2 11/31] drm/xe: Use hmm_range_fault to populate user pages Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:09   ` Matthew Brost
  2024-04-16 19:01   ` Matthew Brost
  2024-04-09 20:17 ` [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config Oak Zeng
                   ` (19 subsequent siblings)
  31 siblings, 2 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Memory remap GPU vram using devm_memremap_pages, so each GPU vram
page is backed by a struct page.

Those struct pages are created to allow hmm migrate buffer b/t
GPU vram and CPU system memory using existing Linux migration
mechanism (i.e., migrating b/t CPU system memory and hard disk).

This is prepare work to enable svm (shared virtual memory) through
Linux kernel hmm framework. The memory remap's page map type is set
to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
vram page get a struct page and can be mapped in CPU page table,
but such pages are treated as GPU's private resource, so CPU can't
access them. If CPU access such page, a page fault is triggered
and page will be migrate to system memory.

For GPU device which supports coherent memory protocol b/t CPU and
GPU (such as CXL and CAPI protocol), we can remap device memory as
MEMORY_DEVICE_COHERENT. This is TBD.

v1:
Changes per code review feedback from Matt:
    change .o order in Makefile
    fix indentation
    change code order in mmio_fini
    remove unnecessary header file
    uniform xe_svm_devm_add/_remove parameter
    use tile (vs dev) as pagemap.owner during memremap
    only remap vram for platform that support usm
Changes per review feedback from Brian:
    s/xe_svm_devm_add/xe_devm_add
    s/xe_svm_devm_remove/xe_devm_remove
    move calling of xe_devm_add to xe_tile.c

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/Makefile          |  1 +
 drivers/gpu/drm/xe/xe_device_types.h |  8 +++
 drivers/gpu/drm/xe/xe_mmio.c         |  6 ++
 drivers/gpu/drm/xe/xe_svm.h          | 15 +++++
 drivers/gpu/drm/xe/xe_svm_devmem.c   | 89 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_tile.c         |  4 ++
 6 files changed, 123 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index fff70fc9a09e..cd5213ba182b 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -129,6 +129,7 @@ xe-y += xe_bb.o \
 	xe_sa.o \
 	xe_sched_job.o \
 	xe_step.o \
+	xe_svm_devmem.o \
 	xe_sync.o \
 	xe_tile.o \
 	xe_tile_sysfs.o \
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index e73b9a086718..d6a14327986b 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -103,6 +103,14 @@ struct xe_mem_region {
 	resource_size_t actual_physical_size;
 	/** @mapping: pointer to VRAM mappable space */
 	void __iomem *mapping;
+	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
+	struct dev_pagemap pagemap;
+	/**
+	 * @hpa_base: base host physical address
+	 *
+	 * This is generated when remap device memory as ZONE_DEVICE
+	 */
+	resource_size_t hpa_base;
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
index 7ba2477452d7..12923fe6abae 100644
--- a/drivers/gpu/drm/xe/xe_mmio.c
+++ b/drivers/gpu/drm/xe/xe_mmio.c
@@ -22,6 +22,7 @@
 #include "xe_module.h"
 #include "xe_sriov.h"
 #include "xe_tile.h"
+#include "xe_svm.h"
 
 #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
 #define TILE_COUNT		REG_GENMASK(15, 8)
@@ -354,6 +355,11 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
 static void mmio_fini(struct drm_device *drm, void *arg)
 {
 	struct xe_device *xe = arg;
+	struct xe_tile *tile;
+	u8 id;
+
+	for_each_tile(tile, xe, id)
+		xe_devm_remove(tile, &tile->mem.vram);
 
 	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
 	if (xe->mem.vram.mapping)
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
new file mode 100644
index 000000000000..e944971cfc6d
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef __XE_SVM_H
+#define __XE_SVM_H
+
+struct xe_tile;
+struct xe_mem_region;
+
+int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
+void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
new file mode 100644
index 000000000000..31af56e8285a
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/mm_types.h>
+#include <linux/sched/mm.h>
+
+#include "xe_device_types.h"
+#include "xe_svm.h"
+
+
+static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
+{
+	return 0;
+}
+
+static void xe_devm_page_free(struct page *page)
+{
+}
+
+static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
+	.page_free = xe_devm_page_free,
+	.migrate_to_ram = xe_devm_migrate_to_ram,
+};
+
+/**
+ * xe_devm_add: Remap and provide memmap backing for device memory
+ * @tile: tile that the memory region blongs to
+ * @mr: memory region to remap
+ *
+ * This remap device memory to host physical address space and create
+ * struct page to back device memory
+ *
+ * Return: 0 on success standard error code otherwise
+ */
+int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
+{
+	struct xe_device *xe = tile_to_xe(tile);
+	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
+	struct resource *res;
+	void *addr;
+	int ret;
+
+	res = devm_request_free_mem_region(dev, &iomem_resource,
+					   mr->usable_size);
+	if (IS_ERR(res)) {
+		ret = PTR_ERR(res);
+		return ret;
+	}
+
+	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	mr->pagemap.range.start = res->start;
+	mr->pagemap.range.end = res->end;
+	mr->pagemap.nr_range = 1;
+	mr->pagemap.ops = &xe_devm_pagemap_ops;
+	mr->pagemap.owner = xe;
+	addr = devm_memremap_pages(dev, &mr->pagemap);
+	if (IS_ERR(addr)) {
+		devm_release_mem_region(dev, res->start, resource_size(res));
+		ret = PTR_ERR(addr);
+		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
+				tile->id, ret);
+		return ret;
+	}
+	mr->hpa_base = res->start;
+
+	drm_info(&xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
+			tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
+	return 0;
+}
+
+/**
+ * xe_devm_remove: Unmap device memory and free resources
+ * @tile: xe tile
+ * @mr: memory region to remove
+ */
+void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr)
+{
+	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
+
+	/*FIXME: Does below cause a kernel hange during moduel remove?*/
+	if (mr->hpa_base) {
+		devm_memunmap_pages(dev, &mr->pagemap);
+		devm_release_mem_region(dev, mr->pagemap.range.start,
+			mr->pagemap.range.end - mr->pagemap.range.start + 1);
+	}
+}
+
diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
index 0650b2fa75ef..f1c4f9de51df 100644
--- a/drivers/gpu/drm/xe/xe_tile.c
+++ b/drivers/gpu/drm/xe/xe_tile.c
@@ -14,6 +14,7 @@
 #include "xe_tile_sysfs.h"
 #include "xe_ttm_vram_mgr.h"
 #include "xe_wa.h"
+#include "xe_svm.h"
 
 /**
  * DOC: Multi-tile Design
@@ -158,6 +159,7 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
  */
 int xe_tile_init_noalloc(struct xe_tile *tile)
 {
+	struct xe_device *xe = tile_to_xe(tile);
 	int err;
 
 	xe_device_mem_access_get(tile_to_xe(tile));
@@ -175,6 +177,8 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
 
 	xe_tile_sysfs_init(tile);
 
+	if (xe->info.has_usm)
+		xe_devm_add(tile, &tile->mem.vram);
 err_mem_access:
 	xe_device_mem_access_put(tile_to_xe(tile));
 	return err;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (11 preceding siblings ...)
  2024-04-09 20:17 ` [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:13   ` Matthew Brost
  2024-04-09 20:17 ` [v2 14/31] drm/xe: Introduce helper to get tile from memory region Oak Zeng
                   ` (18 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a DRM_XE_SVM kernel config entry for
xe svm feature. xe svm feature allows share
virtual address space between CPU and GPU program.

v1: Improve commit message (Thomas)
    Avoid using #if directive (Thomas)

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/Kconfig   | 21 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_tile.c |  7 +++++--
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
index 449a1ecbc92a..0accb2cb81d6 100644
--- a/drivers/gpu/drm/xe/Kconfig
+++ b/drivers/gpu/drm/xe/Kconfig
@@ -84,6 +84,27 @@ config DRM_XE_FORCE_PROBE
 	  4571.
 
 	  Use "!*" to block the probe of the driver for all known devices.
+config DRM_XE_SVM
+	bool "Enable Shared Virtual Memory support in xe"
+	depends on DRM_XE
+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
+	depends on ARCH_ENABLE_MEMORY_HOTREMOVE
+	depends on MEMORY_HOTPLUG
+	depends on MEMORY_HOTREMOVE
+	depends on ARCH_HAS_PTE_DEVMAP
+	depends on SPARSEMEM_VMEMMAP
+	depends on ZONE_DEVICE
+	depends on DEVICE_PRIVATE
+	depends on MMU
+	select HMM_MIRROR
+	select MMU_NOTIFIER
+	default y
+	help
+	  Choose this option if you want Shared Virtual Memory (SVM)
+	  support in xe. With SVM, virtual address space is shared
+	  between CPU and GPU. This means any virtual address such
+	  as malloc or mmap returns, variables on stack, or global
+	  memory pointers, can be used for GPU transparently.
 
 menu "drm/Xe Debugging"
 depends on DRM_XE
diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
index f1c4f9de51df..a1a436912fe3 100644
--- a/drivers/gpu/drm/xe/xe_tile.c
+++ b/drivers/gpu/drm/xe/xe_tile.c
@@ -159,9 +159,12 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
  */
 int xe_tile_init_noalloc(struct xe_tile *tile)
 {
-	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_device __maybe_unused *xe;
 	int err;
 
+	if (IS_ENABLED(CONFIG_DRM_XE_SVM))
+		xe = tile_to_xe(tile);
+
 	xe_device_mem_access_get(tile_to_xe(tile));
 
 	err = tile_ttm_mgr_init(tile);
@@ -177,7 +180,7 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
 
 	xe_tile_sysfs_init(tile);
 
-	if (xe->info.has_usm)
+	if (IS_ENABLED(CONFIG_DRM_XE_SVM) && xe->info.has_usm)
 		xe_devm_add(tile, &tile->mem.vram);
 err_mem_access:
 	xe_device_mem_access_put(tile_to_xe(tile));
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 14/31] drm/xe: Introduce helper to get tile from memory region
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (12 preceding siblings ...)
  2024-04-09 20:17 ` [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:17   ` Matthew Brost
  2024-04-09 20:17 ` [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn Oak Zeng
                   ` (17 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a simple helper to retrieve tile from memory region

v1: move the function to xe_device.h (Matt)
    improve commit message, add kerneldoc (Thomas)

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_device.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 74eb9833d4d8..68082357aebd 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -178,4 +178,12 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
 
 void xe_device_put_deferred(struct xe_device *xe, struct llist_node *deferred);
 
+/**
+ * xe_mem_region_to_tile() - retrieve tile from memory region
+ * @mr: the memory region we retrieve tile from
+ */
+static inline struct xe_tile *xe_mem_region_to_tile(struct xe_mem_region *mr)
+{
+	return container_of(mr, struct xe_tile, mem.vram);
+}
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (13 preceding siblings ...)
  2024-04-09 20:17 ` [v2 14/31] drm/xe: Introduce helper to get tile from memory region Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:35   ` Matthew Brost
  2024-04-09 20:17 ` [v2 16/31] drm/xe/svm: Get xe memory region from page Oak Zeng
                   ` (16 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Since we now create struct page backing for each vram page,
each vram page now also has a pfn, just like system memory.
This allow us to calcuate device physical address from pfn.

v1: move the function to xe_svm.h (Matt)
    s/vram_pfn_to_dpa/xe_mem_region_pfn_to_dpa (Matt)
    add kernel document for the helper (Thomas)

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index e944971cfc6d..8a34429eb674 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -6,8 +6,31 @@
 #ifndef __XE_SVM_H
 #define __XE_SVM_H
 
-struct xe_tile;
-struct xe_mem_region;
+#include "xe_device_types.h"
+#include "xe_device.h"
+#include "xe_assert.h"
+
+/**
+ * xe_mem_region_pfn_to_dpa() - Calculate page's dpa from pfn
+ *
+ * @mr: The memory region that page resides in
+ * @pfn: page frame number of the page
+ *
+ * Returns: the device physical address of the page
+ */
+static inline u64 xe_mem_region_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
+{
+	u64 dpa;
+	struct xe_tile *tile = xe_mem_region_to_tile(mr);
+	struct xe_device *xe = tile_to_xe(tile);
+	u64 offset;
+
+	xe_assert(xe, (pfn << PAGE_SHIFT) >= mr->hpa_base);
+	offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
+	dpa = mr->dpa_base + offset;
+
+	return dpa;
+}
 
 int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
 void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 16/31] drm/xe/svm: Get xe memory region from page
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (14 preceding siblings ...)
  2024-04-09 20:17 ` [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:38   ` Matthew Brost
  2024-04-09 20:17 ` [v2 17/31] drm/xe: Get xe_vma from xe_userptr Oak Zeng
                   ` (15 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

For gpu vram page, we now have a struct page backing of
it. struct page's pgmap points to xe_memory_region's
pagemap. This allow us to retrieve xe_memory_region
from struct page.

v1: move the function to xe_svm.h

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 8a34429eb674..624c1581f8ba 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -6,6 +6,7 @@
 #ifndef __XE_SVM_H
 #define __XE_SVM_H
 
+#include <linux/mm_types.h>
 #include "xe_device_types.h"
 #include "xe_device.h"
 #include "xe_assert.h"
@@ -35,4 +36,14 @@ static inline u64 xe_mem_region_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
 int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
 void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
 
+/**
+ * xe_page_to_mem_region() - Get a page's memory region
+ *
+ * @page: a struct page pointer pointing to a page in vram memory region
+ */
+static inline struct xe_mem_region *xe_page_to_mem_region(struct page *page)
+{
+	return container_of(page->pgmap, struct xe_mem_region, pagemap);
+}
+
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 17/31] drm/xe: Get xe_vma from xe_userptr
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (15 preceding siblings ...)
  2024-04-09 20:17 ` [v2 16/31] drm/xe/svm: Get xe memory region from page Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:42   ` Matthew Brost
  2024-04-09 20:17 ` [v2 18/31] drm/xe/svm: Build userptr sg table for device pages Oak Zeng
                   ` (14 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a helper to get xe_vma from xe_userptr.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 0b2790f697db..4860747592ad 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -178,6 +178,20 @@ static inline struct xe_userptr_vma *to_userptr_vma(struct xe_vma *vma)
 	return container_of(vma, struct xe_userptr_vma, vma);
 }
 
+/**
+ * xe_userptr_to_vma() - Return xe_vma from a xe_userptr pointer
+ *
+ * @userptr: The userptr struct pointer
+ */
+
+static inline struct xe_vma *xe_userptr_to_vma(struct xe_userptr *userptr)
+{
+	struct xe_userptr_vma *uvma;
+
+	uvma = container_of(userptr, struct xe_userptr_vma, userptr);
+	return &uvma->vma;
+}
+
 u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile);
 
 int xe_vm_create_ioctl(struct drm_device *dev, void *data,
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 18/31] drm/xe/svm: Build userptr sg table for device pages
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (16 preceding siblings ...)
  2024-04-09 20:17 ` [v2 17/31] drm/xe: Get xe_vma from xe_userptr Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:52   ` Matthew Brost
  2024-04-09 20:17 ` [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory Oak Zeng
                   ` (13 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Previously function xe_build_sg only support userptr with system
memory pages. Now this function is extended to support userptr
with device pages backing as well.

For device pages, there is no need of dma-mapping. Instead, we
calculated the device page's dpa (device physical address) and
use dpa to fill sg table.

As of now, we assume each userptr is only backed either by all
system memory pages or all by device pages. There is no support
of mixture backing of device and system memory pages.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_hmm.c      | 121 +++++++++++++++++++++++++------
 drivers/gpu/drm/xe/xe_vm_types.h |   2 +
 2 files changed, 100 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
index 427c6bc49949..a261c1dd2060 100644
--- a/drivers/gpu/drm/xe/xe_hmm.c
+++ b/drivers/gpu/drm/xe/xe_hmm.c
@@ -11,6 +11,7 @@
 #include <linux/hmm.h>
 #include <linux/mm.h>
 #include "xe_hmm.h"
+#include "xe_svm.h"
 #include "xe_vm.h"
 #include "xe_bo.h"
 
@@ -43,15 +44,90 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
 	}
 }
 
+/**
+ * xe_build_sg_device_pages() - build sg table for userptr when the backing store
+ * is device pages
+ *
+ * @st: sg table to build
+ * @hmm_pfns: pfn array of the userptr
+ * @pages: struct page arrary of this userptr
+ * @npages: how many pages in this userptr
+ */
+static int xe_build_sg_device_pages(struct sg_table *st, unsigned long *hmm_pfns,
+						struct page **pages, uint64_t npages)
+{
+	struct scatterlist *sg;
+	int i;
+
+	sg = NULL;
+	st->nents = 0;
+	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
+		return -ENOMEM;
+
+	for (i = 0; i < npages; i++) {
+		unsigned long addr;
+		struct xe_mem_region *mr;
+
+		mr = xe_page_to_mem_region(pages[i]);
+		addr = xe_mem_region_pfn_to_dpa(mr, hmm_pfns[i]);
+		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
+			sg->length += PAGE_SIZE;
+			sg_dma_len(sg) += PAGE_SIZE;
+			continue;
+		}
+
+		sg =  sg ? sg_next(sg) : st->sgl;
+		sg_dma_address(sg) = addr;
+		sg_dma_len(sg) = PAGE_SIZE;
+		sg->length = PAGE_SIZE;
+		st->nents++;
+	}
+
+	sg_mark_end(sg);
+	return 0;
+}
+
+/**
+ * xe_validate_hmm_pfns() - validate all pages in a userptr belong to one memory
+ * region, and populate the pages array.
+ *
+ * @userptr: The userptr to validate
+ * @hmm_pfns: an array holding hmm pfns
+ * @npages: number of pages of this userptr
+ * @pages: output parameter to hold the populated pages from pfn.
+ */
+static void xe_validate_hmm_pfns(struct xe_userptr *userptr, unsigned long *hmm_pfns,
+						uint64_t npages, struct page **pages)
+{
+	int i;
+	struct xe_vma *vma = xe_userptr_to_vma(userptr);
+	struct xe_vm *vm = xe_vma_vm(vma);
+
+	pages[0] = hmm_pfn_to_page(hmm_pfns[0]);
+	userptr->is_device_pages = is_device_private_page(pages[0]);
+	for (i = 1; i < npages; i++) {
+		pages[i] = hmm_pfn_to_page(hmm_pfns[i]);
+		/**
+		 * We currently assume no mixture of device pages and system memory
+		 * pages in one userptr. If it turns out this is not true, we will
+		 * either split userptr into device pages based and system memory
+		 * based, or support a mixture backing store in one userptr.
+		 */
+		xe_assert(vm->xe,
+			userptr->is_device_pages == is_device_private_page(pages[i]));
+	}
+}
+
+
 /**
  * xe_build_sg() - build a scatter gather table for all the physical pages/pfn
  * in a hmm_range. dma-map pages if necessary. dma-address is save in sg table
  * and will be used to program GPU page table later.
  *
  * @xe: the xe device who will access the dma-address in sg table
+ * @userptr: the userptr that we build the sg table for
  * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
  * has the pfn numbers of pages that back up this hmm address range.
- * @st: pointer to the sg table.
  * @write: whether we write to this range. This decides dma map direction
  * for system pages. If write we map it bi-diretional; otherwise
  * DMA_TO_DEVICE
@@ -64,11 +140,6 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
  * access memory. So if the memory is system memory, we need to
  * do a dma-mapping so it can be accessed by GPU/DMA.
  *
- * FIXME: This function currently only support pages in system
- * memory. If the memory is GPU local memory (of the GPU who
- * is going to access memory), we need gpu dpa (device physical
- * address), and there is no need of dma-mapping. This is TBD.
- *
  * FIXME: dma-mapping for peer gpu device to access remote gpu's
  * memory. Add this when you support p2p
  *
@@ -77,12 +148,13 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
  *
  * Returns 0 if successful; -ENOMEM if fails to allocate memory
  */
-static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
-			     struct sg_table *st, bool write)
+static int xe_build_sg(struct xe_device *xe, struct xe_userptr *userptr,
+					struct hmm_range *range, bool write)
 {
+	struct sg_table *st = &userptr->sgt;
 	struct device *dev = xe->drm.dev;
 	struct page **pages;
-	u64 i, npages;
+	u64 npages;
 	int ret;
 
 	npages = xe_npages_in_range(range->start, range->end);
@@ -90,19 +162,22 @@ static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
 	if (!pages)
 		return -ENOMEM;
 
-	for (i = 0; i < npages; i++) {
-		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
-		xe_assert(xe, !is_device_private_page(pages[i]));
-	}
-
-	ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0,
-			npages << PAGE_SHIFT, xe_sg_segment_size(dev), GFP_KERNEL);
-	if (ret)
-		goto free_pages;
+	xe_validate_hmm_pfns(userptr, range->hmm_pfns, npages, pages);
+	if (!userptr->is_device_pages) {
+		ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0,
+				npages << PAGE_SHIFT, xe_sg_segment_size(dev), GFP_KERNEL);
+		if (ret)
+			goto free_pages;
 
-	ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
-			DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+		ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
+				DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+	} else {
+		ret = xe_build_sg_device_pages(st, range->hmm_pfns, pages, npages);
+		if (ret)
+			goto free_pages;
+	}
 
+	userptr->sg = st;
 free_pages:
 	kvfree(pages);
 	return ret;
@@ -127,7 +202,8 @@ void xe_userptr_free_sg(struct xe_userptr_vma *uvma)
 	struct device *dev = xe->drm.dev;
 
 	xe_assert(xe, userptr->sg);
-	dma_unmap_sgtable(dev, userptr->sg,
+	if (!userptr->is_device_pages)
+		dma_unmap_sgtable(dev, userptr->sg,
 			write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE, 0);
 
 	sg_free_table(userptr->sg);
@@ -239,12 +315,11 @@ int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
 	if (ret)
 		goto free_pfns;
 
-	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt, write);
+	ret = xe_build_sg(vm->xe, userptr, &hmm_range, write);
 	if (ret)
 		goto free_pfns;
 
 	xe_mark_range_accessed(&hmm_range, write);
-	userptr->sg = &userptr->sgt;
 	userptr->notifier_seq = hmm_range.notifier_seq;
 
 free_pfns:
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index fbf6bfcf59a8..3b4debfecc9b 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -64,6 +64,8 @@ struct xe_userptr {
 	struct sg_table *sg;
 	/** @notifier_seq: notifier sequence number */
 	unsigned long notifier_seq;
+	/** @is_device_pages: the backing store is in device memory*/
+	bool is_device_pages;
 	/**
 	 * @initial_bind: user pointer has been bound at least once.
 	 * write: vm->userptr.notifier_lock in read mode and vm->resv held.
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (17 preceding siblings ...)
  2024-04-09 20:17 ` [v2 18/31] drm/xe/svm: Build userptr sg table for device pages Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 21:56   ` Matthew Brost
  2024-04-09 20:17 ` [v2 20/31] drm/xe: add xe lock document Oak Zeng
                   ` (12 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

With system allocator, a userptr can now be back by device
memory also. Introduce a helper function xe_vma_is_devmem
to determine whether a vma is backed by device memory.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_pt.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 846e896edcb5..525092111be9 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -577,6 +577,17 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
 	.pt_entry = xe_pt_stage_bind_entry,
 };
 
+static bool xe_vma_is_devmem(struct xe_vma *vma)
+{
+	if (xe_vma_is_userptr(vma)) {
+		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
+		return uvma->userptr.is_device_pages;
+	} else {
+		struct xe_bo *bo = xe_vma_bo(vma);
+		return bo && (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
+	}
+}
+
 /**
  * xe_pt_stage_bind() - Build a disconnected page-table tree for a given address
  * range.
@@ -601,8 +612,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
 {
 	struct xe_device *xe = tile_to_xe(tile);
 	struct xe_bo *bo = xe_vma_bo(vma);
-	bool is_devmem = !xe_vma_is_userptr(vma) && bo &&
-		(xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
+	bool is_devmem = xe_vma_is_devmem(vma);
 	struct xe_res_cursor curs;
 	struct xe_pt_stage_bind_walk xe_walk = {
 		.base = {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 20/31] drm/xe: add xe lock document
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (18 preceding siblings ...)
  2024-04-09 20:17 ` [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 21/31] drm/xe/svm: Introduce svm migration function Oak Zeng
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

This is not intended a complete documentation of xe locks. It
only documents some key locks used in xe driver and gives an
example to illustrate the lock usage.

This is just a start. We should eventually refine this document.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 Documentation/gpu/xe/index.rst   |   1 +
 Documentation/gpu/xe/xe_lock.rst |   8 +++
 drivers/gpu/drm/xe/xe_lock_doc.h | 113 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm_types.h |   2 +-
 4 files changed, 123 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/gpu/xe/xe_lock.rst
 create mode 100644 drivers/gpu/drm/xe/xe_lock_doc.h

diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index 106b60aba1f0..6ae2c8e7bbb4 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -24,3 +24,4 @@ DG2, etc is provided to prototype the driver.
    xe_tile
    xe_debugging
    xe_svm
+   xe_lock
diff --git a/Documentation/gpu/xe/xe_lock.rst b/Documentation/gpu/xe/xe_lock.rst
new file mode 100644
index 000000000000..24e4c2e7c5d1
--- /dev/null
+++ b/Documentation/gpu/xe/xe_lock.rst
@@ -0,0 +1,8 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+
+==============
+xe lock design
+==============
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_lock_doc.h
+   :doc: xe lock design
diff --git a/drivers/gpu/drm/xe/xe_lock_doc.h b/drivers/gpu/drm/xe/xe_lock_doc.h
new file mode 100644
index 000000000000..0fab623ce056
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_lock_doc.h
@@ -0,0 +1,113 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#ifndef _XE_LOCK_DOC_H_
+#define _XE_LOCK_DOC_H_
+
+/**
+ * DOC: XE lock design
+ *
+ * Locks used in xekmd are complicated. This document try to document the
+ * very fundamentals, such as key locks  used, their purpose and the
+ * order of locking if you need to hold multiple locks.
+ *
+ * Locks used in xekmd
+ * ===================
+ * 1. xe_vm::lock
+ * xe_vm::lock is used mainly to protect data in xe_vm struct, more specifically
+ * this includes below:
+ *
+ * 1) vm::rebind_list
+ * 2) vm::flags, only XE_VM_FLA_BANNED bit
+ * 3) vma::tile_present
+ * 4) userptr::repin_list
+ * 5) userptr::invalidated list
+ * 6) vm::preempt::exec_queue
+ * 7) drm_gpuvm::rb list and tree
+ * 8) vm::size
+ * 9) vm::q[]->last_fence, only if q->flags' EXEC_QUEUE_FLAG_VM is set,
+ *    see xe_exec_queue_last_fence_lockdep_assert
+ * 10) a contested list during vm close. see xe_vm_close_and_put
+ *
+ * 2. mm mmap_lock
+ * mm's mmap_lock is used to protect mm's memory mapping such as CPU page
+ * tables. Linux core mm hold this lock whenever it need to change process
+ * space's memory mapping, for example, during a user munmap process.
+ *
+ * xe hold mmap_lock when it needs to walk CPU page table, such as when
+ * it calls hmm_range_fault to populate CPU page tables.
+ *
+ * 3. xe_vm's dma-resv
+ * xe_vm's dma reservation object is used protect GPU page table update.
+ * For BO type vma, dma resv is enough for page table update. For userptr
+ * and hmmptr, besides dma resv, we need an extra notifier_lock to avoid
+ * page table update collision with userptr invalidation. See below.
+ *
+ * 4. xe_vm::userptr::notifier_lock
+ * notifier_lock is used to protect userptr/hmmptr GPU page table update,
+ * to avoid a update collision with userptr invalidation. So notifier_lock
+ * is required in the userptr invalidate callback function. Notifier_lock
+ * is the "user_lock" in the documentation of mmu_interval_read_begin().
+ *
+ * Lock order
+ * ==========
+ * Acquiring locks in the same order can avoid deadlocks. The locking
+ * order of above locks are:
+ *
+ * xe_vm::lock => mmap_lock => xe_vm::dma-resv => notifier_lock
+ *
+ *
+ * Use case, pseudo codes
+ * =====================
+ *
+ * Below are pseudo codes of hmmptr's gpu page fault handler:
+ *
+ * get gpu vm from page fault asid
+ * Down_write(vm->lock)
+ * walk vma tree, get vma of fault address
+ *
+ * Again:
+ * Mmap_read_lock
+ * do page migration for vma if needed
+ * vma->userptr.notifier_seq = mmu_interval_read_begin(&vma->userptr.notifier)
+ * call hmm_range_fault to retrieve vma's pfns/pages
+ * Mmap_read_unlock
+ *
+ * xe_vm_lock(vm)
+ * down_read(&vm->userptr.notifier_lock);
+ * if (!mmu_interval_read_retry() {
+ *     up_read(&vm->userptr.notifier_lock);
+ *     goto Again; //collision happened with userptr invalidation, retry
+ * }
+ *
+ * xe_vm_populate_pgtable or submit gpu job to update page table
+ * up_read(&vm->userptr.notifier_lock);
+ *
+ * xe_vm_unlock(vm)
+ * Up_write(vm->lock)
+ *
+ * In above code, we first hold vm->lock so we can walk vm's vma tree to
+ * get a vma of the fault address.
+ *
+ * Then we do page migration if needed. Page migration is not needed for
+ * userptr but might be needed for hmmptr. After migration, we populate
+ * the pfns of the vma. Since this requires walking CPU page table, we
+ * hold a mmap_lock in this step.
+ *
+ * After that, the remaining work is to update GPU page table with the
+ * pfns/pages populated above. Since we use vm's dma-resv object to protect
+ * gpu page table update, we need to hold vm's dma-resv in this step.
+ *
+ * Since we don't hold the mmap_lock during GPU page table update, user
+ * might perform munmap simultaneously which can cause userptr invalidation.
+ * If such collision happens, we will retry.
+ *
+ * notifier_lock is hold in both mmu notifier callback (Not listed above),
+ * and GPU page table update.
+ *
+ */
+#endif
+
+
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 3b4debfecc9b..d1f5949d4a3b 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -271,7 +271,7 @@ struct xe_vm {
 
 	/**
 	 * @lock: outer most lock, protects objects of anything attached to this
-	 * VM
+	 * VM. See more details in xe_lock_doc.h
 	 */
 	struct rw_semaphore lock;
 	/**
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 21/31] drm/xe/svm: Introduce svm migration function
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (19 preceding siblings ...)
  2024-04-09 20:17 ` [v2 20/31] drm/xe: add xe lock document Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 22:06   ` Matthew Brost
  2024-04-09 20:17 ` [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
                   ` (10 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce xe_migrate_pa function for data migration.
This function is similar to xe_migrate_copy function
but has different parameters. Instead of BO and ttm
resource parameters, it has source and destination
buffer's physical address as parameter. This function is
intended to be used by svm sub-system which doesn't
have BO and TTM concept.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_migrate.c | 217 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_migrate.h |   7 ++
 2 files changed, 224 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 82b63bdb9c47..f1d53911253b 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -462,6 +462,37 @@ static bool xe_migrate_allow_identity(u64 size, const struct xe_res_cursor *cur)
 	return cur->size >= size;
 }
 
+/**
+ * pte_update_cmd_size() - calculate the batch buffer command size
+ * to update a flat page table.
+ *
+ * @size: The virtual address range size of the page table to update
+ *
+ * The page table to update is supposed to be a flat 1 level page
+ * table with all entries pointing to 4k pages.
+ *
+ * Return the number of dwords of the update command
+ */
+static u32 pte_update_cmd_size(u64 size)
+{
+	u32 dword;
+	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+
+	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
+	/*
+	 * MI_STORE_DATA_IMM command is used to update page table. Each
+	 * instruction can update maximumly 0x1ff pte entries. To update
+	 * n (n <= 0x1ff) pte entries, we need:
+	 * 1 dword for the MI_STORE_DATA_IMM command header (opcode etc)
+	 * 2 dword for the page table's physical location
+	 * 2*n dword for value of pte to fill (each pte entry is 2 dwords)
+	 */
+	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
+	dword += entries * 2;
+
+	return dword;
+}
+
 static u32 pte_update_size(struct xe_migrate *m,
 			   bool is_vram,
 			   struct ttm_resource *res,
@@ -562,6 +593,48 @@ static void emit_pte(struct xe_migrate *m,
 	}
 }
 
+/**
+ * build_pt_update_batch_sram() - build batch buffer commands to update
+ * migration vm page table for system memory
+ *
+ * @m: The migration context
+ * @bb: The batch buffer which hold the page table update commands
+ * @pt_offset: The offset of page table to update, in byte
+ * @pa: device physical address you want the page table to point to
+ * @size: size of the virtual address space you want the page table to cover
+ */
+static void build_pt_update_batch_sram(struct xe_migrate *m,
+		     struct xe_bb *bb, u32 pt_offset,
+		     u64 pa, u32 size)
+{
+	u16 pat_index = tile_to_xe(m->tile)->pat.idx[XE_CACHE_WB];
+	u32 ptes;
+
+	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
+	while (ptes) {
+		u32 chunk = min(0x1ffU, ptes);
+
+		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
+		bb->cs[bb->len++] = pt_offset;
+		bb->cs[bb->len++] = 0;
+
+		pt_offset += chunk * 8;
+		ptes -= chunk;
+
+		while (chunk--) {
+			u64 addr;
+
+			addr = pa & PAGE_MASK;
+			addr = m->q->vm->pt_ops->pte_encode_addr(m->tile->xe,
+								 addr, pat_index,
+								 0, false, 0);
+			bb->cs[bb->len++] = lower_32_bits(addr);
+			bb->cs[bb->len++] = upper_32_bits(addr);
+			pa += XE_PAGE_SIZE;
+		}
+	}
+}
+
 #define EMIT_COPY_CCS_DW 5
 static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
 			  u64 dst_ofs, bool dst_is_indirect,
@@ -879,6 +952,150 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
 	return fence;
 }
 
+/**
+ * xe_migrate_pa() - Migrate buffers with src and dst physical address
+ *
+ * @m: The migration context
+ * @src_pa: physical address of source, from GPU's point of view. This is a
+ * device physical address (dpa) when source is in vram. When source is in
+ * system memory, this is a dma mapped host physical address
+ * @src_is_vram: True if source buffer is in vram.
+ * @dst_pa: physical address of destination, from GPU's point of view. This is a
+ * device physical address (dpa) when source is in vram. When source is in
+ * system memory, this is a dma mapped host physical address
+ * @dst_is_vram: True if destination buffer is in vram.
+ * @size: The size of data to copy.
+ *
+ * Copy @size bytes of data from @src_pa to @dst_pa. The functionality
+ * and behavior of this function is similar to xe_migrate_copy function, but
+ * the interface is different. This function is a helper function supposed to
+ * be used by SVM subsytem. Since in SVM subsystem there is no buffer object
+ * and ttm, there is no src/dst bo as function input. Instead, we directly use
+ * src/dst's physical address as function input.
+ *
+ * Since the back store of any user malloc'ed or mmap'ed memory can be placed in
+ * system  memory, it can not be compressed. Thus this function doesn't need
+ * to consider copy CCS (compression control surface) data as xe_migrate_copy did.
+ *
+ * This function assumes the source buffer and destination buffer are all physically
+ * contiguous.
+ *
+ * We use gpu blitter to copy data. Source and destination are first mapped to
+ * migration vm which is a flat one level (L0) page table, then blitter is used to
+ * perform the copy.
+ *
+ * Return: Pointer to a dma_fence representing the last copy batch, or
+ * an error pointer on failure. If there is a failure, any copy operation
+ * started by the function call has been synced.
+ */
+struct dma_fence *xe_migrate_pa(struct xe_migrate *m,
+				  u64 src_pa,
+				  bool src_is_vram,
+				  u64 dst_pa,
+				  bool dst_is_vram,
+				  u64 size)
+{
+#define NUM_PT_PER_BLIT (MAX_PREEMPTDISABLE_TRANSFER / SZ_2M)
+	struct xe_gt *gt = m->tile->primary_gt;
+	struct xe_device *xe = gt_to_xe(gt);
+	struct dma_fence *fence = NULL;
+	u64 src_L0_ofs, dst_L0_ofs;
+	u64 round_update_size;
+	/* A slot is a 4K page of page table, covers 2M virtual address*/
+	u32 pt_slot;
+	int err;
+
+	while (size) {
+		u32 batch_size = 2; /* arb_clear() + MI_BATCH_BUFFER_END */
+		struct xe_sched_job *job;
+		struct xe_bb *bb;
+		u32 update_idx;
+
+		/* Maximumly copy MAX_PREEMPTDISABLE_TRANSFER bytes. Why?*/
+		round_update_size = min_t(u64, size, MAX_PREEMPTDISABLE_TRANSFER);
+
+		/* src pte update*/
+		if (!src_is_vram)
+			batch_size += pte_update_cmd_size(round_update_size);
+		/* dst pte update*/
+		if (!dst_is_vram)
+			batch_size += pte_update_cmd_size(round_update_size);
+
+		/* Copy command size*/
+		batch_size += EMIT_COPY_DW;
+
+		bb = xe_bb_new(gt, batch_size, true);
+		if (IS_ERR(bb)) {
+			err = PTR_ERR(bb);
+			goto err_sync;
+		}
+
+		if (!src_is_vram) {
+			pt_slot = 0;
+			build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
+					src_pa, round_update_size);
+			src_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+		}
+		else
+			src_L0_ofs = xe_migrate_vram_ofs(xe, src_pa);
+
+		if (!dst_is_vram) {
+			pt_slot = NUM_PT_PER_BLIT;
+			build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
+					dst_pa, round_update_size);
+			dst_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
+		}
+		else
+			dst_L0_ofs = xe_migrate_vram_ofs(xe, dst_pa);
+
+
+		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
+		update_idx = bb->len;
+
+		emit_copy(gt, bb, src_L0_ofs, dst_L0_ofs, round_update_size,
+			  XE_PAGE_SIZE);
+
+		mutex_lock(&m->job_mutex);
+		job = xe_bb_create_migration_job(m->q, bb,
+						 xe_migrate_batch_base(m, true),
+						 update_idx);
+		if (IS_ERR(job)) {
+			err = PTR_ERR(job);
+			goto err;
+		}
+
+		xe_sched_job_add_migrate_flush(job, 0);
+		xe_sched_job_arm(job);
+		dma_fence_put(fence);
+		fence = dma_fence_get(&job->drm.s_fence->finished);
+		xe_sched_job_push(job);
+		dma_fence_put(m->fence);
+		m->fence = dma_fence_get(fence);
+
+		mutex_unlock(&m->job_mutex);
+
+		xe_bb_free(bb, fence);
+		size -= round_update_size;
+		src_pa += round_update_size;
+		dst_pa += round_update_size;
+		continue;
+
+err:
+		mutex_unlock(&m->job_mutex);
+		xe_bb_free(bb, NULL);
+
+err_sync:
+		/* Sync partial copy if any. FIXME: under job_mutex? */
+		if (fence) {
+			dma_fence_wait(fence, false);
+			dma_fence_put(fence);
+		}
+
+		return ERR_PTR(err);
+	}
+
+	return fence;
+}
 static void emit_clear_link_copy(struct xe_gt *gt, struct xe_bb *bb, u64 src_ofs,
 				 u32 size, u32 pitch)
 {
diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
index 701bb27349b0..98b480244265 100644
--- a/drivers/gpu/drm/xe/xe_migrate.h
+++ b/drivers/gpu/drm/xe/xe_migrate.h
@@ -101,6 +101,13 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
 				  struct ttm_resource *dst,
 				  bool copy_only_ccs);
 
+struct dma_fence *xe_migrate_pa(struct xe_migrate *m,
+				  u64 src_pa,
+				  bool src_is_vram,
+				  u64 dst_pa,
+				  bool dst_is_vram,
+				  u64 size);
+
 struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 				   struct xe_bo *bo,
 				   struct ttm_resource *dst);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (20 preceding siblings ...)
  2024-04-09 20:17 ` [v2 21/31] drm/xe/svm: Introduce svm migration function Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 22:23   ` Matthew Brost
  2024-04-17 20:55   ` Matthew Brost
  2024-04-09 20:17 ` [v2 23/31] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
                   ` (9 subsequent siblings)
  31 siblings, 2 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Function xe_devm_alloc_pages allocate pages from drm buddy and perform
house keeping work for all the pages allocated, such as get a page
refcount, keep a bitmap of all pages to denote whether a page is in
use, put pages to a drm lru list for eviction purpose.

Function xe_devm_free_blocks return list of memory blocks to drm buddy
allocator.

Function xe_devm_free_page is a call back function from hmm layer. It
is called whenever a page's refcount reaches to 1. This function clears
the bit of this page in the bitmap. If all the bits in the bitmap is
cleared, it means all the pages have been freed, we return all the pages
in this memory block back to drm buddy.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h        |   7 ++
 drivers/gpu/drm/xe/xe_svm_devmem.c | 147 ++++++++++++++++++++++++++++-
 2 files changed, 152 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 624c1581f8ba..92a3ee90d5a7 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -46,4 +46,11 @@ static inline struct xe_mem_region *xe_page_to_mem_region(struct page *page)
 	return container_of(page->pgmap, struct xe_mem_region, pagemap);
 }
 
+int xe_devm_alloc_pages(struct xe_tile *tile,
+						unsigned long npages,
+						struct list_head *blocks,
+						unsigned long *pfn);
+
+void xe_devm_free_blocks(struct list_head *blocks);
+void xe_devm_page_free(struct page *page);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
index 31af56e8285a..5ba0cd9a70b0 100644
--- a/drivers/gpu/drm/xe/xe_svm_devmem.c
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -5,18 +5,161 @@
 
 #include <linux/mm_types.h>
 #include <linux/sched/mm.h>
-
+#include <linux/gfp.h>
+#include <linux/migrate.h>
+#include <linux/dma-mapping.h>
+#include <linux/dma-fence.h>
+#include <linux/bitops.h>
+#include <linux/bitmap.h>
+#include <drm/drm_buddy.h>
 #include "xe_device_types.h"
 #include "xe_svm.h"
+#include "xe_migrate.h"
+#include "xe_ttm_vram_mgr_types.h"
+#include "xe_assert.h"
 
+/**
+ * struct xe_svm_block_meta - svm uses this data structure to manage each
+ * block allocated from drm buddy. This will be set to the drm_buddy_block's
+ * private field.
+ *
+ * @lru: used to link this block to drm's lru lists. This will be replace
+ * with struct drm_lru_entity later.
+ * @tile: tile from which we allocated this block
+ * @bitmap: A bitmap of each page in this block. 1 means this page is used,
+ * 0 means this page is idle. When all bits of this block are 0, it is time
+ * to return this block to drm buddy subsystem.
+ */
+struct xe_svm_block_meta {
+	struct list_head lru;
+	struct xe_tile *tile;
+	unsigned long bitmap[];
+};
 
 static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
 {
 	return 0;
 }
 
-static void xe_devm_page_free(struct page *page)
+static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
+{
+	/** DRM buddy's block offset is 0-based*/
+	offset += mr->hpa_base;
+
+	return PHYS_PFN(offset);
+}
+
+/** FIXME: we locked page by calling zone_device_page_init
+ *  in xe_devm_alloc_pages. Should we unlock pages here?
+ */
+static void free_block(struct drm_buddy_block *block)
+{
+	struct xe_svm_block_meta *meta =
+		(struct xe_svm_block_meta *)block->private;
+	struct xe_tile *tile  = meta->tile;
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+
+	kfree(block->private);
+	drm_buddy_free_block(mm, block);
+}
+
+void xe_devm_page_free(struct page *page)
+{
+	struct drm_buddy_block *block =
+					(struct drm_buddy_block *)page->zone_device_data;
+	struct xe_svm_block_meta *meta =
+					(struct xe_svm_block_meta *)block->private;
+	struct xe_tile *tile  = meta->tile;
+	struct xe_mem_region *mr = &tile->mem.vram;
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+	u64 size = drm_buddy_block_size(mm, block);
+	u64 pages_per_block = size >> PAGE_SHIFT;
+	u64 block_pfn_first =
+					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
+	u64 page_pfn = page_to_pfn(page);
+	u64 i = page_pfn - block_pfn_first;
+
+	xe_assert(tile->xe, i < pages_per_block);
+	clear_bit(i, meta->bitmap);
+	if (bitmap_empty(meta->bitmap, pages_per_block))
+		free_block(block);
+}
+
+/**
+ * xe_devm_alloc_pages() - allocate device pages from buddy allocator
+ *
+ * @xe_tile: which tile to allocate device memory from
+ * @npages: how many pages to allocate
+ * @blocks: used to return the allocated blocks
+ * @pfn: used to return the pfn of all allocated pages. Must be big enough
+ * to hold at @npages entries.
+ *
+ * This function allocate blocks of memory from drm buddy allocator, and
+ * performs initialization work: set struct page::zone_device_data to point
+ * to the memory block; set/initialize drm_buddy_block::private field;
+ * lock_page for each page allocated; add memory block to lru managers lru
+ * list - this is TBD.
+ *
+ * return: 0 on success
+ * error code otherwise
+ */
+int xe_devm_alloc_pages(struct xe_tile *tile,
+						unsigned long npages,
+						struct list_head *blocks,
+						unsigned long *pfn)
+{
+	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
+	struct drm_buddy_block *block, *tmp;
+	u64 size = npages << PAGE_SHIFT;
+	int ret = 0, i, j = 0;
+
+	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
+						blocks, DRM_BUDDY_TOPDOWN_ALLOCATION);
+
+	if (unlikely(ret))
+		return ret;
+
+	list_for_each_entry_safe(block, tmp, blocks, link) {
+		struct xe_mem_region *mr = &tile->mem.vram;
+		u64 block_pfn_first, pages_per_block;
+		struct xe_svm_block_meta *meta;
+		u32 meta_size;
+
+		size = drm_buddy_block_size(mm, block);
+		pages_per_block = size >> PAGE_SHIFT;
+		meta_size = BITS_TO_BYTES(pages_per_block) +
+					sizeof(struct xe_svm_block_meta);
+		meta = kzalloc(meta_size, GFP_KERNEL);
+		bitmap_fill(meta->bitmap, pages_per_block);
+		meta->tile = tile;
+		block->private = meta;
+		block_pfn_first =
+					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
+		for(i = 0; i < pages_per_block; i++) {
+			struct page *page;
+
+			pfn[j++] = block_pfn_first + i;
+			page = pfn_to_page(block_pfn_first + i);
+			/**Lock page per hmm requirement, see hmm.rst.*/
+			zone_device_page_init(page);
+			page->zone_device_data = block;
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * xe_devm_free_blocks() - free all memory blocks
+ *
+ * @blocks: memory blocks list head
+ */
+void xe_devm_free_blocks(struct list_head *blocks)
 {
+	struct drm_buddy_block *block, *tmp;
+
+	list_for_each_entry_safe(block, tmp, blocks, link)
+		free_block(block);
 }
 
 static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 23/31] drm/xe/svm: Trace buddy block allocation and free
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (21 preceding siblings ...)
  2024-04-09 20:17 ` [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 24/31] drm/xe/svm: Create and destroy xe svm Oak Zeng
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

trace_xe_buddy_block_alloc and trace_xe_buddy_block_free
are added to trace buddy allocation and free.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm_devmem.c |  6 ++++-
 drivers/gpu/drm/xe/xe_trace.h      | 35 ++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
index 5ba0cd9a70b0..088ac209ad80 100644
--- a/drivers/gpu/drm/xe/xe_svm_devmem.c
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -17,6 +17,7 @@
 #include "xe_migrate.h"
 #include "xe_ttm_vram_mgr_types.h"
 #include "xe_assert.h"
+#include "xe_trace.h"
 
 /**
  * struct xe_svm_block_meta - svm uses this data structure to manage each
@@ -81,8 +82,10 @@ void xe_devm_page_free(struct page *page)
 
 	xe_assert(tile->xe, i < pages_per_block);
 	clear_bit(i, meta->bitmap);
-	if (bitmap_empty(meta->bitmap, pages_per_block))
+	if (bitmap_empty(meta->bitmap, pages_per_block)) {
 		free_block(block);
+		trace_xe_buddy_block_free(block, size, block_pfn_first);
+	}
 }
 
 /**
@@ -135,6 +138,7 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
 		block->private = meta;
 		block_pfn_first =
 					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
+		trace_xe_buddy_block_alloc(block, size, block_pfn_first);
 		for(i = 0; i < pages_per_block; i++) {
 			struct page *page;
 
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index 5f7d26bf4cd7..f3fcce9f1434 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -21,6 +21,7 @@
 #include "xe_guc_exec_queue_types.h"
 #include "xe_sched_job.h"
 #include "xe_vm.h"
+#include <drm/drm_buddy.h>
 
 DECLARE_EVENT_CLASS(xe_gt_tlb_invalidation_fence,
 		    TP_PROTO(struct xe_gt_tlb_invalidation_fence *fence),
@@ -622,6 +623,40 @@ DEFINE_EVENT_PRINT(xe_guc_ctb, xe_guc_ctb_g2h,
 
 );
 
+DECLARE_EVENT_CLASS(xe_buddy_block,
+               TP_PROTO(struct drm_buddy_block *block, u64 size, u64 pfn),
+               TP_ARGS(block, size, pfn),
+
+               TP_STRUCT__entry(
+                               __field(u64, block)
+                               __field(u64, header)
+                               __field(u64, size)
+                               __field(u64, pfn)
+               ),
+
+               TP_fast_assign(
+                               __entry->block = (u64)block;
+                               __entry->header = block->header;
+                               __entry->size = size;
+                               __entry->pfn = pfn;
+               ),
+
+               TP_printk("xe svm: allocated block %llx, block header %llx, size %llx, pfn %llx\n",
+                       __entry->block, __entry->header, __entry->size, __entry->pfn)
+);
+
+
+DEFINE_EVENT(xe_buddy_block, xe_buddy_block_alloc,
+               TP_PROTO(struct drm_buddy_block *block, u64 size, u64 pfn),
+               TP_ARGS(block, size, pfn)
+);
+
+
+DEFINE_EVENT(xe_buddy_block, xe_buddy_block_free,
+               TP_PROTO(struct drm_buddy_block *block, u64 size, u64 pfn),
+               TP_ARGS(block, size, pfn)
+);
+
 #endif
 
 /* This part must be outside protection */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 24/31] drm/xe/svm: Create and destroy xe svm
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (22 preceding siblings ...)
  2024-04-09 20:17 ` [v2 23/31] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 22:25   ` Matthew Brost
  2024-04-09 20:17 ` [v2 25/31] drm/xe/svm: Add vm to xe_svm process Oak Zeng
                   ` (7 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a data structure xe_svm to represent a shared virtual
address space b/t CPU program and GPU program. Each process can
only have maximumly one xe_svm instance. One xe_svm can have
multiple gpu vm.

Introduce helper functions to create and destroy xe_svm instance.
Once xe_svm instance is created, it is added to a global hash table
keyed by mm_struct. Later on we can retrieve xe_svm using mm_struct.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/Makefile |  1 +
 drivers/gpu/drm/xe/xe_svm.c | 77 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h | 23 +++++++++++
 3 files changed, 101 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index cd5213ba182b..f89d77b6d654 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -129,6 +129,7 @@ xe-y += xe_bb.o \
 	xe_sa.o \
 	xe_sched_job.o \
 	xe_step.o \
+	xe_svm.o \
 	xe_svm_devmem.o \
 	xe_sync.o \
 	xe_tile.o \
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
new file mode 100644
index 000000000000..416cfc81c053
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/mutex.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/hashtable.h>
+#include "xe_svm.h"
+
+#define XE_MAX_SVM_PROCESS 5 /* Maximumly support 32 SVM process*/
+DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
+
+/**
+ * xe_create_svm() - create a svm instance
+ *
+ * one xe_svm struct represent a shared address space
+ * between cpu and gpu program. So one xe_svm is associated
+ * to one mm_struct.
+ *
+ * If xe_svm for this process already exists, just return
+ * it; otherwise create one.
+ *
+ * Return the created xe svm struct pointer
+ */
+struct xe_svm *xe_create_svm(void)
+{
+	struct mm_struct *mm = current->mm;
+	struct xe_svm *svm;
+
+	svm = xe_lookup_svm_by_mm(mm);
+	if (svm)
+		return svm;
+
+	svm = kzalloc(sizeof(struct xe_svm), GFP_KERNEL);
+	svm->mm = mm;
+	mutex_init(&svm->mutex);
+	INIT_LIST_HEAD(&svm->vm_list);
+	/** Add svm to global xe_svm_table hash table
+	 *  use mm as key so later we can retrieve svm using mm
+	 */
+	hash_add_rcu(xe_svm_table, &svm->hnode, (uintptr_t)mm);
+	return svm;
+}
+
+/**
+ * xe_destroy_svm() - destroy a svm process
+ *
+ * @svm: the xe_svm to destroy
+ */
+void xe_destroy_svm(struct xe_svm *svm)
+{
+	BUG_ON(list_empty(&svm->vm_list));
+	hash_del_rcu(&svm->hnode);
+	mutex_destroy(&svm->mutex);
+	kfree(svm);
+}
+
+
+/**
+ * xe_lookup_svm_by_mm() - retrieve xe_svm from mm struct
+ *
+ * @mm: the mm struct of the svm to retrieve
+ *
+ * Return the xe_svm struct pointer, or NULL if fail
+ */
+struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm)
+{
+	struct xe_svm *svm;
+
+	hash_for_each_possible_rcu(xe_svm_table, svm, hnode, (uintptr_t)mm)
+		if (svm->mm == mm)
+			return svm;
+
+	return NULL;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 92a3ee90d5a7..066740fb93f5 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -11,6 +11,29 @@
 #include "xe_device.h"
 #include "xe_assert.h"
 
+
+/**
+ * struct xe_svm - data structure to represent a shared
+ * virtual address space from device side. xe_svm and
+ * mm_struct has a 1:1 relationship.
+ */
+struct xe_svm {
+	/** @mm: The mm_struct corresponding to this xe_svm */
+	struct mm_struct *mm;
+	/**
+	 * @mutex: A lock protects below vm_list
+	 */
+	struct mutex mutex;
+	/** @hnode: used to add this svm to a global xe_svm_hash table*/
+	struct hlist_node hnode;
+	/** @vm_list: a list gpu vm in this svm space */
+	struct list_head vm_list;
+};
+
+extern struct xe_svm *xe_create_svm(void);
+void xe_destroy_svm(struct xe_svm *svm);
+extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
+
 /**
  * xe_mem_region_pfn_to_dpa() - Calculate page's dpa from pfn
  *
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 25/31] drm/xe/svm: Add vm to xe_svm process
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (23 preceding siblings ...)
  2024-04-09 20:17 ` [v2 24/31] drm/xe/svm: Create and destroy xe svm Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 26/31] drm/xe: Make function lookup_vma public Oak Zeng
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

One shared virtual address space (xe_svm) works across CPU
and multiple GPUs under one CPU process. Each xe_svm process
can have multiple gpu vm, for example, one gpu vm for one
gpu card. Add gpu vm to the current xe_svm process during
xe_vm creation to note this gpu vm participate the shared
virtual address space with the current CPU process, also
remove xe_vm from xe_svm on xe_vm destroy.

FIXME: right now we blindly add all xe_vm to svm. Should
we introduce uAPI to allow user decide which xe_vm participate
svm?

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c      | 45 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h      |  3 +++
 drivers/gpu/drm/xe/xe_vm.c       |  5 ++++
 drivers/gpu/drm/xe/xe_vm_types.h |  2 ++
 4 files changed, 55 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 416cfc81c053..1f4c2d32121a 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -8,6 +8,7 @@
 #include <linux/kernel.h>
 #include <linux/hashtable.h>
 #include "xe_svm.h"
+#include "xe_vm_types.h"
 
 #define XE_MAX_SVM_PROCESS 5 /* Maximumly support 32 SVM process*/
 DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
@@ -75,3 +76,47 @@ struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm)
 
 	return NULL;
 }
+
+/**
+ * xe_svm_add_vm() - add a gpu vm to the current svm process
+ *
+ * @vm: The gpu vm to add to the current svm process.
+ *
+ * One shared virtual address space (xe_svm) works across CPU
+ * and multiple GPUs. So each xe_svm process can have N gpu
+ * vm, for example, one gpu vm for on gpu card. This function
+ * add a gpu vm to the current xe_svm process.
+ */
+void xe_svm_add_vm(struct xe_vm *vm)
+{
+	struct xe_svm *svm;
+
+	svm = xe_lookup_svm_by_mm(current->mm);
+	if (!svm)
+		svm = xe_create_svm();
+
+	mutex_lock(&svm->mutex);
+	list_add(&vm->svm_link, &svm->vm_list);
+	mutex_unlock(&svm->mutex);
+}
+
+/**
+ * xe_svm_remove_vm() - remove a gpu vm from svm process
+ *
+ * @vm: The gpu vm to remove from svm process.
+ */
+void xe_svm_remove_vm(struct xe_vm *vm)
+{
+	struct xe_svm *svm;
+
+	svm = xe_lookup_svm_by_mm(current->mm);
+	if (!svm)
+		return;
+
+	mutex_lock(&svm->mutex);
+	list_del(&vm->svm_link);
+	mutex_unlock(&svm->mutex);
+
+	if (list_empty(&svm->vm_list))
+		xe_destroy_svm(svm);
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 066740fb93f5..f601dffe3fc1 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -11,6 +11,7 @@
 #include "xe_device.h"
 #include "xe_assert.h"
 
+struct xe_vm;
 
 /**
  * struct xe_svm - data structure to represent a shared
@@ -33,6 +34,8 @@ struct xe_svm {
 extern struct xe_svm *xe_create_svm(void);
 void xe_destroy_svm(struct xe_svm *svm);
 extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
+void xe_svm_add_vm(struct xe_vm *vm);
+void xe_svm_remove_vm(struct xe_vm *vm);
 
 /**
  * xe_mem_region_pfn_to_dpa() - Calculate page's dpa from pfn
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 61d336f24a65..498b36469d00 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -40,6 +40,7 @@
 #include "xe_trace.h"
 #include "xe_wa.h"
 #include "xe_hmm.h"
+#include "xe_svm.h"
 
 static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
 {
@@ -1347,6 +1348,7 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 	INIT_LIST_HEAD(&vm->userptr.repin_list);
 	INIT_LIST_HEAD(&vm->userptr.invalidated);
 	INIT_LIST_HEAD(&vm->userptr.fault_invalidated);
+	INIT_LIST_HEAD(&vm->svm_link);
 	init_rwsem(&vm->userptr.notifier_lock);
 	spin_lock_init(&vm->userptr.invalidated_lock);
 	INIT_WORK(&vm->userptr.garbage_collector, vm_userptr_garbage_collector);
@@ -1445,6 +1447,8 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 		xe->usm.num_vm_in_non_fault_mode++;
 	mutex_unlock(&xe->usm.lock);
 
+	/** FIXME: Should we add vm to svm conditionally? Per uAPI?*/
+	xe_svm_add_vm(vm);
 	trace_xe_vm_create(vm);
 
 	return vm;
@@ -1562,6 +1566,7 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 	for_each_tile(tile, xe, id)
 		xe_range_fence_tree_fini(&vm->rftree[id]);
 
+	xe_svm_remove_vm(vm);
 	xe_vm_put(vm);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index d1f5949d4a3b..eb797195c374 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -394,6 +394,8 @@ struct xe_vm {
 	bool batch_invalidate_tlb;
 	/** @xef: XE file handle for tracking this VM's drm client */
 	struct xe_file *xef;
+	/** @svm_link: used to link this vm to xe_svm's vm_list*/
+	struct list_head svm_link;
 };
 
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 26/31] drm/xe: Make function lookup_vma public
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (24 preceding siblings ...)
  2024-04-09 20:17 ` [v2 25/31] drm/xe/svm: Add vm to xe_svm process Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-10 22:26   ` Matthew Brost
  2024-04-09 20:17 ` [v2 27/31] drm/xe/svm: Handle CPU page fault Oak Zeng
                   ` (5 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Public this function as it will be used by later patches. Also
rename it to xe_vm_lookup_vma

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c | 10 ++++++++--
 drivers/gpu/drm/xe/xe_vm.h           |  1 +
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 707a3466f36b..668984f0769e 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -80,7 +80,13 @@ static bool vma_matches(struct xe_vma *vma, u64 page_addr)
 	return true;
 }
 
-static struct xe_vma *lookup_vma(struct xe_vm *vm, u64 page_addr)
+/**
+ * xe_vm_lookup_vma() - look up a vma from address
+ *
+ * @vm: the xe_vm that the vma resides in
+ * @page_address: address to look up
+ */
+struct xe_vma *xe_vm_lookup_vma(struct xe_vm *vm, u64 page_addr)
 {
 	struct xe_vma *vma = NULL;
 
@@ -166,7 +172,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 		ret = -ENOENT;
 		goto unlock_vm;
 	}
-	vma = lookup_vma(vm, pf->page_addr);
+	vma = xe_vm_lookup_vma(vm, pf->page_addr);
 	if (!vma) {
 		ret = -EINVAL;
 		goto unlock_vm;
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 4860747592ad..d55330988e32 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -306,3 +306,4 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm);
 void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap);
 void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p);
 void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
+struct xe_vma *xe_vm_lookup_vma(struct xe_vm *vm, u64 page_addr);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (25 preceding siblings ...)
  2024-04-09 20:17 ` [v2 26/31] drm/xe: Make function lookup_vma public Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-11  2:07   ` Matthew Brost
  2024-04-09 20:17 ` [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram Oak Zeng
                   ` (4 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Under the picture of svm, CPU and GPU program share one same
virtual address space. The backing store of this virtual address
space can be either in system memory or device memory. Since GPU
device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
Any CPU access to device memory causes a page fault. Implement
a page fault handler to migrate memory back to system memory and
map it to CPU page table so the CPU program can proceed.

Also unbind this page from GPU side, and free the original GPU
device page

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/Makefile         |   1 +
 drivers/gpu/drm/xe/xe_svm.h         |   8 +-
 drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
 drivers/gpu/drm/xe/xe_svm_migrate.c | 222 ++++++++++++++++++++++++++++
 4 files changed, 230 insertions(+), 8 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index f89d77b6d654..65289acdd563 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -131,6 +131,7 @@ xe-y += xe_bb.o \
 	xe_step.o \
 	xe_svm.o \
 	xe_svm_devmem.o \
+	xe_svm_migrate.o \
 	xe_sync.o \
 	xe_tile.o \
 	xe_tile_sysfs.o \
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index f601dffe3fc1..c9e4239c44b4 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -7,11 +7,11 @@
 #define __XE_SVM_H
 
 #include <linux/mm_types.h>
+#include <linux/mm.h>
 #include "xe_device_types.h"
 #include "xe_device.h"
 #include "xe_assert.h"
-
-struct xe_vm;
+#include "xe_vm_types.h"
 
 /**
  * struct xe_svm - data structure to represent a shared
@@ -31,6 +31,9 @@ struct xe_svm {
 	struct list_head vm_list;
 };
 
+#define xe_svm_for_each_vm(svm, vm)					\
+		list_for_each_entry(vm, &svm->vm_list, svm_link)
+
 extern struct xe_svm *xe_create_svm(void);
 void xe_destroy_svm(struct xe_svm *svm);
 extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
@@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
 
 void xe_devm_free_blocks(struct list_head *blocks);
 void xe_devm_page_free(struct page *page);
+vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
index 088ac209ad80..32ada458f1dd 100644
--- a/drivers/gpu/drm/xe/xe_svm_devmem.c
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -37,11 +37,6 @@ struct xe_svm_block_meta {
 	unsigned long bitmap[];
 };
 
-static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
-{
-	return 0;
-}
-
 static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
 {
 	/** DRM buddy's block offset is 0-based*/
@@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head *blocks)
 
 static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
 	.page_free = xe_devm_page_free,
-	.migrate_to_ram = xe_devm_migrate_to_ram,
+	.migrate_to_ram = xe_svm_migrate_to_sram,
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
new file mode 100644
index 000000000000..0db831af098e
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
@@ -0,0 +1,222 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/gfp.h>
+#include <linux/migrate.h>
+#include <linux/dma-mapping.h>
+#include <linux/dma-fence.h>
+#include <linux/bitops.h>
+#include <linux/bitmap.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <drm/drm_buddy.h>
+#include "xe_device_types.h"
+#include "xe_device.h"
+#include "xe_trace.h"
+#include "xe_migrate.h"
+#include "xe_ttm_vram_mgr_types.h"
+#include "xe_assert.h"
+#include "xe_pt.h"
+#include "xe_svm.h"
+#include "xe_vm.h"
+
+
+/**
+ * alloc_host_page() - allocate one host page for the fault vma
+ *
+ * @dev: (GPU) device that will access the allocated page
+ * @vma: the fault vma that we need allocate page for
+ * @addr: the fault address. The allocated page is for this address
+ * @dma_addr: used to output the dma address of the allocated page.
+ * This dma address will be used for gpu to access this page. GPU
+ * access host page through a dma mapped address.
+ * @pfn: used to output the pfn of the allocated page.
+ *
+ * This function allocate one host page for the specified vma. It
+ * also does some prepare work for GPU to access this page, such
+ * as map this page to iommu (by calling dma_map_page).
+ *
+ * When this function returns, the page is locked.
+ *
+ * Return struct page pointer when success
+ * NULL otherwise
+ */
+static struct page *alloc_host_page(struct device *dev,
+							 struct vm_area_struct *vma,
+							 unsigned long addr,
+							 dma_addr_t *dma_addr,
+							 unsigned long *pfn)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (unlikely(!page))
+		return NULL;
+
+	/**Lock page per hmm requirement, see hmm.rst*/
+	lock_page(page);
+	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE);
+	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
+		unlock_page(page);
+		__free_page(page);
+		return NULL;
+	}
+
+	*pfn = migrate_pfn(page_to_pfn(page));
+	return page;
+}
+
+static void free_host_page(struct page *page)
+{
+	unlock_page(page);
+	put_page(page);
+}
+
+/**
+ * migrate_page_vram_to_ram() - migrate one page from vram to ram
+ *
+ * @vma: The vma that the page is mapped to
+ * @addr: The virtual address that the page is mapped to
+ * @src_pfn: src page's page frame number
+ * @dst_pfn: used to return dstination page (in system ram)'s pfn
+ *
+ * Allocate one page in system ram and copy memory from device memory
+ * to system ram.
+ *
+ * Return: 0 if this page is already in sram (no need to migrate)
+ * 1: successfully migrated this page from vram to sram.
+ * error code otherwise
+ */
+static int migrate_page_vram_to_ram(struct vm_area_struct *vma, unsigned long addr,
+						unsigned long src_pfn, unsigned long *dst_pfn)
+{
+	struct xe_mem_region *mr;
+	struct xe_tile *tile;
+	struct xe_device *xe;
+	struct device *dev;
+	dma_addr_t dma_addr = 0;
+	struct dma_fence *fence;
+	struct page *host_page;
+	struct page *src_page;
+	u64 src_dpa;
+
+	src_page = migrate_pfn_to_page(src_pfn);
+	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
+		return 0;
+
+	mr = xe_page_to_mem_region(src_page);
+	tile = xe_mem_region_to_tile(mr);
+	xe = tile_to_xe(tile);
+	dev = xe->drm.dev;
+
+	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
+	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
+	if (!host_page)
+		return -ENOMEM;
+
+	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
+						dma_addr, false, PAGE_SIZE);
+	if (IS_ERR(fence)) {
+		dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
+		free_host_page(host_page);
+		return PTR_ERR(fence);
+	}
+
+	dma_fence_wait(fence, false);
+	dma_fence_put(fence);
+	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
+	return 1;
+}
+
+/**
+ * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU page fault
+ *
+ * @vmf: cpu vm fault structure, contains fault information such as vma etc.
+ *
+ * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
+ *
+ * This function migrate one gpu vma which contains the fault address to sram.
+ * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e., create one
+ * gpu vma for one cpu vma initially and try not to split it). So this scheme end
+ * up migrate at the vma granularity. This might not be the best performant scheme
+ *
+ * This can be tunned with a migration granularity for  performance, for example,
+ * migration 2M for each CPU page fault, or let user specify how much to migrate.
+ * This is more complex due to vma splitting.
+ *
+ * This function should also update GPU page table, so the fault virtual address
+ * points to the same sram location from GPU side. This is TBD.
+ *
+ * Return:
+ * 0 on success
+ * VM_FAULT_SIGBUS: failed to migrate page to system memory, application
+ * will be signaled a SIGBUG
+ */
+vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
+{
+	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
+	struct xe_tile *tile = xe_mem_region_to_tile(mr);
+	struct xe_device *xe = tile_to_xe(tile);
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
+	unsigned long addr = vma->vm_start;
+	u64 npages = vma_pages(vma);
+	struct xe_vma *xe_vma;
+	vm_fault_t ret = 0;
+	struct xe_vm *vm;
+	void *buf;
+	int i;
+
+	struct migrate_vma migrate_vma = {
+		.vma		= vmf->vma,
+		.start		= vma->vm_start,
+		.end		= vma->vm_end,
+		.pgmap_owner	= xe,
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+		.fault_page = vmf->page,
+	};
+
+	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
+	migrate_vma.src = buf;
+	migrate_vma.dst = buf + npages;
+	if (migrate_vma_setup(&migrate_vma) < 0) {
+		ret = VM_FAULT_SIGBUS;
+		goto free_buf;
+	}
+
+	if (!migrate_vma.cpages)
+		goto free_buf;
+
+	for (i = 0; i < npages; i++) {
+		ret = migrate_page_vram_to_ram(vma, addr, migrate_vma.src[i],
+							migrate_vma.dst + i);
+		if (ret < 0) {
+			ret = VM_FAULT_SIGBUS;
+			break;
+		}
+
+		/** Migration has been successful, free source page */
+		if (ret == 1) {
+			struct page *src_page = migrate_pfn_to_page(migrate_vma.src[i]);
+
+			xe_devm_page_free(src_page);
+		}
+
+		addr += PAGE_SIZE;
+	}
+
+	xe_svm_for_each_vm(svm, vm) {
+		xe_assert(xe, vm->mm == mm);
+		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
+		if (xe_vma)
+			xe_vm_invalidate_vma(xe_vma);
+	}
+	migrate_vma_pages(&migrate_vma);
+	migrate_vma_finalize(&migrate_vma);
+free_buf:
+	kvfree(buf);
+	return 0;
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (26 preceding siblings ...)
  2024-04-09 20:17 ` [v2 27/31] drm/xe/svm: Handle CPU page fault Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-11  2:49   ` Matthew Brost
  2024-04-09 20:17 ` [v2 29/31] drm/xe/svm: trace svm migration Oak Zeng
                   ` (3 subsequent siblings)
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Introduce a helper function xe_svm_migrate_vma_to_vram.

Since the source pages of the svm range can be physically not
contiguous, and the destination vram pages can also be not
contiguous, there is no easy way to migrate multiple pages per
blitter command. We do page by page migration for now.

Migration is best effort. Even if we fail to migrate some pages,
we will try to migrate the rest pages.

FIXME: Use one blitter command to copy when both src and dst are
physically contiguous

FIXME: when a vma is partially migrated, split vma as we assume
no mixture vma placement.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.h         |   2 +
 drivers/gpu/drm/xe/xe_svm_migrate.c | 115 ++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index c9e4239c44b4..18ce2e3757c5 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
 void xe_devm_free_blocks(struct list_head *blocks);
 void xe_devm_page_free(struct page *page);
 vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
+int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma *vma,
+							struct xe_tile *tile);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
index 0db831af098e..ab8dd1f58aa4 100644
--- a/drivers/gpu/drm/xe/xe_svm_migrate.c
+++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
@@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
 	kvfree(buf);
 	return 0;
 }
+
+/**
+ * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma to vram
+ * Must be called with mmap_read_lock held.
+ * @vm: the vm that the vma belongs to
+ * @vma: the vma to migrate.
+ * @tile: the destination tile which holds the new backing store of the range
+ *
+ * Returns: negative errno on faiure, 0 on success
+ */
+int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
+							struct xe_vma *vma,
+							struct xe_tile *tile)
+{
+	struct mm_struct *mm = vm->mm;
+	unsigned long start = xe_vma_start(vma);
+	unsigned long end = xe_vma_end(vma);
+	unsigned long npages = (end - start) >> PAGE_SHIFT;
+	struct xe_mem_region *mr = &tile->mem.vram;
+	struct vm_area_struct *vas;
+
+	struct migrate_vma migrate = {
+		.start		= start,
+		.end		= end,
+		.pgmap_owner	= tile->xe,
+		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
+	};
+	struct device *dev = tile->xe->drm.dev;
+	dma_addr_t *src_dma_addr;
+	struct dma_fence *fence;
+	struct page *src_page;
+	LIST_HEAD(blocks);
+	int ret = 0, i;
+	u64 dst_dpa;
+	void *buf;
+
+	mmap_assert_locked(mm);
+
+	vas = find_vma_intersection(mm, start, start + 4);
+	if (!vas)
+		return -ENOENT;
+
+	migrate.vma = vas;
+	buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
+					GFP_KERNEL);
+	if(!buf)
+		return -ENOMEM;
+	migrate.src = buf;
+	migrate.dst = migrate.src + npages;
+	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
+	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
+	if (ret)
+		goto kfree_buf;
+
+	ret = migrate_vma_setup(&migrate);
+	if (ret) {
+		drm_err(&tile->xe->drm, "vma setup returned %d for range [%lx - %lx]\n",
+				ret, start, end);
+		goto free_dst_pages;
+	}
+
+	/**FIXME: partial migration of a range print a warning for now.
+	 * If this message is printed, we need to split xe_vma as we
+	 * don't support a mixture placement of one vma
+	 */
+	if (migrate.cpages != npages)
+		drm_warn(&tile->xe->drm, "Partial migration for range [%lx - %lx], range is %ld pages, migrate only %ld pages\n",
+				start, end, npages, migrate.cpages);
+
+	/**Migrate page by page for now.
+	 * Both source pages and destination pages can physically not contiguous,
+	 * there is no good way to migrate multiple pages per blitter command.
+	 */
+	for (i = 0; i < npages; i++) {
+		src_page = migrate_pfn_to_page(migrate.src[i]);
+		if (unlikely(!src_page || !(migrate.src[i] & MIGRATE_PFN_MIGRATE)))
+			goto free_dst_page;
+
+		xe_assert(tile->xe, !is_zone_device_page(src_page));
+		src_dma_addr[i] = dma_map_page(dev, src_page, 0, PAGE_SIZE, DMA_TO_DEVICE);
+		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
+			drm_warn(&tile->xe->drm, "dma map error for host pfn %lx\n", migrate.src[i]);
+			goto free_dst_page;
+		}
+		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
+		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
+				dst_dpa, true, PAGE_SIZE);
+		if (IS_ERR(fence)) {
+			drm_warn(&tile->xe->drm, "migrate host page (pfn: %lx) to vram failed\n",
+					migrate.src[i]);
+			/**Migration is best effort. Even we failed here, we continue*/
+			goto free_dst_page;
+		}
+		/**FIXME: Use the first migration's out fence as the second migration's input fence,
+		 * and so on. Only wait the out fence of last migration?
+		 */
+		dma_fence_wait(fence, false);
+		dma_fence_put(fence);
+free_dst_page:
+		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
+	}
+
+	for (i = 0; i < npages; i++)
+		if (!(dma_mapping_error(dev, src_dma_addr[i])))
+			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE, DMA_TO_DEVICE);
+
+	migrate_vma_pages(&migrate);
+	migrate_vma_finalize(&migrate);
+free_dst_pages:
+	if (ret)
+		xe_devm_free_blocks(&blocks);
+kfree_buf:
+	kfree(buf);
+	return ret;
+}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 29/31] drm/xe/svm: trace svm migration
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (27 preceding siblings ...)
  2024-04-09 20:17 ` [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-09 20:17 ` [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr Oak Zeng
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

Add two trace points to trace svm migrations.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_svm_migrate.c |  5 ++++-
 drivers/gpu/drm/xe/xe_trace.h       | 11 +++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
index ab8dd1f58aa4..69096d81bf02 100644
--- a/drivers/gpu/drm/xe/xe_svm_migrate.c
+++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
@@ -211,8 +211,10 @@ vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
 	xe_svm_for_each_vm(svm, vm) {
 		xe_assert(xe, vm->mm == mm);
 		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
-		if (xe_vma)
+		if (xe_vma) {
+			trace_xe_svm_migrate_to_sram(xe_vma);
 			xe_vm_invalidate_vma(xe_vma);
+		}
 	}
 	migrate_vma_pages(&migrate_vma);
 	migrate_vma_finalize(&migrate_vma);
@@ -328,6 +330,7 @@ int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
 
 	migrate_vma_pages(&migrate);
 	migrate_vma_finalize(&migrate);
+	trace_xe_svm_migrate_to_vram(vma);
 free_dst_pages:
 	if (ret)
 		xe_devm_free_blocks(&blocks);
diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
index f3fcce9f1434..12e0c9856540 100644
--- a/drivers/gpu/drm/xe/xe_trace.h
+++ b/drivers/gpu/drm/xe/xe_trace.h
@@ -480,6 +480,17 @@ DEFINE_EVENT(xe_vma, xe_vma_userptr_invalidate_complete,
 	     TP_ARGS(vma)
 );
 
+DEFINE_EVENT(xe_vma, xe_svm_migrate_to_sram,
+		    TP_PROTO(struct xe_vma *vma),
+		    TP_ARGS(vma)
+);
+
+
+DEFINE_EVENT(xe_vma, xe_svm_migrate_to_vram,
+		    TP_PROTO(struct xe_vma *vma),
+		    TP_ARGS(vma)
+);
+
 DECLARE_EVENT_CLASS(xe_vm,
 		    TP_PROTO(struct xe_vm *vm),
 		    TP_ARGS(vm),
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (28 preceding siblings ...)
  2024-04-09 20:17 ` [v2 29/31] drm/xe/svm: trace svm migration Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-11  2:50   ` Matthew Brost
  2024-04-09 20:17 ` [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator Oak Zeng
  2024-04-09 20:52 ` ✗ CI.Patch_applied: failure for Basic system allocator support in xe driver Patchwork
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

xe_vma_is_fault_userptr is added to determine the vma is
a fault userptr.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index d55330988e32..a718f927e362 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -166,6 +166,11 @@ static inline bool xe_vma_is_userptr(struct xe_vma *vma)
 		!xe_vma_is_system_allocator(vma);
 }
 
+static inline bool xe_vma_is_fault_userptr(struct xe_vma *vma)
+{
+	return xe_vma_is_userptr(vma) && (vma->gpuva.flags & XE_VMA_FAULT_USERPTR);
+}
+
 /**
  * to_userptr_vma() - Return a pointer to an embedding userptr vma
  * @vma: Pointer to the embedded struct xe_vma
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (29 preceding siblings ...)
  2024-04-09 20:17 ` [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr Oak Zeng
@ 2024-04-09 20:17 ` Oak Zeng
  2024-04-11  2:55   ` Matthew Brost
  2024-04-09 20:52 ` ✗ CI.Patch_applied: failure for Basic system allocator support in xe driver Patchwork
  31 siblings, 1 reply; 58+ messages in thread
From: Oak Zeng @ 2024-04-09 20:17 UTC (permalink / raw)
  To: intel-xe
  Cc: himal.prasad.ghimiray, krishnaiah.bommu, matthew.brost,
	Thomas.Hellstrom, brian.welty

If applicable, migrate a vma from sram to vram for system allocator.
Traditional userptr is not migrated. Only userptr created during
fault (aka userptr splitted from system allocator vma) can be
migrated.

FIXME: The migration should be conditional on user memory attributes
setting. Add this logic when memory attributes are supported

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c | 9 ++++++++-
 drivers/gpu/drm/xe/xe_vm.c           | 4 ----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
index 668984f0769e..c6ba00049964 100644
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
@@ -20,6 +20,7 @@
 #include "xe_guc_ct.h"
 #include "xe_migrate.h"
 #include "xe_trace.h"
+#include "xe_svm.h"
 #include "xe_vm.h"
 
 struct pagefault {
@@ -209,12 +210,18 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
 
 	if (xe_vma_is_userptr(vma) && write_locked) {
 		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
+		struct xe_userptr *userptr = &uvma->userptr;
 
 		spin_lock(&vm->userptr.invalidated_lock);
-		list_del_init(&uvma->userptr.invalidate_link);
+		list_del_init(&userptr->invalidate_link);
 		spin_unlock(&vm->userptr.invalidated_lock);
 
+		mmap_read_lock(userptr->notifier.mm);
+		/**FIXME: Add migration policy here*/
+		if (xe_vma_is_fault_userptr(vma))
+			xe_svm_migrate_vma_to_vram(vm, vma, tile);
 		ret = xe_vma_userptr_pin_pages(uvma);
+		mmap_read_unlock(userptr->notifier.mm);
 		if (ret)
 			goto unlock_vm;
 
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 498b36469d00..8a58fe144a02 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -71,16 +71,12 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
 	struct xe_vma *vma = &uvma->vma;
 	struct xe_vm *vm = xe_vma_vm(vma);
 	struct xe_device *xe = vm->xe;
-	struct xe_userptr *userptr;
 	int ret;
 
 	lockdep_assert_held(&vm->lock);
 	xe_assert(xe, xe_vma_is_userptr(vma));
 
-	userptr = &uvma->userptr;
-	mmap_read_lock(userptr->notifier.mm);
 	ret = xe_userptr_populate_range(uvma);
-	mmap_read_unlock(userptr->notifier.mm);
 
 	return ret;
 }
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* ✗ CI.Patch_applied: failure for Basic system allocator support in xe driver
  2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
                   ` (30 preceding siblings ...)
  2024-04-09 20:17 ` [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator Oak Zeng
@ 2024-04-09 20:52 ` Patchwork
  31 siblings, 0 replies; 58+ messages in thread
From: Patchwork @ 2024-04-09 20:52 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe

== Series Details ==

Series: Basic system allocator support in xe driver
URL   : https://patchwork.freedesktop.org/series/132229/
State : failure

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: 7be27f645de2 drm-tip: 2024y-04m-09d-17h-53m-12s UTC integration manifest
=== git am output follows ===
error: patch failed: drivers/gpu/drm/xe/xe_bo.h:10
error: drivers/gpu/drm/xe/xe_bo.h: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_device.c:226
error: drivers/gpu/drm/xe/xe_device.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_exec.c:135
error: drivers/gpu/drm/xe/xe_exec.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_exec_queue_types.h:78
error: drivers/gpu/drm/xe/xe_exec_queue_types.h: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_pci.c:375
error: drivers/gpu/drm/xe/xe_pci.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_pt.c:1161
error: drivers/gpu/drm/xe/xe_pt.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_sched_job_types.h:39
error: drivers/gpu/drm/xe/xe_sched_job_types.h: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_trace.h:264
error: drivers/gpu/drm/xe/xe_trace.h: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_uc_fw.c:105
error: drivers/gpu/drm/xe/xe_uc_fw.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_vm.c:515
error: drivers/gpu/drm/xe/xe_vm.c: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_vm.h:207
error: drivers/gpu/drm/xe/xe_vm.h: patch does not apply
error: patch failed: drivers/gpu/drm/xe/xe_vm_types.h:180
error: drivers/gpu/drm/xe/xe_vm_types.h: patch does not apply
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Applying: drm/xe: Refactor vm_bind
Patch failed at 0001 drm/xe: Refactor vm_bind
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-04-09 20:17 ` [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
@ 2024-04-10 21:09   ` Matthew Brost
  2024-04-16 19:01   ` Matthew Brost
  1 sibling, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 21:09 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:23PM -0400, Oak Zeng wrote:

Oak - Doing a very high level review due to early stage of the code.

> Memory remap GPU vram using devm_memremap_pages, so each GPU vram
> page is backed by a struct page.
> 
> Those struct pages are created to allow hmm migrate buffer b/t
> GPU vram and CPU system memory using existing Linux migration
> mechanism (i.e., migrating b/t CPU system memory and hard disk).
> 
> This is prepare work to enable svm (shared virtual memory) through
> Linux kernel hmm framework. The memory remap's page map type is set
> to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> vram page get a struct page and can be mapped in CPU page table,
> but such pages are treated as GPU's private resource, so CPU can't
> access them. If CPU access such page, a page fault is triggered
> and page will be migrate to system memory.
> 
> For GPU device which supports coherent memory protocol b/t CPU and
> GPU (such as CXL and CAPI protocol), we can remap device memory as
> MEMORY_DEVICE_COHERENT. This is TBD.
> 
> v1:
> Changes per code review feedback from Matt:
>     change .o order in Makefile
>     fix indentation
>     change code order in mmio_fini
>     remove unnecessary header file
>     uniform xe_svm_devm_add/_remove parameter
>     use tile (vs dev) as pagemap.owner during memremap
>     only remap vram for platform that support usm
> Changes per review feedback from Brian:
>     s/xe_svm_devm_add/xe_devm_add
>     s/xe_svm_devm_remove/xe_devm_remove
>     move calling of xe_devm_add to xe_tile.c
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile          |  1 +
>  drivers/gpu/drm/xe/xe_device_types.h |  8 +++
>  drivers/gpu/drm/xe/xe_mmio.c         |  6 ++
>  drivers/gpu/drm/xe/xe_svm.h          | 15 +++++
>  drivers/gpu/drm/xe/xe_svm_devmem.c   | 89 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_tile.c         |  4 ++
>  6 files changed, 123 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
>  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index fff70fc9a09e..cd5213ba182b 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -129,6 +129,7 @@ xe-y += xe_bb.o \
>  	xe_sa.o \
>  	xe_sched_job.o \
>  	xe_step.o \
> +	xe_svm_devmem.o \

This goes for the series - IMO let's have all the svm code in one file
unless we have a really good reason not to. For the series I see 4:

mbrost@lstrano-desk:xe$ ls -la *.c | grep svm
-rw-rw-r-- 1 mbrost mbrost  8975 Apr 10 11:17 xe_svm.c
-rw-rw-r-- 1 mbrost mbrost  6774 Apr 10 11:17 xe_svm_devmem.c
-rw-rw-r-- 1 mbrost mbrost 10636 Apr 10 11:17 xe_svm_migrate.c
-rw-rw-r-- 1 mbrost mbrost  5940 Apr 10 11:17 xe_svm_range.c

Personally I'd like the name xe_devmem.c (or xe_devm.c) open to xe_svm.c
I guess too.

Whatever name we land on let's also try to make sure all exported
functions (in *.h file) start with the same prefix as the file too.

>  	xe_sync.o \
>  	xe_tile.o \
>  	xe_tile_sysfs.o \
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index e73b9a086718..d6a14327986b 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -103,6 +103,14 @@ struct xe_mem_region {
>  	resource_size_t actual_physical_size;
>  	/** @mapping: pointer to VRAM mappable space */
>  	void __iomem *mapping;
> +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> +	struct dev_pagemap pagemap;
> +	/**
> +	 * @hpa_base: base host physical address
> +	 *
> +	 * This is generated when remap device memory as ZONE_DEVICE
> +	 */
> +	resource_size_t hpa_base;
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> index 7ba2477452d7..12923fe6abae 100644
> --- a/drivers/gpu/drm/xe/xe_mmio.c
> +++ b/drivers/gpu/drm/xe/xe_mmio.c
> @@ -22,6 +22,7 @@
>  #include "xe_module.h"
>  #include "xe_sriov.h"
>  #include "xe_tile.h"
> +#include "xe_svm.h"
>  
>  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
>  #define TILE_COUNT		REG_GENMASK(15, 8)
> @@ -354,6 +355,11 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
>  static void mmio_fini(struct drm_device *drm, void *arg)
>  {
>  	struct xe_device *xe = arg;
> +	struct xe_tile *tile;
> +	u8 id;
> +
> +	for_each_tile(tile, xe, id)
> +		xe_devm_remove(tile, &tile->mem.vram);
>  
>  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
>  	if (xe->mem.vram.mapping)
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> new file mode 100644
> index 000000000000..e944971cfc6d
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -0,0 +1,15 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#ifndef __XE_SVM_H
> +#define __XE_SVM_H
> +
> +struct xe_tile;
> +struct xe_mem_region;
> +
> +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
> +void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> new file mode 100644
> index 000000000000..31af56e8285a
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -0,0 +1,89 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <linux/mm_types.h>
> +#include <linux/sched/mm.h>
> +
> +#include "xe_device_types.h"
> +#include "xe_svm.h"
> +
> +
> +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	return 0;
> +}
> +
> +static void xe_devm_page_free(struct page *page)
> +{
> +}
> +
> +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> +	.page_free = xe_devm_page_free,
> +	.migrate_to_ram = xe_devm_migrate_to_ram,
> +};
> +
> +/**
> + * xe_devm_add: Remap and provide memmap backing for device memory
> + * @tile: tile that the memory region blongs to
> + * @mr: memory region to remap
> + *
> + * This remap device memory to host physical address space and create
> + * struct page to back device memory
> + *
> + * Return: 0 on success standard error code otherwise
> + */
> +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> +{
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> +	struct resource *res;
> +	void *addr;
> +	int ret;
> +
> +	res = devm_request_free_mem_region(dev, &iomem_resource,
> +					   mr->usable_size);
> +	if (IS_ERR(res)) {
> +		ret = PTR_ERR(res);
> +		return ret;
> +	}
> +
> +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +	mr->pagemap.range.start = res->start;
> +	mr->pagemap.range.end = res->end;
> +	mr->pagemap.nr_range = 1;
> +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> +	mr->pagemap.owner = xe;

Nit: I know I suggested this another series too - helper to go from xe
-> owner which can be used in various places we set this.

> +	addr = devm_memremap_pages(dev, &mr->pagemap);
> +	if (IS_ERR(addr)) {
> +		devm_release_mem_region(dev, res->start, resource_size(res));
> +		ret = PTR_ERR(addr);
> +		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
> +				tile->id, ret);
> +		return ret;
> +	}
> +	mr->hpa_base = res->start;
> +
> +	drm_info(&xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
> +			tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
> +	return 0;
> +}
> +
> +/**
> + * xe_devm_remove: Unmap device memory and free resources
> + * @tile: xe tile
> + * @mr: memory region to remove
> + */
> +void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr)
> +{
> +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> +
> +	/*FIXME: Does below cause a kernel hange during moduel remove?*/

Goes for the series, try resolve issues rather than having FIXMEs.

Matt

> +	if (mr->hpa_base) {
> +		devm_memunmap_pages(dev, &mr->pagemap);
> +		devm_release_mem_region(dev, mr->pagemap.range.start,
> +			mr->pagemap.range.end - mr->pagemap.range.start + 1);
> +	}
> +}
> +
> diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> index 0650b2fa75ef..f1c4f9de51df 100644
> --- a/drivers/gpu/drm/xe/xe_tile.c
> +++ b/drivers/gpu/drm/xe/xe_tile.c
> @@ -14,6 +14,7 @@
>  #include "xe_tile_sysfs.h"
>  #include "xe_ttm_vram_mgr.h"
>  #include "xe_wa.h"
> +#include "xe_svm.h"
>  
>  /**
>   * DOC: Multi-tile Design
> @@ -158,6 +159,7 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
>   */
>  int xe_tile_init_noalloc(struct xe_tile *tile)
>  {
> +	struct xe_device *xe = tile_to_xe(tile);
>  	int err;
>  
>  	xe_device_mem_access_get(tile_to_xe(tile));
> @@ -175,6 +177,8 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
>  
>  	xe_tile_sysfs_init(tile);
>  
> +	if (xe->info.has_usm)
> +		xe_devm_add(tile, &tile->mem.vram);
>  err_mem_access:
>  	xe_device_mem_access_put(tile_to_xe(tile));
>  	return err;
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config
  2024-04-09 20:17 ` [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config Oak Zeng
@ 2024-04-10 21:13   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 21:13 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:24PM -0400, Oak Zeng wrote:
> Introduce a DRM_XE_SVM kernel config entry for

Maybe consider another name for this? I could see use cases for non-SVM
where we still want private pages mapped (e.g. VRAM userptrs on
non-faulting devices). Don't really have suggestion but worth
considering.

> xe svm feature. xe svm feature allows share
> virtual address space between CPU and GPU program.
> 
> v1: Improve commit message (Thomas)
>     Avoid using #if directive (Thomas)
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Kconfig   | 21 +++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_tile.c |  7 +++++--
>  2 files changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
> index 449a1ecbc92a..0accb2cb81d6 100644
> --- a/drivers/gpu/drm/xe/Kconfig
> +++ b/drivers/gpu/drm/xe/Kconfig
> @@ -84,6 +84,27 @@ config DRM_XE_FORCE_PROBE
>  	  4571.
>  
>  	  Use "!*" to block the probe of the driver for all known devices.
> +config DRM_XE_SVM
> +	bool "Enable Shared Virtual Memory support in xe"
> +	depends on DRM_XE
> +	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +	depends on ARCH_ENABLE_MEMORY_HOTREMOVE
> +	depends on MEMORY_HOTPLUG
> +	depends on MEMORY_HOTREMOVE
> +	depends on ARCH_HAS_PTE_DEVMAP
> +	depends on SPARSEMEM_VMEMMAP
> +	depends on ZONE_DEVICE
> +	depends on DEVICE_PRIVATE
> +	depends on MMU
> +	select HMM_MIRROR
> +	select MMU_NOTIFIER
> +	default y
> +	help
> +	  Choose this option if you want Shared Virtual Memory (SVM)
> +	  support in xe. With SVM, virtual address space is shared
> +	  between CPU and GPU. This means any virtual address such
> +	  as malloc or mmap returns, variables on stack, or global
> +	  memory pointers, can be used for GPU transparently.
>  
>  menu "drm/Xe Debugging"
>  depends on DRM_XE
> diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> index f1c4f9de51df..a1a436912fe3 100644
> --- a/drivers/gpu/drm/xe/xe_tile.c
> +++ b/drivers/gpu/drm/xe/xe_tile.c
> @@ -159,9 +159,12 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
>   */
>  int xe_tile_init_noalloc(struct xe_tile *tile)
>  {
> -	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_device __maybe_unused *xe;

Just assign this here blindly? The __maybe_unused should suppress the
warning CONFIG_DRM_XE_SVM is false and should just compile out if it is.

Matt 

>  	int err;
>  
> +	if (IS_ENABLED(CONFIG_DRM_XE_SVM))
> +		xe = tile_to_xe(tile);
> +
>  	xe_device_mem_access_get(tile_to_xe(tile));
>  
>  	err = tile_ttm_mgr_init(tile);
> @@ -177,7 +180,7 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
>  
>  	xe_tile_sysfs_init(tile);
>  
> -	if (xe->info.has_usm)
> +	if (IS_ENABLED(CONFIG_DRM_XE_SVM) && xe->info.has_usm)
>  		xe_devm_add(tile, &tile->mem.vram);
>  err_mem_access:
>  	xe_device_mem_access_put(tile_to_xe(tile));
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 14/31] drm/xe: Introduce helper to get tile from memory region
  2024-04-09 20:17 ` [v2 14/31] drm/xe: Introduce helper to get tile from memory region Oak Zeng
@ 2024-04-10 21:17   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 21:17 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:25PM -0400, Oak Zeng wrote:
> Introduce a simple helper to retrieve tile from memory region
> 
> v1: move the function to xe_device.h (Matt)
>     improve commit message, add kerneldoc (Thomas)
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>

This LGTM but can it be moved to xe_tile.h? That might be a better place
but I know xe_device.h and xe_tile.h are intertwined a bit.

Matt

> ---
>  drivers/gpu/drm/xe/xe_device.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 74eb9833d4d8..68082357aebd 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -178,4 +178,12 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
>  
>  void xe_device_put_deferred(struct xe_device *xe, struct llist_node *deferred);
>  
> +/**
> + * xe_mem_region_to_tile() - retrieve tile from memory region
> + * @mr: the memory region we retrieve tile from
> + */
> +static inline struct xe_tile *xe_mem_region_to_tile(struct xe_mem_region *mr)
> +{
> +	return container_of(mr, struct xe_tile, mem.vram);
> +}
>  #endif
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn
  2024-04-09 20:17 ` [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn Oak Zeng
@ 2024-04-10 21:35   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 21:35 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:26PM -0400, Oak Zeng wrote:
> Since we now create struct page backing for each vram page,
> each vram page now also has a pfn, just like system memory.
> This allow us to calcuate device physical address from pfn.
> 
> v1: move the function to xe_svm.h (Matt)
>     s/vram_pfn_to_dpa/xe_mem_region_pfn_to_dpa (Matt)
>     add kernel document for the helper (Thomas)
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.h | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index e944971cfc6d..8a34429eb674 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -6,8 +6,31 @@
>  #ifndef __XE_SVM_H
>  #define __XE_SVM_H
>  
> -struct xe_tile;
> -struct xe_mem_region;
> +#include "xe_device_types.h"
> +#include "xe_device.h"
> +#include "xe_assert.h"

Hmm, including all these headers is a frowned upon and indicates to me
this likely the wrong location. The new header hopefully is clean with
only forward delc and function defs. I know Xe headers are not great at
this but lets not make this worse than it is.

Maybe should be xe_device.h? Also if we move the entire implementation 1
*.c file it is possible this function be private to that C file too.

> +
> +/**
> + * xe_mem_region_pfn_to_dpa() - Calculate page's dpa from pfn
> + *
> + * @mr: The memory region that page resides in
> + * @pfn: page frame number of the page
> + *
> + * Returns: the device physical address of the page
> + */
> +static inline u64 xe_mem_region_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)

I'd change to xe_mem_region_page_to_dpa with a struct page argument
rather than pfn. The pfn then be derived from the page.

I think this better as we will 3 types pfn all with different values /
shifts.

- hmm pfn
- migrate pfn
- linux core pfn

If a migrate pfn or hmm pfn were passed in as argument we'd get the
wrong dpa. I think passing in a page is safer and less bug prone. In my
example if we had a migrate pfn or hmm pfn, we'd use the appropriate
helper to get the struct page.

This also aligns with how a similar AMD helper (svm_migrate_addr, [1]) is
implemented.

Matt

[1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c#L234

> +{
> +	u64 dpa;
> +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	u64 offset;
> +
> +	xe_assert(xe, (pfn << PAGE_SHIFT) >= mr->hpa_base);
> +	offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
> +	dpa = mr->dpa_base + offset;
> +
> +	return dpa;
> +}
>  
>  int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
>  void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 16/31] drm/xe/svm: Get xe memory region from page
  2024-04-09 20:17 ` [v2 16/31] drm/xe/svm: Get xe memory region from page Oak Zeng
@ 2024-04-10 21:38   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 21:38 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:27PM -0400, Oak Zeng wrote:
> For gpu vram page, we now have a struct page backing of
> it. struct page's pgmap points to xe_memory_region's
> pagemap. This allow us to retrieve xe_memory_region
> from struct page.
> 
> v1: move the function to xe_svm.h
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 8a34429eb674..624c1581f8ba 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h

See my comments about location in previous patch and also if this can be
a private function if implementation is in 1 *.c file.

> @@ -6,6 +6,7 @@
>  #ifndef __XE_SVM_H
>  #define __XE_SVM_H
>  
> +#include <linux/mm_types.h>
>  #include "xe_device_types.h"
>  #include "xe_device.h"
>  #include "xe_assert.h"
> @@ -35,4 +36,14 @@ static inline u64 xe_mem_region_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
>  int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
>  void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
>  
> +/**
> + * xe_page_to_mem_region() - Get a page's memory region
> + *
> + * @page: a struct page pointer pointing to a page in vram memory region
> + */
> +static inline struct xe_mem_region *xe_page_to_mem_region(struct page *page)
> +{
> +	return container_of(page->pgmap, struct xe_mem_region, pagemap);
> +}

If the previous patch is xe_mem_region_page_to_dpa and want very robust
code we could add an assert to that function.

xe_assert(xe, mr == xe_page_to_mem_region(page));

Matt

> +
>  #endif
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 17/31] drm/xe: Get xe_vma from xe_userptr
  2024-04-09 20:17 ` [v2 17/31] drm/xe: Get xe_vma from xe_userptr Oak Zeng
@ 2024-04-10 21:42   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 21:42 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:28PM -0400, Oak Zeng wrote:
> Introduce a helper to get xe_vma from xe_userptr.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_vm.h | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index 0b2790f697db..4860747592ad 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -178,6 +178,20 @@ static inline struct xe_userptr_vma *to_userptr_vma(struct xe_vma *vma)
>  	return container_of(vma, struct xe_userptr_vma, vma);
>  }
>  
> +/**
> + * xe_userptr_to_vma() - Return xe_vma from a xe_userptr pointer
> + *
> + * @userptr: The userptr struct pointer
> + */
> +

Extra newline. Otherwise LGTM.

Matt

> +static inline struct xe_vma *xe_userptr_to_vma(struct xe_userptr *userptr)
> +{
> +	struct xe_userptr_vma *uvma;
> +
> +	uvma = container_of(userptr, struct xe_userptr_vma, userptr);
> +	return &uvma->vma;
> +}
> +
>  u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile);
>  
>  int xe_vm_create_ioctl(struct drm_device *dev, void *data,
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 18/31] drm/xe/svm: Build userptr sg table for device pages
  2024-04-09 20:17 ` [v2 18/31] drm/xe/svm: Build userptr sg table for device pages Oak Zeng
@ 2024-04-10 21:52   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 21:52 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:29PM -0400, Oak Zeng wrote:
> Previously function xe_build_sg only support userptr with system
> memory pages. Now this function is extended to support userptr
> with device pages backing as well.
> 
> For device pages, there is no need of dma-mapping. Instead, we
> calculated the device page's dpa (device physical address) and
> use dpa to fill sg table.
> 
> As of now, we assume each userptr is only backed either by all
> system memory pages or all by device pages. There is no support
> of mixture backing of device and system memory pages.
> 

I'm not sure if this required as per Jason's suggestion or rather
insistence to not use a sg list for a collection of dpas [1].

For single device we should just be able to the buddy blocks as the
cursor which I suggest in [1]. Maybe this doesn't work in a multi-device
case but it certainly should work for a single device. Since we are
working a single device first, let get it working without abusing a SG
list.

Matt

[1] https://patchwork.freedesktop.org/patch/574894/?series=128910&rev=1

> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_hmm.c      | 121 +++++++++++++++++++++++++------
>  drivers/gpu/drm/xe/xe_vm_types.h |   2 +
>  2 files changed, 100 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
> index 427c6bc49949..a261c1dd2060 100644
> --- a/drivers/gpu/drm/xe/xe_hmm.c
> +++ b/drivers/gpu/drm/xe/xe_hmm.c
> @@ -11,6 +11,7 @@
>  #include <linux/hmm.h>
>  #include <linux/mm.h>
>  #include "xe_hmm.h"
> +#include "xe_svm.h"
>  #include "xe_vm.h"
>  #include "xe_bo.h"
>  
> @@ -43,15 +44,90 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
>  	}
>  }
>  
> +/**
> + * xe_build_sg_device_pages() - build sg table for userptr when the backing store
> + * is device pages
> + *
> + * @st: sg table to build
> + * @hmm_pfns: pfn array of the userptr
> + * @pages: struct page arrary of this userptr
> + * @npages: how many pages in this userptr
> + */
> +static int xe_build_sg_device_pages(struct sg_table *st, unsigned long *hmm_pfns,
> +						struct page **pages, uint64_t npages)
> +{
> +	struct scatterlist *sg;
> +	int i;
> +
> +	sg = NULL;
> +	st->nents = 0;
> +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> +		return -ENOMEM;
> +
> +	for (i = 0; i < npages; i++) {
> +		unsigned long addr;
> +		struct xe_mem_region *mr;
> +
> +		mr = xe_page_to_mem_region(pages[i]);
> +		addr = xe_mem_region_pfn_to_dpa(mr, hmm_pfns[i]);
> +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> +			sg->length += PAGE_SIZE;
> +			sg_dma_len(sg) += PAGE_SIZE;
> +			continue;
> +		}
> +
> +		sg =  sg ? sg_next(sg) : st->sgl;
> +		sg_dma_address(sg) = addr;
> +		sg_dma_len(sg) = PAGE_SIZE;
> +		sg->length = PAGE_SIZE;
> +		st->nents++;
> +	}
> +
> +	sg_mark_end(sg);
> +	return 0;
> +}
> +
> +/**
> + * xe_validate_hmm_pfns() - validate all pages in a userptr belong to one memory
> + * region, and populate the pages array.
> + *
> + * @userptr: The userptr to validate
> + * @hmm_pfns: an array holding hmm pfns
> + * @npages: number of pages of this userptr
> + * @pages: output parameter to hold the populated pages from pfn.
> + */
> +static void xe_validate_hmm_pfns(struct xe_userptr *userptr, unsigned long *hmm_pfns,
> +						uint64_t npages, struct page **pages)
> +{
> +	int i;
> +	struct xe_vma *vma = xe_userptr_to_vma(userptr);
> +	struct xe_vm *vm = xe_vma_vm(vma);
> +
> +	pages[0] = hmm_pfn_to_page(hmm_pfns[0]);
> +	userptr->is_device_pages = is_device_private_page(pages[0]);
> +	for (i = 1; i < npages; i++) {
> +		pages[i] = hmm_pfn_to_page(hmm_pfns[i]);
> +		/**
> +		 * We currently assume no mixture of device pages and system memory
> +		 * pages in one userptr. If it turns out this is not true, we will
> +		 * either split userptr into device pages based and system memory
> +		 * based, or support a mixture backing store in one userptr.
> +		 */
> +		xe_assert(vm->xe,
> +			userptr->is_device_pages == is_device_private_page(pages[i]));
> +	}
> +}
> +
> +
>  /**
>   * xe_build_sg() - build a scatter gather table for all the physical pages/pfn
>   * in a hmm_range. dma-map pages if necessary. dma-address is save in sg table
>   * and will be used to program GPU page table later.
>   *
>   * @xe: the xe device who will access the dma-address in sg table
> + * @userptr: the userptr that we build the sg table for
>   * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
>   * has the pfn numbers of pages that back up this hmm address range.
> - * @st: pointer to the sg table.
>   * @write: whether we write to this range. This decides dma map direction
>   * for system pages. If write we map it bi-diretional; otherwise
>   * DMA_TO_DEVICE
> @@ -64,11 +140,6 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
>   * access memory. So if the memory is system memory, we need to
>   * do a dma-mapping so it can be accessed by GPU/DMA.
>   *
> - * FIXME: This function currently only support pages in system
> - * memory. If the memory is GPU local memory (of the GPU who
> - * is going to access memory), we need gpu dpa (device physical
> - * address), and there is no need of dma-mapping. This is TBD.
> - *
>   * FIXME: dma-mapping for peer gpu device to access remote gpu's
>   * memory. Add this when you support p2p
>   *
> @@ -77,12 +148,13 @@ static void xe_mark_range_accessed(struct hmm_range *range, bool write)
>   *
>   * Returns 0 if successful; -ENOMEM if fails to allocate memory
>   */
> -static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
> -			     struct sg_table *st, bool write)
> +static int xe_build_sg(struct xe_device *xe, struct xe_userptr *userptr,
> +					struct hmm_range *range, bool write)
>  {
> +	struct sg_table *st = &userptr->sgt;
>  	struct device *dev = xe->drm.dev;
>  	struct page **pages;
> -	u64 i, npages;
> +	u64 npages;
>  	int ret;
>  
>  	npages = xe_npages_in_range(range->start, range->end);
> @@ -90,19 +162,22 @@ static int xe_build_sg(struct xe_device *xe, struct hmm_range *range,
>  	if (!pages)
>  		return -ENOMEM;
>  
> -	for (i = 0; i < npages; i++) {
> -		pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
> -		xe_assert(xe, !is_device_private_page(pages[i]));
> -	}
> -
> -	ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0,
> -			npages << PAGE_SHIFT, xe_sg_segment_size(dev), GFP_KERNEL);
> -	if (ret)
> -		goto free_pages;
> +	xe_validate_hmm_pfns(userptr, range->hmm_pfns, npages, pages);
> +	if (!userptr->is_device_pages) {
> +		ret = sg_alloc_table_from_pages_segment(st, pages, npages, 0,
> +				npages << PAGE_SHIFT, xe_sg_segment_size(dev), GFP_KERNEL);
> +		if (ret)
> +			goto free_pages;
>  
> -	ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
> -			DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
> +		ret = dma_map_sgtable(dev, st, write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE,
> +				DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
> +	} else {
> +		ret = xe_build_sg_device_pages(st, range->hmm_pfns, pages, npages);
> +		if (ret)
> +			goto free_pages;
> +	}
>  
> +	userptr->sg = st;
>  free_pages:
>  	kvfree(pages);
>  	return ret;
> @@ -127,7 +202,8 @@ void xe_userptr_free_sg(struct xe_userptr_vma *uvma)
>  	struct device *dev = xe->drm.dev;
>  
>  	xe_assert(xe, userptr->sg);
> -	dma_unmap_sgtable(dev, userptr->sg,
> +	if (!userptr->is_device_pages)
> +		dma_unmap_sgtable(dev, userptr->sg,
>  			write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE, 0);
>  
>  	sg_free_table(userptr->sg);
> @@ -239,12 +315,11 @@ int xe_userptr_populate_range(struct xe_userptr_vma *uvma)
>  	if (ret)
>  		goto free_pfns;
>  
> -	ret = xe_build_sg(vm->xe, &hmm_range, &userptr->sgt, write);
> +	ret = xe_build_sg(vm->xe, userptr, &hmm_range, write);
>  	if (ret)
>  		goto free_pfns;
>  
>  	xe_mark_range_accessed(&hmm_range, write);
> -	userptr->sg = &userptr->sgt;
>  	userptr->notifier_seq = hmm_range.notifier_seq;
>  
>  free_pfns:
> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> index fbf6bfcf59a8..3b4debfecc9b 100644
> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> @@ -64,6 +64,8 @@ struct xe_userptr {
>  	struct sg_table *sg;
>  	/** @notifier_seq: notifier sequence number */
>  	unsigned long notifier_seq;
> +	/** @is_device_pages: the backing store is in device memory*/
> +	bool is_device_pages;
>  	/**
>  	 * @initial_bind: user pointer has been bound at least once.
>  	 * write: vm->userptr.notifier_lock in read mode and vm->resv held.
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory
  2024-04-09 20:17 ` [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory Oak Zeng
@ 2024-04-10 21:56   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 21:56 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:30PM -0400, Oak Zeng wrote:
> With system allocator, a userptr can now be back by device
> memory also. Introduce a helper function xe_vma_is_devmem
> to determine whether a vma is backed by device memory.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pt.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> index 846e896edcb5..525092111be9 100644
> --- a/drivers/gpu/drm/xe/xe_pt.c
> +++ b/drivers/gpu/drm/xe/xe_pt.c
> @@ -577,6 +577,17 @@ static const struct xe_pt_walk_ops xe_pt_stage_bind_ops = {
>  	.pt_entry = xe_pt_stage_bind_entry,
>  };
>  
> +static bool xe_vma_is_devmem(struct xe_vma *vma)

At some point we probably want to scrub the driver as we intermix
devmem, vram, and lmem nomenclature. I think in case we mean the same
thing too. Anwyays that is a little out of scope here.

> +{
> +	if (xe_vma_is_userptr(vma)) {
> +		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> +		return uvma->userptr.is_device_pages;

Helper itself LGTM. Maybe promote to xe_vm.c/xe_vm.h?

Also consider other options rather than userptr.is_device_pages flag
here (e.g. look for buddy blocks, check a gpuvm flags, etc...). Can live
with a flag but we can do without it, great.

Matt

> +	} else {
> +		struct xe_bo *bo = xe_vma_bo(vma);
> +		return bo && (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
> +	}
> +}
> +
>  /**
>   * xe_pt_stage_bind() - Build a disconnected page-table tree for a given address
>   * range.
> @@ -601,8 +612,7 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
>  {
>  	struct xe_device *xe = tile_to_xe(tile);
>  	struct xe_bo *bo = xe_vma_bo(vma);
> -	bool is_devmem = !xe_vma_is_userptr(vma) && bo &&
> -		(xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo));
> +	bool is_devmem = xe_vma_is_devmem(vma);
>  	struct xe_res_cursor curs;
>  	struct xe_pt_stage_bind_walk xe_walk = {
>  		.base = {
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 21/31] drm/xe/svm: Introduce svm migration function
  2024-04-09 20:17 ` [v2 21/31] drm/xe/svm: Introduce svm migration function Oak Zeng
@ 2024-04-10 22:06   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 22:06 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:32PM -0400, Oak Zeng wrote:
> Introduce xe_migrate_pa function for data migration.
> This function is similar to xe_migrate_copy function
> but has different parameters. Instead of BO and ttm
> resource parameters, it has source and destination
> buffer's physical address as parameter. This function is
> intended to be used by svm sub-system which doesn't
> have BO and TTM concept.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_migrate.c | 217 ++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_migrate.h |   7 ++
>  2 files changed, 224 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index 82b63bdb9c47..f1d53911253b 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -462,6 +462,37 @@ static bool xe_migrate_allow_identity(u64 size, const struct xe_res_cursor *cur)
>  	return cur->size >= size;
>  }
>  
> +/**
> + * pte_update_cmd_size() - calculate the batch buffer command size
> + * to update a flat page table.
> + *
> + * @size: The virtual address range size of the page table to update
> + *
> + * The page table to update is supposed to be a flat 1 level page
> + * table with all entries pointing to 4k pages.
> + *
> + * Return the number of dwords of the update command
> + */
> +static u32 pte_update_cmd_size(u64 size)
> +{
> +	u32 dword;
> +	u64 entries = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> +
> +	XE_WARN_ON(size > MAX_PREEMPTDISABLE_TRANSFER);
> +	/*
> +	 * MI_STORE_DATA_IMM command is used to update page table. Each
> +	 * instruction can update maximumly 0x1ff pte entries. To update
> +	 * n (n <= 0x1ff) pte entries, we need:
> +	 * 1 dword for the MI_STORE_DATA_IMM command header (opcode etc)
> +	 * 2 dword for the page table's physical location
> +	 * 2*n dword for value of pte to fill (each pte entry is 2 dwords)
> +	 */
> +	dword = (1 + 2) * DIV_ROUND_UP(entries, 0x1ff);
> +	dword += entries * 2;
> +
> +	return dword;
> +}
> +
>  static u32 pte_update_size(struct xe_migrate *m,
>  			   bool is_vram,
>  			   struct ttm_resource *res,
> @@ -562,6 +593,48 @@ static void emit_pte(struct xe_migrate *m,
>  	}
>  }
>  
> +/**
> + * build_pt_update_batch_sram() - build batch buffer commands to update
> + * migration vm page table for system memory
> + *
> + * @m: The migration context
> + * @bb: The batch buffer which hold the page table update commands
> + * @pt_offset: The offset of page table to update, in byte
> + * @pa: device physical address you want the page table to point to
> + * @size: size of the virtual address space you want the page table to cover
> + */
> +static void build_pt_update_batch_sram(struct xe_migrate *m,
> +		     struct xe_bb *bb, u32 pt_offset,
> +		     u64 pa, u32 size)
> +{
> +	u16 pat_index = tile_to_xe(m->tile)->pat.idx[XE_CACHE_WB];
> +	u32 ptes;
> +
> +	ptes = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> +	while (ptes) {
> +		u32 chunk = min(0x1ffU, ptes);
> +
> +		bb->cs[bb->len++] = MI_STORE_DATA_IMM | MI_SDI_NUM_QW(chunk);
> +		bb->cs[bb->len++] = pt_offset;
> +		bb->cs[bb->len++] = 0;
> +
> +		pt_offset += chunk * 8;
> +		ptes -= chunk;
> +
> +		while (chunk--) {
> +			u64 addr;
> +
> +			addr = pa & PAGE_MASK;
> +			addr = m->q->vm->pt_ops->pte_encode_addr(m->tile->xe,
> +								 addr, pat_index,
> +								 0, false, 0);
> +			bb->cs[bb->len++] = lower_32_bits(addr);
> +			bb->cs[bb->len++] = upper_32_bits(addr);
> +			pa += XE_PAGE_SIZE;
> +		}
> +	}
> +}
> +
>  #define EMIT_COPY_CCS_DW 5
>  static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
>  			  u64 dst_ofs, bool dst_is_indirect,
> @@ -879,6 +952,150 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
>  	return fence;
>  }
>  
> +/**
> + * xe_migrate_pa() - Migrate buffers with src and dst physical address
> + *
> + * @m: The migration context
> + * @src_pa: physical address of source, from GPU's point of view. This is a
> + * device physical address (dpa) when source is in vram. When source is in
> + * system memory, this is a dma mapped host physical address
> + * @src_is_vram: True if source buffer is in vram.
> + * @dst_pa: physical address of destination, from GPU's point of view. This is a
> + * device physical address (dpa) when source is in vram. When source is in
> + * system memory, this is a dma mapped host physical address
> + * @dst_is_vram: True if destination buffer is in vram.
> + * @size: The size of data to copy.
> + *
> + * Copy @size bytes of data from @src_pa to @dst_pa. The functionality
> + * and behavior of this function is similar to xe_migrate_copy function, but
> + * the interface is different. This function is a helper function supposed to
> + * be used by SVM subsytem. Since in SVM subsystem there is no buffer object
> + * and ttm, there is no src/dst bo as function input. Instead, we directly use
> + * src/dst's physical address as function input.
> + *
> + * Since the back store of any user malloc'ed or mmap'ed memory can be placed in
> + * system  memory, it can not be compressed. Thus this function doesn't need
> + * to consider copy CCS (compression control surface) data as xe_migrate_copy did.
> + *
> + * This function assumes the source buffer and destination buffer are all physically
> + * contiguous.
> + *
> + * We use gpu blitter to copy data. Source and destination are first mapped to
> + * migration vm which is a flat one level (L0) page table, then blitter is used to
> + * perform the copy.
> + *
> + * Return: Pointer to a dma_fence representing the last copy batch, or
> + * an error pointer on failure. If there is a failure, any copy operation
> + * started by the function call has been synced.
> + */
> +struct dma_fence *xe_migrate_pa(struct xe_migrate *m,
> +				  u64 src_pa,
> +				  bool src_is_vram,
> +				  u64 dst_pa,
> +				  bool dst_is_vram,
> +				  u64 size)

This assume both addresses are contiguous if size > 4k.

I don't think needs to be the case when one of the addresses a sram
(dma_addr) as we dynamically map in sram pages into PT entries. i.e only
VRAM addresses need to be contiguous.

I'd suggest this function take from array of dma_addr and 1 vram
address to maximize copy efficiency. Also add a direction variable too
(i.e. is vram the source or destination).

> +{
> +#define NUM_PT_PER_BLIT (MAX_PREEMPTDISABLE_TRANSFER / SZ_2M)
> +	struct xe_gt *gt = m->tile->primary_gt;
> +	struct xe_device *xe = gt_to_xe(gt);
> +	struct dma_fence *fence = NULL;
> +	u64 src_L0_ofs, dst_L0_ofs;
> +	u64 round_update_size;
> +	/* A slot is a 4K page of page table, covers 2M virtual address*/
> +	u32 pt_slot;
> +	int err;
> +
> +	while (size) {

We might not need this loop either if we make the caller enforce the
chunking (i.e. cap size as 2 MB or whatever MAX_PREEMPTDISABLE_TRANSFER
is).

> +		u32 batch_size = 2; /* arb_clear() + MI_BATCH_BUFFER_END */
> +		struct xe_sched_job *job;
> +		struct xe_bb *bb;
> +		u32 update_idx;
> +
> +		/* Maximumly copy MAX_PREEMPTDISABLE_TRANSFER bytes. Why?*/
> +		round_update_size = min_t(u64, size, MAX_PREEMPTDISABLE_TRANSFER);
> +
> +		/* src pte update*/
> +		if (!src_is_vram)
> +			batch_size += pte_update_cmd_size(round_update_size);
> +		/* dst pte update*/
> +		if (!dst_is_vram)
> +			batch_size += pte_update_cmd_size(round_update_size);
> +
> +		/* Copy command size*/
> +		batch_size += EMIT_COPY_DW;
> +
> +		bb = xe_bb_new(gt, batch_size, true);
> +		if (IS_ERR(bb)) {
> +			err = PTR_ERR(bb);
> +			goto err_sync;
> +		}
> +
> +		if (!src_is_vram) {
> +			pt_slot = 0;
> +			build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
> +					src_pa, round_update_size);
> +			src_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
> +		}
> +		else
> +			src_L0_ofs = xe_migrate_vram_ofs(xe, src_pa);
> +
> +		if (!dst_is_vram) {
> +			pt_slot = NUM_PT_PER_BLIT;
> +			build_pt_update_batch_sram(m, bb, pt_slot * XE_PAGE_SIZE,
> +					dst_pa, round_update_size);
> +			dst_L0_ofs = xe_migrate_vm_addr(pt_slot, 0);
> +		}
> +		else
> +			dst_L0_ofs = xe_migrate_vram_ofs(xe, dst_pa);
> +
> +
> +		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
> +		update_idx = bb->len;
> +
> +		emit_copy(gt, bb, src_L0_ofs, dst_L0_ofs, round_update_size,
> +			  XE_PAGE_SIZE);
> +
> +		mutex_lock(&m->job_mutex);
> +		job = xe_bb_create_migration_job(m->q, bb,
> +						 xe_migrate_batch_base(m, true),
> +						 update_idx);
> +		if (IS_ERR(job)) {
> +			err = PTR_ERR(job);
> +			goto err;
> +		}
> +
> +		xe_sched_job_add_migrate_flush(job, 0);
> +		xe_sched_job_arm(job);
> +		dma_fence_put(fence);
> +		fence = dma_fence_get(&job->drm.s_fence->finished);
> +		xe_sched_job_push(job);
> +		dma_fence_put(m->fence);
> +		m->fence = dma_fence_get(fence);
> +
> +		mutex_unlock(&m->job_mutex);
> +
> +		xe_bb_free(bb, fence);
> +		size -= round_update_size;
> +		src_pa += round_update_size;
> +		dst_pa += round_update_size;
> +		continue;
> +
> +err:
> +		mutex_unlock(&m->job_mutex);
> +		xe_bb_free(bb, NULL);
> +
> +err_sync:
> +		/* Sync partial copy if any. FIXME: under job_mutex? */
> +		if (fence) {
> +			dma_fence_wait(fence, false);
> +			dma_fence_put(fence);
> +		}
> +
> +		return ERR_PTR(err);
> +	}
> +
> +	return fence;
> +}
>  static void emit_clear_link_copy(struct xe_gt *gt, struct xe_bb *bb, u64 src_ofs,
>  				 u32 size, u32 pitch)
>  {
> diff --git a/drivers/gpu/drm/xe/xe_migrate.h b/drivers/gpu/drm/xe/xe_migrate.h
> index 701bb27349b0..98b480244265 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.h
> +++ b/drivers/gpu/drm/xe/xe_migrate.h
> @@ -101,6 +101,13 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
>  				  struct ttm_resource *dst,
>  				  bool copy_only_ccs);
>  
> +struct dma_fence *xe_migrate_pa(struct xe_migrate *m,
> +				  u64 src_pa,
> +				  bool src_is_vram,
> +				  u64 dst_pa,
> +				  bool dst_is_vram,
> +				  u64 size);
> +

An option we be export xe_migrate_from_vram / xe_migrate_to_vram and
then internally call the function I suggest above with the correct
direction agrument too.

Matt

>  struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
>  				   struct xe_bo *bo,
>  				   struct ttm_resource *dst);
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-04-09 20:17 ` [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
@ 2024-04-10 22:23   ` Matthew Brost
  2024-04-15 20:13     ` Zeng, Oak
  2024-04-17 20:55   ` Matthew Brost
  1 sibling, 1 reply; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 22:23 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> Function xe_devm_alloc_pages allocate pages from drm buddy and perform
> house keeping work for all the pages allocated, such as get a page
> refcount, keep a bitmap of all pages to denote whether a page is in
> use, put pages to a drm lru list for eviction purpose.
> 
> Function xe_devm_free_blocks return list of memory blocks to drm buddy
> allocator.
> 
> Function xe_devm_free_page is a call back function from hmm layer. It
> is called whenever a page's refcount reaches to 1. This function clears
> the bit of this page in the bitmap. If all the bits in the bitmap is
> cleared, it means all the pages have been freed, we return all the pages
> in this memory block back to drm buddy.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
>  drivers/gpu/drm/xe/xe_svm_devmem.c | 147 ++++++++++++++++++++++++++++-

See comments about file in previous patches, they apply here too.

>  2 files changed, 152 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 624c1581f8ba..92a3ee90d5a7 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -46,4 +46,11 @@ static inline struct xe_mem_region *xe_page_to_mem_region(struct page *page)
>  	return container_of(page->pgmap, struct xe_mem_region, pagemap);
>  }
>  
> +int xe_devm_alloc_pages(struct xe_tile *tile,
> +						unsigned long npages,
> +						struct list_head *blocks,
> +						unsigned long *pfn);
> +
> +void xe_devm_free_blocks(struct list_head *blocks);
> +void xe_devm_page_free(struct page *page);
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> index 31af56e8285a..5ba0cd9a70b0 100644
> --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -5,18 +5,161 @@
>  
>  #include <linux/mm_types.h>
>  #include <linux/sched/mm.h>
> -
> +#include <linux/gfp.h>
> +#include <linux/migrate.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/dma-fence.h>
> +#include <linux/bitops.h>
> +#include <linux/bitmap.h>
> +#include <drm/drm_buddy.h>
>  #include "xe_device_types.h"
>  #include "xe_svm.h"
> +#include "xe_migrate.h"
> +#include "xe_ttm_vram_mgr_types.h"
> +#include "xe_assert.h"
>  
> +/**
> + * struct xe_svm_block_meta - svm uses this data structure to manage each
> + * block allocated from drm buddy. This will be set to the drm_buddy_block's
> + * private field.
> + *
> + * @lru: used to link this block to drm's lru lists. This will be replace
> + * with struct drm_lru_entity later.
> + * @tile: tile from which we allocated this block
> + * @bitmap: A bitmap of each page in this block. 1 means this page is used,
> + * 0 means this page is idle. When all bits of this block are 0, it is time
> + * to return this block to drm buddy subsystem.
> + */
> +struct xe_svm_block_meta {
> +	struct list_head lru;
> +	struct xe_tile *tile;
> +	unsigned long bitmap[];
> +};

This looks not needed to me but admittedly haven't looked at the LRU stuff.

I am thinking roughly...

- I think we drop all this special tracking (kill xe_svm_block_meta)
- Have functions to allocate / free the buddy blocks, store buddy blocks in userptr
- Blocks are allocated before migration to VRAM
- Blocks can be freed on either CPU fault after migration or on VMA
  destroy (probably depends on madvive hints for VMA where we free
  blocks)
- Blocks allocated / freed at ia chunk (xe_vma in this code) granularity
  (conceptually the same if we switch to 1 to N ratio between xe_vma &
  pt_state)
- block->private == memory region so we can get pfn from block
- When we need migrate_pfns we loop over buddy blocks populating migrate.dst

Also I noticed the drm_buddy_* calls in this file are not protected by a
lock, we will need that. Currently it is tile->mem.vram_mgr->lock in the
VRAM mgr code, we either need to reach into there or move this lock to
common place so the VRAM manager and block allocations for SVM don't
race with each other.

Matt

>  
>  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
>  {
>  	return 0;
>  }
>  
> -static void xe_devm_page_free(struct page *page)
> +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> +{
> +	/** DRM buddy's block offset is 0-based*/
> +	offset += mr->hpa_base;
> +
> +	return PHYS_PFN(offset);
> +}
> +
> +/** FIXME: we locked page by calling zone_device_page_init
> + *  in xe_devm_alloc_pages. Should we unlock pages here?
> + */
> +static void free_block(struct drm_buddy_block *block)
> +{
> +	struct xe_svm_block_meta *meta =
> +		(struct xe_svm_block_meta *)block->private;
> +	struct xe_tile *tile  = meta->tile;
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +
> +	kfree(block->private);
> +	drm_buddy_free_block(mm, block);
> +}
> +
> +void xe_devm_page_free(struct page *page)
> +{
> +	struct drm_buddy_block *block =
> +					(struct drm_buddy_block *)page->zone_device_data;
> +	struct xe_svm_block_meta *meta =
> +					(struct xe_svm_block_meta *)block->private;
> +	struct xe_tile *tile  = meta->tile;
> +	struct xe_mem_region *mr = &tile->mem.vram;
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +	u64 size = drm_buddy_block_size(mm, block);
> +	u64 pages_per_block = size >> PAGE_SHIFT;
> +	u64 block_pfn_first =
> +					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
> +	u64 page_pfn = page_to_pfn(page);
> +	u64 i = page_pfn - block_pfn_first;
> +
> +	xe_assert(tile->xe, i < pages_per_block);
> +	clear_bit(i, meta->bitmap);
> +	if (bitmap_empty(meta->bitmap, pages_per_block))
> +		free_block(block);
> +}
> +
> +/**
> + * xe_devm_alloc_pages() - allocate device pages from buddy allocator
> + *
> + * @xe_tile: which tile to allocate device memory from
> + * @npages: how many pages to allocate
> + * @blocks: used to return the allocated blocks
> + * @pfn: used to return the pfn of all allocated pages. Must be big enough
> + * to hold at @npages entries.
> + *
> + * This function allocate blocks of memory from drm buddy allocator, and
> + * performs initialization work: set struct page::zone_device_data to point
> + * to the memory block; set/initialize drm_buddy_block::private field;
> + * lock_page for each page allocated; add memory block to lru managers lru
> + * list - this is TBD.
> + *
> + * return: 0 on success
> + * error code otherwise
> + */
> +int xe_devm_alloc_pages(struct xe_tile *tile,
> +						unsigned long npages,
> +						struct list_head *blocks,
> +						unsigned long *pfn)
> +{
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +	struct drm_buddy_block *block, *tmp;
> +	u64 size = npages << PAGE_SHIFT;
> +	int ret = 0, i, j = 0;
> +
> +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> +						blocks, DRM_BUDDY_TOPDOWN_ALLOCATION);
> +
> +	if (unlikely(ret))
> +		return ret;
> +
> +	list_for_each_entry_safe(block, tmp, blocks, link) {
> +		struct xe_mem_region *mr = &tile->mem.vram;
> +		u64 block_pfn_first, pages_per_block;
> +		struct xe_svm_block_meta *meta;
> +		u32 meta_size;
> +
> +		size = drm_buddy_block_size(mm, block);
> +		pages_per_block = size >> PAGE_SHIFT;
> +		meta_size = BITS_TO_BYTES(pages_per_block) +
> +					sizeof(struct xe_svm_block_meta);
> +		meta = kzalloc(meta_size, GFP_KERNEL);
> +		bitmap_fill(meta->bitmap, pages_per_block);
> +		meta->tile = tile;
> +		block->private = meta;
> +		block_pfn_first =
> +					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
> +		for(i = 0; i < pages_per_block; i++) {
> +			struct page *page;
> +
> +			pfn[j++] = block_pfn_first + i;
> +			page = pfn_to_page(block_pfn_first + i);
> +			/**Lock page per hmm requirement, see hmm.rst.*/
> +			zone_device_page_init(page);
> +			page->zone_device_data = block;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * xe_devm_free_blocks() - free all memory blocks
> + *
> + * @blocks: memory blocks list head
> + */
> +void xe_devm_free_blocks(struct list_head *blocks)
>  {
> +	struct drm_buddy_block *block, *tmp;
> +
> +	list_for_each_entry_safe(block, tmp, blocks, link)
> +		free_block(block);
>  }
>  
>  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 24/31] drm/xe/svm: Create and destroy xe svm
  2024-04-09 20:17 ` [v2 24/31] drm/xe/svm: Create and destroy xe svm Oak Zeng
@ 2024-04-10 22:25   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 22:25 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:35PM -0400, Oak Zeng wrote:
> Introduce a data structure xe_svm to represent a shared virtual
> address space b/t CPU program and GPU program. Each process can
> only have maximumly one xe_svm instance. One xe_svm can have
> multiple gpu vm.
> 
> Introduce helper functions to create and destroy xe_svm instance.
> Once xe_svm instance is created, it is added to a global hash table
> keyed by mm_struct. Later on we can retrieve xe_svm using mm_struct.
> 

I don't think this needed at all, will explain a bit later in the series
but quite sure this can be droppped entirely.

Matt

> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile |  1 +
>  drivers/gpu/drm/xe/xe_svm.c | 77 +++++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_svm.h | 23 +++++++++++
>  3 files changed, 101 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index cd5213ba182b..f89d77b6d654 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -129,6 +129,7 @@ xe-y += xe_bb.o \
>  	xe_sa.o \
>  	xe_sched_job.o \
>  	xe_step.o \
> +	xe_svm.o \
>  	xe_svm_devmem.o \
>  	xe_sync.o \
>  	xe_tile.o \
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> new file mode 100644
> index 000000000000..416cfc81c053
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -0,0 +1,77 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <linux/mutex.h>
> +#include <linux/mm_types.h>
> +#include <linux/kernel.h>
> +#include <linux/hashtable.h>
> +#include "xe_svm.h"
> +
> +#define XE_MAX_SVM_PROCESS 5 /* Maximumly support 32 SVM process*/
> +DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
> +
> +/**
> + * xe_create_svm() - create a svm instance
> + *
> + * one xe_svm struct represent a shared address space
> + * between cpu and gpu program. So one xe_svm is associated
> + * to one mm_struct.
> + *
> + * If xe_svm for this process already exists, just return
> + * it; otherwise create one.
> + *
> + * Return the created xe svm struct pointer
> + */
> +struct xe_svm *xe_create_svm(void)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct xe_svm *svm;
> +
> +	svm = xe_lookup_svm_by_mm(mm);
> +	if (svm)
> +		return svm;
> +
> +	svm = kzalloc(sizeof(struct xe_svm), GFP_KERNEL);
> +	svm->mm = mm;
> +	mutex_init(&svm->mutex);
> +	INIT_LIST_HEAD(&svm->vm_list);
> +	/** Add svm to global xe_svm_table hash table
> +	 *  use mm as key so later we can retrieve svm using mm
> +	 */
> +	hash_add_rcu(xe_svm_table, &svm->hnode, (uintptr_t)mm);
> +	return svm;
> +}
> +
> +/**
> + * xe_destroy_svm() - destroy a svm process
> + *
> + * @svm: the xe_svm to destroy
> + */
> +void xe_destroy_svm(struct xe_svm *svm)
> +{
> +	BUG_ON(list_empty(&svm->vm_list));
> +	hash_del_rcu(&svm->hnode);
> +	mutex_destroy(&svm->mutex);
> +	kfree(svm);
> +}
> +
> +
> +/**
> + * xe_lookup_svm_by_mm() - retrieve xe_svm from mm struct
> + *
> + * @mm: the mm struct of the svm to retrieve
> + *
> + * Return the xe_svm struct pointer, or NULL if fail
> + */
> +struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm)
> +{
> +	struct xe_svm *svm;
> +
> +	hash_for_each_possible_rcu(xe_svm_table, svm, hnode, (uintptr_t)mm)
> +		if (svm->mm == mm)
> +			return svm;
> +
> +	return NULL;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 92a3ee90d5a7..066740fb93f5 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -11,6 +11,29 @@
>  #include "xe_device.h"
>  #include "xe_assert.h"
>  
> +
> +/**
> + * struct xe_svm - data structure to represent a shared
> + * virtual address space from device side. xe_svm and
> + * mm_struct has a 1:1 relationship.
> + */
> +struct xe_svm {
> +	/** @mm: The mm_struct corresponding to this xe_svm */
> +	struct mm_struct *mm;
> +	/**
> +	 * @mutex: A lock protects below vm_list
> +	 */
> +	struct mutex mutex;
> +	/** @hnode: used to add this svm to a global xe_svm_hash table*/
> +	struct hlist_node hnode;
> +	/** @vm_list: a list gpu vm in this svm space */
> +	struct list_head vm_list;
> +};
> +
> +extern struct xe_svm *xe_create_svm(void);
> +void xe_destroy_svm(struct xe_svm *svm);
> +extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> +
>  /**
>   * xe_mem_region_pfn_to_dpa() - Calculate page's dpa from pfn
>   *
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 26/31] drm/xe: Make function lookup_vma public
  2024-04-09 20:17 ` [v2 26/31] drm/xe: Make function lookup_vma public Oak Zeng
@ 2024-04-10 22:26   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-10 22:26 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:37PM -0400, Oak Zeng wrote:
> Public this function as it will be used by later patches. Also
> rename it to xe_vm_lookup_vma
> 

Like the previous patch, pretty sure this can be dropped too. Again will
fully explain later.

Matt

> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt_pagefault.c | 10 ++++++++--
>  drivers/gpu/drm/xe/xe_vm.h           |  1 +
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> index 707a3466f36b..668984f0769e 100644
> --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> @@ -80,7 +80,13 @@ static bool vma_matches(struct xe_vma *vma, u64 page_addr)
>  	return true;
>  }
>  
> -static struct xe_vma *lookup_vma(struct xe_vm *vm, u64 page_addr)
> +/**
> + * xe_vm_lookup_vma() - look up a vma from address
> + *
> + * @vm: the xe_vm that the vma resides in
> + * @page_address: address to look up
> + */
> +struct xe_vma *xe_vm_lookup_vma(struct xe_vm *vm, u64 page_addr)
>  {
>  	struct xe_vma *vma = NULL;
>  
> @@ -166,7 +172,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
>  		ret = -ENOENT;
>  		goto unlock_vm;
>  	}
> -	vma = lookup_vma(vm, pf->page_addr);
> +	vma = xe_vm_lookup_vma(vm, pf->page_addr);
>  	if (!vma) {
>  		ret = -EINVAL;
>  		goto unlock_vm;
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index 4860747592ad..d55330988e32 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -306,3 +306,4 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm);
>  void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap);
>  void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p);
>  void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
> +struct xe_vma *xe_vm_lookup_vma(struct xe_vm *vm, u64 page_addr);
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-09 20:17 ` [v2 27/31] drm/xe/svm: Handle CPU page fault Oak Zeng
@ 2024-04-11  2:07   ` Matthew Brost
  2024-04-12 17:24     ` Zeng, Oak
  0 siblings, 1 reply; 58+ messages in thread
From: Matthew Brost @ 2024-04-11  2:07 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> Under the picture of svm, CPU and GPU program share one same
> virtual address space. The backing store of this virtual address
> space can be either in system memory or device memory. Since GPU
> device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> Any CPU access to device memory causes a page fault. Implement
> a page fault handler to migrate memory back to system memory and
> map it to CPU page table so the CPU program can proceed.
> 
> Also unbind this page from GPU side, and free the original GPU
> device page
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile         |   1 +
>  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
>  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
>  drivers/gpu/drm/xe/xe_svm_migrate.c | 222 ++++++++++++++++++++++++++++
>  4 files changed, 230 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index f89d77b6d654..65289acdd563 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
>  	xe_step.o \
>  	xe_svm.o \
>  	xe_svm_devmem.o \
> +	xe_svm_migrate.o \

See comments about file org, same thing applies here. Let's put all of
the svm implementation in a single file.

>  	xe_sync.o \
>  	xe_tile.o \
>  	xe_tile_sysfs.o \
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index f601dffe3fc1..c9e4239c44b4 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -7,11 +7,11 @@
>  #define __XE_SVM_H
>  
>  #include <linux/mm_types.h>
> +#include <linux/mm.h>
>  #include "xe_device_types.h"
>  #include "xe_device.h"
>  #include "xe_assert.h"
> -
> -struct xe_vm;
> +#include "xe_vm_types.h"
>  
>  /**
>   * struct xe_svm - data structure to represent a shared
> @@ -31,6 +31,9 @@ struct xe_svm {
>  	struct list_head vm_list;
>  };
>  
> +#define xe_svm_for_each_vm(svm, vm)					\
> +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> +

Don't think this is need, see below.

>  extern struct xe_svm *xe_create_svm(void);
>  void xe_destroy_svm(struct xe_svm *svm);
>  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
>  
>  void xe_devm_free_blocks(struct list_head *blocks);
>  void xe_devm_page_free(struct page *page);
> +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> index 088ac209ad80..32ada458f1dd 100644
> --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
>  	unsigned long bitmap[];
>  };
>  
> -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> -{
> -	return 0;
> -}
> -
>  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
>  {
>  	/** DRM buddy's block offset is 0-based*/
> @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head *blocks)
>  
>  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
>  	.page_free = xe_devm_page_free,
> -	.migrate_to_ram = xe_devm_migrate_to_ram,
> +	.migrate_to_ram = xe_svm_migrate_to_sram,

Again single file so this will be static function, no reason to export
this.

>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
> new file mode 100644
> index 000000000000..0db831af098e
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> @@ -0,0 +1,222 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <linux/gfp.h>
> +#include <linux/migrate.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/dma-fence.h>
> +#include <linux/bitops.h>
> +#include <linux/bitmap.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <drm/drm_buddy.h>
> +#include "xe_device_types.h"
> +#include "xe_device.h"
> +#include "xe_trace.h"
> +#include "xe_migrate.h"
> +#include "xe_ttm_vram_mgr_types.h"
> +#include "xe_assert.h"
> +#include "xe_pt.h"
> +#include "xe_svm.h"
> +#include "xe_vm.h"
> +
> +
> +/**
> + * alloc_host_page() - allocate one host page for the fault vma
> + *
> + * @dev: (GPU) device that will access the allocated page
> + * @vma: the fault vma that we need allocate page for
> + * @addr: the fault address. The allocated page is for this address
> + * @dma_addr: used to output the dma address of the allocated page.
> + * This dma address will be used for gpu to access this page. GPU
> + * access host page through a dma mapped address.
> + * @pfn: used to output the pfn of the allocated page.
> + *
> + * This function allocate one host page for the specified vma. It
> + * also does some prepare work for GPU to access this page, such
> + * as map this page to iommu (by calling dma_map_page).
> + *
> + * When this function returns, the page is locked.
> + *
> + * Return struct page pointer when success
> + * NULL otherwise
> + */
> +static struct page *alloc_host_page(struct device *dev,
> +							 struct vm_area_struct *vma,
> +							 unsigned long addr,
> +							 dma_addr_t *dma_addr,
> +							 unsigned long *pfn)

Weird alignment, also I don't think we are want to allocate a page at
time...

Beyond that, can't say I'm a fan of 2 arguments being return and
populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
haven't seen a lot that style function in Linux.

Probably makes more sense to have a function which allocates pages,
locks them, and populates the pfn array (migrate_pfn) rather than doing
this a page at a time.

> +{
> +	struct page *page;
> +
> +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> +	if (unlikely(!page))
> +		return NULL;
> +
> +	/**Lock page per hmm requirement, see hmm.rst*/
> +	lock_page(page);
> +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE);

The device is writing to these pages so I think DMA_BIDIRECTIONAL is
needed, right? As mentioned above I think this should be broken out into
a different step too.

> +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> +		unlock_page(page);
> +		__free_page(page);
> +		return NULL;
> +	}
> +
> +	*pfn = migrate_pfn(page_to_pfn(page));
> +	return page;
> +}
> +
> +static void free_host_page(struct page *page)
> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
> +/**
> + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> + *
> + * @vma: The vma that the page is mapped to
> + * @addr: The virtual address that the page is mapped to
> + * @src_pfn: src page's page frame number
> + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> + *
> + * Allocate one page in system ram and copy memory from device memory
> + * to system ram.
> + *
> + * Return: 0 if this page is already in sram (no need to migrate)
> + * 1: successfully migrated this page from vram to sram.
> + * error code otherwise
> + */
> +static int migrate_page_vram_to_ram(struct vm_area_struct *vma, unsigned long addr,
> +						unsigned long src_pfn, unsigned long *dst_pfn)
> +{

We definitely don't want to copy 1 page at time. I touch on this in [1].
Basically this going to perform poorly unless we use larger copies, the
migrate code supports non-contigous sram addresses, and vram addresses
will likely be contigous due to the buddy allocator.
 
[1] https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1

> +	struct xe_mem_region *mr;
> +	struct xe_tile *tile;
> +	struct xe_device *xe;
> +	struct device *dev;
> +	dma_addr_t dma_addr = 0;
> +	struct dma_fence *fence;
> +	struct page *host_page;
> +	struct page *src_page;
> +	u64 src_dpa;
> +
> +	src_page = migrate_pfn_to_page(src_pfn);
> +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))

I'm going to say this is a bug if !src_page ||
!is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE) and
we return -EFAULT (or another error code if that makes more sense). We
are migrating from the device where we know we have backing store from
the original fault.

> +		return 0;
> +
> +	mr = xe_page_to_mem_region(src_page);
> +	tile = xe_mem_region_to_tile(mr);
> +	xe = tile_to_xe(tile);
> +	dev = xe->drm.dev;
> +
> +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> +	if (!host_page)
> +		return -ENOMEM;
> +
> +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> +						dma_addr, false, PAGE_SIZE);
> +	if (IS_ERR(fence)) {
> +		dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
> +		free_host_page(host_page);
> +		return PTR_ERR(fence);
> +	}
> +
> +	dma_fence_wait(fence, false);

Even if we did want to migrate a page at a time, we only need to wait on
the last fence due to the ordered nature of exec queues.

> +	dma_fence_put(fence);
> +	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);

With above, will likely unmap all dma pages in a single function once
the last fence is signaled.

> +	return 1;
> +}
> +
> +/**
> + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU page fault
> + *
> + * @vmf: cpu vm fault structure, contains fault information such as vma etc.
> + *
> + * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
> + *
> + * This function migrate one gpu vma which contains the fault address to sram.
> + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e., create one
> + * gpu vma for one cpu vma initially and try not to split it). So this scheme end
> + * up migrate at the vma granularity. This might not be the best performant scheme
> + *
> + * This can be tunned with a migration granularity for  performance, for example,
> + * migration 2M for each CPU page fault, or let user specify how much to migrate.
> + * This is more complex due to vma splitting.
> + *
> + * This function should also update GPU page table, so the fault virtual address
> + * points to the same sram location from GPU side. This is TBD.
> + *
> + * Return:
> + * 0 on success
> + * VM_FAULT_SIGBUS: failed to migrate page to system memory, application
> + * will be signaled a SIGBUG
> + */
> +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> +{
> +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);

I don't think this is needed... More below.

> +	unsigned long addr = vma->vm_start;
> +	u64 npages = vma_pages(vma);
> +	struct xe_vma *xe_vma;
> +	vm_fault_t ret = 0;
> +	struct xe_vm *vm;
> +	void *buf;
> +	int i;
> +
> +	struct migrate_vma migrate_vma = {
> +		.vma		= vmf->vma,
> +		.start		= vma->vm_start,
> +		.end		= vma->vm_end,

So I know in my PoC I had the fault user pointer (xe_vma) == struct
vm_area_struct of the GPU fault. That is definitely wrong. We likely
want to allocate sub-range of vm_area_struct for the xe_vma, we can call
this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
finally 4k in sizes trying the largest first... The chunk sizes are
trivial as we likely can just have table with values, the key here is
the vm_area_struct vm_start / vm_end are not what we want to use here
rather xe_vma_start(vma) and xe_vma_end(vma). I think we get the xe_vma
from the faulting page vmf->page->zone_device_data field unless you have
another use that field...

I also comment on my patch with my suggestion implement chunk sizes too.

> +		.pgmap_owner	= xe,

Again helper for this.

> +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> +		.fault_page = vmf->page,
> +	};
> +
> +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> +	migrate_vma.src = buf;
> +	migrate_vma.dst = buf + npages;
> +	if (migrate_vma_setup(&migrate_vma) < 0) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto free_buf;
> +	}
> +
> +	if (!migrate_vma.cpages)

This is an error, need to set a return value.

> +		goto free_buf;
> +

We probably should check migrate.cpages != npages too as I also think
this is an error.

> +	for (i = 0; i < npages; i++) {
> +		ret = migrate_page_vram_to_ram(vma, addr, migrate_vma.src[i],
> +							migrate_vma.dst + i);
> +		if (ret < 0) {
> +			ret = VM_FAULT_SIGBUS;
> +			break;
> +		}
> +
> +		/** Migration has been successful, free source page */
> +		if (ret == 1) {
> +			struct page *src_page = migrate_pfn_to_page(migrate_vma.src[i]);
> +
> +			xe_devm_page_free(src_page);
> +		}
> +
> +		addr += PAGE_SIZE;
> +	}

I touch on this above, this should be reworked to roughly:

- alloc pages and populate migrate_vma.dst
- dma map sram pages
- migrate a chunk of contigous vram addresses at a time
- wait on last dma fence from migrate
- unmap dma pages
- unlock and free all pages

> +
> +	xe_svm_for_each_vm(svm, vm) {
> +		xe_assert(xe, vm->mm == mm);
> +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> +		if (xe_vma)
> +			xe_vm_invalidate_vma(xe_vma);
> +	}

I've touched on why this isn't needed... I think one of these
migrate_vma_* functions will trigger all MMU notifiers registered for
the range. The notifier owns the invalidate then.

Beyond this, maybe I'm confused about a few things and how this fits all
together. Doesn't every user process have its own unique mm, fd, and vm
(e.g. own address space)? If a user want a shared address space then use
threads with a single mm, fd, and vm.

So even if we had to resolve the xe_vma's here and do an invalidate here
very confused what this is doing. This is this the case with multiple
devices and each VM points to a different device? Again so that case I
don't think a xe_svm structure would be needed, on GPU fault we should
be to detect from the faulting page zone_device_data and pgmap owner
if the fault already has a physical backing on another GPU and resolve
how to map it into GPU with a fault... Jason suggests this in the
following thread [2] and I think I agree with him.

[2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-240632fd3e35@amd.com/T/

> +	migrate_vma_pages(&migrate_vma);

This logic is going to change but ... 

On an error I think we only want to call migrate_vma_finalize to revert
pages back to the original state (i.e. migrate_vma_pages commits the
page changes which we don't want to do on an error).

> +	migrate_vma_finalize(&migrate_vma);
> +free_buf:
> +	kvfree(buf);
> +	return 0;

I don't think 0 should blindly be return here, if there is an error
return VM_FAULT_SIGBUS. We likely want a high level error message too.

Matt

> +}
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
  2024-04-09 20:17 ` [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram Oak Zeng
@ 2024-04-11  2:49   ` Matthew Brost
  2024-04-12 21:21     ` Zeng, Oak
  0 siblings, 1 reply; 58+ messages in thread
From: Matthew Brost @ 2024-04-11  2:49 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> Introduce a helper function xe_svm_migrate_vma_to_vram.
> 
> Since the source pages of the svm range can be physically not
> contiguous, and the destination vram pages can also be not
> contiguous, there is no easy way to migrate multiple pages per
> blitter command. We do page by page migration for now.
> 
> Migration is best effort. Even if we fail to migrate some pages,
> we will try to migrate the rest pages.
> 
> FIXME: Use one blitter command to copy when both src and dst are
> physically contiguous
> 

Yep, touch in this throughout the series. Only vram needs to be
contiguous though as we dynamically create PT mappings for sram pages in
the migrate code. Getting this in a must and should be done immediately
IMO as this a very, very basic perform thing we know needs to be done.
We will likely have to tune this code quite a bit for performance so
getting known things done would be helpful.

> FIXME: when a vma is partially migrated, split vma as we assume
> no mixture vma placement.
> 

Agree we do not want support partial migrations. We likely want to
return -EAGAIN for something and fault back to a smaller xe_vma chunk
size which I discussed in [1] and add comment on in [2].

Migration should be best case too if we fail migrate we can always leave
backing store in sram too.

I do have question though, when do we get partial migrations? A user
having called mlock on some of the pages? I just want to make sure I
fully understand that case.

[1] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
[2] https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1

> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.h         |   2 +
>  drivers/gpu/drm/xe/xe_svm_migrate.c | 115 ++++++++++++++++++++++++++++

Same comment on file structure throughout the series apply here too.

>  2 files changed, 117 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index c9e4239c44b4..18ce2e3757c5 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
>  void xe_devm_free_blocks(struct list_head *blocks);
>  void xe_devm_page_free(struct page *page);
>  vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma *vma,
> +							struct xe_tile *tile);
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
> index 0db831af098e..ab8dd1f58aa4 100644
> --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
>  	kvfree(buf);
>  	return 0;
>  }
> +
> +/**
> + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma to vram
> + * Must be called with mmap_read_lock held.
> + * @vm: the vm that the vma belongs to
> + * @vma: the vma to migrate.
> + * @tile: the destination tile which holds the new backing store of the range
> + *
> + * Returns: negative errno on faiure, 0 on success
> + */
> +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> +							struct xe_vma *vma,
> +							struct xe_tile *tile)
> +{
> +	struct mm_struct *mm = vm->mm;
> +	unsigned long start = xe_vma_start(vma);
> +	unsigned long end = xe_vma_end(vma);
> +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> +	struct xe_mem_region *mr = &tile->mem.vram;
> +	struct vm_area_struct *vas;
> +
> +	struct migrate_vma migrate = {
> +		.start		= start,
> +		.end		= end,
> +		.pgmap_owner	= tile->xe,

Again helper to assign owner.

> +		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
> +	};
> +	struct device *dev = tile->xe->drm.dev;
> +	dma_addr_t *src_dma_addr;
> +	struct dma_fence *fence;
> +	struct page *src_page;
> +	LIST_HEAD(blocks);
> +	int ret = 0, i;
> +	u64 dst_dpa;
> +	void *buf;
> +
> +	mmap_assert_locked(mm);

This mmap_assert_locked is ambiguous, we should make it clear if this
read or write locked. Doesn't it have to be write to do the migrate
pages?

A larger question about the locking... The CPU fault handler holds the
mmap lock in write mode, right? 

I'm asking as basically I think at least initially we want to hold the
mmap lock in a way that the GPU handler and CPU handler do not race.
i.e. From fault userptr create in GPU fault handler to issuing the bind
we prevent the CPU fault handler from running.

I'm having issues figuring out how to prevent races between initial
binds of userptrs and userptr invalidates on faulting VMs. This race is
seen any xe_exec_fault_mode for example... So preventing races between
CPU / GPU fault handler with the mmap probably is a good idea initially.
Likely can make the locking finer grained once this is all working and I
figure out how to handle this race better.

> +
> +	vas = find_vma_intersection(mm, start, start + 4);

find_vma should work fine here.

> +	if (!vas)
> +		return -ENOENT;
> +
> +	migrate.vma = vas;
> +	buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
> +					GFP_KERNEL);
> +	if(!buf)
> +		return -ENOMEM;
> +	migrate.src = buf;
> +	migrate.dst = migrate.src + npages;
> +	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> +	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);

Again as I discussed in [3] I think this should be broken out into a
different step with the blocks allocated before this, and here just
populate migrate.dst from the existing blocks.

[3] https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1

> +	if (ret)
> +		goto kfree_buf;
> +
> +	ret = migrate_vma_setup(&migrate);
> +	if (ret) {
> +		drm_err(&tile->xe->drm, "vma setup returned %d for range [%lx - %lx]\n",
> +				ret, start, end);
> +		goto free_dst_pages;
> +	}
> +
> +	/**FIXME: partial migration of a range print a warning for now.
> +	 * If this message is printed, we need to split xe_vma as we
> +	 * don't support a mixture placement of one vma
> +	 */
> +	if (migrate.cpages != npages)
> +		drm_warn(&tile->xe->drm, "Partial migration for range [%lx - %lx], range is %ld pages, migrate only %ld pages\n",
> +				start, end, npages, migrate.cpages);

As discussed above, we shouldn't support this. We should fall back to
smaller xe_vma chunk size until we find one that works or simply leave
the pages in sram and map those pages to GPU.

> +
> +	/**Migrate page by page for now.
> +	 * Both source pages and destination pages can physically not contiguous,
> +	 * there is no good way to migrate multiple pages per blitter command.
> +	 */

Touched on this a bunch throughout the series, lets do better than a
page a time migration.

Algorithm should be very similar to what I discussed here [4] but with a
few key differences.

- I think the sram pages can be unpopulated (page == NULL) if the user
  has not yet touched the page
- Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid

In these cases this indicate we have to issue a copy for the pages we
are accumulating with contigous vram addresses.

[4] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1

> +	for (i = 0; i < npages; i++) {
> +		src_page = migrate_pfn_to_page(migrate.src[i]);
> +		if (unlikely(!src_page || !(migrate.src[i] & MIGRATE_PFN_MIGRATE)))

Discussed this in the CPU fault patch, once we call migrate_vma_setup,
on subsequent errors we need to call migrate_vma_finalize to revert the
pages to the original state. At least I think if I am reading the doc
after this correctly.

Here on error we just free the pages...

Matt

> +			goto free_dst_page;
> +
> +		xe_assert(tile->xe, !is_zone_device_page(src_page));
> +		src_dma_addr[i] = dma_map_page(dev, src_page, 0, PAGE_SIZE, DMA_TO_DEVICE);
> +		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> +			drm_warn(&tile->xe->drm, "dma map error for host pfn %lx\n", migrate.src[i]);
> +			goto free_dst_page;
> +		}
> +		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> +		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> +				dst_dpa, true, PAGE_SIZE);
> +		if (IS_ERR(fence)) {
> +			drm_warn(&tile->xe->drm, "migrate host page (pfn: %lx) to vram failed\n",
> +					migrate.src[i]);
> +			/**Migration is best effort. Even we failed here, we continue*/
> +			goto free_dst_page;
> +		}
> +		/**FIXME: Use the first migration's out fence as the second migration's input fence,
> +		 * and so on. Only wait the out fence of last migration?
> +		 */
> +		dma_fence_wait(fence, false);
> +		dma_fence_put(fence);
> +free_dst_page:
> +		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> +	}
> +
> +	for (i = 0; i < npages; i++)
> +		if (!(dma_mapping_error(dev, src_dma_addr[i])))
> +			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE, DMA_TO_DEVICE);
> +
> +	migrate_vma_pages(&migrate);
> +	migrate_vma_finalize(&migrate);
> +free_dst_pages:
> +	if (ret)
> +		xe_devm_free_blocks(&blocks);
> +kfree_buf:
> +	kfree(buf);
> +	return ret;
> +}
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr
  2024-04-09 20:17 ` [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr Oak Zeng
@ 2024-04-11  2:50   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-11  2:50 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:41PM -0400, Oak Zeng wrote:
> xe_vma_is_fault_userptr is added to determine the vma is
> a fault userptr.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_vm.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index d55330988e32..a718f927e362 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -166,6 +166,11 @@ static inline bool xe_vma_is_userptr(struct xe_vma *vma)
>  		!xe_vma_is_system_allocator(vma);
>  }
>  
> +static inline bool xe_vma_is_fault_userptr(struct xe_vma *vma)
> +{
> +	return xe_vma_is_userptr(vma) && (vma->gpuva.flags & XE_VMA_FAULT_USERPTR);

Presumably we never set XE_VMA_FAULT_USERPTR when xe_vma_is_userptr is
false probably safe to just check XE_VMA_FAULT_USERPTR here.

Matt

> +}
> +
>  /**
>   * to_userptr_vma() - Return a pointer to an embedding userptr vma
>   * @vma: Pointer to the embedded struct xe_vma
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator
  2024-04-09 20:17 ` [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator Oak Zeng
@ 2024-04-11  2:55   ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-11  2:55 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:42PM -0400, Oak Zeng wrote:
> If applicable, migrate a vma from sram to vram for system allocator.
> Traditional userptr is not migrated. Only userptr created during
> fault (aka userptr splitted from system allocator vma) can be
> migrated.
> 
> FIXME: The migration should be conditional on user memory attributes
> setting. Add this logic when memory attributes are supported
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt_pagefault.c | 9 ++++++++-
>  drivers/gpu/drm/xe/xe_vm.c           | 4 ----
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> index 668984f0769e..c6ba00049964 100644
> --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> @@ -20,6 +20,7 @@
>  #include "xe_guc_ct.h"
>  #include "xe_migrate.h"
>  #include "xe_trace.h"
> +#include "xe_svm.h"
>  #include "xe_vm.h"
>  
>  struct pagefault {
> @@ -209,12 +210,18 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
>  
>  	if (xe_vma_is_userptr(vma) && write_locked) {
>  		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> +		struct xe_userptr *userptr = &uvma->userptr;
>  
>  		spin_lock(&vm->userptr.invalidated_lock);
> -		list_del_init(&uvma->userptr.invalidate_link);
> +		list_del_init(&userptr->invalidate_link);
>  		spin_unlock(&vm->userptr.invalidated_lock);
>  
> +		mmap_read_lock(userptr->notifier.mm);
> +		/**FIXME: Add migration policy here*/
> +		if (xe_vma_is_fault_userptr(vma))
> +			xe_svm_migrate_vma_to_vram(vm, vma, tile);

Agree we need a policy here...

See my comments about locking in [1] thinking if we migrate we likely
want to hold the mmap lock until at least the bind being issued to
prevent races with the CPU fault handler, at least initially. 

[1] https://patchwork.freedesktop.org/patch/588542/?series=132229&rev=1

>  		ret = xe_vma_userptr_pin_pages(uvma);
> +		mmap_read_unlock(userptr->notifier.mm);
>  		if (ret)
>  			goto unlock_vm;
>  
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 498b36469d00..8a58fe144a02 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -71,16 +71,12 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
>  	struct xe_vma *vma = &uvma->vma;
>  	struct xe_vm *vm = xe_vma_vm(vma);
>  	struct xe_device *xe = vm->xe;
> -	struct xe_userptr *userptr;
>  	int ret;
>  
>  	lockdep_assert_held(&vm->lock);
>  	xe_assert(xe, xe_vma_is_userptr(vma));
>  
> -	userptr = &uvma->userptr;
> -	mmap_read_lock(userptr->notifier.mm);
>  	ret = xe_userptr_populate_range(uvma);
> -	mmap_read_unlock(userptr->notifier.mm);

Now you won't have the lock here other callers of this function...
Probably need to have locked / unlocked version or arguments here.

Matt

>  
>  	return ret;
>  }
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-11  2:07   ` Matthew Brost
@ 2024-04-12 17:24     ` Zeng, Oak
  2024-04-12 18:10       ` Matthew Brost
  0 siblings, 1 reply; 58+ messages in thread
From: Zeng, Oak @ 2024-04-12 17:24 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Thomas.Hellstrom, Welty, Brian



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, April 10, 2024 10:07 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> 
> On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> > Under the picture of svm, CPU and GPU program share one same
> > virtual address space. The backing store of this virtual address
> > space can be either in system memory or device memory. Since GPU
> > device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> > Any CPU access to device memory causes a page fault. Implement
> > a page fault handler to migrate memory back to system memory and
> > map it to CPU page table so the CPU program can proceed.
> >
> > Also unbind this page from GPU side, and free the original GPU
> > device page
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile         |   1 +
> >  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
> >  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
> >  drivers/gpu/drm/xe/xe_svm_migrate.c | 222
> ++++++++++++++++++++++++++++
> >  4 files changed, 230 insertions(+), 8 deletions(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> >
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index f89d77b6d654..65289acdd563 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
> >  	xe_step.o \
> >  	xe_svm.o \
> >  	xe_svm_devmem.o \
> > +	xe_svm_migrate.o \
> 
> See comments about file org, same thing applies here. Let's put all of
> the svm implementation in a single file.
> 
> >  	xe_sync.o \
> >  	xe_tile.o \
> >  	xe_tile_sysfs.o \
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > index f601dffe3fc1..c9e4239c44b4 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -7,11 +7,11 @@
> >  #define __XE_SVM_H
> >
> >  #include <linux/mm_types.h>
> > +#include <linux/mm.h>
> >  #include "xe_device_types.h"
> >  #include "xe_device.h"
> >  #include "xe_assert.h"
> > -
> > -struct xe_vm;
> > +#include "xe_vm_types.h"
> >
> >  /**
> >   * struct xe_svm - data structure to represent a shared
> > @@ -31,6 +31,9 @@ struct xe_svm {
> >  	struct list_head vm_list;
> >  };
> >
> > +#define xe_svm_for_each_vm(svm, vm)
> 	\
> > +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> > +
> 
> Don't think this is need, see below.
> 
> >  extern struct xe_svm *xe_create_svm(void);
> >  void xe_destroy_svm(struct xe_svm *svm);
> >  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> > @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> >
> >  void xe_devm_free_blocks(struct list_head *blocks);
> >  void xe_devm_page_free(struct page *page);
> > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > index 088ac209ad80..32ada458f1dd 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
> >  	unsigned long bitmap[];
> >  };
> >
> > -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > -{
> > -	return 0;
> > -}
> > -
> >  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> >  {
> >  	/** DRM buddy's block offset is 0-based*/
> > @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head *blocks)
> >
> >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> >  	.page_free = xe_devm_page_free,
> > -	.migrate_to_ram = xe_devm_migrate_to_ram,
> > +	.migrate_to_ram = xe_svm_migrate_to_sram,
> 
> Again single file so this will be static function, no reason to export
> this.
> 
> >  };
> >
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > new file mode 100644
> > index 000000000000..0db831af098e
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > @@ -0,0 +1,222 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2023 Intel Corporation
> > + */
> > +
> > +#include <linux/gfp.h>
> > +#include <linux/migrate.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/dma-fence.h>
> > +#include <linux/bitops.h>
> > +#include <linux/bitmap.h>
> > +#include <linux/kernel.h>
> > +#include <linux/slab.h>
> > +#include <drm/drm_buddy.h>
> > +#include "xe_device_types.h"
> > +#include "xe_device.h"
> > +#include "xe_trace.h"
> > +#include "xe_migrate.h"
> > +#include "xe_ttm_vram_mgr_types.h"
> > +#include "xe_assert.h"
> > +#include "xe_pt.h"
> > +#include "xe_svm.h"
> > +#include "xe_vm.h"
> > +
> > +
> > +/**
> > + * alloc_host_page() - allocate one host page for the fault vma
> > + *
> > + * @dev: (GPU) device that will access the allocated page
> > + * @vma: the fault vma that we need allocate page for
> > + * @addr: the fault address. The allocated page is for this address
> > + * @dma_addr: used to output the dma address of the allocated page.
> > + * This dma address will be used for gpu to access this page. GPU
> > + * access host page through a dma mapped address.
> > + * @pfn: used to output the pfn of the allocated page.
> > + *
> > + * This function allocate one host page for the specified vma. It
> > + * also does some prepare work for GPU to access this page, such
> > + * as map this page to iommu (by calling dma_map_page).
> > + *
> > + * When this function returns, the page is locked.
> > + *
> > + * Return struct page pointer when success
> > + * NULL otherwise
> > + */
> > +static struct page *alloc_host_page(struct device *dev,
> > +							 struct vm_area_struct
> *vma,
> > +							 unsigned long addr,
> > +							 dma_addr_t
> *dma_addr,
> > +							 unsigned long *pfn)
> 
> Weird alignment, also I don't think we are want to allocate a page at
> time...
> 
> Beyond that, can't say I'm a fan of 2 arguments being return and
> populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
> haven't seen a lot that style function in Linux.
> 
> Probably makes more sense to have a function which allocates pages,
> locks them, and populates the pfn array (migrate_pfn) rather than doing
> this a page at a time.
> 
> > +{
> > +	struct page *page;
> > +
> > +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> > +	if (unlikely(!page))
> > +		return NULL;
> > +
> > +	/**Lock page per hmm requirement, see hmm.rst*/
> > +	lock_page(page);
> > +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> DMA_FROM_DEVICE);
> 
> The device is writing to these pages so I think DMA_BIDIRECTIONAL is
> needed, right? As mentioned above I think this should be broken out into
> a different step too.
> 
> > +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> > +		unlock_page(page);
> > +		__free_page(page);
> > +		return NULL;
> > +	}
> > +
> > +	*pfn = migrate_pfn(page_to_pfn(page));
> > +	return page;
> > +}
> > +
> > +static void free_host_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> > + *
> > + * @vma: The vma that the page is mapped to
> > + * @addr: The virtual address that the page is mapped to
> > + * @src_pfn: src page's page frame number
> > + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> > + *
> > + * Allocate one page in system ram and copy memory from device memory
> > + * to system ram.
> > + *
> > + * Return: 0 if this page is already in sram (no need to migrate)
> > + * 1: successfully migrated this page from vram to sram.
> > + * error code otherwise
> > + */
> > +static int migrate_page_vram_to_ram(struct vm_area_struct *vma,
> unsigned long addr,
> > +						unsigned long src_pfn,
> unsigned long *dst_pfn)
> > +{
> 
> We definitely don't want to copy 1 page at time. I touch on this in [1].
> Basically this going to perform poorly unless we use larger copies, the
> migrate code supports non-contigous sram addresses, and vram addresses
> will likely be contigous due to the buddy allocator.
> 
> [1] https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1
> 
> > +	struct xe_mem_region *mr;
> > +	struct xe_tile *tile;
> > +	struct xe_device *xe;
> > +	struct device *dev;
> > +	dma_addr_t dma_addr = 0;
> > +	struct dma_fence *fence;
> > +	struct page *host_page;
> > +	struct page *src_page;
> > +	u64 src_dpa;
> > +
> > +	src_page = migrate_pfn_to_page(src_pfn);
> > +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
> 
> I'm going to say this is a bug if !src_page ||
> !is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE) and
> we return -EFAULT (or another error code if that makes more sense). We
> are migrating from the device where we know we have backing store from
> the original fault.
> 
> > +		return 0;
> > +
> > +	mr = xe_page_to_mem_region(src_page);
> > +	tile = xe_mem_region_to_tile(mr);
> > +	xe = tile_to_xe(tile);
> > +	dev = xe->drm.dev;
> > +
> > +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> > +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> > +	if (!host_page)
> > +		return -ENOMEM;
> > +
> > +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> > +						dma_addr, false, PAGE_SIZE);
> > +	if (IS_ERR(fence)) {
> > +		dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> DMA_FROM_DEVICE);
> > +		free_host_page(host_page);
> > +		return PTR_ERR(fence);
> > +	}
> > +
> > +	dma_fence_wait(fence, false);
> 
> Even if we did want to migrate a page at a time, we only need to wait on
> the last fence due to the ordered nature of exec queues.
> 
> > +	dma_fence_put(fence);
> > +	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
> 
> With above, will likely unmap all dma pages in a single function once
> the last fence is signaled.
> 
> > +	return 1;
> > +}
> > +
> > +/**
> > + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU page
> fault
> > + *
> > + * @vmf: cpu vm fault structure, contains fault information such as vma etc.
> > + *
> > + * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
> > + *
> > + * This function migrate one gpu vma which contains the fault address to
> sram.
> > + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e.,
> create one
> > + * gpu vma for one cpu vma initially and try not to split it). So this scheme
> end
> > + * up migrate at the vma granularity. This might not be the best performant
> scheme
> > + *
> > + * This can be tunned with a migration granularity for  performance, for
> example,
> > + * migration 2M for each CPU page fault, or let user specify how much to
> migrate.
> > + * This is more complex due to vma splitting.
> > + *
> > + * This function should also update GPU page table, so the fault virtual
> address
> > + * points to the same sram location from GPU side. This is TBD.
> > + *
> > + * Return:
> > + * 0 on success
> > + * VM_FAULT_SIGBUS: failed to migrate page to system memory,
> application
> > + * will be signaled a SIGBUG
> > + */
> > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> > +{
> > +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> > +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> > +	struct xe_device *xe = tile_to_xe(tile);
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
> 
> I don't think this is needed... More below.
> 
> > +	unsigned long addr = vma->vm_start;
> > +	u64 npages = vma_pages(vma);
> > +	struct xe_vma *xe_vma;
> > +	vm_fault_t ret = 0;
> > +	struct xe_vm *vm;
> > +	void *buf;
> > +	int i;
> > +
> > +	struct migrate_vma migrate_vma = {
> > +		.vma		= vmf->vma,
> > +		.start		= vma->vm_start,
> > +		.end		= vma->vm_end,
> 
> So I know in my PoC I had the fault user pointer (xe_vma) == struct
> vm_area_struct of the GPU fault. That is definitely wrong. We likely
> want to allocate sub-range of vm_area_struct for the xe_vma, we can call
> this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
> finally 4k in sizes trying the largest first... The chunk sizes are
> trivial as we likely can just have table with values, the key here is
> the vm_area_struct vm_start / vm_end are not what we want to use here
> rather xe_vma_start(vma) and xe_vma_end(vma). I think we get the xe_vma
> from the faulting page vmf->page->zone_device_data field unless you have
> another use that field...

You are right. Such work is planned in the memory attributes part that Himal is working on. We have a migration_granularity attribute which allow user to set the chunk size.

> 
> I also comment on my patch with my suggestion implement chunk sizes too.
> 
> > +		.pgmap_owner	= xe,
> 
> Again helper for this.
> 
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > +		.fault_page = vmf->page,
> > +	};
> > +
> > +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> > +	migrate_vma.src = buf;
> > +	migrate_vma.dst = buf + npages;
> > +	if (migrate_vma_setup(&migrate_vma) < 0) {
> > +		ret = VM_FAULT_SIGBUS;
> > +		goto free_buf;
> > +	}
> > +
> > +	if (!migrate_vma.cpages)
> 
> This is an error, need to set a return value.
> 
> > +		goto free_buf;
> > +
> 
> We probably should check migrate.cpages != npages too as I also think
> this is an error.
> 
> > +	for (i = 0; i < npages; i++) {
> > +		ret = migrate_page_vram_to_ram(vma, addr,
> migrate_vma.src[i],
> > +							migrate_vma.dst + i);
> > +		if (ret < 0) {
> > +			ret = VM_FAULT_SIGBUS;
> > +			break;
> > +		}
> > +
> > +		/** Migration has been successful, free source page */
> > +		if (ret == 1) {
> > +			struct page *src_page =
> migrate_pfn_to_page(migrate_vma.src[i]);
> > +
> > +			xe_devm_page_free(src_page);
> > +		}
> > +
> > +		addr += PAGE_SIZE;
> > +	}
> 
> I touch on this above, this should be reworked to roughly:
> 
> - alloc pages and populate migrate_vma.dst
> - dma map sram pages
> - migrate a chunk of contigous vram addresses at a time
> - wait on last dma fence from migrate
> - unmap dma pages
> - unlock and free all pages
> 
> > +
> > +	xe_svm_for_each_vm(svm, vm) {
> > +		xe_assert(xe, vm->mm == mm);
> > +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> > +		if (xe_vma)
> > +			xe_vm_invalidate_vma(xe_vma);
> > +	}
> 
> I've touched on why this isn't needed... I think one of these
> migrate_vma_* functions will trigger all MMU notifiers registered for
> the range. The notifier owns the invalidate then.

Very good point. Yes after read migrate_vma_setup function, I agree this function will call mmu notifiers with MMU_NOTIFY_MIGRATE event. Today we invalidate vma with this event. So yes, above codes are not needed.

I do identified one potential improvement: when mmu notifier is called back with MMU_NOTIFY_MIGRATE event, if the migrate_vma_setup is called from the gpu page fault path, we can ignore the gpu vma invalidation as we will update gpu page table later after migration anyway. I think an page table invalidation is not needed in such case. But this should be just a minor improvement.


> 
> Beyond this, maybe I'm confused about a few things and how this fits all
> together. Doesn't every user process have its own unique mm, fd, and vm
> (e.g. own address space)? If a user want a shared address space then use
> threads with a single mm, fd, and vm.

Yes, this is also my understanding. Each user process has its own mm struct and device fds. 

In a shared address space case, such as there are multiple pthread created through pthread_create in one process, all those pthreads should have different kernel task_struct, but all those task_struct (say we get it from "current" macro) should share one same mm struct, which means they all lives inside one cpu address space.

Now with this work, we are now basically extending this shared cpu address space to gpu program. So both cpu program and gpu program can share one address space.

Since we allow user to create multiple gpu vm for one device (lets focus on one device for now), so each shared address space can have multiple gpu vm... each gpuvm should be able to register its own mmu notifier to core mm, even if those notifier has the same address range. But I will have to test this out. If all this works, above codes are not needed. If different gpuvm can't register mmu notifier for same address range, then we would need a fix....


> 
> So even if we had to resolve the xe_vma's here and do an invalidate here
> very confused what this is doing. This is this the case with multiple
> devices and each VM points to a different device? 

Right now I only focus on single device. See above. This is to solve one gpu device but multiple gpu vm case. But as said above, for now I don't think this is needed. I need to test more on the mmu notifier behavior: whether it allow us to insert two notifiers for the same range for one mm....

Oak

Again so that case I
> don't think a xe_svm structure would be needed, on GPU fault we should
> be to detect from the faulting page zone_device_data and pgmap owner
> if the fault already has a physical backing on another GPU and resolve
> how to map it into GPU with a fault... Jason suggests this in the
> following thread [2] and I think I agree with him.
> 
> [2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-
> 240632fd3e35@amd.com/T/
> 
> > +	migrate_vma_pages(&migrate_vma);
> 
> This logic is going to change but ...
> 
> On an error I think we only want to call migrate_vma_finalize to revert
> pages back to the original state (i.e. migrate_vma_pages commits the
> page changes which we don't want to do on an error).
> 
> > +	migrate_vma_finalize(&migrate_vma);
> > +free_buf:
> > +	kvfree(buf);
> > +	return 0;
> 
> I don't think 0 should blindly be return here, if there is an error
> return VM_FAULT_SIGBUS. We likely want a high level error message too.
> 
> Matt
> 
> > +}
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-12 17:24     ` Zeng, Oak
@ 2024-04-12 18:10       ` Matthew Brost
  2024-04-12 18:39         ` Zeng, Oak
  0 siblings, 1 reply; 58+ messages in thread
From: Matthew Brost @ 2024-04-12 18:10 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Thomas.Hellstrom, Welty, Brian

On Fri, Apr 12, 2024 at 11:24:06AM -0600, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Wednesday, April 10, 2024 10:07 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > Brian <brian.welty@intel.com>
> > Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> > 
> > On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> > > Under the picture of svm, CPU and GPU program share one same
> > > virtual address space. The backing store of this virtual address
> > > space can be either in system memory or device memory. Since GPU
> > > device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> > > Any CPU access to device memory causes a page fault. Implement
> > > a page fault handler to migrate memory back to system memory and
> > > map it to CPU page table so the CPU program can proceed.
> > >
> > > Also unbind this page from GPU side, and free the original GPU
> > > device page
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile         |   1 +
> > >  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
> > >  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
> > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 222
> > ++++++++++++++++++++++++++++
> > >  4 files changed, 230 insertions(+), 8 deletions(-)
> > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> > >
> > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > index f89d77b6d654..65289acdd563 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
> > >  	xe_step.o \
> > >  	xe_svm.o \
> > >  	xe_svm_devmem.o \
> > > +	xe_svm_migrate.o \
> > 
> > See comments about file org, same thing applies here. Let's put all of
> > the svm implementation in a single file.
> > 
> > >  	xe_sync.o \
> > >  	xe_tile.o \
> > >  	xe_tile_sysfs.o \
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > index f601dffe3fc1..c9e4239c44b4 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -7,11 +7,11 @@
> > >  #define __XE_SVM_H
> > >
> > >  #include <linux/mm_types.h>
> > > +#include <linux/mm.h>
> > >  #include "xe_device_types.h"
> > >  #include "xe_device.h"
> > >  #include "xe_assert.h"
> > > -
> > > -struct xe_vm;
> > > +#include "xe_vm_types.h"
> > >
> > >  /**
> > >   * struct xe_svm - data structure to represent a shared
> > > @@ -31,6 +31,9 @@ struct xe_svm {
> > >  	struct list_head vm_list;
> > >  };
> > >
> > > +#define xe_svm_for_each_vm(svm, vm)
> > 	\
> > > +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> > > +
> > 
> > Don't think this is need, see below.
> > 
> > >  extern struct xe_svm *xe_create_svm(void);
> > >  void xe_destroy_svm(struct xe_svm *svm);
> > >  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> > > @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > >
> > >  void xe_devm_free_blocks(struct list_head *blocks);
> > >  void xe_devm_page_free(struct page *page);
> > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > index 088ac209ad80..32ada458f1dd 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
> > >  	unsigned long bitmap[];
> > >  };
> > >
> > > -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > -{
> > > -	return 0;
> > > -}
> > > -
> > >  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > >  {
> > >  	/** DRM buddy's block offset is 0-based*/
> > > @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head *blocks)
> > >
> > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > >  	.page_free = xe_devm_page_free,
> > > -	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > +	.migrate_to_ram = xe_svm_migrate_to_sram,
> > 
> > Again single file so this will be static function, no reason to export
> > this.
> > 
> > >  };
> > >
> > >  /**
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > new file mode 100644
> > > index 000000000000..0db831af098e
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > @@ -0,0 +1,222 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2023 Intel Corporation
> > > + */
> > > +
> > > +#include <linux/gfp.h>
> > > +#include <linux/migrate.h>
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/dma-fence.h>
> > > +#include <linux/bitops.h>
> > > +#include <linux/bitmap.h>
> > > +#include <linux/kernel.h>
> > > +#include <linux/slab.h>
> > > +#include <drm/drm_buddy.h>
> > > +#include "xe_device_types.h"
> > > +#include "xe_device.h"
> > > +#include "xe_trace.h"
> > > +#include "xe_migrate.h"
> > > +#include "xe_ttm_vram_mgr_types.h"
> > > +#include "xe_assert.h"
> > > +#include "xe_pt.h"
> > > +#include "xe_svm.h"
> > > +#include "xe_vm.h"
> > > +
> > > +
> > > +/**
> > > + * alloc_host_page() - allocate one host page for the fault vma
> > > + *
> > > + * @dev: (GPU) device that will access the allocated page
> > > + * @vma: the fault vma that we need allocate page for
> > > + * @addr: the fault address. The allocated page is for this address
> > > + * @dma_addr: used to output the dma address of the allocated page.
> > > + * This dma address will be used for gpu to access this page. GPU
> > > + * access host page through a dma mapped address.
> > > + * @pfn: used to output the pfn of the allocated page.
> > > + *
> > > + * This function allocate one host page for the specified vma. It
> > > + * also does some prepare work for GPU to access this page, such
> > > + * as map this page to iommu (by calling dma_map_page).
> > > + *
> > > + * When this function returns, the page is locked.
> > > + *
> > > + * Return struct page pointer when success
> > > + * NULL otherwise
> > > + */
> > > +static struct page *alloc_host_page(struct device *dev,
> > > +							 struct vm_area_struct
> > *vma,
> > > +							 unsigned long addr,
> > > +							 dma_addr_t
> > *dma_addr,
> > > +							 unsigned long *pfn)
> > 
> > Weird alignment, also I don't think we are want to allocate a page at
> > time...
> > 
> > Beyond that, can't say I'm a fan of 2 arguments being return and
> > populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
> > haven't seen a lot that style function in Linux.
> > 
> > Probably makes more sense to have a function which allocates pages,
> > locks them, and populates the pfn array (migrate_pfn) rather than doing
> > this a page at a time.
> > 
> > > +{
> > > +	struct page *page;
> > > +
> > > +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> > > +	if (unlikely(!page))
> > > +		return NULL;
> > > +
> > > +	/**Lock page per hmm requirement, see hmm.rst*/
> > > +	lock_page(page);
> > > +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> > DMA_FROM_DEVICE);
> > 
> > The device is writing to these pages so I think DMA_BIDIRECTIONAL is
> > needed, right? As mentioned above I think this should be broken out into
> > a different step too.
> > 
> > > +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> > > +		unlock_page(page);
> > > +		__free_page(page);
> > > +		return NULL;
> > > +	}
> > > +
> > > +	*pfn = migrate_pfn(page_to_pfn(page));
> > > +	return page;
> > > +}
> > > +
> > > +static void free_host_page(struct page *page)
> > > +{
> > > +	unlock_page(page);
> > > +	put_page(page);
> > > +}
> > > +
> > > +/**
> > > + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> > > + *
> > > + * @vma: The vma that the page is mapped to
> > > + * @addr: The virtual address that the page is mapped to
> > > + * @src_pfn: src page's page frame number
> > > + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> > > + *
> > > + * Allocate one page in system ram and copy memory from device memory
> > > + * to system ram.
> > > + *
> > > + * Return: 0 if this page is already in sram (no need to migrate)
> > > + * 1: successfully migrated this page from vram to sram.
> > > + * error code otherwise
> > > + */
> > > +static int migrate_page_vram_to_ram(struct vm_area_struct *vma,
> > unsigned long addr,
> > > +						unsigned long src_pfn,
> > unsigned long *dst_pfn)
> > > +{
> > 
> > We definitely don't want to copy 1 page at time. I touch on this in [1].
> > Basically this going to perform poorly unless we use larger copies, the
> > migrate code supports non-contigous sram addresses, and vram addresses
> > will likely be contigous due to the buddy allocator.
> > 
> > [1] https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1
> > 
> > > +	struct xe_mem_region *mr;
> > > +	struct xe_tile *tile;
> > > +	struct xe_device *xe;
> > > +	struct device *dev;
> > > +	dma_addr_t dma_addr = 0;
> > > +	struct dma_fence *fence;
> > > +	struct page *host_page;
> > > +	struct page *src_page;
> > > +	u64 src_dpa;
> > > +
> > > +	src_page = migrate_pfn_to_page(src_pfn);
> > > +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
> > 
> > I'm going to say this is a bug if !src_page ||
> > !is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE) and
> > we return -EFAULT (or another error code if that makes more sense). We
> > are migrating from the device where we know we have backing store from
> > the original fault.
> > 
> > > +		return 0;
> > > +
> > > +	mr = xe_page_to_mem_region(src_page);
> > > +	tile = xe_mem_region_to_tile(mr);
> > > +	xe = tile_to_xe(tile);
> > > +	dev = xe->drm.dev;
> > > +
> > > +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> > > +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> > > +	if (!host_page)
> > > +		return -ENOMEM;
> > > +
> > > +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> > > +						dma_addr, false, PAGE_SIZE);
> > > +	if (IS_ERR(fence)) {
> > > +		dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> > DMA_FROM_DEVICE);
> > > +		free_host_page(host_page);
> > > +		return PTR_ERR(fence);
> > > +	}
> > > +
> > > +	dma_fence_wait(fence, false);
> > 
> > Even if we did want to migrate a page at a time, we only need to wait on
> > the last fence due to the ordered nature of exec queues.
> > 
> > > +	dma_fence_put(fence);
> > > +	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
> > 
> > With above, will likely unmap all dma pages in a single function once
> > the last fence is signaled.
> > 
> > > +	return 1;
> > > +}
> > > +
> > > +/**
> > > + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU page
> > fault
> > > + *
> > > + * @vmf: cpu vm fault structure, contains fault information such as vma etc.
> > > + *
> > > + * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
> > > + *
> > > + * This function migrate one gpu vma which contains the fault address to
> > sram.
> > > + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e.,
> > create one
> > > + * gpu vma for one cpu vma initially and try not to split it). So this scheme
> > end
> > > + * up migrate at the vma granularity. This might not be the best performant
> > scheme
> > > + *
> > > + * This can be tunned with a migration granularity for  performance, for
> > example,
> > > + * migration 2M for each CPU page fault, or let user specify how much to
> > migrate.
> > > + * This is more complex due to vma splitting.
> > > + *
> > > + * This function should also update GPU page table, so the fault virtual
> > address
> > > + * points to the same sram location from GPU side. This is TBD.
> > > + *
> > > + * Return:
> > > + * 0 on success
> > > + * VM_FAULT_SIGBUS: failed to migrate page to system memory,
> > application
> > > + * will be signaled a SIGBUG
> > > + */
> > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> > > +{
> > > +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> > > +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> > > +	struct xe_device *xe = tile_to_xe(tile);
> > > +	struct vm_area_struct *vma = vmf->vma;
> > > +	struct mm_struct *mm = vma->vm_mm;
> > > +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
> > 
> > I don't think this is needed... More below.
> > 
> > > +	unsigned long addr = vma->vm_start;
> > > +	u64 npages = vma_pages(vma);
> > > +	struct xe_vma *xe_vma;
> > > +	vm_fault_t ret = 0;
> > > +	struct xe_vm *vm;
> > > +	void *buf;
> > > +	int i;
> > > +
> > > +	struct migrate_vma migrate_vma = {
> > > +		.vma		= vmf->vma,
> > > +		.start		= vma->vm_start,
> > > +		.end		= vma->vm_end,
> > 
> > So I know in my PoC I had the fault user pointer (xe_vma) == struct
> > vm_area_struct of the GPU fault. That is definitely wrong. We likely
> > want to allocate sub-range of vm_area_struct for the xe_vma, we can call
> > this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
> > finally 4k in sizes trying the largest first... The chunk sizes are
> > trivial as we likely can just have table with values, the key here is
> > the vm_area_struct vm_start / vm_end are not what we want to use here
> > rather xe_vma_start(vma) and xe_vma_end(vma). I think we get the xe_vma

After I typed this, realized I made a mistake here...

s/xe_vma_start/xe_vma_userptr/
s/xe_vma_end/xe_vma_userptr + xe_vma_size/

But you get the idea - the zone_device_data points Xe specific chunk
data (currently xe_vma, could be xe_pt_state our something later if we
switch to 1:N).

Check AMD's + Nvidia's drivers and they both use this field in a similar
way.

> > from the faulting page vmf->page->zone_device_data field unless you have
> > another use that field...
> 
> You are right. Such work is planned in the memory attributes part that Himal is working on. We have a migration_granularity attribute which allow user to set the chunk size.
> 

That makes sense. The chunk size is always just hint though, right?

> > 
> > I also comment on my patch with my suggestion implement chunk sizes too.
> > 
> > > +		.pgmap_owner	= xe,
> > 
> > Again helper for this.
> > 
> > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > +		.fault_page = vmf->page,
> > > +	};
> > > +
> > > +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> > > +	migrate_vma.src = buf;
> > > +	migrate_vma.dst = buf + npages;
> > > +	if (migrate_vma_setup(&migrate_vma) < 0) {
> > > +		ret = VM_FAULT_SIGBUS;
> > > +		goto free_buf;
> > > +	}
> > > +
> > > +	if (!migrate_vma.cpages)
> > 
> > This is an error, need to set a return value.
> > 
> > > +		goto free_buf;
> > > +
> > 
> > We probably should check migrate.cpages != npages too as I also think
> > this is an error.
> > 
> > > +	for (i = 0; i < npages; i++) {
> > > +		ret = migrate_page_vram_to_ram(vma, addr,
> > migrate_vma.src[i],
> > > +							migrate_vma.dst + i);
> > > +		if (ret < 0) {
> > > +			ret = VM_FAULT_SIGBUS;
> > > +			break;
> > > +		}
> > > +
> > > +		/** Migration has been successful, free source page */
> > > +		if (ret == 1) {
> > > +			struct page *src_page =
> > migrate_pfn_to_page(migrate_vma.src[i]);
> > > +
> > > +			xe_devm_page_free(src_page);
> > > +		}
> > > +
> > > +		addr += PAGE_SIZE;
> > > +	}
> > 
> > I touch on this above, this should be reworked to roughly:
> > 
> > - alloc pages and populate migrate_vma.dst
> > - dma map sram pages
> > - migrate a chunk of contigous vram addresses at a time
> > - wait on last dma fence from migrate
> > - unmap dma pages
> > - unlock and free all pages
> > 
> > > +
> > > +	xe_svm_for_each_vm(svm, vm) {
> > > +		xe_assert(xe, vm->mm == mm);
> > > +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> > > +		if (xe_vma)
> > > +			xe_vm_invalidate_vma(xe_vma);
> > > +	}
> > 
> > I've touched on why this isn't needed... I think one of these
> > migrate_vma_* functions will trigger all MMU notifiers registered for
> > the range. The notifier owns the invalidate then.
> 
> Very good point. Yes after read migrate_vma_setup function, I agree this function will call mmu notifiers with MMU_NOTIFY_MIGRATE event. Today we invalidate vma with this event. So yes, above codes are not needed.
> 
> I do identified one potential improvement: when mmu notifier is called back with MMU_NOTIFY_MIGRATE event, if the migrate_vma_setup is called from the gpu page fault path, we can ignore the gpu vma invalidation as we will update gpu page table later after migration anyway. I think an page table invalidation is not needed in such case. But this should be just a minor improvement.
>

We skip invalidations if the initial_bind flag is clear which should
cover at the initial GPU fault. There is certainly room for improvement
/ optimizations in the MMU notifier though, it is kinda messy right now
too. IMO work like this can be done once the basic design is working +
tests in place to verify changes / optimizations.
 
> 
> > 
> > Beyond this, maybe I'm confused about a few things and how this fits all
> > together. Doesn't every user process have its own unique mm, fd, and vm
> > (e.g. own address space)? If a user want a shared address space then use
> > threads with a single mm, fd, and vm.
> 
> Yes, this is also my understanding. Each user process has its own mm struct and device fds. 
> 
> In a shared address space case, such as there are multiple pthread created through pthread_create in one process, all those pthreads should have different kernel task_struct, but all those task_struct (say we get it from "current" macro) should share one same mm struct, which means they all lives inside one cpu address space.
> 
> Now with this work, we are now basically extending this shared cpu address space to gpu program. So both cpu program and gpu program can share one address space.
> 
> Since we allow user to create multiple gpu vm for one device (lets focus on one device for now), so each shared address space can have multiple gpu vm... each gpuvm should be able to register its own mmu notifier to core mm, even if those notifier has the same address range. But I will have to test this out. If all this works, above codes are not needed. If different gpuvm can't register mmu notifier for same address range, then we would need a fix....
>

The mmu notifier code is implemented with the interval tree which
supports overlapping rangers (i.e. we can have multiple VM's register
notifiers with the sam address range in a single MM).
 
> 
> > 
> > So even if we had to resolve the xe_vma's here and do an invalidate here
> > very confused what this is doing. This is this the case with multiple
> > devices and each VM points to a different device? 
> 
> Right now I only focus on single device. See above. This is to solve one gpu device but multiple gpu vm case. But as said above, for now I don't think this is needed. I need to test more on the mmu notifier behavior: whether it allow us to insert two notifiers for the same range for one mm....
> 

Agree that our focus should be on single device now. If that design it
well thought out I don't think extending this to multiple devices will
be a huge change either.

Matt

> Oak
> 
> Again so that case I
> > don't think a xe_svm structure would be needed, on GPU fault we should
> > be to detect from the faulting page zone_device_data and pgmap owner
> > if the fault already has a physical backing on another GPU and resolve
> > how to map it into GPU with a fault... Jason suggests this in the
> > following thread [2] and I think I agree with him.
> > 
> > [2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-
> > 240632fd3e35@amd.com/T/
> > 
> > > +	migrate_vma_pages(&migrate_vma);
> > 
> > This logic is going to change but ...
> > 
> > On an error I think we only want to call migrate_vma_finalize to revert
> > pages back to the original state (i.e. migrate_vma_pages commits the
> > page changes which we don't want to do on an error).
> > 
> > > +	migrate_vma_finalize(&migrate_vma);
> > > +free_buf:
> > > +	kvfree(buf);
> > > +	return 0;
> > 
> > I don't think 0 should blindly be return here, if there is an error
> > return VM_FAULT_SIGBUS. We likely want a high level error message too.
> > 
> > Matt
> > 
> > > +}
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [v2 27/31] drm/xe/svm: Handle CPU page fault
  2024-04-12 18:10       ` Matthew Brost
@ 2024-04-12 18:39         ` Zeng, Oak
  0 siblings, 0 replies; 58+ messages in thread
From: Zeng, Oak @ 2024-04-12 18:39 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Thomas.Hellstrom, Welty, Brian



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Friday, April 12, 2024 2:11 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> 
> On Fri, Apr 12, 2024 at 11:24:06AM -0600, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Wednesday, April 10, 2024 10:07 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > > Brian <brian.welty@intel.com>
> > > Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> > >
> > > On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> > > > Under the picture of svm, CPU and GPU program share one same
> > > > virtual address space. The backing store of this virtual address
> > > > space can be either in system memory or device memory. Since GPU
> > > > device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> > > > Any CPU access to device memory causes a page fault. Implement
> > > > a page fault handler to migrate memory back to system memory and
> > > > map it to CPU page table so the CPU program can proceed.
> > > >
> > > > Also unbind this page from GPU side, and free the original GPU
> > > > device page
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > Co-developed-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Signed-off-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile         |   1 +
> > > >  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
> > > >  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
> > > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 222
> > > ++++++++++++++++++++++++++++
> > > >  4 files changed, 230 insertions(+), 8 deletions(-)
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > > index f89d77b6d654..65289acdd563 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
> > > >  	xe_step.o \
> > > >  	xe_svm.o \
> > > >  	xe_svm_devmem.o \
> > > > +	xe_svm_migrate.o \
> > >
> > > See comments about file org, same thing applies here. Let's put all of
> > > the svm implementation in a single file.
> > >
> > > >  	xe_sync.o \
> > > >  	xe_tile.o \
> > > >  	xe_tile_sysfs.o \
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > > index f601dffe3fc1..c9e4239c44b4 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > @@ -7,11 +7,11 @@
> > > >  #define __XE_SVM_H
> > > >
> > > >  #include <linux/mm_types.h>
> > > > +#include <linux/mm.h>
> > > >  #include "xe_device_types.h"
> > > >  #include "xe_device.h"
> > > >  #include "xe_assert.h"
> > > > -
> > > > -struct xe_vm;
> > > > +#include "xe_vm_types.h"
> > > >
> > > >  /**
> > > >   * struct xe_svm - data structure to represent a shared
> > > > @@ -31,6 +31,9 @@ struct xe_svm {
> > > >  	struct list_head vm_list;
> > > >  };
> > > >
> > > > +#define xe_svm_for_each_vm(svm, vm)
> > > 	\
> > > > +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> > > > +
> > >
> > > Don't think this is need, see below.
> > >
> > > >  extern struct xe_svm *xe_create_svm(void);
> > > >  void xe_destroy_svm(struct xe_svm *svm);
> > > >  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> > > > @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > > >
> > > >  void xe_devm_free_blocks(struct list_head *blocks);
> > > >  void xe_devm_page_free(struct page *page);
> > > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > > >  #endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > index 088ac209ad80..32ada458f1dd 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
> > > >  	unsigned long bitmap[];
> > > >  };
> > > >
> > > > -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > > -{
> > > > -	return 0;
> > > > -}
> > > > -
> > > >  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > > >  {
> > > >  	/** DRM buddy's block offset is 0-based*/
> > > > @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head
> *blocks)
> > > >
> > > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > >  	.page_free = xe_devm_page_free,
> > > > -	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > +	.migrate_to_ram = xe_svm_migrate_to_sram,
> > >
> > > Again single file so this will be static function, no reason to export
> > > this.
> > >
> > > >  };
> > > >
> > > >  /**
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > new file mode 100644
> > > > index 000000000000..0db831af098e
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > @@ -0,0 +1,222 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2023 Intel Corporation
> > > > + */
> > > > +
> > > > +#include <linux/gfp.h>
> > > > +#include <linux/migrate.h>
> > > > +#include <linux/dma-mapping.h>
> > > > +#include <linux/dma-fence.h>
> > > > +#include <linux/bitops.h>
> > > > +#include <linux/bitmap.h>
> > > > +#include <linux/kernel.h>
> > > > +#include <linux/slab.h>
> > > > +#include <drm/drm_buddy.h>
> > > > +#include "xe_device_types.h"
> > > > +#include "xe_device.h"
> > > > +#include "xe_trace.h"
> > > > +#include "xe_migrate.h"
> > > > +#include "xe_ttm_vram_mgr_types.h"
> > > > +#include "xe_assert.h"
> > > > +#include "xe_pt.h"
> > > > +#include "xe_svm.h"
> > > > +#include "xe_vm.h"
> > > > +
> > > > +
> > > > +/**
> > > > + * alloc_host_page() - allocate one host page for the fault vma
> > > > + *
> > > > + * @dev: (GPU) device that will access the allocated page
> > > > + * @vma: the fault vma that we need allocate page for
> > > > + * @addr: the fault address. The allocated page is for this address
> > > > + * @dma_addr: used to output the dma address of the allocated page.
> > > > + * This dma address will be used for gpu to access this page. GPU
> > > > + * access host page through a dma mapped address.
> > > > + * @pfn: used to output the pfn of the allocated page.
> > > > + *
> > > > + * This function allocate one host page for the specified vma. It
> > > > + * also does some prepare work for GPU to access this page, such
> > > > + * as map this page to iommu (by calling dma_map_page).
> > > > + *
> > > > + * When this function returns, the page is locked.
> > > > + *
> > > > + * Return struct page pointer when success
> > > > + * NULL otherwise
> > > > + */
> > > > +static struct page *alloc_host_page(struct device *dev,
> > > > +							 struct vm_area_struct
> > > *vma,
> > > > +							 unsigned long addr,
> > > > +							 dma_addr_t
> > > *dma_addr,
> > > > +							 unsigned long *pfn)
> > >
> > > Weird alignment, also I don't think we are want to allocate a page at
> > > time...
> > >
> > > Beyond that, can't say I'm a fan of 2 arguments being return and
> > > populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
> > > haven't seen a lot that style function in Linux.
> > >
> > > Probably makes more sense to have a function which allocates pages,
> > > locks them, and populates the pfn array (migrate_pfn) rather than doing
> > > this a page at a time.
> > >
> > > > +{
> > > > +	struct page *page;
> > > > +
> > > > +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> > > > +	if (unlikely(!page))
> > > > +		return NULL;
> > > > +
> > > > +	/**Lock page per hmm requirement, see hmm.rst*/
> > > > +	lock_page(page);
> > > > +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> > > DMA_FROM_DEVICE);
> > >
> > > The device is writing to these pages so I think DMA_BIDIRECTIONAL is
> > > needed, right? As mentioned above I think this should be broken out into
> > > a different step too.
> > >
> > > > +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> > > > +		unlock_page(page);
> > > > +		__free_page(page);
> > > > +		return NULL;
> > > > +	}
> > > > +
> > > > +	*pfn = migrate_pfn(page_to_pfn(page));
> > > > +	return page;
> > > > +}
> > > > +
> > > > +static void free_host_page(struct page *page)
> > > > +{
> > > > +	unlock_page(page);
> > > > +	put_page(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> > > > + *
> > > > + * @vma: The vma that the page is mapped to
> > > > + * @addr: The virtual address that the page is mapped to
> > > > + * @src_pfn: src page's page frame number
> > > > + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> > > > + *
> > > > + * Allocate one page in system ram and copy memory from device
> memory
> > > > + * to system ram.
> > > > + *
> > > > + * Return: 0 if this page is already in sram (no need to migrate)
> > > > + * 1: successfully migrated this page from vram to sram.
> > > > + * error code otherwise
> > > > + */
> > > > +static int migrate_page_vram_to_ram(struct vm_area_struct *vma,
> > > unsigned long addr,
> > > > +						unsigned long src_pfn,
> > > unsigned long *dst_pfn)
> > > > +{
> > >
> > > We definitely don't want to copy 1 page at time. I touch on this in [1].
> > > Basically this going to perform poorly unless we use larger copies, the
> > > migrate code supports non-contigous sram addresses, and vram addresses
> > > will likely be contigous due to the buddy allocator.
> > >
> > > [1]
> https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1
> > >
> > > > +	struct xe_mem_region *mr;
> > > > +	struct xe_tile *tile;
> > > > +	struct xe_device *xe;
> > > > +	struct device *dev;
> > > > +	dma_addr_t dma_addr = 0;
> > > > +	struct dma_fence *fence;
> > > > +	struct page *host_page;
> > > > +	struct page *src_page;
> > > > +	u64 src_dpa;
> > > > +
> > > > +	src_page = migrate_pfn_to_page(src_pfn);
> > > > +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
> > >
> > > I'm going to say this is a bug if !src_page ||
> > > !is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE)
> and
> > > we return -EFAULT (or another error code if that makes more sense). We
> > > are migrating from the device where we know we have backing store from
> > > the original fault.
> > >
> > > > +		return 0;
> > > > +
> > > > +	mr = xe_page_to_mem_region(src_page);
> > > > +	tile = xe_mem_region_to_tile(mr);
> > > > +	xe = tile_to_xe(tile);
> > > > +	dev = xe->drm.dev;
> > > > +
> > > > +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> > > > +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> > > > +	if (!host_page)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> > > > +						dma_addr, false, PAGE_SIZE);
> > > > +	if (IS_ERR(fence)) {
> > > > +		dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> > > DMA_FROM_DEVICE);
> > > > +		free_host_page(host_page);
> > > > +		return PTR_ERR(fence);
> > > > +	}
> > > > +
> > > > +	dma_fence_wait(fence, false);
> > >
> > > Even if we did want to migrate a page at a time, we only need to wait on
> > > the last fence due to the ordered nature of exec queues.
> > >
> > > > +	dma_fence_put(fence);
> > > > +	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
> > >
> > > With above, will likely unmap all dma pages in a single function once
> > > the last fence is signaled.
> > >
> > > > +	return 1;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU
> page
> > > fault
> > > > + *
> > > > + * @vmf: cpu vm fault structure, contains fault information such as vma
> etc.
> > > > + *
> > > > + * Note, this is in CPU's vm fault handler, caller holds the mmap read
> lock.
> > > > + *
> > > > + * This function migrate one gpu vma which contains the fault address to
> > > sram.
> > > > + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e.,
> > > create one
> > > > + * gpu vma for one cpu vma initially and try not to split it). So this
> scheme
> > > end
> > > > + * up migrate at the vma granularity. This might not be the best
> performant
> > > scheme
> > > > + *
> > > > + * This can be tunned with a migration granularity for  performance, for
> > > example,
> > > > + * migration 2M for each CPU page fault, or let user specify how much
> to
> > > migrate.
> > > > + * This is more complex due to vma splitting.
> > > > + *
> > > > + * This function should also update GPU page table, so the fault virtual
> > > address
> > > > + * points to the same sram location from GPU side. This is TBD.
> > > > + *
> > > > + * Return:
> > > > + * 0 on success
> > > > + * VM_FAULT_SIGBUS: failed to migrate page to system memory,
> > > application
> > > > + * will be signaled a SIGBUG
> > > > + */
> > > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> > > > +{
> > > > +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> > > > +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> > > > +	struct xe_device *xe = tile_to_xe(tile);
> > > > +	struct vm_area_struct *vma = vmf->vma;
> > > > +	struct mm_struct *mm = vma->vm_mm;
> > > > +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
> > >
> > > I don't think this is needed... More below.
> > >
> > > > +	unsigned long addr = vma->vm_start;
> > > > +	u64 npages = vma_pages(vma);
> > > > +	struct xe_vma *xe_vma;
> > > > +	vm_fault_t ret = 0;
> > > > +	struct xe_vm *vm;
> > > > +	void *buf;
> > > > +	int i;
> > > > +
> > > > +	struct migrate_vma migrate_vma = {
> > > > +		.vma		= vmf->vma,
> > > > +		.start		= vma->vm_start,
> > > > +		.end		= vma->vm_end,
> > >
> > > So I know in my PoC I had the fault user pointer (xe_vma) == struct
> > > vm_area_struct of the GPU fault. That is definitely wrong. We likely
> > > want to allocate sub-range of vm_area_struct for the xe_vma, we can call
> > > this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
> > > finally 4k in sizes trying the largest first... The chunk sizes are
> > > trivial as we likely can just have table with values, the key here is
> > > the vm_area_struct vm_start / vm_end are not what we want to use here
> > > rather xe_vma_start(vma) and xe_vma_end(vma). I think we get the
> xe_vma
> 
> After I typed this, realized I made a mistake here...
> 
> s/xe_vma_start/xe_vma_userptr/
> s/xe_vma_end/xe_vma_userptr + xe_vma_size/
> 
> But you get the idea - the zone_device_data points Xe specific chunk
> data (currently xe_vma, could be xe_pt_state our something later if we
> switch to 1:N).
> 
> Check AMD's + Nvidia's drivers and they both use this field in a similar
> way.
> 
> > > from the faulting page vmf->page->zone_device_data field unless you have
> > > another use that field...
> >
> > You are right. Such work is planned in the memory attributes part that Himal
> is working on. We have a migration_granularity attribute which allow user to
> set the chunk size.
> >
> 
> That makes sense. The chunk size is always just hint though, right?


I believe we should have a default chunk size, such as 2M, and user can overwrite it through hints.

> 
> > >
> > > I also comment on my patch with my suggestion implement chunk sizes too.
> > >
> > > > +		.pgmap_owner	= xe,
> > >
> > > Again helper for this.
> > >
> > > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > +		.fault_page = vmf->page,
> > > > +	};
> > > > +
> > > > +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> > > > +	migrate_vma.src = buf;
> > > > +	migrate_vma.dst = buf + npages;
> > > > +	if (migrate_vma_setup(&migrate_vma) < 0) {
> > > > +		ret = VM_FAULT_SIGBUS;
> > > > +		goto free_buf;
> > > > +	}
> > > > +
> > > > +	if (!migrate_vma.cpages)
> > >
> > > This is an error, need to set a return value.
> > >
> > > > +		goto free_buf;
> > > > +
> > >
> > > We probably should check migrate.cpages != npages too as I also think
> > > this is an error.
> > >
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		ret = migrate_page_vram_to_ram(vma, addr,
> > > migrate_vma.src[i],
> > > > +							migrate_vma.dst + i);
> > > > +		if (ret < 0) {
> > > > +			ret = VM_FAULT_SIGBUS;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		/** Migration has been successful, free source page */
> > > > +		if (ret == 1) {
> > > > +			struct page *src_page =
> > > migrate_pfn_to_page(migrate_vma.src[i]);
> > > > +
> > > > +			xe_devm_page_free(src_page);
> > > > +		}
> > > > +
> > > > +		addr += PAGE_SIZE;
> > > > +	}
> > >
> > > I touch on this above, this should be reworked to roughly:
> > >
> > > - alloc pages and populate migrate_vma.dst
> > > - dma map sram pages
> > > - migrate a chunk of contigous vram addresses at a time
> > > - wait on last dma fence from migrate
> > > - unmap dma pages
> > > - unlock and free all pages
> > >
> > > > +
> > > > +	xe_svm_for_each_vm(svm, vm) {
> > > > +		xe_assert(xe, vm->mm == mm);
> > > > +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> > > > +		if (xe_vma)
> > > > +			xe_vm_invalidate_vma(xe_vma);
> > > > +	}
> > >
> > > I've touched on why this isn't needed... I think one of these
> > > migrate_vma_* functions will trigger all MMU notifiers registered for
> > > the range. The notifier owns the invalidate then.
> >
> > Very good point. Yes after read migrate_vma_setup function, I agree this
> function will call mmu notifiers with MMU_NOTIFY_MIGRATE event. Today we
> invalidate vma with this event. So yes, above codes are not needed.
> >
> > I do identified one potential improvement: when mmu notifier is called back
> with MMU_NOTIFY_MIGRATE event, if the migrate_vma_setup is called from
> the gpu page fault path, we can ignore the gpu vma invalidation as we will
> update gpu page table later after migration anyway. I think an page table
> invalidation is not needed in such case. But this should be just a minor
> improvement.
> >
> 
> We skip invalidations if the initial_bind flag is clear which should
> cover at the initial GPU fault. There is certainly room for improvement
> / optimizations in the MMU notifier though, it is kinda messy right now
> too. IMO work like this can be done once the basic design is working +
> tests in place to verify changes / optimizations.
> 

agreed

> >
> > >
> > > Beyond this, maybe I'm confused about a few things and how this fits all
> > > together. Doesn't every user process have its own unique mm, fd, and vm
> > > (e.g. own address space)? If a user want a shared address space then use
> > > threads with a single mm, fd, and vm.
> >
> > Yes, this is also my understanding. Each user process has its own mm struct
> and device fds.
> >
> > In a shared address space case, such as there are multiple pthread created
> through pthread_create in one process, all those pthreads should have different
> kernel task_struct, but all those task_struct (say we get it from "current" macro)
> should share one same mm struct, which means they all lives inside one cpu
> address space.
> >
> > Now with this work, we are now basically extending this shared cpu address
> space to gpu program. So both cpu program and gpu program can share one
> address space.
> >
> > Since we allow user to create multiple gpu vm for one device (lets focus on
> one device for now), so each shared address space can have multiple gpu vm...
> each gpuvm should be able to register its own mmu notifier to core mm, even
> if those notifier has the same address range. But I will have to test this out. If
> all this works, above codes are not needed. If different gpuvm can't register
> mmu notifier for same address range, then we would need a fix....
> >
> 
> The mmu notifier code is implemented with the interval tree which
> supports overlapping rangers (i.e. we can have multiple VM's register
> notifiers with the sam address range in a single MM).

Ok, that is great. I will delete the xe_svm struct.

Oak

> 
> >
> > >
> > > So even if we had to resolve the xe_vma's here and do an invalidate here
> > > very confused what this is doing. This is this the case with multiple
> > > devices and each VM points to a different device?
> >
> > Right now I only focus on single device. See above. This is to solve one gpu
> device but multiple gpu vm case. But as said above, for now I don't think this is
> needed. I need to test more on the mmu notifier behavior: whether it allow us
> to insert two notifiers for the same range for one mm....
> >
> 
> Agree that our focus should be on single device now. If that design it
> well thought out I don't think extending this to multiple devices will
> be a huge change either.
> 
> Matt
> 
> > Oak
> >
> > Again so that case I
> > > don't think a xe_svm structure would be needed, on GPU fault we should
> > > be to detect from the faulting page zone_device_data and pgmap owner
> > > if the fault already has a physical backing on another GPU and resolve
> > > how to map it into GPU with a fault... Jason suggests this in the
> > > following thread [2] and I think I agree with him.
> > >
> > > [2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-
> > > 240632fd3e35@amd.com/T/
> > >
> > > > +	migrate_vma_pages(&migrate_vma);
> > >
> > > This logic is going to change but ...
> > >
> > > On an error I think we only want to call migrate_vma_finalize to revert
> > > pages back to the original state (i.e. migrate_vma_pages commits the
> > > page changes which we don't want to do on an error).
> > >
> > > > +	migrate_vma_finalize(&migrate_vma);
> > > > +free_buf:
> > > > +	kvfree(buf);
> > > > +	return 0;
> > >
> > > I don't think 0 should blindly be return here, if there is an error
> > > return VM_FAULT_SIGBUS. We likely want a high level error message too.
> > >
> > > Matt
> > >
> > > > +}
> > > > --
> > > > 2.26.3
> > > >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
  2024-04-11  2:49   ` Matthew Brost
@ 2024-04-12 21:21     ` Zeng, Oak
  2024-04-15 19:40       ` Matthew Brost
  0 siblings, 1 reply; 58+ messages in thread
From: Zeng, Oak @ 2024-04-12 21:21 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Thomas.Hellstrom, Welty, Brian



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, April 10, 2024 10:49 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
> 
> On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> > Introduce a helper function xe_svm_migrate_vma_to_vram.
> >
> > Since the source pages of the svm range can be physically not
> > contiguous, and the destination vram pages can also be not
> > contiguous, there is no easy way to migrate multiple pages per
> > blitter command. We do page by page migration for now.
> >
> > Migration is best effort. Even if we fail to migrate some pages,
> > we will try to migrate the rest pages.
> >
> > FIXME: Use one blitter command to copy when both src and dst are
> > physically contiguous
> >
> 
> Yep, touch in this throughout the series. Only vram needs to be
> contiguous though as we dynamically create PT mappings for sram pages in
> the migrate code. Getting this in a must and should be done immediately
> IMO as this a very, very basic perform thing we know needs to be done.
> We will likely have to tune this code quite a bit for performance so
> getting known things done would be helpful.
> 
> > FIXME: when a vma is partially migrated, split vma as we assume
> > no mixture vma placement.
> >
> 
> Agree we do not want support partial migrations. We likely want to
> return -EAGAIN for something and fault back to a smaller xe_vma chunk
> size which I discussed in [1] and add comment on in [2].
> 
> Migration should be best case too if we fail migrate we can always leave
> backing store in sram too.
> 
> I do have question though, when do we get partial migrations? A user
> having called mlock on some of the pages? I just want to make sure I
> fully understand that case.

Yah, mlock could be one case...

I also looked the hmm code. There are few other cases where MIGRATE_PFN_MIGRATE is not set (so we skip migration after), such as pte is NULL and vma is file-backed (not anonymous); entry is swapped out to hard disk etc. see more details in function migrate_vma_collect_pmd.


> 
> [1] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> [2] https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1
> 
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_svm.h         |   2 +
> >  drivers/gpu/drm/xe/xe_svm_migrate.c | 115
> ++++++++++++++++++++++++++++
> 
> Same comment on file structure throughout the series apply here too.
> 
> >  2 files changed, 117 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > index c9e4239c44b4..18ce2e3757c5 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> >  void xe_devm_free_blocks(struct list_head *blocks);
> >  void xe_devm_page_free(struct page *page);
> >  vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma *vma,
> > +							struct xe_tile *tile);
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > index 0db831af098e..ab8dd1f58aa4 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct
> vm_fault *vmf)
> >  	kvfree(buf);
> >  	return 0;
> >  }
> > +
> > +/**
> > + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma to
> vram
> > + * Must be called with mmap_read_lock held.
> > + * @vm: the vm that the vma belongs to
> > + * @vma: the vma to migrate.
> > + * @tile: the destination tile which holds the new backing store of the range
> > + *
> > + * Returns: negative errno on faiure, 0 on success
> > + */
> > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> > +							struct xe_vma *vma,
> > +							struct xe_tile *tile)
> > +{
> > +	struct mm_struct *mm = vm->mm;
> > +	unsigned long start = xe_vma_start(vma);
> > +	unsigned long end = xe_vma_end(vma);
> > +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> > +	struct xe_mem_region *mr = &tile->mem.vram;
> > +	struct vm_area_struct *vas;
> > +
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= tile->xe,
> 
> Again helper to assign owner.
> 
> > +		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct device *dev = tile->xe->drm.dev;
> > +	dma_addr_t *src_dma_addr;
> > +	struct dma_fence *fence;
> > +	struct page *src_page;
> > +	LIST_HEAD(blocks);
> > +	int ret = 0, i;
> > +	u64 dst_dpa;
> > +	void *buf;
> > +
> > +	mmap_assert_locked(mm);
> 
> This mmap_assert_locked is ambiguous, we should make it clear if this
> read or write locked. Doesn't it have to be write to do the migrate
> pages?

I followed hmm document (Documents/mm/hmm.rst), see section "Migration to and from device memory". It explicitly write a read_lock in this document.

I believe a read_lock is enough for the migrate_vma_setup/migrate_vma_finalize().

As I understand it, the mm.mmap_lock protect the process's address space. When we modify process's address space such as mmap/munmap, we need to hold a write mode lock; if we only read process's address space, such as in the migrate_vma_setup/finalize, or in the cpu page fault handler case, we only need a read mode lock. 

I also cross checked amd driver. They also use a read lock.. see function svm_range_restore_pages in kfd_svm.c....


> 
> A larger question about the locking... The CPU fault handler holds the
> mmap lock in write mode, right?

No. since cpu fault handler doesn't modify process address space, instead it only fill up cpu page table for some valid address range, a read lock is enough. 

> 
> I'm asking as basically I think at least initially we want to hold the
> mmap lock in a way that the GPU handler and CPU handler do not race.
> i.e. From fault userptr create in GPU fault handler to issuing the bind
> we prevent the CPU fault handler from running.

Yes we hold mmap_read_lock in both cpu and gpu fault handler to avoid that race.

In user mmap/munmap (such as kernel function vm_munmap), we hold a mmap_write_lock which prevent it racing with cpu and gpu fault handler.


> 
> I'm having issues figuring out how to prevent races between initial
> binds of userptrs and userptr invalidates on faulting VMs. This race is
> seen any xe_exec_fault_mode for example... So preventing races between
> CPU / GPU fault handler with the mmap probably is a good idea initially.
> Likely can make the locking finer grained once this is all working and I
> figure out how to handle this race better.


I don't quite follow here. 

Initial bind of user ptr... if you meant the bind in gpu page fault handler, then the racing with invalidation is roughly like below:
Invalidation is called with mmap_write_lock
In userptr_pin_page, we hold a mmap_read_lock, so we know during pin page, invalidation is excluded.
After pin, before programming gpu page table, we check whether there is invalidation happen *after last pin but before programming page table*, if that happened, we retry



Oak

> 
> > +
> > +	vas = find_vma_intersection(mm, start, start + 4);
> 
> find_vma should work fine here.
> 
> > +	if (!vas)
> > +		return -ENOENT;
> > +
> > +	migrate.vma = vas;
> > +	buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
> > +					GFP_KERNEL);
> > +	if(!buf)
> > +		return -ENOMEM;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> > +	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
> 
> Again as I discussed in [3] I think this should be broken out into a
> different step with the blocks allocated before this, and here just
> populate migrate.dst from the existing blocks.
> 
> [3] https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1
> 
> > +	if (ret)
> > +		goto kfree_buf;
> > +
> > +	ret = migrate_vma_setup(&migrate);
> > +	if (ret) {
> > +		drm_err(&tile->xe->drm, "vma setup returned %d for range
> [%lx - %lx]\n",
> > +				ret, start, end);
> > +		goto free_dst_pages;
> > +	}
> > +
> > +	/**FIXME: partial migration of a range print a warning for now.
> > +	 * If this message is printed, we need to split xe_vma as we
> > +	 * don't support a mixture placement of one vma
> > +	 */
> > +	if (migrate.cpages != npages)
> > +		drm_warn(&tile->xe->drm, "Partial migration for range [%lx -
>  %lx], range is %ld pages, migrate only %ld pages\n",
> > +				start, end, npages, migrate.cpages);
> 
> As discussed above, we shouldn't support this. We should fall back to
> smaller xe_vma chunk size until we find one that works or simply leave
> the pages in sram and map those pages to GPU.
> 
> > +
> > +	/**Migrate page by page for now.
> > +	 * Both source pages and destination pages can physically not
> contiguous,
> > +	 * there is no good way to migrate multiple pages per blitter command.
> > +	 */
> 
> Touched on this a bunch throughout the series, lets do better than a
> page a time migration.
> 
> Algorithm should be very similar to what I discussed here [4] but with a
> few key differences.
> 
> - I think the sram pages can be unpopulated (page == NULL) if the user
>   has not yet touched the page
> - Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid
> 
> In these cases this indicate we have to issue a copy for the pages we
> are accumulating with contigous vram addresses.
> 
> [4] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> 
> > +	for (i = 0; i < npages; i++) {
> > +		src_page = migrate_pfn_to_page(migrate.src[i]);
> > +		if (unlikely(!src_page || !(migrate.src[i] &
> MIGRATE_PFN_MIGRATE)))
> 
> Discussed this in the CPU fault patch, once we call migrate_vma_setup,
> on subsequent errors we need to call migrate_vma_finalize to revert the
> pages to the original state. At least I think if I am reading the doc
> after this correctly.
> 
> Here on error we just free the pages...
> 
> Matt
> 
> > +			goto free_dst_page;
> > +
> > +		xe_assert(tile->xe, !is_zone_device_page(src_page));
> > +		src_dma_addr[i] = dma_map_page(dev, src_page, 0,
> PAGE_SIZE, DMA_TO_DEVICE);
> > +		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> > +			drm_warn(&tile->xe->drm, "dma map error for host
> pfn %lx\n", migrate.src[i]);
> > +			goto free_dst_page;
> > +		}
> > +		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> > +		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> > +				dst_dpa, true, PAGE_SIZE);
> > +		if (IS_ERR(fence)) {
> > +			drm_warn(&tile->xe->drm, "migrate host page
> (pfn: %lx) to vram failed\n",
> > +					migrate.src[i]);
> > +			/**Migration is best effort. Even we failed here, we
> continue*/
> > +			goto free_dst_page;
> > +		}
> > +		/**FIXME: Use the first migration's out fence as the second
> migration's input fence,
> > +		 * and so on. Only wait the out fence of last migration?
> > +		 */
> > +		dma_fence_wait(fence, false);
> > +		dma_fence_put(fence);
> > +free_dst_page:
> > +		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> > +	}
> > +
> > +	for (i = 0; i < npages; i++)
> > +		if (!(dma_mapping_error(dev, src_dma_addr[i])))
> > +			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE,
> DMA_TO_DEVICE);
> > +
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +free_dst_pages:
> > +	if (ret)
> > +		xe_devm_free_blocks(&blocks);
> > +kfree_buf:
> > +	kfree(buf);
> > +	return ret;
> > +}
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
  2024-04-12 21:21     ` Zeng, Oak
@ 2024-04-15 19:40       ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-15 19:40 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Thomas.Hellstrom, Welty, Brian

On Fri, Apr 12, 2024 at 03:21:04PM -0600, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Wednesday, April 10, 2024 10:49 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > Brian <brian.welty@intel.com>
> > Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
> > 
> > On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> > > Introduce a helper function xe_svm_migrate_vma_to_vram.
> > >
> > > Since the source pages of the svm range can be physically not
> > > contiguous, and the destination vram pages can also be not
> > > contiguous, there is no easy way to migrate multiple pages per
> > > blitter command. We do page by page migration for now.
> > >
> > > Migration is best effort. Even if we fail to migrate some pages,
> > > we will try to migrate the rest pages.
> > >
> > > FIXME: Use one blitter command to copy when both src and dst are
> > > physically contiguous
> > >
> > 
> > Yep, touch in this throughout the series. Only vram needs to be
> > contiguous though as we dynamically create PT mappings for sram pages in
> > the migrate code. Getting this in a must and should be done immediately
> > IMO as this a very, very basic perform thing we know needs to be done.
> > We will likely have to tune this code quite a bit for performance so
> > getting known things done would be helpful.
> > 
> > > FIXME: when a vma is partially migrated, split vma as we assume
> > > no mixture vma placement.
> > >
> > 
> > Agree we do not want support partial migrations. We likely want to
> > return -EAGAIN for something and fault back to a smaller xe_vma chunk
> > size which I discussed in [1] and add comment on in [2].
> > 
> > Migration should be best case too if we fail migrate we can always leave
> > backing store in sram too.
> > 
> > I do have question though, when do we get partial migrations? A user
> > having called mlock on some of the pages? I just want to make sure I
> > fully understand that case.
> 
> Yah, mlock could be one case...
> 
> I also looked the hmm code. There are few other cases where MIGRATE_PFN_MIGRATE is not set (so we skip migration after), such as pte is NULL and vma is file-backed (not anonymous); entry is swapped out to hard disk etc. see more details in function migrate_vma_collect_pmd.
> 
> 
> > 
> > [1] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > [2] https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1
> > 
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_svm.h         |   2 +
> > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 115
> > ++++++++++++++++++++++++++++
> > 
> > Same comment on file structure throughout the series apply here too.
> > 
> > >  2 files changed, 117 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > index c9e4239c44b4..18ce2e3757c5 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > >  void xe_devm_free_blocks(struct list_head *blocks);
> > >  void xe_devm_page_free(struct page *page);
> > >  vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma *vma,
> > > +							struct xe_tile *tile);
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > index 0db831af098e..ab8dd1f58aa4 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct
> > vm_fault *vmf)
> > >  	kvfree(buf);
> > >  	return 0;
> > >  }
> > > +
> > > +/**
> > > + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma to
> > vram
> > > + * Must be called with mmap_read_lock held.
> > > + * @vm: the vm that the vma belongs to
> > > + * @vma: the vma to migrate.
> > > + * @tile: the destination tile which holds the new backing store of the range
> > > + *
> > > + * Returns: negative errno on faiure, 0 on success
> > > + */
> > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> > > +							struct xe_vma *vma,
> > > +							struct xe_tile *tile)
> > > +{
> > > +	struct mm_struct *mm = vm->mm;
> > > +	unsigned long start = xe_vma_start(vma);
> > > +	unsigned long end = xe_vma_end(vma);
> > > +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > +	struct vm_area_struct *vas;
> > > +
> > > +	struct migrate_vma migrate = {
> > > +		.start		= start,
> > > +		.end		= end,
> > > +		.pgmap_owner	= tile->xe,
> > 
> > Again helper to assign owner.
> > 
> > > +		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
> > > +	};
> > > +	struct device *dev = tile->xe->drm.dev;
> > > +	dma_addr_t *src_dma_addr;
> > > +	struct dma_fence *fence;
> > > +	struct page *src_page;
> > > +	LIST_HEAD(blocks);
> > > +	int ret = 0, i;
> > > +	u64 dst_dpa;
> > > +	void *buf;
> > > +
> > > +	mmap_assert_locked(mm);
> > 
> > This mmap_assert_locked is ambiguous, we should make it clear if this
> > read or write locked. Doesn't it have to be write to do the migrate
> > pages?
> 
> I followed hmm document (Documents/mm/hmm.rst), see section "Migration to and from device memory". It explicitly write a read_lock in this document.
> 
> I believe a read_lock is enough for the migrate_vma_setup/migrate_vma_finalize().
> 
> As I understand it, the mm.mmap_lock protect the process's address space. When we modify process's address space such as mmap/munmap, we need to hold a write mode lock; if we only read process's address space, such as in the migrate_vma_setup/finalize, or in the cpu page fault handler case, we only need a read mode lock. 
> 
> I also cross checked amd driver. They also use a read lock.. see function svm_range_restore_pages in kfd_svm.c....
> 

Yea, I see that too. Trying to figure out the locking, IMO the locking
document might actually be wrong, or the very least the locking design
is very ill-conceived. We can discuss internally a bit before I
publically share my grievances.

> 
> > 
> > A larger question about the locking... The CPU fault handler holds the
> > mmap lock in write mode, right?
> 
> No. since cpu fault handler doesn't modify process address space, instead it only fill up cpu page table for some valid address range, a read lock is enough. 
> 

Ah, yes after digging around a bit I see this.

> > 
> > I'm asking as basically I think at least initially we want to hold the
> > mmap lock in a way that the GPU handler and CPU handler do not race.
> > i.e. From fault userptr create in GPU fault handler to issuing the bind
> > we prevent the CPU fault handler from running.
> 
> Yes we hold mmap_read_lock in both cpu and gpu fault handler to avoid that race.
>

That's not how rw lock work. 2 threads can both hold the read lock in
parallel (shared read access), only 1 thread hold the write lock
(exclusive write access, no one can hold read lock either). Thus my
concern about the cpu and gpu fault handler running in parallel and the
larger locking design questions. Again we can talk through this in
detail internally a bit.
 
> In user mmap/munmap (such as kernel function vm_munmap), we hold a mmap_write_lock which prevent it racing with cpu and gpu fault handler.
> 
> 
> > 
> > I'm having issues figuring out how to prevent races between initial
> > binds of userptrs and userptr invalidates on faulting VMs. This race is
> > seen any xe_exec_fault_mode for example... So preventing races between
> > CPU / GPU fault handler with the mmap probably is a good idea initially.
> > Likely can make the locking finer grained once this is all working and I
> > figure out how to handle this race better.
> 
> 
> I don't quite follow here. 
> 
> Initial bind of user ptr... if you meant the bind in gpu page fault handler, then the racing with invalidation is roughly like below:
> Invalidation is called with mmap_write_lock

Is it? If the notifer does the invalidation via migrate_vma_setup in the
CPU fault handler we established about that only the mmap_read_lock is
held.

> In userptr_pin_page, we hold a mmap_read_lock, so we know during pin page, invalidation is excluded.

Nope, see above invalidation can happen when userptr_pin_page is
executing because of the read lock. The seqno check (described below) is
what prevents programming of bad page tables.

> After pin, before programming gpu page table, we check whether there is invalidation happen *after last pin but before programming page table*, if that happened, we retry
>

Yes, that is how it works on tip but I am refactoring it in [1]. I was
trying to avoid the retry loop by turning PDE/PTE writes into clears if an
invalidation happened but not sure if that works without a larger
refactor due to nature PDEs being shared. I may need the retry loop but
that also gets tricky with array of binds... A few options here and will
land a on solution once [2] is merged.

Regardless my point here is still valid - it may not be the worst idea
when getting this code initially up and running just to grab
mmap_write_lock in GPU fault handler as a BLK (big kernel lock) to
prevent all races. Once the code is stable and stress testing in place,
switch to finer grained locking as define in HMM document (or newly
defined if we fine locking design is insufficient).

Matt

[1] https://patchwork.freedesktop.org/series/125608/
[2] https://patchwork.freedesktop.org/series/132246/

> 
> 
> Oak
> 
> > 
> > > +
> > > +	vas = find_vma_intersection(mm, start, start + 4);
> > 
> > find_vma should work fine here.
> > 
> > > +	if (!vas)
> > > +		return -ENOENT;
> > > +
> > > +	migrate.vma = vas;
> > > +	buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
> > > +					GFP_KERNEL);
> > > +	if(!buf)
> > > +		return -ENOMEM;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> > > +	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
> > 
> > Again as I discussed in [3] I think this should be broken out into a
> > different step with the blocks allocated before this, and here just
> > populate migrate.dst from the existing blocks.
> > 
> > [3] https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1
> > 
> > > +	if (ret)
> > > +		goto kfree_buf;
> > > +
> > > +	ret = migrate_vma_setup(&migrate);
> > > +	if (ret) {
> > > +		drm_err(&tile->xe->drm, "vma setup returned %d for range
> > [%lx - %lx]\n",
> > > +				ret, start, end);
> > > +		goto free_dst_pages;
> > > +	}
> > > +
> > > +	/**FIXME: partial migration of a range print a warning for now.
> > > +	 * If this message is printed, we need to split xe_vma as we
> > > +	 * don't support a mixture placement of one vma
> > > +	 */
> > > +	if (migrate.cpages != npages)
> > > +		drm_warn(&tile->xe->drm, "Partial migration for range [%lx -
> >  %lx], range is %ld pages, migrate only %ld pages\n",
> > > +				start, end, npages, migrate.cpages);
> > 
> > As discussed above, we shouldn't support this. We should fall back to
> > smaller xe_vma chunk size until we find one that works or simply leave
> > the pages in sram and map those pages to GPU.
> > 
> > > +
> > > +	/**Migrate page by page for now.
> > > +	 * Both source pages and destination pages can physically not
> > contiguous,
> > > +	 * there is no good way to migrate multiple pages per blitter command.
> > > +	 */
> > 
> > Touched on this a bunch throughout the series, lets do better than a
> > page a time migration.
> > 
> > Algorithm should be very similar to what I discussed here [4] but with a
> > few key differences.
> > 
> > - I think the sram pages can be unpopulated (page == NULL) if the user
> >   has not yet touched the page
> > - Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid
> > 
> > In these cases this indicate we have to issue a copy for the pages we
> > are accumulating with contigous vram addresses.
> > 
> > [4] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > 
> > > +	for (i = 0; i < npages; i++) {
> > > +		src_page = migrate_pfn_to_page(migrate.src[i]);
> > > +		if (unlikely(!src_page || !(migrate.src[i] &
> > MIGRATE_PFN_MIGRATE)))
> > 
> > Discussed this in the CPU fault patch, once we call migrate_vma_setup,
> > on subsequent errors we need to call migrate_vma_finalize to revert the
> > pages to the original state. At least I think if I am reading the doc
> > after this correctly.
> > 
> > Here on error we just free the pages...
> > 
> > Matt
> > 
> > > +			goto free_dst_page;
> > > +
> > > +		xe_assert(tile->xe, !is_zone_device_page(src_page));
> > > +		src_dma_addr[i] = dma_map_page(dev, src_page, 0,
> > PAGE_SIZE, DMA_TO_DEVICE);
> > > +		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> > > +			drm_warn(&tile->xe->drm, "dma map error for host
> > pfn %lx\n", migrate.src[i]);
> > > +			goto free_dst_page;
> > > +		}
> > > +		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> > > +		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> > > +				dst_dpa, true, PAGE_SIZE);
> > > +		if (IS_ERR(fence)) {
> > > +			drm_warn(&tile->xe->drm, "migrate host page
> > (pfn: %lx) to vram failed\n",
> > > +					migrate.src[i]);
> > > +			/**Migration is best effort. Even we failed here, we
> > continue*/
> > > +			goto free_dst_page;
> > > +		}
> > > +		/**FIXME: Use the first migration's out fence as the second
> > migration's input fence,
> > > +		 * and so on. Only wait the out fence of last migration?
> > > +		 */
> > > +		dma_fence_wait(fence, false);
> > > +		dma_fence_put(fence);
> > > +free_dst_page:
> > > +		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> > > +	}
> > > +
> > > +	for (i = 0; i < npages; i++)
> > > +		if (!(dma_mapping_error(dev, src_dma_addr[i])))
> > > +			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE,
> > DMA_TO_DEVICE);
> > > +
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +free_dst_pages:
> > > +	if (ret)
> > > +		xe_devm_free_blocks(&blocks);
> > > +kfree_buf:
> > > +	kfree(buf);
> > > +	return ret;
> > > +}
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-04-10 22:23   ` Matthew Brost
@ 2024-04-15 20:13     ` Zeng, Oak
  2024-04-15 21:19       ` Matthew Brost
  0 siblings, 1 reply; 58+ messages in thread
From: Zeng, Oak @ 2024-04-15 20:13 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Thomas.Hellstrom, Welty, Brian

Hi Matt,

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Wednesday, April 10, 2024 6:24 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> Brian <brian.welty@intel.com>
> Subject: Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free
> device memory
> 
> On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> > Function xe_devm_alloc_pages allocate pages from drm buddy and perform
> > house keeping work for all the pages allocated, such as get a page
> > refcount, keep a bitmap of all pages to denote whether a page is in
> > use, put pages to a drm lru list for eviction purpose.
> >
> > Function xe_devm_free_blocks return list of memory blocks to drm buddy
> > allocator.
> >
> > Function xe_devm_free_page is a call back function from hmm layer. It
> > is called whenever a page's refcount reaches to 1. This function clears
> > the bit of this page in the bitmap. If all the bits in the bitmap is
> > cleared, it means all the pages have been freed, we return all the pages
> > in this memory block back to drm buddy.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
> >  drivers/gpu/drm/xe/xe_svm_devmem.c | 147
> ++++++++++++++++++++++++++++-
> 
> See comments about file in previous patches, they apply here too.
> 
> >  2 files changed, 152 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > index 624c1581f8ba..92a3ee90d5a7 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -46,4 +46,11 @@ static inline struct xe_mem_region
> *xe_page_to_mem_region(struct page *page)
> >  	return container_of(page->pgmap, struct xe_mem_region, pagemap);
> >  }
> >
> > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > +						unsigned long npages,
> > +						struct list_head *blocks,
> > +						unsigned long *pfn);
> > +
> > +void xe_devm_free_blocks(struct list_head *blocks);
> > +void xe_devm_page_free(struct page *page);
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > index 31af56e8285a..5ba0cd9a70b0 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > @@ -5,18 +5,161 @@
> >
> >  #include <linux/mm_types.h>
> >  #include <linux/sched/mm.h>
> > -
> > +#include <linux/gfp.h>
> > +#include <linux/migrate.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/dma-fence.h>
> > +#include <linux/bitops.h>
> > +#include <linux/bitmap.h>
> > +#include <drm/drm_buddy.h>
> >  #include "xe_device_types.h"
> >  #include "xe_svm.h"
> > +#include "xe_migrate.h"
> > +#include "xe_ttm_vram_mgr_types.h"
> > +#include "xe_assert.h"
> >
> > +/**
> > + * struct xe_svm_block_meta - svm uses this data structure to manage each
> > + * block allocated from drm buddy. This will be set to the
> drm_buddy_block's
> > + * private field.
> > + *
> > + * @lru: used to link this block to drm's lru lists. This will be replace
> > + * with struct drm_lru_entity later.
> > + * @tile: tile from which we allocated this block
> > + * @bitmap: A bitmap of each page in this block. 1 means this page is used,
> > + * 0 means this page is idle. When all bits of this block are 0, it is time
> > + * to return this block to drm buddy subsystem.
> > + */
> > +struct xe_svm_block_meta {
> > +	struct list_head lru;
> > +	struct xe_tile *tile;
> > +	unsigned long bitmap[];
> > +};
> 
> This looks not needed to me but admittedly haven't looked at the LRU stuff.
> 
> I am thinking roughly...
> 
> - I think we drop all this special tracking (kill xe_svm_block_meta)
> - Have functions to allocate / free the buddy blocks, store buddy blocks in
> userptr
> - Blocks are allocated before migration to VRAM
> - Blocks can be freed on either CPU fault after migration or on VMA
>   destroy (probably depends on madvive hints for VMA where we free
>   blocks)
> - Blocks allocated / freed at ia chunk (xe_vma in this code) granularity
>   (conceptually the same if we switch to 1 to N ratio between xe_vma &
>   pt_state)
> - block->private == memory region so we can get pfn from block
> - When we need migrate_pfns we loop over buddy blocks populating
> migrate.dst

I thought into your scheme. The free of device memory is not completely controlled by driver. Core mm can call back to driver to free a device memory page. The xe_devm_page_free in this series is a callback function registered to core mm. this is why in above data structure I have to have a bitmap. This bitmap is used to mark which page is freed, when all pages in a buddy block are freed, it is time to free the whole buddy block.

In your scheme, we allocate/free at xe_vma granularity. So I imagine you would have a list of buddy blocks in usrptr, and free all blocks in the list when every pages in all blocks are freed. While my scheme free memory at the buddy block granularity - I think it natural because the buddy free interface is also block based.

You would eventually need to introduce a lru link to link each buddy block to a lru list when vram eviction come into picture.

So I just explained why above xe_svm_block_meta was introduced, such as why the bitmap and lru field is necessary to me. if you drop this data structure, they will have to show up in another way.

> 
> Also I noticed the drm_buddy_* calls in this file are not protected by a
> lock, we will need that. Currently it is tile->mem.vram_mgr->lock in the
> VRAM mgr code, we either need to reach into there or move this lock to
> common place so the VRAM manager and block allocations for SVM don't
> race with each other.
> 

Yes, the lock has to be added. Thanks for pointing this out. Maybe move the tile->mem.vram_mgr->lock to the xe_tile level so it can be shared b/t BO-driver and system allocator?

Oak

> Matt
> 
> >
> >  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> >  {
> >  	return 0;
> >  }
> >
> > -static void xe_devm_page_free(struct page *page)
> > +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > +{
> > +	/** DRM buddy's block offset is 0-based*/
> > +	offset += mr->hpa_base;
> > +
> > +	return PHYS_PFN(offset);
> > +}
> > +
> > +/** FIXME: we locked page by calling zone_device_page_init
> > + *  in xe_devm_alloc_pages. Should we unlock pages here?
> > + */
> > +static void free_block(struct drm_buddy_block *block)
> > +{
> > +	struct xe_svm_block_meta *meta =
> > +		(struct xe_svm_block_meta *)block->private;
> > +	struct xe_tile *tile  = meta->tile;
> > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > +
> > +	kfree(block->private);
> > +	drm_buddy_free_block(mm, block);
> > +}
> > +
> > +void xe_devm_page_free(struct page *page)
> > +{
> > +	struct drm_buddy_block *block =
> > +					(struct drm_buddy_block *)page-
> >zone_device_data;
> > +	struct xe_svm_block_meta *meta =
> > +					(struct xe_svm_block_meta *)block-
> >private;
> > +	struct xe_tile *tile  = meta->tile;
> > +	struct xe_mem_region *mr = &tile->mem.vram;
> > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > +	u64 size = drm_buddy_block_size(mm, block);
> > +	u64 pages_per_block = size >> PAGE_SHIFT;
> > +	u64 block_pfn_first =
> > +					block_offset_to_pfn(mr,
> drm_buddy_block_offset(block));
> > +	u64 page_pfn = page_to_pfn(page);
> > +	u64 i = page_pfn - block_pfn_first;
> > +
> > +	xe_assert(tile->xe, i < pages_per_block);
> > +	clear_bit(i, meta->bitmap);
> > +	if (bitmap_empty(meta->bitmap, pages_per_block))
> > +		free_block(block);
> > +}
> > +
> > +/**
> > + * xe_devm_alloc_pages() - allocate device pages from buddy allocator
> > + *
> > + * @xe_tile: which tile to allocate device memory from
> > + * @npages: how many pages to allocate
> > + * @blocks: used to return the allocated blocks
> > + * @pfn: used to return the pfn of all allocated pages. Must be big enough
> > + * to hold at @npages entries.
> > + *
> > + * This function allocate blocks of memory from drm buddy allocator, and
> > + * performs initialization work: set struct page::zone_device_data to point
> > + * to the memory block; set/initialize drm_buddy_block::private field;
> > + * lock_page for each page allocated; add memory block to lru managers lru
> > + * list - this is TBD.
> > + *
> > + * return: 0 on success
> > + * error code otherwise
> > + */
> > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > +						unsigned long npages,
> > +						struct list_head *blocks,
> > +						unsigned long *pfn)
> > +{
> > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > +	struct drm_buddy_block *block, *tmp;
> > +	u64 size = npages << PAGE_SHIFT;
> > +	int ret = 0, i, j = 0;
> > +
> > +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> > +						blocks,
> DRM_BUDDY_TOPDOWN_ALLOCATION);
> > +
> > +	if (unlikely(ret))
> > +		return ret;
> > +
> > +	list_for_each_entry_safe(block, tmp, blocks, link) {
> > +		struct xe_mem_region *mr = &tile->mem.vram;
> > +		u64 block_pfn_first, pages_per_block;
> > +		struct xe_svm_block_meta *meta;
> > +		u32 meta_size;
> > +
> > +		size = drm_buddy_block_size(mm, block);
> > +		pages_per_block = size >> PAGE_SHIFT;
> > +		meta_size = BITS_TO_BYTES(pages_per_block) +
> > +					sizeof(struct xe_svm_block_meta);
> > +		meta = kzalloc(meta_size, GFP_KERNEL);
> > +		bitmap_fill(meta->bitmap, pages_per_block);
> > +		meta->tile = tile;
> > +		block->private = meta;
> > +		block_pfn_first =
> > +					block_offset_to_pfn(mr,
> drm_buddy_block_offset(block));
> > +		for(i = 0; i < pages_per_block; i++) {
> > +			struct page *page;
> > +
> > +			pfn[j++] = block_pfn_first + i;
> > +			page = pfn_to_page(block_pfn_first + i);
> > +			/**Lock page per hmm requirement, see hmm.rst.*/
> > +			zone_device_page_init(page);
> > +			page->zone_device_data = block;
> > +		}
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * xe_devm_free_blocks() - free all memory blocks
> > + *
> > + * @blocks: memory blocks list head
> > + */
> > +void xe_devm_free_blocks(struct list_head *blocks)
> >  {
> > +	struct drm_buddy_block *block, *tmp;
> > +
> > +	list_for_each_entry_safe(block, tmp, blocks, link)
> > +		free_block(block);
> >  }
> >
> >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-04-15 20:13     ` Zeng, Oak
@ 2024-04-15 21:19       ` Matthew Brost
  0 siblings, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-15 21:19 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe, Ghimiray, Himal Prasad, Bommu, Krishnaiah,
	Thomas.Hellstrom, Welty, Brian

On Mon, Apr 15, 2024 at 02:13:55PM -0600, Zeng, Oak wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Wednesday, April 10, 2024 6:24 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty,
> > Brian <brian.welty@intel.com>
> > Subject: Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free
> > device memory
> > 
> > On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> > > Function xe_devm_alloc_pages allocate pages from drm buddy and perform
> > > house keeping work for all the pages allocated, such as get a page
> > > refcount, keep a bitmap of all pages to denote whether a page is in
> > > use, put pages to a drm lru list for eviction purpose.
> > >
> > > Function xe_devm_free_blocks return list of memory blocks to drm buddy
> > > allocator.
> > >
> > > Function xe_devm_free_page is a call back function from hmm layer. It
> > > is called whenever a page's refcount reaches to 1. This function clears
> > > the bit of this page in the bitmap. If all the bits in the bitmap is
> > > cleared, it means all the pages have been freed, we return all the pages
> > > in this memory block back to drm buddy.
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
> > >  drivers/gpu/drm/xe/xe_svm_devmem.c | 147
> > ++++++++++++++++++++++++++++-
> > 
> > See comments about file in previous patches, they apply here too.
> > 
> > >  2 files changed, 152 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > index 624c1581f8ba..92a3ee90d5a7 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -46,4 +46,11 @@ static inline struct xe_mem_region
> > *xe_page_to_mem_region(struct page *page)
> > >  	return container_of(page->pgmap, struct xe_mem_region, pagemap);
> > >  }
> > >
> > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > +						unsigned long npages,
> > > +						struct list_head *blocks,
> > > +						unsigned long *pfn);
> > > +
> > > +void xe_devm_free_blocks(struct list_head *blocks);
> > > +void xe_devm_page_free(struct page *page);
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > index 31af56e8285a..5ba0cd9a70b0 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > @@ -5,18 +5,161 @@
> > >
> > >  #include <linux/mm_types.h>
> > >  #include <linux/sched/mm.h>
> > > -
> > > +#include <linux/gfp.h>
> > > +#include <linux/migrate.h>
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/dma-fence.h>
> > > +#include <linux/bitops.h>
> > > +#include <linux/bitmap.h>
> > > +#include <drm/drm_buddy.h>
> > >  #include "xe_device_types.h"
> > >  #include "xe_svm.h"
> > > +#include "xe_migrate.h"
> > > +#include "xe_ttm_vram_mgr_types.h"
> > > +#include "xe_assert.h"
> > >
> > > +/**
> > > + * struct xe_svm_block_meta - svm uses this data structure to manage each
> > > + * block allocated from drm buddy. This will be set to the
> > drm_buddy_block's
> > > + * private field.
> > > + *
> > > + * @lru: used to link this block to drm's lru lists. This will be replace
> > > + * with struct drm_lru_entity later.
> > > + * @tile: tile from which we allocated this block
> > > + * @bitmap: A bitmap of each page in this block. 1 means this page is used,
> > > + * 0 means this page is idle. When all bits of this block are 0, it is time
> > > + * to return this block to drm buddy subsystem.
> > > + */
> > > +struct xe_svm_block_meta {
> > > +	struct list_head lru;
> > > +	struct xe_tile *tile;
> > > +	unsigned long bitmap[];
> > > +};
> > 
> > This looks not needed to me but admittedly haven't looked at the LRU stuff.
> > 
> > I am thinking roughly...
> > 
> > - I think we drop all this special tracking (kill xe_svm_block_meta)
> > - Have functions to allocate / free the buddy blocks, store buddy blocks in
> > userptr
> > - Blocks are allocated before migration to VRAM
> > - Blocks can be freed on either CPU fault after migration or on VMA
> >   destroy (probably depends on madvive hints for VMA where we free
> >   blocks)
> > - Blocks allocated / freed at ia chunk (xe_vma in this code) granularity
> >   (conceptually the same if we switch to 1 to N ratio between xe_vma &
> >   pt_state)
> > - block->private == memory region so we can get pfn from block
> > - When we need migrate_pfns we loop over buddy blocks populating
> > migrate.dst
> 
> I thought into your scheme. The free of device memory is not completely controlled by driver. Core mm can call back to driver to free a device memory page. The xe_devm_page_free in this series is a callback function registered to core mm. this is why in above data structure I have to have a bitmap. This bitmap is used to mark which page is freed, when all pages in a buddy block are freed, it is time to free the whole buddy block.
>

Certainly in this scenario we'd also get a mmu notifier with
MMU_NOTIFY_UNMAP when pages are being free'd too, right? The notifier
puts the VMA in the garbage collector and the blocks are free'd when VMA
is destroyed.

The garbage collector only supports complete unmaps at the moment but if
we need to support partial unmaps (likely do as a users can munmap
partial buffers) we can with garbage collector transfering ownership of
blocks from one VMA to another as needed.

It is possible I don't fully understand the ref couting scheme for pages
either and we will need to implement the dev_pagemap_ops.free_pagses
(seems likely now that I am typing) rather than the notifier scheme
described above...

If we need to do this, then roughly...

- page->zone_device_data is still the Xe chunk (xe_vma currently)
- Ref count device pages in Xe chunk (or perhaps individual block?,
  need to think about this more but certainly bitmap is overkill)
- free_pages decrements ref count
- When ref count goes to zero, free blocks

Again this seems to align with Nouveu and AMD (haven't check Nvidia's
open source driver) and aligns with the design in Xe of everything being
a chunk grainularity (e.g. no partial migrations within a chunk).

I guess we will a need to a test to do partial unmaps to figure out all
of these details... 

> In your scheme, we allocate/free at xe_vma granularity. So I imagine you would have a list of buddy blocks in usrptr, and free all blocks in the list when every pages in all blocks are freed. While my scheme free memory at the buddy block granularity - I think it natural because the buddy free interface is also block based.
> 
> You would eventually need to introduce a lru link to link each buddy block to a lru list when vram eviction come into picture.
> 
> So I just explained why above xe_svm_block_meta was introduced, such as why the bitmap and lru field is necessary to me. if you drop this data structure, they will have to show up in another way.
>

LRU will likely be at chunk grainularity too (i.e. xe_vma not at block
level).

Also in most cases xe_vma == 1 block if the buddy allocator is doing its
job so no reason to optimize for block level here.
 
> > 
> > Also I noticed the drm_buddy_* calls in this file are not protected by a
> > lock, we will need that. Currently it is tile->mem.vram_mgr->lock in the
> > VRAM mgr code, we either need to reach into there or move this lock to
> > common place so the VRAM manager and block allocations for SVM don't
> > race with each other.
> > 
> 
> Yes, the lock has to be added. Thanks for pointing this out. Maybe move the tile->mem.vram_mgr->lock to the xe_tile level so it can be shared b/t BO-driver and system allocator?
>

Yea, tile->mem.vram_lock might be a better location.

Matt
 
> Oak
> 
> > Matt
> > 
> > >
> > >  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > >  {
> > >  	return 0;
> > >  }
> > >
> > > -static void xe_devm_page_free(struct page *page)
> > > +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > > +{
> > > +	/** DRM buddy's block offset is 0-based*/
> > > +	offset += mr->hpa_base;
> > > +
> > > +	return PHYS_PFN(offset);
> > > +}
> > > +
> > > +/** FIXME: we locked page by calling zone_device_page_init
> > > + *  in xe_devm_alloc_pages. Should we unlock pages here?
> > > + */
> > > +static void free_block(struct drm_buddy_block *block)
> > > +{
> > > +	struct xe_svm_block_meta *meta =
> > > +		(struct xe_svm_block_meta *)block->private;
> > > +	struct xe_tile *tile  = meta->tile;
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +
> > > +	kfree(block->private);
> > > +	drm_buddy_free_block(mm, block);
> > > +}
> > > +
> > > +void xe_devm_page_free(struct page *page)
> > > +{
> > > +	struct drm_buddy_block *block =
> > > +					(struct drm_buddy_block *)page-
> > >zone_device_data;
> > > +	struct xe_svm_block_meta *meta =
> > > +					(struct xe_svm_block_meta *)block-
> > >private;
> > > +	struct xe_tile *tile  = meta->tile;
> > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +	u64 size = drm_buddy_block_size(mm, block);
> > > +	u64 pages_per_block = size >> PAGE_SHIFT;
> > > +	u64 block_pfn_first =
> > > +					block_offset_to_pfn(mr,
> > drm_buddy_block_offset(block));
> > > +	u64 page_pfn = page_to_pfn(page);
> > > +	u64 i = page_pfn - block_pfn_first;
> > > +
> > > +	xe_assert(tile->xe, i < pages_per_block);
> > > +	clear_bit(i, meta->bitmap);
> > > +	if (bitmap_empty(meta->bitmap, pages_per_block))
> > > +		free_block(block);
> > > +}
> > > +
> > > +/**
> > > + * xe_devm_alloc_pages() - allocate device pages from buddy allocator
> > > + *
> > > + * @xe_tile: which tile to allocate device memory from
> > > + * @npages: how many pages to allocate
> > > + * @blocks: used to return the allocated blocks
> > > + * @pfn: used to return the pfn of all allocated pages. Must be big enough
> > > + * to hold at @npages entries.
> > > + *
> > > + * This function allocate blocks of memory from drm buddy allocator, and
> > > + * performs initialization work: set struct page::zone_device_data to point
> > > + * to the memory block; set/initialize drm_buddy_block::private field;
> > > + * lock_page for each page allocated; add memory block to lru managers lru
> > > + * list - this is TBD.
> > > + *
> > > + * return: 0 on success
> > > + * error code otherwise
> > > + */
> > > +int xe_devm_alloc_pages(struct xe_tile *tile,
> > > +						unsigned long npages,
> > > +						struct list_head *blocks,
> > > +						unsigned long *pfn)
> > > +{
> > > +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> > > +	struct drm_buddy_block *block, *tmp;
> > > +	u64 size = npages << PAGE_SHIFT;
> > > +	int ret = 0, i, j = 0;
> > > +
> > > +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> > > +						blocks,
> > DRM_BUDDY_TOPDOWN_ALLOCATION);
> > > +
> > > +	if (unlikely(ret))
> > > +		return ret;
> > > +
> > > +	list_for_each_entry_safe(block, tmp, blocks, link) {
> > > +		struct xe_mem_region *mr = &tile->mem.vram;
> > > +		u64 block_pfn_first, pages_per_block;
> > > +		struct xe_svm_block_meta *meta;
> > > +		u32 meta_size;
> > > +
> > > +		size = drm_buddy_block_size(mm, block);
> > > +		pages_per_block = size >> PAGE_SHIFT;
> > > +		meta_size = BITS_TO_BYTES(pages_per_block) +
> > > +					sizeof(struct xe_svm_block_meta);
> > > +		meta = kzalloc(meta_size, GFP_KERNEL);
> > > +		bitmap_fill(meta->bitmap, pages_per_block);
> > > +		meta->tile = tile;
> > > +		block->private = meta;
> > > +		block_pfn_first =
> > > +					block_offset_to_pfn(mr,
> > drm_buddy_block_offset(block));
> > > +		for(i = 0; i < pages_per_block; i++) {
> > > +			struct page *page;
> > > +
> > > +			pfn[j++] = block_pfn_first + i;
> > > +			page = pfn_to_page(block_pfn_first + i);
> > > +			/**Lock page per hmm requirement, see hmm.rst.*/
> > > +			zone_device_page_init(page);
> > > +			page->zone_device_data = block;
> > > +		}
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * xe_devm_free_blocks() - free all memory blocks
> > > + *
> > > + * @blocks: memory blocks list head
> > > + */
> > > +void xe_devm_free_blocks(struct list_head *blocks)
> > >  {
> > > +	struct drm_buddy_block *block, *tmp;
> > > +
> > > +	list_for_each_entry_safe(block, tmp, blocks, link)
> > > +		free_block(block);
> > >  }
> > >
> > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-04-09 20:17 ` [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
  2024-04-10 21:09   ` Matthew Brost
@ 2024-04-16 19:01   ` Matthew Brost
  1 sibling, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-16 19:01 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:23PM -0400, Oak Zeng wrote:
> Memory remap GPU vram using devm_memremap_pages, so each GPU vram
> page is backed by a struct page.
> 
> Those struct pages are created to allow hmm migrate buffer b/t
> GPU vram and CPU system memory using existing Linux migration
> mechanism (i.e., migrating b/t CPU system memory and hard disk).
> 
> This is prepare work to enable svm (shared virtual memory) through
> Linux kernel hmm framework. The memory remap's page map type is set
> to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> vram page get a struct page and can be mapped in CPU page table,
> but such pages are treated as GPU's private resource, so CPU can't
> access them. If CPU access such page, a page fault is triggered
> and page will be migrate to system memory.
> 
> For GPU device which supports coherent memory protocol b/t CPU and
> GPU (such as CXL and CAPI protocol), we can remap device memory as
> MEMORY_DEVICE_COHERENT. This is TBD.
> 
> v1:
> Changes per code review feedback from Matt:
>     change .o order in Makefile
>     fix indentation
>     change code order in mmio_fini
>     remove unnecessary header file
>     uniform xe_svm_devm_add/_remove parameter
>     use tile (vs dev) as pagemap.owner during memremap
>     only remap vram for platform that support usm
> Changes per review feedback from Brian:
>     s/xe_svm_devm_add/xe_devm_add
>     s/xe_svm_devm_remove/xe_devm_remove
>     move calling of xe_devm_add to xe_tile.c
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile          |  1 +
>  drivers/gpu/drm/xe/xe_device_types.h |  8 +++
>  drivers/gpu/drm/xe/xe_mmio.c         |  6 ++
>  drivers/gpu/drm/xe/xe_svm.h          | 15 +++++
>  drivers/gpu/drm/xe/xe_svm_devmem.c   | 89 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_tile.c         |  4 ++
>  6 files changed, 123 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
>  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index fff70fc9a09e..cd5213ba182b 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -129,6 +129,7 @@ xe-y += xe_bb.o \
>  	xe_sa.o \
>  	xe_sched_job.o \
>  	xe_step.o \
> +	xe_svm_devmem.o \
>  	xe_sync.o \
>  	xe_tile.o \
>  	xe_tile_sysfs.o \
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index e73b9a086718..d6a14327986b 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -103,6 +103,14 @@ struct xe_mem_region {
>  	resource_size_t actual_physical_size;
>  	/** @mapping: pointer to VRAM mappable space */
>  	void __iomem *mapping;
> +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> +	struct dev_pagemap pagemap;
> +	/**
> +	 * @hpa_base: base host physical address
> +	 *
> +	 * This is generated when remap device memory as ZONE_DEVICE
> +	 */
> +	resource_size_t hpa_base;
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> index 7ba2477452d7..12923fe6abae 100644
> --- a/drivers/gpu/drm/xe/xe_mmio.c
> +++ b/drivers/gpu/drm/xe/xe_mmio.c
> @@ -22,6 +22,7 @@
>  #include "xe_module.h"
>  #include "xe_sriov.h"
>  #include "xe_tile.h"
> +#include "xe_svm.h"
>  
>  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
>  #define TILE_COUNT		REG_GENMASK(15, 8)
> @@ -354,6 +355,11 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
>  static void mmio_fini(struct drm_device *drm, void *arg)
>  {
>  	struct xe_device *xe = arg;
> +	struct xe_tile *tile;
> +	u8 id;
> +
> +	for_each_tile(tile, xe, id)
> +		xe_devm_remove(tile, &tile->mem.vram);
>  
>  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
>  	if (xe->mem.vram.mapping)
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> new file mode 100644
> index 000000000000..e944971cfc6d
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -0,0 +1,15 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#ifndef __XE_SVM_H
> +#define __XE_SVM_H
> +
> +struct xe_tile;
> +struct xe_mem_region;
> +
> +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
> +void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> new file mode 100644
> index 000000000000..31af56e8285a
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -0,0 +1,89 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <linux/mm_types.h>
> +#include <linux/sched/mm.h>
> +
> +#include "xe_device_types.h"
> +#include "xe_svm.h"
> +
> +
> +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	return 0;
> +}
> +
> +static void xe_devm_page_free(struct page *page)
> +{
> +}
> +
> +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> +	.page_free = xe_devm_page_free,
> +	.migrate_to_ram = xe_devm_migrate_to_ram,
> +};
> +
> +/**
> + * xe_devm_add: Remap and provide memmap backing for device memory
> + * @tile: tile that the memory region blongs to
> + * @mr: memory region to remap
> + *
> + * This remap device memory to host physical address space and create
> + * struct page to back device memory
> + *
> + * Return: 0 on success standard error code otherwise
> + */
> +int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> +{
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> +	struct resource *res;
> +	void *addr;
> +	int ret;
> +
> +	res = devm_request_free_mem_region(dev, &iomem_resource,
> +					   mr->usable_size);
> +	if (IS_ERR(res)) {
> +		ret = PTR_ERR(res);
> +		return ret;
> +	}
> +
> +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +	mr->pagemap.range.start = res->start;
> +	mr->pagemap.range.end = res->end;
> +	mr->pagemap.nr_range = 1;
> +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> +	mr->pagemap.owner = xe;
> +	addr = devm_memremap_pages(dev, &mr->pagemap);
> +	if (IS_ERR(addr)) {
> +		devm_release_mem_region(dev, res->start, resource_size(res));
> +		ret = PTR_ERR(addr);
> +		drm_err(&xe->drm, "Failed to remap tile %d memory, errno %d\n",
> +				tile->id, ret);
> +		return ret;
> +	}
> +	mr->hpa_base = res->start;
> +
> +	drm_info(&xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
> +			tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
> +	return 0;
> +}
> +
> +/**
> + * xe_devm_remove: Unmap device memory and free resources
> + * @tile: xe tile
> + * @mr: memory region to remove
> + */
> +void xe_devm_remove(struct xe_tile *tile, struct xe_mem_region *mr)

Also I don't think function is not needed...

devm_memremap_pages registers devm_memremap_pages_release via
evm_add_action_or_reset...

And if it was we'd want to register a devm_fini function rather than
exporting a function and call it from the mmio layer.

Matt

> +{
> +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> +
> +	/*FIXME: Does below cause a kernel hange during moduel remove?*/
> +	if (mr->hpa_base) {
> +		devm_memunmap_pages(dev, &mr->pagemap);
> +		devm_release_mem_region(dev, mr->pagemap.range.start,
> +			mr->pagemap.range.end - mr->pagemap.range.start + 1);
> +	}
> +}
> +
> diff --git a/drivers/gpu/drm/xe/xe_tile.c b/drivers/gpu/drm/xe/xe_tile.c
> index 0650b2fa75ef..f1c4f9de51df 100644
> --- a/drivers/gpu/drm/xe/xe_tile.c
> +++ b/drivers/gpu/drm/xe/xe_tile.c
> @@ -14,6 +14,7 @@
>  #include "xe_tile_sysfs.h"
>  #include "xe_ttm_vram_mgr.h"
>  #include "xe_wa.h"
> +#include "xe_svm.h"
>  
>  /**
>   * DOC: Multi-tile Design
> @@ -158,6 +159,7 @@ static int tile_ttm_mgr_init(struct xe_tile *tile)
>   */
>  int xe_tile_init_noalloc(struct xe_tile *tile)
>  {
> +	struct xe_device *xe = tile_to_xe(tile);
>  	int err;
>  
>  	xe_device_mem_access_get(tile_to_xe(tile));
> @@ -175,6 +177,8 @@ int xe_tile_init_noalloc(struct xe_tile *tile)
>  
>  	xe_tile_sysfs_init(tile);
>  
> +	if (xe->info.has_usm)
> +		xe_devm_add(tile, &tile->mem.vram);
>  err_mem_access:
>  	xe_device_mem_access_put(tile_to_xe(tile));
>  	return err;
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory
  2024-04-09 20:17 ` [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
  2024-04-10 22:23   ` Matthew Brost
@ 2024-04-17 20:55   ` Matthew Brost
  1 sibling, 0 replies; 58+ messages in thread
From: Matthew Brost @ 2024-04-17 20:55 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, himal.prasad.ghimiray, krishnaiah.bommu,
	Thomas.Hellstrom, brian.welty

On Tue, Apr 09, 2024 at 04:17:33PM -0400, Oak Zeng wrote:
> Function xe_devm_alloc_pages allocate pages from drm buddy and perform
> house keeping work for all the pages allocated, such as get a page
> refcount, keep a bitmap of all pages to denote whether a page is in
> use, put pages to a drm lru list for eviction purpose.
> 
> Function xe_devm_free_blocks return list of memory blocks to drm buddy
> allocator.
> 
> Function xe_devm_free_page is a call back function from hmm layer. It
> is called whenever a page's refcount reaches to 1. This function clears
> the bit of this page in the bitmap. If all the bits in the bitmap is
> cleared, it means all the pages have been freed, we return all the pages
> in this memory block back to drm buddy.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.h        |   7 ++
>  drivers/gpu/drm/xe/xe_svm_devmem.c | 147 ++++++++++++++++++++++++++++-
>  2 files changed, 152 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 624c1581f8ba..92a3ee90d5a7 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -46,4 +46,11 @@ static inline struct xe_mem_region *xe_page_to_mem_region(struct page *page)
>  	return container_of(page->pgmap, struct xe_mem_region, pagemap);
>  }
>  
> +int xe_devm_alloc_pages(struct xe_tile *tile,
> +						unsigned long npages,
> +						struct list_head *blocks,
> +						unsigned long *pfn);
> +
> +void xe_devm_free_blocks(struct list_head *blocks);
> +void xe_devm_page_free(struct page *page);
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> index 31af56e8285a..5ba0cd9a70b0 100644
> --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -5,18 +5,161 @@
>  
>  #include <linux/mm_types.h>
>  #include <linux/sched/mm.h>
> -
> +#include <linux/gfp.h>
> +#include <linux/migrate.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/dma-fence.h>
> +#include <linux/bitops.h>
> +#include <linux/bitmap.h>
> +#include <drm/drm_buddy.h>
>  #include "xe_device_types.h"
>  #include "xe_svm.h"
> +#include "xe_migrate.h"
> +#include "xe_ttm_vram_mgr_types.h"
> +#include "xe_assert.h"
>  
> +/**
> + * struct xe_svm_block_meta - svm uses this data structure to manage each
> + * block allocated from drm buddy. This will be set to the drm_buddy_block's
> + * private field.
> + *
> + * @lru: used to link this block to drm's lru lists. This will be replace
> + * with struct drm_lru_entity later.
> + * @tile: tile from which we allocated this block
> + * @bitmap: A bitmap of each page in this block. 1 means this page is used,
> + * 0 means this page is idle. When all bits of this block are 0, it is time
> + * to return this block to drm buddy subsystem.
> + */
> +struct xe_svm_block_meta {
> +	struct list_head lru;
> +	struct xe_tile *tile;
> +	unsigned long bitmap[];
> +};
>  
>  static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
>  {
>  	return 0;
>  }
>  
> -static void xe_devm_page_free(struct page *page)
> +static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> +{
> +	/** DRM buddy's block offset is 0-based*/
> +	offset += mr->hpa_base;
> +
> +	return PHYS_PFN(offset);
> +}
> +
> +/** FIXME: we locked page by calling zone_device_page_init
> + *  in xe_devm_alloc_pages. Should we unlock pages here?
> + */
> +static void free_block(struct drm_buddy_block *block)
> +{
> +	struct xe_svm_block_meta *meta =
> +		(struct xe_svm_block_meta *)block->private;
> +	struct xe_tile *tile  = meta->tile;
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +
> +	kfree(block->private);
> +	drm_buddy_free_block(mm, block);
> +}
> +
> +void xe_devm_page_free(struct page *page)
> +{
> +	struct drm_buddy_block *block =
> +					(struct drm_buddy_block *)page->zone_device_data;
> +	struct xe_svm_block_meta *meta =
> +					(struct xe_svm_block_meta *)block->private;
> +	struct xe_tile *tile  = meta->tile;
> +	struct xe_mem_region *mr = &tile->mem.vram;
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +	u64 size = drm_buddy_block_size(mm, block);
> +	u64 pages_per_block = size >> PAGE_SHIFT;
> +	u64 block_pfn_first =
> +					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
> +	u64 page_pfn = page_to_pfn(page);
> +	u64 i = page_pfn - block_pfn_first;
> +
> +	xe_assert(tile->xe, i < pages_per_block);
> +	clear_bit(i, meta->bitmap);
> +	if (bitmap_empty(meta->bitmap, pages_per_block))
> +		free_block(block);
> +}
> +
> +/**
> + * xe_devm_alloc_pages() - allocate device pages from buddy allocator
> + *
> + * @xe_tile: which tile to allocate device memory from
> + * @npages: how many pages to allocate
> + * @blocks: used to return the allocated blocks
> + * @pfn: used to return the pfn of all allocated pages. Must be big enough
> + * to hold at @npages entries.
> + *
> + * This function allocate blocks of memory from drm buddy allocator, and
> + * performs initialization work: set struct page::zone_device_data to point
> + * to the memory block; set/initialize drm_buddy_block::private field;
> + * lock_page for each page allocated; add memory block to lru managers lru
> + * list - this is TBD.
> + *
> + * return: 0 on success
> + * error code otherwise
> + */
> +int xe_devm_alloc_pages(struct xe_tile *tile,
> +						unsigned long npages,
> +						struct list_head *blocks,
> +						unsigned long *pfn)
> +{
> +	struct drm_buddy *mm = &tile->mem.vram_mgr->mm;
> +	struct drm_buddy_block *block, *tmp;
> +	u64 size = npages << PAGE_SHIFT;
> +	int ret = 0, i, j = 0;
> +
> +	ret = drm_buddy_alloc_blocks(mm, 0, mm->size, size, PAGE_SIZE,
> +						blocks, DRM_BUDDY_TOPDOWN_ALLOCATION);

Realized this while discussing ref counting off the list, the buddy
allocation size can be either PAGE_SIZE or SZ_64K depending on platform
too. We store this in VM via XE_VM_FLAG_64K flag.

Matt

> +
> +	if (unlikely(ret))
> +		return ret;
> +
> +	list_for_each_entry_safe(block, tmp, blocks, link) {
> +		struct xe_mem_region *mr = &tile->mem.vram;
> +		u64 block_pfn_first, pages_per_block;
> +		struct xe_svm_block_meta *meta;
> +		u32 meta_size;
> +
> +		size = drm_buddy_block_size(mm, block);
> +		pages_per_block = size >> PAGE_SHIFT;
> +		meta_size = BITS_TO_BYTES(pages_per_block) +
> +					sizeof(struct xe_svm_block_meta);
> +		meta = kzalloc(meta_size, GFP_KERNEL);
> +		bitmap_fill(meta->bitmap, pages_per_block);
> +		meta->tile = tile;
> +		block->private = meta;
> +		block_pfn_first =
> +					block_offset_to_pfn(mr, drm_buddy_block_offset(block));
> +		for(i = 0; i < pages_per_block; i++) {
> +			struct page *page;
> +
> +			pfn[j++] = block_pfn_first + i;
> +			page = pfn_to_page(block_pfn_first + i);
> +			/**Lock page per hmm requirement, see hmm.rst.*/
> +			zone_device_page_init(page);
> +			page->zone_device_data = block;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * xe_devm_free_blocks() - free all memory blocks
> + *
> + * @blocks: memory blocks list head
> + */
> +void xe_devm_free_blocks(struct list_head *blocks)
>  {
> +	struct drm_buddy_block *block, *tmp;
> +
> +	list_for_each_entry_safe(block, tmp, blocks, link)
> +		free_block(block);
>  }
>  
>  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2024-04-17 20:55 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-09 20:17 [v2 00/31] Basic system allocator support in xe driver Oak Zeng
2024-04-09 20:17 ` [v2 01/31] drm/xe: Refactor vm_bind Oak Zeng
2024-04-09 20:17 ` [v2 02/31] drm/xe/svm: Add SVM document Oak Zeng
2024-04-09 20:17 ` [v2 03/31] drm/xe: Invalidate userptr VMA on page pin fault Oak Zeng
2024-04-09 20:17 ` [v2 04/31] drm/xe: Drop unused arguments from vm_bind_ioctl_ops_parse Oak Zeng
2024-04-09 20:17 ` [v2 05/31] drm/xe: Fix op->tile_mask for fault mode Oak Zeng
2024-04-09 20:17 ` [v2 06/31] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATOR flag Oak Zeng
2024-04-09 20:17 ` [v2 07/31] drm/xe: Create userptr if page fault occurs on system_allocator VMA Oak Zeng
2024-04-09 20:17 ` [v2 08/31] drm/xe: Add faulted userptr VMA garbage collector Oak Zeng
2024-04-09 20:17 ` [v2 09/31] drm/xe: Introduce helper to populate userptr Oak Zeng
2024-04-09 20:17 ` [v2 10/31] drm/xe: Introduce a helper to free sg table Oak Zeng
2024-04-09 20:17 ` [v2 11/31] drm/xe: Use hmm_range_fault to populate user pages Oak Zeng
2024-04-09 20:17 ` [v2 12/31] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
2024-04-10 21:09   ` Matthew Brost
2024-04-16 19:01   ` Matthew Brost
2024-04-09 20:17 ` [v2 13/31] drm/xe/svm: Introduce DRM_XE_SVM kernel config Oak Zeng
2024-04-10 21:13   ` Matthew Brost
2024-04-09 20:17 ` [v2 14/31] drm/xe: Introduce helper to get tile from memory region Oak Zeng
2024-04-10 21:17   ` Matthew Brost
2024-04-09 20:17 ` [v2 15/31] drm/xe: Introduce a helper to get dpa from pfn Oak Zeng
2024-04-10 21:35   ` Matthew Brost
2024-04-09 20:17 ` [v2 16/31] drm/xe/svm: Get xe memory region from page Oak Zeng
2024-04-10 21:38   ` Matthew Brost
2024-04-09 20:17 ` [v2 17/31] drm/xe: Get xe_vma from xe_userptr Oak Zeng
2024-04-10 21:42   ` Matthew Brost
2024-04-09 20:17 ` [v2 18/31] drm/xe/svm: Build userptr sg table for device pages Oak Zeng
2024-04-10 21:52   ` Matthew Brost
2024-04-09 20:17 ` [v2 19/31] drm/xe/svm: Determine a vma is backed by device memory Oak Zeng
2024-04-10 21:56   ` Matthew Brost
2024-04-09 20:17 ` [v2 20/31] drm/xe: add xe lock document Oak Zeng
2024-04-09 20:17 ` [v2 21/31] drm/xe/svm: Introduce svm migration function Oak Zeng
2024-04-10 22:06   ` Matthew Brost
2024-04-09 20:17 ` [v2 22/31] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
2024-04-10 22:23   ` Matthew Brost
2024-04-15 20:13     ` Zeng, Oak
2024-04-15 21:19       ` Matthew Brost
2024-04-17 20:55   ` Matthew Brost
2024-04-09 20:17 ` [v2 23/31] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
2024-04-09 20:17 ` [v2 24/31] drm/xe/svm: Create and destroy xe svm Oak Zeng
2024-04-10 22:25   ` Matthew Brost
2024-04-09 20:17 ` [v2 25/31] drm/xe/svm: Add vm to xe_svm process Oak Zeng
2024-04-09 20:17 ` [v2 26/31] drm/xe: Make function lookup_vma public Oak Zeng
2024-04-10 22:26   ` Matthew Brost
2024-04-09 20:17 ` [v2 27/31] drm/xe/svm: Handle CPU page fault Oak Zeng
2024-04-11  2:07   ` Matthew Brost
2024-04-12 17:24     ` Zeng, Oak
2024-04-12 18:10       ` Matthew Brost
2024-04-12 18:39         ` Zeng, Oak
2024-04-09 20:17 ` [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram Oak Zeng
2024-04-11  2:49   ` Matthew Brost
2024-04-12 21:21     ` Zeng, Oak
2024-04-15 19:40       ` Matthew Brost
2024-04-09 20:17 ` [v2 29/31] drm/xe/svm: trace svm migration Oak Zeng
2024-04-09 20:17 ` [v2 30/31] drm/xe/svm: Add a helper to determine a vma is fault userptr Oak Zeng
2024-04-11  2:50   ` Matthew Brost
2024-04-09 20:17 ` [v2 31/31] drm/xe/svm: Migration from sram to vram for system allocator Oak Zeng
2024-04-11  2:55   ` Matthew Brost
2024-04-09 20:52 ` ✗ CI.Patch_applied: failure for Basic system allocator support in xe driver Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.