All of lore.kernel.org
 help / color / mirror / Atom feed
* ✓ CI.Patch_applied: success for Use hmm_range_fault to populate user page
  2024-03-14  3:35 [PATCH 0/5] Use hmm_range_fault to populate user page Oak Zeng
@ 2024-03-14  3:28 ` Patchwork
  2024-03-14  3:28 ` ✗ CI.checkpatch: warning " Patchwork
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: Patchwork @ 2024-03-14  3:28 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe

== Series Details ==

Series: Use hmm_range_fault to populate user page
URL   : https://patchwork.freedesktop.org/series/131117/
State : success

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: 790a1d4e546a drm-tip: 2024y-03m-13d-20h-00m-39s UTC integration manifest
=== git am output follows ===
.git/rebase-apply/patch:198: new blank line at EOF.
+
warning: 1 line adds whitespace errors.
.git/rebase-apply/patch:241: new blank line at EOF.
+
warning: 1 line adds whitespace errors.
Applying: drm/xe/svm: Remap and provide memmap backing for GPU vram
Applying: drm/xe: Helper to get memory region from tile
Applying: drm/xe: Helper to get dpa from pfn
Applying: drm/xe: Helper to populate a userptr or hmmptr
Applying: drm/xe: Use hmm_range_fault to populate user pages



^ permalink raw reply	[flat|nested] 49+ messages in thread

* ✗ CI.checkpatch: warning for Use hmm_range_fault to populate user page
  2024-03-14  3:35 [PATCH 0/5] Use hmm_range_fault to populate user page Oak Zeng
  2024-03-14  3:28 ` ✓ CI.Patch_applied: success for " Patchwork
@ 2024-03-14  3:28 ` Patchwork
  2024-03-14  3:29 ` ✗ CI.KUnit: failure " Patchwork
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: Patchwork @ 2024-03-14  3:28 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe

== Series Details ==

Series: Use hmm_range_fault to populate user page
URL   : https://patchwork.freedesktop.org/series/131117/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
a9eb1ac8298ef9f9146567c29fa762d8e9efa1ef
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 949ff440ef2fd98f08340e015405731c9c20401c
Author: Oak Zeng <oak.zeng@intel.com>
Date:   Wed Mar 13 23:35:53 2024 -0400

    drm/xe: Use hmm_range_fault to populate user pages
    
    This is an effort to unify hmmptr (aka system allocator)
    and userptr code. hmm_range_fault is used to populate
    a virtual address range for both hmmptr and userptr,
    instead of hmmptr using hmm_range_fault and userptr
    using get_user_pages_fast.
    
    This also aligns with AMD gpu driver's behavior. In
    long term, we plan to put some common helpers in this
    area to drm layer so it can be re-used by different
    vendors.
    
    Signed-off-by: Oak Zeng <oak.zeng@intel.com>
+ /mt/dim checkpatch 790a1d4e546a1d7f1cc5316c77f21379a4083250 drm-intel
835de08a6cb6 drm/xe/svm: Remap and provide memmap backing for GPU vram
Traceback (most recent call last):
  File "scripts/spdxcheck.py", line 6, in <module>
    from ply import lex, yacc
ModuleNotFoundError: No module named 'ply'
-:58: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#58: FILE: drivers/gpu/drm/xe/xe_device_types.h:103:
+    struct dev_pagemap pagemap;$

-:64: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#64: FILE: drivers/gpu/drm/xe/xe_device_types.h:109:
+    resource_size_t hpa_base;$

-:93: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#93: FILE: drivers/gpu/drm/xe/xe_mmio.c:359:
+    struct xe_tile *tile;$

-:94: WARNING:LEADING_SPACE: please, no spaces at the start of a line
#94: FILE: drivers/gpu/drm/xe/xe_mmio.c:360:
+    u8 id;$

-:107: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#107: 
new file mode 100644

-:112: WARNING:SPDX_LICENSE_TAG: Improper SPDX comment style for 'drivers/gpu/drm/xe/xe_svm.h', please use '/*' instead
#112: FILE: drivers/gpu/drm/xe/xe_svm.h:1:
+// SPDX-License-Identifier: MIT

-:112: WARNING:SPDX_LICENSE_TAG: Missing or malformed SPDX-License-Identifier tag in line 1
#112: FILE: drivers/gpu/drm/xe/xe_svm.h:1:
+// SPDX-License-Identifier: MIT

-:144: CHECK:LINE_SPACING: Please don't use multiple blank lines
#144: FILE: drivers/gpu/drm/xe/xe_svm_devmem.c:13:
+
+

-:194: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#194: FILE: drivers/gpu/drm/xe/xe_svm_devmem.c:63:
+		drm_err(&tile->xe->drm, "Failed to remap tile %d memory, errno %d\n",
+				tile->id, ret);

-:200: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#200: FILE: drivers/gpu/drm/xe/xe_svm_devmem.c:69:
+	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
+			tile->id, mr->io_start, mr->io_start + mr->usable_size, res);

-:212: WARNING:IF_0: Consider removing the code enclosed by this #if 0 and its #endif
#212: FILE: drivers/gpu/drm/xe/xe_svm_devmem.c:81:
+#if 0

-:218: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#218: FILE: drivers/gpu/drm/xe/xe_svm_devmem.c:87:
+		devm_release_mem_region(dev, mr->pagemap.range.start,
+			mr->pagemap.range.end - mr->pagemap.range.start +1);

-:218: CHECK:SPACING: spaces preferred around that '+' (ctx:WxV)
#218: FILE: drivers/gpu/drm/xe/xe_svm_devmem.c:87:
+			mr->pagemap.range.end - mr->pagemap.range.start +1);
 			                                                ^

total: 0 errors, 8 warnings, 5 checks, 159 lines checked
2d11ab5ab4ca drm/xe: Helper to get memory region from tile
-:7: WARNING:COMMIT_MESSAGE: Missing commit description - Add an appropriate one

total: 0 errors, 1 warnings, 0 checks, 9 lines checked
244ff8649c11 drm/xe: Helper to get dpa from pfn
-:24: WARNING:LINE_SPACING: Missing a blank line after declarations
#24: FILE: drivers/gpu/drm/xe/xe_device_types.h:583:
+	u64 offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
+	dpa = mr->dpa_base + offset;

total: 0 errors, 1 warnings, 0 checks, 12 lines checked
c0b901d8869e drm/xe: Helper to populate a userptr or hmmptr
Traceback (most recent call last):
  File "scripts/spdxcheck.py", line 6, in <module>
    from ply import lex, yacc
ModuleNotFoundError: No module named 'ply'
-:46: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#46: 
new file mode 100644

-:123: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#123: FILE: drivers/gpu/drm/xe/xe_hmm.c:73:
+static int build_sg(struct xe_device *xe, struct hmm_range *range,
+			     struct sg_table *st, bool write)

-:147: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#147: FILE: drivers/gpu/drm/xe/xe_hmm.c:97:
+			addr = dma_map_page(dev, page, 0, PAGE_SIZE,
+					write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE);

-:194: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#194: FILE: drivers/gpu/drm/xe/xe_hmm.c:144:
+int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range *hmm_range,
+						bool write)

-:228: WARNING:REPEATED_WORD: Possible repeated word: 'the'
#228: FILE: drivers/gpu/drm/xe/xe_hmm.c:178:
+	 * Set the the dev_private_owner can prevent hmm_range_fault to fault

-:270: WARNING:SPDX_LICENSE_TAG: Improper SPDX comment style for 'drivers/gpu/drm/xe/xe_hmm.h', please use '/*' instead
#270: FILE: drivers/gpu/drm/xe/xe_hmm.h:1:
+// SPDX-License-Identifier: MIT

-:270: WARNING:SPDX_LICENSE_TAG: Missing or malformed SPDX-License-Identifier tag in line 1
#270: FILE: drivers/gpu/drm/xe/xe_hmm.h:1:
+// SPDX-License-Identifier: MIT

-:281: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#281: FILE: drivers/gpu/drm/xe/xe_hmm.h:12:
+int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range *hmm_range,
+						bool write);

total: 0 errors, 4 warnings, 4 checks, 234 lines checked
949ff440ef2f drm/xe: Use hmm_range_fault to populate user pages



^ permalink raw reply	[flat|nested] 49+ messages in thread

* ✗ CI.KUnit: failure for Use hmm_range_fault to populate user page
  2024-03-14  3:35 [PATCH 0/5] Use hmm_range_fault to populate user page Oak Zeng
  2024-03-14  3:28 ` ✓ CI.Patch_applied: success for " Patchwork
  2024-03-14  3:28 ` ✗ CI.checkpatch: warning " Patchwork
@ 2024-03-14  3:29 ` Patchwork
  2024-03-14  3:35 ` [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: Patchwork @ 2024-03-14  3:29 UTC (permalink / raw)
  To: Oak Zeng; +Cc: intel-xe

== Series Details ==

Series: Use hmm_range_fault to populate user page
URL   : https://patchwork.freedesktop.org/series/131117/
State : failure

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:../arch/x86/um/user-offsets.c:17:6: warning: no previous prototype for ‘foo’ [-Wmissing-prototypes]
   17 | void foo(void)
      |      ^~~
In file included from ../arch/um/kernel/asm-offsets.c:1:
../arch/x86/um/shared/sysdep/kernel-offsets.h:9:6: warning: no previous prototype for ‘foo’ [-Wmissing-prototypes]
    9 | void foo(void)
      |      ^~~
../arch/x86/um/bugs_64.c:9:6: warning: no previous prototype for ‘arch_check_bugs’ [-Wmissing-prototypes]
    9 | void arch_check_bugs(void)
      |      ^~~~~~~~~~~~~~~
../arch/x86/um/bugs_64.c:13:6: warning: no previous prototype for ‘arch_examine_signal’ [-Wmissing-prototypes]
   13 | void arch_examine_signal(int sig, struct uml_pt_regs *regs)
      |      ^~~~~~~~~~~~~~~~~~~
../arch/x86/um/fault.c:18:5: warning: no previous prototype for ‘arch_fixup’ [-Wmissing-prototypes]
   18 | int arch_fixup(unsigned long address, struct uml_pt_regs *regs)
      |     ^~~~~~~~~~
../arch/x86/um/os-Linux/registers.c:146:15: warning: no previous prototype for ‘get_thread_reg’ [-Wmissing-prototypes]
  146 | unsigned long get_thread_reg(int reg, jmp_buf *buf)
      |               ^~~~~~~~~~~~~~
../arch/x86/um/vdso/um_vdso.c:16:5: warning: no previous prototype for ‘__vdso_clock_gettime’ [-Wmissing-prototypes]
   16 | int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts)
      |     ^~~~~~~~~~~~~~~~~~~~
../arch/x86/um/vdso/um_vdso.c:30:5: warning: no previous prototype for ‘__vdso_gettimeofday’ [-Wmissing-prototypes]
   30 | int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
      |     ^~~~~~~~~~~~~~~~~~~
../arch/x86/um/vdso/um_vdso.c:44:21: warning: no previous prototype for ‘__vdso_time’ [-Wmissing-prototypes]
   44 | __kernel_old_time_t __vdso_time(__kernel_old_time_t *t)
      |                     ^~~~~~~~~~~
../arch/x86/um/vdso/um_vdso.c:57:1: warning: no previous prototype for ‘__vdso_getcpu’ [-Wmissing-prototypes]
   57 | __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
      | ^~~~~~~~~~~~~
../arch/x86/um/os-Linux/mcontext.c:7:6: warning: no previous prototype for ‘get_regs_from_mc’ [-Wmissing-prototypes]
    7 | void get_regs_from_mc(struct uml_pt_regs *regs, mcontext_t *mc)
      |      ^~~~~~~~~~~~~~~~
../arch/um/os-Linux/skas/process.c:107:6: warning: no previous prototype for ‘wait_stub_done’ [-Wmissing-prototypes]
  107 | void wait_stub_done(int pid)
      |      ^~~~~~~~~~~~~~
../arch/um/os-Linux/skas/process.c:683:6: warning: no previous prototype for ‘__switch_mm’ [-Wmissing-prototypes]
  683 | void __switch_mm(struct mm_id *mm_idp)
      |      ^~~~~~~~~~~
../arch/um/os-Linux/main.c:187:7: warning: no previous prototype for ‘__wrap_malloc’ [-Wmissing-prototypes]
  187 | void *__wrap_malloc(int size)
      |       ^~~~~~~~~~~~~
../arch/um/os-Linux/main.c:208:7: warning: no previous prototype for ‘__wrap_calloc’ [-Wmissing-prototypes]
  208 | void *__wrap_calloc(int n, int size)
      |       ^~~~~~~~~~~~~
../arch/um/os-Linux/main.c:222:6: warning: no previous prototype for ‘__wrap_free’ [-Wmissing-prototypes]
  222 | void __wrap_free(void *ptr)
      |      ^~~~~~~~~~~
../arch/um/os-Linux/mem.c:28:6: warning: no previous prototype for ‘kasan_map_memory’ [-Wmissing-prototypes]
   28 | void kasan_map_memory(void *start, size_t len)
      |      ^~~~~~~~~~~~~~~~
../arch/um/os-Linux/mem.c:212:13: warning: no previous prototype for ‘check_tmpexec’ [-Wmissing-prototypes]
  212 | void __init check_tmpexec(void)
      |             ^~~~~~~~~~~~~
../arch/x86/um/ptrace_64.c:111:5: warning: no previous prototype for ‘poke_user’ [-Wmissing-prototypes]
  111 | int poke_user(struct task_struct *child, long addr, long data)
      |     ^~~~~~~~~
../arch/x86/um/ptrace_64.c:171:5: warning: no previous prototype for ‘peek_user’ [-Wmissing-prototypes]
  171 | int peek_user(struct task_struct *child, long addr, long data)
      |     ^~~~~~~~~
../arch/um/kernel/skas/mmu.c:17:5: warning: no previous prototype for ‘init_new_context’ [-Wmissing-prototypes]
   17 | int init_new_context(struct task_struct *task, struct mm_struct *mm)
      |     ^~~~~~~~~~~~~~~~
../arch/um/kernel/skas/mmu.c:60:6: warning: no previous prototype for ‘destroy_context’ [-Wmissing-prototypes]
   60 | void destroy_context(struct mm_struct *mm)
      |      ^~~~~~~~~~~~~~~
../arch/um/os-Linux/signal.c:75:6: warning: no previous prototype for ‘sig_handler’ [-Wmissing-prototypes]
   75 | void sig_handler(int sig, struct siginfo *si, mcontext_t *mc)
      |      ^~~~~~~~~~~
../arch/um/os-Linux/signal.c:111:6: warning: no previous prototype for ‘timer_alarm_handler’ [-Wmissing-prototypes]
  111 | void timer_alarm_handler(int sig, struct siginfo *unused_si, mcontext_t *mc)
      |      ^~~~~~~~~~~~~~~~~~~
../arch/um/os-Linux/start_up.c:301:12: warning: no previous prototype for ‘parse_iomem’ [-Wmissing-prototypes]
  301 | int __init parse_iomem(char *str, int *add)
      |            ^~~~~~~~~~~
../arch/um/kernel/skas/process.c:36:12: warning: no previous prototype for ‘start_uml’ [-Wmissing-prototypes]
   36 | int __init start_uml(void)
      |            ^~~~~~~~~
../arch/x86/um/signal.c:560:6: warning: no previous prototype for ‘sys_rt_sigreturn’ [-Wmissing-prototypes]
  560 | long sys_rt_sigreturn(void)
      |      ^~~~~~~~~~~~~~~~
../arch/x86/um/syscalls_64.c:48:6: warning: no previous prototype for ‘arch_switch_to’ [-Wmissing-prototypes]
   48 | void arch_switch_to(struct task_struct *to)
      |      ^~~~~~~~~~~~~~
../arch/um/kernel/mem.c:202:8: warning: no previous prototype for ‘pgd_alloc’ [-Wmissing-prototypes]
  202 | pgd_t *pgd_alloc(struct mm_struct *mm)
      |        ^~~~~~~~~
../arch/um/kernel/mem.c:215:7: warning: no previous prototype for ‘uml_kmalloc’ [-Wmissing-prototypes]
  215 | void *uml_kmalloc(int size, int flags)
      |       ^~~~~~~~~~~
../arch/um/kernel/process.c:51:5: warning: no previous prototype for ‘pid_to_processor_id’ [-Wmissing-prototypes]
   51 | int pid_to_processor_id(int pid)
      |     ^~~~~~~~~~~~~~~~~~~
../arch/um/kernel/process.c:87:7: warning: no previous prototype for ‘__switch_to’ [-Wmissing-prototypes]
   87 | void *__switch_to(struct task_struct *from, struct task_struct *to)
      |       ^~~~~~~~~~~
../arch/um/kernel/process.c:140:6: warning: no previous prototype for ‘fork_handler’ [-Wmissing-prototypes]
  140 | void fork_handler(void)
      |      ^~~~~~~~~~~~
../arch/um/kernel/process.c:217:6: warning: no previous prototype for ‘arch_cpu_idle’ [-Wmissing-prototypes]
  217 | void arch_cpu_idle(void)
      |      ^~~~~~~~~~~~~
../arch/um/kernel/process.c:253:5: warning: no previous prototype for ‘copy_to_user_proc’ [-Wmissing-prototypes]
  253 | int copy_to_user_proc(void __user *to, void *from, int size)
      |     ^~~~~~~~~~~~~~~~~
../arch/um/kernel/process.c:263:5: warning: no previous prototype for ‘clear_user_proc’ [-Wmissing-prototypes]
  263 | int clear_user_proc(void __user *buf, int size)
      |     ^~~~~~~~~~~~~~~
../arch/um/kernel/process.c:271:6: warning: no previous prototype for ‘set_using_sysemu’ [-Wmissing-prototypes]
  271 | void set_using_sysemu(int value)
      |      ^~~~~~~~~~~~~~~~
../arch/um/kernel/process.c:278:5: warning: no previous prototype for ‘get_using_sysemu’ [-Wmissing-prototypes]
  278 | int get_using_sysemu(void)
      |     ^~~~~~~~~~~~~~~~
../arch/um/kernel/process.c:316:12: warning: no previous prototype for ‘make_proc_sysemu’ [-Wmissing-prototypes]
  316 | int __init make_proc_sysemu(void)
      |            ^~~~~~~~~~~~~~~~
../arch/um/kernel/process.c:348:15: warning: no previous prototype for ‘arch_align_stack’ [-Wmissing-prototypes]
  348 | unsigned long arch_align_stack(unsigned long sp)
      |               ^~~~~~~~~~~~~~~~
../arch/um/kernel/reboot.c:45:6: warning: no previous prototype for ‘machine_restart’ [-Wmissing-prototypes]
   45 | void machine_restart(char * __unused)
      |      ^~~~~~~~~~~~~~~
../arch/um/kernel/reboot.c:51:6: warning: no previous prototype for ‘machine_power_off’ [-Wmissing-prototypes]
   51 | void machine_power_off(void)
      |      ^~~~~~~~~~~~~~~~~
../arch/um/kernel/reboot.c:57:6: warning: no previous prototype for ‘machine_halt’ [-Wmissing-prototypes]
   57 | void machine_halt(void)
      |      ^~~~~~~~~~~~
../arch/um/kernel/tlb.c:579:6: warning: no previous prototype for ‘flush_tlb_mm_range’ [-Wmissing-prototypes]
  579 | void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
      |      ^~~~~~~~~~~~~~~~~~
../arch/um/kernel/tlb.c:594:6: warning: no previous prototype for ‘force_flush_all’ [-Wmissing-prototypes]
  594 | void force_flush_all(void)
      |      ^~~~~~~~~~~~~~~
../arch/um/kernel/um_arch.c:408:19: warning: no previous prototype for ‘read_initrd’ [-Wmissing-prototypes]
  408 | int __init __weak read_initrd(void)
      |                   ^~~~~~~~~~~
../arch/um/kernel/um_arch.c:461:7: warning: no previous prototype for ‘text_poke’ [-Wmissing-prototypes]
  461 | void *text_poke(void *addr, const void *opcode, size_t len)
      |       ^~~~~~~~~
../arch/um/kernel/um_arch.c:473:6: warning: no previous prototype for ‘text_poke_sync’ [-Wmissing-prototypes]
  473 | void text_poke_sync(void)
      |      ^~~~~~~~~~~~~~
../arch/um/kernel/kmsg_dump.c:60:12: warning: no previous prototype for ‘kmsg_dumper_stdout_init’ [-Wmissing-prototypes]
   60 | int __init kmsg_dumper_stdout_init(void)
      |            ^~~~~~~~~~~~~~~~~~~~~~~
../lib/iomap.c:156:5: warning: no previous prototype for ‘ioread64_lo_hi’ [-Wmissing-prototypes]
  156 | u64 ioread64_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:163:5: warning: no previous prototype for ‘ioread64_hi_lo’ [-Wmissing-prototypes]
  163 | u64 ioread64_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:170:5: warning: no previous prototype for ‘ioread64be_lo_hi’ [-Wmissing-prototypes]
  170 | u64 ioread64be_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:178:5: warning: no previous prototype for ‘ioread64be_hi_lo’ [-Wmissing-prototypes]
  178 | u64 ioread64be_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:264:6: warning: no previous prototype for ‘iowrite64_lo_hi’ [-Wmissing-prototypes]
  264 | void iowrite64_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:272:6: warning: no previous prototype for ‘iowrite64_hi_lo’ [-Wmissing-prototypes]
  272 | void iowrite64_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:280:6: warning: no previous prototype for ‘iowrite64be_lo_hi’ [-Wmissing-prototypes]
  280 | void iowrite64be_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
../lib/iomap.c:288:6: warning: no previous prototype for ‘iowrite64be_hi_lo’ [-Wmissing-prototypes]
  288 | void iowrite64be_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
../drivers/gpu/drm/xe/xe_hmm.c: In function ‘build_sg’:
../drivers/gpu/drm/xe/xe_hmm.c:93:9: error: implicit declaration of function ‘page_to_mem_region’; did you mean ‘release_mem_region’? [-Werror=implicit-function-declaration]
   93 |    mr = page_to_mem_region(page);
      |         ^~~~~~~~~~~~~~~~~~
      |         release_mem_region
../drivers/gpu/drm/xe/xe_hmm.c:93:7: warning: assignment to ‘struct xe_mem_region *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
   93 |    mr = page_to_mem_region(page);
      |       ^
cc1: some warnings being treated as errors
make[7]: *** [../scripts/Makefile.build:243: drivers/gpu/drm/xe/xe_hmm.o] Error 1
make[7]: *** Waiting for unfinished jobs....
make[6]: *** [../scripts/Makefile.build:481: drivers/gpu/drm/xe] Error 2
make[5]: *** [../scripts/Makefile.build:481: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:481: drivers/gpu] Error 2
make[3]: *** [../scripts/Makefile.build:481: drivers] Error 2
make[2]: *** [/kernel/Makefile:1921: .] Error 2
make[1]: *** [/kernel/Makefile:240: __sub-make] Error 2
make: *** [Makefile:240: __sub-make] Error 2

[03:28:55] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[03:28:59] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 0/5] Use hmm_range_fault to populate user page
@ 2024-03-14  3:35 Oak Zeng
  2024-03-14  3:28 ` ✓ CI.Patch_applied: success for " Patchwork
                   ` (7 more replies)
  0 siblings, 8 replies; 49+ messages in thread
From: Oak Zeng @ 2024-03-14  3:35 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, matthew.brost, airlied, brian.welty,
	himal.prasad.ghimiray

This is an effort to unify hmmptr (system allocator) and userptr.
A helper xe_hmm_populate_range is created to populate a user
page using hmm_range_fault, instead of using get_user_pages_fast.
This helper is then used to replace some userptr codes.

The same help will be used later for hmmptr.

This is part of the hmmptr (system allocator) codes. Since this
part can be merged separately, send it out for CI and review first.
It will be followed by other hmmptr codes.

Oak Zeng (5):
  drm/xe/svm: Remap and provide memmap backing for GPU vram
  drm/xe: Helper to get memory region from tile
  drm/xe: Helper to get dpa from pfn
  drm/xe: Helper to populate a userptr or hmmptr
  drm/xe: Use hmm_range_fault to populate user pages

 drivers/gpu/drm/xe/Makefile          |   4 +-
 drivers/gpu/drm/xe/xe_device_types.h |  22 +++
 drivers/gpu/drm/xe/xe_hmm.c          | 213 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_hmm.h          |  12 ++
 drivers/gpu/drm/xe/xe_mmio.c         |   8 +
 drivers/gpu/drm/xe/xe_svm.h          |  14 ++
 drivers/gpu/drm/xe/xe_svm_devmem.c   |  91 ++++++++++++
 drivers/gpu/drm/xe/xe_vm.c           | 105 +------------
 8 files changed, 367 insertions(+), 102 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c

-- 
2.26.3


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-14  3:35 [PATCH 0/5] Use hmm_range_fault to populate user page Oak Zeng
                   ` (2 preceding siblings ...)
  2024-03-14  3:29 ` ✗ CI.KUnit: failure " Patchwork
@ 2024-03-14  3:35 ` Oak Zeng
  2024-03-14 17:17   ` Matthew Brost
  2024-03-15  1:45   ` Welty, Brian
  2024-03-14  3:35 ` [PATCH 2/5] drm/xe: Helper to get memory region from tile Oak Zeng
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Oak Zeng @ 2024-03-14  3:35 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, matthew.brost, airlied, brian.welty,
	himal.prasad.ghimiray

Memory remap GPU vram using devm_memremap_pages, so each GPU vram
page is backed by a struct page.

Those struct pages are created to allow hmm migrate buffer b/t
GPU vram and CPU system memory using existing Linux migration
mechanism (i.e., migrating b/t CPU system memory and hard disk).

This is prepare work to enable svm (shared virtual memory) through
Linux kernel hmm framework. The memory remap's page map type is set
to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
vram page get a struct page and can be mapped in CPU page table,
but such pages are treated as GPU's private resource, so CPU can't
access them. If CPU access such page, a page fault is triggered
and page will be migrate to system memory.

For GPU device which supports coherent memory protocol b/t CPU and
GPU (such as CXL and CAPI protocol), we can remap device memory as
MEMORY_DEVICE_COHERENT. This is TBD.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/Makefile          |  3 +-
 drivers/gpu/drm/xe/xe_device_types.h |  9 +++
 drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
 drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
 drivers/gpu/drm/xe/xe_svm_devmem.c   | 91 ++++++++++++++++++++++++++++
 5 files changed, 124 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h
 create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index c531210695db..840467080e59 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -142,7 +142,8 @@ xe-y += xe_bb.o \
 	xe_vram_freq.o \
 	xe_wait_user_fence.o \
 	xe_wa.o \
-	xe_wopcm.o
+	xe_wopcm.o \
+	xe_svm_devmem.o
 
 # graphics hardware monitoring (HWMON) support
 xe-$(CONFIG_HWMON) += xe_hwmon.o
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 9785eef2e5a4..f27c3bee8ce7 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -99,6 +99,15 @@ struct xe_mem_region {
 	resource_size_t actual_physical_size;
 	/** @mapping: pointer to VRAM mappable space */
 	void __iomem *mapping;
+	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
+    struct dev_pagemap pagemap;
+    /**
+     * @hpa_base: base host physical address
+     *
+     * This is generated when remap device memory as ZONE_DEVICE
+     */
+    resource_size_t hpa_base;
+
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
index e3db3a178760..0d795394bc4c 100644
--- a/drivers/gpu/drm/xe/xe_mmio.c
+++ b/drivers/gpu/drm/xe/xe_mmio.c
@@ -22,6 +22,7 @@
 #include "xe_module.h"
 #include "xe_sriov.h"
 #include "xe_tile.h"
+#include "xe_svm.h"
 
 #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
 #define TILE_COUNT		REG_GENMASK(15, 8)
@@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
 		}
 
 		io_size -= min_t(u64, tile_size, io_size);
+		xe_svm_devm_add(tile, &tile->mem.vram);
 	}
 
 	xe->mem.vram.actual_physical_size = total_size;
@@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
 static void mmio_fini(struct drm_device *drm, void *arg)
 {
 	struct xe_device *xe = arg;
+    struct xe_tile *tile;
+    u8 id;
 
 	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
 	if (xe->mem.vram.mapping)
 		iounmap(xe->mem.vram.mapping);
+
+	for_each_tile(tile, xe, id)
+		xe_svm_devm_remove(xe, &tile->mem.vram);
+
 }
 
 static int xe_verify_lmem_ready(struct xe_device *xe)
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
new file mode 100644
index 000000000000..09f9afb0e7d4
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef __XE_SVM_H
+#define __XE_SVM_H
+
+#include "xe_device_types.h"
+
+int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
+void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region *mem);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
new file mode 100644
index 000000000000..63b7a1961cc6
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/mm_types.h>
+#include <linux/sched/mm.h>
+
+#include "xe_device_types.h"
+#include "xe_trace.h"
+#include "xe_svm.h"
+
+
+static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
+{
+	return 0;
+}
+
+static void xe_devm_page_free(struct page *page)
+{
+}
+
+static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
+	.page_free = xe_devm_page_free,
+	.migrate_to_ram = xe_devm_migrate_to_ram,
+};
+
+/**
+ * xe_svm_devm_add: Remap and provide memmap backing for device memory
+ * @tile: tile that the memory region blongs to
+ * @mr: memory region to remap
+ *
+ * This remap device memory to host physical address space and create
+ * struct page to back device memory
+ *
+ * Return: 0 on success standard error code otherwise
+ */
+int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
+{
+	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
+	struct resource *res;
+	void *addr;
+	int ret;
+
+	res = devm_request_free_mem_region(dev, &iomem_resource,
+					   mr->usable_size);
+	if (IS_ERR(res)) {
+		ret = PTR_ERR(res);
+		return ret;
+	}
+
+	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	mr->pagemap.range.start = res->start;
+	mr->pagemap.range.end = res->end;
+	mr->pagemap.nr_range = 1;
+	mr->pagemap.ops = &xe_devm_pagemap_ops;
+	mr->pagemap.owner = tile->xe->drm.dev;
+	addr = devm_memremap_pages(dev, &mr->pagemap);
+	if (IS_ERR(addr)) {
+		devm_release_mem_region(dev, res->start, resource_size(res));
+		ret = PTR_ERR(addr);
+		drm_err(&tile->xe->drm, "Failed to remap tile %d memory, errno %d\n",
+				tile->id, ret);
+		return ret;
+	}
+	mr->hpa_base = res->start;
+
+	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
+			tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
+	return 0;
+}
+
+/**
+ * xe_svm_devm_remove: Unmap device memory and free resources
+ * @xe: xe device
+ * @mr: memory region to remove
+ */
+void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region *mr)
+{
+	/*FIXME: below cause a kernel hange during moduel remove*/
+#if 0
+	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
+
+	if (mr->hpa_base) {
+		devm_memunmap_pages(dev, &mr->pagemap);
+		devm_release_mem_region(dev, mr->pagemap.range.start,
+			mr->pagemap.range.end - mr->pagemap.range.start +1);
+	}
+#endif
+}
+
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 2/5] drm/xe: Helper to get memory region from tile
  2024-03-14  3:35 [PATCH 0/5] Use hmm_range_fault to populate user page Oak Zeng
                   ` (3 preceding siblings ...)
  2024-03-14  3:35 ` [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
@ 2024-03-14  3:35 ` Oak Zeng
  2024-03-14 17:33   ` Matthew Brost
  2024-03-14 17:44   ` Matthew Brost
  2024-03-14  3:35 ` [PATCH 3/5] drm/xe: Helper to get dpa from pfn Oak Zeng
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Oak Zeng @ 2024-03-14  3:35 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, matthew.brost, airlied, brian.welty,
	himal.prasad.ghimiray

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index f27c3bee8ce7..bbea40b57e84 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -571,4 +571,9 @@ struct xe_file {
 	struct xe_drm_client *client;
 };
 
+static inline struct xe_tile *mem_region_to_tile(struct xe_mem_region *mr)
+{
+	return container_of(mr, struct xe_tile, mem.vram);
+}
+
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 3/5] drm/xe: Helper to get dpa from pfn
  2024-03-14  3:35 [PATCH 0/5] Use hmm_range_fault to populate user page Oak Zeng
                   ` (4 preceding siblings ...)
  2024-03-14  3:35 ` [PATCH 2/5] drm/xe: Helper to get memory region from tile Oak Zeng
@ 2024-03-14  3:35 ` Oak Zeng
  2024-03-14 17:39   ` Matthew Brost
  2024-03-14  3:35 ` [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr Oak Zeng
  2024-03-14  3:35 ` [PATCH 5/5] drm/xe: Use hmm_range_fault to populate user pages Oak Zeng
  7 siblings, 1 reply; 49+ messages in thread
From: Oak Zeng @ 2024-03-14  3:35 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, matthew.brost, airlied, brian.welty,
	himal.prasad.ghimiray

Since we now create struct page backing for each vram page,
each vram page now also has a pfn, just like system memory.
This allow us to calcuate device physical address from pfn.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index bbea40b57e84..bf349321f037 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -576,4 +576,12 @@ static inline struct xe_tile *mem_region_to_tile(struct xe_mem_region *mr)
 	return container_of(mr, struct xe_tile, mem.vram);
 }
 
+static inline u64 vram_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
+{
+	u64 dpa;
+	u64 offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
+	dpa = mr->dpa_base + offset;
+	return dpa;
+}
+
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-14  3:35 [PATCH 0/5] Use hmm_range_fault to populate user page Oak Zeng
                   ` (5 preceding siblings ...)
  2024-03-14  3:35 ` [PATCH 3/5] drm/xe: Helper to get dpa from pfn Oak Zeng
@ 2024-03-14  3:35 ` Oak Zeng
  2024-03-14 20:25   ` Matthew Brost
                     ` (2 more replies)
  2024-03-14  3:35 ` [PATCH 5/5] drm/xe: Use hmm_range_fault to populate user pages Oak Zeng
  7 siblings, 3 replies; 49+ messages in thread
From: Oak Zeng @ 2024-03-14  3:35 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, matthew.brost, airlied, brian.welty,
	himal.prasad.ghimiray

Add a helper function xe_hmm_populate_range to populate
a a userptr or hmmptr range. This functions calls hmm_range_fault
to read CPU page tables and populate all pfns/pages of this
virtual address range.

If the populated page is system memory page, dma-mapping is performed
to get a dma-address which can be used later for GPU to access pages.

If the populated page is device private page, we calculate the dpa (
device physical address) of the page.

The dma-address or dpa is then saved in userptr's sg table. This is
prepare work to replace the get_user_pages_fast code in userptr code
path. The helper function will also be used to populate hmmptr later.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/Makefile |   3 +-
 drivers/gpu/drm/xe/xe_hmm.c | 213 ++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_hmm.h |  12 ++
 3 files changed, 227 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
 create mode 100644 drivers/gpu/drm/xe/xe_hmm.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 840467080e59..29dcbc938b01 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -143,7 +143,8 @@ xe-y += xe_bb.o \
 	xe_wait_user_fence.o \
 	xe_wa.o \
 	xe_wopcm.o \
-	xe_svm_devmem.o
+	xe_svm_devmem.o \
+	xe_hmm.o
 
 # graphics hardware monitoring (HWMON) support
 xe-$(CONFIG_HWMON) += xe_hwmon.o
diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
new file mode 100644
index 000000000000..c45c2447d386
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hmm.c
@@ -0,0 +1,213 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/dma-mapping.h>
+#include <linux/memremap.h>
+#include <linux/swap.h>
+#include <linux/mm.h>
+#include "xe_hmm.h"
+#include "xe_vm.h"
+
+/**
+ * mark_range_accessed() - mark a range is accessed, so core mm
+ * have such information for memory eviction or write back to
+ * hard disk
+ *
+ * @range: the range to mark
+ * @write: if write to this range, we mark pages in this range
+ * as dirty
+ */
+static void mark_range_accessed(struct hmm_range *range, bool write)
+{
+	struct page *page;
+	u64 i, npages;
+
+	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >> PAGE_SHIFT) + 1;
+	for (i = 0; i < npages; i++) {
+		page = hmm_pfn_to_page(range->hmm_pfns[i]);
+		if (write) {
+			lock_page(page);
+			set_page_dirty(page);
+			unlock_page(page);
+		}
+		mark_page_accessed(page);
+	}
+}
+
+/**
+ * build_sg() - build a scatter gather table for all the physical pages/pfn
+ * in a hmm_range. dma-address is save in sg table and will be used to program
+ * GPU page table later.
+ *
+ * @xe: the xe device who will access the dma-address in sg table
+ * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
+ * has the pfn numbers of pages that back up this hmm address range.
+ * @st: pointer to the sg table.
+ * @write: whether we write to this range. This decides dma map direction
+ * for system pages. If write we map it bi-diretional; otherwise
+ * DMA_TO_DEVICE
+ *
+ * All the contiguous pfns will be collapsed into one entry in
+ * the scatter gather table. This is for the convenience of
+ * later on operations to bind address range to GPU page table.
+ *
+ * The dma_address in the sg table will later be used by GPU to
+ * access memory. So if the memory is system memory, we need to
+ * do a dma-mapping so it can be accessed by GPU/DMA. If the memory
+ * is GPU local memory (of the GPU who is going to access memory),
+ * we need gpu dpa (device physical address), and there is no need
+ * of dma-mapping.
+ *
+ * FIXME: dma-mapping for peer gpu device to access remote gpu's
+ * memory. Add this when you support p2p
+ *
+ * This function allocates the storage of the sg table. It is
+ * caller's responsibility to free it calling sg_free_table.
+ *
+ * Returns 0 if successful; -ENOMEM if fails to allocate memory
+ */
+static int build_sg(struct xe_device *xe, struct hmm_range *range,
+			     struct sg_table *st, bool write)
+{
+	struct device *dev = xe->drm.dev;
+	struct scatterlist *sg;
+	u64 i, npages;
+
+	sg = NULL;
+	st->nents = 0;
+	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >> PAGE_SHIFT) + 1;
+
+	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
+		return -ENOMEM;
+
+	for (i = 0; i < npages; i++) {
+		struct page *page;
+		unsigned long addr;
+		struct xe_mem_region *mr;
+
+		page = hmm_pfn_to_page(range->hmm_pfns[i]);
+		if (is_device_private_page(page)) {
+			mr = page_to_mem_region(page);
+			addr = vram_pfn_to_dpa(mr, range->hmm_pfns[i]);
+		} else {
+			addr = dma_map_page(dev, page, 0, PAGE_SIZE,
+					write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE);
+		}
+
+		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
+			sg->length += PAGE_SIZE;
+			sg_dma_len(sg) += PAGE_SIZE;
+			continue;
+		}
+
+		sg =  sg ? sg_next(sg) : st->sgl;
+		sg_dma_address(sg) = addr;
+		sg_dma_len(sg) = PAGE_SIZE;
+		sg->length = PAGE_SIZE;
+		st->nents++;
+	}
+
+	sg_mark_end(sg);
+	return 0;
+}
+
+/**
+ * xe_hmm_populate_range() - Populate physical pages of a virtual
+ * address range
+ *
+ * @vma: vma has information of the range to populate. only vma
+ * of userptr and hmmptr type can be populated.
+ * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
+ * will hold the populated pfns.
+ * @write: populate pages with write permission
+ *
+ * This function populate the physical pages of a virtual
+ * address range. The populated physical pages is saved in
+ * userptr's sg table. It is similar to get_user_pages but call
+ * hmm_range_fault.
+ *
+ * This function also read mmu notifier sequence # (
+ * mmu_interval_read_begin), for the purpose of later
+ * comparison (through mmu_interval_read_retry).
+ *
+ * This must be called with mmap read or write lock held.
+ *
+ * This function allocates the storage of the userptr sg table.
+ * It is caller's responsibility to free it calling sg_free_table.
+ *
+ * returns: 0 for succuss; negative error no on failure
+ */
+int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range *hmm_range,
+						bool write)
+{
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
+	struct xe_userptr_vma *userptr_vma;
+	struct xe_userptr *userptr;
+	u64 start = vma->gpuva.va.addr;
+	u64 end = start + vma->gpuva.va.range;
+	struct xe_vm *vm = xe_vma_vm(vma);
+	u64 npages;
+	int ret;
+
+	userptr_vma = to_userptr_vma(vma);
+	userptr = &userptr_vma->userptr;
+	mmap_assert_locked(userptr->notifier.mm);
+
+	npages = ((end - 1) >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
+	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
+	if (unlikely(!pfns))
+		return -ENOMEM;
+
+	if (write)
+		flags |= HMM_PFN_REQ_WRITE;
+
+	memset64((u64 *)pfns, (u64)flags, npages);
+	hmm_range->hmm_pfns = pfns;
+	hmm_range->notifier_seq = mmu_interval_read_begin(&userptr->notifier);
+	hmm_range->notifier = &userptr->notifier;
+	hmm_range->start = start;
+	hmm_range->end = end;
+	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE;
+	/**
+	 * FIXME:
+	 * Set the the dev_private_owner can prevent hmm_range_fault to fault
+	 * in the device private pages owned by caller. See function
+	 * hmm_vma_handle_pte. In multiple GPU case, this should be set to the
+	 * device owner of the best migration destination. e.g., device0/vm0
+	 * has a page fault, but we have determined the best placement of
+	 * the fault address should be on device1, we should set below to
+	 * device1 instead of device0.
+	 */
+	hmm_range->dev_private_owner = vm->xe->drm.dev;
+
+	while (true) {
+		ret = hmm_range_fault(hmm_range);
+		if (time_after(jiffies, timeout))
+			break;
+
+		if (ret == -EBUSY)
+			continue;
+		break;
+	}
+
+	if (ret)
+		goto free_pfns;
+
+	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
+	if (ret)
+		goto free_pfns;
+
+	mark_range_accessed(hmm_range, write);
+	userptr->sg = &userptr->sgt;
+	userptr->notifier_seq = hmm_range->notifier_seq;
+
+free_pfns:
+	kvfree(pfns);
+	return ret;
+}
+
diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
new file mode 100644
index 000000000000..960f3f6d36ae
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hmm.h
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#include <linux/types.h>
+#include <linux/hmm.h>
+#include "xe_vm_types.h"
+#include "xe_svm.h"
+
+int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range *hmm_range,
+						bool write);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 5/5] drm/xe: Use hmm_range_fault to populate user pages
  2024-03-14  3:35 [PATCH 0/5] Use hmm_range_fault to populate user page Oak Zeng
                   ` (6 preceding siblings ...)
  2024-03-14  3:35 ` [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr Oak Zeng
@ 2024-03-14  3:35 ` Oak Zeng
  2024-03-14 20:54   ` Matthew Brost
  7 siblings, 1 reply; 49+ messages in thread
From: Oak Zeng @ 2024-03-14  3:35 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, matthew.brost, airlied, brian.welty,
	himal.prasad.ghimiray

This is an effort to unify hmmptr (aka system allocator)
and userptr code. hmm_range_fault is used to populate
a virtual address range for both hmmptr and userptr,
instead of hmmptr using hmm_range_fault and userptr
using get_user_pages_fast.

This also aligns with AMD gpu driver's behavior. In
long term, we plan to put some common helpers in this
area to drm layer so it can be re-used by different
vendors.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c | 105 ++-----------------------------------
 1 file changed, 4 insertions(+), 101 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index db3f049a47dc..d6088dcac74a 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -38,6 +38,7 @@
 #include "xe_sync.h"
 #include "xe_trace.h"
 #include "xe_wa.h"
+#include "xe_hmm.h"
 
 static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
 {
@@ -65,113 +66,15 @@ int xe_vma_userptr_check_repin(struct xe_userptr_vma *uvma)
 
 int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
 {
-	struct xe_userptr *userptr = &uvma->userptr;
 	struct xe_vma *vma = &uvma->vma;
 	struct xe_vm *vm = xe_vma_vm(vma);
 	struct xe_device *xe = vm->xe;
-	const unsigned long num_pages = xe_vma_size(vma) >> PAGE_SHIFT;
-	struct page **pages;
-	bool in_kthread = !current->mm;
-	unsigned long notifier_seq;
-	int pinned, ret, i;
-	bool read_only = xe_vma_read_only(vma);
+	bool write = !xe_vma_read_only(vma);
+	struct hmm_range hmm_range;
 
 	lockdep_assert_held(&vm->lock);
 	xe_assert(xe, xe_vma_is_userptr(vma));
-retry:
-	if (vma->gpuva.flags & XE_VMA_DESTROYED)
-		return 0;
-
-	notifier_seq = mmu_interval_read_begin(&userptr->notifier);
-	if (notifier_seq == userptr->notifier_seq)
-		return 0;
-
-	pages = kvmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL);
-	if (!pages)
-		return -ENOMEM;
-
-	if (userptr->sg) {
-		dma_unmap_sgtable(xe->drm.dev,
-				  userptr->sg,
-				  read_only ? DMA_TO_DEVICE :
-				  DMA_BIDIRECTIONAL, 0);
-		sg_free_table(userptr->sg);
-		userptr->sg = NULL;
-	}
-
-	pinned = ret = 0;
-	if (in_kthread) {
-		if (!mmget_not_zero(userptr->notifier.mm)) {
-			ret = -EFAULT;
-			goto mm_closed;
-		}
-		kthread_use_mm(userptr->notifier.mm);
-	}
-
-	while (pinned < num_pages) {
-		ret = get_user_pages_fast(xe_vma_userptr(vma) +
-					  pinned * PAGE_SIZE,
-					  num_pages - pinned,
-					  read_only ? 0 : FOLL_WRITE,
-					  &pages[pinned]);
-		if (ret < 0)
-			break;
-
-		pinned += ret;
-		ret = 0;
-	}
-
-	if (in_kthread) {
-		kthread_unuse_mm(userptr->notifier.mm);
-		mmput(userptr->notifier.mm);
-	}
-mm_closed:
-	if (ret)
-		goto out;
-
-	ret = sg_alloc_table_from_pages_segment(&userptr->sgt, pages,
-						pinned, 0,
-						(u64)pinned << PAGE_SHIFT,
-						xe_sg_segment_size(xe->drm.dev),
-						GFP_KERNEL);
-	if (ret) {
-		userptr->sg = NULL;
-		goto out;
-	}
-	userptr->sg = &userptr->sgt;
-
-	ret = dma_map_sgtable(xe->drm.dev, userptr->sg,
-			      read_only ? DMA_TO_DEVICE :
-			      DMA_BIDIRECTIONAL,
-			      DMA_ATTR_SKIP_CPU_SYNC |
-			      DMA_ATTR_NO_KERNEL_MAPPING);
-	if (ret) {
-		sg_free_table(userptr->sg);
-		userptr->sg = NULL;
-		goto out;
-	}
-
-	for (i = 0; i < pinned; ++i) {
-		if (!read_only) {
-			lock_page(pages[i]);
-			set_page_dirty(pages[i]);
-			unlock_page(pages[i]);
-		}
-
-		mark_page_accessed(pages[i]);
-	}
-
-out:
-	release_pages(pages, pinned);
-	kvfree(pages);
-
-	if (!(ret < 0)) {
-		userptr->notifier_seq = notifier_seq;
-		if (xe_vma_userptr_check_repin(uvma) == -EAGAIN)
-			goto retry;
-	}
-
-	return ret < 0 ? ret : 0;
+	return xe_hmm_populate_range(vma, &hmm_range, write);
 }
 
 static bool preempt_fences_waiting(struct xe_vm *vm)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-14  3:35 ` [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
@ 2024-03-14 17:17   ` Matthew Brost
  2024-03-14 18:32     ` Zeng, Oak
  2024-03-15  1:45   ` Welty, Brian
  1 sibling, 1 reply; 49+ messages in thread
From: Matthew Brost @ 2024-03-14 17:17 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, thomas.hellstrom, airlied, brian.welty, himal.prasad.ghimiray

On Wed, Mar 13, 2024 at 11:35:49PM -0400, Oak Zeng wrote:
> Memory remap GPU vram using devm_memremap_pages, so each GPU vram
> page is backed by a struct page.
> 
> Those struct pages are created to allow hmm migrate buffer b/t
> GPU vram and CPU system memory using existing Linux migration
> mechanism (i.e., migrating b/t CPU system memory and hard disk).
> 
> This is prepare work to enable svm (shared virtual memory) through
> Linux kernel hmm framework. The memory remap's page map type is set
> to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> vram page get a struct page and can be mapped in CPU page table,
> but such pages are treated as GPU's private resource, so CPU can't
> access them. If CPU access such page, a page fault is triggered
> and page will be migrate to system memory.
> 

Is this really true? We can map VRAM BOs to the CPU without having
migarte back and forth. Admittedly I don't know the inner workings of
how this works but in IGTs we do this all the time.

  54         batch_bo = xe_bo_create(fd, vm, batch_size,
  55                                 vram_if_possible(fd, 0),
  56                                 DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM);
  57         batch_map = xe_bo_map(fd, batch_bo, batch_size);

The BO is created in VRAM and then mapped to the CPU.

I don't think there is an expectation of coherence rather caching mode
and exclusive access of the memory based on synchronization.

e.g.
User write BB/data via CPU to GPU memory
User calls exec
GPU read / write memory
User wait on sync indicating exec done
User reads result

All of this works without migration. Are we not planing supporting flow
with SVM?  

Afaik this migration dance really only needs to be done if the CPU and
GPU are using atomics on a shared memory region and the GPU device
doesn't support a coherent memory protocol (e.g. PVC).

> For GPU device which supports coherent memory protocol b/t CPU and
> GPU (such as CXL and CAPI protocol), we can remap device memory as
> MEMORY_DEVICE_COHERENT. This is TBD.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile          |  3 +-
>  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
>  drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
>  drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
>  drivers/gpu/drm/xe/xe_svm_devmem.c   | 91 ++++++++++++++++++++++++++++
>  5 files changed, 124 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
>  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index c531210695db..840467080e59 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
>  	xe_vram_freq.o \
>  	xe_wait_user_fence.o \
>  	xe_wa.o \
> -	xe_wopcm.o
> +	xe_wopcm.o \
> +	xe_svm_devmem.o

These should be in alphabetical order.

>  
>  # graphics hardware monitoring (HWMON) support
>  xe-$(CONFIG_HWMON) += xe_hwmon.o
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 9785eef2e5a4..f27c3bee8ce7 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -99,6 +99,15 @@ struct xe_mem_region {
>  	resource_size_t actual_physical_size;
>  	/** @mapping: pointer to VRAM mappable space */
>  	void __iomem *mapping;
> +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> +    struct dev_pagemap pagemap;
> +    /**
> +     * @hpa_base: base host physical address
> +     *
> +     * This is generated when remap device memory as ZONE_DEVICE
> +     */
> +    resource_size_t hpa_base;

Weird indentation. This goes for the entire series, look at checkpatch.

> +
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> index e3db3a178760..0d795394bc4c 100644
> --- a/drivers/gpu/drm/xe/xe_mmio.c
> +++ b/drivers/gpu/drm/xe/xe_mmio.c
> @@ -22,6 +22,7 @@
>  #include "xe_module.h"
>  #include "xe_sriov.h"
>  #include "xe_tile.h"
> +#include "xe_svm.h"
>  
>  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
>  #define TILE_COUNT		REG_GENMASK(15, 8)
> @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
>  		}
>  
>  		io_size -= min_t(u64, tile_size, io_size);
> +		xe_svm_devm_add(tile, &tile->mem.vram);

Do we want to do this probe for all devices with VRAM or only a subset?

>  	}
>  
>  	xe->mem.vram.actual_physical_size = total_size;
> @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
>  static void mmio_fini(struct drm_device *drm, void *arg)
>  {
>  	struct xe_device *xe = arg;
> +    struct xe_tile *tile;
> +    u8 id;
>  
>  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
>  	if (xe->mem.vram.mapping)
>  		iounmap(xe->mem.vram.mapping);
> +
> +	for_each_tile(tile, xe, id)
> +		xe_svm_devm_remove(xe, &tile->mem.vram);

This should probably be above existing code. Typical on fini to do
things in reverse order from init.

> +
>  }
>  
>  static int xe_verify_lmem_ready(struct xe_device *xe)
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> new file mode 100644
> index 000000000000..09f9afb0e7d4
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -0,0 +1,14 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation

2024?

> + */
> +
> +#ifndef __XE_SVM_H
> +#define __XE_SVM_H
> +
> +#include "xe_device_types.h"

I don't think you need to include this. Rather just forward decl structs
used here.

e.g.

struct xe_device;
struct xe_mem_region;
struct xe_tile;

> +
> +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
> +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region *mem);

The arguments here are incongruent here. Typically we want these to
match.

> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c

Incongruent between xe_svm.h and xe_svm_devmem.c. Again these two should
match.

> new file mode 100644
> index 000000000000..63b7a1961cc6
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -0,0 +1,91 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation

2024?

> + */
> +
> +#include <linux/mm_types.h>
> +#include <linux/sched/mm.h>
> +
> +#include "xe_device_types.h"
> +#include "xe_trace.h"

xe_trace.h appears to be unused.

> +#include "xe_svm.h"
> +
> +
> +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	return 0;
> +}
> +
> +static void xe_devm_page_free(struct page *page)
> +{
> +}
> +
> +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> +	.page_free = xe_devm_page_free,
> +	.migrate_to_ram = xe_devm_migrate_to_ram,
> +};
> +

Assume these are placeholders that will be populated later?

> +/**
> + * xe_svm_devm_add: Remap and provide memmap backing for device memory
> + * @tile: tile that the memory region blongs to
> + * @mr: memory region to remap
> + *
> + * This remap device memory to host physical address space and create
> + * struct page to back device memory
> + *
> + * Return: 0 on success standard error code otherwise
> + */
> +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)

Here I see the xe_mem_region is from tile->mem.vram, wondering rather
than using the tile->mem.vram we should use xe->mem.vram when enabling
svm? Isn't the idea behind svm the entire memory is 1 view?

I suppose if we do that we also only use 1 TTM VRAM manager / buddy
allocator too. I thought I saw some patches flying around for that too.

> +{
> +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> +	struct resource *res;
> +	void *addr;
> +	int ret;
> +
> +	res = devm_request_free_mem_region(dev, &iomem_resource,
> +					   mr->usable_size);
> +	if (IS_ERR(res)) {
> +		ret = PTR_ERR(res);
> +		return ret;
> +	}
> +
> +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +	mr->pagemap.range.start = res->start;
> +	mr->pagemap.range.end = res->end;
> +	mr->pagemap.nr_range = 1;
> +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> +	mr->pagemap.owner = tile->xe->drm.dev;
> +	addr = devm_memremap_pages(dev, &mr->pagemap);
> +	if (IS_ERR(addr)) {
> +		devm_release_mem_region(dev, res->start, resource_size(res));
> +		ret = PTR_ERR(addr);
> +		drm_err(&tile->xe->drm, "Failed to remap tile %d memory, errno %d\n",
> +				tile->id, ret);
> +		return ret;
> +	}
> +	mr->hpa_base = res->start;
> +
> +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
> +			tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
> +	return 0;
> +}
> +
> +/**
> + * xe_svm_devm_remove: Unmap device memory and free resources
> + * @xe: xe device
> + * @mr: memory region to remove
> + */
> +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region *mr)
> +{
> +	/*FIXME: below cause a kernel hange during moduel remove*/
> +#if 0
> +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> +
> +	if (mr->hpa_base) {
> +		devm_memunmap_pages(dev, &mr->pagemap);
> +		devm_release_mem_region(dev, mr->pagemap.range.start,
> +			mr->pagemap.range.end - mr->pagemap.range.start +1);
> +	}
> +#endif

This would need to be fixed too.

Matt 

> +}
> +
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] drm/xe: Helper to get memory region from tile
  2024-03-14  3:35 ` [PATCH 2/5] drm/xe: Helper to get memory region from tile Oak Zeng
@ 2024-03-14 17:33   ` Matthew Brost
  2024-03-14 17:44   ` Matthew Brost
  1 sibling, 0 replies; 49+ messages in thread
From: Matthew Brost @ 2024-03-14 17:33 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, thomas.hellstrom, airlied, brian.welty, himal.prasad.ghimiray

On Wed, Mar 13, 2024 at 11:35:50PM -0400, Oak Zeng wrote:
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device_types.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index f27c3bee8ce7..bbea40b57e84 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -571,4 +571,9 @@ struct xe_file {
>  	struct xe_drm_client *client;
>  };
>  
> +static inline struct xe_tile *mem_region_to_tile(struct xe_mem_region *mr)
> +{
> +	return container_of(mr, struct xe_tile, mem.vram);
> +}

Helper shouldn't be in *_types.h files, only struct def should be.

Also s/mem_region_to_tile/xe_mem_region_to_tile/

Matt

> +
>  #endif
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] drm/xe: Helper to get dpa from pfn
  2024-03-14  3:35 ` [PATCH 3/5] drm/xe: Helper to get dpa from pfn Oak Zeng
@ 2024-03-14 17:39   ` Matthew Brost
  2024-03-15 17:29     ` Zeng, Oak
  2024-03-18 12:09     ` Hellstrom, Thomas
  0 siblings, 2 replies; 49+ messages in thread
From: Matthew Brost @ 2024-03-14 17:39 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, thomas.hellstrom, airlied, brian.welty, himal.prasad.ghimiray

On Wed, Mar 13, 2024 at 11:35:51PM -0400, Oak Zeng wrote:
> Since we now create struct page backing for each vram page,
> each vram page now also has a pfn, just like system memory.
> This allow us to calcuate device physical address from pfn.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device_types.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index bbea40b57e84..bf349321f037 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -576,4 +576,12 @@ static inline struct xe_tile *mem_region_to_tile(struct xe_mem_region *mr)
>  	return container_of(mr, struct xe_tile, mem.vram);
>  }
>  
> +static inline u64 vram_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
> +{
> +	u64 dpa;
> +	u64 offset = (pfn << PAGE_SHIFT) - mr->hpa_base;

Can't this be negative? 

e.g. if pfn == 0, offset == -mr->hpa_base.

Or is the assumption (pfn << PAGE_SHIFT) is always > mr->hpa_base?

If so can we an assert or something to ensure we using this function correctly.

> +	dpa = mr->dpa_base + offset;
> +	return dpa;
> +}

Same as previous patch, should be *.h not a *_types.h file.

Also as this is xe_mem_region not explictly vram. Maybe:

s/vram_pfn_to_dpa/xe_mem_region_pfn_to_dpa/

Matt

> +
>  #endif
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] drm/xe: Helper to get memory region from tile
  2024-03-14  3:35 ` [PATCH 2/5] drm/xe: Helper to get memory region from tile Oak Zeng
  2024-03-14 17:33   ` Matthew Brost
@ 2024-03-14 17:44   ` Matthew Brost
  2024-03-15  2:48     ` Zeng, Oak
  1 sibling, 1 reply; 49+ messages in thread
From: Matthew Brost @ 2024-03-14 17:44 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, thomas.hellstrom, airlied, brian.welty, himal.prasad.ghimiray

On Wed, Mar 13, 2024 at 11:35:50PM -0400, Oak Zeng wrote:
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>

Missed this. Need a commit message. Also this looks to be unused in this series.

Matt

> ---
>  drivers/gpu/drm/xe/xe_device_types.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index f27c3bee8ce7..bbea40b57e84 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -571,4 +571,9 @@ struct xe_file {
>  	struct xe_drm_client *client;
>  };
>  
> +static inline struct xe_tile *mem_region_to_tile(struct xe_mem_region *mr)
> +{
> +	return container_of(mr, struct xe_tile, mem.vram);
> +}
> +
>  #endif
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-14 17:17   ` Matthew Brost
@ 2024-03-14 18:32     ` Zeng, Oak
  2024-03-14 20:49       ` Matthew Brost
  0 siblings, 1 reply; 49+ messages in thread
From: Zeng, Oak @ 2024-03-14 18:32 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad

Hi Matt,

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Thursday, March 14, 2024 1:18 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for
> GPU vram
> 
> On Wed, Mar 13, 2024 at 11:35:49PM -0400, Oak Zeng wrote:
> > Memory remap GPU vram using devm_memremap_pages, so each GPU vram
> > page is backed by a struct page.
> >
> > Those struct pages are created to allow hmm migrate buffer b/t
> > GPU vram and CPU system memory using existing Linux migration
> > mechanism (i.e., migrating b/t CPU system memory and hard disk).
> >
> > This is prepare work to enable svm (shared virtual memory) through
> > Linux kernel hmm framework. The memory remap's page map type is set
> > to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> > vram page get a struct page and can be mapped in CPU page table,
> > but such pages are treated as GPU's private resource, so CPU can't
> > access them. If CPU access such page, a page fault is triggered
> > and page will be migrate to system memory.
> >
> 
> Is this really true? We can map VRAM BOs to the CPU without having
> migarte back and forth. Admittedly I don't know the inner workings of
> how this works but in IGTs we do this all the time.
> 
>   54         batch_bo = xe_bo_create(fd, vm, batch_size,
>   55                                 vram_if_possible(fd, 0),
>   56                                 DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM);
>   57         batch_map = xe_bo_map(fd, batch_bo, batch_size);
> 
> The BO is created in VRAM and then mapped to the CPU.
> 
> I don't think there is an expectation of coherence rather caching mode
> and exclusive access of the memory based on synchronization.
> 
> e.g.
> User write BB/data via CPU to GPU memory
> User calls exec
> GPU read / write memory
> User wait on sync indicating exec done
> User reads result
> 
> All of this works without migration. Are we not planing supporting flow
> with SVM?
> 
> Afaik this migration dance really only needs to be done if the CPU and
> GPU are using atomics on a shared memory region and the GPU device
> doesn't support a coherent memory protocol (e.g. PVC).

All you said is true. On many of our HW, CPU can actually access device memory, cache coherently or not. 

The problem is, this is not true for all GPU vendors. For example, on some HW from some vendor, CPU can only access partially of device memory. The so called small bar concept.

So when HMM is defined, such factors were considered, and MEMORY_DEVICE_PRIVATE is defined. With this definition, CPU can't access device memory.

So you can think it is a limitation of HMM.

Note this is only part 1 of our system allocator work. We do plan to support DEVICE_COHERENT for our newer device, see below. With this option, we don't have unnecessary migration back and forth.

You can think this is just work out all the code path. 90% of the driver code for DEVICE_PRIVATE and DEVICE_COHERENT will be same. Our real use of system allocator will be DEVICE_COHERENT mode. While DEVICE_PRIVATE mode allow us to exercise the code on old HW. 

Make sense?


> 
> > For GPU device which supports coherent memory protocol b/t CPU and
> > GPU (such as CXL and CAPI protocol), we can remap device memory as
> > MEMORY_DEVICE_COHERENT. This is TBD.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile          |  3 +-
> >  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> >  drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> >  drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> >  drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> ++++++++++++++++++++++++++++
> >  5 files changed, 124 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> >  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> >
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index c531210695db..840467080e59 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> >  	xe_vram_freq.o \
> >  	xe_wait_user_fence.o \
> >  	xe_wa.o \
> > -	xe_wopcm.o
> > +	xe_wopcm.o \
> > +	xe_svm_devmem.o
> 
> These should be in alphabetical order.

Will fix
> 
> >
> >  # graphics hardware monitoring (HWMON) support
> >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> b/drivers/gpu/drm/xe/xe_device_types.h
> > index 9785eef2e5a4..f27c3bee8ce7 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -99,6 +99,15 @@ struct xe_mem_region {
> >  	resource_size_t actual_physical_size;
> >  	/** @mapping: pointer to VRAM mappable space */
> >  	void __iomem *mapping;
> > +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> > +    struct dev_pagemap pagemap;
> > +    /**
> > +     * @hpa_base: base host physical address
> > +     *
> > +     * This is generated when remap device memory as ZONE_DEVICE
> > +     */
> > +    resource_size_t hpa_base;
> 
> Weird indentation. This goes for the entire series, look at checkpatch.

Will fix
> 
> > +
> >  };
> >
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> > index e3db3a178760..0d795394bc4c 100644
> > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > @@ -22,6 +22,7 @@
> >  #include "xe_module.h"
> >  #include "xe_sriov.h"
> >  #include "xe_tile.h"
> > +#include "xe_svm.h"
> >
> >  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
> >  #define TILE_COUNT		REG_GENMASK(15, 8)
> > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
> >  		}
> >
> >  		io_size -= min_t(u64, tile_size, io_size);
> > +		xe_svm_devm_add(tile, &tile->mem.vram);
> 
> Do we want to do this probe for all devices with VRAM or only a subset?

All
> 
> >  	}
> >
> >  	xe->mem.vram.actual_physical_size = total_size;
> > @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
> >  static void mmio_fini(struct drm_device *drm, void *arg)
> >  {
> >  	struct xe_device *xe = arg;
> > +    struct xe_tile *tile;
> > +    u8 id;
> >
> >  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> >  	if (xe->mem.vram.mapping)
> >  		iounmap(xe->mem.vram.mapping);
> > +
> > +	for_each_tile(tile, xe, id)
> > +		xe_svm_devm_remove(xe, &tile->mem.vram);
> 
> This should probably be above existing code. Typical on fini to do
> things in reverse order from init.

Will fix
> 
> > +
> >  }
> >
> >  static int xe_verify_lmem_ready(struct xe_device *xe)
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > new file mode 100644
> > index 000000000000..09f9afb0e7d4
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -0,0 +1,14 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2023 Intel Corporation
> 
> 2024?

This patch was actually written 2023 
> 
> > + */
> > +
> > +#ifndef __XE_SVM_H
> > +#define __XE_SVM_H
> > +
> > +#include "xe_device_types.h"
> 
> I don't think you need to include this. Rather just forward decl structs
> used here.

Will fix
> 
> e.g.
> 
> struct xe_device;
> struct xe_mem_region;
> struct xe_tile;
> 
> > +
> > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
> > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> *mem);
> 
> The arguments here are incongruent here. Typically we want these to
> match.

Will fix
> 
> > +
> > +#endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> b/drivers/gpu/drm/xe/xe_svm_devmem.c
> 
> Incongruent between xe_svm.h and xe_svm_devmem.c. 

Did you mean mem vs mr? if yes, will fix

Again these two
> should
> match.
> 
> > new file mode 100644
> > index 000000000000..63b7a1961cc6
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > @@ -0,0 +1,91 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2023 Intel Corporation
> 
> 2024?
It is from 2023
> 
> > + */
> > +
> > +#include <linux/mm_types.h>
> > +#include <linux/sched/mm.h>
> > +
> > +#include "xe_device_types.h"
> > +#include "xe_trace.h"
> 
> xe_trace.h appears to be unused.

Will fix
> 
> > +#include "xe_svm.h"
> > +
> > +
> > +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > +{
> > +	return 0;
> > +}
> > +
> > +static void xe_devm_page_free(struct page *page)
> > +{
> > +}
> > +
> > +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > +	.page_free = xe_devm_page_free,
> > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > +};
> > +
> 
> Assume these are placeholders that will be populated later?


corrrect
> 
> > +/**
> > + * xe_svm_devm_add: Remap and provide memmap backing for device
> memory
> > + * @tile: tile that the memory region blongs to
> > + * @mr: memory region to remap
> > + *
> > + * This remap device memory to host physical address space and create
> > + * struct page to back device memory
> > + *
> > + * Return: 0 on success standard error code otherwise
> > + */
> > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> 
> Here I see the xe_mem_region is from tile->mem.vram, wondering rather
> than using the tile->mem.vram we should use xe->mem.vram when enabling
> svm? Isn't the idea behind svm the entire memory is 1 view?

Still need to use tile. The reason is, memory of different tile can have different characteristics, such as latency. So we want to differentiate memory b/t tiles also in svm. I need to change below " mr->pagemap.owner = tile->xe->drm.dev ". the owner should also be tile. This is the way hmm differentiate memory of different tile.

With svm it is 1 view, from virtual address space perspective and from physical struct page perspective. You can think of all the tile's vram is stacked together to form a unified view together with system memory. This doesn't prohibit us from differentiate memory from different tile. This differentiation allow us to optimize performance, i.e., we can wisely place memory in specific tile. If we don't differentiate, this is not possible. 

> 
> I suppose if we do that we also only use 1 TTM VRAM manager / buddy
> allocator too. I thought I saw some patches flying around for that too.

Ttm vram manager is not in the picture. We deliberately avoided it per previous discussion

Yes same buddy allocator. It is in my previous POC: https://lore.kernel.org/dri-devel/20240117221223.18540-12-oak.zeng@intel.com/. I didn't put those patches in this series because I want to merge this small patches separately.
> 
> > +{
> > +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> > +	struct resource *res;
> > +	void *addr;
> > +	int ret;
> > +
> > +	res = devm_request_free_mem_region(dev, &iomem_resource,
> > +					   mr->usable_size);
> > +	if (IS_ERR(res)) {
> > +		ret = PTR_ERR(res);
> > +		return ret;
> > +	}
> > +
> > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > +	mr->pagemap.range.start = res->start;
> > +	mr->pagemap.range.end = res->end;
> > +	mr->pagemap.nr_range = 1;
> > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > +	mr->pagemap.owner = tile->xe->drm.dev;
> > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > +	if (IS_ERR(addr)) {
> > +		devm_release_mem_region(dev, res->start, resource_size(res));
> > +		ret = PTR_ERR(addr);
> > +		drm_err(&tile->xe->drm, "Failed to remap tile %d memory,
> errno %d\n",
> > +				tile->id, ret);
> > +		return ret;
> > +	}
> > +	mr->hpa_base = res->start;
> > +
> > +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm,
> remapped to %pr\n",
> > +			tile->id, mr->io_start, mr->io_start + mr->usable_size,
> res);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * xe_svm_devm_remove: Unmap device memory and free resources
> > + * @xe: xe device
> > + * @mr: memory region to remove
> > + */
> > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> *mr)
> > +{
> > +	/*FIXME: below cause a kernel hange during moduel remove*/
> > +#if 0
> > +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> > +
> > +	if (mr->hpa_base) {
> > +		devm_memunmap_pages(dev, &mr->pagemap);
> > +		devm_release_mem_region(dev, mr->pagemap.range.start,
> > +			mr->pagemap.range.end - mr->pagemap.range.start +1);
> > +	}
> > +#endif
> 
> This would need to be fixed too.


Yes...

Oak
> 
> Matt
> 
> > +}
> > +
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-14  3:35 ` [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr Oak Zeng
@ 2024-03-14 20:25   ` Matthew Brost
  2024-03-16  1:35     ` Zeng, Oak
  2024-03-18 11:53   ` Hellstrom, Thomas
  2024-03-18 13:12   ` Hellstrom, Thomas
  2 siblings, 1 reply; 49+ messages in thread
From: Matthew Brost @ 2024-03-14 20:25 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, thomas.hellstrom, airlied, brian.welty, himal.prasad.ghimiray

On Wed, Mar 13, 2024 at 11:35:52PM -0400, Oak Zeng wrote:
> Add a helper function xe_hmm_populate_range to populate
> a a userptr or hmmptr range. This functions calls hmm_range_fault
> to read CPU page tables and populate all pfns/pages of this
> virtual address range.
> 
> If the populated page is system memory page, dma-mapping is performed
> to get a dma-address which can be used later for GPU to access pages.
> 
> If the populated page is device private page, we calculate the dpa (
> device physical address) of the page.
> 
> The dma-address or dpa is then saved in userptr's sg table. This is
> prepare work to replace the get_user_pages_fast code in userptr code
> path. The helper function will also be used to populate hmmptr later.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile |   3 +-
>  drivers/gpu/drm/xe/xe_hmm.c | 213 ++++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
>  3 files changed, 227 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
>  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index 840467080e59..29dcbc938b01 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
>  	xe_wait_user_fence.o \
>  	xe_wa.o \
>  	xe_wopcm.o \
> -	xe_svm_devmem.o
> +	xe_svm_devmem.o \
> +	xe_hmm.o
>  
>  # graphics hardware monitoring (HWMON) support
>  xe-$(CONFIG_HWMON) += xe_hwmon.o
> diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
> new file mode 100644
> index 000000000000..c45c2447d386
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_hmm.c
> @@ -0,0 +1,213 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#include <linux/mmu_notifier.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/memremap.h>
> +#include <linux/swap.h>
> +#include <linux/mm.h>
> +#include "xe_hmm.h"
> +#include "xe_vm.h"
> +
> +/**
> + * mark_range_accessed() - mark a range is accessed, so core mm
> + * have such information for memory eviction or write back to
> + * hard disk
> + *
> + * @range: the range to mark
> + * @write: if write to this range, we mark pages in this range
> + * as dirty
> + */
> +static void mark_range_accessed(struct hmm_range *range, bool write)
> +{
> +	struct page *page;
> +	u64 i, npages;
> +
> +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >> PAGE_SHIFT) + 1;
> +	for (i = 0; i < npages; i++) {
> +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> +		if (write) {
> +			lock_page(page);
> +			set_page_dirty(page);
> +			unlock_page(page);
> +		}
> +		mark_page_accessed(page);
> +	}
> +}
> +
> +/**
> + * build_sg() - build a scatter gather table for all the physical pages/pfn
> + * in a hmm_range. dma-address is save in sg table and will be used to program
> + * GPU page table later.
> + *
> + * @xe: the xe device who will access the dma-address in sg table
> + * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
> + * has the pfn numbers of pages that back up this hmm address range.
> + * @st: pointer to the sg table.
> + * @write: whether we write to this range. This decides dma map direction
> + * for system pages. If write we map it bi-diretional; otherwise
> + * DMA_TO_DEVICE
> + *
> + * All the contiguous pfns will be collapsed into one entry in
> + * the scatter gather table. This is for the convenience of
> + * later on operations to bind address range to GPU page table.
> + *
> + * The dma_address in the sg table will later be used by GPU to
> + * access memory. So if the memory is system memory, we need to
> + * do a dma-mapping so it can be accessed by GPU/DMA. If the memory
> + * is GPU local memory (of the GPU who is going to access memory),
> + * we need gpu dpa (device physical address), and there is no need
> + * of dma-mapping.
> + *
> + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> + * memory. Add this when you support p2p
> + *
> + * This function allocates the storage of the sg table. It is
> + * caller's responsibility to free it calling sg_free_table.
> + *
> + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> + */
> +static int build_sg(struct xe_device *xe, struct hmm_range *range,

xe is unused.

> +			     struct sg_table *st, bool write)
> +{
> +	struct device *dev = xe->drm.dev;
> +	struct scatterlist *sg;
> +	u64 i, npages;
> +
> +	sg = NULL;
> +	st->nents = 0;
> +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >> PAGE_SHIFT) + 1;
> +
> +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> +		return -ENOMEM;
> +
> +	for (i = 0; i < npages; i++) {
> +		struct page *page;
> +		unsigned long addr;
> +		struct xe_mem_region *mr;
> +
> +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> +		if (is_device_private_page(page)) {
> +			mr = page_to_mem_region(page);

Not seeing where page_to_mem_region is defined.

> +			addr = vram_pfn_to_dpa(mr, range->hmm_pfns[i]);
> +		} else {
> +			addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> +					write ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE);
> +		}
> +
> +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> +			sg->length += PAGE_SIZE;
> +			sg_dma_len(sg) += PAGE_SIZE;
> +			continue;
> +		}
> +
> +		sg =  sg ? sg_next(sg) : st->sgl;
> +		sg_dma_address(sg) = addr;
> +		sg_dma_len(sg) = PAGE_SIZE;
> +		sg->length = PAGE_SIZE;
> +		st->nents++;
> +	}
> +
> +	sg_mark_end(sg);
> +	return 0;
> +}
> +

Hmm, this looks way to open coded to me.

Can't we do something like this:

struct page **pages = convert from range->hmm_pfns
sg_alloc_table_from_pages_segment
if (is_device_private_page())
        populatue sg table via vram_pfn_to_dpa
else
        dma_map_sgtable
free(pages)

This assume we are not mixing is_device_private_page & system memory
pages in a single struct hmm_range.


> +/**
> + * xe_hmm_populate_range() - Populate physical pages of a virtual
> + * address range
> + *
> + * @vma: vma has information of the range to populate. only vma
> + * of userptr and hmmptr type can be populated.
> + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> + * will hold the populated pfns.
> + * @write: populate pages with write permission
> + *
> + * This function populate the physical pages of a virtual
> + * address range. The populated physical pages is saved in
> + * userptr's sg table. It is similar to get_user_pages but call
> + * hmm_range_fault.
> + *
> + * This function also read mmu notifier sequence # (
> + * mmu_interval_read_begin), for the purpose of later
> + * comparison (through mmu_interval_read_retry).
> + *
> + * This must be called with mmap read or write lock held.
> + *
> + * This function allocates the storage of the userptr sg table.
> + * It is caller's responsibility to free it calling sg_free_table.

I'd add a helper to free the sg_free_table & unmap the dma pages if
needed.

> + *
> + * returns: 0 for succuss; negative error no on failure
> + */
> +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range *hmm_range,
> +						bool write)
> +{

The layering is all wrong here, we shouldn't be touching struct xe_vma
directly in hmm layer.

Pass in a populated hmm_range and sgt. Or alternatively pass in
arguments and then populate a hmm_range locally.

> +	unsigned long timeout =
> +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> +	struct xe_userptr_vma *userptr_vma;
> +	struct xe_userptr *userptr;
> +	u64 start = vma->gpuva.va.addr;
> +	u64 end = start + vma->gpuva.va.range;

We have helper - xe_vma_start and xe_vma_end but I think either of these
are correct in this case.

xe_vma_start is the address which this bound to the GPU, we want the
userptr address.

So I think it would be:

start = xe_vma_userptr()
end = xe_vma_userptr() + xe_vma_size()


> +	struct xe_vm *vm = xe_vma_vm(vma);
> +	u64 npages;
> +	int ret;
> +
> +	userptr_vma = to_userptr_vma(vma);
> +	userptr = &userptr_vma->userptr;
> +	mmap_assert_locked(userptr->notifier.mm);
> +
> +	npages = ((end - 1) >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;

This math is done above, if you need this math in next rev add a helper.

> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (unlikely(!pfns))
> +		return -ENOMEM;
> +
> +	if (write)
> +		flags |= HMM_PFN_REQ_WRITE;
> +
> +	memset64((u64 *)pfns, (u64)flags, npages);

Why is this needed, can't we just set hmm_range->default_flags?

> +	hmm_range->hmm_pfns = pfns;
> +	hmm_range->notifier_seq = mmu_interval_read_begin(&userptr->notifier);

We need a userptr->notifier == userptr->notifier_seq check that just
returns, right?

> +	hmm_range->notifier = &userptr->notifier;
> +	hmm_range->start = start;
> +	hmm_range->end = end;
> +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE;

Is this needed? AMD and Nouveau do not set this argument.

> +	/**
> +	 * FIXME:
> +	 * Set the the dev_private_owner can prevent hmm_range_fault to fault
> +	 * in the device private pages owned by caller. See function
> +	 * hmm_vma_handle_pte. In multiple GPU case, this should be set to the
> +	 * device owner of the best migration destination. e.g., device0/vm0
> +	 * has a page fault, but we have determined the best placement of
> +	 * the fault address should be on device1, we should set below to
> +	 * device1 instead of device0.
> +	 */
> +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> +
> +	while (true) {

mmap_read_lock(mm);

> +		ret = hmm_range_fault(hmm_range);

mmap_read_unlock(mm);

> +		if (time_after(jiffies, timeout))
> +			break;
> +
> +		if (ret == -EBUSY)
> +			continue;
> +		break;
> +	}
> +
> +	if (ret)
> +		goto free_pfns;
> +
> +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> +	if (ret)
> +		goto free_pfns;
> +
> +	mark_range_accessed(hmm_range, write);
> +	userptr->sg = &userptr->sgt;

Again this should be set in caller after this function return.

> +	userptr->notifier_seq = hmm_range->notifier_seq;

This is could be a pass by reference I guess and set here.

> +
> +free_pfns:
> +	kvfree(pfns);
> +	return ret;
> +}
> +
> diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
> new file mode 100644
> index 000000000000..960f3f6d36ae
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_hmm.h
> @@ -0,0 +1,12 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#include <linux/types.h>
> +#include <linux/hmm.h>
> +#include "xe_vm_types.h"
> +#include "xe_svm.h"

As per the previous patches no need to xe_*.h files, just forward
declare any arguments.

Matt

> +
> +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range *hmm_range,
> +						bool write);
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-14 18:32     ` Zeng, Oak
@ 2024-03-14 20:49       ` Matthew Brost
  2024-03-15 16:00         ` Zeng, Oak
  0 siblings, 1 reply; 49+ messages in thread
From: Matthew Brost @ 2024-03-14 20:49 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad

On Thu, Mar 14, 2024 at 12:32:36PM -0600, Zeng, Oak wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Thursday, March 14, 2024 1:18 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for
> > GPU vram
> > 
> > On Wed, Mar 13, 2024 at 11:35:49PM -0400, Oak Zeng wrote:
> > > Memory remap GPU vram using devm_memremap_pages, so each GPU vram
> > > page is backed by a struct page.
> > >
> > > Those struct pages are created to allow hmm migrate buffer b/t
> > > GPU vram and CPU system memory using existing Linux migration
> > > mechanism (i.e., migrating b/t CPU system memory and hard disk).
> > >
> > > This is prepare work to enable svm (shared virtual memory) through
> > > Linux kernel hmm framework. The memory remap's page map type is set
> > > to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> > > vram page get a struct page and can be mapped in CPU page table,
> > > but such pages are treated as GPU's private resource, so CPU can't
> > > access them. If CPU access such page, a page fault is triggered
> > > and page will be migrate to system memory.
> > >
> > 
> > Is this really true? We can map VRAM BOs to the CPU without having
> > migarte back and forth. Admittedly I don't know the inner workings of
> > how this works but in IGTs we do this all the time.
> > 
> >   54         batch_bo = xe_bo_create(fd, vm, batch_size,
> >   55                                 vram_if_possible(fd, 0),
> >   56                                 DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM);
> >   57         batch_map = xe_bo_map(fd, batch_bo, batch_size);
> > 
> > The BO is created in VRAM and then mapped to the CPU.
> > 
> > I don't think there is an expectation of coherence rather caching mode
> > and exclusive access of the memory based on synchronization.
> > 
> > e.g.
> > User write BB/data via CPU to GPU memory
> > User calls exec
> > GPU read / write memory
> > User wait on sync indicating exec done
> > User reads result
> > 
> > All of this works without migration. Are we not planing supporting flow
> > with SVM?
> > 
> > Afaik this migration dance really only needs to be done if the CPU and
> > GPU are using atomics on a shared memory region and the GPU device
> > doesn't support a coherent memory protocol (e.g. PVC).
> 
> All you said is true. On many of our HW, CPU can actually access device memory, cache coherently or not. 
> 
> The problem is, this is not true for all GPU vendors. For example, on some HW from some vendor, CPU can only access partially of device memory. The so called small bar concept.
> 
> So when HMM is defined, such factors were considered, and MEMORY_DEVICE_PRIVATE is defined. With this definition, CPU can't access device memory.
> 
> So you can think it is a limitation of HMM.
> 

Is it though? I see other type MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC, and MEMORY_DEVICE_PCI_P2PDMA. From my limited
understanding it looks to me like one of those modes would support my
example.

> Note this is only part 1 of our system allocator work. We do plan to support DEVICE_COHERENT for our newer device, see below. With this option, we don't have unnecessary migration back and forth.
> 
> You can think this is just work out all the code path. 90% of the driver code for DEVICE_PRIVATE and DEVICE_COHERENT will be same. Our real use of system allocator will be DEVICE_COHERENT mode. While DEVICE_PRIVATE mode allow us to exercise the code on old HW. 
> 
> Make sense?
>

I guess if we want the system allocator to always coherent, then yes you
need this dynamic migration with faulting on either side.

I was thinking the system allocator would be behave like my example
above with madvise dictating the coherence rules.

Maybe I missed this in system allocator design but my feeling is we
shouldn't arbitrarily enforce coherence as that could lead to poor
performance due to constant migration.

> 
> > 
> > > For GPU device which supports coherent memory protocol b/t CPU and
> > > GPU (such as CXL and CAPI protocol), we can remap device memory as
> > > MEMORY_DEVICE_COHERENT. This is TBD.
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile          |  3 +-
> > >  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> > >  drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> > >  drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> > >  drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> > ++++++++++++++++++++++++++++
> > >  5 files changed, 124 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> > >
> > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > index c531210695db..840467080e59 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> > >  	xe_vram_freq.o \
> > >  	xe_wait_user_fence.o \
> > >  	xe_wa.o \
> > > -	xe_wopcm.o
> > > +	xe_wopcm.o \
> > > +	xe_svm_devmem.o
> > 
> > These should be in alphabetical order.
> 
> Will fix
> > 
> > >
> > >  # graphics hardware monitoring (HWMON) support
> > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > b/drivers/gpu/drm/xe/xe_device_types.h
> > > index 9785eef2e5a4..f27c3bee8ce7 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -99,6 +99,15 @@ struct xe_mem_region {
> > >  	resource_size_t actual_physical_size;
> > >  	/** @mapping: pointer to VRAM mappable space */
> > >  	void __iomem *mapping;
> > > +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> > > +    struct dev_pagemap pagemap;
> > > +    /**
> > > +     * @hpa_base: base host physical address
> > > +     *
> > > +     * This is generated when remap device memory as ZONE_DEVICE
> > > +     */
> > > +    resource_size_t hpa_base;
> > 
> > Weird indentation. This goes for the entire series, look at checkpatch.
> 
> Will fix
> > 
> > > +
> > >  };
> > >
> > >  /**
> > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> > > index e3db3a178760..0d795394bc4c 100644
> > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > @@ -22,6 +22,7 @@
> > >  #include "xe_module.h"
> > >  #include "xe_sriov.h"
> > >  #include "xe_tile.h"
> > > +#include "xe_svm.h"
> > >
> > >  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
> > >  #define TILE_COUNT		REG_GENMASK(15, 8)
> > > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
> > >  		}
> > >
> > >  		io_size -= min_t(u64, tile_size, io_size);
> > > +		xe_svm_devm_add(tile, &tile->mem.vram);
> > 
> > Do we want to do this probe for all devices with VRAM or only a subset?
> 
> All

Can you explain why?

> > 
> > >  	}
> > >
> > >  	xe->mem.vram.actual_physical_size = total_size;
> > > @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
> > >  static void mmio_fini(struct drm_device *drm, void *arg)
> > >  {
> > >  	struct xe_device *xe = arg;
> > > +    struct xe_tile *tile;
> > > +    u8 id;
> > >
> > >  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> > >  	if (xe->mem.vram.mapping)
> > >  		iounmap(xe->mem.vram.mapping);
> > > +
> > > +	for_each_tile(tile, xe, id)
> > > +		xe_svm_devm_remove(xe, &tile->mem.vram);
> > 
> > This should probably be above existing code. Typical on fini to do
> > things in reverse order from init.
> 
> Will fix
> > 
> > > +
> > >  }
> > >
> > >  static int xe_verify_lmem_ready(struct xe_device *xe)
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > new file mode 100644
> > > index 000000000000..09f9afb0e7d4
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -0,0 +1,14 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2023 Intel Corporation
> > 
> > 2024?
> 
> This patch was actually written 2023 
> > 
> > > + */
> > > +
> > > +#ifndef __XE_SVM_H
> > > +#define __XE_SVM_H
> > > +
> > > +#include "xe_device_types.h"
> > 
> > I don't think you need to include this. Rather just forward decl structs
> > used here.
> 
> Will fix
> > 
> > e.g.
> > 
> > struct xe_device;
> > struct xe_mem_region;
> > struct xe_tile;
> > 
> > > +
> > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
> > > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> > *mem);
> > 
> > The arguments here are incongruent here. Typically we want these to
> > match.
> 
> Will fix
> > 
> > > +
> > > +#endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > 
> > Incongruent between xe_svm.h and xe_svm_devmem.c. 
> 
> Did you mean mem vs mr? if yes, will fix
> 
> Again these two
> > should
> > match.
> > 
> > > new file mode 100644
> > > index 000000000000..63b7a1961cc6
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > @@ -0,0 +1,91 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2023 Intel Corporation
> > 
> > 2024?
> It is from 2023
> > 
> > > + */
> > > +
> > > +#include <linux/mm_types.h>
> > > +#include <linux/sched/mm.h>
> > > +
> > > +#include "xe_device_types.h"
> > > +#include "xe_trace.h"
> > 
> > xe_trace.h appears to be unused.
> 
> Will fix
> > 
> > > +#include "xe_svm.h"
> > > +
> > > +
> > > +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +static void xe_devm_page_free(struct page *page)
> > > +{
> > > +}
> > > +
> > > +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > +	.page_free = xe_devm_page_free,
> > > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > +};
> > > +
> > 
> > Assume these are placeholders that will be populated later?
> 
> 
> corrrect
> > 
> > > +/**
> > > + * xe_svm_devm_add: Remap and provide memmap backing for device
> > memory
> > > + * @tile: tile that the memory region blongs to
> > > + * @mr: memory region to remap
> > > + *
> > > + * This remap device memory to host physical address space and create
> > > + * struct page to back device memory
> > > + *
> > > + * Return: 0 on success standard error code otherwise
> > > + */
> > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> > 
> > Here I see the xe_mem_region is from tile->mem.vram, wondering rather
> > than using the tile->mem.vram we should use xe->mem.vram when enabling
> > svm? Isn't the idea behind svm the entire memory is 1 view?
> 
> Still need to use tile. The reason is, memory of different tile can have different characteristics, such as latency. So we want to differentiate memory b/t tiles also in svm. I need to change below " mr->pagemap.owner = tile->xe->drm.dev ". the owner should also be tile. This is the way hmm differentiate memory of different tile.
> 
> With svm it is 1 view, from virtual address space perspective and from physical struct page perspective. You can think of all the tile's vram is stacked together to form a unified view together with system memory. This doesn't prohibit us from differentiate memory from different tile. This differentiation allow us to optimize performance, i.e., we can wisely place memory in specific tile. If we don't differentiate, this is not possible. 
>

Ok makes sense.

Matt

> > 
> > I suppose if we do that we also only use 1 TTM VRAM manager / buddy
> > allocator too. I thought I saw some patches flying around for that too.
> 
> Ttm vram manager is not in the picture. We deliberately avoided it per previous discussion
> 
> Yes same buddy allocator. It is in my previous POC: https://lore.kernel.org/dri-devel/20240117221223.18540-12-oak.zeng@intel.com/. I didn't put those patches in this series because I want to merge this small patches separately.
> > 
> > > +{
> > > +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> > > +	struct resource *res;
> > > +	void *addr;
> > > +	int ret;
> > > +
> > > +	res = devm_request_free_mem_region(dev, &iomem_resource,
> > > +					   mr->usable_size);
> > > +	if (IS_ERR(res)) {
> > > +		ret = PTR_ERR(res);
> > > +		return ret;
> > > +	}
> > > +
> > > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > > +	mr->pagemap.range.start = res->start;
> > > +	mr->pagemap.range.end = res->end;
> > > +	mr->pagemap.nr_range = 1;
> > > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > > +	mr->pagemap.owner = tile->xe->drm.dev;
> > > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > > +	if (IS_ERR(addr)) {
> > > +		devm_release_mem_region(dev, res->start, resource_size(res));
> > > +		ret = PTR_ERR(addr);
> > > +		drm_err(&tile->xe->drm, "Failed to remap tile %d memory,
> > errno %d\n",
> > > +				tile->id, ret);
> > > +		return ret;
> > > +	}
> > > +	mr->hpa_base = res->start;
> > > +
> > > +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm,
> > remapped to %pr\n",
> > > +			tile->id, mr->io_start, mr->io_start + mr->usable_size,
> > res);
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * xe_svm_devm_remove: Unmap device memory and free resources
> > > + * @xe: xe device
> > > + * @mr: memory region to remove
> > > + */
> > > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> > *mr)
> > > +{
> > > +	/*FIXME: below cause a kernel hange during moduel remove*/
> > > +#if 0
> > > +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> > > +
> > > +	if (mr->hpa_base) {
> > > +		devm_memunmap_pages(dev, &mr->pagemap);
> > > +		devm_release_mem_region(dev, mr->pagemap.range.start,
> > > +			mr->pagemap.range.end - mr->pagemap.range.start +1);
> > > +	}
> > > +#endif
> > 
> > This would need to be fixed too.
> 
> 
> Yes...
> 
> Oak
> > 
> > Matt
> > 
> > > +}
> > > +
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 5/5] drm/xe: Use hmm_range_fault to populate user pages
  2024-03-14  3:35 ` [PATCH 5/5] drm/xe: Use hmm_range_fault to populate user pages Oak Zeng
@ 2024-03-14 20:54   ` Matthew Brost
  2024-03-19  2:36     ` Zeng, Oak
  0 siblings, 1 reply; 49+ messages in thread
From: Matthew Brost @ 2024-03-14 20:54 UTC (permalink / raw)
  To: Oak Zeng
  Cc: intel-xe, thomas.hellstrom, airlied, brian.welty, himal.prasad.ghimiray

On Wed, Mar 13, 2024 at 11:35:53PM -0400, Oak Zeng wrote:
> This is an effort to unify hmmptr (aka system allocator)
> and userptr code. hmm_range_fault is used to populate
> a virtual address range for both hmmptr and userptr,
> instead of hmmptr using hmm_range_fault and userptr
> using get_user_pages_fast.
> 
> This also aligns with AMD gpu driver's behavior. In
> long term, we plan to put some common helpers in this
> area to drm layer so it can be re-used by different
> vendors.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_vm.c | 105 ++-----------------------------------
>  1 file changed, 4 insertions(+), 101 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index db3f049a47dc..d6088dcac74a 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -38,6 +38,7 @@
>  #include "xe_sync.h"
>  #include "xe_trace.h"
>  #include "xe_wa.h"
> +#include "xe_hmm.h"
>  
>  static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
>  {
> @@ -65,113 +66,15 @@ int xe_vma_userptr_check_repin(struct xe_userptr_vma *uvma)
>  
>  int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)

See my comments in the previous patch about layer, those comments are
valid here too.

>  {
> -	struct xe_userptr *userptr = &uvma->userptr;
>  	struct xe_vma *vma = &uvma->vma;
>  	struct xe_vm *vm = xe_vma_vm(vma);
>  	struct xe_device *xe = vm->xe;
> -	const unsigned long num_pages = xe_vma_size(vma) >> PAGE_SHIFT;
> -	struct page **pages;
> -	bool in_kthread = !current->mm;
> -	unsigned long notifier_seq;
> -	int pinned, ret, i;
> -	bool read_only = xe_vma_read_only(vma);
> +	bool write = !xe_vma_read_only(vma);
> +	struct hmm_range hmm_range;
>  
>  	lockdep_assert_held(&vm->lock);
>  	xe_assert(xe, xe_vma_is_userptr(vma));
> -retry:
> -	if (vma->gpuva.flags & XE_VMA_DESTROYED)
> -		return 0;

^^^
This should not be dropped. Both the vma->gpuva.flags & XE_VMA_DESTROYED
and userptr invalidation check retry loop should still be in here.

> -
> -	notifier_seq = mmu_interval_read_begin(&userptr->notifier);
> -	if (notifier_seq == userptr->notifier_seq)
> -		return 0;
> -
> -	pages = kvmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL);
> -	if (!pages)
> -		return -ENOMEM;
> -
> -	if (userptr->sg) {
> -		dma_unmap_sgtable(xe->drm.dev,
> -				  userptr->sg,
> -				  read_only ? DMA_TO_DEVICE :
> -				  DMA_BIDIRECTIONAL, 0);
> -		sg_free_table(userptr->sg);
> -		userptr->sg = NULL;
> -	}

^^^
Likewise, I don't think this should be dropped either.

> -
> -	pinned = ret = 0;
> -	if (in_kthread) {
> -		if (!mmget_not_zero(userptr->notifier.mm)) {
> -			ret = -EFAULT;
> -			goto mm_closed;
> -		}
> -		kthread_use_mm(userptr->notifier.mm);
> -	}

^^^
Nor this.

> -
> -	while (pinned < num_pages) {
> -		ret = get_user_pages_fast(xe_vma_userptr(vma) +
> -					  pinned * PAGE_SIZE,
> -					  num_pages - pinned,
> -					  read_only ? 0 : FOLL_WRITE,
> -					  &pages[pinned]);
> -		if (ret < 0)
> -			break;
> -
> -		pinned += ret;
> -		ret = 0;
> -	}

^^^
We should be replacing this.

> -
> -	if (in_kthread) {
> -		kthread_unuse_mm(userptr->notifier.mm);
> -		mmput(userptr->notifier.mm);
> -	}
> -mm_closed:
> -	if (ret)
> -		goto out;
> -
> -	ret = sg_alloc_table_from_pages_segment(&userptr->sgt, pages,
> -						pinned, 0,
> -						(u64)pinned << PAGE_SHIFT,
> -						xe_sg_segment_size(xe->drm.dev),
> -						GFP_KERNEL);
> -	if (ret) {
> -		userptr->sg = NULL;
> -		goto out;
> -	}
> -	userptr->sg = &userptr->sgt;
> -
> -	ret = dma_map_sgtable(xe->drm.dev, userptr->sg,
> -			      read_only ? DMA_TO_DEVICE :
> -			      DMA_BIDIRECTIONAL,
> -			      DMA_ATTR_SKIP_CPU_SYNC |
> -			      DMA_ATTR_NO_KERNEL_MAPPING);
> -	if (ret) {
> -		sg_free_table(userptr->sg);
> -		userptr->sg = NULL;
> -		goto out;
> -	}
> -
> -	for (i = 0; i < pinned; ++i) {
> -		if (!read_only) {
> -			lock_page(pages[i]);
> -			set_page_dirty(pages[i]);
> -			unlock_page(pages[i]);
> -		}
> -
> -		mark_page_accessed(pages[i]);
> -	}
> -
> -out:
> -	release_pages(pages, pinned);
> -	kvfree(pages);

^^^
Through here (minus existing the kthread) with hmm call. I guess the
kthread enter / exit could be in the hmm layer too.

Matt

> -
> -	if (!(ret < 0)) {
> -		userptr->notifier_seq = notifier_seq;
> -		if (xe_vma_userptr_check_repin(uvma) == -EAGAIN)
> -			goto retry;
> -	}
> -
> -	return ret < 0 ? ret : 0;
> +	return xe_hmm_populate_range(vma, &hmm_range, write);
>  }
>  
>  static bool preempt_fences_waiting(struct xe_vm *vm)
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-14  3:35 ` [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
  2024-03-14 17:17   ` Matthew Brost
@ 2024-03-15  1:45   ` Welty, Brian
  2024-03-15  3:10     ` Zeng, Oak
  1 sibling, 1 reply; 49+ messages in thread
From: Welty, Brian @ 2024-03-15  1:45 UTC (permalink / raw)
  To: Oak Zeng, intel-xe
  Cc: thomas.hellstrom, matthew.brost, airlied, himal.prasad.ghimiray

Hi Oak,

On 3/13/2024 8:35 PM, Oak Zeng wrote:
> Memory remap GPU vram using devm_memremap_pages, so each GPU vram
> page is backed by a struct page.
> 
> Those struct pages are created to allow hmm migrate buffer b/t
> GPU vram and CPU system memory using existing Linux migration
> mechanism (i.e., migrating b/t CPU system memory and hard disk).
> 
> This is prepare work to enable svm (shared virtual memory) through
> Linux kernel hmm framework. The memory remap's page map type is set
> to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> vram page get a struct page and can be mapped in CPU page table,
> but such pages are treated as GPU's private resource, so CPU can't
> access them. If CPU access such page, a page fault is triggered
> and page will be migrate to system memory.
> 
> For GPU device which supports coherent memory protocol b/t CPU and
> GPU (such as CXL and CAPI protocol), we can remap device memory as
> MEMORY_DEVICE_COHERENT. This is TBD.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>   drivers/gpu/drm/xe/Makefile          |  3 +-
>   drivers/gpu/drm/xe/xe_device_types.h |  9 +++
>   drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
>   drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
>   drivers/gpu/drm/xe/xe_svm_devmem.c   | 91 ++++++++++++++++++++++++++++
>   5 files changed, 124 insertions(+), 1 deletion(-)
>   create mode 100644 drivers/gpu/drm/xe/xe_svm.h
>   create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index c531210695db..840467080e59 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
>   	xe_vram_freq.o \
>   	xe_wait_user_fence.o \
>   	xe_wa.o \
> -	xe_wopcm.o
> +	xe_wopcm.o \
> +	xe_svm_devmem.o

Minor, but maintainers want above alphabetically sorted.

>   
>   # graphics hardware monitoring (HWMON) support
>   xe-$(CONFIG_HWMON) += xe_hwmon.o
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 9785eef2e5a4..f27c3bee8ce7 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -99,6 +99,15 @@ struct xe_mem_region {
>   	resource_size_t actual_physical_size;
>   	/** @mapping: pointer to VRAM mappable space */
>   	void __iomem *mapping;
> +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> +    struct dev_pagemap pagemap;
> +    /**
> +     * @hpa_base: base host physical address
> +     *
> +     * This is generated when remap device memory as ZONE_DEVICE
> +     */
> +    resource_size_t hpa_base;
> +
>   };
>   
>   /**
> diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> index e3db3a178760..0d795394bc4c 100644
> --- a/drivers/gpu/drm/xe/xe_mmio.c
> +++ b/drivers/gpu/drm/xe/xe_mmio.c
> @@ -22,6 +22,7 @@
>   #include "xe_module.h"
>   #include "xe_sriov.h"
>   #include "xe_tile.h"
> +#include "xe_svm.h"
>   
>   #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
>   #define TILE_COUNT		REG_GENMASK(15, 8)
> @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
>   		}
>   
>   		io_size -= min_t(u64, tile_size, io_size);
> +		xe_svm_devm_add(tile, &tile->mem.vram);

I think slightly more appropriate call site for this might be 
xe_tile_init_noalloc(), as that function states it is preparing tile
for VRAM allocations.
Also, I mention because we might like the flexiblity in future to call 
this once for xe_device.mem.vram, instead of calling for each tile,
and easier to handle that in xe_tile.c instead of xe_mmio.c.

Related comment below.


>   	}
>   
>   	xe->mem.vram.actual_physical_size = total_size;
> @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
>   static void mmio_fini(struct drm_device *drm, void *arg)
>   {
>   	struct xe_device *xe = arg;
> +    struct xe_tile *tile;
> +    u8 id;
>   
>   	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
>   	if (xe->mem.vram.mapping)
>   		iounmap(xe->mem.vram.mapping);
> +
> +	for_each_tile(tile, xe, id)
> +		xe_svm_devm_remove(xe, &tile->mem.vram);
> +
>   }
>   
>   static int xe_verify_lmem_ready(struct xe_device *xe)
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> new file mode 100644
> index 000000000000..09f9afb0e7d4
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -0,0 +1,14 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#ifndef __XE_SVM_H
> +#define __XE_SVM_H
> +
> +#include "xe_device_types.h"
> +
> +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
> +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region *mem);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> new file mode 100644
> index 000000000000..63b7a1961cc6
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -0,0 +1,91 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <linux/mm_types.h>
> +#include <linux/sched/mm.h>
> +
> +#include "xe_device_types.h"
> +#include "xe_trace.h"
> +#include "xe_svm.h"
> +
> +
> +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> +{
> +	return 0;
> +}
> +
> +static void xe_devm_page_free(struct page *page)
> +{
> +}
> +
> +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> +	.page_free = xe_devm_page_free,
> +	.migrate_to_ram = xe_devm_migrate_to_ram,
> +};
> +
> +/**
> + * xe_svm_devm_add: Remap and provide memmap backing for device memory

Do we really need 'svm' in function name?

> + * @tile: tile that the memory region blongs to

We might like flexibility in future to call this once for 
xe_device.mem.vram, instead of calling for each tile.
So can we remove the tile argument, and just pass the xe_device pointer
and tile->id ?


> + * @mr: memory region to remap
> + *
> + * This remap device memory to host physical address space and create
> + * struct page to back device memory
> + *
> + * Return: 0 on success standard error code otherwise
> + */
> +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> +{
> +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> +	struct resource *res;
> +	void *addr;
> +	int ret;
> +
> +	res = devm_request_free_mem_region(dev, &iomem_resource,
> +					   mr->usable_size);
> +	if (IS_ERR(res)) {
> +		ret = PTR_ERR(res);
> +		return ret;
> +	}
> +
> +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +	mr->pagemap.range.start = res->start;
> +	mr->pagemap.range.end = res->end;
> +	mr->pagemap.nr_range = 1;
> +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> +	mr->pagemap.owner = tile->xe->drm.dev;
> +	addr = devm_memremap_pages(dev, &mr->pagemap);
> +	if (IS_ERR(addr)) {
> +		devm_release_mem_region(dev, res->start, resource_size(res));
> +		ret = PTR_ERR(addr);
> +		drm_err(&tile->xe->drm, "Failed to remap tile %d memory, errno %d\n",
> +				tile->id, ret);
> +		return ret;
> +	}
> +	mr->hpa_base = res->start;
> +
> +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm, remapped to %pr\n",
> +			tile->id, mr->io_start, mr->io_start + mr->usable_size, res);
> +	return 0;
> +}
> +
> +/**
> + * xe_svm_devm_remove: Unmap device memory and free resources
> + * @xe: xe device
> + * @mr: memory region to remove
> + */
> +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region *mr)
> +{
> +	/*FIXME: below cause a kernel hange during moduel remove*/
> +#if 0
> +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> +
> +	if (mr->hpa_base) {
> +		devm_memunmap_pages(dev, &mr->pagemap);
> +		devm_release_mem_region(dev, mr->pagemap.range.start,
> +			mr->pagemap.range.end - mr->pagemap.range.start +1);
> +	}
> +#endif
> +}
> +

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 2/5] drm/xe: Helper to get memory region from tile
  2024-03-14 17:44   ` Matthew Brost
@ 2024-03-15  2:48     ` Zeng, Oak
  0 siblings, 0 replies; 49+ messages in thread
From: Zeng, Oak @ 2024-03-15  2:48 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Thursday, March 14, 2024 1:44 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 2/5] drm/xe: Helper to get memory region from tile
> 
> On Wed, Mar 13, 2024 at 11:35:50PM -0400, Oak Zeng wrote:
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> 
> Missed this. Need a commit message. Also this looks to be unused in this series.

Yes... I accidently picked wrong patch. Will drop this one and pick up the correct one.

Oak
> 
> Matt
> 
> > ---
> >  drivers/gpu/drm/xe/xe_device_types.h | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> b/drivers/gpu/drm/xe/xe_device_types.h
> > index f27c3bee8ce7..bbea40b57e84 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -571,4 +571,9 @@ struct xe_file {
> >  	struct xe_drm_client *client;
> >  };
> >
> > +static inline struct xe_tile *mem_region_to_tile(struct xe_mem_region *mr)
> > +{
> > +	return container_of(mr, struct xe_tile, mem.vram);
> > +}
> > +
> >  #endif
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-15  1:45   ` Welty, Brian
@ 2024-03-15  3:10     ` Zeng, Oak
  2024-03-15  3:16       ` Zeng, Oak
  0 siblings, 1 reply; 49+ messages in thread
From: Zeng, Oak @ 2024-03-15  3:10 UTC (permalink / raw)
  To: Welty, Brian, intel-xe
  Cc: Hellstrom, Thomas, Brost, Matthew, airlied, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Welty, Brian <brian.welty@intel.com>
> Sent: Thursday, March 14, 2024 9:46 PM
> To: Zeng, Oak <oak.zeng@intel.com>; intel-xe@lists.freedesktop.org
> Cc: Hellstrom, Thomas <thomas.hellstrom@intel.com>; Brost, Matthew
> <matthew.brost@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for
> GPU vram
> 
> Hi Oak,
> 
> On 3/13/2024 8:35 PM, Oak Zeng wrote:
> > Memory remap GPU vram using devm_memremap_pages, so each GPU vram
> > page is backed by a struct page.
> >
> > Those struct pages are created to allow hmm migrate buffer b/t
> > GPU vram and CPU system memory using existing Linux migration
> > mechanism (i.e., migrating b/t CPU system memory and hard disk).
> >
> > This is prepare work to enable svm (shared virtual memory) through
> > Linux kernel hmm framework. The memory remap's page map type is set
> > to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> > vram page get a struct page and can be mapped in CPU page table,
> > but such pages are treated as GPU's private resource, so CPU can't
> > access them. If CPU access such page, a page fault is triggered
> > and page will be migrate to system memory.
> >
> > For GPU device which supports coherent memory protocol b/t CPU and
> > GPU (such as CXL and CAPI protocol), we can remap device memory as
> > MEMORY_DEVICE_COHERENT. This is TBD.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >   drivers/gpu/drm/xe/Makefile          |  3 +-
> >   drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> >   drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> >   drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> >   drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> ++++++++++++++++++++++++++++
> >   5 files changed, 124 insertions(+), 1 deletion(-)
> >   create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> >   create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> >
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index c531210695db..840467080e59 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> >   	xe_vram_freq.o \
> >   	xe_wait_user_fence.o \
> >   	xe_wa.o \
> > -	xe_wopcm.o
> > +	xe_wopcm.o \
> > +	xe_svm_devmem.o
> 
> Minor, but maintainers want above alphabetically sorted.

Correct. Matt pointed out the same. Will fix
> 
> >
> >   # graphics hardware monitoring (HWMON) support
> >   xe-$(CONFIG_HWMON) += xe_hwmon.o
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> b/drivers/gpu/drm/xe/xe_device_types.h
> > index 9785eef2e5a4..f27c3bee8ce7 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -99,6 +99,15 @@ struct xe_mem_region {
> >   	resource_size_t actual_physical_size;
> >   	/** @mapping: pointer to VRAM mappable space */
> >   	void __iomem *mapping;
> > +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> > +    struct dev_pagemap pagemap;
> > +    /**
> > +     * @hpa_base: base host physical address
> > +     *
> > +     * This is generated when remap device memory as ZONE_DEVICE
> > +     */
> > +    resource_size_t hpa_base;
> > +
> >   };
> >
> >   /**
> > diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> > index e3db3a178760..0d795394bc4c 100644
> > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > @@ -22,6 +22,7 @@
> >   #include "xe_module.h"
> >   #include "xe_sriov.h"
> >   #include "xe_tile.h"
> > +#include "xe_svm.h"
> >
> >   #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
> >   #define TILE_COUNT		REG_GENMASK(15, 8)
> > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
> >   		}
> >
> >   		io_size -= min_t(u64, tile_size, io_size);
> > +		xe_svm_devm_add(tile, &tile->mem.vram);
> 
> I think slightly more appropriate call site for this might be
> xe_tile_init_noalloc(), as that function states it is preparing tile
> for VRAM allocations.
> Also, I mention because we might like the flexiblity in future to call
> this once for xe_device.mem.vram, instead of calling for each tile,
> and easier to handle that in xe_tile.c instead of xe_mmio.c.

Good point. Will follow.
> 
> Related comment below.
> 
> 
> >   	}
> >
> >   	xe->mem.vram.actual_physical_size = total_size;
> > @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
> >   static void mmio_fini(struct drm_device *drm, void *arg)
> >   {
> >   	struct xe_device *xe = arg;
> > +    struct xe_tile *tile;
> > +    u8 id;
> >
> >   	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> >   	if (xe->mem.vram.mapping)
> >   		iounmap(xe->mem.vram.mapping);
> > +
> > +	for_each_tile(tile, xe, id)
> > +		xe_svm_devm_remove(xe, &tile->mem.vram);
> > +
> >   }
> >
> >   static int xe_verify_lmem_ready(struct xe_device *xe)
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > new file mode 100644
> > index 000000000000..09f9afb0e7d4
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -0,0 +1,14 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2023 Intel Corporation
> > + */
> > +
> > +#ifndef __XE_SVM_H
> > +#define __XE_SVM_H
> > +
> > +#include "xe_device_types.h"
> > +
> > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
> > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> *mem);
> > +
> > +#endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > new file mode 100644
> > index 000000000000..63b7a1961cc6
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > @@ -0,0 +1,91 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2023 Intel Corporation
> > + */
> > +
> > +#include <linux/mm_types.h>
> > +#include <linux/sched/mm.h>
> > +
> > +#include "xe_device_types.h"
> > +#include "xe_trace.h"
> > +#include "xe_svm.h"
> > +
> > +
> > +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > +{
> > +	return 0;
> > +}
> > +
> > +static void xe_devm_page_free(struct page *page)
> > +{
> > +}
> > +
> > +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > +	.page_free = xe_devm_page_free,
> > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > +};
> > +
> > +/**
> > + * xe_svm_devm_add: Remap and provide memmap backing for device
> memory
> 
> Do we really need 'svm' in function name?

Good point. I will remove svm.
> 
> > + * @tile: tile that the memory region blongs to
> 
> We might like flexibility in future to call this once for
> xe_device.mem.vram, instead of calling for each tile.
> So can we remove the tile argument, and just pass the xe_device pointer
> and tile->id ?

This is interesting. 

First of all, I programmed wrong below: mr->pagemap.owner = tile->xe->drm.dev;

This should be: mr->pagemap.owner = tile for NUMA vram system.

For UMA vram, this should be: mr->pagemap.owner = tile_to_xe(tile);

This owner is important. It is used later to decide migration by hmm. We need to set the owner for hmm to identify different vram region.

Based on above, I think the tile parameter is better. For UMA, caller need to make sure call it once, any tile pointer should work. This does sound a little weird. But I don’t have a better idea. 

Oak
> 
> 
> > + * @mr: memory region to remap
> > + *
> > + * This remap device memory to host physical address space and create
> > + * struct page to back device memory
> > + *
> > + * Return: 0 on success standard error code otherwise
> > + */
> > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> > +{
> > +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> > +	struct resource *res;
> > +	void *addr;
> > +	int ret;
> > +
> > +	res = devm_request_free_mem_region(dev, &iomem_resource,
> > +					   mr->usable_size);
> > +	if (IS_ERR(res)) {
> > +		ret = PTR_ERR(res);
> > +		return ret;
> > +	}
> > +
> > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > +	mr->pagemap.range.start = res->start;
> > +	mr->pagemap.range.end = res->end;
> > +	mr->pagemap.nr_range = 1;
> > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > +	mr->pagemap.owner = tile->xe->drm.dev;
> > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > +	if (IS_ERR(addr)) {
> > +		devm_release_mem_region(dev, res->start, resource_size(res));
> > +		ret = PTR_ERR(addr);
> > +		drm_err(&tile->xe->drm, "Failed to remap tile %d memory,
> errno %d\n",
> > +				tile->id, ret);
> > +		return ret;
> > +	}
> > +	mr->hpa_base = res->start;
> > +
> > +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm,
> remapped to %pr\n",
> > +			tile->id, mr->io_start, mr->io_start + mr->usable_size,
> res);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * xe_svm_devm_remove: Unmap device memory and free resources
> > + * @xe: xe device
> > + * @mr: memory region to remove
> > + */
> > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> *mr)
> > +{
> > +	/*FIXME: below cause a kernel hange during moduel remove*/
> > +#if 0
> > +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> > +
> > +	if (mr->hpa_base) {
> > +		devm_memunmap_pages(dev, &mr->pagemap);
> > +		devm_release_mem_region(dev, mr->pagemap.range.start,
> > +			mr->pagemap.range.end - mr->pagemap.range.start +1);
> > +	}
> > +#endif
> > +}
> > +

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-15  3:10     ` Zeng, Oak
@ 2024-03-15  3:16       ` Zeng, Oak
  2024-03-15 18:05         ` Welty, Brian
  0 siblings, 1 reply; 49+ messages in thread
From: Zeng, Oak @ 2024-03-15  3:16 UTC (permalink / raw)
  To: Zeng, Oak, Welty, Brian, intel-xe
  Cc: Hellstrom, Thomas, Brost, Matthew, airlied, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Intel-xe <intel-xe-bounces@lists.freedesktop.org> On Behalf Of Zeng, Oak
> Sent: Thursday, March 14, 2024 11:10 PM
> To: Welty, Brian <brian.welty@intel.com>; intel-xe@lists.freedesktop.org
> Cc: Hellstrom, Thomas <thomas.hellstrom@intel.com>; Brost, Matthew
> <matthew.brost@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: RE: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for
> GPU vram
> 
> 
> 
> > -----Original Message-----
> > From: Welty, Brian <brian.welty@intel.com>
> > Sent: Thursday, March 14, 2024 9:46 PM
> > To: Zeng, Oak <oak.zeng@intel.com>; intel-xe@lists.freedesktop.org
> > Cc: Hellstrom, Thomas <thomas.hellstrom@intel.com>; Brost, Matthew
> > <matthew.brost@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing
> for
> > GPU vram
> >
> > Hi Oak,
> >
> > On 3/13/2024 8:35 PM, Oak Zeng wrote:
> > > Memory remap GPU vram using devm_memremap_pages, so each GPU
> vram
> > > page is backed by a struct page.
> > >
> > > Those struct pages are created to allow hmm migrate buffer b/t
> > > GPU vram and CPU system memory using existing Linux migration
> > > mechanism (i.e., migrating b/t CPU system memory and hard disk).
> > >
> > > This is prepare work to enable svm (shared virtual memory) through
> > > Linux kernel hmm framework. The memory remap's page map type is set
> > > to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> > > vram page get a struct page and can be mapped in CPU page table,
> > > but such pages are treated as GPU's private resource, so CPU can't
> > > access them. If CPU access such page, a page fault is triggered
> > > and page will be migrate to system memory.
> > >
> > > For GPU device which supports coherent memory protocol b/t CPU and
> > > GPU (such as CXL and CAPI protocol), we can remap device memory as
> > > MEMORY_DEVICE_COHERENT. This is TBD.
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/Makefile          |  3 +-
> > >   drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> > >   drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> > >   drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> > >   drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> > ++++++++++++++++++++++++++++
> > >   5 files changed, 124 insertions(+), 1 deletion(-)
> > >   create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > >   create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> > >
> > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > index c531210695db..840467080e59 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> > >   	xe_vram_freq.o \
> > >   	xe_wait_user_fence.o \
> > >   	xe_wa.o \
> > > -	xe_wopcm.o
> > > +	xe_wopcm.o \
> > > +	xe_svm_devmem.o
> >
> > Minor, but maintainers want above alphabetically sorted.
> 
> Correct. Matt pointed out the same. Will fix
> >
> > >
> > >   # graphics hardware monitoring (HWMON) support
> > >   xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > b/drivers/gpu/drm/xe/xe_device_types.h
> > > index 9785eef2e5a4..f27c3bee8ce7 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -99,6 +99,15 @@ struct xe_mem_region {
> > >   	resource_size_t actual_physical_size;
> > >   	/** @mapping: pointer to VRAM mappable space */
> > >   	void __iomem *mapping;
> > > +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> > > +    struct dev_pagemap pagemap;
> > > +    /**
> > > +     * @hpa_base: base host physical address
> > > +     *
> > > +     * This is generated when remap device memory as ZONE_DEVICE
> > > +     */
> > > +    resource_size_t hpa_base;
> > > +
> > >   };
> > >
> > >   /**
> > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c
> b/drivers/gpu/drm/xe/xe_mmio.c
> > > index e3db3a178760..0d795394bc4c 100644
> > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > @@ -22,6 +22,7 @@
> > >   #include "xe_module.h"
> > >   #include "xe_sriov.h"
> > >   #include "xe_tile.h"
> > > +#include "xe_svm.h"
> > >
> > >   #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
> > >   #define TILE_COUNT		REG_GENMASK(15, 8)
> > > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
> > >   		}
> > >
> > >   		io_size -= min_t(u64, tile_size, io_size);
> > > +		xe_svm_devm_add(tile, &tile->mem.vram);
> >
> > I think slightly more appropriate call site for this might be
> > xe_tile_init_noalloc(), as that function states it is preparing tile
> > for VRAM allocations.
> > Also, I mention because we might like the flexiblity in future to call
> > this once for xe_device.mem.vram, instead of calling for each tile,
> > and easier to handle that in xe_tile.c instead of xe_mmio.c.
> 
> Good point. Will follow.

Sorry, with my comment below, do you still want to call it in xe_tile_init_noalloc?

For UMA, we only need to call it once. If you do it in init-noalloc, you would call it multiple times. Right?

Oak

> >
> > Related comment below.
> >
> >
> > >   	}
> > >
> > >   	xe->mem.vram.actual_physical_size = total_size;
> > > @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
> > >   static void mmio_fini(struct drm_device *drm, void *arg)
> > >   {
> > >   	struct xe_device *xe = arg;
> > > +    struct xe_tile *tile;
> > > +    u8 id;
> > >
> > >   	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> > >   	if (xe->mem.vram.mapping)
> > >   		iounmap(xe->mem.vram.mapping);
> > > +
> > > +	for_each_tile(tile, xe, id)
> > > +		xe_svm_devm_remove(xe, &tile->mem.vram);
> > > +
> > >   }
> > >
> > >   static int xe_verify_lmem_ready(struct xe_device *xe)
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > new file mode 100644
> > > index 000000000000..09f9afb0e7d4
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -0,0 +1,14 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2023 Intel Corporation
> > > + */
> > > +
> > > +#ifndef __XE_SVM_H
> > > +#define __XE_SVM_H
> > > +
> > > +#include "xe_device_types.h"
> > > +
> > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
> > > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> > *mem);
> > > +
> > > +#endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > new file mode 100644
> > > index 000000000000..63b7a1961cc6
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > @@ -0,0 +1,91 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2023 Intel Corporation
> > > + */
> > > +
> > > +#include <linux/mm_types.h>
> > > +#include <linux/sched/mm.h>
> > > +
> > > +#include "xe_device_types.h"
> > > +#include "xe_trace.h"
> > > +#include "xe_svm.h"
> > > +
> > > +
> > > +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +static void xe_devm_page_free(struct page *page)
> > > +{
> > > +}
> > > +
> > > +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > +	.page_free = xe_devm_page_free,
> > > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > +};
> > > +
> > > +/**
> > > + * xe_svm_devm_add: Remap and provide memmap backing for device
> > memory
> >
> > Do we really need 'svm' in function name?
> 
> Good point. I will remove svm.
> >
> > > + * @tile: tile that the memory region blongs to
> >
> > We might like flexibility in future to call this once for
> > xe_device.mem.vram, instead of calling for each tile.
> > So can we remove the tile argument, and just pass the xe_device pointer
> > and tile->id ?
> 
> This is interesting.
> 
> First of all, I programmed wrong below: mr->pagemap.owner = tile->xe-
> >drm.dev;
> 
> This should be: mr->pagemap.owner = tile for NUMA vram system.
> 
> For UMA vram, this should be: mr->pagemap.owner = tile_to_xe(tile);
> 
> This owner is important. It is used later to decide migration by hmm. We need to
> set the owner for hmm to identify different vram region.
> 
> Based on above, I think the tile parameter is better. For UMA, caller need to
> make sure call it once, any tile pointer should work. This does sound a little weird.
> But I don’t have a better idea.
> 
> Oak
> >
> >
> > > + * @mr: memory region to remap
> > > + *
> > > + * This remap device memory to host physical address space and create
> > > + * struct page to back device memory
> > > + *
> > > + * Return: 0 on success standard error code otherwise
> > > + */
> > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> > > +{
> > > +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> > > +	struct resource *res;
> > > +	void *addr;
> > > +	int ret;
> > > +
> > > +	res = devm_request_free_mem_region(dev, &iomem_resource,
> > > +					   mr->usable_size);
> > > +	if (IS_ERR(res)) {
> > > +		ret = PTR_ERR(res);
> > > +		return ret;
> > > +	}
> > > +
> > > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > > +	mr->pagemap.range.start = res->start;
> > > +	mr->pagemap.range.end = res->end;
> > > +	mr->pagemap.nr_range = 1;
> > > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > > +	mr->pagemap.owner = tile->xe->drm.dev;
> > > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > > +	if (IS_ERR(addr)) {
> > > +		devm_release_mem_region(dev, res->start, resource_size(res));
> > > +		ret = PTR_ERR(addr);
> > > +		drm_err(&tile->xe->drm, "Failed to remap tile %d memory,
> > errno %d\n",
> > > +				tile->id, ret);
> > > +		return ret;
> > > +	}
> > > +	mr->hpa_base = res->start;
> > > +
> > > +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm,
> > remapped to %pr\n",
> > > +			tile->id, mr->io_start, mr->io_start + mr->usable_size,
> > res);
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * xe_svm_devm_remove: Unmap device memory and free resources
> > > + * @xe: xe device
> > > + * @mr: memory region to remove
> > > + */
> > > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> > *mr)
> > > +{
> > > +	/*FIXME: below cause a kernel hange during moduel remove*/
> > > +#if 0
> > > +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> > > +
> > > +	if (mr->hpa_base) {
> > > +		devm_memunmap_pages(dev, &mr->pagemap);
> > > +		devm_release_mem_region(dev, mr->pagemap.range.start,
> > > +			mr->pagemap.range.end - mr->pagemap.range.start +1);
> > > +	}
> > > +#endif
> > > +}
> > > +

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-14 20:49       ` Matthew Brost
@ 2024-03-15 16:00         ` Zeng, Oak
  2024-03-15 20:39           ` Matthew Brost
  0 siblings, 1 reply; 49+ messages in thread
From: Zeng, Oak @ 2024-03-15 16:00 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Thursday, March 14, 2024 4:49 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for
> GPU vram
> 
> On Thu, Mar 14, 2024 at 12:32:36PM -0600, Zeng, Oak wrote:
> > Hi Matt,
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Thursday, March 14, 2024 1:18 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing
> for
> > > GPU vram
> > >
> > > On Wed, Mar 13, 2024 at 11:35:49PM -0400, Oak Zeng wrote:
> > > > Memory remap GPU vram using devm_memremap_pages, so each GPU
> vram
> > > > page is backed by a struct page.
> > > >
> > > > Those struct pages are created to allow hmm migrate buffer b/t
> > > > GPU vram and CPU system memory using existing Linux migration
> > > > mechanism (i.e., migrating b/t CPU system memory and hard disk).
> > > >
> > > > This is prepare work to enable svm (shared virtual memory) through
> > > > Linux kernel hmm framework. The memory remap's page map type is set
> > > > to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> > > > vram page get a struct page and can be mapped in CPU page table,
> > > > but such pages are treated as GPU's private resource, so CPU can't
> > > > access them. If CPU access such page, a page fault is triggered
> > > > and page will be migrate to system memory.
> > > >
> > >
> > > Is this really true? We can map VRAM BOs to the CPU without having
> > > migarte back and forth. Admittedly I don't know the inner workings of
> > > how this works but in IGTs we do this all the time.
> > >
> > >   54         batch_bo = xe_bo_create(fd, vm, batch_size,
> > >   55                                 vram_if_possible(fd, 0),
> > >   56                                 DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM);
> > >   57         batch_map = xe_bo_map(fd, batch_bo, batch_size);
> > >
> > > The BO is created in VRAM and then mapped to the CPU.
> > >
> > > I don't think there is an expectation of coherence rather caching mode
> > > and exclusive access of the memory based on synchronization.
> > >
> > > e.g.
> > > User write BB/data via CPU to GPU memory
> > > User calls exec
> > > GPU read / write memory
> > > User wait on sync indicating exec done
> > > User reads result
> > >
> > > All of this works without migration. Are we not planing supporting flow
> > > with SVM?
> > >
> > > Afaik this migration dance really only needs to be done if the CPU and
> > > GPU are using atomics on a shared memory region and the GPU device
> > > doesn't support a coherent memory protocol (e.g. PVC).
> >
> > All you said is true. On many of our HW, CPU can actually access device memory,
> cache coherently or not.
> >
> > The problem is, this is not true for all GPU vendors. For example, on some HW
> from some vendor, CPU can only access partially of device memory. The so called
> small bar concept.
> >
> > So when HMM is defined, such factors were considered, and
> MEMORY_DEVICE_PRIVATE is defined. With this definition, CPU can't access
> device memory.
> >
> > So you can think it is a limitation of HMM.
> >
> 
> Is it though? I see other type MEMORY_DEVICE_FS_DAX,
> MEMORY_DEVICE_GENERIC, and MEMORY_DEVICE_PCI_P2PDMA. From my
> limited
> understanding it looks to me like one of those modes would support my
> example.


No, above are for other purposes. HMM only support DEVICE_PRIVATE and DEVICE_COHERENT.

> 
> > Note this is only part 1 of our system allocator work. We do plan to support
> DEVICE_COHERENT for our newer device, see below. With this option, we don't
> have unnecessary migration back and forth.
> >
> > You can think this is just work out all the code path. 90% of the driver code for
> DEVICE_PRIVATE and DEVICE_COHERENT will be same. Our real use of system
> allocator will be DEVICE_COHERENT mode. While DEVICE_PRIVATE mode allow us
> to exercise the code on old HW.
> >
> > Make sense?
> >
> 
> I guess if we want the system allocator to always coherent, then yes you
> need this dynamic migration with faulting on either side.
> 
> I was thinking the system allocator would be behave like my example
> above with madvise dictating the coherence rules.
> 
> Maybe I missed this in system allocator design but my feeling is we
> shouldn't arbitrarily enforce coherence as that could lead to poor
> performance due to constant migration.

System allocator itself doesn't enforce coherence. Coherence is built in user programming model. So system allocator allow both GPU and CPU access system allocated pointers, but it doesn't necessarily guarantee the data accessed from CPU/GPU is coherent. It is user program's responsibility to maintain data coherence.

Data migration in driver is optional, depending on platform capability, user preference, correctness and performance consideration. Driver internal data migration of course shouldn't break data coherence. 

Of course different vendor can have different data coherence scheme. For example, it is completely designer's flexibility to build model that is HW automatic data coherence or software explicit data coherence. On our platform, we allow user program to select different coherence mode by setting pat_index for gpu and cpu_caching mode for CPU. So we have completely give the flexibility to user program. Nothing of this contract is changed in system allocator design. 

Going back to the question of what memory type we should use to register our vram to core mm. HMM currently support two types: PRIVATE and COHERENT. The coherent type requires some HW and BIOS support which we don't have right now. So the only available is PRIVATE. We have not other option right now. As said, we plan to support coherent type where we can avoid unnecessary data migration. But that is stage 2.

> 
> >
> > >
> > > > For GPU device which supports coherent memory protocol b/t CPU and
> > > > GPU (such as CXL and CAPI protocol), we can remap device memory as
> > > > MEMORY_DEVICE_COHERENT. This is TBD.
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > Co-developed-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Signed-off-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile          |  3 +-
> > > >  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> > > >  drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> > > >  drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> > > >  drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> > > ++++++++++++++++++++++++++++
> > > >  5 files changed, 124 insertions(+), 1 deletion(-)
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > > index c531210695db..840467080e59 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> > > >  	xe_vram_freq.o \
> > > >  	xe_wait_user_fence.o \
> > > >  	xe_wa.o \
> > > > -	xe_wopcm.o
> > > > +	xe_wopcm.o \
> > > > +	xe_svm_devmem.o
> > >
> > > These should be in alphabetical order.
> >
> > Will fix
> > >
> > > >
> > > >  # graphics hardware monitoring (HWMON) support
> > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > > index 9785eef2e5a4..f27c3bee8ce7 100644
> > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > @@ -99,6 +99,15 @@ struct xe_mem_region {
> > > >  	resource_size_t actual_physical_size;
> > > >  	/** @mapping: pointer to VRAM mappable space */
> > > >  	void __iomem *mapping;
> > > > +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> > > > +    struct dev_pagemap pagemap;
> > > > +    /**
> > > > +     * @hpa_base: base host physical address
> > > > +     *
> > > > +     * This is generated when remap device memory as ZONE_DEVICE
> > > > +     */
> > > > +    resource_size_t hpa_base;
> > >
> > > Weird indentation. This goes for the entire series, look at checkpatch.
> >
> > Will fix
> > >
> > > > +
> > > >  };
> > > >
> > > >  /**
> > > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c
> b/drivers/gpu/drm/xe/xe_mmio.c
> > > > index e3db3a178760..0d795394bc4c 100644
> > > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > > @@ -22,6 +22,7 @@
> > > >  #include "xe_module.h"
> > > >  #include "xe_sriov.h"
> > > >  #include "xe_tile.h"
> > > > +#include "xe_svm.h"
> > > >
> > > >  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
> > > >  #define TILE_COUNT		REG_GENMASK(15, 8)
> > > > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
> > > >  		}
> > > >
> > > >  		io_size -= min_t(u64, tile_size, io_size);
> > > > +		xe_svm_devm_add(tile, &tile->mem.vram);
> > >
> > > Do we want to do this probe for all devices with VRAM or only a subset?
> >
> > All
> 
> Can you explain why?

It is natural for me to add all device memory to hmm. In hmm design, device memory is used as a special swap out for system memory. I would ask why we only want to add a subset of vram? By a subset, do you mean only vram of one tile instead of all tiles?

Oak


> 
> > >
> > > >  	}
> > > >
> > > >  	xe->mem.vram.actual_physical_size = total_size;
> > > > @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
> > > >  static void mmio_fini(struct drm_device *drm, void *arg)
> > > >  {
> > > >  	struct xe_device *xe = arg;
> > > > +    struct xe_tile *tile;
> > > > +    u8 id;
> > > >
> > > >  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> > > >  	if (xe->mem.vram.mapping)
> > > >  		iounmap(xe->mem.vram.mapping);
> > > > +
> > > > +	for_each_tile(tile, xe, id)
> > > > +		xe_svm_devm_remove(xe, &tile->mem.vram);
> > >
> > > This should probably be above existing code. Typical on fini to do
> > > things in reverse order from init.
> >
> > Will fix
> > >
> > > > +
> > > >  }
> > > >
> > > >  static int xe_verify_lmem_ready(struct xe_device *xe)
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > > new file mode 100644
> > > > index 000000000000..09f9afb0e7d4
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > @@ -0,0 +1,14 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2023 Intel Corporation
> > >
> > > 2024?
> >
> > This patch was actually written 2023
> > >
> > > > + */
> > > > +
> > > > +#ifndef __XE_SVM_H
> > > > +#define __XE_SVM_H
> > > > +
> > > > +#include "xe_device_types.h"
> > >
> > > I don't think you need to include this. Rather just forward decl structs
> > > used here.
> >
> > Will fix
> > >
> > > e.g.
> > >
> > > struct xe_device;
> > > struct xe_mem_region;
> > > struct xe_tile;
> > >
> > > > +
> > > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
> > > > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> > > *mem);
> > >
> > > The arguments here are incongruent here. Typically we want these to
> > > match.
> >
> > Will fix
> > >
> > > > +
> > > > +#endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > >
> > > Incongruent between xe_svm.h and xe_svm_devmem.c.
> >
> > Did you mean mem vs mr? if yes, will fix
> >
> > Again these two
> > > should
> > > match.
> > >
> > > > new file mode 100644
> > > > index 000000000000..63b7a1961cc6
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > @@ -0,0 +1,91 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2023 Intel Corporation
> > >
> > > 2024?
> > It is from 2023
> > >
> > > > + */
> > > > +
> > > > +#include <linux/mm_types.h>
> > > > +#include <linux/sched/mm.h>
> > > > +
> > > > +#include "xe_device_types.h"
> > > > +#include "xe_trace.h"
> > >
> > > xe_trace.h appears to be unused.
> >
> > Will fix
> > >
> > > > +#include "xe_svm.h"
> > > > +
> > > > +
> > > > +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > > +{
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static void xe_devm_page_free(struct page *page)
> > > > +{
> > > > +}
> > > > +
> > > > +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > > +	.page_free = xe_devm_page_free,
> > > > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > +};
> > > > +
> > >
> > > Assume these are placeholders that will be populated later?
> >
> >
> > corrrect
> > >
> > > > +/**
> > > > + * xe_svm_devm_add: Remap and provide memmap backing for device
> > > memory
> > > > + * @tile: tile that the memory region blongs to
> > > > + * @mr: memory region to remap
> > > > + *
> > > > + * This remap device memory to host physical address space and create
> > > > + * struct page to back device memory
> > > > + *
> > > > + * Return: 0 on success standard error code otherwise
> > > > + */
> > > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> > >
> > > Here I see the xe_mem_region is from tile->mem.vram, wondering rather
> > > than using the tile->mem.vram we should use xe->mem.vram when enabling
> > > svm? Isn't the idea behind svm the entire memory is 1 view?
> >
> > Still need to use tile. The reason is, memory of different tile can have different
> characteristics, such as latency. So we want to differentiate memory b/t tiles also
> in svm. I need to change below " mr->pagemap.owner = tile->xe->drm.dev ". the
> owner should also be tile. This is the way hmm differentiate memory of different
> tile.
> >
> > With svm it is 1 view, from virtual address space perspective and from physical
> struct page perspective. You can think of all the tile's vram is stacked together to
> form a unified view together with system memory. This doesn't prohibit us from
> differentiate memory from different tile. This differentiation allow us to optimize
> performance, i.e., we can wisely place memory in specific tile. If we don't
> differentiate, this is not possible.
> >
> 
> Ok makes sense.
> 
> Matt
> 
> > >
> > > I suppose if we do that we also only use 1 TTM VRAM manager / buddy
> > > allocator too. I thought I saw some patches flying around for that too.
> >
> > Ttm vram manager is not in the picture. We deliberately avoided it per previous
> discussion
> >
> > Yes same buddy allocator. It is in my previous POC: https://lore.kernel.org/dri-
> devel/20240117221223.18540-12-oak.zeng@intel.com/. I didn't put those patches
> in this series because I want to merge this small patches separately.
> > >
> > > > +{
> > > > +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> > > > +	struct resource *res;
> > > > +	void *addr;
> > > > +	int ret;
> > > > +
> > > > +	res = devm_request_free_mem_region(dev, &iomem_resource,
> > > > +					   mr->usable_size);
> > > > +	if (IS_ERR(res)) {
> > > > +		ret = PTR_ERR(res);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > > > +	mr->pagemap.range.start = res->start;
> > > > +	mr->pagemap.range.end = res->end;
> > > > +	mr->pagemap.nr_range = 1;
> > > > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > > > +	mr->pagemap.owner = tile->xe->drm.dev;
> > > > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > > > +	if (IS_ERR(addr)) {
> > > > +		devm_release_mem_region(dev, res->start, resource_size(res));
> > > > +		ret = PTR_ERR(addr);
> > > > +		drm_err(&tile->xe->drm, "Failed to remap tile %d memory,
> > > errno %d\n",
> > > > +				tile->id, ret);
> > > > +		return ret;
> > > > +	}
> > > > +	mr->hpa_base = res->start;
> > > > +
> > > > +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm,
> > > remapped to %pr\n",
> > > > +			tile->id, mr->io_start, mr->io_start + mr->usable_size,
> > > res);
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_svm_devm_remove: Unmap device memory and free resources
> > > > + * @xe: xe device
> > > > + * @mr: memory region to remove
> > > > + */
> > > > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> > > *mr)
> > > > +{
> > > > +	/*FIXME: below cause a kernel hange during moduel remove*/
> > > > +#if 0
> > > > +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> > > > +
> > > > +	if (mr->hpa_base) {
> > > > +		devm_memunmap_pages(dev, &mr->pagemap);
> > > > +		devm_release_mem_region(dev, mr->pagemap.range.start,
> > > > +			mr->pagemap.range.end - mr->pagemap.range.start +1);
> > > > +	}
> > > > +#endif
> > >
> > > This would need to be fixed too.
> >
> >
> > Yes...
> >
> > Oak
> > >
> > > Matt
> > >
> > > > +}
> > > > +
> > > > --
> > > > 2.26.3
> > > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 3/5] drm/xe: Helper to get dpa from pfn
  2024-03-14 17:39   ` Matthew Brost
@ 2024-03-15 17:29     ` Zeng, Oak
  2024-03-16  1:33       ` Matthew Brost
  2024-03-18 12:09     ` Hellstrom, Thomas
  1 sibling, 1 reply; 49+ messages in thread
From: Zeng, Oak @ 2024-03-15 17:29 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Thursday, March 14, 2024 1:39 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 3/5] drm/xe: Helper to get dpa from pfn
> 
> On Wed, Mar 13, 2024 at 11:35:51PM -0400, Oak Zeng wrote:
> > Since we now create struct page backing for each vram page,
> > each vram page now also has a pfn, just like system memory.
> > This allow us to calcuate device physical address from pfn.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_device_types.h | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> b/drivers/gpu/drm/xe/xe_device_types.h
> > index bbea40b57e84..bf349321f037 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -576,4 +576,12 @@ static inline struct xe_tile *mem_region_to_tile(struct
> xe_mem_region *mr)
> >  	return container_of(mr, struct xe_tile, mem.vram);
> >  }
> >
> > +static inline u64 vram_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
> > +{
> > +	u64 dpa;
> > +	u64 offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
> 
> Can't this be negative?
> 
> e.g. if pfn == 0, offset == -mr->hpa_base.
> 
> Or is the assumption (pfn << PAGE_SHIFT) is always > mr->hpa_base?
> 
> If so can we an assert or something to ensure we using this function correctly.

Yes we can assert it. The hpa_base is the host physical base address for this memory region, while pfn should point to a page inside this memory region.

I will add an assertion.


> 
> > +	dpa = mr->dpa_base + offset;
> > +	return dpa;
> > +}
> 
> Same as previous patch, should be *.h not a *_types.h file.

Yes will fix.
> 
> Also as this is xe_mem_region not explictly vram. Maybe:
> 
> s/vram_pfn_to_dpa/xe_mem_region_pfn_to_dpa/

Xe_mem_region can only represent vram, right? I mean it can't represent system memory. Copied the definition below:

/**
 * struct xe_mem_region - memory region structure
 * This is used to describe a memory region in xe
 * device, such as HBM memory or CXL extension memory.
 */

Oak

> 
> Matt
> 
> > +
> >  #endif
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-15  3:16       ` Zeng, Oak
@ 2024-03-15 18:05         ` Welty, Brian
  2024-03-15 23:11           ` Zeng, Oak
  0 siblings, 1 reply; 49+ messages in thread
From: Welty, Brian @ 2024-03-15 18:05 UTC (permalink / raw)
  To: Zeng, Oak, intel-xe
  Cc: Hellstrom, Thomas, Brost, Matthew, airlied, Ghimiray, Himal Prasad


On 3/14/2024 8:16 PM, Zeng, Oak wrote:
[...]
>>>> diff --git a/drivers/gpu/drm/xe/xe_mmio.c
>> b/drivers/gpu/drm/xe/xe_mmio.c
>>>> index e3db3a178760..0d795394bc4c 100644
>>>> --- a/drivers/gpu/drm/xe/xe_mmio.c
>>>> +++ b/drivers/gpu/drm/xe/xe_mmio.c
>>>> @@ -22,6 +22,7 @@
>>>>    #include "xe_module.h"
>>>>    #include "xe_sriov.h"
>>>>    #include "xe_tile.h"
>>>> +#include "xe_svm.h"
>>>>
>>>>    #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
>>>>    #define TILE_COUNT		REG_GENMASK(15, 8)
>>>> @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
>>>>    		}
>>>>
>>>>    		io_size -= min_t(u64, tile_size, io_size);
>>>> +		xe_svm_devm_add(tile, &tile->mem.vram);
>>>
>>> I think slightly more appropriate call site for this might be
>>> xe_tile_init_noalloc(), as that function states it is preparing tile
>>> for VRAM allocations.
>>> Also, I mention because we might like the flexiblity in future to call
>>> this once for xe_device.mem.vram, instead of calling for each tile,
>>> and easier to handle that in xe_tile.c instead of xe_mmio.c.
>>
>> Good point. Will follow.
> 
> Sorry, with my comment below, do you still want to call it in xe_tile_init_noalloc?
> 
> For UMA, we only need to call it once. If you do it in init-noalloc, you would call it multiple times. Right?
> 
> Oak
> 

Oh, I hoped you were still going to move it.
I prefer it in xe_tile.c instead of here.  But feel free to ask 
maintainers about it.
I think the UMA would anyway add conditional to disable these calls 
wherever they are placed, in favor of calling against xe_device.mem.vram.
But this is pretty minor and can revisit later too.

I commented down below, to keep to one email.

>>>
>>> Related comment below.
>>>
>>>
>>>>    	}
>>>>
>>>>    	xe->mem.vram.actual_physical_size = total_size;
>>>> @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
>>>>    static void mmio_fini(struct drm_device *drm, void *arg)
>>>>    {
>>>>    	struct xe_device *xe = arg;
>>>> +    struct xe_tile *tile;
>>>> +    u8 id;
>>>>
>>>>    	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
>>>>    	if (xe->mem.vram.mapping)
>>>>    		iounmap(xe->mem.vram.mapping);
>>>> +
>>>> +	for_each_tile(tile, xe, id)
>>>> +		xe_svm_devm_remove(xe, &tile->mem.vram);
>>>> +
>>>>    }
>>>>
[...]
>>> b/drivers/gpu/drm/xe/xe_svm_devmem.c
>>>> new file mode 100644
>>>> index 000000000000..63b7a1961cc6
>>>> --- /dev/null
>>>> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
>>>> @@ -0,0 +1,91 @@
>>>> +// SPDX-License-Identifier: MIT
>>>> +/*
>>>> + * Copyright © 2023 Intel Corporation
>>>> + */
>>>> +
>>>> +#include <linux/mm_types.h>
>>>> +#include <linux/sched/mm.h>
>>>> +
>>>> +#include "xe_device_types.h"
>>>> +#include "xe_trace.h"
>>>> +#include "xe_svm.h"
>>>> +
>>>> +
>>>> +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
>>>> +{
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static void xe_devm_page_free(struct page *page)
>>>> +{
>>>> +}
>>>> +
>>>> +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
>>>> +	.page_free = xe_devm_page_free,
>>>> +	.migrate_to_ram = xe_devm_migrate_to_ram,
>>>> +};
>>>> +
>>>> +/**
>>>> + * xe_svm_devm_add: Remap and provide memmap backing for device
>>> memory
>>>
>>> Do we really need 'svm' in function name?
>>
>> Good point. I will remove svm.
>>>
>>>> + * @tile: tile that the memory region blongs to
>>>
>>> We might like flexibility in future to call this once for
>>> xe_device.mem.vram, instead of calling for each tile.
>>> So can we remove the tile argument, and just pass the xe_device pointer
>>> and tile->id ?
>>
>> This is interesting.
>>
>> First of all, I programmed wrong below: mr->pagemap.owner = tile->xe-
>>> drm.dev;
>>
>> This should be: mr->pagemap.owner = tile for NUMA vram system
Oh, okay.  Glad we caught that.

Still, wouldn't using 'mr' pointer as the owner be sufficient and then 
is common for both cases?
Avoiding the strong association with 'tile' is nice if possible.

But I haven't studied rest of code and how pagemap.owner is used, so 
maybe you actually need the tile pointer for something.

Well, this also seems like something minor and could be revisited later.

-Brian


>>
>> For UMA vram, this should be: mr->pagemap.owner = tile_to_xe(tile);
>>
>> This owner is important. It is used later to decide migration by hmm. We need to
>> set the owner for hmm to identify different vram region.
>>
>> Based on above, I think the tile parameter is better. For UMA, caller need to
>> make sure call it once, any tile pointer should work. This does sound a little weird.
>> But I don’t have a better idea.
>>
>> Oak
>>>
>>>
>>>> + * @mr: memory region to remap
>>>> + *
>>>> + * This remap device memory to host physical address space and create
>>>> + * struct page to back device memory
>>>> + *
>>>> + * Return: 0 on success standard error code otherwise
>>>> + */
>>>> +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
>>>> +{
>>>> +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
>>>> +	struct resource *res;
>>>> +	void *addr;
>>>> +	int ret;
>>>> +
>>>> +	res = devm_request_free_mem_region(dev, &iomem_resource,
>>>> +					   mr->usable_size);
>>>> +	if (IS_ERR(res)) {
>>>> +		ret = PTR_ERR(res);
>>>> +		return ret;
>>>> +	}
>>>> +
>>>> +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
>>>> +	mr->pagemap.range.start = res->start;
>>>> +	mr->pagemap.range.end = res->end;
>>>> +	mr->pagemap.nr_range = 1;
>>>> +	mr->pagemap.ops = &xe_devm_pagemap_ops;
>>>> +	mr->pagemap.owner = tile->xe->drm.dev;
>>>> +	addr = devm_memremap_pages(dev, &mr->pagemap);
>>>> +	if (IS_ERR(addr)) {
>>>> +		devm_release_mem_region(dev, res->start, resource_size(res));
>>>> +		ret = PTR_ERR(addr);
>>>> +		drm_err(&tile->xe->drm, "Failed to remap tile %d memory,
>>> errno %d\n",
>>>> +				tile->id, ret);
>>>> +		return ret;
>>>> +	}
>>>> +	mr->hpa_base = res->start;
>>>> +
>>>> +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm,
>>> remapped to %pr\n",
>>>> +			tile->id, mr->io_start, mr->io_start + mr->usable_size,
>>> res);
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/**
>>>> + * xe_svm_devm_remove: Unmap device memory and free resources
>>>> + * @xe: xe device
>>>> + * @mr: memory region to remove
>>>> + */
>>>> +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
>>> *mr)
>>>> +{
>>>> +	/*FIXME: below cause a kernel hange during moduel remove*/
>>>> +#if 0
>>>> +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
>>>> +
>>>> +	if (mr->hpa_base) {
>>>> +		devm_memunmap_pages(dev, &mr->pagemap);
>>>> +		devm_release_mem_region(dev, mr->pagemap.range.start,
>>>> +			mr->pagemap.range.end - mr->pagemap.range.start +1);
>>>> +	}
>>>> +#endif
>>>> +}
>>>> +

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-15 16:00         ` Zeng, Oak
@ 2024-03-15 20:39           ` Matthew Brost
  2024-03-15 21:31             ` Zeng, Oak
  0 siblings, 1 reply; 49+ messages in thread
From: Matthew Brost @ 2024-03-15 20:39 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad

On Fri, Mar 15, 2024 at 10:00:06AM -0600, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Thursday, March 14, 2024 4:49 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for
> > GPU vram
> > 
> > On Thu, Mar 14, 2024 at 12:32:36PM -0600, Zeng, Oak wrote:
> > > Hi Matt,
> > >
> > > > -----Original Message-----
> > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > Sent: Thursday, March 14, 2024 1:18 PM
> > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > <himal.prasad.ghimiray@intel.com>
> > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing
> > for
> > > > GPU vram
> > > >
> > > > On Wed, Mar 13, 2024 at 11:35:49PM -0400, Oak Zeng wrote:
> > > > > Memory remap GPU vram using devm_memremap_pages, so each GPU
> > vram
> > > > > page is backed by a struct page.
> > > > >
> > > > > Those struct pages are created to allow hmm migrate buffer b/t
> > > > > GPU vram and CPU system memory using existing Linux migration
> > > > > mechanism (i.e., migrating b/t CPU system memory and hard disk).
> > > > >
> > > > > This is prepare work to enable svm (shared virtual memory) through
> > > > > Linux kernel hmm framework. The memory remap's page map type is set
> > > > > to MEMORY_DEVICE_PRIVATE for now. This means even though each GPU
> > > > > vram page get a struct page and can be mapped in CPU page table,
> > > > > but such pages are treated as GPU's private resource, so CPU can't
> > > > > access them. If CPU access such page, a page fault is triggered
> > > > > and page will be migrate to system memory.
> > > > >
> > > >
> > > > Is this really true? We can map VRAM BOs to the CPU without having
> > > > migarte back and forth. Admittedly I don't know the inner workings of
> > > > how this works but in IGTs we do this all the time.
> > > >
> > > >   54         batch_bo = xe_bo_create(fd, vm, batch_size,
> > > >   55                                 vram_if_possible(fd, 0),
> > > >   56                                 DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM);
> > > >   57         batch_map = xe_bo_map(fd, batch_bo, batch_size);
> > > >
> > > > The BO is created in VRAM and then mapped to the CPU.
> > > >
> > > > I don't think there is an expectation of coherence rather caching mode
> > > > and exclusive access of the memory based on synchronization.
> > > >
> > > > e.g.
> > > > User write BB/data via CPU to GPU memory
> > > > User calls exec
> > > > GPU read / write memory
> > > > User wait on sync indicating exec done
> > > > User reads result
> > > >
> > > > All of this works without migration. Are we not planing supporting flow
> > > > with SVM?
> > > >
> > > > Afaik this migration dance really only needs to be done if the CPU and
> > > > GPU are using atomics on a shared memory region and the GPU device
> > > > doesn't support a coherent memory protocol (e.g. PVC).
> > >
> > > All you said is true. On many of our HW, CPU can actually access device memory,
> > cache coherently or not.
> > >
> > > The problem is, this is not true for all GPU vendors. For example, on some HW
> > from some vendor, CPU can only access partially of device memory. The so called
> > small bar concept.
> > >
> > > So when HMM is defined, such factors were considered, and
> > MEMORY_DEVICE_PRIVATE is defined. With this definition, CPU can't access
> > device memory.
> > >
> > > So you can think it is a limitation of HMM.
> > >
> > 
> > Is it though? I see other type MEMORY_DEVICE_FS_DAX,
> > MEMORY_DEVICE_GENERIC, and MEMORY_DEVICE_PCI_P2PDMA. From my
> > limited
> > understanding it looks to me like one of those modes would support my
> > example.
> 
> 
> No, above are for other purposes. HMM only support DEVICE_PRIVATE and DEVICE_COHERENT.
> 
> > 
> > > Note this is only part 1 of our system allocator work. We do plan to support
> > DEVICE_COHERENT for our newer device, see below. With this option, we don't
> > have unnecessary migration back and forth.
> > >
> > > You can think this is just work out all the code path. 90% of the driver code for
> > DEVICE_PRIVATE and DEVICE_COHERENT will be same. Our real use of system
> > allocator will be DEVICE_COHERENT mode. While DEVICE_PRIVATE mode allow us
> > to exercise the code on old HW.
> > >
> > > Make sense?
> > >
> > 
> > I guess if we want the system allocator to always coherent, then yes you
> > need this dynamic migration with faulting on either side.
> > 
> > I was thinking the system allocator would be behave like my example
> > above with madvise dictating the coherence rules.
> > 
> > Maybe I missed this in system allocator design but my feeling is we
> > shouldn't arbitrarily enforce coherence as that could lead to poor
> > performance due to constant migration.
> 
> System allocator itself doesn't enforce coherence. Coherence is built in user programming model. So system allocator allow both GPU and CPU access system allocated pointers, but it doesn't necessarily guarantee the data accessed from CPU/GPU is coherent. It is user program's responsibility to maintain data coherence.
> 
> Data migration in driver is optional, depending on platform capability, user preference, correctness and performance consideration. Driver internal data migration of course shouldn't break data coherence. 
> 
> Of course different vendor can have different data coherence scheme. For example, it is completely designer's flexibility to build model that is HW automatic data coherence or software explicit data coherence. On our platform, we allow user program to select different coherence mode by setting pat_index for gpu and cpu_caching mode for CPU. So we have completely give the flexibility to user program. Nothing of this contract is changed in system allocator design. 
> 
> Going back to the question of what memory type we should use to register our vram to core mm. HMM currently support two types: PRIVATE and COHERENT. The coherent type requires some HW and BIOS support which we don't have right now. So the only available is PRIVATE. We have not other option right now. As said, we plan to support coherent type where we can avoid unnecessary data migration. But that is stage 2.
>

Thanks for the explaination. After reading your replies, the HMM doc,
and looking at code this all makes sense.

> > 
> > >
> > > >
> > > > > For GPU device which supports coherent memory protocol b/t CPU and
> > > > > GPU (such as CXL and CAPI protocol), we can remap device memory as
> > > > > MEMORY_DEVICE_COHERENT. This is TBD.
> > > > >
> > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/Makefile          |  3 +-
> > > > >  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> > > > >  drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> > > > >  drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> > > > >  drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> > > > ++++++++++++++++++++++++++++
> > > > >  5 files changed, 124 insertions(+), 1 deletion(-)
> > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > >
> > > > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > > > index c531210695db..840467080e59 100644
> > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> > > > >  	xe_vram_freq.o \
> > > > >  	xe_wait_user_fence.o \
> > > > >  	xe_wa.o \
> > > > > -	xe_wopcm.o
> > > > > +	xe_wopcm.o \
> > > > > +	xe_svm_devmem.o
> > > >
> > > > These should be in alphabetical order.
> > >
> > > Will fix
> > > >
> > > > >
> > > > >  # graphics hardware monitoring (HWMON) support
> > > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > index 9785eef2e5a4..f27c3bee8ce7 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > @@ -99,6 +99,15 @@ struct xe_mem_region {
> > > > >  	resource_size_t actual_physical_size;
> > > > >  	/** @mapping: pointer to VRAM mappable space */
> > > > >  	void __iomem *mapping;
> > > > > +	/** @pagemap: Used to remap device memory as ZONE_DEVICE */
> > > > > +    struct dev_pagemap pagemap;
> > > > > +    /**
> > > > > +     * @hpa_base: base host physical address
> > > > > +     *
> > > > > +     * This is generated when remap device memory as ZONE_DEVICE
> > > > > +     */
> > > > > +    resource_size_t hpa_base;
> > > >
> > > > Weird indentation. This goes for the entire series, look at checkpatch.
> > >
> > > Will fix
> > > >
> > > > > +
> > > > >  };
> > > > >
> > > > >  /**
> > > > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c
> > b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > index e3db3a178760..0d795394bc4c 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > @@ -22,6 +22,7 @@
> > > > >  #include "xe_module.h"
> > > > >  #include "xe_sriov.h"
> > > > >  #include "xe_tile.h"
> > > > > +#include "xe_svm.h"
> > > > >
> > > > >  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
> > > > >  #define TILE_COUNT		REG_GENMASK(15, 8)
> > > > > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
> > > > >  		}
> > > > >
> > > > >  		io_size -= min_t(u64, tile_size, io_size);
> > > > > +		xe_svm_devm_add(tile, &tile->mem.vram);
> > > >
> > > > Do we want to do this probe for all devices with VRAM or only a subset?
> > >
> > > All
> > 
> > Can you explain why?
> 
> It is natural for me to add all device memory to hmm. In hmm design, device memory is used as a special swap out for system memory. I would ask why we only want to add a subset of vram? By a subset, do you mean only vram of one tile instead of all tiles?
> 

I think we talking about different things, my bad on wording in the
original question.

Let me ask again - should be calling xe_svm_devm_add on all *platforms*
that have VRAM. i.e. Should we do this on PVC but not DG2?

Matt

> Oak
> 
> 
> > 
> > > >
> > > > >  	}
> > > > >
> > > > >  	xe->mem.vram.actual_physical_size = total_size;
> > > > > @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
> > > > >  static void mmio_fini(struct drm_device *drm, void *arg)
> > > > >  {
> > > > >  	struct xe_device *xe = arg;
> > > > > +    struct xe_tile *tile;
> > > > > +    u8 id;
> > > > >
> > > > >  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> > > > >  	if (xe->mem.vram.mapping)
> > > > >  		iounmap(xe->mem.vram.mapping);
> > > > > +
> > > > > +	for_each_tile(tile, xe, id)
> > > > > +		xe_svm_devm_remove(xe, &tile->mem.vram);
> > > >
> > > > This should probably be above existing code. Typical on fini to do
> > > > things in reverse order from init.
> > >
> > > Will fix
> > > >
> > > > > +
> > > > >  }
> > > > >
> > > > >  static int xe_verify_lmem_ready(struct xe_device *xe)
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > > > new file mode 100644
> > > > > index 000000000000..09f9afb0e7d4
> > > > > --- /dev/null
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > @@ -0,0 +1,14 @@
> > > > > +// SPDX-License-Identifier: MIT
> > > > > +/*
> > > > > + * Copyright © 2023 Intel Corporation
> > > >
> > > > 2024?
> > >
> > > This patch was actually written 2023
> > > >
> > > > > + */
> > > > > +
> > > > > +#ifndef __XE_SVM_H
> > > > > +#define __XE_SVM_H
> > > > > +
> > > > > +#include "xe_device_types.h"
> > > >
> > > > I don't think you need to include this. Rather just forward decl structs
> > > > used here.
> > >
> > > Will fix
> > > >
> > > > e.g.
> > > >
> > > > struct xe_device;
> > > > struct xe_mem_region;
> > > > struct xe_tile;
> > > >
> > > > > +
> > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mem);
> > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> > > > *mem);
> > > >
> > > > The arguments here are incongruent here. Typically we want these to
> > > > match.
> > >
> > > Will fix
> > > >
> > > > > +
> > > > > +#endif
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > >
> > > > Incongruent between xe_svm.h and xe_svm_devmem.c.
> > >
> > > Did you mean mem vs mr? if yes, will fix
> > >
> > > Again these two
> > > > should
> > > > match.
> > > >
> > > > > new file mode 100644
> > > > > index 000000000000..63b7a1961cc6
> > > > > --- /dev/null
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > @@ -0,0 +1,91 @@
> > > > > +// SPDX-License-Identifier: MIT
> > > > > +/*
> > > > > + * Copyright © 2023 Intel Corporation
> > > >
> > > > 2024?
> > > It is from 2023
> > > >
> > > > > + */
> > > > > +
> > > > > +#include <linux/mm_types.h>
> > > > > +#include <linux/sched/mm.h>
> > > > > +
> > > > > +#include "xe_device_types.h"
> > > > > +#include "xe_trace.h"
> > > >
> > > > xe_trace.h appears to be unused.
> > >
> > > Will fix
> > > >
> > > > > +#include "xe_svm.h"
> > > > > +
> > > > > +
> > > > > +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > > > +{
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static void xe_devm_page_free(struct page *page)
> > > > > +{
> > > > > +}
> > > > > +
> > > > > +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > > > +	.page_free = xe_devm_page_free,
> > > > > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > > +};
> > > > > +
> > > >
> > > > Assume these are placeholders that will be populated later?
> > >
> > >
> > > corrrect
> > > >
> > > > > +/**
> > > > > + * xe_svm_devm_add: Remap and provide memmap backing for device
> > > > memory
> > > > > + * @tile: tile that the memory region blongs to
> > > > > + * @mr: memory region to remap
> > > > > + *
> > > > > + * This remap device memory to host physical address space and create
> > > > > + * struct page to back device memory
> > > > > + *
> > > > > + * Return: 0 on success standard error code otherwise
> > > > > + */
> > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> > > >
> > > > Here I see the xe_mem_region is from tile->mem.vram, wondering rather
> > > > than using the tile->mem.vram we should use xe->mem.vram when enabling
> > > > svm? Isn't the idea behind svm the entire memory is 1 view?
> > >
> > > Still need to use tile. The reason is, memory of different tile can have different
> > characteristics, such as latency. So we want to differentiate memory b/t tiles also
> > in svm. I need to change below " mr->pagemap.owner = tile->xe->drm.dev ". the
> > owner should also be tile. This is the way hmm differentiate memory of different
> > tile.
> > >
> > > With svm it is 1 view, from virtual address space perspective and from physical
> > struct page perspective. You can think of all the tile's vram is stacked together to
> > form a unified view together with system memory. This doesn't prohibit us from
> > differentiate memory from different tile. This differentiation allow us to optimize
> > performance, i.e., we can wisely place memory in specific tile. If we don't
> > differentiate, this is not possible.
> > >
> > 
> > Ok makes sense.
> > 
> > Matt
> > 
> > > >
> > > > I suppose if we do that we also only use 1 TTM VRAM manager / buddy
> > > > allocator too. I thought I saw some patches flying around for that too.
> > >
> > > Ttm vram manager is not in the picture. We deliberately avoided it per previous
> > discussion
> > >
> > > Yes same buddy allocator. It is in my previous POC: https://lore.kernel.org/dri-
> > devel/20240117221223.18540-12-oak.zeng@intel.com/. I didn't put those patches
> > in this series because I want to merge this small patches separately.
> > > >
> > > > > +{
> > > > > +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> > > > > +	struct resource *res;
> > > > > +	void *addr;
> > > > > +	int ret;
> > > > > +
> > > > > +	res = devm_request_free_mem_region(dev, &iomem_resource,
> > > > > +					   mr->usable_size);
> > > > > +	if (IS_ERR(res)) {
> > > > > +		ret = PTR_ERR(res);
> > > > > +		return ret;
> > > > > +	}
> > > > > +
> > > > > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > > > > +	mr->pagemap.range.start = res->start;
> > > > > +	mr->pagemap.range.end = res->end;
> > > > > +	mr->pagemap.nr_range = 1;
> > > > > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > > > > +	mr->pagemap.owner = tile->xe->drm.dev;
> > > > > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > > > > +	if (IS_ERR(addr)) {
> > > > > +		devm_release_mem_region(dev, res->start, resource_size(res));
> > > > > +		ret = PTR_ERR(addr);
> > > > > +		drm_err(&tile->xe->drm, "Failed to remap tile %d memory,
> > > > errno %d\n",
> > > > > +				tile->id, ret);
> > > > > +		return ret;
> > > > > +	}
> > > > > +	mr->hpa_base = res->start;
> > > > > +
> > > > > +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm,
> > > > remapped to %pr\n",
> > > > > +			tile->id, mr->io_start, mr->io_start + mr->usable_size,
> > > > res);
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * xe_svm_devm_remove: Unmap device memory and free resources
> > > > > + * @xe: xe device
> > > > > + * @mr: memory region to remove
> > > > > + */
> > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct xe_mem_region
> > > > *mr)
> > > > > +{
> > > > > +	/*FIXME: below cause a kernel hange during moduel remove*/
> > > > > +#if 0
> > > > > +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> > > > > +
> > > > > +	if (mr->hpa_base) {
> > > > > +		devm_memunmap_pages(dev, &mr->pagemap);
> > > > > +		devm_release_mem_region(dev, mr->pagemap.range.start,
> > > > > +			mr->pagemap.range.end - mr->pagemap.range.start +1);
> > > > > +	}
> > > > > +#endif
> > > >
> > > > This would need to be fixed too.
> > >
> > >
> > > Yes...
> > >
> > > Oak
> > > >
> > > > Matt
> > > >
> > > > > +}
> > > > > +
> > > > > --
> > > > > 2.26.3
> > > > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-15 20:39           ` Matthew Brost
@ 2024-03-15 21:31             ` Zeng, Oak
  2024-03-16  1:25               ` Matthew Brost
  0 siblings, 1 reply; 49+ messages in thread
From: Zeng, Oak @ 2024-03-15 21:31 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Friday, March 15, 2024 4:40 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for
> GPU vram
> 
> On Fri, Mar 15, 2024 at 10:00:06AM -0600, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Thursday, March 14, 2024 4:49 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing
> for
> > > GPU vram
> > >
> > > On Thu, Mar 14, 2024 at 12:32:36PM -0600, Zeng, Oak wrote:
> > > > Hi Matt,
> > > >
> > > > > -----Original Message-----
> > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > Sent: Thursday, March 14, 2024 1:18 PM
> > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > > <himal.prasad.ghimiray@intel.com>
> > > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap
> backing
> > > for
> > > > > GPU vram
> > > > >
> > > > > On Wed, Mar 13, 2024 at 11:35:49PM -0400, Oak Zeng wrote:
> > > > > > Memory remap GPU vram using devm_memremap_pages, so each
> GPU
> > > vram
> > > > > > page is backed by a struct page.
> > > > > >
> > > > > > Those struct pages are created to allow hmm migrate buffer b/t
> > > > > > GPU vram and CPU system memory using existing Linux migration
> > > > > > mechanism (i.e., migrating b/t CPU system memory and hard disk).
> > > > > >
> > > > > > This is prepare work to enable svm (shared virtual memory) through
> > > > > > Linux kernel hmm framework. The memory remap's page map type is
> set
> > > > > > to MEMORY_DEVICE_PRIVATE for now. This means even though each
> GPU
> > > > > > vram page get a struct page and can be mapped in CPU page table,
> > > > > > but such pages are treated as GPU's private resource, so CPU can't
> > > > > > access them. If CPU access such page, a page fault is triggered
> > > > > > and page will be migrate to system memory.
> > > > > >
> > > > >
> > > > > Is this really true? We can map VRAM BOs to the CPU without having
> > > > > migarte back and forth. Admittedly I don't know the inner workings of
> > > > > how this works but in IGTs we do this all the time.
> > > > >
> > > > >   54         batch_bo = xe_bo_create(fd, vm, batch_size,
> > > > >   55                                 vram_if_possible(fd, 0),
> > > > >   56
> DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM);
> > > > >   57         batch_map = xe_bo_map(fd, batch_bo, batch_size);
> > > > >
> > > > > The BO is created in VRAM and then mapped to the CPU.
> > > > >
> > > > > I don't think there is an expectation of coherence rather caching mode
> > > > > and exclusive access of the memory based on synchronization.
> > > > >
> > > > > e.g.
> > > > > User write BB/data via CPU to GPU memory
> > > > > User calls exec
> > > > > GPU read / write memory
> > > > > User wait on sync indicating exec done
> > > > > User reads result
> > > > >
> > > > > All of this works without migration. Are we not planing supporting flow
> > > > > with SVM?
> > > > >
> > > > > Afaik this migration dance really only needs to be done if the CPU and
> > > > > GPU are using atomics on a shared memory region and the GPU device
> > > > > doesn't support a coherent memory protocol (e.g. PVC).
> > > >
> > > > All you said is true. On many of our HW, CPU can actually access device
> memory,
> > > cache coherently or not.
> > > >
> > > > The problem is, this is not true for all GPU vendors. For example, on some
> HW
> > > from some vendor, CPU can only access partially of device memory. The so
> called
> > > small bar concept.
> > > >
> > > > So when HMM is defined, such factors were considered, and
> > > MEMORY_DEVICE_PRIVATE is defined. With this definition, CPU can't access
> > > device memory.
> > > >
> > > > So you can think it is a limitation of HMM.
> > > >
> > >
> > > Is it though? I see other type MEMORY_DEVICE_FS_DAX,
> > > MEMORY_DEVICE_GENERIC, and MEMORY_DEVICE_PCI_P2PDMA. From my
> > > limited
> > > understanding it looks to me like one of those modes would support my
> > > example.
> >
> >
> > No, above are for other purposes. HMM only support DEVICE_PRIVATE and
> DEVICE_COHERENT.
> >
> > >
> > > > Note this is only part 1 of our system allocator work. We do plan to support
> > > DEVICE_COHERENT for our newer device, see below. With this option, we
> don't
> > > have unnecessary migration back and forth.
> > > >
> > > > You can think this is just work out all the code path. 90% of the driver code
> for
> > > DEVICE_PRIVATE and DEVICE_COHERENT will be same. Our real use of system
> > > allocator will be DEVICE_COHERENT mode. While DEVICE_PRIVATE mode
> allow us
> > > to exercise the code on old HW.
> > > >
> > > > Make sense?
> > > >
> > >
> > > I guess if we want the system allocator to always coherent, then yes you
> > > need this dynamic migration with faulting on either side.
> > >
> > > I was thinking the system allocator would be behave like my example
> > > above with madvise dictating the coherence rules.
> > >
> > > Maybe I missed this in system allocator design but my feeling is we
> > > shouldn't arbitrarily enforce coherence as that could lead to poor
> > > performance due to constant migration.
> >
> > System allocator itself doesn't enforce coherence. Coherence is built in user
> programming model. So system allocator allow both GPU and CPU access system
> allocated pointers, but it doesn't necessarily guarantee the data accessed from
> CPU/GPU is coherent. It is user program's responsibility to maintain data
> coherence.
> >
> > Data migration in driver is optional, depending on platform capability, user
> preference, correctness and performance consideration. Driver internal data
> migration of course shouldn't break data coherence.
> >
> > Of course different vendor can have different data coherence scheme. For
> example, it is completely designer's flexibility to build model that is HW automatic
> data coherence or software explicit data coherence. On our platform, we allow
> user program to select different coherence mode by setting pat_index for gpu
> and cpu_caching mode for CPU. So we have completely give the flexibility to user
> program. Nothing of this contract is changed in system allocator design.
> >
> > Going back to the question of what memory type we should use to register our
> vram to core mm. HMM currently support two types: PRIVATE and COHERENT.
> The coherent type requires some HW and BIOS support which we don't have
> right now. So the only available is PRIVATE. We have not other option right now.
> As said, we plan to support coherent type where we can avoid unnecessary data
> migration. But that is stage 2.
> >
> 
> Thanks for the explaination. After reading your replies, the HMM doc,
> and looking at code this all makes sense.
> 
> > >
> > > >
> > > > >
> > > > > > For GPU device which supports coherent memory protocol b/t CPU and
> > > > > > GPU (such as CXL and CAPI protocol), we can remap device memory as
> > > > > > MEMORY_DEVICE_COHERENT. This is TBD.
> > > > > >
> > > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/Makefile          |  3 +-
> > > > > >  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> > > > > >  drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> > > > > >  drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> > > > > >  drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> > > > > ++++++++++++++++++++++++++++
> > > > > >  5 files changed, 124 insertions(+), 1 deletion(-)
> > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> > > > > > index c531210695db..840467080e59 100644
> > > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> > > > > >  	xe_vram_freq.o \
> > > > > >  	xe_wait_user_fence.o \
> > > > > >  	xe_wa.o \
> > > > > > -	xe_wopcm.o
> > > > > > +	xe_wopcm.o \
> > > > > > +	xe_svm_devmem.o
> > > > >
> > > > > These should be in alphabetical order.
> > > >
> > > > Will fix
> > > > >
> > > > > >
> > > > > >  # graphics hardware monitoring (HWMON) support
> > > > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > index 9785eef2e5a4..f27c3bee8ce7 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > @@ -99,6 +99,15 @@ struct xe_mem_region {
> > > > > >  	resource_size_t actual_physical_size;
> > > > > >  	/** @mapping: pointer to VRAM mappable space */
> > > > > >  	void __iomem *mapping;
> > > > > > +	/** @pagemap: Used to remap device memory as ZONE_DEVICE
> */
> > > > > > +    struct dev_pagemap pagemap;
> > > > > > +    /**
> > > > > > +     * @hpa_base: base host physical address
> > > > > > +     *
> > > > > > +     * This is generated when remap device memory as ZONE_DEVICE
> > > > > > +     */
> > > > > > +    resource_size_t hpa_base;
> > > > >
> > > > > Weird indentation. This goes for the entire series, look at checkpatch.
> > > >
> > > > Will fix
> > > > >
> > > > > > +
> > > > > >  };
> > > > > >
> > > > > >  /**
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c
> > > b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > index e3db3a178760..0d795394bc4c 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > @@ -22,6 +22,7 @@
> > > > > >  #include "xe_module.h"
> > > > > >  #include "xe_sriov.h"
> > > > > >  #include "xe_tile.h"
> > > > > > +#include "xe_svm.h"
> > > > > >
> > > > > >  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
> > > > > >  #define TILE_COUNT		REG_GENMASK(15, 8)
> > > > > > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
> > > > > >  		}
> > > > > >
> > > > > >  		io_size -= min_t(u64, tile_size, io_size);
> > > > > > +		xe_svm_devm_add(tile, &tile->mem.vram);
> > > > >
> > > > > Do we want to do this probe for all devices with VRAM or only a subset?
> > > >
> > > > All
> > >
> > > Can you explain why?
> >
> > It is natural for me to add all device memory to hmm. In hmm design, device
> memory is used as a special swap out for system memory. I would ask why we
> only want to add a subset of vram? By a subset, do you mean only vram of one
> tile instead of all tiles?
> >
> 
> I think we talking about different things, my bad on wording in the
> original question.
> 
> Let me ask again - should be calling xe_svm_devm_add on all *platforms*
> that have VRAM. i.e. Should we do this on PVC but not DG2?


Oh, I see. Good question. On i915, this feature was only tested on PVC. We don't have a plan to enable it on older platform than PVC. 

Let me add a check here, only enabled it on platform newer than PVC

Oak 

> 
> Matt
> 
> > Oak
> >
> >
> > >
> > > > >
> > > > > >  	}
> > > > > >
> > > > > >  	xe->mem.vram.actual_physical_size = total_size;
> > > > > > @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device
> *xe)
> > > > > >  static void mmio_fini(struct drm_device *drm, void *arg)
> > > > > >  {
> > > > > >  	struct xe_device *xe = arg;
> > > > > > +    struct xe_tile *tile;
> > > > > > +    u8 id;
> > > > > >
> > > > > >  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> > > > > >  	if (xe->mem.vram.mapping)
> > > > > >  		iounmap(xe->mem.vram.mapping);
> > > > > > +
> > > > > > +	for_each_tile(tile, xe, id)
> > > > > > +		xe_svm_devm_remove(xe, &tile->mem.vram);
> > > > >
> > > > > This should probably be above existing code. Typical on fini to do
> > > > > things in reverse order from init.
> > > >
> > > > Will fix
> > > > >
> > > > > > +
> > > > > >  }
> > > > > >
> > > > > >  static int xe_verify_lmem_ready(struct xe_device *xe)
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..09f9afb0e7d4
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > @@ -0,0 +1,14 @@
> > > > > > +// SPDX-License-Identifier: MIT
> > > > > > +/*
> > > > > > + * Copyright © 2023 Intel Corporation
> > > > >
> > > > > 2024?
> > > >
> > > > This patch was actually written 2023
> > > > >
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef __XE_SVM_H
> > > > > > +#define __XE_SVM_H
> > > > > > +
> > > > > > +#include "xe_device_types.h"
> > > > >
> > > > > I don't think you need to include this. Rather just forward decl structs
> > > > > used here.
> > > >
> > > > Will fix
> > > > >
> > > > > e.g.
> > > > >
> > > > > struct xe_device;
> > > > > struct xe_mem_region;
> > > > > struct xe_tile;
> > > > >
> > > > > > +
> > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region
> *mem);
> > > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct
> xe_mem_region
> > > > > *mem);
> > > > >
> > > > > The arguments here are incongruent here. Typically we want these to
> > > > > match.
> > > >
> > > > Will fix
> > > > >
> > > > > > +
> > > > > > +#endif
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > >
> > > > > Incongruent between xe_svm.h and xe_svm_devmem.c.
> > > >
> > > > Did you mean mem vs mr? if yes, will fix
> > > >
> > > > Again these two
> > > > > should
> > > > > match.
> > > > >
> > > > > > new file mode 100644
> > > > > > index 000000000000..63b7a1961cc6
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > @@ -0,0 +1,91 @@
> > > > > > +// SPDX-License-Identifier: MIT
> > > > > > +/*
> > > > > > + * Copyright © 2023 Intel Corporation
> > > > >
> > > > > 2024?
> > > > It is from 2023
> > > > >
> > > > > > + */
> > > > > > +
> > > > > > +#include <linux/mm_types.h>
> > > > > > +#include <linux/sched/mm.h>
> > > > > > +
> > > > > > +#include "xe_device_types.h"
> > > > > > +#include "xe_trace.h"
> > > > >
> > > > > xe_trace.h appears to be unused.
> > > >
> > > > Will fix
> > > > >
> > > > > > +#include "xe_svm.h"
> > > > > > +
> > > > > > +
> > > > > > +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > > > > +{
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static void xe_devm_page_free(struct page *page)
> > > > > > +{
> > > > > > +}
> > > > > > +
> > > > > > +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > > > > +	.page_free = xe_devm_page_free,
> > > > > > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > > > +};
> > > > > > +
> > > > >
> > > > > Assume these are placeholders that will be populated later?
> > > >
> > > >
> > > > corrrect
> > > > >
> > > > > > +/**
> > > > > > + * xe_svm_devm_add: Remap and provide memmap backing for
> device
> > > > > memory
> > > > > > + * @tile: tile that the memory region blongs to
> > > > > > + * @mr: memory region to remap
> > > > > > + *
> > > > > > + * This remap device memory to host physical address space and create
> > > > > > + * struct page to back device memory
> > > > > > + *
> > > > > > + * Return: 0 on success standard error code otherwise
> > > > > > + */
> > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> > > > >
> > > > > Here I see the xe_mem_region is from tile->mem.vram, wondering rather
> > > > > than using the tile->mem.vram we should use xe->mem.vram when
> enabling
> > > > > svm? Isn't the idea behind svm the entire memory is 1 view?
> > > >
> > > > Still need to use tile. The reason is, memory of different tile can have
> different
> > > characteristics, such as latency. So we want to differentiate memory b/t tiles
> also
> > > in svm. I need to change below " mr->pagemap.owner = tile->xe->drm.dev ".
> the
> > > owner should also be tile. This is the way hmm differentiate memory of
> different
> > > tile.
> > > >
> > > > With svm it is 1 view, from virtual address space perspective and from
> physical
> > > struct page perspective. You can think of all the tile's vram is stacked together
> to
> > > form a unified view together with system memory. This doesn't prohibit us
> from
> > > differentiate memory from different tile. This differentiation allow us to
> optimize
> > > performance, i.e., we can wisely place memory in specific tile. If we don't
> > > differentiate, this is not possible.
> > > >
> > >
> > > Ok makes sense.
> > >
> > > Matt
> > >
> > > > >
> > > > > I suppose if we do that we also only use 1 TTM VRAM manager / buddy
> > > > > allocator too. I thought I saw some patches flying around for that too.
> > > >
> > > > Ttm vram manager is not in the picture. We deliberately avoided it per
> previous
> > > discussion
> > > >
> > > > Yes same buddy allocator. It is in my previous POC:
> https://lore.kernel.org/dri-
> > > devel/20240117221223.18540-12-oak.zeng@intel.com/. I didn't put those
> patches
> > > in this series because I want to merge this small patches separately.
> > > > >
> > > > > > +{
> > > > > > +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> > > > > > +	struct resource *res;
> > > > > > +	void *addr;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	res = devm_request_free_mem_region(dev, &iomem_resource,
> > > > > > +					   mr->usable_size);
> > > > > > +	if (IS_ERR(res)) {
> > > > > > +		ret = PTR_ERR(res);
> > > > > > +		return ret;
> > > > > > +	}
> > > > > > +
> > > > > > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > > > > > +	mr->pagemap.range.start = res->start;
> > > > > > +	mr->pagemap.range.end = res->end;
> > > > > > +	mr->pagemap.nr_range = 1;
> > > > > > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > > > > > +	mr->pagemap.owner = tile->xe->drm.dev;
> > > > > > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > > > > > +	if (IS_ERR(addr)) {
> > > > > > +		devm_release_mem_region(dev, res->start,
> resource_size(res));
> > > > > > +		ret = PTR_ERR(addr);
> > > > > > +		drm_err(&tile->xe->drm, "Failed to remap tile %d
> memory,
> > > > > errno %d\n",
> > > > > > +				tile->id, ret);
> > > > > > +		return ret;
> > > > > > +	}
> > > > > > +	mr->hpa_base = res->start;
> > > > > > +
> > > > > > +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to
> devm,
> > > > > remapped to %pr\n",
> > > > > > +			tile->id, mr->io_start, mr->io_start + mr-
> >usable_size,
> > > > > res);
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * xe_svm_devm_remove: Unmap device memory and free resources
> > > > > > + * @xe: xe device
> > > > > > + * @mr: memory region to remove
> > > > > > + */
> > > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct
> xe_mem_region
> > > > > *mr)
> > > > > > +{
> > > > > > +	/*FIXME: below cause a kernel hange during moduel remove*/
> > > > > > +#if 0
> > > > > > +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> > > > > > +
> > > > > > +	if (mr->hpa_base) {
> > > > > > +		devm_memunmap_pages(dev, &mr->pagemap);
> > > > > > +		devm_release_mem_region(dev, mr-
> >pagemap.range.start,
> > > > > > +			mr->pagemap.range.end - mr-
> >pagemap.range.start +1);
> > > > > > +	}
> > > > > > +#endif
> > > > >
> > > > > This would need to be fixed too.
> > > >
> > > >
> > > > Yes...
> > > >
> > > > Oak
> > > > >
> > > > > Matt
> > > > >
> > > > > > +}
> > > > > > +
> > > > > > --
> > > > > > 2.26.3
> > > > > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-15 18:05         ` Welty, Brian
@ 2024-03-15 23:11           ` Zeng, Oak
  0 siblings, 0 replies; 49+ messages in thread
From: Zeng, Oak @ 2024-03-15 23:11 UTC (permalink / raw)
  To: Welty, Brian, intel-xe
  Cc: Hellstrom, Thomas, Brost, Matthew, airlied, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Welty, Brian <brian.welty@intel.com>
> Sent: Friday, March 15, 2024 2:05 PM
> To: Zeng, Oak <oak.zeng@intel.com>; intel-xe@lists.freedesktop.org
> Cc: Hellstrom, Thomas <thomas.hellstrom@intel.com>; Brost, Matthew
> <matthew.brost@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for
> GPU vram
> 
> 
> On 3/14/2024 8:16 PM, Zeng, Oak wrote:
> [...]
> >>>> diff --git a/drivers/gpu/drm/xe/xe_mmio.c
> >> b/drivers/gpu/drm/xe/xe_mmio.c
> >>>> index e3db3a178760..0d795394bc4c 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_mmio.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_mmio.c
> >>>> @@ -22,6 +22,7 @@
> >>>>    #include "xe_module.h"
> >>>>    #include "xe_sriov.h"
> >>>>    #include "xe_tile.h"
> >>>> +#include "xe_svm.h"
> >>>>
> >>>>    #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
> >>>>    #define TILE_COUNT		REG_GENMASK(15, 8)
> >>>> @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
> >>>>    		}
> >>>>
> >>>>    		io_size -= min_t(u64, tile_size, io_size);
> >>>> +		xe_svm_devm_add(tile, &tile->mem.vram);
> >>>
> >>> I think slightly more appropriate call site for this might be
> >>> xe_tile_init_noalloc(), as that function states it is preparing tile
> >>> for VRAM allocations.
> >>> Also, I mention because we might like the flexiblity in future to call
> >>> this once for xe_device.mem.vram, instead of calling for each tile,
> >>> and easier to handle that in xe_tile.c instead of xe_mmio.c.
> >>
> >> Good point. Will follow.
> >
> > Sorry, with my comment below, do you still want to call it in xe_tile_init_noalloc?
> >
> > For UMA, we only need to call it once. If you do it in init-noalloc, you would call
> it multiple times. Right?
> >
> > Oak
> >
> 
> Oh, I hoped you were still going to move it.
> I prefer it in xe_tile.c instead of here.  But feel free to ask
> maintainers about it.
> I think the UMA would anyway add conditional to disable these calls
> wherever they are placed, in favor of calling against xe_device.mem.vram.
> But this is pretty minor and can revisit later too.

Let me move to xe_tile.c then... we can make whatever change when UMA is settled down.

> 
> I commented down below, to keep to one email.
> 
> >>>
> >>> Related comment below.
> >>>
> >>>
> >>>>    	}
> >>>>
> >>>>    	xe->mem.vram.actual_physical_size = total_size;
> >>>> @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device *xe)
> >>>>    static void mmio_fini(struct drm_device *drm, void *arg)
> >>>>    {
> >>>>    	struct xe_device *xe = arg;
> >>>> +    struct xe_tile *tile;
> >>>> +    u8 id;
> >>>>
> >>>>    	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> >>>>    	if (xe->mem.vram.mapping)
> >>>>    		iounmap(xe->mem.vram.mapping);
> >>>> +
> >>>> +	for_each_tile(tile, xe, id)
> >>>> +		xe_svm_devm_remove(xe, &tile->mem.vram);
> >>>> +
> >>>>    }
> >>>>
> [...]
> >>> b/drivers/gpu/drm/xe/xe_svm_devmem.c
> >>>> new file mode 100644
> >>>> index 000000000000..63b7a1961cc6
> >>>> --- /dev/null
> >>>> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> >>>> @@ -0,0 +1,91 @@
> >>>> +// SPDX-License-Identifier: MIT
> >>>> +/*
> >>>> + * Copyright © 2023 Intel Corporation
> >>>> + */
> >>>> +
> >>>> +#include <linux/mm_types.h>
> >>>> +#include <linux/sched/mm.h>
> >>>> +
> >>>> +#include "xe_device_types.h"
> >>>> +#include "xe_trace.h"
> >>>> +#include "xe_svm.h"
> >>>> +
> >>>> +
> >>>> +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> >>>> +{
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>> +static void xe_devm_page_free(struct page *page)
> >>>> +{
> >>>> +}
> >>>> +
> >>>> +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> >>>> +	.page_free = xe_devm_page_free,
> >>>> +	.migrate_to_ram = xe_devm_migrate_to_ram,
> >>>> +};
> >>>> +
> >>>> +/**
> >>>> + * xe_svm_devm_add: Remap and provide memmap backing for device
> >>> memory
> >>>
> >>> Do we really need 'svm' in function name?
> >>
> >> Good point. I will remove svm.
> >>>
> >>>> + * @tile: tile that the memory region blongs to
> >>>
> >>> We might like flexibility in future to call this once for
> >>> xe_device.mem.vram, instead of calling for each tile.
> >>> So can we remove the tile argument, and just pass the xe_device pointer
> >>> and tile->id ?
> >>
> >> This is interesting.
> >>
> >> First of all, I programmed wrong below: mr->pagemap.owner = tile->xe-
> >>> drm.dev;
> >>
> >> This should be: mr->pagemap.owner = tile for NUMA vram system
> Oh, okay.  Glad we caught that.
> 
> Still, wouldn't using 'mr' pointer as the owner be sufficient and then
> is common for both cases?
> Avoiding the strong association with 'tile' is nice if possible.
> 
> But I haven't studied rest of code and how pagemap.owner is used, so
> maybe you actually need the tile pointer for something.

mr is more unified. But when I looked further, for now we still need to set the owner to dev. So I take back the comment of setting owner to tile. 

The reason is, we also need to set owner calling hmm_range_fault, see patch 5. At patch 5, we don't have the mr and tile, we only have vma and vm. And from vm we can only get dev, not tile and mr...

It looks like the xe_vm is designed per device, not per tile. I wouldn't change this perspective in this series. Making vm per tile is doable. But the main purpose of this series is to proof the system allocator concept on pvc and position it for new platform which might have uma architecture b/t tiles. So optimization such as migration b/t tiles is low priority.

So for now, lets still go back to dev for the owner.

Oak

> 
> Well, this also seems like something minor and could be revisited later.
> 
> -Brian
> 
> 
> >>
> >> For UMA vram, this should be: mr->pagemap.owner = tile_to_xe(tile);
> >>
> >> This owner is important. It is used later to decide migration by hmm. We need
> to
> >> set the owner for hmm to identify different vram region.
> >>
> >> Based on above, I think the tile parameter is better. For UMA, caller need to
> >> make sure call it once, any tile pointer should work. This does sound a little
> weird.
> >> But I don’t have a better idea.
> >>
> >> Oak
> >>>
> >>>
> >>>> + * @mr: memory region to remap
> >>>> + *
> >>>> + * This remap device memory to host physical address space and create
> >>>> + * struct page to back device memory
> >>>> + *
> >>>> + * Return: 0 on success standard error code otherwise
> >>>> + */
> >>>> +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> >>>> +{
> >>>> +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> >>>> +	struct resource *res;
> >>>> +	void *addr;
> >>>> +	int ret;
> >>>> +
> >>>> +	res = devm_request_free_mem_region(dev, &iomem_resource,
> >>>> +					   mr->usable_size);
> >>>> +	if (IS_ERR(res)) {
> >>>> +		ret = PTR_ERR(res);
> >>>> +		return ret;
> >>>> +	}
> >>>> +
> >>>> +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> >>>> +	mr->pagemap.range.start = res->start;
> >>>> +	mr->pagemap.range.end = res->end;
> >>>> +	mr->pagemap.nr_range = 1;
> >>>> +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> >>>> +	mr->pagemap.owner = tile->xe->drm.dev;
> >>>> +	addr = devm_memremap_pages(dev, &mr->pagemap);
> >>>> +	if (IS_ERR(addr)) {
> >>>> +		devm_release_mem_region(dev, res->start, resource_size(res));
> >>>> +		ret = PTR_ERR(addr);
> >>>> +		drm_err(&tile->xe->drm, "Failed to remap tile %d memory,
> >>> errno %d\n",
> >>>> +				tile->id, ret);
> >>>> +		return ret;
> >>>> +	}
> >>>> +	mr->hpa_base = res->start;
> >>>> +
> >>>> +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to devm,
> >>> remapped to %pr\n",
> >>>> +			tile->id, mr->io_start, mr->io_start + mr->usable_size,
> >>> res);
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * xe_svm_devm_remove: Unmap device memory and free resources
> >>>> + * @xe: xe device
> >>>> + * @mr: memory region to remove
> >>>> + */
> >>>> +void xe_svm_devm_remove(struct xe_device *xe, struct
> xe_mem_region
> >>> *mr)
> >>>> +{
> >>>> +	/*FIXME: below cause a kernel hange during moduel remove*/
> >>>> +#if 0
> >>>> +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> >>>> +
> >>>> +	if (mr->hpa_base) {
> >>>> +		devm_memunmap_pages(dev, &mr->pagemap);
> >>>> +		devm_release_mem_region(dev, mr->pagemap.range.start,
> >>>> +			mr->pagemap.range.end - mr->pagemap.range.start +1);
> >>>> +	}
> >>>> +#endif
> >>>> +}
> >>>> +

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-15 21:31             ` Zeng, Oak
@ 2024-03-16  1:25               ` Matthew Brost
  2024-03-18 10:16                 ` Hellstrom, Thomas
  2024-03-18 14:51                 ` Zeng, Oak
  0 siblings, 2 replies; 49+ messages in thread
From: Matthew Brost @ 2024-03-16  1:25 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad

On Fri, Mar 15, 2024 at 03:31:24PM -0600, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Friday, March 15, 2024 4:40 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for
> > GPU vram
> > 
> > On Fri, Mar 15, 2024 at 10:00:06AM -0600, Zeng, Oak wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > Sent: Thursday, March 14, 2024 4:49 PM
> > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > <himal.prasad.ghimiray@intel.com>
> > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing
> > for
> > > > GPU vram
> > > >
> > > > On Thu, Mar 14, 2024 at 12:32:36PM -0600, Zeng, Oak wrote:
> > > > > Hi Matt,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > Sent: Thursday, March 14, 2024 1:18 PM
> > > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > > > <himal.prasad.ghimiray@intel.com>
> > > > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap
> > backing
> > > > for
> > > > > > GPU vram
> > > > > >
> > > > > > On Wed, Mar 13, 2024 at 11:35:49PM -0400, Oak Zeng wrote:
> > > > > > > Memory remap GPU vram using devm_memremap_pages, so each
> > GPU
> > > > vram
> > > > > > > page is backed by a struct page.
> > > > > > >
> > > > > > > Those struct pages are created to allow hmm migrate buffer b/t
> > > > > > > GPU vram and CPU system memory using existing Linux migration
> > > > > > > mechanism (i.e., migrating b/t CPU system memory and hard disk).
> > > > > > >
> > > > > > > This is prepare work to enable svm (shared virtual memory) through
> > > > > > > Linux kernel hmm framework. The memory remap's page map type is
> > set
> > > > > > > to MEMORY_DEVICE_PRIVATE for now. This means even though each
> > GPU
> > > > > > > vram page get a struct page and can be mapped in CPU page table,
> > > > > > > but such pages are treated as GPU's private resource, so CPU can't
> > > > > > > access them. If CPU access such page, a page fault is triggered
> > > > > > > and page will be migrate to system memory.
> > > > > > >
> > > > > >
> > > > > > Is this really true? We can map VRAM BOs to the CPU without having
> > > > > > migarte back and forth. Admittedly I don't know the inner workings of
> > > > > > how this works but in IGTs we do this all the time.
> > > > > >
> > > > > >   54         batch_bo = xe_bo_create(fd, vm, batch_size,
> > > > > >   55                                 vram_if_possible(fd, 0),
> > > > > >   56
> > DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM);
> > > > > >   57         batch_map = xe_bo_map(fd, batch_bo, batch_size);
> > > > > >
> > > > > > The BO is created in VRAM and then mapped to the CPU.
> > > > > >
> > > > > > I don't think there is an expectation of coherence rather caching mode
> > > > > > and exclusive access of the memory based on synchronization.
> > > > > >
> > > > > > e.g.
> > > > > > User write BB/data via CPU to GPU memory
> > > > > > User calls exec
> > > > > > GPU read / write memory
> > > > > > User wait on sync indicating exec done
> > > > > > User reads result
> > > > > >
> > > > > > All of this works without migration. Are we not planing supporting flow
> > > > > > with SVM?
> > > > > >
> > > > > > Afaik this migration dance really only needs to be done if the CPU and
> > > > > > GPU are using atomics on a shared memory region and the GPU device
> > > > > > doesn't support a coherent memory protocol (e.g. PVC).
> > > > >
> > > > > All you said is true. On many of our HW, CPU can actually access device
> > memory,
> > > > cache coherently or not.
> > > > >
> > > > > The problem is, this is not true for all GPU vendors. For example, on some
> > HW
> > > > from some vendor, CPU can only access partially of device memory. The so
> > called
> > > > small bar concept.
> > > > >
> > > > > So when HMM is defined, such factors were considered, and
> > > > MEMORY_DEVICE_PRIVATE is defined. With this definition, CPU can't access
> > > > device memory.
> > > > >
> > > > > So you can think it is a limitation of HMM.
> > > > >
> > > >
> > > > Is it though? I see other type MEMORY_DEVICE_FS_DAX,
> > > > MEMORY_DEVICE_GENERIC, and MEMORY_DEVICE_PCI_P2PDMA. From my
> > > > limited
> > > > understanding it looks to me like one of those modes would support my
> > > > example.
> > >
> > >
> > > No, above are for other purposes. HMM only support DEVICE_PRIVATE and
> > DEVICE_COHERENT.
> > >
> > > >
> > > > > Note this is only part 1 of our system allocator work. We do plan to support
> > > > DEVICE_COHERENT for our newer device, see below. With this option, we
> > don't
> > > > have unnecessary migration back and forth.
> > > > >
> > > > > You can think this is just work out all the code path. 90% of the driver code
> > for
> > > > DEVICE_PRIVATE and DEVICE_COHERENT will be same. Our real use of system
> > > > allocator will be DEVICE_COHERENT mode. While DEVICE_PRIVATE mode
> > allow us
> > > > to exercise the code on old HW.
> > > > >
> > > > > Make sense?
> > > > >
> > > >
> > > > I guess if we want the system allocator to always coherent, then yes you
> > > > need this dynamic migration with faulting on either side.
> > > >
> > > > I was thinking the system allocator would be behave like my example
> > > > above with madvise dictating the coherence rules.
> > > >
> > > > Maybe I missed this in system allocator design but my feeling is we
> > > > shouldn't arbitrarily enforce coherence as that could lead to poor
> > > > performance due to constant migration.
> > >
> > > System allocator itself doesn't enforce coherence. Coherence is built in user
> > programming model. So system allocator allow both GPU and CPU access system
> > allocated pointers, but it doesn't necessarily guarantee the data accessed from
> > CPU/GPU is coherent. It is user program's responsibility to maintain data
> > coherence.
> > >
> > > Data migration in driver is optional, depending on platform capability, user
> > preference, correctness and performance consideration. Driver internal data
> > migration of course shouldn't break data coherence.
> > >
> > > Of course different vendor can have different data coherence scheme. For
> > example, it is completely designer's flexibility to build model that is HW automatic
> > data coherence or software explicit data coherence. On our platform, we allow
> > user program to select different coherence mode by setting pat_index for gpu
> > and cpu_caching mode for CPU. So we have completely give the flexibility to user
> > program. Nothing of this contract is changed in system allocator design.
> > >
> > > Going back to the question of what memory type we should use to register our
> > vram to core mm. HMM currently support two types: PRIVATE and COHERENT.
> > The coherent type requires some HW and BIOS support which we don't have
> > right now. So the only available is PRIVATE. We have not other option right now.
> > As said, we plan to support coherent type where we can avoid unnecessary data
> > migration. But that is stage 2.
> > >
> > 
> > Thanks for the explaination. After reading your replies, the HMM doc,
> > and looking at code this all makes sense.
> > 
> > > >
> > > > >
> > > > > >
> > > > > > > For GPU device which supports coherent memory protocol b/t CPU and
> > > > > > > GPU (such as CXL and CAPI protocol), we can remap device memory as
> > > > > > > MEMORY_DEVICE_COHERENT. This is TBD.
> > > > > > >
> > > > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/xe/Makefile          |  3 +-
> > > > > > >  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> > > > > > >  drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> > > > > > >  drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> > > > > > >  drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> > > > > > ++++++++++++++++++++++++++++
> > > > > > >  5 files changed, 124 insertions(+), 1 deletion(-)
> > > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > > > > > > index c531210695db..840467080e59 100644
> > > > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > > > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> > > > > > >  	xe_vram_freq.o \
> > > > > > >  	xe_wait_user_fence.o \
> > > > > > >  	xe_wa.o \
> > > > > > > -	xe_wopcm.o
> > > > > > > +	xe_wopcm.o \
> > > > > > > +	xe_svm_devmem.o
> > > > > >
> > > > > > These should be in alphabetical order.
> > > > >
> > > > > Will fix
> > > > > >
> > > > > > >
> > > > > > >  # graphics hardware monitoring (HWMON) support
> > > > > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > index 9785eef2e5a4..f27c3bee8ce7 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > @@ -99,6 +99,15 @@ struct xe_mem_region {
> > > > > > >  	resource_size_t actual_physical_size;
> > > > > > >  	/** @mapping: pointer to VRAM mappable space */
> > > > > > >  	void __iomem *mapping;
> > > > > > > +	/** @pagemap: Used to remap device memory as ZONE_DEVICE
> > */
> > > > > > > +    struct dev_pagemap pagemap;
> > > > > > > +    /**
> > > > > > > +     * @hpa_base: base host physical address
> > > > > > > +     *
> > > > > > > +     * This is generated when remap device memory as ZONE_DEVICE
> > > > > > > +     */
> > > > > > > +    resource_size_t hpa_base;
> > > > > >
> > > > > > Weird indentation. This goes for the entire series, look at checkpatch.
> > > > >
> > > > > Will fix
> > > > > >
> > > > > > > +
> > > > > > >  };
> > > > > > >
> > > > > > >  /**
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c
> > > > b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > index e3db3a178760..0d795394bc4c 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > @@ -22,6 +22,7 @@
> > > > > > >  #include "xe_module.h"
> > > > > > >  #include "xe_sriov.h"
> > > > > > >  #include "xe_tile.h"
> > > > > > > +#include "xe_svm.h"
> > > > > > >
> > > > > > >  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
> > > > > > >  #define TILE_COUNT		REG_GENMASK(15, 8)
> > > > > > > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device *xe)
> > > > > > >  		}
> > > > > > >
> > > > > > >  		io_size -= min_t(u64, tile_size, io_size);
> > > > > > > +		xe_svm_devm_add(tile, &tile->mem.vram);
> > > > > >
> > > > > > Do we want to do this probe for all devices with VRAM or only a subset?
> > > > >
> > > > > All
> > > >
> > > > Can you explain why?
> > >
> > > It is natural for me to add all device memory to hmm. In hmm design, device
> > memory is used as a special swap out for system memory. I would ask why we
> > only want to add a subset of vram? By a subset, do you mean only vram of one
> > tile instead of all tiles?
> > >
> > 
> > I think we talking about different things, my bad on wording in the
> > original question.
> > 
> > Let me ask again - should be calling xe_svm_devm_add on all *platforms*
> > that have VRAM. i.e. Should we do this on PVC but not DG2?
> 
> 
> Oh, I see. Good question. On i915, this feature was only tested on PVC. We don't have a plan to enable it on older platform than PVC. 
> 
> Let me add a check here, only enabled it on platform newer than PVC
> 

Probably actually check 'xe->info.has_usm'.

We might want to rename field too and drop the 'usm' nomenclature but
that can be done later.

Matt

> Oak 
> 
> > 
> > Matt
> > 
> > > Oak
> > >
> > >
> > > >
> > > > > >
> > > > > > >  	}
> > > > > > >
> > > > > > >  	xe->mem.vram.actual_physical_size = total_size;
> > > > > > > @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct xe_device
> > *xe)
> > > > > > >  static void mmio_fini(struct drm_device *drm, void *arg)
> > > > > > >  {
> > > > > > >  	struct xe_device *xe = arg;
> > > > > > > +    struct xe_tile *tile;
> > > > > > > +    u8 id;
> > > > > > >
> > > > > > >  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> > > > > > >  	if (xe->mem.vram.mapping)
> > > > > > >  		iounmap(xe->mem.vram.mapping);
> > > > > > > +
> > > > > > > +	for_each_tile(tile, xe, id)
> > > > > > > +		xe_svm_devm_remove(xe, &tile->mem.vram);
> > > > > >
> > > > > > This should probably be above existing code. Typical on fini to do
> > > > > > things in reverse order from init.
> > > > >
> > > > > Will fix
> > > > > >
> > > > > > > +
> > > > > > >  }
> > > > > > >
> > > > > > >  static int xe_verify_lmem_ready(struct xe_device *xe)
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..09f9afb0e7d4
> > > > > > > --- /dev/null
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > > @@ -0,0 +1,14 @@
> > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > +/*
> > > > > > > + * Copyright © 2023 Intel Corporation
> > > > > >
> > > > > > 2024?
> > > > >
> > > > > This patch was actually written 2023
> > > > > >
> > > > > > > + */
> > > > > > > +
> > > > > > > +#ifndef __XE_SVM_H
> > > > > > > +#define __XE_SVM_H
> > > > > > > +
> > > > > > > +#include "xe_device_types.h"
> > > > > >
> > > > > > I don't think you need to include this. Rather just forward decl structs
> > > > > > used here.
> > > > >
> > > > > Will fix
> > > > > >
> > > > > > e.g.
> > > > > >
> > > > > > struct xe_device;
> > > > > > struct xe_mem_region;
> > > > > > struct xe_tile;
> > > > > >
> > > > > > > +
> > > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region
> > *mem);
> > > > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct
> > xe_mem_region
> > > > > > *mem);
> > > > > >
> > > > > > The arguments here are incongruent here. Typically we want these to
> > > > > > match.
> > > > >
> > > > > Will fix
> > > > > >
> > > > > > > +
> > > > > > > +#endif
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > >
> > > > > > Incongruent between xe_svm.h and xe_svm_devmem.c.
> > > > >
> > > > > Did you mean mem vs mr? if yes, will fix
> > > > >
> > > > > Again these two
> > > > > > should
> > > > > > match.
> > > > > >
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..63b7a1961cc6
> > > > > > > --- /dev/null
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > @@ -0,0 +1,91 @@
> > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > +/*
> > > > > > > + * Copyright © 2023 Intel Corporation
> > > > > >
> > > > > > 2024?
> > > > > It is from 2023
> > > > > >
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include <linux/mm_types.h>
> > > > > > > +#include <linux/sched/mm.h>
> > > > > > > +
> > > > > > > +#include "xe_device_types.h"
> > > > > > > +#include "xe_trace.h"
> > > > > >
> > > > > > xe_trace.h appears to be unused.
> > > > >
> > > > > Will fix
> > > > > >
> > > > > > > +#include "xe_svm.h"
> > > > > > > +
> > > > > > > +
> > > > > > > +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > > > > > +{
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void xe_devm_page_free(struct page *page)
> > > > > > > +{
> > > > > > > +}
> > > > > > > +
> > > > > > > +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > > > > > +	.page_free = xe_devm_page_free,
> > > > > > > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > > > > +};
> > > > > > > +
> > > > > >
> > > > > > Assume these are placeholders that will be populated later?
> > > > >
> > > > >
> > > > > corrrect
> > > > > >
> > > > > > > +/**
> > > > > > > + * xe_svm_devm_add: Remap and provide memmap backing for
> > device
> > > > > > memory
> > > > > > > + * @tile: tile that the memory region blongs to
> > > > > > > + * @mr: memory region to remap
> > > > > > > + *
> > > > > > > + * This remap device memory to host physical address space and create
> > > > > > > + * struct page to back device memory
> > > > > > > + *
> > > > > > > + * Return: 0 on success standard error code otherwise
> > > > > > > + */
> > > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region *mr)
> > > > > >
> > > > > > Here I see the xe_mem_region is from tile->mem.vram, wondering rather
> > > > > > than using the tile->mem.vram we should use xe->mem.vram when
> > enabling
> > > > > > svm? Isn't the idea behind svm the entire memory is 1 view?
> > > > >
> > > > > Still need to use tile. The reason is, memory of different tile can have
> > different
> > > > characteristics, such as latency. So we want to differentiate memory b/t tiles
> > also
> > > > in svm. I need to change below " mr->pagemap.owner = tile->xe->drm.dev ".
> > the
> > > > owner should also be tile. This is the way hmm differentiate memory of
> > different
> > > > tile.
> > > > >
> > > > > With svm it is 1 view, from virtual address space perspective and from
> > physical
> > > > struct page perspective. You can think of all the tile's vram is stacked together
> > to
> > > > form a unified view together with system memory. This doesn't prohibit us
> > from
> > > > differentiate memory from different tile. This differentiation allow us to
> > optimize
> > > > performance, i.e., we can wisely place memory in specific tile. If we don't
> > > > differentiate, this is not possible.
> > > > >
> > > >
> > > > Ok makes sense.
> > > >
> > > > Matt
> > > >
> > > > > >
> > > > > > I suppose if we do that we also only use 1 TTM VRAM manager / buddy
> > > > > > allocator too. I thought I saw some patches flying around for that too.
> > > > >
> > > > > Ttm vram manager is not in the picture. We deliberately avoided it per
> > previous
> > > > discussion
> > > > >
> > > > > Yes same buddy allocator. It is in my previous POC:
> > https://lore.kernel.org/dri-
> > > > devel/20240117221223.18540-12-oak.zeng@intel.com/. I didn't put those
> > patches
> > > > in this series because I want to merge this small patches separately.
> > > > > >
> > > > > > > +{
> > > > > > > +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> > > > > > > +	struct resource *res;
> > > > > > > +	void *addr;
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	res = devm_request_free_mem_region(dev, &iomem_resource,
> > > > > > > +					   mr->usable_size);
> > > > > > > +	if (IS_ERR(res)) {
> > > > > > > +		ret = PTR_ERR(res);
> > > > > > > +		return ret;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > > > > > > +	mr->pagemap.range.start = res->start;
> > > > > > > +	mr->pagemap.range.end = res->end;
> > > > > > > +	mr->pagemap.nr_range = 1;
> > > > > > > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > > > > > > +	mr->pagemap.owner = tile->xe->drm.dev;
> > > > > > > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > > > > > > +	if (IS_ERR(addr)) {
> > > > > > > +		devm_release_mem_region(dev, res->start,
> > resource_size(res));
> > > > > > > +		ret = PTR_ERR(addr);
> > > > > > > +		drm_err(&tile->xe->drm, "Failed to remap tile %d
> > memory,
> > > > > > errno %d\n",
> > > > > > > +				tile->id, ret);
> > > > > > > +		return ret;
> > > > > > > +	}
> > > > > > > +	mr->hpa_base = res->start;
> > > > > > > +
> > > > > > > +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to
> > devm,
> > > > > > remapped to %pr\n",
> > > > > > > +			tile->id, mr->io_start, mr->io_start + mr-
> > >usable_size,
> > > > > > res);
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * xe_svm_devm_remove: Unmap device memory and free resources
> > > > > > > + * @xe: xe device
> > > > > > > + * @mr: memory region to remove
> > > > > > > + */
> > > > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct
> > xe_mem_region
> > > > > > *mr)
> > > > > > > +{
> > > > > > > +	/*FIXME: below cause a kernel hange during moduel remove*/
> > > > > > > +#if 0
> > > > > > > +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> > > > > > > +
> > > > > > > +	if (mr->hpa_base) {
> > > > > > > +		devm_memunmap_pages(dev, &mr->pagemap);
> > > > > > > +		devm_release_mem_region(dev, mr-
> > >pagemap.range.start,
> > > > > > > +			mr->pagemap.range.end - mr-
> > >pagemap.range.start +1);
> > > > > > > +	}
> > > > > > > +#endif
> > > > > >
> > > > > > This would need to be fixed too.
> > > > >
> > > > >
> > > > > Yes...
> > > > >
> > > > > Oak
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > > > +}
> > > > > > > +
> > > > > > > --
> > > > > > > 2.26.3
> > > > > > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] drm/xe: Helper to get dpa from pfn
  2024-03-15 17:29     ` Zeng, Oak
@ 2024-03-16  1:33       ` Matthew Brost
  2024-03-18 19:25         ` Zeng, Oak
  0 siblings, 1 reply; 49+ messages in thread
From: Matthew Brost @ 2024-03-16  1:33 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad

On Fri, Mar 15, 2024 at 11:29:33AM -0600, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Thursday, March 14, 2024 1:39 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [PATCH 3/5] drm/xe: Helper to get dpa from pfn
> > 
> > On Wed, Mar 13, 2024 at 11:35:51PM -0400, Oak Zeng wrote:
> > > Since we now create struct page backing for each vram page,
> > > each vram page now also has a pfn, just like system memory.
> > > This allow us to calcuate device physical address from pfn.
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_device_types.h | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > b/drivers/gpu/drm/xe/xe_device_types.h
> > > index bbea40b57e84..bf349321f037 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -576,4 +576,12 @@ static inline struct xe_tile *mem_region_to_tile(struct
> > xe_mem_region *mr)
> > >  	return container_of(mr, struct xe_tile, mem.vram);
> > >  }
> > >
> > > +static inline u64 vram_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
> > > +{
> > > +	u64 dpa;
> > > +	u64 offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
> > 
> > Can't this be negative?
> > 
> > e.g. if pfn == 0, offset == -mr->hpa_base.
> > 
> > Or is the assumption (pfn << PAGE_SHIFT) is always > mr->hpa_base?
> > 
> > If so can we an assert or something to ensure we using this function correctly.
> 
> Yes we can assert it. The hpa_base is the host physical base address for this memory region, while pfn should point to a page inside this memory region.
> 
> I will add an assertion.
> 
> 
> > 
> > > +	dpa = mr->dpa_base + offset;
> > > +	return dpa;
> > > +}
> > 
> > Same as previous patch, should be *.h not a *_types.h file.
> 
> Yes will fix.
> > 
> > Also as this is xe_mem_region not explictly vram. Maybe:
> > 
> > s/vram_pfn_to_dpa/xe_mem_region_pfn_to_dpa/
> 
> Xe_mem_region can only represent vram, right? I mean it can't represent system memory. Copied the definition below:
> 
> /**
>  * struct xe_mem_region - memory region structure
>  * This is used to describe a memory region in xe
>  * device, such as HBM memory or CXL extension memory.
>  */
> 

Ah yes but still as the first argument is xe_mem_region I think the
function name should reflect that.

Matt

> Oak
> 
> > 
> > Matt
> > 
> > > +
> > >  #endif
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-14 20:25   ` Matthew Brost
@ 2024-03-16  1:35     ` Zeng, Oak
  2024-03-18  0:29       ` Matthew Brost
  0 siblings, 1 reply; 49+ messages in thread
From: Zeng, Oak @ 2024-03-16  1:35 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Thursday, March 14, 2024 4:25 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
> 
> On Wed, Mar 13, 2024 at 11:35:52PM -0400, Oak Zeng wrote:
> > Add a helper function xe_hmm_populate_range to populate
> > a a userptr or hmmptr range. This functions calls hmm_range_fault
> > to read CPU page tables and populate all pfns/pages of this
> > virtual address range.
> >
> > If the populated page is system memory page, dma-mapping is performed
> > to get a dma-address which can be used later for GPU to access pages.
> >
> > If the populated page is device private page, we calculate the dpa (
> > device physical address) of the page.
> >
> > The dma-address or dpa is then saved in userptr's sg table. This is
> > prepare work to replace the get_user_pages_fast code in userptr code
> > path. The helper function will also be used to populate hmmptr later.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile |   3 +-
> >  drivers/gpu/drm/xe/xe_hmm.c | 213
> ++++++++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> >  3 files changed, 227 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> >
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index 840467080e59..29dcbc938b01 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> >  	xe_wait_user_fence.o \
> >  	xe_wa.o \
> >  	xe_wopcm.o \
> > -	xe_svm_devmem.o
> > +	xe_svm_devmem.o \
> > +	xe_hmm.o
> >
> >  # graphics hardware monitoring (HWMON) support
> >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
> > new file mode 100644
> > index 000000000000..c45c2447d386
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > @@ -0,0 +1,213 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/memremap.h>
> > +#include <linux/swap.h>
> > +#include <linux/mm.h>
> > +#include "xe_hmm.h"
> > +#include "xe_vm.h"
> > +
> > +/**
> > + * mark_range_accessed() - mark a range is accessed, so core mm
> > + * have such information for memory eviction or write back to
> > + * hard disk
> > + *
> > + * @range: the range to mark
> > + * @write: if write to this range, we mark pages in this range
> > + * as dirty
> > + */
> > +static void mark_range_accessed(struct hmm_range *range, bool write)
> > +{
> > +	struct page *page;
> > +	u64 i, npages;
> > +
> > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> > +	for (i = 0; i < npages; i++) {
> > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > +		if (write) {
> > +			lock_page(page);
> > +			set_page_dirty(page);
> > +			unlock_page(page);
> > +		}
> > +		mark_page_accessed(page);
> > +	}
> > +}
> > +
> > +/**
> > + * build_sg() - build a scatter gather table for all the physical pages/pfn
> > + * in a hmm_range. dma-address is save in sg table and will be used to
> program
> > + * GPU page table later.
> > + *
> > + * @xe: the xe device who will access the dma-address in sg table
> > + * @range: the hmm range that we build the sg table from. range-
> >hmm_pfns[]
> > + * has the pfn numbers of pages that back up this hmm address range.
> > + * @st: pointer to the sg table.
> > + * @write: whether we write to this range. This decides dma map direction
> > + * for system pages. If write we map it bi-diretional; otherwise
> > + * DMA_TO_DEVICE
> > + *
> > + * All the contiguous pfns will be collapsed into one entry in
> > + * the scatter gather table. This is for the convenience of
> > + * later on operations to bind address range to GPU page table.
> > + *
> > + * The dma_address in the sg table will later be used by GPU to
> > + * access memory. So if the memory is system memory, we need to
> > + * do a dma-mapping so it can be accessed by GPU/DMA. If the memory
> > + * is GPU local memory (of the GPU who is going to access memory),
> > + * we need gpu dpa (device physical address), and there is no need
> > + * of dma-mapping.
> > + *
> > + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> > + * memory. Add this when you support p2p
> > + *
> > + * This function allocates the storage of the sg table. It is
> > + * caller's responsibility to free it calling sg_free_table.
> > + *
> > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > + */
> > +static int build_sg(struct xe_device *xe, struct hmm_range *range,
> 
> xe is unused.

It is used in below line
> 
> > +			     struct sg_table *st, bool write)
> > +{
> > +	struct device *dev = xe->drm.dev;

Used here

> > +	struct scatterlist *sg;
> > +	u64 i, npages;
> > +
> > +	sg = NULL;
> > +	st->nents = 0;
> > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> > +
> > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < npages; i++) {
> > +		struct page *page;
> > +		unsigned long addr;
> > +		struct xe_mem_region *mr;
> > +
> > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > +		if (is_device_private_page(page)) {
> > +			mr = page_to_mem_region(page);
> 
> Not seeing where page_to_mem_region is defined.


Yah,,,, I forgot to pick up this patch. Will pick up...

> 
> > +			addr = vram_pfn_to_dpa(mr, range->hmm_pfns[i]);
> > +		} else {
> > +			addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> > +					write ? DMA_BIDIRECTIONAL :
> DMA_TO_DEVICE);
> > +		}
> > +
> > +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> > +			sg->length += PAGE_SIZE;
> > +			sg_dma_len(sg) += PAGE_SIZE;
> > +			continue;
> > +		}
> > +
> > +		sg =  sg ? sg_next(sg) : st->sgl;
> > +		sg_dma_address(sg) = addr;
> > +		sg_dma_len(sg) = PAGE_SIZE;
> > +		sg->length = PAGE_SIZE;
> > +		st->nents++;
> > +	}
> > +
> > +	sg_mark_end(sg);
> > +	return 0;
> > +}
> > +
> 
> Hmm, this looks way to open coded to me.
> 
> Can't we do something like this:
> 
> struct page **pages = convert from range->hmm_pfns
> sg_alloc_table_from_pages_segment
> if (is_device_private_page())
>         populatue sg table via vram_pfn_to_dpa
> else
>         dma_map_sgtable
> free(pages)
> 
> This assume we are not mixing is_device_private_page & system memory
> pages in a single struct hmm_range.



that is exactly I pictured. We actually can mixing vram and system memory... the reason is, migration of a hmm range from system memory to vram can fail for whatever reason. And it can end up a range is partially migrated.... And migration is best effort in such case. We just map the system memory in that range to gpu for such case. This will come up in the coming system allocator patches...

This case was found during i915 test...

I also checked amd's code. They assume the same, just take a look of function svm_range_dma_map_dev.

agree that code is not nice. But I don't have a better way assuming above

> 
> 
> > +/**
> > + * xe_hmm_populate_range() - Populate physical pages of a virtual
> > + * address range
> > + *
> > + * @vma: vma has information of the range to populate. only vma
> > + * of userptr and hmmptr type can be populated.
> > + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> > + * will hold the populated pfns.
> > + * @write: populate pages with write permission
> > + *
> > + * This function populate the physical pages of a virtual
> > + * address range. The populated physical pages is saved in
> > + * userptr's sg table. It is similar to get_user_pages but call
> > + * hmm_range_fault.
> > + *
> > + * This function also read mmu notifier sequence # (
> > + * mmu_interval_read_begin), for the purpose of later
> > + * comparison (through mmu_interval_read_retry).
> > + *
> > + * This must be called with mmap read or write lock held.
> > + *
> > + * This function allocates the storage of the userptr sg table.
> > + * It is caller's responsibility to free it calling sg_free_table.
> 
> I'd add a helper to free the sg_free_table & unmap the dma pages if
> needed.

Ok, due to the reason I explained above, there will be a little complication. I will do it in a separate patch.
> 
> > + *
> > + * returns: 0 for succuss; negative error no on failure
> > + */
> > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> *hmm_range,
> > +						bool write)
> > +{
> 
> The layering is all wrong here, we shouldn't be touching struct xe_vma
> directly in hmm layer.

I have to admit we don't have a clear layering here. This function is supposed to be a shared function which will be used by both hmmptr and userptr.

Maybe you got an impression from my POC series that we don't have xe_vma in system allocator codes. That was true. But after the design discussion, we have decided to unify userptr code and hmmptr codes, so we will have xe_vma also in system allocator code. Basically xe_vma will replace the xe_svm_range concept in my patch series.

Of course, per Thomas we can further optimize by splitting xe_vma into mutable and unmutable... but that should be step 2.

> 
> Pass in a populated hmm_range and sgt. Or alternatively pass in
> arguments and then populate a hmm_range locally.

Xe_vma is the input parameter here.

I will remove Hmm_range from function parameter and make it a local. I figured I don't need this as an output anymore. All we need is already in sg table. 

> 
> > +	unsigned long timeout =
> > +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > +	struct xe_userptr_vma *userptr_vma;
> > +	struct xe_userptr *userptr;
> > +	u64 start = vma->gpuva.va.addr;
> > +	u64 end = start + vma->gpuva.va.range;
> 
> We have helper - xe_vma_start and xe_vma_end but I think either of these
> are correct in this case.
> 
> xe_vma_start is the address which this bound to the GPU, we want the
> userptr address.
> 
> So I think it would be:
> 
> start = xe_vma_userptr()
> end = xe_vma_userptr() + xe_vma_size()

You are correct. Will fix. I mixed this because, in system allocator, cpu address always == gpu address 
> 
> 
> > +	struct xe_vm *vm = xe_vma_vm(vma);
> > +	u64 npages;
> > +	int ret;
> > +
> > +	userptr_vma = to_userptr_vma(vma);
> > +	userptr = &userptr_vma->userptr;
> > +	mmap_assert_locked(userptr->notifier.mm);
> > +
> > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
> 
> This math is done above, if you need this math in next rev add a helper.


Will do.
> 
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (unlikely(!pfns))
> > +		return -ENOMEM;
> > +
> > +	if (write)
> > +		flags |= HMM_PFN_REQ_WRITE;
> > +
> > +	memset64((u64 *)pfns, (u64)flags, npages);
> 
> Why is this needed, can't we just set hmm_range->default_flags?

In this case, set default_flags also work.

This is some codes I inherited from Niranjana. Per my test before it works.

Basically there are two way to control the flags:
Default is the coarse grain way applying to all pfns
The way I am using here is a per pfn fine grained flag setting. It also works.

> 
> > +	hmm_range->hmm_pfns = pfns;
> > +	hmm_range->notifier_seq = mmu_interval_read_begin(&userptr-
> >notifier);
> 
> We need a userptr->notifier == userptr->notifier_seq check that just
> returns, right?

Yes this sequence number is used to decide whether we need to retry. See function xe_pt_userptr_pre_commit
> 
> > +	hmm_range->notifier = &userptr->notifier;
> > +	hmm_range->start = start;
> > +	hmm_range->end = end;
> > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> HMM_PFN_REQ_WRITE;
> 
> Is this needed? AMD and Nouveau do not set this argument.


As explained above. Amd and nouveau only use coarse grain setting

> 
> > +	/**
> > +	 * FIXME:
> > +	 * Set the the dev_private_owner can prevent hmm_range_fault to fault
> > +	 * in the device private pages owned by caller. See function
> > +	 * hmm_vma_handle_pte. In multiple GPU case, this should be set to the
> > +	 * device owner of the best migration destination. e.g., device0/vm0
> > +	 * has a page fault, but we have determined the best placement of
> > +	 * the fault address should be on device1, we should set below to
> > +	 * device1 instead of device0.
> > +	 */
> > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > +
> > +	while (true) {
> 
> mmap_read_lock(mm);
> 
> > +		ret = hmm_range_fault(hmm_range);
> 
> mmap_read_unlock(mm);


This need to be called in caller. The reason is, in the system allocator code path, mmap lock is already hold before calling into this helper.

I will check patch 5 for this purpose.

> 
> > +		if (time_after(jiffies, timeout))
> > +			break;
> > +
> > +		if (ret == -EBUSY)
> > +			continue;
> > +		break;
> > +	}
> > +
> > +	if (ret)
> > +		goto free_pfns;
> > +
> > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> > +	if (ret)
> > +		goto free_pfns;
> > +
> > +	mark_range_accessed(hmm_range, write);
> > +	userptr->sg = &userptr->sgt;
> 
> Again this should be set in caller after this function return.

why we can't set it here? It is shared b/t userptr and hmmptr


> 
> > +	userptr->notifier_seq = hmm_range->notifier_seq;
> 
> This is could be a pass by reference I guess and set here.

Sorry, I don't understand this comment.

> 
> > +
> > +free_pfns:
> > +	kvfree(pfns);
> > +	return ret;
> > +}
> > +
> > diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
> > new file mode 100644
> > index 000000000000..960f3f6d36ae
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > @@ -0,0 +1,12 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/hmm.h>
> > +#include "xe_vm_types.h"
> > +#include "xe_svm.h"
> 
> As per the previous patches no need to xe_*.h files, just forward
> declare any arguments.


Will do.

Oak
> 
> Matt
> 
> > +
> > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> *hmm_range,
> > +						bool write);
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-16  1:35     ` Zeng, Oak
@ 2024-03-18  0:29       ` Matthew Brost
  0 siblings, 0 replies; 49+ messages in thread
From: Matthew Brost @ 2024-03-18  0:29 UTC (permalink / raw)
  To: Zeng, Oak
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad

On Fri, Mar 15, 2024 at 07:35:15PM -0600, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Thursday, March 14, 2024 4:25 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
> > 
> > On Wed, Mar 13, 2024 at 11:35:52PM -0400, Oak Zeng wrote:
> > > Add a helper function xe_hmm_populate_range to populate
> > > a a userptr or hmmptr range. This functions calls hmm_range_fault
> > > to read CPU page tables and populate all pfns/pages of this
> > > virtual address range.
> > >
> > > If the populated page is system memory page, dma-mapping is performed
> > > to get a dma-address which can be used later for GPU to access pages.
> > >
> > > If the populated page is device private page, we calculate the dpa (
> > > device physical address) of the page.
> > >
> > > The dma-address or dpa is then saved in userptr's sg table. This is
> > > prepare work to replace the get_user_pages_fast code in userptr code
> > > path. The helper function will also be used to populate hmmptr later.
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile |   3 +-
> > >  drivers/gpu/drm/xe/xe_hmm.c | 213
> > ++++++++++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> > >  3 files changed, 227 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> > >
> > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > index 840467080e59..29dcbc938b01 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> > >  	xe_wait_user_fence.o \
> > >  	xe_wa.o \
> > >  	xe_wopcm.o \
> > > -	xe_svm_devmem.o
> > > +	xe_svm_devmem.o \
> > > +	xe_hmm.o
> > >
> > >  # graphics hardware monitoring (HWMON) support
> > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
> > > new file mode 100644
> > > index 000000000000..c45c2447d386
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > > @@ -0,0 +1,213 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/memremap.h>
> > > +#include <linux/swap.h>
> > > +#include <linux/mm.h>
> > > +#include "xe_hmm.h"
> > > +#include "xe_vm.h"
> > > +
> > > +/**
> > > + * mark_range_accessed() - mark a range is accessed, so core mm
> > > + * have such information for memory eviction or write back to
> > > + * hard disk
> > > + *
> > > + * @range: the range to mark
> > > + * @write: if write to this range, we mark pages in this range
> > > + * as dirty
> > > + */
> > > +static void mark_range_accessed(struct hmm_range *range, bool write)
> > > +{
> > > +	struct page *page;
> > > +	u64 i, npages;
> > > +
> > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> > PAGE_SHIFT) + 1;
> > > +	for (i = 0; i < npages; i++) {
> > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > +		if (write) {
> > > +			lock_page(page);
> > > +			set_page_dirty(page);
> > > +			unlock_page(page);
> > > +		}
> > > +		mark_page_accessed(page);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * build_sg() - build a scatter gather table for all the physical pages/pfn
> > > + * in a hmm_range. dma-address is save in sg table and will be used to
> > program
> > > + * GPU page table later.
> > > + *
> > > + * @xe: the xe device who will access the dma-address in sg table
> > > + * @range: the hmm range that we build the sg table from. range-
> > >hmm_pfns[]
> > > + * has the pfn numbers of pages that back up this hmm address range.
> > > + * @st: pointer to the sg table.
> > > + * @write: whether we write to this range. This decides dma map direction
> > > + * for system pages. If write we map it bi-diretional; otherwise
> > > + * DMA_TO_DEVICE
> > > + *
> > > + * All the contiguous pfns will be collapsed into one entry in
> > > + * the scatter gather table. This is for the convenience of
> > > + * later on operations to bind address range to GPU page table.
> > > + *
> > > + * The dma_address in the sg table will later be used by GPU to
> > > + * access memory. So if the memory is system memory, we need to
> > > + * do a dma-mapping so it can be accessed by GPU/DMA. If the memory
> > > + * is GPU local memory (of the GPU who is going to access memory),
> > > + * we need gpu dpa (device physical address), and there is no need
> > > + * of dma-mapping.
> > > + *
> > > + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> > > + * memory. Add this when you support p2p
> > > + *
> > > + * This function allocates the storage of the sg table. It is
> > > + * caller's responsibility to free it calling sg_free_table.
> > > + *
> > > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > > + */
> > > +static int build_sg(struct xe_device *xe, struct hmm_range *range,
> > 
> > xe is unused.
> 
> It is used in below line
> > 
> > > +			     struct sg_table *st, bool write)
> > > +{
> > > +	struct device *dev = xe->drm.dev;
> 
> Used here
> 
> > > +	struct scatterlist *sg;
> > > +	u64 i, npages;
> > > +
> > > +	sg = NULL;
> > > +	st->nents = 0;
> > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> > PAGE_SHIFT) + 1;
> > > +
> > > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > > +		return -ENOMEM;
> > > +
> > > +	for (i = 0; i < npages; i++) {
> > > +		struct page *page;
> > > +		unsigned long addr;
> > > +		struct xe_mem_region *mr;
> > > +
> > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > +		if (is_device_private_page(page)) {
> > > +			mr = page_to_mem_region(page);
> > 
> > Not seeing where page_to_mem_region is defined.
> 
> 
> Yah,,,, I forgot to pick up this patch. Will pick up...
> 
> > 
> > > +			addr = vram_pfn_to_dpa(mr, range->hmm_pfns[i]);
> > > +		} else {
> > > +			addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> > > +					write ? DMA_BIDIRECTIONAL :
> > DMA_TO_DEVICE);
> > > +		}
> > > +
> > > +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> > > +			sg->length += PAGE_SIZE;
> > > +			sg_dma_len(sg) += PAGE_SIZE;
> > > +			continue;
> > > +		}
> > > +
> > > +		sg =  sg ? sg_next(sg) : st->sgl;
> > > +		sg_dma_address(sg) = addr;
> > > +		sg_dma_len(sg) = PAGE_SIZE;
> > > +		sg->length = PAGE_SIZE;
> > > +		st->nents++;
> > > +	}
> > > +
> > > +	sg_mark_end(sg);
> > > +	return 0;
> > > +}
> > > +
> > 
> > Hmm, this looks way to open coded to me.
> > 
> > Can't we do something like this:
> > 
> > struct page **pages = convert from range->hmm_pfns
> > sg_alloc_table_from_pages_segment
> > if (is_device_private_page())
> >         populatue sg table via vram_pfn_to_dpa
> > else
> >         dma_map_sgtable
> > free(pages)
> > 
> > This assume we are not mixing is_device_private_page & system memory
> > pages in a single struct hmm_range.
> 
> 
> 
> that is exactly I pictured. We actually can mixing vram and system memory... the reason is, migration of a hmm range from system memory to vram can fail for whatever reason. And it can end up a range is partially migrated.... And migration is best effort in such case. We just map the system memory in that range to gpu for such case. This will come up in the coming system allocator patches...
> 
> This case was found during i915 test...
> 
> I also checked amd's code. They assume the same, just take a look of function svm_range_dma_map_dev.
> 
> agree that code is not nice. But I don't have a better way assuming above
> 

Hmm, based on what you saying if a migration failed with a partial
migration I'd say we'd do one of two things.

1. Reverse the partial migration
2. Split GPUVA (or pt_state if supporting 1:N) based on the migration

I don't think we'd ever want a GPUVA (or pt_state) that is mixed between
SRAM / VRAM for variety of reasons.

This is all speculation (both my comment and yours) as we haven't gotten
to implementing the system allocator in Xe.

How about for this series we drop all device mapping support (this would
be untested dead code in Xe which we shouldn't merge anyways) and just
add the hmm layer which maps a SG for system memory only.

Once we need device mapping support we add this into hmm layer then.

Sound reasonable?

> > 
> > 
> > > +/**
> > > + * xe_hmm_populate_range() - Populate physical pages of a virtual
> > > + * address range
> > > + *
> > > + * @vma: vma has information of the range to populate. only vma
> > > + * of userptr and hmmptr type can be populated.
> > > + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> > > + * will hold the populated pfns.
> > > + * @write: populate pages with write permission
> > > + *
> > > + * This function populate the physical pages of a virtual
> > > + * address range. The populated physical pages is saved in
> > > + * userptr's sg table. It is similar to get_user_pages but call
> > > + * hmm_range_fault.
> > > + *
> > > + * This function also read mmu notifier sequence # (
> > > + * mmu_interval_read_begin), for the purpose of later
> > > + * comparison (through mmu_interval_read_retry).
> > > + *
> > > + * This must be called with mmap read or write lock held.
> > > + *
> > > + * This function allocates the storage of the userptr sg table.
> > > + * It is caller's responsibility to free it calling sg_free_table.
> > 
> > I'd add a helper to free the sg_free_table & unmap the dma pages if
> > needed.
> 
> Ok, due to the reason I explained above, there will be a little complication. I will do it in a separate patch.

Maybe something like:

void xe_hmm_unmap_sg(struct xe_device *xe, struct sg_table *sg, bool read_only)

Again for now this will call dma_unmap_sgtable & sg_free_table as we
only support system memory, once we support device memory it will skip
dma_unmap_sgtable device memory.

> > 
> > > + *
> > > + * returns: 0 for succuss; negative error no on failure
> > > + */
> > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > *hmm_range,
> > > +						bool write)
> > > +{
> > 
> > The layering is all wrong here, we shouldn't be touching struct xe_vma
> > directly in hmm layer.
> 
> I have to admit we don't have a clear layering here. This function is supposed to be a shared function which will be used by both hmmptr and userptr.
> 
> Maybe you got an impression from my POC series that we don't have xe_vma in system allocator codes. That was true. But after the design discussion, we have decided to unify userptr code and hmmptr codes, so we will have xe_vma also in system allocator code. Basically xe_vma will replace the xe_svm_range concept in my patch series.
> 
> Of course, per Thomas we can further optimize by splitting xe_vma into mutable and unmutable... but that should be step 2.
> 
> > 
> > Pass in a populated hmm_range and sgt. Or alternatively pass in
> > arguments and then populate a hmm_range locally.
> 
> Xe_vma is the input parameter here.
>

It shouldn't be, thats my point. It is the wrong layering.
 
> I will remove Hmm_range from function parameter and make it a local. I figured I don't need this as an output anymore. All we need is already in sg table. 
> 

I was thinking a prototype like this:

int xe_hmm_map_sg(struct xe_device *xe, struct sg_table *sg,
		  struct mmu_interval_notifier *notifier,
		  long unsigned *current_seq, u64 addr, u64 size,
		  bool read_only, bool need_unmap)

Again now only system memory support. This function will replace most of
the current code in xe_vma_userptr_pin_pages. I have more on this in
patch 5.

If you like xe_hmm_populate_sg more, open to that name too. As with
xe_hmm_unmap_sg, feel free to rename too.

> > 
> > > +	unsigned long timeout =
> > > +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > > +	struct xe_userptr_vma *userptr_vma;
> > > +	struct xe_userptr *userptr;
> > > +	u64 start = vma->gpuva.va.addr;
> > > +	u64 end = start + vma->gpuva.va.range;
> > 
> > We have helper - xe_vma_start and xe_vma_end but I think either of these
> > are correct in this case.
> > 
> > xe_vma_start is the address which this bound to the GPU, we want the
> > userptr address.
> > 
> > So I think it would be:
> > 
> > start = xe_vma_userptr()
> > end = xe_vma_userptr() + xe_vma_size()
> 
> You are correct. Will fix. I mixed this because, in system allocator, cpu address always == gpu address 
> > 
> > 
> > > +	struct xe_vm *vm = xe_vma_vm(vma);
> > > +	u64 npages;
> > > +	int ret;
> > > +
> > > +	userptr_vma = to_userptr_vma(vma);
> > > +	userptr = &userptr_vma->userptr;
> > > +	mmap_assert_locked(userptr->notifier.mm);
> > > +
> > > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
> > 
> > This math is done above, if you need this math in next rev add a helper.
> 
> 
> Will do.
> > 
> > > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > > +	if (unlikely(!pfns))
> > > +		return -ENOMEM;
> > > +
> > > +	if (write)
> > > +		flags |= HMM_PFN_REQ_WRITE;
> > > +
> > > +	memset64((u64 *)pfns, (u64)flags, npages);
> > 
> > Why is this needed, can't we just set hmm_range->default_flags?
> 
> In this case, set default_flags also work.
> 
> This is some codes I inherited from Niranjana. Per my test before it works.
> 
> Basically there are two way to control the flags:
> Default is the coarse grain way applying to all pfns
> The way I am using here is a per pfn fine grained flag setting. It also works.
> 

I'd stick with default_flags unless we have reason not to.

> > 
> > > +	hmm_range->hmm_pfns = pfns;
> > > +	hmm_range->notifier_seq = mmu_interval_read_begin(&userptr-
> > >notifier);
> > 
> > We need a userptr->notifier == userptr->notifier_seq check that just
> > returns, right?
> 
> Yes this sequence number is used to decide whether we need to retry. See function xe_pt_userptr_pre_commit

This check basically means the current sg is valid and short circuits.

Maybe we don't need it? I don't think we ever call this function when
this condition is true. But if it is removed then we should have an
assert userptr->notifier_seq (*current_seq in my function prototype
above) != mmu_interval_read_begin. This ensures we are not calling this
function / mapping a sg table when already have a valid one.

> > 
> > > +	hmm_range->notifier = &userptr->notifier;
> > > +	hmm_range->start = start;
> > > +	hmm_range->end = end;
> > > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> > HMM_PFN_REQ_WRITE;
> > 
> > Is this needed? AMD and Nouveau do not set this argument.
> 
> 
> As explained above. Amd and nouveau only use coarse grain setting
> 

My position is use coarse grain until we have a reason not to.

> > 
> > > +	/**
> > > +	 * FIXME:
> > > +	 * Set the the dev_private_owner can prevent hmm_range_fault to fault
> > > +	 * in the device private pages owned by caller. See function
> > > +	 * hmm_vma_handle_pte. In multiple GPU case, this should be set to the
> > > +	 * device owner of the best migration destination. e.g., device0/vm0
> > > +	 * has a page fault, but we have determined the best placement of
> > > +	 * the fault address should be on device1, we should set below to
> > > +	 * device1 instead of device0.
> > > +	 */
> > > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > > +
> > > +	while (true) {
> > 
> > mmap_read_lock(mm);
> > 
> > > +		ret = hmm_range_fault(hmm_range);
> > 
> > mmap_read_unlock(mm);
> 
> 
> This need to be called in caller. The reason is, in the system allocator code path, mmap lock is already hold before calling into this helper.
> 
> I will check patch 5 for this purpose.
> 

This is missing in patch 5.

Also per the hmm.rst doc and 2 of the 3 existing existing cases in of
hmm_range_fault in the kernel this lock is taken / dropped directly
before / after hmm_range_fault. For the current use case (userptr)
taking this lock here certainly will work.

Again saying the system allocator will hold this lock is IMO speculation
and shouldn't dictate what we do in this series. If we have move the
lock in once the system allocator lands, we can do that then.

> > 
> > > +		if (time_after(jiffies, timeout))
> > > +			break;
> > > +
> > > +		if (ret == -EBUSY)
> > > +			continue;
> > > +		break;
> > > +	}
> > > +
> > > +	if (ret)
> > > +		goto free_pfns;
> > > +
> > > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> > > +	if (ret)
> > > +		goto free_pfns;
> > > +
> > > +	mark_range_accessed(hmm_range, write);
> > > +	userptr->sg = &userptr->sgt;
> > 
> > Again this should be set in caller after this function return.
> 
> why we can't set it here? It is shared b/t userptr and hmmptr
>

This is owned by the VMA userptr and should be set in that layer (set in
xe_vma_userptr_pin_pages). Setting this value means the SG is valid
(this maps to need_unmap in my function prototype example).

> 
> > 
> > > +	userptr->notifier_seq = hmm_range->notifier_seq;
> > 
> > This is could be a pass by reference I guess and set here.
> 
> Sorry, I don't understand this comment.
>

This is current_seq in my function prototype example. Does that make
sense?

Matt

> > 
> > > +
> > > +free_pfns:
> > > +	kvfree(pfns);
> > > +	return ret;
> > > +}
> > > +
> > > diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
> > > new file mode 100644
> > > index 000000000000..960f3f6d36ae
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > > @@ -0,0 +1,12 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#include <linux/types.h>
> > > +#include <linux/hmm.h>
> > > +#include "xe_vm_types.h"
> > > +#include "xe_svm.h"
> > 
> > As per the previous patches no need to xe_*.h files, just forward
> > declare any arguments.
> 
> 
> Will do.
> 
> Oak
> > 
> > Matt
> > 
> > > +
> > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > *hmm_range,
> > > +						bool write);
> > > --
> > > 2.26.3
> > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-16  1:25               ` Matthew Brost
@ 2024-03-18 10:16                 ` Hellstrom, Thomas
  2024-03-18 15:02                   ` Zeng, Oak
  2024-03-18 14:51                 ` Zeng, Oak
  1 sibling, 1 reply; 49+ messages in thread
From: Hellstrom, Thomas @ 2024-03-18 10:16 UTC (permalink / raw)
  To: Brost, Matthew, Zeng, Oak
  Cc: intel-xe, Welty,  Brian, airlied, Ghimiray, Himal Prasad

On Sat, 2024-03-16 at 01:25 +0000, Matthew Brost wrote:
> On Fri, Mar 15, 2024 at 03:31:24PM -0600, Zeng, Oak wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Friday, March 15, 2024 4:40 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap
> > > backing for
> > > GPU vram
> > > 
> > > On Fri, Mar 15, 2024 at 10:00:06AM -0600, Zeng, Oak wrote:
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > Sent: Thursday, March 14, 2024 4:49 PM
> > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > > <himal.prasad.ghimiray@intel.com>
> > > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap
> > > > > backing
> > > for
> > > > > GPU vram
> > > > > 
> > > > > On Thu, Mar 14, 2024 at 12:32:36PM -0600, Zeng, Oak wrote:
> > > > > > Hi Matt,
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > > Sent: Thursday, March 14, 2024 1:18 PM
> > > > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty,
> > > > > > > Brian
> > > > > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > > > > <himal.prasad.ghimiray@intel.com>
> > > > > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide
> > > > > > > memmap
> > > backing
> > > > > for
> > > > > > > GPU vram
> > > > > > > 
> > > > > > > On Wed, Mar 13, 2024 at 11:35:49PM -0400, Oak Zeng wrote:
> > > > > > > > Memory remap GPU vram using devm_memremap_pages, so
> > > > > > > > each
> > > GPU
> > > > > vram
> > > > > > > > page is backed by a struct page.
> > > > > > > > 
> > > > > > > > Those struct pages are created to allow hmm migrate
> > > > > > > > buffer b/t
> > > > > > > > GPU vram and CPU system memory using existing Linux
> > > > > > > > migration
> > > > > > > > mechanism (i.e., migrating b/t CPU system memory and
> > > > > > > > hard disk).
> > > > > > > > 
> > > > > > > > This is prepare work to enable svm (shared virtual
> > > > > > > > memory) through
> > > > > > > > Linux kernel hmm framework. The memory remap's page map
> > > > > > > > type is
> > > set
> > > > > > > > to MEMORY_DEVICE_PRIVATE for now. This means even
> > > > > > > > though each
> > > GPU
> > > > > > > > vram page get a struct page and can be mapped in CPU
> > > > > > > > page table,
> > > > > > > > but such pages are treated as GPU's private resource,
> > > > > > > > so CPU can't
> > > > > > > > access them. If CPU access such page, a page fault is
> > > > > > > > triggered
> > > > > > > > and page will be migrate to system memory.
> > > > > > > > 
> > > > > > > 
> > > > > > > Is this really true? We can map VRAM BOs to the CPU
> > > > > > > without having
> > > > > > > migarte back and forth. Admittedly I don't know the inner
> > > > > > > workings of
> > > > > > > how this works but in IGTs we do this all the time.
> > > > > > > 
> > > > > > >   54         batch_bo = xe_bo_create(fd, vm, batch_size,
> > > > > > >   55                                 vram_if_possible(fd,
> > > > > > > 0),
> > > > > > >   56
> > > DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM);
> > > > > > >   57         batch_map = xe_bo_map(fd, batch_bo,
> > > > > > > batch_size);
> > > > > > > 
> > > > > > > The BO is created in VRAM and then mapped to the CPU.
> > > > > > > 
> > > > > > > I don't think there is an expectation of coherence rather
> > > > > > > caching mode
> > > > > > > and exclusive access of the memory based on
> > > > > > > synchronization.
> > > > > > > 
> > > > > > > e.g.
> > > > > > > User write BB/data via CPU to GPU memory
> > > > > > > User calls exec
> > > > > > > GPU read / write memory
> > > > > > > User wait on sync indicating exec done
> > > > > > > User reads result
> > > > > > > 
> > > > > > > All of this works without migration. Are we not planing
> > > > > > > supporting flow
> > > > > > > with SVM?
> > > > > > > 
> > > > > > > Afaik this migration dance really only needs to be done
> > > > > > > if the CPU and
> > > > > > > GPU are using atomics on a shared memory region and the
> > > > > > > GPU device
> > > > > > > doesn't support a coherent memory protocol (e.g. PVC).
> > > > > > 
> > > > > > All you said is true. On many of our HW, CPU can actually
> > > > > > access device
> > > memory,
> > > > > cache coherently or not.
> > > > > > 
> > > > > > The problem is, this is not true for all GPU vendors. For
> > > > > > example, on some
> > > HW
> > > > > from some vendor, CPU can only access partially of device
> > > > > memory. The so
> > > called
> > > > > small bar concept.
> > > > > > 
> > > > > > So when HMM is defined, such factors were considered, and
> > > > > MEMORY_DEVICE_PRIVATE is defined. With this definition, CPU
> > > > > can't access
> > > > > device memory.
> > > > > > 
> > > > > > So you can think it is a limitation of HMM.
> > > > > > 
> > > > > 
> > > > > Is it though? I see other type MEMORY_DEVICE_FS_DAX,
> > > > > MEMORY_DEVICE_GENERIC, and MEMORY_DEVICE_PCI_P2PDMA. From my
> > > > > limited
> > > > > understanding it looks to me like one of those modes would
> > > > > support my
> > > > > example.
> > > > 
> > > > 
> > > > No, above are for other purposes. HMM only support
> > > > DEVICE_PRIVATE and
> > > DEVICE_COHERENT.
> > > > 
> > > > > 
> > > > > > Note this is only part 1 of our system allocator work. We
> > > > > > do plan to support
> > > > > DEVICE_COHERENT for our newer device, see below. With this
> > > > > option, we
> > > don't
> > > > > have unnecessary migration back and forth.
> > > > > > 
> > > > > > You can think this is just work out all the code path. 90%
> > > > > > of the driver code
> > > for
> > > > > DEVICE_PRIVATE and DEVICE_COHERENT will be same. Our real use
> > > > > of system
> > > > > allocator will be DEVICE_COHERENT mode. While DEVICE_PRIVATE
> > > > > mode
> > > allow us
> > > > > to exercise the code on old HW.
> > > > > > 
> > > > > > Make sense?
> > > > > > 
> > > > > 
> > > > > I guess if we want the system allocator to always coherent,
> > > > > then yes you
> > > > > need this dynamic migration with faulting on either side.
> > > > > 
> > > > > I was thinking the system allocator would be behave like my
> > > > > example
> > > > > above with madvise dictating the coherence rules.
> > > > > 
> > > > > Maybe I missed this in system allocator design but my feeling
> > > > > is we
> > > > > shouldn't arbitrarily enforce coherence as that could lead to
> > > > > poor
> > > > > performance due to constant migration.
> > > > 
> > > > System allocator itself doesn't enforce coherence. Coherence is
> > > > built in user
> > > programming model. So system allocator allow both GPU and CPU
> > > access system
> > > allocated pointers, but it doesn't necessarily guarantee the data
> > > accessed from
> > > CPU/GPU is coherent. It is user program's responsibility to
> > > maintain data
> > > coherence.
> > > > 
> > > > Data migration in driver is optional, depending on platform
> > > > capability, user
> > > preference, correctness and performance consideration. Driver
> > > internal data
> > > migration of course shouldn't break data coherence.
> > > > 
> > > > Of course different vendor can have different data coherence
> > > > scheme. For
> > > example, it is completely designer's flexibility to build model
> > > that is HW automatic
> > > data coherence or software explicit data coherence. On our
> > > platform, we allow
> > > user program to select different coherence mode by setting
> > > pat_index for gpu
> > > and cpu_caching mode for CPU. So we have completely give the
> > > flexibility to user
> > > program. Nothing of this contract is changed in system allocator
> > > design.
> > > > 
> > > > Going back to the question of what memory type we should use to
> > > > register our
> > > vram to core mm. HMM currently support two types: PRIVATE and
> > > COHERENT.
> > > The coherent type requires some HW and BIOS support which we
> > > don't have
> > > right now. So the only available is PRIVATE. We have not other
> > > option right now.
> > > As said, we plan to support coherent type where we can avoid
> > > unnecessary data
> > > migration. But that is stage 2.
> > > > 
> > > 
> > > Thanks for the explaination. After reading your replies, the HMM
> > > doc,
> > > and looking at code this all makes sense.
> > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > For GPU device which supports coherent memory protocol
> > > > > > > > b/t CPU and
> > > > > > > > GPU (such as CXL and CAPI protocol), we can remap
> > > > > > > > device memory as
> > > > > > > > MEMORY_DEVICE_COHERENT. This is TBD.
> > > > > > > > 
> > > > > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > > > > ---
> > > > > > > >  drivers/gpu/drm/xe/Makefile          |  3 +-
> > > > > > > >  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> > > > > > > >  drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> > > > > > > >  drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> > > > > > > >  drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> > > > > > > ++++++++++++++++++++++++++++
> > > > > > > >  5 files changed, 124 insertions(+), 1 deletion(-)
> > > > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > > > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > b/drivers/gpu/drm/xe/Makefile
> > > > > > > > index c531210695db..840467080e59 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > > > > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> > > > > > > >  	xe_vram_freq.o \
> > > > > > > >  	xe_wait_user_fence.o \
> > > > > > > >  	xe_wa.o \
> > > > > > > > -	xe_wopcm.o
> > > > > > > > +	xe_wopcm.o \
> > > > > > > > +	xe_svm_devmem.o
> > > > > > > 
> > > > > > > These should be in alphabetical order.
> > > > > > 
> > > > > > Will fix
> > > > > > > 
> > > > > > > > 
> > > > > > > >  # graphics hardware monitoring (HWMON) support
> > > > > > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > index 9785eef2e5a4..f27c3bee8ce7 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > @@ -99,6 +99,15 @@ struct xe_mem_region {
> > > > > > > >  	resource_size_t actual_physical_size;
> > > > > > > >  	/** @mapping: pointer to VRAM mappable space
> > > > > > > > */
> > > > > > > >  	void __iomem *mapping;
> > > > > > > > +	/** @pagemap: Used to remap device memory as
> > > > > > > > ZONE_DEVICE
> > > */
> > > > > > > > +    struct dev_pagemap pagemap;
> > > > > > > > +    /**
> > > > > > > > +     * @hpa_base: base host physical address
> > > > > > > > +     *
> > > > > > > > +     * This is generated when remap device memory as
> > > > > > > > ZONE_DEVICE
> > > > > > > > +     */
> > > > > > > > +    resource_size_t hpa_base;
> > > > > > > 
> > > > > > > Weird indentation. This goes for the entire series, look
> > > > > > > at checkpatch.
> > > > > > 
> > > > > > Will fix
> > > > > > > 
> > > > > > > > +
> > > > > > > >  };
> > > > > > > > 
> > > > > > > >  /**
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > index e3db3a178760..0d795394bc4c 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > @@ -22,6 +22,7 @@
> > > > > > > >  #include "xe_module.h"
> > > > > > > >  #include "xe_sriov.h"
> > > > > > > >  #include "xe_tile.h"
> > > > > > > > +#include "xe_svm.h"
> > > > > > > > 
> > > > > > > >  #define
> > > > > > > > XEHP_MTCFG_ADDR		XE_REG(0x101800)
> > > > > > > >  #define TILE_COUNT		REG_GENMASK(15, 8)
> > > > > > > > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct
> > > > > > > > xe_device *xe)
> > > > > > > >  		}
> > > > > > > > 
> > > > > > > >  		io_size -= min_t(u64, tile_size,
> > > > > > > > io_size);
> > > > > > > > +		xe_svm_devm_add(tile, &tile-
> > > > > > > > >mem.vram);
> > > > > > > 
> > > > > > > Do we want to do this probe for all devices with VRAM or
> > > > > > > only a subset?
> > > > > > 
> > > > > > All
> > > > > 
> > > > > Can you explain why?
> > > > 
> > > > It is natural for me to add all device memory to hmm. In hmm
> > > > design, device
> > > memory is used as a special swap out for system memory. I would
> > > ask why we
> > > only want to add a subset of vram? By a subset, do you mean only
> > > vram of one
> > > tile instead of all tiles?
> > > > 
> > > 
> > > I think we talking about different things, my bad on wording in
> > > the
> > > original question.
> > > 
> > > Let me ask again - should be calling xe_svm_devm_add on all
> > > *platforms*
> > > that have VRAM. i.e. Should we do this on PVC but not DG2?
> > 
> > 
> > Oh, I see. Good question. On i915, this feature was only tested on
> > PVC. We don't have a plan to enable it on older platform than PVC. 
> > 
> > Let me add a check here, only enabled it on platform newer than PVC
> > 
> 
> Probably actually check 'xe->info.has_usm'.
> 
> We might want to rename field too and drop the 'usm' nomenclature but
> that can be done later.

Perhaps "has_recoverable_pagefaults" or some for of abbreviation.

Another question w r t this is whether we should do this
unconditionally even on platforms that support it. Adding a struct_page
per VRAM page would potentially consume a significant amount of system
memory.

/Thomas



> 
> Matt
> 
> > Oak 
> > 
> > > 
> > > Matt
> > > 
> > > > Oak
> > > > 
> > > > 
> > > > > 
> > > > > > > 
> > > > > > > >  	}
> > > > > > > > 
> > > > > > > >  	xe->mem.vram.actual_physical_size =
> > > > > > > > total_size;
> > > > > > > > @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct
> > > > > > > > xe_device
> > > *xe)
> > > > > > > >  static void mmio_fini(struct drm_device *drm, void
> > > > > > > > *arg)
> > > > > > > >  {
> > > > > > > >  	struct xe_device *xe = arg;
> > > > > > > > +    struct xe_tile *tile;
> > > > > > > > +    u8 id;
> > > > > > > > 
> > > > > > > >  	pci_iounmap(to_pci_dev(xe->drm.dev), xe-
> > > > > > > > >mmio.regs);
> > > > > > > >  	if (xe->mem.vram.mapping)
> > > > > > > >  		iounmap(xe->mem.vram.mapping);
> > > > > > > > +
> > > > > > > > +	for_each_tile(tile, xe, id)
> > > > > > > > +		xe_svm_devm_remove(xe, &tile-
> > > > > > > > >mem.vram);
> > > > > > > 
> > > > > > > This should probably be above existing code. Typical on
> > > > > > > fini to do
> > > > > > > things in reverse order from init.
> > > > > > 
> > > > > > Will fix
> > > > > > > 
> > > > > > > > +
> > > > > > > >  }
> > > > > > > > 
> > > > > > > >  static int xe_verify_lmem_ready(struct xe_device *xe)
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > > > new file mode 100644
> > > > > > > > index 000000000000..09f9afb0e7d4
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > > > @@ -0,0 +1,14 @@
> > > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > > +/*
> > > > > > > > + * Copyright © 2023 Intel Corporation
> > > > > > > 
> > > > > > > 2024?
> > > > > > 
> > > > > > This patch was actually written 2023
> > > > > > > 
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > +#ifndef __XE_SVM_H
> > > > > > > > +#define __XE_SVM_H
> > > > > > > > +
> > > > > > > > +#include "xe_device_types.h"
> > > > > > > 
> > > > > > > I don't think you need to include this. Rather just
> > > > > > > forward decl structs
> > > > > > > used here.
> > > > > > 
> > > > > > Will fix
> > > > > > > 
> > > > > > > e.g.
> > > > > > > 
> > > > > > > struct xe_device;
> > > > > > > struct xe_mem_region;
> > > > > > > struct xe_tile;
> > > > > > > 
> > > > > > > > +
> > > > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct
> > > > > > > > xe_mem_region
> > > *mem);
> > > > > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct
> > > xe_mem_region
> > > > > > > *mem);
> > > > > > > 
> > > > > > > The arguments here are incongruent here. Typically we
> > > > > > > want these to
> > > > > > > match.
> > > > > > 
> > > > > > Will fix
> > > > > > > 
> > > > > > > > +
> > > > > > > > +#endif
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > 
> > > > > > > Incongruent between xe_svm.h and xe_svm_devmem.c.
> > > > > > 
> > > > > > Did you mean mem vs mr? if yes, will fix
> > > > > > 
> > > > > > Again these two
> > > > > > > should
> > > > > > > match.
> > > > > > > 
> > > > > > > > new file mode 100644
> > > > > > > > index 000000000000..63b7a1961cc6
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > > @@ -0,0 +1,91 @@
> > > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > > +/*
> > > > > > > > + * Copyright © 2023 Intel Corporation
> > > > > > > 
> > > > > > > 2024?
> > > > > > It is from 2023
> > > > > > > 
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > +#include <linux/mm_types.h>
> > > > > > > > +#include <linux/sched/mm.h>
> > > > > > > > +
> > > > > > > > +#include "xe_device_types.h"
> > > > > > > > +#include "xe_trace.h"
> > > > > > > 
> > > > > > > xe_trace.h appears to be unused.
> > > > > > 
> > > > > > Will fix
> > > > > > > 
> > > > > > > > +#include "xe_svm.h"
> > > > > > > > +
> > > > > > > > +
> > > > > > > > +static vm_fault_t xe_devm_migrate_to_ram(struct
> > > > > > > > vm_fault *vmf)
> > > > > > > > +{
> > > > > > > > +	return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static void xe_devm_page_free(struct page *page)
> > > > > > > > +{
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static const struct dev_pagemap_ops
> > > > > > > > xe_devm_pagemap_ops = {
> > > > > > > > +	.page_free = xe_devm_page_free,
> > > > > > > > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > > > > > +};
> > > > > > > > +
> > > > > > > 
> > > > > > > Assume these are placeholders that will be populated
> > > > > > > later?
> > > > > > 
> > > > > > 
> > > > > > corrrect
> > > > > > > 
> > > > > > > > +/**
> > > > > > > > + * xe_svm_devm_add: Remap and provide memmap backing
> > > > > > > > for
> > > device
> > > > > > > memory
> > > > > > > > + * @tile: tile that the memory region blongs to
> > > > > > > > + * @mr: memory region to remap
> > > > > > > > + *
> > > > > > > > + * This remap device memory to host physical address
> > > > > > > > space and create
> > > > > > > > + * struct page to back device memory
> > > > > > > > + *
> > > > > > > > + * Return: 0 on success standard error code otherwise
> > > > > > > > + */
> > > > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct
> > > > > > > > xe_mem_region *mr)
> > > > > > > 
> > > > > > > Here I see the xe_mem_region is from tile->mem.vram,
> > > > > > > wondering rather
> > > > > > > than using the tile->mem.vram we should use xe->mem.vram
> > > > > > > when
> > > enabling
> > > > > > > svm? Isn't the idea behind svm the entire memory is 1
> > > > > > > view?
> > > > > > 
> > > > > > Still need to use tile. The reason is, memory of different
> > > > > > tile can have
> > > different
> > > > > characteristics, such as latency. So we want to differentiate
> > > > > memory b/t tiles
> > > also
> > > > > in svm. I need to change below " mr->pagemap.owner = tile-
> > > > > >xe->drm.dev ".
> > > the
> > > > > owner should also be tile. This is the way hmm differentiate
> > > > > memory of
> > > different
> > > > > tile.
> > > > > > 
> > > > > > With svm it is 1 view, from virtual address space
> > > > > > perspective and from
> > > physical
> > > > > struct page perspective. You can think of all the tile's vram
> > > > > is stacked together
> > > to
> > > > > form a unified view together with system memory. This doesn't
> > > > > prohibit us
> > > from
> > > > > differentiate memory from different tile. This
> > > > > differentiation allow us to
> > > optimize
> > > > > performance, i.e., we can wisely place memory in specific
> > > > > tile. If we don't
> > > > > differentiate, this is not possible.
> > > > > > 
> > > > > 
> > > > > Ok makes sense.
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > > 
> > > > > > > I suppose if we do that we also only use 1 TTM VRAM
> > > > > > > manager / buddy
> > > > > > > allocator too. I thought I saw some patches flying around
> > > > > > > for that too.
> > > > > > 
> > > > > > Ttm vram manager is not in the picture. We deliberately
> > > > > > avoided it per
> > > previous
> > > > > discussion
> > > > > > 
> > > > > > Yes same buddy allocator. It is in my previous POC:
> > > https://lore.kernel.org/dri-
> > > > > devel/20240117221223.18540-12-oak.zeng@intel.com/. I didn't
> > > > > put those
> > > patches
> > > > > in this series because I want to merge this small patches
> > > > > separately.
> > > > > > > 
> > > > > > > > +{
> > > > > > > > +	struct device *dev = &to_pci_dev(tile->xe-
> > > > > > > > >drm.dev)->dev;
> > > > > > > > +	struct resource *res;
> > > > > > > > +	void *addr;
> > > > > > > > +	int ret;
> > > > > > > > +
> > > > > > > > +	res = devm_request_free_mem_region(dev,
> > > > > > > > &iomem_resource,
> > > > > > > > +					   mr-
> > > > > > > > >usable_size);
> > > > > > > > +	if (IS_ERR(res)) {
> > > > > > > > +		ret = PTR_ERR(res);
> > > > > > > > +		return ret;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > > > > > > > +	mr->pagemap.range.start = res->start;
> > > > > > > > +	mr->pagemap.range.end = res->end;
> > > > > > > > +	mr->pagemap.nr_range = 1;
> > > > > > > > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > > > > > > > +	mr->pagemap.owner = tile->xe->drm.dev;
> > > > > > > > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > > > > > > > +	if (IS_ERR(addr)) {
> > > > > > > > +		devm_release_mem_region(dev, res-
> > > > > > > > >start,
> > > resource_size(res));
> > > > > > > > +		ret = PTR_ERR(addr);
> > > > > > > > +		drm_err(&tile->xe->drm, "Failed to
> > > > > > > > remap tile %d
> > > memory,
> > > > > > > errno %d\n",
> > > > > > > > +				tile->id, ret);
> > > > > > > > +		return ret;
> > > > > > > > +	}
> > > > > > > > +	mr->hpa_base = res->start;
> > > > > > > > +
> > > > > > > > +	drm_info(&tile->xe->drm, "Added tile %d memory
> > > > > > > > [%llx-%llx] to
> > > devm,
> > > > > > > remapped to %pr\n",
> > > > > > > > +			tile->id, mr->io_start, mr-
> > > > > > > > >io_start + mr-
> > > > usable_size,
> > > > > > > res);
> > > > > > > > +	return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/**
> > > > > > > > + * xe_svm_devm_remove: Unmap device memory and free
> > > > > > > > resources
> > > > > > > > + * @xe: xe device
> > > > > > > > + * @mr: memory region to remove
> > > > > > > > + */
> > > > > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct
> > > xe_mem_region
> > > > > > > *mr)
> > > > > > > > +{
> > > > > > > > +	/*FIXME: below cause a kernel hange during
> > > > > > > > moduel remove*/
> > > > > > > > +#if 0
> > > > > > > > +	struct device *dev = &to_pci_dev(xe->drm.dev)-
> > > > > > > > >dev;
> > > > > > > > +
> > > > > > > > +	if (mr->hpa_base) {
> > > > > > > > +		devm_memunmap_pages(dev, &mr-
> > > > > > > > >pagemap);
> > > > > > > > +		devm_release_mem_region(dev, mr-
> > > > pagemap.range.start,
> > > > > > > > +			mr->pagemap.range.end - mr-
> > > > pagemap.range.start +1);
> > > > > > > > +	}
> > > > > > > > +#endif
> > > > > > > 
> > > > > > > This would need to be fixed too.
> > > > > > 
> > > > > > 
> > > > > > Yes...
> > > > > > 
> > > > > > Oak
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > --
> > > > > > > > 2.26.3
> > > > > > > > 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-14  3:35 ` [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr Oak Zeng
  2024-03-14 20:25   ` Matthew Brost
@ 2024-03-18 11:53   ` Hellstrom, Thomas
  2024-03-18 19:50     ` Zeng, Oak
  2024-03-18 13:12   ` Hellstrom, Thomas
  2 siblings, 1 reply; 49+ messages in thread
From: Hellstrom, Thomas @ 2024-03-18 11:53 UTC (permalink / raw)
  To: intel-xe, Zeng,  Oak
  Cc: Brost, Matthew, Welty, Brian, airlied, Ghimiray, Himal Prasad

Hi, Oak.


On Wed, 2024-03-13 at 23:35 -0400, Oak Zeng wrote:
> Add a helper function xe_hmm_populate_range to populate
> a a userptr or hmmptr range. This functions calls hmm_range_fault
> to read CPU page tables and populate all pfns/pages of this
> virtual address range.
> 
> If the populated page is system memory page, dma-mapping is performed
> to get a dma-address which can be used later for GPU to access pages.
> 
> If the populated page is device private page, we calculate the dpa (
> device physical address) of the page.
> 
> The dma-address or dpa is then saved in userptr's sg table. This is
> prepare work to replace the get_user_pages_fast code in userptr code
> path. The helper function will also be used to populate hmmptr later.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile |   3 +-
>  drivers/gpu/drm/xe/xe_hmm.c | 213
> ++++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
>  3 files changed, 227 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
>  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h

I mostly agree with Matt's review comments on this patch. Some
additional below.

> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index 840467080e59..29dcbc938b01 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
>  	xe_wait_user_fence.o \
>  	xe_wa.o \
>  	xe_wopcm.o \
> -	xe_svm_devmem.o
> +	xe_svm_devmem.o \
> +	xe_hmm.o
>  
>  # graphics hardware monitoring (HWMON) support
>  xe-$(CONFIG_HWMON) += xe_hwmon.o
> diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> b/drivers/gpu/drm/xe/xe_hmm.c
> new file mode 100644
> index 000000000000..c45c2447d386
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_hmm.c
> @@ -0,0 +1,213 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#include <linux/mmu_notifier.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/memremap.h>
> +#include <linux/swap.h>
> +#include <linux/mm.h>
> +#include "xe_hmm.h"
> +#include "xe_vm.h"
> +
> +/**
> + * mark_range_accessed() - mark a range is accessed, so core mm
> + * have such information for memory eviction or write back to
> + * hard disk
> + *
> + * @range: the range to mark
> + * @write: if write to this range, we mark pages in this range
> + * as dirty
> + */
> +static void mark_range_accessed(struct hmm_range *range, bool write)

Some of the static function names aren't really unique enough not to
stand in the way for a future core function name clash. Please consider
using an xe_ prefix in such cases. It will also make backtraces easier
to follow.


> +{
> +	struct page *page;
> +	u64 i, npages;
> +
> +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> +	for (i = 0; i < npages; i++) {
> +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> +		if (write) {
> +			lock_page(page);
> +			set_page_dirty(page);
> +			unlock_page(page);

Could be using set_page_dirty_lock() here.

/Thomas


> +		}
> +		mark_page_accessed(page);
> +	}
> +}
> +
> +/**
> + * build_sg() - build a scatter gather table for all the physical
> pages/pfn
> + * in a hmm_range. dma-address is save in sg table and will be used
> to program
> + * GPU page table later.
> + *
> + * @xe: the xe device who will access the dma-address in sg table
> + * @range: the hmm range that we build the sg table from. range-
> >hmm_pfns[]
> + * has the pfn numbers of pages that back up this hmm address range.
> + * @st: pointer to the sg table.
> + * @write: whether we write to this range. This decides dma map
> direction
> + * for system pages. If write we map it bi-diretional; otherwise
> + * DMA_TO_DEVICE
> + *
> + * All the contiguous pfns will be collapsed into one entry in
> + * the scatter gather table. This is for the convenience of
> + * later on operations to bind address range to GPU page table.
> + *
> + * The dma_address in the sg table will later be used by GPU to
> + * access memory. So if the memory is system memory, we need to
> + * do a dma-mapping so it can be accessed by GPU/DMA. If the memory
> + * is GPU local memory (of the GPU who is going to access memory),
> + * we need gpu dpa (device physical address), and there is no need
> + * of dma-mapping.
> + *
> + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> + * memory. Add this when you support p2p
> + *
> + * This function allocates the storage of the sg table. It is
> + * caller's responsibility to free it calling sg_free_table.
> + *
> + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> + */
> +static int build_sg(struct xe_device *xe, struct hmm_range *range,
> +			     struct sg_table *st, bool write)
> +{
> +	struct device *dev = xe->drm.dev;
> +	struct scatterlist *sg;
> +	u64 i, npages;
> +
> +	sg = NULL;
> +	st->nents = 0;
> +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> +
> +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> +		return -ENOMEM;
> +
> +	for (i = 0; i < npages; i++) {
> +		struct page *page;
> +		unsigned long addr;
> +		struct xe_mem_region *mr;
> +
> +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> +		if (is_device_private_page(page)) {
> +			mr = page_to_mem_region(page);
> +			addr = vram_pfn_to_dpa(mr, range-
> >hmm_pfns[i]);
> +		} else {
> +			addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> +					write ? DMA_BIDIRECTIONAL :
> DMA_TO_DEVICE);
> +		}
> +
> +		if (sg && (addr == (sg_dma_address(sg) + sg-
> >length))) {
> +			sg->length += PAGE_SIZE;
> +			sg_dma_len(sg) += PAGE_SIZE;
> +			continue;
> +		}
> +
> +		sg =  sg ? sg_next(sg) : st->sgl;
> +		sg_dma_address(sg) = addr;
> +		sg_dma_len(sg) = PAGE_SIZE;
> +		sg->length = PAGE_SIZE;
> +		st->nents++;
> +	}
> +
> +	sg_mark_end(sg);
> +	return 0;
> +}
> +
> +/**
> + * xe_hmm_populate_range() - Populate physical pages of a virtual
> + * address range
> + *
> + * @vma: vma has information of the range to populate. only vma
> + * of userptr and hmmptr type can be populated.
> + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> + * will hold the populated pfns.
> + * @write: populate pages with write permission
> + *
> + * This function populate the physical pages of a virtual
> + * address range. The populated physical pages is saved in
> + * userptr's sg table. It is similar to get_user_pages but call
> + * hmm_range_fault.
> + *
> + * This function also read mmu notifier sequence # (
> + * mmu_interval_read_begin), for the purpose of later
> + * comparison (through mmu_interval_read_retry).
> + *
> + * This must be called with mmap read or write lock held.
> + *
> + * This function allocates the storage of the userptr sg table.
> + * It is caller's responsibility to free it calling sg_free_table.
> + *
> + * returns: 0 for succuss; negative error no on failure
> + */
> +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> *hmm_range,
> +						bool write)
> +{
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> +	struct xe_userptr_vma *userptr_vma;
> +	struct xe_userptr *userptr;
> +	u64 start = vma->gpuva.va.addr;
> +	u64 end = start + vma->gpuva.va.range;
> +	struct xe_vm *vm = xe_vma_vm(vma);
> +	u64 npages;
> +	int ret;
> +
> +	userptr_vma = to_userptr_vma(vma);
> +	userptr = &userptr_vma->userptr;
> +	mmap_assert_locked(userptr->notifier.mm);
> +
> +	npages = ((end - 1) >> PAGE_SHIFT) - (start >> PAGE_SHIFT) +
> 1;
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (unlikely(!pfns))
> +		return -ENOMEM;
> +
> +	if (write)
> +		flags |= HMM_PFN_REQ_WRITE;
> +
> +	memset64((u64 *)pfns, (u64)flags, npages);
> +	hmm_range->hmm_pfns = pfns;
> +	hmm_range->notifier_seq = mmu_interval_read_begin(&userptr-
> >notifier);
> +	hmm_range->notifier = &userptr->notifier;
> +	hmm_range->start = start;
> +	hmm_range->end = end;
> +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> HMM_PFN_REQ_WRITE;
> +	/**
> +	 * FIXME:
> +	 * Set the the dev_private_owner can prevent hmm_range_fault
> to fault
> +	 * in the device private pages owned by caller. See function
> +	 * hmm_vma_handle_pte. In multiple GPU case, this should be
> set to the
> +	 * device owner of the best migration destination. e.g.,
> device0/vm0
> +	 * has a page fault, but we have determined the best
> placement of
> +	 * the fault address should be on device1, we should set
> below to
> +	 * device1 instead of device0.
> +	 */
> +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> +
> +	while (true) {
> +		ret = hmm_range_fault(hmm_range);
> +		if (time_after(jiffies, timeout))
> +			break;
> +
> +		if (ret == -EBUSY)
> +			continue;
> +		break;
> +	}
> +
> +	if (ret)
> +		goto free_pfns;
> +
> +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> +	if (ret)
> +		goto free_pfns;
> +
> +	mark_range_accessed(hmm_range, write);
> +	userptr->sg = &userptr->sgt;
> +	userptr->notifier_seq = hmm_range->notifier_seq;
> +
> +free_pfns:
> +	kvfree(pfns);
> +	return ret;
> +}
> +
> diff --git a/drivers/gpu/drm/xe/xe_hmm.h
> b/drivers/gpu/drm/xe/xe_hmm.h
> new file mode 100644
> index 000000000000..960f3f6d36ae
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_hmm.h
> @@ -0,0 +1,12 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#include <linux/types.h>
> +#include <linux/hmm.h>
> +#include "xe_vm_types.h"
> +#include "xe_svm.h"
> +
> +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> *hmm_range,
> +						bool write);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] drm/xe: Helper to get dpa from pfn
  2024-03-14 17:39   ` Matthew Brost
  2024-03-15 17:29     ` Zeng, Oak
@ 2024-03-18 12:09     ` Hellstrom, Thomas
  2024-03-18 19:27       ` Zeng, Oak
  1 sibling, 1 reply; 49+ messages in thread
From: Hellstrom, Thomas @ 2024-03-18 12:09 UTC (permalink / raw)
  To: Brost, Matthew, Zeng, Oak
  Cc: intel-xe, Welty,  Brian, airlied, Ghimiray, Himal Prasad

On Thu, 2024-03-14 at 17:39 +0000, Matthew Brost wrote:
> On Wed, Mar 13, 2024 at 11:35:51PM -0400, Oak Zeng wrote:
> > Since we now create struct page backing for each vram page,
> > each vram page now also has a pfn, just like system memory.
> > This allow us to calcuate device physical address from pfn.

Please use imperative language according to the patch guidelines:
Something like "Add a" or "Introduce A"

> > 
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_device_types.h | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > b/drivers/gpu/drm/xe/xe_device_types.h
> > index bbea40b57e84..bf349321f037 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -576,4 +576,12 @@ static inline struct xe_tile
> > *mem_region_to_tile(struct xe_mem_region *mr)
> >  	return container_of(mr, struct xe_tile, mem.vram);
> >  }
> >  
> > +static inline u64 vram_pfn_to_dpa(struct xe_mem_region *mr, u64
> > pfn)

static inline header functions also need kerneldoc unless strong
reasons not to.

/Thomas



> > +{
> > +	u64 dpa;
> > +	u64 offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
> 
> Can't this be negative? 
> 
> e.g. if pfn == 0, offset == -mr->hpa_base.
> 
> Or is the assumption (pfn << PAGE_SHIFT) is always > mr->hpa_base?
> 
> If so can we an assert or something to ensure we using this function
> correctly.
> 
> > +	dpa = mr->dpa_base + offset;
> > +	return dpa;
> > +}
> 
> Same as previous patch, should be *.h not a *_types.h file.
> 
> Also as this is xe_mem_region not explictly vram. Maybe:
> 
> s/vram_pfn_to_dpa/xe_mem_region_pfn_to_dpa/
> 
> Matt
> 
> > +
> >  #endif
> > -- 
> > 2.26.3
> > 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-14  3:35 ` [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr Oak Zeng
  2024-03-14 20:25   ` Matthew Brost
  2024-03-18 11:53   ` Hellstrom, Thomas
@ 2024-03-18 13:12   ` Hellstrom, Thomas
  2024-03-18 14:49     ` Zeng, Oak
  2 siblings, 1 reply; 49+ messages in thread
From: Hellstrom, Thomas @ 2024-03-18 13:12 UTC (permalink / raw)
  To: intel-xe, Zeng,  Oak
  Cc: Brost, Matthew, Welty, Brian, airlied, Ghimiray, Himal Prasad

Hi, Oak,

Found another thing, see below:

On Wed, 2024-03-13 at 23:35 -0400, Oak Zeng wrote:
> Add a helper function xe_hmm_populate_range to populate
> a a userptr or hmmptr range. This functions calls hmm_range_fault
> to read CPU page tables and populate all pfns/pages of this
> virtual address range.
> 
> If the populated page is system memory page, dma-mapping is performed
> to get a dma-address which can be used later for GPU to access pages.
> 
> If the populated page is device private page, we calculate the dpa (
> device physical address) of the page.
> 
> The dma-address or dpa is then saved in userptr's sg table. This is
> prepare work to replace the get_user_pages_fast code in userptr code
> path. The helper function will also be used to populate hmmptr later.
> 
> Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> Cc: Brian Welty <brian.welty@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile |   3 +-
>  drivers/gpu/drm/xe/xe_hmm.c | 213
> ++++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
>  3 files changed, 227 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
>  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index 840467080e59..29dcbc938b01 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
>  	xe_wait_user_fence.o \
>  	xe_wa.o \
>  	xe_wopcm.o \
> -	xe_svm_devmem.o
> +	xe_svm_devmem.o \
> +	xe_hmm.o
>  
>  # graphics hardware monitoring (HWMON) support
>  xe-$(CONFIG_HWMON) += xe_hwmon.o
> diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> b/drivers/gpu/drm/xe/xe_hmm.c
> new file mode 100644
> index 000000000000..c45c2447d386
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_hmm.c
> @@ -0,0 +1,213 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#include <linux/mmu_notifier.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/memremap.h>
> +#include <linux/swap.h>
> +#include <linux/mm.h>
> +#include "xe_hmm.h"
> +#include "xe_vm.h"
> +
> +/**
> + * mark_range_accessed() - mark a range is accessed, so core mm
> + * have such information for memory eviction or write back to
> + * hard disk
> + *
> + * @range: the range to mark
> + * @write: if write to this range, we mark pages in this range
> + * as dirty
> + */
> +static void mark_range_accessed(struct hmm_range *range, bool write)
> +{
> +	struct page *page;
> +	u64 i, npages;
> +
> +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> +	for (i = 0; i < npages; i++) {
> +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> +		if (write) {
> +			lock_page(page);
> +			set_page_dirty(page);
> +			unlock_page(page);
> +		}
> +		mark_page_accessed(page);
> +	}
> +}
> +
> +/**
> + * build_sg() - build a scatter gather table for all the physical
> pages/pfn
> + * in a hmm_range. dma-address is save in sg table and will be used
> to program
> + * GPU page table later.
> + *
> + * @xe: the xe device who will access the dma-address in sg table
> + * @range: the hmm range that we build the sg table from. range-
> >hmm_pfns[]
> + * has the pfn numbers of pages that back up this hmm address range.
> + * @st: pointer to the sg table.
> + * @write: whether we write to this range. This decides dma map
> direction
> + * for system pages. If write we map it bi-diretional; otherwise
> + * DMA_TO_DEVICE
> + *
> + * All the contiguous pfns will be collapsed into one entry in
> + * the scatter gather table. This is for the convenience of
> + * later on operations to bind address range to GPU page table.
> + *
> + * The dma_address in the sg table will later be used by GPU to
> + * access memory. So if the memory is system memory, we need to
> + * do a dma-mapping so it can be accessed by GPU/DMA. If the memory
> + * is GPU local memory (of the GPU who is going to access memory),
> + * we need gpu dpa (device physical address), and there is no need
> + * of dma-mapping.
> + *
> + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> + * memory. Add this when you support p2p
> + *
> + * This function allocates the storage of the sg table. It is
> + * caller's responsibility to free it calling sg_free_table.
> + *
> + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> + */
> +static int build_sg(struct xe_device *xe, struct hmm_range *range,
> +			     struct sg_table *st, bool write)
> +{
> +	struct device *dev = xe->drm.dev;
> +	struct scatterlist *sg;
> +	u64 i, npages;
> +
> +	sg = NULL;
> +	st->nents = 0;
> +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> +
> +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> +		return -ENOMEM;
> +
> +	for (i = 0; i < npages; i++) {
> +		struct page *page;
> +		unsigned long addr;
> +		struct xe_mem_region *mr;
> +
> +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> +		if (is_device_private_page(page)) {
> +			mr = page_to_mem_region(page);
> +			addr = vram_pfn_to_dpa(mr, range-
> >hmm_pfns[i]);
> +		} else {
> +			addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> +					write ? DMA_BIDIRECTIONAL :
> DMA_TO_DEVICE);
> +		}
> +
> +		if (sg && (addr == (sg_dma_address(sg) + sg-
> >length))) {
> +			sg->length += PAGE_SIZE;
> +			sg_dma_len(sg) += PAGE_SIZE;
> +			continue;
> +		}
> +
> +		sg =  sg ? sg_next(sg) : st->sgl;
> +		sg_dma_address(sg) = addr;
> +		sg_dma_len(sg) = PAGE_SIZE;
> +		sg->length = PAGE_SIZE;
> +		st->nents++;
> +	}
> +
> +	sg_mark_end(sg);
> +	return 0;
> +}
> +
> +/**
> + * xe_hmm_populate_range() - Populate physical pages of a virtual
> + * address range
> + *
> + * @vma: vma has information of the range to populate. only vma
> + * of userptr and hmmptr type can be populated.
> + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> + * will hold the populated pfns.
> + * @write: populate pages with write permission
> + *
> + * This function populate the physical pages of a virtual
> + * address range. The populated physical pages is saved in
> + * userptr's sg table. It is similar to get_user_pages but call
> + * hmm_range_fault.
> + *
> + * This function also read mmu notifier sequence # (
> + * mmu_interval_read_begin), for the purpose of later
> + * comparison (through mmu_interval_read_retry).
> + *
> + * This must be called with mmap read or write lock held.
> + *
> + * This function allocates the storage of the userptr sg table.
> + * It is caller's responsibility to free it calling sg_free_table.
> + *
> + * returns: 0 for succuss; negative error no on failure
> + */
> +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> *hmm_range,
> +						bool write)
> +{
> +	unsigned long timeout =
> +		jiffies +
> msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> +	struct xe_userptr_vma *userptr_vma;
> +	struct xe_userptr *userptr;
> +	u64 start = vma->gpuva.va.addr;
> +	u64 end = start + vma->gpuva.va.range;
> +	struct xe_vm *vm = xe_vma_vm(vma);
> +	u64 npages;
> +	int ret;
> +
> +	userptr_vma = to_userptr_vma(vma);
> +	userptr = &userptr_vma->userptr;
> +	mmap_assert_locked(userptr->notifier.mm);
> +
> +	npages = ((end - 1) >> PAGE_SHIFT) - (start >> PAGE_SHIFT) +
> 1;
> +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> +	if (unlikely(!pfns))
> +		return -ENOMEM;
> +
> +	if (write)
> +		flags |= HMM_PFN_REQ_WRITE;
> +
> +	memset64((u64 *)pfns, (u64)flags, npages);
> +	hmm_range->hmm_pfns = pfns;
> +	hmm_range->notifier_seq = mmu_interval_read_begin(&userptr-
> >notifier);
> +	hmm_range->notifier = &userptr->notifier;
> +	hmm_range->start = start;
> +	hmm_range->end = end;
> +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> HMM_PFN_REQ_WRITE;
> +	/**
> +	 * FIXME:
> +	 * Set the the dev_private_owner can prevent hmm_range_fault
> to fault
> +	 * in the device private pages owned by caller. See function
> +	 * hmm_vma_handle_pte. In multiple GPU case, this should be
> set to the
> +	 * device owner of the best migration destination. e.g.,
> device0/vm0
> +	 * has a page fault, but we have determined the best
> placement of
> +	 * the fault address should be on device1, we should set
> below to
> +	 * device1 instead of device0.
> +	 */
> +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> +
> +	while (true) {
> +		ret = hmm_range_fault(hmm_range);
> +		if (time_after(jiffies, timeout))
> +			break;
> +
> +		if (ret == -EBUSY)
> +			continue;

If (ret == -EBUSY) it looks from the hmm_range_fault() implementation
like hmm_range->notifier_seq has become invalid and without calling 
mmu_interval_read_begin() again, we will end up in an infinite loop?

/Thomas



> +		break;
> +	}
> +
> +	if (ret)
> +		goto free_pfns;
> +
> +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> +	if (ret)
> +		goto free_pfns;
> +
> +	mark_range_accessed(hmm_range, write);
> +	userptr->sg = &userptr->sgt;
> +	userptr->notifier_seq = hmm_range->notifier_seq;
> +
> +free_pfns:
> +	kvfree(pfns);
> +	return ret;
> +}
> +
> diff --git a/drivers/gpu/drm/xe/xe_hmm.h
> b/drivers/gpu/drm/xe/xe_hmm.h
> new file mode 100644
> index 000000000000..960f3f6d36ae
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_hmm.h
> @@ -0,0 +1,12 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#include <linux/types.h>
> +#include <linux/hmm.h>
> +#include "xe_vm_types.h"
> +#include "xe_svm.h"
> +
> +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> *hmm_range,
> +						bool write);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-18 13:12   ` Hellstrom, Thomas
@ 2024-03-18 14:49     ` Zeng, Oak
  2024-03-18 15:40       ` Hellstrom, Thomas
  0 siblings, 1 reply; 49+ messages in thread
From: Zeng, Oak @ 2024-03-18 14:49 UTC (permalink / raw)
  To: Hellstrom, Thomas, intel-xe
  Cc: Brost, Matthew, Welty, Brian, airlied, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> Sent: Monday, March 18, 2024 9:13 AM
> To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
> 
> Hi, Oak,
> 
> Found another thing, see below:
> 
> On Wed, 2024-03-13 at 23:35 -0400, Oak Zeng wrote:
> > Add a helper function xe_hmm_populate_range to populate
> > a a userptr or hmmptr range. This functions calls hmm_range_fault
> > to read CPU page tables and populate all pfns/pages of this
> > virtual address range.
> >
> > If the populated page is system memory page, dma-mapping is performed
> > to get a dma-address which can be used later for GPU to access pages.
> >
> > If the populated page is device private page, we calculate the dpa (
> > device physical address) of the page.
> >
> > The dma-address or dpa is then saved in userptr's sg table. This is
> > prepare work to replace the get_user_pages_fast code in userptr code
> > path. The helper function will also be used to populate hmmptr later.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile |   3 +-
> >  drivers/gpu/drm/xe/xe_hmm.c | 213
> > ++++++++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> >  3 files changed, 227 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> >
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index 840467080e59..29dcbc938b01 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> >  	xe_wait_user_fence.o \
> >  	xe_wa.o \
> >  	xe_wopcm.o \
> > -	xe_svm_devmem.o
> > +	xe_svm_devmem.o \
> > +	xe_hmm.o
> >
> >  # graphics hardware monitoring (HWMON) support
> >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> > b/drivers/gpu/drm/xe/xe_hmm.c
> > new file mode 100644
> > index 000000000000..c45c2447d386
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > @@ -0,0 +1,213 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/memremap.h>
> > +#include <linux/swap.h>
> > +#include <linux/mm.h>
> > +#include "xe_hmm.h"
> > +#include "xe_vm.h"
> > +
> > +/**
> > + * mark_range_accessed() - mark a range is accessed, so core mm
> > + * have such information for memory eviction or write back to
> > + * hard disk
> > + *
> > + * @range: the range to mark
> > + * @write: if write to this range, we mark pages in this range
> > + * as dirty
> > + */
> > +static void mark_range_accessed(struct hmm_range *range, bool write)
> > +{
> > +	struct page *page;
> > +	u64 i, npages;
> > +
> > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> > PAGE_SHIFT) + 1;
> > +	for (i = 0; i < npages; i++) {
> > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > +		if (write) {
> > +			lock_page(page);
> > +			set_page_dirty(page);
> > +			unlock_page(page);
> > +		}
> > +		mark_page_accessed(page);
> > +	}
> > +}
> > +
> > +/**
> > + * build_sg() - build a scatter gather table for all the physical
> > pages/pfn
> > + * in a hmm_range. dma-address is save in sg table and will be used
> > to program
> > + * GPU page table later.
> > + *
> > + * @xe: the xe device who will access the dma-address in sg table
> > + * @range: the hmm range that we build the sg table from. range-
> > >hmm_pfns[]
> > + * has the pfn numbers of pages that back up this hmm address range.
> > + * @st: pointer to the sg table.
> > + * @write: whether we write to this range. This decides dma map
> > direction
> > + * for system pages. If write we map it bi-diretional; otherwise
> > + * DMA_TO_DEVICE
> > + *
> > + * All the contiguous pfns will be collapsed into one entry in
> > + * the scatter gather table. This is for the convenience of
> > + * later on operations to bind address range to GPU page table.
> > + *
> > + * The dma_address in the sg table will later be used by GPU to
> > + * access memory. So if the memory is system memory, we need to
> > + * do a dma-mapping so it can be accessed by GPU/DMA. If the memory
> > + * is GPU local memory (of the GPU who is going to access memory),
> > + * we need gpu dpa (device physical address), and there is no need
> > + * of dma-mapping.
> > + *
> > + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> > + * memory. Add this when you support p2p
> > + *
> > + * This function allocates the storage of the sg table. It is
> > + * caller's responsibility to free it calling sg_free_table.
> > + *
> > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > + */
> > +static int build_sg(struct xe_device *xe, struct hmm_range *range,
> > +			     struct sg_table *st, bool write)
> > +{
> > +	struct device *dev = xe->drm.dev;
> > +	struct scatterlist *sg;
> > +	u64 i, npages;
> > +
> > +	sg = NULL;
> > +	st->nents = 0;
> > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> > PAGE_SHIFT) + 1;
> > +
> > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < npages; i++) {
> > +		struct page *page;
> > +		unsigned long addr;
> > +		struct xe_mem_region *mr;
> > +
> > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > +		if (is_device_private_page(page)) {
> > +			mr = page_to_mem_region(page);
> > +			addr = vram_pfn_to_dpa(mr, range-
> > >hmm_pfns[i]);
> > +		} else {
> > +			addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> > +					write ? DMA_BIDIRECTIONAL :
> > DMA_TO_DEVICE);
> > +		}
> > +
> > +		if (sg && (addr == (sg_dma_address(sg) + sg-
> > >length))) {
> > +			sg->length += PAGE_SIZE;
> > +			sg_dma_len(sg) += PAGE_SIZE;
> > +			continue;
> > +		}
> > +
> > +		sg =  sg ? sg_next(sg) : st->sgl;
> > +		sg_dma_address(sg) = addr;
> > +		sg_dma_len(sg) = PAGE_SIZE;
> > +		sg->length = PAGE_SIZE;
> > +		st->nents++;
> > +	}
> > +
> > +	sg_mark_end(sg);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * xe_hmm_populate_range() - Populate physical pages of a virtual
> > + * address range
> > + *
> > + * @vma: vma has information of the range to populate. only vma
> > + * of userptr and hmmptr type can be populated.
> > + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> > + * will hold the populated pfns.
> > + * @write: populate pages with write permission
> > + *
> > + * This function populate the physical pages of a virtual
> > + * address range. The populated physical pages is saved in
> > + * userptr's sg table. It is similar to get_user_pages but call
> > + * hmm_range_fault.
> > + *
> > + * This function also read mmu notifier sequence # (
> > + * mmu_interval_read_begin), for the purpose of later
> > + * comparison (through mmu_interval_read_retry).
> > + *
> > + * This must be called with mmap read or write lock held.
> > + *
> > + * This function allocates the storage of the userptr sg table.
> > + * It is caller's responsibility to free it calling sg_free_table.
> > + *
> > + * returns: 0 for succuss; negative error no on failure
> > + */
> > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > *hmm_range,
> > +						bool write)
> > +{
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > +	struct xe_userptr_vma *userptr_vma;
> > +	struct xe_userptr *userptr;
> > +	u64 start = vma->gpuva.va.addr;
> > +	u64 end = start + vma->gpuva.va.range;
> > +	struct xe_vm *vm = xe_vma_vm(vma);
> > +	u64 npages;
> > +	int ret;
> > +
> > +	userptr_vma = to_userptr_vma(vma);
> > +	userptr = &userptr_vma->userptr;
> > +	mmap_assert_locked(userptr->notifier.mm);
> > +
> > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >> PAGE_SHIFT) +
> > 1;
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (unlikely(!pfns))
> > +		return -ENOMEM;
> > +
> > +	if (write)
> > +		flags |= HMM_PFN_REQ_WRITE;
> > +
> > +	memset64((u64 *)pfns, (u64)flags, npages);
> > +	hmm_range->hmm_pfns = pfns;
> > +	hmm_range->notifier_seq = mmu_interval_read_begin(&userptr-
> > >notifier);
> > +	hmm_range->notifier = &userptr->notifier;
> > +	hmm_range->start = start;
> > +	hmm_range->end = end;
> > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> > HMM_PFN_REQ_WRITE;
> > +	/**
> > +	 * FIXME:
> > +	 * Set the the dev_private_owner can prevent hmm_range_fault
> > to fault
> > +	 * in the device private pages owned by caller. See function
> > +	 * hmm_vma_handle_pte. In multiple GPU case, this should be
> > set to the
> > +	 * device owner of the best migration destination. e.g.,
> > device0/vm0
> > +	 * has a page fault, but we have determined the best
> > placement of
> > +	 * the fault address should be on device1, we should set
> > below to
> > +	 * device1 instead of device0.
> > +	 */
> > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > +
> > +	while (true) {
> > +		ret = hmm_range_fault(hmm_range);
> > +		if (time_after(jiffies, timeout))
> > +			break;
> > +
> > +		if (ret == -EBUSY)
> > +			continue;
> 
> If (ret == -EBUSY) it looks from the hmm_range_fault() implementation
> like hmm_range->notifier_seq has become invalid and without calling
> mmu_interval_read_begin() again, we will end up in an infinite loop?
> 

I noticed this thing before and had a read_begin in the while loop. But after a second thought, function xe_hmm_populate_range is called inside a mmap lock, so after the read_begin is called above, there can't be a invalidation before mmap unlock. So theoretically EBUSY can't happen?

Oak

> /Thomas
> 
> 
> 
> > +		break;
> > +	}
> > +
> > +	if (ret)
> > +		goto free_pfns;
> > +
> > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> > +	if (ret)
> > +		goto free_pfns;
> > +
> > +	mark_range_accessed(hmm_range, write);
> > +	userptr->sg = &userptr->sgt;
> > +	userptr->notifier_seq = hmm_range->notifier_seq;
> > +
> > +free_pfns:
> > +	kvfree(pfns);
> > +	return ret;
> > +}
> > +
> > diff --git a/drivers/gpu/drm/xe/xe_hmm.h
> > b/drivers/gpu/drm/xe/xe_hmm.h
> > new file mode 100644
> > index 000000000000..960f3f6d36ae
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > @@ -0,0 +1,12 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/hmm.h>
> > +#include "xe_vm_types.h"
> > +#include "xe_svm.h"
> > +
> > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > *hmm_range,
> > +						bool write);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-16  1:25               ` Matthew Brost
  2024-03-18 10:16                 ` Hellstrom, Thomas
@ 2024-03-18 14:51                 ` Zeng, Oak
  1 sibling, 0 replies; 49+ messages in thread
From: Zeng, Oak @ 2024-03-18 14:51 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Friday, March 15, 2024 9:26 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for
> GPU vram
> 
> On Fri, Mar 15, 2024 at 03:31:24PM -0600, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Friday, March 15, 2024 4:40 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing
> for
> > > GPU vram
> > >
> > > On Fri, Mar 15, 2024 at 10:00:06AM -0600, Zeng, Oak wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > Sent: Thursday, March 14, 2024 4:49 PM
> > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > > <himal.prasad.ghimiray@intel.com>
> > > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap
> backing
> > > for
> > > > > GPU vram
> > > > >
> > > > > On Thu, Mar 14, 2024 at 12:32:36PM -0600, Zeng, Oak wrote:
> > > > > > Hi Matt,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > > Sent: Thursday, March 14, 2024 1:18 PM
> > > > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > > > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > > > > <himal.prasad.ghimiray@intel.com>
> > > > > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap
> > > backing
> > > > > for
> > > > > > > GPU vram
> > > > > > >
> > > > > > > On Wed, Mar 13, 2024 at 11:35:49PM -0400, Oak Zeng wrote:
> > > > > > > > Memory remap GPU vram using devm_memremap_pages, so each
> > > GPU
> > > > > vram
> > > > > > > > page is backed by a struct page.
> > > > > > > >
> > > > > > > > Those struct pages are created to allow hmm migrate buffer b/t
> > > > > > > > GPU vram and CPU system memory using existing Linux migration
> > > > > > > > mechanism (i.e., migrating b/t CPU system memory and hard disk).
> > > > > > > >
> > > > > > > > This is prepare work to enable svm (shared virtual memory) through
> > > > > > > > Linux kernel hmm framework. The memory remap's page map type
> is
> > > set
> > > > > > > > to MEMORY_DEVICE_PRIVATE for now. This means even though
> each
> > > GPU
> > > > > > > > vram page get a struct page and can be mapped in CPU page table,
> > > > > > > > but such pages are treated as GPU's private resource, so CPU can't
> > > > > > > > access them. If CPU access such page, a page fault is triggered
> > > > > > > > and page will be migrate to system memory.
> > > > > > > >
> > > > > > >
> > > > > > > Is this really true? We can map VRAM BOs to the CPU without having
> > > > > > > migarte back and forth. Admittedly I don't know the inner workings of
> > > > > > > how this works but in IGTs we do this all the time.
> > > > > > >
> > > > > > >   54         batch_bo = xe_bo_create(fd, vm, batch_size,
> > > > > > >   55                                 vram_if_possible(fd, 0),
> > > > > > >   56
> > > DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM);
> > > > > > >   57         batch_map = xe_bo_map(fd, batch_bo, batch_size);
> > > > > > >
> > > > > > > The BO is created in VRAM and then mapped to the CPU.
> > > > > > >
> > > > > > > I don't think there is an expectation of coherence rather caching mode
> > > > > > > and exclusive access of the memory based on synchronization.
> > > > > > >
> > > > > > > e.g.
> > > > > > > User write BB/data via CPU to GPU memory
> > > > > > > User calls exec
> > > > > > > GPU read / write memory
> > > > > > > User wait on sync indicating exec done
> > > > > > > User reads result
> > > > > > >
> > > > > > > All of this works without migration. Are we not planing supporting flow
> > > > > > > with SVM?
> > > > > > >
> > > > > > > Afaik this migration dance really only needs to be done if the CPU and
> > > > > > > GPU are using atomics on a shared memory region and the GPU
> device
> > > > > > > doesn't support a coherent memory protocol (e.g. PVC).
> > > > > >
> > > > > > All you said is true. On many of our HW, CPU can actually access device
> > > memory,
> > > > > cache coherently or not.
> > > > > >
> > > > > > The problem is, this is not true for all GPU vendors. For example, on
> some
> > > HW
> > > > > from some vendor, CPU can only access partially of device memory. The
> so
> > > called
> > > > > small bar concept.
> > > > > >
> > > > > > So when HMM is defined, such factors were considered, and
> > > > > MEMORY_DEVICE_PRIVATE is defined. With this definition, CPU can't
> access
> > > > > device memory.
> > > > > >
> > > > > > So you can think it is a limitation of HMM.
> > > > > >
> > > > >
> > > > > Is it though? I see other type MEMORY_DEVICE_FS_DAX,
> > > > > MEMORY_DEVICE_GENERIC, and MEMORY_DEVICE_PCI_P2PDMA. From
> my
> > > > > limited
> > > > > understanding it looks to me like one of those modes would support my
> > > > > example.
> > > >
> > > >
> > > > No, above are for other purposes. HMM only support DEVICE_PRIVATE and
> > > DEVICE_COHERENT.
> > > >
> > > > >
> > > > > > Note this is only part 1 of our system allocator work. We do plan to
> support
> > > > > DEVICE_COHERENT for our newer device, see below. With this option, we
> > > don't
> > > > > have unnecessary migration back and forth.
> > > > > >
> > > > > > You can think this is just work out all the code path. 90% of the driver
> code
> > > for
> > > > > DEVICE_PRIVATE and DEVICE_COHERENT will be same. Our real use of
> system
> > > > > allocator will be DEVICE_COHERENT mode. While DEVICE_PRIVATE mode
> > > allow us
> > > > > to exercise the code on old HW.
> > > > > >
> > > > > > Make sense?
> > > > > >
> > > > >
> > > > > I guess if we want the system allocator to always coherent, then yes you
> > > > > need this dynamic migration with faulting on either side.
> > > > >
> > > > > I was thinking the system allocator would be behave like my example
> > > > > above with madvise dictating the coherence rules.
> > > > >
> > > > > Maybe I missed this in system allocator design but my feeling is we
> > > > > shouldn't arbitrarily enforce coherence as that could lead to poor
> > > > > performance due to constant migration.
> > > >
> > > > System allocator itself doesn't enforce coherence. Coherence is built in user
> > > programming model. So system allocator allow both GPU and CPU access
> system
> > > allocated pointers, but it doesn't necessarily guarantee the data accessed
> from
> > > CPU/GPU is coherent. It is user program's responsibility to maintain data
> > > coherence.
> > > >
> > > > Data migration in driver is optional, depending on platform capability, user
> > > preference, correctness and performance consideration. Driver internal data
> > > migration of course shouldn't break data coherence.
> > > >
> > > > Of course different vendor can have different data coherence scheme. For
> > > example, it is completely designer's flexibility to build model that is HW
> automatic
> > > data coherence or software explicit data coherence. On our platform, we
> allow
> > > user program to select different coherence mode by setting pat_index for
> gpu
> > > and cpu_caching mode for CPU. So we have completely give the flexibility to
> user
> > > program. Nothing of this contract is changed in system allocator design.
> > > >
> > > > Going back to the question of what memory type we should use to register
> our
> > > vram to core mm. HMM currently support two types: PRIVATE and COHERENT.
> > > The coherent type requires some HW and BIOS support which we don't have
> > > right now. So the only available is PRIVATE. We have not other option right
> now.
> > > As said, we plan to support coherent type where we can avoid unnecessary
> data
> > > migration. But that is stage 2.
> > > >
> > >
> > > Thanks for the explaination. After reading your replies, the HMM doc,
> > > and looking at code this all makes sense.
> > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > For GPU device which supports coherent memory protocol b/t CPU
> and
> > > > > > > > GPU (such as CXL and CAPI protocol), we can remap device memory
> as
> > > > > > > > MEMORY_DEVICE_COHERENT. This is TBD.
> > > > > > > >
> > > > > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > > > > ---
> > > > > > > >  drivers/gpu/drm/xe/Makefile          |  3 +-
> > > > > > > >  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> > > > > > > >  drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> > > > > > > >  drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> > > > > > > >  drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> > > > > > > ++++++++++++++++++++++++++++
> > > > > > > >  5 files changed, 124 insertions(+), 1 deletion(-)
> > > > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > > > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > >
> > > > > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > b/drivers/gpu/drm/xe/Makefile
> > > > > > > > index c531210695db..840467080e59 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > > > > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> > > > > > > >  	xe_vram_freq.o \
> > > > > > > >  	xe_wait_user_fence.o \
> > > > > > > >  	xe_wa.o \
> > > > > > > > -	xe_wopcm.o
> > > > > > > > +	xe_wopcm.o \
> > > > > > > > +	xe_svm_devmem.o
> > > > > > >
> > > > > > > These should be in alphabetical order.
> > > > > >
> > > > > > Will fix
> > > > > > >
> > > > > > > >
> > > > > > > >  # graphics hardware monitoring (HWMON) support
> > > > > > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > index 9785eef2e5a4..f27c3bee8ce7 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > @@ -99,6 +99,15 @@ struct xe_mem_region {
> > > > > > > >  	resource_size_t actual_physical_size;
> > > > > > > >  	/** @mapping: pointer to VRAM mappable space */
> > > > > > > >  	void __iomem *mapping;
> > > > > > > > +	/** @pagemap: Used to remap device memory as ZONE_DEVICE
> > > */
> > > > > > > > +    struct dev_pagemap pagemap;
> > > > > > > > +    /**
> > > > > > > > +     * @hpa_base: base host physical address
> > > > > > > > +     *
> > > > > > > > +     * This is generated when remap device memory as
> ZONE_DEVICE
> > > > > > > > +     */
> > > > > > > > +    resource_size_t hpa_base;
> > > > > > >
> > > > > > > Weird indentation. This goes for the entire series, look at checkpatch.
> > > > > >
> > > > > > Will fix
> > > > > > >
> > > > > > > > +
> > > > > > > >  };
> > > > > > > >
> > > > > > > >  /**
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > index e3db3a178760..0d795394bc4c 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > @@ -22,6 +22,7 @@
> > > > > > > >  #include "xe_module.h"
> > > > > > > >  #include "xe_sriov.h"
> > > > > > > >  #include "xe_tile.h"
> > > > > > > > +#include "xe_svm.h"
> > > > > > > >
> > > > > > > >  #define XEHP_MTCFG_ADDR		XE_REG(0x101800)
> > > > > > > >  #define TILE_COUNT		REG_GENMASK(15, 8)
> > > > > > > > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct xe_device
> *xe)
> > > > > > > >  		}
> > > > > > > >
> > > > > > > >  		io_size -= min_t(u64, tile_size, io_size);
> > > > > > > > +		xe_svm_devm_add(tile, &tile->mem.vram);
> > > > > > >
> > > > > > > Do we want to do this probe for all devices with VRAM or only a subset?
> > > > > >
> > > > > > All
> > > > >
> > > > > Can you explain why?
> > > >
> > > > It is natural for me to add all device memory to hmm. In hmm design, device
> > > memory is used as a special swap out for system memory. I would ask why
> we
> > > only want to add a subset of vram? By a subset, do you mean only vram of
> one
> > > tile instead of all tiles?
> > > >
> > >
> > > I think we talking about different things, my bad on wording in the
> > > original question.
> > >
> > > Let me ask again - should be calling xe_svm_devm_add on all *platforms*
> > > that have VRAM. i.e. Should we do this on PVC but not DG2?
> >
> >
> > Oh, I see. Good question. On i915, this feature was only tested on PVC. We
> don't have a plan to enable it on older platform than PVC.
> >
> > Let me add a check here, only enabled it on platform newer than PVC
> >
> 
> Probably actually check 'xe->info.has_usm'.
> 
> We might want to rename field too and drop the 'usm' nomenclature but
> that can be done later.

Good idea. Let me check with "has_usm"

Oak

> 
> Matt
> 
> > Oak
> >
> > >
> > > Matt
> > >
> > > > Oak
> > > >
> > > >
> > > > >
> > > > > > >
> > > > > > > >  	}
> > > > > > > >
> > > > > > > >  	xe->mem.vram.actual_physical_size = total_size;
> > > > > > > > @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct
> xe_device
> > > *xe)
> > > > > > > >  static void mmio_fini(struct drm_device *drm, void *arg)
> > > > > > > >  {
> > > > > > > >  	struct xe_device *xe = arg;
> > > > > > > > +    struct xe_tile *tile;
> > > > > > > > +    u8 id;
> > > > > > > >
> > > > > > > >  	pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> > > > > > > >  	if (xe->mem.vram.mapping)
> > > > > > > >  		iounmap(xe->mem.vram.mapping);
> > > > > > > > +
> > > > > > > > +	for_each_tile(tile, xe, id)
> > > > > > > > +		xe_svm_devm_remove(xe, &tile->mem.vram);
> > > > > > >
> > > > > > > This should probably be above existing code. Typical on fini to do
> > > > > > > things in reverse order from init.
> > > > > >
> > > > > > Will fix
> > > > > > >
> > > > > > > > +
> > > > > > > >  }
> > > > > > > >
> > > > > > > >  static int xe_verify_lmem_ready(struct xe_device *xe)
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > > > new file mode 100644
> > > > > > > > index 000000000000..09f9afb0e7d4
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > > > @@ -0,0 +1,14 @@
> > > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > > +/*
> > > > > > > > + * Copyright © 2023 Intel Corporation
> > > > > > >
> > > > > > > 2024?
> > > > > >
> > > > > > This patch was actually written 2023
> > > > > > >
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > +#ifndef __XE_SVM_H
> > > > > > > > +#define __XE_SVM_H
> > > > > > > > +
> > > > > > > > +#include "xe_device_types.h"
> > > > > > >
> > > > > > > I don't think you need to include this. Rather just forward decl structs
> > > > > > > used here.
> > > > > >
> > > > > > Will fix
> > > > > > >
> > > > > > > e.g.
> > > > > > >
> > > > > > > struct xe_device;
> > > > > > > struct xe_mem_region;
> > > > > > > struct xe_tile;
> > > > > > >
> > > > > > > > +
> > > > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region
> > > *mem);
> > > > > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct
> > > xe_mem_region
> > > > > > > *mem);
> > > > > > >
> > > > > > > The arguments here are incongruent here. Typically we want these to
> > > > > > > match.
> > > > > >
> > > > > > Will fix
> > > > > > >
> > > > > > > > +
> > > > > > > > +#endif
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > >
> > > > > > > Incongruent between xe_svm.h and xe_svm_devmem.c.
> > > > > >
> > > > > > Did you mean mem vs mr? if yes, will fix
> > > > > >
> > > > > > Again these two
> > > > > > > should
> > > > > > > match.
> > > > > > >
> > > > > > > > new file mode 100644
> > > > > > > > index 000000000000..63b7a1961cc6
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > > @@ -0,0 +1,91 @@
> > > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > > +/*
> > > > > > > > + * Copyright © 2023 Intel Corporation
> > > > > > >
> > > > > > > 2024?
> > > > > > It is from 2023
> > > > > > >
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > +#include <linux/mm_types.h>
> > > > > > > > +#include <linux/sched/mm.h>
> > > > > > > > +
> > > > > > > > +#include "xe_device_types.h"
> > > > > > > > +#include "xe_trace.h"
> > > > > > >
> > > > > > > xe_trace.h appears to be unused.
> > > > > >
> > > > > > Will fix
> > > > > > >
> > > > > > > > +#include "xe_svm.h"
> > > > > > > > +
> > > > > > > > +
> > > > > > > > +static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > > > > > > +{
> > > > > > > > +	return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static void xe_devm_page_free(struct page *page)
> > > > > > > > +{
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > > > > > > +	.page_free = xe_devm_page_free,
> > > > > > > > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > > > > > +};
> > > > > > > > +
> > > > > > >
> > > > > > > Assume these are placeholders that will be populated later?
> > > > > >
> > > > > >
> > > > > > corrrect
> > > > > > >
> > > > > > > > +/**
> > > > > > > > + * xe_svm_devm_add: Remap and provide memmap backing for
> > > device
> > > > > > > memory
> > > > > > > > + * @tile: tile that the memory region blongs to
> > > > > > > > + * @mr: memory region to remap
> > > > > > > > + *
> > > > > > > > + * This remap device memory to host physical address space and
> create
> > > > > > > > + * struct page to back device memory
> > > > > > > > + *
> > > > > > > > + * Return: 0 on success standard error code otherwise
> > > > > > > > + */
> > > > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct xe_mem_region
> *mr)
> > > > > > >
> > > > > > > Here I see the xe_mem_region is from tile->mem.vram, wondering
> rather
> > > > > > > than using the tile->mem.vram we should use xe->mem.vram when
> > > enabling
> > > > > > > svm? Isn't the idea behind svm the entire memory is 1 view?
> > > > > >
> > > > > > Still need to use tile. The reason is, memory of different tile can have
> > > different
> > > > > characteristics, such as latency. So we want to differentiate memory b/t
> tiles
> > > also
> > > > > in svm. I need to change below " mr->pagemap.owner = tile->xe-
> >drm.dev ".
> > > the
> > > > > owner should also be tile. This is the way hmm differentiate memory of
> > > different
> > > > > tile.
> > > > > >
> > > > > > With svm it is 1 view, from virtual address space perspective and from
> > > physical
> > > > > struct page perspective. You can think of all the tile's vram is stacked
> together
> > > to
> > > > > form a unified view together with system memory. This doesn't prohibit
> us
> > > from
> > > > > differentiate memory from different tile. This differentiation allow us to
> > > optimize
> > > > > performance, i.e., we can wisely place memory in specific tile. If we don't
> > > > > differentiate, this is not possible.
> > > > > >
> > > > >
> > > > > Ok makes sense.
> > > > >
> > > > > Matt
> > > > >
> > > > > > >
> > > > > > > I suppose if we do that we also only use 1 TTM VRAM manager /
> buddy
> > > > > > > allocator too. I thought I saw some patches flying around for that too.
> > > > > >
> > > > > > Ttm vram manager is not in the picture. We deliberately avoided it per
> > > previous
> > > > > discussion
> > > > > >
> > > > > > Yes same buddy allocator. It is in my previous POC:
> > > https://lore.kernel.org/dri-
> > > > > devel/20240117221223.18540-12-oak.zeng@intel.com/. I didn't put those
> > > patches
> > > > > in this series because I want to merge this small patches separately.
> > > > > > >
> > > > > > > > +{
> > > > > > > > +	struct device *dev = &to_pci_dev(tile->xe->drm.dev)->dev;
> > > > > > > > +	struct resource *res;
> > > > > > > > +	void *addr;
> > > > > > > > +	int ret;
> > > > > > > > +
> > > > > > > > +	res = devm_request_free_mem_region(dev, &iomem_resource,
> > > > > > > > +					   mr->usable_size);
> > > > > > > > +	if (IS_ERR(res)) {
> > > > > > > > +		ret = PTR_ERR(res);
> > > > > > > > +		return ret;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > > > > > > > +	mr->pagemap.range.start = res->start;
> > > > > > > > +	mr->pagemap.range.end = res->end;
> > > > > > > > +	mr->pagemap.nr_range = 1;
> > > > > > > > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > > > > > > > +	mr->pagemap.owner = tile->xe->drm.dev;
> > > > > > > > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > > > > > > > +	if (IS_ERR(addr)) {
> > > > > > > > +		devm_release_mem_region(dev, res->start,
> > > resource_size(res));
> > > > > > > > +		ret = PTR_ERR(addr);
> > > > > > > > +		drm_err(&tile->xe->drm, "Failed to remap tile %d
> > > memory,
> > > > > > > errno %d\n",
> > > > > > > > +				tile->id, ret);
> > > > > > > > +		return ret;
> > > > > > > > +	}
> > > > > > > > +	mr->hpa_base = res->start;
> > > > > > > > +
> > > > > > > > +	drm_info(&tile->xe->drm, "Added tile %d memory [%llx-%llx] to
> > > devm,
> > > > > > > remapped to %pr\n",
> > > > > > > > +			tile->id, mr->io_start, mr->io_start + mr-
> > > >usable_size,
> > > > > > > res);
> > > > > > > > +	return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/**
> > > > > > > > + * xe_svm_devm_remove: Unmap device memory and free
> resources
> > > > > > > > + * @xe: xe device
> > > > > > > > + * @mr: memory region to remove
> > > > > > > > + */
> > > > > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct
> > > xe_mem_region
> > > > > > > *mr)
> > > > > > > > +{
> > > > > > > > +	/*FIXME: below cause a kernel hange during moduel remove*/
> > > > > > > > +#if 0
> > > > > > > > +	struct device *dev = &to_pci_dev(xe->drm.dev)->dev;
> > > > > > > > +
> > > > > > > > +	if (mr->hpa_base) {
> > > > > > > > +		devm_memunmap_pages(dev, &mr->pagemap);
> > > > > > > > +		devm_release_mem_region(dev, mr-
> > > >pagemap.range.start,
> > > > > > > > +			mr->pagemap.range.end - mr-
> > > >pagemap.range.start +1);
> > > > > > > > +	}
> > > > > > > > +#endif
> > > > > > >
> > > > > > > This would need to be fixed too.
> > > > > >
> > > > > >
> > > > > > Yes...
> > > > > >
> > > > > > Oak
> > > > > > >
> > > > > > > Matt
> > > > > > >
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > --
> > > > > > > > 2.26.3
> > > > > > > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-18 10:16                 ` Hellstrom, Thomas
@ 2024-03-18 15:02                   ` Zeng, Oak
  2024-03-18 15:46                     ` Hellstrom, Thomas
  0 siblings, 1 reply; 49+ messages in thread
From: Zeng, Oak @ 2024-03-18 15:02 UTC (permalink / raw)
  To: Hellstrom, Thomas, Brost, Matthew
  Cc: intel-xe, Welty,  Brian, airlied, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> Sent: Monday, March 18, 2024 6:16 AM
> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Welty, Brian <brian.welty@intel.com>;
> airlied@gmail.com; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for
> GPU vram
> 
> On Sat, 2024-03-16 at 01:25 +0000, Matthew Brost wrote:
> > On Fri, Mar 15, 2024 at 03:31:24PM -0600, Zeng, Oak wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > Sent: Friday, March 15, 2024 4:40 PM
> > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > <himal.prasad.ghimiray@intel.com>
> > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap
> > > > backing for
> > > > GPU vram
> > > >
> > > > On Fri, Mar 15, 2024 at 10:00:06AM -0600, Zeng, Oak wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > Sent: Thursday, March 14, 2024 4:49 PM
> > > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > > > <himal.prasad.ghimiray@intel.com>
> > > > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap
> > > > > > backing
> > > > for
> > > > > > GPU vram
> > > > > >
> > > > > > On Thu, Mar 14, 2024 at 12:32:36PM -0600, Zeng, Oak wrote:
> > > > > > > Hi Matt,
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > > > Sent: Thursday, March 14, 2024 1:18 PM
> > > > > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > > > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty,
> > > > > > > > Brian
> > > > > > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > > > > > <himal.prasad.ghimiray@intel.com>
> > > > > > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide
> > > > > > > > memmap
> > > > backing
> > > > > > for
> > > > > > > > GPU vram
> > > > > > > >
> > > > > > > > On Wed, Mar 13, 2024 at 11:35:49PM -0400, Oak Zeng wrote:
> > > > > > > > > Memory remap GPU vram using devm_memremap_pages, so
> > > > > > > > > each
> > > > GPU
> > > > > > vram
> > > > > > > > > page is backed by a struct page.
> > > > > > > > >
> > > > > > > > > Those struct pages are created to allow hmm migrate
> > > > > > > > > buffer b/t
> > > > > > > > > GPU vram and CPU system memory using existing Linux
> > > > > > > > > migration
> > > > > > > > > mechanism (i.e., migrating b/t CPU system memory and
> > > > > > > > > hard disk).
> > > > > > > > >
> > > > > > > > > This is prepare work to enable svm (shared virtual
> > > > > > > > > memory) through
> > > > > > > > > Linux kernel hmm framework. The memory remap's page map
> > > > > > > > > type is
> > > > set
> > > > > > > > > to MEMORY_DEVICE_PRIVATE for now. This means even
> > > > > > > > > though each
> > > > GPU
> > > > > > > > > vram page get a struct page and can be mapped in CPU
> > > > > > > > > page table,
> > > > > > > > > but such pages are treated as GPU's private resource,
> > > > > > > > > so CPU can't
> > > > > > > > > access them. If CPU access such page, a page fault is
> > > > > > > > > triggered
> > > > > > > > > and page will be migrate to system memory.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Is this really true? We can map VRAM BOs to the CPU
> > > > > > > > without having
> > > > > > > > migarte back and forth. Admittedly I don't know the inner
> > > > > > > > workings of
> > > > > > > > how this works but in IGTs we do this all the time.
> > > > > > > >
> > > > > > > >   54         batch_bo = xe_bo_create(fd, vm, batch_size,
> > > > > > > >   55                                 vram_if_possible(fd,
> > > > > > > > 0),
> > > > > > > >   56
> > > > DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM);
> > > > > > > >   57         batch_map = xe_bo_map(fd, batch_bo,
> > > > > > > > batch_size);
> > > > > > > >
> > > > > > > > The BO is created in VRAM and then mapped to the CPU.
> > > > > > > >
> > > > > > > > I don't think there is an expectation of coherence rather
> > > > > > > > caching mode
> > > > > > > > and exclusive access of the memory based on
> > > > > > > > synchronization.
> > > > > > > >
> > > > > > > > e.g.
> > > > > > > > User write BB/data via CPU to GPU memory
> > > > > > > > User calls exec
> > > > > > > > GPU read / write memory
> > > > > > > > User wait on sync indicating exec done
> > > > > > > > User reads result
> > > > > > > >
> > > > > > > > All of this works without migration. Are we not planing
> > > > > > > > supporting flow
> > > > > > > > with SVM?
> > > > > > > >
> > > > > > > > Afaik this migration dance really only needs to be done
> > > > > > > > if the CPU and
> > > > > > > > GPU are using atomics on a shared memory region and the
> > > > > > > > GPU device
> > > > > > > > doesn't support a coherent memory protocol (e.g. PVC).
> > > > > > >
> > > > > > > All you said is true. On many of our HW, CPU can actually
> > > > > > > access device
> > > > memory,
> > > > > > cache coherently or not.
> > > > > > >
> > > > > > > The problem is, this is not true for all GPU vendors. For
> > > > > > > example, on some
> > > > HW
> > > > > > from some vendor, CPU can only access partially of device
> > > > > > memory. The so
> > > > called
> > > > > > small bar concept.
> > > > > > >
> > > > > > > So when HMM is defined, such factors were considered, and
> > > > > > MEMORY_DEVICE_PRIVATE is defined. With this definition, CPU
> > > > > > can't access
> > > > > > device memory.
> > > > > > >
> > > > > > > So you can think it is a limitation of HMM.
> > > > > > >
> > > > > >
> > > > > > Is it though? I see other type MEMORY_DEVICE_FS_DAX,
> > > > > > MEMORY_DEVICE_GENERIC, and MEMORY_DEVICE_PCI_P2PDMA.
> From my
> > > > > > limited
> > > > > > understanding it looks to me like one of those modes would
> > > > > > support my
> > > > > > example.
> > > > >
> > > > >
> > > > > No, above are for other purposes. HMM only support
> > > > > DEVICE_PRIVATE and
> > > > DEVICE_COHERENT.
> > > > >
> > > > > >
> > > > > > > Note this is only part 1 of our system allocator work. We
> > > > > > > do plan to support
> > > > > > DEVICE_COHERENT for our newer device, see below. With this
> > > > > > option, we
> > > > don't
> > > > > > have unnecessary migration back and forth.
> > > > > > >
> > > > > > > You can think this is just work out all the code path. 90%
> > > > > > > of the driver code
> > > > for
> > > > > > DEVICE_PRIVATE and DEVICE_COHERENT will be same. Our real use
> > > > > > of system
> > > > > > allocator will be DEVICE_COHERENT mode. While DEVICE_PRIVATE
> > > > > > mode
> > > > allow us
> > > > > > to exercise the code on old HW.
> > > > > > >
> > > > > > > Make sense?
> > > > > > >
> > > > > >
> > > > > > I guess if we want the system allocator to always coherent,
> > > > > > then yes you
> > > > > > need this dynamic migration with faulting on either side.
> > > > > >
> > > > > > I was thinking the system allocator would be behave like my
> > > > > > example
> > > > > > above with madvise dictating the coherence rules.
> > > > > >
> > > > > > Maybe I missed this in system allocator design but my feeling
> > > > > > is we
> > > > > > shouldn't arbitrarily enforce coherence as that could lead to
> > > > > > poor
> > > > > > performance due to constant migration.
> > > > >
> > > > > System allocator itself doesn't enforce coherence. Coherence is
> > > > > built in user
> > > > programming model. So system allocator allow both GPU and CPU
> > > > access system
> > > > allocated pointers, but it doesn't necessarily guarantee the data
> > > > accessed from
> > > > CPU/GPU is coherent. It is user program's responsibility to
> > > > maintain data
> > > > coherence.
> > > > >
> > > > > Data migration in driver is optional, depending on platform
> > > > > capability, user
> > > > preference, correctness and performance consideration. Driver
> > > > internal data
> > > > migration of course shouldn't break data coherence.
> > > > >
> > > > > Of course different vendor can have different data coherence
> > > > > scheme. For
> > > > example, it is completely designer's flexibility to build model
> > > > that is HW automatic
> > > > data coherence or software explicit data coherence. On our
> > > > platform, we allow
> > > > user program to select different coherence mode by setting
> > > > pat_index for gpu
> > > > and cpu_caching mode for CPU. So we have completely give the
> > > > flexibility to user
> > > > program. Nothing of this contract is changed in system allocator
> > > > design.
> > > > >
> > > > > Going back to the question of what memory type we should use to
> > > > > register our
> > > > vram to core mm. HMM currently support two types: PRIVATE and
> > > > COHERENT.
> > > > The coherent type requires some HW and BIOS support which we
> > > > don't have
> > > > right now. So the only available is PRIVATE. We have not other
> > > > option right now.
> > > > As said, we plan to support coherent type where we can avoid
> > > > unnecessary data
> > > > migration. But that is stage 2.
> > > > >
> > > >
> > > > Thanks for the explaination. After reading your replies, the HMM
> > > > doc,
> > > > and looking at code this all makes sense.
> > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > For GPU device which supports coherent memory protocol
> > > > > > > > > b/t CPU and
> > > > > > > > > GPU (such as CXL and CAPI protocol), we can remap
> > > > > > > > > device memory as
> > > > > > > > > MEMORY_DEVICE_COHERENT. This is TBD.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > > > > > ---
> > > > > > > > >  drivers/gpu/drm/xe/Makefile          |  3 +-
> > > > > > > > >  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> > > > > > > > >  drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> > > > > > > > >  drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> > > > > > > > >  drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> > > > > > > > ++++++++++++++++++++++++++++
> > > > > > > > >  5 files changed, 124 insertions(+), 1 deletion(-)
> > > > > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > > > > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > b/drivers/gpu/drm/xe/Makefile
> > > > > > > > > index c531210695db..840467080e59 100644
> > > > > > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > > > > > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> > > > > > > > >  	xe_vram_freq.o \
> > > > > > > > >  	xe_wait_user_fence.o \
> > > > > > > > >  	xe_wa.o \
> > > > > > > > > -	xe_wopcm.o
> > > > > > > > > +	xe_wopcm.o \
> > > > > > > > > +	xe_svm_devmem.o
> > > > > > > >
> > > > > > > > These should be in alphabetical order.
> > > > > > >
> > > > > > > Will fix
> > > > > > > >
> > > > > > > > >
> > > > > > > > >  # graphics hardware monitoring (HWMON) support
> > > > > > > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > > index 9785eef2e5a4..f27c3bee8ce7 100644
> > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > > @@ -99,6 +99,15 @@ struct xe_mem_region {
> > > > > > > > >  	resource_size_t actual_physical_size;
> > > > > > > > >  	/** @mapping: pointer to VRAM mappable space
> > > > > > > > > */
> > > > > > > > >  	void __iomem *mapping;
> > > > > > > > > +	/** @pagemap: Used to remap device memory as
> > > > > > > > > ZONE_DEVICE
> > > > */
> > > > > > > > > +    struct dev_pagemap pagemap;
> > > > > > > > > +    /**
> > > > > > > > > +     * @hpa_base: base host physical address
> > > > > > > > > +     *
> > > > > > > > > +     * This is generated when remap device memory as
> > > > > > > > > ZONE_DEVICE
> > > > > > > > > +     */
> > > > > > > > > +    resource_size_t hpa_base;
> > > > > > > >
> > > > > > > > Weird indentation. This goes for the entire series, look
> > > > > > > > at checkpatch.
> > > > > > >
> > > > > > > Will fix
> > > > > > > >
> > > > > > > > > +
> > > > > > > > >  };
> > > > > > > > >
> > > > > > > > >  /**
> > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > > index e3db3a178760..0d795394bc4c 100644
> > > > > > > > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > > @@ -22,6 +22,7 @@
> > > > > > > > >  #include "xe_module.h"
> > > > > > > > >  #include "xe_sriov.h"
> > > > > > > > >  #include "xe_tile.h"
> > > > > > > > > +#include "xe_svm.h"
> > > > > > > > >
> > > > > > > > >  #define
> > > > > > > > > XEHP_MTCFG_ADDR		XE_REG(0x101800)
> > > > > > > > >  #define TILE_COUNT		REG_GENMASK(15, 8)
> > > > > > > > > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct
> > > > > > > > > xe_device *xe)
> > > > > > > > >  		}
> > > > > > > > >
> > > > > > > > >  		io_size -= min_t(u64, tile_size,
> > > > > > > > > io_size);
> > > > > > > > > +		xe_svm_devm_add(tile, &tile-
> > > > > > > > > >mem.vram);
> > > > > > > >
> > > > > > > > Do we want to do this probe for all devices with VRAM or
> > > > > > > > only a subset?
> > > > > > >
> > > > > > > All
> > > > > >
> > > > > > Can you explain why?
> > > > >
> > > > > It is natural for me to add all device memory to hmm. In hmm
> > > > > design, device
> > > > memory is used as a special swap out for system memory. I would
> > > > ask why we
> > > > only want to add a subset of vram? By a subset, do you mean only
> > > > vram of one
> > > > tile instead of all tiles?
> > > > >
> > > >
> > > > I think we talking about different things, my bad on wording in
> > > > the
> > > > original question.
> > > >
> > > > Let me ask again - should be calling xe_svm_devm_add on all
> > > > *platforms*
> > > > that have VRAM. i.e. Should we do this on PVC but not DG2?
> > >
> > >
> > > Oh, I see. Good question. On i915, this feature was only tested on
> > > PVC. We don't have a plan to enable it on older platform than PVC.
> > >
> > > Let me add a check here, only enabled it on platform newer than PVC
> > >
> >
> > Probably actually check 'xe->info.has_usm'.
> >
> > We might want to rename field too and drop the 'usm' nomenclature but
> > that can be done later.
> 
> Perhaps "has_recoverable_pagefaults" or some for of abbreviation.

USM has two flavors: a driver allocator and a system allocator. 

Both of those two flavors depend on recoverable pagefault. And for our current implementation, they only depend on recoverable page fault HW feature. In the future, we might have other implementation that depends on more HW features, such as ATS (address translation service).

So at least for now, "has_usm" and "has_recoverable_pagefaults" are pretty much the same thing. I will keep the has_usm for now. Open to a change in the future also.

> 
> Another question w r t this is whether we should do this
> unconditionally even on platforms that support it. Adding a struct_page
> per VRAM page would potentially consume a significant amount of system
> memory.

Yes, I was thinking of add a kernel configuration option so this feature can be enable/disabled at compilation time. Do you want me to add one?

Oak 
> 
> /Thomas
> 
> 
> 
> >
> > Matt
> >
> > > Oak
> > >
> > > >
> > > > Matt
> > > >
> > > > > Oak
> > > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > >  	}
> > > > > > > > >
> > > > > > > > >  	xe->mem.vram.actual_physical_size =
> > > > > > > > > total_size;
> > > > > > > > > @@ -354,10 +356,16 @@ void xe_mmio_probe_tiles(struct
> > > > > > > > > xe_device
> > > > *xe)
> > > > > > > > >  static void mmio_fini(struct drm_device *drm, void
> > > > > > > > > *arg)
> > > > > > > > >  {
> > > > > > > > >  	struct xe_device *xe = arg;
> > > > > > > > > +    struct xe_tile *tile;
> > > > > > > > > +    u8 id;
> > > > > > > > >
> > > > > > > > >  	pci_iounmap(to_pci_dev(xe->drm.dev), xe-
> > > > > > > > > >mmio.regs);
> > > > > > > > >  	if (xe->mem.vram.mapping)
> > > > > > > > >  		iounmap(xe->mem.vram.mapping);
> > > > > > > > > +
> > > > > > > > > +	for_each_tile(tile, xe, id)
> > > > > > > > > +		xe_svm_devm_remove(xe, &tile-
> > > > > > > > > >mem.vram);
> > > > > > > >
> > > > > > > > This should probably be above existing code. Typical on
> > > > > > > > fini to do
> > > > > > > > things in reverse order from init.
> > > > > > >
> > > > > > > Will fix
> > > > > > > >
> > > > > > > > > +
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > >  static int xe_verify_lmem_ready(struct xe_device *xe)
> > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > > > > new file mode 100644
> > > > > > > > > index 000000000000..09f9afb0e7d4
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > > > > @@ -0,0 +1,14 @@
> > > > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > > > +/*
> > > > > > > > > + * Copyright © 2023 Intel Corporation
> > > > > > > >
> > > > > > > > 2024?
> > > > > > >
> > > > > > > This patch was actually written 2023
> > > > > > > >
> > > > > > > > > + */
> > > > > > > > > +
> > > > > > > > > +#ifndef __XE_SVM_H
> > > > > > > > > +#define __XE_SVM_H
> > > > > > > > > +
> > > > > > > > > +#include "xe_device_types.h"
> > > > > > > >
> > > > > > > > I don't think you need to include this. Rather just
> > > > > > > > forward decl structs
> > > > > > > > used here.
> > > > > > >
> > > > > > > Will fix
> > > > > > > >
> > > > > > > > e.g.
> > > > > > > >
> > > > > > > > struct xe_device;
> > > > > > > > struct xe_mem_region;
> > > > > > > > struct xe_tile;
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct
> > > > > > > > > xe_mem_region
> > > > *mem);
> > > > > > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct
> > > > xe_mem_region
> > > > > > > > *mem);
> > > > > > > >
> > > > > > > > The arguments here are incongruent here. Typically we
> > > > > > > > want these to
> > > > > > > > match.
> > > > > > >
> > > > > > > Will fix
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +#endif
> > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > >
> > > > > > > > Incongruent between xe_svm.h and xe_svm_devmem.c.
> > > > > > >
> > > > > > > Did you mean mem vs mr? if yes, will fix
> > > > > > >
> > > > > > > Again these two
> > > > > > > > should
> > > > > > > > match.
> > > > > > > >
> > > > > > > > > new file mode 100644
> > > > > > > > > index 000000000000..63b7a1961cc6
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > > > @@ -0,0 +1,91 @@
> > > > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > > > +/*
> > > > > > > > > + * Copyright © 2023 Intel Corporation
> > > > > > > >
> > > > > > > > 2024?
> > > > > > > It is from 2023
> > > > > > > >
> > > > > > > > > + */
> > > > > > > > > +
> > > > > > > > > +#include <linux/mm_types.h>
> > > > > > > > > +#include <linux/sched/mm.h>
> > > > > > > > > +
> > > > > > > > > +#include "xe_device_types.h"
> > > > > > > > > +#include "xe_trace.h"
> > > > > > > >
> > > > > > > > xe_trace.h appears to be unused.
> > > > > > >
> > > > > > > Will fix
> > > > > > > >
> > > > > > > > > +#include "xe_svm.h"
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > +static vm_fault_t xe_devm_migrate_to_ram(struct
> > > > > > > > > vm_fault *vmf)
> > > > > > > > > +{
> > > > > > > > > +	return 0;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static void xe_devm_page_free(struct page *page)
> > > > > > > > > +{
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static const struct dev_pagemap_ops
> > > > > > > > > xe_devm_pagemap_ops = {
> > > > > > > > > +	.page_free = xe_devm_page_free,
> > > > > > > > > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > >
> > > > > > > > Assume these are placeholders that will be populated
> > > > > > > > later?
> > > > > > >
> > > > > > >
> > > > > > > corrrect
> > > > > > > >
> > > > > > > > > +/**
> > > > > > > > > + * xe_svm_devm_add: Remap and provide memmap backing
> > > > > > > > > for
> > > > device
> > > > > > > > memory
> > > > > > > > > + * @tile: tile that the memory region blongs to
> > > > > > > > > + * @mr: memory region to remap
> > > > > > > > > + *
> > > > > > > > > + * This remap device memory to host physical address
> > > > > > > > > space and create
> > > > > > > > > + * struct page to back device memory
> > > > > > > > > + *
> > > > > > > > > + * Return: 0 on success standard error code otherwise
> > > > > > > > > + */
> > > > > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct
> > > > > > > > > xe_mem_region *mr)
> > > > > > > >
> > > > > > > > Here I see the xe_mem_region is from tile->mem.vram,
> > > > > > > > wondering rather
> > > > > > > > than using the tile->mem.vram we should use xe->mem.vram
> > > > > > > > when
> > > > enabling
> > > > > > > > svm? Isn't the idea behind svm the entire memory is 1
> > > > > > > > view?
> > > > > > >
> > > > > > > Still need to use tile. The reason is, memory of different
> > > > > > > tile can have
> > > > different
> > > > > > characteristics, such as latency. So we want to differentiate
> > > > > > memory b/t tiles
> > > > also
> > > > > > in svm. I need to change below " mr->pagemap.owner = tile-
> > > > > > >xe->drm.dev ".
> > > > the
> > > > > > owner should also be tile. This is the way hmm differentiate
> > > > > > memory of
> > > > different
> > > > > > tile.
> > > > > > >
> > > > > > > With svm it is 1 view, from virtual address space
> > > > > > > perspective and from
> > > > physical
> > > > > > struct page perspective. You can think of all the tile's vram
> > > > > > is stacked together
> > > > to
> > > > > > form a unified view together with system memory. This doesn't
> > > > > > prohibit us
> > > > from
> > > > > > differentiate memory from different tile. This
> > > > > > differentiation allow us to
> > > > optimize
> > > > > > performance, i.e., we can wisely place memory in specific
> > > > > > tile. If we don't
> > > > > > differentiate, this is not possible.
> > > > > > >
> > > > > >
> > > > > > Ok makes sense.
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > > > >
> > > > > > > > I suppose if we do that we also only use 1 TTM VRAM
> > > > > > > > manager / buddy
> > > > > > > > allocator too. I thought I saw some patches flying around
> > > > > > > > for that too.
> > > > > > >
> > > > > > > Ttm vram manager is not in the picture. We deliberately
> > > > > > > avoided it per
> > > > previous
> > > > > > discussion
> > > > > > >
> > > > > > > Yes same buddy allocator. It is in my previous POC:
> > > > https://lore.kernel.org/dri-
> > > > > > devel/20240117221223.18540-12-oak.zeng@intel.com/. I didn't
> > > > > > put those
> > > > patches
> > > > > > in this series because I want to merge this small patches
> > > > > > separately.
> > > > > > > >
> > > > > > > > > +{
> > > > > > > > > +	struct device *dev = &to_pci_dev(tile->xe-
> > > > > > > > > >drm.dev)->dev;
> > > > > > > > > +	struct resource *res;
> > > > > > > > > +	void *addr;
> > > > > > > > > +	int ret;
> > > > > > > > > +
> > > > > > > > > +	res = devm_request_free_mem_region(dev,
> > > > > > > > > &iomem_resource,
> > > > > > > > > +					   mr-
> > > > > > > > > >usable_size);
> > > > > > > > > +	if (IS_ERR(res)) {
> > > > > > > > > +		ret = PTR_ERR(res);
> > > > > > > > > +		return ret;
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > > > > > > > > +	mr->pagemap.range.start = res->start;
> > > > > > > > > +	mr->pagemap.range.end = res->end;
> > > > > > > > > +	mr->pagemap.nr_range = 1;
> > > > > > > > > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > > > > > > > > +	mr->pagemap.owner = tile->xe->drm.dev;
> > > > > > > > > +	addr = devm_memremap_pages(dev, &mr->pagemap);
> > > > > > > > > +	if (IS_ERR(addr)) {
> > > > > > > > > +		devm_release_mem_region(dev, res-
> > > > > > > > > >start,
> > > > resource_size(res));
> > > > > > > > > +		ret = PTR_ERR(addr);
> > > > > > > > > +		drm_err(&tile->xe->drm, "Failed to
> > > > > > > > > remap tile %d
> > > > memory,
> > > > > > > > errno %d\n",
> > > > > > > > > +				tile->id, ret);
> > > > > > > > > +		return ret;
> > > > > > > > > +	}
> > > > > > > > > +	mr->hpa_base = res->start;
> > > > > > > > > +
> > > > > > > > > +	drm_info(&tile->xe->drm, "Added tile %d memory
> > > > > > > > > [%llx-%llx] to
> > > > devm,
> > > > > > > > remapped to %pr\n",
> > > > > > > > > +			tile->id, mr->io_start, mr-
> > > > > > > > > >io_start + mr-
> > > > > usable_size,
> > > > > > > > res);
> > > > > > > > > +	return 0;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +/**
> > > > > > > > > + * xe_svm_devm_remove: Unmap device memory and free
> > > > > > > > > resources
> > > > > > > > > + * @xe: xe device
> > > > > > > > > + * @mr: memory region to remove
> > > > > > > > > + */
> > > > > > > > > +void xe_svm_devm_remove(struct xe_device *xe, struct
> > > > xe_mem_region
> > > > > > > > *mr)
> > > > > > > > > +{
> > > > > > > > > +	/*FIXME: below cause a kernel hange during
> > > > > > > > > moduel remove*/
> > > > > > > > > +#if 0
> > > > > > > > > +	struct device *dev = &to_pci_dev(xe->drm.dev)-
> > > > > > > > > >dev;
> > > > > > > > > +
> > > > > > > > > +	if (mr->hpa_base) {
> > > > > > > > > +		devm_memunmap_pages(dev, &mr-
> > > > > > > > > >pagemap);
> > > > > > > > > +		devm_release_mem_region(dev, mr-
> > > > > pagemap.range.start,
> > > > > > > > > +			mr->pagemap.range.end - mr-
> > > > > pagemap.range.start +1);
> > > > > > > > > +	}
> > > > > > > > > +#endif
> > > > > > > >
> > > > > > > > This would need to be fixed too.
> > > > > > >
> > > > > > >
> > > > > > > Yes...
> > > > > > >
> > > > > > > Oak
> > > > > > > >
> > > > > > > > Matt
> > > > > > > >
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > --
> > > > > > > > > 2.26.3
> > > > > > > > >


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-18 14:49     ` Zeng, Oak
@ 2024-03-18 15:40       ` Hellstrom, Thomas
  2024-03-18 16:09         ` Zeng, Oak
  0 siblings, 1 reply; 49+ messages in thread
From: Hellstrom, Thomas @ 2024-03-18 15:40 UTC (permalink / raw)
  To: intel-xe, Zeng,  Oak
  Cc: Brost, Matthew, Welty, Brian, airlied, Ghimiray, Himal Prasad

Hi,

On Mon, 2024-03-18 at 14:49 +0000, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> > Sent: Monday, March 18, 2024 9:13 AM
> > To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> > Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> > <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or
> > hmmptr
> > 
> > Hi, Oak,
> > 
> > Found another thing, see below:
> > 
> > On Wed, 2024-03-13 at 23:35 -0400, Oak Zeng wrote:
> > > Add a helper function xe_hmm_populate_range to populate
> > > a a userptr or hmmptr range. This functions calls hmm_range_fault
> > > to read CPU page tables and populate all pfns/pages of this
> > > virtual address range.
> > > 
> > > If the populated page is system memory page, dma-mapping is
> > > performed
> > > to get a dma-address which can be used later for GPU to access
> > > pages.
> > > 
> > > If the populated page is device private page, we calculate the
> > > dpa (
> > > device physical address) of the page.
> > > 
> > > The dma-address or dpa is then saved in userptr's sg table. This
> > > is
> > > prepare work to replace the get_user_pages_fast code in userptr
> > > code
> > > path. The helper function will also be used to populate hmmptr
> > > later.
> > > 
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile |   3 +-
> > >  drivers/gpu/drm/xe/xe_hmm.c | 213
> > > ++++++++++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> > >  3 files changed, 227 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> > > 
> > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > b/drivers/gpu/drm/xe/Makefile
> > > index 840467080e59..29dcbc938b01 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> > >  	xe_wait_user_fence.o \
> > >  	xe_wa.o \
> > >  	xe_wopcm.o \
> > > -	xe_svm_devmem.o
> > > +	xe_svm_devmem.o \
> > > +	xe_hmm.o
> > > 
> > >  # graphics hardware monitoring (HWMON) support
> > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> > > b/drivers/gpu/drm/xe/xe_hmm.c
> > > new file mode 100644
> > > index 000000000000..c45c2447d386
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > > @@ -0,0 +1,213 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/memremap.h>
> > > +#include <linux/swap.h>
> > > +#include <linux/mm.h>
> > > +#include "xe_hmm.h"
> > > +#include "xe_vm.h"
> > > +
> > > +/**
> > > + * mark_range_accessed() - mark a range is accessed, so core mm
> > > + * have such information for memory eviction or write back to
> > > + * hard disk
> > > + *
> > > + * @range: the range to mark
> > > + * @write: if write to this range, we mark pages in this range
> > > + * as dirty
> > > + */
> > > +static void mark_range_accessed(struct hmm_range *range, bool
> > > write)
> > > +{
> > > +	struct page *page;
> > > +	u64 i, npages;
> > > +
> > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > >start >>
> > > PAGE_SHIFT) + 1;
> > > +	for (i = 0; i < npages; i++) {
> > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > +		if (write) {
> > > +			lock_page(page);
> > > +			set_page_dirty(page);
> > > +			unlock_page(page);
> > > +		}
> > > +		mark_page_accessed(page);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * build_sg() - build a scatter gather table for all the
> > > physical
> > > pages/pfn
> > > + * in a hmm_range. dma-address is save in sg table and will be
> > > used
> > > to program
> > > + * GPU page table later.
> > > + *
> > > + * @xe: the xe device who will access the dma-address in sg
> > > table
> > > + * @range: the hmm range that we build the sg table from. range-
> > > > hmm_pfns[]
> > > + * has the pfn numbers of pages that back up this hmm address
> > > range.
> > > + * @st: pointer to the sg table.
> > > + * @write: whether we write to this range. This decides dma map
> > > direction
> > > + * for system pages. If write we map it bi-diretional; otherwise
> > > + * DMA_TO_DEVICE
> > > + *
> > > + * All the contiguous pfns will be collapsed into one entry in
> > > + * the scatter gather table. This is for the convenience of
> > > + * later on operations to bind address range to GPU page table.
> > > + *
> > > + * The dma_address in the sg table will later be used by GPU to
> > > + * access memory. So if the memory is system memory, we need to
> > > + * do a dma-mapping so it can be accessed by GPU/DMA. If the
> > > memory
> > > + * is GPU local memory (of the GPU who is going to access
> > > memory),
> > > + * we need gpu dpa (device physical address), and there is no
> > > need
> > > + * of dma-mapping.
> > > + *
> > > + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> > > + * memory. Add this when you support p2p
> > > + *
> > > + * This function allocates the storage of the sg table. It is
> > > + * caller's responsibility to free it calling sg_free_table.
> > > + *
> > > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > > + */
> > > +static int build_sg(struct xe_device *xe, struct hmm_range
> > > *range,
> > > +			     struct sg_table *st, bool write)
> > > +{
> > > +	struct device *dev = xe->drm.dev;
> > > +	struct scatterlist *sg;
> > > +	u64 i, npages;
> > > +
> > > +	sg = NULL;
> > > +	st->nents = 0;
> > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > >start >>
> > > PAGE_SHIFT) + 1;
> > > +
> > > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > > +		return -ENOMEM;
> > > +
> > > +	for (i = 0; i < npages; i++) {
> > > +		struct page *page;
> > > +		unsigned long addr;
> > > +		struct xe_mem_region *mr;
> > > +
> > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > +		if (is_device_private_page(page)) {
> > > +			mr = page_to_mem_region(page);
> > > +			addr = vram_pfn_to_dpa(mr, range-
> > > > hmm_pfns[i]);
> > > +		} else {
> > > +			addr = dma_map_page(dev, page, 0,
> > > PAGE_SIZE,
> > > +					write ?
> > > DMA_BIDIRECTIONAL :
> > > DMA_TO_DEVICE);
> > > +		}
> > > +
> > > +		if (sg && (addr == (sg_dma_address(sg) + sg-
> > > > length))) {
> > > +			sg->length += PAGE_SIZE;
> > > +			sg_dma_len(sg) += PAGE_SIZE;
> > > +			continue;
> > > +		}
> > > +
> > > +		sg =  sg ? sg_next(sg) : st->sgl;
> > > +		sg_dma_address(sg) = addr;
> > > +		sg_dma_len(sg) = PAGE_SIZE;
> > > +		sg->length = PAGE_SIZE;
> > > +		st->nents++;
> > > +	}
> > > +
> > > +	sg_mark_end(sg);
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * xe_hmm_populate_range() - Populate physical pages of a
> > > virtual
> > > + * address range
> > > + *
> > > + * @vma: vma has information of the range to populate. only vma
> > > + * of userptr and hmmptr type can be populated.
> > > + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> > > + * will hold the populated pfns.
> > > + * @write: populate pages with write permission
> > > + *
> > > + * This function populate the physical pages of a virtual
> > > + * address range. The populated physical pages is saved in
> > > + * userptr's sg table. It is similar to get_user_pages but call
> > > + * hmm_range_fault.
> > > + *
> > > + * This function also read mmu notifier sequence # (
> > > + * mmu_interval_read_begin), for the purpose of later
> > > + * comparison (through mmu_interval_read_retry).
> > > + *
> > > + * This must be called with mmap read or write lock held.
> > > + *
> > > + * This function allocates the storage of the userptr sg table.
> > > + * It is caller's responsibility to free it calling
> > > sg_free_table.
> > > + *
> > > + * returns: 0 for succuss; negative error no on failure
> > > + */
> > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > > *hmm_range,
> > > +						bool write)
> > > +{
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > > +	struct xe_userptr_vma *userptr_vma;
> > > +	struct xe_userptr *userptr;
> > > +	u64 start = vma->gpuva.va.addr;
> > > +	u64 end = start + vma->gpuva.va.range;
> > > +	struct xe_vm *vm = xe_vma_vm(vma);
> > > +	u64 npages;
> > > +	int ret;
> > > +
> > > +	userptr_vma = to_userptr_vma(vma);
> > > +	userptr = &userptr_vma->userptr;
> > > +	mmap_assert_locked(userptr->notifier.mm);
> > > +
> > > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >>
> > > PAGE_SHIFT) +
> > > 1;
> > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > GFP_KERNEL);
> > > +	if (unlikely(!pfns))
> > > +		return -ENOMEM;
> > > +
> > > +	if (write)
> > > +		flags |= HMM_PFN_REQ_WRITE;
> > > +
> > > +	memset64((u64 *)pfns, (u64)flags, npages);
> > > +	hmm_range->hmm_pfns = pfns;
> > > +	hmm_range->notifier_seq =
> > > mmu_interval_read_begin(&userptr-
> > > > notifier);
> > > +	hmm_range->notifier = &userptr->notifier;
> > > +	hmm_range->start = start;
> > > +	hmm_range->end = end;
> > > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> > > HMM_PFN_REQ_WRITE;
> > > +	/**
> > > +	 * FIXME:
> > > +	 * Set the the dev_private_owner can prevent
> > > hmm_range_fault
> > > to fault
> > > +	 * in the device private pages owned by caller. See
> > > function
> > > +	 * hmm_vma_handle_pte. In multiple GPU case, this should
> > > be
> > > set to the
> > > +	 * device owner of the best migration destination. e.g.,
> > > device0/vm0
> > > +	 * has a page fault, but we have determined the best
> > > placement of
> > > +	 * the fault address should be on device1, we should set
> > > below to
> > > +	 * device1 instead of device0.
> > > +	 */
> > > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > > +
> > > +	while (true) {
> > > +		ret = hmm_range_fault(hmm_range);
> > > +		if (time_after(jiffies, timeout))
> > > +			break;
> > > +
> > > +		if (ret == -EBUSY)
> > > +			continue;
> > 
> > If (ret == -EBUSY) it looks from the hmm_range_fault()
> > implementation
> > like hmm_range->notifier_seq has become invalid and without calling
> > mmu_interval_read_begin() again, we will end up in an infinite
> > loop?
> > 
> 
> I noticed this thing before and had a read_begin in the while loop.
> But after a second thought, function xe_hmm_populate_range is called
> inside a mmap lock, so after the read_begin is called above, there
> can't be a invalidation before mmap unlock. So theoretically EBUSY
> can't happen?
> 
> Oak

Invalidations can happen due to many different things that don't need
the mmap lock. File truncation and I think page reclaim are typical
examples.

/Thomas


> 
> > /Thomas
> > 
> > 
> > 
> > > +		break;
> > > +	}
> > > +
> > > +	if (ret)
> > > +		goto free_pfns;
> > > +
> > > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> > > +	if (ret)
> > > +		goto free_pfns;
> > > +
> > > +	mark_range_accessed(hmm_range, write);
> > > +	userptr->sg = &userptr->sgt;
> > > +	userptr->notifier_seq = hmm_range->notifier_seq;
> > > +
> > > +free_pfns:
> > > +	kvfree(pfns);
> > > +	return ret;
> > > +}
> > > +
> > > diff --git a/drivers/gpu/drm/xe/xe_hmm.h
> > > b/drivers/gpu/drm/xe/xe_hmm.h
> > > new file mode 100644
> > > index 000000000000..960f3f6d36ae
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > > @@ -0,0 +1,12 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#include <linux/types.h>
> > > +#include <linux/hmm.h>
> > > +#include "xe_vm_types.h"
> > > +#include "xe_svm.h"
> > > +
> > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > > *hmm_range,
> > > +						bool write);
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram
  2024-03-18 15:02                   ` Zeng, Oak
@ 2024-03-18 15:46                     ` Hellstrom, Thomas
  0 siblings, 0 replies; 49+ messages in thread
From: Hellstrom, Thomas @ 2024-03-18 15:46 UTC (permalink / raw)
  To: Brost, Matthew, Zeng, Oak
  Cc: intel-xe, Welty,  Brian, airlied, Ghimiray, Himal Prasad

On Mon, 2024-03-18 at 15:02 +0000, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> > Sent: Monday, March 18, 2024 6:16 AM
> > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Welty, Brian
> > <brian.welty@intel.com>;
> > airlied@gmail.com; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap
> > backing for
> > GPU vram
> > 
> > On Sat, 2024-03-16 at 01:25 +0000, Matthew Brost wrote:
> > > On Fri, Mar 15, 2024 at 03:31:24PM -0600, Zeng, Oak wrote:
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > Sent: Friday, March 15, 2024 4:40 PM
> > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > > <himal.prasad.ghimiray@intel.com>
> > > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide memmap
> > > > > backing for
> > > > > GPU vram
> > > > > 
> > > > > On Fri, Mar 15, 2024 at 10:00:06AM -0600, Zeng, Oak wrote:
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > > Sent: Thursday, March 14, 2024 4:49 PM
> > > > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > > > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty,
> > > > > > > Brian
> > > > > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > > > > <himal.prasad.ghimiray@intel.com>
> > > > > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and provide
> > > > > > > memmap
> > > > > > > backing
> > > > > for
> > > > > > > GPU vram
> > > > > > > 
> > > > > > > On Thu, Mar 14, 2024 at 12:32:36PM -0600, Zeng, Oak
> > > > > > > wrote:
> > > > > > > > Hi Matt,
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > > > > > Sent: Thursday, March 14, 2024 1:18 PM
> > > > > > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > > > > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > > > > > > > <thomas.hellstrom@intel.com>; airlied@gmail.com;
> > > > > > > > > Welty,
> > > > > > > > > Brian
> > > > > > > > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > > > > > > > <himal.prasad.ghimiray@intel.com>
> > > > > > > > > Subject: Re: [PATCH 1/5] drm/xe/svm: Remap and
> > > > > > > > > provide
> > > > > > > > > memmap
> > > > > backing
> > > > > > > for
> > > > > > > > > GPU vram
> > > > > > > > > 
> > > > > > > > > On Wed, Mar 13, 2024 at 11:35:49PM -0400, Oak Zeng
> > > > > > > > > wrote:
> > > > > > > > > > Memory remap GPU vram using devm_memremap_pages, so
> > > > > > > > > > each
> > > > > GPU
> > > > > > > vram
> > > > > > > > > > page is backed by a struct page.
> > > > > > > > > > 
> > > > > > > > > > Those struct pages are created to allow hmm migrate
> > > > > > > > > > buffer b/t
> > > > > > > > > > GPU vram and CPU system memory using existing Linux
> > > > > > > > > > migration
> > > > > > > > > > mechanism (i.e., migrating b/t CPU system memory
> > > > > > > > > > and
> > > > > > > > > > hard disk).
> > > > > > > > > > 
> > > > > > > > > > This is prepare work to enable svm (shared virtual
> > > > > > > > > > memory) through
> > > > > > > > > > Linux kernel hmm framework. The memory remap's page
> > > > > > > > > > map
> > > > > > > > > > type is
> > > > > set
> > > > > > > > > > to MEMORY_DEVICE_PRIVATE for now. This means even
> > > > > > > > > > though each
> > > > > GPU
> > > > > > > > > > vram page get a struct page and can be mapped in
> > > > > > > > > > CPU
> > > > > > > > > > page table,
> > > > > > > > > > but such pages are treated as GPU's private
> > > > > > > > > > resource,
> > > > > > > > > > so CPU can't
> > > > > > > > > > access them. If CPU access such page, a page fault
> > > > > > > > > > is
> > > > > > > > > > triggered
> > > > > > > > > > and page will be migrate to system memory.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Is this really true? We can map VRAM BOs to the CPU
> > > > > > > > > without having
> > > > > > > > > migarte back and forth. Admittedly I don't know the
> > > > > > > > > inner
> > > > > > > > > workings of
> > > > > > > > > how this works but in IGTs we do this all the time.
> > > > > > > > > 
> > > > > > > > >   54         batch_bo = xe_bo_create(fd, vm,
> > > > > > > > > batch_size,
> > > > > > > > >   55                                
> > > > > > > > > vram_if_possible(fd,
> > > > > > > > > 0),
> > > > > > > > >   56
> > > > > DRM_XE_GEM_CREATE_FLAG_NEEDS_VISIBLE_VRAM);
> > > > > > > > >   57         batch_map = xe_bo_map(fd, batch_bo,
> > > > > > > > > batch_size);
> > > > > > > > > 
> > > > > > > > > The BO is created in VRAM and then mapped to the CPU.
> > > > > > > > > 
> > > > > > > > > I don't think there is an expectation of coherence
> > > > > > > > > rather
> > > > > > > > > caching mode
> > > > > > > > > and exclusive access of the memory based on
> > > > > > > > > synchronization.
> > > > > > > > > 
> > > > > > > > > e.g.
> > > > > > > > > User write BB/data via CPU to GPU memory
> > > > > > > > > User calls exec
> > > > > > > > > GPU read / write memory
> > > > > > > > > User wait on sync indicating exec done
> > > > > > > > > User reads result
> > > > > > > > > 
> > > > > > > > > All of this works without migration. Are we not
> > > > > > > > > planing
> > > > > > > > > supporting flow
> > > > > > > > > with SVM?
> > > > > > > > > 
> > > > > > > > > Afaik this migration dance really only needs to be
> > > > > > > > > done
> > > > > > > > > if the CPU and
> > > > > > > > > GPU are using atomics on a shared memory region and
> > > > > > > > > the
> > > > > > > > > GPU device
> > > > > > > > > doesn't support a coherent memory protocol (e.g.
> > > > > > > > > PVC).
> > > > > > > > 
> > > > > > > > All you said is true. On many of our HW, CPU can
> > > > > > > > actually
> > > > > > > > access device
> > > > > memory,
> > > > > > > cache coherently or not.
> > > > > > > > 
> > > > > > > > The problem is, this is not true for all GPU vendors.
> > > > > > > > For
> > > > > > > > example, on some
> > > > > HW
> > > > > > > from some vendor, CPU can only access partially of device
> > > > > > > memory. The so
> > > > > called
> > > > > > > small bar concept.
> > > > > > > > 
> > > > > > > > So when HMM is defined, such factors were considered,
> > > > > > > > and
> > > > > > > MEMORY_DEVICE_PRIVATE is defined. With this definition,
> > > > > > > CPU
> > > > > > > can't access
> > > > > > > device memory.
> > > > > > > > 
> > > > > > > > So you can think it is a limitation of HMM.
> > > > > > > > 
> > > > > > > 
> > > > > > > Is it though? I see other type MEMORY_DEVICE_FS_DAX,
> > > > > > > MEMORY_DEVICE_GENERIC, and MEMORY_DEVICE_PCI_P2PDMA.
> > From my
> > > > > > > limited
> > > > > > > understanding it looks to me like one of those modes
> > > > > > > would
> > > > > > > support my
> > > > > > > example.
> > > > > > 
> > > > > > 
> > > > > > No, above are for other purposes. HMM only support
> > > > > > DEVICE_PRIVATE and
> > > > > DEVICE_COHERENT.
> > > > > > 
> > > > > > > 
> > > > > > > > Note this is only part 1 of our system allocator work.
> > > > > > > > We
> > > > > > > > do plan to support
> > > > > > > DEVICE_COHERENT for our newer device, see below. With
> > > > > > > this
> > > > > > > option, we
> > > > > don't
> > > > > > > have unnecessary migration back and forth.
> > > > > > > > 
> > > > > > > > You can think this is just work out all the code path.
> > > > > > > > 90%
> > > > > > > > of the driver code
> > > > > for
> > > > > > > DEVICE_PRIVATE and DEVICE_COHERENT will be same. Our real
> > > > > > > use
> > > > > > > of system
> > > > > > > allocator will be DEVICE_COHERENT mode. While
> > > > > > > DEVICE_PRIVATE
> > > > > > > mode
> > > > > allow us
> > > > > > > to exercise the code on old HW.
> > > > > > > > 
> > > > > > > > Make sense?
> > > > > > > > 
> > > > > > > 
> > > > > > > I guess if we want the system allocator to always
> > > > > > > coherent,
> > > > > > > then yes you
> > > > > > > need this dynamic migration with faulting on either side.
> > > > > > > 
> > > > > > > I was thinking the system allocator would be behave like
> > > > > > > my
> > > > > > > example
> > > > > > > above with madvise dictating the coherence rules.
> > > > > > > 
> > > > > > > Maybe I missed this in system allocator design but my
> > > > > > > feeling
> > > > > > > is we
> > > > > > > shouldn't arbitrarily enforce coherence as that could
> > > > > > > lead to
> > > > > > > poor
> > > > > > > performance due to constant migration.
> > > > > > 
> > > > > > System allocator itself doesn't enforce coherence.
> > > > > > Coherence is
> > > > > > built in user
> > > > > programming model. So system allocator allow both GPU and CPU
> > > > > access system
> > > > > allocated pointers, but it doesn't necessarily guarantee the
> > > > > data
> > > > > accessed from
> > > > > CPU/GPU is coherent. It is user program's responsibility to
> > > > > maintain data
> > > > > coherence.
> > > > > > 
> > > > > > Data migration in driver is optional, depending on platform
> > > > > > capability, user
> > > > > preference, correctness and performance consideration. Driver
> > > > > internal data
> > > > > migration of course shouldn't break data coherence.
> > > > > > 
> > > > > > Of course different vendor can have different data
> > > > > > coherence
> > > > > > scheme. For
> > > > > example, it is completely designer's flexibility to build
> > > > > model
> > > > > that is HW automatic
> > > > > data coherence or software explicit data coherence. On our
> > > > > platform, we allow
> > > > > user program to select different coherence mode by setting
> > > > > pat_index for gpu
> > > > > and cpu_caching mode for CPU. So we have completely give the
> > > > > flexibility to user
> > > > > program. Nothing of this contract is changed in system
> > > > > allocator
> > > > > design.
> > > > > > 
> > > > > > Going back to the question of what memory type we should
> > > > > > use to
> > > > > > register our
> > > > > vram to core mm. HMM currently support two types: PRIVATE and
> > > > > COHERENT.
> > > > > The coherent type requires some HW and BIOS support which we
> > > > > don't have
> > > > > right now. So the only available is PRIVATE. We have not
> > > > > other
> > > > > option right now.
> > > > > As said, we plan to support coherent type where we can avoid
> > > > > unnecessary data
> > > > > migration. But that is stage 2.
> > > > > > 
> > > > > 
> > > > > Thanks for the explaination. After reading your replies, the
> > > > > HMM
> > > > > doc,
> > > > > and looking at code this all makes sense.
> > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > For GPU device which supports coherent memory
> > > > > > > > > > protocol
> > > > > > > > > > b/t CPU and
> > > > > > > > > > GPU (such as CXL and CAPI protocol), we can remap
> > > > > > > > > > device memory as
> > > > > > > > > > MEMORY_DEVICE_COHERENT. This is TBD.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > > > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > > > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > > > > > > ---
> > > > > > > > > >  drivers/gpu/drm/xe/Makefile          |  3 +-
> > > > > > > > > >  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
> > > > > > > > > >  drivers/gpu/drm/xe/xe_mmio.c         |  8 +++
> > > > > > > > > >  drivers/gpu/drm/xe/xe_svm.h          | 14 +++++
> > > > > > > > > >  drivers/gpu/drm/xe/xe_svm_devmem.c   | 91
> > > > > > > > > ++++++++++++++++++++++++++++
> > > > > > > > > >  5 files changed, 124 insertions(+), 1 deletion(-)
> > > > > > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > > > > > > > > >  create mode 100644
> > > > > > > > > > drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > > b/drivers/gpu/drm/xe/Makefile
> > > > > > > > > > index c531210695db..840467080e59 100644
> > > > > > > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > > > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > > > > > > @@ -142,7 +142,8 @@ xe-y += xe_bb.o \
> > > > > > > > > >  	xe_vram_freq.o \
> > > > > > > > > >  	xe_wait_user_fence.o \
> > > > > > > > > >  	xe_wa.o \
> > > > > > > > > > -	xe_wopcm.o
> > > > > > > > > > +	xe_wopcm.o \
> > > > > > > > > > +	xe_svm_devmem.o
> > > > > > > > > 
> > > > > > > > > These should be in alphabetical order.
> > > > > > > > 
> > > > > > > > Will fix
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > >  # graphics hardware monitoring (HWMON) support
> > > > > > > > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > > > index 9785eef2e5a4..f27c3bee8ce7 100644
> > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > > > > > > > @@ -99,6 +99,15 @@ struct xe_mem_region {
> > > > > > > > > >  	resource_size_t actual_physical_size;
> > > > > > > > > >  	/** @mapping: pointer to VRAM mappable
> > > > > > > > > > space
> > > > > > > > > > */
> > > > > > > > > >  	void __iomem *mapping;
> > > > > > > > > > +	/** @pagemap: Used to remap device memory
> > > > > > > > > > as
> > > > > > > > > > ZONE_DEVICE
> > > > > */
> > > > > > > > > > +    struct dev_pagemap pagemap;
> > > > > > > > > > +    /**
> > > > > > > > > > +     * @hpa_base: base host physical address
> > > > > > > > > > +     *
> > > > > > > > > > +     * This is generated when remap device memory
> > > > > > > > > > as
> > > > > > > > > > ZONE_DEVICE
> > > > > > > > > > +     */
> > > > > > > > > > +    resource_size_t hpa_base;
> > > > > > > > > 
> > > > > > > > > Weird indentation. This goes for the entire series,
> > > > > > > > > look
> > > > > > > > > at checkpatch.
> > > > > > > > 
> > > > > > > > Will fix
> > > > > > > > > 
> > > > > > > > > > +
> > > > > > > > > >  };
> > > > > > > > > > 
> > > > > > > > > >  /**
> > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > > > index e3db3a178760..0d795394bc4c 100644
> > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_mmio.c
> > > > > > > > > > @@ -22,6 +22,7 @@
> > > > > > > > > >  #include "xe_module.h"
> > > > > > > > > >  #include "xe_sriov.h"
> > > > > > > > > >  #include "xe_tile.h"
> > > > > > > > > > +#include "xe_svm.h"
> > > > > > > > > > 
> > > > > > > > > >  #define
> > > > > > > > > > XEHP_MTCFG_ADDR		XE_REG(0x101800)
> > > > > > > > > >  #define TILE_COUNT		REG_GENMASK(15, 8)
> > > > > > > > > > @@ -286,6 +287,7 @@ int xe_mmio_probe_vram(struct
> > > > > > > > > > xe_device *xe)
> > > > > > > > > >  		}
> > > > > > > > > > 
> > > > > > > > > >  		io_size -= min_t(u64, tile_size,
> > > > > > > > > > io_size);
> > > > > > > > > > +		xe_svm_devm_add(tile, &tile-
> > > > > > > > > > > mem.vram);
> > > > > > > > > 
> > > > > > > > > Do we want to do this probe for all devices with VRAM
> > > > > > > > > or
> > > > > > > > > only a subset?
> > > > > > > > 
> > > > > > > > All
> > > > > > > 
> > > > > > > Can you explain why?
> > > > > > 
> > > > > > It is natural for me to add all device memory to hmm. In
> > > > > > hmm
> > > > > > design, device
> > > > > memory is used as a special swap out for system memory. I
> > > > > would
> > > > > ask why we
> > > > > only want to add a subset of vram? By a subset, do you mean
> > > > > only
> > > > > vram of one
> > > > > tile instead of all tiles?
> > > > > > 
> > > > > 
> > > > > I think we talking about different things, my bad on wording
> > > > > in
> > > > > the
> > > > > original question.
> > > > > 
> > > > > Let me ask again - should be calling xe_svm_devm_add on all
> > > > > *platforms*
> > > > > that have VRAM. i.e. Should we do this on PVC but not DG2?
> > > > 
> > > > 
> > > > Oh, I see. Good question. On i915, this feature was only tested
> > > > on
> > > > PVC. We don't have a plan to enable it on older platform than
> > > > PVC.
> > > > 
> > > > Let me add a check here, only enabled it on platform newer than
> > > > PVC
> > > > 
> > > 
> > > Probably actually check 'xe->info.has_usm'.
> > > 
> > > We might want to rename field too and drop the 'usm' nomenclature
> > > but
> > > that can be done later.
> > 
> > Perhaps "has_recoverable_pagefaults" or some for of abbreviation.
> 
> USM has two flavors: a driver allocator and a system allocator. 
> 
> Both of those two flavors depend on recoverable pagefault. And for
> our current implementation, they only depend on recoverable page
> fault HW feature. In the future, we might have other implementation
> that depends on more HW features, such as ATS (address translation
> service).
> 
> So at least for now, "has_usm" and "has_recoverable_pagefaults" are
> pretty much the same thing. I will keep the has_usm for now. Open to
> a change in the future also.
> 
> > 
> > Another question w r t this is whether we should do this
> > unconditionally even on platforms that support it. Adding a
> > struct_page
> > per VRAM page would potentially consume a significant amount of
> > system
> > memory.
> 
> Yes, I was thinking of add a kernel configuration option so this
> feature can be enable/disabled at compilation time. Do you want me to
> add one?

Yes I think that is a good idea.

We also might want to add per device enabling in sysfs, or
setup on first use and disable if unused at shrinking time.

But IMO let's start with a kernel configuration option for now.

/Thomas


> 
> Oak 
> > 
> > /Thomas
> > 
> > 
> > 
> > > 
> > > Matt
> > > 
> > > > Oak
> > > > 
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > Oak
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > >  	}
> > > > > > > > > > 
> > > > > > > > > >  	xe->mem.vram.actual_physical_size =
> > > > > > > > > > total_size;
> > > > > > > > > > @@ -354,10 +356,16 @@ void
> > > > > > > > > > xe_mmio_probe_tiles(struct
> > > > > > > > > > xe_device
> > > > > *xe)
> > > > > > > > > >  static void mmio_fini(struct drm_device *drm, void
> > > > > > > > > > *arg)
> > > > > > > > > >  {
> > > > > > > > > >  	struct xe_device *xe = arg;
> > > > > > > > > > +    struct xe_tile *tile;
> > > > > > > > > > +    u8 id;
> > > > > > > > > > 
> > > > > > > > > >  	pci_iounmap(to_pci_dev(xe->drm.dev), xe-
> > > > > > > > > > > mmio.regs);
> > > > > > > > > >  	if (xe->mem.vram.mapping)
> > > > > > > > > >  		iounmap(xe->mem.vram.mapping);
> > > > > > > > > > +
> > > > > > > > > > +	for_each_tile(tile, xe, id)
> > > > > > > > > > +		xe_svm_devm_remove(xe, &tile-
> > > > > > > > > > > mem.vram);
> > > > > > > > > 
> > > > > > > > > This should probably be above existing code. Typical
> > > > > > > > > on
> > > > > > > > > fini to do
> > > > > > > > > things in reverse order from init.
> > > > > > > > 
> > > > > > > > Will fix
> > > > > > > > > 
> > > > > > > > > > +
> > > > > > > > > >  }
> > > > > > > > > > 
> > > > > > > > > >  static int xe_verify_lmem_ready(struct xe_device
> > > > > > > > > > *xe)
> > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > > > > > new file mode 100644
> > > > > > > > > > index 000000000000..09f9afb0e7d4
> > > > > > > > > > --- /dev/null
> > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > > > > > > @@ -0,0 +1,14 @@
> > > > > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > > > > +/*
> > > > > > > > > > + * Copyright © 2023 Intel Corporation
> > > > > > > > > 
> > > > > > > > > 2024?
> > > > > > > > 
> > > > > > > > This patch was actually written 2023
> > > > > > > > > 
> > > > > > > > > > + */
> > > > > > > > > > +
> > > > > > > > > > +#ifndef __XE_SVM_H
> > > > > > > > > > +#define __XE_SVM_H
> > > > > > > > > > +
> > > > > > > > > > +#include "xe_device_types.h"
> > > > > > > > > 
> > > > > > > > > I don't think you need to include this. Rather just
> > > > > > > > > forward decl structs
> > > > > > > > > used here.
> > > > > > > > 
> > > > > > > > Will fix
> > > > > > > > > 
> > > > > > > > > e.g.
> > > > > > > > > 
> > > > > > > > > struct xe_device;
> > > > > > > > > struct xe_mem_region;
> > > > > > > > > struct xe_tile;
> > > > > > > > > 
> > > > > > > > > > +
> > > > > > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct
> > > > > > > > > > xe_mem_region
> > > > > *mem);
> > > > > > > > > > +void xe_svm_devm_remove(struct xe_device *xe,
> > > > > > > > > > struct
> > > > > xe_mem_region
> > > > > > > > > *mem);
> > > > > > > > > 
> > > > > > > > > The arguments here are incongruent here. Typically we
> > > > > > > > > want these to
> > > > > > > > > match.
> > > > > > > > 
> > > > > > > > Will fix
> > > > > > > > > 
> > > > > > > > > > +
> > > > > > > > > > +#endif
> > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > > > 
> > > > > > > > > Incongruent between xe_svm.h and xe_svm_devmem.c.
> > > > > > > > 
> > > > > > > > Did you mean mem vs mr? if yes, will fix
> > > > > > > > 
> > > > > > > > Again these two
> > > > > > > > > should
> > > > > > > > > match.
> > > > > > > > > 
> > > > > > > > > > new file mode 100644
> > > > > > > > > > index 000000000000..63b7a1961cc6
> > > > > > > > > > --- /dev/null
> > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > > > > > > > @@ -0,0 +1,91 @@
> > > > > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > > > > +/*
> > > > > > > > > > + * Copyright © 2023 Intel Corporation
> > > > > > > > > 
> > > > > > > > > 2024?
> > > > > > > > It is from 2023
> > > > > > > > > 
> > > > > > > > > > + */
> > > > > > > > > > +
> > > > > > > > > > +#include <linux/mm_types.h>
> > > > > > > > > > +#include <linux/sched/mm.h>
> > > > > > > > > > +
> > > > > > > > > > +#include "xe_device_types.h"
> > > > > > > > > > +#include "xe_trace.h"
> > > > > > > > > 
> > > > > > > > > xe_trace.h appears to be unused.
> > > > > > > > 
> > > > > > > > Will fix
> > > > > > > > > 
> > > > > > > > > > +#include "xe_svm.h"
> > > > > > > > > > +
> > > > > > > > > > +
> > > > > > > > > > +static vm_fault_t xe_devm_migrate_to_ram(struct
> > > > > > > > > > vm_fault *vmf)
> > > > > > > > > > +{
> > > > > > > > > > +	return 0;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static void xe_devm_page_free(struct page *page)
> > > > > > > > > > +{
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static const struct dev_pagemap_ops
> > > > > > > > > > xe_devm_pagemap_ops = {
> > > > > > > > > > +	.page_free = xe_devm_page_free,
> > > > > > > > > > +	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > > > > > > > +};
> > > > > > > > > > +
> > > > > > > > > 
> > > > > > > > > Assume these are placeholders that will be populated
> > > > > > > > > later?
> > > > > > > > 
> > > > > > > > 
> > > > > > > > corrrect
> > > > > > > > > 
> > > > > > > > > > +/**
> > > > > > > > > > + * xe_svm_devm_add: Remap and provide memmap
> > > > > > > > > > backing
> > > > > > > > > > for
> > > > > device
> > > > > > > > > memory
> > > > > > > > > > + * @tile: tile that the memory region blongs to
> > > > > > > > > > + * @mr: memory region to remap
> > > > > > > > > > + *
> > > > > > > > > > + * This remap device memory to host physical
> > > > > > > > > > address
> > > > > > > > > > space and create
> > > > > > > > > > + * struct page to back device memory
> > > > > > > > > > + *
> > > > > > > > > > + * Return: 0 on success standard error code
> > > > > > > > > > otherwise
> > > > > > > > > > + */
> > > > > > > > > > +int xe_svm_devm_add(struct xe_tile *tile, struct
> > > > > > > > > > xe_mem_region *mr)
> > > > > > > > > 
> > > > > > > > > Here I see the xe_mem_region is from tile->mem.vram,
> > > > > > > > > wondering rather
> > > > > > > > > than using the tile->mem.vram we should use xe-
> > > > > > > > > >mem.vram
> > > > > > > > > when
> > > > > enabling
> > > > > > > > > svm? Isn't the idea behind svm the entire memory is 1
> > > > > > > > > view?
> > > > > > > > 
> > > > > > > > Still need to use tile. The reason is, memory of
> > > > > > > > different
> > > > > > > > tile can have
> > > > > different
> > > > > > > characteristics, such as latency. So we want to
> > > > > > > differentiate
> > > > > > > memory b/t tiles
> > > > > also
> > > > > > > in svm. I need to change below " mr->pagemap.owner =
> > > > > > > tile-
> > > > > > > > xe->drm.dev ".
> > > > > the
> > > > > > > owner should also be tile. This is the way hmm
> > > > > > > differentiate
> > > > > > > memory of
> > > > > different
> > > > > > > tile.
> > > > > > > > 
> > > > > > > > With svm it is 1 view, from virtual address space
> > > > > > > > perspective and from
> > > > > physical
> > > > > > > struct page perspective. You can think of all the tile's
> > > > > > > vram
> > > > > > > is stacked together
> > > > > to
> > > > > > > form a unified view together with system memory. This
> > > > > > > doesn't
> > > > > > > prohibit us
> > > > > from
> > > > > > > differentiate memory from different tile. This
> > > > > > > differentiation allow us to
> > > > > optimize
> > > > > > > performance, i.e., we can wisely place memory in specific
> > > > > > > tile. If we don't
> > > > > > > differentiate, this is not possible.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ok makes sense.
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > I suppose if we do that we also only use 1 TTM VRAM
> > > > > > > > > manager / buddy
> > > > > > > > > allocator too. I thought I saw some patches flying
> > > > > > > > > around
> > > > > > > > > for that too.
> > > > > > > > 
> > > > > > > > Ttm vram manager is not in the picture. We deliberately
> > > > > > > > avoided it per
> > > > > previous
> > > > > > > discussion
> > > > > > > > 
> > > > > > > > Yes same buddy allocator. It is in my previous POC:
> > > > > https://lore.kernel.org/dri-
> > > > > > > devel/20240117221223.18540-12-oak.zeng@intel.com/. I
> > > > > > > didn't
> > > > > > > put those
> > > > > patches
> > > > > > > in this series because I want to merge this small patches
> > > > > > > separately.
> > > > > > > > > 
> > > > > > > > > > +{
> > > > > > > > > > +	struct device *dev = &to_pci_dev(tile->xe-
> > > > > > > > > > > drm.dev)->dev;
> > > > > > > > > > +	struct resource *res;
> > > > > > > > > > +	void *addr;
> > > > > > > > > > +	int ret;
> > > > > > > > > > +
> > > > > > > > > > +	res = devm_request_free_mem_region(dev,
> > > > > > > > > > &iomem_resource,
> > > > > > > > > > +					   mr-
> > > > > > > > > > > usable_size);
> > > > > > > > > > +	if (IS_ERR(res)) {
> > > > > > > > > > +		ret = PTR_ERR(res);
> > > > > > > > > > +		return ret;
> > > > > > > > > > +	}
> > > > > > > > > > +
> > > > > > > > > > +	mr->pagemap.type = MEMORY_DEVICE_PRIVATE;
> > > > > > > > > > +	mr->pagemap.range.start = res->start;
> > > > > > > > > > +	mr->pagemap.range.end = res->end;
> > > > > > > > > > +	mr->pagemap.nr_range = 1;
> > > > > > > > > > +	mr->pagemap.ops = &xe_devm_pagemap_ops;
> > > > > > > > > > +	mr->pagemap.owner = tile->xe->drm.dev;
> > > > > > > > > > +	addr = devm_memremap_pages(dev, &mr-
> > > > > > > > > > >pagemap);
> > > > > > > > > > +	if (IS_ERR(addr)) {
> > > > > > > > > > +		devm_release_mem_region(dev, res-
> > > > > > > > > > > start,
> > > > > resource_size(res));
> > > > > > > > > > +		ret = PTR_ERR(addr);
> > > > > > > > > > +		drm_err(&tile->xe->drm, "Failed to
> > > > > > > > > > remap tile %d
> > > > > memory,
> > > > > > > > > errno %d\n",
> > > > > > > > > > +				tile->id, ret);
> > > > > > > > > > +		return ret;
> > > > > > > > > > +	}
> > > > > > > > > > +	mr->hpa_base = res->start;
> > > > > > > > > > +
> > > > > > > > > > +	drm_info(&tile->xe->drm, "Added tile %d
> > > > > > > > > > memory
> > > > > > > > > > [%llx-%llx] to
> > > > > devm,
> > > > > > > > > remapped to %pr\n",
> > > > > > > > > > +			tile->id, mr->io_start,
> > > > > > > > > > mr-
> > > > > > > > > > > io_start + mr-
> > > > > > usable_size,
> > > > > > > > > res);
> > > > > > > > > > +	return 0;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +/**
> > > > > > > > > > + * xe_svm_devm_remove: Unmap device memory and
> > > > > > > > > > free
> > > > > > > > > > resources
> > > > > > > > > > + * @xe: xe device
> > > > > > > > > > + * @mr: memory region to remove
> > > > > > > > > > + */
> > > > > > > > > > +void xe_svm_devm_remove(struct xe_device *xe,
> > > > > > > > > > struct
> > > > > xe_mem_region
> > > > > > > > > *mr)
> > > > > > > > > > +{
> > > > > > > > > > +	/*FIXME: below cause a kernel hange during
> > > > > > > > > > moduel remove*/
> > > > > > > > > > +#if 0
> > > > > > > > > > +	struct device *dev = &to_pci_dev(xe-
> > > > > > > > > > >drm.dev)-
> > > > > > > > > > > dev;
> > > > > > > > > > +
> > > > > > > > > > +	if (mr->hpa_base) {
> > > > > > > > > > +		devm_memunmap_pages(dev, &mr-
> > > > > > > > > > > pagemap);
> > > > > > > > > > +		devm_release_mem_region(dev, mr-
> > > > > > pagemap.range.start,
> > > > > > > > > > +			mr->pagemap.range.end -
> > > > > > > > > > mr-
> > > > > > pagemap.range.start +1);
> > > > > > > > > > +	}
> > > > > > > > > > +#endif
> > > > > > > > > 
> > > > > > > > > This would need to be fixed too.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Yes...
> > > > > > > > 
> > > > > > > > Oak
> > > > > > > > > 
> > > > > > > > > Matt
> > > > > > > > > 
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > --
> > > > > > > > > > 2.26.3
> > > > > > > > > > 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-18 15:40       ` Hellstrom, Thomas
@ 2024-03-18 16:09         ` Zeng, Oak
  0 siblings, 0 replies; 49+ messages in thread
From: Zeng, Oak @ 2024-03-18 16:09 UTC (permalink / raw)
  To: Hellstrom, Thomas, intel-xe
  Cc: Brost, Matthew, Welty, Brian, airlied, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> Sent: Monday, March 18, 2024 11:40 AM
> To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
> 
> Hi,
> 
> On Mon, 2024-03-18 at 14:49 +0000, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> > > Sent: Monday, March 18, 2024 9:13 AM
> > > To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> > > Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> > > <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or
> > > hmmptr
> > >
> > > Hi, Oak,
> > >
> > > Found another thing, see below:
> > >
> > > On Wed, 2024-03-13 at 23:35 -0400, Oak Zeng wrote:
> > > > Add a helper function xe_hmm_populate_range to populate
> > > > a a userptr or hmmptr range. This functions calls hmm_range_fault
> > > > to read CPU page tables and populate all pfns/pages of this
> > > > virtual address range.
> > > >
> > > > If the populated page is system memory page, dma-mapping is
> > > > performed
> > > > to get a dma-address which can be used later for GPU to access
> > > > pages.
> > > >
> > > > If the populated page is device private page, we calculate the
> > > > dpa (
> > > > device physical address) of the page.
> > > >
> > > > The dma-address or dpa is then saved in userptr's sg table. This
> > > > is
> > > > prepare work to replace the get_user_pages_fast code in userptr
> > > > code
> > > > path. The helper function will also be used to populate hmmptr
> > > > later.
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > Co-developed-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > Signed-off-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile |   3 +-
> > > >  drivers/gpu/drm/xe/xe_hmm.c | 213
> > > > ++++++++++++++++++++++++++++++++++++
> > > >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> > > >  3 files changed, 227 insertions(+), 1 deletion(-)
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > b/drivers/gpu/drm/xe/Makefile
> > > > index 840467080e59..29dcbc938b01 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> > > >  	xe_wait_user_fence.o \
> > > >  	xe_wa.o \
> > > >  	xe_wopcm.o \
> > > > -	xe_svm_devmem.o
> > > > +	xe_svm_devmem.o \
> > > > +	xe_hmm.o
> > > >
> > > >  # graphics hardware monitoring (HWMON) support
> > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> > > > b/drivers/gpu/drm/xe/xe_hmm.c
> > > > new file mode 100644
> > > > index 000000000000..c45c2447d386
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > > > @@ -0,0 +1,213 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + */
> > > > +
> > > > +#include <linux/mmu_notifier.h>
> > > > +#include <linux/dma-mapping.h>
> > > > +#include <linux/memremap.h>
> > > > +#include <linux/swap.h>
> > > > +#include <linux/mm.h>
> > > > +#include "xe_hmm.h"
> > > > +#include "xe_vm.h"
> > > > +
> > > > +/**
> > > > + * mark_range_accessed() - mark a range is accessed, so core mm
> > > > + * have such information for memory eviction or write back to
> > > > + * hard disk
> > > > + *
> > > > + * @range: the range to mark
> > > > + * @write: if write to this range, we mark pages in this range
> > > > + * as dirty
> > > > + */
> > > > +static void mark_range_accessed(struct hmm_range *range, bool
> > > > write)
> > > > +{
> > > > +	struct page *page;
> > > > +	u64 i, npages;
> > > > +
> > > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > > >start >>
> > > > PAGE_SHIFT) + 1;
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > > +		if (write) {
> > > > +			lock_page(page);
> > > > +			set_page_dirty(page);
> > > > +			unlock_page(page);
> > > > +		}
> > > > +		mark_page_accessed(page);
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * build_sg() - build a scatter gather table for all the
> > > > physical
> > > > pages/pfn
> > > > + * in a hmm_range. dma-address is save in sg table and will be
> > > > used
> > > > to program
> > > > + * GPU page table later.
> > > > + *
> > > > + * @xe: the xe device who will access the dma-address in sg
> > > > table
> > > > + * @range: the hmm range that we build the sg table from. range-
> > > > > hmm_pfns[]
> > > > + * has the pfn numbers of pages that back up this hmm address
> > > > range.
> > > > + * @st: pointer to the sg table.
> > > > + * @write: whether we write to this range. This decides dma map
> > > > direction
> > > > + * for system pages. If write we map it bi-diretional; otherwise
> > > > + * DMA_TO_DEVICE
> > > > + *
> > > > + * All the contiguous pfns will be collapsed into one entry in
> > > > + * the scatter gather table. This is for the convenience of
> > > > + * later on operations to bind address range to GPU page table.
> > > > + *
> > > > + * The dma_address in the sg table will later be used by GPU to
> > > > + * access memory. So if the memory is system memory, we need to
> > > > + * do a dma-mapping so it can be accessed by GPU/DMA. If the
> > > > memory
> > > > + * is GPU local memory (of the GPU who is going to access
> > > > memory),
> > > > + * we need gpu dpa (device physical address), and there is no
> > > > need
> > > > + * of dma-mapping.
> > > > + *
> > > > + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> > > > + * memory. Add this when you support p2p
> > > > + *
> > > > + * This function allocates the storage of the sg table. It is
> > > > + * caller's responsibility to free it calling sg_free_table.
> > > > + *
> > > > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > > > + */
> > > > +static int build_sg(struct xe_device *xe, struct hmm_range
> > > > *range,
> > > > +			     struct sg_table *st, bool write)
> > > > +{
> > > > +	struct device *dev = xe->drm.dev;
> > > > +	struct scatterlist *sg;
> > > > +	u64 i, npages;
> > > > +
> > > > +	sg = NULL;
> > > > +	st->nents = 0;
> > > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > > >start >>
> > > > PAGE_SHIFT) + 1;
> > > > +
> > > > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > > > +		return -ENOMEM;
> > > > +
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		struct page *page;
> > > > +		unsigned long addr;
> > > > +		struct xe_mem_region *mr;
> > > > +
> > > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > > +		if (is_device_private_page(page)) {
> > > > +			mr = page_to_mem_region(page);
> > > > +			addr = vram_pfn_to_dpa(mr, range-
> > > > > hmm_pfns[i]);
> > > > +		} else {
> > > > +			addr = dma_map_page(dev, page, 0,
> > > > PAGE_SIZE,
> > > > +					write ?
> > > > DMA_BIDIRECTIONAL :
> > > > DMA_TO_DEVICE);
> > > > +		}
> > > > +
> > > > +		if (sg && (addr == (sg_dma_address(sg) + sg-
> > > > > length))) {
> > > > +			sg->length += PAGE_SIZE;
> > > > +			sg_dma_len(sg) += PAGE_SIZE;
> > > > +			continue;
> > > > +		}
> > > > +
> > > > +		sg =  sg ? sg_next(sg) : st->sgl;
> > > > +		sg_dma_address(sg) = addr;
> > > > +		sg_dma_len(sg) = PAGE_SIZE;
> > > > +		sg->length = PAGE_SIZE;
> > > > +		st->nents++;
> > > > +	}
> > > > +
> > > > +	sg_mark_end(sg);
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_hmm_populate_range() - Populate physical pages of a
> > > > virtual
> > > > + * address range
> > > > + *
> > > > + * @vma: vma has information of the range to populate. only vma
> > > > + * of userptr and hmmptr type can be populated.
> > > > + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> > > > + * will hold the populated pfns.
> > > > + * @write: populate pages with write permission
> > > > + *
> > > > + * This function populate the physical pages of a virtual
> > > > + * address range. The populated physical pages is saved in
> > > > + * userptr's sg table. It is similar to get_user_pages but call
> > > > + * hmm_range_fault.
> > > > + *
> > > > + * This function also read mmu notifier sequence # (
> > > > + * mmu_interval_read_begin), for the purpose of later
> > > > + * comparison (through mmu_interval_read_retry).
> > > > + *
> > > > + * This must be called with mmap read or write lock held.
> > > > + *
> > > > + * This function allocates the storage of the userptr sg table.
> > > > + * It is caller's responsibility to free it calling
> > > > sg_free_table.
> > > > + *
> > > > + * returns: 0 for succuss; negative error no on failure
> > > > + */
> > > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > > > *hmm_range,
> > > > +						bool write)
> > > > +{
> > > > +	unsigned long timeout =
> > > > +		jiffies +
> > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > > > +	struct xe_userptr_vma *userptr_vma;
> > > > +	struct xe_userptr *userptr;
> > > > +	u64 start = vma->gpuva.va.addr;
> > > > +	u64 end = start + vma->gpuva.va.range;
> > > > +	struct xe_vm *vm = xe_vma_vm(vma);
> > > > +	u64 npages;
> > > > +	int ret;
> > > > +
> > > > +	userptr_vma = to_userptr_vma(vma);
> > > > +	userptr = &userptr_vma->userptr;
> > > > +	mmap_assert_locked(userptr->notifier.mm);
> > > > +
> > > > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >>
> > > > PAGE_SHIFT) +
> > > > 1;
> > > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > GFP_KERNEL);
> > > > +	if (unlikely(!pfns))
> > > > +		return -ENOMEM;
> > > > +
> > > > +	if (write)
> > > > +		flags |= HMM_PFN_REQ_WRITE;
> > > > +
> > > > +	memset64((u64 *)pfns, (u64)flags, npages);
> > > > +	hmm_range->hmm_pfns = pfns;
> > > > +	hmm_range->notifier_seq =
> > > > mmu_interval_read_begin(&userptr-
> > > > > notifier);
> > > > +	hmm_range->notifier = &userptr->notifier;
> > > > +	hmm_range->start = start;
> > > > +	hmm_range->end = end;
> > > > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> > > > HMM_PFN_REQ_WRITE;
> > > > +	/**
> > > > +	 * FIXME:
> > > > +	 * Set the the dev_private_owner can prevent
> > > > hmm_range_fault
> > > > to fault
> > > > +	 * in the device private pages owned by caller. See
> > > > function
> > > > +	 * hmm_vma_handle_pte. In multiple GPU case, this should
> > > > be
> > > > set to the
> > > > +	 * device owner of the best migration destination. e.g.,
> > > > device0/vm0
> > > > +	 * has a page fault, but we have determined the best
> > > > placement of
> > > > +	 * the fault address should be on device1, we should set
> > > > below to
> > > > +	 * device1 instead of device0.
> > > > +	 */
> > > > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > > > +
> > > > +	while (true) {
> > > > +		ret = hmm_range_fault(hmm_range);
> > > > +		if (time_after(jiffies, timeout))
> > > > +			break;
> > > > +
> > > > +		if (ret == -EBUSY)
> > > > +			continue;
> > >
> > > If (ret == -EBUSY) it looks from the hmm_range_fault()
> > > implementation
> > > like hmm_range->notifier_seq has become invalid and without calling
> > > mmu_interval_read_begin() again, we will end up in an infinite
> > > loop?
> > >
> >
> > I noticed this thing before and had a read_begin in the while loop.
> > But after a second thought, function xe_hmm_populate_range is called
> > inside a mmap lock, so after the read_begin is called above, there
> > can't be a invalidation before mmap unlock. So theoretically EBUSY
> > can't happen?
> >
> > Oak
> 
> Invalidations can happen due to many different things that don't need
> the mmap lock. File truncation and I think page reclaim are typical
> examples.

I see. I will move the read_begin into the loop then. Thanks!

Oak

> 
> /Thomas
> 
> 
> >
> > > /Thomas
> > >
> > >
> > >
> > > > +		break;
> > > > +	}
> > > > +
> > > > +	if (ret)
> > > > +		goto free_pfns;
> > > > +
> > > > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> > > > +	if (ret)
> > > > +		goto free_pfns;
> > > > +
> > > > +	mark_range_accessed(hmm_range, write);
> > > > +	userptr->sg = &userptr->sgt;
> > > > +	userptr->notifier_seq = hmm_range->notifier_seq;
> > > > +
> > > > +free_pfns:
> > > > +	kvfree(pfns);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > diff --git a/drivers/gpu/drm/xe/xe_hmm.h
> > > > b/drivers/gpu/drm/xe/xe_hmm.h
> > > > new file mode 100644
> > > > index 000000000000..960f3f6d36ae
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > > > @@ -0,0 +1,12 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + */
> > > > +
> > > > +#include <linux/types.h>
> > > > +#include <linux/hmm.h>
> > > > +#include "xe_vm_types.h"
> > > > +#include "xe_svm.h"
> > > > +
> > > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > > > *hmm_range,
> > > > +						bool write);
> >


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 3/5] drm/xe: Helper to get dpa from pfn
  2024-03-16  1:33       ` Matthew Brost
@ 2024-03-18 19:25         ` Zeng, Oak
  0 siblings, 0 replies; 49+ messages in thread
From: Zeng, Oak @ 2024-03-18 19:25 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Friday, March 15, 2024 9:34 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 3/5] drm/xe: Helper to get dpa from pfn
> 
> On Fri, Mar 15, 2024 at 11:29:33AM -0600, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: Thursday, March 14, 2024 1:39 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> > > <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> > > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [PATCH 3/5] drm/xe: Helper to get dpa from pfn
> > >
> > > On Wed, Mar 13, 2024 at 11:35:51PM -0400, Oak Zeng wrote:
> > > > Since we now create struct page backing for each vram page,
> > > > each vram page now also has a pfn, just like system memory.
> > > > This allow us to calcuate device physical address from pfn.
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_device_types.h | 8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > > index bbea40b57e84..bf349321f037 100644
> > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > @@ -576,4 +576,12 @@ static inline struct xe_tile
> *mem_region_to_tile(struct
> > > xe_mem_region *mr)
> > > >  	return container_of(mr, struct xe_tile, mem.vram);
> > > >  }
> > > >
> > > > +static inline u64 vram_pfn_to_dpa(struct xe_mem_region *mr, u64 pfn)
> > > > +{
> > > > +	u64 dpa;
> > > > +	u64 offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
> > >
> > > Can't this be negative?
> > >
> > > e.g. if pfn == 0, offset == -mr->hpa_base.
> > >
> > > Or is the assumption (pfn << PAGE_SHIFT) is always > mr->hpa_base?
> > >
> > > If so can we an assert or something to ensure we using this function correctly.
> >
> > Yes we can assert it. The hpa_base is the host physical base address for this
> memory region, while pfn should point to a page inside this memory region.
> >
> > I will add an assertion.
> >
> >
> > >
> > > > +	dpa = mr->dpa_base + offset;
> > > > +	return dpa;
> > > > +}
> > >
> > > Same as previous patch, should be *.h not a *_types.h file.
> >
> > Yes will fix.
> > >
> > > Also as this is xe_mem_region not explictly vram. Maybe:
> > >
> > > s/vram_pfn_to_dpa/xe_mem_region_pfn_to_dpa/
> >
> > Xe_mem_region can only represent vram, right? I mean it can't represent
> system memory. Copied the definition below:
> >
> > /**
> >  * struct xe_mem_region - memory region structure
> >  * This is used to describe a memory region in xe
> >  * device, such as HBM memory or CXL extension memory.
> >  */
> >
> 
> Ah yes but still as the first argument is xe_mem_region I think the
> function name should reflect that.

Sure, will rename it then.

Oak
> 
> Matt
> 
> > Oak
> >
> > >
> > > Matt
> > >
> > > > +
> > > >  #endif
> > > > --
> > > > 2.26.3
> > > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 3/5] drm/xe: Helper to get dpa from pfn
  2024-03-18 12:09     ` Hellstrom, Thomas
@ 2024-03-18 19:27       ` Zeng, Oak
  0 siblings, 0 replies; 49+ messages in thread
From: Zeng, Oak @ 2024-03-18 19:27 UTC (permalink / raw)
  To: Hellstrom, Thomas, Brost, Matthew
  Cc: intel-xe, Welty,  Brian, airlied, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> Sent: Monday, March 18, 2024 8:10 AM
> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Welty, Brian <brian.welty@intel.com>;
> airlied@gmail.com; Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 3/5] drm/xe: Helper to get dpa from pfn
> 
> On Thu, 2024-03-14 at 17:39 +0000, Matthew Brost wrote:
> > On Wed, Mar 13, 2024 at 11:35:51PM -0400, Oak Zeng wrote:
> > > Since we now create struct page backing for each vram page,
> > > each vram page now also has a pfn, just like system memory.
> > > This allow us to calcuate device physical address from pfn.
> 
> Please use imperative language according to the patch guidelines:
> Something like "Add a" or "Introduce A"
> 
> > >
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_device_types.h | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > index bbea40b57e84..bf349321f037 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -576,4 +576,12 @@ static inline struct xe_tile
> > > *mem_region_to_tile(struct xe_mem_region *mr)
> > >  	return container_of(mr, struct xe_tile, mem.vram);
> > >  }
> > >
> > > +static inline u64 vram_pfn_to_dpa(struct xe_mem_region *mr, u64
> > > pfn)
> 
> static inline header functions also need kerneldoc unless strong
> reasons not to.

Good point. Will document it.

Oak

> 
> /Thomas
> 
> 
> 
> > > +{
> > > +	u64 dpa;
> > > +	u64 offset = (pfn << PAGE_SHIFT) - mr->hpa_base;
> >
> > Can't this be negative?
> >
> > e.g. if pfn == 0, offset == -mr->hpa_base.
> >
> > Or is the assumption (pfn << PAGE_SHIFT) is always > mr->hpa_base?
> >
> > If so can we an assert or something to ensure we using this function
> > correctly.
> >
> > > +	dpa = mr->dpa_base + offset;
> > > +	return dpa;
> > > +}
> >
> > Same as previous patch, should be *.h not a *_types.h file.
> >
> > Also as this is xe_mem_region not explictly vram. Maybe:
> >
> > s/vram_pfn_to_dpa/xe_mem_region_pfn_to_dpa/
> >
> > Matt
> >
> > > +
> > >  #endif
> > > --
> > > 2.26.3
> > >


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-18 11:53   ` Hellstrom, Thomas
@ 2024-03-18 19:50     ` Zeng, Oak
  2024-03-19  8:41       ` Hellstrom, Thomas
  0 siblings, 1 reply; 49+ messages in thread
From: Zeng, Oak @ 2024-03-18 19:50 UTC (permalink / raw)
  To: Hellstrom, Thomas, intel-xe
  Cc: Brost, Matthew, Welty, Brian, airlied, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> Sent: Monday, March 18, 2024 7:54 AM
> To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
> 
> Hi, Oak.
> 
> 
> On Wed, 2024-03-13 at 23:35 -0400, Oak Zeng wrote:
> > Add a helper function xe_hmm_populate_range to populate
> > a a userptr or hmmptr range. This functions calls hmm_range_fault
> > to read CPU page tables and populate all pfns/pages of this
> > virtual address range.
> >
> > If the populated page is system memory page, dma-mapping is performed
> > to get a dma-address which can be used later for GPU to access pages.
> >
> > If the populated page is device private page, we calculate the dpa (
> > device physical address) of the page.
> >
> > The dma-address or dpa is then saved in userptr's sg table. This is
> > prepare work to replace the get_user_pages_fast code in userptr code
> > path. The helper function will also be used to populate hmmptr later.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > Cc: Brian Welty <brian.welty@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile |   3 +-
> >  drivers/gpu/drm/xe/xe_hmm.c | 213
> > ++++++++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> >  3 files changed, 227 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> 
> I mostly agree with Matt's review comments on this patch. Some
> additional below.
> 
> >
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index 840467080e59..29dcbc938b01 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> >  	xe_wait_user_fence.o \
> >  	xe_wa.o \
> >  	xe_wopcm.o \
> > -	xe_svm_devmem.o
> > +	xe_svm_devmem.o \
> > +	xe_hmm.o
> >
> >  # graphics hardware monitoring (HWMON) support
> >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> > b/drivers/gpu/drm/xe/xe_hmm.c
> > new file mode 100644
> > index 000000000000..c45c2447d386
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > @@ -0,0 +1,213 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/memremap.h>
> > +#include <linux/swap.h>
> > +#include <linux/mm.h>
> > +#include "xe_hmm.h"
> > +#include "xe_vm.h"
> > +
> > +/**
> > + * mark_range_accessed() - mark a range is accessed, so core mm
> > + * have such information for memory eviction or write back to
> > + * hard disk
> > + *
> > + * @range: the range to mark
> > + * @write: if write to this range, we mark pages in this range
> > + * as dirty
> > + */
> > +static void mark_range_accessed(struct hmm_range *range, bool write)
> 
> Some of the static function names aren't really unique enough not to
> stand in the way for a future core function name clash. Please consider
> using an xe_ prefix in such cases. It will also make backtraces easier
> to follow.

I will add a xe_prefix for the backtrace reason...

As I understand it, static function is file scope, so even if we have a core function with
The same name in the future, as long as they are not in the same file, there wont be any name clash...

> 
> 
> > +{
> > +	struct page *page;
> > +	u64 i, npages;
> > +
> > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> > PAGE_SHIFT) + 1;
> > +	for (i = 0; i < npages; i++) {
> > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > +		if (write) {
> > +			lock_page(page);
> > +			set_page_dirty(page);
> > +			unlock_page(page);
> 
> Could be using set_page_dirty_lock() here.

Will fix, Thanks
Oak

> 
> /Thomas
> 
> 
> > +		}
> > +		mark_page_accessed(page);
> > +	}
> > +}
> > +
> > +/**
> > + * build_sg() - build a scatter gather table for all the physical
> > pages/pfn
> > + * in a hmm_range. dma-address is save in sg table and will be used
> > to program
> > + * GPU page table later.
> > + *
> > + * @xe: the xe device who will access the dma-address in sg table
> > + * @range: the hmm range that we build the sg table from. range-
> > >hmm_pfns[]
> > + * has the pfn numbers of pages that back up this hmm address range.
> > + * @st: pointer to the sg table.
> > + * @write: whether we write to this range. This decides dma map
> > direction
> > + * for system pages. If write we map it bi-diretional; otherwise
> > + * DMA_TO_DEVICE
> > + *
> > + * All the contiguous pfns will be collapsed into one entry in
> > + * the scatter gather table. This is for the convenience of
> > + * later on operations to bind address range to GPU page table.
> > + *
> > + * The dma_address in the sg table will later be used by GPU to
> > + * access memory. So if the memory is system memory, we need to
> > + * do a dma-mapping so it can be accessed by GPU/DMA. If the memory
> > + * is GPU local memory (of the GPU who is going to access memory),
> > + * we need gpu dpa (device physical address), and there is no need
> > + * of dma-mapping.
> > + *
> > + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> > + * memory. Add this when you support p2p
> > + *
> > + * This function allocates the storage of the sg table. It is
> > + * caller's responsibility to free it calling sg_free_table.
> > + *
> > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > + */
> > +static int build_sg(struct xe_device *xe, struct hmm_range *range,
> > +			     struct sg_table *st, bool write)
> > +{
> > +	struct device *dev = xe->drm.dev;
> > +	struct scatterlist *sg;
> > +	u64 i, npages;
> > +
> > +	sg = NULL;
> > +	st->nents = 0;
> > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> > PAGE_SHIFT) + 1;
> > +
> > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < npages; i++) {
> > +		struct page *page;
> > +		unsigned long addr;
> > +		struct xe_mem_region *mr;
> > +
> > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > +		if (is_device_private_page(page)) {
> > +			mr = page_to_mem_region(page);
> > +			addr = vram_pfn_to_dpa(mr, range-
> > >hmm_pfns[i]);
> > +		} else {
> > +			addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> > +					write ? DMA_BIDIRECTIONAL :
> > DMA_TO_DEVICE);
> > +		}
> > +
> > +		if (sg && (addr == (sg_dma_address(sg) + sg-
> > >length))) {
> > +			sg->length += PAGE_SIZE;
> > +			sg_dma_len(sg) += PAGE_SIZE;
> > +			continue;
> > +		}
> > +
> > +		sg =  sg ? sg_next(sg) : st->sgl;
> > +		sg_dma_address(sg) = addr;
> > +		sg_dma_len(sg) = PAGE_SIZE;
> > +		sg->length = PAGE_SIZE;
> > +		st->nents++;
> > +	}
> > +
> > +	sg_mark_end(sg);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * xe_hmm_populate_range() - Populate physical pages of a virtual
> > + * address range
> > + *
> > + * @vma: vma has information of the range to populate. only vma
> > + * of userptr and hmmptr type can be populated.
> > + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> > + * will hold the populated pfns.
> > + * @write: populate pages with write permission
> > + *
> > + * This function populate the physical pages of a virtual
> > + * address range. The populated physical pages is saved in
> > + * userptr's sg table. It is similar to get_user_pages but call
> > + * hmm_range_fault.
> > + *
> > + * This function also read mmu notifier sequence # (
> > + * mmu_interval_read_begin), for the purpose of later
> > + * comparison (through mmu_interval_read_retry).
> > + *
> > + * This must be called with mmap read or write lock held.
> > + *
> > + * This function allocates the storage of the userptr sg table.
> > + * It is caller's responsibility to free it calling sg_free_table.
> > + *
> > + * returns: 0 for succuss; negative error no on failure
> > + */
> > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > *hmm_range,
> > +						bool write)
> > +{
> > +	unsigned long timeout =
> > +		jiffies +
> > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > +	struct xe_userptr_vma *userptr_vma;
> > +	struct xe_userptr *userptr;
> > +	u64 start = vma->gpuva.va.addr;
> > +	u64 end = start + vma->gpuva.va.range;
> > +	struct xe_vm *vm = xe_vma_vm(vma);
> > +	u64 npages;
> > +	int ret;
> > +
> > +	userptr_vma = to_userptr_vma(vma);
> > +	userptr = &userptr_vma->userptr;
> > +	mmap_assert_locked(userptr->notifier.mm);
> > +
> > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >> PAGE_SHIFT) +
> > 1;
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (unlikely(!pfns))
> > +		return -ENOMEM;
> > +
> > +	if (write)
> > +		flags |= HMM_PFN_REQ_WRITE;
> > +
> > +	memset64((u64 *)pfns, (u64)flags, npages);
> > +	hmm_range->hmm_pfns = pfns;
> > +	hmm_range->notifier_seq = mmu_interval_read_begin(&userptr-
> > >notifier);
> > +	hmm_range->notifier = &userptr->notifier;
> > +	hmm_range->start = start;
> > +	hmm_range->end = end;
> > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> > HMM_PFN_REQ_WRITE;
> > +	/**
> > +	 * FIXME:
> > +	 * Set the the dev_private_owner can prevent hmm_range_fault
> > to fault
> > +	 * in the device private pages owned by caller. See function
> > +	 * hmm_vma_handle_pte. In multiple GPU case, this should be
> > set to the
> > +	 * device owner of the best migration destination. e.g.,
> > device0/vm0
> > +	 * has a page fault, but we have determined the best
> > placement of
> > +	 * the fault address should be on device1, we should set
> > below to
> > +	 * device1 instead of device0.
> > +	 */
> > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > +
> > +	while (true) {
> > +		ret = hmm_range_fault(hmm_range);
> > +		if (time_after(jiffies, timeout))
> > +			break;
> > +
> > +		if (ret == -EBUSY)
> > +			continue;
> > +		break;
> > +	}
> > +
> > +	if (ret)
> > +		goto free_pfns;
> > +
> > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> > +	if (ret)
> > +		goto free_pfns;
> > +
> > +	mark_range_accessed(hmm_range, write);
> > +	userptr->sg = &userptr->sgt;
> > +	userptr->notifier_seq = hmm_range->notifier_seq;
> > +
> > +free_pfns:
> > +	kvfree(pfns);
> > +	return ret;
> > +}
> > +
> > diff --git a/drivers/gpu/drm/xe/xe_hmm.h
> > b/drivers/gpu/drm/xe/xe_hmm.h
> > new file mode 100644
> > index 000000000000..960f3f6d36ae
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > @@ -0,0 +1,12 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/hmm.h>
> > +#include "xe_vm_types.h"
> > +#include "xe_svm.h"
> > +
> > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > *hmm_range,
> > +						bool write);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 5/5] drm/xe: Use hmm_range_fault to populate user pages
  2024-03-14 20:54   ` Matthew Brost
@ 2024-03-19  2:36     ` Zeng, Oak
  0 siblings, 0 replies; 49+ messages in thread
From: Zeng, Oak @ 2024-03-19  2:36 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe, Hellstrom, Thomas, airlied, Welty, Brian, Ghimiray,
	Himal Prasad

Hi Matt,

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Thursday, March 14, 2024 4:55 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; airlied@gmail.com; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 5/5] drm/xe: Use hmm_range_fault to populate user pages
> 
> On Wed, Mar 13, 2024 at 11:35:53PM -0400, Oak Zeng wrote:
> > This is an effort to unify hmmptr (aka system allocator)
> > and userptr code. hmm_range_fault is used to populate
> > a virtual address range for both hmmptr and userptr,
> > instead of hmmptr using hmm_range_fault and userptr
> > using get_user_pages_fast.
> >
> > This also aligns with AMD gpu driver's behavior. In
> > long term, we plan to put some common helpers in this
> > area to drm layer so it can be re-used by different
> > vendors.
> >
> > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_vm.c | 105 ++-----------------------------------
> >  1 file changed, 4 insertions(+), 101 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index db3f049a47dc..d6088dcac74a 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -38,6 +38,7 @@
> >  #include "xe_sync.h"
> >  #include "xe_trace.h"
> >  #include "xe_wa.h"
> > +#include "xe_hmm.h"
> >
> >  static struct drm_gem_object *xe_vm_obj(struct xe_vm *vm)
> >  {
> > @@ -65,113 +66,15 @@ int xe_vma_userptr_check_repin(struct
> xe_userptr_vma *uvma)
> >
> >  int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
> 
> See my comments in the previous patch about layer, those comments are
> valid here too.
> 
> >  {
> > -	struct xe_userptr *userptr = &uvma->userptr;
> >  	struct xe_vma *vma = &uvma->vma;
> >  	struct xe_vm *vm = xe_vma_vm(vma);
> >  	struct xe_device *xe = vm->xe;
> > -	const unsigned long num_pages = xe_vma_size(vma) >> PAGE_SHIFT;
> > -	struct page **pages;
> > -	bool in_kthread = !current->mm;
> > -	unsigned long notifier_seq;
> > -	int pinned, ret, i;
> > -	bool read_only = xe_vma_read_only(vma);
> > +	bool write = !xe_vma_read_only(vma);
> > +	struct hmm_range hmm_range;
> >
> >  	lockdep_assert_held(&vm->lock);
> >  	xe_assert(xe, xe_vma_is_userptr(vma));
> > -retry:
> > -	if (vma->gpuva.flags & XE_VMA_DESTROYED)
> > -		return 0;
> 
> ^^^
> This should not be dropped. Both the vma->gpuva.flags & XE_VMA_DESTROYED
> and userptr invalidation check retry loop should still be in here.

I will move this check into hmm.c
> 
> > -
> > -	notifier_seq = mmu_interval_read_begin(&userptr->notifier);
> > -	if (notifier_seq == userptr->notifier_seq)
> > -		return 0;
> > -
> > -	pages = kvmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL);
> > -	if (!pages)
> > -		return -ENOMEM;
> > -
> > -	if (userptr->sg) {
> > -		dma_unmap_sgtable(xe->drm.dev,
> > -				  userptr->sg,
> > -				  read_only ? DMA_TO_DEVICE :
> > -				  DMA_BIDIRECTIONAL, 0);
> > -		sg_free_table(userptr->sg);
> > -		userptr->sg = NULL;
> > -	}
> 
> ^^^
> Likewise, I don't think this should be dropped either.

Will move to hmm.c

> 
> > -
> > -	pinned = ret = 0;
> > -	if (in_kthread) {
> > -		if (!mmget_not_zero(userptr->notifier.mm)) {
> > -			ret = -EFAULT;
> > -			goto mm_closed;
> > -		}
> > -		kthread_use_mm(userptr->notifier.mm);
> > -	}
> 
> ^^^
> Nor this.

Will move to hmm.c


> 
> > -
> > -	while (pinned < num_pages) {
> > -		ret = get_user_pages_fast(xe_vma_userptr(vma) +
> > -					  pinned * PAGE_SIZE,
> > -					  num_pages - pinned,
> > -					  read_only ? 0 : FOLL_WRITE,
> > -					  &pages[pinned]);
> > -		if (ret < 0)
> > -			break;
> > -
> > -		pinned += ret;
> > -		ret = 0;
> > -	}
> 
> ^^^
> We should be replacing this.
> 
> > -
> > -	if (in_kthread) {
> > -		kthread_unuse_mm(userptr->notifier.mm);
> > -		mmput(userptr->notifier.mm);
> > -	}
> > -mm_closed:
> > -	if (ret)
> > -		goto out;
> > -
> > -	ret = sg_alloc_table_from_pages_segment(&userptr->sgt, pages,
> > -						pinned, 0,
> > -						(u64)pinned << PAGE_SHIFT,
> > -						xe_sg_segment_size(xe-
> >drm.dev),
> > -						GFP_KERNEL);
> > -	if (ret) {
> > -		userptr->sg = NULL;
> > -		goto out;
> > -	}
> > -	userptr->sg = &userptr->sgt;
> > -
> > -	ret = dma_map_sgtable(xe->drm.dev, userptr->sg,
> > -			      read_only ? DMA_TO_DEVICE :
> > -			      DMA_BIDIRECTIONAL,
> > -			      DMA_ATTR_SKIP_CPU_SYNC |
> > -			      DMA_ATTR_NO_KERNEL_MAPPING);
> > -	if (ret) {
> > -		sg_free_table(userptr->sg);
> > -		userptr->sg = NULL;
> > -		goto out;
> > -	}
> > -
> > -	for (i = 0; i < pinned; ++i) {
> > -		if (!read_only) {
> > -			lock_page(pages[i]);
> > -			set_page_dirty(pages[i]);
> > -			unlock_page(pages[i]);
> > -		}
> > -
> > -		mark_page_accessed(pages[i]);
> > -	}
> > -
> > -out:
> > -	release_pages(pages, pinned);
> > -	kvfree(pages);
> 
> ^^^
> Through here (minus existing the kthread) with hmm call. I guess the
> kthread enter / exit could be in the hmm layer too.


Moved missing parts to hmm.c. will send out v1.

Thanks a lot for the reviewing Matt!

Oak
> 
> Matt
> 
> > -
> > -	if (!(ret < 0)) {
> > -		userptr->notifier_seq = notifier_seq;
> > -		if (xe_vma_userptr_check_repin(uvma) == -EAGAIN)
> > -			goto retry;
> > -	}
> > -
> > -	return ret < 0 ? ret : 0;
> > +	return xe_hmm_populate_range(vma, &hmm_range, write);
> >  }
> >
> >  static bool preempt_fences_waiting(struct xe_vm *vm)
> > --
> > 2.26.3
> >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-18 19:50     ` Zeng, Oak
@ 2024-03-19  8:41       ` Hellstrom, Thomas
  2024-03-19 16:13         ` Zeng, Oak
  0 siblings, 1 reply; 49+ messages in thread
From: Hellstrom, Thomas @ 2024-03-19  8:41 UTC (permalink / raw)
  To: intel-xe, Zeng,  Oak
  Cc: Brost, Matthew, Welty, Brian, airlied, Ghimiray, Himal Prasad

On Mon, 2024-03-18 at 19:50 +0000, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> > Sent: Monday, March 18, 2024 7:54 AM
> > To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> > Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> > <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or
> > hmmptr
> > 
> > Hi, Oak.
> > 
> > 
> > On Wed, 2024-03-13 at 23:35 -0400, Oak Zeng wrote:
> > > Add a helper function xe_hmm_populate_range to populate
> > > a a userptr or hmmptr range. This functions calls hmm_range_fault
> > > to read CPU page tables and populate all pfns/pages of this
> > > virtual address range.
> > > 
> > > If the populated page is system memory page, dma-mapping is
> > > performed
> > > to get a dma-address which can be used later for GPU to access
> > > pages.
> > > 
> > > If the populated page is device private page, we calculate the
> > > dpa (
> > > device physical address) of the page.
> > > 
> > > The dma-address or dpa is then saved in userptr's sg table. This
> > > is
> > > prepare work to replace the get_user_pages_fast code in userptr
> > > code
> > > path. The helper function will also be used to populate hmmptr
> > > later.
> > > 
> > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura@intel.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Cc: Brian Welty <brian.welty@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile |   3 +-
> > >  drivers/gpu/drm/xe/xe_hmm.c | 213
> > > ++++++++++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> > >  3 files changed, 227 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> > 
> > I mostly agree with Matt's review comments on this patch. Some
> > additional below.
> > 
> > > 
> > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > b/drivers/gpu/drm/xe/Makefile
> > > index 840467080e59..29dcbc938b01 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> > >  	xe_wait_user_fence.o \
> > >  	xe_wa.o \
> > >  	xe_wopcm.o \
> > > -	xe_svm_devmem.o
> > > +	xe_svm_devmem.o \
> > > +	xe_hmm.o
> > > 
> > >  # graphics hardware monitoring (HWMON) support
> > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> > > b/drivers/gpu/drm/xe/xe_hmm.c
> > > new file mode 100644
> > > index 000000000000..c45c2447d386
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > > @@ -0,0 +1,213 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#include <linux/mmu_notifier.h>
> > > +#include <linux/dma-mapping.h>
> > > +#include <linux/memremap.h>
> > > +#include <linux/swap.h>
> > > +#include <linux/mm.h>
> > > +#include "xe_hmm.h"
> > > +#include "xe_vm.h"
> > > +
> > > +/**
> > > + * mark_range_accessed() - mark a range is accessed, so core mm
> > > + * have such information for memory eviction or write back to
> > > + * hard disk
> > > + *
> > > + * @range: the range to mark
> > > + * @write: if write to this range, we mark pages in this range
> > > + * as dirty
> > > + */
> > > +static void mark_range_accessed(struct hmm_range *range, bool
> > > write)
> > 
> > Some of the static function names aren't really unique enough not
> > to
> > stand in the way for a future core function name clash. Please
> > consider
> > using an xe_ prefix in such cases. It will also make backtraces
> > easier
> > to follow.
> 
> I will add a xe_prefix for the backtrace reason...
> 
> As I understand it, static function is file scope, so even if we have
> a core function with
> The same name in the future, as long as they are not in the same
> file, there wont be any name clash...

The name can't be used for a future core extern function?

/Thomas

> 
> > 
> > 
> > > +{
> > > +	struct page *page;
> > > +	u64 i, npages;
> > > +
> > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > >start >>
> > > PAGE_SHIFT) + 1;
> > > +	for (i = 0; i < npages; i++) {
> > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > +		if (write) {
> > > +			lock_page(page);
> > > +			set_page_dirty(page);
> > > +			unlock_page(page);
> > 
> > Could be using set_page_dirty_lock() here.
> 
> Will fix, Thanks
> Oak
> 
> > 
> > /Thomas
> > 
> > 
> > > +		}
> > > +		mark_page_accessed(page);
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * build_sg() - build a scatter gather table for all the
> > > physical
> > > pages/pfn
> > > + * in a hmm_range. dma-address is save in sg table and will be
> > > used
> > > to program
> > > + * GPU page table later.
> > > + *
> > > + * @xe: the xe device who will access the dma-address in sg
> > > table
> > > + * @range: the hmm range that we build the sg table from. range-
> > > > hmm_pfns[]
> > > + * has the pfn numbers of pages that back up this hmm address
> > > range.
> > > + * @st: pointer to the sg table.
> > > + * @write: whether we write to this range. This decides dma map
> > > direction
> > > + * for system pages. If write we map it bi-diretional; otherwise
> > > + * DMA_TO_DEVICE
> > > + *
> > > + * All the contiguous pfns will be collapsed into one entry in
> > > + * the scatter gather table. This is for the convenience of
> > > + * later on operations to bind address range to GPU page table.
> > > + *
> > > + * The dma_address in the sg table will later be used by GPU to
> > > + * access memory. So if the memory is system memory, we need to
> > > + * do a dma-mapping so it can be accessed by GPU/DMA. If the
> > > memory
> > > + * is GPU local memory (of the GPU who is going to access
> > > memory),
> > > + * we need gpu dpa (device physical address), and there is no
> > > need
> > > + * of dma-mapping.
> > > + *
> > > + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> > > + * memory. Add this when you support p2p
> > > + *
> > > + * This function allocates the storage of the sg table. It is
> > > + * caller's responsibility to free it calling sg_free_table.
> > > + *
> > > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > > + */
> > > +static int build_sg(struct xe_device *xe, struct hmm_range
> > > *range,
> > > +			     struct sg_table *st, bool write)
> > > +{
> > > +	struct device *dev = xe->drm.dev;
> > > +	struct scatterlist *sg;
> > > +	u64 i, npages;
> > > +
> > > +	sg = NULL;
> > > +	st->nents = 0;
> > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > >start >>
> > > PAGE_SHIFT) + 1;
> > > +
> > > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > > +		return -ENOMEM;
> > > +
> > > +	for (i = 0; i < npages; i++) {
> > > +		struct page *page;
> > > +		unsigned long addr;
> > > +		struct xe_mem_region *mr;
> > > +
> > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > +		if (is_device_private_page(page)) {
> > > +			mr = page_to_mem_region(page);
> > > +			addr = vram_pfn_to_dpa(mr, range-
> > > > hmm_pfns[i]);
> > > +		} else {
> > > +			addr = dma_map_page(dev, page, 0,
> > > PAGE_SIZE,
> > > +					write ?
> > > DMA_BIDIRECTIONAL :
> > > DMA_TO_DEVICE);
> > > +		}
> > > +
> > > +		if (sg && (addr == (sg_dma_address(sg) + sg-
> > > > length))) {
> > > +			sg->length += PAGE_SIZE;
> > > +			sg_dma_len(sg) += PAGE_SIZE;
> > > +			continue;
> > > +		}
> > > +
> > > +		sg =  sg ? sg_next(sg) : st->sgl;
> > > +		sg_dma_address(sg) = addr;
> > > +		sg_dma_len(sg) = PAGE_SIZE;
> > > +		sg->length = PAGE_SIZE;
> > > +		st->nents++;
> > > +	}
> > > +
> > > +	sg_mark_end(sg);
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * xe_hmm_populate_range() - Populate physical pages of a
> > > virtual
> > > + * address range
> > > + *
> > > + * @vma: vma has information of the range to populate. only vma
> > > + * of userptr and hmmptr type can be populated.
> > > + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> > > + * will hold the populated pfns.
> > > + * @write: populate pages with write permission
> > > + *
> > > + * This function populate the physical pages of a virtual
> > > + * address range. The populated physical pages is saved in
> > > + * userptr's sg table. It is similar to get_user_pages but call
> > > + * hmm_range_fault.
> > > + *
> > > + * This function also read mmu notifier sequence # (
> > > + * mmu_interval_read_begin), for the purpose of later
> > > + * comparison (through mmu_interval_read_retry).
> > > + *
> > > + * This must be called with mmap read or write lock held.
> > > + *
> > > + * This function allocates the storage of the userptr sg table.
> > > + * It is caller's responsibility to free it calling
> > > sg_free_table.
> > > + *
> > > + * returns: 0 for succuss; negative error no on failure
> > > + */
> > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > > *hmm_range,
> > > +						bool write)
> > > +{
> > > +	unsigned long timeout =
> > > +		jiffies +
> > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > > +	struct xe_userptr_vma *userptr_vma;
> > > +	struct xe_userptr *userptr;
> > > +	u64 start = vma->gpuva.va.addr;
> > > +	u64 end = start + vma->gpuva.va.range;
> > > +	struct xe_vm *vm = xe_vma_vm(vma);
> > > +	u64 npages;
> > > +	int ret;
> > > +
> > > +	userptr_vma = to_userptr_vma(vma);
> > > +	userptr = &userptr_vma->userptr;
> > > +	mmap_assert_locked(userptr->notifier.mm);
> > > +
> > > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >>
> > > PAGE_SHIFT) +
> > > 1;
> > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > GFP_KERNEL);
> > > +	if (unlikely(!pfns))
> > > +		return -ENOMEM;
> > > +
> > > +	if (write)
> > > +		flags |= HMM_PFN_REQ_WRITE;
> > > +
> > > +	memset64((u64 *)pfns, (u64)flags, npages);
> > > +	hmm_range->hmm_pfns = pfns;
> > > +	hmm_range->notifier_seq =
> > > mmu_interval_read_begin(&userptr-
> > > > notifier);
> > > +	hmm_range->notifier = &userptr->notifier;
> > > +	hmm_range->start = start;
> > > +	hmm_range->end = end;
> > > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> > > HMM_PFN_REQ_WRITE;
> > > +	/**
> > > +	 * FIXME:
> > > +	 * Set the the dev_private_owner can prevent
> > > hmm_range_fault
> > > to fault
> > > +	 * in the device private pages owned by caller. See
> > > function
> > > +	 * hmm_vma_handle_pte. In multiple GPU case, this should
> > > be
> > > set to the
> > > +	 * device owner of the best migration destination. e.g.,
> > > device0/vm0
> > > +	 * has a page fault, but we have determined the best
> > > placement of
> > > +	 * the fault address should be on device1, we should set
> > > below to
> > > +	 * device1 instead of device0.
> > > +	 */
> > > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > > +
> > > +	while (true) {
> > > +		ret = hmm_range_fault(hmm_range);
> > > +		if (time_after(jiffies, timeout))
> > > +			break;
> > > +
> > > +		if (ret == -EBUSY)
> > > +			continue;
> > > +		break;
> > > +	}
> > > +
> > > +	if (ret)
> > > +		goto free_pfns;
> > > +
> > > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> > > +	if (ret)
> > > +		goto free_pfns;
> > > +
> > > +	mark_range_accessed(hmm_range, write);
> > > +	userptr->sg = &userptr->sgt;
> > > +	userptr->notifier_seq = hmm_range->notifier_seq;
> > > +
> > > +free_pfns:
> > > +	kvfree(pfns);
> > > +	return ret;
> > > +}
> > > +
> > > diff --git a/drivers/gpu/drm/xe/xe_hmm.h
> > > b/drivers/gpu/drm/xe/xe_hmm.h
> > > new file mode 100644
> > > index 000000000000..960f3f6d36ae
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > > @@ -0,0 +1,12 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2024 Intel Corporation
> > > + */
> > > +
> > > +#include <linux/types.h>
> > > +#include <linux/hmm.h>
> > > +#include "xe_vm_types.h"
> > > +#include "xe_svm.h"
> > > +
> > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > > *hmm_range,
> > > +						bool write);
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-19  8:41       ` Hellstrom, Thomas
@ 2024-03-19 16:13         ` Zeng, Oak
  2024-03-19 19:52           ` Hellstrom, Thomas
  0 siblings, 1 reply; 49+ messages in thread
From: Zeng, Oak @ 2024-03-19 16:13 UTC (permalink / raw)
  To: Hellstrom, Thomas, intel-xe
  Cc: Brost, Matthew, Welty, Brian, airlied, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> Sent: Tuesday, March 19, 2024 4:42 AM
> To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
> 
> On Mon, 2024-03-18 at 19:50 +0000, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> > > Sent: Monday, March 18, 2024 7:54 AM
> > > To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> > > Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> > > <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or
> > > hmmptr
> > >
> > > Hi, Oak.
> > >
> > >
> > > On Wed, 2024-03-13 at 23:35 -0400, Oak Zeng wrote:
> > > > Add a helper function xe_hmm_populate_range to populate
> > > > a a userptr or hmmptr range. This functions calls hmm_range_fault
> > > > to read CPU page tables and populate all pfns/pages of this
> > > > virtual address range.
> > > >
> > > > If the populated page is system memory page, dma-mapping is
> > > > performed
> > > > to get a dma-address which can be used later for GPU to access
> > > > pages.
> > > >
> > > > If the populated page is device private page, we calculate the
> > > > dpa (
> > > > device physical address) of the page.
> > > >
> > > > The dma-address or dpa is then saved in userptr's sg table. This
> > > > is
> > > > prepare work to replace the get_user_pages_fast code in userptr
> > > > code
> > > > path. The helper function will also be used to populate hmmptr
> > > > later.
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > Co-developed-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > Signed-off-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile |   3 +-
> > > >  drivers/gpu/drm/xe/xe_hmm.c | 213
> > > > ++++++++++++++++++++++++++++++++++++
> > > >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> > > >  3 files changed, 227 insertions(+), 1 deletion(-)
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> > >
> > > I mostly agree with Matt's review comments on this patch. Some
> > > additional below.
> > >
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > b/drivers/gpu/drm/xe/Makefile
> > > > index 840467080e59..29dcbc938b01 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> > > >  	xe_wait_user_fence.o \
> > > >  	xe_wa.o \
> > > >  	xe_wopcm.o \
> > > > -	xe_svm_devmem.o
> > > > +	xe_svm_devmem.o \
> > > > +	xe_hmm.o
> > > >
> > > >  # graphics hardware monitoring (HWMON) support
> > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> > > > b/drivers/gpu/drm/xe/xe_hmm.c
> > > > new file mode 100644
> > > > index 000000000000..c45c2447d386
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > > > @@ -0,0 +1,213 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + */
> > > > +
> > > > +#include <linux/mmu_notifier.h>
> > > > +#include <linux/dma-mapping.h>
> > > > +#include <linux/memremap.h>
> > > > +#include <linux/swap.h>
> > > > +#include <linux/mm.h>
> > > > +#include "xe_hmm.h"
> > > > +#include "xe_vm.h"
> > > > +
> > > > +/**
> > > > + * mark_range_accessed() - mark a range is accessed, so core mm
> > > > + * have such information for memory eviction or write back to
> > > > + * hard disk
> > > > + *
> > > > + * @range: the range to mark
> > > > + * @write: if write to this range, we mark pages in this range
> > > > + * as dirty
> > > > + */
> > > > +static void mark_range_accessed(struct hmm_range *range, bool
> > > > write)
> > >
> > > Some of the static function names aren't really unique enough not
> > > to
> > > stand in the way for a future core function name clash. Please
> > > consider
> > > using an xe_ prefix in such cases. It will also make backtraces
> > > easier
> > > to follow.
> >
> > I will add a xe_prefix for the backtrace reason...
> >
> > As I understand it, static function is file scope, so even if we have
> > a core function with
> > The same name in the future, as long as they are not in the same
> > file, there wont be any name clash...
> 
> The name can't be used for a future core extern function?

Yes, at least this is my understanding of static function. Static function doesn't have a symbol so it can't clash with other static function in other file with the same name. For example, if you readelf -s xe.ko, you won't be able to find xe_mark_range_accessed, because it is a static function. I think compiler just give static function an address instead of a symbol. 

Oak

> 
> /Thomas
> 
> >
> > >
> > >
> > > > +{
> > > > +	struct page *page;
> > > > +	u64 i, npages;
> > > > +
> > > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > > >start >>
> > > > PAGE_SHIFT) + 1;
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > > +		if (write) {
> > > > +			lock_page(page);
> > > > +			set_page_dirty(page);
> > > > +			unlock_page(page);
> > >
> > > Could be using set_page_dirty_lock() here.
> >
> > Will fix, Thanks
> > Oak
> >
> > >
> > > /Thomas
> > >
> > >
> > > > +		}
> > > > +		mark_page_accessed(page);
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * build_sg() - build a scatter gather table for all the
> > > > physical
> > > > pages/pfn
> > > > + * in a hmm_range. dma-address is save in sg table and will be
> > > > used
> > > > to program
> > > > + * GPU page table later.
> > > > + *
> > > > + * @xe: the xe device who will access the dma-address in sg
> > > > table
> > > > + * @range: the hmm range that we build the sg table from. range-
> > > > > hmm_pfns[]
> > > > + * has the pfn numbers of pages that back up this hmm address
> > > > range.
> > > > + * @st: pointer to the sg table.
> > > > + * @write: whether we write to this range. This decides dma map
> > > > direction
> > > > + * for system pages. If write we map it bi-diretional; otherwise
> > > > + * DMA_TO_DEVICE
> > > > + *
> > > > + * All the contiguous pfns will be collapsed into one entry in
> > > > + * the scatter gather table. This is for the convenience of
> > > > + * later on operations to bind address range to GPU page table.
> > > > + *
> > > > + * The dma_address in the sg table will later be used by GPU to
> > > > + * access memory. So if the memory is system memory, we need to
> > > > + * do a dma-mapping so it can be accessed by GPU/DMA. If the
> > > > memory
> > > > + * is GPU local memory (of the GPU who is going to access
> > > > memory),
> > > > + * we need gpu dpa (device physical address), and there is no
> > > > need
> > > > + * of dma-mapping.
> > > > + *
> > > > + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> > > > + * memory. Add this when you support p2p
> > > > + *
> > > > + * This function allocates the storage of the sg table. It is
> > > > + * caller's responsibility to free it calling sg_free_table.
> > > > + *
> > > > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > > > + */
> > > > +static int build_sg(struct xe_device *xe, struct hmm_range
> > > > *range,
> > > > +			     struct sg_table *st, bool write)
> > > > +{
> > > > +	struct device *dev = xe->drm.dev;
> > > > +	struct scatterlist *sg;
> > > > +	u64 i, npages;
> > > > +
> > > > +	sg = NULL;
> > > > +	st->nents = 0;
> > > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > > >start >>
> > > > PAGE_SHIFT) + 1;
> > > > +
> > > > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > > > +		return -ENOMEM;
> > > > +
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		struct page *page;
> > > > +		unsigned long addr;
> > > > +		struct xe_mem_region *mr;
> > > > +
> > > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > > +		if (is_device_private_page(page)) {
> > > > +			mr = page_to_mem_region(page);
> > > > +			addr = vram_pfn_to_dpa(mr, range-
> > > > > hmm_pfns[i]);
> > > > +		} else {
> > > > +			addr = dma_map_page(dev, page, 0,
> > > > PAGE_SIZE,
> > > > +					write ?
> > > > DMA_BIDIRECTIONAL :
> > > > DMA_TO_DEVICE);
> > > > +		}
> > > > +
> > > > +		if (sg && (addr == (sg_dma_address(sg) + sg-
> > > > > length))) {
> > > > +			sg->length += PAGE_SIZE;
> > > > +			sg_dma_len(sg) += PAGE_SIZE;
> > > > +			continue;
> > > > +		}
> > > > +
> > > > +		sg =  sg ? sg_next(sg) : st->sgl;
> > > > +		sg_dma_address(sg) = addr;
> > > > +		sg_dma_len(sg) = PAGE_SIZE;
> > > > +		sg->length = PAGE_SIZE;
> > > > +		st->nents++;
> > > > +	}
> > > > +
> > > > +	sg_mark_end(sg);
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_hmm_populate_range() - Populate physical pages of a
> > > > virtual
> > > > + * address range
> > > > + *
> > > > + * @vma: vma has information of the range to populate. only vma
> > > > + * of userptr and hmmptr type can be populated.
> > > > + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> > > > + * will hold the populated pfns.
> > > > + * @write: populate pages with write permission
> > > > + *
> > > > + * This function populate the physical pages of a virtual
> > > > + * address range. The populated physical pages is saved in
> > > > + * userptr's sg table. It is similar to get_user_pages but call
> > > > + * hmm_range_fault.
> > > > + *
> > > > + * This function also read mmu notifier sequence # (
> > > > + * mmu_interval_read_begin), for the purpose of later
> > > > + * comparison (through mmu_interval_read_retry).
> > > > + *
> > > > + * This must be called with mmap read or write lock held.
> > > > + *
> > > > + * This function allocates the storage of the userptr sg table.
> > > > + * It is caller's responsibility to free it calling
> > > > sg_free_table.
> > > > + *
> > > > + * returns: 0 for succuss; negative error no on failure
> > > > + */
> > > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > > > *hmm_range,
> > > > +						bool write)
> > > > +{
> > > > +	unsigned long timeout =
> > > > +		jiffies +
> > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > > > +	struct xe_userptr_vma *userptr_vma;
> > > > +	struct xe_userptr *userptr;
> > > > +	u64 start = vma->gpuva.va.addr;
> > > > +	u64 end = start + vma->gpuva.va.range;
> > > > +	struct xe_vm *vm = xe_vma_vm(vma);
> > > > +	u64 npages;
> > > > +	int ret;
> > > > +
> > > > +	userptr_vma = to_userptr_vma(vma);
> > > > +	userptr = &userptr_vma->userptr;
> > > > +	mmap_assert_locked(userptr->notifier.mm);
> > > > +
> > > > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >>
> > > > PAGE_SHIFT) +
> > > > 1;
> > > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > GFP_KERNEL);
> > > > +	if (unlikely(!pfns))
> > > > +		return -ENOMEM;
> > > > +
> > > > +	if (write)
> > > > +		flags |= HMM_PFN_REQ_WRITE;
> > > > +
> > > > +	memset64((u64 *)pfns, (u64)flags, npages);
> > > > +	hmm_range->hmm_pfns = pfns;
> > > > +	hmm_range->notifier_seq =
> > > > mmu_interval_read_begin(&userptr-
> > > > > notifier);
> > > > +	hmm_range->notifier = &userptr->notifier;
> > > > +	hmm_range->start = start;
> > > > +	hmm_range->end = end;
> > > > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> > > > HMM_PFN_REQ_WRITE;
> > > > +	/**
> > > > +	 * FIXME:
> > > > +	 * Set the the dev_private_owner can prevent
> > > > hmm_range_fault
> > > > to fault
> > > > +	 * in the device private pages owned by caller. See
> > > > function
> > > > +	 * hmm_vma_handle_pte. In multiple GPU case, this should
> > > > be
> > > > set to the
> > > > +	 * device owner of the best migration destination. e.g.,
> > > > device0/vm0
> > > > +	 * has a page fault, but we have determined the best
> > > > placement of
> > > > +	 * the fault address should be on device1, we should set
> > > > below to
> > > > +	 * device1 instead of device0.
> > > > +	 */
> > > > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > > > +
> > > > +	while (true) {
> > > > +		ret = hmm_range_fault(hmm_range);
> > > > +		if (time_after(jiffies, timeout))
> > > > +			break;
> > > > +
> > > > +		if (ret == -EBUSY)
> > > > +			continue;
> > > > +		break;
> > > > +	}
> > > > +
> > > > +	if (ret)
> > > > +		goto free_pfns;
> > > > +
> > > > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> > > > +	if (ret)
> > > > +		goto free_pfns;
> > > > +
> > > > +	mark_range_accessed(hmm_range, write);
> > > > +	userptr->sg = &userptr->sgt;
> > > > +	userptr->notifier_seq = hmm_range->notifier_seq;
> > > > +
> > > > +free_pfns:
> > > > +	kvfree(pfns);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > diff --git a/drivers/gpu/drm/xe/xe_hmm.h
> > > > b/drivers/gpu/drm/xe/xe_hmm.h
> > > > new file mode 100644
> > > > index 000000000000..960f3f6d36ae
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > > > @@ -0,0 +1,12 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + */
> > > > +
> > > > +#include <linux/types.h>
> > > > +#include <linux/hmm.h>
> > > > +#include "xe_vm_types.h"
> > > > +#include "xe_svm.h"
> > > > +
> > > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > > > *hmm_range,
> > > > +						bool write);
> >


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-19 16:13         ` Zeng, Oak
@ 2024-03-19 19:52           ` Hellstrom, Thomas
  2024-03-19 20:01             ` Zeng, Oak
  0 siblings, 1 reply; 49+ messages in thread
From: Hellstrom, Thomas @ 2024-03-19 19:52 UTC (permalink / raw)
  To: intel-xe, Zeng,  Oak
  Cc: Brost, Matthew, Welty, Brian, airlied, Ghimiray, Himal Prasad

On Tue, 2024-03-19 at 16:13 +0000, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> > Sent: Tuesday, March 19, 2024 4:42 AM
> > To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> > Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> > <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or
> > hmmptr
> > 
> > On Mon, 2024-03-18 at 19:50 +0000, Zeng, Oak wrote:
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> > > > Sent: Monday, March 18, 2024 7:54 AM
> > > > To: intel-xe@lists.freedesktop.org; Zeng, Oak
> > > > <oak.zeng@intel.com>
> > > > Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> > > > <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal
> > > > Prasad
> > > > <himal.prasad.ghimiray@intel.com>
> > > > Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr
> > > > or
> > > > hmmptr
> > > > 
> > > > Hi, Oak.
> > > > 
> > > > 
> > > > On Wed, 2024-03-13 at 23:35 -0400, Oak Zeng wrote:
> > > > > Add a helper function xe_hmm_populate_range to populate
> > > > > a a userptr or hmmptr range. This functions calls
> > > > > hmm_range_fault
> > > > > to read CPU page tables and populate all pfns/pages of this
> > > > > virtual address range.
> > > > > 
> > > > > If the populated page is system memory page, dma-mapping is
> > > > > performed
> > > > > to get a dma-address which can be used later for GPU to
> > > > > access
> > > > > pages.
> > > > > 
> > > > > If the populated page is device private page, we calculate
> > > > > the
> > > > > dpa (
> > > > > device physical address) of the page.
> > > > > 
> > > > > The dma-address or dpa is then saved in userptr's sg table.
> > > > > This
> > > > > is
> > > > > prepare work to replace the get_user_pages_fast code in
> > > > > userptr
> > > > > code
> > > > > path. The helper function will also be used to populate
> > > > > hmmptr
> > > > > later.
> > > > > 
> > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > > <niranjana.vishwanathapura@intel.com>
> > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > > <niranjana.vishwanathapura@intel.com>
> > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/Makefile |   3 +-
> > > > >  drivers/gpu/drm/xe/xe_hmm.c | 213
> > > > > ++++++++++++++++++++++++++++++++++++
> > > > >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> > > > >  3 files changed, 227 insertions(+), 1 deletion(-)
> > > > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> > > > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> > > > 
> > > > I mostly agree with Matt's review comments on this patch. Some
> > > > additional below.
> > > > 
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > > b/drivers/gpu/drm/xe/Makefile
> > > > > index 840467080e59..29dcbc938b01 100644
> > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> > > > >  	xe_wait_user_fence.o \
> > > > >  	xe_wa.o \
> > > > >  	xe_wopcm.o \
> > > > > -	xe_svm_devmem.o
> > > > > +	xe_svm_devmem.o \
> > > > > +	xe_hmm.o
> > > > > 
> > > > >  # graphics hardware monitoring (HWMON) support
> > > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > > diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> > > > > b/drivers/gpu/drm/xe/xe_hmm.c
> > > > > new file mode 100644
> > > > > index 000000000000..c45c2447d386
> > > > > --- /dev/null
> > > > > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > > > > @@ -0,0 +1,213 @@
> > > > > +// SPDX-License-Identifier: MIT
> > > > > +/*
> > > > > + * Copyright © 2024 Intel Corporation
> > > > > + */
> > > > > +
> > > > > +#include <linux/mmu_notifier.h>
> > > > > +#include <linux/dma-mapping.h>
> > > > > +#include <linux/memremap.h>
> > > > > +#include <linux/swap.h>
> > > > > +#include <linux/mm.h>
> > > > > +#include "xe_hmm.h"
> > > > > +#include "xe_vm.h"
> > > > > +
> > > > > +/**
> > > > > + * mark_range_accessed() - mark a range is accessed, so core
> > > > > mm
> > > > > + * have such information for memory eviction or write back
> > > > > to
> > > > > + * hard disk
> > > > > + *
> > > > > + * @range: the range to mark
> > > > > + * @write: if write to this range, we mark pages in this
> > > > > range
> > > > > + * as dirty
> > > > > + */
> > > > > +static void mark_range_accessed(struct hmm_range *range,
> > > > > bool
> > > > > write)
> > > > 
> > > > Some of the static function names aren't really unique enough
> > > > not
> > > > to
> > > > stand in the way for a future core function name clash. Please
> > > > consider
> > > > using an xe_ prefix in such cases. It will also make backtraces
> > > > easier
> > > > to follow.
> > > 
> > > I will add a xe_prefix for the backtrace reason...
> > > 
> > > As I understand it, static function is file scope, so even if we
> > > have
> > > a core function with
> > > The same name in the future, as long as they are not in the same
> > > file, there wont be any name clash...
> > 
> > The name can't be used for a future core extern function?
> 
> Yes, at least this is my understanding of static function. Static
> function doesn't have a symbol so it can't clash with other static
> function in other file with the same name. For example, if you
> readelf -s xe.ko, you won't be able to find xe_mark_range_accessed,
> because it is a static function. I think compiler just give static
> function an address instead of a symbol. 

I meant for example if you call a function "build_sg". Then if the core
wants to implement an extern function called "build_sg" and includes it
in a reasonably wide-used header you will get a compilation failure.
Hence the use of non-specific static function names restricts future
core use of the same name for public functions.

/Thomas




> 
> Oak
> 
> > 
> > /Thomas
> > 
> > > 
> > > > 
> > > > 
> > > > > +{
> > > > > +	struct page *page;
> > > > > +	u64 i, npages;
> > > > > +
> > > > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > > > > start >>
> > > > > PAGE_SHIFT) + 1;
> > > > > +	for (i = 0; i < npages; i++) {
> > > > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > > > +		if (write) {
> > > > > +			lock_page(page);
> > > > > +			set_page_dirty(page);
> > > > > +			unlock_page(page);
> > > > 
> > > > Could be using set_page_dirty_lock() here.
> > > 
> > > Will fix, Thanks
> > > Oak
> > > 
> > > > 
> > > > /Thomas
> > > > 
> > > > 
> > > > > +		}
> > > > > +		mark_page_accessed(page);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * build_sg() - build a scatter gather table for all the
> > > > > physical
> > > > > pages/pfn
> > > > > + * in a hmm_range. dma-address is save in sg table and will
> > > > > be
> > > > > used
> > > > > to program
> > > > > + * GPU page table later.
> > > > > + *
> > > > > + * @xe: the xe device who will access the dma-address in sg
> > > > > table
> > > > > + * @range: the hmm range that we build the sg table from.
> > > > > range-
> > > > > > hmm_pfns[]
> > > > > + * has the pfn numbers of pages that back up this hmm
> > > > > address
> > > > > range.
> > > > > + * @st: pointer to the sg table.
> > > > > + * @write: whether we write to this range. This decides dma
> > > > > map
> > > > > direction
> > > > > + * for system pages. If write we map it bi-diretional;
> > > > > otherwise
> > > > > + * DMA_TO_DEVICE
> > > > > + *
> > > > > + * All the contiguous pfns will be collapsed into one entry
> > > > > in
> > > > > + * the scatter gather table. This is for the convenience of
> > > > > + * later on operations to bind address range to GPU page
> > > > > table.
> > > > > + *
> > > > > + * The dma_address in the sg table will later be used by GPU
> > > > > to
> > > > > + * access memory. So if the memory is system memory, we need
> > > > > to
> > > > > + * do a dma-mapping so it can be accessed by GPU/DMA. If the
> > > > > memory
> > > > > + * is GPU local memory (of the GPU who is going to access
> > > > > memory),
> > > > > + * we need gpu dpa (device physical address), and there is
> > > > > no
> > > > > need
> > > > > + * of dma-mapping.
> > > > > + *
> > > > > + * FIXME: dma-mapping for peer gpu device to access remote
> > > > > gpu's
> > > > > + * memory. Add this when you support p2p
> > > > > + *
> > > > > + * This function allocates the storage of the sg table. It
> > > > > is
> > > > > + * caller's responsibility to free it calling sg_free_table.
> > > > > + *
> > > > > + * Returns 0 if successful; -ENOMEM if fails to allocate
> > > > > memory
> > > > > + */
> > > > > +static int build_sg(struct xe_device *xe, struct hmm_range
> > > > > *range,
> > > > > +			     struct sg_table *st, bool
> > > > > write)
> > > > > +{
> > > > > +	struct device *dev = xe->drm.dev;
> > > > > +	struct scatterlist *sg;
> > > > > +	u64 i, npages;
> > > > > +
> > > > > +	sg = NULL;
> > > > > +	st->nents = 0;
> > > > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > > > > start >>
> > > > > PAGE_SHIFT) + 1;
> > > > > +
> > > > > +	if (unlikely(sg_alloc_table(st, npages,
> > > > > GFP_KERNEL)))
> > > > > +		return -ENOMEM;
> > > > > +
> > > > > +	for (i = 0; i < npages; i++) {
> > > > > +		struct page *page;
> > > > > +		unsigned long addr;
> > > > > +		struct xe_mem_region *mr;
> > > > > +
> > > > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > > > +		if (is_device_private_page(page)) {
> > > > > +			mr = page_to_mem_region(page);
> > > > > +			addr = vram_pfn_to_dpa(mr, range-
> > > > > > hmm_pfns[i]);
> > > > > +		} else {
> > > > > +			addr = dma_map_page(dev, page, 0,
> > > > > PAGE_SIZE,
> > > > > +					write ?
> > > > > DMA_BIDIRECTIONAL :
> > > > > DMA_TO_DEVICE);
> > > > > +		}
> > > > > +
> > > > > +		if (sg && (addr == (sg_dma_address(sg) + sg-
> > > > > > length))) {
> > > > > +			sg->length += PAGE_SIZE;
> > > > > +			sg_dma_len(sg) += PAGE_SIZE;
> > > > > +			continue;
> > > > > +		}
> > > > > +
> > > > > +		sg =  sg ? sg_next(sg) : st->sgl;
> > > > > +		sg_dma_address(sg) = addr;
> > > > > +		sg_dma_len(sg) = PAGE_SIZE;
> > > > > +		sg->length = PAGE_SIZE;
> > > > > +		st->nents++;
> > > > > +	}
> > > > > +
> > > > > +	sg_mark_end(sg);
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * xe_hmm_populate_range() - Populate physical pages of a
> > > > > virtual
> > > > > + * address range
> > > > > + *
> > > > > + * @vma: vma has information of the range to populate. only
> > > > > vma
> > > > > + * of userptr and hmmptr type can be populated.
> > > > > + * @hmm_range: pointer to hmm_range struct. hmm_rang-
> > > > > >hmm_pfns
> > > > > + * will hold the populated pfns.
> > > > > + * @write: populate pages with write permission
> > > > > + *
> > > > > + * This function populate the physical pages of a virtual
> > > > > + * address range. The populated physical pages is saved in
> > > > > + * userptr's sg table. It is similar to get_user_pages but
> > > > > call
> > > > > + * hmm_range_fault.
> > > > > + *
> > > > > + * This function also read mmu notifier sequence # (
> > > > > + * mmu_interval_read_begin), for the purpose of later
> > > > > + * comparison (through mmu_interval_read_retry).
> > > > > + *
> > > > > + * This must be called with mmap read or write lock held.
> > > > > + *
> > > > > + * This function allocates the storage of the userptr sg
> > > > > table.
> > > > > + * It is caller's responsibility to free it calling
> > > > > sg_free_table.
> > > > > + *
> > > > > + * returns: 0 for succuss; negative error no on failure
> > > > > + */
> > > > > +int xe_hmm_populate_range(struct xe_vma *vma, struct
> > > > > hmm_range
> > > > > *hmm_range,
> > > > > +						bool write)
> > > > > +{
> > > > > +	unsigned long timeout =
> > > > > +		jiffies +
> > > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > > > > +	struct xe_userptr_vma *userptr_vma;
> > > > > +	struct xe_userptr *userptr;
> > > > > +	u64 start = vma->gpuva.va.addr;
> > > > > +	u64 end = start + vma->gpuva.va.range;
> > > > > +	struct xe_vm *vm = xe_vma_vm(vma);
> > > > > +	u64 npages;
> > > > > +	int ret;
> > > > > +
> > > > > +	userptr_vma = to_userptr_vma(vma);
> > > > > +	userptr = &userptr_vma->userptr;
> > > > > +	mmap_assert_locked(userptr->notifier.mm);
> > > > > +
> > > > > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >>
> > > > > PAGE_SHIFT) +
> > > > > 1;
> > > > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > > GFP_KERNEL);
> > > > > +	if (unlikely(!pfns))
> > > > > +		return -ENOMEM;
> > > > > +
> > > > > +	if (write)
> > > > > +		flags |= HMM_PFN_REQ_WRITE;
> > > > > +
> > > > > +	memset64((u64 *)pfns, (u64)flags, npages);
> > > > > +	hmm_range->hmm_pfns = pfns;
> > > > > +	hmm_range->notifier_seq =
> > > > > mmu_interval_read_begin(&userptr-
> > > > > > notifier);
> > > > > +	hmm_range->notifier = &userptr->notifier;
> > > > > +	hmm_range->start = start;
> > > > > +	hmm_range->end = end;
> > > > > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> > > > > HMM_PFN_REQ_WRITE;
> > > > > +	/**
> > > > > +	 * FIXME:
> > > > > +	 * Set the the dev_private_owner can prevent
> > > > > hmm_range_fault
> > > > > to fault
> > > > > +	 * in the device private pages owned by caller. See
> > > > > function
> > > > > +	 * hmm_vma_handle_pte. In multiple GPU case, this
> > > > > should
> > > > > be
> > > > > set to the
> > > > > +	 * device owner of the best migration destination.
> > > > > e.g.,
> > > > > device0/vm0
> > > > > +	 * has a page fault, but we have determined the best
> > > > > placement of
> > > > > +	 * the fault address should be on device1, we should
> > > > > set
> > > > > below to
> > > > > +	 * device1 instead of device0.
> > > > > +	 */
> > > > > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > > > > +
> > > > > +	while (true) {
> > > > > +		ret = hmm_range_fault(hmm_range);
> > > > > +		if (time_after(jiffies, timeout))
> > > > > +			break;
> > > > > +
> > > > > +		if (ret == -EBUSY)
> > > > > +			continue;
> > > > > +		break;
> > > > > +	}
> > > > > +
> > > > > +	if (ret)
> > > > > +		goto free_pfns;
> > > > > +
> > > > > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt,
> > > > > write);
> > > > > +	if (ret)
> > > > > +		goto free_pfns;
> > > > > +
> > > > > +	mark_range_accessed(hmm_range, write);
> > > > > +	userptr->sg = &userptr->sgt;
> > > > > +	userptr->notifier_seq = hmm_range->notifier_seq;
> > > > > +
> > > > > +free_pfns:
> > > > > +	kvfree(pfns);
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > diff --git a/drivers/gpu/drm/xe/xe_hmm.h
> > > > > b/drivers/gpu/drm/xe/xe_hmm.h
> > > > > new file mode 100644
> > > > > index 000000000000..960f3f6d36ae
> > > > > --- /dev/null
> > > > > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > > > > @@ -0,0 +1,12 @@
> > > > > +// SPDX-License-Identifier: MIT
> > > > > +/*
> > > > > + * Copyright © 2024 Intel Corporation
> > > > > + */
> > > > > +
> > > > > +#include <linux/types.h>
> > > > > +#include <linux/hmm.h>
> > > > > +#include "xe_vm_types.h"
> > > > > +#include "xe_svm.h"
> > > > > +
> > > > > +int xe_hmm_populate_range(struct xe_vma *vma, struct
> > > > > hmm_range
> > > > > *hmm_range,
> > > > > +						bool write);
> > > 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
  2024-03-19 19:52           ` Hellstrom, Thomas
@ 2024-03-19 20:01             ` Zeng, Oak
  0 siblings, 0 replies; 49+ messages in thread
From: Zeng, Oak @ 2024-03-19 20:01 UTC (permalink / raw)
  To: Hellstrom, Thomas, intel-xe
  Cc: Brost, Matthew, Welty, Brian, airlied, Ghimiray, Himal Prasad

Hi Thomas,

> -----Original Message-----
> From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> Sent: Tuesday, March 19, 2024 3:52 PM
> To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
> 
> On Tue, 2024-03-19 at 16:13 +0000, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> > > Sent: Tuesday, March 19, 2024 4:42 AM
> > > To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> > > Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> > > <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or
> > > hmmptr
> > >
> > > On Mon, 2024-03-18 at 19:50 +0000, Zeng, Oak wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> > > > > Sent: Monday, March 18, 2024 7:54 AM
> > > > > To: intel-xe@lists.freedesktop.org; Zeng, Oak
> > > > > <oak.zeng@intel.com>
> > > > > Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> > > > > <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal
> > > > > Prasad
> > > > > <himal.prasad.ghimiray@intel.com>
> > > > > Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr
> > > > > or
> > > > > hmmptr
> > > > >
> > > > > Hi, Oak.
> > > > >
> > > > >
> > > > > On Wed, 2024-03-13 at 23:35 -0400, Oak Zeng wrote:
> > > > > > Add a helper function xe_hmm_populate_range to populate
> > > > > > a a userptr or hmmptr range. This functions calls
> > > > > > hmm_range_fault
> > > > > > to read CPU page tables and populate all pfns/pages of this
> > > > > > virtual address range.
> > > > > >
> > > > > > If the populated page is system memory page, dma-mapping is
> > > > > > performed
> > > > > > to get a dma-address which can be used later for GPU to
> > > > > > access
> > > > > > pages.
> > > > > >
> > > > > > If the populated page is device private page, we calculate
> > > > > > the
> > > > > > dpa (
> > > > > > device physical address) of the page.
> > > > > >
> > > > > > The dma-address or dpa is then saved in userptr's sg table.
> > > > > > This
> > > > > > is
> > > > > > prepare work to replace the get_user_pages_fast code in
> > > > > > userptr
> > > > > > code
> > > > > > path. The helper function will also be used to populate
> > > > > > hmmptr
> > > > > > later.
> > > > > >
> > > > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > > > Co-developed-by: Niranjana Vishwanathapura
> > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > Signed-off-by: Niranjana Vishwanathapura
> > > > > > <niranjana.vishwanathapura@intel.com>
> > > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/Makefile |   3 +-
> > > > > >  drivers/gpu/drm/xe/xe_hmm.c | 213
> > > > > > ++++++++++++++++++++++++++++++++++++
> > > > > >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> > > > > >  3 files changed, 227 insertions(+), 1 deletion(-)
> > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> > > > >
> > > > > I mostly agree with Matt's review comments on this patch. Some
> > > > > additional below.
> > > > >
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > > > b/drivers/gpu/drm/xe/Makefile
> > > > > > index 840467080e59..29dcbc938b01 100644
> > > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> > > > > >  	xe_wait_user_fence.o \
> > > > > >  	xe_wa.o \
> > > > > >  	xe_wopcm.o \
> > > > > > -	xe_svm_devmem.o
> > > > > > +	xe_svm_devmem.o \
> > > > > > +	xe_hmm.o
> > > > > >
> > > > > >  # graphics hardware monitoring (HWMON) support
> > > > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> > > > > > b/drivers/gpu/drm/xe/xe_hmm.c
> > > > > > new file mode 100644
> > > > > > index 000000000000..c45c2447d386
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > > > > > @@ -0,0 +1,213 @@
> > > > > > +// SPDX-License-Identifier: MIT
> > > > > > +/*
> > > > > > + * Copyright © 2024 Intel Corporation
> > > > > > + */
> > > > > > +
> > > > > > +#include <linux/mmu_notifier.h>
> > > > > > +#include <linux/dma-mapping.h>
> > > > > > +#include <linux/memremap.h>
> > > > > > +#include <linux/swap.h>
> > > > > > +#include <linux/mm.h>
> > > > > > +#include "xe_hmm.h"
> > > > > > +#include "xe_vm.h"
> > > > > > +
> > > > > > +/**
> > > > > > + * mark_range_accessed() - mark a range is accessed, so core
> > > > > > mm
> > > > > > + * have such information for memory eviction or write back
> > > > > > to
> > > > > > + * hard disk
> > > > > > + *
> > > > > > + * @range: the range to mark
> > > > > > + * @write: if write to this range, we mark pages in this
> > > > > > range
> > > > > > + * as dirty
> > > > > > + */
> > > > > > +static void mark_range_accessed(struct hmm_range *range,
> > > > > > bool
> > > > > > write)
> > > > >
> > > > > Some of the static function names aren't really unique enough
> > > > > not
> > > > > to
> > > > > stand in the way for a future core function name clash. Please
> > > > > consider
> > > > > using an xe_ prefix in such cases. It will also make backtraces
> > > > > easier
> > > > > to follow.
> > > >
> > > > I will add a xe_prefix for the backtrace reason...
> > > >
> > > > As I understand it, static function is file scope, so even if we
> > > > have
> > > > a core function with
> > > > The same name in the future, as long as they are not in the same
> > > > file, there wont be any name clash...
> > >
> > > The name can't be used for a future core extern function?
> >
> > Yes, at least this is my understanding of static function. Static
> > function doesn't have a symbol so it can't clash with other static
> > function in other file with the same name. For example, if you
> > readelf -s xe.ko, you won't be able to find xe_mark_range_accessed,
> > because it is a static function. I think compiler just give static
> > function an address instead of a symbol.
> 
> I meant for example if you call a function "build_sg". Then if the core
> wants to implement an extern function called "build_sg" and includes it
> in a reasonably wide-used header you will get a compilation failure.
> Hence the use of non-specific static function names restricts future
> core use of the same name for public functions.

I see what you meant now 😊. Make sense. Will add a prefix to the functions

Oak
> 
> /Thomas
> 
> 
> 
> 
> >
> > Oak
> >
> > >
> > > /Thomas
> > >
> > > >
> > > > >
> > > > >
> > > > > > +{
> > > > > > +	struct page *page;
> > > > > > +	u64 i, npages;
> > > > > > +
> > > > > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > > > > > start >>
> > > > > > PAGE_SHIFT) + 1;
> > > > > > +	for (i = 0; i < npages; i++) {
> > > > > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > > > > +		if (write) {
> > > > > > +			lock_page(page);
> > > > > > +			set_page_dirty(page);
> > > > > > +			unlock_page(page);
> > > > >
> > > > > Could be using set_page_dirty_lock() here.
> > > >
> > > > Will fix, Thanks
> > > > Oak
> > > >
> > > > >
> > > > > /Thomas
> > > > >
> > > > >
> > > > > > +		}
> > > > > > +		mark_page_accessed(page);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * build_sg() - build a scatter gather table for all the
> > > > > > physical
> > > > > > pages/pfn
> > > > > > + * in a hmm_range. dma-address is save in sg table and will
> > > > > > be
> > > > > > used
> > > > > > to program
> > > > > > + * GPU page table later.
> > > > > > + *
> > > > > > + * @xe: the xe device who will access the dma-address in sg
> > > > > > table
> > > > > > + * @range: the hmm range that we build the sg table from.
> > > > > > range-
> > > > > > > hmm_pfns[]
> > > > > > + * has the pfn numbers of pages that back up this hmm
> > > > > > address
> > > > > > range.
> > > > > > + * @st: pointer to the sg table.
> > > > > > + * @write: whether we write to this range. This decides dma
> > > > > > map
> > > > > > direction
> > > > > > + * for system pages. If write we map it bi-diretional;
> > > > > > otherwise
> > > > > > + * DMA_TO_DEVICE
> > > > > > + *
> > > > > > + * All the contiguous pfns will be collapsed into one entry
> > > > > > in
> > > > > > + * the scatter gather table. This is for the convenience of
> > > > > > + * later on operations to bind address range to GPU page
> > > > > > table.
> > > > > > + *
> > > > > > + * The dma_address in the sg table will later be used by GPU
> > > > > > to
> > > > > > + * access memory. So if the memory is system memory, we need
> > > > > > to
> > > > > > + * do a dma-mapping so it can be accessed by GPU/DMA. If the
> > > > > > memory
> > > > > > + * is GPU local memory (of the GPU who is going to access
> > > > > > memory),
> > > > > > + * we need gpu dpa (device physical address), and there is
> > > > > > no
> > > > > > need
> > > > > > + * of dma-mapping.
> > > > > > + *
> > > > > > + * FIXME: dma-mapping for peer gpu device to access remote
> > > > > > gpu's
> > > > > > + * memory. Add this when you support p2p
> > > > > > + *
> > > > > > + * This function allocates the storage of the sg table. It
> > > > > > is
> > > > > > + * caller's responsibility to free it calling sg_free_table.
> > > > > > + *
> > > > > > + * Returns 0 if successful; -ENOMEM if fails to allocate
> > > > > > memory
> > > > > > + */
> > > > > > +static int build_sg(struct xe_device *xe, struct hmm_range
> > > > > > *range,
> > > > > > +			     struct sg_table *st, bool
> > > > > > write)
> > > > > > +{
> > > > > > +	struct device *dev = xe->drm.dev;
> > > > > > +	struct scatterlist *sg;
> > > > > > +	u64 i, npages;
> > > > > > +
> > > > > > +	sg = NULL;
> > > > > > +	st->nents = 0;
> > > > > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > > > > > start >>
> > > > > > PAGE_SHIFT) + 1;
> > > > > > +
> > > > > > +	if (unlikely(sg_alloc_table(st, npages,
> > > > > > GFP_KERNEL)))
> > > > > > +		return -ENOMEM;
> > > > > > +
> > > > > > +	for (i = 0; i < npages; i++) {
> > > > > > +		struct page *page;
> > > > > > +		unsigned long addr;
> > > > > > +		struct xe_mem_region *mr;
> > > > > > +
> > > > > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > > > > +		if (is_device_private_page(page)) {
> > > > > > +			mr = page_to_mem_region(page);
> > > > > > +			addr = vram_pfn_to_dpa(mr, range-
> > > > > > > hmm_pfns[i]);
> > > > > > +		} else {
> > > > > > +			addr = dma_map_page(dev, page, 0,
> > > > > > PAGE_SIZE,
> > > > > > +					write ?
> > > > > > DMA_BIDIRECTIONAL :
> > > > > > DMA_TO_DEVICE);
> > > > > > +		}
> > > > > > +
> > > > > > +		if (sg && (addr == (sg_dma_address(sg) + sg-
> > > > > > > length))) {
> > > > > > +			sg->length += PAGE_SIZE;
> > > > > > +			sg_dma_len(sg) += PAGE_SIZE;
> > > > > > +			continue;
> > > > > > +		}
> > > > > > +
> > > > > > +		sg =  sg ? sg_next(sg) : st->sgl;
> > > > > > +		sg_dma_address(sg) = addr;
> > > > > > +		sg_dma_len(sg) = PAGE_SIZE;
> > > > > > +		sg->length = PAGE_SIZE;
> > > > > > +		st->nents++;
> > > > > > +	}
> > > > > > +
> > > > > > +	sg_mark_end(sg);
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * xe_hmm_populate_range() - Populate physical pages of a
> > > > > > virtual
> > > > > > + * address range
> > > > > > + *
> > > > > > + * @vma: vma has information of the range to populate. only
> > > > > > vma
> > > > > > + * of userptr and hmmptr type can be populated.
> > > > > > + * @hmm_range: pointer to hmm_range struct. hmm_rang-
> > > > > > >hmm_pfns
> > > > > > + * will hold the populated pfns.
> > > > > > + * @write: populate pages with write permission
> > > > > > + *
> > > > > > + * This function populate the physical pages of a virtual
> > > > > > + * address range. The populated physical pages is saved in
> > > > > > + * userptr's sg table. It is similar to get_user_pages but
> > > > > > call
> > > > > > + * hmm_range_fault.
> > > > > > + *
> > > > > > + * This function also read mmu notifier sequence # (
> > > > > > + * mmu_interval_read_begin), for the purpose of later
> > > > > > + * comparison (through mmu_interval_read_retry).
> > > > > > + *
> > > > > > + * This must be called with mmap read or write lock held.
> > > > > > + *
> > > > > > + * This function allocates the storage of the userptr sg
> > > > > > table.
> > > > > > + * It is caller's responsibility to free it calling
> > > > > > sg_free_table.
> > > > > > + *
> > > > > > + * returns: 0 for succuss; negative error no on failure
> > > > > > + */
> > > > > > +int xe_hmm_populate_range(struct xe_vma *vma, struct
> > > > > > hmm_range
> > > > > > *hmm_range,
> > > > > > +						bool write)
> > > > > > +{
> > > > > > +	unsigned long timeout =
> > > > > > +		jiffies +
> > > > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > > > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > > > > > +	struct xe_userptr_vma *userptr_vma;
> > > > > > +	struct xe_userptr *userptr;
> > > > > > +	u64 start = vma->gpuva.va.addr;
> > > > > > +	u64 end = start + vma->gpuva.va.range;
> > > > > > +	struct xe_vm *vm = xe_vma_vm(vma);
> > > > > > +	u64 npages;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	userptr_vma = to_userptr_vma(vma);
> > > > > > +	userptr = &userptr_vma->userptr;
> > > > > > +	mmap_assert_locked(userptr->notifier.mm);
> > > > > > +
> > > > > > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >>
> > > > > > PAGE_SHIFT) +
> > > > > > 1;
> > > > > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > > > GFP_KERNEL);
> > > > > > +	if (unlikely(!pfns))
> > > > > > +		return -ENOMEM;
> > > > > > +
> > > > > > +	if (write)
> > > > > > +		flags |= HMM_PFN_REQ_WRITE;
> > > > > > +
> > > > > > +	memset64((u64 *)pfns, (u64)flags, npages);
> > > > > > +	hmm_range->hmm_pfns = pfns;
> > > > > > +	hmm_range->notifier_seq =
> > > > > > mmu_interval_read_begin(&userptr-
> > > > > > > notifier);
> > > > > > +	hmm_range->notifier = &userptr->notifier;
> > > > > > +	hmm_range->start = start;
> > > > > > +	hmm_range->end = end;
> > > > > > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> > > > > > HMM_PFN_REQ_WRITE;
> > > > > > +	/**
> > > > > > +	 * FIXME:
> > > > > > +	 * Set the the dev_private_owner can prevent
> > > > > > hmm_range_fault
> > > > > > to fault
> > > > > > +	 * in the device private pages owned by caller. See
> > > > > > function
> > > > > > +	 * hmm_vma_handle_pte. In multiple GPU case, this
> > > > > > should
> > > > > > be
> > > > > > set to the
> > > > > > +	 * device owner of the best migration destination.
> > > > > > e.g.,
> > > > > > device0/vm0
> > > > > > +	 * has a page fault, but we have determined the best
> > > > > > placement of
> > > > > > +	 * the fault address should be on device1, we should
> > > > > > set
> > > > > > below to
> > > > > > +	 * device1 instead of device0.
> > > > > > +	 */
> > > > > > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > > > > > +
> > > > > > +	while (true) {
> > > > > > +		ret = hmm_range_fault(hmm_range);
> > > > > > +		if (time_after(jiffies, timeout))
> > > > > > +			break;
> > > > > > +
> > > > > > +		if (ret == -EBUSY)
> > > > > > +			continue;
> > > > > > +		break;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (ret)
> > > > > > +		goto free_pfns;
> > > > > > +
> > > > > > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt,
> > > > > > write);
> > > > > > +	if (ret)
> > > > > > +		goto free_pfns;
> > > > > > +
> > > > > > +	mark_range_accessed(hmm_range, write);
> > > > > > +	userptr->sg = &userptr->sgt;
> > > > > > +	userptr->notifier_seq = hmm_range->notifier_seq;
> > > > > > +
> > > > > > +free_pfns:
> > > > > > +	kvfree(pfns);
> > > > > > +	return ret;
> > > > > > +}
> > > > > > +
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_hmm.h
> > > > > > b/drivers/gpu/drm/xe/xe_hmm.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..960f3f6d36ae
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > > > > > @@ -0,0 +1,12 @@
> > > > > > +// SPDX-License-Identifier: MIT
> > > > > > +/*
> > > > > > + * Copyright © 2024 Intel Corporation
> > > > > > + */
> > > > > > +
> > > > > > +#include <linux/types.h>
> > > > > > +#include <linux/hmm.h>
> > > > > > +#include "xe_vm_types.h"
> > > > > > +#include "xe_svm.h"
> > > > > > +
> > > > > > +int xe_hmm_populate_range(struct xe_vma *vma, struct
> > > > > > hmm_range
> > > > > > *hmm_range,
> > > > > > +						bool write);
> > > >
> >


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2024-03-19 20:01 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-14  3:35 [PATCH 0/5] Use hmm_range_fault to populate user page Oak Zeng
2024-03-14  3:28 ` ✓ CI.Patch_applied: success for " Patchwork
2024-03-14  3:28 ` ✗ CI.checkpatch: warning " Patchwork
2024-03-14  3:29 ` ✗ CI.KUnit: failure " Patchwork
2024-03-14  3:35 ` [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
2024-03-14 17:17   ` Matthew Brost
2024-03-14 18:32     ` Zeng, Oak
2024-03-14 20:49       ` Matthew Brost
2024-03-15 16:00         ` Zeng, Oak
2024-03-15 20:39           ` Matthew Brost
2024-03-15 21:31             ` Zeng, Oak
2024-03-16  1:25               ` Matthew Brost
2024-03-18 10:16                 ` Hellstrom, Thomas
2024-03-18 15:02                   ` Zeng, Oak
2024-03-18 15:46                     ` Hellstrom, Thomas
2024-03-18 14:51                 ` Zeng, Oak
2024-03-15  1:45   ` Welty, Brian
2024-03-15  3:10     ` Zeng, Oak
2024-03-15  3:16       ` Zeng, Oak
2024-03-15 18:05         ` Welty, Brian
2024-03-15 23:11           ` Zeng, Oak
2024-03-14  3:35 ` [PATCH 2/5] drm/xe: Helper to get memory region from tile Oak Zeng
2024-03-14 17:33   ` Matthew Brost
2024-03-14 17:44   ` Matthew Brost
2024-03-15  2:48     ` Zeng, Oak
2024-03-14  3:35 ` [PATCH 3/5] drm/xe: Helper to get dpa from pfn Oak Zeng
2024-03-14 17:39   ` Matthew Brost
2024-03-15 17:29     ` Zeng, Oak
2024-03-16  1:33       ` Matthew Brost
2024-03-18 19:25         ` Zeng, Oak
2024-03-18 12:09     ` Hellstrom, Thomas
2024-03-18 19:27       ` Zeng, Oak
2024-03-14  3:35 ` [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr Oak Zeng
2024-03-14 20:25   ` Matthew Brost
2024-03-16  1:35     ` Zeng, Oak
2024-03-18  0:29       ` Matthew Brost
2024-03-18 11:53   ` Hellstrom, Thomas
2024-03-18 19:50     ` Zeng, Oak
2024-03-19  8:41       ` Hellstrom, Thomas
2024-03-19 16:13         ` Zeng, Oak
2024-03-19 19:52           ` Hellstrom, Thomas
2024-03-19 20:01             ` Zeng, Oak
2024-03-18 13:12   ` Hellstrom, Thomas
2024-03-18 14:49     ` Zeng, Oak
2024-03-18 15:40       ` Hellstrom, Thomas
2024-03-18 16:09         ` Zeng, Oak
2024-03-14  3:35 ` [PATCH 5/5] drm/xe: Use hmm_range_fault to populate user pages Oak Zeng
2024-03-14 20:54   ` Matthew Brost
2024-03-19  2:36     ` Zeng, Oak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.