All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_*
@ 2021-08-13  6:31 Alex Sierra
  2021-08-13  6:31 ` [PATCH v6 01/13] ext4/xfs: add page refcount helper Alex Sierra
                   ` (12 more replies)
  0 siblings, 13 replies; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

v1:
AMD is building a system architecture for the Frontier supercomputer with a
coherent interconnect between CPUs and GPUs. This hardware architecture allows
the CPUs to coherently access GPU device memory. We have hardware in our labs
and we are working with our partner HPE on the BIOS, firmware and software
for delivery to the DOE.

The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu driver looks
it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
using devm_memremap_pages.

Now we're trying to migrate data to and from that memory using the migrate_vma_*
helpers so we can support page-based migration in our unified memory allocations,
while also supporting CPU access to those pages.

This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
correctly in the migrate_vma_* helpers. We are looking for feedback about this
approach. If we're close, what's needed to make our patches acceptable upstream?
If we're not close, any suggestions how else to achieve what we are trying to do
(i.e. page migration and coherent CPU access to VRAM)?

This work is based on HMM and our SVM memory manager that was recently upstreamed
to Dave Airlie's drm-next branch
https://cgit.freedesktop.org/drm/drm/log/?h=drm-next
On top of that we did some rework of our VRAM management for migrations to remove
some incorrect assumptions, allow partially successful migrations and GPU memory
mappings that mix pages in VRAM and system memory.
https://lore.kernel.org/dri-devel/20210527205606.2660-6-Felix.Kuehling@amd.com/T/#r996356015e295780eb50453e7dbd5d0d68b47cbc

v2:
This patch series version has merged "[RFC PATCH v3 0/2]
mm: remove extra ZONE_DEVICE struct page refcount" patch series made by
Ralph Campbell. It also applies at the top of these series, our changes
to support device generic type in migration_vma helpers.
This has been tested in systems with device memory that has coherent
access by CPU.

Also addresses the following feedback made in v1:
- Isolate in one patch kernel/resource.c modification, based
on Christoph's feedback.
- Add helpers check for generic and private type to avoid
duplicated long lines.

v3:
- Include cover letter from v1.
- Rename dax_layout_is_idle_page func to dax_page_unused in patch
ext4/xfs: add page refcount helper.

v4:
- Add support for zone device generic type in lib/test_hmm and
tool/testing/selftest/vm/hmm-tests.
- Add missing page refcount helper to fuse/dax.c. This was included in
one of Ralph Campbell's patches.

v5:
- Cosmetic changes on patches 3, 5 and 13
- A bug was found while running one of the xfstest (generic/413) used to
validate fs_dax device type. This was first introduced by patch: "mm: remove
extra ZONE_DEVICE struct page refcount" whic is part of these patch series.
The bug was showed as WARNING message at try_grab_page function call, due to
a page refcounter equal to zero. Part of "mm: remove extra ZONE_DEVICE struct
page refcount" changes, was to initialize page refcounter to zero. Therefore,
a special condition was added to try_grab_page on this v5, were it checks for
device zone pages too. It is included in the same patch.

This is how mm changes from these patch series have been validated:
- hmm-tests were run using device private and device generic types. This last,
just added in these patch series. efi_fake_mem was used to mimic SPM memory
for device generic.
- xfstests tool was used to validate fs-dax device type and page refcounter
changes. DAX configuration was used along with emulated Persisten Memory set as
memmap=4G!4G memmap=4G!9G. xfstests were run from ext4 and generic lists. Some
of them, did not run due to limitations in configuration. Ex. test not
supporting specific file system or DAX mode.
Only three tests failed, generic/356/357 and ext4/049. However, these failures
were consistent before and after applying these patch series.
xfstest configuration:
TEST_DEV=/dev/pmem0
TEST_DIR=/mnt/ram0
SCRATCH_DEV=/dev/pmem1
SCRATCH_MNT=/mnt/ram1
TEST_FS_MOUNT_OPTS="-o dax"
EXT_MOUNT_OPTIONS="-o dax"
MKFS_OPTIONS="-b4096"
xfstest passed list:
Ext4:
001,003,005,021,022,023,025,026,030,031,032,036,037,038,042,043,044,271,306
Generic:
1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,20,21,22,23,24,25,28,29,30,31,32,33,35,37,
50,52,53,58,60,61,62,63,64,67,69,70,71,75,76,78,79,80,82,84,86,87,88,91,92,94,
96,97,98,99,103,105,112,113,114,117,120,124,126,129,130,131,135,141,169,184,
198,207,210,211,212,213,214,215,221,223,225,228,236,237,240,244,245,246,247,
248,249,255,257,258,263,277,286,294,306,307,308,309,313,315,316,318,319,337,
346,360,361,371,375,377,379,380,383,384,385,386,389,391,392,393,394,400,401,
403,404,406,409,410,411,412,413,417,420,422,423,424,425,426,427,428

v6:
- These patch series was rebased on amd-staging-drm-next, which in turn is
based on v5.13:
https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-staging-drm-next
- Handle null pointers in dmirror_allocate_chunk at test_hmm.c
- Here's a link to the repo including these patch series:
https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/alexsierrag/device_generic

- CONFIGS required to run hmm-tests and xfstest with no special Hardware.
For hmm-tests:
CONFIG_EFI_FAKE_MEMMAP=y
CONFIG_EFI_SOFT_RESERVE=y
CONFIG_TEST_HMM=m
CONFIG_RUNTIME_TESTING_MENU=y

For xfstest using emulated persistant memory:
CONFIG_X86_PMEM_LEGACY=y
CONFIG_LIBNVDIMM=y
CONFIG_BLK_DEV_PMEM=y
CONFIG_FS_DAX=y
CONFIG_DAX_DRIVER=y
CONFIG_VIRTIO_FS=y

HMM configs for both hmm-test and xfstest:
CONFIG_ZONE_DEVICE=y
CONFIG_HMM_MIRROR=y
CONFIG_MMU_NOTIFIER=y
CONFIG_DEVICE_PRIVATE=y 

- Kernel parameters to run hmm-tests and xfstests.
These tests require to either emulate persistent memory (EPM) for xfstests or
fake special memory purpose (FSPM) for hmm-tests device generic type
configuration. This is achieved by using system memory for both purposes.
The idea is to reserve ranges of physical address by passing specific kernel
parameters. Make sure your kernel has built with the proper CONFIGS mentioned
above. Once you reserve memory ranges through these two mechanisms, they
cannot be used by the kernel as regular system memory. Until these kernel
parameters are removed. Both mechanisms use similar parameters to define
physical address and size. FSPM, however, uses a third field which is the
attribute value. Here’s the syntax for both:
FSPM: efi_fake_mem= nn[KMG]@ss[KMG]:aa
EPM: memmap=nn[KMG]!ss[KMG]
'nn' defines the size (in GB) of memory reserved 
'mm' physical/usable start address. This can be taken from BIOS-e820 mem table
'aa' specify attribute. SPM attribute is EFI_MEMORY_SP(0x40000)
[KMG]: refers to kilo, mega, giga
To find an available memory region address, you could look into BIOS-e820 mem
table. Usually this is printed at kernel boot (dmesg). At this table, make sure
you choose ranges marked as 'usable' and has at least the same or more range
size as your desired reservation. Ex. Range below has a size of 13GB, from a
total of 16GB of system memory.
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000044eafffff] usable

In our testing, we require two ranges of 4GB each for xfstests. And two more of
1GB each for hmm-tests. Total of 10GB.

Based on range above we set these two kernel parameters as follows:
EPM:
memmap=4G!4G memmap=4G!9G
FSPM:
efi_fake_mem=1G@0x200000000:0x40000,1G@0x340000000:0x40000
We alternate one EPM reserve (4GB) and one FSPM (1GB). Starting @4GB address.
These kernel parameters can be passed by editing grub file. Under "/etc/default/grub".
GRUB_CMDLINE_LINUX="memmap=4G!4G memmap=4G!9G
efi_fake_mem=1G@0x200000000:0x40000,1G@0x340000000:0x40000"
Once you have modified this file, don’t forget to update the grub.
$sudo update-grub

After booting with these parameters applied, you should see the new ranges
defined at the "extended physical RAM map" table. This is printed at boot:
reserve setup_data: [mem 0x0000000100000000-0x00000001ffffffff] persistent (type 12)

reserve setup_data: [mem 0x0000000200000000-0x000000023fffffff] soft reserved

reserve setup_data: [mem 0x0000000240000000-0x000000033fffffff] persistent (type 12)

reserve setup_data: [mem 0x0000000340000000-0x000000037fffffff] soft reserved

As you see, EPM ranges are now labeled as Persistent (type 12) and FSPM ranges
as soft reserved. 

- Setting and running hmm-tests
These tests can now be run either with device private or device generic types.
This last, by setting Special Purpose Memory.
To manually run them, on your kernel directory go to:
$cd tools/testing/selftests/vm/

To run device private, enter:
$sudo ./test_hmm.sh smoke

To run device generic, you must pass the physical start addresses for both SP
regions. In this example, these can be taken from above’s table labeled with
"soft reserved":
$sudo ./test_hmm.sh smoke 0x200000000 0x340000000

The same hmm-tests are executed for both device types.

- Setting and running xfstest
Clone xfstests-dev repo
$git clone git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
$cd xfstests-dev
$make
$sudo make install

On xfstests-dev directory, create a local.config file with the following
information:
TEST_DEV=/dev/pmem0
TEST_DIR=/mnt/ram0
SCRATCH_DEV=/dev/pmem1
SCRATCH_MNT=/mnt/ram1
TEST_FS_MOUNT_OPTS="-o dax"
EXT_MOUNT_OPTIONS="-o dax"
MKFS_OPTIONS="-b4096"

Create mounting directories:
$sudo mkdir /mnt/ram0
$sudo mkdir /mnt/ram1

Everytime you boot, you need to create ext4 file system for the emulated
persistent memory partitions.
$sudo mkfs.ext4 /dev/pmem0
$sudo mkfs.ext4 /dev/pmem1

To run the tests:
$sudo ./check -g quick

Alex Sierra (11):
  kernel: resource: lookup_resource as exported symbol
  drm/amdkfd: add SPM support for SVM
  drm/amdkfd: generic type as sys mem on migration to ram
  include/linux/mm.h: helpers to check zone device generic type
  mm: add generic type support to migrate_vma helpers
  mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
  lib: test_hmm add ioctl to get zone device type
  lib: test_hmm add module param for zone device type
  lib: add support for device generic type in test_hmm
  tools: update hmm-test to support device generic type
  tools: update test_hmm script to support SP config

Ralph Campbell (2):
  ext4/xfs: add page refcount helper
  mm: remove extra ZONE_DEVICE struct page refcount

 arch/powerpc/kvm/book3s_hv_uvmem.c       |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  22 ++-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |   2 +-
 fs/dax.c                                 |   8 +-
 fs/ext4/inode.c                          |   5 +-
 fs/fuse/dax.c                            |   4 +-
 fs/xfs/xfs_file.c                        |   4 +-
 include/linux/dax.h                      |  10 +
 include/linux/memremap.h                 |   7 +-
 include/linux/mm.h                       |  21 +-
 kernel/resource.c                        |   1 +
 lib/test_hmm.c                           | 237 +++++++++++++++--------
 lib/test_hmm_uapi.h                      |  16 ++
 mm/internal.h                            |   8 +
 mm/memremap.c                            |  69 ++-----
 mm/migrate.c                             |  23 ++-
 mm/page_alloc.c                          |   3 +
 mm/swap.c                                |  45 +----
 tools/testing/selftests/vm/hmm-tests.c   | 142 ++++++++++++--
 tools/testing/selftests/vm/test_hmm.sh   |  20 +-
 20 files changed, 411 insertions(+), 238 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v6 01/13] ext4/xfs: add page refcount helper
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-15  9:01   ` Christoph Hellwig
  2021-08-13  6:31 ` [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount Alex Sierra
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

From: Ralph Campbell <rcampbell@nvidia.com>

There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

v3:
[AS]: rename dax_layout_is_idle_page func to dax_page_unused

v4:
[AS]: This ref count functionality was missing on fuse/dax.c.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 fs/dax.c            |  4 ++--
 fs/ext4/inode.c     |  5 +----
 fs/fuse/dax.c       |  4 +---
 fs/xfs/xfs_file.c   |  4 +---
 include/linux/dax.h | 10 ++++++++++
 5 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 62352cbcf0f4..c387d09e3e5a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -369,7 +369,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(trunc && !dax_page_unused(page));
 		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
 		page->mapping = NULL;
 		page->index = 0;
@@ -383,7 +383,7 @@ static struct page *dax_busy_page(void *entry)
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		if (page_ref_count(page) > 1)
+		if (!dax_page_unused(page))
 			return page;
 	}
 	return NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fe6045a46599..05ffe6875cb1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3971,10 +3971,7 @@ int ext4_break_layouts(struct inode *inode)
 		if (!page)
 			return 0;
 
-		error = ___wait_var_event(&page->_refcount,
-				atomic_read(&page->_refcount) == 1,
-				TASK_INTERRUPTIBLE, 0, 0,
-				ext4_wait_dax_page(ei));
+		error = dax_wait_page(ei, page, ext4_wait_dax_page);
 	} while (error == 0);
 
 	return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index ff99ab2a3c43..2b1f190ba78a 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -677,9 +677,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(&page->_refcount,
-			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-			0, 0, fuse_wait_dax_page(inode));
+	return dax_wait_page(inode, page, fuse_wait_dax_page);
 }
 
 /* dmap_end == 0 leads to unmapping of whole file */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 396ef36dcd0a..182057281086 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -840,9 +840,7 @@ xfs_break_dax_layouts(
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(&page->_refcount,
-			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-			0, 0, xfs_wait_dax_page(inode));
+	return dax_wait_page(inode, page, xfs_wait_dax_page);
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..8b5da1d60dbc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping)
 	return mapping->host && IS_DAX(mapping->host);
 }
 
+static inline bool dax_page_unused(struct page *page)
+{
+	return page_ref_count(page) == 1;
+}
+
+#define dax_wait_page(_inode, _page, _wait_cb)				\
+	___wait_var_event(&(_page)->_refcount,				\
+		dax_page_unused(_page),				\
+		TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
+
 #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
 void hmem_register_device(int target_nid, struct resource *r);
 #else
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
  2021-08-13  6:31 ` [PATCH v6 01/13] ext4/xfs: add page refcount helper Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-15 15:37   ` Christoph Hellwig
  2021-08-18  0:01     ` Ralph Campbell
  2021-08-13  6:31 ` [PATCH v6 03/13] kernel: resource: lookup_resource as exported symbol Alex Sierra
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

From: Ralph Campbell <rcampbell@nvidia.com>

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

v2:
AS: merged this patch in linux 5.11 version

v5:
AS: add condition at try_grab_page to check for the zone device type, while
page ref counter is checked less/equal to zero. In case of device zone, pages
ref counter are initialized to zero.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 fs/dax.c                               |  4 +-
 include/linux/dax.h                    |  2 +-
 include/linux/memremap.h               |  7 +--
 include/linux/mm.h                     | 13 +----
 lib/test_hmm.c                         |  2 +-
 mm/internal.h                          |  8 +++
 mm/memremap.c                          | 68 +++++++-------------------
 mm/migrate.c                           |  5 --
 mm/page_alloc.c                        |  3 ++
 mm/swap.c                              | 45 ++---------------
 12 files changed, 46 insertions(+), 115 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..acee67710620 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
 
 	dpage = pfn_to_page(uvmem_pfn);
 	dpage->zone_device_data = pvt;
-	get_page(dpage);
+	init_page_count(dpage);
 	lock_page(dpage);
 	return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
 			return NULL;
 	}
 
-	get_page(page);
+	init_page_count(page);
 	lock_page(page);
 	return page;
 }
diff --git a/fs/dax.c b/fs/dax.c
index c387d09e3e5a..1166630b7190 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -571,14 +571,14 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
 /**
  * dax_layout_busy_page_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
  * @start: Starting offset. Page containing 'start' is included.
  * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
  *       pages from 'start' till the end of file are included.
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
  * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
  * any page in the mapping is busy, i.e. for DMA, or other
  * get_user_pages() usages.
  *
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8b5da1d60dbc..05fc982ce153 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space *mapping)
 
 static inline bool dax_page_unused(struct page *page)
 {
-	return page_ref_count(page) == 1;
+	return page_ref_count(page) == 0;
 }
 
 #define dax_wait_page(_inode, _page, _wait_cb)				\
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 45a79da89c5f..77ff5fd0685f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,10 @@ enum memory_type {
 
 struct dev_pagemap_ops {
 	/*
-	 * Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-	 * reach 0 refcount unless there is a refcount bug. This allows the
-	 * device driver to implement its own memory management.)
+	 * Called once the page refcount reaches 0. The reference count
+	 * should be reset to one with init_page_count(page) before reusing
+	 * the page. This allows the device driver to implement its own
+	 * memory management.
 	 */
 	void (*page_free)(struct page *page);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ae31622deef..d48a1f0889d1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1218,7 +1218,7 @@ __maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
 static inline __must_check bool try_get_page(struct page *page)
 {
 	page = compound_head(page);
-	if (WARN_ON_ONCE(page_ref_count(page) <= 0))
+	if (WARN_ON_ONCE(page_ref_count(page) < (int)!is_zone_device_page(page)))
 		return false;
 	page_ref_inc(page);
 	return true;
@@ -1228,17 +1228,6 @@ static inline void put_page(struct page *page)
 {
 	page = compound_head(page);
 
-	/*
-	 * For devmap managed pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the page is free and we
-	 * need to inform the device driver through callback. See
-	 * include/linux/memremap.h and HMM for details.
-	 */
-	if (page_is_devmap_managed(page)) {
-		put_devmap_managed_page(page);
-		return;
-	}
-
 	if (put_page_testzero(page))
 		__put_page(page);
 }
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..6998f10350ea 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -561,7 +561,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	}
 
 	dpage->zone_device_data = rpage;
-	get_page(dpage);
+	init_page_count(dpage);
 	lock_page(dpage);
 	return dpage;
 
diff --git a/mm/internal.h b/mm/internal.h
index e8fdb531f887..5438cceca4b9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -667,4 +667,12 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 
 void vunmap_range_noflush(unsigned long start, unsigned long end);
 
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void free_zone_device_page(struct page *page);
+#else
+static inline void free_zone_device_page(struct page *page)
+{
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memremap.c b/mm/memremap.c
index 15a074ffb8d7..5aa8163fd948 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -12,6 +12,7 @@
 #include <linux/types.h>
 #include <linux/wait_bit.h>
 #include <linux/xarray.h>
+#include "internal.h"
 
 static DEFINE_XARRAY(pgmap_array);
 
@@ -37,32 +38,6 @@ unsigned long memremap_compat_align(void)
 EXPORT_SYMBOL_GPL(memremap_compat_align);
 #endif
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
-EXPORT_SYMBOL(devmap_managed_key);
-
-static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
-{
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-	    pgmap->type == MEMORY_DEVICE_FS_DAX)
-		static_branch_dec(&devmap_managed_key);
-}
-
-static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-	    pgmap->type == MEMORY_DEVICE_FS_DAX)
-		static_branch_inc(&devmap_managed_key);
-}
-#else
-static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-}
-static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
 static void pgmap_array_delete(struct range *range)
 {
 	xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
@@ -102,16 +77,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
 	return (range->start + range_len(range)) >> PAGE_SHIFT;
 }
 
-static unsigned long pfn_next(unsigned long pfn)
-{
-	if (pfn % 1024 == 0)
-		cond_resched();
-	return pfn + 1;
-}
-
-#define for_each_device_pfn(pfn, map, i) \
-	for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
-
 static void dev_pagemap_kill(struct dev_pagemap *pgmap)
 {
 	if (pgmap->ops && pgmap->ops->kill)
@@ -167,20 +132,18 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
 
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
-	unsigned long pfn;
 	int i;
 
 	dev_pagemap_kill(pgmap);
 	for (i = 0; i < pgmap->nr_range; i++)
-		for_each_device_pfn(pfn, pgmap, i)
-			put_page(pfn_to_page(pfn));
+		percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) -
+						pfn_first(pgmap, i));
 	dev_pagemap_cleanup(pgmap);
 
 	for (i = 0; i < pgmap->nr_range; i++)
 		pageunmap_range(pgmap, i);
 
 	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
-	devmap_managed_enable_put(pgmap);
 }
 EXPORT_SYMBOL_GPL(memunmap_pages);
 
@@ -382,8 +345,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 		}
 	}
 
-	devmap_managed_enable_get(pgmap);
-
 	/*
 	 * Clear the pgmap nr_range as it will be incremented for each
 	 * successfully processed range. This communicates how many
@@ -498,16 +459,10 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page)
+static void free_device_private_page(struct page *page)
 {
-	/* notify page idle for dax */
-	if (!is_device_private_page(page)) {
-		wake_up_var(&page->_refcount);
-		return;
-	}
 
 	__ClearPageWaiters(page);
-
 	mem_cgroup_uncharge(page);
 
 	/*
@@ -534,4 +489,19 @@ void free_devmap_managed_page(struct page *page)
 	page->mapping = NULL;
 	page->pgmap->ops->page_free(page);
 }
+
+void free_zone_device_page(struct page *page)
+{
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_FS_DAX:
+		/* notify page idle */
+		wake_up_var(&page->_refcount);
+		return;
+	case MEMORY_DEVICE_PRIVATE:
+		free_device_private_page(page);
+		return;
+	default:
+		return;
+	}
+}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/migrate.c b/mm/migrate.c
index 41ff2c9896c4..e3a10e2a1bb3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -350,11 +350,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
 {
 	int expected_count = 1;
 
-	/*
-	 * Device private pages have an extra refcount as they are
-	 * ZONE_DEVICE pages.
-	 */
-	expected_count += is_device_private_page(page);
 	if (mapping)
 		expected_count += thp_nr_pages(page) + page_has_private(page);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef2265f86b91..1ef1f733af5b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6414,6 +6414,9 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__init_single_page(page, pfn, zone_idx, nid);
 
+		/* ZONE_DEVICE pages start with a zero reference count. */
+		set_page_count(page, 0);
+
 		/*
 		 * Mark page reserved as it will need to wait for onlining
 		 * phase for it to be fully associated with a zone.
diff --git a/mm/swap.c b/mm/swap.c
index dfb48cf9c2c9..9e821f1951c5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -114,12 +114,11 @@ static void __put_compound_page(struct page *page)
 void __put_page(struct page *page)
 {
 	if (is_zone_device_page(page)) {
-		put_dev_pagemap(page->pgmap);
-
 		/*
 		 * The page belongs to the device that created pgmap. Do
 		 * not return it to page allocator.
 		 */
+		free_zone_device_page(page);
 		return;
 	}
 
@@ -917,29 +916,18 @@ void release_pages(struct page **pages, int nr)
 		if (is_huge_zero_page(page))
 			continue;
 
+		if (!put_page_testzero(page))
+			continue;
+
 		if (is_zone_device_page(page)) {
 			if (lruvec) {
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			/*
-			 * ZONE_DEVICE pages that return 'false' from
-			 * page_is_devmap_managed() do not require special
-			 * processing, and instead, expect a call to
-			 * put_page_testzero().
-			 */
-			if (page_is_devmap_managed(page)) {
-				put_devmap_managed_page(page);
-				continue;
-			}
-			if (put_page_testzero(page))
-				put_dev_pagemap(page->pgmap);
+			free_zone_device_page(page);
 			continue;
 		}
 
-		if (!put_page_testzero(page))
-			continue;
-
 		if (PageCompound(page)) {
 			if (lruvec) {
 				unlock_page_lruvec_irqrestore(lruvec, flags);
@@ -1143,26 +1131,3 @@ void __init swap_setup(void)
 	 * _really_ don't want to cluster much more
 	 */
 }
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
-	int count;
-
-	if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
-		return;
-
-	count = page_ref_dec_return(page);
-
-	/*
-	 * devmap page refcounts are 1-based, rather than 0-based: if
-	 * refcount is 1, then the page is free and the refcount is
-	 * stable because nobody holds a reference on the page.
-	 */
-	if (count == 1)
-		free_devmap_managed_page(page);
-	else if (!count)
-		__put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 03/13] kernel: resource: lookup_resource as exported symbol
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
  2021-08-13  6:31 ` [PATCH v6 01/13] ext4/xfs: add page refcount helper Alex Sierra
  2021-08-13  6:31 ` [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-13  6:31 ` [PATCH v6 04/13] drm/amdkfd: add SPM support for SVM Alex Sierra
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

The AMD architecture for the Frontier supercomputer will
have device memory which can be coherently accessed by
the CPU. The system BIOS advertises this memory as SPM
(special purpose memory) in the UEFI system address map.

The AMDGPU driver needs to be able to lookup this resource
in order to claim it as MEMORY_DEVICE_GENERIC using
devm_memremap_pages.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 kernel/resource.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/resource.c b/kernel/resource.c
index ca9f5198a01f..227fc9fab573 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -772,6 +772,7 @@ struct resource *lookup_resource(struct resource *root, resource_size_t start)
 
 	return res;
 }
+EXPORT_SYMBOL_GPL(lookup_resource);
 
 /*
  * Insert a resource into the resource tree. If successful, return NULL,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 04/13] drm/amdkfd: add SPM support for SVM
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
                   ` (2 preceding siblings ...)
  2021-08-13  6:31 ` [PATCH v6 03/13] kernel: resource: lookup_resource as exported symbol Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-15  9:10   ` Christoph Hellwig
  2021-08-13  6:31 ` [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram Alex Sierra
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

When CPU is connected throug XGMI, it has coherent
access to VRAM resource. In this case that resource
is taken from a table in the device gmc aperture base.
This resource is used along with the device type, which could
be DEVICE_PRIVATE or DEVICE_GENERIC to create the device
page map region.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index dab290a4d19d..24a8b6d4f947 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -868,6 +868,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
 	struct resource *res;
 	unsigned long size;
 	void *r;
+	bool xgmi_connected_to_cpu = adev->gmc.xgmi.connected_to_cpu;
 
 	/* Page migration works on Vega10 or newer */
 	if (kfddev->device_info->asic_family < CHIP_VEGA10)
@@ -880,17 +881,22 @@ int svm_migrate_init(struct amdgpu_device *adev)
 	 * should remove reserved size
 	 */
 	size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
-	res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+	if (xgmi_connected_to_cpu)
+		res = lookup_resource(&iomem_resource, adev->gmc.aper_base);
+	else
+		res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+
 	if (IS_ERR(res))
 		return -ENOMEM;
 
-	pgmap->type = MEMORY_DEVICE_PRIVATE;
 	pgmap->nr_range = 1;
 	pgmap->range.start = res->start;
 	pgmap->range.end = res->end;
+	pgmap->type = xgmi_connected_to_cpu ?
+				MEMORY_DEVICE_GENERIC : MEMORY_DEVICE_PRIVATE;
 	pgmap->ops = &svm_migrate_pgmap_ops;
 	pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
-	pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+	pgmap->flags = 0;
 	r = devm_memremap_pages(adev->dev, pgmap);
 	if (IS_ERR(r)) {
 		pr_err("failed to register HMM device memory\n");
@@ -914,6 +920,7 @@ void svm_migrate_fini(struct amdgpu_device *adev)
 	struct dev_pagemap *pgmap = &adev->kfd.dev->pgmap;
 
 	devm_memunmap_pages(adev->dev, pgmap);
-	devm_release_mem_region(adev->dev, pgmap->range.start,
-				pgmap->range.end - pgmap->range.start + 1);
+	if (pgmap->type == MEMORY_DEVICE_PRIVATE)
+		devm_release_mem_region(adev->dev, pgmap->range.start,
+					pgmap->range.end - pgmap->range.start + 1);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
                   ` (3 preceding siblings ...)
  2021-08-13  6:31 ` [PATCH v6 04/13] drm/amdkfd: add SPM support for SVM Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-15 15:38   ` Christoph Hellwig
  2021-08-13  6:31 ` [PATCH v6 06/13] include/linux/mm.h: helpers to check zone device generic type Alex Sierra
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

Generic device type memory on VRAM to RAM migration,
has similar access as System RAM from the CPU. This flag sets
the source from the sender. Which in Generic type case,
should be set as SYSTEM.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 24a8b6d4f947..e5b10de83a5f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -616,9 +616,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
 	migrate.vma = vma;
 	migrate.start = start;
 	migrate.end = end;
-	migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
 	migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
 
+	if (adev->gmc.xgmi.connected_to_cpu)
+		migrate.flags = MIGRATE_VMA_SELECT_SYSTEM;
+	else
+		migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
 	size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t);
 	size *= npages;
 	buf = kvmalloc(size, GFP_KERNEL | __GFP_ZERO);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 06/13] include/linux/mm.h: helpers to check zone device generic type
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
                   ` (4 preceding siblings ...)
  2021-08-13  6:31 ` [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-15  9:16   ` Christoph Hellwig
  2021-08-13  6:31 ` [PATCH v6 07/13] mm: add generic type support to migrate_vma helpers Alex Sierra
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

Two helpers added. One checks if zone device page is generic
type. The other if page is either private or generic type.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 include/linux/mm.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d48a1f0889d1..c25cdb92038f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1187,6 +1187,14 @@ static inline bool is_device_private_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
+static inline bool is_device_page(const struct page *page)
+{
+	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+		is_zone_device_page(page) &&
+		(page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
+		 page->pgmap->type == MEMORY_DEVICE_GENERIC);
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 07/13] mm: add generic type support to migrate_vma helpers
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
                   ` (5 preceding siblings ...)
  2021-08-13  6:31 ` [PATCH v6 06/13] include/linux/mm.h: helpers to check zone device generic type Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-15  9:19   ` Christoph Hellwig
  2021-08-13  6:31 ` [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages Alex Sierra
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

Device generic type case added for migrate_vma_pages and
migrate_vma_check_page helpers.
Both, generic and private device types have the same
conditions to decide to migrate pages from/to device
memory.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 mm/migrate.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index e3a10e2a1bb3..f9e6bfa2867c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2565,7 +2565,7 @@ static bool migrate_vma_check_page(struct page *page)
 		 * FIXME proper solution is to rework migration_entry_wait() so
 		 * it does not need to take a reference on page.
 		 */
-		return is_device_private_page(page);
+		return is_device_page(page);
 	}
 
 	/* For file back page */
@@ -2854,7 +2854,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
  *     handle_pte_fault()
  *       do_anonymous_page()
  * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
- * private page.
+ * private or generic page.
  */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
@@ -2925,10 +2925,14 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 
 			swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
 			entry = swp_entry_to_pte(swp_entry);
+		} else if (is_device_page(page)) {
+			entry = mk_pte(page, vma->vm_page_prot);
+			if (vma->vm_flags & VM_WRITE)
+				entry = pte_mkwrite(pte_mkdirty(entry));
 		} else {
 			/*
-			 * For now we only support migrating to un-addressable
-			 * device memory.
+			 * We support migrating to private and generic types for device
+			 * zone memory.
 			 */
 			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
 			goto abort;
@@ -3034,10 +3038,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 		mapping = page_mapping(page);
 
 		if (is_zone_device_page(newpage)) {
-			if (is_device_private_page(newpage)) {
+			if (is_device_page(newpage)) {
 				/*
-				 * For now only support private anonymous when
-				 * migrating to un-addressable device memory.
+				 * For now only support private and generic
+				 * anonymous when migrating to device memory.
 				 */
 				if (mapping) {
 					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
                   ` (6 preceding siblings ...)
  2021-08-13  6:31 ` [PATCH v6 07/13] mm: add generic type support to migrate_vma helpers Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-15 15:40   ` Christoph Hellwig
  2021-08-13  6:31 ` [PATCH v6 09/13] lib: test_hmm add ioctl to get zone device type Alex Sierra
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

Add MEMORY_DEVICE_GENERIC case to free_zone_device_page callback.
Device generic type memory case is now able to free its pages properly.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 mm/memremap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 5aa8163fd948..5773e15b6ac9 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -459,7 +459,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-static void free_device_private_page(struct page *page)
+static void free_device_page(struct page *page)
 {
 
 	__ClearPageWaiters(page);
@@ -498,7 +498,8 @@ void free_zone_device_page(struct page *page)
 		wake_up_var(&page->_refcount);
 		return;
 	case MEMORY_DEVICE_PRIVATE:
-		free_device_private_page(page);
+	case MEMORY_DEVICE_GENERIC:
+		free_device_page(page);
 		return;
 	default:
 		return;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 09/13] lib: test_hmm add ioctl to get zone device type
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
                   ` (7 preceding siblings ...)
  2021-08-13  6:31 ` [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-13  6:31 ` [PATCH v6 10/13] lib: test_hmm add module param for " Alex Sierra
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

new ioctl cmd added to query zone device type. This will be
used once the test_hmm adds zone device generic type.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 lib/test_hmm.c      | 15 ++++++++++++++-
 lib/test_hmm_uapi.h |  7 +++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 6998f10350ea..3cd91ca31dd7 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -82,6 +82,7 @@ struct dmirror_chunk {
 struct dmirror_device {
 	struct cdev		cdevice;
 	struct hmm_devmem	*devmem;
+	unsigned int            zone_device_type;
 
 	unsigned int		devmem_capacity;
 	unsigned int		devmem_count;
@@ -468,6 +469,7 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	if (IS_ERR(res))
 		goto err_devmem;
 
+	mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
 	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
 	devmem->pagemap.range.start = res->start;
 	devmem->pagemap.range.end = res->end;
@@ -912,6 +914,15 @@ static int dmirror_snapshot(struct dmirror *dmirror,
 	return ret;
 }
 
+static int dmirror_get_device_type(struct dmirror *dmirror,
+			    struct hmm_dmirror_cmd *cmd)
+{
+	mutex_lock(&dmirror->mutex);
+	cmd->zone_device_type = dmirror->mdevice->zone_device_type;
+	mutex_unlock(&dmirror->mutex);
+
+	return 0;
+}
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
 					unsigned int command,
 					unsigned long arg)
@@ -952,7 +963,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 	case HMM_DMIRROR_SNAPSHOT:
 		ret = dmirror_snapshot(dmirror, &cmd);
 		break;
-
+	case HMM_DMIRROR_GET_MEM_DEV_TYPE:
+		ret = dmirror_get_device_type(dmirror, &cmd);
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 670b4ef2a5b6..ee88701793d5 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -26,6 +26,7 @@ struct hmm_dmirror_cmd {
 	__u64		npages;
 	__u64		cpages;
 	__u64		faults;
+	__u64		zone_device_type;
 };
 
 /* Expose the address space of the calling process through hmm device file */
@@ -33,6 +34,7 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_MIGRATE		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE	_IOWR('H', 0x04, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -60,4 +62,9 @@ enum {
 	HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE	= 0x30,
 };
 
+enum {
+	/* 0 is reserved to catch uninitialized type fields */
+	HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+};
+
 #endif /* _LIB_TEST_HMM_UAPI_H */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 10/13] lib: test_hmm add module param for zone device type
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
                   ` (8 preceding siblings ...)
  2021-08-13  6:31 ` [PATCH v6 09/13] lib: test_hmm add ioctl to get zone device type Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-13  6:31 ` [PATCH v6 11/13] lib: add support for device generic type in test_hmm Alex Sierra
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

In order to configure device generic in test_hmm, two
module parameters should be passed, which correspon to the
SP start address of each device (2) spm_addr_dev0 &
spm_addr_dev1. If no parameters are passed, private device
type is configured.

v5:
Remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE at
dmirror_allocate_chunk that was forcing to configure pagemap.type
to MEMORY_DEVICE_PRIVATE

v6:
Check for null pointers for resource and memremap references
at dmirror_allocate_chunk

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 lib/test_hmm.c      | 56 ++++++++++++++++++++++++++++++++-------------
 lib/test_hmm_uapi.h |  1 +
 2 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 3cd91ca31dd7..b4f885c6c6ae 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -33,6 +33,16 @@
 #define DEVMEM_CHUNK_SIZE		(256 * 1024 * 1024U)
 #define DEVMEM_CHUNKS_RESERVE		16
 
+static unsigned long spm_addr_dev0;
+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+		"Specify start address for SPM (special purpose memory) used for device 0. By setting this Generic device type will be used. Make sure spm_addr_dev1 is set too");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+		"Specify start address for SPM (special purpose memory) used for device 1. By setting this Generic device type will be used. Make sure spm_addr_dev0 is set too");
+
 static const struct dev_pagemap_ops dmirror_devmem_ops;
 static const struct mmu_interval_notifier_ops dmirror_min_ops;
 static dev_t dmirror_dev;
@@ -450,11 +460,11 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
 	return ret;
 }
 
-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 				   struct page **ppage)
 {
 	struct dmirror_chunk *devmem;
-	struct resource *res;
+	struct resource *res = NULL;
 	unsigned long pfn;
 	unsigned long pfn_first;
 	unsigned long pfn_last;
@@ -462,15 +472,26 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 
 	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
 	if (!devmem)
-		return false;
+		return -ENOMEM;
+
+	if (!spm_addr_dev0 && !spm_addr_dev1) {
+		res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
+					      "hmm_dmirror");
+		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+		mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+	} else if (spm_addr_dev0 && spm_addr_dev1) {
+		res = lookup_resource(&iomem_resource, MINOR(mdevice->cdevice.dev) ?
+							spm_addr_dev0 :
+							spm_addr_dev1);
+		devmem->pagemap.type = MEMORY_DEVICE_GENERIC;
+		mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_GENERIC;
+	} else {
+		pr_err("Both spm_addr_dev parameters should be set\n");
+	}
 
-	res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
-				      "hmm_dmirror");
-	if (IS_ERR(res))
+	if (IS_ERR_OR_NULL(res))
 		goto err_devmem;
 
-	mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
-	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
 	devmem->pagemap.range.start = res->start;
 	devmem->pagemap.range.end = res->end;
 	devmem->pagemap.nr_range = 1;
@@ -493,10 +514,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		mdevice->devmem_capacity = new_capacity;
 		mdevice->devmem_chunks = new_chunks;
 	}
-
 	ptr = memremap_pages(&devmem->pagemap, numa_node_id());
-	if (IS_ERR(ptr))
+	if (IS_ERR_OR_NULL(ptr)) {
+		if (ptr)
+			ret = PTR_ERR(ptr);
+		else
+			ret = -EFAULT;
 		goto err_release;
+	}
 
 	devmem->mdevice = mdevice;
 	pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
@@ -1097,10 +1122,8 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
 	if (ret)
 		return ret;
 
-	/* Build a list of free ZONE_DEVICE private struct pages */
-	dmirror_allocate_chunk(mdevice, NULL);
-
-	return 0;
+	/* Build a list of free ZONE_DEVICE struct pages */
+	return dmirror_allocate_chunk(mdevice, NULL);
 }
 
 static void dmirror_device_remove(struct dmirror_device *mdevice)
@@ -1113,8 +1136,9 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
 				mdevice->devmem_chunks[i];
 
 			memunmap_pages(&devmem->pagemap);
-			release_mem_region(devmem->pagemap.range.start,
-					   range_len(&devmem->pagemap.range));
+			if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+				release_mem_region(devmem->pagemap.range.start,
+						   range_len(&devmem->pagemap.range));
 			kfree(devmem);
 		}
 		kfree(mdevice->devmem_chunks);
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index ee88701793d5..17a6b5059871 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -65,6 +65,7 @@ enum {
 enum {
 	/* 0 is reserved to catch uninitialized type fields */
 	HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+	HMM_DMIRROR_MEMORY_DEVICE_GENERIC,
 };
 
 #endif /* _LIB_TEST_HMM_UAPI_H */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 11/13] lib: add support for device generic type in test_hmm
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
                   ` (9 preceding siblings ...)
  2021-08-13  6:31 ` [PATCH v6 10/13] lib: test_hmm add module param for " Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-13  6:31 ` [PATCH v6 12/13] tools: update hmm-test to support device generic type Alex Sierra
  2021-08-13  6:31 ` [PATCH v6 13/13] tools: update test_hmm script to support SP config Alex Sierra
  12 siblings, 0 replies; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

Device Generic type uses device memory that is coherently
accesible by the CPU. Usually, this is shown as SP
(special purpose) memory range at the BIOS-e820 memory
enumeration. If no SP memory is supported in system,
this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.

Currently, test_hmm only supports two different SP ranges
of at least 256MB size. This could be specified in the
kernel parameter variable efi_fake_mem. Ex. Two SP ranges
of 1GB starting at 0x100000000 & 0x140000000 physical address.
efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 lib/test_hmm.c      | 166 +++++++++++++++++++++++++++-----------------
 lib/test_hmm_uapi.h |  10 ++-
 2 files changed, 113 insertions(+), 63 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index b4f885c6c6ae..42edcc8eaad2 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -469,6 +469,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	unsigned long pfn_first;
 	unsigned long pfn_last;
 	void *ptr;
+	int ret = -ENOMEM;
 
 	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
 	if (!devmem)
@@ -550,7 +551,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	}
 	spin_unlock(&mdevice->lock);
 
-	return true;
+	return 0;
 
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
@@ -558,7 +559,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 err_devmem:
 	kfree(devmem);
 
-	return false;
+	return ret;
 }
 
 static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
@@ -567,8 +568,10 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	struct page *rpage;
 
 	/*
-	 * This is a fake device so we alloc real system memory to store
-	 * our device memory.
+	 * For ZONE_DEVICE private type, this is a fake device so we alloc real
+	 * system memory to store our device memory.
+	 * For ZONE_DEVICE generic type we use the actual dpage to store the data
+	 * and ignore rpage.
 	 */
 	rpage = alloc_page(GFP_HIGHUSER);
 	if (!rpage)
@@ -601,7 +604,7 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 					   struct dmirror *dmirror)
 {
 	struct dmirror_device *mdevice = dmirror->mdevice;
-	const unsigned long *src = args->src;
+	unsigned long *src = args->src;
 	unsigned long *dst = args->dst;
 	unsigned long addr;
 
@@ -619,12 +622,18 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		 * unallocated pte_none() or read-only zero page.
 		 */
 		spage = migrate_pfn_to_page(*src);
-
+		if (spage && is_zone_device_page(spage)) {
+			pr_debug("page already in device spage pfn: 0x%lx\n",
+				  page_to_pfn(spage));
+			*src &= ~MIGRATE_PFN_MIGRATE;
+			continue;
+		}
 		dpage = dmirror_devmem_alloc_page(mdevice);
 		if (!dpage)
 			continue;
 
-		rpage = dpage->zone_device_data;
+		rpage = is_device_private_page(dpage) ? dpage->zone_device_data :
+							dpage;
 		if (spage)
 			copy_highpage(rpage, spage);
 		else
@@ -636,8 +645,10 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		 * the simulated device memory and that page holds the pointer
 		 * to the mirror.
 		 */
+		rpage = dpage->zone_device_data;
 		rpage->zone_device_data = dmirror;
-
+		pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+			 page_to_pfn(spage), page_to_pfn(dpage));
 		*dst = migrate_pfn(page_to_pfn(dpage)) |
 			    MIGRATE_PFN_LOCKED;
 		if ((*src & MIGRATE_PFN_WRITE) ||
@@ -671,10 +682,13 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 			continue;
 
 		/*
-		 * Store the page that holds the data so the page table
-		 * doesn't have to deal with ZONE_DEVICE private pages.
+		 * For ZONE_DEVICE private pages we store the page that
+		 * holds the data so the page table doesn't have to deal it.
+		 * For ZONE_DEVICE generic pages we store the actual page, since
+		 * the CPU has coherent access to the page.
 		 */
-		entry = dpage->zone_device_data;
+		entry = is_device_private_page(dpage) ? dpage->zone_device_data :
+							dpage;
 		if (*dst & MIGRATE_PFN_WRITE)
 			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
 		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
@@ -688,6 +702,47 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 	return 0;
 }
 
+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
+						      struct dmirror *dmirror)
+{
+	unsigned long *src = args->src;
+	unsigned long *dst = args->dst;
+	unsigned long start = args->start;
+	unsigned long end = args->end;
+	unsigned long addr;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE,
+				       src++, dst++) {
+		struct page *dpage, *spage;
+
+		spage = migrate_pfn_to_page(*src);
+		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
+			continue;
+		if (is_device_private_page(spage)) {
+			spage = spage->zone_device_data;
+		} else {
+			pr_debug("page already in system or SPM spage pfn: 0x%lx\n",
+				  page_to_pfn(spage));
+			*src &= ~MIGRATE_PFN_MIGRATE;
+			continue;
+		}
+		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+		if (!dpage)
+			continue;
+		pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
+			 page_to_pfn(spage), page_to_pfn(dpage));
+
+		lock_page(dpage);
+		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+		copy_highpage(dpage, spage);
+		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+		if (*src & MIGRATE_PFN_WRITE)
+			*dst |= MIGRATE_PFN_WRITE;
+	}
+	return 0;
+}
+
+
 static int dmirror_migrate(struct dmirror *dmirror,
 			   struct hmm_dmirror_cmd *cmd)
 {
@@ -729,33 +784,46 @@ static int dmirror_migrate(struct dmirror *dmirror,
 		args.start = addr;
 		args.end = next;
 		args.pgmap_owner = dmirror->mdevice;
-		args.flags = MIGRATE_VMA_SELECT_SYSTEM;
+		args.flags = (!cmd->alloc_to_devmem &&
+			     dmirror->mdevice->zone_device_type ==
+			     HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
+			     MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
+			     MIGRATE_VMA_SELECT_SYSTEM;
 		ret = migrate_vma_setup(&args);
 		if (ret)
 			goto out;
 
-		dmirror_migrate_alloc_and_copy(&args, dmirror);
+		if (cmd->alloc_to_devmem) {
+			pr_debug("Migrating from sys mem to device mem\n");
+			dmirror_migrate_alloc_and_copy(&args, dmirror);
+		} else {
+			pr_debug("Migrating from device mem to sys mem\n");
+			dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
+		}
 		migrate_vma_pages(&args);
-		dmirror_migrate_finalize_and_map(&args, dmirror);
+		if (cmd->alloc_to_devmem)
+			dmirror_migrate_finalize_and_map(&args, dmirror);
 		migrate_vma_finalize(&args);
 	}
 	mmap_read_unlock(mm);
 	mmput(mm);
 
-	/* Return the migrated data for verification. */
-	ret = dmirror_bounce_init(&bounce, start, size);
-	if (ret)
-		return ret;
-	mutex_lock(&dmirror->mutex);
-	ret = dmirror_do_read(dmirror, start, end, &bounce);
-	mutex_unlock(&dmirror->mutex);
-	if (ret == 0) {
-		if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
-				 bounce.size))
-			ret = -EFAULT;
+	/* Return the migrated data for verification. only for pages in device zone */
+	if (cmd->alloc_to_devmem) {
+		ret = dmirror_bounce_init(&bounce, start, size);
+		if (ret)
+			return ret;
+		mutex_lock(&dmirror->mutex);
+		ret = dmirror_do_read(dmirror, start, end, &bounce);
+		mutex_unlock(&dmirror->mutex);
+		if (ret == 0) {
+			if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+					 bounce.size))
+				ret = -EFAULT;
+		}
+		cmd->cpages = bounce.cpages;
+		dmirror_bounce_fini(&bounce);
 	}
-	cmd->cpages = bounce.cpages;
-	dmirror_bounce_fini(&bounce);
 	return ret;
 
 out:
@@ -779,9 +847,15 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
 	}
 
 	page = hmm_pfn_to_page(entry);
-	if (is_device_private_page(page)) {
-		/* Is the page migrated to this device or some other? */
-		if (dmirror->mdevice == dmirror_page_to_device(page))
+	if (is_device_page(page)) {
+		/* Is page ZONE_DEVICE generic? */
+		if (!is_device_private_page(page))
+			*perm = HMM_DMIRROR_PROT_DEV_GENERIC;
+		/*
+		 * Is page ZONE_DEVICE private migrated to
+		 * this device or some other?
+		 */
+		else if (dmirror->mdevice == dmirror_page_to_device(page))
 			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
 		else
 			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
@@ -1028,38 +1102,6 @@ static void dmirror_devmem_free(struct page *page)
 	spin_unlock(&mdevice->lock);
 }
 
-static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
-						      struct dmirror *dmirror)
-{
-	const unsigned long *src = args->src;
-	unsigned long *dst = args->dst;
-	unsigned long start = args->start;
-	unsigned long end = args->end;
-	unsigned long addr;
-
-	for (addr = start; addr < end; addr += PAGE_SIZE,
-				       src++, dst++) {
-		struct page *dpage, *spage;
-
-		spage = migrate_pfn_to_page(*src);
-		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
-			continue;
-		spage = spage->zone_device_data;
-
-		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-		if (!dpage)
-			continue;
-
-		lock_page(dpage);
-		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
-		copy_highpage(dpage, spage);
-		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
-		if (*src & MIGRATE_PFN_WRITE)
-			*dst |= MIGRATE_PFN_WRITE;
-	}
-	return 0;
-}
-
 static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 {
 	struct migrate_vma args;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 17a6b5059871..1f2322286fba 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -17,8 +17,12 @@
  * @addr: (in) user address the device will read/write
  * @ptr: (in) user address where device data is copied to/from
  * @npages: (in) number of pages to read/write
+ * @alloc_to_devmem: (in) desired allocation destination during migration.
+ * True if allocation is to device memory.
+ * False if allocation is to system memory.
  * @cpages: (out) number of pages copied
  * @faults: (out) number of device page faults seen
+ * @zone_device_type: (out) zone device memory type
  */
 struct hmm_dmirror_cmd {
 	__u64		addr;
@@ -26,7 +30,8 @@ struct hmm_dmirror_cmd {
 	__u64		npages;
 	__u64		cpages;
 	__u64		faults;
-	__u64		zone_device_type;
+	__u32		zone_device_type;
+	__u32		alloc_to_devmem;
 };
 
 /* Expose the address space of the calling process through hmm device file */
@@ -49,6 +54,8 @@ struct hmm_dmirror_cmd {
  *					device the ioctl() is made
  * HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
  *					other device
+ * HMM_DMIRROR_PROT_DEV_GENERIC: Migrate device generic page on the device
+ *				 the ioctl() is made
  */
 enum {
 	HMM_DMIRROR_PROT_ERROR			= 0xFF,
@@ -60,6 +67,7 @@ enum {
 	HMM_DMIRROR_PROT_ZERO			= 0x10,
 	HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL	= 0x20,
 	HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE	= 0x30,
+	HMM_DMIRROR_PROT_DEV_GENERIC		= 0x40,
 };
 
 enum {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 12/13] tools: update hmm-test to support device generic type
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
                   ` (10 preceding siblings ...)
  2021-08-13  6:31 ` [PATCH v6 11/13] lib: add support for device generic type in test_hmm Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  2021-08-13  6:31 ` [PATCH v6 13/13] tools: update test_hmm script to support SP config Alex Sierra
  12 siblings, 0 replies; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

Test cases such as migrate_fault and migrate_multiple,
were modified to explicit migrate from device to sys memory
without the need of page faults, when using device generic
type.

Snapshot test case updated to read memory device type
first and based on that, get the proper returned results
migrate_ping_pong test case added to test explicit migration
from device to sys memory for both private and generic
zone types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 tools/testing/selftests/vm/hmm-tests.c | 142 +++++++++++++++++++++----
 1 file changed, 124 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 5d1ac691b9f4..70632b195497 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,8 @@ struct hmm_buffer {
 	int		fd;
 	uint64_t	cpages;
 	uint64_t	faults;
+	int		zone_device_type;
+	bool		alloc_to_devmem;
 };
 
 #define TWOMEG		(1 << 21)
@@ -133,6 +135,7 @@ static int hmm_dmirror_cmd(int fd,
 	cmd.addr = (__u64)buffer->ptr;
 	cmd.ptr = (__u64)buffer->mirror;
 	cmd.npages = npages;
+	cmd.alloc_to_devmem = buffer->alloc_to_devmem;
 
 	for (;;) {
 		ret = ioctl(fd, request, &cmd);
@@ -144,6 +147,7 @@ static int hmm_dmirror_cmd(int fd,
 	}
 	buffer->cpages = cmd.cpages;
 	buffer->faults = cmd.faults;
+	buffer->zone_device_type = cmd.zone_device_type;
 
 	return 0;
 }
@@ -211,6 +215,34 @@ static void hmm_nanosleep(unsigned int n)
 	nanosleep(&t, NULL);
 }
 
+static int hmm_migrate_sys_to_dev(int fd,
+				   struct hmm_buffer *buffer,
+				   unsigned long npages)
+{
+	buffer->alloc_to_devmem = true;
+	return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+				   struct hmm_buffer *buffer,
+				   unsigned long npages)
+{
+	buffer->alloc_to_devmem = false;
+	return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+}
+
+static int hmm_is_private_device(int fd, bool *res)
+{
+	struct hmm_buffer buffer;
+	int ret;
+
+	buffer.ptr = 0;
+	ret = hmm_dmirror_cmd(fd, HMM_DMIRROR_GET_MEM_DEV_TYPE, &buffer, 1);
+	*res = (buffer.zone_device_type == HMM_DMIRROR_MEMORY_DEVICE_PRIVATE);
+
+	return ret;
+}
+
 /*
  * Simple NULL test of device open/close.
  */
@@ -875,7 +907,7 @@ TEST_F(hmm, migrate)
 		ptr[i] = i;
 
 	/* Migrate memory to device. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, npages);
 
@@ -923,7 +955,7 @@ TEST_F(hmm, migrate_fault)
 		ptr[i] = i;
 
 	/* Migrate memory to device. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, npages);
 
@@ -936,7 +968,7 @@ TEST_F(hmm, migrate_fault)
 		ASSERT_EQ(ptr[i], i);
 
 	/* Migrate memory to the device again. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, npages);
 
@@ -976,7 +1008,7 @@ TEST_F(hmm, migrate_shared)
 	ASSERT_NE(buffer->ptr, MAP_FAILED);
 
 	/* Migrate memory to device. */
-	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 	ASSERT_EQ(ret, -ENOENT);
 
 	hmm_buffer_free(buffer);
@@ -1015,7 +1047,7 @@ TEST_F(hmm2, migrate_mixed)
 	p = buffer->ptr;
 
 	/* Migrating a protected area should be an error. */
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, npages);
 	ASSERT_EQ(ret, -EINVAL);
 
 	/* Punch a hole after the first page address. */
@@ -1023,7 +1055,7 @@ TEST_F(hmm2, migrate_mixed)
 	ASSERT_EQ(ret, 0);
 
 	/* We expect an error if the vma doesn't cover the range. */
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 3);
 	ASSERT_EQ(ret, -EINVAL);
 
 	/* Page 2 will be a read-only zero page. */
@@ -1055,13 +1087,13 @@ TEST_F(hmm2, migrate_mixed)
 
 	/* Now try to migrate pages 2-5 to device 1. */
 	buffer->ptr = p + 2 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 4);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 4);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, 4);
 
 	/* Page 5 won't be migrated to device 0 because it's on device 1. */
 	buffer->ptr = p + 5 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
 	ASSERT_EQ(ret, -ENOENT);
 	buffer->ptr = p;
 
@@ -1070,8 +1102,12 @@ TEST_F(hmm2, migrate_mixed)
 }
 
 /*
- * Migrate anonymous memory to device private memory and fault it back to system
- * memory multiple times.
+ * Migrate anonymous memory to device memory and back to system memory
+ * multiple times. In case of private zone configuration, this is done
+ * through fault pages accessed by CPU. In case of generic zone configuration,
+ * the pages from the device should be explicitly migrated back to system memory.
+ * The reason is Generic device zone has coherent access to CPU, therefore
+ * it will not generate any page fault.
  */
 TEST_F(hmm, migrate_multiple)
 {
@@ -1082,7 +1118,9 @@ TEST_F(hmm, migrate_multiple)
 	unsigned long c;
 	int *ptr;
 	int ret;
+	bool is_private;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd, &is_private), 0);
 	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
 	ASSERT_NE(npages, 0);
 	size = npages << self->page_shift;
@@ -1107,8 +1145,7 @@ TEST_F(hmm, migrate_multiple)
 			ptr[i] = i;
 
 		/* Migrate memory to device. */
-		ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
-				      npages);
+		ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
 		ASSERT_EQ(ret, 0);
 		ASSERT_EQ(buffer->cpages, npages);
 
@@ -1116,7 +1153,12 @@ TEST_F(hmm, migrate_multiple)
 		for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
 			ASSERT_EQ(ptr[i], i);
 
-		/* Fault pages back to system memory and check them. */
+		/* Migrate back to system memory and check them. */
+		if (!is_private) {
+			ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+			ASSERT_EQ(ret, 0);
+		}
+
 		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
 			ASSERT_EQ(ptr[i], i);
 
@@ -1261,10 +1303,12 @@ TEST_F(hmm2, snapshot)
 	unsigned char *m;
 	int ret;
 	int val;
+	bool is_private;
 
 	npages = 7;
 	size = npages << self->page_shift;
 
+	ASSERT_EQ(hmm_is_private_device(self->fd0, &is_private), 0);
 	buffer = malloc(sizeof(*buffer));
 	ASSERT_NE(buffer, NULL);
 
@@ -1312,13 +1356,13 @@ TEST_F(hmm2, snapshot)
 
 	/* Page 5 will be migrated to device 0. */
 	buffer->ptr = p + 5 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, 1);
 
 	/* Page 6 will be migrated to device 1. */
 	buffer->ptr = p + 6 * self->page_size;
-	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 1);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, 1);
 
@@ -1335,9 +1379,16 @@ TEST_F(hmm2, snapshot)
 	ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
 	ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
 	ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
-	ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
-			HMM_DMIRROR_PROT_WRITE);
-	ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+	if (is_private) {
+		ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
+				HMM_DMIRROR_PROT_WRITE);
+		ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+	} else {
+		ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_GENERIC |
+				HMM_DMIRROR_PROT_WRITE);
+		ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_GENERIC |
+				HMM_DMIRROR_PROT_WRITE);
+	}
 
 	hmm_buffer_free(buffer);
 }
@@ -1485,4 +1536,59 @@ TEST_F(hmm2, double_map)
 	hmm_buffer_free(buffer);
 }
 
+/*
+ * Migrate anonymous memory to device memory and migrate back to system memory
+ * explicitly, without generating a page fault.
+ */
+TEST_F(hmm, migrate_ping_pong)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	buffer->alloc_to_devmem = true;
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Migrate memory back to system mem. */
+	ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+
+	/* Check the buffer migrated back to system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
 TEST_HARNESS_MAIN
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 13/13] tools: update test_hmm script to support SP config
  2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
                   ` (11 preceding siblings ...)
  2021-08-13  6:31 ` [PATCH v6 12/13] tools: update hmm-test to support device generic type Alex Sierra
@ 2021-08-13  6:31 ` Alex Sierra
  12 siblings, 0 replies; 50+ messages in thread
From: Alex Sierra @ 2021-08-13  6:31 UTC (permalink / raw)
  To: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as generic.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 tools/testing/selftests/vm/test_hmm.sh | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a625..3eeabe94399f 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,7 +40,18 @@ check_test_requirements()
 
 load_driver()
 {
-	modprobe $DRIVER > /dev/null 2>&1
+	if [ $# -eq 0 ]; then
+		modprobe $DRIVER > /dev/null 2>&1
+	else
+		if [ $# -eq 2 ]; then
+			modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+				> /dev/null 2>&1
+		else
+			echo "Missing module parameters. Make sure pass"\
+			"spm_addr_dev0 and spm_addr_dev1"
+			usage
+		fi
+	fi
 	if [ $? == 0 ]; then
 		major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
 		mknod /dev/hmm_dmirror0 c $major 0
@@ -58,7 +69,7 @@ run_smoke()
 {
 	echo "Running smoke test. Note, this test provides basic coverage."
 
-	load_driver
+	load_driver $1 $2
 	$(dirname "${BASH_SOURCE[0]}")/hmm-tests
 	unload_driver
 }
@@ -75,6 +86,9 @@ usage()
 	echo "# Smoke testing"
 	echo "./${TEST_NAME}.sh smoke"
 	echo
+	echo "# Smoke testing with SPM enabled"
+	echo "./${TEST_NAME}.sh smoke <spm_addr_dev0> <spm_addr_dev1>"
+	echo
 	exit 0
 }
 
@@ -84,7 +98,7 @@ function run_test()
 		usage
 	else
 		if [ "$1" = "smoke" ]; then
-			run_smoke
+			run_smoke $2 $3
 		else
 			usage
 		fi
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 01/13] ext4/xfs: add page refcount helper
  2021-08-13  6:31 ` [PATCH v6 01/13] ext4/xfs: add page refcount helper Alex Sierra
@ 2021-08-15  9:01   ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-15  9:01 UTC (permalink / raw)
  To: Alex Sierra
  Cc: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, hch, jgg, jglisse

On Fri, Aug 13, 2021 at 01:31:38AM -0500, Alex Sierra wrote:
> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
> 
> v3:
> [AS]: rename dax_layout_is_idle_page func to dax_page_unused
> 
> v4:
> [AS]: This ref count functionality was missing on fuse/dax.c.

These per-patch changelog goes under the "---", otherwise they totally
mess up the logs when commited to git.  Same for the other patches in
this series.

But the changes itself looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 04/13] drm/amdkfd: add SPM support for SVM
  2021-08-13  6:31 ` [PATCH v6 04/13] drm/amdkfd: add SPM support for SVM Alex Sierra
@ 2021-08-15  9:10   ` Christoph Hellwig
  2021-08-16 18:54     ` Felix Kuehling
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-15  9:10 UTC (permalink / raw)
  To: Alex Sierra
  Cc: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, hch, jgg, jglisse

> @@ -880,17 +881,22 @@ int svm_migrate_init(struct amdgpu_device *adev)
>  	 * should remove reserved size
>  	 */
>  	size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
> -	res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
> +	if (xgmi_connected_to_cpu)
> +		res = lookup_resource(&iomem_resource, adev->gmc.aper_base);
> +	else
> +		res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
> +

Can you explain what the point of the lookup_resource is here? res->start
is obviously identical to the start value you pass in.  So this is used
as a way to query the length, but I'm pretty sure the driver must
already know that as it inserted the resource itself, right?

On a slightly higher level comment svm_migrate_init is a bit of a mess
with all the if/else already, and with the above addressed will become
a bit more.  I think splitting it into a device private and device
generic case would probably help people finding it to understand the code
much better later on.  Even more so with a useful comment.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 06/13] include/linux/mm.h: helpers to check zone device generic type
  2021-08-13  6:31 ` [PATCH v6 06/13] include/linux/mm.h: helpers to check zone device generic type Alex Sierra
@ 2021-08-15  9:16   ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-15  9:16 UTC (permalink / raw)
  To: Alex Sierra
  Cc: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, hch, jgg, jglisse

On Fri, Aug 13, 2021 at 01:31:43AM -0500, Alex Sierra wrote:
> Two helpers added. One checks if zone device page is generic
> type. The other if page is either private or generic type.
> 
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 07/13] mm: add generic type support to migrate_vma helpers
  2021-08-13  6:31 ` [PATCH v6 07/13] mm: add generic type support to migrate_vma helpers Alex Sierra
@ 2021-08-15  9:19   ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-15  9:19 UTC (permalink / raw)
  To: Alex Sierra
  Cc: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, hch, jgg, jglisse

On Fri, Aug 13, 2021 at 01:31:44AM -0500, Alex Sierra wrote:
> Device generic type case added for migrate_vma_pages and
> migrate_vma_check_page helpers.
> Both, generic and private device types have the same
> conditions to decide to migrate pages from/to device
> memory.

This reas a little weird mostly because it doesn't use up the line
length nicely:

Add the device generic type case to the migrate_vma_pages and
migrate_vma_check_page helpers.  This new case is handled identically
to the existing device private case.

> +			 * We support migrating to private and generic types for device
> +			 * zone memory.

Don't spill comments over 80 characters.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-13  6:31 ` [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount Alex Sierra
@ 2021-08-15 15:37   ` Christoph Hellwig
  2021-08-15 20:40     ` John Hubbard
  2021-08-18  0:01     ` Ralph Campbell
  1 sibling, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-15 15:37 UTC (permalink / raw)
  To: Alex Sierra
  Cc: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, hch, jgg, jglisse, John Hubbard

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8ae31622deef..d48a1f0889d1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1218,7 +1218,7 @@ __maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
>  static inline __must_check bool try_get_page(struct page *page)
>  {
>  	page = compound_head(page);
> -	if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> +	if (WARN_ON_ONCE(page_ref_count(page) < (int)!is_zone_device_page(page)))

Please avoid the overly long line.  In fact I'd be tempted to just not
bother here and keep the old, more lose check.  Especially given that
John has a patch ready that removes try_get_page entirely.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram
  2021-08-13  6:31 ` [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram Alex Sierra
@ 2021-08-15 15:38   ` Christoph Hellwig
  2021-08-16 19:53     ` Sierra Guiza, Alejandro (Alex)
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-15 15:38 UTC (permalink / raw)
  To: Alex Sierra
  Cc: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, hch, jgg, jglisse

On Fri, Aug 13, 2021 at 01:31:42AM -0500, Alex Sierra wrote:
>  	migrate.vma = vma;
>  	migrate.start = start;
>  	migrate.end = end;
> -	migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>  	migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
>  
> +	if (adev->gmc.xgmi.connected_to_cpu)
> +		migrate.flags = MIGRATE_VMA_SELECT_SYSTEM;
> +	else
> +		migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;

It's been a while since I touched this migrate code, but doesn't this
mean that if the range already contains system memory the migration
now won't do anything? for the connected_to_cpu case?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
  2021-08-13  6:31 ` [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages Alex Sierra
@ 2021-08-15 15:40   ` Christoph Hellwig
  2021-08-16 19:00     ` Felix Kuehling
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-15 15:40 UTC (permalink / raw)
  To: Alex Sierra
  Cc: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, hch, jgg, jglisse, Roger Pau Monne,
	Dan Williams, Boris Ostrovsky

On Fri, Aug 13, 2021 at 01:31:45AM -0500, Alex Sierra wrote:
> Add MEMORY_DEVICE_GENERIC case to free_zone_device_page callback.
> Device generic type memory case is now able to free its pages properly.

How is this going to work for the two existing MEMORY_DEVICE_GENERIC
that now change behavior?  And which don't have a ->page_free callback
at all?

> 
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
>  mm/memremap.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 5aa8163fd948..5773e15b6ac9 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -459,7 +459,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -static void free_device_private_page(struct page *page)
> +static void free_device_page(struct page *page)
>  {
>  
>  	__ClearPageWaiters(page);
> @@ -498,7 +498,8 @@ void free_zone_device_page(struct page *page)
>  		wake_up_var(&page->_refcount);
>  		return;
>  	case MEMORY_DEVICE_PRIVATE:
> -		free_device_private_page(page);
> +	case MEMORY_DEVICE_GENERIC:
> +		free_device_page(page);
>  		return;
>  	default:
>  		return;
> -- 
> 2.32.0
---end quoted text---

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-15 15:37   ` Christoph Hellwig
@ 2021-08-15 20:40     ` John Hubbard
  2021-08-16 18:56       ` Felix Kuehling
  2021-08-20  6:33         ` Jerome Glisse
  0 siblings, 2 replies; 50+ messages in thread
From: John Hubbard @ 2021-08-15 20:40 UTC (permalink / raw)
  To: Christoph Hellwig, Alex Sierra
  Cc: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, jgg, jglisse

On 8/15/21 8:37 AM, Christoph Hellwig wrote:
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 8ae31622deef..d48a1f0889d1 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1218,7 +1218,7 @@ __maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
>>   static inline __must_check bool try_get_page(struct page *page)
>>   {
>>   	page = compound_head(page);
>> -	if (WARN_ON_ONCE(page_ref_count(page) <= 0))
>> +	if (WARN_ON_ONCE(page_ref_count(page) < (int)!is_zone_device_page(page)))
> 
> Please avoid the overly long line.  In fact I'd be tempted to just not
> bother here and keep the old, more lose check.  Especially given that
> John has a patch ready that removes try_get_page entirely.
> 

Yes. Andrew has accepted it into mmotm.

Ralph's patch here was written well before my cleanup that removed
try_grab_page() [1]. But now that we're here, if you drop this hunk then
it will make merging easier, I think.


[1] https://lore.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com

thanks,
--
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 04/13] drm/amdkfd: add SPM support for SVM
  2021-08-15  9:10   ` Christoph Hellwig
@ 2021-08-16 18:54     ` Felix Kuehling
  2021-08-17  5:47       ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Felix Kuehling @ 2021-08-16 18:54 UTC (permalink / raw)
  To: Christoph Hellwig, Alex Sierra
  Cc: akpm, linux-mm, rcampbell, linux-ext4, linux-xfs, amd-gfx,
	dri-devel, jgg, jglisse

Am 2021-08-15 um 5:10 a.m. schrieb Christoph Hellwig:
>> @@ -880,17 +881,22 @@ int svm_migrate_init(struct amdgpu_device *adev)
>>  	 * should remove reserved size
>>  	 */
>>  	size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
>> -	res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
>> +	if (xgmi_connected_to_cpu)
>> +		res = lookup_resource(&iomem_resource, adev->gmc.aper_base);
>> +	else
>> +		res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
>> +
> Can you explain what the point of the lookup_resource is here? res->start
> is obviously identical to the start value you pass in.  So this is used
> as a way to query the length, but I'm pretty sure the driver must
> already know that as it inserted the resource itself, right?

I think you're right. We only need the start and end address from
lookup_resource and we already know that anyway. It means we can drop
patch 3 from the series.

Just to be sure, we'll confirm that the end address determined by our
driver matches the one from lookup_resource (coming from the system
address map in the system BIOS). If there were a mismatch, it would
probably be a bug (in the driver or the BIOS) that we'd need to fix anyway.


>
> On a slightly higher level comment svm_migrate_init is a bit of a mess
> with all the if/else already, and with the above addressed will become
> a bit more.  I think splitting it into a device private and device
> generic case would probably help people finding it to understand the code
> much better later on.  Even more so with a useful comment.

I don't really see the "mess" you're talking about. Including the above,
there are only 3 conditional statements in that function that are not
error-handling related:

        /* Page migration works on Vega10 or newer */
        if (kfddev->device_info->asic_family < CHIP_VEGA10)
                return -EINVAL;
...
        if (xgmi_connected_to_cpu)
                res = lookup_resource(&iomem_resource, adev->gmc.aper_base);
        else
                res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
...
        pgmap->type = xgmi_connected_to_cpu ?
                                MEMORY_DEVICE_GENERIC : MEMORY_DEVICE_PRIVATE;


Regards,
  Felix



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-15 20:40     ` John Hubbard
@ 2021-08-16 18:56       ` Felix Kuehling
  2021-08-20  6:33         ` Jerome Glisse
  1 sibling, 0 replies; 50+ messages in thread
From: Felix Kuehling @ 2021-08-16 18:56 UTC (permalink / raw)
  To: John Hubbard, Christoph Hellwig, Alex Sierra
  Cc: akpm, linux-mm, rcampbell, linux-ext4, linux-xfs, amd-gfx,
	dri-devel, jgg, jglisse

Am 2021-08-15 um 4:40 p.m. schrieb John Hubbard:
> On 8/15/21 8:37 AM, Christoph Hellwig wrote:
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 8ae31622deef..d48a1f0889d1 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1218,7 +1218,7 @@ __maybe_unused struct page
>>> *try_grab_compound_head(struct page *page, int refs,
>>>   static inline __must_check bool try_get_page(struct page *page)
>>>   {
>>>       page = compound_head(page);
>>> -    if (WARN_ON_ONCE(page_ref_count(page) <= 0))
>>> +    if (WARN_ON_ONCE(page_ref_count(page) <
>>> (int)!is_zone_device_page(page)))
>>
>> Please avoid the overly long line.  In fact I'd be tempted to just not
>> bother here and keep the old, more lose check.  Especially given that
>> John has a patch ready that removes try_get_page entirely.
>>
>
> Yes. Andrew has accepted it into mmotm.
>
> Ralph's patch here was written well before my cleanup that removed
> try_grab_page() [1]. But now that we're here, if you drop this hunk then
> it will make merging easier, I think.
>
>
> [1]
> https://lore.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com

Hi John,

Thanks for the pointer. We'll drop this hunk and add a statement to our
patch description to highlight the dependency on your patch.

Regards,
  Felix


>
> thanks,
> -- 
> John Hubbard
> NVIDIA
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
  2021-08-15 15:40   ` Christoph Hellwig
@ 2021-08-16 19:00     ` Felix Kuehling
  2021-08-17  5:50       ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Felix Kuehling @ 2021-08-16 19:00 UTC (permalink / raw)
  To: Christoph Hellwig, Alex Sierra
  Cc: akpm, linux-mm, rcampbell, linux-ext4, linux-xfs, amd-gfx,
	dri-devel, jgg, jglisse, Roger Pau Monne, Dan Williams,
	Boris Ostrovsky


Am 2021-08-15 um 11:40 a.m. schrieb Christoph Hellwig:
> On Fri, Aug 13, 2021 at 01:31:45AM -0500, Alex Sierra wrote:
>> Add MEMORY_DEVICE_GENERIC case to free_zone_device_page callback.
>> Device generic type memory case is now able to free its pages properly.
> How is this going to work for the two existing MEMORY_DEVICE_GENERIC
> that now change behavior?  And which don't have a ->page_free callback
> at all?

That's a good catch. Existing drivers shouldn't need a page_free
callback if they didn't have one before. That means we need to add a
NULL-pointer check in free_device_page.

Regards,
  Felix


>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> ---
>>  mm/memremap.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 5aa8163fd948..5773e15b6ac9 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -459,7 +459,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>  
>>  #ifdef CONFIG_DEV_PAGEMAP_OPS
>> -static void free_device_private_page(struct page *page)
>> +static void free_device_page(struct page *page)
>>  {
>>  
>>  	__ClearPageWaiters(page);
>> @@ -498,7 +498,8 @@ void free_zone_device_page(struct page *page)
>>  		wake_up_var(&page->_refcount);
>>  		return;
>>  	case MEMORY_DEVICE_PRIVATE:
>> -		free_device_private_page(page);
>> +	case MEMORY_DEVICE_GENERIC:
>> +		free_device_page(page);
>>  		return;
>>  	default:
>>  		return;
>> -- 
>> 2.32.0
> ---end quoted text---
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram
  2021-08-15 15:38   ` Christoph Hellwig
@ 2021-08-16 19:53     ` Sierra Guiza, Alejandro (Alex)
  2021-08-16 22:06       ` Zeng, Oak
  2021-08-17  5:49       ` Christoph Hellwig
  0 siblings, 2 replies; 50+ messages in thread
From: Sierra Guiza, Alejandro (Alex) @ 2021-08-16 19:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: akpm, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, jgg, jglisse


On 8/15/2021 10:38 AM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2021 at 01:31:42AM -0500, Alex Sierra wrote:
>>   	migrate.vma = vma;
>>   	migrate.start = start;
>>   	migrate.end = end;
>> -	migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>>   	migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
>>   
>> +	if (adev->gmc.xgmi.connected_to_cpu)
>> +		migrate.flags = MIGRATE_VMA_SELECT_SYSTEM;
>> +	else
>> +		migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
> It's been a while since I touched this migrate code, but doesn't this
> mean that if the range already contains system memory the migration
> now won't do anything? for the connected_to_cpu case?

For above’s condition equal to connected_to_cpu , we’re explicitly 
migrating from
device memory to system memory with device generic type. In this type, 
device PTEs are
present in CPU page table.

During migrate_vma_collect_pmd walk op at migrate_vma_setup call, 
there’s a condition
for present pte that require migrate->flags be set for 
MIGRATE_VMA_SELECT_SYSTEM.
Otherwise, the migration for this entry will be ignored.

Regards,
Alex S.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram
  2021-08-16 19:53     ` Sierra Guiza, Alejandro (Alex)
@ 2021-08-16 22:06       ` Zeng, Oak
  2021-08-17  0:42         ` Felix Kuehling
  2021-08-17  5:49       ` Christoph Hellwig
  1 sibling, 1 reply; 50+ messages in thread
From: Zeng, Oak @ 2021-08-16 22:06 UTC (permalink / raw)
  To: Sierra Guiza, Alejandro (Alex), Christoph Hellwig
  Cc: akpm, Kuehling, Felix, linux-mm, rcampbell, linux-ext4,
	linux-xfs, amd-gfx, dri-devel, jgg, jglisse



Regards,
Oak 

 

On 2021-08-16, 3:53 PM, "amd-gfx on behalf of Sierra Guiza, Alejandro (Alex)" <amd-gfx-bounces@lists.freedesktop.org on behalf of alex.sierra@amd.com> wrote:


    On 8/15/2021 10:38 AM, Christoph Hellwig wrote:
    > On Fri, Aug 13, 2021 at 01:31:42AM -0500, Alex Sierra wrote:
    >>   	migrate.vma = vma;
    >>   	migrate.start = start;
    >>   	migrate.end = end;
    >> -	migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
    >>   	migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
    >>   
    >> +	if (adev->gmc.xgmi.connected_to_cpu)
    >> +		migrate.flags = MIGRATE_VMA_SELECT_SYSTEM;
    >> +	else
    >> +		migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
    > It's been a while since I touched this migrate code, but doesn't this
    > mean that if the range already contains system memory the migration
    > now won't do anything? for the connected_to_cpu case?

    For above’s condition equal to connected_to_cpu , we’re explicitly 
    migrating from
    device memory to system memory with device generic type. 

For MEMORY_DEVICE_GENERIC memory type, why do we need to explicitly migrate it from device memory to normal system memory? I thought the design was, for this type of memory, CPU can access it in place without migration(just like CPU access normal system memory), so there is no need to migrate such type of memory to normal system memory...

With this patch, the migration behavior will be: when memory is accessed by CPU, it will be migrated to normal system memory; when memory is accessed by GPU, it will be migrated to device vram. This is basically the same behavior as when vram is treated as DEVICE_PRIVATE. 

I thought the whole goal of introducing DEVICE_GENERIC is to avoid such back and forth migration b/t device memory and normal system memory. But maybe I am missing something here....

Regards,
Oak

In this type, 
    device PTEs are
    present in CPU page table.

    During migrate_vma_collect_pmd walk op at migrate_vma_setup call, 
    there’s a condition
    for present pte that require migrate->flags be set for 
    MIGRATE_VMA_SELECT_SYSTEM.
    Otherwise, the migration for this entry will be ignored.

    Regards,
    Alex S.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram
  2021-08-16 22:06       ` Zeng, Oak
@ 2021-08-17  0:42         ` Felix Kuehling
  0 siblings, 0 replies; 50+ messages in thread
From: Felix Kuehling @ 2021-08-17  0:42 UTC (permalink / raw)
  To: Zeng, Oak, Sierra Guiza, Alejandro (Alex), Christoph Hellwig
  Cc: akpm, linux-mm, rcampbell, linux-ext4, linux-xfs, amd-gfx,
	dri-devel, jgg, jglisse

Am 2021-08-16 um 6:06 p.m. schrieb Zeng, Oak:
> Regards,
> Oak 
>
>  
>
> On 2021-08-16, 3:53 PM, "amd-gfx on behalf of Sierra Guiza, Alejandro (Alex)" <amd-gfx-bounces@lists.freedesktop.org on behalf of alex.sierra@amd.com> wrote:
>
>
>     On 8/15/2021 10:38 AM, Christoph Hellwig wrote:
>     > On Fri, Aug 13, 2021 at 01:31:42AM -0500, Alex Sierra wrote:
>     >>   	migrate.vma = vma;
>     >>   	migrate.start = start;
>     >>   	migrate.end = end;
>     >> -	migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>     >>   	migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
>     >>   
>     >> +	if (adev->gmc.xgmi.connected_to_cpu)
>     >> +		migrate.flags = MIGRATE_VMA_SELECT_SYSTEM;
>     >> +	else
>     >> +		migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>     > It's been a while since I touched this migrate code, but doesn't this
>     > mean that if the range already contains system memory the migration
>     > now won't do anything? for the connected_to_cpu case?
>
>     For above’s condition equal to connected_to_cpu , we’re explicitly 
>     migrating from
>     device memory to system memory with device generic type. 
>
> For MEMORY_DEVICE_GENERIC memory type, why do we need to explicitly migrate it from device memory to normal system memory? I thought the design was, for this type of memory, CPU can access it in place without migration(just like CPU access normal system memory), so there is no need to migrate such type of memory to normal system memory...
>
> With this patch, the migration behavior will be: when memory is accessed by CPU, it will be migrated to normal system memory; when memory is accessed by GPU, it will be migrated to device vram. This is basically the same behavior as when vram is treated as DEVICE_PRIVATE. 
>
> I thought the whole goal of introducing DEVICE_GENERIC is to avoid such back and forth migration b/t device memory and normal system memory. But maybe I am missing something here....

Hi Oak,

By using MEMORY_DEVICE_GENERIC we can avoid CPU page faults triggering
migration back to system memory on every CPU access on the Frontier
system architecture, because such pages can be mapped in the CPU page
table. You're right that this is the reason for the whole patch series.

But we still need the ability to migrate from MEMORY_DEVICE_GENERIC to
system memory for reasons other than CPU page faults. Applications can
request migrations explicitly (hipMemPrefetchAsync). Or we can be forced
to migrate data due to memory pressure from other allocations (evictions
in the TTM memory allocator).

Regards,
  Felix


>
> Regards,
> Oak
>
> In this type, 
>     device PTEs are
>     present in CPU page table.
>
>     During migrate_vma_collect_pmd walk op at migrate_vma_setup call, 
>     there’s a condition
>     for present pte that require migrate->flags be set for 
>     MIGRATE_VMA_SELECT_SYSTEM.
>     Otherwise, the migration for this entry will be ignored.
>
>     Regards,
>     Alex S.
>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 04/13] drm/amdkfd: add SPM support for SVM
  2021-08-16 18:54     ` Felix Kuehling
@ 2021-08-17  5:47       ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-17  5:47 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Christoph Hellwig, Alex Sierra, akpm, linux-mm, rcampbell,
	linux-ext4, linux-xfs, amd-gfx, dri-devel, jgg, jglisse

On Mon, Aug 16, 2021 at 02:54:30PM -0400, Felix Kuehling wrote:
> I think you're right. We only need the start and end address from
> lookup_resource and we already know that anyway. It means we can drop
> patch 3 from the series.
> 
> Just to be sure, we'll confirm that the end address determined by our
> driver matches the one from lookup_resource (coming from the system
> address map in the system BIOS). If there were a mismatch, it would
> probably be a bug (in the driver or the BIOS) that we'd need to fix anyway.

Or rather that the driver claimed area is smaller or the same as the
bios range.  No harm (except for potential peformance implications) when
you don't use all of it.

> I don't really see the "mess" you're talking about. Including the above,
> there are only 3 conditional statements in that function that are not
> error-handling related:
> 
>         /* Page migration works on Vega10 or newer */
>         if (kfddev->device_info->asic_family < CHIP_VEGA10)
>                 return -EINVAL;
> ...
>         if (xgmi_connected_to_cpu)
>                 res = lookup_resource(&iomem_resource, adev->gmc.aper_base);
>         else
>                 res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
> ...
>         pgmap->type = xgmi_connected_to_cpu ?
>                                 MEMORY_DEVICE_GENERIC : MEMORY_DEVICE_PRIVATE;
> 

Plus the devm_release_mem_region error handling that is currently missing.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram
  2021-08-16 19:53     ` Sierra Guiza, Alejandro (Alex)
  2021-08-16 22:06       ` Zeng, Oak
@ 2021-08-17  5:49       ` Christoph Hellwig
  1 sibling, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-17  5:49 UTC (permalink / raw)
  To: Sierra Guiza, Alejandro (Alex)
  Cc: Christoph Hellwig, akpm, Felix.Kuehling, linux-mm, rcampbell,
	linux-ext4, linux-xfs, amd-gfx, dri-devel, jgg, jglisse

On Mon, Aug 16, 2021 at 02:53:18PM -0500, Sierra Guiza, Alejandro (Alex) wrote:
> For above’s condition equal to connected_to_cpu , we’re explicitly 
> migrating from
> device memory to system memory with device generic type. In this type, 
> device PTEs are
> present in CPU page table.
>
> During migrate_vma_collect_pmd walk op at migrate_vma_setup call, there’s 
> a condition
> for present pte that require migrate->flags be set for 
> MIGRATE_VMA_SELECT_SYSTEM.
> Otherwise, the migration for this entry will be ignored.

I think we might need a new SELECT flag here for IOMEM.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
  2021-08-16 19:00     ` Felix Kuehling
@ 2021-08-17  5:50       ` Christoph Hellwig
  2021-08-17 15:44         ` Felix Kuehling
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-17  5:50 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Christoph Hellwig, Alex Sierra, akpm, linux-mm, rcampbell,
	linux-ext4, linux-xfs, amd-gfx, dri-devel, jgg, jglisse,
	Roger Pau Monne, Dan Williams, Boris Ostrovsky

On Mon, Aug 16, 2021 at 03:00:49PM -0400, Felix Kuehling wrote:
> 
> Am 2021-08-15 um 11:40 a.m. schrieb Christoph Hellwig:
> > On Fri, Aug 13, 2021 at 01:31:45AM -0500, Alex Sierra wrote:
> >> Add MEMORY_DEVICE_GENERIC case to free_zone_device_page callback.
> >> Device generic type memory case is now able to free its pages properly.
> > How is this going to work for the two existing MEMORY_DEVICE_GENERIC
> > that now change behavior?  And which don't have a ->page_free callback
> > at all?
> 
> That's a good catch. Existing drivers shouldn't need a page_free
> callback if they didn't have one before. That means we need to add a
> NULL-pointer check in free_device_page.

Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/
->mapping = NULL).

In many ways this seems like you want to bring back the DEVICE_PUBLIC
pgmap type that was removed a while ago due to the lack of users
instead of overloading the generic type.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
  2021-08-17  5:50       ` Christoph Hellwig
@ 2021-08-17 15:44         ` Felix Kuehling
  2021-08-20  5:05           ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Felix Kuehling @ 2021-08-17 15:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Sierra, akpm, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, jgg, jglisse, Roger Pau Monne, Dan Williams,
	Boris Ostrovsky

Am 2021-08-17 um 1:50 a.m. schrieb Christoph Hellwig:
> On Mon, Aug 16, 2021 at 03:00:49PM -0400, Felix Kuehling wrote:
>> Am 2021-08-15 um 11:40 a.m. schrieb Christoph Hellwig:
>>> On Fri, Aug 13, 2021 at 01:31:45AM -0500, Alex Sierra wrote:
>>>> Add MEMORY_DEVICE_GENERIC case to free_zone_device_page callback.
>>>> Device generic type memory case is now able to free its pages properly.
>>> How is this going to work for the two existing MEMORY_DEVICE_GENERIC
>>> that now change behavior?  And which don't have a ->page_free callback
>>> at all?
>> That's a good catch. Existing drivers shouldn't need a page_free
>> callback if they didn't have one before. That means we need to add a
>> NULL-pointer check in free_device_page.
> Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/
> ->mapping = NULL).
>
> In many ways this seems like you want to bring back the DEVICE_PUBLIC
> pgmap type that was removed a while ago due to the lack of users
> instead of overloading the generic type.

I think so. I'm not clear about how DEVICE_PUBLIC differed from what
DEVICE_GENERIC is today. As I understand it, DEVICE_PUBLIC was removed
because it was unused and also known to be broken in some ways.
DEVICE_GENERIC seemed close enough to what we need, other than not being
supported in the migration helpers.

Would you see benefit in re-introducing DEVICE_PUBLIC as a distinct
memory type from DEVICE_GENERIC? What would be the benefits of making
that distinction?

Thanks,
  Felix



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-13  6:31 ` [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount Alex Sierra
@ 2021-08-18  0:01     ` Ralph Campbell
  2021-08-18  0:01     ` Ralph Campbell
  1 sibling, 0 replies; 50+ messages in thread
From: Ralph Campbell @ 2021-08-18  0:01 UTC (permalink / raw)
  To: Alex Sierra, akpm, Felix.Kuehling, linux-mm, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

On 8/12/21 11:31 PM, Alex Sierra wrote:
> From: Ralph Campbell <rcampbell@nvidia.com>
>
> ZONE_DEVICE struct pages have an extra reference count that complicates the
> code for put_page() and several places in the kernel that need to check the
> reference count to see that a page is not being used (gup, compaction,
> migration, etc.). Clean up the code so the reference count doesn't need to
> be treated specially for ZONE_DEVICE.
>
> v2:
> AS: merged this patch in linux 5.11 version
>
> v5:
> AS: add condition at try_grab_page to check for the zone device type, while
> page ref counter is checked less/equal to zero. In case of device zone, pages
> ref counter are initialized to zero.
>
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
>   arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>   drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>   fs/dax.c                               |  4 +-
>   include/linux/dax.h                    |  2 +-
>   include/linux/memremap.h               |  7 +--
>   include/linux/mm.h                     | 13 +----
>   lib/test_hmm.c                         |  2 +-
>   mm/internal.h                          |  8 +++
>   mm/memremap.c                          | 68 +++++++-------------------
>   mm/migrate.c                           |  5 --
>   mm/page_alloc.c                        |  3 ++
>   mm/swap.c                              | 45 ++---------------
>   12 files changed, 46 insertions(+), 115 deletions(-)
>
I haven't seen a response to the issues I raised back at v3 of this series.
https://lore.kernel.org/linux-mm/4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc@nvidia.com/

Did I miss something?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
@ 2021-08-18  0:01     ` Ralph Campbell
  0 siblings, 0 replies; 50+ messages in thread
From: Ralph Campbell @ 2021-08-18  0:01 UTC (permalink / raw)
  To: Alex Sierra, akpm, Felix.Kuehling, linux-mm, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

On 8/12/21 11:31 PM, Alex Sierra wrote:
> From: Ralph Campbell <rcampbell@nvidia.com>
>
> ZONE_DEVICE struct pages have an extra reference count that complicates the
> code for put_page() and several places in the kernel that need to check the
> reference count to see that a page is not being used (gup, compaction,
> migration, etc.). Clean up the code so the reference count doesn't need to
> be treated specially for ZONE_DEVICE.
>
> v2:
> AS: merged this patch in linux 5.11 version
>
> v5:
> AS: add condition at try_grab_page to check for the zone device type, while
> page ref counter is checked less/equal to zero. In case of device zone, pages
> ref counter are initialized to zero.
>
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
>   arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>   drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>   fs/dax.c                               |  4 +-
>   include/linux/dax.h                    |  2 +-
>   include/linux/memremap.h               |  7 +--
>   include/linux/mm.h                     | 13 +----
>   lib/test_hmm.c                         |  2 +-
>   mm/internal.h                          |  8 +++
>   mm/memremap.c                          | 68 +++++++-------------------
>   mm/migrate.c                           |  5 --
>   mm/page_alloc.c                        |  3 ++
>   mm/swap.c                              | 45 ++---------------
>   12 files changed, 46 insertions(+), 115 deletions(-)
>
I haven't seen a response to the issues I raised back at v3 of this series.
https://lore.kernel.org/linux-mm/4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc@nvidia.com/

Did I miss something?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-18  0:01     ` Ralph Campbell
@ 2021-08-18  0:35       ` Felix Kuehling
  -1 siblings, 0 replies; 50+ messages in thread
From: Felix Kuehling @ 2021-08-18  0:35 UTC (permalink / raw)
  To: Ralph Campbell, Alex Sierra, akpm, linux-mm, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse


Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell:
> On 8/12/21 11:31 PM, Alex Sierra wrote:
>> From: Ralph Campbell <rcampbell@nvidia.com>
>>
>> ZONE_DEVICE struct pages have an extra reference count that
>> complicates the
>> code for put_page() and several places in the kernel that need to
>> check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't
>> need to
>> be treated specially for ZONE_DEVICE.
>>
>> v2:
>> AS: merged this patch in linux 5.11 version
>>
>> v5:
>> AS: add condition at try_grab_page to check for the zone device type,
>> while
>> page ref counter is checked less/equal to zero. In case of device
>> zone, pages
>> ref counter are initialized to zero.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> ---
>>   arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>>   fs/dax.c                               |  4 +-
>>   include/linux/dax.h                    |  2 +-
>>   include/linux/memremap.h               |  7 +--
>>   include/linux/mm.h                     | 13 +----
>>   lib/test_hmm.c                         |  2 +-
>>   mm/internal.h                          |  8 +++
>>   mm/memremap.c                          | 68 +++++++-------------------
>>   mm/migrate.c                           |  5 --
>>   mm/page_alloc.c                        |  3 ++
>>   mm/swap.c                              | 45 ++---------------
>>   12 files changed, 46 insertions(+), 115 deletions(-)
>>
> I haven't seen a response to the issues I raised back at v3 of this
> series.
> https://lore.kernel.org/linux-mm/4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc@nvidia.com/
>
>
> Did I miss something?

I think part of the response was that we did more testing. Alex added
support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests
recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE
about a zero page refcount in try_get_page. The fix is in the latest
version of patch 2. But it's already obsolete because John Hubbard is
about to remove that function altogether.

I think the issues you raised were more uncertainty than known bugs. It
seems the fact that you can have DAX pages with 0 refcount is a feature
more than a bug.

Regards,
  Felix



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
@ 2021-08-18  0:35       ` Felix Kuehling
  0 siblings, 0 replies; 50+ messages in thread
From: Felix Kuehling @ 2021-08-18  0:35 UTC (permalink / raw)
  To: Ralph Campbell, Alex Sierra, akpm, linux-mm, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse


Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell:
> On 8/12/21 11:31 PM, Alex Sierra wrote:
>> From: Ralph Campbell <rcampbell@nvidia.com>
>>
>> ZONE_DEVICE struct pages have an extra reference count that
>> complicates the
>> code for put_page() and several places in the kernel that need to
>> check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't
>> need to
>> be treated specially for ZONE_DEVICE.
>>
>> v2:
>> AS: merged this patch in linux 5.11 version
>>
>> v5:
>> AS: add condition at try_grab_page to check for the zone device type,
>> while
>> page ref counter is checked less/equal to zero. In case of device
>> zone, pages
>> ref counter are initialized to zero.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> ---
>>   arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>>   fs/dax.c                               |  4 +-
>>   include/linux/dax.h                    |  2 +-
>>   include/linux/memremap.h               |  7 +--
>>   include/linux/mm.h                     | 13 +----
>>   lib/test_hmm.c                         |  2 +-
>>   mm/internal.h                          |  8 +++
>>   mm/memremap.c                          | 68 +++++++-------------------
>>   mm/migrate.c                           |  5 --
>>   mm/page_alloc.c                        |  3 ++
>>   mm/swap.c                              | 45 ++---------------
>>   12 files changed, 46 insertions(+), 115 deletions(-)
>>
> I haven't seen a response to the issues I raised back at v3 of this
> series.
> https://lore.kernel.org/linux-mm/4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc@nvidia.com/
>
>
> Did I miss something?

I think part of the response was that we did more testing. Alex added
support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests
recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE
about a zero page refcount in try_get_page. The fix is in the latest
version of patch 2. But it's already obsolete because John Hubbard is
about to remove that function altogether.

I think the issues you raised were more uncertainty than known bugs. It
seems the fact that you can have DAX pages with 0 refcount is a feature
more than a bug.

Regards,
  Felix



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-18  0:35       ` Felix Kuehling
  (?)
@ 2021-08-18 19:28       ` Ralph Campbell
  2021-08-19 18:00         ` Sierra Guiza, Alejandro (Alex)
  2021-08-20  4:56         ` Christoph Hellwig
  -1 siblings, 2 replies; 50+ messages in thread
From: Ralph Campbell @ 2021-08-18 19:28 UTC (permalink / raw)
  To: Felix Kuehling, Alex Sierra, akpm, linux-mm, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

On 8/17/21 5:35 PM, Felix Kuehling wrote:
> Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell:
>> On 8/12/21 11:31 PM, Alex Sierra wrote:
>>> From: Ralph Campbell <rcampbell@nvidia.com>
>>>
>>> ZONE_DEVICE struct pages have an extra reference count that
>>> complicates the
>>> code for put_page() and several places in the kernel that need to
>>> check the
>>> reference count to see that a page is not being used (gup, compaction,
>>> migration, etc.). Clean up the code so the reference count doesn't
>>> need to
>>> be treated specially for ZONE_DEVICE.
>>>
>>> v2:
>>> AS: merged this patch in linux 5.11 version
>>>
>>> v5:
>>> AS: add condition at try_grab_page to check for the zone device type,
>>> while
>>> page ref counter is checked less/equal to zero. In case of device
>>> zone, pages
>>> ref counter are initialized to zero.
>>>
>>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>> ---
>>>    arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>>>    drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>>>    fs/dax.c                               |  4 +-
>>>    include/linux/dax.h                    |  2 +-
>>>    include/linux/memremap.h               |  7 +--
>>>    include/linux/mm.h                     | 13 +----
>>>    lib/test_hmm.c                         |  2 +-
>>>    mm/internal.h                          |  8 +++
>>>    mm/memremap.c                          | 68 +++++++-------------------
>>>    mm/migrate.c                           |  5 --
>>>    mm/page_alloc.c                        |  3 ++
>>>    mm/swap.c                              | 45 ++---------------
>>>    12 files changed, 46 insertions(+), 115 deletions(-)
>>>
>> I haven't seen a response to the issues I raised back at v3 of this
>> series.
>> https://lore.kernel.org/linux-mm/4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc@nvidia.com/
>>
>>
>> Did I miss something?
> I think part of the response was that we did more testing. Alex added
> support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests
> recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE
> about a zero page refcount in try_get_page. The fix is in the latest
> version of patch 2. But it's already obsolete because John Hubbard is
> about to remove that function altogether.
>
> I think the issues you raised were more uncertainty than known bugs. It
> seems the fact that you can have DAX pages with 0 refcount is a feature
> more than a bug.
>
> Regards,
>    Felix

Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined?
In that case, mmap() of a DAX device will call insert_page() which calls
get_page() which would trigger VM_BUG_ON_PAGE().

I can believe it is OK for PTE_SPECIAL page table entries to have no
struct page or that MEMORY_DEVICE_GENERIC struct pages be mapped with
a zero reference count using insert_pfn().

I find it hard to believe that other MM developers don't see an issue
with a struct page with refcount == 0 and mapcount == 1.

I don't see where init_page_count() is being called for the
MEMORY_DEVICE_GENERIC or MEMORY_DEVICE_PRIVATE struct pages the AMD
driver allocates and passes to migrate_vma_setup().
Looks like svm_migrate_get_vram_page() needs to call init_page_count()
instead of get_page(). (I'm looking at branch origin/alexsierrag/device_generic
https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver.git)

Also, what about the other places where is_device_private_page() is called?
Don't they need to be updated to call is_device_page() instead?
One of my goals for this patch was to remove special casing reference counts
for ZONE_DEVICE pages in rmap.c, etc.

I still think this patch needs an ACK from a FS/DAX maintainer.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-18 19:28       ` Ralph Campbell
@ 2021-08-19 18:00         ` Sierra Guiza, Alejandro (Alex)
  2021-08-19 19:59           ` Felix Kuehling
  2021-08-20  7:17             ` Jerome Glisse
  2021-08-20  4:56         ` Christoph Hellwig
  1 sibling, 2 replies; 50+ messages in thread
From: Sierra Guiza, Alejandro (Alex) @ 2021-08-19 18:00 UTC (permalink / raw)
  To: Ralph Campbell, Felix Kuehling, akpm, linux-mm, linux-ext4, linux-xfs
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse


On 8/18/2021 2:28 PM, Ralph Campbell wrote:
> On 8/17/21 5:35 PM, Felix Kuehling wrote:
>> Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell:
>>> On 8/12/21 11:31 PM, Alex Sierra wrote:
>>>> From: Ralph Campbell <rcampbell@nvidia.com>
>>>>
>>>> ZONE_DEVICE struct pages have an extra reference count that
>>>> complicates the
>>>> code for put_page() and several places in the kernel that need to
>>>> check the
>>>> reference count to see that a page is not being used (gup, compaction,
>>>> migration, etc.). Clean up the code so the reference count doesn't
>>>> need to
>>>> be treated specially for ZONE_DEVICE.
>>>>
>>>> v2:
>>>> AS: merged this patch in linux 5.11 version
>>>>
>>>> v5:
>>>> AS: add condition at try_grab_page to check for the zone device type,
>>>> while
>>>> page ref counter is checked less/equal to zero. In case of device
>>>> zone, pages
>>>> ref counter are initialized to zero.
>>>>
>>>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>>> ---
>>>>    arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>>>>    drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>>>>    fs/dax.c                               |  4 +-
>>>>    include/linux/dax.h                    |  2 +-
>>>>    include/linux/memremap.h               |  7 +--
>>>>    include/linux/mm.h                     | 13 +----
>>>>    lib/test_hmm.c                         |  2 +-
>>>>    mm/internal.h                          |  8 +++
>>>>    mm/memremap.c                          | 68 
>>>> +++++++-------------------
>>>>    mm/migrate.c                           |  5 --
>>>>    mm/page_alloc.c                        |  3 ++
>>>>    mm/swap.c                              | 45 ++---------------
>>>>    12 files changed, 46 insertions(+), 115 deletions(-)
>>>>
>>> I haven't seen a response to the issues I raised back at v3 of this
>>> series.
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2F4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc%40nvidia.com%2F&amp;data=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P7FxYm%2BkJaCkMFa3OHtuKrPOn7SvytFRmYQdIzq7rN4%3D&amp;reserved=0 
>>>
>>>
>>>
>>> Did I miss something?
>> I think part of the response was that we did more testing. Alex added
>> support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests
>> recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE
>> about a zero page refcount in try_get_page. The fix is in the latest
>> version of patch 2. But it's already obsolete because John Hubbard is
>> about to remove that function altogether.
>>
>> I think the issues you raised were more uncertainty than known bugs. It
>> seems the fact that you can have DAX pages with 0 refcount is a feature
>> more than a bug.
>>
>> Regards,
>>    Felix
>
> Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined?
> In that case, mmap() of a DAX device will call insert_page() which calls
> get_page() which would trigger VM_BUG_ON_PAGE().
>
> I can believe it is OK for PTE_SPECIAL page table entries to have no
> struct page or that MEMORY_DEVICE_GENERIC struct pages be mapped with
> a zero reference count using insert_pfn().
Hi Ralph,
We have tried the DAX tests with and without CONFIG_ARCH_HAS_PTE_SPECIAL 
defined.
Apparently none of the tests touches that condition for a DAX device. Of 
course,
that doesn't mean it could happen.

Regards,
Alex S.

>
>
> I find it hard to believe that other MM developers don't see an issue
> with a struct page with refcount == 0 and mapcount == 1.
>
> I don't see where init_page_count() is being called for the
> MEMORY_DEVICE_GENERIC or MEMORY_DEVICE_PRIVATE struct pages the AMD
> driver allocates and passes to migrate_vma_setup().
> Looks like svm_migrate_get_vram_page() needs to call init_page_count()
> instead of get_page(). (I'm looking at branch 
> origin/alexsierrag/device_generic
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver.git&amp;data=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=IXe8HP2s8x5OdJdERBkGOYJCQk3iqCu5AYkwpDL8zec%3D&amp;reserved=0)
Yes, you're right. My bad. Thanks for catching this up. I didn't realize 
I was missing
to define CONFIG_DEBUG_VM on my build. Therefore this BUG was never caught.
It worked after I replaced get_pages by init_page_count at
svm_migrate_get_vram_page. However, I don't think this is the best way 
to fix it.
Ideally, get_pages call should work for device pages with ref count 
equal to 0
too. Otherwise, we could overwrite refcounter if someone else is 
grabbing the page
concurrently.
I was thinking to add a special condition in get_pages for dev pages. 
This could
also fix the insert_page -> get_page call from a DAX device.

Regards,
Alex S.
>
>
> Also, what about the other places where is_device_private_page() is 
> called?
> Don't they need to be updated to call is_device_page() instead?
> One of my goals for this patch was to remove special casing reference 
> counts
> for ZONE_DEVICE pages in rmap.c, etc.
Correct, is_device_private_page is still used in rmap, memcontrol and 
migrate.c files
Looks like rmap and memcontrol should be replaced by is_device_page 
function. However,
I still need test to validate this. For migrate.c is used in 
remove_migration_pte and
migrate_vma_insert_page, however these are specific conditions for 
private device type
Thanks for raise these questions, I think we're getting close.

Regards,
Alex S.
>
> I still think this patch needs an ACK from a FS/DAX maintainer.
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-19 18:00         ` Sierra Guiza, Alejandro (Alex)
@ 2021-08-19 19:59           ` Felix Kuehling
  2021-08-20  4:40             ` Christoph Hellwig
  2021-08-20  7:17             ` Jerome Glisse
  1 sibling, 1 reply; 50+ messages in thread
From: Felix Kuehling @ 2021-08-19 19:59 UTC (permalink / raw)
  To: Sierra Guiza, Alejandro (Alex),
	Ralph Campbell, akpm, linux-mm, linux-ext4, linux-xfs,
	Theodore Ts'o
  Cc: amd-gfx, dri-devel, hch, jgg, jglisse

Am 2021-08-19 um 2:00 p.m. schrieb Sierra Guiza, Alejandro (Alex):
>
> On 8/18/2021 2:28 PM, Ralph Campbell wrote:
>> On 8/17/21 5:35 PM, Felix Kuehling wrote:
>>> Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell:
>>>> On 8/12/21 11:31 PM, Alex Sierra wrote:
>>>>> From: Ralph Campbell <rcampbell@nvidia.com>
>>>>>
>>>>> ZONE_DEVICE struct pages have an extra reference count that
>>>>> complicates the
>>>>> code for put_page() and several places in the kernel that need to
>>>>> check the
>>>>> reference count to see that a page is not being used (gup,
>>>>> compaction,
>>>>> migration, etc.). Clean up the code so the reference count doesn't
>>>>> need to
>>>>> be treated specially for ZONE_DEVICE.
>>>>>
>>>>> v2:
>>>>> AS: merged this patch in linux 5.11 version
>>>>>
>>>>> v5:
>>>>> AS: add condition at try_grab_page to check for the zone device type,
>>>>> while
>>>>> page ref counter is checked less/equal to zero. In case of device
>>>>> zone, pages
>>>>> ref counter are initialized to zero.
>>>>>
>>>>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>>>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>>>> ---
>>>>>    arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>>>>>    drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>>>>>    fs/dax.c                               |  4 +-
>>>>>    include/linux/dax.h                    |  2 +-
>>>>>    include/linux/memremap.h               |  7 +--
>>>>>    include/linux/mm.h                     | 13 +----
>>>>>    lib/test_hmm.c                         |  2 +-
>>>>>    mm/internal.h                          |  8 +++
>>>>>    mm/memremap.c                          | 68
>>>>> +++++++-------------------
>>>>>    mm/migrate.c                           |  5 --
>>>>>    mm/page_alloc.c                        |  3 ++
>>>>>    mm/swap.c                              | 45 ++---------------
>>>>>    12 files changed, 46 insertions(+), 115 deletions(-)
>>>>>
>>>> I haven't seen a response to the issues I raised back at v3 of this
>>>> series.
>>>> https://lore.kernel.org/linux-mm/4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc@nvidia.com/
>>>>
>>>>
>>>>
>>>> Did I miss something?
>>> I think part of the response was that we did more testing. Alex added
>>> support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests
>>> recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE
>>> about a zero page refcount in try_get_page. The fix is in the latest
>>> version of patch 2. But it's already obsolete because John Hubbard is
>>> about to remove that function altogether.
>>>
>>> I think the issues you raised were more uncertainty than known bugs. It
>>> seems the fact that you can have DAX pages with 0 refcount is a feature
>>> more than a bug.
>>>
>>> Regards,
>>>    Felix
>>
>> Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined?
>> In that case, mmap() of a DAX device will call insert_page() which calls
>> get_page() which would trigger VM_BUG_ON_PAGE().
>>
>> I can believe it is OK for PTE_SPECIAL page table entries to have no
>> struct page or that MEMORY_DEVICE_GENERIC struct pages be mapped with
>> a zero reference count using insert_pfn().
> Hi Ralph,
> We have tried the DAX tests with and without
> CONFIG_ARCH_HAS_PTE_SPECIAL defined.
> Apparently none of the tests touches that condition for a DAX device.
> Of course,
> that doesn't mean it could happen.
>
> Regards,
> Alex S.
>
>>
>>
>> I find it hard to believe that other MM developers don't see an issue
>> with a struct page with refcount == 0 and mapcount == 1.
>>
>> I don't see where init_page_count() is being called for the
>> MEMORY_DEVICE_GENERIC or MEMORY_DEVICE_PRIVATE struct pages the AMD
>> driver allocates and passes to migrate_vma_setup().
>> Looks like svm_migrate_get_vram_page() needs to call init_page_count()
>> instead of get_page(). (I'm looking at branch
>> origin/alexsierrag/device_generic
>> https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver.git
> Yes, you're right. My bad. Thanks for catching this up. I didn't
> realize I was missing
> to define CONFIG_DEBUG_VM on my build. Therefore this BUG was never
> caught.
> It worked after I replaced get_pages by init_page_count at
> svm_migrate_get_vram_page. However, I don't think this is the best way
> to fix it.
> Ideally, get_pages call should work for device pages with ref count
> equal to 0
> too. Otherwise, we could overwrite refcounter if someone else is
> grabbing the page
> concurrently.

I think using init_page_count in svm_migrate_get_vram_page is the right
answer. This is where the page first gets allocated and initialized
(data migrated into it). I think nobody should have or try to take a
reference to the page before that. We should probably also add a
VM_BUG_ON_PAGE(page_ref_count(page) != 0) before calling init_page_count
to make sure of that.


> I was thinking to add a special condition in get_pages for dev pages.
> This could
> also fix the insert_page -> get_page call from a DAX device.

[+Theodore]

I got lost trying to understand how DAX counts page references and how
the PTE_SPECIAL option affects that. Theodore, can you help with this?
Is there an easy way to test without CONFIG_ARCH_HAS_PTE_SPECIAL on x86,
or do we need to test on a CPU architecture that doesn't support this
feature?

Thanks,
  Felix


>
> Regards,
> Alex S.
>>
>>
>> Also, what about the other places where is_device_private_page() is
>> called?
>> Don't they need to be updated to call is_device_page() instead?
>> One of my goals for this patch was to remove special casing reference
>> counts
>> for ZONE_DEVICE pages in rmap.c, etc.
> Correct, is_device_private_page is still used in rmap, memcontrol and
> migrate.c files
> Looks like rmap and memcontrol should be replaced by is_device_page
> function. However,
> I still need test to validate this. For migrate.c is used in
> remove_migration_pte and
> migrate_vma_insert_page, however these are specific conditions for
> private device type
> Thanks for raise these questions, I think we're getting close.
>
> Regards,
> Alex S.
>>
>> I still think this patch needs an ACK from a FS/DAX maintainer.
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-19 19:59           ` Felix Kuehling
@ 2021-08-20  4:40             ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-20  4:40 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Sierra Guiza, Alejandro (Alex),
	Ralph Campbell, akpm, linux-mm, linux-ext4, linux-xfs,
	Theodore Ts'o, amd-gfx, dri-devel, hch, jgg, jglisse

On Thu, Aug 19, 2021 at 03:59:56PM -0400, Felix Kuehling wrote:
> I got lost trying to understand how DAX counts page references and how
> the PTE_SPECIAL option affects that. Theodore, can you help with this?
> Is there an easy way to test without CONFIG_ARCH_HAS_PTE_SPECIAL on x86,
> or do we need to test on a CPU architecture that doesn't support this
> feature?

I think the right answer is to simplify disallow ZONE_DEVICE pages
if ARCH_HAS_PTE_SPECIAL is not supported.  ARCH_HAS_PTE_SPECIAL is
supported by all modern architecture ports than can make use of
ZONE_DEVICE / dev_pagemap, so we can avoid this pocket of barely
testable code entirely:


diff --git a/mm/Kconfig b/mm/Kconfig
index 40a9bfcd5062e1..2823bbfd1c8c70 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -775,6 +775,7 @@ config ZONE_DMA32
 
 config ZONE_DEVICE
 	bool "Device memory (pmem, HMM, etc...) hotplug support"
+	depends on ARCH_HAS_PTE_SPECIAL
 	depends on MEMORY_HOTPLUG
 	depends on MEMORY_HOTREMOVE
 	depends on SPARSEMEM_VMEMMAP

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-18 19:28       ` Ralph Campbell
  2021-08-19 18:00         ` Sierra Guiza, Alejandro (Alex)
@ 2021-08-20  4:56         ` Christoph Hellwig
  1 sibling, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-20  4:56 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Felix Kuehling, Alex Sierra, akpm, linux-mm, linux-ext4,
	linux-xfs, amd-gfx, dri-devel, hch, jgg, jglisse

On Wed, Aug 18, 2021 at 12:28:30PM -0700, Ralph Campbell wrote:
> Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined?
> In that case, mmap() of a DAX device will call insert_page() which calls
> get_page() which would trigger VM_BUG_ON_PAGE().

__vm_insert_mixed still ends up calling insert_pfn for the
!CASE_ARCH_HAS_PTE_SPECIAL if pfn_t_devmap() is true, which it should
be for DAX.  (and as said in my other mail, I suspect we should disallow
that case anyway, as no one can test it in practice).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
  2021-08-17 15:44         ` Felix Kuehling
@ 2021-08-20  5:05           ` Christoph Hellwig
  2021-08-20  7:24               ` Jerome Glisse
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2021-08-20  5:05 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Christoph Hellwig, Alex Sierra, akpm, linux-mm, rcampbell,
	linux-ext4, linux-xfs, amd-gfx, dri-devel, jgg, jglisse,
	Roger Pau Monne, Dan Williams, Boris Ostrovsky

On Tue, Aug 17, 2021 at 11:44:54AM -0400, Felix Kuehling wrote:
> >> That's a good catch. Existing drivers shouldn't need a page_free
> >> callback if they didn't have one before. That means we need to add a
> >> NULL-pointer check in free_device_page.
> > Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/
> > ->mapping = NULL).
> >
> > In many ways this seems like you want to bring back the DEVICE_PUBLIC
> > pgmap type that was removed a while ago due to the lack of users
> > instead of overloading the generic type.
> 
> I think so. I'm not clear about how DEVICE_PUBLIC differed from what
> DEVICE_GENERIC is today. As I understand it, DEVICE_PUBLIC was removed
> because it was unused and also known to be broken in some ways.
> DEVICE_GENERIC seemed close enough to what we need, other than not being
> supported in the migration helpers.
> 
> Would you see benefit in re-introducing DEVICE_PUBLIC as a distinct
> memory type from DEVICE_GENERIC? What would be the benefits of making
> that distinction?

The old DEVICE_PUBLIC mostly different in that it allowed the page
to be returned from vm_normal_page, which I think was horribly buggy.

But the point is not to bring back these old semantics.  The idea
is to be able to differeniate between your new coherent on-device
memory and the existing DEVICE_GENERIC.  That is call the
code in free_devmap_managed_page that is currently only used
for device private pages also for your new public device pages without
affecting the devdax and xen use cases.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-15 20:40     ` John Hubbard
  2021-08-16 18:56       ` Felix Kuehling
@ 2021-08-20  6:33         ` Jerome Glisse
  1 sibling, 0 replies; 50+ messages in thread
From: Jerome Glisse @ 2021-08-20  6:33 UTC (permalink / raw)
  To: John Hubbard
  Cc: Christoph Hellwig, Alex Sierra, Andrew Morton, Kuehling, Felix,
	linux-mm, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list,
	Maling list - DRI developers, jgg, Jerome Glisse

Note that you do not want GUP to succeed on device page, i do not see
where that is handled in the new code.

On Sun, Aug 15, 2021 at 1:40 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 8/15/21 8:37 AM, Christoph Hellwig wrote:
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 8ae31622deef..d48a1f0889d1 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -1218,7 +1218,7 @@ __maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
> >>   static inline __must_check bool try_get_page(struct page *page)
> >>   {
> >>      page = compound_head(page);
> >> -    if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> >> +    if (WARN_ON_ONCE(page_ref_count(page) < (int)!is_zone_device_page(page)))
> >
> > Please avoid the overly long line.  In fact I'd be tempted to just not
> > bother here and keep the old, more lose check.  Especially given that
> > John has a patch ready that removes try_get_page entirely.
> >
>
> Yes. Andrew has accepted it into mmotm.
>
> Ralph's patch here was written well before my cleanup that removed
> try_grab_page() [1]. But now that we're here, if you drop this hunk then
> it will make merging easier, I think.
>
>
> [1] https://lore.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com
>
> thanks,
> --
> John Hubbard
> NVIDIA
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
@ 2021-08-20  6:33         ` Jerome Glisse
  0 siblings, 0 replies; 50+ messages in thread
From: Jerome Glisse @ 2021-08-20  6:33 UTC (permalink / raw)
  To: John Hubbard
  Cc: Christoph Hellwig, Alex Sierra, Andrew Morton, Kuehling, Felix,
	linux-mm, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list,
	Maling list - DRI developers, jgg, Jerome Glisse

Note that you do not want GUP to succeed on device page, i do not see
where that is handled in the new code.

On Sun, Aug 15, 2021 at 1:40 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 8/15/21 8:37 AM, Christoph Hellwig wrote:
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 8ae31622deef..d48a1f0889d1 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -1218,7 +1218,7 @@ __maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
> >>   static inline __must_check bool try_get_page(struct page *page)
> >>   {
> >>      page = compound_head(page);
> >> -    if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> >> +    if (WARN_ON_ONCE(page_ref_count(page) < (int)!is_zone_device_page(page)))
> >
> > Please avoid the overly long line.  In fact I'd be tempted to just not
> > bother here and keep the old, more lose check.  Especially given that
> > John has a patch ready that removes try_get_page entirely.
> >
>
> Yes. Andrew has accepted it into mmotm.
>
> Ralph's patch here was written well before my cleanup that removed
> try_grab_page() [1]. But now that we're here, if you drop this hunk then
> it will make merging easier, I think.
>
>
> [1] https://lore.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com
>
> thanks,
> --
> John Hubbard
> NVIDIA
>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
@ 2021-08-20  6:33         ` Jerome Glisse
  0 siblings, 0 replies; 50+ messages in thread
From: Jerome Glisse @ 2021-08-20  6:33 UTC (permalink / raw)
  To: John Hubbard
  Cc: Christoph Hellwig, Alex Sierra, Andrew Morton, Kuehling, Felix,
	linux-mm, Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list,
	Maling list - DRI developers, jgg, Jerome Glisse

Note that you do not want GUP to succeed on device page, i do not see
where that is handled in the new code.

On Sun, Aug 15, 2021 at 1:40 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 8/15/21 8:37 AM, Christoph Hellwig wrote:
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 8ae31622deef..d48a1f0889d1 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -1218,7 +1218,7 @@ __maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
> >>   static inline __must_check bool try_get_page(struct page *page)
> >>   {
> >>      page = compound_head(page);
> >> -    if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> >> +    if (WARN_ON_ONCE(page_ref_count(page) < (int)!is_zone_device_page(page)))
> >
> > Please avoid the overly long line.  In fact I'd be tempted to just not
> > bother here and keep the old, more lose check.  Especially given that
> > John has a patch ready that removes try_get_page entirely.
> >
>
> Yes. Andrew has accepted it into mmotm.
>
> Ralph's patch here was written well before my cleanup that removed
> try_grab_page() [1]. But now that we're here, if you drop this hunk then
> it will make merging easier, I think.
>
>
> [1] https://lore.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com
>
> thanks,
> --
> John Hubbard
> NVIDIA
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
  2021-08-19 18:00         ` Sierra Guiza, Alejandro (Alex)
@ 2021-08-20  7:17             ` Jerome Glisse
  2021-08-20  7:17             ` Jerome Glisse
  1 sibling, 0 replies; 50+ messages in thread
From: Jerome Glisse @ 2021-08-20  7:17 UTC (permalink / raw)
  To: Sierra Guiza, Alejandro (Alex)
  Cc: Ralph Campbell, Felix Kuehling, Andrew Morton, linux-mm,
	linux-ext4, linux-xfs, amd-gfx list,
	Maling list - DRI developers, Christoph Hellwig, jgg,
	Jerome Glisse

On Thu, Aug 19, 2021 at 11:00 AM Sierra Guiza, Alejandro (Alex)
<alex.sierra@amd.com> wrote:
>
>
> On 8/18/2021 2:28 PM, Ralph Campbell wrote:
> > On 8/17/21 5:35 PM, Felix Kuehling wrote:
> >> Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell:
> >>> On 8/12/21 11:31 PM, Alex Sierra wrote:
> >>>> From: Ralph Campbell <rcampbell@nvidia.com>
> >>>>
> >>>> ZONE_DEVICE struct pages have an extra reference count that
> >>>> complicates the
> >>>> code for put_page() and several places in the kernel that need to
> >>>> check the
> >>>> reference count to see that a page is not being used (gup, compaction,
> >>>> migration, etc.). Clean up the code so the reference count doesn't
> >>>> need to
> >>>> be treated specially for ZONE_DEVICE.
> >>>>
> >>>> v2:
> >>>> AS: merged this patch in linux 5.11 version
> >>>>
> >>>> v5:
> >>>> AS: add condition at try_grab_page to check for the zone device type,
> >>>> while
> >>>> page ref counter is checked less/equal to zero. In case of device
> >>>> zone, pages
> >>>> ref counter are initialized to zero.
> >>>>
> >>>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> >>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> >>>> ---
> >>>>    arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
> >>>>    drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
> >>>>    fs/dax.c                               |  4 +-
> >>>>    include/linux/dax.h                    |  2 +-
> >>>>    include/linux/memremap.h               |  7 +--
> >>>>    include/linux/mm.h                     | 13 +----
> >>>>    lib/test_hmm.c                         |  2 +-
> >>>>    mm/internal.h                          |  8 +++
> >>>>    mm/memremap.c                          | 68
> >>>> +++++++-------------------
> >>>>    mm/migrate.c                           |  5 --
> >>>>    mm/page_alloc.c                        |  3 ++
> >>>>    mm/swap.c                              | 45 ++---------------
> >>>>    12 files changed, 46 insertions(+), 115 deletions(-)
> >>>>
> >>> I haven't seen a response to the issues I raised back at v3 of this
> >>> series.
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2F4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc%40nvidia.com%2F&amp;data=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P7FxYm%2BkJaCkMFa3OHtuKrPOn7SvytFRmYQdIzq7rN4%3D&amp;reserved=0
> >>>
> >>>
> >>>
> >>> Did I miss something?
> >> I think part of the response was that we did more testing. Alex added
> >> support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests
> >> recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE
> >> about a zero page refcount in try_get_page. The fix is in the latest
> >> version of patch 2. But it's already obsolete because John Hubbard is
> >> about to remove that function altogether.
> >>
> >> I think the issues you raised were more uncertainty than known bugs. It
> >> seems the fact that you can have DAX pages with 0 refcount is a feature
> >> more than a bug.
> >>
> >> Regards,
> >>    Felix
> >
> > Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined?
> > In that case, mmap() of a DAX device will call insert_page() which calls
> > get_page() which would trigger VM_BUG_ON_PAGE().
> >
> > I can believe it is OK for PTE_SPECIAL page table entries to have no
> > struct page or that MEMORY_DEVICE_GENERIC struct pages be mapped with
> > a zero reference count using insert_pfn().
> Hi Ralph,
> We have tried the DAX tests with and without CONFIG_ARCH_HAS_PTE_SPECIAL
> defined.
> Apparently none of the tests touches that condition for a DAX device. Of
> course,
> that doesn't mean it could happen.
>
> Regards,
> Alex S.
>
> >
> >
> > I find it hard to believe that other MM developers don't see an issue
> > with a struct page with refcount == 0 and mapcount == 1.
> >
> > I don't see where init_page_count() is being called for the
> > MEMORY_DEVICE_GENERIC or MEMORY_DEVICE_PRIVATE struct pages the AMD
> > driver allocates and passes to migrate_vma_setup().
> > Looks like svm_migrate_get_vram_page() needs to call init_page_count()
> > instead of get_page(). (I'm looking at branch
> > origin/alexsierrag/device_generic
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver.git&amp;data=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=IXe8HP2s8x5OdJdERBkGOYJCQk3iqCu5AYkwpDL8zec%3D&amp;reserved=0)
> Yes, you're right. My bad. Thanks for catching this up. I didn't realize
> I was missing
> to define CONFIG_DEBUG_VM on my build. Therefore this BUG was never caught.
> It worked after I replaced get_pages by init_page_count at
> svm_migrate_get_vram_page. However, I don't think this is the best way
> to fix it.

You definitly don't want to do that. reiniting the page refcounter is
wrong it should be done once. Nouveau is not a good example here.

> Ideally, get_pages call should work for device pages with ref count
> equal to 0
> too. Otherwise, we could overwrite refcounter if someone else is
> grabbing the page
> concurrently.
> I was thinking to add a special condition in get_pages for dev pages.
> This could
> also fix the insert_page -> get_page call from a DAX device.

What is the issue here exactly ?

>
> Regards,
> Alex S.
> >
> >
> > Also, what about the other places where is_device_private_page() is
> > called?
> > Don't they need to be updated to call is_device_page() instead?
> > One of my goals for this patch was to remove special casing reference
> > counts
> > for ZONE_DEVICE pages in rmap.c, etc.
> Correct, is_device_private_page is still used in rmap, memcontrol and
> migrate.c files
> Looks like rmap and memcontrol should be replaced by is_device_page
> function.

No you do not want to do that. The private case is special case for a reason.

Jerome

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
@ 2021-08-20  7:17             ` Jerome Glisse
  0 siblings, 0 replies; 50+ messages in thread
From: Jerome Glisse @ 2021-08-20  7:17 UTC (permalink / raw)
  To: Sierra Guiza, Alejandro (Alex)
  Cc: Ralph Campbell, Felix Kuehling, Andrew Morton, linux-mm,
	linux-ext4, linux-xfs, amd-gfx list,
	Maling list - DRI developers, Christoph Hellwig, jgg,
	Jerome Glisse

On Thu, Aug 19, 2021 at 11:00 AM Sierra Guiza, Alejandro (Alex)
<alex.sierra@amd.com> wrote:
>
>
> On 8/18/2021 2:28 PM, Ralph Campbell wrote:
> > On 8/17/21 5:35 PM, Felix Kuehling wrote:
> >> Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell:
> >>> On 8/12/21 11:31 PM, Alex Sierra wrote:
> >>>> From: Ralph Campbell <rcampbell@nvidia.com>
> >>>>
> >>>> ZONE_DEVICE struct pages have an extra reference count that
> >>>> complicates the
> >>>> code for put_page() and several places in the kernel that need to
> >>>> check the
> >>>> reference count to see that a page is not being used (gup, compaction,
> >>>> migration, etc.). Clean up the code so the reference count doesn't
> >>>> need to
> >>>> be treated specially for ZONE_DEVICE.
> >>>>
> >>>> v2:
> >>>> AS: merged this patch in linux 5.11 version
> >>>>
> >>>> v5:
> >>>> AS: add condition at try_grab_page to check for the zone device type,
> >>>> while
> >>>> page ref counter is checked less/equal to zero. In case of device
> >>>> zone, pages
> >>>> ref counter are initialized to zero.
> >>>>
> >>>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> >>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> >>>> ---
> >>>>    arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
> >>>>    drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
> >>>>    fs/dax.c                               |  4 +-
> >>>>    include/linux/dax.h                    |  2 +-
> >>>>    include/linux/memremap.h               |  7 +--
> >>>>    include/linux/mm.h                     | 13 +----
> >>>>    lib/test_hmm.c                         |  2 +-
> >>>>    mm/internal.h                          |  8 +++
> >>>>    mm/memremap.c                          | 68
> >>>> +++++++-------------------
> >>>>    mm/migrate.c                           |  5 --
> >>>>    mm/page_alloc.c                        |  3 ++
> >>>>    mm/swap.c                              | 45 ++---------------
> >>>>    12 files changed, 46 insertions(+), 115 deletions(-)
> >>>>
> >>> I haven't seen a response to the issues I raised back at v3 of this
> >>> series.
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2F4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc%40nvidia.com%2F&amp;data=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P7FxYm%2BkJaCkMFa3OHtuKrPOn7SvytFRmYQdIzq7rN4%3D&amp;reserved=0
> >>>
> >>>
> >>>
> >>> Did I miss something?
> >> I think part of the response was that we did more testing. Alex added
> >> support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests
> >> recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE
> >> about a zero page refcount in try_get_page. The fix is in the latest
> >> version of patch 2. But it's already obsolete because John Hubbard is
> >> about to remove that function altogether.
> >>
> >> I think the issues you raised were more uncertainty than known bugs. It
> >> seems the fact that you can have DAX pages with 0 refcount is a feature
> >> more than a bug.
> >>
> >> Regards,
> >>    Felix
> >
> > Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined?
> > In that case, mmap() of a DAX device will call insert_page() which calls
> > get_page() which would trigger VM_BUG_ON_PAGE().
> >
> > I can believe it is OK for PTE_SPECIAL page table entries to have no
> > struct page or that MEMORY_DEVICE_GENERIC struct pages be mapped with
> > a zero reference count using insert_pfn().
> Hi Ralph,
> We have tried the DAX tests with and without CONFIG_ARCH_HAS_PTE_SPECIAL
> defined.
> Apparently none of the tests touches that condition for a DAX device. Of
> course,
> that doesn't mean it could happen.
>
> Regards,
> Alex S.
>
> >
> >
> > I find it hard to believe that other MM developers don't see an issue
> > with a struct page with refcount == 0 and mapcount == 1.
> >
> > I don't see where init_page_count() is being called for the
> > MEMORY_DEVICE_GENERIC or MEMORY_DEVICE_PRIVATE struct pages the AMD
> > driver allocates and passes to migrate_vma_setup().
> > Looks like svm_migrate_get_vram_page() needs to call init_page_count()
> > instead of get_page(). (I'm looking at branch
> > origin/alexsierrag/device_generic
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver.git&amp;data=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=IXe8HP2s8x5OdJdERBkGOYJCQk3iqCu5AYkwpDL8zec%3D&amp;reserved=0)
> Yes, you're right. My bad. Thanks for catching this up. I didn't realize
> I was missing
> to define CONFIG_DEBUG_VM on my build. Therefore this BUG was never caught.
> It worked after I replaced get_pages by init_page_count at
> svm_migrate_get_vram_page. However, I don't think this is the best way
> to fix it.

You definitly don't want to do that. reiniting the page refcounter is
wrong it should be done once. Nouveau is not a good example here.

> Ideally, get_pages call should work for device pages with ref count
> equal to 0
> too. Otherwise, we could overwrite refcounter if someone else is
> grabbing the page
> concurrently.
> I was thinking to add a special condition in get_pages for dev pages.
> This could
> also fix the insert_page -> get_page call from a DAX device.

What is the issue here exactly ?

>
> Regards,
> Alex S.
> >
> >
> > Also, what about the other places where is_device_private_page() is
> > called?
> > Don't they need to be updated to call is_device_page() instead?
> > One of my goals for this patch was to remove special casing reference
> > counts
> > for ZONE_DEVICE pages in rmap.c, etc.
> Correct, is_device_private_page is still used in rmap, memcontrol and
> migrate.c files
> Looks like rmap and memcontrol should be replaced by is_device_page
> function.

No you do not want to do that. The private case is special case for a reason.

Jerome


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
  2021-08-20  5:05           ` Christoph Hellwig
  2021-08-20  7:24               ` Jerome Glisse
@ 2021-08-20  7:24               ` Jerome Glisse
  0 siblings, 0 replies; 50+ messages in thread
From: Jerome Glisse @ 2021-08-20  7:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Felix Kuehling, Alex Sierra, Andrew Morton, linux-mm,
	Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list,
	Maling list - DRI developers, jgg, Jerome Glisse,
	Roger Pau Monne, Dan Williams, Boris Ostrovsky

On Thu, Aug 19, 2021 at 10:05 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Aug 17, 2021 at 11:44:54AM -0400, Felix Kuehling wrote:
> > >> That's a good catch. Existing drivers shouldn't need a page_free
> > >> callback if they didn't have one before. That means we need to add a
> > >> NULL-pointer check in free_device_page.
> > > Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/
> > > ->mapping = NULL).
> > >
> > > In many ways this seems like you want to bring back the DEVICE_PUBLIC
> > > pgmap type that was removed a while ago due to the lack of users
> > > instead of overloading the generic type.
> >
> > I think so. I'm not clear about how DEVICE_PUBLIC differed from what
> > DEVICE_GENERIC is today. As I understand it, DEVICE_PUBLIC was removed
> > because it was unused and also known to be broken in some ways.
> > DEVICE_GENERIC seemed close enough to what we need, other than not being
> > supported in the migration helpers.
> >
> > Would you see benefit in re-introducing DEVICE_PUBLIC as a distinct
> > memory type from DEVICE_GENERIC? What would be the benefits of making
> > that distinction?
>
> The old DEVICE_PUBLIC mostly different in that it allowed the page
> to be returned from vm_normal_page, which I think was horribly buggy.

Why was that buggy ? If I were to do it now, i would return
DEVICE_PUBLIC page from vm_normal_page but i would ban pinning as
pinning is exceptionally wrong for GPU. If you migrate some random
anonymous/file back to your GPU memory and it gets pinned there then
there is no way for the GPU to migrate the page out. Quickly you will
run out of physically contiguous memory and things like big graphic
buffer allocation (anything that needs physically contiguous memory)
will fail. It is less of an issue on some hardware that rely less and
less on physically contiguous memory but i do not think it is
completely gone from all hw.

> But the point is not to bring back these old semantics.  The idea
> is to be able to differeniate between your new coherent on-device
> memory and the existing DEVICE_GENERIC.  That is call the
> code in free_devmap_managed_page that is currently only used
> for device private pages also for your new public device pages without
> affecting the devdax and xen use cases.

Yes, I would rather bring back DEVICE_PUBLIC then try to use
DEVICE_GENERIC, the GENERIC change was done for users that closely
matched DAX semantics and it is not the case here, at least not from
my point of view.

Jerome

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
@ 2021-08-20  7:24               ` Jerome Glisse
  0 siblings, 0 replies; 50+ messages in thread
From: Jerome Glisse @ 2021-08-20  7:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Felix Kuehling, Alex Sierra, Andrew Morton, linux-mm,
	Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list,
	Maling list - DRI developers, jgg, Jerome Glisse,
	Roger Pau Monne, Dan Williams, Boris Ostrovsky

On Thu, Aug 19, 2021 at 10:05 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Aug 17, 2021 at 11:44:54AM -0400, Felix Kuehling wrote:
> > >> That's a good catch. Existing drivers shouldn't need a page_free
> > >> callback if they didn't have one before. That means we need to add a
> > >> NULL-pointer check in free_device_page.
> > > Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/
> > > ->mapping = NULL).
> > >
> > > In many ways this seems like you want to bring back the DEVICE_PUBLIC
> > > pgmap type that was removed a while ago due to the lack of users
> > > instead of overloading the generic type.
> >
> > I think so. I'm not clear about how DEVICE_PUBLIC differed from what
> > DEVICE_GENERIC is today. As I understand it, DEVICE_PUBLIC was removed
> > because it was unused and also known to be broken in some ways.
> > DEVICE_GENERIC seemed close enough to what we need, other than not being
> > supported in the migration helpers.
> >
> > Would you see benefit in re-introducing DEVICE_PUBLIC as a distinct
> > memory type from DEVICE_GENERIC? What would be the benefits of making
> > that distinction?
>
> The old DEVICE_PUBLIC mostly different in that it allowed the page
> to be returned from vm_normal_page, which I think was horribly buggy.

Why was that buggy ? If I were to do it now, i would return
DEVICE_PUBLIC page from vm_normal_page but i would ban pinning as
pinning is exceptionally wrong for GPU. If you migrate some random
anonymous/file back to your GPU memory and it gets pinned there then
there is no way for the GPU to migrate the page out. Quickly you will
run out of physically contiguous memory and things like big graphic
buffer allocation (anything that needs physically contiguous memory)
will fail. It is less of an issue on some hardware that rely less and
less on physically contiguous memory but i do not think it is
completely gone from all hw.

> But the point is not to bring back these old semantics.  The idea
> is to be able to differeniate between your new coherent on-device
> memory and the existing DEVICE_GENERIC.  That is call the
> code in free_devmap_managed_page that is currently only used
> for device private pages also for your new public device pages without
> affecting the devdax and xen use cases.

Yes, I would rather bring back DEVICE_PUBLIC then try to use
DEVICE_GENERIC, the GENERIC change was done for users that closely
matched DAX semantics and it is not the case here, at least not from
my point of view.

Jerome


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
@ 2021-08-20  7:24               ` Jerome Glisse
  0 siblings, 0 replies; 50+ messages in thread
From: Jerome Glisse @ 2021-08-20  7:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Felix Kuehling, Alex Sierra, Andrew Morton, linux-mm,
	Ralph Campbell, linux-ext4, linux-xfs, amd-gfx list,
	Maling list - DRI developers, jgg, Jerome Glisse,
	Roger Pau Monne, Dan Williams, Boris Ostrovsky

On Thu, Aug 19, 2021 at 10:05 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Aug 17, 2021 at 11:44:54AM -0400, Felix Kuehling wrote:
> > >> That's a good catch. Existing drivers shouldn't need a page_free
> > >> callback if they didn't have one before. That means we need to add a
> > >> NULL-pointer check in free_device_page.
> > > Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/
> > > ->mapping = NULL).
> > >
> > > In many ways this seems like you want to bring back the DEVICE_PUBLIC
> > > pgmap type that was removed a while ago due to the lack of users
> > > instead of overloading the generic type.
> >
> > I think so. I'm not clear about how DEVICE_PUBLIC differed from what
> > DEVICE_GENERIC is today. As I understand it, DEVICE_PUBLIC was removed
> > because it was unused and also known to be broken in some ways.
> > DEVICE_GENERIC seemed close enough to what we need, other than not being
> > supported in the migration helpers.
> >
> > Would you see benefit in re-introducing DEVICE_PUBLIC as a distinct
> > memory type from DEVICE_GENERIC? What would be the benefits of making
> > that distinction?
>
> The old DEVICE_PUBLIC mostly different in that it allowed the page
> to be returned from vm_normal_page, which I think was horribly buggy.

Why was that buggy ? If I were to do it now, i would return
DEVICE_PUBLIC page from vm_normal_page but i would ban pinning as
pinning is exceptionally wrong for GPU. If you migrate some random
anonymous/file back to your GPU memory and it gets pinned there then
there is no way for the GPU to migrate the page out. Quickly you will
run out of physically contiguous memory and things like big graphic
buffer allocation (anything that needs physically contiguous memory)
will fail. It is less of an issue on some hardware that rely less and
less on physically contiguous memory but i do not think it is
completely gone from all hw.

> But the point is not to bring back these old semantics.  The idea
> is to be able to differeniate between your new coherent on-device
> memory and the existing DEVICE_GENERIC.  That is call the
> code in free_devmap_managed_page that is currently only used
> for device private pages also for your new public device pages without
> affecting the devdax and xen use cases.

Yes, I would rather bring back DEVICE_PUBLIC then try to use
DEVICE_GENERIC, the GENERIC change was done for users that closely
matched DAX semantics and it is not the case here, at least not from
my point of view.

Jerome

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2021-08-20  7:45 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-13  6:31 [PATCH v6 00/13] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
2021-08-13  6:31 ` [PATCH v6 01/13] ext4/xfs: add page refcount helper Alex Sierra
2021-08-15  9:01   ` Christoph Hellwig
2021-08-13  6:31 ` [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount Alex Sierra
2021-08-15 15:37   ` Christoph Hellwig
2021-08-15 20:40     ` John Hubbard
2021-08-16 18:56       ` Felix Kuehling
2021-08-20  6:33       ` Jerome Glisse
2021-08-20  6:33         ` Jerome Glisse
2021-08-20  6:33         ` Jerome Glisse
2021-08-18  0:01   ` Ralph Campbell
2021-08-18  0:01     ` Ralph Campbell
2021-08-18  0:35     ` Felix Kuehling
2021-08-18  0:35       ` Felix Kuehling
2021-08-18 19:28       ` Ralph Campbell
2021-08-19 18:00         ` Sierra Guiza, Alejandro (Alex)
2021-08-19 19:59           ` Felix Kuehling
2021-08-20  4:40             ` Christoph Hellwig
2021-08-20  7:17           ` Jerome Glisse
2021-08-20  7:17             ` Jerome Glisse
2021-08-20  4:56         ` Christoph Hellwig
2021-08-13  6:31 ` [PATCH v6 03/13] kernel: resource: lookup_resource as exported symbol Alex Sierra
2021-08-13  6:31 ` [PATCH v6 04/13] drm/amdkfd: add SPM support for SVM Alex Sierra
2021-08-15  9:10   ` Christoph Hellwig
2021-08-16 18:54     ` Felix Kuehling
2021-08-17  5:47       ` Christoph Hellwig
2021-08-13  6:31 ` [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram Alex Sierra
2021-08-15 15:38   ` Christoph Hellwig
2021-08-16 19:53     ` Sierra Guiza, Alejandro (Alex)
2021-08-16 22:06       ` Zeng, Oak
2021-08-17  0:42         ` Felix Kuehling
2021-08-17  5:49       ` Christoph Hellwig
2021-08-13  6:31 ` [PATCH v6 06/13] include/linux/mm.h: helpers to check zone device generic type Alex Sierra
2021-08-15  9:16   ` Christoph Hellwig
2021-08-13  6:31 ` [PATCH v6 07/13] mm: add generic type support to migrate_vma helpers Alex Sierra
2021-08-15  9:19   ` Christoph Hellwig
2021-08-13  6:31 ` [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages Alex Sierra
2021-08-15 15:40   ` Christoph Hellwig
2021-08-16 19:00     ` Felix Kuehling
2021-08-17  5:50       ` Christoph Hellwig
2021-08-17 15:44         ` Felix Kuehling
2021-08-20  5:05           ` Christoph Hellwig
2021-08-20  7:24             ` Jerome Glisse
2021-08-20  7:24               ` Jerome Glisse
2021-08-20  7:24               ` Jerome Glisse
2021-08-13  6:31 ` [PATCH v6 09/13] lib: test_hmm add ioctl to get zone device type Alex Sierra
2021-08-13  6:31 ` [PATCH v6 10/13] lib: test_hmm add module param for " Alex Sierra
2021-08-13  6:31 ` [PATCH v6 11/13] lib: add support for device generic type in test_hmm Alex Sierra
2021-08-13  6:31 ` [PATCH v6 12/13] tools: update hmm-test to support device generic type Alex Sierra
2021-08-13  6:31 ` [PATCH v6 13/13] tools: update test_hmm script to support SP config Alex Sierra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.